*This model was published in HF papers on 2024-08-28 and contributed to Hugging Face Transformers on 2026-04-28.*
# Laguna
Laguna is Poolside's mixture-of-experts language model family. The Laguna-specific
deltas vs a standard SwiGLU MoE transformer are:
- **Per-layer head counts** via `num_attention_heads_per_layer` — different decoder
layers can have different query-head counts while sharing the same KV cache shape.
- **Sigmoid MoE router with auxiliary-loss-free load balancing**
([arXiv:2408.15664](https://huggingface.co/papers/2408.15664)) and optional logit
soft-capping (`moe_router_logit_softcapping`) — router scores are the element-wise
sigmoid of the gate logits plus a learned per-expert bias (`e_score_correction_bias`)
that is added at selection time only.
## Usage
```python
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="poolside/Laguna-XS.2",
dtype="auto",
device_map="auto",
)
print(pipe("The capital of France is", max_new_tokens=20, do_sample=False)[0]["generated_text"])
```
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "poolside/Laguna-XS.2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
dtype=torch.bfloat16,
device_map="auto",
)
prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
generated = model.generate(**inputs, max_new_tokens=20, do_sample=False)
print(tokenizer.decode(generated[0], skip_special_tokens=True))
```
## Notes
- **Attention backends.** SDPA (default), FlashAttention-2, and flex attention are
supported. Attention-output gating is applied outside the kernel call and
therefore works with all backends.
- **`num_attention_heads_per_layer`.** When provided, its length must equal
`num_hidden_layers`. Each entry must be divisible by `num_key_value_heads`.
- **`layer_types`.** Defaults to `["full_attention"] * num_hidden_layers` when left
unset. To enable sliding-window attention, pass a list of
`"full_attention"` / `"sliding_attention"` values.
- **`mlp_layer_types`.** Per-layer MLP type, values `"dense"` or `"sparse"`. Length must
equal `num_hidden_layers`. Defaults to `["dense"] + ["sparse"] * (num_hidden_layers - 1)`
(first layer dense, rest MoE) when left unset.
- **`moe_apply_router_weight_on_input=True`** is not currently supported alongside the
fused experts kernel (`grouped_mm_experts_forward`); `validate_architecture` raises at
config-construction time. Set it to `False` (the default).
## LagunaConfig
[[autodoc]] LagunaConfig
## LagunaModel
[[autodoc]] LagunaModel
- forward
## LagunaForCausalLM
[[autodoc]] LagunaForCausalLM
- forward