*This model was published in HF papers on 2024-08-28 and contributed to Hugging Face Transformers on 2026-04-28.*

# Laguna Laguna is Poolside's mixture-of-experts language model family. The Laguna-specific deltas vs a standard SwiGLU MoE transformer are: - **Per-layer head counts** via `num_attention_heads_per_layer` — different decoder layers can have different query-head counts while sharing the same KV cache shape. - **Sigmoid MoE router with auxiliary-loss-free load balancing** ([arXiv:2408.15664](https://huggingface.co/papers/2408.15664)) and optional logit soft-capping (`moe_router_logit_softcapping`) — router scores are the element-wise sigmoid of the gate logits plus a learned per-expert bias (`e_score_correction_bias`) that is added at selection time only. ## Usage ```python from transformers import pipeline pipe = pipeline( "text-generation", model="poolside/Laguna-XS.2", dtype="auto", device_map="auto", ) print(pipe("The capital of France is", max_new_tokens=20, do_sample=False)[0]["generated_text"]) ``` ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "poolside/Laguna-XS.2" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, dtype=torch.bfloat16, device_map="auto", ) prompt = "The capital of France is" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) generated = model.generate(**inputs, max_new_tokens=20, do_sample=False) print(tokenizer.decode(generated[0], skip_special_tokens=True)) ``` ## Notes - **Attention backends.** SDPA (default), FlashAttention-2, and flex attention are supported. Attention-output gating is applied outside the kernel call and therefore works with all backends. - **`num_attention_heads_per_layer`.** When provided, its length must equal `num_hidden_layers`. Each entry must be divisible by `num_key_value_heads`. - **`layer_types`.** Defaults to `["full_attention"] * num_hidden_layers` when left unset. To enable sliding-window attention, pass a list of `"full_attention"` / `"sliding_attention"` values. - **`mlp_layer_types`.** Per-layer MLP type, values `"dense"` or `"sparse"`. Length must equal `num_hidden_layers`. Defaults to `["dense"] + ["sparse"] * (num_hidden_layers - 1)` (first layer dense, rest MoE) when left unset. - **`moe_apply_router_weight_on_input=True`** is not currently supported alongside the fused experts kernel (`grouped_mm_experts_forward`); `validate_architecture` raises at config-construction time. Set it to `False` (the default). ## LagunaConfig [[autodoc]] LagunaConfig ## LagunaModel [[autodoc]] LagunaModel - forward ## LagunaForCausalLM [[autodoc]] LagunaForCausalLM - forward