*This model was contributed to Hugging Face Transformers on 2025-08-05.*

# GptOss [GptOss](https://openai.com/index/introducing-gpt-oss/) is a sparse mixture-of-experts (MoE) language model from OpenAI that routes each token to 4 of 128 experts. It uses attention sinks — learnable auxiliary tokens appended to each attention head — and YaRN rotary embeddings for sequences up to 131k tokens. The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModelForCausalLM`] class. ```python from transformers import pipeline pipe = pipeline( task="text-generation", model="openai/gpt-oss-20b", ) pipe("Plants create energy through a process known as") ``` ```python from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b") model = AutoModelForCausalLM.from_pretrained( "openai/gpt-oss-20b", device_map="auto", ) input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device) output = model.generate(**input_ids, max_new_tokens=50) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ## Notes - SDPA is not supported because attention sinks require direct access to the full attention logits before softmax. Use Flash Attention or Flex Attention instead. - When using Flex Attention, attention sinks require special handling. The `score_mod` function operates on individual score elements rather than the full attention matrix, so sink renormalization is applied after computation using the log-sum-exp (LSE) values returned by Flex Attention. ## GptOssConfig [[autodoc]] GptOssConfig ## GptOssModel [[autodoc]] GptOssModel - forward ## GptOssForCausalLM [[autodoc]] GptOssForCausalLM - forward ## GptOssForSequenceClassification [[autodoc]] GptOssForSequenceClassification - forward ## GptOssForTokenClassification [[autodoc]] GptOssForTokenClassification - forward