first commit

2026-06-05 16:53:03 +08:00
commit 06f1fd69a6
6047 changed files with 1895387 additions and 0 deletions
--- a/docs/source/en/model_doc/gemma4_unified_assistant.md
+++ b/docs/source/en/model_doc/gemma4_unified_assistant.md
@@ -0,0 +1,123 @@
+<!--Copyright 2026 the HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.
+
+-->
+*This model was contributed to Hugging Face Transformers on 2026-06-03.*
+
+
+# Gemma 4 Unified Assistant
+
+## Overview
+
+Gemma 4 Unified Assistant is a small, text-only model that enables speculative decoding with for Gemma 4 Unified models using the
+Multi-Token Prediction (MTP) method and associated candidate generator. Pre-trained models are provided for the IT
+variants of the Gemma 4 12B model.
+
+For more information, please see [Gemma4 Assistant](./gemma4_assistant.md). Architecturally and conceptually, they
+share the same concept and differences to their base model:
+
+*   **The entire model uses KV sharing**. This technique, originally introduced with [Gemma 3n](./gemma3n), allows the
+    model to resuse the KV cache populated by the target model the assistant supports, allowing the assistant to skip
+    the pre-fille phase entirely, and considerably reducing attention compute during the forward pass.
+*   **The `position_ids` value are constant**. Since the KV cache is shared and the assistant does not have a mean of
+    updating the cache, the assistant predicts all tokens from the same position ID.
+*   **Inputs are the concatenation of embeddings and hidden states**. To adapt for the static KV cache and
+    `position_ids`, the model takes its inputs as the concatenation of the `embedding` and `hidden_states` for the last
+    seen token from the target model and projects them into assistant model space with a `nn.Linear` transform. The
+    definition of last seen token changes throughout the assisted decoding loop. For the first token drafted after
+    pre-fill, the last seen token will be the last token from the prompt. For subsequent drafting steps, the last seen
+    token will be the last token generated by the assistant (within a drafting round) or the last token accepted by the
+    target model (between drafting rounds).
+*   **Cross-attention is used to make the most of the target model's context**. Cross-attention allows the query states
+    geneated by the assistant to attend to the shared KV cache values from the target model, allowing the assistant to
+    accurately predict more drafted tokens per drafting round.
+
+## Usage examples
+
+The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
+
+<hfoptions id="usage">
+<hfoption id="Pipeline">
+
+```py
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(
+    task="image-text-to-text",
+    model="google/gemma-4-12B-it",
+    assistant_model="google/gemma-4-12B-it-assistant",
+)
+pipeline(
+    images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
+    text="<|image|>\n\nWhat is shown in this image?"
+)
+```
+
+</hfoption>
+<hfoption id="AutoModel">
+
+```py
+import torch
+from transformers import AutoProcessor, AutoModelForImageTextToText
+
+model = AutoModelForImageTextToText.from_pretrained(
+    "google/gemma-4-12B-it",
+    dtype=torch.bfloat16,
+    device_map="auto",
+)
+assistant_model = AutoModelForCausalLM.from_pretrained(
+    "google/gemma-4-12B-it-assistant",
+    dtype=torch.bfloat16,
+    device_map="auto",
+)
+
+processor = AutoProcessor.from_pretrained(
+    "google/gemma-4-12B-it",
+    padding_side="left"
+)
+messages = [
+    {
+        "role": "user", "content": [
+            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
+            {"type": "text", "text": "What is shown in this image?"},
+        ]
+    },
+]
+inputs = processor.apply_chat_template(
+    messages,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt",
+    add_generation_prompt=True,
+).to(model.device)
+input_len = inputs["input_ids"].shape[-1]
+
+output = model.generate(**inputs, max_new_tokens=50, assistant_model=assistant_model)
+print(processor.decode(output[0][input_len:], skip_special_tokens=True))
+```
+
+</hfoption>
+
+
+## Gemma4UnifiedAssistantConfig
+
+[[autodoc]] Gemma4UnifiedAssistantConfig
+
+## Gemma4UnifiedAssistantForCausalLM
+
+[[autodoc]] Gemma4UnifiedAssistantForCausalLM