*This model was contributed to Hugging Face Transformers on 2026-05-05.* # Gemma 4 Assistant ## Overview Gemma 4 Assistant is a small, text-only model that enables speculative decoding with for Gemma 4 models using the Multi-Token Prediction (MTP) method and associated candidate generator. Pre-trained models are provided for the IT vairants of the Gemma 4 E2B, E4B, 31B and 26B-A4B (MoE) models. Architecturally, the Gemma 4 Assistant shares the same [`Gemma4TextModel` backbone](gemma4#transformers.Gemma4TextModel) as other Gemma 4 models, but differs in a few key ways: * **The entire model uses KV sharing**. This technique, originally introduced with [Gemma 3n](./gemma3n), allows the model to resuse the KV cache populated by the target model the assistant supports, allowing the assistant to skip the pre-fille phase entirely, and considerably reducing attention compute during the forward pass. * **The `position_ids` value are constant**. Since the KV cache is shared and the assistant does not have a mean of updating the cache, the assistant predicts all tokens from the same position ID. * **Inputs are the concatenation of embeddings and hidden states**. To adapt for the static KV cache and `position_ids`, the model takes its inputs as the concatenation of the `embedding` and `hidden_states` for the last seen token from the target model and projects them into assistant model space with a `nn.Linear` transform. The definition of last seen token changes throughout the assisted decoding loop. For the first token drafted after pre-fill, the last seen token will be the last token from the prompt. For subsequent drafting steps, the last seen token will be the last token generated by the assistant (within a drafting round) or the last token accepted by the target model (between drafting rounds). * **Cross-attention is used to make the most of the target model's context**. Cross-attention allows the query states geneated by the assistant to attend to the shared KV cache values from the target model, allowing the assistant to accurately predict more drafted tokens per drafting round. You can find all the original Gemma 4 Assistant checkpoints under the [Gemma 4](https://huggingface.co/collections/google/gemma-4) release. ## Usage examples The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class. ```py import torch from transformers import pipeline pipeline = pipeline( task="image-text-to-text", model="google/gemma-4-E2B-it", assistant_model="google/gemma-4-E2B-it-assistant", ) pipeline( images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", text="<|image|>\n\nWhat is shown in this image?" ) ``` ```py import torch from transformers import AutoProcessor, AutoModelForImageTextToText model = AutoModelForImageTextToText.from_pretrained( "google/gemma-4-E2B-it", dtype=torch.bfloat16, device_map="auto", ) assistant_model = AutoModelForCausalLM.from_pretrained( "google/gemma-4-E2B-it-assistant", dtype=torch.bfloat16, device_map="auto", ) processor = AutoProcessor.from_pretrained( "google/gemma-4-E2B-it", padding_side="left" ) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"}, {"type": "text", "text": "What is shown in this image?"}, ] }, ] inputs = processor.apply_chat_template( messages, tokenize=True, return_dict=True, return_tensors="pt", add_generation_prompt=True, ).to(model.device) input_len = inputs["input_ids"].shape[-1] output = model.generate(**inputs, max_new_tokens=50, assistant_model=assistant_model) print(processor.decode(output[0][input_len:], skip_special_tokens=True)) ``` ## Gemma4AssistantConfig [[autodoc]] Gemma4AssistantConfig ## Gemma4AssistantForCausalLM [[autodoc]] Gemma4AssistantForCausalLM