gavin/transformers

Fork 0

Files

陈赣 06f1fd69a6

Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled

Details

Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled

Details

Build documentation / build (push) Has been cancelled

Details

Build documentation / build_other_lang (push) Has been cancelled

Details

CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled

Details

New model PR merged notification / Notify new model (push) Has been cancelled

Details

PR CI / pr-ci (push) Has been cancelled

Details

Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled

Details

Secret Leaks / trufflehog (push) Has been cancelled

Details

Update Transformers metadata / build_and_package (push) Has been cancelled

Details

Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled

Details

Check Tiny Models / Check tiny models (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled

Details

Nvidia CI - Flash Attn / Setup (push) Has been cancelled

Details

Nvidia CI - Flash Attn / Model CI (push) Has been cancelled

Details

Nvidia CI / Setup (push) Has been cancelled

Details

Nvidia CI / Model CI (push) Has been cancelled

Details

Nvidia CI / Torch pipeline CI (push) Has been cancelled

Details

Nvidia CI / Example CI (push) Has been cancelled

Details

Nvidia CI / Trainer/FSDP CI (push) Has been cancelled

Details

Nvidia CI / DeepSpeed CI (push) Has been cancelled

Details

Nvidia CI / Quantization CI (push) Has been cancelled

Details

Nvidia CI / Kernels CI (push) Has been cancelled

Details

Doctests / Setup (push) Has been cancelled

Details

Doctests / Call doctest jobs (push) Has been cancelled

Details

Doctests / Send results to webhook (push) Has been cancelled

Details

Extras Smoke Test / Get supported Python versions (push) Has been cancelled

Details

Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled

Details

Extras Smoke Test / Check Slack token availability (push) Has been cancelled

Details

Extras Smoke Test / Notify failures to Slack (push) Has been cancelled

Details

Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled

Details

Stale Bot / Close Stale Issues (push) Has been cancelled

Details

first commit

2026-06-05 16:53:03 +08:00

8.4 KiB

Raw Permalink Blame History

This model was contributed to Hugging Face Transformers on 2026-06-03.

Gemma4 Unified

Overview

Gemma 4 12B Unified is an encoder-free multimodal model with pretrained and instruction-tuned variants. Unlike standard Gemma 4, which uses dedicated encoder towers, Gemma 4 12B Unified projects raw inputs directly into the language model's embedding space through lightweight linear pipelines. This results in a simpler architecture while maintaining strong multimodal performance.

Key differences from standard Gemma 4:

No Vision Tower: Raw pixel patches are projected directly into LM space via a Dense + LayerNorm pipeline with factorized 2D positional embeddings, replacing the vision encoder.
No Audio Tower: Raw 16 kHz waveform samples are chunked into fixed-length frames and projected through a simple RMSNorm → Linear pipeline, replacing the mel spectrogram + Conformer encoder.
Shared Multimodal Pipeline: Both vision and audio use the same Gemma4UnifiedMultimodalEmbedder (RMSNorm → Linear) for the final projection to text hidden space.

You can find the original Gemma 4 12B Unified checkpoints under the Gemma 4 release.

Encoder-Free Vision Pipeline

The key architectural difference from standard Gemma 4 is the removal of the vision encoder tower. Instead, Gemma 4 12B Unified processes images through a lightweight pipeline:

Patchification: Images are split into 16×16 pixel patches
Patch Merging: Adjacent 3×3 patches are merged into 48×48 model patches, each with 48² × 3 = 6,912 raw pixel channels
Projection: LayerNorm → Dense → LayerNorm projects each merged patch into the LM embedding dimension
Positional Embedding: Factorized 2D positional embeddings are added (separate learned embeddings for x and y axes, summed together)
Final Norm: A final LayerNorm is applied
Multimodal Embedder: RMSNorm → Linear projects to the text hidden size

Like standard Gemma 4, the model processes images of different sizes using a fixed-budget number of tokens. The same constraints apply:

The total number of pixels must fit within a patch budget
Both height and width must be divisible by 48 (= patch size 16 × pooling kernel 3)

Important

Gemma 4 12B Unified does not apply mean/std normalization. The model's own patch embedding layer handles the final scaling internally.

The number of soft tokens per image is configurable. The supported options and default (280 soft tokens) are:

Soft Tokens	Patches (before pooling)	Approx. Image Area
70	630	~161K pixels
140	1,260	~323K pixels
280	2,520	~645K pixels
560	5,040	~1.3M pixels
1,120	10,080	~2.6M pixels

Encoder-Free Audio Pipeline

The audio pipeline is similarly simplified. Instead of computing mel spectrograms and processing them through a Conformer encoder, raw 16 kHz waveform samples are:

Chunked into fixed-length frames of 640 samples each (40ms per frame at 16 kHz)
Projected directly through RMSNorm → Linear via the shared Gemma4UnifiedMultimodalEmbedder

Since there is no downsampling, the number of output soft tokens equals the number of input frames: ceil(num_samples / 640).

Usage examples

The example below demonstrates how to generate text based on an image and an audio sample with [Pipeline] or the [AutoModel] class.

from transformers import pipeline


pipe = pipeline(
    task="any-to-any",
    model="google/gemma-4-12B-it",
)

image_messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
            },
            {
                "type": "text",
                "text": "What is shown in this image?"
            }
        ]
    }
]

image_output = pipe(image_messages, return_full_text=False)
print(image_output[0]["generated_text"])

audio_messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Please transcribe the following audio:"},
            {
                "type": "audio",
                "url": "https://huggingface.co/datasets/eustlb/audio-samples/resolve/main/bcn_weather.mp3",
            },
        ],
    }
]

audio_output = pipe(audio_messages, return_full_text=False)
print(audio_output[0]["generated_text"])

Image

from transformers import AutoModelForMultimodalLM, AutoProcessor


model = AutoModelForMultimodalLM.from_pretrained(
    "google/gemma-4-12B-it",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(
    "google/gemma-4-12B-it"
)

messages = [
    {
        "role": "user", "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
            {"type": "text", "text": "What is shown in this image?"},
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

output = model.generate(**inputs, max_new_tokens=50, cache_implementation="static")
print(processor.decode(output[0][input_len:], skip_special_tokens=True))

Audio

from transformers import AutoModelForMultimodalLM, AutoProcessor


messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Please transcribe the following audio:"},
            {
                "type": "audio",
                "url": "https://huggingface.co/datasets/eustlb/audio-samples/resolve/main/bcn_weather.mp3",
            },
        ],
    }
]

model = AutoModelForMultimodalLM.from_pretrained(
    "google/gemma-4-12B-it",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(
    "google/gemma-4-12B-it"
)

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device, dtype=model.dtype)

input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(outputs[0][input_len:], skip_special_tokens=False))

Gemma4UnifiedAudioConfig

autodoc Gemma4UnifiedAudioConfig

Gemma4UnifiedConfig

autodoc Gemma4UnifiedConfig

Gemma4UnifiedTextConfig

autodoc Gemma4UnifiedTextConfig

Gemma4UnifiedVisionConfig

autodoc Gemma4UnifiedVisionConfig

Gemma4UnifiedAudioFeatureExtractor

autodoc Gemma4UnifiedAudioFeatureExtractor - call

Gemma4UnifiedImageProcessor

autodoc Gemma4UnifiedImageProcessor

Gemma4UnifiedVideoProcessor

autodoc Gemma4UnifiedVideoProcessor

Gemma4UnifiedProcessor

autodoc Gemma4UnifiedProcessor - call

Gemma4UnifiedPreTrainedModel

autodoc Gemma4UnifiedPreTrainedModel - forward

Gemma4UnifiedModel

autodoc Gemma4UnifiedModel - forward

Gemma4UnifiedTextModel

autodoc Gemma4UnifiedTextModel - forward

Gemma4UnifiedForCausalLM

autodoc Gemma4UnifiedForCausalLM

Gemma4UnifiedForConditionalGeneration

autodoc Gemma4UnifiedForConditionalGeneration - forward

8.4 KiB Raw Permalink Blame History Unescape Escape

Gemma4 Unified

Overview

Encoder-Free Vision Pipeline

Encoder-Free Audio Pipeline

Usage examples

Image

Audio

Gemma4UnifiedAudioConfig

Gemma4UnifiedConfig

Gemma4UnifiedTextConfig

Gemma4UnifiedVisionConfig

Gemma4UnifiedAudioFeatureExtractor

Gemma4UnifiedImageProcessor

Gemma4UnifiedVideoProcessor

Gemma4UnifiedProcessor

Gemma4UnifiedPreTrainedModel

Gemma4UnifiedModel

Gemma4UnifiedTextModel

Gemma4UnifiedForCausalLM

Gemma4UnifiedForConditionalGeneration

8.4 KiB

Raw Permalink Blame History