gavin/transformers

Fork 0

Files

陈赣 06f1fd69a6

Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled

Details

Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled

Details

Build documentation / build (push) Has been cancelled

Details

Build documentation / build_other_lang (push) Has been cancelled

Details

CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled

Details

New model PR merged notification / Notify new model (push) Has been cancelled

Details

PR CI / pr-ci (push) Has been cancelled

Details

Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled

Details

Secret Leaks / trufflehog (push) Has been cancelled

Details

Update Transformers metadata / build_and_package (push) Has been cancelled

Details

Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled

Details

Check Tiny Models / Check tiny models (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled

Details

Nvidia CI - Flash Attn / Setup (push) Has been cancelled

Details

Nvidia CI - Flash Attn / Model CI (push) Has been cancelled

Details

Nvidia CI / Setup (push) Has been cancelled

Details

Nvidia CI / Model CI (push) Has been cancelled

Details

Nvidia CI / Torch pipeline CI (push) Has been cancelled

Details

Nvidia CI / Example CI (push) Has been cancelled

Details

Nvidia CI / Trainer/FSDP CI (push) Has been cancelled

Details

Nvidia CI / DeepSpeed CI (push) Has been cancelled

Details

Nvidia CI / Quantization CI (push) Has been cancelled

Details

Nvidia CI / Kernels CI (push) Has been cancelled

Details

Doctests / Setup (push) Has been cancelled

Details

Doctests / Call doctest jobs (push) Has been cancelled

Details

Doctests / Send results to webhook (push) Has been cancelled

Details

Extras Smoke Test / Get supported Python versions (push) Has been cancelled

Details

Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled

Details

Extras Smoke Test / Check Slack token availability (push) Has been cancelled

Details

Extras Smoke Test / Notify failures to Slack (push) Has been cancelled

Details

Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled

Details

Stale Bot / Close Stale Issues (push) Has been cancelled

Details

first commit

2026-06-05 16:53:03 +08:00

15 KiB

Raw Blame History

This model was published in HF papers on 2025-07-10 and contributed to Hugging Face Transformers on 2025-11-12.

Audio Flamingo 3

Overview

Audio Flamingo 3 (AF3) is a fully open large audio–language model designed for robust understanding and reasoning over speech, environmental sounds, and music. AF3 pairs a Whisper-style audio encoder with a causal language model and performs replace-in-place audio–text fusion: the processor aligns post-pool audio frames to a dedicated placeholder token and the model replaces those token slots with projected audio embeddings during the forward pass.

The model checkpoint is available at: nvidia/audio-flamingo-3-hf

Highlights:

Unified audio encoder across speech, sound, and music.
Long-audio support via windowing and post-pool alignment (up to 10 minutes maximum). The model processes audio in 30-second windows with a hard limit of 20 windows (10 minutes total). Audio longer than 10 minutes will be truncated.
Deterministic fusion that preserves sequence length by replacing audio placeholder tokens with audio embeddings.

This model was contributed by Lasha Koroshinadze and Eric Bezzam.

Paper

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S. Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valle, B. Catanzaro
NVIDIA and University of Maryland
Project: https://research.nvidia.com/labs/adlr/AF3/

Usage

Audio Instruct Mode

The model supports audio-text instructions, including multi-turn interactions, all processed in batches.

➡️ audio + text instruction

from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor


model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the input speech."},
            {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3"},
        ],
    }
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
).to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=500)

decoded_outputs = processor.decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)

➡️ multi-turn:

from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor


model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Instruction: How does the tone of female speech change throughout the audio? Choose the correct option among the options below: (A) Sad to happy (B) Happy to sad (C) Neutral to happy (D) Happy to neutral.",
            },
            {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/000000786159.31.wav"},
        ],
    },
    {
        "role": "assistant",
        "content": [{"type": "text", "text": "(A) Sad to happy"}],
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Why do you think so?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
).to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=500)

decoded_outputs = processor.decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)

➡️ text only:

from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor


model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is the capital of France?"},
        ],
    }
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
).to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=500)

decoded_outputs = processor.decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)

➡️ audio only:

from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor


model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3"},
        ],
    }
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
).to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=500)

decoded_outputs = processor.decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)

➡️ batched inference!

from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor


model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

conversations = [
    [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe the input speech."},
                {
                    "type": "audio",
                    "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/t_837b89f2-26aa-4ee2-bdf6-f73f0dd59b26.wav",
                },
            ],
        }
    ],
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "This track feels really peaceful and introspective. What elements make it feel so calming and meditative?",
                },
                {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/FPSbCAANfbJLVSwD.mp3"},
            ],
        }
    ],
]

inputs = processor.apply_chat_template(
    conversations,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
).to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=500)

decoded_outputs = processor.decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)

➡️ Training:

from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor


model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
model.train()

conversation = [
    [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe the input speech."},
                {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3"},
            ],
        },
        {
            "role": "assistant",
            "content": [{"type": "text", "text": "The transcription of the audio is 'summer follows spring the days grow longer and the nights are warm'."}],
        }
    ],
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "This track feels really peaceful and introspective. What elements make it feel so calming and meditative?",
                },
                {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/FPSbCAANfbJLVSwD.mp3"},
            ],
        },
        {
            "role": "assistant",
            "content": [{"type": "text", "text": "The transcription of the audio is 'some transcription of the audio'."}],
        }

    ]
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    output_labels=True,
).to(model.device, dtype=model.dtype)

loss = model(**inputs).loss
loss.backward()

➡️ transcription shortcut

from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor


model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

inputs = processor.apply_transcription_request(audio="https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/t_837b89f2-26aa-4ee2-bdf6-f73f0dd59b26.wav").to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True, strip_prefix=True)

print(decoded_outputs)

The model is trained to emit transcriptions prefixed with assistant framing such as The spoken content of the audio is "<text>".. Use strip_prefix=True (as shown above) to remove the fixed assistant sentence and surrounding quotes so that only the transcription remains.

How the model works

Architecture

AudioFlamingo3Encoder Whisper-style feature extractor + encoder → average-pool over time (stride 2) → LayerNorm. Produces per-frame hidden states at the post-pool rate.
AudioFlamingo3MultiModalProjector A small MLP that maps encoder features to the language model’s hidden size.
AudioFlamingo3ForConditionalGeneration A causal language model that accepts text embeddings where each audio placeholder token slot is replaced, in place, by an audio frame embedding. No sequence-length change is introduced by fusion.

Processor-level alignment

Each raw waveform is split into fixed-length windows based on the feature extractor’s chunk_length (seconds) and sampling_rate (Hz).
For each window, the processor computes the number of post-pool frames post_pool_len that the encoder will output (matching the conv/pool schedule).
The processor expands the audio placeholder token by the total number of post-pool frames across all windows.
The model later replaces those token positions with the corresponding projected audio embeddings.

Usage patterns

Transcription shortcut

For automatic speech recognition you can skip writing the default instruction each time and call [~transformers.AudioFlamingo3Processor.apply_transcription_request]:

inputs = processor.apply_transcription_request(audio=audio_array)

Pass prompt="Transcribe the input speech." (or a list of prompts for batch audio) to customize the instruction while keeping the audio placeholder handling.

audio accepts in-memory arrays, local file paths, or URLs. Any processor kwargs (text_kwargs, audio_kwargs, etc.) are forwarded, so you can tweak padding or tensor formats just like when calling processor(...).

Long audio and windowing

Important: Maximum audio length is 10 minutes. Audio longer than this will be truncated.

The default setup processes 30-second windows at 16 kHz mono.
The processor enforces a hard limit of 20 windows per sample, resulting in a maximum of 10 minutes of audio (20 windows × 30 seconds).
For each window:
- mel_len is the padded mel length.
- A conv stack reduces time as conv_output_len = (mel_len - 1) // 2 + 1.
- Post-pool frames per window: post_pool_len = (conv_output_len - 2) // 2 + 1.
- An audio placeholder token is expanded to the sum of post_pool_len across all windows.

Padding, attention, and caching

Left padding vs right padding For generation with mixed prompt lengths in a batch, left padding is usually preferable. For training, right padding is common; AF3’s fusion mechanism itself is padding-agnostic because it replaces in place.
Attention masks The processor returns attention_mask (text) and input_features_mask (audio). The model builds an internal 4-D mask on the encoder’s pre-pool axis with negative infinity at pad positions.
Caching During generation, input_features and input_features_mask are only passed on the first step. Subsequent steps use cached keys/values from the language model.

Troubleshooting

Empty or truncated outputs when batching Use left padding for batched generation and decode only the new tokens after the prompt length, as shown in the quickstart.

AudioFlamingo3Config

autodoc AudioFlamingo3Config

AudioFlamingo3EncoderConfig

autodoc AudioFlamingo3EncoderConfig

AudioFlamingo3Processor

autodoc AudioFlamingo3Processor - call

AudioFlamingo3Encoder

autodoc AudioFlamingo3Encoder - forward

AudioFlamingo3Model

autodoc AudioFlamingo3Model - forward

AudioFlamingo3ForConditionalGeneration

autodoc AudioFlamingo3ForConditionalGeneration - forward - get_audio_features

15 KiB Raw Blame History Unescape Escape

Audio Flamingo 3

Overview

Paper

Usage

Audio Instruct Mode

How the model works

Architecture

Processor-level alignment

Usage patterns

Transcription shortcut

Long audio and windowing

Padding, attention, and caching

Troubleshooting

AudioFlamingo3Config

AudioFlamingo3EncoderConfig

AudioFlamingo3Processor

AudioFlamingo3Encoder

AudioFlamingo3Model

AudioFlamingo3ForConditionalGeneration

15 KiB

Raw Blame History