Files
transformers/docs/source/en/model_doc/audioflamingo3.md
陈赣 06f1fd69a6
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
first commit
2026-06-05 16:53:03 +08:00

15 KiB
Raw Blame History

This model was published in HF papers on 2025-07-10 and contributed to Hugging Face Transformers on 2025-11-12.

Audio Flamingo 3

FlashAttention SDPA

Overview

Audio Flamingo 3 (AF3) is a fully open large audiolanguage model designed for robust understanding and reasoning over speech, environmental sounds, and music. AF3 pairs a Whisper-style audio encoder with a causal language model and performs replace-in-place audiotext fusion: the processor aligns post-pool audio frames to a dedicated placeholder token and the model replaces those token slots with projected audio embeddings during the forward pass.

The model checkpoint is available at: nvidia/audio-flamingo-3-hf

Highlights:

  • Unified audio encoder across speech, sound, and music.
  • Long-audio support via windowing and post-pool alignment (up to 10 minutes maximum). The model processes audio in 30-second windows with a hard limit of 20 windows (10 minutes total). Audio longer than 10 minutes will be truncated.
  • Deterministic fusion that preserves sequence length by replacing audio placeholder tokens with audio embeddings.

This model was contributed by Lasha Koroshinadze and Eric Bezzam.

Paper

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S. Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valle, B. Catanzaro
NVIDIA and University of Maryland
Project: https://research.nvidia.com/labs/adlr/AF3/

Usage

Audio Instruct Mode

The model supports audio-text instructions, including multi-turn interactions, all processed in batches.

➡️ audio + text instruction

from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor


model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the input speech."},
            {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3"},
        ],
    }
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
).to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=500)

decoded_outputs = processor.decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)

➡️ multi-turn:

from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor


model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Instruction: How does the tone of female speech change throughout the audio? Choose the correct option among the options below: (A) Sad to happy (B) Happy to sad (C) Neutral to happy (D) Happy to neutral.",
            },
            {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/000000786159.31.wav"},
        ],
    },
    {
        "role": "assistant",
        "content": [{"type": "text", "text": "(A) Sad to happy"}],
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Why do you think so?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
).to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=500)

decoded_outputs = processor.decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)

➡️ text only:

from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor


model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is the capital of France?"},
        ],
    }
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
).to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=500)

decoded_outputs = processor.decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)

➡️ audio only:

from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor


model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3"},
        ],
    }
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
).to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=500)

decoded_outputs = processor.decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)

➡️ batched inference!

from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor


model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

conversations = [
    [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe the input speech."},
                {
                    "type": "audio",
                    "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/t_837b89f2-26aa-4ee2-bdf6-f73f0dd59b26.wav",
                },
            ],
        }
    ],
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "This track feels really peaceful and introspective. What elements make it feel so calming and meditative?",
                },
                {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/FPSbCAANfbJLVSwD.mp3"},
            ],
        }
    ],
]

inputs = processor.apply_chat_template(
    conversations,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
).to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=500)

decoded_outputs = processor.decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)

➡️ Training:

from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor


model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
model.train()

conversation = [
    [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe the input speech."},
                {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3"},
            ],
        },
        {
            "role": "assistant",
            "content": [{"type": "text", "text": "The transcription of the audio is 'summer follows spring the days grow longer and the nights are warm'."}],
        }
    ],
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "This track feels really peaceful and introspective. What elements make it feel so calming and meditative?",
                },
                {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/FPSbCAANfbJLVSwD.mp3"},
            ],
        },
        {
            "role": "assistant",
            "content": [{"type": "text", "text": "The transcription of the audio is 'some transcription of the audio'."}],
        }

    ]
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    output_labels=True,
).to(model.device, dtype=model.dtype)

loss = model(**inputs).loss
loss.backward()

➡️ transcription shortcut

from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor


model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

inputs = processor.apply_transcription_request(audio="https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/t_837b89f2-26aa-4ee2-bdf6-f73f0dd59b26.wav").to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True, strip_prefix=True)

print(decoded_outputs)

The model is trained to emit transcriptions prefixed with assistant framing such as The spoken content of the audio is "<text>".. Use strip_prefix=True (as shown above) to remove the fixed assistant sentence and surrounding quotes so that only the transcription remains.

How the model works

Architecture

  • AudioFlamingo3Encoder Whisper-style feature extractor + encoder → average-pool over time (stride 2) → LayerNorm. Produces per-frame hidden states at the post-pool rate.

  • AudioFlamingo3MultiModalProjector A small MLP that maps encoder features to the language models hidden size.

  • AudioFlamingo3ForConditionalGeneration A causal language model that accepts text embeddings where each audio placeholder token slot is replaced, in place, by an audio frame embedding. No sequence-length change is introduced by fusion.

Processor-level alignment

  1. Each raw waveform is split into fixed-length windows based on the feature extractors chunk_length (seconds) and sampling_rate (Hz).
  2. For each window, the processor computes the number of post-pool frames post_pool_len that the encoder will output (matching the conv/pool schedule).
  3. The processor expands the audio placeholder token by the total number of post-pool frames across all windows.
  4. The model later replaces those token positions with the corresponding projected audio embeddings.

Usage patterns

Transcription shortcut

For automatic speech recognition you can skip writing the default instruction each time and call [~transformers.AudioFlamingo3Processor.apply_transcription_request]:

inputs = processor.apply_transcription_request(audio=audio_array)

Pass prompt="Transcribe the input speech." (or a list of prompts for batch audio) to customize the instruction while keeping the audio placeholder handling.

audio accepts in-memory arrays, local file paths, or URLs. Any processor kwargs (text_kwargs, audio_kwargs, etc.) are forwarded, so you can tweak padding or tensor formats just like when calling processor(...).

Long audio and windowing

Important: Maximum audio length is 10 minutes. Audio longer than this will be truncated.

  • The default setup processes 30-second windows at 16 kHz mono.

  • The processor enforces a hard limit of 20 windows per sample, resulting in a maximum of 10 minutes of audio (20 windows × 30 seconds).

  • For each window:

    • mel_len is the padded mel length.
    • A conv stack reduces time as conv_output_len = (mel_len - 1) // 2 + 1.
    • Post-pool frames per window: post_pool_len = (conv_output_len - 2) // 2 + 1.
    • An audio placeholder token is expanded to the sum of post_pool_len across all windows.

Padding, attention, and caching

  • Left padding vs right padding For generation with mixed prompt lengths in a batch, left padding is usually preferable. For training, right padding is common; AF3s fusion mechanism itself is padding-agnostic because it replaces in place.
  • Attention masks The processor returns attention_mask (text) and input_features_mask (audio). The model builds an internal 4-D mask on the encoders pre-pool axis with negative infinity at pad positions.
  • Caching During generation, input_features and input_features_mask are only passed on the first step. Subsequent steps use cached keys/values from the language model.

Troubleshooting

  • Empty or truncated outputs when batching Use left padding for batched generation and decode only the new tokens after the prompt length, as shown in the quickstart.

AudioFlamingo3Config

autodoc AudioFlamingo3Config

AudioFlamingo3EncoderConfig

autodoc AudioFlamingo3EncoderConfig

AudioFlamingo3Processor

autodoc AudioFlamingo3Processor - call

AudioFlamingo3Encoder

autodoc AudioFlamingo3Encoder - forward

AudioFlamingo3Model

autodoc AudioFlamingo3Model - forward

AudioFlamingo3ForConditionalGeneration

autodoc AudioFlamingo3ForConditionalGeneration - forward - get_audio_features