15 KiB
This model was published in HF papers on 2025-07-10 and contributed to Hugging Face Transformers on 2025-11-12.
Audio Flamingo 3
Overview
Audio Flamingo 3 (AF3) is a fully open large audio–language model designed for robust understanding and reasoning over speech, environmental sounds, and music. AF3 pairs a Whisper-style audio encoder with a causal language model and performs replace-in-place audio–text fusion: the processor aligns post-pool audio frames to a dedicated placeholder token and the model replaces those token slots with projected audio embeddings during the forward pass.
The model checkpoint is available at: nvidia/audio-flamingo-3-hf
Highlights:
- Unified audio encoder across speech, sound, and music.
- Long-audio support via windowing and post-pool alignment (up to 10 minutes maximum). The model processes audio in 30-second windows with a hard limit of 20 windows (10 minutes total). Audio longer than 10 minutes will be truncated.
- Deterministic fusion that preserves sequence length by replacing audio placeholder tokens with audio embeddings.
This model was contributed by Lasha Koroshinadze and Eric Bezzam.
Paper
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S. Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valle, B. Catanzaro
NVIDIA and University of Maryland
Project: https://research.nvidia.com/labs/adlr/AF3/
Usage
Audio Instruct Mode
The model supports audio-text instructions, including multi-turn interactions, all processed in batches.
➡️ audio + text instruction
from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe the input speech."},
{"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3"},
],
}
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
).to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)
➡️ multi-turn:
from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
conversation = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Instruction: How does the tone of female speech change throughout the audio? Choose the correct option among the options below: (A) Sad to happy (B) Happy to sad (C) Neutral to happy (D) Happy to neutral.",
},
{"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/000000786159.31.wav"},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": "(A) Sad to happy"}],
},
{
"role": "user",
"content": [
{"type": "text", "text": "Why do you think so?"},
],
},
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
).to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)
➡️ text only:
from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "What is the capital of France?"},
],
}
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
).to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)
➡️ audio only:
from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
conversation = [
{
"role": "user",
"content": [
{"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3"},
],
}
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
).to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)
➡️ batched inference!
from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
conversations = [
[
{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe the input speech."},
{
"type": "audio",
"path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/t_837b89f2-26aa-4ee2-bdf6-f73f0dd59b26.wav",
},
],
}
],
[
{
"role": "user",
"content": [
{
"type": "text",
"text": "This track feels really peaceful and introspective. What elements make it feel so calming and meditative?",
},
{"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/FPSbCAANfbJLVSwD.mp3"},
],
}
],
]
inputs = processor.apply_chat_template(
conversations,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
).to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)
➡️ Training:
from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
model.train()
conversation = [
[
{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe the input speech."},
{"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3"},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": "The transcription of the audio is 'summer follows spring the days grow longer and the nights are warm'."}],
}
],
[
{
"role": "user",
"content": [
{
"type": "text",
"text": "This track feels really peaceful and introspective. What elements make it feel so calming and meditative?",
},
{"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/FPSbCAANfbJLVSwD.mp3"},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": "The transcription of the audio is 'some transcription of the audio'."}],
}
]
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
output_labels=True,
).to(model.device, dtype=model.dtype)
loss = model(**inputs).loss
loss.backward()
➡️ transcription shortcut
from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
inputs = processor.apply_transcription_request(audio="https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/t_837b89f2-26aa-4ee2-bdf6-f73f0dd59b26.wav").to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True, strip_prefix=True)
print(decoded_outputs)
The model is trained to emit transcriptions prefixed with assistant framing such as The spoken content of the audio is "<text>".. Use strip_prefix=True (as shown above) to remove the fixed assistant sentence and surrounding quotes so that only the transcription remains.
How the model works
Architecture
-
AudioFlamingo3Encoder Whisper-style feature extractor + encoder → average-pool over time (stride 2) → LayerNorm. Produces per-frame hidden states at the post-pool rate.
-
AudioFlamingo3MultiModalProjector A small MLP that maps encoder features to the language model’s hidden size.
-
AudioFlamingo3ForConditionalGeneration A causal language model that accepts text embeddings where each audio placeholder token slot is replaced, in place, by an audio frame embedding. No sequence-length change is introduced by fusion.
Processor-level alignment
- Each raw waveform is split into fixed-length windows based on the feature extractor’s
chunk_length(seconds) andsampling_rate(Hz). - For each window, the processor computes the number of post-pool frames
post_pool_lenthat the encoder will output (matching the conv/pool schedule). - The processor expands the audio placeholder token by the total number of post-pool frames across all windows.
- The model later replaces those token positions with the corresponding projected audio embeddings.
Usage patterns
Transcription shortcut
For automatic speech recognition you can skip writing the default instruction each time and call
[~transformers.AudioFlamingo3Processor.apply_transcription_request]:
inputs = processor.apply_transcription_request(audio=audio_array)
Pass prompt="Transcribe the input speech." (or a list of prompts for batch audio) to customize the instruction while
keeping the audio placeholder handling.
audio accepts in-memory arrays, local file paths, or URLs. Any processor kwargs (text_kwargs, audio_kwargs, etc.)
are forwarded, so you can tweak padding or tensor formats just like when calling processor(...).
Long audio and windowing
Important: Maximum audio length is 10 minutes. Audio longer than this will be truncated.
-
The default setup processes 30-second windows at 16 kHz mono.
-
The processor enforces a hard limit of 20 windows per sample, resulting in a maximum of 10 minutes of audio (20 windows × 30 seconds).
-
For each window:
mel_lenis the padded mel length.- A conv stack reduces time as
conv_output_len = (mel_len - 1) // 2 + 1. - Post-pool frames per window:
post_pool_len = (conv_output_len - 2) // 2 + 1. - An audio placeholder token is expanded to the sum of
post_pool_lenacross all windows.
Padding, attention, and caching
- Left padding vs right padding For generation with mixed prompt lengths in a batch, left padding is usually preferable. For training, right padding is common; AF3’s fusion mechanism itself is padding-agnostic because it replaces in place.
- Attention masks
The processor returns
attention_mask(text) andinput_features_mask(audio). The model builds an internal 4-D mask on the encoder’s pre-pool axis with negative infinity at pad positions. - Caching
During generation,
input_featuresandinput_features_maskare only passed on the first step. Subsequent steps use cached keys/values from the language model.
Troubleshooting
- Empty or truncated outputs when batching Use left padding for batched generation and decode only the new tokens after the prompt length, as shown in the quickstart.
AudioFlamingo3Config
autodoc AudioFlamingo3Config
AudioFlamingo3EncoderConfig
autodoc AudioFlamingo3EncoderConfig
AudioFlamingo3Processor
autodoc AudioFlamingo3Processor - call
AudioFlamingo3Encoder
autodoc AudioFlamingo3Encoder - forward
AudioFlamingo3Model
autodoc AudioFlamingo3Model - forward
AudioFlamingo3ForConditionalGeneration
autodoc AudioFlamingo3ForConditionalGeneration - forward - get_audio_features