first commit
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
This commit is contained in:
194
docs/source/en/model_doc/voxtral_realtime.md
Normal file
194
docs/source/en/model_doc/voxtral_realtime.md
Normal file
@@ -0,0 +1,194 @@
|
||||
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
*This model was contributed to Hugging Face Transformers on 2026-02-16.*
|
||||
|
||||
# VoxtralRealtime
|
||||
|
||||
VoxtralRealtime is a streaming speech-to-text model from [Mistral AI](https://mistral.ai), designed for real-time automatic speech recognition (ASR). Unlike the offline [Voxtral](./voxtral) model which processes complete audio files, VoxtralRealtime is architected for low-latency, incremental transcription by processing audio in chunks as they arrive.
|
||||
|
||||
The model combines an audio encoder with a Mistral-based language model decoder, using time conditioning embeddings and causal convolutions with padding caches to enable efficient streaming inference.
|
||||
|
||||
|
||||
## Usage
|
||||
|
||||
### Offline Transcription
|
||||
|
||||
For transcribing complete audio files, use the processor and model directly. The generation length is automatically determined from the audio length.
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
|
||||
from transformers import AutoProcessor, VoxtralRealtimeForConditionalGeneration
|
||||
|
||||
|
||||
repo_id = "mistralai/Voxtral-Mini-4B-Realtime-2602"
|
||||
|
||||
processor = AutoProcessor.from_pretrained(repo_id)
|
||||
model = VoxtralRealtimeForConditionalGeneration.from_pretrained(repo_id, device_map="auto")
|
||||
|
||||
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
||||
audio = ds[0]["audio"]["array"]
|
||||
|
||||
inputs = processor(audio, return_tensors="pt").to(model.device)
|
||||
inputs = inputs.to(model.device, dtype=model.dtype)
|
||||
|
||||
outputs = model.generate(**inputs)
|
||||
decoded_outputs = processor.batch_decode(outputs, skip_special_tokens=True)
|
||||
|
||||
print(decoded_outputs[0])
|
||||
```
|
||||
|
||||
### Batched Offline Transcription
|
||||
|
||||
Multiple audio samples can be transcribed in a single forward pass:
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
|
||||
from transformers import AutoProcessor, VoxtralRealtimeForConditionalGeneration
|
||||
|
||||
|
||||
repo_id = "mistralai/Voxtral-Mini-4B-Realtime-2602"
|
||||
|
||||
processor = AutoProcessor.from_pretrained(repo_id)
|
||||
model = VoxtralRealtimeForConditionalGeneration.from_pretrained(repo_id, device_map="auto")
|
||||
|
||||
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
||||
audio = [ds[i]["audio"]["array"] for i in range(2)]
|
||||
|
||||
inputs = processor(audio, return_tensors="pt").to(model.device)
|
||||
inputs = inputs.to(model.device, dtype=model.dtype)
|
||||
|
||||
outputs = model.generate(**inputs)
|
||||
decoded_outputs = processor.batch_decode(outputs, skip_special_tokens=True)
|
||||
|
||||
for decoded_output in decoded_outputs:
|
||||
print(decoded_output)
|
||||
```
|
||||
|
||||
### Streaming Transcription
|
||||
> [!NOTE]
|
||||
> This is an experimental feature and the API is subject to change.
|
||||
|
||||
For real-time transcription, audio is split into chunks following:
|
||||
|
||||
```python
|
||||
from threading import Thread
|
||||
|
||||
import numpy as np
|
||||
from datasets import load_dataset
|
||||
|
||||
from transformers import (
|
||||
TextIteratorStreamer,
|
||||
VoxtralRealtimeForConditionalGeneration,
|
||||
VoxtralRealtimeProcessor,
|
||||
)
|
||||
|
||||
|
||||
model_id = "mistralai/Voxtral-Mini-4B-Realtime-2602"
|
||||
processor = VoxtralRealtimeProcessor.from_pretrained(model_id)
|
||||
model = VoxtralRealtimeForConditionalGeneration.from_pretrained(model_id, device_map="cuda:0")
|
||||
|
||||
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
||||
audio = ds[0]["audio"]["array"]
|
||||
# Manually pad the audio to account for right padding tokens required by the model
|
||||
xaudio = np.pad(audio, (0, processor.num_right_pad_tokens * processor.raw_audio_length_per_tok))
|
||||
|
||||
first_chunk_inputs = processor(
|
||||
audio[:processor.num_samples_first_audio_chunk],
|
||||
is_streaming=True,
|
||||
is_first_audio_chunk=True,
|
||||
return_tensors="pt"
|
||||
)
|
||||
first_chunk_inputs.to(model.device, dtype=model.dtype)
|
||||
|
||||
def input_features_generator():
|
||||
yield first_chunk_inputs.input_features
|
||||
|
||||
mel_frame_idx = processor.num_mel_frames_first_audio_chunk
|
||||
hop_length = processor.feature_extractor.hop_length
|
||||
win_length = processor.feature_extractor.win_length
|
||||
|
||||
start_idx = mel_frame_idx * hop_length - win_length // 2
|
||||
end_idx = start_idx + processor.num_samples_per_audio_chunk
|
||||
|
||||
while (end_idx:=start_idx + processor.num_samples_per_audio_chunk) < audio.shape[0]:
|
||||
inputs = processor(
|
||||
audio[start_idx:end_idx],
|
||||
is_streaming=True,
|
||||
is_first_audio_chunk=False,
|
||||
return_tensors="pt"
|
||||
)
|
||||
inputs.to(model.device, dtype=model.dtype)
|
||||
yield inputs.input_features
|
||||
|
||||
mel_frame_idx += processor.audio_length_per_tok
|
||||
start_idx = mel_frame_idx * hop_length - win_length // 2
|
||||
|
||||
streamer = TextIteratorStreamer(processor.tokenizer, skip_special_tokens=True, clean_up_tokenization_spaces=True)
|
||||
generate_kwargs = {
|
||||
"input_ids": first_chunk_inputs.input_ids,
|
||||
"input_features": input_features_generator(),
|
||||
"num_delay_tokens": first_chunk_inputs.num_delay_tokens,
|
||||
"streamer": streamer,
|
||||
}
|
||||
thread = Thread(target=model.generate, kwargs=generate_kwargs)
|
||||
thread.start()
|
||||
|
||||
# Iterate over the streamer to get text chunks as they are generated
|
||||
print("Model output (streaming):", end=" ", flush=True)
|
||||
for text_chunk in streamer:
|
||||
print(text_chunk, end="", flush=True)
|
||||
```
|
||||
|
||||
This model was contributed by [Eustache Le Bihan](https://huggingface.co/eustlb).
|
||||
|
||||
## VoxtralRealtimeConfig
|
||||
|
||||
[[autodoc]] VoxtralRealtimeConfig
|
||||
|
||||
## VoxtralRealtimeEncoderConfig
|
||||
|
||||
[[autodoc]] VoxtralRealtimeEncoderConfig
|
||||
|
||||
## VoxtralRealtimeTextConfig
|
||||
|
||||
[[autodoc]] VoxtralRealtimeTextConfig
|
||||
|
||||
## VoxtralRealtimeFeatureExtractor
|
||||
|
||||
[[autodoc]] VoxtralRealtimeFeatureExtractor
|
||||
|
||||
## VoxtralRealtimeProcessor
|
||||
|
||||
[[autodoc]] VoxtralRealtimeProcessor
|
||||
- __call__
|
||||
|
||||
## VoxtralRealtimeEncoder
|
||||
|
||||
[[autodoc]] VoxtralRealtimeEncoder
|
||||
- forward
|
||||
|
||||
## VoxtralRealtimeModel
|
||||
|
||||
[[autodoc]] VoxtralRealtimeModel
|
||||
- forward
|
||||
|
||||
## VoxtralRealtimeForConditionalGeneration
|
||||
|
||||
[[autodoc]] VoxtralRealtimeForConditionalGeneration
|
||||
- forward
|
||||
- get_audio_features
|
||||
Reference in New Issue
Block a user