5.0 KiB
This model was contributed to Hugging Face Transformers on 2026-03-26.
CohereAsr
Overview
Cohere ASR, released by Cohere on March 26th, 2026, is a 2B parameter Conformer-based encoder-decoder speech recognition model.
This model was contributed by Eustache Le Bihan.
Usage
Short-form transcription
from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from transformers.audio_utils import load_audio
revision = "refs/pr/6"
processor = AutoProcessor.from_pretrained("CohereLabs/cohere-transcribe-03-2026", revision=revision)
model = CohereAsrForConditionalGeneration.from_pretrained("CohereLabs/cohere-transcribe-03-2026", device_map="auto", revision=revision)
audio = load_audio(
"https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
sampling_rate=16000,
)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="en").to(model.device)
inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True)
print(text)
Punctuation control
Pass punctuation=False to obtain lower-cased output without punctuation marks.
inputs_pnc = processor(audio, sampling_rate=16000, return_tensors="pt", language="en", punctuation=True).to(model.device)
inputs_nopnc = processor(audio, sampling_rate=16000, return_tensors="pt", language="en", punctuation=False).to(model.device)
Long-form transcription
For audio longer than the feature extractor's max_audio_clip_s, the feature extractor automatically splits the waveform into chunks.
The processor reassembles the per-chunk transcriptions using the returned audio_chunk_index.
audio_long = load_audio(
"https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3",
sampling_rate=16000,
)
inputs = processor(audio=audio_long, return_tensors="pt", language="en", sampling_rate=16000).to(model.device)
audio_chunk_index = inputs.get("audio_chunk_index")
inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True, audio_chunk_index=audio_chunk_index, language="en")
print(text)
Batched inference
Multiple audio files can be processed in a single call. When the batch mixes short-form and long-form audio, the processor handles chunking and reassembly.
audio_short = load_audio(
"https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
sampling_rate=16000,
)
audio_long = load_audio(
"https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3",
sampling_rate=16000,
)
inputs = processor([audio_short, audio_long], sampling_rate=16000, return_tensors="pt", language="en").to(model.device)
audio_chunk_index = inputs.get("audio_chunk_index")
inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(
outputs, skip_special_tokens=True, audio_chunk_index=audio_chunk_index, language="en"
)
print(text)
Non-English transcription
Specify the language code to transcribe in any of the 14 supported languages.
audio_es = load_audio(
"https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/fleur_es_sample.wav",
sampling_rate=16000,
)
inputs = processor(audio_es, sampling_rate=16000, return_tensors="pt", language="es", punctuation=True).to(model.device)
inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True)
print(text)
CohereAsrConfig
autodoc CohereAsrConfig
CohereAsrFeatureExtractor
autodoc CohereAsrFeatureExtractor - call
CohereAsrProcessor
autodoc CohereAsrProcessor - call
CohereAsrPreTrainedModel
autodoc CohereAsrPreTrainedModel - forward
CohereAsrModel
autodoc CohereAsrModel - forward
CohereAsrForConditionalGeneration
autodoc CohereAsrForConditionalGeneration - forward