*This model was contributed to Hugging Face Transformers on 2024-09-18.*
# Mimi
[Mimi](huggingface.co/papers/2410.00037) is a neural audio codec model with pretrained and quantized variants, designed for efficient speech representation and compression. The model operates at 1.1 kbps with a 12 Hz frame rate and uses a convolutional encoder-decoder architecture combined with a residual vector quantizer of 16 codebooks. Mimi outputs dual token streams i.e. semantic and acoustic to balance linguistic richness with high fidelity reconstruction. Key features include a causal streaming encoder for low-latency use, dual-path tokenization for flexible downstream generation, and integration readiness with large speech models like Moshi.
You can find the original Mimi checkpoints under the [Kyutai](https://huggingface.co/kyutai/models?search=mimi) organization.
>[!TIP]
> This model was contributed by [ylacombe](https://huggingface.co/ylacombe).
>
> Click on the Mimi models in the right sidebar for more examples of how to apply Mimi.
The example below demonstrates how to encode and decode audio with the [`AutoModel`] class.
```python
from datasets import Audio, load_dataset
from transformers import AutoFeatureExtractor, MimiModel
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
# load model and feature extractor
model = MimiModel.from_pretrained("kyutai/mimi", device_map="auto")
feature_extractor = AutoFeatureExtractor.from_pretrained("kyutai/mimi")
# load audio sample
librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
audio_sample = librispeech_dummy[-1]["audio"]["array"]
inputs = feature_extractor(raw_audio=audio_sample, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt").to(model.device)
encoder_outputs = model.encode(inputs["input_values"], inputs["padding_mask"])
audio_values = model.decode(encoder_outputs.audio_codes, inputs["padding_mask"])[0]
# or the equivalent with a forward pass
audio_values = model(inputs["input_values"], inputs["padding_mask"]).audio_values
```
## MimiConfig
[[autodoc]] MimiConfig
## MimiModel
[[autodoc]] MimiModel
- decode
- encode
- forward