*This model was contributed to Hugging Face Transformers on 2024-09-18.*

# Mimi [Mimi](huggingface.co/papers/2410.00037) is a neural audio codec model with pretrained and quantized variants, designed for efficient speech representation and compression. The model operates at 1.1 kbps with a 12 Hz frame rate and uses a convolutional encoder-decoder architecture combined with a residual vector quantizer of 16 codebooks. Mimi outputs dual token streams i.e. semantic and acoustic to balance linguistic richness with high fidelity reconstruction. Key features include a causal streaming encoder for low-latency use, dual-path tokenization for flexible downstream generation, and integration readiness with large speech models like Moshi. You can find the original Mimi checkpoints under the [Kyutai](https://huggingface.co/kyutai/models?search=mimi) organization. >[!TIP] > This model was contributed by [ylacombe](https://huggingface.co/ylacombe). > > Click on the Mimi models in the right sidebar for more examples of how to apply Mimi. The example below demonstrates how to encode and decode audio with the [`AutoModel`] class. ```python from datasets import Audio, load_dataset from transformers import AutoFeatureExtractor, MimiModel librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") # load model and feature extractor model = MimiModel.from_pretrained("kyutai/mimi", device_map="auto") feature_extractor = AutoFeatureExtractor.from_pretrained("kyutai/mimi") # load audio sample librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate)) audio_sample = librispeech_dummy[-1]["audio"]["array"] inputs = feature_extractor(raw_audio=audio_sample, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt").to(model.device) encoder_outputs = model.encode(inputs["input_values"], inputs["padding_mask"]) audio_values = model.decode(encoder_outputs.audio_codes, inputs["padding_mask"])[0] # or the equivalent with a forward pass audio_values = model(inputs["input_values"], inputs["padding_mask"]).audio_values ``` ## MimiConfig [[autodoc]] MimiConfig ## MimiModel [[autodoc]] MimiModel - decode - encode - forward