<<<<<<< Updated upstream *This model was released on 2025-04-17 and added to Hugging Face Transformers on 2025-12-16.* ======= *This model was published in HF papers on 2025-04-17 and contributed to Hugging Face Transformers on 2025-12-16.* >>>>>>> Stashed changes # PE Audio [PE Audio](https://huggingface.co/papers/2504.13181) is the audio branch of Meta's Perception Encoder family. It contrastively aligns raw waveforms with text into a shared embedding space, trained on paired audio–caption data for cross-modal retrieval and zero-shot audio classification. Two heads are exposed on top of the same encoder. [`PeAudioModel`] returns one pooled embedding per clip for clip-level retrieval, while [`PeAudioFrameLevelModel`] returns one embedding every 40 ms for event localization and fine-grained temporal analysis. You can find all the official PE Audio checkpoints under the [perception-encoder-audio-visual](https://huggingface.co/collections/facebook/perception-encoder-audio-visual) collection. ## Quickstart ```py import torch from datasets import load_dataset from transformers import AutoProcessor, PeAudioModel processor = AutoProcessor.from_pretrained("facebook/pe-av-large") model = PeAudioModel.from_pretrained( "facebook/pe-av-large", device_map="auto", ) ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") audio = ds[0]["audio"]["array"] labels = ["a dog barking", "a person speaking", "music playing"] audio_inputs = processor.feature_extractor(audio, sampling_rate=48_000, return_tensors="pt").to(model.device) text_inputs = processor.tokenizer(labels, padding=True, return_tensors="pt").to(model.device) inputs = {**audio_inputs, **text_inputs} with torch.no_grad(): outputs = model(**inputs) probs = outputs.logits_audio_text.sigmoid() print({label: p.item() for label, p in zip(labels, probs[0])}) ``` ## Usage tips and notes - Audio must be mono (`feature_size=1`) and resampled to 48 kHz — the feature extractor warns but does not resample for you. Stereo input is not supported. - Variable-length audio is handled with `padding_mask` (not the usual `attention_mask`). The mask is downsampled internally by `dac_config.hop_length` before it reaches the encoder, so pass the raw waveform-resolution mask that the feature extractor returns. - [`PeAudioModel`] returns logits of shape `(n_audio, n_text)`. [`PeAudioFrameLevelModel`] returns `(n_audio, n_text, n_frames)` with one frame every 40 ms. Pick the class that matches the task — they share weights so swapping is cheap. - The text tower is a shared encoder loaded via `AutoModel` from `config.text_config`. The tokenizer is attached to the processor via `AutoTokenizer`, not a dedicated class. ## PeAudioConfig [[autodoc]] PeAudioConfig ## PeAudioEncoderConfig [[autodoc]] PeAudioEncoderConfig ## PeAudioFeatureExtractor [[autodoc]] PeAudioFeatureExtractor - __call__ ## PeAudioProcessor [[autodoc]] PeAudioProcessor ## PeAudioEncoder [[autodoc]] PeAudioEncoder - forward ## PeAudioModel [[autodoc]] PeAudioModel - forward ## PeAudioFrameLevelModel [[autodoc]] PeAudioFrameLevelModel - forward