3.8 KiB
<<<<<<< Updated upstream This model was released on 2025-04-17 and added to Hugging Face Transformers on 2025-12-16.
This model was published in HF papers on 2025-04-17 and contributed to Hugging Face Transformers on 2025-12-16.
Stashed changes
PE Audio
PE Audio is the audio branch of Meta's Perception Encoder family. It contrastively aligns raw waveforms with text into a shared embedding space, trained on paired audio–caption data for cross-modal retrieval and zero-shot audio classification.
Two heads are exposed on top of the same encoder. [PeAudioModel] returns one pooled embedding per clip for clip-level retrieval, while [PeAudioFrameLevelModel] returns one embedding every 40 ms for event localization and fine-grained temporal analysis.
You can find all the official PE Audio checkpoints under the perception-encoder-audio-visual collection.
Quickstart
import torch
from datasets import load_dataset
from transformers import AutoProcessor, PeAudioModel
processor = AutoProcessor.from_pretrained("facebook/pe-av-large")
model = PeAudioModel.from_pretrained(
"facebook/pe-av-large",
device_map="auto",
)
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio = ds[0]["audio"]["array"]
labels = ["a dog barking", "a person speaking", "music playing"]
audio_inputs = processor.feature_extractor(audio, sampling_rate=48_000, return_tensors="pt").to(model.device)
text_inputs = processor.tokenizer(labels, padding=True, return_tensors="pt").to(model.device)
inputs = {**audio_inputs, **text_inputs}
with torch.no_grad():
outputs = model(**inputs)
probs = outputs.logits_audio_text.sigmoid()
print({label: p.item() for label, p in zip(labels, probs[0])})
Usage tips and notes
- Audio must be mono (
feature_size=1) and resampled to 48 kHz — the feature extractor warns but does not resample for you. Stereo input is not supported. - Variable-length audio is handled with
padding_mask(not the usualattention_mask). The mask is downsampled internally bydac_config.hop_lengthbefore it reaches the encoder, so pass the raw waveform-resolution mask that the feature extractor returns. - [
PeAudioModel] returns logits of shape(n_audio, n_text). [PeAudioFrameLevelModel] returns(n_audio, n_text, n_frames)with one frame every 40 ms. Pick the class that matches the task — they share weights so swapping is cheap. - The text tower is a shared encoder loaded via
AutoModelfromconfig.text_config. The tokenizer is attached to the processor viaAutoTokenizer, not a dedicated class.
PeAudioConfig
autodoc PeAudioConfig
PeAudioEncoderConfig
autodoc PeAudioEncoderConfig
PeAudioFeatureExtractor
autodoc PeAudioFeatureExtractor - call
PeAudioProcessor
autodoc PeAudioProcessor
PeAudioEncoder
autodoc PeAudioEncoder - forward
PeAudioModel
autodoc PeAudioModel - forward
PeAudioFrameLevelModel
autodoc PeAudioFrameLevelModel - forward