first commit

2026-06-05 16:53:03 +08:00
commit 06f1fd69a6
6047 changed files with 1895387 additions and 0 deletions
--- a/docs/source/en/model_doc/pe_audio_video.md
+++ b/docs/source/en/model_doc/pe_audio_video.md
@@ -0,0 +1,89 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+<<<<<<< Updated upstream
+*This model was released on 2025-04-17 and added to Hugging Face Transformers on 2025-12-16.*
+=======
+*This model was published in HF papers on 2025-04-17 and contributed to Hugging Face Transformers on 2025-12-16.*
+>>>>>>> Stashed changes
+
+# PE Audio Video
+
+[PE Audio Video](https://huggingface.co/papers/2504.13181) is the joint audio–video branch of Meta's Perception Encoder family. It encodes audio and video streams together with a shared text tower, producing contrastive embeddings for every pairwise combination, audio-text, video-text, audio-video, and audio+text-video, from a single forward pass.
+
+Internally the model aligns the video feature sequence to the audio's temporal resolution via nearest-neighbor interpolation, so clips with different frame rates from sample rates stay in lockstep. The text encoder weights are tied across the audio and video branches.
+
+You can find all the official PE Audio Video checkpoints under the [perception-encoder-audio-visual](https://huggingface.co/collections/facebook/perception-encoder-audio-visual) collection.
+
+## Quickstart
+
+```py
+import torch
+from datasets import load_dataset
+from transformers import AutoProcessor, PeAudioVideoModel
+from transformers.video_utils import load_video
+
+processor = AutoProcessor.from_pretrained("facebook/pe-av-large")
+model = PeAudioVideoModel.from_pretrained(
+    "facebook/pe-av-large",
+    device_map="auto",
+)
+
+ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+audio = ds[0]["audio"]["array"]
+video, _ = load_video("https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4")
+labels = ["a person playing tennis with background crowd", "a dog barking in a park"]
+
+audio_inputs = processor.feature_extractor(audio, sampling_rate=48_000, return_tensors="pt").to(model.device)
+video_inputs = processor.video_processor(video, num_frames=16, return_tensors="pt").to(model.device)
+text_inputs = processor.tokenizer(labels, padding=True, return_tensors="pt").to(model.device)
+inputs = {**audio_inputs, **video_inputs, **text_inputs}
+
+with torch.no_grad():
+    outputs = model(**inputs)
+
+print("audio-text:", outputs.logits_audio_text.sigmoid().tolist())
+print("video-text:", outputs.logits_video_text.sigmoid().tolist())
+print("audio-video:", outputs.logits_audio_video.sigmoid().tolist())
+```
+
+## Usage tips and notes
+
+- [`PeAudioVideoModel`] requires at least two of `input_ids`, `input_values`, `pixel_values_videos` — if only two are provided it dispatches to the audio-only or video-only sub-model. Passing all three triggers the joint audio-video-text path and the full set of logit matrices in [`PeAudioVideoOutput`].
+- Audio uses `padding_mask` and video uses `padding_mask_videos` simultaneously. They are independent masks; do not conflate them with `attention_mask`, which is reserved for the text tower.
+- Audio–video alignment runs per-batch-element inside `_align_video_hidden_state`, so batches with very different audio/video lengths iterate rather than vectorizing. Keep batch items roughly balanced for throughput.
+- The text tower's weights are tied across branches via `_tied_weights_keys` — do not try to load separate text encoders for the audio and video halves.
+
+## PeAudioVideoConfig
+
+[[autodoc]] PeAudioVideoConfig
+
+## PeAudioVideoEncoderConfig
+
+[[autodoc]] PeAudioVideoEncoderConfig
+
+## PeAudioVideoProcessor
+
+[[autodoc]] PeAudioVideoProcessor
+
+## PeAudioVideoEncoder
+
+[[autodoc]] PeAudioVideoEncoder
+    - forward
+
+## PeAudioVideoModel
+
+[[autodoc]] PeAudioVideoModel
+    - forward