first commit

2026-06-05 16:53:03 +08:00
commit 06f1fd69a6
6047 changed files with 1895387 additions and 0 deletions
--- a/docs/source/en/model_doc/pe_video.md
+++ b/docs/source/en/model_doc/pe_video.md
@@ -0,0 +1,87 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+<<<<<<< Updated upstream
+*This model was released on 2025-04-17 and added to Hugging Face Transformers on 2025-12-16.*
+=======
+*This model was published in HF papers on 2025-04-17 and contributed to Hugging Face Transformers on 2025-12-16.*
+>>>>>>> Stashed changes
+
+# PE Video
+
+[PE Video](https://huggingface.co/papers/2504.13181) is the video branch of Meta's Perception Encoder family. It contrastively aligns video clips with text into a shared embedding space, enabling zero-shot video classification and video–text retrieval from a single pretrained backbone.
+
+The encoder's rotary embeddings and patch embedder treat the temporal axis as a first-class dimension, so variable-length clips can be encoded without tiling each frame independently.
+
+You can find all the official PE Audio checkpoints under the [perception-encoder-audio-visual](https://huggingface.co/collections/facebook/perception-encoder-audio-visual) collection.
+
+## Quickstart
+
+```py
+import torch
+from transformers import AutoProcessor, PeVideoModel
+from transformers.video_utils import load_video
+
+processor = AutoProcessor.from_pretrained("facebook/pe-av-large")
+model = PeVideoModel.from_pretrained(
+    "facebook/pe-av-large",
+    device_map="auto",
+)
+
+video, _ = load_video("https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4")
+labels = ["a person playing tennis", "a person cooking", "a cat sleeping"]
+
+video_inputs = processor.video_processor(video, num_frames=16, return_tensors="pt").to(model.device)
+text_inputs = processor.tokenizer(labels, padding=True, return_tensors="pt").to(model.device)
+inputs = {**video_inputs, **text_inputs}
+
+with torch.no_grad():
+    outputs = model(**inputs)
+
+probs = outputs.logits_video_text.sigmoid()
+print({label: p.item() for label, p in zip(labels, probs[0])})
+```
+
+## Usage tips and notes
+
+- Variable-length videos use `padding_mask_videos` (not `attention_mask`). The video processor only pads and returns this mask when `return_tensors` is set — without it you get a list of per-clip tensors and no mask.
+- Pass `num_frames` to the video processor for fixed-length uniform sampling across `[0, total_frames-1]`. Omit it to fall back to fps-based sampling from the base class. Checkpoints are usually trained at a specific frame count, so match what the checkpoint expects.
+- Encoder input is `pixel_values_videos`. The encoder's `main_input_name` is `"pixel_values_videos"` while the full model's is `"input_ids"`, which matters when routing through generic utilities that inspect `main_input_name`.
+
+## PeVideoConfig
+
+[[autodoc]] PeVideoConfig
+
+## PeVideoEncoderConfig
+
+[[autodoc]] PeVideoEncoderConfig
+
+## PeVideoVideoProcessor
+
+[[autodoc]] PeVideoVideoProcessor
+
+## PeVideoProcessor
+
+[[autodoc]] PeVideoProcessor
+
+## PeVideoEncoder
+
+[[autodoc]] PeVideoEncoder
+    - forward
+
+## PeVideoModel
+
+[[autodoc]] PeVideoModel
+    - forward