first commit

2026-06-05 16:53:03 +08:00
commit 06f1fd69a6
6047 changed files with 1895387 additions and 0 deletions
--- a/docs/source/en/model_doc/qwen3_vl.md
+++ b/docs/source/en/model_doc/qwen3_vl.md
@@ -0,0 +1,125 @@
+<!--Copyright 2025 The Qwen Team and The HuggingFace Inc. team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+*This model was published in HF papers on 2025-02-19 and contributed to Hugging Face Transformers on 2025-09-15.*
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">    </div>
+</div>
+
+# Qwen3-VL
+
+[Qwen3-VL](https://huggingface.co/papers/2502.13923) is a multimodal vision-language model series, encompassing both dense and MoE variants, as well as Instruct and Thinking versions. Building upon its predecessors, Qwen3-VL delivers significant improvements in visual understanding while maintaining strong pure text capabilities. Key architectural advancements include: enhanced MRope with interleaved layout for better spatial-temporal modeling, DeepStack integration to effectively leverage multi-level features from the Vision Transformer (ViT), and improved video understanding through text-based time alignment—evolving from T-RoPE to text timestamp alignment for more precise temporal grounding. These innovations collectively enable Qwen3-VL to achieve superior performance in complex multimodal tasks.
+
+Model usage
+
+<hfoptions id="usage">
+<hfoption id="AutoModel">
+
+```python
+from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
+
+
+model = Qwen3VLForConditionalGeneration.from_pretrained(
+    "Qwen/Qwen3-VL",
+    device_map="auto",
+    attn_implementation="sdpa"
+)
+processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL")
+messages = [
+    {
+        "role":"user",
+        "content":[
+            {
+                "type":"image",
+                "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+            },
+            {
+                "type":"text",
+                "text":"Describe this image."
+            }
+        ]
+    }
+
+]
+
+inputs = processor.apply_chat_template(
+    messages,
+    tokenize=True,
+    add_generation_prompt=True,
+    return_dict=True,
+    return_tensors="pt",
+)
+inputs.pop("token_type_ids", None)
+
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+            out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+       generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+
+</hfoption>
+</hfoptions>
+
+## Qwen3VLConfig
+
+[[autodoc]] Qwen3VLConfig
+
+## Qwen3VLVisionConfig
+
+[[autodoc]] Qwen3VLVisionConfig
+
+## Qwen3VLTextConfig
+
+[[autodoc]] Qwen3VLTextConfig
+
+## Qwen3VLProcessor
+
+[[autodoc]] Qwen3VLProcessor
+    - __call__
+
+## Qwen3VLVideoProcessor
+
+[[autodoc]] Qwen3VLVideoProcessor
+
+## Qwen3VLVisionModel
+
+[[autodoc]] Qwen3VLVisionModel
+    - forward
+
+## Qwen3VLTextModel
+
+[[autodoc]] Qwen3VLTextModel
+    - forward
+
+## Qwen3VLModel
+
+[[autodoc]] Qwen3VLModel
+    - forward
+    - get_video_features
+    - get_image_features
+
+## Qwen3VLForConditionalGeneration
+
+[[autodoc]] Qwen3VLForConditionalGeneration
+    - forward
+    - get_video_features
+    - get_image_features