first commit

2026-06-05 16:53:03 +08:00
commit 06f1fd69a6
6047 changed files with 1895387 additions and 0 deletions
--- a/docs/source/en/chat_content_patterns.md
+++ b/docs/source/en/chat_content_patterns.md
@@ -0,0 +1,198 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Chat message patterns
+
+Chat models expect conversations as a list of dictionaries. Each dictionary uses `role` and `content` keys. The `content` key holds the user message passed to the model. Large language models accept text and tools and multimodal models combine text with images, videos, and audio.
+
+Transformers uses a unified format where each modality type is specified explicitly, making it straightforward to mix and match inputs in a single message.
+
+This guide covers message formatting patterns for each modality, tools, batch inference, and multi-turn conversations.
+
+## Text
+
+Text is the most basic content type. It's the foundation for all other patterns. Pass your message to `"content"` as a string.
+
+```py
+message = [
+    {
+        "role": "user",
+        "content": "Explain the French Bread Law."
+    }
+]
+```
+
+You could also use the explicit `"type": "text"` format to keep your code consistent when you add images, videos, or audio later.
+
+```py
+message = [
+    {
+        "role": "user",
+        "content": [{"type": "text", "text": "Explain the French Bread Law."}]
+    }
+]
+```
+
+## Tools
+
+[Tools](./chat_extras) are functions a chat model can call, like getting real-time weather data, instead of generating it on its own.
+
+The `assistant` role handles the tool request. Set `"type": "function"` in the `"tool_calls"` key and provide your tool to the `"function"` key. Append the assistant's tool request to your message.
+
+```py
+weather = {"name": "get_current_temperature", "arguments": {"location": "Paris, France", "unit": "celsius"}}
+message.append(
+    {
+        "role": "assistant", 
+        "tool_calls": [{"type": "function", "function": weather}]
+    }
+)
+```
+
+The `tool` role handles the result. Append it in `"content"`. This value should always be a string.
+
+```py
+message.append({"role": "tool", "content": "22"})
+```
+
+## Multimodal
+
+Multimodal models extend this format to handle images, videos, and audio. Each input specifies its `"type"` and provides the media with `"url"` or `"path"`.
+
+### Image
+
+Set `"type": "image"` and use `"url"` for links or `"path"` for local files.
+
+```py
+message = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "url": "https://assets.bonappetit.com/photos/57ad4ebc53e63daf11a4ddc7/master/w_1280,c_limit/kouign-amann.jpg"},
+            {"type": "text", "text": "What pastry is shown in the image?"}
+        ]
+    }
+]
+```
+
+### Video
+
+Set `"type": "video"` and use `"url"` for links or `"path"` for local files.
+
+```py
+message = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "video", "url": "https://static01.nyt.com/images/2019/10/01/dining/01Sourdough-GIF-1/01Sourdough-GIF-1-superJumbo.gif"},
+            {"type": "text", "text": "What is shown in this video?"}
+        ]
+    }
+]
+```
+
+### Audio
+
+Set `"type": "audio"` and use `"url"` for links or `"path"` for local files.
+
+```py
+message = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "audio", "url": "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"},
+            {"type": "text", "text": "Transcribe the speech."}
+        ]
+    }
+]
+```
+
+### Mixed multiple
+
+The `content` list accepts any combination of types. The model processes all inputs together, enabling comparisons or cross-modal reasoning.
+
+```py
+message = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "url": "https://assets.bonappetit.com/photos/57ad4ebc53e63daf11a4ddc7/master/w_1280,c_limit/kouign-amann.jpg"},
+            {"type": "video", "url": "https://static01.nyt.com/images/2019/10/01/dining/01Sourdough-GIF-1/01Sourdough-GIF-1-superJumbo.gif"},
+            {"type": "text", "text": "What does the image and video share in common?"},
+        ],
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "url": "https://assets.bonappetit.com/photos/57ad4ebc53e63daf11a4ddc7/master/w_1280,c_limit/kouign-amann.jpg"},
+            {"type": "image", "url": "https://assets.bonappetit.com/photos/57e191f49f19b4610e6b7693/master/w_1600%2Cc_limit/undefined"},
+            {"type": "text", "text": "What type of pastries are these?"},
+        ],
+    }
+]
+```
+
+## Batched
+
+Batched inference processes multiple conversations in a single forward pass to improve throughput and efficiency. Wrap each conversation in its own list, then pass them together as a list of lists.
+
+```py
+messages = [
+    [
+        {"role": "user",
+        "content": [
+                {"type": "image", "url": "https://assets.bonappetit.com/photos/57ad4ebc53e63daf11a4ddc7/master/w_1280,c_limit/kouign-amann.jpg"},
+                {"type": "text", "text": "What type of pastry is this?"}
+            ]
+        },
+    ],
+    [
+        {"role": "user",
+        "content": [
+                {"type": "image", "url": "https://assets.bonappetit.com/photos/57e191f49f19b4610e6b7693/master/w_1600%2Cc_limit/undefined"},
+                {"type": "text", "text": "What type of pastry is this?"}
+            ]
+        },
+    ],
+]
+```
+
+## Multi-turn
+
+Conversations span multiple exchanges, alternating between `"user"` and `"assistant"` roles. Each turn adds a new message to the list, giving the model access to the full conversation history. This context helps the model generate more appropriate responses.
+
+```py
+message = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "url": "https://assets.bonappetit.com/photos/57ad4ebc53e63daf11a4ddc7/master/w_1280,c_limit/kouign-amann.jpg"},
+            {"type": "text", "text": "What pastry is shown in the image?"}
+        ]
+    },
+    {
+        "role": "assistant",
+        "content": [{"type": "text", "text": "This is kouign amann, a laminated dough pastry (i.e., dough folded with layers of butter) that also incorporates sugar between layers so that during baking the sugar caramelizes."}]
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "url": "https://static01.nyt.com/images/2023/07/21/multimedia/21baguettesrex-hbkc/21baguettesrex-hbkc-videoSixteenByNineJumbo1600.jpg"},
+            {"type": "text", "text": "Compare it to this image now."}
+        ]
+    }
+]
+```