first commit

2026-06-05 16:53:03 +08:00
commit 06f1fd69a6
6047 changed files with 1895387 additions and 0 deletions
--- a/docs/source/en/model_doc/donut.md
+++ b/docs/source/en/model_doc/donut.md
@@ -0,0 +1,227 @@
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
+License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+specific language governing permissions and limitations under the License. -->
+*This model was published in HF papers on 2021-11-30 and contributed to Hugging Face Transformers on 2022-08-12.*
+
+
+# Donut
+
+[Donut (Document Understanding Transformer)](https://huggingface.co/papers/2111.15664) is a visual document understanding model that doesn't require an Optical Character Recognition (OCR) engine. Unlike traditional approaches that extract text using OCR before processing, Donut employs an end-to-end Transformer-based architecture to directly analyze document images. This eliminates OCR-related inefficiencies making it more accurate and adaptable to diverse languages and formats.
+
+Donut features vision encoder ([Swin](./swin)) and a text decoder ([BART](./bart)). Swin converts document images into embeddings and BART processes them into meaningful text sequences.
+
+You can find all the original Donut checkpoints under the [Naver Clova Information Extraction](https://huggingface.co/naver-clova-ix) organization.
+
+> [!TIP]
+> Click on the Donut models in the right sidebar for more examples of how to apply Donut to different language and vision tasks.
+
+The examples below demonstrate how to perform document understanding tasks using Donut with [`Pipeline`] and [`AutoModel`]
+
+<hfoptions id="usage">
+<hfoption id="Pipeline">
+
+```python
+# pip install datasets
+from transformers import pipeline
+
+
+pipeline = pipeline(
+    task="document-question-answering",
+    model="naver-clova-ix/donut-base-finetuned-docvqa",
+    device=0,
+)
+dataset = load_dataset("hf-internal-testing/example-documents", split="test")
+image = dataset[0]["image"]
+
+pipeline(image=image, question="What time is the coffee break?")
+```
+
+</hfoption>
+<hfoption id="AutoModel">
+
+```python
+# pip install datasets
+from datasets import load_dataset
+
+from transformers import AutoModelForImageTextToText, AutoProcessor
+
+
+processor = AutoProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
+model = AutoModelForImageTextToText.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa", device_map="auto")
+
+dataset = load_dataset("hf-internal-testing/example-documents", split="test")
+image = dataset[0]["image"]
+question = "What time is the coffee break?"
+task_prompt = f"<s_docvqa><s_question>{question}</s_question><s_answer>"
+inputs = processor(image, task_prompt, return_tensors="pt").to(model.device)
+
+outputs = model.generate(
+    input_ids=inputs.input_ids,
+    pixel_values=inputs.pixel_values,
+    max_length=512
+)
+answer = processor.decode(outputs[0], skip_special_tokens=True)
+print(answer)
+```
+
+</hfoption>
+</hfoptions>
+
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+
+The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
+
+```python
+# pip install datasets torchao
+from datasets import load_dataset
+
+from transformers import AutoModelForImageTextToText, AutoProcessor, TorchAoConfig
+
+
+quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
+processor = AutoProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
+model = AutoModelForImageTextToText.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa", quantization_config=quantization_config, device_map="auto")
+
+dataset = load_dataset("hf-internal-testing/example-documents", split="test")
+image = dataset[0]["image"]
+question = "What time is the coffee break?"
+task_prompt = f"<s_docvqa><s_question>{question}</s_question><s_answer>"
+inputs = processor(image, task_prompt, return_tensors="pt").to(model.device)
+
+outputs = model.generate(
+    input_ids=inputs.input_ids,
+    pixel_values=inputs.pixel_values,
+    max_length=512
+)
+answer = processor.decode(outputs[0], skip_special_tokens=True)
+print(answer)
+```
+
+## Notes
+
+- Use Donut for document image classification as shown below.
+
+    ```py
+    import re
+    from transformers import DonutProcessor, VisionEncoderDecoderModel
+        from datasets import load_dataset
+    import torch
+
+    processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip")
+    model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip", device_map="auto")
+
+    model.to(model.device)  # doctest: +IGNORE_RESULT
+
+    # load document image
+    dataset = load_dataset("hf-internal-testing/example-documents", split="test")
+    image = dataset[1]["image"]
+
+    # prepare decoder inputs
+    task_prompt = "<s_rvlcdip>"
+    decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").to(model.device).input_ids
+
+    pixel_values = processor(image, return_tensors="pt").to(model.device).pixel_values
+
+    outputs = model.generate(
+        pixel_values.to(model.device),
+        decoder_input_ids=decoder_input_ids.to(model.device),
+        max_length=model.decoder.config.max_position_embeddings,
+        pad_token_id=processor.tokenizer.pad_token_id,
+        eos_token_id=processor.tokenizer.eos_token_id,
+        use_cache=True,
+        bad_words_ids=[[processor.tokenizer.unk_token_id]],
+        return_dict_in_generate=True,
+    )
+
+    sequence = processor.batch_decode(outputs.sequences)[0]
+    sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
+    sequence = re.sub(r"<.*?>", "", sequence, count=1).strip()  # remove first task start token
+    print(processor.token2json(sequence))
+    {'class': 'advertisement'}
+    ```
+
+- Use Donut for document parsing as shown below.
+
+    ```py
+    import re
+        from datasets import load_dataset
+    from transformers import DonutProcessor, VisionEncoderDecoderModel
+    import torch
+
+    processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
+    model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2", device_map="auto")
+
+    model.to(model.device)  # doctest: +IGNORE_RESULT
+
+    # load document image
+    dataset = load_dataset("hf-internal-testing/example-documents", split="test")
+    image = dataset[2]["image"]
+
+    # prepare decoder inputs
+    task_prompt = "<s_cord-v2>"
+    decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").to(model.device).input_ids
+
+    pixel_values = processor(image, return_tensors="pt").to(model.device).pixel_values
+
+    outputs = model.generate(
+        pixel_values.to(model.device),
+        decoder_input_ids=decoder_input_ids.to(model.device),
+        max_length=model.decoder.config.max_position_embeddings,
+        pad_token_id=processor.tokenizer.pad_token_id,
+        eos_token_id=processor.tokenizer.eos_token_id,
+        use_cache=True,
+        bad_words_ids=[[processor.tokenizer.unk_token_id]],
+        return_dict_in_generate=True,
+    )
+
+    sequence = processor.batch_decode(outputs.sequences)[0]
+    sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
+    sequence = re.sub(r"<.*?>", "", sequence, count=1).strip()  # remove first task start token
+    print(processor.token2json(sequence))
+    {'menu': {'nm': 'CINNAMON SUGAR', 'unitprice': '17,000', 'cnt': '1 x', 'price': '17,000'}, 'sub_total': {'subtotal_price': '17,000'}, 'total': 
+    {'total_price': '17,000', 'cashprice': '20,000', 'changeprice': '3,000'}}
+    ```
+
+## DonutSwinConfig
+
+[[autodoc]] DonutSwinConfig
+
+## DonutImageProcessor
+
+[[autodoc]] DonutImageProcessor
+    - preprocess
+
+## DonutImageProcessorPil
+
+[[autodoc]] DonutImageProcessorPil
+    - preprocess
+
+## DonutProcessor
+
+[[autodoc]] DonutProcessor
+    - __call__
+    - from_pretrained
+    - save_pretrained
+    - batch_decode
+    - decode
+
+## DonutSwinModel
+
+[[autodoc]] DonutSwinModel
+    - forward
+
+## DonutSwinForImageClassification
+
+[[autodoc]] transformers.DonutSwinForImageClassification
+    - forward