first commit

2026-06-05 16:53:03 +08:00
commit 06f1fd69a6
6047 changed files with 1895387 additions and 0 deletions
--- a/docs/source/en/model_doc/trocr.md
+++ b/docs/source/en/model_doc/trocr.md
@@ -0,0 +1,125 @@
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
+License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+specific language governing permissions and limitations under the License. -->
+*This model was published in HF papers on 2021-09-21 and contributed to Hugging Face Transformers on 2021-10-13.*
+
+
+# TrOCR
+
+[TrOCR](https://huggingface.co/papers/2109.10282) is a text recognition model for both image understanding and text generation. It doesn't require separate models for image processing or character generation. TrOCR is a simple single end-to-end system that uses a transformer to handle visual understanding and text generation.
+
+You can find all the original TrOCR checkpoints under the [Microsoft](https://huggingface.co/microsoft/models?search=trocr) organization.
+
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/trocr_architecture.jpg"
+alt="drawing" width="600"/>
+<small> TrOCR architecture. Taken from the <a href="https://huggingface.co/papers/2109.10282">original paper</a>. </small>
+
+> [!TIP]
+> This model was contributed by [nielsr](https://huggingface.co/nielsr).
+>
+> Click on the TrOCR models in the right sidebar for more examples of how to apply TrOCR to different image and text tasks.
+
+The example below demonstrates how to perform optical character recognition (OCR) with the [`AutoModel`] class.
+
+<hfoptions id="usage">
+<hfoption id="AutoModel">
+
+```python
+import requests
+from PIL import Image
+
+from transformers import TrOCRProcessor, VisionEncoderDecoderModel
+
+
+processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
+model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten", device_map="auto")
+
+# load image from the IAM dataset
+url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
+image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
+
+pixel_values = processor(image, return_tensors="pt").to(model.device).pixel_values
+generated_ids = model.generate(pixel_values)
+
+generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
+print(generated_text)
+```
+
+</hfoption>
+</hfoptions>
+
+## Quantization
+
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+
+The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to 8-bits.
+
+```python
+# pip install bitsandbytes accelerate
+from transformers import TrOCRProcessor, VisionEncoderDecoderModel, BitsandBytesConfig
+import requests
+from PIL import Image
+
+# Set up the quantization configuration
+quantization_config = BitsandBytesConfig(load_in_8bit=True)
+
+# Use a large checkpoint for a more noticeable impact
+processor = TrOCRProcessor.from_pretrained("microsoft/trocr-large-handwritten")
+model = VisionEncoderDecoderModel.from_pretrained(
+    "microsoft/trocr-large-handwritten",
+    quantization_config=quantization_config
+ device_map="auto")
+
+# load image from the IAM dataset
+url = "[https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg](https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg)"
+image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
+
+pixel_values = processor(image, return_tensors="pt").to(model.device).pixel_values
+generated_ids = model.generate(pixel_values)
+
+generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
+print(generated_text)
+```
+
+## Notes
+
+- TrOCR wraps [`ViTImageProcessor`]/[`DeiTImageProcessor`] and [`RobertaTokenizer`]/[`XLMRobertaTokenizer`] into a single instance of [`TrOCRProcessor`] to handle images and text.
+- TrOCR is always used within the [VisionEncoderDecoder](vision-encoder-decoder) framework.
+
+## Resources
+
+- A blog post on [Accelerating Document AI](https://huggingface.co/blog/document-ai) with TrOCR.
+- A blog post on how to [Document AI](https://github.com/philschmid/document-ai-transformers) with TrOCR.
+- A notebook on how to [finetune TrOCR on IAM Handwriting Database using Seq2SeqTrainer](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Fine_tune_TrOCR_on_IAM_Handwriting_Database_using_Seq2SeqTrainer.ipynb).
+- An interactive-demo on [TrOCR handwritten character recognition](https://huggingface.co/spaces/nielsr/TrOCR-handwritten).
+- A notebook on [inference with TrOCR](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Inference_with_TrOCR_%2B_Gradio_demo.ipynb) and Gradio demo.
+- A notebook on [evaluating TrOCR on the IAM test set](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Evaluating_TrOCR_base_handwritten_on_the_IAM_test_set.ipynb).
+
+## TrOCRConfig
+
+[[autodoc]] TrOCRConfig
+
+## TrOCRProcessor
+
+[[autodoc]] TrOCRProcessor
+    - __call__
+    - from_pretrained
+    - save_pretrained
+    - batch_decode
+    - decode
+
+## TrOCRForCausalLM
+
+[[autodoc]] TrOCRForCausalLM
+     - forward