*This model was published in HF papers on 2026-01-28 and contributed to Hugging Face Transformers on 2026-06-01.* # DeepSeek-OCR-2 ## Overview The DeepSeek-OCR-2 model was proposed in [Visual Causal Flow: A Novel Approach to OCR-Specialized Vision-Language Models](https://huggingface.co/papers/2601.20552) by the DeepSeek team. DeepSeek-OCR-2 is an OCR-specialized vision-language model built on a distinctive architecture: a SAM ViT-B vision encoder feeds into a Qwen2 hybrid attention encoder, which is connected through an MLP projector to a DeepSeek-V2 Mixture-of-Experts (MoE) language model. A key feature of the model is its hybrid attention mechanism, which applies bidirectional attention over image tokens and causal attention over query tokens, enabling efficient and accurate document understanding.

DeepSeek-OCR 2: Visual Causal Flow. This model was contributed by [thisisiron](https://huggingface.co/thisisiron). ## Usage example ### Plain OCR ```python from transformers import AutoProcessor, AutoModelForImageTextToText model = AutoModelForImageTextToText.from_pretrained( "deepseek-community/DeepSeek-OCR-2", device_map="auto" ) processor = AutoProcessor.from_pretrained("deepseek-community/DeepSeek-OCR-2") image = "https://huggingface.co/datasets/hf-internal-testing/fixtures_got_ocr/resolve/main/image_ocr.jpg" inputs = processor(images=image, text="\nFree OCR.", return_tensors="pt").to(model.device) generate_ids = model.generate(**inputs, do_sample=False, max_new_tokens=256) processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True) # "R&D QUALITY IMPROVEMENT\nSUGGESTION/SOLUTION FORM\nName/Phone Ext. : (...)" ``` ### Grounding with markdown conversion The `<|grounding|>` token enables coordinate-aware output with `<|ref|>` and `<|det|>` tags. ```python inputs = processor( images=image, text="\n<|grounding|>Convert the document to markdown.", return_tensors="pt", ).to(model.device) generate_ids = model.generate(**inputs, do_sample=False, max_new_tokens=256) processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=False) # "<|ref|>title<|/ref|><|det|>[[330, 198, 558, 230]]<|/det|>\n# R&D QUALITY (...)" ``` ## DeepseekOcr2Config [[autodoc]] DeepseekOcr2Config ## DeepseekOcr2VisionConfig [[autodoc]] DeepseekOcr2VisionConfig ## DeepseekOcr2SamVisionConfig [[autodoc]] DeepseekOcr2SamVisionConfig ## DeepseekOcr2VisionEncoderConfig [[autodoc]] DeepseekOcr2VisionEncoderConfig ## DeepseekOcr2TextConfig [[autodoc]] DeepseekOcr2TextConfig ## DeepseekOcr2ImageProcessor [[autodoc]] DeepseekOcr2ImageProcessor ## DeepseekOcr2ImageProcessorPil [[autodoc]] DeepseekOcr2ImageProcessorPil ## DeepseekOcr2Processor [[autodoc]] DeepseekOcr2Processor ## DeepseekOcr2TextModel [[autodoc]] DeepseekOcr2TextModel ## DeepseekOcr2VisionModel [[autodoc]] DeepseekOcr2VisionModel ## DeepseekOcr2Model [[autodoc]] DeepseekOcr2Model ## DeepseekOcr2ForConditionalGeneration [[autodoc]] DeepseekOcr2ForConditionalGeneration