3.9 KiB
This model was published in HF papers on 2026-01-28 and contributed to Hugging Face Transformers on 2026-06-01.
DeepSeek-OCR-2
Overview
The DeepSeek-OCR-2 model was proposed in Visual Causal Flow: A Novel Approach to OCR-Specialized Vision-Language Models by the DeepSeek team.
DeepSeek-OCR-2 is an OCR-specialized vision-language model built on a distinctive architecture: a SAM ViT-B vision encoder feeds into a Qwen2 hybrid attention encoder, which is connected through an MLP projector to a DeepSeek-V2 Mixture-of-Experts (MoE) language model. A key feature of the model is its hybrid attention mechanism, which applies bidirectional attention over image tokens and causal attention over query tokens, enabling efficient and accurate document understanding.
DeepSeek-OCR 2: Visual Causal Flow.
This model was contributed by thisisiron.
Usage example
Plain OCR
from transformers import AutoProcessor, AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained(
"deepseek-community/DeepSeek-OCR-2", device_map="auto"
)
processor = AutoProcessor.from_pretrained("deepseek-community/DeepSeek-OCR-2")
image = "https://huggingface.co/datasets/hf-internal-testing/fixtures_got_ocr/resolve/main/image_ocr.jpg"
inputs = processor(images=image, text="<image>\nFree OCR.", return_tensors="pt").to(model.device)
generate_ids = model.generate(**inputs, do_sample=False, max_new_tokens=256)
processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
# "R&D QUALITY IMPROVEMENT\nSUGGESTION/SOLUTION FORM\nName/Phone Ext. : (...)"
Grounding with markdown conversion
The <|grounding|> token enables coordinate-aware output with <|ref|> and <|det|> tags.
inputs = processor(
images=image,
text="<image>\n<|grounding|>Convert the document to markdown.",
return_tensors="pt",
).to(model.device)
generate_ids = model.generate(**inputs, do_sample=False, max_new_tokens=256)
processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=False)
# "<|ref|>title<|/ref|><|det|>[[330, 198, 558, 230]]<|/det|>\n# R&D QUALITY (...)"
DeepseekOcr2Config
autodoc DeepseekOcr2Config
DeepseekOcr2VisionConfig
autodoc DeepseekOcr2VisionConfig
DeepseekOcr2SamVisionConfig
autodoc DeepseekOcr2SamVisionConfig
DeepseekOcr2VisionEncoderConfig
autodoc DeepseekOcr2VisionEncoderConfig
DeepseekOcr2TextConfig
autodoc DeepseekOcr2TextConfig
DeepseekOcr2ImageProcessor
autodoc DeepseekOcr2ImageProcessor
DeepseekOcr2ImageProcessorPil
autodoc DeepseekOcr2ImageProcessorPil
DeepseekOcr2Processor
autodoc DeepseekOcr2Processor
DeepseekOcr2TextModel
autodoc DeepseekOcr2TextModel
DeepseekOcr2VisionModel
autodoc DeepseekOcr2VisionModel
DeepseekOcr2Model
autodoc DeepseekOcr2Model
DeepseekOcr2ForConditionalGeneration
autodoc DeepseekOcr2ForConditionalGeneration