*This model was published in HF papers on 2025-10-01 and contributed to Hugging Face Transformers on 2026-02-23.*
# ModernVBert
## Overview
ModernVBert is a Vision-Language encoder that combines [ModernBert](modernbert) with a [SigLIP](siglip) vision encoder. It is optimized for visual document understanding and retrieval tasks.
The model was introduced in [ModernVBERT: Towards Smaller Visual Document Retrievers](https://huggingface.co/papers/2510.01149).
```python
import torch
from huggingface_hub import hf_hub_download
from PIL import Image
from transformers import AutoModelForMaskedLM, AutoProcessor
processor = AutoProcessor.from_pretrained("./mvb")
model = AutoModelForMaskedLM.from_pretrained("./mvb", device_map="auto")
image = Image.open(hf_hub_download("HuggingFaceTB/SmolVLM", "example_images/rococo.jpg", repo_type="space"))
text = "This [MASK] is on the wall."
# Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": text}
]
},
]
# Prepare inputs
prompt = processor.apply_chat_template(messages)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(model.device)
# Inference
with torch.no_grad():
outputs = model(**inputs)
# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(processor.tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = processor.tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token) # Predicted token: painting
```
## ModernVBertConfig
[[autodoc]] ModernVBertConfig
## ModernVBertModel
[[autodoc]] ModernVBertModel
- forward
## ModernVBertForMaskedLM
[[autodoc]] ModernVBertForMaskedLM
- forward
## ModernVBertForSequenceClassification
[[autodoc]] ModernVBertForSequenceClassification
- forward
## ModernVBertForTokenClassification
[[autodoc]] ModernVBertForTokenClassification
- forward