*This model was published in HF papers on 2025-10-01 and contributed to Hugging Face Transformers on 2026-02-23.* # ModernVBert

## Overview ModernVBert is a Vision-Language encoder that combines [ModernBert](modernbert) with a [SigLIP](siglip) vision encoder. It is optimized for visual document understanding and retrieval tasks. The model was introduced in [ModernVBERT: Towards Smaller Visual Document Retrievers](https://huggingface.co/papers/2510.01149). ```python import torch from huggingface_hub import hf_hub_download from PIL import Image from transformers import AutoModelForMaskedLM, AutoProcessor processor = AutoProcessor.from_pretrained("./mvb") model = AutoModelForMaskedLM.from_pretrained("./mvb", device_map="auto") image = Image.open(hf_hub_download("HuggingFaceTB/SmolVLM", "example_images/rococo.jpg", repo_type="space")) text = "This [MASK] is on the wall." # Create input messages messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": text} ] }, ] # Prepare inputs prompt = processor.apply_chat_template(messages) inputs = processor(text=prompt, images=[image], return_tensors="pt").to(model.device) # Inference with torch.no_grad(): outputs = model(**inputs) # To get predictions for the mask: masked_index = inputs["input_ids"][0].tolist().index(processor.tokenizer.mask_token_id) predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1) predicted_token = processor.tokenizer.decode(predicted_token_id) print("Predicted token:", predicted_token) # Predicted token: painting ``` ## ModernVBertConfig [[autodoc]] ModernVBertConfig ## ModernVBertModel [[autodoc]] ModernVBertModel - forward ## ModernVBertForMaskedLM [[autodoc]] ModernVBertForMaskedLM - forward ## ModernVBertForSequenceClassification [[autodoc]] ModernVBertForSequenceClassification - forward ## ModernVBertForTokenClassification [[autodoc]] ModernVBertForTokenClassification - forward