6.7 KiB
This model was published in HF papers on 2025-09-16 and contributed to Hugging Face Transformers on 2026-04-28.
MiniCPM-V
MiniCPM-V is a series of efficient multimodal large language models developed by OpenBMB. The MiniCPM-V 4.6 architecture uses a SigLIP vision encoder with a window-attention merger and a Qwen3.5 language model backbone, supporting both 4x and 16x visual downsampling modes.
This model was contributed by OpenBMB. The original code can be found here.
Usage example
Inference with Pipeline
from transformers import pipeline
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
},
{"type": "text", "text": "Describe this image."},
],
},
]
pipe = pipeline("image-text-to-text", model="openbmb/MiniCPM-V-4_6")
outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
outputs[0]["generated_text"]
Inference on a single image
Note
The model has been trained with a specific prompt format for chatting. Use
processor.apply_chat_template(my_conversation_dict)to correctly format your prompts.
from transformers import AutoProcessor, AutoModelForImageTextToText
model_checkpoint = "openbmb/MiniCPM-V-4_6"
processor = AutoProcessor.from_pretrained(model_checkpoint)
model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map="auto")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
{"type": "text", "text": "Describe this image."},
],
}
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
).to(model.device, dtype=model.dtype)
output = model.generate(**inputs, max_new_tokens=100)
decoded_output = processor.decode(output[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(decoded_output)
Downsampling mode
MiniCPM-V 4.6 supports two visual downsampling modes:
- 16x (default): More aggressive downsampling, fewer visual tokens, faster inference.
- 4x: Less downsampling, more visual tokens, better for detail-rich tasks.
You can change the downsampling mode at runtime by passing downsample_mode via processor_kwargs and to model.generate:
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
processor_kwargs={"downsample_mode": "4x"},
).to(model.device, dtype=model.dtype)
output = model.generate(**inputs, max_new_tokens=100, downsample_mode="4x")
Thinking mode
The model supports a thinking mode controlled by enable_thinking in the chat template. When enabled, the model generates internal reasoning before providing the final answer:
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
enable_thinking=True,
).to(model.device, dtype=model.dtype)
output = model.generate(**inputs, max_new_tokens=1024)
To disable thinking (default for evaluation):
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
enable_thinking=False,
).to(model.device, dtype=model.dtype)
Image processing backend
MiniCPM-V 4.6 provides two image processing backends:
- torchvision (default): Uses
torchvision.transformsfor image resizing. - pil: Uses
PIL.Image.resize, matching the original implementation.
To use the PIL backend:
from transformers import AutoProcessor, AutoImageProcessor
processor = AutoProcessor.from_pretrained(model_checkpoint)
processor.image_processor = AutoImageProcessor.from_pretrained(model_checkpoint, backend="pil")
Video inference
MiniCPM-V 4.6 supports video understanding.
messages = [
{
"role": "user",
"content": [
{"type": "video", "video": "path/to/video.mp4"},
{"type": "text", "text": "Describe what happens in this video."},
],
}
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
).to(model.device, dtype=model.dtype)
output = model.generate(**inputs, max_new_tokens=200)
decoded_output = processor.decode(output[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(decoded_output)
If you already have the rendered prompt string, you can call processor(text=..., videos=[...]) directly instead.
MiniCPMV4_6Config
autodoc MiniCPMV4_6Config
MiniCPMV4_6VisionConfig
autodoc MiniCPMV4_6VisionConfig
MiniCPMV4_6Model
autodoc MiniCPMV4_6Model - forward - get_image_features
MiniCPMV4_6ForConditionalGeneration
autodoc MiniCPMV4_6ForConditionalGeneration - forward - get_image_features
MiniCPMV4_6Processor
autodoc MiniCPMV4_6Processor - call
MiniCPMV4_6ImageProcessor
autodoc MiniCPMV4_6ImageProcessor - preprocess
MiniCPMV4_6ImageProcessorPil
autodoc MiniCPMV4_6ImageProcessorPil - preprocess
MiniCPMV4_6VideoProcessor
autodoc MiniCPMV4_6VideoProcessor - preprocess