*This model was published in HF papers on 2025-09-16 and contributed to Hugging Face Transformers on 2026-04-28.*
# MiniCPM-V
[MiniCPM-V](https://huggingface.co/papers/2509.18154) is a series of efficient multimodal large language models developed by [OpenBMB](https://github.com/OpenBMB). The MiniCPM-V 4.6 architecture uses a [SigLIP](siglip) vision encoder with a window-attention merger and a [Qwen3.5](qwen3_5) language model backbone, supporting both 4x and 16x visual downsampling modes.
This model was contributed by [OpenBMB](https://huggingface.co/openbmb).
The original code can be found [here](https://github.com/OpenBMB/MiniCPM-V).
## Usage example
### Inference with Pipeline
```python
from transformers import pipeline
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
},
{"type": "text", "text": "Describe this image."},
],
},
]
pipe = pipeline("image-text-to-text", model="openbmb/MiniCPM-V-4_6")
outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
outputs[0]["generated_text"]
```
### Inference on a single image
> [!NOTE]
> The model has been trained with a specific prompt format for chatting. Use `processor.apply_chat_template(my_conversation_dict)` to correctly format your prompts.
```python
from transformers import AutoProcessor, AutoModelForImageTextToText
model_checkpoint = "openbmb/MiniCPM-V-4_6"
processor = AutoProcessor.from_pretrained(model_checkpoint)
model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map="auto")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
{"type": "text", "text": "Describe this image."},
],
}
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
).to(model.device, dtype=model.dtype)
output = model.generate(**inputs, max_new_tokens=100)
decoded_output = processor.decode(output[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(decoded_output)
```
### Downsampling mode
MiniCPM-V 4.6 supports two visual downsampling modes:
- **16x** (default): More aggressive downsampling, fewer visual tokens, faster inference.
- **4x**: Less downsampling, more visual tokens, better for detail-rich tasks.
You can change the downsampling mode at runtime by passing `downsample_mode` via `processor_kwargs` and to `model.generate`:
```python
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
processor_kwargs={"downsample_mode": "4x"},
).to(model.device, dtype=model.dtype)
output = model.generate(**inputs, max_new_tokens=100, downsample_mode="4x")
```
### Thinking mode
The model supports a thinking mode controlled by `enable_thinking` in the chat template. When enabled, the model generates internal reasoning before providing the final answer:
```python
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
enable_thinking=True,
).to(model.device, dtype=model.dtype)
output = model.generate(**inputs, max_new_tokens=1024)
```
To disable thinking (default for evaluation):
```python
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
enable_thinking=False,
).to(model.device, dtype=model.dtype)
```
### Image processing backend
MiniCPM-V 4.6 provides two image processing backends:
- **torchvision** (default): Uses `torchvision.transforms` for image resizing.
- **pil**: Uses `PIL.Image.resize`, matching the original implementation.
To use the PIL backend:
```python
from transformers import AutoProcessor, AutoImageProcessor
processor = AutoProcessor.from_pretrained(model_checkpoint)
processor.image_processor = AutoImageProcessor.from_pretrained(model_checkpoint, backend="pil")
```
### Video inference
MiniCPM-V 4.6 supports video understanding.
```python
messages = [
{
"role": "user",
"content": [
{"type": "video", "video": "path/to/video.mp4"},
{"type": "text", "text": "Describe what happens in this video."},
],
}
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
).to(model.device, dtype=model.dtype)
output = model.generate(**inputs, max_new_tokens=200)
decoded_output = processor.decode(output[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(decoded_output)
```
If you already have the rendered prompt string, you can call `processor(text=..., videos=[...])` directly instead.
## MiniCPMV4_6Config
[[autodoc]] MiniCPMV4_6Config
## MiniCPMV4_6VisionConfig
[[autodoc]] MiniCPMV4_6VisionConfig
## MiniCPMV4_6Model
[[autodoc]] MiniCPMV4_6Model
- forward
- get_image_features
## MiniCPMV4_6ForConditionalGeneration
[[autodoc]] MiniCPMV4_6ForConditionalGeneration
- forward
- get_image_features
## MiniCPMV4_6Processor
[[autodoc]] MiniCPMV4_6Processor
- __call__
## MiniCPMV4_6ImageProcessor
[[autodoc]] MiniCPMV4_6ImageProcessor
- preprocess
## MiniCPMV4_6ImageProcessorPil
[[autodoc]] MiniCPMV4_6ImageProcessorPil
- preprocess
## MiniCPMV4_6VideoProcessor
[[autodoc]] MiniCPMV4_6VideoProcessor
- preprocess