7.1 KiB
This model was published in HF papers on 2024-03-08 and contributed to Hugging Face Transformers on 2025-07-25.
DeepseekVLHybrid
Deepseek-VL-Hybrid was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages LLaMA as its text encoder, while SigLip is used for encoding low-resolution images and SAM (Segment Anything Model) is incorporated to handle high-resolution image encoding, enhancing the model's ability to process fine-grained visual details. Deepseek-VL-Hybrid is a variant of Deepseek-VL that uses SAM (Segment Anything Model) to handle high-resolution image encoding.
You can find all the original Deepseek-VL-Hybrid checkpoints under the DeepSeek-community organization.
Tip
Click on the Deepseek-VL-Hybrid models in the right sidebar for more examples of how to apply Deepseek-VL-Hybrid to different vision and language tasks.
The example below demonstrates how to generate text based on an image with [Pipeline] or the [AutoModel] class.
from transformers import pipeline
pipe = pipeline(
task="image-text-to-text",
model="deepseek-community/deepseek-vl-7b-chat",
device=0,
)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
},
{ "type": "text", "text": "Describe this image."},
]
}
]
pipe(text=messages, max_new_tokens=20, return_full_text=False)
from transformers import AutoProcessor, DeepseekVLHybridForConditionalGeneration
model = DeepseekVLHybridForConditionalGeneration.from_pretrained(
"deepseek-community/deepseek-vl-7b-chat",
device_map="auto",
attn_implementation="sdpa"
)
processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-7b-chat")
messages = [
{
"role":"user",
"content":[
{
"type":"image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
},
{
"type":"text",
"text":"Describe this image."
}
]
}
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device, dtype=model.dtype)
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.
The example below uses torchao to only quantize the weights to int4.
from transformers import DeepseekVLHybridForConditionalGeneration, TorchAoConfig
quantization_config = TorchAoConfig(
"int4_weight_only",
group_size=128
)
model = DeepseekVLHybridForConditionalGeneration.from_pretrained(
"deepseek-community/deepseek-vl-7b-chat",
device_map="auto",
quantization_config=quantization_config
)
Notes
-
Do inference with multiple images in a single conversation.
import torch from transformers import DeepseekVLHybridForConditionalGeneration, AutoProcessor model = DeepseekVLHybridForConditionalGeneration.from_pretrained( "deepseek-community/deepseek-vl-7b-chat", device_map="auto", attn_implementation="sdpa" ) processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-7b-chat") messages = [ [ { "role": "user", "content": [ {"type": "text", "text": "What’s the difference between"}, {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"}, {"type": "text", "text": " and "}, {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"} ] } ], [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"}, {"type": "text", "text": "What do you see in this image?"} ] } ] ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, padding=True, truncation=True, tokenize=True, return_dict=True, return_tensors="pt" ).to(model.device, dtype=model.dtype) generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text)
DeepseekVLHybridConfig
autodoc DeepseekVLHybridConfig
DeepseekVLHybridProcessor
autodoc DeepseekVLHybridProcessor - call
DeepseekVLHybridImageProcessor
autodoc DeepseekVLHybridImageProcessor - preprocess
DeepseekVLHybridImageProcessorPil
autodoc DeepseekVLHybridImageProcessorPil - preprocess
DeepseekVLHybridModel
autodoc DeepseekVLHybridModel - forward - get_image_features
DeepseekVLHybridForConditionalGeneration
autodoc DeepseekVLHybridForConditionalGeneration - forward