7.3 KiB
This model was published in HF papers on 2025-07-01 and contributed to Hugging Face Transformers on 2025-06-25.
GLM-V
Overview
The GLM-V model was proposed in GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.
The abstract from the paper is the following:
We present GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. We further introduce the GLM-4.6V series, open-source multimodal models with native tool use and a 128K context window. A brief overview is available at this https URL. Code, models and more information are released at https://github.com/zai-org/GLM-V
Support Model
This Model type support these model of zai-org:
- GLM-4.1V-9B-Base
- GLM-4.1V-9B-Thinking
- GLM-4.6V-Flash
- AutoGLM-Phone-9B
- AutoGLM-Phone-9B-Multilingual
- Glyph
- WebVIA-Agent
- UI2Code_N
This model was contributed by Raushan Turganbay and Yuxuan Zhang.
Usage
The example below demonstrates how to generate text based on an image with [Pipeline] or the [AutoModel] class.
from transformers import pipeline
pipe = pipeline(
task="image-text-to-text",
model="THUDM/GLM-4.1V-9B-Thinking",
device=0,
)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
},
{"type": "text", "text": "Describe this image."},
]
}
]
pipe(text=messages, max_new_tokens=20, return_full_text=False)
from transformers import AutoProcessor, Glm4vForConditionalGeneration
model = Glm4vForConditionalGeneration.from_pretrained(
"THUDM/GLM-4.1V-9B-Thinking",
device_map="auto",
attn_implementation="sdpa"
)
processor = AutoProcessor.from_pretrained("THUDM/GLM-4.1V-9B-Thinking")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
},
{
"type": "text",
"text": "Describe this image."
}
]
}
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Using GLM-4.1V with video input is similar to using it with image input. The model can process video data and generate text based on the content of the video.
from transformers import AutoProcessor, Glm4vForConditionalGeneration
processor = AutoProcessor.from_pretrained("THUDM/GLM-4.1V-9B-Thinking")
model = Glm4vForConditionalGeneration.from_pretrained(
pretrained_model_name_or_path="THUDM/GLM-4.1V-9B-Thinking",
device_map="auto"
)
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"url": "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4",
},
{
"type": "text",
"text": "discribe this video",
},
],
}
]
inputs = processor.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_dict=True,
return_tensors="pt", padding=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=1.0)
output_text = processor.decode(generated_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(output_text)
Glm4vConfig
autodoc Glm4vConfig
Glm4vVisionConfig
autodoc Glm4vVisionConfig
Glm4vTextConfig
autodoc Glm4vTextConfig
Glm4vImageProcessor
autodoc Glm4vImageProcessor - preprocess
Glm4vVideoProcessor
autodoc Glm4vVideoProcessor
- preprocess
Glm4vImageProcessorPil
autodoc Glm4vImageProcessorPil - preprocess
Glm4vProcessor
autodoc Glm4vProcessor - call
Glm4vVisionModel
autodoc Glm4vVisionModel - forward
Glm4vTextModel
autodoc Glm4vTextModel - forward
Glm4vModel
autodoc Glm4vModel - forward - get_video_features - get_image_features
Glm4vForConditionalGeneration
autodoc Glm4vForConditionalGeneration - forward - get_video_features - get_image_features