first commit
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
This commit is contained in:
209
docs/source/en/model_doc/glm_image.md
Normal file
209
docs/source/en/model_doc/glm_image.md
Normal file
@@ -0,0 +1,209 @@
|
||||
<!--Copyright 2025 the HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
*This model was contributed to Hugging Face Transformers on 2026-01-13.*
|
||||
|
||||
# GlmImage
|
||||
|
||||
## Overview
|
||||
|
||||
GLM-Image is an image generation model adopts a hybrid autoregressive + diffusion decoder architecture, effectively pushing the upper bound of visual fidelity and fine-grained details. In general image generation quality, it aligns with industry-standard LDM-based approaches, while demonstrating significant advantages in knowledge-intensive image generation scenarios.
|
||||
|
||||
Model architecture: a hybrid autoregressive + diffusion decoder design、
|
||||
|
||||
+ Autoregressive generator: a 9B-parameter model initialized from [GLM-4-9B-0414](https://huggingface.co/zai-org/GLM-4-9B-0414), with an expanded vocabulary to incorporate visual tokens. The model first generates a compact encoding of approximately 256 tokens, then expands to 1K–4K tokens, corresponding to 1K–2K high-resolution image outputs.
|
||||
+ Diffusion Decoder: a 7B-parameter decoder based on a single-stream DiT architecture for latent-space image decoding. It is equipped with a Glyph Encoder text module, significantly improving accurate text rendering within images.
|
||||
|
||||
Post-training with decoupled reinforcement learning: the model introduces a fine-grained, modular feedback strategy using the GRPO algorithm, substantially enhancing both semantic understanding and visual detail quality.
|
||||
|
||||
+ Autoregressive module: provides low-frequency feedback signals focused on aesthetics and semantic alignment, improving instruction following and artistic expressiveness.
|
||||
+ Decoder module: delivers high-frequency feedback targeting detail fidelity and text accuracy, resulting in highly realistic textures, lighting, and color reproduction, as well as more precise text rendering.
|
||||
|
||||
GLM-Image supports both text-to-image and image-to-image generation within a single model
|
||||
|
||||
+ Text-to-image: generates high-detail images from textual descriptions, with particularly strong performance in information-dense scenarios.
|
||||
+ Image-to-image: supports a wide range of tasks, including image editing, style transfer, multi-subject consistency, and identity-preserving generation for people and objects.
|
||||
|
||||
+ `GlmImageForConditionalGeneration` is the AR part of GLM-Image model, and for full image generation pipeline, please refer to [here](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/glm_image).
|
||||
|
||||
This model was contributed by [Raushan Turganbay](https://huggingface.co/RaushanTurganbay) and [Yuxuan Zhang](https://huggingface.co/ZHANGYUXUAN-zR).
|
||||
|
||||
## Usage examples
|
||||
|
||||
Using GLM-Image with image input to generate vision token for DIT using.
|
||||
|
||||
```python
|
||||
|
||||
from transformers import AutoProcessor, GlmImageForConditionalGeneration
|
||||
|
||||
|
||||
model = GlmImageForConditionalGeneration.from_pretrained(
|
||||
pretrained_model_name_or_path="zai-org/GLM-Image/vision_language_encoder",
|
||||
device_map="cuda:0"
|
||||
)
|
||||
processor = AutoProcessor.from_pretrained(
|
||||
pretrained_model_name_or_path="zai-org/GLM-Image/processor",
|
||||
use_fast=True
|
||||
)
|
||||
|
||||
# Case1 T2I
|
||||
prompt = "现代美食杂志风格的甜点制作教程图,主题为覆盆子慕斯蛋糕。整体布局干净明亮,分为四个主要区域:顶部左侧是黑色粗体标题“覆盆子慕斯蛋糕制作指南”,右侧搭配光线柔和的成品蛋糕特写照片,蛋糕呈淡粉色,表面点缀新鲜覆盆子与薄荷叶;左下方为配料清单区域,标题“配料”使用简洁字体,下方列有“面粉 150g”“鸡蛋 3个”“细砂糖 120g”“覆盆子果泥 200g”“明胶片 10g”“淡奶油 300ml”“新鲜覆盆子”等配料,每种配料旁配有简约线图标(如面粉袋、鸡蛋、糖罐等);右下方是四个等大的步骤方框,每个方框内含高清微距实拍图及对应操作说明,从上到下依次为:步骤1展示打蛋器打发白色泡沫(对应说明“打发蛋白至干性发泡”),步骤2展示红白相间的混合物被刮刀翻拌(对应说明“轻柔翻拌果泥与面糊”),步骤3展示粉色液体被倒入圆形模具(对应说明“倒入模具并冷藏4小时”),步骤4展示成品蛋糕表面装饰覆盆子与薄荷叶(对应说明“用覆盆子和薄荷装饰”);底部边缘设浅棕色信息条,左侧图标分别代表“准备时间:30分钟”“烹饪时间:20分钟”“份量:8人份”。整体色调以奶油白、淡粉色为主,背景带轻微纸质纹理,图文排版紧凑有序,信息层级分明。"
|
||||
target_h, target_w = 1152, 768
|
||||
use_reference_images = False
|
||||
reference_image_paths = None
|
||||
|
||||
# ## Case2
|
||||
# prompt = "Replace the background of the snow forest with an underground station featuring an automatic escalator."
|
||||
# cond_0 = "cond.jpg"
|
||||
# target_h, target_w = 1152, 768
|
||||
# use_reference_images = True
|
||||
# reference_image_paths = [cond_0]
|
||||
|
||||
## Case3
|
||||
# prompt = "Make the man in the first figure and the child from the second image bow at the same time in a respectful KTV."
|
||||
# cond_0 = "cond_0.jpg"
|
||||
# cond_1 = "cond_1.jpg"
|
||||
# target_h, target_w = 1152, 768
|
||||
# use_reference_images = True
|
||||
# reference_image_paths = [cond_0, cond_1]
|
||||
|
||||
|
||||
def build_messages(prompt, use_reference_images, reference_image_paths):
|
||||
content = []
|
||||
if use_reference_images:
|
||||
for img_path in reference_image_paths:
|
||||
content.append({"type": "image", "url": img_path})
|
||||
content.append({"type": "text", "text": prompt})
|
||||
return [{"role": "user", "content": content}]
|
||||
|
||||
|
||||
def compute_generation_params(image_grid_thw, use_reference_images):
|
||||
grid_sizes = []
|
||||
for i in range(image_grid_thw.shape[0]):
|
||||
t, h, w = image_grid_thw[i].tolist()
|
||||
grid_sizes.append(int(h * w))
|
||||
|
||||
target_output_length = grid_sizes[0]
|
||||
|
||||
if use_reference_images:
|
||||
max_new_tokens = grid_sizes[-1] + 1
|
||||
output_start_offset = 0
|
||||
output_length = grid_sizes[-1]
|
||||
else:
|
||||
total_tokens = sum(grid_sizes)
|
||||
max_new_tokens = total_tokens + 1
|
||||
output_start_offset = sum(grid_sizes[1:])
|
||||
output_length = target_output_length
|
||||
|
||||
return max_new_tokens, output_start_offset, output_length
|
||||
|
||||
|
||||
messages = build_messages(prompt, use_reference_images, reference_image_paths if use_reference_images else None)
|
||||
|
||||
inputs = processor.apply_chat_template(
|
||||
messages,
|
||||
target_h=target_h,
|
||||
target_w=target_w,
|
||||
tokenize=True,
|
||||
return_dict=True,
|
||||
return_tensors="pt",
|
||||
).to(model.device)
|
||||
|
||||
image_grid_thw = inputs.get('image_grid_thw')
|
||||
print(f"image_grid_thw: {image_grid_thw}")
|
||||
|
||||
max_new_tokens, output_start_offset, output_length = compute_generation_params(
|
||||
image_grid_thw, use_reference_images
|
||||
)
|
||||
|
||||
print(f"use_reference_images: {use_reference_images}")
|
||||
print(f"max_new_tokens: {max_new_tokens}")
|
||||
print(f"output_start_offset: {output_start_offset}")
|
||||
print(f"output_length: {output_length}")
|
||||
|
||||
outputs = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens=max_new_tokens,
|
||||
do_sample=True
|
||||
)
|
||||
|
||||
input_length = inputs["input_ids"].shape[-1]
|
||||
output_tokens = outputs[0][input_length:][output_start_offset:output_start_offset + output_length]
|
||||
print(f"Input length: {input_length}")
|
||||
print(f"Total generated tokens: {outputs[0].shape[-1] - input_length}")
|
||||
print(f"Extracted output tokens shape: {output_tokens.shape}")
|
||||
print(f"Output tokens: {output_tokens}")
|
||||
```
|
||||
|
||||
## GlmImageConfig
|
||||
|
||||
[[autodoc]] GlmImageConfig
|
||||
|
||||
## GlmImageVisionConfig
|
||||
|
||||
[[autodoc]] GlmImageVisionConfig
|
||||
|
||||
## GlmImageTextConfig
|
||||
|
||||
[[autodoc]] GlmImageTextConfig
|
||||
|
||||
## GlmImageVQVAEConfig
|
||||
|
||||
[[autodoc]] GlmImageVQVAEConfig
|
||||
|
||||
## GlmImageImageProcessor
|
||||
|
||||
[[autodoc]] GlmImageImageProcessor
|
||||
- preprocess
|
||||
|
||||
## GlmImageImageProcessorPil
|
||||
|
||||
[[autodoc]] GlmImageImageProcessorPil
|
||||
- preprocess
|
||||
|
||||
## GlmImageProcessor
|
||||
|
||||
[[autodoc]] GlmImageProcessor
|
||||
- __call__
|
||||
|
||||
## GlmImageVisionModel
|
||||
|
||||
[[autodoc]] GlmImageVisionModel
|
||||
- forward
|
||||
|
||||
## GlmImageTextModel
|
||||
|
||||
[[autodoc]] GlmImageTextModel
|
||||
- forward
|
||||
|
||||
## GlmImageVQVAE
|
||||
|
||||
[[autodoc]] GlmImageVQVAE
|
||||
- forward
|
||||
|
||||
## GlmImageModel
|
||||
|
||||
[[autodoc]] GlmImageModel
|
||||
- forward
|
||||
- get_image_features
|
||||
|
||||
## GlmImageForConditionalGeneration
|
||||
|
||||
[[autodoc]] GlmImageForConditionalGeneration
|
||||
- forward
|
||||
- get_image_features
|
||||
Reference in New Issue
Block a user