Files
transformers/docs/source/en/model_doc/glm_image.md
陈赣 06f1fd69a6
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
first commit
2026-06-05 16:53:03 +08:00

8.7 KiB
Raw Permalink Blame History

This model was contributed to Hugging Face Transformers on 2026-01-13.

GlmImage

Overview

GLM-Image is an image generation model adopts a hybrid autoregressive + diffusion decoder architecture, effectively pushing the upper bound of visual fidelity and fine-grained details. In general image generation quality, it aligns with industry-standard LDM-based approaches, while demonstrating significant advantages in knowledge-intensive image generation scenarios.

Model architecture: a hybrid autoregressive + diffusion decoder design、

  • Autoregressive generator: a 9B-parameter model initialized from GLM-4-9B-0414, with an expanded vocabulary to incorporate visual tokens. The model first generates a compact encoding of approximately 256 tokens, then expands to 1K4K tokens, corresponding to 1K2K high-resolution image outputs.
  • Diffusion Decoder: a 7B-parameter decoder based on a single-stream DiT architecture for latent-space image decoding. It is equipped with a Glyph Encoder text module, significantly improving accurate text rendering within images.

Post-training with decoupled reinforcement learning: the model introduces a fine-grained, modular feedback strategy using the GRPO algorithm, substantially enhancing both semantic understanding and visual detail quality.

  • Autoregressive module: provides low-frequency feedback signals focused on aesthetics and semantic alignment, improving instruction following and artistic expressiveness.
  • Decoder module: delivers high-frequency feedback targeting detail fidelity and text accuracy, resulting in highly realistic textures, lighting, and color reproduction, as well as more precise text rendering.

GLM-Image supports both text-to-image and image-to-image generation within a single model

  • Text-to-image: generates high-detail images from textual descriptions, with particularly strong performance in information-dense scenarios.

  • Image-to-image: supports a wide range of tasks, including image editing, style transfer, multi-subject consistency, and identity-preserving generation for people and objects.

  • GlmImageForConditionalGeneration is the AR part of GLM-Image model, and for full image generation pipeline, please refer to here.

This model was contributed by Raushan Turganbay and Yuxuan Zhang.

Usage examples

Using GLM-Image with image input to generate vision token for DIT using.


from transformers import AutoProcessor, GlmImageForConditionalGeneration


model = GlmImageForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path="zai-org/GLM-Image/vision_language_encoder",
    device_map="cuda:0"
)
processor = AutoProcessor.from_pretrained(
    pretrained_model_name_or_path="zai-org/GLM-Image/processor",
    use_fast=True
)

# Case1 T2I
prompt = "现代美食杂志风格的甜点制作教程图,主题为覆盆子慕斯蛋糕。整体布局干净明亮,分为四个主要区域:顶部左侧是黑色粗体标题“覆盆子慕斯蛋糕制作指南”,右侧搭配光线柔和的成品蛋糕特写照片,蛋糕呈淡粉色,表面点缀新鲜覆盆子与薄荷叶;左下方为配料清单区域,标题“配料”使用简洁字体,下方列有“面粉 150g”“鸡蛋 3个”“细砂糖 120g”“覆盆子果泥 200g”“明胶片 10g”“淡奶油 300ml”“新鲜覆盆子”等配料每种配料旁配有简约线图标如面粉袋、鸡蛋、糖罐等右下方是四个等大的步骤方框每个方框内含高清微距实拍图及对应操作说明从上到下依次为步骤1展示打蛋器打发白色泡沫对应说明“打发蛋白至干性发泡”步骤2展示红白相间的混合物被刮刀翻拌对应说明“轻柔翻拌果泥与面糊”步骤3展示粉色液体被倒入圆形模具对应说明“倒入模具并冷藏4小时”步骤4展示成品蛋糕表面装饰覆盆子与薄荷叶对应说明“用覆盆子和薄荷装饰”底部边缘设浅棕色信息条左侧图标分别代表“准备时间30分钟”“烹饪时间20分钟”“份量8人份”。整体色调以奶油白、淡粉色为主背景带轻微纸质纹理图文排版紧凑有序信息层级分明。"
target_h, target_w = 1152, 768
use_reference_images = False
reference_image_paths = None

# ## Case2
# prompt = "Replace the background of the snow forest with an underground station featuring an automatic escalator."
# cond_0 = "cond.jpg"
# target_h, target_w = 1152, 768
# use_reference_images = True
# reference_image_paths = [cond_0]

## Case3
# prompt = "Make the man in the first figure and the child from the second image bow at the same time in a respectful KTV."
# cond_0 = "cond_0.jpg"
# cond_1 = "cond_1.jpg"
# target_h, target_w = 1152, 768
# use_reference_images = True
# reference_image_paths = [cond_0, cond_1]


def build_messages(prompt, use_reference_images, reference_image_paths):
    content = []
    if use_reference_images:
        for img_path in reference_image_paths:
            content.append({"type": "image", "url": img_path})
    content.append({"type": "text", "text": prompt})
    return [{"role": "user", "content": content}]


def compute_generation_params(image_grid_thw, use_reference_images):
    grid_sizes = []
    for i in range(image_grid_thw.shape[0]):
        t, h, w = image_grid_thw[i].tolist()
        grid_sizes.append(int(h * w))

    target_output_length = grid_sizes[0]

    if use_reference_images:
        max_new_tokens = grid_sizes[-1] + 1
        output_start_offset = 0
        output_length = grid_sizes[-1]
    else:
        total_tokens = sum(grid_sizes)
        max_new_tokens = total_tokens + 1
        output_start_offset = sum(grid_sizes[1:])
        output_length = target_output_length

    return max_new_tokens, output_start_offset, output_length


messages = build_messages(prompt, use_reference_images, reference_image_paths if use_reference_images else None)

inputs = processor.apply_chat_template(
    messages,
    target_h=target_h,
    target_w=target_w,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

image_grid_thw = inputs.get('image_grid_thw')
print(f"image_grid_thw: {image_grid_thw}")

max_new_tokens, output_start_offset, output_length = compute_generation_params(
    image_grid_thw, use_reference_images
)

print(f"use_reference_images: {use_reference_images}")
print(f"max_new_tokens: {max_new_tokens}")
print(f"output_start_offset: {output_start_offset}")
print(f"output_length: {output_length}")

outputs = model.generate(
    **inputs,
    max_new_tokens=max_new_tokens,
    do_sample=True
)

input_length = inputs["input_ids"].shape[-1]
output_tokens = outputs[0][input_length:][output_start_offset:output_start_offset + output_length]
print(f"Input length: {input_length}")
print(f"Total generated tokens: {outputs[0].shape[-1] - input_length}")
print(f"Extracted output tokens shape: {output_tokens.shape}")
print(f"Output tokens: {output_tokens}")

GlmImageConfig

autodoc GlmImageConfig

GlmImageVisionConfig

autodoc GlmImageVisionConfig

GlmImageTextConfig

autodoc GlmImageTextConfig

GlmImageVQVAEConfig

autodoc GlmImageVQVAEConfig

GlmImageImageProcessor

autodoc GlmImageImageProcessor - preprocess

GlmImageImageProcessorPil

autodoc GlmImageImageProcessorPil - preprocess

GlmImageProcessor

autodoc GlmImageProcessor - call

GlmImageVisionModel

autodoc GlmImageVisionModel - forward

GlmImageTextModel

autodoc GlmImageTextModel - forward

GlmImageVQVAE

autodoc GlmImageVQVAE - forward

GlmImageModel

autodoc GlmImageModel - forward - get_image_features

GlmImageForConditionalGeneration

autodoc GlmImageForConditionalGeneration - forward - get_image_features