LogoAI Just Better
icon of GLM-Image

GLM-Image

GLM-Image: A hybrid autoregressive + diffusion model for high-fidelity image generation with advanced text rendering.

Introduction

GLM-Image: A Hybrid Autoregressive-Diffusion Model for Knowledge-Intensive Image Generation

Introduction

Today marks the release of GLM-Image, the first open-source, industrial-grade discrete autoregressive image generation model. Developed with a hybrid architecture that combines an autoregressive module with a diffusion decoder, GLM-Image represents a significant advancement in generative AI, particularly for text-rendering and knowledge-intensive scenarios.

Architectural Innovation

GLM-Image adopts a decoupled two-stage approach:

  • Autoregressive Generator: Partially based on GLM-4-9B-0414 with 9B parameters, it produces tokens containing low-frequency semantic signals
  • Diffusion Decoder: Following CogView4's single-stream DiT structure with 7B parameters, it refines high-frequency details for final image output

This hybrid design enables the model to excel in both semantic understanding and fine-grained detail generation, addressing limitations of traditional diffusion models in complex instruction following.

Technical Highlights
Visual Token Selection

GLM-Image employs semantic-VQ as its primary tokenization strategy, balancing information completeness with semantic relevance. The choice was informed by comparative analysis showing semantic tokens offer superior convergence properties (~3 loss vs. ~7 for VQVAE) while maintaining sufficient visual information.

Progressive Training Strategy

The model undergoes multi-resolution training:

  • Initial stage: 256px with raster scan token generation
  • Advanced stages: 512px to 1024px with progressive generation
  • Final output: 1024px to 2048px images via 32× upscaling

The progressive approach addresses controllability issues at higher resolutions by generating layout-determining tokens first with increased training weight.

Enhanced Text Rendering

A key innovation is the integration of Glyph-byT5, a lightweight model performing character-level encoding for text regions. This specialized attention to glyph embeddings enables exceptional performance in Chinese character generation and complex text rendering.

Efficient Image Editing

For image editing tasks, GLM-Image uses both semantic-VQ tokens and VAE latents as conditioning inputs. The implementation of block-causal attention (inspired by ControlNet-Reference-Only) significantly reduces computational overhead while preserving fine-grained details from reference images.

Conclusion

GLM-Image's hybrid architecture represents a strategic move toward combining artistic aesthetics with informational precision. By decoupling semantic understanding from detail generation, it achieves competitive performance while opening new possibilities for creative applications requiring both knowledge representation and visual fidelity. The model's open-source nature and industrial-grade implementation make it a valuable contribution to the generative AI ecosystem, particularly for applications demanding precise semantic alignment alongside high visual quality.

Note: GLM-Image requires further benchmarking against established models like DALL-E 3, Midjourney, and Stable Diffusion 3 for comprehensive performance evaluation across diverse use cases.

Newsletter

Join the Community

Subscribe to our newsletter for the latest news and updates