GLM-5V-Turbo is Z.AI’s first multimodal coding foundation model, built for vision-based coding tasks. It can natively process multimodal inputs such as images, video, and text, while also excelling at long-horizon planning, complex coding, and action execution. Deeply optimized for agent workflows, it works seamlessly with agents such as Claude Code and OpenClaw to complete the full loop of “understand the environment → plan actions → execute tasks”.
Overview
GLM-5V-Turbo is positioned as a multimodal coding model, capable of understanding and generating code based on visual inputs. It aims to bridge the gap between visual understanding and software development, enabling more intuitive and efficient coding processes.
Capability
Positioning
- Multimodal Coding Model: Natively handles video, image, text, and file inputs.
Input Modality
- Video / Image / Text / File
Output Modality
- Text
Context Length
- 200K tokens
Maximum Output Tokens
- 128K tokens
Key Capabilities:
- Thinking Mode: Offers multiple thinking modes for different scenarios, enhancing flexibility and adaptability in problem-solving.
- Vision Comprehension: Possesses powerful vision understanding capabilities, supporting images, video, and files for detailed analysis.
- Streaming Output: Supports real-time streaming responses, improving user interaction and experience.
- Function Call: Enables powerful tool invocation capabilities, facilitating integration with various external toolsets.
- Context Caching: Implements an intelligent caching mechanism to optimize performance in long conversations, ensuring efficiency and responsiveness.
Usage
GLM-5V-Turbo is designed for a variety of coding-related tasks, including:
- Frontend Recreation: Recreating mobile pages from design mockups, understanding layouts, color palettes, component hierarchies, and interaction logic to generate runnable frontend projects. It can reconstruct structure and functionality from wireframes and achieve pixel-level visual consistency from high-fidelity designs.
- GUI Autonomous Exploration and Recreation: Working with frameworks like Claude Code, it can autonomously browse target websites, map page transitions, collect visual assets and interaction details, and generate code based on exploration results, moving beyond static recreation to dynamic understanding.
- Code Debugging: Analyzing screenshots of buggy pages to identify rendering issues like layout misalignment, component overlap, and color mismatches, thereby assisting in problem localization and generating fix code to improve debugging efficiency.
- OpenClaw Integration: Enhances OpenClaw's capabilities by enabling it to understand webpage layouts, GUI elements, and chart information, supporting agents in complex real-world tasks that combine perception, planning, and execution.
Resources
- API Documentation: Provides comprehensive documentation for interacting with the GLM-5V-Turbo API, including details on chat completion.
Introducing GLM-5V-Turbo
GLM-5V-Turbo represents a significant advancement in AI model capabilities, particularly in the domain of multimodal understanding and code generation. Its development is underpinned by systematic upgrades across four key layers:
- Native Multimodal Fusion: The model is trained from pre-training through post-training to continuously enhance visual-text alignment. By integrating the CogViT vision encoder and an inference-friendly MTP architecture, it achieves improved multimodal understanding and reasoning efficiency.
- 30+ Task Joint Reinforcement Learning: During the reinforcement learning phase, GLM-5V-Turbo is jointly optimized across more than 30 task types, encompassing STEM, grounding, video, GUI agents, and coding agents. This comprehensive optimization leads to more robust gains in perception, reasoning, and agentic execution.
- Agentic Data and Task Construction: Addressing the scarcity of agent data and the challenges in verification, Z.AI has developed a multi-level, controllable, and verifiable data system. By injecting agentic meta-capabilities during pretraining, the model's action prediction and execution abilities are strengthened.
- Expanded Multimodal Toolchain: The model incorporates a suite of multimodal tools, including box drawing, screenshots, and webpage reading (with image understanding capabilities). This expands agent capabilities beyond text-only interactions to visual interaction, supporting a more complete perception–planning–execution loop.
Official Skills
GLM-5V-Turbo supports a range of official Skills designed for various scenarios and tasks, building upon previous models like GLM-OCR and GLM-Image. These skills include:
- Image Captioning: Automatically analyzes image content to generate natural-language descriptions, identifying objects, relationships, scene atmosphere, and actions for accurate and fluent textual descriptions.
- Visual Grounding: Precisely locates objects or regions in an image based on natural-language descriptions, establishing alignment between text and pixels, typically marking targets with bounding boxes for grounded interactions and fine-grained analysis.
- Document-Grounded Writing: Extracts key information from documents (PDFs, Word files) to generate text in specified formats, ensuring output is grounded in content for document interpretation, report generation, and proposal drafting.
- Resume Screening: Reads candidate resumes and intelligently compares them against job requirements, extracting key information and assessing fit to improve recruiting efficiency.
- Prompt Generation: Automatically generates high-quality, structured prompts based on reference images/videos and user goals, improving the accuracy and quality of AI-generated content.
These skills are available on ClawHub for installation.
Examples
GLM-5V-Turbo demonstrates its capabilities through various practical examples:
- Web Page Coding: Recreating web pages from design mockups, including handling different states and interactions.
- Website Generation: Creating academic websites from article content, showcasing content structuring and layout design.
- Document Comprehension & Writing: Summarizing research papers and extracting key arguments.
- Video Object Tracking: Identifying and tracking objects within video frames, outputting results in JSON format.
Quick Start
Users can interact with GLM-5V-Turbo via cURL, Python SDK, or Java SDK. The examples provided demonstrate basic and streaming calls, illustrating how to integrate the model into various applications.

