Z.ai debuts open source GLM-4.6V, a native tool-calling vision model for multimodal reasoning


Chinese AI startup Zhipu AI aka Z.ai has released its GLM-4.6V seriesa new generation of open-source vision language models (VLMs) optimized for multimodal reasoning, frontend automation, and highly efficient deployment.
The release includes two models in “large” and “small” sizes:
-
GLM-4.6V (106B)a larger 106 billion parameter model aimed at cloud-scale inference
-
GLM-4.6V flash (9B)a smaller model with only 9 billion parameters, designed for low-latency local applications
Remember that models with more parameters (or internal settings that determine their behavior, i.e. weights and biases) are generally more powerful and performant, and are able to perform at a higher overall level on more varied tasks.
However, smaller models can provide better efficiency for edge or real-time applications where latency and resource constraints are critical.
The defining innovation in this series is the introduction of native function call in a vision language model, allowing direct use of tools such as search, cropping or map recognition with visual input.
With a context length of 128,000 tokens (equivalent to the text of a 300-page novel exchanged in a single input/output interaction with the user) and state-of-the-art (SoTA) results across more than 20 benchmarks, the GLM-4.6V series positions itself as a highly competitive alternative to both closed and open-source VLMs. It is available in the following formats:
-
API access via OpenAI compatible interface
-
Try the demo on Zhipu’s web interface
-
Download weights from Cuddle Face
-
Desktop assistant app available at Facial spaces hugging
Licensing and Business Use
GLM‑4.6V and GLM‑4.6V‑Flash are distributed to MIT licensea permissive open source license that allows free commercial and non-commercial use, modification, redistribution, and local deployment without obligation to open source derivative works.
This licensing model makes the series suitable for enterprise adoption, including scenarios that require full control of infrastructure, internal governance compliance or air-gapped environments.
Model weights and documentation are publicly hosted at Hugging facewith supporting code and tooling available at GitHub.
The MIT license guarantees maximum flexibility for integration into proprietary systems, including internal tools, production pipelines, and edge deployments.
Architecture and technical possibilities
The GLM-4.6V models follow a conventional encoder-decoder architecture with significant adjustments for multimodal input.
Both models feature a Vision Transformer (ViT) encoder, based on AIMv2-Huge, and an MLP projector to align visual features with a Large Language Model (LLM) decoder.
Video input benefits from 3D convolutions and temporal compression, while spatial encoding is handled using 2D RoPE and bicubic interpolation of absolute positional embedding.
A key technical feature is the system’s support for arbitrary image resolutions and aspect ratios, including wide panoramic inputs up to 200:1.
In addition to parsing static images and documents, GLM-4.6V can ingest temporal sequences of video frames with explicit timestamp tokens, enabling robust temporal reasoning.
On the decoding side, the model supports token generation aligned with function calling protocols, allowing structured reasoning about text, image, and tool output. This is supported by an extensive tokenizer vocabulary and output formatting templates to ensure consistent API or agent compatibility.
Native multimodal tool use
GLM-4.6V introduces native multimodal function calls, allowing visual assets, such as screenshots, images, and documents, to be passed directly to tools as parameters. This eliminates the need for intermediate text-only conversions, which historically involved information loss and complexity.
The tool calling mechanism works bidirectionally:
-
Input tools allow images or videos to be passed directly (for example, document pages to crop or analyze).
-
Output tools such as graph renderers or web snapshot utilities return visual data, which GLM-4.6V integrates directly into the reasoning chain.
In practice, this means that the GLM-4.6V can perform tasks such as:
-
Generate structured reports from documents of different formats
-
Perform visual audits of candidate images
-
Automatically crop figures from paper during generation
-
Perform visual web searches and answer multimodal queries
High performance benchmarks compared to other models of similar size
GLM-4.6V was evaluated against more than twenty public benchmarks, including general VQA, diagram understanding, OCR, STEM reasoning, frontend replication, and multimodal agents.
According to the benchmark chart released by Zhipu AI:
-
GLM-4.6V (106B) achieves SoTA or near-SoTA scores among similarly sized open source models (106B) on MMBench, MathVista, MMLongBench, ChartQAPro, RefCOCO, TreeBench, and more.
-
GLM-4.6V-Flash (9B) outperforms other lightweight models (e.g. Qwen3-VL-8B, GLM-4.1V-9B) in almost all categories tested.
-
The 128K token window of the 106B model allows it to outperform larger models such as Step-3 (321B) and Qwen3-VL-235B on long-context document tasks, video summarization, and structured multimodal reasoning.
Example scores from the leaderboard include:
-
MathVista: 88.2 (GLM-4.6V) vs. 84.6 (GLM-4.5V) vs. 81.4 (Qwen3-VL-8B)
-
WebVoyager: 81.0 vs. 68.4 (Qwen3-VL-8B)
-
Ref-L4 test: 88.9 vs. 89.5 (GLM-4.5V), but with better ground fidelity at 87.7 (Flash) vs. 86.8
Both models were evaluated using the vLLM inference backend and support SGLang for video-based tasks.
Frontend automation and long context workflows
Zhipu AI highlighted GLM-4.6V’s ability to support frontend development workflows. The model can:
-
Replicate pixel-accurate HTML/CSS/JS from UI screenshots
-
Accept natural language editing commands to change layouts
-
Visually identify and manipulate specific UI components
This capability is integrated into an end-to-end visual programming interface, where the model iterates over the layout, design intent, and output code using the native understanding of screenshots.
In long document scenarios, GLM-4.6V can process up to 128,000 tokens, allowing for a single inference:
-
150 pages of text (input)
-
200 slide decks
-
1 hour videos
Zhipu AI reported successful use of the model in financial analysis of multi-document corpora and in summarizing entire sports broadcasts with timestamp event detection.
Training and reinforcement learning
The model was trained using multi-stage pre-training, followed by supervised fine-tuning (SFT) and reinforcement learning (RL). Key innovations include:
-
Curriculum Sampling (RLCS): Dynamically adjusts the difficulty of training examples based on model progress
-
Multi-domain reward systems: task-specific verifiers for STEM, diagram reasoning, GUI agents, video QA, and spatial grounding
-
Function-aware training: uses structured tags (e.g.
, , <|begin_of_box|>) to align the reasoning and answer format
The reinforcement learning pipeline emphasizes verifiable rewards (RLVR) over human feedback (RLHF) for scalability, and avoids KL/entropy losses to stabilize training in multimodal domains
Pricing (API)
Zhipu AI offers competitive pricing for the GLM-4.6V series, with both the flagship model and the lightweight variant positioned for high accessibility.
-
GLM-4.6V: $0.30 (input) / $0.90 (output) per 1 million tokens
-
GLM-4.6V flash: free
Compared to large vision and text-first LLMs, GLM-4.6V is one of the most cost-efficient for multimodal reasoning at scale. Below is a comparative snapshot of prices from different providers:
USD per 1 million tokens – sorted by lowest → highest total fees
|
Model |
Import |
Export |
Total costs |
Source |
|
Qwen3 Turbo |
$0.05 |
$0.20 |
$0.25 |
|
|
ERNIE 4.5 Turbo |
$0.11 |
$0.45 |
$0.56 |
|
|
GLM‑4.6V |
$0.30 |
$0.90 |
$1.20 |
|
|
Grok 4.1 Quick (reasoning) |
$0.20 |
$0.50 |
$0.70 |
|
|
Grok 4.1 Fast (not reasoning) |
$0.20 |
$0.50 |
$0.70 |
|
|
deepseek chat (V3.2-Exp) |
$0.28 |
$0.42 |
$0.70 |
|
|
deepseek reasoner (V3.2-Exp) |
$0.28 |
$0.42 |
$0.70 |
|
|
Qwen3 Plus |
$0.40 |
$1.20 |
$1.60 |
|
|
ERNIE 5.0 |
$0.85 |
$3.40 |
$4.25 |
|
|
Qwen-Max |
$1.60 |
$6.40 |
$8.00 |
|
|
GPT-5.1 |
$1.25 |
$10.00 |
$11.25 |
|
|
Gemini 2.5 Pro (≤200K) |
$1.25 |
$10.00 |
$11.25 |
|
|
Gemini 3 Pro (≤200K) |
$2.00 |
$12.00 |
$14.00 |
|
|
Gemini 2.5 Pro (>200K) |
$2.50 |
$15.00 |
$17.50 |
|
|
Grok 4 (0709) |
$3.00 |
$15.00 |
$18.00 |
|
|
Gemini 3 Pro (>200K) |
$4.00 |
$18.00 |
$22.00 |
|
|
Claude Opus 4.1 |
$15.00 |
$75.00 |
$90.00 |
Previous releases: GLM‑4.5 series and Enterprise applications
Before GLM-4.6V, Z.ai released the GLM-4.5 family in mid-2025, making the company a serious competitor in the field of open-source LLM development.
The flagship GLM‑4.5 and its smaller sibling GLM‑4.5‑Air both support reasoning, tooling, coding, and agent behavior, while delivering strong performance in standard benchmarks.
The models introduced dual reasoning modes (“thinking” and “non-thinking”) and could automatically generate entire PowerPoint presentations from a single prompt – a feature suitable for use in corporate reporting, education and internal communications workflows. Z.ai has also expanded the GLM-4.5 series with additional variants such as GLM-4.5-X, AirX and Flash, aimed at ultra-fast inference and low-cost scenarios.
Together, these features position the GLM‑4.5 Series as a cost-effective, open, and production-ready alternative for enterprises needing autonomy across model deployment, lifecycle management, and integration pipeline.
Implications for the ecosystem
The GLM-4.6V release represents a remarkable advancement in open-source multimodal AI. Although large numbers of vision language models have emerged in the past year, few offer:
-
Integrated visual aid use
-
Structured multimodal generation
-
Agent-oriented memory and decision logic
Zhipu AI’s emphasis on “closing the loop” from perception to action via native function calling marks a step toward agentic multimodal systems.
The model’s architecture and training pipeline demonstrate a continued evolution of the GLM family, positioning it competitively alongside offerings such as OpenAI’s GPT-4V and Google DeepMind’s Gemini-VL.
Takeaway for business leaders
With GLM-4.6V, Zhipu AI introduces an open-source VLM capable of using native visual tools, long-context reasoning, and front-end automation. It sets new performance features among similarly sized models and provides a scalable platform for building agentic, multimodal AI systems.




