Z.ai debuts open source GLM-4.6V, a native tool-calling vision model for multimodal reasoning

December 9, 2025

7 5 minutes read

Chinese AI startup Zhipu AI aka Z.ai has released its GLM-4.6V seriesa new generation of open-source vision language models (VLMs) optimized for multimodal reasoning, frontend automation, and highly efficient deployment.

The release includes two models in “large” and “small” sizes:

GLM-4.6V (106B)a larger 106 billion parameter model aimed at cloud-scale inference
GLM-4.6V flash (9B)a smaller model with only 9 billion parameters, designed for low-latency local applications

Remember that models with more parameters (or internal settings that determine their behavior, i.e. weights and biases) are generally more powerful and performant, and are able to perform at a higher overall level on more varied tasks.

However, smaller models can provide better efficiency for edge or real-time applications where latency and resource constraints are critical.

The defining innovation in this series is the introduction of native function call in a vision language model, allowing direct use of tools such as search, cropping or map recognition with visual input.

With a context length of 128,000 tokens (equivalent to the text of a 300-page novel exchanged in a single input/output interaction with the user) and state-of-the-art (SoTA) results across more than 20 benchmarks, the GLM-4.6V series positions itself as a highly competitive alternative to both closed and open-source VLMs. It is available in the following formats:

API access via OpenAI compatible interface
Try the demo on Zhipu’s web interface
Download weights from Cuddle Face
Desktop assistant app available at Facial spaces hugging

Licensing and Business Use

GLM‑4.6V and GLM‑4.6V‑Flash are distributed to MIT licensea permissive open source license that allows free commercial and non-commercial use, modification, redistribution, and local deployment without obligation to open source derivative works.

This licensing model makes the series suitable for enterprise adoption, including scenarios that require full control of infrastructure, internal governance compliance or air-gapped environments.

Model weights and documentation are publicly hosted at Hugging facewith supporting code and tooling available at GitHub.

The MIT license guarantees maximum flexibility for integration into proprietary systems, including internal tools, production pipelines, and edge deployments.

Architecture and technical possibilities

The GLM-4.6V models follow a conventional encoder-decoder architecture with significant adjustments for multimodal input.

Both models feature a Vision Transformer (ViT) encoder, based on AIMv2-Huge, and an MLP projector to align visual features with a Large Language Model (LLM) decoder.

Video input benefits from 3D convolutions and temporal compression, while spatial encoding is handled using 2D RoPE and bicubic interpolation of absolute positional embedding.

A key technical feature is the system’s support for arbitrary image resolutions and aspect ratios, including wide panoramic inputs up to 200:1.

In addition to parsing static images and documents, GLM-4.6V can ingest temporal sequences of video frames with explicit timestamp tokens, enabling robust temporal reasoning.

On the decoding side, the model supports token generation aligned with function calling protocols, allowing structured reasoning about text, image, and tool output. This is supported by an extensive tokenizer vocabulary and output formatting templates to ensure consistent API or agent compatibility.

Native multimodal tool use

GLM-4.6V introduces native multimodal function calls, allowing visual assets, such as screenshots, images, and documents, to be passed directly to tools as parameters. This eliminates the need for intermediate text-only conversions, which historically involved information loss and complexity.

The tool calling mechanism works bidirectionally:

Input tools allow images or videos to be passed directly (for example, document pages to crop or analyze).
Output tools such as graph renderers or web snapshot utilities return visual data, which GLM-4.6V integrates directly into the reasoning chain.

In practice, this means that the GLM-4.6V can perform tasks such as:

Generate structured reports from documents of different formats
Perform visual audits of candidate images
Automatically crop figures from paper during generation
Perform visual web searches and answer multimodal queries

High performance benchmarks compared to other models of similar size

GLM-4.6V was evaluated against more than twenty public benchmarks, including general VQA, diagram understanding, OCR, STEM reasoning, frontend replication, and multimodal agents.

According to the benchmark chart released by Zhipu AI:

GLM-4.6V (106B) achieves SoTA or near-SoTA scores among similarly sized open source models (106B) on MMBench, MathVista, MMLongBench, ChartQAPro, RefCOCO, TreeBench, and more.
GLM-4.6V-Flash (9B) outperforms other lightweight models (e.g. Qwen3-VL-8B, GLM-4.1V-9B) in almost all categories tested.
The 128K token window of the 106B model allows it to outperform larger models such as Step-3 (321B) and Qwen3-VL-235B on long-context document tasks, video summarization, and structured multimodal reasoning.

Example scores from the leaderboard include:

MathVista: 88.2 (GLM-4.6V) vs. 84.6 (GLM-4.5V) vs. 81.4 (Qwen3-VL-8B)
WebVoyager: 81.0 vs. 68.4 (Qwen3-VL-8B)
Ref-L4 test: 88.9 vs. 89.5 (GLM-4.5V), but with better ground fidelity at 87.7 (Flash) vs. 86.8

Both models were evaluated using the vLLM inference backend and support SGLang for video-based tasks.

Frontend automation and long context workflows

Zhipu AI highlighted GLM-4.6V’s ability to support frontend development workflows. The model can:

Replicate pixel-accurate HTML/CSS/JS from UI screenshots
Accept natural language editing commands to change layouts
Visually identify and manipulate specific UI components

This capability is integrated into an end-to-end visual programming interface, where the model iterates over the layout, design intent, and output code using the native understanding of screenshots.

In long document scenarios, GLM-4.6V can process up to 128,000 tokens, allowing for a single inference:

150 pages of text (input)
200 slide decks
1 hour videos

Zhipu AI reported successful use of the model in financial analysis of multi-document corpora and in summarizing entire sports broadcasts with timestamp event detection.

Training and reinforcement learning

The model was trained using multi-stage pre-training, followed by supervised fine-tuning (SFT) and reinforcement learning (RL). Key innovations include:

Curriculum Sampling (RLCS): Dynamically adjusts the difficulty of training examples based on model progress
Multi-domain reward systems: task-specific verifiers for STEM, diagram reasoning, GUI agents, video QA, and spatial grounding
Function-aware training: uses structured tags (e.g. , , <|begin_of_box|>) to align the reasoning and answer format

The reinforcement learning pipeline emphasizes verifiable rewards (RLVR) over human feedback (RLHF) for scalability, and avoids KL/entropy losses to stabilize training in multimodal domains

Pricing (API)

Zhipu AI offers competitive pricing for the GLM-4.6V series, with both the flagship model and the lightweight variant positioned for high accessibility.

GLM-4.6V: $0.30 (input) / $0.90 (output) per 1 million tokens
GLM-4.6V flash: free

Compared to large vision and text-first LLMs, GLM-4.6V is one of the most cost-efficient for multimodal reasoning at scale. Below is a comparative snapshot of prices from different providers:

USD per 1 million tokens – sorted by lowest → highest total fees

Model	Import	Export	Total costs	Source
Qwen3 Turbo	$0.05	$0.20	$0.25	Alibaba cloud
ERNIE 4.5 Turbo	$0.11	$0.45	$0.56	Qianfan
GLM‑4.6V	$0.30	$0.90	$1.20	Z.AI
Grok 4.1 Quick (reasoning)	$0.20	$0.50	$0.70	xAI
Grok 4.1 Fast (not reasoning)	$0.20	$0.50	$0.70	xAI
deepseek chat (V3.2-Exp)	$0.28	$0.42	$0.70	Deep Search
deepseek reasoner (V3.2-Exp)	$0.28	$0.42	$0.70	Deep Search
Qwen3 Plus	$0.40	$1.20	$1.60	Alibaba cloud
ERNIE 5.0	$0.85	$3.40	$4.25	Qianfan
Qwen-Max	$1.60	$6.40	$8.00	Alibaba cloud
GPT-5.1	$1.25	$10.00	$11.25	OpenAI
Gemini 2.5 Pro (≤200K)	$1.25	$10.00	$11.25	Googling
Gemini 3 Pro (≤200K)	$2.00	$12.00	$14.00	Googling
Gemini 2.5 Pro (>200K)	$2.50	$15.00	$17.50	Googling
Grok 4 (0709)	$3.00	$15.00	$18.00	xAI
Gemini 3 Pro (>200K)	$4.00	$18.00	$22.00	Googling
Claude Opus 4.1	$15.00	$75.00	$90.00	Anthropic

Previous releases: GLM‑4.5 series and Enterprise applications

Before GLM-4.6V, Z.ai released the GLM-4.5 family in mid-2025, making the company a serious competitor in the field of open-source LLM development.

The flagship GLM‑4.5 and its smaller sibling GLM‑4.5‑Air both support reasoning, tooling, coding, and agent behavior, while delivering strong performance in standard benchmarks.

The models introduced dual reasoning modes (“thinking” and “non-thinking”) and could automatically generate entire PowerPoint presentations from a single prompt – a feature suitable for use in corporate reporting, education and internal communications workflows. Z.ai has also expanded the GLM-4.5 series with additional variants such as GLM-4.5-X, AirX and Flash, aimed at ultra-fast inference and low-cost scenarios.

Together, these features position the GLM‑4.5 Series as a cost-effective, open, and production-ready alternative for enterprises needing autonomy across model deployment, lifecycle management, and integration pipeline.

Implications for the ecosystem

The GLM-4.6V release represents a remarkable advancement in open-source multimodal AI. Although large numbers of vision language models have emerged in the past year, few offer:

Integrated visual aid use
Structured multimodal generation
Agent-oriented memory and decision logic

Zhipu AI’s emphasis on “closing the loop” from perception to action via native function calling marks a step toward agentic multimodal systems.

The model’s architecture and training pipeline demonstrate a continued evolution of the GLM family, positioning it competitively alongside offerings such as OpenAI’s GPT-4V and Google DeepMind’s Gemini-VL.

Takeaway for business leaders

With GLM-4.6V, Zhipu AI introduces an open-source VLM capable of using native visual tools, long-context reasoning, and front-end automation. It sets new performance features among similarly sized models and provides a scalable platform for building agentic, multimodal AI systems.

Source link

Z.ai debuts open source GLM-4.6V, a native tool-calling vision model for multimodal reasoning

Licensing and Business Use

Architecture and technical possibilities

Native multimodal tool use

High performance benchmarks compared to other models of similar size

Frontend automation and long context workflows

Training and reinforcement learning

Pricing (API)

Previous releases: GLM‑4.5 series and Enterprise applications

Implications for the ecosystem

Takeaway for business leaders

How to Hire a Payroll Specialist or Manager in 6 Simple Steps

Nu open: 1 Hotel Tokyo – een stedelijk heiligdom vol natuur | Nieuws

Avora Residences acquires Seven Seas Navigator and announces launch of Avora Lumina | News

Licensing and Business Use

Architecture and technical possibilities

Native multimodal tool use

High performance benchmarks compared to other models of similar size

Frontend automation and long context workflows

Training and reinforcement learning

Pricing (API)

Previous releases: GLM‑4.5 series and Enterprise applications

Implications for the ecosystem

Takeaway for business leaders

'Secret Lives of Mormon Women' star Jessi Ngatikaura's affair revealed

Fox NFL Sunday Host Curt Menefee weighs in on Terry Bradshaw's future

Related Articles

Cooking Up Narrative Consistency for Long Video Generation

Ali Partovi and Russell Kaplan join StrictlyVC Menlo Park

Visa just launched a protocol to secure the AI shopping boom — here’s what it means for merchants

Court filings reveal OpenAI and io’s early work on an AI device

How to Hire a Payroll Specialist or Manager in 6 Simple Steps

Nu open: 1 Hotel Tokyo – een stedelijk heiligdom vol natuur | Nieuws

Avora Residences acquires Seven Seas Navigator and announces launch of Avora Lumina | News