AI

Nvidia researchers unlock 4-bit LLM training that matches 8-bit performance

Researchers at Nvidia have a new approach to train large language models (LLMs) in 4-bit quantized format while maintaining their stability and accuracy at the level of high-precision models. Their technique, NVFP4, makes it possible to train models that not only outperform other leading 4-bit formats, but also match the performance of the larger 8-bit FP8 format, while using half the memory and a fraction of the processing power.

The success of NVFP4 shows that companies can continue to reduce inference costs by using leaner models that match the performance of larger models. It also points to a future where the cost of training LLMs will drop to a point where many more organizations will be able to train their own custom models from scratch rather than just refining existing models.

The quantization challenge

Model quantization is a technique used to reduce the computational and memory costs of running and training AI models. It works by converting the model’s parameters or weights from high-precision formats, such as 16- and 32-bit floating point (BF16 and FP32), to lower-precision formats. The main challenge of quantization is to reduce the size of the model while retaining as much of its knowledge and capabilities as possible.

In recent years, 8-bit floating point (FP8) formats have become a popular industry standard, offering a good balance between performance and efficiency. They significantly reduce the computational cost and memory demand for LLM training without a major decrease in accuracy.

The next logical step is 4-bit floating point (FP4), which promises to halve memory usage again and further improve performance on high-end hardware. However, this transition was a challenge. Existing 4-bit formats, such as MXFP4, often struggle to maintain the same level of accuracy as their 8-bit counterparts, necessitating a difficult trade-off between cost and performance.

See also  Nvidia wants to be the Android of generalist robotics 

How NVFP4 works

NVFP4 overcomes the stability and accuracy challenges of other FP4 techniques through smarter design and targeted training methodology. A major problem with 4-bit precision is its extremely limited range: it can only represent 16 different values. When converting from a high-precision format, outliers can distort the entire data set, affecting the accuracy of the model. NVFP4 uses a more advanced multi-level scaling approach that better handles these outliers, allowing for “more precise and accurate representation of tensor values ​​during training,” according to Nvidia.

In addition to the format, the researchers introduce a 4-bit training recipe that achieves an accuracy comparable to FP8. A central component is their ‘mixed precision strategy’. Instead of converting the entire model to NVFP4, the majority of the layers are quantized, while a small portion of the numerically sensitive layers are kept in a higher precision format such as BF16. This maintains stability where it matters most. The methodology also adjusts the way gradients are calculated during backpropagation (or the model learning phase) to reduce biases that can accumulate from low-precision arithmetic.

NVFP4 in practice

To test their approach, the Nvidia team trained a high-performance hybrid with 12 billion parameters Mamba-Transformer model on a whopping 10 trillion tokens. They then compared performance directly against a base model trained in the widely popular FP8 format. The results showed that the training loss of the NVFP4 model and the accuracy of the downstream task closely followed the FP8 version throughout the process.

Performance spanned a wide range of domains, including knowledge-intensive reasoning, mathematics, and common sense tasks, with only a slight decline in coding benchmarks in late training.

See also  How Warp is introducing robots to automate its network of warehouses

“This is, to our knowledge, the first successful demonstration of training language models with billions of parameters with 4-bit precision over a multi-trillion token horizon, laying the foundation for faster and more efficient training of future boundary models,” the researchers write.

According to Nvidia’s product director for AI and data center GPUs NvidiaShar Narasimhan, NVFP4’s 4-bit precision format allows real-world developers and enterprises to train and deploy AI models with nearly the same accuracy as traditional 8-bit formats.

“By training model weights directly in 4-bit format while maintaining accuracy, developers can experiment with new architectures, iterate faster, and discover insights without being hampered by limited resources,” he told VentureBeat.

In contrast, FP8 (although already a leap forward over FP16) still imposes limitations on model size and inference performance due to higher memory and bandwidth requirements. “NVFP4 breaks through that ceiling and offers equivalent quality with significantly more room for growth and experimentation,” says Narasimhan.

Compared to the alternative 4-bit format, MXFP4, the advantages of NVFP4 become even more apparent. In an experiment with a model with 8 billion parameters, NVFP4 converged to a better loss score than MXFP4. To achieve the same level of performance as the NVFP4 model, the MXFP4 model had to be trained on 36% more data, which represents a significant increase in training time and costs.

In addition to making pre-training more efficient, NVFP4 also redefines what is possible. “Demonstrating that 4-bit precision can maintain model quality at scale opens the door to a future where highly specialized models can be trained from scratch by mid-market enterprises or startups, and not just by hyperscalers,” Narasimhan said, adding that over time we can expect a shift from developing general-purpose LLM models to “a diverse ecosystem of custom, high-performance models built by a broader range of innovators.”

See also  Anthropic Just Became America's Most Intriguing AI Company

Beyond previous education

Although the article focuses on the benefits of NVFP4 during pretraining, its impact also extends to inference.

“Models trained on NVFP4 can not only deliver faster inference and higher throughput, but also shorten the time it takes for AI factories to achieve ROI – accelerating the cycle from model development to real-world deployment,” said Narasimhan.

Because these models are smaller and more efficient, they unlock new possibilities for delivering complex, high-performance responses in real time, even in token-intensive, agentic applications, without increasing energy and computational costs.

Narasimhan said he looks forward to a future of model efficiency that is not just about lowering precision, but about building smarter systems.

“There are many opportunities to extend research to lower precisions and to adapt architectures to address the components that increasingly dominate computing power in large-scale models,” he said. “These areas are rich with opportunity, especially as we move toward agentic systems that require high throughput, low latency, and adaptive reasoning. NVFP4 proves that precision can be optimized without compromising quality, and it sets the stage for a new era of intelligent, efficient AI design.”

Source link

Back to top button