Microsoft’s Inference Framework Brings 1-Bit Large Language Models to Local Devices
On October 17, 2024, Microsoft has announced BitNet.cppan inference framework designed to output 1-bit quantized large language models (LLMs). BitNet.cpp is a significant advancement in Gen AI, enabling the efficient implementation of 1-bit LLMs on commodity CPUs without the need for expensive GPUs. This development democratizes access to LLMs, making them available on a wide range of devices and opening up new possibilities in device-based AI applications.
Understanding 1-bit large language models
Large language models (LLMs) traditionally require significant computational resources due to their use of high-precision floating-point numbers (typically FP16 or BF16) for model weights. This necessity has made the use of LLMs expensive and energy-intensive.
At their core, 1-bit LLMs use extreme quantization techniques to represent model weights with only three possible values: -1, 0, and 1, hence the term ‘1.58-bit’ (since it takes a little more than one bit to get three bits to encode). states).
Ternary weight system
The Concept
The 1-bit quantization in BitNet.cpp is a ternary weight system. BitNet works with only three possible values for each parameter:
- -1 (negative)
- 0 (neutral)
- 1 (positive)
This results in a storage requirement of approximately 1.58 bits per parameter, hence the name BitNet b1.58. This drastic reduction in parameter bitwidth leads to an impressive reduction in memory usage and computational complexity, as most floating point multiplications are replaced by simple additions and subtractions.
Mathematical Foundation
1-bit quantization involves transforming weights and activations into their ternary representation via the following steps:
1. Weight binarization
Binarizing the weights means centralizing them around the mean (α
), resulting in a ternary representation. The transformation is expressed mathematically as:
WF=Sign(W−α)
Where:
- W is the original weight matrix.
- α is the average of the weights.
- Sign(x) returns +1 as x > 0 And -1 otherwise.
2. Quantization of activation
Quantization activations ensure that input is limited to a specified bit width:
Where:
- Qb = 2(b−1)2^{(b-1)} is the maximum quantization level for b-bitwidth.
- γ is the maximum absolute value of X (referred to as ∣∣x∣∣∞).
- ε is a small number to avoid overflow during calculations.
3. BitLinear operation
The BitLinear layer replaces traditional matrix multiplications with simplified operation:
j=WF×X^e×(Qbβγ)
Where:
- β is a scaling factor used to minimize approximation errors.
- γ scales the activations.
- Question_b is the quantization factor.
This transformation enables efficient calculations while maintaining model performance.
Performance implications
Memory efficiency
The ternary weight system significantly reduces memory requirements:
- Traditional LLMs: 16 bits per weight
- BitNet.cpp: 1.58 bits per weight
This reduction translates into a memory savings of approx 90% compared to traditional 16-bit models, allowing larger models to fit within the same hardware limitations.
1. Inference speed: faster on both CPUs
Inference speed is represented as the number of tokens processed per second. Here is an overview of the observations:
- On Apple M2 Ultra: BitNet.cpp reaches maximum 5.07x acceleration for larger models (30B) compared to Llama.cpp, with a peak speed of 593.43 tokens per second for a 125M model, that’s a 1.37x speed up. For larger models like the 3.8B and 7B, BitNet.cpp maintains a speed of over 84.77 tokens per second, demonstrating efficiency at different scales.
- On Intel i7-13700H: BitNet.cpp achieves even more dramatic speed improvements. At the 7B model size, BitNet.cpp provides a incredible 5.68x acceleration compared to Llama.cpp. This is processed for smaller models such as 125M 389.08 tokens per secondthat is 2.37x faster than Llama.cpp.
2. Energy efficiency: a game-changer for edge devices
The included graphs also include energy cost comparisonsshowing a significant reduction in energy consumption per token processed:
- On Apple M2 Ultra: The energy savings of BitNet.cpp are significant. For the 700M model it consumes 55.4% less energy per token compared to Llama.cpp, decreasing from 0.314 to 0.140. This trend continues for larger models, with the 70B model a 70.0% reduction in energy consumption.
- On Intel i7-13700H: BitNet.cpp delivers 71.9% energy savings for the 700M model, where consumption dropped from 1,367 Unpleasant 0.384. Although energy data for the 70B model in Llama.cpp is not available, BitNet.cpp remains efficient, with an energy consumption of 5.33 for the 70B model.
3. Exceeding the human reading speed benchmark
One of the most interesting insights from these graphs is the reference to human reading speedmarked on 5-7 tokens per second. This red line shows that both implementations, especially BitNet.cpp, can comfortably exceed human read speed even for the largest models:
- On Apple M2UltraBitNet.cpp exceeds human read speed for all model sizes, with the slowest speed 8.67 tokens per second for a 70B model.
- On Intel i7-13700Hthe 100B model still performs 1.70 tokens per secondwhich almost reaches the lower range of human reading speed, while all smaller models exceed this benchmark.
Considerations when training
Straight-Through Estimator (STE)
Because 1-bit quantization introduces non-differentiable functions, training involves a specialized technique known as the Straight-Through Estimator (STE). In this approach, the gradients flow unchanged through non-differentiable points. Here is a simplified implementation in Python:
class StraightThroughEstimator(Function): @staticmethod def forward(ctx, input): return input.sign() @staticmethod def backward(ctx, grad_output): return grad_output
Mixed precision training
To maintain stability during training, mixed precision is active:
- Weights and activations: Quantized to 1-bit precision.
- Expirations and optimization statuses: Stored with higher precision.
- Latent weights: Maintained with high precision to enable accurate updates during training.
Strategy for high learning rates
A unique challenge with 1-bit models is that small updates may not affect the binarized weights. To mitigate this, the learning rate is increased, ensuring faster convergence and better optimization compared to traditional approaches.
Group quantization and normalization
BitNet.cpp introduces Group quantization and normalization to improve the parallelism of the model. Instead of calculating parameters for the entire weight matrix, BitNet divides weights and activations into multiple groups (G
).
This grouping enables efficient parallel processing without additional communication between groups, enabling large-scale model training and inference.
Implementation notes and optimizations
CPU optimization
BitNet.cpp uses several low-level optimizations to achieve maximum CPU performance:
- Vectorized operations: Uses SIMD instructions to perform bit manipulations efficiently.
- Cache-friendly memory access: Structures data to minimize cache misses.
- Parallel processing: Effectively distributes the workload across multiple CPU cores.
Here is an example of a key function that implements quantization and inference in BitNet:
Supported models
The current version of BitNet.cpp supports the following 1-bit LLMs available on Hugging Face:
- bitnet_b1_58-large (0.7B parameters)
- bitnet_b1_58-3B (3.3B parameters)
- Llama3-8B-1.58-100B tokens (8.0B parameters)
These models are publicly available to demonstrate the inference capabilities of the framework. Although not officially trained or released by Microsoft, they illustrate the versatility of the framework.
Installation manual
Follow the steps below to get started with BitNet.cpp:
Requirements
- Python >=3.9
- CMake >=3.22
- Clang >=18
- Conda (highly recommended)
For Windows users, Visual Studio must be installed with the following components enabled:
- Desktop development with C++
- C++ CMake Tools for Windows
- Git for Windows
- C++ Clang Compiler for Windows
- MS-Build support for LLVM Toolset (Clang)
For Debian/Ubuntu users, an automatic installation script is available:
Step-by-step installation
- Clone the repository:
- Install dependencies:
- Build and prepare the project: You can download a model directly from Hugging Face and convert it to a quantized format:
You can also download and convert the model manually:
Performing inference with BitNet.cpp
To perform inference using the framework, use the following command:
Explanation:
-m
specifies the model file path.-p
defines the prompt text.-n
sets the number of tokens to predict.-temp
adjusts the randomness of sampling (temperature) during inference.
Output example
Technical details of BitNet.cpp
BitLinear layer
BitNet.cpp implements a custom Transformer architecture, replacing standard matrix multiplications with BitLinear
operations. This approach centralizes the weights to zero before quantization and scales them to reduce approximation errors. The key transformation function looks like this:
# Binarization function for 1-bit weights def binarize_weights(W): alpha = W.mean() W_binarized = np.sign(W - alpha) return W_binarized
The combination of centralized weights and scaling minimizes quantization error, preserving performance.
Impact on the industry
BitNet.cpp can have far-reaching consequences for the use of LLMs:
- Accessibility: Allows LLMs to run on commodity devices, democratizing access to powerful AI.
- Cost efficiency: Reduces the need for expensive GPUs, lowering the barrier to adoption.
- Energy efficiency: Saves energy by using standard CPU-based inference.
- Innovation: Opens up new possibilities for on-device AI, such as real-time language translation, voice assistants and privacy-focused applications without cloud dependencies.
Challenges and future directions
Although 1-bit LLMs show promise, several challenges remain. These include developing robust 1-bit models for various tasks, optimizing hardware for 1-bit computations, and encouraging developers to adopt this new paradigm. Furthermore, exploring 1-bit quantization for computer vision or audio tasks represents an exciting future direction.
Conclusion
Microsoft’s launch of BitNet.cpp is significant progress. By enabling efficient 1-bit inference on commodity CPUs, BitNet.cpp creates the accessibility and sustainability of AI. This framework provides the foundation for more portable and cost-effective LLMs, promoting what’s possible with on-device AI.