Microsoft’s Inference Framework Brings 1-Bit Large Language Models to Local Devices

October 28, 2024

3 6 minutes read

On October 17, 2024, Microsoft has announced BitNet.cppan inference framework designed to output 1-bit quantized large language models (LLMs). BitNet.cpp is a significant advancement in Gen AI, enabling the efficient implementation of 1-bit LLMs on commodity CPUs without the need for expensive GPUs. This development democratizes access to LLMs, making them available on a wide range of devices and opening up new possibilities in device-based AI applications.

Understanding 1-bit large language models

Large language models (LLMs) traditionally require significant computational resources due to their use of high-precision floating-point numbers (typically FP16 or BF16) for model weights. This necessity has made the use of LLMs expensive and energy-intensive.

At their core, 1-bit LLMs use extreme quantization techniques to represent model weights with only three possible values: -1, 0, and 1, hence the term ‘1.58-bit’ (since it takes a little more than one bit to get three bits to encode). states).

Ternary weight system

The Concept

The 1-bit quantization in BitNet.cpp is a ternary weight system. BitNet works with only three possible values for each parameter:

-1 (negative)
0 (neutral)
1 (positive)

This results in a storage requirement of approximately 1.58 bits per parameter, hence the name BitNet b1.58. This drastic reduction in parameter bitwidth leads to an impressive reduction in memory usage and computational complexity, as most floating point multiplications are replaced by simple additions and subtractions.

Mathematical Foundation

1-bit quantization involves transforming weights and activations into their ternary representation via the following steps:

1. Weight binarization

Binarizing the weights means centralizing them around the mean (α), resulting in a ternary representation. The transformation is expressed mathematically as:

WF=Sign(W−α)

Where:

W is the original weight matrix.
α is the average of the weights.
Sign(x) returns +1 as x > 0 And -1 otherwise.

2. Quantization of activation

Quantization activations ensure that input is limited to a specified bit width:

$X^_{e} = Quantitative (X) = Clamp (γ X \times Q ^{b}, - Q_{b} + ϵ, Q_{b} - ϵ)$

Where:

Qb = $2^{(b-1)}$ is the maximum quantization level for b-bitwidth.
γ is the maximum absolute value of X (referred to as ).
ε is a small number to avoid overflow during calculations.

3. BitLinear operation

The BitLinear layer replaces traditional matrix multiplications with simplified operation:

j=WF×X^e×(Qbβγ)

Where:

β is a scaling factor used to minimize approximation errors.
γ scales the activations.
Question_b is the quantization factor.

This transformation enables efficient calculations while maintaining model performance.

Performance implications

Memory efficiency

The ternary weight system significantly reduces memory requirements:

Traditional LLMs: 16 bits per weight
BitNet.cpp: 1.58 bits per weight

This reduction translates into a memory savings of approx 90% compared to traditional 16-bit models, allowing larger models to fit within the same hardware limitations.

Inference speed, energy efficiency (Apple M2)

Inference speed, energy efficiency (i7-13700H)

1. Inference speed: faster on both CPUs

Inference speed is represented as the number of tokens processed per second. Here is an overview of the observations:

On Apple M2 Ultra: BitNet.cpp reaches maximum 5.07x acceleration for larger models (30B) compared to Llama.cpp, with a peak speed of 593.43 tokens per second for a 125M model, that’s a 1.37x speed up. For larger models like the 3.8B and 7B, BitNet.cpp maintains a speed of over 84.77 tokens per second, demonstrating efficiency at different scales.
On Intel i7-13700H: BitNet.cpp achieves even more dramatic speed improvements. At the 7B model size, BitNet.cpp provides a incredible 5.68x acceleration compared to Llama.cpp. This is processed for smaller models such as 125M 389.08 tokens per secondthat is 2.37x faster than Llama.cpp.

2. Energy efficiency: a game-changer for edge devices

The included graphs also include energy cost comparisonsshowing a significant reduction in energy consumption per token processed:

On Apple M2 Ultra: The energy savings of BitNet.cpp are significant. For the 700M model it consumes 55.4% less energy per token compared to Llama.cpp, decreasing from 0.314 to 0.140. This trend continues for larger models, with the 70B model a 70.0% reduction in energy consumption.
On Intel i7-13700H: BitNet.cpp delivers 71.9% energy savings for the 700M model, where consumption dropped from 1,367 Unpleasant 0.384. Although energy data for the 70B model in Llama.cpp is not available, BitNet.cpp remains efficient, with an energy consumption of 5.33 for the 70B model.

3. Exceeding the human reading speed benchmark

One of the most interesting insights from these graphs is the reference to human reading speedmarked on 5-7 tokens per second. This red line shows that both implementations, especially BitNet.cpp, can comfortably exceed human read speed even for the largest models:

On Apple M2UltraBitNet.cpp exceeds human read speed for all model sizes, with the slowest speed 8.67 tokens per second for a 70B model.
On Intel i7-13700Hthe 100B model still performs 1.70 tokens per secondwhich almost reaches the lower range of human reading speed, while all smaller models exceed this benchmark.

Considerations when training

Straight-Through Estimator (STE)

Because 1-bit quantization introduces non-differentiable functions, training involves a specialized technique known as the Straight-Through Estimator (STE). In this approach, the gradients flow unchanged through non-differentiable points. Here is a simplified implementation in Python:

class StraightThroughEstimator(Function):
    @staticmethod
    def forward(ctx, input):
        return input.sign()
    @staticmethod
    def backward(ctx, grad_output):
        return grad_output

Mixed precision training

To maintain stability during training, mixed precision is active:

Weights and activations: Quantized to 1-bit precision.
Expirations and optimization statuses: Stored with higher precision.
Latent weights: Maintained with high precision to enable accurate updates during training.

Strategy for high learning rates

A unique challenge with 1-bit models is that small updates may not affect the binarized weights. To mitigate this, the learning rate is increased, ensuring faster convergence and better optimization compared to traditional approaches.

Group quantization and normalization

BitNet.cpp introduces Group quantization and normalization to improve the parallelism of the model. Instead of calculating parameters for the entire weight matrix, BitNet divides weights and activations into multiple groups (G).

This grouping enables efficient parallel processing without additional communication between groups, enabling large-scale model training and inference.

Implementation notes and optimizations

CPU optimization

BitNet.cpp uses several low-level optimizations to achieve maximum CPU performance:

Vectorized operations: Uses SIMD instructions to perform bit manipulations efficiently.
Cache-friendly memory access: Structures data to minimize cache misses.
Parallel processing: Effectively distributes the workload across multiple CPU cores.

Here is an example of a key function that implements quantization and inference in BitNet:

 
def bitlinear_forward(input, weight, scale):
    # Quantize the input using absmax quantization
    input_q = quantize(input)
    
    # Perform binary matrix multiplication
    output = binary_matmul(input_q, weight)
    
    # Scale the output to match the original precision
    return output * scale
def quantize(x):
    # Perform absmax quantization
    scale = torch.max(torch.abs(x))
    return torch.clamp(x / scale, -1, 1) * scale

Supported models

The current version of BitNet.cpp supports the following 1-bit LLMs available on Hugging Face:

bitnet_b1_58-large (0.7B parameters)
bitnet_b1_58-3B (3.3B parameters)
Llama3-8B-1.58-100B tokens (8.0B parameters)

These models are publicly available to demonstrate the inference capabilities of the framework. Although not officially trained or released by Microsoft, they illustrate the versatility of the framework.

Installation manual

Follow the steps below to get started with BitNet.cpp:

Requirements

Python >=3.9
CMake >=3.22
Clang >=18
Conda (highly recommended)

For Windows users, Visual Studio must be installed with the following components enabled:

Desktop development with C++
C++ CMake Tools for Windows
Git for Windows
C++ Clang Compiler for Windows
MS-Build support for LLVM Toolset (Clang)

For Debian/Ubuntu users, an automatic installation script is available:

Step-by-step installation

Clone the repository:
Install dependencies:
Build and prepare the project: You can download a model directly from Hugging Face and convert it to a quantized format:
You can also download and convert the model manually:

Performing inference with BitNet.cpp

To perform inference using the framework, use the following command:

Explanation:

-m specifies the model file path.
-p defines the prompt text.
-n sets the number of tokens to predict.
-temp adjusts the randomness of sampling (temperature) during inference.

Output example

Technical details of BitNet.cpp

BitLinear layer

BitNet.cpp implements a custom Transformer architecture, replacing standard matrix multiplications with BitLinear operations. This approach centralizes the weights to zero before quantization and scales them to reduce approximation errors. The key transformation function looks like this:

# Binarization function for 1-bit weights
def binarize_weights(W):
    alpha = W.mean()
    W_binarized = np.sign(W - alpha)
    return W_binarized

The combination of centralized weights and scaling minimizes quantization error, preserving performance.

Impact on the industry

BitNet.cpp can have far-reaching consequences for the use of LLMs:

Accessibility: Allows LLMs to run on commodity devices, democratizing access to powerful AI.
Cost efficiency: Reduces the need for expensive GPUs, lowering the barrier to adoption.
Energy efficiency: Saves energy by using standard CPU-based inference.
Innovation: Opens up new possibilities for on-device AI, such as real-time language translation, voice assistants and privacy-focused applications without cloud dependencies.

Challenges and future directions

Although 1-bit LLMs show promise, several challenges remain. These include developing robust 1-bit models for various tasks, optimizing hardware for 1-bit computations, and encouraging developers to adopt this new paradigm. Furthermore, exploring 1-bit quantization for computer vision or audio tasks represents an exciting future direction.

Conclusion

Microsoft’s launch of BitNet.cpp is significant progress. By enabling efficient 1-bit inference on commodity CPUs, BitNet.cpp creates the accessibility and sustainability of AI. This framework provides the foundation for more portable and cost-effective LLMs, promoting what’s possible with on-device AI.

Source link

Microsoft’s Inference Framework Brings 1-Bit Large Language Models to Local Devices

Understanding 1-bit large language models

Ternary weight system

The Concept

Mathematical Foundation

1. Weight binarization

WF=Sign(W−α)

2. Quantization of activation

$X^_{e} = Quantitative (X) = Clamp (γ X \times Q ^{b}, - Q_{b} + ϵ, Q_{b} - ϵ)$

3. BitLinear operation

j=WF×X^e×(Qbβγ)

Performance implications

Memory efficiency

1. Inference speed: faster on both CPUs

2. Energy efficiency: a game-changer for edge devices

3. Exceeding the human reading speed benchmark

Considerations when training

Straight-Through Estimator (STE)

Mixed precision training

Strategy for high learning rates

Group quantization and normalization

Implementation notes and optimizations

CPU optimization

Supported models

Installation manual

Requirements

Step-by-step installation

Performing inference with BitNet.cpp

Explanation:

Output example

Technical details of BitNet.cpp

BitLinear layer

Impact on the industry

Challenges and future directions

Conclusion

Amazon to develop ‘Joseph of Egypt’ Biblical series

R Kelly hurried to the hospital after ‘overdose in prison caused by staff’

These custom pets portraits are the most useful gift

Understanding 1-bit large language models

Ternary weight system

The Concept

Mathematical Foundation

1. Weight binarization

WF​=Sign(W−α)

2. Quantization of activation

X^e​=Quantitative(X)=Clamp(γX×Qb​​,−Qb​+ϵ,Qb​−ϵ)

3. BitLinear operation

j=WF​×X^e​×(Qb​βγ​)

Performance implications

Memory efficiency

1. Inference speed: faster on both CPUs

2. Energy efficiency: a game-changer for edge devices

3. Exceeding the human reading speed benchmark

Considerations when training

Straight-Through Estimator (STE)

Mixed precision training

Strategy for high learning rates

Group quantization and normalization

Implementation notes and optimizations

CPU optimization

Supported models

Installation manual

Requirements

Step-by-step installation

Performing inference with BitNet.cpp

Explanation:

Output example

Technical details of BitNet.cpp

BitLinear layer

Impact on the industry

Challenges and future directions

Conclusion

Ric Flair breaks silence on stepson Sebastian Kidder's death

Who is Walker Lyons? Meet Rylee Arnold of the USC Tight End Dating DWTS

Related Articles

A Notable Advance in Human-Driven AI Video

Intel could be in for significant changes as Lip-Bu Tan takes on CEO role

OpenAI-backed startup Figure teases new humanoid robot

Artemis Seaford and Ion Stoica cover the ethical crisis at Sessions: AI

Amazon to develop ‘Joseph of Egypt’ Biblical series

R Kelly hurried to the hospital after ‘overdose in prison caused by staff’

These custom pets portraits are the most useful gift

WF=Sign(W−α)

$X^_{e} = Quantitative (X) = Clamp (γ X \times Q ^{b}, - Q_{b} + ϵ, Q_{b} - ϵ)$

j=WF×X^e×(Qbβγ)