Jamba: AI21 Labs’ New Hybrid Transformer-Mamba Language Model

August 28, 2024

4 6 minutes read

Language models have witnessed rapid developments, with Transformer-based architectures leading the way in natural language processing. However, as models grow in size, the challenges of dealing with long contexts, memory efficiency, and throughput have increased.

AI21 Laboratories has introduced a new solution with Jamba, an advanced large language model (LLM) that combines the strengths of both Transformer and Mamba architectures in a hybrid framework. This article delves into the details of Jamba and explores its architecture, performance, and potential applications.

Overview of Jamba

Jamba is a hybrid large-language model developed by AI21 Labs, which uses a combination of Transformer layers and Mamba layers, integrated with a Mixture-of-Experts (MoE) module. This architecture allows Jamba to balance memory usage, throughput, and performance, making it a powerful tool for a wide range of NLP tasks. Designed to fit a single 80GB GPU, the model offers high throughput and small memory footprint while maintaining state-of-the-art performance on various benchmarks.

The architecture of Jamba

Jamba’s architecture is the cornerstone of its capabilities. It is built on a new hybrid design that interweaves Transformer layers with Mamba layers, incorporating MoE modules to increase the model’s capacity without significantly increasing computational requirements.

1. Transformer layers

The Transformer architecture has become the standard for modern LLMs due to its ability to efficiently handle parallel processing and capture long-range dependencies in text. However, performance is often limited by high memory and computation requirements, especially when processing long contexts. Jamba addresses these limitations by integrating Mamba layers, which we will explore next.

2. Mamba layers

Mamba is a recent state-space model (SSM) designed to handle long-distance relationships in sequences more efficiently than traditional RNNs or even Transformers. Mamba layers are particularly effective at reducing the memory footprint associated with storing key-value (KV) caches in Transformers. By interweaving Mamba layers with Transformer layers, Jamba reduces overall memory usage while maintaining high performance, especially on tasks that require long context processing.

3. Modules of mix of experts (MoE).

The MoE module in Jamba introduces a flexible approach to scaling model capacity. MoE allows the model to increase the number of available parameters without proportionally increasing the active parameters during inference. In Jamba, MoE is applied to some MLP layers, where the router mechanism selects the best experts to activate for each token. This selective activation allows Jamba to maintain high efficiency when performing complex tasks.

The image below demonstrates the functionality of an induction head in a hybrid Attention-Mamba model, a key feature of Jamba. In this example, the attention head is responsible for predicting labels such as “Positive” or “Negative” in response to sentiment analysis tasks. The highlighted words illustrate how the model’s attention is heavily focused on label tokens from the few examples, especially at the critical moment before predicting the final label. This attention mechanism plays a crucial role in the model’s ability to perform in-context learning, where the model must infer the correct label based on the given context and some examples.

The performance improvements provided by the integration of Mixture-of-Experts (MoE) with the Attention-Mamba hybrid architecture are highlighted in the table. By using MoE, Jamba increases its capacity without proportionately increasing the computation cost. This is especially reflected in the significant performance improvement in various benchmarks such as HellaSwag, WinoGrande and Natural Questions (NQ). The model with MoE not only achieves higher accuracy (e.g., 66.0% on WinoGrande compared to 62.5% without MoE), but also demonstrates improved log-likelihoods across domains (e.g., -0.534 on C4).

Main architectural features

Layer composition: Jamba’s architecture consists of blocks that combine Mamba and Transformer layers in a specific ratio (e.g. 1:7, meaning one Transformer layer for every seven Mamba layers). This ratio is tailored to optimal performance and efficiency.
MoE integration: The MoE tiers are applied every few tiers, with 16 experts available and the top 2 experts activated per token. This configuration allows Jamba to scale effectively while managing the trade-off between memory usage and compute efficiency.
Normalization and stability: To ensure stability during training, Jamba integrates RMSNorm into the Mamba layers, which helps reduce issues such as large activation peaks that can occur on a large scale.

Jamba’s performance and benchmarking

Jamba has been extensively tested against a wide range of benchmarks and shows competitive performance across the board. The following sections highlight some of the key benchmarks in which Jamba excelled, showcasing its strengths in both general NLP tasks and long-context scenarios.

1. General NLP benchmarks

Jamba has been evaluated on several academic benchmarks, including:

HellaSwag (10-shot): A common sense reasoning task where Jamba achieved a performance score of 87.1%, surpassing many competing models.
WinoGrande (5-shot): Another reasoning task where Jamba scored 82.5%, again demonstrating his ability to handle complex linguistic reasoning.
ARC Challenge (25 shots): Jamba showed a strong performance with a score of 64.4%, reflecting his ability to answer challenging multiple-choice questions.

In aggregate benchmarks such as MMLU (5-shot), Jamba achieved a score of 67.4%, indicating its robustness across various tasks.

2. Long-context evaluations

One of the standout features of Jamba is its ability to handle extremely long contexts. The model supports a context length of up to 256,000 tokens, the longest among publicly available models. This capability was tested using the Needle-in-a-Haystack benchmark, with Jamba showing exceptional retrieval accuracy across different context lengths, including up to 256K tokens.

3. Throughput and efficiency

Jamba’s hybrid architecture significantly improves throughput, especially for long sequences.

In tests comparing the throughput (tokens per second) of different models, Jamba consistently outperformed its competitors, especially in scenarios with large batch sizes and long contexts. For example, with a context of 128,000 tokens, Jamba achieved 3x the throughput of Mixtral, a similar model.

Using Jamba: Python

For developers and researchers eager to experiment with Jamba, AI21 Labs has made the model available on platforms like Hugging Face, making it accessible to a wide range of applications. The following code snippet shows how to load and generate text with Jamba:

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1")
tokenizer = AutoTokenizer.from_pretrained("ai21labs/Jamba-v0.1")
input_ids = tokenizer("In the recent Super Bowl LVIII,", return_tensors='pt').to(model.device)["input_ids"]
outputs = model.generate(input_ids, max_new_tokens=216)
print(tokenizer.batch_decode(outputs))

This simple script loads the Jamba model and tokenizer, generates text based on a given input prompt, and prints the generated output.

Fine tune Jamba

Jamba is designed as a base model, meaning it can be tailored to specific tasks or applications. Refinement allows users to adapt the model to niche domains, improving performance on specialized tasks. The following example shows how to refine Jamba using the PEFT library:

import torch
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
tokenizer = AutoTokenizer.from_pretrained("ai21labs/Jamba-v0.1")
model = AutoModelForCausalLM.from_pretrained(
"ai21labs/Jamba-v0.1", device_map='auto', torch_dtype=torch.bfloat16)
lora_config = LoraConfig(r=8,
target_modules=[
"embed_tokens","x_proj", "in_proj", "out_proj", # mamba
"gate_proj", "up_proj", "down_proj", # mlp
"q_proj", "k_proj", "v_proj" 
# attention],
task_type="CAUSAL_LM", bias="none")
dataset = load_dataset("Abirate/english_quotes", split="train")
training_args = SFTConfig(output_dir="./results",
num_train_epochs=2,
per_device_train_batch_size=4,
logging_dir='./logs',
logging_steps=10, learning_rate=1e-5, dataset_text_field="quote")
trainer = SFTTrainer(model=model, tokenizer=tokenizer, args=training_args,
peft_config=lora_config, train_dataset=dataset,
)
trainer.train()

This code snippet fine-tunes Jamba on a dataset of English quotes, adjusting the model’s parameters to better suit the specific task of generating text in a specialized domain.

Implementation and integration

AI21 Labs has made the Jamba family broadly accessible through a variety of platforms and deployment options:

Cloud platforms:
- Available from major cloud providers including Google Cloud Vertex AI, Microsoft Azure and NVIDIA NIM.
- Coming soon on Amazon Bedrock, Databricks Marketplace, and Snowflake Cortex.
AI development frameworks:
- Integration with popular frameworks such as LangChain and LlamaIndex (upcoming).
AI21 studio:
- Direct access via AI21’s proprietary development platform.
Hugging face:
- Models available to download and experiment.
On-site implementation:
- Private on-premises deployment options for organizations with specific security or compliance needs.
Custom solutions:
- AI21 provides customized model customization and refinement services for enterprise clients.

Developer-friendly features

Jamba models come with several built-in capabilities that make them particularly attractive to developers:

Call function: Easily integrate external tools and APIs into your AI workflows.
Structured JSON output: Generate clean, parsable data structures directly from natural language input.
Digestion of document objects: Efficiently process and understand complex document structures.
RAG optimizations: Built-in functions to improve fetch generation pipelines.

These features, combined with the model’s long context window and efficient processing, make Jamba a versatile tool for a wide range of development scenarios.

Ethical considerations and responsible AI

While Jamba’s capabilities are impressive, it’s critical to approach its use with a responsible AI mindset. AI21 Labs highlights a number of key points:

Basic model Nature: Jamba 1.5 models are pre-trained basic models without specific alignment or instruction tuning.
Lack of built-in protections: The models have no inherent moderation mechanisms.
Careful effort: Additional adjustments and safety measures should be implemented before using Jamba in production or end-user environments.
Data privacy: When using cloud-based deployments, consider data processing and compliance requirements.
Awareness of prejudices: Like all major language models, Jamba can reflect biases present in the training data. Users should be aware of this and take appropriate measures.

By keeping these factors in mind, developers and organizations can leverage Jamba’s capabilities in a responsible and ethical manner.

A new chapter in AI development?

The introduction of the Jamba family by AI21 Labs marks an important milestone in the evolution of large language models. By combining the strengths of transformers and state space models, integrating a mix of expert techniques, and pushing the boundaries of context length and processing speed, Jamba opens new possibilities for AI applications across industries.

As the AI community continues to explore and build on this innovative architecture, we can expect further advances in model efficiency, long context understanding, and practical AI implementation. The Jamba family represents not just a new set of models, but a potential shift in the way we approach the design and implementation of large-scale AI systems.