AI21’s Jamba Reasoning 3B Redefines What “Small” Means in LLMs — 250K Context on a Laptop


The latest addition to the small enterprise model wave comes from AI21 Laboratorieswhich is betting that bringing models to devices will free up traffic in data centers.
AI21’s Jamba Reasoning 3B, a ‘small’ open source model that can perform extensive reasoning, generate code, and respond based on ground truth. Jamba Reasoning 3B processes over 250,000 tokens and can perform inference on edge devices.
The company said Jamba Reasoning 3B works on devices such as laptops and mobile phones.
Ori Goshen, co-CEO of AI21, told VentureBeat that the company sees more business use cases for small models, especially as moving most inference to devices frees up data centers.
“What we’re seeing in the industry now is an economic problem where there are very expensive data center expansions, and the revenue generated from the data centers versus the depreciation rate of all their chips shows that the calculations are off,” Goshen said.
He added that in the future, “the industry would generally be hybrid, in the sense that some of the computation will happen locally on devices and other inference will go to GPUs.”
Tested on a MacBook
Jamba Reasoning 3B combines the Mamba architecture and Transformers to run a 250K token window on devices. AI21 said it can do 2-4x faster inference speeds. Goshen said the Mamba architecture contributed significantly to the model’s speed.
Jamba Reasoning 3B’s hybrid architecture also makes it possible to reduce memory requirements, thus reducing computing needs.
AI21 tested the model on a standard MacBook Pro and found that it can process 35 tokens per second.
Goshen said the model works best for tasks involving function calls, policy-based generation and tool routing. He said simple requests, such as asking for information about an upcoming meeting and asking the model to create an agenda for it, can be made on devices. The more complex reasoning tasks can be saved for GPU clusters.
Small models in enterprises
Companies are interested in using a mix of small models, some of which are specifically designed for their sector and others of which are shortened versions of LLMs.
In September, Meta issued MobileLLM-R1, a family of reasoning models ranging from 140M to 950M parameters. These models are designed for math, coding and scientific reasoning rather than chat applications. MobileLLM-R1 can run on devices with limited computing power.
Googling‘S Gemma was one of the first small models to hit the market, designed for portable devices such as laptops and mobile phones. Gemma has since extensive.
Companies like it FICO have also started building their own models. FICO launched the small models FICO Focused Language and FICO Focused Sequence that only answer financial specific questions.
Goshen said the big difference their model offers is that it is even smaller than most models and can still perform reasoning tasks without sacrificing speed.
Benchmark testing
In benchmark testing, Jamba Reasoning 3B showed strong performance compared to other small models, including Qwen 4B, Meta‘s Llama 3.2B-3B and Phi-4-Mini van Microsoft.
It outperformed all models on the IFBench test and Humanity’s Last Exam, although it came second to Qwen 4 on MMLU-Pro.
Goshen said another advantage of small models like Jamba Reasoning 3B is that they are highly controllable and offer better privacy options to companies because the inference is not sent to another server.
“I really believe there is a world where you can optimize for customer needs and experience, and the models held on devices are a big part of that,” he said.




