MIT offshoot Liquid AI releases blueprint for enterprise-grade small-model training


When Liquid AI, a startup ffounded by MIT computer scientists in 2023introduced its Liquid Foundation Models series 2 (LFM2) in July 2025the pitch was simple: deliver the fastest on-device basic models on the market using the new “liquid” architecture, with training and inference efficiencies that made small models a serious alternative to cloud-only large language models (LLMs) like OpenAI’s GPT series and Google’s Gemini.
The initial release featured dense checkpoints at 350M, 700M, and 1.2B parameters, a hybrid architecture heavily weighted toward gated short convolutions, and benchmark figures that gave LFM2 an edge over similarly sized competitors like Qwen3, Llama 3.2, and Gemma 3 on both quality and CPU throughput. The message to businesses was clear: real-time, privacy-preserving AI on phones, laptops and vehicles no longer requires sacrificing capacity for latency.
In the months since that launch, Liquid has expanded LFM2 into a broader product line – with task- and domain-specific variants, a small video capture and analytics model, and an edge-focused deployment stack called LEAP – and positioned the models as the control layer for on-device and on-premises agentic systems.
Now, with the publication of the detailed 51-page LFM2 technical report on arXivthe company goes one step further by disclosing the architecture search process, training data mix, distillation objective, curriculum strategy, and post-training pipeline behind these models.
And unlike previous open models, LFM2 is built around a repeatable recipe: a hardware-in-the-loop search process, a training curriculum that compensates for smaller parameter budgets, and a post-training pipeline tailored to instruction following and tool use.
Instead of just providing weights and an API, Liquid actually publishes a detailed blueprint that other organizations can use as a reference for training their own small, efficient models from scratch, tailored to their own hardware and implementation constraints.
A model family designed around real-world constraints, not based on GPU labs
The technical report starts with a premise that companies are well familiar with: real AI systems reach their limits long before benchmarks do. Latency budgets, peak memory caps, and thermal throttling determine what can actually run in production, especially on laptops, tablets, commodity servers, and mobile devices.
To address this, Liquid AI conducted architectural research directly on target hardware, including mobile Snapdragon SoCs and Ryzen laptop CPUs. The result is a consistent outcome across all formats: a minimal hybrid architecture dominated by gated short convolution blocks and a small number attention to grouped queries (GQA) layers. This design was repeatedly chosen over more exotic linear attention and SSM hybrids because it produced a Pareto profile with better quality latency memory under real device conditions.
This matters for enterprise teams in three ways:
-
Predictability. The architecture is simple, parameter efficient and stable for model sizes from 350 million to 2.6 billion.
-
Operational portability. Dense and MoE variants share the same structural backbone, simplifying deployment in mixed hardware fleets.
-
Feasibility on the device. Prefill and decode throughput on CPUs exceeds comparable open models by roughly 2x in many cases, reducing the need to offload routine tasks to cloud inference endpoints.
Rather than optimizing for academic novelty, the report reads like a systematic effort to design models that enterprises can use actually ship.
This is notable and more practical for enterprises in a field where many open models silently assume access to multi-H100 clusters during inference.
A training pipeline tailored to business-relevant behavior
LFM2 takes a training approach that compensates for the smaller scale of its models with structure rather than brute force. Key elements include:
-
Pretraining of 10–12T tokens and an extra 32K context mid-training phasewhich expands the model’s useful context window without exploding computational costs.
-
A decoupled Top-K knowledge distillation objective that avoids the instability of standard KL distillation when teachers provide only partial logits.
-
A post-training sequence in three phases—SFT, length-normalized preference alignment, and model merging — designed to more reliably follow instructions and produce tool-using behavior.
For enterprise AI developers, the key is that LFM2 models behave less like “little LLMs” and more like practical agents that can follow structured formats, adhere to JSON schemas, and manage multi-turn chat flows. Many open models of similar size fail not due to a lack of reasoning ability, but due to poor adherence to instructional templates. The post-workout LFM2 recipe targets these rough edges directly.
In other words, Liquid AI optimized small models for operational reliabilitynot just scoreboards.
Multimodality designed for device limitations, not lab demos
The LFM2-VL and LFM2-Audio variants reflect another shift: multimodality is built on it symbolic efficiency.
Instead of embedding a massive vision transformer directly into an LLM, LFM2-VL connects a SigLIP2 encoder via a connector that aggressively reduces the number of visual tokens via PixelUnshuffle. High-resolution input automatically activates dynamic tiling, keeping token budgets manageable, even on mobile hardware. LFM2-Audio uses a split audio path (one for embedding and one for generation) and supports real-time transcription or speech-to-speech on modest CPUs.
For enterprise platform architects, this design points to a practical future where:
-
understanding documents is done directly on endpoints such as field devices;
-
audio transcription and voice agents are run locally to meet privacy requirements;
-
multimodal agents operate within fixed latency envelopes without streaming data outside the device.
The through line is the same: multimodal capabilities without the need for a GPU farm.
Retrieval models built for agent systems, not legacy queries
LFM2-ColBERT extends late interaction retrieval to a small enough footprint for enterprise deployments that require multi-language RAG, without the overhead of specialized vector DB accelerators.
This is especially important as organizations begin to orchestrate a fleet of agents. Fast local retrieval (which runs on the same hardware as the reasoning model) reduces latency and provides an administrative gain: documents never leave the device boundary.
Taken together, the VL, Audio, and ColBERT variants demonstrate LFM2 as a modular system, rather than a single model.
The emerging blueprint for hybrid enterprise AI architectures
For all variants, the LFM2 report implicitly outlines what tomorrow’s business AI stack will look like: hybrid local cloud orchestrationwhere small, fast models running on devices handle time-critical perception, formatting, tool calling, and assessment tasks, while larger models in the cloud provide heavy-duty reasoning when necessary.
Several trends come together here:
-
Cost control. Performing routine inference locally prevents unpredictable cloud billing.
-
Latency determinism. TTFT and decoding stability matter in agent workflows; on the device eliminates network jitter.
-
Governance and compliance. Local execution simplifies PII processing, data storage, and auditability.
-
Resistance. Agent systems are broken down without any problem if the cloud path is no longer available.
Companies adopting these architectures will likely view small on-device models as the “control plane” of agentic workflows, while large cloud models serve as on-demand accelerators.
LFM2 is one of the clearest open source foundations for that control layer yet.
The strategic takeaway: On-device AI is now a design choice, not a compromise
For years, organizations building AI capabilities have accepted that “real AI” requires cloud inference. LFM2 challenges that assumption. The models perform competitively on reasoning, instruction following, multilingual tasks, and RAG, while achieving significant latency gains over other open families with small models.
For CIOs and CTOs finalizing 2026 roadmaps, the implication is immediate: small, open on-device models are now strong enough to support meaningful portions of the production workload.
LFM2 will not replace frontier cloud models for frontier-scale reasoning. But it offers something that companies demonstrably need more: a reproducible, open and operationally feasible basis agentic systems that have to run everywherefrom telephones to industrial endpoints to secure facilities with air gaps.
In the ever-widening landscape of enterprise AI, LFM2 is less a research milestone and more a sign of architectural convergence. The future is not the cloud or the edge; it’s both working together. And releases like LFM2 provide the building blocks for organizations willing to build that hybrid future intentionally, not accidentally.




