AI

Microsoft’s Fara-7B is a computer-use AI agent that rivals GPT-4o and works directly on your PC

Microsoft has introduced Fara-7B, a new 7 billion parameter model designed to act as a Computer Use Agent (CUA) that can perform complex tasks directly on a user’s device. Fara-7B delivers new state-of-the-art results for its size and provides a way to build AI agents that do not rely on massive, cloud-dependent models and can run on compact systems with lower latency and improved privacy.

Although the model is experimental, its architecture addresses a primary barrier to enterprise adoption: data security. Because Fara-7B is small enough to run locally, users can automate sensitive workflows, such as managing internal accounts or processing sensitive corporate data, without that information ever leaving the device.

How Fara-7B sees the web

Fara-7B is designed to navigate user interfaces using the same tools as a human: a mouse and keyboard. The model works by visually sensing a web page through screenshots and predicting specific coordinates for actions such as clicking, typing and scrolling.

Crucially, Fara-7B does not rely on ‘accessibility trees’, the underlying code structure that browsers use to describe web pages to screen readers. Instead, it relies solely on pixel-level visual data. This approach allows the agent to interact with websites even if the underlying code is unclear or complex.

According to Yash Lara, Senior PM Lead at Microsoft Research, processing all visual input on the device provides true “pixel sovereignty” because screenshots and the reasoning required for automation remain on the user’s device. “This approach helps organizations meet stringent requirements in regulated industries, including HIPAA and GLBA,” he told VentureBeat in written comments.

See also  Cohere hires long-time Meta research head Joelle Pineau as its chief AI officer

In benchmarking tests, this visual approach has produced strong results. On WebVoyagera standard benchmark for web agents, Fara-7B achieved a task success rate of 73.5%. This outperforms larger, more resource-intensive systems, including GPT-4owhen asked to act as a computer usage agent (65.1%) and the native UI-TARS-1.5-7B model (66.4%).

Efficiency is another key differentiator. In comparison tests, Fara-7B completed tasks in an average of about 16 steps, compared to about 41 steps for the UI-TARS-1.5-7B model.

Dealing with risks

However, the transition to autonomous agents is not without risks. Microsoft notes that Fara-7B has the same limitations as other AI models, including possible hallucinations, errors in following complex instructions, and degradation of accuracy on complex tasks.

To limit these risks, the model was trained to recognize ‘critical points’. A critical point is defined as any situation where a user’s personal information or consent is required before an irreversible action takes place, such as sending an email or completing a financial transaction. When such a moment is reached, Fara-7B is designed to pause and explicitly request user permission before continuing.

Managing this interaction without frustrating the user is a key design challenge. “Balancing robust protections like Critical Points with seamless user journeys is critical,” said Lara. “Having a user interface, like Microsoft Research’s Magentic UI, is critical to giving users the ability to intervene when necessary, while also preventing approval fatigue.” Magnetic user interface is a research prototype specifically designed to facilitate these human-agent interactions. Fara-7B is designed to run in Magnetic-UI.

Distill complexity into one model

The development of Fara-7B highlights a growing trend in distillation of knowledgecompressing the capabilities of a complex system into a smaller, more efficient model.

See also  'Baldur's Gate' HBO series in the works from Craig Mazin

Creating a CUA typically requires vast amounts of training data showing how to navigate the web. Collecting this data through human annotation is prohibitively expensive. To solve this, Microsoft used a synthetic data pipeline Magic-Onea multi-agent framework. In this setup, an “Orchestrator” agent created plans and instructed a “WebSurfer” agent to surf the Internet, generating 145,000 successful task paths.

The researchers then “distilled” this complex interaction data into Fara-7B, which is built on Qwen2.5-VL-7B, a base model chosen for its long context window (up to 128,000 tokens) and strong ability to connect text instructions to visual elements on a screen. Although the data generation required a heavy multi-agent system, Fara-7B itself is a single model, demonstrating that a small model can effectively learn advanced behavior without the need for complex scaffolding at runtime.

The training process was based on refined supervised tuning, where the model learns by emulating the successful examples generated by the synthetic pipeline.

I look ahead

While the current version is trained on static datasets, future iterations will focus on making the model smarter, and not necessarily bigger. “Going forward, we will strive to maintain the small size of our models,” Lara said. “Our ongoing research is focused on making agentic models smarter and more secure, not just bigger.” This includes exploring techniques such as reinforcement learning (RL) in live, sandbox environments, allowing the model to learn through trial and error in real time.

Microsoft has made the model available under an MIT license on Hugging Face and Microsoft Foundry. However, Lara warns that while the license allows commercial use, the model is not yet production-ready. “You can freely experiment and prototype Fara-7B under the MIT license,” he says, “but it is best suited for pilots and proofs-of-concept rather than mission-critical deployments.”

See also  TikTok now lets you choose how much AI-generated content you want to see

Source link

Back to top button