AI

Google's 'Watch & Learn' framework cracks the data bottleneck for training computer-use agents

A new framework developed by researchers at Google Cloud and DeepMind aims to address one of the key challenges in developing computer use agents (CUAs): collecting high-quality training examples at scale.

The framework, called Watch and learn (W&L), tackles the problem of training data generation in a way that does not require human annotation and can automatically extract demonstrations from raw videos.

Their experiments show that the data-generated W&L can be used to train or refine existing computing and baseline models to improve their performance on computing tasks. But just as importantly, the same approach can be used to create contextual learning (ICL) examples for computer usage agents, allowing companies to create KUAs for customized internal tasks without the need for expensive training of specialized models.

CUA’s data bottleneck

The Internet is rich with video tutorials and screencasts that describe complex workflows for using applications. These videos are a goldmine that can provide this computing agents with domain knowledge and instructions for performing various tasks through user interface interactions.

However, before they can be used to train CUA agents, these videos must be converted into annotated pathways (i.e. a series of task descriptions, screenshots, and actions), a process that is prohibitively expensive and time-consuming if done manually.

Existing approaches to address this data bottleneck rely on annotating these videos using multimodal language models, which usually results in low precision and erroneous examples. Another approach uses self-playing agents that autonomously explore user interfaces to collect trajectories. However, techniques that use this approach usually create simple examples that are not useful in unpredictable real-world situations.

See also  Intel’s Masked Humanoid Controller: A Novel Approach to Physically Realistic and Directable Human Motion Generation

As the researchers note in their paper, “In general, these approaches either rely on fragile heuristics, are costly because they rely on explorations in real environments, or generate low-complexity demonstrations that do not match human intentions.”

Watch and learn

The Watch & Learn framework attempts to address the challenges of creating CUA demonstrations by rethinking problem formulation.

Rather than directly generating trajectories or relying on complex multi-phase pipelines, the researchers frame the problem as an “inverse dynamics objective”: predict, given two consecutive observations, the intervening action that caused the transition.

According to the researchers, this formulation is “easier to learn, avoids handcrafted heuristics, and generalizes robustly across applications.”

The W&L framework can be broken down into three major phases: training an inverse dynamics model (IDM), retrieving raw videos, and training CUA agents.

In the first phase, the researchers used agents to interact with live web pages to create a large corpus of 500,000 state transitions (two consecutive observations and the action that led to the transition). They then used this data (along with 132,000 human-annotated transitions from existing open datasets) to train an inverse dynamics model (IDM) that takes two consecutive observations and predicts the transition action. Their trained IDM, a small transformer model, outperformed off-the-shelf basic models in predicting transition actions.

The researchers then designed a pipeline that pulls videos from platforms like YouTube and runs them through IDM to generate high-quality trajectories. The IDM records consecutive video frames and determines the actions (scrolling, clicking) that caused the changes in the environment, which are then packaged into annotated trajectories. Using this method, they generated 53,125 trajectories with highly accurate action labels.

See also  Hugging Face's new robot is the Seinfeld of AI devices

These examples can be used to train effective computing models for specific tasks. But the researchers also found that trajectories extracted via IDM can serve as in-context learning samples to improve KUAs’ performance on customized tasks during inference time. For ICL, they use Gemini 2.5 Flash to add additional reasoning annotations to the observation/action examples in the pathways, which can then be inserted into the CUA agent prompt during inference (usually 3-5 examples).

“This dual role (in-context training and guidance) enables flexible integration with both open-source models and general-purpose agents,” the researchers write.

W&L in action

To test the usefulness of W&L, the researchers conducted a series of experiments with closed and open source models on the computer OSWorld benchmarkthat evaluates agents in real desktop and operating system environments for a variety of tasks including productivity, programming, and design.

For refinement, they used their corpus of 53,000 trajectories to train two open source models: UI-TARS-1.5, a strong, open source vision-language-action model designed specifically for computing, and Qwen 2.5-VLa multimodal open-weight LLM.

For in-context learning tests, they applied W&L examples to general-purpose multimodal models such as Gemini 2.5 Flash, OpenAI o3, and Claude Sonnet 4.

W&L resulted in improvements on OSWorld in all model categories, including up to 3 points for ICL on general purpose models and up to 11 points for refined open source models.

More importantly, these benefits were achieved without any manual annotation, “demonstrating that web-scale human workflows can serve as a practical and scalable foundation for advancing CUAs toward real-world implementation,” the researchers write.

See also  Echelon's AI agents take aim at Accenture and Deloitte consulting models

This could have important implications for real-world applications, allowing companies to turn their existing corpora of videos and conference recordings into training data for CUAs. It also makes it easier to generate new training programs. All you have to do is record videos of performing various tasks and have them annotated by an IDM. And as boundary models continually improve and become cheaper, you can expect to get more out of your existing data and the field continues to evolve.

Source link

Back to top button