AI

Google's AI can now surf the web for you, click on buttons, and fill out forms with Gemini 2.5 Computer Use

Some of the greatest providers of large language models (LLMS) have tried to go further than multimodal chatbots – expand their models to “agents” who can actually take more actions on behalf of the user on websites. Think of the Chatgpt -agent of OpenAi (previously known as “Operator”) and the computer use of Anthropic, both released in the past two years.

Now Google is also in the same game. Nowadays the search giants Diepmind Ai Lab subsidiary unveiled a new, refined and tailor-made version of its powerful Gemini 2.5 Pro LLM known as “Gemini 2.5 Pro computer use“What is possible Use a virtual browser to surf on the internet on your behalf, pick up information, fill in forms and even take action on websites – Everything from the single text prompt from a user.

“These are early days, but the power of the model to communicate with the web – such as scrolling, filling forms + navigating with drop -down people – is a Important next step in building general agents, ” said Google CEO Sundar Pichai, As part of a Longer explanation about the social network, X.

However, the model is not available for consumers directly from Google.

Instead of, Google worked together with another company, Browser baseFounded by Former Twilio -Engineer Paul Klein early 2024That virtual “headless” web browser offers, specifically for use by AI agents and applications. (A “headless” browser is one that does not require a graphic user interface or GUI for navigating on the web, although in this case and others browserbase shows a graphic representation for the user).

Users can demonstrate the new Gemini 2.5 -computer use model directly on browser base here and even compare it side by side with the older, rival offers from OpenAi and Anthropic in a new one “Browserara“Started by the startup (although only one extra model can be selected next to Gemini at the same time).

For AI builders and developers it is made as a rough, albeit propritation llm through the Gemini API in Google AI Studio for Rapid prototypingand Google Cloud’s Corner point AI Modeledeland sector and applications build platform.

The new offer builds on the possibilities of Gemini 2.5 ProReleased in March 2025, but which has since been updated considerably several times, with a specific focus on enabling AI agents to enable direct interactions with user interfaces, including browsers and mobile applications.

In general it seems Gemini 2.5 Computer use is designed to have developers made agents who can complete interfaced tasks autonomously-as clicks, type, scroll, fill in forms and navigate behind login screens.

See also  A New System for Temporally Consistent Stable Diffusion Video Characters

Instead of just trusting APIs or structured inputs, this model AI systems enables to communicate visually and functionally with software, just like a person.

Short hands-on tests of the user

In my short, unscientific initial practical tests on the browserbase website Gemini 2.5 computer navigate to the official website of Taylor Swift, as instructed and gave me a summary of what was sold or promoted at the top of her latest album, “The Life of a Showgirl.”

In another test I asked Gemini 2.5-computing use to look for Amazon for highly rated and well-rated solar lamps that I could use in my backyard, and I was very happy to look, because it successfully completed a Google search Captcha that was designed to weed (“Select all the box with a Motorfiets with a Motorfiet with a Motorfiets with a Motorfiets with a Motorfiets with a Motorfiets with a Motorfiets with a Motorfiets.

Once it got through there, it stuck and it was unable to complete the task, despite serving a “compete task” message.

I also have to note that, although the Chatgpt agent from OpenAI and Anthropic’s Claude can create and edit local files – such as PowerPoint presentations, spreadsheets or text documents – on behalf of the user, Gemini 2.5 computer use, there is currently no access to access to file systems or native file creation.

Instead, it is designed to operate and navigate web and mobile user interfaces through actions such as clicks, types and scrolling. The output is limited to proposed onion actions or text reactions in chatbot style; Each structured output such as a document or file must be treated separately by the developer, often via adapted code or integrations of third parties.

Performance bensmarks

Google says that Gemini 2.5 computer use has demonstrated the most important results in multiple interface operating benchmarks, in particular in comparison with other large AI systems, including Claude Sonnet and OpenAi’s agent-based models.

Evaluations were carried out via browserbase and Google’s own tests.

Some highlights include:

  • Online-Mind2Web (browser base): 65.7% for Gemini 2.5 versus 61.0% (Claude Sonnet 4) and 44.3% (OpenAI agent)

  • Webvoyager (browser base): 79.9% for Gemini 2.5 versus 69.4% (Claude Sonnet 4) and 61.0% (OpenAI agent)

  • Androidworld (DeepMind): 69.7% for Gemini 2.5 versus 62.1% (Claude Sonnet 4); The OpenAi model could not be measured due to lack of access

  • Osworld: Currently not supported by Gemini 2.5; Top competition result was 61.4%

See also  Why LLMs Overthink Easy Puzzles but Give Up on Hard Ones

In addition to strong accuracy, Google reports that the model works on a lower latency than other browser control solutions – a key factor in production use cases such as onion automation and testing.

How it works

Agents powered by the computer use model work within an interactile. They receive:

  • A usage task prompt

  • A screenshot of the interface

  • A history of action actions

The model analyzes this input and produces a recommended onion action, such as clicking a button or types in a field.

If necessary, it can request confirmation from the end user for riskier tasks, such as making a purchase.

Once the promotion has been carried out, the interfacestatus is updated and a new screenshot will be returned to the model. The loop continues until the task has been completed or stopped due to an error or a safety decision.

The model uses a specialized tool called computer_useand it can be integrated into adapted environments with the help of tools such as Playwriter or via the Browser base Demo Sandbox.

Use cases and adoption

According to Google, teams have already started using the model in different domains internally and externally:

  • Google’s Payment Platform Team Reports that Gemini 2.5 computer use is successfully repaired more than 60% of the failed test versions, which reduces an important source of inefficiencies for technical engineering.

  • Car labAn AI agent platform from third parties, said that the model performed better than others in complex data bargest tasks, which increases performance by a maximum of 18% in their most difficult evaluations.

  • Poke.comA proactive AI assistant provider, noted that the Gemini model often works 50% faster Then competing solutions during interface interactions.

The model is also used in Google’s own product development efforts, also in Project Marinerthe Firebase -testerAnd Ai -mode in the search.

Safety measures

Because this model controls software interfaces, Google emphasizes a multi-layered approach:

  • A per-step safety service Inspects every proposed action before implementation.

  • Developers can define System level instructions To block or require confirmation for specific actions.

  • The model contains built -in guarantees to prevent actions that can endanger security or violate the prohibited user policy of Google.

For example, if the model encounters a captcha, it generates an action to click on the check box, but marks it as a confirmation of the users, so that the system does not take place without human supervision.

See also  Boston Dynamics CEO Robert Playter steps down after 30 years at the company

Technical possibilities

The model supports a wide range of built-in UI promotions such as:

  • click_attype_text_atscroll_documentdrag_and_dropand more

  • Functions defined by the user can be added to expand the range of mobile or adapted environments

  • Screen coordinates are normalized (60-1000 scale) and translated into pixel dimensions during the version

It accepts Image and text Input and outputs text reactions or Job calls To perform tasks. The recommended screen resolution for optimum results is 1440×900Although it can work with other sizes.

API prices remain almost identical to Gemini 2.5 Pro

The prices for Gemini 2.5 Computer use Closes closely with the standard Gemini 2.5 Pro model. Both follow the same billing structure per linked: input tokens are priced on $ 1.25 per million tokens For instructions under 200,000 tokens, and $ 2.50 per million tokens For instructions longer than that.

Output sticks follow a similar split, priced on $ 10.00 per million For smaller reactions and $ 15.00 For larger ones.

Where the models vary in availability and extra functions.

Gemini 2.5 Pro includes a free low This enables developers to use the model free of charge, published without explicit token cap, although the use can be subject to interest rates or quota reductions, depending on the platform (eg Google AI Studio).

This free access includes both input and export tokens. As soon as developers exceed their assigned quota or switch to the paid layer, the standard prices per coupled.

On the other hand, Gemini 2.5 Computer use is exclusively available via the paid layer. There is No free access Currently offered for this model, and all the use makes up on token -based costs from the start.

In terms of function, Gemini 2.5 Pro supports optional possibilities such as contextcaching (starting at $ 0.31 per million tokens) and earthing with Google Search (free for a maximum of 1500 requests per day, then $ 35 per 1,000 extra requests). These are currently not available for computer use.

Another distinction is in data processing: output from the computer use model is not used to improve Google products in the paid layer, while the free use of Gemini 2.5 Pro contributes to model improvement unless explicitly canceled.

In general, developers can expect comparable token-based costs in both models, but they must take into account TIERTOGANG, recorded options and policy for data use when deciding which model meets their needs.

Source link

Back to top button