From Intent to Execution: How Microsoft is Transforming Large Language Models into Action-Oriented AI

January 11, 2025

0 5 minutes read

Large Language Models (LLMs) have changed the way we handle natural language processing. They can answer questions, write code and have conversations. Yet they fall short when it comes to real-world tasks. For example, an LLM can guide you through purchasing a jacket, but cannot place the order for you. This gap between thinking and doing is a major limitation. People don’t just need information; they want results.

To bridge this gap, Microsoft is doing just that turn LLMs are being turned into actionable AI agents. By allowing them to plan tasks, parse them, and participate in real-world interactions, they enable LLMs to manage practical tasks effectively. This shift has the potential to redefine what LLMs can do, turning them into tools that automate complex workflows and simplify everyday tasks. Let’s take a look at what it takes to make this happen and how Microsoft is approaching the problem.

What LLMs should do

If LLMs want to perform tasks in the real world, they must go beyond understanding text. They must deal with digital and physical environments while adapting to changing circumstances. Here are some of the capabilities they need:

Understanding user intent

To trade effectively, LLMs must understand user requests. Input such as text or voice commands are often vague or incomplete. The system must fill the gaps using its knowledge and the context of the request. Multi-step conversations can help refine these intentions so the AI understands them before taking action.

Converting intentions into actions

After understanding a task, LLMs must convert it into actionable steps. This may involve clicking buttons, calling APIs, or controlling physical devices. The LLMs must tailor their actions to the specific task, adapt to the environment, and solve challenges as they arise.

Adapt to changes

Tasks in the real world don’t always go as planned. LLMs must anticipate problems, adjust steps, and find alternatives when problems arise. For example, if a necessary resource is unavailable, the system must find another way to complete the task. This flexibility ensures that the process does not get stuck if something changes.

Specialized in specific tasks

Although LLMs are designed for general use, specialization makes them more efficient. By focusing on specific tasks, these systems can deliver better results with fewer resources. This is especially important for devices with limited computing power, such as smartphones or embedded systems.

By developing these skills, LLMs can go beyond just processing information. They can take meaningful actions, paving the way for seamless integration of AI into daily workflows.

How Microsoft is Transforming LLMs

Microsoft’s approach to creating actionable AI follows a structured process. The main goal is to enable LLMs to understand assignments, plan effectively and take action. Here’s how they do it:

Step 1: Collect and prepare data

In the first sentence, they collected data related to their specific use cases: UFO Agent (described below). The data includes user queries, environment details, and task-specific actions. This phase collects two different types of data: First, they collect task plan data that helps LLMs outline the high-level steps required to complete a task. For example, “Change font size in Word” can include steps such as selecting text and adjusting toolbar settings. Second, they collected task action data, allowing LLMs to translate these steps into precise instructions, such as clicking specific buttons or using keyboard shortcuts.

This combination gives the model both the big picture and the detailed instructions it needs to perform tasks effectively.

Step 2: Train the model

Once the data is collected, LLMs are refined through multiple training sessions. The first step trains LLMs in task scheduling by teaching them how to break down user requests into executable steps. Then, labeled data is used by experts to teach them how to translate these plans into specific actions. To further enhance their problem-solving skills, LLMs have engaged in a self-reinforcing exploration process that allows them to tackle unsolved tasks and generate new examples for continuous learning. Finally, reinforcement learning is applied, using feedback from successes and failures to further improve their decision-making.

Step 3: Offline testing

After training, the model is tested in controlled environments to ensure reliability. Metrics like Task Success Rate (TSR) and Step Success Rate (SSR) are used to measure performance. For example, testing a calendar management agent might include verifying its ability to schedule meetings and send invitations without errors.

Step 4: Integration into real systems

Once validated, the model is integrated into an agent framework. This allowed it to interact with real environments, such as clicking buttons or navigating menus. Tools such as UI Automation APIs helped the system dynamically identify and manipulate user interface elements.

For example, if the agent is tasked with highlighting text in Word, the agent identifies the highlight button, selects the text, and applies the formatting. A memory component could help LLM keep track of previous actions, allowing it to adapt to new scenarios.

Step 5: Testing in the real world

The final step is online evaluation. Here the system is tested in real-world scenarios to ensure it can handle unexpected changes and errors. For example, a customer support bot can guide users through a password reset while adjusting for incorrect input or missing information. These tests ensure that the AI is robust and ready for everyday use.

A practical example: the UFO agent

To demonstrate how actionable AI works, Microsoft has created the UFO agent. This system is designed to perform real tasks in Windows environments, turning user requests into completed actions.

At its core, the UFO agent uses an LLM to interpret requests and plan actions. For example, if a user says, “Highlight the word ‘important’ in this document,” the agent works with Word to complete the task. It collects contextual information, such as the positions of UI controls, and uses it to plan and execute actions.

The UFO agent relies on tools such as the Windows UI automation (UIA) API. This API scans applications for controls, such as buttons or menus. For a task like “Save the document as PDF,” the agent uses the UIA to identify the “File” button, locate the “Save As” option, and perform the necessary steps. By structuring data consistently, the system ensures smooth operation from training to real-world application.

Overcoming challenges

While this is an exciting development, creating action-oriented AI comes with challenges. Scalability is a big problem. Training and deploying these models for a variety of tasks requires significant resources. Guaranteeing safety and reliability is just as important. Models must perform tasks without unintended consequences, especially in sensitive environments. And since these systems handle private data, maintaining ethical standards around privacy and security is also critical.

Microsoft’s roadmap focuses on improving efficiency, expanding use cases and maintaining ethical standards. With these improvements, LLMs could redefine the way AI interacts with the world, making them more practical, adaptable, and actionable.

The future of AI

Transforming LLMs into action-oriented agents could be a game changer. These systems can automate tasks, simplify workflows and make technology more accessible. Microsoft’s work in actionable AI and tools like the UFO Agent is just the beginning. As AI continues to evolve, we can expect smarter, more capable systems that not only communicate with us, but also get jobs done.

Source link