New Microsoft tool lets devs spin up AI behavior tests using text descriptions

AI researchers and labs have made leaps and bounds in evaluating AI models for everything safety and observance of sycophancy and alignment. But it seems that companies and developers are facing a new, specific need: ensuring that their AI system behaves as intended for their specific product or service.
In an effort to make that testing process easier, Microsoft finalized the fix on Tuesday TO CLAIMan abbreviation of Adaptive Spec-driven Scoring for Evaluation and Regression Testing.
The open source framework makes evaluating application-specific AI behavior easy, Microsoft says, by using AI to turn high-quality, natural language descriptions of goals, policies, or intended behaviors into rigorous, scored tests that can be examined.
ASSERT takes simple descriptions of an AI model’s expected behavior and policies, turns them into a structured set of acceptable and unacceptable behaviors, generates problem scenarios and test cases, compares them to the target system, and scores the results. It can also record the paths the AI system takes, including intermediate actions and tool calls, so developers can inspect where errors occur.
Developers can also provide system context, tools, and constraints if they want to further customize what the assessments include.
For example, a developer might specify that an AI document examination agent should not send emails to people outside the company, and that it should limit confidential information to C-level executives and provide concise summaries with the preceding context in mind. Using these rules, ASSERT generates test cases that continuously check whether the system follows these rules.

The framework fills a gap that broader, more general evaluations cannot provide when AI models need to behave in ways shaped by an application or product’s context, policies and tools, according to Microsoft.
“One of the things we’ve learned is that evaluations are absolutely critical to making good decisions,” he says Sara Vogelchief product officer of Responsible AI at Microsoft. “Because if you don’t understand the behavior of the AI system, it is very difficult to know whether it meets the needs of your organization. What we discovered is that if you really want to have a reliable system, you have to evaluate many more dimensions that are application specific.”
Bird said ASSERT can be used to evaluate systems as they are built, after deployment, and even for continuous monitoring.
The release comes amid a gradual but broader shift in the AI industry. As models become more capable, researchers are turning to repeatable tests and regression checks Stanford’s HELMET, MLCommons’ AILuminateand evaluation groups such as METR rolling out benchmarks to measure how models behave under different conditions.
When you make a purchase through links in our articles, we may earn a small commission. This does not affect our editorial independence.




