Microsoft launches a tool that creates AI behaviour tests from text prompts
Microsoft has introduced a new tool that enables developers to generate AI behaviour tests using simple text descriptions, streamlining AI evaluation and safety testing.
AI developers and researchers have made significant progress in evaluating AI systems across areas such as safety, compliance, alignment, and behavioural consistency. However, as AI becomes more deeply integrated into products and services, organisations increasingly face a different challenge: ensuring that AI systems meet the specific requirements of their applications.
To address that need, Microsoft unveiled a new tool on Tuesday called ASSERT, which stands for Adaptive Spec-driven Scoring for Evaluation and Regression Testing.
According to Microsoft, the open-source framework is designed to simplify the evaluation of application-specific AI behaviour. It uses AI to transform high-level natural-language descriptions of goals, policies, and expected behaviours into comprehensive tests that can be scored, monitored, and investigated.
ASSERT works by taking plain-language instructions that describe how an AI system should behave and converting them into a structured framework of acceptable and unacceptable actions. The system then generates scenarios and test cases, runs those evaluations against the target AI application, and produces performance-based results.
In addition to testing outcomes, ASSERT can capture the pathways an AI system follows while making decisions. This includes intermediate reasoning steps, tool usage, and other actions taken throughout the process. By recording these details, developers gain greater visibility into where and why failures occur.
The framework also supports additional customisation. Developers can provide information about system context, available tools, operational constraints, and other parameters to ensure the evaluations reflect the unique requirements of their products.
For example, a developer building a document research AI assistant could specify that the system should never send emails outside the organisation, should share confidential information only with C-level executives, and should provide concise summaries that take prior context into account. ASSERT would use those requirements to automatically generate evaluation scenarios that continuously test whether the AI system complies with those rules.
Microsoft says the framework fills an important gap left by broader AI evaluation methods. While general-purpose benchmarks can measure overall model capabilities, they often fail to assess behaviour shaped by the specific policies, workflows, tools, and operational environments of individual applications.
“One of the things we’ve learned is that evaluations are absolutely critical to making good decisions,” said Sarah Bird, Chief Product Officer of Responsible AI at Microsoft. “Because if you don’t understand the behaviour of the AI system, it’s really hard to know if it’s meeting your organisation’s bar. What we found is that if you really want to have a trustworthy system, you should evaluate many more dimensions that are application-specific.”
Bird explained that ASSERT can be deployed at multiple stages of an AI system’s lifecycle. Organisations can use the framework during development, after deployment, and as part of ongoing monitoring efforts to ensure systems continue to operate as intended over time.
The launch reflects a broader shift taking place across the artificial intelligence industry. As AI models become increasingly capable and are deployed in more critical environments, companies and researchers are placing greater emphasis on repeatable testing, regression analysis, and behavioural evaluation. Organisations such as Stanford’s HELM, MLCommons’ AILuminate, and evaluation-focused groups including MET and RandR have introduced new benchmarking systems to measure how AI models perform under a variety of conditions. Microsoft’s ASSERT adds to that growing ecosystem by focusing specifically on evaluating whether AI systems adhere to the unique requirements and expectations of individual applications.
With ASSERT, Microsoft helps developers and enterprises turn policy documents, behavioural guidelines, and operational rules into measurable tests, enabling organisations to build systems that are capable, predictable, and trustworthy.
What's Your Reaction?
Like
0
Dislike
0
Love
0
Funny
0
Angry
0
Sad
0
Wow
0