Will updating your AI agents help or hamper their performance? Raindrop's new tool Experiments tells you


It seems like new major language models (LLMs) from rival labs or from OpenAI itself have been released almost every week for the past two years since ChatGPT launched. Businesses are struggling to keep up with the sheer pace of change, let alone understand how to adapt to it. Which, if any, of these new models should they use to support their workflows and the custom AI agents they build to execute them?
Help has arrived: Starting observation of AI applications Raindrop has launched Experimentsa new analytics feature that the company describes as the first A/B testing suite designed specifically for enterprise AI agents. This allows companies to see and compare how updating agents to new underlying models, or changing their instructions and access to tools, will impact their performance with real end users.
The release extends Raindrop’s existing observability tools, giving developers and teams a way to see how their agents behave and evolve in real-world conditions.
With Experiments, teams can track how changes (such as a new tool, prompt, model update, or full pipeline refactor) affect AI performance across millions of user interactions. The new feature is now available to users with Raindrop’s Pro plan ($350 per month). raindrop.ai.
A data-driven look at agent development
Co-founder and Chief Technology Officer of Raindrop Ben Hylak A product announcement video (above) notes that Experiments helps teams see “how literally everything has changed,” including tool usage, user intent, and problem rates, and explore differences based on demographic factors like language. The goal is to make model iteration more transparent and measurable.
The Experiments interface presents results visually and shows when an experiment is performing better or worse than the baseline. An increase in negative signals may indicate more frequent task failures or partial code execution, while improvements in positive signals may reflect more complete responses or better user experiences.
By making this data easy to interpret, Raindrop encourages AI teams to approach agent iteration with the same rigor as modern software deployment: tracking results, sharing insights, and addressing regressions before they can worsen.
Background: from AI observability to experimentation
Raindrop’s launch of Experiments builds on the company’s foundation as one of the first AI-native observation platformsdesigned to help companies monitor and understand how their generative AI systems behave in production.
As VentureBeat reported earlier this year, the company – originally known as Dawn AI – was created to tackle what Hylak is, a former Apple human interface designer, who called the “black box problem” of AI performance and helps teams catch errors “as they happen and explain to companies what went wrong and why.”
At the time, Hylak described how “AI products fail all the time – in both hilarious and terrifying ways,” noting that unlike traditional software, which has obvious exceptions, “AI products fail silently.” Raindrop’s original platform focused on detecting these silent failures by analyzing signals such as user feedback, task errors, declines, and other conversation anomalies across millions of daily events.
The company’s co-founders – Hylak, Alexis GaubaAnd Zubin Singh Koticha – built Raindrop after experiencing firsthand the difficulty of debugging AI systems in production.
“We started building AI products, not infrastructure,” Hylak said Venture Beat. “But we soon saw that to grow anything seriously, we needed tools to understand AI behavior – and that tooling didn’t exist.”
With Experiments, Raindrop expands that same mission detecting errors Unpleasant measuring improvements. The new tool transforms observational data into actionable comparisons, allowing companies to test whether changes to their models, directions or pipelines actually make their AI agents better – or just different.
Solving the “Agents Succeed, Agents Fail” problem
Traditional evaluation frameworks, while useful for benchmarking, rarely capture the unpredictable behavior of AI agents operating in dynamic environments.
As co-founder of Raindrop Alexis Gauba explained in her LinkedIn announcement“Traditional assessments don’t really answer this question. They’re great unit tests, but you can’t predict your user’s actions and your agent spends hours calling hundreds of tools.”
Gauba said the company consistently heard a common frustration from teams: “Employees succeed, agents fail.”
Experiments are intended to bridge that gap by showing what actually changes when developers send updates to their systems.
The tool makes it possible to compare models, tools, intentions or properties side by side, revealing measurable differences in behavior and performance.
Designed for AI behavior in the real world
In the announcement video, Raindrop described Experiments as a way to “compare everything and measure how your agent’s behavior actually changed in production across millions of real interactions.”
The platform helps users identify issues such as spikes in task failure, forgetting, or new tools causing unexpected errors.
It can also be used in reverse: starting from a known problem, such as an ‘agent stuck in a loop’, and going back to which model, tool or flag is causing it.
From there, developers can dig into detailed traces to find the root cause and quickly deliver a fix.
Each experiment provides a visual overview of metrics such as tool usage frequency, error rate, call duration, and response duration.
Users can click on any equation to access the underlying event data, giving them a clear picture of how agent behavior has changed over time. Shared links make it easy to collaborate with teammates or report findings.
Integration, scalability and accuracy
According to Hylak, Experiments integrates directly with “the feature flag platforms that companies know and love (like Statsig!)” and is designed to work seamlessly with existing telemetry and analytics pipelines.
For businesses without these integrations, it can still compare performance over time (such as yesterday and today) without additional setup.
Hylak said teams typically need about 2,000 users per day to produce statistically meaningful results.
To ensure the accuracy of comparisons, Experiments checks for adequate sample size and alerts users if a test contains insufficient data to draw valid conclusions.
“We’re obsessed with making sure metrics like task errors and user frustration are metrics that would require you to wake up an engineer on call,” Hylak explains. He added that teams can drill down into the specific conversations or events driving these metrics, ensuring transparency behind each total number.
Security and data protection
Raindrop operates as a cloud-hosted platform, but also offers on-premise redaction of personally identifiable information (PII) for companies that need additional control.
Hylak said the company complies with SOC 2 and one PII guard feature that uses AI to automatically remove sensitive information from stored data. “We take the protection of customer data very seriously,” he emphasized.
Prices and plans
Experimenting is part of Raindrop’s Pro subscriptionwhich costs $350 per month or $0.0007 per interaction. The Pro tier also includes deep research tools, topic clustering, custom issue tracking, and semantic search capabilities.
Of raindrops Starter plan – $65 per month or $0.001 per interaction — provides core analytics including issue detection, user feedback signals, Slack alerts, and user tracking. Both plans come with a 14-day free trial.
Larger organizations can opt for one Enterprise plan with custom pricing and advanced features like SSO sign-in, custom alerts, integrations, edge PII redaction, and priority support.
Continuous improvement for AI systems
With Experiments, Raindrop positions itself at the intersection of AI analysis and software observation. The focus on “measuring the truth,” as mentioned in the product video, reflects a broader industry push for accountability and transparency in AI operations.
Rather than relying solely on offline benchmarks, Raindrop’s approach emphasizes real user data and contextual understanding. The company hopes this will allow AI developers to act faster, identify root causes more quickly, and deliver better-performing models with confidence.




