Blog
When AI Systems Don’t Fail Loudly
·
Sekhar Sarukkai

I must confess that when faced with the task of writing this blog post, I did what many of us increasingly do these days: I reached for my favorite generative AI sidekick so I could spend time on “more important” company-building activities. I gave the AI all the context, and it came back with what seemed like a well thought out writeup – with a lot of dramatic flair that I could not dream of possessing. It seemed to have done a great job so I decided to tick this off my to-do list. As it came time to publish the article, I re-read it and there was something that felt off. I could see patterns of AI-speak, and writing in key areas that were not in my voice.
While the AI did its job, it was not able to capture the context and idiosyncrasies of how I would have written a blog post before my writing skills atrophied thanks to token maxxing. Now the AI slop was obvious. Not one time. But many times. Every section. With dramatic effect. Hidden in plain sight. You get the drift.
So, what does this have to do with ChatSee?
It turns out, a lot.
My little blogging episode was harmless, but it exposed the same underlying property that makes enterprise AI hard to operate. The system had the context. It produced something plausible. It even seemed useful at first glance. But the output still missed the intent, tone, and judgment that mattered for the task.
That is the operational challenge with probabilistic systems. They do not simply succeed or fail in obvious ways. They often produce behavior that is directionally correct, superficially polished, and still wrong in context.
This everyday use of AI mirrors what is already happening inside enterprises.
Using AI is appealing and the flashes of episodic intelligence have led to grass roots enterprise adoption at rates unseen with previous technologies – just see Anthropic’s enterprise revenue growth over the last year! But this adoption comes with an operational caveat that many organizations are only beginning to fully internalize:
Past performance is not a reliable indicator of future behavior.
AI systems are probabilistic, context-sensitive, and increasingly autonomous. The same prompt issued multiple times can produce meaningfully different outcomes. In production systems, those differences are not merely cosmetic. They can translate into outputs or actions that are misaligned with business goals, enterprise policy, regulatory requirements, or even basic user intent.
This is not because the systems are “broken” in the traditional sense. In fact, many AI systems work remarkably well most of the time.That is precisely what makes the problem difficult. A system that is correct 95% of the time can still create significant operational overhead, and risks, if the remaining 5% of failures are unpredictable, difficult to detect and repeat in different forms over time. This semantic, contextual, difficult to reproduce, and harder to reason behavior is precisely why traditional operational tooling does not work adequately for this problem.
And increasingly, enterprises are discovering that traditional observability and security tooling—largely built around deterministic systems and human-paced investigation workflows—does not naturally solve this problem because they are not built to determine:
whether the agent behavior was correct in context,
whether similar semantic failures have occurred before,
whether the system is drifting,
or how to ensure the same class of issue does not continue recurring.
More importantly, they were not designed for AI systems operating at machine speed where failures need to be semantically interpreted, generalized, and continuously fed back into the system.
That realization is what led my cofounder Sanjay Agrawal and me to start ChatSee.
We believe enterprises need a new runtime layer focused specifically on behavioral assurance for AI systems.
A system that can observe production AI behavior, detect semantic and behavioral anomalies, accumulate structured failure intelligence over time, and continuously feed those learnings back into agents, workflows, and policies.
Not just to identify failures after the fact—but to help systems improve operationally over time. In other words, failures need to become organizational knowledge — structured lessons that help both humans and AI systems avoid repeating the same behavioral mistakes.
This is not simply a logging or context storage problem. In fact, many emerging AI systems are already accumulating large volumes of context, memory, traces, and interaction history. The harder problem is determining what subset of that information actually matters for improving future behavioral reliability. Specifically, remembering what went wrong, why it went wrong, how to recognize it again, and how to prevent similar failures from recurring. That requires a different kind of operational infrastructure—one designed around runtime behavior rather than static software execution.
Today, I’m grateful to share that we’ve raised $6.5M led by True Ventures and joined by First Rays Venture Partners and Seven Hill Ventures, among others, to build this missing layer in the AI stack.
I’m also deeply grateful to the early design partners helping shape the platform through real production deployments. Over the coming months, I’m looking forward to sharing more about what we are learning from these environments, including examples of how enterprises are improving the reliability, consistency, and operational performance of AI systems in production.