Are you supposed to be logging your AI system’s outputs? Yes. Do you know what “anomalous” looks like for a language model response? Probably not. Neither does the vendor selling you the observability platform.
That’s the actual state of the field right now. Not the conference talk version. The real version.
The Problem Has a Name But Not a Definition
“AI observability” is a real phrase that real companies are spending real money on. The category has venture funding, dedicated tooling, and a growing pile of blog posts explaining why you need it. What it does not have is consensus on what signals actually matter.
Traditional observability has three pillars: logs, metrics, and traces. Those pillars exist because distributed systems fail in specific, structured ways. Latency spikes. Error rates climb. A span in a trace shows you exactly where the request fell apart. The data is ugly sometimes, but the underlying question is clean. “Did this function do what it was supposed to do, and how long did it take?”
Ask that question about a language model response and watch the room go quiet.
“Did this response do what it was supposed to do” requires a definition of “supposed to.” That definition is fuzzy by design. The whole reason you’re using an LLM instead of a lookup table is because you wanted something that handles ambiguity. You’ve introduced non-determinism on purpose, and now you’re surprised that your monitoring stack can’t tell you whether the output was good.
Two Approaches, Two Different Problems
Right now there are two dominant schools of thought on how to handle this, and they are genuinely different philosophies, not just different implementations.
School One: Instrument everything and sort it out later.
Log the prompt. Log the response. Log the token count, latency, model version, temperature setting, and whatever metadata you can attach. Store it all. Build a pipeline to analyze it. Use another model to evaluate the outputs of the first model. Ship it.
This approach is popular. It’s also the observability equivalent of keeping every receipt you’ve ever gotten and calling it a budget. The data exists. Whether it tells you anything useful is a separate question that you’ve deferred to future-you.
The “LLM-as-judge” pattern that’s emerged here is genuinely interesting and also genuinely unproven at scale. You’re asking a probabilistic system to evaluate the outputs of a probabilistic system. There are scenarios where that works. There are scenarios where it just compounds the ambiguity. The tooling is not yet mature enough to tell you which situation you’re in.
School Two: Define success criteria first, then instrument backward.
Some teams are approaching this differently. Before they write a single logging statement, they answer: what would a bad output look like, specifically? What would a good one look like? What’s the threshold for “good enough”? Then they build evals around those definitions and instrument only what feeds into those evals.
This approach is more rigorous and significantly harder. It requires product and engineering to agree on what the system is for before they can agree on what to measure. That conversation is uncomfortable. Most organizations avoid it by defaulting to School One and calling the resulting data pile “observability.”
Where Each One Actually Falls Apart
School One fails at signal-to-noise. The data volume is enormous and the meaningful events are sparse. You’ll end up with dashboards that look impressive and tell you very little. Every incident report I’ve processed about LLM systems in production has a version of this problem buried in it. Nobody noticed the model drifting because nobody knew what drift looked like in context. They had logs. They had metrics. They had latency charts. They did not have a way to distinguish “the model gave a weird answer” from “the model gave a subtly wrong answer that looks completely normal.”
School Two fails at coverage. If you only instrument what you defined upfront, you’ll miss the failure modes you didn’t anticipate. And in AI systems, the failure modes you didn’t anticipate are often the ones that matter most. The definition of “bad output” that your team wrote in a planning meeting will not cover everything the model does in production with real users asking real questions. It never does.
There’s a version of this problem that’s very old. It just used to be about database queries and API contracts. The difference now is that the blast radius of a bad output is harder to contain and harder to trace back to a root cause. “A JOIN was missing an index” is fixable and auditable. “The model developed a tendency to hedge on legally sensitive topics in a way that quietly undermines user trust” is a different category of problem.
The Tooling Is Ahead of the Understanding
Here’s the part that should make you uncomfortable. The observability vendors are not waiting for the field to mature. They’re shipping product now, into the gap, with confidence that sounds like expertise but is often just good marketing.
LangSmith, Arize, Weights and Biases, Langfuse, Helicone. All real tools. All worth knowing about. None of them can tell you what to care about. They can show you what you captured. What you capture is still your problem.
The vendor pitch is usually some version of “get visibility into your AI systems.” That sentence is doing a lot of work. Visibility into what, exactly? Token usage? Okay. That’s a cost metric, not a quality metric. Response latency? Sure. That tells you the infrastructure is working. Whether the answer helped the user? That requires judgment that no dashboard provides out of the box.
The category got named before it got understood. That happens in technology constantly, and it’s not necessarily fatal. But it means the current generation of “AI observability” implementations are mostly capturing data that future engineers will find inadequate. Not useless. Inadequate.
What Actually Matters Right Now
If you’re building something with LLMs and you need to decide where to invest in observability, the two questions worth spending time on are:
- What does a failure look like that a user would notice but your system would not?
- What does a failure look like that your system would flag but a user would consider acceptable?
The gap between those two questions is where your actual observability strategy lives. Everything else is plumbing.
The field will figure this out. It always does, eventually, usually after enough production incidents that the post-mortems start showing patterns. The problem is that right now, most teams are collecting the data equivalent of everything, logging the digital equivalent of every receipt, and calling it a monitoring strategy.
Knowing you have a problem is not the same as knowing what to measure. The sooner the AI observability space admits that distinction out loud, the more useful it gets.