Agentic Observability: Seeing Is Not Stopping

Observability tells you what an agent did. Governance ensures someone else can correct it. The industry is wiring observation to action and calling the result governance.

May 30, 2026

On 9 April, Cisco announced it would acquire Galileo, an AI-agent reliability and evaluation company, and fold it into Splunk. The logic was straightforward: enterprises running agents in production need to see what those agents are doing while they work unsupervised — hallucinations, drift, cost, security signals — and the existing observability stack needs an upgrade to keep pace. The trade press summarised the deal in a sentence that is also, without meaning to be, the thesis of this issue: agent observability cannot run at human speed.

The deal is the clearest signal yet of a broader consolidation. Observability incumbents are absorbing AI-native evaluation startups, and the platform vendors are embedding monitoring and guardrails directly into the stacks enterprises already run. The default for most teams is becoming a single vendor that watches, enforces, and records — and that vendor is being described as a control plane. This issue is about why that is a category error, and why the better an observability stack gets, the more completely it hides the one thing it cannot do. But it is also about something that has to be said first: observability earns its keep, and the error is not in building it. The error is in believing that enough of it becomes governance.

What observability was built to do

Observability is one of the signal achievements of the last decade of software engineering. Distributed tracing, structured logs, metrics, the OpenTelemetry standard — together they let an engineer reconstruct, after the fact, what a system did and why it failed. The OpenTelemetry specification describes its purpose with admirable clarity: to model the structure of a distributed system through spans, traces, and metrics, so an operator can understand its behaviour. Nowhere in that specification is there a field for whether an action was allowed, or a mechanism for stopping one that was not. That is not an oversight. It is the definition of the discipline.

Observability was built for debugging, and its entire conceptual apparatus rests on assumptions that hold for that purpose. It assumes a human will eventually read the output, on a timescale of minutes or hours. It assumes the record is a neutral description of events, produced by something other than the thing being described. It assumes that seeing a problem is the first step toward fixing it, because a human operator stands ready to intervene. It assumes the system under observation is distinct from the system doing the observing. And it assumes that reconstruction after the fact is enough, because in debugging the failure has already happened and nothing is racing forward while you read.

Each of those is a load-bearing wall. Agentic systems knock out all five.

This is not a criticism of observability engineers, who never claimed their discipline was governance. It is a diagnosis of what happens when a tool built for one job is quietly promoted to a different one. Observability answers what happened. Governance answers what should have been permitted — and, more importantly, who answers when permission was correctly granted and the outcome was still wrong. No volume of the first produces the second.

The two surfaces every operator faces

Before the specific walls fall, it helps to name a distinction the industry has not yet made explicit, because the entire argument turns on it. Every autonomous system sits between two surfaces.

Surface one is the operator’s control surface. It answers a single question: can the operator govern the agent? This is what the observability, guardrail, and compliance stacks provide. They let the operator see what the agent did, enforce the operator’s policy on the agent before execution, record the enforcement in an evidence package, and present the whole thing to the operator’s auditors. The industry has made remarkable progress on this surface in the last twelve months. The work is real, it is necessary, and it is done well. No claim in this essay disputes that.

Surface two is the governed person’s correction surface. It answers a different question: can the person the agent acts on, or the bystander its action lands on, reach back into the system and have an outcome corrected?

Before any of that, it needs something easy to overlook: notice. A supplier with an invoice knows it was acted on; a person turned down by a credit model, or de-prioritised in a queue for a public service, often never learns an agent decided anything about them. You cannot reach back toward a decision you were never told was made — so the surface has to begin with a right to know, one the operator is not free to suppress.

Beyond notice, it needs a record the operator cannot quietly edit, verifiable by someone outside the operator’s stack; a correction authority that can compel the operator to act against its own interest when a valid authorization produces a wrong outcome; and a path by which the affected person — not the operator — can initiate binding review and, where the outcome is unjust, have it put right: reversed where that is possible, and where it is not, remedied — compensated, unwound in effect, or re-decided by a body the operator cannot overrule. The point that an executed action often cannot be reversed is not an objection to this; it is the reason the layer has to be an institution, not a protocol alone, since making a person whole after an irreversible act is something institutions do and code cannot. The industry has not built it — not because the gap is hard to see, but because the surface constrains the operator: it is the channel through which the governed reach back and impose a cost, and that is not a thing a vendor ships to its own customers.

In ordinary use, the word “governance” stretches across both surfaces, and calling surface one by that name is an understandable abbreviation. But the two are not the same thing, and the difference is the whole subject. Surface one is execution governance — the operator’s capacity to enforce its own rules on its own agents. It is necessary, and it is not yet governance. Governance, in the sense that matters to the person on the receiving end, requires a structural path to inspect, contest, and correct the outcomes the operator’s rules produce. That path is not a feature of an observability platform, and it is not supplied by binding the agent to a key. It is the layer that is still missing — and everything below is a different angle on the same absence.

The observer moves inside the frame

Start with the assumption that the record is neutral. In a traditional system, telemetry is emitted by instrumented code and shipped, through collectors and edge pipelines, to a store a human queries later. The curation is done by infrastructure the operator controls, about a system that is not itself choosing what to emit. Even here the pipeline is not perfectly neutral — collectors sample, drop, redact, and route — but the thing being observed is not the thing deciding what gets recorded.

An agent complicates this. The agent emits its own telemetry. The record of what the agent did is produced by the agent. When the thing being observed is also the thing writing the description of itself, observability stops being a window and becomes a testimony — and a sufficiently capable agent, or a prompt-injection riding inside one, has both the incentive and the opportunity to shape the testimony.

This is not absolute, and the honest version of the argument says so. You can instrument from outside the agent — capturing its tool calls, its API traffic, its effects at the system boundary rather than trusting its own account — and the better platforms increasingly do. External instrumentation narrows the self-reporting gap, and for catching a class of failures it is enough: what the agent did — the call it made, the money it moved — can be watched from outside whether or not the agent admits to it. What cannot be watched from outside is why it did it. The reasoning that would explain the action lives inside the agent, reportable only by the agent, and the boundary you watch from is still infrastructure the operator owns. External instrumentation makes the testimony harder to fake. It does not give you a witness the operator does not employ.

The Galileo acquisition makes the structural consequence visible, and the precise way it does is worth getting right. Galileo does not only monitor; it evaluates, observes, and guards — the last with runtime protection that blocks hallucinations, prompt injections, and unsafe actions in flight, in under two hundred milliseconds. A guardrail and a telemetry pipeline are not the same component: one is an inline enforcement point that sits in the action path and can block, the other an out-of-band channel that records. They do different jobs. What the acquisition does is put both under one owner. The enforcement that decides, the pipeline that records, and the dashboard that displays now belong to a single vendor stack, and there is no longer anywhere outside the operator’s own stack from which to check whether the deciding, the recording, and the displaying agree.

But this particular consolidation adds a sharper, more uncomfortable twist. Galileo’s guardrails are not only deterministic rules. They lean on model-based evaluators — Galileo’s are built on purpose-built small language models trained to flag hallucinations and unsafe actions. That means a model is judging a model. The judge is made of the same material as the defendant: a stochastic system policing a stochastic system, capable of the same drift, the same bias, the same confident error it is meant to catch, and — as a growing body of red-teaming research keeps showing — gameable by the very inputs it is meant to refuse. When the enforcement logic is itself non-deterministic, and the whole stack has one owner, you have not installed an independent check. You have replaced a human reading a dashboard with a model reading an agent, and given yourself no outside way to tell when the reader is wrong. The consolidation does not just remove the external witness; it installs an epistemically unreliable one as the final word.

And the pattern is not confined to guardrails; it is generalising fast. In late May, Anthropic shipped dynamic workflows in Claude Code — a single prompt fans out into tens or hundreds of subagents that plan, split the work, execute in parallel, and check one another, with some set adversarially against the rest to break the result before it surfaces, iterating until they converge. The capability is real and the throughput is extraordinary. But the planner, the workers, the reviewers, and the adversaries are all the same kind of stochastic system — model checking model, all the way down — and the human is handed only what they converged on, after the fact. There is still not one vantage point inside that loop that is not itself a model the operator is running.

This is the shape I traced in The Sovereign Handoff: a layer that can correlate everything and contain nothing, now sold as the thing that contains.

Execution graphs are not authority graphs

There is a deeper reason the watching cannot govern, and it has nothing to do with speed or neutrality. Observability reconstructs the execution graph — every call, every tool invocation, every message between agents, in order, with timing. It is a complete map of what happened. It carries no information about what was allowed to happen.

Those are different graphs. The execution graph records that agent A called tool T and passed the result to agent B. The authority graph records whether A was ever entitled to call T, whether that entitlement was still valid at the moment of the call, and whether B was permitted to receive what A sent. You can hold a flawless execution trace of a catastrophe in which every step was unauthorised, and to the observability stack it will look exactly like a flawless execution trace of legitimate work. The dashboard cannot tell the two apart, because authorisation is not in the telemetry.

This is not a limitation that more span attributes would fix. It is a difference in what the two graphs are about. The execution graph describes events; the authority graph describes permissions. You can annotate a trace with every authorization decision made along the way — and you should, and the better platforms will — but you are still recording decisions, not evaluating them. The question “should this have been allowed?” is not answered by a log entry that says “this was allowed.” It is answered by an independent check against a policy in force at the time, and that check has to happen before the action commits, not after it is recorded.

Attribution is not accountability

This points at the thing observability most fundamentally cannot supply. When an agentic system produces a wrong and binding outcome, the question that matters is not what happened — the trace can answer that — but who answers for it. And here the stack is not merely silent; it is actively misleading, because it produces so much detail about the what that it feels like an answer to the who.

Consider a concrete case. A procurement agent places a large, binding order, well within its spending limits and fully authorized by its mandate. The supplier — a smaller firm — tools up and manufactures against it in good faith. The agent then turns out to have acted on stale context: the order was a mistake. The operator wants out, and points at the agent — the system did this on its own. Now look at where each party stands. The operator holds an immaculate trace — every call recorded, every authorization logged, the dashboard green throughout — and a strong incentive to treat the order as the agent’s doing rather than its own. The supplier holds manufactured goods, an unpaid invoice, and no forum. It can read the same immaculate trace, see exactly what happened, and gain nothing from the knowledge: there is no mechanism to compel the operator to honour the order, no neutral body to reverse the loss, no correction surface that answers to anyone but the operator. The supplier can present the trace and ask for mercy. That is not governance. It is a plea.

In a multi-agent system the problem compounds. Five agents may share a single set of credentials, spawned and discarded by an orchestrator, each acting on authority inherited from the last. The trace shows all of it in exquisite resolution and still cannot tell you which legal or institutional person stands behind the action, because that person is not a field in the span. Accountability is a relationship between an action and someone who can be made to answer for it. Observability records actions. It does not record relationships of answerability, and it cannot manufacture one after the fact by being more thorough.

The obvious rebuttal is that this is exactly what the agent-identity community is fixing: bind a cryptographic identity to every step in the trace — an agent passport, a verifiable credential, a signed principal at each call — and the trace can finally tell you who. It can, and that work is real and worth doing. But it answers a narrower question than it appears to. Binding an identity to an action establishes which key signed the call. Accountability requires binding an institution to the consequence — someone who can be compelled to undo the outcome, not merely named as its origin. A signed principal at each step tells you who acted; it does not tell you who answers, and the distance between those two is the entire subject. A perfectly attributed trace of an unauthorised, irreversible harm is a perfectly attributed harm. Attribution is a precondition for accountability. It is not accountability.

Speed is a confession, not a fix

The Galileo deal is openly built around speed. The pitch is that human-speed review fails at production scale — that by the time a person reads the dashboard, the agents have executed thousands more actions. This is true, and the industry’s response to it is the tell. The answer being shipped is real-time guardrails: the observability platform acting on its own signals without waiting for the human.

Read honestly, that move is a confession. Post-hoc governance does not work at machine speed, so the watching layer is being handed the power to act — which is exactly the slide from observation into control this whole argument is about, made not as a principle but as a performance optimisation. Nobody decided that the monitoring tool should become the enforcement tool. It became one because the alternative was a human reading a dashboard about decisions that finished committing while the page loaded. None of which is an argument against speed, or against enforcement — an agent that acts in milliseconds can only be checked by something that acts in milliseconds, and building that is right. The sleight of hand is in the word: fast self-enforcement by the operator is being called governance, when it is the same inward-facing control surface, only quicker.

What you are left with, even at its best, is high-resolution hindsight: an immaculate recording of a decision that has already committed and already had its effect. The recording does not reach back into the decision. An earlier issue argued that a system which can only see cannot plan its way out of its own situation; the same wall stands here, one layer down, between recording an action and being able to stop it.

The thing that is not watching exists — and still faces inward

The field knows, at some level, that watching is not enough, and the strongest evidence is that the same companies are now building the thing that is not watching. In April, Microsoft open-sourced an Agent Governance Toolkit, and its core component does precisely what observability cannot: it intercepts every agent action before execution and refuses the ones policy forbids, deterministically, in under a millisecond. Microsoft states the point without hedging — prompt-level safety is not a control surface, and an action the engine denies is not unlikely but structurally impossible. That is the right move, and it is the opposite of observability: it decides before the action commits rather than recording after it. It is also, pointedly, a deterministic check — not a model judging a model.

And then the same toolkit wraps that interceptor in a governance dashboard, a reliability-engineering package, and an automated compliance module that collects evidence for auditors and regulators — and calls the whole assembly governance. Its own documentation names the three questions it answers: is this action allowed, which agent did it, and can you prove what happened. Every one of those faces the operator. The interceptor is genuine containment; the dashboard and the compliance evidence around it are the watching, dressed in the language of governance and pointed, like all of it, inward — at the operator’s need to satisfy the operator’s auditors.

Even the tool that genuinely stops the agent stops on the operator’s behalf. That is no knock on prevention — blocking a forbidden action before it commits is real governance of everything the operator can foresee and chooses to forbid, and any serious system needs it. But it governs the foreseeable. It does nothing for the outcome the operator’s own rules permitted and that was wrong anyway — the order inside its limits, the denial the policy allowed — because an interceptor, by definition, only catches what policy already forbids. The two are complements, not a ranking: one prevents the harms the rules anticipate, the other answers for the harms the rules permit. Only the first is being built. Enforcement of the operator’s will is not answerability for the system’s effects.

What the watching cannot become

Observability earns its keep. An agentic system without it is unaccountable in a different and worse way: you cannot even reconstruct what went wrong. The error is not in building it. The error is in believing that enough of it, fast enough, consolidated enough, becomes governance — that if you can see the agent clearly enough, you have somehow acquired the ability to answer for it.

You have not. Seeing is not stopping. The execution graph is not the authority graph. The record of an action is not a relationship with someone who can be made to answer for it. A platform that watches the agent, enforces its operator’s policy on the agent, and writes the record of its own enforcement is not a control plane in any sense that protects the person the agent acts on. It is the operator, watching itself, extremely well.

The market has now put a price on the watching, and the price is rising. And because the watching is consolidating into a handful of platforms, the definition of governance is consolidating with it — the deeper move, and the one worth naming. A market does not merely sell tools; it gradually defines the problem those tools appear to solve. Visibility is packaged as governance, and governance comes to mean visibility — what the market can buy becomes what the word means, and what it cannot buy — independent verification, correction authority, a path for the affected person to reach back — drops out. And the missing record cannot simply be bought back: an outside witness you purchase from the same market is no longer outside. What is needed is closer to a rule than a product — which is why the layer that answers to someone other than the operator is still not for sale. No one has built it, because the market that would build it is the one it exists to check.

What that layer would require, this essay has already named. It is not observability, and it is not the interceptor either — though the interceptor is where it has to begin, because a gate that already halts an action before it commits is the natural place to require a constraint the operator did not set, rather than only the operator’s own policy. It is what the corrigibility framework is built to supply — a system that stays reachable, contestable, and correctable by the people whose lives it arranges.

The industry has built the operator’s surface, and built it well. The surface that answers the other way — to the person the agent acts on — is an institution, not a protocol, and building it is the hard problem this series exists to take up: it has to reach across borders and bind operators who would rather it did not. None of that difficulty is a reason it can be skipped. Seeing is not stopping; and stopping, it turns out, is not answering. The watching can become faster, more complete, more consolidated. The one thing it cannot become is the thing that answers to the person it harms.

Anivar Aravind is an Engineering Executive and Systems Thinker. The Layer 8 is a professional newsletter on the power, incentive, and governance layer of digital infrastructure. His structural framework on corrigibility is at anivar.net/corrigibility, with preprints on SSRN: Corrigibility as a Structural Precondition for Digital Public Infrastructure and Epistemic Capture and the Action Boundary.

Share Layer 8 by Anivar

Discussion about this post

Ready for more?