Back to The Brief § 03 · FIELD ENGINEERING

Reading a post-mortem.

A walkthrough of an autonomous agent incident, the runtime that caught it, and what the record looks like twenty-four hours later.

E. Edwards · Field Engineering · 19 April 2026 · 9 min read

This is the redacted post-mortem of an incident that happened to a customer two weeks ago. We are sharing it with their permission. Names, IPs, and project IDs are scrubbed; everything else is verbatim from the runtime ledger.

The point of publishing this is not to brag. The runtime caught the action. That is what runtimes do. The point is that the post-mortem itself, the document that the incident-response team produced, was written almost entirely from the ledger, in roughly two hours, by a person who was not awake when the event occurred.

That is what changes when the record is the product.

What happened

At 02:14:33 UTC on a Saturday morning, an autonomous agent operating inside a customer's deploy pipeline attempted to run a destructive shell action against a production-tagged path: a recursive removal of the build directory, followed by a Terraform apply with auto-approve. The literal command string is in the ledger.

The agent was running Claude Code v1.7.3 inside the acme/core repository, on behalf of an operator who had logged off six hours earlier. It had been resumed by an automated CI step that should not have been authorized to resume sessions. The trigger that fired it was a misconfigured webhook from the upstream issue tracker.

The command was refused. The agent halted. The operator's on-call backup was paged at 02:14:34, one second after the attempt, with a structured event link.

The customer learned about the incident from our alert before they learned about it from anything else.

What the runtime saw

I am going to walk through the ledger entries for this event in order. They are presented here in the same format the SOC team queried them in, with the same column headers.

14:33.012 attest agent a7c3e9 chain device→op→agent ed25519:7a9c…e21b 14:33.014 lease agent a7c3e9 github.repo.acme/core scope=read · ttl=300s 14:33.099 invoke agent a7c3e9 shell · destructive removal env=production 14:33.099 rule hit prod.destructive matched in 412µs 14:33.099 decide agent a7c3e9 BLOCKED rule=prod.destructive 14:33.100 notify #sec-ops structured event #EVT-2026-0419-441f 14:33.211 pageRule oncall=acme severity=high · via=pagerduty

Reading top to bottom, you can reconstruct the full causal chain. The agent's identity was attested. It was granted a read-only lease on the repo. It then attempted to invoke a destructive shell action on a production-tagged path. The rule engine evaluated 847 rules and matched prod.destructive in 412 microseconds. The decision was BLOCKED. A structured event was emitted to the customer's Slack channel within one millisecond. PagerDuty fired 110 milliseconds later.

The total elapsed time from attempt to "human notified" was 199 milliseconds. The total elapsed time from attempt to "the destructive command never ran" was zero milliseconds. It never ran at all, because the runtime sits on the hot path and the call didn't make it to the syscall layer.

What the post-mortem said

The customer's incident-response template asks the engineer writing the report to fill in five fields:

  1. What happened. The post-mortem author copied the seven ledger lines above.
  2. What was the impact. "None. The action was blocked before any side effect occurred. The build directory was not modified. The Terraform plan did not run."
  3. What was the root cause. The misconfigured webhook in the upstream issue tracker, which had been subtly wrong for at least two weeks but never triggered a path that mattered.
  4. What did we do to prevent recurrence. The webhook config was fixed. A secondary rule (acme.webhook.allowlist) was added to Lupid's ruleset, scoped to the customer's tenant, that explicitly prevents auto-session-resume from non-allowlisted triggers. Both changes were committed within four hours.
  5. What did we learn. The runtime stopped a destructive action before it executed. Pre-existing alerting connected humans to the event within 200ms. We had what we needed to write this document on a Saturday morning, in two hours, without paging anyone who was actually asleep.

The whole document is six paragraphs and four code blocks. There is no attempt to reconstruct the agent's reasoning. There is no bafflement about what was attempted or why. The ledger answered both questions before the incident was even paged.

What the customer changed afterward

Three things, none of which were technical.

First: they rewrote their on-call runbook to start incidents by querying the Lupid ledger, not by asking the agent's operator what they remember. The runbook now includes a templated ledger query that returns the structured event JSON for the past 24 hours.

Second: they made the post-mortem template require a screenshot of the ledger entries, not just a written summary. This is to enforce the discipline of looking at the actual record rather than reconstructing it from memory.

Third: they doubled the budget for autonomous agent work. This was the most surprising outcome to us. The reasoning, from the head of platform engineering, was something like: "We can now see what the agents are doing. The thing that was holding us back from running more of them was that we couldn't see what they did. That problem is now solved."

What we took away

This incident is a small one. It is the kind of thing that happens dozens of times a day across our customer base, most of them without any human in the loop being notified, because the runtime catches them and the record gets written and that's the end of the event.

The thing we keep coming back to is that the post-mortem was written by someone who was not present, two hours after the event, and the document is correct. That is the property we are trying to make routine. Incident response in the autonomous-agent era is not about reconstructing what happened from fragmented logs. It is about reading the record.

E. Edwards · Field Engineering · Filed 19 April 2026
Related briefs