Read the Decisions, Not the Diff

Questioning an agent's output was never about reading every line. It's about understanding the decisions, and that map, on a codebase you didn't implement, is the job a CTO has always had.

By Jason Waldrip, with Claude Opus 4.8

The agent just handed you eight hundred lines. Three files you’ve never opened, a migration, a new module, and a green check mark sitting under all of it. It runs. The demo works. Somewhere in your gut there’s a question you’d rather not ask, because you suspect the honest answer. The question is whether you actually know what this does. Not whether it passes. Whether you could stand up in front of your team tomorrow and explain why it’s built this way and not the other way. If the answer is no, you didn’t ship a feature. You shipped a thing you’re now responsible for and don’t understand, with your name on the merge.

This is the part of the vibe-coding panic aimed at the wrong target. The danger was never that an agent wrote the code. The danger is that nobody questioned it. And “question it” gets heard as “read every line,” which is both impossible at the volume a fleet produces and not what catching problems ever actually required.

you never read every line

I’ve approved thousands of pull requests. I did not read most of them line by line. Any engineering leader who tells you they did is lying or running a team of four. What I read was the shape of the decision: where the state lives, what this couples to, whether the approach matches what we’d agreed the problem actually was. A senior engineer reviewing a junior’s PR isn’t proofreading characters. They’re checking judgment against intent. That skill, the one we’ve used on each other for as long as we’ve worked in teams, is exactly the one this era needs. Almost nobody is naming it as the same skill.

So here’s what questioning the output actually means, and it’s two moves. First, understanding why the agent made the calls it made. Not the syntax. The decisions. Why a queue here and a synchronous call there. Why it reached for that library. What it quietly decided about retries, about failure, about the case you never mentioned. Second, holding its finished design up against the spec you handed it, and finding the gap. There’s always a gap.

the spec is the input, not the system

Spec-driven development is having its moment, and for good reason. The spec is the steering wheel now, and the more capable the fleet, the more elaborate that spec has to get, because every ambiguity you leave is a decision you’ve delegated without deciding anything. Write the spec well and you’ve pointed the engine at the right hill.

But the spec is the input. The system that comes back is the output, and those are not the same document. The agent fills every gap you left, confidently, in ways that drift from what you meant by degrees small enough to skim past. I’ve written before that agents don’t go quiet when they hit missing context. They fill it. The spec tells you what you asked for. Only the diff tells you what you got, and the distance between the two is precisely the thing the green check mark can’t see.

The spec tells you what you asked for. Only the output tells you what you got. The gap between them is where the work moved.

the reasoning is ephemeral now

Here’s the part that’s genuinely new, and the part I think most teams will learn the hard way.

When a human on my team made one of those calls, the reasoning had a home. It lived in a head I could walk over to. At Brandfolder, Ruby was crashing on large zip files, so we built that one piece of the pipeline in Go. Six months later, anyone could ask why the zip path was a different language than the rest of the app, and get the real answer, because the people who made the call were still in the building. The why survived because the person did.

When an agent makes the call, the reasoning lives in a transcript. A run log: the options it weighed, the ones it dropped, the chain that led to the design you’re now holding. And that log is ephemeral. It scrolls off. It’s in a session you closed, on a harness that may or may not have kept it, in a format you may or may not have the skill to read. The code stays, green and running. The reasoning that produced it evaporates, unless two things are true: your harness surfaces the trace, and you know how to dig through it.

The code is permanent. The reasoning is a log that scrolls off.

A human teammate’s “why” is still in the building in six months. An agent’s “why” is in a session you already closed, unless you went and captured it.

Reading an agent’s decision log the way you’d read a teammate’s PR description is a new literacy, and it’s becoming load-bearing fast. The decision the fleet made at 2pm Tuesday is as real as any architectural call your staff engineer ever made. The difference is your staff engineer is still here to defend it.

you’ve held this map before

Which brings me to the thing that’s supposed to feel like a crisis and, for me, mostly doesn’t. The job now is to hold the map on a codebase you architected but didn’t implement. To carry the why across a system you’re responsible for and didn’t personally type. If that sounds like a frightening new mandate, you haven’t sat in the chair.

That has always been the job. At GigSmart I hired and led the team that built the platform; I made the call that a real-time marketplace is a concurrency problem and that’s why we were on Elixir, but I was not the one writing every module that depended on it. My value was never that I typed the most. I argued that already, at length, about my own commit history. The value was holding the why across systems I hadn’t touched in a year, built by engineers I’d guided but never stood over. Understanding the decisions my team made, and just as often the decisions they didn’t make: the corner nobody addressed because my guidance there was thin. Holding that map is the oldest part of leading engineers. The fleet didn’t invent the problem. It handed the problem to everyone, including the solo developer who never had a team and now suddenly does.

A fleet is a team. A fast, tireless, literal-minded team that writes no documentation unless you force it, remembers nothing past the session, and will pave a road straight off a cliff and report success. Directing that isn’t a new discipline. It’s the discipline good engineering leaders already had, pointed at a team that happens to be made of models.

so what do you actually do

The work reorganizes around the map, in priority order, because the sequence matters more than the calendar:

Interrogate the why before you accept, not after it breaks. Make the agent defend the call. What did it consider and drop, and why this over that? If it can’t produce the roads not taken, the thinking didn’t happen, and you’re the one who has to do it.
Diff the design against the spec, not just the tests against the code. Green means the code does what it does. It says nothing about whether what it does is the system you asked for. The gap between spec and output is the review.
Capture the ephemeral reasoning before it scrolls away. The load-bearing “why” belongs in a decision record or in the code itself, not abandoned in a closed session. Write down the why, for the next human and for the next context window both.
Treat the agent log as a primary source. Build the harness that keeps the trace, and learn to read it like you read a teammate’s PR. The reasoning is evidence. Evidence you can’t retrieve isn’t evidence.

None of this is anti-agent. I’d be a hypocrite. I wrote this one with a fleet at my elbow, and I’d point a fleet at nearly anything. The argument is only that generation getting free doesn’t retire the engineer. It promotes everyone into the chair I’ve been sitting in, the one where you’re accountable for decisions you didn’t make with your own hands, and the only thing standing between you and a beautifully formatted disaster is whether you understood them.

So go back to the eight hundred lines. The check is green. That was never the question. The question is whether you could explain, with the session closed and no tab to peek at, why it’s built this way. If you can, you’re doing the actual job: you’re holding the map. If you can’t, the code is already yours, and the reasoning that made it is scrolling toward the bottom of a log you may never open again. Better go read it now, while it’s still there.

Jason Waldrip has spent his career leading engineering at consumer-scale software companies. He writes about engineering leadership, infrastructure, and building in the age of AI agents.

A note on how this was made: I wrote this with Claude Opus 4.8. I brought the frame, the experience, and the calls on what mattered and what to cut; Claude did most of the drafting. I’d rather say that plainly than pretend the tool wasn’t in the room, especially in a piece about exactly this.