The Tests Are Green

Your pipeline says green. Your dashboards say healthy. Your users are stuck. Here's the gap your signals don't cover.

By Jason Waldrip, with Claude Opus 4.8

You shipped Thursday afternoon. The merge train was green, the image promoted clean, the canary baked six hours without a blip. You slept fine. Friday morning there’s a support thread: customers can’t get through checkout. You open the dashboards. Error rate, normal. Latency, normal. Nothing in the exception tracker, every health check green. By every signal you’re paying to watch, the system is healthy. The feature is broken.

What broke doesn’t throw. A user submitted a form, it validated, the response came back 200. The record never persisted, because of how the confirmation step races a read against a write that hasn’t committed yet. Your tests cover the form. They cover the validation. They don’t cover that combination, with that database state, at that timing, because nobody wrote that test, because nobody thought to, because the code reads fine in isolation. It is fine in isolation. That was never the question.

The question tests don’t answer

Tests verify code paths. That’s the job, and it’s not negotiable. You write a test, the test tells you the function does what you claimed. Real, fast, valuable. But “this function does what I said” is a different question from “the checkout flow is working right now, for real people, on real phones, against production data.”

Those two questions don’t fail together. A team that trusts the first one to answer the second has built its confidence in the wrong place, and won’t find out until a customer tells them.

I walked into Brandfolder to a product going down on a memory leak. The test suite was green the whole time. Memory isn’t a code path you assert against; it’s a property of the thing while it runs, under real load, over real hours. We couldn’t fix it until we could see it. Once we could watch the process actually behave, the fix came fast. Before that we were tightening bolts that weren’t loose while the real one backed out a turn at a time.

Monitoring is watching for what you feared

Most teams have monitoring. Dashboards, alerts, a rotation. And nearly all of it is shaped around the failures they imagined the day they built the system.

So you end up instrumented for the outage you pictured, blind to the one you didn’t.

Error rate catches crashes. Latency catches slow. Queue depth catches backed-up. All worth having. None of them answer whether a person can finish a flow. Whether the submit actually persisted. Whether the third-party call that returned 200 also returned the right thing. Whether the feature you shipped last week is being used or quietly abandoned because it’s broken in a way that never writes a log line. A 200 with an empty body is the happiest-looking failure in your whole system.

Your system can pass every check you’re running and fail every one that matters to the person on the other end.

Observability is a different property

The word got annexed by APM vendors and flattened into “dashboards, but prettier.” That’s not what I mean by it.

As a property of a system, observability is this: can you look at the outputs and understand what’s going on inside, including things you never thought to instrument? Here’s the clean test. Something breaks in prod in a shape you didn’t predict. Can you work out why from data you already have, without shipping new logging and waiting for a deploy? If the honest answer is “we’d have to add instrumentation and redeploy,” then what you have is monitoring for predicted failures. Useful. Not the same thing.

What closes the gap is structured, high-cardinality data, captured at enough granularity that you can ask questions you didn’t know to ask when you wrote the feature. Not “was there an error” but “what did this user’s session look like across these three services in the ninety seconds before it stalled.” Not “is the queue deep” but “which job class started retrying after Wednesday’s deploy, and what do those payloads share.” If the answer to that lives in next sprint’s logging ticket, you are permanently one incident behind the thing that’s actually hurting you.

The cut is simple. Monitoring tells you something went wrong. Observability tells you why, before the support ticket does it for you.

The agent era widens the gap

Every bit of this gets heavier when generation goes cheap.

When one engineer ships a handful of deliberate changes a week, there’s an attention budget riding on each diff. A careful human reads it, pictures the failure, writes the test for the edge they’re worried about. The coverage is a fossil of the care that went into the code. It exists because a person was nervous about exactly the right thing.

When a fleet opens dozens of clean, plausible merge requests a day, that nervousness is gone. Agents test what they exercised. They don’t know about the payment provider that behaves differently on a retry, the mobile layout that’s subtly wrong in a viewport no one opened, the analytics event that stopped firing after a re-render three components away. They don’t know because nobody told them, and at generation-era volume there’s no human reading every diff with the right fear in their gut. I’ve written before that you multiply judgment by a fleet, never zero by one. This is where the zero hides: in the tests nobody knew to ask for.

Which leaves exactly one honest source of truth: the running system. Real users, real product, real flows completing or not. Funnel completion, outcomes by result, conversion against the baseline. That signal is what tells you Thursday’s deploy actually worked, because it’s measuring the only thing that was ever the point. When generation is free, knowing what’s true is the entire job.

What to actually instrument

Tests verify intent. Observability verifies outcomes. You need both, and you should stop pretending one covers for the other.

Watch outcomes, not just exceptions. A user who started a flow and didn’t finish it is a signal, even when nothing threw. Instrument the funnel, not only the endpoints. Abandonment is data.
Real-user monitoring over synthetic checks. Synthetic tests prove the happy path under conditions you controlled. RUM tells you what’s happening on real devices, on real connections, right now. Both earn their keep; only one knows about this morning’s deploy.
One honest health endpoint. The canary is only as good as what it bakes against. Error rate and latency are necessary and nowhere near sufficient. Aggregate a real business signal into the gate: conversion, funnel completion, worker success. A slower honest gate beats a fast lying one every time.
Structure events for the question you haven’t asked yet. Every request and action should carry enough context to reconstruct what happened without a redeploy: user, session, flags in effect, code version, the IDs that let you stitch a story across services. The log you’ll be desperate for next month is the one you have to write this sprint.

None of this demotes tests. They’re still the fastest feedback in the cycle, and they catch things in seconds that production would charge you days for. They just answer “does this code do what I said,” not “is this working for the person holding the phone.” Know which question you’re asking. Carry evidence for both.

The signal decides

I’ve written about the merge train, the canary, what it takes to make every green commit shippable. All of it rests on a signal worth trusting. A canary baking against error rate and latency is a good start. A canary baking against those plus “did the core flow complete” is the difference between confidence and theater.

Your trust in the release process can’t run deeper than the signal underneath it. Build shallow signals and what you’ve actually built is a very sophisticated machine for shipping into the dark with the lights off and a smile on.

So sit with this one. If something broke quietly in your product right now, no exception, no spike, just a flow that stopped completing, how long until you knew? An hour? A support ticket? Something you’d stumble onto in next week’s numbers?

That number is the investment. And unlike most of the debt on your books, you can pay it down before it comes due.

Jason Waldrip has spent his career leading engineering at consumer-scale software companies. He writes about engineering leadership, infrastructure, and building in the age of AI agents.

A note on how this was made: I wrote this with Claude Opus 4.8. I brought the frame, the experience, and the calls on what mattered and what to cut; Claude did most of the drafting. I’d rather say that plainly than pretend the tool wasn’t in the room, especially in a piece about exactly this.