$ cat choices/distributed-systems.md

Distributed Systems

the call

I treat scale as a design problem before a code problem. The first question on anything that has to grow is where state lives. Decide that before you write a line, because you can't refactor your way out of the wrong answer.

why

Most “we can’t scale” problems aren’t slow code. They’re state in the wrong place. Once you’ve shipped a design that assumes one database, one process, one source of truth, throwing hardware at it just moves the wall. The leverage is in the decision you make before any of that exists: where does state live, who owns it, and what happens when a node dies. Get that right and the code is almost boring. Get it wrong and no amount of optimization saves you.

when I don’t

A distributed system is a distributed problem. You trade one hard thing (a slow monolith) for a pile of harder ones: partial failure, consistency, ordering, debugging across boundaries. Most products never hit the scale that justifies any of it. Reaching for queues, sharding, and eventual consistency before you’ve proven you need them is résumé-driven architecture. The default should be the boring single store until the load curve says otherwise. Distribute when the data forces your hand, not when it flatters your design doc.

in production

iTriage / Aetna was a consumer health app at real scale, and it’s where I learned the lesson the hard way: scale is a systems-design problem before it’s a code problem. The questions that mattered weren’t about hot loops. They were about where state lived and who owned it, decided up front. The other durable thing I took from it: the most lasting thing a senior engineer ships isn’t code, it’s other engineers. Most of that came from pairing, not from documents.— see: works / iTriage / Aetna

the principle under it

The transferable rule is simple and almost everyone violates it: decide where state lives first, and treat that decision as the architecture. Everything downstream (the language, the framework, the deploy story) is negotiable. State ownership isn’t. It’s the one call you can’t cheaply reverse later, which is exactly why it deserves the most thought up front and the least improvisation under deadline.

the gaps — what it costs even when it’s right

Failure modes multiply. The moment state crosses a network boundary you inherit partial failure, retries, and ordering: bugs that don’t exist in a single process and don’t reproduce on your laptop. You’re now debugging across machines, which is a different and slower craft.

Consistency becomes a daily tax. “Where does state live” answers one question and opens another: how stale is allowed, and who reconciles when two truths disagree. Every feature after that pays a little interest on the answer.

It’s easy to pay the cost too early. The same up-front thinking that saves you at scale is pure overhead before you get there. Designing for distribution you don’t need is its own failure: slower delivery now for a problem you may never have.

used at

works / iTriage / Aetna