$ cat choices/local-llms.md

Local LLMs

the call

I keep a serious eye on open-weight, local models, and reach for them more every month. When the data is sensitive, the cost is per-token-at-scale, or you need control and offline capability, a model on your own hardware beats a frontier API. The gap to the frontier is closing fast enough that 'local' is no longer the compromise it used to be.

why

When you run an open-weight model yourself, three things change. The data never leaves, which is the whole ballgame for regulated, sensitive, or competitive workloads. The cost is hardware, not per-token, so at volume the economics of an always-on fleet flip. And you own the thing: quantize it, fine-tune it, pin a version, run it offline, and never wake up to a deprecated endpoint or a surprise price change. For the high-volume, lower-stakes layer of the work, that control is worth a lot.

when I don’t

For the genuinely hard reasoning (the load-bearing call, the gnarly agentic task) the hosted frontier models still lead, and I won’t run a weaker local model on the part that actually needs to be right just to be ideologically local. And “local” means you own the ops: GPUs, serving, scaling, evals. That’s real cost and real expertise. The move isn’t all local. It’s the right model in the right place.

in production

What’s changed my posture is the pace. Open-weight releases (Qwen (3.6), the latest DeepSeek, and the steady drumbeat behind them) now land within striking distance of frontier on a growing share of tasks, and the cadence is measured in months. The pattern that’s emerging: run the fleet’s bulk volume on local/open models where “good enough, owned, and private” wins, and reserve the frontier API for the hard reasoning. The crossover where local is the default for most of the work, not the exception, is arriving faster than most teams have planned for.— see: choices / the-agent-fleet · choices / llm-apis

the principle under it

Match the model to the task and the constraints (privacy, cost, latency, control) not to the leaderboard. The frontier isn’t always the right tool; “good enough, owned, and private” beats “best, rented, and exposed” for a large and growing slice of real work. The teams that win won’t be the ones who picked one model. They’ll be the ones who route each job to the model that fits it, and who saw the local curve coming.

the gaps — what it costs even when it’s right

You own the ops now. GPUs, serving infra, quantization tradeoffs, evals, and keeping the lights on. The per-token price went away and an infrastructure bill (and a skillset) took its place. Cheaper at scale, not cheaper to start.

It still trails on the hardest reasoning. The gap is closing, not closed. Put a local model on a task past its weight class and you get exactly the confident-wrong failure that makes unattended generation dangerous.

The ground moves monthly. Today’s best open model is stale by next quarter. That’s the upside and the tax: staying current is a standing commitment, and betting the stack on a single snapshot is how you end up frozen on yesterday’s weights.

used at

choices / llm-apis choices / the-agent-fleet