Newsletter
AI: Inference
Inference is the part of AI nobody talks about until the cloud bill arrives. It's the runtime moment—the actual generating-an-answer step—that fires every time your model "thinks," every tool call, every approval round-trip inside an agent loop. Training built the ship. Inference is what burns the fuel. And the fuel economy of AI in 2026 is what quietly decides whether your product survives.
The simple picture
AI inference is what happens when a trained model actually does work—reads your prompt, runs the math, produces an answer. Training is the one-time event that gives the model its weights; inference is the billion-times-a-day event that turns those weights into output. Every chat reply, every code completion, every agent decision is one inference call. In 2026, the world runs about 40 trillion of them per day, and that number doubles roughly every five months.
The model is the recipe. Inference is the kitchen at lunch rush.
Before vs After
Before the inference shift: GPT-class models lived in remote data centers. You called an API, paid per token, and prayed the model remembered your last ten hours of context. Latency was measured in seconds. Costs were measured in "is the CFO going to ask about this?"
After: Inference is everywhere. A 70B model runs on a single consumer GPU at 80 tokens/sec. A 4B model runs on your phone at 60 tokens/sec, offline, in airplane mode. Specialized chips quote inference cost in fractions of a cent per million tokens. The smartest model in your workflow today might literally be running on the laptop in your bag.
Where it came from
The pressure had been building since 2024, but three things broke it open.
First, speculative decoding went mainstream in mid-2025. The idea is brutally clever—a tiny "draft" model guesses the next 4-8 tokens, the big model verifies them in a single forward pass, and you get 3-5× the speed for free. By Q4 2025 every serious inference engine had it on by default.
Second, the hardware fan-out. NVIDIA's Blackwell ramped, AMD's MI400 finally shipped credible CUDA-alternative tooling, Apple's M5 Max ate the prosumer ML market overnight (128GB unified memory at MacBook prices), and at least four custom inference ASICs—Groq's LPU, Cerebras WSE-3, Etched's Sohu, SambaNova—each carved out a niche where they beat NVIDIA on tokens-per-dollar for some specific shape of workload.
Third, open weights kept getting better. Llama 4, Qwen 3, DeepSeek V4, and the surprise hit Hermes 4 from Nous all shipped between December 2025 and March 2026. Combined with 4-bit quantization that actually preserves quality, the gap between "frontier closed model" and "what I can run at home" closed to weeks for most tasks.
By the time the AAF released the Open Inference Benchmark in February 2026, the conversation had already shifted from "which model is smartest" to "which inference stack ships the answer fastest, cheapest, and most privately."
How the pieces actually work
A modern inference stack is five layers, each with its own engineering culture.
The model: weights on disk. Quantized—typically Q4_K_M, AWQ, or GPTQ for open weights, FP8 for newer hardware. The smaller the format without quality loss, the cheaper everything downstream.
The runtime: vLLM, TensorRT-LLM, llama.cpp, MLX, SGLang, TGI. Each makes different bets on batching strategy, kernel fusion, and which hardware it loves. vLLM dominates server-side; llama.cpp and MLX own the local and Mac story; SGLang is winning the structured-output crowd.
The KV cache: the model's working memory for the current conversation. Continuous batching, paged attention, and prefix sharing turned this from "the bottleneck" into "the place where 10× speedups live." Modern stacks share cache across users when prompts share prefixes—huge for system prompts and RAG.
Speculative and parallel decoding: draft model plus verifier, Medusa heads, EAGLE-2, lookahead decoding. Doubles to triples throughput on most workloads, free of charge.
The router: decides which model gets which request. Cheap model for cheap queries, frontier model only when needed. The router is the last 10× cost reduction nobody talks about.
All of this is increasingly invisible. You point your code at an OpenAI-compatible endpoint and the stack figures out the rest.
What this actually unlocks
The inference shift isn't a feature—it's a phase change in what AI products can do.
Agents that don't think about cost. When inference drops 100× in two years, agent loops that were "too expensive to run" become "run them all night." Every codebase audit, every market scan, every prototype critique now happens in parallel because the marginal token is approximately free.
Local-first by default. A Tauri app shipping a 4B model in the binary is now normal. Your IDE assistant, your notes app, your translator—all running offline, all private, all instant. Cloud inference becomes the fallback, not the default.
Inference-time compute. Models like o1 and DeepSeek-R1 proved you can trade more inference for smarter outputs. Want a better answer? Spend 30 seconds of GPU time instead of 3. The intelligence dial is now on the runtime, not the model.
Specialized routing. A coding query goes to a code-tuned 70B at 200 tok/s. A summary goes to a 3B at 800 tok/s. A creative draft goes to a frontier model. The router does what humans used to do—pick the right tool—and saves 80% of the bill.
Edge inference for latency-critical paths. Game NPCs that respond in 40ms. AR glasses that caption the world in real time. Robots that plan locally and don't freeze when WiFi flickers. None of this works on cloud round-trips.
One stranger experiment worth watching: inference markets. Spot-priced GPU capacity, sold by the second, routed by latency and price across providers. Akash, Vast, and Together's combined spot tier moved $40M in compute in March 2026—up from nothing a year earlier. Your inference stack now has a yield curve.
Looking ahead
Three things are about to get weird.
Inference will eat training's budget. For the first time, Anthropic, OpenAI, and Google are all spending more on serving inference than on training the next model. The flywheel reversed—inference revenue funds training, not the other way around—and that's reshaping how new architectures get evaluated. "How well does it serve?" is now a first-class question alongside "how well does it train?"
Hardware-software co-design becomes mandatory. The next generation of models will be designed against specific inference targets—specific chips, specific quantization formats, specific batch sizes. The "train once, serve anywhere" era is ending. Llama 5 is rumored to ship in three "shapes" tuned for three different inference profiles.
Compliance and observability eat the next layer. As inference moves into regulated industries, every token needs to be auditable—which model, which prompt, which user, which approval. The testing story we covered last issue was the prototype for this; the same logging hooks are now being wired into the inference runtime itself, not just the tool layer.
The hidden shift? Inference is no longer an implementation detail—it's the product. The model gives you intelligence; the inference stack decides whether that intelligence is fast enough, cheap enough, and private enough to actually use. The next billion-dollar AI companies won't be model labs. They'll be the people who figured out how to run someone else's model 50× cheaper than the lab can.