Guust GOOSSENS

Intelligence, when delivered at scale, is a stack. At the top sits the token, the smallest unit of output a user cares about. At the bottom sits the electron, pulled from a power plant somewhere and eventually dissipated as heat in a datacenter. Between them are four or five layers of engineering, each with its own growth rate, its own ceiling, and its own bottleneck.

Most people who build on top of this stack only see one or two layers of it. Application developers see tokens and latency. Model labs see parameters and training runs. Infrastructure teams see racks and interconnects. Utility planners see megawatts and grid capacity. But each layer constrains the one above it, and the dynamics of each layer are different enough that looking at only one gives you a distorted picture of the whole.

This essay walks the stack from top to bottom. Tokens, transformers, chips, datacenters, power. At each layer I want to answer two questions: what is the current state, and what is the trajectory. By the time we reach the bottom, the formula for universal AI inference writes itself, and the picture of what the future looks like, what becomes possible and where the real challenges remain, falls out of the math.

Layer One: The Token

A token is the atomic unit of an AI system. For text, it is a sub-word chunk, typically three to four characters long. "Strawberry" is two tokens. A short sentence is twenty. A page is around five hundred. Everything an AI model reads or writes is measured in these units, and everything the industry charges for is priced per million of them.

Two years ago, token consumption per user was small. A typical ChatGPT session was a few hundred tokens of input and a few hundred of output. The model was a token generator with a chat interface wrapped around it, and the volume was bounded by how fast a human could read and type.

Today that ceiling is gone. Agentic workflows consume tokens in the background. Reasoning models think silently, often burning tens of thousands of tokens to produce a single answer. Coding agents read entire repositories, consuming hundreds of thousands of tokens per task, continuously. Voice models stream tokens in real time. Vision models turn each frame into a few hundred tokens and process them at thirty frames per second. One developer documented 202 million tokens in a single weekend. Eight months of Claude Code usage by a single user amounted to 10 billion tokens, roughly 40 million per day.

The growth rate of tokens-per-user is around 2x per year, and it has been remarkably stable. Not because users got greedier, but because each new capability (reasoning, agents, vision, voice) multiplies the token cost of delivering it.

Multiply tokens-per-user by users and you have the demand side of the ledger. If the goal is eight billion people at six million tokens per day, that is 4.8×10¹⁶ tokens per day globally. Today, the entire industry processes roughly 50 trillion tokens per day. The target is still about 1,000x above that. Every one of those tokens has a concrete compute cost, which is where the next layer comes in.

T_total = N_users · T_tokens ≈ 4.8 × 10¹⁶ tokens/day

r_demand ≈ 3x/year

Layer Two: The Transformer

A transformer is a machine that turns tokens into more tokens. To do it, it performs a specific and well-defined number of floating-point operations per token, and that number is what connects the token layer to the compute layer.

The rule of thumb is simple. For every token a dense transformer produces, it performs roughly two floating-point operations per model parameter. A 200-billion-parameter model burns around 4×10¹¹ FLOP per output token. Input tokens are cheaper because they can be processed in parallel during prefill. Output tokens are more expensive because each one is generated autoregressively (meaning each new token depends on the one before it, so they have to be produced one at a time in sequence).

But since a few years, Mixture-of-Experts techniques are changing the math. A sparse model has many more total parameters than it activates per token. DeepSeek-V3 activates 37 billion of its 671 billion parameters per token, a ratio of 18x. The model behaves like a large model when you evaluate its knowledge and reasoning, and like a small model when you measure the per-token compute cost. MoE is the single biggest efficiency lever pulled in the last three years, and it is still accelerating.

Transformer architecture also has other knobs. Quantization reduces the precision of each parameter from 16 bits to 8, 4, or in the BitNet case, roughly 1.58 bits. Each step cuts both memory bandwidth and arithmetic cost. NVIDIA's NVFP4 format on Blackwell is production-ready and delivers 2x throughput over FP8. Google's TurboQuant compresses the KV cache (the memory used to store context from prior tokens) to 3-4 bits with zero accuracy loss. Speculative decoding has a small cheap model guess the next few tokens, and the big expensive model verifies the guesses in a single pass. When the guesses are right, which they often are, you get several tokens for the price of one. Batching packs many user requests into a single forward pass, amortizing the fixed cost of loading model weights from memory. vLLM benchmarks at 23x, and newer stacks like SGLang are pushing about 29% faster on top of that.

Compose these factors and you get the software efficiency equation:

η_software = η_MoE · η_batch · η_quant · η_spec

MoE at 18-25x, batching at 23x, quantization at 3-4x, speculative decoding at 2-3x. The multipliers are not fully independent, but even the conservative read is a combined reduction of around 500x in effective FLOP per token compared to a naive dense forward pass.

The growth rate here has been running at roughly 3x per year for algorithmic efficiency alone, and sometimes much faster in short bursts. Researchers have started calling it the Densing Law: the capability density of models (capability per active parameter) has been doubling every 3.3 months. The distilled 32-billion-parameter variant of DeepSeek-R1, released at the start of 2025, already runs on a single RTX 4090 and scores competitively with the frontier reasoning models of a year before. A year earlier, that level of performance required a cluster.

This has implications beyond datacenters. When a frontier-class reasoning model fits on consumer hardware, the economics of local AI change completely. The efficiency stack is not just closing the gap between supply and demand at datacenter scale. It is also pulling intelligence toward the edge, toward laptops and phones, in a way that seemed structurally impossible two years ago.

F_effective = (T_total · C_{per_token}) / η_software

r_η ≈ 3x/year

Layer Three: The Chip

FLOPs have to run somewhere, and that somewhere is a GPU or a TPU. The chip is the layer where physics starts to bite.

An H100, NVIDIA's 2022 flagship, delivers around 2×10¹⁵ FLOP per second in FP16 (16-bit floating-point precision, where each number is stored in 16 bits) and roughly twice that in FP8. In a full day, running at 30-50% real-world utilization, that is around 5×10¹⁹ effective FLOP per GPU-day. Blackwell, the 2024 successor, roughly doubles bandwidth and adds native FP4 support at 20 PFLOPS sparse. Rubin, NVIDIA's 2026 architecture, exceeded expectations: 50 PFLOPS at FP4, a 5x jump over Blackwell, with 288 GB of HBM4 and 22 TB/s of memory bandwidth.

Chip performance has been growing at around 1.6-1.7x per year over the last several generations. The growth is driven by a combination of node shrinks (3nm to 2nm), packaging innovations (chiplets, advanced interconnects), and memory upgrades (HBM3e to HBM4). None of these levers are free, and each one has a visible ceiling.

Memory bandwidth is the quiet bottleneck most people miss. Autoregressive decoding (the one-at-a-time token generation described above) is memory-bound, not compute-bound, because producing each output token requires loading the relevant model weights from HBM (High Bandwidth Memory, the fast memory stacked on top of the chip). H100 has 3.35 TB/s of memory bandwidth. HBM4, which SK Hynix and Samsung began mass-producing in February 2026, pushes this to 22 TB/s. But memory bandwidth grows at around 1.22x per year, slower than raw compute, which means the gap between peak theoretical FLOPs and sustainable token throughput keeps widening. MoE helps because it shrinks the active parameter count per token, but the underlying bandwidth constraint is structural.

The supply side of chips is a separate bottleneck from performance. TSMC's CoWoS advanced packaging capacity (needed to integrate HBM with the GPU die) is the hard constraint on how many high-end AI chips can exist. It has been scaling rapidly, from 35K wafers per month in late 2024 to 75-80K in late 2025, with a 120-130K target for late 2026. But build-out times for new packaging lines are multi-year, and Blackwell backlog hit 3.6 million units, sold out through mid-2026.

The competitive landscape is widening. AMD's MI400 series, arriving in 2026, offers 432 GB of HBM4 and 19.6 TB/s bandwidth. Hyperscaler custom silicon is surging: Microsoft's Maia 200 on 3nm, Amazon's Trainium 3 (already used by Anthropic and OpenAI for inference), Google's TPU v7 at 4,614 TFLOPS. The pure NVIDIA dependency for inference is decreasing, though NVIDIA still dominates training.

N_GPU = F_effective / (t_day · Φ_GPU · η_util)

r_Φ ≈ 1.7x/year

r_BW ≈ 1.22x/year

Layer Four: The Datacenter

Chips have to sit in a building, connected to other chips, cooled, powered, and networked. The datacenter is the layer where individual silicon becomes a cluster.

A single H100 draws around 700 watts. A B200 draws closer to 1,000 watts. A full rack of modern AI accelerators pulls 100 kilowatts or more, compared to 10-20 kilowatts for a traditional server rack. This has reshaped datacenter design. Air cooling no longer suffices for AI racks, so liquid cooling is becoming the default for new builds (a $6 billion market in 2026, growing at 20%+ per year). Power density per square foot is climbing, and new AI-optimized datacenters look more like power plants with some chips bolted on than like the row-of-servers image most people hold.

Global datacenter capacity sits around 130 GW in 2025, of which roughly 30-35 GW is AI-specific. Projections put 2030 total capacity around 200-219 GW, with the AI-specific slice approaching 100 GW. Growth is running at roughly 14% per year overall, and closer to 30% per year for AI-dedicated capacity. Fast for physical infrastructure. Slow for the demand curve it is trying to catch.

P_AI = N_GPU · P_chip · PUE

r_DC ≈ 1.30x/year (AI-dedicated)

Layer Five: The Electron

At the bottom of the stack sits the power grid. This is the layer where the engineering meets civilization.

The universal-inference scenario, once you apply all the software efficiency gains, requires somewhere between 100 and 400 gigawatts of dedicated AI power. Global electricity generation averages around 3,400 GW of continuous output (nameplate capacity is much higher, around 10,400 GW, but availability factors bring it down). AI currently consumes a small single-digit percentage of that. Pushing it to 10-15% requires building new plants, and building new plants requires time that the rest of the stack does not need.

Nuclear, specifically small modular reactors, is the option with the right shape. Always on, high energy density, deployable next to datacenters, timelines of five to ten years. Microsoft, Amazon, Meta, Google, and newer entrants like the NuScale/TVA partnership have collectively committed to over 20 gigawatts of nuclear capacity, most of it scheduled to come online between 2028 and 2035. Three Mile Island's restart is ahead of schedule, targeting 2027.

Geothermal is emerging as a serious second option. Fervo Energy's Cape Station is targeting 100 MW in 2026, scaling to 500 MW by 2028. Google has invested, and the Rhodium Group estimates enhanced geothermal could serve roughly two-thirds of new datacenter demand by 2030.

Fusion is accelerating at the research level (Helion achieved D-T fusion, Commonwealth is targeting first plasma in 2027) but remains a 2035-2040 technology for grid-scale power at the earliest.

The growth rate of AI-specific power capacity is around 15% per year, which is the slowest term in the entire stack. Hyperscaler capital expenditure is trying to force this faster ($660-690 billion committed for 2026, up from $256 billion in 2024, Amazon alone at $200 billion) but money cannot accelerate grid interconnection queues or reactor construction timelines beyond a certain point.

r_power ≈ 1.15x/year

The Formula

Put the whole column together and the algebra writes itself. Demand is users times tokens times FLOPs-per-token. Supply is chips times FLOPs-per-chip times utilization times the efficiency stack. Divide one by the other and you have the number of chips required.

N_GPU = (N_users · T_tokens · C_{per_token}) / (t_day · Φ_GPU · η_util · η_MoE · η_batch · η_quant · η_spec)

Every term in the numerator is growing. Every term in the denominator is also growing, and growing faster. The supply and demand rates compose out of the layer rates above:

r_supply = r_Φ · r_η · r_prod ≈ 1.7 · 3 · 1.35 ≈ 6.9x/year

r_demand = r_users · r_tokens/user ≈ 1.5 · 2 = 3x/year

The ratio of growth rates, roughly 1.5-1.7x of closing per year, is what determines the timeline. The gap today sits around 20-30x across the combined constraints of hardware, energy, and cost (down from roughly 50x a year ago). The convergence formula is:

T_feasible = log(Gap_today) / log(r_supply / r_demand)

That resolves to roughly 2031-2033. If demand grows faster than 3x (which the agentic explosion suggests it might), the timeline stretches by a year or two. If a breakthrough architecture cuts C_{per_token} by another order of magnitude, it compresses. The formula does not predict the exact year. It predicts the shape of the trajectory, and the shape is unambiguous.

What the algebra reveals is not just a timeline but a map of what matters at each stage. In the near term, the efficiency stack is the story: algorithmic gains are compounding so fast that frontier-class reasoning is already migrating from clusters to consumer GPUs, and that trend will only accelerate. In the medium term, the binding constraint shifts to energy: reactor buildouts and grid interconnection queues become the variables that determine whether universal access arrives in 2031 or 2035. And in the long term, the math suggests something that would have sounded absurd three years ago: intelligence, measured as inference at scale, follows a cost curve that looks a lot like electricity's. A utility. Metered, ubiquitous, and eventually too cheap to meter at the margin.

The algebra of intelligence is one equation with a dozen terms, and every term is moving. The stack tells you which ones to watch.

Guust Goossens