An interactive exploration of how matrix multiplications can execute deterministic programs
A transformer is just matrix multiplications and softmax. How can that execute a program exactly?
The secret: attention is a lookup table. If you set up the keys and queries just right, the softmax becomes so peaked that it acts like an exact array[index] read.
Below are 5 scores. One is the "correct" lookup target. Watch what happens as you increase the temperature multiplier — the softmax sharpens until it's essentially a hard lookup.
At low temperature, attention is spread out — fuzzy, not useful for exact computation.
Key insight: When the score gap is large enough, softmax gives >99.99% weight to one entry. The "weighted average" becomes an exact read. This is how weights produce deterministic behavior — not by magic, but by engineering the scores to be extremely peaked.
A computer needs to read from memory. You say "give me the value at address 3" and you get back whatever is stored there. Simple.
But a transformer has no memory. It only has a growing list of past tokens and one operation: attention — which computes a weighted average over all past tokens. How do you turn a weighted average into an exact memory read?
Before attention can happen, every token goes through the same pipeline. Here it is, step by step:
Then this embedding gets multiplied by three separate weight matrices to produce Q, K, and V:
A common misconception: the value v is not computed from q · k. The dot product q · k only produces a score — a single number. Here's the actual flow:
Think of it like a library: k is the label on the spine of each book, q is the title you're searching for, and v is the content inside the book. You use q and k to find the right book, then you read v from it. The content was always there — the lookup just selects which book to open.
x: The input to each layer isn't the raw token embedding — it's the residual stream: the original embedding plus all outputs from previous layers added on top. So by the time a token reaches layer 3, its x already contains results computed by layers 1 and 2 (like the current instruction pointer, or a value just loaded from the stack).42", then layer 3's WV can extract that 42 and make it available as this token's value.
[i, 1] and [2j, −j²]
Now that you know how they're produced, here's what each one means:
| Key (k) | The token's address label. It answers: "Where am I?" Think of it like the number on a mailbox. Token at memory address 3 gets a key that encodes "I'm address 3." |
| Query (q) | The current token's request. It answers: "What address do I need?" If the current instruction needs to read address 3, the query encodes "I want address 3." |
| Value (v) | The token's stored data. It answers: "What's in me?" This is the actual content — the number 42, or a stack entry, or whatever was written there. |
Attention computes the dot product between the query and every key. The dot product is just: multiply corresponding entries and add up. A high dot product means "these two vectors point in a similar direction" — i.e. this key matches what the query is asking for.
Then softmax turns those scores into weights that sum to 1. If one score is much higher than the rest, that key gets nearly 100% of the weight — and the output is just that key's value. That's an exact read.
score_j = q · k_j ← dot product of query with each keyweight_j = softmax(scores) ← highest score gets ~100%output = Σ weight_j × value_j ← weighted average ≈ exact value
In this paper, each key is a 2-number column vector. The weight matrix W_K is engineered so that a token at address j gets mapped to:
kj = [2j, −j²]
Why these specific numbers? Because they sit on a parabola, and that shape has a magical property: when you dot-product a query q = [i, 1] with every key, the math simplifies to −(i − j)² + i². That's an upside-down parabola centered at j = i — meaning the key at the exact target address always wins. (Tab 3 visualizes this in detail.)
The key vector isn't something the model "learns" in the usual sense — the weight matrix W_K is set by construction to produce these parabola coordinates. It's more like compiling an address decoder into the weights.
Below is a simulated memory with 6 slots. Click a slot to change which address you're reading. Watch how the query vector changes, and how the dot products with each key vector determine which value gets read.
Both use the same mechanism (dot product → softmax → weighted average). The difference is what the keys and queries represent and how sharp the result is.
stack[3] or mem[addr] exactly.
Same math, different weights. Traditional LLMs learn W_K and W_Q from data → fuzzy semantic similarity. This paper engineers W_K and W_Q by construction → exact address lookup. The architecture is identical.
In both cases, the query is produced the same way: multiply the current token's embedding by the WQ weight matrix. The difference is what WQ contains.
[0.3, 0.8, −0.1, ...][0.7, 0.8, 0.0, ...]WQ is trained via gradient descent on billions of tokens. It learns to produce queries that find contextually relevant tokens — "delicious" attends to "cake" because they co-occur. The query is a fuzzy semantic direction.
[i, 1]WQ is not learned — it's set so that it extracts the target address from the current token's state and formats it as [i, 1]. The address i itself comes from earlier layers: e.g. the instruction pointer head computes the current IP, and the stack head uses the current stack depth. Each head's WQ is wired to extract the specific piece of state it needs.
i32.add"Each attention head in the transformer does exactly this — but the "indices" and "values" are determined by the weight matrices W_Q, W_K, W_V. The weights are set so that:
Multiple heads = multiple independent arrays. That's enough to build registers, a stack, and memory.
Here's the mathematical trick that makes exact lookup work with just 2D keys.
Store index j as the 2D key: kj = (2j, −j²)
To look up index i, query with direction: q = (i, 1)
The dot product becomes: q · kj = 2ij − j² = −(i − j)² + i²
Since i² is constant for a given query, the argmax is at j = i — exact match!
The keys sit on a parabola. Drag the query slider to change which index you're looking up. The dot product scores form an inverted parabola peaked at the target.
Index 3 gets the highest score. The penalty −(i−j)² ensures only the exact match wins.
Why this matters: The weight matrix W_K is set so it maps token positions to points on this parabola. W_Q produces the query direction. The result: attention performs an exact array read — no scanning, no approximation. And because the keys are 2D, you can use a convex hull to find the max in O(log n) time instead of checking every key.
A WebAssembly VM needs: an instruction pointer, an operand stack, and memory. Here's how attention heads provide each:
Mechanism: Cumulative sum via attention
Each trace token emits a delta (+1 for next instruction, or a jump offset). One head averages all deltas uniformly, then multiplies by the token count to recover the running sum = current IP.
Mechanism: Index lookup via 2D keys
Stack depth is tracked as a cumulative sum of push/pop deltas. To read the top of stack, the head queries for the current depth using the parabola trick — exact lookup of the most recent value at that depth.
Mechanism: Index lookup via 2D keys
Memory addresses work the same way as stack indices. To read mem[addr], the head queries with direction (addr, 1). The most recent write to that address has the highest score because it has the largest position component.
This is the crucial insight: writing = emitting a new token. There is no separate memory. The growing sequence of trace tokens is the memory.
Click "Step" to watch the token sequence grow. Each token carries a key (address) and value (data). Reading is just attention looking back at past tokens.
Click "Step" to execute WASM instructions. Watch how each head reads/writes.
Now let's see the complete picture. The transformer receives a WASM program as tokens. Then it generates an execution trace — each token is produced by the attention heads doing lookups into the growing trace.
LLM writes print(3+5)
Pauses, sends to external Python
Gets back "8" — a black box
Execution happened outside
LLM emits WASM tokens
Each next token = one VM step
Attention heads ARE the CPU
Every step visible in the trace
The weights aren't learned from data in the usual sense — they're engineered (or compiled) so that the transformer's forward pass literally executes a WASM interpreter. The architecture is a standard PyTorch transformer. Only the weight values are special.