How Transformer Weights Become a Computer

An interactive exploration of how matrix multiplications can execute deterministic programs

The Puzzle: How Can Weights Be Deterministic?

A transformer is just matrix multiplications and softmax. How can that execute a program exactly?

The secret: attention is a lookup table. If you set up the keys and queries just right, the softmax becomes so peaked that it acts like an exact array[index] read.

Try it: Softmax Temperature

Below are 5 scores. One is the "correct" lookup target. Watch what happens as you increase the temperature multiplier — the softmax sharpens until it's essentially a hard lookup.

At low temperature, attention is spread out — fuzzy, not useful for exact computation.

Key insight: When the score gap is large enough, softmax gives >99.99% weight to one entry. The "weighted average" becomes an exact read. This is how weights produce deterministic behavior — not by magic, but by engineering the scores to be extremely peaked.

Attention as a Lookup Table

Start here: What problem are we solving?

A computer needs to read from memory. You say "give me the value at address 3" and you get back whatever is stored there. Simple.

But a transformer has no memory. It only has a growing list of past tokens and one operation: attention — which computes a weighted average over all past tokens. How do you turn a weighted average into an exact memory read?

First: How tokens become vectors

Before attention can happen, every token goes through the same pipeline. Here it is, step by step:

Plain text
"cake"

tokenizer
Token ID
4821

embedding table
Token embedding (x)
[0.3, 0.8, −0.1, 0.5, ...]
one vector per token, d dimensions

Then this embedding gets multiplied by three separate weight matrices to produce Q, K, and V:

x
×
WQ
=
q
← "what address do I need?"
x
×
WK
=
k
← "what address am I at?"
x
×
WV
=
v
← "what data do I hold?"

How Q, K, V work together (this is the key!)

A common misconception: the value v is not computed from q · k. The dot product q · k only produces a score — a single number. Here's the actual flow:

Each past token (independently)
kj = xj × WK
its address label
vj = xj × WV
its stored data
Current token
q = xnow × WQ
the address I want
Step 1: Score every past token
scorej = q · kj
"how well does this key match my query?"
→ just a number, not a vector
Step 2: Pick the winner
weightj = softmax(scores)
highest score → ~100% weight
all others → ~0%
Step 3: Retrieve the value
output = Σ weightj × vj
≈ just vwinner (because its weight is ~100%)

Think of it like a library: k is the label on the spine of each book, q is the title you're searching for, and v is the content inside the book. You use q and k to find the right book, then you read v from it. The content was always there — the lookup just selects which book to open.

So where is the "memory" physically?
There's no separate memory chip. The past tokens in the sequence are the memory. Each past token carries a key (its address) and a value (its data), both computed from its residual stream via WK and WV. When the model "writes 42 to address 3," it emits a trace token whose key encodes address 3 and whose value encodes 42. When a later step "reads address 3," attention finds that token and retrieves its value.
Important nuance about x: The input to each layer isn't the raw token embedding — it's the residual stream: the original embedding plus all outputs from previous layers added on top. So by the time a token reaches layer 3, its x already contains results computed by layers 1 and 2 (like the current instruction pointer, or a value just loaded from the stack).

This means WV doesn't just extract the original token — it extracts whatever information previous layers have written into the residual stream. For example, if layer 2 computed "the value at stack position 3 is 42", then layer 3's WV can extract that 42 and make it available as this token's value.
The mechanism is identical in both traditional LLMs and this paper. The only difference is what the weight matrices contain:
● Traditional: WQ, WK, WV are learned from data via gradient descent → produce fuzzy semantic vectors
● This paper: WQ, WK, WV are set by construction (compiled) → produce exact address vectors like [i, 1] and [2j, −j²]

The three players: Key, Query, Value

Now that you know how they're produced, here's what each one means:

Key (k) The token's address label. It answers: "Where am I?"
Think of it like the number on a mailbox. Token at memory address 3 gets a key that encodes "I'm address 3."
Query (q) The current token's request. It answers: "What address do I need?"
If the current instruction needs to read address 3, the query encodes "I want address 3."
Value (v) The token's stored data. It answers: "What's in me?"
This is the actual content — the number 42, or a stack entry, or whatever was written there.

How the lookup works

Attention computes the dot product between the query and every key. The dot product is just: multiply corresponding entries and add up. A high dot product means "these two vectors point in a similar direction" — i.e. this key matches what the query is asking for.

Then softmax turns those scores into weights that sum to 1. If one score is much higher than the rest, that key gets nearly 100% of the weight — and the output is just that key's value. That's an exact read.

1. Compute scores:  score_j = q · k_j  ← dot product of query with each key
2. Normalize:  weight_j = softmax(scores)  ← highest score gets ~100%
3. Read:  output = Σ weight_j × value_j  ← weighted average ≈ exact value

But what are the key vectors, concretely?

In this paper, each key is a 2-number column vector. The weight matrix W_K is engineered so that a token at address j gets mapped to:

kj = [2j, −j²]

Why these specific numbers? Because they sit on a parabola, and that shape has a magical property: when you dot-product a query q = [i, 1] with every key, the math simplifies to −(i − j)² + i². That's an upside-down parabola centered at j = i — meaning the key at the exact target address always wins. (Tab 3 visualizes this in detail.)

The key vector isn't something the model "learns" in the usual sense — the weight matrix W_K is set by construction to produce these parabola coordinates. It's more like compiling an address decoder into the weights.

Interactive: See It In Action

Below is a simulated memory with 6 slots. Click a slot to change which address you're reading. Watch how the query vector changes, and how the dot products with each key vector determine which value gets read.

Query Vector

Key Vectors & Dot Products

Side by Side: Traditional LLM Attention vs Memory Lookup

Both use the same mechanism (dot product → softmax → weighted average). The difference is what the keys and queries represent and how sharp the result is.

Traditional LLM Attention
Keys = learned semantic embeddings (high-dimensional, shown as 4D slice)
Weights are spread across multiple tokens — attention is a soft blend. Good for language understanding ("what words are related?") but useless for exact computation.
Memory Lookup (this paper)
Keys = engineered 2D addresses on a parabola
Weight is 100% on one token — attention is a hard lookup. This is what you need to read stack[3] or mem[addr] exactly.

Same math, different weights. Traditional LLMs learn W_K and W_Q from data → fuzzy semantic similarity. This paper engineers W_K and W_Q by construction → exact address lookup. The architecture is identical.

But where does the query vector come from?

In both cases, the query is produced the same way: multiply the current token's embedding by the WQ weight matrix. The difference is what WQ contains.

Traditional LLM
Current token embedding
x = [0.3, 0.8, −0.1, ...]
~4096 dimensions
× WQ
WQ weight matrix
Learned from data
Encodes "what's semantically relevant to me?"
=
Query vector
q = [0.7, 0.8, 0.0, ...]
Points toward semantically similar keys

WQ is trained via gradient descent on billions of tokens. It learns to produce queries that find contextually relevant tokens — "delicious" attends to "cake" because they co-occur. The query is a fuzzy semantic direction.

Memory Lookup (this paper)
Current state (from prior layers)
x includes the target address i
Prior attention heads computed which address to read
× WQ
WQ weight matrix
Engineered by construction
Extracts address i and formats it as (i, 1)
=
Query vector
q = [i, 1]
Points exactly at key ki on the parabola

WQ is not learned — it's set so that it extracts the target address from the current token's state and formats it as [i, 1]. The address i itself comes from earlier layers: e.g. the instruction pointer head computes the current IP, and the stack head uses the current stack depth. Each head's WQ is wired to extract the specific piece of state it needs.

The chain of computation across layers:
Layer 1: A head computes the instruction pointer (cumulative sum of deltas) → "we're at instruction 5"
Layer 2: A head uses the IP to look up instruction 5 from the program → "it's i32.add"
Layer 3: A head uses the stack depth to read the top two stack values → operands
Layer 4: The feed-forward network does the arithmetic → result
Layer 5: A head writes the result back to the stack at the new depth

Each layer's WQ is wired to extract exactly the right piece of state from the previous layer's output. The query vector is never a vague "what's relevant?" — it's always a precise "give me the value at this specific address."

Each attention head in the transformer does exactly this — but the "indices" and "values" are determined by the weight matrices W_Q, W_K, W_V. The weights are set so that:

  • W_K maps each token to a 2D "address" on a parabola
  • W_Q maps the current step to a "query direction" — extracting the target address from the current state
  • W_V extracts the stored value

Multiple heads = multiple independent arrays. That's enough to build registers, a stack, and memory.

The 2D Parabola Trick: Exact Index Lookup

Here's the mathematical trick that makes exact lookup work with just 2D keys.

Store index j as the 2D key: kj = (2j, −j²)

To look up index i, query with direction: q = (i, 1)

The dot product becomes: q · kj = 2ij − j² = −(i − j)² + i²

Since is constant for a given query, the argmax is at j = i — exact match!

Interactive: See the Parabola

The keys sit on a parabola. Drag the query slider to change which index you're looking up. The dot product scores form an inverted parabola peaked at the target.

Keys on Parabola (2D space)

Dot Product Scores (q · kj)

Index 3 gets the highest score. The penalty −(i−j)² ensures only the exact match wins.

Why this matters: The weight matrix W_K is set so it maps token positions to points on this parabola. W_Q produces the query direction. The result: attention performs an exact array read — no scanning, no approximation. And because the keys are 2D, you can use a convex hull to find the max in O(log n) time instead of checking every key.

Building a Stack Machine from Attention Heads

A WebAssembly VM needs: an instruction pointer, an operand stack, and memory. Here's how attention heads provide each:

📍 Instruction Pointer

Mechanism: Cumulative sum via attention

Each trace token emits a delta (+1 for next instruction, or a jump offset). One head averages all deltas uniformly, then multiplies by the token count to recover the running sum = current IP.

📚 Operand Stack

Mechanism: Index lookup via 2D keys

Stack depth is tracked as a cumulative sum of push/pop deltas. To read the top of stack, the head queries for the current depth using the parabola trick — exact lookup of the most recent value at that depth.

💾 Memory

Mechanism: Index lookup via 2D keys

Memory addresses work the same way as stack indices. To read mem[addr], the head queries with direction (addr, 1). The most recent write to that address has the highest score because it has the largest position component.

But how does "writing" work? There's no memory!

This is the crucial insight: writing = emitting a new token. There is no separate memory. The growing sequence of trace tokens is the memory.

Click "Step" to watch the token sequence grow. Each token carries a key (address) and value (data). Reading is just attention looking back at past tokens.

Summary:
Write = emit a token. Its WK gives it an address, its WV gives it data. It just sits there in the sequence.
Read = a later token's WQ produces a query. Attention scans all past tokens' keys, finds the match, returns that token's value.
Overwrite = emit another token with the same address. Because it's later in the sequence, the parabola trick gives it a higher score than the old one (the position component breaks ties).

There is no "write back" step. The sequence only grows. Old values are never erased — they're just overshadowed by newer tokens at the same address.

Interactive: Step Through a Stack Machine

Click "Step" to execute WASM instructions. Watch how each head reads/writes.

Putting It All Together: 3 + 5 Inside a Transformer

Now let's see the complete picture. The transformer receives a WASM program as tokens. Then it generates an execution trace — each token is produced by the attention heads doing lookups into the growing trace.

Full Execution Trace: 3 + 5

Why This Is Different From Tool Use

🔧 Tool Use

LLM writes print(3+5)

Pauses, sends to external Python

Gets back "8" — a black box

Execution happened outside

⚡ In-Model Execution

LLM emits WASM tokens

Each next token = one VM step

Attention heads ARE the CPU

Every step visible in the trace

The Weight-Level Summary

WK maps positions → parabola points
Encodes "what address am I?"
WQ maps current state → query direction
Encodes "what address do I need?"
WV extracts stored values
Encodes "what data is here?"
Wff (feed-forward) does ALU ops
Addition, comparison, branching logic

The weights aren't learned from data in the usual sense — they're engineered (or compiled) so that the transformer's forward pass literally executes a WASM interpreter. The architecture is a standard PyTorch transformer. Only the weight values are special.