Files
kennethnym 6c3abab032 Interactive explainer and manim visualization of Percepta's 'Can LLMs Be Computers?'
- Interactive web app (interactive/) explaining how transformer weights
  execute deterministic WASM programs: softmax sharpening, 2D parabola
  trick for exact memory lookup, stack machine step-through, and full
  execution trace visualization
- Manim animation script (manim_project/scene.py) with 9 scenes covering
  the article's key concepts

Co-authored-by: Ona <no-reply@ona.com>
2026-03-22 21:16:16 +00:00

516 lines
32 KiB
HTML
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>How Transformer Weights Become a Computer</title>
<link rel="stylesheet" href="style.css">
</head>
<body>
<div id="app">
<header>
<h1>How Transformer Weights Become a Computer</h1>
<p class="subtitle">An interactive exploration of how matrix multiplications can execute deterministic programs</p>
</header>
<nav id="tabs">
<button class="tab active" data-tab="tab1">1. The Puzzle</button>
<button class="tab" data-tab="tab2">2. Attention = Lookup</button>
<button class="tab" data-tab="tab3">3. The 2D Trick</button>
<button class="tab" data-tab="tab4">4. Stack Machine</button>
<button class="tab" data-tab="tab5">5. Full Execution</button>
</nav>
<!-- TAB 1: The Puzzle -->
<section id="tab1" class="panel active">
<h2>The Puzzle: How Can Weights Be Deterministic?</h2>
<div class="explainer">
<p>A transformer is just matrix multiplications and softmax. How can that <em>execute a program</em> exactly?</p>
<p>The secret: attention is a <strong>lookup table</strong>. If you set up the keys and queries just right, the softmax becomes so peaked that it acts like an exact <code>array[index]</code> read.</p>
</div>
<div class="demo-box">
<h3>Try it: Softmax Temperature</h3>
<p>Below are 5 scores. One is the "correct" lookup target. Watch what happens as you increase the temperature multiplier — the softmax sharpens until it's essentially a hard lookup.</p>
<div class="slider-row">
<label>Temperature multiplier: <strong id="tempVal">1</strong></label>
<input type="range" id="tempSlider" min="1" max="50" value="1" step="1">
</div>
<div id="softmaxBars"></div>
<p class="insight" id="tempInsight">At low temperature, attention is spread out — fuzzy, not useful for exact computation.</p>
</div>
<div class="explainer">
<p><strong>Key insight:</strong> When the score gap is large enough, softmax gives >99.99% weight to one entry. The "weighted average" becomes an exact read. This is how weights produce deterministic behavior — not by magic, but by engineering the scores to be extremely peaked.</p>
</div>
<button class="next-btn" data-next="tab2">Next: Attention = Lookup →</button>
</section>
<!-- TAB 2: Attention as Lookup -->
<section id="tab2" class="panel">
<h2>Attention as a Lookup Table</h2>
<div class="explainer">
<h3 style="color:var(--gold);margin-bottom:0.6rem">Start here: What problem are we solving?</h3>
<p>A computer needs to <em>read from memory</em>. You say "give me the value at address 3" and you get back whatever is stored there. Simple.</p>
<p>But a transformer has no memory. It only has a growing list of past tokens and one operation: <strong>attention</strong> — which computes a weighted average over all past tokens. How do you turn a weighted average into an exact memory read?</p>
</div>
<div class="explainer">
<h3 style="color:var(--accent);margin-bottom:0.6rem">First: How tokens become vectors</h3>
<p>Before attention can happen, every token goes through the same pipeline. Here it is, step by step:</p>
<div class="pipeline-diagram">
<div class="pipe-step">
<div class="pipe-label">Plain text</div>
<div class="pipe-box pipe-text">"cake"</div>
</div>
<div class="pipe-arrow"><br><span class="pipe-arrow-label">tokenizer</span></div>
<div class="pipe-step">
<div class="pipe-label">Token ID</div>
<div class="pipe-box pipe-id">4821</div>
</div>
<div class="pipe-arrow"><br><span class="pipe-arrow-label">embedding table</span></div>
<div class="pipe-step">
<div class="pipe-label">Token embedding (x)</div>
<div class="pipe-box pipe-emb">[0.3, 0.8, 0.1, 0.5, ...]</div>
<div class="pipe-dim">one vector per token, d dimensions</div>
</div>
</div>
<p style="margin-top:0.8rem">Then this embedding gets multiplied by <strong>three separate weight matrices</strong> to produce Q, K, and V:</p>
<div class="qkv-matmul-diagram">
<div class="qkv-row">
<div class="qkv-input">x</div>
<div class="qkv-times">×</div>
<div class="qkv-matrix" style="border-color:var(--warn)">W<sub>Q</sub></div>
<div class="qkv-eq">=</div>
<div class="qkv-output" style="color:var(--warn)">q</div>
<div class="qkv-meaning">← "what address do I need?"</div>
</div>
<div class="qkv-row">
<div class="qkv-input">x</div>
<div class="qkv-times">×</div>
<div class="qkv-matrix" style="border-color:var(--accent)">W<sub>K</sub></div>
<div class="qkv-eq">=</div>
<div class="qkv-output" style="color:var(--accent)">k</div>
<div class="qkv-meaning">← "what address am I at?"</div>
</div>
<div class="qkv-row">
<div class="qkv-input">x</div>
<div class="qkv-times">×</div>
<div class="qkv-matrix" style="border-color:var(--green)">W<sub>V</sub></div>
<div class="qkv-eq">=</div>
<div class="qkv-output" style="color:var(--green)">v</div>
<div class="qkv-meaning">← "what data do I hold?"</div>
</div>
</div>
<div class="explainer" style="margin-top:0.8rem;border-color:var(--gold);border-width:2px">
<h3 style="color:var(--gold);margin-bottom:0.6rem">How Q, K, V work together (this is the key!)</h3>
<p>A common misconception: the value <code>v</code> is <strong>not</strong> computed from <code>q · k</code>. The dot product <code>q · k</code> only produces a <em>score</em> — a single number. Here's the actual flow:</p>
<div class="attn-flow-diagram">
<div class="attn-flow-section">
<div class="attn-flow-label">Each past token (independently)</div>
<div class="attn-flow-row">
<div class="attn-flow-box" style="border-color:var(--accent)">k<sub>j</sub> = x<sub>j</sub> × W<sub>K</sub></div>
<div class="attn-flow-desc">its address label</div>
</div>
<div class="attn-flow-row">
<div class="attn-flow-box" style="border-color:var(--green)">v<sub>j</sub> = x<sub>j</sub> × W<sub>V</sub></div>
<div class="attn-flow-desc">its stored data</div>
</div>
</div>
<div class="attn-flow-section">
<div class="attn-flow-label">Current token</div>
<div class="attn-flow-row">
<div class="attn-flow-box" style="border-color:var(--warn)">q = x<sub>now</sub> × W<sub>Q</sub></div>
<div class="attn-flow-desc">the address I want</div>
</div>
</div>
<div class="attn-flow-section">
<div class="attn-flow-label">Step 1: Score every past token</div>
<div class="attn-flow-row">
<div class="attn-flow-box computed">score<sub>j</sub> = q · k<sub>j</sub></div>
<div class="attn-flow-desc">"how well does this key match my query?"<br>→ just a number, not a vector</div>
</div>
</div>
<div class="attn-flow-section">
<div class="attn-flow-label">Step 2: Pick the winner</div>
<div class="attn-flow-row">
<div class="attn-flow-box computed">weight<sub>j</sub> = softmax(scores)</div>
<div class="attn-flow-desc">highest score → ~100% weight<br>all others → ~0%</div>
</div>
</div>
<div class="attn-flow-section">
<div class="attn-flow-label">Step 3: Retrieve the value</div>
<div class="attn-flow-row">
<div class="attn-flow-box" style="border-color:var(--gold);border-width:2px">output = Σ weight<sub>j</sub> × v<sub>j</sub></div>
<div class="attn-flow-desc">≈ just <strong>v<sub>winner</sub></strong> (because its weight is ~100%)</div>
</div>
</div>
</div>
<p style="margin-top:0.6rem">Think of it like a library: <strong style="color:var(--accent)">k</strong> is the label on the spine of each book, <strong style="color:var(--warn)">q</strong> is the title you're searching for, and <strong style="color:var(--green)">v</strong> is the content inside the book. You use q and k to <em>find</em> the right book, then you <em>read</em> v from it. The content was always there — the lookup just selects which book to open.</p>
</div>
<div style="background:var(--surface2);border-radius:8px;padding:0.7rem 1rem;margin-top:0.8rem;font-size:0.82rem;line-height:1.6">
<strong style="color:var(--gold)">So where is the "memory" physically?</strong><br>
There's no separate memory chip. The <strong>past tokens in the sequence are the memory</strong>. Each past token carries a key (its address) and a value (its data), both computed from its residual stream via W<sub>K</sub> and W<sub>V</sub>. When the model "writes 42 to address 3," it emits a trace token whose key encodes address 3 and whose value encodes 42. When a later step "reads address 3," attention finds that token and retrieves its value.
</div>
<div style="background:var(--surface2);border-radius:8px;padding:0.7rem 1rem;margin-top:0.6rem;font-size:0.82rem;line-height:1.6">
<strong style="color:var(--gold)">Important nuance about <code>x</code>:</strong> The input to each layer isn't the raw token embedding — it's the <strong>residual stream</strong>: the original embedding <em>plus</em> all outputs from previous layers added on top. So by the time a token reaches layer 3, its <code>x</code> already contains results computed by layers 1 and 2 (like the current instruction pointer, or a value just loaded from the stack).<br><br>
This means W<sub>V</sub> doesn't just extract the original token — it extracts <em>whatever information previous layers have written into the residual stream</em>. For example, if layer 2 computed "the value at stack position 3 is <code>42</code>", then layer 3's W<sub>V</sub> can extract that <code>42</code> and make it available as this token's value.
</div>
<div style="background:var(--surface2);border-radius:8px;padding:0.7rem 1rem;margin-top:0.6rem;font-size:0.82rem;line-height:1.6">
<strong style="color:var(--gold)">The mechanism is identical in both traditional LLMs and this paper.</strong> The only difference is what the weight matrices contain:<br>
<span style="color:var(--accent2)">● Traditional:</span> W<sub>Q</sub>, W<sub>K</sub>, W<sub>V</sub> are <em>learned from data</em> via gradient descent → produce fuzzy semantic vectors<br>
<span style="color:var(--gold)">● This paper:</span> W<sub>Q</sub>, W<sub>K</sub>, W<sub>V</sub> are <em>set by construction</em> (compiled) → produce exact address vectors like <code>[i, 1]</code> and <code>[2j, j²]</code>
</div>
</div>
<div class="explainer">
<h3 style="color:var(--accent);margin-bottom:0.6rem">The three players: Key, Query, Value</h3>
<p>Now that you know how they're produced, here's what each one <em>means</em>:</p>
<table class="kqv-table">
<tr>
<td class="kqv-name" style="color:var(--accent)">Key (k)</td>
<td>The token's <em>address label</em>. It answers: <strong>"Where am I?"</strong><br>
Think of it like the number on a mailbox. Token at memory address 3 gets a key that encodes "I'm address 3."</td>
</tr>
<tr>
<td class="kqv-name" style="color:var(--warn)">Query (q)</td>
<td>The current token's <em>request</em>. It answers: <strong>"What address do I need?"</strong><br>
If the current instruction needs to read address 3, the query encodes "I want address 3."</td>
</tr>
<tr>
<td class="kqv-name" style="color:var(--green)">Value (v)</td>
<td>The token's <em>stored data</em>. It answers: <strong>"What's in me?"</strong><br>
This is the actual content — the number 42, or a stack entry, or whatever was written there.</td>
</tr>
</table>
</div>
<div class="explainer">
<h3 style="color:var(--accent);margin-bottom:0.6rem">How the lookup works</h3>
<p>Attention computes the <strong>dot product</strong> between the query and every key. The dot product is just: multiply corresponding entries and add up. A high dot product means "these two vectors point in a similar direction" — i.e. <em>this key matches what the query is asking for</em>.</p>
<p>Then <strong>softmax</strong> turns those scores into weights that sum to 1. If one score is much higher than the rest, that key gets nearly 100% of the weight — and the output is just that key's value. That's an exact read.</p>
<div style="background:var(--surface2);border-radius:8px;padding:0.8rem 1rem;margin-top:0.6rem;font-size:0.85rem;line-height:1.7">
<span style="color:var(--dim)">1.</span> Compute scores: &nbsp;<code>score_j = q · k_j</code> &nbsp;<span style="color:var(--dim)">← dot product of query with each key</span><br>
<span style="color:var(--dim)">2.</span> Normalize: &nbsp;<code>weight_j = softmax(scores)</code> &nbsp;<span style="color:var(--dim)">← highest score gets ~100%</span><br>
<span style="color:var(--dim)">3.</span> Read: &nbsp;<code>output = Σ weight_j × value_j</code> &nbsp;<span style="color:var(--dim)">← weighted average ≈ exact value</span>
</div>
</div>
<div class="explainer">
<h3 style="color:var(--accent);margin-bottom:0.6rem">But what <em>are</em> the key vectors, concretely?</h3>
<p>In this paper, each key is a <strong>2-number column vector</strong>. The weight matrix <code>W_K</code> is engineered so that a token at address <code>j</code> gets mapped to:</p>
<div style="text-align:center;margin:0.6rem 0;font-size:1.1rem">
<code>k<sub>j</sub> = <span style="color:var(--accent)">[2j, j²]</span></code>
</div>
<p>Why these specific numbers? Because they sit on a <strong>parabola</strong>, and that shape has a magical property: when you dot-product a query <code>q = [i, 1]</code> with every key, the math simplifies to <code>(i j)² + i²</code>. That's an upside-down parabola centered at <code>j = i</code> — meaning the key at the exact target address always wins. (Tab 3 visualizes this in detail.)</p>
<p>The key vector isn't something the model "learns" in the usual sense — the weight matrix <code>W_K</code> is <strong>set by construction</strong> to produce these parabola coordinates. It's more like compiling an address decoder into the weights.</p>
</div>
<div class="demo-box">
<h3>Interactive: See It In Action</h3>
<p>Below is a simulated memory with 6 slots. Click a slot to change which address you're reading. Watch how the query vector changes, and how the dot products with each key vector determine which value gets read.</p>
<div id="memorySlots"></div>
<h4 style="color:var(--dim);margin:1rem 0 0.5rem;font-size:0.85rem">Query Vector</h4>
<div id="queryVec"></div>
<h4 style="color:var(--dim);margin:1rem 0 0.5rem;font-size:0.85rem">Key Vectors &amp; Dot Products</h4>
<div id="vecColumns"></div>
<div id="readResult"></div>
</div>
<div class="demo-box">
<h3>Side by Side: Traditional LLM Attention vs Memory Lookup</h3>
<p>Both use the same mechanism (dot product → softmax → weighted average). The difference is <em>what the keys and queries represent</em> and <em>how sharp the result is</em>.</p>
<div class="sbs-controls">
<div class="slider-row">
<label>Input word to attend to: <strong id="sbsTargetLabel">delicious</strong></label>
<input type="range" id="sbsTargetSlider" min="0" max="4" value="2" step="1">
</div>
</div>
<div class="sbs-container">
<!-- LEFT: Traditional -->
<div class="sbs-panel">
<div class="sbs-header trad">Traditional LLM Attention</div>
<div class="sbs-subtitle">Keys = learned semantic embeddings (high-dimensional, shown as 4D slice)</div>
<div id="sbsTradQ" class="sbs-vec-row"></div>
<div id="sbsTradKeys" class="sbs-keys"></div>
<div id="sbsTradResult" class="sbs-result"></div>
<div class="sbs-note">
Weights are <strong>spread across multiple tokens</strong> — attention is a soft blend. Good for language understanding ("what words are related?") but useless for exact computation.
</div>
</div>
<!-- RIGHT: Memory Lookup -->
<div class="sbs-panel">
<div class="sbs-header lookup">Memory Lookup (this paper)</div>
<div class="sbs-subtitle">Keys = engineered 2D addresses on a parabola</div>
<div id="sbsLookupQ" class="sbs-vec-row"></div>
<div id="sbsLookupKeys" class="sbs-keys"></div>
<div id="sbsLookupResult" class="sbs-result"></div>
<div class="sbs-note">
Weight is <strong>100% on one token</strong> — attention is a hard lookup. This is what you need to read <code>stack[3]</code> or <code>mem[addr]</code> exactly.
</div>
</div>
</div>
<p class="insight">Same math, different weights. Traditional LLMs learn W_K and W_Q from data → fuzzy semantic similarity. This paper engineers W_K and W_Q by construction → exact address lookup. The architecture is identical.</p>
</div>
<div class="explainer">
<h3 style="color:var(--accent);margin-bottom:0.6rem">But where does the query vector come from?</h3>
<p>In both cases, the query is produced the same way: <strong>multiply the current token's embedding by the W<sub>Q</sub> weight matrix</strong>. The difference is what W<sub>Q</sub> contains.</p>
<div class="query-origin-grid">
<!-- Traditional -->
<div class="qo-panel">
<div class="qo-header" style="color:var(--accent2)">Traditional LLM</div>
<div class="qo-pipeline">
<div class="qo-step">
<div class="qo-label">Current token embedding</div>
<div class="qo-box">x = <code>[0.3, 0.8, 0.1, ...]</code></div>
<div class="qo-dim">~4096 dimensions</div>
</div>
<div class="qo-arrow">× W<sub>Q</sub></div>
<div class="qo-step">
<div class="qo-label">W<sub>Q</sub> weight matrix</div>
<div class="qo-box learned">Learned from data</div>
<div class="qo-dim">Encodes "what's semantically relevant to me?"</div>
</div>
<div class="qo-arrow">=</div>
<div class="qo-step">
<div class="qo-label">Query vector</div>
<div class="qo-box">q = <code>[0.7, 0.8, 0.0, ...]</code></div>
<div class="qo-dim">Points toward semantically similar keys</div>
</div>
</div>
<p class="qo-note">W<sub>Q</sub> is trained via gradient descent on billions of tokens. It learns to produce queries that find <em>contextually relevant</em> tokens — "delicious" attends to "cake" because they co-occur. The query is a fuzzy semantic direction.</p>
</div>
<!-- Memory lookup -->
<div class="qo-panel">
<div class="qo-header" style="color:var(--gold)">Memory Lookup (this paper)</div>
<div class="qo-pipeline">
<div class="qo-step">
<div class="qo-label">Current state (from prior layers)</div>
<div class="qo-box">x includes the <em>target address</em> i</div>
<div class="qo-dim">Prior attention heads computed which address to read</div>
</div>
<div class="qo-arrow">× W<sub>Q</sub></div>
<div class="qo-step">
<div class="qo-label">W<sub>Q</sub> weight matrix</div>
<div class="qo-box engineered">Engineered by construction</div>
<div class="qo-dim">Extracts address i and formats it as (i, 1)</div>
</div>
<div class="qo-arrow">=</div>
<div class="qo-step">
<div class="qo-label">Query vector</div>
<div class="qo-box">q = <code>[i, 1]</code></div>
<div class="qo-dim">Points exactly at key k<sub>i</sub> on the parabola</div>
</div>
</div>
<p class="qo-note">W<sub>Q</sub> is <strong>not learned</strong> — it's set so that it extracts the target address from the current token's state and formats it as <code>[i, 1]</code>. The address <code>i</code> itself comes from earlier layers: e.g. the instruction pointer head computes the current IP, and the stack head uses the current stack depth. Each head's W<sub>Q</sub> is wired to extract the specific piece of state it needs.</p>
</div>
</div>
<div style="background:var(--surface2);border-radius:8px;padding:0.8rem 1rem;margin-top:0.8rem;font-size:0.85rem;line-height:1.7">
<strong style="color:var(--accent)">The chain of computation across layers:</strong><br>
<span style="color:var(--dim)">Layer 1:</span> A head computes the instruction pointer (cumulative sum of deltas) → "we're at instruction 5"<br>
<span style="color:var(--dim)">Layer 2:</span> A head uses the IP to look up instruction 5 from the program → "it's <code>i32.add</code>"<br>
<span style="color:var(--dim)">Layer 3:</span> A head uses the stack depth to read the top two stack values → operands<br>
<span style="color:var(--dim)">Layer 4:</span> The feed-forward network does the arithmetic → result<br>
<span style="color:var(--dim)">Layer 5:</span> A head writes the result back to the stack at the new depth<br>
<br>
Each layer's W<sub>Q</sub> is wired to extract exactly the right piece of state from the previous layer's output. The query vector is never a vague "what's relevant?" — it's always a precise "give me the value at this specific address."
</div>
</div>
<div class="explainer">
<p>Each attention head in the transformer does exactly this — but the "indices" and "values" are determined by the weight matrices <code>W_Q</code>, <code>W_K</code>, <code>W_V</code>. The weights are set so that:</p>
<ul>
<li><strong>W_K</strong> maps each token to a 2D "address" on a parabola</li>
<li><strong>W_Q</strong> maps the current step to a "query direction" — extracting the target address from the current state</li>
<li><strong>W_V</strong> extracts the stored value</li>
</ul>
<p>Multiple heads = multiple independent arrays. That's enough to build registers, a stack, and memory.</p>
</div>
<button class="next-btn" data-next="tab3">Next: The 2D Trick →</button>
</section>
<!-- TAB 3: The 2D Parabola Trick -->
<section id="tab3" class="panel">
<h2>The 2D Parabola Trick: Exact Index Lookup</h2>
<div class="explainer">
<p>Here's the mathematical trick that makes exact lookup work with just 2D keys.</p>
<p>Store index <code>j</code> as the 2D key: <strong>k<sub>j</sub> = (2j, j²)</strong></p>
<p>To look up index <code>i</code>, query with direction: <strong>q = (i, 1)</strong></p>
<p>The dot product becomes: <code>q · k<sub>j</sub> = 2ij j² = (i j)² + i²</code></p>
<p>Since <code></code> is constant for a given query, the argmax is at <code>j = i</code> — exact match!</p>
</div>
<div class="demo-box">
<h3>Interactive: See the Parabola</h3>
<p>The keys sit on a parabola. Drag the query slider to change which index you're looking up. The dot product scores form an inverted parabola peaked at the target.</p>
<div class="slider-row">
<label>Query index i = <strong id="queryIdxVal">3</strong></label>
<input type="range" id="queryIdxSlider" min="0" max="7" value="3" step="1">
</div>
<div class="two-col">
<div>
<h4>Keys on Parabola (2D space)</h4>
<canvas id="parabolaCanvas" width="380" height="300"></canvas>
</div>
<div>
<h4>Dot Product Scores (q · k<sub>j</sub>)</h4>
<canvas id="scoresCanvas" width="380" height="300"></canvas>
</div>
</div>
<p class="insight" id="parabolaInsight">Index 3 gets the highest score. The penalty (ij)² ensures only the exact match wins.</p>
</div>
<div class="explainer">
<p><strong>Why this matters:</strong> The weight matrix <code>W_K</code> is set so it maps token positions to points on this parabola. <code>W_Q</code> produces the query direction. The result: attention performs an exact array read — no scanning, no approximation. And because the keys are 2D, you can use a convex hull to find the max in O(log n) time instead of checking every key.</p>
</div>
<button class="next-btn" data-next="tab4">Next: Stack Machine →</button>
</section>
<!-- TAB 4: Stack Machine -->
<section id="tab4" class="panel">
<h2>Building a Stack Machine from Attention Heads</h2>
<div class="explainer">
<p>A WebAssembly VM needs: an <strong>instruction pointer</strong>, an <strong>operand stack</strong>, and <strong>memory</strong>. Here's how attention heads provide each:</p>
</div>
<div class="mechanism-grid">
<div class="mechanism">
<h3>📍 Instruction Pointer</h3>
<p><strong>Mechanism:</strong> Cumulative sum via attention</p>
<p>Each trace token emits a delta (+1 for next instruction, or a jump offset). One head averages all deltas uniformly, then multiplies by the token count to recover the running sum = current IP.</p>
<div class="mini-demo" id="ipDemo">
<div class="ip-trace"></div>
</div>
</div>
<div class="mechanism">
<h3>📚 Operand Stack</h3>
<p><strong>Mechanism:</strong> Index lookup via 2D keys</p>
<p>Stack depth is tracked as a cumulative sum of push/pop deltas. To read the top of stack, the head queries for the current depth using the parabola trick — exact lookup of the most recent value at that depth.</p>
<div class="mini-demo" id="stackDemo">
<div class="stack-vis"></div>
</div>
</div>
<div class="mechanism">
<h3>💾 Memory</h3>
<p><strong>Mechanism:</strong> Index lookup via 2D keys</p>
<p>Memory addresses work the same way as stack indices. To read <code>mem[addr]</code>, the head queries with direction <code>(addr, 1)</code>. The most recent write to that address has the highest score because it has the largest position component.</p>
</div>
</div>
<div class="explainer" style="border-color:var(--gold);border-width:2px">
<h3 style="color:var(--gold);margin-bottom:0.6rem">But how does "writing" work? There's no memory!</h3>
<p>This is the crucial insight: <strong>writing = emitting a new token</strong>. There is no separate memory. The growing sequence of trace tokens <em>is</em> the memory.</p>
<div class="write-read-demo">
<p style="font-size:0.82rem;color:var(--dim);margin-bottom:0.6rem">Click "Step" to watch the token sequence grow. Each token carries a key (address) and value (data). Reading is just attention looking back at past tokens.</p>
<div id="wrTrace" class="wr-trace"></div>
<div id="wrAction" class="wr-action"></div>
<div class="btn-row" style="margin-top:0.6rem">
<button id="wrStep" class="action-btn">Step →</button>
<button id="wrReset" class="action-btn secondary">Reset</button>
</div>
</div>
<div style="background:var(--surface2);border-radius:8px;padding:0.7rem 1rem;margin-top:0.8rem;font-size:0.82rem;line-height:1.6">
<strong style="color:var(--gold)">Summary:</strong><br>
<strong>Write</strong> = emit a token. Its W<sub>K</sub> gives it an address, its W<sub>V</sub> gives it data. It just sits there in the sequence.<br>
<strong>Read</strong> = a later token's W<sub>Q</sub> produces a query. Attention scans all past tokens' keys, finds the match, returns that token's value.<br>
<strong>Overwrite</strong> = emit another token with the same address. Because it's later in the sequence, the parabola trick gives it a higher score than the old one (the position component breaks ties).<br><br>
There is no "write back" step. The sequence only grows. Old values are never erased — they're just overshadowed by newer tokens at the same address.
</div>
</div>
<div class="demo-box">
<h3>Interactive: Step Through a Stack Machine</h3>
<p>Click "Step" to execute WASM instructions. Watch how each head reads/writes.</p>
<div id="stackMachineVis">
<div id="smProgram"></div>
<div id="smState"></div>
<div id="smHeads"></div>
</div>
<div class="btn-row">
<button id="smStep" class="action-btn">Step →</button>
<button id="smReset" class="action-btn secondary">Reset</button>
</div>
</div>
<button class="next-btn" data-next="tab5">Next: Full Execution →</button>
</section>
<!-- TAB 5: Full Execution -->
<section id="tab5" class="panel">
<h2>Putting It All Together: 3 + 5 Inside a Transformer</h2>
<div class="explainer">
<p>Now let's see the complete picture. The transformer receives a WASM program as tokens. Then it generates an execution trace — each token is produced by the attention heads doing lookups into the growing trace.</p>
</div>
<div class="demo-box full-width">
<h3>Full Execution Trace: 3 + 5</h3>
<div id="fullExec">
<div id="feProgram"></div>
<div id="feTrace"></div>
<div id="feDetail"></div>
</div>
<div class="btn-row">
<button id="feStep" class="action-btn">Step →</button>
<button id="feReset" class="action-btn secondary">Reset</button>
<button id="feRunAll" class="action-btn secondary">Run All</button>
</div>
</div>
<div class="explainer">
<h3>Why This Is Different From Tool Use</h3>
<div class="comparison">
<div class="comp-col">
<h4>🔧 Tool Use</h4>
<p>LLM writes <code>print(3+5)</code></p>
<p>Pauses, sends to external Python</p>
<p>Gets back "8" — a black box</p>
<p>Execution happened <em>outside</em></p>
</div>
<div class="comp-col highlight">
<h4>⚡ In-Model Execution</h4>
<p>LLM emits WASM tokens</p>
<p>Each next token = one VM step</p>
<p>Attention heads ARE the CPU</p>
<p>Every step visible in the trace</p>
</div>
</div>
</div>
<div class="explainer">
<h3>The Weight-Level Summary</h3>
<div class="summary-grid">
<div class="summary-item">
<strong>W<sub>K</sub></strong> maps positions → parabola points<br>
<span class="dim">Encodes "what address am I?"</span>
</div>
<div class="summary-item">
<strong>W<sub>Q</sub></strong> maps current state → query direction<br>
<span class="dim">Encodes "what address do I need?"</span>
</div>
<div class="summary-item">
<strong>W<sub>V</sub></strong> extracts stored values<br>
<span class="dim">Encodes "what data is here?"</span>
</div>
<div class="summary-item">
<strong>W<sub>ff</sub></strong> (feed-forward) does ALU ops<br>
<span class="dim">Addition, comparison, branching logic</span>
</div>
</div>
<p>The weights aren't learned from data in the usual sense — they're <strong>engineered</strong> (or compiled) so that the transformer's forward pass literally executes a WASM interpreter. The architecture is a standard PyTorch transformer. Only the weight values are special.</p>
</div>
</section>
</div>
<script src="app.js"></script>
</body>
</html>