- Interactive web app (interactive/) explaining how transformer weights execute deterministic WASM programs: softmax sharpening, 2D parabola trick for exact memory lookup, stack machine step-through, and full execution trace visualization - Manim animation script (manim_project/scene.py) with 9 scenes covering the article's key concepts Co-authored-by: Ona <no-reply@ona.com>
516 lines
32 KiB
HTML
516 lines
32 KiB
HTML
<!DOCTYPE html>
|
||
<html lang="en">
|
||
<head>
|
||
<meta charset="UTF-8">
|
||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||
<title>How Transformer Weights Become a Computer</title>
|
||
<link rel="stylesheet" href="style.css">
|
||
</head>
|
||
<body>
|
||
<div id="app">
|
||
<header>
|
||
<h1>How Transformer Weights Become a Computer</h1>
|
||
<p class="subtitle">An interactive exploration of how matrix multiplications can execute deterministic programs</p>
|
||
</header>
|
||
|
||
<nav id="tabs">
|
||
<button class="tab active" data-tab="tab1">1. The Puzzle</button>
|
||
<button class="tab" data-tab="tab2">2. Attention = Lookup</button>
|
||
<button class="tab" data-tab="tab3">3. The 2D Trick</button>
|
||
<button class="tab" data-tab="tab4">4. Stack Machine</button>
|
||
<button class="tab" data-tab="tab5">5. Full Execution</button>
|
||
</nav>
|
||
|
||
<!-- TAB 1: The Puzzle -->
|
||
<section id="tab1" class="panel active">
|
||
<h2>The Puzzle: How Can Weights Be Deterministic?</h2>
|
||
<div class="explainer">
|
||
<p>A transformer is just matrix multiplications and softmax. How can that <em>execute a program</em> exactly?</p>
|
||
<p>The secret: attention is a <strong>lookup table</strong>. If you set up the keys and queries just right, the softmax becomes so peaked that it acts like an exact <code>array[index]</code> read.</p>
|
||
</div>
|
||
<div class="demo-box">
|
||
<h3>Try it: Softmax Temperature</h3>
|
||
<p>Below are 5 scores. One is the "correct" lookup target. Watch what happens as you increase the temperature multiplier — the softmax sharpens until it's essentially a hard lookup.</p>
|
||
<div class="slider-row">
|
||
<label>Temperature multiplier: <strong id="tempVal">1</strong></label>
|
||
<input type="range" id="tempSlider" min="1" max="50" value="1" step="1">
|
||
</div>
|
||
<div id="softmaxBars"></div>
|
||
<p class="insight" id="tempInsight">At low temperature, attention is spread out — fuzzy, not useful for exact computation.</p>
|
||
</div>
|
||
<div class="explainer">
|
||
<p><strong>Key insight:</strong> When the score gap is large enough, softmax gives >99.99% weight to one entry. The "weighted average" becomes an exact read. This is how weights produce deterministic behavior — not by magic, but by engineering the scores to be extremely peaked.</p>
|
||
</div>
|
||
<button class="next-btn" data-next="tab2">Next: Attention = Lookup →</button>
|
||
</section>
|
||
|
||
<!-- TAB 2: Attention as Lookup -->
|
||
<section id="tab2" class="panel">
|
||
<h2>Attention as a Lookup Table</h2>
|
||
|
||
<div class="explainer">
|
||
<h3 style="color:var(--gold);margin-bottom:0.6rem">Start here: What problem are we solving?</h3>
|
||
<p>A computer needs to <em>read from memory</em>. You say "give me the value at address 3" and you get back whatever is stored there. Simple.</p>
|
||
<p>But a transformer has no memory. It only has a growing list of past tokens and one operation: <strong>attention</strong> — which computes a weighted average over all past tokens. How do you turn a weighted average into an exact memory read?</p>
|
||
</div>
|
||
|
||
<div class="explainer">
|
||
<h3 style="color:var(--accent);margin-bottom:0.6rem">First: How tokens become vectors</h3>
|
||
<p>Before attention can happen, every token goes through the same pipeline. Here it is, step by step:</p>
|
||
|
||
<div class="pipeline-diagram">
|
||
<div class="pipe-step">
|
||
<div class="pipe-label">Plain text</div>
|
||
<div class="pipe-box pipe-text">"cake"</div>
|
||
</div>
|
||
<div class="pipe-arrow">→<br><span class="pipe-arrow-label">tokenizer</span></div>
|
||
<div class="pipe-step">
|
||
<div class="pipe-label">Token ID</div>
|
||
<div class="pipe-box pipe-id">4821</div>
|
||
</div>
|
||
<div class="pipe-arrow">→<br><span class="pipe-arrow-label">embedding table</span></div>
|
||
<div class="pipe-step">
|
||
<div class="pipe-label">Token embedding (x)</div>
|
||
<div class="pipe-box pipe-emb">[0.3, 0.8, −0.1, 0.5, ...]</div>
|
||
<div class="pipe-dim">one vector per token, d dimensions</div>
|
||
</div>
|
||
</div>
|
||
|
||
<p style="margin-top:0.8rem">Then this embedding gets multiplied by <strong>three separate weight matrices</strong> to produce Q, K, and V:</p>
|
||
|
||
<div class="qkv-matmul-diagram">
|
||
<div class="qkv-row">
|
||
<div class="qkv-input">x</div>
|
||
<div class="qkv-times">×</div>
|
||
<div class="qkv-matrix" style="border-color:var(--warn)">W<sub>Q</sub></div>
|
||
<div class="qkv-eq">=</div>
|
||
<div class="qkv-output" style="color:var(--warn)">q</div>
|
||
<div class="qkv-meaning">← "what address do I need?"</div>
|
||
</div>
|
||
<div class="qkv-row">
|
||
<div class="qkv-input">x</div>
|
||
<div class="qkv-times">×</div>
|
||
<div class="qkv-matrix" style="border-color:var(--accent)">W<sub>K</sub></div>
|
||
<div class="qkv-eq">=</div>
|
||
<div class="qkv-output" style="color:var(--accent)">k</div>
|
||
<div class="qkv-meaning">← "what address am I at?"</div>
|
||
</div>
|
||
<div class="qkv-row">
|
||
<div class="qkv-input">x</div>
|
||
<div class="qkv-times">×</div>
|
||
<div class="qkv-matrix" style="border-color:var(--green)">W<sub>V</sub></div>
|
||
<div class="qkv-eq">=</div>
|
||
<div class="qkv-output" style="color:var(--green)">v</div>
|
||
<div class="qkv-meaning">← "what data do I hold?"</div>
|
||
</div>
|
||
</div>
|
||
|
||
<div class="explainer" style="margin-top:0.8rem;border-color:var(--gold);border-width:2px">
|
||
<h3 style="color:var(--gold);margin-bottom:0.6rem">How Q, K, V work together (this is the key!)</h3>
|
||
<p>A common misconception: the value <code>v</code> is <strong>not</strong> computed from <code>q · k</code>. The dot product <code>q · k</code> only produces a <em>score</em> — a single number. Here's the actual flow:</p>
|
||
|
||
<div class="attn-flow-diagram">
|
||
<div class="attn-flow-section">
|
||
<div class="attn-flow-label">Each past token (independently)</div>
|
||
<div class="attn-flow-row">
|
||
<div class="attn-flow-box" style="border-color:var(--accent)">k<sub>j</sub> = x<sub>j</sub> × W<sub>K</sub></div>
|
||
<div class="attn-flow-desc">its address label</div>
|
||
</div>
|
||
<div class="attn-flow-row">
|
||
<div class="attn-flow-box" style="border-color:var(--green)">v<sub>j</sub> = x<sub>j</sub> × W<sub>V</sub></div>
|
||
<div class="attn-flow-desc">its stored data</div>
|
||
</div>
|
||
</div>
|
||
|
||
<div class="attn-flow-section">
|
||
<div class="attn-flow-label">Current token</div>
|
||
<div class="attn-flow-row">
|
||
<div class="attn-flow-box" style="border-color:var(--warn)">q = x<sub>now</sub> × W<sub>Q</sub></div>
|
||
<div class="attn-flow-desc">the address I want</div>
|
||
</div>
|
||
</div>
|
||
|
||
<div class="attn-flow-section">
|
||
<div class="attn-flow-label">Step 1: Score every past token</div>
|
||
<div class="attn-flow-row">
|
||
<div class="attn-flow-box computed">score<sub>j</sub> = q · k<sub>j</sub></div>
|
||
<div class="attn-flow-desc">"how well does this key match my query?"<br>→ just a number, not a vector</div>
|
||
</div>
|
||
</div>
|
||
|
||
<div class="attn-flow-section">
|
||
<div class="attn-flow-label">Step 2: Pick the winner</div>
|
||
<div class="attn-flow-row">
|
||
<div class="attn-flow-box computed">weight<sub>j</sub> = softmax(scores)</div>
|
||
<div class="attn-flow-desc">highest score → ~100% weight<br>all others → ~0%</div>
|
||
</div>
|
||
</div>
|
||
|
||
<div class="attn-flow-section">
|
||
<div class="attn-flow-label">Step 3: Retrieve the value</div>
|
||
<div class="attn-flow-row">
|
||
<div class="attn-flow-box" style="border-color:var(--gold);border-width:2px">output = Σ weight<sub>j</sub> × v<sub>j</sub></div>
|
||
<div class="attn-flow-desc">≈ just <strong>v<sub>winner</sub></strong> (because its weight is ~100%)</div>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
|
||
<p style="margin-top:0.6rem">Think of it like a library: <strong style="color:var(--accent)">k</strong> is the label on the spine of each book, <strong style="color:var(--warn)">q</strong> is the title you're searching for, and <strong style="color:var(--green)">v</strong> is the content inside the book. You use q and k to <em>find</em> the right book, then you <em>read</em> v from it. The content was always there — the lookup just selects which book to open.</p>
|
||
</div>
|
||
|
||
<div style="background:var(--surface2);border-radius:8px;padding:0.7rem 1rem;margin-top:0.8rem;font-size:0.82rem;line-height:1.6">
|
||
<strong style="color:var(--gold)">So where is the "memory" physically?</strong><br>
|
||
There's no separate memory chip. The <strong>past tokens in the sequence are the memory</strong>. Each past token carries a key (its address) and a value (its data), both computed from its residual stream via W<sub>K</sub> and W<sub>V</sub>. When the model "writes 42 to address 3," it emits a trace token whose key encodes address 3 and whose value encodes 42. When a later step "reads address 3," attention finds that token and retrieves its value.
|
||
</div>
|
||
|
||
<div style="background:var(--surface2);border-radius:8px;padding:0.7rem 1rem;margin-top:0.6rem;font-size:0.82rem;line-height:1.6">
|
||
<strong style="color:var(--gold)">Important nuance about <code>x</code>:</strong> The input to each layer isn't the raw token embedding — it's the <strong>residual stream</strong>: the original embedding <em>plus</em> all outputs from previous layers added on top. So by the time a token reaches layer 3, its <code>x</code> already contains results computed by layers 1 and 2 (like the current instruction pointer, or a value just loaded from the stack).<br><br>
|
||
This means W<sub>V</sub> doesn't just extract the original token — it extracts <em>whatever information previous layers have written into the residual stream</em>. For example, if layer 2 computed "the value at stack position 3 is <code>42</code>", then layer 3's W<sub>V</sub> can extract that <code>42</code> and make it available as this token's value.
|
||
</div>
|
||
|
||
<div style="background:var(--surface2);border-radius:8px;padding:0.7rem 1rem;margin-top:0.6rem;font-size:0.82rem;line-height:1.6">
|
||
<strong style="color:var(--gold)">The mechanism is identical in both traditional LLMs and this paper.</strong> The only difference is what the weight matrices contain:<br>
|
||
<span style="color:var(--accent2)">● Traditional:</span> W<sub>Q</sub>, W<sub>K</sub>, W<sub>V</sub> are <em>learned from data</em> via gradient descent → produce fuzzy semantic vectors<br>
|
||
<span style="color:var(--gold)">● This paper:</span> W<sub>Q</sub>, W<sub>K</sub>, W<sub>V</sub> are <em>set by construction</em> (compiled) → produce exact address vectors like <code>[i, 1]</code> and <code>[2j, −j²]</code>
|
||
</div>
|
||
</div>
|
||
|
||
<div class="explainer">
|
||
<h3 style="color:var(--accent);margin-bottom:0.6rem">The three players: Key, Query, Value</h3>
|
||
<p>Now that you know how they're produced, here's what each one <em>means</em>:</p>
|
||
<table class="kqv-table">
|
||
<tr>
|
||
<td class="kqv-name" style="color:var(--accent)">Key (k)</td>
|
||
<td>The token's <em>address label</em>. It answers: <strong>"Where am I?"</strong><br>
|
||
Think of it like the number on a mailbox. Token at memory address 3 gets a key that encodes "I'm address 3."</td>
|
||
</tr>
|
||
<tr>
|
||
<td class="kqv-name" style="color:var(--warn)">Query (q)</td>
|
||
<td>The current token's <em>request</em>. It answers: <strong>"What address do I need?"</strong><br>
|
||
If the current instruction needs to read address 3, the query encodes "I want address 3."</td>
|
||
</tr>
|
||
<tr>
|
||
<td class="kqv-name" style="color:var(--green)">Value (v)</td>
|
||
<td>The token's <em>stored data</em>. It answers: <strong>"What's in me?"</strong><br>
|
||
This is the actual content — the number 42, or a stack entry, or whatever was written there.</td>
|
||
</tr>
|
||
</table>
|
||
</div>
|
||
|
||
<div class="explainer">
|
||
<h3 style="color:var(--accent);margin-bottom:0.6rem">How the lookup works</h3>
|
||
<p>Attention computes the <strong>dot product</strong> between the query and every key. The dot product is just: multiply corresponding entries and add up. A high dot product means "these two vectors point in a similar direction" — i.e. <em>this key matches what the query is asking for</em>.</p>
|
||
<p>Then <strong>softmax</strong> turns those scores into weights that sum to 1. If one score is much higher than the rest, that key gets nearly 100% of the weight — and the output is just that key's value. That's an exact read.</p>
|
||
<div style="background:var(--surface2);border-radius:8px;padding:0.8rem 1rem;margin-top:0.6rem;font-size:0.85rem;line-height:1.7">
|
||
<span style="color:var(--dim)">1.</span> Compute scores: <code>score_j = q · k_j</code> <span style="color:var(--dim)">← dot product of query with each key</span><br>
|
||
<span style="color:var(--dim)">2.</span> Normalize: <code>weight_j = softmax(scores)</code> <span style="color:var(--dim)">← highest score gets ~100%</span><br>
|
||
<span style="color:var(--dim)">3.</span> Read: <code>output = Σ weight_j × value_j</code> <span style="color:var(--dim)">← weighted average ≈ exact value</span>
|
||
</div>
|
||
</div>
|
||
|
||
<div class="explainer">
|
||
<h3 style="color:var(--accent);margin-bottom:0.6rem">But what <em>are</em> the key vectors, concretely?</h3>
|
||
<p>In this paper, each key is a <strong>2-number column vector</strong>. The weight matrix <code>W_K</code> is engineered so that a token at address <code>j</code> gets mapped to:</p>
|
||
<div style="text-align:center;margin:0.6rem 0;font-size:1.1rem">
|
||
<code>k<sub>j</sub> = <span style="color:var(--accent)">[2j, −j²]</span></code>
|
||
</div>
|
||
<p>Why these specific numbers? Because they sit on a <strong>parabola</strong>, and that shape has a magical property: when you dot-product a query <code>q = [i, 1]</code> with every key, the math simplifies to <code>−(i − j)² + i²</code>. That's an upside-down parabola centered at <code>j = i</code> — meaning the key at the exact target address always wins. (Tab 3 visualizes this in detail.)</p>
|
||
<p>The key vector isn't something the model "learns" in the usual sense — the weight matrix <code>W_K</code> is <strong>set by construction</strong> to produce these parabola coordinates. It's more like compiling an address decoder into the weights.</p>
|
||
</div>
|
||
|
||
<div class="demo-box">
|
||
<h3>Interactive: See It In Action</h3>
|
||
<p>Below is a simulated memory with 6 slots. Click a slot to change which address you're reading. Watch how the query vector changes, and how the dot products with each key vector determine which value gets read.</p>
|
||
<div id="memorySlots"></div>
|
||
<h4 style="color:var(--dim);margin:1rem 0 0.5rem;font-size:0.85rem">Query Vector</h4>
|
||
<div id="queryVec"></div>
|
||
<h4 style="color:var(--dim);margin:1rem 0 0.5rem;font-size:0.85rem">Key Vectors & Dot Products</h4>
|
||
<div id="vecColumns"></div>
|
||
<div id="readResult"></div>
|
||
</div>
|
||
<div class="demo-box">
|
||
<h3>Side by Side: Traditional LLM Attention vs Memory Lookup</h3>
|
||
<p>Both use the same mechanism (dot product → softmax → weighted average). The difference is <em>what the keys and queries represent</em> and <em>how sharp the result is</em>.</p>
|
||
<div class="sbs-controls">
|
||
<div class="slider-row">
|
||
<label>Input word to attend to: <strong id="sbsTargetLabel">delicious</strong></label>
|
||
<input type="range" id="sbsTargetSlider" min="0" max="4" value="2" step="1">
|
||
</div>
|
||
</div>
|
||
<div class="sbs-container">
|
||
<!-- LEFT: Traditional -->
|
||
<div class="sbs-panel">
|
||
<div class="sbs-header trad">Traditional LLM Attention</div>
|
||
<div class="sbs-subtitle">Keys = learned semantic embeddings (high-dimensional, shown as 4D slice)</div>
|
||
<div id="sbsTradQ" class="sbs-vec-row"></div>
|
||
<div id="sbsTradKeys" class="sbs-keys"></div>
|
||
<div id="sbsTradResult" class="sbs-result"></div>
|
||
<div class="sbs-note">
|
||
Weights are <strong>spread across multiple tokens</strong> — attention is a soft blend. Good for language understanding ("what words are related?") but useless for exact computation.
|
||
</div>
|
||
</div>
|
||
<!-- RIGHT: Memory Lookup -->
|
||
<div class="sbs-panel">
|
||
<div class="sbs-header lookup">Memory Lookup (this paper)</div>
|
||
<div class="sbs-subtitle">Keys = engineered 2D addresses on a parabola</div>
|
||
<div id="sbsLookupQ" class="sbs-vec-row"></div>
|
||
<div id="sbsLookupKeys" class="sbs-keys"></div>
|
||
<div id="sbsLookupResult" class="sbs-result"></div>
|
||
<div class="sbs-note">
|
||
Weight is <strong>100% on one token</strong> — attention is a hard lookup. This is what you need to read <code>stack[3]</code> or <code>mem[addr]</code> exactly.
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<p class="insight">Same math, different weights. Traditional LLMs learn W_K and W_Q from data → fuzzy semantic similarity. This paper engineers W_K and W_Q by construction → exact address lookup. The architecture is identical.</p>
|
||
</div>
|
||
|
||
<div class="explainer">
|
||
<h3 style="color:var(--accent);margin-bottom:0.6rem">But where does the query vector come from?</h3>
|
||
<p>In both cases, the query is produced the same way: <strong>multiply the current token's embedding by the W<sub>Q</sub> weight matrix</strong>. The difference is what W<sub>Q</sub> contains.</p>
|
||
|
||
<div class="query-origin-grid">
|
||
<!-- Traditional -->
|
||
<div class="qo-panel">
|
||
<div class="qo-header" style="color:var(--accent2)">Traditional LLM</div>
|
||
<div class="qo-pipeline">
|
||
<div class="qo-step">
|
||
<div class="qo-label">Current token embedding</div>
|
||
<div class="qo-box">x = <code>[0.3, 0.8, −0.1, ...]</code></div>
|
||
<div class="qo-dim">~4096 dimensions</div>
|
||
</div>
|
||
<div class="qo-arrow">× W<sub>Q</sub></div>
|
||
<div class="qo-step">
|
||
<div class="qo-label">W<sub>Q</sub> weight matrix</div>
|
||
<div class="qo-box learned">Learned from data</div>
|
||
<div class="qo-dim">Encodes "what's semantically relevant to me?"</div>
|
||
</div>
|
||
<div class="qo-arrow">=</div>
|
||
<div class="qo-step">
|
||
<div class="qo-label">Query vector</div>
|
||
<div class="qo-box">q = <code>[0.7, 0.8, 0.0, ...]</code></div>
|
||
<div class="qo-dim">Points toward semantically similar keys</div>
|
||
</div>
|
||
</div>
|
||
<p class="qo-note">W<sub>Q</sub> is trained via gradient descent on billions of tokens. It learns to produce queries that find <em>contextually relevant</em> tokens — "delicious" attends to "cake" because they co-occur. The query is a fuzzy semantic direction.</p>
|
||
</div>
|
||
|
||
<!-- Memory lookup -->
|
||
<div class="qo-panel">
|
||
<div class="qo-header" style="color:var(--gold)">Memory Lookup (this paper)</div>
|
||
<div class="qo-pipeline">
|
||
<div class="qo-step">
|
||
<div class="qo-label">Current state (from prior layers)</div>
|
||
<div class="qo-box">x includes the <em>target address</em> i</div>
|
||
<div class="qo-dim">Prior attention heads computed which address to read</div>
|
||
</div>
|
||
<div class="qo-arrow">× W<sub>Q</sub></div>
|
||
<div class="qo-step">
|
||
<div class="qo-label">W<sub>Q</sub> weight matrix</div>
|
||
<div class="qo-box engineered">Engineered by construction</div>
|
||
<div class="qo-dim">Extracts address i and formats it as (i, 1)</div>
|
||
</div>
|
||
<div class="qo-arrow">=</div>
|
||
<div class="qo-step">
|
||
<div class="qo-label">Query vector</div>
|
||
<div class="qo-box">q = <code>[i, 1]</code></div>
|
||
<div class="qo-dim">Points exactly at key k<sub>i</sub> on the parabola</div>
|
||
</div>
|
||
</div>
|
||
<p class="qo-note">W<sub>Q</sub> is <strong>not learned</strong> — it's set so that it extracts the target address from the current token's state and formats it as <code>[i, 1]</code>. The address <code>i</code> itself comes from earlier layers: e.g. the instruction pointer head computes the current IP, and the stack head uses the current stack depth. Each head's W<sub>Q</sub> is wired to extract the specific piece of state it needs.</p>
|
||
</div>
|
||
</div>
|
||
|
||
<div style="background:var(--surface2);border-radius:8px;padding:0.8rem 1rem;margin-top:0.8rem;font-size:0.85rem;line-height:1.7">
|
||
<strong style="color:var(--accent)">The chain of computation across layers:</strong><br>
|
||
<span style="color:var(--dim)">Layer 1:</span> A head computes the instruction pointer (cumulative sum of deltas) → "we're at instruction 5"<br>
|
||
<span style="color:var(--dim)">Layer 2:</span> A head uses the IP to look up instruction 5 from the program → "it's <code>i32.add</code>"<br>
|
||
<span style="color:var(--dim)">Layer 3:</span> A head uses the stack depth to read the top two stack values → operands<br>
|
||
<span style="color:var(--dim)">Layer 4:</span> The feed-forward network does the arithmetic → result<br>
|
||
<span style="color:var(--dim)">Layer 5:</span> A head writes the result back to the stack at the new depth<br>
|
||
<br>
|
||
Each layer's W<sub>Q</sub> is wired to extract exactly the right piece of state from the previous layer's output. The query vector is never a vague "what's relevant?" — it's always a precise "give me the value at this specific address."
|
||
</div>
|
||
</div>
|
||
|
||
<div class="explainer">
|
||
<p>Each attention head in the transformer does exactly this — but the "indices" and "values" are determined by the weight matrices <code>W_Q</code>, <code>W_K</code>, <code>W_V</code>. The weights are set so that:</p>
|
||
<ul>
|
||
<li><strong>W_K</strong> maps each token to a 2D "address" on a parabola</li>
|
||
<li><strong>W_Q</strong> maps the current step to a "query direction" — extracting the target address from the current state</li>
|
||
<li><strong>W_V</strong> extracts the stored value</li>
|
||
</ul>
|
||
<p>Multiple heads = multiple independent arrays. That's enough to build registers, a stack, and memory.</p>
|
||
</div>
|
||
<button class="next-btn" data-next="tab3">Next: The 2D Trick →</button>
|
||
</section>
|
||
|
||
<!-- TAB 3: The 2D Parabola Trick -->
|
||
<section id="tab3" class="panel">
|
||
<h2>The 2D Parabola Trick: Exact Index Lookup</h2>
|
||
<div class="explainer">
|
||
<p>Here's the mathematical trick that makes exact lookup work with just 2D keys.</p>
|
||
<p>Store index <code>j</code> as the 2D key: <strong>k<sub>j</sub> = (2j, −j²)</strong></p>
|
||
<p>To look up index <code>i</code>, query with direction: <strong>q = (i, 1)</strong></p>
|
||
<p>The dot product becomes: <code>q · k<sub>j</sub> = 2ij − j² = −(i − j)² + i²</code></p>
|
||
<p>Since <code>i²</code> is constant for a given query, the argmax is at <code>j = i</code> — exact match!</p>
|
||
</div>
|
||
<div class="demo-box">
|
||
<h3>Interactive: See the Parabola</h3>
|
||
<p>The keys sit on a parabola. Drag the query slider to change which index you're looking up. The dot product scores form an inverted parabola peaked at the target.</p>
|
||
<div class="slider-row">
|
||
<label>Query index i = <strong id="queryIdxVal">3</strong></label>
|
||
<input type="range" id="queryIdxSlider" min="0" max="7" value="3" step="1">
|
||
</div>
|
||
<div class="two-col">
|
||
<div>
|
||
<h4>Keys on Parabola (2D space)</h4>
|
||
<canvas id="parabolaCanvas" width="380" height="300"></canvas>
|
||
</div>
|
||
<div>
|
||
<h4>Dot Product Scores (q · k<sub>j</sub>)</h4>
|
||
<canvas id="scoresCanvas" width="380" height="300"></canvas>
|
||
</div>
|
||
</div>
|
||
<p class="insight" id="parabolaInsight">Index 3 gets the highest score. The penalty −(i−j)² ensures only the exact match wins.</p>
|
||
</div>
|
||
<div class="explainer">
|
||
<p><strong>Why this matters:</strong> The weight matrix <code>W_K</code> is set so it maps token positions to points on this parabola. <code>W_Q</code> produces the query direction. The result: attention performs an exact array read — no scanning, no approximation. And because the keys are 2D, you can use a convex hull to find the max in O(log n) time instead of checking every key.</p>
|
||
</div>
|
||
<button class="next-btn" data-next="tab4">Next: Stack Machine →</button>
|
||
</section>
|
||
|
||
<!-- TAB 4: Stack Machine -->
|
||
<section id="tab4" class="panel">
|
||
<h2>Building a Stack Machine from Attention Heads</h2>
|
||
<div class="explainer">
|
||
<p>A WebAssembly VM needs: an <strong>instruction pointer</strong>, an <strong>operand stack</strong>, and <strong>memory</strong>. Here's how attention heads provide each:</p>
|
||
</div>
|
||
<div class="mechanism-grid">
|
||
<div class="mechanism">
|
||
<h3>📍 Instruction Pointer</h3>
|
||
<p><strong>Mechanism:</strong> Cumulative sum via attention</p>
|
||
<p>Each trace token emits a delta (+1 for next instruction, or a jump offset). One head averages all deltas uniformly, then multiplies by the token count to recover the running sum = current IP.</p>
|
||
<div class="mini-demo" id="ipDemo">
|
||
<div class="ip-trace"></div>
|
||
</div>
|
||
</div>
|
||
<div class="mechanism">
|
||
<h3>📚 Operand Stack</h3>
|
||
<p><strong>Mechanism:</strong> Index lookup via 2D keys</p>
|
||
<p>Stack depth is tracked as a cumulative sum of push/pop deltas. To read the top of stack, the head queries for the current depth using the parabola trick — exact lookup of the most recent value at that depth.</p>
|
||
<div class="mini-demo" id="stackDemo">
|
||
<div class="stack-vis"></div>
|
||
</div>
|
||
</div>
|
||
<div class="mechanism">
|
||
<h3>💾 Memory</h3>
|
||
<p><strong>Mechanism:</strong> Index lookup via 2D keys</p>
|
||
<p>Memory addresses work the same way as stack indices. To read <code>mem[addr]</code>, the head queries with direction <code>(addr, 1)</code>. The most recent write to that address has the highest score because it has the largest position component.</p>
|
||
</div>
|
||
</div>
|
||
<div class="explainer" style="border-color:var(--gold);border-width:2px">
|
||
<h3 style="color:var(--gold);margin-bottom:0.6rem">But how does "writing" work? There's no memory!</h3>
|
||
<p>This is the crucial insight: <strong>writing = emitting a new token</strong>. There is no separate memory. The growing sequence of trace tokens <em>is</em> the memory.</p>
|
||
|
||
<div class="write-read-demo">
|
||
<p style="font-size:0.82rem;color:var(--dim);margin-bottom:0.6rem">Click "Step" to watch the token sequence grow. Each token carries a key (address) and value (data). Reading is just attention looking back at past tokens.</p>
|
||
<div id="wrTrace" class="wr-trace"></div>
|
||
<div id="wrAction" class="wr-action"></div>
|
||
<div class="btn-row" style="margin-top:0.6rem">
|
||
<button id="wrStep" class="action-btn">Step →</button>
|
||
<button id="wrReset" class="action-btn secondary">Reset</button>
|
||
</div>
|
||
</div>
|
||
|
||
<div style="background:var(--surface2);border-radius:8px;padding:0.7rem 1rem;margin-top:0.8rem;font-size:0.82rem;line-height:1.6">
|
||
<strong style="color:var(--gold)">Summary:</strong><br>
|
||
<strong>Write</strong> = emit a token. Its W<sub>K</sub> gives it an address, its W<sub>V</sub> gives it data. It just sits there in the sequence.<br>
|
||
<strong>Read</strong> = a later token's W<sub>Q</sub> produces a query. Attention scans all past tokens' keys, finds the match, returns that token's value.<br>
|
||
<strong>Overwrite</strong> = emit another token with the same address. Because it's later in the sequence, the parabola trick gives it a higher score than the old one (the position component breaks ties).<br><br>
|
||
There is no "write back" step. The sequence only grows. Old values are never erased — they're just overshadowed by newer tokens at the same address.
|
||
</div>
|
||
</div>
|
||
|
||
<div class="demo-box">
|
||
<h3>Interactive: Step Through a Stack Machine</h3>
|
||
<p>Click "Step" to execute WASM instructions. Watch how each head reads/writes.</p>
|
||
<div id="stackMachineVis">
|
||
<div id="smProgram"></div>
|
||
<div id="smState"></div>
|
||
<div id="smHeads"></div>
|
||
</div>
|
||
<div class="btn-row">
|
||
<button id="smStep" class="action-btn">Step →</button>
|
||
<button id="smReset" class="action-btn secondary">Reset</button>
|
||
</div>
|
||
</div>
|
||
<button class="next-btn" data-next="tab5">Next: Full Execution →</button>
|
||
</section>
|
||
|
||
<!-- TAB 5: Full Execution -->
|
||
<section id="tab5" class="panel">
|
||
<h2>Putting It All Together: 3 + 5 Inside a Transformer</h2>
|
||
<div class="explainer">
|
||
<p>Now let's see the complete picture. The transformer receives a WASM program as tokens. Then it generates an execution trace — each token is produced by the attention heads doing lookups into the growing trace.</p>
|
||
</div>
|
||
<div class="demo-box full-width">
|
||
<h3>Full Execution Trace: 3 + 5</h3>
|
||
<div id="fullExec">
|
||
<div id="feProgram"></div>
|
||
<div id="feTrace"></div>
|
||
<div id="feDetail"></div>
|
||
</div>
|
||
<div class="btn-row">
|
||
<button id="feStep" class="action-btn">Step →</button>
|
||
<button id="feReset" class="action-btn secondary">Reset</button>
|
||
<button id="feRunAll" class="action-btn secondary">Run All</button>
|
||
</div>
|
||
</div>
|
||
<div class="explainer">
|
||
<h3>Why This Is Different From Tool Use</h3>
|
||
<div class="comparison">
|
||
<div class="comp-col">
|
||
<h4>🔧 Tool Use</h4>
|
||
<p>LLM writes <code>print(3+5)</code></p>
|
||
<p>Pauses, sends to external Python</p>
|
||
<p>Gets back "8" — a black box</p>
|
||
<p>Execution happened <em>outside</em></p>
|
||
</div>
|
||
<div class="comp-col highlight">
|
||
<h4>⚡ In-Model Execution</h4>
|
||
<p>LLM emits WASM tokens</p>
|
||
<p>Each next token = one VM step</p>
|
||
<p>Attention heads ARE the CPU</p>
|
||
<p>Every step visible in the trace</p>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<div class="explainer">
|
||
<h3>The Weight-Level Summary</h3>
|
||
<div class="summary-grid">
|
||
<div class="summary-item">
|
||
<strong>W<sub>K</sub></strong> maps positions → parabola points<br>
|
||
<span class="dim">Encodes "what address am I?"</span>
|
||
</div>
|
||
<div class="summary-item">
|
||
<strong>W<sub>Q</sub></strong> maps current state → query direction<br>
|
||
<span class="dim">Encodes "what address do I need?"</span>
|
||
</div>
|
||
<div class="summary-item">
|
||
<strong>W<sub>V</sub></strong> extracts stored values<br>
|
||
<span class="dim">Encodes "what data is here?"</span>
|
||
</div>
|
||
<div class="summary-item">
|
||
<strong>W<sub>ff</sub></strong> (feed-forward) does ALU ops<br>
|
||
<span class="dim">Addition, comparison, branching logic</span>
|
||
</div>
|
||
</div>
|
||
<p>The weights aren't learned from data in the usual sense — they're <strong>engineered</strong> (or compiled) so that the transformer's forward pass literally executes a WASM interpreter. The architecture is a standard PyTorch transformer. Only the weight values are special.</p>
|
||
</div>
|
||
</section>
|
||
</div>
|
||
|
||
<script src="app.js"></script>
|
||
</body>
|
||
</html>
|