{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidchtuf3qvqdno6fgt2aebfvrqk3he7d34wepzglrmuwt5cvftwn4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhdyydkpnir2"
  },
  "path": "/t/reflow-a-feature-decoupled-transformer-with-native-interpretability/174380#post_1",
  "publishedAt": "2026-03-18T12:49:32.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "reuAC/reFlow · Hugging Face"
  ],
  "textContent": "**TL;DR** : We decompose the embedding matrix E ∈ R^{V×d} into W_recipe × W_basis, forcing every token to be a readable “recipe” over a shared signal basis. Without any sparsity constraint, the signal space spontaneously develops semantic structure (three↔four cos=0.76, king+woman−man=queen rank #1), 11% natural sparsity, and single-signal causal traceability. Full training code, 12 interpretability experiments, and pretrained weights are MIT-licensed.\n\n* * *\n\n## Motivation\n\nStandard Transformer embeddings are unstructured lookup tables — every token gets an independent d-dimensional vector with no compositional constraint. This makes the latent space a semantic tangle: you can probe it after the fact (SAE, probing classifiers), but the model was never _designed_ to be interpretable. reFlow flips this: by factoring the embedding into a **recipe matrix** (how to mix signals) and a **signal basis** (what the signals mean), the architecture forces all computation onto a signal manifold. Interpretability isn’t bolted on — it’s load-bearing structure.\n\n## Key Results\n\n  * **Convergence** : reFlow-1 (32 layers, 464M) is ~3% above GPT-2-New (36 layers, 514M) due to 4 fewer layers and 9% fewer params. When aligned to the same depth (reFlow-1-Big, 36 layers, 515M), the gap narrows to ~1%. Three-point scaling: Small (46M, 3.55) → reFlow-1 (464M, 3.01) → Big (515M, 2.92).\n  * **Semantic organization in recipe space** : Top-20 nearest-neighbor pairs are all semantically valid (three↔four 0.7551, king↔queen 0.54, France↔Germany 0.53). PCA silhouette score = 0.1052 (positive → real clusters).\n  * **Semantic algebra** : 3/3 hit — king + woman − man → queen (#1), walked + running − walking → ran (#1), Paris + China − France → Beijing (#2).\n  * **Emergent sparsity** : Mean 116.6/1024 signals active per token (11.38% activation rate), with no L0/L1 penalty. Gini coefficient only 0.085 → all signals utilized evenly.\n  * **Causal traceability** : Ablating 1 signal on “The capital of France is” drops target probability from 8.31% → 0.03%. That signal’s codebook = {the, a, in, to, an, at} — a pure function-word channel.\n  * **Behavioral steering** : Emotion surgery flips “terrible” → “great” (L0–L12 injection). Concept inception: critical α ≈ 18.4. Gene tampering: modifying W_recipe globally flips sentiment while maintaining grammatical coherence.\n  * **Hard sparsity destroys semantics** : Top-64 constraint collapses recipe structure (cos 0.76 → 0.30, algebra 3/3 → 0/3, silhouette drops to −0.02). Sparsity ≠ interpretability.\n  * **Information crystallization boundary** : Semantic decisions solidify around L12–L18; interventions after this layer range have no effect.\n\n\n\n## Architecture in brief\n\n\n    Input token i → W_recipe[i, :] (S-dim recipe vector)\n                        ↓\n                e_i = W_recipe[i] × W_basis    (S×d shared signal basis)\n                        ↓\n                36-layer Transformer (RMSNorm, RoPE, SwiGLU)\n                        ↓\n                Logits = H_out × (W_recipe × W_basis)^T    (dynamic vocab matrix, no separate LM head)\n\n\nThe same factored product is used for both input embedding and output projection — a closed loop that forces the backbone to operate entirely on the signal manifold.\n\n## Links\n\n  * **HuggingFace** : reuAC/reFlow · Hugging Face\n  * **Paper** : `paper/paper.pdf` in the repo\n\n\n\nBuilt on nanoGPT. Trained on OpenWebText (9B tokens), 4×T4 GPUs, 50k steps.\n\n* * *\n\nHappy to answer questions! Especially interested in discussion around:\n\n  * The tension between hard sparsity and semantic fidelity (Section 6)\n  * Signal distillation prospects (teacher/student sharing W_basis)\n  * How this compares to SAE-based post-hoc interpretability\n\n",
  "title": "reFlow: A Feature-Decoupled Transformer with Native Interpretability"
}