External Publication
Visit Post

reFlow: A Feature-Decoupled Transformer with Native Interpretability

Hugging Face Forums [Unofficial] March 18, 2026
Source

TL;DR : We decompose the embedding matrix E ∈ R^{V×d} into W_recipe × W_basis, forcing every token to be a readable “recipe” over a shared signal basis. Without any sparsity constraint, the signal space spontaneously develops semantic structure (three↔four cos=0.76, king+woman−man=queen rank #1), 11% natural sparsity, and single-signal causal traceability. Full training code, 12 interpretability experiments, and pretrained weights are MIT-licensed.


Motivation

Standard Transformer embeddings are unstructured lookup tables — every token gets an independent d-dimensional vector with no compositional constraint. This makes the latent space a semantic tangle: you can probe it after the fact (SAE, probing classifiers), but the model was never designed to be interpretable. reFlow flips this: by factoring the embedding into a recipe matrix (how to mix signals) and a signal basis (what the signals mean), the architecture forces all computation onto a signal manifold. Interpretability isn’t bolted on — it’s load-bearing structure.

Key Results

  • Convergence : reFlow-1 (32 layers, 464M) is ~3% above GPT-2-New (36 layers, 514M) due to 4 fewer layers and 9% fewer params. When aligned to the same depth (reFlow-1-Big, 36 layers, 515M), the gap narrows to ~1%. Three-point scaling: Small (46M, 3.55) → reFlow-1 (464M, 3.01) → Big (515M, 2.92).
  • Semantic organization in recipe space : Top-20 nearest-neighbor pairs are all semantically valid (three↔four 0.7551, king↔queen 0.54, France↔Germany 0.53). PCA silhouette score = 0.1052 (positive → real clusters).
  • Semantic algebra : 3/3 hit — king + woman − man → queen (#1), walked + running − walking → ran (#1), Paris + China − France → Beijing (#2).
  • Emergent sparsity : Mean 116.6/1024 signals active per token (11.38% activation rate), with no L0/L1 penalty. Gini coefficient only 0.085 → all signals utilized evenly.
  • Causal traceability : Ablating 1 signal on “The capital of France is” drops target probability from 8.31% → 0.03%. That signal’s codebook = {the, a, in, to, an, at} — a pure function-word channel.
  • Behavioral steering : Emotion surgery flips “terrible” → “great” (L0–L12 injection). Concept inception: critical α ≈ 18.4. Gene tampering: modifying W_recipe globally flips sentiment while maintaining grammatical coherence.
  • Hard sparsity destroys semantics : Top-64 constraint collapses recipe structure (cos 0.76 → 0.30, algebra 3/3 → 0/3, silhouette drops to −0.02). Sparsity ≠ interpretability.
  • Information crystallization boundary : Semantic decisions solidify around L12–L18; interventions after this layer range have no effect.

Architecture in brief

Input token i → W_recipe[i, :] (S-dim recipe vector)
                    ↓
            e_i = W_recipe[i] × W_basis    (S×d shared signal basis)
                    ↓
            36-layer Transformer (RMSNorm, RoPE, SwiGLU)
                    ↓
            Logits = H_out × (W_recipe × W_basis)^T    (dynamic vocab matrix, no separate LM head)

The same factored product is used for both input embedding and output projection — a closed loop that forces the backbone to operate entirely on the signal manifold.

Links

  • HuggingFace : reuAC/reFlow · Hugging Face
  • Paper : paper/paper.pdf in the repo

Built on nanoGPT. Trained on OpenWebText (9B tokens), 4×T4 GPUs, 50k steps.


Happy to answer questions! Especially interested in discussion around:

  • The tension between hard sparsity and semantic fidelity (Section 6)
  • Signal distillation prospects (teacher/student sharing W_basis)
  • How this compares to SAE-based post-hoc interpretability

Discussion in the ATmosphere

Loading comments...