{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreieyadhbeyotruav4dod6ydjuuezfg5seszlkbqxvgsyjlagxwoty4",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mo5sigkpyqa2"
},
"path": "/t/removing-the-embedding-from-my-embedding-a-byte-transformer-with-a-0-parameter-input-layer-25m-single-rtx-4070/176731#post_3",
"publishedAt": "2026-06-13T07:07:43.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "This is exactly the framing I was missing — thank you. “How much input geometry has to be learned before attention sees the bytes” is a sharper question than “no embeddings,” and the prior-art table places it correctly. I’m adopting the **failure-boundary map** as the actual roadmap, not another single-winner table. Here’s where the boundaries already stand, and where they’re still blank:\n\n**Already measured:**\n\n * **Geometry boundary** (same 25M body, 3-modality mix, 2 seeds, via `hsl_embedding.ablation`):\n\n\n\n**input door** | **text bpb** | **learned input params**\n---|---|---\nzero HSL | 2.456 | 0\nlearned projection over HSL | 2.443 | ~125k\nplain learned byte embedding | 2.773 | ~132k\nrandom / permuted HSL | 2.44–2.45 | 0\n\nThe interesting wrinkle: **random/permuted land near zero-HSL on _text_ bpb but lose the cross-modal binding gap** — which points exactly at your “HSL ≈ permuted > random ⇒ distribution matters more than value-geometry” interpretation for text, while value-geometry shows up in binding. So the answer is already modality-dependent, as you predicted.\n\n * **K boundary** — your prediction was correct. K=16 (half the raw-byte-position length) holds text bpb (2.48→2.48) while the **audio→caption binding gap softens (0.063 → 0.042)**. Compression and fine-grained binding pull in different directions, exactly as you framed it. K up to 18 stays at 0 learned params on dim-512.\n * **Implementation boundary** — tail handling is explicit (`pad`/`drop`), no silent byte loss; an adversarial pass caught a real bug where a zero-door checkpoint silently loaded random Linear doors, now gated by a round-trip test.\n\n\n\n**Still blank — and the order I’ll fill them, taking your “cleanest question” first:**\n\n 1. **Schedule boundary** (does plain learned embedding catch up with a longer budget?) — this separates “better cold-start” from “better final,” and I agree it’s the single most informative one.\n 2. **Binding boundary with hard negatives** (same-class wrong-instance, length-/entropy-matched mismatch) — to tell real association from a shortcut.\n 3. **Scale** (25M → larger), then the **modality boundary by data type** (decoded pixels / μ-law vs UTF-8 vs compressed/encrypted as a negative control), which your numeric-locality argument predicts should split.\n\n\n\nTwo of your points I want to flag as already on my path: the **unmixed fixed channels as an interpretability handle** (Δ/Δ²/Fourier/phase enter at fixed addresses, so a first-layer feature-group attribution + eval-time channel knockout is cheap), and keeping the **input door separate from the output head** — the current claim is input-only; HSL-aware output geometry is a later branch.\n\nI’ll report these as boundaries, including the ones where the zero door _stops_ behaving like a learned door — that’s the useful result either way. This is “it works, and here’s where it breaks,” not a superiority claim. Genuinely grateful for the map.",
"title": "Removing the embedding from my embedding: a byte transformer with a 0-parameter input layer (25M, single RTX 4070)"
}