External Publication
Visit Post

Removing the embedding from my embedding: a byte transformer with a 0-parameter input layer (25M, single RTX 4070)

Hugging Face Forums [Unofficial] June 13, 2026
Source

This is exactly the framing I was missing — thank you. “How much input geometry has to be learned before attention sees the bytes” is a sharper question than “no embeddings,” and the prior-art table places it correctly. I’m adopting the failure-boundary map as the actual roadmap, not another single-winner table. Here’s where the boundaries already stand, and where they’re still blank:

Already measured:

  • Geometry boundary (same 25M body, 3-modality mix, 2 seeds, via hsl_embedding.ablation):
input door text bpb learned input params
zero HSL 2.456 0
learned projection over HSL 2.443 ~125k
plain learned byte embedding 2.773 ~132k
random / permuted HSL 2.44–2.45 0

The interesting wrinkle: random/permuted land near zero-HSL on text bpb but lose the cross-modal binding gap — which points exactly at your “HSL ≈ permuted > random ⇒ distribution matters more than value-geometry” interpretation for text, while value-geometry shows up in binding. So the answer is already modality-dependent, as you predicted.

  • K boundary — your prediction was correct. K=16 (half the raw-byte-position length) holds text bpb (2.48→2.48) while the audio→caption binding gap softens (0.063 → 0.042). Compression and fine-grained binding pull in different directions, exactly as you framed it. K up to 18 stays at 0 learned params on dim-512.
  • Implementation boundary — tail handling is explicit (pad/drop), no silent byte loss; an adversarial pass caught a real bug where a zero-door checkpoint silently loaded random Linear doors, now gated by a round-trip test.

Still blank — and the order I’ll fill them, taking your “cleanest question” first:

  1. Schedule boundary (does plain learned embedding catch up with a longer budget?) — this separates “better cold-start” from “better final,” and I agree it’s the single most informative one.
  2. Binding boundary with hard negatives (same-class wrong-instance, length-/entropy-matched mismatch) — to tell real association from a shortcut.
  3. Scale (25M → larger), then the modality boundary by data type (decoded pixels / μ-law vs UTF-8 vs compressed/encrypted as a negative control), which your numeric-locality argument predicts should split.

Two of your points I want to flag as already on my path: the unmixed fixed channels as an interpretability handle (Δ/Δ²/Fourier/phase enter at fixed addresses, so a first-layer feature-group attribution + eval-time channel knockout is cheap), and keeping the input door separate from the output head — the current claim is input-only; HSL-aware output geometry is a later branch.

I’ll report these as boundaries, including the ones where the zero door stops behaving like a learned door — that’s the useful result either way. This is “it works, and here’s where it breaks,” not a superiority claim. Genuinely grateful for the map.

Discussion in the ATmosphere

Loading comments...