Removing the embedding from my embedding: a byte transformer with a 0-parameter input layer (25M, single RTX 4070)
This is exactly the framing I was missing — thank you. “How much input geometry has to be learned before attention sees the bytes” is a sharper question than “no embeddings,” and the prior-art table places it correctly. I’m adopting the failure-boundary map as the actual roadmap, not another single-winner table. Here’s where the boundaries already stand, and where they’re still blank:
Already measured:
- Geometry boundary (same 25M body, 3-modality mix, 2 seeds, via
hsl_embedding.ablation):
| input door | text bpb | learned input params |
|---|---|---|
| zero HSL | 2.456 | 0 |
| learned projection over HSL | 2.443 | ~125k |
| plain learned byte embedding | 2.773 | ~132k |
| random / permuted HSL | 2.44–2.45 | 0 |
The interesting wrinkle: random/permuted land near zero-HSL on text bpb but lose the cross-modal binding gap — which points exactly at your “HSL ≈ permuted > random ⇒ distribution matters more than value-geometry” interpretation for text, while value-geometry shows up in binding. So the answer is already modality-dependent, as you predicted.
- K boundary — your prediction was correct. K=16 (half the raw-byte-position length) holds text bpb (2.48→2.48) while the audio→caption binding gap softens (0.063 → 0.042). Compression and fine-grained binding pull in different directions, exactly as you framed it. K up to 18 stays at 0 learned params on dim-512.
- Implementation boundary — tail handling is explicit (
pad/drop), no silent byte loss; an adversarial pass caught a real bug where a zero-door checkpoint silently loaded random Linear doors, now gated by a round-trip test.
Still blank — and the order I’ll fill them, taking your “cleanest question” first:
- Schedule boundary (does plain learned embedding catch up with a longer budget?) — this separates “better cold-start” from “better final,” and I agree it’s the single most informative one.
- Binding boundary with hard negatives (same-class wrong-instance, length-/entropy-matched mismatch) — to tell real association from a shortcut.
- Scale (25M → larger), then the modality boundary by data type (decoded pixels / μ-law vs UTF-8 vs compressed/encrypted as a negative control), which your numeric-locality argument predicts should split.
Two of your points I want to flag as already on my path: the unmixed fixed channels as an interpretability handle (Δ/Δ²/Fourier/phase enter at fixed addresses, so a first-layer feature-group attribution + eval-time channel knockout is cheap), and keeping the input door separate from the output head — the current claim is input-only; HSL-aware output geometry is a later branch.
I’ll report these as boundaries, including the ones where the zero door stops behaving like a learned door — that’s the useful result either way. This is “it works, and here’s where it breaks,” not a superiority claim. Genuinely grateful for the map.
Discussion in the ATmosphere