External Publication
Visit Post

Removing the embedding from my embedding: a byte transformer with a 0-parameter input layer (25M, single RTX 4070)

Hugging Face Forums [Unofficial] June 12, 2026
Source

Hi everyone, a follow-up — and a slightly absurd experiment that worked.

Since the last post, the substrate ablation toolkit shipped inside the encoder (hsl_embedding.ablation — capacity-matched hsl / learned / random / permuted arms, as discussed in this thread). While running the full A/B I got curious about a stranger question:

what happens if I remove the embedding from my embedding?

I.e. feed the frozen 27-D signal features straight into the transformer through a fixed zero-pad — no tokenizer, no embedding table, no learned input projection. Zero learned parameters at the door.

It runs!!

input front door text bpb caption bpb learned input params
zero (frozen features, zero-pad) 2.456 ±0.027 1.526 0
learned projection on same features 2.443 ±0.014 1.402 ~125k
plain learned byte embedding 2.773 ±0.076 2.556 ~132k

(2 seeds, same lean ~25M body, same 3-modality byte mix, fixed 3000-step budget. Doubling bytes-per-slot (K=16, half the prefix positions) holds text bpb at 2.455.)

Reading this honestly: not “embeddings are beaten.” At this small budget the frozen substrate already carries what a learned front door would have to learn, and a plain learned byte embedding doesn’t get there in 3k steps — it may well close the gap with a longer schedule. One consumer GPU, small body, the table is the claim.

So I shipped it as a tiny package plus a live proof model:

  • pip install hsl-embedding-zero — the zero door as a drop-in module (GitHub, MIT, DOI 10.5281/zenodo.20643551)
  • HoLo_ZeRo — a 25M model trained entirely behind the zero door (the casing is the signal: HoLoZeRo = 10101010): weights · live demo (byte generation + the 27-D cosmos it literally reads)

If you’re curious, poke it and tell me where it breaks.

(Also since last time: HoLo 6.5.1 finished its 3-stage curriculum — weights public, knowledge-grounding gap grew 0.001 → 1.835 across training, full numbers in the repo.)

Discussion in the ATmosphere

Loading comments...