{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreia7ui5zkybmgdzghcyqp6a7fvzbkeafanaajfu7mcodhh34bnfwya",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mo33alfcwg22"
},
"path": "/t/removing-the-embedding-from-my-embedding-a-byte-transformer-with-a-0-parameter-input-layer-25m-single-rtx-4070/176731#post_1",
"publishedAt": "2026-06-12T04:32:07.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"GitHub",
"weights",
"live demo",
"weights public"
],
"textContent": "Hi everyone, a follow-up — and a slightly absurd experiment that worked.\n\nSince the last post, the substrate ablation toolkit shipped inside the encoder (`hsl_embedding.ablation` — capacity-matched hsl / learned / random / permuted arms, as discussed in this thread). While running the full A/B I got curious about a stranger question:\n\n**what happens if I remove the embedding from my embedding?**\n\nI.e. feed the frozen 27-D signal features straight into the transformer through a fixed zero-pad — no tokenizer, no embedding table, no learned input projection. **Zero learned parameters at the door.**\n\nIt runs!!\n\n**input front door** | **text bpb** | **caption bpb** | **learned input params**\n---|---|---|---\nzero (frozen features, zero-pad) | 2.456 ±0.027 | 1.526 | **0**\nlearned projection on same features | 2.443 ±0.014 | 1.402 | ~125k\nplain learned byte embedding | 2.773 ±0.076 | 2.556 | ~132k\n\n(2 seeds, same lean ~25M body, same 3-modality byte mix, fixed 3000-step budget. Doubling bytes-per-slot (K=16, half the prefix positions) holds text bpb at 2.455.)\n\nReading this honestly: **not** “embeddings are beaten.” At this small budget the frozen substrate already carries what a learned front door would have to learn, and a plain learned byte embedding doesn’t get there in 3k steps — it may well close the gap with a longer schedule. One consumer GPU, small body, the table is the claim.\n\nSo I shipped it as a tiny package plus a live proof model:\n\n * `pip install hsl-embedding-zero` — the zero door as a drop-in module (GitHub, MIT, DOI 10.5281/zenodo.20643551)\n * **HoLo_ZeRo** — a 25M model trained entirely behind the zero door (the casing is the signal: HoLoZeRo = 10101010): weights · live demo (byte generation + the 27-D cosmos it literally reads)\n\n\n\nIf you’re curious, poke it and tell me where it breaks.\n\n(Also since last time: HoLo 6.5.1 finished its 3-stage curriculum — weights public, knowledge-grounding gap grew 0.001 → 1.835 across training, full numbers in the repo.)",
"title": "Removing the embedding from my embedding: a byte transformer with a 0-parameter input layer (25M, single RTX 4070)"
}