Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreie7ckaiyep2yqdppmxrnuf5y3xym2bqf7kpw3goxunzaui2utwq2e",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mo5f2uiayz32"
  },
  "path": "/t/removing-the-embedding-from-my-embedding-a-byte-transformer-with-a-0-parameter-input-layer-25m-single-rtx-4070/176731#post_2",
  "publishedAt": "2026-06-13T03:45:17.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Shaham & Levy, Neural Machine Translation without Embeddings",
    "Shaham & Levy",
    "CANINE",
    "Charformer",
    "MEGABYTE",
    "Byte Latent Transformer"
  ],
  "textContent": "Hmm… maybe something like this?:\n\n* * *\n\nI think this makes the earlier HoLo/HSL line much sharper.\n\nI would read this less as “removing embeddings” in the broad, final sense, and more as a **front-door experiment** :\n\n> how much input geometry has to be learned before attention sees the bytes?\n\nThat framing makes the result easier to place. The interesting claim does not seem to be simply “no embeddings.” There is already a close embedding-free byte precedent in Shaham & Levy, Neural Machine Translation without Embeddings, where UTF-8 bytes can be represented with one-hot vectors instead of a learned embedding layer.\n\nThe sharper claim here seems to be the combination:\n\n> **zero learned input door + K-byte packing + deterministic HSL geometry**\n\nThat puts HoLo_ZeRo between a few existing lines of work:\n\nExisting thread | What it already covers | What HoLo_ZeRo adds\n---|---|---\nShaham & Levy | Embedding-free byte input via one-hot bytes | A packed, shorter-than-raw-byte-position input using deterministic HSL geometry\nCANINE / Charformer | Token-free character/byte modeling with downsampling or learned subword formation | A zero learned input door rather than a learned front-end\nMEGABYTE | Byte patching with local/global multiscale modeling | Fixed HSL feature packing instead of learned/local patch modeling\nByte Latent Transformer | Dynamic byte patches and FLOP-controlled byte scaling | Deterministic K-packing as a simpler fixed front door\nfixed / random / compressed feature maps | Learned embeddings are not the only way to create vector input | Structured byte-signal geometry as the fixed map\n\nSo the way I would summarize the research position is:\n\n> Shaham & Levy asks: can bytes avoid learned embeddings?\n>  Charformer / MEGABYTE / BLT ask: how do we shorten byte streams?\n>  HoLo_ZeRo asks: can we do both with a deterministic byte-signal front door?\n\nThat is a pretty clean question.\n\n## The next useful artifact: a failure-boundary map\n\nSince you explicitly asked where it breaks, I think the most useful next artifact may not be another single-winner table, but a **failure-boundary map**.\n\nSomething like:\n\nBoundary | Question | Why it matters\n---|---|---\n**Schedule boundary** | Does the plain learned byte embedding catch up after longer training? | Separates “zero door has better cold start” from “zero door has better final behavior.”\n**Scale boundary** | Does the gap persist at 50M / 100M / 300M bodies? | Larger bodies may learn input geometry that a small model cannot.\n**K boundary** | What happens across K=4 / 8 / 16 / 18? | K controls sequence density versus fine-grained binding.\n**Modality boundary** | Does it work equally for text, decoded audio, decoded image bytes, and compressed-like bytes? | HSL geometry may be more natural for signal-like bytes than semantically arbitrary byte streams.\n**Geometry boundary** | HSL vs learned / random / permuted / raw-bit controls | Tests whether the structured geometry matters, not just capacity or invertibility.\n**Binding boundary** | Does the binding gap survive harder negatives? | Separates real cross-modal association from shortcut learning.\n**Implementation boundary** | Does tail padding / streaming / byte recovery behave safely under odd lengths? | K-packed byte models can get misleading results if bytes are silently dropped or shifted.\n\nThat table would answer the exact question I think many readers will have:\n\n> where does the zero door stop behaving like a learned input door?\n\n## K looks like the main design knob\n\nThe K-sweep seems especially important to me.\n\nK does several things at once:\n\nIncreasing K does this | Benefit | Possible cost\n---|---|---\nReduces the number of attention positions | Lower input-side attention cost |\nPacks more local bytes into one slot | Higher local byte density |\nMakes each slot cover a wider byte span | Coarser temporal / modal alignment |\nPreserves zero learned input params | Keeps the front door fixed |\nMakes slot attribution more complex | Harder to tell which byte/channel drove a decision |\n\nSo I would treat K less like a normal hyperparameter and more like a design knob:\n\n> **sequence density versus fine-grained binding**\n\nThe K=16 result is interesting for exactly that reason. If text/caption bpb holds up while binding softens, that suggests a useful compression/alignment frontier rather than just “K bigger is better” or “K smaller is better.”\n\nA nice future plot might be:\n\nK | text bpb | caption bpb | audio→caption binding gap | image/audio/text retrieval | tokens/sec or bytes/sec | VRAM\n---|---|---|---|---|---|---\n4 |  |  |  |  |  |\n8 |  |  |  |  |  |\n16 |  |  |  |  |  |\n18 |  |  |  |  |  |\n\nThat would make the trade-off visible.\n\n## Short-term experiments that seem most useful\n\nIf I were trying to poke this in the most useful way, I would prioritize these:\n\nTest | What it would clarify\n---|---\n**Longer schedule for learned byte embedding** | Whether the learned embedding is just slower to warm up.\n**Multi-seed runs** | Whether the 25M / 3000-step gap is stable or seed-sensitive.\n**Same-FLOP / same-wallclock reporting** | Whether K-packing improves the compute-quality frontier, not just the loss table.\n**HSL vs learned / random / permuted controls** | Whether HSL geometry matters beyond capacity or invertibility.\n**K-sweep** | Whether compression and binding pull in different directions.\n**Hard-negative binding** | Whether the audio/image/text relation survives more difficult mismatches.\n**Tail and odd-length tests** | Whether K-packing is safe under non-divisible byte lengths.\n**Streaming path tests** | Whether the AR path behaves consistently with the packed prefix path.\n\nThe cleanest short-term question, to me, is:\n\n> At what training budget does a learned input door catch up, if it catches up at all?\n\nThe second cleanest is:\n\n> At what K does sequence compression start to damage binding?\n\nThose two together would already say a lot.\n\n## Geometry controls\n\nThe geometry controls seem central.\n\nI would want to see the same body and data under something like:\n\nInput door | What it tests\n---|---\nzero HSL | The proposed structured deterministic geometry\nlearned projection over HSL | Whether early learned mixing helps\nplain learned byte embedding | Standard learned byte identity baseline\nraw bit features | Whether simple bit identity is enough\nrandom fixed features | Whether fixed capacity is enough\npermuted HSL LUT | Whether the HSL value geometry matters\nlearned byte projection with same dimension | Whether the model can learn equivalent geometry itself\n\nThe `permuted HSL` control seems especially useful. If it keeps marginal feature statistics but breaks byte-value adjacency geometry, then the comparison is more informative than just random features.\n\nA rough interpretation table could look like this:\n\nResult pattern | Possible interpretation\n---|---\nHSL > permuted ≈ random | HSL geometry likely matters.\nHSL ≈ permuted > random | feature distribution/capacity may matter more than value geometry.\nHSL ≈ learned projection ≫ learned byte embedding at short schedule | fixed geometry gives a cold-start advantage.\nlearned byte embedding catches up at long schedule | zero door may be a sample-efficiency / early-training advantage rather than a final-performance advantage.\nlearned projection over HSL wins consistently | HSL features help, but early learned mixing still matters.\nall fixed variants collapse on harder data | the zero-door effect may depend on toy-scale or signal-like structure.\n\nThat kind of result table would make the claim much easier to read.\n\n## Modality boundary\n\nOne place I would expect different behavior is modality.\n\nHSL geometry should be more natural when byte values have local numeric meaning. That is true for decoded signal-like data, but less obviously true for text or compressed media bytes.\n\nInput type | Expected HSL fit | Reason\n---|---|---\ndecoded grayscale pixels | high | nearby byte values often mean nearby brightness\nPCM / μ-law audio | high to medium | byte values often relate to amplitude-like quantities\nrasterized numeric signals | high | byte values often preserve numeric locality\nUTF-8 text | medium to low | numeric byte adjacency is not the same as semantic adjacency\ncompressed image/audio/video bytes | low | codec structure and entropy coding dominate\nencrypted/random bytes | near zero | byte adjacency has no semantic meaning\n\nSo I would not expect one global answer to “does zero door work?”\nI would expect a boundary map by data type.\n\nThat could be a useful long-term table:\n\nModality / encoding | zero HSL | learned projection | learned byte embedding | random/permuted | Notes\n---|---|---|---|---|---\nUTF-8 text |  |  |  |  | byte identity may matter more than numeric adjacency\ngrayscale raster |  |  |  |  | value geometry should help\nμ-law audio |  |  |  |  | amplitude-like structure may help\ndecoded video frames |  |  |  |  | likely K/alignment-sensitive\ncompressed bytes |  |  |  |  | useful negative control\nshuffled/random bytes |  |  |  |  | sanity check\n\nThis would also help avoid overgeneralizing from the current mixture.\n\n## Binding probes\n\nThe binding result is probably where I would be most careful.\n\nFor text bpb, K-packing can be judged fairly directly. For cross-modal binding, I would want harder negatives.\n\nPossible hard negatives:\n\nProbe | Why it helps\n---|---\nsame-class wrong-instance | Reduces class-label shortcuts\nsame-length caption mismatch | Reduces length cues\nentropy-matched caption mismatch | Reduces byte-distribution cues\nshifted audio/video windows | Tests temporal leakage\nimage-only / audio-only / image+audio ablation | Shows which modality contributes\ntop-k retrieval over captions | Easier to interpret than bpb alone\ncross-dataset transfer | Tests whether binding survives outside the original toy distribution\n\nIf K=16 preserves caption bpb but weakens binding, the hard-negative setting may be the best way to see whether K is losing fine-grained alignment or just losing an easy shortcut.\n\n## Fixed channel addresses are also an interpretability opportunity\n\nOne thing I like about the zero-door design is that the channels enter unmixed.\n\nThat is not only a parameter-saving trick. It may make the first learned layer unusually inspectable.\n\nFor example, one could ask:\n\nAnalysis | What it might show\n---|---\nfirst-layer attention by feature group | whether Δ / Δ² / Fourier / phase-like channels are used differently\nmodality-specific channel usage | whether text, audio, and image bytes rely on different feature groups\nK-slot byte-position attribution | which byte positions inside the packed slot matter most\nlearned projection comparison | whether the learned projection rediscovers similar channel mixing\nfeature ablation during eval | which channels matter after training\n\nThis might become one of the cleaner advantages of the zero-door setup: the model has fewer learned parameters before the first attention operation, so attribution at the input boundary is less hidden.\n\n## Longer-term: separate input door from output head\n\nLonger-term, I would also separate the **zero input door** question from the **output-head** question.\n\nShaham & Levy is relevant here because that work discusses replacing embedding layers with one-hot byte representations in the first and last layers. HoLo_ZeRo, as I understand it, is mainly about the input front door.\n\nThose are related but not identical questions:\n\nQuestion | Why separate it\n---|---\nzero input door | asks how bytes should enter the model\noutput byte head | asks how predictions should be parameterized\nweight tying | changes meaning when there is no learned input embedding table\nHSL-aware output geometry | possible future direction, but separate from the present claim\nAR streaming path | may interact with output-side design\n\nI would keep the current claim focused on the input door, but maybe mark output-head structure as a later research branch.\n\n## Small implementation note\n\nI also like the explicit tail handling.\n\nIn K-packed byte models, silent byte loss would be an easy source of false confidence. Making pad/drop behavior explicit is a small but important engineering detail.\n\n## Bottom line\n\nMy rough read:\n\nThis is not just an “embedding removed” experiment. It is more specifically a test of whether a deterministic HSL byte-signal geometry can replace the learned input door while also reducing raw byte-position length through K-packing.\n\nThe next most useful thing may be a failure-boundary map:\n\n  1. when learned embeddings catch up;\n  2. when larger models erase the advantage;\n  3. when K starts hurting binding;\n  4. which modalities benefit from HSL geometry;\n  5. whether HSL beats random/permuted/learned controls;\n  6. whether hard-negative binding still holds.\n\n\n\nIf that map looks good, then the result becomes much more than a neat parameter-saving trick. It becomes a concrete design rule for where a fixed byte-signal front door is useful.",
  "title": "Removing the embedding from my embedding: a byte transformer with a 0-parameter input layer (25M, single RTX 4070)"
}