Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibhxjntgetd64p2vvhryjpmjgwi4ymhe62njxvtjsc6a36qm42rge",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mobyq7ysiul2"
  },
  "path": "/t/removing-the-embedding-from-my-embedding-a-byte-transformer-with-a-0-parameter-input-layer-25m-single-rtx-4070/176731#post_5",
  "publishedAt": "2026-06-14T22:43:37.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Shaham & Levy, Neural Machine Translation without Embeddings",
    "CANINE",
    "Charformer",
    "MEGABYTE",
    "Byte Latent Transformer",
    "RAG",
    "Shaham & Levy — Neural Machine Translation without Embeddings"
  ],
  "textContent": "Oh. Looks like there’s a useful next framing here:\n\n* * *\n\n## Short version\n\nThis update makes the result more interesting, not less, because it shows real boundaries.\n\nMy updated read would be:\n\n> zero HSL door looks promising for primary byte streams; fixed channel addresses make it inspectable; K is a compression/alignment knob; retrieved memory probably needs a separate learned interface; output packing is a separate question; and absolute sensor tasks may need an explicit magnitude channel.\n\nSo the next design rule may not be:\n\n> use zero door everywhere.\n\nIt may be:\n\n> use zero HSL for primary observations, add learned interfaces where the path is semantically different, and add explicit channels only when a measured boundary shows they are needed.\n\nThat feels like a much more useful research direction.\n\n## This is now a path-specific door question\n\nThis update is useful because the map is no longer just:\n\n> does the zero door work?\n\nIt is becoming:\n\n> which path can use a zero door, which path needs a learned interface, and which task needs extra absolute information?\n\nThat distinction feels important.\n\nA useful reference point here is the earlier embedding-free byte line, especially Shaham & Levy, Neural Machine Translation without Embeddings, which showed that byte models can avoid learned embeddings by using fixed byte representations. But HoLo_ZeRo is now moving into a more specific question: not merely “can the input embedding be removed?”, but “which interfaces can remain fixed, and which interfaces need learned adapters?”\n\nThat is a sharper question.\n\n## The channel-knockout matrix is probably the most important new result\n\nThe channel-knockout table seems especially valuable because it makes the fixed-address design pay off.\n\nThe zero door is not only saving parameters. It is making the first interface inspectable.\n\nObservation | My read\n---|---\ntext/caption lean heavily on Fourier | local byte prediction may rely more on bit-pattern structure/distribution than on raw numeric locality\nimage leans more on Δ + phase | decoded image-like bytes seem to use numeric-local / phase-like structure\naudio spreads over Δ + Fourier + phase | audio seems to use a mixed signal-like representation rather than one dominant channel family\nboundary is near zero everywhere | useful negative control under fixed K-packing\n\nThe boundary-channel result is actually nice. If every channel looked important, the knockout would be harder to trust. The fact that boundary stays idle under fixed patching makes sense: with fixed K, the model does not need to ask “where does the unit end?”\n\nThat gives a clean future prediction:\n\n> boundary should matter more under adaptive or content-determined slotting.\n\nSo the channel matrix is not only diagnostic; it creates a falsifiable next experiment.\n\nThis also connects nicely to the broader byte-modeling literature. CANINE, Charformer, MEGABYTE, and Byte Latent Transformer all deal, in different ways, with the fact that byte/character streams are long and need some kind of downsampling, grouping, or patching. What is different here is that the fixed HSL channels let you inspect which parts of the byte-signal representation are load-bearing after training.\n\nThat inspectability may be one of the strongest reasons to keep the zero-door setup around, even apart from parameter count.\n\n## The grounding failure is the most informative break\n\nThe disk-grounding result looks like the most important failure case so far.\n\nMy updated read is:\n\nPath | Current read\n---|---\nprimary byte input | zero HSL door looks plausible\ndecoded sensor-like input | zero HSL door can be competitive with a learned door, at least in these small probes\nretrieved memory / disk facts | zero-padding alone is not enough; this path likely needs its own learned interface\noutput head | K-packed symmetry is a separate structural choice and currently seems to cost bits\nabsolute-distance tasks | change-rate geometry may need an explicit absolute-magnitude channel\n\nThat makes me think the next map should be a **path-specific door map** , not only a failure-boundary map.\n\nSomething like:\n\nPath | Candidate door | Why\n---|---|---\nprimary decoded byte stream | zero HSL door | fixed geometry seems enough for ordinary observation bytes\nretrieved memory / disk facts | small learned projection | retrieved facts are not just another byte stream; the model must be forced to read their content\noutput head | separate design, not automatically symmetric | input K-packing and output K-packing do not have to share the same answer\nabsolute sensor tasks | HSL + coarse absolute channel | change-rate features may miss magnitude-critical information\nadaptive slotting | HSL + active boundary channel | boundary may wake up only when slot boundaries become content-dependent\n\nThe grounding break is useful because it suggests that “zero learned door” may be **path-specific** , not universal.\n\nThat is a good result, not a bad one. It narrows the design rule.\n\n## Memory path may be a different object than input bytes\n\nThe retrieved-memory failure suggests a conceptual distinction:\n\n> primary input bytes are observations; retrieved memory is evidence.\n\nThose may need different interfaces.\n\nFor primary input, zero HSL features can work because the model is learning from local byte structure.\nFor retrieved facts, the model has to treat the memory content as something to consult. If the memory path is too thin or too diluted by positional structure, the model can learn to ignore it.\n\nSo I would phrase the result this way:\n\n> zero door is promising for primary observation streams, but retrieved memory probably needs an explicit learned reading interface.\n\nThat is a more specific and more useful claim than either:\n\n  * “zero door works everywhere,” or\n  * “zero door fails grounding.”\n\n\n\nThe monotonic recovery after adding a small learned memory projection is encouraging, even if the current gap is still far below the old learned-door number. It shows directionality: the model can be made to read the disk again when the memory path has its own interface.\n\nThe next useful split might be:\n\nCondition | What it tells us\n---|---\nzero input door + zero memory door | whether the pure zero-door version can read retrieved content\nzero input door + learned memory projection | whether a separate memory interface restores grounding\nlearned input door + zero memory door | whether the failure is specifically memory-path dilution\nlearned input door + learned memory projection | upper/reference condition\nzero input door + low-rank memory adapter | whether a very small adapter is enough\nzero input door + per-channel scale only | whether memory needs full projection or just rescaling\nisolated knowledge-only vs mixed training | whether the memory path overfits when the batch is too knowledge-heavy\n\nThe isolated-vs-mixed distinction seems important. If the isolated knowledge probe overshoots and overfits, but the mixed run stays modest and stable, then the memory path may need curriculum/mixing control as much as architecture control.\n\nThis is close in spirit to the broader lesson from retrieval-augmented modeling: retrieval content is not just “more input.” It is a different information path. Work like RAG makes that separation explicit by giving retrieved evidence its own retrieval/conditioning mechanism. I do not mean HoLo_ZeRo should copy that architecture, only that the failure mode here fits a known pattern: retrieved evidence often needs a distinct interface.\n\n## The absolute-distance failure is also a useful encoder hint\n\nThe sensor probe is interesting because it gives a more concrete version of the modality-boundary story.\n\nThe result is not simply:\n\n> HSL works on sensor bytes.\n\nIt is more like:\n\n> HSL is competitive on several signal-like byte streams, but absolute-magnitude tasks may expose a missing channel.\n\nThat is a much more useful finding.\n\nIf HSL is mainly change-rate / relational geometry, then losing to a raw range profile on an absolute-distance task makes sense. The next encoder hypothesis becomes very concrete:\n\n> add a small coarse absolute-magnitude channel and test whether it improves absolute tasks without damaging change/binding tasks.\n\nI would probably test it as an explicit card:\n\nEncoder variant | Test\n---|---\ncurrent HSL | baseline\nHSL + coarse absolute bucket | does absolute-distance improve?\nHSL + normalized absolute channel | does magnitude help without scale instability?\nHSL + per-modality absolute channel | does this help only sensors, or also image/audio?\nHSL + absolute channel ablated at eval | does the model actually use it?\n\nThe important part is to avoid making the encoder more complex without a direct failure case. Here there is a direct failure case, so the extra channel has a clear reason to exist.\n\nThis also fits the broader pattern from byte/signal models: the usefulness of a byte representation depends on whether byte-value geometry corresponds to something meaningful in the underlying data. Decoded grayscale, μ-law audio, lidar/radar ranges, UTF-8 text, and compressed bytes should not necessarily behave the same way.\n\n## Output head should stay separate from input door\n\nThe I/O-symmetric run is useful precisely because it did **not** simply validate symmetry.\n\nMy read would be:\n\n> input-side zero K-packing and output-side K-packing are related ideas, but not the same claim.\n\nInput packing gives the model a compressed observation interface.\nOutput packing changes the prediction factorization.\n\nThose can have very different costs.\n\nComponent | Question\n---|---\nzero input door | can fixed geometry replace a learned input embedding/projection?\nK-packed input | can several bytes share one attention slot efficiently?\nK-packed output | can several output bytes be bundled without paying too much bpb?\nI/O symmetry | is structural neatness worth the output prediction cost?\n\nSo I would keep the output-head result as its own branch:\n\n> structurally interesting, but currently not a win.\n\nThat is still useful. It prevents the input-door result from being overextended.\n\nThis distinction is also useful relative to prior embedding-free work. Shaham & Levy discuss replacing learned embeddings in byte models, including the first/last layer framing. But HoLo_ZeRo’s strongest current evidence seems input-side. Output-side bundling should probably remain a separate experiment rather than being assumed from the input-side result.\n\n## Updated map\n\nMy updated map would be:\n\nArea | Current state | Next useful test\n---|---|---\nprimary input door | promising | long-schedule learned-embedding head-to-head\nchannel geometry | now inspectable | repeat knockout across seeds/checkpoints\nmodality boundary | starting to appear | decoded signal-like vs UTF-8 vs compressed/encrypted controls\nsensor boundary | HSL ≈ learned in small probes, but absolute distance exposes weakness | HSL + coarse absolute channel\nK boundary | K=16 preserves some bpb but softens binding | K sweep with binding/hard negatives\ngrounding | zero memory path breaks; learned memory projection recovers directionally | path-specific memory-door ablation\noutput head | I/O symmetry costs bits | keep separate from input-door claim\nboundary channel | idle under fixed K | adaptive/content-determined slotting\n\nThat is a much sharper story than a single “zero door wins/loses” table.\n\n## What I would prioritize next\n\nGiven this update, I would prioritize three things.\n\n### 1. Long-schedule learned-embedding head-to-head\n\nThis is still the cleanest missing test.\n\nIt separates:\n\nPossibility | Meaning\n---|---\nlearned embedding catches up | zero door is mostly a cold-start / sample-efficiency advantage\nlearned embedding does not catch up | fixed HSL geometry remains useful even after enough schedule\nlearned embedding catches up on bpb but not binding | HSL geometry matters more for association than local prediction\nlearned embedding wins text but loses sensors | modality boundary becomes the main story\n\nThis curve matters more than one endpoint.\n\n### 2. Path-specific door ablation\n\nThe grounding failure makes this almost as important as the schedule curve.\n\nI would test:\n\nInput door | Memory door | Expected use\n---|---|---\nzero | zero | pure zero-door stress test\nzero | learned projection | current recovery direction\nlearned | zero | tests whether memory failure is independent of input door\nlearned | learned projection | reference\nzero | low-rank / gated memory adapter | minimal learned memory interface\nzero | per-channel scale only | tests whether memory needs full projection or just rescaling\n\nThis would answer:\n\n> how much learned interface is needed for retrieved memory?\n\n### 3. Hard-negative binding\n\nBecause the channel/geometry story now looks different for text bpb and binding, binding deserves its own stronger test.\n\nUseful negatives:\n\nNegative | Why\n---|---\nsame-class wrong instance | removes simple class shortcut\nsame-length caption mismatch | removes length cue\nentropy-matched caption mismatch | removes byte-distribution cue\nshifted audio/video window | tests temporal leakage\nimage-only / audio-only ablation | shows modality contribution\ntop-k retrieval | easier to interpret than bpb alone\n\nIf HSL geometry still helps under hard negatives, that becomes a much stronger result than “zero door gives better text bpb.”\n\n## Links / orientation points\n\nI would keep these as orientation points, not as direct baselines:\n\nLink | Why it is useful here\n---|---\nShaham & Levy — Neural Machine Translation without Embeddings | closest reference for embedding-free byte input\nCANINE | tokenization-free character encoder with downsampling\nCharformer | byte/character sequence shortening via learned GBST\nMEGABYTE | byte patching and long byte-sequence modeling\nByte Latent Transformer | dynamic byte patches and FLOP-controlled byte scaling\nRAG | useful reminder that retrieved evidence is often a separate interface, not just more input\n\n## Bottom line\n\nThis update makes the result more interesting because it shows real boundaries.\n\nMy updated summary would be:\n\n> zero HSL door looks promising for primary byte streams; fixed channel addresses make it inspectable; K is a compression/alignment knob; retrieved memory probably needs a separate learned interface; output packing is a separate question; and absolute sensor tasks may need an explicit magnitude channel.\n\nSo the next design rule may not be:\n\n> use zero door everywhere.\n\nIt may be:\n\n> use zero HSL for primary observations, add learned interfaces where the path is semantically different, and add explicit channels only when a measured boundary shows they are needed.\n\nThat is a more useful research direction than a universal zero-door claim.",
  "title": "Removing the embedding from my embedding: a byte transformer with a 0-parameter input layer (25M, single RTX 4070)"
}