External Publication
Visit Post

Removing the embedding from my embedding: a byte transformer with a 0-parameter input layer (25M, single RTX 4070)

Hugging Face Forums [Unofficial] June 13, 2026
Source

Hmm… maybe something like this?:


I think this makes the earlier HoLo/HSL line much sharper.

I would read this less as “removing embeddings” in the broad, final sense, and more as a front-door experiment :

how much input geometry has to be learned before attention sees the bytes?

That framing makes the result easier to place. The interesting claim does not seem to be simply “no embeddings.” There is already a close embedding-free byte precedent in Shaham & Levy, Neural Machine Translation without Embeddings, where UTF-8 bytes can be represented with one-hot vectors instead of a learned embedding layer.

The sharper claim here seems to be the combination:

zero learned input door + K-byte packing + deterministic HSL geometry

That puts HoLo_ZeRo between a few existing lines of work:

Existing thread What it already covers What HoLo_ZeRo adds
Shaham & Levy Embedding-free byte input via one-hot bytes A packed, shorter-than-raw-byte-position input using deterministic HSL geometry
CANINE / Charformer Token-free character/byte modeling with downsampling or learned subword formation A zero learned input door rather than a learned front-end
MEGABYTE Byte patching with local/global multiscale modeling Fixed HSL feature packing instead of learned/local patch modeling
Byte Latent Transformer Dynamic byte patches and FLOP-controlled byte scaling Deterministic K-packing as a simpler fixed front door
fixed / random / compressed feature maps Learned embeddings are not the only way to create vector input Structured byte-signal geometry as the fixed map

So the way I would summarize the research position is:

Shaham & Levy asks: can bytes avoid learned embeddings? Charformer / MEGABYTE / BLT ask: how do we shorten byte streams? HoLo_ZeRo asks: can we do both with a deterministic byte-signal front door?

That is a pretty clean question.

The next useful artifact: a failure-boundary map

Since you explicitly asked where it breaks, I think the most useful next artifact may not be another single-winner table, but a failure-boundary map.

Something like:

Boundary Question Why it matters
Schedule boundary Does the plain learned byte embedding catch up after longer training? Separates “zero door has better cold start” from “zero door has better final behavior.”
Scale boundary Does the gap persist at 50M / 100M / 300M bodies? Larger bodies may learn input geometry that a small model cannot.
K boundary What happens across K=4 / 8 / 16 / 18? K controls sequence density versus fine-grained binding.
Modality boundary Does it work equally for text, decoded audio, decoded image bytes, and compressed-like bytes? HSL geometry may be more natural for signal-like bytes than semantically arbitrary byte streams.
Geometry boundary HSL vs learned / random / permuted / raw-bit controls Tests whether the structured geometry matters, not just capacity or invertibility.
Binding boundary Does the binding gap survive harder negatives? Separates real cross-modal association from shortcut learning.
Implementation boundary Does tail padding / streaming / byte recovery behave safely under odd lengths? K-packed byte models can get misleading results if bytes are silently dropped or shifted.

That table would answer the exact question I think many readers will have:

where does the zero door stop behaving like a learned input door?

K looks like the main design knob

The K-sweep seems especially important to me.

K does several things at once:

Increasing K does this Benefit Possible cost
Reduces the number of attention positions Lower input-side attention cost
Packs more local bytes into one slot Higher local byte density
Makes each slot cover a wider byte span Coarser temporal / modal alignment
Preserves zero learned input params Keeps the front door fixed
Makes slot attribution more complex Harder to tell which byte/channel drove a decision

So I would treat K less like a normal hyperparameter and more like a design knob:

sequence density versus fine-grained binding

The K=16 result is interesting for exactly that reason. If text/caption bpb holds up while binding softens, that suggests a useful compression/alignment frontier rather than just “K bigger is better” or “K smaller is better.”

A nice future plot might be:

K text bpb caption bpb audio→caption binding gap image/audio/text retrieval tokens/sec or bytes/sec VRAM
4
8
16
18

That would make the trade-off visible.

Short-term experiments that seem most useful

If I were trying to poke this in the most useful way, I would prioritize these:

Test What it would clarify
Longer schedule for learned byte embedding Whether the learned embedding is just slower to warm up.
Multi-seed runs Whether the 25M / 3000-step gap is stable or seed-sensitive.
Same-FLOP / same-wallclock reporting Whether K-packing improves the compute-quality frontier, not just the loss table.
HSL vs learned / random / permuted controls Whether HSL geometry matters beyond capacity or invertibility.
K-sweep Whether compression and binding pull in different directions.
Hard-negative binding Whether the audio/image/text relation survives more difficult mismatches.
Tail and odd-length tests Whether K-packing is safe under non-divisible byte lengths.
Streaming path tests Whether the AR path behaves consistently with the packed prefix path.

The cleanest short-term question, to me, is:

At what training budget does a learned input door catch up, if it catches up at all?

The second cleanest is:

At what K does sequence compression start to damage binding?

Those two together would already say a lot.

Geometry controls

The geometry controls seem central.

I would want to see the same body and data under something like:

Input door What it tests
zero HSL The proposed structured deterministic geometry
learned projection over HSL Whether early learned mixing helps
plain learned byte embedding Standard learned byte identity baseline
raw bit features Whether simple bit identity is enough
random fixed features Whether fixed capacity is enough
permuted HSL LUT Whether the HSL value geometry matters
learned byte projection with same dimension Whether the model can learn equivalent geometry itself

The permuted HSL control seems especially useful. If it keeps marginal feature statistics but breaks byte-value adjacency geometry, then the comparison is more informative than just random features.

A rough interpretation table could look like this:

Result pattern Possible interpretation
HSL > permuted ≈ random HSL geometry likely matters.
HSL ≈ permuted > random feature distribution/capacity may matter more than value geometry.
HSL ≈ learned projection ≫ learned byte embedding at short schedule fixed geometry gives a cold-start advantage.
learned byte embedding catches up at long schedule zero door may be a sample-efficiency / early-training advantage rather than a final-performance advantage.
learned projection over HSL wins consistently HSL features help, but early learned mixing still matters.
all fixed variants collapse on harder data the zero-door effect may depend on toy-scale or signal-like structure.

That kind of result table would make the claim much easier to read.

Modality boundary

One place I would expect different behavior is modality.

HSL geometry should be more natural when byte values have local numeric meaning. That is true for decoded signal-like data, but less obviously true for text or compressed media bytes.

Input type Expected HSL fit Reason
decoded grayscale pixels high nearby byte values often mean nearby brightness
PCM / μ-law audio high to medium byte values often relate to amplitude-like quantities
rasterized numeric signals high byte values often preserve numeric locality
UTF-8 text medium to low numeric byte adjacency is not the same as semantic adjacency
compressed image/audio/video bytes low codec structure and entropy coding dominate
encrypted/random bytes near zero byte adjacency has no semantic meaning

So I would not expect one global answer to “does zero door work?” I would expect a boundary map by data type.

That could be a useful long-term table:

Modality / encoding zero HSL learned projection learned byte embedding random/permuted Notes
UTF-8 text byte identity may matter more than numeric adjacency
grayscale raster value geometry should help
μ-law audio amplitude-like structure may help
decoded video frames likely K/alignment-sensitive
compressed bytes useful negative control
shuffled/random bytes sanity check

This would also help avoid overgeneralizing from the current mixture.

Binding probes

The binding result is probably where I would be most careful.

For text bpb, K-packing can be judged fairly directly. For cross-modal binding, I would want harder negatives.

Possible hard negatives:

Probe Why it helps
same-class wrong-instance Reduces class-label shortcuts
same-length caption mismatch Reduces length cues
entropy-matched caption mismatch Reduces byte-distribution cues
shifted audio/video windows Tests temporal leakage
image-only / audio-only / image+audio ablation Shows which modality contributes
top-k retrieval over captions Easier to interpret than bpb alone
cross-dataset transfer Tests whether binding survives outside the original toy distribution

If K=16 preserves caption bpb but weakens binding, the hard-negative setting may be the best way to see whether K is losing fine-grained alignment or just losing an easy shortcut.

Fixed channel addresses are also an interpretability opportunity

One thing I like about the zero-door design is that the channels enter unmixed.

That is not only a parameter-saving trick. It may make the first learned layer unusually inspectable.

For example, one could ask:

Analysis What it might show
first-layer attention by feature group whether Δ / Δ² / Fourier / phase-like channels are used differently
modality-specific channel usage whether text, audio, and image bytes rely on different feature groups
K-slot byte-position attribution which byte positions inside the packed slot matter most
learned projection comparison whether the learned projection rediscovers similar channel mixing
feature ablation during eval which channels matter after training

This might become one of the cleaner advantages of the zero-door setup: the model has fewer learned parameters before the first attention operation, so attribution at the input boundary is less hidden.

Longer-term: separate input door from output head

Longer-term, I would also separate the zero input door question from the output-head question.

Shaham & Levy is relevant here because that work discusses replacing embedding layers with one-hot byte representations in the first and last layers. HoLo_ZeRo, as I understand it, is mainly about the input front door.

Those are related but not identical questions:

Question Why separate it
zero input door asks how bytes should enter the model
output byte head asks how predictions should be parameterized
weight tying changes meaning when there is no learned input embedding table
HSL-aware output geometry possible future direction, but separate from the present claim
AR streaming path may interact with output-side design

I would keep the current claim focused on the input door, but maybe mark output-head structure as a later research branch.

Small implementation note

I also like the explicit tail handling.

In K-packed byte models, silent byte loss would be an easy source of false confidence. Making pad/drop behavior explicit is a small but important engineering detail.

Bottom line

My rough read:

This is not just an “embedding removed” experiment. It is more specifically a test of whether a deterministic HSL byte-signal geometry can replace the learned input door while also reducing raw byte-position length through K-packing.

The next most useful thing may be a failure-boundary map:

  1. when learned embeddings catch up;
  2. when larger models erase the advantage;
  3. when K starts hurting binding;
  4. which modalities benefit from HSL geometry;
  5. whether HSL beats random/permuted/learned controls;
  6. whether hard-negative binding still holds.

If that map looks good, then the result becomes much more than a neat parameter-saving trick. It becomes a concrete design rule for where a fixed byte-signal front door is useful.

Discussion in the ATmosphere

Loading comments...