External Publication

Visit Post

Removing the embedding from my embedding: a byte transformer with a 0-parameter input layer (25M, single RTX 4070)

Hugging Face Forums [Unofficial] June 13, 2026

Source

Hmm… maybe something like this?:

I think this makes the earlier HoLo/HSL line much sharper.

I would read this less as “removing embeddings” in the broad, final sense, and more as a front-door experiment :

how much input geometry has to be learned before attention sees the bytes?

That framing makes the result easier to place. The interesting claim does not seem to be simply “no embeddings.” There is already a close embedding-free byte precedent in Shaham & Levy, Neural Machine Translation without Embeddings, where UTF-8 bytes can be represented with one-hot vectors instead of a learned embedding layer.

The sharper claim here seems to be the combination:

zero learned input door + K-byte packing + deterministic HSL geometry

That puts HoLo_ZeRo between a few existing lines of work:

Existing thread	What it already covers	What HoLo_ZeRo adds
Shaham & Levy	Embedding-free byte input via one-hot bytes	A packed, shorter-than-raw-byte-position input using deterministic HSL geometry
CANINE / Charformer	Token-free character/byte modeling with downsampling or learned subword formation	A zero learned input door rather than a learned front-end
MEGABYTE	Byte patching with local/global multiscale modeling	Fixed HSL feature packing instead of learned/local patch modeling
Byte Latent Transformer	Dynamic byte patches and FLOP-controlled byte scaling	Deterministic K-packing as a simpler fixed front door
fixed / random / compressed feature maps	Learned embeddings are not the only way to create vector input	Structured byte-signal geometry as the fixed map

So the way I would summarize the research position is:

Shaham & Levy asks: can bytes avoid learned embeddings? Charformer / MEGABYTE / BLT ask: how do we shorten byte streams? HoLo_ZeRo asks: can we do both with a deterministic byte-signal front door?

That is a pretty clean question.

The next useful artifact: a failure-boundary map

Since you explicitly asked where it breaks, I think the most useful next artifact may not be another single-winner table, but a failure-boundary map.

Something like:

Boundary	Question	Why it matters
Schedule boundary	Does the plain learned byte embedding catch up after longer training?	Separates “zero door has better cold start” from “zero door has better final behavior.”
Scale boundary	Does the gap persist at 50M / 100M / 300M bodies?	Larger bodies may learn input geometry that a small model cannot.
K boundary	What happens across K=4 / 8 / 16 / 18?	K controls sequence density versus fine-grained binding.
Modality boundary	Does it work equally for text, decoded audio, decoded image bytes, and compressed-like bytes?	HSL geometry may be more natural for signal-like bytes than semantically arbitrary byte streams.
Geometry boundary	HSL vs learned / random / permuted / raw-bit controls	Tests whether the structured geometry matters, not just capacity or invertibility.
Binding boundary	Does the binding gap survive harder negatives?	Separates real cross-modal association from shortcut learning.
Implementation boundary	Does tail padding / streaming / byte recovery behave safely under odd lengths?	K-packed byte models can get misleading results if bytes are silently dropped or shifted.

That table would answer the exact question I think many readers will have:

where does the zero door stop behaving like a learned input door?

K looks like the main design knob

The K-sweep seems especially important to me.

K does several things at once:

Increasing K does this	Benefit	Possible cost
Reduces the number of attention positions	Lower input-side attention cost
Packs more local bytes into one slot	Higher local byte density
Makes each slot cover a wider byte span	Coarser temporal / modal alignment
Preserves zero learned input params	Keeps the front door fixed
Makes slot attribution more complex	Harder to tell which byte/channel drove a decision

So I would treat K less like a normal hyperparameter and more like a design knob:

sequence density versus fine-grained binding

The K=16 result is interesting for exactly that reason. If text/caption bpb holds up while binding softens, that suggests a useful compression/alignment frontier rather than just “K bigger is better” or “K smaller is better.”

A nice future plot might be:

K	text bpb	caption bpb	audio→caption binding gap	image/audio/text retrieval	tokens/sec or bytes/sec	VRAM
4
8
16
18

That would make the trade-off visible.

Short-term experiments that seem most useful

If I were trying to poke this in the most useful way, I would prioritize these:

Test	What it would clarify
Longer schedule for learned byte embedding	Whether the learned embedding is just slower to warm up.
Multi-seed runs	Whether the 25M / 3000-step gap is stable or seed-sensitive.
Same-FLOP / same-wallclock reporting	Whether K-packing improves the compute-quality frontier, not just the loss table.
HSL vs learned / random / permuted controls	Whether HSL geometry matters beyond capacity or invertibility.
K-sweep	Whether compression and binding pull in different directions.
Hard-negative binding	Whether the audio/image/text relation survives more difficult mismatches.
Tail and odd-length tests	Whether K-packing is safe under non-divisible byte lengths.
Streaming path tests	Whether the AR path behaves consistently with the packed prefix path.

The cleanest short-term question, to me, is:

At what training budget does a learned input door catch up, if it catches up at all?

The second cleanest is:

At what K does sequence compression start to damage binding?

Those two together would already say a lot.

Geometry controls

The geometry controls seem central.

I would want to see the same body and data under something like:

Input door	What it tests
zero HSL	The proposed structured deterministic geometry
learned projection over HSL	Whether early learned mixing helps
plain learned byte embedding	Standard learned byte identity baseline
raw bit features	Whether simple bit identity is enough
random fixed features	Whether fixed capacity is enough
permuted HSL LUT	Whether the HSL value geometry matters
learned byte projection with same dimension	Whether the model can learn equivalent geometry itself

The permuted HSL control seems especially useful. If it keeps marginal feature statistics but breaks byte-value adjacency geometry, then the comparison is more informative than just random features.

A rough interpretation table could look like this:

Result pattern	Possible interpretation
HSL > permuted ≈ random	HSL geometry likely matters.
HSL ≈ permuted > random	feature distribution/capacity may matter more than value geometry.
HSL ≈ learned projection ≫ learned byte embedding at short schedule	fixed geometry gives a cold-start advantage.
learned byte embedding catches up at long schedule	zero door may be a sample-efficiency / early-training advantage rather than a final-performance advantage.
learned projection over HSL wins consistently	HSL features help, but early learned mixing still matters.
all fixed variants collapse on harder data	the zero-door effect may depend on toy-scale or signal-like structure.

That kind of result table would make the claim much easier to read.

Modality boundary

One place I would expect different behavior is modality.

HSL geometry should be more natural when byte values have local numeric meaning. That is true for decoded signal-like data, but less obviously true for text or compressed media bytes.

Input type	Expected HSL fit	Reason
decoded grayscale pixels	high	nearby byte values often mean nearby brightness
PCM / μ-law audio	high to medium	byte values often relate to amplitude-like quantities
rasterized numeric signals	high	byte values often preserve numeric locality
UTF-8 text	medium to low	numeric byte adjacency is not the same as semantic adjacency
compressed image/audio/video bytes	low	codec structure and entropy coding dominate
encrypted/random bytes	near zero	byte adjacency has no semantic meaning

So I would not expect one global answer to “does zero door work?” I would expect a boundary map by data type.

That could be a useful long-term table:

Modality / encoding	zero HSL	learned projection	learned byte embedding	random/permuted	Notes
UTF-8 text					byte identity may matter more than numeric adjacency
grayscale raster					value geometry should help
μ-law audio					amplitude-like structure may help
decoded video frames					likely K/alignment-sensitive
compressed bytes					useful negative control
shuffled/random bytes					sanity check

This would also help avoid overgeneralizing from the current mixture.

Binding probes

The binding result is probably where I would be most careful.

For text bpb, K-packing can be judged fairly directly. For cross-modal binding, I would want harder negatives.

Possible hard negatives:

Probe	Why it helps
same-class wrong-instance	Reduces class-label shortcuts
same-length caption mismatch	Reduces length cues
entropy-matched caption mismatch	Reduces byte-distribution cues
shifted audio/video windows	Tests temporal leakage
image-only / audio-only / image+audio ablation	Shows which modality contributes
top-k retrieval over captions	Easier to interpret than bpb alone
cross-dataset transfer	Tests whether binding survives outside the original toy distribution

If K=16 preserves caption bpb but weakens binding, the hard-negative setting may be the best way to see whether K is losing fine-grained alignment or just losing an easy shortcut.

Fixed channel addresses are also an interpretability opportunity

One thing I like about the zero-door design is that the channels enter unmixed.

That is not only a parameter-saving trick. It may make the first learned layer unusually inspectable.

For example, one could ask:

Analysis	What it might show
first-layer attention by feature group	whether Δ / Δ² / Fourier / phase-like channels are used differently
modality-specific channel usage	whether text, audio, and image bytes rely on different feature groups
K-slot byte-position attribution	which byte positions inside the packed slot matter most
learned projection comparison	whether the learned projection rediscovers similar channel mixing
feature ablation during eval	which channels matter after training

This might become one of the cleaner advantages of the zero-door setup: the model has fewer learned parameters before the first attention operation, so attribution at the input boundary is less hidden.

Longer-term: separate input door from output head

Longer-term, I would also separate the zero input door question from the output-head question.

Shaham & Levy is relevant here because that work discusses replacing embedding layers with one-hot byte representations in the first and last layers. HoLo_ZeRo, as I understand it, is mainly about the input front door.

Those are related but not identical questions:

Question	Why separate it
zero input door	asks how bytes should enter the model
output byte head	asks how predictions should be parameterized
weight tying	changes meaning when there is no learned input embedding table
HSL-aware output geometry	possible future direction, but separate from the present claim
AR streaming path	may interact with output-side design

I would keep the current claim focused on the input door, but maybe mark output-head structure as a later research branch.

Small implementation note

I also like the explicit tail handling.

In K-packed byte models, silent byte loss would be an easy source of false confidence. Making pad/drop behavior explicit is a small but important engineering detail.

Bottom line

My rough read:

This is not just an “embedding removed” experiment. It is more specifically a test of whether a deterministic HSL byte-signal geometry can replace the learned input door while also reducing raw byte-position length through K-packing.

The next most useful thing may be a failure-boundary map:

when learned embeddings catch up;
when larger models erase the advantage;
when K starts hurting binding;
which modalities benefit from HSL geometry;
whether HSL beats random/permuted/learned controls;
whether hard-negative binding still holds.

If that map looks good, then the result becomes much more than a neat parameter-saving trick. It becomes a concrete design rule for where a fixed byte-signal front door is useful.