External Publication

Visit Post

Removing the embedding from my embedding: a byte transformer with a 0-parameter input layer (25M, single RTX 4070)

Hugging Face Forums [Unofficial] June 14, 2026

Source

Oh. Looks like there’s a useful next framing here:

Short version

This update makes the result more interesting, not less, because it shows real boundaries.

My updated read would be:

zero HSL door looks promising for primary byte streams; fixed channel addresses make it inspectable; K is a compression/alignment knob; retrieved memory probably needs a separate learned interface; output packing is a separate question; and absolute sensor tasks may need an explicit magnitude channel.

So the next design rule may not be:

use zero door everywhere.

It may be:

use zero HSL for primary observations, add learned interfaces where the path is semantically different, and add explicit channels only when a measured boundary shows they are needed.

That feels like a much more useful research direction.

This is now a path-specific door question

This update is useful because the map is no longer just:

does the zero door work?

It is becoming:

which path can use a zero door, which path needs a learned interface, and which task needs extra absolute information?

That distinction feels important.

A useful reference point here is the earlier embedding-free byte line, especially Shaham & Levy, Neural Machine Translation without Embeddings, which showed that byte models can avoid learned embeddings by using fixed byte representations. But HoLo_ZeRo is now moving into a more specific question: not merely “can the input embedding be removed?”, but “which interfaces can remain fixed, and which interfaces need learned adapters?”

That is a sharper question.

The channel-knockout matrix is probably the most important new result

The channel-knockout table seems especially valuable because it makes the fixed-address design pay off.

The zero door is not only saving parameters. It is making the first interface inspectable.

Observation	My read
text/caption lean heavily on Fourier	local byte prediction may rely more on bit-pattern structure/distribution than on raw numeric locality
image leans more on Δ + phase	decoded image-like bytes seem to use numeric-local / phase-like structure
audio spreads over Δ + Fourier + phase	audio seems to use a mixed signal-like representation rather than one dominant channel family
boundary is near zero everywhere	useful negative control under fixed K-packing

The boundary-channel result is actually nice. If every channel looked important, the knockout would be harder to trust. The fact that boundary stays idle under fixed patching makes sense: with fixed K, the model does not need to ask “where does the unit end?”

That gives a clean future prediction:

boundary should matter more under adaptive or content-determined slotting.

So the channel matrix is not only diagnostic; it creates a falsifiable next experiment.

This also connects nicely to the broader byte-modeling literature. CANINE, Charformer, MEGABYTE, and Byte Latent Transformer all deal, in different ways, with the fact that byte/character streams are long and need some kind of downsampling, grouping, or patching. What is different here is that the fixed HSL channels let you inspect which parts of the byte-signal representation are load-bearing after training.

That inspectability may be one of the strongest reasons to keep the zero-door setup around, even apart from parameter count.

The grounding failure is the most informative break

The disk-grounding result looks like the most important failure case so far.

My updated read is:

Path	Current read
primary byte input	zero HSL door looks plausible
decoded sensor-like input	zero HSL door can be competitive with a learned door, at least in these small probes
retrieved memory / disk facts	zero-padding alone is not enough; this path likely needs its own learned interface
output head	K-packed symmetry is a separate structural choice and currently seems to cost bits
absolute-distance tasks	change-rate geometry may need an explicit absolute-magnitude channel

That makes me think the next map should be a path-specific door map , not only a failure-boundary map.

Something like:

Path	Candidate door	Why
primary decoded byte stream	zero HSL door	fixed geometry seems enough for ordinary observation bytes
retrieved memory / disk facts	small learned projection	retrieved facts are not just another byte stream; the model must be forced to read their content
output head	separate design, not automatically symmetric	input K-packing and output K-packing do not have to share the same answer
absolute sensor tasks	HSL + coarse absolute channel	change-rate features may miss magnitude-critical information
adaptive slotting	HSL + active boundary channel	boundary may wake up only when slot boundaries become content-dependent

The grounding break is useful because it suggests that “zero learned door” may be path-specific , not universal.

That is a good result, not a bad one. It narrows the design rule.

Memory path may be a different object than input bytes

The retrieved-memory failure suggests a conceptual distinction:

primary input bytes are observations; retrieved memory is evidence.

Those may need different interfaces.

For primary input, zero HSL features can work because the model is learning from local byte structure. For retrieved facts, the model has to treat the memory content as something to consult. If the memory path is too thin or too diluted by positional structure, the model can learn to ignore it.

So I would phrase the result this way:

zero door is promising for primary observation streams, but retrieved memory probably needs an explicit learned reading interface.

That is a more specific and more useful claim than either:

“zero door works everywhere,” or
“zero door fails grounding.”

The monotonic recovery after adding a small learned memory projection is encouraging, even if the current gap is still far below the old learned-door number. It shows directionality: the model can be made to read the disk again when the memory path has its own interface.

The next useful split might be:

Condition	What it tells us
zero input door + zero memory door	whether the pure zero-door version can read retrieved content
zero input door + learned memory projection	whether a separate memory interface restores grounding
learned input door + zero memory door	whether the failure is specifically memory-path dilution
learned input door + learned memory projection	upper/reference condition
zero input door + low-rank memory adapter	whether a very small adapter is enough
zero input door + per-channel scale only	whether memory needs full projection or just rescaling
isolated knowledge-only vs mixed training	whether the memory path overfits when the batch is too knowledge-heavy

The isolated-vs-mixed distinction seems important. If the isolated knowledge probe overshoots and overfits, but the mixed run stays modest and stable, then the memory path may need curriculum/mixing control as much as architecture control.

This is close in spirit to the broader lesson from retrieval-augmented modeling: retrieval content is not just “more input.” It is a different information path. Work like RAG makes that separation explicit by giving retrieved evidence its own retrieval/conditioning mechanism. I do not mean HoLo_ZeRo should copy that architecture, only that the failure mode here fits a known pattern: retrieved evidence often needs a distinct interface.

The absolute-distance failure is also a useful encoder hint

The sensor probe is interesting because it gives a more concrete version of the modality-boundary story.

The result is not simply:

HSL works on sensor bytes.

It is more like:

HSL is competitive on several signal-like byte streams, but absolute-magnitude tasks may expose a missing channel.

That is a much more useful finding.

If HSL is mainly change-rate / relational geometry, then losing to a raw range profile on an absolute-distance task makes sense. The next encoder hypothesis becomes very concrete:

add a small coarse absolute-magnitude channel and test whether it improves absolute tasks without damaging change/binding tasks.

I would probably test it as an explicit card:

Encoder variant	Test
current HSL	baseline
HSL + coarse absolute bucket	does absolute-distance improve?
HSL + normalized absolute channel	does magnitude help without scale instability?
HSL + per-modality absolute channel	does this help only sensors, or also image/audio?
HSL + absolute channel ablated at eval	does the model actually use it?

The important part is to avoid making the encoder more complex without a direct failure case. Here there is a direct failure case, so the extra channel has a clear reason to exist.

This also fits the broader pattern from byte/signal models: the usefulness of a byte representation depends on whether byte-value geometry corresponds to something meaningful in the underlying data. Decoded grayscale, μ-law audio, lidar/radar ranges, UTF-8 text, and compressed bytes should not necessarily behave the same way.

Output head should stay separate from input door

The I/O-symmetric run is useful precisely because it did not simply validate symmetry.

My read would be:

input-side zero K-packing and output-side K-packing are related ideas, but not the same claim.

Input packing gives the model a compressed observation interface. Output packing changes the prediction factorization.

Those can have very different costs.

Component	Question
zero input door	can fixed geometry replace a learned input embedding/projection?
K-packed input	can several bytes share one attention slot efficiently?
K-packed output	can several output bytes be bundled without paying too much bpb?
I/O symmetry	is structural neatness worth the output prediction cost?

So I would keep the output-head result as its own branch:

structurally interesting, but currently not a win.

That is still useful. It prevents the input-door result from being overextended.

This distinction is also useful relative to prior embedding-free work. Shaham & Levy discuss replacing learned embeddings in byte models, including the first/last layer framing. But HoLo_ZeRo’s strongest current evidence seems input-side. Output-side bundling should probably remain a separate experiment rather than being assumed from the input-side result.

Updated map

My updated map would be:

Area	Current state	Next useful test
primary input door	promising	long-schedule learned-embedding head-to-head
channel geometry	now inspectable	repeat knockout across seeds/checkpoints
modality boundary	starting to appear	decoded signal-like vs UTF-8 vs compressed/encrypted controls
sensor boundary	HSL ≈ learned in small probes, but absolute distance exposes weakness	HSL + coarse absolute channel
K boundary	K=16 preserves some bpb but softens binding	K sweep with binding/hard negatives
grounding	zero memory path breaks; learned memory projection recovers directionally	path-specific memory-door ablation
output head	I/O symmetry costs bits	keep separate from input-door claim
boundary channel	idle under fixed K	adaptive/content-determined slotting

That is a much sharper story than a single “zero door wins/loses” table.

What I would prioritize next

Given this update, I would prioritize three things.

1. Long-schedule learned-embedding head-to-head

This is still the cleanest missing test.

It separates:

Possibility	Meaning
learned embedding catches up	zero door is mostly a cold-start / sample-efficiency advantage
learned embedding does not catch up	fixed HSL geometry remains useful even after enough schedule
learned embedding catches up on bpb but not binding	HSL geometry matters more for association than local prediction
learned embedding wins text but loses sensors	modality boundary becomes the main story

This curve matters more than one endpoint.

2. Path-specific door ablation

The grounding failure makes this almost as important as the schedule curve.

I would test:

Input door	Memory door	Expected use
zero	zero	pure zero-door stress test
zero	learned projection	current recovery direction
learned	zero	tests whether memory failure is independent of input door
learned	learned projection	reference
zero	low-rank / gated memory adapter	minimal learned memory interface
zero	per-channel scale only	tests whether memory needs full projection or just rescaling

This would answer:

how much learned interface is needed for retrieved memory?

3. Hard-negative binding

Because the channel/geometry story now looks different for text bpb and binding, binding deserves its own stronger test.

Useful negatives:

Negative	Why
same-class wrong instance	removes simple class shortcut
same-length caption mismatch	removes length cue
entropy-matched caption mismatch	removes byte-distribution cue
shifted audio/video window	tests temporal leakage
image-only / audio-only ablation	shows modality contribution
top-k retrieval	easier to interpret than bpb alone

If HSL geometry still helps under hard negatives, that becomes a much stronger result than “zero door gives better text bpb.”

Links / orientation points

I would keep these as orientation points, not as direct baselines:

Link	Why it is useful here
Shaham & Levy — Neural Machine Translation without Embeddings	closest reference for embedding-free byte input
CANINE	tokenization-free character encoder with downsampling
Charformer	byte/character sequence shortening via learned GBST
MEGABYTE	byte patching and long byte-sequence modeling
Byte Latent Transformer	dynamic byte patches and FLOP-controlled byte scaling
RAG	useful reminder that retrieved evidence is often a separate interface, not just more input

Bottom line

This update makes the result more interesting because it shows real boundaries.

My updated summary would be: