External Publication
Visit Post

Removing the embedding from my embedding: a byte transformer with a 0-parameter input layer (25M, single RTX 4070)

Hugging Face Forums [Unofficial] June 14, 2026
Source

Oh. Looks like there’s a useful next framing here:


Short version

This update makes the result more interesting, not less, because it shows real boundaries.

My updated read would be:

zero HSL door looks promising for primary byte streams; fixed channel addresses make it inspectable; K is a compression/alignment knob; retrieved memory probably needs a separate learned interface; output packing is a separate question; and absolute sensor tasks may need an explicit magnitude channel.

So the next design rule may not be:

use zero door everywhere.

It may be:

use zero HSL for primary observations, add learned interfaces where the path is semantically different, and add explicit channels only when a measured boundary shows they are needed.

That feels like a much more useful research direction.

This is now a path-specific door question

This update is useful because the map is no longer just:

does the zero door work?

It is becoming:

which path can use a zero door, which path needs a learned interface, and which task needs extra absolute information?

That distinction feels important.

A useful reference point here is the earlier embedding-free byte line, especially Shaham & Levy, Neural Machine Translation without Embeddings, which showed that byte models can avoid learned embeddings by using fixed byte representations. But HoLo_ZeRo is now moving into a more specific question: not merely “can the input embedding be removed?”, but “which interfaces can remain fixed, and which interfaces need learned adapters?”

That is a sharper question.

The channel-knockout matrix is probably the most important new result

The channel-knockout table seems especially valuable because it makes the fixed-address design pay off.

The zero door is not only saving parameters. It is making the first interface inspectable.

Observation My read
text/caption lean heavily on Fourier local byte prediction may rely more on bit-pattern structure/distribution than on raw numeric locality
image leans more on Δ + phase decoded image-like bytes seem to use numeric-local / phase-like structure
audio spreads over Δ + Fourier + phase audio seems to use a mixed signal-like representation rather than one dominant channel family
boundary is near zero everywhere useful negative control under fixed K-packing

The boundary-channel result is actually nice. If every channel looked important, the knockout would be harder to trust. The fact that boundary stays idle under fixed patching makes sense: with fixed K, the model does not need to ask “where does the unit end?”

That gives a clean future prediction:

boundary should matter more under adaptive or content-determined slotting.

So the channel matrix is not only diagnostic; it creates a falsifiable next experiment.

This also connects nicely to the broader byte-modeling literature. CANINE, Charformer, MEGABYTE, and Byte Latent Transformer all deal, in different ways, with the fact that byte/character streams are long and need some kind of downsampling, grouping, or patching. What is different here is that the fixed HSL channels let you inspect which parts of the byte-signal representation are load-bearing after training.

That inspectability may be one of the strongest reasons to keep the zero-door setup around, even apart from parameter count.

The grounding failure is the most informative break

The disk-grounding result looks like the most important failure case so far.

My updated read is:

Path Current read
primary byte input zero HSL door looks plausible
decoded sensor-like input zero HSL door can be competitive with a learned door, at least in these small probes
retrieved memory / disk facts zero-padding alone is not enough; this path likely needs its own learned interface
output head K-packed symmetry is a separate structural choice and currently seems to cost bits
absolute-distance tasks change-rate geometry may need an explicit absolute-magnitude channel

That makes me think the next map should be a path-specific door map , not only a failure-boundary map.

Something like:

Path Candidate door Why
primary decoded byte stream zero HSL door fixed geometry seems enough for ordinary observation bytes
retrieved memory / disk facts small learned projection retrieved facts are not just another byte stream; the model must be forced to read their content
output head separate design, not automatically symmetric input K-packing and output K-packing do not have to share the same answer
absolute sensor tasks HSL + coarse absolute channel change-rate features may miss magnitude-critical information
adaptive slotting HSL + active boundary channel boundary may wake up only when slot boundaries become content-dependent

The grounding break is useful because it suggests that “zero learned door” may be path-specific , not universal.

That is a good result, not a bad one. It narrows the design rule.

Memory path may be a different object than input bytes

The retrieved-memory failure suggests a conceptual distinction:

primary input bytes are observations; retrieved memory is evidence.

Those may need different interfaces.

For primary input, zero HSL features can work because the model is learning from local byte structure. For retrieved facts, the model has to treat the memory content as something to consult. If the memory path is too thin or too diluted by positional structure, the model can learn to ignore it.

So I would phrase the result this way:

zero door is promising for primary observation streams, but retrieved memory probably needs an explicit learned reading interface.

That is a more specific and more useful claim than either:

  • “zero door works everywhere,” or
  • “zero door fails grounding.”

The monotonic recovery after adding a small learned memory projection is encouraging, even if the current gap is still far below the old learned-door number. It shows directionality: the model can be made to read the disk again when the memory path has its own interface.

The next useful split might be:

Condition What it tells us
zero input door + zero memory door whether the pure zero-door version can read retrieved content
zero input door + learned memory projection whether a separate memory interface restores grounding
learned input door + zero memory door whether the failure is specifically memory-path dilution
learned input door + learned memory projection upper/reference condition
zero input door + low-rank memory adapter whether a very small adapter is enough
zero input door + per-channel scale only whether memory needs full projection or just rescaling
isolated knowledge-only vs mixed training whether the memory path overfits when the batch is too knowledge-heavy

The isolated-vs-mixed distinction seems important. If the isolated knowledge probe overshoots and overfits, but the mixed run stays modest and stable, then the memory path may need curriculum/mixing control as much as architecture control.

This is close in spirit to the broader lesson from retrieval-augmented modeling: retrieval content is not just “more input.” It is a different information path. Work like RAG makes that separation explicit by giving retrieved evidence its own retrieval/conditioning mechanism. I do not mean HoLo_ZeRo should copy that architecture, only that the failure mode here fits a known pattern: retrieved evidence often needs a distinct interface.

The absolute-distance failure is also a useful encoder hint

The sensor probe is interesting because it gives a more concrete version of the modality-boundary story.

The result is not simply:

HSL works on sensor bytes.

It is more like:

HSL is competitive on several signal-like byte streams, but absolute-magnitude tasks may expose a missing channel.

That is a much more useful finding.

If HSL is mainly change-rate / relational geometry, then losing to a raw range profile on an absolute-distance task makes sense. The next encoder hypothesis becomes very concrete:

add a small coarse absolute-magnitude channel and test whether it improves absolute tasks without damaging change/binding tasks.

I would probably test it as an explicit card:

Encoder variant Test
current HSL baseline
HSL + coarse absolute bucket does absolute-distance improve?
HSL + normalized absolute channel does magnitude help without scale instability?
HSL + per-modality absolute channel does this help only sensors, or also image/audio?
HSL + absolute channel ablated at eval does the model actually use it?

The important part is to avoid making the encoder more complex without a direct failure case. Here there is a direct failure case, so the extra channel has a clear reason to exist.

This also fits the broader pattern from byte/signal models: the usefulness of a byte representation depends on whether byte-value geometry corresponds to something meaningful in the underlying data. Decoded grayscale, μ-law audio, lidar/radar ranges, UTF-8 text, and compressed bytes should not necessarily behave the same way.

Output head should stay separate from input door

The I/O-symmetric run is useful precisely because it did not simply validate symmetry.

My read would be:

input-side zero K-packing and output-side K-packing are related ideas, but not the same claim.

Input packing gives the model a compressed observation interface. Output packing changes the prediction factorization.

Those can have very different costs.

Component Question
zero input door can fixed geometry replace a learned input embedding/projection?
K-packed input can several bytes share one attention slot efficiently?
K-packed output can several output bytes be bundled without paying too much bpb?
I/O symmetry is structural neatness worth the output prediction cost?

So I would keep the output-head result as its own branch:

structurally interesting, but currently not a win.

That is still useful. It prevents the input-door result from being overextended.

This distinction is also useful relative to prior embedding-free work. Shaham & Levy discuss replacing learned embeddings in byte models, including the first/last layer framing. But HoLo_ZeRo’s strongest current evidence seems input-side. Output-side bundling should probably remain a separate experiment rather than being assumed from the input-side result.

Updated map

My updated map would be:

Area Current state Next useful test
primary input door promising long-schedule learned-embedding head-to-head
channel geometry now inspectable repeat knockout across seeds/checkpoints
modality boundary starting to appear decoded signal-like vs UTF-8 vs compressed/encrypted controls
sensor boundary HSL ≈ learned in small probes, but absolute distance exposes weakness HSL + coarse absolute channel
K boundary K=16 preserves some bpb but softens binding K sweep with binding/hard negatives
grounding zero memory path breaks; learned memory projection recovers directionally path-specific memory-door ablation
output head I/O symmetry costs bits keep separate from input-door claim
boundary channel idle under fixed K adaptive/content-determined slotting

That is a much sharper story than a single “zero door wins/loses” table.

What I would prioritize next

Given this update, I would prioritize three things.

1. Long-schedule learned-embedding head-to-head

This is still the cleanest missing test.

It separates:

Possibility Meaning
learned embedding catches up zero door is mostly a cold-start / sample-efficiency advantage
learned embedding does not catch up fixed HSL geometry remains useful even after enough schedule
learned embedding catches up on bpb but not binding HSL geometry matters more for association than local prediction
learned embedding wins text but loses sensors modality boundary becomes the main story

This curve matters more than one endpoint.

2. Path-specific door ablation

The grounding failure makes this almost as important as the schedule curve.

I would test:

Input door Memory door Expected use
zero zero pure zero-door stress test
zero learned projection current recovery direction
learned zero tests whether memory failure is independent of input door
learned learned projection reference
zero low-rank / gated memory adapter minimal learned memory interface
zero per-channel scale only tests whether memory needs full projection or just rescaling

This would answer:

how much learned interface is needed for retrieved memory?

3. Hard-negative binding

Because the channel/geometry story now looks different for text bpb and binding, binding deserves its own stronger test.

Useful negatives:

Negative Why
same-class wrong instance removes simple class shortcut
same-length caption mismatch removes length cue
entropy-matched caption mismatch removes byte-distribution cue
shifted audio/video window tests temporal leakage
image-only / audio-only ablation shows modality contribution
top-k retrieval easier to interpret than bpb alone

If HSL geometry still helps under hard negatives, that becomes a much stronger result than “zero door gives better text bpb.”

Links / orientation points

I would keep these as orientation points, not as direct baselines:

Link Why it is useful here
Shaham & Levy — Neural Machine Translation without Embeddings closest reference for embedding-free byte input
CANINE tokenization-free character encoder with downsampling
Charformer byte/character sequence shortening via learned GBST
MEGABYTE byte patching and long byte-sequence modeling
Byte Latent Transformer dynamic byte patches and FLOP-controlled byte scaling
RAG useful reminder that retrieved evidence is often a separate interface, not just more input

Bottom line

This update makes the result more interesting because it shows real boundaries.

My updated summary would be:

zero HSL door looks promising for primary byte streams; fixed channel addresses make it inspectable; K is a compression/alignment knob; retrieved memory probably needs a separate learned interface; output packing is a separate question; and absolute sensor tasks may need an explicit magnitude channel.

So the next design rule may not be:

use zero door everywhere.

It may be:

use zero HSL for primary observations, add learned interfaces where the path is semantically different, and add explicit channels only when a measured boundary shows they are needed.

That is a more useful research direction than a universal zero-door claim.

Discussion in the ATmosphere

Loading comments...