External Publication
Visit Post

HoLo-ToLk: tokenizer-free speech (STT + TTS) on the 0-parameter HSL byte substrate

Hugging Face Forums [Unofficial] June 30, 2026
Source

I think that’s probably the safer direction. If I were organizing that route, I’d frame it like this:


Short read

This route becomes easier to evaluate if each branch gets its own narrow label.

For the speech branch:

tokenizer-free speech on the fixed HSL substrate

For the BPE text-generation branch:

embedding-table-free text generation over BPE-segmented HSL streams

That keeps the useful constraint visible without forcing every branch into the same label.

The pattern is becoming:

fixed HSL substrate where possible, explicit path-specific lenses where the path needs extra structure.

For STT, the lens is spectral. For TTS, the next lens is diagnostic: coverage / stop / acoustic clarity / vocoder controls. For text generation, BPE-style segmentation becomes a segmentation lens.


1. The route becomes clearer if BPE is treated as a segmentation lens

The route becomes easier to evaluate if the global constraint is split into path-level claims.

In that map, BPE is not just an implementation detail. It is the text-generation lens.

Path What changed Clean framing
HoLo_ZeRo input remove learned byte embedding door zero-param observation door
STT raw HSL alone is weak, spectral lens helps speech needs a time-frequency lens
TTS text input can use HSL, free-run still rough acoustic generation is a separate path
text generation raw byte HSL gives word salad at small scale text generation needs segmentation structure
memory / grounding zero memory path was weak retrieved facts need a learned/explicit interface

That gives a consistent story:

fixed substrate where possible, explicit lens where necessary.


2. The BPE branch needs its own label

Once BPE is in the loop, the branch has segmentation and a vocabulary. So I would label the branch around the remaining constraint:

no learned embedding table for the segmented units; the chunks are still encoded through the fixed HSL substrate.

A useful naming table:

Possible phrase Why it works / risk
segmentation lens fits the path-specific lens story
embedding-table-free BPE-HSL precise and compact
BPE-segmented HSL stream descriptive and low-risk
tokenizer-free with BPE mixes two different claims
semantic chunks useful intuition, but too strong if literal

A clean version might be:

BPE-HSL: embedding-table-free generation over HSL-coded subword chunks

That separates three things:

Layer Question BPE-HSL answer
Segmentation / vocabulary Are there explicit chunks? yes, BPE-style chunks
Embedding table Does each chunk get a learned vector? no, if each chunk is streamed through HSL
Model body Does the Transformer learn mixing/generation? yes

So the interesting text-generation question becomes:

how much segmentation structure is needed before the fixed HSL substrate becomes useful for generation?


Background note on BPE wording (click for more details)


3. The key control is segmentation vs HSL geometry

BPE may help for two separate reasons:

  1. It reduces sequence burden and gives the model larger text units.
  2. The HSL stream over those chunks may still provide useful fixed geometry.

Those are different claims.

A result like:

BPE-HSL improves word salad

would be useful, but it would not yet say whether HSL geometry is the important part. It might simply mean the model needed segmentation.

So I would make the BPE-HSL branch easy to interpret with a small control table.

Condition What it tests
raw byte HSL pure byte-substrate baseline
raw byte HSL + longer schedule whether the issue is only training budget
char-level / byte-level HSL with same model size whether shorter chunks are needed
BPE segmentation + HSL stream segmentation lens effect
BPE tokens + learned embedding baseline what is lost by avoiding embedding tables
BPE segmentation + random/fixed vectors whether segmentation alone explains the gain
BPE segmentation + permuted HSL mapping whether HSL geometry matters after segmentation
vocab-size sweep chunk granularity boundary
rare words / numbers / punctuation bucket whether BPE helps the actual failure cases

The key comparison is not only:

BPE-HSL vs raw byte HSL

but also:

BPE-HSL vs BPE learned embeddings

because that is the cleanest way to test the “embedding-table-free” part.

Possible interpretation table (click for more details)


4. HoLo-ToLk and BPE-HSL can stay related but distinct

The branches can share the same substrate story without sharing the same public label.

Branch Claim
HoLo-ToLk STT HSL substrate needs spectral lens for speech recognition
HoLo-ToLk TTS HSL text front end is feasible, but free-run acoustic generation needs a failure map
BPE-HSL text generation text generation may need a segmentation lens, while still avoiding learned embedding tables

That keeps the speech claim and the text-generation claim from interfering with each other.

The phrase “tokenizer-free speech” can still describe the HoLo-ToLk STT/TTS demo if the text path is UTF-8 bytes and the STT path outputs chars/bytes. Once BPE is added for text generation, I would give that branch its own label.

Something like:

HoLo-BPE-HSL: embedding-table-free text generation over HSL-coded subword chunks

A bit long, but very clear.


5. The updated wording can stay narrow and testable

A scoped version of the claims might be:

Claim Safer status
fixed HSL substrate yes
no learned embedding table plausible branch claim
tokenizer-free only for branches that actually avoid segmentation/vocab
no learned interface anywhere probably too broad
path-specific lens map increasingly useful
BPE for text generation segmentation-lens branch

That makes the route more testable.

Instead of one global claim, each branch gets a measurable boundary:

Branch Main question
STT Does spectral lens + HSL beat mel in the same controlled setup?
TTS Does free-run cover, stop, and preserve phonetic content?
BPE-HSL text Does segmentation help without needing a learned embedding table?
Memory / grounding Does a small learned interface recover evidence use?

Bottom line

I would frame the new route like this:

BPE-HSL is the segmentation-lens branch. It has a vocabulary/segmentation step, but it can still test the more interesting constraint: no learned embedding table for the vocabulary items.

That fits the larger pattern:

fixed substrate where possible, explicit lens where necessary.

If BPE-HSL reduces word salad while staying reasonably close to a learned-embedding BPE baseline, that would be a strong result. If it only matches random fixed vectors, then segmentation is doing most of the work. Either way, the experiment becomes interpretable.

Discussion in the ATmosphere

Loading comments...