External Publication

Visit Post

HoLo-ToLk: tokenizer-free speech (STT + TTS) on the 0-parameter HSL byte substrate

Hugging Face Forums [Unofficial] June 30, 2026

Source

I think that’s probably the safer direction. If I were organizing that route, I’d frame it like this:

Short read

This route becomes easier to evaluate if each branch gets its own narrow label.

For the speech branch:

tokenizer-free speech on the fixed HSL substrate

For the BPE text-generation branch:

embedding-table-free text generation over BPE-segmented HSL streams

That keeps the useful constraint visible without forcing every branch into the same label.

The pattern is becoming:

fixed HSL substrate where possible, explicit path-specific lenses where the path needs extra structure.

For STT, the lens is spectral. For TTS, the next lens is diagnostic: coverage / stop / acoustic clarity / vocoder controls. For text generation, BPE-style segmentation becomes a segmentation lens.

1. The route becomes clearer if BPE is treated as a segmentation lens

The route becomes easier to evaluate if the global constraint is split into path-level claims.

In that map, BPE is not just an implementation detail. It is the text-generation lens.

Path	What changed	Clean framing
HoLo_ZeRo input	remove learned byte embedding door	zero-param observation door
STT	raw HSL alone is weak, spectral lens helps	speech needs a time-frequency lens
TTS	text input can use HSL, free-run still rough	acoustic generation is a separate path
text generation	raw byte HSL gives word salad at small scale	text generation needs segmentation structure
memory / grounding	zero memory path was weak	retrieved facts need a learned/explicit interface

That gives a consistent story:

fixed substrate where possible, explicit lens where necessary.

2. The BPE branch needs its own label

Once BPE is in the loop, the branch has segmentation and a vocabulary. So I would label the branch around the remaining constraint:

no learned embedding table for the segmented units; the chunks are still encoded through the fixed HSL substrate.

A useful naming table:

Possible phrase	Why it works / risk
segmentation lens	fits the path-specific lens story
embedding-table-free BPE-HSL	precise and compact
BPE-segmented HSL stream	descriptive and low-risk
tokenizer-free with BPE	mixes two different claims
semantic chunks	useful intuition, but too strong if literal

A clean version might be:

BPE-HSL: embedding-table-free generation over HSL-coded subword chunks

That separates three things:

Layer	Question	BPE-HSL answer
Segmentation / vocabulary	Are there explicit chunks?	yes, BPE-style chunks
Embedding table	Does each chunk get a learned vector?	no, if each chunk is streamed through HSL
Model body	Does the Transformer learn mixing/generation?	yes

So the interesting text-generation question becomes:

how much segmentation structure is needed before the fixed HSL substrate becomes useful for generation?

Background note on BPE wording (click for more details)

3. The key control is segmentation vs HSL geometry

BPE may help for two separate reasons:

It reduces sequence burden and gives the model larger text units.
The HSL stream over those chunks may still provide useful fixed geometry.

Those are different claims.

A result like:

BPE-HSL improves word salad

would be useful, but it would not yet say whether HSL geometry is the important part. It might simply mean the model needed segmentation.

So I would make the BPE-HSL branch easy to interpret with a small control table.

Condition	What it tests
raw byte HSL	pure byte-substrate baseline
raw byte HSL + longer schedule	whether the issue is only training budget
char-level / byte-level HSL with same model size	whether shorter chunks are needed
BPE segmentation + HSL stream	segmentation lens effect
BPE tokens + learned embedding baseline	what is lost by avoiding embedding tables
BPE segmentation + random/fixed vectors	whether segmentation alone explains the gain
BPE segmentation + permuted HSL mapping	whether HSL geometry matters after segmentation
vocab-size sweep	chunk granularity boundary
rare words / numbers / punctuation bucket	whether BPE helps the actual failure cases

The key comparison is not only:

BPE-HSL vs raw byte HSL

but also:

BPE-HSL vs BPE learned embeddings

because that is the cleanest way to test the “embedding-table-free” part.

Possible interpretation table (click for more details)

4. HoLo-ToLk and BPE-HSL can stay related but distinct

The branches can share the same substrate story without sharing the same public label.

Branch	Claim
HoLo-ToLk STT	HSL substrate needs spectral lens for speech recognition
HoLo-ToLk TTS	HSL text front end is feasible, but free-run acoustic generation needs a failure map
BPE-HSL text generation	text generation may need a segmentation lens, while still avoiding learned embedding tables

That keeps the speech claim and the text-generation claim from interfering with each other.

The phrase “tokenizer-free speech” can still describe the HoLo-ToLk STT/TTS demo if the text path is UTF-8 bytes and the STT path outputs chars/bytes. Once BPE is added for text generation, I would give that branch its own label.

Something like:

HoLo-BPE-HSL: embedding-table-free text generation over HSL-coded subword chunks

A bit long, but very clear.

5. The updated wording can stay narrow and testable

A scoped version of the claims might be:

Claim	Safer status
fixed HSL substrate	yes
no learned embedding table	plausible branch claim
tokenizer-free	only for branches that actually avoid segmentation/vocab
no learned interface anywhere	probably too broad
path-specific lens map	increasingly useful
BPE for text generation	segmentation-lens branch

That makes the route more testable.

Instead of one global claim, each branch gets a measurable boundary:

Branch	Main question
STT	Does spectral lens + HSL beat mel in the same controlled setup?
TTS	Does free-run cover, stop, and preserve phonetic content?
BPE-HSL text	Does segmentation help without needing a learned embedding table?
Memory / grounding	Does a small learned interface recover evidence use?

Bottom line

I would frame the new route like this:

BPE-HSL is the segmentation-lens branch. It has a vocabulary/segmentation step, but it can still test the more interesting constraint: no learned embedding table for the vocabulary items.

That fits the larger pattern: