External Publication

HoLo-ToLk: tokenizer-free speech (STT + TTS) on the 0-parameter HSL byte substrate

Hugging Face Forums [Unofficial] June 29, 2026

Hmm… I looked into it a bit on Colab:

Short read

My read is that the STT side is already a fairly clean path-specific lens result, while the TTS side should probably be treated as a free-run acoustic-path debugging problem , not first as an HSL text-input failure.

For STT, the important result is not “HSL alone solves speech.” It is more specific:

fixed HSL substrate + a speech-specific spectral lens + gated fusion can beat the mel baseline in the same small setup.

That is useful because the weak hsl / stronger hslspec pattern says where the missing structure is: speech recognition needs time-frequency structure, not only raw byte/signal locality. Your own framing above is careful here: 8 kHz, char-CTC, no LM, not SOTA, controlled comparison, with CER 0.194 for substrate+lens vs 0.213 for mel in that setup. The STT repo is the better external anchor for that part.

For TTS, I would separate three questions:

Can the HSL text front end condition the decoder at all?
Can the autoregressive mel decoder cover the text and stop at the right time?
After it covers/stops, is the generated mel/audio phonetically clear enough?

Those are different failure surfaces. A rough waveform by itself does not identify which one failed.

So the next clean artifact I would want is not another single teacher-forced mel-L1 number. It is a small free-run instability map : coverage, stop/duration, acoustic clarity, and vocoder/mel controls.

1. The useful framing: fixed substrate + path-specific lenses

The line seems to be moving from:

“Can the learned input door be removed?”

to something more practical:

“Which paths can use a fixed HSL substrate directly, and which paths need a small lens/interface for missing structure?”

A compact map might be:

Path	What HSL gives	Missing structure / risk	Minimal lens or interface
Primary decoded byte stream	fixed byte/signal geometry	maybe none at the door	zero HSL door
STT audio	byte/signal substrate	time-frequency structure	spectral lens + gated fusion
TTS text input	tokenizer-free UTF-8 byte conditioning	probably not the first suspect	HSL text-byte front end
TTS acoustic output	generated mel/audio path	duration, stop, alignment, phonetic clarity	AR mel decoder + diagnostics
Retrieved memory / facts	bytes, but semantically different path	evidence-reading interface	learned projection / adapter
Output head	prediction factorization	not symmetric with input	separate output design

That makes the STT result cleaner, not weaker. It says the fixed substrate is useful, but the path still needs the right measurement lens. Speech recognition needs a spectral lens. TTS text input may not.

More detail on the STT/TTS split (click for more details)

2. TTS: I would make free-run failures observable first

The next useful TTS artifact might be very small and very diagnostic.

Not:

“Here is one more aggregate mel-L1.”

Instead:

“Here is where free-run breaks: coverage, stop, duration, phonetic clarity, or vocoder/mel path.”

A minimal table could be:

Bucket	N	Why it matters
short plain sentence	5	sanity check
medium LJSpeech-like sentence	5	normal operating region
long sentence	5	duration / coverage / cap stress
punctuation-heavy	5	pauses, punctuation, byte patterns
numbers / abbreviations	5	text normalization and OOD phonetic stress
technical terms	5	phonetic clarity under unusual words

For each output, I would log:

Field	Why
input chars / bytes	input length
generated frames	duration behavior
hit `max_frames`	stop failure or cap too low
max stop probability	whether stop head ever becomes confident
stop frame	early/late termination
final attended text position	coverage
attention coverage ratio	skipped/unfinished text
rough manual label or ASR proxy	content preservation
mel image path	acoustic inspection
wav path	listening / external checks

Even 25–30 prompts can tell a lot. It is not a benchmark. It is a failure map.

How I would interpret that free-run map (click for more details)

3. ASR round-trip is useful, but only as a limited proxy

Since this project already has an STT half, a round-trip check is tempting:

text → TTS → audio → ASR → CER/WER

I think that is useful, but only with a narrow interpretation.

It can help detect:

ASR round-trip signal	Possible meaning
many deletions	coverage failure / skipped text
repeated words	repetition loop
near-word substitutions	weak phonetic clarity
numbers/technical terms fail	OOD text or normalization stress
all ASR models disagree wildly	ASR proxy may be unreliable
stronger ASR still makes similar substitutions	generated audio may be acoustically ambiguous

But I would not present ASR CER/WER as TTS quality. It is not MOS. It is not naturalness. It mixes TTS errors with ASR errors.

A safer wording is:

ASR round-trip is a content-preservation proxy, mainly useful for deletion/repetition/near-word substitution diagnostics.

For background, ESPnet-TTS uses ASR-based objective evaluation in an ASR/TTS framework, and tools like JiWER make CER/WER easy to compute. If using Whisper, I would also keep the Whisper docs in mind: it is an ASR model, not a TTS evaluator.

4. I would add one vocoder/mel control before blaming the front end

A rough waveform does not locate the failing component. In a Tacotron-like cascade, the badness can come from the acoustic model or the vocoder path.

The classic split is:

text → mel → waveform

Tacotron 2 is the standard reference point for this decomposition: text is mapped to mel spectrograms, then a vocoder produces waveform audio.

For this project, since the TTS side uses a HiFi-GAN vocoder, I would add a small oracle-vocoder check:

Input to vocoder	What it tells you
ground-truth/reference mel → vocoder	whether the vocoder/config can reconstruct clean speech
teacher-forced predicted mel → vocoder	whether the acoustic model works when given the correct history
free-run generated mel → vocoder	whether self-fed inference drifts

The SpeechBrain HiFi-GAN LJSpeech model card is relevant here because it describes a vocoder that takes a spectrogram and produces waveform audio. Its notes also make the same practical point: vocoder use depends on compatible spectrogram settings, such as hop length and mel layout. So I would avoid saying “the vocoder is bad” or “the HSL front end is bad” until this control is done.

A compact diagnostic card (click for more details)

5. Delay the big redesign until the failure map says which branch is needed

There are several possible next architectures, but I would not jump to them before the small map above.

If the failure map says alignment/duration is dominant, then duration or monotonic-alignment ideas become natural. FastSpeech is the obvious reference for a duration/length-regulator branch. Glow-TTS is another useful reference for monotonic alignment / non-AR TTS.

If the map says teacher-forced is fine but free-run drifts , then the issue looks more like exposure bias / self-fed acoustic drift. A relevant reference point is Teacher-Student Training for Robust Tacotron-based TTS, which discusses training/inference mismatch in Tacotron-style systems.

If the map says coverage and stop are fine, but words are phonetically unclear , then I would inspect mel targets, postnet behavior, vocoder compatibility, and maybe whether mel regression is too blurry for the small model.

So the branch order I would use is:

Make free-run observable.
Split coverage vs stop vs acoustic clarity.
Only then choose duration/monotonic/vocoder/acoustic-model changes.

Reference anchors (click for more details)

Bottom line

The next clean framing might be:

STT: fixed substrate + spectral lens is the useful result. TTS: do not collapse free-run roughness into “HSL input failed.” First split coverage, stop/duration, acoustic clarity, and vocoder/mel path.

If those diagnostics show that text coverage is easy but phonetic clarity remains poor, then the next problem is probably not the zero/HSL text door. It is the acoustic generation path.

That would actually make the story stronger: the project would no longer be trying to say “zero learned interface everywhere.” It would be showing where a fixed substrate is enough, where a path-specific lens is needed, and where generation-side interfaces still need work.