External Publication
Visit Post

HoLo-ToLk: tokenizer-free speech (STT + TTS) on the 0-parameter HSL byte substrate

Hugging Face Forums [Unofficial] June 29, 2026
Source

Hmm… I looked into it a bit on Colab:


Short read

My read is that the STT side is already a fairly clean path-specific lens result, while the TTS side should probably be treated as a free-run acoustic-path debugging problem , not first as an HSL text-input failure.

For STT, the important result is not “HSL alone solves speech.” It is more specific:

fixed HSL substrate + a speech-specific spectral lens + gated fusion can beat the mel baseline in the same small setup.

That is useful because the weak hsl / stronger hslspec pattern says where the missing structure is: speech recognition needs time-frequency structure, not only raw byte/signal locality. Your own framing above is careful here: 8 kHz, char-CTC, no LM, not SOTA, controlled comparison, with CER 0.194 for substrate+lens vs 0.213 for mel in that setup. The STT repo is the better external anchor for that part.

For TTS, I would separate three questions:

  1. Can the HSL text front end condition the decoder at all?
  2. Can the autoregressive mel decoder cover the text and stop at the right time?
  3. After it covers/stops, is the generated mel/audio phonetically clear enough?

Those are different failure surfaces. A rough waveform by itself does not identify which one failed.

So the next clean artifact I would want is not another single teacher-forced mel-L1 number. It is a small free-run instability map : coverage, stop/duration, acoustic clarity, and vocoder/mel controls.


1. The useful framing: fixed substrate + path-specific lenses

The line seems to be moving from:

“Can the learned input door be removed?”

to something more practical:

“Which paths can use a fixed HSL substrate directly, and which paths need a small lens/interface for missing structure?”

A compact map might be:

Path What HSL gives Missing structure / risk Minimal lens or interface
Primary decoded byte stream fixed byte/signal geometry maybe none at the door zero HSL door
STT audio byte/signal substrate time-frequency structure spectral lens + gated fusion
TTS text input tokenizer-free UTF-8 byte conditioning probably not the first suspect HSL text-byte front end
TTS acoustic output generated mel/audio path duration, stop, alignment, phonetic clarity AR mel decoder + diagnostics
Retrieved memory / facts bytes, but semantically different path evidence-reading interface learned projection / adapter
Output head prediction factorization not symmetric with input separate output design

That makes the STT result cleaner, not weaker. It says the fixed substrate is useful, but the path still needs the right measurement lens. Speech recognition needs a spectral lens. TTS text input may not.

More detail on the STT/TTS split (click for more details)


2. TTS: I would make free-run failures observable first

The next useful TTS artifact might be very small and very diagnostic.

Not:

“Here is one more aggregate mel-L1.”

Instead:

“Here is where free-run breaks: coverage, stop, duration, phonetic clarity, or vocoder/mel path.”

A minimal table could be:

Bucket N Why it matters
short plain sentence 5 sanity check
medium LJSpeech-like sentence 5 normal operating region
long sentence 5 duration / coverage / cap stress
punctuation-heavy 5 pauses, punctuation, byte patterns
numbers / abbreviations 5 text normalization and OOD phonetic stress
technical terms 5 phonetic clarity under unusual words

For each output, I would log:

Field Why
input chars / bytes input length
generated frames duration behavior
hit max_frames stop failure or cap too low
max stop probability whether stop head ever becomes confident
stop frame early/late termination
final attended text position coverage
attention coverage ratio skipped/unfinished text
rough manual label or ASR proxy content preservation
mel image path acoustic inspection
wav path listening / external checks

Even 25–30 prompts can tell a lot. It is not a benchmark. It is a failure map.

How I would interpret that free-run map (click for more details)


3. ASR round-trip is useful, but only as a limited proxy

Since this project already has an STT half, a round-trip check is tempting:

text → TTS → audio → ASR → CER/WER

I think that is useful, but only with a narrow interpretation.

It can help detect:

ASR round-trip signal Possible meaning
many deletions coverage failure / skipped text
repeated words repetition loop
near-word substitutions weak phonetic clarity
numbers/technical terms fail OOD text or normalization stress
all ASR models disagree wildly ASR proxy may be unreliable
stronger ASR still makes similar substitutions generated audio may be acoustically ambiguous

But I would not present ASR CER/WER as TTS quality. It is not MOS. It is not naturalness. It mixes TTS errors with ASR errors.

A safer wording is:

ASR round-trip is a content-preservation proxy, mainly useful for deletion/repetition/near-word substitution diagnostics.

For background, ESPnet-TTS uses ASR-based objective evaluation in an ASR/TTS framework, and tools like JiWER make CER/WER easy to compute. If using Whisper, I would also keep the Whisper docs in mind: it is an ASR model, not a TTS evaluator.


4. I would add one vocoder/mel control before blaming the front end

A rough waveform does not locate the failing component. In a Tacotron-like cascade, the badness can come from the acoustic model or the vocoder path.

The classic split is:

text → mel → waveform

Tacotron 2 is the standard reference point for this decomposition: text is mapped to mel spectrograms, then a vocoder produces waveform audio.

For this project, since the TTS side uses a HiFi-GAN vocoder, I would add a small oracle-vocoder check:

Input to vocoder What it tells you
ground-truth/reference mel → vocoder whether the vocoder/config can reconstruct clean speech
teacher-forced predicted mel → vocoder whether the acoustic model works when given the correct history
free-run generated mel → vocoder whether self-fed inference drifts

The SpeechBrain HiFi-GAN LJSpeech model card is relevant here because it describes a vocoder that takes a spectrogram and produces waveform audio. Its notes also make the same practical point: vocoder use depends on compatible spectrogram settings, such as hop length and mel layout. So I would avoid saying “the vocoder is bad” or “the HSL front end is bad” until this control is done.

A compact diagnostic card (click for more details)


5. Delay the big redesign until the failure map says which branch is needed

There are several possible next architectures, but I would not jump to them before the small map above.

If the failure map says alignment/duration is dominant, then duration or monotonic-alignment ideas become natural. FastSpeech is the obvious reference for a duration/length-regulator branch. Glow-TTS is another useful reference for monotonic alignment / non-AR TTS.

If the map says teacher-forced is fine but free-run drifts , then the issue looks more like exposure bias / self-fed acoustic drift. A relevant reference point is Teacher-Student Training for Robust Tacotron-based TTS, which discusses training/inference mismatch in Tacotron-style systems.

If the map says coverage and stop are fine, but words are phonetically unclear , then I would inspect mel targets, postnet behavior, vocoder compatibility, and maybe whether mel regression is too blurry for the small model.

So the branch order I would use is:

  1. Make free-run observable.
  2. Split coverage vs stop vs acoustic clarity.
  3. Only then choose duration/monotonic/vocoder/acoustic-model changes.

Reference anchors (click for more details)


Bottom line

The next clean framing might be:

STT: fixed substrate + spectral lens is the useful result. TTS: do not collapse free-run roughness into “HSL input failed.” First split coverage, stop/duration, acoustic clarity, and vocoder/mel path.

If those diagnostics show that text coverage is easy but phonetic clarity remains poor, then the next problem is probably not the zero/HSL text door. It is the acoustic generation path.

That would actually make the story stronger: the project would no longer be trying to say “zero learned interface everywhere.” It would be showing where a fixed substrate is enough, where a path-specific lens is needed, and where generation-side interfaces still need work.

Discussion in the ATmosphere

Loading comments...