HoLo-ToLk: tokenizer-free speech (STT + TTS) on the 0-parameter HSL byte substrate
Hmm… I looked into it a bit on Colab:
Short read
My read is that the STT side is already a fairly clean path-specific lens result, while the TTS side should probably be treated as a free-run acoustic-path debugging problem , not first as an HSL text-input failure.
For STT, the important result is not “HSL alone solves speech.” It is more specific:
fixed HSL substrate + a speech-specific spectral lens + gated fusion can beat the mel baseline in the same small setup.
That is useful because the weak hsl / stronger hslspec pattern says where the missing structure is: speech recognition needs time-frequency structure, not only raw byte/signal locality. Your own framing above is careful here: 8 kHz, char-CTC, no LM, not SOTA, controlled comparison, with CER 0.194 for substrate+lens vs 0.213 for mel in that setup. The STT repo is the better external anchor for that part.
For TTS, I would separate three questions:
- Can the HSL text front end condition the decoder at all?
- Can the autoregressive mel decoder cover the text and stop at the right time?
- After it covers/stops, is the generated mel/audio phonetically clear enough?
Those are different failure surfaces. A rough waveform by itself does not identify which one failed.
So the next clean artifact I would want is not another single teacher-forced mel-L1 number. It is a small free-run instability map : coverage, stop/duration, acoustic clarity, and vocoder/mel controls.
1. The useful framing: fixed substrate + path-specific lenses
The line seems to be moving from:
“Can the learned input door be removed?”
to something more practical:
“Which paths can use a fixed HSL substrate directly, and which paths need a small lens/interface for missing structure?”
A compact map might be:
| Path | What HSL gives | Missing structure / risk | Minimal lens or interface |
|---|---|---|---|
| Primary decoded byte stream | fixed byte/signal geometry | maybe none at the door | zero HSL door |
| STT audio | byte/signal substrate | time-frequency structure | spectral lens + gated fusion |
| TTS text input | tokenizer-free UTF-8 byte conditioning | probably not the first suspect | HSL text-byte front end |
| TTS acoustic output | generated mel/audio path | duration, stop, alignment, phonetic clarity | AR mel decoder + diagnostics |
| Retrieved memory / facts | bytes, but semantically different path | evidence-reading interface | learned projection / adapter |
| Output head | prediction factorization | not symmetric with input | separate output design |
That makes the STT result cleaner, not weaker. It says the fixed substrate is useful, but the path still needs the right measurement lens. Speech recognition needs a spectral lens. TTS text input may not.
More detail on the STT/TTS split (click for more details)
2. TTS: I would make free-run failures observable first
The next useful TTS artifact might be very small and very diagnostic.
Not:
“Here is one more aggregate mel-L1.”
Instead:
“Here is where free-run breaks: coverage, stop, duration, phonetic clarity, or vocoder/mel path.”
A minimal table could be:
| Bucket | N | Why it matters |
|---|---|---|
| short plain sentence | 5 | sanity check |
| medium LJSpeech-like sentence | 5 | normal operating region |
| long sentence | 5 | duration / coverage / cap stress |
| punctuation-heavy | 5 | pauses, punctuation, byte patterns |
| numbers / abbreviations | 5 | text normalization and OOD phonetic stress |
| technical terms | 5 | phonetic clarity under unusual words |
For each output, I would log:
| Field | Why |
|---|---|
| input chars / bytes | input length |
| generated frames | duration behavior |
hit max_frames |
stop failure or cap too low |
| max stop probability | whether stop head ever becomes confident |
| stop frame | early/late termination |
| final attended text position | coverage |
| attention coverage ratio | skipped/unfinished text |
| rough manual label or ASR proxy | content preservation |
| mel image path | acoustic inspection |
| wav path | listening / external checks |
Even 25–30 prompts can tell a lot. It is not a benchmark. It is a failure map.
How I would interpret that free-run map (click for more details)
3. ASR round-trip is useful, but only as a limited proxy
Since this project already has an STT half, a round-trip check is tempting:
text → TTS → audio → ASR → CER/WER
I think that is useful, but only with a narrow interpretation.
It can help detect:
| ASR round-trip signal | Possible meaning |
|---|---|
| many deletions | coverage failure / skipped text |
| repeated words | repetition loop |
| near-word substitutions | weak phonetic clarity |
| numbers/technical terms fail | OOD text or normalization stress |
| all ASR models disagree wildly | ASR proxy may be unreliable |
| stronger ASR still makes similar substitutions | generated audio may be acoustically ambiguous |
But I would not present ASR CER/WER as TTS quality. It is not MOS. It is not naturalness. It mixes TTS errors with ASR errors.
A safer wording is:
ASR round-trip is a content-preservation proxy, mainly useful for deletion/repetition/near-word substitution diagnostics.
For background, ESPnet-TTS uses ASR-based objective evaluation in an ASR/TTS framework, and tools like JiWER make CER/WER easy to compute. If using Whisper, I would also keep the Whisper docs in mind: it is an ASR model, not a TTS evaluator.
4. I would add one vocoder/mel control before blaming the front end
A rough waveform does not locate the failing component. In a Tacotron-like cascade, the badness can come from the acoustic model or the vocoder path.
The classic split is:
text → mel → waveform
Tacotron 2 is the standard reference point for this decomposition: text is mapped to mel spectrograms, then a vocoder produces waveform audio.
For this project, since the TTS side uses a HiFi-GAN vocoder, I would add a small oracle-vocoder check:
| Input to vocoder | What it tells you |
|---|---|
| ground-truth/reference mel → vocoder | whether the vocoder/config can reconstruct clean speech |
| teacher-forced predicted mel → vocoder | whether the acoustic model works when given the correct history |
| free-run generated mel → vocoder | whether self-fed inference drifts |
The SpeechBrain HiFi-GAN LJSpeech model card is relevant here because it describes a vocoder that takes a spectrogram and produces waveform audio. Its notes also make the same practical point: vocoder use depends on compatible spectrogram settings, such as hop length and mel layout. So I would avoid saying “the vocoder is bad” or “the HSL front end is bad” until this control is done.
A compact diagnostic card (click for more details)
5. Delay the big redesign until the failure map says which branch is needed
There are several possible next architectures, but I would not jump to them before the small map above.
If the failure map says alignment/duration is dominant, then duration or monotonic-alignment ideas become natural. FastSpeech is the obvious reference for a duration/length-regulator branch. Glow-TTS is another useful reference for monotonic alignment / non-AR TTS.
If the map says teacher-forced is fine but free-run drifts , then the issue looks more like exposure bias / self-fed acoustic drift. A relevant reference point is Teacher-Student Training for Robust Tacotron-based TTS, which discusses training/inference mismatch in Tacotron-style systems.
If the map says coverage and stop are fine, but words are phonetically unclear , then I would inspect mel targets, postnet behavior, vocoder compatibility, and maybe whether mel regression is too blurry for the small model.
So the branch order I would use is:
- Make free-run observable.
- Split coverage vs stop vs acoustic clarity.
- Only then choose duration/monotonic/vocoder/acoustic-model changes.
Reference anchors (click for more details)
Bottom line
The next clean framing might be:
STT: fixed substrate + spectral lens is the useful result. TTS: do not collapse free-run roughness into “HSL input failed.” First split coverage, stop/duration, acoustic clarity, and vocoder/mel path.
If those diagnostics show that text coverage is easy but phonetic clarity remains poor, then the next problem is probably not the zero/HSL text door. It is the acoustic generation path.
That would actually make the story stronger: the project would no longer be trying to say “zero learned interface everywhere.” It would be showing where a fixed substrate is enough, where a path-specific lens is needed, and where generation-side interfaces still need work.
Discussion in the ATmosphere