{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreif6us55lsxu37jo5ztyforlnx4ozlhtehx5qqvz4z2upxj6gt4s5e",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mpg5a6fgvmh2"
  },
  "path": "/t/holo-tolk-tokenizer-free-speech-stt-tts-on-the-0-parameter-hsl-byte-substrate/177216#post_2",
  "publishedAt": "2026-06-29T08:36:08.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "STT repo",
    "(click for more details)",
    "ESPnet-TTS",
    "JiWER",
    "Whisper docs",
    "Tacotron 2",
    "SpeechBrain HiFi-GAN LJSpeech model card",
    "FastSpeech",
    "Glow-TTS",
    "Teacher-Student Training for Robust Tacotron-based TTS"
  ],
  "textContent": "Hmm… I looked into it a bit on Colab:\n\n* * *\n\n## Short read\n\nMy read is that the STT side is already a fairly clean **path-specific lens** result, while the TTS side should probably be treated as a **free-run acoustic-path debugging problem** , not first as an HSL text-input failure.\n\nFor STT, the important result is not “HSL alone solves speech.” It is more specific:\n\n> fixed HSL substrate + a speech-specific spectral lens + gated fusion can beat the mel baseline in the same small setup.\n\nThat is useful because the weak `hsl` / stronger `hslspec` pattern says where the missing structure is: speech recognition needs time-frequency structure, not only raw byte/signal locality. Your own framing above is careful here: 8 kHz, char-CTC, no LM, not SOTA, controlled comparison, with `CER 0.194` for substrate+lens vs `0.213` for mel in that setup. The STT repo is the better external anchor for that part.\n\nFor TTS, I would separate three questions:\n\n  1. **Can the HSL text front end condition the decoder at all?**\n  2. **Can the autoregressive mel decoder cover the text and stop at the right time?**\n  3. **After it covers/stops, is the generated mel/audio phonetically clear enough?**\n\n\n\nThose are different failure surfaces. A rough waveform by itself does not identify which one failed.\n\nSo the next clean artifact I would want is not another single teacher-forced mel-L1 number. It is a small **free-run instability map** : coverage, stop/duration, acoustic clarity, and vocoder/mel controls.\n\n* * *\n\n## 1. The useful framing: fixed substrate + path-specific lenses\n\nThe line seems to be moving from:\n\n> “Can the learned input door be removed?”\n\nto something more practical:\n\n> “Which paths can use a fixed HSL substrate directly, and which paths need a small lens/interface for missing structure?”\n\nA compact map might be:\n\nPath | What HSL gives | Missing structure / risk | Minimal lens or interface\n---|---|---|---\nPrimary decoded byte stream | fixed byte/signal geometry | maybe none at the door | zero HSL door\nSTT audio | byte/signal substrate | time-frequency structure | spectral lens + gated fusion\nTTS text input | tokenizer-free UTF-8 byte conditioning | probably not the first suspect | HSL text-byte front end\nTTS acoustic output | generated mel/audio path | duration, stop, alignment, phonetic clarity | AR mel decoder + diagnostics\nRetrieved memory / facts | bytes, but semantically different path | evidence-reading interface | learned projection / adapter\nOutput head | prediction factorization | not symmetric with input | separate output design\n\nThat makes the STT result cleaner, not weaker. It says the fixed substrate is useful, but the path still needs the right measurement lens. Speech recognition needs a spectral lens. TTS text input may not.\n\nMore detail on the STT/TTS split (click for more details)\n\n* * *\n\n## 2. TTS: I would make free-run failures observable first\n\nThe next useful TTS artifact might be very small and very diagnostic.\n\nNot:\n\n> “Here is one more aggregate mel-L1.”\n\nInstead:\n\n> “Here is where free-run breaks: coverage, stop, duration, phonetic clarity, or vocoder/mel path.”\n\nA minimal table could be:\n\nBucket | N | Why it matters\n---|---|---\nshort plain sentence | 5 | sanity check\nmedium LJSpeech-like sentence | 5 | normal operating region\nlong sentence | 5 | duration / coverage / cap stress\npunctuation-heavy | 5 | pauses, punctuation, byte patterns\nnumbers / abbreviations | 5 | text normalization and OOD phonetic stress\ntechnical terms | 5 | phonetic clarity under unusual words\n\nFor each output, I would log:\n\nField | Why\n---|---\ninput chars / bytes | input length\ngenerated frames | duration behavior\nhit `max_frames` | stop failure or cap too low\nmax stop probability | whether stop head ever becomes confident\nstop frame | early/late termination\nfinal attended text position | coverage\nattention coverage ratio | skipped/unfinished text\nrough manual label or ASR proxy | content preservation\nmel image path | acoustic inspection\nwav path | listening / external checks\n\nEven 25–30 prompts can tell a lot. It is not a benchmark. It is a failure map.\n\nHow I would interpret that free-run map (click for more details)\n\n* * *\n\n## 3. ASR round-trip is useful, but only as a limited proxy\n\nSince this project already has an STT half, a round-trip check is tempting:\n\n> text → TTS → audio → ASR → CER/WER\n\nI think that is useful, but only with a narrow interpretation.\n\nIt can help detect:\n\nASR round-trip signal | Possible meaning\n---|---\nmany deletions | coverage failure / skipped text\nrepeated words | repetition loop\nnear-word substitutions | weak phonetic clarity\nnumbers/technical terms fail | OOD text or normalization stress\nall ASR models disagree wildly | ASR proxy may be unreliable\nstronger ASR still makes similar substitutions | generated audio may be acoustically ambiguous\n\nBut I would not present ASR CER/WER as TTS quality. It is not MOS. It is not naturalness. It mixes TTS errors with ASR errors.\n\nA safer wording is:\n\n> ASR round-trip is a content-preservation proxy, mainly useful for deletion/repetition/near-word substitution diagnostics.\n\nFor background, ESPnet-TTS uses ASR-based objective evaluation in an ASR/TTS framework, and tools like JiWER make CER/WER easy to compute. If using Whisper, I would also keep the Whisper docs in mind: it is an ASR model, not a TTS evaluator.\n\n* * *\n\n## 4. I would add one vocoder/mel control before blaming the front end\n\nA rough waveform does not locate the failing component. In a Tacotron-like cascade, the badness can come from the acoustic model or the vocoder path.\n\nThe classic split is:\n\n> text → mel → waveform\n\nTacotron 2 is the standard reference point for this decomposition: text is mapped to mel spectrograms, then a vocoder produces waveform audio.\n\nFor this project, since the TTS side uses a HiFi-GAN vocoder, I would add a small oracle-vocoder check:\n\nInput to vocoder | What it tells you\n---|---\nground-truth/reference mel → vocoder | whether the vocoder/config can reconstruct clean speech\nteacher-forced predicted mel → vocoder | whether the acoustic model works when given the correct history\nfree-run generated mel → vocoder | whether self-fed inference drifts\n\nThe SpeechBrain HiFi-GAN LJSpeech model card is relevant here because it describes a vocoder that takes a spectrogram and produces waveform audio. Its notes also make the same practical point: vocoder use depends on compatible spectrogram settings, such as hop length and mel layout. So I would avoid saying “the vocoder is bad” or “the HSL front end is bad” until this control is done.\n\nA compact diagnostic card (click for more details)\n\n* * *\n\n## 5. Delay the big redesign until the failure map says which branch is needed\n\nThere are several possible next architectures, but I would not jump to them before the small map above.\n\nIf the failure map says **alignment/duration** is dominant, then duration or monotonic-alignment ideas become natural. FastSpeech is the obvious reference for a duration/length-regulator branch. Glow-TTS is another useful reference for monotonic alignment / non-AR TTS.\n\nIf the map says **teacher-forced is fine but free-run drifts** , then the issue looks more like exposure bias / self-fed acoustic drift. A relevant reference point is Teacher-Student Training for Robust Tacotron-based TTS, which discusses training/inference mismatch in Tacotron-style systems.\n\nIf the map says **coverage and stop are fine, but words are phonetically unclear** , then I would inspect mel targets, postnet behavior, vocoder compatibility, and maybe whether mel regression is too blurry for the small model.\n\nSo the branch order I would use is:\n\n  1. Make free-run observable.\n  2. Split coverage vs stop vs acoustic clarity.\n  3. Only then choose duration/monotonic/vocoder/acoustic-model changes.\n\nReference anchors (click for more details)\n\n* * *\n\n## Bottom line\n\nThe next clean framing might be:\n\n> **STT:** fixed substrate + spectral lens is the useful result.\n>  **TTS:** do not collapse free-run roughness into “HSL input failed.” First split coverage, stop/duration, acoustic clarity, and vocoder/mel path.\n\nIf those diagnostics show that text coverage is easy but phonetic clarity remains poor, then the next problem is probably not the zero/HSL text door. It is the acoustic generation path.\n\nThat would actually make the story stronger: the project would no longer be trying to say “zero learned interface everywhere.” It would be showing where a fixed substrate is enough, where a path-specific lens is needed, and where generation-side interfaces still need work.",
  "title": "HoLo-ToLk: tokenizer-free speech (STT + TTS) on the 0-parameter HSL byte substrate"
}