Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreie6ju4r3zxnmh24zqkm4zq4owa35j56eko6niqfqsic5fvhi5udpq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mpiadpgb6wq2"
  },
  "path": "/t/holo-tolk-tokenizer-free-speech-stt-tts-on-the-0-parameter-hsl-byte-substrate/177216#post_4",
  "publishedAt": "2026-06-30T03:38:07.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "(click for more details)"
  ],
  "textContent": "I think that’s probably the safer direction. If I were organizing that route, I’d frame it like this:\n\n* * *\n\n## Short read\n\nThis route becomes easier to evaluate if each branch gets its own narrow label.\n\nFor the speech branch:\n\n> **tokenizer-free speech on the fixed HSL substrate**\n\nFor the BPE text-generation branch:\n\n> **embedding-table-free text generation over BPE-segmented HSL streams**\n\nThat keeps the useful constraint visible without forcing every branch into the same label.\n\nThe pattern is becoming:\n\n> fixed HSL substrate where possible, explicit path-specific lenses where the path needs extra structure.\n\nFor STT, the lens is spectral.\nFor TTS, the next lens is diagnostic: coverage / stop / acoustic clarity / vocoder controls.\nFor text generation, BPE-style segmentation becomes a **segmentation lens**.\n\n* * *\n\n## 1. The route becomes clearer if BPE is treated as a segmentation lens\n\nThe route becomes easier to evaluate if the global constraint is split into path-level claims.\n\nIn that map, BPE is not just an implementation detail. It is the text-generation lens.\n\nPath | What changed | Clean framing\n---|---|---\nHoLo_ZeRo input | remove learned byte embedding door | zero-param observation door\nSTT | raw HSL alone is weak, spectral lens helps | speech needs a time-frequency lens\nTTS | text input can use HSL, free-run still rough | acoustic generation is a separate path\ntext generation | raw byte HSL gives word salad at small scale | text generation needs segmentation structure\nmemory / grounding | zero memory path was weak | retrieved facts need a learned/explicit interface\n\nThat gives a consistent story:\n\n> fixed substrate where possible, explicit lens where necessary.\n\n* * *\n\n## 2. The BPE branch needs its own label\n\nOnce BPE is in the loop, the branch has segmentation and a vocabulary. So I would label the branch around the remaining constraint:\n\n> no learned embedding table for the segmented units; the chunks are still encoded through the fixed HSL substrate.\n\nA useful naming table:\n\nPossible phrase | Why it works / risk\n---|---\n**segmentation lens** | fits the path-specific lens story\n**embedding-table-free BPE-HSL** | precise and compact\n**BPE-segmented HSL stream** | descriptive and low-risk\n**tokenizer-free with BPE** | mixes two different claims\n**semantic chunks** | useful intuition, but too strong if literal\n\nA clean version might be:\n\n> **BPE-HSL: embedding-table-free generation over HSL-coded subword chunks**\n\nThat separates three things:\n\nLayer | Question | BPE-HSL answer\n---|---|---\nSegmentation / vocabulary | Are there explicit chunks? | yes, BPE-style chunks\nEmbedding table | Does each chunk get a learned vector? | no, if each chunk is streamed through HSL\nModel body | Does the Transformer learn mixing/generation? | yes\n\nSo the interesting text-generation question becomes:\n\n> how much segmentation structure is needed before the fixed HSL substrate becomes useful for generation?\n\n* * *\n\nBackground note on BPE wording (click for more details)\n\n* * *\n\n## 3. The key control is segmentation vs HSL geometry\n\nBPE may help for two separate reasons:\n\n  1. It reduces sequence burden and gives the model larger text units.\n  2. The HSL stream over those chunks may still provide useful fixed geometry.\n\n\n\nThose are different claims.\n\nA result like:\n\n> BPE-HSL improves word salad\n\nwould be useful, but it would not yet say whether HSL geometry is the important part. It might simply mean the model needed segmentation.\n\nSo I would make the BPE-HSL branch easy to interpret with a small control table.\n\nCondition | What it tests\n---|---\nraw byte HSL | pure byte-substrate baseline\nraw byte HSL + longer schedule | whether the issue is only training budget\nchar-level / byte-level HSL with same model size | whether shorter chunks are needed\nBPE segmentation + HSL stream | segmentation lens effect\nBPE tokens + learned embedding baseline | what is lost by avoiding embedding tables\nBPE segmentation + random/fixed vectors | whether segmentation alone explains the gain\nBPE segmentation + permuted HSL mapping | whether HSL geometry matters after segmentation\nvocab-size sweep | chunk granularity boundary\nrare words / numbers / punctuation bucket | whether BPE helps the actual failure cases\n\nThe key comparison is not only:\n\n> BPE-HSL vs raw byte HSL\n\nbut also:\n\n> BPE-HSL vs BPE learned embeddings\n\nbecause that is the cleanest way to test the “embedding-table-free” part.\n\nPossible interpretation table (click for more details)\n\n* * *\n\n## 4. HoLo-ToLk and BPE-HSL can stay related but distinct\n\nThe branches can share the same substrate story without sharing the same public label.\n\nBranch | Claim\n---|---\nHoLo-ToLk STT | HSL substrate needs spectral lens for speech recognition\nHoLo-ToLk TTS | HSL text front end is feasible, but free-run acoustic generation needs a failure map\nBPE-HSL text generation | text generation may need a segmentation lens, while still avoiding learned embedding tables\n\nThat keeps the speech claim and the text-generation claim from interfering with each other.\n\nThe phrase “tokenizer-free speech” can still describe the HoLo-ToLk STT/TTS demo if the text path is UTF-8 bytes and the STT path outputs chars/bytes. Once BPE is added for text generation, I would give that branch its own label.\n\nSomething like:\n\n> **HoLo-BPE-HSL: embedding-table-free text generation over HSL-coded subword chunks**\n\nA bit long, but very clear.\n\n* * *\n\n## 5. The updated wording can stay narrow and testable\n\nA scoped version of the claims might be:\n\nClaim | Safer status\n---|---\nfixed HSL substrate | yes\nno learned embedding table | plausible branch claim\ntokenizer-free | only for branches that actually avoid segmentation/vocab\nno learned interface anywhere | probably too broad\npath-specific lens map | increasingly useful\nBPE for text generation | segmentation-lens branch\n\nThat makes the route more testable.\n\nInstead of one global claim, each branch gets a measurable boundary:\n\nBranch | Main question\n---|---\nSTT | Does spectral lens + HSL beat mel in the same controlled setup?\nTTS | Does free-run cover, stop, and preserve phonetic content?\nBPE-HSL text | Does segmentation help without needing a learned embedding table?\nMemory / grounding | Does a small learned interface recover evidence use?\n\n* * *\n\n## Bottom line\n\nI would frame the new route like this:\n\n> BPE-HSL is the **segmentation-lens branch**.\n>  It has a vocabulary/segmentation step, but it can still test the more interesting constraint: **no learned embedding table for the vocabulary items**.\n\nThat fits the larger pattern:\n\n> fixed substrate where possible, explicit lens where necessary.\n\nIf BPE-HSL reduces word salad while staying reasonably close to a learned-embedding BPE baseline, that would be a strong result. If it only matches random fixed vectors, then segmentation is doing most of the work. Either way, the experiment becomes interpretable.",
  "title": "HoLo-ToLk: tokenizer-free speech (STT + TTS) on the 0-parameter HSL byte substrate"
}