HoLo-ToLk: tokenizer-free speech (STT + TTS) on the 0-parameter HSL byte substrate
Follow-up to my earlier post on the 0-parameter input layer.
I took the HSL byte substrate (no tokenizer, no learned input embedding) and built two small speech models on top, to see whether “bytes as signal” carries through to audio. I’m calling the line HoLo-ToLk.
STT (speech → text) — the result I’m most confident about. Feeding the raw HSL substrate to a char-CTC baseline is weak on its own (CER ~0.67). Adding a small model-side spectral lens (log-mel + a learnable gated fusion over the frozen substrate) flips it: CER 0.194, beating a mel-spectrogram baseline (0.213) in the same setup, confirmed across 4 seeds. So the honest takeaway is a controlled comparison — substrate + lens > mel, same setup — not a SOTA number (8 kHz, char-CTC, no LM; readable but rough).
TTS (text → speech) — here the byte substrate is even more natural: UTF-8 text bytes go straight in as HSL features, no tokenizer/vocab. A small AR transformer + guided attention + HiFi-GAN gives a single-speaker voice. Held-out teacher-forced mel-L1 is 0.296 (multi-seed) and some samples sound genuinely natural — but free-run synthesis on arbitrary sentences is still rough/unstable. So I’m framing TTS as a feasibility demo, not a usable TTS.
Both are research/devlog results, not production or SOTA. The two models are separate today; the goal is to unify them into one over time.
Try it (combined demo, both tabs):
huggingface.co
HoLo-ToLk - a Hugging Face Space by ggunio
Tokenizer-free speech line: STT + TTS on HSL
Substrate: pip install hsl-embedding-zero
Happy to answer questions on the lens design or the byte→signal encoding — and very open to critique, especially on the TTS free-run instability.
Discussion in the ATmosphere