{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreidx3tjarjvxi6rb3j75qx3hrmai3hpdyfiey5chrb6c77irqmq3tu",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkubvidoa6n2"
},
"path": "/t/how-do-i-make-stt-work-for-my-ai-vtuber-on-discord-vc-calls/175621#post_8",
"publishedAt": "2026-05-02T08:57:53.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"Whisper-Streaming paper",
"Whisper-Streaming GitHub",
"OpenAI Whisper real-time discussion",
"Hugging Face ASR chunking guide",
"faster-whisper",
"ASR chunking guide",
"Silero VAD",
"Whisper hallucination discussion: VAD + condition_on_previous_text=False",
"Hugging Face ASR task guide",
"Wav2Vec2 docs",
"nvidia/parakeet-unified-en-0.6b",
"NVIDIA Parakeet ASR collection",
"NVIDIA blog: Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR",
"nvidia/nemotron-speech-streaming-en-0.6b",
"NVIDIA cache-aware streaming ASR blog"
],
"textContent": "Whisper can handle “chunks,” but there’s probably a slight discrepancy in how you and the Whisper model interpret about “chunks”.\n\nIf you can use a different model that prioritizes real-time performance, that’s a viable option; however, if Whisper is already integrated and you can’t use another model, I think you’ll need to create a buffer.\nIt comes down to whether you prioritize the final transcription quality or real-time streaming.\n\n* * *\n\n# Can Whisper transcribe chunks? Could another Hugging Face model handle streaming better?\n\n## Short answer\n\nYes, **Whisper can transcribe chunks** in the basic sense: you can pass it a short audio array or short WAV segment and it may return text.\n\nBut Whisper is **not a true streaming ASR model** in the sense of:\n\n\n 20 ms Discord audio frame in\n → stable partial transcript out\n → updated transcript as more audio arrives\n\n\nThat difference is the important part.\n\nYour current chunk is roughly:\n\n\n 5760 samples / 16000 Hz = 0.36 seconds\n\n\nThat is extremely short for independent Whisper transcription. It may contain only part of a syllable, a clipped word edge, a breath, keyboard noise, silence, Discord compression residue, or TTS feedback.\n\nSo the practical answer is:\n\n\n Whisper can transcribe chunks, but not tiny independent Discord callback chunks reliably.\n\n\nFor Whisper/faster-whisper, “chunking” should usually mean:\n\n\n small audio frames\n → buffered into a larger speech window\n → optional VAD trimming\n → optional overlap/stride\n → transcribe meaningful segment\n\n\nNot:\n\n\n tiny Discord frame\n → independent STT call\n → send result to Ollama\n\n\nUseful references:\n\n * Whisper-Streaming paper\n * Whisper-Streaming GitHub\n * OpenAI Whisper real-time discussion\n * Hugging Face ASR chunking guide\n * faster-whisper\n\n\n\n* * *\n\n# 1. The important distinction: chunks vs streaming\n\nThese are not the same thing.\n\nTerm | Meaning | Good fit for tiny Discord chunks?\n---|---|---\n**Independent chunk transcription** | Each audio chunk is treated as a complete standalone clip | Usually bad\n**Chunked transcription with overlap/stride** | Larger chunks are decoded with left/right context so boundary errors are reduced | Better\n**Utterance-based STT** | VAD detects speech start/end, then STT transcribes the completed utterance | Best first version\n**True streaming ASR** | Model keeps state/cache and emits partial/final text incrementally | Best for low-latency live captions\n**Raw Discord frame transcription** | Every small callback/frame goes straight to STT | Usually the failure mode\n\nYour current system seems closest to this:\n\n\n small audio callback\n → immediate STT\n → empty or bad transcript\n → empty transcript still sent to Ollama\n\n\nThat is the wrong shape for Whisper.\n\n* * *\n\n# 2. Why Whisper struggles with your current chunks\n\nWhisper is a sequence-to-sequence model. It is strong, but it expects enough audio context to infer words.\n\nIt works best with something like:\n\n\n 1–15 seconds of speech-like audio\n mostly intact word boundaries\n reasonable volume\n correct sample rate\n silence/noise trimmed\n\n\nIt works badly with:\n\n\n 0.12–0.36 seconds of audio\n half a word\n wrong sample rate\n wrong dtype\n clipping / over-amplification\n silence or no-speech\n bot TTS leaking into input\n Discord receive artifacts\n\n\nThe Whisper-Streaming paper states the key issue directly: Whisper is not designed for real-time transcription, so the authors built a streaming wrapper around it using local agreement and adaptive latency.\n\nThat means:\n\n\n Whisper can be used in streaming systems,\n but Whisper itself is not a native streaming recognizer.\n\n\n* * *\n\n# 3. What “chunking” should mean for Whisper\n\nBad Whisper chunking:\n\n\n chunk 1 alone → text?\n chunk 2 alone → text?\n chunk 3 alone → text?\n\n\nBetter Whisper chunking:\n\n\n audio stream\n → collect 1–5 seconds\n → add overlap/padding\n → transcribe\n → commit only stable/final text\n\n\nBest first version for your AI VTuber:\n\n\n audio stream\n → VAD detects speech start\n → buffer while user speaks\n → wait for 700–1200 ms silence\n → transcribe the completed utterance\n → reject empty/garbage\n → send valid text to Ollama\n\n\nThis is not “true streaming,” but it is usually the best first working design for a Discord AI VTuber.\n\n* * *\n\n# 4. Why overlap/stride matters\n\nIf you cut audio at arbitrary boundaries, words get chopped.\n\nExample:\n\n\n chunk 1: \"can you hea\"\n chunk 2: \"r me now\"\n\n\nA model may misread both chunks because neither one has the full word boundary context.\n\nHugging Face’s ASR chunking guide explains this for CTC models such as Wav2Vec2: chunks are decoded with stride/overlap so the model has context around the cut points, and the unreliable edges can be dropped/merged.\n\nThe same general idea matters for Whisper too, even though Whisper is not CTC-based:\n\n\n do not decode arbitrary tiny independent slices\n\n\nUse:\n\n\n VAD padding\n overlap\n larger windows\n or utterance-level transcription\n\n\n* * *\n\n# 5. Can another Hugging Face model handle chunks better?\n\nYes, but the details matter.\n\nThere are three realistic paths:\n\n 1. **Stay with Whisper/faster-whisper, but add VAD + utterance buffering.**\n 2. **Try CTC models like Wav2Vec2 with chunking/stride.**\n 3. **Use true streaming ASR models like NVIDIA Parakeet/Nemotron-style RNN-T/FastConformer models.**\n\n\n\n* * *\n\n# 6. Option A — Stay with faster-whisper + VAD buffering\n\nThis is still my recommended first fix.\n\nUse:\n\n\n Silero VAD\n + utterance buffer\n + faster-whisper\n + transcript validation\n\n\nReferences:\n\n * faster-whisper\n * Silero VAD\n * Whisper hallucination discussion: VAD + condition_on_previous_text=False\n\n\n\nWhy this is best first:\n\nReason | Explanation\n---|---\nEasier setup | Much easier than integrating a true streaming ASR runtime\nGood accuracy | Whisper-family models are strong when audio is clean\nGood enough latency | Utterance-based latency is acceptable for conversational bots\nFewer moving parts | You can debug audio conversion, VAD, STT, and Ollama separately\n\nRecommended flow:\n\n\n Discord/local mic chunks\n → convert to mono 16 kHz float32\n → VAD\n → buffer complete utterance\n → faster-whisper\n → reject blank/garbage\n → Ollama\n\n\nThis will likely solve more of your current issue than switching models immediately.\n\n* * *\n\n# 7. Option B — Wav2Vec2 / CTC models with chunking + stride\n\nCTC models can be more natural for chunking than Whisper.\n\nExamples:\n\n\n Wav2Vec2\n HuBERT\n WavLM-style ASR checkpoints\n\n\nWhy CTC models can work better for chunked audio:\n\n * they produce frame-level logits,\n * overlapping chunks can be merged more naturally,\n * boundary handling is simpler than seq2seq decoding,\n * Hugging Face pipelines support chunking/stride for many CTC ASR models.\n\n\n\nReferences:\n\n * Hugging Face ASR chunking guide\n * Hugging Face ASR task guide\n * Wav2Vec2 docs\n\n\n\nExample shape:\n\n\n from transformers import pipeline\n\n pipe = pipeline(\n \"automatic-speech-recognition\",\n model=\"facebook/wav2vec2-base-960h\",\n )\n\n result = pipe(\n audio_16k,\n chunk_length_s=5,\n stride_length_s=1,\n )\n\n\nBut this still does **not** mean:\n\n\n 0.36-second Discord chunk → independent transcript\n\n\nIt means:\n\n\n larger windows\n + overlap/stride\n + merge outputs\n\n\nCTC chunking may be worth testing, but it does not remove the need for:\n\n\n audio conversion\n buffering\n VAD/endpointing\n empty transcript filtering\n feedback prevention\n\n\n* * *\n\n# 8. Option C — True streaming ASR models\n\nThis is the route closest to what you are asking for.\n\nStreaming ASR models are designed around:\n\n\n small incoming chunks\n + preserved model state/cache\n + partial/final transcript updates\n\n\nCommon architectures include:\n\n\n RNN-T / Transducer\n Conformer / FastConformer\n cache-aware streaming encoders\n\n\nThese are better suited to live voice agents than naive Whisper-per-chunk.\n\n* * *\n\n## NVIDIA Parakeet Unified\n\nnvidia/parakeet-unified-en-0.6b is a strong example.\n\nThe model card describes it as an English ASR model based on transducer architecture / RNN-T / FastConformer, supporting both offline and streaming inference in one model. It also mentions a minimum latency of 160 ms and configurable streaming chunk sizes from 2080 ms down to 160 ms in 80 ms steps.\n\nUseful links:\n\n * nvidia/parakeet-unified-en-0.6b\n * NVIDIA Parakeet ASR collection\n\n\n\nWhy it matters:\n\n\n This is much closer to “streaming ASR” than plain Whisper.\n\n\nCaveat:\n\n\n You still need to integrate its streaming/buffered-streaming API correctly.\n Do not call it independently on every Discord chunk as if each chunk is a complete utterance.\n\n\n* * *\n\n## NVIDIA Nemotron Speech Streaming\n\nNVIDIA’s Nemotron Speech streaming ASR is another relevant route.\n\nThe Hugging Face blog describes cache-aware streaming inference for voice agents, with latency modes such as 80 ms, 160 ms, 560 ms, and 1.12 s. It also explains why cache-aware streaming is more efficient than repeatedly re-encoding overlapping windows.\n\nUseful links:\n\n * NVIDIA blog: Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR\n * nvidia/nemotron-speech-streaming-en-0.6b\n\n\n\nWhy it matters:\n\n\n This is the kind of architecture built for live voice agents.\n\n\nCaveat:\n\n\n It is more engineering-heavy than faster-whisper.\n Expect NeMo/runtime-specific setup and more integration work.\n\n\n* * *\n\n# 9. Practical comparison\n\nApproach | Can handle tiny chunks directly? | Setup difficulty | Good for Discord AI VTuber? | Recommendation\n---|---|---|---|---\nNaive Whisper per chunk | No | Low | Bad | Avoid\nfaster-whisper + VAD utterance buffering | Not directly; buffers into utterances | Medium-low | Good | Best first working route\nWhisper-Streaming | More streaming-like with local agreement | Medium-high | Good if you need partials | Try after basic STT works\nWav2Vec2/CTC + chunk/stride | Better chunk merging than Whisper | Medium | Maybe | Worth testing\nNVIDIA Parakeet/Nemotron streaming | Yes, designed for streaming modes | Higher | Strong candidate | Best true-streaming HF route\nCloud STT | Yes | Low-medium | Technically good | Not free/local long-term\n\n* * *\n\n# 10. Important: streaming ASR still needs a controller\n\nEven with a true streaming model, you still need:\n\n\n correct audio conversion\n RMS/peak validation\n VAD or endpointing\n partial/final transcript handling\n empty transcript filtering\n bot_is_speaking guard\n TTS feedback prevention\n Discord receive debugging\n\n\nStreaming ASR can help with this:\n\n\n tiny chunks are too small for independent Whisper transcription\n\n\nIt does not automatically fix this:\n\n\n empty transcript sent to Ollama\n bot hears itself\n wrong sample rate\n bad amplitude\n Discord receive broken\n missing transcribe_audio function\n\n\nYour current logs show controller/audio-path issues clearly, so switching models first may hide the real bug.\n\n* * *\n\n# 11. What I would do in your exact case\n\n## Step 1 — Fix the current pipeline first\n\nBefore changing models, fix these:\n\n\n define transcribe_audio()\n block empty Ollama calls\n save actual STT input WAVs\n validate sample rate / duration / RMS / peak\n prevent Chatterbox feedback\n\n\nYour current error:\n\n\n NameError: name 'transcribe_audio' is not defined\n\n\nmeans model choice is not the first blocker.\n\n* * *\n\n## Step 2 — Make faster-whisper work on full utterances\n\nUse:\n\n\n audio chunks\n → mono 16 kHz conversion\n → VAD\n → utterance buffer\n → faster-whisper\n → reject blank\n → Ollama\n\n\nThis is the best first stable version.\n\n* * *\n\n## Step 3 — If you need live partial captions, try Whisper-Streaming\n\nUse these:\n\n * Whisper-Streaming paper\n * Whisper-Streaming GitHub\n\n\n\nThis keeps Whisper-like behavior but adds a streaming policy layer.\n\n* * *\n\n## Step 4 — If you need true low-latency streaming, test Parakeet/Nemotron\n\nStart here:\n\n * nvidia/parakeet-unified-en-0.6b\n * NVIDIA Parakeet ASR collection\n * nvidia/nemotron-speech-streaming-en-0.6b\n * NVIDIA cache-aware streaming ASR blog\n\n\n\nThese are better for true streaming, but they are more complex than faster-whisper.\n\n* * *\n\n# 12. Best mental model\n\n## Whisper/faster-whisper route\n\n\n small chunks\n → buffer into complete utterance\n → transcribe utterance\n\n\n## Streaming ASR route\n\n\n small chunks\n → preserve model state/cache\n → emit partial/final transcript\n → controller decides when text is final enough\n\n\n## Bad route\n\n\n small chunk\n → independent STT call\n → send result to Ollama\n\n\nThat last route is what you should avoid.\n\n* * *\n\n# Bottom line\n\nWhisper **can** transcribe chunks, but it is bad at transcribing **tiny independent live chunks** like `5760` samples.\n\nIf you want to stay simple:\n\n\n Use faster-whisper + Silero VAD + utterance buffering.\n\n\nIf you want streaming behavior while keeping Whisper-like transcription:\n\n\n Try Whisper-Streaming.\n\n\nIf you want a real Hugging Face streaming ASR model:\n\n\n Try NVIDIA Parakeet Unified or Nemotron Speech Streaming.\n\n\nBut do not skip the basics:\n\n\n define transcribe_audio()\n block empty Ollama calls\n buffer audio\n validate audio\n save debug WAVs\n prevent TTS feedback\n\n\nA streaming model can improve latency. It will not fix a broken controller or bad audio path.",
"title": "How Do i Make Stt Work for my ai Vtuber on Discord Vc calls?"
}