Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidx3tjarjvxi6rb3j75qx3hrmai3hpdyfiey5chrb6c77irqmq3tu",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkubvidoa6n2"
  },
  "path": "/t/how-do-i-make-stt-work-for-my-ai-vtuber-on-discord-vc-calls/175621#post_8",
  "publishedAt": "2026-05-02T08:57:53.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Whisper-Streaming paper",
    "Whisper-Streaming GitHub",
    "OpenAI Whisper real-time discussion",
    "Hugging Face ASR chunking guide",
    "faster-whisper",
    "ASR chunking guide",
    "Silero VAD",
    "Whisper hallucination discussion: VAD + condition_on_previous_text=False",
    "Hugging Face ASR task guide",
    "Wav2Vec2 docs",
    "nvidia/parakeet-unified-en-0.6b",
    "NVIDIA Parakeet ASR collection",
    "NVIDIA blog: Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR",
    "nvidia/nemotron-speech-streaming-en-0.6b",
    "NVIDIA cache-aware streaming ASR blog"
  ],
  "textContent": "Whisper can handle “chunks,” but there’s probably a slight discrepancy in how you and the Whisper model interpret about “chunks”.\n\nIf you can use a different model that prioritizes real-time performance, that’s a viable option; however, if Whisper is already integrated and you can’t use another model, I think you’ll need to create a buffer.\nIt comes down to whether you prioritize the final transcription quality or real-time streaming.\n\n* * *\n\n# Can Whisper transcribe chunks? Could another Hugging Face model handle streaming better?\n\n## Short answer\n\nYes, **Whisper can transcribe chunks** in the basic sense: you can pass it a short audio array or short WAV segment and it may return text.\n\nBut Whisper is **not a true streaming ASR model** in the sense of:\n\n\n    20 ms Discord audio frame in\n    → stable partial transcript out\n    → updated transcript as more audio arrives\n\n\nThat difference is the important part.\n\nYour current chunk is roughly:\n\n\n    5760 samples / 16000 Hz = 0.36 seconds\n\n\nThat is extremely short for independent Whisper transcription. It may contain only part of a syllable, a clipped word edge, a breath, keyboard noise, silence, Discord compression residue, or TTS feedback.\n\nSo the practical answer is:\n\n\n    Whisper can transcribe chunks, but not tiny independent Discord callback chunks reliably.\n\n\nFor Whisper/faster-whisper, “chunking” should usually mean:\n\n\n    small audio frames\n    → buffered into a larger speech window\n    → optional VAD trimming\n    → optional overlap/stride\n    → transcribe meaningful segment\n\n\nNot:\n\n\n    tiny Discord frame\n    → independent STT call\n    → send result to Ollama\n\n\nUseful references:\n\n  * Whisper-Streaming paper\n  * Whisper-Streaming GitHub\n  * OpenAI Whisper real-time discussion\n  * Hugging Face ASR chunking guide\n  * faster-whisper\n\n\n\n* * *\n\n# 1. The important distinction: chunks vs streaming\n\nThese are not the same thing.\n\nTerm | Meaning | Good fit for tiny Discord chunks?\n---|---|---\n**Independent chunk transcription** | Each audio chunk is treated as a complete standalone clip | Usually bad\n**Chunked transcription with overlap/stride** | Larger chunks are decoded with left/right context so boundary errors are reduced | Better\n**Utterance-based STT** | VAD detects speech start/end, then STT transcribes the completed utterance | Best first version\n**True streaming ASR** | Model keeps state/cache and emits partial/final text incrementally | Best for low-latency live captions\n**Raw Discord frame transcription** | Every small callback/frame goes straight to STT | Usually the failure mode\n\nYour current system seems closest to this:\n\n\n    small audio callback\n    → immediate STT\n    → empty or bad transcript\n    → empty transcript still sent to Ollama\n\n\nThat is the wrong shape for Whisper.\n\n* * *\n\n# 2. Why Whisper struggles with your current chunks\n\nWhisper is a sequence-to-sequence model. It is strong, but it expects enough audio context to infer words.\n\nIt works best with something like:\n\n\n    1–15 seconds of speech-like audio\n    mostly intact word boundaries\n    reasonable volume\n    correct sample rate\n    silence/noise trimmed\n\n\nIt works badly with:\n\n\n    0.12–0.36 seconds of audio\n    half a word\n    wrong sample rate\n    wrong dtype\n    clipping / over-amplification\n    silence or no-speech\n    bot TTS leaking into input\n    Discord receive artifacts\n\n\nThe Whisper-Streaming paper states the key issue directly: Whisper is not designed for real-time transcription, so the authors built a streaming wrapper around it using local agreement and adaptive latency.\n\nThat means:\n\n\n    Whisper can be used in streaming systems,\n    but Whisper itself is not a native streaming recognizer.\n\n\n* * *\n\n# 3. What “chunking” should mean for Whisper\n\nBad Whisper chunking:\n\n\n    chunk 1 alone → text?\n    chunk 2 alone → text?\n    chunk 3 alone → text?\n\n\nBetter Whisper chunking:\n\n\n    audio stream\n    → collect 1–5 seconds\n    → add overlap/padding\n    → transcribe\n    → commit only stable/final text\n\n\nBest first version for your AI VTuber:\n\n\n    audio stream\n    → VAD detects speech start\n    → buffer while user speaks\n    → wait for 700–1200 ms silence\n    → transcribe the completed utterance\n    → reject empty/garbage\n    → send valid text to Ollama\n\n\nThis is not “true streaming,” but it is usually the best first working design for a Discord AI VTuber.\n\n* * *\n\n# 4. Why overlap/stride matters\n\nIf you cut audio at arbitrary boundaries, words get chopped.\n\nExample:\n\n\n    chunk 1: \"can you hea\"\n    chunk 2: \"r me now\"\n\n\nA model may misread both chunks because neither one has the full word boundary context.\n\nHugging Face’s ASR chunking guide explains this for CTC models such as Wav2Vec2: chunks are decoded with stride/overlap so the model has context around the cut points, and the unreliable edges can be dropped/merged.\n\nThe same general idea matters for Whisper too, even though Whisper is not CTC-based:\n\n\n    do not decode arbitrary tiny independent slices\n\n\nUse:\n\n\n    VAD padding\n    overlap\n    larger windows\n    or utterance-level transcription\n\n\n* * *\n\n# 5. Can another Hugging Face model handle chunks better?\n\nYes, but the details matter.\n\nThere are three realistic paths:\n\n  1. **Stay with Whisper/faster-whisper, but add VAD + utterance buffering.**\n  2. **Try CTC models like Wav2Vec2 with chunking/stride.**\n  3. **Use true streaming ASR models like NVIDIA Parakeet/Nemotron-style RNN-T/FastConformer models.**\n\n\n\n* * *\n\n# 6. Option A — Stay with faster-whisper + VAD buffering\n\nThis is still my recommended first fix.\n\nUse:\n\n\n    Silero VAD\n    + utterance buffer\n    + faster-whisper\n    + transcript validation\n\n\nReferences:\n\n  * faster-whisper\n  * Silero VAD\n  * Whisper hallucination discussion: VAD + condition_on_previous_text=False\n\n\n\nWhy this is best first:\n\nReason | Explanation\n---|---\nEasier setup | Much easier than integrating a true streaming ASR runtime\nGood accuracy | Whisper-family models are strong when audio is clean\nGood enough latency | Utterance-based latency is acceptable for conversational bots\nFewer moving parts | You can debug audio conversion, VAD, STT, and Ollama separately\n\nRecommended flow:\n\n\n    Discord/local mic chunks\n    → convert to mono 16 kHz float32\n    → VAD\n    → buffer complete utterance\n    → faster-whisper\n    → reject blank/garbage\n    → Ollama\n\n\nThis will likely solve more of your current issue than switching models immediately.\n\n* * *\n\n# 7. Option B — Wav2Vec2 / CTC models with chunking + stride\n\nCTC models can be more natural for chunking than Whisper.\n\nExamples:\n\n\n    Wav2Vec2\n    HuBERT\n    WavLM-style ASR checkpoints\n\n\nWhy CTC models can work better for chunked audio:\n\n  * they produce frame-level logits,\n  * overlapping chunks can be merged more naturally,\n  * boundary handling is simpler than seq2seq decoding,\n  * Hugging Face pipelines support chunking/stride for many CTC ASR models.\n\n\n\nReferences:\n\n  * Hugging Face ASR chunking guide\n  * Hugging Face ASR task guide\n  * Wav2Vec2 docs\n\n\n\nExample shape:\n\n\n    from transformers import pipeline\n\n    pipe = pipeline(\n        \"automatic-speech-recognition\",\n        model=\"facebook/wav2vec2-base-960h\",\n    )\n\n    result = pipe(\n        audio_16k,\n        chunk_length_s=5,\n        stride_length_s=1,\n    )\n\n\nBut this still does **not** mean:\n\n\n    0.36-second Discord chunk → independent transcript\n\n\nIt means:\n\n\n    larger windows\n    + overlap/stride\n    + merge outputs\n\n\nCTC chunking may be worth testing, but it does not remove the need for:\n\n\n    audio conversion\n    buffering\n    VAD/endpointing\n    empty transcript filtering\n    feedback prevention\n\n\n* * *\n\n# 8. Option C — True streaming ASR models\n\nThis is the route closest to what you are asking for.\n\nStreaming ASR models are designed around:\n\n\n    small incoming chunks\n    + preserved model state/cache\n    + partial/final transcript updates\n\n\nCommon architectures include:\n\n\n    RNN-T / Transducer\n    Conformer / FastConformer\n    cache-aware streaming encoders\n\n\nThese are better suited to live voice agents than naive Whisper-per-chunk.\n\n* * *\n\n## NVIDIA Parakeet Unified\n\nnvidia/parakeet-unified-en-0.6b is a strong example.\n\nThe model card describes it as an English ASR model based on transducer architecture / RNN-T / FastConformer, supporting both offline and streaming inference in one model. It also mentions a minimum latency of 160 ms and configurable streaming chunk sizes from 2080 ms down to 160 ms in 80 ms steps.\n\nUseful links:\n\n  * nvidia/parakeet-unified-en-0.6b\n  * NVIDIA Parakeet ASR collection\n\n\n\nWhy it matters:\n\n\n    This is much closer to “streaming ASR” than plain Whisper.\n\n\nCaveat:\n\n\n    You still need to integrate its streaming/buffered-streaming API correctly.\n    Do not call it independently on every Discord chunk as if each chunk is a complete utterance.\n\n\n* * *\n\n## NVIDIA Nemotron Speech Streaming\n\nNVIDIA’s Nemotron Speech streaming ASR is another relevant route.\n\nThe Hugging Face blog describes cache-aware streaming inference for voice agents, with latency modes such as 80 ms, 160 ms, 560 ms, and 1.12 s. It also explains why cache-aware streaming is more efficient than repeatedly re-encoding overlapping windows.\n\nUseful links:\n\n  * NVIDIA blog: Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR\n  * nvidia/nemotron-speech-streaming-en-0.6b\n\n\n\nWhy it matters:\n\n\n    This is the kind of architecture built for live voice agents.\n\n\nCaveat:\n\n\n    It is more engineering-heavy than faster-whisper.\n    Expect NeMo/runtime-specific setup and more integration work.\n\n\n* * *\n\n# 9. Practical comparison\n\nApproach | Can handle tiny chunks directly? | Setup difficulty | Good for Discord AI VTuber? | Recommendation\n---|---|---|---|---\nNaive Whisper per chunk | No | Low | Bad | Avoid\nfaster-whisper + VAD utterance buffering | Not directly; buffers into utterances | Medium-low | Good | Best first working route\nWhisper-Streaming | More streaming-like with local agreement | Medium-high | Good if you need partials | Try after basic STT works\nWav2Vec2/CTC + chunk/stride | Better chunk merging than Whisper | Medium | Maybe | Worth testing\nNVIDIA Parakeet/Nemotron streaming | Yes, designed for streaming modes | Higher | Strong candidate | Best true-streaming HF route\nCloud STT | Yes | Low-medium | Technically good | Not free/local long-term\n\n* * *\n\n# 10. Important: streaming ASR still needs a controller\n\nEven with a true streaming model, you still need:\n\n\n    correct audio conversion\n    RMS/peak validation\n    VAD or endpointing\n    partial/final transcript handling\n    empty transcript filtering\n    bot_is_speaking guard\n    TTS feedback prevention\n    Discord receive debugging\n\n\nStreaming ASR can help with this:\n\n\n    tiny chunks are too small for independent Whisper transcription\n\n\nIt does not automatically fix this:\n\n\n    empty transcript sent to Ollama\n    bot hears itself\n    wrong sample rate\n    bad amplitude\n    Discord receive broken\n    missing transcribe_audio function\n\n\nYour current logs show controller/audio-path issues clearly, so switching models first may hide the real bug.\n\n* * *\n\n# 11. What I would do in your exact case\n\n## Step 1 — Fix the current pipeline first\n\nBefore changing models, fix these:\n\n\n    define transcribe_audio()\n    block empty Ollama calls\n    save actual STT input WAVs\n    validate sample rate / duration / RMS / peak\n    prevent Chatterbox feedback\n\n\nYour current error:\n\n\n    NameError: name 'transcribe_audio' is not defined\n\n\nmeans model choice is not the first blocker.\n\n* * *\n\n## Step 2 — Make faster-whisper work on full utterances\n\nUse:\n\n\n    audio chunks\n    → mono 16 kHz conversion\n    → VAD\n    → utterance buffer\n    → faster-whisper\n    → reject blank\n    → Ollama\n\n\nThis is the best first stable version.\n\n* * *\n\n## Step 3 — If you need live partial captions, try Whisper-Streaming\n\nUse these:\n\n  * Whisper-Streaming paper\n  * Whisper-Streaming GitHub\n\n\n\nThis keeps Whisper-like behavior but adds a streaming policy layer.\n\n* * *\n\n## Step 4 — If you need true low-latency streaming, test Parakeet/Nemotron\n\nStart here:\n\n  * nvidia/parakeet-unified-en-0.6b\n  * NVIDIA Parakeet ASR collection\n  * nvidia/nemotron-speech-streaming-en-0.6b\n  * NVIDIA cache-aware streaming ASR blog\n\n\n\nThese are better for true streaming, but they are more complex than faster-whisper.\n\n* * *\n\n# 12. Best mental model\n\n## Whisper/faster-whisper route\n\n\n    small chunks\n    → buffer into complete utterance\n    → transcribe utterance\n\n\n## Streaming ASR route\n\n\n    small chunks\n    → preserve model state/cache\n    → emit partial/final transcript\n    → controller decides when text is final enough\n\n\n## Bad route\n\n\n    small chunk\n    → independent STT call\n    → send result to Ollama\n\n\nThat last route is what you should avoid.\n\n* * *\n\n# Bottom line\n\nWhisper **can** transcribe chunks, but it is bad at transcribing **tiny independent live chunks** like `5760` samples.\n\nIf you want to stay simple:\n\n\n    Use faster-whisper + Silero VAD + utterance buffering.\n\n\nIf you want streaming behavior while keeping Whisper-like transcription:\n\n\n    Try Whisper-Streaming.\n\n\nIf you want a real Hugging Face streaming ASR model:\n\n\n    Try NVIDIA Parakeet Unified or Nemotron Speech Streaming.\n\n\nBut do not skip the basics:\n\n\n    define transcribe_audio()\n    block empty Ollama calls\n    buffer audio\n    validate audio\n    save debug WAVs\n    prevent TTS feedback\n\n\nA streaming model can improve latency. It will not fix a broken controller or bad audio path.",
  "title": "How Do i Make Stt Work for my ai Vtuber on Discord Vc calls?"
}