External Publication
Visit Post

How Do i Make Stt Work for my ai Vtuber on Discord Vc calls?

Hugging Face Forums [Unofficial] May 2, 2026
Source

Whisper can handle “chunks,” but there’s probably a slight discrepancy in how you and the Whisper model interpret about “chunks”.

If you can use a different model that prioritizes real-time performance, that’s a viable option; however, if Whisper is already integrated and you can’t use another model, I think you’ll need to create a buffer. It comes down to whether you prioritize the final transcription quality or real-time streaming.


Can Whisper transcribe chunks? Could another Hugging Face model handle streaming better?

Short answer

Yes, Whisper can transcribe chunks in the basic sense: you can pass it a short audio array or short WAV segment and it may return text.

But Whisper is not a true streaming ASR model in the sense of:

20 ms Discord audio frame in
→ stable partial transcript out
→ updated transcript as more audio arrives

That difference is the important part.

Your current chunk is roughly:

5760 samples / 16000 Hz = 0.36 seconds

That is extremely short for independent Whisper transcription. It may contain only part of a syllable, a clipped word edge, a breath, keyboard noise, silence, Discord compression residue, or TTS feedback.

So the practical answer is:

Whisper can transcribe chunks, but not tiny independent Discord callback chunks reliably.

For Whisper/faster-whisper, “chunking” should usually mean:

small audio frames
→ buffered into a larger speech window
→ optional VAD trimming
→ optional overlap/stride
→ transcribe meaningful segment

Not:

tiny Discord frame
→ independent STT call
→ send result to Ollama

Useful references:

  • Whisper-Streaming paper
  • Whisper-Streaming GitHub
  • OpenAI Whisper real-time discussion
  • Hugging Face ASR chunking guide
  • faster-whisper

1. The important distinction: chunks vs streaming

These are not the same thing.

Term Meaning Good fit for tiny Discord chunks?
Independent chunk transcription Each audio chunk is treated as a complete standalone clip Usually bad
Chunked transcription with overlap/stride Larger chunks are decoded with left/right context so boundary errors are reduced Better
Utterance-based STT VAD detects speech start/end, then STT transcribes the completed utterance Best first version
True streaming ASR Model keeps state/cache and emits partial/final text incrementally Best for low-latency live captions
Raw Discord frame transcription Every small callback/frame goes straight to STT Usually the failure mode

Your current system seems closest to this:

small audio callback
→ immediate STT
→ empty or bad transcript
→ empty transcript still sent to Ollama

That is the wrong shape for Whisper.


2. Why Whisper struggles with your current chunks

Whisper is a sequence-to-sequence model. It is strong, but it expects enough audio context to infer words.

It works best with something like:

1–15 seconds of speech-like audio
mostly intact word boundaries
reasonable volume
correct sample rate
silence/noise trimmed

It works badly with:

0.12–0.36 seconds of audio
half a word
wrong sample rate
wrong dtype
clipping / over-amplification
silence or no-speech
bot TTS leaking into input
Discord receive artifacts

The Whisper-Streaming paper states the key issue directly: Whisper is not designed for real-time transcription, so the authors built a streaming wrapper around it using local agreement and adaptive latency.

That means:

Whisper can be used in streaming systems,
but Whisper itself is not a native streaming recognizer.

3. What “chunking” should mean for Whisper

Bad Whisper chunking:

chunk 1 alone → text?
chunk 2 alone → text?
chunk 3 alone → text?

Better Whisper chunking:

audio stream
→ collect 1–5 seconds
→ add overlap/padding
→ transcribe
→ commit only stable/final text

Best first version for your AI VTuber:

audio stream
→ VAD detects speech start
→ buffer while user speaks
→ wait for 700–1200 ms silence
→ transcribe the completed utterance
→ reject empty/garbage
→ send valid text to Ollama

This is not “true streaming,” but it is usually the best first working design for a Discord AI VTuber.


4. Why overlap/stride matters

If you cut audio at arbitrary boundaries, words get chopped.

Example:

chunk 1: "can you hea"
chunk 2: "r me now"

A model may misread both chunks because neither one has the full word boundary context.

Hugging Face’s ASR chunking guide explains this for CTC models such as Wav2Vec2: chunks are decoded with stride/overlap so the model has context around the cut points, and the unreliable edges can be dropped/merged.

The same general idea matters for Whisper too, even though Whisper is not CTC-based:

do not decode arbitrary tiny independent slices

Use:

VAD padding
overlap
larger windows
or utterance-level transcription

5. Can another Hugging Face model handle chunks better?

Yes, but the details matter.

There are three realistic paths:

  1. Stay with Whisper/faster-whisper, but add VAD + utterance buffering.
  2. Try CTC models like Wav2Vec2 with chunking/stride.
  3. Use true streaming ASR models like NVIDIA Parakeet/Nemotron-style RNN-T/FastConformer models.

6. Option A — Stay with faster-whisper + VAD buffering

This is still my recommended first fix.

Use:

Silero VAD
+ utterance buffer
+ faster-whisper
+ transcript validation

References:

  • faster-whisper
  • Silero VAD
  • Whisper hallucination discussion: VAD + condition_on_previous_text=False

Why this is best first:

Reason Explanation
Easier setup Much easier than integrating a true streaming ASR runtime
Good accuracy Whisper-family models are strong when audio is clean
Good enough latency Utterance-based latency is acceptable for conversational bots
Fewer moving parts You can debug audio conversion, VAD, STT, and Ollama separately

Recommended flow:

Discord/local mic chunks
→ convert to mono 16 kHz float32
→ VAD
→ buffer complete utterance
→ faster-whisper
→ reject blank/garbage
→ Ollama

This will likely solve more of your current issue than switching models immediately.


7. Option B — Wav2Vec2 / CTC models with chunking + stride

CTC models can be more natural for chunking than Whisper.

Examples:

Wav2Vec2
HuBERT
WavLM-style ASR checkpoints

Why CTC models can work better for chunked audio:

  • they produce frame-level logits,
  • overlapping chunks can be merged more naturally,
  • boundary handling is simpler than seq2seq decoding,
  • Hugging Face pipelines support chunking/stride for many CTC ASR models.

References:

  • Hugging Face ASR chunking guide
  • Hugging Face ASR task guide
  • Wav2Vec2 docs

Example shape:

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="facebook/wav2vec2-base-960h",
)

result = pipe(
    audio_16k,
    chunk_length_s=5,
    stride_length_s=1,
)

But this still does not mean:

0.36-second Discord chunk → independent transcript

It means:

larger windows
+ overlap/stride
+ merge outputs

CTC chunking may be worth testing, but it does not remove the need for:

audio conversion
buffering
VAD/endpointing
empty transcript filtering
feedback prevention

8. Option C — True streaming ASR models

This is the route closest to what you are asking for.

Streaming ASR models are designed around:

small incoming chunks
+ preserved model state/cache
+ partial/final transcript updates

Common architectures include:

RNN-T / Transducer
Conformer / FastConformer
cache-aware streaming encoders

These are better suited to live voice agents than naive Whisper-per-chunk.


NVIDIA Parakeet Unified

nvidia/parakeet-unified-en-0.6b is a strong example.

The model card describes it as an English ASR model based on transducer architecture / RNN-T / FastConformer, supporting both offline and streaming inference in one model. It also mentions a minimum latency of 160 ms and configurable streaming chunk sizes from 2080 ms down to 160 ms in 80 ms steps.

Useful links:

  • nvidia/parakeet-unified-en-0.6b
  • NVIDIA Parakeet ASR collection

Why it matters:

This is much closer to “streaming ASR” than plain Whisper.

Caveat:

You still need to integrate its streaming/buffered-streaming API correctly.
Do not call it independently on every Discord chunk as if each chunk is a complete utterance.

NVIDIA Nemotron Speech Streaming

NVIDIA’s Nemotron Speech streaming ASR is another relevant route.

The Hugging Face blog describes cache-aware streaming inference for voice agents, with latency modes such as 80 ms, 160 ms, 560 ms, and 1.12 s. It also explains why cache-aware streaming is more efficient than repeatedly re-encoding overlapping windows.

Useful links:

  • NVIDIA blog: Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR
  • nvidia/nemotron-speech-streaming-en-0.6b

Why it matters:

This is the kind of architecture built for live voice agents.

Caveat:

It is more engineering-heavy than faster-whisper.
Expect NeMo/runtime-specific setup and more integration work.

9. Practical comparison

Approach Can handle tiny chunks directly? Setup difficulty Good for Discord AI VTuber? Recommendation
Naive Whisper per chunk No Low Bad Avoid
faster-whisper + VAD utterance buffering Not directly; buffers into utterances Medium-low Good Best first working route
Whisper-Streaming More streaming-like with local agreement Medium-high Good if you need partials Try after basic STT works
Wav2Vec2/CTC + chunk/stride Better chunk merging than Whisper Medium Maybe Worth testing
NVIDIA Parakeet/Nemotron streaming Yes, designed for streaming modes Higher Strong candidate Best true-streaming HF route
Cloud STT Yes Low-medium Technically good Not free/local long-term

10. Important: streaming ASR still needs a controller

Even with a true streaming model, you still need:

correct audio conversion
RMS/peak validation
VAD or endpointing
partial/final transcript handling
empty transcript filtering
bot_is_speaking guard
TTS feedback prevention
Discord receive debugging

Streaming ASR can help with this:

tiny chunks are too small for independent Whisper transcription

It does not automatically fix this:

empty transcript sent to Ollama
bot hears itself
wrong sample rate
bad amplitude
Discord receive broken
missing transcribe_audio function

Your current logs show controller/audio-path issues clearly, so switching models first may hide the real bug.


11. What I would do in your exact case

Step 1 — Fix the current pipeline first

Before changing models, fix these:

define transcribe_audio()
block empty Ollama calls
save actual STT input WAVs
validate sample rate / duration / RMS / peak
prevent Chatterbox feedback

Your current error:

NameError: name 'transcribe_audio' is not defined

means model choice is not the first blocker.


Step 2 — Make faster-whisper work on full utterances

Use:

audio chunks
→ mono 16 kHz conversion
→ VAD
→ utterance buffer
→ faster-whisper
→ reject blank
→ Ollama

This is the best first stable version.


Step 3 — If you need live partial captions, try Whisper-Streaming

Use these:

  • Whisper-Streaming paper
  • Whisper-Streaming GitHub

This keeps Whisper-like behavior but adds a streaming policy layer.


Step 4 — If you need true low-latency streaming, test Parakeet/Nemotron

Start here:

  • nvidia/parakeet-unified-en-0.6b
  • NVIDIA Parakeet ASR collection
  • nvidia/nemotron-speech-streaming-en-0.6b
  • NVIDIA cache-aware streaming ASR blog

These are better for true streaming, but they are more complex than faster-whisper.


12. Best mental model

Whisper/faster-whisper route

small chunks
→ buffer into complete utterance
→ transcribe utterance

Streaming ASR route

small chunks
→ preserve model state/cache
→ emit partial/final transcript
→ controller decides when text is final enough

Bad route

small chunk
→ independent STT call
→ send result to Ollama

That last route is what you should avoid.


Bottom line

Whisper can transcribe chunks, but it is bad at transcribing tiny independent live chunks like 5760 samples.

If you want to stay simple:

Use faster-whisper + Silero VAD + utterance buffering.

If you want streaming behavior while keeping Whisper-like transcription:

Try Whisper-Streaming.

If you want a real Hugging Face streaming ASR model:

Try NVIDIA Parakeet Unified or Nemotron Speech Streaming.

But do not skip the basics:

define transcribe_audio()
block empty Ollama calls
buffer audio
validate audio
save debug WAVs
prevent TTS feedback

A streaming model can improve latency. It will not fix a broken controller or bad audio path.

Discussion in the ATmosphere

Loading comments...