External Publication
Visit Post

How Do i Make Stt Work for my ai Vtuber on Discord Vc calls?

Hugging Face Forums [Unofficial] May 2, 2026
Source

There’s some other error in that log…


Why raising max_samples will not fix short Discord STT chunks

Short answer

No, simply turning max_samples up will not fix your current issue.

Your code:

# Limit audio length
max_samples = 16000 * 15  # 15 seconds max

if len(audio_np) > max_samples:
    audio_np = audio_np[-max_samples:]
    print(f"[TRIM] Trimmed audio to last {max_samples} samples")

only handles audio that is too long.

It says:

If audio is longer than 15 seconds, trim it down to the last 15 seconds.

It does not say:

Wait until I have enough audio before transcribing.

Your new log shows the opposite problem:

[SEARCH] Audio analysis - size: 5760, max_amplitude: 1.304917
[MIC] Incoming audio | amp=1.304917 | samples=5760
[PROCESS] Processing audio: 5760 samples, 1.304917 max amplitude
[STT] Transcribing with improved local STT...
...
Transcription error: name 'transcribe_audio' is not defined
...
Transcribed text: ''
Sending to Ollama: '...'

5760 samples is very short.

At different sample rates, that means:

Sample rate assumption Duration
16 kHz 5760 / 16000 = 0.36s
24 kHz 5760 / 24000 = 0.24s
48 kHz 5760 / 48000 = 0.12s

So raising max_samples from 15 seconds to 30 seconds would not help. Your audio is not being cut because it is too long. It is being sent to STT before enough speech has accumulated.

What you need is not a bigger maximum. You need:

minimum duration gate
+ audio chunk buffering
+ VAD / speech-end detection
+ transcript validation
+ empty transcript blocking

Useful references:

  • Whisper-Streaming paper
  • faster-whisper
  • Silero VAD
  • Hugging Face Whisper docs
  • LiveKit turn detection docs
  • Discord Voice Connections docs
  • Pycord voice docs

What your new log says

You now have several separate problems at the same time.

1. The audio chunk is too short

This line matters:

samples=5760

At 16 kHz, that is only 0.36 seconds.

That is not a complete utterance. It might be a breath, half a syllable, background noise, a clipped word, or a small piece of the bot’s own audio.

Whisper-style models are not good at:

tiny fragment in
→ reliable transcript out

Whisper-style models are better at:

complete speech segment in
→ transcript out

The Whisper-Streaming paper is relevant because it explicitly says Whisper is not designed for native real-time transcription. It wraps Whisper with a streaming policy so it can work on live/unsegmented speech.

For your bot, the practical translation is:

Do not transcribe tiny chunks.
Buffer chunks into completed speech turns.

2. The audio amplitude is still suspicious

Your log says:

max_amplitude: 1.304917

For normalized float audio going into STT, you usually want roughly:

-1.0 to +1.0

A peak above 1.0 can happen if there is gain/normalization, but it is suspicious enough to inspect. It may mean:

Possible issue Result
int16 PCM converted incorrectly static / garbage waveform
stereo interleaved audio treated as mono distorted audio
gain too high clipping
double normalization harsh waveform
wrong dtype nonsense values
wrong sample-rate path sped-up or slowed-down speech

This can explain “when it works, it is off as heck.”

Before changing models, save the exact STT input as a WAV and listen to it.


3. Your STT function path is broken

This is a hard code bug:

NameError: name 'transcribe_audio' is not defined

That means your code tried to call:

transcribe_audio(...)

but no such function exists in that scope.

So that run did not prove anything about Whisper quality. The STT path crashed before a real transcription could happen.

You need either:

def transcribe_audio(audio_16k):
    ...

or change your code to call the function that actually exists.

Example:

def safe_transcribe(audio_16k):
    try:
        return transcribe_audio(audio_16k)
    except Exception as e:
        print(f"[STT] Transcription error: {e}")
        return ""

If transcribe_audio is not defined, every STT attempt becomes an empty transcript.


4. Empty transcripts are still being sent to Ollama

This is the most important controller bug.

Your log shows:

Transcribed text: ''
You:
Sending to Ollama: '...'
Ollama response status: 200
...
AI: What's good, chat? Ready to get this conversation started!

That means:

STT failed
→ empty text
→ sent to Ollama anyway
→ Ollama generated a generic opener
→ TTS generated audio

That creates a loop where the AI responds even though no valid user speech was heard.

This must be blocked.


What max_samples actually does

Your current code:

max_samples = 16000 * 15

if len(audio_np) > max_samples:
    audio_np = audio_np[-max_samples:]

means:

Keep at most the last 15 seconds.

It is an upper cap.

It only triggers when:

len(audio_np) > 240000

But your log has:

len(audio_np) = 5760

So:

5760 > 240000  # False

Nothing happens.

What you actually need

You need a minimum :

min_samples = int(16000 * 0.8)

if len(audio_np) < min_samples:
    print("Too short; keep buffering instead of transcribing.")
    return ""

But even that is only a guard. The real fix is buffering.


The correct idea: concatenate chunks before STT

Your incoming chunks are tiny. That is normal for real-time audio.

The wrong pipeline is:

chunk 1 → STT
chunk 2 → STT
chunk 3 → STT
chunk 4 → STT

The better pipeline is:

chunk 1
+ chunk 2
+ chunk 3
+ chunk 4
+ ...
→ enough speech collected
→ STT once

The best pipeline is:

chunks
→ VAD detects speech start
→ buffer while user speaks
→ VAD detects enough silence
→ finalize utterance
→ STT once

That is the difference between chunk transcription and utterance transcription.


Minimal fix order

Do these in this order.

1. Define or correctly call transcribe_audio

Your log has:

NameError: name 'transcribe_audio' is not defined

Fix that first.

Example wrapper:

def transcribe_audio(audio_16k):
    return transcribe_with_faster_whisper(audio_16k)

Or rename the call:

# Wrong if transcribe_audio does not exist:
text = transcribe_audio(audio_np)

# Right if this is the function that actually exists:
text = transcribe_with_faster_whisper(audio_np)

Until this is fixed, STT cannot work.


2. Stop sending empty transcripts to Ollama

Add this immediately:

def should_send_to_ollama(text: str) -> bool:
    text = (text or "").strip()

    if not text:
        return False

    if len(text) < 2:
        return False

    bad_outputs = {
        ".",
        "...",
        "you",
        "thank you",
        "thanks for watching",
        "subscribe",
    }

    if text.lower() in bad_outputs:
        return False

    return True

Use it before every Ollama call:

text = safe_transcribe(audio_np)

if not should_send_to_ollama(text):
    print("[CTRL] Empty/invalid transcript; not sending to Ollama.")
    return

send_to_ollama(text)

This prevents:

blank STT
→ generic AI greeting
→ TTS
→ possible feedback loop

3. Add audio validation before STT

Use this before calling Whisper/faster-whisper:

import numpy as np

def valid_audio_for_stt(audio_16k, sr=16000):
    audio_16k = np.asarray(audio_16k, dtype=np.float32)

    duration = len(audio_16k) / sr
    peak = float(np.max(np.abs(audio_16k))) if len(audio_16k) else 0.0
    rms = float(np.sqrt(np.mean(audio_16k ** 2))) if len(audio_16k) else 0.0

    if duration < 0.8:
        return False, f"too short: {duration:.2f}s"

    if peak < 0.015:
        return False, f"too quiet: peak={peak:.4f}"

    if rms < 0.003:
        return False, f"too quiet: rms={rms:.4f}"

    if peak > 1.05:
        return False, f"bad normalization: peak={peak:.4f}"

    return True, "ok"

Your current chunk would probably fail:

samples=5760
peak=1.304917

That is good. Bad audio should be rejected before STT.


Simple concatenation buffer

This is not the final ideal version, but it is a useful first patch.

import numpy as np

class RollingSTTBuffer:
    def __init__(self, sample_rate=16000, min_seconds=1.0, max_seconds=15.0):
        self.sample_rate = sample_rate
        self.min_samples = int(sample_rate * min_seconds)
        self.max_samples = int(sample_rate * max_seconds)
        self.buffer = np.zeros(0, dtype=np.float32)

    def add(self, chunk):
        chunk = np.asarray(chunk, dtype=np.float32)
        self.buffer = np.concatenate([self.buffer, chunk])

        if len(self.buffer) > self.max_samples:
            self.buffer = self.buffer[-self.max_samples:]

    def ready(self):
        return len(self.buffer) >= self.min_samples

    def pop(self):
        audio = self.buffer
        self.buffer = np.zeros(0, dtype=np.float32)
        return audio

Usage:

stt_buffer = RollingSTTBuffer(
    sample_rate=16000,
    min_seconds=1.0,
    max_seconds=15.0,
)

def handle_audio_chunk(chunk_16k):
    stt_buffer.add(chunk_16k)

    if not stt_buffer.ready():
        print("[BUFFER] Not enough audio yet.")
        return

    audio_for_stt = stt_buffer.pop()

    text = safe_transcribe(audio_for_stt)

    if not should_send_to_ollama(text):
        return

    send_to_ollama(text)

This proves whether concatenating chunks helps.

But it has a weakness: it transcribes after a fixed amount of audio, not after the user actually finishes speaking.

The better solution is VAD-based buffering.


Better solution: VAD-based utterance buffering

Use VAD to decide:

speech started
speech continued
speech ended

Then transcribe the completed utterance.

Recommended tools:

  • Silero VAD
  • faster-whisper VAD filtering
  • LiveKit turn detection docs

Silero VAD is useful because it supports 8 kHz and 16 kHz audio and is designed for fast chunk-level speech detection.

VAD-based utterance buffer

import numpy as np

class UtteranceBuffer:
    def __init__(
        self,
        sample_rate=16000,
        min_speech_seconds=0.8,
        end_silence_ms=900,
        max_seconds=15.0,
    ):
        self.sample_rate = sample_rate
        self.min_speech_samples = int(sample_rate * min_speech_seconds)
        self.end_silence_ms = end_silence_ms
        self.max_samples = int(sample_rate * max_seconds)

        self.frames = []
        self.speaking = False
        self.silence_ms = 0.0
        self.speech_samples = 0

    def _frame_ms(self, frame):
        return 1000.0 * len(frame) / self.sample_rate

    def push(self, frame_16k, is_speech: bool):
        frame_16k = np.asarray(frame_16k, dtype=np.float32)

        if is_speech:
            self.speaking = True
            self.silence_ms = 0.0
            self.speech_samples += len(frame_16k)
            self.frames.append(frame_16k)

        elif self.speaking:
            self.silence_ms += self._frame_ms(frame_16k)
            self.frames.append(frame_16k)

        else:
            return None

        audio = (
            np.concatenate(self.frames)
            if self.frames
            else np.zeros(0, dtype=np.float32)
        )

        if len(audio) > self.max_samples:
            audio = audio[-self.max_samples:]
            self.frames = [audio]

        if self.speaking and self.silence_ms >= self.end_silence_ms:
            utterance = (
                np.concatenate(self.frames)
                if self.frames
                else np.zeros(0, dtype=np.float32)
            )

            enough_speech = self.speech_samples >= self.min_speech_samples

            self.frames = []
            self.speaking = False
            self.silence_ms = 0.0
            self.speech_samples = 0

            if not enough_speech:
                print("[VAD] Dropped utterance: too little speech")
                return None

            return utterance

        return None

Conceptual usage:

utt_buffer = UtteranceBuffer(sample_rate=16000)

def process_audio_frame(frame_16k):
    is_speech = vad_is_speech(frame_16k)  # implement with Silero/WebRTC/etc.

    utterance = utt_buffer.push(frame_16k, is_speech=is_speech)

    if utterance is None:
        return ""

    text = safe_transcribe(utterance)

    if not should_send_to_ollama(text):
        return ""

    send_to_ollama(text)
    return text

This is the direction you want.


faster-whisper starter config

Use faster-whisper instead of a hand-rolled “simple Whisper STT” path if possible.

Reference:

  • faster-whisper
  • Hugging Face Whisper docs
  • Whisper hallucination discussion: VAD + condition_on_previous_text=False

Example:

from faster_whisper import WhisperModel
import numpy as np

model = WhisperModel(
    "small.en",
    device="cpu",        # use "cuda" if available
    compute_type="int8", # use "float16" on CUDA
)

def transcribe_with_faster_whisper(audio_16k: np.ndarray) -> str:
    audio_16k = np.asarray(audio_16k, dtype=np.float32)
    audio_16k = np.clip(audio_16k, -1.0, 1.0)

    ok, reason = valid_audio_for_stt(audio_16k, sr=16000)
    if not ok:
        print("[STT] Skipping:", reason)
        return ""

    segments, info = model.transcribe(
        audio_16k,
        language="en",
        task="transcribe",
        beam_size=1,
        temperature=0.0,
        condition_on_previous_text=False,
        vad_filter=True,
        vad_parameters={
            "min_silence_duration_ms": 700,
            "speech_pad_ms": 300,
        },
        no_speech_threshold=0.6,
        compression_ratio_threshold=1.35,
        log_prob_threshold=-1.0,
    )

    return " ".join(seg.text.strip() for seg in segments).strip()

Why these settings help:

Setting Reason
language="en" Avoids unstable language detection on short clips
task="transcribe" Prevents accidental translation
beam_size=1 Lower latency
temperature=0.0 More deterministic
condition_on_previous_text=False Reduces carry-over hallucination between short turns
vad_filter=True Extra silence cleanup
min_silence_duration_ms=700 Reasonable conversational silence threshold
speech_pad_ms=300 Avoids cutting word edges
no_speech_threshold=0.6 Helps ignore no-speech chunks
compression_ratio_threshold=1.35 Helps catch repetitive hallucinations
log_prob_threshold=-1.0 Helps catch low-confidence output

Save the exact STT input as WAV

This is still the most important debug step.

# deps:
# pip install soundfile numpy

import numpy as np
import soundfile as sf
from pathlib import Path

debug_dir = Path("debug_stt")
debug_dir.mkdir(exist_ok=True)

def save_debug_wav(audio, sr, filename):
    audio = np.asarray(audio, dtype=np.float32)
    audio = np.clip(audio, -1.0, 1.0)
    sf.write(debug_dir / filename, audio, sr)

Use it right before STT:

save_debug_wav(audio_16k, 16000, "actual_stt_input.wav")

Then listen.

What the WAV sounds like Diagnosis
Silence wrong source / VAD issue / Discord receive issue
Static dtype/decode issue
Fast voice sample-rate mismatch
Slow voice sample-rate mismatch
Distorted/clipped normalization/gain issue
Half-word only chunking problem
Bot voice TTS feedback loop
Multiple speakers need per-user buffers
Clean sentence STT settings/model issue

Do not skip this. It usually reveals the actual problem faster than changing models.


Sample rates in your system

You now likely have several sample rates:

STT target: 16000 Hz
Chatterbox TTS output: 24000 Hz
Discord voice audio: commonly 48000 Hz stereo/Opus/PCM path

Your log says:

TTS result: sr=24000, audio_shape=(86400,)

That is:

86400 / 24000 = 3.6 seconds

Chatterbox generated 3.6 seconds of TTS audio.

That audio should go to the TTS/playback path, not the STT input path.

Keep these separate:

Input audio → 16 kHz mono → STT
TTS audio → Discord playback format → Discord output

Do not let Chatterbox/TTS output leak into your mic/STT input.


Is Chatterbox affecting STT?

Probably not directly.

Chatterbox is TTS. It generates speech. It does not transcribe speech.

But it can affect your STT system indirectly in three ways.

1. Feedback loop

If the bot’s generated voice is captured by your mic or virtual audio cable, the STT system may hear the bot instead of you.

Bad routing:

TTS output
→ speakers / desktop mix / virtual cable
→ STT input
→ bot hears itself
→ bot replies to itself

Better routing:

Human mic or per-user Discord receive
→ STT

Bot TTS
→ Discord output only

2. Sample-rate confusion

Chatterbox output is 24 kHz in your log.

STT should usually get 16 kHz mono.

Discord playback often involves 48 kHz audio.

So do not reuse one conversion path for everything.

3. The Turbo warning is not your STT bug

Your log says:

WARNING - CFG, min_p and exaggeration are not supported by Turbo version and will be ignored.

That warning is about Chatterbox Turbo TTS settings. It means those generation settings are ignored by the Turbo model.

Relevant links:

  • Chatterbox Turbo discussion: CFG/exaggeration not supported
  • Chatterbox TTS configuration warning
  • Chatterbox-TTS-Server issue about Turbo warning

That warning can affect TTS behavior/customization, but it does not explain blank STT.


Add a bot_is_speaking guard while debugging

For the first stable version, disable listening while the bot speaks.

bot_is_speaking = False

Around TTS playback:

bot_is_speaking = True
play_tts_audio(...)
bot_is_speaking = False

In audio handling:

def handle_audio_chunk(chunk_16k):
    if bot_is_speaking:
        print("[AUDIO] Ignoring input while bot is speaking.")
        return

    # continue STT path

This disables barge-in, but it prevents feedback while debugging.

Later, implement real barge-in:

if human starts speaking while bot speaks:
    stop TTS
    clear playback queue
    cancel current LLM/TTS response
    return to listening

Live voice-agent systems treat turn detection and interruption handling as separate concerns. See:

  • LiveKit turn detection docs
  • LiveKit turn handling options
  • LiveKit turn detection guide

Better logs to add

Your logs should include:

sample_rate
samples
duration_seconds
min
max
peak
rms
bot_is_speaking
buffer_size
vad_state
utterance_ready
stt_called
ollama_called

Example logging helper:

import numpy as np

def log_audio_debug(label, audio, sr):
    audio = np.asarray(audio, dtype=np.float32)
    duration = len(audio) / sr if sr else 0.0
    peak = float(np.max(np.abs(audio))) if len(audio) else 0.0
    rms = float(np.sqrt(np.mean(audio ** 2))) if len(audio) else 0.0

    print(
        f"[{label}] sr={sr} samples={len(audio)} "
        f"duration={duration:.3f}s peak={peak:.4f} rms={rms:.4f}"
    )

Healthy logs should look like:

[MIC] sr=16000 samples=320 duration=0.020s peak=0.12 rms=0.02
[VAD] speech_start
[BUFFER] speech_ms=1240 silence_ms=0
[VAD] endpoint after silence_ms=900
[UTTERANCE] sr=16000 samples=35680 duration=2.23s peak=0.44 rms=0.06
[STT] text="can you hear me now"
[OLLAMA] sending valid transcript

Unhealthy logs look like:

samples=5760
transcribe immediately
NameError
empty text
send to Ollama anyway

Discord-specific note

If you are receiving audio from Discord VC, remember that Discord receive is its own fragile layer.

Discord voice docs:

  • Discord Voice Connections

Pycord warns recording/listening may be affected by DAVE:

  • Pycord voice docs

Receive extension:

  • discord-ext-voice-recv

Before debugging STT, prove Discord receive works by saving clean WAV files:

Discord receive
→ decode/convert
→ save WAV
→ listen manually

Only after the WAV sounds correct should you send it into STT.


Recommended build order

Phase 1: local mic STT only

local mic
→ VAD
→ utterance buffer
→ faster-whisper
→ print transcript

Pass criteria:

silence produces no transcript
one sentence produces one transcript
partial speech is not sent
empty text is ignored

Phase 2: add Ollama

local mic
→ STT
→ Ollama
→ print reply

Pass criteria:

Ollama is called only for real speech
blank transcripts are ignored

Phase 3: add Chatterbox locally

local mic
→ STT
→ Ollama
→ Chatterbox TTS
→ local playback

Pass criteria:

the bot does not hear itself
the bot does not respond to its own voice

Phase 4: send TTS to Discord

local mic
→ STT
→ Ollama
→ Chatterbox TTS
→ Discord VC output

Phase 5: add Discord receive later

First:

Discord receive
→ save clean WAV per speaker

Then:

Discord receive
→ per-user VAD
→ per-user STT
→ speaker-labeled transcript

Do not start with full Discord receive unless you need it. It adds several failure points.


Final summary

Turning max_samples up will not fix this because max_samples is an upper cap, not a minimum buffer target.

Your immediate problems are:

1. You are calling STT on tiny chunks like 5760 samples.
2. Your waveform amplitude is suspiciously above 1.0.
3. Your code is calling a missing function: transcribe_audio.
4. Empty transcripts are still being sent to Ollama.
5. Chatterbox may be feeding back into STT if audio routing is not separated.

Fix order:

1. Define or correctly call transcribe_audio().
2. Stop sending empty transcripts to Ollama.
3. Add minimum-duration/RMS/peak validation before STT.
4. Concatenate chunks into a buffer.
5. Replace fixed buffering with VAD-based utterance buffering.
6. Keep Chatterbox/TTS output out of the STT input path.
7. Use faster-whisper with vad_filter=True and condition_on_previous_text=False.

The core rule:

Do not make Whisper transcribe chunks.
Make Whisper transcribe completed utterances.

Discussion in the ATmosphere

Loading comments...