Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreif7qxfze5z2layu2ionwjj3cqvx6shb57samzc7rfq54jkikyb724",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkubvdrysli2"
  },
  "path": "/t/how-do-i-make-stt-work-for-my-ai-vtuber-on-discord-vc-calls/175621#post_9",
  "publishedAt": "2026-05-02T08:57:58.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Whisper-Streaming paper",
    "faster-whisper",
    "Silero VAD",
    "Hugging Face Whisper docs",
    "LiveKit turn detection docs",
    "Discord Voice Connections docs",
    "Pycord voice docs",
    "faster-whisper VAD filtering",
    "Whisper hallucination discussion: VAD + condition_on_previous_text=False",
    "Chatterbox Turbo discussion: CFG/exaggeration not supported",
    "Chatterbox TTS configuration warning",
    "Chatterbox-TTS-Server issue about Turbo warning",
    "LiveKit turn handling options",
    "LiveKit turn detection guide",
    "Discord Voice Connections",
    "discord-ext-voice-recv"
  ],
  "textContent": "There’s some other error in that log…\n\n* * *\n\n# Why raising `max_samples` will not fix short Discord STT chunks\n\n## Short answer\n\nNo, simply turning `max_samples` up will **not** fix your current issue.\n\nYour code:\n\n\n    # Limit audio length\n    max_samples = 16000 * 15  # 15 seconds max\n\n    if len(audio_np) > max_samples:\n        audio_np = audio_np[-max_samples:]\n        print(f\"[TRIM] Trimmed audio to last {max_samples} samples\")\n\n\nonly handles audio that is **too long**.\n\nIt says:\n\n\n    If audio is longer than 15 seconds, trim it down to the last 15 seconds.\n\n\nIt does **not** say:\n\n\n    Wait until I have enough audio before transcribing.\n\n\nYour new log shows the opposite problem:\n\n\n    [SEARCH] Audio analysis - size: 5760, max_amplitude: 1.304917\n    [MIC] Incoming audio | amp=1.304917 | samples=5760\n    [PROCESS] Processing audio: 5760 samples, 1.304917 max amplitude\n    [STT] Transcribing with improved local STT...\n    ...\n    Transcription error: name 'transcribe_audio' is not defined\n    ...\n    Transcribed text: ''\n    Sending to Ollama: '...'\n\n\n`5760 samples` is very short.\n\nAt different sample rates, that means:\n\nSample rate assumption | Duration\n---|---\n16 kHz | `5760 / 16000 = 0.36s`\n24 kHz | `5760 / 24000 = 0.24s`\n48 kHz | `5760 / 48000 = 0.12s`\n\nSo raising `max_samples` from 15 seconds to 30 seconds would not help. Your audio is not being cut because it is too long. It is being sent to STT before enough speech has accumulated.\n\nWhat you need is not a bigger maximum. You need:\n\n\n    minimum duration gate\n    + audio chunk buffering\n    + VAD / speech-end detection\n    + transcript validation\n    + empty transcript blocking\n\n\nUseful references:\n\n  * Whisper-Streaming paper\n  * faster-whisper\n  * Silero VAD\n  * Hugging Face Whisper docs\n  * LiveKit turn detection docs\n  * Discord Voice Connections docs\n  * Pycord voice docs\n\n\n\n* * *\n\n# What your new log says\n\nYou now have several separate problems at the same time.\n\n## 1. The audio chunk is too short\n\nThis line matters:\n\n\n    samples=5760\n\n\nAt 16 kHz, that is only 0.36 seconds.\n\nThat is not a complete utterance. It might be a breath, half a syllable, background noise, a clipped word, or a small piece of the bot’s own audio.\n\nWhisper-style models are not good at:\n\n\n    tiny fragment in\n    → reliable transcript out\n\n\nWhisper-style models are better at:\n\n\n    complete speech segment in\n    → transcript out\n\n\nThe Whisper-Streaming paper is relevant because it explicitly says Whisper is not designed for native real-time transcription. It wraps Whisper with a streaming policy so it can work on live/unsegmented speech.\n\nFor your bot, the practical translation is:\n\n\n    Do not transcribe tiny chunks.\n    Buffer chunks into completed speech turns.\n\n\n* * *\n\n## 2. The audio amplitude is still suspicious\n\nYour log says:\n\n\n    max_amplitude: 1.304917\n\n\nFor normalized float audio going into STT, you usually want roughly:\n\n\n    -1.0 to +1.0\n\n\nA peak above `1.0` can happen if there is gain/normalization, but it is suspicious enough to inspect. It may mean:\n\nPossible issue | Result\n---|---\nint16 PCM converted incorrectly | static / garbage waveform\nstereo interleaved audio treated as mono | distorted audio\ngain too high | clipping\ndouble normalization | harsh waveform\nwrong dtype | nonsense values\nwrong sample-rate path | sped-up or slowed-down speech\n\nThis can explain “when it works, it is off as heck.”\n\nBefore changing models, save the exact STT input as a WAV and listen to it.\n\n* * *\n\n## 3. Your STT function path is broken\n\nThis is a hard code bug:\n\n\n    NameError: name 'transcribe_audio' is not defined\n\n\nThat means your code tried to call:\n\n\n    transcribe_audio(...)\n\n\nbut no such function exists in that scope.\n\nSo that run did **not** prove anything about Whisper quality. The STT path crashed before a real transcription could happen.\n\nYou need either:\n\n\n    def transcribe_audio(audio_16k):\n        ...\n\n\nor change your code to call the function that actually exists.\n\nExample:\n\n\n    def safe_transcribe(audio_16k):\n        try:\n            return transcribe_audio(audio_16k)\n        except Exception as e:\n            print(f\"[STT] Transcription error: {e}\")\n            return \"\"\n\n\nIf `transcribe_audio` is not defined, every STT attempt becomes an empty transcript.\n\n* * *\n\n## 4. Empty transcripts are still being sent to Ollama\n\nThis is the most important controller bug.\n\nYour log shows:\n\n\n    Transcribed text: ''\n    You:\n    Sending to Ollama: '...'\n    Ollama response status: 200\n    ...\n    AI: What's good, chat? Ready to get this conversation started!\n\n\nThat means:\n\n\n    STT failed\n    → empty text\n    → sent to Ollama anyway\n    → Ollama generated a generic opener\n    → TTS generated audio\n\n\nThat creates a loop where the AI responds even though no valid user speech was heard.\n\nThis must be blocked.\n\n* * *\n\n# What `max_samples` actually does\n\nYour current code:\n\n\n    max_samples = 16000 * 15\n\n    if len(audio_np) > max_samples:\n        audio_np = audio_np[-max_samples:]\n\n\nmeans:\n\n\n    Keep at most the last 15 seconds.\n\n\nIt is an **upper cap**.\n\nIt only triggers when:\n\n\n    len(audio_np) > 240000\n\n\nBut your log has:\n\n\n    len(audio_np) = 5760\n\n\nSo:\n\n\n    5760 > 240000  # False\n\n\nNothing happens.\n\n## What you actually need\n\nYou need a **minimum** :\n\n\n    min_samples = int(16000 * 0.8)\n\n    if len(audio_np) < min_samples:\n        print(\"Too short; keep buffering instead of transcribing.\")\n        return \"\"\n\n\nBut even that is only a guard. The real fix is buffering.\n\n* * *\n\n# The correct idea: concatenate chunks before STT\n\nYour incoming chunks are tiny. That is normal for real-time audio.\n\nThe wrong pipeline is:\n\n\n    chunk 1 → STT\n    chunk 2 → STT\n    chunk 3 → STT\n    chunk 4 → STT\n\n\nThe better pipeline is:\n\n\n    chunk 1\n    + chunk 2\n    + chunk 3\n    + chunk 4\n    + ...\n    → enough speech collected\n    → STT once\n\n\nThe best pipeline is:\n\n\n    chunks\n    → VAD detects speech start\n    → buffer while user speaks\n    → VAD detects enough silence\n    → finalize utterance\n    → STT once\n\n\nThat is the difference between **chunk transcription** and **utterance transcription**.\n\n* * *\n\n# Minimal fix order\n\nDo these in this order.\n\n## 1. Define or correctly call `transcribe_audio`\n\nYour log has:\n\n\n    NameError: name 'transcribe_audio' is not defined\n\n\nFix that first.\n\nExample wrapper:\n\n\n    def transcribe_audio(audio_16k):\n        return transcribe_with_faster_whisper(audio_16k)\n\n\nOr rename the call:\n\n\n    # Wrong if transcribe_audio does not exist:\n    text = transcribe_audio(audio_np)\n\n    # Right if this is the function that actually exists:\n    text = transcribe_with_faster_whisper(audio_np)\n\n\nUntil this is fixed, STT cannot work.\n\n* * *\n\n## 2. Stop sending empty transcripts to Ollama\n\nAdd this immediately:\n\n\n    def should_send_to_ollama(text: str) -> bool:\n        text = (text or \"\").strip()\n\n        if not text:\n            return False\n\n        if len(text) < 2:\n            return False\n\n        bad_outputs = {\n            \".\",\n            \"...\",\n            \"you\",\n            \"thank you\",\n            \"thanks for watching\",\n            \"subscribe\",\n        }\n\n        if text.lower() in bad_outputs:\n            return False\n\n        return True\n\n\nUse it before every Ollama call:\n\n\n    text = safe_transcribe(audio_np)\n\n    if not should_send_to_ollama(text):\n        print(\"[CTRL] Empty/invalid transcript; not sending to Ollama.\")\n        return\n\n    send_to_ollama(text)\n\n\nThis prevents:\n\n\n    blank STT\n    → generic AI greeting\n    → TTS\n    → possible feedback loop\n\n\n* * *\n\n## 3. Add audio validation before STT\n\nUse this before calling Whisper/faster-whisper:\n\n\n    import numpy as np\n\n    def valid_audio_for_stt(audio_16k, sr=16000):\n        audio_16k = np.asarray(audio_16k, dtype=np.float32)\n\n        duration = len(audio_16k) / sr\n        peak = float(np.max(np.abs(audio_16k))) if len(audio_16k) else 0.0\n        rms = float(np.sqrt(np.mean(audio_16k ** 2))) if len(audio_16k) else 0.0\n\n        if duration < 0.8:\n            return False, f\"too short: {duration:.2f}s\"\n\n        if peak < 0.015:\n            return False, f\"too quiet: peak={peak:.4f}\"\n\n        if rms < 0.003:\n            return False, f\"too quiet: rms={rms:.4f}\"\n\n        if peak > 1.05:\n            return False, f\"bad normalization: peak={peak:.4f}\"\n\n        return True, \"ok\"\n\n\nYour current chunk would probably fail:\n\n\n    samples=5760\n    peak=1.304917\n\n\nThat is good. Bad audio should be rejected before STT.\n\n* * *\n\n# Simple concatenation buffer\n\nThis is not the final ideal version, but it is a useful first patch.\n\n\n    import numpy as np\n\n    class RollingSTTBuffer:\n        def __init__(self, sample_rate=16000, min_seconds=1.0, max_seconds=15.0):\n            self.sample_rate = sample_rate\n            self.min_samples = int(sample_rate * min_seconds)\n            self.max_samples = int(sample_rate * max_seconds)\n            self.buffer = np.zeros(0, dtype=np.float32)\n\n        def add(self, chunk):\n            chunk = np.asarray(chunk, dtype=np.float32)\n            self.buffer = np.concatenate([self.buffer, chunk])\n\n            if len(self.buffer) > self.max_samples:\n                self.buffer = self.buffer[-self.max_samples:]\n\n        def ready(self):\n            return len(self.buffer) >= self.min_samples\n\n        def pop(self):\n            audio = self.buffer\n            self.buffer = np.zeros(0, dtype=np.float32)\n            return audio\n\n\nUsage:\n\n\n    stt_buffer = RollingSTTBuffer(\n        sample_rate=16000,\n        min_seconds=1.0,\n        max_seconds=15.0,\n    )\n\n    def handle_audio_chunk(chunk_16k):\n        stt_buffer.add(chunk_16k)\n\n        if not stt_buffer.ready():\n            print(\"[BUFFER] Not enough audio yet.\")\n            return\n\n        audio_for_stt = stt_buffer.pop()\n\n        text = safe_transcribe(audio_for_stt)\n\n        if not should_send_to_ollama(text):\n            return\n\n        send_to_ollama(text)\n\n\nThis proves whether concatenating chunks helps.\n\nBut it has a weakness: it transcribes after a fixed amount of audio, not after the user actually finishes speaking.\n\nThe better solution is VAD-based buffering.\n\n* * *\n\n# Better solution: VAD-based utterance buffering\n\nUse VAD to decide:\n\n\n    speech started\n    speech continued\n    speech ended\n\n\nThen transcribe the completed utterance.\n\nRecommended tools:\n\n  * Silero VAD\n  * faster-whisper VAD filtering\n  * LiveKit turn detection docs\n\n\n\nSilero VAD is useful because it supports 8 kHz and 16 kHz audio and is designed for fast chunk-level speech detection.\n\n## VAD-based utterance buffer\n\n\n    import numpy as np\n\n    class UtteranceBuffer:\n        def __init__(\n            self,\n            sample_rate=16000,\n            min_speech_seconds=0.8,\n            end_silence_ms=900,\n            max_seconds=15.0,\n        ):\n            self.sample_rate = sample_rate\n            self.min_speech_samples = int(sample_rate * min_speech_seconds)\n            self.end_silence_ms = end_silence_ms\n            self.max_samples = int(sample_rate * max_seconds)\n\n            self.frames = []\n            self.speaking = False\n            self.silence_ms = 0.0\n            self.speech_samples = 0\n\n        def _frame_ms(self, frame):\n            return 1000.0 * len(frame) / self.sample_rate\n\n        def push(self, frame_16k, is_speech: bool):\n            frame_16k = np.asarray(frame_16k, dtype=np.float32)\n\n            if is_speech:\n                self.speaking = True\n                self.silence_ms = 0.0\n                self.speech_samples += len(frame_16k)\n                self.frames.append(frame_16k)\n\n            elif self.speaking:\n                self.silence_ms += self._frame_ms(frame_16k)\n                self.frames.append(frame_16k)\n\n            else:\n                return None\n\n            audio = (\n                np.concatenate(self.frames)\n                if self.frames\n                else np.zeros(0, dtype=np.float32)\n            )\n\n            if len(audio) > self.max_samples:\n                audio = audio[-self.max_samples:]\n                self.frames = [audio]\n\n            if self.speaking and self.silence_ms >= self.end_silence_ms:\n                utterance = (\n                    np.concatenate(self.frames)\n                    if self.frames\n                    else np.zeros(0, dtype=np.float32)\n                )\n\n                enough_speech = self.speech_samples >= self.min_speech_samples\n\n                self.frames = []\n                self.speaking = False\n                self.silence_ms = 0.0\n                self.speech_samples = 0\n\n                if not enough_speech:\n                    print(\"[VAD] Dropped utterance: too little speech\")\n                    return None\n\n                return utterance\n\n            return None\n\n\nConceptual usage:\n\n\n    utt_buffer = UtteranceBuffer(sample_rate=16000)\n\n    def process_audio_frame(frame_16k):\n        is_speech = vad_is_speech(frame_16k)  # implement with Silero/WebRTC/etc.\n\n        utterance = utt_buffer.push(frame_16k, is_speech=is_speech)\n\n        if utterance is None:\n            return \"\"\n\n        text = safe_transcribe(utterance)\n\n        if not should_send_to_ollama(text):\n            return \"\"\n\n        send_to_ollama(text)\n        return text\n\n\nThis is the direction you want.\n\n* * *\n\n# faster-whisper starter config\n\nUse `faster-whisper` instead of a hand-rolled “simple Whisper STT” path if possible.\n\nReference:\n\n  * faster-whisper\n  * Hugging Face Whisper docs\n  * Whisper hallucination discussion: VAD + condition_on_previous_text=False\n\n\n\nExample:\n\n\n    from faster_whisper import WhisperModel\n    import numpy as np\n\n    model = WhisperModel(\n        \"small.en\",\n        device=\"cpu\",        # use \"cuda\" if available\n        compute_type=\"int8\", # use \"float16\" on CUDA\n    )\n\n    def transcribe_with_faster_whisper(audio_16k: np.ndarray) -> str:\n        audio_16k = np.asarray(audio_16k, dtype=np.float32)\n        audio_16k = np.clip(audio_16k, -1.0, 1.0)\n\n        ok, reason = valid_audio_for_stt(audio_16k, sr=16000)\n        if not ok:\n            print(\"[STT] Skipping:\", reason)\n            return \"\"\n\n        segments, info = model.transcribe(\n            audio_16k,\n            language=\"en\",\n            task=\"transcribe\",\n            beam_size=1,\n            temperature=0.0,\n            condition_on_previous_text=False,\n            vad_filter=True,\n            vad_parameters={\n                \"min_silence_duration_ms\": 700,\n                \"speech_pad_ms\": 300,\n            },\n            no_speech_threshold=0.6,\n            compression_ratio_threshold=1.35,\n            log_prob_threshold=-1.0,\n        )\n\n        return \" \".join(seg.text.strip() for seg in segments).strip()\n\n\nWhy these settings help:\n\nSetting | Reason\n---|---\n`language=\"en\"` | Avoids unstable language detection on short clips\n`task=\"transcribe\"` | Prevents accidental translation\n`beam_size=1` | Lower latency\n`temperature=0.0` | More deterministic\n`condition_on_previous_text=False` | Reduces carry-over hallucination between short turns\n`vad_filter=True` | Extra silence cleanup\n`min_silence_duration_ms=700` | Reasonable conversational silence threshold\n`speech_pad_ms=300` | Avoids cutting word edges\n`no_speech_threshold=0.6` | Helps ignore no-speech chunks\n`compression_ratio_threshold=1.35` | Helps catch repetitive hallucinations\n`log_prob_threshold=-1.0` | Helps catch low-confidence output\n\n* * *\n\n# Save the exact STT input as WAV\n\nThis is still the most important debug step.\n\n\n    # deps:\n    # pip install soundfile numpy\n\n    import numpy as np\n    import soundfile as sf\n    from pathlib import Path\n\n    debug_dir = Path(\"debug_stt\")\n    debug_dir.mkdir(exist_ok=True)\n\n    def save_debug_wav(audio, sr, filename):\n        audio = np.asarray(audio, dtype=np.float32)\n        audio = np.clip(audio, -1.0, 1.0)\n        sf.write(debug_dir / filename, audio, sr)\n\n\nUse it right before STT:\n\n\n    save_debug_wav(audio_16k, 16000, \"actual_stt_input.wav\")\n\n\nThen listen.\n\nWhat the WAV sounds like | Diagnosis\n---|---\nSilence | wrong source / VAD issue / Discord receive issue\nStatic | dtype/decode issue\nFast voice | sample-rate mismatch\nSlow voice | sample-rate mismatch\nDistorted/clipped | normalization/gain issue\nHalf-word only | chunking problem\nBot voice | TTS feedback loop\nMultiple speakers | need per-user buffers\nClean sentence | STT settings/model issue\n\nDo not skip this. It usually reveals the actual problem faster than changing models.\n\n* * *\n\n# Sample rates in your system\n\nYou now likely have several sample rates:\n\n\n    STT target: 16000 Hz\n    Chatterbox TTS output: 24000 Hz\n    Discord voice audio: commonly 48000 Hz stereo/Opus/PCM path\n\n\nYour log says:\n\n\n    TTS result: sr=24000, audio_shape=(86400,)\n\n\nThat is:\n\n\n    86400 / 24000 = 3.6 seconds\n\n\nChatterbox generated 3.6 seconds of TTS audio.\n\nThat audio should go to the TTS/playback path, not the STT input path.\n\nKeep these separate:\n\n\n    Input audio → 16 kHz mono → STT\n    TTS audio → Discord playback format → Discord output\n\n\nDo not let Chatterbox/TTS output leak into your mic/STT input.\n\n* * *\n\n# Is Chatterbox affecting STT?\n\nProbably not directly.\n\nChatterbox is TTS. It generates speech. It does not transcribe speech.\n\nBut it can affect your STT system indirectly in three ways.\n\n## 1. Feedback loop\n\nIf the bot’s generated voice is captured by your mic or virtual audio cable, the STT system may hear the bot instead of you.\n\nBad routing:\n\n\n    TTS output\n    → speakers / desktop mix / virtual cable\n    → STT input\n    → bot hears itself\n    → bot replies to itself\n\n\nBetter routing:\n\n\n    Human mic or per-user Discord receive\n    → STT\n\n    Bot TTS\n    → Discord output only\n\n\n## 2. Sample-rate confusion\n\nChatterbox output is 24 kHz in your log.\n\nSTT should usually get 16 kHz mono.\n\nDiscord playback often involves 48 kHz audio.\n\nSo do not reuse one conversion path for everything.\n\n## 3. The Turbo warning is not your STT bug\n\nYour log says:\n\n\n    WARNING - CFG, min_p and exaggeration are not supported by Turbo version and will be ignored.\n\n\nThat warning is about Chatterbox Turbo TTS settings. It means those generation settings are ignored by the Turbo model.\n\nRelevant links:\n\n  * Chatterbox Turbo discussion: CFG/exaggeration not supported\n  * Chatterbox TTS configuration warning\n  * Chatterbox-TTS-Server issue about Turbo warning\n\n\n\nThat warning can affect TTS behavior/customization, but it does not explain blank STT.\n\n* * *\n\n# Add a `bot_is_speaking` guard while debugging\n\nFor the first stable version, disable listening while the bot speaks.\n\n\n    bot_is_speaking = False\n\n\nAround TTS playback:\n\n\n    bot_is_speaking = True\n    play_tts_audio(...)\n    bot_is_speaking = False\n\n\nIn audio handling:\n\n\n    def handle_audio_chunk(chunk_16k):\n        if bot_is_speaking:\n            print(\"[AUDIO] Ignoring input while bot is speaking.\")\n            return\n\n        # continue STT path\n\n\nThis disables barge-in, but it prevents feedback while debugging.\n\nLater, implement real barge-in:\n\n\n    if human starts speaking while bot speaks:\n        stop TTS\n        clear playback queue\n        cancel current LLM/TTS response\n        return to listening\n\n\nLive voice-agent systems treat turn detection and interruption handling as separate concerns. See:\n\n  * LiveKit turn detection docs\n  * LiveKit turn handling options\n  * LiveKit turn detection guide\n\n\n\n* * *\n\n# Better logs to add\n\nYour logs should include:\n\n\n    sample_rate\n    samples\n    duration_seconds\n    min\n    max\n    peak\n    rms\n    bot_is_speaking\n    buffer_size\n    vad_state\n    utterance_ready\n    stt_called\n    ollama_called\n\n\nExample logging helper:\n\n\n    import numpy as np\n\n    def log_audio_debug(label, audio, sr):\n        audio = np.asarray(audio, dtype=np.float32)\n        duration = len(audio) / sr if sr else 0.0\n        peak = float(np.max(np.abs(audio))) if len(audio) else 0.0\n        rms = float(np.sqrt(np.mean(audio ** 2))) if len(audio) else 0.0\n\n        print(\n            f\"[{label}] sr={sr} samples={len(audio)} \"\n            f\"duration={duration:.3f}s peak={peak:.4f} rms={rms:.4f}\"\n        )\n\n\nHealthy logs should look like:\n\n\n    [MIC] sr=16000 samples=320 duration=0.020s peak=0.12 rms=0.02\n    [VAD] speech_start\n    [BUFFER] speech_ms=1240 silence_ms=0\n    [VAD] endpoint after silence_ms=900\n    [UTTERANCE] sr=16000 samples=35680 duration=2.23s peak=0.44 rms=0.06\n    [STT] text=\"can you hear me now\"\n    [OLLAMA] sending valid transcript\n\n\nUnhealthy logs look like:\n\n\n    samples=5760\n    transcribe immediately\n    NameError\n    empty text\n    send to Ollama anyway\n\n\n* * *\n\n# Discord-specific note\n\nIf you are receiving audio from Discord VC, remember that Discord receive is its own fragile layer.\n\nDiscord voice docs:\n\n  * Discord Voice Connections\n\n\n\nPycord warns recording/listening may be affected by DAVE:\n\n  * Pycord voice docs\n\n\n\nReceive extension:\n\n  * discord-ext-voice-recv\n\n\n\nBefore debugging STT, prove Discord receive works by saving clean WAV files:\n\n\n    Discord receive\n    → decode/convert\n    → save WAV\n    → listen manually\n\n\nOnly after the WAV sounds correct should you send it into STT.\n\n* * *\n\n# Recommended build order\n\n## Phase 1: local mic STT only\n\n\n    local mic\n    → VAD\n    → utterance buffer\n    → faster-whisper\n    → print transcript\n\n\nPass criteria:\n\n\n    silence produces no transcript\n    one sentence produces one transcript\n    partial speech is not sent\n    empty text is ignored\n\n\n## Phase 2: add Ollama\n\n\n    local mic\n    → STT\n    → Ollama\n    → print reply\n\n\nPass criteria:\n\n\n    Ollama is called only for real speech\n    blank transcripts are ignored\n\n\n## Phase 3: add Chatterbox locally\n\n\n    local mic\n    → STT\n    → Ollama\n    → Chatterbox TTS\n    → local playback\n\n\nPass criteria:\n\n\n    the bot does not hear itself\n    the bot does not respond to its own voice\n\n\n## Phase 4: send TTS to Discord\n\n\n    local mic\n    → STT\n    → Ollama\n    → Chatterbox TTS\n    → Discord VC output\n\n\n## Phase 5: add Discord receive later\n\nFirst:\n\n\n    Discord receive\n    → save clean WAV per speaker\n\n\nThen:\n\n\n    Discord receive\n    → per-user VAD\n    → per-user STT\n    → speaker-labeled transcript\n\n\nDo not start with full Discord receive unless you need it. It adds several failure points.\n\n* * *\n\n# Final summary\n\nTurning `max_samples` up will not fix this because `max_samples` is an upper cap, not a minimum buffer target.\n\nYour immediate problems are:\n\n\n    1. You are calling STT on tiny chunks like 5760 samples.\n    2. Your waveform amplitude is suspiciously above 1.0.\n    3. Your code is calling a missing function: transcribe_audio.\n    4. Empty transcripts are still being sent to Ollama.\n    5. Chatterbox may be feeding back into STT if audio routing is not separated.\n\n\nFix order:\n\n\n    1. Define or correctly call transcribe_audio().\n    2. Stop sending empty transcripts to Ollama.\n    3. Add minimum-duration/RMS/peak validation before STT.\n    4. Concatenate chunks into a buffer.\n    5. Replace fixed buffering with VAD-based utterance buffering.\n    6. Keep Chatterbox/TTS output out of the STT input path.\n    7. Use faster-whisper with vad_filter=True and condition_on_previous_text=False.\n\n\nThe core rule:\n\n\n    Do not make Whisper transcribe chunks.\n    Make Whisper transcribe completed utterances.\n",
  "title": "How Do i Make Stt Work for my ai Vtuber on Discord Vc calls?"
}