Whisper API keeps returning empty transcript for videos longer than 30 minutes — stuck in production
The 25MB limit on the OpenAI Whisper API is the main bottleneck here. Compressing a 90-minute recording to fit that size kills the audio quality (hence the hallucinations or empty transcripts).
Since you’re looking for a production-ready solution that handles 60+ minutes and speaker labels, here are three solid approaches:
Deepgram API: It’s often the go-to for long-form audio. It accepts URLs directly, doesn’t have that strict 25MB limit, and has excellent diarization (speaker labels) that stays consistent throughout the entire recording.
AssemblyAI: Similar to Deepgram, it’s built for long files and handles speaker diarization much better than ‘stitching’ Whisper chunks together.
Self-hosted Whisper (Faster-Whisper): If you have the infra (or use a GPU cloud like RunPod), you can run
faster-whisper. You won’t have file size limits, and you can usepyannote-audiofor speaker labeling, though it requires more setup.
Sticking with OpenAI’s Whisper for 90-minute files will always feel like a hack because of the chunking/compression trade-off.The 25MB limit on the OpenAI Whisper API is the main bottleneck here. Compressing a 90-minute recording to fit that size kills the audio quality (hence the hallucinations or empty transcripts).
Since you’re looking for a production-ready solution that handles 60+ minutes and speaker labels, here are three solid approaches:
Deepgram API: It’s often the go-to for long-form audio. It accepts URLs directly, doesn’t have that strict 25MB limit, and has excellent diarization (speaker labels) that stays consistent throughout the entire recording.
AssemblyAI: Similar to Deepgram, it’s built for long files and handles speaker diarization much better than ‘stitching’ Whisper chunks together.
Self-hosted Whisper (Faster-Whisper): If you have the infra (or use a GPU cloud like RunPod), you can run
faster-whisper. You won’t have file size limits, and you can usepyannote-audiofor speaker labeling, though it requires more setup.
Sticking with OpenAI’s Whisper for 90-minute files will always feel like a hack because of the chunking/compression trade-off.
Discussion in the ATmosphere