Remove Background Noise from Video Without Re-encoding: An Audio-Only Approach with DeepFilterNet3
The Problem You record a 15-minute interview, a drone flyover, or a screen capture — and the audio has that familiar hum: HVAC, wind, fan noise, room tone. The footage itself is great. The audio ruins it. The standard fix is to run the whole file through a video editor or ffmpeg with a noise filter. That works, but it re-encodes the video stream. For a 6 GB 4K HEVC file, that means: 30–90 minutes of CPU time A generation of quality loss from re-encoding Another 6 GB of temporary disk space There’s a much better way. The Insight: Only the Audio Needs to Change Video containers (MP4, MKV) store the video and audio as separate streams. You can replace just the audio track and copy the video bytes untouched — no decoding, no re-encoding, no quality loss. The pipeline is: Extract audio → lossless FLAC (takes seconds) Denoise the audio with an ML model Remux: original video stream + cleaned audio → new file (also takes seconds) That 6 GB 4K file? The video remux step takes about 8 seconds. The only slow part is the ML inference on the audio, which is proportional to the audio length — not the video resolution. The Tool I built denoise, a Python CLI that implements this pipeline with multiple denoising backends and a clean interactive interface: ╭─────────────────────────────────────────────────╮ │ denoise · video background noise removal │ ╰─────────────────────────────────────────────────╯ file interview.mp4 video hevc audio aac length 15m 32s size 6.7 GB 1 DeepFilterNet3 ★ state-of-the-art ML speech enhancement 2 DeepFilterNet2 previous generation — faster 3 noisereduce profiles your actual noise — preserves voice 4 RNNoise · bd FFmpeg neural net, no extra deps ... select method [1]: parameter default range description passes 1 1–4 runs of the model atten_lim_db 15.0 6–40 max attenuation in dB adjust (key=value · Enter to keep defaults): passes=2 ✓ Extracted 77.5 MB (64s) ✓ Denoised [pass 1/2] (14m 23s) ✓ Denoised [pass 2/2] (14m 19s) ✓ Audio saved → outputs/interview_audio_clean.flac ✓ Done → interview_clean.mp4 6.7 GB (8s) Denoising Backends The tool auto-detects what’s installed and builds the menu accordingly. DeepFilterNet3 (best quality) DeepFilterNet3 is a recurrent neural network trained specifically for speech enhancement. It operates in the frequency domain, separating speech from stationary and non-stationary noise with remarkable precision. It’s the recommended choice for any footage with human voice. It runs on CPU — no GPU required — though inference is slower than the other options (1× real-time on an M-series Mac). The key parameter is atten_lim_db: the maximum attenuation applied to any frequency band. The default of 15 dB is conservative and preserves naturalness. Crank it to 30+ for aggressive cleanup at the risk of some artefacts. Setting it below 10 gives a gentle pass that’s almost imperceptible. Multi-pass is particularly effective with DeepFilterNet3. Running the model twice (passes=2) often cleans residual noise that survived the first pass without introducing new artefacts — the model is stable enough to run on its own output. noisereduce noisereduce uses spectral gating: it samples a short clip of noise-only audio (the first ~0.75s by default), builds a noise profile, then subtracts that profile from the full recording. This approach shines when your noise is consistent — air conditioning, projector hum, camera body noise. It’s faster than DeepFilterNet and tends to leave voice texture more intact when the noise is well-profiled. Tune noise_clip_s to point it at a section of your recording that contains only noise (no speech). If your recording starts with speech, trim a noise sample first and adjust accordingly. FFmpeg RNNoise (arnndn) ffmpeg’s built-in arnndn filter runs Mozilla’s RNNoise — a recurrent neural net — entirely in C. No Python dependencies at all once you have a .rnnn model file. Clone the model zoo: git clone https://github.com/GregorR/rnnoise-models rnn-models The tool will pick them up automatically and list each one in the menu. The models vary in training data; beguiling-drafter and conjoined-burgers tend to work well for general speech. afftdn and anlmdn These are pure-ffmpeg filters, no extra installs needed. afftdn (Adaptive Frequency Filter for Noise) uses spectral subtraction. The critical parameters are nr (noise reduction in dB, default 15) and nf (noise floor in dBFS, default -40). The original reason for building this tool was that the default nf=-25 used in most ffmpeg recipes is far too aggressive — it treats everything below -25 dBFS as noise, which eats soft consonants and produces the hoarse, “underwater” quality people complain about. anlmdn (Adaptive Non-Local Means Denoising) compares overlapping audio windows to detect and suppress repetitive noise patterns. It’s gentler on transients than spectral subtraction and worth trying when afftdn sounds over-processed. Install git clone https://github.com/lynchaos/denoise cd denoise python3 -m venv .venv source .venv/bin/activate pip install rich For the ML backends: # macOS — Rust is required to compile deepfilterlib brew install rust pip install deepfilternet noisereduce soundfile torch torchaudio torchcodec Model weights (50 MB) are downloaded from Hugging Face on first use and cached locally. Usage python denoise.py video.mp4 Select a method, optionally adjust parameters inline (passes=2 atten_lim_db=20), and wait. The cleaned audio is saved to outputs/ as a standalone FLAC alongside the remuxed video — useful for checking the result before committing. Technical Notes Why FLAC for the intermediate file? It’s lossless and roughly half the size of WAV, which matters when you’re working with 15+ minute recordings. The audio never loses quality through the extract → denoise → remux pipeline. Why not just use ffmpeg’s arnndn filter directly on the video? You can — but you lose the ability to use Python-based ML models (DeepFilterNet, noisereduce), tune parameters interactively, save the isolated audio, or run multi-pass. The intermediate extraction step costs ~60 seconds but unlocks the full range of options. Container compatibility: MP4 is used when the source is MP4 with an H.264/HEVC/AV1 video stream. Everything else falls back to MKV, which accepts virtually any codec combination without complaints. Stereo handling in DeepFilterNet3: The model is trained on mono speech, so stereo channels are processed independently and then recombined. This avoids the phase artefacts that can appear when stereo audio is naively downmixed to mono and back. Source MIT licensed: github.com/lynchaos/denoise Kemal Yaylali orcid.org/0000-0003-1190-7807 · support@yaylali.uk
Discussion in the ATmosphere