{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibckujbqhzkobqda7qwpgwoyduvu2g3ymk3o27vqpilifdoofx3eq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjg7wbhe4sw2"
  },
  "path": "/t/continous-increase-in-memory-usage/127891#post_15",
  "publishedAt": "2026-04-13T23:54:16.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "FastAPI",
    "PyTorch Docs",
    "GitHub",
    "Hugging Face",
    "man7.org",
    "Hugging Face Forums",
    "@router.post"
  ],
  "textContent": "Apart from the following hypothesis, there are simply so many cases where using a dataset containing media files rather than text consumes more RAM than expected that I’m not sure if pinpointing the problem will be easy:\n\n* * *\n\nThe most likely cause in your case is **host RAM growth from the request path itself** , then **whole-file Wav2Vec2 inference on long audio** , then **allocator retention** that makes memory appear “not freed” even after cleanup. I do **not** think the main problem is your `finally` block being too weak. I think the main problem is that the expensive allocations have already happened before that block runs. FastAPI’s `UploadFile` is built around an internal `SpooledTemporaryFile`, Torchaudio can load from a file-like object directly, and Wav2Vec2 has well-known long-audio memory problems when you push full clips through one shot instead of chunking. (FastAPI)\n\n## What your handler is doing to memory\n\nThe first big amplification happens here:\n\n\n    audio_bytes = await audio_file.read()\n    waveform, sample_rate = torchaudio.load(io.BytesIO(audio_bytes))\n\n\n`UploadFile` already wraps a spooled file object, so reading it fully into `audio_bytes` forces a whole extra in-process copy of the upload. Then `torchaudio.load()` decodes that into a waveform tensor, and the docs state that it accepts a file-like object directly and returns `float32` tensors for common compressed formats. So a compressed file can become a much larger decoded tensor immediately, before inference even starts. (FastAPI)\n\nThe next amplification is the preprocessing chain. Your mono conversion creates a new tensor. Your resampling step creates another tensor. Torchaudio’s resampling docs note that `transforms.Resample` precomputes and caches a kernel, which is useful when reused, but in your code you instantiate it per request. None of these steps is “wrong,” but together they mean one incoming request can temporarily hold multiple full-waveform tensors in RAM. (PyTorch Docs)\n\nThen the audio goes into the Hugging Face ASR pipeline. The pipeline source shows that the automatic speech recognition pipeline preprocesses audio with `return_attention_mask=True`, and it also has built-in `chunk_length_s` and `stride_length_s` handling for chunked processing. That matters because the high-level pipeline is convenient, but it is not the minimum-allocation path, and it is generic rather than tailored to your exact Wav2Vec2 checkpoint. (GitHub)\n\n## Why Wav2Vec2 is especially prone to this\n\nWav2Vec2 is a CTC model, and Hugging Face’s own long-audio guide exists because long files should be handled by **chunking with stride** , not by pushing the entire waveform through one forward pass. There are public reports of Wav2Vec2 consuming **all 64 GB of RAM** on a 7-minute file, more than **200 GB of RAM** on a large decoding case, and OOM even on a **2 minute 17 second** sample on a 32 GB machine. That is the same symptom family as yours. (Hugging Face)\n\nSo the background model should be this: **Wav2Vec2 on long raw waveforms is memory-hungry by default**. If your endpoint accepts arbitrary-duration audio and does full decode plus full inference plus generic pipeline preprocessing, then steady RAM growth under real traffic is exactly what one would expect. (Hugging Face)\n\n## The most important checkpoint-specific detail\n\nThe Wav2Vec2 docs state that models with `config.feat_extract_norm == \"group\"` such as `wav2vec2-base` were **not trained using`attention_mask`**, and for those models inputs should simply be padded with zeros and no attention mask should be passed. Only `layer`-norm variants such as `wav2vec2-lv60` should get `attention_mask` for batched inference. The pipeline source, however, shows it builds `attention_mask=True` in its preprocessing path. That does not mean the pipeline is broken. It means the pipeline is generic, while your service may need a tighter manual path that avoids unnecessary tensors for your specific model family. (Hugging Face)\n\n## Why your cleanup is not solving it\n\n`torch.cuda.empty_cache()` only releases **unoccupied cached GPU memory**. PyTorch explicitly says it does **not** increase the amount of GPU memory available to PyTorch itself, though it can help reduce fragmentation in some cases. It also says nothing about host RAM because it is a GPU-cache function. So it cannot fix CPU-side growth from uploads, decoded waveforms, or Python/native heap behavior. (PyTorch Docs)\n\n`malloc_trim(0)` is also weaker than people often think. The Linux man page says it only **attempts** to release free heap memory from the process heap back to the OS. That means it may help sometimes and do nothing sometimes. It is not a primary control mechanism for a service that is over-allocating per request. (man7.org)\n\nThis is why your logs can show “I deleted everything” while RSS stays high. Some of that can be real live memory. Some can be allocator retention. Hugging Face users have reported the same “first batch fits, second similar batch OOMs” pattern with Wav2Vec2, and FastAPI/Uvicorn users have also reported persistent growth under repeated inference loads. (Hugging Face Forums)\n\n## My diagnosis, ranked\n\n### 1. Highest-probability cause: whole-file CPU memory amplification\n\nYou are reading the full upload into Python memory, decoding the full file into float32, then creating more full-size tensors for mono conversion and resampling. That is the clearest architectural problem in the code. (FastAPI)\n\n### 2. Very likely: full-length Wav2Vec2 inference instead of chunking\n\nThe public Wav2Vec2 OOM reports and the official chunking guide point strongly in this direction. Even a couple of minutes can be enough to blow memory depending on the exact model and path. (Hugging Face)\n\n### 3. Likely: generic pipeline preprocessing doing more work than needed\n\nThe ASR pipeline preprocesses with `return_attention_mask=True` and has its own chunking behavior. A manual `processor + model` path gives you tighter control over what tensors are built and when. (GitHub)\n\n### 4. Secondary amplifier: allocator retention and fragmentation\n\nThis explains why RAM does not visibly return to baseline after cleanup. It does not explain the initial spike by itself. (PyTorch Docs)\n\n## What I would change first\n\n### First: stop materializing the upload as `bytes`\n\nUse the file object you already have:\n\n\n    await audio_file.seek(0)\n    waveform, sample_rate = torchaudio.load(audio_file.file)\n\n\nFastAPI documents that `UploadFile` exposes the underlying spooled file, and Torchaudio documents that `load()` accepts a file-like object. This removes one full-copy allocation of the uploaded payload. (FastAPI)\n\n### Second: cap duration or chunk before full inference\n\nDo not let arbitrary-duration audio go straight into the model. Use chunking with overlap, or reject or trim overly long inputs. Hugging Face’s long-audio guide is explicit that chunking with stride is the right approach for Wav2Vec2 on long files. (Hugging Face)\n\n### Third: replace the high-level pipeline in the hot path\n\nLoad `AutoProcessor` and `AutoModelForCTC` once at startup, then call them directly in the request handler. This lets you control `return_attention_mask`, input dtype, chunking, and device transfer yourself. The pipeline docs describe pipeline as a convenience abstraction, which is exactly why it is good for prototypes and sometimes suboptimal for tight production serving paths. (Hugging Face)\n\n### Fourth: use `torch.inference_mode()`\n\nPyTorch states that `inference_mode` is analogous to `no_grad`, but removes additional overhead by disabling view tracking and version-counter bumps. For pure inference endpoints, that is generally the better mode. (PyTorch Docs)\n\n### Fifth: only pass `attention_mask` if your checkpoint needs it\n\nIf your model is a `group`-norm Wav2Vec2 checkpoint, drop the attention mask. If it is a `layer`-norm variant, keep it. That one decision can remove a large extra tensor from every request. (Hugging Face)\n\n## A safer version of the endpoint\n\nThis is the shape I would move toward:\n\n\n    import gc\n    import psutil\n    import torch\n    import torchaudio\n\n    from fastapi import UploadFile, File, HTTPException\n    from fastapi.responses import JSONResponse\n    from transformers import AutoProcessor, AutoModelForCTC\n\n    TARGET_SR = 16000\n    MAX_SECONDS = 30\n    DEVICE = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n\n    MODEL_ID = \"your-model-id\"\n    processor = AutoProcessor.from_pretrained(MODEL_ID)\n    model = AutoModelForCTC.from_pretrained(MODEL_ID).to(DEVICE).eval()\n\n    USE_ATTENTION_MASK = getattr(model.config, \"feat_extract_norm\", None) == \"layer\"\n\n    @router.post(\"/transcribe\")\n    async def quran(audio_file: UploadFile = File(...)):\n        process = psutil.Process()\n        start_ram = process.memory_info().rss / (1024 ** 2)\n\n        waveform = None\n        inputs = None\n        logits = None\n        pred_ids = None\n\n        try:\n            await audio_file.seek(0)\n\n            # No full bytes copy\n            waveform, sample_rate = torchaudio.load(audio_file.file)\n\n            # Stereo -> mono\n            if waveform.ndim == 2 and waveform.shape[0] > 1:\n                waveform = waveform.mean(dim=0)\n            else:\n                waveform = waveform.squeeze(0)\n\n            # Resample only if needed\n            if sample_rate != TARGET_SR:\n                waveform = torchaudio.functional.resample(waveform, sample_rate, TARGET_SR)\n\n            # Hard cap input size before inference\n            waveform = waveform[: TARGET_SR * MAX_SECONDS].contiguous()\n\n            inputs = processor(\n                waveform.numpy(),\n                sampling_rate=TARGET_SR,\n                return_tensors=\"pt\",\n                padding=False,\n                return_attention_mask=USE_ATTENTION_MASK,\n            )\n\n            inputs = {k: v.to(DEVICE, non_blocking=True) for k, v in inputs.items()}\n\n            with torch.inference_mode():\n                logits = model(**inputs).logits\n                pred_ids = torch.argmax(logits, dim=-1)\n                transcript = processor.batch_decode(pred_ids)[0]\n\n            return JSONResponse({\"transcript\": transcript}, status_code=200)\n\n        except Exception:\n            raise HTTPException(status_code=500, detail=\"Internal processing error\")\n\n        finally:\n            try:\n                await audio_file.close()\n            except Exception:\n                pass\n\n            for name in (\"waveform\", \"inputs\", \"logits\", \"pred_ids\"):\n                if locals().get(name) is not None:\n                    del locals()[name]\n\n            if torch.cuda.is_available():\n                torch.cuda.empty_cache()\n\n            gc.collect()\n\n            final_ram = process.memory_info().rss / (1024 ** 2)\n            logger.info(\"RAM Status: %.1fMB -> %.1fMB\", start_ram, final_ram)\n\n\nThis version changes the core memory behavior in the places that matter: it removes the extra upload copy, bounds input length before the model sees it, uses a manual path instead of the generic pipeline, and makes attention-mask creation checkpoint-aware. Those recommendations follow directly from the FastAPI, Torchaudio, Wav2Vec2, pipeline, and PyTorch docs. (FastAPI)\n\n## How to tell whether this is a true leak or allocator retention\n\nAdd logging for:\n\n  * process RSS\n  * `torch.cuda.memory_allocated()`\n  * `torch.cuda.memory_reserved()`\n\n\n\nPyTorch documents that `memory_allocated()` is memory currently occupied by tensors, while `memory_reserved()` is memory managed by the caching allocator. If `allocated` stays flat but `reserved` climbs, that points to allocator behavior or fragmentation. If both are flat while RSS rises, the pressure is on the CPU side: uploads, audio decode, resample buffers, Python/native heap, or FastAPI/Uvicorn request handling. PyTorch’s memory snapshot tools are the right next step if CUDA behavior is unclear. (PyTorch Docs)\n\n## The big picture\n\nThe core issue is not that you forgot one magic cleanup call. The core issue is that your endpoint is currently shaped like this:\n\n  1. read all uploaded bytes into Python\n  2. decode all audio into float32\n  3. create more full-size tensors for mono and resample\n  4. run a generic pipeline on the whole clip\n  5. try to clean everything after the expensive work already happened\n\n\n\nFor Wav2Vec2 serving, the more stable shape is:\n\n  1. keep the upload as a file-like object\n  2. decode without creating an extra bytes copy\n  3. bound or chunk the audio early\n  4. use a manual processor/model path\n  5. make attention-mask use depend on the checkpoint\n  6. treat `empty_cache()` and `malloc_trim()` as secondary hygiene, not the main fix (FastAPI)\n\n\n\nMy final judgment: **your RAM growth is most likely caused by architectural over-allocation in the request path plus long-audio Wav2Vec2 inference, with allocator retention making the symptom look worse over time.** The highest-value fixes are: remove `audio_bytes`, stop whole-file inference, move off the generic pipeline in the API route, and make preprocessing explicit. (Hugging Face)",
  "title": "Continous increase in Memory usage"
}