Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreib4e5ee2qre5ifnnnskopecye7nue7f6p2duwtkfgdwvy3ohidazm",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjhi6xhhovi2"
  },
  "path": "/t/continous-increase-in-memory-usage/127891#post_18",
  "publishedAt": "2026-04-14T12:11:50.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "man7.org",
    "Uvicorn",
    "FastAPI",
    "PyTorch Docs",
    "Kernel Documentation"
  ],
  "textContent": "Hmm… This seems more like a problem related to C or the operating system than Python…\nSince I’m a Windows user, I don’t really know much about Linux issues…\n\n* * *\n\nMy view is that your latest test shows **two different phenomena** :\n\n  1. **The machine is still healthy at the system level right now.**\n  2. **The Uvicorn process is retaining memory across requests.**\n\n\n\nThose are not the same problem. Linux `free` reports `used` as `total - available`, and `available` is the kernel’s estimate of how much memory can still be given to new applications without swapping. In your posted snapshot, `available` is still about **28.1 GB** out of **33.5 GB** , and swap is still **0**. That means the box is **not currently close to OOM** , even though `free -h used` looks worse. The `free(1)` man page is explicit that `used` includes memory that is unavailable because it is in cache or otherwise reclaimable, and that `available` is the more useful “can I still allocate more?” field. (man7.org)\n\nThat said, your **Uvicorn RSS at ~1.5 GB after 3,600 requests** means the process is not returning to its original steady state. That part is real. The next question is whether this is **file-backed/cache-like growth** , **anonymous heap retention** , or a true object leak. Linux exposes that breakdown in `/proc/<pid>/status` via `VmRSS`, `RssAnon`, `RssFile`, and `RssShmem`, and in `/proc/<pid>/smaps_rollup` via pre-summed `Rss` and `Pss`. `smaps_rollup` exists specifically to give you the summed process memory picture without manually aggregating every VMA. (man7.org)\n\n## What I think your result means\n\nI do **not** think your latest result proves that the server will “collapse after some time” by itself. With the numbers you posted, the machine still has large headroom and no swap activity. What it **does** prove is that the process is **retaining memory across requests** , and that your desired target of “RSS and `free used` must return exactly to the initial point after every request” is not a realistic expectation for a long-lived Linux/Python/PyTorch service. `malloc_trim()` only returns `1` when memory was actually released back to the system and `0` when it was not possible; it is an attempt, not a guarantee. glibc also documents that allocator arenas are a trade-off: **more arenas reduce contention but increase memory usage**. (man7.org)\n\nSo the core conclusion is:\n\n> Your current problem is less “one missing cleanup call” and more “steady-state memory behavior of a long-lived inference worker.” (Uvicorn)\n\n## Why `free -h used` is misleading in your test\n\n`free` is not a process-leak detector. The man page says the displayed values come from `/proc/meminfo`, and the `used` column is calculated as `total - available`. The `available` column exists specifically because `free` and cache numbers alone are not enough to judge memory pressure. In practice, page cache growth, tmpfs activity, reclaimable slabs, and file-backed mappings can make `used` climb while the machine remains healthy. (man7.org)\n\nThat is why your earlier observation matters:\n\n  * previously, `free used` grew while RSS did not\n  * now, both `free used` and RSS grow\n\n\n\nThe first pattern points more toward **kernel caching or file-backed effects**. The second pattern adds **process-retained memory** on top. The correct way to separate them is not `ps RSS` alone, but `RssAnon` vs `RssFile` and `Pss` from procfs. Linux documents those fields directly. (man7.org)\n\n## Why the newer approach may have made RSS look worse\n\nThe earlier changes removed one obvious extra copy of the upload, but they did not solve the long-lived-process problem. FastAPI documents that `UploadFile` uses a **spooled file** that stays in memory up to a threshold and then spills to disk, and that the underlying `SpooledTemporaryFile` can be passed directly to libraries. Torchaudio documents that `torchaudio.load()` accepts a **file-like object** , returns **float32 tensors** , and supports `frame_offset` and `num_frames` so you do not have to decode the whole file. Those are useful improvements, but they do **not** guarantee per-request RSS reset. They only reduce some avoidable transient allocations. (FastAPI)\n\nSo if the new version shows higher settled RSS, that does **not** automatically mean the old version was better. It may simply mean the memory has shifted from reclaimable file/cache behavior to longer-lived anonymous heap retention, or that allocator reuse is now more visible in RSS. That distinction is why `RssAnon` and `RssFile` matter so much here. (man7.org)\n\n## Why exact per-request RSS reset is the wrong target\n\nFor a long-lived worker, exact RSS reset after every request is usually not achievable as a design guarantee. There are three reasons.\n\nFirst, glibc arenas can keep memory around for reuse, and glibc explicitly says more arenas increase memory usage. Second, `malloc_trim()` only sometimes succeeds. Third, Uvicorn itself exposes `--limit-max-requests` specifically to protect long-running workers from memory leak or leak-like behavior, which is an admission that long-lived Python web workers often do accumulate or retain memory in practice. (man7.org)\n\nThat is why my opinion is blunt:\n\n> If your hard requirement is “the process must return to its exact initial RSS after every request,” the robust solution is **process recycling or subprocess isolation** , not more aggressive in-process cleanup. (Uvicorn)\n\n## What I would treat as the real success criteria\n\nThe right targets are:\n\n  * `MemAvailable` stays comfortably high\n  * `VmSwap` stays at 0\n  * `RssAnon` or `Pss` plateaus rather than rising linearly forever\n  * no OOM-killer events\n  * stable latency under your target load\n\n\n\nThose criteria follow directly from the meanings of `free`’s `available` field and procfs memory fields. (man7.org)\n\n## What I would do next\n\n### 1. Measure the right fields\n\nDo not rely on `ps RSS` alone. Use:\n\n\n    PID=$(pgrep -n -f 'uvicorn app:app')\n\n    watch -n 5 \"grep -E 'VmRSS|RssAnon|RssFile|RssShmem|VmSwap' /proc/$PID/status\"\n\n    watch -n 5 \"grep -E '^(Rss|Pss|Anonymous|Swap):' /proc/$PID/smaps_rollup\"\n\n    watch -n 5 \"free -w -h\"\n\n\nLinux documents all of those fields. `smaps_rollup` is specifically the pre-summed memory report, and `/proc/<pid>/status` exposes the anon/file/shmem split. (man7.org)\n\nInterpretation:\n\n  * If **`RssFile`** rises more than `RssAnon`, a lot of what you see is file-backed memory or cache-like effects.\n  * If **`RssAnon`** and **`Pss`** rise steadily under a fixed workload, that is real process-retained memory.\n  * If `MemAvailable` remains high and swap remains zero, the system is not in imminent danger even if RSS is larger than you want. (man7.org)\n\n\n\n### 2. Bound the workload earlier\n\nTorchaudio supports `frame_offset` and `num_frames`, and always returns float32 tensors. That means decoding whole uploads is expensive by construction. For Wav2Vec2 specifically, Hugging Face’s long-audio guidance is explicit: use **chunking with stride** for arbitrarily long files or live inference, because the transformer cost grows badly with long sequence length and whole-file inference can crash from memory. (PyTorch Docs)\n\nThis means your production path should enforce **one or both** of these:\n\n  * a hard maximum accepted duration per request\n  * chunked inference with overlap instead of whole-file inference\n\n\n\n### 3. Control the worker lifecycle\n\nUvicorn documents three settings that matter here:\n\n  * `--limit-concurrency` to keep memory use predictable under load\n  * `--limit-max-requests` to terminate a process after N requests\n  * `--limit-max-requests-jitter` to stagger restarts across workers\n\n\n\nUvicorn’s own wording says `--limit-concurrency` is useful for ensuring known memory usage patterns and `--limit-max-requests` is useful for preventing memory leaks from impacting long-running processes. That is exactly your scenario. (Uvicorn)\n\nA practical starting point is:\n\n\n    uvicorn app:app \\\n      --host 0.0.0.0 \\\n      --port 9000 \\\n      --limit-concurrency 1 \\\n      --limit-max-requests 500 \\\n      --limit-max-requests-jitter 100\n\n\nFor your 1 req/s test, I would benchmark 300, 500, and 1000 and choose the smallest number that keeps latency acceptable while flattening `RssAnon` and `Pss`.\n\n### 4. Tune glibc only as a controlled experiment\n\nglibc documents `M_ARENA_MAX` and the `MALLOC_ARENA_MAX` environment variable. It also states the trade-off clearly: **more arenas lower contention but increase memory usage**. It documents `M_TRIM_THRESHOLD` too, which affects when `free()` will try to release memory back to the system. (man7.org)\n\nThe safest allocator experiment is:\n\n\n    MALLOC_ARENA_MAX=2 uvicorn app:app ...\n\n\nor:\n\n\n    GLIBC_TUNABLES=glibc.malloc.arena_max=2 uvicorn app:app ...\n\n\nA second experiment is to combine low arena count with a more aggressive trim threshold. But this is a tuning experiment, not a guaranteed fix. glibc documents these as trade-offs, not as magic leak cures. (man7.org)\n\n### 5. Stop expecting `empty_cache()` to solve host RAM\n\nPyTorch is explicit that `torch.cuda.memory.empty_cache()` releases **unoccupied cached GPU memory** , does **not** increase the amount of GPU memory available to PyTorch, and may only help reduce fragmentation in some cases. It is not a host-RAM control and it does not force a long-lived process back to baseline. (PyTorch Docs)\n\n## My direct opinion on your current result\n\nYour latest test does **not** show that the box is about to collapse. It **does** show that the current worker model does not meet your desired memory behavior. Those are different statements.\n\nThe operationally correct answer is not “find a better `gc.collect()` pattern.”\nIt is:\n\n  1. measure `RssAnon`, `RssFile`, `Pss`, and `MemAvailable`\n  2. bound audio length and chunk inference\n  3. recycle workers proactively\n  4. optionally tune glibc arenas\n  5. judge success by **plateau** , not by exact per-request RSS reset\n\n\n\nThat is the most realistic path for a Python + FastAPI + Torchaudio + Wav2Vec2 service on Linux. (man7.org)\n\nThe sharpest summary is this:\n\n> Your goal should be **bounded steady-state memory** , not **perfect baseline restoration after every request**. Linux, glibc, and Uvicorn are all telling you the same thing through their documented behavior. (man7.org)\n\n* * *\n\nUse this as a **repeatable 80-minute memory characterization test** for your ASR worker. The plan is built around three Linux-backed signals: **`RssAnon`** from `/proc/<pid>/status`, **`Pss`** from `/proc/<pid>/smaps_rollup`, and **`MemAvailable`** from `/proc/meminfo`. `RssAnon` is resident anonymous memory, `VmRSS` is the sum of `RssAnon`, `RssFile`, and `RssShmem`, and `smaps_rollup` gives you a pre-summed process view where **PSS** is the process’s proportional share of shared pages. `free` reports **used** as `total - available`, and **available** is the kernel’s estimate of how much memory can still be given to new apps without swapping. That is why this plan treats **PSS as the primary process metric** and **MemAvailable as the primary system-safety metric**. (man7.org)\n\n## What this test is trying to answer\n\nYou are not testing “does RSS return exactly to baseline after every request.” That is too strict for a long-lived Linux worker. You are testing three simpler questions:\n\n  1. Does **`Pss`** plateau after warm-up, or keep rising through the hour?\n  2. Does **`RssAnon`** plateau after warm-up, or keep rising through the hour?\n  3. Does **`MemAvailable`** stay comfortably high with **swap still at 0**?\n\n\n\nThat framing matches what the kernel and `free` actually report, and it avoids being misled by cache growth alone. (Kernel Documentation)\n\n## Test layout\n\nRun one worker, one concurrency slot, fixed audio input, fixed rate. Uvicorn documents `--limit-concurrency` as useful for ensuring known memory usage patterns, and `--limit-max-requests` as useful for preventing leaks from impacting long-running processes. For the **first characterization run** , keep `--limit-max-requests` **off** so you can observe the natural memory shape. (Uvicorn)\n\nTimeline:\n\n  * **5 minutes idle baseline**\n  * **60 minutes load** at **1 request/second**\n  * **15 minutes idle recovery**\n\n\n\nSample every **5 seconds**. That gives enough resolution to see trend shape without creating noisy data files.\n\n## Terminal 1: start the server\n\nRun a single worker and cap concurrency so the test stays deterministic.\n\n\n    MALLOC_ARENA_MAX=2 \\\n    uvicorn app:app \\\n      --host 0.0.0.0 \\\n      --port 9000 \\\n      --workers 1 \\\n      --limit-concurrency 1\n\n\n`MALLOC_ARENA_MAX=2` is included because glibc documents that more arenas reduce lock contention but increase memory usage. This is a sensible controlled starting point for long-lived services, but it is still an experiment, not a guaranteed fix. (Uvicorn)\n\n## Terminal 2: identify the PID\n\n\n    PID=$(pgrep -n -f 'uvicorn app:app')\n    echo \"$PID\"\n\n\n## Terminal 2: create the collector\n\nThis collector writes one CSV row every 5 seconds with exactly the fields you care about.\n\n\n    cat > collect_mem.sh <<'SH'\n    #!/usr/bin/env bash\n    set -euo pipefail\n\n    PID=\"$1\"\n    OUT=\"${2:-mem.csv}\"\n\n    echo \"ts,VmRSS_kB,RssAnon_kB,RssFile_kB,RssShmem_kB,VmSwap_kB,Pss_kB,MemAvailable_kB,MemTotal_kB\" > \"$OUT\"\n\n    while kill -0 \"$PID\" 2>/dev/null; do\n      ts=$(date +%s)\n\n      read vmrss rssanon rssfile rssshmem vmswap < <(\n        awk '\n          /^VmRSS:/{vmrss=$2}\n          /^RssAnon:/{ra=$2}\n          /^RssFile:/{rf=$2}\n          /^RssShmem:/{rs=$2}\n          /^VmSwap:/{vs=$2}\n          END{print vmrss+0, ra+0, rf+0, rs+0, vs+0}\n        ' /proc/$PID/status\n      )\n\n      pss=$(awk '/^Pss:/{print $2; exit}' /proc/$PID/smaps_rollup)\n      memavail=$(awk '/^MemAvailable:/{print $2; exit}' /proc/meminfo)\n      memtotal=$(awk '/^MemTotal:/{print $2; exit}' /proc/meminfo)\n\n      echo \"$ts,$vmrss,$rssanon,$rssfile,$rssshmem,$vmswap,$pss,$memavail,$memtotal\" >> \"$OUT\"\n      sleep 5\n    done\n    SH\n\n    chmod +x collect_mem.sh\n\n\nWhy these fields:\n\n  * `RssAnon` tracks resident anonymous memory in the process. (man7.org)\n  * `Pss` is a better total-process footprint measure than RSS because it prorates shared pages. (Kernel Documentation)\n  * `MemAvailable` is the system-level safety number that `free` itself tells you to care about. (man7.org)\n\n\n\n## Terminal 2: start collection\n\n\n    ./collect_mem.sh \"$PID\" mem_run1.csv &\n    COLLECTOR_PID=$!\n    echo \"$COLLECTOR_PID\"\n\n\n## Optional live watch during the run\n\nThis is only for sanity checking while the CSV is being recorded.\n\n\n    watch -n 5 \"\n    echo '=== /proc/\\$PID/status ==='\n    grep -E 'VmRSS|RssAnon|RssFile|RssShmem|VmSwap' /proc/$PID/status\n    echo\n    echo '=== /proc/\\$PID/smaps_rollup ==='\n    grep -E '^(Rss|Pss|Anonymous|Swap):' /proc/$PID/smaps_rollup\n    echo\n    echo '=== free -w -h ==='\n    free -w -h\n    \"\n\n\n`/proc/<pid>/status` gives the anon/file/shmem split, and `smaps_rollup` gives the pre-summed process map totals, including `Pss`. (man7.org)\n\n## Terminal 2: baseline idle for 5 minutes\n\n\n    sleep 300\n\n\n## Terminal 2: exact 1 request/second load for 1 hour\n\nThis version paces requests to one-second periods, so the loop stays close to **1 req/s** rather than “request time plus one second”.\n\n\n    python3 - <<'PY'\n    import subprocess, time, sys\n\n    URL = \"http://127.0.0.1:9000/transcribe\"\n    FILE = \"sample.wav\"\n    TOTAL_REQUESTS = 3600\n    PERIOD = 1.0\n\n    ok = 0\n    fail = 0\n\n    for i in range(TOTAL_REQUESTS):\n        t0 = time.time()\n        try:\n            r = subprocess.run(\n                [\"curl\", \"-sS\", \"-o\", \"/dev/null\", \"-w\", \"%{http_code}\", \"-F\", f\"audio_file=@{FILE}\", URL],\n                capture_output=True, text=True, check=False\n            )\n            code = r.stdout.strip()\n            if code == \"200\":\n                ok += 1\n            else:\n                fail += 1\n                print(f\"request {i+1}: http {code}\", file=sys.stderr)\n        except Exception as e:\n            fail += 1\n            print(f\"request {i+1}: exception {e}\", file=sys.stderr)\n\n        dt = time.time() - t0\n        if dt < PERIOD:\n            time.sleep(PERIOD - dt)\n\n    print(f\"ok={ok} fail={fail}\")\n    PY\n\n\n## Terminal 2: idle recovery for 15 minutes\n\n\n    sleep 900\n    kill \"$COLLECTOR_PID\"\n    wait \"$COLLECTOR_PID\" 2>/dev/null || true\n\n\n## Terminal 2: analyzer\n\nThis analyzer gives you **PASS / WARN / FAIL** for `RssAnon`, `Pss`, and `MemAvailable`. The thresholds below are **operational heuristics** , not kernel-defined rules. They are tuned for a **single worker** , **concurrency 1** , **fixed input** , and **1 req/s for 1 hour**.\n\n\n    cat > analyze_mem.py <<'PY'\n    import csv, statistics as st, sys\n\n    path = sys.argv[1]\n    rows = list(csv.DictReader(open(path)))\n\n    for r in rows:\n        for k in r:\n            r[k] = int(r[k])\n\n    t0 = rows[0][\"ts\"]\n    for r in rows:\n        r[\"t\"] = r[\"ts\"] - t0  # seconds since start\n\n    def median_in_window(start_s, end_s, key):\n        vals = [r[key] for r in rows if start_s <= r[\"t\"] < end_s]\n        if not vals:\n            raise RuntimeError(f\"No samples for {key} in window {start_s}-{end_s}\")\n        return int(st.median(vals))\n\n    KiB = 1\n    MiB = 1024 * KiB\n    GiB = 1024 * MiB\n\n    # Windows for this exact plan:\n    # 0-300s    baseline idle\n    # 300-3900s load\n    # 3900-4800s recovery idle\n    baseline_start, baseline_end = 180, 300\n    early_start, early_end       = 600, 900    # 5-10 min into load\n    late_start, late_end         = 3600, 3900  # last 5 min of load\n    reco_start, reco_end         = 4500, 4800  # last 5 min of recovery\n\n    mem_total = median_in_window(baseline_start, baseline_end, \"MemTotal_kB\")\n\n    baseline = {\n        \"RssAnon\": median_in_window(baseline_start, baseline_end, \"RssAnon_kB\"),\n        \"Pss\": median_in_window(baseline_start, baseline_end, \"Pss_kB\"),\n        \"MemAvailable\": median_in_window(baseline_start, baseline_end, \"MemAvailable_kB\"),\n    }\n    early = {\n        \"RssAnon\": median_in_window(early_start, early_end, \"RssAnon_kB\"),\n        \"Pss\": median_in_window(early_start, early_end, \"Pss_kB\"),\n        \"MemAvailable\": median_in_window(early_start, early_end, \"MemAvailable_kB\"),\n    }\n    late = {\n        \"RssAnon\": median_in_window(late_start, late_end, \"RssAnon_kB\"),\n        \"Pss\": median_in_window(late_start, late_end, \"Pss_kB\"),\n        \"MemAvailable\": median_in_window(late_start, late_end, \"MemAvailable_kB\"),\n    }\n    recovery = {\n        \"RssAnon\": median_in_window(reco_start, reco_end, \"RssAnon_kB\"),\n        \"Pss\": median_in_window(reco_start, reco_end, \"Pss_kB\"),\n        \"MemAvailable\": median_in_window(reco_start, reco_end, \"MemAvailable_kB\"),\n    }\n\n    swap_used = max(r[\"VmSwap_kB\"] for r in rows)\n\n    def fmt_mib(kib):\n        return f\"{kib / MiB:.1f} MiB\"\n\n    def classify_pss(early_kb, late_kb):\n        growth = late_kb - early_kb\n        if growth <= max(int(0.10 * early_kb), 150 * MiB):\n            return \"PASS\", growth\n        if growth <= max(int(0.20 * early_kb), 300 * MiB):\n            return \"WARN\", growth\n        return \"FAIL\", growth\n\n    def classify_rssanon(early_kb, late_kb):\n        growth = late_kb - early_kb\n        if growth <= max(int(0.15 * early_kb), 200 * MiB):\n            return \"PASS\", growth\n        if growth <= max(int(0.30 * early_kb), 400 * MiB):\n            return \"WARN\", growth\n        return \"FAIL\", growth\n\n    def classify_memavailable(baseline_kb, late_kb, reco_kb, mem_total_kb, swap_kb):\n        drop = baseline_kb - late_kb\n        min_safe = int(0.20 * mem_total_kb)  # 20% of RAM still available\n        # PASS: small drop, still plenty available, no swap\n        if drop <= min(1 * GiB, int(0.05 * mem_total_kb)) and late_kb >= min_safe and swap_kb == 0:\n            return \"PASS\", drop\n        # WARN: moderate drop, still no swap, recovery not catastrophic\n        if drop <= min(2 * GiB, int(0.10 * mem_total_kb)) and late_kb >= int(0.15 * mem_total_kb) and swap_kb == 0:\n            return \"WARN\", drop\n        return \"FAIL\", drop\n\n    pss_status, pss_growth = classify_pss(early[\"Pss\"], late[\"Pss\"])\n    rss_status, rss_growth = classify_rssanon(early[\"RssAnon\"], late[\"RssAnon\"])\n    mem_status, mem_drop = classify_memavailable(\n        baseline[\"MemAvailable\"], late[\"MemAvailable\"], recovery[\"MemAvailable\"], mem_total, swap_used\n    )\n\n    print(\"=== Window medians ===\")\n    for label, d in [(\"baseline\", baseline), (\"early_load\", early), (\"late_load\", late), (\"recovery\", recovery)]:\n        print(\n            f\"{label:10s} \"\n            f\"RssAnon={fmt_mib(d['RssAnon'])} \"\n            f\"Pss={fmt_mib(d['Pss'])} \"\n            f\"MemAvailable={fmt_mib(d['MemAvailable'])}\"\n        )\n\n    print(\"\\n=== Classification ===\")\n    print(f\"PSS        : {pss_status}  (late - early = {fmt_mib(pss_growth)})\")\n    print(f\"RssAnon    : {rss_status}  (late - early = {fmt_mib(rss_growth)})\")\n    print(f\"MemAvailable: {mem_status} (baseline - late = {fmt_mib(mem_drop)})\")\n    print(f\"VmSwap max : {fmt_mib(swap_used)}\")\n\n    overall = \"PASS\"\n    if \"FAIL\" in (pss_status, rss_status, mem_status) or swap_used > 0:\n        overall = \"FAIL\"\n    elif \"WARN\" in (pss_status, rss_status, mem_status):\n        overall = \"WARN\"\n\n    print(f\"\\nOVERALL: {overall}\")\n    PY\n\n\nRun it:\n\n\n    python3 analyze_mem.py mem_run1.csv\n\n\n## How to read the output\n\n### PASS pattern\n\nA **PASS** means:\n\n  * `Pss` rises during early load, then **late-load PSS is not much higher than early-load PSS**\n  * `RssAnon` shows the same plateau shape\n  * `MemAvailable` stays high\n  * `VmSwap` stays **0**\n\n\n\nThat pattern means the process likely has **warm-up and retention** , but not a leak-like linear climb. PSS is the best single-process memory indicator here because the kernel defines it as the process’s proportional share of shared pages. (Kernel Documentation)\n\n### WARN pattern\n\nA **WARN** means:\n\n  * `Pss` and/or `RssAnon` still grow between early and late windows, but not explosively\n  * `MemAvailable` drops moderately, yet remains healthy\n  * swap is still **0**\n\n\n\nThat pattern usually means **bounded retention** or **allocator effects** rather than immediate OOM risk. In that case, the next experiment is operational mitigation: same workload, but run Uvicorn with `--limit-max-requests` and jitter, because Uvicorn explicitly documents that setting as protection against memory leaks affecting long-running processes. (Uvicorn)\n\n### FAIL pattern\n\nA **FAIL** means one of these:\n\n  * `Pss` keeps rising materially from early to late load\n  * `RssAnon` keeps rising materially from early to late load\n  * `MemAvailable` drops too far\n  * swap becomes nonzero\n\n\n\nThat pattern is “leak-like” for operations purposes, even if the root cause is allocator retention rather than a literal object leak. At that point, do not chase exact per-request reset. Move to **worker recycling** and **workload bounding**. Uvicorn’s docs are explicit that `--limit-concurrency` is for predictable memory usage and `--limit-max-requests` is for preventing memory leaks from impacting long-running processes. (Uvicorn)\n\n## Expected output patterns\n\n### Pattern A: healthy plateau\n\nYou want this shape:\n\n  * baseline idle low and stable\n  * early load higher than baseline\n  * late load close to early load\n  * recovery equal to or somewhat below late load\n  * `MemAvailable` high throughout\n  * `VmSwap = 0`\n\n\n\nThis says “memory warms up and then stabilizes.”\n\n### Pattern B: file-backed or cache-heavy growth\n\nYou may see:\n\n  * `free used` rises\n  * `MemAvailable` stays high\n  * `Pss` grows only a little\n  * optional `RssFile` grows more than `RssAnon`\n\n\n\nThat says the machine is using more memory, but the worker itself is not accumulating much anonymous footprint. `free` reports `used = total - available`, so this pattern is not an immediate danger signal by itself. (man7.org)\n\n### Pattern C: real worker retention\n\nYou do **not** want this shape:\n\n  * `Pss` rises steadily from early to late load\n  * `RssAnon` rises in parallel\n  * recovery stays near the late-load high-water mark\n  * `MemAvailable` trends down materially\n  * swap eventually becomes nonzero\n\n\n\nThat is the pattern that justifies worker recycling and further narrowing of request memory. (Kernel Documentation)\n\n## What to do after Run 1\n\nIf Run 1 is **PASS** , keep the current process model and move on to chunking and throughput tuning.\n\nIf Run 1 is **WARN** , run the same test again with request-count recycling:\n\n\n    uvicorn app:app \\\n      --host 0.0.0.0 \\\n      --port 9000 \\\n      --workers 1 \\\n      --limit-concurrency 1 \\\n      --limit-max-requests 500 \\\n      --limit-max-requests-jitter 50\n\n\nUvicorn documents both `--limit-max-requests` and jitter for exactly this long-running-worker problem class. (Uvicorn)\n\nIf Run 1 is **FAIL** , run two follow-ups:\n\n  1. same test with `--limit-max-requests 300`\n  2. same test with bounded audio length or chunked inference\n\n\n\nFor Wav2Vec2 specifically, the documented path for long or repeated long-audio inference is **chunking with stride** , not naïve whole-file inference. (Uvicorn)\n\n## Decision rule to keep\n\nUse this rule going forward:\n\n  * **Primary process metric:** `Pss`\n  * **Secondary corroboration:** `RssAnon`\n  * **Primary system metric:** `MemAvailable`\n  * **Hard stop:** any nonzero `VmSwap`\n\n\n\nThat matches the semantics Linux exposes for these fields and will keep your memory diagnosis grounded in the right signals instead of `free used` alone. (Kernel Documentation)",
  "title": "Continous increase in Memory usage"
}