Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreia7rp5wngq6e7ogqwvwaqnlgsov4oxe4lrvbsbepfdudb4scjaj5u",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mk77e2uhhxi2"
  },
  "path": "/t/why-am-i-facing-this-error-while-running-this-code/175485#post_2",
  "publishedAt": "2026-04-23T23:30:18.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face",
    "GitHub"
  ],
  "textContent": "Based on that screenshot alone, I can’t pinpoint the exact cause…\nBut it might be a case of a Transformers version mismatch:\n\n* * *\n\n## Plain-language answer\n\nYour code is failing because **the Hugging Face Whisper speech-recognition pipeline expected an internal value called`num_frames`, but that value was missing**.\n\nYou did **not** forget to write `num_frames` yourself.\nYou are **not** supposed to pass `num_frames` manually.\n\nThe error is coming from inside the installed `transformers` package, not from your own `print(result)` line.\n\nYour code is basically this:\n\n\n    from transformers import pipeline\n\n    transcriber = pipeline(\n        task=\"automatic-speech-recognition\",\n        model=\"openai/whisper-large-v3\"\n    )\n\n    result = transcriber(\n        \"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac\"\n    )\n\n    print(result)\n\n\nThe important error is:\n\n\n    KeyError: 'num_frames'\n\n\nThat means:\n\n> A part of the Whisper ASR pipeline tried to read `num_frames`, but the processed audio data did not contain that key.\n\n* * *\n\n# What is happening behind the scenes\n\nThe simple-looking line:\n\n\n    transcriber = pipeline(\n        task=\"automatic-speech-recognition\",\n        model=\"openai/whisper-large-v3\"\n    )\n\n\nbuilds a full speech-recognition system.\n\nIt does not only load the model. It also loads:\n\n  * the Whisper model,\n  * the tokenizer,\n  * the feature extractor,\n  * audio loading logic,\n  * audio decoding logic,\n  * preprocessing logic,\n  * generation logic,\n  * postprocessing logic.\n\n\n\nThe Hugging Face pipeline docs describe pipelines as high-level wrappers around model inference, and the ASR pipeline specifically works with audio files or raw waveforms. The docs also say audio-file input needs FFmpeg support for multiple audio formats. (Hugging Face)\n\nSo this call:\n\n\n    result = transcriber(\".../mlk.flac\")\n\n\ndoes many hidden steps:\n\n  1. download/read the `.flac` file,\n  2. decode the audio,\n  3. convert the audio into numerical features,\n  4. send those features into Whisper,\n  5. generate text,\n  6. return the transcript.\n\n\n\nYour error happens around step 3 or 4, before the final transcript is produced.\n\n* * *\n\n# What `num_frames` means\n\n`num_frames` is internal audio metadata.\n\nA “frame” here means a small processed unit of audio. Whisper does not read the raw `.flac` file directly. The audio has to be converted into model-ready features first.\n\nThe pipeline uses frame-related metadata for things like:\n\n  * audio length,\n  * chunking,\n  * timestamps,\n  * batching,\n  * long audio handling,\n  * mapping generated text back to time positions.\n\n\n\nSo when you see:\n\n\n    KeyError: 'num_frames'\n\n\nyou can read it as:\n\n> The pipeline expected audio-length bookkeeping information, but the object it received did not include that information.\n\nThis usually points to a **library/version mismatch or pipeline bug** , not a mistake in your visible code.\n\n* * *\n\n# Why your code is not obviously wrong\n\n## 1. The model name is valid\n\nThis model is real:\n\n\n    \"openai/whisper-large-v3\"\n\n\nThe official model card says Whisper large-v3 is supported in Hugging Face Transformers. It also shows how to run it with `AutoModelForSpeechSeq2Seq`, `AutoProcessor`, and `pipeline`. (Hugging Face)\n\nSo the problem is probably not the model ID.\n\n* * *\n\n## 2. Passing an audio URL is allowed\n\nThe ASR pipeline supports a string input that is either:\n\n  * a local audio file path, or\n  * a public URL to an audio file.\n\n\n\nThe current ASR pipeline source says a string can be a filename or public URL, and the file is read at the correct sampling rate using FFmpeg. (GitHub)\n\nSo this is valid in principle:\n\n\n    transcriber(\"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac\")\n\n\nThe URL style using `/resolve/main/...` is the correct “raw file” style.\n\n* * *\n\n## 3. The same error pattern exists online\n\nThere is a very similar public Transformers issue where a Whisper ASR pipeline fails with:\n\n\n    KeyError: 'num_frames'\n\n\ninside the feature-extraction / pipeline code. The issue is labeled as a bug. (GitHub)\n\nThere are also related reports where batching fails because some audio examples contain `num_frames` and others do not. One issue says `batch_size=1` worked, but `batch_size>1` failed with a key mismatch involving `num_frames`. (GitHub)\n\nThat strongly suggests your case belongs to a known family of Whisper pipeline problems.\n\n* * *\n\n# Most likely cause\n\nThe most likely cause is:\n\n> Your installed `transformers` / audio stack has a mismatch where the ASR pipeline expects `num_frames`, but the feature extractor path you are hitting does not return it.\n\nThis can happen because of:\n\n  * an older `transformers` version,\n  * a very new `transformers` version with a regression,\n  * mixed package versions,\n  * a notebook runtime that was upgraded without restart,\n  * audio dependencies not matching the current ASR stack,\n  * hidden changes in the hosted environment.\n\n\n\nIn simple terms:\n\n> Your code is small, but the environment underneath it is complicated.\n\n* * *\n\n# Best solution path\n\n## Step 1: restart with a clean package setup\n\nRun this first:\n\n\n    pip install --upgrade pip\n    pip install --upgrade transformers datasets[audio] accelerate\n\n\nThen **restart the runtime/kernel**.\n\nThis restart is important. Installing new packages while old modules are already imported can leave Python using stale code.\n\nThe official Whisper large-v3 model card recommends installing/upgrading `transformers`, `datasets[audio]`, and `accelerate` before running the model. (Hugging Face)\n\n* * *\n\n## Step 2: check your versions\n\nAfter restarting, run:\n\n\n    import sys\n    import transformers\n    import torch\n\n    print(\"python:\", sys.version)\n    print(\"transformers:\", transformers.__version__)\n    print(\"torch:\", torch.__version__)\n    print(\"cuda available:\", torch.cuda.is_available())\n\n\nThis tells you what you are actually running.\n\nThis matters because the same code may behave differently depending on:\n\n  * Python version,\n  * `transformers` version,\n  * `torch` version,\n  * audio decoding dependencies,\n  * CPU vs GPU runtime.\n\n\n\n* * *\n\n## Step 3: use the safer official-style code\n\nInstead of the shortest `pipeline(...)` version, use the more explicit pattern.\n\n\n    import torch\n    from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline\n\n    model_id = \"openai/whisper-large-v3\"\n\n    device = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n    torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32\n\n    model = AutoModelForSpeechSeq2Seq.from_pretrained(\n        model_id,\n        torch_dtype=torch_dtype,\n        low_cpu_mem_usage=True,\n        use_safetensors=True,\n    )\n\n    model.to(device)\n\n    processor = AutoProcessor.from_pretrained(model_id)\n\n    transcriber = pipeline(\n        \"automatic-speech-recognition\",\n        model=model,\n        tokenizer=processor.tokenizer,\n        feature_extractor=processor.feature_extractor,\n        torch_dtype=torch_dtype,\n        device=device,\n    )\n\n    audio_url = \"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac\"\n\n    result = transcriber(audio_url)\n\n    print(result[\"text\"])\n\n\nThis matches the official model-card style more closely: load the model, load the processor, then pass the tokenizer and feature extractor into the pipeline explicitly. (Hugging Face)\n\nThis is better because the error involves audio feature extraction. Making the feature extractor explicit reduces hidden auto-loading ambiguity.\n\n* * *\n\n# Why this code is safer\n\nYour original code:\n\n\n    transcriber = pipeline(\n        task=\"automatic-speech-recognition\",\n        model=\"openai/whisper-large-v3\"\n    )\n\n\nasks Transformers to infer everything automatically.\n\nThe safer code:\n\n\n    processor = AutoProcessor.from_pretrained(model_id)\n\n    transcriber = pipeline(\n        \"automatic-speech-recognition\",\n        model=model,\n        tokenizer=processor.tokenizer,\n        feature_extractor=processor.feature_extractor,\n    )\n\n\nmakes the important pieces visible:\n\n  * the model,\n  * the tokenizer,\n  * the feature extractor.\n\n\n\nThat matters because `num_frames` is related to how the audio is processed before reaching the model.\n\n* * *\n\n# Step 4: test whether the URL is involved\n\nYour URL is probably not the main problem, but it is easy to test.\n\nTry downloading the file first:\n\n\n    import requests\n\n    url = \"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac\"\n    path = \"mlk.flac\"\n\n    response = requests.get(url)\n    response.raise_for_status()\n\n    with open(path, \"wb\") as f:\n        f.write(response.content)\n\n    result = transcriber(path)\n    print(result[\"text\"])\n\n\nInterpret the result like this:\n\nResult | Meaning\n---|---\nLocal file works, URL fails | The issue may be URL reading or remote audio decoding.\nLocal file also fails | The issue is probably the Transformers/Whisper pipeline stack.\nBoth work after upgrade | The issue was likely a package-version problem.\n\n* * *\n\n# Step 5: make sure audio dependencies are present\n\nFor audio work, you may need a fuller audio stack:\n\n\n    pip install --upgrade soundfile librosa torchcodec\n\n\nFor Linux/Colab-style systems, also check FFmpeg:\n\n\n    ffmpeg -version\n\n\nIf FFmpeg is missing on a Debian/Ubuntu-style system:\n\n\n    sudo apt-get update\n    sudo apt-get install -y ffmpeg\n\n\nHugging Face Datasets audio decoding uses TorchCodec, which uses FFmpeg under the hood. (Hugging Face)\n\n* * *\n\n# Step 6: do not add extra options yet\n\nFirst make plain transcription work.\n\nAvoid these at the beginning:\n\n\n    batch_size=8\n    chunk_length_s=30\n    return_timestamps=True\n    return_timestamps=\"word\"\n    generate_kwargs={...}\n\n\nWhy?\n\nBecause `num_frames` is tied to pipeline bookkeeping for things like batching, timestamps, and chunking. The ASR source shows the pipeline handles chunking, stride, `num_frames`, timestamps, and postprocessing internally. (GitHub)\n\nStart with:\n\n\n    result = transcriber(audio_url)\n    print(result[\"text\"])\n\n\nThen add features one by one.\n\n* * *\n\n# If upgrading does not fix it\n\nTry one of these controlled paths.\n\n## Option A: reinstall cleanly\n\n\n    pip install --upgrade --force-reinstall transformers datasets[audio] accelerate\n\n\nThen restart the runtime.\n\n* * *\n\n## Option B: install the newest code from GitHub\n\n\n    pip install --upgrade --force-reinstall git+https://github.com/huggingface/transformers.git\n    pip install --upgrade datasets[audio] accelerate\n\n\nThen restart the runtime.\n\nThis is useful when the bug has already been fixed in the repository but not yet in the normal pip release.\n\n* * *\n\n## Option C: pin a known working version\n\nIf a specific version works, save it.\n\nFor example, after finding a working setup:\n\n\n    pip freeze | grep -E \"transformers|torch|datasets|accelerate|torchcodec\"\n\n\nThen put those exact working versions in your notebook or `requirements.txt`.\n\nExample format:\n\n\n    transformers==...\n    torch==...\n    datasets==...\n    accelerate==...\n    torchcodec==...\n\n\nUse the actual versions that worked for you.\n\n* * *\n\n# Guides worth opening\n\n## 1. Whisper large-v3 model card\n\nUse this for the official recommended code pattern for `openai/whisper-large-v3`. It shows the explicit `AutoModelForSpeechSeq2Seq` + `AutoProcessor` + `pipeline` approach. (Hugging Face)\n\n## 2. Transformers pipeline docs\n\nUse this to understand what `pipeline(...)` does and why a simple call can fail inside hidden preprocessing code. (Hugging Face)\n\n## 3. ASR pipeline source code\n\nUse this only when you want to compare your traceback to the actual internal code. It shows that ASR input can be a local file path, public URL, bytes, raw NumPy array, or dictionary with sampling rate. (GitHub)\n\n## 4. Datasets audio loading docs\n\nUse this when audio loading or decoding fails. It explains that Datasets audio decoding relies on TorchCodec and FFmpeg. (Hugging Face)\n\n## 5. Related GitHub issues\n\nUse these to confirm that `num_frames` errors are a real known problem family, especially around Whisper, batching, and pipeline internals. (GitHub)\n\n* * *\n\n# What not to do\n\n## Do not manually pass `num_frames`\n\nThis is not the solution:\n\n\n    num_frames = ...\n\n\nThe missing value is inside the pipeline’s internal processed-audio object. It is not a parameter you are expected to provide.\n\n* * *\n\n## Do not edit the installed package first\n\nAvoid editing files inside:\n\n\n    site-packages/transformers/...\n\n\nFor example, changing:\n\n\n    processed.pop(\"num_frames\")\n\n\nto something else may hide the error, but it may break timestamp or chunking behavior later.\n\nA package upgrade, clean reinstall, or explicit model/processor loading is safer.\n\n* * *\n\n## Do not start with batching\n\nDo not start with:\n\n\n    result = transcriber(list_of_audio_files, batch_size=8)\n\n\nFirst make one audio file work. Related reports show `num_frames` can be involved in batching failures. (GitHub)\n\n* * *\n\n# Recommended final code\n\nUse this after upgrading and restarting:\n\n\n    import torch\n    from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline\n\n    model_id = \"openai/whisper-large-v3\"\n\n    device = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n    torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32\n\n    model = AutoModelForSpeechSeq2Seq.from_pretrained(\n        model_id,\n        torch_dtype=torch_dtype,\n        low_cpu_mem_usage=True,\n        use_safetensors=True,\n    )\n\n    model.to(device)\n\n    processor = AutoProcessor.from_pretrained(model_id)\n\n    transcriber = pipeline(\n        \"automatic-speech-recognition\",\n        model=model,\n        tokenizer=processor.tokenizer,\n        feature_extractor=processor.feature_extractor,\n        torch_dtype=torch_dtype,\n        device=device,\n    )\n\n    result = transcriber(\n        \"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac\"\n    )\n\n    print(result[\"text\"])\n\n\n* * *\n\n# Bottom line\n\nYour error is best understood like this:\n\n> The Whisper ASR pipeline is trying to process audio, but the installed pipeline stack expects an internal `num_frames` value that is missing. Your code is not obviously wrong; the issue is most likely a `transformers` / audio dependency / runtime-version mismatch or bug.\n\n## Quick checklist\n\n  * Upgrade:\n\n\n\n\n    pip install --upgrade pip\n    pip install --upgrade transformers datasets[audio] accelerate\n\n\n  * Restart runtime.\n  * Use explicit `AutoModelForSpeechSeq2Seq` + `AutoProcessor`.\n  * Test one audio file first.\n  * Avoid batching/timestamps/chunking until the basic call works.\n  * Check FFmpeg / audio dependencies if audio decoding fails.\n  * Pin the working package versions once it runs.\n\n",
  "title": "Why am I facing this Error while running this code"
}