Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreievvwp4rcqxcfdijuj3yjlhatoi4al2ppxuh2msmqqm6wvtuygkci",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3moiu7wgj3lo2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreicjqiwnuplpnexwrg55q3332akhh6p4ejafpio5kkrocfahlicfgu"
    },
    "mimeType": "image/webp",
    "size": 83180
  },
  "path": "/jamhimself/why-your-youtube-transcript-scraper-started-returning-empty-strings-and-how-to-fix-it-in-2026-20ed",
  "publishedAt": "2026-06-17T17:21:11.000Z",
  "site": "https://dev.to",
  "tags": [
    "ai",
    "python",
    "webdev",
    "opensource",
    "bgutils-js",
    "github.com/jamhimself/youtube-transcript-cli",
    "a transcript scraper",
    "whole-channel-to-RAG"
  ],
  "textContent": "If you have a script that pulled YouTube transcripts a year ago, there's a good chance it quietly broke. It still runs, no errors — it just returns **empty**. Here's what changed, and how to actually get captions again.\n\n##  The symptom\n\nYou hit YouTube's caption endpoint (`timedtext`), get back **HTTP 200** … and an **empty body**. No exception, no 403, nothing to catch. Just nothing. So your pipeline happily writes empty transcripts and you don't notice until your RAG index is full of blanks.\n\nThis is why a lot of the popular libraries went dark in 2025–2026, even ones that are still \"maintained.\" The request shape that used to work now returns nothing.\n\n##  What actually changed: PoToken\n\nYouTube now requires a **Proof-of-Origin Token (PoToken)** — generated by its BotGuard system — on caption requests. Without a valid token bound to the specific video, `timedtext` returns that empty `200`. Datacenter IPs (AWS/GCP/Azure) also get blocked or throttled hard, which is the _second_ reason server-side scrapers silently fail.\n\nSo the modern recipe is three steps:\n\n  1. **Fetch the watch page** and parse `ytInitialPlayerResponse` for the caption tracks + `visitorData`.\n  2. **Mint a PoToken** bound to the video ID by solving the BotGuard challenge.\n  3. **Request the caption track** with `&pot=<token>&c=WEB&fmt=json3` — _then_ you get real JSON back.\n\n\n\nThe PoToken part is the bit everyone gets stuck on. You don't have to reverse-engineer BotGuard yourself — bgutils-js (paired with `jsdom` to give it a DOM to run in) handles the challenge. Here's the shape of it:\n\n\n\n    import { BG, buildURL, GOOG_API_KEY } from 'bgutils-js';\n    import { JSDOM } from 'jsdom';\n\n    // 1. give BotGuard a DOM to run in\n    const dom = new JSDOM('<!DOCTYPE html><html><body></body></html>', {\n      url: 'https://www.youtube.com/',\n    });\n    Object.assign(globalThis, { window: dom.window, document: dom.window.document });\n\n    // 2. solve the challenge -> integrity token -> a minter bound to your session\n    const challenge = await BG.Challenge.create({ fetch, globalObj: globalThis, requestKey, identifier: visitorData });\n    new Function(challenge.interpreterJavascript.privateDoNotAccessOrElseSafeScriptWrappedValue)();\n    const bg = await BG.BotGuardClient.create({ program: challenge.program, globalName: challenge.globalName, globalObj: globalThis });\n    const out = [];\n    const it = await fetch(buildURL('GenerateIT', false), {\n      method: 'POST',\n      headers: { 'Content-Type': 'application/json+protobuf', 'x-goog-api-key': GOOG_API_KEY },\n      body: JSON.stringify([requestKey, await bg.snapshot({ webPoSignalOutput: out })]),\n    });\n    const minter = await BG.WebPoMinter.create({ integrityToken: (await it.json())[0] }, out);\n\n    // 3. mint a token bound to THIS video, attach to the caption URL\n    const pot = await minter.mintAsWebsafeString(videoId);\n    const url = new URL(captionTrack.baseUrl);\n    url.searchParams.set('fmt', 'json3');\n    url.searchParams.set('pot', pot);\n    url.searchParams.set('c', 'WEB');\n    const segments = (await (await fetch(url)).json()).events; // <- finally, real data\n\n\nA couple of things that bit me:\n\n  * **Bind the token to the video ID** (`mintAsWebsafeString(videoId)`), not a generic identifier — a session-only token still returns empty on `timedtext`.\n  * **`&c=WEB` is required** alongside `&pot=`. Miss it and you're back to the empty `200`.\n  * The integrity token has a TTL (~12h), so for batches you bootstrap **once** and reuse the minter.\n\n\n\n##  I packaged it as a tiny CLI\n\nI got tired of re-deriving this, so I put it in a zero-config MIT package:\n\n\n\n    npx get-youtube-transcript https://www.youtube.com/watch?v=jNQXAC9IVRw\n\n\nIt does the whole watch-page → PoToken → captions dance and prints the transcript (text/JSON/SRT). Source: github.com/jamhimself/youtube-transcript-cli. It's single-video and runs from your IP — perfect for scripts and notebooks.\n\n##  Where it gets hard: scale\n\nThe CLI works great until you need **hundreds or thousands** of videos. Then you hit the _other_ wall: YouTube rate-limits and blocks datacenter IPs, so an unattended server job gets throttled fast. At that point you need rotating residential proxies + retries + uptime monitoring, which is a different project from \"parse the captions.\"\n\nThat's the part I run as a hosted service on Apify — a transcript scraper and a whole-channel-to-RAG version that lists a channel and returns every video's transcript as chunked, embed-ready text. Same engine as the CLI, just with the proxy/uptime layer so you don't babysit it. (Mentioning it because \"how do I do this at scale\" is the inevitable next question — but the CLI above is genuinely all you need for low-volume.)\n\n##  TL;DR\n\n  * Empty transcripts in 2026 = missing **PoToken** + datacenter-IP blocks.\n  * Recipe: watch page → caption tracks → mint a video-bound PoToken (`bgutils-js`) → fetch with `&pot=&c=WEB&fmt=json3`.\n  * For one-off use, `npx get-youtube-transcript <url>`.\n  * For scale, you need residential proxies + retries on top — that's the real cost, not the parsing.\n\n\n\nIf your captions pipeline has been quietly returning blanks, now you know why. Go check your RAG index. 🙃",
  "title": "Why your YouTube transcript scraper started returning empty strings (and how to fix it in 2026)"
}