{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreievvwp4rcqxcfdijuj3yjlhatoi4al2ppxuh2msmqqm6wvtuygkci",
"uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3moiu7wgj3lo2"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreicjqiwnuplpnexwrg55q3332akhh6p4ejafpio5kkrocfahlicfgu"
},
"mimeType": "image/webp",
"size": 83180
},
"path": "/jamhimself/why-your-youtube-transcript-scraper-started-returning-empty-strings-and-how-to-fix-it-in-2026-20ed",
"publishedAt": "2026-06-17T17:21:11.000Z",
"site": "https://dev.to",
"tags": [
"ai",
"python",
"webdev",
"opensource",
"bgutils-js",
"github.com/jamhimself/youtube-transcript-cli",
"a transcript scraper",
"whole-channel-to-RAG"
],
"textContent": "If you have a script that pulled YouTube transcripts a year ago, there's a good chance it quietly broke. It still runs, no errors — it just returns **empty**. Here's what changed, and how to actually get captions again.\n\n## The symptom\n\nYou hit YouTube's caption endpoint (`timedtext`), get back **HTTP 200** … and an **empty body**. No exception, no 403, nothing to catch. Just nothing. So your pipeline happily writes empty transcripts and you don't notice until your RAG index is full of blanks.\n\nThis is why a lot of the popular libraries went dark in 2025–2026, even ones that are still \"maintained.\" The request shape that used to work now returns nothing.\n\n## What actually changed: PoToken\n\nYouTube now requires a **Proof-of-Origin Token (PoToken)** — generated by its BotGuard system — on caption requests. Without a valid token bound to the specific video, `timedtext` returns that empty `200`. Datacenter IPs (AWS/GCP/Azure) also get blocked or throttled hard, which is the _second_ reason server-side scrapers silently fail.\n\nSo the modern recipe is three steps:\n\n 1. **Fetch the watch page** and parse `ytInitialPlayerResponse` for the caption tracks + `visitorData`.\n 2. **Mint a PoToken** bound to the video ID by solving the BotGuard challenge.\n 3. **Request the caption track** with `&pot=<token>&c=WEB&fmt=json3` — _then_ you get real JSON back.\n\n\n\nThe PoToken part is the bit everyone gets stuck on. You don't have to reverse-engineer BotGuard yourself — bgutils-js (paired with `jsdom` to give it a DOM to run in) handles the challenge. Here's the shape of it:\n\n\n\n import { BG, buildURL, GOOG_API_KEY } from 'bgutils-js';\n import { JSDOM } from 'jsdom';\n\n // 1. give BotGuard a DOM to run in\n const dom = new JSDOM('<!DOCTYPE html><html><body></body></html>', {\n url: 'https://www.youtube.com/',\n });\n Object.assign(globalThis, { window: dom.window, document: dom.window.document });\n\n // 2. solve the challenge -> integrity token -> a minter bound to your session\n const challenge = await BG.Challenge.create({ fetch, globalObj: globalThis, requestKey, identifier: visitorData });\n new Function(challenge.interpreterJavascript.privateDoNotAccessOrElseSafeScriptWrappedValue)();\n const bg = await BG.BotGuardClient.create({ program: challenge.program, globalName: challenge.globalName, globalObj: globalThis });\n const out = [];\n const it = await fetch(buildURL('GenerateIT', false), {\n method: 'POST',\n headers: { 'Content-Type': 'application/json+protobuf', 'x-goog-api-key': GOOG_API_KEY },\n body: JSON.stringify([requestKey, await bg.snapshot({ webPoSignalOutput: out })]),\n });\n const minter = await BG.WebPoMinter.create({ integrityToken: (await it.json())[0] }, out);\n\n // 3. mint a token bound to THIS video, attach to the caption URL\n const pot = await minter.mintAsWebsafeString(videoId);\n const url = new URL(captionTrack.baseUrl);\n url.searchParams.set('fmt', 'json3');\n url.searchParams.set('pot', pot);\n url.searchParams.set('c', 'WEB');\n const segments = (await (await fetch(url)).json()).events; // <- finally, real data\n\n\nA couple of things that bit me:\n\n * **Bind the token to the video ID** (`mintAsWebsafeString(videoId)`), not a generic identifier — a session-only token still returns empty on `timedtext`.\n * **`&c=WEB` is required** alongside `&pot=`. Miss it and you're back to the empty `200`.\n * The integrity token has a TTL (~12h), so for batches you bootstrap **once** and reuse the minter.\n\n\n\n## I packaged it as a tiny CLI\n\nI got tired of re-deriving this, so I put it in a zero-config MIT package:\n\n\n\n npx get-youtube-transcript https://www.youtube.com/watch?v=jNQXAC9IVRw\n\n\nIt does the whole watch-page → PoToken → captions dance and prints the transcript (text/JSON/SRT). Source: github.com/jamhimself/youtube-transcript-cli. It's single-video and runs from your IP — perfect for scripts and notebooks.\n\n## Where it gets hard: scale\n\nThe CLI works great until you need **hundreds or thousands** of videos. Then you hit the _other_ wall: YouTube rate-limits and blocks datacenter IPs, so an unattended server job gets throttled fast. At that point you need rotating residential proxies + retries + uptime monitoring, which is a different project from \"parse the captions.\"\n\nThat's the part I run as a hosted service on Apify — a transcript scraper and a whole-channel-to-RAG version that lists a channel and returns every video's transcript as chunked, embed-ready text. Same engine as the CLI, just with the proxy/uptime layer so you don't babysit it. (Mentioning it because \"how do I do this at scale\" is the inevitable next question — but the CLI above is genuinely all you need for low-volume.)\n\n## TL;DR\n\n * Empty transcripts in 2026 = missing **PoToken** + datacenter-IP blocks.\n * Recipe: watch page → caption tracks → mint a video-bound PoToken (`bgutils-js`) → fetch with `&pot=&c=WEB&fmt=json3`.\n * For one-off use, `npx get-youtube-transcript <url>`.\n * For scale, you need residential proxies + retries on top — that's the real cost, not the parsing.\n\n\n\nIf your captions pipeline has been quietly returning blanks, now you know why. Go check your RAG index. 🙃",
"title": "Why your YouTube transcript scraper started returning empty strings (and how to fix it in 2026)"
}