{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreiekkcysqdw2lo4v5xvbvxbndtihuqyunwnszjq6ksmddino47kco4",
"uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3moxd4j2qt5i2"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreigwqznwe2cs554tdexcqntnj2gnyql3z5ttci57ym2mjf2sdugefu"
},
"mimeType": "image/webp",
"size": 115348
},
"path": "/h_amro_13de6b93cc1ce/arabic-ocr-with-an-api-make-scanned-arabic-pdfs-searchable-python-5hah",
"publishedAt": "2026-06-23T11:25:30.000Z",
"site": "https://dev.to",
"tags": [
"api",
"python",
"ocr",
"arabic",
"Try the Arabic OCR API free — 1,000 requests/month, no card",
"@arabic_scan.pdf"
],
"textContent": "If you've ever tried to extract text from a scanned Arabic document, you already know the pain. Most OCR tooling is built English-first. Arabic adds three problems on top:\n\n 1. **Right-to-left (RTL) text** that breaks naive layout assumptions.\n 2. **Connected letters (ligatures)** — the same letter changes shape depending on its position in the word.\n 3. **Diacritics and a different numeral set** that generic models drop or mangle.\n\n\n\nThe result: you run a scanned Arabic contract, invoice, or government form through a typical \"PDF to text\" tool and get back garbage — reversed words, missing letters, or nothing at all.\n\nThis post shows a practical way to turn a **scanned Arabic PDF into a searchable PDF** (a real, selectable text layer underneath the original page image) with a single API call — no ML pipeline to build, no GPU, no model weights to host. Code is in Python, cURL, and JavaScript.\n\n## Contents\n\n * What \"searchable PDF\" actually means\n * The approach\n * Tips for better Arabic OCR results\n * Honest limitations\n * Why an API instead of self-hosting Tesseract\n * Pricing\n * Wrap-up\n\n\n\n## What \"searchable PDF\" actually means\n\nThere are two different things people call \"OCR\":\n\n * **Text extraction** — you get back a string of the recognized text.\n * **Searchable PDF** — you get back a _PDF that looks identical to the scan_ , but now has an invisible text layer, so `Ctrl+F`, copy-paste, and indexing all work.\n\n\n\nThe second is what most real workflows need: you keep the original document exactly as scanned (important for legal/official docs), but it becomes searchable and accessible. That's what we'll produce here.\n\n## The approach\n\nWe'll use the **PDF Tools API** `/ocr` endpoint. Under the hood it runs Tesseract with the Arabic (`ara`) and English (`eng`) language models and rebuilds the PDF with an invisible OCR text layer. The relevant detail for us: you can pass `lang=eng+ara` to recognize **mixed Arabic/English documents** in one pass — which is what most real MENA paperwork actually is (Arabic body text, English brand names, Latin numbers).\n\nYou'll need a free API key from the listing (the free tier is 1,000 requests/month, no card). Then:\n\n### Python\n\n\n import requests\n\n API_KEY = \"YOUR_RAPIDAPI_KEY\"\n HOST = \"pdf-tools-api2.p.rapidapi.com\"\n\n with open(\"arabic_scan.pdf\", \"rb\") as f:\n resp = requests.post(\n f\"https://{HOST}/ocr\",\n headers={\"X-RapidAPI-Key\": API_KEY, \"X-RapidAPI-Host\": HOST},\n files={\"file\": (\"arabic_scan.pdf\", f, \"application/pdf\")},\n data={\"lang\": \"eng+ara\"}, # mixed Arabic + English\n )\n resp.raise_for_status()\n\n with open(\"searchable.pdf\", \"wb\") as out:\n out.write(resp.content)\n\n print(\"Done — searchable.pdf now has a real text layer.\")\n\n\nOpen `searchable.pdf` and try selecting the Arabic text or searching it. It's there now.\n\n### cURL\n\n\n curl -X POST \"https://pdf-tools-api2.p.rapidapi.com/ocr\" \\\n -H \"X-RapidAPI-Key: YOUR_RAPIDAPI_KEY\" \\\n -H \"X-RapidAPI-Host: pdf-tools-api2.p.rapidapi.com\" \\\n -F \"file=@arabic_scan.pdf\" \\\n -F \"lang=eng+ara\" \\\n --output searchable.pdf\n\n\n### JavaScript (Node / browser)\n\n\n const form = new FormData();\n form.append(\"file\", fileInput.files[0]);\n form.append(\"lang\", \"eng+ara\");\n\n const res = await fetch(\"https://pdf-tools-api2.p.rapidapi.com/ocr\", {\n method: \"POST\",\n headers: {\n \"X-RapidAPI-Key\": \"YOUR_RAPIDAPI_KEY\",\n \"X-RapidAPI-Host\": \"pdf-tools-api2.p.rapidapi.com\",\n },\n body: form,\n });\n const blob = await res.blob(); // application/pdf, now searchable\n\n // Browser: download the searchable PDF\n const url = URL.createObjectURL(blob);\n const a = Object.assign(document.createElement(\"a\"), { href: url, download: \"searchable.pdf\" });\n a.click();\n URL.revokeObjectURL(url);\n\n\nJust need the raw text instead of a searchable PDF?\n\nIf you only want the extracted string (for a database, a search index, an LLM pipeline), run the searchable PDF through `/extract-text`:\n\n\n\n resp = requests.post(\n \"https://pdf-tools-api2.p.rapidapi.com/extract-text\",\n headers={\"X-RapidAPI-Key\": API_KEY, \"X-RapidAPI-Host\": HOST},\n files={\"file\": (\"searchable.pdf\", open(\"searchable.pdf\", \"rb\"), \"application/pdf\")},\n )\n print(resp.json()[\"text\"])\n\n\n## Tips for better Arabic OCR results\n\nOCR quality depends mostly on the **input scan** , not the engine. To get clean output:\n\n * **Scan at 300 DPI** or higher. Below ~200 DPI, connected Arabic letters blur together.\n * **Deskew** crooked scans before sending. Even 2–3° of rotation hurts RTL recognition.\n * **Use`eng+ara`, not `ara` alone**, for any document that mixes Latin characters (almost all real-world ones do).\n * **Keep it under 15 pages per request** (split larger docs first — there's a `/split` endpoint).\n * **Black-on-white** beats colored backgrounds; if your scan is noisy, that's the biggest quality lever.\n\n\n\n## Honest limitations\n\n\nThis is Tesseract-based OCR, not a frontier vision model. It's excellent for **printed** Arabic (forms, contracts, books, invoices). It is **not** built for handwritten Arabic, heavily stylized calligraphy, or low-resolution phone photos — accuracy drops sharply there, same as every OCR engine. For clean printed scans it's genuinely good and, importantly, it's _available_ — which is more than most PDF APIs can say for Arabic at all.\n\n\n## Why an API instead of self-hosting Tesseract\n\nYou _can_ `apt install tesseract-ocr-ara` and wire up the PDF rebuild yourself. People do. But you then own:\n\n * installing and updating Tesseract + the Arabic language data,\n * the rasterize → OCR → re-embed-text-layer pipeline (the fiddly part),\n * font/encoding edge cases for the invisible RTL text layer,\n * scaling it without melting your server on a 15-page scan.\n\n\n\nIf Arabic OCR is core to your product, self-hosting is fine. If it's one feature among many, one HTTP call you can put in a spreadsheet beats a maintenance project.\n\n## Pricing, briefly\n\nThe API is **flat per-request** — one OCR call is one request, whether it's a 1-page or 15-page scan. No credit tables, no per-page billing (iLovePDF, for comparison, charges OCR per page in credits). Free tier is 1,000 requests/month, permanently, no card. The same key also does merge, split, compress, encrypt, HTML→PDF, Office→PDF, redaction, and table extraction — 26 endpoints total.\n\n## Wrap-up\n\nArabic OCR has a reputation for being painful, and self-hosting it is. But for printed documents, turning a scanned Arabic PDF into a searchable one is now a single API call with `lang=eng+ara`. If you're digitizing Arabic archives, building a MENA document-management product, or just need `Ctrl+F` to work on a scanned contract, this gets you there in five minutes.\n\n**Your turn:** what trips you up most with Arabic OCR — RTL layout, connected-letter ligatures, or diacritics getting dropped? And what are you digitizing: contracts, old books, or handwritten notes? Tell me in the comments. 👇\n\nTry the Arabic OCR API free — 1,000 requests/month, no card\n\n_Built and maintained by a solo developer (based in Syria) who actually answers — questions welcome in the comments._",
"title": "Arabic OCR with an API: Make Scanned Arabic PDFs Searchable (Python)"
}