Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiekkcysqdw2lo4v5xvbvxbndtihuqyunwnszjq6ksmddino47kco4",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3moxd4j2qt5i2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreigwqznwe2cs554tdexcqntnj2gnyql3z5ttci57ym2mjf2sdugefu"
    },
    "mimeType": "image/webp",
    "size": 115348
  },
  "path": "/h_amro_13de6b93cc1ce/arabic-ocr-with-an-api-make-scanned-arabic-pdfs-searchable-python-5hah",
  "publishedAt": "2026-06-23T11:25:30.000Z",
  "site": "https://dev.to",
  "tags": [
    "api",
    "python",
    "ocr",
    "arabic",
    "Try the Arabic OCR API free — 1,000 requests/month, no card",
    "@arabic_scan.pdf"
  ],
  "textContent": "If you've ever tried to extract text from a scanned Arabic document, you already know the pain. Most OCR tooling is built English-first. Arabic adds three problems on top:\n\n  1. **Right-to-left (RTL) text** that breaks naive layout assumptions.\n  2. **Connected letters (ligatures)** — the same letter changes shape depending on its position in the word.\n  3. **Diacritics and a different numeral set** that generic models drop or mangle.\n\n\n\nThe result: you run a scanned Arabic contract, invoice, or government form through a typical \"PDF to text\" tool and get back garbage — reversed words, missing letters, or nothing at all.\n\nThis post shows a practical way to turn a **scanned Arabic PDF into a searchable PDF** (a real, selectable text layer underneath the original page image) with a single API call — no ML pipeline to build, no GPU, no model weights to host. Code is in Python, cURL, and JavaScript.\n\n##  Contents\n\n  * What \"searchable PDF\" actually means\n  * The approach\n  * Tips for better Arabic OCR results\n  * Honest limitations\n  * Why an API instead of self-hosting Tesseract\n  * Pricing\n  * Wrap-up\n\n\n\n##  What \"searchable PDF\" actually means\n\nThere are two different things people call \"OCR\":\n\n  * **Text extraction** — you get back a string of the recognized text.\n  * **Searchable PDF** — you get back a _PDF that looks identical to the scan_ , but now has an invisible text layer, so `Ctrl+F`, copy-paste, and indexing all work.\n\n\n\nThe second is what most real workflows need: you keep the original document exactly as scanned (important for legal/official docs), but it becomes searchable and accessible. That's what we'll produce here.\n\n##  The approach\n\nWe'll use the **PDF Tools API** `/ocr` endpoint. Under the hood it runs Tesseract with the Arabic (`ara`) and English (`eng`) language models and rebuilds the PDF with an invisible OCR text layer. The relevant detail for us: you can pass `lang=eng+ara` to recognize **mixed Arabic/English documents** in one pass — which is what most real MENA paperwork actually is (Arabic body text, English brand names, Latin numbers).\n\nYou'll need a free API key from the listing (the free tier is 1,000 requests/month, no card). Then:\n\n###  Python\n\n\n    import requests\n\n    API_KEY = \"YOUR_RAPIDAPI_KEY\"\n    HOST = \"pdf-tools-api2.p.rapidapi.com\"\n\n    with open(\"arabic_scan.pdf\", \"rb\") as f:\n        resp = requests.post(\n            f\"https://{HOST}/ocr\",\n            headers={\"X-RapidAPI-Key\": API_KEY, \"X-RapidAPI-Host\": HOST},\n            files={\"file\": (\"arabic_scan.pdf\", f, \"application/pdf\")},\n            data={\"lang\": \"eng+ara\"},   # mixed Arabic + English\n        )\n    resp.raise_for_status()\n\n    with open(\"searchable.pdf\", \"wb\") as out:\n        out.write(resp.content)\n\n    print(\"Done — searchable.pdf now has a real text layer.\")\n\n\nOpen `searchable.pdf` and try selecting the Arabic text or searching it. It's there now.\n\n###  cURL\n\n\n    curl -X POST \"https://pdf-tools-api2.p.rapidapi.com/ocr\" \\\n      -H \"X-RapidAPI-Key: YOUR_RAPIDAPI_KEY\" \\\n      -H \"X-RapidAPI-Host: pdf-tools-api2.p.rapidapi.com\" \\\n      -F \"file=@arabic_scan.pdf\" \\\n      -F \"lang=eng+ara\" \\\n      --output searchable.pdf\n\n\n###  JavaScript (Node / browser)\n\n\n    const form = new FormData();\n    form.append(\"file\", fileInput.files[0]);\n    form.append(\"lang\", \"eng+ara\");\n\n    const res = await fetch(\"https://pdf-tools-api2.p.rapidapi.com/ocr\", {\n      method: \"POST\",\n      headers: {\n        \"X-RapidAPI-Key\": \"YOUR_RAPIDAPI_KEY\",\n        \"X-RapidAPI-Host\": \"pdf-tools-api2.p.rapidapi.com\",\n      },\n      body: form,\n    });\n    const blob = await res.blob(); // application/pdf, now searchable\n\n    // Browser: download the searchable PDF\n    const url = URL.createObjectURL(blob);\n    const a = Object.assign(document.createElement(\"a\"), { href: url, download: \"searchable.pdf\" });\n    a.click();\n    URL.revokeObjectURL(url);\n\n\nJust need the raw text instead of a searchable PDF?\n\nIf you only want the extracted string (for a database, a search index, an LLM pipeline), run the searchable PDF through `/extract-text`:\n\n\n\n    resp = requests.post(\n        \"https://pdf-tools-api2.p.rapidapi.com/extract-text\",\n        headers={\"X-RapidAPI-Key\": API_KEY, \"X-RapidAPI-Host\": HOST},\n        files={\"file\": (\"searchable.pdf\", open(\"searchable.pdf\", \"rb\"), \"application/pdf\")},\n    )\n    print(resp.json()[\"text\"])\n\n\n##  Tips for better Arabic OCR results\n\nOCR quality depends mostly on the **input scan** , not the engine. To get clean output:\n\n  * **Scan at 300 DPI** or higher. Below ~200 DPI, connected Arabic letters blur together.\n  * **Deskew** crooked scans before sending. Even 2–3° of rotation hurts RTL recognition.\n  * **Use`eng+ara`, not `ara` alone**, for any document that mixes Latin characters (almost all real-world ones do).\n  * **Keep it under 15 pages per request** (split larger docs first — there's a `/split` endpoint).\n  * **Black-on-white** beats colored backgrounds; if your scan is noisy, that's the biggest quality lever.\n\n\n\n##  Honest limitations\n\n\nThis is Tesseract-based OCR, not a frontier vision model. It's excellent for **printed** Arabic (forms, contracts, books, invoices). It is **not** built for handwritten Arabic, heavily stylized calligraphy, or low-resolution phone photos — accuracy drops sharply there, same as every OCR engine. For clean printed scans it's genuinely good and, importantly, it's _available_ — which is more than most PDF APIs can say for Arabic at all.\n\n\n##  Why an API instead of self-hosting Tesseract\n\nYou _can_ `apt install tesseract-ocr-ara` and wire up the PDF rebuild yourself. People do. But you then own:\n\n  * installing and updating Tesseract + the Arabic language data,\n  * the rasterize → OCR → re-embed-text-layer pipeline (the fiddly part),\n  * font/encoding edge cases for the invisible RTL text layer,\n  * scaling it without melting your server on a 15-page scan.\n\n\n\nIf Arabic OCR is core to your product, self-hosting is fine. If it's one feature among many, one HTTP call you can put in a spreadsheet beats a maintenance project.\n\n##  Pricing, briefly\n\nThe API is **flat per-request** — one OCR call is one request, whether it's a 1-page or 15-page scan. No credit tables, no per-page billing (iLovePDF, for comparison, charges OCR per page in credits). Free tier is 1,000 requests/month, permanently, no card. The same key also does merge, split, compress, encrypt, HTML→PDF, Office→PDF, redaction, and table extraction — 26 endpoints total.\n\n##  Wrap-up\n\nArabic OCR has a reputation for being painful, and self-hosting it is. But for printed documents, turning a scanned Arabic PDF into a searchable one is now a single API call with `lang=eng+ara`. If you're digitizing Arabic archives, building a MENA document-management product, or just need `Ctrl+F` to work on a scanned contract, this gets you there in five minutes.\n\n**Your turn:** what trips you up most with Arabic OCR — RTL layout, connected-letter ligatures, or diacritics getting dropped? And what are you digitizing: contracts, old books, or handwritten notes? Tell me in the comments. 👇\n\nTry the Arabic OCR API free — 1,000 requests/month, no card\n\n_Built and maintained by a solo developer (based in Syria) who actually answers — questions welcome in the comments._",
  "title": "Arabic OCR with an API: Make Scanned Arabic PDFs Searchable (Python)"
}