Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicrsrihruzmumpiinhoc24hjawuhfz4p5uiv2b3cznfmusvzfdeae",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3moowrvcsydi2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreicl7me5lxwqsdjrfko43ov457cn242qzwibhzpuecymvba3fqupwm"
    },
    "mimeType": "image/webp",
    "size": 78336
  },
  "path": "/jacob_gong/parsing-and-rebuilding-epub-files-in-python-lessons-learned-from-building-an-ai-translation-service-jpb",
  "publishedAt": "2026-06-20T03:01:16.000Z",
  "site": "https://dev.to",
  "tags": [
    "python",
    "webdev",
    "ai",
    "tutorial"
  ],
  "textContent": "_How we extract, translate, and reconstruct entire ebooks with Python while preserving every detail_\n\nAt LectuLibre, we built a service that translates entire books using large language models. Our users upload EPUB files, and our backend pipeline parses them, extracts the text, sends it to an LLM for translation, and then rebuilds the EPUB with the translated content—all while preserving the original formatting, images, and metadata. This sounded straightforward until we looked inside a real EPUB.\n\nEPUB is essentially a ZIP file containing a structured set of XHTML, CSS, and XML files. The `content.opf` file defines the reading order (spine), metadata, and manifest. The `toc.ncx` holds the table of contents. The actual text lives in XHTML documents, often split per chapter. To translate a book, we needed to: 1) reliably parse the EPUB, 2) locate all translatable text, 3) send it chunk by chunk to the LLM, and 4) rebuild the EPUB with the translated text while keeping every byte of the formatting intact.\n\n##  The Problem with Off-the-Shelf Libraries\n\nWe initially reached for `ebooklib`, the most popular Python library for EPUB manipulation. It worked great for simple EPUBs—until we threw a few hundred real-world files at it. We quickly hit issues:\n\n  * **Metadata loss** : `ebooklib` didn’t fully preserve custom metadata or namespace-prefixed properties in the OPF.\n  * **Namespace handling** : When modifying XHTML, it could strip or mangle `xmlns` attributes, breaking rendering on some devices.\n  * **TOC and spine sync** : After rebuilding, the table of contents and spine often got out of sync unless we manually repaired them.\n  * **Large files** : Processing a 200‑chapter book consumed surprising memory because `ebooklib` loaded everything at once.\n\n\n\nWe could have used a heavyweight tool like Calibre’s command-line interface, but that introduced external dependencies and wasn’t as programmatically flexible. Instead, we decided to stick with `ebooklib` for high-level book structure and augment it with `lxml` for precise XML control.\n\n##  Our Parsing and Rebuilding Pipeline\n\nHere’s the core approach we landed on:\n\n  1. **Read the EPUB** with `ebooklib` to get a list of items (documents, images, CSS).\n  2. **Identify translatable content** – usually `ITEM_DOCUMENT` (XHTML) and sometimes `ITEM_NAVIGATION` (NCX for titles).\n  3. **Parse each XHTML document** with `lxml`, extract text, while keeping a map of each text node to its parent element.\n  4. **Send blocks of text** to the LLM for translation, preserving order and context.\n  5. **Rebuild the XHTML** by replacing original text nodes with their translations using the saved mapping.\n  6. **Write the new EPUB** with `ebooklib`, manually ensuring the OPF and spine are correct.\n\n\n\nLet’s dive into the code.\n\n###  Step 1: Reading and Filtering Items\n\n\n    import ebooklib\n    from ebooklib import epub\n\n    book = epub.read_epub('original.epub')\n\n    translatable_items = []\n    for item in book.get_items():\n        if item.get_type() == ebooklib.ITEM_DOCUMENT:\n            translatable_items.append(item)\n        # Some books use NCX for chapter titles\n        elif item.get_type() == ebooklib.ITEM_NAVIGATION:\n            translatable_items.append(item)\n\n\nWe ignore images, fonts, and CSS—they don’t contain translatable text.\n\n###  Step 2: Extracting Text with Context\n\nWe need to extract text while remembering exactly where it came from. We use `lxml.etree` to parse the XHTML and walk the tree, collecting text nodes and their XPath locations:\n\n\n\n    from lxml import etree\n\n    def extract_text_with_xpath(content):\n        parser = etree.HTMLParser()\n        root = etree.fromstring(content, parser)\n        tree = etree.ElementTree(root)\n\n        text_mapping = []  # list of (xpath, original_text, parent_element)\n        for elem in root.iter():\n            if elem.text and elem.text.strip():\n                xpath = tree.getpath(elem)\n                text_mapping.append((xpath, elem.text, elem))\n            if elem.tail and elem.tail.strip():\n                # tail text belongs to the parent, but logically follows the element\n                parent = elem.getparent()\n                xpath = tree.getpath(parent) if parent is not None else None\n                if xpath:\n                    text_mapping.append((xpath, elem.tail, elem))\n        return text_mapping\n\n\nPay attention to `tail` text—it’s the text that follows a closing tag, common in interleaved markup. Missing it leads to lost sentences.\n\n###  Step 3: Translating in Chunks\n\nWe batch the collected text nodes into chunks that respect LLM token limits. For instance, we group consecutive text from the same XHTML document, aiming for ~3000 tokens per batch. We then send each chunk to our translation model (e.g., Claude 3.5 Sonnet) and receive a block of translated text. We split the translated block back into individual strings by comparing lengths (advanced: we use a diff algorithm to align original and translated sentences). This is simplified here for brevity.\n\n###  Step 4: Replacing Text in the Original XHTML\n\nNow we map translations back:\n\n\n\n    for (xpath, original, elem), translated_text in zip(text_mapping, translations):\n        # Use xpath to locate the element again (parsed fresh from original)\n        # but we cached the element objects, so we can just update them\n        if elem.text and elem.text == original:\n            elem.text = translated_text\n        elif elem.tail and elem.tail == original:\n            elem.tail = translated_text\n\n    # Serialize back to string\n    new_content = etree.tostring(root, encoding='unicode', method='html')\n\n\nWe return the modified XHTML as a string, ready to replace the item’s content in the EPUB.\n\n###  Step 5: Rebuilding the EPUB\n\nHere’s where `ebooklib` shines. We create a new `EpubBook`, set the same metadata (title, author, language), and add items:\n\n\n\n    new_book = epub.EpubBook()\n    new_book.set_identifier(original_book.get_metadata('DC', 'identifier')[0][0])\n    new_book.set_title(original_book.get_metadata('DC', 'title')[0][0])\n    new_book.set_language(original_book.get_metadata('DC', 'language')[0][0])\n\n    # Add all original items, replacing document content where needed\n    for item in original_book.get_items():\n        if item.get_name() in modified_content_map:\n            # Replace with translated XHTML\n            new_content = modified_content_map[item.get_name()]\n            new_item = epub.EpubItem(\n                uid=item.get_id(),\n                file_name=item.get_name(),\n                media_type=item.get_type(),\n                content=new_content.encode('utf-8')\n            )\n        else:\n            # Copy image, CSS, etc. as-is\n            new_item = item\n        new_book.add_item(new_item)\n\n    # Replicate the spine and table of contents\n    new_book.spine = original_book.spine\n    new_book.toc = original_book.toc\n\n    # Write out\n    epub.write_epub('translated.epub', new_book, {})\n\n\nBut wait—this naive approach can corrupt the OPF. We found that `ebooklib` sometimes rewrites the spine order incorrectly if the original had complex nesting. To fix this, we manually post-process the written EPUB’s `content.opf` using `lxml`:\n\n\n\n    import zipfile\n    from lxml import etree\n\n    # Open the new EPUB as a ZIP\n    with zipfile.ZipFile('translated.epub', 'a') as zf:\n        with zf.open('content.opf', 'r') as f:\n            opf = etree.parse(f)\n        # Ensure itemref order matches original spine\n        spine = opf.find('.//{http://www.idpf.org/2007/opf}spine')\n        # Reorder based on original spine list\n        # ... custom correction logic ...\n        zf.writestr('content.opf', etree.tostring(opf, xml_declaration=True, encoding='UTF-8'))\n\n\nYes, it’s ugly, but it saved us from countless validation errors.\n\n##  Performance and Real-World Numbers\n\nWe benchmarked on a typical novel: 50 chapters, 350KB uncompressed. Parsing and extracting text: ~0.2 seconds. Rebuilding after translation: ~0.3 seconds. The LLM translation step dominates (around 45 seconds for the whole book), so we worked on parallelism for that part instead.\n\nHowever, with larger educational texts containing hundreds of images and complex tables, memory usage spiked to over 500MB. We mitigated this by processing documents one by one and releasing them immediately.\n\n##  Key Lessons Learned\n\n  1. **Namespaces are the devil** : Always preserve `xmlns=\"http://www.w3.org/1999/xhtml\"` and any custom namespaces on the `<html>` tag. Lxml’s `etree.tostring()` with `method='html'` can drop them unless you explicitly add them back.\n  2. **Validate, validate, validate** : After rebuilding, we run `epubcheck` (via Python subprocess) to catch issues. False positives from custom metadata? We whitelist them after manual review.\n  3. **Don’t trust the library for everything** : `ebooklib` is great for reading, but for writing, we ended up doing a lot of OPF and NCX manipulation ourselves to ensure compliance.\n  4. **Handle encoding upfront** : Some old EPUBs use Latin-1. We transcode everything to UTF-8 early in the pipeline to avoid crashes later.\n  5. **DRM is a dead end** : We detect encrypted books by checking the `<encryption>` element in `META-INF/encryption.xml` and gracefully reject them.\n\n\n\n##  The Open Question for the Community\n\nWe’d love to know how others are managing complex EPUB manipulation in production. Have you found a more robust library than `ebooklib`? How do you deal with interactive EPUB3 elements (Javascript, form fields) when translating? We’re still iterating on our pipeline and would appreciate any battle stories.\n\nIf you’re tackling similar problems or want to try translating your own eBooks, you can see the result of this work at LectuLibre. But most importantly, we hope this deep dive saves you a few late nights the next time you need to mess with EPUB internals.",
  "title": "Parsing and Rebuilding EPUB Files in Python: Lessons Learned from Building an AI Translation Service"
}