{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiftuzusjfhowfqkrbjihpoytyelbzqc6s53zxa5p5bhioh52nlqga",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3molytgklmw42"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreieccvdoay3fh5itbmutsho6k4xiy35hx2dwstcdcrirsq43lhobg4"
    },
    "mimeType": "image/webp",
    "size": 64204
  },
  "path": "/jacob_gong/how-we-translate-entire-books-with-llms-without-losing-context-2em5",
  "publishedAt": "2026-06-18T23:19:00.000Z",
  "site": "https://dev.to",
  "tags": [
    "python",
    "ai",
    "tutorial",
    "showdev",
    "LectuLibre"
  ],
  "textContent": "_Our chunking strategy that keeps chapters coherent, respects context windows, and handles multi-lingual books._\n\n##  The problem: books don’t fit in a prompt\n\nAt LectuLibre, we translate entire books — novels, technical manuals, poetry — using large language models. It sounds simple: feed each paragraph to an LLM, concatenate results, done. But the moment we tried a 300‑page EPUB, chaos ensued. Chapters bled into each other, sentences were chopped mid‑word, and the translation of chapter 5 had no idea what happened in chapter 4.\n\nLLMs have limited context windows. Even the massive 200K token window of Claude 3 can’t hold a whole 150K‑word book. And even if it could, the cost and latency would be absurd. We needed a way to split the book into manageable chunks while preserving enough context so that the translation remains coherent across thousands of pages.\n\nHere’s how we designed a chunking pipeline that respects your wallet, the context window, and the book’s narrative flow.\n\n##  Step 1: extract structure, not just text\n\nNaively splitting by character count is a recipe for disaster. Instead, we first parse the document to understand its logical units: chapters, sections, headings. For EPUB, we use `ebooklib`; for PDF, `pdfplumber`. Both give us a stream of items (paragraphs, headings) that we then organize into a tree of chapters and sub‑sections.\n\n\n\n    import ebooklib\n    from ebooklib import epub\n\n    def get_chapters(epub_path):\n        book = epub.read_epub(epub_path)\n        chapters = []\n        for item in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):\n            # Simplified: each document is a chapter\n            content = item.get_content().decode('utf-8')\n            chapters.append(content)\n        return chapters\n\n\nIn practice, we use `BeautifulSoup` to extract `<body>` text and identify heading tags (`<h1>`–`<h6>`) to build a table of contents. This way, even if a chapter is 20,000 tokens, we keep it together as a single unit until later splitting.\n\n##  Step 2: sentence‑aware splitting with token budgets\n\nA chapter still needs to be broken down to fit the model’s context window. But we never split mid‑sentence. We use `spaCy` to tokenize the text into sentences, then greedily group them until we hit a token limit.\n\nWhy not simple character‑based splitting? Because sentences carry semantic boundaries. Breaking inside a sentence occasionally produces artefacts like “He walked to the sta‑” / “‑tion.” LLMs are forgiving but not that forgiving.\n\n\n\n    import spacy\n    from transformers import AutoTokenizer  # for accurate token count\n\n    nlp = spacy.load(\"en_core_web_sm\")\n    tokenizer = AutoTokenizer.from_pretrained(\"claude-tokenizer\")  # custom tokenizer for Claude\n\n    def sentence_split(text):\n        doc = nlp(text)\n        return [sent.text for sent in doc.sents]\n\n    def chunk_sentences(sentences, max_tokens=1800, overlap_sentences=5):\n        chunks = []\n        current_chunk = []\n        current_token_count = 0\n\n        for i, sent in enumerate(sentences):\n            sent_tokens = len(tokenizer.encode(sent))\n            if current_token_count + sent_tokens > max_tokens:\n                # Store chunk with a sliding overlap\n                chunks.append(current_chunk)\n                # Overlap: take last `overlap_sentences` from the chunk just concluded\n                current_chunk = sentences[i - overlap_sentences : i] if i - overlap_sentences > 0 else []\n                current_token_count = sum(len(tokenizer.encode(s)) for s in current_chunk)\n            current_chunk.append(sent)\n            current_token_count += sent_tokens\n        if current_chunk:\n            chunks.append(current_chunk)\n        return chunks\n\n\nWe set `max_tokens` to 1800, leaving room for the system prompt, context from previous chunks, and the model’s response. That’s for Claude Haiku, which has a 32K context window. For longer‑context models we’d scale up, but keeping chunks smaller also means faster, cheaper API calls.\n\n##  Step 3: passing context across chunks\n\nThe real magic is what we do _between_ chunks. A standalone translation of chunk #5 has no clue that the protagonist just entered a dark cave in chunk #4. Two techniques solved this:\n\n  1. **Sliding window of previous sentences** — we include the last 5–10 sentences from the preceding chunk directly in the prompt as “context left.”\n  2. **A running summary** — after translating a chunk, we ask the LLM to generate a one‑sentence summary of that chunk. This summary is accumulated and fed into every subsequent prompt, so the model remembers high‑level events.\n\n\n\n\n    def build_prompt(chunk, previous_context_sentences, summary_so_far):\n        context_left = \" \".join(previous_context_sentences)\n        prompt = f\"\"\"You are translating a book. Here is a summary of the story so far:\n        {summary_so_far}\n\n        And the previous text (for immediate context):\n        \"{context_left}\"\n\n        Now translate the following text to Spanish, preserving tone and style:\n        {chunk}\"\"\"\n        return prompt\n\n\nThe summary is generated using a separate, cheap call (we use DeepSeek for summaries, even if the main translation uses Claude). This keeps the context token usage minimal while still giving long‑range coherence.\n\nWhy not just include the entire previous chunk? That doubles the token count per call. On a 200K‑word book, that adds up to hundreds of dollars. Summaries cut that cost by ~80% with negligible quality loss.\n\nThe translation loop then looks like this:\n\n\n\n    overall_summary = \"\"\n    previous_context = []\n    full_translation = []\n\n    for chapter_chunks in all_chunks_by_chapter:\n        chapter_summary = \"\"\n        for i, chunk in enumerate(chapter_chunks):\n            prompt = build_prompt(\n                \" \".join(chunk),\n                previous_context,\n                chapter_summary + \"\\n\" + overall_summary if i > 0 else \"\"\n            )\n            translated = call_llm(prompt)\n            full_translation.append(translated)\n\n            # Update context: keep last 5 sentences of the translated chunk as next context\n            trans_sents = sentence_split(translated)\n            previous_context = trans_sents[-5:]\n\n            # Generate chunk summary asynchronously to save time\n            chunk_summary = call_llm(f\"Summarize this passage in one sentence: {chunk}\")\n            chapter_summary += chunk_summary + \" \"\n        overall_summary += chapter_summary\n\n\nWe process chunks concurrently using `asyncio` and `httpx` to keep translation times reasonable.\n\n##  Real‑world results and trade‑offs\n\nTranslating a 120K‑word Spanish novel (“El Quijote”) into English took about 4 minutes end‑to‑end with Claude 3 Haiku. Total API cost: $0.67. The translation was surprisingly fluid — chapters felt connected, and the occasional flashback or pronoun reference (“she” referring to a character introduced three pages earlier) was correctly resolved. Without the context pipeline, the same book would have been riddled with inconsistencies.\n\nWe experimented with other models: DeepSeek‑V3 gave similar quality at half the price but with higher latency, making it better for batch jobs where speed isn’t critical. GPT‑4 Turbo reproduced stylistic flourishes more naturally, but its 16K context window forced us to use even smaller chunks, which sometimes fragmented dialogue. Claude struck the best balance.\n\nBut it’s not perfect. Humor and idioms still occasionally fall flat because the summary can’t encapsulate a running joke. Code blocks and tables inside technical books need special handling — we’re working on a parser that detects them and wraps them in `[CODE]` markers so the LLM doesn’t try to translate variable names. And poetry, with its line breaks and meter, remains a challenge; we’re considering a dedicated poetry‑aware chunker.\n\n##  The key takeaway\n\nIf you’re building long‑document translation using LLMs, invest in a pipeline that:\n\n  * **Respects document structure** (chapters, paragraphs) before splitting.\n  * **Splits on sentences** , and always leaves room for context.\n  * **Provides both immediate context** (last few sentences) and **global context** (summaries) to each chunk.\n  * **Uses separate, cheap models** for auxiliary tasks like summarization to keep costs down.\n\n\n\nOur code is not open‑source yet, but we plan to release the core chunking library once we’ve battle‑tested it on more formats.\n\n**How do you handle context in LLM translations?** We’re especially curious about handling highly technical books with equations, footnotes, and cross‑references. Drop your ideas in the comments — let’s figure this out together.",
  "title": "How We Translate Entire Books with LLMs Without Losing Context"
}