{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreidztkswg54txlsm4kad7gpveuz4mlaxyy6veg5sodoiqn4gozi47q",
"uri": "at://did:plc:5sgu76a53rz3n6unbykmovqy/app.bsky.feed.post/3mlodgmv2q6v2"
},
"description": "Chunking is the process of splitting documents into smaller passages before embedding them for retrieval. Chunk size and boundaries directly determine what a retrieval system can find: a chunk that is too large blurs the meaning of its embedding, and a chunk that is too small lacks the context to answer most questions.\n\n\nCommon strategies\n\n * Fixed-size character or token splitting. Cuts every N characters or tokens. Simple but ignores semantic boundaries.\n * Recursive character splitting. Tries",
"path": "/engineering-glossary/chunking-document-splitting/",
"publishedAt": "2026-05-12T17:39:34.000Z",
"site": "https://sahilkapoor.com",
"tags": [
"RAG",
"Embeddings",
"Vector Database",
"Reranker",
"Context Window"
],
"textContent": "**Chunking** is the process of splitting documents into smaller passages before embedding them for retrieval. Chunk size and boundaries directly determine what a retrieval system can find: a chunk that is too large blurs the meaning of its embedding, and a chunk that is too small lacks the context to answer most questions.\n\n## Common strategies\n\n * **Fixed-size character or token splitting.** Cuts every N characters or tokens. Simple but ignores semantic boundaries.\n * **Recursive character splitting.** Tries to split on paragraph, then sentence, then word boundaries. The common baseline in LangChain and LlamaIndex.\n * **Structural chunking.** Splits on headings, sections, code blocks, or table rows. Often suited to technical documentation.\n * **Semantic chunking.** Splits on shifts in embedding similarity between adjacent sentences.\n * **Overlap.** Adjacent chunks share a small tail and head (10 to 20 percent) so context isn't lost at boundaries.\n\n\n\nš\n\n**Related Terms**\nRAG, Embeddings, Vector Database, Reranker, Context Window.",
"title": "Chunking",
"updatedAt": "2026-05-13T19:15:21.726Z"
}