Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigkhvkbxithnh4qld2iwhtqxnrygqp3zdihgzy7tjggcdxawgwzfm",
    "uri": "at://did:plc:5sgu76a53rz3n6unbykmovqy/app.bsky.feed.post/3mandhd3ciko2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreidogfje6r2crdkg2ninwvea2s4giqerjy3a3wwt427kdcpyqiiav4"
    },
    "mimeType": "image/jpeg",
    "size": 40964
  },
  "description": "A fast MongoDB system comes from modeling data around how your application reads and writes it. This guide breaks down how to structure documents, when to embed or reference, the patterns used in real production systems, and the indexing strategies that keep performance predictable as data grows.",
  "path": "/mongodb-data-modeling/",
  "publishedAt": "2025-12-23T08:29:36.000Z",
  "site": "https://sahilkapoor.com",
  "tags": [
    "MongoDB",
    "microservices",
    "pagination",
    "TTL"
  ],
  "textContent": "Every time I see a MongoDB system that performs beautifully at scale, it’s never because the team did something exotic. It’s because they aligned their schema with one simple truth: **your data model must follow your application’s access patterns.** Not theoretical relationships. Not entity diagrams. Actual reads and writes.\n\nMongoDB is built for this. Once you stop thinking in terms of entities and start thinking in terms of how your application consumes data, schema design becomes far more intuitive.\n\nThis piece breaks down the practical way MongoDB expects you to model data for real-world systems, the patterns that make distributed queries fast, and the anti-patterns that quietly destroy performance.\n\n## **1. The Golden Rule: Data Accessed Together, Stored Together**\n\nMongoDB’s core strength is **data locality**. When all the data you need for a screen or an API call lives inside a single document, you get:\n\n  * predictable read performance\n  * fewer network hops\n  * minimal coordination overhead across nodes\n\n\n\nImagine a user profile screen that shows: `user info`, `subscription details`, `last 3 orders`, `preferences` etc.\n\nIn MongoDB, this works best when these pieces live inside a single document. Your application reads once, renders once, and moves on.\n\nHere’s how a real-world User document might look:\n\n\n    {\n      \"_id\": 101,\n      \"name\": \"Aditi Sharma\",\n      \"email\": \"aditi.sharma@example\",\n      \"preferences\": {\n        \"language\": \"en\",\n        \"theme\": \"dark\"\n      },\n      \"recent_orders\": [\n        {\n          \"order_id\": 9001,\n          \"amount\": 450,\n          \"placed_at\": \"2024-12-10T12:00:00Z\"\n        },\n        {\n          \"order_id\": 9002,\n          \"amount\": 199,\n          \"placed_at\": \"2024-12-11T16:00:00Z\"\n        }\n      ],\n      \"subscription\": {\n        \"tier\": \"Gold\",\n        \"renewal\": \"2025-01-01\"\n      }\n    }\n\n\nOne API call. One predictable latency. No fan-out queries.\n\nThis design philosophy is the backbone of fast MongoDB systems; group fields that are read together into one document so your read path stays stable and efficient.\n\n## **2. Embed vs Reference: The Practical Decision Matrix**\n\nMongoDB gives you two big tools: embedding and referencing. The challenge is knowing when to use which.\n\nA clean mental model is this: **how many items sit on the many side of your relationship and how often are they accessed?**\n\nLet’s break it down.\n\n### **A. One-to-Few: Embed**\n\nIf the child objects are:\n\n  * small\n  * bounded\n  * frequently accessed with the parent\n\n\n\nThen embedding is perfect.\n\n**Example: User + Addresses**\n\n\n    {\n      \"name\": \"Sahil\",\n      \"addresses\": [\n        { \"type\": \"home\", \"city\": \"Gurgaon\" },\n        { \"type\": \"office\", \"city\": \"Bangalore\" }\n      ]\n    }\n\n\nBounded arrays shine here. Fast reads, minimal overhead. The key idea is that when the list will never grow beyond a small, safe upper limit, embedding ensures consistent performance without worrying about document bloat.\n\n### **B. One-to-Many: Reference**\n\nIf you have potentially thousands of children, embedding becomes impractical. Document size grows. Updates become slow.\n\nA classic example is products and reviews.\n\n**Product document:**\n\n\n    {\n      \"_id\": 77,\n      \"title\": \"Don 3\",\n      \"price\": 399\n    }\n\n\n**Review document:**\n\n\n    {\n      \"product_id\": 77,\n      \"rating\": 5,\n      \"comment\": \"Insane movie!\"\n    }\n\n\nThis keeps your primary document light and responsive. The reviews load only when the user requests them, which is exactly the point: when a related dataset grows large, referencing preserves both agility and performance.\n\n### **C. One-to-Squillions: Hybrid**\n\nUnbounded relationships like logs, activity feeds, or transactions require a hybrid model using **bucketing** , **sharded collections** , or **capped collections**.\n\nThe idea is to avoid:\n\n  * unbounded arrays\n  * massive documents\n  * unpredictable write behavior\n\n\n\nMongoDB works best when documents stay reasonably sized. For unbounded data, spread writes across multiple documents instead of forcing everything into a single growing structure.\n\n## **3. Production-Proven Patterns That Make MongoDB Fly**\n\nThe following patterns aren’t theoretical. They show up everywhere across high-scale systems.\n\n### **A. Subset Pattern (The Homepage Problem)**\n\nLet’s say a movie has 10,000 reviews. The homepage needs only the top 3.\n\nEmbedding all 10,000 is impossible. Querying reviews separately for every homepage view is expensive.\n\nThe subset pattern solves this by keeping only the frequently accessed slice of data inside the main document.\n\n\n    {\n      \"_id\": 77,\n      \"title\": \"Don 3\",\n      \"top_reviews\": [\n        { \"rating\": 5, \"user\": \"Aarav\", \"comment\": \"🔥\" },\n        { \"rating\": 4, \"user\": \"Reema\", \"comment\": \"Loved it\" }\n      ]\n    }\n\n\nThis gives instant page loads while keeping the full review set separate.\n\nYou’ll see this pattern everywhere:\n\n  * product listings\n  * home feeds\n  * dashboards\n  * content cards\n\n\n\nIt optimizes for the 95 percent case by keeping just enough data in the parent document to serve the common path quickly.\n\n### **B. Extended Reference Pattern (Minimizing Follow-Up Calls)**\n\nSometimes a reference isn’t enough. Your API often needs a few extra fields from the referenced document.\n\nInstead of making another query, you store just those fields alongside the reference.\n\n**Example: Order document embedding commonly used customer fields:**\n\n\n    {\n      \"order_id\": 99,\n      \"customer\": {\n        \"id\": 123,\n        \"name\": \"Jane Doe\",\n        \"avatar\": \"jane.jpg\"\n      }\n    }\n\n\nThis isn’t about duplicating entire objects. It’s about tuning the document so your read path becomes a single operation.\n\nIt’s especially powerful in microservices where latency adds up quickly. The broader idea is to store the small fields that your read path depends on so you avoid extra lookups during critical flows.\n\n### **C. Bucket Pattern (For Logs and Event Streams)**\n\nLogs arrive continuously. Storing each log event as an individual document introduces huge overhead.\n\nMongoDB’s bucket pattern groups related events into a single document.\n\n\n    {\n      \"user_id\": 123,\n      \"day\": \"2024-12-10\",\n      \"events\": [\n        { \"ts\": 1702212010, \"type\": \"click\" },\n        { \"ts\": 1702212022, \"type\": \"scroll\" }\n      ]\n    }\n\n\nThis cuts your writes massively. Queries also become more predictable.\n\n## **4. Anti-Patterns That Hurt Real-World Systems**\n\nThese traps look harmless when data is small but explode at scale.\n\n### **A. Unbounded Arrays**\n\n\n    {\n      \"log_entries\": []\n    }\n\n\nAn array like this grows forever. And every write rewrites the entire document. Your database becomes slower and slower until it hits a hard size ceiling. Always bucket or reference.\n\n### **B. Overly Fragmented Collections**\n\nSome teams create a separate collection for every small entity:\n\n  * users\n  * addresses\n  * preferences\n  * phone numbers\n  * tags\n\n\n\nEach extra collection increases the number of queries required to assemble a single response.\n\nHigh-scale MongoDB systems aggressively minimize the number of collections needed for a single screen.\n\n### **C. Bloated Documents**\n\nEmbedding large blobs like images or PDFs inside documents leads to heavy reads.\n\n\n    {\n      \"user\": \"Aditi\",\n      \"profile_pic\": \"<2MB binary>\"\n    }\n\n\nEven a simple metadata lookup now transfers megabytes.\n\nKeep large objects in object storage or GridFS. MongoDB should carry metadata, not media, so each request moves only the bytes it truly needs.\n\n## **5. How MongoDB Wants You to Think About Data**\n\nMongoDB rewards schemas that follow how your application actually consumes data.\n\nThe right mental model is simple:\n\n  * If multiple fields are always read together, embed them.\n  * If a child grows large or unbounded, reference it.\n  * If a child is partially read frequently, embed a subset.\n  * If writes dominate, keep documents small.\n\n\n\nThis approach helps to keep reads predictable, writes efficient, documents maintainable and performance steady as scale increases.\n\n### **A Real Example: Swiggy-Style Order Flows**\n\nTake a food delivery app. On the order history screen, the user only needs a lightweight summary of each order: the order_id, restaurant name, amount, a thumbnail, and maybe the top few items. On the order detail page, the same order expands into the full item list, delivery timeline events, delivery agent details, payment breakdown, and the restaurant’s full address.\n\nA practical schema for this might look like:\n\n\n    // orders collection: optimized for history listings and quick lookups\n    {\n      _id: ObjectId(\"675abc123...\"),\n      user_id: 123,\n      restaurant: {\n        id: 45,\n        name: \"Bombay Biryani\",\n        thumbnail: \"biryani-thumb.jpg\"\n      },\n      amount: 375,\n      summary_items: [\n        { name: \"Chicken Biryani\", qty: 1 },\n        { name: \"Gulab Jamun\", qty: 2 }\n      ],\n      created_at: ISODate(\"2024-12-10T13:05:00Z\"),\n      status: \"delivered\"\n    }\n\n    // order_items collection: full detail for the order detail page\n    {\n      order_id: ObjectId(\"675abc123...\"),\n      items: [\n        {\n          name: \"Chicken Biryani\",\n          qty: 1,\n          price: 250\n        },\n        {\n          name: \"Gulab Jamun\",\n          qty: 2,\n          price: 125\n        }\n      ],\n      restaurant_address: {\n        line1: \"Sector 29\",\n        city: \"Gurgaon\",\n        lat: 28.4595,\n        lng: 77.0266\n      },\n      payment_breakdown: {\n        subtotal: 375,\n        taxes: 45,\n        delivery_fee: 25,\n        discounts: 50,\n        total: 395\n      }\n    }\n\n    // order_events collection: bucketed delivery timeline\n    {\n      order_id: ObjectId(\"675abc123...\"),\n      day: \"2024-12-10\",\n      events: [\n        { ts: 1702212010, type: \"created\" },\n        { ts: 1702212110, type: \"accepted_by_restaurant\" },\n        { ts: 1702212310, type: \"picked_up\" },\n        { ts: 1702212610, type: \"delivered\" }\n      ]\n    }\n\nIn this design, the history screen queries only the `orders` collection, the detail page joins in the `order_items` document when needed, and the tracking UI reads from `order_events`. The result is an absurdly fast system for millions of users, even during lunch peak, because each flow reads just enough data to do its job and nothing more.This becomes a clean split between fast summary access and deeper detail access, ensuring the common path stays lightweight:\n\n  * Orders contain a **subset** of frequently accessed fields.\n  * Full details live in their own structure.\n  * Delivery events use **buckets**.\n  * Restaurant metadata is embedded **if used often**.\n\n\n\nThe result is an absurdly fast system for millions of users, even during lunch peak. The lesson: tune your schema to the real flow of data consumption and the system naturally scales.\n\n## 6. Indexing Strategies\n\nIndex design in MongoDB isn’t an afterthought. It’s what turns a well-structured schema into a fast system. Here are battle-tested indexing patterns that pair naturally with the modeling techniques above.\n\n### **A. Single-Field Indexes for High-Cardinality Fields**\n\nFields like `email`, `product_id`, or `order_id` should always be indexed because they are frequently used in equality filters.\n\n\n    // Fast lookup by product\n     db.reviews.createIndex({ product_id: 1 });\n\n\n### **B. Compound Indexes for Common Query Shapes**\n\nMongoDB matches queries to indexes by prefix. If most of your queries look like:\n\n\n     db.orders.find({ user_id: 123 }).sort({ placed_at: -1 })\n\n\nThen your index should match that shape:\n\n\n     db.orders.createIndex({ user_id: 1, placed_at: -1 });\n\n\nThis avoids in-memory sorts and keeps pagination fast.\n\n### **C. Indexing Embedded Fields**\n\n\n    // Index the user's city inside embedded addresses\n     db.users.createIndex({ \"addresses.city\": 1 });\n\n\nEmbedded objects and arrays can be indexed directly. MongoDB handles multi-key indexes automatically when arrays are involved.\n\n### **D. Partial Indexes for Sparse or Optional Fields**\n\nUseful when only a subset of documents contains the field. This keeps indexes small and efficient.\n\n\n     db.orders.createIndex(\n       { \"subscription.renewal\": 1 },\n       { partialFilterExpression: { \"subscription.renewal\": { $exists: true } } }\n     );\n\n\n### **E. TTL Indexes for Bucketed or Ephemeral Data**\n\nGreat for logs, events, sessions. TTL + buckets gives extremely efficient log deletion.\n\n\n     db.events.createIndex(\n       { created_at: 1 },\n       { expireAfterSeconds: 86400 }\n     );\n\n\n### **F. Prefix Rule Reminder**\n\nIf you create an index like:\n\n\n     db.orders.createIndex({ user_id: 1, placed_at: -1, amount: 1 });\n\n\nMongoDB can use it for queries that include:\n\n  * `user_id`\n  * `user_id` + `placed_at`\n  * `user_id` + `placed_at` + `amount`\n\n\n\nBut NOT for:\n\n  * `placed_at` alone\n  * `amount` alone\n\n\n\nDesign indexes around actual query patterns. The principle behind all indexing in MongoDB is simple: optimize for the queries that hit your system most often, not hypothetical ones.\n\n## **Bringing It All Together**\n\nGood MongoDB schema design feels like UI-driven modeling. You organize data based on the screens and API calls your application actually serves.\n\nWhen you:\n\n  * embed intentionally\n  * duplicate selectively for performance\n  * reference when data grows large\n  * bucket when data grows endlessly\n\n\n\nMongoDB becomes one of the most efficient databases to operate. Predictable performance. Reduced infra cost. Cleaner code.",
  "title": "MongoDB Data Modeling: How to Design Schemas for Real-World Applications",
  "updatedAt": "2026-06-03T22:31:57.954Z"
}