{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreig6epydtsuyt2xrec4exuniszyqeozmbpseyymdwrpjrndd2m4kxa",
"uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mp4egceholf2"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreib5tgko2boj55qth6spryakmmdfg2pzrqerm7ixtctjhqseqge6t4"
},
"mimeType": "image/webp",
"size": 281486
},
"path": "/arslan_ah/ai-system-design-interview-questions-chatgpt-rag-llm-inference-and-agents-1doi",
"publishedAt": "2026-06-25T11:38:22.000Z",
"site": "https://dev.to",
"tags": [
"ai",
"rag",
"chatgpt",
"claude",
"64 System Design Interview Questions, Ranked From Easiest to Hardest",
"Design ChatGPT walkthrough",
"Grokking Modern AI Fundamentals",
"Grokking System Design Fundamentals",
"Grokking the System Design Interview",
"System Design Interview Crash Course",
"Advanced System Design Interview, Volume II",
"Grokking Scalable Systems for Interviews"
],
"textContent": "System design interviews are changing.\n\nTraditional questions such as “Design Twitter,” “Design Uber,” and “Design YouTube” are still important. They test whether you understand databases, caching, partitioning, replication, messaging, and high availability.\n\nBut engineers working on modern platforms now encounter a different category of problem:\n\n * Design a ChatGPT-like conversational assistant.\n * Design a retrieval-augmented generation system.\n * Design an LLM inference platform.\n * Design an AI agent that can call external tools.\n * Design an enterprise AI assistant for private documents.\n * Design an evaluation platform for generative AI applications.\n\n\n\nThese questions still require classical distributed-systems knowledge. An AI product needs APIs, queues, storage, authentication, observability, rate limiting, and reliable deployment.\n\nThe difference is that it also introduces expensive accelerators, probabilistic output, long-running requests, model routing, vector retrieval, prompt construction, safety controls, and quality evaluation.\n\nThis guide explains the most important AI system design interview questions and what a strong candidate should discuss for each.\n\nFor a broader preparation roadmap covering traditional and modern problems, see 64 System Design Interview Questions, Ranked From Easiest to Hardest.\n\n# Why AI System Design Is Different\n\nA conventional service usually transforms an input into a deterministic output.\n\nIf a user requests order number 123, the service should retrieve order 123. Two identical requests should usually return the same underlying information.\n\nGenerative AI systems behave differently.\n\nA model may produce different responses to the same prompt. A response can be grammatically convincing while being factually wrong. Latency depends on the number of generated tokens. Serving capacity is constrained by accelerator memory, not merely CPU utilization. Product quality may depend on prompts, retrieved context, model versions, safety filters, and external tools.\n\nThis creates several new design dimensions.\n\n## 1. Quality is part of the architecture\n\nTraditional systems are often measured using availability, latency, throughput, and error rate.\n\nAI systems need those metrics, but they also need measures such as:\n\n * Answer correctness\n * Relevance\n * Groundedness\n * Retrieval quality\n * Hallucination rate\n * Tool-use success\n * Safety-policy compliance\n * User satisfaction\n\n\n\nA system that returns a response in 200 milliseconds is not useful if that response is wrong.\n\n## 2. Requests are computationally expensive\n\nAn ordinary API server may process thousands of lightweight requests per second.\n\nAn LLM request can occupy expensive GPU memory while processing a long prompt and generating hundreds of tokens. The architecture must therefore optimize batching, memory utilization, model placement, and request scheduling.\n\n## 3. Latency is experienced as a stream\n\nUsers do not normally wait for an entire answer before seeing anything. Tokens are streamed as they are generated.\n\nThis introduces at least two important latency measurements:\n\n * **Time to first token:** How quickly generation begins.\n * **Inter-token latency:** How smoothly subsequent tokens arrive.\n\n\n\nA system may have acceptable total latency but still feel slow if the first token takes too long.\n\n## 4. Data enters the system in several ways\n\nAn AI application may depend on:\n\n * Model-training data\n * User prompts\n * Conversation history\n * Retrieved documents\n * Tool results\n * Feedback\n * Evaluation datasets\n * Safety policies\n\n\n\nEach data type has distinct requirements for retention, privacy, freshness, and consistency.\n\n## 5. Failure is not always binary\n\nA traditional request may succeed or fail.\n\nAn AI request can technically succeed but produce a low-quality answer, retrieve the wrong documents, call the wrong tool, exceed a cost budget, or violate a safety rule.\n\nThe architecture must detect and respond to these softer failure modes.\n\n# A Framework for Answering Any AI System Design Question\n\nBefore considering individual questions, use a consistent interview structure.\n\n## Step 1: Clarify the product\n\nAsk what the system is expected to do.\n\nFor example:\n\n * Is the assistant general-purpose or domain-specific?\n * Does it need private enterprise data?\n * Can it take actions or only provide answers?\n * Does it support text only, or also images, audio, and files?\n * Are responses expected in real time?\n * Does the system need citations?\n * Which decisions require human approval?\n\n\n\nWithout this clarification, “Design an AI assistant” is too broad.\n\n## Step 2: Define scale and service-level objectives\n\nEstimate:\n\n * Daily and peak requests\n * Average prompt size\n * Average output length\n * Concurrent users\n * Required time to first token\n * Model size\n * GPU-memory requirements\n * Availability target\n * Cost per request\n\n\n\nAI systems are often constrained by cost as much as by technical capacity.\n\n## Step 3: Separate the application layer from the model layer\n\nThe application layer may include:\n\n * Authentication\n * Billing\n * Conversation history\n * File management\n * User preferences\n * Rate limiting\n * Analytics\n\n\n\nThe AI layer may include:\n\n * Prompt construction\n * Retrieval\n * Model routing\n * Inference scheduling\n * Safety checks\n * Tool execution\n * Evaluation\n\n\n\nKeeping these concerns separate makes the design easier to explain and evolve.\n\n## Step 4: Trace the complete request path\n\nDescribe what happens from the moment a user submits a prompt until the final result is displayed.\n\nA typical path may be:\n\n 1. Authenticate the request.\n 2. Enforce quotas.\n 3. Load conversation state.\n 4. Retrieve relevant context.\n 5. Construct the model prompt.\n 6. Run input-safety checks.\n 7. Select a model.\n 8. Schedule inference.\n 9. Stream tokens.\n 10. Run output-safety checks.\n 11. Store the response.\n 12. Record metrics and feedback.\n\n\n\n## Step 5: Discuss failure, quality, and cost\n\nA strong answer should explain:\n\n * What happens when the primary model is overloaded?\n * What happens when retrieval returns no useful documents?\n * How are duplicate tool calls prevented?\n * How does the system degrade gracefully?\n * How are model changes evaluated?\n * How is tenant data isolated?\n * How are expensive requests controlled?\n\n\n\nThese discussions distinguish a production design from a demo.\n\n# Question 1: Design ChatGPT\n\nA ChatGPT-like system is one of the most comprehensive AI system design questions.\n\nThe functional requirements may include:\n\n * Starting a conversation\n * Sending prompts\n * Receiving streamed responses\n * Viewing conversation history\n * Regenerating an answer\n * Uploading files\n * Choosing among models\n * Enforcing free and paid usage limits\n\n\n\nA useful high-level architecture contains the following components.\n\n## API gateway\n\nThe gateway handles authentication, request routing, rate limiting, quotas, and basic validation.\n\nLong-running generation requests may use Server-Sent Events or WebSockets to stream tokens to clients.\n\n## Conversation service\n\nThis service manages:\n\n * Conversations\n * Messages\n * User preferences\n * Message ordering\n * Conversation titles\n * Retention and deletion\n\n\n\nConversation metadata can live in a transactional database, while large attachments may be placed in object storage.\n\n## Context builder\n\nModels have finite context windows. The context builder decides what information should be included in the next request.\n\nIt may combine:\n\n * The system prompt\n * Recent conversation messages\n * A summary of older messages\n * Retrieved documents\n * User preferences\n * Tool outputs\n\n\n\nSimply sending the entire conversation forever is expensive and eventually impossible. Older content may need summarization or selective retrieval.\n\n## Model gateway\n\nThe model gateway provides a single interface to multiple model backends.\n\nIt can route requests based on:\n\n * Task type\n * Required quality\n * User subscription\n * Context length\n * Latency target\n * Current capacity\n * Cost budget\n * Model availability\n\n\n\nA simple request may use a smaller, faster model, while complex reasoning may be routed to a more capable one.\n\n## Inference scheduler\n\nThe scheduler assigns requests to model replicas running on accelerators.\n\nIt should consider:\n\n * Available GPU memory\n * Model placement\n * Prompt length\n * Output-token budget\n * Priority\n * Batch compatibility\n * Tenant quotas\n\n\n\nA naive first-in, first-out scheduler can allow a few extremely long prompts to delay many short requests.\n\n## Streaming layer\n\nGenerated tokens should be forwarded incrementally to the user.\n\nThe system must also handle:\n\n * Client disconnections\n * User cancellation\n * Partial responses\n * Network retries\n * Moderation during generation\n * Final persistence after streaming completes\n\n\n\n## Safety layer\n\nInput and output policies may detect:\n\n * Prompt injection\n * Sensitive information\n * Disallowed requests\n * Malicious files\n * Unsafe tool instructions\n * Data leakage\n\n\n\nSafety should not be treated as one filter placed at the end. Different checks may be required before retrieval, before tool execution, before inference, and before returning the final response.\n\n## Important deep dives\n\nAn interviewer may ask:\n\n * How would you reduce time to first token?\n * How would you support 100 million users?\n * How would you prevent one tenant from consuming all GPU capacity?\n * How would you summarize long conversations?\n * How would you route between multiple models?\n * How would you preserve availability during GPU shortages?\n * How would you limit the cost for free users?\n\n\n\nThe Design ChatGPT walkthrough provides a structured example of this problem.\n\n# Question 2: Design a RAG System\n\nRetrieval-augmented generation, or RAG, allows a model to answer using information retrieved from external sources.\n\nA common interview prompt is:\n\n> Design an enterprise assistant that answers employee questions using internal documents and provides citations.\n\nA RAG system has two major paths:\n\n 1. The ingestion path\n 2. The query path\n\n\n\n## The ingestion path\n\nDocuments may come from file uploads, internal wikis, cloud drives, databases, or support systems.\n\nThe ingestion pipeline performs several stages.\n\n### Document extraction\n\nFiles must be converted into usable text.\n\nThe system may need parsers for:\n\n * PDFs\n * Word documents\n * Presentations\n * HTML pages\n * Spreadsheets\n * Scanned images\n\n\n\nThe extraction process should preserve useful metadata such as titles, headings, page numbers, owners, and access permissions.\n\n### Chunking\n\nLong documents are divided into smaller segments.\n\nChunks that are too large may contain irrelevant text and consume excessive context. Chunks that are too small may lose meaning.\n\nPossible strategies include:\n\n * Fixed token windows\n * Paragraph-based chunking\n * Heading-aware chunking\n * Overlapping windows\n * Semantic chunking\n\n\n\nThere is no universally correct chunk size. It should be tested against representative questions.\n\n### Embedding generation\n\nEach chunk is converted into a numerical vector using an embedding model.\n\nThe embedding service should be versioned because changing models can require re-embedding the entire corpus.\n\n### Indexing\n\nThe system stores:\n\n * Embeddings\n * Original text\n * Document metadata\n * Access-control information\n * Source location\n * Embedding version\n * Update timestamp\n\n\n\nA vector index enables semantic retrieval. A traditional inverted index can support keyword retrieval. Many production systems combine both.\n\n## The query path\n\nWhen a user submits a question:\n\n 1. Authenticate the user.\n 2. Generate an embedding for the query.\n 3. Retrieve candidate chunks.\n 4. Apply access-control filtering.\n 5. Rerank the candidates.\n 6. Select the best context.\n 7. Construct the prompt.\n 8. Generate the answer.\n 9. Attach citations.\n 10. Evaluate or log the result.\n\n\n\n## Hybrid retrieval\n\nSemantic retrieval is useful when the query and source use different words with similar meanings.\n\nKeyword retrieval is useful for exact terms such as:\n\n * Product codes\n * Error messages\n * Names\n * Dates\n * Identifiers\n\n\n\nCombining both methods often produces better coverage.\n\n## Reranking\n\nVector similarity may retrieve documents that are generally related but not directly useful.\n\nA reranker can score the top candidates more accurately before they are sent to the LLM. This improves answer quality while keeping the final prompt small.\n\n## Access control\n\nSecurity is one of the most important parts of enterprise RAG.\n\nA user should never retrieve a document they are not authorized to view. Filtering after the model has already received the document is too late.\n\nPermissions should be enforced during retrieval, with tenant and user identity included in the query path.\n\n## Freshness and deletion\n\nThe system must react when:\n\n * A document changes.\n * A document is deleted.\n * Permissions change.\n * A user loses access.\n * A newer policy replaces an older one.\n\n\n\nThe ingestion pipeline may use event-driven updates, periodic crawling, or both.\n\n## RAG evaluation\n\nA RAG system should separately evaluate:\n\n * **Retrieval quality:** Did the system find the relevant document?\n * **Generation quality:** Did the model use the retrieved context correctly?\n * **Citation quality:** Do the cited sources actually support the answer?\n\n\n\nThis separation is important. A poor answer can result from failed retrieval even when the model behaves correctly.\n\n# Question 3: Design an LLM Inference Platform\n\nThis question focuses less on the product interface and more on the infrastructure that serves models.\n\nA possible prompt is:\n\n> Design a multi-tenant platform that serves several large language models to millions of requests.\n\nThe platform may need to support:\n\n * Multiple model families\n * Different model sizes\n * Streaming generation\n * Priority tiers\n * Autoscaling\n * Usage accounting\n * Model versioning\n * Regional deployment\n * Fine-tuned adapters\n\n\n\n## Inference gateway\n\nThe gateway exposes a consistent API and performs:\n\n * Authentication\n * Quota enforcement\n * Request validation\n * Model selection\n * Token-limit checks\n * Admission control\n * Cost estimation\n\n\n\nAdmission control is critical. Accepting unlimited work and allowing it to queue indefinitely creates poor latency and can destabilize the system.\n\n## Model registry\n\nThe registry tracks:\n\n * Model version\n * Artifact location\n * Supported hardware\n * Memory requirements\n * Context length\n * Quantization format\n * Deployment status\n * Safety and evaluation results\n\n\n\nRollouts should use immutable versions so requests and incidents can be traced to the exact model that served them.\n\n## Model placement\n\nLoading a large model into GPU memory can take substantial time. The scheduler cannot treat models like lightweight stateless application containers.\n\nIt must decide:\n\n * Which models remain loaded\n * How many replicas each model receives\n * Where fine-tuned adapters are placed\n * When models should be unloaded\n * How capacity is distributed across regions\n\n\n\nPopular models may remain warm, while rarely used models may accept a cold-start delay.\n\n## Prefill and decode\n\nLLM inference contains two different computational phases.\n\n**Prefill** processes the input prompt and can often benefit from parallel computation.\n\n**Decode** generates tokens sequentially and is usually memory-bandwidth intensive.\n\nSeparating or independently scheduling these phases can improve utilization, but it also adds network and orchestration complexity.\n\n## Continuous batching\n\nInstead of waiting for a fixed group of requests to finish together, continuous batching adds and removes requests dynamically as generation progresses.\n\nThis improves GPU utilization, especially when responses have different lengths.\n\nThe scheduler must still prevent long requests from starving shorter ones.\n\n## KV cache\n\nThe key-value cache stores intermediate attention state so the model does not recompute the entire prompt for every generated token.\n\nKV-cache management affects:\n\n * Maximum concurrency\n * Memory pressure\n * Long-context support\n * Prefix reuse\n * Request eviction\n\n\n\nA shared prompt prefix—such as a large system prompt—may sometimes be cached and reused across compatible requests.\n\n## Scaling\n\nGPU utilization alone may not be sufficient for autoscaling.\n\nUseful signals include:\n\n * Queue length\n * Time to first token\n * Tokens generated per second\n * KV-cache pressure\n * Number of active sequences\n * Predicted token demand\n * Model-specific backlog\n\n\n\nBecause accelerator provisioning may be slow, the platform may need reserved capacity and predictive scaling.\n\n## Graceful degradation\n\nWhen capacity is limited, the system may:\n\n * Route to a smaller model.\n * Reduce the maximum output length.\n * Reject low-priority requests.\n * Disable expensive features.\n * Queue batch workloads.\n * Move traffic to another region.\n * Use a third-party model provider.\n\n\n\nA strong interview answer discusses the quality and cost consequences of each fallback.\n\n# Question 4: Design an AI Agent Platform\n\nAn AI agent does more than produce text. It can plan a sequence of actions, call tools, observe results, update its state, and continue until a goal is completed.\n\nA typical prompt might be:\n\n> Design an enterprise agent that can search internal documents, update tickets, send emails, and request human approval for sensitive actions.\n\n## Core components\n\n### Agent orchestrator\n\nThe orchestrator controls the execution loop:\n\n 1. Receive a goal.\n 2. Construct the current context.\n 3. Ask the model for the next action.\n 4. Validate the proposed action.\n 5. Execute the selected tool.\n 6. Store the result.\n 7. Decide whether to continue.\n 8. Produce the final response.\n\n\n\nThe orchestrator—not the model—should enforce hard limits such as maximum steps, timeouts, budgets, and approval requirements.\n\n### Tool registry\n\nThe tool registry describes each available capability:\n\n * Tool name\n * Purpose\n * Input schema\n * Required permissions\n * Timeout\n * Retry policy\n * Risk level\n * Whether human approval is required\n\n\n\nTool definitions should be versioned because changing their schemas can break existing agent behavior.\n\n### Tool execution service\n\nTool calls should run through controlled executors rather than allowing the model unrestricted access to internal systems.\n\nThe executor handles:\n\n * Authentication\n * Input validation\n * Secrets\n * Network policy\n * Timeouts\n * Retries\n * Audit logging\n * Output normalization\n\n\n\nHigh-risk operations should use narrow, purpose-built APIs.\n\n### State and memory\n\nAgents may need several kinds of memory.\n\n**Working memory** contains the current task, observations, and intermediate steps.\n\n**Session memory** preserves information during one user interaction.\n\n**Long-term memory** stores information across sessions.\n\n**External memory** may contain documents retrieved from databases or vector indexes.\n\nNot everything should be stored forever. Memory needs explicit retention, privacy, and deletion policies.\n\n### Human approval\n\nActions such as sending payments, deleting data, publishing content, or modifying production systems should not be executed solely because a model requested them.\n\nThe agent can create a proposed action, pause its workflow, and wait for authorized approval.\n\nThe approval record should contain:\n\n * The intended action\n * The relevant parameters\n * Why it was proposed\n * The expected effect\n * The identity of the approver\n * An expiration time\n\n\n\n### Idempotency\n\nAgents may retry actions after timeouts.\n\nWithout idempotency, a retry could send the same email twice, create duplicate tickets, or repeat a transaction.\n\nEvery state-changing tool call should include a stable execution identifier or idempotency key.\n\n## Agent-specific failure modes\n\nA strong candidate should discuss:\n\n * Infinite planning loops\n * Repeated tool calls\n * Prompt injection inside retrieved content\n * Tool hallucination\n * Stale observations\n * Excessive cost\n * Partial workflow completion\n * Conflicting actions\n * Unauthorized data access\n\n\n\nThe system should impose:\n\n * Maximum step counts\n * Token budgets\n * Time limits\n * Per-tool permissions\n * Human checkpoints\n * Detailed audit logs\n * Recovery or compensation workflows\n\n\n\nThe Grokking Modern AI Fundamentals course can provide additional background on agentic AI, planning, memory, and tool-based behavior.\n\n# Additional AI System Design Questions to Practice\n\nThe four core questions cover much of the modern AI stack, but interviewers can frame the same concepts in narrower ways.\n\n## Design an enterprise AI copilot\n\nFocus on tenant isolation, document permissions, RAG, conversation history, model routing, auditability, and data retention.\n\n## Design a coding assistant\n\nDiscuss repository indexing, code-aware chunking, low-latency suggestions, context selection, IDE integration, private-code protection, and evaluation of generated code.\n\n## Design an AI evaluation platform\n\nCover dataset versioning, offline evaluation, human review, model comparison, prompt experiments, regression detection, and production feedback.\n\n## Design a multi-model gateway\n\nExplain routing between internal and third-party models based on cost, quality, latency, privacy, context length, and availability.\n\n## Design a semantic-search platform\n\nFocus on ingestion, embeddings, vector indexes, hybrid retrieval, filtering, reranking, index updates, and relevance metrics.\n\n## Design an AI safety and guardrails service\n\nDiscuss policy versioning, input and output classification, prompt-injection detection, tool restrictions, personally identifiable information, appeals, and false-positive handling.\n\n## Design a prompt-management platform\n\nCover prompt templates, version control, experiments, rollout, rollback, tenant overrides, caching, and compatibility with changing model versions.\n\n## Design a multimodal assistant\n\nAdd image, audio, and document ingestion, media storage, preprocessing, modality-specific models, content safety, and larger payload management.\n\n# What Interviewers Look for in AI System Design Answers\n\nA weak answer places an LLM box in the center of a diagram and connects it to an API.\n\nA strong answer explains the system around the model.\n\nInterviewers want to see whether you can reason about:\n\n## End-to-end architecture\n\nCan you connect the client, application services, retrieval layer, model platform, storage, and observability systems?\n\n## Trade-offs\n\nCan you compare:\n\n * Larger models versus smaller models\n * Quality versus latency\n * Quality versus cost\n * Long context versus retrieval\n * Hosted APIs versus self-hosting\n * Semantic retrieval versus keyword search\n * Autonomy versus human control\n\n\n\n## Reliability\n\nCan the system continue operating when a model, vector index, tool, region, or third-party provider is unavailable?\n\n## Evaluation\n\nCan you tell whether a new model or prompt actually improved the product?\n\n## Security\n\nCan you protect tenant data, prevent unauthorized retrieval, constrain tool use, and manage sensitive prompts?\n\n## Cost\n\nCan you estimate and control token usage, accelerator capacity, retrieval cost, storage, and third-party API spending?\n\nThe model is only one component. Production readiness comes from the architecture surrounding it.\n\n# How to Prepare\n\nCandidates new to large-scale architecture should first learn the traditional foundations.\n\nGrokking System Design Fundamentals introduces the core building blocks behind scalable systems, including databases, caches, queues, replication, partitioning, and load balancing.\n\nThe original Grokking the System Design Interview applies those concepts to common interview problems and teaches a structured way to move from requirements to architecture and trade-offs.\n\nThe System Design Interview Crash Course is useful for practicing a consistent interview framework across modern case studies, including a complete ChatGPT design problem.\n\nEngineers preparing for senior and staff-level discussions can continue with Advanced System Design Interview, Volume II, which emphasizes open-ended problems, failures, and defensible architectural decisions.\n\nGrokking Scalable Systems for Interviews is a useful next step for strengthening scalability, observability, fault tolerance, and performance reasoning.\n\nFor every AI design problem, practice three times:\n\n 1. **Design the happy path.**\n 2. **Design for failure and overload.**\n 3. **Defend the quality, safety, and cost trade-offs.**\n\n\n\nThat third pass is where most of the valuable interview discussion occurs.\n\n# Final Takeaway\n\nAI system design is not a replacement for traditional system design.\n\nIt is a traditional system design combined with a new set of constraints.\n\nYou still need to understand APIs, storage, caching, queues, partitioning, replication, security, observability, and fault tolerance.\n\nBut you must now apply those concepts to systems with:\n\n * Probabilistic outputs\n * Expensive inference\n * Streaming generation\n * Vector retrieval\n * Dynamic prompts\n * Long-lived context\n * External tools\n * Model evaluation\n * Safety requirements\n * Human approval\n\n\n\nStart with four foundational problems:\n\n 1. Design ChatGPT.\n 2. Design a RAG platform.\n 3. Design an LLM inference service.\n 4. Design an AI agent platform.\n\n\n\nMaster the request flow, deep dives, failure modes, and trade-offs behind each one.\n\nOnce you can explain those systems clearly, most other AI system design questions become variations of the same underlying building blocks.",
"title": "AI System Design Interview Questions: ChatGPT, RAG, LLM Inference, and Agents"
}