{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreifr7amineppmjlbce2rq5p46nlzknt7lj77duh3m6wfsd6bra4eda",
"uri": "at://did:plc:5sgu76a53rz3n6unbykmovqy/app.bsky.feed.post/3mm33hzvdi2r2"
},
"description": "An inference endpoint is the serving layer for a trained model. After training (or downloading) an LLM, you need infrastructure to accept requests, run the forward pass, and return outputs at scale. That infrastructure, whether it's Hugging Face Inference Endpoints, AWS SageMaker, your own Vllm deployment, or a managed service like OpenAI, is the inference endpoint.\n\n\nRequest Flow\n\n 1. Client sends HTTP POST with prompt, model params (temperature, max_tokens)\n 2. Endpoint tokenizes the prompt (T",
"path": "/engineering-glossary/inference-endpoint/",
"publishedAt": "2026-05-17T19:21:01.000Z",
"site": "https://sahilkapoor.com",
"tags": [
"Vllm",
"Tokenization",
"Ollama",
"Openrouter",
"Kubernetes"
],
"textContent": "An inference endpoint is the serving layer for a trained model. After training (or downloading) an LLM, you need infrastructure to accept requests, run the forward pass, and return outputs at scale. That infrastructure, whether it's Hugging Face Inference Endpoints, AWS SageMaker, your own Vllm deployment, or a managed service like OpenAI, is the inference endpoint.\n\n## Request Flow\n\n 1. Client sends HTTP POST with prompt, model params (temperature, max_tokens)\n 2. Endpoint tokenizes the prompt (Tokenization)\n 3. Model runs forward pass on GPU/CPU\n 4. Output tokens stream back (SSE) or return in one response\n 5. Client receives generated text\n\n\n\n## Key Metrics\n\n * **Time to First Token (TTFT)** , latency before streaming starts; affects perceived responsiveness\n * **Tokens per Second (TPS)** , throughput once streaming begins\n * **Requests per Second (RPS)** , concurrent request capacity\n * **P99 latency** , tail latency; critical for SLA\n\n\n\n## Managed vs Self-Hosted\n\n**Managed:** OpenAI, Anthropic, Hugging Face Inference Endpoints, pay per token, zero infrastructure management, model choice limited to provider's catalog.\n\n**Self-hosted with** Vllm**:** Full control, any model, predictable cost at scale, but requires GPU infrastructure, on-call rotation, and ops work. The economics favor self-hosting above ~$10k/month in model API spend.\n\n**Local with** Ollama**:** Runs on developer hardware, no cost, no latency from the network, but limited throughput (typically 1 concurrent user).\n\n**Aggregated via** Openrouter**:** Multiple providers through one API, convenient, adds small markup, limited to providers in the catalog.\n\n## Streaming\n\nProduction UX almost always requires streaming, showing tokens as they arrive rather than waiting for the full response. Endpoints implement this via Server-Sent Events (SSE). SDKs handle stream parsing automatically. Streaming dramatically improves perceived latency for long responses.\n\n## Context and Batching\n\nInference endpoints handle batching: grouping concurrent requests for efficient GPU utilization. Vllm's continuous batching processes new requests as they arrive rather than waiting for a full batch, dramatically improving throughput and reducing average wait time.\n\n## Related Terms\n\n * Vllm, open-source engine for building high-throughput inference endpoints\n * Ollama, local inference endpoint for development\n * Openrouter, managed aggregator for multiple inference endpoints\n * Tokenization, the first processing step on every inference endpoint\n * Kubernetes, standard orchestration for self-hosted inference endpoint clusters\n\n",
"title": "Inference Endpoint",
"updatedAt": "2026-05-18T20:03:52.162Z"
}