Inference EndpointAn inference endpoint is the serving layer for a trained model. After training (or downloading) an LLM, you need infrastructure to accept requests, run the forward pass, and return outputs at scale. T…Sahil Kapoor's Playbook·May 17·3 min readVllmTokenizationOllamaOpenrouter
TokenizationTokenization is the first step in any LLM pipeline: converting raw text into a sequence of integer IDs that the model actually processes. Understanding tokenization helps you reason about context wind…Sahil Kapoor's Playbook·May 17·3 min readLangchainVllmOllamaPrompt Engineering
LoRA (Low-Rank Adaptation)LoRA (Low-Rank Adaptation) is a fine-tuning method introduced by Hu et al. at Microsoft in 2021. Instead of updating all billions of parameters in a large model, LoRA freezes the original weights and …Sahil Kapoor's Playbook·May 17·3 min readPrompt EngineeringSystem PromptRlhfVllm
OpenRouterA unified API gateway for large language models that lets you call 100+ LLMs from different providers through a single OpenAI-compatible endpoint with automatic fallback and cost routing.Sahil Kapoor's Playbook·May 17·2 min readOllamaVllmInference EndpointLangchain
GitHub CopilotGitHub Copilot, launched in 2021 and built on OpenAI Codex (later GPT-4), was the first AI pair programmer to reach mainstream adoption. It integrates as an extension into VS Code, JetBrains, Neovim, …Sahil Kapoor's Playbook·May 17·3 min readCursorWindsurfOllamaVllm
OllamaOllama makes running open-source LLMs as straightforward as running a Docker container. You pull a model, and it starts serving a local REST API that your code can call, no cloud, no API key, no per-t…Sahil Kapoor's Playbook·May 17·3 min readVllmLangchainCursorOpenhands