#Vllm

Inference Endpoint

An inference endpoint is the serving layer for a trained model. After training (or downloading) an LLM, you need infrastructure to accept requests, run the forward pass, and return outputs at scale. T…

Sahil Kapoor's Playbook·May 17·3 min read

Tokenization

Tokenization is the first step in any LLM pipeline: converting raw text into a sequence of integer IDs that the model actually processes. Understanding tokenization helps you reason about context wind…

Sahil Kapoor's Playbook·May 17·3 min read

Langchain Vllm Ollama Prompt Engineering

LoRA (Low-Rank Adaptation)

LoRA (Low-Rank Adaptation) is a fine-tuning method introduced by Hu et al. at Microsoft in 2021. Instead of updating all billions of parameters in a large model, LoRA freezes the original weights and …

Sahil Kapoor's Playbook·May 17·3 min read

Prompt Engineering System Prompt Rlhf Vllm

OpenRouter

A unified API gateway for large language models that lets you call 100+ LLMs from different providers through a single OpenAI-compatible endpoint with automatic fallback and cost routing.

Sahil Kapoor's Playbook·May 17·2 min read

Ollama Vllm Inference Endpoint Langchain

GitHub Copilot

GitHub Copilot, launched in 2021 and built on OpenAI Codex (later GPT-4), was the first AI pair programmer to reach mainstream adoption. It integrates as an extension into VS Code, JetBrains, Neovim, …

Sahil Kapoor's Playbook·May 17·3 min read

Cursor Windsurf Ollama Vllm

Ollama

Ollama makes running open-source LLMs as straightforward as running a Docker container. You pull a model, and it starts serving a local REST API that your code can call, no cloud, no API key, no per-t…

Sahil Kapoor's Playbook·May 17·3 min read

Vllm Langchain Cursor Openhands