{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreic7bke53unwymoxk44weqxr6vijaza4pt35xou7hfnx7zrj3ltnai",
"uri": "at://did:plc:5sgu76a53rz3n6unbykmovqy/app.bsky.feed.post/3mm33g4p7unl2"
},
"description": "Ollama makes running open-source LLMs as straightforward as running a Docker container. You pull a model, and it starts serving a local REST API that your code can call, no cloud, no API key, no per-token billing.\n\n\nHow It Works\n\nOllama bundles model weights, a Go-based runtime, and a simple model definition format (Modelfiles) into a single binary. When you run ollama run llama3.2, it downloads the model (GGUF format, quantized for CPU/GPU), starts a server at localhost:11434, and drops you int",
"path": "/engineering-glossary/ollama/",
"publishedAt": "2026-05-17T19:20:47.000Z",
"site": "https://sahilkapoor.com",
"tags": [
"Vllm",
"Langchain",
"Cursor",
"Openhands",
"Mcp Model Context Protocol",
"Inference Endpoint",
"Tokenization"
],
"textContent": "Ollama makes running open-source LLMs as straightforward as running a Docker container. You pull a model, and it starts serving a local REST API that your code can call, no cloud, no API key, no per-token billing.\n\n## How It Works\n\nOllama bundles model weights, a Go-based runtime, and a simple model definition format (Modelfiles) into a single binary. When you run `ollama run llama3.2`, it downloads the model (GGUF format, quantized for CPU/GPU), starts a server at `localhost:11434`, and drops you into an interactive chat. The HTTP API mirrors the OpenAI completions endpoint, so existing code that hits `api.openai.com` can point to Ollama with a URL change.\n\n\n ollama pull llama3.2\n ollama run mistral\n curl http://localhost:11434/api/generate -d '{\"model\":\"llama3.2\",\"prompt\":\"Explain REST APIs\"}'\n\n## Model Library\n\nOllama's model library includes Llama 3.x, Mistral, Gemma 2, Phi-3, Qwen, Code Llama, DeepSeek Coder, and dozens more. Models are stored in `~/.ollama/models` and each one includes a Modelfile that sets the system prompt and parameters.\n\n## Use Cases\n\n * **Privacy-sensitive workloads** , legal, medical, or proprietary data that can't leave your network\n * **Offline/air-gapped environments** , dev environments without internet access\n * **Cost control** , development and testing without per-token costs\n * **Local RAG pipelines** , combine with a local vector DB for fully offline retrieval\n * **Custom models** , fine-tuned models via Modelfiles or GGUF import\n\n\n\n## Ollama vs vLLM\n\nOllama is optimized for ease of use on developer laptops; Vllm is optimized for throughput in production. Ollama runs on CPU if no GPU is present; vLLM requires CUDA/ROCm. For a single developer experimenting with models, Ollama is the right choice. For serving models to multiple users or benchmarking throughput, vLLM wins.\n\n## Integration with AI Tooling\n\nBecause Ollama exposes an OpenAI-compatible API, it plugs into Langchain, Cursor, and most LLM SDKs without changes. You can use it as a local backend for Openhands or any agent framework. Combined with Mcp Model Context Protocol, Ollama can power fully local agentic workflows.\n\n## Related Terms\n\n * Vllm, production-grade inference server for high throughput\n * Inference Endpoint, cloud-hosted equivalent of what Ollama provides locally\n * Langchain, orchestration framework that can use Ollama as its LLM backend\n * Tokenization, how the model converts your prompt to numbers before processing\n\n",
"title": "Ollama",
"updatedAt": "2026-05-18T20:04:15.235Z"
}