{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreic7bke53unwymoxk44weqxr6vijaza4pt35xou7hfnx7zrj3ltnai",
    "uri": "at://did:plc:5sgu76a53rz3n6unbykmovqy/app.bsky.feed.post/3mm33g4p7unl2"
  },
  "description": "Ollama makes running open-source LLMs as straightforward as running a Docker container. You pull a model, and it starts serving a local REST API that your code can call, no cloud, no API key, no per-token billing.\n\n\nHow It Works\n\nOllama bundles model weights, a Go-based runtime, and a simple model definition format (Modelfiles) into a single binary. When you run ollama run llama3.2, it downloads the model (GGUF format, quantized for CPU/GPU), starts a server at localhost:11434, and drops you int",
  "path": "/engineering-glossary/ollama/",
  "publishedAt": "2026-05-17T19:20:47.000Z",
  "site": "https://sahilkapoor.com",
  "tags": [
    "Vllm",
    "Langchain",
    "Cursor",
    "Openhands",
    "Mcp Model Context Protocol",
    "Inference Endpoint",
    "Tokenization"
  ],
  "textContent": "Ollama makes running open-source LLMs as straightforward as running a Docker container. You pull a model, and it starts serving a local REST API that your code can call, no cloud, no API key, no per-token billing.\n\n## How It Works\n\nOllama bundles model weights, a Go-based runtime, and a simple model definition format (Modelfiles) into a single binary. When you run `ollama run llama3.2`, it downloads the model (GGUF format, quantized for CPU/GPU), starts a server at `localhost:11434`, and drops you into an interactive chat. The HTTP API mirrors the OpenAI completions endpoint, so existing code that hits `api.openai.com` can point to Ollama with a URL change.\n\n\n    ollama pull llama3.2\n    ollama run mistral\n    curl http://localhost:11434/api/generate -d '{\"model\":\"llama3.2\",\"prompt\":\"Explain REST APIs\"}'\n\n## Model Library\n\nOllama's model library includes Llama 3.x, Mistral, Gemma 2, Phi-3, Qwen, Code Llama, DeepSeek Coder, and dozens more. Models are stored in `~/.ollama/models` and each one includes a Modelfile that sets the system prompt and parameters.\n\n## Use Cases\n\n  * **Privacy-sensitive workloads** , legal, medical, or proprietary data that can't leave your network\n  * **Offline/air-gapped environments** , dev environments without internet access\n  * **Cost control** , development and testing without per-token costs\n  * **Local RAG pipelines** , combine with a local vector DB for fully offline retrieval\n  * **Custom models** , fine-tuned models via Modelfiles or GGUF import\n\n\n\n## Ollama vs vLLM\n\nOllama is optimized for ease of use on developer laptops; Vllm is optimized for throughput in production. Ollama runs on CPU if no GPU is present; vLLM requires CUDA/ROCm. For a single developer experimenting with models, Ollama is the right choice. For serving models to multiple users or benchmarking throughput, vLLM wins.\n\n## Integration with AI Tooling\n\nBecause Ollama exposes an OpenAI-compatible API, it plugs into Langchain, Cursor, and most LLM SDKs without changes. You can use it as a local backend for Openhands or any agent framework. Combined with Mcp Model Context Protocol, Ollama can power fully local agentic workflows.\n\n## Related Terms\n\n  * Vllm, production-grade inference server for high throughput\n  * Inference Endpoint, cloud-hosted equivalent of what Ollama provides locally\n  * Langchain, orchestration framework that can use Ollama as its LLM backend\n  * Tokenization, how the model converts your prompt to numbers before processing\n\n",
  "title": "Ollama",
  "updatedAt": "2026-05-18T20:04:15.235Z"
}