External Publication
Visit Post

Ollama

Sahil Kapoor's Playbook May 17, 2026
Source

Ollama makes running open-source LLMs as straightforward as running a Docker container. You pull a model, and it starts serving a local REST API that your code can call, no cloud, no API key, no per-token billing.

How It Works

Ollama bundles model weights, a Go-based runtime, and a simple model definition format (Modelfiles) into a single binary. When you run ollama run llama3.2, it downloads the model (GGUF format, quantized for CPU/GPU), starts a server at localhost:11434, and drops you into an interactive chat. The HTTP API mirrors the OpenAI completions endpoint, so existing code that hits api.openai.com can point to Ollama with a URL change.

ollama pull llama3.2
ollama run mistral
curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":"Explain REST APIs"}'

Model Library

Ollama's model library includes Llama 3.x, Mistral, Gemma 2, Phi-3, Qwen, Code Llama, DeepSeek Coder, and dozens more. Models are stored in ~/.ollama/models and each one includes a Modelfile that sets the system prompt and parameters.

Use Cases

  • Privacy-sensitive workloads , legal, medical, or proprietary data that can't leave your network
  • Offline/air-gapped environments , dev environments without internet access
  • Cost control , development and testing without per-token costs
  • Local RAG pipelines , combine with a local vector DB for fully offline retrieval
  • Custom models , fine-tuned models via Modelfiles or GGUF import

Ollama vs vLLM

Ollama is optimized for ease of use on developer laptops; Vllm is optimized for throughput in production. Ollama runs on CPU if no GPU is present; vLLM requires CUDA/ROCm. For a single developer experimenting with models, Ollama is the right choice. For serving models to multiple users or benchmarking throughput, vLLM wins.

Integration with AI Tooling

Because Ollama exposes an OpenAI-compatible API, it plugs into Langchain, Cursor, and most LLM SDKs without changes. You can use it as a local backend for Openhands or any agent framework. Combined with Mcp Model Context Protocol, Ollama can power fully local agentic workflows.

Related Terms

  • Vllm, production-grade inference server for high throughput
  • Inference Endpoint, cloud-hosted equivalent of what Ollama provides locally
  • Langchain, orchestration framework that can use Ollama as its LLM backend
  • Tokenization, how the model converts your prompt to numbers before processing

Discussion in the ATmosphere

Loading comments...