Ollama
Ollama makes running open-source LLMs as straightforward as running a Docker container. You pull a model, and it starts serving a local REST API that your code can call, no cloud, no API key, no per-token billing.
How It Works
Ollama bundles model weights, a Go-based runtime, and a simple model definition format (Modelfiles) into a single binary. When you run ollama run llama3.2, it downloads the model (GGUF format, quantized for CPU/GPU), starts a server at localhost:11434, and drops you into an interactive chat. The HTTP API mirrors the OpenAI completions endpoint, so existing code that hits api.openai.com can point to Ollama with a URL change.
ollama pull llama3.2
ollama run mistral
curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":"Explain REST APIs"}'
Model Library
Ollama's model library includes Llama 3.x, Mistral, Gemma 2, Phi-3, Qwen, Code Llama, DeepSeek Coder, and dozens more. Models are stored in ~/.ollama/models and each one includes a Modelfile that sets the system prompt and parameters.
Use Cases
- Privacy-sensitive workloads , legal, medical, or proprietary data that can't leave your network
- Offline/air-gapped environments , dev environments without internet access
- Cost control , development and testing without per-token costs
- Local RAG pipelines , combine with a local vector DB for fully offline retrieval
- Custom models , fine-tuned models via Modelfiles or GGUF import
Ollama vs vLLM
Ollama is optimized for ease of use on developer laptops; Vllm is optimized for throughput in production. Ollama runs on CPU if no GPU is present; vLLM requires CUDA/ROCm. For a single developer experimenting with models, Ollama is the right choice. For serving models to multiple users or benchmarking throughput, vLLM wins.
Integration with AI Tooling
Because Ollama exposes an OpenAI-compatible API, it plugs into Langchain, Cursor, and most LLM SDKs without changes. You can use it as a local backend for Openhands or any agent framework. Combined with Mcp Model Context Protocol, Ollama can power fully local agentic workflows.
Related Terms
- Vllm, production-grade inference server for high throughput
- Inference Endpoint, cloud-hosted equivalent of what Ollama provides locally
- Langchain, orchestration framework that can use Ollama as its LLM backend
- Tokenization, how the model converts your prompt to numbers before processing
Discussion in the ATmosphere