{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreic3drlbcgys4lzbumtenxwvz7kejgjmqrfoihbcv64ozldgrnnuaq",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mpcuhhederb2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreigjej6cbnf7lobm3ql2kn7j7rd7tmfhpr5d6jxyzj36lg4x4yfxtu"
    },
    "mimeType": "image/webp",
    "size": 62374
  },
  "path": "/mohitkumar4/getting-started-with-ollama-run-llms-locally-in-10-minutes-5g98",
  "publishedAt": "2026-06-28T01:18:12.000Z",
  "site": "https://dev.to",
  "tags": [
    "ai",
    "tutorial",
    "opensource",
    "beginners",
    "ollama.com/download",
    "ollama.com/library"
  ],
  "textContent": "If you've ever wanted to run a large language model on your own machine — no API key, no cloud bill, no data leaving your laptop — **Ollama** is the easiest way to get there. It packages model weights, a runtime (built on `llama.cpp`), and a simple CLI/REST API into one tool that works the same way on macOS, Linux, and Windows.\n\nThis guide covers installation, running your first model, the core commands you'll actually use, picking a model for your hardware, and hooking Ollama into your own code via its API.\n\n##  Why run models locally?\n\n  * **Privacy** — your prompts and data never leave your machine.\n  * **Cost** — no per-token billing. You pay once, in hardware (or nothing, if you already have a decent laptop).\n  * **Offline** — works on a plane, in a SCIF, or wherever your Wi-Fi doesn't.\n  * **Control** — swap models, tweak parameters, fine-tune behavior with no rate limits.\n\n\n\nThe tradeoff: local models are generally smaller and slightly behind frontier cloud models (GPT, Claude, Gemini) on raw capability — though the gap keeps shrinking fast.\n\n##  Installation\n\n###  macOS\n\nDownload the app from ollama.com/download, or use Homebrew:\n\n\n\n    brew install ollama\n\n\n###  Linux\n\n\n    curl -fsSL https://ollama.com/install.sh | sh\n\n\nThis installs the `ollama` binary and sets up a systemd service so it runs in the background. Check it's alive:\n\n\n\n    systemctl status ollama\n\n\n###  Windows\n\nDownload `OllamaSetup.exe` from ollama.com/download and run it — no admin rights required. Recent versions ship a full desktop app with a chat window, so you can skip the terminal entirely if you prefer. A native ARM64 build is also available for Windows-on-Arm devices.\n\n###  Docker\n\n\n    docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama\n\n\nAdd `--gpus=all` if you have an NVIDIA GPU and the NVIDIA Container Toolkit installed.\n\n###  Verify it's working\n\n\n    ollama --version\n    ollama list\n\n\nAn empty list is expected on a fresh install — it just confirms the daemon is up and responding.\n\n##  Run your first model\n\n\n    ollama run llama3.2\n\n\nThis pulls the model (a few GB, one-time download) and drops you into an interactive chat session. Type a prompt, hit enter, get a response. `Ctrl+D` or `/bye` exits.\n\n##  Core commands cheat sheet\n\nCommand | What it does\n---|---\n`ollama run <model>` | Pull (if needed) and chat with a model\n`ollama pull <model>` | Download a model without starting a chat\n`ollama list` | Show models you have installed\n`ollama ps` | Show models currently loaded in memory\n`ollama show <model>` | Show details/parameters for a model\n`ollama rm <model>` | Delete a model to free disk space\n`ollama stop <model>` | Unload a model from memory\n`ollama create <name> -f Modelfile` | Build a custom model from a Modelfile\n\nAlways pull with an explicit tag for anything you depend on (`ollama pull qwen2.5-coder:7b`), since `:latest` can change under you.\n\n##  Picking a model for your hardware\n\nOllama's library has hundreds of models. As a starting point:\n\nUse case | Try | Rough RAM/VRAM\n---|---|---\nGeneral daily driver, light hardware | `llama3.2:3b` | ~4 GB\nGeneral daily driver, mid hardware |  `llama3.1:8b` or `qwen3:8b` | ~6–8 GB\nCoding |  `qwen2.5-coder:7b` or `qwen3-coder:30b` (MoE, runs lighter than its size suggests) | 6–20 GB\nReasoning / math / step-by-step logic |  `deepseek-r1:7b` or `:14b` | 6–12 GB\nBest quality you can fit on a single consumer GPU |  `qwen3.6:27b` or `gpt-oss:20b` | ~16–24 GB\nVision (images + text) |  `llava` or `gemma3:12b` | 8–16 GB\nEmbeddings (for RAG / semantic search) | `nomic-embed-text` | <1 GB\n\nRule of thumb for sizing: a 7–8B model at Q4 quantization needs roughly 5–6 GB of memory; rough numbers, not gospel. Mixture-of-experts models (the ones with an \"active/total\" split, like `qwen3-coder:30b`) only run a fraction of their listed size at inference time, so they're often faster than their parameter count implies — but they still need the _full_ model in memory, not just the active slice. Always check `ollama.com/library` for the current tag list, since model lineups change weekly.\n\nIf you're not sure where to start: pull a small model, use it for a week on your actual tasks, and let what it struggles with point you toward the next one.\n\n##  Using the API\n\nOllama exposes a REST API on `localhost:11434` — this is how every IDE plugin, chat UI, and framework talks to it under the hood.\n\n\n\n    curl http://localhost:11434/api/chat -d '{\n      \"model\": \"llama3.2\",\n      \"messages\": [{ \"role\": \"user\", \"content\": \"Explain Ollama in one sentence.\" }],\n      \"stream\": false\n    }'\n\n\nIt also exposes an **OpenAI-compatible endpoint** , so anything built for the OpenAI SDK can point at Ollama with a base URL change:\n\n\n\n    http://localhost:11434/v1/chat/completions\n\n\n###  Python\n\n\n    pip install ollama\n\n\n\n    from ollama import chat\n\n    response = chat(model='llama3.2', messages=[\n        {'role': 'user', 'content': 'Why is the sky blue?'}\n    ])\n    print(response.message.content)\n\n\n##  Customizing a model with a Modelfile\n\nWant a model with a fixed system prompt or different default parameters? Create a `Modelfile`:\n\n\n\n    FROM llama3.2\n\n    PARAMETER temperature 0.7\n    PARAMETER num_ctx 4096\n\n    SYSTEM \"\"\"\n    You are a terse code reviewer. Point out bugs and style issues only — no praise, no fluff.\n    \"\"\"\n\n\nBuild it:\n\n\n\n    ollama create code-reviewer -f Modelfile\n    ollama run code-reviewer\n\n\nNow `code-reviewer` is its own model in `ollama list`, with your settings baked in.\n\n##  A few practical tips\n\n  * **Bind address** : by default Ollama only listens on `127.0.0.1`. Setting `OLLAMA_HOST=0.0.0.0` exposes the API to your whole network with **no authentication** — fine on a trusted LAN, risky anywhere else.\n  * **Multiple models loaded at once** : `OLLAMA_NUM_PARALLEL` and `OLLAMA_MAX_LOADED_MODELS` control concurrency if you're serving more than one model.\n  * **Long contexts are expensive** : KV cache memory scales with context length, not just model size. A 70B model at 128K context can add tens of GB beyond the weights alone. Set `num_ctx` deliberately in a Modelfile instead of leaving it at whatever default your VRAM tier triggers.\n  * **GPU not being used?** Check `ollama ps` — it shows whether a model is running on CPU or GPU. Driver issues (CUDA/ROCm) are the most common cause of silent CPU fallback.\n\n\n\n##  Where to go next\n\n  * Browse ollama.com/library for the full, constantly-updated model list.\n  * Point any OpenAI-SDK-based tool (LangChain, LlamaIndex, Continue, etc.) at `http://localhost:11434/v1` to swap in local models with minimal code changes.\n  * Pair a small embedding model (`nomic-embed-text`) with a chat model to build a local RAG pipeline with zero API cost.\n\n\n\nThat's the whole loop: install, pull, run, integrate. Everything else is just picking the right model for the job.",
  "title": "Getting Started with Ollama: Run LLMs Locally in 10 Minutes"
}