Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicyu5cioh46wu5r3acgoevuoy7pxuf27tnkmyicnteabic37mcfxq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkgxqgrljah2"
  },
  "path": "/t/benchmark-6-local-ollama-models-for-code-gen-delegation-with-variance-analysis/175579#post_1",
  "publishedAt": "2026-04-27T00:08:00.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "I’ve been building a local Ollama pool to delegate small, well-scoped coding chores from a main agent. Before cabling routing rules into the agent, I wanted a defensible answer to “which model for which task family.” This post is the bench I ran, the surprises, and the methodology lessons. The full repro (bash wrappers, prompts, verifier) is single-file Python + curl + jq, so it should be easy to reproduce or extend.\n\n**## TL;DR**\n\nI ran 6 models against 3 strict, single-function prompts, auto-graded by I/O equivalence (32 test cases total). Then I ran the most discriminating prompt 3 times on every model to measure variance. The single-shot ranking and the post-variance ranking did not agree.\n\nHeadline findings:\n\n1. The post-variance winner on narrow code-gen tasks is `gemma4:latest`. Byte-stable 22/22 across 3 runs. Single-shot ranking placed it 5th because it failed an unrelated test-scaffolding prompt that needed Python module-level reasoning.\n\n2. `qwen2.5-coder:14b` is the right pick for prompts requiring runtime/Python semantics. Stable 20-22/22, only model that handled a stale-reference trap correctly.\n\n3. `qwen3.5:9b` failed 2 of 3 runs on the same prompt. Produced byte-identical buggy code in two consecutive runs at `temperature=0.2`. The 21/22 score that put it #1 in the single-shot ranking was the _*less common*_ sampling path.\n\n4. `qwen3.5:4b` was wildly unstable. Score swung from 4/22 to 19/22 across runs at `temperature=0.2`. Useful only with a best-of-N + verifier wrapper.\n\n5. The Qwen3 thinking variants returned empty `response` fields on 100% of constrained code-gen prompts until I set `think:false`. Default-on thinking was a complete trap.\n\nMethodological lesson: single-shot LLM benchmarks lie in both directions. Variance flipped my “winner” and uncovered a “loser” that was actually best-in-class for a specific task family.\n\n**## Setup**\n\n- Hardware: single workstation, 16 GB VRAM (Quadro), Ollama on `127.0.0.1:11434`.\n\n- Driver: a 60-line bash wrapper that POSTs each prompt with `temperature=0.2`, `stream=false`, and writes each response to a file.\n\n- Verifier: a Python script that strips markdown fences, `exec()`s each model’s output, and runs a battery of valid + invalid inputs against the resulting function. Every score below is automated.\n\n**## The three prompts**\n\nAll three explicitly forbid markdown fences, imports outside the function body, and any preamble.\n\n- ****P1**** : pytest test generator with a stale-reference trap. The function under test rebinds the module global, so the test must re-read by attribute, not hold a local. Binary pass/fail.\n\n- ****P2**** : `parse_iso_duration(s: str) → int` for `PTHM~~S` strings, raising `ValueError(“invalid ISO duration: …”)` on malformed input. 6 valid + 8 invalid cases.~~\n\n~~~~\n\n- ****P3**** : `flatten(d: dict, sep: str = “.”) → dict` that recurses into nested dicts but leaves lists/tuples as-is, and drops empty nested dicts entirely. 10 cases including custom separators, depth>3, mixed types, and the “only empty subtree” edge case.\n\n**## Single-shot results (N=3 prompts, 1 run each)**\n\nScore per prompt is normalized to [0,1] (P1 is 0/1, P2 is /22, P3 is /10) and averaged.\n\n| # | Model | Size | P1 | P2 | P3 | Score |\n\n|—|—|—|—|—|—|—|\n\n| 1 | qwen3.5:9b (`think:false`) | 6.6 GB | yes | 21/22 | 10/10 | 0.985 |\n\n| 2 | qwen2.5-coder:14b | 9.0 GB | yes | 20/22 | 10/10 | 0.970 |\n\n| 3 | qwen3.5:4b (`think:false`) | ****3.4 GB**** | yes | 20/22 | 8/10 | 0.903 |\n\n| 4 | qwen3:14b (`think:false`) | 9.3 GB | yes | 8/22 | 10/10 | 0.788 |\n\n| 5 | gemma4:latest | 9.6 GB | no | ****22/22**** | 10/10 | 0.667 |\n\n| 6 | deepseek-coder-v2:16b | 8.9 GB | no | 16/22 | 9/10 | 0.542 |\n\nThis ranking turned out to be misleading. Read on.\n\n**## Variance check that flipped the ranking (3 runs of P2, all 6 models)**\n\nSame prompt, same `temperature=0.2`, three independent calls per model:\n\n| Model | Run 1 | Run 2 | Run 3 | Mean | Stability |\n\n|—|—|—|—|—|—|\n\n| ****gemma4:latest**** | ****22/22**** | ****22/22**** | ****22/22**** | ****22.0**** | perfect x 3 |\n\n| qwen2.5-coder:14b | 22/22 | 20/22 | 20/22 | 20.7 | tight cluster |\n\n| qwen3:14b (`think:false`) | 17/22 | 16/22 | 17/22 | 16.7 | stable, mediocre |\n\n| deepseek-coder-v2:16b | 16/22 | 16/22 | 12/22 | 14.7 | stable, wrong on valid inputs |\n\n| qwen3.5:9b (`think:false`) | 9/22 | 9/22 | 21/22 | 13.0 | bimodal |\n\n| qwen3.5:4b (`think:false`) | 4/22 | 19/22 | 16/22 | 13.0 | wild |\n\n`gemma4` was byte-stable perfect across 3 independent runs. Not just hitting 22/22 once, but the only model where I’d trust the answer without re-checking. The single-shot ranking placed it 5th because it failed the unrelated P1 prompt.\n\n`qwen3.5:9b` returned byte-identical buggy code in runs 1 and 2 (725 bytes each) and a different correct-ish answer in run 3. The 21/22 score that put it #1 in single-shot was the less common sampling path. Its dominant decoding mode is broken on this prompt.\n\n`deepseek-coder-v2:16b` is stably wrong: 0/6 valid inputs across all 3 runs. Same regex bug every time. Rerunning won’t save it.\n\nThe bug that hit `qwen3.5:9b` twice in a row at temp 0.2 was a regex requiring all three letters: `^(\\d+)?H(\\d+)?M(\\d+)?S$`. So `“PT5M”` fails because there’s no `H` and no `S` literal. Subtle, plausible-looking, and it ships unless you actually run the function.\n\n**## Gotcha: Qwen3 thinking models silently return empty**\n\nFirst pass on Qwen3, with the default `think:true`:\n\n| Model | Wall time | `response` bytes |\n\n|—|—|—|\n\n| qwen3:14b | ****1174 s**** | 1 (just `\\n`) |\n\n| qwen3.5:9b | 116 s | 1 |\n\n| qwen3.5:4b | 81 s | 1 |\n\nTwenty minutes of GPU time on the 14B and zero output. Ollama’s `/api/generate` returns two fields for thinking-mode models: `response` and `thinking`. My script only logged `response`. When I dumped the raw JSON, the 9B’s `thinking` field was 21 KB of this:\n\n```\n\n* Wait, I need to check if I can use `src` if `import src.main_improved` is used.\n\n* Yes.\n\n* So I will use `src.main_improved`.\n\n* Wait, I need to check if I can use `src` if `import src` is used.\n\n* Yes.\n\n* So I will use `src.main_improved`.\n\n...repeats until context fills...\n\n```\n\n`done_reason: “stop”` on a 21,000-character thinking trace with no output. The model talked itself in circles and never committed to an answer.\n\nThe fix is one parameter: `“think”: false` in the request body. With it, all three Qwen3 sizes responded in 8-11 seconds and produced clean code. Worth being aware of if you’re benchmarking thinking-capable models with strict output requirements: smoke-test `think:false` first, and log both fields.\n\n**## Same model, opposite verdicts on different prompts**\n\n`gemma4:latest` scored a perfect 22/22 on the regex parser. On P1 (the test-generation prompt), it produced this:\n\n```python\n\ndef test_invalidate_model_cache_resets_all_keys():\n\n\n    global \\_model_cache    # <-- bug\n\n    \\_model_cache = {\"model\": \"x\", \"cost_matrix\": \"y\", \"timestamp\": \"z\"}\n\n    invalidate_model_cache()\n\n    assert \\_model_cache\\[\"model\"\\] is None\n\n    ...\n\n\n```\n\nThe `global` binding inside the test creates a `_model_cache` in the _*test*_ module, not in `src.main_improved`. So `invalidate_model_cache` rebinds the source module’s dict and the assertion checks an unrelated local. The test silently passes for the wrong reason.\n\n`deepseek-coder-v2:16b` made the same mistake. A model that handles regex flawlessly cannot necessarily reason about Python’s module-level rebinding semantics in a test scaffold. This is the strongest case I have for running at least two unrelated tasks before deciding which model to route where.\n\n**## Markdown fence compliance**\n\nBoth prompts said “no markdown fences.” Compliance:\n\n| Model | P1 fences | P2 fences |\n\n|—|—|—|\n\n| qwen2.5-coder:14b | yes | yes |\n\n| deepseek-coder-v2:16b | yes | yes |\n\n| gemma4:latest | no | no |\n\n| qwen3:14b (`think:false`) | no | no |\n\n| qwen3.5:9b (`think:false`) | no | no |\n\n| qwen3.5:4b (`think:false`) | no | no |\n\nThe instruct-tuned coder models (`qwen2.5-coder`, `deepseek-coder`) wrap output in fences regardless of the instruction. The Qwen3 family and `gemma4` follow the no-fences instruction. If your delegation wrapper does not strip fences before `exec()`, you’ll see “broken” output that’s actually correct code in a string.\n\n**## Cross-prompt confirmation: gemma4 + qwen2.5-coder on P3 (3 runs each)**\n\nTo check that the gemma4 specialization wasn’t a one-prompt fluke, I ran 3 more runs of P3 (the dict flatten task) on the two stable models:\n\n| Model | P3 Run 1 | Run 2 | Run 3 | Mean |\n\n|—|—|—|—|—|\n\n| ****gemma4:latest**** | 10/10 | 10/10 | 10/10 | ****10.0**** |\n\n| qwen2.5-coder:14b | 10/10 | 10/10 | 9/10 | 9.7 |\n\n`gemma4` went 6 for 6 across both code-gen prompts: perfect, byte-stable. `qwen2.5-coder` lost a single point on P3 run 3 with `if v:` (truthy check) instead of `if v is not None`, silently dropping a `None` value. Subtle, but the kind of idiomatic Python bug a real test would catch.\n\n**## Best-of-N + verifier rescue for the unstable models**\n\nThe qwen3.5 family failed variance because at `temperature=0.2` they produced byte-identical buggy outputs. Natural fix: bump temperature for diversity, sample N times, run the verifier, keep the passer.\n\n5 samples of P2 at `temperature=0.7`:\n\n| Model | Run scores (best to worst) | Best-of-5 | Hit rate >=18/22 | Wall time | VRAM |\n\n|—|—|—|—|—|—|\n\n| qwen3.5:9b (`think:false`) | 22, 21, 14, 14, 8 | ****22/22**** | 2/5 (40%) | 30 s | 6.6 GB |\n\n| qwen3.5:4b (`think:false`) | 20, 20, 20, 13, 8 | ****20/22**** | 3/5 (60%) | 20 s | 3.4 GB |\n\nBoth produced 5 distinct hashes, so the diversity is real, not pathological. With a verifier in the loop:\n\n- `qwen3.5:9b` best-of-5 matches `gemma4` and `qwen2.5-coder` single-shot (22/22 ceiling) at 6.6 GB and ~30s. Comparable to running `gemma4` directly. Not worth the complexity unless `gemma4` isn’t available.\n\n- `qwen3.5:4b` best-of-5 is the real win: 20/22 ceiling at 3.4 GB and ~20s total. Fills the mini-tier slot for laptops or any machine where 9 GB of model is too much.\n\nCaveat: best-of-N only works for tasks with a cheap automated verifier. For “draft a commit message” or “write a docstring” there’s no programmatic way to pick the best, so this strategy doesn’t help.\n\n**## Routing rules I ended up with**\n\n- Parsers, regex, recursive transformers: `gemma4:latest`. Byte-stable 22/22 across 6 runs of 2 different prompts at temp 0.2.\n\n- Tests, fixtures, anything needing Python module/runtime semantics: `qwen2.5-coder:14b`. Stable 20-22/22, the only model that handled the test-scaffolding trap correctly.\n\n- Mini tier (laptop, 4 GB VRAM): `qwen3.5:4b` with `think:false`, sample 5x at temp 0.7, run verifier, keep passer. 3.4 GB, ~20s total.\n\n- Skip: `qwen3:14b` (stably mediocre, 16/22 mean) and `deepseek-coder-v2:16b` (stably wrong, 0/6 valid inputs same regex bug 3/3 runs).\n\nNote on MCP wrappers: if you’re routing through a community Ollama MCP server, check whether it exposes `think:false`. The one I tested doesn’t, and it timed out at 120s on a prompt that the underlying model handles in 30s via direct `/api/generate`. The wrapper’s description also misreported which model it was wrapping. Verify before relying on it.\n\n**## What surprised me**\n\nThe general-purpose model (`gemma4`) beat the dedicated coder model (`qwen2.5-coder:14b`) on every code-gen prompt that didn’t require Python runtime reasoning. The “coder” label means trained on code, not best at every code task. I went into this assuming the coder-tuned model was the safe default and I was wrong.\n\nThe single-shot ranking placed `qwen3.5:9b` at the top with a 0.985/1.0 score. Variance check showed 2 of 3 runs were broken with byte-identical output. If I’d shipped a routing policy off that ranking, I would have sent every parser-style task to a model that fails most of the time at temperature 0.2.\n\nLogging only the `response` field on Ollama thinking-mode calls cost me 20 minutes of GPU debugging for what looked like crashes but were actually 21 KB infinite-loop self-arguments inside `thinking`. One missing line of logging.\n\n**## Open questions**\n\n- Which model are you using for the parser/transformer slot? I want to compare against `gemma4`. Especially curious about `granite-code:3b` and `phi-4-mini` for the same prompts.\n\n- For the mini-tier slot, has anyone shipped `qwen3.5:4b` (or smaller) in a best-of-N + verifier loop in production? What’s your hit rate and N?\n\n- Is anyone seeing similar bimodal behavior on `qwen3.5:9b` at low temperature on other constrained-format prompts, or is this specific to my prompt template?\n\n**## Repro**\n\nBash wrappers, prompts, and verifiers are single-file scripts: no deps beyond `curl`, `jq`, and stdlib Python. Hardware was a 16 GB consumer GPU on WSL2. Happy to share if there’s interest.",
  "title": "Benchmark: 6 local Ollama models for code-gen delegation, with variance analysis"
}