Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifd3ls2qxdz5vshda3d5fl4f3tupqowqqmbtc6dxpf4ireiu3pfhu",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mogqyoq2mqh2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreid4o5oyfivlmd46hmpwgqe45mljlyj7i3gxchupuvpl2dmronra6e"
    },
    "mimeType": "image/webp",
    "size": 55948
  },
  "path": "/shashank_ms_6a35baa4be138/comparing-llm-models-a-technical-deep-dive-dhp",
  "publishedAt": "2026-06-16T21:35:26.000Z",
  "site": "https://dev.to",
  "tags": [
    "engineering",
    "oxlo",
    "ai",
    "https://portal.oxlo.ai"
  ],
  "textContent": "I needed a fast, repeatable way to compare production-grade open models before routing traffic to them. In this post, I will walk through a lightweight Python harness that sends identical prompts to four different Oxlo.ai models, times each response, and scores the outputs with a judge model so you can pick the right one for your workload.\n\n## What you'll need\n\n  * An Oxlo.ai API key from https://portal.oxlo.ai\n  * Python 3.10 or newer\n  * The OpenAI SDK: `pip install openai`\n\n\n\n## Step 1: Set up the Oxlo.ai client and model roster\n\nWe start by initializing the client and defining the models we want to test. I picked a mix of generalist, reasoning, and multilingual models that Oxlo.ai hosts.\n\n\n    from openai import OpenAI\n    import os\n\n    client = OpenAI(\n        base_url=\"https://api.oxlo.ai/v1\",\n        api_key=os.environ.get(\"OXLO_API_KEY\")\n    )\n\n    CANDIDATE_MODELS = [\n        \"llama-3.3-70b\",\n        \"qwen-3-32b\",\n        \"kimi-k2.6\",\n        \"deepseek-v3.2\",\n    ]\n\n    TEST_PROMPT = (\n        \"Write a Python function that accepts a list of integers and returns \"\n        \"the longest strictly increasing subsequence. Include type hints, \"\n        \"a docstring, and a simple test case in the same code block.\"\n    )\n\n## Step 2: Define the judge system prompt\n\nBefore we fire requests, we need a consistent rubric. I use a separate system prompt for the judge model so scoring stays objective across runs.\n\n\n    JUDGE_SYSTEM_PROMPT = \"\"\"You are an expert code reviewer. You will receive a user request and a candidate response. Score the response on three axes from 1 to 5:\n    1. Correctness: does the code solve the problem and pass the included test?\n    2. Clarity: are the docstring, types, and variable names clear?\n    3. Conciseness: is the solution free of unnecessary bloat?\n\n    Return ONLY a JSON object with keys: model, correctness, clarity, conciseness, total_score, and one_sentence_verdict.\n    \"\"\"\n\n## Step 3: Dispatch prompts concurrently\n\nWaiting for four sequential API calls is slow. I use a thread pool to hit all candidate models at once and record wall-clock latency for each.\n\n\n    import time\n    import concurrent.futures\n\n    def query_model(model_id: str, prompt: str) -> dict:\n        start = time.perf_counter()\n        response = client.chat.completions.create(\n            model=model_id,\n            messages=[\n                {\"role\": \"system\", \"content\": \"You are a helpful coding assistant.\"},\n                {\"role\": \"user\", \"content\": prompt},\n            ],\n            temperature=0.2,\n        )\n        elapsed = time.perf_counter() - start\n        return {\n            \"model\": model_id,\n            \"text\": response.choices[0].message.content,\n            \"latency_sec\": round(elapsed, 2),\n        }\n\n    def run_benchmark(prompt: str):\n        results = []\n        with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:\n            futures = {\n                executor.submit(query_model, m, prompt): m\n                for m in CANDIDATE_MODELS\n            }\n            for future in concurrent.futures.as_completed(futures):\n                results.append(future.result())\n        return results\n\n## Step 4: Score outputs with a judge model\n\nNow we feed each candidate response into a judge. I use llama-3.3-70b as the judge because it gives stable JSON formatting.\n\n\n    import json\n\n    def judge_response(candidate: dict, original_prompt: str) -> dict:\n        judge_input = (\n            f\"User request:\\n{original_prompt}\\n\\n\"\n            f\"Candidate response from {candidate['model']}:\\n{candidate['text']}\\n\\n\"\n            \"Score the response and return the JSON object.\"\n        )\n        response = client.chat.completions.create(\n            model=\"llama-3.3-70b\",\n            messages=[\n                {\"role\": \"system\", \"content\": JUDGE_SYSTEM_PROMPT},\n                {\"role\": \"user\", \"content\": judge_input},\n            ],\n            temperature=0.1,\n        )\n        raw = response.choices[0].message.content.strip()\n        if raw.startswith(\"\n\n    ```\"):\n            raw = raw.split(\"```\n\n    \")[1].replace(\"json\", \"\").strip()\n        scores = json.loads(raw)\n        return {**candidate, **scores}\n\n    def score_all(results: list, prompt: str):\n        return [judge_response(r, prompt) for r in results]\n\n## Step 5: Render the comparison report\n\nFinally, we print a markdown table so the differences are obvious at a glance.\n\n\n    def print_report(scored_results: list):\n        print(\"| Model | Latency (s) | Correctness | Clarity | Conciseness | Total | Verdict |\")\n        print(\"|-------|-------------|-------------|---------|-------------|-------|---------|\")\n        for r in scored_results:\n            print(\n                f\"| {r['model']} | {r['latency_sec']} | \"\n                f\"{r['correctness']} | {r['clarity']} | {r['conciseness']} | \"\n                f\"{r['total_score']} | {r['one_sentence_verdict']} |\"\n            )\n\n    if __name__ == \"__main__\":\n        print(\"Running benchmark...\")\n        raw_results = run_benchmark(TEST_PROMPT)\n        scored = score_all(raw_results, TEST_PROMPT)\n        scored.sort(key=lambda x: x[\"total_score\"], reverse=True)\n        print_report(scored)\n\n## Run it\n\nSave the script as `benchmark.py`, export your key, and run it.\n\n\n    export OXLO_API_KEY=\"your-key-here\"\n    python benchmark.py\n\nExample output (values will vary by run):\n\n\n    Running benchmark...\n    | Model | Latency (s) | Correctness | Clarity | Conciseness | Total | Verdict |\n    |-------|-------------|-------------|---------|-------------|-------|---------|\n    | deepseek-v3.2 | 4.2 | 5 | 5 | 4 | 14 | Produces correct LIS with clean type hints and a valid doctest. |\n    | kimi-k2.6 | 3.8 | 5 | 4 | 4 | 13 | Correct solution but slightly verbose docstring. |\n    | qwen-3-32b | 2.1 | 4 | 4 | 5 | 13 | Correct logic, omits explicit test case in the block. |\n    | llama-3.3-70b | 1.9 | 4 | 5 | 4 | 13 | Good structure, test case is present but uses print instead of assert. |\n\n## Wrap-up and next steps\n\nSwap the static prompt for a JSONL test suite so you can regression-test model behavior on every deploy. You can also add a lightweight Streamlit frontend so non-engineers can run comparisons and vote on their preferred output.",
  "title": "Comparing LLM Models: A Technical Deep Dive"
}