{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreifooblqcdf5gag5v7atgon2bmw2cfle4avvgurl5no6q5ujgo7lv4",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkeun3qz2452"
},
"path": "/t/chini-bench-preliminary-a-deterministic-30-problem-ai-system-design-benchmark/175564#post_1",
"publishedAt": "2026-04-26T04:51:01.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"CHINI-bench",
"CHINI-bench - Chinilla",
"Methodology - CHINI-bench",
"GitHub - collapseindex/chini-bench-cli: Standalone CLI for the CHINI-bench AI system-design benchmark · GitHub",
"Leaderboard - CHINI-bench"
],
"textContent": "Helloo HF! First time posting here!\n\nSharing some early results from CHINI-bench, a small public benchmark I just finished building couple days ago. It asks LLMs to design distributed systems as graphs (components, behaviors, edges), then runs the resulting architecture through a discrete-event simulator under stress scenarios. Scoring is mechanical, no LLM-as-judge and no human in the loop. Same simulator, same math, every model.\n\nThis is preliminary: 30 problems, four frontier models, single seed per problem. The N is scoped on purpose. Scoring is fully deterministic given the canvas, so a single run per cell carries real signal. The reason to call it preliminary is the _model coverage_ (only four, all closed) and the _single Reflexion turn_ , not the problem count. Posting now mostly to broaden coverage and pressure-test the framing before making stronger claims.\n\n**The setup**\n\n * 30 problems across 5 classes (SWE backend, operations, personal, civic, adversarial)\n * Models output a `CanvasState` JSON. The simulator scores it on stability, delivery, cost, constraints, and design.\n * Open-source CLI: `pip install git+https://github.com/collapseindex/chini-bench-cli` runs any model end-to-end with your own API key\n * Harness is hash-pinned (`chini-bench-cli:06d0ffb42f19`) so leaderboard runs are reproducible\n\n\n\n**Single-shot results so far (4 frontier models, 30 problems each, 120 runs)**\n\n * Combined coverage: **10 of 30** problems passed by at least one model\n * A handful of problems weren’t passed by anyone in this batch\n * Roughly: best class is operations (PC2), weakest is adversarial (PC5)\n\n\n\nPer-class slices are six problems each, so treat the class-level ordering as directional rather than definitive.\n\n**Reflexion track, early observations**\n\nI added a second turn: run v1, simulator emits a redacted FeedbackPacket (no scores, just which checks failed), model writes v2, submit v2.\n\nModel | Avg v1 | Avg v2 | Δ | Passes after revision\n---|---|---|---|---\nGemini 3.1 Pro | ~73 | ~73 | 0 | 2 of 30\nGrok 4.20 | ~65 | ~68 | +3 | 1 of 30\nGPT-5.4 | ~64 | ~60 | -4 | 0 of 30\nClaude Sonnet 4.6 | ~62 | ~53 | -9 | 0 of 30\n\nA tentative read:\n\n * **Possible overshoot pattern** in Claude and GPT runs: feedback flags a failed check, the model restructures more than needed, often adds a component, and ends up tripping a count or constraint limit.\n * **Possible flat-revision pattern** in Gemini runs: starts highest, patches the exact thing the feedback flagged, preserves what worked, but doesn’t actually move the average. v2 ≈ v1, the wins and losses cancel out. Lands at the top of the table by virtue of a strong v1, not by improving.\n\n\n\nIf that pattern holds across more models, it would suggest a search-strategy gap (when to patch vs. when to rewrite) more than a reasoning gap. With four models and a single seed per problem, I’m not ready to call that a finding. It’s a hypothesis I’d like to stress-test against open-weights models and longer Reflexion chains.\n\nNet Reflexion v2 passes across the four models in this batch: 3 of 120.\n\n**Caveats I want to be upfront about**\n\n * Only four models, all closed-weights. The “frontier” framing is incomplete until open-weights models (Llama, Qwen, DeepSeek, Mistral) are on the board.\n * One Reflexion turn only. Multi-turn (2-3 rounds) might tell a different story.\n * Single seed per problem. The simulator is deterministic, but model sampling isn’t, so seed-level variance isn’t characterized.\n * The problem set reflects my judgement about what matters in distributed-systems design. Critique welcome on coverage and weighting.\n\n\n\n**What’s open**\n\n * All 30 problems and canonical prompts: CHINI-bench - Chinilla\n * Methodology and scoring math: Methodology - CHINI-bench\n * CLI source (PolyForm Noncommercial): GitHub - collapseindex/chini-bench-cli: Standalone CLI for the CHINI-bench AI system-design benchmark · GitHub\n * Live leaderboard with the Reflexion track split out: Leaderboard - CHINI-bench\n\n\n\n**What I’d love help with**\n\n 1. Open-source model runs (Llama, Qwen, DeepSeek, Mistral, anything you have a key or local setup for). The CLI supports Ollama for local and OpenRouter for hosted.\n 2. Pushback on the overshoot/undershoot framing. Is there a model you’d expect to behave differently? A reading of the data I’m missing?\n 3. Reflexion variants. Does 2-3 turns close the gap, or amplify whichever mode the model started in?\n\n\n\nCome check it out! Happy to walk through any of the methodology, scoring weights, or harness details.\n\n**A quick note on submission integrity:** scoring runs server-side against the canonical problem definitions, so submitters can’t ship their own scores or modified rules. Reflexion submissions include the v1 canvas and the server re-scores it; if the self-reported v1 number doesn’t match what the simulator actually produces, the row gets flagged for review.\n\nPublic CLI runs carry a harness hash (**`chini-bench-cli:06d0ffb42f19`** for single-shot, **`chini-bench-reflex:42769353289d`** for Reflexion); anything else is tagged **`custom`**. Community submissions are auto-prefixed **`community:`** so no one can impersonate official model rows.\n\n * alex\n\n",
"title": "CHINI-bench (preliminary): a deterministic 30-problem AI system-design benchmark"
}