CHINI-bench (preliminary): a deterministic 30-problem AI system-design benchmark
Helloo HF! First time posting here!
Sharing some early results from CHINI-bench, a small public benchmark I just finished building couple days ago. It asks LLMs to design distributed systems as graphs (components, behaviors, edges), then runs the resulting architecture through a discrete-event simulator under stress scenarios. Scoring is mechanical, no LLM-as-judge and no human in the loop. Same simulator, same math, every model.
This is preliminary: 30 problems, four frontier models, single seed per problem. The N is scoped on purpose. Scoring is fully deterministic given the canvas, so a single run per cell carries real signal. The reason to call it preliminary is the model coverage (only four, all closed) and the single Reflexion turn , not the problem count. Posting now mostly to broaden coverage and pressure-test the framing before making stronger claims.
The setup
- 30 problems across 5 classes (SWE backend, operations, personal, civic, adversarial)
- Models output a
CanvasStateJSON. The simulator scores it on stability, delivery, cost, constraints, and design. - Open-source CLI:
pip install git+https://github.com/collapseindex/chini-bench-cliruns any model end-to-end with your own API key - Harness is hash-pinned (
chini-bench-cli:06d0ffb42f19) so leaderboard runs are reproducible
Single-shot results so far (4 frontier models, 30 problems each, 120 runs)
- Combined coverage: 10 of 30 problems passed by at least one model
- A handful of problems weren’t passed by anyone in this batch
- Roughly: best class is operations (PC2), weakest is adversarial (PC5)
Per-class slices are six problems each, so treat the class-level ordering as directional rather than definitive.
Reflexion track, early observations
I added a second turn: run v1, simulator emits a redacted FeedbackPacket (no scores, just which checks failed), model writes v2, submit v2.
| Model | Avg v1 | Avg v2 | Δ | Passes after revision |
|---|---|---|---|---|
| Gemini 3.1 Pro | ~73 | ~73 | 0 | 2 of 30 |
| Grok 4.20 | ~65 | ~68 | +3 | 1 of 30 |
| GPT-5.4 | ~64 | ~60 | -4 | 0 of 30 |
| Claude Sonnet 4.6 | ~62 | ~53 | -9 | 0 of 30 |
A tentative read:
- Possible overshoot pattern in Claude and GPT runs: feedback flags a failed check, the model restructures more than needed, often adds a component, and ends up tripping a count or constraint limit.
- Possible flat-revision pattern in Gemini runs: starts highest, patches the exact thing the feedback flagged, preserves what worked, but doesn’t actually move the average. v2 ≈ v1, the wins and losses cancel out. Lands at the top of the table by virtue of a strong v1, not by improving.
If that pattern holds across more models, it would suggest a search-strategy gap (when to patch vs. when to rewrite) more than a reasoning gap. With four models and a single seed per problem, I’m not ready to call that a finding. It’s a hypothesis I’d like to stress-test against open-weights models and longer Reflexion chains.
Net Reflexion v2 passes across the four models in this batch: 3 of 120.
Caveats I want to be upfront about
- Only four models, all closed-weights. The “frontier” framing is incomplete until open-weights models (Llama, Qwen, DeepSeek, Mistral) are on the board.
- One Reflexion turn only. Multi-turn (2-3 rounds) might tell a different story.
- Single seed per problem. The simulator is deterministic, but model sampling isn’t, so seed-level variance isn’t characterized.
- The problem set reflects my judgement about what matters in distributed-systems design. Critique welcome on coverage and weighting.
What’s open
- All 30 problems and canonical prompts: CHINI-bench - Chinilla
- Methodology and scoring math: Methodology - CHINI-bench
- CLI source (PolyForm Noncommercial): GitHub - collapseindex/chini-bench-cli: Standalone CLI for the CHINI-bench AI system-design benchmark · GitHub
- Live leaderboard with the Reflexion track split out: Leaderboard - CHINI-bench
What I’d love help with
- Open-source model runs (Llama, Qwen, DeepSeek, Mistral, anything you have a key or local setup for). The CLI supports Ollama for local and OpenRouter for hosted.
- Pushback on the overshoot/undershoot framing. Is there a model you’d expect to behave differently? A reading of the data I’m missing?
- Reflexion variants. Does 2-3 turns close the gap, or amplify whichever mode the model started in?
Come check it out! Happy to walk through any of the methodology, scoring weights, or harness details.
A quick note on submission integrity: scoring runs server-side against the canonical problem definitions, so submitters can’t ship their own scores or modified rules. Reflexion submissions include the v1 canvas and the server re-scores it; if the self-reported v1 number doesn’t match what the simulator actually produces, the row gets flagged for review.
Public CLI runs carry a harness hash (chini-bench-cli:06d0ffb42f19 for single-shot, chini-bench-reflex:42769353289d for Reflexion); anything else is tagged custom. Community submissions are auto-prefixed community: so no one can impersonate official model rows.
- alex
Discussion in the ATmosphere