External Publication
Visit Post

CHINI-bench (preliminary): a deterministic 30-problem AI system-design benchmark

Hugging Face Forums [Unofficial] April 26, 2026
Source

Helloo HF! First time posting here!

Sharing some early results from CHINI-bench, a small public benchmark I just finished building couple days ago. It asks LLMs to design distributed systems as graphs (components, behaviors, edges), then runs the resulting architecture through a discrete-event simulator under stress scenarios. Scoring is mechanical, no LLM-as-judge and no human in the loop. Same simulator, same math, every model.

This is preliminary: 30 problems, four frontier models, single seed per problem. The N is scoped on purpose. Scoring is fully deterministic given the canvas, so a single run per cell carries real signal. The reason to call it preliminary is the model coverage (only four, all closed) and the single Reflexion turn , not the problem count. Posting now mostly to broaden coverage and pressure-test the framing before making stronger claims.

The setup

  • 30 problems across 5 classes (SWE backend, operations, personal, civic, adversarial)
  • Models output a CanvasState JSON. The simulator scores it on stability, delivery, cost, constraints, and design.
  • Open-source CLI: pip install git+https://github.com/collapseindex/chini-bench-cli runs any model end-to-end with your own API key
  • Harness is hash-pinned (chini-bench-cli:06d0ffb42f19) so leaderboard runs are reproducible

Single-shot results so far (4 frontier models, 30 problems each, 120 runs)

  • Combined coverage: 10 of 30 problems passed by at least one model
  • A handful of problems weren’t passed by anyone in this batch
  • Roughly: best class is operations (PC2), weakest is adversarial (PC5)

Per-class slices are six problems each, so treat the class-level ordering as directional rather than definitive.

Reflexion track, early observations

I added a second turn: run v1, simulator emits a redacted FeedbackPacket (no scores, just which checks failed), model writes v2, submit v2.

Model Avg v1 Avg v2 Δ Passes after revision
Gemini 3.1 Pro ~73 ~73 0 2 of 30
Grok 4.20 ~65 ~68 +3 1 of 30
GPT-5.4 ~64 ~60 -4 0 of 30
Claude Sonnet 4.6 ~62 ~53 -9 0 of 30

A tentative read:

  • Possible overshoot pattern in Claude and GPT runs: feedback flags a failed check, the model restructures more than needed, often adds a component, and ends up tripping a count or constraint limit.
  • Possible flat-revision pattern in Gemini runs: starts highest, patches the exact thing the feedback flagged, preserves what worked, but doesn’t actually move the average. v2 ≈ v1, the wins and losses cancel out. Lands at the top of the table by virtue of a strong v1, not by improving.

If that pattern holds across more models, it would suggest a search-strategy gap (when to patch vs. when to rewrite) more than a reasoning gap. With four models and a single seed per problem, I’m not ready to call that a finding. It’s a hypothesis I’d like to stress-test against open-weights models and longer Reflexion chains.

Net Reflexion v2 passes across the four models in this batch: 3 of 120.

Caveats I want to be upfront about

  • Only four models, all closed-weights. The “frontier” framing is incomplete until open-weights models (Llama, Qwen, DeepSeek, Mistral) are on the board.
  • One Reflexion turn only. Multi-turn (2-3 rounds) might tell a different story.
  • Single seed per problem. The simulator is deterministic, but model sampling isn’t, so seed-level variance isn’t characterized.
  • The problem set reflects my judgement about what matters in distributed-systems design. Critique welcome on coverage and weighting.

What’s open

  • All 30 problems and canonical prompts: CHINI-bench - Chinilla
  • Methodology and scoring math: Methodology - CHINI-bench
  • CLI source (PolyForm Noncommercial): GitHub - collapseindex/chini-bench-cli: Standalone CLI for the CHINI-bench AI system-design benchmark · GitHub
  • Live leaderboard with the Reflexion track split out: Leaderboard - CHINI-bench

What I’d love help with

  1. Open-source model runs (Llama, Qwen, DeepSeek, Mistral, anything you have a key or local setup for). The CLI supports Ollama for local and OpenRouter for hosted.
  2. Pushback on the overshoot/undershoot framing. Is there a model you’d expect to behave differently? A reading of the data I’m missing?
  3. Reflexion variants. Does 2-3 turns close the gap, or amplify whichever mode the model started in?

Come check it out! Happy to walk through any of the methodology, scoring weights, or harness details.

A quick note on submission integrity: scoring runs server-side against the canonical problem definitions, so submitters can’t ship their own scores or modified rules. Reflexion submissions include the v1 canvas and the server re-scores it; if the self-reported v1 number doesn’t match what the simulator actually produces, the row gets flagged for review.

Public CLI runs carry a harness hash (chini-bench-cli:06d0ffb42f19 for single-shot, chini-bench-reflex:42769353289d for Reflexion); anything else is tagged custom. Community submissions are auto-prefixed community: so no one can impersonate official model rows.

  • alex

Discussion in the ATmosphere

Loading comments...