{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreifn5uwchpzns2y4gtiahx64ioqb63fn262kvil6xxwdtcwdbc763y",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkry2do5vym2"
},
"path": "/t/pure-prompt-vs-cognitive-runtime-for-pr-review-a-reproducible-case-study/175694#post_1",
"publishedAt": "2026-05-01T09:55:42.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"Beyond Prompting: Decoupling Cognition from Execution in LLM-based Agents through the ORCA Framework",
"https://github.com/gfernandf/agent-skills/tree/master/experiments/change_approval_gate"
],
"textContent": "### Motivation\n\nLLM-based code review is increasingly used in PR workflows.\nMost implementations rely on a **pure prompt approach** : a single LLM call that takes a diff and a policy description, and produces a decision.\n\nThis works well for many cases — but what happens when the decision must be:\n\n * reproducible\n\n * policy-grounded\n\n * auditable\n\n\n\n\nThis post explores that question through a controlled experiment following the approach stated here Beyond Prompting: Decoupling Cognition from Execution in LLM-based Agents through the ORCA Framework\n\n* * *\n\n### Setup\n\nWe compare two approaches for automated PR/release approval:\n\n#### 1) Pure Prompt Baseline\n\nA single LLM call that receives:\n\n * the full `change_package` (diff + metadata)\n\n * the full `policy_profile` as structured JSON\n\n * explicit instructions to output one of: `approve / block / escalate`\n\n\n\n\n#### 2) Cognitive Runtime (ORCA framework)\n\nA structured execution pipeline where decisions are made through:\n\n * deterministic policy enforcement\n\n * deterministic risk classification\n\n * bounded LLM decision steps\n\n\n\n\nThe runtime executes a 7-step DAG:\n\n\n\n\n\n summarize_change\n → extract_risks\n → classify_risk (deterministic)\n → apply_policy_gate (deterministic)\n → determine_decision (bounded LLM branch)\n → justify_decision (deterministic)\n → summarize_executive\n\n\nKey properties:\n\n * policy is a first-class structured input\n\n * decision space is bounded\n\n * rule evaluation is explicit and traceable\n\n\n\n\n* * *\n\n### Experiment\n\n * 8 change fixtures (realistic PR scenarios)\n\n * 3 policy profiles (`fast_track`, `standard`, `strict_prod`)\n\n * 24 total runs\n\n * Model: `gpt-4o-mini`, temperature 0.2, seed 42\n\n\n\n\n* * *\n\n### Results\n\nApproach | Accuracy\n---|---\nPure prompt | 71%\nCognitive runtime | 79%\n\nAccuracy is not the main finding.\n\n#### Critical failure metric\n\nWe define a **critical false positive** as:\n\n> approving a change that should have been blocked or escalated\n\nMetric | Prompt | Runtime\n---|---|---\nCritical false positives | **5** | **0**\n\n* * *\n\n### Where the Prompt Fails\n\nThe failures are not random. They cluster around specific structural signals:\n\n#### Case 1 — CVE in dependency update\n\n * Prompt: approves (“low impact update”)\n\n * Runtime: escalates (CVE detected → critical risk)\n\n\n\n\n#### Case 2 — One-line change in core router (prod)\n\n * Prompt: approves (“trivial typo fix”)\n\n * Runtime: escalates (critical-path file + production target)\n\n\n\n\nIn both cases:\n\n * the change _looks_ safe\n\n * the prompt is influenced by narrative\n\n * the runtime enforces structural constraints\n\n\n\n\n* * *\n\n### Why This Happens\n\nThe difference is architectural.\n\n#### Pure prompt\n\n * policy is embedded in text\n\n * no hard constraints\n\n * no requirement to link decisions to rules\n\n\n\n\n#### Cognitive runtime\n\n * policy is structured input\n\n * deterministic checks run before decisions\n\n * decision space is bounded\n\n * outputs are traceable to specific rules\n\n\n\n\nEven with a “fair” prompt (same data, same model, explicit instructions),\n\nthe model **interprets policy instead of enforcing it**.\n\n* * *\n\n### Key Insight\n\n> LLMs don’t fail randomly in this setting — they fail systematically at policy enforcement when used via a pure prompt approach.\n\n* * *\n\n### Limitations\n\n * Some expected labels (especially under `fast_track`) assume stricter policy semantics\n\n * Risk classification uses heuristic signals (e.g., CVE string matching)\n\n * Single model and seed\n\n\n\n\n* * *\n\n### Reproducibility\n\nAll experiments are reproducible:\n\nhttps://github.com/gfernandf/agent-skills/tree/master/experiments/change_approval_gate\n\n* * *\n\n### Discussion\n\nThis suggests a broader design question:\n\n * When is a prompt sufficient?\n\n * When do we need a structured execution layer?\n\n\n\n\nFor tasks that require:\n\n * reproducibility\n\n * auditability\n\n * policy enforcement\n\n\n\n\na bounded execution model may be a better abstraction than a single prompt.\n\n* * *\n\nCurious to hear how others are approaching this —\n\nespecially in CI/CD or safety-critical workflows.",
"title": "Pure Prompt vs Cognitive Runtime for PR Review: A Reproducible Case Study"
}