Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifn5uwchpzns2y4gtiahx64ioqb63fn262kvil6xxwdtcwdbc763y",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkry2do5vym2"
  },
  "path": "/t/pure-prompt-vs-cognitive-runtime-for-pr-review-a-reproducible-case-study/175694#post_1",
  "publishedAt": "2026-05-01T09:55:42.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Beyond Prompting: Decoupling Cognition from Execution in LLM-based Agents through the ORCA Framework",
    "https://github.com/gfernandf/agent-skills/tree/master/experiments/change_approval_gate"
  ],
  "textContent": "### Motivation\n\nLLM-based code review is increasingly used in PR workflows.\nMost implementations rely on a **pure prompt approach** : a single LLM call that takes a diff and a policy description, and produces a decision.\n\nThis works well for many cases — but what happens when the decision must be:\n\n  * reproducible\n\n  * policy-grounded\n\n  * auditable\n\n\n\n\nThis post explores that question through a controlled experiment following the approach stated here Beyond Prompting: Decoupling Cognition from Execution in LLM-based Agents through the ORCA Framework\n\n* * *\n\n### Setup\n\nWe compare two approaches for automated PR/release approval:\n\n#### 1) Pure Prompt Baseline\n\nA single LLM call that receives:\n\n  * the full `change_package` (diff + metadata)\n\n  * the full `policy_profile` as structured JSON\n\n  * explicit instructions to output one of: `approve / block / escalate`\n\n\n\n\n#### 2) Cognitive Runtime (ORCA framework)\n\nA structured execution pipeline where decisions are made through:\n\n  * deterministic policy enforcement\n\n  * deterministic risk classification\n\n  * bounded LLM decision steps\n\n\n\n\nThe runtime executes a 7-step DAG:\n\n\n\n\n\n    summarize_change\n    → extract_risks\n    → classify_risk          (deterministic)\n    → apply_policy_gate      (deterministic)\n    → determine_decision     (bounded LLM branch)\n    → justify_decision       (deterministic)\n    → summarize_executive\n\n\nKey properties:\n\n  * policy is a first-class structured input\n\n  * decision space is bounded\n\n  * rule evaluation is explicit and traceable\n\n\n\n\n* * *\n\n### Experiment\n\n  * 8 change fixtures (realistic PR scenarios)\n\n  * 3 policy profiles (`fast_track`, `standard`, `strict_prod`)\n\n  * 24 total runs\n\n  * Model: `gpt-4o-mini`, temperature 0.2, seed 42\n\n\n\n\n* * *\n\n### Results\n\nApproach | Accuracy\n---|---\nPure prompt | 71%\nCognitive runtime | 79%\n\nAccuracy is not the main finding.\n\n#### Critical failure metric\n\nWe define a **critical false positive** as:\n\n> approving a change that should have been blocked or escalated\n\nMetric | Prompt | Runtime\n---|---|---\nCritical false positives | **5** | **0**\n\n* * *\n\n### Where the Prompt Fails\n\nThe failures are not random. They cluster around specific structural signals:\n\n#### Case 1 — CVE in dependency update\n\n  * Prompt: approves (“low impact update”)\n\n  * Runtime: escalates (CVE detected → critical risk)\n\n\n\n\n#### Case 2 — One-line change in core router (prod)\n\n  * Prompt: approves (“trivial typo fix”)\n\n  * Runtime: escalates (critical-path file + production target)\n\n\n\n\nIn both cases:\n\n  * the change _looks_ safe\n\n  * the prompt is influenced by narrative\n\n  * the runtime enforces structural constraints\n\n\n\n\n* * *\n\n### Why This Happens\n\nThe difference is architectural.\n\n#### Pure prompt\n\n  * policy is embedded in text\n\n  * no hard constraints\n\n  * no requirement to link decisions to rules\n\n\n\n\n#### Cognitive runtime\n\n  * policy is structured input\n\n  * deterministic checks run before decisions\n\n  * decision space is bounded\n\n  * outputs are traceable to specific rules\n\n\n\n\nEven with a “fair” prompt (same data, same model, explicit instructions),\n\nthe model **interprets policy instead of enforcing it**.\n\n* * *\n\n### Key Insight\n\n> LLMs don’t fail randomly in this setting — they fail systematically at policy enforcement when used via a pure prompt approach.\n\n* * *\n\n### Limitations\n\n  * Some expected labels (especially under `fast_track`) assume stricter policy semantics\n\n  * Risk classification uses heuristic signals (e.g., CVE string matching)\n\n  * Single model and seed\n\n\n\n\n* * *\n\n### Reproducibility\n\nAll experiments are reproducible:\n\nhttps://github.com/gfernandf/agent-skills/tree/master/experiments/change_approval_gate\n\n* * *\n\n### Discussion\n\nThis suggests a broader design question:\n\n  * When is a prompt sufficient?\n\n  * When do we need a structured execution layer?\n\n\n\n\nFor tasks that require:\n\n  * reproducibility\n\n  * auditability\n\n  * policy enforcement\n\n\n\n\na bounded execution model may be a better abstraction than a single prompt.\n\n* * *\n\nCurious to hear how others are approaching this —\n\nespecially in CI/CD or safety-critical workflows.",
  "title": "Pure Prompt vs Cognitive Runtime for PR Review: A Reproducible Case Study"
}