Pure Prompt vs Cognitive Runtime for PR Review: A Reproducible Case Study
Motivation
LLM-based code review is increasingly used in PR workflows. Most implementations rely on a pure prompt approach : a single LLM call that takes a diff and a policy description, and produces a decision.
This works well for many cases — but what happens when the decision must be:
reproducible
policy-grounded
auditable
This post explores that question through a controlled experiment following the approach stated here Beyond Prompting: Decoupling Cognition from Execution in LLM-based Agents through the ORCA Framework
Setup
We compare two approaches for automated PR/release approval:
1) Pure Prompt Baseline
A single LLM call that receives:
the full
change_package(diff + metadata)the full
policy_profileas structured JSONexplicit instructions to output one of:
approve / block / escalate
2) Cognitive Runtime (ORCA framework)
A structured execution pipeline where decisions are made through:
deterministic policy enforcement
deterministic risk classification
bounded LLM decision steps
The runtime executes a 7-step DAG:
summarize_change
→ extract_risks
→ classify_risk (deterministic)
→ apply_policy_gate (deterministic)
→ determine_decision (bounded LLM branch)
→ justify_decision (deterministic)
→ summarize_executive
Key properties:
policy is a first-class structured input
decision space is bounded
rule evaluation is explicit and traceable
Experiment
8 change fixtures (realistic PR scenarios)
3 policy profiles (
fast_track,standard,strict_prod)24 total runs
Model:
gpt-4o-mini, temperature 0.2, seed 42
Results
| Approach | Accuracy |
|---|---|
| Pure prompt | 71% |
| Cognitive runtime | 79% |
Accuracy is not the main finding.
Critical failure metric
We define a critical false positive as:
approving a change that should have been blocked or escalated
| Metric | Prompt | Runtime |
|---|---|---|
| Critical false positives | 5 | 0 |
Where the Prompt Fails
The failures are not random. They cluster around specific structural signals:
Case 1 — CVE in dependency update
Prompt: approves (“low impact update”)
Runtime: escalates (CVE detected → critical risk)
Case 2 — One-line change in core router (prod)
Prompt: approves (“trivial typo fix”)
Runtime: escalates (critical-path file + production target)
In both cases:
the change looks safe
the prompt is influenced by narrative
the runtime enforces structural constraints
Why This Happens
The difference is architectural.
Pure prompt
policy is embedded in text
no hard constraints
no requirement to link decisions to rules
Cognitive runtime
policy is structured input
deterministic checks run before decisions
decision space is bounded
outputs are traceable to specific rules
Even with a “fair” prompt (same data, same model, explicit instructions),
the model interprets policy instead of enforcing it.
Key Insight
LLMs don’t fail randomly in this setting — they fail systematically at policy enforcement when used via a pure prompt approach.
Limitations
Some expected labels (especially under
fast_track) assume stricter policy semanticsRisk classification uses heuristic signals (e.g., CVE string matching)
Single model and seed
Reproducibility
All experiments are reproducible:
https://github.com/gfernandf/agent-skills/tree/master/experiments/change_approval_gate
Discussion
This suggests a broader design question:
When is a prompt sufficient?
When do we need a structured execution layer?
For tasks that require:
reproducibility
auditability
policy enforcement
a bounded execution model may be a better abstraction than a single prompt.
Curious to hear how others are approaching this —
especially in CI/CD or safety-critical workflows.
Discussion in the ATmosphere