External Publication

Pure Prompt vs Cognitive Runtime for PR Review: A Reproducible Case Study

Hugging Face Forums [Unofficial] May 1, 2026

Motivation

LLM-based code review is increasingly used in PR workflows. Most implementations rely on a pure prompt approach : a single LLM call that takes a diff and a policy description, and produces a decision.

This works well for many cases — but what happens when the decision must be:

reproducible
policy-grounded
auditable

This post explores that question through a controlled experiment following the approach stated here Beyond Prompting: Decoupling Cognition from Execution in LLM-based Agents through the ORCA Framework

Setup

We compare two approaches for automated PR/release approval:

1) Pure Prompt Baseline

A single LLM call that receives:

the full change_package (diff + metadata)
the full policy_profile as structured JSON
explicit instructions to output one of: approve / block / escalate

2) Cognitive Runtime (ORCA framework)

A structured execution pipeline where decisions are made through:

deterministic policy enforcement
deterministic risk classification
bounded LLM decision steps

The runtime executes a 7-step DAG:

summarize_change
→ extract_risks
→ classify_risk          (deterministic)
→ apply_policy_gate      (deterministic)
→ determine_decision     (bounded LLM branch)
→ justify_decision       (deterministic)
→ summarize_executive

Key properties:

policy is a first-class structured input
decision space is bounded
rule evaluation is explicit and traceable

Experiment

8 change fixtures (realistic PR scenarios)
3 policy profiles (fast_track, standard, strict_prod)
24 total runs
Model: gpt-4o-mini, temperature 0.2, seed 42

Results

Approach	Accuracy
Pure prompt	71%
Cognitive runtime	79%

Accuracy is not the main finding.

Critical failure metric

We define a critical false positive as:

approving a change that should have been blocked or escalated

Metric	Prompt	Runtime
Critical false positives	5	0

Where the Prompt Fails

The failures are not random. They cluster around specific structural signals:

Case 1 — CVE in dependency update

Prompt: approves (“low impact update”)
Runtime: escalates (CVE detected → critical risk)

Case 2 — One-line change in core router (prod)

Prompt: approves (“trivial typo fix”)
Runtime: escalates (critical-path file + production target)

In both cases:

the change looks safe
the prompt is influenced by narrative
the runtime enforces structural constraints

Why This Happens

The difference is architectural.

Pure prompt

policy is embedded in text
no hard constraints
no requirement to link decisions to rules

Cognitive runtime

policy is structured input
deterministic checks run before decisions
decision space is bounded
outputs are traceable to specific rules

Even with a “fair” prompt (same data, same model, explicit instructions),

the model interprets policy instead of enforcing it.

Key Insight

LLMs don’t fail randomly in this setting — they fail systematically at policy enforcement when used via a pure prompt approach.

Limitations

Some expected labels (especially under fast_track) assume stricter policy semantics
Risk classification uses heuristic signals (e.g., CVE string matching)
Single model and seed

Reproducibility

All experiments are reproducible:

https://github.com/gfernandf/agent-skills/tree/master/experiments/change_approval_gate

Discussion

This suggests a broader design question:

When is a prompt sufficient?
When do we need a structured execution layer?

For tasks that require:

reproducibility
auditability
policy enforcement

a bounded execution model may be a better abstraction than a single prompt.

Curious to hear how others are approaching this —

especially in CI/CD or safety-critical workflows.