External Publication
Visit Post

Pure Prompt vs Cognitive Runtime for PR Review: A Reproducible Case Study

Hugging Face Forums [Unofficial] May 1, 2026
Source

Motivation

LLM-based code review is increasingly used in PR workflows. Most implementations rely on a pure prompt approach : a single LLM call that takes a diff and a policy description, and produces a decision.

This works well for many cases — but what happens when the decision must be:

  • reproducible

  • policy-grounded

  • auditable

This post explores that question through a controlled experiment following the approach stated here Beyond Prompting: Decoupling Cognition from Execution in LLM-based Agents through the ORCA Framework


Setup

We compare two approaches for automated PR/release approval:

1) Pure Prompt Baseline

A single LLM call that receives:

  • the full change_package (diff + metadata)

  • the full policy_profile as structured JSON

  • explicit instructions to output one of: approve / block / escalate

2) Cognitive Runtime (ORCA framework)

A structured execution pipeline where decisions are made through:

  • deterministic policy enforcement

  • deterministic risk classification

  • bounded LLM decision steps

The runtime executes a 7-step DAG:

summarize_change
→ extract_risks
→ classify_risk          (deterministic)
→ apply_policy_gate      (deterministic)
→ determine_decision     (bounded LLM branch)
→ justify_decision       (deterministic)
→ summarize_executive

Key properties:

  • policy is a first-class structured input

  • decision space is bounded

  • rule evaluation is explicit and traceable


Experiment

  • 8 change fixtures (realistic PR scenarios)

  • 3 policy profiles (fast_track, standard, strict_prod)

  • 24 total runs

  • Model: gpt-4o-mini, temperature 0.2, seed 42


Results

Approach Accuracy
Pure prompt 71%
Cognitive runtime 79%

Accuracy is not the main finding.

Critical failure metric

We define a critical false positive as:

approving a change that should have been blocked or escalated

Metric Prompt Runtime
Critical false positives 5 0

Where the Prompt Fails

The failures are not random. They cluster around specific structural signals:

Case 1 — CVE in dependency update

  • Prompt: approves (“low impact update”)

  • Runtime: escalates (CVE detected → critical risk)

Case 2 — One-line change in core router (prod)

  • Prompt: approves (“trivial typo fix”)

  • Runtime: escalates (critical-path file + production target)

In both cases:

  • the change looks safe

  • the prompt is influenced by narrative

  • the runtime enforces structural constraints


Why This Happens

The difference is architectural.

Pure prompt

  • policy is embedded in text

  • no hard constraints

  • no requirement to link decisions to rules

Cognitive runtime

  • policy is structured input

  • deterministic checks run before decisions

  • decision space is bounded

  • outputs are traceable to specific rules

Even with a “fair” prompt (same data, same model, explicit instructions),

the model interprets policy instead of enforcing it.


Key Insight

LLMs don’t fail randomly in this setting — they fail systematically at policy enforcement when used via a pure prompt approach.


Limitations

  • Some expected labels (especially under fast_track) assume stricter policy semantics

  • Risk classification uses heuristic signals (e.g., CVE string matching)

  • Single model and seed


Reproducibility

All experiments are reproducible:

https://github.com/gfernandf/agent-skills/tree/master/experiments/change_approval_gate


Discussion

This suggests a broader design question:

  • When is a prompt sufficient?

  • When do we need a structured execution layer?

For tasks that require:

  • reproducibility

  • auditability

  • policy enforcement

a bounded execution model may be a better abstraction than a single prompt.


Curious to hear how others are approaching this —

especially in CI/CD or safety-critical workflows.

Discussion in the ATmosphere

Loading comments...