Pure Prompt vs Cognitive Runtime for PR Review: A Reproducible Case Study
This is an excellent feedback John as usual, thank you! I agree with your core reframing: this is better described as governed PR/release approval than generic “AI code review.”
The main claim we want to defend is exactly: prompts can review, runtimes can gate.
Also aligned on metrics: headline accuracy is secondary; unsafe approvals / critical false positives are the primary safety signal.
We’ll incorporate your strongest methodological points in the next iteration:
per-fixture per-policy expected labels,
stronger baseline ladder (including schema-constrained prompt + policy-only gate),
richer trace artifacts and reproducibility metadata.
On architecture, we also agree with the direction to make final enforcement fully deterministic (LLM for interpretation, policy code for authority).
In short: the goal is not replacing human review; it is preventing unstructured LLM inference from acting as policy authority in CI/CD.
Discussion in the ATmosphere