External Publication

Seeking arXiv cs.AI endorsement - Harvard Data Science paper on structured judgment for agentic AI

Hugging Face Forums [Unofficial] April 27, 2026

Hi all,

I’m looking for an arXiv cs.AI endorsement to submit my first paper. I’m in Harvard’s Data Science program and lead product on Amazon’s AI Decision Science team. The paper is on structured judgment for agentic AI, it engages with ReAct, AgentBench, GAIA, and BIG-Bench.

What endorsement involves: It’s a quick administrative step, you click the arXiv link below to confirm the paper fits the cs.AI category. It doesn’t require you to have read the paper or more formally endorse it. You just need to have published 3+ papers in a CS category on arXiv in the past 5 years. Endorsement code: 6GZZO4

About the work: The paper argues that frontier AI agents fail on judgment-intensive tasks because agent architectures don’t force models to apply capability with structural discipline, not because the models lack capability. We propose a structured judgment scaffold, evaluate it against a three-level task taxonomy, and report results from a blinded comparison scored by raters from two model families (Claude and GPT-5). The scaffold closes 54–79% of the performance gap over chain-of-thought, never loses a prompt under either rater, and adds zero on execution tasks.

Full abstract below. Happy to share the draft with anyone interested. Thank you!

Abstract: Frontier AI agents fail on judgment-intensive tasks not because the underlying models lack capability, but because agent architectures do not force those models to apply capability with structural discipline. We call this the rigor deficit. We propose a structured judgment scaffold that compels named criteria, quantified trade-offs, explicit fragility analysis, and clear stopping rules, and we evaluate it against a three-level task taxonomy (Direction-finding, Solution-selection, Execution). Judgment-intensive tasks represent a small fraction of the work delegated to AI agents but govern a steeply disproportionate share of the value those agents are expected to produce. We identify the Direction-Layer Ground Truth Problem: direction-finding tasks cannot have their ground truth specified before the work is complete, rendering standard evaluation infrastructure blind to rigor failures at the layer where consequential value is generated. In a blinded three-condition comparison scored by raters from two laboratories (Claude and GPT-5), the scaffold closes 54–79% of the performance gap over chain-of-thought on a 15-point rubric, never loses a prompt under either rater, holds on load-bearing rubric dimensions when formatting is stripped, and adds zero on execution tasks. The practical implication is economic: structured reasoning converts agent output from plausibly-wrong to reviewable, allowing one senior operator to validate judgment work that previously required $350k–$700k+ talent to produce from scratch. This compression of the most expensive labor band in knowledge work, not faster execution, is the next meaningful unlock from agentic AI.

Discussion in the ATmosphere