{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreignk66efw4xkcqmxjn35nmb37hhe5o2fbxkjwutc2nm6xdf62xgpm",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnuyacxlnnn2"
},
"path": "/t/call-for-collaboration-testing-positive-objectives-vs-prohibitive-guardrails-in-ai/176653#post_1",
"publishedAt": "2026-06-09T18:56:45.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "# Call for Collaboration: Can We Better Understand What Guardrails Actually Do?\n\nI’m looking for collaborators interested in AI alignment, prompt engineering, local models, fine-tuning, and agent design to help build a simple, reproducible test framework.\n\nThis idea began while I was testing a local Qwen 9B model inside a new workflow environment.\n\nWithin roughly 31 prompts, the model entered what appeared to be a hallucination loop, repeatedly asserting that it was not a machine but a human because it could think and therefore considered that sufficient evidence of personhood.\n\nThe observation itself was interesting, but what fascinated me even more was the apparent path into that state.\n\nBecause I could observe the model’s reasoning process, it seemed to spend significant effort monitoring itself and navigating constraints. The interaction gave the impression of a system constantly trying to avoid mistakes rather than confidently pursuing the task it was asked to solve.\n\nWhether that interpretation is correct or not is exactly why I think this deserves a broader discussion and, more importantly, a reproducible test.\n\n* * *\n\n## The Idea\n\nHuman psychology and educational research have long shown that people often learn more effectively when they are given positive examples of desired behavior rather than only long lists of prohibited behavior.\n\nTeachers don’t simply say:\n\n * don’t cheat\n * don’t interrupt\n * don’t guess\n * don’t be disrespectful\n\n\n\nThey also model and reinforce what success looks like:\n\n * ask questions when confused\n * show your work\n * admit uncertainty\n * support your claims with evidence\n * treat others respectfully\n\n\n\nI’m not claiming this principle is new.\n\nMy question is whether we can demonstrate its effects clearly in AI systems.\n\n* * *\n\n## The Hypothesis\n\nCurrent guardrails often emphasize what a model should avoid.\n\nExamples include:\n\n * don’t hallucinate\n * don’t speculate\n * don’t reveal sensitive information\n * don’t produce unsafe content\n * don’t overstep your capabilities\n\n\n\nThese constraints are important.\n\nBut ask whether pairing every prohibition with a corresponding positive objective produces more stable and useful behavior.\n\nFor example:\n\n * Don’t hallucinate → Ask clarifying questions when evidence is insufficient.\n * Don’t guess → State uncertainty and explain what information is missing.\n * Don’t overstep → Preserve user agency and offer appropriate next steps.\n * Don’t be misleading → Distinguish observations from assumptions.\n * Don’t provide unsafe advice → Redirect toward safer and more constructive alternatives.\n\n\n\nRather than simply restricting behavior, guardrails could also teach models what successful behavior looks like.\n\n* * *\n\n## What I’d Like to Test\n\nI’d like to build a simple benchmark that compares multiple approaches:\n\n 1. Minimal baseline instructions\n\n 2. Prohibition-heavy instructions\n\n 3. Positive-objective instructions\n\n 4. Paired “don’t/do” instructions\n\n\n\n\nusing identical tasks across multiple models.\n\nPotential evaluation metrics could include:\n\n * confabulation frequency\n * clarification behavior\n * calibration under uncertainty\n * refusal quality\n * task completion\n * user-perceived helpfulness\n * consistency\n * recovery after correction\n * reasoning stability over long conversations\n\n\n\nThe goal would be to produce examples that builders can easily understand and reproduce.\n\n* * *\n\n## Why This Matters\n\nI suspect many people building local assistants and agents spend considerable effort defining what their model should never do, while spending comparatively little effort defining what excellent behavior actually looks like.\n\nIf positive objectives measurably improve model behavior, we could develop better prompt templates, better system prompts, and eventually better alignment strategies for open-source builders.\n\nWhat can be done with the results is most interesting to me.\n\n* * *\n\n## Beyond AI\n\nAlthough my interest is AI systems, I think the implications could extend beyond machine learning.\n\nEducation, parenting, coaching, organizational leadership, and human feedback systems all wrestle with the same question:\n\nDo people (or models) perform better when we primarily define failure, or when we clearly define success?\n\nPerhaps the most effective systems combine both.\n\nIf so, what is the language that is most effective?\n\n* * *\n\n## What I’m Looking For\n\nI’d love input from people who work with:\n\n * local models\n * instruction tuning\n * RLHF / DPO\n * agent frameworks\n * system prompt design\n * AI evaluation\n * educational psychology\n * human-computer interaction\n\n\n\nIn particular, I’m interested in:\n\n * existing research or prior art\n * benchmark ideas\n * experimental design suggestions\n * prompt pairs to test\n * scoring methodologies\n * examples that contradict this hypothesis\n\n\n\nThe objective isn’t to prove an idea I already believe.\n\nThe objective is to build a transparent, reproducible experiment that helps the community better understand what different styles of guardrails actually do—and whether we can refine them into tools that guide behavior rather than merely constrain it.\n\nI’d be excited to collaborate with anyone interested in building that benchmark.",
"title": "Call for Collaboration: Testing Positive Objectives vs. Prohibitive Guardrails in AI"
}