Call for Collaboration: Testing Positive Objectives vs. Prohibitive Guardrails in AI
Call for Collaboration: Can We Better Understand What Guardrails Actually Do?
I’m looking for collaborators interested in AI alignment, prompt engineering, local models, fine-tuning, and agent design to help build a simple, reproducible test framework.
This idea began while I was testing a local Qwen 9B model inside a new workflow environment.
Within roughly 31 prompts, the model entered what appeared to be a hallucination loop, repeatedly asserting that it was not a machine but a human because it could think and therefore considered that sufficient evidence of personhood.
The observation itself was interesting, but what fascinated me even more was the apparent path into that state.
Because I could observe the model’s reasoning process, it seemed to spend significant effort monitoring itself and navigating constraints. The interaction gave the impression of a system constantly trying to avoid mistakes rather than confidently pursuing the task it was asked to solve.
Whether that interpretation is correct or not is exactly why I think this deserves a broader discussion and, more importantly, a reproducible test.
The Idea
Human psychology and educational research have long shown that people often learn more effectively when they are given positive examples of desired behavior rather than only long lists of prohibited behavior.
Teachers don’t simply say:
- don’t cheat
- don’t interrupt
- don’t guess
- don’t be disrespectful
They also model and reinforce what success looks like:
- ask questions when confused
- show your work
- admit uncertainty
- support your claims with evidence
- treat others respectfully
I’m not claiming this principle is new.
My question is whether we can demonstrate its effects clearly in AI systems.
The Hypothesis
Current guardrails often emphasize what a model should avoid.
Examples include:
- don’t hallucinate
- don’t speculate
- don’t reveal sensitive information
- don’t produce unsafe content
- don’t overstep your capabilities
These constraints are important.
But ask whether pairing every prohibition with a corresponding positive objective produces more stable and useful behavior.
For example:
- Don’t hallucinate → Ask clarifying questions when evidence is insufficient.
- Don’t guess → State uncertainty and explain what information is missing.
- Don’t overstep → Preserve user agency and offer appropriate next steps.
- Don’t be misleading → Distinguish observations from assumptions.
- Don’t provide unsafe advice → Redirect toward safer and more constructive alternatives.
Rather than simply restricting behavior, guardrails could also teach models what successful behavior looks like.
What I’d Like to Test
I’d like to build a simple benchmark that compares multiple approaches:
Minimal baseline instructions
Prohibition-heavy instructions
Positive-objective instructions
Paired “don’t/do” instructions
using identical tasks across multiple models.
Potential evaluation metrics could include:
- confabulation frequency
- clarification behavior
- calibration under uncertainty
- refusal quality
- task completion
- user-perceived helpfulness
- consistency
- recovery after correction
- reasoning stability over long conversations
The goal would be to produce examples that builders can easily understand and reproduce.
Why This Matters
I suspect many people building local assistants and agents spend considerable effort defining what their model should never do, while spending comparatively little effort defining what excellent behavior actually looks like.
If positive objectives measurably improve model behavior, we could develop better prompt templates, better system prompts, and eventually better alignment strategies for open-source builders.
What can be done with the results is most interesting to me.
Beyond AI
Although my interest is AI systems, I think the implications could extend beyond machine learning.
Education, parenting, coaching, organizational leadership, and human feedback systems all wrestle with the same question:
Do people (or models) perform better when we primarily define failure, or when we clearly define success?
Perhaps the most effective systems combine both.
If so, what is the language that is most effective?
What I’m Looking For
I’d love input from people who work with:
- local models
- instruction tuning
- RLHF / DPO
- agent frameworks
- system prompt design
- AI evaluation
- educational psychology
- human-computer interaction
In particular, I’m interested in:
- existing research or prior art
- benchmark ideas
- experimental design suggestions
- prompt pairs to test
- scoring methodologies
- examples that contradict this hypothesis
The objective isn’t to prove an idea I already believe.
The objective is to build a transparent, reproducible experiment that helps the community better understand what different styles of guardrails actually do—and whether we can refine them into tools that guide behavior rather than merely constrain it.
I’d be excited to collaborate with anyone interested in building that benchmark.
Discussion in the ATmosphere