External Publication

Call for Collaboration: Testing Positive Objectives vs. Prohibitive Guardrails in AI

Hugging Face Forums [Unofficial] June 9, 2026

Call for Collaboration: Can We Better Understand What Guardrails Actually Do?

I’m looking for collaborators interested in AI alignment, prompt engineering, local models, fine-tuning, and agent design to help build a simple, reproducible test framework.

This idea began while I was testing a local Qwen 9B model inside a new workflow environment.

Within roughly 31 prompts, the model entered what appeared to be a hallucination loop, repeatedly asserting that it was not a machine but a human because it could think and therefore considered that sufficient evidence of personhood.

The observation itself was interesting, but what fascinated me even more was the apparent path into that state.

Because I could observe the model’s reasoning process, it seemed to spend significant effort monitoring itself and navigating constraints. The interaction gave the impression of a system constantly trying to avoid mistakes rather than confidently pursuing the task it was asked to solve.

Whether that interpretation is correct or not is exactly why I think this deserves a broader discussion and, more importantly, a reproducible test.

The Idea

Human psychology and educational research have long shown that people often learn more effectively when they are given positive examples of desired behavior rather than only long lists of prohibited behavior.

Teachers don’t simply say:

don’t cheat
don’t interrupt
don’t guess
don’t be disrespectful

They also model and reinforce what success looks like:

ask questions when confused
show your work
admit uncertainty
support your claims with evidence
treat others respectfully

I’m not claiming this principle is new.

My question is whether we can demonstrate its effects clearly in AI systems.

The Hypothesis

Current guardrails often emphasize what a model should avoid.

Examples include:

don’t hallucinate
don’t speculate
don’t reveal sensitive information
don’t produce unsafe content
don’t overstep your capabilities

These constraints are important.

But ask whether pairing every prohibition with a corresponding positive objective produces more stable and useful behavior.

For example:

Don’t hallucinate → Ask clarifying questions when evidence is insufficient.
Don’t guess → State uncertainty and explain what information is missing.
Don’t overstep → Preserve user agency and offer appropriate next steps.
Don’t be misleading → Distinguish observations from assumptions.
Don’t provide unsafe advice → Redirect toward safer and more constructive alternatives.

Rather than simply restricting behavior, guardrails could also teach models what successful behavior looks like.

What I’d Like to Test

I’d like to build a simple benchmark that compares multiple approaches:

Minimal baseline instructions
Prohibition-heavy instructions
Positive-objective instructions
Paired “don’t/do” instructions

using identical tasks across multiple models.

Potential evaluation metrics could include:

confabulation frequency
clarification behavior
calibration under uncertainty
refusal quality
task completion
user-perceived helpfulness
consistency
recovery after correction
reasoning stability over long conversations

The goal would be to produce examples that builders can easily understand and reproduce.

Why This Matters

I suspect many people building local assistants and agents spend considerable effort defining what their model should never do, while spending comparatively little effort defining what excellent behavior actually looks like.

If positive objectives measurably improve model behavior, we could develop better prompt templates, better system prompts, and eventually better alignment strategies for open-source builders.

What can be done with the results is most interesting to me.

Beyond AI

Although my interest is AI systems, I think the implications could extend beyond machine learning.

Education, parenting, coaching, organizational leadership, and human feedback systems all wrestle with the same question:

Do people (or models) perform better when we primarily define failure, or when we clearly define success?

Perhaps the most effective systems combine both.

If so, what is the language that is most effective?

What I’m Looking For

I’d love input from people who work with:

local models
instruction tuning
RLHF / DPO
agent frameworks
system prompt design
AI evaluation
educational psychology
human-computer interaction

In particular, I’m interested in:

existing research or prior art
benchmark ideas
experimental design suggestions
prompt pairs to test
scoring methodologies
examples that contradict this hypothesis

The objective isn’t to prove an idea I already believe.

The objective is to build a transparent, reproducible experiment that helps the community better understand what different styles of guardrails actually do—and whether we can refine them into tools that guide behavior rather than merely constrain it.

I’d be excited to collaborate with anyone interested in building that benchmark.