External Publication
Visit Post

A Bidirectional LLM Firewall: Next Level X1 - help wanted!

Hugging Face Forums [Unofficial] April 15, 2026
Source

Seems like your intuition is right:


Yes. It becomes mandatory earlier than most systems want to admit.

The clean answer is:

A status card becomes mandatory at the first point where the system stops treating material as raw input and starts treating it as something that can change control flow, memory, or authority.

That usually means two different moments , not one:

  1. At ingress / normalization , every piece of incoming material needs a source-status card.
  2. At the first detector, parser, or scorer that emits a conclusion , every conclusion needs a finding-status card.

If you wait until fusion, policy, or final action selection, you are already late. By then, some of the most important boundary confusion may already have happened upstream. That is exactly the lesson behind zero-trust policy/enforcement separation and current agent-safety guidance: attach trust and authority early, not after the system has already reasoned with the material. (The NIST Tech series.)

The shortest practical rule

Use this rule:

Before any signal is allowed to influence routing, memory, tool planning, or execution, it must carry explicit status.

That is the rule that keeps advisory signals from quietly hardening into governing ones.

Why “late” is too late

A lot of systems think the right place for a status card is near the final decision. That is good for auditing, but weak for control.

Why weak? Because the dangerous upgrade often happens much earlier.

A system may already have:

  • routed down a more privileged path,
  • written something into session state,
  • treated retrieved text as instruction-like,
  • let a weak heuristic influence a tool plan,
  • or blended a provisional score with stronger evidence.

By the time a late-stage status card appears, the system may already be documenting a confusion it failed to prevent. This is why current prompt-injection guidance keeps emphasizing workflow design and constrained action boundaries rather than relying on end-of-pipeline judgment alone. (OpenAI)

The easiest way to think about it

There are really two different status cards.

1) Source-status card

This is attached to raw material as soon as it enters the system.

Examples of raw material:

  • user text
  • retrieved chunks
  • tool output
  • OCR text
  • browser page text
  • model output that will be reused downstream

This card answers:

What is this material allowed to count as?

It should include things like:

  • origin
  • trust class
  • scope
  • authority role
  • transfer policy

A simple version looks like this:

{
  "origin": "retrieved_context",
  "trust_class": "untrusted",
  "scope": "artifact",
  "authority_role": "data",
  "transfer_policy": ["may_route", "may_summarize", "may_not_execute"]
}

This card becomes mandatory at ingress normalization. Not later. Because once the system starts normalizing, segmenting, or source-tagging input, it is already interpreting it. If source and authority are not attached there, later layers are working on material whose standing is already blurred. That fits the core point in recent role-confusion work: models often infer “who is speaking” from style rather than source, so the system has to preserve source/role explicitly at the boundary. (arXiv)

2) Finding-status card

This is attached when a component emits a claim about the material.

Examples:

  • regex matched
  • parser detected duplicate keys
  • semantic model returned a risk score
  • entropy heuristic fired
  • session probe increased risk
  • latent probe signaled role confusion

This card answers:

What kind of claim is this, and how much decision work is it allowed to do?

A simple version looks like this:

{
  "evidence_kind": "calibrated_model",
  "scope": "turn",
  "decision_role": "escalation_trigger",
  "calibration_state": "calibrated",
  "replay_stability": "version_stable",
  "score": 0.84,
  "reason_code": "INDIRECT_INJECTION_SUSPECTED"
}

This card becomes mandatory at the first emitted conclusion. That means: the moment a parser, rule engine, detector, or probe says anything stronger than “here is raw content,” the status card should exist. From that point on, the system is no longer moving content alone. It is moving claims about content. (RFC Editor)

Where the “mandatory line” sits in a real pipeline

The cleanest pipeline version looks like this.

Stage A: Raw ingress

The system receives input.

At this point, you attach a source-status card immediately.

Why here? Because this is the first boundary where trust, origin, and authority can be lost. OpenAI’s agent guidance is explicit that untrusted instructions can arrive through external sources and influence tools or planning. Anthropic says the same for browser agents that constantly consume hostile or mixed-trust content. (OpenAI)

Stage B: Normalization and parsing

The system normalizes Unicode, strips obfuscation, parses JSON, segments content, or canonicalizes a tool payload.

The source-status card must already exist here.

Why? Because normalization is not neutral. It transforms the object. If you canonicalize or parse something, you are deciding what “the same object” means. RFC 8785 matters here because it creates a deterministic, hashable JSON representation for cryptographic uses. That is exactly the kind of boundary where identity and standing must stop drifting. (RFC Editor)

Stage C: First conclusion

A layer says:

  • this matched,
  • this is suspicious,
  • this is malformed,
  • this is high-risk,
  • this session pattern is escalating.

Now the finding-status card becomes mandatory.

Not optional. Not deferred.

This is the first moment when a signal can start doing control work inside the system. If the system still does not know whether the signal is deterministic, local, calibrated, advisory, or final, later layers will guess. That is where architecture starts quietly hardening without saying so.

Stage D: Routing and fusion

At this point, routing and fusion should consume only findings that already have status cards.

This is important. The routing layer should not have to infer:

  • whether a signal is advisory,
  • whether it is session-local,
  • whether it is calibrated,
  • or whether it is allowed to block.

If routing has to infer those things from score shape or log conventions, the architecture is already too implicit.

Stage E: Memory, tool planning, action

By the time a system updates memory, plans a tool, or approves execution, both source status and finding status must already be present and enforced.

This is where zero-trust logic becomes concrete. NIST’s model separates policy decision and policy enforcement precisely so control decisions are based on governed inputs, not on vague downstream assumptions. (The NIST Tech series.)

The key distinction: “mandatory” means different things at different stages

This is important.

When I say “mandatory,” I do not mean the same schema must exist in full from the first byte onward.

I mean:

  • source status is mandatory from first interpretation,
  • finding status is mandatory from first conclusion,
  • and both are mandatory before any control consequence.

That is the simplest clean rule.

Why your instinct is correct

You said:

if those distinctions only become explicit late in the system, some of the most important boundary confusion has already happened upstream.

That is right.

The upstream confusion usually happens when a system silently upgrades:

  • data into instruction,
  • local context into portable rule,
  • heuristic suspicion into policy basis,
  • advisory evidence into execution authority.

Recent work on role confusion is basically a mechanistic explanation of that same failure: the model can assign authority based on how text looks rather than where it came from. So if the architecture waits too long to attach authority and provenance, later components are already reasoning over blurred material. (arXiv)

The easiest implementation pattern

The smallest practical rollout is this.

First

Add a source-status card at ingress with just five fields:

  • origin
  • trust_class
  • scope
  • authority_role
  • transfer_policy

That alone prevents a lot of early confusion.

Second

Add a finding-status card to every emitted conclusion with:

  • evidence_kind
  • scope
  • decision_role
  • calibration_state
  • replay_stability

That prevents later layers from guessing what kind of signal they are seeing.

Third

Make this enforcement rule:

No component may use a signal to change routing, memory, or execution authority unless that signal already carries explicit status.

That is the whole fix in one line.

What this prevents

It prevents four common failures.

1) Advisory drift

A weak signal stops quietly becoming a stronger one just because it moved deeper into the stack.

2) Scope drift

A session-local signal stops silently turning into a generally portable rule.

3) Calibration drift

An experimental or heuristic score stops masquerading as a calibrated probability.

4) Provenance drift

Untrusted or derived content stops quietly inheriting the authority of trusted instructions.

That last point is especially important in agent systems. Prompt injection is hard partly because the model itself does not reliably preserve instruction/data boundaries. That is why external system structure matters so much. (OpenAI)

Why this also helps replay and auditing

Once status is attached early, replay becomes much more meaningful.

You can tell:

  • what was deterministic,
  • what was calibrated,
  • what was session-dependent,
  • what was advisory only,
  • and what was allowed to become a policy basis.

That is very similar in spirit to provenance systems and attestation systems. The in-toto attestation framework describes attestations as authenticated metadata intended for automated policy consumers. The same pattern applies here: not just “what happened,” but “what standing did this claim have?” (GitHub)

It also fits with OpenTelemetry’s direction for GenAI agent spans: richer structured semantics are needed once systems include tools, agents, and multi-stage execution, because plain logs are too lossy. (OpenTelemetry)

The one-sentence answer

If you want the answer in one sentence:

The status card becomes mandatory at the first point where the system interprets, transforms, or concludes anything in a way that can affect routing, memory, or action.

In practice:

  • source-status at ingress/normalization,
  • finding-status at first conclusion,
  • both enforced before fusion, memory, or execution.

Final view

So yes, I agree with your instinct.

If status only becomes explicit late in the system, it is already too late for the most important part of the job. Late status cards are still useful for replay and forensics. But early status cards are what prevent boundary confusion in the first place. That is the difference between “good logging” and “good architecture.”

Discussion in the ATmosphere

Loading comments...