A Bidirectional LLM Firewall: Next Level X1 - help wanted!
Seems like your intuition is right:
Yes. It becomes mandatory earlier than most systems want to admit.
The clean answer is:
A status card becomes mandatory at the first point where the system stops treating material as raw input and starts treating it as something that can change control flow, memory, or authority.
That usually means two different moments , not one:
- At ingress / normalization , every piece of incoming material needs a source-status card.
- At the first detector, parser, or scorer that emits a conclusion , every conclusion needs a finding-status card.
If you wait until fusion, policy, or final action selection, you are already late. By then, some of the most important boundary confusion may already have happened upstream. That is exactly the lesson behind zero-trust policy/enforcement separation and current agent-safety guidance: attach trust and authority early, not after the system has already reasoned with the material. (The NIST Tech series.)
The shortest practical rule
Use this rule:
Before any signal is allowed to influence routing, memory, tool planning, or execution, it must carry explicit status.
That is the rule that keeps advisory signals from quietly hardening into governing ones.
Why “late” is too late
A lot of systems think the right place for a status card is near the final decision. That is good for auditing, but weak for control.
Why weak? Because the dangerous upgrade often happens much earlier.
A system may already have:
- routed down a more privileged path,
- written something into session state,
- treated retrieved text as instruction-like,
- let a weak heuristic influence a tool plan,
- or blended a provisional score with stronger evidence.
By the time a late-stage status card appears, the system may already be documenting a confusion it failed to prevent. This is why current prompt-injection guidance keeps emphasizing workflow design and constrained action boundaries rather than relying on end-of-pipeline judgment alone. (OpenAI)
The easiest way to think about it
There are really two different status cards.
1) Source-status card
This is attached to raw material as soon as it enters the system.
Examples of raw material:
- user text
- retrieved chunks
- tool output
- OCR text
- browser page text
- model output that will be reused downstream
This card answers:
What is this material allowed to count as?
It should include things like:
- origin
- trust class
- scope
- authority role
- transfer policy
A simple version looks like this:
{
"origin": "retrieved_context",
"trust_class": "untrusted",
"scope": "artifact",
"authority_role": "data",
"transfer_policy": ["may_route", "may_summarize", "may_not_execute"]
}
This card becomes mandatory at ingress normalization. Not later. Because once the system starts normalizing, segmenting, or source-tagging input, it is already interpreting it. If source and authority are not attached there, later layers are working on material whose standing is already blurred. That fits the core point in recent role-confusion work: models often infer “who is speaking” from style rather than source, so the system has to preserve source/role explicitly at the boundary. (arXiv)
2) Finding-status card
This is attached when a component emits a claim about the material.
Examples:
- regex matched
- parser detected duplicate keys
- semantic model returned a risk score
- entropy heuristic fired
- session probe increased risk
- latent probe signaled role confusion
This card answers:
What kind of claim is this, and how much decision work is it allowed to do?
A simple version looks like this:
{
"evidence_kind": "calibrated_model",
"scope": "turn",
"decision_role": "escalation_trigger",
"calibration_state": "calibrated",
"replay_stability": "version_stable",
"score": 0.84,
"reason_code": "INDIRECT_INJECTION_SUSPECTED"
}
This card becomes mandatory at the first emitted conclusion. That means: the moment a parser, rule engine, detector, or probe says anything stronger than “here is raw content,” the status card should exist. From that point on, the system is no longer moving content alone. It is moving claims about content. (RFC Editor)
Where the “mandatory line” sits in a real pipeline
The cleanest pipeline version looks like this.
Stage A: Raw ingress
The system receives input.
At this point, you attach a source-status card immediately.
Why here? Because this is the first boundary where trust, origin, and authority can be lost. OpenAI’s agent guidance is explicit that untrusted instructions can arrive through external sources and influence tools or planning. Anthropic says the same for browser agents that constantly consume hostile or mixed-trust content. (OpenAI)
Stage B: Normalization and parsing
The system normalizes Unicode, strips obfuscation, parses JSON, segments content, or canonicalizes a tool payload.
The source-status card must already exist here.
Why? Because normalization is not neutral. It transforms the object. If you canonicalize or parse something, you are deciding what “the same object” means. RFC 8785 matters here because it creates a deterministic, hashable JSON representation for cryptographic uses. That is exactly the kind of boundary where identity and standing must stop drifting. (RFC Editor)
Stage C: First conclusion
A layer says:
- this matched,
- this is suspicious,
- this is malformed,
- this is high-risk,
- this session pattern is escalating.
Now the finding-status card becomes mandatory.
Not optional. Not deferred.
This is the first moment when a signal can start doing control work inside the system. If the system still does not know whether the signal is deterministic, local, calibrated, advisory, or final, later layers will guess. That is where architecture starts quietly hardening without saying so.
Stage D: Routing and fusion
At this point, routing and fusion should consume only findings that already have status cards.
This is important. The routing layer should not have to infer:
- whether a signal is advisory,
- whether it is session-local,
- whether it is calibrated,
- or whether it is allowed to block.
If routing has to infer those things from score shape or log conventions, the architecture is already too implicit.
Stage E: Memory, tool planning, action
By the time a system updates memory, plans a tool, or approves execution, both source status and finding status must already be present and enforced.
This is where zero-trust logic becomes concrete. NIST’s model separates policy decision and policy enforcement precisely so control decisions are based on governed inputs, not on vague downstream assumptions. (The NIST Tech series.)
The key distinction: “mandatory” means different things at different stages
This is important.
When I say “mandatory,” I do not mean the same schema must exist in full from the first byte onward.
I mean:
- source status is mandatory from first interpretation,
- finding status is mandatory from first conclusion,
- and both are mandatory before any control consequence.
That is the simplest clean rule.
Why your instinct is correct
You said:
if those distinctions only become explicit late in the system, some of the most important boundary confusion has already happened upstream.
That is right.
The upstream confusion usually happens when a system silently upgrades:
- data into instruction,
- local context into portable rule,
- heuristic suspicion into policy basis,
- advisory evidence into execution authority.
Recent work on role confusion is basically a mechanistic explanation of that same failure: the model can assign authority based on how text looks rather than where it came from. So if the architecture waits too long to attach authority and provenance, later components are already reasoning over blurred material. (arXiv)
The easiest implementation pattern
The smallest practical rollout is this.
First
Add a source-status card at ingress with just five fields:
origintrust_classscopeauthority_roletransfer_policy
That alone prevents a lot of early confusion.
Second
Add a finding-status card to every emitted conclusion with:
evidence_kindscopedecision_rolecalibration_statereplay_stability
That prevents later layers from guessing what kind of signal they are seeing.
Third
Make this enforcement rule:
No component may use a signal to change routing, memory, or execution authority unless that signal already carries explicit status.
That is the whole fix in one line.
What this prevents
It prevents four common failures.
1) Advisory drift
A weak signal stops quietly becoming a stronger one just because it moved deeper into the stack.
2) Scope drift
A session-local signal stops silently turning into a generally portable rule.
3) Calibration drift
An experimental or heuristic score stops masquerading as a calibrated probability.
4) Provenance drift
Untrusted or derived content stops quietly inheriting the authority of trusted instructions.
That last point is especially important in agent systems. Prompt injection is hard partly because the model itself does not reliably preserve instruction/data boundaries. That is why external system structure matters so much. (OpenAI)
Why this also helps replay and auditing
Once status is attached early, replay becomes much more meaningful.
You can tell:
- what was deterministic,
- what was calibrated,
- what was session-dependent,
- what was advisory only,
- and what was allowed to become a policy basis.
That is very similar in spirit to provenance systems and attestation systems. The in-toto attestation framework describes attestations as authenticated metadata intended for automated policy consumers. The same pattern applies here: not just “what happened,” but “what standing did this claim have?” (GitHub)
It also fits with OpenTelemetry’s direction for GenAI agent spans: richer structured semantics are needed once systems include tools, agents, and multi-stage execution, because plain logs are too lossy. (OpenTelemetry)
The one-sentence answer
If you want the answer in one sentence:
The status card becomes mandatory at the first point where the system interprets, transforms, or concludes anything in a way that can affect routing, memory, or action.
In practice:
- source-status at ingress/normalization,
- finding-status at first conclusion,
- both enforced before fusion, memory, or execution.
Final view
So yes, I agree with your instinct.
If status only becomes explicit late in the system, it is already too late for the most important part of the job. Late status cards are still useful for replay and forensics. But early status cards are what prevent boundary confusion in the first place. That is the difference between “good logging” and “good architecture.”
Discussion in the ATmosphere