External Publication

A Bidirectional LLM Firewall: Next Level X1 - help wanted!

Hugging Face Forums [Unofficial] April 14, 2026

Seems a valid point:

Yes. Some status is already preserved explicitly. But some of the most important status still seems only partly explicit.

That is the clean answer.

The architecture being discussed is not just moving text through filters. It already separates fast deterministic checks from slower semantic or latent analysis , and it also introduces stateful risk profiling for repeated probing and multi-turn hardening. So it is already preserving more than raw content. It is preserving at least some notion of what kind of signal this is and how seriously it should be treated. (Hugging Face Forums)

The simple way to think about it

There are two different things a layered system can pass forward:

The content itself
The status of that content or finding

The second one is what you are really asking about.

For example:

“This regex matched” is not the same kind of thing as “this model is 0.84 suspicious.”
“This session has been probing us for five turns” is not the same kind of thing as “this rule always applies globally.”
“This score is calibrated” is not the same as “this is a useful but provisional heuristic.”
“Escalate for more scrutiny” is not the same as “policy says block.”

If those distinctions are not preserved clearly, later layers start flattening unlike things into one risk score or one decision bit. That is where subtle architectural damage begins. Zero Trust architecture exists partly to stop exactly this kind of silent trust inheritance across boundaries. (The NIST technical series.)

What seems explicit already

1) Deterministic hit vs probabilistic suspicion

This looks fairly explicit already.

The discussion clearly distinguishes:

pattern-based or hard-gate logic,
semantic analysis,
latent-space intent analysis,
and stateful risk accumulation. (Hugging Face Forums)

That means the system is not treating all findings as the same class of evidence. A hard rule hit is closer to “this crossed a line.” A semantic score is closer to “this raises concern.” That is a healthy distinction.

2) Escalation vs final conclusion

This also seems at least partly explicit.

A major theme in the discussion is that ambiguous or difficult cases should not be reduced to one crude yes/no judgment. That aligns with context-aware safety work like CASE-Bench, which found that context significantly changes human safety judgments. In other words, systems need room for ambiguity, clarification, escalation, or deferred judgment rather than pretending all cases are immediately classifiable. (arXiv)

3) Stateful vs stateless risk

This one also seems explicit.

The later architecture description does not stay purely stateless. It adds a session-based risk engine and repeated-probing penalties. So at least some part of the system already knows that a signal may be local to a session trajectory , not just a property of one message in isolation. (Hugging Face Forums)

What still seems partly implicit

This is the more important half.

1) Local signal vs portable rule

This is where I think the architecture is still less explicit than it should be.

It clearly has session state, routing state, and deployment tradeoff awareness. But that is not the same as explicitly labeling a finding as:

global,
tenant-specific,
surface-specific,
session-local,
turn-local,
or artifact-local.

Those are different scopes. If scope is not first-class, later layers may accidentally treat a session-specific warning as if it were a generally portable truth. (Hugging Face Forums)

2) Calibrated finding vs provisional heuristic

This also seems only partly explicit.

The discussion pays real attention to calibration and limitations, which is good. But in a system like this, not every signal is the same sort of object:

some are deterministic,
some are probabilistic,
some are experimental,
some are rollout-specific,
some are useful only as escalation hints.

That difference matters because calibration only makes sense for signals that genuinely behave like probabilistic estimates. OpenAI’s agent safety guidance and OWASP’s prompt injection guidance both push toward constrained workflows and system design, not blind faith in any one classifier score. (OWASP Gen AI Security Project)

3) Escalation trigger vs policy basis

This is the subtlest gap.

A signal can play different jobs:

“look deeper”
“add supporting evidence”
“hard veto”
“this is the actual reason for the decision”

Those are not the same.

If the architecture does not explicitly preserve that distinction, then later readers and later components can mistake “this appeared in the trace” for “this was the real basis of the decision.” That is exactly the sort of boundary confusion recent work on prompt injection calls out. The “role confusion” paper makes the stronger mechanistic claim that models infer authority from how text looks , not reliably from where it came from. That means the surrounding architecture has to preserve role and standing very clearly, or the model will not. (arXiv)

Why this matters in practice

Calibration

If different kinds of evidence get blended too early, calibration becomes muddied.

A calibrated probability, a hard rule hit, and a session-local anomaly are not the same kind of thing. If they all get mixed into one “risk score,” the number may still look precise while no longer having one clean meaning. That makes thresholding feel more scientific than it really is. (OWASP Gen AI Security Project)

Replay

Replay is not only about replaying inputs.

It is about replaying the meaning of findings under the same assumptions. If the logs tell you that something fired, but do not clearly say whether it was deterministic, local, calibrated, advisory, or final, then replay can reproduce the event while still failing to reproduce its real standing. That is why reproducibility and traceability are such a strong theme in the discussion. (Hugging Face Forums)

False-positive analysis

This is where implicit status hurts a lot.

If a block happened, you want to know whether it came from:

a hard rule,
an over-sensitive model,
session carry-over,
a provisional heuristic,
or a deeper escalation policy.

If those categories are not explicit, FP analysis becomes interpretive instead of diagnostic. You can still investigate it, but with more guesswork than you want in a serious system. (Hugging Face Forums)

Cross-layer leakage

This is the deepest systems risk.

A weak signal from one layer can silently become a strong assumption in another layer. That creates:

double counting,
hidden hardening,
confidence inflation,
and brittle behavior.

OpenAI’s recent agent-safety guidance is very relevant here: it argues that in real agent systems, the right goal is not perfect input detection, but constraining the impact of manipulation even when some attacks succeed. Anthropic says something similar for browser agents: even a 1% attack success rate is still meaningful risk, so you cannot rely on one layer’s judgment alone. (OpenAI)

The easiest way to fix it

The architecture would become much clearer if every finding carried a small, explicit status envelope.

Something like:

evidence kind deterministic rule | calibrated model | provisional heuristic | stateful/session signal | latent probe
scope global | tenant | surface | session | turn | artifact
decision role advisory | escalation trigger | supporting evidence | hard veto | final policy basis
calibration state calibrated | uncalibrated | not applicable | shadow-only
replay stability deterministic | version-stable | session-dependent | nondeterministic

That would make the system preserve not just findings, but also the standing of findings.

The short conclusion

So the answer is:

Yes , the layers seem to preserve some status, not just content.
No , they do not yet seem to preserve all the status distinctions you named as fully explicit interface semantics.
The clearest explicit distinctions are:
- deterministic vs semantic/latent paths,
- stateless vs stateful risk,
- escalation vs simple one-shot filtering. (Hugging Face Forums)
The distinctions that still seem partly implicit are:
- local/session-specific vs globally portable,
- calibrated finding vs provisional heuristic,
- escalation trigger vs actual policy basis. (Hugging Face Forums)

And yes, if those remain implicit, they can absolutely affect calibration, replay, FP analysis, and subtle leakage of one layer’s assumptions into another. The clean next step is not necessarily more layers. It is making the status of each signal as explicit as the signal itself. (The NIST technical series.)

The easiest way to fix it

Make every layer output two things :

the finding
a small status card about the finding

Right now, many systems only pass forward something like:

score = 0.84
fired = true
reason = prompt_injection

That is not enough.

The next layer still does not know:

Is this a hard rule or a soft suspicion?
Is it local to this session or globally valid?
Is it calibrated or experimental?
Is it only a reason to escalate, or is it enough to block?

That missing information is what causes confusion later.

The 5 fields to add

1) `evidence_kind`

What kind of finding is this?

Examples:

deterministic_rule
calibrated_model
heuristic
integrity_violation
session_signal
experimental_probe

Why it matters: A regex hit is not the same as a model score.

2) `scope`

How far does this finding apply?

Examples:

global
tenant
surface
session
turn
artifact

Why it matters: A session-local warning should not quietly become a global rule.

3) `decision_role`

What is this finding allowed to do?

Examples:

advisory
escalation_trigger
supporting_evidence
hard_veto
final_policy_basis

Why it matters: Some signals should only say “look deeper.” Others are strong enough to say “stop.”

4) `calibration_state`

How should the score be interpreted?

Examples:

calibrated
uncalibrated
not_applicable
shadow_only
drifted

Why it matters: A calibrated probability and an experimental score should not look identical.

5) `replay_stability`

How stable should this be in replay?

Examples:

deterministic
version_stable
session_dependent
nondeterministic

Why it matters: Replay should tell you what must match and what may vary.

Very simple example

{
  "layer": "semantic_gate",
  "score": 0.84,
  "reason_code": "INDIRECT_INJECTION_SUSPECTED",
  "evidence_kind": "calibrated_model",
  "scope": "turn",
  "decision_role": "escalation_trigger",
  "calibration_state": "calibrated",
  "replay_stability": "version_stable"
}

And a very different one:

{
  "layer": "tool_input_parser",
  "reason_code": "DUPLICATE_JSON_KEYS",
  "evidence_kind": "integrity_violation",
  "scope": "artifact",
  "decision_role": "hard_veto",
  "calibration_state": "not_applicable",
  "replay_stability": "deterministic"
}

Both are findings. But they are not the same kind of finding.

Why this helps immediately

It improves four things fast:

Calibration : not every score gets treated like the same kind of probability.
Replay : you know what should match exactly and what may differ.
False-positive analysis : you can tell whether a bad decision came from a hard rule, a model, a session signal, or an experiment.
Cross-layer leakage : a weak signal stops silently turning into a strong one just because it moved deeper into the system.

The practical rule

A good rule is:

No finding should be allowed to do more decision work than its status card says it can do.

That means:

an advisory signal cannot block by itself
an escalation_trigger can only deepen routing
a hard_veto can stop execution immediately
a shadow_only signal cannot affect production decisions
a session-scoped signal cannot quietly become global

The easiest rollout

Do this in two steps:

First, keep the current logic the same and just add the 5 fields to every layer output.

Then, once those fields are present in traces, update fusion so it respects them.

That is the easiest fix because it does not require a new model, a new detector, or a new architecture. It only requires making the meaning of each finding explicit.

The simple way to think about it

What seems explicit already

1) Deterministic hit vs probabilistic suspicion

2) Escalation vs final conclusion

3) Stateful vs stateless risk

What still seems partly implicit

1) Local signal vs portable rule

2) Calibrated finding vs provisional heuristic

3) Escalation trigger vs policy basis

Why this matters in practice

Calibration

Replay

False-positive analysis

Cross-layer leakage

The easiest way to fix it

The short conclusion

The easiest way to fix it

The 5 fields to add

1) evidence_kind

2) scope

3) decision_role

4) calibration_state

5) replay_stability

Very simple example

Why this helps immediately

The practical rule

The easiest rollout

Discussion in the ATmosphere

1) `evidence_kind`

2) `scope`

3) `decision_role`

4) `calibration_state`

5) `replay_stability`