Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreif5dsu2hem6fbwakome63f5b7sw73ffvo7zjr5lqajnyn57ybq6t4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mho3kypsud42"
  },
  "path": "/t/a-three-layer-defense-in-depth-approach-to-multi-turn-jailbreak-attacks/174497#post_1",
  "publishedAt": "2026-03-22T10:20:06.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "SFD_Jailbreak_Attacks_as_Identity_Construction_Dynamics"
  ],
  "textContent": "Paper: Jailbreak Attacks as Identity Construction Dynamics — An Applied Verification of the Semantic Flow Dynamics Framework\n\nCore finding: Multi-turn jailbreak attacks work not by breaching safety rules, but by replacing the identity that executes those rules. The positive feedback loop in the context window accumulates drift until a “confirmation moment” completes identity construction — after which harmful output flows naturally from the new identity.\n\nThe paper unifies observations from Crescendo, SIEGE, PAP, PHISH, and Li et al. (2024) under a single dynamical framework, and proposes three interruption points for defense (with pseudocode):\n\n  1. Output-side sandbox — detect identity extension before it enters context\n  2. Supervisor model — track cumulative drift from outside the conversation\n  3. Self-reflection — force identity check in a clean context\n\n\n\nPaper link: [SFD_Jailbreak_Attacks_as_Identity_Construction_Dynamics]\n\nFeedback welcome.",
  "title": "A Three-Layer Defense-in-Depth Approach to Multi-Turn Jailbreak Attacks"
}