{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreif5dsu2hem6fbwakome63f5b7sw73ffvo7zjr5lqajnyn57ybq6t4",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mho3kypsud42"
},
"path": "/t/a-three-layer-defense-in-depth-approach-to-multi-turn-jailbreak-attacks/174497#post_1",
"publishedAt": "2026-03-22T10:20:06.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"SFD_Jailbreak_Attacks_as_Identity_Construction_Dynamics"
],
"textContent": "Paper: Jailbreak Attacks as Identity Construction Dynamics — An Applied Verification of the Semantic Flow Dynamics Framework\n\nCore finding: Multi-turn jailbreak attacks work not by breaching safety rules, but by replacing the identity that executes those rules. The positive feedback loop in the context window accumulates drift until a “confirmation moment” completes identity construction — after which harmful output flows naturally from the new identity.\n\nThe paper unifies observations from Crescendo, SIEGE, PAP, PHISH, and Li et al. (2024) under a single dynamical framework, and proposes three interruption points for defense (with pseudocode):\n\n 1. Output-side sandbox — detect identity extension before it enters context\n 2. Supervisor model — track cumulative drift from outside the conversation\n 3. Self-reflection — force identity check in a clean context\n\n\n\nPaper link: [SFD_Jailbreak_Attacks_as_Identity_Construction_Dynamics]\n\nFeedback welcome.",
"title": "A Three-Layer Defense-in-Depth Approach to Multi-Turn Jailbreak Attacks"
}