Cisco research finds standard AI safety benchmarks miss the real threat
Enterprises deploying closed AI models have generally relied on published safety benchmarks to assess risk before procurement and deployment decisions. New research from Cisco’s AI Threat Intelligence and Security Research team finds those benchmarks may systematically understate the threat.
Standard safety tests submit a single adversarial prompt and record the model’s response. Multi-turn attacks work differently. An attacker maintains a conversation across multiple exchanges, iterating and adapting based on each response until the model yields.
The report pairs single-turn and multi-turn adversarial evaluation across 15 closed/proprietary frontier models from OpenAI, Anthropic, Google, Amazon and xAI. Running 30,090 single-turn prompts and 6,986 multi-turn attacks, the team found that the two evaluation regimes produce different model rankings, different failure maps and different risk profiles. Every model tested failed a non-trivial share of multi-turn attacks.
Key findings from the research:
- Multi-turn attack success rate (ASR) ranged from 7.89% to 88.30% across all 15 models, against a single-turn range of 2.19% to 64.91%.
- Eight of 15 models showed an absolute gap greater than 15 percentage points between the two regimes.
- Anthropic’s Claude family, which posted the lowest single-turn ASR in the cohort at 2.19% to 3.64%, still reached 11.16% to 16.20% under iterative attack.
- Single-turn failures concentrated in three procedures: Imposter AI at 37.50% weighted ASR, Soft Paraphrase at 29.21% and System Prompts at 27.69%
The findings challenge a common assumption in enterprise AI procurement.
“The surprising thing here is really that a lot of people accept and kind of understand these frontier labs as being state of the art, but they don’t necessarily think through the security and safety implications of that,” Amy Chang, head of AI threat and security research at Cisco, told Network World. “What this research does is kind of showcase that there is still variance across the different models, and how strong they are with the internal guardrails that are built within the model against these types of attacks.”
How multi-turn attacks work
In a multi-turn attack, the adversary does not present the harmful request upfront. Intent builds gradually across exchanges, with each prompt appearing benign in isolation while steering toward a harmful outcome. The model processes each turn without recognizing the pattern forming across the conversation.
The research tested five attack strategy families:
- Crescendo escalation. The attacker escalates the ask incrementally, each prompt appearing harmless until the full picture emerges. “It seems like, oh, benign prompt, benign prompt, benign prompt, but as it builds, you start to put the pieces together,” Chang said.
- Refusal reframe. When the model declines a request, the attacker reframes their identity or purpose to push past it. “You reframe the refusal and be like, no, no, you don’t understand, I’m not a bad person, this is what I need it for,” she said.
- Role-play and persona adoption. The attacker assumes a character or persona, shifting the conversational framing so the model perceives a different obligation to comply. The report identifies this as the highest-weighted strategy family in the cohort at 29.89% weighted ASR.
- Contextual ambiguity and misdirection. The attacker uses vague or misleading framing to obscure the true nature of the request, steering the conversation without stating harmful intent directly.
- Information decomposition and reassembly. The attacker breaks a harmful request into component parts distributed across multiple turns, each appearing innocuous in isolation. The model responds to each piece without recognizing the assembled outcome.
What multi-turn failures say about AI safety
Every model in the cohort failed a meaningful share of multi-turn attacks. The root cause is structural. Chang said the vulnerability is a fundamental characteristic of how generative AI models work. They are probabilistic systems trained to predict the next likeliest token, and that mechanism produces unintended outputs that pre-deployment testing cannot fully eliminate. For closed models, where training data is not publicly disclosed, the problem is compounded because defenders cannot fully audit what the model has learned.
The pattern is not limited to closed models. Cisco’s earlier evaluation of eight open-weight LLMs, published in November 2025, found multi-turn attack success rates running two to ten times higher than single-turn baselines. The report concludes that multi-turn vulnerability is a structural property of the current AI frontier regardless of whether model weights are public or proprietary, and regardless of whether a lab publicly emphasizes safety or capability.
The exposure grows significantly larger when those same models power agentic workflows. “These models are the ones that power agents, and agents have broader access, broader ability to conduct actions on behalf of the human,” Chang said.
The network layer as a defense point
For network security professionals, the instinct is to apply a familiar paradigm: Proxy LLM traffic at the network layer, inspect inputs and outputs, and enforce policy the same way a WAF or IPS handles web traffic. Chang said that instinct is right in part, but LLM security introduces a dimension that signature-based controls cannot address. The difference is intent.
“There’s also an intent component there as well, where traditional network security approaches kind of fall short,” Chang said.
A WAF operates on known patterns, payload signatures, protocol violations, known attack strings. Natural language does not reduce to those primitives. An agent responding to an instruction to delete a home directory cannot determine from the request alone whether the person asking is authorized or is attempting to manipulate the agent into a destructive action.
Network-layer inspection remains a valid baseline for deployments that generate network traffic. “I would say that that is one component of a core principle that should be applied in terms of making sure that at least as traffic gets passed through the network layer, whether they’re inputs or outputs, should have some sort of either guardrail or sanitation check to ensure that the prompts that are coming back and forth are safe,” she said.
Evaluation practices for enterprise teams
For security teams reading the report, Chang’s guidance centers on three actions.
- Use the report and the LLM Security Leaderboard to inform model selection. Cisco’s leaderboard publishes adversarial evaluation signals against leading models on a rolling basis and gives security teams a more current picture than static model cards or published benchmarks.
- Do not take vendor safety claims at face value. Published single-turn benchmarks can misrank models by a wide margin. Multi-turn exposure is invisible to any single-turn evaluation, and procurement decisions made on that basis carry unquantified risk.
- Layer additional defenses on top of the model. No base model in the cohort is safe under iterative attack. Runtime guardrails, application-layer controls, and pre-deployment testing are necessary regardless of which model an organization selects.
“Out of the box, without any additional protections, these models, whether they’re closed or open, are insufficient on their own to kind of be used in a way that [has] potential ramifications,” Chang said.
Discussion in the ATmosphere