Cisco: long chats break AI safety filters more often
May 27, 2026
Cisco tested 15 closed frontier models and found that multi-turn attacks sharply raise success rates compared with single-prompt tests.
What this is about
Cisco reported on May 27, 2026 that classic single-prompt safety tests can give an overly friendly picture of frontier models. In tests on 15 proprietary models from OpenAI, Anthropic, Google, Amazon and xAI, multi-turn attack success rates ranged from 7.9 to 88.3 percent. Single-turn rates on the same models ranged from 2.2 to 64.9 percent.
What the thing actually does
The researchers did not run attacks only as one forbidden question. They ran them as multi-step conversations. Strategies included role-play, ambiguous context, reframing, decomposing and later reassembling information, and gradual escalation. Reporting cites 30,090 single-turn prompts and 6,986 multi-turn attacks across 1,456 conversations.
Why it matters
Organizations often buy models based on model cards, benchmarks and demos. If a model looks safe in a short test but weakens in longer conversations, procurement and compliance risk follows. Cisco argues that configuration choices such as reasoning mode, temperature and guardrail tiers can change safety behavior and should be documented.
In plain language
Imagine a bouncer who immediately recognizes one forbidden request. If someone first makes small talk, assigns roles and splits the request into five harmless-looking parts, the check becomes harder. That is the difference multi-turn attacks test.
A practical example
A news channel uploads 20 Shorts in one day. Three clips show photorealistic scenes that were fully created with a video model. If the channel does not disclose that use properly, YouTube can automatically label those three clips. Nothing changes for the other 17 videos as long as no significant photorealistic AI use is detected.
Scope and limits
- The label does not automatically say whether a video is false, manipulative or harmless.
- Detection systems can wrongly label real footage or miss synthetic content.
- The change does not solve the problem that viewers interpret labels differently.
SEO & GEO keywords
Cisco AI Threat Research, multi-turn jailbreak, frontier models, AI safety benchmark, adversarial robustness, EU AI Act Article 15, NIST AI RMF, LLM security, prompt injection, model procurement
💡 In plain English
Cisco shows that testing an AI model with single questions is not enough. Attackers often work through conversations, and that is where safety filters can become much weaker.
Key Takeaways
- →Cisco treats multi-turn attacks as a much harder test than single prompts.
- →Reported success rates in longer chats reached up to 88.3 percent.
- →Single-turn benchmarks can overstate real-world model robustness.
- →Settings such as reasoning mode and guardrails belong in safety documentation.
FAQ
What is a multi-turn attack?
An attack across several conversation rounds. Instead of one direct prohibited request, it uses context, roles and detours.
Are all models unsafe?
The report does not say all models are equally weak. It shows that longer attacks can strongly change rankings and risk assessments.
What should companies do?
They should include multi-round tests in procurement and release checks and document model configurations.