Poisoned documents can turn reasoning-based AI guardrails into DoS weapons by leveraging security systems themselves as resource sinks—a new attack vector with concentration risks in shared governance infrastructure.
The White House removed Anthropic’s Fable model from the market with export controls after concerns about bypassed security safeguards, following failed intensive negotiations between government officials and CEO Amodei.
Grammar-Constrained Decoding (GCD), a technique for ensuring syntactically correct code, opens a new jailbreak method for attackers with a success rate over 30 percentage points higher than previous approaches.
Anthropic splits Claude Fable 5 into a public version (with safeguards) and a restrictive version (Claude Mythos 5 without security layers) for verified cybersecurity experts.
Multi-turn reasoning models can maintain safe surface metrics while their internal states are compromised across conversation turns or their secure internal logic is ignored in harmful outputs.