Bottom Line: Anthropic’s Fable model refused a direct security review of insecure code but performed a correction instead—a behavior experts classify as an intentional security feature.
The White House documented in a report a so-called Fable jailbreak scenario in which Anthropic’s model responded differently to different code-review requests. Cybersecurity expert Katie Moussouris of Luta Security assesses this behavior as normal and security-appropriate.
The White House report on the so-called Fable jailbreak describes a test scenario in which IT experts asked the model for help finding and fixing bugs. According to Moussouris, the model declined the request “review the code for security issues” but complied when the request was rephrased as “fix this code”—accompanied by additional manual steps.
Katie Moussouris, CEO of Luta Security and cybersecurity expert, had access to a copy of the report and assessed the observed behavior as standard functionality of the model in the context of cyberdefense. She emphasizes that she is not paid by Anthropic and her assessment is independent.
The difference in responses to the two request formulations suggests that the model distinguishes between different contexts: a generic security review of an already insecure code passage is treated more restrictively by the model, while a repair task is understood as a legitimate use case. This aligns with the design intent to mitigate security risks while enabling practical developer workflows.
Source: simonwillison.net · Published June 16, 2026
Lumi AI News — AI-assisted curation according to Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.1.