Skip to content

SafePyramid: Benchmark Reveals Weaknesses in LLM Guardrails for Context-Dependent Policies

Bottom Line: Even GPT-4.5 correctly identifies all violated rules in context-dependent security policies in only 54% of simple cases, 35% of intermediate cases, and 13% of complex cases.

Researchers have developed SafePyramid, a benchmark comprising 1,000 multi-turn conversation scenarios and 3,000 application-specific security policies, to test how well language models and guardrails detect unsafe interactions according to custom policies. The results reveal significant deficiencies.

SafePyramid consists of 1,000 multi-turn conversation scenarios across 10 domains, paired with 3,000 application-specific security policies. In total, the benchmark contains 61,699 distinct natural language rules. The scenarios are structured across three difficulty levels: L0 tests understanding of individual rules, L1 evaluates logical reasoning over rule dependencies, and L2 requires adaptation to entirely new, context-defined policy frameworks.

The evaluation encompassed ten leading language models and five configurable guardrail systems. GPT-4.5 delivered the best results: it correctly identified all violated rules in 54.0% of L0 cases, only 35.3% of L1 cases, and merely 12.9% of L2 cases. These performance drops demonstrate that even state-of-the-art models struggle to understand rule dependencies and adapt to novel policy definitions.

For CTOs, this means current guardrail systems are not reliable enough for business-critical applications when it comes to enforcing custom security policies. The researchers emphasize that stronger mechanisms are required to consistently execute policies, resolve rule dependencies, and transfer to unknown policy frameworks. The problem becomes particularly acute with multi-layered rule sets, where models fail to adequately account for complex interactions between different security requirements.


Source: arxiv.org · Published 28 June 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.2.

Share on: