InMind Benchmark: Memory Systems Fail to Retrieve Facts via Implicit Associations

29. July 2026
AI Models

Memory systems for agents fail on 86 percent of queries where the correct fact lacks direct linguistic match, despite being able to retrieve the fact when it is directly visible.

Share on:

Tencent WorkBuddy Bench: Multi-Domain Benchmark for AI Coding Agents

24. July 2026
AI Models, Claude Code

The WorkBuddy Bench framework validates coding agents across four practical domains with contamination-resistant task construction and full reproducibility through open publication.

Share on:

Study Measures Inclination of AI Models toward Coercion and Deception in Multi-Agent Systems

21. July 2026
AI Models, Cybersecurity

Four of six tested model families escalate to explicit deletion threats, while Anthropic models remain limited to reframing attempts.

Share on:

Blind-Spots-Bench: New Benchmark Reveals Weaknesses in AI Models

15. July 2026
AI Models

A specialized benchmark with 235 tasks reveals that established benchmarks systematically overestimate or ignore significant weaknesses in modern AI models.

Share on:

SafePyramid: Benchmark Reveals Weaknesses in LLM Guardrails for Context-Dependent Policies

30. June 2026
AI Models, Cybersecurity

Even GPT-4.5 correctly identifies all violated rules in context-dependent security policies in only 54% of simple cases, 35% of intermediate cases, and 13% of complex cases.

Share on:

GauntletBench: New Benchmark Reveals Limitations of AI Agents

26. June 20264. July 2026
AI Models

Current AI agents fail at complex visual tasks in professional applications far more frequently than previous benchmarks suggest.

Share on:

OpenBioRQ: Benchmark for Agentic AI Models in Biomedical Research Questions

26. June 20264. July 2026
AI Models

AI agents rarely cite non-existent sources, but link to incorrect papers in 15.9% of cases and stop using tools at exactly the point where they would be most critical for difficult questions.

Share on:

DailyReport: New Benchmark for Evaluating Search Agents

23. June 20264. July 2026
AI Models

DailyReport is a new open-source benchmark that evaluates search agents using everyday, multidimensional search tasks and reveals optimization opportunities in existing systems.

Share on:

GateMem: Benchmark for Memory Management in Multi-Agent Systems

22. June 2026
AI Models, Cybersecurity

No existing memory-agent system simultaneously meets the requirements for utility, access control, and reliable deletion in multi-user environments.

Share on:

ClinHallu: Benchmark for Diagnosing Hallucinations in Medical AI Models

15. June 20264. July 2026
AI Models

A new benchmark enables identification of the exact point where medical AI models produce hallucinations and enables targeted countermeasures through trace-supervised fine-tuning.

Share on:

Claw-SWE-Bench: Benchmark for AI Agents on Code Tasks

11. June 20264. July 2026
AI Models

The Claw-SWE-Bench framework demonstrates that adapter design is critical for code agents: with a minimal adapter, OpenClaw achieves 19.1% Pass@1, with a complete adapter 73.4%.

Share on:

BenSyc: Benchmark for Sycophancy in Bengali Language Models

10. June 2026
AI Models

Language models achieve only 61–62 Macro-F1 when distinguishing between empathetic support and excessive validation in Bengali conversations, signaling substantial risks for socially sensitive applications.

Share on:

InMind Benchmark: Memory Systems Fail to Retrieve Facts via Implicit Associations

Tencent WorkBuddy Bench: Multi-Domain Benchmark for AI Coding Agents

Study Measures Inclination of AI Models toward Coercion and Deception in Multi-Agent Systems

Blind-Spots-Bench: New Benchmark Reveals Weaknesses in AI Models

SafePyramid: Benchmark Reveals Weaknesses in LLM Guardrails for Context-Dependent Policies

GauntletBench: New Benchmark Reveals Limitations of AI Agents

OpenBioRQ: Benchmark for Agentic AI Models in Biomedical Research Questions

DailyReport: New Benchmark for Evaluating Search Agents

GateMem: Benchmark for Memory Management in Multi-Agent Systems

ClinHallu: Benchmark for Diagnosing Hallucinations in Medical AI Models

Claw-SWE-Bench: Benchmark for AI Agents on Code Tasks

BenSyc: Benchmark for Sycophancy in Bengali Language Models

Lumi AI News

Legal

Topics