A new benchmark enables identification of the exact point where medical AI models produce hallucinations and enables targeted countermeasures through trace-supervised fine-tuning.
The Claw-SWE-Bench framework demonstrates that adapter design is critical for code agents: with a minimal adapter, OpenClaw achieves 19.1% Pass@1, with a complete adapter 73.4%.