At a glance: The Claw-SWE-Bench framework demonstrates that adapter design is critical for code agents: with a minimal adapter, OpenClaw achieves 19.1% Pass@1, with a complete adapter 73.4%.

A new benchmark framework enables AI agents in the style of OpenClaw to be compared on programming tasks. Claw-SWE-Bench standardizes prompts, runtime budget, and evaluation procedure across heterogeneous agent implementations.

Autonomous AI agents such as OpenClaw are increasingly deployed as tool users, yet their ability to solve programming tasks is difficult to measure under the existing SWE-bench standard. Generic agents do not inherently satisfy the Docker workspace, patch, and prediction requirements for standardized evaluation. Claw-SWE-Bench introduces an adapter protocol that makes heterogeneous agent harnesses comparable under uniform conditions: identical prompts, fixed runtime budget, standardized workspace contract, uniform patch extraction, and evaluator.

The full benchmark comprises 350 GitHub issue resolution instances across 8 programming languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini. In parallel, Claw-SWE-Bench Lite is provided as an 80-instance rapid validation variant, selected through a cost-optimized ranking procedure across 17 calibration dimensions. OpenClaw with minimal Direct-Diff adapter achieves only 19.1% Pass@1 on the full benchmark, but 73.4% with a complete adapter on the identical GLM-5.1 backbone – a jump of 54.3 percentage points.

Across systems, model choice varies Pass@1 by 29.4 percentage points, harness choice by 27.4 percentage points with models held constant. Systems with similar accuracy differ substantially in total costs for API calls. Claw-SWE-Bench treats harness design and cost accounting as coequal evaluation dimensions for code agents. Data is available on GitHub and HuggingFace.

Source: arxiv.org · Published 9 June 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.6.5.

Share on:

Claw-SWE-Bench: Benchmark for AI Agents on Code Tasks

Lumi AI News

Legal

Topics