In a nutshell: Arbor enables AI-driven research through systematic hypothesis management and achieved an average of 2.5x higher improvements than existing code models on six test tasks.
Anthropic has developed Arbor, a framework for autonomous research that empowers AI agents to independently test hypotheses over extended periods, interpret results, and integrate insights into subsequent experiments. The system combines a coordinating agent with specialized executors and a persistent hypothesis tree structure to build research findings cumulatively rather than as isolated attempts.
The core of the Arbor framework consists of three components: A central coordinator steers the overarching research strategy via Hypothesis Tree Refinement (HTR) – a data structure that links hypotheses, artifacts, evidence, and distilled insights with one another. Short-lived executor agents implement and test individual hypotheses in isolated working environments. As results flow back, Arbor updates the tree, propagates reusable lessons, refines the search boundary, and integrates verified improvements.
In the practical evaluation setting “Autonomous Optimization” (AO), the agent iteratively improves an initial research artifact through experiments without step-by-step human oversight. Anthropic tested Arbor on six real research tasks in the areas of model training, harness engineering, and data synthesis. The system achieved the best held-out results on all six tasks and realized an average of 2.5 times the relative held-out gains of Codex and Claude Code under identical task interface and the same resource budget.
For CTOs, Arbor is relevant because it demonstrates how AI systems can independently manage longer-term research cycles – a model that is transferable beyond fundamental research to internal optimization tasks, model engineering, and data pipeline improvements. Explicit tracking of hypotheses and insights also enables better traceability of automated research decisions. On MLE-Bench Lite, Arbor achieved a score of 86.36% Any Medal with GPT-5.5, the strongest comparative result in the study.
Source: arxiv.org · Published June 9, 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification through Lumi News Pipeline v1.6.5.