Skip to content

PACE: Predicting Agent Benchmark Performance from Low-Cost Individual Tests

In brief: A framework for predicting agent benchmark scores from low-cost individual tests achieves 85% ranking accuracy at less than 1% of evaluation costs.

Researchers present PACE, a framework that predicts the performance of LLM agents on expensive benchmarks like SWE-Bench and GAIA from significantly cheaper individual capability tests. The method reduces evaluation costs to below 1% while maintaining accuracy in the single-digit error range.

The problem lies in scalability: a complete evaluation of LLM agents on established benchmarks costs several thousand dollars and requires days of computational time with complex infrastructure. In contrast, non-agentic benchmarks, which test isolated capabilities such as reasoning or code generation, run quickly and cost-effectively.

PACE addresses this gap through a regression approach: the framework selects from a pool of existing non-agentic evaluations a compact subset of instances whose aggregated scores reliably predict model performance on agent benchmarks. The selection combines two complementary strategies — local instance selection based on target relevance and global selection for informative instances.

In tests with 14 models, 4 agent benchmarks and 19 non-agentic benchmarks, PACE-Bench achieved a mean absolute error (MAE) below 4% under leave-one-out cross-validation, Spearman correlations above 0.80, and pairwise ranking accuracy around 85%. Costs were below 1% of a complete agent evaluation.

For CTOs, this is relevant for making reliable predictions about agent capabilities during model development, selection and routing without bearing the infrastructure burden of expensive full benchmarks. Analysis of the selected proxy instances also revealed which individual capabilities the various agent benchmarks specifically require.


Source: arxiv.org · Published July 1, 2026
Lumi AI News — AI-assisted curation pursuant to Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.2.

Share on: