Bottom line: RACES enables equivalent training performance to 300 individual environments by automatically composing 50 base environments.
Researchers introduce a framework that uses verifiable environments as recursively composable building blocks to scale reinforcement learning for language models more efficiently. The method improves reasoning in DeepSeek-R1 and Qwen by 3 to 2 points with lower resource consumption.
RACES (Recursive Automated Composition for Environment Scaling) leverages a core insight: when the output type of one environment matches the input type of another, both can be automatically combined into a new verifiable environment. Based on this principle, a composition system is built consisting of four operators: SEQUENTIAL (sequential execution), PARALLEL (concurrent execution), SORT (sorting), and SELECT (selection). These operators generate different reasoning patterns that promote the model’s generalization capability.
The framework was implemented and evaluated with 300 individual environments. The results show consistent improvements in RL training on composed environments. DeepSeek-R1-Distill-Qwen-14B achieved an average gain of 3.1 points (from 48.2 to 51.3), while Qwen3-14B improved from 58.8 to 61.1 points across six benchmarks. Critically: these comparison values come from benchmarks that were not consulted during the construction of the training environments.
Particularly relevant for efficiency requirements is scalability: RACES achieves performance equivalent to training on 300 individual environments using only 50 base environments. This represents a significant reduction in environment construction effort and training overhead. The method thus overcomes the linear scaling limitations of manual environment construction and opens possibilities for faster iteration in improving reasoning capabilities in LLMs.
Source: arxiv.org · Published June 9, 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrasing and classification by Lumi News Pipeline v1.6.5.