STARE: Token-Level Stability Procedure Against Policy Entropy Collapse in GRPO Training

Share on:

Bottom line: STARE uses surprisal metrics and selective advantage reweighting to maintain policy entropy stability across long training sequences while improving accuracy by 4–8%.

Researchers have developed a method to stabilize reinforcement learning training of large language models by selectively reweighting token-level advantages. The STARE procedure addresses the problem of entropy collapse in GRPO algorithms, which leads to a reduction in model diversity.

The STARE procedure (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability) solves a central challenge in using GRPO and related verifiable-rewards algorithms: training risks converging the model policy to a very narrow spectrum of outputs while losing the ability to explore.

The basic idea is rooted in gradient analysis of token-level entropy dynamics. The researchers show that per-token entropy variation can be described by the product of trajectory-level advantage and an entropy sensitivity function over the next token distribution. This results in a four-quadrant structure in which certain tokens are more critical to entropy stability than others.

STARE identifies these critical token subsets using batch-internal surprisal quantiles and selectively reweights their effective advantages. Additionally, the procedure employs a targeted entropy gating system that keeps policy entropy within a defined target zone. Tests show that STARE maintains stable entropy values across thousands of training steps while model size varies from 1.5B to 32B.

On the AIME24 and AIME25 benchmarks, STARE outperforms competing baselines like DAPO by 4–8% average accuracy. At the same time, reflection tokens and response length grow in parallel with improvement, indicating preserved exploration balance. Source code is publicly available.

Source: arxiv.org · Published June 16, 2026
Lumi AI News — AI-assisted curation pursuant to Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.1.

Share on:

STARE: Token-Level Stability Procedure Against Policy Entropy Collapse in GRPO Training

Lumi AI News

Legal

Topics