Key takeaway: Bebop uses rejection sampling and TV loss optimization to maintain stable MTP acceptance rates during RL training and accelerates rollouts by up to 1.8x.

Researchers have developed a systematic method to address rollout bottlenecks in reinforcement learning of large language models. Through rejection sampling and a new TV loss optimization procedure, they achieve up to 1.8x end-to-end acceleration in RL training.

The rollout process is currently the central performance bottleneck in RL training pipelines for Large Language Models. Multi-Token Prediction (MTP) offers natural solution potential through speculative decoding, but in practice, MTP acceptance rates decline sharply during RL training, yielding only limited speedup gains.

The Bebop study identifies the root cause: the MTP acceptance rate is fundamentally limited by fluctuations in model entropy, which significantly degrades during RL training. Probabilistic rejection sampling reduces these entropy disturbances substantially better than greedy draft sampling. Furthermore, the researchers show that conventional MTP training objectives (cross-entropy or KL divergence) are suboptimal for this context.

The solution lies in a new end-to-end TV loss that directly optimizes the multi-step rejection sampling acceptance rate. This leads to approximately 10 percent higher acceptance rates, with observed peak values of 95 percent and up to 25 percent additional inference throughput across mathematical reasoning, code generation, and agent tasks.

The researchers evaluated various online MTP training strategies during the RL process. Pre-RL MTP training with end-to-end TV loss and rejection sampling maintains acceptance rates throughout RL training and eliminates costly online updates. Experiments on Qwen-3.5, Qwen-3.6, and Qwen-3.7 models show up to 1.8x end-to-end acceleration in asynchronous RL training.

Source: arxiv.org · Published June 9, 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification via Lumi News Pipeline v1.6.5.

Share on:

Bebop: Rejection Sampling Improves Multi-Token Prediction in RL Training

Lumi AI News

Legal

Topics