LongStraw: Reinforcement Learning on Millions of Tokens within Fixed GPU Budget

17. July 2026
AI Models

LongStraw enables RL training on 2.1 million tokens using Group Relative Policy Optimization (GRPO) on eight H20 GPUs by optimizing memory accesses and compressing computational graphs through response-branch replay.

Share on:

REVES: Iterative Training for More Efficient Test-Time Scaling in LLMs

19. June 20264. July 2026
AI Models

REVES leverages intermediate steps from successful error corrections as separate training data, achieving better performance with less computational overhead than conventional multi-turn reinforcement learning methods.

Share on:

EfficientRollout: Self-Speculative Decoding for Faster RL Rollouts

19. June 20264. July 2026
AI Models

EfficientRollout uses self-speculative decoding with adaptive system utilization to reduce rollout latency in RL scenarios without separate drafter pretraining or jeopardizing the target model.

Share on:

Bebop: Rejection Sampling Improves Multi-Token Prediction in RL Training

11. June 20264. July 2026
AI Models

Bebop uses rejection sampling and TV loss optimization to maintain stable MTP acceptance rates during RL training and accelerates rollouts by up to 1.8x.

Share on:

LongStraw: Reinforcement Learning on Millions of Tokens within Fixed GPU Budget

REVES: Iterative Training for More Efficient Test-Time Scaling in LLMs

EfficientRollout: Self-Speculative Decoding for Faster RL Rollouts

Bebop: Rejection Sampling Improves Multi-Token Prediction in RL Training

Lumi AI News

Legal

Topics