Skill Self-Play: New Co-Evolution for LLM Training Methods

27. July 2026
AI Models

Skill Self-Play combines task generation, solution search, and dynamic skill control in a reinforcement learning loop to achieve both task diversity and training reliability.

Share on:

SLPO: Outcome-Reward Training for Latent Reasoners Without Token Decoding

24. July 2026
AI Models

Surrogate Latent Policy Optimization enables efficient outcome-reward training for latent reasoners that use continuous vectors instead of tokens for intermediate steps.

Share on:

SEED: Self-Evolving Behavior Clarification for Agent-Based Reinforcement Learning Models

17. July 2026
AI Models

SEED leverages self-generated hindsight supervision from language model-native trajectory analysis to bridge the supervision gap between episode-level outcomes and token-level learning signals.

Share on:

Direct-OPD: Transferring Policy Shifts from Smaller to Larger Models

14. July 2026
AI Models

Direct-OPD transfers RL-induced policy shifts from weaker to stronger models by leveraging the implicit reward signal from the log-ratio of the RL-shifted and original policy.

Share on:

SAO: Single-Rollout Method Improves Stability in Agent-Based RL Training

9. July 2026
AI Models

Single-rollout sampling instead of batch sampling stabilizes asynchronous RL training and outperforms GRPO on agent-based benchmarks.

Share on:

Reinforcement Learning with Metacognition Improves Uncertainty Expression in LLMs

1. July 2026
AI Models

Reinforcement Learning with Metacognitive Feedback (RLMF) enables LLMs to express their own uncertainty in a calibrated manner and outperforms standard RL methods by up to 63 percent.

Share on:

Structure-Aware Curriculum Learning for LLMs via Manifold Bandits

23. June 20264. July 2026
AI Models

Structured curriculum learning strategies that leverage task relationships in latent space achieve better downstream performance than pure difficulty prioritization.

Share on:

STARE: Token-Level Stability Procedure Against Policy Entropy Collapse in GRPO Training

19. June 20264. July 2026
AI Models

STARE uses surprisal metrics and selective advantage reweighting to maintain policy entropy stability across long training sequences while improving accuracy by 4–8%.

Share on:

ZPPO: Teacher Models as Prompts Instead of Gradients

17. June 20264. July 2026
AI Models

ZPPO integrates teacher models as prompt components instead of gradients, improving generalization in knowledge transfer to smaller models.

Share on:

RACES: Automatic Composition of Verifiable Environments for LLM Training

11. June 20264. July 2026
AI Models

RACES enables equivalent training performance to 300 individual environments by automatically composing 50 base environments.

Share on:

RACES: Verifiable Environments as Recursively Composable Building Blocks for LLM Reasoning

11. June 20264. July 2026
AI Models

RACES enables automatic composition of verifiable environments through recursive combination, with DeepSeek-R1-Distill-Qwen-14B improving by 3.1 points and Qwen3-14B by 2.3 points across six benchmarks.

Share on:

FlowTracer: Targeted Reinforcement Learning Through Information Flow Tracking in LLMs

10. June 20264. July 2026
AI Models

FlowTracer models information propagation as a directed graph and derives token credits from global flow structure to precisely concentrate reinforcement learning signals on critical reasoning steps.

Share on:

Skill Self-Play: New Co-Evolution for LLM Training Methods

SLPO: Outcome-Reward Training for Latent Reasoners Without Token Decoding

SEED: Self-Evolving Behavior Clarification for Agent-Based Reinforcement Learning Models

Direct-OPD: Transferring Policy Shifts from Smaller to Larger Models

SAO: Single-Rollout Method Improves Stability in Agent-Based RL Training

Reinforcement Learning with Metacognition Improves Uncertainty Expression in LLMs

Structure-Aware Curriculum Learning for LLMs via Manifold Bandits

STARE: Token-Level Stability Procedure Against Policy Entropy Collapse in GRPO Training

ZPPO: Teacher Models as Prompts Instead of Gradients

RACES: Automatic Composition of Verifiable Environments for LLM Training

RACES: Verifiable Environments as Recursively Composable Building Blocks for LLM Reasoning

FlowTracer: Targeted Reinforcement Learning Through Information Flow Tracking in LLMs

Lumi AI News

Legal

Topics