Bottom line: Post-training migrates from monolithic RL pipelines to decentralized specialist systems merged through on-policy distillation into a generalist student—a scaling pattern that resolves capability conflicts across domains.
The post-training methodology of large language models has evolved more in the past year than in the three years before that. Frontier models are increasingly adopting Multi-Teacher On-Policy Distillation (MOPD) in 2026—a paradigm that combines specialist networks with decentralized scalability.
Evolution of the post-training pipeline: InstructGPT (2022) followed a linear three-step recipe consisting of supervised fine-tuning (SFT), reward model training, and PPO-based reinforcement learning. By 2024, open models such as Llama 3 and Tülu 3 established multi-stage pipelines with SFT → DPO → RL using verifiable rewards, while closed systems employed more complex multi-stage RLHF variants. DeepSeek R1 (2025) marked a turning point: large-scale reasoning RL as a core component.
MOPD as the new standard: Multi-Teacher On-Policy Distillation (MOPD) is the pattern taking hold at frontier models in 2026. The procedure trains N specialized teacher models (each: SFT, then RL on relevant domains). A generalist student model is trained by sampling its own trajectories and minimizes reverse KL divergence to the output distributions of relevant teacher models at each rollout, token by token. MiMo Flash V2 introduced MOPD; DeepSeek V4 and Nvidia Nemotron 3 Ultra scale the procedure to over ten teachers.
Motivation for specialization: Monolithic RL became more costly and conflict-prone for heterogeneous tasks (mathematics, code, agent-based tasks), as capability trade-offs emerged. Specialist models can be trained cost-effectively and are organizationally scalable: SFT followed by domain-specific RL is a well-understood, parallelizable process. At the same time, on-policy distillation matured through theoretical advances and practical experience in the RLVR literature.
Source: www.interconnects.ai · Published June 16, 2026
Lumi AI News — AI-assisted curation according to Art. 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.1.