The point: ZPPO integrates teacher models as prompt components instead of gradients, improving generalization in knowledge transfer to smaller models.
Researchers from the Alibaba Group present Zone of Proximal Policy Optimization (ZPPO), a method for knowledge transfer to small language models that integrates teacher models into the input text rather than into the training process.
The central problem of classical knowledge distillation: when a large teacher model transmits its logits (raw data before probability conversion) to a significantly smaller student model, the student model focuses on the sharpest patterns of the teacher model and generalizes poorly to new tasks outside the training distribution.
ZPPO takes a different approach, inspired by Vygotsky’s pedagogical concept of the “Zone of Proximal Development”. Instead of forcing teacher competencies into the gradient update, two specially reformulated prompt types are constructed: Binary Candidate-included Questions (BCQ) present a correct answer from the teacher model and an incorrect answer from the student model as anonymous candidates for discrimination. Negative Candidate-included Questions (NCQ) aggregate multiple incorrect attempts by the student into a prompt to reveal common error patterns. A replay buffer circulates difficult questions until the student’s average accuracy reaches at least 50 percent.
Tests on the Qwen3.5 family with four student models (0.8B to 9B parameters) and a 27B teacher model show: ZPPO outperforms classical off-policy and on-policy distillation as well as GRPO baselines. The greatest advantage lies with the smallest models. The evaluation covers 31 benchmarks (16 vision-language tests, 10 pure language tests, 5 video tests).
Source: arxiv.org · Published June 15, 2026
Lumi AI News — AI-assisted curation according to Art. 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.1.