Skip to content

chrisliu298/awesome-on-policy-distillation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

116 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚗️ Awesome On-Policy Distillation

Awesome On-Policy Distillation

Entries GitHub Stars GitHub Forks Last Commit

A curated collection of papers, technical reports, frameworks, and tools for on-policy distillation (OPD) of large language models.

On-policy distillation trains a student on samples from its own evolving policy, while a teacher (external, privileged, or self-conditioned) provides dense supervision on those same samples.

On-policy distillation (OPD) trains a student on trajectories sampled from its own policy while a teacher scores the student-visited prefixes with dense token-level guidance. This on-policy data collection reduces the train-inference distribution gap that affects off-policy KD/SFT on fixed traces. Depending on the estimator, OPD looks like GKD on student rollouts or policy-gradient/RL with teacher-defined per-token KL/log-prob rewards, making the natural contrast sparse outcome-reward RL rather than RL as a whole. As of 2026, OPD is a standard post-training primitive at Alibaba (Qwen3), DeepSeek (V4), Xiaomi (MiMo), Zhipu (GLM-5), NVIDIA (Nemotron-Cascade 2), and others.

Shipping today? Jump to Frameworks and Implementations. New to OPD? Read Start Here.

Contents

Start Here

A fast path through the field:

  1. Survey. OPD Survey — taxonomy, methods, and open problems in one place.
  2. Foundations. MiniLLM, GKD, and ExOPD — the core student-rollout plus teacher-supervision loop, including its dense KL-constrained RL framing.
  3. Practical intuition. Thinking Machines blog — the clearest end-to-end explanation of why and when OPD applies.
  4. When OPD works and when it breaks. Revisiting OPD, Entropy-Aware OPD, and Rethinking OPD — failure modes (instability, diversity collapse, tokenizer mismatch) and success conditions (compatible thinking patterns, novel teacher capability).
  5. No teacher logits. Black-Box OPD — discriminator-based reward when the teacher is API-only.
  6. No teacher at all. OPSD and SDFT — same model as student and self-teacher.
  7. Context and experience. OPCD and OEL — distill prompts and deployment traces into weights.
  8. Industrial recipes. Qwen3, DeepSeek-V4, MiMo-V2-Flash, GLM-5 — how labs ship OPD in production.

Key decision: access to teacher logits? Yes → white-box (GKD, Veto, Entropy-Aware OPD). No → black-box (GAD, OVD) or self-distillation (OPSD, SDFT).

Surveys and Essays

Surveys and Position Papers

Essays, Blog Posts, and Walkthroughs

Core OPD Papers

The papers that define on-policy distillation for LLMs.

Scope rule: A paper belongs here if its primary contribution is a new component of the OPD training loop itself — an objective, divergence formulation, stability fix, teacher access-mode variant, self-distillation variant, context-internalization mechanism, or systems/efficiency/privacy constraint applied to that loop — with student rollouts central to the learning signal, evaluated on LLM text generation or reasoning. Operational test: if removing the OPD-loop component leaves a working contribution (a working RL recipe, preference method, or KD baseline), the OPD piece is auxiliary → Adjacent. Papers that enable OPD (cross-tokenizer alignment, calibration), compose with OPD as one component of a larger RL/preference structure, or apply OPD to non-text-reasoning substrates live in Adjacent and Enabling Work or Domain Extensions.

Foundations

Gap-Bridging

Stability and Objective Design

Self-Distillation

Context and Experience Internalization

Efficiency, Systems, and Privacy Variants

Taxonomy

Cross-cutting views over the canonical papers. Many entries span multiple categories — this is for orientation, not strict partitioning.

By Teacher Type

Teacher Type Papers
External white-box MiniLLM, GKD, DistiLLM, DistiLLM-2, Veto, Entropy-Aware OPD, ExOPD, REOPOLD, PACED, Prefix OPD, Revisiting OPD, Rethinking OPD, Lightning OPD, Uni-OPD, SOD, AOPD, vOPD, SCOPE, HPD, TIP, DP-OPD, NPD, Prune-OPD, EffOPD, CoDistill-GRPO, Rock Tokens, Sparse-to-Dense, MOTAB, TGPO, TA-OPD, ESR, ADWIN, dGRPO, LGR, TRB, POPD/TOPD, f-OPD, NF-OPD, OPD+, TrOPD, MPD
External black-box Black-Box OPD / GAD, OVD, ROPD, OmniOPD
Self-teacher with privileged context OPSD, SDFT, SDPO, OPSDC, GATES, pi-Distill, RLSD, SDZero, OGLS-SD, PBSD, UniSD, ATESD, RLRT, EGRSD, CREDIT, SDAR, MixSD, AntiSD, TRACE, AVSD, VPD, RMSD, SPD, MSD-Safety, COPSD, EDGE-OPD, EMPO², StepOPSD, PW-OPSD, Ditto, ROSD, SGSD, AMR-SD, Feedback Distillation, SC-SDPO, SSOPD, OPSA, OPCritD
Internal self-teacher (cross-depth) OISD
Self-teacher (non-privileged / answer-free) CAST, TS-OPSD, SafeSteer
Context-conditioned OPCD, OEL, Multi-Rollout MOPD, MAIGO, FiC, CCOPD
Multiple / lifecycle teachers MiMo-V2-Flash MOPD, GLM-5, Qwen3, Baichuan-M3, DeepSeek-V4, CoPD, MAD-OPD, KAT-Coder-V2, CaMOPD, CollectionLoRA

By Primary Goal

Goal Papers
Compression / strong-to-weak transfer MiniLLM, GKD, Qwen3, Prefix OPD, Rethinking OPD, Lightning OPD, MOTAB, ADWIN, TRB, POPD/TOPD, NF-OPD, OPD+, TrOPD
Post-RL consolidation / skill integration MiMo MOPD, GLM-5, ExOPD, CoPD, OPCritD, TS-OPSD
Continual learning SDFT, OPCD, OEL, MixSD, EDGE-OPD, CaMOPD, MAIGO, FiC, CCOPD
RL replacement / augmentation SDPO, RLTF-SD, RLAD, REOPOLD, RLSD, SDZero, OGLS-SD, PBSD, CoDistill-GRPO, RLRT, EGRSD, CREDIT, SDAR, Sparse-to-Dense, AntiSD, TRACE, AVSD, VPD, RMSD, Multi-Rollout MOPD, EMPO², StepOPSD, TGPO, Ditto, OISD, ROSD, SGSD, AMR-SD, dGRPO, Feedback Distillation, CAST, SC-SDPO, f-OPD, SSOPD, OPSA, SafeSteer
Reasoning compression OPSDC, MPD
Black-box distillation GAD, OVD, ROPD, OmniOPD

Adjacent and Enabling Work

Papers that are not canonical OPD but matter for understanding or deploying it.

Cross-Tokenizer and Model-Family Enablers

Mismatch Mitigation and Student Quality

Preference, Reward-Guided, and Hybrid RL+KD

Self-Play and Iterative Bootstrapping

Precursors

Domain Extensions

OPD applied to non-text-reasoning settings — agents, multimodal models, diffusion, audio, robotics — and to inference acceleration via speculative decoding. These pass the inclusion criterion (student rollouts central to the learning signal) but on substrates beyond LLM text reasoning.

Agent, Multimodal, and Other Extensions

Speculative Decoding (Draft-Model Training)

Draft-model training for speculative decoding shares OPD's core loop: the draft (student) generates, the target (teacher) verifies, and the draft is updated to match. Included for breadth even though the goal is inference acceleration rather than student capability.

Technical Reports and Industrial Recipes

Production training pipelines that use OPD as a post-training stage.

Year System OPD Usage Link
2024 Gemma 2 KD as alternative to next-token prediction for 2B and 9B students arXiv
2025 Qwen3 Strong-to-weak; off-policy then on-policy distillation arXiv
2025 Qwen3-Omni Off-policy then on-policy distillation before GSPO arXiv
2025 GLM-4.5 / 4.6 Multi-stage post-training with expert model iteration and RL arXiv
2025 HY-MT1.5 Multi-stage translation: SFT + OPD + RL arXiv
2026 MiMo-V2-Flash Multi-Teacher OPD (MOPD) as post-training stage arXiv
2026 GLM-5 On-policy cross-stage distillation to recover earlier skills arXiv
2026 Typhoon-S Minimal sovereign recipe: SFT + OPD + small-scale RFT arXiv
2026 Nemotron-Cascade 2 Cascade RL + multi-domain on-policy distillation arXiv
2026 Baichuan-M3 Task RL → offline policy distillation → multi-teacher OPD arXiv
2026 MobileLLM-R1.5 Final-stage on-policy KD as primary improvement over R1 model card
2026 Nanbeige4-3B-Thinking OPD preferred over off-policy for math reasoning model card
2026 DeepSeek-V4 Domain-expert SFT+GRPO → unified model consolidation via OPD report
2026 Qwen3.5-Omni Specialist distillation → privileged-input self-distillation aligning audio to text arXiv
2026 HY-Embodied-0.5 32B → 2B on-policy distillation; student rollouts, teacher token-level supervision arXiv
2026 KAT-Coder-V2 Specialize-then-Unify: 5 domain-expert agents → unified via OPD on student trajectories arXiv
2026 Cursor Composer 2.5 Hint-conditioned self-teacher OPD KL added to RL for targeted behaviors (tool calls, style); built on Kimi K2.5 blog

Frameworks and Implementations

Training Frameworks

Framework Description Link
TRL GKD, GOLD, and MiniLLM trainers; most accessible starting point docs
NeMo-RL Multi-teacher and cross-tokenizer OPD at scale docs, repo
veRL Async on-policy KD trading strict on-policy guarantees for throughput docs
MS-Swift GKD and OPSD sections in the ModelScope ecosystem docs
EasyDistill Comprehensive KD toolkit for black-box and white-box LLM distillation arXiv
KDFlow Off-policy, on-policy, and cross-tokenizer distillation via decoupled backends arXiv, repo
slime Unified RL stack supporting on-policy distillation and hindsight hints repo
OpenClaw-RL Agentic RL stack with hindsight-guided OPD arXiv
NexRL Dedicated on-policy distillation recipes repo
SkyRL OPD examples and blog resources repo
ATLAS Continual-learning framework using GKD/GRPO from runtime traces docs
AReaL OPD and KDRL over student-sampled trajectories with teacher log-prob guidance docs
rLLM Agent RL framework (UC Berkeley Sky) with first-class OPD: examples/math_distill/ (DeepMath OPSD + train_deepmath_distill_tinker.{py,sh}) and rllm/trainer/distill/ modules over verl or tinker backends docs, repo
SpecForge Speculative draft training with EAGLE-3 support and hybrid parallelism arXiv, repo
TorchSpec Torch-native speculative draft training with disaggregated inference/training; streams target hidden states via Mooncake store; Kimi-K2.5/MiniMax-M2.5/Qwen3-Coder-Next examples blog, repo
Tinker Cookbook Thinking Machines' Tinker SDK recipes for off-policy KD, single/multi-teacher OPD, multi-turn tool use recipes, repo
ROLL Alibaba's scalable RL library for LLMs/VLMs with an OPD pipeline repo

Implementations

  • OPSD — Official code for Self-Distilled Reasoner / OPSD.
  • SCOPE — Dual-path OPD: student-PPL-weighted MLE for correct rollouts, teacher-PPL-weighted KL for incorrect.
  • CaOPD — K student rollouts → empirical success rate → confidence target replacement → reverse-KL OPD.
  • OPSD-OnPolicyDistillation — verl-based OPD with separate teacher, agent-loop rollouts, and memory-efficient execution.
  • nano-opd — Hackable OPD library decoupling vLLM rollout, FSDP training, and teacher forwards across independent GPU groups.
  • Rethinking OPD — Official code for Rethinking OPD, with verl-based scripts and top-k teacher–student overlap diagnostics merged upstream.
  • DiffusionOPD — Official implementation of round-robin multi-task diffusion OPD distilling task-specialized teachers into one student along its own rollout trajectories.

Acknowledgments

This list draws on the parallel curation effort at thinkwee/AwesomeOPD, which provided pointers to several papers (notably speculative-decoding draft training, BoN distillation, self-play, multilingual and crosslingual self-distillation, clinical and multimodal agentic OPD, additional industrial reports, and several training frameworks). The two lists organize differently — thinkwee/AwesomeOPD groups by feedback signal and access mode; this list groups by methodological role — and are best read together.

Contributing

Contributions welcome. See CONTRIBUTING.md for criteria, section placement, and formatting.

  • Inclusion criteria: the work should involve student rollouts as central to the learning signal, or directly enable OPD deployment (cross-tokenizer, frameworks, etc.).
  • Entry format: [Title](url) *(Year)* — One-line description. See CONTRIBUTING.md for full examples.

Citation

@software{awesome-on-policy-distillation,
  title = {{Awesome On-Policy Distillation}},
  author = {Liu, Chris Yuhao and others},
  year = {2026},
  doi = {10.5281/zenodo.19411493},
  url = {https://github.com/chrisliu298/awesome-on-policy-distillation},
  version = {v1.0.0}
}

Releases

No releases published

Packages

 
 
 

Contributors

Languages