TCOD: Exploring Temporal Curriculum In On-policy Distillation For Multi-turn Autonomous Agents
2026 Β· Jiaqi Wang, Wenhao Zhang, Weijie Shi, et al.
Abstract
arXiv:2604.24005v2 Announce Type: replace Abstract: On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL divergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter-turn error compounding: as errors accumulate, the student is driven beyond the teacher's effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On-Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed to the student and progressively expands it from
Authors
(none)
Tags
Stats
Related papers
- Online Policy Distillation With Decision-attention (2024)0.00
- Dual Policy Distillation (2020)10.61
- Continual Deep Reinforcement Learning With Task-agnostic Policy Distillation (2024)0.00
- Continual Policy Distillation From Distributed Reinforcement Learning Teachers (2026)0.00
- Offline Behavior Distillation (2024)0.00
- The Nature Of Temporal Difference Errors In Multi-step Distributional Reinforcement Learning (2022)0.00
- A Conservative Approach For Few-shot Transfer In Off-dynamics Reinforcement Learning (2023)0.00
- Adaptive Temporal-difference Learning For Policy Evaluation With Per-state Uncertainty Estimates (2019)0.00