Decouple Non-parametric Knowledge Distillation For End-to-end Speech Translation
2023 Β· Hao Zhang, Nianwen Si, Yaqi Chen, et al.
Abstract
Existing techniques often attempt to make knowledge transfer from a powerful machine translation (MT) to speech translation (ST) model with some elaborate techniques, which often requires transcription as extra input during training. However, transcriptions are not always available, and how to improve the ST model performance without transcription, i.e., data efficiency, has rarely been studied in the literature. In this paper, we propose Decoupled Non-parametric Knowledge Distillation (DNKD) from data perspective to improve the data efficiency. Our method follows the knowledge distillation paradigm. However, instead of obtaining the teacher distribution from a sophisticated MT model, we construct it from a non-parametric datastore via k-Nearest-Neighbor (kNN) retrieval, which removes the dependence on transcription and MT model. Then we decouple the classic knowledge distillation loss into target and non-target distillation to enhance the effect of the knowledge among non-target logit
Authors
(none)
Tags
Stats
Related papers
- End-to-end Speech Translation With Knowledge Distillation (2019)0.00
- Source And Target Bidirectional Knowledge Distillation For End-to-end Speech Translation (2021)9.03
- Inter-kd: Intermediate Knowledge Distillation For Ctc-based Automatic Speech Recognition (2022)7.50
- Knowledge Distillation For Neural Transducer-based Target-speaker ASR: Exploiting Parallel Mixture/single-talker Speech Data (2023)4.52
- Improving End-to-end Speech Translation By Imitation-based Knowledge Distillation With Synthetic Transcripts (2023)0.60
- Data Efficient Direct Speech-to-text Translation With Modality Agnostic Meta-learning (2019)0.00
- Leave No Knowledge Behind During Knowledge Distillation: Towards Practical And Effective Knowledge Distillation For Code-switching ASR Using Realistic Data (2024)3.58
- Reducing The Gap Between Streaming And Non-streaming Transducer-based ASR By Adaptive Two-stage Knowledge Distillation (2023)4.52