Application Of Knowledge Distillation To Multi-task Speech Representation Learning
2022 Β· Mine Kerpicci, van Nguyen, Shuhua Zhang, et al.
Abstract
Model architectures such as wav2vec 2.0 and HuBERT have been proposed to learn speech representations from audio waveforms in a self-supervised manner. When they are combined with downstream tasks such as keyword spotting and speaker verification, they provide state-of-the-art performance. However, these models use a large number of parameters, the smallest version of which has 95 million parameters. This constitutes a challenge for edge AI device deployments. In this paper, we investigate the application of knowledge distillation to speech representation learning (SRL) models followed by joint fine-tuning with multiple downstream voice-activated tasks. In our experiments on two such tasks, our approach results in nearly 75% reduction in model size while suffering only 0.1% accuracy and 0.9% equal error rate degradation compared to the full-size model. In addition, we show that fine-tuning the SRL models results in a significant performance boost compared to using frozen SRL models.
Authors
(none)
Tags
Stats
Related papers
- An Efficient End-to-end Approach To Noise Invariant Speech Features Via Multi-task Learning (2024)0.00
- One-step Knowledge Distillation And Fine-tuning In Using Large Pre-trained Self-supervised Learning Models For Speaker Verification (2023)7.81
- SKILL: Similarity-aware Knowledge Distillation For Speech Self-supervised Learning (2024)3.58
- Deep Versus Wide: An Analysis Of Student Architectures For Task-agnostic Knowledge Distillation Of Self-supervised Speech Models (2022)9.23
- Knowledge Distillation From Language Model To Acoustic Model: A Hierarchical Multi-task Learning Approach (2021)3.58
- Audio-visual Representation Learning Via Knowledge Distillation From Speech Foundation Models (2025)7.81
- Distilhubert: Speech Representation Learning By Layer-wise Distillation Of Hidden-unit BERT (2021)15.06
- Two-stage Textual Knowledge Distillation For End-to-end Spoken Language Understanding (2020)9.41