Star: Distilling Speech Temporal Relation For Lightweight Speech Self-supervised Learning Models
2023 Β· Kangwook Jang, Sungnyun Kim, Hoirin Kim
Abstract
Albeit great performance of Transformer-based speech selfsupervised learning (SSL) models, their large parameter size and computational cost make them unfavorable to utilize. In this study, we propose to compress the speech SSL models by distilling speech temporal relation (STaR). Unlike previous works that directly match the representation for each speech frame, STaR distillation transfers temporal relation between speech frames, which is more suitable for lightweight student with limited capacity. We explore three STaR distillation objectives and select the best combination as the final STaR loss. Our model distilled from HuBERT BASE achieves an overall score of 79.8 on SUPERB benchmark, the best performance among models with up to 27 million parameters. We show that our method is applicable across different speech SSL models and maintains robust performance with further reduced parameters.
Authors
(none)
Tags
Stats
Related papers
- Recycle-and-distill: Universal Compression Strategy For Transformer-based Speech SSL Models With Attention Map Reusing And Masking Distillation (2023)5.84
- Fithubert: Going Thinner And Deeper For Knowledge Distillation Of Speech Self-supervised Learning (2022)10.97
- Distilhubert: Speech Representation Learning By Layer-wise Distillation Of Hidden-unit BERT (2021)15.06
- Fast-hubert: An Efficient Training Framework For Self-supervised Speech Representation Learning (2023)0.00
- SKILL: Similarity-aware Knowledge Distillation For Speech Self-supervised Learning (2024)3.58
- Unispeech-sat: Universal Speech Representation Learning With Speaker Aware Pre-training (2021)0.00
- LASER: Learning By Aligning Self-supervised Representations Of Speech For Improving Content-related Tasks (2024)4.52
- Is Smaller Always Faster? Tradeoffs In Compressing Self-supervised Speech Transformers (2022)0.00