Recycle-and-distill: Universal Compression Strategy For Transformer-based Speech SSL Models With Attention Map Reusing And Masking Distillation
2023 Β· Kangwook Jang, Sungnyun Kim, Se-Young Yun, et al.
Abstract
Transformer-based speech self-supervised learning (SSL) models, such as HuBERT, show surprising performance in various speech processing tasks. However, huge number of parameters in speech SSL models necessitate the compression to a more compact model for wider usage in academia or small companies. In this study, we suggest to reuse attention maps across the Transformer layers, so as to remove key and query parameters while retaining the number of layers. Furthermore, we propose a novel masking distillation strategy to improve the student model's speech representation quality. We extend the distillation loss to utilize both masked and unmasked speech frames to fully leverage the teacher model's high-quality representation. Our universal compression strategy yields the student model that achieves phoneme error rate (PER) of 7.72% and word error rate (WER) of 9.96% on the SUPERB benchmark.
Authors
(none)
Tags
Stats
Related papers
- Star: Distilling Speech Temporal Relation For Lightweight Speech Self-supervised Learning Models (2023)5.24
- Fithubert: Going Thinner And Deeper For Knowledge Distillation Of Speech Self-supervised Learning (2022)10.97
- Unispeech-sat: Universal Speech Representation Learning With Speaker Aware Pre-training (2021)0.00
- Distilhubert: Speech Representation Learning By Layer-wise Distillation Of Hidden-unit BERT (2021)15.06
- Pushing The Limits Of Unsupervised Unit Discovery For SSL Speech Representation (2023)6.34
- Multi-resolution Hubert: Multi-resolution Speech Self-supervised Learning With Masked Unit Prediction (2023)0.00
- Efficient Infusion Of Self-supervised Representations In Automatic Speech Recognition (2024)0.00
- CHAPTER: Exploiting Convolutional Neural Network Adapters For Self-supervised Speech Models (2022)7.50