How Redundant Is The Transformer Stack In Speech Representation Models?
2024 · Teresa Dorszewski, Albert Kjøller Jacobsen, Lenka Tětková, et al.
Abstract
Self-supervised speech representation models, particularly those leveraging transformer architectures, have demonstrated remarkable performance across various tasks such as speech recognition, speaker identification, and emotion detection. Recent studies on transformer models revealed a high redundancy between layers and the potential for significant pruning, which we will investigate here for transformer-based speech representation models. We perform a detailed analysis of layer similarity in speech representation models using three similarity metrics: cosine similarity, centered kernel alignment, and mutual nearest-neighbor alignment. Our findings reveal a block-like structure of high similarity, suggesting two main processing steps and significant redundancy of layers. We demonstrate the effectiveness of pruning transformer-based speech representation models without the need for post-training, achieving up to 40% reduction in transformer layers while maintaining over 95% of the mode
Authors
(none)
Tags
Stats
Related papers
- Is Smaller Always Faster? Tradeoffs In Compressing Self-supervised Speech Transformers (2022)0.00
- Structured Pruning Of Self-supervised Pre-trained Models For Speech Recognition And Understanding (2023)11.39
- Speechformer: Reducing Information Loss In Direct Speech Translation (2021)7.16
- Simplified Self-attention For Transformer-based End-to-end Speech Recognition (2020)10.61
- Exploring Heterogeneous Characteristics Of Layers In ASR Models For More Efficient Training (2021)2.26
- Efficientasr: Speech Recognition Network Compression Via Attention Redundancy And Chunk-level FFN Optimization (2024)3.58
- Input-independent Attention Weights Are Expressive Enough: A Study Of Attention In Self-supervised Audio Transformers (2020)0.00
- Resource-efficient Separation Transformer (2022)7.81