An Empirical Analysis Of Speech Self-supervised Learning At Multiple Resolutions
2024 Β· Theo Clark, Benedetta Cevoli, Eloy de Jong, et al.
Abstract
Self-supervised learning (SSL) models have become crucial in speech processing, with recent advancements concentrating on developing architectures that capture representations across multiple timescales. The primary goal of these multi-scale architectures is to exploit the hierarchical nature of speech, where lower-resolution components aim to capture representations that align with increasingly abstract concepts (e.g., from phones to words to sentences). Although multi-scale approaches have demonstrated some improvements over single-scale models, the precise reasons for these enhancements have poor empirical support. In this study, we present an initial analysis of layer-wise representations in multi-scale architectures, with a focus on Canonical Correlation Analysis (CCA) and Mutual Information (MI). We apply this analysis to Multi-Resolution HuBERT (MR-HuBERT) and find that (1) the improved performance on SUPERB tasks is primarily due to the auxiliary low-resolution loss rather than
Authors
(none)
Tags
Stats
Related papers
- Multi-resolution Hubert: Multi-resolution Speech Self-supervised Learning With Masked Unit Prediction (2023)0.00
- Speech Representation Analysis Based On Inter- And Intra-model Similarities (2024)2.26
- A Large-scale Probing Analysis Of Speaker-specific Attributes In Self-supervised Speech Representations (2025)0.00
- Comparative Layer-wise Analysis Of Self-supervised Speech Models (2022)0.00
- What Do Self-supervised Speech And Speaker Models Learn? New Findings From A Cross Model Layer-wise Analysis (2024)8.09
- Evidence Of Vocal Tract Articulation In Self-supervised Learning Of Speech (2022)9.41
- Unispeech-sat: Universal Speech Representation Learning With Speaker Aware Pre-training (2021)0.00
- Analyzing The Factors Affecting Usefulness Of Self-supervised Pre-trained Representations For Speech Recognition (2022)0.00