Residual Speech Embeddings For Tone Classification: Removing Linguistic Content To Enhance Paralinguistic Analysis
2025 Β· Hamdan Al Ahbabi, Gautier Marti, Saeed Almarri, et al.
Abstract
Self-supervised learning models for speech processing, such as wav2vec2, HuBERT, WavLM, and Whisper, generate embeddings that capture both linguistic and paralinguistic information, making it challenging to analyze tone independently of spoken content. In this work, we introduce a method for disentangling paralinguistic features from linguistic content by regressing speech embeddings onto their corresponding text embeddings and using the residuals as a representation of vocal tone. We evaluate this approach across multiple self-supervised speech embeddings, demonstrating that residual embeddings significantly improve tone classification performance compared to raw speech embeddings. Our results show that this method enhances linear separability, enabling improved classification even with simple models such as logistic regression. Visualization of the residual embeddings further confirms the successful removal of linguistic information while preserving tone-related features. These findi
Authors
(none)
Tags
Stats
Related papers
- Contentvec: An Improved Self-supervised Speech Representation By Disentangling Speakers (2022)0.00
- Speaker Disentanglement Of Speech Pre-trained Model Based On Interpretability (2025)0.00
- Do Discrete Self-supervised Representations Of Speech Capture Tone Distinctions? (2024)0.00
- Disentangling Voice And Content With Self-supervision For Speaker Recognition (2023)2.26
- Disentangling Prosody Representations With Unsupervised Speech Reconstruction (2022)0.00
- A Layer-wise Analysis Of Mandarin And English Suprasegmentals In SSL Speech Models (2024)0.00
- Disentangling Textual And Acoustic Features Of Neural Speech Representations (2024)0.00
- Investigating Disentanglement In A Phoneme-level Speech Codec For Prosody Modeling (2024)4.52