Learning Disentangled Speech Representations
2023 Β· Yusuf Brima, Ulf Krumnack, Simone Pika, et al.
Abstract
Disentangled representation learning in speech processing has lagged behind other domains, largely due to the lack of datasets with annotated generative factors for robust evaluation. To address this, we propose SynSpeech, a novel large-scale synthetic speech dataset specifically designed to enable research on disentangled speech representations. SynSpeech includes controlled variations in speaker identity, spoken text, and speaking style, with three dataset versions to support experimentation at different levels of complexity. In this study, we present a comprehensive framework to evaluate disentangled representation learning techniques, applying both linear probing and established supervised disentanglement metrics to assess the modularity, compactness, and informativeness of the representations learned by a state-of-the-art model. Using the RAVE model as a test case, we find that SynSpeech facilitates benchmarking across a range of factors, achieving promising disentanglement of s
Authors
(none)
Tags
Stats
Related papers
- 3d-speaker: A Large-scale Multi-device, Multi-distance, And Multi-dialect Corpus For Speech Representation Disentanglement (2023)0.00
- DSVAE: Interpretable Disentangled Representation For Synthetic Speech Detection (2023)0.00
- Speech Resynthesis From Discrete Disentangled Self-supervised Representations (2021)16.25
- Towards The Next Frontier In Speech Representation Learning Using Disentanglement (2024)0.00
- Self-supervised Disentangled Representation Learning For Robust Target Speech Extraction (2023)5.24
- Adversarially Learning Disentangled Speech Representations For Robust Multi-factor Voice Conversion (2021)9.92
- Disentangling Speech And Non-speech Components For Building Robust Acoustic Models From Found Data (2019)0.00
- Unsupervised Speech Enhancement With Speech Recognition Embedding And Disentanglement Losses (2021)8.35