Self-supervised Disentangled Representation Learning For Robust Target Speech Extraction
2023 Β· Zhaoxi Mu, Xinyu Yang, Sining Sun, et al.
Abstract
Speech signals are inherently complex as they encompass both global acoustic characteristics and local semantic information. However, in the task of target speech extraction, certain elements of global and local semantic information in the reference speech, which are irrelevant to speaker identity, can lead to speaker confusion within the speech extraction network. To overcome this challenge, we propose a self-supervised disentangled representation learning method. Our approach tackles this issue through a two-phase process, utilizing a reference speech encoding network and a global information disentanglement network to gradually disentangle the speaker identity information from other irrelevant factors. We exclusively employ the disentangled speaker identity information to guide the speech extraction network. Moreover, we introduce the adaptive modulation Transformer to ensure that the acoustic representation of the mixed signal remains undisturbed by the speaker embeddings. This com
Authors
(none)
Tags
Stats
Related papers
- Intra-class Variation Reduction Of Speaker Representation In Disentanglement Framework (2020)8.35
- Contentvec: An Improved Self-supervised Speech Representation By Disentangling Speakers (2022)0.00
- Towards The Next Frontier In Speech Representation Learning Using Disentanglement (2024)0.00
- Disentangled Representation Learning For Environment-agnostic Speaker Recognition (2024)4.82
- DEAAN: Disentangled Embedding And Adversarial Adaptation Network For Robust Speaker Representation Learning (2020)9.59
- Learning Disentangled Speech Representations (2023)0.00
- Unsupervised Speech Enhancement With Speech Recognition Embedding And Disentanglement Losses (2021)8.35
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97