Speaker And Style Disentanglement Of Speech Based On Contrastive Predictive Coding Supported Factorized Variational Autoencoder
2024 Β· Yuying Xie, Michael Kuhlmann, Frederik Rautenberg, et al.
Abstract
Speech signals encompass various information across multiple levels including content, speaker, and style. Disentanglement of these information, although challenging, is important for applications such as voice conversion. The contrastive predictive coding supported factorized variational autoencoder achieves unsupervised disentanglement of a speech signal into speaker and content embeddings by assuming speaker info to be temporally more stable than content-induced variations. However, this assumption may introduce other temporal stable information into the speaker embeddings, like environment or emotion, which we call style. In this work, we propose a method to further disentangle non-content features into distinct speaker and style features, notably by leveraging readily accessible and well-defined speaker labels without the necessity for style labels. Experimental results validate the proposed method's effectiveness on extracting disentangled features, thereby facilitating speaker,
Authors
(none)
Tags
Stats
Related papers
- Contrastive Predictive Coding Supported Factorized Variational Autoencoder For Unsupervised Learning Of Disentangled Speech Representations (2020)8.09
- Unsupervised Learning Of Disentangled Speech Content And Style Representation (2020)7.50
- Improved Disentangled Speech Representations Using Contrastive Learning In Factorized Hierarchical Variational Autoencoder (2022)2.26
- Disentanglement Of Emotional Style And Speaker Identity For Expressive Voice Conversion (2021)10.97
- Multi-speaker Multi-style Speech Synthesis With Timbre And Style Disentanglement (2022)6.77
- Many-to-many Voice Conversion Based Feature Disentanglement Using Variational Autoencoder (2021)7.81
- Disentangled Representation Learning For Environment-agnostic Speaker Recognition (2024)4.82
- Multi-speaker Expressive Speech Synthesis Via Multiple Factors Decoupling (2022)0.00