Contrastive Predictive Coding Supported Factorized Variational Autoencoder For Unsupervised Learning Of Disentangled Speech Representations
2020 Β· Janek Ebbers, Michael Kuhlmann, Tobias Cord-Landwehr, et al.
Abstract
In this work we address disentanglement of style and content in speech signals. We propose a fully convolutional variational autoencoder employing two encoders: a content encoder and a style encoder. To foster disentanglement, we propose adversarial contrastive predictive coding. This new disentanglement method does neither need parallel data nor any supervision. We show that the proposed technique is capable of separating speaker and content traits into the two different representations and show competitive speaker-content disentanglement performance compared to other unsupervised approaches. We further demonstrate an increased robustness of the content representation against a train-test mismatch compared to spectral features, when used for phone recognition.
Authors
(none)
Tags
Stats
Related papers
- Speaker And Style Disentanglement Of Speech Based On Contrastive Predictive Coding Supported Factorized Variational Autoencoder (2024)2.26
- Unsupervised Learning Of Disentangled Speech Content And Style Representation (2020)7.50
- Adversarially Learning Disentangled Speech Representations For Robust Multi-factor Voice Conversion (2021)9.92
- Many-to-many Voice Conversion Based Feature Disentanglement Using Variational Autoencoder (2021)7.81
- Improved Disentangled Speech Representations Using Contrastive Learning In Factorized Hierarchical Variational Autoencoder (2022)2.26
- Disentangled Feature Learning For Real-time Neural Speech Coding (2022)0.00
- Fine-grained Style Modeling, Transfer And Prediction In Text-to-speech Synthesis Via Phone-level Content-style Disentanglement (2020)9.41
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97