Improved Disentangled Speech Representations Using Contrastive Learning In Factorized Hierarchical Variational Autoencoder
2022 Β· Yuying Xie, Thomas Arildsen, Zheng-Hua Tan
Abstract
Leveraging the fact that speaker identity and content vary on different time scales, \acrlong\{fhvae\} (\acrshort\{fhvae\}) uses different latent variables to symbolize these two attributes. Disentanglement of these attributes is carried out by different prior settings of the corresponding latent variables. For the prior of speaker identity variable, \acrshort\{fhvae\} assumes it is a Gaussian distribution with an utterance-scale varying mean and a fixed variance. By setting a small fixed variance, the training process promotes identity variables within one utterance gathering close to the mean of their prior. However, this constraint is relatively weak, as the mean of the prior changes between utterances. Therefore, we introduce contrastive learning into the \acrshort\{fhvae\} framework, to make the speaker identity variables gathering when representing the same speaker, while distancing themselves as far as possible from those of other speakers. The model structure has not been chang
Authors
(none)
Tags
Stats
Related papers
- Disentangled Speech Representation Learning Based On Factorized Hierarchical Variational Autoencoder With Self-supervised Objective (2022)7.81
- Unsupervised Representation Learning Of Speech For Dialect Identification (2018)7.16
- Speaker And Style Disentanglement Of Speech Based On Contrastive Predictive Coding Supported Factorized Variational Autoencoder (2024)2.26
- Adversarially Learning Disentangled Speech Representations For Robust Multi-factor Voice Conversion (2021)9.92
- Disentangled Speech Representation Learning For One-shot Cross-lingual Voice Conversion Using \(\beta\)-vae (2022)7.50
- Contrastive Predictive Coding Supported Factorized Variational Autoencoder For Unsupervised Learning Of Disentangled Speech Representations (2020)8.09
- Scalable Factorized Hierarchical Variational Autoencoder Training (2018)7.81
- Investigation Of Using Disentangled And Interpretable Representations For One-shot Cross-lingual Voice Conversion (2018)6.77