Learning Disentangled Phone And Speaker Representations In A Semi-supervised VQ-VAE Paradigm
2020 Β· Jennifer Williams, Yi Zhao, Erica Cooper, et al.
Abstract
We present a new approach to disentangle speaker voice and phone content by introducing new components to the VQ-VAE architecture for speech synthesis. The original VQ-VAE does not generalize well to unseen speakers or content. To alleviate this problem, we have incorporated a speaker encoder and speaker VQ codebook that learns global speaker characteristics entirely separate from the existing sub-phone codebooks. We also compare two training methods: self-supervised with global conditions and semi-supervised with speaker labels. Adding a speaker VQ component improves objective measures of speech synthesis quality (estimated MOS, speaker similarity, ASR-based intelligibility) and provides learned representations that are meaningful. Our speaker VQ codebook indices can be used in a simple speaker diarization task and perform slightly better than an x-vector baseline. Additionally, phones can be recognized from sub-phone VQ codebook indices in our semi-supervised VQ-VAE better than self-
Authors
(none)
Tags
Stats
Related papers
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97
- VQMIVC: Vector Quantization And Mutual Information-based Unsupervised Speech Representation Disentanglement For One-shot Voice Conversion (2021)20.31
- Disentangled Speech Representation Learning For One-shot Cross-lingual Voice Conversion Using \(\beta\)-vae (2022)7.50
- Improved Prosody From Learned F0 Codebook Representations For VQ-VAE Speech Waveform Reconstruction (2020)7.50
- Towards Unsupervised Phone And Word Segmentation Using Self-supervised Vector-quantized Neural Networks (2020)0.00
- DART: Disentanglement Of Accent And Speaker Representation In Multispeaker Text-to-speech (2024)0.00
- Contrastive Predictive Coding Supported Factorized Variational Autoencoder For Unsupervised Learning Of Disentangled Speech Representations (2020)8.09
- Many-to-many Voice Conversion Based Feature Disentanglement Using Variational Autoencoder (2021)7.81