Non-parallel Sequence-to-sequence Voice Conversion With Disentangled Linguistic And Speaker Representations
2019 Β· Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai
Abstract
This paper presents a method of sequence-to-sequence (seq2seq) voice conversion using non-parallel training data. In this method, disentangled linguistic and speaker representations are extracted from acoustic features, and voice conversion is achieved by preserving the linguistic representations of source utterances while replacing the speaker representations with the target ones. Our model is built under the framework of encoder-decoder neural networks. A recognition encoder is designed to learn the disentangled linguistic representations with two strategies. First, phoneme transcriptions of training data are introduced to provide the references for leaning linguistic representations of audio signals. Second, an adversarial training strategy is employed to further wipe out speaker information from the linguistic representations. Meanwhile, speaker representations are extracted from audio signals by a speaker encoder. The model parameters are estimated by two-stage training, including
Authors
(none)
Tags
Stats
Related papers
- Learning In Your Voice: Non-parallel Voice Conversion Based On Speaker Consistency Loss (2020)0.00
- Investigation Of Using Disentangled And Interpretable Representations For One-shot Cross-lingual Voice Conversion (2018)6.77
- Singing Voice Conversion With Disentangled Representations Of Singer And Vocal Technique Using Variational Autoencoders (2019)10.97
- Many-to-many Voice Conversion Based Feature Disentanglement Using Variational Autoencoder (2021)7.81
- Any-to-one Sequence-to-sequence Voice Conversion Using Self-supervised Discrete Speech Representations (2020)0.00
- Hierarchical Sequence To Sequence Voice Conversion With Limited Data (2019)0.00
- Accent And Speaker Disentanglement In Many-to-many Voice Conversion (2020)10.35
- Multi-target Voice Conversion Without Parallel Data By Adversarially Learning Disentangled Audio Representations (2018)13.60