Any-to-many Voice Conversion With Location-relative Sequence-to-sequence Modeling
2020 Β· Songxiang Liu, Yuewen Cao, Disong Wang, et al.
Abstract
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach, which utilizes text supervision during training. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. During the training stage, an encoder-decoder-based hybrid connectionist-temporal-classification-attention (CTC-attention) phoneme recognizer is trained, whose encoder has a bottle-neck layer. A BNE is obtained from the phoneme recognizer and is utilized to extract speaker-independent, dense and rich spoken content representations from spectral features. Then a multi-speaker location-relative attention based seq2seq synthesis model is trained to reconstruct spectral features from the bottle-neck features, conditioning on speaker representations for speaker identity control in the generated speech. To mitigate the difficulties of using seq2seq models to align long sequences, we down-sample the input spectral feature
Authors
(none)
Tags
Stats
Related papers
- Non-parallel Sequence-to-sequence Voice Conversion With Disentangled Linguistic And Speaker Representations (2019)14.02
- Hierarchical Sequence To Sequence Voice Conversion With Limited Data (2019)0.00
- Any-to-one Sequence-to-sequence Voice Conversion Using Self-supervised Discrete Speech Representations (2020)0.00
- Disentangleing Content And Fine-grained Prosody Information Via Hybrid ASR Bottleneck Features For Voice Conversion (2022)10.48
- Improving Sequence-to-sequence Acoustic Modeling By Adding Text-supervision (2018)9.92
- End-to-end Zero-shot Voice Conversion With Location-variable Convolutions (2022)7.50
- Atts2s-vc: Sequence-to-sequence Voice Conversion With Attention And Context Preservation Mechanisms (2018)14.15
- Accent And Speaker Disentanglement In Many-to-many Voice Conversion (2020)10.35