AAS-VC: On The Generalization Ability Of Automatic Alignment Search Based Non-autoregressive Sequence-to-sequence Voice Conversion
2023 Β· Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda
Abstract
Non-autoregressive (non-AR) sequence-to-seqeunce (seq2seq) models for voice conversion (VC) is attractive in its ability to effectively model the temporal structure while enjoying boosted intelligibility and fast inference thanks to non-AR modeling. However, the dependency of current non-AR seq2seq VC models on ground truth durations extracted from an external AR model greatly limits its generalization ability to smaller training datasets. In this paper, we first demonstrate the above-mentioned problem by varying the training data size. Then, we present AAS-VC, a non-AR seq2seq VC model based on automatic alignment search (AAS), which removes the dependency on external durations and serves as a proper inductive bias to provide the required generalization ability for small datasets. Experimental results show that AAS-VC can generalize better to a training dataset of only 5 minutes. We also conducted ablation studies to justify several model design choices. The audio samples and implemen
Authors
(none)
Tags
Stats
Related papers
- Any-to-one Sequence-to-sequence Voice Conversion Using Self-supervised Discrete Speech Representations (2020)0.00
- Fasts2s-vc: Streaming Non-autoregressive Sequence-to-sequence Voice Conversion (2021)0.00
- Atts2s-vc: Sequence-to-sequence Voice Conversion With Attention And Context Preservation Mechanisms (2018)14.15
- ACVAE-VC: Non-parallel Many-to-many Voice Conversion With Auxiliary Classifier Variational Autoencoder (2018)14.69
- The NU Voice Conversion System For The Voice Conversion Challenge 2020: On The Effectiveness Of Sequence-to-sequence Models And Autoregressive Neural Vocoders (2020)3.58
- Assem-vc: Realistic Voice Conversion By Assembling Modern Speech Synthesis Techniques (2021)11.64
- Efficient Non-autoregressive GAN Voice Conversion Using Vqwav2vec Features And Dynamic Convolution (2022)0.00
- Measuring The Effectiveness Of Voice Conversion On Speaker Identification And Automatic Speech Recognition Systems (2019)0.00