Speaker Verification-derived Loss And Data Augmentation For Dnn-based Multispeaker Speech Synthesis
2021 Β· Beata Lorincz, Adriana Stan, Mircea Giurgiu
Abstract
Building multispeaker neural network-based text-to-speech synthesis systems commonly relies on the availability of large amounts of high quality recordings from each speaker and conditioning the training process on the speaker's identity or on a learned representation of it. However, when little data is available from each speaker, or the number of speakers is limited, the multispeaker TTS can be hard to train and will result in poor speaker similarity and naturalness. In order to address this issue, we explore two directions: forcing the network to learn a better speaker identity representation by appending an additional loss term; and augmenting the input data pertaining to each speaker using waveform manipulation methods. We show that both methods are efficient when evaluated with both objective and subjective measures. The additional loss term aids the speaker similarity, while the data augmentation improves the intelligibility of the multispeaker TTS system.
Authors
(none)
Tags
Stats
Related papers
- From Speaker Verification To Multispeaker Speech Synthesis, Deep Transfer With Feedback Constraint (2020)10.61
- Data Augmentation Enhanced Speaker Enrollment For Text-dependent Speaker Verification (2020)0.00
- Relational Data Selection For Data Augmentation Of Speaker-dependent Multi-band Melgan Vocoder (2021)0.00
- Training Multi-speaker Neural Text-to-speech Systems Using Speaker-imbalanced Speech Corpora (2019)8.09
- Transfer Learning From Speaker Verification To Multispeaker Text-to-speech Synthesis (2018)0.00
- Exploring Voice Conversion Based Data Augmentation In Text-dependent Speaker Verification (2020)0.00
- Adaptive Data Augmentation With Naturalspeech3 For Far-field Speaker Verification (2025)0.00
- Unit Selection Synthesis Based Data Augmentation For Fixed Phrase Speaker Verification (2021)7.50