Training Universal Vocoders With Feature Smoothing-based Augmentation Methods For High-quality TTS Systems
2024 Β· Jeongmin Liu, Eunwoo Song
Abstract
While universal vocoders have achieved proficient waveform generation across diverse voices, their integration into text-to-speech (TTS) tasks often results in degraded synthetic quality. To address this challenge, we present a novel augmentation technique for training universal vocoders. Our training scheme randomly applies linear smoothing filters to input acoustic features, facilitating vocoder generalization across a wide range of smoothings. It significantly mitigates the training-inference mismatch, enhancing the naturalness of synthetic output even when the acoustic model produces overly smoothed features. Notably, our method is applicable to any vocoder without requiring architectural modifications or dependencies on specific acoustic models. The experimental results validate the superiority of our vocoder over conventional methods, achieving 11.99% and 12.05% improvements in mean opinion scores when integrated with Tacotron 2 and FastSpeech 2 TTS acoustic models, respectively.
Authors
(none)
Tags
Stats
Related papers
- Bigvgan: A Universal Neural Vocoder With Large-scale Training (2022)6.17
- A Cyclical Post-filtering Approach To Mismatch Refinement Of Neural Vocoder For Text-to-speech Systems (2020)3.58
- VITS2: Improving Quality And Efficiency Of Single-stage Text-to-speech With Adversarial Learning And Architecture Design (2023)12.40
- Towards Achieving Robust Universal Neural Vocoding (2018)0.00
- Expediting TTS Synthesis With Adversarial Vocoding (2019)6.77
- Advances In Speech Vocoding For Text-to-speech With Continuous Parameters (2021)2.26
- Tts-by-tts: Tts-driven Data Augmentation For Fast And High-quality Speech Synthesis (2020)9.59
- Tts-by-tts 2: Data-selective Augmentation For Neural Speech Synthesis Using Ranking Support Vector Machine With Variational Autoencoder (2022)4.52