Fast And Small Footprint Hybrid Hmm-hifigan Based System For Speech Synthesis In Indian Languages
2023 Β· Sudhanshu Srivastava, Ishika Gupta, Anusha Prakash, et al.
Abstract
Hidden-Markov-model (HMM) based text-to-speech (HTS) offers flexibility in speaking styles along with fast training and synthesis while being computationally less intense. HTS performs well even in low-resource scenarios. The primary drawback is that the voice quality is poor compared to that of E2E systems. A hybrid approach combining HMM-based feature generation and neural-network-based HiFi-GAN vocoder to improve HTS synthesis quality is proposed. HTS is trained on high-resolution mel-spectrograms instead of conventional mel generalized coefficients (MGC), and the output mel-spectrogram corresponding to the input text is used in a HiFi-GAN vocoder trained on Indic languages, to produce naturalness that is equivalent to that of E2E systems, as evidenced from the DMOS and PC tests.
Authors
(none)
Tags
Stats
Related papers
- Hifi-gan: Generative Adversarial Networks For Efficient And High Fidelity Speech Synthesis (2020)0.00
- Hiftnet: A Fast High-quality Neural Vocoder With Harmonic-plus-noise Filter And Inverse Short Time Fourier Transform (2023)0.00
- Towards Building Text-to-speech Systems For The Next Billion Users (2022)0.00
- JETS: Jointly Training Fastspeech2 And Hifi-gan For End To End Text To Speech (2022)12.10
- MHTTS: Fast Multi-head Text-to-speech For Spontaneous Speech With Imperfect Transcription (2022)0.00
- Generic Indic Text-to-speech Synthesisers With Rapid Adaptation In An End-to-end Framework (2020)8.82
- Rapid Speaker Adaptation In Low Resource Text To Speech Systems Using Synthetic Data And Transfer Learning (2023)0.00
- Hmm-based Data Augmentation For E2E Systems For Building Conversational Speech Synthesis Systems (2022)0.00