High-quality Speech Synthesis Using Super-resolution Mel-spectrogram
2019 Β· Leyuan Sheng, Dong-Yan Huang, Evgeniy N. Pavlovskiy
Abstract
In speech synthesis and speech enhancement systems, melspectrograms need to be precise in acoustic representations. However, the generated spectrograms are over-smooth, that could not produce high quality synthesized speech. Inspired by image-to-image translation, we address this problem by using a learning-based post filter combining Pix2PixHD and ResUnet to reconstruct the mel-spectrograms together with super-resolution. From the resulting super-resolution spectrogram networks, we can generate enhanced spectrograms to produce high quality synthesized speech. Our proposed model achieves improved mean opinion scores (MOS) of 3.71 and 4.01 over baseline results of 3.29 and 3.84, while using vocoder Griffin-Lim and WaveNet, respectively.
Authors
(none)
Tags
Stats
Related papers
- Universal Melgan: A Robust Neural Vocoder For High-fidelity Waveform Generation In Multiple Domains (2020)0.00
- Mel-fullsubnet: Mel-spectrogram Enhancement For Improving Both Speech Quality And ASR (2024)0.00
- Wave-u-mamba: An End-to-end Framework For High-quality And Efficient Speech Super Resolution (2024)3.58
- Hifi-sr: A Unified Generative Transformer-convolutional Adversarial Network For High-fidelity Speech Super-resolution (2025)10.81
- Universal Adaptor: Converting Mel-spectrograms Between Different Configurations For Speech Synthesis (2022)0.00
- Neural Vocoder Is All You Need For Speech Super-resolution (2022)12.25
- Univnet: A Neural Vocoder With Multi-resolution Spectrogram Discriminators For High-fidelity Waveform Generation (2021)14.80
- GELP: Gan-excited Linear Prediction For Speech Synthesis From Mel-spectrogram (2019)10.74