Hifi-sr: A Unified Generative Transformer-convolutional Adversarial Network For High-fidelity Speech Super-resolution
2025 Β· Shengkui Zhao, Kun Zhou, Zexu Pan, et al.
Abstract
The application of generative adversarial networks (GANs) has recently advanced speech super-resolution (SR) based on intermediate representations like mel-spectrograms. However, existing SR methods that typically rely on independently trained and concatenated networks may lead to inconsistent representations and poor speech quality, especially in out-of-domain scenarios. In this work, we propose HiFi-SR, a unified network that leverages end-to-end adversarial training to achieve high-fidelity speech super-resolution. Our model features a unified transformer-convolutional generator designed to seamlessly handle both the prediction of latent representations and their conversion into time-domain waveforms. The transformer network serves as a powerful encoder, converting low-resolution mel-spectrograms into latent space representations, while the convolutional network upscales these representations into high-resolution waveforms. To enhance high-frequency fidelity, we incorporate a multi-
Authors
(none)
Tags
Stats
Related papers
- Hifi-gan: Generative Adversarial Networks For Efficient And High Fidelity Speech Synthesis (2020)0.00
- Hifi-gan: High-fidelity Denoising And Dereverberation Based On Speech Deep Features In Adversarial Networks (2020)0.00
- Hifi++: A Unified Framework For Bandwidth Extension And Speech Enhancement (2022)11.93
- STSR: High-fidelity Speech Super-resolution Via Spectral-transient Context Modeling (2025)0.00
- Hifi-wavegan: Generative Adversarial Network With Auxiliary Spectrogram-phase Loss For High-fidelity Singing Voice Generation (2022)0.00
- TFGAN: Time And Frequency Domain Based Generative Adversarial Network For High-fidelity Speech Synthesis (2020)0.00
- Mdctgan: Taming Transformer-based GAN For Speech Super-resolution With Modified DCT Spectra (2023)3.65
- Hiftnet: A Fast High-quality Neural Vocoder With Harmonic-plus-noise Filter And Inverse Short Time Fourier Transform (2023)0.00