Universr: Unified And Versatile Audio Super-resolution Via Vocoder-free Flow Matching
2025 Β· Woongjib Choi, Sangmin Lee, Hyungseob Lim, et al.
Abstract
In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel-spectrogram and then rely on a pre-trained neural vocoder to synthesize waveforms, our method directly reconstructs waveforms via the inverse Short-Time Fourier Transform (iSTFT), thereby eliminating the dependence on a separate vocoder. This design not only simplifies end-to-end optimization but also overcomes a critical bottleneck of two-stage pipelines, where the final audio quality is fundamentally constrained by vocoder performance. Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors, achieving state-of-the-art performance on both speech and general audio datasets.
Authors
(none)
Tags
Stats
Related papers
- Flowhigh: Towards Efficient And High-quality Audio Super-resolution With Single-step Flow Matching (2025)5.84
- Inspiremusic: Integrating Super Resolution And Large Language Model For High-fidelity Long-form Music Generation (2025)6.26
- Neural Vocoder Is All You Need For Speech Super-resolution (2022)12.25
- Audio Dequantization For High Fidelity Audio Generation In Flow-based Neural Vocoder (2020)6.77
- Real-time And Accurate: Zero-shot High-fidelity Singing Voice Conversion With Multi-condition Flow Synthesis (2024)0.00
- Flashsr: One-step Versatile Audio Super-resolution Via Diffusion Distillation (2025)4.52
- STSR: High-fidelity Speech Super-resolution Via Spectral-transient Context Modeling (2025)0.00
- V2sflow: Video-to-speech Generation With Speech Decomposition And Rectified Flow (2024)8.52