STSR: High-fidelity Speech Super-resolution Via Spectral-transient Context Modeling
2025 Β· Jiajun Yuan, Xiaochen Wang, Yuhang Xiao, et al.
Abstract
Speech super-resolution (SR) reconstructs high-fidelity wideband speech from low-resolution inputs-a task that necessitates reconciling global harmonic coherence with local transient sharpness. While diffusion-based generative models yield impressive fidelity, their practical deployment is often stymied by prohibitive computational demands. Conversely, efficient time-domain architectures lack the explicit frequency representations essential for capturing long-range spectral dependencies and ensuring precise harmonic alignment. We introduce STSR, a unified end-to-end framework formulated in the MDCT domain to circumvent these limitations. STSR employs a Spectral-Contextual Attention mechanism that harnesses hierarchical windowing to adaptively aggregate non-local spectral context, enabling consistent harmonic reconstruction up to 48 kHz. Concurrently, a sparse-aware regularization strategy is employed to mitigate the suppression of transient components inherent in compressed spectral re
Authors
(none)
Tags
Stats
Related papers
- Mdctgan: Taming Transformer-based GAN For Speech Super-resolution With Modified DCT Spectra (2023)3.65
- Hifi-sr: A Unified Generative Transformer-convolutional Adversarial Network For High-fidelity Speech Super-resolution (2025)10.81
- Wave-u-mamba: An End-to-end Framework For High-quality And Efficient Speech Super Resolution (2024)3.58
- Flashsr: One-step Versatile Audio Super-resolution Via Diffusion Distillation (2025)4.52
- Universr: Unified And Versatile Audio Super-resolution Via Vocoder-free Flow Matching (2025)0.00
- Neural Vocoder Is All You Need For Speech Super-resolution (2022)12.25
- Super Denoise Net: Speech Super Resolution With Noise Cancellation In Low Sampling Rate Noisy Environments (2023)0.00
- Heterogeneous Space Fusion And Dual-dimension Attention: A New Paradigm For Speech Enhancement (2024)2.26