Real-time Streamable Generative Speech Restoration With Flow Matching
2025 Β· Simon Welker, Bunlong Lay, Maris Hillemann, et al.
Abstract
Diffusion-based generative models have greatly impacted the speech processing field in recent years, exhibiting high speech naturalness and spawning a new research direction. Their application in real-time communication is, however, still lagging behind due to their computation-heavy nature involving multiple calls of large DNNs. Here, we present Stream\(.\)FM, a frame-causal flow-based generative model with an algorithmic latency of 32 milliseconds (ms) and a total latency of 48 ms, paving the way for generative speech processing in real-time communication. We propose a buffered streaming inference scheme and an optimized DNN architecture, show how learned few-step numerical solvers can boost output quality at a fixed compute budget, explore model weight compression to find favorable points along a compute/quality tradeoff, and contribute a model variant with 24 ms total latency for the speech enhancement task. Our work looks beyond theoretical latencies, showing that high-quality
Authors
(none)
Tags
Stats
Related papers
- Voiceflow: Efficient Text-to-speech With Rectified Flow Matching (2023)0.00
- Diffusion-based Generative Modeling With Discriminative Guidance For Streamable Speech Enhancement (2024)7.16
- Flashaudio: Rectified Flows For Fast And High-fidelity Text-to-audio Generation (2024)5.13
- Meanflowse: One-step Generative Speech Enhancement Via Conditional Mean Flow (2025)3.01
- Flowavse: Efficient Audio-visual Speech Enhancement With Conditional Flow Matching (2024)0.00
- Generative Pre-training For Speech With Flow Matching (2023)0.00
- REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation Via Id-context Caching And Asynchronous Streaming Distillation (2025)0.00
- Diflow-tts: Compact And Low-latency Zero-shot Text-to-speech With Factorized Discrete Flow Matching (2025)0.00