Diflow-tts: Compact And Low-latency Zero-shot Text-to-speech With Factorized Discrete Flow Matching
2025 Β· Ngoc-Son Nguyen, Thanh V. T. Tran, Hieu-Nghia Huynh-Nguyen, et al.
Abstract
This paper introduces DiFlow-TTS, a novel zero-shot text-to-speech (TTS) system that employs discrete flow matching for generative speech modeling. We position this work as an entry point that may facilitate further advances in this research direction. Through extensive empirical evaluation, we analyze both the strengths and limitations of this approach across key aspects, including naturalness, expressive attributes, speaker identity, and inference latency. To this end, we leverage factorized speech representations and design a deterministic Phoneme-Content Mapper for modeling linguistic content, together with a Factorized Discrete Flow Denoiser that jointly models multiple discrete token streams corresponding to prosody and acoustics to capture expressive speech attributes. Experimental results demonstrate that DiFlow-TTS achieves strong performance across multiple metrics while maintaining a compact model size, up to 11.7 times smaller, and enabling low-latency inference that is up
Authors
(none)
Tags
Stats
Related papers
- F5-TTS: A Fairytaler That Fakes Fluent And Faithful Speech With Flow Matching (2024)0.00
- Voiceflow: Efficient Text-to-speech With Rectified Flow Matching (2023)0.00
- Reflow-tts: A Rectified Flow Model For High-fidelity Text-to-speech (2023)7.50
- Flow-tsvad: Target-speaker Voice Activity Detection Via Latent Flow Matching (2024)0.00
- Time-layer Adaptive Alignment For Speaker Similarity In Flow-matching Based Zero-shot TTS (2025)0.00
- V2sflow: Video-to-speech Generation With Speech Decomposition And Rectified Flow (2024)8.52
- Real-time Streamable Generative Speech Restoration With Flow Matching (2025)0.00
- An Investigation Of Noise Robustness For Flow-matching-based Zero-shot TTS (2024)5.24