F5-TTS: A Fairytaler That Fakes Fluent And Faithful Speech With Flow Matching
2024 Β· Yushen Chen, Zhikang Niu, Ziyang Ma, et al.
Abstract
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model's performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, whi
Authors
(none)
Tags
Stats
Related papers
- Diflow-tts: Compact And Low-latency Zero-shot Text-to-speech With Factorized Discrete Flow Matching (2025)0.00
- Cross-lingual F5-TTS: Towards Language-agnostic Voice Cloning And Speech Synthesis (2025)0.00
- Voiceflow: Efficient Text-to-speech With Rectified Flow Matching (2023)0.00
- Reflow-tts: A Rectified Flow Model For High-fidelity Text-to-speech (2023)7.50
- F5R-TTS: Improving Flow-matching Based Text-to-speech With Group Relative Policy Optimization (2025)0.00
- Glow-tts: A Generative Flow For Text-to-speech Via Monotonic Alignment Search (2020)0.00
- Matcha-tts: A Fast TTS Architecture With Conditional Flow Matching (2023)14.73
- FLY-TTS: Fast, Lightweight And High-quality End-to-end Text-to-speech Synthesis (2024)0.00