F5R-TTS: Improving Flow-matching Based Text-to-speech With Group Relative Policy Optimization
2025 · Xiaohui Sun, Ruitong Xiao, Jianye Mo, et al.
Abstract
We present F5R-TTS, a novel text-to-speech (TTS) system that integrates Group Relative Policy Optimization (GRPO) into a flow-matching based architecture. By reformulating the deterministic outputs of flow-matching TTS into probabilistic Gaussian distributions, our approach enables seamless integration of reinforcement learning algorithms. During pretraining, we train a probabilistically reformulated flow-matching based model which is derived from F5-TTS with an open-source dataset. In the subsequent reinforcement learning (RL) phase, we employ a GRPO-driven enhancement stage that leverages dual reward metrics: word error rate (WER) computed via automatic speech recognition and speaker similarity (SIM) assessed by verification models. Experimental results on zero-shot voice cloning demonstrate that F5R-TTS achieves significant improvements in both speech intelligibility (a 29.5% relative reduction in WER) and speaker similarity (a 4.6% relative increase in SIM score) compared to conven
Authors
(none)
Tags
Stats
Related papers
- F5-TTS: A Fairytaler That Fakes Fluent And Faithful Speech With Flow Matching (2024)0.00
- Advances In GRPO For Generation Models: A Survey (2026)0.00
- Cross-lingual F5-TTS: Towards Language-agnostic Voice Cloning And Speech Synthesis (2025)0.00
- No Verifiable Reward For Prosody: Toward Preference-guided Prosody Learning In TTS (2025)0.00
- Voiceflow: Efficient Text-to-speech With Rectified Flow Matching (2023)0.00
- Improving Multi-speaker TTS Prosody Variance With A Residual Encoder And Normalizing Flows (2021)0.00
- Glow-tts: A Generative Flow For Text-to-speech Via Monotonic Alignment Search (2020)0.00
- Tangoflux: Super Fast And Faithful Text To Audio Generation With Flow Matching And Clap-ranked Preference Optimization (2024)0.00