Syncspeech: Efficient And Low-latency Text-to-speech Based On Temporal Masked Transformer
2025 Β· Zhengyan Sheng, Zhihao Du, Shiliang Zhang, et al.
Abstract
Current text-to-speech (TTS) models face a persistent limitation: autoregressive (AR) models suffer from low generation efficiency, while modern non-autoregressive (NAR) models experience high latency due to their unordered temporal nature. To bridge this divide, we introduce SyncSpeech, an efficient and low-latency TTS model based on the proposed Temporal Mask Transformer (TMT) paradigm. TMT synergistically unifies the temporally ordered generation of AR models with the parallel decoding efficiency of NAR models. TMT is realized through a meticulously designed sequence construction rule, a corresponding training objective, and a specialized hybrid attention mask. Furthermore, with the primary aim of enhancing training efficiency, a high-probability masking strategy is introduced, which also leads to a significant improvement in overall model performance. During inference, SyncSpeech achieves high efficiency by decoding all speech tokens corresponding to each newly arrived text token i
Authors
(none)
Tags
Stats
Related papers
- Efficienttts: An Efficient And High-quality Text-to-speech Architecture (2020)0.00
- Maskgct: Zero-shot Text-to-speech With Masked Generative Codec Transformer (2024)7.98
- Mobilespeech: A Fast And High-fidelity Framework For Mobile Zero-shot Text-to-speech (2024)0.00
- Non-autoregressive TTS With Explicit Duration Modelling For Low-resource Highly Expressive Speech (2021)8.82
- High Fidelity Text-to-speech Via Discrete Tokens Using Token Transducer And Group Masked Language Model (2024)4.52
- Livespeech: Low-latency Zero-shot Text-to-speech Via Autoregressive Modeling Of Audio Discrete Codes (2024)5.84
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- MELA-TTS: Joint Transformer-diffusion Model With Representation Alignment For Speech Synthesis (2025)0.00