Multi-rate Attention Architecture For Fast Streamable Text-to-speech Spectrum Modeling
2021 Β· Qing He, Zhiping Xiu, Thilo Koehler, et al.
Abstract
Typical high quality text-to-speech (TTS) systems today use a two-stage architecture, with a spectrum model stage that generates spectral frames and a vocoder stage that generates the actual audio. High-quality spectrum models usually incorporate the encoder-decoder architecture with self-attention or bi-directional long short-term (BLSTM) units. While these models can produce high quality speech, they often incur O(\(L\)) increase in both latency and real-time factor (RTF) with respect to input length \(L\). In other words, longer inputs leads to longer delay and slower synthesis speed, limiting its use in real-time applications. In this paper, we propose a multi-rate attention architecture that breaks the latency and RTF bottlenecks by computing a compact representation during encoding and recurrently generating the attention vector in a streaming manner during decoding. The proposed architecture achieves high audio quality (MOS of 4.31 compared to groundtruth 4.48), low latency, and
Authors
(none)
Tags
Stats
Related papers
- High Quality Streaming Speech Synthesis With Low, Sentence-length-independent Latency (2021)8.60
- Efficienttts: An Efficient And High-quality Text-to-speech Architecture (2020)0.00
- Spark-tts: An Efficient Llm-based Text-to-speech Model With Single-stream Decoupled Speech Tokens (2025)8.08
- Triple M: A Practical Text-to-speech Synthesis System With Multi-guidance Attention And Multi-band Multi-time Lpcnet (2021)2.26
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- Syncspeech: Efficient And Low-latency Text-to-speech Based On Temporal Masked Transformer (2025)0.00
- Streaming Attention-based Models With Augmented Memory For End-to-end Speech Recognition (2020)5.84
- Generating Synthetic Audio Data For Attention-based Speech Recognition Systems (2019)12.68