Triple M: A Practical Text-to-speech Synthesis System With Multi-guidance Attention And Multi-band Multi-time Lpcnet
2021 Β· Shilun Lin, Fenglong Xie, Li Meng, et al.
Abstract
In this work, a robust and efficient text-to-speech (TTS) synthesis system named Triple M is proposed for large-scale online application. The key components of Triple M are: 1) A sequence-to-sequence model adopts a novel multi-guidance attention to transfer complementary advantages from guiding attention mechanisms to the basic attention mechanism without in-domain performance loss and online service modification. Compared with single attention mechanism, multi-guidance attention not only brings better naturalness to long sentence synthesis, but also reduces the word error rate by 26.8%. 2) A new efficient multi-band multi-time vocoder framework, which reduces the computational complexity from 2.8 to 1.0 GFLOP and speeds up LPCNet by 2.75x on a single CPU.
Authors
(none)
Tags
Stats
Related papers
- High Quality, Lightweight And Adaptable TTS Using Lpcnet (2019)10.97
- High Quality Streaming Speech Synthesis With Low, Sentence-length-independent Latency (2021)8.60
- Multi-rate Attention Architecture For Fast Streamable Text-to-speech Spectrum Modeling (2021)6.34
- MHTTS: Fast Multi-head Text-to-speech For Spontaneous Speech With Imperfect Transcription (2022)0.00
- Building Multi Lingual TTS Using Cross Lingual Voice Conversion (2020)0.00
- Improving Audio Codec-based Zero-shot Text-to-speech Synthesis With Multi-modal Context And Large Language Model (2024)2.26
- Improving Robustness Of Llm-based Speech Synthesis By Learning Monotonic Alignment (2024)0.00
- Towards High-quality Neural TTS For Low-resource Languages By Learning Compact Speech Representations (2022)0.00