Improving Robustness Of Llm-based Speech Synthesis By Learning Monotonic Alignment
2024 · Paarth Neekhara, Shehzeen Hussain, Subhankar Ghosh, et al.
Abstract
Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers. However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech (referred to as hallucinations or attention errors), especially when the text contains multiple occurrences of the same token. We examine these challenges in an encoder-decoder transformer model and find that certain cross-attention heads in such models implicitly learn the text and speech alignment when trained for predicting speech tokens for a given text. To make the alignment more robust, we propose techniques utilizing CTC loss and attention priors that encourage monotonic cross-attention over the text tokens. Our guided attention training technique does not introduce any new learnable parameters and significantly improves robustness of LLM-based TTS models.
Authors
(none)
Tags
Stats
Related papers
- Eliminating Stability Hallucinations In Llm-based Tts Models Via Attention Guidance (2025)0.00
- Enhancing The Stability Of Llm-based Speech Generation Systems Through Self-supervised Representations (2024)0.00
- Enhancing Code-switched Text-to-speech Synthesis Capability In Large Language Models With Only Monolingual Corpora (2024)0.00
- Robust Sequence-to-sequence Acoustic Modeling With Stepwise Monotonic Attention For Neural TTS (2019)11.49
- Initial Investigation Of An Encoder-decoder End-to-end TTS Framework Using Marginalization Of Monotonic Hard Latent Alignments (2019)0.00
- Small-e: Small Language Model With Linear Attention For Efficient Speech Synthesis (2024)9.02
- Regotron: Regularizing The Tacotron2 Architecture Via Monotonic Alignment Loss (2022)5.24
- Boosting Large Language Model For Speech Synthesis: An Empirical Study (2023)6.77