Eliminating Stability Hallucinations In Llm-based Tts Models Via Attention Guidance
2025 · Shiming Wang, Zhihao Du, Yang Xiang, et al.
Abstract
This paper focuses on resolving stability hallucinations (e.g., repetitive or omitted speech) in LLM-based Text-to-Speech (TTS) models by improving and leveraging the attention mechanism. First, we analyzed the alignment mechanism between text tokens and speech tokens in LLMs. We then proposed a metric termed the Optimal Alignment Score (OAS), which employs the Viterbi algorithm to evaluate text-speech alignment quality. Subsequently, OAS was integrated into the training of CosyVoice2 to assist LLMs in learning continuous, stable alignment. Additionally, the pre-trained attention value is employed to guide the training of the student CosyVoice2 via chain-of-thought (CoT), which further reduces stability hallucinations in synthesized speech. Experiments on the Seed-TTS-Eval and CV3-Eval test sets demonstrate that the proposed methods can effectively reduce the stability hallucinations of CosyVoice2 without introducing additional negative effects. The appendix is available at https://wsm
Authors
(none)
Tags
Stats
Related papers
- Improving Robustness Of Llm-based Speech Synthesis By Learning Monotonic Alignment (2024)0.00
- Enhancing The Stability Of Llm-based Speech Generation Systems Through Self-supervised Representations (2024)0.00
- Cosyvoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer Based On Supervised Semantic Tokens (2024)0.00
- Stylespeech: Parameter-efficient Fine Tuning For Pre-trained Controllable Text-to-speech (2024)6.34
- Data-efficient Targeted Token-level Preference Optimization For Llm-based Text-to-speech (2026)0.00
- Efficient Emotion And Speaker Adaptation In Llm-based TTS Via Characteristic-specific Partial Fine-tuning (2025)0.00
- Initial Investigation Of An Encoder-decoder End-to-end TTS Framework Using Marginalization Of Monotonic Hard Latent Alignments (2019)0.00
- Tacolm: Gated Attention Equipped Codec Language Model Are Efficient Zero-shot Text To Speech Synthesizers (2024)0.00