Retrieval Augmented Generation In Prompt-based Text-to-speech Synthesis With Context-aware Contrastive Language-audio Pretraining
2024 Β· Jinlong Xue, Yayue Deng, Yingming Gao, et al.
Abstract
Recent prompt-based text-to-speech (TTS) models can clone an unseen speaker using only a short speech prompt. They leverage a strong in-context ability to mimic the speech prompts, including speaker style, prosody, and emotion. Therefore, the selection of a speech prompt greatly influences the generated speech, akin to the importance of a prompt in large language models (LLMs). However, current prompt-based TTS models choose the speech prompt manually or simply at random. Hence, in this paper, we adapt retrieval augmented generation (RAG) from LLMs to prompt-based TTS. Unlike traditional RAG methods, we additionally consider contextual information during the retrieval process and present a Context-Aware Contrastive Language-Audio Pre-training (CA-CLAP) model to extract context-aware, style-related features. The objective and subjective evaluations demonstrate that our proposed RAG method outperforms baselines, and our CA-CLAP achieves better results than text-only retrieval methods.
Authors
(none)
Tags
Stats
Related papers
- Retrieval-augmented Text-to-audio Generation (2023)0.00
- Clapspeech: Learning Prosody From Text Context With Contrastive Language-audio Pre-training (2023)0.00
- Audiorag+: Feedback-driven Retrieval-augmented Audio Generation With Large Audio Language Models (2025)0.00
- Generating Speakers By Prompting Listener Impressions For Pre-trained Multi-speaker Text-to-speech Systems (2024)3.58
- Autostyle-tts: Retrieval-augmented Generation Based Automatic Style Matching Text-to-speech Synthesis (2025)4.52
- Promptasr For Contextualized ASR With Controllable Style (2023)8.35
- La-rag:enhancing Llm-based ASR Accuracy With Retrieval-augmented Generation (2024)0.00
- Speechgen: Unlocking The Generative Power Of Speech Language Models With Prompts (2023)0.00