Clapspeech: Learning Prosody From Text Context With Contrastive Language-audio Pre-training
2023 Β· Zhenhui Ye, Rongjie Huang, Yi Ren, et al.
Abstract
Improving text representation has attracted much attention to achieve expressive text-to-speech (TTS). However, existing works only implicitly learn the prosody with masked token reconstruction tasks, which leads to low training efficiency and difficulty in prosody modeling. We propose CLAPSpeech, a cross-modal contrastive pre-training framework that explicitly learns the prosody variance of the same text token under different contexts. Specifically, 1) We encourage the model to connect the text context with its corresponding prosody pattern in the joint multi-modal space with the elaborate design of the encoder inputs and contrastive loss; 2) We introduce a multi-scale pre-training pipeline to capture prosody patterns in multiple levels. We show how to incorporate CLAPSpeech into existing TTS models for better prosody. Experiments on three datasets not only show that CLAPSpeech could improve the prosody prediction for existing TTS methods, but also demonstrate its generalization abili
Authors
(none)
Tags
Stats
Related papers
- CLASP: Contrastive Language-speech Pretraining For Multilingual Multimodal Information Retrieval (2024)0.00
- Prior-agnostic Multi-scale Contrastive Text-audio Pre-training For Parallelized TTS Frontend Modeling (2024)0.00
- Learning Speech Representation From Contrastive Token-acoustic Pretraining (2023)7.81
- Retrieval Augmented Generation In Prompt-based Text-to-speech Synthesis With Context-aware Contrastive Language-audio Pretraining (2024)0.00
- Human-clap: Human-perception-based Contrastive Language-audio Pretraining (2025)4.52
- M2D-CLAP: Masked Modeling Duo Meets CLAP For Learning General-purpose Audio-language Representation (2024)7.81
- Multi-modal Automatic Prosody Annotation With Contrastive Pretraining Of SSWP (2023)0.00
- Clipsonic: Text-to-audio Synthesis With Unlabeled Videos And Pretrained Language-vision Models (2023)9.03