Multi-modal Automatic Prosody Annotation With Contrastive Pretraining Of SSWP
2023 Β· Jinzuomu Zhong, Yang Li, Hui Huang, et al.
Abstract
In expressive and controllable Text-to-Speech (TTS), explicit prosodic features significantly improve the naturalness and controllability of synthesised speech. However, manual prosody annotation is labor-intensive and inconsistent. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. In the first stage, we use contrastive pretraining of Speech-Silence and Word-Punctuation (SSWP) pairs to enhance prosodic information in latent representations. In the second stage, we build a multi-modal prosody annotator, comprising pretrained encoders, a text-speech fusing scheme, and a sequence classifier. Experiments on English prosodic boundaries demonstrate that our method achieves state-of-the-art (SOTA) performance with 0.72 and 0.93 f1 score for Prosodic Word and Prosodic Phrase boundary respectively, while bearing remarkable robustness to data scarcity.
Authors
(none)
Tags
Stats
Related papers
- Clapspeech: Learning Prosody From Text Context With Contrastive Language-audio Pre-training (2023)0.00
- Unsupervised Word-level Prosody Tagging For Controllable Speech Synthesis (2022)7.16
- Audio-conditioned Phonemic And Prosodic Annotation For Building Text-to-speech Models From Unlabeled Speech Data (2024)3.58
- Prior-agnostic Multi-scale Contrastive Text-audio Pre-training For Parallelized TTS Frontend Modeling (2024)0.00
- Simple And Effective Multi-sentence TTS With Expressive And Coherent Prosody (2022)7.16
- Prosospeech: Enhancing Prosody With Quantized Vector Pre-training In Text-to-speech (2022)10.61
- Diffprosody: Diffusion-based Latent Prosody Generation For Expressive Speech Synthesis With Prosody Conditional Adversarial Training (2023)10.07
- Dynamic Prosody Generation For Speech Synthesis Using Linguistics-driven Acoustic Embedding Selection (2019)7.81