Fluenteditor2: Text-based Speech Editing By Modeling Multi-scale Acoustic And Prosody Consistency
2024 · Rui Liu, Jiatian Xi, Ziyue Jiang, et al.
Abstract
Text-based speech editing (TSE) allows users to edit speech by modifying the corresponding text directly without altering the original recording. Current TSE techniques often focus on minimizing discrepancies between generated speech and reference within edited regions during training to achieve fluent TSE performance. However, the generated speech in the edited region should maintain acoustic and prosodic consistency with the unedited region and the original speech at both the local and global levels. To maintain speech fluency, we propose a new fluency speech editing scheme based on our previous \textit\{FluentEditor\} model, termed \textit\{\textbf\{FluentEditor2\}\}, by modeling the multi-scale acoustic and prosody consistency training criterion in TSE training. Specifically, for local acoustic consistency, we propose \textit\{hierarchical local acoustic smoothness constraint\} to align the acoustic properties of speech frames, phonemes, and words at the boundary between the genera
Authors
(none)
Tags
Stats
Related papers
- Fluenteditor: Text-based Speech Editing By Considering Acoustic And Prosody Consistency (2023)7.18
- Diffeditor: Enhancing Speech Editing With Semantic Enrichment And Acoustic Consistency (2024)0.00
- Multi-scale Accent Modeling And Disentangling For Multi-speaker Multi-accent Text-to-speech Synthesis (2024)2.26
- Improving Multi-speaker TTS Prosody Variance With A Residual Encoder And Normalizing Flows (2021)0.00
- Editts: Score-based Editing For Controllable Text-to-speech (2021)10.07
- Fluentspeech: Stutter-oriented Automatic Speech Editing With Context-aware Diffusion Models (2023)12.13
- Diffstyletts: Diffusion-based Hierarchical Prosody Modeling For Text-to-speech With Diverse And Controllable Styles (2024)0.00
- Msstyletts: Multi-scale Style Modeling With Hierarchical Context Information For Expressive Speech Synthesis (2023)6.77