Disentangling Segmental And Prosodic Factors To Non-native Speech Comprehensibility
2024 Β· Waris Quamer, Ricardo Gutierrez-Osuna
Abstract
Current accent conversion (AC) systems do not disentangle the two main sources of non-native accent: segmental and prosodic characteristics. Being able to manipulate a non-native speaker's segmental and/or prosodic channels independently is critical to quantify how these two channels contribute to speech comprehensibility and social attitudes. We present an AC system that not only decouples voice quality from accent, but also disentangles the latter into its segmental and prosodic characteristics. The system is able to generate accent conversions that combine (1) the segmental characteristics from a source utterance, (2) the voice characteristics from a target utterance, and (3) the prosody of a reference utterance. We show that vector quantization of acoustic embeddings and removal of consecutive duplicated codewords allows the system to transfer prosody and improve voice similarity. We conduct perceptual listening tests to quantify the individual contributions of segmental features a
Authors
(none)
Tags
Stats
Related papers
- Accent Conversion Using Discrete Units With Parallel Data Synthesized From Controllable Accented TTS (2024)0.00
- Improving Accent Conversion With Reference Encoder And End-to-end Text-to-speech (2020)0.00
- Accent And Speaker Disentanglement In Many-to-many Voice Conversion (2020)10.35
- Zero-shot Accent Conversion Using Pseudo Siamese Disentanglement Network (2022)5.24
- Transfer The Linguistic Representations From TTS To Accent Conversion With Non-parallel Data (2024)6.77
- Fac-facodec: Controllable Zero-shot Foreign Accent Conversion With Factorized Speech Codec (2025)0.00
- Tts-guided Training For Accent Conversion Without Parallel Data (2022)8.60
- Remap, Warp And Attend: Non-parallel Many-to-many Accent Conversion With Normalizing Flows (2022)0.00