Disentangling Prosody Representations With Unsupervised Speech Reconstruction
2022 Β· Leyuan Qu, Taihao Li, Cornelius Weber, et al.
Abstract
Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speak
Authors
(none)
Tags
Stats
Related papers
- Speech Resynthesis From Discrete Disentangled Self-supervised Representations (2021)16.25
- Investigating Disentanglement In A Phoneme-level Speech Codec For Prosody Modeling (2024)4.52
- Unsupervised Quantized Prosody Representation For Controllable Speech Synthesis (2022)4.52
- Perception Of Prosodic Variation For Speech Synthesis Using An Unsupervised Discrete Representation Of F0 (2020)7.81
- Unsupervised Learning Of Disentangled Speech Content And Style Representation (2020)7.50
- Learning Disentangled Speech Representations (2023)0.00
- Disentangling Speech And Non-speech Components For Building Robust Acoustic Models From Found Data (2019)0.00
- Disentangling Voice And Content With Self-supervision For Speaker Recognition (2023)2.26