Learning Multilingual Expressive Speech Representation For Prosody Prediction Without Parallel Data
2023 · Jarod Duret, Titouan Parcollet, Yannick Estève
Abstract
We propose a method for speech-to-speech emotionpreserving translation that operates at the level of discrete speech units. Our approach relies on the use of multilingual emotion embedding that can capture affective information in a language-independent manner. We show that this embedding can be used to predict the pitch and duration of speech units in a target language, allowing us to resynthesize the source speech signal with the same emotional content. We evaluate our approach to English and French speech signals and show that it outperforms a baseline method that does not use emotional information, including when the emotion embedding is extracted from a different language. Even if this preliminary study does not address directly the machine translation issue, our results demonstrate the effectiveness of our approach for cross-lingual emotion preservation in the context of speech resynthesis.
Authors
(none)
Tags
Stats
Related papers
- Msemotts: Multi-scale Emotion Transfer, Prediction, And Control For Emotional Speech Synthesis (2022)13.97
- Nonparallel Emotional Speech Conversion (2018)11.08
- Language Model-based Emotion Prediction Methods For Emotional Speech Synthesis Systems (2022)8.35
- Textless Speech Emotion Conversion Using Discrete And Decomposed Representations (2021)10.74
- Fine-grained Emotion Strength Transfer, Control And Prediction For Emotional Speech Synthesis (2020)12.25
- Robust And Fine-grained Prosody Control Of End-to-end Speech Synthesis (2018)14.31
- Decoding Emotions: A Comprehensive Multilingual Study Of Speech Models For Speech Emotion Recognition (2023)0.00
- METTS: Multilingual Emotional Text-to-speech By Cross-speaker And Cross-lingual Emotion Transfer (2023)0.00