Textless And Non-parallel Speech-to-speech Emotion Style Transfer
2025 Β· Soumya Dutta, Avni Jain, Sriram Ganapathy
Abstract
Given a pair of source and reference speech recordings, speech-to-speech (S2S) emotion style transfer involves the generation of an output speech that mimics the emotion characteristics of the reference while preserving the content and speaker attributes of the source. In this paper, we propose a speech-to-speech zero-shot emotion style transfer framework, termed S2S Zero-shot Emotion Style Transfer (S2S-ZEST), that enables the transfer of emotional attributes from the reference to the source while retaining the speaker identity and speech content. The S2S-ZEST framework consists of an analysis-synthesis pipeline in which the analysis module extracts semantic tokens, speaker representations, and emotion embeddings from speech. Using these representations, a pitch contour estimator and a duration predictor are learned. Further, a synthesis module is designed to generate speech based on the input representations and the derived factors. The analysis-synthesis pipeline is trained using an
Authors
(none)
Tags
Stats
Related papers
- Styles2st: Zero-shot Style Transfer For Direct Speech-to-speech Translation (2023)0.00
- Zet-speech: Zero-shot Adaptive Emotion-controllable Text-to-speech Synthesis With Diffusion And Style-based Models (2023)8.09
- Nonparallel Emotional Speech Conversion (2018)11.08
- ZS-MSTM: Zero-shot Style Transfer For Gesture Animation Driven By Text And Speech Using Adversarial Disentanglement Of Multimodal Style Encoding (2023)8.82
- Cross-speaker Emotion Disentangling And Transfer For End-to-end Speech Synthesis (2021)12.61
- Speech-to-speech Translation With Discrete-unit-based Style Transfer (2023)0.00
- Seen And Unseen Emotional Style Transfer For Voice Conversion With A New Emotional Speech Dataset (2020)16.34
- Improving Speech Emotion Recognition With Unsupervised Speaking Style Transfer (2022)6.34