VECL-TTS: Voice Identity And Emotional Style Controllable Cross-lingual Text-to-speech
2024 Β· Ashishkumar Gudmalwar, Nirmesh Shah, Sai Akarsh, et al.
Abstract
Despite the significant advancements in Text-to-Speech (TTS) systems, their full utilization in automatic dubbing remains limited. This task necessitates the extraction of voice identity and emotional style from a reference speech in a source language and subsequently transferring them to a target language using cross-lingual TTS techniques. While previous approaches have mainly concentrated on controlling voice identity within the cross-lingual TTS framework, there has been limited work on incorporating emotion and voice identity together. To this end, we introduce an end-to-end Voice Identity and Emotional Style Controllable Cross-Lingual (VECL) TTS system using multilingual speakers and an emotion embedding network. Moreover, we introduce content and style consistency losses to enhance the quality of synthesized speech further. The proposed system achieved an average relative improvement of 8.83% compared to the state-of-the-art (SOTA) methods on a database comprising English and th
Authors
(none)
Tags
Stats
Related papers
- Boosting Multi-speaker Expressive Speech Synthesis With Semi-supervised Contrastive Learning (2023)5.24
- Text-driven Emotional Style Control And Cross-speaker Style Transfer In Neural TTS (2022)7.81
- Emosphere-tts: Emotional Style And Intensity Modeling Via Spherical Emotion Vector For Controllable Emotional Text-to-speech (2024)10.35
- Limited Data Emotional Voice Conversion Leveraging Text-to-speech: Two-stage Sequence-to-sequence Training (2021)10.35
- Interpretable Style Transfer For Text-to-speech With Controlvae And Diffusion Bridge (2023)5.24
- Emotional Voice Conversion Using Multitask Learning With Text-to-speech (2019)0.00
- Cross-speaker Emotion Transfer For Low-resource Text-to-speech Using Non-parallel Voice Conversion With Pitch-shift Data Augmentation (2022)8.09
- Enhancing Emotional Text-to-speech Controllability With Natural Language Guidance Through Contrastive Learning And Diffusion Models (2024)5.24