Voice Impression Control In Zero-shot TTS
2025 Β· Kenichi Fujita, Shota Horiguchi, Yusuke Ijima
Abstract
Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization. Audio examples are available on our demo page (https://ntt-hilab-gensp.github.io/is2025voiceimpression/).
Authors
(none)
Tags
Stats
Related papers
- Libritts-vi: A Public Corpus And Novel Methods For Efficient Voice Impression Control (2025)0.00
- Controlspeech: Towards Simultaneous And Independent Zero-shot Speaker Cloning And Zero-shot Language Style Control (2024)9.40
- Zero-shot Personalized Lip-to-speech Synthesis With Face Image Based Voice Control (2023)5.84
- Yourtts: Towards Zero-shot Multi-speaker TTS And Zero-shot Voice Conversion For Everyone (2021)0.00
- Word-level Emotional Expression Control In Zero-shot Text-to-speech Synthesis (2025)0.00
- Voiceshop: A Unified Speech-to-speech Framework For Identity-preserving Zero-shot Voice Editing (2024)0.00
- SIG-VC: A Speaker Information Guided Zero-shot Voice Conversion System For Both Human Beings And Machines (2021)8.09
- Tcsinger: Zero-shot Singing Voice Synthesis With Style Transfer And Multi-level Style Control (2024)7.16