Libritts-vi: A Public Corpus And Novel Methods For Efficient Voice Impression Control
2025 Β· Junki Ohmura, Yuki Ito, Emiru Tsunoo, et al.
Abstract
Numerical voice impression (VI) control (e.g., scaling brightness) enables fine-grained control in text-to-speech (TTS). However, it faces two challenges: no public corpus and impression leakage, where reference audio biases synthesized voice away from the target VI. To address the first challenge, we introduce LibriTTS-VI, the first public VI corpus built on LibriTTS-R. For the second, we hypothesize a single reference causes leakage by entangling speaker identity and VI. To mitigate this, we propose 1) disentangled training with two utterances from the same speaker for speaker and VI conditioning, and 2) a reference-free method controlling the impression solely via target VI. Experimentally, our best method improves controllability: 11-dimensional VI mean squared error drops from 0.61 to 0.41 objectively and 1.15 to 0.92 subjectively. A comparison with a prompt-based TTS reveals imprecise numerical control and entanglement between VI and text semantics, which our methods overcome.
Authors
(none)
Tags
Stats
Related papers
- Voice Impression Control In Zero-shot TTS (2025)0.00
- VECL-TTS: Voice Identity And Emotional Style Controllable Cross-lingual Text-to-speech (2024)0.00
- Libritts: A Corpus Derived From Librispeech For Text-to-speech (2019)20.79
- Indicvoices-r: Unlocking A Massive Multilingual Multi-speaker Speech Corpus For Scaling Indian TTS (2024)2.26
- Computational Narrative Understanding For Expressive Text-to-speech (2025)0.00
- Voiceloop: Voice Fitting And Synthesis Via A Phonological Loop (2017)0.00
- Interpretable Style Transfer For Text-to-speech With Controlvae And Diffusion Bridge (2023)5.24
- Enhancing Emotional Text-to-speech Controllability With Natural Language Guidance Through Contrastive Learning And Diffusion Models (2024)5.24