Abstract

Emotional text-to-speech (TTS) systems sturggle to capture the full spectrum of human emotions due to the inherent complexity of emotional expressions and the limited coverage of existing emotion labels. To address this, we propose a language model-based TTS framework that synthesizes speech across a broad range of emotional styles. Our approach enables flexible user control along three continuous dimensions - pleasure, arousal, and dominance (PAD). To enable this, we train an emotional dimension predictor that maps categorical emotion labels in speech datasets into the PAD space, grounded in established psychological research. Importantly, while the emotional dimension predictor leverages categorical labels, the TTS framework itself does not require explict emotion labels during training. Objective and subjective evaluations demonstrate that our framework effectively generates more expressive emotional styles and enhances both naturalness and diversity compared to baselines.

Authors

(none)

Tags

  • Text-to-Speech

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyzhou2024emotional

Related papers