Speech Rhythm-based Speaker Embeddings Extraction From Phonemes And Phoneme Duration For Multi-speaker Speech Synthesis
2024 · Kenichi Fujita, Atsushi Ando, Yusuke Ijima
Abstract
This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the essential factors among speaker characteristics, along with acoustic features such as F0, for reproducing individual utterances in speech synthesis. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted three experiments, speaker embeddings generation, speech synthesis with generated embeddings, and embedding space analysis, to evaluate the performance. The proposed method demonstrated a moderate speaker identification performance (15.2% EER), even with only phonemes and their duration information. The objective and subjective evaluation results demonstrated that the proposed method can
Authors
(none)
Tags
Stats
Related papers
- Speaker Embedding Extraction With Phonetic Information (2018)11.85
- Recursive Attentive Pooling For Extracting Speaker Embeddings From Multi-speaker Recordings (2024)2.26
- An Analysis On The Effects Of Speaker Embedding Choice In Non Auto-regressive TTS (2023)0.00
- Acoustic BPE For Speech Generation With Discrete Tokens (2023)6.77
- Controllable Speech Synthesis By Learning Discrete Phoneme-level Prosodic Representations (2022)6.34
- Rethinking Speaker Embeddings For Speech Generation: Sub-center Modeling For Capturing Intra-speaker Diversity (2024)0.00
- ELF: Encoding Speaker-specific Latent Speech Feature For Speech Synthesis (2023)0.00
- Improved Vocal Effort Transfer Vector Estimation For Vocal Effort-robust Speaker Verification (2023)0.00