ELF: Encoding Speaker-specific Latent Speech Feature For Speech Synthesis
2023 Β· Jungil Kong, Junmo Lee, Jeongmin Kim, et al.
Abstract
In this work, we propose a novel method for modeling numerous speakers, which enables expressing the overall characteristics of speakers in detail like a trained multi-speaker model without additional training on the target speaker's dataset. Although various works with similar purposes have been actively studied, their performance has not yet reached that of trained multi-speaker models due to their fundamental limitations. To overcome previous limitations, we propose effective methods for feature learning and representing target speakers' speech characteristics by discretizing the features and conditioning them to a speech synthesis model. Our method obtained a significantly higher similarity mean opinion score (SMOS) in subjective similarity evaluation than seen speakers of a high-performance multi-speaker model, even with unseen speakers. The proposed method also outperforms a zero-shot method by significant margins. Furthermore, our method shows remarkable performance in generatin
Authors
(none)
Tags
Stats
Related papers
- Modeling Multi-speaker Latent Space To Improve Neural TTS: Quick Enrolling New Speaker And Enhancing Premium Voice (2018)0.00
- Msdtron: A High-capability Multi-speaker Speech Synthesis System For Diverse Data Using Characteristic Information (2021)4.52
- Multi-speaker Expressive Speech Synthesis Via Multiple Factors Decoupling (2022)0.00
- Electrolaryngeal Speech Intelligibility Enhancement Through Robust Linguistic Encoders (2023)6.34
- Enhancing Zero-shot Multi-speaker TTS With Negated Speaker Representations (2024)3.58
- SELM: Speech Enhancement Using Discrete Tokens And Language Models (2023)11.19
- A Unified Speaker Adaptation Method For Speech Synthesis Using Transcribed And Untranscribed Speech With Backpropagation (2019)0.00
- Zero-shot Multi-speaker Text-to-speech With State-of-the-art Neural Speaker Embeddings (2019)15.67