SATTS: Speaker Attractor Text To Speech, Learning To Speak By Learning To Separate
2022 Β· Nabarun Goswami, Tatsuya Harada
Abstract
The mapping of text to speech (TTS) is non-deterministic, letters may be pronounced differently based on context, or phonemes can vary depending on various physiological and stylistic factors like gender, age, accent, emotions, etc. Neural speaker embeddings, trained to identify or verify speakers are typically used to represent and transfer such characteristics from reference speech to synthesized speech. Speech separation on the other hand is the challenging task of separating individual speakers from an overlapping mixed signal of various speakers. Speaker attractors are high-dimensional embedding vectors that pull the time-frequency bins of each speaker's speech towards themselves while repelling those belonging to other speakers. In this work, we explore the possibility of using these powerful speaker attractors for zero-shot speaker adaptation in multi-speaker TTS synthesis and propose speaker attractor text to speech (SATTS). Through various experiments, we show that SATTS can s
Authors
(none)
Tags
Stats
Related papers
- Learning Speaker Embedding From Text-to-speech (2020)5.84
- DART: Disentanglement Of Accent And Speaker Representation In Multispeaker Text-to-speech (2024)0.00
- Speaker-independent Speech Separation With Deep Attractor Network (2017)16.84
- Boosting Unknown-number Speaker Separation With Transformer Decoder-based Attractor (2024)0.00
- Speak, Read And Prompt: High-fidelity Text-to-speech With Minimal Supervision (2023)0.00
- Sample Efficient Adaptive Text-to-speech (2018)0.00
- TDASS: Target Domain Adaptation Speech Synthesis Framework For Multi-speaker Low-resource TTS (2022)0.00
- Multi-scale Accent Modeling And Disentangling For Multi-speaker Multi-accent Text-to-speech Synthesis (2024)2.26