Breezyvoice: Adapting TTS For Taiwanese Mandarin With Enhanced Polyphone Disambiguation -- Challenges And Insights

Abstract

We present BreezyVoice, a Text-to-Speech (TTS) system specifically adapted for Taiwanese Mandarin, highlighting phonetic control abilities to address the unique challenges of polyphone disambiguation in the language. Building upon CosyVoice, we incorporate a \(S^\{3\}\) tokenizer, a large language model (LLM), an optimal-transport conditional flow matching model (OT-CFM), and a grapheme to phoneme prediction model, to generate realistic speech that closely mimics human utterances. Our evaluation demonstrates BreezyVoice's superior performance in both general and code-switching contexts, highlighting its robustness and effectiveness in generating high-fidelity speech. Additionally, we address the challenges of generalizability in modeling long-tail speakers and polyphone disambiguation. Our approach significantly enhances performance and offers valuable insights into the workings of neural codec TTS systems.

Breezyvoice: Adapting TTS For Taiwanese Mandarin With Enhanced Polyphone Disambiguation -- Challenges And Insights

Abstract

Authors

Tags

Stats

Related papers