Breezyvoice: Adapting TTS For Taiwanese Mandarin With Enhanced Polyphone Disambiguation -- Challenges And Insights
2025 Β· Chan-Jan Hsu, Yi-Cheng Lin, Chia-Chun Lin, et al.
Abstract
We present BreezyVoice, a Text-to-Speech (TTS) system specifically adapted for Taiwanese Mandarin, highlighting phonetic control abilities to address the unique challenges of polyphone disambiguation in the language. Building upon CosyVoice, we incorporate a \(S^\{3\}\) tokenizer, a large language model (LLM), an optimal-transport conditional flow matching model (OT-CFM), and a grapheme to phoneme prediction model, to generate realistic speech that closely mimics human utterances. Our evaluation demonstrates BreezyVoice's superior performance in both general and code-switching contexts, highlighting its robustness and effectiveness in generating high-fidelity speech. Additionally, we address the challenges of generalizability in modeling long-tail speakers and polyphone disambiguation. Our approach significantly enhances performance and offers valuable insights into the workings of neural codec TTS systems.
Authors
(none)
Tags
Stats
Related papers
- External Knowledge Augmented Polyphone Disambiguation Using Large Language Model (2023)0.00
- Towards Natural Bilingual And Code-switched Speech Synthesis Based On Mix Of Monolingual Recordings And Cross-lingual Voice Conversion (2020)0.00
- Bailing-tts: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation (2024)0.00
- Polyphone Disambiguation And Accent Prediction Using Pre-trained Language Models In Japanese TTS Front-end (2022)5.24
- Dict-tts: Learning To Pronounce With Prior Dictionary Knowledge For Text-to-speech (2022)4.27
- A Unified Sequence-to-sequence Front-end Model For Mandarin Text-to-speech Synthesis (2019)9.41
- Building Multi Lingual TTS Using Cross Lingual Voice Conversion (2020)0.00
- Disambiguation Of Chinese Polyphones In An End-to-end Framework With Semantic Features Extracted By Pre-trained BERT (2025)7.16