Praxy Voice: Voice-prompt Recovery + BUPS For Commercial-class Indic TTS From A Frozen Non-indic Base At Zero Commercial-training-data Cost
2026 Β· Venkata Pushpak Teja Menta
Abstract
arXiv:2604.25441v1 Announce Type: new Abstract: Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or Tamil. We ask: what is the minimum intervention that brings such a non-Indic-native base to commercial-class output on Telugu, Tamil, and Hindi, without training a new acoustic decoder and without any commercial TTS training data? We combine three pieces: (1) BUPS, a Brahmic Unified Phoneme Space that deterministically romanises seven Indic scripts to ISO-15919 so Chatterbox's Latin tokeniser can process them; (2) a LoRA adapter on only the text-token predictor (Chatterbox's t3), trained on ~1,220h of licensed Indic audio with a Hindi-proxy language_id; (3) a voice-prompt recovery recipe -- an 8-11s same-language reference clip plus three sampling overrides (exa
Authors
(none)
Tags
Stats
Related papers
- Indicvoices-r: Unlocking A Massive Multilingual Multi-speaker Speech Corpus For Scaling Indian TTS (2024)2.26
- Enhancing Out-of-vocabulary Performance Of Indian TTS Systems For Practical Applications Through Low-effort Data Strategies (2024)0.00
- Exploring An Inter-pausal Unit (IPU) Based Approach For Indic End-to-end TTS Systems (2024)0.00
- Towards Building Text-to-speech Systems For The Next Billion Users (2022)0.00
- Generic Indic Text-to-speech Synthesisers With Rapid Adaptation In An End-to-end Framework (2020)8.82
- Empowering Global Voices: A Data-efficient, Phoneme-tone Adaptive Approach To High-fidelity Speech Synthesis (2025)0.00
- ELAICHI: Enhancing Low-resource TTS By Addressing Infrequent And Low-frequency Character Bigrams (2024)0.00
- PSP: An Interpretable Per-dimension Accent Benchmark For Indic Text-to-speech (2026)0.00