Squid: Measuring Speech Naturalness In Many Languages
2022 Β· Thibault Sellam, Ankur Bapna, Joshua Camp, et al.
Abstract
Much of text-to-speech research relies on human evaluation, which incurs heavy costs and slows down the development process. The problem is particularly acute in heavily multilingual applications, where recruiting and polling judges can take weeks. We introduce SQuId (Speech Quality Identification), a multilingual naturalness prediction model trained on over a million ratings and tested in 65 locales-the largest effort of this type to date. The main insight is that training one model on many locales consistently outperforms mono-locale baselines. We present our task, the model, and show that it outperforms a competitive baseline based on w2v-BERT and VoiceMOS by 50.0%. We then demonstrate the effectiveness of cross-locale transfer during fine-tuning and highlight its effect on zero-shot locales, i.e., locales for which there is no fine-tuning data. Through a series of analyses, we highlight the role of non-linguistic effects such as sound artifacts in cross-locale transfer. Finally, we
Authors
(none)
Tags
Stats
Related papers
- Efficient Neural Speech Synthesis For Low-resource Languages Through Multilingual Modeling (2020)8.60
- Speechllm-as-judges: Towards General And Interpretable Speech Quality Evaluation (2025)2.60
- Comparison Of Speech Representations For Automatic Quality Estimation In Multi-speaker Text-to-speech Synthesis (2020)0.00
- Naturalspeech: End-to-end Text To Speech Synthesis With Human-level Quality (2022)16.32
- Naturalspeech 2: Latent Diffusion Models Are Natural And Zero-shot Speech And Singing Synthesizers (2023)0.00
- Evaluating Text-to-speech Synthesis From A Large Discrete Token-based Speech Language Model (2024)0.00
- Extending Multilingual Speech Synthesis To 100+ Languages Without Transcribed Data (2024)7.16
- Torchaudio-squim: Reference-less Speech Quality And Intelligibility Measures In Torchaudio (2023)0.00