Mixer-tts: Non-autoregressive, Fast And Compact Text-to-speech Model Conditioned On Language Model Embeddings
2021 Β· Oktai Tatanov, Stanislav Beliaev, Boris Ginsburg
Abstract
This paper describes Mixer-TTS, a non-autoregressive model for mel-spectrogram generation. The model is based on the MLP-Mixer architecture adapted for speech synthesis. The basic Mixer-TTS contains pitch and duration predictors, with the latter being trained with an unsupervised TTS alignment framework. Alongside the basic model, we propose the extended version which additionally uses token embeddings from a pre-trained language model. Basic Mixer-TTS and its extended version achieve a mean opinion score (MOS) of 4.05 and 4.11, respectively, compared to a MOS of 4.27 of original LJSpeech samples. Both versions have a small number of parameters and enable much faster speech synthesis compared to the models with similar quality.
Authors
(none)
Tags
Stats
Related papers
- MELA-TTS: Joint Transformer-diffusion Model With Representation Alignment For Speech Synthesis (2025)0.00
- Matcha-tts: A Fast TTS Architecture With Conditional Flow Matching (2023)14.73
- Spark-tts: An Efficient Llm-based Text-to-speech Model With Single-stream Decoupled Speech Tokens (2025)8.08
- Aligntts: Efficient Feed-forward Text-to-speech System Without Explicit Alignment (2020)11.76
- Efficienttts: An Efficient And High-quality Text-to-speech Architecture (2020)0.00
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- STYLER: Style Factor Modeling With Rapidity And Robustness Via Speech Decomposition For Expressive And Controllable Neural Text To Speech (2021)9.23