MELA-TTS: Joint Transformer-diffusion Model With Representation Alignment For Speech Synthesis
2025 Β· Keyu An, Zhiyu Zhang, Changfeng Gao, et al.
Abstract
This work introduces MELA-TTS, a novel joint transformer-diffusion framework for end-to-end text-to-speech synthesis. By autoregressively generating continuous mel-spectrogram frames from linguistic and speaker conditions, our architecture eliminates the need for speech tokenization and multi-stage processing pipelines. To address the inherent difficulties of modeling continuous features, we propose a representation alignment module that aligns output representations of the transformer decoder with semantic embeddings from a pretrained ASR encoder during training. This mechanism not only speeds up training convergence, but also enhances cross-modal coherence between the textual and acoustic domains. Comprehensive experiments demonstrate that MELA-TTS achieves state-of-the-art performance across multiple evaluation metrics while maintaining robust zero-shot voice cloning capabilities, in both offline and streaming synthesis modes. Our results establish a new benchmark for continuous fea
Authors
(none)
Tags
Stats
Related papers
- Aligntts: Efficient Feed-forward Text-to-speech System Without Explicit Alignment (2020)11.76
- Autoregressive Speech Synthesis Without Vector Quantization (2024)0.00
- Megatts 3: Sparse Alignment Enhanced Latent Diffusion Transformer For Zero-shot Speech Synthesis (2025)0.00
- Autotts: End-to-end Text-to-speech Synthesis Through Differentiable Duration Modeling (2022)0.00
- Continuous Autoregressive Modeling With Stochastic Monotonic Alignment For Speech Synthesis (2025)0.00
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- Mixer-tts: Non-autoregressive, Fast And Compact Text-to-speech Model Conditioned On Language Model Embeddings (2021)6.34
- End-to-end Adversarial Text-to-speech (2020)0.00