Time-layer Adaptive Alignment For Speaker Similarity In Flow-matching Based Zero-shot TTS
2025 · Haoyu Li, Mingyang Han, Yu Xi, et al.
Abstract
Flow-Matching (FM)-based zero-shot text-to-speech (TTS) systems exhibit high-quality speech synthesis and robust generalization capabilities. However, the speaker representation ability of such systems remains underexplored, primarily due to the lack of explicit speaker-specific supervision in the FM framework. To this end, we conduct an empirical analysis of speaker information distribution and reveal its non-uniform allocation across time steps and network layers, underscoring the need for adaptive speaker alignment. Accordingly, we propose Time-Layer Adaptive Speaker Alignment (TLA-SA), a strategy that enhances speaker consistency by jointly leveraging temporal and hierarchical variations. Experimental results show that TLA-SA substantially improves speaker similarity over baseline systems on both research- and industrial-scale datasets and generalizes well across diverse model architectures, including decoder-only language model (LM)-based and free TTS systems. A demo is provided.
Authors
(none)
Tags
Stats
Related papers
- Diflow-tts: Compact And Low-latency Zero-shot Text-to-speech With Factorized Discrete Flow Matching (2025)0.00
- Cross-lingual F5-TTS: Towards Language-agnostic Voice Cloning And Speech Synthesis (2025)0.00
- An Investigation Of Noise Robustness For Flow-matching-based Zero-shot TTS (2024)5.24
- Glow-tts: A Generative Flow For Text-to-speech Via Monotonic Alignment Search (2020)0.00
- Flow-tsvad: Target-speaker Voice Activity Detection Via Latent Flow Matching (2024)0.00
- F5-TTS: A Fairytaler That Fakes Fluent And Faithful Speech With Flow Matching (2024)0.00
- Improving Multi-speaker TTS Prosody Variance With A Residual Encoder And Normalizing Flows (2021)0.00
- Cross-lingual Text-to-speech With Flow-based Voice Conversion For Improved Pronunciation (2022)0.00