Awesome Speech Audio

📄Papers 🧭Topics 🔥Trending 🗺️Map 🏆Leaderboards 🎓Learn 🤖Ask AI

⋯More

👥Authors 📚Reading Packs 📊Datasets 🛠️Tools 📰News 📝Blogs ✉️Newsletter 🔖Saved

← all topics overview

Text-to-Speech

loading…

Stay Updated

Submit a paper · Privacy · Terms

© 2026 Awesome Papers.

Awesome Text-to-Speech — curated papers, datasets & benchmarks · Awesome Speech Audio

← all topics overview

Awesome Text-to-Speech

Text-to-Speech is one of the most active areas in Awesome Speech Audio — 2,891 papers in this collection, evaluated on datasets like LibriSpeech, MuST-C, LibriTTS. A strong starting point is "Contextual Audio-visual Switching For Speech Enhancement In Real-world Environments".

Datasets & benchmarks

LibriSpeech56 papers · 🤗

MuST-C37 papers

LibriTTS30 papers

LJSpeech23 papers · 🤗

CoVoST-216 papers

VCTK14 papers · 🤗

IEMOCAP11 papers

AISHELL-110 papers

English9 papers

GigaSpeech7 papers · 🤗

Key papers

60 papers · trending (default)numbers = 🔥 heat

Contextual Audio-visual Switching For Speech Enhancement In Real-world Environments (2018)
Ahsan Adeel, Mandar Gogate, Amir Hussain
14.35
Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models (2026)
Lianghua Huang et al.
13.88
Emotion Rendering For Conversational Speech Synthesis With Heterogeneous Graph-based Context Modeling (2023)
Rui Liu, Yifan Hu, Yi Ren, et al.
13.15
Facexhubert: Text-less Speech-driven E(x)pressive 3D Facial Animation Synthesis Using Self-supervised Speech Representation Learning (2023)
Kazi Injamamul Haque, Zerrin Yumak
11.49
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction (2025)
Qian Chen et al.
11.29
Textless Speech Emotion Conversion Using Discrete And Decomposed Representations (2021)
Felix Kreuk, Adam Polyak, Jade Copet, et al.
10.74
Voxtral TTS (2026)
Mistral-AI: Alexander H. Liu et al.
10.65
Semantic Speech Retrieval With A Visually Grounded Model Of Untranscribed Speech (2017)
Herman Kamper, Gregory Shakhnarovich, Karen Livescu
10.61
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder (2025)
Bowen Zhang et al.
9.94
Transfer Learning From Audio-visual Grounding To Speech Recognition (2019)
Wei-Ning Hsu, David Harwath, James Glass
9.59
Squid: Measuring Speech Naturalness In Many Languages (2022)
Thibault Sellam, Ankur Bapna, Joshua Camp, et al.
9.41
Mixspeech: Cross-modality Self-learning With Audio-visual Stream Mixup For Visual Speech Translation And Recognition (2023)
Xize Cheng, Linjun Li, Tao Jin, et al.
8.60
Lahjoita Puhetta -- A Large-scale Corpus Of Spoken Finnish With Some Benchmarks (2022)
Anssi Moisio, Dejan Porjazovski, Aku Rouhe, et al.
8.60
MoonCast: High-Quality Zero-Shot Podcast Generation (2025)
Zeqian Ju et al.
8.52
DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation (2025)
Jiaqi Li et al.
8.34
S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models (2025)
Feng Jiang et al.
7.77
Boosting Large Language Model For Speech Synthesis: An Empirical Study (2023)
Hongkun Hao, Long Zhou, Shujie Liu, et al.
6.77
Personalization Of Ctc-based End-to-end Speech Recognition Using Pronunciation-driven Subword Tokenization (2023)
Zhihong Lei, Ernest Pusateri, Shiyi Han, et al.
6.77
Learning Contextually Fused Audio-visual Representations For Audio-visual Speech Recognition (2022)
Zi-Qiang Zhang, Jie Zhang, Jian-Shu Zhang, et al.
6.77
Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought (2025)
Zhixian Zhao et al.
6.58
Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing (2025)
Zhedong Zhang et al.
6.41
Fake It To Make It: Using Synthetic Data To Remedy The Data Shortage In Joint Multimodal Speech-and-gesture Synthesis (2024)
Shivam Mehta, Anna Deichler, Jim O'Regan, et al.
6.34
Silent Speech And Emotion Recognition From Vocal Tract Shape Dynamics In Real-time MRI (2021)
Laxmi Pandey, Ahmed Sabbir Arif
6.34
Qwen2.5-Omni Technical Report (2025)
Jin Xu et al.
6.17
Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications (2025)
Biel Tura Vecino et al.
6.01
SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation (2025)
Stephen Brade et al.
5.96
Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis (2025)
Zhikang Niu et al.
5.93
High-Fidelity Simultaneous Speech-To-Speech Translation (2025)
Tom Labiausse et al.
5.87
Bayesian Example Selection Improves In-context Learning For Speech, Text, And Visual Modalities (2024)
Siyin Wang, Chao-Han Huck Yang, Ji Wu, et al.
5.84
The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar (2026)
Ilseyar Alimova et al.
5.82
Mntts2: An Open-source Multi-speaker Mongolian Text-to-speech Synthesis Dataset (2022)
Kailin Liang, Bin Liu, Yifan Hu, et al.
5.81
Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations (2025)
Xue Jiang et al.
5.59
Cross-Modal Knowledge Distillation for Speech Large Language Models (2025)
Enzhi Wang et al.
5.57
Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation (2025)
Wen Huang et al.
5.48
Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation (2025)
Pengchao Feng et al.
5.35
VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation (2025)
Puyuan Peng et al.
5.35
DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue (2025)
Xiang Li et al.
5.29
Speech Recognition With Llms Adapted To Disordered Speech Using Reinforcement Learning (2024)
Chirag Nagpal, Subhashini Venugopalan, Jimmy Tobin, et al.
5.24
Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving (2026)
Ruchao Fan et al.
5.01
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning (2026)
Congrui Du et al.
5.01
Closing the Quality Gap in Low-Resource Text-to-Speech: LoRA Fine-Tuning of VoxCPM2 for Khmer and Korean (2026)
Phannet Pov et al.
4.95
RedVox: Safety and Fairness Gaps in Speech Models Across Languages (2026)
Beatrice Savoldi et al.
4.95
Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space (2025)
Zhengrui Ma et al.
4.93
Speechless: Speech Instruction Training Without Speech for Low Resource Languages (2025)
Alan Dao (Gia Tuan Dao) et al.
4.93
Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios (2025)
Gerard I. G\'allego and Oriol Pareras and Mart\'i Cortada Garcia and Lucas Takanori and Javier Hernando
4.93
SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow (2025)
Kaidi Wang et al.
4.87
DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility (2025)
Yifan Liu et al.
4.82
AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation (2025)
Wuwei Huang et al.
4.82
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation (2026)
Szu-Chi Chen et al.
4.81
Retrieval-Augmented Speech Recognition Approach for Domain Challenges (2025)
Peng Shen et al.
4.76
TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer (2025)
Vladimir Bataev et al.
4.71
WAXAL: A Large-Scale Multilingual African Language Speech Corpus (2026)
Abdoulaye Diack et al.
4.70
Contextualized Token Discrimination for Speech Search Query Correction (2025)
Junyu Lu et al.
4.64
Augsumm: Towards Generalizable Speech Summarization Using Synthetic Labels From Large Language Model (2024)
Jee-Weon Jung, Roshan Sharma, William Chen, et al.
4.53
MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt (2025)
Zhichao Wu et al.
4.42
A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models (2026)
Prabal Gupta (Rama Labs et al.
4.39
Enhancing Flow Matching with A Unified Guidance Framework for Efficient and Robust Speech Synthesis (2026)
Zuda Yu et al.
4.39
A Geometric Perspective on Composable Emotion Steering in Text-to-Speech Models (2026)
Siyi Wang et al.
4.39
Fast Speech Foundation Model Distillation Using Interleaved Stacking (2026)
Eungbeom Kim et al.
4.33
Mask, Sample, Revise: A Revisable CTMC Inference Stack for Guided Discrete Flow Matching Text-to-Speech (2026)
Alef Iury Siqueira Ferreira et al.
4.33