Awesome Speech Audio

📄Papers 🧭Topics 🔥Trending 🗺️Map 🏆Leaderboards 🎓Learn 🤖Ask AI

⋯More

👥Authors 📚Reading Packs 📊Datasets 🛠️Tools 📰News 📝Blogs ✉️Newsletter 🔖Saved

← all topics overview

Speech Translation

loading…

Stay Updated

Submit a paper · Privacy · Terms

© 2026 Awesome Papers.

Awesome Speech Translation — curated papers, datasets & benchmarks · Awesome Speech Audio

← all topics overview

Awesome Speech Translation

Speech Translation is one of the most active areas in Awesome Speech Audio — 2,952 papers in this collection, evaluated on datasets like LibriSpeech, MuST-C, AISHELL-1. A strong starting point is "Training Speech Recognition Models With Federated Learning: A Quality/cost Framework".

Datasets & benchmarks

LibriSpeech120 papers · 🤗

MuST-C46 papers

AISHELL-126 papers

CoVoST-223 papers

Common Voice16 papers · 🤗

FLEURS16 papers

UA-Speech14 papers

WSJ-0-2Mix14 papers

Libri-2Mix14 papers

AMI12 papers · 🤗

TIMIT12 papers · 🤗

Key papers

60 papers · trending (default)numbers = 🔥 heat

Training Speech Recognition Models With Federated Learning: A Quality/cost Framework (2020)
Dhruv Guliani, Francoise Beaufays, Giovanni Motta
12.93
Looking Into Your Speech: Learning Cross-modal Affinity For Audio-visual Speech Separation (2021)
Jiyoung Lee, Soo-Whan Chung, Sunok Kim, et al.
11.67
Lebenchmark: A Reproducible Framework For Assessing Self-supervised Representation Learning From Speech (2021)
Solene Evain, Ha Nguyen, Hang Le, et al.
11.39
Transformer-based Video Front-ends For Audio-visual Speech Recognition For Single And Multi-person Video (2022)
Dmitriy Serdyuk, Otavio Braga, Olivier Siohan
11.39
A Highly Adaptive Acoustic Model For Accurate Multi-dialect Speech Recognition (2022)
Sanghyun Yoo, Inchul Song, Yoshua Bengio
10.85
Textless Speech Emotion Conversion Using Discrete And Decomposed Representations (2021)
Felix Kreuk, Adam Polyak, Jade Copet, et al.
10.74
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder (2025)
Bowen Zhang et al.
9.94
Synthvsr: Scaling Up Visual Speech Recognition With Synthetic Supervision (2023)
Xubo Liu, Egor Lakomkin, Konstantinos Vougioukas, et al.
9.76
Transfer Learning From Audio-visual Grounding To Speech Recognition (2019)
Wei-Ning Hsu, David Harwath, James Glass
9.59
Re-translation Strategies For Long Form, Simultaneous, Spoken Language Translation (2019)
Naveen Arivazhagan, Colin Cherry, Te I, et al.
9.23
Mixspeech: Cross-modality Self-learning With Audio-visual Stream Mixup For Visual Speech Translation And Recognition (2023)
Xize Cheng, Linjun Li, Tao Jin, et al.
8.60
DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation (2025)
Jiaqi Li et al.
8.34
Dicow: Diarization-conditioned Whisper For Target Speaker Automatic Speech Recognition (2024)
Alexander Polok, Dominik Klement, Martin Kocour, et al.
8.09
S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models (2025)
Feng Jiang et al.
7.77
SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation (2025)
Wenyi Yu et al.
7.75
Domain Robust Feature Extraction For Rapid Low Resource ASR Development (2018)
Siddharth Dalmia, Xinjian Li, Florian Metze, et al.
7.50
A Multi-purpose Audio-visual Corpus For Multi-modal Persian Speech Recognition: The Arman-av Dataset (2023)
Javad Peymanfard, Samin Heydarian, Ali Lashini, et al.
7.50
Speech Enhancement Using Continuous Embeddings of Neural Audio Codec (2025)
Haoyang Li et al.
7.29
Challenging The Boundaries Of Speech Recognition: The MALACH Corpus (2019)
Michael Picheny, Zóltan Tüske, Brian Kingsbury, et al.
7.16
SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline (2025)
Helin Wang et al.
7.13
GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio (2021)
Guoguo Chen et al.
7.01
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs (2025)
Umberto Cappellazzo et al.
6.83
Leveraging Translations For Speech Transcription In Low-resource Settings (2018)
Antonis Anastasopoulos, David Chiang
6.77
Personalization Of Ctc-based End-to-end Speech Recognition Using Pronunciation-driven Subword Tokenization (2023)
Zhihong Lei, Ernest Pusateri, Shiyi Han, et al.
6.77
Incorporating Pass-phrase Dependent Background Models For Text-dependent Speaker Verification (2016)
A. K. Sarkar, Zheng-Hua Tan
6.77
Learning Contextually Fused Audio-visual Representations For Audio-visual Speech Recognition (2022)
Zi-Qiang Zhang, Jie Zhang, Jian-Shu Zhang, et al.
6.77
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations (2025)
Jeong Hun Yeo et al.
6.41
Silent Speech And Emotion Recognition From Vocal Tract Shape Dynamics In Real-time MRI (2021)
Laxmi Pandey, Ahmed Sabbir Arif
6.34
Sample-efficient Unsupervised Domain Adaptation Of Speech Recognition Systems A Case Study For Modern Greek (2022)
Georgios Paraskevopoulos, Theodoros Kouzelis, Georgios Rouvalis, et al.
6.34
Listening And Seeing Again: Generative Error Correction For Audio-visual Speech Recognition (2025)
Rui Liu, Hongyu Yuan, Haizhou Li
6.26
LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation (2025)
Keisuke Kamahori et al.
6.12
SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation (2025)
Stephen Brade et al.
5.96
AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines (2025)
Cancan Li et al.
5.93
High-Fidelity Simultaneous Speech-To-Speech Translation (2025)
Tom Labiausse et al.
5.87
Spoken Term Detection And Relevance Score Estimation Using Dot-product Of Pronunciation Embeddings (2022)
Jan Švec, Luboš Šmídl, Josef V. Psutka, et al.
5.84
Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations (2025)
Xue Jiang et al.
5.59
Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation (2025)
Pengchao Feng et al.
5.35
Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages (2025)
Chin-Jou Li et al.
5.35
Intuitive Multilingual Audio-visual Speech Recognition With A Single-trained Model (2023)
Joanna Hong, Se Jin Park, Yong Man Ro
5.24
Speech Recognition With Llms Adapted To Disordered Speech Using Reinforcement Learning (2024)
Chirag Nagpal, Subhashini Venugopalan, Jimmy Tobin, et al.
5.24
Throat and acoustic paired speech dataset for deep learning-based speech enhancement (2025)
Yunsik Kim et al.
5.18
Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation (2025)
Qiuming Zhao et al.
5.18
Speech Denoising with Auditory Models (2020)
Mark R. Saddler et al.
5.06
Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving (2026)
Ruchao Fan et al.
5.01
NAVER LABS Europe Submission to the Instruction-following 2026 Short Track (2026)
Marcely Zanon Boito et al.
5.01
FBK's Long-form SpeechLLMs for IWSLT 2026 Instruction Following (2026)
Zhihang Xie et al.
4.95
Transfer Learning-Based Deep Residual Learning for Speech Recognition in Clean and Noisy Environments (2025)
Noussaiba Djeffal et al.
4.93
Speechless: Speech Instruction Training Without Speech for Low Resource Languages (2025)
Alan Dao (Gia Tuan Dao) et al.
4.93
Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios (2025)
Gerard I. G\'allego and Oriol Pareras and Mart\'i Cortada Garcia and Lucas Takanori and Javier Hernando
4.93
AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation (2025)
Wuwei Huang et al.
4.82
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation (2026)
Szu-Chi Chen et al.
4.81
Retrieval-Augmented Speech Recognition Approach for Domain Challenges (2025)
Peng Shen et al.
4.76
Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis (2025)
Minu Kim et al.
4.71
A long-form single-speaker real-time MRI speech dataset and benchmark (2025)
Sean Foley et al.
4.64
OLKAVS: An Open Large-scale Korean Audio-visual Speech Dataset (2023)
Jeongkyun Park, Jung-Wook Hwang, Kwanghee Choi, et al.
4.52
Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models (2025)
Shunsuke Kando et al.
4.42
Efficient Speech Translation through Model Compression and Knowledge Distillation (2025)
Yasmin Moslem
4.42
SonicSieve: Bringing Directional Speech Extraction to Smartphones Using Acoustic Microstructures (2025)
Kuang Yuan et al.
4.36
Fast Speech Foundation Model Distillation Using Interleaved Stacking (2026)
Eungbeom Kim et al.
4.33
MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data (2026)
Subhankar Ghosh et al.
4.33