Awesome Speech Audio

📄Papers 🧭Topics 🔥Trending 🗺️Map 🏆Leaderboards 🎓Learn 🤖Ask AI

⋯More

👥Authors 📚Reading Packs 📊Datasets 🛠️Tools 📰News 📝Blogs ✉️Newsletter 🔖Saved

← all topics overview

Speech Recognition

loading…

Stay Updated

Submit a paper · Privacy · Terms

© 2026 Awesome Papers.

Awesome Speech Recognition — curated papers, datasets & benchmarks · Awesome Speech Audio

← all topics overview

Awesome Speech Recognition

Speech Recognition is one of the most active areas in Awesome Speech Audio — 6,118 papers in this collection, evaluated on datasets like LibriSpeech, AISHELL-1, IEMOCAP. A strong starting point is "Detection Of Glottal Closure Instants From Speech Signals: A Quantitative Review".

Datasets & benchmarks

LibriSpeech251 papers · 🤗

AISHELL-162 papers

IEMOCAP60 papers

SUPERB39 papers

MuST-C36 papers

Common Voice35 papers · 🤗

FLEURS32 papers

AMI27 papers · 🤗

UA-Speech27 papers

TIMIT26 papers · 🤗

Google Speech Commands23 papers

SLURP22 papers · 🤗

Key papers

60 papers · trending (default)numbers = 🔥 heat

Detection Of Glottal Closure Instants From Speech Signals: A Quantitative Review (2019)
Thomas Drugman, Mark Thomas, Jon Gudnason, et al.
16.88
Emotion2vec: Self-supervised Pre-training For Speech Emotion Representation (2023)
Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, et al.
15.88
Joint Robust Voicing Detection And Pitch Estimation Based On Residual Harmonics (2019)
Thomas Drugman, Abeer Alwan
14.93
Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models (2026)
Lianghua Huang et al.
13.88
E-RNN: Design Optimization For Efficient Recurrent Neural Networks In Fpgas (2018)
Zhe Li, Caiwen Ding, Siyue Wang, et al.
13.50
Training Speech Recognition Models With Federated Learning: A Quality/cost Framework (2020)
Dhruv Guliani, Francoise Beaufays, Giovanni Motta
12.93
Adversarial Auto-encoders For Speech Based Emotion Recognition (2018)
Saurabh Sahu, Rahul Gupta, Ganesh Sivaraman, et al.
12.68
Lebenchmark: A Reproducible Framework For Assessing Self-supervised Representation Learning From Speech (2021)
Solene Evain, Ha Nguyen, Hang Le, et al.
11.39
Transformer-based Video Front-ends For Audio-visual Speech Recognition For Single And Multi-person Video (2022)
Dmitriy Serdyuk, Otavio Braga, Olivier Siohan
11.39
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction (2025)
Qian Chen et al.
11.29
A Highly Adaptive Acoustic Model For Accurate Multi-dialect Speech Recognition (2022)
Sanghyun Yoo, Inchul Song, Yoshua Bengio
10.85
Textless Speech Emotion Conversion Using Discrete And Decomposed Representations (2021)
Felix Kreuk, Adam Polyak, Jade Copet, et al.
10.74
Ctl-mtnet: A Novel Capsnet And Transfer Learning-based Mixed Task Net For The Single-corpus And Cross-corpus Speech Emotion Recognition (2022)
Xin-Cheng Wen, Jia-Xin Ye, Yan Luo, et al.
10.21
Few-shot Learning In Emotion Recognition Of Spontaneous Speech Using A Siamese Neural Network With Adaptive Sample Pair Formation (2021)
Kexin Feng, Theodora Chaspari
9.92
Performance Of Three Slim Variants Of The Long Short-term Memory (LSTM) Layer (2019)
Daniel Kent, Fathi M. Salem
9.92
Synthvsr: Scaling Up Visual Speech Recognition With Synthetic Supervision (2023)
Xubo Liu, Egor Lakomkin, Konstantinos Vougioukas, et al.
9.76
Transfer Learning From Audio-visual Grounding To Speech Recognition (2019)
Wei-Ning Hsu, David Harwath, James Glass
9.59
Spoken Language Identification Using Convnets (2019)
Sarthak, Shikhar Shukla, Govind Mittal
9.59
Language Learning Using Speech To Image Retrieval (2019)
Danny Merkx, Stefan L. Frank, Mirjam Ernestus
9.41
Re-translation Strategies For Long Form, Simultaneous, Spoken Language Translation (2019)
Naveen Arivazhagan, Colin Cherry, Te I, et al.
9.23
Multistage Linguistic Conditioning Of Convolutional Layers For Speech Emotion Recognition (2021)
Andreas Triantafyllopoulos, Uwe Reichel, Shuo Liu, et al.
9.23
Transferable Positive/negative Speech Emotion Recognition Via Class-wise Adversarial Domain Adaptation (2018)
Hao Zhou, Ke Chen
9.23
Towards Visually Grounded Sub-word Speech Unit Discovery (2019)
David Harwath, James Glass
9.03
Mixspeech: Cross-modality Self-learning With Audio-visual Stream Mixup For Visual Speech Translation And Recognition (2023)
Xize Cheng, Linjun Li, Tao Jin, et al.
8.60
Few-shot Open-set Learning For On-device Customization Of Keyword Spotting Systems (2023)
Manuele Rusci, Tinne Tuytelaars
8.60
DOVER: A Method For Combining Diarization Outputs (2019)
Andreas Stolcke, Takuya Yoshioka
8.60
DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation (2025)
Jiaqi Li et al.
8.34
Dicow: Diarization-conditioned Whisper For Target Speaker Automatic Speech Recognition (2024)
Alexander Polok, Dominik Klement, Martin Kocour, et al.
8.09
Scalable Factorized Hierarchical Variational Autoencoder Training (2018)
Wei-Ning Hsu, James Glass
7.81
SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation (2025)
Wenyi Yu et al.
7.75
Domain Robust Feature Extraction For Rapid Low Resource ASR Development (2018)
Siddharth Dalmia, Xinjian Li, Florian Metze, et al.
7.50
Federated Pruning: Improving Neural Network Efficiency With Federated Learning (2022)
Rongmei Lin, Yonghui Xiao, Tien-Ju Yang, et al.
7.50
A Multi-purpose Audio-visual Corpus For Multi-modal Persian Speech Recognition: The Arman-av Dataset (2023)
Javad Peymanfard, Samin Heydarian, Ali Lashini, et al.
7.50
Challenging The Boundaries Of Speech Recognition: The MALACH Corpus (2019)
Michael Picheny, Zóltan Tüske, Brian Kingsbury, et al.
7.16
SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline (2025)
Helin Wang et al.
7.13
GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio (2021)
Guoguo Chen et al.
7.01
Scaling On-Device GPU Inference for Large Generative Models (2025)
Jiuqiang Tang et al.
6.95
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs (2025)
Umberto Cappellazzo et al.
6.83
Personalization Of Ctc-based End-to-end Speech Recognition Using Pronunciation-driven Subword Tokenization (2023)
Zhihong Lei, Ernest Pusateri, Shiyi Han, et al.
6.77
Learning Waveform-based Acoustic Models Using Deep Variational Convolutional Neural Networks (2019)
Dino Oglic, Zoran Cvetkovic, Peter Sollich
6.77
Incorporating Pass-phrase Dependent Background Models For Text-dependent Speaker Verification (2016)
A. K. Sarkar, Zheng-Hua Tan
6.77
SHARP: An Adaptable, Energy-efficient Accelerator For Recurrent Neural Network (2019)
Reza Yazdani, Olatunji Ruwase, Minjia Zhang, et al.
6.77
Learning Contextually Fused Audio-visual Representations For Audio-visual Speech Recognition (2022)
Zi-Qiang Zhang, Jie Zhang, Jian-Shu Zhang, et al.
6.77
Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading Learning (2025)
Lucas Block Medin et al.
6.63
Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought (2025)
Zhixian Zhao et al.
6.58
Revisiting ASR Error Correction with Specialized Models (2024)
Zijin Gu et al.
6.57
AnySimLite: A Lightweight Few-Shot Similarity Encoder for On-Device Speech-Adjacent Classification (2026)
Sourav Ghosh et al.
6.44
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations (2025)
Jeong Hun Yeo et al.
6.41
SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR (2026)
Kavya Manohar et al.
6.37
Transformer-based ASR Incorporating Time-reduction Layer And Fine-tuning With Self-knowledge Distillation (2021)
Md Akmal Haidar, Chao Xing, Mehdi Rezagholizadeh
6.34
Ed-cec: Improving Rare Word Recognition Using Asr Postprocessing Based On Error Detection And Context-aware Error Correction (2023)
Jiajun He, Zekun Yang, Tomoki Toda
6.34
Silent Speech And Emotion Recognition From Vocal Tract Shape Dynamics In Real-time MRI (2021)
Laxmi Pandey, Ahmed Sabbir Arif
6.34
Sample-efficient Unsupervised Domain Adaptation Of Speech Recognition Systems A Case Study For Modern Greek (2022)
Georgios Paraskevopoulos, Theodoros Kouzelis, Georgios Rouvalis, et al.
6.34
Listening And Seeing Again: Generative Error Correction For Audio-visual Speech Recognition (2025)
Rui Liu, Hongyu Yuan, Haizhou Li
6.26
Multilingual Source Tracing of Speech Deepfakes: A First Benchmark (2025)
Xi Xuan et al.
6.18
Qwen2.5-Omni Technical Report (2025)
Jin Xu et al.
6.17
LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation (2025)
Keisuke Kamahori et al.
6.12
CLEP-DG: Contrastive Learning for Speech Emotion Domain Generalization via Soft Prompt Tuning (2025)
Jiacheng Shi et al.
6.12
The Interspeech 2025 Speech Accessibility Project Challenge (2025)
Xiuwen Zheng et al.
6.12
Multitask Learning with Capsule Networks for Speech-to-Intent Applications (2020)
Jakob Poncelet et al.
6.08