Awesome Speech Audio

📄Papers 🧭Topics 🔥Trending 🗺️Map 🏆Leaderboards 🎓Learn 🤖Ask AI

⋯More

👥Authors 📚Reading Packs 📊Datasets 🛠️Tools 📰News 📝Blogs ✉️Newsletter 🔖Saved

← all topics overview

Audio Understanding

loading…

Stay Updated

Submit a paper · Privacy · Terms

© 2026 Awesome Papers.

Awesome Audio Understanding — curated papers, datasets & benchmarks · Awesome Speech Audio

← all topics overview

Awesome Audio Understanding

Audio Understanding is one of the most active areas in Awesome Speech Audio — 2,612 papers in this collection, evaluated on datasets like LibriSpeech, IEMOCAP, SUPERB. A strong starting point is "Detection Of Glottal Closure Instants From Speech Signals: A Quantitative Review".

Datasets & benchmarks

LibriSpeech46 papers · 🤗

IEMOCAP44 papers

SUPERB31 papers

WSJ-0-2Mix21 papers

SLURP20 papers · 🤗

AMI18 papers · 🤗

CallHome16 papers · 🤗

AudioSet15 papers · 🤗

Google Speech Commands15 papers

LibriMix14 papers · 🤗

Libri-2Mix14 papers

VoxCeleb-114 papers

Key papers

60 papers · trending (default)numbers = 🔥 heat

Detection Of Glottal Closure Instants From Speech Signals: A Quantitative Review (2019)
Thomas Drugman, Mark Thomas, Jon Gudnason, et al.
16.88
Multi-level And Multi-scale Feature Aggregation Using Pre-trained Convolutional Neural Networks For Music Auto-tagging (2017)
Jongpil Lee, Juhan Nam
15.43
Joint Robust Voicing Detection And Pitch Estimation Based On Residual Harmonics (2019)
Thomas Drugman, Abeer Alwan
14.93
Curriculum-based Transfer Learning For An Effective End-to-end Spoken Language Understanding And Domain Portability (2019)
Antoine Caubrière, Natalia Tomashenko, Antoine Laurent, et al.
10.74
Phonetic-and-semantic Embedding Of Spoken Words With Applications In Spoken Content Retrieval (2018)
Yi-Chen Chen, Sung-Feng Huang, Chia-Hao Shen, et al.
9.76
Spoken Language Identification Using Convnets (2019)
Sarthak, Shikhar Shukla, Govind Mittal
9.59
MiDashengLM: Efficient Audio Understanding with General Audio Captions (2025)
Heinrich Dinkel et al.
9.30
Native Active Perception as Reasoning for Omni-Modal Understanding (2026)
Zhenghao Xing et al.
9.27
Lahjoita Puhetta -- A Large-scale Corpus Of Spoken Finnish With Some Benchmarks (2022)
Anssi Moisio, Dejan Porjazovski, Aku Rouhe, et al.
8.60
The MSP-Podcast Corpus (2025)
Carlos Busso et al.
8.23
PIN: A Novel Parallel Interactive Network For Spoken Language Understanding (2020)
Peilin Zhou, Zhiqi Huang, Fenglin Liu, et al.
8.09
S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models (2025)
Feng Jiang et al.
7.77
SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation (2025)
Wenyi Yu et al.
7.75
E-chat: Emotion-sensitive Spoken Dialogue System With Large Language Models (2023)
Hongfei Xue, Yuhao Liang, Bingshen Mu, et al.
7.50
Speech Enhancement Using Continuous Embeddings of Neural Audio Codec (2025)
Haoyang Li et al.
7.29
SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline (2025)
Helin Wang et al.
7.13
Style Attuned Pre-training And Parameter Efficient Fine-tuning For Spoken Language Understanding (2020)
Jin Cao, Jun Wang, Wael Hamza, et al.
6.77
Can String Kernels Pass The Test Of Time In Native Language Identification? (2017)
Radu Tudor Ionescu, Marius Popescu
6.77
PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding (2026)
Jihyung Park et al.
6.69
Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought (2025)
Zhixian Zhao et al.
6.58
AV-SyncBench: Decoupled Benchmarking of Temporal and Semantic Audio-Visual Synchronization (2026)
Tianhong Zhou et al.
6.52
AnySimLite: A Lightweight Few-Shot Similarity Encoder for On-Device Speech-Adjacent Classification (2026)
Sourav Ghosh et al.
6.44
Label-aware Multi-level Contrastive Learning For Cross-lingual Spoken Language Understanding (2022)
Shining Liang, Linjun Shou, Jian Pei, et al.
6.34
Keyword Localisation In Untranscribed Speech Using Visually Grounded Speech Models (2022)
Kayode Olaleye, Dan Oneata, Herman Kamper
6.34
CLEP-DG: Contrastive Learning for Speech Emotion Domain Generalization via Soft Prompt Tuning (2025)
Jiacheng Shi et al.
6.12
Multitask Learning with Capsule Networks for Speech-to-Intent Applications (2020)
Jakob Poncelet et al.
6.08
LauraTSE: Target Speaker Extraction using Auto-Regressive Decoder-Only Language Models (2025)
Beilong Tang et al.
5.96
Cross-Modal Knowledge Distillation for Speech Large Language Models (2025)
Enzhi Wang et al.
5.57
Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation (2025)
Wen Huang et al.
5.48
Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation (2025)
Pengchao Feng et al.
5.35
Improving Pretrained YAMNet for Enhanced Speech Command Detection via Transfer Learning (2025)
Sidahmed Lachenani et al.
5.29
Deep Learning for Speech Emotion Recognition: A CNN Approach Utilizing Mel Spectrograms (2025)
Niketa Penumajji
5.24
On Structured Sparsity Of Phonological Posteriors For Linguistic Parsing (2016)
Milos Cernak, Afsaneh Asaei, Hervé Bourlard
5.24
Residual Shuffle-exchange Networks For Fast Processing Of Long Sequences (2020)
Andis Draguns, Emīls Ozoliņš, Agris Šostaks, et al.
5.24
End-to-end Spoken Language Understanding Using Transformer Networks And Self-supervised Pre-trained Features (2020)
Edmilson Morais, Hong-Kwang J. Kuo, Samuel Thomas, et al.
5.24
CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages (2025)
Shangda Wu et al.
5.18
Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion (2025)
Lea Fischbach et al.
5.04
NAVER LABS Europe Submission to the Instruction-following 2026 Short Track (2026)
Marcely Zanon Boito et al.
5.01
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning (2026)
Congrui Du et al.
5.01
Adaptive Speech-to-Spike Encoding for Spiking Neural Networks (2026)
Taharim Rahman Anon et al.
4.95
HierBias: Context-Conditioned Hierarchical Media Bias Detection with Multi-Task Type Classification (2026)
Kaining Li et al.
4.95
RedVox: Safety and Fairness Gaps in Speech Models Across Languages (2026)
Beatrice Savoldi et al.
4.95
Elastic Time: Dynamic Frame Rate Bottlenecks for Neural Audio Coding (2026)
Dimitrios Bralios et al.
4.95
SpecWav-Attack: Leveraging Spectrogram Resizing and Wav2Vec 2.0 for Attacking Anonymized Speech (2025)
Yuqi Li et al.
4.93
LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement (2025)
Boyi Kang and Xinfa Zhu and Zihan Zhang and Zhen Ye and Mingshuai Liu and Ziqian Wang and Yike Zhu and Guobin Ma and Jun Chen and Longshuai Xiao and Chao Weng and Wei Xue and Lei Xie
4.82
AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation (2025)
Wuwei Huang et al.
4.82
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation (2026)
Szu-Chi Chen et al.
4.81
Distillation and Pruning for Scalable Self-Supervised Representation-Based Speech Quality Assessment (2025)
Benjamin Stahl and Hannes Gamper
4.76
Reconstruction of the Complete Vocal Tract Contour Through Acoustic to Articulatory Inversion Using Real-Time MRI Data (2025)
Sofiane Azzouz et al.
4.69
VBx for End-to-End Neural and Clustering-based Diarization (2025)
Petr P\'alka et al.
4.69
Contextualized Token Discrimination for Speech Search Query Correction (2025)
Junyu Lu et al.
4.64
A long-form single-speaker real-time MRI speech dataset and benchmark (2025)
Sean Foley et al.
4.64
OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction (2025)
Pablo Alonso-Jim\'enez and Pedro Ramoneda and R. Oguz Araz and Andrea Poltronieri and Dmitry Bogdanov
4.53
Augsumm: Towards Generalizable Speech Summarization Using Synthetic Labels From Large Language Model (2024)
Jee-Weon Jung, Roshan Sharma, William Chen, et al.
4.53
HC\(^2\)L: Hybrid And Cooperative Contrastive Learning For Cross-lingual Spoken Language Understanding (2024)
Bowen Xing, Ivor W. Tsang
4.52
Spatio-spectral diarization of meetings by combining TDOA-based segmentation and speaker embedding-based clustering (2025)
Tobias Cord-Landwehr et al.
4.47
Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models (2025)
Shunsuke Kando et al.
4.42
Adaptive Perturbation Selection for Contrastive Audio Decoding (2026)
Aaron Isidore Grace et al.
4.39
From Objectives to Applications: Aligning Architectural Biases in Audio Self-Supervised Learning (2026)
Kele Xu et al.
4.39
Evaluating Pretrained Music Embeddings for Cross-Performance Jazz Standard Recognition (2026)
\c{C}a\u{g}r{\i} Eser
4.39