Awesome Speech Audio

📄Papers 🧭Topics 🔥Trending 🗺️Map 🏆Leaderboards 🎓Learn 🤖Ask AI

⋯More

👥Authors 📚Reading Packs 📊Datasets 🛠️Tools 📰News 📝Blogs ✉️Newsletter 🔖Saved

← all topics overview

Multimodal Audio

loading…

Stay Updated

Submit a paper · Privacy · Terms

© 2026 Awesome Papers.

Awesome Multimodal Audio — curated papers, datasets & benchmarks · Awesome Speech Audio

← all topics overview

Awesome Multimodal Audio

Multimodal Audio is one of the most active areas in Awesome Speech Audio — 848 papers in this collection, evaluated on datasets like LRS-3, LibriSpeech, AudioCaps. A strong starting point is "Multi-level And Multi-scale Feature Aggregation Using Pre-trained Convolutional Neural Networks For Music Auto-tagging".

Datasets & benchmarks

LibriSpeech14 papers · 🤗

AudioCaps11 papers · 🤗

MuST-C11 papers

IEMOCAP10 papers

VoxCeleb28 papers · 🤗

AISHELL-16 papers

MELD4 papers · 🤗

MusicCaps4 papers · 🤗

MuAViC4 papers · 🤗

AVSpeech4 papers · 🤗

Key papers

60 papers · trending (default)numbers = 🔥 heat

Multi-level And Multi-scale Feature Aggregation Using Pre-trained Convolutional Neural Networks For Music Auto-tagging (2017)
Jongpil Lee, Juhan Nam
15.43
Contextual Audio-visual Switching For Speech Enhancement In Real-world Environments (2018)
Ahsan Adeel, Mandar Gogate, Amir Hussain
14.35
Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models (2026)
Lianghua Huang et al.
13.88
Multi-modality Associative Bridging Through Memory: Speech Sound Recollected From Face Video (2022)
Minsu Kim, Joanna Hong, Se Jin Park, et al.
12.10
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction (2025)
Qian Chen et al.
11.29
MiDashengLM: Efficient Audio Understanding with General Audio Captions (2025)
Heinrich Dinkel et al.
9.30
Native Active Perception as Reasoning for Omni-Modal Understanding (2026)
Zhenghao Xing et al.
9.27
FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation (2025)
Yuxuan Jiang et al.
7.86
Generalized Multichannel Variational Autoencoder For Underdetermined Source Separation (2018)
Shogo Seki, Hirokazu Kameoka, Li Li, et al.
7.81
S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models (2025)
Feng Jiang et al.
7.77
Fast Text-to-Audio Generation with Adversarial Post-Training (2025)
Zachary Novack et al.
7.30
TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis (2025)
Yu Zhang et al.
7.13
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs (2025)
Umberto Cappellazzo et al.
6.83
AV-SyncBench: Decoupled Benchmarking of Temporal and Semantic Audio-Visual Synchronization (2026)
Tianhong Zhou et al.
6.52
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations (2025)
Jeong Hun Yeo et al.
6.41
Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing (2025)
Zhedong Zhang et al.
6.41
Fake It To Make It: Using Synthetic Data To Remedy The Data Shortage In Joint Multimodal Speech-and-gesture Synthesis (2024)
Shivam Mehta, Anna Deichler, Jim O'Regan, et al.
6.34
Listening And Seeing Again: Generative Error Correction For Audio-visual Speech Recognition (2025)
Rui Liu, Hongyu Yuan, Haizhou Li
6.26
Qwen2.5-Omni Technical Report (2025)
Jin Xu et al.
6.17
CLEP-DG: Contrastive Learning for Speech Emotion Domain Generalization via Soft Prompt Tuning (2025)
Jiacheng Shi et al.
6.12
AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines (2025)
Cancan Li et al.
5.93
Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling (2025)
Ju-Chieh Chou et al.
5.87
Direct Multimodal Few-shot Learning Of Speech And Images (2020)
Leanne Nortje, Herman Kamper
5.84
Cross-Modal Knowledge Distillation for Speech Large Language Models (2025)
Enzhi Wang et al.
5.57
Intuitive Multilingual Audio-visual Speech Recognition With A Single-trained Model (2023)
Joanna Hong, Se Jin Park, Yong Man Ro
5.24
Unsupervised Vs. Transfer Learning For Multimodal One-shot Matching Of Speech And Images (2020)
Leanne Nortje, Herman Kamper
5.24
Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders (2025)
Nathan Paek et al.
5.21
CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages (2025)
Shangda Wu et al.
5.18
Throat and acoustic paired speech dataset for deep learning-based speech enhancement (2025)
Yunsik Kim et al.
5.18
Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving (2026)
Ruchao Fan et al.
5.01
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning (2026)
Congrui Du et al.
5.01
HierBias: Context-Conditioned Hierarchical Media Bias Detection with Multi-Task Type Classification (2026)
Kaining Li et al.
4.95
Utilizing Cognitive Signals Generated during Human Reading to Enhance Keyphrase Extraction from Microblogs (2026)
Xinyi Yan et al.
4.95
SocialPersona: Benchmarking Personalized Profiling and Response with Multimodal Social-Media Context (2026)
Qinkai Zhang et al.
4.95
Evaluation Pitfalls and Challenges in Multimedia Event Extraction (2026)
Philipp Seeberger et al.
4.95
Elastic Time: Dynamic Frame Rate Bottlenecks for Neural Audio Coding (2026)
Dimitrios Bralios et al.
4.95
LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement (2025)
Boyi Kang and Xinfa Zhu and Zihan Zhang and Zhen Ye and Mingshuai Liu and Ziqian Wang and Yike Zhu and Guobin Ma and Jun Chen and Longshuai Xiao and Chao Weng and Wei Xue and Lei Xie
4.82
DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility (2025)
Yifan Liu et al.
4.82
AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation (2025)
Wuwei Huang et al.
4.82
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation (2026)
Szu-Chi Chen et al.
4.81
OLKAVS: An Open Large-scale Korean Audio-visual Speech Dataset (2023)
Jeongkyun Park, Jung-Wook Hwang, Kwanghee Choi, et al.
4.52
MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt (2025)
Zhichao Wu et al.
4.42
Adaptive Perturbation Selection for Contrastive Audio Decoding (2026)
Aaron Isidore Grace et al.
4.39
A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models (2026)
Prabal Gupta (Rama Labs et al.
4.39
From Objectives to Applications: Aligning Architectural Biases in Audio Self-Supervised Learning (2026)
Kele Xu et al.
4.39
Beyond Words: Towards Effective Modeling of Non-Verbal Vocalizations in ASR (2026)
Gene Yang et al.
4.39
An Efficient vLLM-Based Inference Pipeline for Unified Audio Understanding and Generation (2026)
Haoran Wang et al.
4.39
Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1) (2026)
Jingbiao Mei
4.33
SpectCount: Spectrotemporal Counting via Synthetic Signals Improves Large Audio Language Models (2026)
Seonuk Kim et al.
4.33
Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation (2026)
Ziyu Zhang et al.
4.33
DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast (2026)
Zhengkun Ge et al.
4.33
MaskedFOP: Polyglot Speaker Identification under Missing Visual Modality via Cascaded Graph Label Propagation (2026)
Ayoub Elkhouzari et al.
4.33
EMORSION: Examining the Impact of Audio Parameters on Emotional Responses and Immersion in Film (2026)
Nelly Garcia et al.
4.33
Continuous Audio Thinking for Large Audio Language Models (2026)
Gyojin Han et al.
4.33
Constraining to Generalize: Subspace Tuning for Few-shot Generalization of Audio-Language Models (2026)
Jaehyuk Jang et al.
4.33
SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction (2026)
Quanjiang Guo et al.
4.33
Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors (2026)
Michael Finkelson et al.
4.33
From Sounds to Scenes: A Benchmark for Evaluating Context-Aware Auditory Scene Understanding in Large Audio Language Models (2026)
Pengfei Zhang et al.
4.33
FoleySet: A Multi-Level Human-Annotated Foley Sound Dataset (2026)
Sunshiyu Wang et al.
4.33
Listening Like a Judge: A Music-Aware Framework for Automatic Singing Performance Evaluation (2026)
Neelam Saini et al.
4.33