Awesome Audio-Visual
Audio-Visual is one of the most active areas in Awesome Multimodal β 873 papers in this collection, evaluated on datasets like AudioCaps, InfoSeek, CMU-MOSI. A strong starting point is "Native Active Perception as Reasoning for Omni-Modal Understanding".
Datasets & benchmarks
Key papers
- Native Active Perception as Reasoning for Omni-Modal Understanding (2026)Zhenghao Xing et al.11.01
- OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains (2026)Xinyue Cai et al.10.59
- Human Motion Video Generation: A Survey (2025)Haiwei Xue, Xiangyang Luo, Zhanghao Hu, et al.10.19
- Show, Tell And Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization (2025)Zhiwang Zhang, Dong Xu, Wanli Ouyang, et al.9.65
- Omnivec2 -- A Novel Transformer Based Network For Large Scale Multimodal And Multitask Learning (2025)Siddharth Srivastava, Gaurav Sharma9.46
- You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences (2026)Ninad Daithankar et al.8.46
- MVEB: Massive Video Embedding Benchmark (2026)Adnan El Assadi et al.7.98
- A Context-aware Attention And Graph Neural Network-based Multimodal Framework For Misogyny Detection (2025)Mohammad Zia Ur Rehman, Sufyaan Zahoor, Areeb Manzoor, et al.7.57
- Listening To The Unspoken: Exploring "365" Aspects Of Multimodal Interview Performance Assessment (2025)Jia Li, Yang Wang, Wenhao Qian, et al.6.62
- Mevis: A Multi-modal Dataset For Referring Motion Expression Video Segmentation (2025)Henghui Ding, Chang Liu, Shuting He, et al.6.12
- Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs (2025)Mozhgan Nasr Azadani et al.5.84
- MM-HSD: Multi-modal Hate Speech Detection In Videos (2025)Berta CΓ©spedes-Sarrias, Carlos Collado-Capell, Pablo Rodenas-Ruiz, et al.5.72
- DRKF: Decoupled Representations With Knowledge Fusion For Multimodal Emotion Recognition (2025)Peiyuan Jiang, Yao Liu, Qiao Liu, et al.5.67
- WDMIR: Wavelet-driven Multimodal Intent Recognition (2025)Weiyin Gong, Kai Zhang, Yanghai Zhang, et al.5.46
- Singakids: A Multilingual Multimodal Dialogic Tutor For Language Learning (2025)Zhengyuan Liu, Geyu Lin, Hui Li Tan, et al.5.46
- Noteit: A System Converting Instructional Videos To Interactable Notes Through Multimodal Video Understanding (2025)Running Zhao, Zhihan Jiang, Xinchen Zhang, et al.5.46
- Fine-r1: Make Multi-modal Llms Excel In Fine-grained Visual Recognition By Chain-of-thought Reasoning (2026)Hulingxiao He, Zijun Geng, Yuxin Peng5.43
- ERNIE 5.0 Technical Report (2026)Haifeng Wang et al.5.36
- Enhanced Multimodal Hate Video Detection Via Channel-wise And Modality-wise Fusion (2025)Yinghui Zhang, Tailin Chen, Yuchen Zhang, et al.5.20
- Grounding Emotion Recognition With Visual Prototypes: VEGA -- Revisiting CLIP In MERC (2025)Guanyu Hu, Dimitrios Kollias, Xinyu Yang5.14
- Implihatevid: A Benchmark Dataset And Two-stage Contrastive Learning Framework For Implicit Hate Speech Detection In Videos (2025)Mohammad Zia Ur Rehman, Anukriti Bhatnagar, Omkar Kabde, et al.5.04
- SAM2-LOVE: Segment Anything Model 2 In Language-aided Audio-visual Scenes (2025)Yuji Wang, Haoran Xu, Yong Liu, et al.5.04
- Learning Contrastive Multimodal Fusion With Improved Modality Dropout For Disease Detection And Prediction (2025)Yi Gu, Kuniaki Saito, Jiaxin Ma4.72
- NSF-MAP: Neurosymbolic Multimodal Fusion For Robust And Interpretable Anomaly Prediction In Assembly Pipelines (2025)Chathurangi Shyalika, Renjith Prasad, Fadi El Kalach, et al.4.72
- Engagement Prediction Of Short Videos With Large Multimodal Models (2025)Wei Sun, Linhan Cao, Yuqin Cao, et al.4.67
- Adaptive Markup Language Generation For Contextually-grounded Visual Document Understanding (2025)Han Xiao, Yina Xie, Guanxin Tan, et al.4.67
- Autoregressive Semantic Visual Reconstruction Helps Vlms Understand Better (2025)Dianyi Wang, Wei Song, Yikun Wang, et al.4.56
- Advancing Talking Head Generation: A Comprehensive Survey Of Multi-modal Methodologies, Datasets, Evaluation Metrics, And Loss Functions (2025)Vineet Kumar Rakesh, Soumya Mazumdar, Research Pratim Maity, et al.4.55
- The Perils Of Chart Deception: How Misleading Visualizations Affect Vision-language Models (2025)Ridwan Mahbub, Mohammed Saidul Islam, Md Tahmid Rahman Laskar, et al.4.53
- VLM Can Be A Good Assistant: Enhancing Embodied Visual Tracking With Self-improving Vision-language Models (2025)Kui Wu, Shuhang Xu, Hao Chen, et al.4.53
- Audio Does Matter: Importance-aware Multi-granularity Fusion For Video Moment Retrieval (2025)Junan Lin, Daizong Liu, Xianke Chen, et al.4.51
- FoleyGenEx: Unified Video-to-Audio Generation with Multi-Modal Control, Temporal Alignment, and Semantic Precision (2026)Shiyao Wang et al.4.39
- AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models (2026)Hui Geng et al.4.39
- Mitigating Visual Hallucinations in Multimodal Systems through Retrieval-Augmented Reliability-Aware Inference (2026)Pratheswaran Hariharan et al.4.39
- Object Tokens as a Bridge Between Segmentation and Visual Question Answering in Robotic Surgery (2026)Yiping Li et al.4.39
- Constraining to Generalize: Subspace Tuning for Few-shot Generalization of Audio-Language Models (2026)Jaehyuk Jang et al.4.39
- Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs (2026)Hyebin Cho et al.4.39
- MixProLAP: Mixture-Induced Uncertainty Modeling for Probabilistic Language-Audio Pretraining (2026)Yu Nakagome et al.4.39
- E3RG: Building Explicit Emotion-driven Empathetic Response Generation System With Multimodal Large Language Model (2025)Ronghao Lin, Shuai Shen, Weipeng Hu, et al.4.34
- Dynamic Interaction-Aware and Causality-Disentangled Framework for Multimodal Sentiment Analysis (2026)Guangyuan Dong et al.4.33
- Does Visual Information Play a Decisive Role in Vision-Language-Action Model Driving Behavior? (2026)Jingtao He et al.4.33
- VisionPulse: Dynamic Visual Sparsity for Efficient Multimodal Reasoning (2026)Hengbo Xu et al.4.33
- MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning (2026)Zheng Jiang et al.4.25
- Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation And Methodology (2025)Haochen Wang, Xiangtai Li, Zilong Huang, et al.4.08
- OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention (2026)Zhangquan Chen et al.4.04
- LLM-based Multimodal Feedback Produces Equivalent Learning and Better Student Perceptions than Educator Feedback (2026)Chloe Qianhui Zhao et al.3.98
- JPS: Jailbreak Multimodal Large Language Models With Collaborative Visual Perturbation And Textual Steering (2025)Renmiao Chen, Shiyao Cui, Xuancheng Huang, et al.3.86
- Active Multimodal Distillation For Few-shot Action Recognition (2025)Weijia Feng, Yichen Zhu, Ruojia Zhang, et al.3.86
- Exploring Machine Learning And Language Models For Multimodal Depression Detection (2025)Javier Si Zhao Hong, Timothy Zoe Delaya, Sherwyn Chan Yin Kit, et al.3.86
- Enabling Chatbots With Eyes And Ears: An Immersive Multimodal Conversation System For Dynamic Interactions (2025)Jihyoung Jang, Minwook Bae, Minji Kim, et al.3.86
- Exploring Object Status Recognition For Recipe Progress Tracking In Non-visual Cooking (2025)Franklin Mingzhe Li, Kaitlyn Ng, Bin Zhu, et al.3.86
- Take That For Me: Multimodal Exophora Resolution With Interactive Questioning For Ambiguous Out-of-view Instructions (2025)Akira Oyama, Shoichi Hasegawa, Akira Taniguchi, et al.3.86
- Fact-checking At Scale: Multimodal AI For Authenticity And Context Verification In Online Media (2025)van-Hoang Phan, Tung-Duong Le-Duc, Long-Khanh Pham, et al.3.86
- Talksketch: Multimodal Generative AI For Real-time Sketch Ideation With Speech (2025)Weiyan Shi, Sunaya Upadhyay, Geraldine Quek, et al.3.86
- A Multimodal Deviation Perceiving Framework For Weakly-supervised Temporal Forgery Localization (2025)Wenbo Xu, Junyan Wu, Wei Lu, et al.3.86
- Medcfvqa: A Causal Approach To Mitigate Modality Preference Bias In Medical Visual Question Answering (2025)Shuchang Ye, Usman Naseem, Mingyuan Meng, et al.3.86
- Multimodal Foundation Model-driven User Interest Modeling And Behavior Analysis On Short Video Platforms (2025)Yushang Zhao, Yike Peng, Li Zhang, et al.3.86
- Text-guided Visual Prompt DINO For Generic Segmentation (2025)Yuchen Guan, Chong Sun, Canmiao Fu, et al.3.84
- Character-Centered Dialogue Generation from Scene-Level Prompts (2025)Taewon Kang et al.3.81
- Native Visual Understanding: Resolving Resolution Dilemmas In Vision-language Models (2025)Junbo Niu, Yuanhong Zheng, Ziyang Miao, et al.3.79