Multimodal Audio
50 papers tagged Multimodal Audio (ordered by heat_score)
Papers
- Wavcaps: A Chatgpt-assisted Weakly-labelled Audio Captioning Dataset For Audio-language Multimodal Research (2023)Xinhao Mei, Chutong Meng, Haohe Liu, et al.20.69
- Large-scale Contrastive Language-audio Pretraining With Feature Fusion And Keyword-to-caption Augmentation (2022)Yusong Wu, Ke Chen, Tianyu Zhang, et al.19.60
- An Overview Of Deep-learning-based Audio-visual Speech Enhancement And Separation (2020)Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, et al.18.31
- Multimodal Speech Emotion Recognition Using Audio And Text (2018)Seunghyun Yoon, Seokhyun Byun, Kyomin Jung18.02
- A Joint Cross-attention Model For Audio-visual Fusion In Dimensional Emotion Recognition (2022)R. Gnana Praveen, Wheidima Carneiro de Melo, Nasib Ullah, et al.18.00
- Audio-visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks (2017)Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, et al.17.39
- Multimodal Transformer Networks For End-to-end Video-grounded Dialogue Systems (2019)Hung Le, Doyen Sahoo, Nancy F. Chen, et al.17.12
- Mintrec: A New Dataset For Multimodal Intent Recognition (2022)Hanlei Zhang, Hua Xu, Xin Wang, et al.17.08
- Attention-based Audio-visual Fusion For Robust Automatic Speech Recognition (2018)George Sterpu, Christian Saam, Naomi Harte16.67
- End-to-end Generative Pretraining For Multimodal Video Captioning (2022)Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, et al.15.85
- Learning Alignment For Multimodal Emotion Recognition From Speech (2019)Haiyang Xu, Hui Zhang, Kun Han, et al.15.22
- Hierarchical Multimodal Transformer To Summarize Videos (2021)Bin Zhao, Maoguo Gong, Xuelong Li14.69
- Time Domain Audio Visual Speech Separation (2019)Jian Wu, Yong Xu, Shi-Xiong Zhang, et al.14.62
- Contextual Audio-visual Switching For Speech Enhancement In Real-world Environments (2018)Ahsan Adeel, Mandar Gogate, Amir Hussain14.35
- Token-level Contrastive Learning With Modality-aware Prompting For Multimodal Intent Recognition (2023)Qianrui Zhou, Hua Xu, Hao Li, et al.14.17
- Information Fusion In Attention Networks Using Adaptive And Multi-level Factorized Bilinear Pooling For Audio-visual Emotion Recognition (2021)Hengshun Zhou, Jun Du, Yuanyuan Zhang, et al.13.97
- Jointly Fine-tuning "bert-like" Self Supervised Models To Improve Multimodal Speech Emotion Recognition (2020)Shamane Siriwardhana, Andrew Reis, Rivindu Weerasekera, et al.13.74
- Key-sparse Transformer For Multimodal Speech Emotion Recognition (2021)Weidong Chen, Xiaofeng Xing, Xiangmin Xu, et al.13.50
- Multimodal Machine Translation Through Visuals And Speech (2019)Umut Sulubacak, Ozan Caglayan, Stig-Arne Grönroos, et al.12.68
- Learning Audio-video Modalities From Image Captions (2022)Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, et al.12.54
- Audio-visual Speech Separation And Dereverberation With A Two-stage Multimodal Network (2019)Ke Tan, Yong Xu, Shi-Xiong Zhang, et al.12.47
- VX2TEXT: End-to-end Learning Of Video-based Text Generation From Multimodal Inputs (2021)Xudong Lin, Gedas Bertasius, Jue Wang, et al.12.17
- Avgzslnet: Audio-visual Generalized Zero-shot Learning By Reconstructing Label Features From Multi-modal Embeddings (2020)Pratik Mazumder, Pravendra Singh, Kranti Kumar Parida, et al.12.10
- Interpretable Multimodal Emotion Recognition Using Hybrid Fusion Of Speech And Image Data (2022)Puneet Kumar, Sarthak Malik, Balasubramanian Raman11.85
- Video-based Cross-modal Auxiliary Network For Multimodal Sentiment Analysis (2022)Rongfei Chen, Wenju Zhou, Yang Li, et al.11.76
- Improved Lite Audio-visual Speech Enhancement (2020)Shang-Yi Chuang, Hsin-Min Wang, Yu Tsao11.39
- Group Gated Fusion On Attention-based Bidirectional Alignment For Multimodal Emotion Recognition (2022)Pengfei Liu, Kun Li, Helen Meng11.39
- Recursive Joint Cross-modal Attention For Multimodal Fusion In Dimensional Emotion Recognition (2024)R. Gnana Praveen, Jahangir Alam11.39
- Temporal Working Memory: Query-guided Segment Refinement For Enhanced Multimodal Understanding (2025)Xingjian Diao, Chunhui Zhang, Weiyi Wu, et al.11.33
- Leveraging Unimodal Self-supervised Learning For Multimodal Audio-visual Speech Recognition (2022)Xichen Pan, Peiyu Chen, Yichen Gong, et al.11.29
- Cross-modal Embeddings For Video And Audio Retrieval (2018)Didac Surís, Amanda Duarte, Amaia Salvador, et al.11.08
- Mmcosine: Multi-modal Cosine Loss Towards Balanced Audio-visual Fine-grained Learning (2023)Ruize Xu, Ruoxuan Feng, Shi-Xiong Zhang, et al.10.97
- Detecting Expressions With Multimodal Transformers (2020)Srinivas Parthasarathy, Shiva Sundaram10.74
- Auto-acd: A Large-scale Dataset For Audio-language Representation Learning (2023)Luoyi Sun, Xuenan Xu, Mengyue Wu, et al.10.74
- VALOR: Vision-audio-language Omni-perception Pretraining Model And Dataset (2023)Jing Liu, Sihan Chen, Xingjian He, et al.10.61
- Cross-modal Prompts: Adapting Large Pre-trained Models For Audio-visual Downstream Tasks (2023)Haoyi Duan, Yan Xia, Mingze Zhou, et al.10.48
- Enriching Multimodal Sentiment Analysis Through Textual Emotional Descriptions Of Visual-audio Content (2024)Sheng Wu, Xiaobao Wang, Longbiao Wang, et al.10.48
- Mm-narrator: Narrating Long-form Videos With Multimodal In-context Learning (2023)Chaoyi Zhang, Kevin Lin, Zhengyuan Yang, et al.10.35
- Contrastive Conditional Latent Diffusion For Audio-visual Segmentation (2023)Yuxin Mao, Jing Zhang, Mochu Xiang, et al.10.29
- MMER: Multimodal Multi-task Learning For Speech Emotion Recognition (2022)Sreyan Ghosh, Utkarsh Tyagi, S Ramaneswaran, et al.10.07
- Learning Music Audio Representations Via Weak Language Supervision (2021)Ilaria Manco, Emmanouil Benetos, Elio Quinton, et al.10.07
- Data-efficient Multimodal Fusion On A Single GPU (2023)Noël Vouitsis, Zhaoyan Liu, Satya Krishna Gorti, et al.10.02
- Interpretability For Multimodal Emotion Recognition Using Concept Activation Vectors (2022)Ashish Ramayee Asokan, Nidarshan Kumar, Anirudh Venkata Ragam, et al.9.76
- Multimodal Speech Emotion Recognition Using Cross Attention With Aligned Audio And Text (2022)Yoonhyung Lee, Seunghyun Yoon, Kyomin Jung9.76
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)Umberto Cappellazzo, Minsu Kim, Honglie Chen, et al.9.59
- Muscaps: Generating Captions For Music Audio (2021)Ilaria Manco, Emmanouil Benetos, Elio Quinton, et al.9.59
- Multimodal Semi-supervised Learning Framework For Punctuation Prediction In Conversational Speech (2020)Monica Sunkara, Srikanth Ronanki, Dhanush Bekal, et al.9.59
- A Study Of Dropout-induced Modality Bias On Robustness To Missing Video Frames For Audio-visual Speech Recognition (2024)Yusheng Dai, Hang Chen, Jun Du, et al.9.50
- Multimodal One-shot Learning Of Speech And Images (2018)Ryan Eloff, Herman A. Engelbrecht, Herman Kamper9.03
- Improving Multimodal Speech Recognition By Data Augmentation And Speech Representations (2022)Dan Oneata, Horia Cucu9.03