Maivar-t: Multimodal Audio-image And Video Action Recognizer Using Transformers
2023 Β· Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam, et al.
Abstract
In line with the human capacity to perceive the world by simultaneously processing and integrating high-dimensional inputs from multiple modalities like vision and audio, we propose a novel model, MAiVAR-T (Multimodal Audio-Image to Video Action Recognition Transformer). This model employs an intuitive approach for the combination of audio-image and video modalities, with a primary aim to escalate the effectiveness of multimodal human action recognition (MHAR). At the core of MAiVAR-T lies the significance of distilling substantial representations from the audio modality and transmuting these into the image domain. Subsequently, this audio-image depiction is fused with the video modality to formulate a unified representation. This concerted approach strives to exploit the contextual richness inherent in both audio and video modalities, thereby promoting action recognition. In contrast to existing state-of-the-art strategies that focus solely on audio or video modalities, MAiVAR-T demon
Authors
(none)
Tags
Stats
Related papers
- Efficient Selective Audio Masked Multimodal Bottleneck Transformer For Audio-video Classification (2024)0.00
- Transformer-based Video Front-ends For Audio-visual Speech Recognition For Single And Multi-person Video (2022)11.39
- Detecting Expressions With Multimodal Transformers (2020)10.74
- Multi-resolution Audio-visual Feature Fusion For Temporal Action Localization (2023)0.00
- A Better Use Of Audio-visual Cues: Dense Video Captioning With Bi-modal Transformer (2020)10.61
- Multi-modal Emotion Recognition By Text, Speech And Video Using Pretrained Transformers (2024)0.00
- TMT: A Transformer-based Modal Translator For Improving Multimodal Sequence Representations In Audio Visual Scene-aware Dialog (2020)5.24
- Attentive Fusion Enhanced Audio-visual Encoding For Transformer Based Robust Speech Recognition (2020)0.00