Multimodal Open-vocabulary Video Classification Via Pre-trained Vision And Language Models
2022 Β· Rui Qian, Yeqing Li, Zheng Xu, et al.
Abstract
Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present \textbf\{MOV\}, a simple yet effective method for \textbf\{M\}ultimodal \textbf\{O\}pen-\textbf\{V\}ocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram. We design a cross-modal fusion mechanism to aggregate complimentary multimodal information. Experiments on Kinetics-700 and VGGSound show that introducing flow or audio modality brings large performance gains over the pre-trained VLM and existing methods. Specifically, MOV greatly improves the accuracy on base classes, while generalizes better on novel classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video classification be
Authors
(none)
Tags
Stats
Related papers
- Efficient Selective Audio Masked Multimodal Bottleneck Transformer For Audio-video Classification (2024)0.00
- Getting The Subtext Without The Text: Scalable Multimodal Sentiment Classification From Visual And Acoustic Modalities (2018)7.50
- WAVE: Learning Unified & Versatile Audio-visual Embeddings With Multimodal LLM (2025)0.00
- Unified Video-language Pre-training With Synchronized Audio (2024)0.00
- Effectively Obtaining Acoustic, Visual And Textual Data From Videos (2025)0.00
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- VITA: Towards Open-source Interactive Omni Multimodal LLM (2024)0.00
- Fine-grained Grounding For Multimodal Speech Recognition (2020)5.84