Coordinated Joint Multimodal Embeddings For Generalized Audio-visual Zeroshot Classification And Retrieval Of Videos
2019 Β· Kranti Kumar Parida, Neeraj Matiyali, Tanaya Guha, et al.
Abstract
We present an audio-visual multimodal approach for the task of zeroshot learning (ZSL) for classification and retrieval of videos. ZSL has been studied extensively in the recent past but has primarily been limited to visual modality and to images. We demonstrate that both audio and visual modalities are important for ZSL for videos. Since a dataset to study the task is currently not available, we also construct an appropriate multimodal dataset with 33 classes containing 156,416 videos, from an existing large scale audio event dataset. We empirically show that the performance improves by adding audio modality for both tasks of zeroshot classification and retrieval, when using multimodal extensions of embedding learning methods. We also propose a novel method to predict the `dominant' modality using a jointly learned modality attention network. We learn the attention in a semi-supervised setting and thus do not require any additional explicit labelling for the modalities. We provide qua
Authors
(none)
Tags
Stats
Related papers
- Multimodal Clustering Networks For Self-supervised Learning From Unlabeled Videos (2021)13.28
- Cross-modal Embeddings For Video And Audio Retrieval (2018)11.08
- Rzenembed: Towards Comprehensive Multimodal Retrieval (2025)0.00
- Modality-aware Representation Learning For Zero-shot Sketch-based Image Retrieval (2024)8.60
- Vlm2vec-v2: Advancing Multimodal Embedding For Videos, Images, And Visual Documents (2025)0.00
- WAVE: Learning Unified & Versatile Audio-visual Embeddings With Multimodal LLM (2025)0.00
- Deep Latent Space Learning For Cross-modal Mapping Of Audio And Visual Signals (2019)12.17
- Video And Audio Are Images: A Cross-modal Mixer For Original Data On Video-audio Retrieval (2023)7.16