Towards Fast Adaptation Of Pretrained Contrastive Models For Multi-channel Video-language Retrieval
2022 Β· Xudong Lin, Simran Tiwari, Shiyuan Huang, et al.
Abstract
Multi-channel video-language retrieval require models to understand information from different channels (e.g. video\(+\)question, video\(+\)speech) to correctly link a video with a textual response or query. Fortunately, contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text, e.g., CLIP; text contrastive models are extensively studied recently for their strong ability of producing discriminative sentence embeddings, e.g., SimCSE. However, there is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources. In this paper, we identify a principled model design space with two axes: how to represent videos and how to fuse video and text information. Based on categorization of recent methods, we investigate the options of representing videos using continuous feature vectors or discrete text tokens; for the fusion method, we explore the use of a multimodal transformer or a p
Authors
(none)
Tags
Stats
Related papers
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- Multi-modal Transformer For Video Retrieval (2020)19.47
- Mv-adapter: Multimodal Video Transfer Learning For Video Text Retrieval (2023)9.76
- Everything At Once -- Multi-modal Fusion Transformer For Video Retrieval (2021)15.78
- Improving Video-text Retrieval By Multi-stream Corpus Alignment And Dual Softmax Loss (2021)0.00
- Lat: Latent Translation With Cycle-consistency For Video-text Retrieval (2022)0.00
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42