Bridging Video-text Retrieval With Multiple Choice Questions
2022 Β· Yuying Ge, Yixiao Ge, Xihui Liu, et al.
Abstract
Pre-training a model to learn transferable video-text representation for retrieval has attracted a lot of attention in recent years. Previous dominant works mainly adopt two separate encoders for efficient retrieval, but ignore local associations between videos and texts. Another line of research uses a joint encoder to interact video with texts, but results in low efficiency since each text-video pair needs to be fed into the model. In this work, we enable fine-grained video-text interactions while maintaining high efficiency for retrieval via a novel pretext task, dubbed as Multiple Choice Questions (MCQ), where a parametric module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features. Specifically, we exploit the rich semantics of text (i.e., nouns and verbs) to build questions, with which the video encoder can be trained to capture more regional content and temporal dynamics. In the form of questions and answers, the
Authors
(none)
Tags
Stats
Related papers
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- Multi-query Video Retrieval (2022)9.59
- Bridging Information Asymmetry In Text-video Retrieval: A Data-centric Approach (2024)0.00
- BRIDGE: Multimodal-to-text Retrieval Via Reinforcement-learned Query Alignment (2026)0.00
- Mv-adapter: Multimodal Video Transfer Learning For Video Text Retrieval (2023)9.76
- Towards Fast Adaptation Of Pretrained Contrastive Models For Multi-channel Video-language Retrieval (2022)7.50
- Simple Baselines For Interactive Video Retrieval With Questions And Answers (2023)7.16
- Memory Enhanced Embedding Learning For Cross-modal Video-text Retrieval (2021)0.00