Efficient Cross-modal Video Retrieval With Meta-optimized Frames
2022 Β· Ning Han, Xun Yang, Ee-Peng Lim, et al.
Abstract
Cross-modal video retrieval aims to retrieve the semantically relevant videos given a text as a query, and is one of the fundamental tasks in Multimedia. Most of top-performing methods primarily leverage Visual Transformer (ViT) to extract video features [1, 2, 3], suffering from high computational complexity of ViT especially for encoding long videos. A common and simple solution is to uniformly sample a small number (say, 4 or 8) of frames from the video (instead of using the whole video) as input to ViT. The number of frames has a strong influence on the performance of ViT, e.g., using 8 frames performs better than using 4 frames yet needs more computational resources, resulting in a trade-off. To get free from this trade-off, this paper introduces an automatic video compression method based on a bilevel optimization program (BOP) consisting of both model-level (i.e., base-level) and frame-level (i.e., meta-level) optimizations. The model-level learns a cross-modal video retrieval m
Authors
(none)
Tags
Stats
Related papers
- Multi-modal Transformer For Video Retrieval (2020)19.47
- Hybrid Contrastive Quantization For Efficient Cross-view Video Retrieval (2022)9.28
- Towards Efficient And Robust Moment Retrieval System: A Unified Framework For Multi-granularity Models And Temporal Reranking (2025)2.26
- Multimodal Contextualized Support For Enhancing Video Retrieval System (2026)0.00
- Modality-balanced Embedding For Video Retrieval (2022)7.16
- Vision-language Models Learn Super Images For Efficient Partially Relevant Video Retrieval (2023)3.58
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- Vop: Text-video Co-operative Prompt Tuning For Cross-modal Retrieval (2022)16.41