Vision-language Models Learn Super Images For Efficient Partially Relevant Video Retrieval
2023 Β· Taichi Nishimura, Shota Nakada, Masayoshi Kondo
Abstract
In this paper, we propose an efficient and high-performance method for partially relevant video retrieval, which aims to retrieve long videos that contain at least one moment relevant to the input text query. The challenge lies in encoding dense frames using visual backbones. This requires models to handle the increased frames, resulting in significant computation costs for long videos. To mitigate the costs, previous studies use lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities. However, it is undesirable to simply replace the backbones with high-performance large vision-and-language models (VLMs) due to their low efficiency. To address this dilemma, instead of dense frames, we focus on super images, which are created by rearranging the video frames in an \(N \times N\) grid layout. This reduces the number of visual encodings to \(\frac\{1\}\{N^2\}\) and mitigates the low efficiency of large VLMs. Based on this idea, we make two
Authors
(none)
Tags
Stats
Related papers
- Exploiting Local Indexing And Deep Feature Confidence Scores For Fast Image-to-video Search (2018)2.26
- Efficient Cross-modal Video Retrieval With Meta-optimized Frames (2022)7.16
- Towards Efficient And Robust Moment Retrieval System: A Unified Framework For Multi-granularity Models And Temporal Reranking (2025)2.26
- HVD: Human Vision-driven Video Representation Learning For Text-video Retrieval (2026)0.00
- The VISIONE Video Search System: Exploiting Off-the-shelf Text Search Engines For Large-scale Video Retrieval (2020)10.74
- Vidvec: Unlocking Video MLLM Embeddings For Video-text Retrieval (2026)0.00
- Litevl: Efficient Video-language Learning With Enhanced Spatial-temporal Modeling (2022)6.34
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00