Vision-language Models Learn Super Images For Efficient Partially Relevant Video Retrieval

Abstract

In this paper, we propose an efficient and high-performance method for partially relevant video retrieval, which aims to retrieve long videos that contain at least one moment relevant to the input text query. The challenge lies in encoding dense frames using visual backbones. This requires models to handle the increased frames, resulting in significant computation costs for long videos. To mitigate the costs, previous studies use lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities. However, it is undesirable to simply replace the backbones with high-performance large vision-and-language models (VLMs) due to their low efficiency. To address this dilemma, instead of dense frames, we focus on super images, which are created by rearranging the video frames in an \(N \times N\) grid layout. This reduces the number of visual encodings to \(\frac\{1\}\{N^2\}\) and mitigates the low efficiency of large VLMs. Based on this idea, we make two

Vision-language Models Learn Super Images For Efficient Partially Relevant Video Retrieval

Abstract

Authors

Tags

Stats

Related papers