Abstract

In current text-to-video retrieval (T2VR), videos to be retrieved have been properly trimmed so that a correspondence between the videos and ad-hoc textual queries naturally exists. Note in practice that videos circulated on the Internet and social media platforms, while being relatively short, are typically rich in their content. Often, multiple scenes / actions / events are shown in a single video, leading to a more challenging T2VR setting wherein only part of the video content is relevant w.r.t. a given query. This paper presents a first study on this setting which we term Partially Relevant Video Retrieval (PRVR). Considering that a video typically consists of multiple moments, a video is regarded as partially relevant w.r.t. to a given query if it contains a query-related moment. We formulate the PRVR task as a multiple instance learning problem, and propose a Multi-Scale Similarity Learning (MS-SL++) network that jointly learns both clip-scale and frame-scale similarities to det

Authors

(none)

Tags

  • Image Retrieval

Stats

  • citations1
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score2.26
  • arxiv keychen2022prvr

Related papers