Reading-strategy Inspired Visual Representation Learning For Text-to-video Retrieval
2022 Β· Jianfeng Dong, Yabing Wang, Xianke Chen, et al.
Abstract
This paper aims for the task of text-to-video retrieval, where given a query in the form of a natural-language sentence, it is asked to retrieve videos which are semantically relevant to the given query, from a great number of unlabeled videos. The success of this task depends on cross-modal representation learning that projects both videos and sentences into common spaces for semantic similarity computation. In this work, we concentrate on video representation learning, an essential component for text-to-video retrieval. Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos, which consists of two branches: a previewing branch and an intensive-reading branch. The previewing branch is designed to briefly capture the overview information of videos, while the intensive-reading branch is designed to obtain more in-depth information. Moreover, the intensive-reading branch is aware of the video overview c
Authors
(none)
Tags
Stats
Related papers
- Ambiguity-restrained Text-video Representation Learning For Partially Relevant Video Retrieval (2025)5.84
- HVD: Human Vision-driven Video Representation Learning For Text-video Retrieval (2026)0.00
- Learning Audio-guided Video Representation With Gated Attention For Video-text Retrieval (2025)5.24
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Learning To Retrieve Videos By Asking Questions (2022)8.82
- Use What You Have: Video Retrieval Using Representations From Collaborative Experts (2019)0.00
- Dual Encoding For Video Retrieval By Text (2020)16.05
- Tree-augmented Cross-modal Encoding For Complex-query Video Retrieval (2020)15.57