Learning To Retrieve Videos By Asking Questions
2022 Β· Avinash Madasu, Junier Oliva, Gedas Bertasius
Abstract
The majority of traditional text-to-video retrieval systems operate in static environments, i.e., there is no interaction between the user and the agent beyond the initial textual query provided by the user. This can be sub-optimal if the initial query has ambiguities, which would lead to many falsely retrieved videos. To overcome this limitation, we propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog, where the user refines retrieved results by answering questions generated by an AI agent. Our novel multimodal question generator learns to ask questions that maximize the subsequent video retrieval performance using (i) the video candidates retrieved during the last round of interaction with the user and (ii) the text-based dialog history documenting all previous interactions, to generate questions that incorporate both visual and linguistic cues relevant to video retrieval. Furthermore, to
Authors
(none)
Tags
Stats
Related papers
- Interactive Video Retrieval With Dialog (2019)8.09
- Simple Baselines For Interactive Video Retrieval With Questions And Answers (2023)7.16
- V-agent: An Interactive Video Search System Using Vision-language Models (2025)0.00
- Dialog-based Interactive Image Retrieval (2018)0.00
- Reading-strategy Inspired Visual Representation Learning For Text-to-video Retrieval (2022)13.93
- Towards Efficient And Robust Moment Retrieval System: A Unified Framework For Multi-granularity Models And Temporal Reranking (2025)2.26
- Use What You Have: Video Retrieval Using Representations From Collaborative Experts (2019)0.00
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00