Hierarchical Video-moment Retrieval And Step-captioning
2023 Β· Abhay Zala, Jaemin Cho, Satwik Kottur, et al.
Abstract
There is growing interest in searching for information from large video corpora. Prior works have studied relevant tasks, such as text-based video retrieval, moment retrieval, video summarization, and video captioning in isolation, without an end-to-end setup that can jointly search from video corpora and generate summaries. Such an end-to-end setup would allow for many interesting applications, e.g., a text-based search that finds a relevant video from a video corpus, extracts the most relevant moment from that video, and segments the moment into important steps with captions. To address this, we present the HiREST (HIerarchical REtrieval and STep-captioning) dataset and propose a new benchmark that covers hierarchical information retrieval and visual/textual stepwise summarization from an instructional video corpus. HiREST consists of 3.4K text-video pairs from an instructional video dataset, where 1.1K videos have annotations of moment spans relevant to text query and breakdown of e
Authors
(none)
Tags
Stats
Related papers
- Query-centric Audio-visual Cognition Network For Moment Retrieval, Segmentation And Step-captioning (2024)3.58
- Towards Efficient And Robust Moment Retrieval System: A Unified Framework For Multi-granularity Models And Temporal Reranking (2025)2.26
- Viseret: A Simple Yet Effective Approach To Moment Retrieval Via Fine-grained Video Segmentation (2021)0.00
- A Lightweight Moment Retrieval System With Global Re-ranking And Robust Adaptive Bidirectional Temporal Search (2025)3.58
- Tencent Text-video Retrieval: Hierarchical Cross-modal Interactions With Multi-level Representations (2022)7.81
- Delving Deeper: Hierarchical Visual Perception For Robust Video-text Retrieval (2026)1.24
- Momentseeker: A Task-oriented Benchmark For Long-video Moment Retrieval (2025)0.00
- Lovr: A Benchmark For Long Video Retrieval In Multimodal Contexts (2025)0.00