Granalign: Granularity-aware Alignment Framework For Zero-shot Video Moment Retrieval
2026 Β· Mingyu Jeon, Sunjae Yoon, Jonghee Kim, et al.
Abstract
Zero-shot video moment retrieval (ZVMR) is the task of localizing a temporal moment within an untrimmed video using a natural language query without relying on task-specific training data. The primary challenge in this setting lies in the mismatch in semantic granularity between textual queries and visual content. Previous studies in ZVMR have attempted to achieve alignment by leveraging high-quality pre-trained knowledge that represents video and language in a joint space. However, these approaches failed to balance the semantic granularity between the pre-trained knowledge provided by each modality for a given scene. As a result, despite the high quality of each modality's representations, the mismatch in granularity led to inaccurate retrieval. In this paper, we propose a training-free framework, called Granularity-Aware Alignment (GranAlign), that bridges this gap between coarse and fine semantic representations. Our approach introduces two complementary techniques: granularity-bas
Authors
(none)
Tags
Stats
Related papers
- Vlanet: Video-language Alignment Network For Weakly-supervised Video Moment Retrieval (2020)13.28
- X-aligner: Composed Visual Retrieval Without The Bells And Whistles (2026)0.00
- Subject-aware Multi-granularity Alignment For Zero-shot Eeg-to-image Retrieval (2026)0.00
- Context-enhanced Video Moment Retrieval With Large Language Models (2024)5.84
- Towards Efficient And Robust Moment Retrieval System: A Unified Framework For Multi-granularity Models And Temporal Reranking (2025)2.26
- Towards Balanced Alignment: Modal-enhanced Semantic Modeling For Video Moment Retrieval (2023)14.33
- Unicvr: From Alignment To Reranking For Unified Zero-shot Composed Visual Retrieval (2026)0.00
- Coarse To Fine: Video Retrieval Before Moment Localization (2021)0.00