Propy: Building Interactive Prompt Pyramids Upon CLIP For Partially Relevant Video Retrieval
2025 Β· Yi Pan, Yujia Zhang, Michael Kampffmeyer, et al.
Abstract
Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments. While existing works follow the paradigm of developing models to process unimodal features, powerful pretrained vision-language models like CLIP remain underexplored in this field. To bridge this gap, we propose ProPy, a model with systematic architectural adaption of CLIP specifically designed for PRVR. Drawing insights from the semantic relevance of multi-granularity events, ProPy introduces two key innovations: (1) A Prompt Pyramid structure that organizes event prompts to capture semantics at multiple granularity levels, and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. With these designs, ProPy achieves SOTA performance on three public datasets, outperforming previous models by significant margins. Code is available at https://github.com/BUAA
Authors
(none)
Tags
Stats
Related papers
- PRVR: Partially Relevant Video Retrieval (2022)2.26
- Uneven Event Modeling For Partially Relevant Video Retrieval (2025)1.40
- Prototypes Are Balanced Units For Efficient And Effective Partially Relevant Video Retrieval (2025)0.00
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- Hlformer: Enhancing Partially Relevant Video Retrieval With Hyperbolic Learning (2025)3.50
- Ambiguity-restrained Text-video Representation Learning For Partially Relevant Video Retrieval (2025)5.84
- Imagine Before Concentration: Diffusion-guided Registers Enhance Partially Relevant Video Retrieval (2026)3.80
- Delving Deeper: Hierarchical Visual Perception For Robust Video-text Retrieval (2026)1.24