Shotfinder: Imagination-driven Open-domain Video Shot Retrieval Via Web Search
2026 Β· Tao Yu, Haopeng Jin, Hao Wang, et al.
Abstract
In recent years, large language models (LLMs) have made rapid progress in information retrieval, yet existing research has mainly focused on text or static multimodal settings. Open-domain video shot retrieval, which involves richer temporal structure and more complex semantics, still lacks systematic benchmarks and analysis. To fill this gap, we introduce ShotFinder, a benchmark that formalizes editing requirements as keyframe-oriented shot descriptions and introduces five types of controllable single-factor constraints: Temporal order, Color, Visual style, Audio, and Resolution. We curate 1,210 high-quality samples from YouTube across 20 thematic categories, using large models for generation with human verification. Based on the benchmark, we propose ShotFinder, a text-driven three-stage retrieval and localization pipeline: (1) query expansion via video imagination, (2) candidate video retrieval with a search engine, and (3) description-guided temporal localization. Experiments on mu
Authors
(none)
Tags
Stats
Related papers
- Momentseeker: A Task-oriented Benchmark For Long-video Moment Retrieval (2025)0.00
- The VISIONE Video Search System: Exploiting Off-the-shelf Text Search Engines For Large-scale Video Retrieval (2020)10.74
- Few Shots Text To Image Retrieval: New Benchmarking Dataset And Optimization Methods (2026)0.00
- Use What You Have: Video Retrieval Using Representations From Collaborative Experts (2019)0.00
- Towards Efficient And Robust Moment Retrieval System: A Unified Framework For Multi-granularity Models And Temporal Reranking (2025)2.26
- Exploiting Local Indexing And Deep Feature Confidence Scores For Fast Image-to-video Search (2018)2.26
- Multimodal Contextualized Support For Enhancing Video Retrieval System (2026)0.00
- Shotit: Compute-efficient Image-to-video Search Engine For The Cloud (2024)0.00