Lovr: A Benchmark For Long Video Retrieval In Multimodal Contexts
2025 Β· Qifeng Cai, Hao Liang, Zhaoyang Han, et al.
Abstract
Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more
Authors
(none)
Tags
Stats
Related papers
- Momentseeker: A Task-oriented Benchmark For Long-video Moment Retrieval (2025)0.00
- Multimodal Lengthy Videos Retrieval Framework And Evaluation Metric (2025)0.00
- MUVR: A Multi-modal Untrimmed Video Retrieval Benchmark With Multi-level Visual Correspondence (2025)1.40
- SALOVA: Segment-augmented Long Video Assistant For Targeted Retrieval And Routing In Long-form Video Analysis (2024)0.00
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Multivent 2.0: A Massive Multilingual Benchmark For Event-centric Video Retrieval (2024)3.58
- Sovabench: A Vehicle Surveillance Action Retrieval Benchmark For Multimodal Large Language Models (2026)0.00
- LOVO: Efficient Complex Object Query In Large-scale Video Datasets (2025)2.26