Towards Universal Video Retrieval: Generalizing Video Embedding Via Synthesized Multimodal Pyramid Curriculum
2025 Β· Zhuoning Guo, Mingxin Li, Yanzhao Zhang, et al.
Abstract
The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments
Authors
(none)
Tags
Stats
Related papers
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Modality-balanced Embedding For Video Retrieval (2022)7.16
- GME: Improving Universal Multimodal Retrieval By Multimodal Llms (2024)0.00
- MUVR: A Multi-modal Untrimmed Video Retrieval Benchmark With Multi-level Visual Correspondence (2025)1.40
- MDMMT-2: Multidomain Multimodal Transformer For Video Retrieval, One More Step Towards Generalization (2022)0.00
- Universal Vision-language Dense Retrieval: Learning A Unified Representation Space For Multi-modal Retrieval (2022)3.45
- Dual Encoding For Video Retrieval By Text (2020)16.05
- Megapairs: Massive Data Synthesis For Universal Multimodal Retrieval (2024)3.58