Generative Recall, Dense Reranking: Learning Multi-view Semantic Ids For Efficient Text-to-video Retrieval
2026 Β· Zecheng Zhao, Zhi Chen, Zi Huang, et al.
Abstract
Text-to-Video Retrieval (TVR) is essential in video platforms. Dense retrieval with dual-modality encoders leads in accuracy, but its computation and storage scale poorly with corpus size. Thus, real-time large-scale applications adopt two-stage retrieval, where a fast recall model gathers a small candidate pool, which is reranked by an advanced dense retriever. Due to hugely reduced candidates, the reranking model can use any off-the-shelf dense retriever without hurting efficiency, meaning the recall model bounds two-stage TVR performance. Recently, generative retrieval (GR) replaces dense video embeddings with discrete semantic IDs and retrieves by decoding text queries into ID tokens. GR offers near-constant inference and storage complexity, and its semantic IDs capture high-level video features via quantization, making it ideal for quickly eliminating irrelevant candidates during recall. However, as a recall model in two-stage TVR, GR suffers from (i) semantic ambiguity, where eac
Authors
(none)
Tags
Stats
Related papers
- T2vindexer: A Generative Video Indexer For Efficient Text-video Retrieval (2024)8.24
- Generative Retrieval As Multi-vector Dense Retrieval (2024)8.60
- UATVR: Uncertainty-adaptive Text-video Retrieval (2023)15.46
- Dual Encoding For Video Retrieval By Text (2020)16.05
- Towards Efficient And Robust Moment Retrieval System: A Unified Framework For Multi-granularity Models And Temporal Reranking (2025)2.26
- Boosting Video-text Retrieval With Explicit High-level Semantics (2022)7.50
- Use What You Have: Video Retrieval Using Representations From Collaborative Experts (2019)0.00
- Tree-augmented Cross-modal Encoding For Complex-query Video Retrieval (2020)15.57