Reinpool: Reinforcement Learning Pooling Multi-vector Embeddings For Retrieval System
2026 Β· Sungguk Cha, Dongwook Kim, Mintae Kim, et al.
Abstract
Multi-vector embedding models have emerged as a powerful paradigm for document retrieval, preserving fine-grained visual and textual details through token-level representations. However, this expressiveness comes at a staggering cost: storing embeddings for every token inflates index sizes by over \(1000\times\) compared to single-vector approaches, severely limiting scalability. We introduce \textbf\{ReinPool\}, a reinforcement learning framework that learns to dynamically filter and pool multi-vector embeddings into compact, retrieval-optimized representations. By training with an inverse retrieval objective and NDCG-based rewards, ReinPool identifies and retains only the most discriminative vectors without requiring manual importance annotations. On the Vidore V2 benchmark across three vision-language embedding models, ReinPool compresses multi-vector representations by \(746\)--\(1249\times\) into single vectors while recovering 76--81% of full multi-vector retrieval performance. C
Authors
(none)
Tags
Stats
Related papers
- Reducing The Footprint Of Multi-vector Retrieval With Minimal Performance Impact Via Token Pooling (2024)0.00
- Investigating Multi-layer Representations For Dense Passage Retrieval (2025)0.00
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Nemotron Colembed V2: Top-performing Late Interaction Embedding Models For Visual Document Retrieval (2026)0.00
- Image Retrieval Using Multi-scale CNN Features Pooling (2020)9.23
- Docpruner: A Storage-efficient Framework For Multi-vector Visual Document Retrieval Via Adaptive Patch-level Embedding Pruning (2025)0.00
- REMAP: Multi-layer Entropy-guided Pooling Of Dense CNN Features For Image Retrieval (2019)12.33
- MURE: Hierarchical Multi-resolution Encoding Via Vision-language Models For Visual Document Retrieval (2026)0.00