Metaembed: Scaling Multimodal Retrieval At Test-time With Flexible Late Interaction
2025 Β· Zilin Xiao, Qi Ma, Mengting Gu, et al.
Abstract
Universal multimodal embedding models have achieved great success in capturing semantic relevance between queries and candidates. However, current methods either condense queries and candidates into a single vector, potentially limiting the expressiveness for fine-grained information, or produce too many vectors that are prohibitive for multi-vector retrieval. In this work, we introduce MetaEmbed, a new framework for multimodal retrieval that rethinks how multimodal embeddings are constructed and interacted with at scale. During training, a fixed number of learnable Meta Tokens are appended to the input sequence. At test-time, their last-layer contextualized representations serve as compact yet expressive multi-vector embeddings. Through the proposed Matryoshka Multi-Vector Retrieval training, MetaEmbed learns to organize information by granularity across multiple vectors. As a result, we enable test-time scaling in multimodal retrieval where users can balance retrieval quality against
Authors
(none)
Tags
Stats
Related papers
- Magic-mm-embedding: Towards Visual-token-efficient Universal Multimodal Embedding With Mllms (2026)0.00
- Rzenembed: Towards Comprehensive Multimodal Retrieval (2025)0.00
- Modality Curation: Building Universal Embeddings For Advanced Multimodal Information Retrieval (2025)0.00
- Reason To Contrast: A Cascaded Multimodal Retrieval Framework (2025)0.00
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- U-MARVEL: Unveiling Key Factors For Universal Multimodal Retrieval Via Embedding Learning With Mllms (2025)3.11
- Unified Interactive Multimodal Moment Retrieval Via Cascaded Embedding-reranking And Temporal-aware Score Fusion (2025)0.00
- Magmar Shared Task System Description: Video Retrieval With Omniembed (2025)0.00