Abstract

Universal multimodal embedding models have achieved great success in capturing semantic relevance between queries and candidates. However, current methods either condense queries and candidates into a single vector, potentially limiting the expressiveness for fine-grained information, or produce too many vectors that are prohibitive for multi-vector retrieval. In this work, we introduce MetaEmbed, a new framework for multimodal retrieval that rethinks how multimodal embeddings are constructed and interacted with at scale. During training, a fixed number of learnable Meta Tokens are appended to the input sequence. At test-time, their last-layer contextualized representations serve as compact yet expressive multi-vector embeddings. Through the proposed Matryoshka Multi-Vector Retrieval training, MetaEmbed learns to organize information by granularity across multiple vectors. As a result, we enable test-time scaling in multimodal retrieval where users can balance retrieval quality against

Authors

(none)

Tags

  • Image Retrieval

Stats

  • citations0
  • S2 citationsβ€”
  • github stars14
  • HF likes0
  • heat score2.35
  • arxiv keyxiao2025metaembed

Related papers

Metaembed: Scaling Multimodal Retrieval At Test-time With Flexible Late Interaction β€” learning-to-hash