VISTA: Visualized Text Embedding For Universal Multi-modal Retrieval
2024 Β· Junjie Zhou, Zheng Liu, Shitao Xiao, et al.
Abstract
Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-language models like CLIP, the current methods are severely limited in representing the text-only and image-only data. In this work, we present a new embedding model VISTA for universal multi-modal retrieval. Our work brings forth threefold technical contributions. Firstly, we introduce a flexible architecture which extends a powerful text encoder with the image understanding capability by introducing visual token embeddings. Secondly, we develop two data generation strategies, which bring high-quality composed image-text to facilitate the training of the embedding model. Thirdly, we introduce a multi-stage training algorithm, which first aligns the visual token embedding with the text encoder using massive weakly labeled data, and then develops multi-modal representation capabi
Authors
(none)
Tags
Stats
Related papers
- Vista: Vision And Scene Text Aggregation For Cross-modal Retrieval (2022)14.31
- Universal Vision-language Dense Retrieval: Learning A Unified Representation Space For Multi-modal Retrieval (2022)3.45
- Vector Embedding Of Multi-modal Texts: A Tool For Discovery? (2025)0.00
- VIRTUE: Visual-interactive Text-image Universal Embedder (2025)0.00
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Evo-retriever: Llm-guided Curriculum Evolution With Viewpoint-pathway Collaboration For Multimodal Document Retrieval (2026)0.00
- MARVEL: Unlocking The Multi-modal Capability Of Dense Retrieval Via Visual Module Plugin (2023)9.04
- Look, Imagine And Match: Improving Textual-visual Cross-modal Retrieval With Generative Models (2017)18.52