VIRTUE: Visual-interactive Text-image Universal Embedder
2025 Β· Wei-Yao Wang, Kazuya Tateishi, Qiyu Wu, et al.
Abstract
Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interest from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation le
Authors
(none)
Tags
Stats
Related papers
- Vlm2vec: Training Vision-language Models For Massive Multimodal Embedding Tasks (2024)0.00
- VISTA: Visualized Text Embedding For Universal Multi-modal Retrieval (2024)16.73
- Vlm2vec-v2: Advancing Multimodal Embedding For Videos, Images, And Visual Documents (2025)0.00
- ABC: Achieving Better Control Of Multimodal Embeddings Using Vlms (2025)0.00
- Vidvec: Unlocking Video MLLM Embeddings For Video-text Retrieval (2026)0.00
- Vlm2geovec: Toward Universal Multimodal Embeddings For Remote Sensing (2025)0.00
- OLIVE: Object Level In-context Visual Embeddings (2024)0.00
- Give: Guiding Visual Encoder To Perceive Overlooked Information (2024)0.00