ABC: Achieving Better Control Of Multimodal Embeddings Using Vlms
2025 Β· Benjamin Schneider, Florian Kerschbaum, Wenhu Chen
Abstract
Visual embedding models excel at zero-shot tasks like visual retrieval and classification. However, these models cannot be used for tasks that contain ambiguity or require user instruction. These tasks necessitate an embedding model which outputs can use a natural language instruction to control the representation of a visual embedding. Existing CLIP-based approaches embed images and text independently, and fuse the result. We find that this results in weak interactions between modalities, and poor user control over the representation. We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone to deeply integrate image features with natural language instructions. ABC achieves best-for-size performance on MSCOCO image-to-text retrieval and is the top performing model on classification and VQA tasks in the Massive Multimodal Embedding Benchmark. With a strongly unified vision-language representation, ABC can use natural language to solve subtle
Authors
(none)
Tags
Stats
Related papers
- Vlm2vec-v2: Advancing Multimodal Embedding For Videos, Images, And Visual Documents (2025)0.00
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Vlm2vec: Training Vision-language Models For Massive Multimodal Embedding Tasks (2024)0.00
- VL-CLIP: Enhancing Multimodal Recommendations Via Visual Grounding And Llm-augmented CLIP Embeddings (2025)2.26
- Exploring A Unified Vision-centric Contrastive Alternatives On Multi-modal Web Documents (2025)1.69
- VIRTUE: Visual-interactive Text-image Universal Embedder (2025)0.00
- Analyzing Diffusion And Autoregressive Vision Language Models In Multimodal Embedding Space (2026)0.00
- Magic-mm-embedding: Towards Visual-token-efficient Universal Multimodal Embedding With Mllms (2026)0.00