Give: Guiding Visual Encoder To Perceive Overlooked Information
2024 Β· Junjie Li, Jianghong Ma, Xiaofeng Zhang, et al.
Abstract
Multimodal Large Language Models have advanced AI in applications like text-to-video generation and visual question answering. These models rely on visual encoders to convert non-text data into vectors, but current encoders either lack semantic alignment or overlook non-salient objects. We propose the Guiding Visual Encoder to Perceive Overlooked Information (GiVE) approach. GiVE enhances visual representation with an Attention-Guided Adapter (AG-Adapter) module and an Object-focused Visual Semantic Learning module. These incorporate three novel loss terms: Object-focused Image-Text Contrast (OITC) loss, Object-focused Image-Image Contrast (OIIC) loss, and Object-focused Image Discrimination (OID) loss, improving object consideration, retrieval accuracy, and comprehensiveness. Our contributions include dynamic visual focus adjustment, novel loss functions to enhance object retrieval, and the Multi-Object Instruction (MOInst) dataset. Experiments show our approach achieves state-of-the-
Authors
(none)
Tags
Stats
Related papers
- Steerable Visual Representations (2026)0.00
- Finevit: Progressively Unlocking Fine-grained Perception With Dense Recaptions (2026)0.00
- OLIVE: Object Level In-context Visual Embeddings (2024)0.00
- Perception Encoder: The Best Visual Embeddings Are Not At The Output Of The Network (2025)6.71
- Come-vl: Scaling Complementary Multi-encoder Vision-language Learning (2026)0.00
- Learning The Visualness Of Text Using Large Vision-language Models (2023)4.52
- VIRTUE: Visual-interactive Text-image Universal Embedder (2025)0.00
- Combating Visual Neglect And Semantic Drift In Large Multimodal Models For Enhanced Cross-modal Retrieval (2026)0.00