Leveraging Retrieval-augmented Tags For Large Vision-language Understanding In Complex Scenes
2024 Β· Antonio Carlos Rivera, Anthony Moore, Steven Robinson
Abstract
Object-aware reasoning in vision-language tasks poses significant challenges for current models, particularly in handling unseen objects, reducing hallucinations, and capturing fine-grained relationships in complex visual scenes. To address these limitations, we propose the Vision-Aware Retrieval-Augmented Prompting (VRAP) framework, a generative approach that enhances Large Vision-Language Models (LVLMs) by integrating retrieval-augmented object tags into their prompts. VRAP introduces a novel pipeline where structured tags, including objects, attributes, and relationships, are extracted using pretrained visual encoders and scene graph parsers. These tags are enriched with external knowledge and incorporated into the LLM's input, enabling detailed and accurate reasoning. We evaluate VRAP across multiple vision-language benchmarks, including VQAv2, GQA, VizWiz, and COCO, achieving state-of-the-art performance in fine-grained reasoning and multimodal understanding. Additionally, our abl
Authors
(none)
Tags
Stats
Related papers
- RAVEN: Multitask Retrieval Augmented Vision-language Learning (2024)0.00
- Alleviating Hallucination In Large Vision-language Models With Active Retrieval Augmentation (2024)7.16
- Pitl: Cross-modal Retrieval With Weakly-supervised Vision-language Pre-training Via Prompting (2023)7.16
- Leveraging Large Vision-language Model As User Intent-aware Encoder For Composed Image Retrieval (2024)3.58
- R4: Retrieval-augmented Reasoning For Vision-language Models In 4D Spatio-temporal Space (2025)0.00
- Open-world 3D Scene Graph Generation For Retrieval-augmented Reasoning (2025)0.00
- Reminding Multimodal Large Language Models Of Object-aware Knowledge With Retrieved Tags (2024)0.00
- Object Retrieval For Visual Question Answering With Outside Knowledge (2024)0.00