Developing Visual Augmented Q&A System Using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker
2025 Β· Rachna Saxena, Abhijeet Kumar, Suresh Shanmugam
Abstract
Traditional information extraction systems face challenges with text only language models as it does not consider infographics (visual elements of information) such as tables, charts, images etc. often used to convey complex information to readers. Multimodal LLM (MLLM) face challenges of finding needle in the haystack problem i.e., either longer context length or substantial number of documents as search space. Late interaction mechanism over visual language models has shown state of the art performance in retrieval-based vision augmented Q&A tasks. There are yet few challenges using it for RAG based multi-modal Q&A. Firstly, many popular and widely adopted vector databases do not support native multi-vector retrieval. Secondly, late interaction requires computation which inflates space footprint and can hinder enterprise adoption. Lastly, the current state of late interaction mechanism does not leverage the approximate neighbor search indexing methods for large speed ups in retrieval
Authors
(none)
Tags
Stats
Related papers
- Fine-grained Late-interaction Multi-modal Retrieval For Retrieval Augmented Visual Question Answering (2023)5.24
- Object Retrieval For Visual Question Answering With Outside Knowledge (2024)0.00
- OMGM: Orchestrate Multiple Granularities And Modalities For Efficient Multimodal Retrieval (2025)0.00
- An Interactive Multi-modal Query Answering System With Retrieval-augmented Large Language Models (2024)5.84
- Visual-rag: Benchmarking Text-to-image Retrieval Augmented Generation For Visual Knowledge Intensive Queries (2025)0.00
- Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models And Vision Language Models (2024)8.82
- Enhancing Question Answering Precision With Optimized Vector Retrieval And Instructions (2024)0.00
- Index Light, Reason Deep: Deferred Visual Ingestion For Visual-dense Document Question Answering (2026)0.00