LUMA-RAG: Lifelong Multimodal Agents With Provably Stable Streaming Alignment
2025 Β· Rohan Wandre, Yash Gajewar, Namrata Patel, et al.
Abstract
Retrieval-Augmented Generation (RAG) has emerged as the dominant paradigm for grounding large language model outputs in verifiable evidence. However, as modern AI agents transition from static knowledge bases to continuous multimodal streams encompassing text, images, video, and audio, two critical challenges arise: maintaining index freshness without prohibitive re-indexing costs, and preserving cross-modal semantic consistency across heterogeneous embedding spaces. We present LUMA-RAG, a lifelong multimodal agent architecture featuring three key innovations: (i) a streaming, multi-tier memory system that dynamically spills embeddings from a hot HNSW tier to a compressed IVFPQ tier under strict memory budgets; (ii) a streaming CLAP->CLIP alignment bridge that maintains cross-modal consistency through incremental orthogonal Procrustes updates; and (iii) stability-aware retrieval telemetry providing Safe@k guarantees by jointly bounding alignment drift and quantization error. Experiment
Authors
(none)
Tags
Stats
Related papers
- Ragdb: A Zero-dependency, Embeddable Architecture For Multimodal Retrieval-augmented Generation On The Edge (2025)0.00
- Cimrag: Cim-aware Domain-adaptive And Noise-resilient Retrieval-augmented Generation For Edge-based Llms (2026)0.00
- LMAR: Language Model Augmented Retriever For Domain-specific Knowledge Indexing (2025)1.57
- Multi-head RAG: Solving Multi-aspect Problems With Llms (2024)0.00
- Re-ranking The Context For Multimodal Retrieval Augmented Generation (2025)0.00
- SV-RAG: Lora-contextualizing Adaptation Of Mllms For Long Document Understanding (2024)0.00
- MLLM Is A Strong Reranker: Advancing Multimodal Retrieval-augmented Generation Via Knowledge-enhanced Reranking And Noise-injected Training (2024)9.18
- M4-RAG: A Massive-scale Multilingual Multi-cultural Multimodal RAG (2025)2.00