Llandmark: A Multi-agent Framework For Landmark-aware Multimodal Interactive Video Retrieval
2026 Β· Minh-Chi Phung, Thien-Bao Le, Cam-Tu Tran-Thi, et al.
Abstract
The increasing diversity and scale of video data demand retrieval systems capable of multimodal understanding, adaptive reasoning, and domain-specific knowledge integration. This paper presents LLandMark, a modular multi-agent framework for landmark-aware multimodal video retrieval to handle real-world complex queries. The framework features specialized agents that collaborate across four stages: query parsing and planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis. A key component, the Landmark Knowledge Agent, detects cultural or spatial landmarks and reformulates them into descriptive visual prompts, enhancing CLIP-based semantic matching for Vietnamese scenes. To expand capabilities, we introduce an LLM-assisted image-to-image pipeline, where a large language model (Gemini 2.5 Flash) autonomously detects landmarks, generates image search queries, retrieves representative images, and performs CLIP-based visual similarity matching, removing the need for
Authors
(none)
Tags
Stats
Related papers
- V-agent: An Interactive Video Search System Using Vision-language Models (2025)0.00
- Clamr: Contextualized Late-interaction For Multimodal Content Retrieval (2025)0.00
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Effective Multi-query Expansions: Collaborative Deep Networks For Robust Landmark Retrieval (2017)15.73
- Who Can We Trust? Scope-aware Video Moment Retrieval With Multi-agent Conflict (2025)0.00
- MERLIN: Multimodal Embedding Refinement Via Llm-based Iterative Navigation For Text-video Retrieval-rerank Pipeline (2024)5.84
- Context-enhanced Video Moment Retrieval With Large Language Models (2024)5.84
- MAGNET: A Multi-agent Framework For Finding Audio-visual Needles By Reasoning Over Multi-video Haystacks (2025)0.00