V-agent: An Interactive Video Search System Using Vision-language Models
2025 Β· Sunyoung Park, Jong-Hyeon Lee, Youngjune Kim, et al.
Abstract
We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image-text retrieval model, we overcome the limitations of traditional text-based retrieval systems in multimodal scenarios. The VLM-based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V-Agent to interpret both visual and spoken content for context-aware video search. This system consists of three agents-a routing agent, a search agent, and a chat agent-that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM-based retrieval model together with an additional re-ranking module to further enhance video retrieval qual
Authors
(none)
Tags
Stats
Related papers
- Learning To Retrieve Videos By Asking Questions (2022)8.82
- The VISIONE Video Search System: Exploiting Off-the-shelf Text Search Engines For Large-scale Video Retrieval (2020)10.74
- Llandmark: A Multi-agent Framework For Landmark-aware Multimodal Interactive Video Retrieval (2026)0.00
- Interactive Video Retrieval With Dialog (2019)8.09
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Lazyvlm: Neuro-symbolic Approach To Video Analytics (2025)0.00
- Meta-personalizing Vision-language Models To Find Named Instances In Video (2023)8.60
- Simple Baselines For Interactive Video Retrieval With Questions And Answers (2023)7.16