MAGNET: A Multi-agent Framework For Finding Audio-visual Needles By Reasoning Over Multi-video Haystacks
2025 Β· Sanjoy Chowdhury, Mohamed Elmoghany, Yohan Abeysinghe, et al.
Abstract
Large multimodal models (LMMs) have shown remarkable progress in audio-visual understanding, yet they struggle with real-world scenarios that require complex reasoning across extensive video collections. Existing benchmarks for video question answering remain limited in scope, typically involving one clip per query, which falls short of representing the challenges of large-scale, audio-visual retrieval and reasoning encountered in practical applications. To bridge this gap, we introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer. To this end, we present AVHaystacks, an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task. Additionally, we propose a model-agnostic, multi-agent framework MAGNET to address this challenge, achieving up to 89%
Authors
(none)
Tags
Stats
Related papers
- Visual Haystacks: A Vision-centric Needle-in-a-haystack Benchmark (2024)0.00
- Multihaystack: Benchmarking Multimodal Retrieval And Reasoning Over 40K Images, Videos, And Documents (2026)0.00
- Document Haystacks: Vision-language Reasoning Over Piles Of 1000+ Documents (2024)2.83
- RAVU: Retrieval Augmented Video Understanding With Compositional Reasoning Over Graph (2025)0.00
- V-retrver: Evidence-driven Agentic Reasoning For Universal Multimodal Retrieval (2026)0.00
- Query-centric Audio-visual Cognition Network For Moment Retrieval, Segmentation And Step-captioning (2024)3.58
- Llandmark: A Multi-agent Framework For Landmark-aware Multimodal Interactive Video Retrieval (2026)0.00
- Multimodal Needle In A Haystack: Benchmarking Long-context Capability Of Multimodal Large Language Models (2024)11.84