MLVU
Emerging13papers using it
2025first seen
MLVU is a dataset that contains 200 videos and 800 generated summaries, used to evaluate video-to-text summarization through multimodal question answering.
Papers using MLVU (13)
- QEVA: A Reference-free Evaluation Metric For Narrative Video Summarization With Multimodal Question AnsweringEvent-Anchored Frame Selection for Effective Long-Video UnderstandingQuestion-guided Visual Compression with Memory Feedback for Long-Term Video UnderstandingAdaptive Greedy Frame Selection for Long Video UnderstandingForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest ModelingReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video UnderstandingMSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video UnderstandingThink-Clip-Sample: Slow-Fast Frame Selection for Video UnderstandingLiViBench: An Omnimodal Benchmark for Interactive Livestream Video UnderstandingVideo-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop ReasoningTowards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip RetrievalQ-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMsFALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs