Video-MME
Emerging27papers using it
2024first seen
The 'Video-MME' dataset/benchmark contains long video sequences and is used to evaluate the effectiveness of multimodal models in understanding and selecting keyframes based on dynamic query-driven criteria.
Papers using Video-MME (27)
- Task-Focused Memorization for Multimodal AgentsXiaomi Mimo-vl-miloco Technical ReportMorphoQuant: Modality-Aware Quantization for Omni-modal Large Language ModelsDynin-Omni: Omnimodal Unified Large Diffusion Language ModelWhere to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video UnderstandingEvent-Anchored Frame Selection for Effective Long-Video UnderstandingEvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMsBeyond Single-Sample: Reliable Multi-Sample Distillation for Video UnderstandingQuestion-guided Visual Compression with Memory Feedback for Long-Term Video UnderstandingHiMu: Hierarchical Multimodal Frame Selection for Long Video Question AnsweringLensWalk: Agentic Video Understanding by Planning How You See in VideosMACD: Model-Aware Contrastive Decoding via Counterfactual DataMSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video UnderstandingLinMU: Multimodal Understanding Made LinearScaling Vision Language Models for Pharmaceutical Long Form Video Reasoning on Industrial GenAI PlatformVideo Evidence to Reasoning Efficient Video Understanding via Explicit Evidence GroundingThink-Clip-Sample: Slow-Fast Frame Selection for Video UnderstandingLiViBench: An Omnimodal Benchmark for Interactive Livestream Video UnderstandingStructured Over Scale: Learning Spatial Reasoning from Educational VideoVideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial ReasoningScaling RL To Long VideosVSI: Visual Subtitle Integration For Keyframe Selection To Enhance Long Video UnderstandingLightweight Structured Multimodal Reasoning For Clinical Scene Understanding In RoboticsEnhancing Temporal Understanding In Video-llms Through Stacked Temporal Attention In Vision EncodersLess Is More: Token-efficient Video-qa Via Adaptive Frame-pruning And Semantic Graph IntegrationQ-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMsVideo-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis