MCIF: Multimodal Crosslingual Instruction-following Benchmark From Scientific Talks
2025 · Sara Papi, Maike Züfle, Marco Gaido, et al.
Abstract
Recent advances in large language models have laid the foundation for multimodal LLMs (MLLMs), which unify text, speech, and vision within a single framework. As these models are rapidly evolving toward general-purpose instruction following across diverse and complex tasks, a key frontier is evaluating their crosslingual and multimodal capabilities over both short- and long-form inputs. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on a single modality at a time, rely on short-form inputs, or lack human annotations--hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first crosslingual human-annotated benchmark based on scientific talks on NLP and beyond. MCIF evaluates instruction following in crosslingual, multimodal settings over different input lengths an
Authors
(none)
Tags
Stats
Related papers
- MCAT: Scaling Many-to-many Speech-to-text Translation With Mllms To 70 Languages (2025)2.41
- Multimodal Large Language Models For End-to-end Affective Computing: Benchmarking And Boosting With Generative Knowledge Prompting (2025)0.00
- Benchmarking Large Multimodal Models Against Common Corruptions (2024)2.89
- VCB Bench: An Evaluation Benchmark For Audio-grounded Large Language Model Conversational Agents (2025)0.00
- Chatbridge: Bridging Modalities With Large Language Model As A Language Catalyst (2023)0.00
- Macaw-llm: Multi-modal Language Modeling With Image, Audio, Video, And Text Integration (2023)0.00
- Audiobench: A Universal Benchmark For Audio Large Language Models (2024)10.21
- Omhbench: Benchmarking Balanced And Grounded Omni-modal Multi-hop Reasoning (2026)0.00