← all papers Β· overview

Measuring Epistemic Humility in Multimodal Large Language Models

Abstract

arXiv:2509.09658v2 Announce Type: replace Abstract: Hallucinations in multimodal large language models (MLLMs) -- where the model generates content inconsistent with the input image -- pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks another important capability for trustworthy AI: recognizing when none of the provided options is supported by the image and abstaining from committing to a false choice, a humility-related behavior. We present HumbleBench, a new hallucination benchmark designed to evaluate false-option rejection in MLLMs under a forced-choice multiple-choice setting with a ``None of the above'' option. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations for objects and relations, use candidate attribute cues, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a ``None of the above'' option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs -- including general-purpose, specialized reasoning, and proprietary models -- on HumbleBench and report empirical findings for the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites by assessing a narrower but important abstention-oriented behavior that is relevant to trustworthy multimodal reasoning. Our code and dataset are released publicly and can be accessed at \href{https://github.com/maifoundations/HumbleBench}{https://github.com/maifoundations/HumbleBench}.

Code

Related papers