Exploring Audio Hallucination In Egocentric Video Understanding
2026 Β· Ashish Seth, Xinhao Mei, Changsheng Zhao, et al.
Abstract
arXiv:2604.23860v1 Announce Type: cross Abstract: Egocentric videos provide a distinctive setting in which sound serves as crucial cues to understand user activities and surroundings, particularly when visual information is unstable or occluded due to continuous camera movement. State-of-the-art large audio-visual language models (AV-LLMs) can generate multimodal descriptions. However, we show in this work that they are prone to audio hallucinations, often inferring sounds from visual cues that are visible but not heard. We present a systematic and automatic evaluation framework for analyzing audio hallucinations in egocentric video through a targeted question-answering (Q/A) protocol. We curate a dataset of 300 egocentric videos and design 1,000 sound-focused questions to probe model outputs. To characterize hallucinations, we propose a grounded taxonomy that distinguishes between foreground action sounds from the user activities and background ambient sounds. Our evaluation shows th
Authors
(none)
Tags
Stats
Related papers
- Walking Through Uncertainty: An Empirical Study Of Uncertainty Estimation For Audio-aware Large Language Models (2026)0.00
- All That Glitters Is Not Audio: Rethinking Text Priors And Audio Reliance In Audio-language Evaluation (2026)0.00
- Halluaudio: Hallucinating Frequency As Concepts For Few-shot Audio Classification (2023)3.58
- Towards Holistic Evaluation Of Large Audio-language Models: A Comprehensive Survey (2026)9.75
- Mmaudioreverbs: Video-guided Acoustic Modeling For Dereverberation And Room Impulse Response Estimation (2026)0.00
- Omni-captioner: Data Pipeline, Models, And Benchmark For Omni Detailed Perception (2025)0.00
- Semantically Consistent Video-to-audio Generation Using Multimodal Language Large Model (2024)0.00
- Videollama 2: Advancing Spatial-temporal Modeling And Audio Understanding In Video-llms (2024)0.00