Omni-captioner: Data Pipeline, Models, And Benchmark For Omni Detailed Perception
2025 Β· Ziyang Ma, Ruiyang Xu, Zhenghao Xing, et al.
Abstract
Fine-grained perception of multimodal information is critical for advancing human-AI interaction. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent "co-growth" between detail and hallucination in current OLMs. To address this, we propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception,
Authors
(none)
Tags
Stats
Related papers
- Capybara-omni: An Efficient Paradigm For Building Omni-modal Language Models (2025)0.00
- OMCAT: Omni Context Aware Transformer (2024)0.00
- M2-omni: Advancing Omni-mllm For Comprehensive Modality Support With Competitive Performance (2025)0.00
- Audio-omni: Extending Multi-modal Understanding To Versatile Audio Generation And Editing (2026)0.00
- VAST: A Vision-audio-subtitle-text Omni-modality Foundation Model And Dataset (2023)14.55
- Omni-c: Compressing Heterogeneous Modalities Into A Single Dense Encoder (2026)0.00
- Audiosetcaps: An Enriched Audio-caption Dataset Using Automated Generation Pipeline With Large Audio And Language Models (2024)13.44
- VALOR: Vision-audio-language Omni-perception Pretraining Model And Dataset (2023)10.61