From Contrast To Commonality: Audio Commonality Captioning For Enhanced Audio-text Cross-modal Understanding In Multimodal Llms
2025 Β· Yuhang Jia, Xu Zhang, Yujie Guo, et al.
Abstract
Audio Captioning (AC) plays a pivotal role in enhancing audio-text cross-modal understanding during the pretraining and finetuning of Multimodal LLMs (MLLMs). To strengthen this alignment, recent works propose Audio Difference Captioning (ADC), which takes multiple audio inputs and encourages the model to describe their differences, thereby promoting fine-grained discrimination. However, despite its effectiveness, ADC introduces a semantic gap between input audios-often rich in diverse events-and the brief, difference-focused short caption. This deviation from AC-style task causes a mismatch with the pretraining objective, leading to catastrophic forgetting. To address this, we propose Audio Commonality Captioning (ACC), a comparably challenging but gentler alternative that guides the model to capture shared semantics across audio clips rather than detailed differences. Experiments show that ACC not only improves audio-text understanding on captioning benchmarks but also better preserv
Authors
(none)
Tags
Stats
Related papers
- Audio Difference Captioning Utilizing Similarity-discrepancy Disentanglement (2023)2.26
- Interactive Audio-text Representation For Automated Audio Captioning With Contrastive Learning (2022)0.00
- Improving Audio Captioning Models With Fine-grained Audio Features, Text Embedding Supervision, And LLM Mix-up Augmentation (2023)8.82
- Enhancing Automated Audio Captioning Via Large Language Models With Optimized Audio Encoding (2024)5.24
- Improving Audio-text Retrieval Via Hierarchical Cross-modal Interaction And Auxiliary Captions (2023)0.00
- Auto-acd: A Large-scale Dataset For Audio-language Representation Learning (2023)10.74
- Multiscale Matching Driven By Cross-modal Similarity Consistency For Audio-text Retrieval (2024)4.52
- Beyond The Status Quo: A Contemporary Survey Of Advances And Challenges In Audio Captioning (2022)9.03