Quantifying Multimodal Imbalance: A Gmm-guided Adaptive Loss For Audio-visual Learning
2025 Β· Zhaocheng Liu, Zhiwen Yu, Xiaoqing Liu
Abstract
Multimodal learning integrates diverse modalities but suffers from modality imbalance, where dominant modalities suppress weaker ones due to inconsistent convergence rates. Existing methods predominantly rely on static modulation or heuristics, overlooking sample-level distributional variations in prediction bias. Specifically, they fail to distinguish outlier samples where the modality gap is exacerbated by low data quality. We propose a framework to quantitatively diagnose and dynamically mitigate this imbalance at the sample level. We introduce the Modality Gap metric to quantify prediction discrepancies. Analysis reveals that this gap follows a bimodal distribution, indicating the coexistence of balanced and imbalanced sample subgroups. We employ a Gaussian Mixture Model (GMM) to explicitly model this distribution, leveraging Bayesian posterior probabilities for soft subgroup separation. Our two-stage framework comprises a Warm-up stage and an Adaptive Training stage. In the latter
Authors
(none)
Tags
Stats
Related papers
- Mmcosine: Multi-modal Cosine Loss Towards Balanced Audio-visual Fine-grained Learning (2023)10.97
- Attribution Regularization For Multimodal Paradigms (2024)0.00
- Modality Collapse As Mismatched Decoding: Information-theoretic Limits Of Multimodal Llms (2026)0.00
- Mmdisco: Multi-modal Discriminator-guided Cooperative Diffusion For Joint Audio And Video Generation (2024)1.91
- Identifiable Shared Component Analysis Of Unpaired Multimodal Mixtures (2024)0.00
- A Study Of Dropout-induced Modality Bias On Robustness To Missing Video Frames For Audio-visual Speech Recognition (2024)9.50
- Analyzing Utility Of Visual Context In Multimodal Speech Recognition Under Noisy Conditions (2019)0.00
- Complete Cross-triplet Loss In Label Space For Audio-visual Cross-modal Retrieval (2022)5.84