Bima: Towards Biases Mitigation For Text-video Retrieval Via Scene Element Guidance
2025 Β· Huy Le, Nhat Chung, Tung Kieu, et al.
Abstract
Text-video retrieval (TVR) systems often suffer from visual-linguistic biases present in datasets, which cause pre-trained vision-language models to overlook key details. To address this, we propose BiMa, a novel framework designed to mitigate biases in both visual and textual representations. Our approach begins by generating scene elements that characterize each video by identifying relevant entities/objects and activities. For visual debiasing, we integrate these scene elements into the video embeddings, enhancing them to emphasize fine-grained and salient details. For textual debiasing, we introduce a mechanism to disentangle text features into content and bias components, enabling the model to focus on meaningful content while separately handling biased information. Extensive experiments and ablation studies across five major TVR benchmarks (i.e., MSR-VTT, MSVD, LSMDC, ActivityNet, and DiDeMo) demonstrate the competitive performance of BiMa. Additionally, the model's bias mitigati
Authors
(none)
Tags
Stats
Related papers
- Selective Query-guided Debiasing For Video Corpus Moment Retrieval (2022)9.59
- Bridging Information Asymmetry In Text-video Retrieval: A Data-centric Approach (2024)0.00
- Modality-balanced Embedding For Video Retrieval (2022)7.16
- Bidirectional Likelihood Estimation With Multi-modal Large Language Models For Text-video Retrieval (2025)2.76
- Dual-modal Attention-enhanced Text-video Retrieval With Triplet Partial Margin Contrastive Learning (2023)8.82
- Mitigating Test-time Bias For Fair Image Retrieval (2023)0.00
- Tokenbinder: Text-video Retrieval With One-to-many Alignment Paradigm (2024)4.52
- MASS: Overcoming Language Bias In Image-text Matching (2025)0.00