Modality-balanced Embedding For Video Retrieval
2022 Β· Xun Wang, Bingqing Ke, Xuanping Li, et al.
Abstract
Video search has become the main routine for users to discover videos relevant to a text query on large short-video sharing platforms. During training a query-video bi-encoder model using online search logs, we identify a modality bias phenomenon that the video encoder almost entirely relies on text matching, neglecting other modalities of the videos such as vision, audio. This modality imbalanceresults from a) modality gap: the relevance between a query and a video text is much easier to learn as the query is also a piece of text, with the same modality as the video text; b) data bias: most training samples can be solved solely by text matching. Here we share our practices to improve the first retrieval stage including our solution for the modality imbalance issue. We propose MBVR (short for Modality Balanced Video Retrieval) with two key components: manually generated modality-shuffled (MS) samples and a dynamic margin (DM) based on visual relevance. They can encourage the video enco
Authors
(none)
Tags
Stats
Related papers
- Towards Balanced Alignment: Modal-enhanced Semantic Modeling For Video Moment Retrieval (2023)14.33
- Embedding-based Retrieval In Multimodal Content Moderation (2025)2.26
- Dual Encoding For Video Retrieval By Text (2020)16.05
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Memory Enhanced Embedding Learning For Cross-modal Video-text Retrieval (2021)0.00
- Towards Universal Video Retrieval: Generalizing Video Embedding Via Synthesized Multimodal Pyramid Curriculum (2025)0.00
- Multi-modal Transformer For Video Retrieval (2020)19.47
- Relevance-based Margin For Contrastively-trained Video Retrieval Models (2022)7.74