Expectation-maximization Contrastive Learning For Compact Video-and-language Representations
2022 Β· Peng Jin, Jinfa Huang, Fenglin Liu, et al.
Abstract
Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the modality gap between visual and textual representation can not be fully eliminated. In this paper, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases. Such feature decomposition of video-and-language representations reduces the rank of the latent space, resulting in increased representing power for the semantics. Extensive experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discrimi
Authors
(none)
Tags
Stats
Related papers
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- Normalized Contrastive Learning For Text-video Retrieval (2022)6.77
- Contrastive Video-language Learning With Fine-grained Frame Sampling (2022)6.77
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12
- Context-adaptive Multi-prompt Embedding With Large Language Models For Vision-language Alignment (2025)0.00
- TCLR: Temporal Contrastive Learning For Video Representation (2021)15.78
- Memory Enhanced Embedding Learning For Cross-modal Video-text Retrieval (2021)0.00
- Unifying Latent And Lexicon Representations For Effective Video-text Retrieval (2024)0.00