Lat: Latent Translation With Cycle-consistency For Video-text Retrieval
2022 Β· Jinbin Bai, Chunhui Liu, Feiyue Ni, et al.
Abstract
Video-text retrieval is a class of cross-modal representation learning problems, where the goal is to select the video which corresponds to the text query between a given text query and a pool of candidate videos. The contrastive paradigm of vision-language pretraining has shown promising success with large-scale datasets and unified transformer architecture, and demonstrated the power of a joint latent space. Despite this, the intrinsic divergence between the visual domain and textual domain is still far from being eliminated, and projecting different modalities into a joint latent space might result in the distorting of the information inside the single modality. To overcome the above issue, we present a novel mechanism for learning the translation relationship from a source modality space \(\mathcal\{S\}\) to a target modality space \(\mathcal\{T\}\) without the need for a joint latent space, which bridges the gap between visual and textual domains. Furthermore, to keep cycle consis
Authors
(none)
Tags
Stats
Related papers
- Text-video Retrieval With Global-local Semantic Consistent Learning (2024)8.75
- Towards Fast Adaptation Of Pretrained Contrastive Models For Multi-channel Video-language Retrieval (2022)7.50
- Hit: Hierarchical Transformer With Momentum Contrast For Video-text Retrieval (2021)15.98
- Unifying Latent And Lexicon Representations For Effective Video-text Retrieval (2024)0.00
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- Temporal Perceiving Video-language Pre-training (2023)0.00
- Locvtp: Video-text Pre-training For Temporal Localization (2022)11.39