MILES: Visual BERT Pre-training With Injected Language Semantics For Video-text Retrieval
2022 Β· Yuying Ge, Yixiao Ge, Xihui Liu, et al.
Abstract
Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics. The recent success of image BERT pre-training with masked visual modeling that promotes the learning of local visual context, motivates a possible solution to address the above limitation. In this work, we for the first time investigate masked visual modeling in video-text pre-training with the "dual-encoder" architecture. We perform Masked visual modeling with Injected LanguagE Semantics (MILES) by employing an extra snapshot video encoder as an evolving "tokenizer" to produce reconstruction targets for masked video patch prediction. Given the corrupted video, the video encoder is trained to recover text-aligned features of the masked patches via reasoning with the visible regions along the spatial and temporal dimensions, which enhanc
Authors
(none)
Tags
Stats
Related papers
- Mask To Reconstruct: Cooperative Semantics Completion For Video-text Retrieval (2023)5.24
- Masked Contrastive Pre-training For Efficient Video-text Retrieval (2022)5.84
- E-vilm: Efficient Video-language Model Via Masked Video Modeling With Semantic Vector-quantized Tokenizer (2023)0.00
- Memory Enhanced Embedding Learning For Cross-modal Video-text Retrieval (2021)0.00
- Vidvec: Unlocking Video MLLM Embeddings For Video-text Retrieval (2026)0.00
- Temporal Perceiving Video-language Pre-training (2023)0.00
- Imagebert: Cross-modal Pre-training With Large-scale Weak-supervised Image-text Data (2020)0.00
- Video-text Pre-training With Learned Regions (2021)0.00