Egocentric Video-language Pretraining @ EPIC-KITCHENS-100 Multi-instance Retrieval Challenge 2022
2022 Β· Kevin Qinghong Lin, Alex Jinpeng Wang, Rui Yan, et al.
Abstract
In this report, we propose a video-language pretraining (VLP) based solution \cite\{kevin2022egovlp\} for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge. Especially, we exploit the recently released Ego4D dataset \cite\{grauman2021ego4d\} to pioneer Egocentric VLP from pretraining dataset, pretraining objective, and development set. Based on the above three designs, we develop a pretrained video-language model that is able to transfer its egocentric video-text representation to MIR benchmark. Furthermore, we devise an adaptive multi-instance max-margin loss to effectively fine-tune the model and equip the dual-softmax technique for reliable inference. Our best single model obtains strong performance on the challenge test set with 47.39% mAP and 61.44% nDCG. The code is available at https://github.com/showlab/EgoVLP.
Authors
(none)
Tags
Stats
Code
Related papers
- Exploiting Semantic Role Contextualized Video Features For Multi-instance Text-video Retrieval EPIC-KITCHENS-100 Multi-instance Retrieval Challenge 2022 (2022)0.00
- Helping Hands: An Object-aware Ego-centric Video Recognition Model (2023)10.07
- Symmetric Multi-similarity Loss For EPIC-KITCHENS-100 Multi-instance Retrieval Challenge 2024 (2024)1.20
- Retrieval-augmented Egocentric Video Captioning (2024)11.29
- Revitalize Region Feature For Democratizing Video-language Pre-training Of Retrieval (2022)2.72
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Vlmo: Unified Vision-language Pre-training With Mixture-of-modality-experts (2021)6.34
- Masked Contrastive Pre-training For Efficient Video-text Retrieval (2022)5.84