Helping Hands: An Object-aware Ego-centric Video Recognition Model
2023 Β· Chuhan Zhang, Ankush Gupta, Andrew Zisserman
Abstract
We introduce an object-aware decoder for improving the performance of spatio-temporal representations on ego-centric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is able to track and ground objects (although it has not been trained explicitly for this). We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i.e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e.g. Episodic Memory in Ego4D). In all cases the performance improves over the state of the art -- even compared to networks trained with far larger batch sizes. We also show t
Authors
(none)
Tags
Stats
Related papers
- Retrieval-augmented Egocentric Video Captioning (2024)11.29
- Object-aware Video-language Pre-training For Retrieval (2021)16.14
- Egocentric Video-language Pretraining @ EPIC-KITCHENS-100 Multi-instance Retrieval Challenge 2022 (2022)4.83
- Object-centric Representation Learning From Unlabeled Videos (2016)7.16
- Object Priors For Classifying And Localizing Unseen Actions (2021)9.41
- Egocvr: An Egocentric Benchmark For Fine-grained Composed Video Retrieval (2024)10.00
- Give: Guiding Visual Encoder To Perceive Overlooked Information (2024)0.00
- Spatialmem: Metric-aligned Long-horizon Video Memory For Language Grounding And QA (2026)0.00