Fine-grained Action Retrieval Through Multiple Parts-of-speech Embeddings
2019 Β· Michael Wray, Diane Larlus, Gabriela Csurka, et al.
Abstract
We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS tag. The outputs of multiple PoS embeddings are then used as input to an integrated multi-modal space, where we perform action retrieval. All embeddings are trained jointly through a combination of PoS-aware and PoS-agnostic losses. Our proposal enables learning specialised embedding spaces that offer multiple views of the same embedded entities. We report the first retrieval results on fine-grained actions for the large-scale EPIC dataset, in a generalised zero-shot setting. Results show the advantage of our approach for both video-to-text and text-to-video action retrieval. We also demonstr
Authors
(none)
Tags
Stats
Related papers
- Video-adverb Retrieval With Compositional Adverb-action Embeddings (2023)0.00
- Domain Adaptation In Multi-view Embedding For Cross-modal Video Retrieval (2021)0.00
- Retrieval-augmented Egocentric Video Captioning (2024)11.29
- Dual Encoding For Video Retrieval By Text (2020)16.05
- Multilevel Language And Vision Integration For Text-to-clip Retrieval (2018)17.67
- Exploiting Semantic Role Contextualized Video Features For Multi-instance Text-video Retrieval EPIC-KITCHENS-100 Multi-instance Retrieval Challenge 2022 (2022)0.00
- Embedding-based Retrieval In Multimodal Content Moderation (2025)2.26
- Memory Enhanced Embedding Learning For Cross-modal Video-text Retrieval (2021)0.00