Video-adverb Retrieval With Compositional Adverb-action Embeddings
2023 Β· Thomas Hummel, Otniel-Bogdan Mercea, A. Sophia Koepke, et al.
Abstract
Retrieving adverbs that describe an action in a video poses a crucial step towards fine-grained video understanding. We propose a framework for video-to-adverb retrieval (and vice versa) that aligns video embeddings with their matching compositional adverb-action text embedding in a joint embedding space. The compositional adverb-action text embedding is learned using a residual gating mechanism, along with a novel training objective consisting of triplet losses and a regression target. Our method achieves state-of-the-art performance on five recent benchmarks for video-adverb retrieval. Furthermore, we introduce dataset splits to benchmark video-adverb retrieval for unseen adverb-action compositions on subsets of the MSR-VTT Adverbs and ActivityNet Adverbs datasets. Our proposed framework outperforms all prior works for the generalisation task of retrieving adverbs from videos for unseen adverb-action compositions. Code and dataset splits are available at https://hummelth.github.io/Re
Authors
(none)
Tags
Stats
Related papers
- Beyond Simple Edits: Composed Video Retrieval With Dense Modifications (2025)2.16
- Fine-grained Action Retrieval Through Multiple Parts-of-speech Embeddings (2019)15.62
- ICSVR: Investigating Compositional And Syntactic Understanding In Video Retrieval Models (2023)8.92
- NAVERO: Unlocking Fine-grained Semantics For Video-language Compositionality (2024)0.00
- Domain Adaptation In Multi-view Embedding For Cross-modal Video Retrieval (2021)0.00
- Composed Video Retrieval Via Enriched Context And Discriminative Embeddings (2024)12.19
- RAVU: Retrieval Augmented Video Understanding With Compositional Reasoning Over Graph (2025)0.00
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00