SEA: Sentence Encoder Assembly For Video Retrieval By Textual Queries
2020 Β· Xirong Li, Fangming Zhou, Chaoxi Xu, et al.
Abstract
Retrieving unlabeled videos by textual queries, known as Ad-hoc Video Search (AVS), is a core theme in multimedia data management and retrieval. The success of AVS counts on cross-modal representation learning that encodes both query sentences and videos into common spaces for semantic similarity computation. Inspired by the initial success of previously few works in combining multiple sentence encoders, this paper takes a step forward by developing a new and general method for effectively exploiting diverse sentence encoders. The novelty of the proposed method, which we term Sentence Encoder Assembly (SEA), is two-fold. First, different from prior art that use only a single common space, SEA supports text-video matching in multiple encoder-specific common spaces. Such a property prevents the matching from being dominated by a specific encoder that produces an encoding vector much longer than other encoders. Second, in order to explore complementarities among the individual common spac
Authors
(none)
Tags
Stats
Related papers
- Dual Encoding For Video Retrieval By Text (2020)16.05
- Multiple Visual-semantic Embedding For Video Retrieval From Query Sentence (2020)2.26
- Tree-augmented Cross-modal Encoding For Complex-query Video Retrieval (2020)15.57
- Learning Joint Representations Of Videos And Sentences With Web Image Search (2016)12.93
- Use What You Have: Video Retrieval Using Representations From Collaborative Experts (2019)0.00
- Marinevrs: Marine Video Retrieval System With Explainability Via Semantic Understanding (2023)0.00
- The VISIONE Video Search System: Exploiting Off-the-shelf Text Search Engines For Large-scale Video Retrieval (2020)10.74
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00