Multiple Visual-semantic Embedding For Video Retrieval From Query Sentence
2020 Β· Huy Manh Nguyen, Tomo Miyazaki, Yoshihiro Sugaya, et al.
Abstract
Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in videos to textual features in sentences. A single space is not enough to accommodate various videos and sentences. In this paper, we propose a novel framework that maps instances into multiple individual embedding spaces so that we can capture multiple relationships between instances, leading to compelling video retrieval. We propose to produce a final similarity between instances by fusing similarities measured in each embedding space using a weighted sum strategy. We determine the weights according to a sentence. Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset. The proposed method achieved superior perform
Authors
(none)
Tags
Stats
Related papers
- Learning Joint Representations Of Videos And Sentences With Web Image Search (2016)12.93
- Polysemous Visual-semantic Embedding For Cross-modal Retrieval (2019)17.70
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Univse: Robust Visual Semantic Embeddings Via Structured Semantic Representations (2019)0.00
- SEA: Sentence Encoder Assembly For Video Retrieval By Textual Queries (2020)12.47
- Tree-augmented Cross-modal Encoding For Complex-query Video Retrieval (2020)15.57
- Dual Encoding For Video Retrieval By Text (2020)16.05
- On Semantic Similarity In Video Retrieval (2021)12.81