Learning Joint Representations Of Videos And Sentences With Web Image Search
2016 Β· Mayu Otani, Yuta Nakashima, Esa Rahtu, et al.
Abstract
Our objective is video retrieval based on natural language queries. In addition, we consider the analogous problem of retrieving sentences or generating descriptions given an input video. Recent work has addressed the problem by embedding visual and textual inputs into a common space where semantic similarities correlate to distances. We also adopt the embedding approach, and make the following contributions: First, we utilize web image search in sentence embedding process to disambiguate fine-grained visual concepts. Second, we propose embedding models for sentence, image, and video inputs whose parameters are learned simultaneously. Finally, we show how the proposed model can be applied to description generation. Overall, we observe a clear improvement over the state-of-the-art methods in the video and sentence retrieval tasks. In description generation, the performance level is comparable to the current state-of-the-art, although our embeddings were trained for the retrieval tasks.
Authors
(none)
Tags
Stats
Related papers
- Multiple Visual-semantic Embedding For Video Retrieval From Query Sentence (2020)2.26
- Webly Supervised Joint Embedding For Cross-modal Image-text Retrieval (2018)13.17
- Use What You Have: Video Retrieval Using Representations From Collaborative Experts (2019)0.00
- Learning Language-visual Embedding For Movie Understanding With Natural-language (2016)0.00
- Dual Encoding For Video Retrieval By Text (2020)16.05
- SEA: Sentence Encoder Assembly For Video Retrieval By Textual Queries (2020)12.47
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Univse: Robust Visual Semantic Embeddings Via Structured Semantic Representations (2019)0.00