Learning Language-visual Embedding For Movie Understanding With Natural-language
2016 Β· Atousa Torabi, Niket Tandon, Leonid Sigal
Abstract
Learning a joint language-visual embedding has a number of very appealing properties and can result in variety of practical application, including natural language image/video annotation and search. In this work, we study three different joint language-visual neural network model architectures. We evaluate our models on large scale LSMDC16 movie dataset for two tasks: 1) Standard Ranking for video annotation and retrieval 2) Our proposed movie multiple-choice test. This test facilitate automatic evaluation of visual-language models for natural language video annotation based on human activities. In addition to original Audio Description (AD) captions, provided as part of LSMDC16, we collected and will make available a) manually generated re-phrasings of those captions obtained using Amazon MTurk b) automatically generated human activity elements in "Predicate + Object" (PO) phrases based on "Knowlywood", an activity knowledge mining model. Our best model archives Recall@10 of 19.2% on
Authors
(none)
Tags
Stats
Related papers
- Learning Joint Representations Of Videos And Sentences With Web Image Search (2016)12.93
- Towards Holistic Language-video Representation: The Language Model-enhanced Msr-video To Text Dataset (2024)0.00
- Webly Supervised Joint Embedding For Cross-modal Image-text Retrieval (2018)13.17
- Multilevel Language And Vision Integration For Text-to-clip Retrieval (2018)17.67
- Avlnet: Learning Audio-visual Language Representations From Instructional Videos (2020)12.87
- Learning Robust Visual-semantic Embeddings (2017)15.22
- Show, Translate And Tell (2019)4.52
- Learning A Text-video Embedding From Incomplete And Heterogeneous Data (2018)4.18