Show, Translate And Tell
2019 Β· Dheeraj Peri, Shagan Sah, Raymond Ptucha
Abstract
Humans have an incredible ability to process and understand information from multiple sources such as images, video, text, and speech. Recent success of deep neural networks has enabled us to develop algorithms which give machines the ability to understand and interpret this information. There is a need to both broaden their applicability and develop methods which correlate visual information along with semantic content. We propose a unified model which jointly trains on images and captions, and learns to generate new captions given either an image or a caption query. We evaluate our model on three different tasks namely cross-modal retrieval, image captioning, and sentence paraphrasing. Our model gains insight into cross-modal vector embeddings, generalizes well on multiple tasks and is competitive to state of the art methods on retrieval.
Authors
(none)
Tags
Stats
Related papers
- Towards Retrieval-augmented Architectures For Image Captioning (2024)9.41
- Deep Image Representations Using Caption Generators (2017)0.00
- Learning Language-visual Embedding For Movie Understanding With Natural-language (2016)0.00
- Retrieval-augmented Image Captioning (2023)11.29
- Unicoder-vl: A Universal Encoder For Vision And Language By Cross-modal Pre-training (2019)20.24
- Knowledge Completes The Vision: A Multimodal Entity-aware Retrieval-augmented Generation Framework For News Image Captioning (2025)0.00
- Cross-modal Retrieval Augmentation For Multi-modal Classification (2021)9.23
- Look, Imagine And Match: Improving Textual-visual Cross-modal Retrieval With Generative Models (2017)18.52