Newsstories: Illustrating Articles With Visual Summaries
2022 Β· Reuben Tan, Bryan A. Plummer, Kate Saenko, et al.
Abstract
Recent self-supervised approaches have used large-scale image-text datasets to learn powerful representations that transfer to many tasks without finetuning. These methods often assume that there is one-to-one correspondence between its images and their (short) captions. However, many tasks require reasoning about multiple images and long text narratives, such as describing news articles with visual summaries. Thus, we explore a novel setting where the goal is to learn a self-supervised visual-language representation that is robust to varying text length and the number of images. In addition, unlike prior work which assumed captions have a literal relation to the image, we assume images only contain loose illustrative correspondence with the text. To explore this problem, we introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos. We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images
Authors
(none)
Tags
Stats
Related papers
- Self-supervised Visual Representations For Cross-modal Retrieval (2019)7.50
- Upgrading The Newsroom: An Automated Image Selection System For News Articles (2020)0.00
- Learning The Visualness Of Text Using Large Vision-language Models (2023)4.52
- Mllms-augmented Visual-language Representation Learning (2023)0.00
- Dreamlip: Language-image Pre-training With Long Captions (2024)10.61
- Show, Translate And Tell (2019)4.52
- Deep Image Representations Using Caption Generators (2017)0.00
- Webly Supervised Joint Embedding For Cross-modal Image-text Retrieval (2018)13.17