Univse: Robust Visual Semantic Embeddings Via Structured Semantic Representations
2019 Β· Hao Wu, Jiayuan Mao, Yufeng Zhang, et al.
Abstract
We propose Unified Visual-Semantic Embeddings (UniVSE) for learning a joint space of visual and textual concepts. The space unifies the concepts at different levels, including objects, attributes, relations, and full scenes. A contrastive learning approach is proposed for the fine-grained alignment from only image-caption pairs. Moreover, we present an effective approach for enforcing the coverage of semantic components that appear in the sentence. We demonstrate the robustness of Unified VSE in defending text-domain adversarial attacks on cross-modal retrieval tasks. Such robustness also empowers the use of visual cues to resolve word dependencies in novel sentences.
Authors
(none)
Tags
Stats
Related papers
- Learning Robust Visual-semantic Embeddings (2017)15.22
- Multiple Visual-semantic Embedding For Video Retrieval From Query Sentence (2020)2.26
- Dynamic Visual Semantic Sub-embeddings And Fast Re-ranking (2023)0.00
- Unifying Latent And Lexicon Representations For Effective Video-text Retrieval (2024)0.00
- Polysemous Visual-semantic Embedding For Cross-modal Retrieval (2019)17.70
- Universal Vision-language Dense Retrieval: Learning A Unified Representation Space For Multi-modal Retrieval (2022)3.45
- Improving Visual-semantic Embeddings By Learning Semantically-enhanced Hard Negatives For Cross-modal Information Retrieval (2022)9.41
- Beyond Visual Semantics: Exploring The Role Of Scene Text In Image Understanding (2019)9.59