Polysemous Visual-semantic Embedding For Cross-modal Retrieval
2019 Β· Yale Song, Mohammad Soleymani
Abstract
Visual-semantic embedding aims to find a shared latent space where related visual and textual instances are close to each other. Most current methods learn injective embedding functions that map an instance to a single point in the shared space. Unfortunately, injective embedding cannot effectively handle polysemous instances with multiple possible meanings; at best, it would find an average representation of different meanings. This hinders its use in real-world scenarios where individual instances and their cross-modal associations are often ambiguous. In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning. To learn visual-semantic embedding, we tie-up two PIE-Nets and optimize them jointly in the multiple instance learning framework. Most existing work on cross-modal retrieval focuses on im
Authors
(none)
Tags
Stats
Related papers
- Multiple Visual-semantic Embedding For Video Retrieval From Query Sentence (2020)2.26
- Univse: Robust Visual Semantic Embeddings Via Structured Semantic Representations (2019)0.00
- MHSAN: Multi-head Self-attention Network For Visual Semantic Embedding (2020)10.48
- Learning Robust Visual-semantic Embeddings (2017)15.22
- Webly Supervised Joint Embedding For Cross-modal Image-text Retrieval (2018)13.17
- Improving Cross-modal Retrieval With Set Of Diverse Embeddings (2022)13.55
- Dynamic Visual Semantic Sub-embeddings And Fast Re-ranking (2023)0.00
- Image Search Using Multilingual Texts: A Cross-modal Learning Approach Between Image And Text (2019)0.00