cluster #3
50 papers in this cluster (ordered by heat_score)
Papers
- Stacked Cross Attention For Image-text Matching (2018)Kuang-Huei Lee, Xi Chen, Gang Hua, et al.27.77
- Visual Semantic Reasoning For Image-text Matching (2019)Kunpeng Li, Yulun Zhang, Kai Li, et al.25.23
- Imagebind: One Embedding Space To Bind Them All (2023)Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, et al.21.38
- Visual Relationship Detection With Language Priors (2016)Cewu Lu, Ranjay Krishna, Michael Bernstein, et al.20.75
- Unicoder-vl: A Universal Encoder For Vision And Language By Cross-modal Pre-training (2019)Gen Li, Nan Duan, Yuejian Fang, et al.20.24
- Pic2word: Mapping Pictures To Words For Zero-shot Composed Image Retrieval (2023)Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, et al.20.24
- Zero-shot Composed Image Retrieval With Textual Inversion (2023)Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, et al.19.84
- Matching Images And Text With Multi-modal Tensor Fusion And Re-ranking (2019)Tan Wang, Xing Xu, Yang Yang, et al.19.77
- Fine-grained Visual Textual Alignment For Cross-modal Retrieval Using Transformer Encoders (2020)Nicola Messina, Giuseppe Amato, Andrea Esuli, et al.19.48
- Remote Sensing Cross-modal Text-image Retrieval Based On Global And Local Information (2022)Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, et al.19.48
- IMRAM: Iterative Matching With Recurrent Attention Memory For Cross-modal Image-text Retrieval (2020)Hui Chen, Guiguang Ding, Xudong Liu, et al.19.22
- Composing Text And Image For Image Retrieval - An Empirical Odyssey (2018)Nam Vo, Lu Jiang, Chen Sun, et al.18.71
- Look, Imagine And Match: Improving Textual-visual Cross-modal Retrieval With Generative Models (2017)Jiuxiang Gu, Jianfei Cai, Shafiq Joty, et al.18.52
- CAMP: Cross-modal Adaptive Message Passing For Text-image Retrieval (2019)Zihao Wang, Xihui Liu, Hongsheng Li, et al.18.38
- Mobileclip: Fast Image-text Models Through Multi-modal Reinforced Training (2023)Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, et al.18.12
- 12-in-1: Multi-task Vision And Language Representation Learning (2019)Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, et al.17.85
- Polysemous Visual-semantic Embedding For Cross-modal Retrieval (2019)Yale Song, Mohammad Soleymani17.70
- CLIP-KD: An Empirical Study Of CLIP Model Distillation (2023)Chuanguang Yang, Zhulin An, Libo Huang, et al.17.57
- Lightningdot: Pre-training Visual-semantic Embeddings For Real-time Image-text Retrieval (2021)Siqi Sun, Yen-Chun Chen, Linjie Li, et al.17.42
- Image Retrieval On Real-life Images With Pre-trained Vision-and-language Models (2021)Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, et al.17.07
- Composed Image Retrieval Using Contrastive Learning And Task-oriented Clip-based Features (2023)Alberto Baldrati, Marco Bertini, Tiberio Uricchio, et al.16.84
- Exploring A Fine-grained Multiscale Method For Cross-modal Remote Sensing Image Retrieval (2022)Zhiqiang Yuan, Wenkai Zhang, Kun Fu, et al.16.73
- VISTA: Visualized Text Embedding For Universal Multi-modal Retrieval (2024)Junjie Zhou, Zheng Liu, Shitao Xiao, et al.16.73
- Multimodal Contrastive Training For Visual Representation Learning (2021)Xin Yuan, Zhe Lin, Jason Kuen, et al.16.32
- Scene Text Retrieval Via Joint Text Detection And Similarity Learning (2021)Hao Wang, Xiang Bai, Mingkun Yang, et al.16.16
- Transformer Reasoning Network For Image-text Matching And Retrieval (2020)Nicola Messina, Fabrizio Falchi, Andrea Esuli, et al.16.15
- Mplug: Effective And Efficient Vision-language Learning By Cross-modal Skip-connections (2022)Chenliang Li, Haiyang Xu, Junfeng Tian, et al.16.14
- Cross-modal And Uni-modal Soft-label Alignment For Image-text Retrieval (2024)Hailang Huang, Zhijie Nie, Ziqiao Wang, et al.15.75
- Bi-directional Training For Composed Image Retrieval Via Text Prompt Learning (2023)Zheyuan Liu, Weixuan Sun, Yicong Hong, et al.15.63
- Context-i2w: Mapping Images To Context-dependent Words For Accurate Zero-shot Composed Image Retrieval (2023)Yuanmin Tang, Jing Yu, Keke Gai, et al.15.41
- Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers (2021)Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, et al.15.16
- Learning Aligned Cross-modal Representations From Weakly Aligned Data (2016)Lluis Castrejon, Yusuf Aytar, Carl Vondrick, et al.14.97
- Long-clip: Unlocking The Long-text Capability Of CLIP (2024)Beichen Zhang, Pan Zhang, Xiaoyi Dong, et al.14.90
- Multimodal Prototypical Networks For Few-shot Learning (2020)Frederik Pahde, Mihai Puscas, Tassilo Klein, et al.14.73
- Mixgen: A New Multi-modal Data Augmentation (2022)Xiaoshuai Hao, Yi Zhu, Srikar Appalaraju, et al.14.47
- Cross-modal Scene Networks (2016)Yusuf Aytar, Lluis Castrejon, Carl Vondrick, et al.14.35
- Your Negative May Not Be True Negative: Boosting Image-text Matching With False Negative Elimination (2023)Haoxuan Li, Yi Bin, Junrong Liao, et al.14.32
- Vista: Vision And Scene Text Aggregation For Cross-modal Retrieval (2022)Mengjun Cheng, Yipeng Sun, Longchao Wang, et al.14.31
- ALADIN: Distilling Fine-grained Alignment Scores For Efficient Image-text Matching And Retrieval (2022)Nicola Messina, Matteo Stefanini, Marcella Cornia, et al.14.00
- Image-text Retrieval: A Survey On Recent Research And Development (2022)Min Cao, Shiping Li, Juntao Li, et al.13.93
- Align2ground: Weakly Supervised Phrase Grounding Guided By Image-caption Alignment (2019)Samyak Datta, Karan Sikka, Anirban Roy, et al.13.93
- Unifying Two-stream Encoders With Transformers For Cross-modal Retrieval (2023)Yi Bin, Haoxuan Li, Yahui Xu, et al.13.89
- Decoupling The Role Of Data, Attention, And Losses In Multimodal Transformers (2021)Lisa Anne Hendricks, John Mellor, Rosalia Schneider, et al.13.88
- Finding Beans In Burgers: Deep Semantic-visual Embedding With Localization (2018)Martin Engilberge, Louis Chevallier, Patrick Pérez, et al.13.84
- Unsupervised Contrastive Hashing For Cross-modal Retrieval In Remote Sensing (2022)Georgii Mikriukov, Mahdyar Ravanbakhsh, Begüm Demir13.84
- Equivariant Similarity For Vision-language Foundation Models (2023)Tan Wang, Kevin Lin, Linjie Li, et al.13.78
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)Haoyu Lu, Nanyi Fei, Yuqi Huo, et al.13.60
- Safe-clip: Removing NSFW Concepts From Vision-and-language Models (2023)Samuele Poppi, Tobia Poppi, Federico Cocchi, et al.13.41
- CMIR-NET : A Deep Learning Based Model For Cross-modal Retrieval In Remote Sensing (2019)Ushasi Chaudhuri, Biplab Banerjee, Avik Bhattacharya, et al.13.34
- Expressing Objects Just Like Words: Recurrent Visual Embedding For Image-text Matching (2020)Tianlang Chen, Jiebo Luo13.34