Linking Representations With Multimodal Contrastive Learning
2023 Β· Abhishek Arora, Xinmei Yang, Shao-Yu Jheng, et al.
Abstract
Many applications require linking individuals, firms, or locations across datasets. Most widely used methods, especially in social science, do not employ deep learning, with record linkage commonly approached using string matching techniques. Moreover, existing methods do not exploit the inherently multimodal nature of documents. In historical record linkage applications, documents are typically noisily transcribed by optical character recognition (OCR). Linkage with just OCR'ed texts may fail due to noise, whereas linkage with just image crops may also fail because vision models lack language understanding (e.g., of abbreviations or other different ways of writing firm names). To leverage multimodal learning, this study develops CLIPPINGS (Contrastively LInking Pooled Pre-trained Embeddings). CLIPPINGS aligns symmetric vision and language bi-encoders, through contrastive language-image pre-training on document images and their corresponding OCR'ed texts. It then contrastively learns a
Authors
(none)
Tags
Stats
Related papers
- Multimodal Representation Learning Conditioned On Semantic Relations (2025)0.00
- Exploring A Unified Vision-centric Contrastive Alternatives On Multi-modal Web Documents (2025)1.69
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- Cross The Gap: Exposing The Intra-modal Misalignment In CLIP Via Modality Inversion (2025)3.64
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34
- Explaining And Mitigating The Modality Gap In Contrastive Multimodal Learning (2024)0.00