CSA: Data-efficient Mapping Of Unimodal Features To Multimodal Features
2024 Β· Po-Han Li, Sandeep P. Chinchali, Ufuk Topcu
Abstract
Multimodal encoders like CLIP excel in tasks such as zero-shot image classification and cross-modal retrieval. However, they require excessive training data. We propose canonical similarity analysis (CSA), which uses two unimodal encoders to replicate multimodal encoders using limited data. CSA maps unimodal features into a multimodal space, using a new similarity score to retain only the multimodal information. CSA only involves the inference of unimodal encoders and a cubic-complexity matrix decomposition, eliminating the need for extensive GPU-based model training. Experiments show that CSA outperforms CLIP while requiring \(50,000\times\) fewer multimodal data pairs to bridge the modalities given pre-trained unimodal encoders on ImageNet classification and misinformative news caption detection. CSA surpasses the state-of-the-art method to map unimodal features to multimodal features. We also demonstrate the ability of CSA with modalities beyond image and text, paving the way for fu
Authors
(none)
Tags
Stats
Related papers
- ASIF: Coupled Data Turns Unimodal Models To Multimodal Without Training (2022)0.00
- Cross The Gap: Exposing The Intra-modal Misalignment In CLIP Via Modality Inversion (2025)3.64
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Isoclip: Decomposing CLIP Projectors For Efficient Intra-modal Alignment (2026)3.06
- Multimodal Representation Learning Conditioned On Semantic Relations (2025)0.00
- Linking Representations With Multimodal Contrastive Learning (2023)0.00
- Adversarial Cross-modal Retrieval Via Learning And Transferring Single-modal Similarities (2019)8.60
- Modality Curation: Building Universal Embeddings For Advanced Multimodal Information Retrieval (2025)0.00