Textme: Bridging Unseen Modalities Through Text Descriptions
2026 Β· Soyeon Hong, Jinchan Kim, Jaegook You, et al.
Abstract
Expanding multimodal representations to novel modalities is constrained by reliance on large-scale paired datasets (e.g., text-image, text-audio, text-3D, text-molecule), which are costly and often infeasible in domains requiring expert annotation such as medical imaging and molecular analysis. We introduce TextME, the first text-only modality expansion framework, to the best of our knowledge, projecting diverse modalities into LLM embedding space as a unified anchor. Our approach exploits the geometric structure of pretrained contrastive encoders to enable zero-shot cross-modal transfer using only text descriptions, without paired supervision. We empirically validate that such consistent modality gaps exist across image, video, audio, 3D, X-ray, and molecular domains, demonstrating that text-only training can preserve substantial performance of pretrained encoders. We further show that our framework enables emergent cross-modal retrieval between modality pairs not explicitly aligned d
Authors
(none)
Tags
Stats
Related papers
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Multilingual-to-multimodal (M2M): Unlocking New Languages With Monolingual Text (2026)0.00
- Towards Cross-modal Text-molecule Retrieval With Better Modality Alignment (2024)4.52
- Memory Enhanced Embedding Learning For Cross-modal Video-text Retrieval (2021)0.00
- Modality Curation: Building Universal Embeddings For Advanced Multimodal Information Retrieval (2025)0.00
- Fill The Gap: Quantifying And Reducing The Modality Gap In Image-text Representation Learning (2025)0.00
- Thin Bridges For Drug Text Alignment: Lightweight Contrastive Learning For Target Specific Drug Retrieval (2025)0.00
- MATE: Meet At The Embedding -- Connecting Images With Long Texts (2024)5.24