Tricolo: Trimodal Contrastive Loss For Text To Shape Retrieval
2022 Β· Yue Ruan, Han-Hung Lee, Yiming Zhang, et al.
Abstract
Text-to-shape retrieval is an increasingly relevant problem with the growth of 3D shape data. Recent work on contrastive losses for learning joint embeddings over multimodal data has been successful at tasks such as retrieval and classification. Thus far, work on joint representation learning for 3D shapes and text has focused on improving embeddings through modeling of complex attention between representations, or multi-task learning. We propose a trimodal learning scheme over text, multi-view images and 3D shape voxels, and show that with large batch contrastive learning we achieve good performance on text-to-shape retrieval without complex attention mechanisms or losses. Our experiments serve as a foundation for follow-up work on building trimodal embeddings for text-image-shape.
Authors
(none)
Tags
Stats
Related papers
- Dual-modal Attention-enhanced Text-video Retrieval With Triplet Partial Margin Contrastive Learning (2023)8.82
- Enhanced Cross-modal 3D Retrieval Via Tri-modal Reconstruction (2025)0.00
- Parts2words: Learning Joint Embedding Of Point Clouds And Texts By Bidirectional Matching Between Parts And Words (2021)9.96
- Angular Triplet-center Loss For Multi-view 3D Shape Retrieval (2018)12.33
- Y^2seq2seq: Cross-modal Representation Learning For 3D Shape And Text By Joint Reconstruction And Prediction Of View And Word Sequences (2018)12.02
- Optimizing Multi-modal Models For Image-based Shape Retrieval: The Role Of Pre-alignment And Hard Contrastive Learning (2026)0.00
- Rethinking Loss Design For Large-scale 3D Shape Retrieval (2019)4.52
- Multimodal Contrastive Training For Visual Representation Learning (2021)16.32