No Captions, No Problem: Captionless 3D-CLIP Alignment With Hard Negatives Via CLIP Knowledge And Llms
2024 Β· Cristian Sbrolli, Matteo Matteucci
Abstract
In this study, we explore an alternative approach to enhance contrastive text-image-3D alignment in the absence of textual descriptions for 3D objects. We introduce two unsupervised methods, \(I2I\) and \((I2L)^2\), which leverage CLIP knowledge about textual and 2D data to compute the neural perceived similarity between two 3D samples. We employ the proposed methods to mine 3D hard negatives, establishing a multimodal contrastive pipeline with hard negative weighting via a custom loss function. We train on different configurations of the proposed hard negative mining approach, and we evaluate the accuracy of our models in 3D classification and on the cross-modal retrieval benchmark, testing image-to-shape and shape-to-image retrieval. Results demonstrate that our approach, even without explicit text alignment, achieves comparable or superior performance on zero-shot and standard 3D classification, while significantly improving both image-to-shape and shape-to-image retrieval compared
Authors
(none)
Tags
Stats
Related papers
- Tripletclip: Improving Compositional Reasoning Of CLIP Via Synthetic Vision-language Negatives (2024)4.52
- FG-CLIP: Fine-grained Visual And Textual Alignment (2025)5.75
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34
- Contextclip: Contextual Alignment Of Image-text Pairs On CLIP Visual Representations (2022)5.84
- Clip-lite: Information Efficient Visual Representation Learning With Language Supervision (2021)2.35
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- Linear Alignment Of Vision-language Models For Image Captioning (2023)0.00
- Medclip: Contrastive Learning From Unpaired Medical Images And Text (2022)26.02