Describe, Adapt And Combine: Empowering CLIP Encoders For Open-set 3D Object Retrieval
2025 Β· Zhichuan Wang, Yang Zhou, Zhe Liu, et al.
Abstract
Open-set 3D object retrieval (3DOR) is an emerging task aiming to retrieve 3D objects of unseen categories beyond the training set. Existing methods typically utilize all modalities (i.e., voxels, point clouds, multi-view images) and train specific backbones before fusion. However, they still struggle to produce generalized representations due to insufficient 3D training data. Being contrastively pre-trained on web-scale image-text pairs, CLIP inherently produces generalized representations for a wide range of downstream tasks. Building upon it, we present a simple yet effective framework named Describe, Adapt and Combine (DAC) by taking only multi-view images for open-set 3DOR. DAC innovatively synergizes a CLIP model with a multi-modal large language model (MLLM) to learn generalized 3D representations, where the MLLM is used for dual purposes. First, it describes the seen category information to align with CLIP's training objective for adaptation during training. Second, it provides
Authors
(none)
Tags
Stats
Related papers
- Distill CLIP (DCLIP): Enhancing Image-text Retrieval Via Cross-modal Transformer Distillation (2025)0.00
- Clip-moe: Towards Building Mixture Of Experts For CLIP With Diversified Multiplet Upcycling (2024)2.26
- Teda: Boosting Vision-lanuage Models For Zero-shot 3D Object Retrieval Via Testing-time Distribution Alignment (2025)4.06
- Scenarioclip: Pretrained Transferable Visual Language Models And Action-genome Dataset For Natural Scene Analysis (2025)0.00
- Priorclip: Visual Prior Guided Vision-language Model For Remote Sensing Image-text Retrieval (2024)0.00
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- OSCAR: Open-set CAD Retrieval From A Language Prompt And A Single Image (2026)0.00