Enhanced Cross-modal 3D Retrieval Via Tri-modal Reconstruction
2025 Β· Junlong Ren, Hao Wang
Abstract
Cross-modal 3D retrieval is a critical yet challenging task, aiming to achieve bi-directional retrieval between 3D and text modalities. Current methods predominantly rely on a certain 3D representation (e.g., point cloud), with few exploiting the 2D-3D consistency and complementary relationships, which constrains their performance. To bridge this gap, we propose to adopt multi-view images and point clouds to jointly represent 3D shapes, facilitating tri-modal alignment (i.e., image, point, text) for enhanced cross-modal 3D retrieval. Notably, we introduce tri-modal reconstruction to improve the generalization ability of encoders. Given point features, we reconstruct image features under the guidance of text features, and vice versa. With well-aligned point cloud and multi-view image features, we aggregate them as multimodal embeddings through fine-grained 2D-3D fusion to enhance geometric and semantic understanding. Recognizing the significant noise in current datasets where many 3D sh
Authors
(none)
Tags
Stats
Related papers
- COM3D: Leveraging Cross-view Correspondence And Cross-modal Mining For 3D Retrieval (2024)3.58
- Revisiting Cross Modal Retrieval (2018)0.00
- SCA3D: Enhancing Cross-modal 3D Retrieval Via 3D Shape And Caption Paired Data Augmentation (2025)4.17
- Crossover: 3D Scene Cross-modal Alignment (2025)4.52
- Fusionbert: Multi-view Image-3d Retrieval Via Cross-attention Visual Fusion And Normal-aware 3D Encoder (2026)0.00
- Joint Fusion And Encoding: Advancing Multimodal Retrieval From The Ground Up (2025)0.00
- Y^2seq2seq: Cross-modal Representation Learning For 3D Shape And Text By Joint Reconstruction And Prediction Of View And Word Sequences (2018)12.02
- Contrastive Masked Auto-encoders Based Self-supervised Hashing For 2D Image And 3D Point Cloud Cross-modal Retrieval (2024)2.26