Fusionbert: Multi-view Image-3d Retrieval Via Cross-attention Visual Fusion And Normal-aware 3D Encoder
2026 Β· Wei Li, Yufan Ren, Hanqing Jiang, et al.
Abstract
We propose FusionBERT, a novel multi-view visual fusion framework for image-3D multimodal retrieval. Existing image-3D representation learning methods predominantly focus on feature alignment of a single object image and its 3D model, limiting their applicability in realistic scenarios where an object is typically observed and captured from multiple viewpoints. Although multi-view observations naturally provide complementary geometric and appearance cues, existing multimodal large models rarely explore how to effectively fuse such multi-view visual information for better cross-modal retrieval. To address this limitation, we introduce a multi-view image-3D retrieval framework named FusionBERT, which innovatively utilizes a cross-attention-based multi-view visual aggregator to adaptively integrate features from multi-view images of an object. The proposed multi-view visual encoder fuses inter-view complementary relationships and selectively emphasizes informative visual cues across multi
Authors
(none)
Tags
Stats
Related papers
- Enhanced Cross-modal 3D Retrieval Via Tri-modal Reconstruction (2025)0.00
- Joint Fusion And Encoding: Advancing Multimodal Retrieval From The Ground Up (2025)0.00
- Retrieval-guided Cross-view Image Synthesis (2024)0.00
- Everything At Once -- Multi-modal Fusion Transformer For Video Retrieval (2021)15.78
- DAFM: Dynamic Adaptive Fusion For Multi-model Collaboration In Composed Image Retrieval (2025)0.00
- Cross-modal Fusion Distillation For Fine-grained Sketch-based Image Retrieval (2022)2.68
- Mire: Enhancing Multimodal Queries Representation Via Fusion-free Modality Interaction For Multimodal Retrieval (2024)3.81
- Generalized Multi-view Embedding For Visual Recognition And Cross-modal Retrieval (2016)14.69