Abstract
arXiv:2602.21636v2 Announce Type: replace Abstract: Abridged: Clinicians commonly interpret 3D medical images by examining multiple anatomical planes rather than relying on volumetric views. In clinical CT workflows, the axial plane often serves as the primary diagnostic reference, while the auxiliary planes provide complementary spatial context. However, many existing 3D deep learning approaches either process volumetric data holistically or assign equal importance to all planes, failing to reflect this asymmetric, axial-centric interpretation strategy. To address this, we propose an axial-centric cross-plane attention architecture for 3D medical image classification that models asymmetric dependencies between anatomical planes. The architecture employs large-scale axial CT images pretrained MedDINOv3 as a frozen feature extractor for axial, coronal, and sagittal planes. RICA blocks and intra-plane transformer encoders capture plane-specific positional and contextual information, while axial-centric cross-plane transformer encoders selectively condition axial representations on complementary auxiliary representations. Experiments on six datasets from the MedMNIST3D benchmark show that the proposed method consistently outperforms existing 3D and multi-plane models in ACC and AUC. A lightweight variant, AC-Tiny, achieves competitive performance with substantially fewer trainable parameters, suggesting that architectural design contributes more to performance gains than increased model scale. Ablation studies further validate the importance of axial-centric querying, QKV allocation, directional cross-plane fusion, residual-free cross-attention, and classification head design. Slice-level Grad-CAM visualizations demonstrate that the model identifies diagnostically relevant regions across all planes. These findings highlight the value of aligning architectural design with clinical interpretation workflows for robust 3D medical image analysis.