Abstract
arXiv:2410.12673v3 Announce Type: replace Abstract: Accurate 3D object detection in autonomous driving relies on Bird's Eye View (BEV) perception and effective temporal fusion. However, existing fusion strategies based on convolutional layers or deformable self-attention struggle to model global context in BEV space, leading to reduced accuracy for large objects.To address this limitation, we propose MambaBEV, a novel BEV-based 3D object detection model that leverages Mamba2, an advanced state-space model (SSM) optimized for long-sequence processing. Our key contribution is TemporalMamba, a temporal fusion module that enhances global context modeling through a BEV feature discrete rearrangement mechanism tailored for sequential processing. In addition, we introduce a Mamba-based DETR head to improve multi-object representation. Evaluations on the nuScenes dataset demonstrate that MambaBEV-base achieves 51.7% NDS and an 42.7% mAP. Furthermore, evaluation within an end-to-end autonomous driving paradigm validates its effectiveness in motion forecasting and planning.These results highlight the potential of state-space models for improving global context understanding and large-object detection in autonomous driving perception systems.