Adaptive multi-modal feature fusion for far and hard object detection

LI Yang1，2， GE Hongwei1，2

（1. Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence， Jiangnan University， Wuxi 214122， China； 2. School of Artificial Intelligence and Computer Science， Jiangnan University， Wuxi 214122， China）

Abstract： In order to solve difficult detection of far and hard objects due to the sparseness and insufficient semantic information of LiDAR point cloud, a 3D object detection network with multi-modal data adaptive fusion is proposed, which makes use of multi-neighborhood information of voxel and image information. Firstly, design an improved ResNet that maintains the structure information of far and hard objects in low-resolution feature maps, which is more suitable for detection task. Meanwhile, semantema of each image feature map is enhanced by semantic information from all subsequent feature maps. Secondly, extract multi-neighborhood context information with different receptive field sizes to make up for the defect of sparseness of point cloud which improves the ability of voxel features to represent the spatial structure and semantic information of objects. Finally, propose a multi-modal feature adaptive fusion strategy which uses learnable weights to express the contribution of different modal features to the detection task, and voxel attention further enhances the fused feature expression of effective target objects. The experimental results on the KITTI benchmark show that this method outperforms VoxelNet with remarkable margins, i.e. increasing the AP by 8.78% and 5.49% on medium and hard difficulty levels. Meanwhile, our method achieves greater detection performance compared with many mainstream multi-modal methods, i.e. outperforming the AP by 1% compared with that of MVX-Net on medium and hard difficulty levels.

Key words： 3D object detection； adaptive fusion； multi-modal data fusion； attention mechanism； multi-neighborhood features

References

［1］Chen X, Ma H, Wan J, et al. Multi-view 3D object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 2017： 1907-1915.
［2］Yang B, Luo W, Urtasun R. PIXOR： real-time 3D object detection from point clouds. In： Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018： 7652-7660.
［3］Zhou Y, Tuzel O. VoxelNet： end-to-end learning for point cloud based 3D object detection. In： Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018： 4490-4499.
［4］Yan Y, Mao Y, Li B. Second： sparsely embedded convolutional detection. Sensors, 2018, 18(10)： 3337-3354.
［5］Lang A H, Vora S, Caesar H, et al. PointPillars： fast encoders for object detection from point clouds. In： Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, 2019： 12697-12705.
［6］Shi S, Wang Z, Shi J, et al. From points to parts： 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. doi:10.1109 /TPAMI.2020.2977026.
［7］Shi S, Guo C, Jiang L, et al. PV-RCNN： point-voxel feature set abstraction for 3D object detection. In： Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, 2020： 10529-10538.
［8］Qi C R, Su H, Mo K, et al. PointNet： deep learning on point sets for 3D classification and segmentation. In： Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 2017： 652-660.
［9］Qi C R, Yi L, Su H, et al. Pointnet++： deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems, 2017, 30： 5099-5108.
［10］Shi S, Wang X, Li H. PointRCNN： 3D object proposal generation and detection from point cloud. In： Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, 2019： 770-779.
［11］Yang Z, Sun Y, Liu S, et al. STD： sparse-to-dense 3D object detector for point cloud. In： Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 2019： 1951-1960.
［12］Qi C R, Litany O, He K, et al. Deep hough voting for 3D object detection in point clouds. In： Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 2019： 9276-9285.
［13］ Ku J, Mozifian M, Lee J, et al. Joint 3D proposal generation and object detection from aggregation. In： Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, 2018： 1-8.
［14］Liang M, Yang B, Chen Y, et al. Multi-task multi-sensor fusion for 3D object detection. In： Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, 2019： 7337-7345.
［15］Sindagi V A, Zhou Y, Tuzel O. MVX-Net： multimodal VoxelNet for 3D object detection. In： Proceedings of International Conference on Robotics and Automation (ICRA), Montreal, QC, 2019： 7276-7282.
［16］Qi C R, Liu W, Wu C, et al. Frustum pointNets for 3D object detection from RGB-D data. In： Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018： 918-927.
［17］He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In： Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016： 770-778.
［18］Lin T, Dollár P, Girshick R, et al. Feature pyramid networks for object detection. In： Proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Honolulu, HI, 2017： 936-944.
［19］Lin T, Goyal P, Girshick R, et al. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2)： 318-327.
［20］Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In： Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, 2012： 3354-3361.
［21］Liang M, Yang B, Wang S, et al. Deep continuous fusion for multi-sensor 3D object detection. In： Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 2018： 641-656.

自适应性多模态特征融合的远小困难目标检测

李阳1，2，葛洪伟1，2

（1. 江南大学江苏省模式识别与计算智能实验室，江苏无锡 214122； 2. 江南大学人工智能与计算机学院，江苏无锡 214122）

摘要：为了解决由LiDAR点云稀疏性和语义信息不足造成的远小困难物体检测困难的问题，提出了一种多模态数据自适应性融合的3D目标检测网络，充分融合了体素的多邻域上下文信息和图片多层语义信息。首先，设计了一种更适用于检测任务的改进残差网络，提取图片多层语义特征的同时，在低分辨率特征图中有效保留了远小物体的结构细节信息。每个特征图进一步通过来自所有后续特征图的语义信息进行语义增强。其次，提取具有不同感受野大小的多邻域上下文信息，弥补远小物体点云信息不足的缺陷，加强体素特征的结构信息和语义信息，以提高体素特征对物体空间结构和语义信息的表征能力及特征鲁棒性。最后，提出了一种多模态特征自适应融合策略，通过可学习权重，根据不同模态特征对检测任务的贡献程度进行自适应性融合。此外，体素注意力根据融合特征进一步加强有效目标对象的特征表达。在KITTI数据集上的实验结果表明，本方法以明显的优势优于VoxelNet，即在中等难度和困难难度下AP分别提高8.78％和5.49％。同时，与许多主流的多模态方法相比，本方法在远小困难物体的检测性能上具有更高的检测性能，即在中等和困难难度级别上， AP的性能比MVX-Net AP均高出1％。

关键词: 3D目标检测；自适应性融合；多模态数据融合；注意力机制；多邻域特征

引用格式： LI Yang， GE Hongwei. Adaptive multi-modal feature fusion for far and hard object detection. Journal of Measurement Science and Instrumentation, 2021, 12（2）： 232-241. DOI： 10.3969／j.issn.1674-8042.2021.02.013

[full text view]

此页面上的内容需要较新版本的 Adobe Flash Player。

Adaptive multi-modal feature fusion for far and hard object detection