General and robust voxel feature learning with Transformer for 3D object detection

LI Yang1,2， GE Hongwei1,2

（1. Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence， Wuxi 214122， China；2. School of Artificial Intelligence and Computer Science， Jiangnan University， Wuxi 214122, China）

Abstract： The selfattention networks and Transformer have dominated machine translation and natural language processing fields, and shown great potential in image vision tasks such as image classification and object detection. Inspired by the great progress of Transformer, we propose a novel general and robust voxel feature encoder for 3D object detection based on the traditional Transformer. We first investigate the permutation invariance of sequence data of the selfattention and apply it to point cloud processing. Then we construct a voxel feature layer based on the selfattention to adaptively learn local and robust context of a voxel according to the spatial relationship and context information exchanging between all points within the voxel. Lastly, we construct a general voxel feature learning framework with the voxel feature layer as the core for 3D object detection. The voxel feature with Transformer (VFT) can be plugged into any other voxelbased 3D object detection framework easily, and serves as the backbone for voxel feature extractor. Experiments results on the KITTI dataset demonstrate that our method achieves the stateoftheart performance on 3D object detection.

Key words： 3D object detection； selfattention networks； voxel feature with Transformer (VFT)； point cloud； encoderdecoder

References

［1］CHEN X, MA H, WAN J, et al. Multiview 3D object detection network for autonomous driving//IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2225, 2017, Honolulu, HI. New York: IEEE, 2017: 19071915.

［2］KU J, MOZIFIAN M, LEE J, et al. Joint 3D proposal generation and object detection from view aggregation//IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain. New York: IEEE, 2018: 18.

［3］YANG B, LUO W, URTASUN R. PIXOR: realtime 3D object detection from point clouds//Conference on Computer Vision and Pattern Recognition, Jun. 1822, 2018, Salt Lake City, UT, USA. New York: IEEE, 2018: 76527660.

［4］ZHOU Y, TUZEL O. VoxelNet: endtoend learning for point cloud based 3D object detection//IEEE Conference on Computer Vision and Pattern Recognition, Jun. 1822, 2018, Salt Lake City, UT, USA. New York: IEEE, 2018: 44904499.

［5］YAN Y, MAO Y, LI B. Second: sparsely embedded convolutional detection. Sensors, 2018,18(10): 33373354.

［6］LANG A H, VORA S, CAESAR H, et al. PointPillars: fast encoders for object detection from point clouds//IEEE Conference on Computer Vision and Pattern Recognition, Jan. 1620, 2019, Long Beach, CA, USA. New York: IEEE, 2019: 1269712705.

［7］QI C R, SU H, MO K, et al. PointNet: deep learning on point sets for 3D classification and segmentation//IEEE conference on Computer Vision and Pattern Recognition, Jul. 2225, 2017, Honolulu, HI, USA. New York: IEEE, 2017: 652660.

［8］HE C, ZENG H, HUANG J, et al. Structure aware singlestage 3D object detection from point cloud//IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 1618, 2020, Seattle, WA, USA. New York: IEEE, 2020: 1187011879.

［9］QI C R, YI L, SU H, et al. PointNet++: deep hierarchical feature learning on point sets in a metric space//Advances in Neural Information Processing Systems, Dec. 49, 2017, Long Beach, CA, USA. Trier: Trier University, 2017: 51055114.

［10］SHI S, WANG X, LI H. PointRCNN: 3D object proposal generation and detection from point cloud//IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 1620, 2019, Long Beach, CA, USA. New York: IEEE, 2019: 770779.

［11］YANG Z, SUN Y, LIU S, et al. 3DSSD: pointbased 3D single stage object detector//IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 1618, 2020, Seattle, WA, USA. New York: IEEE, 2020: 1103711045.

［12］VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need//Advances in Neural Information Processing Systems, Dec. 49, 2017, Long Beach, CA, USA. Trier: Trier University, 2017: 59986008.

［13］BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate. ［20210201］. https://doi.org/10.48550/arXiv.1409.0473.

［14］LIN Z H, FENG M W, SANTOS C N D, et al. A structured selfattentive sentence embedding. ［20210201］. https:arXiv.org/pdf/1703.03130.pdf.

［15］DEVLIN J, CHANG M W, LEE K, et al. BERT: pretraining of deep bidirectional transformers for language understanding. ［20210201］. https:aclanthology.org/N194403.par.

［16］HU H, ZHANG Z, XIE Z, et al. Local relation networks for image recognition//IEEE/CVF International Conference on Computer Vision, Oct. 27Nov.3, 2019, Korea, Seoul. New York: IEEE, 2019: 34633472.

［17］RAMACHANDRAN P, PARMAR N, VASEANI A, et al. Standalone selfattention in vision models. ［20210201］. http://doi.org/10.485501/arXiv: 906.05909..

［18］ZHAO H, JIA J, KOLTUN V. Exploring selfattention for image recognition. ［20210201］. DOI:10.1109/CVPR42600.2020.01009.

［19］DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ［20210201］. https:dl.acm.org/doi/10.5555/3454287.3454294.

［20］CARION N, MASSA F, SYNNAEVE G, et al. EndtoEnd object detection with transformers//European Conference on Computer Vision, Aug. 2328, 2020, Glasgow, USA. Berlin: Springer, 2020: 213229.

［21］LIN T, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 318327.

［22］GEIGER A, LENZ P, URTASUN R. Are we ready for autonomous driving? The KITTI vision benchmark suite//IEEE Conference on Computer Vision and Pattern Recognition. Jun. 1621, 2012, Providence, RI, US. New York: IEEE, 2012: 33543361.

［23］QI C R, LIU W, WU C, et al. FrustumPointNets for 3D object detection from RGBD data//IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 1822, 2018, Salt Lake City, UT, USA. New York: IEEE, 2018: 918927.

［24］LIANG M, YANG B, WANG S, et al. Deep continuous fusion for multisensor 3D object detection//European Conference on Computer Vision, Sept. 814, 2018, Munich, Germany. Berlin: Springer, 2018: 641656.

［25］SINDAGI V A, ZHOU Y, TUZELl O. MVXNet: multimodal voxelnet for 3D object detection//International Conference on Robotics and Automation, May 2024, 2019, Montreal, QC, Canada. New York: IEEE, 2019: 72767282.

基于Transformer的通用和鲁棒体素特征学习的目标检测

李阳1,2，葛洪伟1,2

（1. 江南大学江苏省模式识别与计算智能实验室，江苏无锡 214122； 2. 江南大学人工智能与计算机学院，江苏无锡 214122）

摘要:自注意力网络和Transformer主导了机器翻译和自然语言处理领域，并在诸如图像分类和目标检测等图像视觉任务中显示出巨大潜力。受到Transformer在2D图像视觉任务中取得的巨大进步的启发，提出了一种基于传统Transformer的新颖和鲁棒的体素特征编码器。首先，探究自注意力对序列数据的排列不变性，并将其应用于点云数据处理。其次，基于自注意力构造体素特征层，根据体素内所有点之间的空间关系和上下文信息交换自适应地学习体素的局部和鲁棒上下文。最后，构建了以体素特征层为核心的通用3D目标检测框架。 VFT（voxel feature learning with Transformer）是通用的体素特征提取器，可以嵌入任何其他基于体素方法的3D物体检测框架中。在KITTI数据集上进行的实验结果表明，本方法在3D目标检测方面表现出优越的性能。

关键词:3D目标检测；自注意力网络；基于Transformer的体素特征学习；点云；编码解码器

引用格式：LI Yang， GE Hongwei． General and robust voxel feature learning with Transformer for 3D object detection． Journal of Measurement Science and Instrumentation， 2022， 13（1）： 5160. DOI： 10．3969／j．issn．16748042．2022．01．006

[full text view]

此页面上的内容需要较新版本的 Adobe Flash Player。

General and robust voxel feature learning with Transformer for 3D object detection