SCP: Scene Completion Pre-training for 3D Object Detection

3D object detection using LiDAR point clouds is a fundamental task in the fields of computer vision, robotics, and autonomous driving. However, existing 3D detectors heavily rely on annotated datasets, which are both time-consuming and prone to errors during the process of labeling 3D bounding boxes. In this paper, we propose a Scene Completion Pre-training (SCP) method to enhance the performance of 3D object detectors with less labeled data. SCP offers three key advantages: (1) Improved initialization of the point cloud model. By completing the scene point clouds, SCP effectively captures the spatial and semantic relationships among objects within urban environments. (2) Elimination of the need for additional datasets. SCP serves as a valuable auxiliary network that does not impose any additional efforts or data requirements on the 3D detectors. (3) Reduction of the amount of labeled data for detection. With the help of SCP, the existing state-of-the-art 3D detectors can achieve comparable performance while only relying on 20% labeled data.


INTRODUCTION
3D object detection using LiDAR point clouds is a key task in the domains of computer vision (Mao et al., 2021), robotics (Zhou and Tuzel, 2018), and autonomous driving (Yan et al., 2018).In contrast to 2D images, point clouds obtained through mobile laser scanning (MLS) offer accurate 3D geometric properties and depth insights (Xia et al., 2021c), endowing them with superior resilience for object detection under diverse illumination conditions (Xia et al., 2023b).Over the past few years, numerous learning-based 3D detection techniques have exhibited remarkable performance by leveraging extensive supervised training on large annotated datasets.Nevertheless, the annotation of point clouds presents a substantial challenge due to (1) inherent incompleteness and occlusion, which render the identification of points ambiguous (Li et al., 2023); and (2) the time-consuming and error-prone nature of labeling individual points or delineating 3D bounding boxes (Wang et al., 2021).
A possible solution is to learn 3D model initialization using an unsupervised pre-training way and then fine-tune the models with small labeled data.A recent line of pre-training works based on generative adversarial networks (GANs) is proposed.(Sauder and Sievers, 2019) proposes to learn the rearrangement of the point clouds by predicting the original voxel location of each point.However, this approach is incapable of handling rotated and translated point clouds effectively due to the permutation variability exhibited in their voxel representations.Furthermore, (Sharma and Kaul, 2020) explores to classify each point into the assigned partitions based on cover trees.However, it ignores the semantically contiguous regions(e.g., airplane wings, car tires).In addition, PointContrast (Xie et al., 2020) leverages established point-wise correspondences between various views of a complete 3D scene to pre-train weights for point clouds.However, this approach may not be suitable for dynamic urban environments.Recently, a novel approach has been proposed to learn model initialization by completing the 3D shape of single objects (Wang et al., 2021).This method demonstrates notable advancements in various downstream tasks, such as object classification and segmentation, by effectively completing individual objects.However, it overlooks the significance of both spatial and semantic relationships among objects, which are crucial considerations for successful 3D detection tasks in complex urban environments.
To tackle this problem, we propose a novel Scene Completion Pre-training network, named SCP, aiming to learn a robust model initialization for 3D object detection from single LiDAR scans.Our SCP involves training a voxel-based scene completion network consisting of a feature encoder and a decoder.The encoder utilizes a Transformer-based 3D backbone (Mao et al., 2021) to efficiently extract informative features from the raw point clouds.Concurrently, the decoder incorporates an anisotropic convolution (AIC) module (Li et al., 2020), which dynamically adapts the receptive fields for different voxels.By completing the scene point cloud, our method accurately learns the spatial and semantic relationships among the objects.This enables the pre-training model to serve as an effective model initialization when employed as the 3D backbone in a 3D detection network.Furthermore, a small labeled data is introduced to fine-tune the 3D detection network.
To summarize, the main contributions of this work are: • We propose a voxel-based Scene Completion Pre-training network, called SCP, which purely applies to LiDAR point clouds.With the help of the carefully designed encoder and decoder, SCP provides a robust model initialization for the next detection network, encoding the spatial and semantic relationships within urban environments.
• We conduct extensive experiments on the KITTI 3D detection benchmark (Geiger et al., 2013) to demonstrate the effectiveness of our SCP.Notably, the existing state-ofthe-art methods with SCP yield comparable performance while relying on 20% labeled data.Step 2: Pre-training model weights from step 1 are used for the following 3D detection task.

RELATED WORK
In this section, we give a brief literature review of 3D object detection and scene completion, respectively.
3D object detection.Early 3D object detection works (Shi et al., 2019, Yang et al., 2020, Mao et al., 2021, Zhou and Tuzel, 2018, Yan et al., 2018, Shi et al., 2020, Xia et al., 2023b) can be broadly categorized into two main methods: the point-based and the voxel-based detectors.The point-based approaches focus on directly capturing features and predicting 3D bounding boxes from the raw points.PointRCNN (Shi et al., 2019) extracts features from the foreground points and derives the corresponding 3D bounding box.3DSSD (Yang et al., 2020) removes the FP layer and refinement module to reduce computational complexity and proposes a new fusion sampling strategy that yields improved results using fewer representative points.
VoTr (Mao et al., 2021) introduces a voxel transformer-based 3D detection backbone, presenting an alternative solution to the task of 3D object detection.The voxel-based approach in 3D object detection involves transforming the large and nonstructured point cloud data into voxels, which enables efficient feature extraction and saves computational time.Voxel-Net (Zhou and Tuzel, 2018) and SECOND (Yan et al., 2018), for instance, partition the points into voxels and utilize 3D sparse convolution to extract features.Subsequently, they employ the Region Proposal Network (RPN) to obtain 3D bounding boxes.On the other hand, PV-RCNN (Shi et al., 2020) combines the strengths of both approaches.It leverages multi-scale techniques to generate high-quality proposals from voxel-based methods while also incorporating fine-grained local information from point-based methods.Recently, DMT (Xia et al., 2023b) explores motion prior knowledge to generate accurate 3D positions and rotation.
3D scene completion.Early works on 3D completion mainly focus on single objects (Xia et al., 2021a, Wang et al., 2022, Xia et al., 2021c).Comparably, completing the whole scene poses greater challenges since the scene point cloud is large-scale and has many objects with various densities.The pioneering work by Song (Song et al., 2017) explores the depth maps for 3D scene completion and leverages the scene information derived from the depth map for semantic segmentation.Scene completion and semantic segmentation are closely intertwined tasks, and jointly processing them can yield mutual performance improvements.JS3C (Yan et al., 2021)

METHOD
The overview of our SCP for 3D object detection can be divided into two stages, as illustrated in Fig. 1.In the first step (scene completion pre-training), we employ an encode-decode model to effectively complete the partial scene point clouds.This involves leveraging available data to predict and generate the missing parts of the point cloud, resulting in a more comprehensive representation of the scene.In the second step (3D object detection), we utilize the learned weights from the scene completion pre-training model as an initialization for the 3D detectors.By transferring the knowledge acquired during the scene completion, we establish a strong spatial and semantic relationship of objects, leading to improved detection performance and efficiency, especially in the case of smaller labeled data.
Scene completion pre-training Next, we provide a detailed pipeline of SCP in Fig. 2.This section is divided into four stages: voxelization, encoder, decoder, and prediction.b) Encoder.We use a transformer-based 3D backbone network to extract features from voxels.The architecture of the 3D backbone is the same as VoTr (Mao et al., 2021), as illustrated in Fig. 3.The voxel undergoes a sequence of three "VoTr Block" layers.Each block layer consists of one sparse voxel module and two submanifold voxel modules.As the voxel passes through each block layer, its features are effectively extracted, and the voxel is downsampled three times.In these voxels, both non-empty voxels and empty voxels are present.The submanifold voxel modules are designed to handle the non-empty voxels and utilize self-attention mechanisms (Xia et al., 2021b, Xia et al., 2023a) to effectively extract features from them.On the other hand, the sparse voxel modules are specifically designed for empty voxels, allowing them to perform feature extraction on these regions.3D object detection.In the first step, we acquired wellperforming pre-training weights, which are then utilized to aid in 3D detector (Mao et al., 2021).The pipeline of the detector VoTr is shown in Fig. 4.This knowledge transfer enables the utilization of valuable insights gained from scene completion to improve the accuracy and reliability of 3D detection results.
In the initial step, the raw point cloud data undergoes voxelization, converting them into the structured voxel representation.These voxelized data are then fed into the 3D backbone, which is the same in the scene completion.The extracted features are then projected onto a bird's-eye view (BEV) map to generate 3D proposals, followed by the utilization of a 2D backbone and a detection head for further processing.By leveraging the learned knowledge from scene completion, the pre-training weights gain a deeper understanding of the scene, resulting in improved accuracy and reliability in the 3D detection results.

EXPERIMENTS
This section first provides a comprehensive description of the datasets used for both the pre-training and fine-tuning detection stages.Subsequently, we present and discuss the detection results, shedding light on the performance and effectiveness of our SCP.

Datasets
SemanticKITTI dataset.We utilize SemanticKITTI (Behley et al., 2019)  KITTI detection benchmark.To evaluate the effectiveness of our SCP in 3D object detection, we use the KITTI benchmark (Geiger et al., 2013), which comprises 7,481 samples for training and 7,518 for testing.Following VoTr (Mao et al., 2021), the training dataset is further subdivided into 3,712 samples for training data and 3,769 samples for validation data (Chen et al., 2015).

Evaluation metric
Following (Hossin and Sulaiman, 2015), we first employ the Intersection over Union (IoU) metric for the 3D scene completion.In 3D object detection, we then use the Average Precision (AP) with 11 recall points (AP11) for performance assessment.

Intersection over union (IoU).
IoU is a metric used to measure the overlap between a model's predictions and the ground truth.
It quantifies the degree of alignment between the two sets.An IoU of 0 indicates no overlap, meaning there is no intersection between the predicted regions and the ground truth.Conversely, an IoU of 1 represents a complete overlap, where the predicted regions perfectly match the ground truth.In practice, a larger IoU indicates better algorithm performance.The IoU is calculated as follows: where TP (True Positive) denotes the correct prediction of the ground truth.FP (False Positive) denotes incorrectly predicting an object that is not in the ground truth, and FN (False Negative) denotes the ground truth not being predicted.
Average precision 11(AP11).The Average Precision (AP) is computed by integrating the Precision-Recall (PR) curve.To approximate the PR curve, the 11-point interpolation method is commonly employed.In the 11-point interpolation, the Precision-Recall curve is sampled at 11 equally spaced recall levels between 0 and 1.The formula for AP11 is as follows: R∈{0,0.1,...,0.9,1} P interp (R) (2) where P interp (R) is the maximum precision for recall greater than R (Recall).

3D object detection
During this stage, the objective is to train and optimize the components of the detection network that complement the fully trained 3D backbone.The pre-training weights in the scene completion provide a solid starting point and enable efficient transfer of prior knowledge.By building upon the already wellperforming 3D backbone part, we can concentrate on refining and fine-tuning the other components of the detection network.
To evaluate the effectiveness of our SCP, we employ the entire validation dataset and compare it with the fully trained 3D detectors.

Implementation details
The pre-training scene completion module employed the Adam optimizer with a batch size of 3 and an initial learning rate of 0.001.The learning rate dynamically changed during training based on the number of training rounds.The training process utilized 20% of the SemanticKITTI dataset, with approximately 50 epochs of training conducted.Regarding the 3D detection module, the Adam-one cycle optimizer was employed with a batch size of 6 for SCP-SSD and 3 for SCP-TSD.The learning rate was set to 0.003.The SCP-SSD model was trained for 100 epochs, while the SCP-TSD model was trained for 80 epochs.All experiments and training processes were conducted on GPUs, specifically using hardware such as the Quadro RTX 8000 and NVIDIA A40.These GPUs offer high-performance computing capabilities and possess a substantial memory capacity of 48GB.

Results
We conduct a comparative analysis by assembling our SCP and the state-of-the-art 3D detector VoTr (Mao et al., 2021).To ensure a fair comparison, we adopt the same evaluation metrics, following (Mao et al., 2021).Table .2 presents the AP11 achieved by each method on the KITTI validation split for the car category.Table .1 illustrates the results of scene completion.Despite the utilization of a relatively small amount (20%) of training data, our SCP exhibits remarkable performance in object detection tasks.It is worth highlighting that our method achieves detection results that are remarkably close to those obtained through full data training, even surpassing the performance of full-data training.    .2. Remarkably, the reduction in training data does not lead to a significant degradation in the detection performance of the VoTr-SCP model.Despite utilizing a smaller amount of training data, our approach maintains a highly effective detection performance that is comparable to, and in some cases even surpasses, the results obtained through full data training.
Here, we have only calculated the detection results on the car category.
Specifically, our SCP demonstrates notable improvements in the easy car class, elevating the performance from 89.04 to 89.10 compared to VoTr-TSD.This finding highlights the effectiveness of our approach in enhancing the detection results.Additionally, we provide a visualization of a scene completion result in Fig. 6.

DISCUSSIONS
Different labeled data volumes.increases, the performance of the models improves.Notably, our experimental results highlight the robustness and efficacy of our model, even when trained with very small amounts of data.At data volumes as low as 10% and 20% of the training set, our model consistently achieved commendable results.Different completion effects.To investigate the influence of scene completions on the 3D detection performance, we conducted an in-depth analysis by selecting two different completion results with varying intersection over union (IoU).Specifically, we chose one completion result with an IoU of 51.6% and another with an IoU of 54.1%.We then compared the performance of these completion results across different volume datasets.Table.5 and Table.6 demonstrate that different scene completion results yield varying effects on 3D detection for the VoTr-SSD-SCP and VoTr-TSD-SCP frameworks, respectively.Notably, when utilizing the completion results with better performance (IoU=54.1%), the detection results show significant improvements on the 10% and 20% datasets.We thus get two conclusions: (1) Accurate and reliable scene completions play a critical role in improving the overall detection performance, particularly in scenarios with very limited data.
(2) As the completion achieves better results, it becomes increasingly beneficial for 3D object detection.In this paper, we propose SCP, a novel scene completion pretraining network for 3D object detection.We have demonstrated that scene completion can learn a model initialization to help the 3D detectors only trained on a small amount of dataset.Experiments indicate that the quality of the scene completion has a positive correlation with the effectiveness of object detection.In the future, we hope to explore the scene completion pre-training into more downstream tasks, for example, 3D semantic segmentations and 3D forecasting.

Figure 1 .
Figure 1.Overview of our SCP.Step 1: An encoder-decoder scene completion model is designed to complete the raw point cloud.Step 2: Pre-training model weights from step 1 are used for the following 3D detection task.
introduces the Point-Voxel Interaction (PVI) module to enhance knowledge fusion between the semantic segmentation and semantic scene completion tasks.This module facilitates interaction between incomplete local geometries in point clouds and complete global structures in voxels, enabling a more comprehensive understanding of the scene.AICNet(Li et al., 2020) proposes a novel anisotropic convolution, which decomposes a 3D convolution into three consecutive 1D convolutions.S3CNet(Cheng et al., 2021) tackles the challenge of large-scale environments by incorporating sparsity considerations and leveraging a sparse convolution-based neural network.Recently, SCPNet(Xia et al., 2023c) introduces a novel knowledge distillation objective termed as Dense-to-Sparse Knowledge Distillation (DSKD).

Figure 2 .
Figure 2. The pipeline of scene completion network.The encoder is a Transformer-based backbone and the decoder includes an AIC module and three convolutional layers.

Figure 3 .
Figure 3.The architecture of the 3D backbone network.It consists of three VoTr blocks, each layer containing two different self-attention modules.

4. 3
Scene completion pre-trainingThe raw point cloud serves as input and is fed into our SCP.Notably, only 20% of the training data is sampled from the overall training set for training the SCP.Through this step, we obtain the pre-training model for scene completion.As depicted in Fig.5, it is evident that after pre-training the complementary network, the original data is effectively completed and generates a more holistic representation of the scene.

Figure 5 .
Figure 5.The comparison before and after the scene completion.
3D scene completion.In this study, we utilized only 20% of the SemanticKITTI dataset for training the completion network.The completion results obtained in our experiments ranged from 47.39% to 54.58%.These results reflect the effectiveness of our completion network in generating accurate and reliable scene completions.Notably, our pre-training model, which underwent rigorous training and optimization, yielded the highest completion result of 54.58%.The training results of the scene completion network are shown in Fig. 1.
3D object detection.In this experiment, we focus on training the VoTr-SCP model using a significantly reduced amount of training data.Specifically, we utilized only 20% of the available data for training purposes.This deliberate reduction in training data allowed us to investigate the model's performance under resource-constrained scenarios and assess its ability to leverage limited data effectively.The results are presented in Table

Figure 6 .
Figure 6.Visualization of the example results.Due to the existence of some points similar to cars in the target area, VoTr incorrectly identified the wall as a car (the green bounding box), whereas our SCP can help to reduce these errors.

Table 1 .
The IoU results of 3D scene completion in different training epochs.
To comprehensively assess the influence of the pre-training model on the detection performance under different amounts of labeled data, we carried out experiments utilizing varying fractions of the training set.Specifically, we conducted experiments using four different fractions: 10%, 20%, 30%, and 40% of the total training set.The results of these experiments are presented in Table.3 and Table.4. It is evident from the tables that as the volume of data

Table 2 .
Comparisons on the KITTI with AP11 for the car category (20% training data).

Table 5 .
Comparisons on different IoU with VoTr-SSD-SCP for the car category.

Table 6 .
Comparisons on different IoU with VoTr-TSD-SCP for the car category.