Weakly Supervised Learning Method for Semantic Segmentation of Large-Scale 3D Point Cloud Based on Transformers

: Nowadays, semantic segmentation results of 3D point cloud have been widely applied in the fields of robotics, autonomous driving, and augmented reality etc. Thanks to the development of relevant deep learning models (such as PointNet), supervised training methods have become hotspot, in which two common limitations exists: inferior feature representation of 3D points and massive annotations. To improve 3D point feature, inspired by the idea of transformer, we employ a so-call LCP network that extracts better feature by investigating attentions between target 3D points and its corresponding local neighbors via local context propagation. Training transformer-based network needs amount of training samples, which itself is a labor-intensive, costly and error-prone work, therefore, this work proposes a weakly supervised framework, in particular, pseudo-labels are estimated based on the feature distances between unlabeled points and prototypes, which are calculated based on labeled data. The extensive experimental results show that, the proposed PL-LCP can yield considerable results (67.6% mIOU for indoor and 67.3% for outdoor) even if only using 1% real labels, and comparing to several state-of-the-art method using all labels, we achieve superior results in mIOU, OA for indoor (65.9%, 89.2%).


Introduction
Semantic segmentation is a key technique to assign a semantic label to each individual point in a point cloud.This technology is widely used in areas such as autonomous driving, augmented reality and 3D reconstruction.Traditional semantic segmentation methods such as Ransac (Jung, 2014), regional growth (Wang, 2015) and other methods are difficult to adapt to complex scenes.Emerging semantic segmentation methods (Zhao, 2021;Xu, 2020;Milioto, 2019) based on deep learning can process point clouds in multiple scenarios more accurately by learning the characteristics of supervised point cloud data.However, the large demand for supervised data and the difficulty of learning local features of point cloud are still unsolved problems.
The Transformer has achieved remarkable success in various fields such as natural language processing (NLP) (Vaswani, 2017;Wu, 2019;Devlin, 2018) and 2D image processing (Zhao, 2020;Ramachandran, 2019;Hu, 2019).In the domain of point cloud semantic segmentation, it plays a critical role in leveraging contextual features.The use of the Transformer (Zhao, 2021;Guo, 2021) has shown potential in capturing crucial features, thanks to its fundamental attention mechanism and the ability to capture long-range dependencies.This makes it a reasonable choice for handling unstructured and unordered point cloud data.However, the inherent limitation of the Transformer network is the lack of integration of local information, which has been described as a drawback (Liu, 2021).LCPFormer (Huang, 2023) introduces a simple and effective module called LCP (Local Context Propagation) to facilitate message passing between adjacent local regions.Specifically, it leverages information exchange between neighboring local regions to provide each local region with more informative and discriminative features.The goal of this method is to enhance the capability of Transformer in integrating local information by integrating relational information between adjacent local regions, thereby improving the performance of point cloud semantic segmentation networks.Furthermore, most existing large-scale point cloud datasets heavily rely on manually annotating each point, which is † These two authors make equal contributions. Corresponding author.a labor-intensive, expensive, and error-prone task.Transformer architecture requires extensive datasets for training, which would incur significant costs.Weakly supervised training provides an effective solution to reduce annotation costs.Pseudo-labeling techniques (Lee, 2013) are a method of leveraging unlabeled data in weakly supervised training.Initially, the model (Zhang, 2021) is trained using a small amount of labeled data.Then, the trained model is used to predict the unlabeled data, and these predictions serve as pseudo-labels.Subsequently, the model is trained using a combination of labeled and unlabeled data, incorporating both the real labels and the pseudo-labels.The model's performance is evaluated on the test set.In addition, entropy regularization loss (Grandvalet, 2006;Shannon, 1948) and distribution alignment loss (Zhang, 2021;Saito, 2019) have been introduced in 3D segmentation tasks for weakly supervised learning.These techniques aim to utilize the information from all unlabeled points by mitigating the negative impact of pseudo-label noise and addressing distribution discrepancies (Li, 2023).Overall, this paper combines weakly supervised learning with a Transformer framework with LCP to achieve superior results compared to other point cloud semantic segmentation methods such as RandLA-Net (Hu, 2020), even with limited annotations.
The methodology and workflow of our approach are illustrated in Figure 1.We begin by feeding the point cloud into an LCP network to predict the initial semantic information of the point cloud.Next, we employ a prototype pseudo-label generation strategy based on momentum (Xu, 2020) to generate pseudolabels for unlabeled points.These pseudo-labels, along with the predicted results, are optimized using a loss function.Our main contributions are threefold: 1. Propose a novel PL-LCP framework that combines pseudolabeling with Transformer, achieving good performance even with limited training samples.2. Entropy regularization loss and distribution alignment loss are incorporated into pseudo-label generation, allowing for better utilization of information from all unlabeled points.3. We employ varying degrees of labeling on the original data: 1%, 10%, and full, to investigate the performance of this framework with limited labelled 3D points.Finally, all points within a voxel are assigned the same semantic label as that voxel.In general, voxelization naturally preserves the neighborhood structure of three-dimensional point clouds.Its regular data format also allows for the direct application of standard 3D convolutions.However, voxelization inevitably introduces discretization artifacts and information loss, with high resolutions leading to high memory and computational costs, while low resolutions result in loss of detail.In practical applications, it's challenging to choose the appropriate grid resolution.

Point-based methods:
Point-based methods directly work on the unstructured and irregular point clouds.These methods directly interact with points, taking individual points as input and outputting a labeled point or labeling the entire point cloud.
PointNet (Qi, 2017) is a pioneering work and breakthrough that opened deep learning for direct work with points without rendering them to voxels or 2D images.PointNet used the maxpooling function with each layer in the network by learning an optimization function and aggregating the optimized values to a global descriptor.The final fully connected layers of the network aggregate these learnt optimal values into the global descriptor for the entire shape or are used to predict per point labels.Hu er al. (2020) proposes an efficient lightweight network, RandLA-Net, for large-scale point cloud segmentation.This network utilizes random point sampling and achieves very high efficiency in terms of memory and computation.Furthermore, it introduces the Local Feature Aggregation module to capture and retain geometric features.Pointvector (Deng, 2023) proposes a Vectororiented Point Set Abstraction that can aggregate neighboring features through higher-dimensional vectors.To facilitate network optimization, it constructs a transformation from scalar to vector using independent angles based on 3D vector rotations.

Transformers in 3D Point Clouds
Self-attention networks have revolutionized natural language processing and are making impressive strides in image analysis tasks such as image classification and object detection.The transformer models are especially suited for point cloud

Methodology
This paper aims to leverage pseudo-label generation techniques to implement a point cloud segmentation network trained with a small amount of annotated points based on Transformer.Furthermore, aiming to enhance the Transformer's ability to integrate information across adjacent local regions, the LCP module is added to the Transformer.We propose an effective weakly supervised framework based on Transformer, and the overview of the framework is illustrated in Figure 2.
Our approach combines a Transformer network with LCP (Local Context Perception) modules and pseudo-label generation techniques to achieve better semantic segmentation results with only a small amount of real annotations.In Section 3.1, we provide a brief introduction to the Transformer network.Next, in Section 3.2, we present a detailed description of the LCP module's structure and analysis its principle of integrating overlapping regions.Finally, in Section 3.3, we extensively discuss the process of generating pseudo-labels and the reasons for introducing entropy regularization loss and distribution alignment loss.

Preliminary
The transformer consists of an encoder and a decoder.The encoder is responsible for transforming the input sequence into a series of hidden representations, while the decoder generates the target sequence based on the outputs from the encoder and the previous decoder states.Each encoder and decoder layer consists of multiple identical sub-layers, including multi-head selfattention mechanism (MHSA) and feed-forward neural network (FFN).where PE() means the position encoding function.

Local Context Propagation
While transformers primarily emphasize long-range dependencies, local structural information remains crucial in a transformer-based 3D point cloud model.To enable transformers to incorporate such local structural information, we utilize the LCPFormer (Huang, 2023) architecture as the backbone for a weakly supervised semantic segmentation network.LCPFormer is based on a simple observation that when dividing the whole point cloud into different local regions, naturally there is overlap among them.It works by updating point features in overlapping areas of different regions.Given a point   , we denote its corresponding local regions as { 1 , ...,   }.After the Transformer independently operates on these regions, for each local region   , point   should possess corresponding features within it, denoted as    .To obtain features for each local region, LCP combines the results of max pooling and mean pooling.Max pooling captures important features, while mean pooling captures features from the surrounding area.We assuming that the whole point cloud contains N points grouped into C local regions.The input is  in ∈  ×× , where K is the number of points in each local region and D is the feature dimension of each point.Then, we obtain representations  ∈  ×2 for each local region through max pooling and mean pooling, followed by a 1x1 convolution to generate the weight matrix  ∈  × corresponding to each region.Finally, we update the feature of   by using the weight matrix.This process can be formulated as:

Pseudo-Label Generation
We employ a momentum-based prototype pseudo-label generation process Xu er al. (2020).Specifically, prototypes denote the centroid of a class in the feature space, which is calculated based on labeled data, while pseudo-labels are estimated based on the feature distance between unlabeled points and class centroids.To reduce computational costs, we employ the momentum optimization algorithm for optimization, supplemented by an MLP-based projection network to aid in pseudo-label generation.The specific process is illustrated in Figure 2. We assume the input point cloud X, where the labeled point cloud is denoted as l X , and the unlabeled point cloud as u X .The pseudo-label generation process can be described as follows: where y is the label of a labeled point,   represents the global class centroid for the k-th class,    denotes the number of labeled points of the k-th class,  ∘  = ((⋅)) signifies the transformation through the backbone network f and the projection network ,  represents the momentum coefficient, and cosine similarity is employed for (⋅,⋅)to generate the scores.By default, we utilize 2-layer MLPs for the projection network  and set  = 0.999.
Existing pseudo-label generation methods often rely on empirical label selection strategies, such as confidence thresholds, to generate pseudo-labels that are beneficial for model training.This approach may potentially waste unlabeled points.In this paper, pseudo-labels are generated for all unlabeled points, and entropy regularization loss and distribution alignment loss are introduced to minimize the disparity between pseudo-labels and model predictions.We denote the two loss functions as   and   , respectively (Li, 2023).Then, we have the overall loss of ERDA as follows: where the  > 0 modulates the entropy regularization.
For the entropy regularization loss, we posit that when pseudolabels fail to provide reliable outcomes, they are more susceptible to noise interference, resulting in a high-entropy distribution within  .To alleviate this issue, we propose minimizing its Shannon entropy to reduce the noise level in .By minimizing the entropy of pseudo-labels, we enhance their quality.Therefore, we have: where () = ∑ −      and i iterates over the vector.
While the entropy regularization can mitigate the impact of noise in pseudo-labels, significant disparities between pseudo-labels and predictions from the segmentation network can still confound the learning process, leading to unreliable segmentation results.
To address this issue, we propose a joint optimization approach for pseudo-labels and the network to narrow this gap, ensuring that generated pseudo-labels do not deviate too far from segmentation predictions.Therefore, we introduce the distribution alignment loss as follows: With the   and   formulated as above, given that (||) = (, ) − () where (, ) is the cross entropy between  and , we can have a simplified ERDA formulation as: In particular, when  = 1, we obtain the final ERDA loss: The above simplified ERDA loss differs from traditional crossentropy loss.Traditional cross-entropy loss employs fixed labels and only optimizes the term within the logarithmic function, whereas the loss proposed above simultaneously optimizes both  and .Finally, with the above simplified ERDA, the final loss is given as: where  (, ) =   (, ) = (, ) is the typical cross-entropy loss used for point cloud segmentation,   and   are the numbers of labeled and unlabeled points, and  is the loss weight.

Experiments
To demonstrate the efficacy of our proposed PL-LCP, we evaluate 3D semantic segmentation results on both indoor and outdoor scenarios using two large-scale point cloud datasets.First, we do two ablation experiments to validate the ability of the LCP module to integrate inter-block information and the effect of pseudo-labels.Then, our method is compared with other relevant approaches, primarily to demonstrate the effectiveness of the PL-LCP network architecture.Our experimental environment is: Intel Core i7-8700 CPU (3.70GHz), 64GB RAM, NVIDIA GeForce RTX 4090 24GB GPU, 64-bit Ubuntu 22.04.3LTS Operating System (5.4.0-149-generic).
We trained the network for 200 epochs using the Adam optimizer with momentum, batch size and weight decay set to 0.9, 4 and 0.0001, respectively.The initial learning rate was set to 0.01, and decreased by a factor of 10 at 120 epochs.

Datasets
The S3DIS (Stanford Large-Scale 3D Indoor Spaces Dataset) is a vast collection of three-dimensional indoor space data provided by Stanford University (Armeni, 2016).It comprises six distinct indoor scenes, each containing three-dimensional reconstruction data.All points are labeled with their semantic ground truth from 13 categories including board, bookcase, chair, ceiling, beam, etc. SensatUrban (Hu, 2022) is an urban-scale photogrammetric point cloud dataset with nearly three billion richly annotated points, which is five times the number of labeled points than the existing largest point cloud dataset.Our dataset consists of large areas from two UK cities, covering about 6 km 2 of the city landscape.In SensatUrban, each 3D point is labeled as one of 13 semantic classes, such as ground, vegetation, car, etc.

LCP Network Architecture
We constructed a UNet-like (Ronneberger, 2015) network for semantic segmentation tasks using 4 LCP Blocks and 4 upsampling layers, as depicted in Figure 2, as it requires per-point features for dense prediction.Before entering the first LCP Block, the data passes through a shared MLP.The specific structure of an LCP block is illustrated in Figure 3.For each up-sampling layer, we first employ the KNN algorithm to find the nearest neighbour point for each query point, and then perform upsampling on the point feature set through nearest neighbour interpolation.Subsequently, the up-sampled feature maps are concatenated with the intermediate feature maps generated by the encoding layers via skip connections, followed by applying a shared MLP to the concatenated feature maps.The dimensions of each layer in the network are 128, 256, 512, and 1024, respectively.The input consists of 40960 points.

Evaluation Metrics
Taking into consideration simplicity and representativeness, this paper compares and analyzes various point cloud semantic segmentation methods using three evaluation metrics: Overall Accuracy (OA), mean accuracy (mAcc), and mean Intersection over Union (mIoU).For ease of description, we assume there are N semantic classes. ij represents the number of units where the actual semantic type is i and the predicted type is j, and vice versa for  ji . ii represents the number of units with both actual and predicted semantic type i.
OA is the ratio of the number of samples correctly predicted by the segmentation algorithms to the total number of samples.Its formulation is given as: mAcc represents an enhancement of OA, which computes the precision for each category individually and subsequently averages the accumulated results based on the number of categories.Its formulation is given as: In this section, we conducted extensive ablation experiments to validate our approach, including the effectiveness of the proposed LCP module and the impact of varying degrees of ground truth annotations on PL-LCP.The results of the ablation experiments are presented in Table 1.We first examined the effectiveness of the LCP module.Through comparison, the LCP module resulted in improvements of 6.6%, 3.2%, and 9.3% on OA, mAcc, and mIOU, respectively, demonstrating the importance of integrating information across different blocks for semantic segmentation.Part 3 and part 4 respectively reduced the quantity of ground truth labels to 10% and 1% to assess the effectiveness of pseudo-labels, resulting in varying degrees of decrease compared to part 1.It is noteworthy that even with only 1% of ground truth labels, our approach achieved results that match or surpass those of some strong supervised networks.The results above demonstrate that the LCP module indeed enhances the network's local aggregation capability.The integration weak supervision enables our network to perform remarkably well even with a limited amount of real annotations.
To visually evaluate the impact of our proposed PL-LCP, we randomly selected several point cloud scenes from the S3DIS dataset and visualized their output results, as shown in Figure 4. We present the detection results using the LCP module and compare them with the results obtained without using the LCP module, along with the ground truth labels.It can be observed that the results without the LCP module show relatively poorer handling of boundary regions, indicating that the separation sampling of local regions leads to the degradation of instance information.In contrast, our proposed LCP module effectively improves the point cloud features, providing richer information and more discriminative representations.

Experiment 2: Comparison with state-of-the art methods
The results on the S3DIS dataset are presented in In Table 3, it is evident that our method outperforms certain non-Transformer architectures, such as RandLA-Net (Hu, 2020), which achieves an OA of 87.2%, mAcc of 71.4%, and mIOU of 62.4%.Our method surpasses LocalTransformer by 2.6%, 3.6%, and 3.5% in terms of OA, mAcc, and mIOU respectively, demonstrating the efficacy of our LCP module in integrating local information.
As shown in Table 2, we also evaluated our method on the challenging urban-scale segmentation dataset SensatUrban, achieving significant improvements.Compared to the previously popular methods KPConv and renowned RandLA-Net, our method demonstrated an increase in mIOU by approximately 9.7% and 14.6%, respectively.Particularly in smaller categories, our approach showcased remarkable potential.For instance, we achieved 46.9% in the railway category and 84.2% in the bridge category.Across all categories, our method decisively outperformed other approaches.

Conclusion
This work explores the combination of Transformers and weak supervision for 3D point cloud semantic segmentation, emphasizing the integration of semantic information across local regions.We introduce a novel and effective weakly supervised network, PL-LCP.In contrast to previous approaches, our method not only exploits the Transformer's capability to process sequential data but also addresses the challenge of the Transformer architecture's reliance on extensive datasets for training.Moreover, by introducing the LCP module, we effectively mitigate the issue of Transformers solely focusing on long-range dependencies while neglecting structural information.Ultimately, our proposed approach demonstrates significant improvements compared to various relevant methods and Transformer-based approaches in the dense prediction task for semantic segmentation on the S3DIS dataset.
In this paper, we combine the Transformer with an LCP module to form a network that achieves promising results even with limited true annotations.Subsequent experiments will involve the use of a custom dataset to further validate the effectiveness of our approach.Additionally, we plan to make some improvements, such as replacing precise nearest neighbour search with efficient serialized neighbour mapping organized according to specific patterns of point clouds.This enhancement will significantly increase the receptive field while accelerating processing speed and runtime efficiency, thereby enhancing its performance on outdoor point cloud datasets.

Figure 1 .
Figure 1.Flowchart of the proposed method

Figure 2 .
Figure 2. Overall architecture of our 3D vector rotations First, we review the commonly used MHSA, which aims to enable the model to simultaneously attend to different parts of the inputs with different representations.By performing multiple attention operations in parallel, each attention head can learn different semantic information.Given a set of point cloud  = {  },we denote its positions  = {  } and corresponding features  = {  }.It can be formulated as follows:   = PE() +  (1)   =      ,   =      ,   =      (ℎ 1 , ℎ 2 , . . ., ℎ  ) *   (4) preceding discussion, the LCP Block is constructed as illustrated in Figure 3.The LCP Block consists of a grouping layer, two MHSA, and an intermediate LCP module.The grouping layer utilizes FPS sampling to select central points, and for each central point, K nearest neighbors (kNN) are gathered within the local neighborhood to construct local regions.The k of kNN is set as 16.

Figure 3 .
Figure 3.The specific content of the LCP Block.

Figure
Figure 4. Visualization of segmentation results on S3DIS is the feature dimension.  is the projection of The FFN is a fundamental structure in neural networks.The transformation layer consists of MHSA and FFN with skip connections.
are projections of the  -th head for query, key and value respectively.

Table 2 .
Performance comparisons with existing sota methods on SensatUrban test set.overall accuracy (OA), mean IOU (mIOU), and per-class IOU scores are reported from the leaderboard of SensatUrban.

Table 3 .
Performance comparisons with previous methods on S3DIS, evaluated on Region 5.The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-1-2024 ISPRS TC I Mid-term Symposium "Intelligent Sensing and Remote Sensing Application", 13-17 May 2024, Changsha, China