An End-to-End Geometric Characterization-aware Semantic Instance Segmentation Network for ALS Point Clouds

Semantic instance segmentation from scenes, serving as a crucial role for 3D modelling and scene understanding. Conducting semantic segmentation before grouping instances is adopted by the existing state-of-the-art methods. However, without additional refinement, semantic errors will fully propagate into the grouping stage, resulting in low overlap with the ground truth instance. Furthermore, the proposed methods focused on indoor level scenes, which are limited when directly applied to large-scale outdoor Airborne Laser Scanning (ALS) point clouds. Numerous instances, significant object density and scale variations make ALS point clouds distinct from indoor data. In order to address the problems, we proposed a geometric characterization-aware semantic instance segmentation network, which utilized both semantic and objectness score to select potential points for grouping. And in point cloud feature learning stage, hand-craft geometry features are taken as input for geometric characterization awareness. Moreover, to address errors propagated from previous modules after grouping, we have additionally designed a per-instance refinement module. To assess semantic instance segmentation, we conducted experiments on an open-source dataset. Additionally, we performed semantic segmentation experiments to evaluate the performance of our proposed point cloud feature learning method.


Introduction
ALS (Airborne Laser Scanning) point cloud refers to a collection of 3D coordinate points obtained through the use of airborne LiDAR (Light Detection and Ranging) technology, which can represent the 3D structure of the terrain and objects on the Earth's surface (Polewski and Yao, 2019).Instance segmentation on the ALS point clouds, meanwhile, serving as a crucial role for 3D modelling and scene understanding with a variety of applications like autonomous driving, augmented reality and robot navigation.
The development of instance segmentation have significant progress in recent years, which are driven by advancements in deep learning, computer vision algorithms and sensor technology.3D instance segmentation from outdoor scene is a challenge task.Firstly, the label of instances have no fixed annotation like semantic class, making it hard to directly predict.Secondly, each scene contains different number of instance.Recent advancements in 3D instance segmentation, as demonstrated by state-of-the-art methods such as 3D-SIS Hou et al.
(2019), and SoftGroup Vu et al. (2022), have yielded significant progress.These methods employ two primary strategies: topdown and bottom-up.The top-down approach is well-suited for rapidly processing scenes, whereas the bottom-up approach excels in achieving high-precision segmentation for complex scenes.
In terms of data, the previous works primarily utilized indoor scanning data as input.However, to the best of our knowledge, there are still no study focused on ALS point cloud semantic instance segmentation.ALS point clouds typically encompass a wide range of outdoor scenes with complex instances.Unlike indoor RGB-D data, ALS point clouds often contain obstructed areas with sparse or no points due to scanning positions.Nevertheless, the presence of numerous instances, significant object density, and scale variations make it distinct (Figure 1).Furthermore, the boundaries between different categories in ALS point clouds are ambiguous and irregular.And in our instance segmentation task, the majority of input points are classified as background, indicating that only a few points should be grouped as instances.
In this study, we present an semantic instance segmentation network, which particularly considered the geometry characteristics of ALS point clouds and can be trained in an endto-end manner.To ensure the quality of our results, we adopt a bottom-up strategy.Semantic predictions are utilized for instance mask proposals, and an another per-instance refinement module is employed for background points segmentation in each proposed instance.As a result, we place particular emphasis on the performance of semantic segmentation.The core idea of our network architecture is that we designed a geometric characterization-aware method for input points' feature learning, which leads to better performance in semantic segmentation for distinguishing instance category and background.To address the issue of significant variations in object scale, our grouping parameters are specially designed based on the average number of points per instance category.Moreover, due to the end-to-end training process, accumulated errors can be avoided.In summary, the key objectives of our work are as follows: (1) We first introduce a semantic instance segmentation network for ALS point clouds, which can be trained in an end-to-end manner.Both semantic and objectness score are utilized to select potential points for grouping, followed by per-instance refinement module.
(2) We design a geometric characterization-aware feature learning network GFLN.et al. (2017) focused on hand-craft features based on mathematical principles.Generally, it can effectively express the characteristics of points in a certain domain or condition.While the fixed explanation of algorithm leads to it heavily relies on computational parameters.As a result, the methods are difficult to apply in complex and ever-changing environments.2022) introduced Soft-Group, which performs a bottom-up soft grouping followed by a top-down refinement.Semantic segmentation and instance offset prediction are conducted simultaneously.When performing semantic segmentation before grouping, the method allows the point to be associated the multiple class soft predictions to alleviate the propagation of errors to the subsequent processing.In summary, for the grouping based bottom-up strategy, utilize per-point predictions will make instance predictions more precise and finally refine themselves.

Dynamic convolution-based
Dynamic convolution is a technique in convolutional neural networks that allows the shape and size of the convolutional kernel to change dynamically during the forward pass of the network.In 3D semantic instance segmentation, this strategy allows the point-wise convolutional kernel's shape to be adjusted, making the kernel instanceaware.For instance, techniques such as DyCo3D He et al.
(2021) can effectively address the inevitable variation in the instance scales by generating instance-aware dynamic convolution kernels in the stage of point cloud feature learning.
Through out these works, they focused on indoor scene, which contain objects of similar size with less occlusion compared with ALS point clouds.Simply using the proposed methods on ALS point clouds is still far from satisfactory Han et al. (2024).Therefore, our method pay more attention to the characteristics of ALS data.On the one hand, in the stage of point cloud feature learning, we take some hand-craft features as input to enhance its geometric awareness.On the other hand, we developed a grouping-based network that specifically tailored the grouping parameters based on the average number of points per instance category.Moreover, as most of the points are background, we designed a semantic segmentation refinement module to enhance the background classification performance for each grouped instance proposal.

Method
The goal of our work is to take ALS point clouds as input and segment instances.Thus, we propose this end-to-end semantic instance segmentation network.Moreover, for ALS point clouds with special geometric features, we also introduce a novel strategy for 3D point feature learning.The overall architecture is illustrated in Figure 3, which consists of three main parts: semantic segmentation, instance center prediction, and per-instance refinement modules.
Specifically, given an input point cloud P (x|f ) with N points and extended by K-dim features.First, point-wise hand craft geometric characterizations are calculated for point cloud feature learning.Then, semantic segmentation and instance center prediction are conducted simultaneously for grouping preliminary instance proposals.Finally, per-instance refinement module is used to re-segment background points of grouped instances.

Geometric characterization learning
Numerous instances, significant object density and scale variations make the geometric characterization of ALS point clouds distinct.Moreover, compare with the indoor data, outdoor ALS point clouds including much more background points.But for our task of point cloud instance segmentation, the classification of background points becomes essential.For optimal performance, special attention should be paid to the geometric relationship at both local and global scales.Thus, we propose GFLN, a geometric characterization-aware feature learning network (illustrated in Figure 3) which is inspired from Li et al. (2020).Geometric characterization of a point can describe the shape, structure and topological properties.While it's generic and low-level, which leads limitation of its ability to represent complex scenes.Thus, in GFLN, we take geometric characterizations as prior knowledge with a weight matrix as multi-layer perception (MLP) for learning and generating high-level features.Due to the task, we first get the point and its spherical neighbor area for following analysis.Normal vector N , and the first three eigenvalues E(λ1 > λ2 > λ3) of covariance matrix C in the area are chosen to define the input low-level feature g l [N, E].Simultaneously, rigid KPConv is adopted as backbone for feature learning of input original points according its impressive results on several open datasets.Figure 4 illustrates the comparison between KPConv (as baseline) and GFLN in the task of semantic segmentation.According to the results, the use of GFLN can result in a more precise division of boundaries between different categories, which can be highly beneficial for the subsequent task of instance grouping.Specifically, our method of point-wise feature learning is a convolution-based U-Net.To enhance both local and global geometric understanding, the radius of neighborhood area will expand after skip connection layer.In this work, GFLN is used for initial point cloud feature learning.
. Geometric characterization-aware feature learning network (GFLN).For input point P within a spherical range.Low-level geometric features g l are first be calculated, then it will go through a MLP to generate high-level feature g h .

Semantic prediction branch
For all of the input N points' semantic label prediction, we leverage a softmax layer to obtain the score vector S = s1, s2, ..., sn ∈ R N ×C where C represents the number of semantic class.The predicted semantic score are supervised by weighted cross entropy loss and illustrated as Where tij represents the true label of class j for sample i, and yij denotes the probability that model predicts sample i as class j.

Center prediction network
Inspired from VoteNet Qi et al. (2019), we learn the 3D offset from object center for each point.However, for ALS point clouds, the scale of objects from different categories vary significantly.To address this issue, our approach utilizes a 6-layer MLP with a pooling layer to enhance awareness of both local and global context features of points.The output offset vector O = o1, o2, ..., on ∈ R N ×3 , that represents the x, y, z offsets from point to the geometric center of corresponding object.Shifted points are obtained according to the offsets prediction.Specially, for the background points, the ground truth offset is 0. Furthermore, the features of offset points will be leveraged to obtain the objectness score in the subsequent task.To evaluate the 3D offset oi, we compare the predicted and ground truth center yi = xi + oi and gi = xi + ogi to obtain whether the shifted points is on the object surface.Thus, the 3D offset can be supervised by a regression loss, which is denoted as: Where Y is the vector of predicted instance center.M represents the ground truth mask of instance points.If point pi belongs to an instance, Mi = 1, otherwise Mi = 0. Msum is the ground truth total number of instance points.

Grouping instance
After point-wise semantic and center prediction, the results are used in the instance grouping stage.Initially, the features of offset points are filtered based on their semantic predictions to obtain subsets where all points within each subset belong to the same class.Then, features of offset points are utilized to generate class-wise objectness scores S obj = s1, s2, ..., sm ∈ R M ×1 where M represents the number of filtered subset points.Considering that the score represents if points belong to the instance classes, we opt sigmoid as activation function.Then, the points with object scores above a certain threshold t are regarded as positive prediction of object (potential points), which will subsequently perform DBSCAN grouping to get instance proposals.The operation of selecting potential points can improve the semantic precision of the grouping points.As a result, it largely prevents previous semantic errors propagate into the grouping stage.Considering significant variation in the object scale, for each category of instance, we apply different grouping parameters, which depends on the mean instance point number.The loss of objectness scores is calculated by mean squared error (MSE) of instant predictions, which is denoted as: Where y and t represents prediction and ground truth labels respectively.

Per-instance refinement
The per-instance refinement stage reclassify and refines the instance proposals from the previous bottom-up grouping stage.
To reduce the error propagated from the previous modules, an additional semantic prediction is conducted.It can be understand as a binary classification to classify background and object points, which take GFLN output features as input, then fed into 3 MLP layers.The output semantic score vector is S ref ine = s1, s2, ..., sn ∈ R N ×2 .For loss computation in this stage, we adopt the same approach as initial point-wise semantic segmentation in section 3.2.

Loss and training process
The entire network can be trained end-to-end, with the loss propagated at each stage.The general loss computation is illustrated as: Where vector λ donates the corresponding weights.Specifically, we set λ1 = 10, λ2 = 4, λ3 = 3 and λ4 = 4.

Experiments dataset and preprocessing
In order to verify our work, we conduct experiments on a labeled open source dataset: DALES Object.

DALES Object dataset
The DALES Object dataset is a large-scale aerial LiDAR point cloud dataset designed for semantic and instance segmentation tasks.It provides detailed annotations for various natural and man-made objects in urban and suburban environments, which include both semantic and instance-based labeling.The dataset includes over half a billion accurately labeled points covering an area of approximately 10 square kilometers.
We consider originally labeled 7 classes: ground, vegetation, car, power line, fence, pole and building in the experiments.
For the task of semantic instance segmentation, we merge the classes of ground, power line and fence as background of instance class that will be ignored when processing.And the format of the utilized features was x, y, z.

Point cloud feature learning backbone
In line with the KPConv method, our implementation includes encoder and decoder blocks (Figure 5).To mitigate gradient vanishing, we employ skip connections through feature concatenation.Within each block, as the neighborhood area radius of the points increases or decreases, down-sampling or up-sampling of points occurs to enhance the understanding of local and global knowledge.Additionally, batch normalization is utilized to improve training speed and stability.
Each batch consists of several spherical areas.Specifically, for the hand-crafted features of points, we adjust the radius of the neighborhood area based on the radius of the batch spherical areas.Considering the point density, we have set the sampling resolution to 0.5m for the DALES dataset.

Instance grouping
During the grouping stage, we employ DBSCAN grouping.Given the significant variation in the average number of object points, we set the grouping parameters based on the object size, as outlined in Table 1.Here, "eps" denotes the parameter used to define the neighborhood radius, specifying the maximum distance threshold at which two samples are considered neighbors.Specifically, for the filtered subset points fewer than np, they will be grouped with np = 6.

Instance merging
For trained model validation, a problem arises due to the batch outputs consisting of spherical regions that may not cover the entire scene, resulting in one instance being segmented into parts across different spherical regions.To obtain the complete semantic instance segmentation result, these instances need to be merged.Our solution is as follows: Let's assume there is a predicted instance vector pins in a spherical region si of the batch inputs, and the previously predicted instances are stored in vector Vins.First, we calculate the intersection of the two vectors.If there is a sufficient intersection with a stored instance, the two instances will be treated as one.
Although the grouping strategy for each instance is based on semantic segmentation, the merge operation may lead to different semantic predictions within a single instance.Therefore, for classification consistency in predicted instances, we filter each instance based on the class with the most occurrences.The remaining points will be converted to background points (not part of any instance).

Semantic segmentation evaluation
For the task of point cloud semantic segmentation we adopt overall accuracy (OA) and F1 scores to evaluate the performance of our method (Equ.5).Where OA represents is a measure of the proportion of correctly classified points, which provides a general assessment of the model's performance across all classes.And F1 score is the harmonic mean of precision and recall, which is a single metric that takes into account both false positives and false negatives, making it a useful measure for imbalanced class distributions.

OA = T P T P +
Where T P , F P and F N represents the number of true positive, false positive and false negative predicted points respectively.

Semantic instance segmentation evaluation
For the task of semantic instance segmentation, we evaluate the mean class coverage mCov (Equ.6) and mean class-weighted coverage mwCov, which represent the average instance-wise intersection over union (IoU).In order to conduct comprehensive evaluation for the task performance, we test the predicted instances which obtain IoU more than threiou from the scene (in our experiments, we set threiou = 0.1).Moreover, mean precision and recall of the predicted instances are also calculated in our work.
Where IoU (., .)means the IoU between two point sets.pi and gi donate the predicted and ground truth instance point clouds.maxjIoU (pi, gj) represents the highest IoU of ground truth instance point cloud gj.M represents the number of instance prediction.And ni is the point number of ground truth instance i.

Results and discussion
5.1 Semantic segmentation results

DALES Object dataset
We compared the performance of our proposed point cloud feature learning backbone GFLN with the baseline method KPConv.Figure 7 illustrates results of the two methods.And the accuracy evaluation is shown in Table 3.
The results show that our proposed method for semantic segmentation reached the highest OA of 98.20%.Particularly, when dealing with limited training and testing samples, such as for car and pole classes in this dataset, the improvement is even more significant.Additionally, the results of semantic instance segmentation (GFLN SIS) indicate that the workflow enhanced the classification performance for classes that were previously challenging to classify.For instance, in the case of poles with sparse geometry distribution, employing GFLN as the backbone for point feature learning resulted in an increase in the F1 score from 0 to 0.054.Furthermore, with the constraints of the instance segmentation task, the F1 score further increased from 0.054 to 0.14.5.2 Semantic instance segmentation results

DALES Object dataset
We conduct semantic instance segmentation by our proposed method.And for comparison, we choose SoftGroup Vu et al. (2022) as baseline, which utilized a bottom-up strategy to generate soft grouping proposals and then refines the results with a top-down per-instance refinement module.For each instance, although grouping strategy is based on semantic segmentation, the merge operation may leads to different semantic prediction in one instance.The segment results are depicted in Figure 6.Table 2 provides the general accuracy evaluation result.And Table 5 shows the class-wise evaluation results.
Upon analyzing the results, our proposed method demonstrates superior overall performance compared to the baseline, particularly for large-scale building and vegetation.However, the F1 score for cars was the lowest, despite high precision.This is likely due to the small number of points for each car object, leading to errors in predicting the instance center offset.During the grouping stage, some shifted points were disregarded, while others were grouped into different objects (Figure 8).We attribute this to subsampling, which resulted in low geometric resolution and ambiguity for small-sized instances.In the evaluation of vegetation objects, high precision but low recall was observed.Upon reviewing the ground truth label, we believe this is due to the subjective definition of ground truth vegetation objects (Figure 9), resulting in over-segmentation, particularly in low vegetation areas.

Per-instance refinement
We compared the results of two models (on DALES Object dataset), one of which utilized per-instance refinement, while the other did not.Our experiments demonstrate that the per-instance refinement module has a positive impact, increasing mean class coverage and mmIoU in accuracy evaluation.We provide the comparison result in Table 4. 5.4.2Downstream work challenge: semantic instance completion Since our work follows a bottom-up grouping-based approach, after grouping the instance points, we can proceed with other downstream tasks such as instance completion.
Instance completion refers predicting missing part of 3D instances from incomplete or occluded 3D data.Methods for example, Yuan et al. (2018) proposed the first learning-based architecture PCN, which leveraged global feature from incomplete input point cloud to generate coarse result, and then predict detailed output via folding operation.In this case, we followed PCN, and tried to train an end-to-end point cloud semantic instance completion network.But during the implementation, we found it is still a challenge task.
(1)When working with outdoor data, it is not feasible to input the entire scene in a single batch.Consequently, objects near the boundary will be truncated, resulting in unavoidable structural deficiencies.
(2)During the end-to-end training process, the input points for the completion sub-network module consist of the output of the previous module, containing numerous error predictions that propagate into the subsequent completion network.
(3)Some scenes do not contain target instances that can be fed into the completion module, resulting in the inability to calculate loss for that batch, leading to gradient anomalies.For relatively small size of instance like car within the red circled area, three of them are grouped as one instance.fore, we define a soft instance proposal as an instance proposal with predicted scores exceeding a fixed threshold t.This operation ensures the high precision of the input instance points, enabling the subsequent completion tasks to proceed normally.By adopting the soft instance proposal strategy, the whole training will be divided into two steps.Step 1 aims to train soft instance proposal to feed into the completion network.Step 2 involves the completion training process to generate completed instances with semantic labels.

Limitations
Our method focuses on ALS point cloud semantic instance segmentation.While the framework achieved segmentation of different categories of objects in outdoor scenes, the overall performance is still relatively lower than in indoor scenes.We believe that future proposed networks can lead to further improvements.Throughout the entire semantic instance training process, the performance of semantic segmentation only showed a partial increase for certain classes compared to training with only the semantic segmentation network.However, we still believe that the overall performance of semantic segmentation will be enhanced by the instance segmentation module.Additionally, in the instance grouping stage, there is a sensitivity to parameters.When changing the scene domain, such as from a suburb to an urban area, the grouping parameters (see Table 1) should be reset, as the object attributes have changed significantly.

Conclusion
In this study, we have introduced an end to end geometric characterization-aware semantic instance completion network for ALS point clouds.The network incorporates hand-crafted geometry features into the point feature learning stage, resulting in a better understanding of the geometric relationship between points.Points offset to its corresponding instance center are learned for the task of instance segmentation.Both semantic and offset prediction are utilized to enhance the instance grouping.Moreover, a final per-instance refinement are conducted to refine the instance proposal and semantic segmentation results.
For future work, we intend to explore how to improve the overall instance accuracy and conduct the down stream task of semantic instance completion.We believe that the performance of semantic segmentation will be enhanced through instance completion.Furthermore, instances after completion are expected to exhibit improved performance in 3D modeling and scene understanding.

Figure 1 .
Figure 1.Comparison between outdoor ALS point cloud and indoor scanning.Numerous instances, significant object density and scale variations make ALS point clouds distinct from indoor data.

Figure 4 .
Figure 4. Semantic segmentation comparison between KPConv (left) and GFLN (right).GFLN have better performance at boundaries between different categories.KPConv made obvious incorrect predictions within the red circled area.

Figure 5 .
Figure 5. Illustration of our point cloud feature learning backbone: GFLN's architecture.In the forward propagation process, feature dimension of points are transformed.Points of each layer experienced sampling operation.And skip connections are employed to mitigate the issue of vanishing gradients.

Figure 6 .
Figure 6.Point cloud semantic instance segmentation performance on DALES Object dataset.Ours (left), ground truth (right), where different instances are shown in different colors.
analysis The training and validation are conducted in a same GTX 1080Ti GPU.Based on our testing on DALES Object dataset, for the task of point cloud semantic instance segmentation the average time for one step in a epoch is 8 seconds (with per-instance refinement module) and 5 seconds (without per-instance refinement module).And for the task of semantic segmentation the average time are decreased to 0.5 seconds.

Figure 7 .
Figure 7.Comparison of point cloud semantic segmentation performance on DALES Object dataset.KPConv base line (left), GFLN (middle) and ground truth (right).

Figure 8 .
Figure 8. Point cloud semantic instance segmentation results on DALES Object dataset.Ours (left), ground truth (right).For relatively small size of instance like car within the red circled area, three of them are grouped as one instance.

Figure 9 .
Figure 9. Point cloud semantic instance segmentation results on DALES Object dataset.Ours (left), ground truth (right).Our results segment individual trees, whereas the ground truth label combines trees that are close to each other as one instance.
Overview of the proposed method.Our network is consisted of the instance center prediction, semantic segmentation, and per-instance refinement modules.Taking ALS point cloud P with K extend dim features as input, instance proposals are obtained by semantic and center prediction modules.Features of each grouped instance are taken into per-instance refinement module for final instance outputs.

Table 2 .
Accuracy evaluation of point cloud semantic instance segmentation on DALES Object dataset.

Table 3 .
Accuracy evaluation of point cloud semantic segmentation.
Wang and Yao (2022)vergence, we propose a potential solution.Inspired fromWang and Yao (2022), a prediction with a high posterior probability is typically more likely to be correct.There-

Table 4 .
Ablation study on performance of per-instance refinement module.

Table 5 .
Class-wise accuracy evaluation of point cloud semantic instance segmentation on DALES Object dataset.