JSMNet Improving Indoor Point Cloud Semantic and Instance Segmentation through Self-Attention and Multiscale

The semantic understanding of indoor 3D point cloud data is crucial for a range of subsequent applications, including indoor service robots, navigation systems, and digital twin engineering. Global features are crucial for achieving high-quality semantic and instance segmentation of indoor point clouds, as they provide essential long-range context information. To this end, we propose JSMNet, which combines a multi-layer network with a global feature self-attention module to jointly segment three-dimensional point cloud semantics and instances. To better express the characteristics of indoor targets, we have designed a multi-resolution feature adaptive fusion module that takes into account the differences in point cloud density caused by varying scanner distances from the target. Additionally, we propose a framework for joint semantic and instance segmentation by integrating semantic and instance features to achieve superior results. We conduct experiments on S3DIS, which is a large three-dimensional indoor point cloud dataset. Our proposed method is compared against other methods, and the results show that it outperforms existing methods in semantic and instance segmentation and provides better results in target local area segmentation. Specifically, our proposed method outperforms PointNet (Qi et al., 2017a) by 16.0% and 26.3% in terms of semantic segmentation mIoU in S3DIS (Area 5) and instance segmentation mPre, respectively. Additionally, it surpasses ASIS (Wang et al., 2019) by 6.0% and 4.6%, respectively, as well as JSPNet (Chen et al., 2022) by a margin of 3.3% for semantic segmentation mIoU and a slight improvement of 0.3% for instance segmentation mPre.


Introduction
Three-dimensional indoor point cloud scene understanding is a crucial field for indoor navigation (Diaz-Vilarino et al., 2016;Kim et al., 2018), synchronous positioning and indoor scene modeling (Poux et al., 2018).Semantic and instance joint segmentation of indoor point clouds (Long et al., 2015;Kirillov et al., 2019) is a critical technology required to facilitate these applications.
Indoor point cloud semantic segmentation assigns labels to the different areas in the scene based on predefined category labels.
On the other hand, instance segmentation is an extension of semantic segmentation, and it allows for further differentiation within a category by labeling different instances.In recent years, there has been rapid development in point cloud labeling methods that leverage deep learning (Wang et al., 2018;Yi et al., 2019).With advancements in three-dimensional information collection technology, indoor point clouds can provide rich geometry, shape, and texture information, thereby presenting realistic scenes.However, the disorder and unstructured nature of 3D point cloud data present significant challenges in efficiently expressing indoor point cloud data semantically.Challenges faced include difficulties in extracting point cloud features and high computational and memory consumption in models.
The application of self-attention mechanism in point cloud understanding and recognition (Xie et al., 2018;Feng et al., 2020)  • We design a multi-resolution feature adaptive fusion module that is specialized for indoor point clouds so that fine, multi-scale, significant feature expression effect can be obtained.
• We propose a new deep learning framework for joint instance and semantic segmentation in which case segmentation and semantic segmentation promote each other.
• Our framework achieves state-of-the-art results on 3D instance segmentation and semantic segmentation tasks on the Stanford large 3D indoor space dataset (S3DIS) (Armeni et al., 2016).

Method
This study's overall structure is depicted in Figure 1.We use each layer.This module then outputs the feature matrix via two parallel decoders.Afterward, the multi-resolution feature adaptive fusion module outputs a feature matrix.We use three branches to integrate and supplement information from the two paths, resulting in a better indoor scene segmentation effect.

Transformer encoder-decoder module
In this section, we propose a Transformer encoder-decoder module for initializing features of indoor point clouds.The module consists of two parts: the Transformer encoder and decoder modules.The coding module is constructed by successively applying three encoding layers of PointConv (Wu et al., 2019) after the SA layer (Qi et al., 2017b), followed by four Transformer modules.

SA layer
To achieve an abstract representation of point set in indoor point clouds, we introduce the SA layer of PointNet++ (Qi et al., 2017b) in the first layer of the encoder.To achieve an abstract representation of point set in indoor point clouds, we introduce the SA layer of PointNet++ (Qi et al., 2017b) in the first layer of the encoder.Also to avoid causing some loss of useful information,, unlike pointnet++, we use attention pooling (Yang et al.,2020;Hu et al.,2020) instead of maximum pooling to aggregate useful information by automatic learning.

Transformer module
The irregular embedding of indoor point clouds in metric space, as well as their insensitivity to the arrangement and cardinality of input features, make Transformer a well-suited architecture for point cloud processing.In our Transformer layer, we We subtract the attention vector ( ) from the attention vector ( ) and add the position encoding δ to it and to the attention vector ( ), given by: Here, the subset ()⊆ is the local neighborhood of (k We construct a Transformer module centered on a Transformer layer, including a Point Transformer layer, a linear projection, and a residual link to reduce dimensionality and speed up processing.

Multi-resolution feature fusion module
After extracting features using the Transformer codec module, we obtain an output feature matrix with dimensions ×512.
The next steps involve upsampling and feature fusion via two separate decoder branches that use the PointConv's depthwise feature decoding layers.
In the semantic branch, we use the same operations as JSNet.
Although downsampling the point cloud during multilevel segmentation, fusion, and aggregation of depth information for point cloud tasks such as instance segmentation can benefit the extraction of discriminating features, the corresponding output features could become implicit and abstract.Therefore, we need to recover the feature map that supplies the original points and fully explains the encoded information for each point in the instance branch.To achieve this, we select and fuse fine-grained representations from multi-resolution feature maps.
First, the feature matrices of four layers with different resolution dimensions are upsampled to obtain the full-size feature representations of all N points and reconstructed into full-size feature maps by MLP.To further improve our integration method, we analyzed the point-level perceptions { 1 , 2 , 3 , 4 } and regressed the fusion parameters Finally, we integrated a comprehensive feature graph Sout for instance segmentation into the multi-resolution features of each point: Where ∀ ' ∈{ 1 ' , 2 ' , 3 ' , 4 ' }, FC (•) is a fully-connected layer and its superscript indicates the number of kernels, the point-level information ∈ .

Joint instance and semantic segmentation module
The combination of instance segmentation and semantic segmentation has introduced a unique approach to point cloud instance segmentation.Semantic segmentation and instance segmentation can benefit from each other's learned features.
However, previous studies have shown that directly merging instance and semantic information may introduce low-quality semantic information, which can negatively influence the subsequent segmentation tasks (Hou et al., 2019).To address this challenge, we added a branch ( 1 in Figure 1) to branches 2 and 3 , inspired by recent research called the attentional context fusion module (Wen et al., 2020).In the l1 branch, we employ a self-attention mechanism to blend the original semantic features by weighted average.This attentively enhances the useful information and masks the irrelevant information, thus avoiding the introduction of low-quality semantic information to instance segmentation.

Instance branch
As shown in equation ( 6 shown in equation ( 7), we concatenate ，2 with to obtain the feature matrix ，3 .We then convolve ，3 using a 1 × 1 convolution and add the resultant to the feature matrix ，2 and Conv1D (i.e., the operation of l3 in Figure 1).
As shown in equation ( 8), the instance embedding matrix ' is obtained by applying a 1 × 1 convolution as described above.
Finally, we use the mean-shift clustering algorithm (Comaniciu et al., 2002) to obtain each instance.

Semantic branch
In the semantic segmentation branching, we follow a similar approach to the study by Zhao and Tao (2020), by integrating the feature matrix ，3 generated from the instance branch into the semantic feature space through branching 2 .
Specifically, we perform a 1 × 1 convolution on ，3 and then cross-average the resulting matrix column-wise (by taking the mean across elements in each column).Next, as shown in equation ( 9), we perform a tiling operation (namely the Tile(•)) to replicate the resulting matrix row-by-row, generating a matrix that we add element-wise to , resulting in matrix ，2 .As shown in equation ( 10), We obtain the matrix ，3 by concatenating and ，2 .Finally, similar to the instance branch, as shown in equation ( 11), we obtain the semantic feature matrix ' , and use a learned classifier to derive the final semantic labels.

Training
During training, the semantic segmentation branch uses cross entropy loss function ( ).In the instance segmentation branch, as shown in equation ( 12), we define the loss function as the sum of three parts: , , and : where pulls the embedding item toward the average embedding of the instance (i.e., its center), promoting intra-instance similarity and discouraging inter-instance confusion.makes the embeddings of instances exclusive, ensuring that different instances are well-separated and preventing overlap. is a regularization term that maintains the boundary of the embedding values by penalizing embeddings that are too far from the origin, ensuring that the centers of each cluster in the mapping space are not too far away.In the experiments, we set γ to 0.001.The formulas are as follows: In the above formulas, the "I" represents the number of ground-truth instances, is the number of points in the i-th instance, and is the embedding of the j-th point.
represents the average embedding of the instance i, which serves as the instance center.and represent the margins of and , respectively.The notation [ • ] + = max (0, A) refers to the hinge function, while • 1 is the L1 distance.
During training, the total loss function is composed of both the semantic branch loss ( ) and the instance branch loss ( ), yielding L = + .

Datasets and evaluation metrics
We assessed the results of our experiments using S3DIS (Stanford large-scale 3D Interior space) dataset (Armeni et al., 2016).This dataset comprises a collection of 3D point clouds representing interior spaces, where each point is defined by its spatial coordinates, spectral information (e.g., RGB values), and semantic/instance labels.The dataset consists of six zones (1-6), encompassing a total of 272 rooms and featuring 13 different item categories.
For the S3DIS dataset, we conducted a 6-fold cross-validation (6-fold CV) based on k-fold cross-validation in PointNet (Qi et al., 2017a) to ensure fair comparison with other methods.
Additionally, we present the results of a fifth fold (Area 5), which is a separate building and exhibits some differences from other areas, similar to Tchapmi et al.'s study (2017).For semantic segmentation evaluation, we report overall accuracy (oAcc), average accuracy (mAcc), and average IoU (mIoU).In terms of indoor instance segmentation, indicators include mean precision (mPrec), mean recall (mRec), and mean coverage with an IoU threshold of 0.5.We also employ the coverage (Cov) and weighted coverage (WCov) metrics proposed by Ren and Zemel (2017) to assess the performance of indoor scene instance segmentation.

Evaluation and comparison
In this section, we comprehensively evaluate our method and compare it with some existing semantic and instance segmentation methods.

4.Conclusions
This study presents a novel deep learning framework for joint semantics and instance segmentation of indoor point clouds.
The framework comprises the Transformer module, multi-resolution feature fusion module, and feature channel aggregation module, specifically designed to enable the joint processing of semantic segmentation and instance segmentation of indoor scenes.By integrating and promoting each other, the two branches of semantic segmentation and instance segmentation achieve superior performance compared to other methods, as revealed by the results of testing on the S3DIS dataset.Our method outperformed JSPNet by 3.1%, 1.3%, and 3.3% on mAcc, oAcc, and mIoU indexes, respectively.
However, indoor scene segmentation remains a challenging task due to high occlusion, clutter, and variability.To obtain more effective and robust results, we plan to consider integrating the multi-level and multi-scale information structure (Tao et al., 2020) and leveraging some fine-grained perceptual models in future work.

5.References
Armeni, I., Sener, O., Zamir, A.R., Jiang, H., Brilakis, I., Fischer, M., Savarese, S., 2016.3d semantic parsing of large-scale indoor spaces.In: Proceedings of the IEEE has enabled the efficient extraction of point cloud features.By incorporating self-attention mechanism, semantically rich features can be obtained, which helps to improve the performance of semantic segmentation.In this study, we present a novel deep learning network framework for accomplishing joint semantics and instance segmentation of indoor point clouds.Our proposed framework combines Transformer and PointConv to create a global feature self-attention coding module as a means of feature extraction, resulting in robust indoor point cloud feature expression.To overcome the loss after information interpolation, we integrate information with different resolutions and adaptively fuse multi-resolution features of each point to increase feature significance.Finally, semantic and instance segmentation modules are integrated into a unified model, allowing the two branches to promote each other, resulting in a better semantic expression effect for indoor point clouds.The following are the contributions of our study: • We combine Transformer and PointConv to design a global feature encoding layer that is based on self-attention and create an innovative model for point cloud feature extraction.

Figure 1 .
Figure 1.An overview of JSMNet, which utilizes self-attention and multiscale fusion for joint semantic and instance segmentation of indoor point clouds.
neighbors of the ).The mapping function γ (•) is an MLP with two linear layers and a ReLU nonlinear activation function, and ( • ) is a softmax function.( ) , ( ) , and ( ) represent three different attention vectors in the transformer.P i and P j denote the three-dimensional coordinates of points i and j.
), In the instance branch, we first pass the feature matrix obtained from the semantic branch through an attention context fusion module and fusion gate ， which extracts the features of the semantic branch into the feature matrix of the instance branch to obtain a new feature matrix ，2 .The gating function Gated (•) used in the fusion gate can be found in the research of Wen et al. (2020).Next, As

Figure 2 .
Figure 2. Comparison of segmentation results between our method and JSNet on S3DIS.

Table 2 .
Instance segmentation results on S3DIS dataset.