INSTANCE SEGMENTATION OF LIDAR DATA WITH VISION TRANSFORMER MODEL IN SUPPORT INUNDATION MAPPING UNDER FOREST CANOPY ENVIRONMENT

: Inundation mapping in forest and dense vegetated areas requires the ability to generate well defined Digital Terrain Models (DTM) to derive floodwater extent, depth, and duration. Due to the occlusion caused by overlapping leaves and branch structures of forest canopies, the ability to extract elevation point clouds through UAV and airborne optical imagery and photogrammetry is challenging. LiDAR is an active sensor that acquires direct 3D measurements by transmitting hundreds of thousands of laser measurements per second producing incredibly detailed mapping layers of not only the terrain but also forest attributes such as crown diameter, tree density and height that can support inundation mapping as well as hydrological models and monitoring of floods. In this research, we propose a methodology to map the inundated areas under canopies by using photon base Geiger Mode LiDAR point cloud dataset and a deep learning model to conduct instance segmentation of tree canopy. The method is to segment the vegetation from water and determine the gap fraction between trees to quantify the penetration through canopy for the detection of water pixels in vegetated areas. To conduct the segmentation Masked-attention Mask Transformer (Mask2Former) for universal segmentation model was implemented and trained to automate the extraction of tree crown segments from the LiDAR data. Furthermore, a semi-automatic experimental approach using a Canopy Height Model and watershed segmentation was applied to develop a rapid tree crown annotation strategy.


Introduction
In the hydrological process, forest areas play a vital role in retaining and delaying water flow into drainage networks.They also absorb excess water and release it back into the atmosphere through transpiration.Detection of flood areas by remote sensing has used Synthetic Aperture Radar (SAR) backscatter properties and multispectral indices such as the Normalized Water Difference Index (NWDI) (Gebrehiwot and Hashemi-Beni, 2020).However, the challenges in acquiring reliable mapping of flood area in forests and vegetated areas has been associated with spatial resolution, frequency of cloud cover for optical imagery, reflectance properties, and the ability to detect water pixels through the gap fraction of the canopy (Salem and Hashemi-Beni 2021).Airborne LiDAR technology can be used to compute tree characteristics that include crown area, orientation, and height as well as under canopy terrain that can be combined with imagery (Hashemi-Beni et al. 2021).The advent of Geiger Mode LiDAR data presents an effective solution for forestry applications due to its advantages of highdensity data collection, high resolution, accuracy, and its multilook diversity of oblique overlapping frame measurements.By integrating Geiger Mode LiDAR data with precise individual tree segmentation algorithms, it becomes feasible to accurately calculate tree attributes on a large scale and create highdefinition terrain for inundation mapping.In the past, tree crown segmentation primarily relied on watershed segmentation using a Canopy Height Model (CHM) derived from a 3D point cloud (Zhao and Popescu 2007).Watershed segmentation is a region-based method that is based on mathematical morphology.However, this approach faced challenges such as over-segmentation and vulnerability to noise, limiting its effectiveness in dense forest areas where crown segments boundaries are unclear.Additionally, parameter tuning hinders complete automation.In recent years, deep learning methods have achieved outstanding results in challenging computer vision tasks, including instance segmentation.In the context of forestry applications, Individual Tree Crown (ITC) segmentation is closely related to instance segmentation, which involves the identification and separation of individual objects.Both multispectral images and LiDAR derived CHMs can effectively be leveraged for this task.In terms of ITC detection and segmentation, Convolutional Neural Network (CNN) based architectures play a dominant role, including YOLO (Jiang et al. 2022) and Mask R-CNN (He et al. 2017).In recent years, transformer models have shown remarkable success in natural language processing tasks, which has motivated researchers to explore their application in computer vision problems as well.Through the self-attention mechanism, vision transformers can model and understand the relationship between different patches across the entire image, effectively capturing the global context of the scene.With respect to instance segmentation and object detection, DETR (Carion et al. 2020) was proposed as a transformer-based architecture and has been used for tree crown instance segmentation (Dersch et al. 2023).Although DETR was initially promising, it was still falling behind CNNs in terms of performance as it has not yet fully leveraged the potential of transformers for image instance segmentation.To address this limitation, our research introduces a transformer-based network for ITC segmentation, building upon the state-of-the-art architecture of Mask2Former (Chen et al. 2022).In our study, we highlight the potential of applying Vision Transformer model Mask2Former to Geiger Mode LiDAR data to obtain accurate forestry analytics to support flood risk mapping.By harnessing the capabilities of Geiger Mode LiDAR data, we can derive precise and valuable insights for various environmental applications.

Data preparation
The LiDAR data was processed to calibrated point cloud data and projected into NAD_1983_2011_Idaho_West_ft.Then, a classification step was conducted to remove noise and nonvegetation points.This was followed by a data cleaning process and quality review.A 50 cm CHM based on ground and vegetation points was created by normalizing the height to above ground level using the Digital Terrain Model as shown in Figure 3.A total of 4 tiles are used in this study, each of which covers approximately 1,500 x 1,500 sq ft (457 m x 457 m) of area and contains ~ 3,000 trees with over 31 million LiDAR points.

Methodology 3.1 Overview
The research proposes a methodology based on conducting ITC segmentation from high-resolution CHM derived from Geiger Mode LiDAR data.Figure 4 provides the workflow for the processing procedures.The method starts with generating a high-resolution CHM from Geiger Mode LiDAR data for the study area.Thereafter, data labeling is conducted in a semiautomated manner.The research uses watershed segmentation to generate crown segments and then refined.The CHM and tree crown polygons are used to construct an instance segmentation dataset and finally, the Mask2Former model is implemented for crown instance segmentation.

Data labeling
Supervised deep learning models need label or annotated data for training and the validation of results.However, manually digitizing tree crown segments is time consuming and expensive especially in dense forest environments.We propose a semi-automated data labeling process incorporating local maxima-based tree detection and watershed-based crown segmentation.This is coupled with a manual refinement process in problematic areas to improve the crown delineation quality.Specifically, we first apply a Gaussian filter of 5 x 5 to smooth the appearance of CHM by minimizing artefact noises.Then, a local maxima filter of 7 x 7 is utilized to detect individual treetops.In this process, treetops below 5 m are removed given the minimum tree height, and two neighbouring treetops are merged when their distance is under 3.5 m.Next, the treetops are used as the markers for watershed segmentation, in which the height difference within a specific crown should not exceed 0.5 m or they are merged.Lastly, these tree crown segments are manually refined to get the final ground truth (Figure 5).   1 Tree parameters for the training and validation split

Mask2Former
Mask2Former is a mask classification architecture (Cheng et al. 2021) where pixels are grouped into N segments by predicting N binary masks and N class labels.Unlike CNN based segmentation models where the model learns to predict a class for every pixel, mask classification splits the segmentation task into two steps: partitioning the image into N segments/regions represented by binary masks and then associating each segment as a whole to a semantic class.This formulation allows for both semantic and instance segmentation.The Mask2Former model consists of three main components: a backbone, a pixel decoder, and a transformer decoder (Figure 6).The backbone aims at extracting low resolution features from an image.The pixel decoder gradually up-samples these features to generate high-resolution per-pixel embeddings.Finally, the transformer decoder processes these embeddings using learnable object queries to produce binary mask predictions.The Mask2Former model uses the masked attention operation in the transformer decoder which constrains attention only within the foreground region of the predicted mask for each object, instead of attending to the full feature map (Cheng et al. 2022).Furthermore, instead of using use the standard convolution-based ResNet backbones, the Swin Transformer model (Liu et al. 2021) is used in this study, which is a transformer-based backbone.It builds hierarchical feature maps by merging image patches in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local and shifted window.This makes it suitable for fine-scale instance segmentation from high-resolution images, such as individual tree crowns.
To train the Mask2Former model, a matching is necessary between the set of predictions and the set of ground truth segments.This is done using a set prediction loss that enforces a one-to-one correspondence between predicted and ground truth instances.Then, the overall model is trained using a cross-entropy classification loss and a binary mask loss.The latter is a linear combination of focal loss (Lin et al. 2018) and dice loss (Sudre et al. 2017).We implement our experiments on a virtual workstation that has an Intel(R) Xeon(R) Gold 5218 2.3 GHz CPU, 64 GB RAM, and 21 GB GPU memory provided by NVIDIA GRID P40-24Q.Due to the limited GPU memory, the batch sizes of training and validation are set to 2. The AdamW optimizer, the initial learning rate is set at 1e-4 is used in this study, which is a stochastic optimization method that modifies the typical implementation of weight decay in Adam, by decoupling weight decay from the gradient update.The total iteration of the training process is set to 68,750 for reaching stable validation accuracy.Due to the limited amount of training data and to leverage the capabilities of transfer learning, we initialize the Swin Transformer backbone using the pretrained weights obtained from the imagenet-1k dataset.Figure 7 shows the classification and the mask loss throughout the training process.We also show in Figure 8, the evolution of the segmentation and detection mean average precision (mAP) on the validation set.The mAP is the area under the precisionrecall curve averaged for all classes.(2) Note that the digitized ITC is deemed as correctly detected if it intersects any of the detected ITCs.Meanwhile, the detected ITC is deemed as correctly segmented if the Intersection over Union (IoU) is over 0.5 between the digitized and detected ITCs.For further comparison, the watershed segmentation results are also quantified by the same indices.1, we present different accuracy parameters for both Mask2Former and Watershed segmentation methods, including the number of digitized ITCs, the number of correctly detected ITCs, the number of correctly segmented ITCs, the rate of correct detection, and the rate of correct segmentation.Although the watershed method performs perfectly for the rate of correct detection, the Mask2Former method shows 10% increase in the rate of correct segmentation.This also demonstrates the advantages of deep learning-based instance segmentation model in ITC segmentation.

Discussion
Tree instance segmentation is a challenging task that requires detecting and segmenting individual crown segments.Since the Mask2Former model is a mask classification architecture, it can flexibly generate tree instance masks.Although limited training has been applied, the results show that the model can achieve a correct detection rate of 87% and a correct segmentation rate of 64% on the test set.The correct segmentation rate is higher than the watershed baseline which shows the potential of Mask2Former in instance segmentation for challenging objects like trees.On the other hand, although the rate of correct detection is smaller than the watershed, this discrepancy is influenced by the pre-processing step applied to the point clouds, wherein non-tree objects are systematically removed to generate a refined canopy height model.Deep learning models like Mask2Former has the potential to be more robust when deployed without the precleaning step.The Mask2Former can also consistently handle varying sizes and shapes of trees as shown in Figure 9.However, there are still some limitations and challenges that need to be addressed in the future.For example, the Mask2Former model requires a large amount of training data to achieve a good performance, which may not be available or feasible for some regions or scenarios.Data quality and variety is also crucial to improve the robustness of the model.Additionally, the model may benefit from incorporating auxiliary information, such as multispectral imagery, to enhance its discriminative power.

Conclusion
Drawing from the results outlined above, the novel approach utilizing Mask2Former displays a noteworthy enhancement in accuracy for ITC segmentation when compared to the traditional watershed technique.This demonstrates the substantial promise of leveraging state-of-the-art instance segmentation models within forestry contexts, notably for intricate attribute extraction at a fine scale, facilitated by Geiger Mode LiDAR data.However, the constrained availability and quality of crown samples, as well as the dense forest canopies curtail the full potential of Mask2Formerbased ITC segmentation, underscoring the need for ongoing enhancements in forthcoming endeavors.
Geiger Mode LiDAR data was collected over a portion of Payette River, near Crouch Idaho on June 25, 2022, with average acquisition height at 3,787 m above ground level.The sensor collected data within a hemispherical perimeter swath of 27° Field of View using a Palmer Scanner.Data was collected with a 50% overlapping flight line.Elevation measurements are based on laser flashes illuminating a contiguous 2D array Geiger Mode Avalanche Photodiode Detector of 4,096 pixels (Figure 1).The Palmer Scanner rotates the laser light which flashes at frequency of 50 kHz producing overlapping array measurements to collect over 205 million points per second.Based on the Instantaneous Field of View (IFOV) of the individual photodiode detector and the elevation above ground, the measurement resolution which is analogous to linear LiDAR footprint was 12 cm.The internal data processing utilizes a voxel process and produces a uniform point cloud distribution of 50 points per meter squared.

Figure 2 .
Figure 2. 3D visualization of the study area

Figure 3 .
Figure 3. CHMs of the study area

Figure 4 .
Figure 4. ITC segmentation based on the Mask2Former model.

Figure 5 .
Figure 5.The semi-automated crown segmentation results for data labelling.The study area was then split into training and validation data.Table1summarizes number of trees for each split.

Figure 7 .
Figure 7. Visualization of the classification loss and the mask loss on the Training Set.

Figure 8 .
Figure 8. Validation Set Performance: Segmentation mAP and Bounding Box mAP 4.2 Evaluation To validate the accuracy of ITC segmentation, 300 crowns are manually digitized in the upper right tile of CHMs demonstrated in Figure 3.By visual interpretation, only clearly distinguished crowns are digitized in ArcMap (Figure 9).Since the Mask2Former segmentation results are tiled images with 224 x 224, we restore them back to the original dimension of CHM with the same georeference coordinates.To fully evaluate the detection and segmentation accuracy of instance segmentation, we define two indices below:

FigureFigure 12 .
Figure 12 demonstrates two examples of mask and bounding box predictions.

Table 2 .
Accuracy parameters for ITC segmentation using Mask2Foremer and watershed methods.Figures 10 and 11 depict an overview of ITC segmentation results by Mask2Former and watershed methods.