AN END-TO-END DEEP LEARNING WORKFLOW FOR BUILDING SEGMENTATION, BOUNDARY REGULARIZATION AND VECTORIZATION OF BUILDING FOOTPRINTS

: Automatic building footprint extraction from remote sensing imagery is a widely used method, with deep learning techniques being particularly effective. However, deep learning approaches still require additional post-processing steps due to pixel-wise predictions, that contribute to occluded and geometrically incorrectly segmented buildings. To address this issue, we propose an end-to-end workflow that utilizes binary semantic segmentation, regularization, and vectorization. We implement and assess the performance of four convolutional neural network architectures including U-Net, U-NetFormer, FT-UnetFormer, and DCSwin on the MapAI Precision in Building Segmentation competition. To additionally improve the shape of the predicted buildings we apply regularization on the predictions to assess whether regularization further improves the geometrical shape and improve the prediction accuracy. We aim to produce accurate predictions with regularized boundaries that can prove useful in many cartographic and engineering applications. The regularization and vectorization workflow is further developed into a working QGIS-plugin that can be used to extend the functionality of QGIS. Our aim is to provide an end-to-end workflow for building segmentation, regularization and vectorization.


INTRODUCTION
With increasing digitalization and automation, there is a need to develop automatic methods to maintain and update public information stored in spatial databases. Public, building related information is stored in the building register. The building register is the fundamental record for storing information and other relevant data necessary for taxation, public planning and emergency services. Up-to-date building footprint maps are essential for many geospatial applications including disaster management, population estimation, monitoring of urban and impervious areas, 3D city modeling, detection of illegal construction cases (Bakirman et al., 2022), updating topographical databases on a country-wide level and assessing the damage after natural disasters (Takhtkeshha et al., 2023). Although machine learning methods have achieved accurate results in the past in building segmentation, current trends have moved towards the utilization of deep learning for building footprint extraction, that require minimal post-processing after segmentation has been performed. One of the ongoing challenges in building footprint extraction is the accurate recreation of the polygonal boundary of the building footprint either in 2D (Li et al., 2021 or in 3D space (Wang et al., 2021), while at the same time extracting the vectorized building mask as output to be directly used in various GIS software. In the past different approaches have been developed for building extraction from various data sources including satellite, aerial or drone images and the use of LiDAR point clouds. Additionally many different challenges and competitions for building segmentation have been organized and publicly available building datasets have been developed. The most popular ones include the DeepGlobe (Demir et al., 2018), The Wuhan building dataset (Ji et al., 2019), SpaceNet (Etten et al., 2019), CrowdAI (Mohanty et al., 2020) and the most recent MapAI building segmentation dataset (Jyhne et al., 2022). Having different build- * Corresponding author ing segmentation competitions with open access to data aids and encourages the development and improvement of methods for accurate building segmentation. However there is still a demand for developing better methods that can extract building footprints in and end-to-end fashion, enabling the user to segment, regularize and vectorize the detected building footprints to make the results applicable within the GIS domain.
Building footprint extraction from remote sensing imagery applying deep learning techniques can be achieved by using either instance segmentation or semantic segmentation, also known as pixel wise labeling (Neupane et al., 2021). Both of these methods have shown great potential and have boosted the performance of building footprint extraction but are lacking the capability to delineate structured building footprints . The extracted features also require further postprocessing labour which hinders the applicability and the practical use of the results.
The purpose of our research is to develop an end-to-end workflow for accurate segmentation of building footprints including three major steps: (1) binary semantic segmentation with a CNN, (2) applying building boundary regularization and (3) vectorization. The dataset used for building segmentation is the NORA MapAI: Precision in Building Segmentation dataset (Jyhne et al., 2022). We have developed an implementation for building segmentation using open-source software libraries including Python, PyTorch, the Geospatial Data Abstraction Library (GDAL), QGIS and QtDesigner. Our approach implements the projectRegularization repository from Fraundorfer, 2019, Zorzi et al., 2021) on a semantic segmentation task. The novelty of our approach is applying the regularization task on an entirely new building dataset, while adding our own implementation for the vectorization part. In addition the entire workflow has been developed in an end-to-end manner, that can be applied on different datasets and sets of problems for binary semantic segmentation. Our code can be further de-

Deep learning methods for image segmentation
Deep learning methods for image segmentation can be divided into: (1) semantic segmentation and the more sophisticated (2) instance segmentation. Both methods can be multi-class or binary. In multi-class segmentation different classes of buildings can be segmented, while in binary classification the goal is to extract only the building class from the provided image.
Semantic segmentation is a computer vision task that involves dividing an image into distinct regions and assigning a semantic label to each pixel within those regions. In the case of building segmentation the goal is to distinguish between building and background pixels. Several neural network architectures can be applied for semantic segmentation, including different variations of the U-Net, FCN and SegNet. Recently proposed semantic segmentation architectures include the application of advanced vision transformers for semantic segmentation. Geo-Seg 1 is one of the open-source semantic segmentation toolboxes for various image segmentation tasks. The repository has 7 different models, that can be used for either multi-class or binary semantic segmentation tasks, including four vision transformers: U-NetFormer, FT-U-NetFormer, DCSwin, BANet and three regular CNN models: MANet, ABCNet, A2FPN.
The second method that can be applied for building footprint extraction is instance segmentation, which takes a step further in segmenting the building in the image by proposing a bounding box around the detected building and giving each instance of a building a class probability score (Šanca et al., 2021). Instance segmentation can be achieved through a wide variety of methods, which include the region-based approaches such as Mask R-CNN and its predecessors: R-CNN, Fast R-CNN and Faster R-CNN. While the implementation of instance segmentation can be more challenging and computationally heavier, the approach can be more effective in densely populated urban areas, where buildings may be close or overlapping (Zhao et al., 2020).
Both instance and semantic segmentation is trained in a supervised manner using image and ground truth pairs. The resulting segmentation mask is often highly irregular and is not applicable in cartographic applications before it has been vectorized. In many cases, especially when the buildings are occluded by vegetation, shadows, clouds or have different light conditions the predicted segmentation maps can be far different from the real building footprints and need further post-processing steps to be practically applicable in many cartographic and other engineering applications .

Building boundary regularization methods
Previous attempts at building segmentation used textures, lines, shadows, or more sophisticated and empirically designed methods. However, most of them were not successful at automating and improving the regularization technique of building boundaries. Boundary regularization is a technique used in various computer vision applications to improve the accuracy of image segmentation. Boundaries between different objects can be ambiguous, making it difficult for deep learning models to accurately segment them. In addition, real-world remote sensing images can be noisy, having shadows and different light conditions. Furthermore there is a need for large amounts of training data to achieve accurate segmentation maps with CNNs (Tang et al., 2018). In machine learning, regularization is defined as a method to reduce the generalization error during training (Goodfellow et al., 2016). In the GIS domain regularization or shape-refinement is understood as a normalization process to improve the geometry of the building footprint in a postprocessing manner (Zhao et al., 2020). Applying regularization for building segmentation maps constrains the building footprints to be smoother, with clearly defined and straight edges. This makes the building footprint more even if occluded and visually more appealing. In recent studies regularization techniques have been applied by (Zhao et al., 2018). They applied boundary regularization with Mask R-CNN using Minimum Description Length (MDL) optimization. A CNN-based segmentation and empirical polygonal regularization on the Wuhan building dataset using the MA-FCN CNN architecture preprocessed by a boundary extraction algorithm was proposed by (Wei et al., 2020). For the boundary extraction step the Marching Cubes algorithm and for the regularization the Douglas-Peucker algorithm has been used. In their study coarse-and fine adjustment techniques were applied to improve the geometry of the building footprints. In order to achieve higher prediction accuracy (Zhao et al., 2020) developed a new instance segmentation workflow called Hybrid Task Cascade (HTC) as a baseline model for building detection and segmentation. They integrated the Convex hull and Douglas Peucker algorithms for regularization, to obtain accurate building segmentation maps. Their method was tested on the CrowdAI dataset. In contrary Zorzi et al., (2021) approached the problem differently, they trained an unsupervised GAN regularization network using adversarial, potts and normalized cut losses to ingrain knowledge about building boundaries into the neural network. Their implementation was tested with instance segmentation, applying the Mask R-CNN architecture for building segmentation and comparing it with a R2U-Net semantic segmentation architecture. Their implementation is publicly available as projectRegularization 2 . Because their implementation has open access, is straightforward to implement and can be used for both semanticand instance segmentation tasks we have chosen to test it and incorporate it into our end-to-end workflow for the MapAI dataset.

The MapAI dataset
The proposed end-to-end workflow has been tested and evaluated on the MapAI: Precision in Building Segmentation competition dataset. The competition was arranged by the Norwegian Artificial Intelligence Research Consortium (NORA) in collaboration with Center for Artificial Intelligence Research at the University of Agder (CAIR), the Norwegian Mapping Authority, AI:Hub, Norkart, and The Danish Agency for Data Supply and Infrastructure. The dataset provides data sources for segmentation of buildings using aerial images and LiDAR data. The dataset is split into training, validation and two test sets with image shapes of 500x500 and resolution of 0.25 m. The training dataset consists of several different locations in Denmark, while the test dataset consists of seven locations in Norway, including urban areas: Bergen, Kristiansand, Oslo, Stavanger, Tromsø and a rural area: Rana. The dataset includes a wide variety of buildings with different sizes, shapes and complexities, this ensures a diverse dataset with different environments and building types (Jyhne et al., 2022).
There are two test sets divided into task 1 and task 2 to evaluate the accuracy of the trained models. The test set for task 1 is used for testing the segmentation approach using only aerial images as data source, while the test set for task 2 is used to test the combined approach using aerial and LiDAR images. In total there are: 7000 instances of buildings in the training set, 1500 instances of buildings in the validation set, 1369 instances of images in the task 1 test set, 978 instance of images in the task 2 test set. The dataset can be dowloaded from HuggingFace 3 . Figure 1 shows an example from the the training dataset.

METHODS
We provide an end-to-end workflow for building extraction, while also improving the predicted building footprints by boundary regularization. Our workflow consists of three steps, that are merged together end-to-end: 1. First, we utilize four convolutional neural network architectures to train binary semantic segmentation models on the MapAI dataset and make predictions on the test set 1.
2. Second, we apply projectRegularization proposed by Fraundorfer, 2019, Zorzi et al., 2021) to regularize the predicted building footprints and improve their geometry.
3. In the final step we perform the vectorization process converting the regularized building masks to polygons ready to be used in any GIS-environment.
Steps (2) and (3) are implemented into our developed QGISplugin. Our workflow was developed in Python, using the PyTorch library for the application and development of deep learning models. We used GDAL (Geospatial Data Abstraction Library) to vectorize the predictions in step 3. QtDesigner and QGIS have been used to develop and test the plugin. Each step of our workflow is further described in the following subsections. The complete workflow for model training, prediction and regularization is presented on figure 3.

Dataset preparation
The MapAI dataset was downloaded from Huggingface and saved locally as a cached Parquet file, which can be accessed with the PyTorch DataLoader library. Since the dataset contains some mislabeled images in the training and validation sets 3 https://huggingface.co/datasets/sjyhne/mapai dataset we have removed them according to previous work by ( Kaliyugarasan and Lundervolt, 2023). The names of the images from the training and validation sets are stored inside two text files in our repository. We provide simple bash scripts for their removal from the original dataset.

Semantic segmentation with CNNs
The initial stage of our methodology involves identifying and delineating the boundaries of buildings depicted from aerial images. We have decided to apply the basic U-Net neural network architecture and three vision transformers including U-Net-Former, FT-UNet-Former and DCSwin.
2.2.1 Model training. U-Net, proposed by (Ronneberger et al., 2015) has been successfully applied in the past for various image segmentation tasks both in the medical and remote sensing domain. The following three architectures are vision transformers (ViT). In a ViT the input image is divided into a sequence of patches, which are flattened and fed into the transformer encoder network. The network consists of a stack of self-attention layers, which enable the network to target different parts of the image when making predictions (Dosovitskiy et al., 2021). The key idea behind a vision transformer is to use a multi-scale hierarchical approach for image segmentation, where low-level transformers process raw images and high-level transformers operate on down-sampled images. This approach enables to capture information on different scales and preserve rich contextual information. In contrary, traditional CNNs gradually decrease the spatial resolution of an image, which leads to loss of detail (Liu et al., 2021). The second neural network we have applied is the U-NetFormer (Petit et al., 2021), which is a unified network consisting of two architectures: a 3D Swin Transformer based encoder network and transformer based decoder network, that allows higher accuracy and lesser computational cost during training. The CNN architecture integrates skip connections between the encoder and decoder network. This enables the use of deep supervision, that can help to mitigate the vanishing gradient problem, improve the overall stability of the training process and enable more accurate and efficient learning (Wang et al., 2022). The third applied model is FT-Unet-Former, which is a fully transformerbased network architecture, without any additional recurrent or convolutional layers, meaning that the model only uses selfattention and feed-forward layers to process the input sequence, making it highly parallelizable and computationally efficient. The final neural network that we applied is the DCSwin. It is a hierarchical vision transformer using a shifted window approach proposed by (Liu et al., 2021).
We trained four binary semantic segmentation models on the MapAI dataset using the hyperparameters listed in We used the Adam optimizer with Binary Cross Entropy Loss with logits during training to measure the difference between the predicted output and the ground truth. The loss function is defined as: where N is the batch size, yi is the ground truth image for sample i, zi is the logit output of the model for sample i and σ is the Sigmoid function.

Applying regularization on predictions
The steps to calculate the GAN objective function are summarized from Fraundorfer, 2019, Zorzi et al., 2021): The regularization learning process -L(G, R, D) 1. The generator G(x, y) learns the mapping function from the segmented building footprints -X and the ideal building footprints from the training set -Y .
2. The intensity images Z are exploited from the dataset.

Regularization is performed
4. The regularized building footprints are produced by the encoder EG and the residual decoder F . 5. The discriminator D estimates whether the regularized images are ideal.
The final and full objective function to jointly train the generator path G and the reconstruction path R is a linear combination between the adversarial, regularized and the reconstruction losses, expressed as: The above are created by connecting the encoders ER and EG to the residual decoder F for each iteration. The final, regularized building mask is generated after EG, ER and F are jointly updated.

Evaluation metrics
The performance of our developed workflow applying projectRegularization is evaluated based on the metrics proposed in the MapAI: Precision in Building Segmentation challenge (Jyhne et al., 2022). Intersection-over-Union (IoU) or the Jaccard index, is the ratio of the intersection area of the predicted and ground truth mask to their union: where G is the ground truth mask and P is the prediction. Boundary Intersection-over-Union (BIoU) calculates the IoU of the boundary of the prediction and ground-truth: where G and G d denote the ground-truth and the edge of the ground truth with thickness d. Similarly to G, P and P d the predicted mask and the edge of the predicted mask with thickness d. To evaluate the submissions for the MapAI competition, the final score is a combination of Intersection-over-Union (IoU) and Boundary Intersection-over-Union as noted below.
We provide the metrics for the predictions using our trained models and for the regularizations separately in order to compare the difference and assess whether regularization improves the predicted building footprints or not. Our lowest performing model was the simple U-Net achieving a 37.93 IoU without and 38.87 IoU with regularization. Its extended transformer architectures U-Net-Former and FT-Unet-Former performed better. FT-Unet-Former was slightly better than U-Net-Former, achieving 39.95 IoU without regularization and 40.17 IoU with regularization. The reason for its improved performance is the fully-transformer based architecture. The best performing model as expected was the DCSwin model achieving 45.19 IoU without regularization and 45.64 IoU with regularization. The reason for its improved performance is the shifted window approach for hierarchical feature representation. The Swin Transformer divides the input image into smaller patches and processes them hierarchically in a series of stages, each of these stages operate at different spatial resolution and are better at feature extraction, which improves the final segmentation accuracy. The results show, that applying regularization slightly improved the performance of our models by a small margin, on average around 0.5 % depending on the test image. We applied regularization on a wide variety of predictions. In cases where the prediction is of poor quality, the regularization will be the same. In contrary, the tested regularization method can help to improve the geometry of the buildings, but cannot be used to significantly improve the prediction accuracy. Although we did not use data augmentation techniques to further improve our results, we can conclude that data augmentation is a necessary step to improve the prediction accuracy, especially on the test images for Tromsø, where many of the buildings have shadows and are low-contrast images. The next step would be to apply transfer learning to further improve our results and perform the combined aerial-LiDAR segmentation task.

Developed QGIS plugin
Our developed QGIS plugin, that can be used to regularize any binary semantic segmentation image is presented on figure 4. The user can choose between two options: (1) regularization option, which will regularize and further improve the prediction and the (2) vectorization option, that enables the user to vectorize any predicted or already regularized building footprint from a raster format to a vector format. The graphical user interface for the developed plugin is simple. On the top, the user provides the path for the raster file, that will be regularized or vectorized. The loaded raster is shown in the middle. The two checkboxes can be used to choose which process will be executed. The Restore Defaults resets the plugin interface and removes any stored data. Additional instructions on how to use the plugin can be found by clicking the Help button. Both Regularize building footprint and Vectorize building footprint options automatically save the generated file. The regularization option will save the file in the same folder where the original raster file for regularization is located. It adds the prefix regand uses the same image type as the original. After the regularization, the checkbox automatically changes to the option Vectorize building footprint, which can be used to save the regularized image or even just the prediction as a vector file. If the user runs the plugin once more the regularization raster is automatically converted into a polygon and is automatically saved as a GeoPackage in the corresponding folder. The development of our proposed plugin can be followed online, accessing its GitHub repository 5 . We encourage everyone to test the plugin, provide feedback, new ideas, suggest improvements and contribute to further development.  The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-4/W7-2023 FOSS4G (Free and Open Source Software for Geospatial) 2023 -Academic Track, 26 June-2 July 2023, Prizren, Kosovo

CONCLUSION
The main purpose of our study was to develop an end-to-end workflow for building footprint segmentation, apply regularization and vectorization on the results in order to provide a GIS-ready solution. We conclude that projectRegularization additionally improves the segmentation accuracy by an average value of 0.55 IoU, 0.35 in BIoU and 0.44 in S metric.
Regularization not only improves the predictions, but also improves the geometrical shape of the building footprints. Furthermore the vectorization part contributed to the practical aspect of combining deep learning models and open-source GIS software. Our QGIS-plugin can be used to regularize buildings from predictions and convert them to vector files, which can be help in areas where practical application is of outmost importance. Our workflow is accessible online on GitHub: https: //github.com/s1m0nS/mapAI-regularization and tested. We provide Jupyter Notebooks for easier work management with explanations. The development of our QGIS-plugin can be followed on GitHub: https://github.com/s1m0nS/QGIS-Regular ize-Building-Footprints. We encourage everyone to try out our QGIS plugin and provide feedback, or contribute to the code repository.