Extracting building outlines based on convolutional neural networks using the property of linear connectivity

For years, researchers have been developing an automated method that can replace humans by drawing the outlines of individual buildings in a vector format, which plays an important role in GIS creation, environmental monitoring, urban planning, population density estimation, and energy supply. There is no doubt that this is an extremely difficult task, not only because of the labor required to develop such a highly intelligent algorithm, but also because of the challenges posed by imperfect imaging conditions, different building structures, and the complexity of the background. One of the current challenges in extracting building outlines is to accurately recreate the polygonal boundaries of the building while extracting vectorized building masks as output for direct use in various applications. This work provides a comprehensive workflow for building extraction and improves the predicted area of buildings through boundary regularization. First, a convolutional neural network is used to train instance segmentation model, then regularization and vectorization processes are performed. The main difference from existing methods is a new regularization method based on the concepts of linear connectivity and convexity of a set of points. This approach can effectively identify and remove points that do not belong to the detected building but were incorrectly segmented by the algorithm. Based on the results of experiments, the algorithm showed a high level of efficiency, comparable to leading methods for extracting building boundaries as PolyWorld.


Introduction
For many years, researchers have been developing an automated method that can replace humans for mapping vector format outlines of individual buildings, which play an important role in GIS production, environmental monitoring, urban planning, population density estimation and energy supply.Undoubtedly, this is an extremely difficult task, not only due to the laboriousness of developing such a highly intelligent algorithm, but also due to the challenges associated with imperfect imaging conditions, varied building architecture, and background complexity.
Automatic detection of buildings from aerial photographs has been considered an important means of improving the efficiency of vector map generation for decades (Paparoditis et al., 1998, Persson et al., 2005, Yang et al., 2018).In recent years, with the support of extensive training data and sufficient computing power, deep learning methods such as convolutional neural networks (CNN) (LeCun et al., 1989) and fully convolutional networks (FCN) (Long et al., 2014) significantly improved the accuracy of building detection from remote sensing images (Li et al., 2019, Chen et al., 2020, Šanca et al., 2023).However, automatically generating high-quality vector building maps from aerial photographs is not yet a reality for most communities.This is partly because deep learning-based building detection approaches still face challenges such as low recognition rates of roofs obscured by trees or shadows (Chen et al., 2019) and relatively poor generalization ability for certain geographic regions to others (Maggiori et al., 2017).One of the current challenges in extracting building outlines is to accurately recreate the polygonal boundary of a building while extracting a vectorized building mask as output for direct use in various applications.This paper proposes the algorithm for automatically extracting building outlines based on instance segmentation, regularization and vectorization.The main difference from existing methods is a new regularization method based on the concepts of linear connectivity and convexity of a set of points.This approach can effectively identify and remove points that do not belong to the detected building but were incorrectly segmented by the algorithm.In summary, the main contributions of this paper are as follows: cess.
• To analyze the results and compare them with existing methods for highlighting building boundaries, one of the most popular datasets for vectorization, CrowdAI (Mohanty et al., 2020), is used.

Related work
Currently, the leading building extraction approaches are semantic segmentation and instance segmentation methods.

Neural network methods
Since building predictions are made at the highest resolution, holes may appear in the large-scale building predictions if the global semantic information is insufficient, while small-scale buildings may be omitted without enough local details.To address these issues, (Wei et al., 2019) introduced a multi-scale aggregation FCN that fuses multi-scale building features to generate final building predictions.The PolygonCNN proposed by (Chen et al., 2020) first performs segmentation to extract initial building outlines.Then, it utilizes a modified PointNet to learn the shape prior and predict polygon vertices to generate precise building vector results by encoding the vertices of building polygons and merging image features extracted from the segmentation step.( Šanca et al., 2023) propose an end-toend workflow that utilizes binary semantic segmentation, regularization, and vectorization.The novelty of their approach is applying the regularization task on an entirely new building dataset, while adding their own implementation for the vectorization part.The study (Knyaz et al., 2020) proposed masking technique for robust segmentation of the repeated structures in images.Such approach allowed to improve segmentation performance for 11%.
In (Zhao et al., 2018) the authors corrected the segmentation masks produced with Mask R-CNN by first simplifying the detected boundaries using the Douglas-Peucker algorithm and subsequently refining the resulting polygons using a Minimum Descriptor Length method.Aiming at the problem that the quality of detection affects the integrity of the mask, (Zhao et al., 2020) proposed an instance segmentation model for the accuracy of segmentation contours, which used detection and segmentation as a multi-stage process to obtain accurate segmentation edges and improve the geometric regularity of the segmentation results.
This methods create segmentation maps at the pixel level, but the building boundaries produced by the algorithms are usually zigzag and far removed from manual delineation of objects.In addition, the results require extensive post-processing: semantic segmentation cannot distinguish between individual buildings, and the bounding box predicted by the instance segmentation method may contain elements of other buildings, making mask training difficult.However, geographic and cartographic applications typically require precise vector polygons of extracted objects instead of rasterized output.(Zorzi et al., 2022) introduces PolyWorld, a neural network that directly extracts building vertices from an image and connects them correctly to create precise polygons.
A few other studies (Ling et al., 2019, Peng et al., 2020, Liu et al., 2021, Wei et al., 2022) have considered the instance segmentation problem as contour regression, i.e., regressing the vertex coordinates of a contour (in other words, a polygon represented by a series of discrete vertices).The contour-based methods are theoretically advantageous in efficiency since they straightforwardly regress the polygon coordinates, compared to semantic/instance segmentation with a pixelwise operation, and have the potential to get rid of the need for post-processing operations such as raster-to-vector conversion and empirical regularization.Considering the size of buildings (> 10m 2 ), we note some benchmark satellite data/aerial imagery sets (Rottensteiner et al., 2012, Ji et al., 2019, Yang et al., 2022, Tian et al., 2020, Mohanty et al., 2020), most of which have spatial resolution ranging from the centimeter level to 2 m, with the exception of the relatively coarse resolution of SpaceNet 7 (4 m).In addition to the commonly used RGB channels, some datasets also provide additional useful information to further image buildings.In terms of spectral information, the Potsdam and WHU-Satellite datasets have RGB/near-infrared (NIR) bands, and the SpaceNet and SpaceNet 4 datasets consist of eight spectral bands from the WorldView 2/3 sensors.For vertical information, the Potsdam, Vaihingen, Zeebrugge and DFC19-JAX datasets provide airborne LiDAR derived nDSMs, while the Spa-ceNet 4 dataset consists of 27 unique images for which viewing angles range from −32.5 0 to 54.0 0 (Weir et al., 2019).

Datasets
Several datasets (e.g., DFC19-JAX) have also attempted to improve deep learning networks by combining planar and stereo remote sensing observations.In terms of temporal properties, the WHU Building Change Detection, SECOND, Hi-UCD, and ZKXT 2021 datasets contain multi-temporal remote sensing observations, building contours for each date, and building change records.

Method
This work provides a comprehensive workflow for building extraction and improves the predicted area of buildings through boundary regularization.First, a convolutional neural network is used to train instance segmentation model.Second, the property of linear set connectivity is used to organize the predicted contours of buildings and improve their geometry.The final step is the vectorization process, converting the regularized building masks into polygons for use in any application.The scheme of the algorithm is shown in Figure 2.

Instance segmentation with Mask R-CNN
The initial stage of our methodology involves identifying and delineating the boundaries of buildings depicted in aerial photographs.The neural network Mask R-CNN was used to perform this task.The mask branch is a small fully convolutional network (FCN) applied to each RoI, predicting a segmentation mask in a pixelto-pixel manner.Mask R-CNN is simple to implement and train due to the Faster R-CNN framework, which facilitates a wide range of flexible architecture designs.Additionally, the mask branch only adds a small computational overhead, enabling a fast system and rapid experimentation.
The Adam optimizer with Binary Cross Entropy Loss with logits was used during training to measure the difference between the predicted result and the ground truth.The loss function is defined as: where N is the batch size, xi is the ground truth image for sample i, yi is the logit output of the model for sample i and σ is the Sigmoid function.A Sigmoid function is any mathematical function whose graph has a characteristic S-shaped curve (sigmoid curve).For the sigmoid function we use logistic function, which is defined as: (2)

Applying regularization on predictions
Once the predictions are generated using the trained model, a post-processing step applies regularization to further improve the geometry and accuracy of the predicted building masks.
Since pixel-based classification results in rounded corners and closed-edge predictions, regularization is an important step to further improve predictions.Also, after the segmentation process, the predicted bounding box may contain additional instances, which makes it difficult to train the mask head of the network.
Considering that the identification of building boundaries is carried out on remote sensing images, we will assume that the buildings in the images do not intersect or overlap each other.In this case, each building is a closed limited set of pixels.Thus, using the property of linear connectivity, unnecessary points that do not belong to the main object are removed from the bounding rectangles.In topology and related branches of mathematics, a connected space is a topological space that cannot be represented as the union of two or more disjoint non-empty open subsets.A pathconnected space is a stronger notion of connectedness, requiring the structure of a path.A path from a point x to a point y in a topological space X is a continuous function f from the unit interval [0,1] to X with f (0) = x and f (1) = y.A pathcomponent of X is an equivalence class of X under the equivalence relation which makes x equivalent to y if there is a path from x to y.The space X is said to be path-connected if there is exactly one path-component.For non-empty spaces, this is equivalent to the statement that there is a path joining any two points in X.The definition of a path-connected set is similar to the definition for a space.
Thus, the developed algorithm has the following structure: 1. Detection of objects of the "building" class in the image 2. Segmentation of objects of the "building" class in each identified bounding box 3. Find and remove points that do not belong to the main building in the bounding box, but are segmented as "building"

Vectorizing the resulting images
It can be presented as Algorithm 1.
Figure 5.Using the linear connectivity property to remove unnecessary points from images.In the first image, an object of the "building" class is detected.On the second, objects of the "building" class are segmented in the bounding box.On the third, pixels that do not belong to the main object, but are erroneously segmented, are highlighted (in red) and deleted.The last image contains the final result of segmentation and regularization.
Algorithm 1: Regularization using properties of a pathconnected set of points Input: An image with set of points belonging to the class "building" B = {xi}, Output: An image with set of points belonging to the class "building" B = {x ′ i }, 1 Search and removal unnecessary points from the set of points B ; 2 Procedure Search(B):

Evaluation metrics
Similarly to (Zorzi et al., 2022) we use the following evaluation metrics.
Intersection-over-Union (IoU) or the Jaccard index, is the ratio of the intersection area of the predicted and ground truth mask to their union: Also precision and recall were calculated to determine average precision (AP) and average recall (AR) values: P recision = T P T P + F P (4) where TP , FP , FN are the true positive, false positive and false negative of the building class.

Experiment
The  The meaning of the calculated metrics is given in Table 1.To understand the level of efficiency of the algorithm, the table also includes the results of the leading methods on similar data.

Conclusion
The main goal of our study was to develop an end-to-end workflow for extracting building outlines using instance segmentation, linear connectivity-based regularization, and vectorization.We concluded that regularization using the linear connectivity property improves segmentation accuracy by an average of 23.3 in AP(average precision) and 27.3 in AR(average  recall).Regularization not only improves predictions, but also improves the geometric shape of building outlines.Based on the results of experiments, the algorithm showed a high level of efficiency, comparable to leading methods for extracting building boundaries as PolyWorld (Zorzi et al., 2022).

Figure 1 .
Figure 1.Example of extracting a building boundary.
Since the Zeebrugge dataset(Campos-Taberner et al., 2016) was published as part of the 2015 IEEE GRSS Data Fusion Contest, dozens of building detection and segmentation datasets have been released.It is worth noting that the datasets used to evaluate traditional methods are usually small in size, and the training and testing sets are collected from the same local region (or image), resulting in poor generalization ability.In the era of deep learning, more advanced datasets can achieve spatial independence of training and test sets, wider spatial coverage and larger data volume, which corresponds to reality.

Figure 2 .
Figure 2. The structure of the proposed algorithm.

Figure 3 .
Figure 3.The structure of the Mask R-CNN.

3
for each point xi of class "building" B do 4 Construct a straight line Li passing through points xi and x0, where x0 is the central point of the bounding box 5 if ∃xextra / ∈ B0, but xextra ∈ Li then 6 draw a straight line Li 1 to the last one, belonging to the class "building" x1 take points x1 n from the unit neighborhood x1, belonging to the class "building" ; 8 draw straight lines to from points x1 n to xi; 9 repeat until polygonal chain PC appears connecting x0 and xi; developed algorithm was trained on the open CrowdAI Mapping Challenge database (Mohanty et al., 2020), which is composed of over 280k satellite images for training and 60k images for testing.The training images were divided into two parts: 80% of the images were used to train the algorithm, 20% for validation.The training was performed locally with CUDA 11.7 on an NVIDIA GeForce RTX 3070 graphics card with 8 GB of memory.

Figure 6 .
Figure 6.Example images from the CrowdAI Mapping Challenge dataset.

Figure 7 .
Figure 7. Example of the resulting segmented images before and after the regularization process.

Table 1 .
Results on the CrowdAI test dataset for all the building extraction and polygonization experiments.