A COMPARISON STUDY ON DEEP LEARNING MODELS FOR BUILDING ROOFTOP CLASSIFICATION

: The availability of semantic information about a cityscape is essential for understanding and analysing urban processes. Automatic gathering of such information is important due to the enormous amount of data. A great number of building features could be gained solely by visual inspections. Therefore, it is meaningful to utilize recent advancements in automatic image recognition technologies to extract these properties automatically. This paper proposes an optimized solution for the classification of rooftops from aerial imagery based on a deep learning model using Convolutional Neural Networks (CNNs). It describes the architecture of the network, the training procedure and important hypermeters. A model analysis using advanced interpretability and explainability tools is conducted. The model’s superiority is demonstrated by comparing its performance against several state-of-the-art image classification architectures, including CNN-based ones such as Xception and Efficientnet, pure Visual Transformers (ViTs) based architectures such as BEiT, and hybrid architectures.


INTRODUCTION
Automatic image recognition plays a crucial role in understanding and analysing urban processes, particularly in the context of the built environment.By harnessing recent advancements in deep learning and Convolutional Neural Networks (CNNs), it becomes possible to automatically extract valuable information from urban imagery, facilitating effective decision-making and urban analysis.In this regard, building rooftop classification holds significant importance as it provides essential semantic information about the cityscape.
Accurate modelling of buildings and their rooftops is essential for various applications, including infrastructure and service planning, solar potential estimation, green roof analysis, and social space assessment.3D city models and City Digital Twins (CDTs) in general replicate the physical environment of a city and enable comprehensive analysis of urban processes.Proper modelling of buildings at different levels of detail (LOD) is crucial for generating detailed 3D city models and functional CDTs.Rooftop modelling, classified at LOD2 or LOD3, enhances the visual perception of 3D city models and facilitates various urban analyses (Biljecki et al., 2015;Julin et al., 2018;Suszanowicz, 2019;Shao et al., 2021;Pomeroy, 2012).
To achieve accurate rooftop modelling, access to high-quality data is vital, including aerial imagery and semantic information.The latter can be extracted through automatic image recognition methods.Recent research has focused on developing optimized deep-learning models for automatic rooftop classification, leveraging the capabilities of CNNs.These models can efficiently extract building features solely through visual inspection, thereby improving the modelling process within CDTs (Spasov, 2021;Castagno, 2018;Cai et al., 2021).This paper proposes an enhanced approach for categorizing rooftops from aerial images, utilizing a CNN deep learning model.It outlines the network's structure, the training process, * Corresponding author and significant hyperparameters.The fine-tuned model weights trained on images of Sofia (Bulgaria) are shared on GitHub (2023).Advanced tools are employed for model analysis to ensure model interpretability and explainability.Moreover, the performance of the proposed model is compared against several state-of-the-art image classification architectures, including CNN-based models like Xception and Efficientnet, pure Visual Transformers (ViTs) such as BEiT, and hybrid architectures.
The rest of the paper is organised as follows.Section 2 presents the methodology applied in the study.Section 3 shows the results obtained from the proposed fine-tuned model compared to other state-of-the-art image classification models.Finally, Section 4 concludes the paper and outlines future work.

METHODOLOGY
This section describes the methodology followed for the rooftop classification, including data preparation and labelling, model selection and optimisation and its performance evaluation.

Data Preparation
The classification models utilised in this study have deep learning architectures and are trained using a supervised learning approach.Depending on the problem to be solved, i.e., the object to be identified and classified, supervised models could require a substantial amount of data to achieve high (classification) performance.For example, the high optical variability of objects belonging to the same class (intra-class variability) and the high visual similarity of objects containing different classes make the classification task more difficult to solve.Other factors are image quality (such as resolution, noise and illumination) and the proportion of objects of interest to the area of the entire image.In addition, the unambiguity of the objects to be classified, as well as the presents of a single object of interest on an image, are prerequisites for a single-class prediction.Considering these factors, the data definition, collection and all preprocessing steps applied in this study are carefully selected and performed.
A dataset consisting of 3,517 rooftop images encompassing the district "Lozenets" of Sofia was employed for the study (Hristov, 2023).It is derived from a solitary orthophoto, made available in TIFF format and represented in the RGB colour space.The orthophoto was acquired in 2020 using aerial photography techniques, employing an ultra-wide range digital camera.The acquisition process involved a longitudinal overlap of 60% and a transverse overlap of 30% to ensure comprehensive coverage and accurate representation of the district.Notably, the orthophoto's Ground Sampling Distance (GSD) was 10 cm, which is considered highly detailed and distinctive for an urban environment like the city of Sofia.
The preparation of the dataset involved a meticulous process executed in multiple steps to support the classification models.Initially, a QGIS plugin named Mapflow is used to localise the buildings from the orthoimage.This automated procedure helped to identify the approximate outlines of the buildings based on the available data.Subsequently, a manual adjustment is performed to refine the inferred buildings' outlines.Third, the orthoimage is tiled based on the resulting outlines from the previous step aiming to extract each building in a separate image.Specifically, the applied tiling rule produces images containing detected building boundaries with an additional outer buffer of 2 meters (see Figure 1).This buffer ensures that the extracted images encompass the whole outlines of each building, separating neighbouring structures in another tile.By following this procedure, the resulting dataset was optimally prepared for a one-class prediction.The combination of automated techniques and manual refinement allowed for the creation of a comprehensive dataset efficiently.This dataset serves as the foundation for training and evaluating the models in the current study.

Data Labelling
The study area is distinctive for its complex architecture, characterised by various roof shapes.Therefore, a careful selection labelling strategy is essential to balance intra-class variability and inter-class similarity, which is also advantageous for the purpose of the model.Based on this consideration, the single rooftop images are classified into three main classes, namely "pitched", "flat", and "complex".An additional helper class, "no_roof", is introduced to cover cases where the image doesn't represent a roof.Figure 2 shows examples from the four classes.
(a) flat roof (b) pitched roof The "flat" roof class encompasses completely flat roofs with a minimal slope (see Figure 2a).Key identifying features typically include a simple rectangular shape, perpendicular angles, uniformity in terms of pixels, colours, and the absence of distinct planes.It is important to note that buildings with flat roofs spanning multiple levels are also included in this category, which may lead to potential overlap with the complex roof category.
The "pitched" roof class includes all sloped roof types, including hip and gable roofs, and their various configurations (see Figure 2b).Roofs are considered part of this class regardless of the number of planes they comprise, as long as they possess a sloping structure.The criteria and key features utilised for the classification of a roof as pitched include the presence of hips and ridges, which form clear demarcation lines between the planes.An identifiable diagonal hip line and darker or shaded planes on the opposite side of the ridge serve as indicators for a pitched roof.
The "complex" roof class covers roofs that incorporate a combination of pitched and flat geometry.Additional criteria for inclusion in this category include roofs with multiple levels and terraces, roofs with intricate shapes featuring numerous slopes, and roofs with oval or spherical forms.A roof is classified as complex when multiple buildings with distinct roof types and varying shapes share walls, giving the appearance of a unified roof area or building.
The "no_ruf" class incorporates images that do not illustrate buildings, including construction sites, unclear or blurry images, extremely small sections of roofs, or shapes that are inherently unidentifiable to the human eye.
The above considerations aim to minimise ambiguity among the images and overlap between the classes and act as annotation rules.Furthermore, following them consistently while annotating is important, since the inconsistency would provide contradictive examples for the models to learn from.Consequently, the classification performance is affected due to the relative amount of such cases to the size of the dataset and the respective classes.
Given the extensive range of roof architectures, however, a certain level of subjectivity and partial overlap are inevitable.

Model Selection and Optimisation
For the current classification task, a widely used ResNet architecture is selected.This type of architecture has been successfully utilised as a feature extractor and serves as a backbone for various image recognition tasks such as classification, detection and segmentation.The ResNet architectures contain residual functions, so-called "identity blocks", which effectively tackle the problem of vanishing gradients.This problem refers to the fact that the gradients of deep neural networks become increasingly small as they propagate backwards.As a result, the network is unable to update its parameters effectively.The ResNet architectures provide an effective solution for this phenomenon by incorporating an additional connection called a skip connection or residual connection in the network.It allows the network to "choose" whether to use a learned transformation or to simply propagate its input to the next layer if this is the optimal solution.
Several experiments are conducted using different model sizes, namely ResNet18, ResNet50 and ResNet101.All three CNNs consist of similar building blocks, composed of convolutional layers, pooling layers, normalisation layers and Rectified Linear Unit (ReLU) activations.The main difference between the networks is in the number of building blocks, presented in brackets in Table 1, and consequently, the number of learnable parameters (He, 2016).ResNet18 is the smallest network with ca.11.7 million parameters and ResNet101 with ca.44.5 million parameters.The output of these networks for an image is a 512 or 2048-dimensional feature vector, which is a dense representation of the image.Based on these feature vectors a fully connected layer assigns a "score-value" to each of the classes.where knowledge obtained from one domain is applied to a new domain.The parameters could be used without finetuning in the convolutional blocks or at initialisation.In the latter case, models pre-trained on this dataset show faster convergence and often better performance than those without using pre-trained weights.
In CNNs, earlier layers of the networks extract more general lowlevel features, whereas deeper layers extract more domainspecific high-level features.Therefore, finetuning the deeper layers solely is often advantageous when there is a specific domain and initial weights trained on a large, diverse dataset are available.This is the case in the current study.Since ImageNet does not contain areal imagery, finetuning of the deeper convolutional layers should lead to better performance.Experiments are conducted with and without finetuning the ResNet backbone.However, finetuning the last convolutional block increased the performance significantly; thuss, the next iterations in the hyperparameter tuning processes are conducted with finetuning of this block.Figure 3 illustrates the ResNet101 architecture with the finetuned blocks.An extensive hyperparameter tuning is performed, experimenting with elements such as loss function, regularisation techniques, data augmentation and optimisation strategies.Selected values in the experiments were based on common practices and the results of the previous iterations in a semi-manual manner utilising grid search optimization strategies as well.In the following, the main elements of the optimized training design are presented.

Optimisation Loss:
The network is optimized using a Negative Log Loss in combination with the Log of Softmax, which minimising is equivalent to maximising the entropy of the classification.The following implementation of the loss is used: ,   = −    exp( ,  ) ∑ exp( , )  =1 (1) where x is the input y is the target w is the weight C is the number of classes N spans the minibatch dimension.

Optimiser:
The Stochastic Gradient Descent (SGD) and Adam optimizer are tested with different initial learning rates and learning rate scheduling tactics, including annealing techniques such as stepwise, cosine and warm restart cosine annealing.With neglectable impact on performance, the final model used "reduced on plateau" scheduling with 3e-4 initial learning rate.

Data Augmentation:
The network is trained with the stated details for 100 epochs with and without data augmentation.The vertical and horizontal flip is applied and in addition, random rotation, augmentation of brightness, contrast, saturation and hue of the images is performed.With augmentation, the training loss converged similarly as in the case where no augmentation was applied.However, the variance was significantly reduced with the application of the augmentation techniques.

Performance Evaluation
The open-source machine learning platform Mlflow is used for tracking model performance during the research and development phase (Gundersen, 2022).The system allows for logging experiments and better comparison of different versions of models and datasets.The overall performance on the dataset as well as for each class separately, was assessed using precision, recall and F1-score using the following definitions.(Chattopadhay, 2018).This method generates a saliency map showing which special pixels have the largest contribution to a class prediction.
It is based on the gradients of the last convolutional layer's kernels and the resulting feature maps.This method is especially useful to analyse whether a model makes its predictions based on the right features, such as characterising a certain class.In addition, a TracInCP (Pruthi, 2020) is applied to find the most influential train images for a given prediction.This algorithm calculates the influential score for a given train example on a specified test image.This is achieved by estimating the change in the loss on the test image when the given train example is removed and the model retrained.In this case, one can find the train images with the most positive scorethe proponents; as well as with the most negative scoreopponents.
The final model is compared to other state-of-the-art models on a randomly selected training-validation split.The models were selected so that different types of deep learning models are covered.Their weights were pretrained on ImageNet1k or ImageNet22k.CNN models tested are Xception (Chollet, 2016) and EfficientNet (Tan, 2019).Xception is reported to show slightly better performance than ResNet on some benchmark datasets.The architecture introduces modified separable depthwise convolution (a depth-wise convolution followed by a pointwise convolution) first introduced by the Inception model.EfficientNet on the other hand is developed to strive for optimal trade-off between model size, computational efficiency and model performance.It uses the concept of compound scaling, which systematically scales the network's depth, width and resolution simultaneously.The optimized ResNet model is also compared against pure Visual Transformer -ViT (Dosovitskiy, 2020) and BEiT (Bao, 2021) and hybrid architectures incorporating Visual Transformer (Steiner, 2021) and ResNet backbone.

RESULTS
The final optimized ResNet model shows consistent overall results over the 5-fold cross-validation with average accuracy of 85%, average F1-score of 85%, average precision of 84%, and average recall of 84%.The observation suggests that the model performance on the dataset is independent of the exact images in the train and validation split.The results of each validation split are depicted in The performance metrics declined slightly once the additional class "bugs" was added to the classification task and dataset.The accuracy dropped to 82.6% and the weighted F1-score to 83.7% from the previous 84.8% and 85.2%, respectively.Out of the 64 validation samples of the class "bugs", 19 were confused with flat roofs and 9 with pitched roofs.This is understandable, considering that bugs were mostly rectangle-like shapes such as shadows, football fields or started constructions.
The performance of the optimised ResNet shows differences between the individual classes.The higher accuracy is achieved for the class "pitched", namely 91%.For the "flat" class 81% of the samples were predicted as annotated and for the class "complex" this was 77% of the samples.Further exploration of the CAMs generated by GradCAM++ reveals the regions of each image that most contributed to the inference made.Looking at the CAMs of the correct prediction for the "flat" and "pitched" classes, it seems that these regions are distinctive for the respective class.Figure 4 shows examples of the CAMs for these categories.For the "flat" class, the roof features that contributed most are the flat parts of the roofs, often less covered with additional roof elements or the outlines of the roofs.For the "pitched" class, mostly the ridges or the hips of the roofs were highlighted by the GradCAM++.For the "complex" class, the regions contributing the most to the correct result often cross a more significant part of a roof.This is in line with the fact that the complexity of the geometry and architecture of the roofs comes from the combination of different roof styles across the roof (see Figure 5).Analysis of GradCAM++ heatmaps of the misclassified samples gives insights into the reason for the misclassification.In most cases, the reasons lie in the ambiguity of the images.For example, roofs mainly composed of a flat part with a very small gable part could be annotated as complex but predicted by the model as flat and vice versa.Examples of misclassified roofs are shown in Figure 6.A more profound observation could reveal specific roofs that are misclassified.For example, white-pitched roofs are incorrectly recognised as flat roofs (see Figure 8).This might be because there are only two white-pitched roofs in the dataset.

••
Class-based Precision: How many of the predictions for a class were correctly predicted  Class-based F1-score: Harmonic mean of Precision and Recall 1 = 22 +  + ,(4)• Dataset-based accuracy: How many of all predictions were pitched roof (f) pitched roof Figure 4. Class Activation Maps for flat and pitched roofs.

Figure 5 .
Class Activation Maps for complex roofs.

Figure 6 .
Figure 6.Class Activation Maps for misclassified roofs.In addition, observing the individual misclassified samples, it is found that they are the most ambiguous roof types or, in some cases, those that are misannotated.Examples of such annotations are shown in Figure7.
The network is initialised with parameters (weights) pre-trained on the ImageNet database.ImageNet is a large dataset consisting of images with diverse objects and backgrounds.Its characteristics make it particularly suitable for Transfer Learning

Table 3 .
Comparison of models' performance.
Table 4 shows the confusion matrix on the validation data.

Table 4 .
Confusion matrix on the validation data.