Convolutional Neural Networks for Road Detection: An Unsupervised Domain Adaptation Approach

Due to the frequent road network changes, keeping them updated is fundamental for several purposes. Currently, models based on Deep Learning (DL), specifically, Convolutional Neural Networks (CNNs), such as encoder-decoder type, are state-of-the-art for this purpose. In this context, the high performance in CNNs has two aspects involved: the model needs a large labeled dataset, and the dataset belongs to the same probability distribution. In practical applications, however, this may not hold, since there is a domain shift effect, and it is not customary for the availability of labeled data. To approach these challenges, we propose to adapt the U-Net architecture (encoder-decoder) to the Unsupervised Domain Adaptation (UDA) that does not need labeling data to minimize the domain shift effect. Our results demonstrate that the proposed method contributes to road segmentation, whose model reaches 74.31% (IoU) and 85.04% (F1), against the same model without UDA that reaches 67.36% (IoU) and 80.02% (F1). This implies that the information that comes from the target domain, even unsupervised, contributes to adversarial learning, improving the generalization capacity of the model, enhancing aspects such as better discrimination surrounding classes, and in the geometric delineation of the road network.


INTRODUCTION
Due to the frequent road network changes, keeping them updated is fundamental for several purposes.The road network is one of the main modes of transport in the world, which provides many types of support, such as expansion or limitation among sites, urban planning, economy (e.g., logistics), autonomous navigation systems, smart cities' development, etc. (Wang et al., 2016).As the large availability of remotely sensed data, a variety of road network detection methods have been proposed since the 70s.Essentially, these methods are based on information about geometrical, radiometrical, topological, functional, scale, and contextual characteristics.Since the methods based on handcrafted features are a time-consuming and costly process to handle with all these characteristics, algorithms based on Deep Learning (DL) exploit them automatically in a unique process.
In this context, Convolutional Neural Networks (CNNs) have been making fundamental advances in Computer Vision over the last decade for the classification, detection, and segmentation of objects in images.In regards to the specific topic of road detection, CNNs, especially, the deeper ones, are the most utilized algorithms for this end (Filho et al., 2023).Algorithms like this are flexible to approach different data types, which increases the detection accuracy based on feature automatic extraction and in different levels of abstraction.Since the pioneering work in 2010 (Mnih and Hinton, 2010), different methods have emerged, such as patch CNNs, deep CNNs, and those based on adversarial training.
With the revolution in the semantic segmentation field by the Fully Convolutional Networks (FCNs) (Long et al., 2015), road detection based on patch CNNs (Li et al., 2016), which have suffered with computational resources with a larger patch, was gradually replaced for the networks with all convolutional layers (Zhong et al., 2016, Henry et al., 2018).Later, the encoderdecoder architectures gained attention, such as the U-Net (Ronneberger et al., 2015).This is due to the skipping connections between the low and high levels of the layers, in contrast to FCNs, which are not effective in preserving with accuracy the spatial details in the reconstruction of the masks.
Approaches based on different characteristics with encoderdecoder nets have been proposed.For example, in the case of multitasking, involving the surface, centerline, and the edges of the roads (Cheng et al., 2017, Lu et al., 2022); residual learning as the increases of the depth of layers (Zhang et al., 2018, Bandara et al., 2022); local and global attention blocks (Xu et al., 2018); multimodal fusion (Filho et al., 2023), adversarial networks (Abdollahi et al., 2021), etc.On the other hand, although DL-based methods are the mainstream approach (i.e., encoder-decoder architectures), the high performance of the existing works has two aspects involved: the model needs a large labeled dataset, and the dataset belongs to the same probability distribution.
The different benchmark sets available in the literature are an example of this assumption, whose approaches use labeled datasets such as the Massachusetts Dataset (Mnih, 2013), Deep-Globe (Demir et al., 2018), CasNet dataset (Cheng et al., 2017), for rural areas (Yang and Wang, 2020), and others.Therefore, once these datasets are split into train and test subsets, the high accuracy of the inference is strongly related because they come from the same probability distribution.In other words, even though they have the same distribution, DL-based models assume that the training data must have enough variability for the generalization at the inference (testing).
In practical applications in the road detection context, however, the accuracy can decrease and even achieve overfitting when in the presence of a domain shift effect.For instance, a model is trained with a dataset, but it is applied to another one, e.g., Deep Globe and Massachusetts; this characterizes that there are two domains, a source, and a target domain, respectively.Then, the domain gap is due to the different probability distributions that generate the datasets, caused by different spatial resolutions, different locations, acquisition methods, data types, or other factors.Furthermore, once the algorithms work like in fully supervised training, another limitation is the pixel-level labeling in the target domain; in the real world, not always labeled data is available and to do this is a time-consuming process.
Challenges like those contextualize the usage of Domain Adaptation (DA) for road detection.Essentially, DA can minimize the difference in the distribution between the domains, using labeled data just in one domain, to complete the task in the target domain without labeling.As for the specific case of labeled data available just in the source domain, this appeal is called Unsupervised Domain Adaptation (UDA) (Wang and Deng, 2018), a particularity of transfer learning, and the hardest case of DA.UDA methods can be categorized into two categories: representation matching and appearance adaptation networks.While representation matching aims to build a latent space of invariant domain (Wittich and Rottensteiner, 2019), appearance adaptation networks concentrate on transforming images from the target domain similar to the source domain, like generative methods (Wittich and Rottensteiner, 2021).Besides, for both approaches, adversarial training is the most used (Goodfellow et al., 2014).
Despite the recent advances in Remote Sensing and Photogrammetry (RS&P) area, i.e., on the land use and land cover classification by UDA (Xu et al., 2022), for road network detection, which belongs to the semantic segmentation task, several challenges are still suggested, with few works proposed (Iqbal et al., 2023).Specifically, since the UDA's pioneer work (Ganin et al., 2015) in the Computer Vision field, which brought advantages for frameworks of DL and training steps, neither for the RS&P area nor for the road detection topic, this approach has been investigated for deeper CNNs (Elshamli et al., 2017, Soto et al., 2022).Therefore, in this paper, we propose a new approach, in order to contribute to road network detection: we adapt the U-Net architecture to UDA, especially by the strategy of a Domain Adversarial Neural Network (DANN) (Ganin et al., 2015).Thus, the contributions of our study can be summarized in the following: 1) building an encoder-decoder architecture proposed for UDA, whose method and datasets utilized are state-of-the-art; 2) discussing the challenges involved in adversarial training, concerning the optimization problem; and 3) a viable solution to minimize the domain shift problem, without the necessity of labeled data for the inference, which lead to the segmentation of the road network automatically.

Study area and preprocessing stages
A benchmark dataset (Cheng et al., 2017) is used, i.e., the CasNet dataset.The set was originally proposed for road and centerline detection, through CasNet architecture.It contains aerial images collected from Google Earth, whose road segmentation reference maps were manually labeled.The road width is about 12-15 pixels, and the set has some occlusions of cars and trees as well as a highly residential area whose roofs have similar radiometric characteristics to roads.The images have a Ground Sample Distance (GSD) of 1.2 m, radiometric resolution of 8 bits, and a total number of 224 samples, with dimensions of at least 600x600 pixels.
The CasNet dataset is chosen for some reasons: the delineation of the road network is uniform and homogeneous with little occlusions, the data have a lower GSD and require less storage (little amount), and it also supports comparatives (metrics) among diverse works.For the unsupervised adaptation strategy, the proposed approach considers as source domain RGB images converted into grayscale, and RGB images as the target domain, as shown in Figure 1.The total amount of 224 samples of the benchmark set is already split into training (160), validation (20), and testing (44).However, as the images are at least in 600x600 px dimensions, train and test images are cropped in 512x512 px patches and, for training, the samples are padded with mirror mode.This increased the number of samples of the set to 252, 20, and 81, for training, validation (the same), and testing, respectively.Moreover, various operations of data augmentation based on geometric operations are applied; for instance, the images are randomly flipped from left to right and then, from up to down; operations like image rotation and image transposing are also made.

Proposed architecture
The proposed method adapts the U-Net for UDA as the DANN.The U-Net is a network for semantic segmentation.It consists of a contracting path to capture context and a symmetric expanding path that enables precise location (reconstruction of the masks).These paths are named as encoder and decoder, respectively, which yields the u-shaped architecture, and the net does not have fully connected layers.The DANN, a UDA approach, focuses on learning discriminativeness and domaininvariance features.Particularly, DANN can minimize the divergence between two probability distributions (source and target domain) parametrized by an encoder of a deep neural network, which is implemented with other two modules: a decoder (or classifier), and a domain discriminator.The structure of the architecture is presented in detail in Figure 2.
The encoder layers follow the standard U-Net (ImageNet's pretrained weights), and also for the decoder, but with batch normalization layers (Ioffe and Szegedy, 2015) as regularizers.In regards to the domain discriminator, in order to determine from which domain the sample belongs, domain labels are defined as 0 and 1 (for the source and the target domain, respectively), in addition to a Gradient Reversal Layer (GRL), which involves a domain regularizer hyperparameter.
The adapted net has two batches as input: source domain and target domain.They are convoluted separately by the encoder, but concatenated for the domain discriminator, while only the source pass goes to the decoder classification.In regards to the GRL layer, which acts as an identity transformation during the forward pass, and during the backward propagation like a negative scalar (λ), the proposed method implements it in the backward step, according to the gradients.
In the backward pass, gradients of the decoder and the domain discriminator are separately calculated with respect to the same module: the encoder (θ f weights), as written in Equation 1.The gradients of the decoder loss with respect to its module are calculated and updated with the encoder module (1st step).Subsequently, the batch with source and target domain samples passes to the encoder and the domain discriminator and thus, the discriminator module is updated with another optimizer (2nd step).
where Ly is the loss for the decoder (θy weights); and L d is the loss for the domain discriminator (θ d weights).
The whole network was implemented with TensorFlow framework (2.15 version) in the Python programming language (3.10 version).The network was trained on the Google Colab Pro environment, which utilizes the NVIDIA Tesla T4 16GB graphic card.

Evaluation of the proposed approach
The three modules of the architecture are used in training, but only the encoder and the decoder are used for the testing step.
Based on this, the results can be evaluated.For both source and target domains, the data are split into training, validation, and testing.However, for a fair comparison, and in order to check the contribution of UDA for road detection, the results of the proposed method are compared from a baseline, which adopts the same segmentation model (U-Net) and settings but without UDA, and also, with a fully supervised training (on the target domain).
The evaluation consists of qualitative and quantitative analysis.The metrics Recall (Equation 2), Precision (Equation 3), F1-Score (Equation 4), and Intersection over Union (IoU) (Equation 5) are used for this end, as the Equations are written below: (2) Precision = T P T P + F P (3) Where False Negatives (FN) are the number of pixels of "road class" but classified as "non-road class"; False Positives (FP) are the number of pixels of "non-road class" but classified as "road class"; True Negatives (TN) are the number of pixels correctly classified in "non-road class"; and True Positives (TP) are the number of pixels correctly classified in "road class".
Due to the classes' unbalancing of the roads, the Focal Tversky Loss (Abraham and Khan, 2018), written in Equation 6, is used for the decoder (prediction of the roads), and for the domain discriminator, two loss functions are analyzed: Binary Cross-Entropy (BCE), and the sigmoid with cross-entropy (S-CE).

RESULTS
For the selected domains from the CasNet dataset, it is considered that the adaptation setting includes different data types (i.e., grayscale and RGB images), different backgrounds, and types of road surfaces.For this, the batch size of training was fixed into 4 and data augmentation operations were made, such as images randomly flipped from left to right and up to down, image transpose, and rotation of 90°and 270°.An extensive inspection of hyperparameters was also made, which involved, mainly, finding an adequate learning rate, which optimizer to use, the total number of iterations (according to the batch size, epochs, and number of samples), and the λ value (domain regularizer hyperparameter).Table 2. Preliminary results based on and domain regularization hyperparameter.
Firstly, two optimizers are tested: Adam and Stochastic Gradient Descent (SGD) with exponential decay.For the domain discriminator, two loss functions are analyzed: the Binary Cross-Entropy, and the sigmoid with cross entropy (S-CE).Table 1 shows the results in terms of metrics for this preliminary analysis, where the subscript "e" and "d" refer to encoder-decoder, and discriminator modules, respectively.
Different settings for the model are presented in Table 1.The SGD optimizer with momentum (0.9) and exponential decay, which was chosen to update the encoder-decoder weights, emphasizes with a lower learning rate, the model reaches a plateau and does not converge (model 1a in Table 1).Then, a higher learning rate should be used, such as 10 −2 with decay.Regarding the discriminator influence, Table 1 shows that an adaptive optimizer (Adam) or the SGD with or without scheduling for both loss functions, the prediction in the subset test in F1 and IoU values are around 82%, and 69%, respectively; but in visual analysis, more iterations can reduce the FPs in the prediction (model 1c and 2a). Figure 3 presents a comparison between them.
As presented in Figure 3, it is evident that more iterations are necessary for the experiment.For instance, the predictions have some shortcomings, like noises on the tiles, buildings are predicted as roads (similar in radiometric terms), and the roads are disconnected due to occlusion by the trees or cars.On the other hand, the failures might be related to the optimization problem; in adversarial training, it is essential to balance the weight between the label predictor and discriminator losses.In practice, it means that the label predictor should not be too good to make it difficult to predict the domains ("weaker" discriminator), and vice versa.In that regard, UDA architecture has a domain regularization hyperparameter, i.e., λ, which correctly needs to propagate the gradients to the network.Thus, different values were analyzed, for both the label predictor and the discriminator.Table 2 shows these analyses.
The results presented in the first preliminary analysis (Table 1) use a fixed λ = 1.However, according to Table 3. Analysis of the proposed method and comparisons to baseline and training on target.. decoder model and another for the discriminator module.Using different values for each one, in this case, may decrease the metric of the model.In contrast, as in model 3b, if the discriminator is not updated, F1 and IoU are affected.When using a small and the same value, the metrics increase, and more iterations are necessary, achieving approximately, 85% and 74% for F1 and IoU, respectively (e.g., model 3e).For qualitative analyses, Figure 3 (models 3e and 3f) shows the predictions for this second preliminary analysis.
The last two results presented in Figure 3 emphasize that a larger λ may be more sensitive to the gradients and even propagate them incorrectly.This means that a small and equal value is more suitable for the road detection problem (grayscale to RGB case).For example, when compared to the first analysis, now the model distinguishes the classes from the roads with more accuracy, without the noises presented previously, reducing the FPs on the segmentation, and better geometric delineation of the road network.
After these analyses, it is possible to compare the results to a baseline, in this case, the same model: U-Net.Also, the training and prediction in a fully supervised way is presented, i.e., with the RGB images, intending to check the contribution of UDA to this end.Table 3 shows the comparison.
As shown in Table 3, the proposed method performs favorably against the baseline, achieving 85.04% and 74.31% for F1-Score and IOU, respectively; the difference is around 6% and 10%, respectively.In terms of training (optimization), it emphasizes the importance of the hyperparameters calibration, and also, regularizers (e.g., λ), with a focus on adversarial training.
Regarding the fully supervised training on the target domain, the UDA strategy still needs to be improved, whose performance achieves 89.56% and 81.86% (Cheng et al., 2017), and 86.98.56% and 77.52% (the same settings), for F1-Score and IoU, respectively.Figure 4 presents the comparing results in visual performance.
In visual performance (Figure 4), it is noted that the UDA strategy can significantly contribute to road segmentation.For instance, in the source domain with only information about grayscale (spectral information), the roads are easily confused with roofs of houses, while with the information from the target domain (RGB), the network can discriminate between these classes and improve their connectivity (topology).However, in the presence of trees or cars over the roads, where the network misses information, the probability of FN increases.Furthermore, the model is still sensitive to detecting roads and intersections, which means that does not have enough variability from grayscale and RGB domains.Different objects but radiometrically similar, like crosswalks, central trees, and different pave-ments of roads are some examples of the main shortcomings of the proposed method.
tests are necessary for road segmentation, including more iterations for training.Despite the model still having some limitations in detecting the road network, it should be emphasized that the method improves the segmentation since training like transfer learning is not necessary, and some problems remain even in the fully supervised training (on target).In this sense, investigating other loss functions, another backbone, and other hyperparameter settings is some suggestion to overcome the current models' limitations and enhance the road segmentation of the proposed method.

CONCLUSION AND FUTURE WORK
This work approached an experiment in order to check the contribution of UDA in the road segmentation task, based on the strategy that available labeled road data can achieve a promising road network detection on unlabeled target images.
For this purpose, the achieved metrics have shown adequate results.The contribution of UDA is emphasized since the proposed method performs well against a baseline (the approach without UDA).The information that comes from the target domain, even unsupervised, contributes to adversarial learning, improving the generalization capacity of the model.Aspects such as better discrimination surrounding classes (contextual aspect) refine the geometric delineation of the road network and avoid the probability of false positives and false negatives.Despite some improvements that need to be made, such as in cases in which the roads are obstructed, or due to the different pavements, it is worth noticing that the proposed approach leads to an automatic road segmentation without labeled data in the target domain, once no further training or fine-tuning is required.
To state these perspectives, other approaches are proposed to check the contribution of UDA in a visual analysis, based on techniques for representation of data in the DL context, such as the t-SNE.Besides, since road detection comprehends multiscale information, another backbone, different hyperparameter settings, and even methods for multiscale and multilevel features extraction (attention) are some strategies to be investigated.

Figure 4 .
Figure 4. Visual comparisons of road detection results: on target, baseline, and UDA.The 6th column: close-ups images of the yellow boxes.The last column: FP -red color; FN -blue color; TP -green color.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-2-2024 ISPRS TC II Mid-term Symposium "The Role of Photogrammetry for a Sustainable World", 11-14 June 2024, Las Vegas, Nevada, USA

Table 1 .
Preliminary results of hyperparameters in the subtest set.
*Learning rate schedule: exponential decay from 1e-2 up to 1e-3; **The scalar concerns the encoder and the discriminator module, respectively.
Table 2, the regularizer with small values can better match distributions between source and target domains.Empirically, different values were tested, since there is a regularizer for the encoder- *U-Net modified.