A Deep Neural Network for Road Extraction with the Capability to Remove Foreign Objects with Similar Spectra

Existing road extraction methods based on deep learning often struggle with distinguishing ground objects that share similar spectral information, such as roads and buildings. Consequently, this study proposes a dual encoder-decoder deep neural network to address road extraction in complex backgrounds. In the feature extraction stage, the first encoder-decoder designed for extracting road features. The second encoder-decoder utilized for extracting building features. During the feature fusion stage, road features and building features are integrated using a subtraction method. The resultant road features, constrained by building features, enhance the preservation of accurate road feature information. Within the feature fusion stage, road feature maps and building feature maps designated for fusion are input into the convolutional block attention module. This step aims to amplify the features of different channels and extract key information from diverse spatial positions. Subsequently, feature fusion is executed using the element-by-element subtraction method. The outcome is road features constrained by building features, thus preserving more precise road feature information. Experimental results demonstrate that the model successfully learns both road and building features concurrently. It effectively distinguishes between easily confused roads and buildings with similar spectral information, ultimately enhancing the accuracy of road extraction.


Introduction
The significance of road extraction lies in its ability to automatically identify and delineate roads from satellite images or maps.This process is essential for various applications such as urban planning, transportation management, infrastructure development, environmental monitoring, and disaster response.Accurate road extraction facilitates navigation systems, helps improve traffic flow analysis, aids in updating maps, and supports various location-based services.
Traditional road extraction methods are labor-intensive, relying on designing features based on road texture, shape, edges, and other characteristics.The accuracy of extraction is not high, and the robustness is poor.Moreover, this method is not suitable for road extraction in complex scenarios.Support vector machine (SVM) classification method (Simler, 2011) and Markov random field classification method (MRF classification method) (Li et al., 2017), formulate rules according to the spectral and spatial characteristics of roads, extract road fragments from images, and further refine them.He et al. proposed a colorbased road detection algorithm by combining boundary estimation results from grayscale images with road region extraction results from color images (He et al., 2004).SIMLER et al. proposed an SVM technique using spectral and spatial features to extract roads from aerial images with a spatial resolution of 0.5m.YAGER et al. used SVM to extract roads from aerial images with a spatial resolution of 0.45m by utilizing important features such as edge length, intensity and gradient (YAGER and Sowmya, 2003).Wegner et al. proposed a high-order conditional random field (CRF) model for road network extraction (Wegner et al., 2013).
Currently, deep learning is highly favored in the field of semantic segmentation.An increasing number of studies are using deep learning to tackle various problems.Among them, the most commonly used are Convolutional Neural Network (CNN) and Fully Convolutional Networks (FCN) (Long et al., 2015).Additionally, with the advancement of deep learning, transformers have also been widely employed.U-Net architecture adopts cascaded upsampling and combines multiple loss functions for road extraction (Ronneberger et al., 2015).To minimize information loss, LinkNet directly connects the encoder to the decoder (Chaurasia and Culurciello, 2017).D-LinkNet adopts the LinkNet architecture and utilizes shortcut connections in the central part to combine atrous convolution blocks into several parallel branches (Zhou et al., 2018).Due to occlusions from buildings and shadows, discontinuities occur in roads.Therefore, in the CoANet model, a connectivity attention module (CoA) is designed to address the continuity issues in roads (Mei et al., 2021).Zhang et al. proposed a semantic segmentation neural network for road extraction that combines the advantages of residual learning and U-Net (Zhang et al., 2018).Since CNNs struggle to capture global representations, transformers are used to obtain comprehensive contextual information.Therefore, Seg-Road (Tao et al., 2023) and DPENet (Chen et al., 2023) models effectively combine local and global information using a dual-encoder structure for road extraction.SemiRoadExNet is a semi-supervised road extraction framework that employs Generative Adversarial Networks (GANs) and utilizes multiple discriminators to ensure consistency in feature distributions between labeled and unlabeled data, enhancing the generalization capability of the model (Chen et al., 2023).
However, road extraction methods based on deep learning still have limitations.These include issues with road connectivity, object occlusion, and difficulty distinguishing between objects with similar spectral characteristics (such as roads and buildings).Therefore, this paper proposes a dual-encoderdecoder structure to simultaneously learn features of roads and buildings, with building features suppressing the learning of road features.Additionally, Convolutional Block Attention Module (CBAM) is employed to enhance features, reduce semantic information loss, and more effectively utilize extracted information.Our contributions are as follows: • We propose a model consisting of a dual-encoder-decoder architecture to simultaneously learn features of roads and buildings, which are prone to confusion.
• We employ CBAM to enhance features, extract, and leverage more shared information.
• We adopt an exclusion strategy for feature fusion, using building features to suppress the learning of road features, thereby reducing misclassification during road extraction.

Methods
Our proposed model primarily consists of two encoder-decoder structures and a CBAM module integrated with element-wise subtraction for feature fusion, as illustrated in Fig. 1.
Figure 1.Architecture of the proposed network.

Dual Encoder-Decoder Network Model Structure
The dual encoder-decoder architecture is mainly based on the encoder-decoder structures of the CoANet (Mei et al., 2021) and TransUNet (Chen et al., 2021) models.
The first encoder-decoder structure is dedicated to extracting road features, with the decoder segment utilizing a pre-trained ResNet101.It comprises five modules, with the first module containing a convolutional layer, batch normalization layer, activation function, and max-pooling layer.The subsequent four modules consist of convolutional layers, normalization layers, and activation functions, with layer depths of 3, 4, 23, and 3, respectively.The last two modules employ dilated convolutions with dilation rates of 2 and 4 to extract denser features, resulting in five feature maps.The deepest feature map, along with the feature map extracted from the second encoder-decoder structure for feature fusion, is input into CBAM.This amplifies features from different channels, extracts crucial information from various spatial positions, performs feature fusion using element-wise subtraction, and then feeds the fused features into Atrous Spatial Pyramid Pooling (ASPP) to increase the receptive field.Finally, it enters a decoder containing four strip conv modules, each capturing contextual information in four different directions.Finally, the feature map is inputted into a decoder with four strip conv modules, each containing four different directional strip convolutions to capture contextual information.Connected after the decoder are the segmentation branch and the connectivity branch.The connectivity branch integrates SE (Squeeze-and-Excitation) for attention-weighted processing of feature maps across different channels.
To distinguish between easily confused road and building features, we employ parallel encoder-decoder structures to avoid issues such as information loss due to small feature maps caused by deepening the model.The second encoder adopts ResNetV2, comprising three modules, each composed of convolutional layers, batch normalization layers, and activation functions.The encoder part produces three feature maps and one for obtaining global contextual information.The feature map used to acquire global contextual information is serialized, passed through twelve transformer layers, with positional encoding added.The resulting sequence is reshaped, and then passed through convolutional and activation functions.The resulting feature map is upsampled and fused with the feature map of the same shape from the encoder structure.The fused feature map is upsampled and then merged with the feature map of the same shape from the encoder structure.Then, the fused features are upsampled and fused once again.Finally, the fused feature map is upsampled, and segmentation is performed to obtain the predicted binary image.

Feature Fusion
The feature fusion structure diagram is shown in Fig. 2. The deepest feature map from ResNet101 and the feature map used for feature fusion from the second encoder-decoder structure are inputted into CBAM.Initially, they pass through a channel attention module, followed by a spatial attention module.The channel attention module performs max-pooling and averagepooling operations on the input feature maps, then combines them through a Multilayer Perceptron (MLP) and element-wise addition.The resulting feature map is multiplied element-wise with the input feature map to produce the input feature map for the spatial attention module.The spatial attention module conducts max-pooling and average-pooling operations on the feature maps, fuses them based on channels, applies a convolution operation, and passes through a sigmoid to obtain the final feature map.The road and building feature maps processed by CBAM are fused using element-wise subtraction to yield the final fused feature.The formula is as follows:

Loss Function
The loss function plays a crucial role in the model training process.After each batch of data is input into the model and predicts values through forward propagation, the loss function calculates the difference between the predicted values and the ground truth.Then, through backpropagation, the parameters in the model are updated to minimize this difference, thereby allowing the model to converge and achieve the training objective.Therefore, the choice of loss function is also critical during model training.
This paper employs a combination of BCE (Binary Cross-Entropy) and Dice loss functions to address the foregroundbackground class imbalance issue in images.Based on the model architecture shown in Figure 1, the loss function L mainly consists of two parts: L1 and L2.L1 represents the loss function of the first encoder-decoder structure, while L2 represents the loss function of the second encoder-decoder structure.Since the first encoder-decoder part includes both linking branches and segmentation branches, the loss function in L1 also comprises two parts: Ls and Lc.The linking branch is used to determine the connectivity between the current pixel and its surrounding eight pixels.Therefore, the loss function is formulated as follows.
( ) ( ) ( ) where LB = BCE loss function LD = Dice loss function yi = the ground truth y ' i = the prediction of the segmentation branch Q0 = the number of surrounding pixels yc = the connectivity of the pixel with its surrounding pixels y ' c = the prediction of the linking branch α = adjust the weights of the BCE and Dice loss functions α' = adjust the weights of the two linking branch loss functions α'' = adjust the weights of the segmentation branch and linking branch loss functions α''' = adjust the weights of the two encoder-decoder structure loss functions

Datasets
Aerial Image Segmentation Dataset (Kaiser et al., 2017): The aerial images are divided into aerial remote sensing images from Google Maps and pixel-level buildings, roads and background labels from OpenStreetMap.Sourced from the website (https://zenodo.org/records/1154821#.XH6HtygzbIU).It covers Berlin, Chicago, Paris, Potsdam and Zurich.The image of part of the Zurich area was selected as the data set, which was cropped to 512*512 pixels, and the road and building parts in the label were extracted separately as the labels of their respective channel models.Among them, 8070 images were used as the training set, 1020 images were used as the verification set, and 1080 images were used as the test set.

Implementation Details
Training is conducted under the PyTorch deep learning framework, with training platform parameters as shown in

Evaluation Metrics
Road extraction is commonly perceived as a binary classification problem.The commonly used model performance evaluation metrics are Overall Accuracy (OA), Precision, Recall, Intersection over Union, and F1-score.We adopt the average Precision, the average Recall, the average Intersection over Union, and F1-score for both foreground and background, along with OA, to evaluate the road extraction performance of our proposed model.The formulas for OA, Precision, Recall, Intersection over Union, and F1-score are as follows. ( where TP = the number of pixels correctly predicted as roads FP = the number of pixels incorrectly predicted as roads TN = the number of pixels correctly predicted as background FN = the number of pixels missed in predicting as roads OA = the proportion of correctly predicted pixels to the total number of pixels Precision = the proportion of correctly predicted samples to the predicted samples Recall = the ratio of correctly predicted positive samples to the total number of true positive samples F1-score = the harmonic mean of precision and recall IoU = represents the ratio of the intersection of the actual region and the predicted region to the union of the actual region and the predicted region

Comparative Analysis of other Modules
To validate the feasibility of the model proposed in this paper for road extraction, we compared it with five other state-of-theart models (PSPNet50 (Zhao et al., 2017), TransUNet (Chen et al., 2021), DeepLabV3 (Chen et al., 2017), D-LinkNet (Zhou et al., 2018), and CoANet (Mei et al., 2021)) on the Aerial Image Segmentation Dataset.The accuracy results of the comparison are shown in Table 3 The comparison results of model accuracy in Table 3 indicate that our proposed model outperforms others on all five metrics: OA, mPre, mRecall, mIoU, and mF1.Among the six networks evaluated, CoANet exhibited the second-best performance in identifying roads, with superior performance in other metrics.DeepLabV3 outperformed others in five metrics.However, TransUNet, PSPNet50, and D-LinkNet show average performance across OA, mPre, mRecall, mIoU, and mF1 metrics.TransUNet has the lowest mPre, mRecall, and mF1, while D-LinkNet has the lowest OA and mIoU.4 demonstrate that our proposed method achieves the highest scores on metrics OA, mPre, mRecall, mIoU, and mF1.The OA of proposed is 0.96% higher than that of Model A, mPre is 0.59% higher, mRecall is 0.67% higher, mIoU is 1.32% higher, and mF1 is 0.63% higher.
To more intuitively assess the feasibility of the integrated feature module, the heatmap of the feature map was visualized, with the results displayed in Fig. 3.The heat map in Fig. 3 demonstrates that after applying the feature fusion module, it becomes possible to distinguish between the features of roads and buildings, resulting in more accurate feature extraction.This indicates the effectiveness of the feature fusion module we utilized.However, there are still areas for improvement, particularly in extracting features at the edges and in regions obscured by trees and shadows.
In Fig. 4, the ROC curve plot has the true positive rate on the yaxis and the false positive rate on the x-axis.The closer the curve approaches the upper-left corner, the better the performance of the model.The AUC value represents the area under the ROC curve, with a larger AUC indicating better model performance.It can be observed from the graph that our proposed model performs the best.The comparative results of various models in Fig. 5 demonstrate that our proposed method achieves the best extraction performance in shadow areas, with fewer instances of misextraction and strong performance in road connectivity.However, its performance in edge processing remains suboptimal.

Conclusion
In this paper, we propose a neural network model with a dual encoder-decoder architecture.By employing a dual-channel framework, we conduct separate learning of road and building features.We incorporate the CBAM attention mechanism with element-wise subtraction to amplify features from different channels, extracting crucial information from distinct spatial locations of roads and buildings.This facilitates feature fusion to distinguish between roads and buildings that are prone to confusion in images.The experimental results indicate that our proposed method shows improvement compared to other stateof-the-art network models in distinguishing between complex roads and buildings, with better road extraction performance.

Figure 3 .
Figure 3. Heat maps of each layer of model encoder and fusion features.

Figure 4 .
Figure 4. ROC curves and AUC values of 5 models.

Figure 5 .
Figure 5.Comparison of model results.
Despite optimizing the model's performance, shortcomings still exist.Results from the Aerial Image Segmentation Dataset indicate subpar performance in extracting road edges.Additionally, the model is relatively large, resulting in longer training times.Therefore, in future research, our objective is to address road edge extraction issues and modify the model into a lightweight version to achieve more efficient, rapid, and accurate road extraction.
Table 1 and model parameter settings for training as shown in Table 2.The poly learning rate decay strategy used in Table 2 is as shown in the formula.

Table 3 .
. Compare with five other state-of-the-art models

Table 4 .
To further validate the feasibility of CBAM in road extraction, ablation experiments were conducted, and the accuracy results are shown in Table4.Comparison of CBAM ablation experimentsIn Table4, Model A only inputs two feature maps used for feature fusion into CBAM, whereas our proposed model inputs four feature maps obtained from the first encoder and the feature map obtained from the second encoder-decoder used for model fusion into CBAM.The accuracy comparison results in Table