Cascaded framework for earthquake building damage detection combining spatial and frequency domain feature integration

Building collapse is a major cause of casualties after an earthquake, so accurately extracting building damage information is critical for post-earthquake assessment and rescue. Currently, most deep learning methods focus on the end-to-end detection of building collapse. However, in real-world earthquake scenarios, the end-to-end computational process often lacks flexibility and struggles to meet the requirements of rapid emergency response. To address this issue, this paper proposes a cascaded framework that combines pre-earthquake building extraction and post-earthquake building damage classification. The proposed framework includes two sections: (1) Progressive building semantic segmentation model in the joint frequency domain. This model is designed to accurately extract buildings prior to an earthquake, with the goal of minimizing error propagation throughout the cascading process. The model addresses the spatial similarity of buildings under complicated backgrounds, as well as the high internal heterogeneity of buildings, by utilizing frequency domain techniques. It compensates for the shortcomings of traditional models in terms of incomplete information extraction through the effective integration of global and local information. Finally, the model employs edge priors for edge regularization. (2) Rapid building damage classification process. Based on the accurate building extraction results, a fast and efficient classification process is developed. This process uses a simple and lightweight classification network to effectively extract building damage information caused by the earthquake. The superiority of the proposed framework is validated through comparison with traditional cascading architectures and end-to-end models. The results show that the cascading framework not only provides accurate pre-earthquake building extraction, but also enables efficient and accurate post-earthquake damage classification, which meets the requirements of rapid post-earthquake emergency response. This balance of accuracy and speed is essential for effective disaster management and recovery.


Introduction
Earthquakes are among the world's most dangerous natural disasters, and building collapse has been identified as one of the most emblematic forms of seismic damage, leading directly to human casualties and significant property loss (Qu et al., 2023).Rapid assessment of earthquake-induced building damage is critical for effective emergency response and pre-rescue operations.The post-earthquake geological environment often presents significant hazards, making on-site investigations impractical.Therefore, the use of remote sensing data technology facilitates the rapid, efficient, and safe acquisition of information about post-earthquake building collapse (Xie et al., 2023).The use of automated and intelligent data mining and analysis increases the speed of disaster response and the efficiency of post-earthquake damage assessment, thereby reducing economic losses (Zhang et al., 2023).
In recent years, many researchers have explored the use of centimeter-level drone data for building damage detection.The ultra-high resolution of these data allows for a more detailed representation of building damage and provides high extraction accuracy.However, drone operators are often unable to reach hard-hit areas immediately after an earthquake, and some locations may be completely inaccessible.In addition, drones are limited in their ability to rapidly cover large areas, making them less effective for extensive data collection in disaster zones.As a result, sub-meter satellite imagery remains critical for rapid assessment of building collapse after an earthquake (Burke et al., 2019).
With the advancement of deep learning in computer vision, extensive applications in remote sensing building damage detection have emerged.Architecturally, building damage detection is mainly divided into end-to-end and cascaded frameworks.The end-to-end architecture typically employs Siamese-network structures that merge localization and classification tasks while sharing knowledge.The researchers used siamese networks to detect building damage (Sun et al., 2022;Chen et al., 2022;Seyed et al., 2024).However, current siamese networks are difficult to train due to the large amount of data, require precise image registration, and lack the flexibility of cascaded networks that allow for pre-earthquake building localization and rapid post-earthquake classification.
Cascaded architectures predominantly use object-based image analysis (OBIA) for segmentation.Patch-based CNNs integrated with OBIA primarily use superpixel segmentation to generate objects that are non-semantic with irregular geometric shapes (Zhang et al., 2018).However, semantic inconsistencies in building damage assessment occur in semantic and regularly shaped building objects, rendering traditional OBIA methods inapplicable.The crux lies in the fact that current OBIA only integrates process level with deep learning, lacking feature level interaction.Therefore, some researchers use used a fully convolutional network (FCN) for building localization (Gupta et al., 2019)and a patch-based CNN for damage classification (Qing et al., 2022a), but the limited parameterization of these methods fails to accurately represent building features, resulting in suboptimal accuracy in subsequent damage classification.All above, the paper presents the following innovations: (1) To address the issues of lacking feature-level knowledge interaction and multi-target misclassification in OBIA cascade networks, a framework for building collapse detection is proposed, which utilizes the fusion of spatial and frequency domain features.This architecture generates objects with practical significance and refines the minimum unit of collapse detection.
(2) To improve the pre-earthquake building extraction and boundary accuracy, an advanced building semantic segmentation model combining spatial and frequency domain features is introduced.It includes the organic integration of global and local features and a building edge regularization module to better align segmentation results with actual building boundaries, thereby reducing error propagation in cascaded structures.
(3) To accurately classify building collapses after earthquakes, a simple and fast extraction method is proposed.The use of buffering strategies effectively reduces classification errors caused by registration issues.Simultaneously, to rapidly and effectively extract inter-channel deformation features, a lightweight, spatial-domain feature-enhanced deformable convolutional neural network is designed.

Method
A cascaded architecture for building collapse detection has been proposed, which decouples the task into pre-earthquake building localization and post-earthquake building collapse detection.This method utilizes domain feature enhancement to facilitate knowledge interaction between the two tasks, enabling more precise detection of building collapse information, and making the framework process more flexible The main workflow is illustrated in Figure 1.
Figure 1.Overview of building collapse detection framework

Building collapse detection framework
The paper proposes a cascaded building collapse detection framework designed to flexibly extract building damage information after an earthquake.The process includes several key steps: (1) Collection of pre-earthquake remote sensing images.The framework starts by collecting pre-earthquake remote sensing images of the affected area.To ensure accurate detection of buildings, these images typically require sub-meter spatial resolution.
(2) Effective preprocessing methods.Research suggests that overlapping cropping and sample augmentation are effective preprocessing methods.These techniques prepare images for further processing and analysis.
(3) Building Segmentation Using spatial and frequency domain featureintegrated building extraction Network (SFFNet).The preprocessed samples are then fed into SFFNet, a neural network, to obtain accurate building segmentation results.SFFNet is designed to effectively segment buildings from the remote sensing images.(4) Post-processing through connectivity analysis and regularization (Wei et al., 2020).After segmentation, the framework applies connectivity analysis and regularization to post-process the building detection results.This step refines the segmentation and isolates individual buildings.
(5) Establishment of individual building buffer zones.Using the identified building vector positions, the framework creates buffer zones around individual buildings.These buffer zones are critical for isolating each building and its immediate environment for detailed analysis.( 6) Creation of Multi-Channel Damage Detection Matrix Blocks.The buffer zones are then used to overlay pre-earthquake and post-earthquake images and pre-and post-earthquake Local Binary Pattern (LBP) (Ojala et al., 1994)    (1)  The energy percentage method calculates the total energy by constructing the energy of each coefficient in the frequency domain, and designs a truncation percentage to obtain the target energy for truncation.Then, all energy values are sorted in descending order and their cumulative sum is calculated.When the cumulative sum reaches the coefficient energy required for the target energy, the threshold for frequency domain truncation is obtained.This method has a certain adaptability compared to the traditional filter design, and uses this method to truncate high-frequency information by 2%, 4%, 6%, and 8%, as shown in in Figure 4.After feature extraction using the U 2 net unit and the global information unit, the feature information is stored in multiple channels, which is not conducive to facilitating the distinction between buildings and background areas.Therefore, we use convolution operations to combine all the channel features into a single channel.For effective fusion of the extracted local and global information, a sigmoid function-guided feature fusion method is proposed.This method can effectively distinguish between buildings and non-buildings, which helps to guide the feature selection.The formula can be expressed as Formula (4): where FC represents fusion features, FP represents local information extracted by u 2 net, FG represents global features extracted in frequency domain, and SP represents the use of sigmoid to predict probability.

Edge control strategy
Considering that buildings, as man-made structures, have distinct geometric features, an edge-prior-based adaptive regularization method for building edges is proposed to address inaccuracies and irregularities often found in building semantic segmentation results.Deep networks, which focus on abstract semantic features, tend to miss finer details, while initial shallow layers retain excessive details, leading to noise.To address this, the planar semantic information of the second layer is converted to edge semantic information, guided by edges extracted from building labels.This improves the network's ability to extract building edge features.When extracting edge information from labels, a broadened label edge strategy is used because of the difficulty in training networks with too fine edges.In addition, a weighted loss function is used to balance the samples for edge learning.The edge loss function for the second layer is as follows: where Xi represents the result of the i-th layer, Xgt represents the ground truth, Lbce represents the use of the binary cross-entropy loss, and LSSIM is structural similarity loss.

Post-earthquake building collapse detection
Traditional change detection typically involves three main tasks: (1) detecting the transition of buildings from intact to damaged, (2) assessing whether buildings that were intact before the earthquake remained mostly intact after the earthquake, and (3) detecting irrelevant background.However, the collapse of buildings after an earthquake is often irregular in extent, which may lead to background changes unrelated to building damage, thereby reducing the accuracy of building damage detection.
In response to this problem, this paper takes a novel approach by focusing on individual buildings rather than background changes.By using the pre-earthquake detection results of individual buildings, this paper simplifies the aforementioned tasks into two more specific objectives: (1) identifying buildings that have transitioned from an intact state to a damaged state, and (2) determining which buildings that were intact before the earthquake have remained largely intact.The advantage of this approach is that by focusing on the state changes of individual buildings, it effectively avoids misjudgments caused by changes in the background, thus improving the accuracy of building damage detection.In addition, this method makes the detection tasks more precise and focused, helping to improve the overall performance of change detection.

Figure 5. Schematic diagram of LBP changes
As shown in Figure 5, given the difficulty of perfectly aligning pre-earthquake and post-earthquake images and the distinct contextual features of collapsed buildings, a buffering strategy is used to ensure the integrity of buildings in the samples and to capture more features of collapsed structures.Since the most obvious post-collapse features are building boundary and texture, Local Binary Patterns (LBP) images of buildings are used to enhance bands in pre-earthquake and post-earthquake images and guide the classification network to learn texture features.Notably, the proposed LFNet is simple, fast, and has high classification accuracy.Among them, Resnet is mainly used as the basic network, and deformable convolution (Dai et al.,2017) is added to each layer of Resnet to adapt to the changes between channels.Therefore, a building earthquake damage detection process is constructed, as shown in Figure 6.
Figure 6.Building fall damage detection process

Study area data
The study area is located in Guangjie Town, Yushu City, Qinghai Province, China, as shown in Figure 7.A magnitude 7.1 earthquake occurred here on April 14, 2010, resulting in extensive structural damage, 2,220 deaths, and thousands of injuries.The experimental data are from Google Earth imagery with a spatial resolution of 0.6 meters (panchromatic and multispectral fusion imagery) and an image size of 20,000 pixels × 12,000 pixels.Details of the image data are shown in Table 1.Table 1.Description of the data used in the study

Evaluation metrics
In terms of evaluation metrics, the system commonly used in semantic segmentation -Precision, Recall, F1 and IoU -has been adopted.where TP is the number of pixels correctly extracted as buildings, FP is the number of other object pixels extracted as buildings, and FN is the number of building pixels extracted as other objects.

Pre-earthquake building extraction result analysis
This section reports experiments conducted on two datasets.In order to ensure fairness, the experimental data set was cropped to the same size and the same method of data enhancement was used.Our model was thoroughly compared with state-of-the-art methods (Chen et al., 2021;Li et al., 2022;Wang et al., 2022b;Zhou et al., 2022) to demonstrate the segmentation quality and to assess the capabilities of our model.

Experimental detail
In the experiment, the sample was cropped to 384*384.During training, the Adam optimizer was used with default parameters (initial learning rate = 1e-4, betas = (0.9, 0.999), eps = 1e-8, weight decay = 0).The network was trained with a batch size of 4 and a termination iteration of about 300 epochs.The training process was performed on a platform with an I7-10700 CPU and 3090 GPU, with 24G of memory.
In the Yushu dataset, building detection is challenging due to the complex background and dense urban areas with shadows from buildings and trees, as well as significant size variations among buildings, making small structures difficult to extract.As shown in Figure 8, where TP (white) means the number of pixels correctly extracted as buildings, FP (blue) means the number of other object pixels extracted as buildings, and FN (red) means the number of building pixels extracted as other objects.The first two rows depict images of urban areas in Yushu City, where buildings are prominent against the city background with orderly arrangements.From the five contrast results, it can be observed that our method achieves higher precision in extracting building edges.The latter three rows show images of rural areas, where building layouts are less organized.This leads to issues of missed and false detections due to minimal differences between buildings and background, as well as challenges in accurately segmenting tightly spaced buildings.Based on these experimental results, compared to using spatial-domain deep learning methods alone, our approach utilizes the Discrete Cosine Transform (DCT) to transform spatial-domain signals into the frequency domain.This enables the exploration of building features in complex backgrounds, reducing missed and false detections.Furthermore, by selecting high-frequency and low-frequency features, we address the segmentation challenges posed by adjacent buildings.The accuracy results of the five sets of experiments are shown in Table 2.Although our proposed method did not perform as well as other methods in terms of accuracy, it achieved a more balanced recall rate, indicating a better restriction of false extractions, resulting in better F1 and IoU scores.Specifically, compared to MANet, our method improved IoU and F1 by 1.58% and 2.28%, respectively, indicating the effectiveness of our network.Compared to SGCN, there was an increase of 3.04% and 4.32%, respectively, showing better overall performance.
Compared to TransUnet, the increases were 4.01% and 5.66%, and compared to UNetFormer, there were improvements of 2.02% and 2.90%, respectively.This suggests that using transformers to find features in the frequency domain is more accurate than direct extraction in the image domain.

Ablation study
In this section, separate ablation studies are performed on two datasets to assess the effectiveness of each critical component of the model.The U2net is used as a baseline, and additional modules are progressively integrated.We focus on the visualization of the penultimate layer in the decoder, as shown in Figure 9.It can be seen that as the number of components increases, there is a tendency for more buildings to be identified in the feature map.After integrating the GFIE, but opting for Concatenation Fusion instead of the AFF module, there is a noticeable improvement in the detection efficiency of small target buildings.However, this approach also results in a higher false positive rate for building detection.The use of the AFF module allows effective control over global and local feature fusion, which helps to reduce some false detections.Finally, the implementation of edge priors helps to refine building boundaries, thereby improving accuracy.As shown as Table 3, compared to the U 2 net, the inclusion of GFIE results in a slight increase in both F1 score and IoU.There is an increase of 0.76% in F1 score and 0.32% in IoU.These improvements demonstrate the effectiveness of using the frequency domain to extract global information.With the addition of AFFM, which allows for a more effective fusion of global and local information, there is a significant improvement in both F1 score and IoU.The F1 score and IoU increase by 1.22% and 1.29%.Finally, the implementation of edge priors further improves the overall building segmentation accuracy.These metrics show the cumulative benefits of each component in improving the model's performance for building segmentation tasks.

Post-earthquake building extraction result analysis
To validate the effectiveness of the building damage extraction framework proposed in this paper, we selected two types of cascaded building damage information extraction frameworks (Chen and Liu, 2021;Qing et al., 2022b) and two end-to-end building damage information extraction methods for comparison (Caye Daudt et al., 2018;Yan et al., 2022).
During training, the Adam optimizer was used with default parameters (initial learning rate lr = 5e-4, weight decay = 5e-2).LFnet was trained with a batch size of 32 and a termination iteration of about 100 epochs.The training process was performed on a platform with an I7-10700 CPU and 3090 GPU, with 24G of memory.
Using LFnet to classify building collapses in the Yushu area after the earthquake, the precision for collapsed building detection is 91.24%, the recall is 90.52%, and the F1 score is 90.88%.Figure 10 shows the assessment results of building damage after the earthquake, where red represents collapsed buildings, blue indicates intact buildings, and white represents the background.The first two groups of images are from the Yushu urban area, where relatively few buildings are damaged, while the last three groups are from the urban-rural interface and villages, where more buildings are damaged.It can be seen from the images that the object-based segmentation performance is poor in the two groups.This is mainly because the superpixel segmentation used in these experiments did not incorporate semantic information from the images during segmentation, resulting in fragmented buildings and inaccurate boundary positioning.As a result, buildings tend to stick together during the post-classification clustering process.Furthermore, in the first set of experiments, classification was based on representing an area with a single point, and the representativeness of this point is key to classification accuracy.However, a single point lacks semantic relationships with its surroundings and similarly categorized pixels, and it cannot effectively address issues such as small variance between background and damaged buildings.Therefore, in areas with more damaged buildings, the number of false detections increases significantly.In the second set of experiments, the classification involved cropping small patches within a spot and using a deep learning network to determine the category of these small patches, incorporating some semantic information between categories.However, these small patches also have errors, as each patch can contain more than one category (ground, damaged building, intact building).As a result, while the accuracy of this experiment is significantly better than that of the first group, it still does not solve the problems of buildings sticking together and false detections.The third and fourth groups use an end-to-end classification method.It is obvious that the main factor affecting the accuracy of end-to-end methods is the accuracy of the building extraction by the main network.If the accuracy of building extraction is low, the results tend to be poor.The accuracy rating, as shown in

Conclusions
In this study, a cascaded architecture for building collapse detection, which integrates spatial and frequency domain target feature knowledge interaction, has been proposed.Based on experimental results and analysis, the conclusions can be drawn as follows.
(1) Compared to end-to-end direct damage detection networks, the cascade framework clarifies the tasks and overcomes the training and transfer difficulties of direct detection networks, resulting in a more flexible overall process.Compared to traditional OBIA cascade networks, our method fully extracts semantic information at each stage, forming feature-level knowledge interaction, generating objects with practical significance, and improving the detection rate of building damage.
(2) Accurate building detection is the foundation of this framework because it can reduce loss propagation within the framework.The building detection method introduced in this paper validates that the effective combination of spatial and frequency domains in complex backgrounds can improve the accuracy of building extraction.This method addresses the problem of high heterogeneity leading to small inter-class variance and large intra-class variance and utilizes adaptive feature selection of global and local information to address the challenge of inferring image content from distant context.Additionally, the edge verification module can further improve the accuracy of boundary detection.
(3) Building on accurate building extraction, a fast and simple method for building collapse classification is proposed.By utilizing a buffering strategy to reduce errors caused by registration, and constructing a multi-channel classification network based on texture priors, rapid detection of building collapses is achieved.
In summary, the newly proposed cascaded building collapse detection framework is a workflow with clear tasks, flexible processes, and high detection accuracy, which can better serve the emergency management domain.Moreover, the building extraction network as the core method of the workflow can also be introduced to other areas of geoscience applications, such as the semantic segmentation of other land cover features (roads, farmlands, etc.).
However, due to the limited resolution and vertical field of view of remote sensing, it is not possible to observe the lateral damage of building walls, which leads to the omission of intermediate levels of damage in our damage detection.This is a key issue that requires further research.

Figure 2 .
In the network, we use U 2 net(Qin et al., 2020) as the backbone and design a progressive space and frequency domain feature fusion block (PSFF Block).Specifically, based on the local detail information provided by spatial features, the Frequency Domain Global Information Extraction Module (FGIE Module) utilizes transformer in the frequency domain to obtain its global semantic information, and ultimately, the Adaptive Feature Fusion Module (AFF Module) performs feature fusion.In particular, by incorporating edge priors in the second layer, we enhance the extraction of regularized edge features of buildings, thereby improving segmentation accuracy and regularizing building boundaries.

Figure 2 .
Figure 2. Spatial and frequency domain feature-integrated building extraction network 3) where M and N represents the width and height of the image, F(U,V) is the frequency coefficient in the two-dimensional frequency domain, f(x,y)is the pixel value of the original image in the spatial domain, U and V is the coordinate in the frequency domain, C(U) and C(V) is the DCT transformation coefficient.After the DCT transformation, the energy percentage method is constructed to gradually remove high-frequency information, as shown in Figure 3(c).

Figure 3 .
Figure 3. Progressive spatial and frequency domain feature fusion module

Figure
Figure 7. Study area, Yushu City

2.2. Spatial and frequency domain feature-integrated building extraction network
texture features.This overlay creates multi-channel matrix blocks for damage detection, integrating different types of information for each building.(7)Classification with Lightweight fast network (LFnet).Finally, these multi-channel matrix blocks are fed into LFnet for classification.LFnet classifies the blocks and determines the extent of damage to each building.
building footprints from complex scenes remains a challenge, mainly due to insufficient feature extraction and inaccurate, irregular building boundary localization in semantic segmentation results.Therefore, in this paper, we propose an edge-prior-based progressive feature fusion network, as shown in

Table 2 .
Quantitative evaluation of different methods.

Table 4 ,
includes C1 for background, C2 for intact buildings, and C3 for damaged buildings.

Table 4 .
Quantitative evaluation of Building Collapse Detection results.