Deep Convolutional Network Based on Attention Mechanism for Matching Optical and SAR Images

Complex geometric distortions and nonlinear radiation differences between optical and synthetic aperture radar (SAR) images present challenges for the matching of sufficient and evenly distributed corresponding points. To address this problem, this paper proposes a deep convolutional network based on an attention mechanism for matching optical and SAR images. In order to obtain robust feature points, we employ phase consistency instead of image intensity and gradient information for feature detection. A deep convolutional network (DCN) is designed to extract high-level semantic features between optical and SAR images, providing robustness to geometric distortion and nonlinear radiation changes. Notably, incorporating multiple inverted residual structures in the DCN facilitates efficient extraction of local and global features, promoting feature reuse, and reducing the loss of key features. Furthermore, a dense feature fusion module based on coordinate attention is designed, focusing on the spatial positional information of effective features, integrating key features into deep descriptors to enhance the robustness of deep descriptors to nonlinear radiometric differences. A coarse-to-fine strategy is then employed to enhance accuracy by eliminating mismatches. Experimental results demonstrate that the proposed network performs better than the manually designed descriptors-based methods and the state-of-the-art deep learning networks in both matching effectiveness and accuracy. Specifically, the number of matches achieved is approximately 2 times greater than that of other methods, with a 10% improvement in F-measure.


Introduction
Synthetic Aperture Radar (SAR) is an active imaging sensor that utilizes microwaves to observe Earth targets.It possesses key characteristics such as all-weather capability, wide-area coverage, and strong penetrative ability, making it suitable for high-resolution Earth observation applications.The joint interpretation of optical images and SAR images is widely applied in image fusion (Yan and Kong, 2020), image registration (Quan et al., 2022), change detection (Liu et al., 2019), 3D reconstruction (Zhang et al., 2022a), and other fields, with optical and SAR image matching being one of the key technologies for these applications.However, significant nonlinear radiometric differences and geometric distortions exist between optical images and SAR images, along with speckle noise present in SAR images, making optical and SAR image matching challenging.In recent years, many researchers have proposed various matching methods for optical and SAR images, mainly divided into three categories: area-based matching, feature-based matching, and deep learning-based matching.
Among area-based matching methods, the most commonly used approaches are normalized cross-correlation and mutual information.These two methods directly utilize intensity information from images for computation.Ye et al. (Ye et al., 2017) employed phase consistency information instead of image intensity information for matching, proposing the histogram of oriented phase congruency (HOPC) algorithm, which effectively counters nonlinear radiometric differences and enhances matching performance.Li et al. (Li et al., 2020b) calculated multi-directional phase feature maps based on detected feature points and used them to construct descriptors.Building upon HOPC, Fan et al. (Fan et al., 2021) proposed angle-weighted orientation gradient descriptors, distributing gradient values to the two most correlated directions and employing three-dimensional phase information as a similarity measure, significantly improving matching performance.Despite the high accuracy of area-based matching methods, they suffer from high computational complexity and poor robustness to illumination changes and nonlinear radiometric differences.
Feature-based matching is a commonly used method in the field of image matching, which mainly consists of three steps: keypoint detection, descriptor construction, and feature matching, with the most famous being the scale-invariant feature transform (SIFT) algorithm (Yoo and Han, 2009).To meet the matching requirements of optical and SAR images, Dellinger et al. (Dellinger et al., 2015) proposed the SAR-SIFT method.This method utilizes directional gradients to construct descriptors, which are robust to speckle noise and finally combined with SIFT.Xiang et al. (Xiang et al., 2018) improved the method for detecting feature points in SAR images and constructed new descriptors for image registration.Li et al. (Li et al., 2020a) used phase consistency instead of image intensity for feature point detection and proposed the rotation invariant feature transform (RIFT) algorithm for multi-modal image matching.Feature-based matching has been widely applied and has made significant contributions.However, due to significant nonlinear radiometric differences between optical and SAR images, manually designed features have poor robustness and are difficult to produce highly repeatable features.
In the past decade, deep learning-based matching methods have garnered widespread attention in multi-modal image matching tasks due to their excellent generalization capabilities.Han et al. (Xufeng et al., 2015) introduced deep learning into image matching and proposed MatchNet, which adopts a dual-branch network structure to extract features from image patches and calculate feature similarity to obtain matching points.Reference (Merkle et al., 2017) proposed a siamese deep neural network (DNN) for extracting deep features from optical and SAR images for image matching, where expanded convolutional layers are used to increase the receptive field and enhance image features.Li et al. (Li et al., 2022) utilized the feature learning network SARPointNet to obtain feature points and descriptors of images, improving matching performance.Zhang et al. (Zhang et al., 2022b) proposed the optical and SAR Image Matching Network (OSMNet), which adopts a multi-level feature fusion network architecture combined with a channel attention mechanism to extract better features from optical images for feature matching.Existing deep learning methods suffer from shallow network layers, making it difficult to capture higher-level semantic features of images; existing deep convolutional neural networks extract a multitude of features, which often contain noise and outliers, lacking robustness when facing significant nonlinear radiometric differences; the network models pay less attention to spatial positional information of features.To address these issues, this paper proposes a DCN based on an attention mechanism for optical and SAR image matching.The cardinal contributions of this work are itemized as follows: 1.

Method
The method proposed in this paper comprises three parts: utilizing phase consistency (PC) instead of image intensity information for feature detection; constructing deep descriptors using a deep convolutional neural network based on the detected feature points; obtaining initial matching points through nearest neighbor matching, and then employing dynamic adaptive thresholding and the Random Sample Consensus (RANSAC) algorithm to removal mismatched points and obtain the final matching points.The flowchart of this paper is illustrated in Figure 1.

PC Detection
Classical image matching methods generally rely on the intensity and gradient information of images, which belong to spatial domain information.Apart from spatial domain information, frequency domain information (such as phase information) can also be employed to describe images.Oppenheim and Lim (Quan et al., 2023) first revealed the importance of phase information in preserving image features, as phase information exhibits robustness to changes in image contrast, illumination, scale, and so on.Therefore, this paper adopts PC instead of image intensity and gradient information (Li et al., 2020a), and then utilizes the Fast algorithm for feature point detection, ensuring the number of feature points, enhancing the repeatability of feature points, and exhibiting robustness to nonlinear radiometric differences.The formula for phase consistency calculation is as follows: For the phase deviation function, the formula is: where , , 1, , just and The MobileNetV2 network is an image classification network that cannot be directly used for extracting deep descriptors from images.In this paper, the network model is improved by replacing the last two layers of the backbone network with a dense feature fusion module.The proposed network model is illustrated in Figure 2, consisting of convolutional layers, inverted residual structures, and dense feature fusion modules, with detailed information provided in Table 1.To integrate the dense features extracted by the backbone network into the deep descriptors, we designed a dense feature fusion module based on coordinate attention (DFFCA), as shown in Figure 3.The coordinate attention (CA) mechanism (Hou et al., 2021), with almost no additional computational cost, effectively integrates spatial positional information into dense features, emphasizing key features and suppressing the contribution of non-significant features.The DFFCA effectively incorporates key features into deep descriptors, thereby improving their matching accuracy and robustness.

Loss Function
To train the network, the loss function adopts the "hardest example" mining strategy.Mishchuk et al. (Mishchuk et al., 2017) proposed the HardNet loss function, which requires that the distance between each row and each column with the correct match be minimized.For each positive sample, n negative samples are generated, and the one with the smallest 1 max 0,1 , min , , , .
Table 1.Details of the proposed network.
In the table, c represents the depth channel of the output feature map, n represents the number of repetitions of IRM, s represents the step size of the first layer input for each sequence, and all other steps are 1.

Filter Errors
This paper employs a strategy of coarse-to-fine to remove mismatched points.Traditionally, to assess the quality of the i pair of matches, a fixed multiplier factor t is commonly used as a threshold.When d t d   it is considered that this pair of matches has good quality.Due to significant differences between optical and SAR images, it is challenging to determine the multiplier factor t. In this paper, an adaptive threshold constraint is used instead of the multiplier factor, eliminating the need for manual threshold adjustment and demonstrating good adaptability.The average difference between the nearest neighbor point and the second nearest neighbor point is calculated as the basis for judgment, with the calculation formula as follows: where N = number of reference image feature points d = image coordinates d = distance of the second nearest neighbor point avgd = mean distance When the matching points meet the condition d avgd d    , it is considered that the quality of this matching point is good.
After adaptive thresholding coarse screening, Numerous mismatched points can be removed, significantly improving the inlier ratio, but there are still some mismatched points.This paper uses the RANSAC algorithm to refine the coarse-screened matching points.Due to the complex geometric distortion and significant nonlinear radiation differences between optical and SAR images, using a single geometric model as the estimation model cannot eliminate mismatched points.Therefore, this paper adopts an affine transformation model and a homography matrix as the RANSAC algorithm estimation model, which effectively improves the computational efficiency of RANSAC random sampling and geometric consistency verification, and enhances the robustness of the matching results.

Experimental Dataset and Implementation Details
To train the network models, this paper utilizes two publicly available datasets containing optical and SAR imagery.The QXS-SROPT dataset was proposed by Huang et al. (Huang et al., 2021) in 2021, comprising 20,000 images obtained from Gaofen-3 synthetic aperture radar satellite imagery and Google Earth imagery.The SEN1-2 dataset was introduced by Schmitt et al. (Schmitt et al., 2018)  The test data consists of real optical and SAR images, as shown in Table 2.The test data are optical and SAR images obtained by different sensors.They vary in resolution, imaging time, and ground coverage.Due to the presence of complex geometric distortions and significant nonlinear radiometric differences among the images, they are particularly suitable for evaluating the proposed methods.
During the training process, the MobileNetV2 backbone network is trained using transfer learning.This paper employs the Adam optimizer for training, with an initial learning rate of 0.001 and a batch size of 256.Training is conducted using a single NVIDIA RTX 4060Ti GPU to improve training efficiency.

Pair
Sensor Table 2. Test image pairs and their characteristics.

Experiment and Discussion
To evaluate the matching performance of the proposed method, this paper adopts the number of correct matching points (NCM), F-measure, and root mean square error (RMSE) as evaluation metrics.considered correct if its error rate is less than 3 pixels.The Fmeasure represents the matching performance and is defined as the harmonic mean of precision and recall, defined as: where MP = accuracy of matching points, Ratio of NCM to total matching points Recall = recall rate, Ratio of NCM to keypoints Root mean square error (RMSE) reflects the matching accuracy of matching point pairs, defined as: Figure 4. Matching results of six methods on test data.
3.2.1 Qualitative Comparisons: Figure 4 illustrates the matching results of the proposed method compared to five other matching algorithms: PSO-SIFT (Ma et al., 2017) SAR-SIFT (Dellinger et al., 2015), RIFT (Li et al., 2020a), CMM-Net (Lan et al., 2021), andMatchosNet (Liao et al., 2022).The figure demonstrates that the proposed method obtains a large number of matching points with a uniform distribution of keypoints.
The proposed method fully utilizes the nonlinear modeling capability of deep learning to extract more robust, stable, and reproducible feature points.Compared to two traditional manually designed descriptors (PSO-SIFT and SAR-SIFT), the deep learning-based method demonstrates superior performance on test data.The RIFT algorithm uses phase congruency information for matching, and transforming images from the spatial domain to the frequency domain.This method achieves successful matching on all test data, but the NCM is much lower than that of the proposed method.The CMM-Net method extracts advanced semantic information from images, resulting in a higher NCM than that obtained by MatchosNet.MatchosNet builds descriptors based on local information in graphics, leading to matching failure on significantly different images (like pairs B and F).Qualitative results demonstrate that the proposed method not only achieves a greater number of NCM but also exhibits a more even distribution of matching points.This is attributed to the utilization of PC for feature point detection in the proposed method, which remains invariant to nonlinear radiometric differences.The network simultaneously considers both low-level and high-level semantic information of images.The designed DFFCA focuses on spatial positional information of features, integrating key features into deep descriptors, thereby enhancing descriptor performance.Consequently, both the quantity and distribution of NCM surpass those of other methods.

Quantitative Comparisons:
In quantitative experiments, this paper employs three evaluation metrics to analyze the matching performance of each method.The threshold for NCM and RMSE is set to 3 pixels.The experimental results are presented in Table 3, where the best results are highlighted in bold, representing the average of 10 trials.
The results indicate that the matching performance of the proposed method is significantly better than other methods.Both PSO-SIFT and SAR-SIFT algorithms exhibit poor matching performance on the test data.Manually designed descriptors are limited to low-level semantic information of images and lack robustness against significant nonlinear radiometric differences.RIFT utilizes PC to construct descriptors, which have a certain invariance to nonlinear radiometric differences.However, the NCM of RIFT is much lower than that of the method proposed in this paper.Qualitative experiments show that the distribution of matching points by the RIFT algorithm is uneven, as seen for instance in image pairs A and C. CMM-Net is a method for matching heterogeneous images.Unlike traditional methods, feature point detection in CMM-Net is conducted after feature description.Both feature point detection and description in CMM-Net are performed on feature maps, extracting features containing highlevel semantic information, which are more suitable for matching heterogeneous images.However, its drawback lies in the poor localization accuracy of matching points.MatchosNet is specifically designed for optical and SAR image matching.MatchosNet fails to match in pairs B and D. MatchosNet utilizes DOG for feature point detection, which lacks robustness against nonlinear radiometric differences.Additionally, the construction of descriptors in MatchosNet does not fully consider high-level semantic information of images, resulting in poor matching performance.
As shown in Table 3, the proposed method achieves the highest number of NCM on the test images.Through F-measure comparison, the F-measure of the proposed method significantly outperforms other methods, overall obtaining the best matching results.This also indicates that the proposed method can extract robust matching points from optical and SAR images.Additionally, the RMSE on the test data is less than 2 pixels for the proposed method.The proposed method utilizes DFFCA to integrate key features containing high-level semantic information into deep descriptors.The network constructs deep descriptors with strong robustness against nonlinear radiometric differences.The proposed method achieves good matching accuracy on optical and SAR images with noise interference and significant nonlinear radiometric differences.

Ablation Experiment
To further validate the contribution of the attention mechanism to the matching task, this paper conducts ablative experiments to test the impact of the coordinate attention mechanism on matching performance.Using the same dataset to train the network models, 125 pairs of images are selected from the SAR2Opt (Zhao et al., 2022) dataset for testing.The dataset's image pairs size is 600×600, which was not used during training.The experimental results are shown in Figure 5.
The proposed method adds a CA module to the network, as shown in Figure 4.After adding CA, the accuracy of NCM and matching points is further improved, enhancing the matching performance of the algorithm.The CA module pays more attention to the spatial positional information of effective features, integrating key features into attention maps, and suppressing the expression of non-key features.The designed DFFCA integrates attention maps containing key features into deep descriptors, improving the robustness and stability of descriptors to nonlinear radiometric differences.

Conclusion
Due to complex geometric distortions and significant nonlinear radiometric differences between optical and SAR images, traditional matching methods have difficulty obtaining a sufficient and uniformly distributed set of matching points.To tackle this issue, this paper proposes a DCN based on attention mechanisms for matching optical and SAR images.
Experimental results validate the accuracy and robustness of the proposed method.Compared with five other methods, the proposed method achieves accurate and stable matching in different scenarios, outperforming other methods.Firstly, the paper uses PC instead of intensity information of images for feature detection, obtaining feature points invariant to nonlinear radiometric differences.Secondly, a DCN is employed to extract both local and global semantic information from images.The network effectively reduces feature loss and achieves feature reuse using an IRM structure.The DFFCA is designed to pay more attention to the spatial positional information of effective features, merging key features from dense features into deep descriptors.The constructed deep descriptors exhibit robustness to nonlinear radiometric differences.Finally, adaptive thresholds and RANSAC are utilized to improve the quantity and accuracy of matching points.
Meanwhile, ablation experiment results confirm the performance of the proposed method CA structure in optical and SAR image matching.The utilization of the CA structure has improved the NCM and matching accuracy.Future work will focus on improving the performance of the network model in constructing deep descriptors, further enhancing the model's generalization ability.
the quantity contained is equal to itself when its value is positive, otherwise it is zero.

Figure 1 .
Figure 1.The framework of the proposed method.
Figure 2. Structure of the proposed network.

2L
distance to the correct match is selected for optimization of the modelthe i deep descriptor of the optical image, and j s represents the j deep descriptor of the SAR image.For each pair of matched deep descriptors   , of the loss function is to minimize the distance between matching pairs of depth descriptors and non-matching depth descriptors.The loss function continuously reduces the distance between matching pairs and increases the distance between non-matching pairs, thus optimizing the network model during the backpropagation process, and completing the model training.The formula for the loss function is as follows: in 2018.It includes 282,384 pairs of SAR and optical image patches from various regions worldwide and all meteorological seasons.For the training process, this paper utilizes the summer subset of the SEN1-2 dataset.The images in both datasets are sized 256×256, and during network training, they are randomly cropped into 224×224 image patches.