A Transformer and Visual Foundation Model-Based Method for Cross-View Remote Sensing Image Retrieval

Retrieving UAV images that lack POS information with georeferenced satellite orthoimagery is challenging due to the differences in angles of views. Most existing methods rely on deep neural networks with a large number of parameters, leading to substantial time and financial investments in network training. Consequently, these methods may not be well-suited for downstream tasks that have high timeliness requirements. In this work, we propose a cross-view remote sensing image retrieval method based on transformer and visual foundation model. We investigated the potential of visual foundation model for extracting common features from cross-view images. Training is only conducted on a small, self-designed retrieval head, alleviating the burden of network training. Specifically, we designed a CVV module to optimize the features extracted from the visual foundation model, making these features more adept for cross-view image retrieval tasks. And we designed an MLP head to achieve similarity discrimination. The method is verified on a publicly available dataset containing multiple scenes. Our method shows excellent results in terms of both efficiency and accuracy on 15 sub-datasets (10 or 50 scene categories) derived from the public dataset, which holds practical value in engineering applications with streamlined scene categories and constrained computational resources. Furthermore, we initiated a comprehensive discussion and conducted ablation experiments on the network design to validate its efficacy. Additionally, we analyzed the presence of overfitting within the network and deliberated on the limitations of our study, proposing potential avenues for future enhancements.


Introduction
Oblique UAV images have become increasingly pivotal in applications such as urban modeling and scene understanding (Verykokou et al., 2016, Sheppard andRahnemoonfar, 2017).Determining the geo-location of oblique UAV images accurately is the fundamental basis of these applications.When oblique UAV images lack POS information, we usually retrieve and subsequently register these images with georeferenced satellite orthoimagery.It is a cross-view remote sensing image retrieval task, and is challenging due to the substantial differences in angles of view between oblique UAV images and satellite images.Traditional handcrafted feature-based image retrieval methods struggle to capture common features between such cross-view images.Nowadays, most cross-view image retrieval methods are grounded in deep learning approaches.However, many methods aim to enhance retrieval performance by stacking learning modules.This augmentation increases the number of model parameters and poses challenges in finetuning tasks when the scene changes, thereby limiting the practical applicability of these methods in engineering.
Recently, Meta AI Research introduced DinoV2 (Oquab et al., 2023), a large visual foundation model that demonstrates robust generalization and zero-shot transfer capabilities in downstream tasks such as semantic segmentation and depth estimation.The latent features extracted by DinoV2 exhibit exceptionally strong common feature representation capabilities in contrast to the backbones utilized for single tasks, such as image matching.Thus DinoV2 can provide a solid foundation for initializing image features in cross-view remote sensing image retrieval, thereby facilitating accurate model regression.Furthermore, the use of DinoV2 for cross-view remote sensing image retrieval obviates the need for weight fine-tuning, thus avoiding extensive training during the fine-tuning task and reducing computational costs.The key to effectively leveraging DinoV2 lies in the design of an effective downstream task head.
Therefore, in this work, we investigate a novel cross-view remote sensing image retrieval method based on a visual foundation model DinoV2, and a transformer-based retrieval head.In summary, our contributions include: 1. We employed zero-shot transfer learning on the backbone network.We introduced the vision foundation model as the backbone for the cross-view image retrieval task and froze its weight, thus circumventing the additional cost from the backbone network training.
2. We designed a retrieval head, a small network based on transformer with only a few parameters.We not only alleviated the burden on network training but also enhanced the features from DinoV2, rendering them more adept for retrieval tasks.
3. We proposed a novel deep learning approach for cross-view image retrieval, which combines contrastive learning, supervised learning, and transfer learning, integrating these latest techniques in the field of deep learning.

Cross-View Remote Sensing Image Retrieval
Cross-view image retrieval is widely used for rough positioning of query images.Most methods for cross-view image retrieval follow a standard data processing pipeline.First, features of query and gallery images are extracted individually, then the similarities between these images are measured, and finally, the gallery image with the highest similarity score is selected as the retrieval result.Some researchers employed handcrafted features for cross-view image characterization.Cheng et al. (2018) used SIFT descriptors for retrieval between cross-view ground images.Luo and Ye (2023) designed the SDS (Segments Direction Statistics) feature pattern, and used it in the oblique-view UAV image-based retrieval task.Owing to the significant advantages of neural networks in feature extraction and the continual development of cross-view image datasets like University-1652 (Zheng et al., 2020), cross-view image retrieval research has predominantly relied on deep learning methodologies over the past decade.Zemene et al. (2019) designed a retrieval method for querying in a city-wide reference image database with known absolute coordinates, thereby determining the geo-location of the query image.Similarly, Rodrigues and Tani (2022) performed retrieval between ground images and a large geotagged aerial image database.Some methods enhance the retrieval capabilities of the network by improving training strategies or modifying the framework of the model.Zhang et al. (2022) proposed a deep neural network that introduced a spatial scale attention mechanism for cross-view image feature extraction, strengthening the scene spatial information representation.Lin et al. (2022) presented a feature learning approach based on joint learning, leveraging a single network to acquire discriminative features.They also introduced a key point detection model to emulate human visual perception, thereby enhancing the feature's capability to represent key areas.Zeng et al. (2022) designed a peer learning-based parallel retrieval method incorporating two siamese networks.They utilized UAV images as intermediaries between ground images and satellite images to facilitate retrieval between the two views.Hu et al. (2018) developed a global feature generation module to further optimize the local features extracted by the backbone network.Additionally, they introduced a weighted soft margin ranking loss to accelerate model convergence.Some recent studies have opted for transformer-based backbone networks instead of CNNs.For instance, the FSRA (Dai et al., 2021) automatically divided the original image into multiple regions based on the heat distribution of the feature map, and achieved feature alignment based on region consistency.Zhuang et al. (2022) introduced semantic constraints based on FSRA to enhance the effectiveness of feature alignment.However, a limitation shared by all the above methods is that the training process still involves the backbone network, resulting in increased computational and time overhead for the retrieval task.Therefore, in this study, we introduce an approach by incorporating a visual foundation model with robust generalization capabilities as the backbone network of our model.

Visual Foundation Model
The effectiveness of a neural network lies in the proper initialization of its parameters.In the field of deep learning, this initialization process necessitates a substantial amount of highquality training data.However, many downstream tasks lack access to such data.Therefore, a common approach for most tasks is to fine-tune foundation models that have been pretrained on large datasets.Vision foundation models are widely used for transfer learning (Zhou et al., 2023).Initially, these models referred to pre-trained weights of backbone networks obtained by training CNN networks (for example, ResNet) on general and labeled datasets (including ImageNet).These pretrained weights were then transferred to downstream tasks to expedite convergence.However, due to the expense of data annotation, the size of these datasets is limited, and the network's generalization performance cannot be guaranteed.Most subsequent research has concentrated on semi-supervised learning or self-supervised learning methods, as weakly labeled or unlabeled data is generally more accessible.
The visual foundation model typically consists of an encoder and a decoder.In transfer learning, fine-tuning the encoder part is generally focused, while the task-specific heads are connected to various downstream tasks.CNNs were previously utilized to construct visual foundation models.Context (Doersch et al., 2015) is a self-supervised learning method that learns the contextual information in the image through random sampling patches, thus enhancing the semantic attributes of features.The vision transformer (ViT) has emerged as a prominent research focus in recent years due to its capability to achieve superior training results on large datasets.BEiT (Bao et al., 2021) introduced patch random masking based on the classic ViT, forcing the network to strengthen the representation ability of latent features.SAM (Kirillov et al., 2023) is a visual foundation model for segmentation tasks.It has achieved extremely robust semantic segmentation performance by training on the SA-1B dataset which contains 1 billion masks.Although visual foundation models demonstrate strong generalization capabilities and have been widely employed in various downstream tasks, their application in the field of crossview image retrieval remains unexplored.DinoV2 (Oquab et al., 2023) presently stands as one of the most widely adopted visual foundation models for downstream tasks.It has acquired robust generalization capabilities through training on the extensive LVD-142M dataset and is capable of achieving zero-shot transfer.Therefore, in this work, we designed a cross-view retrieval method based on DinoV2.

Methodology
The overall framework of our model is shown in Fig. 1.We designed a siamese network, comprising three parts: 1) The visual basic model functions as the backbone network, extracting local and global features from UAV and satellite images.2) The cross-view ViT (CVV) module serves as the feature adaption module, enhancing the features extracted from the backbone to align with the requirements of the cross-view image retrieval task.3) The classification head receives two sets of image features and is responsible for identifying geospatial relations.Since we use both supervised and contrastive learning, we also illustrate our loss function design.

Visual Foundation Model Backbone
DinoV2 consists of both an encoder and a decoder, and performs discriminative self-supervised pre-training on the large LVD-142M dataset, achieving robust zero-shot transfer generalization capabilities.In this work, we transfer the DinoV2 encoder as the backbone network.The original DinoV2 encoder is a ViT model containing 1 billion adjustable parameters, which places high demands on the hardware even for the inference process.Therefore, the unsupervised distillation method is employed in DinoV2, where the original model serves as a teacher model and is compressed into three student models of varying sizes to accommodate different downstream task application scenarios.Since the patch size is set to 14 in DinoV2, these four models are called ViT-G/14, ViT-L/14, ViT-B/14, and ViT-S/14 respectively.
The generalization performance of the student models is similar to that of the teacher model, despite the significant reduction in the number of parameters.Considering the limited hardware resources in engineering and the real-time requirements of cross-view retrieval tasks, we opt to utilize ViT-S/14 (21m params) as our backbone network.Ibrahimovic et al. ( 2023) asserted that in ViT, the number of patches significantly impacts the model's performance in downstream tasks.Excessive patches can lead to increased computational costs and potential overfitting on the training set.Therefore, we ultimately decide to set the number of patches to 16×16.Since the patch size is 14, our image input size is determined to be 224×224.The author of University-1652 noted that the input size has few impacts on cross-view image retrieval performance.The difference in Recall@1 performance between the 224×224 input size setting and the best setting is less than 3%.Therefore, it is believed that setting the input size to 224×224 will not significantly affect model performance.However, it can substantially reduce the computational burden during model training and inference.To inherit all the prior knowledge from DinoV2 and achieve a comprehensive representation of cross-view images, we simultaneously extracted both global features (size of 1×384) and local features (size of 256×384) of the image, as illustrated in formula (1): 14 , ( )

Cross-view Feature Adaptation Module
Although DinoV2 has demonstrated strong zero-sample transfer capabilities, conducting downstream tasks directly based on the global features extracted by the DinoV2 encoder presents challenges (Lu et al., 2019).Therefore, optimization of the latent features obtained through DinoV2 encoding is necessary to adapt to cross-view image retrieval tasks.In certain research based on visual foundation models, a feasible approach involves adding a feature adaptation module after the backbone network to align features with downstream tasks (Houlsby et al., 2019).We adopt this feature optimization technique.Given the exceptionally strong generalization abilities of features obtained from DinoV2, we freeze the weights of the backbone network during feature optimization.Consequently, we preserve the zero-shot transfer characteristics of DinoV2.Since the encoder of DinoV2 is constructed using ViT architecture, to ensure consistency in feature representation, our feature adaptation module is also constructed based on ViT, named cross-view ViT (CVV), as shown in Fig. 2. Therefore, a simplified network suffices to optimize the feature space, aligning it with the requirements of the cross-view retrieval task.Furthermore, this design ensures the minimization of trainable parameters in the network, thereby effectively reducing computational and time costs in the subsequent training process.A classic ViT network initially divides the input image into image patches, and then converts each image patch into an embedding vector through a linear transformation.In CVV, we directly utilize global features and local features from the backbone network as embedding vectors.These embedding vectors, along with position codes, are concatenated into a sequence to generate a new embedding, which is subsequently input into the transformer block.Then, the embedding vectors containing global features, local features and position codes interact and undergo nonlinear transformations through self-attention mechanisms and feed-forward neural networks.This process facilitates the capture of global contextual information.Finally, the embedding vector at a specific position in the output sequence is sent to the MLP head for a nonlinear change to obtain optimized new features.The above process can be written as a formula (2).

 
() where In cross-view image retrieval, as it essentially boils down to a classification task, our primary objective is to capture global features from the CVV module.Given variations in angles of views, feature representations of satellite and UAV images may differ, thus the CVV module in both the satellite and UAV branches don't share weights.

Classification Head
After obtaining the global features corresponding to satellite and UAV images respectively, it is necessary to determine the geospatial relation of the two images based on these features to complete the retrieval task.In many studies, discrimination based on cosine similarity or Euclidean distance serves as a prevalent method, where the group of images with the highest score is selected as the retrieval result.These methods still require additional similarity evaluation operations after feature extraction based on neural networks, thus we design an end-toend method.We simplify the cross-view image retrieval task a classification task and employ an MLP head to perform similarity evaluation.As the scenes in the training data may differ from those encountered during actual usage encoding the geographical location of the scene as a classification label proves challenging.Consequently, implementing a multiclassification-based cross-view retrieval model becomes difficult.Inspired by (Zhou et al., 2023), we designed the network as a binary classification model, as shown in Fig. 1.The positive category signifies that the UAV image and the satellite image describe the same geographical space, while the negative category indicates that they were captured at distinct geospatial locations.To achieve feature interaction, we adopt feature subtraction between the features extracted from the UAV and satellite images, thus obtaining the new discriminative feature that represents the differences between the two views.The new feature is then fed into the MLP head for spatial relationship discrimination.We utilize a linear layer to compress the feature into a one-dimensional representation, followed by normalization using the sigmoid to express the probability of a positive class.This process can be expressed as formula (3).Due to potential Internal Covariate Shift issues, the distribution of varying discriminative features may be inconsistent, leading to difficulty in retrieval.In order to enhance the robustness and accelerate convergence in the training process, we incorporated layer norm for feature normalization before inputting the linear layer, following the design of the MLP module in CVV.

Loss Function
Contrastive learning is a widely applied deep learning technique utilized in siamese networks.It enables the acquisition of a more discriminative feature representation method to assist downstream tasks by learning the consistencies between positive samples and mining the differences among negative samples.However, recent studies predominantly integrate contrastive learning with supervised learning.This is attributed to the availability of geospatial location labels in cross-view image retrieval datasets used for training, and numerous research have shown the performance enhancements achievable through supervised learning.Hence, our method also combines contrastive learning and supervised learning, and sets corresponding loss functions respectively.Contrastive loss aims to minimize the Euclidean distance between positive samples and optimize the Euclidean distance between negative samples to a fixed value, thereby augmenting the differentiation between positive and negative samples.Since we have designed a binary classification head, we adopt the classic Binary Cross-Entropy Loss (BCELoss) as the loss function for supervised learning.
Our final loss function incorporates both contrastive loss and BCEloss.Balancing model performance and generalization, we set the weight coefficients of both losses to 1.The loss function is shown in formula (4).
where N = number of training samples i y = label of i th sample, 1 represents positive i d = Euclidean distance of i th sample i p = predicted probability of i th sample margin = distance threshold hyperparameter The contrastive loss is employed to optimize the feature space of UAV and satellite global features obtained by the CVV module, thereby making these features more discriminative to better adapt to retrieval tasks.The BCELoss optimizes the similarity score output by the classification head.As the contrastive loss involves the setting of a hyperparameter margin, given the size of the global feature which output by the CVV module is 1*384, and considering that the is also layer normed, we set the margin to 10.

Dataset:
We chose the University-1652 dataset (Zheng et al., 2020), which is widely recognized as a benchmark dataset for cross-view image retrieval tasks.The University-1652 dataset comprises images captured from three different viewpoints: UAV, satellite, and ground, each tagged with a four-digit geospatial label.The UAV images are simulated using Google Earth and collected from various scenes.Our experiments only selected images from UAVs and satellites, as shown in Fig. 3.In the dataset configuration, both the training set and the test set comprise images of 701 distinct buildings.Additionally, images of another 250 buildings are included in the test set as interference.For each building, there are 54 UAV images from various view angles and one corresponding satellite image.Some UAV images exhibit large oblique angles, which poses a challenge.Since our model employs supervised learning and utilizes supervision types of positive and negative, we transformed the two original geospatial labels in a set of images into a discriminative label of 1 or 0. Here, 1 represents positive samples and 0 represents negative samples.We randomly selected 15 sub-datasets from the test set to evaluate our method, forming 2 evaluation sets respectively.10 subdatasets (evaluation set1) each contain 10 scenes, while the remaining 5 sub-datasets (evaluation set2) consist of 50 scenes each.Evaluation set2 contains more scene categories than evaluation set1, and the composition is more complex.The scenes in different sub-datasets do not overlap.We devised this setting with a focus on engineering application scenarios.Some applications entail fewer scene categories, while others involve relatively more.Our setting considered both application scenarios (10 and 50 scenes), thereby ensuring a comprehensive evaluation of our method.Compared to Recall@k, Average Precision (AP) is a more comprehensive evaluation metric that considers both Precision and Recall across various thresholds.Therefore, we employed Recall@k and AP for evaluation.Recall@k indicates that at least one of the k retrieval results, ranked by similarity score, corresponds to a positive sample.AP is computed by calculating the area under the Precision-Recall curve.We focused on Recall@1 and Recall@3 in this work.

Implementation Details:
We balanced model performance and training cost, and the size of all input images was adjusted to 224×224.Given that the original image size is 512×512, to preserve the details, we utilized cubic interpolation.We normalized the input to minimize the disparity between samples, enhance model stability and generalization, and expedite convergence.Additionally, to further enhance the model's generalization capability, we applied data augmentation on both UAV and satellite images in the training set, including random cropping, padding, horizontal and vertical flipping, etc.Since we only train the CVV module and classification head, we froze the backbone network from DinoV2.To minimize additional computational cost, we utilized ViT-S/14 to preextract global and local features of the image and store them in memory.We configured the batch size to 256 and utilized the AdamW optimizer for training.The weight decay was set to 0.0005, while the initial learning rate was set to 0.001.We employed a learning rate scheduler that reduced the learning rate to 10% of its previous value every 50 epochs, and the training process spanned a total of 200 epochs.

Experiments Results
We initially analyzed the evaluation set 1.As depicted in Tab. 1, our method attained a Recall@1 exceeding 80% across the ten sub-datasets within evaluation set 1. Notably, five of these subdatasets exhibited Recall@1 values surpassing 90%, with the average Recall@1 across all ten sub-datasets reaching 89.29%.
Regarding the AP metric, all sub-datasets demonstrated an AP exceeding 80%, with over half of them surpassing 90%.The average AP across all sub-datasets reached 89.86%.Thus, our method achieved superior retrieval performance for cross-view image retrieval across ten different scenes.Our method demonstrated robustness, with the highest sub-dataset Recall@1 reaching 94.52%, and even the lowest sub-dataset Recall@1 reaching 80.89%.The variance in Recall@1 and AP metrics across different sub-datasets can be attributed to variations in the distribution of each subset.In evaluation set2, only one sub-dataset's Recall@1 fell below 60%, with sub-dataset14 achieving a Recall@1 of 74.14%, and the average Recall@1 across the five sub-datasets reaching 64.50%.For the AP metric, all sub-datasets exhibited AP values exceeding 60%, with the average AP across all sub-datasets reaching 68.67%.Even more surprising is the Recall@3 metric, as all sub-datasets exhibited Recall@3 values surpassing 80%, with the highest achieving an AP of 92.24% in sub-dataset14.We specifically evaluated Recall@3 in this context due to the complex scene composition in evaluation set2.A high Recall@3 metric addresses the requirements of engineering effectively.The presence of a greater number of categories within each sub-dataset in evaluation set2, accentuates the disparity in sample distribution among the subdatasets.This discrepancy is also evident in the Recall@1 and AP.The results reflected the randomness of our experiments and verified the robustness and effectiveness of our method.

Dataset
Recall@1 Recall@3 AP Sub-dataset11 Overall, our method showed excellent retrieval performance, effectively meeting the requirements of engineering with scene categories less than 50.We also present the performance of several classic methods for cross-view image retrieval tasks on the University-1652 dataset in Tab. 3. The comparison of our method with these classic methods may not be appropriate due to differences in the datasets used evaluation.Among the listed methods, FSRA's demonstrates significantly superior performance compared to others.This is attributed to its utilization of a large backbone network (21 million parameters) and the adoption of ViT instead of CNN.These factors enable FSRA to better capture global features, a crucial aspect in crossview image retrieval tasks.However, the number of trainable parameters of these methods exceeds 20m, while our method comprises only 2.8m parameters.This significant reduction in parameters greatly diminishes both the and the inference time of the network.Therefore, our method possesses distinct advantages in engineering.

Enhanced feature discrimination with introduced CVV
The CVV module stands out as the cornerstone of our work, as it plays a pivotal role in inheriting latent features from the DinoV2 backbone network and facilitating feature adaptation to suit cross-view image retrieval tasks.As we calculate the similarity between satellite and UAV images and ascertain geospatial relations based on the global features extracted by the CVV module, the effectiveness of the CVV module profoundly influences the performance of the network in cross-view image retrieval tasks.To evaluate the effectiveness of CVV in improving feature discrimination is necessary.Here we compile statistics on the distribution of Euclidean distances between the global features of positive and negative samples.We randomly selected 10,000 sets of cross-view images from the test set,  Comparatively, the Euclidean distance between positive and negative classes relatively increased after the utilization of CVV adaptation.The effectiveness of our CVV module, as evidenced by the distribution results of positive and negative classes in the two sets of statistics, can be attributed to our incorporation of contrastive techniques.While some overlap between positive and negative samples persists even after CVV adaptation, it is reduced notably.Meanwhile, our geospatial relation discrimination relies on a neural network and doesn't solely depend on the Euclidean distance between samples.

Ablation of sharing weights in CVV
Although the backbone of our cross-view retrieval model shares weights across its two branches, in the feature adaptation process, the CVV modules in the two branches do not share weights.We hypothesize that the substantial difference in viewing angles leads to distinct feature representation between cross-viewing images.On the Recall@1 metric, the group that didn't share CVV weights demonstrated a 24.19% improvement compared to the control group that shared CVV weights, while AP increased by 25.30%.This disparity is quite evident, hence we opt not to share CVV weights in our method.Although our experimental results do not elucidate the relationship between feature representation and viewing angle, our design significantly enhances performance, as evidenced by the test outcomes.

Analysis of overfitting in the proposed model
the promising performance on the sub-dataset, we that our method still exhibits shortcomings when evaluated across the entire test set.During the training process, we observed that after a certain number of epochs, the recall on the validation and test sets ceased to improve, while the recall on the training set had already converged to a very high level.This suggests that our model may be experiencing overfitting.One of the common and effective strategies is to mitigate overfitting by reducing the complexity of the network and opting for a more lightweight architecture.Therefore, we devised a control experiment wherein we eliminated the transformer block from the CVV module, retaining only the MLP head.This reduction in complexity significantly decreased the number of trainable parameters in the model from 2.8 million to 0.3 million.Subsequently, we retrained the modified model and evaluated its performance on sub-datasets 11-15, calculating the average Recall@1 and AP.The results are shown in the Tab. 5.

Method
Recall

Limitations of the proposed model
Our experiments validate that the visual foundation model is helpful for cross-view retrieval tasks.However, the representation performance of global features on the test set does not match that of the training set, indicating potential for further improvement in model generalization.In comparison with some of the latest cross-view image retrieval methods, although supervised learning is integrated, our supervision relies on positive and negative classes while discarding absolute position labels during the process.This approach constitutes relatively weak supervision, which may limit the model's ability to learn optimal features.Additionally, researchers have suggested that in contrastive learning, the generalization of contrastive loss might be slightly weaker than loss functions such as triplet loss.Hence, for future research, we propose improvements in three main areas: transfer learning, contrastive learning, and supervised learning.Specifically, we aim to enhance the CVV module to extract more generalizable global features, refine the loss function to further optimize the distribution of positive and negative class samples in feature space, and integrate absolute position information of cross-view images into supervised learning to reinforce geospatial supervision.

Conclusions
The retrieval of satellite images based on oblique UAV images poses a significant challenge due to the considerable difference in angles of views between the two types of imagery.To address this challenge, we propose a deep learning method based on a visual foundation model and ViT.Our method integrates transfer learning, contrastive learning, and supervised learning to tackle cross-view image retrieval tasks.
Most cross-view image retrieval methods based on deep learning often utilize a large and complex architecture, leading to considerable costs in training time and computation.This limitation severely restricts the practical application of these methods in engineering tasks.In our work, we explore the potential of the visual foundation model in cross-view image retrieval tasks.By leveraging prior knowledge acquired by DinoV2 on large-scale datasets, we achieve effective network initialization, and maintain the zero-shot transfer feature of DinoV2.Consequently, we only need to train the lightweight feature adaptation module and classification head, significantly reducing the complexity of the cross-view image retrieval model and enhancing the method's feasibility in engineering tasks.
Additionally, we harness ViT's capabilities in capturing global features, rendering our features more suitable for retrieval tasks.
Experiments on public datasets demonstrate that our method excels when the number of scene categories is under 50 and satisfactorily meets the requirements of cross-view image retrieval in engineering applications with streamlined scene categories and constrained computational resources.
Nevertheless, there is still potential for improvement in the generalization ability of our method to non-training data.For future enhancements, we plan to focus on three aspects: feature adaption, loss function of contrastive learning, and supervised techniques.

Figure 2 .
Figure 2. Architecture of CVV Our design draws inspiration from existing work utilizing visual foundation models(Cheng et al., 2022), comprising a ViT module and a MLP head.This configuration facilitates the optimization of both global and local features extracted from DinoV2, resulting in the generation of new global features.The ViT module is composed of only one transformer block.This decision is informed by the fact that the latent feature extracted by DinoV2 already possesses strong generalization capabilities.Therefore, a simplified network suffices to optimize the feature space, aligning it with the requirements of the cross-view retrieval task.Furthermore, this design ensures the minimization of trainable parameters in the network, thereby effectively reducing computational and time costs in the subsequent training process.A classic ViT network initially divides the input image into image patches, and then converts each image patch into an embedding vector through a linear transformation.In CVV, we directly utilize global features and local features from the backbone network as embedding vectors.These embedding vectors, along with position codes, are concatenated into a sequence to generate a new embedding, which is subsequently input into the transformer block.Then, the embedding vectors containing global features, local features and position codes interact and undergo nonlinear transformations through self-attention mechanisms and feed-forward neural

Figure 3 .
Figure 3.An example of University-16524.1.2Evaluation Protocol:Recall@k is the most widely used evaluation metric in cross-view image retrieval tasks.Compared to Recall@k, Average Precision (AP) is a more comprehensive evaluation metric that considers both Precision and Recall across various thresholds.Therefore, we employed Recall@k and AP for evaluation.Recall@k indicates that at least one of the k retrieval results, ranked by similarity score, corresponds to a positive sample.AP is computed by calculating the area under the Precision-Recall curve.We focused on Recall@1 and Recall@3 in this work.
comprising 5,000 positive examples and 5,000 negative examples.To validate the enhancement achieved by CVV adaptation in the feature space, we assessed the distribution of global features directly from DinoV2 and the global features after CVV adaptation.Results are shown in Fig. 4 and Fig. 5.

Figure 4 .
Figure 4. Distribution of Euclidean distance between global features from DinoV2 backbone

Table 1 .
Results of evaluation set1

Table 2 .
Results of evaluation set2

Table 3 .
Classic methods performance on University-1652 To validate the rationale behind our design, we established a control group where we configured the CVV modules of the two branches in the model to share weights, while keeping other components unchanged.We subsequently retrained the model and conducted tests on subdatasets 11-15 to calculate the average Recall@1 and AP.The results are shown below.

Table 5 .
Comparison between complete CVV and CVV removing transformer block (retaining only MLP)Despite the considerable reduction in model complexity, there was a decline of over 10% in both Recall@1 and AP performance.During the training process of the control group model, we observed that it could still converge to very high recall on the training set.However, the recall plateaued at a low level on both the validation set and test set, indicating that for our model, there isn't a straightforward correlation between overfitting and model complexity.