PCINet: a Prototype-and Concept-based Interpretable Network for Mutli-scene Recognition

With the development of remote sensing techniques, a large number of high-resolution aerial images is now available and benefit many applications. Multi-scene recognition plays a key role in applying remote sensing images to these applications, which refers to predicting multiple scenes coexisted in an aerial image and has attracted an increasing attention. Recently, most researchers tend to invent deep learning-based recognition models and has gained great achievements. However, few efforts have been deployed to explaining the success of deep neural networks in multi-scene recognition. To address this, we introduce concept bottleneck model (CBM) to interpreting model performance and propose a novel network, namely Prototype-and Concept-based Interpretable Network (PCINet), that projects aerial imagery into a prototype-concept memory bank and encode their correlations for explaining how a network can identify coexisting scenes in an aerial image. Specifically, the proposed network mainly consists of two branches: prototype matching that measures similarity scores between image features and scene prototypes, and concept bottleneck branches that aligned image features to textual embeddings and compute their relations with concept embeddings. Afterwards, Outputs are integrated for inferring scene categories. Experimental results show that the model enhances interpretability, providing valuable insights for urban planning and resource management, thereby bridging the gap between deep learning models and practical applications.


Introduction
With the development of remote sensing techniques, a large number of high-resolution aerial images is now available and beneficial to many applications, e.g., urban planning (Marmanis et al., 2018, Fang et al., 2023), traffic monitoring (Mou andZhu, 2018, Mou andZhu, 2016) and natural resource management (Du et al., 2022, Qiu et al., 2019, Weng et al., 2018).As a bridge between imagery and applications, multi-scene recognition that refers to inferring multiple scenes coexisted in an aerial image has now attracted an increasing attention.Recently, most researchers tend to invent deep learning-based recognition models and has gained great achievements (Long et al., 2021, Zheng et al., 2022).However, few efforts have been deployed to explaining the success of deep neural networks in multi-scene recognition.To address this, we introduce concept bottleneck model (CBM) (Koh et al., 2020) to interpreting model performance and propose a novel network that projects aerial imagery into a prototype-concept memory bank and encode their correlations for explaining how a network can identify coexisting scenes in an aerial image.Afterwards, these correlations are fed to a decision layer for scene classification.

Methodology
Our proposed model, called Prototype-and Concept-based Interpretable Network (PCINet), mainly consists of two branches: prototype matching that measures similarity scores between image features and scene prototypes, and concept bottleneck branches that aligned image features to textual embeddings and compute their relations with concept embeddings (cf. Figure 2).Afterwards, Outputs are integrated for inferring scene categories.* Corresponding author

Prototype Matching Branch
Given an aerial image, the prototype matching branch first extracts the feature map X using a convolutional neural network (CNN), denoted as f ϕ .The feature map X is then compared with a set of predefined scene prototypes P = [p1, p2, ..., pn] T , where N is the number of scene categories and pi denotes the prototype of the i-th scene.In this work, we follow (Hua et al., 2021a) and generate scene prototypes by first training f ϕ on a single-scene aerial image dataset and then summarizing features of samples belonging to the i-th scene as its prototype pi.Thus, pi is expected to be representative of its correspond- ing scene (see Figure 5).Afterwards, the similarity score si between the feature map and the i-th prototype pi is computed through a dot product and a softmax function as follows: where Q and K are query and key mapping functions stemming from the Transformer (Vaswani et al., 2017).

Concept Bottleneck Branch
The concept bottleneck branch aims to align the image features with textual embeddings and compute their relations with concept embeddings.One of the crucial steps is to construct the concept bank.To this end, we employ GPT-3.5, know as a powerful large-scale language model, to distill keywords of textual descriptions related to each scene.For example, we send a prompt You are a remote sensing expert and experienced in interpreting images.Now summarize "what does a soccer field look like from the nadir view?" with 10 keywords or phrases.to GPT-3.5, and it will respond with Rectangular shape, Green playing surface, Goalposts, White boundary lines, Central circle, Corner flags, Goalkeeper boxes, Spectator stands, Surrounding facilities, Team markings.Then we compute word embeddings of these concepts and generate concept embeddings C = [c1, c2, ..., cm] T , where m is the number of concepts.To align image and language features, we employ a vision-language model pretrained with Contrastive Language Image Pretraining (CLIP) techniques (Radford et al., 2021).Specifically, an aerial image is first transformed into textual space, yielding a concept feature map Xtext.The concept feature map Xtext is then compared with predefined concept embeddings.The output of the concept bottleneck branch, Sc, is computed with Eq. 1 but replacing X and P with Xtext and C.

Scene Prediction
Afterwards, the outputs from the prototype matching branch and the concept bottleneck branch are integrated and fed to the final classification layer to make the final prediction.with the following equation: where g is the classification layer, and Vp and Vc are value mapping functions for prototype matching and concept bottleneck branches, respectively.By doing so, we can interpret network decisions by figuring out prototypes and concepts with the highest scores.

Experimental Results
We generate scene prototypes by training CNNs on single-scene aerial image datasets, i.e., UCM (Yang and Newsam, 2010) and AID (Xia et al., 2017) datasets, and evaluate the performance of our model on the MAI dataset (Hua et al., 2021b), which is specifically designed for multi-scene recognition.Quantitative and qualitative results are reported for analysis and discussion.area, desert, forest, parking lot, industrial area, town square, sparse residential area, pond, medium residential area, port, resort, airport, school, stadium, and dense residential area.The number of images varies across categories, ranging from 220 to 420.Similar to the UCM dataset, we adopt a data split approach where 20% of images from each scene category are allocated as test samples, while the remaining images are used for training and validation of the embedding function.
Dataset configuration.In order to widely evaluate the performance of our method, we utilize two variant dataset configurations, MAI-UCM and MAI-AID, based on common scene categories shared by UCM/AID and MAI.Specifically, the MAI-UCM configuration consists of 1600 single-scene aerial images from the UCM dataset and 1649 multi-scene images from our MAI dataset.16 aerial scenes that are commonly included in both two datasets are considered in UCM2MAI, and numbers of their associated images are listed in Table 1.Besides, the MAI-AID configuration is composed of 7050 and 3239 aerial images from the AID and MAI datasets, respectively.20 common scene categories are taken into consideration, and the number of images related to each scene is present in Table 1.
Although such configurations might limit the number of recognizable scene classes, we believe this limitation can be addressed by collecting more single-scene images by crawling OSM data and producing large-scale multi-scene aerial image datasets.We select only 90 and 120 multi-scene aerial images from MAI-UCM and MAI-AID as training instances, respectively, and test networks on the remaining multi-scene images.For rare scenes (e.g., port and train station), we select all associated training images, while for common scenes, we randomly select several of their training samples.It is noteworthy that we yield the scene prototype of residential by taking an average of high-level representations of aerial images belonging to scene medium residential and dense residential.Besides, although the UCM and AID datasets do not contain images for sea, their images for beach often comprise both sea and beach.Therefore, we make use of training samples labeled as beach to yield the prototype representation of sea.

Concept Generation
To construct a comprehensive and precise initial set of scene concepts, this project proposes to adopt a concept generation approach based on large language models (LLMs) and prompt engineering.By designing prompt paradigms, we aim to guide LLMs to simultaneously retrieve training sample corpora covering a wide range of contexts and online expert knowledge bases with strong timeliness, thus generating an initial set of concepts describing scene appearance, compositional structure, functional purposes, adjacent features, and more.In the process of prompt design, we first specify the system roles undertaken by the LLM and clarify the purposes, contents, and formats of the questions and answers.Next, we engage in multiround question-and-answer sessions to establish a model thinking chain.To enhance the accuracy and robustness of model outputs, we will employ active-prompt techniques, where we calculate the uncertainty of the model's multiple responses to the same prompt and supplement the model's thinking chain by manually retrieving relevant corpora for prompts with low confidence in the answers, repeating this process until the model produces highly confident results.Finally, we summarize the LLM's responses to prompts from different angles on the same scene, manually filtering out highly irrelevant concepts to construct an initial set of scene concepts with rich descriptive dimensions and high semantic confidence.

Training Details
The training process involves two phases: 1) learning the embedding function f ϕ using a large dataset of single-scene aerial images, and 2) training the entire PCINet using a limited number of multi-scene images in an end-to-end fashion.Different training strategies are applied to each phase, detailed as follows.
During the initial training phase, we initialize the feature extraction modules with CNNs pre-trained on ImageNet (Deng et al., 2009).We utilize crossentropy as the loss function and employ Nesterov Adam (Dozat, n.d.) as the optimizer, with recommended parameters: β1 = 0.9, β2 = 0.999, and ϵ = 1e − 08.The initial learning rate is set to 2e − 04 and decayed by √ 0.1 if the validation loss does not decrease for two consecutive epochs.
In the subsequent training phase, we initialize f ϕ with the parameters learned in the previous phase and use a Glorot uniform initializer to initialize all weights in Q h , V h , K h , and the final fully-connected layer.We set L and U to 256, and the number of heads to 20.All weights are trainable, and the embedding function is fine-tuned during this phase as well.Scene-level labels are encoded as multi-hot vectors, where 0 indicates the absence of a scene and 1 indicates its presence.The loss function is defined as binary cross-entropy.The optimizer remains the same as in the initial phase, but we use a relatively larger learning rate of 5e − 4. The network is implemented using Tensor-Flow and trained on a single NVIDIA Tesla P100 16GB GPU for 100 epochs.We set the training batch size to 32 for both phases.

Evaluation Metrics
To quantitatively evaluate network performance, we employ examplebased F1(Wu and Zhou, 2016) and F2(Van Rijsbergen, 1979) scores as evaluation metrics.These scores are calculated using the following equation: where pe and re represent example-based precision and recall (Tsoumakas and Vlahavas, 2007), which are computed as: where T Pe, F Pe, and F Ne indicate the numbers of true positives, false positives, and false negatives, respectively, within each example.Each example in our case corresponds to a multiscene aerial image.By averaging scores across all examples in the test set, we can determine the mean example-based F scores, precision, and recall.Additionally, we calculate labelbased precision p l and recall r l using Eq. 4 and Eq. 5, respectively, but substituting the counts of false negatives, false positives, and true positives specific to each scene category.The mean p l and r l are then computed.It's worth noting that the primary metrics of interest are the mean F1 and F2 scores.

Results
We report the results of our experiments in terms of accuracy, precision, recall, and F1-score.Our model achieves better performance, and correlations between images, prototypes and concepts are visualized to illustrate the decision process.

Conclusion
In conclusion, PCINet, with its dual branches integrating prototypes and concepts, achieves superior performance in unconstrained scene recognition for high-resolution aerial images.The model enhances interpretability, providing valuable insights for urban planning and resource management, thereby bridging the gap between deep learning models and practical applications.

Figure 1 .
Figure 1.Comparisons between (a) single-and (b) multi-scene recogntion.In (a), each aerial image contains one dominant scene, and the task is to classify each image into one scene category.In (b), multiple scenes are present simultaneously in one single image, and they are required to be thoroughly identified.In our case, single-scene images, such as images in (a), are leveraged to learn scene prototypes for inferring scenes in multi-scene images.

Figure 2 .
Figure 2. Architecture of the proposed PCINet.It mainly consists of two branches: a prototype matching branch that measures similarity scores between image features and scene prototypes, and a concept bottleneck branch that aligned image features to textual embeddings and compute their relations with concept embeddings.Afterwards, outputs are integrated and fed to the final classification layer for scene prediction.

Figure 3 .
Figure 3. Example images in our MAI dataset.Each image is 512 × 512 pixels, and their spatial resolutions range from 0.3 m/pixel to 0.6 m/pixel.We list their scene-level labels here: (a) farmland and residential; (b) baseball, woodland, parking lot, and tennis court; (c) commercial, parking lot, and residential; (d) woodland, residential, river, and runway; (e) river and storage tanks; (f) beach, woodland, residential, and sea; (g) farmland, woodland, and residential; (h) apron and runway; (i) baseball field, parking lot, residential, bridge, and soccer field.

Figure 4 .
Figure 4. Sample distributions of all scene categories in the MAI dataset.

Table 1 .
indicates that the number of images is not counted in total amounts, as the scene prototype of beach and sea are learned from the same images.The Number of Images Associated with Each Scene. *