The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Download
Share
Publications Copernicus
Download
Citation
Share
Articles | Volume XLVIII-M-6-2025
https://doi.org/10.5194/isprs-archives-XLVIII-M-6-2025-23-2025
https://doi.org/10.5194/isprs-archives-XLVIII-M-6-2025-23-2025
19 May 2025
 | 19 May 2025

Comparative Analysis of Vision Foundation Models For Building Segmentation in Aerial Imagery

Zeynep Akbulut, Samed Özdemir, and Fevzi Karslı

Keywords: Aerial Imagery, Building Segmentation, Grounded-SAM, Segment Anything Model, Vision Foundation Models

Abstract. Visual Foundation Models (VFMs) demonstrate impressive generalization capabilities for image segmentation and classification tasks, leading to their increasing adoption in the remote sensing field. This study investigates the performance of VFMs in zero-shot building segmentation from aerial imagery using two model pipelines: Grounded-SAM and SAM+CLIP. Grounded-SAM integrates the Grounding DINO backbone with a Segment Anything Model (SAM) while SAM+CLIP first employs SAM for generating masks followed by Contrastive Language Image Pretraining (CLIP) for classification. The evaluation, performed on the WHU building dataset using Precision, Recall, F1 score, and intersection over union (IoU) metrics, revealed that Grounded-SAM achieved F1-score of 0.83 and IoU of 0.71. SAM+CLIP achieved F1-score of 0.65 and IoU of 0.49. While Grounded-SAM excelled at accurately delineating partially occluded and irregularly shaped buildings, SAM+CLIP was able to segment larger buildings but struggled with delineating smaller ones. Given the impressive performance of VFMs in zero-shot building segmentation, future efforts aimed at refining these models through fine-tuning or few-shot learning could significantly expand their application in remote sensing.

Share