Comparative Analysis of Vision Foundation Models For Building Segmentation in Aerial Imagery

Akbulut, Zeynep; Özdemir, Samed; Karslı, Fevzi

doi:https://doi.org/10.5194/isprs-archives-XLVIII-M-6-2025-23-2025

Articles | Volume XLVIII-M-6-2025

https://doi.org/10.5194/isprs-archives-XLVIII-M-6-2025-23-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/isprs-archives-XLVIII-M-6-2025-23-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume XLVIII-M-6-2025

19 May 2025

| 19 May 2025

Comparative Analysis of Vision Foundation Models For Building Segmentation in Aerial Imagery

Zeynep Akbulut, Samed Özdemir, and Fevzi Karslı

Keywords: Aerial Imagery, Building Segmentation, Grounded-SAM, Segment Anything Model, Vision Foundation Models

Abstract. Visual Foundation Models (VFMs) demonstrate impressive generalization capabilities for image segmentation and classification tasks, leading to their increasing adoption in the remote sensing field. This study investigates the performance of VFMs in zero-shot building segmentation from aerial imagery using two model pipelines: Grounded-SAM and SAM+CLIP. Grounded-SAM integrates the Grounding DINO backbone with a Segment Anything Model (SAM) while SAM+CLIP first employs SAM for generating masks followed by Contrastive Language Image Pretraining (CLIP) for classification. The evaluation, performed on the WHU building dataset using Precision, Recall, F1 score, and intersection over union (IoU) metrics, revealed that Grounded-SAM achieved F1-score of 0.83 and IoU of 0.71. SAM+CLIP achieved F1-score of 0.65 and IoU of 0.49. While Grounded-SAM excelled at accurately delineating partially occluded and irregularly shaped buildings, SAM+CLIP was able to segment larger buildings but struggled with delineating smaller ones. Given the impressive performance of VFMs in zero-shot building segmentation, future efforts aimed at refining these models through fine-tuning or few-shot learning could significantly expand their application in remote sensing.

Comparative Analysis of Vision Foundation Models For Building Segmentation in Aerial Imagery

Useful Links

Useful External Links

Our Contact