The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Download
Share
Publications Copernicus
Download
Citation
Share
Articles | Volume XLVIII-4/W16-2025
https://doi.org/10.5194/isprs-archives-XLVIII-4-W16-2025-39-2025
https://doi.org/10.5194/isprs-archives-XLVIII-4-W16-2025-39-2025
19 Sep 2025
 | 19 Sep 2025

VLM-Based Building Change Detection with CNN-Transformer

Zeinab Gharibbafghi and Peter Reinartz

Keywords: Building Change Detection, Vision-Language Model, Satelllite Imagery, Transformer Model, Grounding Dino

Abstract. Accurate building change detection in high-resolution satellite imagery is critical for urban planning, disaster response, and smart city applications. Existing methods often rely on large labeled datasets or handcrafted features, limiting scalability across diverse geographic regions. In this paper, we propose a hybrid framework that integrates a pretrained Vision-Language Model (Grounding DINO) with a lightweight CNN-Transformer architecture to perform text-guided building change detection. Without any fine-tuning, Grounding DINO generates semantic building masks from bi-temporal image pairs using the text prompt “building,” which are used to amplify structural features in a ResNet18 backbone. A custom Transformer encoder with dual spatial and channel attention refines these features to capture both local details and global context. On the LEVIR-CD dataset, our framework improves Recall by +3.98%, F1-Score by +3.01%, and Intersection over Union (IoU) by +4.70% compared to a CNN-Transformer baseline. These results highlight the potential of vision-language models to enhance remote sensing workflows without extensive domain-specific fine-tuning.

Share