VLM-Based Building Change Detection with CNN-Transformer
Keywords: Building Change Detection, Vision-Language Model, Satelllite Imagery, Transformer Model, Grounding Dino
Abstract. Accurate building change detection in high-resolution satellite imagery is critical for urban planning, disaster response, and smart city applications. Existing methods often rely on large labeled datasets or handcrafted features, limiting scalability across diverse geographic regions. In this paper, we propose a hybrid framework that integrates a pretrained Vision-Language Model (Grounding DINO) with a lightweight CNN-Transformer architecture to perform text-guided building change detection. Without any fine-tuning, Grounding DINO generates semantic building masks from bi-temporal image pairs using the text prompt “building,” which are used to amplify structural features in a ResNet18 backbone. A custom Transformer encoder with dual spatial and channel attention refines these features to capture both local details and global context. On the LEVIR-CD dataset, our framework improves Recall by +3.98%, F1-Score by +3.01%, and Intersection over Union (IoU) by +4.70% compared to a CNN-Transformer baseline. These results highlight the potential of vision-language models to enhance remote sensing workflows without extensive domain-specific fine-tuning.