The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Download
Share
Publications Copernicus
Download
Citation
Share
Articles | Volume XLVIII-4/W14-2025
https://doi.org/10.5194/isprs-archives-XLVIII-4-W14-2025-219-2025
https://doi.org/10.5194/isprs-archives-XLVIII-4-W14-2025-219-2025
26 Nov 2025
 | 26 Nov 2025

Exploring the Potential of VLMs in Remote Sensing through Prompt Optimization

Weibin Ma, Ruiqian Zhang, Xiaogang Ning, Hanchao Zhang, and Yixin Chen

Keywords: Vision-Language Models(VLMs), Remote Sensing, Prompt Optimization

Abstract. Vision-Language Models(VLMs) have demonstrated impressive capabilities in interpreting natural scene imagery. However, their generalization to domain-specific applications, such as remote sensing, remains underexplored. We address this gap by introducing a refined methodology centered on language-driven prompt optimization, with the aim of enhancing the adaptability of VLMs to remote sensing tasks. Specifically, we adopt a two-stage evaluation framework comprising Zero-Shot Prompting and Prompt- Informed Supervised Fine-Tuning. In the first stage, we assess the influence of prompt formulation on zero-shot performance. In the second stage, we further explore how the incorporation of optimized prompts during supervised fine-tuning can help reveal the model’s generalization potential. Within this framework, we introduce two prompting strategies tailored for remote sensing: Cognitively-Guided Prompting (CogPrompt), which employs Chain-of-Thought reasoning to elicit structured and interpretable responses; and Knowledge-Injected Prompting (KnowPrompt), which incorporates domain-specific priors through existence assertions. We conducted a comprehensive evaluation of several open-source VLMs, including Qwen-VL, InternVL, and the LLaVA series, across multiple remote sensing benchmarks, including remote sensing object detection and captioning. To support our analysis, we propose a two-stage evaluation framework, including Zero-Shot Prompting and Prompt-Informed Supervised Fine-Tuning. Extensive experimental results show that prompt optimization consistently enhances overall detection and captioning performance across a range of metrics, and there is still significant room for improvement in the capabilities of VLMs for remote sensing tasks.

Share