Exploring the Potential of VLMs in Remote Sensing through Prompt Optimization

Ma, Weibin; Zhang, Ruiqian; Ning, Xiaogang; Zhang, Hanchao; Chen, Yixin

doi:10.5194/isprs-archives-XLVIII-4-W14-2025-219-2025

Articles | Volume XLVIII-4/W14-2025

https://doi.org/10.5194/isprs-archives-XLVIII-4-W14-2025-219-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/isprs-archives-XLVIII-4-W14-2025-219-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume XLVIII-4/W14-2025

26 Nov 2025

| 26 Nov 2025

Exploring the Potential of VLMs in Remote Sensing through Prompt Optimization

Weibin Ma, Ruiqian Zhang, Xiaogang Ning, Hanchao Zhang, and Yixin Chen

Keywords: Vision-Language Models(VLMs), Remote Sensing, Prompt Optimization

Abstract. Vision-Language Models(VLMs) have demonstrated impressive capabilities in interpreting natural scene imagery. However, their generalization to domain-specific applications, such as remote sensing, remains underexplored. We address this gap by introducing a refined methodology centered on language-driven prompt optimization, with the aim of enhancing the adaptability of VLMs to remote sensing tasks. Specifically, we adopt a two-stage evaluation framework comprising Zero-Shot Prompting and Prompt- Informed Supervised Fine-Tuning. In the first stage, we assess the influence of prompt formulation on zero-shot performance. In the second stage, we further explore how the incorporation of optimized prompts during supervised fine-tuning can help reveal the model’s generalization potential. Within this framework, we introduce two prompting strategies tailored for remote sensing: Cognitively-Guided Prompting (CogPrompt), which employs Chain-of-Thought reasoning to elicit structured and interpretable responses; and Knowledge-Injected Prompting (KnowPrompt), which incorporates domain-specific priors through existence assertions. We conducted a comprehensive evaluation of several open-source VLMs, including Qwen-VL, InternVL, and the LLaVA series, across multiple remote sensing benchmarks, including remote sensing object detection and captioning. To support our analysis, we propose a two-stage evaluation framework, including Zero-Shot Prompting and Prompt-Informed Supervised Fine-Tuning. Extensive experimental results show that prompt optimization consistently enhances overall detection and captioning performance across a range of metrics, and there is still significant room for improvement in the capabilities of VLMs for remote sensing tasks.

Exploring the Potential of VLMs in Remote Sensing through Prompt Optimization

Useful Links

Useful External Links

Our Contact