Exploring the Potential of VLMs in Remote Sensing through Prompt Optimization
Keywords: Vision-Language Models(VLMs), Remote Sensing, Prompt Optimization
Abstract. Vision-Language Models(VLMs) have demonstrated impressive capabilities in interpreting natural scene imagery. However, their generalization to domain-specific applications, such as remote sensing, remains underexplored. We address this gap by introducing a refined methodology centered on language-driven prompt optimization, with the aim of enhancing the adaptability of VLMs to remote sensing tasks. Specifically, we adopt a two-stage evaluation framework comprising Zero-Shot Prompting and Prompt- Informed Supervised Fine-Tuning. In the first stage, we assess the influence of prompt formulation on zero-shot performance. In the second stage, we further explore how the incorporation of optimized prompts during supervised fine-tuning can help reveal the model’s generalization potential. Within this framework, we introduce two prompting strategies tailored for remote sensing: Cognitively-Guided Prompting (CogPrompt), which employs Chain-of-Thought reasoning to elicit structured and interpretable responses; and Knowledge-Injected Prompting (KnowPrompt), which incorporates domain-specific priors through existence assertions. We conducted a comprehensive evaluation of several open-source VLMs, including Qwen-VL, InternVL, and the LLaVA series, across multiple remote sensing benchmarks, including remote sensing object detection and captioning. To support our analysis, we propose a two-stage evaluation framework, including Zero-Shot Prompting and Prompt-Informed Supervised Fine-Tuning. Extensive experimental results show that prompt optimization consistently enhances overall detection and captioning performance across a range of metrics, and there is still significant room for improvement in the capabilities of VLMs for remote sensing tasks.
