The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Download
Share
Publications Copernicus
Download
Citation
Share
Articles | Volume XLVIII-G-2025
https://doi.org/10.5194/isprs-archives-XLVIII-G-2025-249-2025
https://doi.org/10.5194/isprs-archives-XLVIII-G-2025-249-2025
28 Jul 2025
 | 28 Jul 2025

Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations

Keumgang Cha, Donggeun Yu, Junghoon Seo, Hyunguk Choi, and Taegyun Jeon

Keywords: Remote Sensing, Foundation Model, Multi Modality, Vision-Language

Abstract. The prominence of generalized foundation models in vision-language integration has witnessed a surge, given their multifarious applications. Within the natural domain, the procurement of vision-language datasets to construct these foundation models is facilitated by their abundant availability and the ease of web crawling. Conversely, in the remote sensing domain, although vision-language datasets exist, their volume is suboptimal for constructing robust foundation models. This study introduces an approach to curate vision-language datasets by employing an image decoding machine learning model, negating the need for human-annotated labels. Utilizing this methodology, we amassed approximately 9.6 million vision-language paired datasets in VHR imagery. The resultant model outperformed counterparts that did not leverage publicly available vision-language datasets, particularly in downstream tasks such as zero-shot classification, semantic localization, and image-text retrieval. Moreover, in tasks exclusively employing vision encoders, such as linear probing and k-NN classification, our model demonstrated superior efficacy compared to those relying on domain-specific vision-language datasets.

Share