Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations

Cha, Keumgang; Yu, Donggeun; Seo, Junghoon; Choi, Hyunguk; Jeon, Taegyun

doi:https://doi.org/10.5194/isprs-archives-XLVIII-G-2025-249-2025

Articles | Volume XLVIII-G-2025

https://doi.org/10.5194/isprs-archives-XLVIII-G-2025-249-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/isprs-archives-XLVIII-G-2025-249-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume XLVIII-G-2025

28 Jul 2025

| 28 Jul 2025

Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations

Keumgang Cha, Donggeun Yu, Junghoon Seo, Hyunguk Choi, and Taegyun Jeon

Keywords: Remote Sensing, Foundation Model, Multi Modality, Vision-Language

Abstract. The prominence of generalized foundation models in vision-language integration has witnessed a surge, given their multifarious applications. Within the natural domain, the procurement of vision-language datasets to construct these foundation models is facilitated by their abundant availability and the ease of web crawling. Conversely, in the remote sensing domain, although vision-language datasets exist, their volume is suboptimal for constructing robust foundation models. This study introduces an approach to curate vision-language datasets by employing an image decoding machine learning model, negating the need for human-annotated labels. Utilizing this methodology, we amassed approximately 9.6 million vision-language paired datasets in VHR imagery. The resultant model outperformed counterparts that did not leverage publicly available vision-language datasets, particularly in downstream tasks such as zero-shot classification, semantic localization, and image-text retrieval. Moreover, in tasks exclusively employing vision encoders, such as linear probing and k-NN classification, our model demonstrated superior efficacy compared to those relying on domain-specific vision-language datasets.

Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations

Useful Links

Useful External Links

Our Contact