MultiTrans-LC: Multimodal Fusion Transformer for Remote Sensing Land Cover Classification

Wang, Qixuan; Li, Ning; Chen, Yiheng; Zhu, Hainiu

doi:https://doi.org/10.5194/isprs-archives-XLVIII-1-W5-2025-133-2025

Articles | Volume XLVIII-1/W5-2025

https://doi.org/10.5194/isprs-archives-XLVIII-1-W5-2025-133-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/isprs-archives-XLVIII-1-W5-2025-133-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume XLVIII-1/W5-2025

05 Nov 2025

| 05 Nov 2025

MultiTrans-LC: Multimodal Fusion Transformer for Remote Sensing Land Cover Classification

Qixuan Wang, Ning Li, Yiheng Chen, and Hainiu Zhu

Keywords: Multimodal Fusion, Transformer, Remote Sensing, Land Cover Classification

Abstract. The use of remote sensing images for land cover classification is crucial for environmental monitoring, urban planning, and sustainable resource management. Despite advances in deep learning, existing methods suffer from blurred boundaries in complex landscapes and perform poorly in identifying small or overlapping land cover categories. This article introduces MultiTrans LC, a novel multimodal fusion framework that integrates visual language interaction and boundary perception optimization to address these challenges. The proposed architecture utilizes a hierarchical Transformer encoder to extract global visual features from high-resolution images and aligns them with semantic embeddings in text prompts through cross modal attention. The visual language decoder further refines the multi-scale feature representation through progressive fusion, while the edge aware loss function jointly optimizes pixel level classification and boundary localization. Experiments on three benchmark datasets (GID-15, LoveDA, RSSCN7) have demonstrated state-of-the-art performance, achieving an overall accuracy of 90.7% and a Kappa coefficient of 0.901 on GID-15, which is 1.6% higher than the leading method in OA. Visualization confirms that MultiTrans LC performs well compared to CNN and Transformer baselines. By bridging visual and textual semantics, MultiTrans LC improves the accuracy of large-scale land cover mapping and provides a powerful solution for geospatial intelligence applications. Discussed the limitations and future directions of open vocabulary classification and edge device deployment.

MultiTrans-LC: Multimodal Fusion Transformer for Remote Sensing Land Cover Classification

Useful Links

Useful External Links

Our Contact