Adapting Semi-Supervised Segmentation methods to Multimodal Remote Sensing Data
Keywords: Pseudo-labeling, Consistency Regularization, Contrastive Learning, Multimodal Fusion, Aerial Imagery
Abstract. Remote sensing (RS) imagery is important for applications ranging from land cover and land use (LCLU) mapping to agriculture and forest monitoring. However, there is a limited availability of high-quality labeled data to use as a reference to train supervised learning (SL) models. Semi-supervised learning (SSL) frameworks, such as UniMatch (Yang et al., 2023), use pseudo-labeling and consistency regularization methods to address this limitation. Similar works have been adapted to RS: LSST (Lu et al., 2022) refines pseudo-labels with adaptive class-specific thresholds, while RS-DWL (Huang et al., 2024) mitigates noise and class imbalance through decoupled learning and confidence-based weighting. Despite these advances, SSL applications to multimodal RS imagery remain underexplored. We address this gap by adapting the SSL framework UniMatch to incorporate diverse encoders and multimodal remote sensing data for LCLU segmentation. We experimented on FLAIR-2 (Garioud et al., 2023), a dataset that combines very high-resolution aerial imagery (RGB) with near-infrared (NIR) data and elevation measurements (above-ground height). Key findings reveal that we achieved the best segmentation results using a transformer encoder for SL and SSL scenarios. When comparing RGB-only data and multimodal data, we observed that some classes, like “buildings”, “water”, and “coniferous”, benefited from the inclusion of NIR and elevation information. In the semi-supervised experiments, where only half of the data was labeled, and the remaining half was used as unlabeled (simulating a real-world scenario), the multimodal SSL approach outperformed the fully supervised learning (FSL) approach using only the labeled subset (1/2). These results highlight the strong potential of data fusion in RS applications with limited labeled data.