EFFECT OF DATA QUALITY ON WATER BODY SEGMENTATION WITH DEEPLABV3+ ALGORITHM
Keywords: Data Quality, Image Segmentation, Deep Learning, DeepLab, Satellite Images, Water Body
Abstract. Training Deep Learning (DL) algorithms for segmenting features require hundreds to thousands of input data and corresponding labels. Generating thousands of input images and labels requires considerable resources and time. Hence, it is common practice to use opensource imagery data and labels available online. Most of these open-source data have little or no metadata describing their quality or suitability making it problematic for training or evaluating DL models. This study evaluated the effect of data quality on training DeepLabV3+, using Sentinel 2 A/B RGB images and labels obtained from Kaggle. We generated subsets of 256 × 256 pixels, and 10% of these images (802) were set aside for testing. First, we trained and validated the DeepLabV3+ model with the remaining images. Second, we removed images with incorrect labels and trained another DeepLabV3+ network. Finally, we trained the third DeepLabV3+ network after removing images with turbid water or with floating vegetation. All three trained models were evaluated with test images and then we calculated accuracy metrics. As the quality of the input images improved, accuracy of the predicted masks generated from the first model increased from 92.8% to 94.3% in the second model. The third model’s accuracy was 96.4%, demonstrating the network’s ability to better learn and predict water bodies when the input data had fewer class variations. Based on the results we recommend assessing the quality of open-source data for incorrect labels and variations in the target class prior to training DeepLabV3+ or any other DL network.