DATA IMBALANCE IN LANDSLIDE SUSCEPTIBILITY ZONATION: UNDER-SAMPLING FOR CLASS-IMBALANCE LEARNING
Keywords: Landslide susceptibility zonation (LSZ), Imbalanced learning, Under-sampling, Artificial neural network (ANN), Fisher Discriminant Analysis (FDA), Logistic Regression(LR)
Abstract. Machine learning methods such as artificial neural network, support vector machine etc. require a large amount of training data, however, the number of landslide occurrences are limited in a study area. The limited number of landslides leads to a small number of positive class pixels in the training data. On contrary, the number of non-landslide pixels (negative class pixels) are enormous in numbers. This under-represented data and severe class distribution skew create a data imbalance for learning algorithms and suboptimal models, which are biased towards the majority class (non-landslide pixels) and have low performance on the minority class (landslide pixels).
In this work, we have used two algorithms namely EasyEnsemble and BalanceCascade for balancing the data. This balanced data is used with feature selection methods such as fisher discriminant analysis (FDA), logistic regression (LR) and artificial neural network (ANN) to generate LSZ maps The results of the study show that ANN with balanced data has major improvements in preparation of susceptibility maps over imbalanced data, where as the LR method is ill-effected by data balancing algorithms. The FDA does not show significant changes between balanced and imbalanced data.