VERIFICATION AND RISK ASSESSMENT FOR LANDSLIDES IN THE SHIMEN RESERVOIR WATERSHED OF TAIWAN USING SPATIAL ANALYSIS AND DATA MINING

Spatial information technologies and data can be used effectively to investigate and monitor natural disasters contiguously and to support policyand decision-making for hazard prevention, mitigation and reconstruction. However, in addition to the vastly growing data volume, various spatial data usually come from different sources and with different formats and characteristics. Therefore, it is necessary to find useful and valuable information that may not be obvious in the original data sets from numerous collections. This paper presents the preliminary results of a research in the validation and risk assessment of landslide events induced by heavy torrential rains in the Shimen reservoir watershed of Taiwan using spatial analysis and data mining algorithms. In this study, eleven factors were considered, including elevation (Digital Elevation Model, DEM), slope, aspect, curvature, NDVI (Normalized Difference Vegetation Index), fault, geology, soil, land use, river and road. The experimental results indicate that overall accuracy and kappa coefficient in verification can reach 98.1 % and 0.8829, respectively. However, the DT model after training is too over-fitting to carry prediction. To address this issue, a mechanism was developed to filter uncertain data by standard deviation of data distribution. Experimental results demonstrated that after filtering the uncertain data, the kappa coefficient in prediction substantially increased 29.5%.The results indicate that spatial analysis and data mining algorithm combining the mechanism developed in this study can produce more reliable results for verification and forecast of landslides in the study site. * Corresponding author.


INTRODUCTION
Taiwan has complicated geological conditions, high density of population and other potential factors making it vulnerable to natural hazards.The geological structures have become very fracture after the 1999 Chichi earthquake.Moreover, typhoons and other extreme weathers also frequently happen in this region.The heavy rainfall often triggers serious landslides and debris flows, and then causing human casualties and property damages.Therefore, the World Bank listed Taiwan as one of the countries that is most vulnerable to natural disasters in the world in terms of lands and population exposing to the danger.Thus, to prevent and mitigate natural hazards such as landslides has become an important issue in Taiwan.
Spatial information technologies and data such as remotely sensed images, LIDAR point clouds and GIS datasets can be used effectively to investigate and monitor natural disasters contiguously and to support policy-and decision-making for hazard prevention, mitigation and reconstruction.In addition, previous studies (e.g.Sakar & Kanungo, 2004;Metternicht et al., 2005;Nichol & Wong, 2005;Tsai & Chen, 2007;Peduzzi, 2010) have demonstrated that geo-informatics techniques can perform investigation successfully in LULC (Land Use/ Land Cover) and natural disaster applications.However, in addition to the vastly growing data volume, these spatial data usually come from different sources and with different formats and characteristics.Therefore, it is necessary to develop effective algorithms to extract useful and valuable information that may not be obvious in the original datasets from complicated datasets for efficient analysis.
Data Mining (DM) is an important and effective technique in the field of Knowledge Discovery (KD) that extracts knowledge from vast data, database or data warehouse as the primary objective.Therefore, it may be a viable solution to fulfill the demand of identifying possible landslide factors from heterogeneous spatial datasets.In addition, Spatial Analysis (SA) can supply DM with advanced information with overlay, buffer and other GIS processes.Decision Tree (DT) algorithm is a classical, universal and comprehensible method in the DM domain.The outcomes of DT are constituted by "If and Then" rules, and these sequences are helpful in realizing the reasons and interactions between causative factors in the landslide records.Based on these spatial analysis components, this research has adopted and developed DT and SA algorithms on the validation and forecast (risk assessment) of landslide events induced by heavy rainfall in a regional scale.

STUDY SITE AND MATERIALS
The Shimen reservoir watershed (see Figure 1) that covers a region of about 763.4 km 2 in Taiwan was selected as the study site of this research The elevation in the study site ranges between 250m to 3,500m.The primary land-cover is forest, but there are limited agricultural activities.Landslides are commonly induced by heavy rainfall in the area and the debris flows are flushed into the reservoir, causing various problems in water supply and resource management.Previous studies related to landslide (e.g.Sidle et al., 1985;Wu and Sidle, 1995;Zhou et al., 2002;Dahal et al., 2008) divided the causative factors of landslides into latent and triggering catalogs.This study does not focus on different types of landslides.Instead, it explores the knowledge of landslides induced by heavy torrential rains using data mining and spatial analysis.Therefore, eleven latent factors were considered, including elevation (Digital Elevation Model, DEM), slope, aspect, curvature, NDVI (Normalized Difference Vegetation Index), fault, geology, soil, land use, river and road.Besides, landslide inventories since 2004 to 2008 also adopted for extracting each causative factors.Detail information of the selected factors is listed in Table 1.

PROCEDURE AND METHODOLOGY
The objective of this research is to perform validation and risk assessment of landslide events in the study site using SA and DT algorithms.There are four primary steps, i.e. data preprocessing and integration, analytic strategies, kernel computation and results as illustrated in Figure 2. In the data pre-processing and integration step, eleven factors were considered, including vector and raster data.Because the algorithm is record-or grid-based, vector data need to be rasterized.Furthermore, all data were pre-processed to remove null value or noise, and resampled to the same cell size.In our case, 10 m by 10 m pixel size was used.Subsequently, some factors that can provide advanced information after SA, such as aspect, curvature and slope were derived from DEM; NDVI was produced from original satellite images and normalized with PIFs; distance information about each pixel to the nearest target was generated from GIS poly-lines of rivers and roads.Finally the pre-processed data were integrated for subsequent analysis.There are two important steps in the DT operator.The first is to develop the tree, i.e. branches are separated by computing and comparing degree of impurity of each condition attribute.The second is to prune the tree.The purpose of pruning is to avoid DT model becoming too over-fitting to carry validation and prediction.However, the arithmetic degree of impurity is different between nominal (or discrete) and quantitative (or contiguous) data because of their data structures.In general, information gain and Gini index are the major quantification for degree of impurity.The former is defined as Eq (1), the latter is defined as Eq (2).This study utilized the J48 algorithm in WEKA software (http://www.cs.waikato.ac.nz/ml/weka/) to build the decision tree because it can handle nominal and quantitative data at the same time.

p n p n p n p n p n p n E A I p n p n Gain A I p n E A
where D = segmented point to divide contiguous data into two parts A = one of the condition attributes f = frequency of the positive and negative in A≦D or A>D range N = total records of training data N1, N2 = total numbers of A≦D and A>D domain respectively

Preliminary results
The evaluations of check and prediction are shown in Table 3 (a) and (c), respectively.It is clear that the check result is good enough.However, the omission (100%-PA) and commission (100%-UA) of landslides are too high in the prediction (risk assessment) phase to obtain acceptable results (low kappa values).It may be caused by interaction between uncertain and multi-temporal problems in different data sets obtained from multi-sources.Therefore, the DT model after training becomes too over-fitting to carry prediction.To address this issue, a mechanism was developed to filter out uncertain data by standard deviation of data distribution (see next section).

Results after filtering
The results of landslide validation and prediction after filtering are shown in Table 3 (b) and (d).The check (validation) result remains excellent after filtering.In the prediction results, the OA decreased because the PA of non-landslide reduced.Extracting landslides knowledge is usually complex than nonlandslides, but the samples in non-landslide are more than landslide, which may dominate the "overall" accuracy (OA) and reach high precision.However, the records of non-landslide in prediction are reduced by the filtering process (see Table 4); this is the reason that OA and PA of non-landslides decreased.
In addition, the PA and UA of landslide substantially increased 24.5 % and 45 %, so kappa coefficient has improved significantly.In other words, this test indicates that the mechanism to filter out uncertain data can produce more reliable verification and risk assessment for landslides.

CONCLUSIONS
The decision tree algorithm and spatial analysis were utilized to extract landslide knowledge for validation and risk assessment of landslides in a watershed in Taiwan.Moreover, this paper presents a mechanism for filtering uncertain data from the data sets to improve the reliability of landslide predictions.The experimental results indicate that OA and kappa coefficient in verification can reach 98.1 % and 0.8829, respectively.However, the DT models after training are too over-fitting to carry prediction.After filtering uncertain data, PA, UA of landslides and kappa coefficient in the prediction task substantially increased at least 20 %.In conclusion, the spatial analysis and data mining algorithms combining the mechanism of filtering uncertainty data can perform verification and forecast of landslides with more reliable results in the study site.

Figure 1
Figure 1.Study site

Figure 2 .
Figure 2. The procedure of this study For the analytic strategies, this study used 2/3 of collected landslide inventory from 2004 to 2007 as the training dataset; the remainder in the same period were used as check-data.The models of data mining results were used to predict landslide events in 2008.The forecast was then verified and evaluated using real landslides in 2008 supplied by the watershed authority as reference data.On the other hand, all non-landslide records were randomly sampled approximately tenfold of landslide records.Landslide records of training, check and prediction are listed in Table2.For the kernel computation, DT algorithm was utilized.Finally, the results of OA (Overall Accuracy), PA (Producer's Accuracy), UA (User's Accuracy) and kappa coefficient were computed from error or confusion matrix for assessment, for both check and prediction data.
of positive in the decision attribute n = number of negative in the decision attribute I(p, n) = entropy of all condition attributes A = one of the condition attributes v = number of different contents in a specific attribute E(A) = entropy of specific condition attribute Gain(A) = information gain of specific condition attribute

Table 1
Schott et al., 1988).It is very convenient and suitable for analyzing multi-temporal NDVIs.Therefore, all the NDVI images were normalized using PIFs in this study.
. All materials in this research Among the selected factors, NDVI data were derived from original satellite images that contain different radiometric and atmospheric conditions.Consequently, there may be biases if these NDVI values are compared or analyzed directly.Pseudo Invariant Features (PIFs) normalization is one of relative radiometric correction methods, which performs linear stretch or histogram matching based on PIFs (

Table 2
. For the kernel computation, DT algorithm was utilized.Finally, the results of OA (Overall Accuracy), PA (Producer's Accuracy), UA (User's Accuracy) and kappa coefficient were computed from error or confusion matrix for assessment, for both check and prediction data.

Table 4 .
Records of test data