THE EXPLAINABILITY OF GRADIENT-BOOSTED DECISION TREES FOR DIGITAL ELEVATION MODEL (DEM) ERROR PREDICTION

: Gradient boosted decision trees (GBDTs) have repeatedly outperformed several machine learning and deep learning algorithms in competitive data science. However, the explainability of GBDT predictions especially with earth observation data is still an open issue requiring more focus by researchers. In this study, we investigate the explainability of Bayesian-optimised GBDT algorithms for modelling and prediction of the vertical error in Copernicus GLO-30 digital elevation model (DEM). Three GBDT algorithms are investigated (extreme gradient boosting - XGBoost, light boosting machine – LightGBM, and categorical boosting – CatBoost), and SHapley Additive exPlanations (SHAP) are adopted for the explainability analysis. The assessment sites are selected from urban/industrial and mountainous landscapes in Cape Town, South Africa. Training datasets are comprised of eleven predictor variables which are known influencers of elevation error: elevation, slope, aspect, surface roughness, topographic position index, terrain ruggedness index, terrain surface texture, vector roughness measure, forest cover, bare ground cover, and urban footprints. The target variable (elevation error) was calculated with respect to accurate airborne LiDAR. After model training and testing, the GBDTs were applied for predicting the elevation error at model implementation sites. The SHAP plots showed varying levels of emphasis on the parameters depending on the land cover and terrain. For example, in the urban area, the influence of vector ruggedness measure surpassed that of first-order derivatives such as slope and aspect. Thus, it is recommended that machine learning modelling procedures and workflows incorporate model explainability to ensure robust interpretation and understanding of model predictions by both technical and non-technical users.


Background
For decades, global digital elevation models (DEMs) have been gaining wide traction in the remote sensing, geospatial and allied communities. Due to their wide-area coverage and analysis-ready formats, many practitioners and mapping organisations have adopted global DEMs for a wide variety of uses such as the production of topographic and navigation maps, terrain analysis in geomorphology, hydrological analysis and modelling, modelling earth movements, mapping natural hazards, civil engineering and construction, urban planning and modelling, forest ecology, creation of digital twins, rendering three dimensional (3D) visualisations, and the reduction of gravity measurements.
The accuracies of global DEMs are influenced by several factors including land cover and terrain irregularities. This compromises their quality and adequacy for hydrological and environmental applications where precise and accurate terrain information is needed. Thus, researchers have adopted a variety of methods including machine learning (ML) for enhancing their qualities (e.g., Wendi et al., 2016;Kulp and Strauss, 2018;Kim et al., 2020;Hawker et al., 2022;Hu and Ji, 2022). These studies recorded modest achievements in terms of error reduction in the output DEMs. However, there is still a limited understanding of the explainability of the models and their predicted outcomes. This includes the interdependence between terrain and land cover parameters used in DEM correction, including their influences on the predicted targets (elevation error).
Explainability is becoming a major requirement for the deployment of machine learning models. Understanding why a model makes a certain prediction can be as crucial as the prediction's accuracy (Lundberg and Lee, 2017). Thus, despite their high-level performance, many state-of-the-art ML models are viewed as black boxes which are difficult to explain or interpret (Kim, 2021).

Explainable Artificial Intelligence
Explainable Artificial Intelligence (XAI) is concerned with methods that explain and interpret the performance of machine learning algorithms. Explainability is associated with the internal mechanics and logic that are within a machine learning system while interpretability deals with the intuition behind the model outputs (Adadi and Berrada, 2018;Linardatos et al., 2020). Machine learning-based feature importance techniques are used to explain the importance or influence of the input variables, and provide an understanding of the predictions made by complex models (Nordin et al., 2023). The explainability of machine learning models has received so much wide attention in recent years that XAI has become a domain in the AI community (Saleem et al., 2022;Nauta et al., 2023).

Study Area
The study area for this research is Cape Town, South Africa's most south-western city, with a land area of about 400 km 2 (Orimoloye et al., 2019). Cape Town is situated on the southwestern coast of the Western Cape Province. The coastline varies from sandy to rocky, to steep and mountainous in places. The town has a high landscape-level diversity and is biophysically diverse with rivers, wetlands, and coastal areas (Goodness and Anderson, 2013). Two different landscapes are considered for this study (Figure 1), as discussed below.

Urban/industrial
The urban and industrial areas are located within the Cape Town metropolis, a large urban area with a high population density, multiple business districts and industrial areas. The selected areas include the Parow and Elsies River Industrial areas, and several residential and commercial buildings.

Mountain
The high flat-topped Table Mountain forms part of a much larger mountain chain known as the Cape Fold Belt (Meadows and Compton, 2015). It rises steeply from the Atlantic Ocean and has a cliff face that lies at the northern end of the Cape Peninsula.

Datasets 2.2.1
Digital elevation models: Copernicus GLO-30 DEM is derived from the WorldDEM data. The WorldDEM data product is based on the radar satellite data which was acquired during the TanDEM-X Mission. Copernicus GLO-30 is vertically referenced to EGM2008.
The City of Cape Town (CoCT) airborne LiDAR-derived DEM is used as the reference dataset. The height accuracy of the point cloud used for generating the CoCT DEM is 0.15 m. The LiDAR DEM is spatially referenced to the Hartebeesthoek94 horizontal co-ordinate system and vertically referenced to the South Africa (SA) Land Levelling Datum (SAGEOID2010).

Terrain and land cover parameters:
To characterise the influence of the terrain on the elevation error, the following additional input variables were selected: elevation, slope, aspect, surface roughness, topographic position index (TPI), terrain ruggedness index (TRI), terrain surface texture (TST), vector ruggedness measure (VRM), percentage forest canopy, percentage bare ground cover, and urban footprints. The forest and bare ground data were sourced from the global land analysis and discovery (GLAD) database (Hansen et al., 2013). The urban footprints were derived from the global urban footprints (GUF) dataset (Esch et al., 2010(Esch et al., , 2013. The elevation errors or differences (∆ ) between Copernicus DEM and the reference LIDAR were calculated as follows:

Gradient Boosted Decision Trees
Gradient Boosted Decision Trees (GBDTs) are very powerful for classification and regression problems and achieve state-of-theart results in a variety of practical tasks. Specialised algorithms based on gradient boosting include extreme gradient boosting (XGBoost), light boosting machine (LightGBM) and categorical boosting (CatBoost). GBDT algorithms are based on the tree structure used in the decision tree (DT) algorithm (Aksoy and Genc, 2023).

XGBoost:
XGBoost is a scalable end-to-end tree boosting system that is commonly used by data scientists and provides state-of-the-art results on many problems, and has excelled in numerous data mining and machine learning challenges (Chen and Guestrin, 2016).

LightGBM:
LightGBM addresses the efficiency and scalability drawbacks of previous engineering optimizations used in GBDTs.

CatBoost:
CatBoost is a relatively new member of the family of GBDT machine learning ensemble techniques which debuted in 2018, and is well-suited to machine learning problems involving categorical, heterogeneous data (Hancock and Khoshgoftaar, 2020). Lundberg and Lee (2017) proposed the Shapley value for determining feature importance of input variables. The Shapley value is a fair profit allocation among the input features depending on their relative contributions to the outcome (Roth 1988;Nohara et al. 2022). The Shapley value of a feature, n, in each dataset, D, can be computed. The feature value is replaced by another value that is obtained from a different instance of the model. Every possible outcome is considered in the comparison

Machine Learning Explainability with SHAP
Where, f (x+j) is the prediction for target x with a random number of feature values, including jth feature, and f (x−j) is the coalition without the jth feature.
Generally, the Shapley value of the jth feature that is target x is (Gebreyesus et al., 2023): ( 3) The importance of all the features is derived in the same manner and the ranking is done based on their Shapley value. SHapley Additive exPlanations (SHAP) adopts Shapley values for constructing additive explanatory models that improve and facilitate the interpretation of machine learning algorithms (Lu et al. 2023). SHAP is model-agnostic and is based on the cooperative game-theory (Anjum et al. 2022;Nordin et al. 2023;Palar et al. 2023;Lu et al. 2023). Using SHAP, feature interactions and underlying patterns in the data can be revealed (Gebreyesus et al. 2023).

Data Preparation
To

Data Processing
The three GBDT models were trained using the elevation, slope, aspect, surface roughness, topographic position index, terrain ruggedness index, terrain surface texture, vector ruggedness measure, percentage forest canopy, percentage bare ground cover and urban footprints as input parameters, and the elevation error as the target variable (or predictand). For the hyperparameter tuning, Bayesian optimisation was adopted. Nine sets of hyperparameters were selected for each model, with 10 steps of random exploration and 50 steps (iterations) of Bayesian optimisation. After training and testing, the models were saved, loaded and implemented for predicting the DEM errors at separate implementation sites with similar terrain characteristics. The predicted elevation errors were applied for deriving corrected DEMs (i.e.., = − ∆ ) at the implementation sites. The accuracies of the corrected DEMs were assessed using the mean absolute error and the root mean square error (RMSE) metrics. SHAP (Shapley Additive Explanations) is adopted in this research as an excellent measure to explain individual predictions of the tuned models.

Results of DEM Correction
All three GBDT algorithms showed satisfactory prediction performance and a good capability for predicting the DEM error. In the correction of Copernicus DEM at the model implementation sites, several accuracy improvements were recorded (Table 1; Figures 1 -2). For example in the urban/industrial area, the MAE reduced by 53.1% from 1.087 m to 0.509 m (XGBoost and CatBoost) and RMSE reduced by 47.2% from 1.479m to 0.781m (XGBoost). In the mountainous area, there was 15.8% MAE reduction (from 4.848 m to 4.080 m) and 10.1% RMSE reduction (from 10.237 m to 9.206 m). Figures  1 and 2 show a comparison of the absolute height errors of the original Copernicus DEM versus corrected Copernicus DEMs for 500 selected points in the urban/ industrial and mountain landscapes respectively.

Urban/Industrial area
The SHAP plots in Figures 3 and 4 provide the understanding of how the values of the feature variables are impacting the predictability of the model for desired outputs. It shows here for example that higher values of most of the predictors affect the model predictability in a positive direction, while lower values of the feature variables impact the model predictability in the negative direction. The six most influential features in the urban/industrial area are the topographic position index (TPI), terrain surface texture (TST), urban footprints, tree cover, vector ruggedness measure (VRM) and terrain ruggedness index (TRI). It is also obvious from the plots that TPI has a nearly equal extent of SHAP values in both positive and negative directions, which connotes that its feature values could impact the model predictability in positive and negative directions equally and should be dealt with carefully. This makes it more influential to the model than the rest as shown in Figure 4.
Landscapes with relatively very high/very low TPI are associated with increased ruggedness, and higher positive or negative SHAP values. This could explain the exceeding influence of TPI which is shown to be the highest influencer in the urban area. However, Figure 4 -M-3-2023ASPRS 2023Annual Conference, 13-15 February & 12-15 June 2023 most of the other predictors have more positive impact on the model predictability than the negative impact. Interestingly, the popular first order terrain parameters of slope and aspect which were frequently used in previous studies are the least influential based on the SHAP analysis. In many cities, buildings are constructed on land with gentle slope or gradient (Zhou et al., 2021), and possibly the slope variability could be moderate in urban areas. The likely explanation for the lower influence of slope and aspect could also be related to the characteristic cover or draping of buildings over the land surface in the urban area which makes the slope or aspect transition or changes less conspicuous when viewed from above. However, other terrain  For example, the VRM caters for the heterogeneity of both slope and aspect, and incorporates 3D dispersion of vectors normal (orthogonal) to planar facets on a landscape (Sappington et al., 2007). Moreover, VRM can differentiate among terrain types (e.g., differentiating smooth, steep hillsides from irregular terrain) and enables the treatment of terrain components as separate variables when quantifying landscapes (Sappington et al., 2007;Welty et al., 2018).

Mountain landscape
In the mountainous landscape, the impact and reactions of the predictors and their values are slightly different from their impact on the model predictability in the urban/industrial area. For example, in XGBoost and LightGBM, tree cover has more influence on the model predictability than the TPI (Figures 5 and  6). Higher values of tree cover impacted the predictability of the model in the positive direction while TPI maintained almost similar influence it had in the urban/industrial area, impacting the model performance in both positive and negative directions almost equally. It is also worth noting that in the mountainous landscape, the features degree influence on the model is slightly different in CatBoost algorithm as they were in XGBoost and LightGBM. Bare ground appears to have the highest impact on the model predictability in CatBoost. However, tree cover and TPI still maintain quite similar high degree of influence and directions of impact as they had in the other two models.
Tree cover, TPI, bare ground, elevation, slope and aspect are the six most influential features in this mountainous landscape. Table  Mountain has an unusually rich array of trees and shrubland which are interspersed with bare rocky surfaces. This spatial variation in the tree cover on Table Mountain is spatially correlated with the DEM error. A common feature of mountains is their characteristic steepness (slope angle to the horizontal). Thus, slope and aspect which were less influential in the urban area are more influential in the mountainous area. The characteristic of the mountain which extrudes out of the surrounding areas could also explain this pattern.

CONCLUSION
The corrections achieved significant accuracy gains in Copernicus DEM thus proving the potential of GBDTs for DEM correction. We adopted the SHAP method for interpreting the hyperparameter-optimised GBDT models that were configured for predicting the DEM error. The interpretation by SHAP revealed some underlying relationships among the input features. This knowledge on feature explainability is important to inform the feature selection process, and for understanding cause-andeffect relationships between variables for prediction of DEM error with machine learning. Further research can compare the impact of hyperparameter tuning on feature importance. Also, the variations in feature importance between different classes of ML/DL models such as tree-based algorithms versus neural networks can be investigated.