SPATIAL DOWNSCALING OF GPM IMERG V06 GRIDDED PRECIPITATION USING MACHINE LEARNING ALGORITHMS

According to recent studies, Remote sensing data plays a significant role in filling gaps in the poor gauge station, particularly at high elevations and with complex underlying surface features. In order to provide high-resolution precipitation estimates over the poor gauge and with complex terrain areas, downscaling low-resolution satellite precipitation estimates using various environmental variables. In this paper, we tried to downscale the GPM IMERG V06 with a resolution of (0.1° ×0.1°) nearly 10km to (1km × 1 km) using four machine learning algorithms namely, Decision Trees, Multiple Linear Regression, Support Vector Regressor and random forest. Vegetation indices Normalized difference in vegetation index (NDVI), Topography, Land Surface Temperature (LST), and latitude and longitude. This framework can downscale the 0.1° resolution of the GPM IMERG precipitation product to 1 km, by determining the importance of features, and automatically optimizing the model parameters. Additionally, ground recorded data from rain gauge stations have validated downscaled precipitation products. Spatial downscaling can generally increase the accuracy of GPM IMERG gridded precipitation data and results reveal that spatial downscaling is an acceptable way of investigating the precipitation over Taiwan.


INTRODUCTION
The global hydrological cycle cannot function without precipitation, which is crucial for maintaining the hydro-climate balance and ecosystem activities. Even while the data from rain gauges can yield precise point-based measurements, it is difficult to extrapolate the data to create precise maps at the basin scale, particularly when the distribution of rain gauges is uneven or for ungauged basins (Luo et al., 2019;Zhang et al., 2019) and rain gauges serves as an evaluation tool for various precipitation products. Over the last few decades, rapid developments in remote sensing techniques provide an opportunity to estimate spatial continuous precipitation on a global scale. Satellite-based precipitation products (SPPs) are openly available to the public to understand the precipitation characteristics for hydrometeorological applications, especially over sparse rain gauge areas (Belabid et al., 2019;Pellet et al., 2019). Various SPPs are available in recent decades namely, Tropical Rainfall Measuring Mission (TRMM) Multi-satellite precipitation analysis (TMPA), Climate Prediction Center Morphing technique (CMORPH), Precipitation Estimation from Remotely Sensed Information using Artificial Neural Network (PERSIANN), and Integrated Multi-satellitE Retrieval for Global Precipitation Measurement (IMERG). However, these SPPs are relatively low resolution varying from 0.1° to 0.5°, and these products are too coarse for analysis. An efficient strategy required to close the spatial scale gap between low/coarse resolution and high/fine resolution is to use spatial downscaling techniques. There are two major downscaling techniques available for SPPs, statistical downscaling (construct the empirical relationship between object and auxiliary variable) and dynamical downscaling (mathematical representation of the complex physical phenomenon of atmosphere, ocean, and land) (Sachindra & Perera, 2016). A critical step in the downscaling method is to select appropriate environmental variables and methods for downscaling precipitation, variables can be divided between dynamical (variable can change over spatially and temporally) and static variables (variables remain constant). Numerous downscaling models has been developed recent decades using Univariate Regression (UR), multivariate regression (MR), and Geographic Weighted Regression (GWR) and it fails to reflect spatial heterogeneity between precipitation and land surface characteristics. In this study we employed non-parametric machine learning algorithms (Decision Trees, Multiple Linear Regression, Support Vector Regressor and random forest) to downscale the IMERG satellite estimation using various environmental variables.

Study Area
The study area located in East Asia, on the western side of the Pacific Ocean and it consists of main land of Taiwan as well as small distant islands. The Pacific Ocean, Bashi channel, Taiwan Strait and East China sea are all located to east, south, west and north of Taiwan mainland respectively. Study area stretching from 120E to 122E and from 22N to 25N that covers 36197 km2 which mainland dominated by mountain ranges in East and gently sloping plains in the West, approximately 394km long from North to South and 144 km broad from West to East.

Datasets
This study utilized satellite-based precipitation estimates (GPM IMERG V06), vegetation indices (NDVI, EVI), digital elevation model (SRTM), landcover (MCD12Q1), and Land Surface Temperature (LST) with their spatial and temporal resolution characteristics are given below in the Table.1.

NDVI and Landcover
In this study,16-day NDVI product (MOD13A2) from Moderate Resolution Imaging Spectroradiometer (MODIS) aboard Terra sensor with a spatial resolution of 1km by 1km which can be downloaded from NASA Land Processes Distributed Active Achieve Center (https://lpdaac.usgs.gov/products/mod13a2v006/). Annual mean NDVI were calculated by averaging 16-day NDVI data in a given year. Anomalous pixels (snow cover, urban areas and water bodies) should be removed from NDVI pixels which was influenced by vegetation growth (Wang et al., 2019). Anomalous pixels were removed from original NDVI data based on Land cover data (MCD12Q1) from MODIS onboard Terra sensor at 500m by 500m spatial resolution.

Elevation and Geographic locations
Digital elevation model with a spatial resolution of 90m from the Shuttle Radar Topography Mission provided the elevation data which was used in this study and it can be downloaded from USGS website (EarthExplorer (usgs.gov)). 90 m elevation data were resampled to 1km and 10 km spatial resolution by using pixel averaging methods. Latitude and Longitude data utilised in this study were derived using elevation data as well and it serves as a common feature to combine all data for further processing and analysis.

Machine learning algorithms
In this study, we used Four machine learning algorithms of the scikit-learn python (Pedregosa et al., 2011) that contains Random Forest, Decision Tree, Support Vector Machine and Multiple Linear Regression, along with other model called Adaptive Network-based Fuzzy Inference System (ANFIS) utilised to model the complex relationship between IMERG precipitation and environmental parameters for downscaling the precipitation products. In order to categorize the input variables into an m-dimensional feature space with a maximal margin, which may be determined by solving a quadratic problems depends upon on an optimization theory that employs a hyperplane (Smola & Schölkopf, 2004). Most widely used SVM tools called libsvm that developed by (Chang & Lin, 2011) and it is available freely in online that is being adopted for this study. libsvm includes all important parameters such as kernel function, capacity parameter cost. Random Forest is an improved technique which is integrating a large number of Classification and Regression Tree (CART) methods in to an ensemble. It has been widely utilized for numerous remote sensing applications, including regression and classification (Rhee et al., 2014).RF employs a bootstrap aggregating method in contrast to CART algorithm to enhance model performance. The aggregate output from numerous trees may smooth the variance between trees and produce more accurate prediction results since each tree is constructed using a random subset of training data and a random subset of predictor variables (Breiman, 2001). According to Zhang et al. (2017),conventional method of Multiple Linear Regression Table 1. Dataset required for this study GPM mission, an international constellation of satellites consisting of one major observatory satellite and ten partner satellites that offers the next-generation global precipitation measurement, which was started by Japan Aerospace Exploration Agency (JAXA) and United States National Aeronautics and Space Administration (NASA) (Lu et al., 2018). IMERG has a multi-channel GPM Microwave Imager (GMI) that is combined with first space-borne Ku (13GHz) and Ka (35GHz) bands which can able to detect light rain (<0.5 mm/hr) (Hou et al., 2014). Final Run IMERG products were utilised in this study out of other two products (early run, late run) as suggested by data provider and it can be downloaded approximately four months after real time observation from (https://gpm.nasa.gov/data/ directory). Guo et al. (2016) suggested that PrecipitationCal performs better than PrecipitationUnCal over Taiwan since it has very less bias with Rain gauge station observations.

MODIS Land Surface Temperature (LST)
MOD11A1 version 6 daily product from Terra MODIS-LST product were used in this study at 1km spatial resolution. Daily relies on regression coefficient matrix that may depict the correlation between dependent and independent variables. Multiple Linear Regression has been used frequently to predict the values of dependent variable from a collection of predictor variables. In this study. We tried to build a Multiple Linear Regression function between satellite precipitation products and the numerous environmental variables.

Adaptive Network-Fuzzy Inference System
Zadeh (1965) made first attempt to prepare fuzzy set approach which is basically a combination of neural network (NN) and the concept of fuzzy logic (FL). Sugeno's system is one of the most widely used in fuzzy systems to create model with scarce or ambiguous data. Fuzzy Logic has a stronger capacity for condition adjustment during learning process. Thus, by using NN, it is possible to reduce the error rate according to FL rules and have membership function (MFs)vary naturally. ANFIS model developments has two main components, (i) during learning process, membership function (MF) is used to convert input values to fuzzy values from 0 to 1 and, (ii) using some IF-THEN rules in ANFIS model to describe the non-linear relationship between input and output space. Membership function is the primary factor influencing the ANFIS model's predictability performance, hence choosing the right MF is one of the predefined modelling steps. Fig.2 shows about structure of ANFIS model with 6 input variables(x1 to x6) and five different layers: fuzzification, product, normalization, de-fuzzification and output (Moosavi et al., 2013). The most common MFs are triangular, trapezoidal, generalized bell-shaped and Gaussian, Gaussian MF is used in this study. Jang et al. (1997) developed a hybrid teaching for the neuro-fuzzy model to calculate the model parameters more quickly and accurately than back-propagation method which is based on gradient reduction.

Downscaling
The main goal of statistical spatial downscaling is to establish, an empirical statistical relationship between the object variable and the relevant auxiliary variables to low/coarse spatial resolution, High/fine spatial resolution is thought to be appropriate to the empirical statistical relationship. The downscaled object variable at fine spatial resolution is then produced by applying the established empirical statistical relationship to the auxiliary variables at fine spatial resolution.
The detailed flow of work involved in downscaling satellite precipitation in the Fig.3.In our study IMERG annual satellite precipitation data serve as an object variable that has to downscaled from 0.1° (nearly 10 km) resolution to high resolution of 1km. six auxiliary variables are NDVI, Latitude, Longitude, Elevation, LST, Landcover selected from the past literatures. Initially monthly based auxiliary variables have been prepared with a spatial resolution of 10 km and 1km and then it converts in to yearly composition by averaging the monthly values and auxiliary variables have resampled from 1 km and 10km by the method of nearest neighbour method. Waterbody and urban built-up areas to be masked out from vegetation indices and landcover dataset due to negative impact on the downscaling model. To get rid of the impacts of different scale, standardize the variables by use of their means and standard deviation at scikit-learn library. Non-parametric machine learning regression models (SVR, Decision Trees, MLR, Random forest and ANFIS) has created between IMERG precipitation and six variables with 10km resolution. Residual errors can be calculated from predicted 10km precipitation with IMERG precipitation products. Resampled the residual error from 10 km to 1 km using spline interpolation and findings that have been downscaled using residual correction and are 1 km resolution might be attained by combining the residual errors from the interpolated the results after scaling without residual rectification.
In this study, six environmental variables that include three static variables (elevation, longitude and latitude) and three dynamical variables (either NDVI or EVI, LST and Land cover) with a spatial resolution of 10km and 1km are created. Initially annual composite values were calculated for all the environmental variables from averaging monthly values. By employing the nearest neighbour method to resample the environmental variables at 10km and 1km spatial resolution. To prevent a detrimental impact on the downscaling model, Negative values in the vegetation indices as well as urban, built-  up areas and water bodies should be eliminated through Landcover dataset.

Data Normalization
All input environmental variables are normalized and scaled in the same order, in order to improve the evaluation accuracy and speed when working with raw input data and it has been collected from different sources. We adopted min-max normalization in this study to linearly rescale all the environmental variables in to (0 to 1) interval.
Where xi indicates that normalized data and xmin and xmax are minimum and maximum value of that data.

Hyper-Parameter Optimization
Hyper-Parameter optimization plays an important role in machine learning algorithms to improve its performance. We used scikit-learn GridSearchCV algorithm with 10-fold crossvalidation to identify the best hyper-parameters to construct optimal prediction model (

Statistical indices
Two assessment indices were adopted for this study to compare the performances of downscaling machine learning models that includes correlation of determination(R 2 ) and mean squared error(MSE) (Shi et al., 2015).
Where, S represents original IMERG and P predicted IMERG precipitation. Table 3 shows about the results of various downscaling models that adopted for this study to simulate IMERG precipitation data at 1km spatial resolution.

Feature Importance
As ANFIS model has been selected for further downscaling analysis of IMERG precipitation and it is hard to identify the rate of influencing individual variable on prediction result in conventional regression model. However this can accomplish by analysing SHapely Additive exPlanations(SHAP) values. According to game theory-based presumptions, SHAP enables evaluation of the extent and nature of each explanatory variables impact on the outcome of any machine learning or deep learning model (Lundberg & Lee, 2017). Figure. Latitude and longitude have a most influencing factor followed by NDVI, LST and elevation, while land cover has least one in the list that impact the downscaling model. Figure.6 explains about waterfall plot of predicted results of features with largest contribution in SHAP value. X axis represents value of target variable (ie., Precipitation) and E[f(x)] provides the expected value of precipitation prediction. Longitude and latitude have the highest impact followed NDVI and LST and landcover was least one. Figure 7 represents the original IMERG precipitation products and the predicted precipitation based on six input variables using ANFIS machine learning algorithm. Predicted results revealed that spatial pattern of downscaled IMERG as same as the original IMERG precipitation products. Annual mean of downscaled products are very close to original IMERG while slight difference in spatial distribution. Results indirectly proves that ANFIS algorithm performs well for downscaling the satellite precipitation. Fig.7 Annual mean original IMERG precipitation product (a), downscaled annual mean IMERG precipitation (b).

CONCLUSION
The main objective of this framework is to correct way for downscaling the IMERG precipitation products along with various static and dynamical variables (Elevation, Land cover, LST, NDVI, latitude and longitude) using suitable machine learning algorithms (SVR, MLR, Decision Tree, Random forest and ANFIS). The machine learning based algorithms provides smallest residual errors than conventional parametric models on downscaling with sufficient spatial details for meteorological analysis. Our results revealed that Adaptive Network-based Fuzzy Inference system (ANFIS) regression algorithm performed better and significant statistical metrics in downscaling the precipitation. Future studies will involve validation of downscaled precipitation against rain gauge observation values.