METHODS AND CHALLENGES IN TIMESERIES ANALYSIS OF VEGETATION IN THE GEOSPATIAL DOMAIN

: The increasing availability of remotely sensed data have offered unprecedented possibilities for monitoring and analysis of environmental variables, including boosting recent studies in the field of ecosystem resilience relying on indicators derived from timeseries analysis, such as the temporal autocorrelation of vegetation indices. A forest ecosystem with decreased resilience will be more susceptible to external drivers and their change and could shift into an alternative system configuration by crossing a tipping point. Nevertheless, remote sensing data quantifying vegetation and forests properties inherently carry information related to the climate as well, which has to be accounted for before performing any modelling exercise. In this paper, we aim to present the general workflow and the challenges encountered in processing and analysing the historical, high-frequency and high-resolution timeseries of vegetation and climatic data. The final aim is training a machine learning model (Random Forest) in order to model and explore the performance and importance of a set of climatic and environmental metrics in predicting an indicator of the resilience of forests. In this case, the resilience of forests is quantified through the temporal autocorrelation (TAC) of the kernel NDVI (kNDVI). Climatic and environmental predictors include 2-meter air temperature, total precipitation, vapour pressure deficit, surface solar radiation, forest cover and soil organic carbon content. Results show a good performance of the Random Forest model and the ranking in the importance of the predicting variables captured in terms of background climate and climate variability. This application allows to separate and identify the main drivers of the temporal autocorrelation of kNDVI.


INTRODUCTION
The increasing availability and ease of access of global, historical, and high-frequency remote sensing data have offered unprecedented possibilities for monitoring and analysis of environmental variables. Recent studies in the field of ecosystem resilience relied on indicators derived from timeseries analysis, such as the temporal autocorrelation and the variance of a system signal (Dakos et al., 2015). The aforementioned availability of global, temporally and spatially dense timeseries of indicators of biomass and greenness of vegetation, such as the normalized difference vegetation index (NDVI), among others, has boosted ecosystem resilience scientific applications to forests as well. The ecological definition of resilience corresponds with the capacity of a system to absorb and recover from a disturbance (Forzieri et al., 2022). When dealing with ecosystems increasingly affected by natural and anthropogenic pressures such as forests, monitoring their health is particularly relevant.
Forest ecosystems play a crucial part in the global carbon cycle and in any climate change mitigation strategy, despite being increasingly affected by natural and anthropogenic pressures. While anthropogenic action on forests is mainly represented by stand replacement, natural perturbations include wind throws and fires, as well as extended insects and disease outbreaks, such as the recent outbreak affecting Central Europe (Bárta et al., 2021;Thonfeld et al., 2022). These natural disturbances are strictly interconnected with climate change. A forest ecosystem with decreased resilience will be more susceptible to external drivers and their change and could shift into an alternative system configuration by crossing a tipping point.
However, remote sensing data quantifying vegetation and forests properties inherently carry information related to the climate as well. If not accounted for, these confounding factors, such as short-term climate fluctuations, may hide the actual vegetation anomalies focus of a study and the importance of other drivers in vegetation itself. In addition, the comparison of the same vegetation property between different geographical areas naturally affected by different climates is hindered.
In order to explore the relationships of a set of environmental and climatic metrics with an indicator of the resilience of forests, a machine learning (ML) model is implemented. In this paper, we aim to present the general workflow and the challenges encountered in processing and analysing the timeseries of vegetation and climatic. The focus of this paper will be on a workflow implemented to analyse the aforementioned timeseries and on the methods and tools implemented to account for the background climate effect on vegetation. A key aspect to assess resilience of ecosystems is the treatment of the signal to remove long-term trends and seasonal cycle of the signal. Being aware of the variety and heterogeneity of methodologies existing in the field of timeseries analysis, in this contribution we will describe tools and methods for deseasonalization, detrending, growing season identification and accounting for short-term climate effects.
The final aim is to present one of the many workflows that can be implemented when dealing with timeseries of vegetationrelated data in the geospatial domain, where climate plays a crucial role. The importance of the availability of open data and open-source tools and platforms in making this big data analysis possible is also strongly highlighted.

OBJECTIVE
The main objective of the presented paper is to illustrate a thorough methodology to pre-process and analyse timeseries of vegetation and climatic data in order to retrieve a vegetation signal and a derived metric of resilience, in this specific case in terms of lag-1 temporal autocorrelation, as close as possible to representing the actual response of the system accounting for short-term climate fluctuation. In order to do so, it has to be considered firstly that the timeseries of vegetation and climatic variables are constituted of three main components: a long-term trend, the seasonality cycles and the remaining part representing the anomaly that deviates from the average conditions. In the following equation representing a decomposed timeseries, the Y represents either the vegetation or the climate variable timeseries (Sun et a;. In order to obtain an accurate estimate of vegetation and its resilience, the vegetation timeseries used in any model needs to be stationary, hence without periodical seasonal cycles and long-term trends (Forzieri et al., 2022).
Secondly, a Random Forest (Breiman, 2001) model is used to explain the observed long-term lag-1 temporal autocorrelation (TAC) of vegetation, by accounting simultaneously for climate fluctuations and the temporal autocorrelation of the climate itself.
In the previous equation the X represents a series of climatic and environmental predictors used to estimate the vegetation long-term TAC and ε are the residuals of such estimation. This allows to explain TAC at different pixels accounting for the differences in climate and environmental conditions.

MATERIALS AND METHODS
All data and most of the tools leveraged for this study are open. In the following sections the main datasets and tools are described.

Tools
The data processing takes place mainly within Google Earth Engine (GEE) and the Joint Research Centre (JRC) Big Data Analytics Platform (BDAP). Google Earth Engine is a cloud-based geospatial analysis platform providing a multi-petabyte catalogue of satellite imagery and geospatial datasets coupled with large analysis capabilities (Gorelick et al., 2017). The JRC Big Data Analytics Platform is a petabyte-scale storage system coupled with a processing cluster. It includes open-source interactive data analysis tools, a remote data science desktop and distributed computing with specialized hardware for machine learning and deep learning tasks (Soille et al., 2018). GEE is mainly used to pre-process MODIS data and secondary datasets. The ERA5 pre-processing and the core timeseries analysis are performed within the JEODPP, where main tools include R (R Core Team, 2022), Climate Data Operator (CDO) and netCDF Operators (NCO). The whole machine learning model is instead trained and run in R. The different platforms and tools implemented in the study highlight the heterogeneity of data as well involved, data availability and data formats, ranging from TIFF, netCDF and R objects.

Datasets
This section is going to illustrate main datasets used in the analysis divided by category and the main filtering and resampling steps applied to each dataset in order to create a coherent ensemble of data to be used in the subsequent model in terms of temporal and spatial resolution. All data have been retrieved over Europe.

Vegetation:
The long-term kNDVI timeseries was retrieved by processing the full timeseries of daily MODIS Terra and Aqua Surface Reflectance at 500m from 2003 to 2021 (Vermote et al., 2021). A simplified version of kNDVI is defined as: The kNDVI is a nonlinear generalization of the NDVI that shows stronger correlations than NDVI and NIRv with forest key parameters. kNDVI is also more resistant to saturation, bias, and complex phenological cycles, and it is more robust to noise and more stable across spatial and temporal scales (Camps-Valls et al., 2021). MODIS daily data have been filtered in order to select only the highest quality data for the bands of interest (1 and 2) and MODIS pixels have been masked for clouds, clouds shadows, water, snow and ice by referring to the data product state QA (Quality Assessment) bit flags. After quality filtering, MODIS derived kNDVI daily data have been reduced as mean into a timeseries with an 8 days' time-step and resampled at 0.005° resolution to facilitate comparison with climatic data. Each of the 0.005° kNDVI pixel in the timeseries was afterward masked with a binary forest mask including only pixel with at least 50% of forest cover. Afterwards, the kNDVI was averagely aggregated to 0.05° resolution.
The forest mask used was obtained from a forest cover percentage layer. The latter was obtained as the percentage of forest covered pixels in a 0.005° cell, accounting as forest covered pixels all the 30m pixels from the Hansen tree cover 2000 layer (Hansen et al., 2013), only where tree cover is higher or equal 30% and retaining only patches of at least 6 connected pixels, accounting for a surface higher than 0.5ha. In addition, any pixel undergoing a forest loss in the time period was removed in order to account for managed or disturbed forest patches, where disturbances may include forest fires or clearcutting. Indeed, such events can artificially boost the lag-1 TAC of the vegetation signal. This pre-processing resulted in a 0.05° resolution timeseries of 8-days averaged kNDVI, derived exclusively from forested pixels.
In order to account for the phenology, MODIS Land Cover Dynamics yearly data at 500m (Friedl et al., 2022) were used.
From these, the circular mean of greenup (as the date when EVI2 first crossed 15% of the segment EVI2 amplitude) and dormancy (as the date when EVI2 last crossed 15% of the segment EVI2 amplitude) was calculated, after yearly dates of greenup and dormancy have been translated in their respective angular measure of their respective day of the year. Average greenup and dormancy were obtained for the time period 2001-2021. Finally, the growing season products were aggregated at 0.05° resolution, accounting again only for pixels cover at least for 50% by forests.

Climate:
Hourly ERA5-Land data at 10km resolution (Copernicus Climate Change Service (C3S), 2017) were used to retrieve the set of climatic and environmental predictors including 2-meter air temperature (t2m), total precipitation (tp), vapour pressure deficit (vpd) and surface solar radiation (ssr). These variables were computed as 8-days averages or sums according to the specific variable and as well resampled at 0.05° resolution in order to be coherent with the vegetation data.

Others:
Additional datasets included in the final RF model are the forest cover (fc) layer previously described in the Vegetation section aggregated at 0.05° resolution and the soil organic carbon content layer at 30cm depth (socc30cm) from OpenLandMap Soil Organic Carbon Content at 250m (Hengl et al., 2018) and aggregated at 0.05° resolution.

Timeseries pre-processing
This section is going to illustrate the main pre-processing steps applied to each timeseries involved in the study, prior to the calculation of the lag-1 temporal and other statistics. Each step is important in timeseries analysis in order to maximise the representativeness of the resulting signal, being either of vegetation or climatic variables.

Growing season selection:
A climatological growing season was identified as the 20 years (2001-2021) circular mean of greenup and dormancy from the MODIS Land Cover Dynamics product. This climatological growing season was used in order to retain only 8 days timesteps that are within the greenup day of the year and the dormancy day of the year. This allows the background environment metrics to not be strongly affected by the dormient months of the vegetation.

Detrending:
As a final step in the pre-processing of the timeseries, a linear regression model has been fitted to each pixel's timeseries and the resulting fitted values have been subtracted from the kNDVI and climatic variables values, in order to remove any linear trends from the timeseries, because these trends result for long-term climate trends (i.e., global warming).

TAC and other statistics calculation:
The previously explained pre-processing steps leaves anomalies from the seasonal cycle in the vegetation and climatic timeseries, as shown in Figure 1. Figure 1 shows the evolution of the timeseries of kNDVI for a representative pixel undergoing each step of the presented pre-processing.
The vegetation and climatic anomalies are finally used to compute the long-term (2003-2021) lag-1 temporal autocorrelation (TAC) of both kNDVI anomalies and each climatic variable anomalies involved in the study.
In addition, the average and coefficient of variation (CoV) of kNDVI and climatic data are computed for the climatic growing season. Figure 1. kNDVI timeseries for a sample pixel at each preprocessing step.

Random Forest (RF) model
In order to predict the long-term kNDVI TAC accounting for the impact of climate and other environmental factors, a RF regression model was implemented using as predicted variable the long-term kNDVI anomaly TAC and as predictors: the longterm TAC of each climatic anomaly, the average and coefficient of variation (CoV) of each climatic variable computed on the climatic growing season, as well as the average and coefficient of variation (CoV) of kNDVI computed on the climatic growing season. The CoV represents of the variability of climate, whilst the average represents the background climate. In addition to these vegetation and climatic variables, additional datasets including the forest cover and soil organic carbon content were included in the RF model. Predicting the long-term lag-1 kNDVI TAC with a series of climatic and environmental predictors allows to separate and identify the drivers of forest resilience in terms of long-term TAC.
The RF model was trained with 500 trees and with a proportion of training equal to 70% of the whole dataset and 30% for the testing.

RESULTS AND DISCUSSION
In this section, the results of the pre-processing applied to the timeseries will be illustrated specifically for the kNDVI timeseries. In addition, the performance and outcome in terms of predictors ranked by variable importance of the trained RF model will be presented. Figure 2 illustrates the long-term lag-1 TAC of kNDVI without any pre-processing applied to the timeseries and the same lag-1 TAC of kNDVI following the timeseries pre-processing, as illustrated in the previous section.
As it can be seen from the two maps, the long-term lag-1 TAC of kNDVI strongly changes whether calculated over a non-preprocessed timeseries and after pre-processing. In the first case, the TAC is strongly dominated by seasonality, while in the second case the TAC shows patterns that are instead more coherent with general climatic conditions.

Figure 2.
Long-term lag-1 TAC of kNDVI computed over an untreated timeseries (above) and over a pre-processed timeseries (below).  In order to quantify and rank the influence of individual environmental and climatic factors on TAC, variable importance metrics have been extracted. These metrics allow to separate and identify main drivers of the kNDVI TAC. Figure 4 shows the environmental and climatic predictors ranked by variable importance in the Random Forest model. The percentage increase in MSE represents the mean decrease of accuracy in predictions when a variable is permuted, meaning the mean increase in MSE contribution by variable divided by its variability. From Figure 4, it is clear how the most important predictor is forest cover, followed by the temporal autocorrelation of the climatic variables (2-meter air temperature, surface solar radiation and total precipitation).
Finally, in Figure 5 a map of the predicted values of long-term lag-1 TAC of kNDVI extended over the whole dataset and a map of the residuals from the Random Forest model are presented. Figure 5. Predicted long-term lag-1 TAC of kNDVI and residuals of the RF model.

CONCLUSION
The study presents an overview of the pre-processing of historical, high-frequency and high-resolution timeseries of vegetation and climatic data with the aim of training a Random Forest model to explain a general resilience indicator, the longterm temporal autocorrelation of kNDVI, with climatic and environmental variables. The overall pre-processing highlighted the importance of retrieving stationary timeseries before the data are input in the machine learning model. The application of the RF model highlighted the strong influence of climatic variables as drivers of vegetation temporal autocorrelation.