ESTIMATION OF SUGARCANE YIELD USING MULTI-TEMPORAL SENTINEL 2 SATELLITE IMAGERY AND RANDOM FOREST REGRESSION

.


INTRODUCTION
Early prediction of crop yield at local, national, and regional levels is crucial for effective planning and decision-making (Becker-Reshef et al., 2010).Traditionally, crop yield estimations have been made using field observation and survey techniques.However, these techniques are often timeconsuming, costly, subjective, and prone to large errors (Reynolds et al., 2000).Furthermore, data processing and analysis may be completed several months after the harvest date, limiting the usefulness of these data for timely decision-making and planning (Dempewolf et al., 2014).Similarly, yield estimation using weather data is associated with several problems related to the spatial distribution of weather stations (Becker-Reshef et al., 2010), necessitating the need for alternative approaches that are effective in providing accurate crop yield estimates.
Remotely sensed data, due to their synoptic and repetitive coverage over a large area, have been recognized as an imperative technique for crop yield estimation (Becker-Reshef et al., 2010;Ban et al., 2016).Since they are based on satellite images that can be constantly downloaded, yield estimates can be updated frequently during the growing season (Reynolds et al., 2000).Using Vegetation Indices (VIs), crop yield forecasts can thus be produced earlier than conventional estimates.Previous studies, such as (Franch et al., 2019), used empirical models to estimate crop yield using VIs extracted from satellite imaging data.However, the accuracy of the predictions varied depending on how strong the correlations were between the crop yields estimated from the field data and the satellite-based VIs.In this regard, the saturation problem that affects the accuracy of the estimation is the major challenge of using VIs.Studies have demonstrated the use of VIs based on red-edge bands to improve the estimation of above-ground biomass (Franch et al., 2019), LAI (Herrmann et al., 2011), and plant chlorophyll and nitrogen content (Delegido et al., 2011) by reducing the saturation effect on traditional VIs extracted from multispectral bands.Information on such variables is very important for determining crop growth conditions and predicting crop yield.
In this respect, the recent S2 satellites, which have a high temporal frequency (i.e., 5-day revisit cycle) and spatial resolution (10m -20m), are found to be promising in helping with crop monitoring and yield estimation.The S2 imagery has recently been used for estimating crop primary productivity (Wolanin et al., 2019), crop canopy and nitrogen content (Delloye et al., 2018), LAI and biomass (Punalekar et al., 2018), and for crop mapping and yield estimation (Jin et al., 2019).Therefore, the successful application of VIs extracted from S2 imagery for crop monitoring and yield estimation could be useful not only for yield estimation and monitoring at the local level but also for mapping crop productivity at a national and regional scale.
In recent years, several Machine Learning (ML) techniques have been employed to attain improved yield estimation for different crops.To improve crop yield estimation, a powerful method for selecting the optimal VIs is highly required because VIs are weakly correlated with crop yield or are highly correlated to each other.Recently, a data-driven ensemble learning technique called RF is increasingly being applied in the field of remote sensing for several applications including Land use/Land cover classification (Munyati, 2019), crop disease detection (Adam et al., 2017), nitrogen and chlorophyll content estimation (Abdel-Rahman et al., 2012), and LAI and biomass estimation (Pandit et al., 2018).Previous studies have also used the RF regression for predicting crop yield (Saeed et al., 2017).
The robustness and high accuracy for estimation, high computational speed, and capacity to rank the variables based on their importance makes the RF algorithm one of the best approaches for classification and regression problems in remote sensing.RF can overcome the limitations of other ML methods such as the black box constraint in ANNs, helps to select the optimal variables and can reduce the dimensionality of the dataset.The RF algorithm can be used for both developing models and variables selection.Besides, RF algorithm is less sensitive to over-fitting and runs efficiently on large datasets, and it has fewer parameters as compared with other ML approaches such as ANNs and Support Vector Machine (SVM) algorithms.
In this study, we evaluated the performance of VIs computed from S2 imaging data (n = 22) to estimate sugarcane yield in Wonji-Shoa and Metehara estates, Ethiopia.Series of VIs involving visible, NIR, red-edge, and short-wave infrared (SWIR) bands were calculated from the S2 imagery, and sugarcane yield was predicted using yield data and the RF regression algorithm.The RF regression was used in this study because of its capability to select and rank important variables for sugarcane yield estimation.Therefore, the objectives of this study were to: (1) investigate the use of VIs derived from multitemporal S2 data for estimating sugarcane yield (t/ha), and (2) evaluate the performance of the RF regression as a variable selection and prediction method.This approach is expected to provide reliable estimates of crop yield that can contribute to improving early estimates of sugarcane yield before harvest.

Study Area
This research was carried out in the Wonji-Shoa and Metehara sugar estates during the 2016/17-2018/19 cropping seasons (Figure 1).The Wonji-Shoa sugarcane plantation, positioned about 108 km southeast of Addis Ababa, Ethiopia, spans the coordinates 8° 21'-8° 29' N and 39° 12'-39° 18' E. Its average altitude is 1540 meters above sea level, encompassing an area of 12,000 hectares.Located downstream of the Koka dam, within the upper Awash River Basin of Ethiopia's rift valley, the Wonji-Shoa plantation experiences a semi-arid climate.It records an average annual rainfall of 831.2 mm, with mean annual temperatures ranging from a maximum of 27.6℃ to a minimum of 15.2℃.
The Metehara sugarcane plantation lies at 8° 35'-8° 54' N and 39° 40'-39° 55' E, approximately 200 km southeast of Addis Ababa in the rift valley region of Ethiopia.It stands at an elevation of around 950 meters above sea level.The Chercher highlands and Mount Fentale, as well as adjacent undulating plateaus flank the Metehara area, largely comprising level floodplains along the Awash River.With a similar semi-arid climate, the Metehara region receives an average annual rainfall of 532 mm.The temperature in this area varies between an average maximum of 32.7℃ and a minimum of 17.4℃.Currently, the estate extends over more than 13,000 hectares and includes eleven principal sugarcane fields.

Sugarcane Yield Data
Sugarcane yield data, measured in tons of stalks per hectare (t/ha), were collected for the ratoon cane crops in the aforementioned estates from the Wonj and Metehara sugarcane research and development centres for the 2016/17 to 2018/19 cropping seasons.This study focused on ratoon cane fields harvested 11 to 14 months after ratooning during: 1) the March 2016/17 to 2018/19 cropping seasons, and 2) the May 2016/17 to 2017/18 cropping seasons.The study area's shape file facilitated the extraction of zonal statistics for the sampled fields, identified by their respective field numbers in the management units of the sugarcane schemes.Table 1 presents a summary of the plot area, age at harvest, and sugarcane yield for the ratoon cane in both the Wonji-Shoa and Metehara estates.To enable reliable yield estimation using these images, atmospheric correction was essential.Level 1C S2 data were processed to Level 2A Bottom of Atmosphere (BOA) reflectance images using the Image-based Dark Object Subtraction (DOS1) algorithm, employed via the Semi-automatic Classification Plugin (SCP) V 6.2.9 (Congedo, 2021) in QGIS 3.6.3.In this study, we utilized S2 images with 10m spatial resolution bands (Blue, Green, Red, NIR,) and 20m resolution bands (Red-edge 1, Red-edge 2, Red-edge 3, NIR narrow 1, SWIR1, SWIR2).These surface reflectance images were then cropped to the study area, followed by the computation of various VIs from the selected spectral bands.

Vegetation Indices and Seasonal Composites
In our research, 22 VIs, as reported in existing literature, were chosen for the development of empirical sugarcane yield estimation models.These VIs were computed from seasonal S2A images using QGIS 3.6.3software.We then calculated the seasonal cumulative values of each VI during the grand growth stage, specifically from July to November and September to January for the March and May ratoon dates, respectively.This stage is critical as it encompasses key processes like stem elongation and yield formation.The zonal mean values of these seasonal composites within each farm field (i.e., management unit) were subsequently calculated using the Processing Toolbox in QGIS 3.6.3.These values were later utilized for modelling sugarcane yield estimation. (8 − 7) ⁄ (8 + 7)

METHOD
In our study, we utilized a comprehensive methodology for estimating sugarcane yield, which involved the integration of various statistical techniques and data processing methods.
We employed the RF regression method, a non-parametric and ensemble-learning approach based on decision trees.The optimization of the RF algorithm included adjusting the number of regression trees (ntree), the number of variables for growing each tree (mtry), and the number of terminal nodes (node size).The ntree values were varied from 500 to 2000 in increments of 500, while mtry was tested across all 22 predictor variables.
Optimal values for ntree and mtry were determined using the Out-Of-Bag (OOB) Root Mean Square Error (OOB_RMSE) from the training data set.Default values were used for node size for computational efficiency, and the percent increment of mean squared error (% IncMSE) was used to measure variable importance.
Additionally, we included the Recursive Feature Elimination (RFE) algorithm to select the most relevant Vegetation Indices (VIs) for yield estimation (Pullanagari et al., 2018).RFE, a wrapper-based variable selection method, is particularly effective for analysing high-dimensional datasets (Degenhardt et al., 2019).To optimize variable selection, a 10-fold cross-validation was employed, using Root Mean Square Error (RMSE) to identify the optimal predictor variables.This method was preferred over the OOB error, which can be biased in evaluating the overall error of the selection procedure.RFE was implemented using the caret R package and the RF algorithm (Kuhn, 2008).(1+L) * The predictive performance of our regression models was evaluated using the Leave-One-Out Cross-Validation (LOOCV) method.This involved sequentially removing one cropping season at a time to generate new models for predicting the yield of the omitted season.Regression analyses were performed on the training dataset, with subsequent cross-validation using an independent validation dataset.Various metrics, including Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), RMSE, Mean percent (Mean %), and the coefficient of determination (R²) were used to assess the models.These assessments established a direct comparison between observed and predicted yield values and allowed for a comparison of the RF regression's performance against the SMR.

RESULTS AND DISCUSSION
The RF regression parameters were optimized using training datasets and Out-Of-Bag Root Mean Square Error (OOB_RMSE), considering ntree values of 500, 1000, 1500, and 2000, and exploring all 22 possible mtry variables.This optimization revealed that the ideal ntree and mtry values varied based on the prediction date and the specific growing estate.Interestingly, models generally favored mtry values above the default setting, except for the 2016/17-I season at Wonji-Shoa.
Across various mtry values, the number of trees ranging from 500 to 2000 consistently produced lower prediction errors, with 500 trees often outperforming the others, indicating a less sensitive OOB_RMSE to ntree values and suggesting stability in model performance with varying ntree values.
The importance of predictor variables, specifically Vegetation Indices (VIs), was evaluated using the percentage IncMSE, which measured the deterioration in the predictive performance of the sugarcane yield estimation models when each predictor was permuted.VIs such as NDVIre1, NDVIre3n, NDVIre2n, NDRE1, and NDRE2 emerged as significant contributors to yield estimation.The RF-RFE algorithm further refined the selection, identifying the optimal VIs that improved the RF algorithm's predictive performance.This process indicated that the accuracy of yield estimation enhanced for both Wonji-Shoa and Metehara study areas, with a significant proportion of the VIs being selected as important.Notably, a combination of red-edge bands and NIR narrow bands, as well as NDMI and MSI, yielded the lowest RMSE.

CONCLUSION
RF regression is a powerful statistical model that has found many applications in remote sensing.It has been applied to such a wide variety of applications including crop monitoring and yield prediction problems such as the estimation of crop biomass, LAI and yield, and crop disease detection and prediction.Features that have added to the popularity of RF algorithms especially in the remote sensing field of studies include the feature selection methods it provides to select the optimal variables that provide the best predictive power rather than prediction only.In this study, 22 VIs derived from multi-temporal S2A imagery were considered to estimate sugarcane yield in Wonji-Shoa and Metehara estates, Ethiopia.The RF regression was explored to select optimal VIs that could be used to accurately estimate sugarcane yield.In this study, the predictive performance of RF-RFE method were evaluated and compared with the RF and SMR.The results of the study demonstrated that the optimal VIs involving the additional red edge spectral bands of S2 that include NDVIre1n, NDVIre2n, NDVIre3n, NDRE1, and NDRE2 could improve the predictions of the RF model compared to using the RF with the full datasets.
Overall, RF-RFE method was able to provide a distinct subset of VIs with a reasonable prediction performance.The results obtained showed that the RF-RFE produce better sugarcane yield estimation (MAE =10.9 t/ha, MAPE=17.38%,RMSE= 15.41 t/ha, Mean % = 12.76 t/ha, R2 =0.7 for Wonji-Shoa, MAE =9.53 t/ha, MAPE= 18.58%, RMSE=13.11t/ha, Mean % = 9 t/ha, R2 =0.72 for Metehara).Thus, we recommend employing the implementation of the RF-RFE algorithm that provides reasonable variable selection in crop yield estimation.When RF-RFE is applied using a cross-validation approach, the resulting variable importance measures can be used reliably for optimal variable selection.In this study, the use of S2 VIs and the RF regression based on RFE is illustrated for sugarcane yield estimation.In conclusion, the results demonstrated that the RF-RFE method is an effective method for variable selection and prediction in sugarcane yield estimation purposes and can be applied directly in crop yield monitoring and prediction studies for irrigation management strategies.

Figure 1 .
Figure 1.Map of the study area The overall performance of the RF methods was evaluated by comparing the RF-RFE with the full datasets and the SMR.The RF-RFE algorithm demonstrated superior predictive accuracy, as evidenced by lower Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE) values in comparison to the RF with full datasets.This improvement was consistent across both Wonji-Shoa and Metehara estates.The RF-RFE algorithm also outperformed the SMR in all prediction dates.The observed versus estimated sugarcane yield, represented in graphical form, further confirmed the enhanced predictive capability of the RF-RFE algorithm using selected VIs.

Table 1 .
Descriptive statistics of sugarcane yield (t/ha)

Satellite Data Acquisition and Pre-processing
For this study, Level 1C S2 satellite images, which represent the grand growth stage of sugarcane, were acquired from the USGS Earth Explorer portal.The specific tile encompassing the study area was identified as T37PEK, with a relative orbit number of 92.The research focused on sugarcane growing seasons spanning from 2016/17 to 2018/19.Five cropping seasons were selected for analysis, constrained by the availability of historical S2 data prior to 2016.

Table 2 .
List of Sentinel 2A VIs and their formulas used in this study

Table 2 .
Comparative analysis of Sugarcane yield estimation models utilizing S2 VIs at Wonji-Shoa

Table 3 .
Comparative analysis of Sugarcane yield estimation models utilizing S2 VIs at Metehara