IRRIGATED AGRICULTURE MAPPING IN A SEMI-ARID REGION IN BRAZIL BASED ON THE USE OF SENTINEL-2 DATA AND RANDOM FOREST ALGORITHM

: Irrigation is important for agricultural production and is often decisive for this, especially in arid and semi-arid areas, where precipitation is insufficient. In Brazil, irrigated agriculture is responsible for 46% of withdrawals from water bodies and 67% of the consumption of the total volume of water collected, representing the highest consumptive use in the country. Remote sensing technologies have great potential for developing methods for monitoring irrigated areas. However, mapping irrigated areas is still a challenge, due to the complexity and diversity of irrigation methods and crops, especially in a country with continental dimensions like Brazil. Remote sensing techniques for mapping irrigated areas in Brazil have been applied mainly in areas with center pivot irrigation in the Cerrado, and with paddy rice in the south of Brazil. But few or no applications, involving mapping of crops irrigated by other irrigation methods, mainly in the semi-arid, have been carried out. The objective of this work was to investigate a method for classifying irrigated agriculture in a semiarid region of Brazil, based on the use of Sentinel 2 imagery and random forest algorithm. We proposed a novel and robust methodology showing with preliminary results that it´s possible to identify irrigated agriculture in this region with a class-f1-score of 74% for complementary irrigation and 95% for center-pivots.


INTRODUCTION
Agriculture is the biggest global consumer of water, with irrigated areas constituting 40% of the total area used for agricultural production (FAO, 2014).The rate of increase in irrigated areas was approximately 2.6% per year, going from 95 million hectares (Mha) in the 1940s to 280 Mha in the early 1990s (Van Schilfgaarde 1994, Seckler et al. 2000, Siebert et al., 2005a, b, 2006).FAO estimates that 80% of food needs in 2025 will be covered by irrigated agriculture (Schaldach et al., 2012) and more than 324 million hectares are equipped for irrigation in the world (Dubois, 2011).
In Brazil, irrigated agriculture is responsible for 46% of withdrawals from water bodies and 67% of the consumption of the total volume of water collected, representing the highest consumptive use in the country.It is a dynamic activity that has shown increasing and persistent performance in recent decades, often against the grain of unstable and negative periods in the Brazilian economy.There was an intensification in the most recent period, linked to the greater contribution of credit and private investments, between 2012 and 2019, and growth was around 4% per year in Brazil when around 216 thousand hectares of irrigated fields were incorporated per year.In 2019, the value of irrigated production surpassed the BRL 55 billion mark (ANA, 2021).
In general, the survey carried out by the National Agency for Water and Basic Sanitation (ANA) shows that with the current availability of water, only 36% of the agricultural area and 15% of the pasture area could be converted into irrigated areas in Brazil.However, the potential for expansion of irrigated areas (total and effective) must be observed with caution, as local particularities, infrastructure expansion, and water infrastructure works can change the estimate of additional irrigable area, especially when the water supply is increased with transfers from other basins or reduced with the installation of other uses or with the revision of water supply databases.Furthermore, the potentials were estimated only on current agricultural areas (agriculture and pasture already consolidated).The expansion of the irrigated area in the country has occurred and should continue to occur, according to three main aspects: public perimeters planned by government agencies; joint private initiatives, organized in the form of cooperatives or associations; and individual private initiatives.In this context, it is important to strengthen planning and organize the State's role as a promoter and partner of this development, especially at the federal level, in conjunction with states, municipalities, and the private sector (ANA, 2021).
Irrigation is important for agricultural production and is often decisive for this, especially in arid and semi-arid areas, where precipitation is insufficient.The benefits of irrigation include increased productivity and the dissociation of constraints, ensuring greater resistance to extreme weather events, modifying temperature, humidity, and precipitation regimes on local to regional scales, and evapotranspiration (ET) globally.However, irrigation can also have significant social and environmental impacts, including drainage or maintenance of wetlands, disruption of sedimentation, increased soil salinity, changes in river temperatures, changes in water table depth, decreased water flow, changes in peak discharge and baseflow, and conflicts over water use (Ketchum et al., 2020;Pousa et al., 2019).To mitigate or eliminate negative impacts, the irrigated areas must advance observing the different dimensions of sustainability.
As seen, information on the spatial distribution of irrigated areas is highly relevant for water management and food security, being essential for decision-makers who are facing the transition to a more efficient sustainable agriculture.However, the spatial patterns, extent, and intensity of water use by irrigated agriculture are currently not well understood.In part, this is because statistics and reports are costly to produce and can be biased due to over-or under-reporting of water use (Rufin et al., 2021;Deines et al., 2019;Özdoğan et al., 2006).In addition, irrigated area statistics often do not include smaller or informal irrigated areas (e.g., groundwater, small reservoirs, and ponds), due to the difficulty in mapping these smaller areas, as well as due to the inherent underreporting of survey systems by interviews.However, in many countries, these areas are very significant and even exceed the main irrigated areas.
Remote sensing techniques for mapping irrigated areas in Brazil have been applied mainly in areas with center pivot irrigation in the Cerrado (Saraiva et al., 2020;Albuquerque et al., 2021).Recently, some applications involving flood-irrigated rice crop mapping have been carried out (De Bem et al., 2021).But few or no applications, involving mapping of crops irrigated by other irrigation methods, mainly in the semi-arid, have been carried out.There is still a gap in the scientific literature, in terms of concerns the development of semiautomatic methods for mapping irrigated agriculture by remote sensing, which can be transferred to the Brazilian reality.
The objective of this work was to investigate a method for classifying LULC in a semiarid region of Brazil, focusing on irrigated agriculture, based on the use of Sentinel 2 imagery and random forest algorithm.This work is in the context of the project "Irrigated Agriculture Based on Remote Sensing Technologies to Update and Improve ANA's Atlas Irrigation", developed by INPE (National Institute for Space Research) and ANA, which aims at developing a method for automatically mapping irrigated agricultural land and estimating water use in Brazilian irrigated agriculture.

Study areas
The study area of the western Ceará agriculture hub is located around the municipalities of Guaraciaba do Norte, São Benedito, and Ipu, and is within in the Caatinga biome (Figure 1).Agricultural cultivation areas correspond to small plots, usually close to watercourses.There is no presence of irrigation with center pivots.Irrigation is based on micro spray and dripping systems.In this area, there is a predominance of native vegetation composed of deciduous shrub vegetation, that is, with loss of foliage in the dry season of the year.There is also the presence of evergreen vegetation in the humid areas.The predominant climate is Tropical, Equatorial Zone, hot with an average temperature > 18° C in all months, semi-arid with 7 to 8 dry months.The study area of the Petrolina/Juazeiro agriculture hub is located around the municipalities of Petrolina, in the state of Pernambuco and Juazeiro, in the north of Bahia state, and it is also within in the Caatinga biome (Figure 1).Agricultural cultivation areas are made up of large groups of plots with perennial and annual crops, distributed among a matrix of native vegetation.In this area, there is a predominance of native vegetation composed of deciduous shrub vegetation, with evergreen vegetation on the banks of watercourses.The predominant climate is Tropical, Equatorial Zone, hot with an average temperature > 18° C in all months, semi-arid with 7 to 8 dry months.

Sample Generation
The established protocol for sample generation contemplated the segmentation of cloud-free Sentinel 2 image obtained during the dry period (T24MTA_20190810T1 and T24LUQ_20191026T1).Considering bands 2, 3, 4 and 8 (blue, green, red and NIR) we applied the multi-resolution algorithm, with parameters of 80, 0.2 and 0.5 respectively for scale, shape and compactness.Then, with the respective Dynamic World classified image (Brown et al., 2022), we intersected with the segmented images to obtain the major land use and land cover (LULC) class per segment and the probability values of each class.We clipped the segmented images in smaller subsets, considering representative areas of the regions.The segments with the values of the majority classes with the highest probability were assigned to a specialist, whom proceeded the visual interpretation considering the LULC classes showed in Table 1.For the visual interpretation, we assumed that the agricultural areas that showed a stronger color pattern in the infrared during the dry period were irrigated areas.According to information from ANA specialists, in these regions most agriculture has some level of supplementary irrigation, with rainfed areas basically consisting of areas of subsistence agriculture, in small and diffused areas.Finally, for each of the segments, sampled points containing the class information of its respective polygon were generated, and the set of all points generated in the subsets generated in the subsets were used as input data for training.Several operational sensor systems are currently in orbit and the development of infrastructure for remote sensing data storage and data dissemination allows to derive consistent, analysisready images, eliminating the need for pre-processing and storage for the user (Potapov et al. al., 2020, Frantz, 2019).This allows exploring the full potential of integrated data analysis, in which metrics derived from ARD time series (e.g., phenological or spectral-time metrics) are combined with meaningful environmental data for a specific domain.In this work, we used data cubes developed from optical images of multispectral sensors (MSI -Multispectral Instrument) of the Sentinel 2 A and B satellites.Images at surface reflectance level were accessed from the asset "COPERNICUS/S2_SR", in GEE, with atmospheric correction performed with the Sen2cor method (Main-Knorn et al., 2017).The cloud mask was obtained by the CDI (Cloud Displacement Index) algorithm (FRANTZ et al., 2018), which makes use of the three highly correlated nearinfrared bands that are observed with different viewing angles.Thus, elevated such as clouds are observed under a parallax and can be reliably separated from other objects on the Earth's surface.Dense data cubes were generated, that is, with a temporal resolution of 8 days, for the calculation of phenological metrics, which require a shorter interval between observations, and that they are equally spaced in time.For this, a process was applied to create temporal mosaics of 8 days, based on the highest value, to organize cubes with observations equally spaced in time, using the GEE itself, through the "qualityMosaic" function.For the dense data cubes, in cases where no images were found with a cloud cover percentage lower than 50% within the 8 days, a synthetic image was generated.Subsequently, in a Python programming environment, the pixels contaminated by clouds and cloud shadows found in the data cubes were interpolated using the Radial Basis Funcion method (Schwieder et al., 2016;Bendini et al., 2019).Data cubes with a temporal resolution of 16 days were also generated to calculate the accumulated sums of vegetation indices during the dry period.To identify the dry period, an approach was used in which CHIRPS monthly precipitation data was considered.For each CHIRPS pixel (1 km), the three driest months in 2019 were obtained.The dense data cubes were generated for the following vegetation indices, EVI2 (Two-band Enhanced Vegetation Index) (Jiang et al., 2008), NDWI (Normalized Difference Water Index) (McFeeters, 1996), GI (Greenness Index) (Gitelson, 2003), ARVI2 (Atmospherically Resistant Vegetation Index 2) (Kaufman et al., 1992), and LSWI (Land Surface Water Index) (Jügens, 1997).

Accumulated sum of VI during the dry season:
The cumulative sum of the vegetation index values of the Sentinel 2 images was performed by a zonal summation operator on the Sentinel 2 images, according to the grid obtained from the CHIRPS pixels, where the dry period could vary spatially.In this case, as 16 days of temporal resolution would be enough to calculate an accumulated sum, not being necessary to generate a dense cube, which requires more sophisticated interpolation methods not available in GEE, the whole process was carried out in GEE, to reduce computational costs, which would also be amplified by the strategy of using CHIRPS to identify local dry periods.As the presence of null values could impact the calculation of the accumulated sum, an interpolation was also necessary.However, as the process was carried out in the GEE, the interpolation was simpler, being performed by the average of the period values.In cases of negative values, which would also significantly affect the sum, they were replaced by zero.

Phenological metrics:
We derived the phenological metrics from the Sentinel 2 EVI2 dense data cube described earlier, during the agricultural year 2019 -2020, i.e., August 2019 to October 2020 (CONAB, 2018).A total of 13 phenological metrics were derived using the Python Stmetrics package (Soares, A;Bendini, H. N.;et al., 2019) (https://github.com/brazil-data-cube/stmetrics,accessed on 20 November 2022).We extracted for the 3 seasonal cycles observed in the EVI2 time series, totaling 39 metrics.These metrics refer, for example, to the beginning and end of the season, the maximum value of station EVI2 or the amplitude and are inspired by the TIMESAT software, explained in detail in Jönsson, Eklundh (2004).Bendini et al. (2019) used them for mapping cropping systems and crop types in Brazil with good accuracies.

2.3.5
Spectral-temporal metrics (STM): Spectro-temporal metrics (STM) (Griffiths et al., 2013, Rufin et al., 2015) were calculated for the dry and rainy seasons of the year 2019, consisting of statistical values (standard deviation, variance, mean, median, and percentiles), obtained from Sentinel 2 surface reflectance images, free of clouds and cloud shadows, of all spectral bands (B, R, G, NIR, Red-edge, and SWIR).Rufin et al. ( 2018) used them for mapping cropping systems in Turkey with good accuracies.

Neighborhood Green Chlorophyll Vegetation Index (NGCVI):
The Neighborhood Green Chlorophyll Vegetation Index (NGCVI) was described by Deines et al. (2019) and consists of a normalization of the Green Chlorophyll Vegetation Index (GCVI) (Gitelson et al., 2005) by a neighborhood index, which is defined as the division of the maximum value of the GCVI divided by the 15th percentile, convoluted by a kernel.Deines et al. (2019) use a 50km radius to map the High Plains aquifer, based on the work of Xu et al. (2019), which explores the autocorrelation of climate variables.Still in Deines et al. (2019), the Landsat image was resampled to 1000 meters, due to the computing limitations of the Google Earth Engine platform, a procedure that was replicated in this work, however, for Sentinel 2 images.Fernandes Filho et al. (2022) used the NGCVI for classifying irrigated agriculture in the Brazilian semi-arid region with promising results.However, the authors suggested incorporating other vegetation indices necessary to increase the accuracy of the classification.

Classification
The predictor variables were used together with the reference data to train a Random Forest (RF) classifier.RF is a nonparametric machine learning algorithm based on decision trees.As individual decision trees are error-prone, RF uses a set of many decision trees that have been independently trained with random subsets of the input data to overcome this limitation (Breiman, 2001).The implementation of the algorithm in Python (Python, 2022) also allows evaluating the variable importance of each input variable based on the Gini coefficient.Different models were trained for each study area, and the classification was evaluated using metrics derived from the confusion matrix, such as global accuracy and f1-score.Validation was performed with 30% of data, in an npartition scheme, with the remaining 70% used in training.Subsequently, the robustness of these models was evaluated using the same model in an area different from the training area, but in the same irrigation hub.The same sample collection protocol was applied in these test areas, so that the validation was performed with totally independent of the training set.We can observe that the agricultural areas present a seasonality similar to that of the natural vegetation, however, in the case of center-pivots, there is an indication of seasonality at the beginning of the agricultural year, suggesting the presence of two harvests.This is possibly due to the fact that in this region, there is expressiveness in the presence of perennial agricultural crops, which have a behavior similar to that of natural vegetation.

RESULTS AND DISCUSSIONS
Figures 3 and 4 illustrate the result of the mapping for respectively the study areas in the Juazeiro/Petrolina, and Western Ceará hubs.Although the global accuracy of the global model was 72%, the target class, related to irrigated agriculture, achieve a class-f1score of 74%.We can see that the spatial patterns are very consistent, reflecting the quality of the mapping.Figure 5 shows the confusion matrix of the global model.It is possible to verify that there was confusion between agriculture and the classes referring to natural vegetation.We can infer that the more pronounced seasonality of the semiarid vegetation can lead to confusion with agriculture.A more cautious sampling is suggested in this sense, including in the sample collection protocol an evaluation of the EVI time series in these areas, to verify the characteristic pattern of agriculture, as well as an increase in the number of samples, in a more distributed way in the scene.

FINAL CONSIDERATIONS AND FUTURE WORKS
We explored a method for classifying irrigated agriculture in the semiarid of Brazil, using Sentinel 2 data and the random forest algorithm.We proposed a global model for classifying irrigated agriculture in this region.Good results were observed for irrigated agriculture, however, there was also a significant inclusion of areas of native vegetation (mainly shrubs).This fact is associated with the high seasonality of this vegetation, as well as its resilience to periods of drought (in the case of native forests areas close to water bodies).In the next steps, a more careful inspection of these areas of inclusion will be carried out, followed by a more representative sampling of these areas.

Figure 2
Figure 2 illustrates the mean time series of all pixels corresponding to the training samples used for the global model.

Figure 2 .
Figure 2. EVI2 curves, obtained for the points of the samples used in the training model.

Figure 3 .
Figure 3. False Color composition (NIR, Red, Green), and result of the classification for training area in the region of Juazeiro/Petrolina hub.

Figure 4 .
Figure 4. False Color composition (NIR, Red, Green), and result of the classification for training area in the region of Western Ceará hub.

Figure 5 .
Figure 5. Classification confusion matrix for the global model.

Table 1 .
Number of samples.

Datasets generation 2.3.1 Environmental data:
Terrain data were obtained by the SRTM (Shuttle Radar Topography Mission) (NASA, 2013), from which the elevation and slope images were derived.Precipitation data from CHIRPS (Climate Hazards Group InfraRed Precipitation with Station data) was also used, which is a nearly global rainfall dataset spanning more than 30 years.CHIRPS incorporates 0.05° resolution satellite imagery with insitu station data to create gridded precipitation time series for trend analysis and seasonal drought monitoring.Recent advances in remote sensing technologies offer great opportunities to map land use and land cover over large areas.