LINKING SATELLITE REMOTE SENSING BASED ENVIRONMENTAL PREDICTORS TO DISEASE: AN APPLICATION TO THE SPATIOTEMPORAL MODELLING OF SCHISTOSOMIASIS IN GHANA

Abstract. 90% of the worldwide schistosomiasis burden falls on sub-Saharan Africa. Control efforts are often based on infrequent, small-scale health surveys, which are expensive and logistically difficult to conduct. Use of satellite imagery to predictively model infectious disease transmission has great potential for public health applications. Transmission of schistosomiasis requires specific environmental conditions to sustain freshwater snails, however has unknown seasonality, and is difficult to study due to a long lag between infection and clinical symptoms. To overcome this, we employed a comprehensive 8-year time-series built from remote sensing feeds. The purely environmental predictor variables: accumulated precipitation, land surface temperature, vegetative growth indices, and climate zones created from a novel climate regionalization technique, were regressed against 8 years of national surveillance data in Ghana. All data were aggregated temporally into monthly observations, and spatially at the level of administrative districts. The result of an initial mixed effects model had 41% explained variance overall. Stratification by climate zone brought the R2 as high as 50% for major zones and as high as 59% for minor zones. This can lead to a predictive risk model used to develop a decision support framework to design treatment schemes and direct scarce resources to areas with the highest risk of infection. This framework can be applied to diseases sensitive to climate or to locations where remote sensing would be better suited than health surveys.


INTRODUCTION
Schistosomiasis is a debilitating infection acquired from contact with infested water bodies and caused by parasitic blood flukes of the genus Schistosoma.It infects over 200 million people worldwide, and around 779 million are at risk (Steinmann et al., 2006;Utzinger et al., 2009;Gray et al., 2010).Schistosomiasis is endemic to parts of South America, the Middle East, Asia, and Africa.Sub-Saharan Africa contributes 90% of cases worldwide (Chitsulo et al., 2000;Gryseels et al., 2006;Hurlimann et al., 2011) and Ghana has one of the highest prevalence rates of infections (WHO, 2010).In Ghana 30-35% of the population require preventative chemotherapy, however only 2-8% of the population receives it annually (WHO, 2010).Due to limited disease surveillance, control efforts are often based on infrequent, small-scale health surveys, which are expensive and logistically difficult to conduct.
Favorable snail habitat can serve as a proxy for potential schistosomiasis transmission, because snails must be present for the parasite to develop into a stage infective to humans (Sturrock et al., 2001).For disease to occur there must be interaction between the parasite, snail, and human.The parasite and snail factors are primarily environmental (see Table 1).Thus, environmental variables that dictate snail habitat can be used to predict disease transmission.Furthermore, these parameters can be accessed by remote sensing (RS) technology (Simoonga et al., 2009).Specific RS data are publically available at no or limited cost, and since this technology can cover a large spatial and temporal scale, it is ideal for use in low-income countries lacking the resources to perform expensive ground-based studies (Gryseels et al., 2006).Recent review papers show that the most commonly used environmental variables are temperature, vegetation, rainfall, water chemistry, distance to waterbodies, and elevation (Simonga et al., 2009;Walz et al., 2015a).Aggregation over ecological zones as opposed to administrative units has been recommended in the literature, and is shown to be valuable in predicting the occurrence of schistosomiasis (Simoonga et al., 2008;Grosse, 1993;Walz et al., 2015b;Walz et al., 2015c).Walz et al., 2015a).

Parasite
In order to determine predictive properties of RS-based environmental parameters, we examined monthly records of reported schistosomaisis during 2008-2015 and their associations, with time-series built from remote sensing feeds at the district level in Ghana.

DATA
The environmental variables used in this work came entirely from publically available remote sensing data (Table 2).The environmental variables were selected based on their repeated use in the literature (Bavia et al., 2001;Simoonga et al 2009;Walz et. al 2015a), and for comparison to the predominant climate classification system for the past 100 years.The Köppen-Geiger (KG) climate classification system is based on the assumption that vegetation is the best proxy for climate, and that temperature and precipitation are the best proxies for vegetation (Kotteck et al., 2006).The variables were projected to WGS-84/UTM-30N datum prior to aggregation to mean values per district, using zonal statistics in the ArcGIS 10.3 software.Where resampling was required, the cubic convolution technique was used because it more realistically reflected the smooth transitions of environmental data across terrain., 2016).Rates were calculated per 10,000 people.

Data
The original case counts contained "blanks" (47% of all data points), which might be interpreted as a missing data points or as "true" lack of reporting for a given month in a given location.
In the current study "blanks" were replaced with zeros.To correct for right-skewed distribution, the rates were transformed by natural logarithm and adjusted by the lowest observed disease rate for a non-zero count of 0.00506.

LKN regionalization
LKN (LKNr) regions were created following the methodology of our recently proposed automated climate regionalization method, LKN-regionalization, which are based on k-means clustering algorithm over time-space (Liss et al. 2015).The LKN method consists of the data limiting step (L-step) to reduce dimensionality by applying principal component analysis, a classification step (K-step) to produce hierarchical candidate regions using k-means unsupervised classification algorithm, and a nomination step (N-step) to determine the number of candidate climate regions using cluster validity indexes.Using a comprehensive set of multiple satellite data streams arranged as time series, the method is capable of defining climate regions over large spatial extents.This is essential for large-scale epidemiological studies to account for geographic heterogeneity.
MODIS Normalized Difference Vegetation Index (NDVI) and Land Surface Temperature (LST) for 15 years (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015) were downloaded from the online Data Pool at the NASA Land Processes Distributed Active Archive Center (LP DAAC), USGS/Earth Resources Observation and Science (EROS) Center, Sioux Falls, South Dakota (LPDAAC-NASA 2000-2015).The data was mosaicked and re-projected so that it covered the entire extent of our study region (latitude and longitude of the upper left and lower right corners of the bounding box were: UL (11.625, -3.625), LR (4.375, 1.375)).Each of the two EOS satellites, Aqua and Terra, produced composites on overlapping 16 days schedules that we combined and constructed 8 day temporal resolution composites.The data was aggregated in a layered space-time series.Using a normalized vegetation index allowed us to reduce or eliminate the effects of annually changing lighting conditions, thin clouds, atmospheric and anisotropic distortions.
Water bodies were masked to avoid misclassification, because the water reflectance pattern differs significantly from almost any other land surface material by absorbing most of the incoming radiation.The imagery for each of the 8 days period was pixel-averaged across multiple years resulting in 43 layers.
The time series of 8-days vegetation and temperature rasters naturally exhibited a very high degree of spatial and temporal correlation.Following the original LKN-methodology we reduced dimensionality and orthogonalized this data by applying Principal Component (PC) decomposition to the original time series.We found that the first 4 and 8 principal components retained 90% and 95% of the original information respectively.A composite image created from the first four components showed a high spatial separation and a large signal to noise ratio (Figure 2B).The original methodology employs cluster analysis to define regions with similar climate.It aims to assign a finite set of labels (also known as categories or classes) to a very large number of multidimensional objects (pixels, representing a defined area on the ground in our case) based on their similarity followed by determination of the cluster validity.We perform classification using several parameter sets and algorithms with a varying number of principal components, regions, and applied distance measures resulting in approximately 200 classifications.We assessed the quality of clustering using cluster validity index (CVI) analysis.K-Means using Euclidean distance produced the best clustering, with 3 major zones (Figure 2C) and 9 minor zones (Figure 2D).

Associations between disease and RS-based parameters
Spatiotemporal patterns of schistosomiasis disease rates were explored using descriptive statistics, Spearman's correlation, and a multivariate generalized linear mixed effects regression model.The model was designed to control for temporal variations, such as trend and seasonality with two harmonic terms, and spatial variations by including district-level clustering effects.The model was repeated for major and minor climatic zones and formulated as follows: , where Ytj is the natural log transformed monthly rate per 10,000 people at t-month and j-district; t is the time in months (96 months from 2008 -2015); βL and βM represent the regression coefficients for fixed effects of seasonality, S, and remotely sensed variables, RS, respectively: (2) (3) bp represents the coefficients of the random effects for the remote sensing parameters for each district.Two harmonic terms account for potential multiple peaks, as the north of Ghana is characterized by one wet-dry season, while the south by two seasons.The models were fitted by the restricted maximum likelihood (REML) using the glmer function of the R package [lme4] (version 3.2.2).
Models results were presented as relative risk (RR) estimates with their 95% confidence intervals (95%CI).Model fit was assessed by superimposing the fitted model on a needle plot representation of the outcome's time series.

REULTS
The general summaries of outcome and predictor variables indicate spatial heterogeneity by climate zone (see Table 3).As evident from the spatial distributions of the predictor variables (Figure 2) there are clear climatic divisions in the long term averages of RS parameters.These divisions were well captured by the LKN regions (Figure 1), which provide a more accurate spatial unit of analysis in assessing the relationship between climate and schistosomiasis than administrative boundaries.
LKN regions affected the long term averages of the outcome and predictor variables.When stratified into the 9 minor regions, the rates ranged from 0.18 to 1.09 per 10,000 people (mean ± stdev.= 0.45 ± 1.61), LST ranged from 25.64 to 34.03 degrees Celsius (28.18 ± 4.36), NDVI ranged from 0.39 to 0.68 in health of vegetation between ± 1 (0.58 ± 0.17), and AP ranged from 0.10 to 0.14 in mm/month of rainfall (0.13 ± 0.09).
The long term average of reporting rate was very low, and could reflect underreporting.
All values of LST fall within the ideal range of temperature values suitable for the schistosoma parasite and snail species (Gryseels et al., 2006).Overall, NDVI shows healthy vegetation on average, as would be expected in a tropical environment.The range of rain values was small.The apparent albeit mild, differences in average values per LKN region suggest that studies of this nature require precision in order to flush out substantial spatiotemporal relationships in equatorial regions.Overall, the correlation between the monthly values of rates and the environmental variables was near zero (which is confirmed by a more detailed analysis).There was moderate correlation among the environmental variables (Figure 3), which did not warrant collinearity.The mixed effects models had 41% explained variance overall.

Zone
Stratification by LKN region brought the R 2 as high as 50% for major zones and as high as 59% for minor zones.Trend and seasonal oscillations contributes a major portion of the variability explained (see Figure 4; 37%).In terms of environmental predictor variables, LST and AP showed no significant risk association with disease rates.NDVI showed increased relative risk values in major zone 3 and minor zone 7 (RR = 1.64;CI95%: 1.01,2.67 and RR = 6.76 CI95%: 1.93,23.62,respectively).AP showed increased relative risk overall (RR = 1.47;CI95%: 1.01,2.15).

DISCUSSION
The proposed mixed effects model explained ~ 41% of spatiotemporal variability of the health outcome.When stratified by LKN region the R 2 ranged from 37-50% for the 3 major regions, and 28-59% for the 9 minor regions.RR associated with the trend variable in the models reflects the decline in the reported rates of schistosomiasis over the study period 2008-2015.While the exact nature of this will require further investigation, it is likely that mass drug administration (MDA) campaigns with the deworming drug, praziquantel, are a potential cause.MDA campaigns began in 2008 and have continued through 2015 to present (WHO, 2010;Merck, 2016).Also important to note is that according to the World Bank, Africa currently has one of the highest population growth rates in the world, necessitating the use of accurate population adjustments and growth rates per individual years (Linard et al., 2012).No similar decreasing trends were noted in the predictor variables, which alternatively may show effects of climate change.
Exploratory analyses showed noticeable seasonality in the environmental predictor variables.The documented single wet/dry cycle of northern Ghana was apparent in the major LKN region 1, while the double wet/dry cycles common to the south were seen in the major LKN regions 2 and 3.In addition to showing agreement with known hydrological cycles, the major LKN regions aligned well with the Köppen-Geiger climate classification regions.Thus, remote sensing feeds can effectively be used to capture the seasonal effects and climate divisions.Assessing the nature of schistosomiasis count seasonality needs further assessment; however model fit improved substantially with the incorporation of seasonal harmonics.
A long lag between infection and clinical symptoms may contribute to underreporting (Danso-Appiah et al., 2004;de Vlas et al. 2014), and so obscure the documentation of seasonality.Ghana's equatorial location makes variations in seasonal environmental variables subtle, and so makes the precision of the data in this type of analysis especially important.Of the 600 partitions run on 200 classifications, most agreed that there were only 3 major and 7-10 minor clusters.For these reasons it is difficult to link schistosomiasis rates with purely environmental variables.However the 41% explained variance is promising.
To overcome these challenges, this study has utilized 8 years of overlapping remote sensing data feeds and national health data.
Other studies approaching this length of temporal data have relied upon retrospective or prospective health surveys (Walz et al, 2015a).Studies that have taken place on similar or greater spatial scales have relied upon aggregated retrospective health surveys (Shur et al., 2011;Shur et al., 2013).While these data sources are highly accurate for the time and place in which they were collected, extrapolation of these results have increased uncertainty.
Access to the national health data of schistosomiasis in Ghana was granted based on recent research on schistosomiasis and water in Ghana (Kosinski et al. 2011;Kosinaki et al., 2012;Kulinkina et al., 2016) The last systematic collection of health survey data in Ghana at the national scale and subsequently used to direct control efforts was in 2008 (Soares-Magalhaes et al., 2011).This study used detailed health data collected in 77 schools to direct post study drug distribution by interpolating predicted risk for the rest of the country.A purely environmental approach paired with access to national health data for the reportable disease, schistosomiasis, has the benefit of being a more cost effective spatiotemporal tool.
Increasing spatiotemporal resolutions of environmental and outcome variables would be of benefit to improving model fit (Utzinger et al., 2009).However the MODIS sensor has a spatial resolution limit of 250m unsharpened, and a temporal resolution limit of 4 day adapted composites before alternative satellite sensors would need consideration.ASTER or Landsat would be considered for future publically available satellite data as they provide continuous, high spatiotemporal resolution.However the computational price of this will need careful consideration.
By running the LKN regionalization algorithm hierarchically on each of the major zones separately we can hone in on subtler climatic differences on scales closer to our district level rates.
Alternatively the spatial resolution of the health outcome is available at the rural health unit, provided that the georeferencing of the clinics is undertaken.Bringing the environmental and health data to a comparable scale is one source of improved model fit.Additional improvements would come from increasing the overall spatiotemporal resolution of all variables; adding additional environmental variables, use of fourier transform analysis to guide seasonality analysis, and incorporating adjustments such as collinearity, autocorrelation, over dispersion, and lag.By exploring the potential of purely environmental data to predict schistosomiasis, we are seeing if there might be a more sustainable, cost effective and rapid method for disease surveillance.

Figure 1 :
Figure 1: Climate classification workflow.A) flowchart of the LKN method.B) first four components of the PCA.C) 3 major zones and D) 9 minor zones determined by CVI analysis.

Figure 2 :
Figure 2: Spatial distribution of the 8 year averages of Loge(rate), LST, NDVI, and AP aggregated from 2008-2015 and extracted by the 216 districts of Ghana.Districts with no counts for the entire time period are shown in white, and data values were stretched from high to low.

Figure 3 :
Figure 3: Correlation matrix of outcome and predictors.

Table 3 :
Descriptive Statistics of outcome and predictors.

Table 4 .
Results of mixed effects models per LKN regions.