Retrieval algorithm of chlorophyll-a concentration for coastal waters based on ridge regression

In this paper, we use the multiple scattering near-infrared aerosol correction model of SeaDAS to execute the atmospheric correction for Terra/Aqua MODIS remote sensing data, and also use three standard operational chlorophyll-a concentration inversion algorithms built in SeaDAS, named OC2, OC3, and OC4, to carry out chlorophyll-a concentration inversion in the Yellow Sea and East China Sea. We validate the inversion results using in-situ measured chlorophyll-a concentration data collected from the Yellow Sea and East China Sea in 2003. The results show that the inversion results of OC3 and OC4 algorithm are significantly larger than in-situ measured values, and the results of the OC2 algorithm are closer to the measured values. In view of the shortcomings of these algorithms, we proposes a chlorophyll-a concentration inversion model based on ridge regression, and carry out chlorophyll-a concentration inversion test using MODIS images covering the Yellow Sea and East China Sea 2003. The results indicate that, the new inversion model can effectively overcome the deficiencies of OC2, OC3, and OC4 algorithms. The inversion model can effectively overcome the covariance problem of OC2, OC3 and OC4 algorithms on the multivariate linear regression model, and the model passed the F-test (F=25.893, p=0.000<0.05), the mean absolute percentage error (MAPE) between the inversion values and the in-situ measured values was 21.8%, the root mean squared error (RMSE) was 0.325, and the coefficient of determination (R 2 ) was 0.847. The accuracy and the fit degree of the new model were significantly better than those of the OC2, OC3 and OC4 algorithms. Therefore, the chlorophyll-a concentration inversion model based on ridge regression can effectively invert the chlorophyll-a concentration in the offshore


Introduction
Marine remote sensing technology has now become a major means of obtaining indicators of the nutrient status and ecological health of the oceans (Pan Gang, et al. 2007).
Chlorophyll-a is one of the key indicators of water quality, which can reflect the growth of phytoplankton in water and thus characterize the eutrophication level of seawater (Lavigne H, et al., 2021).Changes of chlorophyll-a concentration in the Yellow Sea and East China Sea have a wide and far-reaching impact on the marine ecosystem because it is an important sea in China.Domestic and foreign scholars have conducted many research works on ocean color remote sensing and made significant progress (Huang Weigong et al. 2002).Ocean color remote sensing data sources including MODIS, OLCI, VIIRS, MERIS, and GOCI.Zhang et al. (2007) inverted chlorophyll-a concentrationused from MODIS images covering the Fujian Coastal Sea by using various empirical algorithms, such as OC2, OC3, and found that the results of OC2 and OC3 inversion algorithms were larger than in-situ measured chlorophyll-a concentration.Li et al. (2009) inverted chlorophyll-a concentration in the Yellow Sea and East China Sea from satellite images collected by SeaWiFS by using the OC4 algorithm, and found that the OC4 inversion results were obviously lager than in-situ measured values.Wang et al. (2018) conducted similar experiment in the Yellow Sea and East China Sea by using OC2, OC3G and YOC3 algorithms, but the data source he used was GOCI, and the results showed that the inversion results were generally lager than in-situ measured chlorophyll-a concentration too.Zhao et al. (2014) retrieved chlorophyll-a concentration from MODIS images covering the South China Sea through OCI, OC3 and other algorithms, they also found that the retrieved values of chlorophyll-a concentration were higher than the measured values.Currently, there has been great success in retrieving Chlorophyll -a from water reflectance with blue/green-ratio algorithms such as OC2, OC3 and OC4 in oceanic "case I" waters.But studies on retrieving chlorophyll-a concentration using integrated algorithm in oceanic "case II" waters are relatively limited at present.In order to fill this research gap, we proposes a chlorophyll-a concentration inversion model based on ridge regression, and the retrieved results were validated using in-situ measured data, which was measured by the Institute of Oceanography, Chinese Academy of Sciences (IOC) in September 2003.All MODIS data was processed using SeaDAS software.The new model integrated the inversion results of OC2, OC3, and OC4 algorithms using multivariate linear regression, so as establish the relationship between these independent variables and the actual chlorophyll-a concentration.We test this new model in the Yellow Sea and East China Sea.The results of this study can provide valuable references for water quality monitoring in oceanic "case II" waters.
1 Data acquisition and pre-processing

Acquisition of in-situ measured data
Due to the complex impacts of oceanic circulation patterns and land-based sources, the Inland river inlets in the Yellow Sea and East China Sea are more prone to nutrient aggregation and hydrophobic events, so it is reasonable to select in-situ sampling sites near Jiaozhou Bay and the mouth of the Yangtze River for collecting chlorophyll-a concentration samples.The in-situ measured chlorophyll-a concentration data used in this study come from the sample data collected by the Oceanographic Institute of the Chinese Academy of Sciences (OIC) in the Yellow Sea and East China Sea in 2003, the distribution of sampling points used in this study is shown in figure 1.The sampling points are registered with the remote sensing images used in this study to ensure the consistency of the coordinate in the process of processing MODIS data in SeaDAS software.
In order to ensure the synchronization between the in-situ measured data and the satellite data, we selected MODIS imagery observed under clear sky condition in different four days.i.e., remote sensing images collected on September 11, 24, 25 and 26 were selected.So, there are 16 sampling sites can be used for quantitative remote sensing analysis in this study, and the latitude and longitude of the sampling sites are shown in Table 1.All MODIS data used in this study are obtained from the official website of NASA Ocean Color (https://oceancolor.gsfc.nasa.gov/),and the valid Terra/Aqua satellite data were selected by set cloudiness threshold and time online.The 8 th to 16 th bands of MODIS imagery can be used for ocean color research, especially for analysis of suspended sediment and chlorophyll-a concentration.These parameters commonly used to reflect phytoplankton production and eutrophication level in marine water bodies.
Detailed information of these bands is shown in Table 2.The data processing in this paper begins with reading the MODIS_PDS file at level 0, which is converted into MODIS_L1A and MODIS_L1B data using OCSSW in SeaDAS software, then atmospheric correction is performed using the L2gen module, and the correction is done by using the nearinfrared aerosol correction mode with multiple scattering to obtain MODIS_ L2B data (Li et al., 2018).The atmospheric correction by L2gen module is crucial, which directly affects the accuracy of retrieved chlorophyll-a concentration, and the flow of MODIS data processing in SeaDAS is shown in figure 2.

Inversion of Chlorophyll-a using OCX algorithms
In this paper, three NASA standard operational chlorophyll-a concentration inversion algorithms, OC2, OC3, and OC4, are used to retrieve chlorophyll-a in the Yellow Sea and East China Sea.These three algorithms belong to blue-green band ratio algorithm which was designed for open ocean waters.
All hese algorithms are in accordance with the empirical model of OCX (X=2-6), and were obtained through a large number of inversion experiments conducted by O'Reilly et al. (1998).The characteristics of these algorithms are easy implementation and fast efficiency.However, It should be noted that the OC2 algorithm built in SeaDAS software for MODIS data is different from the third-order OC2 algorithm proposed by Zheng et al. (2017), the relationship between the ratio of the blue-green band remotely sensed reflectance Rrs(λ) and the in-situ measured chlorophyll-a concentration is fourth-order polynomial relationship for OC2, OC3 and OC4 algorithms (Zhu et al., 2013;Dong et al., 2021).The model expression is defined as: In Eq. ( 1), the coefficients a0, a1, a2, a3, a4 are sensor specific, Rrs( λ blue) is the maximum value of remote sensing reflectance among several MODIS blue bands of; Rrs (λgreen) is the remote sensing reflectance of MODIS green band.In the Miscellaneous parameter setting of the L2gen processing module in SeaDAS software, the blue, green bands and corresponding coefficients are set as shown in Table 3.In order to evaluate the retrieval accuracy of these algorithms, three statistical measures is used in this paper, they are coefficient of determination (R 2 ), root-mean-square error (RMSE) and mean absolute percentage error (MAPE).The calculation formula is shown in Eqs. ( 2), ( 3) and (4) (Li, 2020).
Table 5 shows the results of the accuracy evaluation for the three algorithms.it can be seen that the OC2 algorithm has the smallest error, but the linear regression model fit is the lowest; the OC4 algorithm has a larger error than OC2 algorithm, but the correlation between the measured and retrieved values is better; the R 2 of OC3 algorithm is the best among these three algorithms, bur its RMSE and MAPE are with an intermediate level.Therefore, it is difficult to say which one is the best.

Retrieval chlorophyll-a concentration using ridge regression model
In order to establish a more stable and effective retrieval algorithm for extraction of chlorophyll-a concentration, we propose a new algorithm based on ridge regression.This new algorithm takes the chlorophyll-a concentration values retrieved by OC2, OC3, and OC4 algorithms as the independent variables and the in-situ measured chlorophyll-a concentration as the dependent variable, through introduces the idea of ridge regression, then applies the multivariate linear regression analysis to explore the relationship between the retrieval values and measured values.
Before establishing the multiple linear regression model, the predictors of OC2, OC3, OC4 algorithms were diagnosed with covariance, and the variance inflation factor (VIF) and tolerance level (TOL) were chosen to determine the degree of covariance among the independent variables.The diagnostic results are shown in Table 6.When the VIF value is greater than 10 or the TOL value is less than 0.1, it indicates that the independent variables have covariance problems in the model (Lin et al., 2022).It is clear that all these model predictors (OC2, OC3, and OC4) have covariance, so using the traditional least squares method for multiple linear regression will make the prediction results more unstable.While ridge regression is a regularization algorithm, it can improves the least squares method, because it controls the magnitude of the regression coefficients of the independent variables by losing some information and accuracy of the independent variables, so as to reduce the effect of covariance and obtain more reliable and realistic values of the regression coefficients of the independent variables.The improved regression coefficients matrix increases the ridge parameter k, and the appropriate ridge parameter k could improve the accuracy of the retrieval model.The formula of the regression coefficient matrix can be expressed as (Li, 2021): In Eq. ( 5), w is the ridge regression coefficient, k is the ridge parameter, X is the eigenvalue matrix (independent variable), y is the target matrix (actual chlorophyll-a concentration), and I is the unit matrix.
The ridge regression analysis was performed using SPSS software, and the plotted ridge trace is shown in Figure 3. From Figure 3, it can be seen that when the ridge parameter tends to infinity, the regression coefficient gradually tends to 0 and fails to capture the relationship between the predicted value and the measured value of chlorophyll-a.When k=0, the model degenerates into the traditional least squares method for calculating the regression coefficient.Therefore, the value of k should not be taken too large, and the value of k should be taken as the smallest value that satisfies the experimental needs.In Figure 3, when k value is between 0-0.2, the ridge regression coefficient begins to stabilize, so the main calculation can be carried out when the k value is larger than 0 and smaller than 0.2.

Figure 3 Ridge trace map
Research has shown that, when the maximum variance inflation factor (VIF) value of the output terms of OC2, OC3 and OC4 is less than 10, it was satisfy the experimental demand.So the value of k was taken as 0.02 in this paper.The results of ridge regression analysis when k=0.02 are shown in Table 7.The model passed the F-test (F=25.893,p=0.000<0.05),and the MAPE between predicted values of the new model and the measured chlorophyll-a values is 21.8%, the RMSE is 0.325 mg/m 3 and R 2 is 0.847.Results of ridge regression analysis are shown in Table 8.
Where, chl_oc2, chl_oc3 and chl_oc4 are the outputs of OC2, OC3 and OC4 algorithm and F(chl_oc2, chl_oc3, chl_oc4) is the retrieval values of ridge regression algorithm.
Numerically, the retrieval values of OC3 and OC4 algorithms were significantly higher than the measured chlorophyll-a concentration values (as shown in Figure 4), this can be attributed to the fact that the retrieved chlorophyll-a concentration was too high in the study area due to the influence of the reflection from the surrounding clouds (Zhang et al., 2007;Wang et al., 2018).The retrieval values of the OC3 and OC4 algorithms were closer to each other, and are significantly higher than that of the OC2 algorithm.The retrieval values of OC2 algorithm are slightly larger than the measured chlorophyll-a concentration.But when the concentration of chlorophyll-a is too high or too low, the retrieval error will rapidly increase.The ridge regression retrieval values are closer to actual values, indicating that this method has a rebound effect on OC2, OC3 and OC4 algorithms when the chlorophyll-a concentration is too high or too low.Ridge regression algorithms, through regularization and reasonable intervention of regression coefficients, not only makes the coefficient of determination (R 2 ) better than OC2, OC3, and OC4 algorithms, but also significantly reduces errors.Its overall performance is better than OC2, OC3, and OC4 algorithms.Compare the four maps in Figure 6, the chlorophyll-a concentration distribution map retrieved by the ridge regression algorithm is more coordinate with actual situation.The retrieval results of OC3 and OC4 algorithms are close to each other, but overestimated chlorophyll-a concentration in coastal waters.

Discussion and conclusion
In this paper, a ridge regression algorithm for chlorophyll-a concentration inversion was proposed and tested using MODIS images cover the East China Sea and the Yellow Sea of China.
The new algorithm use the outputs of the OC2, OC3 and OC4 algorithms embedded in SeaDAS, and the output of this new algorithm is more accurate than these three algorithms, because it can effectively readjust the overestimated chlorophyll-a concentration of OC2, OC3, and OC4 algorithms.The experimental results indicate that the MAPE, RMSE and R 2 of the retrieval results are 21.8%, 0.325mg/m 3 and 0.847 respectively, and this new algorithm is suitable for the second category of oceanic "case Ⅱ" waters.
Ocean color remote sensing is a complex technology.Currently, although there is specialized software for ocean color remote sensing, such as SeaDAS, the variability of atmospheric and sea surface conditions still poses challenges for modeling ocean color remote sensing.The preliminary results of ridge regression algorithm in ocean color remote sensing retrieval are robust; indicate the new algorithm can optimize the traditional algorithms.However, its accuracy still needs further experimental verification.

Figure
Figure 1 Distribution of sampling sites 1.2 Remote Sensing Data Acquisition and Pre-processing

Figure 2
Figure 2 Process flows of MODIS data in SeaDAS 2 Chlorophyll-a concentration inversion methods and results

Figure 4
Figure 4 Comparison of the ridge regression algorithm with OC2, OC3 and OC4 algorithms Statistical analysis was performed in SeaDAS software, and the frequency histograms of the values of the four algorithms are shown in Figure 5, it is obviously that the ridge regression model accounted for a smaller proportion of the higher chlorophyll-a concentration values (2-4 mg/m³) relative to the OC3 and OC4 algorithms, this can be attributed to the neighbourhood cloud reflection effect.The overall retrieval values of the ridge regression model are better than OC2, OC3, and OC4 algorithms.

Figure 5 Figure 6
Figure 5 Frequency histogram of retrieval values of four retrieval algorithms

Table 2
Detection band characteristics of MODIS for marine environment monitoring

Table 3
Band selection and coefficients setting for OC2, OC3 and OC4 algorithmIn order to obtain valid retrieval results, we designed time matching principle between MODIS data and in-situ observation data as follows: ① The Terra/Aqua MODIS remote sensing data is synchronous with in-situ measured chlorophyll-a data; if both Terra MODIS and Aqua MODIS images observed on the same day cover the sampling sites, then it is considered that there are two sets of retrieval values matching the in-situ measured values.②Considering the MODIS images covering the Yellow Sea in September were generally cloudy; there were fewer data match with the sampling sites, we assume that the chlorophyll-a concentration in oceanic water does not change between adjacent days, so the in-situ measured data observed on September 25, 2003 can be matched with the remote sensing data observed on September 24, 2003.Therefore, MODIS imagery observed onSeptember 11, 24, 25 and 26, 2003are selected to do the chlorophyll-a concentration retrieval experiments in this paper.There are 18 usable images match with 16 in-situ measurement sites.The retrieval results of OC2, OC3 and OC4 algorithms and the corresponding actual values are shown in Table4.

Table 4
Retrieval results of OC2, OC3 and OC4 comparing to in-situ measured chlorophyll-a

Table 6
Diagnostic results of model predictor covarianceThe International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-1-2024 ISPRS TC I Mid-term Symposium "Intelligent Sensing and Remote Sensing Application", 13-17 May 2024, Changsha, China

Table 8
Results of ridge regression analysis when k=0.02The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-1-2024 ISPRS TC I Mid-term Symposium "Intelligent Sensing and Remote Sensing Application", 13-17 May 2024, Changsha, China