Abstract
Abstract
A Bayesian regularized back-propagation neural network (BRBPNN) model is created and used to predict the monthly chlorophyll-a concentration dynamics over a period of 15 years in Meiliang Bay, Lake Taihu. The optimal network was found to consist of seven input neurons, six hidden neurons, and one output neuron, and coefficient of determination (R2) values for the training, validation, and test sets were 0.77, 0.49, and 0.76, respectively. Respective values of the root mean square (RMSE) and bias for the three data sets are 17.24 and −1.05 for training, 12.48 and 0.62 for validation, and 11.01 and 2.2 for testing. Compared with multiple linear regression models, the BRBPNN model fit the data much better. Thus, the BRBPNN model was shown to be a powerful tool for predicting the long-term chlorophyll-a concentration dynamics in Meiliang Bay. Furthermore, we find that algae in the Meiliang Bay, principally Microcystis, were alkalophilic, and phytoplankton production was controlled by P inputs from spring to early summer, whereas N played a more dominant controlling role in summer–fall. Therefore, reducing P may no longer be adequate for Lake Taihu, and new nutrient reduction strategies should incorporate N-input reduction along with P-input reductions.
Introduction
E
Lake Taihu, the third largest freshwater lake in China, was oligotrophic in the 1950s, but the increase in anthropogenic nutrient input to the lake has resulted in algal blooms appearing since the 1990s (Chang, 1995). Meiliang Bay, situated at the northern end of Lake Taihu, is one of the most eutrophic bays in China and generally suffers more intense blooms than the open water of the lake (Chen et al., 2003; Liu et al., 2011). Lake Taihu not only plays an extremely important role in the economic and social development of the Yangtze River Delta but also provides irreplaceable ecological services to nearby regions. Thus, the issue of algal blooms in Lake Taihu has become prominent in China (Song et al., 2007; Wu et al., 2010).
To alleviate the harmful impact of algal blooms, it is imperative to determine the key factors governing the algal dynamics and to establish models that can effectively simulate the timing and magnitude of algal blooms (Wei et al., 2001). In recent years, artificial neural networks (ANNs) have been applied to predict algal blooms (Wu et al., 2013; Cho et al., 2014; Coad et al., 2014; Liu et al., 2015).
Based on the 1.5-year measured data set of Chla and environmental variables, Wu et al. (2013) developed ANN to simulate the daily Chla dynamics in a German lowland river, and the developed ANN models achieved satisfactory accuracy in predicting daily dynamics of Chla concentrations. Cho et al. (2014) used the automatic water quality monitoring data, weather data, and hydrologic data in the man-made Lake Juam during 2008–2010 to develop an ANN model to predict Chla concentration as an indirect measure of the abundance of algae, and the ANN trained with the time series data successfully predicted the Chla dynamics. Coad et al. (2014) used a high-resolution temporal data set, including Chla, temperature (water and air), salinity, and photosynthetically available radiation, to configure an ANN to predict (1, 3, and 7 days in advance) the Chla concentrations and obtained a satisfactory result.
These studies revealed that, compared with other modeling approaches, ANNs exhibit better performance in the prediction of algal blooms and have, therefore, become a popular and useful tool for environmental simulations (Maier and Dandy, 2001). Liu et al. (2015) used water quality and meteorological data from 1999 to 2012 in the Yuqiao Reservoir (Tianjin, China) to build six artificial networks (ANNs) to predict the levels of Chla and found that the back program model yielded slightly better results than all the other ANNs. Compared with other traditional ANNs, the Bayesian regularized back-propagation neural network (BRBPNN) performs better when the variables have a nonlinear relationship (Xu et al., 2006).
BRBPNNs have an excellent generalization capability, a result of their automated regularization parameter selection. This allows them to obtain the optimal network architecture for the posterior distribution and avoid the over-fitting problem (Mackay, 1992; Foresee and Hagan, 1997; Burden and Winkler, 2000).
In this study, a BRBPNN model is used to predict the chlorophyll-a concentrations in Meiliang Bay, Lake Taihu. The relative importance of environmental factors affecting the chlorophyll-a concentration is evaluated through a sensitivity analysis. We also compared the BRBPNN model with multiple linear regression (MLR) models. Finally, we provide some bases for the effective eutrophication treatment of Meiliang Bay.
Material and Methods
Study area and data
Lake Taihu has an area of 2,338 km2 and an average depth of about 2.0 m. It is located between 30°05′ and 32°08′ N and 119°08′ and 121°55′ E, downstream of the Yangtze River (Fig. 1).

Location of Lake Taihu, China, and sampling sites in Meiliang Bay (northern Lake Taihu).
Meiliang Bay, situated in the northern part of Lake Taihu, is one of the most eutrophic bays (Chen et al., 2003). The bay has a surface area of 132 km2 and a mean depth of 2.0 m. The Liangxi and the Lujiang rivers discharge wastewater from the cities of Wuxi and Changzhou into Meiliang Bay. Due to the heavy industrial and agricultural pollution, algal blooms have been frequently observed in Meiliang Bay during the past two decades (Liu et al., 2011; Paerl et al., 2011).
The water quality data used in this study were obtained from the Taihu Laboratory for Lake Ecosystem Research (CNERN TaiLLER). We selected monthly data over a period of 15 years (January 1992 to December 2006) and measured at four sites in the Meiliang Bay. The data include eight water quality factors such as pH, water temperature (WT, °C), transparency (SD, m), suspended solids (SS, mg/L), electrical conductivity (EC, μS/cm), total nitrogen (TN, mg/L), total phosphorus (TP, mg/L), and the chlorophyll-a concentration (Chla, μg/L). The basic statistics of the measured water quality variables in Meiliang Bay, Lake Taihu, is shown in Table 1. The raw data figure for eight water quality variables of four sampling points in Meiliang Bay is shown in Supplementary Fig. S1–S8.
CV, coefficient of variation; SD, standard deviation.
Bayesian regularized back-propagation neural network modeling
In general, a neural network comprises three layers: the input layer, the hidden layer(s), and the output layer. In this study, we build a three-layer neural network model, as shown in Fig. 2. Seven of the measured environmental factors are selected as the input layer variables. A single hidden layer is used, and the current month chlorophyll-a concentration, which is a well-known integrative indicator of algal biomass, is the output variable. All the computations were performed using MATLAB (MathWorks, Inc.).

Neural network structure.
Bayesian regularization algorithm
Many algorithms can be used to train neural network models. In this study, a Bayesian regularization algorithm is applied to the training data to calculate the weights between the input and hidden layers and between the hidden and output layers. The transfer functions of the hidden layer and output layer are set to the log–sigmoid function:
where x is the input vector.
The BRBPNN uses the regularization method to improve its generalization ability. The training objective function F is given by the following:
where Ew is the squared sum of the weights in the network, ED is the squared sum of the residuals between network response values and objective values, and α, β are objective function parameters or hyperparameters.
In the Bayesian framework, the weights of the network are considered to be random variables. At first, the function is set to some prior distribution. When the data have been observed, the posterior distribution of the weights can be updated using Bayes' rule:
where G is the neural network model, w is the vector of network weights, P(w|α,G) is the prior density, P(D|w,β,G) is the likelihood function, and P(D|α,β,G)is the normalization factor (Mackay, 1992). Thus, Equation (3) can be expressed as follows:
Assuming that the weight and data probability distributions are Gaussian, the likelihood function can be written as follows:
where ZD(β) is the normalization factor:
Similarly, the prior probability can be written as follows:
where ZW(α) is the normalization factor:
Finally, the posterior probability can be written as follows:
We use Bayes' rule to optimize the objective function parameters α and β. Thus, we have the following:
where P(α,β|G) is the prior probability for the regularization parameters α and β, and P(D|α,β,G) is the likelihood function, which is called the evidence for α and β (Mackay, 1992). The optimum values for α and β can be inferred as Livingstone (2009):
where γ is the effective parameter, n is the number of sample sets, m is the total number of parameters in the network, and A is the Hessian matrix of the objective function F(w).
According to Foresee and Hagan (1997), the iterative procedure is as follows. (1) Initialize values for α, β, and the weights. (2) Employ one step of the Levenberg–Marquardt algorithm to minimize the objective function F(w). (3) Compute γ using the Gauss–Newton approximation to the Hessian matrix in the Levenberg–Marquardt training algorithm. (4) Compute new values for the objective function parameters α and β. (5) Iterate steps (2–4) until convergence.
BRBPNN training
The data were normalized for the input and output layers using the linear insert-value method, which is expressed as follows:
where x is the original data and
The first 9 years of data (1992–2000) of all the four sampling points were used for model training, and data of 2001, 2003, and 2005 were used as the validation data set. The remaining 3 years (2002, 2004, and 2006) were used for model testing.
Modeling performance criteria
To determine the performance of the developed BRBPNN model, three different criteria were used: the root mean square (RMSE), the bias, and the coefficient of determination (R2) (Chenard and Caissie, 2008; Singh et al., 2009). The RMSE represents the error associated with the model and can be computed as follows:
where Pi and Mi represented the model computed and the measured values of the variables, and the N represents the number of observations.
The bias represents the mean of all the individual errors and indicates whether the model overestimates or underestimates the dependent variable. It is calculated as follows:
The coefficient of determination (R2) represents the percentage of variability that can be explained by the model and is calculated as follows:
where parameters have been defined in Equation (15).
Results and Discussion
Determination of BRBPNN structure
Generally, the number of hidden layers in a traditional BPNN is determined by repeatedly testing the network. However, BRBPNN can automatically find the optimum value from the posterior distribution (Mackay, 1992; Foresee and Hagan, 1997).
To acquire the optimal structure, the BRBPNNs were independently trained 20 times to eliminate spurious effects caused by the random set of initial weights and 3,000 epochs of maximal stopping (Xu et al., 2006). From Fig. 3, we can see the trend in the number of neurons S in the hidden layer for an optimal network with 15 parallel training runs and 1,500 epochs of maximal stopping. As S increases up to a value of 6, the number of effective parameters increases and the mean squared error (MSE) becomes smaller. When S is greater than 6, the MSE and number of effective parameters remain roughly constant. This gives the minimum number of hidden neurons required to properly represent the objective function. Therefore, we can determine the optimal number of neurons in the input, hidden, and output layers as 7-6-1, respectively, in our BRBPNN model.

Change in the optimal Bayesian regularized back-propagation neural network (BRBPNN) along with the number of hidden neurons. MSE, mean squared error.
Training, validation, and testing results
Table 2 shows the performance parameter of the BRBPNN model for computation of the monthly chlorophyll-a concentrations of different sampling points in Meiliang Bay, Lake Taihu. Figures 4–6 show the training, validation, and testing results of (7-6-1) BRBPNN model for chlorophyll-a concentrations for four sampling points in the Meiliang Bay, Lake Taihu, respectively.

Training results of (7-6-1) BRBPNN model for chlorophyll-a concentrations for four sampling points from Meiliang Bay, Lake Taihu.

Validation results of (7-6-1) BRBPNN model for chlorophyll-a concentrations for four sampling points from in Meiliang Bay, Lake Taihu.

Testing results of (7-6-1) BRBPNN model for chlorophyll-a concentrations for four sampling points from in Meiliang Bay, Lake Taihu.
BRBPNN, Bayesian regularized back-propagation neural network; RMSE, the root mean square.
The coefficient of determination (R2) values (p < 0.001) for the training, validation, and test sets were 0.77, 0.49, and 0.76, respectively. The respective values of RMSE and bias for the three data sets are 17.24 and −1.05 for training, 12.48 and 0.62 for validation, and 11.01 and 2.2 for testing. A closely followed pattern of variation by the measured and model-computed chlorophyll-a concentrations in Meiliang Bay, Lake Taihu (Figs. 4–6), R2, RMSE, and bias values suggest for a good fit of the BRBPNN model to the data set. The respective values of RMSE and bias and the coefficient of determination (R2) for the four sampling points are 19.31, −3.07, and 0.71 for Bay 1, 16.09, 1.61, and 0.86 for Bay 2, 12.62, 0.08, and 0.86 for Bay 3, and 11.98, 1.12, and 0.76 for Bay 4.
Compared with other two sampling points, Bay 2 and Bay 3 yield slightly better results. This may be because, compared with the two sampling points in the middle of the bay, Bay 1 is closer to the shoreside and Bay 4 is closer to the mouth of the bay, the concentrations of chlorophyll-a in Bay 1 and Bay 4 are more likely to be distracted.
Furthermore, all of the field data and model output were divided into the following four groups: spring (March–May), summer (June–August), fall (September–November), and winter (December–January).
Then, we calculated the respective values of RMSE, bias, and the coefficient of determination (R2) for the four seasons. Table 3 shows the performance parameters of the BRBPNN model for computation of the monthly chlorophyll-a concentrations of different seasons in Meiliang Bay, Lake Taihu. The respective values of RMSE, bias, and the coefficient of determination (R2) for the four seasons are 14.31, 0.80, and 0.53 for spring, 22.10, −1.63, and 0.79 for summer, 12.35, 0.68, and 0.90 for fall, and 9.41, −0.1, and 0.38 for winter. Although the values of RMSE and bias for summer are slightly higher, caused by the high concentrations of chlorophyll-a in summer, the values of the coefficient of determination (R2) for summer are quite high.
Thus, we can conclude that our BRBPNN model can be seen as a reliable tool to predict algal blooms and simulate the chlorophyll-a dynamics in Meiliang Bay, Lake Taihu.
Compared with MLR models
In recent years, MLR-based models have repeatedly been applied to many fields for predictions (Bruder et al., 2014; In Ieong et al., 2014; Tugcu et al., 2014). Bruder et al. (2014) used the MLR-based models to predict the taste and compounds in algal bloom-affected inland water bodies and obtained quite good simulation results. By using the MLR-based models, In Ieong et al. (2014) correctly predicted the phytoplankton abundance in Macau Storage Reservoir. Tugcu et al. (2014) presented quantitative structure–toxicity relationship models on the toxicity of 91 organic compounds to Chlorella vulgaris using MLR techniques. Thus, based on our data, we developed MLR models to predict the chlorophyll-a concentrations in Meiliang Bay, Lake Taihu, and compared the MLR-based models with our BRBPNN model.
All the data from the four stations were log transformed, and regression tests were applied to determine the relationships between chlorophyll-a concentration and other water quality variables. The MLR models and the values of the coefficients are shown in Table 4. Compared with the MLR models, the BRBPNN model fits the data much better. This suggested that the BRBPNN model could capture the complex relationships between chlorophyll-a concentration and other water quality variables. Thus, compared with the MLR models, the proposed BRBPNN model is a better way to predict the chlorophyll-a concentration in Meiliang Bay, Lake Taihu.
Sensitivity analyses of the input variables
To identify the sensitivity of chlorophyll-a concentration to minor changes in each input factor, we conducted simulations by increasing each input parameter by 10%. The calculated results are shown in Table 5. The sensitivity analyses indicated that a positive relationship existed between Chla and pH. It is known that Microcystis is the dominant species in Lake Taihu (Zhang et al., 2015). We can conclude that algae in the Meiliang Bay, principally Microcystis, were alkalophilic.
Wei et al. (2001) used an ANN to predict algal blooms in Lake Kasumigaura, Japan, and found that a 10% increase in pH value would lead to the increases of 59.1%, 67.4%, 100.3%, and 158.9% in the densities of Microcystis, Oscillatoria, Phormidium, and Synedra, respectively. Based on an 11-year set of environmental monitoring data in Meiliang Bay, Liu et al. (2011) found that pH was significantly correlated with Microcystis in Lake Taihu, Meiliang Bay. By means of evolutionary computation, Zhang et al. (2015) successfully developed forecasting models that provide early warning on cyanobacteria outbreaks in Lake Taihu and found that a positive relationship existed between cyanobacteria and pH.
Increased pH may in fact increase the availability of P and cause the algal proliferation (Brewer and Goldman, 1976). However, the algal photosynthetic process removes CO2 from the water and causes the increase of pH (Xu et al., 2010). Therefore, it is difficult to establish whether low CO2/high pH is the cause of, or the result of, increased growth of Microcystis in Meiliang Bay (Liu et al., 2011). Further research is needed to be done in this point.
As shown in Table 3, increase in TN or TP value also would cause the increase in chlorophyll-a concentration. However, compared with pH, a 10% increase in TN or TP value causes relatively a small change in chlorophyll-a concentration in the presented model output. This is because, according to the range of the pH value, a 10% increase is relatively large but not for the TN or TP. In most earlier research work of nutrient limitation to phytoplankton, Meiliang Bay has long been assumed to be P-limited based on year-scale data, as supported by analyses of TN:TP (Vant et al., 1998). Recent studies, however, have found that phytoplankton production was controlled by P inputs from spring to early summer, whereas N played a more dominant controlling role in summer–fall (Xu et al., 2010; Paerl et al., 2011; Xu et al., 2013; Paerl et al., 2014, 2015; Ye et al., 2015).
Thus, we conducted the following three simulations: (1) increase of input TN value 0.5 mg/L (+N); (2) increase of input TP value 0.02 mg/L (+P); and (3) both of input TN and TP value (Paerl et al., 2015). Figures 7 and 8 show the change of chlorophyll-a concentrations from spring to early summer (March−June) and summer–fall (July−October), respectively, in three simulations: (1) +N, (2) +P, and (3) +NP. From spring to early summer (March−June), +P and +NP had a statistically significant positive influence on Chla concentrations, whereas +N and +NP were more stimulatory in summer to fall (Figs. 7 and 8). This is because, at the beginning of spring, total dissolved nitrogen concentrations were often very high relative to dissolved inorganic phosphorus concentrations, whereas during summer–fall, when dissolved inorganic nitrogen levels rapidly decreased, the algae tended to show more significant growth responses to +N simulation (Paerl et al., 2015).

Change of chlorophyll-a concentrations from spring to early summer (March–June) in three simulations: (1) +N, (2) +P, and (3) +NP.

Change of chlorophyll-a concentrations from summer to fall (July–October) in three simulations: (1) +N, (2) +P, and (3) +NP.
Therefore, reducing P may no longer be adequate for Lake Taihu, and new nutrient reduction strategies should incorporate N-input reduction along with P-input reductions.
Conclusions
In this article, a BRBPNN model was created and used to predict the monthly chlorophyll-a concentration dynamics over a period of 15 years in Meiliang Bay, Lake Taihu. The following conclusions can be stated:
(a) The optimal network was found to consist of seven input neurons, six hidden neurons, and one output neuron. The coefficient of determination (R2) values (p < 0.001) for the training, validation, and test sets were 0.77, 0.49, and 0.76, respectively. The respective values of RMSE and bias for the three data sets are 17.24 and −1.05 for training, 12.48 and 0.62 for validation, and 11.01 and 2.2 for testing. The respective values of RMSE, bias, and the coefficient of determination (R2) for the four seasons are 14.31, 0.80, and 0.53 for spring, 22.10, −1.63, and 0.79 for summer, 12.35, 0.68, and 0.90 for fall, and 9.41, −0.1, and 0.38 for winter. Compared with MLR models, the BRBPNN model fits the data much better. Thus, the BRBPNN model can be seen as a powerful tool for the prediction of long-term chlorophyll-a concentration dynamics in Meiliang Bay, Lake Taihu. (b) Algae in the Meiliang Bay, principally Microcystis, were alkalophilic, and phytoplankton production was controlled by P inputs from spring to early summer, whereas N played a more dominant controlling role in summer–fall. Therefore, reducing P may no longer adequate for Lake Taihu, and new nutrient reduction strategies should incorporate N-input reduction along with P-input reductions.
Footnotes
Acknowledgments
This work was supported by the State Water Pollution Control and Treatment Technique Program of China (2013ZX07101014-05) and the Natural Science Foundation of Jiangsu Province (BK20140603). The Taihu laboratory for Lake Ecosystem Research (CNERN TaiLLER), Chinese Academy of Science, supplied the monitoring data.
Author Disclosure Statement
No competing financial interests exist.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
