Abstract
Since increasing food demand and continuous reduction of available farmland, reliable and near-real-time wheat yield forecasts are essential to ensure regional and global food supplies. Although the crop model has been widely used in yield estimation, its applicability in large-scale yield prediction is limited due to the large amount of data required for parameterization. We took the main winter wheat growing areas in China and developed an ensemble learning framework based on seven machine learning algorithms, such as extreme gradient boosting, random forest, and support vector regression. The model used satellite vegetation index time series, climate, soil properties, and elevation data to provide county-level winter wheat yield forecasts from 2001 to 2015. The results showed that the ensemble explained 86% of the yield variability, which outperformed all base learners. By calculating the correlation between the prediction results of the base learners, we believed that the prediction performance of ensemble learning still has the potential for improvement. Soil properties and elevation data effectively improved the performance of the model because they contained information about yield prediction that could not be fully captured by vegetation index and climate data. As the growing season went on, the unique contribution of increasing climate data to yield forecasts was always more than that of vegetation index, especially in the early growing season. Furthermore, we evaluated the model’s ability to perform within-season prediction, and the model achieved satisfactory prediction accuracy 2 months before harvest (R2 = 0.85, RMSE = 480 kg/ha, MAPE = 7.52%). The framework of yield forecast established in this research can be applied to other crop varieties and regions and provide stakeholders with sufficiently accurate yield predictions.
1. Introduction
Wheat is one of the highest yielding cereals in the 21st century (Curtis and Halford, 2014). China is a major wheat producer and consumer (Song et al., 2019), with wheat production making up approximately 18% of global total production (Han et al., 2020). Winter wheat, in particular, makes up about 85% of summer food crop production (Cao et al., 2021a). Therefore, timely and reliable winter wheat yield information plays an important role in ensuring agricultural development and food supplies in China and throughout the world. Specifically, crop producers can adjust agricultural management measures with reliable predicted yields (Khaki and Wang, 2019). For government decision-makers, yield prediction can assist in formulating agricultural product trading strategies and farmers’ subsidy policies (Jiang et al., 2020).
In recent years, most crop yield forecasts have relied on process-based crop models or statistical regression models (Archontoulis et al., 2020; Potgieter et al., 2016). Process-based crop models are based on crop physiology and can accurately simulate crop yield by explaining the relationship between crop growth and climate, soil, agricultural management or other factors (Feng et al., 2020b). However, crop models are highly dependent on a large amount of input information on climatic conditions, soil properties, crop types, and agricultural strategies (Holloway & Mengersen, 2018; Kang and Özdoğan, 2019). Therefore, due to the high spatial heterogeneity of the environment, management measures and crop varieties, the prediction of crop models in large-scale areas has great uncertainty (Sakamoto et al., 2013). In addition, statistical models use regression equations to predict yields by correlating crop yields with selected predictive factors (e.g., satellite-derived vegetation indices (VIs) and/or climate data). Generally speaking, statistical regression models have the characteristics of fewer parameter settings and simple calculations (Qader et al., 2018), but most statistical regression models are based on multiple linear regressions, which cannot capture the non-linear relationship between dependent variables and independent variables. Since the influence of environmental conditions on crop growth and the formation of final yield is complex and non-linear (Wolanin et al., 2020), linear regression models may not perform well in yield prediction.
Machine learning predicts yield by establishing an empirical relationship between the predictors related to yield and recorded yield. It assigns different weights to different features, rather than the possibility or probability of any prediction information (Lee et al., 2018; Ma et al., 2021). Therefore, machine learning has the advantage of effectively capturing the non-linear relationship between predicted factors and yield and does not depend on specific parameters (Wang et al., 2020). Satellite technology can monitor crop growth dynamics in real-time with various spectral bands and can be applied to yield prediction (Azzari et al., 2017; Boken and Shaykewich, 2002;). Research on crop growth monitoring and yield prediction has mainly used various VIs derived from optical bands (i.e., visible and near-infrared bands), such as normalized difference vegetation index (NDVI) and enhanced vegetation index (EVI) (Dong et al., 2015; Holzman et al., 2014). Moreover, climatic conditions and soil characteristics are also key factors affecting crop yield, including non-biological information that cannot be captured by VIs (Cao et al., 2021b). To improve yield estimation, machine learning models using multi-source data have been developed to capture regional yield variability. For example, Kamir et al. (2020) developed machine learning algorithms for Australian wheat prediction using climate and satellite-derived NDVI time series, in which the best-performing support vector regression (SVR) model explained 73% of the yield variability. Furthermore, Cai et al. (2019) evaluated the ability of three machine learning models, including random forest (RF), neural network (NN), support vector machine (SVM), and one regression algorithm in the prediction of Australian wheat yield. The results demonstrated that the machine learning models outperformed the traditional regression algorithm.
Although the single machine learning method has achieved satisfactory accuracy in crop yield prediction, the performance and robustness of the model still could be improved. Generally speaking, the ensemble of multiple machine learning algorithms can improve the predictive performance and generalization ability of the model. The basic principle is that an ensemble can consolidate the advantages of single base learners and make up for their shortcomings (Tumer and Ghosh, 1996). Therefore, the potential of ensemble learning in yield prediction is worth exploring. This study aimed to develop an ensemble of machine learning algorithms that uses dynamic variables (satellite vegetation index time series and climate data) and static variables (soil attributes and elevation) to estimate China’s county-level winter wheat yield from 2001 to 2015. The main objectives are to (1) compare and determine the best winter wheat yield prediction model, explore whether the ensemble of base learners can further improve the accuracy, (2) evaluate the contribution of different variables to the model accuracy, (3) analyze the temporal process of model accuracy with climate and satellite vegetation index as inputs as the growing season goes on, determine the best lead time for yield prediction, and (4) understand the factors that are crucial to yield estimation according to the relative importance of variables.
2. Materials
2.1. Study area
In this research, we focused on the main winter wheat growing areas in China (Figure 1), including most regions of Shandong, Henan, and Jiangsu provinces, southern Hebei province, central Hubei province, and northern Anhui province. The study area (29°25′51″–39°57′09″N, 110°21′38″–121°58′31″E) covers approximately 5.63×105 km2 and consists of 444 main winter wheat cultivation counties, where winter wheat cropping acreage and production make up 72% and 85% of China in 2015 (http://www.stats.gov.cn/). The northern part of the study area is located to the north of the Huaihe River, with a warm temperate semi-humid climate, rotation of winter wheat and summer maize is the main agricultural cultivation mode. The southern part of the study area is situated to the south of the Huaihe River, which belongs to a subtropical monsoon climate. Winter wheat is mostly rotated with rice in this region. Winter and spring are usually rainless and dry, and summer is rainy and humid. Winter wheat in the study area is generally sown in early October and harvested in mid-June of the next year (Gan et al., 2020). Therefore, we determined that the growing season of winter wheat in this study is from October to June of the following year for subsequent modeling and analysis. Distribution of winter wheat cropping areas and main producing counties in the study areas. For interpretation of the references to colours in this figure legend, refer to the online version of this article.
2.2. Datasets and preprocessing
We collected the following data from different sources to conduct this research: winter wheat yield data at county level, cropping areas, satellite (including NDVI and digital elevation model) and environmental data (including climate data and soil properties data) (Table S1). First, we resampled all the raster data to 1 km to match the spatial resolution with crop planting maps. Then, the temporal resolution of NDVI was converted into monthly intervals (as with climate data). Satellite and environmental data were masked by crop cropping maps and aggregated into the average of each county.
2.2.1. Winter wheat yield and cropping areas
The county-level winter wheat yield records of the 444 counties in the study area from 2001 to 2015 were manually collected from the county-level statistical yearbook (http://www.stats.gov.cn/). We performed quality control on yield records according to Zhang et al. (2014): records outside the range of the mean ±2 times standard deviation from 2001 to 2015 were identified as outliers and filtered. Moreover, the yearly winter wheat cropping maps with 1 km resolution from 2001 to 2015 used to mask out non-planting areas are available in Luo et al. (2020).
2.2.2. Satellite data
We obtained NDVI with 16-day temporal resolution and 1 km spatial resolution from NASA Terra-Moderate Resolution Imaging Spectroradiometer (MODIS) MOD13A2 (Collection 6) products (https://earthdata.nasa.gov/). To alleviate the noise caused by the problems related to cloud and snow cover, we aggregated NDVI into monthly intervals by MVC (Maximum Value Composite) (Holben, 2007). Moreover, we used DEM (Digital Elevation Model) with 90 m spatial resolution from Shuttle Radar Topography Mission (SRTM) digital elevation database (https://srtm.csi.cgiar.org/).
2.2.3. Environmental data
The monthly climate data is from the Chinese Meteorological Forcing Dataset (0.1°, ∼10 km) of the Institute of Tibetan Plateau Research, Chinese Academy of Sciences (He et al., 2020; Kun and Jie, 2019; Yang et al., 2010). We sourced two variables of monthly average precipitation rate (Prec) and monthly average temperature (Temp) during the winter wheat growing season from 2001 to 2015. It was assumed that higher resolution was not necessary on account of the low local variability of climate variables. We used the nearest neighbor method to resample the climate data to 1 km so that the spatial resolution matched the planting area maps. Considering that soil properties have an important influence on the final yield formation of crops, we selected six physicochemical properties of topsoil (30 cm) at a spatial resolution of 1 km from the Harmonized World Soil Database (HWSD) (http://www.fao.org/soils-portal/en/), including cation exchange capacity, pH, the percentage content of sand, silt, clay, and organic carbon.
3. Methods
3.1. Normalized difference vegetation index (NDVI) metrics extraction
With the purpose of obtaining more information related to crop growth, based on the literature, we extracted 10 temporal metrics that were expected to associate with crop yields from the satellite-derived NDVI time series and input them into the model as supplementary variables. NDVI usually decreases abnormally due to the contamination of bad atmospheric conditions (e.g., aerosol and dust), cloud and snow. In order to reduce noise, the Savitzky–Golay (S-G) filtering method based on locally adaptive moving window was used to smooth NDVI time series for simulating their specific shape more accurately (Savitzky and Golay, 1964). The general equation of fitting progress can be given as follows Visualization of normalized difference vegetation index (NDVI) time series of a pixel for winter wheat and summer corn rotation in Shandong Province in 2009 (satellite NDVI as dark points and smoothed NDVI as dark line). The 10 NDVI metrics derived from the interpolated NDVI time series are reported. For interpretation of the references to colours in this figure legend, refer to the online version of this article.
Temporal metrics extracted from NDVI time series during winter wheat green-up to maturity and used as supplementary variables for predicting yield.
NDVI: normalized difference vegetation index.
3.2. Machine learning algorithms for predicting yield
Here, we selected and compared seven widely used machine learning algorithms: extreme gradient boosting (XGB), RF, SVR, k-nearest neighbors (kNN), extra trees (ET), gradient boosted decision trees (GB) and Gaussian process regression (GP). To prevent the size of certain features from misleading the prediction of the algorithms, all independent variables were normalized to the range of 0–1 before applied algorithms. Each algorithm has a set of hyperparameters and needs to be optimized before the learning process. The grid search method combined with 10-fold cross-validation determined the best hyperparameters. The performance evaluation of all models was based on 10-fold cross-validation: the full data set was evenly and randomly divided into 10 subsets, nine of which were used as training data to train all models, then tested on the remaining one subset from which performance metrics were derived. The whole process was repeated 10 times, and the average of performance metrics was used to evaluate different models. Details for the construction and mathematical theory of models were given in the following. More information about the seven models is summarized in Table S2.
3.2.1. Extreme gradient boosting
Extreme gradient boosting is a special variety of gradient boosting algorithms proposed by Chen and Guestrin (2016), which implements boosted decision tree ensembles. The model creates multiple trees sequentially, and the new trees are all created based on the residuals of the previous tree, thereby improving the overall prediction. The model aims to combine several weak learners into more powerful ones. Extreme gradient boosting uses more regularized model formalization to reduce the risk of overfitting and can use a random subset of predictive variables at each node.
3.2.2. Random forest
Random forest is an ensemble learning technology based on the regression tree method (CART) (Breiman, 2001). The model combines a series of decision trees constructed by independent random sampling. When building a tree, the optimal trees at each split node are determined from the random subset of all variables, which makes the trees independent of each other. The final forecast is the average of all trees. Furthermore, the bagging method is adopted to reduce variance and over-fitting. The model is usually not affected by noise and overfitting (Teluguntla et al., 2018).
3.2.3. Support vector regression
Support vector regression is an SVM in regression estimation, which is characterized by using kernels and acting on margins (Cortes and Vapnik, 1995). It follows the principle of minimum risk of outcome to fit a function that predicts the target variable from the input features according to the requirements of balancing complexity and prediction error. Support vector regression uses the kernel function to map the input space to the high-dimensional feature space and then builds a linear model in the feature space to strike a balance between minimizing errors and overfitting (Smola and Schölkopf, 2004). In this study, we tested support vector machines with polynomial kernels.
3.2.4. k-nearest neighbors
k-nearest neighbors is a simple and intelligible model for classification and regression, first proposed by Cover and Hart (1967). The main difference between classification and regression is the final decision-making method when making predictions. The input of kNN regression contains the k nearest training samples in the feature space, and the prediction result is the average of k nearest neighbors. kNN uses a unified weight by default, or it can be set by the user whether to weigh the value, and the weight is determined by their distance.
3.2.5. Extra trees
Extra trees is an algorithm very similar to RF proposed by Geurts et al. (2006), which is composed of many decision trees. The difference from RF is that ET generally does not adopt random sampling, but each decision tree uses the original training set. Since the characteristic value used to divide the decision tree is randomly selected, the scale of the generated decision tree is generally larger than the decision tree generated by RF. In other words, the variance of the model is further reduced relative to RF, but the bias is further increased.
3.2.6. Gradient boosted decision trees
GB is a gradient boosting algorithm that uses CART as the basis function and adopts an additive model and forward step algorithm (Friedman, 2002). In each iteration, GB learns the decision tree by fitting negative gradients (also known as residuals). This method is similar to the gradient descent method in weight learning. The key is to use the negative gradient of the loss function as the approximate value of the residual in the boosting tree to fit a regression tree.
3.2.7. Gaussian process regression
Gaussian process regression is a non-parametric probability model that uses the Gaussian process for regression analysis of data (Ounpraseuth, 2008). It assumes that the prior average is constant and zero or the average of the training data. The prior covariance is specified by passing a kernel object. The core idea of GP is to infer unknown functional relationships from the training data set by placing the priors directly on the function space; then these priors are updated according to the training samples to generate a posteriori distribution of the function. In the process of GP fitting, based on the transmitted optimizer, the kernel hyperparameters are optimized by maximizing the logarithmic marginal likelihood.
3.3. Stacking of base learners for ensemble learning
We tried to combine the predictions of several base learners by using an ensemble learning algorithm to improve the robustness and generalization ability of a single learner. Specifically, we adopted the Stacking regression model proposed by (Wolpert, 1992) to generate an ensemble of seven base learners (XGB, RF, SVR, KNN, ET, GB, and GP). The basic principle of the Stacking regression model is shown in Figure 3: The performance evaluation of Stacking was also based on ten-fold cross-validation. There are two levels in the framework of Stacking: level 1 and level 2. At level 1, the training data in each fold was ten-fold cross-validated by all base learners to obtain the training data for the meta-model. In this process, 10 prediction results for the test data were generated by each base learner and then averaged as the test data of the meta-model. At level 2, the predictions of the meta-model for the fold were obtained based on the new data set. To avoid overfitting due to the small amount of data at level 2, linear regression was selected as the meta-model. Similarly, Stacking was tested on other folds to generate the final prediction results. We implemented all machine learning models and Stacking regression algorithm via the “scikit-learn” library coded in Python language. The general framework of Stacking. P(s) are model predictions. For interpretation of the references to colours in this figure legend, refer to the online version of this article.
3.4. Model performance evaluation
In this study, three performance metrics were used to evaluate the respective prediction performance of models, that is, coefficient of determination (R2), root mean square error (RMSE) and mean absolute percentage error (MAPE)
4. Results
4.1. Performance of the models
The performance metrics of base learners and ensemble for yield prediction are shown in Figure 4. The R2, RMSE, and MAPE values of each model were obtained by calculating the average of the ten-fold cross-validation results. XGB (R2 = 0.85, RMSE = 494 kg/ha and MAPE = 7.74%) achieved the best performance among all base learners. SVR (R2 = 0.73, RMSE = 666 kg/ha and MAPE = 10.25%) and kNN (R2 = 0.72, RMSE = 677 kg/ha and MAPE = 10.39%) displayed lower accuracy compared with other models. Across all models, Stacking had the highest prediction accuracy and it explained 86% of the winter wheat yield variability with an RMSE and MAPE of 477 kg/ha and 7.48%, respectively. Its performance was slightly better than that of the optimal base learner, XGB. The scatter plot shows that the predicted yields of Stacking and recorded yields were close to the 1:1 line, and most of the points fell within the error range of ±1000 kg/ha (Figure 4(h)). Besides, it was worth noting that all models underestimated high yields and overestimated low yields to varying degrees, but Stacking had improved in this aspect. Predicted yields versus recorded yields from 2001 to 2015. The solid line indicates that predicted yields are equal to recorded yields, and the dotted line indicates that the absolute estimation error is less than 1000 kg/ha. The predictions were derived from extreme gradient boosting (XGB) (a), random forest (RF) (b), support vector regression (SVR) (c), k-nearest neighbors (kNN) (d), ET (e), GB (f), GP (g), and Stacking (h). For interpretation of the references to colours in this figure legend, refer to the online version of this article.
To further understand why Stacking, which is based on ensemble learning technology, was not rewarded with more obvious accuracy improvement, we calculated correlations among the estimates of the base learners and between the estimates of the base learners and recorded yields (Figure 5). Since ensemble relies on the assumption of improving prediction performance by integrating the estimates of low-correlation and good-performing base learners (Dong et al., 2021; Mashhadi et al., 2019), the high correlations between learners (all correlations above 0.92) explained the limited improvement of Stacking compared to other models. Correlations among the predictions of base learners and between the predictions of base learners and recorded yields. For interpretation of the references to colours in this figure legend, refer to the online version of this article.
4.2. Spatial pattern of yield forecast
The spatial distributions of mean estimated yields of all models and mean recorded yields are shown in Figure 6. The spatial patterns of yields obtained by models with good predictive performance (e.g., Stacking, XGB and GB) were more consistent with the spatial distribution of actual yields (Figure 6(i)). The high-yield counties were mostly distributed in the middle and north of the study area, that is, the east of Henan Province, the west of Shandong Province and the south of Hebei Province. The low-yield counties were located in the south of the study area. Spatial patterns of yields from 2001 to 2015 derived using XGB (a), RF (b), SVR (c), kNN (d), ET (e), GB (f), GP (g), Stacking (h), and recorded yields (i). For interpretation of the references to colours in this figure legend, refer to the online version of this article.
The mean relative error distributions of predicted yields are presented in Figure 7. The number of counties with high positive error (≥20%) was significantly more than that of counties with high negative error (≤20%), and they were mainly located in the south of study area. Subsequently, comparing with different models, Stacking outperformed others with the relative error of most counties was close to zero. kNN performed poorly in all models, and more counties with high error were observed in the study area. According to the spatial pattern of recorded yields (Figure 6(i)), the counties with higher positive and negative error generally corresponded to lower and higher yields, respectively, which reflected the defects of models illustrated in Figure 4 that overestimated low yields and underestimated high yields. The reason could be that the distribution of samples used for training models was uneven, with the sample size of medium yields larger than that of low and high yields. It meant the models had the disposition to predict medium yields so that they were insufficient to get better results at the low and high ends of the yield. Especially for many counties with very low yields in the south of study area, the models produced higher positive error. Mean relative errors of predicted yields from 2001 to 2015 derived using XGB (a), RF (b), SVR (c), kNN (d), ET (e), GB (f), GP (g), and Stacking (h). For interpretation of the references to colours in this figure legend, refer to the online version of this article.
4.3. Impact of different variables on model accuracy
We chose XGB and Stacking for subsequent analysis and comparison due to their good predictive performance. To evaluate the yield prediction performance of models with different data sources, we designed the nine combinations of variables to analyze the accuracy of XGB and Stacking (Figure 8). The results showed that in the case of a single data source as input, climate data outperformed NDVI and NDVI metrics, with higher R2 and lower RMSE and MAPE. The worst performance of NDVI metrics may be due to the coarse resolution of NDVI, and the interference caused by serious cloud contamination or complicated crop rotation patterns on NDVI metrics extraction. Stacking was superior to XGB using various combinations of variables as inputs, especially under the conditions of fewer variables. Furthermore, we noticed that the accuracy of models with NDVI + Climate + NDVI metrics was improved by about 9% due to the addition of soil properties, and was improved by about 14% with the introduction of both soil properties and elevation. It indicated that the addition of two variables enhanced the ability of models to capture yield variability. However, the performances of models with NDVI + Climate and NDVI + Climate + NDVI metrics were nearly equal, which suggested that NDVI metrics did not provide more information for prediction than NDVI + Climate. The performance (estimated R2) of XGB and Stacking using nine combinations of variables. All variables represent a combination of NDVI, climate, NDVI metrics, soil and elevation. For interpretation of the references to colours in this figure legend, refer to the online version of this article.
We further analyzed the temporal pattern of model performance within the season. Figures 9(a) and (b) shows the temporal pattern of model accuracy with different data, that is, NDVI, climate and the combination of climate and NDVI (NDVI + Climate). The prediction performance of XGB and Stacking both showed similar temporal patterns: (1) for the three input sources, along with the growth of winter wheat, the model performance gradually increased as more input data became available, and reached saturation at the late growing season, (2) during the whole growing season, NDVI + Climate achieved the highest accuracy, and Climate had better performance than NDVI, (3) there are obvious differences in the temporal progression of model accuracy between NDVI and the other two input sources. NDVI had poor performance at the beginning of the growth season (∼0.1 in R2), and then its performance was dramatically improved during the growing season (∼0.5 increased in R2). Most of the performance improvement occurred from January to May and reached saturation in May. However, Climate and NDVI + Climate had a relatively high R2 at the beginning of the growing season (∼0.2–0.3 in R2), which increased greatly from October to December (∼0.3–0.4 increased in R2), and R2 increased by about 0.45 in the whole growing season. The presentation of this pattern may be related to the construction of the models. In this study, trained models can predict the spatial and temporal variability of yield since each county-year is taken as an independent sample. Because the spatial pattern of climate factors was maintained to a large extent throughout the growing season, based on the assumption that climate is related to yield, climate factors in the early growing season can capture some spatial patterns of yield. Therefore, the relatively good performance of the model with climate data at the early growth stage was largely attributed to the spatial pattern of yield captured by climate factors. The models with NDVI at the early growth stage had poor performance because NDVI was disturbed by the soil background when crop grows sparsely. As the growing season goes on, NDVI can provide more crop growth information, which further improved the performance of the models. (a)–(b) show the temporal patterns of XGB (a) and Stacking (b) performance throughout the winter wheat growing season. Curves with different colors represent the performance of the model using specific time-sequential data: green for NDVI, blue for Climate, and orange for NDVI combined with Climate. (c)–(d) represent the unique benefits of NDVI and climate and the benefits of their overlap. Blue represents the unique contribution of climate, calculated from the orange line minus the green line. Green represents the unique contribution of NDVI, calculated from the orange line minus the blue line. Gray represents the overlapping contribution of NDVI and climate, calculated from the green line plus the blue line minus the orange line. For interpretation of the references to colours in this figure legend, refer to the online version of this article.
Figures 9(c) and (d) further shows that, as the growing season progresses, the unique contribution of climate data increased first and decreased after peaking in the middle of crop growth, while the unique contribution of NDVI was almost the opposite. The overlapping contribution of climate and NDVI had been increasing throughout the growing season. This result suggested that as the growing season progresses, the information contained in climate data was gradually captured by NDVI.
Summary of model’s accuracy 2 months before winter wheat harvest. NDVI and climate data from October to April, as well as soil and elevation data, were used as model inputs.
XGB: extreme gradient boosting, RF: random forest, SVR: support vector regression, kNN: k-nearest neighbors, ET: extra trees, GB: gradient boosted decision trees, GP: Gaussian process Regression, NDVI: normalized difference vegetation index, RMSE: root mean square error, MAPE: mean absolute percentage error.
4.4. Relative importance of predictor variables
The RF can provide the relative importance of variables according to the impact of each variable on the performance of the models (Were et al., 2015). The importance values of all variables are presented in Figure 10. The six types of variables had different effects on the yield estimation, and their importance ranking was: NDVI, Prec, Temp, NDVI metrics, elevation, and soil properties (Figure 10(a)). The relative importance of NDVI, Prec, and Temp in different months during the growing season is shown in Figure 10(b). NDVI in May had a considerable influence on the prediction, and winter wheat was in the flowering-filling period at this stage. In addition, the temperature in December and January and the precipitation in March were also important predictor variables. To facilitate display and comparison, the importance values of the NDVI metrics were rescaled to 0–1 and shown in Figure 10(c). Among the top three metrics of importance, Mean represents the growth of winter wheat from green-up to maturity, Max and Smin are related to the conditions of crops from heading to filling. Importance ranking of six types of variables for yield estimation (a), the stacked plot of the importance of three time-sequential variables in winter wheat growing season (b), and importance ranking of NDVI metrics (c). The importance values of all variables were scaled to 0–1. To facilitate display and comparison, the importance values of the NDVI metrics were rescaled to 0–1. Prec and Temp refer to monthly precipitation rate and temperature, respectively. For interpretation of the references to colours in this figure legend, refer to the online version of this article.
5. Discussion
This study aimed to develop a machine learning algorithm driven by satellite, climate, and soil properties data to estimate the county-level winter wheat yield in the main winter wheat growing areas of China. The prediction performance of the seven machine learning algorithms and one ensemble learning model had been analyzed above. The results showed that XGB outperformed other base learners and achieved high prediction accuracy, with an R2 of 0.85, an RMSE of 494 kg/ha, and a MAPE of 7.74%. Generally speaking, it is difficult for a single machine learning model to maintain stable and good prediction performance under any environmental conditions. For example, previous research reported that the performance of XGB was worse than that of RF in predicting the maize yield at the city level in China, which is contrary to the result of this study (Chen et al., 2021). It emphasized that the performance of models depends on many factors such as growing conditions, predictor variables, and crop varieties. The different characteristics of variables made it difficult for the models to maintain good predictive performance in the case of different variables as inputs (Feng et al., 2020a; Yoosefzadeh-Najafabadi et al., 2020). It might be attributed to the biased nature of models and comprehensive relationships among variables (Hernandez et al., 2015; Lazaridis et al., 2011). Therefore, it is important to develop a method that can synthesize the advantages of multiple models and compensate for the weaknesses of a single model to improve the robustness of the model.
We successfully combined seven base learners through the Stacking method and achieved a higher prediction accuracy than a single base learner (R2 = 0.86, RMSE = 477 kg/ha, and MAPE = 7.48%). Ensemble also improved the generalization ability of a single model. Stacking showed the best accuracy in all test subsets, which was of great significance to the practical application of the model (Figure S1). With fewer variables as input, the advantage of Stacking’s predictive performance was more obvious. Compared with XGB, its R2 had increased by 0.04 at most, which means that it can more effectively obtain limited information of the data (Figure 8). Similarly, Feng et al. (2020a) used the Stacking regression method to integrate UAV-based hyperspectral images to predict alfalfa yield in the United States, and concluded that Stacking outperformed all base learners. In addition, Figure 5 represents the improvement potential of Stacking ensemble learning: a combination of more heterogeneous base learners is used to increase the diversity of the model and further improve the predictive ability of the ensemble. In general, compared with some other methods in the past, our model had improved in the estimation of winter wheat yield at county level. For example, Li et al. (2017) assimilated the remotely-sensed leaf area index value into the wheat model to calculate the yield at county level and explained 67% of the yield variability. Based on a crop model that assimilated crop phenology and leaf area index, Chen et al. (2018) realized the prediction accuracy of winter wheat yield with an R2 of 0.42 and RMSE of 737 kg/ha. Therefore, machine learning algorithms based on multi-source data seem to be an effective approach to develop models that can be used for large-scale regional yield predictions. Compared with traditional crop models, the advantages of machine learning may be attributed to their powerful mining capability for large amounts of data and lack of assumptions about potential data information (Jiang et al., 2004; Uno et al., 2005). Furthermore, Han et al. (2020) used multi-source data to estimate winter wheat yield at county level in China based on six machine learning algorithms and two ensemble learning methods, with an R2 of 0.81 and an RMSE of 750 kg/ha. However, their ensembles were weaker than the best three base learners. Cai et al. (2019) chose three machine learning algorithms and a linear regression model to predict Australian wheat yield using climate and satellite data and obtained a estimation accuracy of R2 = 0.75. By comparison, we increased the input of soil properties to improve yield prediction.
To study the impact of multi-source data on the model’s accuracy, the combinations of different variables were designed and applied to the model respectively. The prediction performance of nine input combinations was evaluated by XGB and Stacking (Figure 8). Among single input sources, the model with climate variables outperformed the model with NDVI. This result was consistent with the research of Cao et al. (2020). It turned out that NDVI metrics did not provide any additional information beyond the combination of NDVI and climate, and its performance was limited for a variety of reasons. The first is the rough spatial resolution (1 km) of NDVI, which resulted in the interference of data quality by mixed pixel effect. And the time resolution of the 16-days interval may ignore some crop growth information (Kamir et al., 2020). Therefore, it is one of the directions to improve the metrics extraction by relying on the vegetation index with higher resolution. Besides, the winter wheat farmland in the study area usually shows complex growth trajectories (e.g., double seasons), and the noise in the NDVI time series may cause high fitting errors (Cao et al., 2018). We noticed that the addition of soil properties and elevation data effectively improved the prediction performance of the model (Figure 8). It proved the hypothesis of some previous studies (Guan et al., 2017). The potential reason why these factors explained more yield variability may be that they are closely related to the spatial distribution of wheat yield. Ajami et al. (2020) established a regression model to prove that topographic features (e.g., elevation) played an important role in the spatial variation of wheat yield by affecting soil quality. In addition, soil properties can affect crop yield by affecting crop response to climate change. For example, the study of He et al. (2013) showed that soil texture can affect the ability of soil to retain water, such that the yield of wheat growing in clay soil was higher than that in silt loam when water was insufficient. Therefore, soil properties and elevation may provide unique contributions to the model to capture the non-linear relationship between yield and related factors. Time-series variables such as climate and satellite data have always been important data sources for yield estimation (Newlands et al., 2014). We found that, during the entire growth period of winter wheat, the benefit from cumulative climate data was always greater than that of NDVI, especially in the early growing season (Figure 9). The studies of Wang et al. (2014) and Tao et al. (2012) indicated that the impact of climate factors on crop yields had a clear spatial pattern. Since climate factors maintained the spatial gradient to a large extent during the growing season, the climate information in the early season may be able to better explain the spatial variability of yield. In the early stage of crop growth, the relationship between NDVI and yield was weak. With crop growth, cumulative NDVI could provide more information related to crop growth, and there are parts that cannot be absorbed by cumulative climate factors (Figures 9(c) and (d)).
The importance of variables in predicting yields can be quantified by RF. NDVI and two climate variables (precipitation and temperature) were the most important variables for yield prediction, which was also emphasized in previous studies (Johnson et al., 2016; Tack et al., 2015). We found that NDVI and climate had different importance at different periods of the growing season. For example, the importance value of NDVI in May is much higher than that in other months. This result was supported by previous studies, which showed there was a significant correlation between NDVI and yield from wheat heading to filling (Prasad et al., 2007; Royo et al., 2010). In addition, Max and Smin also provided information on the period from heading to filling. Mean reflects the conditions from turning green to maturity, which is related to the total amount of aboveground dry matter production (Kamir et al., 2020). The temperature in December and January are important predictive variables of yield. On the one hand, the winter wheat seedlings at this stage face the stress of cold injury and frost injury; on the other hand, the temperature change may affect the vernalization of winter wheat (Tao et al., 2017). Winter wheat in March was around jointing, which was the critical stage of water demand (Fan et al., 2019). Therefore, the precipitation at this stage was also a relatively important forecasting variable of yield.
There are still some shortcomings in this study, which need to be improved in future research. Some of the limitations come from the uncertainty of data quality, such as the low spatial resolution of multi-source data, which might impair the predictive performance of machine learning (Taylor et al., 2007). The key to solving this problem is to produce more accurate crop classification as a mask for high-resolution data. Monthly NDVI and climate data may ignore some crop growth and weather information, and higher temporal resolution data can more accurately reflect crop productivity (Waldner et al., 2019). Due to the influence of soil background, NDVI tends to be saturated at high canopy density, so it will not further explain the difference in biomass (Schwalbert et al., 2020). Compared with NDVI, EVI is more sensitive to capture changes in high biomass periods, and the combination of the two can improve yield prediction (Zhong et al., 2016). In addition to the lack of data of masking planting areas in the current season, the constraint to achieve within-season yield prediction is that S-G filter requires the NDVI time series of the whole growing season, which hinders the dynamic extraction of NDVI indicators. This problem can be overcome by changing the smoothing and extraction method. Secondly, we realized that more factors related to yield can be considered to provide more important information for model prediction, so as to further improving prediction accuracy, such as irrigation information (Cao et al., 2021b; Qiu et al., 2008; Sun et al., 2006) or extreme climate indices (Feng et al., 2020b; Wu et al., 2020). In addition, because the machine learning method is a “black box,” it does not include the physiological mechanism of crop growth. The crop model represents the inherent mechanism of crop growth and development to a certain extent. Therefore, the combination of machine learning and crop growth model can provide biological testable assumptions and might be a better method for future yield prediction research.
6. Conclusion
Accurate prediction of wheat yield is critical to regional and global food supplies. The main task of this study is to predict winter wheat yield at county level in the main winter wheat growing areas of China from 2001 to 2015 using machine learning based on climate (precipitation and temperature), NDVI time series, soil, and elevation data. We developed an ensemble learning method to integrate multiple base learners, which successfully improved the model’s performance in predicting yield. In the comparison of seven base learners and one ensemble, Stacking outperformed all the base learners and achieved the best prediction accuracy (R2 = 0.86, RMSE = 477 kg/ha, and MAPE = 7.48%). The correlation coefficient between the prediction results of each model indicated that the combination of different base learners with low correlation may further improve the predictive skill of the ensemble. The results suggested that increasing the input of soil properties and elevation data effectively improved the prediction performance of the model and indicated that they provided additional information for predicting yield. As the growing season goes on, the unique contribution of accumulated climate information to wheat yield prediction was always more than that of NDVI, especially in the early growing season. After analyzing the ability of models to make a within-season prediction, we found that satisfactory accuracy can be achieved about 2 months before the harvest. Furthermore, the relative importance of the variables reflected the NDVI in May, the temperatures in December and January, and the precipitation in March are important predictors of wheat yield. In general, we established an ensemble learning framework combined with multi-source data, which can be used for large-scale yield forecasting.
Supplemental Material
Supplemental Material - Prediction of winter wheat yield at county level in China using ensemble learning
Supplemental Material forPrediction of winter wheat yield at county level in China using ensemble learning by Yuefan Zhang, Lunche Wang, Xinxin Chen, Yuting Liu, Shaoqiang Wang, and Lizhe Wang in Progress in Physical Geography: Earth and Environment.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
