Abstract
The potential effects of climate change on the environment and society are many. In order to effectively quantify the uncertainty associated with these effects, highly complex simulation models are run with detailed representations of ecosystem processes. These models are computationally expensive and can involve a computer run of several days. Computationally cheaper models can be obtained from large ensembles of simulations using statistical emulation. The purpose of this article is to construct a cheaper computational model (emulator) from simulations of the Lund-Potsdam-Jena managed Land (LPJmL), which is a dynamic global vegetation and crop model. This article focuses on statistical emulation of potential crop yields from LPJmL and an emulator is constructed using a combination of ordinary least squares, principal component analysis and weighted least squares methods. For five climate models, under cross-validation, the percentage of variance explained ranges from 60 to 88% for the rainfed crops and 62 to 93% for the irrigated crops. The emulator can be used to predict potential crop yield change under any future climate scenarios and management options.
Introduction
The world population is projected to increase by 35% by the middle of this century (UNFPA, 2010). This will cause a rise in demand for major food crops that will require a considerable increase in crop production. Climate change, food insecurity and how to effectively feed over 9 billion people by mid-century are major problems threatening humanity (Smith and Gregory, 2013). Rising temperatures and CO2 levels and changing precipitation patterns will affect crop production (Parry et al., 2004). The IPCC Fifth Assessment Synthesis Report (Section SPM2.3) notes that ‘Global temperature increases of
The relationships between crop yields, weather and climate have attracted considerable attention. Many authors have applied empirical methods. Simulated and historical data were explored to assess the global climatic impact on crop yields in Lobell and Burke (2008, 2010). An earlier study by Kart (1979) examined the relationship between crop and weather using a ridge regression approach. Reddy and Pachepsky (2000) estimated changes in crop yield from monthly weather projections of climate variables. Wallach (2011) extended the Kart (1979) approach by estimating wheat production using multiple regression. Bornn and Zidek (2012) evaluated wheat yields using a Bayesian method, examining the significance of incorporating spatial information in crop-yield modelling. Schlenker and Roberts (2006) investigated the effect of change in average weather on crop yield, focusing especially on the non-linear effect of temperature on growing season. Matis et al. (1989) applied a non-parametric Markov chain approach to crop yield prediction while Kim et al. (2005) derived a Bayesian bootstrap method to derive the posterior distribution of the parameters rather than the distribution of a sample statistic. Lobell et al. (2006) developed a non-linear model to relate crop yields to weather and climate, finding a significant non-linear relationship between temperature and yields. These statistical approaches were all applied at relatively small spatial scales (typically country level or smaller), greatly simplifying the problem compared to the global scale analysis we perform here. Bornn and Zidek (2012) note that forecasting wheat yields across the Canadian prairies is in itself a challenging task due to substantial spatial variability.
The other general approach is the application of process-based models (Bondeau et al., 2007; Fader et al., 2010) that simulate detailed physical and biological processes. Leemans and Solomon (1993) implemented a water balance model within a geographic information system (GIS) that integrates databases, while Fischer et al. (2001) extended the approach with the Global Agro-Ecological Zones (GAEZ) software. Similarly, Rosenzweig et al. (2014) used a process-based approach to evaluate the global consequences of various climate change scenarios on crop productivity. Recently, Muller and Robertson (2014) compared the performance of two global crop models—the LPJmL of Bondeau et al. (2007) and Decision Support System for Agrotechnology Transfer (DSSAT) of Jones et al. (2003).
Future projections encompass many uncertainties. These include uncertain future scenarios for greenhouse gases and other emissions, uncertain climate projections for a given scenario, with different global climate models (GCMs) projecting different climates for the same scenario, and uncertain crop yields in response to a given climate change projection. A significant source of crop yield uncertainty results from the poorly quantified ‘
Here we integrate the empirical and process-based approaches with the statistical emulation of an ensemble of simulations of the process-based LPJmL model. Computational speed is highly problematic for coupling complex models together. Emulation is a tool for simplification of models that leads to reduced-form representations of complex models that are computationally much faster and hence easier to couple to other models. Emulation offers a rapid alternative for the projection of crop productivity under diverse climate scenarios, but at the same time it allows us to capture important relationships between crop yield and climate, enabling realistic prediction of responses to climatic change. Emulation would further facilitate other calculations (e.g., sensitivity analysis) that would not be practical using the LPJmL model directly.
The approach uses a combination of ordinary least squares (OLS), principal component analysis (PCA) and weighted least squares (WLS) regression to model the potential crop yields that are simulated by LPJmL. We use a two-stage method. The first stage applies a least squares regression analysis similar to Holden et al. (2010) to model change in potential crop yields as a function of climate change and other relevant covariates. The second stage uses a novel combination of PCA with a WLS regression for interpolation of residual variation that is left unexplained by the regression equation. This has similarities to work of Higdon et al. (2008), who combine Gaussian process (GP) emulation with a basis representation (such as principal components) for calibration of computer models with high dimensional output. Our approach is also related to the technique suggested in O'Hagan (2006), which involves combining a regression with a GP emulation of the regression's residuals for improving emulator performance.
The emulators provide high-resolution spatial output fields for five major crop types. Emulated projections are provided for both irrigated and rainfed crops, and for variable degrees of crop management intensity. They also address the uncertainty due to the CO2 fertilization effect.
In Section 2 we describe the models and simulation set-up. In Section 3 we describe the methods and implementation of the emulator algorithm. Section 4 reports the results of the emulations. Section 5 discusses the results and gives concluding comments.
Models used for the analysis
We built the crop yield emulators using simulation output from the following models: LPJmL, Model for the Assessment of Greenhouse Gas Induced Climate Change (MAGICC) and Spatial Climate Generator (ClimGen). MAGICC is a simple carbon cycle climate model that simulates greenhouse gas (GHG) cycles, radiative forcing and ice melt. The gas cycle uses standard formulae to convert surface emissions of gases to atmospheric concentrations and these, in turn, are then converted to radiative forcing. The generated radiative forcing is then used to drive a diffusive energy balancing model to estimate global climate change. MAGICC6 is a new version of MAGICC (Meinshausen et al., 2011) and is able to simulate global mean temperature (GMT) trajectories based on the emulation of the 18 Atmospheric Ocean General Circulation Models (AOGCMs) used in Solomon (2007) for the Fourth Intergovernmental Panel on Climate Change (IPCC) assessment report.
ClimGen is a spatial climate scenario generator. Emulated global mean temperature trajectories of a particular GCM from MAGICC6 are used to drive the ClimGen model. ClimGen uses a pattern-scaling method and produces spatial climate change information for a given global mean temperature change. The method is based on the assumption that the pattern of climate change simulated by the coupled AOGCMs is relatively constant but the amplitude changes. These normalized patterns of climate change usually show considerable variation between different AOGCMs, and it is this variation that ClimGen is designed to explore (Osborn, (2009).
The LPJmL model of Bondeau et al. (2007) is a dynamic global vegetation model (Sitch et al., 2003) enhanced to represent global agriculture in addition to natural vegetation. It uses eco-physiological relations and plant trait parameters for the estimation of photosynthesis, plant growth, maintenance and regeneration loss, fire disturbance, soil moisture, runoff, evapo-transpiration, irrigation and vegetation structure. Agricultural productivity is simulated through varieties of crop functional types (CFTs), both rainfed and irrigated. The LPJmL model takes as inputs climate variables such as precipitation, temperature and insolation, which are then disaggregated to quasi-daily values by a weather generator. The monthly input and output data are spatially explicit time series of about 60 000 global
In summary, MAGICC6 generates and provides future trajectories of global mean temperature that are used by ClimGen. ClimGen emulates the spatial response patterns by disaggregating the temperature trajectories to
LPJmL simulation
LPJmL was run on seven climate change patterns, namely, Canadian Centre for Climate Modelling and Analysis Coupled Global Climate Model (CCCMA-CGCM31), Center for Climate System Research and Model for Interdisciplinary Research on Climate (CCSR-MIROC32HI), CCSR-MIROC32MED and Hadley Centre Global Environmental Model, Met Office United Kingdom (UKMO-HADGEM1), Goddard Institute for Space Studies (GISS-MODELEH), GISS-MODELER, and Institut Pierre Simon Laplace (IPSL-CM4) generated using ClimGen, which used trajectories of global mean temperature constructed by MAGICC6. In the MAGICC6 model, the forcing pathways of all four Representative Concentration Pathways (RCPs) were used. These RCPs are widely used in climate research. They describe alternative possible future emission scenarios and are designed to standardize climate model simulations for inter-comparison. The pathways are characterized by either long-term stabilization within the period 2100–50 (RCPs 4.5 and 6.0), progressively increasing forcing by 2100 (RCP 8.5), or decreasing forcing by 2100 and beyond (RCP 2.6, also named RCP-3PD) (Moss et al., 2010).
The simulations involve a spin-up stage that was used to equilibrate the long-term carbon stores (in natural and agricultural ecosystems) by repeating the observed climate (1901–30 period) 33 times, immediately followed by an additional 13 repetitions in order to incorporate land-use change for reconstruction of historical soil carbon pools Fader et al. (2010). Then simulations followed for the period 1931–2000, with transient climate and land-use change data. The scenario period (2001–2100) for the different RCPs and GCM patterns started from year 2001. Land-use change pattern and irrigation were held constant at their year 2000 values. Changes in the potential crop yields were simulated for each of the five crop types under both rainfed and irrigated conditions. Results were calculated on a global
Methods
A procedure for statistical emulation
In addition to the simulations described in 2.1, we also examined seven different crop management levels for each scenario, and performed simulations with and without the CO2 fertilization effect. The CO2 fertilization effect is the rise in crop yield as a result of elevated CO2 in the atmosphere. In calibrating crop management, the Leaf Area Index (LAI) is a key parameter. LAI is the ratio of total upper leaf surface of vegetation divided by the surface area of the land on which it grows. Crop management levels are represented by maximum leaf area index LAI
The LPJmL crop yield data were based on simulations for 59 199 grid cells on
The average decadal yield given by LPJmL from 2005 to 2095 was computed for each crop in each grid cell. We then obtained the change in yields relative to the baseline decadal average in 2005–14. We calculated the change in seasonal climate variables for the 37 input variables listed in Table 1.
The emulator's input variables.
The emulator's input variables.
The emulators were constructed in two stages. The first stage uses an OLS regression method to fit a linear model that explains much of the variation between the response and input variables. The second stage combines PCA with a WLS regression to explain some of the residual variation that is left unexplained by the first stage. We built a single emulator for the two CO2 fertilization levels (‘on’ and ‘off’), but treated irrigated and rainfed crops separately. This allows the emulator to be flexible in predicting yield changes for any level of CO2. Figure 1 shows the average yields of cereal (upper plots) and rice (lower plots) in the baseline period (2005–14) given by LPJmL. The plots are for a moderate pathway (RCP 6), the climate model CCSR-MIROC32HI, with CO2 fertilization ‘on’ and management level 6. The crop is not grown in areas in white, such as most of America, Africa and Australia for irrigated rice. Corresponding plots for maize, oil and groundnut are given in Supplementary Material S1.

The emulators were built from two GCMs, CCCMA-CGCM31 and CCSR-MIROC32HI, four RCPs and two simulation categories (with and without CO2 fertilization effect), giving 16 (
We used all valid grid points for irrigated oil and groundnut emulators because each of these crops has less than 5000 non-zero observations in each scenario. For other crops, in the first stage we used observations from a random sample of 5000 valid grid points because the stepwise algorithm could not fit the whole dataset. The 5000 grid points are fixed across the time slice, RCP, GCM and management levels as well as simulation categories. We sampled about 4.5 million observations on each input variable. We then fitted a single model to the sampled data for each crop yield when crops are rainfed. This procedure was carried out for each of the five crops and repeated for the case where crops are irrigated. For the second stage, all valid grid points were used.
Two climate models were used in the construction of the emulators so as to better span the possible climate input space. We should note though that using more climate models only help span ‘climate model space’, not necessarily ‘true climate input space’.
However, we also used several RCPs, which incorporate information on a broad range of emission pathways, and a minimum of 2386 grid points for each emulator. Hence, the climate values that occur across grid points in our training data cover a very wide range. This helps our emulators predict the change in crop yield given by LPJmL for a range of climate forcing scenarios that are not restricted to the RCPs and climate models (GCMs) used to construct them: the climate states that arise within each grid point should resemble the climate that has arisen somewhere in the training data. In Section 4.1 we use cross-validation to examine emulator performance, testing the emulators on five climate models that played no part in their construction.
An emulator is constructed in two stages. In stage 1 stepwise regression is used to obtain a parsimonious model for predicting crop yield change (relative to baseline) from climate and other explanatory variables. The response variable is the change in yield given by LPJmL and is denoted by
An integer with values between 1 and 7 was used to represent the LAI max parameter and this formed a factor variable in the regression analysis. The other explanatory variables can enter the regression as linear or quadratic terms. All two-way interactions were also considered for inclusion. Thus spre, spre2 and spre.wwet are examples of the potential terms in the regression model.
The emulators were built using the Revolution
Stage 2
In contrast to stage 1, here we formed a separate emulator for each combination of management level and time slice. Also, all grid points with non-zero observations are used, whereas in stage 1 we generally took a sample so as to cap their number at 5000. Rather than having multiple subscripts to indicate crop, irrigation regime, management level and time slice, we will consider just a single crop/irrigation regime/management level/time slice combination and let
Given a new vector of predictions,
The resulting
Our method in this stage is similar to a pattern scaling approach that is commonly used in climate scenario generation. Pattern scaling assumes that, given any particular point in space and time, there exists a linear relationship between climate change pattern and global mean temperature with a constant spatial pattern. Here, we allow for a more general multilinear relationship in the residual (Holden et al., 2014). The residual patterns from the OLS results in stage 1 indicated that the residual patterns are relatively similar across RCP and GCM (see Supplementary Material S2). Hence, for example, if a grid cell has a negative residual in one scenario, then that grid cell is likely to have a negative residual for other scenarios. Stage 2 exploits the similarity between the error patterns across scenarios.
Having obtained the residuals
Several distance metrics were tried: linear, square, cubic, spherical, power exponential, Gaussian and three forms of Matern metric. Details of these metrics are given in Appendix 2 where a variogram is also given that illustrates the need to take account of the distance between scenario points when modelling stage 1 prediction error. Performances of the metrics under validation data were compared. We chose a squared distance method scaled by their eigenvalues because it is amongst the best metrics in terms of the proportion of variance it explained and it is the simplest of the better methods. The weights are chosen to be inversely proportional to the squared distance and the same weighting scheme is applied across all grid points. The weights form the non-zero elements of a
A separate weighted regression is performed for each grid cell. For the
The following is a summary of the calculations in stage 2 (see Appendix 1 for details).
Perform a PC analysis of Denote the The explanatory variables for the WLS regression are constructed from the first four eigenvectors of Weighted least squares gives The estimated error for the In order to avoid unusual values and prevent too much extrapolation from known residuals, we compare
Our methods described above are simple and flexible to apply. Diagnostic plots of the residuals from WLS regressions are reported in Supplementary Material S3.
A Gaussian process model could not be applied directly to our data because of the computational difficulty from the large sample size coupled with the large number of parameters to be estimated. GP scales cubically with the number of observations
We assessed the performance of the emulator using five climate models that had not been used to construct the emulator, taking the proportion of variance (
We compute the squared differences between the actual LPJmL values and
Calculating the total effects of each explanatory variable helps identify the relative importance of variables in a model, that is, the contribution of each variable to the total variance. We use the Sobol global sensitivity method. This computes indices by decomposing the variance up to a specified order. The method we used computes the first order and total indices. Suppose our model is represented by
Emulator predictions
Figure 2 gives density plots over grid cells for the percentage change in crop yield (2084–95) relative to the baseline yield for both CCSR-MIROC32HI and CCMA-CGCM31 GCMs. We consider all management levels and a moderate emission scenario, RCP 6. The right-hand plots show the change in crop yield when there is no CO2 fertilization while the left-hand plots show change in yield with CO2 fertilization. In the right-hand plots, most crops show a preponderance of negative values, indicating a general reduction in yield. The exception is oil, which mainly shows positive values. The distributions each have a single major peak but vary as to whether they have a further minor peak (or even several minor peaks). The skewness of the density function also varies markedly with crop.
Now considering the left-hand plots (with CO2 fertilization), the density plots show a preponderance of positive values and are more diverse in pattern, which could be a result of non-linear interactions between climate and CO2. Comparison of the left and right-hand plots shows that CO2 fertilization has a marked effect on all crops except maize, for which the two plots are strikingly similar. (Maize is less affected by CO2 fertilization because it is a C4 plant and has a mechanism to efficiently transport CO2 to the photosynthetic parts, limiting photorespiration rate thereby reducing water losses.) Overall, the varying patterns in Figure 2 clearly show the diversity of the effects we are emulating. The changes in crop yields are characterized with high variability and there are varying patterns across different scenarios.


Figure 3 gives time series plots showing the percentage change in global crop yield over each decade relative to the baseline period under the CCSR-MIROC32HI, either without CO2 fertilization (right-hand plots) or with CO2 fertilization (left-hand plots). It shows temporal variability of the rainfed crops under the RCP 6 scenario with a separate line for each management level. The sensitivity of change in yield due to the CO2 fertilization effect is apparent, with CO2 fertilization greatly improving the change in yield of most crops. Taking management level 4 and the last decade (2085–94) as an example, for cereal a decline in yield of 6% becomes a growth of 18%; rice and oil show improvements of 37% and 25%, respectively, while groundnut has an increase of 32%. Maize exhibits a weak sensitivity to CO2, with the globally averaged yield increasing by 11%.
When there is no CO2 fertilization, with all management levels there is a fairly steady reduction in yield for cereal, rice, maize and groundnut while oil shows an increase in yield. When there is CO2 fertilization, maize and groundnut yield still change comparatively little over the decades and the effect of random variation is more apparent in their time series plots.
Cross-validated proportion of variance
1Oil=yieldmax[soyabean, rapeseed, sunflower].
Table 2 summarizes the cross-validated performance of the emulators for stage 1 and stage 2 using equation 3.2, for both rainfed and irrigated crops under UKMO-HADGEM1. Maize and oil cross-validated noticeably better than other crops with 74%/73% variance explained when rainfed and 80%/81% explained when irrigated. Generally, the emulator of the irrigated crops performed better than the emulator of the rainfed crops. This could be attributed to the water stress in rainfed locations—difficult to model—that could complicate the predictions. Stage 1 explained less than 50% of the variance except for rainfed oil. However, the second stage of the algorithm improved results for both the rainfed and irrigated crop systems. For the rainfed crops, the values of variance explained increased from 24–51% to 60–74% for all the crops. For the irrigated crops, the first stage explained variance was between 37 and 49% for all the crops and this increased to 64–81%, as shown in Table 2.
The last two columns of Table 2 show the computed RMSE
Cross-validated proportion of variance
1Oil=yield
The cross-validation of stage 2 predictions for four additional GCMs are shown in Table 3. We can see that the emulators again performed well, with results that are typically a little better than in Table 2. The results for CCSR-MIROCMED are better than for other GCMs; this was to be expected because a very similar GCM, CCSR-MIROC32HI, gave part of the training data. Results for irrigated crops are generally better than for rainfed crops, which was also the case in Table 2, though the results with GISS-MODELER for cereal, maize and groundnut are an exception.
We now consider the spatial comparison between LPJmL and the emulators of UKMO-HADGEM1 for temperate cereal. Figure 4 displays the change in yield between the baseline period (2005–14) and the decade 2085–94. The left-hand and right-hand plots show results for LPJmL and the emulator, respectively, with results for rainfed temperate cereal in the upper two plots and those for the irrigated crop in the lower two plots. The emulator under-predicts yield change across the United States for rainfed cereal and over-predicts yield of the irrigated crop in some regions, especially in eastern Asia and Europe. Overall, the emulator reproduces the global patterns well, especially for the irrigated crop (

Figure 5 shows results for the rice emulator. The emulator for the rainfed crop under-predicts the rice almost everywhere except in eastern Asia where it reproduces the pattern quite well. On the other hand, the emulator for the irrigated rice reproduces the yield better than the rainfed (

Figures for other crops are presented in Supplementary Material S4. Overall, rainfed crop patterns are quite different from the irrigated, as expected, because irrigation allows some crops to be grown where they would not have grown naturally (e.g., rice is grown in Europe with irrigation). Also, more negative changes are prominent in both the LPJmL and emulator predictions for rainfed rice than for other crops. We can clearly see that the emulators cross-validated well as indicated by their
Here, we investigate how the uncertainty in the crop yield can be partitioned to the various uncertainties in the input variables. We sampled 20 000 observations directly from the simulation with the CO2 effect for each of the 37 input variables in Table 1. We computed both the first order (results are not shown) and total sensitivity indices, as described in Section 3.3. Bootstrapping was used to compute 95% confidence intervals on the estimated indices. This procedure was applied to all the five crops for both rainfed and irrigated crops. The ‘sensitivity’ package in R (2013) was used for this analysis. Total sensitivity results are shown in Figure 6 for rainfed crops. Results for irrigated crops are given in Supplementary Material S5.

The six most relevant parameters for prediction of temperate cereal yields are initial winter cloud cover, CO2 change, initial spring wetday frequency, initial spring temperature (‘initial’ represents baseline value), latitude and LAI, with indices of 0.43, 0.32, 0.30, 0.27, 0.23 and 0.23, respectively. The parameter with the second highest rank is CO2 change, reflecting the sensitivity of yield to the CO2 fertilization effect. Atmospheric CO2 improves crop productivity by stimulating photosynthesis, thus increasing the number of fruits, seeds etc. The fourth most sensitive parameter is temperature, which generally has a considerable effect on plant growth.
Initial summer temperature, latitude, CO2 change, LAI and initial spring precipitation are the five most important variables for rice. Their indices are 0.86, 0.42, 0.23, 0.18 and 0.18, making it very clear that summer temperature is the most important parameter for rice. It is twice as influential as latitude (although temperature and latitude are obviously correlated) and four times more influential than CO2 change.
Rice is a C
The most relevant variables for maize are initial spring, autumn and summer temperature, as well as initial winter cloud cover and LAI. Their indices are 1.1, 0.25, 0.22, 0.19 and 0.14, respectively. It is unsurprising that, as these results show, seasonal temperatures play a key role in the growth and development of maize plants. Maize is less affected by CO2 fertilization because it is a C
Overall, we can see from Figure 6 that baseline seasonal temperature, spring precipitation and wetday frequency, LAI, latitude, CO2 change and cloud cover are the most important variables across the five crops. Although CO2 change has a negligible effect on maize, groundnut and oil, our results further support the joint interactive effect of elevated CO2 and temperature on crop yield. Some variables are less important (examples are change in seasonal precipitation and cloud cover) and some are unimportant (such as summer precipitation and soil) because their calculated sensitivity indices are very low. Baseline CO2 is of limited importance in this analysis because its level is represented by a single global value and does not vary with time slice, RCP or GCM. Similarly, soil is represented in this analysis by discrete values that range from 1 to 8 across the globe and these are constant across RCP and GCM scenarios.
In general, the sensitivity analysis clearly indicates that temperature is the most important parameter in crop yield projection. This high sensitivity of (growing season) temperature was earlier observed as the major determinant of crop yield change under future climate by Lobell and Burke (2008) and Osborne et al. (2013).
This article has addressed the joint emulation of the impact of climate change, CO2 fertilization effect and crop management levels on global crop yields. We have described a procedure for constructing an emulator for LPJmL simulations of potential crop yield for cereal, rice, maize, groundnut and oil crop functional types. Two emulators were built for each crop, one for the rainfed crop and the other for its yield under irrigation. Each emulator was constructed using a two-stage process. The first stage used OLS to fit crop yield as a smooth function of climate variables under the assumption that each spatial point is an independent sample. The second stage involves interpolation of the spatial residuals of unknown scenarios from known scenarios (an approach similar to pattern scaling) using a combination of WLS and PCA. We made similar assumptions to those of GP emulators (O'Hagan, 2006). In particular, it is assumed that the emulator is a smooth and continuous function of its input variables. The second stage indeed improves the overall predictions.
We used cross-validation to test the accuracy of the emulator and performed sensitivity analysis for each crop response. LPJmL uses daily time-steps as crop yields respond to daily variability. Here, we have chosen to use only seasonal mean climatic variables as input, as these are readily available input data. With these inputs, under cross-validation the emulators explained 62–93% of the variance for irrigated crops and 60–88% of the variance for rainfed crops. Sensitivity analysis indicated that the predicted yield of rainfed crops depends most heavily on baseline seasonal temperature, CO2 change, latitude and LAI. Irrigated crops are dominantly sensitive to temperature and less dependent on precipitation, as expected. We provided spatial plots to visually compare the performance of the emulators with the LPJmL simulations.
LPJmL crop data are characterized by a large proportion of zero observations. Grid points where particular crops are not currently grown are represented as zeros in the simulation. A censored regression approach (Moore et al., 2000) was also used to model these data by treating the zero observations as censored observations (results not shown). However, this method was not helpful because of the large dataset we have, coupled with the fact that the censoring algorithm takes a longer computation time, even after reducing the data size from
Nevertheless, for the first stage we analyzed a small sample of this data with censored regression but the results were not better than OLS. Using either non-linear regression or a Gaussian process emulator would be challenging because of the large number of parameters to be estimated with samples from many scenarios (time slice, RCP, GCM, management levels, irrigation regime).
We might also have used a dynamic emulation by emulating each grid cell individually as a function of time. Rather, our choice of OLS is driven by its simplicity.
Our emulation approach extends the work of Lobell and Burke (2008, 2010), which models temporal and spatial variation to predict future crop yields from climate variables and performs sensitivity analyses to examine the importance of temperature and precipitation on future yields. However, unlike (Lobell and Burke 2008, 2010), our predictions were not based solely on climate variables but also incorporated soil, latitude, crop management levels and other covariates. Our analyses and aims are also broader; Lobell and Burke (2010) work with just 94 crop–region combinations and only examine temperature and precipitation while Lobell and Burke (2008) work solely with the yield of maize in 200 sites in Sub-Saharan Africa. Our emulators provide estimates of projected change in crop yields at any level of CO2 emission on high spatial gridded resolutions for five different crops. We also performed rigorous and extensive cross-validation for several climate models and RCPs, and our global sensitivity analysis measures the individual contribution of a number of different variables to overall uncertainty. In agreement with Lobell and Burke (2008), the clearest result from our sensitivity results is that temperature is the dominant source of uncertainty in future impacts assessment. The sensitivity analysis used a variance-based decomposition technique that handles correlated variables, and the contributions of other variables that it quantifies have not previously been reported.
The emulators reduced the LPJmL model down to a two-stage process that is capable of predicting global crop yields of different crop functional types, and the spatial distribution and temporal dynamics of these yields in response to a changing climate. These emulators are much faster to run compared to the LPJmL model. LPJmL is computationally expensive to run, while these emulators give results almost instantaneously. It takes about 8–10 minutes to simulate eight decadal changes of crop yield data (on
Footnotes
Calculations in stage 2
Predictions given by stage 1 for the 16 training scenarios form the
The PC transformation from
In forming regression equations, we use only the first four PCs; for every crop/irrigation regime/management level/time slice combination, these explained at least 95% of the variation in
Let
Let
If any distance
For non-zero
The weighted linear regression uses
For the
which (using (A.4)) can be written as
Equation (A.5) estimates the prediction error separately for each grid cell. To combine the calculations for all grid cells into a single step, let
To avoid extrapolation, we bound the
Variograms and distance metrics
Distance weighted regression is used to estimate the residual pattern of an unknown scenario from the scenarios with known residual patterns (Section 3.1.2). More weight is assigned to known scenarios that are similar in pattern to the unknown scenario. The distance,
Three simple metrics for converting the
To obtain the empirical variogram, we split the range of
Spherical covariance model:
Power exponential covariance:
Matern covariance model:
The parameters
The plotted points in Figure 1A show that
A covariance model assigns weights to scenarios so that Cross-validated proportion of variance
1 Oil=yield
max
[soyabean, rapeseed, sunflower] 2 ‘irrig’ denotes irrigated crop. 3‘Power expo’ denotes Power exponential.
Function
Cereal
Rice
Maize
Oil
1
Groundnut
Rainfed
Irrig
2
Rainfed
Irrig
Rainfed
Irrig
Rain
Irrig
Rainfed
Irrig
Quadratic
0.69
0.74
0.74
0.90
0.81
0.86
0.80
0.88
0.68
0.79
Matern
0.78
0.82
0.78
0.84
0.84
0.87
0.79
0.88
0.67
0.79
Spherical
0.76
0.82
0.78
0.84
0.83
0.73
0.75
0.84
0.68
0.79
Matern
0.78
0.82
0.80
0.85
0.85
0.88
0.79
0.88
0.68
0.81
Matern
0.79
0.82
0.78
0.83
0.83
0.88
0.79
0.88
0.67
0.79
Power expo
3
0.76
0.81
0.65
0.89
0.79
0.86
0.45
0.82
0.59
0.85
Gaussian
0.76
0.80
0.76
0.68
0.77
0.56
0.70
0.79
0.66
0.75
Acknowledgments
The work was supported by the European Union through FP7 ERMITAGE grant no. 265170.
