Abstract
Climate reconstructions produced using regression, with a proxy as the independent variable, are inevitably biased towards the mean, exhibit reduced variance and underestimate extremes. Scaling the mean and variance to fit those of the target climate data produces a more realistic range of reconstructed values but the cost, in terms of inflated error, is seldom assessed. We provide a simple metric that allows the loss of skill because of scaling to be quantified. It can be calculated retrospectively for published studies, some of which exhibit little or no reconstructive skill. Although scaled reconstructions must have a range that is close to that of the target climate data, there is no guarantee that the correct years are pushed to the extremes. We propose a simple non-parametric test for ‘Extreme Value Capture’ that gives the statistical significance of a given number of the correct years being ‘captured’ beyond the thresholds defined by the upper and lower 10% of the measured climate data. The methods are tested using three annually resolved case studies. A tree-growth-based summer temperature reconstruction for northern Fennoscandia captures cold summers very well, but the capture rate of the warmest summers is no better than might be expected purely because of chance. Such failure to correctly capture the warmest years has important implications for interpretation of the frequency and magnitude of very warm summers in the past.
Introduction
The standard approach to reconstructing the climate of the past, using annually resolved records such as tree rings or historical documents, is to use simple linear regression (National Research Council, 2007). The strength of the relationship between the proxy and the climatic target is quantified using correlation statistics: typically, the Pearson’s correlation coefficient (r) and the squared correlation coefficient (R2). The latter provides a measure of the amount of variance in the proxy that is common to the variance in the climate record. Where the climate records available for calibration are of sufficient length, it has become standard practice to test the temporal stability of the relationship by using split-period calibration and verification tests. For these tests, the data are split into two, often equal, periods and the calibration is carried out on one half of the data (calibration) and tested on the other half (verification). The Reduction of Error (RE) and Coefficient of Efficiency (CE) statistics (Cook and Kairiukstis, 1990) provide a measure of how well the reconstruction fits the measured data over the verification period; in contrast to correlation, both the RE and CE tests are sensitive to offsets between the measured and reconstructed values. RE and CE values of zero occur where the average squared difference between the measured and reconstructed values (the Mean Squared Error: MSE) is the same as that obtained when a horizontal line is used as an estimate for the climate of the verification period; in the case of RE, this line represents the mean measured climate variable over the calibration period, and for CE, it is the mean of the verification period. Often, the statistics are calculated twice by switching the calibration and verification periods. Where the RE and CE values are positive, the relationship between proxy and climate target is considered temporally stable and then the proxy data over the full period of available climate data are used to perform the regression analysis, using the proxy as the independent variable with which to reconstruct past climate (National Research Council, 2007).
Climate reconstructions produced in this way, however, suffer from a serious problem in that they inevitably underestimate the variability of past climate (Esper et al., 2005; Von Storch et al., 2004). This underestimation occurs because climate proxies are never perfectly correlated with the target climate data, leaving a proportion of the variance in the proxy unexplained. Simple linear regression uses the principle of least squares, so that the regression line that defines the relationship between the proxy and the target is fitted to minimise the squared difference between each point and the line on the vertical (climate) axis. If the correlation were perfect between proxy and climate, the regression line would pass through every point and the MSE would be zero. Any subsequent reconstructed climate would have the same variance as the measured climate over the calibration period, and we presume also for the rest of the reconstruction. If the correlation between the proxy and the target were zero, then the line that minimises the MSE would be horizontal and pass through the mean value of the measured climate. In this case, all proxy values would predict the same (mean) value for the climate target and the climate reconstruction would be a horizontal line. When, as is always the case, the correlation between the proxy and the climate target falls between zero and (positive or negative) one, the reconstruction captures some, but not all, of the variance in the target climate variable, and the weaker the correlation, the closer the reconstruction comes to being a horizontal line. The loss of variance in this kind of regression-based reconstruction can, in essence, be viewed as a bias towards the mean. Where the purpose of a climate reconstruction is to compare the reconstructed climate of the past with the measured climate of the recent period, or with climate model output, then a bias towards the mean and the resultant underestimation of both positive and negative extremes can be highly problematic.
To overcome the bias towards the mean, and to produce climate reconstructions that capture the full range of climate variability in the past, many authors now employ variance scaling, or variance matching (e.g. Briffa et al., 2013 and selected examples in Table 1). Rather than using the least squares regression equation to perform the reconstruction, the proxy data are re-scaled to fit the climate target data by simply giving them the same mean and variance over the calibration period. A simple way to achieve this is to first convert the proxy data into z-scores, by taking the difference between each value and the mean (of the period for which climate data are available) and dividing this by the standard deviation (also of the period with climate data). The z-scored data now have a mean of zero and standard deviation of one over the period with climate data. The z-scoring is then reversed, but rather than using the mean and standard deviation of the proxy, the values from the climate target are used. The result is a climate reconstruction that, over the period of overlap, has exactly the same mean and variance as the climate target data. Variance-scaled reconstructions can capture the full range of measured climate and, we presume, the full range of climate variability in the past. Essentially, the same procedure is used in the ‘composite plus scaling’ method used to collate and calibrate multiple proxies for use in reconstructions over very large geographical regions (e.g. Ahmed et al., 2013; Mann et al., 1999; Moberg et al., 2005; Neukom et al., 2011), and a scaling step is also applied in some spatial field reconstructions (Cook et al., 2007). There is, however, a ‘cost’ involved in that variance scaling must also increase the MSE. Regression-based reconstructions, by definition, produce the minimum MSE that is possible, so any change to the reconstruction must inevitably lead to an increased MSE.
A selection of publications that have used variance scaling to reconstruct past climate with Pearson’s correlation coefficient (r) and squared correlation coefficient (R2), the equivalents for a variance-scaled reconstruction (rvs and
Where variance scaling is applied to climate reconstructions, authors generally acknowledge that there is an increase in error, relative to regression-based reconstructions, but argue that this is outweighed by the much more realistic range of the reconstructed climate. At present, however, there is no simple method available to quantify the magnitude of the increase in error, or to test whether inflating the range of the reconstructed values results in the correct years being pushed to the extremes. Here, we provide a metric, based upon the same logic as the RE and CE statistics, which quantifies the magnitude of increase in error, and therefore relative loss of signal. To compliment this, we also provide a simple non-parametric statistical test to determine whether the capture of extremes is significantly better than one would expect to occur simply by chance.
Equivalent variance explained (
)
A useful way to envisage the strength of a climate reconstruction is to consider how much better it is than simply using the mean climate value for every year (the climatology). A simple measure of strength is the mean squared difference between each pair of measured and estimated values, which is the MSE. If a reconstruction produces a MSE that is no smaller than that obtained when a horizontal line (mean climate) is used, then it clearly possesses little or no skill as a tool for reconstructing past climate. The bigger the difference between the MSE based on the mean (climatology) and the MSE based on the reconstruction the better, and this is effectively what the squared correlation coefficient measures over the period used for calibration, and it can be expressed as:
A climate reconstruction based on least squares regression is automatically scaled so that the MSE is minimised, so any change to the variance of the reconstruction will inevitably increase the MSE. By directly calculating the MSE of the variance-scaled reconstruction, and comparing it with that obtained using the mean climatology, it is possible to derive a metric that can be viewed in the same way as R2; as a measure of the amount of variance explained relative to that explained by just using the average for every year:
Applying this logic, one may conclude that a variance-scaled reconstruction that yields an
The approach presented here is exactly the same as that used in calculating the RE and CE statistics, and the equations all have the same form, the only differences being the value that is used to define the climatology and the period over which the comparison is made. The RE and CE metrics are used in split-period tests, where the mean values are taken from the calibration and verification periods, respectively, and the comparison is always made over the verification period. For R2 and
In fact, it is not necessary to calculate
Expanding this equation:
Which simplifies to:
From this simple equation, it is clear that when the (positive or negative) correlation between the proxy and the climate target falls below 0.5, the
Capture of extremes
The key aims of variance scaling are to reduce the bias towards the mean inherent in regression-based reconstructions and to ensure that the magnitude of extreme values in the past is not underestimated. A logical measure of how well a reconstruction captures the extreme values would be to define a threshold for extreme values and calculate how well a reconstruction captures the values beyond this threshold. A simple definition of the extremes is the thresholds beyond which the upper and lower 10% of the measured climate data fall. Since regression-based reconstructions are biased towards the mean, we would expect them to underestimate the number of years where the target climate fell beyond the thresholds for the highest and lowest 10% and a variance-scaled reconstruction should perform more successfully.
The ‘Extreme Value Capture’ (EVC) of a reconstruction can be calculated by first ranking the climate target values to identify the values beyond which the highest and lowest 10% lie and noting which years lie beyond those thresholds. The probability of a value falling within any 10% band of the data, purely by chance, is 1 in 10 (p = 0.1). Therefore, a reconstruction that has been variance scaled to have approximately the same range of values as the climate target, but has no skill in capturing extremes beyond pure chance, would still be expected to capture some years. The probability of capturing a given number of extreme years by chance can be calculated using the binomial distribution, providing a simple non-parametric test of statistical significance. The logic and assumptions are the same as those used in applying the Sign Test, the only difference being that the probability of a correct result occurring by chance is 1 in 10 (p = 0.1) rather than 1 in 2 (p = 0.5). Probabilities can be calculated using a wide variety of software. One of the assumptions of the Sign Test, and of the EVC test, is that the extremes are independent of each other, and in time series data, there can be series autocorrelation which violate this, although it is only likely to be important where autocorrelation is very strong, in which case the data are not really suitable for climate reconstruction using regression or variance scaling.
For example, if a calibration data set is available covering 100 years, then there will be 10 years in each of the upper and lower 10% bands. The probability of any year falling into any 10% band is 1 in 10, so even a reconstruction with no skill at all is likely to capture one extreme year in the top 10% and one in the bottom 10%. Using the binomial distribution, we can calculate that the probability of capturing 3 years from 10 is p = 0.06, and so not statistically significant. However, the probability of capturing 4 years by chance is far more remote (p = 0.011), and the probability of capturing 5 is close to one in a thousand (p = 0.001). This simple test can be applied to either the two outer 10% bands combined, to give an indication of the overall skill of a reconstruction in capturing extremes, or, perhaps more usefully, to the high and low extremes individually, to test for a bias in the ability of a reconstruction to capture either the very high or very low values.
For example, given 150 years of climate data, 10% of which is 15, the critical number of correct captures at p = 0.05 and p = 0.01 are 4 and 5. In practice, however, calculating exact probabilities for the EVC test is more complicated because the number of years with climate data available for calibration will rarely be roundly divisible by 10. In such cases, and as fractional observations are not possible in annually resolved datasets, it is necessary to calculate the significance of the number of captures for the higher and lower integer and combine them using a weighted average. For example, a sample of 154 years requires 15.4 years per 10%, so the significance levels are calculated using the results from both 15 and 16 years using a weighted average where the probability for 16 years is multiplied by 0.4 and the probability for 15 years multiplied by 0.6. To achieve a significance level of p = 0.05 in this case requires at least 4 successful EVCs. The relative weighting means that they must come from the upper (or lower) 15 years only (indicated in Table 2 as 4/15). The year ranked 16 cannot be included in the count of 4 captured extremes because 4/16 values is not significant at p = 0.05. To achieve a significance level of p = 0.01 for 15.4 observations by weighting requires at least 6 correct captures from the upper (or lower) 16 observations, so in this case, the year ranked 16 can be included in the count of 6 (indicated in the table as 6/16). The calculations are included for the examples used below, but for convenience and to avoid this procedure, we provide a table of critical values (Table 2).
Critical values for the ‘Extreme Value Capture’ test where p is the probability of capturing at least the tabulated number of correct years. N represents the highest or lowest 10% of years or their combination.
It is important to stress that it is only logical to apply the EVC test to reconstructions that are based either on inverse calibration, where the proxy is the independent variable and underestimation of variance is inevitable, or on matching the variance of the proxy to that of the climate target. In the latter case, the total number of years that can possibly fall beyond the upper and lower thresholds is strongly constrained to be very close to 10%, so that the number of ‘captures’ and the number of erroneous extreme years is effectively a closed set. It is not valid to apply it to reconstructions based on classical correlation, where the proxy is the dependent variable. In this case, unexplained variance in the regression model inflates the variance of the reconstruction, so that the number of years that fall beyond the thresholds is not constrained, irrespective of the results of the EVC test.
Examples
Ice-break data and spring temperatures
Among the most powerful sources of information on the climate of the past are historical archives, especially those that refer to events that are strongly constrained by climate, such as phenology (e.g. flowering or harvest dates) or the freezing and/or melting of water bodies (Brázdil et al. 2005, 2010). One such record is that of the date of break-up of the winter ice blocking the flow of the Tornio River, close to the Arctic Circle between northern Sweden and Finland (Kajander, 1993; Klingbjer and Moberg, 2003; Magnuson et al., 2001). Loader et al. (2011) used this record to reconstruct the mean temperature of April–May back to AD 1693, using an overlap with measured climate data of 150 years (AD 1860–2009). The correlation between the Julian day of ice break and spring temperature over that period is −0.82, so 67% of the variance in break-up date is explained by April–May temperature. Loader et al. (2011) made their reconstruction using regression, but despite the very high correlation, the full range of spring temperatures over the calibration period (−3.9°C to 5.5°C = 9.3°C) is underestimated by about 20% (−2.9°C to 4.6°C = 7.4°C). Using the same data to produce a variance-scaled reconstruction increases the MSE, so that the equivalent variance explained drops from 67% to 64% (R2 = 0.67,
Measures of the skill of palaeoclimate reconstructions based on regression (R) and variance scaling (S) for three case studies, including the Pearson’s correlation coefficient (r) and squared correlation (R2) and their equivalents for a scaled reconstruction (rvs and

Comparison of measured (smooth black lines, presented in rank order) and reconstructed climate parameters using regression and scaling. The red lines and dashed boxes show the extent and target range for the lowest and highest 10% of measured values. (a) Spring temperature reconstructed using ice-break dates on the Tornio River, (b) summer temperature reconstructed using tree growth in northern Fennoscandia and (c) summer precipitation reconstructed using stable oxygen isotopes in British oak tree rings.
Given 150 years of data, the two 10% extremes contain 15 years each. The regression-based reconstruction performs well in capturing the extreme years, with 5 and 6 of the 15 years captured in the highest and lowest 10% bands, respectively (Figure 1, Table 4), both of which are significant (p < 0.01, Table 2), and an overall capture rate of 37% (p < 0.001). Variance scaling raises the capture rate by 2 years at either extreme (p < 0.001), giving 50% overall, which is a gain of 36%. These results suggest that for a small cost in terms of increased MSE, variance scaling removes the underestimate of the full range of climate variability and also improves the capture rate of extremes. When the ice-break data are used to reconstruct April–May temperatures, there appears to be very little bias in skill for reconstructing very warm and very cold years.
Extreme Value Capture (EVC) statistics and associated probabilities (Prob.) for the highest and lowest 10% of measured values in three case studies using regression and variance scaling. Where the number of years divided by 10 does not yield an integer, values are given for the higher and lower number (n) and combined into a weighted average.
Tree growth and summer temperature
McCarroll et al. (2013) present a pine (Pinus sylvestris L.) tree-growth index for the northern timberline region of Europe based on combining nine tree-growth proxies from four sites. The climatic target used was the June to August (JJA) mean temperature of the same region, calculated as an average of data from several climate stations. Trees growing in this timberline region are very sensitive to summer temperature, and the mean growth index is based on a very large set of data, so the correlation with summer (JJA) mean temperature is among the highest yet reported for any climate reconstruction based on trees (r = 0.81).
The tree-growth index was used to reconstruct JJA temperatures using regression, but as expected, there is an underestimate of the variance of summer temperature, with the true range over the calibration period (5.8°C) being underestimated (4.6°C) by more than 20% (Table 3). To reduce this bias towards the mean, a variance-scaled reconstruction was also produced, where the mean and variance were adjusted to fit the mean and variance of the climatic target data over the whole period for which climate data were available (AD 1890–2005), and McCarroll et al. (2013) argue that this ‘gives a more realistic picture of the magnitude of past climate change at the expense of inflating the error’. However, the magnitude of error inflation was not quantified and the gains, in terms of the capture of extremes, were not investigated. Using the methods described in this paper, we can now say that the reconstruction based on regression explained 66% of the variance (R2 = 0.66) but that variance scaling inflated the MSE, giving an equivalent value (
Given 116 years of climate data for calibration, the number of years in the highest and lowest 10% is 11.6. The threshold for the upper 10% of years is 12.6°C, and the regression-based reconstruction only captures two of the 11 or 12 years by placing them above the threshold (Figure 1), which is not statistically significant (p = 0.2). Using variance scaling captures one more year, but the probability of capturing 3 of the 11 or 12 years is about 8%, so the result is still not statistically significant (p > 0.05). The critical threshold for correct captures at p = 0.05 for a sample size of 11.6 is 4 from 12 (Table 2).
The regression-based reconstruction performs much better for the coldest 10% of summers, capturing 6 years below the threshold (Figure 1), which is strongly significant (p < 0.001: Table 2), but variance scaling hardly improves this, with only the 12th year added, which weighted at 0.6 gives a capture rate of 6.6 for 11.6 years (57%). Taking both extremes together, covering 20% of the years, the capture rate is 41%, which is strongly significant (n = 23.2, p < 0.001, Table 2).
These results demonstrate that even when the correlation between proxy and target is unusually high, at 0.81, the ability of a regression-based reconstruction to capture extreme years is not guaranteed. In this case, cold summers are captured much more efficiently than hot summers, to the extent that the capture rate of hot summers is not statistically significant (p > 0.05). Variance scaling improves the capture of extremes only very slightly, and the capture of warm summers alone is still not statistically significant (p > 0.05). It is not surprising that the growth of pine trees close to the tree line of northern Europe is more sensitive to cold summers than to warm summers, and the EVC method allows this bias to be identified and quantified. In this example, the ‘cost’ of variance scaling, in terms of loss of signal, is small, but the gain in terms of the capture of extremes is also small.
Oxygen isotopes in tree rings and summer rainfall
It has been argued, from first principles (Barbour, 2007; Danis et al., 2006; Saurer et al., 1997; Treydte et al., 2014), that the oxygen isotope ratios in the latewood cellulose of British oak trees should provide an indication of the amount of summer rainfall. Isotopic data, averaged from several sites across the United Kingdom, were calibrated using the total JJA precipitation for the England and Wales region (Wigley et al., 1984) over the period AD 1850 to 2012. Given the correlation coefficient of −0.69, a regression-based reconstruction underestimates the true range of precipitation totals (343 mm) by more than 28% (245 mm), whereas a variance-scaled reconstruction produces a range of values (353 mm) within 3% of the target range (Table 3).
The regression-based reconstruction is based on a correlation that explains 48% of the variance, but it performs poorly in capturing the extremes with just 3 and 4 from the upper and lower 10% (Figure 1), neither of which is statistically significant (Critical threshold for 16.3 years at p = 0.05 is 5 from 17: Table 2). Variance scaling increases the MSE, so that the equivalent explained variance (
Discussion and conclusion
One of the central aims of palaeoclimate research is to improve understanding of contemporary climatic changes through comparison of the climate of the past with that of the present. The regression methods that are commonly used, however, inevitably result in an underestimation of the variability of climate in the past and thus a bias towards the mean, with an attendant underestimation of the magnitude and frequency of past extremes. Variance scaling, where the mean and variance of the reconstruction are adjusted to fit the mean and variance of the climate target over the period for which meteorological data are available, overcomes this problem but at the cost of increasing the MSE. Authors generally acknowledge the loss of signal, but do not quantify it or demonstrate that there is an improvement in estimating the magnitude and frequency of extreme years.
Just as the squared correlation coefficient (R2) provides a measure of the strength of a climate reconstruction based on regression, an equivalent measure
When the modulus of the Pearson’s correlation coefficient falls below 0.5 (R2 < 0.25), inflating the variance of a reconstruction to meet the climate target results in such a large inflation of the MSE that it becomes larger than that obtained simply by using the mean value of the climate data for every year and so, by analogy with the RE and CE statistics, the reconstruction can be regarded as having no skill. We conclude therefore that
Variance scaling should not be applied where the squared correlation between proxy and target falls below 0.25 (|r| < 0.5);
The amount of variance explained by a scaled reconstruction is less than that explained by a reconstruction based on regression and can be conveniently quantified using
The loss of skill as a consequence of scaling a reconstruction can be expressed as a percentage
Variance scaling inevitably expands the range of values in a reconstruction, and this is sometimes quantified by quoting the difference between the total range of measured and estimated values of the climate target value over the period with climate data. Such a comparison is of questionable value, because it is inevitable that variance scaling will perform better than regression in this regard. The critical test is to determine whether the actual years that are pushed to the extremes are the correct years. We propose a simple non-parametric EVC test, based on the binomial distribution, which determines whether the number of years correctly placed above and below the thresholds defined by the upper and lower 10% of measured climate data is significantly more than the number that might be expected purely on the basis of chance.
Three examples demonstrate the utility of the EVC test. The most surprising result is that for summer temperature reconstruction in northern Fennoscandia based on a tree-growth index (McCarroll et al., 2013). Although the reconstruction is based on an exceptionally high correlation between proxy and target (JJA temperature) of r = 0.81, the EVC test reveals a clear asymmetry in the ability to capture extreme years. Cold summers are identified very effectively, but the capture rate for warm summers is no better than might be expected simply by chance (p > 0.05). Such a strong asymmetry in the capture of extremes has important implications for the interpretation of palaeoclimate reconstructions, particularly with regard to comparing the magnitude and frequency of warm extremes in the past with those over the period for which meteorological measurements are available. Missing the warmest years will also result in low-frequency temperature reconstructions of past warm periods being underestimated, since low-frequency curves in calibrated reconstructions should simply represent a smoothing of the high-frequency signal.
Footnotes
Acknowledgements
We thank our many friends in the Millennium project for helpful discussion about the perils and intricacies of regression and scaling.
Funding
This work was supported by C3W, the Leverhulme Trust (RPG-2014-327) and the EU project Millennium (017008).
