Abstract
We compare a Bayesian modelling-based technique with weighted averaging (WA) and weighted averaging-partial least squares (WA-PLS) regression in pollen-based summer temperature transfer function calibration. We test the methods using a new, 113-sample calibration set from Estonia, Lithuania and European Russia, and a Holocene fossil pollen sequence from Lake Kharinei, a previously studied lake in northeast European Russia. We find WA-PLS to outperform WA, probably because of smaller edge-effect biases in the ends of the calibration set gradient. The Bayesian-based calibration models show further improved performance compared with WA-PLS in leave-one-out cross-validation, while additional h-block cross-validation shows the Bayesian method to be little affected by spatial autocorrelation. Comparison with independent climate proxies reveals, however, some clear biases in the Bayesian palaeotemperature reconstructions, likely reflecting in part some specific limitations of our calibration set. As the selected prior parameters can significantly affect both Bayesian cross-validation performance and reconstructions, there is a clear need to further test the Bayesian method in different geographic contexts and over different timescales, with special attention given to the selection of the most realistic priors in each situation. In general, our finding that statistically well-performing transfer functions may produce clearly differing palaeotemperature reconstructions urges caution in transfer function-based inferences. We additionally test a spatially restricted, 58-sample subset of the full 113-sample calibration set. We find some reduced biases with the smaller set, likely because of complex, partially bimodal responses of several taxa along the longer temperature gradient, ill-suited for calibration methods assuming unimodal responses to climate.
Introduction
Transfer function-based studies have yielded many key insights into Quaternary environmental change, but several important challenges remain. First, the palaeoclimate reconstructions need to be validated by comparison with independent records (Birks and Birks, 2003). Second, the consistency of the modern calibration data, including the modern climate data, needs to be improved and tested (Daly, 2006; Kriticos and Leriche, 2010; Peterson and Nakazawa, 2008). Third, the various transfer function calibration methods need to be analysed and compared to assess their strengths and weaknesses in different situations, such as specific spatial or temporal scales (e.g. Heinrichs et al., 2001; Köster et al., 2004; Lotter et al., 2000; Paterson et al., 2002; Peyron et al., 2011). Fourth, the performance of transfer functions over longer timescales, with palaeoclimates and biotic assemblages less analogous with modern calibration data, needs to be critically evaluated (Jackson et al., 2009). Fifth, it is increasingly clear that more rigorous climatological, biogeographical and ecological consideration is needed for identifying the climatic parameters that can be realistically reconstructed in each region, from each particular biological proxy group, and from each climate–proxy calibration model, taking into account the possible effects of spatial autocorrelation when evaluating calibration models (Birks et al., 2010).
Weighted averaging-partial least squares regression (WA-PLS; ter Braak and Juggins, 1993) has been a commonly used transfer function calibration method in recent years, probably because WA-PLS generally (but not always) shows improved performance compared with simple two-way weighted-averaging regression (WA), another popular calibration method (Birks et al., 2010). A more recent alternative is offered by palaeoclimate reconstruction using Bayesian modelling. This approach is based on modelling the joint probability of the environmental variables of interest, the proxy data, and all the model parameters. A key component is a ‘forward’ model that describes the data-generating process, that is, the causal effect of the environment on the climate proxy considered. Using the joint model, statistical inversion allows ‘backward’ inference that produces an environmental reconstruction consistent with the proxy data, together with its associated uncertainty. The first papers to use detailed Bayesian modelling for palaeoclimate reconstruction are Vasko et al. (2000), Toivonen et al. (2001), and Korhola et al. (2002). More recently Haslett et al. (2006), Erästö and Holmström (2006), and Holden et al. (2008) have also used a Bayesian approach. Other Bayesian approaches are discussed in Guiot et al. (2000) and Gachet et al. (2003). Bayesian techniques are also widely applied in many other areas that from the point of view of statistical modelling are similar to environmental reconstruction (e.g., Banerjee et al., 2004).
In pollen-based transfer function studies, a major division can be observed in terms of the spatial extent of the employed calibration data sets. While some studies use large-scale, continental or hemispheric data bases (e.g. Bordon et al., 2009; Cheddadi et al., 1998; Feurdean et al., 2008; Kühl et al., 2010; Sawada et al., 2004) others employ smaller, regional calibration sets (e.g. Finsinger et al., 2007; Goring et al., 2009; Seppä et al., 2004; St Jacques et al., 2008; Xu et al., 2010). While the first approach generally supports the use of the modern-analogue technique (MAT), a general objective of the latter approach is to try and select a regional calibration set with an optimal spatial size. Criteria involved in determining the desired calibration set pattern include the climate parameters which will be reconstructed and the spatial orientation of their gradients, the climatic range expected in the fossil data, and possible limitations of the calibration method used.
In this paper, we compare WA, WA-PLS and Bayesian modelling in pollen-summer temperature transfer function development, using a new 113-sample calibration set from European Russia, Estonia and Lithuania. The WA, WA-PLS and Bayesian transfer functions are compared by reconstructing summer temperature based on a Holocene fossil pollen sequence from Arctic European Russia. We focus on two main problems: First, the overall performance of Bayesian modelling versus WA and WA-PLS regression in transfer function calibration, and the relative vulnerability of these methods to spatial autocorrelation effects, with the presented calibration set; second, the effect of calibration set size (Bjune et al., 2010; Velle et al., 2011) on WA, WA-PLS and Bayesian transfer functions and the performance of these methods at different spatial scales.
Study area
Our study area (Figure 1) covers a large part of northeast Europe, stretching from the Baltic Sea in the west to the Ural Mountains in the east, and from northwestern limit of the Eurasian Steppe in the south to the Barents Sea coast in the north. Most of the study area falls within the East European Plain with elevations mostly below 300 m a.s.l., with significantly higher elevations only found in the Urals. Four major vegetation zones cover the study area in latitudinal bands (Figure 1). In the north, a tundra zone stretches from the Barents Sea coast to the arctic treeline (major species: Picea abies ssp. obovata), located near the Arctic Circle. South on the treeline, a forest-tundra mosaic zone averaging c. 100 km in width is often distinguished before the transition to the boreal forest (‘taiga’; major species: spruce (Picea abies), pine (Pinus sylvestris), birch (Betula pendula and B. pubescens) and larch (Larix sibirica)). Transition to mixed temperate forest (birch, alder (Alnus incana, A. glutinosa), lime (Tilia cordata), oak (Quercus robur) hazel (Corylus avellana) and maple (Acer platanoides)) occurs at c. 57–60°N. South of the mixed forest the steppe reaches up to 55°N in parts of European Russia, showing strong human influence with extensive crop and hay cultivation, and with scattered and often planted deciduous forest stands. Mean annual temperature ranges from +7°C in Lithuania to −8°C in the Arctic Urals. Another major climatic gradient in the area is in the continentality index (Gorczyński, 1920, 1922) and the seasonal variation of temperature which increases considerably towards the east, driven by colder winters in the interior of the Eurasian landmass. Mean annual precipitation is relatively uniform, generally 600–700 mm, declining somewhat towards both the northern tundra and the southern steppe (Figure 1; Hijmans et al., 2005).

Map of the study area showing the location the calibration set surface samples and the Lake Kharinei coring site. The 58-sample subset of the full 113-sample calibration set is encircled. Vegetation types are synthesized from Khotinsky (1984), Rekacewicz (1998) and Olson et al. (2001). Climate parameters (from top left: mean annual temperature, May-to-August mean temperature, mean annual precipitation and the continentality index (Gorczynski, 1920, 1922)) are calculated based on WorldClim modern climate grids (Hijmans et al., 2005)
Materials
Surface pollen samples
The pollen–climate calibration data includes 113 lakes from Estonia, Lithuania and Russia. The 24 samples from Estonia were presented in Seppä et al. (2004), the rest of the samples are described for the first time here. Most sites (72) are located in the mixed forest zone, with 7 sites in the southern semi-steppe, 11 in the taiga, 7 in the forest–tundra transition zone, and 16 in the tundra north of the treeline (Figure 1). Samples were collected from medium-sized lakes in 1998–2009, using a surface sediment sampler from a boat or through the ice. Pollen samples were prepared using standard methods (Fægri and Iversen, 1989) with at least 500 terrestrial pollen and spores counted from each sample. Pollen nomenclature follows Moore et al. (1991).
We constructed the transfer functions using two different calibration sets. The first calibration set (113S) uses all 113 surface samples. Figure 2 shows the variation of the most abundantly occurring taxa in 113S. The second calibration set (58S) is a spatially restricted 58-sample subset (indicated in Figure 1) of 113S. 58S omits samples in the southwestern end of 113S, included all steppe samples and a majority of the mixed-forest samples. The objective of testing the restricted subset is to simplify some of the complex or bimodal taxon responses to Tmjja seen in 113S (Figure 2), and to test the effect of calibration set size on calibration model performance. A total of 104 pollen or spore types were identified in 113S, falling to 72 in 58S. All terrestrial pollen and spore types were included in the calibration sets.

Pollen diagram of the calibration set sites. Samples are ordered from top to bottom according to increasing Tmjja. Coloured zones indicate the vegetation belts shown in the Figure 1 map. Taxa are ordered from left to right according to increasing Tmjja optimum value in WA
Modern climate data
The modern summer mean temperature (May-to-August mean = Tmjja) was determined for each calibration set sample. Tmjja was chosen as the climate parameter as it roughly represents the temperature conditions during the growing season, including late spring, generally the most important single climate factor influencing the distribution, reproduction and growth of the plants in the arctic region (Euskirchen et al., 2009; Hinzman et al., 2004). The Tmjja value for each surface sample site was extracted from a Tmjja temperature grid (Figure 1), calculated in ArcInfo GIS software as the mean of WorldClim (Hijmans et al., 2005) mean temperature grids for May, June, July and August. Tmjja values for the surface sites range from 17.3°C to 5.9°C, decreasing with a steady gradient towards the north (Figure 1).
Fossil pollen data
The performance of our pollen–climate calibration set and the WA, WA-PLS and Bayesian modelling based transfer functions are tested on a Holocene pollen-stratigraphical record from Lake Kharinei (67°22′N 62°45′E), located near the Arctic Ural Mountains in European Russia (Figure 1). The general features of this sediment core, including its dating, and the ecological and palaeoclimatological context and implications are discussed fully in other papers (Jones et al., 2011; Salonen et al., 2011; Väliranta et al., 2011). The Lake Kharinei fossil pollen sequence (Figure 3) includes 57 taxa of which 53 are found in 113S and 47 in 58S.

Lake Kharinei biostratigraphy. Most abundantly occurring taxa are shown. Pollen and spore abundances are shown with black silhouette curves. Macrofossils counted in numbers (such as seeds) and conifer stomata are indicated by bars, whereas relative abundance of vegetative remains is indicated by a plus sign
Methods
WA and WA-PLS transfer function calibration
Pollen–climate transfer functions (cf. Seppä and Birks, 2001; Seppä et al., 2004) were calculated using two-way weighted averaging (WA) with inverse deshrinking, as well as 2-, 3-, and 4-component weighted averaging-partial least squares (WA-PLS) regression (Birks, 1998; ter Braak and Juggins, 1993), separately for 58S and 113S. Calibration set species data values were square-root transformed for WA and WA-PLS regression to reduce noise in the data (Prentice, 1980). Calculation of WA and WA-PLS transfer functions was performed in the C2 computer programme (Juggins, 2007). The performance of all transfer functions was evaluated by leave-one-out cross-validation (Birks et al., 1990), and performance statistics were computed for each transfer function. These include the coefficient of determination (r2), root-mean-square error of prediction (RMSEP) and maximum bias.
Bayesian transfer function calibration
The Bayesian reconstruction method used here is Bummer, a Bayesian hierarchical multinomial regression model based on the classical (direct) approach to calibration (Vasko et al., 2000). We describe here only the main idea; for further details and additional discussion we refer to the original paper, the application described in Korhola et al. (2002), as well as to the related work discussed in Erästö and Holmström (2006) and Haslett et al. (2006).
Suppose that the modern data are obtained from n calibration lakes and that the core data consist of N past pollen assemblages. We assume that from all sediment samples m different taxa are counted. Let x = (x1,…, x
n
) denote the modern temperatures and let the m × n matrix Y contain the associated pollen taxon abundances. Similarly, the m × N matrix of the past taxon abundances and the corresponding sequence of unknown past temperatures is denoted by Y* and x* = (
The reconstruction of the past temperature time series x* is described by its conditional probability density f(x* | Y*,Y,x), given the observed counts Y*,Y and the modern temperatures x. This is the so-called posterior density of x* and it can be shown to be proportional to the integral ∫ f(x*, θ) f(Y*,Y|x,x*,θ)dθ. Here the quantity f(x*, θ) is the prior probability density of x* and all the model parameters θ and it is chosen by the modeller to reflect uncertainty about their values, prior to considering the data. The density f(Y*,Y|x,x*,θ) is the likelihood of the abundance data, given the temperatures and the parameters. It represents the forward causal part of the model. The backward step of Bayesian inference is represented by the computation of the posterior density of x*. In practice, the particular model used is too complex to allow an analytical solution and inference is based on large sample from the posterior density obtained using Monte Carlo methods. For a general introduction to Bayesian modeling, see Gelman et al. (2004). A thorough account of Monte Carlo methods for Bayesian analysis is provided by Robert and Casella (2004).
The forward model assumes that for every pollen and spore type there is a temperature at which the corresponding host plant fares particularly well and that the pollen and spore relative abundances decay roughly symmetrically around this optimum temperature. In the Bummer model, the pollen specific response curve for taxon k is assumed to follow a unimodal Gaussian shape controlled by three parameters. This model assumption is coded in the quantities λ ik defined as:
where α k is a scaling factor, β k the taxon-specific optimum temperature and γ k the tolerance that represents sensitivity of the dependence of taxon abundance on the temperature. The relative magnitudes of the quantities λ ik are used to specify the probabilities p ik that random pollen grain or spore from site i belongs to taxon k. These probabilities are treated as random variables, which leaves room for variability between sites. Specifically, one assumes that the probabilities follow Dirichlet distribution,
given the parameters λi1,…,λ im .
The causal forward model for the taxon abundances at site i is a multinomial distribution with the probabilities p
ik
, k = 11,…, m. The corresponding models involving the core relative abundances Y*, the past temperatures
Vasko et al. (2000) and Korhola et al. (2002) both use the same gamma and uniform prior distributions for the tolerances γ k and the scaling factors α k , respectively, and we make the same choices here:
For the past temperatures
In leave-one-out cross-validation the prior was constructed similarly, except that one site in turn is left out from the calibration set.
In cross-validation, choice of the prior for
Here Tmjja(i) is the observed modern May-to-August mean temperature at lake i. The Bayes 1 model follows the original Bummer settings of Korhola et al. (2002) and Bayes 2 relaxes the prior of both the optimal temperatures β
k
and the predicted temperatures by 50%. Both Bayes 3 and Bayes 4 use the species-specific priors for the optimum temperatures β
k
but, where Bayes 3 adopts for
While we report cross-validation results on all these four models, Bayes 3 seems like a reasonable compromise model whose cross-validation result probably is not overly optimistic but that could still be used in reconstruction. As noted above, the priors of β
k
in the Bayes 1 and 2 models may be questionable for the calibration sets considered and the uniform priors of
Leave-one-out cross-validation was used to calculate r2, RMSEP and maximum bias for the Bayesian transfer function. In the model description outlined above, the ‘core’ in this cross-validation setting can be thought of consisting of just one sediment layer (N = 1) with abundance data corresponding to the particular lake left out.
Predicting the temperature of just one lake took about 3.6 h of CPU time on Intel Core DuoE6750 computer, leading to a 17-day cross-validation run for 113S. In Monte Carlo simulation, for each lake i we generated a sample of 20 000 realizations from the posterior distribution of
All computations for the Bayesian model were performed using the Matlab programming environment except that the highest posterior density intervals needed for the Bayesian error bars were calculated using the function emp.hpd from the R statistical software (R Development Core Team, 2009) package TeachingDemos (Snow, 2008).
Spatial autocorrelation tests
The effect of spatial autocorrelation in the performance statistics measured in leave-one-out cross-validation (Telford and Birks, 2005, 2009) was evaluated by additionally performing h-block cross-validation (Telford and Birks, 2009) for each calibration method using 113S. In h-block cross-validation, Tmjja is estimated for each calibration set site using a calibration set which leaves out the site being predicted for, as well as all sites within the h-block radius r. The decline of r2 and the increase of RMSEP in h-block cross-validation relative to leave-one-out cross-validation figures is then used as a measure of the role of spatial autocorrelation in biasing leave-one-out cross-validation performance. For a h-block radius we chose r=100 km, as a compromise that removes a moderate number of neighbouring sites, but leaves some analogues from each vegetation type in the calibration set. An average of 4.85 sites (min=0, max=15) were omitted in the h-block runs, with site density generally highest in the mixed-forest zone. The Bayesian model tested for spatial autocorrelation was Bayes 3 and we used for the taxon-specific parameters α k , and γ k the prior distributions defined above (cf. the section on Bayesian transfer function calibration).
Palaeoclimate reconstruction
The statistical significance of the Tmjja reconstruction was tested using the methods developed by Telford and Birks (2011a). We use redundancy analysis to determine if the proportion of the variance of the fossil data explained by the 113S-based WA and WA-PLS Tmjja reconstructions was larger than the proportion explained by most of 999 reconstructions of random environmental data. However, the WA- and WA-PLS-based tests are prone to Type I (false negative) error when the effective number of species is low (Telford and Birks, 2011a) which is the case with our pollen data (Hill’s N2 = 3.45). We therefore also test a MAT-based reconstruction (using five closest analogues, with squared chord distance as dissimilarity measure), as suggested by Telford and Birks (2011a). Significance test calculations were done using the R statistical software (R Development Core Team, 2009) package palaeoSig (Telford, 2011). Statistical significance of the Bayesian reconstruction was not tested due to the prohibitive computing demands of calculating separate models based on 999 sets of random environmental data.
With WA and WA-PLS, the reconstructed Tmjja values and their sample-specific standard errors were estimated using a 1000-cycle bootstrapping procedure (Birks, 2003) in the C2 software (Juggins, 2007). In Bayesian reconstruction we used for α k , and γ k the same priors as in cross validations and
where 7.2°C is the observed modern mean for Lake Kharinei. Thus, the Bayesian reconstruction transfer function corresponds to the Bayes 3 model tested in cross-validation. We generated a sample of 40 000 posterior realizations from the temperature vector x* using again Metropolis-within-Gibbs sampling. From those samples, the first 20 000 were used for burn-in and from the rest every 10th sample was kept for analysis. For each fossil sample we calculated the pointwise posterior expectation, and the pointwise 95% highest posterior density interval which can be thought to correspond to a classical confidence interval of ±2 standard deviations.
Results
We selected six calibration models for closer evaluation. First, we compare the WA model, the best-performing WA-PLS model, and the Bayes 3-based Bayesian model, all calibrated using 113S. Second, to analyse the effect of calibration set spatial extent on each method, we also test the WA, WA-PLS and the Bayesian models based on 58S. The models are evaluated based on their cross-validation performance and based on palaeotemperature reconstructions prepared with each model.
Transfer function performance
Leave-one-out cross-validation performance statistics (r2, RMSEP and maximum bias) for the developed transfer functions are shown in Table 1. WA-PLS shows improved performance over WA in all statistics. We use the van der Voet (1994) t-test to select the appropriate WA-PLS models for 58S and 113S, with WA-PLS components added as long as they provide a statistically significant improvement in predictive performance. With 113S the three-component WA-PLS model is selected and with 58S the two-component model is chosen. As expected, a tighter prior around the predicted temperature generally improves the cross-validation performance of the Bayesian models. When measured with r2 and RMSEP, the Bayesian transfer functions appear to be better than the non-Bayesian approaches, the only exception being a two-component WA-PLS model that marginally outperforms the uniform prior-based Bayesian transfer function on the smaller calibration set. However, as noted above, the results based on a uniform prior are too pessimistic, as one would not use such a model in practice. Except for the maximum bias, the effect of calibration set size on the Bayesian model performance is minor. The overall best model appears to be Bayes 3 which in the following will simply be referred to as the ‘Bayesian model’.
Performance statistics for WA, WA-PLS and Bayesian pollen–climate models for Tmjja based on leave-one-out cross-validation. Reported statistics are root mean square error of prediction (RMSEP), coefficient of determination (r2), and maximum bias. Shown in italics are the WA, WA-PLS and Bayesian models selected for further analysis with each calibration set size
Figure 4 shows the plots of predicted Tmjja versus the observed modern Tmjja and residuals of predicted Tmjja in leave-one-out cross-validation with the selected models. Generally, the Bayesian models show smaller scatter of temperatures within each vegetation type, compared with WA and WA-PLS. In tundra samples (Tmjja < 8°C) WA and WA-PLS models show their worst performance with the 113S WA model overestimating temperatures by up to 4°C and the 113S WA-PLS model by up to 3°C, while the residuals are also mostly positive but somewhat smaller with 58S. The Bayesian models perform significantly better in the tundra. In taiga samples (Tmjja 10–14°C) WA-PLS and Bayesian models show widely scattered residuals between −3 and +3°C but without systematic bias, while the WA models tend to underestimate temperatures by up to 3°C. All models perform well in mixed forest (Tmjja14–16.7°C). In the steppe samples (Tmjja > 16.7°C) only included in 113S, all models underestimate temperatures, with greater negative bias in WA compared with WA-PLS and Bayesian models.

Plots of predicted versus observed Tmjja, and residuals of predicted versus observed Tmjja for six transfer functions in leave-one-out cross-validation. Transfer functions are prepared with weighted averaging (WA) and weighted averaging-partial least squares (WA-PLS) regression and Bayesian modelling. Two different calibration sets are used with each calibration method: 113S uses all 113 surface samples, 58S is a 58-sample subset of 113S. Coloured zones indicate the vegetation belts shown in Figure 1. LOESS smoothers (span 0.25, one robustifying iteration) are added to the residual plots. Reported performance statistics for each transfer function are coefficient of determination (r2), root mean square error of prediction (RMSEP), and maximum bias
The h-block cross-validation performance statistics for the 113S-based WA, WA-PLS and Bayesian models are shown in Table 2, with the respective leave-one-out cross-validation figures. The performance of WA and WA-PLS worsens moderately in h-block cross-validation, with some rise in RMSEP and maximum bias, and the decline of r2 by c. 0.1 with both models. At least a part of this decline is due to the removal of environmentally similar, spatially close analogues. The Bayesian model is less affected compared with WA and WA-PLS, with an r2 drop of just 0.04 in h-block cross-validation.
Comparison leave-one-out and h-block (h=100 km) cross-validation performance statistics for selected WA-, WA-PLS- and Bayesian-based calibration models for Tmjja, using all 113 calibration set sites. Reported statistics are root mean square error of prediction (RMSEP), coefficient of determination (r2), and maximum bias
Palaeotemperature reconstructions
The WA and WA-PLS T mjja reconstructions test as marginally non-significant (for WA, p = 0.069; for WA-PLS, p = 0.076). The MAT reconstruction, however, tests as significant (p = 0.004). The main features of the MAT reconstruction (supplementary data, available online) closely resemble those of the WA and especially the WA-PLS reconstructions, suggesting that the non-significance of the WA and WA-PLS reconstructions may be due to Type I error (Telford and Birks, 2011a).
Lake Kharinei Tmjja reconstructions based on the WA, WA-PLS and Bayesian models calibrated with 113S and 58S are presented in Figure 5. Both WA-PLS based reconstructions show a prominent Holocene thermal maximum (HTM) during the mid Holocene, at c. 8000–3000 cal. yr BP, with Tmjja around 10°C, c. 3°C above the present-day value. The differences in the 113S and 58S WA-PLS reconstructions are minor, the major difference being that the 113S-based reconstruction shows slightly flattened curve due to higher reconstructed early- and late-Holocene temperatures. WA shows similar temperatures compared with WA-PLS during the mid Holocene, but during both the early and late Holocene temperatures are significantly higher with WA, leading to a much reduced Holocene temperature range (c. 1.5°C). As with WA-PLS, switching from 113S to 58S in WA decreases early- and late-Holocene temperatures slightly. The Bayesian reconstructions have some major differences compared with the WA-PLS reconstructions. They share with the WA-PLS reconstructions the major feature that the mid Holocene (c. 7000–3000 cal. yr BP) is indicated as the warmest period. However, the Tmjja values during the HTM are significantly lower in the Bayesian reconstructions at c. 8.5–9°C, and the cooling indicated since HTM is c. 1°C in the Bayesian reconstructions compared with c. 3°C in the WA-PLS based reconstructions. The effect of calibration set size on the Bayesian reconstruction is negligible.

Lake Kharinei Tmjja reconstructions using weighted averaging (WA) and weighted averaging-partial least squares (WA-PLS) regression and Bayesian modelling. Reconstructions are prepared using two different calibration sets with each method: 113S uses all 113 surface samples, 58S is a 58-sample subset of 113S. Errors bars show the bootstrap-estimated standard errors for the WA and WA-PLS reconstructions, and the sample-specific 95 % error for the Bayesian reconstruction. LOESS smoothers (span 0.25, one robustifying iteration) are added to the reconstructions
LOESS smoothers (span 0.25, one robustifying iteration) were fitted through the WA, WA-PLS and Bayesian based reconstructions. Smoothing is here viewed as an exploratory tool that can be used to reveal the salient features of reconstructions. Note, however, that since the result of a Bayesian reconstruction is in fact the full probability density of possible past temperatures that are consistent with the data, the posterior mean and its LOESS smooth represent only simple summary aspects of the full reconstruction. It can be argued that it is actually more meaningful to explore several different smooths of the whole posterior distribution since then one can make inferences about the salient features of past temperature variation and their statistical significance in many different time scales (cf. Erästö and Holmström, 2006).
Discussion
The effect of calibration set size
Examination of the calibration set pollen curves (Figure 2) reveals several taxa with complex responses to Tmjja. For example, some non-arboreal pollen (NAP) taxa show responses to Tmjja with a clear bimodal component. The most important of these taxa is Gramineae, which shows a clear rise in the north, moving from taiga forests into the tundra. However, the greatest values of Gramineae are found in the steppe, south of the mixed forest zone. Similar behaviour is shown by Carex type and Equisetum (Figure 2) with strong maxima in the tundra but also rising in the steppe. All of these are broad taxa that include species typical of widely different environments, resulting in such complex behavior when observed at family or genus level (Seppä et al., 2004). These complicated responses to the environmental variable being reconstructed present a potentially significant problem for transfer function calibration. A pollen curve such as that of Gramineae (Figure 2) intuitively has significant indicator value – in the north near the Arctic treeline the family clearly indicates tundra conditions instead of taiga, and in the south it is strongly associated with steppe as opposed to mixed forest. However WA and WA-PLS, as well as the Bayesian method model taxon responses unimodally, and are thus unable to capture this information. For example, WA calculates for Gramineae a single temperature optimum of 13.7°C associated with the southern taiga, an environment not typical of the taxon. Through such unrepresentantive optima, bimodal taxon responses can cause biases in temperature estimates for pollen assemblages rich in these taxa. While WA-PLS is generally able to reduce biases compared with WA by adjusting the optima of the taxa abundant in high-residual samples (Birks, 2003; ter Braak and Juggins, 1993), this process still results in loss of information as bimodal real-world responses are modelled unimodally. For Bayesian modeling, we also examined the relationship between the prior and the posterior distributions of the optimal temperatures β k and the tolerances γ k for taxa that appeared only in few calibration lakes and fossil samples. Contrary to what one might expect, the prior did not dominate the posterior for such taxa showing that the prior distributions were sufficiently vague to allow even a relatively small number of data to have an effect on the posterior.
One way to mitigate the biases stemming from bimodal responses is to limit the spatial extent of the calibration set (cf. also Velle et al., 2011). Our smaller calibration set 58S omits the steppe samples and some of the mixed forest samples. The omission of steppe samples gives the aforementioned NAP taxa unimodal responses with maxima in the tundra. In both WA and WA-PLS, 58S gives improved performance over 113S in tundra temperatures in terms of smaller positive residuals, with a greater change towards smaller bias in WA than in WA-PLS. Significant positive residuals remain in the tundra in 58S, however, likely partially due to the so-called edge effects (cf. Birks, 2003; ter Braak and Juggins, 1993) which cause overestimation of optima for the coldest taxa and underestimation for the warmest taxa. With WA-PLS, overall performance is similar with 58S compared to 113S (Table 1) although a simpler model was used (two components versus three). The differences in the Bayesian models based on 113S and 58S are small, with a slight increase in r2 with 58S compared with 113S. In the WA- and WA-PLS-based Lake Kharinei palaeotemperature reconstructions, the smaller tundra residuals with 58S lead to slightly lower temperatures before c. 10 000 cal. yr BP and after c. 2000 cal. yr BP (Figure 5), the periods with most NAP-rich fossil assemblages (Figure 3). The removal of the steppe samples in 58S improves the indicator value of the tundra taxa showing complex responses over the longer climatic gradient of 113S, leading to lower bias in the transfer function, and probably a more credible palaeotemperature reconstruction from fossil tundra assemblages.
Bayesian transfer functions
Bayesian modelling offers some notable advantages over more classical methods. First, ecological knowledge can be explicitly embedded in the models. Second, the structure of the model formulated in terms of probabilities is transparent and the use of probability distributions instead of point estimates provides more information about the quantities of interest, including a rigorous assessment of the uncertainties associated with them. Third, the interpretation of the results is straightforward in the Bayesian approach. For example, Bayesian error intervals have direct interpretation in terms of the posterior probability. The corresponding concept in non-Bayesian statistics is based on the idea of repeated experiments, a view that can have serious difficulties in ecological applications where the number of replications is small (Ellison, 1996; Vasko et al., 2000). In our experiments, the calibration set prediction performance of the Bayesian model appeared to be somewhat better than that of the WA-PLS reconstruction method tested. Possible explanations for the better performance include taking into consideration the uncertainty concerning site specific latent variables (the probabilities p
ik
and
In Bayesian modelling, one should always check the sensitivity of the results to the selection of particular priors for the parameters. Both in Korhola et al. (2002) and Erästö and Holmström (2006) the prior distribution for the
Lake Kharinei reconstructions were also made with various prior standard deviations for the past temperature prior, including the values 1.0°C and 1.5°C tried in cross-validation, as well as with different prior means. The results obtained were not very sensitive to prior selection.
Despite its great potential, Bayesian modelling has not been extensively used in palaeoclimate reconstructions. One reason for its modest popularity may have been the computing demands which can be substantial when large proxy data sets are considered. However, with rapidly increasing computing power offered by modern PCs the situation has changed considerably during the past decade making the Bayesian approach more attractive. Given the total effort put into collecting and analysing the proxy data, it could be argued that if the Bayesian reconstruction model works well, it is worth the computation times needed. Still, better simulation techniques and suitable model approximations could be helpful in making the Bayesian approach more accessible (cf. Bhattacharya, 2004; Haslett et al., 2006).
Evaluation of Bayesian modelling versus WA and WA-PLS
While the WA-, WA-PLS- and Bayesian-based palaeotemperature reconstructions show broadly similar trends, there are major differences in the absolute reconstructed temperatures during different time periods. The major difference between the WA- and WA-PLS-based reconstructions is that WA indicates significantly higher temperatures from early- and late-Holocene tundra conditions compared with WA-PLS, whereas reconstructed temperatures from the mid-Holocene taiga assemblages are similar with both methods. These differences are likely explained by the large, positive cross-validation residuals of the WA-based models for tundra samples, leading to overestimation of temperatures from fossil tundra assemblages. While the WA-based reconstructions can thus be diagnosed as likely less reliable than those based on WA-PLS, it is striking how large the differences between the WA-PLS- and Bayesian-based reconstructions are, considering that both methods have excellent nominal performance in cross-validation, comparing favourably with pollen-based summer temperature (Tjul) transfer functions prepared from other geographic regions (cf. Birks and Seppä, 2004; Birks et al., 2010; Seppä and Bennett, 2003). Clearly at least one (or possibly both) of the ‘excellent’ transfer functions is in fact giving heavily biased results.
Validation of palaeoclimate inferences is a major concern for quantitative palaeoclimatology (e.g. Barber and Langdon, 2007; Bennion et al., 1995; Birks and Birks, 2003; Telford and Birks, 2011a). A key method is to compare the reconstruction with another, independent climate proxy (Birks, 2003). For the Lake Kharinei record plant macrofossils provide one such way to validate the WA-PLS- and Bayesian-based reconstructions (cf. Birks and Birks, 2003). Based on plant macrofossils (Salonen et al., 2011) as well as pollen accumulation rates (Väliranta et al., 2011), the northern limit of the taiga forest is suggested to have reached Lake Kharinei in the mid Holocene, c. 8000–2500 cal. yr BP. The northern limit of the taiga forest is today associated with a Tmjja value of c. 9°C, suggesting that the WA-PLS-based HTM value may be realistic whereas the Bayesian-based HTM temperatures may be too low. In other studies from this part of Northern Russia, several lines of evidence have been invoked to suggest a cooling of 2–3°C or more from the HTM to the present day. These include the geographical shift of vegetation and permafrost zones (Kultti et al., 2003, 2004; MacDonald et al., 2000; Oksanen et al., 2001; Väliranta et al., 2003), indicator species macrofossils (Kultti et al., 2004), and quantitative reconstructions based on pollen (Andreev and Klimanov, 2000; Andreev et al., 2005) and chironomids (Andreev et al., 2005). The slight cooling of c. 1°C in the Bayesian reconstructions is unlikely to account for these environmental changes, including a treeline withdrawal of 150 km or more (MacDonald et al., 2000; Salonen et al., 2011) in the Pechora Basin, whereas the cooling of 3°C in the WA-PLS reconstruction may be more realistic. We consider the amplitude of late-Holocene cooling in the Bayesian model to be too small, possibly explained by too low reconstructed HTM temperatures.
We also computed the posterior probabilities of the HTM (8000–3000 cal. yr BP) temperatures reconstructed by the WA and WA-PLS models. The results (supplementary data, available online) confirm that these temperatures indeed are significantly higher than the reconstructions produced by the Bayesian model. Assuming that the Bayesian model is correct, the posterior probability of observing temperatures at least as high as the WA temperatures is less than 5% for all fossil samples. For WA-PLS, only 5 out of 42 fossil samples produce reconstructions that could be exceeded by the Bayesian model with posterior probability higher than 5%.
Spatial autocorrelation (Legendre, 1993) in the smooth Tmjja gradient of our study area likely accounts for some of the good performance of our calibration models (cf. Telford and Birks, 2005, 2009), especially in the densely spaced mixed-forest samples, and thus inflates the r2 and RMSEP figures to some degree. However, based on our tests none of the three methods (WA, WA-PLS and Bayesian modelling) is greatly affected by spatial autocorrelation with this calibration set (Table 2), with the Bayesian model in particular showing only a slight decrease in performance.
To shed light on the reasons for differences between the reconstructions obtained with WA-PLS and Bayesian modelling, we compared the pollen compositions in the core samples and in the calibration set. As a measure of analogue quality we used the squared chord distance (Overpeck et al., 1985) between the vectors of relative abundances of the fossil and calibration set data. As was to be expected on the basis of the northern location of Lake Kharinei (Figure 1), the closest analogues with the core samples were found among the calibration set lakes in the colder end of the temperature gradient (see supplementary data, available online, for the list of analogues). Further exploration suggested that, for the Bayes 1 and Bayes 2 models that use priors similar to those in Vasko et al. (2000) and Korhola et al. (2002), the reconstruction relies rather heavily on the few closest analogues only, overlooking warmer lakes that still are relatively close in the Euclidean metric. As this may result in overfitting that produces too low reconstructed temperatures, we examined also the effects of prior choices of the parameters αk, β k and γ k , and concluded that only the prior of β k has a noticeable influence on the reconstruction. One motivation for using a different prior for β k in the Bayes 3 and Bayes 4 models was to avoid this potential bias towards colder temperatures. Compared with the Bayes 2 setup, the reconstructed temperatures with species-specific β k priors were on average approximately 0.25°C higher but the salient features of the temperature time series did not change.
Good figures for statistics like r2 and RMSEP may hide significant biases in specific environment types. In our calibration models, a majority of surface samples is from the mixed forest zone (Figures 1 and 2). The excellent prediction accuracy for the mixed forest samples weighs heavily in the calculated r2 and RMSEP, belying the performance in tundra and taiga which, while satisfactory, is significantly worse than in mixed forest (Figure 4; cf. Telford and Birks, 2011b). As the largest mismatch between WA-PLS- and Bayesian-based reconstructions is during the mid-Holocene taiga stage as opposed to the early- and late-Holocene forest–tundra/tundra stages, it is possible that the main discrepancies stem from underestimation of palaeotemperatures specifically from taiga assemblages by the Bayesian method. While the Bayesian models perform well along much of the climatic gradient, the Bayesian cross-validation residuals in taiga are among the largest seen in this study, especially with 113S (Figure 4). The Bayesian taiga residuals range from −3.0°C to +3.1°C, and while there is no systematic negative bias for the modern samples, it seems likely that temperatures from the specific fossil taiga assemblages encountered in the Lake Kharinei core are underestimated by the Bayesian method. The decrease of performance by the Bayesian transfer functions in taiga environments may reflect in part a specific shortcoming of our calibration set. The taiga samples are separated from the rest of the calibration set by large spatial and environmental gaps (Figure 1), which likely contributes to biases in calculated optima for taiga taxa and in reconstructed temperatures from fossil taiga assemblages. The overall performance of the Bayesian models is highly promising, and this reconstruction method should be further evaluated, including the assessment of the prior parameter selection principles, and including testing with new calibration sets in different geographic areas.
Conclusions
We developed pollen-based calibration models for summer temperature using WA and WA-PLS regression, as well as the little-used Bayesian modelling-based method. The prepared transfer functions were tested in palaeoclimate reconstruction based on a Holocene fossil pollen sequence from northeast European Russia. The Bayesian models show significantly improved performance in leave-one-out cross-validation over WA and WA-PLS, while based on h-block-cross-validation the Bayesian performance is only slightly affected by spatial autocorrelation. However, comparison of the reconstructions with several independent records reveals some clear biases in the Bayesian palaeotemperatures. The biases in the Bayesian reconstructions likely reflect in part some specific deficiencies of our calibration data, emphasizing the need to further test the Bayesian method using other calibration sets.
As the priors can significantly influence both the Bayesian cross-validation performance and the final reconstruction, a key challenge in future Bayesian work is the identification of ecologically and climatologically realistic prior parameters, in terms of the timescale and the geographic context of each reconstruction scenario.
We found calibration models based on a geographically limited subset of the full calibration set to show some smaller biases in cross-validation. This is probably due to the simpler ecological response of several taxa on the smaller spatial scale, easier modelled on methods which assume unimodal taxon responses.
While the WA-PLS and Bayesian reconstructions presented in this study show similar trends, there are major differences in the absolute reconstructed temperatures, suggesting that caution should be exercised with apparently well-performing transfer functions, as they may produce significantly biased palaeoclimate reconstructions, even when statistical performance has not been significantly inflated by spatial autocorrelation effects. This suggests that far-reaching inferences should be avoided when working with any single reconstruction method, and underlines the importance of validation of quantitative reconstructions by comparison with independent proxies.
Footnotes
Acknowledgements
Vivienne J. Jones, Nikolay Letuka, Olga Malozemova, Vasily Ponomarev, and Angela Self are thanked for help during fieldwork, Maija Heikkilä for help with the pollen analysis, and Minna Väliranta for providing the macrofossil data.
This study was financed by the CARBO-North project (EU Sixth Framework Programme, Global Change and Ecosystems sub-programme, project number 036993), the Academy of Finland project 1132588 (Quantitative Vegetation Reconstructions), the Finnish Graduate School of Geology, and the Finnish Doctoral Programme in Stochastics and Statistics.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
