Abstract
Transfer functions are widely used in palaeoecology to infer past environmental conditions from fossil remains of many groups of organisms. In contrast to traditional training-set design with one observation per site, some training-sets, including those for peatland testate amoeba-hydrology transfer functions, have a clustered structure with many observations from each site. Here we show that this clustered design causes standard performance statistics to be overly optimistic. Model performance when applied to independent data sets is considerably weaker than suggested by statistical cross-validation. We discuss the reasons for these problems and describe leave-one-site-out cross-validation and the cluster bootstrap as appropriate methods for clustered training-sets. Using these methods we show that the performance of most testate amoeba-hydrology transfer functions is worse than previously assumed and reconstructions are more uncertain.
Keywords
Introduction
Transfer functions are widely used to generate quantitative environmental reconstructions in palaeoecology. Traditional training-set design (e.g. Birks et al., 1990) has one observation per site. An alternative design with many observations at each site is used for some training-sets, including those for chironomid-lake depth (Kurek and Cwynar, 2009); coastal diatom-water chemistry (Saunders et al., 2008); diatom- and foraminifera-sea level (Leorri et al., 2008; Massey et al., 2006; Zong and Horton, 1999); and testate amoeba-hydrology transfer functions (Charman, 2001; Mitchell et al., 2008). Although the implications of, and methods for, such clustered data are well known in other branches of statistics (Walsh, 1947), the implications of this design have been neglected for transfer functions.
One motivation for developing clustered training-sets is the presence within each site of substantial environmental gradients, which may be large relative to the differences between sites. This contrasts with the traditional one observation per site training-set where typically the environmental variable (e.g. lake-pH) is assumed to be spatially homogeneous at each site. Standard methods for assessing the performance of transfer functions assume that the observations are independent and are thus inappropriate for clustered data. Lack of independence between observations, either because of spatial autocorrelation or a clustered design, will cause performance statistics to be over-optimistic (Telford and Birks, 2005). Telford and Birks (2009) have developed cross-validation methods appropriate for spatially autocorrelated training-sets; here we consider the problem of clustered training-sets and develop appropriate cross-validation methods. We focus on testate amoeba-hydrology transfer functions from peatlands, which have become increasingly important in shaping our understanding of Holocene climatic change (Charman et al., 2004, 2006).
Indications that standard tools are misleading
Training sets for peatland testate amoeba transfer functions have a highly uneven spatial structure, with samples from individual sites often only separated by a few metres, while sites may be separated by tens or hundreds of kilometres. Ordinations of testate amoeba data frequently show distinct clustering of observations from the same bog (e.g. Charman et al., 2007; Swindles et al., 2009) and site identity typically explains a large proportion of variance in constrained ordinations (Figure 1).

Variance partitioning, using constrained correspondence analysis, of the inertia in the different data-sets into components explained by water-table depth (light grey), site (dark grey), covariance between site and water-table depth (black). Unexplained inertia is shown in white. See Table 2 for data sources. Site is a statistically significant predictor for all training-sets except Poland 2005.
To provide an independent estimate of transfer function performance, we apply five transfer functions to all comparable independent data sets with appropriate corrections for taxonomic and methodological differences (Appendix 1). Table 1 shows that most transfer functions perform worse than suggested by leave-one-out (LOO) cross-validation when applied to independent data. Methodological explanations for the poor model performance can largely be excluded. Differences in time-discrete water-table measurements cannot explain the differences in rank-order shown by Spearman’s ρ. Any differences in sample preparation and analysis, or residual taxonomic biases cannot explain poor performance where these are closely harmonised (e.g. Polish data). Performance is particularly poor for two data sets from Scotland (Payne, 2010; Potts and Blackford, unpublished data, 2010); in the case of the Moss of Achnacree, this is likely to be due to the limited WTD range in a site which has experienced hydrological modification. As previously presented tests with transfer functions from different regions have frequently (Booth et al., 2008; Charman et al., 2007; Payne, 2011), but not universally (e.g. Swindles et al., 2009), shown performance poorer than LOO cross-validation we conclude that model performance in praxis appears to be weaker than suggested by conventional cross-validation.
Transfer function performance for five training-sets tested by leave-one-out (LOO) cross-validation and application to independent test-sets, showing transfer function method used, number of samples (n), root mean squared error of prediction (RMSEP), R2, and Spearman’s ρ. Some values differ from previously published values because of minor variation in sample selection and taxonomic harmonisation. Values in round brackets show performance when small taxa are excluded to account for differences in the use of back-sieving (Appendix 1). R2 and ρ values in square brackets denote negative correlations.
Back-sieving not used so small taxa excluded.
Lower counts of around 100 tests.
Appropriate cross-validation methods for clustered data
Typically, transfer function model performance is assessed by either leave-one-out (LOO) or bootstrap cross-validation. In LOO, one observation at a time is omitted from the training-set of size n and the environmental value predicted using the remaining n−1 observations. For clustered data, this can be extended to leave-one-site-out cross-validation (LOSO), where data from one site is omitted from the training-set, and data from the remaining m−1 sites used to predict it. LOSO is also known as leave-one-cluster-out cross-validation and sometimes as leave-one-group-out cross-validation (confusingly, this latter term is also used to refer to k-fold cross-validation in which k groups are created at random).
In standard bootstrap cross-validation, n observations are selected from the training set with replacement, and used to predict the remaining observations and new observations. There are several possible bootstrap schemes available for clustered data including the cluster bootstrap, where m clusters are selected at random with replacement, and the two-level bootstrap where m clusters are selected at random and observations are selected at random from within each cluster (Field and Welsh, 2007). Here we use the cluster bootstrap following the findings of Field and Welsh (2007) that the two-level bootstrap and the related reverse-two-level bootstrap generate excessive variability.
Application to testate amoeba training-sets
We determine the performance of 14 published testate amoeba transfer functions for water-table depth (WTD) using both robust cross-validation methods and standard methods. In the case of the Jura training-set (Mitchell et al., 1999) we omit samples with estimated rather than measured water-table depths. For all training-sets, we use weighted averaging with inverse deshrinking as this transfer function method is fairly robust to spatial autocorrelation (Telford and Birks, 2005) and so should also be fairly robust to clustered data. Assemblage data were square root transformed prior to analysis. All analyses were carried out in R (R Development Core Team, 2010) with the rioja library (Juggins, 2011).
While differences are not always great, all transfer functions except for one exhibit worse performance with LOSO than LOO cross-validation (Table 2). One transfer function has a LOSO RMSEP greater than the standard deviation of WTD. There are several possible reasons for this deterioration in performance. It could be simply an artefact because the estimates are based on fewer observations as more observations are omitted during LOSO than LOO. We tested for the importance of this factor by running a modified cross-validation scheme termed leave-many-out (LMO) that omits as many observations as LOSO when making each prediction but with the observations chosen at random rather than being from the same site. We repeated this analysis 100 times to get a distribution of performance statistics and tested if the observed LOSO RMSEP is worse than the 95th percentile of the leave-many-out RMSEP. Only the Poland (Lamentowicz and Mitchell, 2005) training set had a LOSO performance that was not statistically significantly worse than expected from leaving out so many observations during cross-validation.
Root mean squared error of prediction for 14 published training-sets calculated with leave-one-out (LOO), leave-one-site-out (LOSO), and leave-many-out (LMO) cross-validation. The 95th percentile of the LMO distribution is shown. Results are based on weighted averaging with inverse deshrinking on square root transformed data. Also shown are the DWT range (cm), number of sites (m) and observations (n), and the standard deviation of WTD (sd).
LOSO performance would be worse than LOO performance if each site only covered part of the environmental gradient. This factor is likely to be of minor importance, except in the Greece training-set as all the other training-sets have replication along the WTD gradient and variance partitioning shows only a small covariance between WTD and site for most of the training-sets (Figure 1).
As for most training sets the WTD measurements are based on one-time spot measurements, there may be site-specific errors in the WTD measurements if heavy rainfall or prolonged drought occurs between sampling the first and last bog. Most training-sets were collected within a short period of time, so major changes in WTD are unlikely to have occurred, however a few training-sets were acquired over a longer period of time and this may be an important factor (Charman et al., 2007; Lamentowicz M et al., 2008).
There are likely to be important non-hydrological controls on amoebae which differ between sites such as pollutant loading with recent studies showing sulphur (Payne et al., 2010), reactive nitrogen (Mitchell, 2004; Nguyen-Viet et al., 2004), heavy metals (Nguyen-Viet et al., 2007, 2008) and particulate matter (Meyer et al., 2010) to be important. Many transfer function studies have included sites of differing pH and trophic status, and there is evidence for differences in amoeba communities and their hydrological responses between fens and bogs (Jassey et al., 2011; Payne, 2011). Plant communities, which differ between sites in many studies, shape both the physical and biotic environment of amoebae through processes such as root exudation and allelopathy, particularly the production of phenolic compounds (Jassey et al., 2011). The fundamental hydrological controls on amoeba communities are poorly understood, while water-table depth consistently explains the largest proportion of variance in gradient studies it is clearly not water-table depth per se which is important to amoebae usually living well above the water-table. Water-table depth is simply a robust measurement, which serves as a proxy for the hydrological variables which do affect amoebae such as water film thickness and variability in the top few centimetres of moss where amoebae live (Sullivan and Booth, 2011). These variables may be controlled by fine-scale structural details of the peat and plant communities.
Predictors of LOSO relative performance
In an attempt to understand the attributes of training-sets that have a large decrease in performance with LOSO cross-validation, we regress the decrease in performance, standardised by dividing by the standard deviation of WTD, against the number of sites and observations, the proportion of variance explained by WTD, site, and the covariance between WTD and site (Figure 2). Of these predictors, only the proportion of variance explained by WTD is a statistically significant predictor of the deterioration in performance. Although the regression is not statistically significant, there appears to be an increased risk of a large reduction in performance for training sets with few sites.

Scatter plots of the relative decrease in performance against different predictors: (a) number of sites; and proportion of variance explained by (b) site, (c) water-table depth and (d) covariance between water-table depth and site in a CCA.
Error decomposition
The magnitude of the RMSEP is not necessarily a good guide to the utility of a transfer function. If, as is usually the case in testate amoeba palaeoecology, one is interested only in identifying relatively wet and dry phases, then the absolute value of the reconstruction is not very important. Thus, even transfer functions with a large RMSEP could potentially have utility.
For each site in the clustered training-set, we can decompose the total sum of squares of residuals into the proportion explained by site-specific offsets or biases and the residual variation. Table 3 shows that when LOSO is used instead of LOO, the site specific offset increases much more than the residual variation in both absolute and relative terms. This suggests that the absolute values of reconstructions are much more uncertain, but the relative values are only slightly more uncertain than LOO suggests.
Decomposition of the mean total sum of squares of the transfer function residuals into the portion explained by site-specific offsets and the residual variation for both LOO and LOSO cross-validation, and the ratio of the LOSO and LOO results.
Reconstruction errors
Sample-specific (s1; Birks, 1995; Birks et al., 1990) bootstrap errors for the cluster bootstrap will always be larger than those from the standard bootstrap. Figure 3 shows the WTD reconstruction for Jelenia Wyspa, Poland (Lamentowicz M et al., 2007) using the Poland 2008 training set, with sample-specific bootstrap errors using both bootstrap techniques. Bootstrap errors vary by sample but are in all cases greater when using the cluster bootstrap and for some samples the errors are more than double.

Water-table reconstruction from Jelenia Wyspa, Poland (Lamentowicz M et al., 2007) calculated using weighted averaging with inverse deshrinking on square root transformed data with the expanded Polish training set (Lamentowicz M et al., 2008). Reconstructions (black) are based on 1000 bootstrap predictions (50 of which are shown in grey) for (a) conventional bootstrap and (b) cluster bootstrap. The standard deviation of the bootstrap predictions (error component s1) is shown with vertical black lines).
Recommendations
Given our results, improvements can be made in both the generation and application of clustered training-sets. We make four recommendations for generating new training-sets, which should be followed where it is practical to do so and may not be possible to satisfy simultaneously. First, efforts should be made to sample the full environmental gradient at each site, or at least to ensure that all parts of the gradient are replicated in several sites. Ideally, the gradients should be uniformly sampled at each site (Telford and Birks, 2011). Second, approximately the same number of observations should be made at each site, so that in LOSO cross-validation the number of observations omitted is close to constant. Third, a large number of sites should be sampled, as the cluster bootstrap is not appropriate for data sets with few clusters. Finally, the sites should be similar to each other with respect to, for example, vegetation and climate, with the proviso that care is taken to include sufficient diversity of sites to ensure that all fossil samples have good analogues in the training-set.
We recommend that the robust cross-validation methods developed here are used when testing the performance of clustered training-sets. We anticipate that the performance statistics of transfer function methods robust to autocorrelation (e.g. WA) will deteriorate less with robust cross-validation than methods more sensitive to autocorrelation (e.g. WAPLS with several components). If there is a choice of training-set that could be applied to the fossil data, we recommend, all else being equal, using the training-set with the smallest loss of performance when robust cross-validation is used. Single-site training-sets (e.g. Booth et al., 2008; Payne et al., 2008) will be immune to cluster problems but this may be offset by poor reconstructive ability. As always in quantitative palaeoecology, caution should be used in interpreting small changes in reconstructions and replication using multicore, multiproxy and multisite records is desirable.
Conclusions
Published performance statistics of testate amoeba transfer functions are over-optimistic because of the clustered design of the training-sets. LOO cross-validation is biased by the lack of independence of the observations. As amoeba communities in a sample tend to be more similar to other samples from the same site than to samples from different sites, if samples from the same site remain in the training-set during cross-validation, then the model will generate unrealistically accurate predictions of water-table depth in the training-set.
Footnotes
Appendix 1
Details of taxonomic harmonisation showing groupings and nomenclatural changes made to the original data. In addition to these changes small taxa (Corythion spp., Trinema spp., Euglypha rotunda type, Euglypha cristata, Cryptodifflugia oviformis, Difflugia pulex type and Pseudodifflugia fulva type) were eliminated where there was a difference in preparation method between training- and test-sets (Payne, 2009).
Acknowledgements
We thank HJB Birks for his comments on this manuscript. Leave-one-site-out cross-validation is implemented in the rioja library in R, and code for the cluster bootstrap is available from RJT. This is publication no. A358 from the Bjerknes Centre for Climate Research. RJP conceived and coordinated the project, compiled the data and carried out the tests with independent data-sets. RJT devised and implemented the cross-validation procedures. RJP and RJT wrote the paper. Other authors contributed data, discussed the taxonomic harmonisation issues and commented on the interpretation of the results and manuscript.
RJP was supported by a Humanities Research Fellowship from the University of Manchester and a Study Grant from the British Institute at Ankara. Norwegian Research Council projects ARCTREC and PES helped support RJT.
