Abstract
Empirical data in the form of many chironomid-based temperature reconstructions give an excellent opportunity to assess the chironomid approach to temperature reconstruction by testing its reproducibility. Brooks et al. (The Holocene 22(12) 2012 (this issue)) offer a critique of points discussed in Velle et al. (The Holocene 20 (2010) 989–1002), but fail to explain the poor reproducibility found when Holocene chironomid-based temperature reconstructions are compared. We discuss the issues raised by Brooks et al. (2012) and cite studies that demonstrate the complexity involved. We are grateful to Brooks et al. (2012) for contributing to the discussion. However, they overly rely uncritically on transfer functions and the resulting reconstructions as representatives of true patterns in nature. A major source of bias involved when chironomids are used as a palaeoenvironmental proxy is the response to confounding gradients. Many of the challenges discussed in the Forum Article, in the comment, and in the reply are also valid for other research fields within palaeoecology. The challenges should still be properly addressed in chironomid research.
Keywords
Validation of results
Inconsistent temperature reconstructions
As Brooks et al. (2012) note, Velle et al. (2010a) have re-opened a 20 year old debate on biological indicators. Twenty years ago there was an active discussion about whether chironomid assemblages are reliable indicators of past temperatures. There is, however, one important difference between the situation today and 20 years ago. Now, there are empirical data in the form of many chironomid-based temperature reconstructions. So far, more than 20 Holocene chironomid-based temperature reconstructions have been published from Norway, Sweden, and Finland (Antonsson et al., 2006; Bigler et al., 2002, 2003; Heider, 2004; Heinrichs et al., 2005; Korhola et al., 2002; Larocque and Bigler, 2004; Lüder, 2007; Luoto et al., 2010; Paus et al., 2011; Seppä et al., 2002; Velle et al., 2005a, 2005b, 2010b, 2011). Many more have been published worldwide. The Velle et al. (2010a) Forum Article was written to shed light on a striking problem that empirical data and reconstructions have revealed: when Holocene temperature reconstructions from different sampling localities are compared there are many instances of strongly mismatching curves (Figure 1). We feel that many investigators have overlooked this problem and we hope that our Forum Article and ensuing discussion and research would contribute towards resolving the problem. Similarities with other proxies or multiple sites are needed to confirm results, but the results are too uncertain to assess whether discordances are real or are caused by confounding factors.

Chironomid-inferred mean July air temperatures adjusted for glacioisostatic rebound. BJO: Bjørnfjelltjønn (Brooks, 2006); BRU: Brurskardet (Velle et al., 2010b); FIN: Finse Stasjonsdam (Velle et al., 2005a); GIL, Gilltjärnen (Antonsson et al., 2006); HOL, Holebudalen (Velle et al., 2005a); ISB: Isbenttjønn (Lüder, 2007); L850: Lake 850 (Larocque and Bigler, 2004); NJU: Njulla (Bigler et al., 2003); OYK: Vestre Økjamyrtjønn (Velle et al., 2005a); RAT: Råtåsjøen (Velle et al., 2005b); REI: Reiardalsvatnet (Lüder, 2007); SKR: Stora Kroksjön (Heider, 2004); SPA: Spåime (Hammarlund et al., 2004); TOR: Lilla Torkelsjön (Heider, 2004); TOS: Toskaljavri (Seppä et al., 2002); TSU: Tsuolbmajavri (Korhola et al., 2002); VUO: Vuoskkujávri (Bigler et al., 2002).
Velle et al. (2010a) show ten reconstructions and argue that these are so different that it seems unlikely they can solely be attributed to local site differences in microclimate. Brooks et al. (2012) caution against assuming that decadal-scale to centennial-scale temperature variability in Scandinavia should correlate. However, studies of the instrumental record of the last 100–250 years (Casty et al., 2007; Dobrovolny et al., 2010; Jones and Moberg, 2003; Luterbacher et al., 2004; Meier et al., 2007; Moberg et al., 2005; Nordli et al., 2003) and of reconstructions from documentary proxy evidence during the last 500 years (Brazdil et al., 2010; Casty et al., 2005; Dobrovolny et al., 2010; Meier et al., 2007) suggest that temperatures do correlate, not only in Scandinavia, but also across Europe. Brooks et al. (2012) are rightly concerned that chronological uncertainties can obscure a comparison among sites. In a study on six Holocene chironomid-stratigraphies, Velle et al. (2005a) tested the numerical relationship among unsmoothed and smoothed inferred temperature curves. Since only two of 15 comparisons were statistically significantly positively correlated, uncertainties associated with chronologies were tested. The correlation analysis indicated no correlation, whereas by disregarding the dating model, there was series of matching temperatures events, but only when the original age–depth models were displaced by more than 1000 years (Velle et al., 2005a: figure 11). In this context, it is of little meaning to search for discordances that are, or are not, within the prediction errors of the temperature inferences, as Brooks et al. (2012) do. At many sites, the chironomid-inferred temperatures are too erratic to be climatically useful. For most sites, it is of little value to interpret temperature fluctuations that are inside the prediction errors estimated by statistical cross-validation, as many chironomid workers continue to do.
Strong differences prevail when more sites than those presented in Velle et al. (2010a) are included in the comparison (i.e. three sites from western Norway (Velle et al., 2005a), two from southern Norway (Lüder, 2007), and two from southern Sweden (Heider, 2004); Figure 1). Chironomids have produced encouraging results (Larocque et al., 2009), but there is no doubt that chironomids have also produced discouraging results (Velle et al., 2005a). In all instances, quantitative proxies will only produce reliable estimates of past environmental conditions with robust errors if the assumptions of the reconstruction method are met (Birks et al., 2010). While Brooks et al. (2012) make some valuable comments, they fail to acknowledge the underlying concern of Velle et al. (2010a): Why are the chironomid-inferred Holocene temperatures from Fennoscandia so different? We have initiated a re-analysis of most chironomid data sets from Norway, Sweden, Finland, Iceland, and UK. This is a joint project, including some of the Comment authors and others, which hopefully will help to identify at what sites and at what periods the chironomid approach is reliable (e.g. Figure 2).

(a) Non-metric multidimensional scaling (NMDS) of fossil chironomids from Råtåsjøen (red line) added passively into the NMDS space of samples in the modern Norwegian calibration data set (grey dots). The contours (blue) represent temperature change (°C) of samples in the Norwegian calibration data set (Brooks and Birks, 2001, unpublished data, 2001–2010). The fossil samples change along some unknown secondary gradient, NMDS2. (b) Proportion of variance in the Råtåsjøen fossil data explained by environmental variables in the modern calibration data set. The only statistically significant environmental variable is lake depth, suggesting that the secondary gradient found from the NMDS is lake depth. The dashed line to the far right shows the proportion explained by the first axis of a PCA, while the red dashed line shows the threshold from which the environmental variables explains more of the variance than 95% of 999 random reconstructions. For details on the method, see Telford and Birks (2011).
Validation with instrumental records
As Brooks et al. (2012) point out, chironomids have been assessed as a palaeoclimatic proxy by comparing the reconstructed temperatures with instrumental records for the last 100–150 years. This validation has given promising results, but studies that failed to find a correlation between chironomids and the instrumental temperature record should not be overlooked (e.g. Axford et al., 2009; Cameron et al., 2002; Lotter et al., 2002). In addition, it is important to consider potential confounding factors when down-core chironomids and the instrumental temperature record are compared: (1) Any comparison and numerical correlation should be corrected for temporal auto-correlation, as there is a lack of statistical independence in the fossil record and in the instrumental record (Tian et al., 2011). The auto-correlation violates the assumptions of many statistical tests and can cause overoptimistic estimates of the correlation coefficient. A correction for auto-correction is usually not performed in chironomid-temperature validation, but as Tian et al. (2011) show, such a correction can easily be achieved using a block bootstrap similar to h-block cross-validation (Burman et al., 1994). (2) A correlation during the instrumental record for the last 100–150 years does not imply that the chironomid record will provide reliable temperatures for the entire Holocene. This was very evident in Lake 850 in Sweden. Here, there was a good match between the instrumental record and inferred temperatures for the last 100 years (Larocque and Hall, 2003). However, based on inconsistencies in a Holocene multiproxy study, Larocque and Bigler (2004) concluded that temperature was not the most important factor to explain the distribution and abundance of chironomids prior to 2500 cal. yr BP.
Validation with independent proxies
Brooks et al. (2012) cite studies where results obtained from chironomids correspond well with the results obtained from other proxies or with independent records, such as the Greenland ice-cores. We agree that a comparison with independent records is a principal assessment of results. However, we note the following: (1) it is unclear to us why Brooks et al. (2012) assume that temperatures throughout Scandinavia should not necessarily correlate, while Scandinavian chironomid temperature reconstructions should correlate with ice-core records from Greenland. (2) We caution against the risk of circularity of argument. In their conclusions, Brooks et al. (2012) write that chironomid-based temperature reconstructions are supported by results from vegetation modelling. A correlation between inferred temperatures and vegetation modelling is not surprising given that the vegetation model they cite (Heiri et al., 2006) was driven by chironomid-inferred temperatures. Our interpretation is that Brooks et al. (2012) find the chironomid-inferred temperatures supported since the chironomid temperature-driven vegetation model produced vegetation dynamics that mimic an independent pollen record. (3) A general visual similarity with independent climate proxies should also be quantified and assessed for statistical significance (e.g. Dobrovolny et al., 2010: figure 6). According to Bennett (2002), testable hypotheses are needed or it becomes difficult or impossible to disentangle what is based on data and what is based on opinions. (4) There are many multiproxy studies that have pointed out inconsistencies between temperatures obtained from chironomids and other proxies, or studies that have suggested temperature was not the main driver for the full or parts of the down-core chironomid distribution (Bigler et al., 2002; Dalton et al., 2005; Heinrichs et al., 2005; Heiri and Lotter, 2003; Heiri et al., 2003; Korhola et al., 2002; Larocque and Bigler, 2004; Lüder, 2007; Nyman et al., 2008; Velle et al., 2010b).
Gradient length and response to confounding variables
Gradient length in training-sets
We do not think that chironomids are a useful proxy for any environmental variable provided the gradient is long enough, as Brooks et al. (2012) give the impression that we do. However, based on training-set statistics, any environmental variable will appear to be reconstructable if the gradient is long enough and other gradients are short. This is demonstrated by the many environmental variables that chironomids appear to respond to in training-sets (Table 1).
Examples of chironomid training-sets developed to infer diverse environmental variables.
For reference on model performance, the average r2jack for 23 published chironomid–air temperature training-sets is 0.75. WA: weighted averaging; PLS: partial least squares; WA-PLS: weighted averaging partial least squares; inv: inverse deshrinking; classical: classical deshrinking; tol: tolerance downweighting. The number after WA-PLS refers to the number of WA-PLS components considered.
Brooks et al. (2012) state that the performance statistics of chironomid-based temperature inference models exceeds by far the performance statistics of other chironomid-based models. We agree that the numerical performance of chironomid-based temperature-inference models is good (Table 1), but miss data from Brooks et al. (2012) that confirm their statement. To our knowledge it has not been tested whether temperature-inference models out-perform the numerical performance of training-sets based on other environmental variables. The coefficient of determination (r2) measures the strength of the relationship between observed and predicted values and will increase with gradient length, while RMSEP and bias statistics are not dependent on the range of the observed environmental gradient (Birks, 1998). Different units of measurement (e.g. chlorophyll a (μg/l), water depth (m) or temperature (°C)) are not comparable unless standardised. Hence, a comparison among training-sets based on dissimilar environmental variables is valid if the gradient lengths are similar and the environmental data are standardised to a comparable unitless scale.
Relationship to temperature
Most chironomid-temperature training-sets show significant responses to temperature and to secondary variables, such as organic carbon, alkalinity, conductivity, solar radiation, magnesium, precipitation, altitude, lake depth, and lake productivity (e.g. Barley et al., 2006; Larocque et al., 2001, 2006; Lotter et al., 1997, 1998; Olander et al., 1999; Rees et al., 2008; Velle et al., 2005a). In surprisingly many training-sets designed for long temperature gradients, the response to environmental variables other than temperature overrides the response to temperature. Such variables include pH (Porinchu et al., 2009; Rees et al., 2008), loss-on-ignition (LOI) (Larocque et al., 2001; Olander et al., 1999), lake depth (Porinchu et al., 2009), total carbon (Langdon et al., 2008), or total nitrogen (Porinchu et al., 2009). Chironomids in data sets that are not designed for training-set purposes will respond most significantly to one of several environmental variables, such as tropho-dynamic status, DOC, sediment organic content, chlorophyll a, bottom oxygen content, lake size, location of the lake, water chemistry, or altitude (Bigler et al., 2006; Brodersen and Lindegaard, 1999; Catalan et al., 2009; Fjellheim et al., 2009; Kernan et al., 2009; Nyman et al., 2005; Real and Prat, 1992). Many of these environmental variables co-vary with temperature or with one or more of the other variables listed above.
There is a lack of understanding on the relationship between chironomids and temperature, and it seems that indirect effects of temperature can play an important role (Eggermont and Heiri, 2011). When this is the case, it is wise to interpret down-core reconstructions cautiously. All training-sets will produce results when applied down-core (Birks et al., 2010). Can chironomids provide reliable estimates for temperature change or for change along any of the other environmental variables from training-sets (Table 1) backwards in time? In principle yes, but only if the environmental variable of interest is the dominating gradient at the time of interest (Birks et al., 2010). Many chironomid researchers have been concerned about issues of confounding gradients. Brooks (2006) stressed that soil development and the resulting changes in pH, nutrients, dissolved oxygen (DO), and dissolved organic carbon (DOC) can have a greater influence than temperature on the composition of some midge assemblages. Larocque et al. (2006) suggested that it was hard to dissociate the combined effects of temperature, DOC, LOI, and depth when performing down-core temperature reconstructions. According to Langdon et al. (2008) it is a major challenge to separate the effects of temperature, LOI, and lake depth on subfossil chironomid sequences.
Correlation between temperature optima and trophic optima
Brooks et al. (2012) agree that taxa characteristic of warm waters are also often characteristic of eutrophic waters, and taxa characteristic of cold waters are also often characteristic of oligotrophic waters. However, based on optima in a truncated data set they argue the correlation between trophic optima and temperature is not as universal as we suggested. First, we point out that a relationship between trophic optima and temperature optima should not be expected for all taxa in a data set since the optima for rare taxa are inevitably poorly defined. Only common taxa are relevant and performing a more robust correlation with outliers removed would drastically improve the correlation (see Brooks et al., 2012: figure 1). Second and most important, concerning the relationship between temperature and trophic status (nutrients, chlorophyll, Secchi-depth, DOC), it is not a matter of our opinion, or of transfer-function performance and statistical prediction errors, but a question of well-described ecological and limnological phenomena in nature (Brodersen and Anderson, 2002; Brodersen and Lindegaard, 1999; Brooks et al., 2001; Brundin, 1949, 1956; Lenz, 1925; Lotter et al., 1997; Sæther, 1979; Thienemann, 1928, 1954; Walker et al., 1991; Wiederholm and Erikson, 1979). Several examples of significant positive correlations between temperature and trophic variables are reviewed by Eggermont and Heiri (2011). These relationships are responsible for the evolutionary outcome that warm-water taxa, with a higher metabolic activity, are also adapted to productive lakes (and streams, see Velle et al., 2010a: figure 6) rich in available food for growth. This should not be surprising and can be shown for many data sets (Figure 3a). Comparison of independent data will, in many cases, give the same rather convincing result (Figure 3b). However, even if we could succeed in selecting lakes to produce the ideal training-set with no correlations among the variables, as Brooks at al. (2012) attempt, nature will still expose the same relationship based on all the thousands of lakes that were not sampled for the optimised data set. Modern data sets aimed and designed for high precision and accuracy for single variables may perform well for that purpose, but will not necessarily reflect the true complexity in nature in space and time because they are optimised for a single variable at one point in time. If this ecological knowledge is acknowledged, it can be used positively to understand palaeoecological records and contradictions in inferred environmental variables.

(a) Rank correlation between chironomid altitudinal optima in the Swiss Alps (Lotter et al., 1997: figure 6) and trophic optima (Lotter et al., 1998: figure 6). The correlation between temperature and total phosphorus (TP) in the Swiss Alps is r = 0.57, p < 0.01 in the non-truncated data set. (b) Rank correlation between chironomid temperature optima for northwest North America (Barley et al., 2006) and trophic rank (Sæther, 1979). Average rank numbers for trophy were used where adjustment to subfossil genus/type was necessary.
Trophic optima in a Greenland data set
Velle et al. (2010a) cite a study from West Greenland where Brodersen and Anderson (2002) demonstrate there is a strong correlation between temperature optima and trophic optima. Brooks et al. (2012) argue that the West Greenland data set maximises the statistical impact of lake catchment characteristics (nutrients) and minimises the impact of temperature. If Brooks et al. (2012) see this as a problem, we point out that such a sampling strategy with single environmental gradients maximised is common in chironomid temperature training-sets and other proxy-based training-sets (e.g. Brooks and Birks, 2001). Brodersen and Anderson (2002) realised that the nutrient gradient in the Greenland data was long and that temperature significantly explained 20% of the variation in the chironomid data. However, instead of publishing a seemingly good temperature transfer function, they attempted to interpret ecologically the multivariate complexity in their data set. Theoretically, Brodersen and Anderson (2002) could have maximised the temperature gradient simply by including an artificial climate gradient uphill. This would have been a classic example of a data set that would have obscured the strong local influence of catchment characteristics, and have ignored the true limnological processes that probably also occurred in the lakes (down-core) over the Holocene in that region.
Trophic influence down-core
Brooks et al. (2012) provide examples of sites where trophic influence is thought not to obscure the response of chironomids to climate change during the early Holocene, and argue that trophic influence is therefore not a problem. We agree that sites with small confounding gradients are ideal candidates for quantitative palaeoecology (see Velle et al., 2010a: figure 5). However, it is important to be cautious with samples from the early Holocene and from sites formed in recent deglaciated terrain given (1) the relationship between chironomid temperature optima and trophic optima (Figure 3) (Brodersen and Anderson, 2002; Velle et al., 2010a), and (2) that leaching and lake-catchment nutrient-ontogeny is a natural process following deglaciation (Boyle, 2007; Engstrom and Fritz, 2006; Engstrom et al., 2000; Norton et al., 2011; Reuss et al., 2010; Saros et al., 2010). A response of chironomids to the input of phosphorous in mineral colloids from glaciers was described more than 50 years ago (Brundin, 1956, 1958). Of the ten sites presented by Velle et al. (2010a), multiproxy studies indicate that four of the chironomid-based temperature inferences, and not two as stated by Brooks et al. (2012), could be obscured by an early-Holocene increase in productivity (SPA: Hammarlund et al., 2004; L850: Larocque and Bigler, 2004; RAT: Velle et al., 2005b; BRU: Velle et al., 2010b). Furthermore, an increase in rainfall may result in enhanced in-wash of nutrients into water bodies (Chang et al., 2001; Kundzewicz et al., 2007). If this happened in the past, the chironomid-inferred temperatures could accordingly be overestimated. A response to nutrient-input was illustrated in a study on experimental fertilization in Alaska. During the six study years, the dominating chironomid genus in the fertilised side of the lake changed from Heterotrissocladius to Phaenopsectra (Hershey, 1992). There were no corresponding changes in the test side of the lake. Phaenopsectra has higher temperature optima than Heterotrissocladius (Eggermont and Heiri, 2011), and any temperature inference at this site would accordingly be overestimated.
Down-core influence of human impact, pH, and depth
Brooks et al. (2012) recognise that human impact can influence the temperature reconstructions, but that the influence from humans can be discerned and distinguished from climate responses in chironomid records. We agree that potential human impact can be detected at sites where background information exists on the timing and extent of human influence and from multiproxy studies, such as those cited in Brooks et al. (2012). For many sites, however, information on human impact or impact along other confounding gradients is missing. At such sites, the confounding impact would potentially be interpreted as a temperature signal. Because of biases in the inferred temperatures associated with human impact in the Alpine region, Heiri and Lotter (2005) recommended a multiproxy approach to palaeoenvironmental reconstruction. According to Heiri and Lotter (2005), it is clearly essential to keep a close control on changes in local human activity during the late Holocene, even at high elevations and in remote mountain lakes.
When it comes to the influence of lake depth and pH, Brooks et al. (2012) agree that there is a possible problem of confounding gradients that can cause unreliable chironomid-inferred temperature estimates.
Training-sets as representatives of true pattern in nature
The complexity and multidimensionality in chironomid responses to environmental variables in modern data sets and in down-core sequences, as demonstrated above and in Velle et al. (2010a), suggest that transfer functions should not be interpreted as representatives of true ecological patterns in nature. Training-sets can help build ecological hypothesis, but should be seen as empirical models that mimic a small fraction of the biological response mechanisms. It is important to separate responses in static and optimised training-sets in space, from responses along gradual changes within single lakes in time. As an example, Axford et al. (2009) found that some taxa appeared to exhibit different temperature preferences in a core from Iceland and in the Iceland calibration data set. We concur with Huntley (2012), who cautions against using biological proxies to reconstruct variables in isolation since most organisms respond concurrently to several variables. A strong implication is that chironomids can be used as a proxy for combinations of variables that often occur together, such as tropho-dynamic status sensu Catalan et al. (2009). Tropho-dynamic status is a combination of variables, including productivity (DOC, total phosphorus), thermal conditions, and littoral habitat features.
Concluding remarks
All scientific results should be subject to rigorous testing. This is not straightforward when we use proxies to infer some unknown environmental variable of the past (Birks et al., 2010). In this context, we do not see how Brooks et al. (2012) have provided evidence that Holocene chironomid-based temperature reconstructions are reliable, as they state. Similarities with other proxies or multiple sites are needed to confirm results and the similarities should be tested for statistical significance. We do not believe these challenges involved are unique to chironomids as a palaeoenvironmental proxy (e.g. Huntley, 2012). This was highlighted in the title of the Velle et al. Forum Article as ‘lessons for palaeoecology’. As Brooks et al. (2012) noted ‘the requirement for multiproxy and multisite studies to separate signal from noise is equally true for other climate proxies, including lacustrine proxies and proxies from other archives such as tree rings, peats, speleothems and ice cores’. These are general challenges that we hope experts in their respective fields of palaeoenvironmental sciences will take seriously. Recognising that other sciences have challenges is no excuse not to take these challenges seriously in chironomid research. We do not wish palaeoecologists to dismiss their biological proxies, but rather to use more resources to understand the underlying ecological processes and to refine their palaeoecological methods. Then, the resulting palaeoenvironmental inferences will hopefully be more robust.
Footnotes
Acknowledgements
Funding
This work has been supported by the Norwegian Research Council through grant 178653/S30 to GV.
