Abstract
There is growing concern about the response of contemporary ecosystems to increasing and novel anthropogenic pressures and environmental conditions. Palaeoecology is crucial to understanding how ecosystems have responded to past environmental changes and can inform management of contemporary ecosystems and contribute to forecasts of ecosystem responses to change. However, palaeoecological data are subject to uncertainties that arise from environmental processes, field and laboratory methods, and data processing, and that affects inferences drawn from them. Understanding how different sources of uncertainty affect the analyses of proxy records remains limited, and records are often interpreted solely qualitatively. We present a virtual ecology approach for assessing how uncertainties inherent in empirical proxy data influence statistical analyses and the inferences drawn from them. In the virtual ecology approach, both the data and the observational process are recreated in simulation to assess sampling and analytical methods. We demonstrate results from a new model for simulating core-type samples of pseudoproxies comparable to empirical proxy data but not subject to the same sources of proxy and chronological uncertainties. These ‘error-free’ pseudoproxies generated under known driving conditions have uncertainties (e.g. core mixing, sub-sampling, and proxy quantification) systematically introduced to them to assess how individual and combined sources of uncertainty influence analytical methods. Results indicate that inferences drawn from statistical analysis, such as the stability of a system, or the rate of ecological turnover, can change substantially between the ‘error-free’ pseudoproxies, and degraded and sub-sampled data. We show how our approach can advance understanding of uncertainties in palaeoecological data and how it can help shape research questions by quantifying of their influence on proxy data.
Introduction
The problem: Inference in the face of uncertainty
Proxy data, such as pollen records, incorporate multiple uncontrollable sources of uncertainty that arise from environmental processes (e.g. bioturbation, taphonomy and variable sedimentation rates) before sample collection (Smerdon, 2012). Environmental uncertainties are compounded by the process and observer errors that occur during field collection of a core sample, laboratory preparation and analysis of sub-samples and data processing methods (Carstensen et al., 2013; Stegner et al., 2019; Taranu et al., 2018). However, in a palaeoecological context, our understanding of the contribution of different sources of uncertainty, and their influence on inferences made from palaeoecological data, remains poor. Thus, if insights from palaeoecological data are to be used to inform the management of contemporary ecosystems or to forecast ecosystem change, it is crucial to understand how the uncertainties that surround them may influence the inferences drawn from them.
Key concepts: Virtual ecology, pseudoproxies and proxy system models
Here, we describe and implement a framework for virtually assessing the effects of uncertainty in a palaeoecological context. Three concept underpin our framework: virtual ecology (VE; Zurell et al., 2010); pseudoproxy experiments (Mann and Rutherford, 2002); and proxy system modelling (PSM; Evans et al., 2013). Empirical data typically lack an error-free control and even high-quality empirical data (e.g. highly resolved proxy data from a laminated lake sediment core) integrate multiple uncontrollable sources of uncertainty. In the VE approach, both the data and the observational process are simulated. The synthetic data act as an ‘error-free’ benchmark against which to assess simulated observational processes and analytical methods (Zurell et al., 2010). The experimentation embodied in the VE approach allows the effects of different sources of uncertainty to be evaluated in a systematic and controlled way (Smerdon, 2012). Here, we use the term ‘virtual ecology’ to describe the approach by which synthetic data are generated, degraded, sampled, and analysed in ways comparable to empirical data (Zurell et al., 2010).
Pseudoproxy experiments are similar to the VE approach (Mann and Rutherford, 2002) in that they involve the analysis of modified observational data or simulated proxy data (‘pseudoproxies’) in place of empirical observations, which are then analysed in the same way as empirical measurements. The concept underlying pseudoproxy experimentation is that the pseudoproxy data mimic the statistical properties of empirical data without being subject to the same issues of, for example, limited grain or extent (Jaume-Santero et al., 2020; Smerdon, 2012). Thus, virtually assessing sampling strategies and statistical methods in a systematic way can inform empirical studies. If a method fails in simulation, it is unlikely to perform better on empirical data (Simpkins and Perry, 2017; Zurell et al., 2010). While simulations cannot substitute for reality, even simple simulations have been shown to recreate similar patterns to those observed in empirical proxy data (such as abrupt shifts, quasi-cyclic behaviour and long-term trends) via random walk processes (Blaauw et al., 2010). Here, we use the term ‘pseudoproxy’ to refer to the simulated proxy data themselves as it is associated with a body of climatological literature (Bothe et al., 2019; Christiansen et al., 2009; Mann and Rutherford, 2002), and the term ‘virtual ecology’ (VE) is used to describe the approach by which synthetic data are generated, degraded, sampled and analysed.
Proxy system modelling (PSM) is a framework that describes how changes in the environment are recorded as observable traces in a proxy record (Evans et al., 2013). The generalised form of a proxy system model comprises: (i) environmental drivers (e.g. climatic variability); (ii) a sensor (a physical, biological or chemical component of the system that responds to environmental drivers); (iii) an archive (the medium in which the sensor’s response is recorded, such as a lake sediment); and (iv) observations of the archive (Figure 1). In a palaeoecological context, environmental variability causes a response from a sensor (e.g. terrestrial vegetation or diatoms) in the form of, for example, changes in abundance or growth rate. The response is recorded as a proxy (e.g. pollen grains or diatom frustules) in an archive such as a lake sediment. Observations are then drawn from the archive as core samples from which the proxies are quantified.

Proxy system modelling (PSM) framework of how environmental variability is recorded in an archive.
Background
Simulation-based efforts to assess uncertainties in palaeoecological data have considered: the influence of different accumulation rates, and core mixing and compression, on detecting critical transitions (Stegner et al., 2019; Taranu et al., 2018); the influence of sub-sampling densities on the chronological placement of ecological events (Liu et al., 2012); and how accumulation rates, sub-sampling intervals, and species traits affect patterns of ecological memory (Benito et al., 2020). This research has shown that uncertainty arising from environmental processes, field and laboratory methods can decrease both the detectability of a signal in time-series data and the placement of that signal in time. However, while providing valuable insights, previous models have been limited to evaluations of single time-series (e.g. a state variable), using single-driver environments, assessing individual (or few) sources of uncertainty, or having a narrow parameter space for species co-occurrence (i.e. a community that collapses if conditions become too unfavourable). Here, we advance previous research by:
• representing a multi-driver environment that allows for driver interactions and synergies (and thus address questions of equifinality), which are fundamental to understanding ecosystem change (Dutta et al., 2018; Kefi et al., 2016; Paine et al., 1998).
• simulating ecological dynamics (e.g. growth rates and carrying capacities) to generate a multi-species pseudoproxy archive that can undergo community turnover. Thus, pseudoproxies are both relevant to long-term ecological change, and assessing multivariate statistics.
• recreating core formation processes related to accumulation rates and time-span, and virtually recreating the observational processes.
• systematically applying multiple sources of both environmental- and observer- introduced uncertainty individually and combined.
By leveraging different driving conditions in combination with ecological dynamics in the model, we are able to capture patterns of slow steady regime shifts and more abrupt ones (described in Williams et al., 2011). We implement a simulation framework for generating palaeoecologically relevant pseudoproxy data following the PSM framework. We then demonstrate a VE approach to systematically introduce uncertainty to the synthetic data to assess the influence of uncertainty on statistical methods. Our aim is to implement a framework in which different sources of environmental and observer-introduced uncertainty can be assessed for their influence on inferences drawn from the data. Such information can be used to inform for example study design (field and laboratory sampling strategies) and make more robust inferences from the data.
Methods
Generating pseudoproxies
To demonstrate how a simulation framework can improve our understanding of empirical proxy records, we implemented a model that generates a simulated sediment core-type sample. Here we provide a high-level description of the model, but a detailed description following the overview, design concepts, and details, or ODD protocol (Grimm et al., 2020) is provided as Supplementary Information. In summary, the model consists of four components: (i) a representation of the formation and age of the core (Figure 2a); (ii) a suite of environmental drivers that change over time (Figure 2b); (iii) species’ niches (the response of a sensor to the driving environment; Figure 2c); and (iv) pseudoproxy abundances of a ‘sensor’ (Evans et al., 2013) recording the response to driver change in the archive (Figure 2d). The model generates an archive of pseudoproxies consisting of S species representing a palaeoecological record free from process and observer error. The model is written in R (R Core Team, 2020, version 4.0.1), and model code is publicly available on GitHub (Asena, 2023).

Representation of each component of the simulation: the accumulation rate, time-span and length of the core (a); driver conditions over time (b); the niche of each species with respect to the driving environment (c); and the abundances of pseudoproxies in the archive (d). Each species niche comprises an optima and tolerance for each driver, the optima and tolerance curves in (c) are colour-coded to match each driver in (b). In the context of the proxy system model framework, environmental drivers (b) act on a sensor (c; in this case the response of the sensor is the populations’ growth rate changing in response to the environmental drivers) and records the response of the sensor in an archive (d). The simulation runs from past to present, the first time-step being the oldest.
Each simulated core has an age-span and an accumulation rate, which together determine the length of the core and the time captured in each unit depth of sediment. Accumulation rates are modelled with linear, exponential, variable (smoothed random walk motion) or combined (e.g. linear with variability) functions to represent different environments/processes. While we do not evaluate chronological uncertainty here, the age-depth relationship is implicit in the sub-samples and is described in the following section on simulated age-depth models.
A suite of environmental drivers represents the environment in which the proxy archive forms. The drivers can be simulated using functions that represent environmental processes such as pulse/press pressures, seasonal cycling and autocorrelated/non-autocorrelated temporal noise. Each driver is assigned a weighting that determines its effect on the species pool relative to the other drivers. Deterministic drivers have a white Gaussian noise parameter that generates temporally uncorrelated variability around the overall signal, dampening the signal to noise ratio.
Pseudoproxy abundances are simulated as a function of a suite of extrinsic environmental drivers with added variability from disturbance events, introduction to the population via dispersal, and variation in carrying capacity over time. Disturbance events occur randomly during the simulation with a pre-defined mean frequency and reduce species abundances. Reductions in species abundances from disturbance vary across species. Dispersal causes small increases to a species abundance and occurs randomly throughout the simulation. Dispersal gives locally absent species a chance to establish from the regional pool if driver conditions become favourable.
Each species has an optima and tolerance per environmental driver that, across drivers, defines the species niche and determines its population growth rate at a given time-step. Each species niche is a multiplicative function of the environmental drivers, and the limiting factor to growth is the driver least favourable to the species growth (i.e. following Liebig’s law of the minimum). If environmental conditions fall outside a species tolerance, its population will have a negative growth rate and eventually become locally extinct if unfavourable conditions persist for long enough (thus only a subset of the regional species pool will be present in the community at any given time). Species that can tolerate the environmental conditions can be introduced via dispersal; thus, the species present in the local community will turnover as conditions change. A burn-in period is used to allow species to stabilise with respect to the driving environment. The simulated data represent a core-type sample from which subsamples are taken, proxies quantified, and data analysed, similar to real-world core samples. The simulated core provides an ‘error-free’ benchmark against which methodological and statistical processes are assessed, and it represents the complete (and un-degraded) absolute abundance of proxy data.
Simulated age-depth models
In the model, time-steps are annual, producing known ages down-core. However, to better represent an empirical core sample, each replicate simulation has an estimated age per time-step generated by an age-depth model. To generate an age model, the simulated core is sub-sampled at user-defined depth intervals. An age-depth model is constructed using the sampled depths to produce an estimated age distribution for each centimetre of core. Chronological uncertainties and their influence on statistical analyses have been well documented elsewhere (Blaauw et al., 2018; Parnell et al., 2008; Telford et al., 2004).
Degrading and sampling pseudoproxies
Each model realisation generates the abundance of each species at each time-step; for example, a number of pollen grains or diatoms per year of sediment accumulation. The abundance per time-step represents the ‘error-free’ benchmark, which is subject to three processes that introduce increasing uncertainty to the data: sediment mixing (e.g. bioturbation), sub-sampling, and proxy counting. The fundamental structure of the VE framework is: generate data -> sample -> analyse -> assess results. We modify this form to represent the degradation of information as relevant to palaeo-type data: generate data -> degrade -> sample -> analyse -> assess results.
Degradation represents the environmental processes that affect a core before extraction, such as vertical mixing and taphonomic processes. In this case, a simulated mixing process is applied using a centrally weighted rolling average, smoothing the signal in the pseudoproxies. Sub-sampling pseudoproxies represents sampling decisions such as contiguous and non-contiguous sub-sampling. The degraded data are sub-sampled at a user-defined interval and thicknesses, analogous to taking slices from a core sample. The number of time-steps included in each sub-sample of core is determined by the accumulation rate. Proxy counting represents the process of counting the elements (e.g. individual pollen grains or diatoms) in a sub-sample. Pseudoproxies are quantified from the sub-samples using a simulated counting process with the probability of a given species being drawn from the sub-sample weighted by its abundance (with less abundant species being less likely to occur in a counted sub-sample; i.e. a multinomial draw). Finally, abundances are converted to relative abundances.
Figure 3 shows a visualisation of the simulated degradation and observational data for two species that are part of a more diverse local community. The ‘error-free’ benchmark (Figure 3a) is mixed over 10 time-steps (Figure 3b) sub-sampled at 10 cm intervals (Figure 3c), and proxies counted from each sub-sample at a resolution of 400 individuals per sub-sample (Figure 3d).

Sample model output showing two species, visualising data from the error-free reference (a) and one treatment level from each: mixing (b); mixing combined with sub-sampling (c); and mixing combined with sub-sampling and proxy counting (d). Mixing over 10 time-steps is shown as a magnified section of species 56 displaying the error-free reference transitioning to the mixed data (b). The mixed data are then sub-sampled every 10 cm, with a thickness of 1 cm (c) and the sub-sampled data are counted at a resolution of 400 individuals per sub-sample (d). Red dashed lines indicate disturbance events that occur randomly throughout the simulation.
Scenarios and visualisation
In the remainder of this paper, we demonstrate the degradation, observation and analysis of pseudoproxy records using a single realisation of each of two scenarios. In both scenarios, accumulation rate is modelled as a gradual decrease in accumulation rate down-core (with deeper sections of core containing more time-steps per centimetre), representing core compression, superimposed on which is a smoothed random walk, introducing variability in accumulation through time. The two scenarios differ only in the nature of their driving environment; however, parameters such as species’ tolerance and optima, and disturbance events (see SI for details) are randomised for each replicate. These two scenarios were used because they demonstrate key features of the model (e.g. multiple drivers, driver weighting and compositional turnover), and are palaeoecologically relevant examples of a gradually changing ecological gradient and an abrupt shift in conditions. Although we only show examples of proxy uncertainties here (mixing, sub-sampling and proxy counting), exploration of chronological uncertainties is also possible using our model. Additional visualisations are available in Supplemental Materials.
Scenario 1 has two equally weighted environmental drivers: a linearly increasing driver that causes a turnover in species composition by the end of the simulation, and a random walk driver that introduces higher resolution variability in population abundances. Scenario 2 has two unequally weighted drivers: a primary driver that switches between constant conditions, weighted to 0.85 of the overall driver effect, and a random walk driver weighted to 0.15 of the overall driver effect. In both scenarios, a regional pool of 200 species are simulated over 5000 time-steps. The simulation runs from the oldest section of the core (time-step 1, age ~5000 years) to the top of the core (time-step 5000, age 0). The first 500 time-steps are removed as a burn-in period to allow the 200 species to stabilise; during the burn-in most of the 200 species become locally extinct.
We apply mixing, sub-sampling, and proxy counting to the pseudoproxy archive at 10 levels individually and combined, resulting in 1210 datasets spanning a gradient from the ‘error-free’ reference core, and the most intensively sub-sampled to the most degraded and least intensively sub-sampled. In these scenarios, pseudoproxies are mixed over 10 time-steps, sub-sampled at regular 10 cm intervals, with a thickness of 1 cm, and counted at a resolution of 400 individuals per sub-sample.
Quantitative analyses
The complete (200 species) pseudoproxy records of the ‘error-free’ benchmark, and the degraded and sub-sampled examples were described using Fisher Information (FI; Cabezas and Fath, 2002; Eason et al., 2016; Fisher, 1922; Mayer et al., 2006) and principal curves (PrCs; De’ath, 1999; Hastie and Stuetzle, 1989; Simpson and Birks, 2012). Fisher Information and PrCs are both multivariate statistical methods of dimensional reduction, and have been suggested as appropriate descriptive tools where there are (potentially) many input variables of any data type that do not require a priori knowledge of a system’s driving conditions (Roberts et al., 2018). In a palaeoecological context, FI and PrCs have been applied to diatom compositions to assess resilience (Spanbauer et al., 2014) and pigment data to identify important trends in the sequences (Beck et al., 2018).
Fisher Information is versatile in that any type of input data can be used with few statistical assumptions (Ahmad et al., 2016). FI reflects how the ‘stability’ of a system changes through time (see Cabezas and Fath, 2002; Fath et al., 2003), with periods where FI is relatively stable (
Principal curves can be used to identify the latent variables (e.g. an ecological gradient) that describe a system characterised by multiple state variables, and can be considered a form of non-linear principal components analysis (De’ath, 1999; Hastie and Stuetzle, 1989; Simpson and Birks, 2012). We use PrCs as a form of indirect gradient analysis, assuming no a priori knowledge of the driving environment. We use PrCs to characterise the trajectory of the system using cubic smoothing splines to fit the PrC to the data with a penalty of 1.4 on the degrees of freedom. PrC analyses were conducted with the analogue package (Simpson and Oksanen, 2020; version 0.17-4) in R version (R Core Team, 2020).
Results
Scenario 1 pseudoproxy visualisation
In Scenario 1, a gradual species turnover is driven by the linearly increasing environmental driver. The change in the primary driver is small relative to the tolerance of the species in the species pool, allowing generalist species (i.e. those with a broad niche) to persist while creating a gradual turnover across the pool of 200 species. Generalist species, such as sp. 1 (Figure 4) persist for the whole simulation, others, such as sp. 134, represent species tolerant of the central range of driver conditions, while species, including sp. 164, prefer the final simulation conditions. Species such as sp. 109 are more sensitive to change in the random walk driver. As the pseudoproxies are degraded (mixing) and observational models applied (sub-sampling and proxy counting), the record becomes more visually similar to an empirical record (Figure 4).

Visualisation of the environmental drivers, the ‘error-free’ benchmark pseudoproxy archive, and pseudoproxies mixed over 10 time-steps, sub-sampled at 10 cm intervals and counted at a resolution of 400 individuals per sub-sample. A subset of 10 species from the pool of 200 is shown. The ‘error-free’ record shows the absolute abundances per time-step and represents ‘perfect’ knowledge of the system through time, the degraded and sampled record is converted to relative abundances and represents observations comparable to empirical proxy records. Accumulation rate and depth are not pictured in favour of a more complete species record. Note that the y-axis is in simulation time, the most recent section of core is time-step 5000.
Scenario 2 pseudoproxy visualisation
In Scenario 2, the two extrinsic drivers yield an abrupt shift between stable conditions at time-step 3000, with a random walk driver accounting for 0.15 of the overall driver effect on population growth rates (Figure 5). The magnitude of the environmental change is sufficiently small for generalist species to persist, such as species 154, 166, 174 and 191 (Figure 5). However, another pool of species, such as species 134, 158 and 169, are only tolerant of conditions on one side of the driver switch. Species such as species 123 and 129 are sensitive to the random walk driver as well as the primary driver and exist either ephemerally, introduced via dispersal, and experience a brief period of favourable conditions, or show variability throughout the simulation.

Visualisation of Scenario 2. Environmental drivers, the ‘error-free’ pseudoproxy archive, and the degraded and sampled pseudoproxy record are shown. Accumulation rate and core length are not pictured.
Scenario 1 analyses
For the error-free data in Scenario 1, Fisher Information slightly increases through time, suggesting a decrease in the system’s variability as it is forced by the two drivers (Figure 6a). The overall rise in FI is primarily due to the species turnover caused by the gradually increasing driver. However, there is substantial variability in FI as species population growth-rates (and thus abundances) change due to the random walk driver, in addition to the internal variability from disturbances and carrying capacity. With prior knowledge of the underlying drivers, we can identify the cause of the rise in FI (even if it is visually subtle); however, there is no such rise in the FI when calculated for the degraded data (Figure 6a). Comparing FI of the error-free data and the degraded/sub-sampled pseudoproxies, shows there is a decrease in the sampling resolution of the FI time-series and the directionality is different in some periods, for example, between 3750 and 4450 timesteps (Figure 6a blue highlight)

Fisher Information (a) and PrCs (b) of both the ‘error-free’ pseudoproxies, and the degraded and sub-sampled data for Scenario 1. The blue highlighted region in the FI indicates a ~700-year period of reversed directionality. The highlighted regions in the PrCs indicate, in the analyses of the degraded and sub-sampled data, two periods of different rates of change in the PrC. The same regions are highlighted in the ‘error-free’ analysis where no apparent change is visible. Red dashed lines indicate disturbance events for both FI and PrCs.
The distance along the PrC of the error-free and degraded/sub-sampled pseudoproxy data reflect the overall system trajectory driven by the gradually increasing driver (Figure 6b). Some variability due to the random walk driver is visible in the distance along the PrC of the error-free’ pseudoproxies, with small changes in trajectory arising from larger disturbance events. Short-term variability in species’ abundances become less obvious in the analysis of the degraded/sub-sampled data as the points along the curve are further apart in time. Qualitatively, the PrC of the uncertain data shows two zones of different rates of change along the curve (Figure 7b highlighted regions), which are not evident in the error-free data.

Fisher Information (a) and PrCs (b) of both the ‘error-free’ pseudoproxies, and the degraded and sub-sampled data for Scenario 2. The steep drop, and subsequent rise in FI ~3000–3500 time-steps is due to the abrupt shift in the environmental driver causing a change in species composition. The steep change in the PrCs at time-step 3000 is also caused by the change in species assemblage following the abrupt shift in driving conditions. Vertical red dashed lines indicate disturbance events in both A and B.
Scenario 2 analyses
The abrupt shift in species composition in Scenario 2 is evident in the FI for the error-free and the uncertain data. The rate and magnitude of the shift in species composition depends on the sensitivity of the community to the drivers, relative to the rate and magnitude of driver change. In this model realisation the change in species composition is clearly visible in the rapid drop and subsequent rapid rise in FI between c. 3000–3500 time-steps (Figure 7a). The steep increase in FI towards the end of the ‘error-free’ time-series is due to a rapid loss of species, as the system adjusts to the new conditions. The decrease in average richness after the shift is a chance occurrence in this randomly selected model replicate. The rapid rise in FI does not occur in the uncertain record as the sub-sampling and counting processes obscure this pattern in the data. Low abundance species are less likely be present in the sub-sampled and counted data; thus, the loss of such species is not detected in the FI of the uncertain data.
The shift in species composition around time-step 3000 is clearly visible in the PrC (Figure 7b). The starting point of the distances along the PrC is arbitrary, and the x-axis of the degraded and sub-sampled PrC has been reversed for visual comparison. In this model realisation, the distances along the PrCs appear robust to uncertainty.
Discussion
With the application of sophisticated new statistical methods to make inferences from palaeoecological data (Simpson, 2018; Smol et al., 2012), it is important to develop methods to assess how those inferences are affected by the numerous sources of uncertainty inherent in such data. Thus, models designed to assess uncertainty in a palaeoecological context must capture the uncertainties that arise from environmental conditions, field and laboratory procedures, and data processing. Here, we have demonstrated and implemented a method for generating synthetic core-type samples, and a virtual ecology framework to degrade and sample simulated ‘error-free’ proxy data allowing introduction of uncertainties to better understand their influence on inferences we might make from the data.
The two modelled scenarios demonstrate how the inferences made from data can be affected by uncertainty. Following Fath et al. (2003), a decrease in FI is interpreted as a system becoming less predictable (less stable). Thus, in Scenario 1, while some macroscopic patterns remain the same between the FI time-series of the ‘error-free’ core and the uncertain data, a reverse in directionality for 700 time-steps (the estimated age-span depending on the accuracy of the age-depth model) could change our inference from the data. We might interpret this as a period of a loss of stability in the degraded and sub-sampled data or the reverse in the ‘error-free’ data. Similarly, the apparent logistic shape of the PrC of the degraded and sub-sampled data compared to the more linear change in the PrC of the ‘error-free’ data could lead us to infer two zones of change, whereas the environmental change is gradual in our benchmark data.
Analysis of Scenario 2 suggest that the FI is sensitive to the presence of low abundance species, but it detected the abrupt shift in species compositions in both the ‘error-free’ and degraded and sub-sampled data. Likewise, the PrC appears to be robust to uncertainty when there is a primary abrupt driver carrying most of the influence on the community and driving a distinct shift in the ecological community. The exact results vary across replicate model runs, but our intention here is to demonstrate the VE framework applied to pseudoproxies generated following a PSM rather than to scrutinise replicate-level variability.
After sub-sampling, proxy counting and conversion to relative abundances, even species that are highly generalist and remain at their carrying capacity throughout most of the simulation (e.g. Scenario 1, sp. 1 and 147; Scenario 2, sp. 166 and 174) show the characteristic saw-tooth pattern of empirical records, with apparent rises/falls in abundance. Such patterns may qualitatively be interpreted as changes or trends in species’ abundances where they are, in fact, absent from the ‘error-free’ benchmark. Thus, the implication is that the interpretation of micro-features in proxy records must be treated with caution.
Conceptual advancements
We demonstrate a multi–driver and multi-species model that represents both underlying ecological dynamics and core formation processes. In a proxy system model, the environmental drivers are the conditions under which the archive forms; thus, it is important to understand how they affect the sensor, archive, and subsequent observations. Terrestrial ecosystems are not usually governed by a single dominant ecological gradient/environmental process; instead, multiple drivers are likely to interact, with synergistic or opposing effects on the system (Blonder et al., 2017; Williams, 2020). By modelling multiple drivers, we can better understand how similar patterns can arise from different processes (i.e. equifinality), and how drivers operating at different space-time scales interact and, ultimately, manifest as a signal in a proxy record.
Empirical proxy data typically comprise multiple species (e.g. diatom or vegetation communities) and are often analysed using multivariate methods such as ordination or cluster analyses (Legendre and Birks, 2012; Simpson and Birks, 2012). Thus, virtual assessment of such methods requires models that generate multi-species archives (e.g. Benito et al., 2020; Blaauw et al., 2010; Mottl et al., 2021). Furthermore, to assess multivariate methods in a palaeoecologically relevant context, adequately representing core formation processes, such as accumulation rates and age, is crucial if we are to explore the implications of issues such as uneven spacing of samples in time. We address these issues by generating core-type samples containing multiple species, allowing for multivariate statistical methods to be assessed under various core formation processes. Additionally, by simulating a wide pool of species with varying niche spaces, compositional turnover can occur (as opposed to total extinction), allowing for analyses for shifts between species assemblages to be used.
By representing population growth rates as a function of the combined environmental drivers, our model includes dynamics of community response rates, which allows us to test different community-level responses to ecological drivers. Although patterns can be recreated using phenomenological models (Blaauw et al., 2010), models that represent fundamental ecological dynamics, such as population growth and dispersal, are required to better understand ecosystem change. The relationship between the rate of driver change and the rate of community response to that change is key to understanding long-term ecosystem change (Williams et al., 2020). Understanding how ecological dynamics affect the response of a sensor to environmental drivers, and consequently leaves a signal in a proxy archive, is important if we are to disentangle extrinsic driver effects from intrinsic ecosystem dynamics (Williams et al., 2011).
Advantage of simulation experiments
Simulation experiments offer the advantage of perfect knowledge of an imperfectly understood world. However, models are, by design, abstractions of reality and are limited in their representation of real-world processes. Empirical data are imprinted with information from the past, but our ability to infer past change from them is limited by a range of uncertainties. The advantage of using simulation as an experimental platform is that the effects of a range of known driving conditions, and uncertainties, on inferences drawn from data can be tested. Scenario 2 is an example of the introduction of an abrupt extrinsic shift with a known magnitude and timing. Such an example can be assessed for the likelihood of detecting the extrinsic shift in the response of the sensor (i.e. the pseudoproxy record) under different rates and magnitudes of driver change and different levels of driver noise. In addition to being able to test the effects of various driving environments on a sensor, a range of uncertainties can also be assessed for their individual and relative effects on the data. In the simulation framework we present, different sources of environmental and observer-introduced uncertainty can be explored by systematically introducing them to the synthetic data. This systematic degradation of data from an ‘error-free’ benchmark is the fundamental advantage of the virtual approach. This approach can help empirical palaeoecologists understand the effects from sources of uncertainty they have control over and can mitigate, as opposed to those that are uncontrollable. Finally, virtual experimentation allows us to generate thousands of replicate cores (limited only by available computing power) and sampling strategies that would be impractical to achieve in the laboratory.
Bridging the empirical and the virtual
Ultimately, the aim of developing more complex models in a virtual ecology framework is to enable us to assess the outcomes of quantitative analyses in a way that reflects palaeoecological data, the uncertainties inherent in them, and the quantitative methods used to make inferences from them. Simulation experiments provide a powerful way to test underlying dynamics at the environment, sensor, and archive levels (such as driver interactions) that are unobservable at the point of empirical data extraction. Simulation methods can be integrated with empirical studies to:
• a priori help shape field sampling methods: for example, number of core samples (across a region or local replication) required for a given research question.
• understand the sub-sampling and count resolution required to increase the likelihood of detecting a hypothesised signal in the data.
• test hypotheses about the underlying dynamics that may cause an observed pattern in the data (e.g. Liu et al., 2012)
• assess whether inferences made from the data are robust to uncertainty.
While we advocate the use of simulation experiments to advance understanding of uncertainties in empirical data, we certainly do not undervalue field and laboratory experiments. For example, Payne and Gehrels (2010) conducted a six-year experimental study to investigate tephra taphonomy in peat bogs. Although few and far between, such studies provide crucial direct empirical measurements of sources of uncertainty in different environments. Simulation can only help inform questions such as whether to focus effort on core replication or sub-sampling resolution as such issues depend on contextual (e.g. site) knowledge and the specific research question of interest. It is challenging to parameterise environmental processes, such as core mixing, appropriately for a given empirical context. However, it is possible to simulate a range of possibilities and build a picture of which sources of uncertainty (and combinations thereof) may most influence the results, and design studies accordingly. The relative importance of different sources of uncertainty will vary between systems and may vary in the same system over time. The nature of the research question will also influence the importance attached to different sources of uncertainty. For example, rapid species turnover may require more dense sub-sampling intervals to accurately reconstruct ecosystem change than to detect slower, long-term trends. The density of sub-sampling intervals that yields new information may be limited by the amount of mixing the core sample has experienced. By subjecting pseudoproxies to increasing levels of mixing, the likelihood of detecting a signal in the underlying drivers can be assessed under a range of conditions. It may prove more valuable to focus effort on core replication, rather than sub-sampling interval, in the absence of a ‘well-behaved’ core sample.
Outlook
The model we implement could be extended to include: (i) a spatial dimension; (ii) explicit representation of intrinsic ecological dynamics, such as competition and facilitation; and (iii) multiple taxonomic groups (e.g. vegetation and diatoms as sensors in the same model). Models need sufficient structural complexity to address a question of interest while remaining as simple as possible (O’Sullivan and Perry, 2013). However, there are some ways the model could be extended that would enable it to explore palaeoecologically relevant questions. To explore regional variability among core samples, a spatial dimension needs to be included; for example, the model could be run on a network of sites. Non-linear interactions between the species and the environment are required to introduce hysteresis effects and its implications for ecosystem resilience (Scheffer et al., 2001). Finally, empirical studies are more commonly using multiple proxies to characterise environmental change (e.g. combinations of pollen, diatoms, isotopes, or pigments). Multiple sensor models can be integrated into the same model to represent archives of different proxies simultaneously from the same driving environment. Of course, such exploration may be limited by computing power and may make the model outcomes more difficult to interpret.
There are many levels of complexity that can be incorporated into models and in the virtual ecological approach it is important to be pragmatic about the objective of developing a model. The method we describe is phenomenological in that it mimics patterns that are observed in empirical data in order to assess statistical methods of detecting them and evaluate the influence of uncertainties surrounding the statistical methods (Morris et al., 2017). If the objective were to identify underlying mechanisms or predict the dynamics of a specific context then a process based approach might be more useful (Topping et al., 2015). The problem of equifinality is that there are many ways in which a model may produce similar patterns to those observed in empirical data-sets (Beven, 2002); thus, interpretation for their empirical implications must be considered cautiously. Since ‘all models are wrong but some are useful’ (Box, 1979), multiple modelling and empirical approaches are required to address questions of ecological dynamics and uncertainties in palaeoecological data.
Supplemental Material
sj-docx-1-hol-10.1177_09596836241247304 – Supplemental material for Is the past recoverable from the data? Pseudoproxy modelling of uncertainties in palaeoecological data
Supplemental material, sj-docx-1-hol-10.1177_09596836241247304 for Is the past recoverable from the data? Pseudoproxy modelling of uncertainties in palaeoecological data by Quinn Asena, George LW Perry and Janet M Wilmshurst in The Holocene
Supplemental Material
sj-docx-2-hol-10.1177_09596836241247304 – Supplemental material for Is the past recoverable from the data? Pseudoproxy modelling of uncertainties in palaeoecological data
Supplemental material, sj-docx-2-hol-10.1177_09596836241247304 for Is the past recoverable from the data? Pseudoproxy modelling of uncertainties in palaeoecological data by Quinn Asena, George LW Perry and Janet M Wilmshurst in The Holocene
Footnotes
Acknowledgements
The authors are extremely grateful to the Centre for eResearch, especially Nick Young and Noel Zeng, for their digital resources support. We are very grateful to the New Zealand eScience Infrastructure for computer resources and expertise, particularly Anthony Shaw and Callum Walley. The authors would also like to acknowledge the support of the Perry lab group during early paper drafts.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was conducted through the New Zealand’s Biological Heritage National Science Challenge, which is funded by the New Zealand Ministry of Business, Innovation and Employment.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
