A Service-Oriented Platform for Approximate Bayesian Computation in Population Genetics

Abstract

Approximate Bayesian computation (ABC) is a useful technique developed for solving Bayesian inference without explicitly requiring a likelihood function. In population genetics, it is widely used to extract part of the information about the evolutionary history of genetic data. The ABC compares the summary statistics computed on simulated and observed data sets. Typically, a forward-in-time approach is used to simulate the genetic material of a population starting from an initial ancestral population and following the evolution of the individuals by advancing generation by generation under various demographic and genetic forces. This approach is computationally expensive and requires a large number of computations making the use of high-performance computing crucial for decreasing the overall response times. In this work, we propose a fully distributed web service-oriented platform for ABC that is based on forward-in-time simulations. Our proposal is based on a client-server approach. The client enables users to define simulation scenarios. The server enables efficient and scalable population simulations and can be deployed on a distributed cluster of processors or even in the cloud. It is composed of four services: a workload generator, a simulation controller, a simulation results analyzer, and a result builder. The server performs multithread simulations by executing a simulation kernel encapsulated in a proposed libgdrift library. We present and evaluate three different libgdrift library approaches whose algorithms aim to reduce execution times and memory consumption.

1. Introduction

The Approximate Bayesian computation (ABC) is a useful technique developed for solving Bayesian inference without explicitly requiring a likelihood function (Beaumont et al., 2002; Turner and Zandt, 2012). It is widely used by evolutionary geneticists to infer parameters of distributions (such as mutation rates, population scenarios, and demographic events) that govern certain events that, according to the hypothesis, happened somewhere in the past of a population when reconstructing its genetic history. The core of ABC is the likelihood function that expresses the probability of the observed data under a particular statistical model. However, when using simulation models, the likelihood function is difficult to be defined analytically and tends to be computationally expensive. Simulation is used as a tool for estimating the evaluation of the likelihood function.

In this context, two main approaches have been used for simulation: backward-in-time and forward-in-time. The backward-in-time simulation also known as coalescent theory (Kingman, 1982) identifies the most recent common ancestor of two individuals of a population. Then, a stochastic process coalesces these individuals. Thus, this approach starts from the observed sample of the population in the present and works backward to infer the genetic history of the population. The forward-in-time simulation focuses on individuals and usually uses a discrete generation scheme where different genetic or demographic events such as mutation and selection may occur at every generation. It starts from an ancestor population and follows its evolution by simulating the life cycle of each individual: birth, selection, mating, reproduction, mutation, migration, and death.

In this work, we focus on forward-in-time simulations driven by genetic drift algorithms because it allows to simulate genetic samples under complex realistic demographic scenarios. Forward-in-time simulations are only restricted by the assumptions of the model used to describe the genetic drift. Genetic drift is one of the basic mechanisms of evolution. It is used to represent the situation when some individuals may—by chance—leave behind a few more descendants than other individuals.

Currently, there are a number of simulators proposed in the technical literature for performing forward-in-time genetic simulations, such as the simuPOP tool (Peng and Kimmel, 2005), which is a general purpose forward-in-time population genetics simulator. It uses Phyton scripts to manipulate populations. Messer (2013) presented the SLiM simulator. It was designed to study the effects of linkage and selection on a chromosome-wide scale. Later, Haller and Messer (2016) presented the SLiM 2 simulator, which includes a graphical user interface (GUI) for simulation construction, interactive runtime control, and dynamic visualization of simulation output. Thornton (2014) proposed a C++ library of routines named Fwdpp, which is intended to facilitate the implementation of forward-in-time population genetics simulations by abstracting basic operations required for simulating custom models. However, most of the current simulators are developed either for desktop computers or for high-performance computing (HPC) clusters. The simulators for desktop provide a GUI but are restricted to small models due to hardware limitations and simulations take too long to complete. Simulators designed for HPC clusters require advanced knowledge on shell scripting and compiling skills but allow to simulate several (and bigger) models.

In a previous work (Sepulveda et al., 2017), we presented a simulation library named libgdrift 1.0. It was designed to execute forward-in-time simulations driven by genetic drift algorithms. The library optimizes the memory access and uses a two-phase compression technique based on a quaternary conversion to reduce the amount of memory and the execution time of the simulations. We showed that our library can outperform other state-of-the-art simulators.

In this work, we present a fully distributed web service-based platform, named gdrift++ (available at http://200.9.100.196:31080/#/home). Our genetic drift simulation platform aims to tackle the following requirements (Liu et al., 2008): (1) speed (perform as much simulations as possible in a short time), (2) scalability (because of large number of simulations executed and their computational costs), and (3) flexibility (to properly simulate different population dynamics).

The proposed platform is designed with a client-server approach to perform ABC (Beaumont, 2010; CsillÃl'ry et al., 2010) for parameter inference and model selection. It provides a user-friendly GUI and takes full advantage of HPC cluster resource capabilities. At the client side of our platform, we can define different simulation scenarios and set some parameters like the mutation rate. The server can be deployed on the cloud, which uses REpresentational State Transfer (REST) protocol (Fielding, 2000). REST defines the architecture for building large-scale distributed systems by delivering an alternative method for remote procedure calls across the Internet. REST provides an easy way to publish and consume web services (Vinoski, 2008). The server is composed of four services: (1) a workload generator (WG) used to build the simulation scenarios and prepare the workload, (2) a simulation controller (SC) used to execute the simulations with different parameters in parallel by triggering the execution of the libgdrift library, (3) a simulation results analyzer (SRA) that evaluates partial results of the simulations and may adjust certain parameters, and (4) a results builder that builds graphics to present the statistical results.

The libgdrift1.0 library stores for each individual from every population all the information regarding their genetic markers variations during the simulation, including all applied mutations. Although this feature simplifies the conceptual model of the simulator, it implies a great amount of replicated information, in all individuals that share the same version of a marker. Moreover, many variations are only valid for a certain amount of time during the simulation, but they could be discarded before reaching the final generation, so there is also a certain amount of wasted computation. Therefore, in this work, we also introduce two additional simulation libraries, named mutation-tree and mutation-vector. The former is based on a tree data structure and the latter uses a vector data structure to represent the mutations produced in each generation. Both libraries aim to reduce the amount of memory, improve CPU-cache memory access, and reduce the execution time of the simulations. To evaluate our proposal, we use the Wright–Fisher model (Wright, 1931; Fisher, 1999) where each individual has one gene, and at every generation, the population dies but another one is born at the beginning of the next generation, so the population size remains stable.

The remaining of this article is organized as follows. In Section 2, we present related works. In Section 3, we present our proposed platform. Section 5 shows the experimental results and we conclude in Section 6.

2. Related Work

There are several approaches presented in the technical literature to perform ABC. Bouckaert et al. (2014) presented BEAST2, which is an open source, extensible, and flexible software platform devised to analyze Bayesian evolution. It implements a structured coalescent model that allows inference of subpopulation sizes and migration rates together with location-annotated genealogies (structured trees) from genetic data. Huang et al. (2011) proposed MTML-msBayes (multi-taxa multi-locus msBayes). It is a software that implements a comparative phylogeographic analysis of multiple codistributed taxon-pairs using a hierarchical ABC model. The DIYABC (Cornuet et al., 2014) is a software package for a comprehensive analysis of the history of populations using ABC on DNA polymorphism data. It can be used to compare evolutionary scenarios and quantify their relative support and estimate parameters for one or more scenarios. It is based on a backward-in-time approach.

Liepe et al. (2010) presented a framework named ABC-SysBio for parameter estimation and model selection from experimental data in biological systems using ABC. ABC-SysBio implements likelihood-free parameter inference and model selection in dynamical systems. It is designed to work with both stochastic and deterministic models written in Systems Biology Markup Language. Arenas et al. (2015) presented a computer framework named CodABC to co-estimate recombination, substitution, and molecular adaptation rates by ABC from aligned coding sequence data. The tool is based on a special version of the coalescent simulator CoalEvol that implements recombination, a variety of migration models, demographics, and user-defined populations/species trees.

De Mita and Siol (2012) presented an Evolutionary Genetics and Genomics Library (EggLib) software package written in C++/Python. It provides a set of tools for processing biological sequence data, analyzing nucleotide alignments, performing coalescent simulations allowing rarely featured mutation models, mutational bias, as well as explicit selfing and estimating demographic parameters through ABC. Sandoval-Castellanos et al. (2014) presented a Bayesian Statistical Inference of Coalescent Simulations (BaySICS) program that provides an integrated and user-friendly platform to perform coalescent simulations for DNA sequence data and ABC analysis including the estimation of posterior densities for population parameters and Bayes factors to compare models.

Dutta et al. (2017) proposed ABCpy, which is a modular scientific library for ABC written in Python. It provides an interface to run large-scale parallel simulations without the need for users to have knowledge of parallel programming. Wegmann et al. (2010) proposed a tool named ABCtoolbox. It is formed by a set of open source programs to perform ABC written in C++. The user can perform all the necessary steps of a full ABC analysis, including parameter sampling and prior distributions, data simulations, computation of summary statistics, estimation of posterior distributions, model choice, validation of the estimation procedure, and visualization of the results.

Most of previous works are based on the backward-in-time approach, which is faster because only genomics samples that survived until the present are simulated backward-in-time. The limitations of using a forward-in-time approach rely on the excessive computational requirements (execution times tend to be at least linear with respect to the number of individuals multiplied by the number of generations). In a previous work (Sepulveda et al., 2017), we presented a simulation library based on a forward-in-time approach driven by genetic drift algorithms. Results showed that our library reduces the running time and the amount of memory storage required by other well-known simulation software such as SLiM2 (Haller and Messer, 2016) and fwdpp (Kessner and Novembre, 2014). In this work, we propose a web service-based platform to perform ABC, which executes an optimized simulation library for population genetic drift.

3. Platform Overview

The gdrift++ platform is a fully distributed system designed for performing ABC for parameter inference and model selection in population genetic models. The gdrift++ platform is composed of a set of RESTful services. RESTful refers to web services that run the REST architecture, and REST is the architecture that runs over HTTP. The services communicate to each other by exchanging JavaScript Object Notation (JSON) documents via HTTP requests. Figure 1 shows the interaction between the client side (front-end) and the server side (backend). The server side consists of four main services: WG service, SRA service, results builder service, and SC service.

FIG. 1.

System overview: The client defines the simulation scenarios and visualizes the statistics. The server is composed of four services: the workload generator, the simulation controller, the simulation results analyzer, and the results builder.

3.1. Client side

The client side (front-end) consists of a Web-based service that allows the user to define different simulation scenarios. This service provides a user interface (Fig. 2) that helps to set general simulation parameters, such as the user identifier, simulation identifier, and maximum number of simulations, to define individuals (ploidy, chromosomes, genes, mutation rates, etc.) and to configure scenarios (event lists). With these settings, the service produces a user settings document for each experiment, which is sent to the server. A second file with the data sample of the target population is also sent to the server.

FIG. 2.

Screenshot of the web page deployed on the client side of the gdrift++ platform. The menu at the right of the screen allows to run a new simulation, checks previous results, and opens recent simulations. At the bottom, it shows the steps required to create a new simulation.

One of the main features of the client service is the way it allows to define simulation scenarios (third step in Fig. 2). To achieve this goal, the user has to set parameters for the individuals of the population as well as the events that occur during the simulation. The scenarios are defined using Sankey diagrams, where nodes (rectangles) represent events and links (gray areas) represent populations. At the bottom of the figure, the values t0 … t4 represent the time the event should have occurred. The Sankey diagram is a specific type of flow diagram in which the width of the links is shown proportionally to the flow quantity. An example can be observed in Figure 3. At the left, we show the list of events such as increase/decrease of the number of individuals, migration of a subset of the population, a merge of different populations, or even the extinction of the population.

FIG. 3.

Sankey diagram used in the client side to define the scenarios of simulation.

3.2. Server side

The server (backend) is composed of four services running on a distributed platform. Each service can be deployed on different clusters of multi-core computers or even in a cloud system. In the following, we describe each service.

3.2.1. WG service

The WG service creates batches of simulation specification documents based on the user setting documents. A simulation specification document is an instance of the user settings document. The batches of simulations specification documents are distributed among the SC services. To this end, the user setting document is recursively traversed to find JSON nested objects with the element “type.” The element “type” has two possible values: “random” or “fixed.” If “type” = “random,” another element identified with the keyword “distribution” is used to define the probability distribution and its parameters. Otherwise, it uses the “fixed” value specified by the user in the setting document.

3.2.2. SC service

The SC service is in charge of executing simulation threads based on the genetic drift simulation library called libgdrift 1.0 (https://github.com/robertosolargallardo/libgdrift). When a simulation specification document arrives from the WG service, the SC service checks whether there are computing resources (i.e., CPUs) available for executing the simulation. If so, the SC creates a thread instance of the libgdrift 1.0 and launches the simulation in background. Otherwise, the simulation specification document is queued until a computing resource is released. At the end of the simulation, statistics over the populations are computed and sent to the SRA service together with the simulation specification document. Furthermore, the SC service deployment is aimed for being elastic, that is, we can dynamically increase the number of SC services if we require as-fast-as-possible simulation results or we can dynamically decrease the number of SC services if there is no constraint about the response time.

3.2.3. SRA service

The SRA service compares the statistics of the sampled data obtained from the target population (data_sampled) and the data obtained from the simulations (data_simulated) to determine the fitness level between both of them. The sampled data are received from the client side in a text-based format, which represents nucleotide sequences of populations. The statistics of the sampled data are computed and stored in main memory until the end of the experiment. Typically, statistics computed over samples correspond to different metrics, such as number of distinct haplotypes, number of segregating sites, mean and variance of pairwise differences, Tajima's D statistics, mean and variance of the numbers of the rarest nucleotide at segregating sites, and number of private segregating sites.

When a message with the results of a simulation arrives from a SC service, the fitness level d of the data obtained from the simulations is computed as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$d = \rho ( { \rm{statistics}} ( { \rm{dat}}{{ \rm{a}}_{{ \rm{sampled}}}} ) , { \rm{statistics}} ( { \rm{dat}}{{ \rm{a}}_{{ \rm{simulated}}}} ) )$$ \end{document} , where ρ is the weighted Euclidean distance. If \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$d \le \varepsilon$$ \end{document} (where ɛ is a tolerance threshold), the data obtained from the simulations are stored in secondary memory, so it can be latter analyzed. Otherwise, the results obtained by the simulation are discarded.

Furthermore, the SRA service keeps two control counters: (1) a counter_batch, which represents the number of runs within each batch. It is increased when a message containing the results of the simulations (data_simulated) arrives to the SRA service. (2) A counter_accepted, which represents the number of accepted results. It is increased when the simulation results are within the tolerance threshold ɛ. If the counter_batch is equal to size of the batch of simulations and the counter_accepted is less than the maximum number of runs, the SRA service sends a message to the WG service to notify that the experiment has to continue. Then, the WG service has to send another batch of simulation specification documents to the SC service. Otherwise, a message is sent to the WG to notify that the experiment has to finish. Both the batch size and the maximum number of runs are parameters of our platform.

3.2.4. Results builder

The result builder service is in charge of plotting the simulation results stored in secondary memory. It creates the figures and displays them into the client side. Figure 4 shows a screenshot of the results deployed on the client service.

FIG. 4.

Screenshot of the results presented in the client side of our proposed platform.

4. Libgdrift Simulation Library

The libgdrift simulation library aims to simulate forward-in-time genetic drift models. Libgdrift supports the following events: create: to generate an initial fixed-size population; split: to emulate the partitioning of a population into different isolated populations (genetic divergence); merge: to combine a set of populations into a new unique population; decrease and increase: to emulate the variation in the population size (a decrease followed by an increase replicate a bottleneck effect); migration: to emulate the physical movement by individuals from one area to another (founder effect); and extinction: to emulate the death of all individuals of a population. Furthermore, libgdrift can support several mutation models like JC69, K80, HKY85, and TN93 for single-nucleotide polymorphism (SNP), and stepwise mutation model for short tandem repeat (STR).

The individuals of a population are characterized by their ploidy (number of sets of chromosomes), and a set of chromosomes that are composed of a set of genes. Genes may be either STR (also known as microsatellite); tandem repeats of short DNA motifs (between 2 and 5 base pairs that are repeated several times) or SNP; variation in a single nucleotide that occurs at a specific position in the gene. We choose these types of genetic markers since they are the most frequently used for generating genetic data sets in population genetics simulations.

As we explained before, a disadvantage of forward-in-time simulations is the way populations are handled during generation-to-generation transitions. Classical approaches create copies of each population to perform each transition. This is highly expensive in terms of memory utilization as the population size increases. In this section, we present three different approaches of the libgdrift simulation library devised to improve the running time of the simulations as well as the amount of memory (number of copies of each population) required to execute those simulations.

4.1. Libgdrift 1.0

Our first approach of the libgdrift simulation library (Sepulveda et al., 2017) processes the simulation specification documents. To this end, the library creates a data structure named GenePool using the description of the individuals. A second data structure named EventList keeps events with information about the description of the scenario. Events are composed of a type (i.e., create, merge, split), a timestamp, and a parameters list. Events are stored in the EventList sorted by their timestamp in chronological order.

The code of the library is optimized so that all internal objects are stored into array-like data structures (std::vector or native C++ arrays) and are contiguously allocated into memory during the execution of the simulation. Furthermore, we adapt the simulator code to avoid unpredictable branches and, as a consequence, the algorithmic complexity is significantly reduced. Additionally, to reduce the amount of memory, we keep two copies of each simulated population during the simulation: a source population pop_src and a destination population pop_dst. While random sampling is performed over pop_src, offspring individuals are stored into pop_dst. At the end of the generation and right before the next generation, populations pop_src and pop_dst are swapped.

We also reduce the number of point mutations computations. Point mutations occur during DNA replication and may involve a single base pair substitution, insertion, or deletion. Our point mutation approach consists of performing point mutations at the end of each generation instead of evaluating each gene per individual. To this end, we draw a random number from a binomial distribution \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${B_i} = { \rm{binomial}} ( n , p )$$ \end{document} , with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$n = N \times L ( {g_i} )$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$p = { \mu _i}$$ \end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$L ( {g_i} )$$ \end{document} is the number of nucleotides of gene g_i, N is the population size, and μ_i is the mutation rate of gene g_i. The value of B_i indicates the total number of point mutations to be performed at the gene g_i. We use a binomial distribution because it models the number of times that a particular event can occur in a sequence of observations. Particularly, this approach allows to reduce significantly the algorithmic complexity of the simulations from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O ( N \times nG \times L ( {g_i} ) )$$ \end{document} to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O ( N ) + O ( {B_i} \times nG \times L ( {g_i} ) )$$ \end{document} , where nG is the number of genes and taking into account that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${B_i} \mathrel{ < < < } N$$ \end{document} .

Finally, our first approach uses a two-phase compression technique. In the first phase, a quaternary base conversion translates the four abbreviated DNA nucleotides in alphabetical order (A, C, G, T) into quaternary digits in numerical order (0, 1, 2, 3) by using a simple mapping function. Since there are only four digits, they can be represented by two binary digits. In this way, we can store four nucleotides per byte.

In the second phase, at the beginning of the simulation, the libgdrift 1.0 creates a GenePool, which stores all genes and their variants. Each variant points to a random generated gene of reference. Since individuals share most of their genetic material, they do not explicitly store their genes as attributes (no copies of existing instances), instead individuals point to GenePool entries. When a mutation is triggered for a variant \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${e_{ ( i , j ) }}$$ \end{document} of gene g_i, a new GenePool variant \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${e_{ ( i , k ) }}$$ \end{document} ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k > j$$ \end{document} ) is created (or reused from the recycling bin). Each variant \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${e_{ ( i , k ) }}$$ \end{document} has a pointer to the gene of reference g_i, an internal control counter, which indicates the number of individuals pointing to it, and a list of mutations \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$< p , t , bp >$$ \end{document} , where p is the position where the mutation is performed, t is the type of point mutation (substitution, insertion, or deletion), and bp is the new base pair (when a deletion occurs). At the beginning of each generation, all control counters are decreased by 1. When an individual points to an entry, its counter is increased by 1. At the end of the simulation, all entries with counter equal to 0 are stored into a fixed size recycling bin for further utilization.

4.2. Mutation-tree

The mutation-tree approach is a tree-based implementation of a gene where each node represents an allele. That is, the nodes represent variations for which at least one individual contains that version or a mutation derived from it. An individual is composed of a set of pointer alleles, where each one refers to a tree node indicating a relationship. When a mutation is triggered, a new leaf node is created, and the mutation counter is increased by one. This leaf node inherits all mutations from its parent node and so on recursively until the root node. New nodes are always leaf nodes.

Mutations are not actually applied until the end of the simulation since the nodes of the tree may be removed during the simulation process. When a leaf node is not referenced anymore, this node is removed and its mutations are discarded. When an internal node is not referenced by any individual and it has only one child, the node is removed and the tree is contracted, that is, both nodes (parent and child) are merged and their mutation counters are accumulated. When the simulation ends, the mutations are applied by traversing the tree.

Figure 5 shows an example of the mutation tree data structure obtained with the mutation-tree library. The numbers next inside each node represent the number of references. The number in the links represents the number of mutations. That is the number of individuals that have the allele represented by the node. The data structure used to implement the mutation-tree is kept in main memory. The arrays of colored “insects” represent the individuals pointing to the leaf nodes of the tree.

FIG. 5.

Example of the mutation-tree data structure. The numbers inside each node indicate the number of individuals that have an allele represented by the node. The number in the links represents the number of mutations.

4.3. Mutation-vector

The main difference between mutation-vector library and previous library versions is the new mutations storage system. In the original model of libgdrift 1.0, each individual in the simulation stores all the information regarding their alleles for each marker, including all applied mutations. Although this feature simplifies the conceptual model of the simulator, it implies a great amount of replicated information in all individuals that share the same allele. Moreover, the alleles of many individuals are only valid for a certain period of time during the simulation, but they could be discarded before reaching the final generation, so there is also a certain amount of wasted computation.

In the mutation-vector library, each individual only stores the identifier for their current allele for each genetic market. Additionally, the simulator keeps a mutations vector data structure storing the identifier of the direct ancestor for each allele. During the simulation process, no mutation is applied. We only account the mutations that should be applied for each allele, by means of the references to their ancestors in the mutation vector when the sequences for the last generation are needed after the simulation finishes. We assume that there is an allele with identifier 0, without a parent (represented with the value −1 in the mutations vector). That allele represents the original sequence at the beginning of the simulation. When an individual receives a mutation, we create a new identifier \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$new \_id$$ \end{document} as a descendant of the old \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$allele \_id$$ \end{document} for the relevant genetic marker. Then, we add a new entry in the mutation vector data structure with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$allele \_id$$ \end{document} as the parent of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$new \_id$$ \end{document} , and we assign the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$new \_id$$ \end{document} as the current allele for the mutated individual. The new identifiers are generated using an incremental counter.

At the end of the simulation, when it is time to compute the genetic statistics of the resulting population, we apply the mutations to the individuals. For this purpose, we use a temporal data structure named \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$applied \_mutations$$ \end{document} that associates the identifiers of the relevant alleles to the modified sequence by recursively applying all the corresponding mutations. This process is shown in Algorithm 1 getAllele, which makes use of functions createInititalAllele to create the initial sequence for a particular \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$mutation \_model$$ \end{document} , applyMutation that explicitly applies a mutation to a particular sequence given a \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$mutation \_model$$ \end{document} , and a recursive call.

Algorithm 1

The algorithm receives the identifier of a single allele whose sequence is needed, the mutation vector data structure that stores the identifier of the parent for each allele, the temporal structure named applied_mutations to store the generated sequences, and the mutation model for the particular genetic marker. It return the generated sequence to be used to compute genetic statistics.

getAllele (allele_id, mutation_vector, applied_mutations, mutation_model)

if applied_mutations contains allele_id then

return applied_mutations[allele_id]

else if allele_id = 0 then

sequence = createInititalAllele(mutation_model)

applied_mutations[allele_id] = sequence

return sequence

else

parent_id = mutation_vector[allele_id]

parent_sequence = getAllele (parent_id, mutation_vector, applied_mutations, mutation_model)

sequence = applyMutation(parent_sequence, mutation_model)

applied_mutations[allele_id] = sequence

return sequence

end if

5. Experimental Results

In this section, we present the evaluation of the gdrift++ platform running the three proposed libgdrift libraries. We simulate the Wright–Fisher model (Wright, 1931; Fisher, 1999). All the C++ codes were compiled using gcc version 5.3.1. Peak memory usage was measured by using “massif,” a memory profiler that is part of the Valgrind tools. Experiments were executed on two AMD Opteron processors with 32-core, 128 GB of RAM memory, and an L1 cache of 2048 KB. The WG service, the results builder service, and the SRA service are executed on one processor. The SC service is executed on the other processor, so the simulations can be executed with up to 32 cores.

5.1. Performance evaluation

The simulation parameters used in the following experiments correspond to (1) population size, (2) mutation rate, and (3) locus length. The values for the parameters used in the experiments correspond to a population size of 1000 individuals, 1000 generations, mutation rates with values \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$1{ \rm{e}} - 6$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$1{ \rm{e}} - 7$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$1{ \rm{e}} - 8$$ \end{document} , and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$1{ \rm{e}} - 9$$ \end{document} , and locus lengths of 10 KB, 100 KB, and 1 MB. The performance metrics were obtained from the average of 100 executions.

Figures 6a, b and 7 show the running time obtained with 32 cores with a locus length of 10 KB, 100 KB, and 1 MB, respectively. The x-axis shows the mutation rate ranging from 10⁻⁹ to 10⁻⁶. In general, both the mutation-vector and the mutation-tree present lower running time than its predecessor the libgdrift 1.0 library. For low mutation rates, the mutation-tree library tends to present the best performance. However, when the mutation rates are higher the vector-tree tends to drastically outperform the other simulation libraries. This behavior is mainly because of the storage optimization implemented by the mutation-vector library. As we increase the amount of data simulated, the mutation-vector library optimizes the CPU-cache memories access (Fig. 9).

FIG. 6.

Running time obtained with 32 cores in nanoseconds, a mutation rate from 10⁻⁹ to 10⁻⁶, and for a locus length of (a) 10 KB and (b) 100 KB.

FIG. 7.

Running time obtained with 32 cores in nanoseconds, a mutation rate from 10⁻⁹ to 10⁻⁶, and for a locus length of 1 MB.

In Figure 8, we show how our proposed gdrift++ platform scales when executing all the simulation libraries. The y-axis shows the execution time in nanoseconds and the x-axis shows the number of cores. In Figure 8a, we use a mutation rate of 10⁻⁹, and in Figure 8b, we use a mutation rate of 10⁻⁶. In both cases, the mutation-tree and the mutation-vector present almost constant execution times as we increase the number of cores. However, the original library, libgdrift 1.0, presents good results for a low number of cores (1–8), but it cannot scale as the execution time tends to increase with a larger number of cores (16–32). This behavior is also explained in Figure 9.

FIG. 8.

Scalability: Execution time in nanoseconds obtained with different number of cores, a locus length of 1 MB for (a) a mutation rate of 10⁻⁹ and (b) a mutation rate of 10⁻⁶.

FIG. 9.

Number of cache hits obtained with different number of cores, a locus length of 1 MB for (a) a mutation rate 10⁻⁹ and (b) a mutation rate 10⁻⁶.

Figure 9 shows the number of CPU-cache misses reported by all three simulation libraries. The x-axis shows the number of cores. In Figure 9a, we use a mutation rate of 10⁻⁹, and in Figure 9b, we use a higher mutation rate of 10⁻⁶. With more cores, the number of cache misses tends to increase as the cores compete to use the same memory resources, and more data are transferred from the RAM memory to the cache memory of the cores. Also notice that both mutation-vector and mutation-tree libraries drastically reduce the number of cache misses. That is mainly because both libraries are optimized to reduce the amount of data required to simulate the population evolution in different scenarios. With a larger mutation rate (Fig. 9b), the mutation-vector reduces by almost 30% the number of cache misses reported by the mutation-tree.

5.2. Effectiveness evaluation

In this section, we compare the parameters distribution inferred by our proposed gdrift++ and the DIYABC simulator. We evaluate two scenarios with different sequences of events like create, increase, split, decrease, and end simulation. For this set of events, we infer the following parameters: mutation rate, create size, increase rate, increase time, split time, decrease rate, decrease time, and simulation time. To this end, we first show that all the proposed simulation libraries present similar distribution of values for all parameters. In Figure 10, we show the probability density function of the values for the parameters mutation rate and create size (number of individuals created). We show the results for these two parameters because the results achieved by the other parameters are similar.

FIG. 10.

Density of the distribution of values for the inferred parameters: (a) mutation rate and (b) create size (number of individuals created).

The DIYABC uses a backward-in-time approach to run the simulations, meanwhile the gdrift++ uses a forward-in-time approach. Thus, to compare the distribution of the parameters, we invert the sequence of events executed in each simulator. In the first scenario, the sequence of events executed by the gdrift++ are create, increase, split, decrease, and end simulation. Therefore, the DIYABC executes the same events but in reverse. In the second scenario, the sequence of events executed by the gdrift++ are create, split, decrease, and end simulation (again the DIYABC executes the same events in reverse order).

We compute the relative error, which is the measure of the differences between values obtained by the parameters distributions inferred by the DIYABC and by the gdrift++. It is defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${e_r} = \sum ( ( \sqrt ( {x_i} - \overline x { ) ^2} / \overline x ) / n )$$ \end{document} . In Table 1, we present the relative error (e_r) for the mean and standard deviation of each parameter of the simulation. We show the results for Scenario 1 and Scenario 2 defined above. Notice that the increase rate and increase time parameters are not estimated in the second scenario (represented with X).

Table 1.

Relative Error for the Mean and Standard Deviation Reported by the Parameter Distributions Achieved by DIYABC and Gdrift++ for Two Scenarios with Different Sequences of Events

	Scenario 1		Scenario 2
Parameters	Mean	STD	Mean	STD
Mutation rate	0.216	2.211	0.196	5.224
Create size	0.202	2.700	0.151	5.033
Increase rate	0.055	0.413	X	X
Increase time	0.356	0.261	X	X
Split time	0.079	0.0151	0.042	0.033
Decrease rate	0.0007	0.018	0.083	0.122
Decrease time	0.104	0.109	0.044	0.011
Simulation time	0.101	0.128	0.023	0.028

STD, standard deviation.

Results show that the error reported by both metrics—the mean and standard deviation—for the different parameters is very low. Only the standard deviation for the mutation rate and the create size parameters report an error close to 2% and 5%, respectively. These results show that our proposal can obtain parameters distributions similar to the ones reported by the DIYABC.

6. Conclusions

We have proposed a web-service based platform for ABC for genetic drift simulations named gdrift++. The platform is based on a client-server approach. The client executes a web service, which shows the user interface used to define the scenario to be simulated and displays the figures with summary of statistic information obtained from the simulations. The server is composed of four services in charge of creating batches of simulations, analyzing the results, and building the results.

For the server side, we have also proposed two new population genetics simulation libraries named mutation-tree and mutation-vector. Both aim to reduce the running time and the amount of memory required to execute the simulations. Instead of maintaining information about the status of each individual, we store and process the mutations that occur in the entire population. This has a direct impact on the cache access and therefore on the performance of the platform. The experiment results show that the mutation-vector library drastically reduces the cache misses and therefore presents lower running times than alternative libraries reported in the literature. Finally, we have shown that the proposed client-server platform achieves similar parameter distributions than alternative software packages proposed so far such as DIYABC (Cornuet et al., 2014). This makes our proposal a suitable alternative in both quality of results and computational performance.

Footnotes

Acknowledgment

This research was supported by the supercomputing infrastructure of the NLHPC Chile, partially funded by CONICYT Basal funds FB0001, Fondef ID15I10560.

Author Disclosure Statement

The authors declare there are no relationships with any people or organizations that could inappropriately influence (bias) this work.

References

Arenas

, Lopes

J.S.

, Beaumont

M.A.

, et al. 2015. Codabc: A computational framework to coestimate recombination, substitution, and molecular adaptation rates by approximate Bayesian computation. Mol. Biol. Evol. 32, 1109–1112.

Beaumont

M.A.

2010. Approximate Bayesian computation in evolution and ecology. Annu. Rev. Ecol. Evol. Syst. 41, 379–406.

Beaumont

M.A.

, Zhang

, and Balding

D.J.

2002. Approximate Bayesian computation in population genetics. Genetics, 162, 2025–2035.

Bouckaert

, Heled

, KÃijhnert

, et al. 2014. Beast 2: A software platform for Bayesian evolutionary analysis. PLoS Comput. Biol. 10, 1–6.

Cornuet

J.M.

, Pudlo

, Veyssier

, et al. 2014. Diyabc v2.0: A software to make approximate Bayesian computation inferences about population history using single nucleotide polymorphism, DNA sequence and microsatellite data. Bioinformatics, 30, 1187–1189.

CsillÃl'ry

, Blum

M.G.

, Gaggiotti

O.E.

, et al. 2010. Approximate Bayesian computation (abc) in practice. Trends Ecol. Evol. 25, 410–418.

De Mita

, and Siol

2012. Egglib: Processing, analysis and simulation tools for population genetics and genomics. BMC Genet. 13, 27.

Dutta

, Schoengens

, Onnela

J.P.

, et al. 2017. Abcpy: A user-friendly, extensible, and parallel library for approximate Bayesian computation, 8:1–8:9. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC).

Fielding

R.T.

2000. Architectural Styles and the Design of Network-based Software Architectures. Ph.D. Thesis.

10.

Fisher

R.A.

1999. The Genetical Theory of Natural Selection: A Complete Variorum Edition. Oxford University Press, New York, NY, USA.

11.

Haller

B.C.

, and Messer

P.W.

2016. Slim 2: Flexible, interactive forward genetic simulations. Mol. Biol. Evol. 34, 230–240.

12.

Huang

, Takebayashi

, Qi

, et al. 2011. Mtml-msbayes: Approximate Bayesian comparative phylogeographic inference from multiple taxa and multiple loci with rate heterogeneity. BMC Bioinformatics, 12, 1.

13.

Kessner

, and Novembre

2014. Forqs: Forward-in-time simulation of recombination, quantitative traits and selection. Bioinformatics, 30, 576–577.

14.

Kingman

1982. The coalescent. Stoch. Process. Appl. 13, 235–248.

15.

Liepe

, Barnes

, Cule

, et al. 2010. Abc-sysbioâĂŤapproximate Bayesian computation in python with gpu support. Bioinformatics. 26:1797–1799.

16.

Liu

, Athanasiadis

, and Weale

M.E.

2008. A survey of genetic simulation software for population and epidemiological studies. Hum. Genomics, 3, 79–86.

17.

Messer

P.W.

2013. Slim: Simulating evolution with selection and linkage. Genetics, 194, 1037–1039.

18.

Peng

, and Kimmel

2005. Simupop: A forward-time population genetics simulation environment. Bioinformatics, 21, 3686–3687.

19.

Sandoval-Castellanos

, Palkopoulou

, and DalÃl'n

2014. Back to baysics: A user-friendly program for Bayesian statistical inference from coalescent simulations. PLoS One, 9, 1–9.

20.

Sepulveda

, Roberto

, Inostrosa-Psijas

, et al. 2017. Towards rapid population genetics forward-in-time simulations, 2672–2683. In Winter Simulation Conference (WSC).

21.

Thornton

K.R.

2014. A c++ template library for efficient forward-time population genetic simulation of large populations. Genetics, 198, 157–166.

22.

Turner

B.M.

, and Zandt

T.V.

2012. A tutorial on approximate Bayesian computation. J. Math. Psychol. 56, 69–85.

23.

Vinoski

2008. Serendipitous reuse. IEEE Internet Comput. 12, 84–87.

24.

Wegmann

, Leuenberger

, Neuenschwander

, et al. 2010. Abctoolbox: A versatile toolkit for approximate Bayesian computations. BMC Bioinformatics, 11, 116.

25.

Wright

1931. Evolution in Mendelian populations. Genetics, 16, 97.