Abstract

C
1. Introduction
In computational biology, we can find a variety of problems that are affected by huge processing times and memory/storage consumption, due to the large size of biological data sets and the inherent complexity of biological problems. In fact, computational biology is among the most exciting research areas in which parallel computing finds application. Successful examples are mpiBLAST, RAxML-HPC, or ClustalW-MPI, among many others. Therefore, computational biology allows and encourages the application of many different parallel computing approaches: multicore computing, cluster computing, supercomputing, cloud computing, grid computing, green computing, and hardware accelerators such as graphics processing units (GPUs), and field-programmable gate arrays (FPGAs).
This special issue brings together high-quality state-of-the-art contributions about parallel computing in computational biology, from different technological points of view or perspectives. This special issue collects the best articles, accepted and presented in PBio (2017) (5th International Workshop on Parallelism in Bioinformatics, and part of ICA3PP 2017). These articles have been considerably extended and improved by the authors from their original conference versions.
2. Important Data
At present, the application of parallel computing in computational biology is a very popular research topic. As an example of the current interest in this field, it is worth mentioning that, for this special issue, we have managed a total of 17 high-quality submissions from different countries, such as the United States, Argentina, Germany, Switzerland, Poland, Sweden, Portugal, or Spain. All the articles included in this special issue were reviewed by at least four expert reviewers. Furthermore, all the articles in the special issue received a minimum of three review rounds. Finally, six articles of high quality in emerging research areas were accepted for inclusion in the special issue (acceptance rate = 6/17 = 35.29%). In conclusion, we think these articles bring us an international sampling of significant work.
The domains and topics covered in these six articles are timely and important, and the authors have done an excellent job of presenting the material. We are confident that this special issue will be very useful for all the readers who are engaged in the many issues surrounding the application of parallel computing in the computational biology domain.
3. Detailed Content
The title of our first article is “Precise and Parallel Pairwise Metagenomic Comparisons,” by Pérez-Wohlfeil and Trelles (2018). The comparison and assessment of similarity across metagenomes are still an open problem. Uncultivated samples suffer from high variability, thus making it difficult for heuristic sequence comparison methods to find precise matches in reference databases. Finer methods are required to provide higher accuracy and certainty, although these come at the expense of larger computation times. This article presents a software for the highly parallel fine-grained pairwise alignment of metagenomes. The parallel implementation is made through POSIX threads. The parallel scheme adopted is tested, depicting a performance of up to 98% efficiency while using up to 64 cores.
The second article, “Large-Scale Simulations of Bacterial Populations over Complex Networks,” by Teixeira et al. (2018), is focused on understanding bacterial population genetics and evolution. This is a crucial task in epidemic outbreak studies and pathogen surveillance. However, all epidemiological studies are limited to their sampling capacities, which, by being usually biased or limited due to economic constraints, can hamper the real knowledge of the bacterial population structure of a given species. This article addresses the large-scale simulation of genetic evolution of bacterial populations, using the Wright–Fisher model, in the presence of complex host contact networks. The approach uses MapReduce on top of Apache Spark and GraphX API. Furthermore, the article evaluates the relationship between cluster computing power and simulations speedup.
Our third article, “FaST-LMM for Two-Way Epistasis Tests on High Performance Clusters,” authored by Martínez et al. (2018), introduces a version of the epistasis test in factored spectrally transformed linear mixed models (FaST-LMM) for clusters of multithreaded processors. This new software maintains the sensitivity of the original FaST-LMM while delivering an acceleration that is close to linear on 12–16 nodes of two recent parallel platforms. This efficiency is attained through several enhancements on the original single-node version of FaST-LMM, together with the development of a message passing interface (MPI)-based version that ensures a balanced distribution of the workload as well as a multi-GPU module that can exploit the presence of multiple GPUs per node.
The execution of the Basic Local Alignment Search Tool (BLAST) algorithm on high performance computing (HPC) clusters and supercomputers in a massively parallel manner using thousands of processors is addressed in the fourth article, “Massively Parallel Implementation of Sequence Alignment with BLAST Using PCJ Library” by Nowicki et al. (2018). BLAST is an essential algorithm for sequence alignment analysis. The NCBI-BLAST application is the most popular implementation of the BLAST algorithm. It can run on a single multithreading node. This article proposes a solution to execute the multithreading NCBI-BLAST in multiple nodes. The parallel computing in Java (PCJ) library is used to implement the optimal splitting up of the input queries, the work distribution, and search management.
The fifth article, entitled “A Power-Performance Perspective to Multi-Objective EEG Feature Selection on Heterogeneous Parallel Platforms” by Escobar et al. (2018), analyzes the power–performance behavior of an evolutionary parallel multiobjective electroencephalogram feature selection procedure. This procedure is executed in parallel CPU-GPU platforms. The procedure is implemented in OpenMP to dynamically distribute the potential solutions among devices, and uses OpenCL to evaluate the quality of these solutions. The article concludes that parallel processing not only reduces the runtime but also the energy consumed by the application despite a higher instantaneous power.
The sixth article “Scalable Consistency in T-Coffee through Apache Spark and Cassandra Database” is authored by Lladós et al. (2018). T-Coffee is a highly rated multiple sequence alignment tool. It uses the probabilistic consistency as a prior step to the progressive alignment stage to improve the final accuracy. However, it is severely limited by the memory required to store the consistency information. This article presents a novel approach named Big Data T-Coffee (BDT-Coffee). BDT-Coffee is based on the integration of consistency information through Cassandra database, previously generated by the MapReduce processing paradigm, to enable large data sets to be processed with the aim of improving the performance and scalability of the original algorithm.
4. Conclusion
We sincerely hope that you enjoy this special issue. We also have hope that the article collection as a whole can pleasantly introduce the readers to the composite and challenging area of the application of parallel computing in computational biology, giving a fresh view of several state-of-the-art solutions from diverse technological points of view. Before concluding we want to express our sincere gratitude to some people who have helped us in this challenge. First of all, we would like to thank Prof. Dr. Sorin Istrail (Editor-in-Chief of the Journal of Computational Biology) for trusting us. We also extend our sincere thanks to all the authors who submitted articles for this special issue and the many reviewers whose dedicated efforts made this special issue possible.
Footnotes
Acknowledgments
This work was partially funded by the AEI (State Research Agency, Spain) and the ERDF (European Regional Development Fund, EU), under the contract TIN2016-76259-P (PROTEIN project).
Miguel A. Vega-Rodríguez (corresponding author)
Department of Computer and Communications Technologies, University of Extremadura, Escuela Politecnica, Campus Universitario s/n, Caceres 10003, Spain
).
José M. Granado-Criado
Department of Computer and Communications Technologies, University of Extremadura, Escuela Politecnica, Campus Universitario s/n, Caceres 10003, Spain
