Preface to the Special Issue: Parallel Computing in Computational Biology: A Technological Point of View

Abstract

Computational biology allows and encourages the application of many different parallel computing approaches. This special issue brings together high-quality state-of-the-art contributions about parallel computing in computational biology, from different technological points of view or perspectives. The special issue collects considerably extended and improved versions of the best articles, accepted and presented in PBio 2017 (5th International Workshop on Parallelism in Bioinformatics, and part of ICA3PP 2017). The domains and topics covered in these six articles are timely and important, and the authors have done an excellent job of presenting the material.

1. Introduction

In computational biology, we can find a variety of problems that are affected by huge processing times and memory/storage consumption, due to the large size of biological data sets and the inherent complexity of biological problems. In fact, computational biology is among the most exciting research areas in which parallel computing finds application. Successful examples are mpiBLAST, RAxML-HPC, or ClustalW-MPI, among many others. Therefore, computational biology allows and encourages the application of many different parallel computing approaches: multicore computing, cluster computing, supercomputing, cloud computing, grid computing, green computing, and hardware accelerators such as graphics processing units (GPUs), and field-programmable gate arrays (FPGAs).

This special issue brings together high-quality state-of-the-art contributions about parallel computing in computational biology, from different technological points of view or perspectives. This special issue collects the best articles, accepted and presented in PBio (2017) (5th International Workshop on Parallelism in Bioinformatics, and part of ICA3PP 2017). These articles have been considerably extended and improved by the authors from their original conference versions.

2. Important Data

At present, the application of parallel computing in computational biology is a very popular research topic. As an example of the current interest in this field, it is worth mentioning that, for this special issue, we have managed a total of 17 high-quality submissions from different countries, such as the United States, Argentina, Germany, Switzerland, Poland, Sweden, Portugal, or Spain. All the articles included in this special issue were reviewed by at least four expert reviewers. Furthermore, all the articles in the special issue received a minimum of three review rounds. Finally, six articles of high quality in emerging research areas were accepted for inclusion in the special issue (acceptance rate = 6/17 = 35.29%). In conclusion, we think these articles bring us an international sampling of significant work.

The domains and topics covered in these six articles are timely and important, and the authors have done an excellent job of presenting the material. We are confident that this special issue will be very useful for all the readers who are engaged in the many issues surrounding the application of parallel computing in the computational biology domain.

3. Detailed Content

The title of our first article is “Precise and Parallel Pairwise Metagenomic Comparisons,” by Pérez-Wohlfeil and Trelles (2018). The comparison and assessment of similarity across metagenomes are still an open problem. Uncultivated samples suffer from high variability, thus making it difficult for heuristic sequence comparison methods to find precise matches in reference databases. Finer methods are required to provide higher accuracy and certainty, although these come at the expense of larger computation times. This article presents a software for the highly parallel fine-grained pairwise alignment of metagenomes. The parallel implementation is made through POSIX threads. The parallel scheme adopted is tested, depicting a performance of up to 98% efficiency while using up to 64 cores.

The second article, “Large-Scale Simulations of Bacterial Populations over Complex Networks,” by Teixeira et al. (2018), is focused on understanding bacterial population genetics and evolution. This is a crucial task in epidemic outbreak studies and pathogen surveillance. However, all epidemiological studies are limited to their sampling capacities, which, by being usually biased or limited due to economic constraints, can hamper the real knowledge of the bacterial population structure of a given species. This article addresses the large-scale simulation of genetic evolution of bacterial populations, using the Wright–Fisher model, in the presence of complex host contact networks. The approach uses MapReduce on top of Apache Spark and GraphX API. Furthermore, the article evaluates the relationship between cluster computing power and simulations speedup.

Our third article, “FaST-LMM for Two-Way Epistasis Tests on High Performance Clusters,” authored by Martínez et al. (2018), introduces a version of the epistasis test in factored spectrally transformed linear mixed models (FaST-LMM) for clusters of multithreaded processors. This new software maintains the sensitivity of the original FaST-LMM while delivering an acceleration that is close to linear on 12–16 nodes of two recent parallel platforms. This efficiency is attained through several enhancements on the original single-node version of FaST-LMM, together with the development of a message passing interface (MPI)-based version that ensures a balanced distribution of the workload as well as a multi-GPU module that can exploit the presence of multiple GPUs per node.

The execution of the Basic Local Alignment Search Tool (BLAST) algorithm on high performance computing (HPC) clusters and supercomputers in a massively parallel manner using thousands of processors is addressed in the fourth article, “Massively Parallel Implementation of Sequence Alignment with BLAST Using PCJ Library” by Nowicki et al. (2018). BLAST is an essential algorithm for sequence alignment analysis. The NCBI-BLAST application is the most popular implementation of the BLAST algorithm. It can run on a single multithreading node. This article proposes a solution to execute the multithreading NCBI-BLAST in multiple nodes. The parallel computing in Java (PCJ) library is used to implement the optimal splitting up of the input queries, the work distribution, and search management.

The fifth article, entitled “A Power-Performance Perspective to Multi-Objective EEG Feature Selection on Heterogeneous Parallel Platforms” by Escobar et al. (2018), analyzes the power–performance behavior of an evolutionary parallel multiobjective electroencephalogram feature selection procedure. This procedure is executed in parallel CPU-GPU platforms. The procedure is implemented in OpenMP to dynamically distribute the potential solutions among devices, and uses OpenCL to evaluate the quality of these solutions. The article concludes that parallel processing not only reduces the runtime but also the energy consumed by the application despite a higher instantaneous power.

The sixth article “Scalable Consistency in T-Coffee through Apache Spark and Cassandra Database” is authored by Lladós et al. (2018). T-Coffee is a highly rated multiple sequence alignment tool. It uses the probabilistic consistency as a prior step to the progressive alignment stage to improve the final accuracy. However, it is severely limited by the memory required to store the consistency information. This article presents a novel approach named Big Data T-Coffee (BDT-Coffee). BDT-Coffee is based on the integration of consistency information through Cassandra database, previously generated by the MapReduce processing paradigm, to enable large data sets to be processed with the aim of improving the performance and scalability of the original algorithm.

4. Conclusion

We sincerely hope that you enjoy this special issue. We also have hope that the article collection as a whole can pleasantly introduce the readers to the composite and challenging area of the application of parallel computing in computational biology, giving a fresh view of several state-of-the-art solutions from diverse technological points of view. Before concluding we want to express our sincere gratitude to some people who have helped us in this challenge. First of all, we would like to thank Prof. Dr. Sorin Istrail (Editor-in-Chief of the Journal of Computational Biology) for trusting us. We also extend our sincere thanks to all the authors who submitted articles for this special issue and the many reviewers whose dedicated efforts made this special issue possible.

Footnotes

Acknowledgments

This work was partially funded by the AEI (State Research Agency, Spain) and the ERDF (European Regional Development Fund, EU), under the contract TIN2016-76259-P (PROTEIN project).

Miguel A. Vega-Rodríguez received his PhD in computer engineering from the University of Extremadura, Spain, in 2003. He is currently an associate professor (accredited as full professor) of computer architecture in the Department of Computer and Communications Technologies, University of Extremadura. He has authored or coauthored >640 publications including journal articles (>120 JCR-indexed journal articles), book chapters, and peer-reviewed conference proceedings, for which he got several awards. Dr. Vega-Rodriguez has contributed to the organization of several international conferences and workshops, namely as general chair or cochair. He has edited >10 special issues of international JCR-indexed journals. In addition, he is an editor and a reviewer of diverse international JCR-indexed journals. Hismain research interests include parallel and distributed computing, bioinformatics, reconfigurable and embedded computing, and evolutionary computation. A more detailed biography (including contact details) can be found at (http://arco.unex.es/mavega).

Miguel A. Vega-Rodríguez (corresponding author)

Department of Computer and Communications Technologies, University of Extremadura, Escuela Politecnica, Campus Universitario s/n, Caceres 10003, Spain

mavega@unex.es

José M. Granado-Criado is a professor of computer architecture in the Department of Computer and Communications Technologies, University of Extremadura, Spain. He received his PhD in computer science from the University of Extremadura in 2009. Dr. Granado-Criado's main research interests are in the field of parallel processing and, particularly, in the use of reconfigurable hardware (FPGAs), GPUs, and embedded systems (SoC, MPSoC, etc.) in custom-computing applications, such as cryptography, evolutionary computation, and bioinformatics. A more detailed biography (including contact details) can be found at ().

José M. Granado-Criado

Department of Computer and Communications Technologies, University of Extremadura, Escuela Politecnica, Campus Universitario s/n, Caceres 10003, Spain

granado@unex.es

References

Escobar

J.J.

, Ortega

, and Díaz

A.F.

, et al. 2018. A power-performance perspective to multi-objective EEG feature selection on heterogeneous parallel platforms. J Comput Biol [this issue, pg. 882].

Lladós

, Cores

, and Guirado

2018. Scalable consistency in T-coffee through apache spark and Cassandra database. J Comput Biol [this issue, pg. 894].

Martínez

, Barrachina

, Castillo

, et al.: 2018. FaST-LMM for two-way Epistasis tests on high performance clusters. J Comput Biol [this issue, pg. 862].

Nowicki

, Bzhalava

, and Bala

2018. Massively parallel implementation of sequence alignment with BLAST using PCJ library. J Comput Biol [this issue, pg. 871].

PBio. 2017. 5th International Workshop on Parallelism in Bioinformatics. Available at: http://arco.unex.es/mavega/pbio/2017/ Accessed May 7, 2018.

Pérez-Wohlfeil

, and Trelles

2018. Precise and parallel pairwise metagenomic comparisons. J Comput Biol [this issue, pg. 841].

Teixeira

A.S.

, Monteiro

P.T.

, Carriço

J.A.

, et al. 2018. Large-scale simulations of bacterial populations over complex networks. J Comput Biol [this issue, pg. 850].