Abstract
The amount of available data is continuously growing. This phenomenon promotes a new concept, named big data. The highlight technologies related to big data are cloud computing (infrastructure) and Not Only SQL (NoSQL; data storage). In addition, for data analysis, machine learning algorithms such as decision trees, support vector machines, artificial neural networks, and clustering techniques present promising results. In a biological context, big data has many applications due to the large number of biological databases available. Some limitations of biological big data are related to the inherent features of these data, such as high degrees of complexity and heterogeneity, since biological systems provide information from an atomic level to interactions between organisms or their environment. Such characteristics make most bioinformatic-based applications difficult to build, configure, and maintain. Although the rise of big data is relatively recent, it has contributed to a better understanding of the underlying mechanisms of life. The main goal of this article is to provide a concise and reliable survey of the application of big data-related technologies in biology. As such, some fundamental concepts of information technology, including storage resources, analysis, and data sharing, are described along with their relation to biological data.
Introduction
The initial concern of computational science was the creation and execution of mathematical models originated or applied in natural and artificial processes. 1 Improvements in computer performance and data storage capacity are changing how every field of knowledge can be represented by data. 2 This scenario broadens the initial aims of computational science to not only compute but also carry out data management and analysis. 1 The amount of data has recently become an “avalanche,” which is steadily accelerating and amplifying. From this phenomenon, a new concept emerges, named big data. 3
From a scientific perspective, the definition of big data is related to the collection, storage, and analysis of data sets that present a high level of complexity or size. In addition to data volume, its variety, velocity and value are also embedded in the big data definition. 4 For this reason, the management and analysis of data using traditional tools and techniques have become a hard task. 1 Information technology (IT) has a determinant role in data availability and analysis. Data can be considered big when their reuse contributes to the construction of new insights. Considering this, IT has been supporting new models of science and newly founded multidisciplinary fields of study. 3
In this respect, the advent of high-throughput genomic techniques, for example, has led biology (particularly genomics and proteomics) to become an information science.2,5 A single-sequenced human genome, in fact, is ∼140 gigabytes. 5 Furthermore, the 1000 Genomes project, 6 which involves sequencing and cataloging of human genetic variation, has deposited two times more raw data into GenBank 7 during its first 6 months than all the previous sequences deposited in the last 30 years together. 8 The European Bioinformatics Institute, one of the world's largest biological data repositories, stores 20 petabytes of data and backups, from genes to proteins and even other small molecules. 5 Taking this into account, it is clear that life scientists deal with massive and complex data sets.
This background has provided a context for the birthplace of bioinformatics, an interdisciplinary field that makes use of informatic devices to exploit biological big data. From this perspective, the flow of biological information can be traced from postulating hypotheses up to uncovering answers. In this scenario, bioinformatics can be divided into four primary focal points: (1) computing and storage infrastructure (i.e., hardware); (2) data management and planning; (3) databases and computational tools (i.e., software); and (4) data analysis and research (Fig. 1).

Flowchart of biological big data. Big data (top left corner) serves as input for the bioinformatic field (right), resulting in an output of knowledge and applications that go back into academic research, the industry, hospitals, and other sources (bottom left corner). Adapted from: Swiss Institute of Bioinformatics.
Accordingly, this new information science has brought forth new viewpoints to areas such as comparative biology, molecular biology, taxonomy, biochemistry, and genetics. There are numerous in silico approaches to study biological data on the multiple ranges of complexity, starting, for instance, from single cells up to pathogen/organism relationships.
A few examples of in silico approaches are as follows: (1) simulations of molecular dynamics (serving as a large-scale example); (2) sequence alignment; (3) molecular docking (protein/protein complexes); and (4) prediction and search of molecular and genetic structures and elements. 9 In this context, in silico approaches can be applied toward several issues, for instance, discovering new drugs, understanding the Earth's climate, and examining the evolution of species.3,8
However, the complexity displayed by biological data emerges as an issue, mainly due to the natural heterogeneity of biological systems. Biological data range from single chemical or molecular structures to complex systems in a genome-wide scale or even complete metabolic networks. The complexity is driven upward (toward being more complex) when a broader data context is taken into account (e.g., scaling single atoms to DNA or RNA and then to whole genomes). By going through different scales, different approaches are required for data integration and, consequently, the inference of new correlations and hypotheses.9,10
The required IT resources for storing, analyzing, and sharing biological data are not simple and remain an important matter.3,8 While some of the challenges involving biological big data analysis are related to the sheer scale and breadth of new data sets, others relate to the increase in complexity. Considering this, the data present a high degree of complexity due to interactive elements, an inherent feature of biological systems.3,5,8
In this context, the purpose of this article was to provide a brief survey regarding biological data and their relation to big data computational approaches. To achieve this goal, the present study is organized into three main sections following this Introduction.
The features of biological data are presented in The Biological Data section, together with examples of currently available databases and research carried out using preexistent available data. The technology associated with the storage, management, and analysis is described in the IT Resources for Big Data Applications section. This section summarizes fundamental concepts of cloud computing, databases, and data mining (DM). Finally, the Computational Approaches Applied to Big Data Analysis section supplies background information regarding the computational techniques applied to data analysis. Above all, the discussion of big data potential as well as its limitations in biology is presented throughout the article.
The Biological Data
Biology is gradually shifting toward being an informational science, especially with the rise of bioinformatics. The omics (genomics, proteomics, and metabolomics), in particular, are considered big data sciences.3,11 This section describes the features of biological data, their sources, and examples of how they are used in research.
There is a wide landscape of biological data available for research. They are varied in complexity, format, and scale, including the following: (1) sequences (DNA, RNA and proteins, usually in text format); (2) structures (biological and chemical molecules, such as structural proteins and enzymes involved in pathways, in image format); (3) gene expression profiles (measurement of gene activity, in numeric and image formats); (4) biochemical pathways (in text or image formats); (5) chromosomal mapping (in text or image formats); (6) single-nucleotide polymorphisms (in text format); and (7) phylogenetic data (in text or image formats).9,10
Whereas sequences consist of highly specific information, biochemical or metabolic pathways contain a larger number of elements and variables to be taken into account, such as atoms, bonds, entropy, free energy, activation, or inhibition actions, among others. 12 In sequence analysis, for instance, the biological heterogeneity is reflected by multiple nucleotide content profiles in sequences that should present a unique signal. 13
It is also important to point out that biological data present some interferences that come from the wide variety of equipment and protocols used in a given experiment. This is known as the batch effect, which is the result of variables from laboratory origins, such as reagents, machines, and human error, among other factors that may interfere with an experiment. When dealing with large amounts of data and high-throughput data, batch effects can be the source of misleading results. Computational approaches for data analysis (see the IT Resources for Big Data Applications section) are often sensitive to this characteristic, and therefore, it can hamper their accuracy. 10
In the field of medical research, the recent advances on omics data and multimodal imaging data are providing improvements in diagnosis and treatment. 14 As pointed out by He et al., 14 neuroimaging data are being widely used to tackle central challenges that emerge from cognitive neuroscience, such as how to effectively interpret the results of complex data generated by modern experiments. One example is the in vivo probing of neuronal activity in the human brain.15,16
The use of DM and integrative interfaces between imaging data and computational techniques assists in the understanding of complex topics, and shifts them toward precision medicine, which in turn allows the development of more specific treatments and therapies for diseases such as Alzheimer's, 17 cancer, 18 and Parkinson's. 19
Taking the examples described, it is necessary to link the biological data with a computational definition of structured data and unstructured data. When data are displayed in columns and rows—which can be easily ordered and processed—they are defined as structured data (e.g., patient electronic health records 20 ). On the contrary, unstructured data are related to formats such as audio, video, digital photographs, and social media postings (e.g., radiology images).
Taking this into account, biological data can be divided into different subsets, since the data may come from several sources. 21 In this sense, the best way to organize the data of a given project is an important question in big data projects.10,20,21 The computational approaches for this issue are presented in the IT Resources for Big Data Applications section, and some examples of biological databases are also presented in the following subsection.
Biological databases
Experiments in biology can generate a considerable amount of data. 11 For this reason, online databases have become an essential tool and resource of data. The growth of biological databases is noted by the number of journals that have this issue as a topic of their scope. However, it was not possible to find in the literature the exact number of biological databases available up to now.
One of the largest genomic databases is the GenBank, 7 which is built and distributed by the National Center for Biotechnology Information (NCBI). GenBank is part of the International Nucleotide Sequence Database Collaboration, which also includes the DNA DataBank of Japan (DDBJ), and the European Molecular Biology Laboratory. GenBank contains base pairs of DNA sequences as expressed sequence tags, sequence tagged sites, whole-genome shotgun, and the complete genomes of several species. The principal source of data is submissions from individual laboratories that carried out genome sequencing projects. 7 In addition to GenBank, NCBI has 52 additional databases, which provide different types of data, such as protein-related, gene expression, and diseases. 22
Another large database is the Kyoto Encyclopedia of Genes and Genomes (KEGG). The original concept of KEGG was to create a reference knowledge base of metabolism and other cellular processes. It currently presents an integration of the information about systems, genomics, proteins, and chemical compounds related to a given molecular interaction network. KEGG pathway maps are widely used for inferring higher level functions from genome sequences and other high-throughput data. 23
The Protein Data Bank (PDB) is a repository of macromolecular structure information. 24 The PDB was initially established as a database for crystallographic protein data. Currently, it comprises a portal with several tools for visualization, download, structure comparison, and deposition of protein information. 24 Another well-known protein database is the UniProt, which provides protein sequence and functional annotation data. UniProt offers full-text and field-based text search, sequence similarity search, multiple sequence alignment, batch retrieval, and database identifier mapping. 25
An example of a biological big data project incorporated as a database is the Encyclopedia of DNA Elements (ENCODE). It is a foundational data set for understanding the roles of functional elements of the human genome. 26 This project, up to now, has produced 15 terabytes of data generated from 1600 experiments on 147 types of cells. ENCODE and other similar projects provide new insights into the mechanisms of human biology and diseases, enhancing the knowledge of common diseases with a genetic component, rare genetic diseases, and cancer. 26
In addition to these databases, there are other examples dedicated to specific data. It is possible to find databases for the following: (1) organism specific or group of related organisms,27–33 (2) molecules and/or cells,28,34 (3) specific genomic sequences,35,36 (4) diseases,27,37 (5) gene expression, proteins, and metabolism,30,35,38 (6) nomenclature and definitions, 39 and (7) ecology, 29 as some examples are shown in Table 1.
Examples of available biological databases
ENCODE, Encyclopedia of DNA Elements; EST, expressed sequence tag; GBIF, Global Biodiversity Information Facility; miRNA, microRNA; GO, Gene Ontology; PDB, Protein Data Bank; STS, sequence tagged sites; WGS, whole-genome shotgun.
Even though some of these examples may not be big enough to be classified as big data, the major objective in building a new database is to organize data obtained from many and heterogeneous sources. The greater principle is to transform data into useful information by executing friendly queries11,42 as the goal of big data analyses. 2
Considering the number of databases available, it is possible to obtain a wide perspective of a given biological task by comparing or combining the information available from databases all over the world.11,42 Since databases usually present their information in different formats, big data analyses are especially challenging in biology.5,42,43 Some efforts have been made to develop database integration, such as INDIGO, 44 metabolicMine, 45 Dis2PPi, 46 BioExtract Server, 47 GenMAPP, 48 and GeneCodis, 49 among others.
INDIGO is a data warehouse (DW) for three microbial genomes isolated from the Red Sea. It allows the integration of annotations for the purpose of exploration and analysis of those genomes. INDIGO enables users to combine information from multiple sources (genomic sequence, protein domain, gene ontology, and pathways) for further specific or general analysis. 44
Another example of a biological DW is metabolicMine, which is specific for common metabolic diseases. 45 The respective information covers genes, proteins, orthologues, interactions, gene expression, pathways, ontologies, diseases, genome-wide association studies, and single-nucleotide polymorphisms. Its data come from Ensembl, NCBI, UniProt, and KEGG, and its goal is to help users from all levels of informatic expertise to carry out their own studies. 45
The Dis2PPI tool provides an integration of two databases: OMIM and STRING. 46 It is a desktop tool that allows an easy solution to generate diseasome networks, offering a platform to explore diseases by indicating their common genetic origin. For this, Dis2PPI establishes the association of protein information with a given disease. The results of Dis2PPI can be loaded in software such as Cytoscape 50 and Medusa. 51 GenMAPP was developed for biologists who have their research focused on pathway visualization. 48 This tool allows the organization, analysis, and sharing of eukaryote genomic data by integration of selected data from KEGG and Reactome, among other academic laboratories. Moreover, GenMAPP can combine proteomic and gene expression data. 48
The BioExtractor Server is a data integration application of many biomolecular databases, analytic tools and workflows. 47 This system allows the user to reduce the number of sites visited for a given query associated with DNA or protein sequences. Moreover, there is no knowledge requirement related to database management system (DBMS) or query language by the user. GeneCodis 49 is a web server that integrates Gene Ontology (GO) and KEGG databases for life scientists to carry out gene annotation of their genomic research. The organisms supported comprise the eukaryotic and prokaryotic species, such as Homo sapiens, Mus musculus, Caenorhabditis elegans, Vibrio cholerae and Escherichia coli. 49
As described so far, it is clear that a well-designed database is supportive for biological research by providing integration of large volumes of complex data, as well as allowing faster and more powerful searches. 43 Some limitations of biological databases are related to data structure and access. In biology, a large number of mechanisms and phenomena are not yet fully understood, enabling room for deviant interpretations. Therefore, biological databases are different from each other due to a degree of uncertainty on the structure of data.11,42 In this scenario, biological data present not only high complexity, heterogeneity, and peculiarity, but also display an additional challenge in terms of data accuracy and standardization. 42
Another limitation arises from the fact that biological big data is not accessible in a conventional manner, which means data analysis often involves downloading data from public sites (e.g., NCBI and Ensembl), installing software tools locally, and running them. 2 To this regard, cloud access makes data easier to import, export, compare, combine, and understand. 1
Biological big data applications
There are many kinds of data in biology, such as genome sequence, gene expression values, protein sequence and structure, and metabolite concentrations and fluxes. 42 The reuse of information increases biological knowledge and contributes to enhancing research projects.11,42
Next-generation sequencing (NGS) technologies enable the acquisition of the nucleotide content of a given DNA sequence or a whole genome cheaper and faster. 52 For this reason, biological big data sets are now more expensive to store, process, and analyze than to generate. 5 Given the significance of obtaining a sequence, it is also important to identify the biological context (structure, function, and role) of it. The annotation process has theoretical and practical implications since it requires the combination of experimental and computational methods, making it a nontrivial task for a life scientist.53–56
Under this consideration, some tools have been developed (Table 2) aimed at providing a convenient way for the sequencing process and sequence analysis, such as homology searches, genome variant analysis, and genome-wide analyses.49,53,54,57,58 Despite some particularities of these tools, they are devoted to analyzing coding region sequences. Further significant insights can be provided by the determination of when and how genes are “turned on and off.” Some approaches for accomplishing this goal perform the analysis of gene expression level by microarray technology, protein sequence and its structure, and gene expression regulation.
Examples of available biological applications
DDBJ, DNA DataBank of Japan; NGS, next-generation sequencing; NoSQL, Not Only SQL.
The level of gene transcription is determined by microarray technology. Besides gene expression levels, this technology also allows for quantifying alternative splicing and sequence variation. 48 An example of the results achieved by gene expression experiments integrated with bioinformatic tools and database is the identification of genomic biomarkers in cancer. As a result of the identification of differentially expressed genes, new insights into disease diagnosis and treatment are provided.11,59
Ying et al. 60 search for genes with a distinctive expression in ovarian cancer cells. They used microarray data available in a specific database from NCBI, the Gene Expression Omnibus (GEO) database. The data were analyzed with clustering algorithms and several biological tools (Lima package for R, GenMAPP, and GENECODIS). Through this approach, it was possible to identify 1229 differentially expressed genes. These genes are related to cell cycle, lipid metabolic pathways, cytoskeleton changes, and some signal transduction pathways. As they are involved in the establishment and development of ovarian cancer, they may be treatment targets for ovarian cancer.
A similar approach related to pancreatic cancer is presented by Sartor et al. 61 The authors analyzed gene expression data available in the GEO and ArrayExpress databases by using R environment packages devoted to microarray data analysis. 61 The results indicated that a high expression of Tubby-like protein 3 (TULP3) gene may play a critical role in pancreatic cancer progression. The low–high expression levels have not been associated with prognostic value for any other type of cancer, such as breast, ovarian, and lung cancer. For this reason, the TULP3 gene could be explored as a prognostic biomarker in patients with pancreatic adenocarcinoma. 61
By using genome sequence analysis, Jung et al. 62 carried out three machine learning (ML) techniques to classify mutations in two classes: loss- or gain-of-function. The first step was data collection by the literature text-mining process, looking for descriptions of mutations related to both classes. Next, the data were analyzed, and the features of gain- or loss-of-function were selected. Lastly, support vector machines (SVM), random forest, and linear logistic regression were implemented with the aim of classifying the classes.
The results of this study indicate that the reference allele, substitute allele, mutation type, mutation impact, subcellular location, and protein domain are discriminative parameters for the gain- or loss-of-function identification. The accuracy obtained for the simulations was 72.23% for random forest, 71.28% for SVM, and 70.19% for logistic regression classifiers. 62
An approach for protein structure prediction is illustrated by Tunyasuvunakool et al. 63 In this study, the authors predicted the human proteome by using the AlphaFold application. 64 The authors describe that their results cover almost the entire human proteome (98.5% of human proteins). The resulting data set covers 58% of residues with a confident prediction, of which a subset (36% of all residues) has a very high confidence. Jumper et al. 64 advocate that AlphaFold presents accuracy competitive with experimental structures due to the neural network approach that incorporates physical and biological knowledge about protein structure.
Tripathi and Gupta 65 used an artificial neural network (ANN)-based approach named the layer recurrent network to predict lysosomal-associated membrane protein type. The training data were composed of the following protein features: amino acid composition, sequence length, hydrophobicity, electronic group, sum of hydrophobicity, R-group, and dipeptide composition. The overall accuracy for this simulation was 93.2%, which leads the authors to conclude that this approach is efficient in the discrimination of lysosomal-associated membrane proteins from other membrane proteins. In both articles, the authors make their own data set by collecting, cleaning, and transforming the data from previous available databases.
In the field of microorganisms, the regulation of gene expression is also an important task. There are several articles devoted to the prediction and recognition of regulatory elements by using data provided by the RegulonDB, EcoGene, and EcoCyc databases, for instance. Gordon et al. 66 carried out SVM with alignment kernels in two different data sets: promoter and coding regions, and promoter and nonpromoter intergenic regions. The average errors achieved were 16.5% and 18.6%, respectively, for the data sets used.
Rani et al. 67 used n-gram as a feature for a neural network classifier for promoter prediction in E. coli and Drosophila melanogaster. The authors show that the number of n-grams that present the best results for E. coli was n = 3 against a negative example set consisting of gene and nonpromoter intergenic segments. The performance measures presented were a sensitivity of 67.75%, a specificity of 86.10%, and a precision of 80.0%.
An ANN-based approach was used by de Avila e Silva et al. 68 for promoter prediction according to the σ factor that recognizes the sequence. This bioinformatic tool, denoted as BacPP, was developed by weighting rules extracted from ANNs trained with promoter sequences known to respond to a specific σ factor. The information obtained from the rules was weighted to optimize promoter prediction and classification of the sequences according to the σ factor, which recognizes them. The accuracy results for E. coli were 86.9%, 92.8%, 91.5%, 89.3%, 97.0%, and 83.6% for σ24-, σ28-, σ32-, σ38-, σ54-, and σ70-dependent promoter sequences, respectively.
In contrast to tools previously reported in the literature, BacPP is not only able to identify bacterial promoters in background genome sequence, but it is also designed to provide pragmatic classification according to the σ factor. Moreover, when applied to a set of promoters from diverse Enterobacteria, the accuracy of BacPP was 76%, indicating that this tool can be reliably extended beyond the E. coli model. 68
Details of the principals and organization of the transcriptional process are helpful for understanding the complexity of biological systems involved in, for instance, cellular responses to environmental changes or the molecular basis of many diseases caused by microbes. 69
Iraola et al. 69 analyzed 814 different virulence-related genes of more than 600 finished bacterial genomes aiming at identifying patterns in those genes. Both human pathogenic and nonpathogenic bacterial strain genomes were collected from the NCBI. For achieving their goal, the authors used SVM for building a classification tool, named BacFier. As a result, the SVM model classifies bacterial genomes in human pathogens and nonpathogens with 95.4% average accuracy. BacFier may be a useful tool for clinical or industrial purposes, for example, to determine if a new sequenced strain could be pathogenic for humans. 69
Microorganisms are the most abundant and diverse organisms on Earth. 41 From this perspective, Selama et al. 41 described global bacterial biogeography and biodiversity in terms of abundance by using available data from NCBI and Global Biodiversity Information Facility (GBIF). In their results, Proteobacteria is the most abundant phylum in both databases followed by Firmicutes, Actinobacteria, Bacteroidetes, Cyanobacteria, and Planctomycetes. In the last position, Chrysiogenetes and Dictyoglomi phyla were found.
The study also reveals that bacterial biodiversity data come from developed countries and the United States in particular. This kind of research is an effort to contribute to the knowledge about bacterial dispersal limitations, habitat differentiation, competition, and adaptive radiation. Despite microorganisms' ubiquity, information about their distribution patterns and control is still an open topic. 41
Another approach is the representation of the interactions between biological elements (genes, proteins, or metabolites) in a given cellular function. This approach provides an integrated perspective about complex biological systems. 11 The usual way to illustrate them is the graph diagram, which provides a wide cell biological context from the biochemical functions of individual molecules.50,51 An example of this kind of analysis was carried out by Feltes et al. 70
The authors present possible molecular pathways associated with the effects of tobacco smoke components during embryonic development in pregnant female smokers. They used available databases (e.g., STRING) and tools (e.g., Cytoscape) to obtain the networks' interactome about this subject. By analyzing the resulting networks, the authors detected that tobacco constituents act in many bioprocesses as cell communication and signaling, hormone synthesis and signaling, DNA metabolism, DNA repair, and inflammation. Such processes present wide effects on cellular and embryonic physiology. 70
The combination of microarray data and network analysis is presented by Guo et al. 71 In this article, the authors analyzed the interaction network of 126 genes, which presented differential expression in microarray experiments. Based on their results, it was possible to identify 23 genes involved in multiple signaling pathways related to tumorigenesis, which were considered potential biomarkers for bladder cancer. Analysis of the urine of patients was carried out, and the combination of the in silico and in vivo results shows that the expression of BLCA-4 and HOXA13 could distinguish between low- and high-grade tumors. Moreover, IGF-1 and hTERT were closely related to highly invasive and high-grade tumors. 71
Network analysis of interactions among genes, proteins, or other kinds of elements can provide new insights about a given issue. By seeing the place of an element in a pathway, it is possible to obtain insights about the physiological significance and offer clues about functions of similar-looking proteins.48,49
ML approaches are also applied in medical contexts. The related research can contribute to fundamental insights into early diagnosis, prognostics, therapy, and a better understanding of disease processes. Therefore, they would allow a more efficient clinical treatment. 72 By using the data of three available data sets of the UCI database, Yilmaz et al. 73 propose the use of SVM for the diagnosis of diabetes and heart diseases. The accuracy achieved was more than 96% for all the different data sets tested. In addition to the ML technique for data classification, the authors propose a modified version of the K-means algorithm for the data preparation step. 73
A similar approach was carried out by Zheng et al. 72 for breast cancer diagnosis also using the UCI database. At first, the K-means algorithm extracted useful information from the data set (data preparation). Afterward, an SVM was tested as a diagnostic classifier, presenting an accuracy result of 97.38%. 72
The cancer prognostic by using available data was carried by Sun et al. 74 The authors used as data sets the following from the Wisconsin Prognosis Breast Cancer Database: a diffuse large b cell lymphoma data set and a nonsmall-cell lung cancer data set. 74 They performed the following two steps: feature selection and prognosis analysis by using the SVM approach. The results achieved are consistent and comparable with similar approaches. These kinds of research can contribute to a better understanding of the different types of tumor and the properties related to them. Moreover, they provide information for other cancer research. 72
IT Resources for Big Data Applications
The improvement of IT techniques provides a reliable mechanism for data management, analysis, and accessibility.3,75 In big data applications, data establish deep connections with the tools applied in them. 5 The database organization is essential for providing a way to share the data since it improves the organization and standardization. 76
In addition, the database is the source for researchers to manipulate data. Currently, cloud computing presents a promising and up-to-date solution for figuring out some of the big data challenges.1,2,8,10 This approach exploits the use of multiple computers to store resources dynamically on the internet. Beyond that, it eliminates local installation of software, making maintenance and updates easier. 2 In what follows, IT resources applied to big data approaches are described.
Cloud computing
Cloud computing is a significant emergent approach of computational science that exerts influence on both academia and industry.1,8 Cloud computing provides a very elastic, scalable, portable, and cost-efficient solution since it makes the best use of multiple computers to provide on-demand access to hosted resources.10,76,77
Denomination cloud computing was inspired in flowcharts that use a cloud as a representative symbol of the internet. A standard definition for cloud computing is not available in the literature.4,78 However, the National Institute of Standards and Technology provides the following definition 79 : “Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models.”
The distinct definitions of cloud computing8,76–78,80 present common features such as on-demand self-service, broad network access, and virtualization. This last one is considered the key enabler of cloud computing. 78 Virtualization is important because it allows a single physical machine to run multiple operating systems and associated applications by using multiple virtual machines installed. This procedure maximizes the utilization of the hardware and reduces the capital investment. 4 Additional concepts behind cloud computing are high-performance computing (HPC), distributed systems, grid computing, and parallelized programming.1,76,78,81
HPC, also named a computer cluster, typically comprises a set of homogeneous processing elements connected locally.78,81 HPC has been built to provide higher computing power than that offered by a desktop workstation. It manages computing resources efficiently for the analysis of complex flows, slicing it into smaller parallel tasks.81,82 On the contrary, in grid computing architecture, the computational resource available is composed of heterogeneous processing elements, geographically dispersed and connected by a network.81,83,84
The aim of both HPC and grid computing is to run tasks in a parallelized and distributed way.76,81 Despite the similarities among HPC and grid and cloud computing, they cannot be considered the same. Technically, grid computing is different from HPC because it must share common software to carry out the communication between the computers. 81 In addition, grid is different from cloud computing since the latter presents hardware virtualization.78,81,84
A comparison of four attributes of HPC and grid and cloud computing is provided in Table 3. The attributes shown in this table are related to the following: (1) the use of local or remote computers to processing data (resource ownership), (2) the possibility of sharing resources and scalability using the private or public networks available (resource sharing and sizing), and (3) the way the infrastructure is provided (application portability). 2 More information about the technical attributes of these technologies can be found in Refs.3,8,81
Comparison of computational architectures
In the context of life sciences projects with large-size data sets, the choice of IT architecture (Table 3) is more related to technical aspects than the nature of the biological data.9,85 Considering the biological data, there are a variety of problems that require huge processing and consumption of IT resources due to the size of the data set or the number of variables, such as genome assembly and molecular dynamics simulation, among others. For these cases, parallel computing applications are being developed (see the Big Data Platform Model for Data Processing section) to improve the processing time and memory/storage consumption. 86
The choice of which IT architecture is the most appropriate needs to take into account aspects that are related to the following: (1) the maintenance costs of local HPC infrastructure or cloud, (2) the trustworthiness and availability of remote computers in grid structure and privacy, and (3) espionage, international legal conflicts, and internet connection for cloud architecture. 9
The services supported by cloud computing are usually classified into four business model types: software as a service (SaaS), platform as a service (PaaS), infrastructure as a service (IaaS), and data as a service (DaaS). 2
SaaS offers a software application that runs in cloud infrastructure but is accessible on the internet. IaaS usually includes storage, processing units, and memory of high-performance computing. PaaS allows the creation of applications on resources available by the cloud provider.2,8 DaaS comprises biological databases such as GenBank and 1000 Genomes among others. Cloud-based approaches can decrease the time and cost involved in biological big data projects. 2 Cloud computing has been used in genome annotation and NGS projects,53–56 proteomics, and drug discovery,87–90 among others.
Some commercial vendors of cloud services are Amazon (http://aws.amazon.com), Windows Azure (http://azure.microsoft.com), and Google (https://cloud.google.com). These commercial clouds are improving their services for life sciences research projects. 91 Despite the vendor, features such as scalability, reliability, and cost are useful for big data projects. 92
The advantages of cloud computing rely on the fact that the computational resources can be allocated and updated according to the demand in real time. In addition to that, the administration (data backup and recovery) and maintenance of the devices are under the provider's responsibility. Finally, the payment is made according to the amount used (pay-as-you-go model).76,78 Although cloud computing solves the problems of hardware and software cost, it alone does not address all the challenges of big data analysis. The most unresolved issue is related to the electronic transfer and sharing of a huge amount of data at the same time. 93 Another disadvantage of cloud computing is in regard to the privacy of hosting data sets.5,8,76
Database models
Currently, the capability to collect and generate data has been vastly expanded. In this context, data storage plays a fundamental role since it is the backbone of DM. 4 Incomplete or low-quality data may lead to inconclusive results or, more important, erroneous conclusions. 43 Database systems are an efficient mechanism for large data sets77,94 as a well-designed database provides the necessary support for discovering novel correlations in data. For this reason, a computational database can be defined as a set of organized data that aim to assist a group of users. The data access is made by a program to answer the user's queries. That is, a database has data sources and users interested in its content. 94
A database project involves the definition of three different models, conceptual, logical, and physical, to ensure data management. 94 The conceptual model is a generic representation of a specific domain. It presents a collection of concepts that provide the necessary means to achieve database abstraction by the relational, XML, or Not Only SQL (NoSQL) data model. 94 The conceptual model is computationally expressed as a logical model with the purpose of representing data structure. In other words, this model comprises the data types of each attribute of each table, and the relationship among the tables can constitute a new attribute or a new associative table. 94
The logical model is implemented as a physical model through the application of a DBMS. It is a general-purpose software system that facilitates the processes of defining, constructing, manipulating, and sharing databases among various users and applications. 94 Some examples are Apache Cassandra 95 and MongoDB 96 for the NoSQL models and PostgreSQL, MySql, Oracle, Structured Query Language (SQL) Server, and IBM DB2 for the relational database models.
The relational database management system (RDBMS) model is an established approach and widely used in the design of conceptual models of a vast type of applications. 97 Its main concept is the creation of tables that consist of rows (tuples) and columns (attributes). 42 With this model, it is possible to use relational algebra to build the best query execution plan. However, a well-designed relational database requires a regular and complete data set, which means a structured data set. Moreover, the relationship among the tables should be unambiguous as required in the process of database normalization. 97 These requisites are hard to achieve with biological data.4,42,97
An alternative to the rigidness of the RDBMS is the XML schema model that can be used for modeling semistructured data. 97 In this approach, data comprise a series of labels and associated values. It can be represented graphically as nodes (objects) connected by edges to values. 42 A key limitation of XML relies on its difficulty in modeling complex relationships. For example, there is no obvious way to represent many-to-many relationships, which are required to model complex pathways.42,94
Big data platforms usually adopt NoSQL as an alternative to data management and database design. 4 In contrast to relational approaches, NoSQL does not require predefined schema, relationships, and keys, and data storage and management are separated into two independent parts.4,95 In general, NoSQL uses key-value stores, column family implementation, document store, and graph databases, which are uncommon in traditional RDBMS design. For these reasons, NoSQL assists developers in the management of voluminous and heterogeneous data as those treated by big data applications. 97
The NoSQL database system presents some advantages, such as (1) scalability of data storage with high-performance, (2) flexibility for data modeling, (3) easy way to update application development and deployment, and (4) low-level access mechanism in which data management tasks can be implemented.4,97 All these features contribute to make NoSQL significant as a database approach for cloud-based systems. 97
A comparison of relational and NoSQL DBMS is presented in Table 4, which summarizes the features described in this section.
Comparison of data model of RDBMS and NoSQL relational database management system architecture
ACID, Atomicity, Consistency, Isolation, Durability; DBMS, Database Management System; RDBMS, relational database management system; SQL, Structured Query Language.
As presented in The Biological Data section, the landscape of biological data makes the definition of the database model an important question for biological data organization and storage. The traditional databases, such as RDBMS, are well suited for structure and data, since they provide good levels of storage efficiency, data integration, and good data retrieval speeds compared with file-based storage. 98
On the contrary, NoSQL deals better with large data sets of unstructured data, but it is not suitable for systems with a highly transactional level. 20 In this context, the main goal of a biological repository should be considered, for instance, genomic ChIP-seq, RNA-seq, and DNase-seq projects usually use NoSQL approaches and Electronic Health Records, Clinical Trial Records uses RDBMS approaches.21,98,99
Despite the DBMS architecture, the proposition of new inferences usually requires the combination of data located in different or heterogeneous sources. 43 Therefore, database integration applications are necessary and still remain an important topic in big data approaches. An example of how to integrate data can be the DW structure. Despite the divergent definitions for DW, it can be described as an information repository where the data from different sources are stored. 94
The process of data inputting in a DW is formed by the following three main steps: data extraction, transformation, and loading (ETL process). The first step is the extraction of data from heterogeneous sources, and posterior transformation is the second step, which is carried out to clean and improve data accuracy. In the third step, data are being loaded into an integrated multidimensional schema of DW. This process can enhance data access and analysis. For this reason, a DW provides consistent, timely, subject-oriented information at the required level of detail, enabling the user to make better and faster decisions. 100 The ETL steps are suitable for a DW organized in an SQL or NoSQL schema.101,102
The main difference between the data stored in an operational database and the one in a DW is the structure, that is, data knowledge and representation (Table 5). In an operational database, data are in a structured form, whereas in a DW, the data may either be presented in the structured format or not. 94 Moreover, in a database, redundancy should be eliminated to enhance business process, whereas in a DW, the existence of redundancy is required to improve its performance. 100 Further information about the data model and the query processing attributes of these database models can be found in Refs.43,94,100
Comparison between operational database and data warehouse
In the context of big data approaches, the variety (heterogeneity) of formats, and the content and source of the data sets present some difficult degree in the ETL process of a DW implementation. There are some efforts to address this challenge,101,102 as reviewed by Diouf et al. 102 The DW is an important computational approach, in which is integrated an operational database, and it aims to bring for the life researchers the analysis that they need for hypothesis inferences and carry out the scientific contributions for a given field, such as the applications presented by INDIGO, 44 metabolicMine, 45 BioExtract Server, 47 GenMAPP, 48 and GeneCodis, 49 among others.
Big data platform model for data processing
To capture value from big data, it is necessary to use specific platforms. One of the most widely used big data tools is Apache Hadoop. 4 It is an implementation of the computational paradigm known as Map/Reduce that was developed by Google, Inc. 103 In a few words, this platform allows data-intensive analyses by distributing tasks over multiple nodes.8,103 Hadoop has several central modules: Hadoop Kernel, MapReduce, and Hadoop Distributed File System. The MapReduce paradigm divides a computational program into many small subproblems (map step), and Hadoop provides a distributed file system that stores data in nodes. After all, the “reduce” step merges all the smaller outputs for generating the whole result.2,8,76
Hadoop deals with several challenges of data access and management. It hides from the user some computer abstractions, such as distributed file systems, distributed query language, and distributed databases. Despite this feature, the Apache Hadoop programming environment is not suitable for people who present limited programming experience, such as life scientists.2,8,76 Moreover, Hadoop is not designed for real-time applications since its implementation presents high-throughput latency. 4
Dryad is another popular computing framework for implementing parallel and distributed programs. 104 This platform builds a data flow graph application that can be configured into an arbitrary directed acyclic graph applying a set of computational vertices and communication channels. 1 The application executes the vertices of the graph on a set of computers that can exchange data by using shared memory queues and Transmission Control Protocol pipes. The generalization capacity of Dryad is better than MapReduce as it can process data from a very small cluster to a large one.4,104 For real-time big data applications, Storm 105 is an open-source computational system. It is designed for processing fast and large streams of data as they arrive.4,105
Some examples of related projects devoted to biological big data are CloudBioLinux, 57 SeqHBase, 52 an dCloudDOE, 58 among others. CloudBioLinux is an initiative of the J. Craig Venter Institute that provides access to more than hundreds of bioinformatic tools through a friendly graphical user interface. 57 SeqHBase is a cloud-based toolset for analyzing mutations in sequencing data, 52 and CloudDOE is a platform to encapsulate technical details of Hadoop implementation. These applications illustrate the potential of the association between cloud computing and life sciences as well as its potential. 8 A limitation of these approaches is not related to the technology itself but that it relies on the noise and inconsistent nature of biological data.5,9,76,105
Computational Approaches Applied to Big Data Analysis
The value of big data is not concerned with the data, but with how supportive they are for decisions and assumptions. 43 This can be accomplished by the application of DM techniques. DM is an interdisciplinary field that combines some aspects of artificial intelligence, database management, and ML among other fields. 106
DM consists of three principal steps executed in the following order: (1) data preprocessing, (2) data modeling, and (3) data postprocessing. 107 In data preprocessing, the raw data are transformed to “clean” data. This first step is crucial since it is related to the success of all the following steps of data analysis. 107 Subsequently, the application of modeling techniques, such as decision tree, ANNs, and clustering, is carried out. This is the data modeling or DM step. After this, the visualization and evaluation of extracted knowledge are evaluated (data postprocessing).4,107
DM algorithms extract useful regularities for a given purpose, and ML provides the technical basis for DM. ML approaches can give promising results even if the relationships are unknown or hard to describe. In addition, ML can recognize complex patterns in an automatic way or distinguish examples based on these patterns.108,109 These algorithms usually split the data set into training and test groups. They learn from examples (training data) and from them build the classification model, which will be tested in a set of examples not exposed to the classifier in the training process.
In a big data context, data analysis requires comprehension of the data as well as the ML algorithms to solve an issue of interest. A critical factor that needs to be considered is whether the ML algorithm can be efficiently parallelized. 4 This is important once big data problems deal with the distribution of tasks over many computers to capture a solution.
Among all ML techniques, the decision tree, SVM, ANN, and clustering are several applications used in the life sciences, either alone or in combination.110–112 For this reason, the purpose of this section is to provide an explanation about the basic ideas of these ML approaches, and their application in biological contexts. All these examples are related to the building of a classification model, which means, the data were previously acquired, transformed, and then applied in the data.
Decision tree
The decision tree is a ML technique widely used in biological tasks, such as soil systems, 113 genetics and proteomics,108,114 biomanufacturing, 115 and medicine,110,116 among others.
A decision tree is a set of rules about a given subject represented in a flowchart-like structure. Briefly, this algorithm recursively splits a set of independent instances, providing a representation as classification rules. The tree structure is an efficient and easy-to-understand representation of the information captured from the analyzed data set. Each internal node denotes a query test on the attribute, and the branches are an outcome of the test. The terminal nodes represent the class label. 109
Some examples of algorithms to construct a decision tree are CART and C4.5. The difference between them is the evaluation function used for the classification. The C4.5 algorithm, by using entropy measure, estimates the error rate of initial nodes and the tree pruning to make a more efficient subtree. In addition, it can construct a multitree structure. On the contrary, CART usually builds a binary tree by using a Gini index as function evaluation. 109
This technique has a low level of requirement for the data preparation process. Moreover, it presents good performance on large data sets. When compared with the ANN approach, decision tree is more suitable for non-numeric data. In contrast, decision tree presents some limitations, especially in the determination of an appropriate size of the tree. From the user's view, a large decision tree may be hard to understand. From a computational perspective, an optimal decision tree is known as an NP-complete problem. In this way, learning algorithms are usually based on heuristics, which do not guarantee the optimal decision tree structure. 109
Support vector machine
An SVM is a learning method usually implemented as a binary classifier. SVM makes the classification by drawing a straight line that separates, as widely as possible, the positive examples from the negative examples. 117 This classification model is given by the selection of a small number of critical boundary instances (called support vectors) from both classes and their separation by a high-dimensional hyperplane. To obtain the best hyperplane, it is necessary that the application of supervised learning algorithms be denoted as kernel machines.117,118
The kernel function is crucial for SVM, since the knowledge captured from the data set is dependent on the definition of a suitable kernel. 117 Further information and mathematical background of SVM can be found in Refs.117,118 SVM is a technique used in different biological domains as presented by Refs.,62,63,69,72,119–121 for instance.
The SVM algorithms present many advantages in their use when compared with other ML methods. First, SVM produces a unique solution because it is basically a linear problem. Second, it is able to deal with very large amounts of dissimilar information. Finally, the discriminant function is characterized by a comparatively small subset of the entire training data set, which makes the computation faster.117,118 On the contrary, a problem of SVM is its slow training as its learning process is carried out by solving a quadratic programming problem with the number of variables equal to the number of training data. 118
Artificial neural network
The ANN is an artificial intelligence approach used for classification and the prediction process.109,122 In its simplest form, ANN can be viewed as a graphical model consisting of interconnected units. The connection from a unit j to a unit i usually has a weight denoted by Wij. The weights represent information used by the net to solve a problem. During the training process, the weights are adjusted to minimize the difference between the network output and the desired output. The highly popular algorithm for the learning process is back-propagation.64,122,123
The ANN architecture is defined by the way the neurons are interconnected, for this reason is it possible to design many kinds of architecture. The most widely applied ANN architecture is the multilayer perceptron, seen as this architecture presents the capability to capture and discover high-order correlation and/or relationships inside the input data.110,123
The three-layer ANN is known as a universal classifier because it is able to classify any labeled data correctly if there are no identical data in different classes. 122 The layers present different roles in the learning process. The neurons of the input layer receive the information from external sources and pass this information to the hidden layer. The use of hidden neurons increases the capacity of the net to decide how input features should be represented. The output layer contains neurons that receive processed information and send output signals out of the system.110,122,123 Due to the ANN capability of rapid fitting of nonlinear data, it can capture imprecise and incomplete patterns.64,122
It is often applied in biological areas such as (1) gene structure and regulation,68,114,124,125 (2) ecology,126,127 (3) cancer,128,129 (4) degenerative diseases, 130 and (5) proteomics.64,65 Despite their advantages, ANN presents some difficulties. Many decisions related to the choice of ANN structure and parameters are often completely subjective. The final ANN solution may be influenced by a number of factors (e.g., starting weights, number of cases, and number of training cycles). Moreover, the overtraining should be analyzed to prevent an ANN from memorizing the data instead of doing a generalization of them. 122
Clustering
Clustering is an unsupervised classification technique suitable in cases when there is no class to be predicted, but the instances can be divided into natural groups. This approach is divided into two methods: hierarchical and nonhierarchical.72,109,131
K-means is a classic nonhierarchical clustering algorithm. It separates the instances in a certain number of sets according to a predefined distance criterion. The definition of the numbers of clusters is required in advance (the k parameter). After this, the initial cluster centroid is defined either randomly or is prespecified by the researcher. The distances are calculated based on the current arrangement for cluster centroid updating. The most used distance criterion is the Euclidean distance measurement. The K-means algorithm iterates over the whole data set until its conversion, which means each instance in a given cluster is closest to the centroid.13,109
This clustering method is simple and effective, and the K-means algorithm allows each instance to belong to a single set.109,131 However, the final clusters are quite sensitive to the initial cluster centers. Completely different arrangements can arise from small changes in the initial random choice.13,109
Differently from partitional algorithms (such as K-means), hierarchical clustering generates a set of clusters with different granularities in a hierarchical representation. The graphical representation of the results is a dendrogram.131,132 These algorithms can be subdivided into agglomerative and divisive according to the linkage method: bottom-up or top-down. The hierarchical technique follows the bottom-up method, through which it defines the similarity between the objects before the cluster similarity measure. Conversely, the top-down approach first calculates the similarity measure between the clusters. Divisive algorithms follow this linkage method. 132
Hierarchical clustering does not require that the number of clusters or input parameters be known in advance. However, mandatory is the choice of the similarity/dissimilarity measure. Another advantage relies on the good result visualizations integrated into the methods. Nonetheless, this approach may not scale well due to the runtime for standard methods and the difficulty in discovering “optimal clusters” automatically.110,132
The applications of clustering in life sciences are mostly concerned with gene expression and regulation as shown by refs,13,59,72,131–135 among other related articles.
The amount of biological data available has expanded the means for carrying out research. As a consequence, biology has been adopting ML approaches to analyze its data.77,81 Each one is appropriate for a given purpose. Table 6 summarizes the information described in this section about decision tree, SVM, ANN, and clustering. By the development and application of data analytical, mathematical modeling, and computational techniques, it is possible for both expert and machine guided to search novel correlations in data. 3
Data mining methods
ML, machine learning.
In a general perspective, the application of biological big data is related to advances in fields that require massive and complex data analysis. The newest and emergent application of biological big data is the COVID-19 pandemic scenery. 106 The applications include virus biology,136,137 diagnosis, 138 vaccine development,139,140 and fake news regarding COVID-19 reviews.141,142
Conclusions
The convergence between IT technologies and biology has emerged due to the large amount of data produced by experiments. This scenario creates an opportunity to apply big data as a means for life scientists to achieve their goals and propose relevant inferences. 3 Considering that the efficient management of resources (money, power, space, and people) is required to solve an application of interest, a multidisciplinary team for in silico experiments is appropriated to resolve complex challenges. 103 Biology experts need assistance to understand how to correctly use IT tools. On the contrary, computer scientists are not able to alone figure out biological concepts of in silico outputs.
Despite technology advances, it is still challenging to transform biological data into information that can provide insights in a given context. 23 In light of this, this article presents a perspective of the current role of big data in the field of biology. An overview of big data can motivate its application and decrease the gap between the generation of data and the understanding of them. 116 The examples of research based on the reuse of information presented in this article show the potential of big data in biological domains.
Footnotes
Acknowledgment
The authors wish to thank the University of Caxias do Sul for the support for this article.
Authors' Contributions
All authors listed have made a substantial, direct, and intellectual contribution to the work, and approved it for publication.
Author Disclosure Statement
No competing financial interests exist.
Funding Information
No funding was received for the development of this research.
