A Survey of Biological Data in a Big Data Perspective

Abstract

The amount of available data is continuously growing. This phenomenon promotes a new concept, named big data. The highlight technologies related to big data are cloud computing (infrastructure) and Not Only SQL (NoSQL; data storage). In addition, for data analysis, machine learning algorithms such as decision trees, support vector machines, artificial neural networks, and clustering techniques present promising results. In a biological context, big data has many applications due to the large number of biological databases available. Some limitations of biological big data are related to the inherent features of these data, such as high degrees of complexity and heterogeneity, since biological systems provide information from an atomic level to interactions between organisms or their environment. Such characteristics make most bioinformatic-based applications difficult to build, configure, and maintain. Although the rise of big data is relatively recent, it has contributed to a better understanding of the underlying mechanisms of life. The main goal of this article is to provide a concise and reliable survey of the application of big data-related technologies in biology. As such, some fundamental concepts of information technology, including storage resources, analysis, and data sharing, are described along with their relation to biological data.

Introduction

The initial concern of computational science was the creation and execution of mathematical models originated or applied in natural and artificial processes.¹ Improvements in computer performance and data storage capacity are changing how every field of knowledge can be represented by data.² This scenario broadens the initial aims of computational science to not only compute but also carry out data management and analysis.¹ The amount of data has recently become an “avalanche,” which is steadily accelerating and amplifying. From this phenomenon, a new concept emerges, named big data.³

From a scientific perspective, the definition of big data is related to the collection, storage, and analysis of data sets that present a high level of complexity or size. In addition to data volume, its variety, velocity and value are also embedded in the big data definition.⁴ For this reason, the management and analysis of data using traditional tools and techniques have become a hard task.¹ Information technology (IT) has a determinant role in data availability and analysis. Data can be considered big when their reuse contributes to the construction of new insights. Considering this, IT has been supporting new models of science and newly founded multidisciplinary fields of study.³

In this respect, the advent of high-throughput genomic techniques, for example, has led biology (particularly genomics and proteomics) to become an information science.^2,5 A single-sequenced human genome, in fact, is ∼140 gigabytes.⁵ Furthermore, the 1000 Genomes project,⁶ which involves sequencing and cataloging of human genetic variation, has deposited two times more raw data into GenBank⁷ during its first 6 months than all the previous sequences deposited in the last 30 years together.⁸ The European Bioinformatics Institute, one of the world's largest biological data repositories, stores 20 petabytes of data and backups, from genes to proteins and even other small molecules.⁵ Taking this into account, it is clear that life scientists deal with massive and complex data sets.

This background has provided a context for the birthplace of bioinformatics, an interdisciplinary field that makes use of informatic devices to exploit biological big data. From this perspective, the flow of biological information can be traced from postulating hypotheses up to uncovering answers. In this scenario, bioinformatics can be divided into four primary focal points: (1) computing and storage infrastructure (i.e., hardware); (2) data management and planning; (3) databases and computational tools (i.e., software); and (4) data analysis and research (Fig. 1).

FIG. 1.

Flowchart of biological big data. Big data (top left corner) serves as input for the bioinformatic field (right), resulting in an output of knowledge and applications that go back into academic research, the industry, hospitals, and other sources (bottom left corner). Adapted from: Swiss Institute of Bioinformatics.

Accordingly, this new information science has brought forth new viewpoints to areas such as comparative biology, molecular biology, taxonomy, biochemistry, and genetics. There are numerous in silico approaches to study biological data on the multiple ranges of complexity, starting, for instance, from single cells up to pathogen/organism relationships.

A few examples of in silico approaches are as follows: (1) simulations of molecular dynamics (serving as a large-scale example); (2) sequence alignment; (3) molecular docking (protein/protein complexes); and (4) prediction and search of molecular and genetic structures and elements.⁹ In this context, in silico approaches can be applied toward several issues, for instance, discovering new drugs, understanding the Earth's climate, and examining the evolution of species.^3,8

However, the complexity displayed by biological data emerges as an issue, mainly due to the natural heterogeneity of biological systems. Biological data range from single chemical or molecular structures to complex systems in a genome-wide scale or even complete metabolic networks. The complexity is driven upward (toward being more complex) when a broader data context is taken into account (e.g., scaling single atoms to DNA or RNA and then to whole genomes). By going through different scales, different approaches are required for data integration and, consequently, the inference of new correlations and hypotheses.^9,10

The required IT resources for storing, analyzing, and sharing biological data are not simple and remain an important matter.^3,8 While some of the challenges involving biological big data analysis are related to the sheer scale and breadth of new data sets, others relate to the increase in complexity. Considering this, the data present a high degree of complexity due to interactive elements, an inherent feature of biological systems.^3,5,8

In this context, the purpose of this article was to provide a brief survey regarding biological data and their relation to big data computational approaches. To achieve this goal, the present study is organized into three main sections following this Introduction.

The features of biological data are presented in The Biological Data section, together with examples of currently available databases and research carried out using preexistent available data. The technology associated with the storage, management, and analysis is described in the IT Resources for Big Data Applications section. This section summarizes fundamental concepts of cloud computing, databases, and data mining (DM). Finally, the Computational Approaches Applied to Big Data Analysis section supplies background information regarding the computational techniques applied to data analysis. Above all, the discussion of big data potential as well as its limitations in biology is presented throughout the article.

The Biological Data

Biology is gradually shifting toward being an informational science, especially with the rise of bioinformatics. The omics (genomics, proteomics, and metabolomics), in particular, are considered big data sciences.^3,11 This section describes the features of biological data, their sources, and examples of how they are used in research.

There is a wide landscape of biological data available for research. They are varied in complexity, format, and scale, including the following: (1) sequences (DNA, RNA and proteins, usually in text format); (2) structures (biological and chemical molecules, such as structural proteins and enzymes involved in pathways, in image format); (3) gene expression profiles (measurement of gene activity, in numeric and image formats); (4) biochemical pathways (in text or image formats); (5) chromosomal mapping (in text or image formats); (6) single-nucleotide polymorphisms (in text format); and (7) phylogenetic data (in text or image formats).^9,10

Whereas sequences consist of highly specific information, biochemical or metabolic pathways contain a larger number of elements and variables to be taken into account, such as atoms, bonds, entropy, free energy, activation, or inhibition actions, among others.¹² In sequence analysis, for instance, the biological heterogeneity is reflected by multiple nucleotide content profiles in sequences that should present a unique signal.¹³

It is also important to point out that biological data present some interferences that come from the wide variety of equipment and protocols used in a given experiment. This is known as the batch effect, which is the result of variables from laboratory origins, such as reagents, machines, and human error, among other factors that may interfere with an experiment. When dealing with large amounts of data and high-throughput data, batch effects can be the source of misleading results. Computational approaches for data analysis (see the IT Resources for Big Data Applications section) are often sensitive to this characteristic, and therefore, it can hamper their accuracy.¹⁰

In the field of medical research, the recent advances on omics data and multimodal imaging data are providing improvements in diagnosis and treatment.¹⁴ As pointed out by He et al.,¹⁴ neuroimaging data are being widely used to tackle central challenges that emerge from cognitive neuroscience, such as how to effectively interpret the results of complex data generated by modern experiments. One example is the in vivo probing of neuronal activity in the human brain.^15,16

The use of DM and integrative interfaces between imaging data and computational techniques assists in the understanding of complex topics, and shifts them toward precision medicine, which in turn allows the development of more specific treatments and therapies for diseases such as Alzheimer's,¹⁷ cancer,¹⁸ and Parkinson's.¹⁹

Taking the examples described, it is necessary to link the biological data with a computational definition of structured data and unstructured data. When data are displayed in columns and rows—which can be easily ordered and processed—they are defined as structured data (e.g., patient electronic health records²⁰). On the contrary, unstructured data are related to formats such as audio, video, digital photographs, and social media postings (e.g., radiology images).

Taking this into account, biological data can be divided into different subsets, since the data may come from several sources.²¹ In this sense, the best way to organize the data of a given project is an important question in big data projects.^10,20,21 The computational approaches for this issue are presented in the IT Resources for Big Data Applications section, and some examples of biological databases are also presented in the following subsection.

Biological databases

Experiments in biology can generate a considerable amount of data.¹¹ For this reason, online databases have become an essential tool and resource of data. The growth of biological databases is noted by the number of journals that have this issue as a topic of their scope. However, it was not possible to find in the literature the exact number of biological databases available up to now.

One of the largest genomic databases is the GenBank,⁷ which is built and distributed by the National Center for Biotechnology Information (NCBI). GenBank is part of the International Nucleotide Sequence Database Collaboration, which also includes the DNA DataBank of Japan (DDBJ), and the European Molecular Biology Laboratory. GenBank contains base pairs of DNA sequences as expressed sequence tags, sequence tagged sites, whole-genome shotgun, and the complete genomes of several species. The principal source of data is submissions from individual laboratories that carried out genome sequencing projects.⁷ In addition to GenBank, NCBI has 52 additional databases, which provide different types of data, such as protein-related, gene expression, and diseases.²²

Another large database is the Kyoto Encyclopedia of Genes and Genomes (KEGG). The original concept of KEGG was to create a reference knowledge base of metabolism and other cellular processes. It currently presents an integration of the information about systems, genomics, proteins, and chemical compounds related to a given molecular interaction network. KEGG pathway maps are widely used for inferring higher level functions from genome sequences and other high-throughput data.²³

The Protein Data Bank (PDB) is a repository of macromolecular structure information.²⁴ The PDB was initially established as a database for crystallographic protein data. Currently, it comprises a portal with several tools for visualization, download, structure comparison, and deposition of protein information.²⁴ Another well-known protein database is the UniProt, which provides protein sequence and functional annotation data. UniProt offers full-text and field-based text search, sequence similarity search, multiple sequence alignment, batch retrieval, and database identifier mapping.²⁵

An example of a biological big data project incorporated as a database is the Encyclopedia of DNA Elements (ENCODE). It is a foundational data set for understanding the roles of functional elements of the human genome.²⁶ This project, up to now, has produced 15 terabytes of data generated from 1600 experiments on 147 types of cells. ENCODE and other similar projects provide new insights into the mechanisms of human biology and diseases, enhancing the knowledge of common diseases with a genetic component, rare genetic diseases, and cancer.²⁶

In addition to these databases, there are other examples dedicated to specific data. It is possible to find databases for the following: (1) organism specific or group of related organisms,^27–33 (2) molecules and/or cells,^28,34 (3) specific genomic sequences,^35,36 (4) diseases,^27,37 (5) gene expression, proteins, and metabolism,^30,35,38 (6) nomenclature and definitions,³⁹ and (7) ecology,²⁹ as some examples are shown in Table 1.

Table 1.

Examples of available biological databases

Database	Organism	Main kind of information	References
GenBank	Not organism specific	EST, STS, WGS, complete genomes, gene and protein sequences	⁷
KEGG	Not organism specific	Metabolism pathway	²³
PDB	Not organism specific	Macromolecular structure information	²⁴
ENCODE	Human	Functional human genome elements	²⁶
RegulonDB	Escherichia coli k-12	Promoters, genes, transcription factors, binding sites, among other regulatory information	⁴⁰
EcoGene	E. coli k-12	Genome map features, proteome-wide indexing with GO terms, among others.	³¹
StreptomeDB	Genus Streptomyces	Producing strains, synthesized compounds, their biological activity, and the synthesis route	²⁸
FlyAtlas	Drosophila melanogaster	Gene expression	³⁰
NONATObase	Polychaeta (Annelida)	Macroecological and taxonomic	²⁹
OMIN	Human	Human genes and genetic phenotype	²⁷
EcoCyc	E. coli k-12	Genome, transcriptional regulation, transporters, and metabolic pathways	³³
Ensembl	Human and farm animals	Genome annotations	³²
CellFinder	Mammalian	Mammalian cells in different tissues and development stages	³⁴
String	Not organism specific	Protein interactions	³⁵
InternegicDb	Gram-negative bacteria	Intergenic regions	³⁶
The Cancer Genome Atlas	Homo sapiens	Gene expression, single-nucleotide polymorphism, miRNA, among other cancer data.	³⁷
Reactome	Not specie specific	Pathway database	³⁸
GO	Not specie specific	Genomic ontology	³⁹
GBIF	Not specie specific	Earth's biological diversity	⁴¹

ENCODE, Encyclopedia of DNA Elements; EST, expressed sequence tag; GBIF, Global Biodiversity Information Facility; miRNA, microRNA; GO, Gene Ontology; PDB, Protein Data Bank; STS, sequence tagged sites; WGS, whole-genome shotgun.

Even though some of these examples may not be big enough to be classified as big data, the major objective in building a new database is to organize data obtained from many and heterogeneous sources. The greater principle is to transform data into useful information by executing friendly queries^11,42 as the goal of big data analyses.²

Considering the number of databases available, it is possible to obtain a wide perspective of a given biological task by comparing or combining the information available from databases all over the world.^11,42 Since databases usually present their information in different formats, big data analyses are especially challenging in biology.^5,42,43 Some efforts have been made to develop database integration, such as INDIGO,⁴⁴ metabolicMine,⁴⁵ Dis2PPi,⁴⁶ BioExtract Server,⁴⁷ GenMAPP,⁴⁸ and GeneCodis,⁴⁹ among others.

INDIGO is a data warehouse (DW) for three microbial genomes isolated from the Red Sea. It allows the integration of annotations for the purpose of exploration and analysis of those genomes. INDIGO enables users to combine information from multiple sources (genomic sequence, protein domain, gene ontology, and pathways) for further specific or general analysis.⁴⁴

Another example of a biological DW is metabolicMine, which is specific for common metabolic diseases.⁴⁵ The respective information covers genes, proteins, orthologues, interactions, gene expression, pathways, ontologies, diseases, genome-wide association studies, and single-nucleotide polymorphisms. Its data come from Ensembl, NCBI, UniProt, and KEGG, and its goal is to help users from all levels of informatic expertise to carry out their own studies.⁴⁵

The Dis2PPI tool provides an integration of two databases: OMIM and STRING.⁴⁶ It is a desktop tool that allows an easy solution to generate diseasome networks, offering a platform to explore diseases by indicating their common genetic origin. For this, Dis2PPI establishes the association of protein information with a given disease. The results of Dis2PPI can be loaded in software such as Cytoscape⁵⁰ and Medusa.⁵¹ GenMAPP was developed for biologists who have their research focused on pathway visualization.⁴⁸ This tool allows the organization, analysis, and sharing of eukaryote genomic data by integration of selected data from KEGG and Reactome, among other academic laboratories. Moreover, GenMAPP can combine proteomic and gene expression data.⁴⁸

The BioExtractor Server is a data integration application of many biomolecular databases, analytic tools and workflows.⁴⁷ This system allows the user to reduce the number of sites visited for a given query associated with DNA or protein sequences. Moreover, there is no knowledge requirement related to database management system (DBMS) or query language by the user. GeneCodis⁴⁹ is a web server that integrates Gene Ontology (GO) and KEGG databases for life scientists to carry out gene annotation of their genomic research. The organisms supported comprise the eukaryotic and prokaryotic species, such as Homo sapiens, Mus musculus, Caenorhabditis elegans, Vibrio cholerae and Escherichia coli.⁴⁹

As described so far, it is clear that a well-designed database is supportive for biological research by providing integration of large volumes of complex data, as well as allowing faster and more powerful searches.⁴³ Some limitations of biological databases are related to data structure and access. In biology, a large number of mechanisms and phenomena are not yet fully understood, enabling room for deviant interpretations. Therefore, biological databases are different from each other due to a degree of uncertainty on the structure of data.^11,42 In this scenario, biological data present not only high complexity, heterogeneity, and peculiarity, but also display an additional challenge in terms of data accuracy and standardization.⁴²

Another limitation arises from the fact that biological big data is not accessible in a conventional manner, which means data analysis often involves downloading data from public sites (e.g., NCBI and Ensembl), installing software tools locally, and running them.² To this regard, cloud access makes data easier to import, export, compare, combine, and understand.¹

Biological big data applications

There are many kinds of data in biology, such as genome sequence, gene expression values, protein sequence and structure, and metabolite concentrations and fluxes.⁴² The reuse of information increases biological knowledge and contributes to enhancing research projects.^11,42

Next-generation sequencing (NGS) technologies enable the acquisition of the nucleotide content of a given DNA sequence or a whole genome cheaper and faster.⁵² For this reason, biological big data sets are now more expensive to store, process, and analyze than to generate.⁵ Given the significance of obtaining a sequence, it is also important to identify the biological context (structure, function, and role) of it. The annotation process has theoretical and practical implications since it requires the combination of experimental and computational methods, making it a nontrivial task for a life scientist.^53–56

Under this consideration, some tools have been developed (Table 2) aimed at providing a convenient way for the sequencing process and sequence analysis, such as homology searches, genome variant analysis, and genome-wide analyses.^{49,53,54,57,58} Despite some particularities of these tools, they are devoted to analyzing coding region sequences. Further significant insights can be provided by the determination of when and how genes are “turned on and off.” Some approaches for accomplishing this goal perform the analysis of gene expression level by microarray technology, protein sequence and its structure, and gene expression regulation.

Table 2.

Examples of available biological applications

Tool	Goal	Organism specific	Implementation details		Architecture supported		References
Tool	Goal	Organism specific	Program language	Data format	Cloud computing	Local installation	References
BG7	NGS data analysis	Bacteria and archaea	Java	XML	Amazon WS	Yes	⁵⁵
VAT	Genome variant analysis	Human, mouse, Arabidopsis thaliana	C	Json	Amazon S3	No	⁵³
GBrowse	NGS data analysis	No	Server: Perl and C Client: Java Script	NoSQL schema	Amazon	Yes	⁵⁶
DDBJ Pipeline	NGS data analysis	No	Perl and Java	Relational schema	Grid Approach	No	⁵⁴
Cloud BioLinux	NGS data analysis	No	Python	Not described	Amazon C2, Eucalyptus private	Desktop remote access	⁵⁷
SeqHBase	NGS data and Genome variant analysis	No	Java	MapReduce	Hadoop Cloud	Yes	⁵²
CloudDOE	NGS data analysis	No	Java	MapReduce/Hadoop	In-house or public cloud computing	Yes	⁵⁸
Indigo	NGS data analysis	Red Sea Extremophiles	Perl	XML	No	Yes	⁴⁴
Metabolomic Mine	Metabolic diseases analysis	Human	Java	Relational schema	No	Yes	⁴⁵
Dis2PPI	Protein/Protein Interactions Network	Human	Java	XML	No	Yes	⁴⁶
BioExtract	Database integration	No		Relational schema	No	Web	⁴⁷
GeneCodis	Database integration	No	Ruby, Perl and PHP	XML	No	Yes	⁴⁹
Cytoscape	Protein/Protein Interactions Network	No	Java	XML	No	Yes	⁵⁰
Medusa	Protein/Protein Interactions Network	No	Java	XML	No	Yes	⁵¹

DDBJ, DNA DataBank of Japan; NGS, next-generation sequencing; NoSQL, Not Only SQL.

The level of gene transcription is determined by microarray technology. Besides gene expression levels, this technology also allows for quantifying alternative splicing and sequence variation.⁴⁸ An example of the results achieved by gene expression experiments integrated with bioinformatic tools and database is the identification of genomic biomarkers in cancer. As a result of the identification of differentially expressed genes, new insights into disease diagnosis and treatment are provided.^11,59

Ying et al.⁶⁰ search for genes with a distinctive expression in ovarian cancer cells. They used microarray data available in a specific database from NCBI, the Gene Expression Omnibus (GEO) database. The data were analyzed with clustering algorithms and several biological tools (Lima package for R, GenMAPP, and GENECODIS). Through this approach, it was possible to identify 1229 differentially expressed genes. These genes are related to cell cycle, lipid metabolic pathways, cytoskeleton changes, and some signal transduction pathways. As they are involved in the establishment and development of ovarian cancer, they may be treatment targets for ovarian cancer.

A similar approach related to pancreatic cancer is presented by Sartor et al.⁶¹ The authors analyzed gene expression data available in the GEO and ArrayExpress databases by using R environment packages devoted to microarray data analysis.⁶¹ The results indicated that a high expression of Tubby-like protein 3 (TULP3) gene may play a critical role in pancreatic cancer progression. The low–high expression levels have not been associated with prognostic value for any other type of cancer, such as breast, ovarian, and lung cancer. For this reason, the TULP3 gene could be explored as a prognostic biomarker in patients with pancreatic adenocarcinoma.⁶¹

By using genome sequence analysis, Jung et al.⁶² carried out three machine learning (ML) techniques to classify mutations in two classes: loss- or gain-of-function. The first step was data collection by the literature text-mining process, looking for descriptions of mutations related to both classes. Next, the data were analyzed, and the features of gain- or loss-of-function were selected. Lastly, support vector machines (SVM), random forest, and linear logistic regression were implemented with the aim of classifying the classes.

The results of this study indicate that the reference allele, substitute allele, mutation type, mutation impact, subcellular location, and protein domain are discriminative parameters for the gain- or loss-of-function identification. The accuracy obtained for the simulations was 72.23% for random forest, 71.28% for SVM, and 70.19% for logistic regression classifiers.⁶²

An approach for protein structure prediction is illustrated by Tunyasuvunakool et al.⁶³ In this study, the authors predicted the human proteome by using the AlphaFold application.⁶⁴ The authors describe that their results cover almost the entire human proteome (98.5% of human proteins). The resulting data set covers 58% of residues with a confident prediction, of which a subset (36% of all residues) has a very high confidence. Jumper et al.⁶⁴ advocate that AlphaFold presents accuracy competitive with experimental structures due to the neural network approach that incorporates physical and biological knowledge about protein structure.

Tripathi and Gupta⁶⁵ used an artificial neural network (ANN)-based approach named the layer recurrent network to predict lysosomal-associated membrane protein type. The training data were composed of the following protein features: amino acid composition, sequence length, hydrophobicity, electronic group, sum of hydrophobicity, R-group, and dipeptide composition. The overall accuracy for this simulation was 93.2%, which leads the authors to conclude that this approach is efficient in the discrimination of lysosomal-associated membrane proteins from other membrane proteins. In both articles, the authors make their own data set by collecting, cleaning, and transforming the data from previous available databases.

In the field of microorganisms, the regulation of gene expression is also an important task. There are several articles devoted to the prediction and recognition of regulatory elements by using data provided by the RegulonDB, EcoGene, and EcoCyc databases, for instance. Gordon et al.⁶⁶ carried out SVM with alignment kernels in two different data sets: promoter and coding regions, and promoter and nonpromoter intergenic regions. The average errors achieved were 16.5% and 18.6%, respectively, for the data sets used.

Rani et al.⁶⁷ used n-gram as a feature for a neural network classifier for promoter prediction in E. coli and Drosophila melanogaster. The authors show that the number of n-grams that present the best results for E. coli was n = 3 against a negative example set consisting of gene and nonpromoter intergenic segments. The performance measures presented were a sensitivity of 67.75%, a specificity of 86.10%, and a precision of 80.0%.

An ANN-based approach was used by de Avila e Silva et al.⁶⁸ for promoter prediction according to the σ factor that recognizes the sequence. This bioinformatic tool, denoted as BacPP, was developed by weighting rules extracted from ANNs trained with promoter sequences known to respond to a specific σ factor. The information obtained from the rules was weighted to optimize promoter prediction and classification of the sequences according to the σ factor, which recognizes them. The accuracy results for E. coli were 86.9%, 92.8%, 91.5%, 89.3%, 97.0%, and 83.6% for σ24-, σ28-, σ32-, σ38-, σ54-, and σ70-dependent promoter sequences, respectively.

In contrast to tools previously reported in the literature, BacPP is not only able to identify bacterial promoters in background genome sequence, but it is also designed to provide pragmatic classification according to the σ factor. Moreover, when applied to a set of promoters from diverse Enterobacteria, the accuracy of BacPP was 76%, indicating that this tool can be reliably extended beyond the E. coli model.⁶⁸

Details of the principals and organization of the transcriptional process are helpful for understanding the complexity of biological systems involved in, for instance, cellular responses to environmental changes or the molecular basis of many diseases caused by microbes.⁶⁹

Iraola et al.⁶⁹ analyzed 814 different virulence-related genes of more than 600 finished bacterial genomes aiming at identifying patterns in those genes. Both human pathogenic and nonpathogenic bacterial strain genomes were collected from the NCBI. For achieving their goal, the authors used SVM for building a classification tool, named BacFier. As a result, the SVM model classifies bacterial genomes in human pathogens and nonpathogens with 95.4% average accuracy. BacFier may be a useful tool for clinical or industrial purposes, for example, to determine if a new sequenced strain could be pathogenic for humans.⁶⁹

Microorganisms are the most abundant and diverse organisms on Earth.⁴¹ From this perspective, Selama et al.⁴¹ described global bacterial biogeography and biodiversity in terms of abundance by using available data from NCBI and Global Biodiversity Information Facility (GBIF). In their results, Proteobacteria is the most abundant phylum in both databases followed by Firmicutes, Actinobacteria, Bacteroidetes, Cyanobacteria, and Planctomycetes. In the last position, Chrysiogenetes and Dictyoglomi phyla were found.

The study also reveals that bacterial biodiversity data come from developed countries and the United States in particular. This kind of research is an effort to contribute to the knowledge about bacterial dispersal limitations, habitat differentiation, competition, and adaptive radiation. Despite microorganisms' ubiquity, information about their distribution patterns and control is still an open topic.⁴¹

Another approach is the representation of the interactions between biological elements (genes, proteins, or metabolites) in a given cellular function. This approach provides an integrated perspective about complex biological systems.¹¹ The usual way to illustrate them is the graph diagram, which provides a wide cell biological context from the biochemical functions of individual molecules.^50,51 An example of this kind of analysis was carried out by Feltes et al.⁷⁰

The authors present possible molecular pathways associated with the effects of tobacco smoke components during embryonic development in pregnant female smokers. They used available databases (e.g., STRING) and tools (e.g., Cytoscape) to obtain the networks' interactome about this subject. By analyzing the resulting networks, the authors detected that tobacco constituents act in many bioprocesses as cell communication and signaling, hormone synthesis and signaling, DNA metabolism, DNA repair, and inflammation. Such processes present wide effects on cellular and embryonic physiology.⁷⁰

The combination of microarray data and network analysis is presented by Guo et al.⁷¹ In this article, the authors analyzed the interaction network of 126 genes, which presented differential expression in microarray experiments. Based on their results, it was possible to identify 23 genes involved in multiple signaling pathways related to tumorigenesis, which were considered potential biomarkers for bladder cancer. Analysis of the urine of patients was carried out, and the combination of the in silico and in vivo results shows that the expression of BLCA-4 and HOXA13 could distinguish between low- and high-grade tumors. Moreover, IGF-1 and hTERT were closely related to highly invasive and high-grade tumors.⁷¹

Network analysis of interactions among genes, proteins, or other kinds of elements can provide new insights about a given issue. By seeing the place of an element in a pathway, it is possible to obtain insights about the physiological significance and offer clues about functions of similar-looking proteins.^48,49

ML approaches are also applied in medical contexts. The related research can contribute to fundamental insights into early diagnosis, prognostics, therapy, and a better understanding of disease processes. Therefore, they would allow a more efficient clinical treatment.⁷² By using the data of three available data sets of the UCI database, Yilmaz et al.⁷³ propose the use of SVM for the diagnosis of diabetes and heart diseases. The accuracy achieved was more than 96% for all the different data sets tested. In addition to the ML technique for data classification, the authors propose a modified version of the K-means algorithm for the data preparation step.⁷³

A similar approach was carried out by Zheng et al.⁷² for breast cancer diagnosis also using the UCI database. At first, the K-means algorithm extracted useful information from the data set (data preparation). Afterward, an SVM was tested as a diagnostic classifier, presenting an accuracy result of 97.38%.⁷²

The cancer prognostic by using available data was carried by Sun et al.⁷⁴ The authors used as data sets the following from the Wisconsin Prognosis Breast Cancer Database: a diffuse large b cell lymphoma data set and a nonsmall-cell lung cancer data set.⁷⁴ They performed the following two steps: feature selection and prognosis analysis by using the SVM approach. The results achieved are consistent and comparable with similar approaches. These kinds of research can contribute to a better understanding of the different types of tumor and the properties related to them. Moreover, they provide information for other cancer research.⁷²

IT Resources for Big Data Applications

The improvement of IT techniques provides a reliable mechanism for data management, analysis, and accessibility.^3,75 In big data applications, data establish deep connections with the tools applied in them.⁵ The database organization is essential for providing a way to share the data since it improves the organization and standardization.⁷⁶

In addition, the database is the source for researchers to manipulate data. Currently, cloud computing presents a promising and up-to-date solution for figuring out some of the big data challenges.^1,2,8,10 This approach exploits the use of multiple computers to store resources dynamically on the internet. Beyond that, it eliminates local installation of software, making maintenance and updates easier.² In what follows, IT resources applied to big data approaches are described.

Cloud computing

Cloud computing is a significant emergent approach of computational science that exerts influence on both academia and industry.^1,8 Cloud computing provides a very elastic, scalable, portable, and cost-efficient solution since it makes the best use of multiple computers to provide on-demand access to hosted resources.^10,76,77

Denomination cloud computing was inspired in flowcharts that use a cloud as a representative symbol of the internet. A standard definition for cloud computing is not available in the literature.^4,78 However, the National Institute of Standards and Technology provides the following definition⁷⁹: “Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models.”

The distinct definitions of cloud computing^8,76–78,80 present common features such as on-demand self-service, broad network access, and virtualization. This last one is considered the key enabler of cloud computing.⁷⁸ Virtualization is important because it allows a single physical machine to run multiple operating systems and associated applications by using multiple virtual machines installed. This procedure maximizes the utilization of the hardware and reduces the capital investment.⁴ Additional concepts behind cloud computing are high-performance computing (HPC), distributed systems, grid computing, and parallelized programming.^1,76,78,81

HPC, also named a computer cluster, typically comprises a set of homogeneous processing elements connected locally.^78,81 HPC has been built to provide higher computing power than that offered by a desktop workstation. It manages computing resources efficiently for the analysis of complex flows, slicing it into smaller parallel tasks.^81,82 On the contrary, in grid computing architecture, the computational resource available is composed of heterogeneous processing elements, geographically dispersed and connected by a network.^81,83,84

The aim of both HPC and grid computing is to run tasks in a parallelized and distributed way.^76,81 Despite the similarities among HPC and grid and cloud computing, they cannot be considered the same. Technically, grid computing is different from HPC because it must share common software to carry out the communication between the computers.⁸¹ In addition, grid is different from cloud computing since the latter presents hardware virtualization.^78,81,84

A comparison of four attributes of HPC and grid and cloud computing is provided in Table 3. The attributes shown in this table are related to the following: (1) the use of local or remote computers to processing data (resource ownership), (2) the possibility of sharing resources and scalability using the private or public networks available (resource sharing and sizing), and (3) the way the infrastructure is provided (application portability).² More information about the technical attributes of these technologies can be found in Refs.^3,8,81

Table 3.

Comparison of computational architectures

	Computational architecture
	HCP	Grid	Cloud
Resource ownership	Locally	Locally or externally	Locally or externally
Resource sharing	Limited	High	Limited
Resource sizing	Quasistatic size	Dynamic	Dynamic
Platform type	Specific	Mixture of portable and specific	Portable

In the context of life sciences projects with large-size data sets, the choice of IT architecture (Table 3) is more related to technical aspects than the nature of the biological data.^9,85 Considering the biological data, there are a variety of problems that require huge processing and consumption of IT resources due to the size of the data set or the number of variables, such as genome assembly and molecular dynamics simulation, among others. For these cases, parallel computing applications are being developed (see the Big Data Platform Model for Data Processing section) to improve the processing time and memory/storage consumption.⁸⁶

The choice of which IT architecture is the most appropriate needs to take into account aspects that are related to the following: (1) the maintenance costs of local HPC infrastructure or cloud, (2) the trustworthiness and availability of remote computers in grid structure and privacy, and (3) espionage, international legal conflicts, and internet connection for cloud architecture.⁹

The services supported by cloud computing are usually classified into four business model types: software as a service (SaaS), platform as a service (PaaS), infrastructure as a service (IaaS), and data as a service (DaaS).²

SaaS offers a software application that runs in cloud infrastructure but is accessible on the internet. IaaS usually includes storage, processing units, and memory of high-performance computing. PaaS allows the creation of applications on resources available by the cloud provider.^2,8 DaaS comprises biological databases such as GenBank and 1000 Genomes among others. Cloud-based approaches can decrease the time and cost involved in biological big data projects.² Cloud computing has been used in genome annotation and NGS projects,^53–56 proteomics, and drug discovery,^87–90 among others.

Some commercial vendors of cloud services are Amazon (http://aws.amazon.com), Windows Azure (http://azure.microsoft.com), and Google (https://cloud.google.com). These commercial clouds are improving their services for life sciences research projects.⁹¹ Despite the vendor, features such as scalability, reliability, and cost are useful for big data projects.⁹²

The advantages of cloud computing rely on the fact that the computational resources can be allocated and updated according to the demand in real time. In addition to that, the administration (data backup and recovery) and maintenance of the devices are under the provider's responsibility. Finally, the payment is made according to the amount used (pay-as-you-go model).^76,78 Although cloud computing solves the problems of hardware and software cost, it alone does not address all the challenges of big data analysis. The most unresolved issue is related to the electronic transfer and sharing of a huge amount of data at the same time.⁹³ Another disadvantage of cloud computing is in regard to the privacy of hosting data sets.^5,8,76

Database models

Currently, the capability to collect and generate data has been vastly expanded. In this context, data storage plays a fundamental role since it is the backbone of DM.⁴ Incomplete or low-quality data may lead to inconclusive results or, more important, erroneous conclusions.⁴³ Database systems are an efficient mechanism for large data sets^77,94 as a well-designed database provides the necessary support for discovering novel correlations in data. For this reason, a computational database can be defined as a set of organized data that aim to assist a group of users. The data access is made by a program to answer the user's queries. That is, a database has data sources and users interested in its content.⁹⁴

A database project involves the definition of three different models, conceptual, logical, and physical, to ensure data management.⁹⁴ The conceptual model is a generic representation of a specific domain. It presents a collection of concepts that provide the necessary means to achieve database abstraction by the relational, XML, or Not Only SQL (NoSQL) data model.⁹⁴ The conceptual model is computationally expressed as a logical model with the purpose of representing data structure. In other words, this model comprises the data types of each attribute of each table, and the relationship among the tables can constitute a new attribute or a new associative table.⁹⁴

The logical model is implemented as a physical model through the application of a DBMS. It is a general-purpose software system that facilitates the processes of defining, constructing, manipulating, and sharing databases among various users and applications.⁹⁴ Some examples are Apache Cassandra⁹⁵ and MongoDB⁹⁶ for the NoSQL models and PostgreSQL, MySql, Oracle, Structured Query Language (SQL) Server, and IBM DB2 for the relational database models.

The relational database management system (RDBMS) model is an established approach and widely used in the design of conceptual models of a vast type of applications.⁹⁷ Its main concept is the creation of tables that consist of rows (tuples) and columns (attributes).⁴² With this model, it is possible to use relational algebra to build the best query execution plan. However, a well-designed relational database requires a regular and complete data set, which means a structured data set. Moreover, the relationship among the tables should be unambiguous as required in the process of database normalization.⁹⁷ These requisites are hard to achieve with biological data.^4,42,97

An alternative to the rigidness of the RDBMS is the XML schema model that can be used for modeling semistructured data.⁹⁷ In this approach, data comprise a series of labels and associated values. It can be represented graphically as nodes (objects) connected by edges to values.⁴² A key limitation of XML relies on its difficulty in modeling complex relationships. For example, there is no obvious way to represent many-to-many relationships, which are required to model complex pathways.^42,94

Big data platforms usually adopt NoSQL as an alternative to data management and database design.⁴ In contrast to relational approaches, NoSQL does not require predefined schema, relationships, and keys, and data storage and management are separated into two independent parts.^4,95 In general, NoSQL uses key-value stores, column family implementation, document store, and graph databases, which are uncommon in traditional RDBMS design. For these reasons, NoSQL assists developers in the management of voluminous and heterogeneous data as those treated by big data applications.⁹⁷

The NoSQL database system presents some advantages, such as (1) scalability of data storage with high-performance, (2) flexibility for data modeling, (3) easy way to update application development and deployment, and (4) low-level access mechanism in which data management tasks can be implemented.^4,97 All these features contribute to make NoSQL significant as a database approach for cloud-based systems.⁹⁷

A comparison of relational and NoSQL DBMS is presented in Table 4, which summarizes the features described in this section.

Table 4.

Comparison of data model of RDBMS and NoSQL relational database management system architecture

	DBMS architecture
	RDBMS	NoSQL
Data model	Fix ER model	No model or flexible model
Scalability	Limited to the database server	Natural feature from cloud model
Consistency	ACID model	Eventual, in undetermined time all nodes will be synchronized
Query language	SQL	Each NoSQL database provides a particular API to write and execute queries. Some databases support SQL
High availability	Depends on the local infrastructure (local network, database server, data center)	Cloud computing configuration
Data replication	RDBMS available and configuration	Cloud computing configuration for the specific number of nodes

ACID, Atomicity, Consistency, Isolation, Durability; DBMS, Database Management System; RDBMS, relational database management system; SQL, Structured Query Language.

As presented in The Biological Data section, the landscape of biological data makes the definition of the database model an important question for biological data organization and storage. The traditional databases, such as RDBMS, are well suited for structure and data, since they provide good levels of storage efficiency, data integration, and good data retrieval speeds compared with file-based storage.⁹⁸

On the contrary, NoSQL deals better with large data sets of unstructured data, but it is not suitable for systems with a highly transactional level.²⁰ In this context, the main goal of a biological repository should be considered, for instance, genomic ChIP-seq, RNA-seq, and DNase-seq projects usually use NoSQL approaches and Electronic Health Records, Clinical Trial Records uses RDBMS approaches.^21,98,99

Despite the DBMS architecture, the proposition of new inferences usually requires the combination of data located in different or heterogeneous sources.⁴³ Therefore, database integration applications are necessary and still remain an important topic in big data approaches. An example of how to integrate data can be the DW structure. Despite the divergent definitions for DW, it can be described as an information repository where the data from different sources are stored.⁹⁴

The process of data inputting in a DW is formed by the following three main steps: data extraction, transformation, and loading (ETL process). The first step is the extraction of data from heterogeneous sources, and posterior transformation is the second step, which is carried out to clean and improve data accuracy. In the third step, data are being loaded into an integrated multidimensional schema of DW. This process can enhance data access and analysis. For this reason, a DW provides consistent, timely, subject-oriented information at the required level of detail, enabling the user to make better and faster decisions.¹⁰⁰ The ETL steps are suitable for a DW organized in an SQL or NoSQL schema.^101,102

The main difference between the data stored in an operational database and the one in a DW is the structure, that is, data knowledge and representation (Table 5). In an operational database, data are in a structured form, whereas in a DW, the data may either be presented in the structured format or not.⁹⁴ Moreover, in a database, redundancy should be eliminated to enhance business process, whereas in a DW, the existence of redundancy is required to improve its performance.¹⁰⁰ Further information about the data model and the query processing attributes of these database models can be found in Refs.^43,94,100

Table 5.

Comparison between operational database and data warehouse

	Database architecture
	Operational database	Data warehouse
Data model	ER model (ER diagram)	Multidimensional data model (star, snowflake, multistar, and cube) based on fact table and dimension table.
Data redundancy	No	Yes
Data type	Current data	Historical data
Data volume and type	High volume of transaction data	High volume of analytical data
Query performance	Low performance for analytical queries	High performance for analytical queries

In the context of big data approaches, the variety (heterogeneity) of formats, and the content and source of the data sets present some difficult degree in the ETL process of a DW implementation. There are some efforts to address this challenge,^101,102 as reviewed by Diouf et al.¹⁰² The DW is an important computational approach, in which is integrated an operational database, and it aims to bring for the life researchers the analysis that they need for hypothesis inferences and carry out the scientific contributions for a given field, such as the applications presented by INDIGO,⁴⁴ metabolicMine,⁴⁵ BioExtract Server,⁴⁷ GenMAPP,⁴⁸ and GeneCodis,⁴⁹ among others.

Big data platform model for data processing

To capture value from big data, it is necessary to use specific platforms. One of the most widely used big data tools is Apache Hadoop.⁴ It is an implementation of the computational paradigm known as Map/Reduce that was developed by Google, Inc.¹⁰³ In a few words, this platform allows data-intensive analyses by distributing tasks over multiple nodes.^8,103 Hadoop has several central modules: Hadoop Kernel, MapReduce, and Hadoop Distributed File System. The MapReduce paradigm divides a computational program into many small subproblems (map step), and Hadoop provides a distributed file system that stores data in nodes. After all, the “reduce” step merges all the smaller outputs for generating the whole result.^2,8,76

Hadoop deals with several challenges of data access and management. It hides from the user some computer abstractions, such as distributed file systems, distributed query language, and distributed databases. Despite this feature, the Apache Hadoop programming environment is not suitable for people who present limited programming experience, such as life scientists.^2,8,76 Moreover, Hadoop is not designed for real-time applications since its implementation presents high-throughput latency.⁴

Dryad is another popular computing framework for implementing parallel and distributed programs.¹⁰⁴ This platform builds a data flow graph application that can be configured into an arbitrary directed acyclic graph applying a set of computational vertices and communication channels.¹ The application executes the vertices of the graph on a set of computers that can exchange data by using shared memory queues and Transmission Control Protocol pipes. The generalization capacity of Dryad is better than MapReduce as it can process data from a very small cluster to a large one.^4,104 For real-time big data applications, Storm¹⁰⁵ is an open-source computational system. It is designed for processing fast and large streams of data as they arrive.^4,105

Some examples of related projects devoted to biological big data are CloudBioLinux,⁵⁷ SeqHBase,⁵² an dCloudDOE,⁵⁸ among others. CloudBioLinux is an initiative of the J. Craig Venter Institute that provides access to more than hundreds of bioinformatic tools through a friendly graphical user interface.⁵⁷ SeqHBase is a cloud-based toolset for analyzing mutations in sequencing data,⁵² and CloudDOE is a platform to encapsulate technical details of Hadoop implementation. These applications illustrate the potential of the association between cloud computing and life sciences as well as its potential.⁸ A limitation of these approaches is not related to the technology itself but that it relies on the noise and inconsistent nature of biological data.^5,9,76,105

Computational Approaches Applied to Big Data Analysis

The value of big data is not concerned with the data, but with how supportive they are for decisions and assumptions.⁴³ This can be accomplished by the application of DM techniques. DM is an interdisciplinary field that combines some aspects of artificial intelligence, database management, and ML among other fields.¹⁰⁶

DM consists of three principal steps executed in the following order: (1) data preprocessing, (2) data modeling, and (3) data postprocessing.¹⁰⁷ In data preprocessing, the raw data are transformed to “clean” data. This first step is crucial since it is related to the success of all the following steps of data analysis.¹⁰⁷ Subsequently, the application of modeling techniques, such as decision tree, ANNs, and clustering, is carried out. This is the data modeling or DM step. After this, the visualization and evaluation of extracted knowledge are evaluated (data postprocessing).^4,107

DM algorithms extract useful regularities for a given purpose, and ML provides the technical basis for DM. ML approaches can give promising results even if the relationships are unknown or hard to describe. In addition, ML can recognize complex patterns in an automatic way or distinguish examples based on these patterns.^108,109 These algorithms usually split the data set into training and test groups. They learn from examples (training data) and from them build the classification model, which will be tested in a set of examples not exposed to the classifier in the training process.

In a big data context, data analysis requires comprehension of the data as well as the ML algorithms to solve an issue of interest. A critical factor that needs to be considered is whether the ML algorithm can be efficiently parallelized.⁴ This is important once big data problems deal with the distribution of tasks over many computers to capture a solution.

Among all ML techniques, the decision tree, SVM, ANN, and clustering are several applications used in the life sciences, either alone or in combination.^110–112 For this reason, the purpose of this section is to provide an explanation about the basic ideas of these ML approaches, and their application in biological contexts. All these examples are related to the building of a classification model, which means, the data were previously acquired, transformed, and then applied in the data.

Decision tree

The decision tree is a ML technique widely used in biological tasks, such as soil systems,¹¹³ genetics and proteomics,^108,114 biomanufacturing,¹¹⁵ and medicine,^110,116 among others.

A decision tree is a set of rules about a given subject represented in a flowchart-like structure. Briefly, this algorithm recursively splits a set of independent instances, providing a representation as classification rules. The tree structure is an efficient and easy-to-understand representation of the information captured from the analyzed data set. Each internal node denotes a query test on the attribute, and the branches are an outcome of the test. The terminal nodes represent the class label.¹⁰⁹

Some examples of algorithms to construct a decision tree are CART and C4.5. The difference between them is the evaluation function used for the classification. The C4.5 algorithm, by using entropy measure, estimates the error rate of initial nodes and the tree pruning to make a more efficient subtree. In addition, it can construct a multitree structure. On the contrary, CART usually builds a binary tree by using a Gini index as function evaluation.¹⁰⁹

This technique has a low level of requirement for the data preparation process. Moreover, it presents good performance on large data sets. When compared with the ANN approach, decision tree is more suitable for non-numeric data. In contrast, decision tree presents some limitations, especially in the determination of an appropriate size of the tree. From the user's view, a large decision tree may be hard to understand. From a computational perspective, an optimal decision tree is known as an NP-complete problem. In this way, learning algorithms are usually based on heuristics, which do not guarantee the optimal decision tree structure.¹⁰⁹

Support vector machine

An SVM is a learning method usually implemented as a binary classifier. SVM makes the classification by drawing a straight line that separates, as widely as possible, the positive examples from the negative examples.¹¹⁷ This classification model is given by the selection of a small number of critical boundary instances (called support vectors) from both classes and their separation by a high-dimensional hyperplane. To obtain the best hyperplane, it is necessary that the application of supervised learning algorithms be denoted as kernel machines.^117,118

The kernel function is crucial for SVM, since the knowledge captured from the data set is dependent on the definition of a suitable kernel.¹¹⁷ Further information and mathematical background of SVM can be found in Refs.^117,118 SVM is a technique used in different biological domains as presented by Refs.,^{62,63,69,72,119–121} for instance.

The SVM algorithms present many advantages in their use when compared with other ML methods. First, SVM produces a unique solution because it is basically a linear problem. Second, it is able to deal with very large amounts of dissimilar information. Finally, the discriminant function is characterized by a comparatively small subset of the entire training data set, which makes the computation faster.^117,118 On the contrary, a problem of SVM is its slow training as its learning process is carried out by solving a quadratic programming problem with the number of variables equal to the number of training data.¹¹⁸

Artificial neural network

The ANN is an artificial intelligence approach used for classification and the prediction process.^109,122 In its simplest form, ANN can be viewed as a graphical model consisting of interconnected units. The connection from a unit j to a unit i usually has a weight denoted by Wij. The weights represent information used by the net to solve a problem. During the training process, the weights are adjusted to minimize the difference between the network output and the desired output. The highly popular algorithm for the learning process is back-propagation.^64,122,123

The ANN architecture is defined by the way the neurons are interconnected, for this reason is it possible to design many kinds of architecture. The most widely applied ANN architecture is the multilayer perceptron, seen as this architecture presents the capability to capture and discover high-order correlation and/or relationships inside the input data.^110,123

The three-layer ANN is known as a universal classifier because it is able to classify any labeled data correctly if there are no identical data in different classes.¹²² The layers present different roles in the learning process. The neurons of the input layer receive the information from external sources and pass this information to the hidden layer. The use of hidden neurons increases the capacity of the net to decide how input features should be represented. The output layer contains neurons that receive processed information and send output signals out of the system.^110,122,123 Due to the ANN capability of rapid fitting of nonlinear data, it can capture imprecise and incomplete patterns.^64,122

It is often applied in biological areas such as (1) gene structure and regulation,^{68,114,124,125} (2) ecology,^126,127 (3) cancer,^128,129 (4) degenerative diseases,¹³⁰ and (5) proteomics.^64,65 Despite their advantages, ANN presents some difficulties. Many decisions related to the choice of ANN structure and parameters are often completely subjective. The final ANN solution may be influenced by a number of factors (e.g., starting weights, number of cases, and number of training cycles). Moreover, the overtraining should be analyzed to prevent an ANN from memorizing the data instead of doing a generalization of them.¹²²

Clustering

Clustering is an unsupervised classification technique suitable in cases when there is no class to be predicted, but the instances can be divided into natural groups. This approach is divided into two methods: hierarchical and nonhierarchical.^72,109,131

K-means is a classic nonhierarchical clustering algorithm. It separates the instances in a certain number of sets according to a predefined distance criterion. The definition of the numbers of clusters is required in advance (the k parameter). After this, the initial cluster centroid is defined either randomly or is prespecified by the researcher. The distances are calculated based on the current arrangement for cluster centroid updating. The most used distance criterion is the Euclidean distance measurement. The K-means algorithm iterates over the whole data set until its conversion, which means each instance in a given cluster is closest to the centroid.^13,109

This clustering method is simple and effective, and the K-means algorithm allows each instance to belong to a single set.^109,131 However, the final clusters are quite sensitive to the initial cluster centers. Completely different arrangements can arise from small changes in the initial random choice.^13,109

Differently from partitional algorithms (such as K-means), hierarchical clustering generates a set of clusters with different granularities in a hierarchical representation. The graphical representation of the results is a dendrogram.^131,132 These algorithms can be subdivided into agglomerative and divisive according to the linkage method: bottom-up or top-down. The hierarchical technique follows the bottom-up method, through which it defines the similarity between the objects before the cluster similarity measure. Conversely, the top-down approach first calculates the similarity measure between the clusters. Divisive algorithms follow this linkage method.¹³²

Hierarchical clustering does not require that the number of clusters or input parameters be known in advance. However, mandatory is the choice of the similarity/dissimilarity measure. Another advantage relies on the good result visualizations integrated into the methods. Nonetheless, this approach may not scale well due to the runtime for standard methods and the difficulty in discovering “optimal clusters” automatically.^110,132

The applications of clustering in life sciences are mostly concerned with gene expression and regulation as shown by refs,^{13,59,72,131–135} among other related articles.

The amount of biological data available has expanded the means for carrying out research. As a consequence, biology has been adopting ML approaches to analyze its data.^77,81 Each one is appropriate for a given purpose. Table 6 summarizes the information described in this section about decision tree, SVM, ANN, and clustering. By the development and application of data analytical, mathematical modeling, and computational techniques, it is possible for both expert and machine guided to search novel correlations in data.³

Table 6.

Data mining methods

ML methods	Learning process	Learning model	Learn algorithm problem	Learning tasks	References
Decision tree	Supervised	Predictive model	C 4.5CART	ClassificationRegression	^{106,111,115–119}
Support vector machine	Supervised	Predictive model	RBF kernelString kernel	ClassificationRegression	^{62,63,69,72,117–120}
Artificial neural network	Supervised	Predictive model	Back-propagation	ClassificationRegression	^{64,65,68,124–130}
Clustering	Unsupervised	Predictive model	K-meansHierarchical cluster	Clustering	^{13,59,72,131–135}

ML, machine learning.

In a general perspective, the application of biological big data is related to advances in fields that require massive and complex data analysis. The newest and emergent application of biological big data is the COVID-19 pandemic scenery.¹⁰⁶ The applications include virus biology,^136,137 diagnosis,¹³⁸ vaccine development,^139,140 and fake news regarding COVID-19 reviews.^141,142

Conclusions

The convergence between IT technologies and biology has emerged due to the large amount of data produced by experiments. This scenario creates an opportunity to apply big data as a means for life scientists to achieve their goals and propose relevant inferences.³ Considering that the efficient management of resources (money, power, space, and people) is required to solve an application of interest, a multidisciplinary team for in silico experiments is appropriated to resolve complex challenges.¹⁰³ Biology experts need assistance to understand how to correctly use IT tools. On the contrary, computer scientists are not able to alone figure out biological concepts of in silico outputs.

Despite technology advances, it is still challenging to transform biological data into information that can provide insights in a given context.²³ In light of this, this article presents a perspective of the current role of big data in the field of biology. An overview of big data can motivate its application and decrease the gap between the generation of data and the understanding of them.¹¹⁶ The examples of research based on the reuse of information presented in this article show the potential of big data in biological domains.

Footnotes

Acknowledgment

The authors wish to thank the University of Caxias do Sul for the support for this article.

Authors' Contributions

All authors listed have made a substantial, direct, and intellectual contribution to the work, and approved it for publication.

Author Disclosure Statement

No competing financial interests exist.

Funding Information

No funding was received for the development of this research.

Abbreviations Used

References

Gannon

, Reed

Parallelism and the cloud. In: Hey AJ, Tansley S, Tolle KM (Eds.): The fourth paradigm: Data intensive scientific discovery, Redmond, WA: Microsoft Research, 2009. pp. 131–135.

Dai

, Gao

, Guo

, et al. Bioinformatics clouds for big data manipulation. Biol Direct. 2012; 7:43.

Callebaut

Scientific perspectivism: A philosopher of science's response to the challenge of big data biology. Stud Hist Philos Biol Biomed Sci. 2012; 43:69–80.

Philip Chen

, Zhang

C-Y

. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Inform Sci. 2014; 275:314–347.

Marx

The big challenges of big data. Nature. 2013; 498:255–260.

Abecasis

, Altshuler

, Auton

, et al. A map of human genome variation from population-scale sequencing. Nature. 2010; 467:1061–1073.

Sayers

, Cavanaugh

, Clark

, et al. GenBank. Nucleic Acids Res. 2021; 49:D92–D96.

O'Driscoll

, Daugelaite

, Sleator

. “Big data,” Hadoop and cloud computing in genomics. J Biomed Inform. 2013; 46:774–781.

Nobile

, Cazzaniga

, Tangherloni

, et al. Graphics processing units in bioinformatics, computational biology and systems biology. Brief Bioinform. 2017; 18:870–885.

10.

Grossman

RL.

Data lakes, clouds, and commons: A review of platforms for analyzing and sharing genomic data. Trends Genet. 2019; 35:223–234.

11.

Hood

Systems biology and p4 medicine: Past, present, and future. Rambam Maimonides Med J. 2013; 4:e0012.

12.

Mirza

, Wang

, et al. Machine learning and integrative analysis of biomedical big data. Genes (Basel). 2019; 10:87.

13.

Dall'Alba

, Casa

, Notari

, et al. Analysis of the nucleotide content of Escherichia coli promoter sequences related to the alternative sigma factors. J Mol Recognit. 2019; 32:e2770.

14.

, Ge

, He

. Big data analytics for genomic medicine. Int J Mol Sci. 2017; 18:412.

15.

Packer

, Russell

, Dalgleish

HWP

, et al. Simultaneous all-optical manipulation and recording of neural circuit activity with cellular resolution in vivo. Nat Methods. 2015; 12:140–146.

16.

Lerman

, Gill

, Rinberg

, et al. Spatially and temporally precise optical probing of neural activity readout. In: Biophotonics Congress: Biomedical Optics Congress 2018, Hollywood, FL, April 3–6, Optical Society of America, 2018. pp. BTu2C.3.

17.

Glasser

, Coalson

, Robinson

, et al. A multi-modal parcellation of human cerebral cortex. Nature. 2016; 536:171–178.

18.

, Yang

, Xue

, et al. Translating cancer genomics into precision medicine with artificial intelligence: Applications, challenges and future perspectives. Hum Genet. 2019; 138:109–124.

19.

Dinov

, Petrosyan

, Liu

, et al. High-throughput neuroimaging-genetics computational infrastructure. Front Neuroinform. 2014; 8:41.

20.

Leyens

, Reumann

, Malats

, et al. Use of big data for drug development and for public and personal health and care. Genet Epidemiol. 2017; 41:51–60.

21.

Lawlor

, Lynch

, Mac Aogáin

, et al. Field of genes: Using Apache Kafka as a bioinformatic data repository. Gigascience. 2018; 7:giy036.

22.

Sayers

, Beck

, Brister

, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2020; 48:D9–D16.

23.

Kanehisa

, Goto

, Sato

, et al. Data, information, knowledge and principle: Back to metabolism in KEGG. Nucleic Acids Res. 2014; 42:D199–D205.

24.

Berman

, Westbrook

, Feng

, et al. The Protein Data Bank. Nucleic Acids Res. 2000; 28:235–242.

25.

UniProt Consortium. Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 2013; 41:D43–D47.

26.

ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489:57–74.

27.

Amberger

, Bocchini

, Hamosh

. A new face and new challenges for Online Mendelian Inheritance in Man (OMIM^®). Hum Mutat. 2011; 32:564–567.

28.

Lucas

, Senger

, Erxleben

, et al. StreptomeDB: A resource for natural compounds isolated from Streptomyces species. Nucleic Acids Res. 2013; 41:D1130–D1136.

29.

Pagliosa

, Doria

, Misturini

, et al. NONATObase: A database for Polychaeta (Annelida) from the Southwestern Atlantic Ocean. Database (Oxford). 2014; 2014:bau002.

30.

Robinson

, Herzyk

, Dow

JAT

, et al. FlyAtlas: Database of gene expression in the tissues of Drosophila melanogaster. Nucleic Acids Res. 2013; 41:D744–D750.

31.

Zhou

, Rudd

. EcoGene 3.0. Nucleic Acids Res. 2013; 41:D613–D624.

32.

Flicek

, Amode

, Barrell

, et al. Ensembl 2014. Nucleic Acids Res. 2014; 42:D749–D755.

33.

Keseler

, Mackie

, Peralta-Gil

, et al. EcoCyc: Fusing model organism databases with systems biology. Nucleic Acids Res. 2013; 41:D605–D612.

34.

Stachelscheid

, Seltmann

, Lekschas

, et al. CellFinder: A cell data repository. Nucleic Acids Res. 2014; 42:D950–D958.

35.

Franceschini

, Szklarczyk

, Frankild

, et al. STRING v9.1: Protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 2013; 41:D808–D815.

36.

Notari

, Molin

, Davanzo

, et al. IntergenicDB: A database for intergenic sequences. Bioinformation. 2014; 10:381–383.

37.

Robbins

, Grüneberg

, Deus

, et al. A self-updating road map of The Cancer Genome Atlas. Bioinformatics. 2013; 29:1333–1340.

38.

Croft

, Mundo

, Haw

, et al. The Reactome pathway knowledgebase. Nucleic Acids Res. 2014; 42:D472–D477.

39.

Gene Ontology Consortium. Gene Ontology Consortium: Going forward. Nucleic Acids Res. 2015; 43:D1049–D1056.

40.

Santos-Zavaleta

, Salgado

, Gama-Castro

, et al. RegulonDB v 10.5: Tackling challenges to unify classic and high throughput knowledge of gene regulation in E. coli K-12. Nucleic Acids Res. 2019; 47:D212–D220.

41.

Selama

, James

, Nateche

, et al. The world bacterial biogeography and biodiversity through databases: A case study of NCBI Nucleotide Database and GBIF Database. Biomed Res Int. 2013; 2013:240175.

42.

Louie

, Mork

, Martin-Sanchez

, et al. Data integration and genomic medicine. J Biomed Inform. 2007; 40:5–16.

43.

Cole

, Newman

, Foertter

, et al. Breeding and Genetics Symposium: Really big data: Processing and analysis of very large data sets. J Anim Sci. 2012; 90:723–733.

44.

Alam

, Antunes

, Kamau

, et al. INDIGO—INtegrated data warehouse of microbial genomes with examples from the red sea extremophiles. PLoS One. 2013; 8:e82210.

45.

Lyne

, Smith

, Lyne

, et al. metabolicMine: An integrated genomics, genetics and proteomics data warehouse for common metabolic disease research. Database (Oxford). 2013; 2013:bat060.

46.

Notari

, Oldra

, Mariani

, et al. Dis2PPI: A workflow designed to integrate proteomic and genetic disease data. Int J Knowl Disc Bioinfo. 2012; 3:67–85.

47.

Lushbough

, Bergman

, Lawrence

, et al. BioExtract server—An integrated workflow-enabling system to access and analyze heterogeneous, distributed biomolecular data. IEEE/ACM Trans Comput Biol Bioinform. 2010; 7:12–24.

48.

Salomonis

, Hanspers

, Zambon

, et al. GenMAPP 2: New features and resources for pathway analysis. BMC Bioinformatics. 2007; 8:217.

49.

Nogales-Cadenas

, Carmona-Saez

, Vazquez

, et al. GeneCodis: Interpreting gene lists through enrichment analysis and integration of diverse biological information. Nucleic Acids Res. 2009; 37:W317–W322.

50.

Shannon

, Markiel

, Ozier

, et al. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 2003; 13:2498–2504.

51.

Pavlopoulos

, Hooper

, Sifrim

, et al. Medusa: A tool for exploring and clustering biological networks. BMC Res Notes. 2011; 4:384.

52.

, Person

, Hebbring

, et al. SeqHBase: A big data toolset for family based sequencing data analysis. J Med Genet. 2015; 52:282–288.

53.

Habegger

, Balasubramanian

, Chen

, et al. VAT: A computational framework to functionally annotate variants in personal genomes within a cloud-computing environment. Bioinformatics. 2012; 28:2267–2269.

54.

Nagasaki

, Mochizuki

, Kodama

, et al. DDBJ read annotation pipeline: A cloud computing-based pipeline for high-throughput analysis of next-generation sequencing data. DNA Res. 2013; 20:383–390.

55.

Pareja-Tobes

, Manrique

, Pareja-Tobes

, et al. BG7: A new approach for bacterial genome annotation designed for next generation sequencing data. PLoS One. 2012; 7:e49239.

56.

Stein

LD.

Using GBrowse 2.0 to visualize and share next-generation sequence data. Brief Bioinform. 2013; 14:162–171.

57.

Krampis

, Booth

, Chapman

, et al. Cloud BioLinux: Pre-configured and on-demand bioinformatics computing for the genomics community. BMC Bioinformatics. 2012; 13:42.

58.

Chung

W-C

, Chen

C-C

, Ho

J-M

, et al. CloudDOE: A user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce. PLoS One. 2014; 9:e98146.

59.

Boeva

Clustering approaches for dealing with multiple DNA microarray datasets. J Comput Sci. 2014; 5:368–376.

60.

Ying

, Lv

, Ying

, et al. Screening of feature genes of the ovarian cancer epithelia with DNA microarray. J Ovarian Res. 2013; 6:39.

61.

Sartor

ITS

, Zeidán-Chuliá

, Albanus

, et al. Computational analyses reveal a prognostic impact of TULP3 as a transcriptional master regulator in pancreatic ductal adenocarcinoma. Mol Biosyst. 2014; 10:1461–1468.

62.

Jung

, Lee

, Kim

, et al. Identification of genomic features in the classification of loss- and gain-of-function mutation. BMC Med Inform Decis Mak. 2015; 15(Suppl 1): S6.

63.

Tunyasuvunakool

, Adler

, Wu

, et al. Highly accurate protein structure prediction for the human proteome. Nature. 2021; 596:590–596.

64.

Jumper

, Evans

, Pritzel

, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021; 596:583–589.

65.

Tripathi

, Gupta

. Discriminating lysosomal membrane protein types using dynamic neural network. J Biomol Struct Dyn. 2014; 32:1575–1582.

66.

Gordon

, Chervonenkis

, Gammerman

, et al. Sequence alignment kernel for recognition of promoter regions. Bioinformatics. 2003; 19:1964–1971.

67.

Rani

, Bhavani

, Bapi

. Analysis of E. coli promoter recognition problem in dinucleotide feature space. Bioinformatics. 2007; 23:582–588.

68.

de Avila e Silva

, Echeverrigaray

, Gerhardt

GJL

. BacPP: Bacterial promoter prediction—A tool for accurate sigma-factor specific assignment in enterobacteria. J Theor Biol. 2011; 287:92–99.

69.

Iraola

, Vazquez

, Spangenberg

, et al. Reduced set of virulence genes allows high accuracy prediction of bacterial pathogenicity in humans. PLoS One. 2012; 7:e42144.

70.

Feltes

, de Faria Poloni

, Notari

, et al. Toxicological effects of the different substances in tobacco smoke on human embryonic development by a systems chemo-biology approach. PLoS One. 2013; 8:e61743.

71.

Guo

, Che

, Shi

, et al. Interaction network analysis of differentially expressed genes and screening of cancer marker in the urine of patients with invasive bladder cancer. Int J Clin Exp Med. 2015; 8:3619–3628.

72.

Zheng

, Yoon

, Lam

. Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms. Exp Syst Appl. 2014; 41(Part 1):1476–1482.

73.

Yilmaz

, Inan

, Uzer

. A new data preparation method based on clustering algorithms for diagnosis systems of heart and diabetes diseases. J Med Syst. 2014; 38:48.

74.

Sun

B-Y

, Zhu

Z-H

, Li

, et al. Combined feature selection and cancer prognosis using support vector machine regression. IEEE/ACM Trans Comput Biol Bioinform. 2011; 8:1671–1677.

75.

Navale

, Bourne

. Cloud computing applications for biomedical science: A perspective. PLoS Comput Biol. 2018; 14: e1006144.

76.

Schadt

, Linderman

, Sorenson

, et al. Computational solutions to large-scale data management and analysis. Nat Rev Genet. 2010; 11:647–657.

77.

Garg

, Arora

, Gupta

. Cloud computing approaches to accelerate drug discovery value chain. Comb Chem High Throughput Screen. 2011; 14:861–871.

78.

Ebejer

J-P

, Fulle

, Morris

, et al. The emerging role of cloud computing in molecular modelling. J Mol Graph Model. 2013; 44:177–187.

79.

Mell

, Grance

. The NIST definition of cloud computing. NIST Spec Publ. 2011; 800:7.

80.

Vaquero

, Rodero-Merino

, Caceres

, et al. A break in the clouds: Towards a cloud definition. SIGCOMM Comput Commun Rev. 2009; 39:50–55.

81.

Mateescu

, Gentzsch

, Ribbens

. Hybrid computing—Where HPC meets grid and cloud computing. Future Gener Comp Syst. 2011; 27:440–453.

82.

Brown

, Dinu

. High performance computing methods for the integration and analysis of biomedical data using SAS. Comput Methods Programs Biomed. 2013; 112:553–562.

83.

Konagaya

Trends in life science grid: From computing grid to knowledge grid. BMC Bioinformatics. 2006; 7(Suppl 5): S10.

84.

Psomopoulos

, Mitkas

. Bioinformatics algorithm development for Grid environments. J Syst Softw. 2010; 83:1249–1257.

85.

Yin

, Lan

, Tan

, et al. Computing platforms for big biological data analytics: Perspectives and challenges. Comput Struct Biotechnol J. 2017; 15:403–411.

86.

Vega-Rodríguez

, Granado-Criado

. Preface to the special issue: Parallel computing in computational biology: A technological point of view. J Comput Biol. 2018; 25:837–840.

87.

D'Agostino

, Clematis

, Quarati

, et al. Cloud infrastructures for in silico drug discovery: Economic and practical aspects. Biomed Res Int. 2013; 2013:138012.

88.

De Paris

, Frantz

, de Souza

, et al. wFReDoW: A cloud-based web environment to handle molecular docking simulations of a fully flexible receptor model. Biomed Res Int. 2013; 2013:469363.

89.

Kaján

, Yachdav

, Vicedo

, et al. Cloud prediction of protein structure and function with PredictProtein for Debian. Biomed Res Int. 2013; 2013:398968.

90.

Kang

, Guo

, Wang

. A hierarchical method for molecular docking using cloud computing. Bioorg Med Chem Lett. 2012; 22:6568–6572.

91.

Yazar

, Gooden

GEC

, Mackey

, et al. Benchmarking undedicated cloud computing providers for analysis of genomic datasets. PLoS One. 2014; 9: e108490.

92.

Fusaro

, Patil

, Gafni

, et al. Biomedical cloud computing with Amazon Web Services. PLoS Comput Biol. 2011; 7: e1002147.

93.

, Kim

P-G

, Yoon

, et al. Closha: Bioinformatics workflow system for the analysis of massive sequencing data. BMC Bioinformatics. 2018; 19(Suppl 1):43.

94.

Elmasri

, Navathe

. Fundamentals of database systems. Reading, MA: Addison Wesley, 2010.

95.

Brown

Learning Apache Cassandra. Birmingham, UK: Packt Publishing, 2015.

96.

Membrey

, Plugge

, Hawkins

. The definitive guide to MongoDB: The NoSQL database for cloud and desktop computing. Berkeley, CA: Apress, 2010.

97.

Lee

KK-Y

, Tang

W-C

, Choi

K-S

. Alternatives to relational database: Comparison of NoSQL and XML approaches for clinical data storage. Comput Methods Programs Biomed. 2013; 110:99–109.

98.

Schulz

, Nelson

, Felker

, et al. Evaluation of relational and NoSQL database architectures to manage genomic annotations. J Biomed Inform. 2016; 64:288–295.

99.

Sun

, Pittard

, Xu

, et al. Omicseq: A web-based search engine for exploring omics datasets. Nucleic Acids Res. 2017; 45:W445–W452.

100.

Abai

NHZ

, Yahaya

, Deraman

. User requirement analysis in data warehouse design: A review. Proc Technol. 2013; 11:801–806.

101.

Yangui

, Nabli

, Gargouri

ETL based framework for NoSQL warehousing. In: Themistocleous M, Morabito V (Eds.): Information systems. Coimbra, Portugal, September 7–8, Springer International Publishing, 2017. pp. 40–53.

102.

Diouf

, Boly

, Ndiaye

. Variety of data in the ETL processes in the cloud: State of the art. In: 2018 IEEE International Conference on Innovative Research and Development, Bangkok, Thailand, May 11–12, IEEE, 2018. pp. 1–5.

103.

Dean

, Ghemawat

. MapReduce: Simplified data processing on large clusters. Commun ACM. 2008; 51:107–113.

104.

Isard

, Budiu

, Yu

, et al. Dryad: Distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd European Conference on Computer Systems. Lisbon, Portugal, March 21–23, Association for Computing Machinery, 2007. pp. 59–72.

105.

Agneeswaran

VS.

Big data analytics beyond Hadoop: Real-time applications with storm, spark, and more Hadoop alternatives. Upper Saddle River, NJ: FT Press Analytics, 2014.

106.

Almars

, Gad

, Atlam

. Applications of AI and IoT in COVID-19 vaccine and its impact on social life. In: Hassanien

, Bhatnagar

, Snášel

, Yasin Shams

(Eds.): Medical informatics and bioimaging using artificial intelligence. Studies in Computational Intelligence, Cham: Springer, 2022. p. 1005.

107.

Fayyad

, Piatetsky-Shapiro

, Smyth

. From data mining to knowledge discovery in databases. AIMag. 1996; 17:37

108.

Rao

, Mitra

, Bhatt

, et al. The big data system, components, tools, and technologies: A survey. Knowl Inform Syst. 2019; 60:1165–1245.

109.

Witten

, Frank

. Data mining: Practical machine learning tools and techniques with Java implementations. San Francisco: Morgan Kaufmann Publishers, 2011.

110.

Mohammadzadeh

, Noorkojuri

, Pourhoseingholi

, et al. Predicting the probability of mortality of gastric cancer patients using decision tree. Ir J Med Sci. 2015; 184:277–284.

111.

Esfandiari

, Babavalian

, Moghadam

A-ME

, et al. Knowledge discovery in medicine: Current issue and future trend. Exp Syst Appl. 2014; 41:4434–4463.

112.

Liao

S-H

, Chu

P-H

, Hsiao

P-Y

. Data mining techniques and applications—A decade review from 2000 to 2011. Exp Syst Appl. 2012; 39:11303–11311.

113.

Jia

, O'Connor

, Shi

, et al. VIRS based detection in combination with machine learning for mapping soil pollution. Environ Pollut. 2021; 268(Pt A):115845.

114.

de Avila E Silva

, Gerhardt

GJL

, Echeverrigaray

. Rules extraction from neural networks applied to the prediction and recognition of prokaryotic promoters. Genet Mol Biol. 2011; 34:353–360.

115.

Tellechea-Luzardo

, Otero-Muras

, Goñi-Moreno

, et al. Fast biofoundries: Coping with the challenges of biomanufacturing. Trends Biotechnol. 2022. [Epub ahead of print]; DOI: 10.1016/j.tibtech.2021.12.006.

116.

Ngiam

, Khor

. Big data and machine learning algorithms for health-care delivery. Lancet Oncol. 2019; 20: e262–e273. Erratum in: Lancet Oncol. 2019;20:293.

117.

Ben-Hur

, Ong

, Sonnenburg

, et al. Support vector machines and kernels for computational biology. PLoS Comput Biol. 2008; 4: e1000173.

118.

Abe

Support vector machines for pattern classification. London: Springer-Verlag, 2010.

119.

Huang

, Cai

, Pacheco

, et al. Applications of support vector machine (SVM) learning in cancer genomics. Cancer Genomics Proteomics. 2018; 15:41–51.

120.

Ghannam

, Techtmann

. Machine learning applications in microbial ecology, human microbiome studies, and environmental monitoring. Comput Struct Biotechnol J. 2021; 19:1092–1107.

121.

Zhang

, Su

, Lu

, et al. Application of machine learning approaches for protein-protein interactions prediction. Med Chem. 2017; 13:506–514.

122.

CH.

Artificial neural networks for molecular sequence analysis. Comput Chem. 1997; 21:237–256.

123.

de Avila e Silva

, Echeverrigaray

Bacterial promoter features description and their application on E. coli in silico prediction and recognition approaches. In: Pérez-Sánchez H (Ed.): Bioinformatics, Croatia: Intech, 2012. pp. 241–260.

124.

de Avila e Silva

, Forte

, Sartor

ITS

, et al. DNA duplex stability as discriminative characteristic for Escherichia coli σ(54)- and σ(28)- dependent promoter sequences. Biologicals. 2014; 42:22–28.

125.

Schonfeld

, Vendrow

, et al. On the relation of gene essentiality to intron structure: A computational and deep learning approach. Life Sci Alliance. 2021; 4: e202000951.

126.

Simon

, Bakunowski

, Reyes-Vasques

, et al. Acoustic traits of bat-pollinated flowers compared to flowers of other pollination syndromes and their echo-based classification using convolutional neural networks. PLoS Comput Biol. 2021; 17: e1009706.

127.

Cordova

, Portocarrero

MNL

, Salas

, et al. Air quality assessment and pollution forecasting using artificial neural networks in Metropolitan Lima-Peru. Sci Rep. 2021; 11:24232.

128.

, Li

, Cui

, et al. Deep multimodal learning for lymph node metastasis prediction of primary thyroid cancer. Phys Med Biol. 2022; 67:035008.

129.

Figueroa

, Song

, Sunny

, et al. Interpretable deep learning approach for oral cancer classification using guided attention inference network. J Biomed Opt. 2022; 27:015001.

130.

Zou

, Park

, Johnson

, et al. Alzheimer's Disease Neuroimaging Initiative. Deep learning improves utility of tau PET in the study of Alzheimer's disease. Alzheimers Dement (Amst). 2021; 13:e12264.

131.

Wei

, Jiang

, Wei

, et al. A novel hierarchical clustering algorithm for gene sequences. BMC Bioinformatics. 2012; 13:174.

132.

Takumi

, Miyamoto

Top-down vs bottom-up methods of linkage for asymmetric agglomerative hierarchical clustering. In: IEEE International Conference on Granular Computing, Hangzhou, China, August 11–13, IEEE, 2012. pp. 459–464.

133.

Guo

, Yan

, Ling

, et al. Screening and identification of key biomarkers in lower grade glioma via bioinformatical analysis. Appl Bionics Biomech. 2022; 2022:6959237.

134.

Zhao

, Qi

, Cui

, et al. Transcriptomic and physiological analysis identifies a gene network module highly associated with brassinosteroid regulation in hybrid sweetgum tissues differing in the capability of somatic embryogenesis. Hortic Res. 2022; 9:uhab047.

135.

Sato

, Sato

, Shintani

, et al. Clinical significance of metabolism-related genes and FAK activity in ovarian high-grade serous carcinoma. BMC Cancer. 2022; 22:59.

136.

Pepe

, Guarracino

, Ballesio

, et al. Evaluation of potential sponge effects of SARS genomes in human. Noncoding RNA Res. 2022; 7:48–53.

137.

Carcereny

, Garcia-Pedemonte

, Martínez-Velázquez

, et al. Dynamics of SARS-CoV-2 Alpha (B.1.1.7) variant spread: The wastewater surveillance approach. Environ Res. 2022; 208:112720.

138.

Lee

, Kim

, Lee

, et al. The application of a deep learning system developed to reduce the time for RT-PCR in COVID-19 detection. Sci Rep. 2022; 12:1234.

139.

Huang

, Yang

, Pan

, et al. Correlation between vaccine coverage and the COVID-19 pandemic throughout the world: Based on real-world data. J Med Virol. 2022; 94:2181–2187.

140.

El-Shabasy

, Nayel

, Taher

, et al. Three wave changes, new variant strains, and vaccination effect against COVID-19 pandemic. Int J Biol Macromol. 2022; 204:161–168.

141.

Khan

, Hakak

, Deepa

, et al. Detecting COVID-19-related fake news using feature extraction. Front Public Health. 2022; 9:788074.

142.

Isaakidou

, Diomidous

. The contribution of informatics to overcoming the Covid-19 fake news outbreak by learning to navigate the infodemic. Stud Health Technol Inform. 2022; 289:456–459.