Abstract
Single-nucleotide polymorphisms (SNPs) are single-base variants that contribute to human biological variation and pathogenesis of many human diseases. Among all SNP types, nonsynonymous single-nucleotide polymorphisms (nsSNPs) can alter many structural, biochemical, and functional features of a protein such as folding characteristics, charge distribution, stability, dynamics, and interactions with other proteins/nucleotides. These modifications in the protein structure can lead nsSNPs to be closely associated with many multifactorial diseases such as cancer, diabetes, and neurodegenerative diseases. Predicting structural and functional effects of nsSNPs with experimental approaches can be time-consuming and costly; hence, computational prediction tools and algorithms are being widely and increasingly utilized in biology and medical research. This expert review examines the in silico tools and algorithms for the prediction of functional or structural effects of SNP variants, in addition to the description of the phenotypic effects of nsSNPs on protein structure, association between pathogenicity of variants, and functional or structural features of disease-associated variants. Finally, case studies investigating the functional and structural effects of nsSNPs on selected protein structures are highlighted. We conclude that creating a consistent workflow with a combination of in silico approaches or tools should be considered to increase the performance, accuracy, and precision of the biological and clinical predictions made in silico.
Introduction
Genetic similarity of two different human beings is almost 99%, yet the differences are the cause of observed diseases most of the time (Cooper et al., 1985; Karczewski et al., 2020; Kwok and Chen, 2003). According to deep sequencing analysis, the difference between typical genome and reference genome is 4.1 million to 5.0 million bases (1000 Genomes Project Consortium, 2015). These pronounced differences are due to variations and mutations in the protein structures (Lek et al., 2016). Analysis of protein-coding variants is important for the knowledge of medicine and biology; thus, clinical and functional interpretations of variants are critical sources for human diseases (Cassa et al., 2017).
There exists many variation types, including single-nucleotide polymorphisms (SNPs) or single-nucleotide variants (SNVs), insertions and deletions, structural variants, repeat variations, and copy number variations (Alzu'bi et al., 2019). Generally, point mutations and SNPs refer to the same context in some cases, leading to confusions in the field. Mutations can be defined as “any heritable change to the DNA sequence,” while variations can be defined as “the differences from the reference genome or sequence” in modern human genetics (Jackson et al., 2018).
In another point of view, mutation is referred as a rarer change than SNP in the nucleotide sequence, while SNP is referred as a variation in DNA sequence of a population that is observed at the frequency of 1% or over (1000 Genomes Project Consortium, 2010, 2015; Brookes, 1999; Condit et al., 2002; Telenti et al., 2016).
SNPs are the main cause of the above mentioned 1% difference among the human population, while most of them provide diversity of humankind without any biological effect (Collins et al., 1997; Guo et al., 2019). On the other hand, some SNPs have crucial biological, functional, and structural effects such as gene expression, drug response, and disease susceptibility (Chakravarti, 2001; Zou et al., 2020). In general, SNPs are categorized as synonymous single-nucleotide polymorphisms (sSNPs) and nonsynonymous single-nucleotide polymorphisms (nsSNP).
In nsSNP, single base substitution alters encoded amino acids; whereas in sSNP, single base substitution does not change encoded amino acids. Both nsSNPs and sSNPs can exhibit neutral or negative effects for phenotype, which would be interpreted as a neutral or a damaging mutation in the protein structure, respectively (Wang and Moult, 2001). Missense point mutations that change encoded amino acids like nsSNPs can occasionally alter general protein characteristics especially in terms of biochemical properties such as stability, interaction, and dynamics (Kucukkal et al., 2015). Disease causing mutations or nsSNPs can also cause crucial alterations in the physicochemical features of amino acids such as hydrophobicity, charge, and geometry (Chaturvedi and Mahalakshmi, 2013; Lori et al., 2013).
Another important concept for prediction of functional and structural effects of variations on protein is private mutation. Private mutation is a rare gene mutation that usually occurs in a single family or a small population (Gao and Keinan, 2014). Rare genetic variants like private mutations are the result of rapid population growth and increase the burden of spontaneous mutations (Gazave et al., 2014).
Functional and molecular studies for private mutations have been performed to assess the functional and structural effects of these mutations, but difficulties in mapping, increased heterogeneity, and abundance of mutations lead to prediction improvements in the field of computational methods and in silico tools for prediction (Berger et al., 2016; Kim et al., 2016; Pangallo et al., 2020; Starita et al., 2018). Therefore, functional and structural effect prediction through in silico tools for private mutations is one of the best strategies to assess phenotypic effects.
This expert review consists of an examination and synthesis of the literature on the functional and structural effects of nsSNPs and pathogenicity of disease-related variants, followed by computational tools and algorithms for predicting these effects of variants (Fig. 1). Finally, we summarized and classified some case studies related to the investigation of nsSNPs on protein structures.

A graphical representation for the assessment and prediction of functional and structural effects of SNPs using in silico tools. SNP, single-nucleotide polymorphism.
Structural and Functional Effects of nsSNPs on Proteins
nsSNPs exhibit structural impacts on proteins due to the alteration of amino acids with smaller or larger ones that lead to the formation of voids and clashes (Yue et al., 2005), resulting in possible structural and thermodynamic destabilizations in the structure (Stitziel et al., 2003). These disturbances in the buried and core regions of a protein can cause harmful effects on the residue packing level (Vitkup et al., 2003).
In a recent study, researchers have found that variation of hydrophobic residues with charged amino acids in the protein core cannot be tolerated and can lead to destabilization of protein structures, while mutating small residues to larger ones can cause steric clashes (Yue et al., 2005). In another work, nsSNPs resulted in alterations in the proteins' charge distributions causing crucial changes in the pH dependence and catalysis, especially in the enzymes (Stefl et al., 2013).
To understand the genotype-phenotype relationship in terms of the effect of SNPs, computational and experimental approaches are widely used (Chasman and Adams, 2001; Yates and Sternberg, 2013). Computational studies are gaining more popularity as experimental studies are laborious, expensive, and time-consuming (Shen et al., 2006). However, even in the presence of high-quality 3D protein structures, predicting the effects and phenotypes of nsSNPs can be challenging for computational biophysicists and bioinformaticians too (Ittisoponpisan et al., 2019; Kucukkal et al., 2014).
Categorizing SNPs and/or mutations in terms of them being harmless or disease causing is not straightforward by means of protein dynamics (Nussinov and Tsai, 2013). On the other hand, consequences of these mutations are a reflection of the changes in protein dynamics that affect the function as well as the level of alterations in protein motion (Motlagh et al., 2014). Nevertheless, disease-causing or functional variants can exhibit harmful or neutral effects on the structure of a protein (Capriotti et al., 2009) such as protein structure destabilization, gene regulation, and alteration (Barroso et al., 1999), which influence protein charge (Petukh et al., 2015), geometry (Petukh et al., 2015), hydrophobicity (Petukh et al., 2015), stability (Chasman and Adams, 2001), dynamics (Kucukkal et al., 2015), and interprotein/intraprotein interactions (Zhao et al., 2014).
One of the metrics of disease-causing SNPs is the amino acid type that is being substituted. In their studies, Vitkup et al. (2003) and David and Sternberg (2015) indicated that variations from tryptophan and cysteine (Cys) residues have a higher chance of leading to a disease, while variations from arginine and glycine increase the genetic disease tendency by 30%. The reason of these observations lies in the nature of the amino acids. These amino acids have critical roles in various structural events of a protein, such as protein flexibility and formation of disulfide bond, hydrogen bond, salt bridge, and hydrophobic core. Thus, structural integrity and biological functions of a protein are mainly disrupted upon a change in these critical residues.
Among amino acid variations, Cys variation is one of the most disease-causing cases since this amino acid is considered to be both hydrophilic and hydrophobic at the same time (Betts and Russell, 2003). In addition, disulfide bond formation ability and metal binding capacity (mainly Zn2+) of Cys emphasize its importance in protein folding, stability, and function (Pace and Weerapana, 2013, 2014).
Another important phenomenon regarding the effect of nsSNPs is reflected on the differences in the binding free-energy (ΔΔG) of a wild-type and mutant protein structure (Yates and Sternberg, 2013). SKEMPI (Structural Database of Kinetics and Energetics of Mutant Protein Interactions) is a web-based database consisting of binding free energy difference values of 85 wild-type and mutant protein–protein complexes (Moal and Fernández-Recio, 2012). In the case of missing binding energy values of the variants, FoldX can also be used to evaluate free energy of an interaction (Schymkowitz et al., 2005). This technique can also be utilized for a quantitative estimation of SNP effect in terms of protein–protein interactions (Guerois et al., 2002).
Structural unity of a protein was also evaluated through measurement of stability changes upon variations (Kucukkal et al., 2015). Quantification of free folding energy indicates the thermodynamic stability of a protein, which is formed as a result of cumulative improvements from many structural parameters, including H-bonds and salt bridges. Variations like nsSNPs generally alter the energy landscape and the amount of conformations in both folded and unfolded states of a protein (Bartlett and Radford, 2009).
Besides SNPs, structural variations of base triplets, including deletions, insertions, and duplications, can affect protein structure and function by addition or deletion of amino acids. Structural variant term can be described as a part of DNA that demonstrates alterations in copy number (deletions, insertions, and duplications), inversions, or chromosomal locations (translocations) (Escaramís et al., 2015). In many human diseases, structural variants exhibit important functional consequences; thus, human disease studies have been focused on these variants to gain valuable insights about diseases (Weischenfeldt et al., 2013; Yokoyama and Kasahara, 2020).
Human Variation Databases
One of the important aspects of bioinformatics for analysis of variations are databases. Databases are utilized for collection, curation, organization, and analysis of biological data that are available as online sources (Ganesan et al., 2019; Savas, 2010). At the beginning, mutations and variations were preferred to be reported only in published literature; however, it is realized that creating online variation databases provide accessibility and reduced ambiguities and complexity of biological variation data (Higasa et al., 2016; Küntzer et al., 2010).
Especially, enormous amount of sequencing data have been produced through next-generation sequencing (NGS); thus, many variation databases have been developed to collect and organize biological data from NGS (Brown and Tastan Bishop, 2017). There are many human variation databases available online and it has been summarized in Table 1.
Human Variation Databases Accessible Online
COSMIC, Catalog of Somatic Mutations In Cancer; EVA, European Variation Archive; HGMD, Human Gene Mutation Database; HGVD, Human Genetic Variation Database; LOVD, Leiden Open-source Variation Database; NHGRI-EBI GWAS, National Human Genome Research Institute-European Bioinformatics Institute Genome-Wide Association Studies; OMIM, Online Mendelian In Man; SNV, single-nucleotide variant; TCGA, The Cancer Genome Atlas.
Among all human variation databases, ClinVar (Landrum et al., 2018) is the most well-known database that has been created and curated by National Center for Biotechnology Information (NCBI). ClinVar is a freely accessible human variation database that contains consequences of clinical significance of variations and mainly focuses on the association between disease and genotype. Another important human variation database is Catalog Of Somatic Mutations In Cancer (COSMIC) (Tate et al., 2019). COSMIC is a free online available database that consists of somatic mutations and their effects on human cancer. dbSNP (Sherry et al., 2001) is also NCBI-curated free online database for main collection of all known short variations.
In addition to these, there are also specialized databases for specific areas. dbNSFP (Liu et al., 2016) is a database that contains human nonsynonymous single-nucleotide variants (nsSNVs) and their functional predictions and annotations. dbVar (Lappalainen et al., 2013) is a specialized NCBI-curated database that consists of structural variations, including deletions and insertions. dbGaP (Mailman et al., 2007) is another NCBI-curated database that collects the associations between genotype and phenotype in diseases.
The European Bioinformatics Institute (EBI) also had various variation databases in the past, including the Database of Genomic Variants archive (DGVa) (Lappalainen et al., 2013) and the European Genome-phenome Archive (EGA) (Lappalainen et al., 2015). Nowadays, these databases have been integrated into one database known as the European Variation Archive (EVA) (Cook et al., 2016). EVA is an open-access human variation archive that also collaborates with different variation databases and platforms such as dbSNP, dbVar, and Ensemble. EBI, along with National Human Genome Research Institute (NHGRI), has a manually curated catalog of published genome-wide association studies known as NHGRI-EBI GWAS Catalog (Buniello et al., 2019).
HGMD (Human Gene Mutation Database) (Stenson et al., 2020) is an online collection of germline mutations in nuclear genes, which is closely associated with human inherited diseases in published cases. For specific populations, HGVD (Human Genetic Variation Database) (Higasa et al., 2016) is an online Japanese human variation database and a collection of associations between transcriptomics and variations. Some databases or platforms also consist of human variations along with other purposes such as Online Mendelian In Man (OMIM) (Scott et al., 2014) and Ensemble. OMIM is an online database and platform that consists of association variation of distinct phenotypes. Ensembl (Hubbard et al., 2002) is an online platform and database that stores the human variation data and incorporates various human variation databases, including dbSNP, ClinVar, COSMIC, and OMIM.
Another important platform and database is The Cancer Genome Atlas (TCGA) (McLendon et al., 2008). TCGA mainly focuses on variations in different cancer types and is a cancer genomics platform that contains genomic, epigenomic, transcriptomic, and proteomic data. Leiden Open-source Variation Database (LOVD) (Fokkema et al., 2011) is a web-based platform and locus-specific variation database, which contains gene variation sequence data from patients. The 1000 Genomes Project (1000 Genomes Project Consortium, 2015) is also an online platform that consists of human variation datasets derived from whole-genome sequencing methods. HuVarBase (Ganesan et al., 2019) is an online comprehensive human variant database that integrates gene- and protein-level information with sequence and structure properties of the variations.
Tools for Investigating Functional Effects of nsSNPs
Four main bioinformatics methodologies have been used to evaluate and understand the functional effects of nsSNPs and these are as follows: sequence homology-based, supervised learning-based, sequence-structure-based, and consensus-based tools (Table 2). However, each technique has some limitations in terms of defining the effect of the variants on protein dynamics. Hence molecular dynamic (MD) simulations, which enable a much more detailed structural investigation, have gained attention for evaluating the effects of these changes in terms of motion, protein flexibility, and secondary structure elements (Marcolino et al., 2016).
In Silico Tools for the Prediction and Evaluation of Functional Effects of Nonsynonymous Single Nucleotide Polymorphisms on Proteins
GO term, gene ontology term; PDB, Protein Data Bank; rs ID, reference SNP cluster ID; VAPOR, Variant Analysis Portal; VCF, variant call format.
Most of the time, structural analysis of disease-causing variants is observed to have alterations in salt bridge formation and hydrogen bonding network (Petukh et al., 2015). Thermodynamic analysis from computational and experimental studies has also demonstrated that nsSNPs may lead to destabilized protein structure, function, and interactions (Brock et al., 2007). Stability decrease in mutant proteins can be analyzed by searching energy alteration databases for prediction and detection of probable effects of disease-related mutations.
Sequence-structure-based tools
Sequence-structure-based tools employ sequence-based properties as well as structural information to determine functional pathogenicities of nsSNPs. Sequence features such as evolutionary conservation score and homologous sequence score are combined mainly with structural properties, secondary structure information, accessible surface area of mutated residue, and protein stability (Kulshreshtha et al., 2016). There are many sequence-structure-based tools such as PolyPhen-2 (Adzhubei et al., 2010), MuD (Wainreb et al., 2010), FATHMM (Shihab et al., 2013), SNPs3D (Yue et al., 2006), CADD (Rentzsch et al., 2019), and SNPeffect (De Baets et al., 2012).
PolyPhen-2 (Adzhubei et al., 2010) is developed for the classification of disease-related variations according to three structural features and eight sequence-based predictive properties by naive Bayes theorem parameters. MuD (Wainreb et al., 2010) is a web-based sequence-structure prediction tool that can be used for separating functionally neutral and non-neutral variants and has a total of 14 novel or traditional sequence-structure features such as secondary structure assignment, number of sequences in the alignment, stability prediction change, solvent accessibility, and oligomerization interface.
FATHMM (Shihab et al., 2013) is a hidden Markov method-based prediction software and server that utilizes parameters, including conservation score (sequence) along with the conserved protein domain families (structure) to measure the pathogenicity weight score of SNPs. SNPs3D (Yue et al., 2006) modules are developed for the prediction of functional effects of SNPs and utilize protein folding state changes upon amino acid variations and amino acid sequence conservation scores.
CADD (Rentzsch et al., 2019) is another prediction tool that creates a scoring algorithm from evolutionary constraints, gene model annotations, epigenetic measurements, surrounding sequence context, and functional predictions. SNPeffect (De Baets et al., 2012) database consists of sequence- and structure-based prediction tools that use amyloid prediction (WALTZ) (Maurer-Stroh et al., 2010), aggregation prediction (TANGO) (Fernandez-Escamilla et al., 2004), chaperone-binding prediction (LIMBO) (Van Durme et al., 2009), and protein stability analysis (FoldX) (Schymkowitz et al., 2005).
Sequence homology-based tools
The underlying idea of sequence homology-based tools is the usage of sequence homology in terms of sequence conservation scores to define the deleterious effects of nsSNPs or mutations. This phenomenon depends on the concept that highly conserved parts of genomes are more crucial in protein function (Reva et al., 2011). SIFT (Sim et al., 2012; Vaser et al., 2016), PROVEAN (Choi and Chan, 2015), Mutation Assessor (Reva et al., 2007), and PANTHER (Mi et al., 2019; Tang and Thomas, 2016) are among the widely used sequence homology-based tools.
SIFT algorithm (Sim et al., 2012) specifically defines the relationship between variations and protein function by predicting the effect of SNPs through sequence-homology methodology. In this algorithm, sequence from a query protein is searched for homologous sequences (Vaser et al., 2016). Conservation scores are calculated and normalized according to composition of amino acids, while variations are predicted for their functional effects on protein pathogenicity. SIFT and PROVEAN have similar principles based on computing conservation scores of homologous amino acid sequences. PROVEAN (Choi and Chan, 2015) can also be used for the prediction of in-frame insertion and deletion effects, besides variation.
In PANTHER (Mi et al., 2019; Tang and Thomas, 2016), evolutionary preservation metric is used for the prediction of deleterious effects instead of evolutionary conservation score. Evolutionary preservation metric is based on the manifestation of negatively selected variants in avoiding evolutionary change at a particular region of a protein (Tang and Thomas, 2016). Mutation Assessor (Reva et al., 2007) is another tool in this category that utilizes information-based assessment of evolutionary conservation motifs in multiple sequence alignments instead of conservation score measurements.
Supervised learning-based tools
Supervised learning methodology or machine learning techniques, such as neural networks (NNs), random forests (RF), and support vector machines (SVM), can be used for the prediction of functional effects of SNPs (Zhao et al., 2014). These methods are very practical as they can be employed in the analysis of large datasets. Supervised learning prediction methods generally utilize training datasets that consist of known effects to train the algorithm before prediction (Mishra et al., 2019).
In training datasets, labels related with input data are created and these labels are used to identify the predictive patterns present in these data (Camacho et al., 2018). In this case, training datasets are created by prediction; thus, the results are directly related and depended on training datasets. There are many supervised learning-based tools such as SNAP (Bromberg and Rost, 2007), PhD-SNP (Capriotti et al., 2006), SuSPect (Yates et al., 2014), MutPred2 (Pejaver et al., 2017), ParePro (Tian et al., 2007), EFIN (Zeng et al., 2014), SNPs&GO (Calabrese et al., 2009), PON-P2 (Niroula et al., 2015), REVEL (Ioannidis et al., 2016), ClinPred (Alirezaie et al., 2018), and CRAVAT (Masica et al., 2017).
NNs are combination of algorithms inspired by human neural system that aim to recognize specific motifs. In NNs, units or neurons are interconnected with each other and each neuron transmits information into the next related one (Chen and Siu, 2020; Nicholls et al., 2020). Nodes, neurons, or units in NNs acquire many input signals and produce an active response by weighted sum of input data and a nonlinear activation function. Also, each neuron or unit conveys output signals to next connected neuron or unit (Lo et al., 2018). All units can be organized as different levels by forming multilayer network structures (NNs). Organization of nonlinear units can allow NNs to learn complex input data (Baskin et al., 2016).
NNs have wide applications, including recognizing handwritten numbers (Cohen et al., 2017), quantum chemistry (Balabin and Lomakina, 2009), 3D object reconstruction (Choy et al., 2016), and medical diagnosis (Lyons et al., 2016). Among supervised learning-based tools, SNAP (Bromberg and Rost, 2007) and MutPred (Pejaver et al., 2017) are NN-derived tools. SNAP (Bromberg and Rost, 2007) can predict functional consequences of SNPs through a set of NN algorithms by integrating information from residue conservation score, protein structure elements, and other significant parameters. On the other hand, MutPred (Pejaver et al., 2017) combines the genetic and molecular information parameters for the prediction of functional pathogenicity of SNPs.
Another supervised learning method that is used for functional effect prediction of nsSNPs on proteins is RF or random decision forests. RF is a nonparametric machine learning methodology that combines the concept of nearest neighbors on efficient data analysis (Breiman, 2019). EFIN (Zeng et al., 2014), PON-P2 (Niroula et al., 2015), REVEL (Ioannidis et al., 2016), ClinPred (Alirezaie et al., 2018), Rhapsody (Ponzoni et al., 2020b), and CRAVAT (Masica et al., 2017) are RF-based supervised learning tools for pathogenicity prediction of nsSNPs. Among these tools, EFIN (Zeng et al., 2014) and PON-P2 (Niroula et al., 2015) both use evolutionary conservation scores.
Algorithm of EFIN is designed by covering many homologous protein sequence clusters according to evolutionary gap, whereas PON-P2 utilizes biochemical and physical features of amino acids and gene ontology (GO) terms alongside the conservation score (Niroula et al., 2015; Zeng et al., 2014). ClinPred (Alirezaie et al., 2018) uses not only RF but also gradient boosting model by integrating individual prediction tool score and allele frequencies of SNPs and mutations in a population from gnomAD database (Karczewski et al., 2020).
CRAVAT (Masica et al., 2017) is a combination of two methodologies, CHASM (Cancer-specific High-throughput Annotation of Somatic Mutations) (Carter et al., 2009) and VEST (Variant Effect Scoring Tool) (Carter et al., 2013). CHASM algorithm utilizes an RF classifier that is formed by cancer driver variants and some passenger mutations, whereas RF classifier of VEST is trained by disease-related germline mutations. Rhapsody (Ponzoni et al., 2020b) is new web-based RF classifier algorithm that utilizes prediction and evaluation of pathogenicity by sequence conservation score and structure properties.
SVM is a supervised machine learning methodology used for the arrangement and analysis of datasets. Data points in SVM datasets are separated by an imaginary boundary, a hyperplane. Classification and clustering are performed by optimizing this hyperplane in particular sides to obtain the highest margin between data points (Ozer et al., 2020). PhD-SNP (Capriotti et al., 2006), SuSPect (Yates et al., 2014), and SNPs&GO (Calabrese et al., 2009) are SVM-based supervised learning tools.
PhD-SNP (Capriotti et al., 2006) can predict the new phenotype of disease-related SNPs starting from the protein sequence. SuSPect (Yates et al., 2014) is a prediction tool that consists of a trained SVM integrating sequence and structure properties of the protein to determine disease-associated variants. SNPs&GO (Calabrese et al., 2009) uses GO terms and protein sequence for prediction. ParePro (Tian et al., 2007) integrates evolutionary features with amino acid properties to identify the differences between the wild-type and mutant residues.
Consensus-based tools
Consensus-based prediction tools or consensus classifier tools are generally a combination of different tools that analyze homologous sequences through multiple sequence alignments. After analysis, consensus sequences are created and compared with actual protein sequences to make predictions. The most common consensus-based prediction tools are Variant Analysis Portal (VAPOR) (Brown and Tastan Bishop, 2018), Meta-SNP (Capriotti et al., 2013) and PredictSNP (Bendl et al., 2014).
VAPOR (Brown and Tastan Bishop, 2018) is actually integrated in the HUMA webserver with eight different tools, including PROVEAN, PolyPhen-2, PhD-SNP, PANTHER-PSEP, and FATHMM. Meta-SNP (Capriotti et al., 2013) is basically a meta-predictor consensus-based prediction tool that is developed for gathering information of disease-related nsSNPs from four different tools, PANTHER, PhD-SNP, SIFT, and SNAP. PredictSNP (Bendl et al., 2014) is also another meta-predictor consensus-based prediction tool that utilizes trained data from eight prediction tools and integrates the results into consensus classifier scores.
Tools for Investigating Stability and Structural Effects of nsSNPs
Tools that predict the structural impacts of nsSNPs on proteins are an exclusive area of structural bioinformatics. Due to nsSNPs, internal energy of a protein changes, leading to alterations in the protein structure. Free energy difference of a wild-type and a mutated form of a protein is a crucial parameter for protein stability (Kulshreshtha et al., 2016; Yue et al., 2005). Furthermore, long-range order, hydrophobicity of a residue, contact map matrix, and stabilization center of residues are exclusive parameters to determine the structural effects of SNPs. To this end, many tools have been developed for the prediction of nsSNPs' stabilities and structural effects on proteins (Table 3).
In Silico Tools for the Prediction and Evaluation of Stability and Structural Effects of Nonsynonymous Single Nucleotide Polymorphisms on Proteins
mCSM, mutation Cutoff Scanning Matrix; SDM, Site-directed mutator.
ENCoM (Frappier et al., 2015) is a structural effect predictor web-server that can make predictions on protein dynamics features and thermostability through coarse-grained normal mode analysis. Furthermore, DynaMut (Rodrigues et al., 2018) is another web-based server that uses normal mode approach. DynaMut algorithm combines normal mode analysis with graph-based signature and utilizes them as a consensus predictor for protein stability.
MuPro (Cheng et al., 2006) is a machine learning algorithm depending on SVM that measures the free binding energy change (ΔΔG) of a wild-type and a mutated form of a protein. I-Mutant2.0 (Capriotti et al., 2005) is also a web-based SVM tool for direct prediction of protein stability by ΔΔG values. Eris (Yin et al., 2007) also uses ΔΔG analysis for structural effect prediction of SNPs through side-chain packing and backbone relaxation algorithms.
CUPSAT (Cologne University Protein Stability Analysis Tool) (Parthiban et al., 2006) is a web-based server that predicts ΔΔG differences through structural condition- specific atom and torsion angle potentials. Another tool in this category is PoPMuSiC (Prediction of Protein Mutant Stability Changes) (Dehouck et al., 2011). This tool mainly focuses on stability of a mutant protein with the help of sequence-based approaches.
Another approach for the prediction of structural effects of SNPs is based on residue interaction networks (RINs). RINs are graph-based representations of protein structures, where nodes serve as amino acids and edges symbolize physicochemical bonds. Interactions among residues are important in internal folding energy and hence protein stability; therefore, mutant effects on protein structures can be alternatively analyzed through RINs (Cheng et al., 2008). NeEMO (Giollo et al., 2014) is a web-based tool for the evaluation of stability changes and structural effects using RINs. Another tool using graph-based signatures for structural effect predictions is mutation Cutoff Scanning Matrix (Pires et al., 2014).
Pmut (López-Ferrando et al., 2017) is an open-source web portal that has several predictor approaches such as protein domain families, amino acid conservation scores, protein interactome information, and physicochemical features of the protein. Protherm (Gromiha et al., 1999) utilizes differences in thermodynamic parameters between wild-type and mutant proteins. Auto-Mute 2.0 (Masso andVaisman, 2014) is a stand-alone software package for predicting structural effects of nsSNPs using structure-based features with trained statistical learning models. Site-directed mutator (Pandurangan et al., 2017), which predicts structural stability of a mutant protein through statistical potential energy function, is a knowledge-based approach.
Case Studies Investigating the Functional and Structural Effects of nsSNPs
To date, many studies have been conducted (Supplementary Table 1) on the prediction of functional and structural effects of disease-related nsSNPs on proteins. Most of these studies focused on variants of one or more genes belonging to several genetic disorders.
While some studies aimed to investigate variants in multiple disorders and a single gene (Arshad et al., 2018; Doss et al., 2012; Islam et al., 2019; Porto et al., 2015; Shen et al., 2006) or multiple gene and a single disorder (Masoodi et al., 2013; Pandey et al., 2019), others aimed to analyze variants in a single gene and disorder (Abdul Samad et al., 2016; Chitrala and Yeguvapalli, 2014; Kandakatla et al., 2014; Khan et al., 2013; Kumar et al., 2013; Nagarajan et al., 2020; Naveed et al., 2016, 2017; Owji et al., 2020; Ponzoni et al., 2020a; Sang et al., 2017; Yadegari and Majidzadeh, 2019). Interpretations from these studies are generally related to individual genes, while some general consequences could also be derived from them. In this section, recent case studies related to the investigation of structural and functional effects of missense variations on proteins are summarized.
Predicting the effect of nsSNPs is critical for the determination of genetic characterization of a disorder, discovery of molecular therapeutic targets, and understanding evolutionary susceptibility to disease(s). Furthermore, comparing the predictions and experimental findings in terms of functional or structural effects provides valuable insights on pathogenicity and genetic basis of a disease.
Alterations in thermal activity, thermodynamic stability, and structural dynamics must be confirmed with experimental evaluations of variants, while traces of these effects should be detected through computational prediction methods. In an experimental study, Lori et al. (2013) studied Pim-1 kinases and the structural effects of natural variants on these proteins' stabilities and activities. They expressed and purified recombinant soluble mutant proteins and characterized the thermal and thermodynamic stability together with enzyme activity. As expected, their results indicated that mutant proteins display a significant decrease in thermodynamic and thermal stability and activation energy for kinase activity.
For the evaluation of the effects of nsSNPs, structural parameters can also be combined with sequence and evolutionary information and this combination can be a valuable methodology for novel variants' determination. One recent study on Cys loop receptor gene investigated the GABRA2 gene's pathogenic variants in combination with structure, sequence, and evolutionary information and compared the variants to the formerly reported pathogenic variants' positions in other Cys loop receptors (Sanchis-Juan et al., 2020). This study revealed that one of seven variants in GABRA2 gene results in a decreased score in structural, evolutionary, and sequence parameters.
Structural analysis of the variants can also be utilized as a diagnostic classifier when it is associated with genetic information. There is an integrative machine learning approach that produces an in silico model of CACNA1F gene to separate disease-related SNPs from benign variations (Sallah et al., 2020). This approach specifically contains sequence and homology modeling data along with structural parameters of amino acids such as charge, hydrophobicity, position, and size.
Comparison of the predictions of nsSNP effects using in silico tools and clinical methodology is another option to characterize disease-related variants. Cohort studies with in silico methodologies provide strong insights on disease-related variants. In their recent study, Cheng et al. (2020) investigated a large group of TAF1/MRXS33 intellectual disability syndrome cases by combination of clinical and in silico approaches. They performed computational analysis with modeling approaches to identify variants' pathogenicity scores and compared them with clinical phenotypes.
Clinically, multifactorial diseases such as cancer, neurodegenerative diseases, or neurodevelopmental diseases are difficult to investigate since there are many genes involved and variations in those genes resulting in a rather complex picture. Therefore, computational tools are useful in terms of supporting clinical methodology results. Post et al. (2020) investigated PTEN gene and its variants in Autism Spectrum Disorder. They created a deep phenotypic profiling approach to evaluate the effects of missense variants. This model resulted in a strong validation for the effects of variants in such a diverse multifactorial disorder.
An example study on the experimental approaches for the determination of functional effects of nsSNPs was carried by Marín-Martín et al. (2014). They investigated the effects of nsSNPs that can change ATP-binding cassette transporter (ABCA1) gene expression in Tangier disease and allelic disorders familial hypoalphalipoproteinemia. Their results showed that most of the nsSNPs were correctly predicted by MutPred and PolyPhen2 tools and were correlated with experimental studies.
In their recent study on the comparison of gene expression and functional effect prediction of nsSNPs, Russell et al. (2020) assessed Na+-taurocholate co-transporting polypeptide (NTCP, SLC10A1) gene, uptake, transport, and cellular localization of substrate taurocholic acid in patients with mutation. They also compared computational scores of in silico prediction tools with observed in vitro functional effects of nsSNPs to assess the efficiency of seven different algorithms. Decreased NTCP gene expression and reduced substrate uptake were observed in some rare variants. Interestingly, comparison of computational scores of in silico prediction tools with observed in vitro functional effects indicated that in silico prediction tools are not as powerful as experimental in vitro studies.
Decreased ambiguities in computational prediction tools and improved consensus of in silico algorithms are another aspect of case studies for the prediction of structural and functional effects of nsSNPs. Proper variant classification and determination of variant pathogenicity depend on reliability, accuracy, and precision of a prediction algorithm. Especially in machine learning-based prediction tools, quality of trained datasets is also important for reproducibility, reliability, and improved consensus prediction scores of variants.
In the literature, there exist three recent studies that compare different in silico prediction tools and algorithms. First of all, Orioli and Vihinen (2019) have assessed performance of 22 variant pathogenicity predictor tools along with 7 subcellular localization predictors on membrane proteins computationally. PON-P2 (Niroula et al., 2015) was demonstrated to have the best performance followed by REVEL (Ioannidis et al., 2016) and VEST3 (Carter et al., 2013). They also concluded that in silico predictors are more successful in prediction of multipass proteins than single0pass proteins.
Second, Accetturo et al. (2020) evaluated the performance of three in silico meta-predictor algorithms: VEST3 (Carter et al., 2013), REVEL (Ioannidis et al., 2016), and ClinPred (Alirezaie et al., 2018) in NF1 gene variants from ClinVar (Chitipiralla et al., 2015). Among all three meta-predictors, there was no significant difference in the scores of variants in “benign,” “likely benign,” “likely pathogenic,” and “pathogenic” categories of ClinVar. Finally, Gyulkhandanyan et al. (2020) also assessed the performance and reliability of 22 in silico pathogenicity prediction tools and algorithms in missense variants. Even though some conflicting results were found from some variants in this study, they concluded that a combination of several tools can be used to define potential effects of variants.
Case studies investigating effects of nsSNPs by using MD simulations
MD simulations are an important methodology for assessing protein dynamic motions by predicting how atoms of a protein or a biomolecular system move over time. By the development of recent computational methodologies, MD simulations have been involved in many studies that investigate stability and specificity of a protein (Sneha and Priya Doss, 2016).
MD simulations are particularly important for investigating stability and structural effects of nsSNPs on proteins. nsSNPs and missense mutations seem to alter protein dynamics as global perturbations (Haliloglu and Bahar, 2015). MD simulations have also been used for investigating the structural properties of nsSNPs in protein interaction interfaces (Kamburov et al., 2015). Variations are also enriched at dynamically and functionally important regions such as cofactor binding region, DNA-binding region, and hinge region; thus, MD simulations have been useful in detecting the effect of the variations (Stehr et al., 2011).
In literature, there exist many MD studies (as shown in Supplementary Table 1) for the identification of nsSNPs' effects. Some of these studies mainly focus on the MD simulation analysis methods for phenotypic consequences of nsSNPs, such as root mean square deviation (RMSD), root mean square fluctuation (RMSF), hydrogen-bonds (H-bonds), and solvent-accessible surface area (SASA) analysis. Variations in the binding region of a protein can inhibit protein–protein, protein–ligand, or protein-DNA interactions; therefore, such deleterious variations can alter protein stability and dynamics.
In their study, Doss and NagaSundaram (2012) aimed to investigate the pathogenic variations in A-purinic endonuclease-1 (APE1) gene that disturb the binding surface and protein-DNA interactions. A practical methodology, which overlaps the scores from two different in silico prediction tools and analysis from MD simulations, including RMSF, RMSD, H-bonding, salt bridge, and SASA, was developed to assess the APE1 gene variants.
In another study, MD simulations were utilized as a supporting data for functional and structural pathogenicity prediction of nsSNPs (Kumar and Purohit, 2014). MD simulations were conducted, and their results were combined with in silico prediction tools to interpret cancer-related mutations in Aurora-A kinase gene. As a result, atomic rearrangements and structural conformational changes were observed in a mutant protein, while these interpretations were correlated with computational predictions.
Since extended MD simulations provide more insights on protein structure, Marcolino et al. (2016) performed 1 μs MD simulations to evaluate structural impacts of variations in uroguanylin gene. Also, dynamic cross-correlation (DCC) and dynamic residue network (DRN) analysis have been used for variant protein structure and are supportive techniques for MD simulations in variant effect prediction. Sanyanga and Tastan Bishop (2020) have investigated different pathogenic variants in Carbonic Anhydrase VIII gene and have performed DCC and DRN analysis. According to DRN analysis, change in binding surface structure and its accessibility could be a discriminating factor for benign and malign variants.
Conclusions and Outlook
This expert review offers a synthesis of the in silico tools and algorithms for the prediction of functional or structural effects of SNP variants, in addition to the description of the phenotypic effects of nsSNPs on protein structure, association between pathogenicity of variants, and functional or structural features of disease-associated variants. Finally, case studies investigating the functional and structural effects of nsSNPs on selected protein structures are highlighted.
Through recent developments in computational technologies, a diversity of approaches and tools has been produced to assess the functional and structural effects of nsSNPs or missense variants. In addition, structural variants, including deletions, inversions, and duplications, have major roles in protein function and structure since these variants can lead to addition or deletion of amino acids and cause perturbations in the system. Since SNPs have exclusive functional and structural effects involving drug response, gene expression, and disease susceptibility, computational predictions of these effects provide enormous insights, especially in the field of medical science.
Phenotypic effects of nsSNPs on protein structure can exhibit neutral or negative behavior; therefore, these variations can alter functional features, especially in terms of biochemical parameters such as stability, interaction, and dynamics. Disease-related nsSNPs can also result in changes in physicochemical features of amino acids, energy landscapes, free energy differences among folded and unfolded states of proteins, and the amount of conformations in different folding states. For the proper assessment of these structural parameters, MD simulations should be utilized for the assessment of structural effects of nsSNPs on proteins.
Furthermore, a consistent workflow should be established for the reliability and reproducibility of in silico approaches, while a combination of several in silico prediction algorithms or tools must be considered to increase the performance, accuracy, and precision. Combination of in silico tools should comprise two or more of different methodologies such as sequence homology-based, supervised learning-based, sequence-structure based, and consensus-based tools. Therefore, artifacts or disadvantages from each can be eliminated to increase the performance of the assessment. It should also be noted that there does not exist a single dataset that matches the input requirements of all available tools/programs. Hence, a major improvement to the field can be achieved when such a dataset is created.
Another important aspect for improving assessment of effects of SNPs is experimental validation. Experimental validation for effects of variations should focus on genotype and phenotype associations to increase performance. Indeed, adding experimental verification step into workflow for the assessment effects of nsSNPs on proteins provides precision, reliability, and reproducibility.
We conclude that creating a consistent workflow with a combination of in silico approaches or tools should be considered to increase the performance, accuracy, and precision of the biological and clinical predictions made in silico.
Footnotes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
