Abstract
Over the years protein interaction and prediction of membrane protein have been a pivotal research area for all researchers. For both prokaryotes and eukaryotes Adenosine Triphosphate-(ATP) binding cassette (ABC) genes plays a significant role. In our analysis, we concentrate on human part of ABC genes. In case of living organisms transport of precise molecules across lipid membranes has been treated as vital part and for that reason a bigger transporter is required to carry out the molecules. Here ABC transporter families are evolved to transport the specific molecules such as sugars, amino acid, peptides, proteins, ions etc. within the plasma membrane. As we know another important component of human being is cholesterol, which is a major component in cell membrane and its main functions are to maintain integrity and mechanical stability. Each and every time, membrane cholesterolsareinteracted with membrane protein in both N-C terminuses and target valid sequence(s) which has relevance in human diseases. In this manuscript we have applied Fuzzy C-Means (FCM) with Support Vector Machine (SVM) algorithm for prediction of cellular cholesterol with ABC genes. Our experiments have been performed well using ABCdata set.
Introduction
In human pathogenesis cell membranes plays as a major part. Many more proteins are available in cell membranes which have unique functionalities. Among all, ABC proteins are the biggest super family in mammalian membrane. Till now, 49 well characterized ABC genes have been reported. ABC transporter has been utilizing the energy of ATP hydrolysis to translocate solutes across cellular membranes and this protein is responsible for numerous phenomenonssuch as resistance of cancers and pathogenic microbes to drugs. Mammalian ABC transporter familiesare evolved to transport the specific molecules such as sugars, amino acid, peptides, proteins, ions, bile acids etc. within the plasma membrane. On the basis of amino acid homology and other parameter this gene family can be categorized into seven distinct subfamilies like ABCA to ABCG. Cholesterol is a waxy, fat-like substance that is found in every cells of all human being. Basically, cellular cholesterol acts as a significant role in the regulation of membrane proteins [1–10].
Several membrane cholesterols are binding with proteins, which is the combination of amino acid sequences. In case of human proteins, 20 amino acids have been taken. In order to bind cellular cholesterol with membrane proteins, two different approaches have utilized such as cholesterol recognition/interaction amino acid consensus (CRAC) and reverse of CRAC (CARC). The orientation of CRAC motif is L/V-X(1–5)-Y-X(1–5)-R/K, and CARC is R/K-X(1–5)-Y/F-X(1–5)-L/V where X represents whichever amino acid [11–15].
This signature motif has been reported in various subfamilies of ABC genes. As each and every time target binding sites may differ according to orientation of protein sequence, therefore it is a tedious task for all researchers to predict the relevant motif sequence which is useful for all drug scientists in recent trends. For this purpose numerous soft computing with data mining techniques such as FCM, SVM, K-Nearest Neighbor (KNN), K-means and Principal Component Analysis (PCA)have been employed by researchers [11–18].
This paper’s main contribution is to predict cellular cholesterol from ABC genes based on FCM and SVM using protein data and cholesterol motif sequences which would be useful for many drug discovery researchers in future. The structure of the manuscript is as follows: In part 2 the literature works are discussed. Material and proposed models are elaborated in part 3. In part 4 we have detailed description of our methodology and experimental procedure. Lastly we concluded our manuscript in part 5.
Literature works
In cell ecology, ABC transporters are well-known as super families of integral membrane proteins and play a part in imperative place in the regulation of growth and resistance processes in human cell membrane. ABC genes are responsible for various human diseases. So, numerous analysts interested to forecast ABC proteins which are useful for clinical pathology. Therefore several prediction techniques are to be obtained in current date.
Hazai, E et al. [19] pursued a support vector machine approach for prediction of breast cancer resistance protein (BCRP) on 164 BCRP substrates and 99 non-substrates data. The advantage of this model was that classification consists of four phases. In first phase, all substrates and non-substrates of wild-type BCRP data were trained and tested then in next phase, compounds in the training set were obtainable as points in a high-dimensional space as they separated the objects into substrate and non-substrate groups based on hyperplane. In third phase, prediction accuracy of the model was computed and finally these proposed models authenticated on the basis of independent external data set. The experimental results specifieds that the SVM model achieves the highest prediction accuracy of 73% than others. In contrast Wang, Z et al. [20] depicted about P-glycoprotein of ABC genes implicated in numerous required procedures like lipid with steroid transport crosswaysplasma membranes. To do so, authors applied a support vector machine based model which consists of 131 substrates and 81 no substrates P-gp data. The total study exhibits that proposed model provide better significance result in separating substrates and nonsubstrates of P-glycoprotein’s. Once again Molinski, S. V et al. [21] proposed biophysical approaches for ABC super family. In their study author indicate that the drug development for diseases caused by dysfunctional ABC proteins.
A phylogenetic analysis is made by Vishwakarma, S. K et al. [22] based on Human ATP Binding Cassette (ABC) Transporters. The proposed model explored the idea on gene superfamily which consists of 49 members of proteins in human and it was seen that phylogenetic analysis method yielded significantly and classified human transporter in a correct manner. Chandra, N [23] established a prediction model using support vector machine algorithm to classify between substrates and nonsubstrates of Pglycoprotein. In their paper authors carried out the research using seven dissimilar types of features along with eight unlike threshold values for each feature type and it includeed features such as unweighted, weighted, Euclidian distance, all pair wise centroid distances among substructures in a compound and all feature combinations. They observed that the model achieved accuracy of 93%. Furthermore Bhavani, S et al. [3] recommended a widespread classifier framework based on support vector machine to predict adverse effects in diverse classes of drugs. Prediction accuracy of the proposed model forecasted accurate results but this method gave more improved output in case of toxicity prediction.
Materials and model description
Data source description
The protein sequences of human ABC transporter were retrieved from uniprot database [24]. Total 49 genes are reported in human ABC super family till date. This family includes many sub family such as ABCA1-ABCA13, ABCB1-ABCB11, ABCC1-ABCC12, ABCG1-ABCG5, ABCG8 etc. Each gene contains protein Id, name of each helix, total length of transmembrane region, accurate position in transmembrane sequence plus its amino acid sequences which are varying in size according with their gene name. Each gene contains dissimilar helix number. Suppose one gene id Q5T3U5 contains helix 1- 17 then other gene id Q96J66 contains 1-10. Maximum 17 numbers of helices are present in ABC transporter. Samples of data set which are retrieved from uniprot database are shown below in Table 1. Another data set of cholesterol dictionary which is constructed using the algorithm CRAC and CARC are shown in Table 2.
Sample dataset details containing ABCA1 of ABC superfamily
Sample dataset details containing ABCA1 of ABC superfamily
Sample dataset of cholesterol dictionary using algorithms CARC and CRAC
Table 1 represents the sub family of ABC transporter. Here six columns are maintained for gene ABCA1. So many genes are available in uniprot database and with the help of those genes we executed our proposed algorithm. It is not possible to show all the sub genes of the superfamily. Therefore we have taken one sample of gene whose name is ABCA1. Each gene has unique protein Id that is O95477which is shown in table 1. ABCA1 gene has 15 numbers of helices and all helix contain 21 numbers of amino acids. The position of amino acids might be any sequence from transmembrane regions. From transmembrane domain, each helix retrieved their length using different approaches.
Table 2 depicts dictionary construction cholesterol using CRAC (Leucine/Valine-X(1-5)-Tyrosine-X(1-5)-Arginine/Lysine) and CARC (Arginine/Lysine -X(1-5)- Tyrosine /Phenylalanine-X(1-5)- Leucine/Valine) motif. For forward direction, let us assume the motif is LXXYXXK. It means first position is Leucine, second and third positions are any amino acids, then fourth position is Tyrosine, next five and six positions are any amino acid, then last position is Arginine. Here motif type is L2Y2K (22).
Another column represents the length of cholesterol and it varies from 5,6,7,8,9,10,11,12,13.
In mammalian membrane, proteins are being modulated by cellular cholesterol. As cholesterol plays a pivotal role in cell membrane, most of the researchers interested to do their investigation on cholesterol with membrane protein prediction. Numerous types of membrane proteins are reported current date; amongst them Adenosine Triphosphate Binding Cassette transporters (ABC transporters) has significant role in clinical pathogenesis since it is desired for drug detection for human body.
Figure 1 elucidated the detailed flow of ourplanned model. In phase1 we have created a mammalian cholesterol dictionary using the algorithm CRAC and CARC including window size d5, d6, d7, d8, d9, d10, d11, d12 and d13 etc. Another data set that is ABC genes are also retrieved from uniprot database. After that in phase 2 both data setsare mapped using sliding window concept. Then we extracted both forward and backward motifs in phase 3. In very beginning of the phase 4 we employed a novel hybrid approach FCM with SVM algorithm. As we know FCM algorithm is used to mine such kind of information where data belongs more than one clusters. So in our manuscript we have used FCM and it is well suited for our data set. Then we have taken SVM algorithm embedded with FCM for exact classification among cluster points. At last in phase 5 we identified valid motif sequences of cellular cholesterol which has clinical relevance.

Schematic outline of proposed model.
Fuzzy C Means
Cluster study might be represented as assignment of data points to clusters. Things within the identical clusters belong to one cluster while items belonging to diverse clusters are treated as different group of clusters. Furthermore clusters are recognized by resemblance measures. So many clustering algorithms are available for accurate clustering amid the data points. Among all Fuzzy C-means clustering is treated as most important and widespread algorithm in current trends. The main motto of this algorithm is, by means of handing over membership to apiece one data point subsequent to every group center on account of distance among the data points along with cluster center. According to the closeness of data points towards cluster center, the membership has been calculated [25–27].
FCM: Fuzzy C-Means
FCM: Fuzzy C-Means
According to hypothesis, support vector machine is a renowned supervised machine learning approach which is typicallyutilized for resolving classification and regression problems. This algorithm is very simple to implement in current days for solving the class classification problems. SVM concept is based on the decision planes so as to characterize the decision boundaries. A decision plane is one that separates between a set of objects having different class memberships. The foremostintention of this technique is togenerate a line or a hyper plane which divides the data into classes [28–33].
A training set thatincludes label pairs (u
i
, v
i
), i = 1, …, m everywhere
Decision function isspecified below in Equation (3)
Training vector Linear kernel: Polynomial kernel: Radial Basis kernel (RBF): (u
i
, u
j
) = exp(- γ ∥ u
i
- u
j
∥ 2) , γ (6)
The entirekernel arguments such as C, γ, r, and e are initialized by utilizing the dataset. All kernel parameters areaffected based upon the size of training data [29–33].
The proposed algorithm is executed using Intel i3 processor with 4GB hard disk and windows 7 operating system for completing the experiment and total resultant code is written by Python 3. To evaluate the performance of the projected method we have taken ABC gene dataset with cholesterol dictionary. In this paper FCM and SVM are employed for prediction of mammalian cholesterol from ABC proteins. Our proposed works are explained in following phases below.
Phase 1: In first stage we retrieved the protein data information of ABC transporter from uniprot database such as gene, protein Id, helix name which may varies from 1 to 17, length of the protein, position of proteins present in transmembrane region and sequence of protein. Furthermore cholesterol dictionary is constructed using CRAC/CARC algorithm which is described above in data source part.
Phase 2: After preparing both data set then we used sliding window concept for mapping among them. We tookwindow size as D=d5, d6, d7, d8, d9, d10, d11, d12, d13 in this manuscript. According to window size and CRAC/CARC algorithm, cholesterol dictionary is constructed. We know CRAC motif moves on forward direction from N to C terminus. The orientation can be represented as (Leucine/Valine-X (1 -5) -Tyrosine-X (1 -5) -Arginine/Lysine). That means Leu/Val takes the first position, followed by part 1 to 5 of any amino acid residue. After that another residue ‘Y’ which is fixed, subsequently once more a part including 1–5 of any residues, and lastly a basic Lys or Arg is normally maintained. Like forward motif, backward motif CARC is also calculated using (Arginine/Lysine -X (1 -5) - Tyrosine /Phenylalanine-X (1 -5) - Leucine/Valine).
Phase 3: Once step 2 is finished, we go for next stage where we extracted the motif sequences according with their CRAC/CARC formulas. We divide the motif in to two parts: one for forward direction motif and other is backward motif. Then we move for next phase.
Phase 4: Now our data is well fitted for execution. Here we applied our proposed FCM-SVM algorithm for valid motif prediction.The main motto of this algorithm is by means of handing over membership to apiece one data point subsequent to every group center on account of distance among the data points along with cluster center. According to the closeness of data points towards cluster center, the membership has been calculated.
On the basis of minimization, the algorithm is expressed by means of the below Equations (7) and (8):
After getting the cluster points concerning membership functions, information is fed to SVM algorithm. For every set of motif pairs on dissimilar positions of ‘Y’ and ‘F’, we found the cluster points. The aim of the SVM is to teach a model that assigns novel hidden objects into a particular group. Now using the cluster points, SVM classifies the motif sequences of cholesterol. Both forward and backward sequences can be observed from these algorithms which have biological relevance in drug design.
Valid motif sequences of cholesterol prediction from ABC genes are shown in Tables 3 and 4. Forward/Backward motifof cholesterol binding with membrane proteins region may produce numerous possibilities. As a result, prediction of cholesterol signature motif was verified for a particular helix according with their sub motifs included in it. A whole amount of CRAC/CARC motifs in a specified ABC proteins are not distributed homogeneously across the membrane, therefore target motif sequences may vary from protein to protein. Most of the proteins those are shown below in Tables 3 and 4 are involved in much functionality like sterol movement from one place to another, accretion or linkedby rafts.From analysis, we concluded that backward motif have more target on proteins than forward motif.
Cholesterol motifs derived ABC genes of forward region
Cholesterol motifs derived ABC genes backward region
We can assume that more number of sub motifs we found from investigation have better interactionwith membrane cholesterol. Overall, the studies saythat enhancement of the sub motifs’ target region with membrane cholesterol which increased the predictability of the cellular cholesterol consensus motif.
In case of human physiology research, on daily basis new problems and facts are created. For solving such type of biological issues many soft computing algorithms has been implemented by researchers. Quite a few membrane proteins like ABC proteins, GPCR proteins are reported to be modulated by membrane cholesterol. In case of living organisms transport of precise molecules across lipid membranes are treated as vital part and for that reason a bigger transporter is required to carry out the molecules. So our current study emphasized on ABC transporter with cellular cholesterol where the transmembrane-sequence matching of forward/backward cholesterol motif is extracted. In this manuscript we applied Fuzzy C-Means with Support Vector Machine algorithm for prediction of cellular cholesterol with ABC genes. Here we retrieved the protein data information of ABC transporter from uniprot database. Furthermore cholesterol dictionary has been constructed using CRAC/CARC algorithm which is described above in data source part. Our experiments is performed well using the given data set and much more valid motif of different amino acid sequences are found which has clinical relevance in human being for multi-drug discovery.
