The gene ontology (GO) database contains GO terms that describe biological functions of genes. Previous methods for comparing GO terms have relied on the fact that GO terms are organized into a tree structure. Under this paradigm, the locations of two GO terms in the tree dictate their similarity score. In this article, we introduce two new solutions for this problem by focusing instead on the definitions of the GO terms. We apply neural network-based techniques from the natural language processing (NLP) domain. The first method does not rely on the GO tree, whereas the second indirectly depends on the GO tree. In our first approach, we compare two GO definitions by treating them as two unordered sets of words. The word similarity is estimated by a word embedding model that maps words into an N-dimensional space. In our second approach, we account for the word-ordering within a sentence. We use a sentence encoder to embed GO definitions into vectors and estimate how likely one definition entails another. We validate our methods in two ways. In the first experiment, we test the model's ability to differentiate a true protein–protein network from a randomly generated network. In the second experiment, we test the model in identifying orthologs from randomly matched genes in human, mouse, and fly. In both experiments, a hybrid of NLP and GO tree-based method achieves the best classification accuracy.
1. Introduction
The gene ontology (GO) project founded in 1998 is a collaborative effort that has been providing consistent descriptions of genes and proteins across different data sources and species (Gene Ontology Consortium, 2017). The GO database is similar to a dictionary: it contains terms referred to as GO terms. Each GO term has a definition describing some biological event.
The GO database is divided into three categories: cellular components (CCs), molecular functions (MFs), and biological processes (BPs). The CC ontology contains terms describing the components of the cell and can be used to locate a protein. The MF category contains terms describing chemical reactions such as catalytic activity or receptor binding. These terms do not specify the genes or proteins involved in the reactions or the locations of the events. The BP category contains terms describing a series of biological events. For example, the BP term GO:0006874 has the definition “Any process involved in the maintenance of an internal steady state of calcium ions at the level of a cell.” In each category, the GO terms are organized into a tree where there is only one root node (Gene Ontology Consortium, 2017). In this tree of GO terms (or GO tree), a more generic term (i.e., lyase activity) is closer to the root, whereas a more specific term (i.e., carboxy-lyase activity) is closer to a leaf node.
Because there are three GO categories in the database, there are three GO trees. Interestingly, there are edges connecting terms in different GO trees (Fig. 1). The GO database can be represented as three connected GO trees.
Terms shown are ancestors of GO:0009055 (yellow). GO:0003674 and GO:0008150 are root nodes for the MF and BP trees, respectively. Colors denote part of (blue) and is a (black) relationship. Snapshot is downloaded from ebi.ac.uk/QuickGO. BPs, biological processes; GO, gene ontology; MFs, molecular functions.
One application of the GO database is the comparison of two genes by first comparing the similarity of the GO terms that annotate them (Gene Ontology Consortium, 2017). To this end, we need a good metric for comparing GO terms. To solve this problem, we need to focus on the GO trees and the definitions of GO terms. Because of the GO trees, GO terms with a direct ancestor (i.e., sibling nodes) are deemed to be more related than GO terms with a distal ancestor. Moreover, because of this design, existing methods to measure the similarity of two GO terms mostly rely on the GO trees (Mazandu et al., 2017). Very few studies have yet to directly compare the definitions of GO terms (Pesaranghader et al., 2015).
In this article, we introduce two new solutions to measure the semantic similarity of two GO terms by focusing on the definitions of the GO terms. Our approach is most similar to Pesaranghader et al. (2015); however, we apply neural network-based techniques from the natural language processing (NLP) domain.
First, we compare words by converting them into word embeddings. We train the Word2vec model using open access articles on PubMed, so that we can represent a word as an N-dimensional vector (Mikolov et al., 2013). Cosine similarity is used to compare two words. To compare two GO terms, we treat their definitions as two unordered sets of words, and use the weighted modified Hausdorff distance (MHD) to measure the distance between two sets (Dubuisson and Jain, 1994). We name this metric \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{w}}2{ \rm{vGO}}{ \rm{.}}$$
\end{document}\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{w}}2{ \rm{vGO}}$$
\end{document} is entirely independent of the GO trees.
Second, we consider the word-ordering within the GO definitions. We note that entailment relationships exist in the GO tree (i.e., two GO terms are linked by a directed edge) (Fig. 1). We train the sentence encoder InferSent using the definitions of child–parent and randomly matched GO terms (Conneau et al., 2017). InferSent embeds sentences into an N-dimensional vector space and computes the probability that one sentence entails another. We name this approach InferSentGO. InferSentGO needs the GO trees for the training phase. Once the model is trained, only the definitions of GO terms are required for calculating their semantic similarities. The original authors of InferSent did not name their method; hence, we use their GitHub name. The NLP community refers to it as bidirectional long short-term memory with max pooling.
We compare W2vGO and InferSentGO against simDEF by Pesaranghader et al. (2015), and the following tree-based methods: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{Resnik}}{ \kern 1pt} { \rm{ , }}$$
\end{document} graph-based similarity measure (GraSM), aggregate information content (AIC) (Resnik, 1999; Couto et al., 2007; Song et al., 2014). For Resnik and GraSM, we include the random walk contribution (RWC) to their scores by using the software GOssTo (Yang et al., 2012; Caniza et al., 2014). We discuss these competing methods in the Methods and Appendix sections. We further consider an ensemble approach AicInferSentGO by averaging the AIC and InferSentGO scores.
To compare these metrics, we conduct two experiments. In the first experiment, we test these metrics in differentiating the true protein–protein interaction (PPI) network from a random network for both human and yeast. In the second experiment, we test the metrics in identifying orthologs against randomly matched gene pairs for human, mouse, and fly. Our results show that the hybrid AicInferSentGO attains the best classification in terms of the area under the receiver operating characteristic (ROC) curve. Our software, data, and results are available at our GitHub github.com/datduong/NLPMethods2CompareGOterms.
2. Methods
2.1. Methods to measure similarity between two GO terms
Broadly speaking, existing tree-based methods are divided into two types: node based or edge based (Mazandu and Mulder, 2014). The key focus of node-based methods is the evaluation of the information content (IC) of the common ancestors for two GO terms. In brief, the IC of a GO term measures the usefulness of the term by evaluating how often the term is used to annotate a gene. Terms that are used sparingly have high IC because they are specific at distinguishing genes. Node-based methods have been shown to work well (Mazandu and Mulder, 2012, 2014; Song et al., 2014). However, we show in section 2.4 that we can improve the accuracy by including the semantic similarities of the GO definitions.
In this article, we choose the node-based methods Resnik and AIC as the baselines. Resnik is a classical approach for quantifying the similarity between two GO terms (Resnik, 1999). Despite being simple, Resnik has been shown to outperform some of its extensions on several test data sets (Pesquita et al., 2009; Mazandu and Mulder, 2014). AIC is recent. In their article, Song et al. (2014) showed that AIC outperforms popular node-based methods by Jiang and Conrath (1997), Lin et al. (1998), Resnik (1999), and Wang et al. (2007).
Unlike node-based methods, edge-based approaches analyze the paths between two GO terms. For example, these approaches measure the distance (or the average distance when more than one path exists). We include the edge-based GraSM as the baseline (Couto et al., 2007). We also apply Yang et al.'s (2012) approach (through GOssTo software) to add extra topological information from the GO tree for both GraSM and Resnik. Yang et al. (2012) referred to this extra information as the RWC. In the appendix, we describe Resnik, AIC, GraSM, and RWC in detail.
Recently, Pesaranghader et al. (2015) introduced simDEF, a text mining approach to compare the definitions of two GO terms. Their work is most similar to ours; however, they do not use neural network techniques. We briefly describe the five steps in simDEF. First, simDEF counts the co-occurrences for words in the MEDLINE corpus to construct a symmetric first-order co-occurrence matrix. Values in this matrix represent how many times the word in its row appears with the word in its column. Second, simDEF extends the definition of a GO term by concatenating its definition with the definitions of its direct parents and children. Third, simDEF builds a second-order matrix from the first-order matrix (Islam and Inkpen, 2006). Loosely speaking, each row in second-order matrix is an extended GO definition. Each column is the word count for a word in the definition. Fourth, simDEF applies the Pointwise Mutual Information (PMI) function on the second-order matrix (Islam and Inkpen, 2006). In their article, Pesaranghader et al. (2015) refer to this outcome as the PMI-on-second-order matrix. Here, each row represents an extended GO definition, and the entry for each row is the transformed word count for a word in the definition. Fifth, cosine distance is used to measure the similarity between two GO terms (i.e., two rows in the PMI-on-second-order matrix). We now introduce our methods and outline their differences from simDEF.
2.2. Word2vec model
Here, we describe our first metric W2vGO. We use W2vGO as pure NLP technique; that is, W2vGO focuses strictly on the GO definitions and is completely independent of the GO trees. For this reason, unlike simDEF, we do not concatenate a GO definition with the definitions of its parents and children. This extension of a GO definition requires information from the GO tree.
To compare two GO terms, W2vGO compares their definitions by treating the definitions as two unordered sets of words. To solve this problem, we want to first be able to compare two words. To this end, we use the word embedding model Word2vec (Mikolov et al., 2013). Word2vec is a distributional linguistic model. Loosely speaking, Word2vec analyzes how often words co-occur. However, Word2vec is very different from simDEF co-occurrences matrices.
The Word2vec model converts a word into an N-dimensional vector. The user specifies N; in practice, people often set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$N \; = \;300.$$
\end{document} These vectors are known as word embeddings. Word2vec transforms similar words into similar vectors, thus enabling us to quantify the similarity between two words by computing the Euclidean distance or cosine similarity. At the heart of the Word2vec is a neural network model with one input layer, one hidden layer, and one output layer (Mikolov et al., 2013).
Loosely speaking, one can view the Word2vec model as a prediction problem (Rong, 2014). First, a word w from the input layer is mapped into an N-dimensional vector at the hidden layer. Word2vec learns the values for the hidden layer based on the coconcurrences between w and its neighboring words. The key idea is to predict the vectors for the surrounding context words based on the vector for w.
The purpose of this article is not to dissect the Word2vec model; we are interested in adopting this model to measure the similarity of GO terms and compare it with other methods. Interested readers are encouraged to read the original article by Mikolov et al. (2013), and the introduction by Rong (2014).
2.2.1. Measure similarity of two words using Word2vec
The training data influence the application of word embedding model. Existing pretrained Word2vec models are often made by corpora collected from news, books, or the Internet. To obtain suitable word vectors, in this work, we train the Word2vec to recognize biological words. We set the dimension \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$N \; = \;300$$
\end{document} and use 20GB of data from open access articles on PubMed. The raw count of unique and repeated words is 14,526,527,855. We remove words that appear <25 times in the whole training data, thus reducing the final number of unique words to 986,615. We keep stop words and symbols such as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$+$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$-$$
\end{document} in the data because they may have important biological meanings. We use the Python library gensim to train the Word2vec model (Rehurek and Sojka, 2011). A simple Python user interface is available at our GitHub.
There are two important details here. First, the training data do not contain definitions of GO terms found in the GO database. This helps us avoid data reuse. Second, theoretically speaking, Word2vec model can be trained on the GO terms in the PubMed data, so that one can convert a GO term into a vector. Unfortunately, the IDs of the GO terms are not used too often in published articles, and detecting definitions of GO terms in articles is a different type of research problem (Tuan et al., 2013). For these reasons, we use the Word2vec model as a metric to compare two biological words.
To compare two vector representations of two words, we use the cosine similarity, because it is bounded, whereas Euclidean distance is not. We define the function \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{w}}2{ \rm{v}} \left( {z , v} \right)$$
\end{document} as the similarity score of two words \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$z , v.$$
\end{document}
2.2.2. Measuring similarity of two GO terms using Word2vec
A GO term comes with a definition that is usually one or two sentences describing some biological feature. For example, GO:0003700 has the following definition: “Interacting selectively and noncovalently with a specific DNA sequence in order to modulate transcription. The transcription factor may or may not also interact selectively with a protein or macromolecular complex.”
When a GO term definition has more than one sentence, we concatenate these sentences into the same sentence by ignoring the period symbol. For example, the two sentences for GO:0003700 are considered as one long sentence.
Thus, the task to compare two GO terms reduces to the problem of comparing their definitions that are two sentences. Suppose that GO terms a and b have sentences \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$Z , V$$
\end{document} as their definitions, respectively. We treat two sentences \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$Z =$$
\end{document}“\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${z_1} \;{z_2} \;{z_3} \; \ldots \;{z_N}$$
\end{document}” and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$V \; =$$
\end{document}“\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${v_1} \;{v_2} \;{v_3} \; \ldots \;{v_M}$$
\end{document}” as two unordered sets of words \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$Z \; = \; \left\{ {{z_1} , \;{z_2} \; \ldots \;{z_N}} \right\} $$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$V \; = \; \left\{ {{v_1} , \;{v_2} \; \ldots \;{v_M}} \right\} .$$
\end{document}
To measure the similarity of sentences Z and V (or in other words, terms a and b), we use the metric
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\begin{split} { \rm{w}}2{ \rm{vGO}} \left( {a , \;b} \right) \; = \;{ \rm{mean}} \; \left\{ { \; \mathop \sum \limits_{i = 1 \ldots N} \;{ \rm{content }}} \right. \; ( {z_i} ) \; \mathop { \max } \limits_{i = 1 \ldots M} \;{\rm w}2{\rm v}{ \kern 1pt} ( {z_i} ,{v_j} ) , & \\ \left. { \mathop \sum \limits_{i = 1 \ldots N} \;{ \rm{content}} \; ( {v_j} ) \; \mathop { \max } \limits_{i = 1 \ldots N} \;{\rm w}2{\rm v}{ \kern 1pt} ( {z_i} , {v_j} ) } \right\} , & \\\end{split}
\tag{1}
\end{align*}
\end{document}
where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{content}} \left( w \right)$$
\end{document} is the weight of the word w and is often used to distinguish common words from rare words. The weights of words can help avoid the influence of hub words (i.e., words such as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$cell , \;DNA , \;activity$$
\end{document}) that are ubiquitously associated with many other words (Levy et al., 2014). \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{Content}} \, \left( w \right)$$
\end{document} is very similar to the IC function (Li et al., 2006)
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
{ \rm { Content } } \, \left( w \right) \; = \; - \; { \rm { log } } \; \left( { { \frac { { \rm { frequency } } \; { \rm { word } } \;w \; { \rm { in \;training \;data } } } { { \rm { training } } \; { \rm { data } } \; { \rm { size } } } } } \right). \tag { 2 }
\end{align*}
\end{document}
Because GO definitions are often short, we hope that the accuracy of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{w}}2{ \rm{vGO}}$$
\end{document} does not suffer too much from the removal of word-ordering. Surprisingly, this assumption holds true in several instances (Table 2). There are more sophisticated models that consider the word-ordering in the sentences; we will consider one of these methods in the next section.
In any case, we have defined a metric to measure the two GO terms a and b with definitions Z and V under the Word2vec paradigm. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{w}}2{ \rm{vGO}} \, \left( {a , \;b} \right)$$
\end{document} ranges from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$- 1$$
\end{document} to 1 because \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{w}}2{ \rm{v}} \left( {z , \;v} \right)$$
\end{document} ranges from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$- 1$$
\end{document} to 1.
2.3. InferSent model
In the previous section, we have ignored the word-ordering in the sentences, treating them as sets of words. In this section, we briefly explain InferSent, a model that focuses not only on the word embeddings but also on the word-ordering in the sentences (Conneau et al., 2017). InferSent is thus different from simDEF that also ignores the word-ordering.
Loosely speaking, InferSent is similar in spirit to Word2vec; instead of words, InferSent identifies the relationship between two sentences. One training sample for InferSent consists of two sentences having some relationship R. Consider a simple classification where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$R \; = \;entailment$$
\end{document} or \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$neutral.$$
\end{document}. In this case, InferSent expects two classes of input. The first class entailment will have two sentences where the first entails the second; for example, “cat is napping on the mat” entails “cat is not running”. Entailment is a one-directional relationship, because “cat is not running” does not entail “cat is napping on the mat”. The second class neutral will have two unrelated sentences such as “cat is napping on the mat” and “boy is watching the cat.” Unlike entailment, neutral is a bidirectional relationship.
InferSent is a classification model based on the neural network architecture; its full description is given by Conneau et al. (2017). Here, we briefly mention the first layer of this network. InferSent's first layer is the word vectors for words in the entire training data set. Conneau et al. (2017) use the GloVe word vectors (Pennington et al., 2014). However, to obtain the best result, these word vectors should be specific to biology. In this article, we use the Word2vec vectors in section 2.2.1.
2.3.1. Measuring similarity of two GO terms using InferSent
InferSent takes two sentences as one training sample. In this article, the two sentences will be the definitions of two GO terms a and b. We will define two categories entailment and neutral, and estimate the probability \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\mathbb{P} \left( {a \;{ \rm{entails}} \;b} \right)$$
\end{document}. This metric allows us to gauge the semantic similarity for a and b. We choose this option because entailment relationship exists in the GO tree. Child–parent GO terms are linked by a one-directional relationship such as “is a,” “part of,” “regulates,” “negatively regulates,” and “positively regulates.” For example, the term GO:1900237 “positive regulation of induction of conjugation with cellular fusion” entails the term GO:0010514 “induction of conjugation with cellular fusion.”
To prepare the entailment data set, for all three ontology categories, we randomly pair each GO term with one of its parents. To ensure that these child–parent GO terms are indeed similar in meaning, we compute the median AIC score for each category and retain pairs having scores above the median. Our final data set contains 17,226 pairs. We treat the three ontology categories as one single data set when training the InferSent model.
To create the neutral data set, we make two types of unrelated pairs. For the first type, we randomly pick about half the number of GO terms in the entailment data set. For each term c in this set, we pair it with a randomly chosen GO term d in the same ontology category. For the second type, we pair the same term d with another randomly chosen term e. This sampling scheme improves the training by allowing some GO terms to be seen more than once under different circumstances.
Two unrelated GO terms should have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\mathbb{P} \, \left( {{ \rm{ter}}{{ \rm{m}}_1} \;{ \rm{entails}} \;{ \rm{ter}}{{ \rm{m}}_2}} \right)$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\mathbb{P} \left( {{ \rm{ter}}{{ \rm{m}}_2} \;{ \rm{entails}} \;{ \rm{ter}}{{ \rm{m}}_1}} \right)$$
\end{document} near zero. For each neutral pair, we create two different samples for InferSent. The first sample will be the pair (definition of term1, definition of term2), and the second sample will be the pair (definition of term2, definition of term1). Our final neutral data set contains 35,044 pairs of sentences. Since the neutral data set is nearly twice as large as the entailment data set, when training InferSent, we weigh ratio 1:2 for the class neutral and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$entailment.$$
\end{document}
The code to train InferSent is available at our GitHub. We attained 96.93% accuracy in the validation set. We emphasize that InferSent relies on the GO tree only for its training phase. Once the model is trained, it requires only the GO definitions as the input for prediction.
When terms a and b are unrelated, we expect \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\mathbb{P} \left( {a \;{ \rm{entails}} \;b} \right) \approx \mathbb{P} \left( {b \;{ \rm{entails}} \;a} \right) \; \approx \;0$$
\end{document}. When GO term a is the child of term b, we expect \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\mathbb{P} \left( {a \;{ \rm{entails}} \;b} \right)$$
\end{document} to be high, whereas \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\mathbb{P} \left( {b \;{ \rm{entails}} \;a} \right)$$
\end{document} may be low. However, in this case, we still want the similarity between \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$a \;{ \rm{and}} \;b$$
\end{document} to be high. For this reason, to measure the semantic similarity for two GO terms, we use the metric
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
{ \rm{InferSentGO}} \left( {a , \;b} \right) \; = \;{ \rm{max}} \; \left\{ { \mathbb{P} \left( {a \;{ \rm{entails}} \;b} \right) , \; \mathbb{P} \left( {b \;{ \rm{entails}} \;a} \right) } \right\} . \tag{3}
\end{align*}
\end{document}
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{InferSentGO}} \left( {a , \;b} \right)$$
\end{document} ranges from 0 to 1.
2.4. Combining node-based and NLP methods
In this section, we discuss a few examples and motivate the need for combining node-based and NLP methods. Here, we choose the node-based Resnik and AIC, because GraSM and RWC do not yield good results (Table 2).
In essence, both Resnik and AIC focus on the fraction of shared ancestors and weigh this ratio by the IC values. Sometimes, even when two terms are very similar in meaning, they can have a low score. For example, consider the terms GO:0005887 and GO:0016021 in the CC ontology (Table 1 Row 1). This is a child–parent pair, having almost identical definition. However, the Resnik and AIC scores are not high enough. In a sample of 3244 child–parent pairs in the CC ontology, we found the median score to be 5.5933 and 0.9278 for Resnik and AIC, respectively.
A Few Examples to Compare Gene Ontology Similarity Scores
GO:0016021. The component of a membrane consisting of the gene products and protein complexes having at least some part of their peptide sequence embedded in the hydrophobic region of the membrane.
GO:0005887. The component of the plasma membrane consisting of the gene products and protein complexes having at least some part of their peptide sequence embedded in the hydrophobic region of the membrane.
5/9
2.493
0.588
0.982
0.999
GO:0006814. The directed movement of sodium ions (Na+) into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore.
GO:0006874. Any process involved in the maintenance of an internal steady state of calcium ions at the level of a cell.
0
0
0.107
0.590
0.920
GO:0005829. The part of the cytoplasm that does not contain organelles but that does contain other particulate matter, such as protein complexes.
GO:0005615. That part of a multicellular organism outside the cells proper, usually taken to be outside the plasma membranes, and occupied by fluid.
0
0
0.212
0.449
0.845
GO:0004620. Catalysis of the hydrolysis of a glycerophospholipid.
GO:0019905. Interacting selectively and noncovalently with a syntaxin, a SNAP receptor involved in the docking of synaptic vesicles at the presynaptic zone of a synapse.
0
0
0.264
0.333
0.001
Fraction of shared ancestors, with 0 indicates terms that share only the root node.
In the second example (Table 1 Row 2), GO:0006814 and GO:0006874 are unrelated, sharing only the root node. Interestingly, one can argue that both terms are not entirely distinct because they both mention the regulation of ions. Similarly, in the third example (Table 1 Row 3), GO:0005829 and GO:0005615 share only the root node, but both mention fluid containing protein complexes. In both examples, W2vGO on its own or an average of AIC and InferSentGO may give a more satisfying score. InferSentGO and AIC on their own may underestimate and overestimate the similarity, respectively. Resnik on its own is not the best because when terms share only the root node, the similarity score is the IC of the root that is 0.
In the final example (Table 1 Row 4), GO:0004620 and GO:0019905 share only the root node and truly are different in meaning. Here, Resnik and InferSentGO give more reasonable scores than AIC and W2vGO do.
These examples suggest that no one method is always the best, and that we need to combine node-based and NLP approaches. To this end, taking an average of AIC and InferSentGO is reasonable. Empirically, from the examples, we have seen that this average produces a reasonable value. Theoretically, only AIC and InferSentGO scores are in the same \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\left[ {0 , \;1} \right]$$
\end{document} range; whereas Resnik and W2vGO scores are in the range \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\left[ {0 , \; \infty } \right]$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\left[ { - 1 , \;1} \right]$$
\end{document}, respectively (see Appendix). We note that simDEF score range is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\left[ { - 1 , \;1} \right]$$
\end{document}, making it incompatible with Resnik and AIC.
For two GO terms a and b, we introduce the metric
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
{ \rm { AicInferSentGO } } \left( { a , \;b } \right) \; = \; \frac { { { \rm { AIC } } \left( { a , \;b } \right) \; + \; { \rm { InferSentGO } } \left( { a , b } \right) } } { 2 } . \tag { 4 }
\end{align*}
\end{document}
From AIC's perspective, AicInferSentGO improves the distinction between child–parent and randomly matched GO terms, especially in the CC and MF ontologies (Fig. 2). From InferSentGO's perspective, AicInferSentGO gives a more continuous score, allowing for a better resolution when comparing terms. InferSentGO on its own tends to give a stiff 0/1 score.
Similarity scores for 33,020 child–parent GO terms, and 40,419 randomly matched GO terms. The number of pairs are 51,931, 4986, and 16,522 for BP, CC, and MF ontology, respectively. From AIC's perspective, the hybrid method AicInferSentGO improves the distinction between child–parent and randomly matched GO terms. From InferSentGO's perspective, AicInferSentGO gives a more continuous score. AIC, aggregate information content; CCs, cellular components.
2.5. Measuring similarity of two genes
To assess the performance of the different GO metrics, we will use them to measure the similarity between two genes. A gene is annotated with several GO terms from the three GO categories. For example, the gene HOXD4, which is important for morphogenesis, is annotated by these GO terms GO:0003677, GO:0003700, and GO:0006355. Thus, we can view any gene A as a set of GO terms. A GO term a is in the set A (i.e., \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$a \in A$$
\end{document}) if a is used to annotate A.
To assess the similarity between two genes A and B, we must compare two sets of GO terms. There are many metrics for this task (Mazandu et al., 2017). Here, we use the MHD and the best max average distance (BMA). MHD is a traditional metric for comparing two sets, and was often used in image processing (i.e., to compare two sets of pixels) (Dubuisson and Jain, 1994). BMA has been shown to be better than taking the maximum or minimum of all pairwise distances for the elements in the two sets (Pesquita et al., 2008).
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
{ \rm { MHD } } \left( { A , \;B } \right) \; = \; { \rm { min } } \, \left\{ { \; \frac { 1 } { { \left\vert A \right\vert } } \, \mathop \sum \limits_ { a \in A } \, \mathop { { \rm { max } } } \limits_ { b \in B } \,s \left( { a , \;b } \right) , \; \frac { 1 } { { \left\vert B \right\vert } } \; \mathop \sum \limits_ { b \in B } \; \mathop { { \rm { max } } } \limits_ { a \in A } \;s \left( { a , \;b } \right) \; } \right\} . \tag { 5 }
\end{align*}
\end{document}\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
{ \rm { BMA } } \; \left( { A , \;B } \right) \; = \; { \rm { mean } } \, \left\{ { \; \frac { 1 } { { \left\vert A \right\vert } } \; \mathop \sum \limits_ { a \in A } \; \mathop { \max } \limits_ { b \in B } \;s ( a , \;b ) , \; \frac { 1 } { { \left\vert B \right\vert } } \; \mathop \sum \limits_ { b \in B } \; \mathop { \max } \limits_ { a \in A } \;s ( a , \;b ) \; } \right\} \; \;. \tag { 6 }
\end{align*}
\end{document}
In the above, the function s(a, b) is a generic placeholder for measuring the similarity of GO terms a and b. For example, if one uses Resnik, AIC, or W2vGO metric, then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$s \left( {a , \;b} \right) \; = \;{ \rm{Resnik}} \left( {a , \;b} \right) , \;{ \rm{AIC}} \left( {a , \;b} \right)$$
\end{document}, or \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{w}}2{ \rm{vGO}} \left( {a , \;b} \right)$$
\end{document}, respectively.
3. Results
We compare the GO metrics. Because genes are annotated by GO terms, good GO metrics should differentiate similar genes from unrelated genes well. Hence, we conduct two experiments. First, we test the GO metrics in identifying true PPIs. Second, we test the metrics in identifying orthologs in human, mouse, and fly. We download the GO term definitions and GO annotations from the Gene Ontology (geneontology.org). We download the orthologs from Ensembl (ensembl.org/biomart). The source code, data, and results in this section are available at our GitHub.
3.1. Human PPI network
We use the 6031 PPI data prepared by Mazandu and Mulder (2014). We trim these data further, keeping only human proteins that can be mapped to some genes through UniProt (uniprot.org). Next, to avoid data reuse, we remove electronically inferred GO terms (removing terms with tags IEA, NAS, NA, and NR) (Pesquita et al., 2008). To keep only genes that are well studied, we retain only genes with at least one GO term in each ontology (BPs, CCs, and MFs). The final data have 2593 pairs.
Like in Mazandu and Mulder (2014), we want to compare how well each metric differentiates a true PPI network (positive set) from a randomly made PPI network (negative set). We follow the procedure by Mazandu and Mulder (2014). We make the positive and negative sets to have the same number of edges. For the negative set, we randomly assign edges between proteins that do not interact in the real PPI network. The real and random PPI networks have the same proteins; we only require that they have different interacting partners. For each PPI network, to compute the similarity scores of the edges (i.e., pairs of proteins), we use Equations (5) and (6) with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$s \left( {a , \;b} \right)$$
\end{document} being one of the GO metrics in Table 2.
To compare the performance of the metrics, we find the area under the curve (AUC) of the ROC curve. The real and random PPI networks serve as a basis to calculate the true positive and false negative rate, respectively. The AUC is computed by plotting the true positive versus false negative rates at different thresholds and estimating the area under this curve. An AUC value goes from 0 to 1, with 1 being the best prediction power.
We judge the GO metrics based on their AUC values. From Figure 2, because node-based methods adequately distinguish related BP terms from unrelated BP terms, the NLP methods' improvement is best seen in the CC and MF ontologies (Table 2).
Area Under the Curves for Classifying Human Protein–Protein Interactions
Bold font indicates the best value in each column.
Joint analysis: When comparing genes, we keep their entire GO annotations, effectively treating the BP, CC, and MF ontologies as connected GO trees.
p value of the Hanley–McNeil test to compare AUCs of other methods against BMA+AicInferSentGO.
AUCs, area under the curves; BMA, best max average distance; BPs, biological processes; CCs, cellular components; GraSM, graph-based similarity measure; MFs, molecular functions; MHD, modified Hausdorff distance; NA, not available; RWC, random walk contribution.
On average, each gene in this experiment is annotated by 20.66 GO terms, with the composition of 46.25% BP, 24.07% CC, and 29.68% MF terms. Because of these fractions, when using only the BP ontology to compare GO metrics, on average, we are using only 46.25% of the full description for a gene. The same argument can be made for using only the CC or MF ontology to compare GO metrics. For this reason, we conduct a joint analysis. Here, when comparing two genes, we use all the GO terms in their annotations, allowing for comparison of GO terms across different ontologies. This approach aligns with the observation that GO terms in different categories are connected (Fig. 1). Tables 2 and 3 indeed show that the joint analyses yield the highest AUC for all GO metrics. We have two explanations for this outcome.
Bold font indicates the best value in each column.
Joint analysis: When comparing genes, we keep their entire GO annotations, effectively treating the BP, CC, and MF ontologies as connected GO trees.
p value of the Hanley–McNeil test to compare AUCs of other methods against BMA+AicInferSentGO.
First, intuitively, when using all BP, CC, and MF terms in the gene annotation, one can better understand the genes' functionality. For example, when looking at the CC ontology terms alone, arguably proteins in the same part of the cell do not necessarily interact. However, when we consider not only the locations but also the biological and molecular events taking place, then we can accurately compare the two genes.
Second, empirically, in 46,967 randomly chosen child–parent pairs, we count 2060 pairs (4.38%) having terms in different ontologies. For example, a few terms having parents in different ontologies are GO:0009055, GO:0035514, GO:0102496, GO:1903198, and GO:1903934. The fraction 4.38%, despite being small, has a nontrivial repercussion.
This effect is especially true for Resnik and AIC whose key ideas rely on the number of common ancestors. For example, consider the term GO:0009055 in the MF ontology with its parent GO:0022900 in the BP category (Fig. 1). When treating the BP and MF trees separately, GO:0009055 and GO:0022900 have AIC score 0 because they will not have any shared ancestors. When treating the trees jointly, the AIC score is 0.7197. Thus, we can better estimate the similarity between genes containing not only these terms but also their descendant terms.
For reasons already explained, in this article, we select the metric with the highest AUC in the joint analysis to be the best method. Here, Table 2 shows that BMA+AicInferSentGO does best. We note that BMA+W2vGO, despite not using information from the GO trees, works quite well on its own (second rank).
We use the Hanley–McNeil test to compare AUCs of the other methods against BMA+ AicInferSentGO (Hanley and McNeil, 1983). The p values in column 6 of Table 2 show that BMA+AicInferSentGO is slightly better than BMA+AIC and BMA+W2vGO, and is statistically above the other approaches. We provide the ROC plots for the joint analyses at our GitHub.
Our joint analyses did not include Resnik\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${{ \rm{ \;}}_{{ \rm{RWC}}}}$$
\end{document}, GraSM, and GraSM\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${{ \rm{ \;}}_{{ \rm{RWC}}}}$$
\end{document} because the software GOssTo does not allow the option to combine the three ontologies. In the experiment, we reimplemented Resnik and AIC ourselves, and changed the source code for simDEF (see GitHub).
3.2. Orthologs
Like in section 3.1, we remove electronically inferred GO terms from the gene annotation, and use genes with at least one GO term in each ontology. We test the following species: human/mouse and human/fly. For each pair, the positive set contains orthologs from the two species, whereas the negative set contains randomly matched genes. We set the sizes of the positive set and negative set to be equal. For human/mouse data set, we have 10,235 pairs for each set; for the human/fly data set, we have 4880 pairs for each set. We exclude Resnik\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${{ \rm{ \;}}_{{ \rm{RWC}}}}$$
\end{document}, GraSM, and GraSM\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${{ \rm{ \;}}_{{ \rm{RWC}}}}$$
\end{document} in this experiment because they did not perform well for human protein network.
Like before, we consider the best GO metric to be that with the highest AUC for the joint analysis. BMA+AicInferSentGO again does best in this experiment (Table 3). The Hanley–McNeil p values for comparing AUCs indicate that BMA+AicInferSentGO is statistically above all other methods.
We also conduct the classification for mouse/fly orthologs. In the joint analysis, BMA+AicInferSentGO metric gives the highest AUC score of 91.52% (full Table not shown).
3.3. Yeast PPI network
Pesaranghader et al. (2015) provided the yeast PPI data without electronically inferred GO terms. We keep only proteins annotated with at least one term in each ontology. The final data set contains 3938 true interactions and 3938 random pairs. We consider the best GO metric to be that with the highest AUC for the joint analysis (Table 4). Here, Resnik outperforms the other GO metrics. We note that AicInferSentGO is better than both AIC and InferSentGO by themselves. This observation along with the previous results suggests that ensemble of node-based and NLP methods improves performance.
Area Under the Curves for Classifying Yeast Protein–Protein Interactions
Bold font indicates the best value in each column.
Joint analysis: When comparing genes, we keep their entire GO annotations, effectively treating the BP, CC, and MF ontologies as connected GO trees.
p value of the Hanley–McNeil test to compare AUCs of other methods against BMA+AicInferSentGO.
We hypothesize that text-based approaches (both ours and simDEF) do worst in part because PubMed data contain mostly articles on human biology. Evaluating the effect of training the model on different data sources is beyond the scope of this article; we reserve this topic for future research.
4. Discussion
In our results, we do not aim to attain perfect classification, rather, we use the classification to rank the GO metrics. Other articles have used sequence similarity and coexpression data to evaluate GO metrics. However, sequence similarity has a stronger correlation with MF terms than BP and CC terms (Pesquita et al., 2008). Also because of alternative splicing, similar sequences can produce proteins with different functionality (Saha et al., 2017). Coexpression data work best with BP and CC ontologies (Song et al., 2014, Mazandu and Mulder, 2014), but genes are expressed nonuniformly across different tissues (Duong et al., 2017). Depending on the data source, experiments using coexpression data can give highly varying outcomes.
The Word2vec has an extension Sentence2vec that converts a sentence into a vector (Le and Mikolov, 2014). Theoretically, one can convert GO definitions into vectors. However, our Word2vec result contains 986,615 words; so, the number of sentences in the training data set is larger than this number. We encountered computer memory problem in training Sentence2vec on a 64GB RAM computer. Therefore, we opted for the InferSent model instead, training the model in 2 hours with GeForce GTX 1080 Ti 11GB graphic card.
Arguably, the entailment relation in InferSent does not necessarily equate to a perfect similarity measurement. For example, one can argue that every term in the BP ontology entails the root node BPs. Moreover, the NLP approaches in this article are yet to fully recognize chemical equations. An expression like \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$2{H_2} + {O_2}$$
\end{document} may not be seen as strictly equal to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$2{H_2}O$$
\end{document} or the word water. For these reasons, we view NLP methods as ways to refine existing node-based GO metrics. In this article, we have seen that InferSentGO improves the AIC scores. Moreover, InferSentGO does not need to be paired with AIC; it can work with any GO similarity metric that gives scores in the range \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\left[ {0 , 1} \right]$$
\end{document}.
We acknowledge that the literature contains many other methods for measuring GO terms' semantic similarity that were not tested here. There has been much debate regarding which measure should be preferred over the others. However, no clear consensus has been reached (Yang et al., 2012). In this article, as a proof of concept, we show that NLP methods can work together with node-based models to achieve higher accuracy.
To our knowledge, this article is the first to apply neural network-based NLP techniques to compare the semantic meaning of GO terms. Our application suggests that there are great promises in developing NLP methods for this research area.
5. Appendix
5.1. Resnik method
The most basic node-based method introduced by Resnik in 1999 relies on the IC of a GO term. The IC of a GO term t is computed as IC(t) = – log(p(t)), where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$p \left( t \right)$$
\end{document} is the probability of observing a term t in the ontology. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$p \left( t \right)$$
\end{document} is computed as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\displaystyle p \left( t \right) \; = \; { \frac { { \rm { freq } } \left( t \right) } { { \rm { freq } } \left( \rm { root } \right) } } .$$
\end{document}\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{freq}} \left( t \right)$$
\end{document} is defined as the cumulative count of term t and its descendants, where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{freq}} \left( t \right) \; = \;{ \rm{count}} \left( t \right) \; + \; \sum {_{c \in { \rm{child}} \left( t \right) }} \,{\rm freq} ( c ).$$
\end{document} where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{count}} \left( t \right)$$
\end{document} is the number of genes annotated with the term t and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{child}} \left( t \right)$$
\end{document} are the children of t. Based on this definition, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{IC}} \left( \rm{root} \right) \; = \;0 ,$$
\end{document} and a node near the leaves has higher IC than nodes at upper levels. To compute a similarity score of the GO terms a and b, we find the most informative common ancestor of these two terms.
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
{ \rm{Resnik}} \left( {a , \;b} \right) \; = \; \mathop {{ \rm{max}}} \limits_{p \in \left\{ {{ \rm{par}} \left( a \right) \cap { \rm{par}} \left( b \right) } \right\} } \;{ \rm{IC}} \left( p \right) , \tag{7}
\end{align*}
\end{document}
where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{par}} \left( t \right)$$
\end{document} denotes all the ancestors of term t. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{Resnik}} \left( {a , \;b} \right)$$
\end{document} ranges from 0 to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\infty$$
\end{document} because the probability \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$p \left( t \right)$$
\end{document} ranges from 0 to 1.
In this model, the similarity score between a GO term t and itself is not 1. Second, when a and b have only the root as a common ancestor, then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{Resnik}} \left( {a , \;b} \right) \; = \;0.$$
\end{document} This is problematic because leaf nodes are more informative than other types of nodes. Consider the example in Song et al. (2014). Here, the root is the only common ancestor of the pair a and b and the pair c and d. Next, suppose that a and b are leaf nodes, c is the parent of a, d is the parent of b, and root is the parent of both c and d. One would then expect that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{Resnik}} \, \left( {a , \;b} \right) < Resnik \, \left( {c , \;d} \right);$$
\end{document} however, one would obtain \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{Resnik}} \left( {a , \;b} \right) \; = \;{ \rm{Resnik}} \left( {c , \;d} \right) \; = \;0.$$
\end{document}
5.2. AIC method
The AIC method by Song et al. (2014) amends the two problems in the Resnik method. To encode the fact that leaf nodes are more informative, AIC defines a knowledge function of term t as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$k \left( t \right) \; = \;1 / { \rm{IC}} \left( t \right)$$
\end{document}, which is used to measure its semantic weight \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$sw \left( t \right) \; = \;1{ \kern 1pt} / { \kern 1pt} \left( {1 \, \; + \;{ \rm{exp}} \left( { - k \left( t \right) } \right) } \right)$$
\end{document}. Here \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$sw \left( \rm{root} \right) \; = \;1.$$
\end{document} Semantic value \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$sv \left( t \right)$$
\end{document} of t is then defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$sv \left( t \right) \; = \; \sum {_{p \in { \rm{path}} \left( t \right) } \;sw ( p ).}$$
\end{document} Function \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{path}} \left( t \right)$$
\end{document} contains every ancestor of t and the term t itself. Usually, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$sv \left( a \right) < sv \left( b \right)$$
\end{document} when term a is nearer to the root than b. Song et al. (2014) define their similarity score of two GO terms a and b as
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
{ \rm { AIC } } \left( { a , \;b } \right) \; = \; { \frac { 2 \sum \nolimits_ { p \in \left\{ { { \rm { path } } \left( a \right) \cap { \rm { path } } \left( b \right) } \right\} } { \rm { } } \;sw \left( p \right) } { sv \left( a \right) \; + \;sv \left( b \right) } } . \tag { 8 }
\end{align*}
\end{document}
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{AIC}} \left( {a , \;b} \right)$$
\end{document} ranges from 0 to 1. In this model, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{AIC}} \left( {a , \;a} \right) \; = \;1.$$
\end{document} When a and b have only the root as the common ancestor, then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{AIC}} \left( {a , \;b} \right) \; = \;2 / \left( {sv \left( a \right) \; + \;sv \left( b \right) } \right)$$
\end{document}, which depends on where a and b are on the GO tree.
5.3. Graph-based similarity measure
GraSM is an extension of Resnik, and can be classified as an edge-based method. GraSM analyzes more than just the most informative common ancestor of two GO terms, by looking at their disjunctive common ancestors (Couto et al., 2007). For one GO term a, two of its ancestors are disjunctive if there are different paths from both ancestors to the GO term.
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
{ \rm{DisjAnc}} \left( a \right) \; = \; \{ {c_1} , \;{c_2} \; \vert \; \exists { \rm{path}} \left( {a , \;{c_1}} \right) { \rm{not \ containing }} \ {c_2} \ { \rm{ AND }} \ \exists { \rm{path}} \; \left( {a , \;{c_2}} \right) \;{ \rm{not \ containing}} \;{c_1} \} . \tag{9}
\end{align*}
\end{document}
For two GO terms a and b, suppose the term c1 is in the union set U = DisjAnc(a) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\cup$$
\end{document} DisjAnc(b). c1 is a common disjunctive ancestor of a and b if for each common ancestor c2 of a and b where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \rm{IC}} \left( {{c_1}} \right) < IC \left( {{c_2}} \right)$$
\end{document} we have both \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${c_1} , \;{c_2} \in U.$$
\end{document} The similarity measurement for a and b is
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
{ \rm{GraSM}} \left( {a , \;b} \right) \; = \;{ \rm{mean}} \;{ \rm{IC}} \left( c \right) \;{ \rm{where }} \ c \ { \rm{ is \ common \ disjunctive \ ancestor \ of }} \ a , \;b. \tag{10}
\end{align*}
\end{document}
We use the software GOssTo to implement GraSM (Caniza et al., 2014).
5.4. Random walk contribution
We briefly describe RWC's key idea. Unlike many other methods that inspect the ancestors of two given GO terms, RWC is an edge-based approach that analyzes the shared children of two GO terms a and b (Yang et al., 2012). In brief, in the RWC paradigm, GO terms with more common children are more deemed to be more similar.
Define Nc as the number of genes annotated by term c. In RWC, the random walker moves from the parent node p to its direct child c with probability \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\mathbb{P} \left( {p \to c} \right) \; = \;{{{N_c}} \mathord{ \left/ { \vphantom {{{N_c}} { \sum {_{u: \exists p \to u}{ \kern 1pt} } {N_u}.}}} \right. \kern- \nulldelimiterspace} { \sum {_{u: \exists p \to u}{ \kern 1pt} } {N_u}.}}$$
\end{document} As the random walker moves for a very long time, we can denote the probability of ending at a node i from a to be \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$W_ \infty ^a \left( i \right)$$
\end{document}. Let L be the set of all leaf nodes in the GO tree. The RWC for two terms a and b is
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
{ \rm{RWC}} \; \left( {a , \;b} \right) \; = \; \mathop \sum \limits_{i , j \in L} {W_ \infty ^a} ( i ) W_ \infty ^b ( j ) {\rm score} ( i , \,j ) , \tag{11}
\end{align*}
\end{document}
where score (i, j) is a generic place holder. For example, score(i, j) can be Resnik (i, j). (Yang et al., 2012) use their RWC to improve score(a, b) by taking the average
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
{ \rm { scor } } { { \rm { e } } _ { { \rm { RWC } } } } \left( { a , \;b } \right) \; = \; \frac { 1 } { 2 } { \kern 1pt } \left( { { \rm { RWC } } \left( { a , \;b } \right) \; + \; { \rm { score } } \left( { a , \;b } \right) } \right). \tag { 12 }
\end{align*}
\end{document}
In this article, we use the software GOssTo to implement RWC with score(a, b) being Resnik and GraSM (Caniza et al., 2014). Currently, GOssTo is unable to take any generic score function as its argument.
Footnotes
Acknowledgments
D.D. is supported by NIH-NLM National Cancer Institute T32LM012424. D.D. and E.E. are supported by National Science Foundation grants 0513612, 0731455, 0729049, 0916676, 1065276, 1302448, 1320589, and 1331176, and National Institutes of Health grants K25-HL080079, U01-DA024417, P01-HL30568, P01-HL28481, R01-GM083198, R01-ES021801, R01-MH101782, and R01-ES022282. W.U.A. and K.-W.C. are supported by National Science Foundation Grant IIS-1657193. J.J.L. is supported by NIH/NIGMS R01GM120507, NSF DMS-1613338, the Sloan Research Fellowship, and the Johnson & Johnson WiSTEM2D Award.
Contributions
D.D. and J.J.L. came up with the problem. D.D. and J.J.L. decided on the data to be used. D.D. downloaded and cleaned the data. D.D., J.J.L., W.U.A., and K.-W.C. came up with the solution. D.D. and W.U.A. coded the solution. D.D. wrote the article. D.D. and J.J.L. discussed the results. D.D., J.J.L., W.U.A., K.-W.C. and E.E. read the article.
Author Disclosure Statement
The authors declare that no competing financial interests exist.
References
1.
CanizaH., RomeroA.E., HeronS., et al.2014. GOssTo: A stand-alone application and a web tool for calculating semantic similarities on the gene ontology. Bioinformatics, 30, 2235–2236.
2.
ConneauA., KielaD., SchwenkH., et al.2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pgs. 670–680. Copenhagen, Denmark.
3.
CoutoF.M., SilvaM.J., and CoutinhoP.M.2007. Measuring semantic similarity between gene ontology terms. Data Knowl. Eng. 61, 137–152.
4.
DubuissonM.-P., and JainA.K.1994. A modified Hausdorff distance for object matching, 566–568. In Pattern Recognition, 1994. Vol1-Conference A: Computer Vision & Image Processing, Proceedings of the 12th IAPR International Conference on, volume 1. IEEE. Jerusalem, Israel.
5.
DuongD., GaiL., SnirS., et al.2017. Applying meta-analysis to genotype-tissue expression data from multiple tissues to identify eQTLs and increase the number of egenes. Bioinformatics, 33, i67–i74.
6.
Gene Ontology Consortium. 2017. Expansion of the gene ontology knowledgebase and resources. Nucleic Acids Res. 45, D331–D338.
7.
HanleyJ.A., and McNeilB.J.1983. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology, 148, 839–843.
8.
IslamA., and InkpenD.2006. Second order co-occurrence PMI for determining the semantic similarity of words, 1033–1038. In Proceedings of the International Conference on Language Resources and Evaluation, Genoa, Italy. Citeseer.
9.
JiangJ.J., and ConrathD.W.1997. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference Research on Computational Linguistics, 1997. Taiwan.
10.
LeQ., and MikolovT.2014. Distributed representations of sentences and documents, 1188–1196. In Proceedings of the 31st International Conference on Machine Learning (ICML-14).
11.
LevyO., GoldbergY., and Ramat-GanI.2014. Linguistic regularities in sparse and explicit word representations, 171–180. In Proceedings of the 18th Conference on Computational Natural Language Learning, 2014. Baltimore, MD, USA.
12.
LiY., McLeanD., BandarZ., et al.2006. Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18, 1138–1150.
13.
LinD.1998. An information-theoretic definition of similarity, 296–304. In Proceedings of the 15th International Conference on Machine Learning. Madison, WI, USA.
14.
MazanduG.K., ChimusaE.R., and MulderN.J.2017. Gene ontology semantic similarity tools: Survey on features and challenges for biological knowledge discovery. Brief. Bioinform. 18, 886–901.
15.
MazanduG.K., and MulderN.J.2012. A topology-based metric for measuring term similarity in the gene ontology. Adv. Bioinform. 2012, 975783.
16.
MazanduG.K., MulderN.J.2014. Information content-based gene ontology functional similarity measures: Which one to use for a given biological data type?. PLoS One, 9, e113859.
17.
MikolovT., SutskeverI., ChenK., et al.2013. Distributed representations of words and phrases and their compositionality, 3111–3119. In Advances in Neural Information Processing Systems. Harrahs and Harveys, Lake Tahoe, USA.
18.
PenningtonJ., SocherR., and ManningC.2014. Glove: Global vectors for word representation, 1532–1543. In Proceedings of the 2014 conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar.
19.
PesaranghaderA., MatwinS., SokolovaM., et al.2015. simDEF: Definition-based semantic similarity measure of gene ontology terms for functional similarity analysis of genes. Bioinformatics, 32, 1380–1387.
20.
PesquitaC., FariaD., BastosH., et al.2008. Metrics for GO based protein semantic similarity: A systematic evaluation. BMC Bioinform. 9, S4.
21.
PesquitaC., FariaD., FalcãoA.O., et al.2009. Semantic similarity in biomedical ontologies. PLoS Comput. Biol. 5, e1000443.
22.
RehurekR., and SojkaP.2011. Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic.
23.
ResnikP.1999. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. J. Artif. Intell. Res. 11, 95–130.
SahaA., KimY., GewirtzA.D., et al.2017. Co-expression networks reveal the tissue-specific regulation of transcription and splicing. Genome Res. 27, 1843–1858.
26.
SongX., LiL., SrimaniP.K., et al.2014. Measure the semantic similarity of GO terms using aggregate information content. IEEE/ACM Trans. Comput. Biol. Bioinform. 11, 468–476.
27.
TuanL.A., KimJ.-j., and NgS.-K.2013. Gene ontology concept recognition using cross-products and statistical methods, 174. In BioCreative Challenge Evaluation Workshop vol. Bethesda, MD, USA.
28.
WangJ.Z., DuZ., PayattakoolR., et al.2007. A new method to measure the semantic similarity of go terms. Bioinformatics, 23, 1274–1281.
29.
YangH., NepuszT., and PaccanaroA.2012. Improving go semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty. Bioinformatics, 28, 1383–1389.