Abstract
Knowledge Graph (KG) is the network which contains some topic-based entities, called nodes, and the associated information among the entities. Here, the concept in the knowledge graph is denoted by the tuple relationship, such that the
Introduction
Link prediction in the knowledge graphs is considered an important process for the prediction of the non-existing relationships among the entities. Several existing works based on the link prediction concentrated on shallow models, which can extend up to the larger knowledge graphs. But, these models learn minimum expressive features from the multi-layer approaches, thereby limits the performance. Knowledge graphs (KG) are the directed graphs comprising the entities and relations as the nodes and edges. They gather the factual information as the triplet form, where the triplets are stored in the form of head and tail. The head and the tails are the entities and the relations, which represent the relationships from head to tail. Even though, the KGs consist of a huge amount of triplets, they are considered the pivotal parameter to determine the presence of the KGs [29]. In general, link prediction is a problem for finding the non-existing connections between the interconnected entities, whereas the links and the entities are denoted as the graph If the symmetric relationships are the edges, then the asymmetric relationship is the arcs. This prediction issue has been defined in the social network analysis community [1]. This type of analysis is considered a major issue in other domains and huge-scale knowledge bases [2]. Link prediction is used for finding the missing data and discovering new facts. The semantic information is encoded as the knowledge graph when dealing with the link prediction issues [3, 53, 54].
Categorization of link prediction approaches in the KG.
Knowledge graphs are considered both inappropriate and imperfect. Various link prediction approaches and error detection techniques have been introduced in recent times. However, some of the techniques focus on the error correction process, whereas some other techniques focus on the issues of selecting the non-existing facts to add to the KG. The knowledge graph contains various types of links, and it is employed for the closed-world assumption. Here, the existing links are considered positive instances. On the other hand, the unknown links are considered the negative instances [9] as it is extracted through the knowledge graph completion. Since the KGs are typically imperfect, a primary problem for the knowledge graph is the estimation of the missing links. In recent times, widespread studies have been performed based on the learning performance [55] of the low-dimensional representations of entities and relations in case of missed link predictions. These approaches have been defined as scalable and efficient [51, 52]. The wide-ranging perception of these techniques is to design and understand the connectivity patterns in knowledge graphs based on the observed knowledge facts. It is important to identify the ways for designing and inferring these patterns, such as symmetry, anti-symmetry, composition, and inversion from the observed facts for predicting the missing links [33].
The main contribution of this survey is to review existing link prediction in knowledge graph mechanisms. The existing methods are classified into embedding-based methods, deep learning methods, knowledge acquisition methods, ranking methods, and representation learning methods. The survey is done regarding the evaluation metrics, publication year, dataset, toolset and so on. Furthermore, the Hits @ N was reviewed for the performance evaluation of the link prediction in the knowledge graph.
The organization of the paper is as follows: Section 2 reviews the existing link prediction in knowledge graph methods, Section 3 explains the research gaps and issues, Section 4 elaborates the analysis of the methods based on performance metrics, year of publication, dataset, and toolset, and finally Section 5 concludes the paper.
This section explains the various methodologies, used in the effective link prediction in knowledge graphs. Accordingly, research papers are analyzed, and the link prediction in the knowledge graph schemes practised are generally classified into five categories, embedding-based methods, deep learning methods, knowledge acquisition methods, ranking methods, and representation learning methods. Figure 1 depicts the categorization of the several approaches related to the link prediction in the knowledge graphs. The existing research works related to the classified link prediction in the knowledge graphs are discussed below as follows.
Embedding-based methods
This subsection elucidates the Embedding-based methods adopted in the different research works as follows.
Chen et al. [4] introduced a Meta relational learning framework for performing the few-shot link prediction in the knowledge graphs. Here, the triples about the relations were predicted-based on the transferring relation-specific Meta information. This type of the Meta learning method achieved enhanced few-shot link prediction in the knowledge graphs. Tay et al. [5] developed a translational embedding approach for the link prediction on the knowledge graphs. Here, the translational approaches were more sensitive to hyper parameters, namely the learning rate and the margin. In addition, the translational principle allocates one spot for the golden triplet in every vector space. The collaboration of the relations and the entities in the vector space may decrease precision. The proposed technique is more simple, robust, and parallelizable to effectively perform the link prediction in the knowledge graphs. Dettmers et al. [6] devised a ConvE, a multi-layer convolution network model for link prediction. This approach identifies the missing relationships among the entities. This approach utilized the 2D convolutions over the embeddings for effective prediction. This technique was the simple multi-layer convolutional framework for link prediction. It was represented by the single convolutional layer, a projection layer with respect to the inner product layer, and the embedding dimensions.
Poole [7] developed a simple enhancement method, called SimplE for learning the embedding relationships in each and every entity. Here, the SimplE complexities were increased with respect to the embedding sizes. This approach derived a bound on the embedding sizes in order to achieve full expressivity. This approach performs effective link prediction as it learns the embedding relationships. Sun et al. [8] introduced an edge-centric embedding approach called the TransEdge in order to interpret the translations among the entity embeddings. Here, the TransEdge achieved effective prediction over the embedding tasks, and the linear and the bilinear mapping functions were operated on the entity embeddings. Thus, this approach achieved effective link prediction. Zhang et al. [11] introduced a KG embedding model called TimE for the link prediction. In this approach, the translational relationship among the entities and the diagonal projection matrix was exploited in the time domain space. In addition, this approach also employed a cross-operation for designing symmetric and inverse relations. This approach effectively captured the diversity distribution among the embeddings of the entity for the different relation patterns. Nathani et al. [18] developed an attention-based feature embedding approach for the link prediction in the knowledge graphs. This approach captured the embedding relationships among the relations and entities, such that this method performed well on the relation prediction. This method achieved effective relation prediction in the knowledge graphs.
Melo and Paulheim [39] introduced a knowledge graph prediction method, called CoCKG (Correction of Confusions in Knowledge Graphs), for fixing the incorrect facts related to the entities. This approach generated the candidate subjects and the objects for fixing the triples. This approach depends on various error detection strategies to ensure the corrected facts, and finally, effective performance was achieved. Trouillon et al. [40] devised a latent factorization approach for the effective link prediction. This approach handled a variety of binary relations. Here, the Neural Tensor Network was employed for understanding the structure of large knowledge bases. In addition, complex embeddings are utilized for the link prediction among the real vectors in the knowledge graphs. Ballandies and Pournaras [32] introduced a pervasive knowledge graph builder for mobile link prediction. Here, the knowledge graphs are enhanced with respect to the automated link predictions based on genetic programming, and are validated to enhance the performance accuracy. The knowledge graph builder was devised for pervasive devices, like smart phones and preserved privacy by localizing all computations. This method achieved effective computations on smart phones and the feasibility of the pervasive human supervision process with respect to the interactions throughput. Sun et al. [33] developed an approach for the knowledge graph embedding, called the RotatE for the link predictions. This approach consists of rotation relations from the source entity to the target entity in the complex vector space. In addition, this method employed a self-adversarial negative sampling approach for the link predictions in the complex vector spaces.
Su et al. [22] introduced a Knowledge Graph Embedding (KGE) based approach for the link prediction among the group of entities among the heterogeneous networks, called GTransA. GTransA constructs the hierarchies among the individual entities and the group entities within the group entities. Here, the individual links with the group links were represented by adding a weight function into the score function, and hence this approach achieved effective link prediction among the knowledge graphs. Guo et al. [23] developed a jointly embedding knowledge graph approach and logical rules for effective link prediction. Here, the t-norm fuzzy logic was designed for minimizing the global loss over the atomic and complex tasks. The embeddings were learned with the triples and the rule for enhancing the prediction with respect to the knowledge interference, and acquisition. Omran et al. [24] introduced an embedding-based approach for the rule-learning in the knowledge graphs for learning the rules and policies from the KG streams. Here, a target-oriented sampling approach was handling the large KGS. A temporal rule-based learning approach was employed for the efficient KG-enabled link prediction results. Zeb et al. [47] introduced an end-to-end learning framework for the knowledge graphs, called the KGEL for the effective link prediction in the knowledge graphs. This approach employed a weighted graph convolutional network for generating high-quality embedding vectors. This method obtained effective link prediction performance among the knowledge graphs.
Deep learning-based methods
In this subsection, the deep learning approaches practised for the link prediction in the knowledge graphs are portrayed below:
Ostapuk et al. [2] introduced a deep learning approach, called the ActiveLink for the effective link prediction in the knowledge graphs. In this approach, ActiveLink utilizes the Bayesian view on the neural link predictors, thereby enhanced the uncertainty sampling for deep active learning. This approach also utilized the link among the entities for enhancing the sampling efficiency. Trivedi et al. [9] developed a deep convolutional knowledge network, called the Know-Evolve for learning the entity representations over time. Here, the occurrence of the facts was modelled by the multivariate point representations for capturing the temporal dependencies among the facts. This approach created an unexplored connection among the relational process and the temporal point process, thereby facilitating the effective link prediction performance. Agibetov and Samwald [19] devised a benchmarking neural embedding approach for the effective link prediction in the knowledge graphs. These predictions were performed with respect to the structural and semantic changes. In addition, a relation-centric connectivity measure was employed for associating the link prediction capability to the KG structure. Moreover, an open source pipleline was utilized for enhancing the embedding accuracy in link prediction.
Neil et al. [25] developed a Graph Convolutional Neural Networks (GCNNs) for the link prediction on graph data. This approach addressed the issues-based on the biomedical knowledge graphs (KGs). Here, a regularized attention mechanism was employed for improving the performance on the datasets. The representations based on the learning model were employed for the automated dataset denoising, and finally, this method achieved effective prediction results. Tay et al. [31] introduced a multi-task neural approach for predicting and encoding the non-discrete attribute information with respect to the relational settings. Here, the neural network was employed in separate heterogeneous networks for the triplet prediction. A multi-task learning approach was utilized for learning the representations related to relations, links, and entities. Hence, this method achieved effective prediction results with respect to triplet classification, and attribute value prediction. Zhang and Chen [36] introduced a heuristic learning technique for link prediction. This technique unifies several heuristics in a single framework, such that all these heuristics were considered to construct a local sub-graph. This approach resolved various link prediction issues, and finally this method achieved improved prediction outcomes.
Kang et al. [37] developed a representation learning model, called the TDN model for the knowledge graphs. This approach integrated the information related to triples, network structure, and text descriptions in the low-dimensional vector space. The evaluation process was performed to achieve effective prediction results on real-world datasets. Hu et al. [38] developed a knowledge selective adversarial network, named KSGAN for the effective link prediction tasks. This approach contains collective semantic information for effective link prediction. This type of embedding model generated a high-quality negative sampling for avoiding the zero-loss problem. Finally, the developed technique enhanced the link prediction performance. Li et al. [45] devised a Recalibration Convolutional network for effective link prediction. This approach was formed by integrating the recalibration concept with the convolutional networks for indicating the strength of the convolutional networks. This approach offered enhanced generalization abilities for designing complex relations. Xian et al. [46] introduced a deep architecture-based adversarial attack method, called Deep Ensemble Coding for the link prediction in the knowledge graphs. Here, a deep linear coding-enabled structure scheme was employed for generating the adversarial examples. This approach was more robust against malicious adversarial attacks; thereby this method achieved effective link prediction tasks. Huang et al. [48] developed a CoRelatE framework for learning the relations among the relations, facts, and entities. In this approach, the relations based on the entities were modelled by utilizing the graph convolutional neural network. Finally, the results were integrated to perform the effective link prediction among the knowledge graphs.
Li et al. [49] introduced a deep reinforcement learning approach for link prediction in the knowledge graphs. This approach also utilized the Long Short Term memory (LSTM) for effective prediction results. Hu et al. [56] developed bi-directional relation aware network (BDRAN) to represent learning, and data extraction in conventional knowledge graphs. Here, the entities’ features are extracted in numerous patterns, relations, and semantic representations. This method obtained effective link prediction performance among the knowledge graphs, but the prediction time was high. Jagvaral [57] implemented a model by combining CNN and bidirectional long short-term memory (BiLSTM) with an attention mechanism. It was used for capturing the correlation among the paths between the entities. This method offered better performance results. However, multistep reasoning was not possible.
Representation learning-based methods
This subsection elaborates on the different researchs employed with the representation learning-based methods.
Li et al. [12] developed an approach, called the Dual Graph Embedding (DGE) to model the first-order and the high-order proximities in the kG. In this technique, auto encoding architecture was employed for facilitating better object-tag relation interference. Here, the dual graph consists of a tag graph and the object graph for determining the high-order object-object and the tag proximities in the KG. Finally, the embeddings derived by the encoder were refined for better link prediction. Chen et al. [13] introduced a multi-hop link prediction approach using reinforcement learning (RL) for the knowledge graphs. In this technique, RL was employed for the accurate prediction results. This method achieved effective prediction results by considering the maximum entropy and the expected return. Nayyeri et al. [15] developed a Rodrigues formula method based on the axis-angle representation for the knowledge graphs. Here, the similarities of the nodes were captured in the embedding space such that the consumption of the memory and the cost of computations were optimized based on the minimal parameters. This method achieved effective link prediction based on reasoning-based learning.
Ji et al. [29] developed a TranSparse approach for the link prediction tasks. In this approach, the structured and the unstructured sparse patterns were evaluated for classifying the triplets and the link prediction tasks. Here, the KGs were modelled for encoding the entities and the relation in order to achieve effective link prediction tasks. Zhang et al. [30] introduced a approach called the Representation learning approach, called the Trans Relation Hierarchical Structure (RHS) for improving knowledge representation learning. This approach utilized the relative positions between the spheres and the vectors for performing two types of tasks, namely predicting the links and classifying the triples. Finally, this method efficiently fused the RHS into the knowledge graph embeddings. Wang et al. [34] developed a representation learning approach, called TransE by the integration of the multimodal encoder with the TransE model. This method was considered an effective method for the knowledge graphs. This method encoded structural knowledge, multimodal knowledge, visual knowledge and structural knowledge. This method effectively improved the link prediction tasks, and the triplet classification tasks. Kong et al. [35] designed a new objective functions, called the DeCom for boosting the link prediction performance. Initially, the entities, and the relationships are decompressed, and then trained in the feature space. Here, the expressive features were extracted by adding extra parameters. This approach learns the common representations with respect to the decompressing network by incorporating various link predictors, and finally achieved enhanced results. Sadeghian et al. [41] developed a differentiable rule mining approach, called DRUM model for learning logical rules, and the scores of confidence. This approach employed a gradient-based optimization approach for the inductive learning tasks based on programming. This approach provides a connection among the evaluation of the confidence scores and the tensor completion. This approach utilized the bidirectional Recurrent Neural Networks for sharing the required information over the learning rules. Saxena [42] introduced an embedded knowledge graph approach for enhancing the multihop connections. This approach employed a text corpus for addressing the incompleteness problems. This approach trained the entity embeddings during evaluation. Here, the link prediction properties were utilized for reducing the knowledge graph incompleteness problem in the multi-hop connections. Wang et al. [50] devised a new representation learning approach, called TransAE, by integrating the multimodal autoencoder with TransE model, where TransE was an efficient learning-based representations for the KGs. Here, the hidden layer from the autoencoder was utilized as a representation of the entities in the TransE approach. This approach encoded both the structural and the multimodal knowledge with respect to the visual and the textual knowledge into the final representation.
Ranking-based methods
In this subsection, the various researchs utilizing the ranking-based techniques for the link prediction in the knowledge graphs are elucidated.
Zhang et al. [3] developed an embedding-enabled link prediction approach for performing the prediction tasks for all the predicates separately. This approach utilized the Bayesian personalized ranking enabled optimization techniques. Here, the properties-based on the topologies of the knowledge graphs were considered for every predicate, thereby effectively performing the link prediction in the knowledge graphs.
Knowledge acquisition methods
The different researchs adopting the knowledge acquisition-based techniques for the link prediction in the knowledge graphs are elucidated in this subsection. Minervini et al. [16] developed an energy-based embedding approach for the knowledge graphs. Here, the learning procedures were employed for computing the various models. In addition, this developed approach was embedding between the relations and the entities of the vector spaces to generate the optimal solution. Ferré [20] designed a concept of nearest neighbours (CNN) for identifying similar entities with respect to the common graph patterns. This approach does not consist of a training phase, but this approach offers interpretable relations for each interference. It is a symbolic interference which provides a symbolic explanation for each interference in such a way that this approach achieved effective link prediction.
Zhang et al. [43] developed a graph embedding approach, called the Hierarchy-Aware Knowledge Graph Embedding (HAKE) for mapping the entities among the polar coordinate system. HAKE differentiated the entities in two categories, such as different level hierarchy, and the same level hierarchy. By integrating the modulus and the phase information, HAKE maps the entities among the polar coordinate system where the radial coordinate represents the modular information, whereas, the angular coordinate represents the phase information. Thus, this method significantly outperformed the existing methods. Li et al. [44] introduced a hierarchical constrained approach for the link prediction in the knowledge graphs. This approach contains several interference patterns for enhancing the link prediction performance. This approach evaluates the optimal margin by identifying the single step and the multistep hierarchical structures. Moreover, this method effectively achieved the link prediction performance.
Other link prediction methods
In this section, the different researchs adopting the other techniques for the link prediction in the knowledge graphs are elucidated. Zhang et al. [10] devised a novel approach, called the ternary relation link prediction on the knowledge graphs. This approach is a path-based model, which utilizes the n-ary relational data. This TRFR framework integrated the hierarchical attention mechanism and the reinforcement learning approach, thereby learning the non-existence relationship among the entities for effective performance. de Assis Costa and de Oliveira [14] developed an approach for the effective data linking. Here, the blocking scheme was employed for improving the effectiveness of the solution. In this approach, the entity embeddings were mapped with the literal information for achieving effective performance on data linking. This method utilizes various literal elements from the triple for improving the effectiveness of the solution. Liu et al. [17] introduced a Multimodal Knowledge Graphs (MMKG) for link prediction. MMKG was an accumulation of three knowledge graphs with respect to the links and the numerical features for all entities and alignments among the KG pairs. This method performed the multi-relational link prediction and the entity matching communities from the resource. Li et al. [21] developed a Hierarchy-based approach for the link prediction in the knowledge graphs. This approach employed an adaptive KG embedding-based link prediction approach for predicting the missing relationship among the entities, thereby achieved effective prediction results.
Analysis using publication year
Analysis using publication year
Analysis in terms of evaluation metrics
Analysis using the classification of link prediction methods.
Balaževic et al. [26] developed a Tensor factorization for the knowledge graph completion, called the TuckER. This technique was based on the TuckER decomposition of a binary sensor of the learned concepts. Here, the TuckER’s number of parameters grows linearly with respect to the relations or entities in the knowledge graph, and finally, this method achieved effective link prediction on the knowledge graphs. Krompaß and Tresp [27] developed an ensemble solution for the link prediction in the knowledge graphs. Here, the ensembles were constructed and then analyzed with respect to certain domain and range constraints. This approach achieved substantially maximized prediction quality with respect to the higher dimensional latent embedding space, thereby enhanced the link prediction results. Ayala Hernández et al. [28] introduced a knowledge graph prediction technique for the evaluation of the workflow. Here, an evaluation tool was employed in order to perform the graph pre-processing, training, and testing operations. This approach offered a visual summary of the graph for enhancing the link prediction in the knowledge graphs. Kotnis and Nastase [1] developed joint embedding approaches for the link prediction in the knowledge graphs. Here, the representations were learned with respect to the positive instances and the negative ones. In addition, the prediction result relies on the low-dimensional vector spaces. However, this method suffers from computational issues; this approach encoded the knowledge through the entities and the relations. Liu, et al. [58] implemented a Relation Aware Graph ATtention network (RAGAT) for identifying the heterogeneous characteristics of KG. In this method, the relation of the parameters was determined for effective performance. This method offered better performance results for large datasets, but the computational time was high.
Analysis in terms of Hits @ N
Analysis using dataset.
This section explains the research gaps and issues of various categorized methods. The challenges faced by the link prediction methods in KGS are given below: In [4] it is necessary to remove the constraints in the entities for enhancing the results of embedding approaches. In [5], the challenge lies in utilizing the employed method in a large community to make performance improvements by narrowing the break among the entity domain and the general-domain entity recognition. In [6], the challenge lies in integrating the results of among the entities for effective link prediction in KGs. In [11], the devised method failed to utilize the word representation features to improve the prediction performance. The devised Model does not enlarge the test set and training set to enhance the scope and efficiency of the model [15]. In [49], the developed approach failed to use multi-task learning for learning the reasoning paths for various query relations in a simultaneous manner. In [48], the major challenge lies in utilizing the embedding, and the representation approaches for the increasingly difficult data in the knowledge graph. In addition, this approach failed to learn the binary relations for the multi-fold embedding relationships.
Analysis and discussion
The analysis and discussion of link prediction in the knowledge using various research papers with respect to the categorization of methods, publication year, performance evaluation metrics, utilized toolset, and the dataset are elaborated in this section.
Analysis using published year
The review for the link prediction in the knowledge graphs using the publication year of various 50 research papers is illustrated in this section. The analysis with respect to the published year is depicted in Table 1. Out of the 50 papers surveyed, more number of research papers were published in 2020.
Analysis using methods
This section illustrates the review with respect to various link prediction methods in knowledge graphs. The various methods used for the link prediction are depicted in Fig. 2. Based on Fig. 2, it is noted that 37% of the research papers used embedding-based approaches, the deep learning approaches were used in 28% of the researches, 23% of the researches utilized representation learning-based methods, raking-based approaches were used in 3% of the researchers. Therefore, from the analysis embedding-based methods are widely developed techniques for the link prediction in knowledge graphs.
Analysis using evaluation metrics
The evaluation metrics are analyzed and discussed in this section. Mean Rank (MR), Mean Reciprocal Rank (MRR), Hits @ N, Area under Curve (AUC), Recall, Median, Mean Absolute Error (MAE), Average Reciprocal Hit Rank (ARHR), Accuracy, Active Relations Relation (ARR), AO Relation (AOR), Precision, Mean Selection Rate (MSR), Mean Alternative Rate (MAR) are the evaluation metrics considered for link prediction in the knowledge graphs. From Table 2, it is clearly shown that Hits @ N is the most frequently used performance metric.
Analysis using values of performance metrics
The analysis using performance metrics value is illustrated in this section. The analysis using Hits @ N is explained in this section.
Evaluation based on Hits @ N
In this section, the evaluation based on Hits @ N is explained. Table 3 shows the review based on Hits @ N specified by five ranges as, 50–60%, 60–70%, 70–80%, 80–90%, and 90–99%. From Table 3, it is shown that the research article [4, 7, 19] achieved enhanced Hits @ N and [13, 24] research papers had low Hits @ N value.
Analysis using dataset
This section explains the analysis carried out based on various datasets used by existing research works. Figure 3 shows the various datasets used for the link prediction in the knowledge graphs. The commonly utilized datasets are Freebase, Word Net, ICEWS, YAGO, DBpedia, BIO-KG, NELL, Kinship, MetaQA. From Fig. 3, it is clear that, Freebase is the most commonly used dataset.
Conclusion
In this study, a survey is made on several link prediction techniques in knowledge graphs. The reviews are performed with respect to 50 research papers, and the collected papers are categorized in terms of several approaches, like embedding-based methods, deep learning methods, knowledge acquisition methods, ranking methods, and representation learning methods. In addition, several sources, such as Google scholar, IEEE, science direct and so on are used for gathering the research papers in this survey. Here, the collected research papers are analysed, and challenges faced by the existing research papers are described. In addition, the analysis of the survey is illustrated in terms of categorization techniques, utilized toolset, datasets used, and evaluation metrics. From this analysis, it is clearly reviewed that embedding-based approach is the most commonly used technique in the research papers. Similarly, Hits @ N is the most frequently utilized metrics in link prediction and also the Freebase dataset is the most commonly used dataset in the existing papers. Various research gaps and problems in the existing methods are considered in future by developing a novel optimization algorithm for the enhanced link prediction performance. Also, an embedding model will be developed to improve the performance of the link prediction method for the large datasets.
