Neighborhood aggregation based graph attention networks for open-world knowledge graph reasoning

Abstract

Knowledge graph reasoning or completion aims at inferring missing facts based on existing ones in a knowledge graph. In this work, we focus on the problem of open-world knowledge graph reasoning—a task that reasons about entities which are absent from KG at training time (unseen entities). Unfortunately, the performance of most existing reasoning methods on this problem turns out to be unsatisfactory. Recently, some works use graph convolutional networks to obtain the embeddings of unseen entities for prediction tasks. Graph convolutional networks gather information from the entity’s neighborhood, however, they neglect the unequal natures of neighboring nodes. To resolve this issue, we present an attention-based method named as NAKGR, which leverages neighborhood information to generate entities and relations representations. The proposed model is an encoder-decoder architecture. Specifically, the encoder devises an graph attention mechanism to aggregate neighboring nodes’ information with a weighted combination. The decoder employs an energy function to predict the plausibility for each triplets. Benchmark experiments show that NAKGR achieves significant improvements on the open-world reasoning tasks. In addition, our model also performs well on the closed-world reasoning tasks.

Keywords

Open-world knowledge graph reasoning neighborhood information graph attention networks knowledge representation learning

1 Introduction

Recently, numerous knowledge graphs (KGs) have been constructed, such as Freebase [3], WordNet [25], DBPedia [21], and YAGO [35]. They have been applied in various applications including relation extraction [29], recommender systems [5], and intelligent question answering systems [48]. A typical KG is usually composed of numerous triplets (head, relation, tail), e.g., (Washington D . C ., is CapitalOf, America). However, most KGs that are currently in use are incomplete in the sense that they do not contain all true triplets or they actually contain false facts. For instance, in Freebase, 75% person have unknown nationality [43]. Therefore, how to complete KGs or detect false triplets through reasoning methods is an important and challenging task.

While KGs are widely adopted in intelligent applications, a major bottleneck hindering its usage is the incompleteness of manually curated facts, leading to extensive studies on knowledge graph reasoning (KGR). In recent years, many canonical machine learning tasks related to KGR have been explored, which achieves promising performance. These models predict missing facts by learning entity and relation embeddings. A key problem with most existing methods is that the plausibility of links can be determined for known entities only. However, new entities and relations arise with time. For instance, on DBpedia 2016-04 release, about 200 new entities emerge every day [33]. In case a triplet is not contained in KG, it does not imply that the corresponding fact is false but rather unknown (open-world assumption). Therefore, it is necessary for reasoning models to infer knowledge about entities not observed in the KG due to the evolving nature of KGs.

This research explores the task of open-world KGR, which is a critical but relatively unexplored problem. One major challenge of this task is how to represent unseen entities since these entities are not available when training. It seems promising to embed these entities by aggregating its neighborhood vectors. For example, in Figure 1, suppose there is an unseen entity “Carles Puyol” (marked gray) which has connection with existing entities but not included in the KG. We want to infer more facts from existing triplets, and answer questions like “What’s Carles Puyol’s nationality?”. By aggregating information from its neighborhoods, we can build representations for Carles Puyol and then predict its nationality.

Fig. 1

A motivating example of an unseen entity. Solid-lined circles and arrows represent entities and relations that already exisit in the KG. While dashed ones represent the unseen entity which is not observed in the KG and some of its known relations to other exiting entities.

So far, only a few pieces of research have studied the open-world KGR task [31 , 45]. Although these methods achieve competitive results, they require external resources, such as entity’s name, description to generate embeddings for unseen entities. Hamaguchi et al. [13] apply graph neural networks on neighbor nodes to get the representations of unseen entities without using external information. However, they treat each node equally without considering their different importance, which is inconsistent with the practical scenarios. As shown in Figure 1, if we intent to infer Carles Puyol’s nationality, the triplet (Carles Puyol, Born _ in, Catalunya) is more informative than (Carles Puyol, career, Soccer _ player). Therefore, we propose an attention-based aggregator that assigns different importance to neighboring nodes in the process of feature aggregation. Our model can be interpreted as an encoder-decoder architecture where the attention-based aggregator and TransE act in an encoder and decoder, respectively. The encoder employs a graph attention network (GAT) to build representations for unseen entities by aggregating its neighboring nodes. The decoder is used to define the task-oriented objective function. In this study, we would like to choose the TransE [4] model as the decoder, but other KGR models can also be adopted.

The contributions of our research are outlined as follows: (1) We draw attention to an important but relatively unexplored problem of open-world KGR. In particular, we present an attention-based aggregator which is different from previous work and more suitable for practical scenarios. (2) In contrast to previous reasoning methods, our proposed method capture both entity and relation features in the aggregation steps. (3) Experiments demonstrate the effectiveness of assigning different weights to neighborhoods.

The rest of this paper is structured as follows. We first review the related work in Section 2. Then, the notations and background are presented in Section 3. The architecture and details of our model are introduced in Section 4. Datasets and experimental results are summarized in Section 5. Finally, we conclude the paper in Section 6.

2 Related work

Our method is related to embedding-based methods, approaches to reasoning concerning unseen entities, recent advancements in applying graph convolution networks to graph-structured data and temporal knowledge graph (TKG) reasoning methods.

2.1 Embedding-based methods

For knowledge representation learning and reasoning, embedding-based methods have attracted a lot of interests in the past few years. Chen et al. [6] provide a comprehensive review about these methods. Embedding-based models usually embed entities as well as relations into a low-dimensional semantic vector space, and then measure the plausibility of each triplet in that space. Among them, TransE [4] is a classical model. TransE interprets relations as a translation vector between head entity h and tail entity t for each triplet (h, r, t). As shown in Figure 2, it wants the embeddings of h + r ≈ t for a given triplets in the KG if the fact is right. TransE is inspired by [24], in which linguistic patterns can be represented as linear translations such as J . K . Rowling - Harry Potter ≈ William Shakespeare – Hamlet .

Fig. 2

A simple illustration of TransE.

Despite its simplicity, TransE has flaws in dealing with complex relations, such as one-to-many, many-to-one, and many-to-many relations. To deal with this problem, different variants based on TransE have been derived. TransH [41] proposes that entities are supposed to have distinct representations when involved in different relations. It projects head and tail entities into the hyperplane of one specific relation. However, TransH represents entities and relations in the same space, which prevents TransH from modeling entities and relations precisely. TransR [22] observes that an entity may have multiple aspects depending on specific relations and projects entities and relations into different vector spaces. In fact, although TransR has significant improvements compared with TransE and TransH, it also has several flaws. First, both head and tail entities share the same mapping matrix, which ignores the different attributes of entities. Second, TransR has higher complexity than TransE and TransH due to matrix-vector multiplication. To resolve these problems, TransD [15] constructs dynamic mapping matrix by defining two vectors for each entity-relation pair and replaces matrix-vector multiplication by vector operations to reduce model complexity.

Recently, some works have tried to incorporate external information into embedding-based models. TKRL [47] encodes hierarchical type information into KG representations. IKRL [46] combines images with knowledge graphs for KGR. Its promising performances indicate the significance of visual information for KGR. An et al. [1] proposes an accurate text-enhanced KGR method to handle the semantic variety of entities and relations in distinct triplets by exploiting the entity descriptions and triple-specific relation mention.

However, these embedding-based algorithms could only handle the situation where entities are not absent from KGs. Such limitation prevents them from building representations for unseen entities.

2.2 Open-world KGR

To relieve the issue of emerging entities, several methods which can generalize to perform reasoning concerning unseen entity have been proposed. DKRL [45] utilizes entity’s description to build representation for entity which is absent from KGs. It encodes semantics of entity’s descriptions using CBOW and CNN model. ConMask [33] uses relationship-dependent content masking, fully convolutional neural networks, and semantic averaging to extract relationship-dependent embeddings for unseen entity from the textual features of entities and relationships in KGs. Shah et al. [31] propose OWE model which maps the embeddings of entity’s name and description to the graph-based embedding space, by which OWE can perform open-world KGR. Hamaguchi et al. present a Graph-NN model to build embeddings for unseen entities [13] by exploiting existing elements, without using external resources. However, they assume all local neighbors contribute equally to the entity embedding, whereas heterogeneous neighbors could have different influence. Therefore, it is crucial to design a model to effectively capture impact differences of local neighbors.

2.3 Graph neural network-based methods

During the past few years, different variants of GNNs have been developed with graph convolutional networks (GCN) [19] being one of them. Graph neural networks [7] can build representations for entities by encoding local graph structures. GCN learns the features by conducting convolution on neighboring nodes for node classification. A variant of GCN, named R-GCN, is proposed by [30], which aims to model multi-relational data. R-GCN perform reasoning by reconstructing an edge with an autoencoder architecture and using a parameterized score function. However, above methods cannot generalize to unseen nodes. In contrast, Hamilton et al. [14] propose GraphSAGE which can generate representations for previously unseen data by leveraging node attribute information. However, it cannot be directly applied to KGs with multi-relational edges. Shang et al. [32] propose an end-to-end structure-aware convolutional network (SACN), which introduces a weighted GCN to capture the structural information in KGs by utilizing KG node structure, node attributes, and relation types. However, these models are not applicable for open-world KGR.

2.4 TKG reasoning methods

Temporal KGR is another related line of our study, as new entities and relations arise with time. Recent studies have demonstrated that incorporating time information into the embeddings can boost the performances of KGR tasks. t-TransE [17], TAE [16] learn time-aware embedding by imposing temporal order constraints based on a translation-based score function. In order to encode temporal information directly in the learned embeddings, HyTE [8] associates each timestamp with a corresponding hyperplane. Different from existing models usually restricting to one time granularity, TA-TransE [11] can deal with temporal facts having varying time granularities by using a LSTM to encode time digits and relations. Recently, TPmod [2] infers missing events by utilizing the Gate Recurrent Unit (GRU) to model the temporal dependency, and achieves the SOTA results.

3 Background

This section describes the preliminaries of KGR and GATs. Table 1 lists the key symbols used in this paper.

Table 1
List of notations used in this paper

Symbol Definition

E Set of entities

R Set of relations

G Set of triplets

$U$ Set of unseen entities

e_h, r, e_t Head entity, relation and tail entity

e _h, r , e _t Embeddings of head entity, relation, and tail entity

$N_{i}$ Neighborhood of node e_i

W ₁, W ₂ The weight matrix

K The number of attention heads

c _ij The attention coefficients between entity e_i and e_j

α _ij The attention values between entity e_i and e_j

Symbol	Definition
E	Set of entities
R	Set of relations
G	Set of triplets
$U$	Set of unseen entities
e_h, r, e_t	Head entity, relation and tail entity
e _h, r , e _t	Embeddings of head entity, relation, and tail entity
$N_{i}$	Neighborhood of node e_i
W ₁, W ₂	The weight matrix
K	The number of attention heads
c _ij	The attention coefficients between entity e_i and e_j
α _ij	The attention values between entity e_i and e_j

3.1 Preliminaries

KG usually can be represented by a collection of triplets G={ (e_h, r, e_t) |e_h, e_t∈ E, r ∈ R }, where E and R are the entity set and relation set. e_h, r, e_t represent the head entity, relation, and tail entity, respectively. Note that, unlike previous methods, in the open-world KGR task, e_h (or e_t) may be a previously unseen entity.

For an entity e_i, we denote its neighborhood by $N_{i}$ , i . e ., all related entities with the involved relations. Formally,

$N_{i} = {(r, e_{j}) ∣ (e_{i}, r, e_{j}) \in G \land r \in R_{ij}}$ (1)

where R_ij represents the relation set. We use bold lower letters and bold upper letters to denote vectors and matrices, respectively.

Given a KG, we would like to learn a neighborhood aggregator A that act as follows:

For an entity e_i, A aggregates features from its neighborhood to build representations for e_i.

For a triplet (e_i, r, e_j) containing emerging entity, aggregator A is used to generate embeddings for previously unseen entity to predict the plausibility of the triplet.

When an unseen entity having connection with existing entities and relations emerges, we could apply A on its newly established neighborhood to obtain representations, and infer new facts about it.

3.2 GATs

GCN simply assumes that all neighboring nodes contribute equally when aggregating feature from its neighborhood. To overcome this disadvantages, Velivckovic et al. [39] introduce GAT. GAT weighs “important” neighbors more, rather than assigning equal importance to each neighboring nodes.

The input to a single graph attentional layer is a set of node features, $h = {{\vec{h}}_{1}, {\vec{h}}_{2}, \dots, {\vec{h}}_{N}}$ , where N denotes the number of nodes. Then, the layer computes a set of new node features, $h^{'} = {{\vec{h}}_{1}^{'}, {\vec{h}}_{2}^{'}, \dots, {\vec{h}}_{N}^{'}}$ , based on the input features as well as the graph structure.

To achieve a higher-level representation, GAT applies a shared node-wise feature transformation, specified by a weight matrix W , to every node. After this transformation, it performs self-attention on nodes to compute attention coefficients:

$e_{ij} = a (W {\vec{h}}_{i}, W {\vec{h}}_{j})$ (2)

where a is a attention function. To make coefficients easily comparable across different nodes, GAT normalizes them across all the values in the neighborhood using the softmax function:

$α_{ij} = \underset{j}{softmax} (e_{ij}) = \frac{exp (e_{ij})}{\sum_{k \in N_{i}} exp (e_{ik})}$ (3)

where $N_{i}$ is the neighborhood of node i (typically consisting of all i’s first-order neighbors, including i). The output features of node i are defined as:

${\vec{h}}_{i}^{'} = σ (\sum_{j \in N_{i}} α_{ij} W {\vec{h}}_{j})$ (4)

in which σ is an activation function, and α_ij specifies the weighting factor (importance) of node j to i. To make the learning process stable, GAT applies a multi-head attention mechanism to concatenate K attention heads, which is defined as:

${\vec{h}}_{i}^{'} = ∥_{k = 1}^{K} σ (\sum_{j \in N_{i}} α_{ij}^{k} W^{k} {\vec{h}}_{j})$ (5)

where || represents the concatenation operation, $α_{ij}^{k}$ are the attention coefficients derived by the k-th replica, and W ^k is the weight matrix specifying the linear transformation of the k-th replica. To achieve the multi-head attention, GAT uses averaging instead of the concatenation operation to output the final embedding, which is defined as:

${\vec{h}}_{i}^{'} = σ (\frac{1}{K} \sum_{k = 1}^{K} \sum_{j \in N_{i}} α_{ij}^{k} W^{k} {\vec{h}}_{j})$ (6)

4 Model

4.1 Framework

As illustrated in Figure 3, NAKGR follows an encoder-decoder architecture. Given a triplet (e_h, r, e_t), the encoder embeds entities and relations into a low-dimensional vector spaces, and then outputs their embeddings. The decoder measures the plausibility of triplet, which can be substituted by a number of existing KGR models. This setting guarantees the flexibility and extendibility of NAKGR.

Fig. 3

The architecture of NAKGR model.

4.2 Incorporating neighborhood attention

Hamaguchi et al. [13] demonstrate the significance of encoding neighborhood information for KGR. The proposed encoder takes the average of feature representations of neighboring nodes as the embedding of unseen entity. Despite the desirable performance, it neglects the different importance of neighboring entities. In light of this issue, we introduce GATs to aggregate information from neighborhood with assigning different weights. Although GAT has proven to be useful in many applications, a deficiency is that it ignores relation features for obtaining node embeddings. As KGs provide semantic relations between entities, it is natural to incorporate the semantics of relation into fact modeling.

In this paper, we enhance GAT by capturing both entity and relation features as relation is an important part of the KG. As shown in Figure 3, the aggregator takes entity e_h with its neighborhood $N_{e_{h}} = {(r_{h 1}, e_{h 1}), (r_{h 2}, e_{h 2}), \dots, (r_{h n}, e_{h n})}$ as input, and gets aggregated representation, Ne (e_h), of the entity e_h where e_hj is the j-th neighbor of entity e_h.

Specifically, we first perform linear transformations, parameterized by two weight matrices $W_{1} \in ℝ^{m \times m}$ and $W_{2} \in ℝ^{m \times 2 m}$ , over the input features. Then we adopt self-attention on entities to compute attention coefficients between entity e_h and its neighborhood entity e_hj with relation r_hj, i . e .,

$c_{h, j} = a (W_{1} e_{h}, W_{2} [r_{hj}, e_{hj}])$ (7)

where c_h,j represents the attention value of entity pair (e_h, e_hj), [,] is the concatenation operation, and a is an attention function. In the experiments, we replace a with a feedforward neural network, parametrized by a weight vector $a_{h} \in ℝ^{2 m}$ , and apply the LeakyReLU nonlinearity on it.

To get the relative attention values, we also apply a softmax function to c_h,j, which is defined in Eq.(8). The calculation process of the relative attention values α_h,j is shown in Figure 4.

Fig. 4

Illustration of attention mechanism.

$\begin{matrix} α_{h, j} = softmax (c_{h, j}) = \\ \frac{exp (LeakyReLU (a_{h}^{T} (W_{1} e_{h}, W_{2} [r_{hj}, e_{hj}])))}{\sum_{i = 1}^{n} exp (LeakyReLU (a_{h}^{T} (W_{1} e_{h}, W_{2} [r_{hi}, e_{hi}])))} \end{matrix}$ (8)

Finally, we get Ne (e_h), the output of entity e_h, which has already aggregated information from neighboring nodes. To encapsulate more information about the neighborhood and stabilize the learning process, we also apply multi-head attention, i . e .,

$Ne (e_{h}) = σ (\frac{1}{K} \sum_{k = 1}^{K} \sum_{j = 1}^{n} α_{h, j}^{k} W_{2}^{k} [r_{h, j}, e_{hj}])$ (9)

4.3 Training objective

We train our model using the following margin-based loss function:

$\begin{matrix} L = \sum_{(e_{h}, r, e_{t}) \in S} \sum_{(e_{h}^{'}, r, e_{t}^{'}) \in S^{'}} max (γ + f (e_{h}, r, e_{t}) \\ - f (e_{h}^{'}, r, e_{t}^{'}), 0) \end{matrix}$ (10)

{where f () represents the energy function, and γ is the hyperparameter of margin. S is the set of positive triplets, while S′ is denotes the set of negative triplets which is obtained by replacing head entity e_h or tail entity e_t in (e_h, r, e_t) randomly. $S^{'} = {(e_{h}^{'}, r, e_{t}) ∣ e_{h}^{'} \in E} \cup {(e_{h}, r, e_{t}^{'}) ∣ e_{t}^{'} \in E}$ (11)

4.4 Decoder

The decoder is supposed to predict the plausibility of the training triplet based on the embeddings of head entity and tail entity output by the encoder. We select TransE model as the decoder. TransE is one of the most typical embedding-based reasoning models, and we adopt it for its simplicity and ease of training. The decoder measures the plausibility of a training triplet (e_h, r, e_t) with an energy function f (e_h, r, e_t). The energy function of TransE is defined as $f (e_{h}, r, e_{t}) = ∥ e_{h} + r - e_{t} ∥_{L_{1} / L_{2}}$ (12) where ∥· ∥ is the norm of the vector, r represents the embedding of relation in the training triplet. Notice that other embedding-based reasoning models can also be used as the decoder. To test whether the proposed aggregators can generalize to different scoring functions, we also consider several alternatives in experiments.

5 Experiments

KGR can be roughly categorized into the following two kinds of tasks: first, predicting the truth value of triplets (triplet classification), and second, inference of missing entity (entity prediction). We evaluate NAKGR on both tasks under open-world settings and closed-world settings.

5.1 Datasets

For the open-world KGR tasks, we need datasets whose test sets contain unseen entities during training. For the task of triplet classification, we conduct experiments on the datasets released by [13], which is constructed based on WN11. Table 2 shows the details of this dataset. For the entity prediction task, experiments are conducted on a new dataset constructed based on FB15K following a similar procedure used in [13] as follows:

Table 2
Entities and triples statistics of datasets released by [13]. The numbers of triples include negative triples

Datasets Head Tail Both

1000 3000 5000 1000 3000 5000 1000 3000 5000

Train triplets 108,197 99,963 92,309 96,968 78,763 67,774 93,364 71,097 57,601

Valid triplets 4,613 4,184 3,845 3,999 3,122 2,601 3,799 2,759 2,166

Unseen entities 348 1,034 1,744 942 2,627 4,011 1,238 3,319 4,963

Test triplets 994 2,969 4,919 986 2,880 4,603 960 2,708 4,196

Aux entities 2,474 6,791 10,784 8,191 16,193 20,345 9,899 19,218 23,792

Aux triplets 4,352 12,376 19,625 15,277 31,770 40,584 18,638 38,285 48,425

Datasets	Head	Tail	Both
Train triplets	108,197	99,963	92,309	96,968	78,763	67,774	93,364	71,097	57,601
Valid triplets	4,613	4,184	3,845	3,999	3,122	2,601	3,799	2,759	2,166
Unseen entities	348	1,034	1,744	942	2,627	4,011	1,238	3,319	4,963
Test triplets	994	2,969	4,919	986	2,880	4,603	960	2,708	4,196
Aux entities	2,474	6,791	10,784	8,191	16,193	20,345	9,899	19,218	23,792
Aux triplets	4,352	12,376	19,625	15,277	31,770	40,584	18,638	38,285	48,425

1. Sampling unseen entities. We first select different ratio (5%, 10%, 20%) of the original test triplets as a new test set $T$ for our inductive scenario ([13] samples N = {1000, 3000, 5000} testing triplets). For each test, we use two strategies, i.e., Head and Tail, to construct candidate unseen entities $U^{'}$ . In the Head setting, entities appearing as head entities in $T$ are added to $U^{'}$ . In the Tail setting, only entities in $T$ appearing as tail entities are added to $U^{'}$ . For an entity $e \in U^{'}$ , it will be removed if it has no neighboring nodes, yielding the final unseen entity set $U$ . For a triplet $(e_{h}, r, e_{t}) \in T$ , if $e_{h} \in U \land e_{t} \in U$ or e_h ∈ E ∧ e_t ∈ E, it is also removed from $T$ .

2. Filtering and splitting datasets. After finishing the first step, we need to make sure that unseen entities wouldn’t appear in final training set and validation set. The original training dataset is split into the new training dataset and auxiliary dataset. For a triplet (e_h, r, e_t), if e_h ∈ E ∧ e_t ∈ E, we add it to the new training dataset. If $e_{h} \in U \land e_{t} \in E$ or $e_{h} \in E \land e_{t} \in U$ , we add it to the auxiliary dataset. Auxiliary triplets are required to obtain the local neighborhood of a source entity at inference time. For the validation triplets, we simply remove the triplets that contains unseen entity. Table 3 lists the statistics of this dataset.

Table 3

Numbers of entities and relations of our processed FB15k dataset

Dataset	#Relation	#Entity	#Unseen entity
Head-5	1,250	12,187	1,460
Tail-5	1,182	12,269	1,330
Head-10	1,170	10,336	2,082
Tail-10	1,126	10,603	1,934
Head-20	990	7,765	2,544
Tail-20	984	8,219	2,351

For the closed-world KGR tasks, we evaluate the NAKGR model on the benchmark WN11 [34], FB13 [34], FB15k [4], and WN18 [4] datasets. WN11 and WN18 are extracted from WordNet [25] which provides semantic relations among words. In the WordNet, each entity is a synset consisting of several words, expressing a distinct concept. The semantic relations among synsets includes hypernym, hyponym, meronym as well as holonym. FB13 and FB15k are two subsets of Freebase [3] which represents general world knowledge. For example, the triplet (Mark Twain, writer _ of, The Adventures of Tom Sawyer) denotes the fact that Mark Twain is the writer of The Adventures of Tom Sawyer. Table 4 lists statistics of four datasets.

Table 4

Statistics of datasets for closed-world tasks

Dataset	#Rel	#Ent	#Train	#Valid	#Test
WN11	11	38,696	112,581	2,609	10,544
FB13	13	75,043	316,232	5,908	23,733
WN18	18	40,943	141,442	5,000	5,000
FB15k	1,345	14,951	483,142	50,000	59,071

5.2 Baselines

For open-world KGR tasks, we compare our NAKGR model against following baselines:

•DKRL [45] proposes a reasoning method for KGs under the zero-shot setting taking advantage of entity descriptions.

•ConMask [33] learns embeddings of the entity’s name and parts of their text-description to connect unseen entities to the KG.

•TransE-OWE [31] presents an extension embedding-based KGR models to predict the unseen entity, which maps the embeddings of entity’s name and description to the graph-based embedding space.

•Graph-NNs [13] use GNN to build representations for unseen entities, exploiting existing triplets in the KG, which does not rely on external resources.

For closed-world KGR tasks, our model are compared with 5 baseline methods as follows:

•TransE [4] is a well-known embedding-based model for reasoning by capturing the features of entities and relations.

•DistMult [49] proposes a framework for knowledge reasoning task, which models relation composition using a simple formulation of bilinear model.

•ComplEx [37] takes complex valued embeddings into consideration when employing eigenvalue decomposition. It has been shown to achieve SOTA performance on both FB15k and WN18.

•R-GCN [30] is one of the earliest approaches to use GNNs for KGR task. It introduces a relational Graph Convolution Network, which produces locality-sensitive embeddings, which are then passed to the decoder that predicts missing links in KG.

•TransE-NMM [27] is a mixture model which encodes an entity as a weighted hybrid representation of its neighborhoods.

5.3 Open-world KGR

5.3.1 Triplet classification

Triplet classification task aims to judge whether a given triplet (e_h, r, e_t) is correct or not, which can be viewed as a binary classification task.

Evaluation protocol For the triplet classification task, we set a threshold δ_r for each relation. δ_r is obtained by maximizing the classification accuracy on the validation set. For a given triplet (e_h, r, e_t), if its score is smaller than δ_r, it will be judged as positive, otherwise negative.

Experiment settings In this experiment, we select learning rate α from {0.001, 0.01, 0.05, 0.1}, the margin γ from {0.5, 1, 2, 4}, the embedding dimension d from {50, 100, 150, 200} and the batch size B from {128, 256, 512, 1024}. The optimal model parameters are estimated according to the performance on validation set. The best configuration is: α=0.01, γ=2, d=100 and B=1024 for all datasets. We randomly choose 64 entities for each entity.

Evaluation results Table 5 represents the triplet classification results. The results show that: (1) NAKGR outperforms all baselines on the triplet classification task, which indicates the effectiveness of assigning different weights to neighborhoods. (2) Our model not only outperforms graph neural networks-based models, but also performs better than methods using external resources, such as DKRL, TransE-OWE.

Table 5
Evaluation results on open-world triplet classification. Bold indicates the best scores for each dataset

Model Head Tail Both

1000 3000 5000 1000 3000 5000 1000 3000 5000

DKRL 63.4 60.5 61.2 64.2 63.9 63.1 64.9 64.0 64.2

ConMask 80.2 73.5 68.6 78.5 73.9 71.4 74.1 69.7 67.8

TransE-OWE 78.2 71.4 66.5 77.1 73.0 69.6 73.7 68.3 67.0

Graph-NNs 84.1 82.2 80.1 81.5 77.3 70.6 81.3 76.0 65.9

NAKGR 85.2 82.5 81.1 82.9 78.7 73.2 82.0 78.6 68.4

Model	Head	Tail	Both
DKRL	63.4	60.5	61.2	64.2	63.9	63.1	64.9	64.0	64.2
ConMask	80.2	73.5	68.6	78.5	73.9	71.4	74.1	69.7	67.8
TransE-OWE	78.2	71.4	66.5	77.1	73.0	69.6	73.7	68.3	67.0
Graph-NNs	84.1	82.2	80.1	81.5	77.3	70.6	81.3	76.0	65.9
NAKGR	85.2	82.5	81.1	82.9	78.7	73.2	82.0	78.6	68.4

5.3.2 Entity prediction

Entity prediction under open-world assumption aims to infer the missing head entity e_h or tail entity e_t for a triplet (e_h, r, e_t) where e_h or e_t is absent from the KG. To tackle this task, we first hide head entity (tail entity) of each testing triplet in Head-R (Tail-R) to produce a missing part. Then we replace the missing part with each entity in the KG, and calculate the score function value according to Eq.(12). Finally, we rank these entities in ascending order, and obtain the rank of the original correct triplet.

Evaluation protocol Following common methods, we use three evaluation metrics: (1) The average rank of correct entities (MR); (2) the mean of inverse ranks (MRR), and (3) the proportion of correct entities in top-10 ranked entities (Hits@10). Considering the fact that there are a large number of one-to-many, many-to-one, and many-to-many relations in the KG, there may be some problems for predictions that rely solely on ranking results. For example, there are multiple correct answers when we predict the tail entity of (Carles Puyol, plays _ for, ?), such as FC _ Barcelona and Spain national football team. However, this will degrade the performance of the model. Hence, corrupted triplets that already exist in KG should be filtered out before ranking. Corresponding, we name the unfiltered setting as “Raw”, and the filtered one as “Filter”. All the results are reported in the “filtered” setting [4]. In both setting, lower MR, higher MRR and Hits@10 are expected.

Experiment settings The optimal model parameters are decided according to the averaged predictive performance on validation set. The best configurations are α=0.001, γ=1, d=100 and B=1024.

Evaluation results Table 6 presents the entity prediction results on Head-10 and Tail-10. The results on other datasets are similar. The results indicate NAKGR outperforms all models on three metrics. Specifically, compared with Graph-NNs, NAKGR achieves approximately 5.5% and 3.2% improvement in the filtered MR and Hits@10 on Head-10, while on Tail-10, NAKGR gains about 7% and 1.5% improvement. The experimental results again verifies the significance of treating the neighbors differently. The possible reason may be that GAT can filter out noise neighbors.

Table 6
Evaluation results on open-world entity prediction

Model Head-10 Tail-10

MR MRR Hits@10 MR MRR Hits@10

DKRL 636 0.230 35.7 659 0.201 31.3

ConMask 379 0.299 45.8 510 0.274 39.5

TransE-OWE 434 0.287 41.0 578 0.229 37.8

Graph-NNs 292 0.316 47.6 359 0.292 40.6

NAKGR 276 0.355 49.1 313 0.339 41.2

Model	Head-10	Tail-10
DKRL	636	0.230	35.7	659	0.201	31.3
ConMask	379	0.299	45.8	510	0.274	39.5
TransE-OWE	434	0.287	41.0	578	0.229	37.8
Graph-NNs	292	0.316	47.6	359	0.292	40.6
NAKGR	276	0.355	49.1	313	0.339	41.2

5.4 Closed-world KGR

Because entities are observed at training time under closed-world assumption so that the NAKGR model can perform closed-world KGR tasks. Therefore, we also compare NAKGR with TransE, DisMult, ComplEx, R-GCN, and TransE-NMM on the triplet classification and entity prediction tasks. For the triplet classification task, WN11 and FB13 are used for evaluation. For the entity prediction task, FB15k and WN18 are the benchmark datasets.

Experiment settings We test some classical baseline models with OpenKE 1 toolkit, and other baseline results are taken from the original papers. We assessed several settings on the validation dataset to determine the best configuration. The optimal configurations are: α=0.01, γ=2, d=50 and B=512 on WN11; α=0.001, γ=2, d=100 and B=128 on FB13; α=0.001, γ=0.5, d=100 and B=1024 on FB15k; α=0.01, γ=1, d=100 and B=256 on WN18.

Results Table 7 presents triplet classification results of NAKGR model and previous published results on the FB13 and WN11 datasets. On FB13, NAKGR achieves an accuracy of 91.1% which outperforms all other models. On WN11, NAKGR obtains a second highest accuracy of 88.2% which is 0.1% outperformed by ComplEx. Overall, our proposed model yields the best performance averaged over these two benchmark datasets. Table 8 presents the results of NAKGR model on closed-world entity prediction task. The results show NAKGR sometimes performs better than closed-world methods on the entity prediction task. NAKGR outperforms baseline models more significantly on FB15K than on WN18, which means in a more dense dataset, neighbors are more informative and could provide more helpful information. The experimental results show that NAKGR also achieves good performance on the closed-world task.

Table 7
Closed-world triplet classification accuracy on WN11 and FB13

Method WN11 FB13 Avg.

TransE 75.9 81.5 78.7

DisMult 87.1 86.2 86.7

ComplEx 88.3 89.3 88.8

R-GCN 87.5 88.9 88.2

TransE-NMM 86.8 88.6 87.7

NAKGR 88.2 91.1 89.7

Method	WN11	FB13	Avg.
TransE	75.9	81.5	78.7
DisMult	87.1	86.2	86.7
ComplEx	88.3	89.3	88.8
R-GCN	87.5	88.9	88.2
TransE-NMM	86.8	88.6	87.7
NAKGR	88.2	91.1	89.7

Table 8

Closed-world entity prediction results on FB15k and WN18

Method	FB15k			WN18
	MRR	MR	Hits@10	MRR	MR	Hits@10
TransE	0.463	125	47.1	0.495	251	89.2
DisMult	0.654	42	82.4	0.822	655	93.6
ComplEx	0.692	37	84.0	0.941	207	94.7
R-GCN	0.651	80	82.5	0.814	238	95.5
TransE-NMM	0.617	101	65.7	0.786	249	95.0
NAKGR	0.752	74	89.2	0.842	196	95.8

5.5 Discussion

Necessity of incorporating relation In this experiment, we would like to confirm that it is necessary for the aggregator to model semantics of relation. Specifically, we carry out an ablation study on NAKGR, where we analyze the performance of NAKGR on the test set when omitting relation information. Table 9 shows that the original results drop by 2.2% and 6.8% in Hits@10 and MRR on the Head-10 dataset when we remove the relations. It can be seen that removing the relations from NAKGR has a negative impact on the results, which suggests that the relation embeddings play an important role on the prediction task.

Table 9
Effectiveness of relations on Head-10

Model MRR Hits@10 Hits@3 Hits@1

NAKGR 0.355 49.1 37.7 24.9

w/o R 0.331 48.0 35.8 23.4

Model	MRR	Hits@10	Hits@3	Hits@1
NAKGR	0.355	49.1	37.7	24.9
w/o R	0.331	48.0	35.8	23.4

Generalization to other scoring functions We demonstrate the usefulness of our aggregator by adopting the TransE model as the decoder. However, our framework is agnostic to the particular choice of decoder. Hence, we replace the TransE with ComplEx [37], Analogy [23] and DisMult [49] as the decoder. As presented in Table 10, we can observe that with different decoder, NAKGR outperforms Graph-NNs consistently on all evaluation metrics. Note that TransE leads to the best results on NAKGR and Graph-NNs.

Table 10

Different scoring function on Head-10

Encoder	Decoder	MRR	Hits@10	Hits@3	Hits@1
Graph-NNs	ComplEx	0.187	30.648	20.7	12.4
Graph-NNs	Analogy	0.193	31.4	21.8	13.4
Graph-NNs	Distmult	0.227	35.555	25.632	16.0
Graph-NNs	TransE	0.316	48.993	35.2	22.959
NAKGR	ComplEx	0.222	35.262	24.204	15.672
NAKGR	Analogy	0.225	35.555	24.753	15.699
NAKGR	Distmult	0.234	37.166	26.2	16.477
NAKGR	TransE	0.355	49.1	37.7	24.9

Influence of the proportion of unseen entities It is reasonable to assume that the performance would decrease as the ratio of the unseen entities increases. We conduct entity prediction experiments on datasets with different sample rates. The results are shown in Figure 5. We can see that the increasing proportion of unseen entities has a negative impact on the model.

Fig. 5

Results on Head-R and Tail-R with different proportion of unseen entities.

6 Conclusion and future work

In this paper, we introduce a novel method for open-world KGR. We present NAKGR, an attention-based aggregator that leverages neighborhood information to efficiently build representations for previously unseen entity. Additionally, NAKGR captures both entity and relation features in a given entity’s neighborhood. Further analysis shows that our encoder can easily extend to existing models such as ComplEx and Analogy without introducing extra parameters. This makes it possible for our encoder to be a component of other KGR models. Experiments results on benchmark datasets demonstrate that NAKGR achieves competitive performance on the open-world KGR task and performs well in the closed-world reasoning task.

There are two major limitations in this study that could be addressed in future research. First, the NAKGR model only aggregates information from immediate neighbors, while multi-hop neighbors can help the model iteratively accumulate knowledge. Second, NAKGR only considers neighborhood information in the feature aggregation steps, while there is rich information, like visual and textual information which could be integrated into our model.

For future work, we will investigate a more expressive architecture which updates an entity’s representation by aggregating information not only from its direct neighbors, but from its multi-hop neighborhood. Furthermore, we intend to incorporate side information of entities in other modalities into the graph structures.

Footnotes

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant No.72071145, the National Key R&D Program of China under Grant No.2019YFB1704402.

OpenKE: github.com/thunlp/OpenKE

References

, Chen

, Han

and Sun

, Accurate textenhanced knowledge graph representation learning, In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (2018), pages 745–755.

Bai

, Ma

, Zhang

and Yu

, Tpmod: A tendency-guided prediction model for temporal knowledge graph completion, ACM Transactions on Knowledge Discovery from Data 15(3) (2021), 1–17.

Bollacker

, Evans

, Paritosh

, Sturge

and Taylor

, Freebase: a collaboratively created graph database for structuring human knowledge, In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (2008), pages 1247–1250.

Bordes

, Usunier

, Garcia-Duran

, Weston

and Yakhnenko

, Translating embeddings for modeling multi-relational data, In Advances in neural information processing systems, (2013), pages 2787–2795.

Catherine

and Cohen

, Personalized recommendations using knowledge graphs: A probabilistic logic programming approach, In Proceedings of the 10th ACM Conference on Recommender Systems (2016), pages 325–332.

Chen

, Jia

and Xiang

, A review: Knowledge reasoning over knowledge graph, Expert Systems with Applications 141 (2020), 112948.

Dai

, Dai

and Song

, Discriminative embeddings of latent variable models for structured data, In International conference on machine learning (2016), pages 2702–2711.

Dasgupta

S.S.

, Ray

S.N.

and Talukdar

, Hyte: Hyperplane-based temporally aware knowledge graph embedding, In Proceedings of the 2018 conference on empirical methods in natural language processing (2018), pages 2001–2011.

Defferrard

, Bresson

and Vandergheynst

, Convolutional neural networks on graphs with fast localized spectral filtering, In Advances in neural information processing systems, (2016), pages 3844–3852.

10.

Dettmers

, Minervini

, Stenetorp

and Riedel

, Convolutional 2d knowledge graph embeddings, In Thirty-Second AAAI Conference on Artificial Intelligence (2018).

11.

Garcia-Duran

, Dumančić

and Niepert

, Learning sequence encoders for temporal knowledge graph completion, In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018), pages 4816–4821.

12.

Guu

, Miller

and Liang

, Traversing knowledge graphs in vector space, In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (2015), pages 318–327.

13.

Hamaguchi

, Oiwa

, Shimbo

and Matsumoto

, Knowledge transfer for out-of-knowledge-base entities: a graph neural network approach, In Proceedings of the 26th International Joint Conference on Artificial Intelligence (2017), pages 1802–1808.

14.

Hamilton

, Ying

and Leskovec

, Inductive representation learning on large graphs, In Advances in neural information processing systems, (2017), pages 1024–1034.

15.

, He

, Xu

, Liu

and Zhao

, Knowledge graph embedding via dynamic mapping matrix, In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers) (2015), pages 687–696.

16.

Jiang

, Liu

, Ge

, Sha

, Chang

, Li

and Sui

, Towards time-aware knowledge graph completion. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers (2016), pages 1715–1724.

17.

Jiang

, Liu

, Ge

, Sha

, Li

, Chang

and Sui

, Encoding temporal information for time-aware link prediction, In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (2016), pages 2350–2354.

18.

Kingma

D.P.

and Ba

, Adam: A method for stochastic optimization, (2014), arXiv preprint arXiv:1412.6980.

19.

Kipf

T.N.

and Welling

, Semi-supervised classification with graph convolutional networks, arXiv preprint arXiv:1609.02907, (2016).

20.

Lao

, Mitchell

and Cohen

, Random walk inference and learning in a large scale knowledge base, In Proceedings of the 2011 conference on empirical methods in natural language processing (2011), pages 529–539.

21.

Lehmann

, Isele

, Jakob

, Jentzsch

, Kontokostas

, Mendes

P.N.

, Hellmann

, Morsey

, Van Kleef

, Auer

, et al., Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia, Semantic Web 6(2) (2015), 167–195.

22.

Lin

, Liu

, Sun

, Liu

and Zhu

, Learning entity and relation embeddings for knowledge graph completion, In Twenty-ninth AAAI conference on artificial intelligence (2015).

23.

Liu

, Wu

and Yang

, Analogical inference for multi-relational embeddings, In International Conference on Machine Learning (2017), pages 2168–2178.

24.

Mikolov

, Yih

W.-t.

and Zweig

, Linguistic regularities in continuous space word representations, In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies (2013), pages 746–751.

25.

Miller

G.A.

, Wordnet: a lexical database for english, Communications of the ACM 38(11) (1995), 39–41.

26.

Neelakantan

, Roth

and McCallum

, Compositional vector space models for knowledge base inference, In AAAI Spring Symposia, (2015).

27.

Nguyen

D.Q.

, Sirts

, Qu

and Johnson

, Neighborhood mixture model for knowledge base completion, In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning (2016), pages 40–50.

28.

Nguyen

T.D.

, Nguyen

D.Q.

, Phung

, et al., A novel embedding model for knowledge base completion based on convolutional neural network, In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (2018), pages 327–333.

29.

Riedel

, Yao

, McCallum

and Marlin

B.M.

, Relation extraction with matrix factorization and universal schemas, In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2013), pages 74–84.

30.

Schlichtkrull

, Kipf

T.N.

, Bloem

, Van Den Berg

, Titov

and Welling

, Modeling relational data with graph convolutional networks, In European Semantic Web Conference, pages 593–607. Springer, (2018).

31.

Shah

, Villmow

, Ulges

, Schwanecke

and Shafait

, An open-world extension to knowledge graph completion models, In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3044–3051, (2019).

32.

Shang

, Tang

, Huang

, Bi

, He

and Zhou

, End-to-end structure-aware convolutional networks for knowledge base completion, In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3060–3067, (2019).

33.

Shi

and Weninger

, Open-world knowledge graph completion, In Thirty-Second AAAI Conference on Artificial Intelligence (2018).

34.

Socher

, Chen

, Manning

C.D.

and Ng

, Reasoning with neural tensor networks for knowledge base completion, In Advances in neural information processing systems, pages 926–934, (2013).

35.

Suchanek

F.M.

, Kasneci

and Weikum

, Yago: a core of semantic knowledge, In Proceedings of the 16th international conference on World Wide Web (2007), pages 697–706.

36.

Toutanova

, Lin

X.V.

, Yih

W.-t.

, Poon

and Quirk

, Compositional learning of embeddings for relation paths in knowledge base and text, In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2016), pages 1434–1444.

37.

Trouillon

, Welbl

, Riedel

, Gaussier

É.

and Bouchard

, Complex embeddings for simple link prediction, International Conference on Machine Learning (ICML) (2016).

38.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Kaiser

Ł.

, and Polosukhin

, Attention is all you need, In Advances in neural information processing systems, (2017), pages 5998–6008.

39.

Veličković

, Cucurull

, Casanova

, Romero

, Liò

and Bengio

, Graph attention networks, In International Conference on Learning Representations (2018).

40.

Wang

, Kulkarni

and Wang

W.Y.

, Dolores: Deep contextualized knowledge graph embeddings, In Automated Knowledge Base Construction, (2020).

41.

Wang

, Zhang

, Feng

and Chen

, Knowledge graph embedding by translating on hyperplanes, In AAAI, volume 14, (2014), pages 1112–1119.

42.

Wang

and Li

, Text-enhanced representation learning for knowledge graph. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (2016), pages 1293–1299.

43.

West

, Gabrilovich

, Murphy

, Sun

, Gupta

and Lin

, Knowledge base completion via search-based question answering, In Proceedings of the 23rd international conference on World wide web (2014), pages 515–526.

44.

Xiao

, Huang

, Meng

and Zhu

, Ssp: semantic space projection for knowledge graph embedding with text descriptions, In Thirty-First AAAI conference on artificial intelligence (2017).

45.

Xie

, Liu

, Jia

, Luan

and Sun

, Representation learning of knowledge graphs with entity descriptions, In Thirtieth AAAI Conference on Artificial Intelligence (2016).

46.

Xie

, Liu

, Luan

and Sun

, Image-embodied knowledge representation learning, In Proceedings of the 26th International Joint Conference on Artificial Intelligence (2017), pages 3140–3146.

47.

Xie

, Liu

and Sun

, Representation learning of knowledge graphs with hierarchical types, In IJCAI, (2016), pages 2965–2971.

48.

Yang

, Wang

, Liu

, Lyu

, Wu

, She

and Li

, Enhancing pre-trained language representations with rich knowledge for machine reading comprehension, In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019), pages 2346–2357.

49.

Yang

, Yih

S.W.-t.

, He

, Gao

and Deng

, Embedding entities and relations for learning and inference in knowledge bases, In Proceedings of the International Conference on Learning Representations (ICLR) 2015 (2015).

50.

Zhong

, Zhang

, Wang

, Wan

and Chen

, Aligning knowledge and text embeddings by entity descriptions, In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (2015), pages 267–272.

Neighborhood aggregation based graph attention networks for open-world knowledge graph reasoning

Abstract

Keywords

1 Introduction

2.1 Embedding-based methods

2.3 Graph neural network-based methods

2.4 TKG reasoning methods

3 Background

4.1 Framework

5.1 Datasets

5.3 Open-world KGR

5.3.1 Triplet classification

Table 7 Closed-world triplet classification accuracy on WN11 and FB13 Method WN11 FB13 Avg. TransE 75.9 81.5 78.7 DisMult 87.1 86.2 86.7 ComplEx 88.3 89.3 88.8 R-GCN 87.5 88.9 88.2 TransE-NMM 86.8 88.6 87.7 NAKGR 88.2 91.1 89.7

Table 9 Effectiveness of relations on Head-10 Model MRR Hits@10 Hits@3 Hits@1 NAKGR 0.355 49.1 37.7 24.9 w/o R 0.331 48.0 35.8 23.4

Footnotes

Acknowledgments

References

Table 7
Closed-world triplet classification accuracy on WN11 and FB13

Method WN11 FB13 Avg.

TransE 75.9 81.5 78.7

DisMult 87.1 86.2 86.7

ComplEx 88.3 89.3 88.8

R-GCN 87.5 88.9 88.2

TransE-NMM 86.8 88.6 87.7

NAKGR 88.2 91.1 89.7

Table 9
Effectiveness of relations on Head-10

Model MRR Hits@10 Hits@3 Hits@1

NAKGR 0.355 49.1 37.7 24.9

w/o R 0.331 48.0 35.8 23.4