Learning neighborhood-based embedding sequence for link prediction in temporal knowledge graphs

Abstract

Many real-world knowledge graphs are complex and keep evolving over time. Inferring missing facts in temporal knowledge graphs is a fundamental and challenging task. Previous studies focus on link prediction in static knowledge graphs which hardly extracts the temporal features effectively. In this paper, we propose a novel deep learning model, namely KBGAT-BiLSTM, which is capable of solving long-term predict problems and is suitable for temporal knowledge graph with complex structures. First, we adapt the Graph Attention Network (GAT) to learn the structural features of knowledge graph. Then we utilize the Bidirectional Long Short-Term Memory Networks (BiLSTM) to learn the temporal features and obtain the low-dimensional embeddings of entities and relations. Finally, we employ a scoring function for link prediction in temporal knowledge graphs. Through extensive experiments on YAGO, WIKI, and ICEWS18 datasets, we demonstrate the effectiveness of our model, compare the performance of our model with several different state-of-the-art methods and further analyze the properties of the proposed method.

Keywords

Knowledge graph link prediction graph attention network temporal

1 Introduction

Link prediction has been widely used in various areas, such as social networks [1, 2], question answering [3], recommender systems [4], etc. The real-world knowledge graphs (KGs) keep evolving, entities and relations may appear or disappear. Previous works focus on the link prediction of static knowledge graphs, which is designed based on the structure of the knowledge graph, making it difficult to use temporal information effectively. So, in recent years, Link prediction for temporal knowledge graphs (TKGs) has attracted widespread attentions. It learns both structural features and historical information of TKGs in order to better understand the evolution process and the topological structure of TKGs.

Link prediction for TKGs attempts to find the missing relations or entities. For example, in Fig. 1, a quadruple (Trump, visit, China, 11/8/17) is represented as two entities: (Trump, China) along with a relation (visit) and a time (11/8/17). Many real-world TKGs may lack entities or relations and even have wrong quadruples. So, we need to correct these wrong quadruples or predict missing relations or entities. KG embeddings are generally recognized as a fundamental tool for this due to their extensive usages in link prediction.

Fig. 1

A subgraph of temporal knowledge graphs contains relations between entities and temporal information. Entities include places (blue) and people (purple). The dashed lines represent predicted quadruples.

State-of-the-art link prediction methods for TKGs can be primarily classified as: 1) based temporal point process [5, 6], 2) based time steps [7, 8], 3) based dynamic network [9, 10]. The first group of methods aims to model the occurrence of an event as a multidimensional temporal point process by conditional intensity functions and relation scores. It can predict whether an event will occur and when it will occur. But this method cannot model events that occur at the same time step. The same subject entity and relation pair may simultaneously involve many different object entities, which often happens in the real world. The second group of methods predicts events based on time steps. So if the time step of the event is unknown, link prediction cannot be made. It is pretty bad for TKG to predict. In reality, there are many quadruples that lack time information. Moreover, their models are relatively simple, rely on scoring functions, and cannot effectively capture the structural information of the knowledge graph. The last group of methods ignores the importance of relations. Dynamic networks are capable of learning the embedding of entities and learn the structural information of TKGs through the neural network and the attention mechanism effectively. Still, they are difficult to capture the relationship features and potential semantic information of neighborhood entities effectively. They cannot learn the embedding of relations, ignoring the importance of relations and reducing link prediction accuracy.

To address the aforementioned defects, we propose a novel deep learning model KBGAT-BiLSTM for modeling dynamic multi-relational directed graphs. Our model is based on the Graph Attention Networks to capture the multi-hop neighborhood features of a given entity. Compared with the based temporal point process models, our model can model events that occur at the same time step, learn both the structural and the temporal characteristics of TKGs effectively, and predict future additions or disappearances quadruples. Our model considers TKG data in a dynamic scenario, where each triple consists of two entities and their relations sequentially over time. So it can capture entities and relations latent temporal features. We conducted experiments on three public temporal knowledge graph datasets to demonstrate the effectiveness and superiority of the proposed model for link prediction of temporal knowledge graphs. Our contributions are summarized as follows:

We propose a novel deep learning model to perform embedding on temporal knowledge graphs. The model is capable of capturing both the structural and the temporal features of entities and relations for link prediction. To the best of our knowledge, we are the first to apply GAT to link prediction of temporal knowledge graphs.

The proposed new deep architecture is competent to solve long term prediction tasks. We utilize KBGAT at each time step, and then we employ BiLSTM to capture the temporal features of entities and relations.

The structure of the rest of the paper is as follows. We first introduce related work in Section 2 and then introduce some concepts and our method in Section 3. Section 4 describes the datasets and experimental results. Finally, we conclude this paper and outline future work in Section 5.

2 Related work

Recently, some researchers have proposed different embedding methods for link prediction. We divide these methods into static and temporal KGs link prediction.

2.1 Static KGs link prediction

Link prediction of static knowledge graphs includes reasoning based on translation, Convolutional Neural Networks (CNN), graph, and hybrid. TransE [11], DISTMULT [12] and ComplEx [13] are based on the translation model. TransE learns the embedding of entities and relations and then uses it to predict triples. After TransE many variants were derived, such as TransH [14], TransG [15] and TransR [16]. DISTMULT expresses the relation as a matrix. The head entity vector can be transformed into the tail entity through the linear transformation of the relation matrix. ComplEx uses complex-valued embeddings to handle various binary relations, among them symmetric and antisymmetric relations. ConvKB [17] and ConvE [18] is based on CNN reasoning, using CNN to analyze the features of triples for link prediction. ConvKB models the relations among the same dimensional entities of the embeddings. This implies that ConvKB generalizes transitional characteristics in transition-based embedding models. ConvE is a competitive 2D convolutional model and focuses on the local relationships among different dimensional entities. We summarize the score functions of the model based on translation and CNN in Table 1. Reasoning based on the graph, such as R-GCN [19], is an extension of Graph Convolutional Network (GCN) and predicts by convolution operation on the neighborhood of entities. Hybrid reasoning includes DeepPath [20] and MINERVA [21], which use reinforcement learning methods to learn path selection strategies in the multi-step reasoning process of knowledge graphs.

Table 1
The score functions in previous translation and CNN models. e_i _p denotes the p-norm of e_i. denotes a tri-linear dot product. The real part of e_i denotes Re (e_i). concat denotes a concatenation operator. f denotes a non-linear function. ∗ denotes a convolution operator. · denotes a dot product. W denotes a set of weight matrixes. Ω denotes a set of filters. ${\hat{e}}_{i}$ denotes a 2D reshaping of e_i

Model Scoring Function

TransE [11] ||e_i + r_r - e_j ||_p

DISTMULT [12] ‹e_i+r_r–e_j›

ComplEx Re(‹e_i+r_r–e_j›)

ConvKB [17] concat (f ([e_i, r_r, e_j] ∗ Ω)) · W

ConvE $f (vec (f ([{\hat{e}}_{i}; {\hat{r}}_{r}] * Ω) W)) e_{j}$

Model	Scoring Function
TransE [11]	\|\|e_i + r_r - e_j \|\|_p
DISTMULT [12]	‹e_i+r_r–e_j›
ComplEx	Re(‹e_i+r_r–e_j›)
ConvKB [17]	concat (f ([e_i, r_r, e_j] ∗ Ω)) · W
ConvE	$f (vec (f ([{\hat{e}}_{i}; {\hat{r}}_{r}] * Ω) W)) e_{j}$

2.2 Temporal KGs link prediction

Recently, some researchers have attempted to incorporate temporal information into KG link prediction. DyRep [6] captures the interaction process between nodes through two simultaneous temporal point process models and combines the attention mechanism to give neighborhood nodes different weights. DySAT [22] obtains the structural and temporal characteristics of entities at each time step and then performs vector addition to obtain the embedding of entities. Besides, some representation learning techniques [23, 24] model temporal information for link prediction. But they cannot effectively capture structural information and predict future events.

Some researches are based on recurrent graph neural models for temporal graph-structured data. EvolveGCN [25] adapts the Graph Convolutional Network (GCN) [26] model along the temporal dimension without resorting to entity embeddings. EvolveGCN captures the temporal feature by using an Recurrent Neural Network (RNN) to evolve the GCN parameters. GCRN [27] aims to learn the structural and temporal features of TKGs through CNN and RNN. GN [28] and RRN [29] update entity embeddings through message-passing at each time step. DDNE [30] uses a Gated Recurrent Unit (GRU) to learn the embedding of entities and obtain the final embedding of entities based on the interaction of neighborhood entities. However, they all ignore the relationship features of graphs and reduce the performance of the model.

3 Problem definitions and prediction framework

In this section, we will introduce KBGAT-BiLSTM model for link prediction in temporal knowledge graphs. It consists of two parts, Knowledge Base Graph Attention Network (KBGAT) and BiLSTM, as shown in Fig. 2. KBGAT is an attention-based feature embedding method, which can capture the rich semantic information and potential relations inherent in the neighborhood of a given entity. The BiLSTM can effectively capture the temporal characteristics of TKGs. Therefore, our framework can capture structural and temporal features of temporal knowledge graphs for future added or removed links.

Fig. 2

KBGAT-BiLSTM is an end-to-end TKG link prediction model. Given a TKG sequence of length N, G_t = f (G_t-N, G_t-N+1 ⋯ , G_t-1), feeds the entity matrix H and relation matrix G of each TKG snapshot into KBGAT to learn the structural features of the KG, respectively. In the next step, the entity matrix sequence and the relation matrix sequence of each time step are fed into BiLSTM respectively to obtain the final embedding. Then, the entity matrix and the relation matrix are mapped to the original space through a scoring function, finally the predicted result G_t is obtained.

We denote a sequence of length N in temporal knowledge graphs, {G_t-N, G_t-N+1, ⋯ , G_t-1 }. First, we use GAT to capture the structural features of the knowledge graph at each time step to obtain ${G_{t - N}^{'}, G_{t - N + 1}^{'}, \dots, G_{t - 1}^{'}}$ . Next, we input ${G_{t - N}^{'}, G_{t - N + 1}^{'}, \dots, G_{t - 1}^{'}}$ into BiLSTM to get G_t. Finally, We utilized a scoring function to maps the extracted features to the original space to obtain the final prediction. our work is shown in Algorithm 1.

Algorithm 1: Inferring the future network structure through our model

Input: Observed a TKG sequence of length N, {G_t-N, G_t-N+1, ⋯ , G_t-1 };

Output: Inferred TKG at the next time step G_t;

1: t′ ← t - N

2: While t′ < t

3: Train G_t′ through KBGAT model to obtain $G_{t^{'}}^{'}$ ;

4: Input $G_{t^{'}}^{'}$ into BiLSTM model;

5: t′ ← t′ + 1

6: endWhile

7: BiLSTM model outputs G_t, and predicts the structure of the network through the score function;

3.1 Problem definitions

3.1.1 Temporal knowledge graphs

We consider a temporal knowledge graph as a sequence of multi-relational directed graph snapshots, {G_t-N, G_t-N+1, ⋯ , G_t }, where G_t = (V, R_t) represents a directed and unweighted graph at time t. Let V be the set of all entities and R_t be the temporal relation within the fixed timespan [t_k-1, t_k]. The number of entities is denoted as |V|. The number of relations is denoted as |E|. An event between two entities, namely e_i and e_j, is represented by a triplet $t_{ij}^{k} = (e_{i}, r_{k}, e_{j})$ at time t.

In a static knowledge graph, the link prediction problem generally aims to predict the entities or relations by the current KG. It mainly focuses on the structural feature of KGs. Similarly, link prediction in TKG not only needs to capture structural feature, but also needs to capture the temporal feature of KG according to the dynamic evolution processes of previous snapshots, so as to predict the future status of the KG. Our goal is to learn the structural and evolutionary properties of the knowledge graphs.

3.1.2 Link prediction in temporal knowledge graphs

In temporal knowledge graphs, we denote a series of snapshots as {G_t-N, G_t-N+1, ⋯ , G_t-1 }, where N is the number of time steps. The goal of link prediction task is to predict the structure of the next time t, which can be formally described below: $G_{t} = f (G_{t - N}, G_{t - N + 1} \dots, G_{t - 1})$ (1) where f (·) is the model that we need to construct in this paper, G_t represents predicted result. The KGs of the real-world will keep evolving over time. In this paper, we record the change process of KGs in the form of snapshots. If time granularity is a year, Fig. 1 includes five network snapshots.

3.1.3 Neighborhood-based embeddings sequence

Given an entity in temporal knowldege graphs h ∈ V, we capture the features of entities and relations in a multi-hop neighborhood of the given entity h at each time step. We denote the output representations of the entity h as {h₁, h₂, ⋯ , h_t }, which will be fed into RNN to solve long-term dependencies among the entities. {h₁, h₂, ⋯ , h_t } is defined as neighborhood-based embeddings sequence.

With the defined sequence, the changes of entity embeddings over time can be manifested explicitly. We can infer the hidden structure of entities from the sequence. It is worth mentioning that the history neighborhood of entities can influence the current neighborhood. Meanwhile, the features of relations are easily overlooked, which are an integral part of TKGs. Similarly, We capture the features of relations at each time step. We denote the output representations of the relation as {r₁, r₂, ⋯ , r_t }.

Before introducing the model in detail, we will give some terms and notations that will be used in this section in Table 2.

Table 2
Terms and notations used in KBGAT-BiLSTM

Symbol Definition

h_i, h_j embedding of entities i,j

g _k embedding of relation k

H, G embedding matrix of entity and relation

N_e, N_r the number of entities and relations

T, P feature dimensions of entities and relations

C_t, h_t cell and hidden state of BiLSTM

W_x, b_x, x∈ weight and bias of three gates in BiLSTM

{f, i, o, c}

M the number of heads of multihead attention

σ, tanh sigmoid and tanh activation function

W_y, y∈ weights of the KBGAT model

{1, 2, E, R}

Symbol	Definition
h_i, h_j	embedding of entities i,j
g _k	embedding of relation k
H, G	embedding matrix of entity and relation
N_e, N_r	the number of entities and relations
T, P	feature dimensions of entities and relations
C_t, h_t	cell and hidden state of BiLSTM
W_x, b_x, x∈	weight and bias of three gates in BiLSTM
{f, i, o, c}
M	the number of heads of multihead attention
σ, tanh	sigmoid and tanh activation function
W_y, y∈	weights of the KBGAT model
{1, 2, E, R}

3.2 KBGAT model

GAT is an improved model based on Graph Convolutional Network [26], and the attention mechanism is introduced into GCN. GAT solves the shortcoming of obtaining information from neighbors equally in GCN and uses the attention mechanism to assign different weights to neighbor entities. Since the existing link prediction models process triples independently and cannot capture the multi-hop information and potential relations of neighbor entities, Nathani et al. [31] improved GAT and proposed the KBGAT model. First, we calculate the attention coefficient of the neighbor entities. For a particular triplet $t_{ij}^{k} = (e_{i}, r_{k}, e_{j})$ , we learn the new embedding of the triplet through linear transformation by the weight matrix W₁, which is expressed as: $c_{ijk} = W_{1} [h_{i} | | h_{j} | | g_{k}]$ (2) where || is the concatenation operation. Then we apply non-linearity activation function LeakyReLU to obtain the attention value, which is expressed as: $b_{ijk} = LeakyReLU (W_{2} c_{ijk}) .$ (3) Then we normalize b_ijk to get the relative attention value, which is expressed as: $\begin{matrix} α_{ijk} & = \underset{jk}{softmax} (b_{ijk}) \\ = \frac{\exp (b_{ijk})}{\sum_{n \in N_{i}} \sum_{r \in R_{in}} \exp (b_{inr})} \end{matrix}$ (4)

where N_i represents the number of neighborhood of entity e_i, R_in represents the number of relations between entities e_i and e_j, and α_ijk represents the attention coefficient of entity e_i to neighbor entity e_j under relation k. Figure 3 shows the calculation method of the attention coefficient of the triple. In order to obtain the multiple semantic information of neighborhood, we introduce multi-head attention mechanism to obtain a new entity embedding vector, which is expressed as: $h_{i}^{'} = σ (\frac{1}{M} \sum_{m = 1}^{M} \sum_{j \in N_{j}} \sum_{k \in R_{ij}} α_{ijk}^{m} c_{ijk}^{m})$ (5) $h_{i}^{'}$ is the new embedding of entity e_i. We apply GAT to KGs and learn new entity and relation embeddings through the above formula. We define new matrices of entities and relations as H′ and G and perform linear transformations by weight matrices, respectively. However, after a series of linear changes, the new embedding may lost the original embedding information. In order to solve this problem, we add the original entity matrix H to the linearly transformed entity matrix H′. The formula is shown as follows: $\begin{matrix} G^{'} & = W_{R} \cdot G \\ H^{″} & = W_{E} H^{'} + H . \end{matrix}$ (6)

Fig. 3

Calculation of attention value between entities.

The new entity and relation embedding matrices and G′ capture the structural feature of the current time step of the KG. Each entity can capture the neighborhood information according to the attention coefficient. After accumulation, it can effectively capture the multi-hop neighborhood information.

3.3 BiLSTM network

RNN is mainly used to process sequential data. As both the training time and the number of network layers increase over time, the problems of gradient disappearance and gradient explosion are prone to occur, which makes it impossible to process longer sequential data. In 1997, Hochreiter and Schmidhuber proposed LSTM network [32], which solved the problem of vanishing and exploding gradient and long-term dependence by improving the hidden structure.

LSTM is not capable of encoding the information back to the front. Sometimes the output at the current moment is not only related to the previous state but also may be related to the future state. The bidirectional LSTM network (BiLSTM) was proposed to solve this problem. The BiLSTM network contains two LSTM networks with opposite directions, as shown in Fig. 4. The embedding sequence of entities and relations is used as input, and the final embedding vector of entities and relations is obtained by extracting temporal features through BiLSTM. Hence, the model can capture the context information of a sequence precisely. The hidden state of LSTM propagating from front to back is denoted as ${\vec{h}}_{t} = LSTM (x_{t}, {\vec{h}}_{t - 1})$ , the hidden state propagating from back to front is denoted as ${\overset{\leftarrow}{h}}_{t} = LSTM (x_{t}, {\overset{\leftarrow}{h}}_{t + 1})$ , and finally, the hidden state of bidirectional LSTM is denoted as $h_{t} = [{\vec{h}}_{t}, {\overset{\leftarrow}{h}}_{t}]$ .

Fig. 4

The BiLSTM network consists of two LSTM networks in opposite directions. The input of the network (yellow) represents the embedding sequence of entities and relations, and the output of the network (green) represents the final embedding vector of entities and relations.

3.4 KG predict

Our model borrows the idea of a scoring function from [23], which learns embeddings of given neighbor entities e_i and e_j, input their relational sequence {r₁, r₂⋯ r_k } into LSTM and obtain the new relation r_seq. The new relation contains temporal information, which is used to determine the probability that the triple $t_{ij}^{k} = (e_{i}, r_{seq}, e_{j})$ is valid at a certain time step in the future. The definition of the score function is expressed as follows: $d_{t_{ij}} = (h_{i} ⊙ h_{j}) (g_{seq}^{T})$ (7) where h_j and h_j are the embeddings of entity e_i and e_j, and g_seq is the embedding of r_seq. Given a set of valid triplets S, we can construct a set of invalid triplets S′ by corrupting either one of the relation arguments, $S^{'} = {(e_{i}^{'}, r_{k}, e_{j}) | e_{i}^{'} \in V, (e_{i}^{'}, r_{k}, e_{j}) \notin S} \cup {(e_{i}, r_{k}, e_{j}^{'}) | e_{j}^{'} \in V, (e_{i}, r_{k}, e_{j}^{'}) \notin S}$ . The training objective is to minimize the margin-based ranking loss: $L (Ω) = \sum_{t_{ij} \in S} \sum_{t_{ij}^{'} \in S^{'}} \max {d_{t_{ij}^{'}} - d_{t_{ij}} + 1, 0}$ (8) the neural network parameters of the model discussed above can be learned by minimizing a margin-based ranking objective, which encourages the scores of positive triplets to be higher than the scores of any negative triplets.

4 Experiments

4.1 Datasets

We conduct the comparison of our proposed method with other methods on three real-world TKG datasets, which are widely used for link prediction. The entities in these datasets represent people, places, institutions, etc., and relations representing the connections between entities. The ICEWS18 [33] dataset is event-based temporal knowledge graphs, such as (e_i, r_k, e_j, t), and the WIKI [7] and YAGO [34] datasets have time spans as (e_i, r_k, e_j, [t_s, t_e]), where t_s is the starting time point, and t_e is the ending time point. In order to facilitate comparative experiments, we converted the format of WIKI and YAGO datasets to the format of ICEWS18 dataset. The details of these datasets are as follows.

YAGO: It is a semantic knowledge base that aggregates data from various sources, including Wikipedia. YAGO contains many famous places and people. These places and people naturally contain many attribute values, such as height, weight, age, population size, etc. The YAGO dataset is collected from 2013 to 2017.

WIKI: Wikidata originated from Wikipedia. Wikidata is a multilingual encyclopedia knowledge base that can be edited collaboratively. It is expected to extract structured knowledge from Wikipedia, Wikisource, and Wiki guide. There are many different entities in Wikidata Language tags, aliases, expressions, and statements, etc. The WIKI dataset is collected from 2008 to 2017.

ICEWS18: The Integrated Crisis Early Warning System(ICEWS) is generated by the BBN ACCENT event encoder, automatically extracting data from news articles and adding temporal information. The data is usually generated once a day. The ICEWS18 dataset is collected from 1/1/2018 to 10/31/2018.

The statistics of three datasets are summarized in Table 3. Before training, we need to preprocess the datasets, remove the repeated quadruples, and sort them in ascending order of each time step.

Table 3
Dataset statistics

DataSet Entities Relations Training Validation Test Snapshot length

YAGO 10623 10 161540 19523 20026 189

WIKI 12554 24 539286 67538 63110 232

ICEWS18 23033 256 373018 45995 49545 304

DataSet	Entities	Relations	Training	Validation	Test	Snapshot length
YAGO	10623	10	161540	19523	20026	189
WIKI	12554	24	539286	67538	63110	232
ICEWS18	23033	256	373018	45995	49545	304

4.2 Evaluation metrics and competitors

4.2.1 Evaluation metrics

In this paper, we use three evaluation methods: Mean Rank (MR), Mean Reciprocal Ranks (MRR) and Hits@1/3/10. When predicting the missing head entity or tail entity of the quadruple, we use the score function to rank all entities to obtain the correct entity’s ranking in the result. We average the rankings of the predicted results, which is MR. MRR is to average the reciprocal of the right entity’s ranking in the predicted results. Hits@1/3/10 is the probability that the correct entity is in the top 1/3/10 of the predicted result. In order to reduce the generalization error, We divide each dataset into three subsets by time step, such as train(80%)/valid(10%)/test(10%). In the relation prediction task, we will remove the head entity or the tail entity to obtain the quadruple (? , r_k, e_j, t) and (e_i, r_k, ? , t). Our model will generate two entity sets of length N, which are the prediction results of the head entity and the tail entity, respectively. We can use the above three evaluation metrics to evaluate the results.

4.2.2 Static methods

We compare our model with static methods by ignoring time steps. Static methods includes TransE [11], DistMult [12], R-GCN [19] and ComplEx [13]. R-GCN predicts triplets by convolution operation on the neighborhood of entities. TransE, DistMult and ComplEx are based translational models. Translational models learn the embedding of entities and relations through scoring functions. They usually require fewer parameters and relatively easier to train.

4.2.3 Temporal reasoning methods

To verify the effectiveness of the model, we also compare our model with the state-of-the-art temporal knowledge graph models, which are capable of capturing the structural feature and handling long-term dependencies, including Know-Evolve [5], DyRep [6], HyTE [8], TTransE [35], R-GCRN [36]. The details of these models are as follows.

Table 4
Temporal knowledge graphs link prediction on MRR and Hits@3/10. Hits@3/10 values are in percentage and the value of MRR times 100. The best score are marked in bold and the second best score is underlined

Method YAGO WIKI ICEWS18

MRR Hits@3 Hits@10 MRR Hits@3 Hits@10 MRR Hits@3 Hits@10

Static R-GCN 41.30 44.44 52.68 37.57 39.66 41.90 23.19 25.34 36.48

TransE 48.97 62.45 66.05 46.68 49.71 51.71 17.56 26.95 43.87

DistMult 59.47 60.91 65.26 46.12 49.81 51.38 22.16 26.00 42.18

ComplEx 61.29 62.28 66.82 47.84 50.08 51.39 30.09 34.15 45.96

Temporal Know-Evolve 6.19 6.59 11.48 12.64 14.33 21.57 9.29 9.62 17.18

DyRep 5.87 6.54 11.98 11.60 12.74 21.65 9.86 10.66 18.66

HyTE 23.16 45.74 51.94 43.02 45.12 49.49 7.31 7.50 14.95

TTransE 32.57 43.39 53.37 31.74 36.25 43.45 8.36 8.71 21.93

R-GCRN 53.89 56.06 61.19 47.71 48.14 49.66 35.12 38.26 50.49

EvolveRGCN 59.74 61.03 61.69 46.49 47.83 49.23 16.59 18.32 34.01

RE-NET 65.16 65.63 68.08 51.97 52.07 53.91 42.93 45.47 55.80

KBGAT-BiLSTM 65.48 66.66 68.24 49.32 50.50 52.12 34.87 37.98 50.60

	Method	YAGO	WIKI	ICEWS18
Static	R-GCN	41.30	44.44	52.68	37.57	39.66	41.90	23.19	25.34	36.48
	TransE	48.97	62.45	66.05	46.68	49.71	51.71	17.56	26.95	43.87
	DistMult	59.47	60.91	65.26	46.12	49.81	51.38	22.16	26.00	42.18
	ComplEx	61.29	62.28	66.82	47.84	50.08	51.39	30.09	34.15	45.96
Temporal	Know-Evolve	6.19	6.59	11.48	12.64	14.33	21.57	9.29	9.62	17.18
	DyRep	5.87	6.54	11.98	11.60	12.74	21.65	9.86	10.66	18.66
	HyTE	23.16	45.74	51.94	43.02	45.12	49.49	7.31	7.50	14.95
	TTransE	32.57	43.39	53.37	31.74	36.25	43.45	8.36	8.71	21.93
	R-GCRN	53.89	56.06	61.19	47.71	48.14	49.66	35.12	38.26	50.49
	EvolveRGCN	59.74	61.03	61.69	46.49	47.83	49.23	16.59	18.32	34.01
	RE-NET	65.16	65.63	68.08	51.97	52.07	53.91	42.93	45.47	55.80
	KBGAT-BiLSTM	65.48	66.66	68.24	49.32	50.50	52.12	34.87	37.98	50.60

Know-Evolve: It uses the temporal point process to model the occurrence of events. The model uses bilinear relation scoring to capture multi-relational interactions between entities and learn dynamic entity representations. It is one of the earliest models for link prediction of TKGs.

DyRep: It is the representation learning of TKGs, which can effectively absorb the information related to the entity events over time, and introduces an attention mechanism to update the weight ratio of the node neighbors continuously, thereby forming a continuously updated embedding.

HyTE: It is a temporally aware KG embedding method that explicitly incorporates temporal in the entity and relation space by associating each time step with a corresponding hyperplane. The model not only performs KG complete using temporal information but also predicts temporal scopes for relational facts with missing time step.

TTransE: It is an extension model of TransE by using temporal information among facts. TTransE is based on Integer Linear Programming (ILP) using temporal consistency information as constraints to incorporate the valid time of facts.

R-GCRN: GCRN [27] combines CNN and RNN to learn spatial structures and temporal features. It is not capable of link prediction for TKGs. Jin et al. [36] modeified the model to work on TKGs by using RGCN instead of CNN.

EvolveGCN: EvolveGCN [25] adapts the GCN to learn entity and relation embeddings. It captures the dynamism of the graph sequence by using RNN to evolve the GCN parameters.

For better comparison experiments, Know-Evolve, DyRep and R-GCRN use the Multi-Layer Perceptron (MLP) decoder proposed by Jin et al. The three models achieve better performance after using the MLP decoder.

4.3 Experimental results

4.3.1 Experimental parameters

In this section, We will introduce the parameter settings. In the experiment, we got the best parameters through repeated training. In the knowledge graph training for each time step, we set the vector dimension of entities and relations to 100, without pre-training entities and relations, with a learning rate of 0.001, and set entity embeddings to obtain 2-hop neighborhood information. In the multi-head attention, we use two-head attention. We set the batch size to 128, margin to 5.0 and we utilize the adam optimizer. After obtaining the embeddings of entity and relation at each time step, we use the stacked BiLSTM to get the temporal information, and the stack value is set to 2. In the predictive model, the batch size and margin are different from the KBGAT model, they are 512 and 1.0 respectively, and the momentum of the optimizer is 0.9.

4.3.2 Experimental analysis

We compare the KBGAT-LSTM model with five temporal benchmark models and four static benchmark models on three different datasets. The experimental results demonstrate the effectiveness of our model, especially in YAGO and WIKI datasets. The results of all models are summarized in Table 4. Figure 5 shows the detailed prediction results of our model in different datasets at each time step. Specifically, Fig. 5(a) presents the detailed prediction results of our model in the Yago dataset from 2013 to 2017. Fig. 5(b) demonstrates the precise predictions of our model in the Wiki dataset from 2008 to 2017. Fig. 5(c) presents the detailed prediction results of changes in our model over 33 minutes in the ICEWS18 dataset. Hits@1/3/10 values are in percentage and the value of MRR times 100. The best results are marked in bold. The experimental results of the baseline model are from Jin et al. [36].

Fig. 5

The prediction results of the KBGAT-BiLSTM model at each time step on YAGO, WIKI and ICEWS18 datasets.

We find that the prediction results of these models on the WIKI and YAGO datasets are better than the results on the ICEWS18 dataset. This is due to the characteristics of the datasets themself. The ICEWS18 dataset is event-based and the time granularity is days. Therefore, events of ICEWS18 dataset are real-time. We define it as interactive knowledge graphs. The events of interactive knowledge graphs are repetitive and unstable. They can be interactions that last for a while (phone calls, face-to-face communication, etc.), or they can be discontinuous interactions (emails, text messages, etc.). The events in the WIKI and YAGO datasets are usually long-term and stable, and their time granularity is years. We define them as the relational knowledge graph. Relations of the relational knowledge graph is more stable, such as the relation between friends and colleagues. Therefore, we can appropriately combine the link prediction model of static knowledge graphs with the link prediction of temporal knowledge graphs to achieve better prediction results. For interactive knowledge graphs, it is necessary to focus on the time window or some aggregation to infer future knowledge graphs.

In the MRR evaluation metrics, our model is 4.19 higher than ComplEx on YAGO and 1.48 higher than ComplEx on WIKI. But our model is 0.25 lower than R-GCRN on ICEWS18. In the Hits@3 evaluation metrics, our model is 4.21 higher than TransE on YAGO and 0.42 higher than ComplEx on WIKI. Still, our model is 0.28 lower than R-GCRN. In the Hits@10 evaluation metrics, our model outperforms all competitors in these datasets. Our model focuses on the training of relational embedding and outperforms state-of-the-art models in relational knowledge graph datasets with stable relations. Our model and R-GCRN have achieved similar results in interactive knowledge graphs datasets. The experimental result is also indicating that our model is more suitable for the prediction of relational knowledge graphs.

GCRN aims to study the dynamic pattern of data and researches sequential data such as video through convolution operations on LSTM. R-GCRN introduces RGCN instead of CNN based on GCRN and uses a new decoder. Therefore, R-GCRN shows outstanding performance in interactive knowledge graphs datasets. Know-Evolve and DyRep have achieved better results in the ICEWS18 datasets because they adapt temporal point process to model events that change over time. Know-Evolve and DyRep pay more attention to the learning of the temporal feature of knowledge graphs, which can better capture multi-relational exchange information between entities. HyTE and TTransE focus on the modeling of relations and are suitable for the prediction of relation knowledge graphs.

Since the formation of a network is a complex process and is affected by many factors, it is impossible to design a method to be better than other methods on any datasets. We can conclude that the experimental results of static models and temporal models are quite different on different characters of datasets. Static methods achieve good results in the YAGO and WIKI datasets, but the results of static methods are impoverished in ICEWS18. This is because of the characteristics of the dataset itself. We can conclude that static methods are more stable than temporal reasoning methods by the experimental results. Because previous researchers have done lots of works and achieved better results in the link prediction of static knowledge graphs. The study of temporal reasoning is still in its infancy, and we need to continue to explore.

4.4 Ablation study

To study the importance of the various components of our model, we performed an ablation study. Specifically, we removed the relational embedding of our model (— Relation) and replaced it with relational embedding without training. We also removed the multi-hop neighborhood (— MN) information of a given node, which means we only get the one-hop neighborhood information of the node. Fig. 6 shows that our model is better than these two ablation models. Removing the relational embedding of our model will have a substantial impact on the performance of the model.

Fig. 6

Mean Reciprocal Ranks of our model and two ablated models on YAGO. — MN (blue) represents the model after removing multi-hop neighborhood information. — Relation (green) represents the model without relational information. Our model (red) represents the entire model.

4.5 Parameter sensitivity

In this section, we study the parameter sensitivity of our model, including the historical length of knowledge graphs and the number of BiLSTM layers. We record the effects of hyperparameter changes on the performance of our model in Fig. 7. Specifically, Fig. 7(a) shows the changes in the prediction results of our model under different historical time steps, and Fig. 7(b) shows the influence of layer number of C on model performance. We will separately introduce their impact on model performance as follows.

Fig. 7

Parameter sensitivity of our model on YAGO.

4.5.1 Length of past knowledge graphs

In temporal knowledge graphs, we perform link prediction tasks based on the historical information of knowledge graphs. The longer length of knowledge graphs, the more temporal information can be obtained. Conversely, the shorter length of knowledge graphs, we cannot effectively obtain the temporal features of knowledge graphs, which will lead to worse prediction results. When the length of knowledge graphs reaches a specific value, the effect of link prediction will not be significantly improved, and the computational complexity will increase. Therefore, we need to find a suitable length to predict.

4.5.2 Layers of BiLSTM

We have evaluated the number of layers of BiLSTM to obtain better temporal features. Fig. 7(b) shows that we use 2-layer BiLSTM better than 1-layer BiLSTM because 2-layer BiLSTM can get more temporal information. Therefore, in the experiment, we set the number of layers of BiLSTM to 2.

5 Conclusions

In this paper, we propose a new deep learning model KBGAT-BiLSTM, which combines the KBGAT and BiLSTM models. KBGAT encapsulates both the rich semantic information and potential relations of entities to learns the structural feature of knowledge graphs. BiLSTM captures the temporal feature of knowledge graphs by learning the embedding sequence of entities and relations. Besides, we are the first to apply GAT to temporal knowledge graphs. Our model captures the neighborhood information of entities effectively by continually updating the attention value. The results show the effectiveness of our model and achieve state-of-the-art performance.

The network structure of knowledge graph is complex and is affected by many factors in the process of change. Therefore, it is difficult to propose a model that can show good performance on different characters of datasets. We worked to advance the progress of this research. In the future, we will work to reduce the complexity of the KBGAT-BiLSTM model and make it suitable for link prediction of large-scale knowledge graphs. Additionally, we will study the transferability of our model to make it suitable for tasks such as dynamic networks and node classification.

Footnotes

Acknowledgments

This research is funded by the Key Program of Tianjin Natural Science Foundation (19JCZDJC40000), Science and Technology Program of Tianjin (18YFCZZC00060, 18ZXZNGX00100) and Natural Science Foundation of Hebei Province (F2020202008).

References

Dai

, Li

, Tang

et al., Adversarial network embedding, Proceedings of the AAAI Conference on Artificial Intelligence 32(1), 2018.

Hamidi

, Smarandache

Singlevalued Neutrosophic Directed (Hyper) graphs and Applications in Networks, Journal of Intelligent & Fuzzy Systems (2019), 2869–2885.

Cai

, Zheng

V.W.

and Chang

K.C.C.

, A comprehensive survey of graph embedding: Problems, techniques, and applications, IEEE Transactions on Knowledge and Data Engineering 30(9) (2018), 1616–1637.

Cen

, Zou

, Zhang

et al., Representation learning for attributed multiplex heterogeneous network, Proceedings of the 25th ACMSIGKDD International Conference on Knowledge Discovery & Data Mining (2019), 1358–1368.

Trivedi

, Dai

, Wang

et al., Know-evolve: Deep temporal reasoning for dynamic knowledge graphs, international conference on machine learning, PMLR (2017), 3462–3471.

Trivedi

, Farajtabar

, Biswal

et al., Dyrep: Learning representations over dynamic graphs, International Conference on Learning Representations 2019.

Leblay

, Chekol

M.W.

Deriving validity time in knowledge graph, Companion Proceedings of the The Web Conference (2018), 1771–1776.

Dasgupta

S.S.

, Ray

S.N.

, Talukdar

Hyte: Hyperplanebased temporally aware knowledge graph embedding, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018), 2001–2011.

Sankar

, Wu

, Gou

et al., Dynamic graph representation learning via self-attention networks, arXiv preprint arXiv:1812.09430, 2018.

10.

Goyal

, Kamra

, He

et al., Dyngem: Deep embedding method for dynamic graphs, arXiv preprint arXiv:1805.11273, 2018.

11.

Bordes

, Usunier

, Garcia-Duran

et al., Translating embeddings for modeling multi-relational data, Advances in Neural Information Processing Systems (2013), 26.

12.

Yang

, Yih

, He

et al., Embedding entities and relations for learning and inference in knowledge bases, arXiv preprint arXiv:1412.6575, 2014.

13.

Trouillon

, Welbl

, Riedel

et al., Complex embeddings for simple link prediction, International conference on machine learning, PMLR (2016), 2071–2080.

14.

Wang

, Zhang

, Feng

et al., Knowledge graph embedding by translating on hyperplanes, Proceedings of the AAAI Conference on Artificial Intelligence 28(1) (2014).

15.

Xiao

, Huang

, Hao

et al., TransG:Agenerative mixture model for knowledge graph embedding, arXiv preprint arXiv:1509.05488, 2015.

16.

Lin

, Liu

, Sun

et al.Learning entity and relation embeddings for knowledge graph completion, Twenty-Ninth AAAI Conference on Artificial Intelligence 2015.

17.

Nguyen

D.Q.

, Nguyen

T.D.

, Nguyen

D.Q.

et al., A novel embedding model for knowledge base completion based on convolutional neural network, arXiv preprint arXiv:1712.02121, 2017.

18.

Dettmers

, Minervini

, Stenetorp

et al., Convolutional 2d knowledge graph embeddings, Thirty-Second AAAI Conference on Artificial Intelligence 2018.

19.

Schlichtkrull

et al., Convolutional 2d knowledge graph embeddings, Thirty-Second AAAI Conference on Artificial Intelligence 2018.

20.

Das

, Dhuliawala

, Zaheer

et al., Modeling relational data with graph convolutional networks, European semantic web conference. Springer, Cham, 2018.

21.

Guo

, Wang

et al., Knowledge graph embedding with iterative guidance from soft rules, Proceedings of the AAAI Conference on Artificial Intelligence 32(1), 2018.

22.

Sankar

, Wu

, Gou

et al., Dysat: Deep neural representation learning on dynamic graphs via self-attention networks, Proceedings of the 13th International Conference onWeb Search and Data Mining (2020), 519–527.

23.

García-Durán

, Dumancić

, Niepert

Learning sequence encoders for temporal knowledge graph completion, arXiv preprint arXiv:1809.03202, 2018.

24.

Tang

, Yuan

, Li

et al., Timespan-aware dynamic knowledge graph embedding by incorporating temporal evolution, IEEE Access 8 (2020), 6849–6860.

25.

Pareja

, Domeniconi

, Chen

et al., Evolvegcn: Evolving graph convolutional networks for dynamic graphs, Proceedings of the AAAI Conference on Artificial Intelligence 34(04) (2020), 5363–5370.

26.

Kipf

T.N.

, Welling

Semi-supervised classification with graph convolutional networks, arXiv preprint arXiv:1609.02907, 2016.

27.

Seo

, Defferrard

, Vandergheynst

et al., Structured sequence modeling with graph convolutional recurrent networks, International Conference on Neural Information Processing. Springer, Cham (2018), 362–373.

28.

Sanchez-Gonzalez

, Heess

, Springenberg

J.T.

et al., Graph networks as learnable physics engines for inference and control, International Conference on Machine Learning PMLR (2018), 4470–4479..

29.

Palm

R.B.

, Paquet

, Winther

Recurrent relational networks, arXiv preprint arXiv:1711.08028, 2017..

30.

, Zhang

, Philip

S.Y.

et al., Deep dynamic network embedding for link prediction, IEEE Access 6 (2018), 29219–29230.

31.

Nathani

, Chauhan

, Sharma

et al., Learning attentionbased embeddings for relation prediction in knowledge graphs, arXiv preprint arXiv:1906.01195, 2019.

32.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Computation 9(8) (1997), 1735–1780.

33.

Boschee

, Lautenschlager

, O’Brien

et al., Icews coded event data, Harvard Data-verse, 2015..

34.

Mahdisoltani

, Biega

, Suchanek

F.M.

A knowledge base from multilingualWikipedias–yago3, Technical report, Telecom ParisTech, 2014..

35.

Jiang

, Liu

, Ge

et al., Towards time-aware knowledge graph completion, Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers (2016), 1715–1724.

36.

Jin

, Qu

, Jin

et al., Recurrent event network: Autoregressive structure inference over temporal knowledge graphs, arXiv preprint arXiv:1904.05530, 2019.

Learning neighborhood-based embedding sequence for link prediction in temporal knowledge graphs

Abstract

Keywords

1 Introduction

2.1 Static KGs link prediction

3 Problem definitions and prediction framework

3.1.1 Temporal knowledge graphs

3.1.2 Link prediction in temporal knowledge graphs

4.1 Datasets

Table 3 Dataset statistics DataSet Entities Relations Training Validation Test Snapshot length YAGO 10623 10 161540 19523 20026 189 WIKI 12554 24 539286 67538 63110 232 ICEWS18 23033 256 373018 45995 49545 304

4.2.1 Evaluation metrics

4.2.2 Static methods

4.2.3 Temporal reasoning methods

4.3.1 Experimental parameters

4.3.2 Experimental analysis

4.5.2 Layers of BiLSTM

5 Conclusions

Footnotes

Acknowledgments

References

Table 3
Dataset statistics

DataSet Entities Relations Training Validation Test Snapshot length

YAGO 10623 10 161540 19523 20026 189

WIKI 12554 24 539286 67538 63110 232

ICEWS18 23033 256 373018 45995 49545 304