Knowledge-aware hierarchical attention network for recommendation

Abstract

Knowledge graphs have been introduced into recommender systems due to the rich connectivity information. Many knowledge-aware recommendation methods use graph neural networks (GNNs) to capture the high-order structural and semantic information of knowledge graphs. However, previous GNN-based methods have the following limitations: (1) they fail to make full use of the neighborhood information of entities and (2) they ignore the importance of user interaction sequences on reflecting user preferences. As such, these models are insufficient for generating accurate representations of users and items. In this study, we propose a Knowledge-aware Hierarchical Attention Network (KHAN) to provide better recommendation. Specifically, the proposed model mainly consists of an item encoder and a user encoder. The item encoder is equipped with a hierarchical attention network, which is used to generate entity (item) representations by carefully aggregating neighborhood information of entities. The user encoder is also designed to learn more informative user representations from user interaction sequences using multi-head self-attention. The learned user representations are then combined with user representations introduced in the item encoder through a gating mechanism to generate the final user representations. Extensive experiments on two real-world datasets about movie and restaurant recommendation demonstrate the effectiveness of our model.

Keywords

Recommender system knowledge graph graph neural network hierarchical attention network

1 Introduction

Users are inundated with many choices in the age of information explosion. To tackle the problem of information overload, electronic retailers and content providers apply recommender systems to help users find items of interest. A recommender system is an information filtering system, which can mine users’ interests and match them with the most appropriate items. Due to the importance in practice, recommender systems have been playing an increasingly important role in modern society.

One of the most prevalent recommendation techniques is collaborative filtering (CF) [1 –3], which models user preferences based on user similarity or item similarity from the interaction data. Although CF-based methods have many prominent characteristics such as generality and implementation simplicity, they suffer from the cold start problem and perform poorly in sparse scenarios where users have a few interactions with items. To alleviate these problems, many works attempt to incorporate side information into the learning process [4], such as user profiles [5], item attributes [6], etc. Nevertheless, these methods assume that relations among items are independent and overlook the connectivity between them.

Fig. 1

An example of KG-aware recommender system. The dashed line denotes that the user interacts with the item, while the solid line denotes that the item has the property in the knowledge graph.

Based on the above observations, a few recent studies propose to incorporate knowledge graphs (KGs) [7 –9] into recommendation. A KG is a directed graph, where nodes correspond to entities and edges correspond to relations between entities. The information of entities and relations is usually organized in the form of triplets (h, r, t), which represents the fact that head entity h has a relation r with tail entity t. For example, the triplet (Will Smith, Star, The Pursuit of Happyness) means that Will Smith stars in the movie The Pursuit of Happyness. KG associates items with their attributes via various relations, which has shown great potential to improve the accuracy of recommender systems. Taking movie recommendation as an example (Fig. 1), a user interacts with I Robot, and The Pursuit of Happyness share the same actor Will Smith with I Robot. Therefore, we can reason that the user may also be interested in the movie The Pursuit of Happyness. Such rich semantic relatedness among items is helpful for reasoning about unseen user-item interactions. Meanwhile, KG also compensates for the data sparsity due to the integration of side information.

There have been several knowledge-aware recommendation methods. (1) Embedding-based methods [7, 10] leverage knowledge graph embedding (KGE) methods, such as TransE [11] and TransD [12], to map entities and relations to low-rank representations. These methods only consider one-hop relations between entities rather than multi-hop relation paths, resulting in a lack of reasoning ability and interpretability. (2) Path-based methods [13 –15] introduce meta-paths to match the similarity of the items or users. By exploring the connectivity information of users and items in KGs, these methods can better provide the ability of interpretability for recommendation. However, commonly-used path-based methods rely heavily on manually designed meta-paths, which leads to the inability to reason on unseen connectivity patterns.

Considering the limitations of existing methods, several methods take inspiration from the development of graph neural networks (GNNs) [16] for recommendation. GNN-based methods [17, 18] can leverage both the semantic representations used in embedding-based methods and the semantic connectivity information used in path-based methods, and have proven to be one of the best performing architectures for various graph learning tasks. The key idea of GNN is to update a target node representation by recursively aggregating the information from its multi-hop neighbors. Since different neighbor nodes contribute unequally to the central node, it is necessary to discriminate the importance of neighbor nodes carefully. However, most GNN-based methods either treat the neighbors of each entity equally or roughly discriminate the importance of neighbors, leading to a failure to fully utilize neighborhood information.

Moreover, existing GNN-based methods completely neglect the importance of user interaction sequences on reflecting user preferences. For instance, as shown in Fig. 1, a user interacts with the movie I Robot, Titanic, and Flipped. Without considering the user interaction sequence, we can recommend the movie The Pursuit of Happyness to the user since it has the same leading actor as I Robot. However, once we take the user interaction sequence into account, we can infer that the user prefers romantic movies as most of the movies the user interacts with belong to romantic movies. Therefore, the movie Roman Holiday is a more suitable recommendation item than The Pursuit of Happyness.

Fig. 2

(a) The entity Titanic and its neighbors. (b) The hierarchical structure of Titanic’s neighbors. N₁ denotes the set of neighbor entities under relation Starred. N₂ denotes the set of neighbor entities under relation Genre.

In this paper, we propose a knowledge-aware hierarchical attention network based on GNN, dubbed KHAN. Our design objective is to take full advantage of the local neighborhood information of each entity to obtain better entity representations and incorporate user interaction sequences to capture user preferences more accurately. The core of KHAN is an item encoder and a user encoder. In the item encoder, we design a hierarchical attention network, which can generate item (entity) representations by aggregating neighborhood information of items. More specifically, we first treat the connections between a central entity and its neighbors as a hierarchical structure [19]. A central entity connects to different relations and each relation links the central entity to one or more neighbor entities. For instance, we select Titanic and its neighbors from Fig. 1 as Fig. 2(a), and Fig. 2(b) can be viewed as the hierarchical structure of neighbors. Next, based on the above hierarchical structure, we elaborately devise a hierarchical attention mechanism to discriminate the importance of different neighbors. The hierarchical attention consists of user-relation attention and user-entity attention. Wherein, user-relation attention represents the importance of relations to users while user-entity attention characterizes the importance of different neighbor entities under each relation. Then the user-relation attention and the user-entity attention are combined as the final scores of neighbors when aggregating neighborhood information. In the user encoder, we employ a self-attention mechanism to capture the relatedness of user interaction sequences to fully extract user interests. The learned user representations are then combined with user representations used in the item encoder through a gating mechanism to generate the final user representations. We conduct extensive experiments on two datasets to verify the effectiveness of our model.

To sum up, the contributions of this paper are as follows:

We highlight the importance of taking full advantage of neighborhood information for enhancing the recommendation task and propose an end-to-end model KHAN, which aggregates neighborhood information of each entity via a hierarchical attention network.

We consider the importance of user interaction sequences and design a new user encoder to learn more accurate user representations from user interaction sequences.

Experimental results on real-world datasets demonstrate that KHAN significantly outperforms state-of-the-art KG-based methods for recommendation.

The rest of this paper is organized as follows. The related works are reviewed in Section 2. Section 3 presents the problem formulation. In Section 4, we describe our proposed model KHAN in detail. Thereafter, we provide an extensive evaluation of our model in Section 5 to demonstrate its effectiveness. Finally, Section 6 concludes the paper.

2 Related work

In this section, we briefly review KG-based recommender systems, which can be roughly divided into three classes: embedding-based methods, path-based methods, and unified methods.

Embedding-based methods exploit the structured information of KG to learn low-rank embeddings of entities and relations, such as TransE [11], TransD [12], TransR [20], etc. For example, DKN [7] fused the knowledge-level representations of entities learned via TransD and word-level representations of titles learned with CNN to model news for news recommendation. Instead of explicitly leveraging high-order information, these methods only take into account the direct relations between entities. Therefore, embedding-based methods lack reasoning ability and interpretability, and are not very suitable for recommendation.

In the meantime, path-based methods were proposed. Path-based methods take advantage of the connectivity patterns of entities to match the similarity between users and items. To get the paths between selected nodes, they define meta-path patterns [14, 15] or automatically select the most salient paths [13]. For example, MCRec [14] considered the mutual effect between the meta-paths and the involved user-item pair in an interaction, and employed an attention mechanism to mutually improve the representations for meta-path based context, users, and items. KARN [21] combined the users’ clicked history sequences and the path connectivity between users and items for recommendation. However, the methods of defining meta-path patterns require domain knowledge to define meta-path manually. They usually cannot capture all connectivity patterns, which leads to suboptimal recommendation results.

Recently, unified methods [17 , 22–24] have become one of the most popular methods for recommendation. Unified methods are based on the idea of embedding propagation, which updates the target node representation guided by the connective structure in the KG. As one of the unified methods, GNN-based methods have received growing attention due to their excellence in integrating node information and topological structure. For example, KGAT [17] refined an entity embedding based on the embeddings of its neighbors and applies an attention mechanism to learn the weight of each neighbor on the user-item knowledge graph. Nevertheless, the method ignores the correlation between the target user behavior and the item in the embedding learning process. KGCN [18] proposed to use scores relying on the connecting relation and specific user to weight neighbors on the item knowledge graph. As a result, the users’ personalized interests in relations can be characterized. However, the method does not consider the fact that users’ interests in different entities (items) are various even under the same relation. More recently, KGIN [25] explored the hidden user intent by modeling each intent as an attentive combination of KG relations. Different from all these methods, in this work, we treat the connections between a central entity and its neighbors as a hierarchical structure and design a hierarchical attention mechanism, including user-relation attention and user-entity attention, to refine the aggregation process of neighborhood information. The methods of organizing information into hierarchies have been verified to yield better results. For example, RGHAT [19] proposed a hierarchical mechanism to utilize the neighborhood information of entities and successfully applied it to the task of knowledge graph completion. The difference between RGHAT and our work is that RGHAT takes the semantic relevance of head and tail entities as the main factor to determine the attention scores of triplets, while our work mainly focuses on the users’ interest in triplets. By introducing users’ interests into triplets, an item can adaptively learn different representations when facing different users. Therefore, RGHAT is more suitable for knowledge graph completion while our work is more suitable for recommendation. In addition, existing GNN-based methods fail to consider the importance of user interaction sequences on reflecting user preferences. In our work, this problem is also effectively addressed.

3 Problem formulation

In the study of knowledge-aware recommendation scenario, we have user set $U = {u_{1}, u_{2}, . . ., u_{| U |}}$ , item set $V = {v_{1}, v_{2}, . . ., v_{| V |}}$ and user-item interaction data $Y = {y_{uv} | u \in U, v \in V}$ , where $y_{uv} = {\begin{matrix} 1, if interaction (u, v) exists \\ 0, otherwise \end{matrix}$ (1) Based on the user-item interaction data Y, for a given user u, we can extract the user interaction sequence S (u) ={ v₁, v₂, …, v_t }. In addition to the interaction data, we also map items and item attributes to KG $G$ , thus each node of the KG corresponds to a real-world item or a item property, and each edge represents the relation between entities. More formally, we denote KG as $G = {(h, r, t) | h, t \in E, r \in R}$ in which each triplet (h, r, t) represents the fact that head entity h has a relation r with tail entity t, $E$ and $R$ denote the entity set and relation set, respectively.

Therefore, we can formulate the recommendation task as follows: given a user u, a candidate item v, the interaction sequence S (u) of user u and KG $G$ , the holistic goal is to predict whether the user u will engage with the candidate item v. Thus, the prediction function can be defined as: ${\hat{y}}_{uv} = F (u, v | S (u), G, Θ)$ (2) where ${\hat{y}}_{uv}$ is the predicted score for user-item interaction, $F$ is the model with parameters Θ.

Fig. 3

The overall architecture of the KHAN model. The symbols ⊕ and ⊗ denote the element-wise addition and multiplication, respectively.

4 Knowledge-aware hierarchical attention network

In this section, we present the proposed KHAN model. As is illustrated in Fig. 3, the model framework consists of two components, i.e., item encoder and user encoder. Specifically, the item encoder regards the neighborhood of an entity in KG as a hierarchical structure and designs a hierarchical attention network accordingly. Then the neighbors of central entity are weighted by hierarchical attention scores and further form the neighborhood representation. Finally, to get the final entity representation, the neighborhood representation is aggregated to the central entity. The user encoder employs multi-head self-attention to draw global dependencies of user interaction sequences and further learns user representations. The learned user representations are then combined with user representations introduced in the item encoder through a gating mechanism to generate the final user representations.

4.1 Item encoder

The item encoder aims to learn item representations based on the idea of the GNN for recommendation. Following the general process of GNN, we design two components for the item encoder: hierarchical attention-based information propagation and information aggregation.

4.1.1 Hierarchical attention-based information propagation

The information propagation operation tries to model information propagated from neighbors for the central entity. Given a central entity v, a common way to obtain the neighborhood representation of v is to compute the linear combination of v’ tail entity. The process is given as: ${agg}_{v} = \sum_{t \in N_{v}} γ_{t} t$ (3) where t is the embedding of the tail entity t connected to the central entity v, N_v is the set of tail entities connected to entity v, γ_t represents the weight of triplet (v, r, t). In the following, we will focus on the calculation of γ_t.

To learn the weight γ_t more accurately, we design a hierarchical attention mechanism, which consists of user-relation attention and user-entity attention. The hierarchical attention architecture is inspired by RGHAT [19], where the goal is to learn a better representation for the target node by making full use of the neighborhood information for enhancing the knowledge graph completion task.

User-Relation Attention. There are various relations in the KG. Different users may focus on different relations. For instance, for a movie dataset, a user may pay more attention to the movie’s actor, while another user may have more interests in the movie’s director. In addition, the same user may also show different interests in different relations. Thus, we propose to calculate user-relation attention, which reveals the importance of the relation r to user u. Denote the user-relation attention score as ${att}_{r_{v}}^{u}$ , which is formulated as follows: $Con (u, r_{v}) = W_{u} u + W_{r} r_{v}$ (4) ${att}_{r_{v}}^{u} = g_{1}^{T} \tanh (Con (u, r_{v}))$ (5) where $u \in ℝ^{d}$ is the embedding of user u, $r_{v} \in ℝ^{d}$ is the embedding of relation r connected to the central entity v, d is the dimension of the embedding. $W_{u} \in ℝ^{d \times d}$ , $W_{r} \in ℝ^{d \times d}$ , $g_{1} \in ℝ^{d}$ are training parameters. To easily compare the importance of different relations to the entity v, we normalize the scores of user-relation attention across all relations connected to v using the softmax function: $\tilde{{att}_{r_{v}}^{u}} = \frac{exp ({att}_{r_{v}}^{u})}{\sum_{r_{v}^{'} \in N_{v}} exp ({att}_{r_{v}^{'}}^{u})}$ (6)

User-Entity Attention. On the one hand, users usually have different interests in different relations. On the other hand, even for entities under the same relation, users’ attention may also vary greatly. Taking movie recommendation as an example, the movie Titanic is linked by the relation Starred to different actors, Leonardo DiCaprio and Kate Winslet. A user may show different interests in Leonardo DiCaprio and Kate Winslet. Thus, we propose to calculate the user-entity attention ${att}_{r_{v}, t}^{u}$ , which is formulated as follows: $Con (u, r_{v}, t) = W_{a} (Con (u, r_{v})) + W_{t} t$ (7) ${att}_{r_{v}, t}^{u} = g_{2}^{T} \tanh (Con (u, r_{v}, t))$ (8) where $t \in ℝ^{d}$ is the embedding of the tail entity t under relation r_v, $W_{a} \in ℝ^{d \times d}$ , $W_{t} \in ℝ^{d \times d}$ , $g_{2} \in ℝ^{d}$ represent training parameters. Subsequently, we normalize the scores of user-entity attention across all tail entities under relation r_v with softmax function: $\tilde{{att}_{r_{v}, t}^{u}} = \frac{exp ({att}_{r_{v}, t}^{u})}{\sum_{t^{'} \in N_{r_{v}}} exp ({att}_{r_{v}, t^{'}}^{u})}$ (9)

Combining user-relation attention with user-entity attention, we can get the final attention score $γ_{r_{v}, t}^{u}$ of triplet (v, r, t). In KHAN, we use the product of user-relation attention and user-entity attention as the final attention score of the triplet.

In RGHAT, attention scores are also calculated for triplets. However, the focus of our model and RGHAT is different when calculating attention scores. RGHAT determines the attention scores of triplets based on the semantic relevance of the triplets. If the semantic relevance of a triplet is strong, then the triplet can contribute more information to the central (head) entity. Such a strategy is reasonable for knowledge graph completion. However, for recommendation task, we argue that user preferences are the key factor in determining how much information is propagated to the central entity by a triplet. For example, there is a movie Titanic, a triplet (Titanic, Starred, Leonardo DiCaprio), and two users u₁ and u₂. We want to predict whether to recommend the movie Titanic to the user u₁ and u₂. Assuming that by calculation we find that the user u₁ is highly interested in Leonardo DiCaprio while user u₂ is not concerned about the leading actor, then we can infer that Leonardo DiCaprio is more important for u₁ compared to u₂. Accordingly, for u₁, we should increase the amount of information propagated from the above triplet to the central entity, while for u₂, we can appropriately reduce the information passed to the central entity. In this way, we can learn different representations for movie Titanic adapted to u₁ and u₂, hence improving the accuracy of the recommendation. To demonstrate that our method is more suitable for recommender systems, we compare the performance of RGHAT and our model in the experiments section.

Unlike previous approaches described in Equation (3) that simply use the linear combination of v’s tail entity as the final neighborhood representation, we aggregate Con (u, r_v, t) obtained from the above calculation as the neighborhood representation. Con (u, r_v, t) combines the information of relation r, tail entity v_t, and user u. It is a natural way to aggregate the representations of both relations and tail entities as neighborhood representation. We incorporate user representation into neighborhood representation to tackle the problem that the item knowledge graph cannot capture collaborative signal due to the lack of user-item interaction information. By modeling user information into the entity representation, we can close the spatial distance of the entities preferred by similar users. The neighborhood representation of v is computed as follows: ${agg}_{v} = \sum_{t \in N_{r_{v}}} s_{t}$ (10) $s_{t} = γ_{r_{v}, t}^{u} (Con (u, r_{v}, t))$ (11) where s_t represents the information propagated from the triplet (v, r, t) to the central entity v. Moreover, updating an entity with all of its neighbors can be expensive and the neighbors of each entity vary significantly in size. Hence, we randomly sample a fixed-size neighborhood for each entity as the final neighborhood.

4.1.2 Information aggregation

Having obtained the neighborhood representation of v as agg_v, we aggregate the entity representation v and its neighborhood representation agg_v into the final representation of v: $v^{*} = σ (W_{s} (v + {agg}_{v}) + b)$ (12) where $W_{s} \in ℝ^{d \times d}$ and $b \in ℝ^{d}$ are training parameters, σ is ReLU function.

Finally, to mine user interest in a deeper way, we further stack multiple propagation layers to aggregate information from multi-hop neighbors.

4.2 User encoder

In the previous steps, to learn user-personalized item representation, we introduce user representation u which is learned from the embedding of user ID. Although the corresponding user embedding can be obtained, we argue that user interaction sequences more clearly reflect the user preferences. Considering that user interaction records may have some relatedness, we propose to use multi-head self-attention to capture the correlations between items to learn their contextual characteristics. Based on the entity look-up table, for the interaction sequence {v₁, v₂, . . . , v_t} of user u, we can obtain their representation $H = {v_{1}^{*}, v_{2}^{*}, . . ., v_{t}^{*}} \in ℝ^{t \times d}$ . Then we employ Scaled Dot-Product Attention [26] to learn the representations of user interaction sequence. The representations of user interaction sequence learned by the i_th attention head are computed as: ${\tilde{H}}_{i} = softmax (\frac{(H W_{i}^{Q}) {(H W_{i}^{K})}^{T}}{{\sqrt{d}}_{k}}) (H W_{i}^{V})$ (13) where $W_{i}^{Q}, W_{i}^{K}, W_{i}^{V} \in ℝ^{d \times d_{k}}$ are training parameters, k represents the num of self-attention heads, d_k = d/k. We then concatenate the representations produced by k self-attention heads to get the final representations of the user interaction sequence: $\hat{H} = {\tilde{H}}_{1} | | \dots | | {\tilde{H}}_{k}$ (14) where || represents the concatenation operation. After obtaining the representations of the user interaction sequence, a typical operation is to calculate the average of the vectors of user interaction items as user representation. However, a user may have different preferences for different items, the mean operation could not attach different importance to different items. Therefore, an additive mechanism is proposed to discriminate the importance of the items to users. We compute the linear combination of user interaction records as user representation, which is formulated as: $u_{h} = \sum_{i = 1}^{t} α_{i} {\hat{H}}_{i}$ (15) where ${\hat{H}}_{i}$ is the representation of the ith interaction item of the user u, α_i is the attention weight of ${\hat{H}}_{i}$ , which is defined as: $α_{i} = \frac{exp (σ (q \cdot {\hat{H}}_{i}))}{\sum_{j = 1}^{t} exp (σ (q \cdot {\hat{H}}_{j}))}$ (16) where q ∈ R^d is training parameter, σ represents the LeakyReLU function with negative input slope as 0.2.

To get the final user representation u^*, we incorporate a gating network to combine user representation u introduced in the item encoder and user representation u_h learned from user interaction sequence adaptively. $u^{*} = G_{n} ⊙ u + (1 - G_{n}) ⊙ u_{h}$ (17) $G_{n} = σ (W_{g} \cdot [u, u_{h}]) + b_{g}$ (18) where W_g ∈ R^d×d, b_g ∈ R^d are training parameters, σ represents the sigmoid function, ⊙ denotes element-wise multiplication, and $G_{n} \in ℝ^{d}$ is the learnable gate.

4.3 Model training

After obtaining user representation u^* and item representation v^*, we use the simple inner product to predict matching score: ${\hat{y}}_{uv} = {u^{*}}^{T} v^{*}$ (19)

We employ a uniform negative sampling strategy to learn the parameters of our model. For each item that a user interacts with, we randomly sample an item that is not selected by this user. The loss function of the KHAN model is defined as follows: $L = - {\sum_{X \in O^{+}} y log (\hat{y}) + \sum_{X^{'} \in^{O}} (1 - y) log (1 - \hat{y})}$ (20) where X and X′ represent the positive and negative samples from the positive set $O^{+}$ and negative set , respectively.

5 Experiments and evaluation

In this section, we perform extensive experiments on real-world datasets to evaluate the proposed KHAN model.

5.1 Dataset description

To verify the effectiveness of KHAN in different scenarios, we conduct experiments on two benchmark datasets: MovieLens-20M and Dianping-Food. The two datasets have been extended by [27]. We use the extended datasets to perform experiments and, to ensure the objectivity of the results, they are identical to the datasets used by [27].

MovieLens-20M 1 is a widely used dataset for movie recommendation, which includes approximately 20 million explicit ratings and each rating is an integer between 1 to 5.

Dianping-Food is collected from Dianping.com 2 . Wherein the restaurants are viewed as items.

Table 1
Statistical details of the datasets

MovieLens-20M Dianping-Food

User-Item #User 138,159 2,298,698

Interaction #Items 16,954 1,362

#Interactions 13,501,622 23,416,418

Knowledge #Entities 102,569 28,115

Graph #Relations 32 7

#Triplets 499,474 160,519

		MovieLens-20M	Dianping-Food
User-Item	#User	138,159	2,298,698
Interaction	#Items	16,954	1,362
	#Interactions	13,501,622	23,416,418
Knowledge	#Entities	102,569	28,115
Graph	#Relations	32	7
	#Triplets	499,474	160,519

The statistical details of the two datasets are presented in Table

5.2 Experiment settings

5.2.1 Baselines

To evaluate the effectiveness of our approach, we compare it with the following baselines. SVD and LibFM are KG-free methods, and the rest of the models are KG-aware methods.

SVD [28] transforms both items and users to the same latent factor space using inner product to model user-item interactions.

LibFM [29] is a factorization model, which considers second-order feature combination. We concatenate user id, item id as input.

CKE [10] integrates text, image, and structure information of KG for recommendation, which employs TransR to learn structural knowledge, guiding the representation learning of items.

PER [30] is a path-based method, which extracts meta-path based features to represent the connectivity between users and items.

KGAT [17] constructs a new heterogeneous graph by combining user-item interactions and KG, and then uses attention-based propagation to generate user and item representations on it.

KGCN [18] is an approach based on GNN, which combines neighborhood information with bias to learn item representations on KG.

KGNN-LS [27] is an improvement over KGCN, which introduces label smoothness assumption to regularize the model.

RGHAT [19] is a method designed for knowledge graph completion, which mainly consists of an encoder and a decoder. The encoder is used to learn the entity representation based on the KG. In this paper, we use the encoder of RGHAT for recommendation.

KGIN [25] is a state-of-the-art GNN-based method. It considers user intents and models them as an attentive combination of KG relations.

5.2.2 Evaluation metrics

For each dataset, we divide it into three parts: training set (60%), validation set (20%), and test set (20%). To demonstrate the effectiveness of KHAN in different scenarios, we evaluate our model in click-through rate (CTR) prediction and top-K recommendation: (1) In CTR prediction, we use the trained model to predict user-item matching scores in the test set. If user-item matching score $\hat{y} > = 0.5$ , we consider that there is an interaction between the user and the item. AUC and F1 are used as the evaluation metric for CTR prediction. (2) In top-K recommendation, we use the trained model to learn the user preferences over all the items except the positive ones in the training set, and choose the top k items with the highest user-item matching scores as the top k item set. We adopt recall@K to evaluate the recommendation performance of our model and the baselines.

5.2.3 Parameter settings

For fair comparison, we do not use any pretrained parameters, but learn all models from scratch. We use Tensorflow in the experiment, and optimized all models with Adam optimizer. Grid search is used to find out the best settings of hyper-parameters. The dimension of embeddings D is tuned in {4, 8, 16, 32, 64}, the number of neighbors N for each entity is searched amongst {2, 4, 8, 16, 32}, the number of propagation layers L is tuned amongst {1, 2, 3, 4}, the number of self-attention heads K is searched in {2, 4, 8}, the L2 regularization weight λ is searched amongst {10^-8, 10^-7, 10^-6, 10^-5, 10^-4}, and the learning rate η is tuned in {10^-5, 2 × 10^-5, ⋯ , 2 × 10^-2, 10^-1}. We also set the maximal length of the user interaction sequence T for users and determine its value by calculating the median of the user interaction sequence length across all users. The hyper-parameter setting of our model is given in Table 2, which is mainly determined by optimizing Recall@10 on the validation set. Moreover, we repeat the whole experimental process five times and report the average values as results to keep the reliability of the model.

Table 2
Hyper-parameter settings of KHAN. D is the dimension of embeddings; N is the number of neighbors fcor each entity; H is the number of propagation layers; T denotes the maximal length of user interaction sequence; K is the number of self-attention heads; λ is L2 regularization weight; η is learning rate.

D N H T K λ η

MovieLens-20M 32 16 1 50 4 10^-7 2 × 10^-2

Dianping-Food 16 4 2 25 2 10^-6 10^-4

	D	N	H	T	K	λ	η
MovieLens-20M	32	16	1	50	4	10^-7	2 × 10^-2
Dianping-Food	16	4	2	25	2	10^-6	10^-4

Fig. 4

The performance of different methods in top-K recommendation.

Table 3

The performance of different methods in CTR prediction. “RI” denotes the average relative improvements of KHAN over the corresponding baseline.

	MovieLens-20M				Dianping-Food
	AUC	RI	F1	RI	AUC	RI	F1	RI
SVD	0.963	+2.08%	0.919	+2.50%	0.838	+4.18%	0.763	+4.98%
LibFM	0.959	+2.50%	0.906	+3.97%	0.837	+4.30%	0.761	+5.26%
PER	0.832	+18.15%	0.788	+19.54%	0.746	+17.02%	0.697	+14.92%
CKE	0.924	+6.39%	0.871	+8.15%	0.802	+8.85%	0.741	+8.10%
RGHAT	0.977	+0.61%	0.930	+1.29%	0.843	+3.56%	0.766	+4.57%
KGAT	0.976	+0.72%	0.928	+1.51%	0.846	+3.19%	0.785	+2.04 %
KGCN	0.978	+0.51%	0.932	+1.07%	0.845	+3.31%	0.777	+3.10%
KGNN-LS	0.979	+0.41%	0.934	+0.86%	0.850	+2.71%	0.778	+2.96%
KGIN	0.978	+0.51%	0.935	+0.75%	0.860	+1.51%	0.791	+1.26 %
KHAN	0.983	–	0.942	–	0.873	–	0.801	–

5.3 Results and analysis

The performance comparison results are shown in and Fig. 4 and Table 3. From Fig. 4 and Table 3, we have the following findings:

PER has the worst performance on both datasets. It is mainly because PER relies heavily on the predefined meta-paths, and defining optimal meta-paths is difficult. CKE performs poorly compared to other baselines, which is probably because CKE cannot fully utilize the KG through TransR regularization.

Knowledge-free methods (i.e., SVD and LibFM) outperform PER and CKE baselines but are weaker than KGCN and KGNN-LS methods, which indicates the importance of making full use of the KG.

RGHAT is slightly worse than KGCN and KGNN-LS, which shows that it is necessary to learn different representations for items when facing different users.

Compared with KGCN, the performance of KGNN-LS is significantly improved by introducing LS regularization, which indicates the LS part helps learn the edge weights in KG and has better generalization ability.

KGAT performs better on Dianping-Food compared to most of the baselines. One possible reason is that there are a large number of user-item interactions in Dianping-Food, and KGAT better captures the similarity of user interaction behavior by introducing user-item interaction information into knowledge graph.

KGIN shows strong performance compared with other baselines. This is mainly because KGIN preserves the semantics of long-range connectivity using relation dependencies. Note that we also incorporate information of relations in KHAN model.

Our model consistently achieves the best performance on both datasets, which proves the effectiveness of our model. KHAN not only learns more effective item representations by a hierarchical attention network, but also captures user preferences more accurately by combining user representations introduced in item encoder and user representations based on user interaction sequences in a unified framework.

Fig. 5

The performance of variants of KHAN methods. The best performing methods, KGCN and KGNN-LS, are also added as baselines, which are denoted as two dashed lines, respectively.

5.4 Effectiveness of item encoder and user encoder

In this section, to demonstrate the effectiveness of item encoder and user encoder, we consider the following two variants:

KHAN-h, in this model, we do not consider user interests based on user interaction sequences. Specifically, we use the user representation u introduced in item encoder as final user representation u^* in Equation (17).

KHAN-i, the variant model simply employs the embedding of the item ID as item representation without learning from KG.

Figure 5 illustrates the results of KHAN-h, KHAN-i, and KHAN in terms of Recall@10 and Recall@50. We also add KGCN and KGNN-LS as baselines, where KGCN only distinguishes user-relation attention, and KGNN-LS performs best w.r.t Recall@10 compared with other baselines. According to Fig. 5, we have the following findings:

KHAN outperforms KHAN-h and KHAN-i on both datasets, which validates that the user encoder and the item encoder are both beneficial for recommendation.

KHAN-i performs better than KHAN-h on both datasets. One possible reason is that user interaction sequences can better reflect user interests and excessive reliance on KG may introduce noise to user preferences.

Even without considering user interests based on user interaction sequences, KHAN-h still outperforms KGCN and KGCN-LS in most cases, which indicates the effectiveness of the hierarchical attention network. Note that KGNN-LS performs better than KHAN-h in Recall@10 on Dianping-Food, indicating that mitigating overfitting is also a significant part to be considered in the use of KG.

5.5 Results in data sparsity

One major motivation to exploit KG in recommender systems is to alleviate the sparsity issue. In order to investigate whether exploiting connectivity information of KG is helpful to alleviate this issue, we conduct experiments in cold-start scenarios where user-item interactions are sparse. With the validation and test set being kept fixed, we reduce the size of the training set from r = 100% to r = 20%. Figure 6 shows the results w.r.t. AUC on MovieLens-20M. From the results, we have the following observations:

When the training set is reduced from 100% to 20%, the performance of methods using only user-item interactions decreases significantly compared with KG-based methods, indicating that the information learned from KG can compensate for the lack of use-item interactions. Our model consistently performs well compared with other KG-based methods, which validates that KHAN can effectively mitigate the data sparsity issue.

Fig. 6

The performance of different methods w.r.t. the ratio of training set.

5.6 Analysis of hierarchical attention mechanism

In this section, we attempt to understand how the hierarchical attention mechanism facilitates item representation learning. Towards this end, we randomly selected two users u₇₃ and u₂₀₅ from MovieLens-20M, and one item v₅₄₈ that they both interact with. Figure 7 presents how the hierarchical attention mechanism distinguishes the contribution of neighboring entities to the central entity. There are two findings as follows:

KHAN can explicitly distinguish the importance of different relations and different entities under the same relation to a user. As we can see, for user u₇₃, he pays most attention to the movie director, and focuses secondly on the movie actor. For different movie actors, he also has different preferences. In addition, combining the attention values in the knowledge graph endows KHAN the ability of interpretability. For example, u₇₃ chooses v₅₄₈ because v₅₄₈ matches his preference for a certain director.

Jointly analyzing the hierarchical attention values of item v₅₄₈ when facing user u₇₃ and user u₂₀₅, we find that the movie director contributes significantly to the item v₅₄₈ for user u₇₃, while the movie actor achieves a higher attention score for user u₂₀₅. It verifies that KHAN can adaptively learn different representations for an item when facing different users.

Fig. 7

Hierarchical attention values of item i₅₄₈ when facing user u₇₃ and user u₂₀₅. Relations connected to only one entity omit their entity since the entity weight is fixed to 1.

6 Conclusion

In this work, we have proposed a knowledge-aware hierarchical attention network KHAN for personalized recommendation, which is comprised of an item encoder and a use encoder. In the item encoder, to learn effective entity representations, we have designed a hierarchical attention network that makes full use of the local neighborhood information from surrounding an entity by carefully discriminating the importance of the neighbors. In the user encoder, we have learned user representations based on user interaction sequences and combined them with user representations introduced in the item encoder through a gating mechanism to capture complex user interests. We have conducted extensive experiments on real-world datasets. The results have shown that KHAN improves over the strongest baselines in terms of AUC and F1 by 0.51%, 0.75%,1.51% and 1.26% on MovieLens-20M and Dianping-Food, demonstrating the significant advantage and rationality of the proposed model.

Footnotes

Acknowledgments

This work is supported by the National Natural Science Foundation of China under grant No.61872163 and 61806084, Jilin Province Key Scientific and Technological Research and Development Project under grant No.20210201131GX, and Jilin Provincial Education Department project under grant No.JJKH20190-160KJ.

https://grouplens.org/datasets/movielens/

https://www.dianping.com/

References

Gao

, Yao

, Jin

and Li

, Efficient Data-specific Model Search for Collaborative Filtering, KDD 2021, 415–425.

Xue

, Dai

, Zhang

, Huang

and Chen

, Deep matrix factorization models for recommender systems, IJCAI 2017, 3203–3209.

Zhu

, Topic recommendation system using personalized fuzzy logic interest set, Journal of Intelligent & Fuzzy Systems 40(2) (2021), 2891–2901.

Sun

, Guo

, Yang

, Fang

, Guo

, Zhang

and Burke

, Research commentary on recommendations with side information: A survey and research directions, Electronic Commerce Research and Applications 37 (2019), 100879.

Wang

, He

, Nie

and Chua

, Item silk road: Recommending items from information domains to social users, SIGIR 2017, 185–194.

Wang

, Zhang

, Hou

, Xie

, Guo

and Liu

, SHINE: signed heterogeneous information network embedding for sentiment link prediction, WSDM 2018, 592–600.

Wang

, Zhang

, Xie

and Guo

, DKN: deep knowledge-aware network for news recommendation, WWW 2018, 1835–1844.

, Liang

, Bin

and Chang

, Combining user-end and item-end knowledge graph learning for personalized recommendation, Journal of Intelligent & Fuzzy Systems 40(5) (2021), 9213–9225.

AyalaGómez

, Daróczy

and Benczúr

, M. Mathioudakis and A. Gionis, Global citation recommendation using knowledge graphs, Journal of Intelligent & Fuzzy Systems 34(5) (2018), 3089–3100.

10.

Zhang

, Yuan

N.J.

, Lian

, Xie

and Ma

, Collaborative knowledge base embedding for recommender systems, SIGKDD 2016, 353–362.

11.

Wang

, Zhang

, Feng

and Chen

, Knowledge graph embedding by translating on hyperplanes, AAAI 2014, 1112–1119.

12.

, He

, Xu

, Liu

and Zhao

, Knowledge graph embedding via dynamic mapping matrix, ACL 2015, 687–696.

13.

Wang

, Wang

, Xu

, He

, Cao

and Chua

, Explainable reasoning over knowledge graphs for recommendation, AAAI 2019, 5329–5336.

14.

, Shi

, Zhao

W.X.

and Yu

P.S.

, Leveraging meta-path based context for top- N recommendation with A neural co-attention model, SIGKDD 2018, 1531–1540.

15.

Sun

, Han

, Yan

, Yu

P.S.

and Wu

, Pathsim: Meta path-based top-k similarity search in heterogeneous information networks, VLDB 2011, 992–1003.

16.

, Pan

, Chen

, Long

, Zhang

and Yu

P.S.

, A comprehensive survey on graph neural networks, IEEE Transactions on Neural Networks and Learning Systems 32(1) (2021), 4–24.

17.

Wang

, He

, Cao

, Liu

and Chua

, KGAT: knowledge graph attention network for recommendation, KDD 2019, 950–958.

18.

Wang

, Zhao

, Xie

, Li

and Guo

, Knowledge graph convolutional networks for recommender systems, WWW 2019, 3307–3313.

19.

Zhang

, Zhuang

, Zhu

, Shi

, Xiong

and He

, Relational graph neural network with hierarchical attention for knowledge graph completion, AAAI 2020, 9612–9619.

20.

Lin

, Liu

, Sun

, Liu

and Zhu

, Learning entity and relation embeddings for knowledge graph completion, AAAI 2015, 2181–2187.

21.

Zhu

, Zhou

, Wu

, Tan

and Guo

, A knowledge-aware attentional reasoning network for recommendation, AAAI 2020, 6999–7006.

22.

Guo

, Zhuang

, Qin

, Zhu

, Xie

, Xiong

and He

, A survey on knowledge graph-based recommender systems, IEEE Transactions on Knowledge and Data Engineering 2020.

23.

Wang

, Zhang

, Wang

, Zhao

, Li

, Xie

and Guo

, Ripplenet: Propagating user preferences on the knowledge graph for recommender systems, CIKM 2018, 417–426.

24.

Tang

, Wang

, Yang

and Song

, AKUPM: attention-enhanced knowledge-aware user preference model for recommendation, KDD 2019, 1891–1899.

25.

Wang

, Huang

, Wang

, Yuan

, Liu

, He

and Chua

, Learning intents behind interactions with knowledge graph for recommendation, WWW 2021, 878–887.

26.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Kaiser

and Polosukhin

, Attention is all you need, NIPS 2017, 5998–6008.

27.

Wang

, Zhang

, Leskovec

, Zhao

, Li

and Wang

, Knowledge-aware graph neural networks with label smoothness regularization for recommender systems, KDD 2019, 968–977.

28.

Koren

, Factorization meets the neighborhood: a multi-faceted collaborative filtering model, KDD 2008, 426–434.

29.

Rendle

, Factorization machines with libfm, ACM Transactions on Intelligent Systems and Technology 3(3) (2012), 1–22.

30.

, Ren

, Sun

, Gu

, Sturt

, Khandelwal

, Norick

and Han

, Personalized entity recommendation: a heterogeneous information network approach, WSDM 2014, 283–292.

Knowledge-aware hierarchical attention network for recommendation

Abstract

Keywords

1 Introduction

3 Problem formulation

4.1 Item encoder

4.1.1 Hierarchical attention-based information propagation

5.1 Dataset description

Table 1 Statistical details of the datasets MovieLens-20M Dianping-Food User-Item #User 138,159 2,298,698 Interaction #Items 16,954 1,362 #Interactions 13,501,622 23,416,418 Knowledge #Entities 102,569 28,115 Graph #Relations 32 7 #Triplets 499,474 160,519

5.2.1 Baselines

5.2.2 Evaluation metrics

5.2.3 Parameter settings

5.5 Results in data sparsity

Footnotes

Acknowledgments

References

Table 1
Statistical details of the datasets

MovieLens-20M Dianping-Food

User-Item #User 138,159 2,298,698

Interaction #Items 16,954 1,362

#Interactions 13,501,622 23,416,418

Knowledge #Entities 102,569 28,115

Graph #Relations 32 7

#Triplets 499,474 160,519