Abstract
Modeling user’s fine-grained preferences and dynamic preference evolution from their chronological behaviors are challenging and crucial for sequential recommendation. In this paper, we develop a Hierarchical Self-Attention Incorporating Knowledge Graph for Sequential Recommendation (HSRec). HSRec models not only the user’s intrinsic preferences but also the user’s external potential interests to capture the user’s fine-grained preferences. Specifically, the intrinsic interest module and potential interest module are designed to capture these two preferences respectively. In the intrinsic interest module, user’s sequential patterns are characterized from their behaviors via the self-attention mechanism. As for the potential interest module, high-order paths can be generated with the help of the knowledge graph. Therefore, a hierarchical self-attention mechanism is designed to aggregate the semantic information of user interaction from these paths. Specifically, an entity-level self-attention mechanism is applied to capture the sequential patterns contained in the high-order paths while an interaction-level self-attention mechanism is designed to further capture the semantic information from user interactions. Moreover, according to the high-order semantic relevance, HSRec can explore the user’s dynamic preferences at each time, thus describing the user’s dynamic preference evolution. Finally, experiments conducted on three real world datasets demonstrate the state-of-the-art performance of the HSRec.
Keywords
Introduction
Recommender system is used to alleviate the information overload problem, thus meeting user’s preferences accurately from his/her interaction histories. Unlike the general recommendation [12, 26], sequential recommendation [25, 29–31] aims at predicting the next item or next few items successively that target user would interact with by modeling user’s sequential interaction history. Since the user’s preferences are complex and changeable in practice, sequential recommendation needs to capture user’s fine-grained preferences and dynamic preference evolution.
In the literature, to explore user’s dynamic interactions, Markov Chain-based method [3, 14] is a classic work, which assumes that target user’s next interaction depends on the preceding few actions. Although these models can characterize short-range item transitions for recommendation [8], it is difficult for them to explore dynamic sequential signs in complex scenarios with noisy data. Recently, neural network-based methods have grabbed people’s attention successfully, which can encode user’s sequential interactions into latent vectors. Several studies [16, 24] apply convolutional neural networks (CNNs) to extract important features from user’s sequential interactions by utilizing convolutional operations. Besides, recurrent neural networks (RNNs) also have been proven powerful in sequential recommendation [5, 11]. Transformer [17] is a useful sequential model with state-of-the-art performance and efficiency, which is purely based on self-attention. Inspired by it, self-attentive sequential recommendation (SASRec) [8] and time interval aware self-attention based sequential recommendation (TiSASRec) [10] apply self-attention to model user’s sequential interactions without any recurrent or convolutional operations, and achieve better performances than CNN/RNN based models. However, most of those approaches cannot capture the user’s dynamic fine-grained preference well. For example, as shown in Fig. 1, if there is only a user-item sequence, i.e., user →Raiders of the Lost Ark → The Green Mile → Ice Pawn, it is difficult to understand why user interacted with The Green Mile at t2 or with Ice Pawn at t3. Introducing external knowledge is a promising approach for this problem. In recent years, knowledge graph (KG) has received a lot of attention due to its abundant facts and entity connections. With the help of KG, we can find that the The Green Mile and Ice Pawn are Drama films. This film type is the same as the films that he has been watched, i.e., The Shawshank Redemption and Fight Club. So we can intuitively guess that user selected The Green MileandIce Pawn and Ice Pawn during the t2-t3 time step since both of them are Drama films.

An example of user interactions with high-order paths.
Various studies have applied KG into recommendation, such as deep knowledge-aware network (DKN) [20], RippleNet [19], knowledge graph convolutional networks (KGCN) [21]. However, these methods assume user’s sequential interactions are static without taking the evolution of relational items into account. For example, as shown in Fig. 1, these methods only know that user interacted with The Green Mileand since its type is Drama, interacted with The Shawshank Redemption since its type is Adventure and interacted with Ice Pawn since its type is Drama. These preferences are treated independently and statically without considering their sequence orders. The Adventure and Drama films would be equally viewed when the recommendation provide a result to user at t4. However, we find that the user’s preferences experienced an evolution from Adventure to Drama. Drama would play a larger influence on the user’s preference at t4 since user’s subsequence interaction is usually related to his/her last click. He/She would prefer the Drama film at t4 rather than the Adventure film. That is to say, Adventure and Drama film should not be equally treated at t4. We call this evolution as “preference drift”. Capturing the “preference drift” can further help to accurately predict user’s next action.
As for the KG-based sequential model, knowledge-enhanced sequential recommender (KSR) [7], knowledge-guided reinforcement learning model (KERL) [22] and Chorus [18] take the advantage of KG to get better item representation, thus provide better recommendation results. However, these models adopt the idea of translation function to embed the entities and relations into low-dimensional continuous vector spaces, which focus more on the 1-order connectivity in the KG. They cannot explore the high-order connectivity between user and item well, which will limit the ability of model to exploit how different high-order semantic relevance will influence temporal evolution in the sequence.
To tackle these challenges of the above-mentioned work, we propose Hierarchical Self-Attention Incorporating Knowledge Graph for Sequential Recommendation (HSRec). HSRec aims at exploring user’s fine-grained interests and the effects of higher-order semantic relationship between user and item on the sequential temporal evolution. On the one hand, the user’s dynamic fine-grained interests are supposed to include two aspects, termed as inherent interest and potential interest. Therefore, the inherent interest module and potential interest module are designed to capture these two different interests respectively. Specifically, an inherent interest module is developed to capture user’s inherent interest from his/her sequential actions via the self-attention mechanism. Besides, the user’s external potential preference can be explored by the potential interest module. In the potential interest module, each user-item pair can generate several high order paths with the help of the KG. In this part, we design a two-level self-attention mechanism to adaptively select the relevant semantic information for the user’s potential preference, called entity-level self-attention and interaction-level self-attention. Such exploration will be conducted under the guidance of the KG-aware path. On the other hand, to exploit the temporal evolution of sequential interactions, we investigate the utility of high-order connectivity between user and item to capture user’s temporal interests. This strategy can better capture the contribution of each item in different contexts, thus help our model to learn how different relations will influence temporal evolution.
To summarize, the contributions of this paper are as follows: To infer user’s intents from their sequential behaviors better, we propose a Hierarchical Self-Attention Incorporating Knowledge Graph for Sequential Recommendation model, termed as HSRec, to capture user’s dynamic fine-grained interests and their dynamic preference evolution. The user’s fine-grained interests are modeled as inherent preference and potential preference, which are further explored by the inherent interest module and potential interest module respectively. Moreover, a hierarchical self-attention mechanism is developed to aggregate the semantic user interaction information from the high order paths of each user-item pair, thus exploring their dynamic preference evolution. Experiments conducted on three real-world datasets demonstrate the state-of-the-art performance of HSRec.
In this section, we will introduce serval previous work that related to our model, which mainly includes the sequential recommendation and KG-based recommendations.
Sequential recommendation
Rendle et al. [14] proposed Factorizing Personalized Markov Chains (FPMC) model to model user’s sequential interests by integrating the power of matrix factorization and first-order Markov Chain. Moreover, He et al. [3] proposed Fossil to fuse similarity-based models with high-order Markov Chain. Recently, neural network has been widely applied into sequential recommendation. Tang et al. [16] proposed convolutional sequence embedding recommendation model (Caser) model to capture important sequential features by utilizing horizontal and vertical convolutional filters. However, Caser is a 1D-CNN framework. To extend this idea, Yan et al. [24] proposed convolutional network for sequential recommendation (CosRec) model, a 2D-CNN framework that introduced a pairwise encoding module. RNNs are also utilized for sequential recommendation. For instance, [6] and [5] applied Gate Recurrent Unit (GRU) to model user’s short-term preferences by encoding the historical sequence into a latent vector. Li et al. [11] proposed neural attentive recommendation machine (NARM) to captures user’s global and local preferences respectively by an encoder-decoder architecture. To better mine user’s various preferences, Guo et al. [30] proposed a Hierarchical Leaping Network (HLN) to distinguish different preferences. Furthermore, Zhang et al. [31]. focused on user’s hierarchical intents by designing a model, called Hierarchical Intents and Selective Sequential Interests (HISSI). HISSI can form fine-to-coarse grained intents and accurately predict user’s next behavior that is related to his/her last interaction. Inspired by the Transformer [17], SASRec [8] applied self-attention to model user’s sequential interactions without any recurrent or convolutional operations, and achieve strong performance. However, SASRec did not consider the relative time intervals between two items. To extend this idea, Li et al. [10] proposed a time-aware self-attention mechanism for the sequential recommendation, termed as TiSASRec.
KG-based recommendation
DKN [20] adopted CNN-based framework to extract word embedding and entity embedding. To better explore the semantic information from KG. Wang et al. [19] proposed a KG-aware recommendation that can automatically capture user’s hierarchical interests from high-order connectivity, namely RippleNet. For another perspective, Wang et al. [21] employed graph convolutional networks (GCNs) for the KG-based recommendation. However, all of these models cannot model the “preference drift” well. As for the KG-based sequential model, Huang et al. [7] proposed a GRU-based sequential model with a knowledge-enhanced key-value memory network (KV-MN), which can leverage the power of KG to improve the representation of items. Wang et al. [22] further employed the KG to guide reinforcement learning-based model, aiming at capturing user’s long-term preference. Wang et al. [18] proposed a knowledge-aware and time-aware sequential model, Chorus, which can model the item relations and corresponding temporal dynamics simultaneously.
Primary formulation
Task definition
As for the sequential recommendation, we used U and I to represent user and item set respectively, where u ∈ U is a user and i ∈ I is an item. And each user’s sequential behaviors sorted by the timestamp, which are denoted as
Knowledge graph and entity alignment
KG can be regarded as a heterogeneous graph with abundant nodes (i.e., entities) and edges (i.e., relations). We use a triplet (h, r, t) to represent such a fact, i.e., there is a relation r between head entity h and tail entity t. Formally, the KG can be defined as G o = {(h, r, t) |h, t ∈ ɛ o , r ∈ R o }, where ɛ o and R o are the entity set and relation set respectively. Furthermore, we use “entity alignment” operation to share the information between user-item interactions and KG data. Therefore, we maintain an item-entity mapping set for the alignment between item and its corresponding entity in the KG. Formally, let H = {(i, e) |i ∈ I, e ∈ ɛ o } as an alignment mapping set. After finishing it, the new integrated knowledge graph can be denoted as G = {(h, r, t) |h, t ∈ ɛ, r ∈ R}, where ɛ = ɛ o ∪ U and R = R o ∪ interact.
High-order path
A high-order path extended by the KG can be denoted as
Our approach
Inherent interest module
The overall framework is shown in Fig. 2. In the inherent interest module, the training sequence

The overall framework of HSRec.
Besides, we argue that the user’s preferences include not only inherent interests but also potential interests. In the potential interest module, each user-item can generate several high order paths with the help of the KG. These paths contain the semantic information of user interaction. It is necessary to effectively capture the semantic relevance through an appropriate strategy. To achieve this goal, a two-level self-attention mechanism is designed to adaptively select the relevant semantic information, termed as entity-level self-attention and interaction-level self-attention. Specifically, the entity-level self-attention is designed to model the high-order path. Since the high-order path is made up of entities and connected relations, all of these elements can be viewed as a new sequence. Hence, it is a promising strategy to capture the high-order relevance from such structure by the self-attention mechanism. Taking the
Datasets
The datasets we adopted are the Movielens-100k 1 , Movielens-1M 2 and Amazon Book [27]. Movielens-100 k and Movielens-1M are two widely used movie datasets, while Amazon Book is a larger dataset obtained from the Amazon website. It contains the user ratings on book products. As described above, it is necessary to integrate the recommendation datasets with KG. IMDB 3 is a useful KG tool, containing abundant movie auxiliary information. Hence, we utilize it to execute the entity alignment operation between Movielens-100k/Movielens-1M. Specifically, we employ the title and released year of the movie as a link between the Movielens-100k/Movielens-1M and IMDB. As for the Amazon Book, we apply the KB4Rec dataset [28] to link the entities from Freebase with items. Considering that the sequential recommendation puts more emphasis on predicting if the user will interact with the recommended item, we convert these explicit feedback data to implicit feedback data, i.e., all the observed interactions are regarded as positive feedback. Then we remove the cold-start user/item whose interaction length is less than 5. As for the Amazon Book, entities with fewer than 3 KG triples are further deleted. Finally, the detailed basic statistics of the three datasets are shown in Table 1. As for the data splitting, we use the leave-one-out method, i.e., for each user, we choose the most recent item for testing, the item before the test item for validation, and the remaining items for training. The evaluation metrics utilized are Hit Rate (i.e., HR), Normalized Discounted Cumulative Gain (i.e., NDCG). And we also randomly sample 100 negative items for each user and rank them with the ground-truth item, i.e., there are 101 items for ranking. Finally, the 7 state-of-the-art baselines used for comparison are described as follows:
Basic statistics of the three datasets
Basic statistics of the three datasets
For the implementations of BPR, NCF, Tensor, GRU4Rec, NARM, we use the code from ReChorus 4 . As for SASRec and TiSASRec, the codes are provided by the authors. Specifically, in the proposed model, we adopt 2 self-attention blocks. And the optimizer adopted in our model is Adam, the batch size is set as 128 and the learning rate is set as 0.001, the maximum sequence length is set as 200. The results of the above baselines and HASRec are shown in Table 2, Tables 3 and 4. After comparing the results in term of four evaluation metrics, we have the following observations:
Overall performance comparison on the Movielens-100k dataset
Overall performance comparison on the Movielens-100k dataset
Overall performance comparison on the Movielens-1M dataset
Overall performance comparison on the Amazon Book dataset
HSRec performs best on three datasets in the sequential recommendation scenario. The results reveal that the strong ability of HSRec to capture the user’s sequential patterns and better investigate the utility of KG for the sequential recommendation. SASRec and TiSASRec significantly outperform other baselines in both datasets. This shows that the effectiveness of self-attention for the sequential recommendation, which can adaptively select the relevant sequential signs for prediction. On the one hand, although all of SASRec, TiSASRec and HSRec adopt the self-attention to explore user’s sequential preferences, HSRec extends it by utilizing a hierarchical self-attention mechanism to model user’s sequential-aware and knowledge-aware interests. We argue such a strategy can better capture more comprehensive and fine-grained user preferences. On the other hand, both of SASRec and TiSASRec lack the guidance of external knowledge for sequence modeling, it also proves that high-order knowledge path can improve the model performance.
Impact of the hidden dimensionality
Note that the SASRec and TiSASRec are more strong than other baselines, so we make comparisons with them. We further investigate the impact of hidden dimensionality d on the model performance. The results with the hidden dimensionality d from 10–90 are shown in Fig. 3. It is obvious that too small hidden dimensionality is not conducive to the model’s performance. As the hidden dimensionality increases, the model performance also increases until it tends to converge. Note that the HSRec performs better than the others regardless of the hidden dimensionality.

Results of different hidden dimensionalities.
In practice, different users have different sequence lengths. The sequence length is an important parameter for the sequential task, which determines that how much information the model can learn from. The effects of different maximum sequence lengths on the model’s performance are shown in Fig. 4. From the figure, we have observed that the model performance would benefit from the larger sequence length. One likely reason is that the longer the sequence length, the more useful sequence information it contains. However, the larger sequence length is not adopted due to the memory limitation of our GPU.

Results of different maximum sequence lengths.
Since the hierarchical self-attention mechanism is designed to aggregate the semantic user interaction information, thus exploring their dynamic preference evolution, it is necessary to explore effects of this mechanism on the model performance. Therefore, in this experiment, we perform an analysis on the different important components of hierarchical self-attention mechanism, i.e., entity-level self-attention and interaction-level self-attention. The results are shown in Tables 5 and 6. From these tables, we can perceive that entity-level self-attention and interaction-level self-attention are helpful to improve the model performance. Furthermore, HSRec performs worse without entity-level self-attention than without interaction-level self-attention, which demonstrates that entity-level self-attention is more important for aggregating high-order semantic information. The reason seems that the entity-level self-attention can learn more semantic information from different entities and paths than from the interaction-level information.
Overall performance comparison on the Movielens-100k dataset
Overall performance comparison on the Movielens-100k dataset
Overall performance comparison on the Movielens-1M dataset
In this paper, we propose a Hierarchical Self-Attention Incorporating Knowledge Graph for Sequential Recommendation (HSRec) to capture user’s fine-grained preferences and dynamic preference evolution. Specifically, the fine-grained preferences contain two parts, i.e., inherent interest and potential interest. Both of these two interests can be further captured by the inherent interest module and potential interest module. In the inherent interest model, self-attention is applied to capture the sequential patterns from the user’s behaviors. Moreover, in the potential interest module, each user-item pair can generate several high-order paths, which can contain semantic information about the user interaction. Therefore, a hierarchical self-attention mechanism, i.e., entity-level self-attention and interaction-level self-attention, is designed to adaptively aggregate the semantic relevance from these paths. Besides, according to the character of the high-order paths, we can further explore the user’s dynamic preference evolution. Finally, extensive experiments on three datasets show the state-of-the-art performance of the proposed model.
In the future, we plan to extend this work in two ways. Recently, graph neural networks (GNNs) have proven their strong ability in complex graph. Hence it is intuitive to utilize GNN-based model to learn better representations of entity and relation in the KG. Besides, since the KG used in our model is incomplete, jointly modeling KG completion and sequential recommendation task is a promising method to improve the model performance.
