Abstract
Existing recommender systems usually make recommendations by exploiting the binary relationship between users and items, and assume that users only have flat preferences for items. They ignore the users' intentions as an origin and driving force for users' performance. Cognitive science tells us that users' preference comes from an explicit intention. They first have an intention to possess a particular (type of) item(s) and then their preferences emerge when facing multiple available options. Most of the data used in recommender systems are composed of heterogeneous information contained in a complicated network's structure. Learning effective representations from these heterogeneous information networks (HINs) can help capture the user's intention and preferences, therefore, improving recommendation performance. We propose a hierarchical user's intention and preferences modeling for sequential recommendation based on relation-aware HIN embedding (HIP-RHINE). We first construct a multirelational semantic space of heterogeneous information networks to learn node embedding based on specific relations. We then model user's intention and preferences using hierarchical trees. Finally, we leverage the structured decision patterns to learn user's preferences and thereafter make recommendations. To demonstrate the effectiveness of our proposed model, we also report on the conducted experiments on three real data sets. The results demonstrated that our model achieves significant improvements in Recall and Mean Reciprocal Rank metrics compared with other baselines.
Introduction
One of the critical tasks in the recommender system is to help users find the items that they are interested in from many items, and this will improve the user experience. Traditional recommendation algorithms usually use a binary relation between the users and items to learn users' preferences for recommendation, such as collaborative filtering1,2 recommending items to users based on users or items similarity, matrix factorization1,3 decomposing scoring matrix into latent feature expression of users and items, and then recommending items of interest to each user. However, they all have problems such as sparse matrix, cold start, flattening preference, and limiting the model's performance.
There is a natural interaction process in the actual user's purchase behavior 4 : first, the user intends to buy a specific type of item (e.g., a jacket), and then driven by this intention, he/she selects a particular item (jacket of a specific brand or a specific color) based on his/her preference and availability. This purchase behavior coincides with cognitive studies5,6 wherein preference only emerges one has an intention and that intention can be fulfilled with multiple options. The traditional recommendation algorithms use the user–item binary interaction relationship, ignoring the origin and the driving of the preference that is user's intention. This is because modeling user's intention and preference is challenging.
The existing recommender systems contain a wealth of different types of information, which constitutes a heterogeneous information network (HIN). 7 HINs generally have nodes and links in the form of nodes and links, which reflect different semantic perspectives on user preference. 8 The model in Sun et al. 9 uses matrix decomposition and factorization machine to learn the feature expression of users and items in different meta paths. It can only learn better for specific meta paths because the model has different learning abilities depending on the meta paths.
Chang et al. 10 construct a HIN model, such as defining a network model on Yelp data set through node types user, review, word, etc. A proposal was then to define the semantic information association between two nodes located on two different meta paths using the PathSim algorithm. Prabhu et al. 11 propose a method to learn the feature representation of various types of nodes by deep heterogeneous network embedding. The model uses a convolutional neural network and fully connected layer to learn the embedding of images and text.
However, the mentioned methods have four shortcomings:
When modeling user preferences using a binary relationship between a user and an item, the assumption is that the user's preferences are flat, ignoring the hierarchical relationship between user intention and preference. Identifying semantic heterogeneity between various types of nodes and relationships is difficult when modeling them in a shared feature space. Fine-grained learning of node representation based on particular relationships does not quite exist. Distinct link relationships may correlate with different features of node properties.
In this article, based on relation-aware HIN embedding, we propose hierarchical intention and preference modeling for sequential recommendation. We make the nodes that hold the relationship close to each other and the nodes that weakly hold or do not hold the relationship far away by projecting each relationship and corresponding node in the HIN into the relationship-specific semantic space rather than the public space. To integrate disparate information, we create a relation-aware attention layer that personalizes the influence of different connections on node representation learning.
We model hierarchical user intention and preference based on multirelational node embedding learned in a HIN. We adopt high-level user–category decision making to understand user's category intention and specific preferences within the intention. The model ranks and recommends items depending on their learned preference degree that is explainable.
Our contributions mainly include the following four aspects:
We apply relation-aware HIN embedding to generate distinct node embedding that has diverse relationships among user–item–category.
We propose a relation-aware attention mechanism to learn the varied effects of different relationships on the representation of distinct node features.
We construct a hierarchical tree of user intention and infer the possible user intentions and preferences.
We evaluate our method on three real-world data sets, and the results demonstrate that the proposed model outperforms the baseline methods.
This article is organized as follows: Related Work section reviews related studies that lead to our proposed model in Methodology section; Experiments section details experiments and discussions followed by conclusions.
Related Work
HIN Embedding-based recommendation
As opposed to homogeneous networks, HINs have multiple types of nodes and edges. Several attempts with HIN embedding have yielded promising results in various tasks.12–15 The recommender system based on HIN successfully solves the problem of how to model different kinds of heterogeneous auxiliary information and user interaction behavior. It effectively alleviates the problem of data sparsity and cold start in the recommendation system and can significantly improve the interpretability of the recommender system.
The fundamental of a recommender system based on a HIN is to model the user–item interaction and all auxiliary information into the HIN, and then design a recommendation model suitable for the HIN. 16 SemRec 17 takes into account the attribute values of links, learns the weight mechanism of different meta paths, combines these similarities, and approximates the scoring matrix.
HeteRec 18 uses a meta path to calculate the item–item similarity, then makes an inner product with a user scoring matrix to generate a user preference diffusion matrix, and uses a non-negative matrix on the diffusion matrix to learn potential characteristics of users and items. HIN2Vec 13 learns HIN embeddings by performing several prediction training tasks concurrently. HERec 15 filters node sequences with type restrictions, capturing the semantics of HINs.
Sequential recommendation
In contrast to traditional recommendation approaches such as collaborative filtering,19–21 or matrix factorization,3,22 sequential recommendation aims to capture the temporal shifting patterns of user preferences. The majority of classical approaches are based on Markov Chains (MCs), which explore how to extract sequential patterns to learn users' following preferences using probabilistic decision-tree models.23–26 Nevertheless, MC-based approaches can only represent local sequential patterns between neighboring interactions and cannot address the whole series. Then successive recommendation algorithms based on factorization machines are applied.
For instance, Rendle et al. 23 present FPMC, which combines matrix factorization and the Markov model to simulate individualized transition probability. Cheng et al. 27 expand FPMC to PFMC-LR and use a Markov model to provide geographical limits to the user's movement range. The enormous success of deep neural networks also has spurred the use of deep models in sequential recommendation.25,28,29 For example, Wang et al. 30 integrate auxiliary and identity information to develop e-commerce recommendations to prevent the recommender system's cold start. Wang et al. 31 introduce HRM—hierarchical representation model, which can extract interest representations more effectively from user behavior sequences.
Recently, Recurrent Neural Networks have been devised to model variable-length sequential data with the goal of encoding previous user behaviors into latent representations. Hidasi et al., 32 particularly, use gated recurrent units to collect user behavior sequences for session-based recommendations, and they subsequently suggest an enhanced version 33 with a different loss function. Liu et al. 34 and others35,36 investigate the challenge of sequential recommendation given contextual information. Furthermore, unidirectional 28 and bidirectional 29 self-attention techniques are used to collect sequential patterns of user activities, resulting in state-of-the-art performance.
Nevertheless, these approaches only focus on modeling the relationships between the target user's prior behaviors and their upcoming behavior, leaving out the capacity to capture user intents buried in the behaviors. As a result, conventional techniques are unable to comprehend why the target user makes her following action.
Intention-aware recommendation
In recent years, diverse intention-aware recommendation has drawn great attention. It takes into account users' intents in behavior modeling. Zhu et al. 37 propose a key-array memory network (KA-MemNN) that portrays intents directly using items' categories in users' behaviors. This approach is straightforward and provides an obvious way to define user intents. Chen et al. 38 employ an attention mechanism to capture users' category-wise intentions, represented by a pair of action types and item categories. Wang et al. 39 propose a neural intention-driven method for modeling the heterogeneous intentions underlying users' complex behaviors.
Li et al. 40 present an intention-aware method to capture each user's underlying intentions that may lead to her following consumption behavior and improving recommendation performance. Wang et al. 41 aggregate the history sequence into relation-specific embeddings to model dynamic impacts of historical relational interactions on user intention. In contrast, they give less attention to simulating user intentions, particularly when users' behaviors are melting. They also disregard organized user intent transition, resulting in a solid inductive bias for sequential recommendation.
Attention mechanism-based recommendation
Deep learning's attention process 31 is comparable with humans' selective visual attention mechanism. Its purpose is to swiftly find more relevant information to the task goal among a significant volume of information. It is frequently used in text translation, sequence modeling, image recognition, video description, etc. Hidasi et al. 32 pioneered the attention mechanism for machine translation within the encoder–decoder architecture. It can discover the shortest path between any two points, regardless of their distance or order. Deep Interest Network (DIN) 42 model calculates the correlation between users' previous shopping histories and potential items using the attention mechanism.
In contrast, the DIN model does not take into account the time of user behavior and assumes that user behavior is independent of each other. Deep Interest Evolution Network (DIEN) 43 holds that user interest is dynamic and shifts over time. A user interest extraction layer and a user interest evolution layer are presented based on DIN. Local activation is incorporated in each stage of Gated Recurrent Unit to boost the representation of relevant interests and mimic the movement of interests indicated by users in the behavior sequence.
Deep Session Interest Network (DSIN) 44 argues that the user behavior sequence has a hierarchical structure. User behavior in a single session is similar, and user behavior in subsequent sessions is considerably different. With high interpretability, the attention mechanism may distinguish the value of user behavior and screen out behaviors that are strongly related but irrelevant to objectives.
As we can see from above, the drawbacks associated with the traditional recommendation approach stimulate various of efforts in different directions. HIN Embedding-based recommendation tries to overcome problems with homogeneous networks; sequential recommendation aims to capture the temporal shifting patterns of user preferences. Realizing the root of the preference comes by the user's intention, many efforts have been conducted to capture the user's intention. Intention-aware recommendation simply tries to directly link user intents with behavior that ignores the behaviors conflict and intents transitions. Recent development on machine learning and deep learning shed new lights on the problem, attention mechanism-based recommendation is a brave attempt.
DIEN and DSIN are examples. However, neither explicitly represents users' intention and preference in a hierarchical structure. To address the issues identified, we propose a hierarchical user intention and preference framework for sequential recommendation based on relation-aware HIN embedding as described in the following section.
Methodology
In this section, we first introduce the problem formalization. Then we describe the proposed model framework in detail. After that, we talk about the different modules of our model. Finally, we discuss the model training.
Problem definition
Definition 1: Heterogeneous information network
A HIN is defined as a graph
Definition 2: Node and relation
We defined three types of nodes in HIN as follows: user nodes
Definition 3: HIN embedding
Given a HIN
Model framework
The framework of our approach is shown in Figure 1. It consists of three modules as follows:

Model framework.
Relation-aware node embedding: We generate distinct node embedding in HINs that have diverse relationships among the user–item–category. The user–item relationship represents the interaction between the user and item. Meanwhile, the item–category relationship represents which category the item belongs to. Relation-aware node embedding is to develop mapping functions that project nodes of diverse relationships to low-dimensional vectors.
Relation-aware attention layer: As the core of the attention model, the relational attention layer can capture the dependencies between nodes. To capture the effects of different relations on different node embeddings, we create the user-specific representation of categories as a sum of the node embeddings weighted.
Hierarchical user intention and preference for sequential recommendation: We construct a hierarchical tree of user intention and infer the possible user intentions and preferences the next time. We extract information about user intent from the relational attention layer and represent their hierarchical structure from fine to coarse. The users' intentions are learned to anticipate the interactions between users and items. We elaborate on the details of the three modules in the following subsections.
Relation-aware node embedding
The observable node
The correlation of two nodes is measured by Euclidean distance. Euclidean distance satisfies the triangular inequality, naturally maintaining the first-order and second-order correlation. This specific relation projection can keep the related nodes closely connected with each other or keep the unconnected nodes away. The distance between node
where
Relation-aware attention layer
Different relations have different semantic information. That is, they represent different aspects of nodes. This section wants to capture the effects of different relations on different node embeddings. We propose a relation-aware attention layer to learn to assign different attention weights to capture the relationships among the nodes. We input node embedding
where
Hierarchical user intention and preference for sequential recommendation
Inspired by Prabhu et al. 11 and Zhu et al., 45 we build a hierarchical tree according to the characteristic that the category–item relation has a hierarchical index in the recommender system. The retrieval process of each hierarchy is called hierarchical user intention and preference. To facilitate construction, at each hierarchy of nonleaf nodes in the tree, we first randomly sort the category information and place the items together that belong to the same category. If an item belongs to multiple categories, it will be randomly assigned to one of them.
Then we use the learned node embedding vector to recluster into a new tree. The nonleaf node is a coarse-grained category concept used as the index of items in the tree. The leaf node is the items in the corpus, which finely represents users' specific preferences under their intention. We predict the user's category intention and preference as follows:
where
We take the items
where
Model training
We use Bayesian personalized ranking objective
46
to optimize our model. The key idea of Bayesian personalized ranking optimization is to make the items that users are really interested in ranking ahead of the items that users are not interested, that is, the positive sample probability is greater than the negative sample probability. So, we take a negative sample
where
Experiments
We provide empirical results to demonstrate the effectiveness of our proposed model. The experiments are designed to answer the following research questions:
RQ1: How does our proposed model perform compared with other state-of-the-art sequential recommendation models and user intention modeling-based methods?
RQ2: How does each module (i.e., multirelation HIN embedding, relation-aware attention layer, and hierarchical user intention) affect the performance of our model?
RQ3: How do the influences of different parameters affect our proposed model?
Experiments settings
To answer the first research question (RQ1), we use three actual and available data sets and make comparisons with existing models on Recall and Mean Reciprocal Rank (MRR).
Data sets
To evaluate our proposed model, we conducted extensive experiments on the three real data sets. The statistics of the data sets are summarized in Table 1.
Statistics of the data sets
MovieLens
This data set is about movie ratings and has been widely used to evaluate recommendation algorithms. We use MovieLens-1 m containing 1 million rating records, respectively. We extract interaction records from rating data, items from “movie name,” and users from “user id.”
Douban-Book
This data set is about book ratings collected from Douban website. We use friend relationship, rating data, and genres of books in the data set as category. It is worth noting that although our model only illustrates three types of nodes, our model can be extended to more types of nodes and correspond more types of relationships.
Last-FM
This data set is about music that users listen to on the online music website Last.fm. The data set includes friend relationship, user listening to artist, user label to artist, and artist label. To unify category nodes, we take the artist's label as category.
Evaluation metrics
To evaluate the recommendation performance of our proposed model, we use two evaluation metrics Recall@K and MRR@K for short. The first metric evaluates the fraction of ground truth items that are retrieved over the total amount of ground truth items, whereas the second metric is the mean of reciprocal of the rank at which the ground-truth item is retrieved. The larger the values of both Recall and MRR metrics, the better the performance.
where
Baselines
We compare our model with the following baseline algorithms, including HIN embedding methods, session-based recommendation, and hierarchical representation approaches.
Deep heterogeneous autoencoders
This article proposes a deep heterogeneous self-encoder to model heterogeneous auxiliary information to solve the data sparsity problem of the collaborative filtering algorithm. 48 We set the number of hidden layers of deep heterogeneous autoencoders (DHA) self-encoder L = 4. We also sort the input data of DHA according to the data format requirements in this article. The input data include user, item, category, and interaction.
BPR-MF + TransE
This method combines BPR-MF and TransE. BPR-MF combines Bayesian personalized ranking with matrix factorization model and learns personalized ranking from implicit feedback. 49 TransE models the node embedding of HIN. Because we do not use image data, we remove the image (visual knowledge) processing module in BPR-MF + TransE.
FPMC
This method models user preferences by combining MF, which captures users' general preferences and a first-order MC to predict the user's next action. 23
PageRank with Priors
This method integrates the user–item relationship and other heterogeneous auxiliary information into a unified isomorphism diagram. 50 PageRank outputs a personalized initial probability distribution. Similarly, we remove the image (visual knowledge) processing module in PageRank with Priors (PRP).
FOSSIL
This method integrates factored item similarity with MC to model a user's long- and short-term preferences.
26
We set
Hierarchical representation model
This method generates a hierarchical user representation to capture sequential information and general tastes. 31 We use max pooling as the aggregation operation because this achieves the best result.
SHAN
This model employs two attention networks to mine users' long- and short-term preferences. 51
Key-array memory network
This article proposes a KA-MemNN to hierarchize user intention preference for sequence recommendation based on the ternary relationship of user–intention–item. 37
Parameter settings
To facilitate the experiments, we filter out users and items for which interactive data are <5. For each user, we randomly select 80% of the interactive data as the training set
Performance comparison
We begin with the comparison with respect to Recall@20, Recall@50, MRR@20, and MRR@50. Table 2 gives the empirical results, with percent Imp. denoting the relative improvements of the top performing technique (bold) over the strongest baselines (underlined). We find the following:
Overall performance comparison
The top performing technique (bold), the strongest baselines (underlined).
%Imp, denoting the relative improvements of the top performing technique over the strongest baselines; DHA, deep heterogeneous autoencoders; HRM, hierarchical representation model; KA-MemNN, key-array memory network; PRP, PageRank with Priors.
Our model consistently outperforms all baselines across the three data sets in terms of all measures. More specifically, it achieves significant improvements over the strongest baselines with respect to MRR@20 by 7.25%, 25.7%, and 15.87% in MovieLens, Douban-Book, and Last-FM, respectively. Our model's logic and efficacy are demonstrated in this way. These gains can be attributed to our model's relational modeling: (1) By investigating user intentions, we can better define the links between users and objects, resulting in more effective user and item representations. Some baselines, in contrast, ignore hidden user intents; (2) our model learns node embeddings in HINs based on user–intention–item relationships; (3) our model fuses node feature representations in multirelational semantic spaces using relation-aware attentional layers.
We can see that the sequential methods (e.g., FPMC, HRM, and KA-MemNN) outperform the nonsequential methods (e.g., BPR-MF, PRP, and FOSSIL) in general. The methods that only consider user actions without the sequential order do not make full use of the sequence information and report the worse performance. Specifically, compared with BPR-MF, the main advantage of FPMC comes from modeling historical user actions with first-order Markov chains, namely considering the sequence order, so that FPMC reports better results than BPR-MF. This can verify that sequential pattern is essential for improving the predictive ability for sequential recommendations.
BPRMF + TransE and PRP outperform DHA, indicating that HIN embedding can more reasonably capture heterogeneous information semantic features to improve recommendation quality rather than directly encoding structural information in a feature engineering manner. KA-MemNN outperforms both BPR-MF + TransE and FPMC on all the data sets, indicating that hierarchical user intent and preference are better than flat user preference of the learning approach. Compared with BPRMF + TransE and KA-MemNN, we model the heterogeneity of relationships in HINs based on specific relation semantics and personalize the fusion of node feature representations in each semantic space, and in addition, we model hierarchical user intentions and preferences according to the natural user interaction process.
As the data show, there is a discrepancy in performance between HRM and KA-MemNN. The disparity, we believe, is caused by the various degrees of user intentions. When compared with single-level user intentions, two-level intents may be thought of as an extension that separates user intents into particular and broad categories.
Impact of components
In this section, we drill deeper to answer question RQ2 the impact of each component in our proposed model, which is in relation with the overall performance based on embedding the public feature space. We also want to verify that our hierarchical user intent and preferences outperform flat user preferences on recommendation. We adopt three simplified versions of HIP-RHINE as follows.
HIP-RHINE-1: Remove the relation-aware heterogeneous information embedding module, and the replacement operation is to integrate heterogeneous relations and structured data into a unified isomorphic graph.
HIP-RHINE-2: Remove the relation-aware attention layer module, and the replacement operation is to directly add the feature expressions of nodes in each relation semantic embedding space point by point.
HIP-RHINE-3: Remove the hierarchical tree module, and the replacement operation is to directly calculate
We also apply Recall@N and MRR@N to evaluate the performance of these models. We show the results under the metrics of Recall@20, Recall@50, MRR@20, and MRR@50. In addition, we evaluate the score of each category as an average of the scores of its items. This way the intention-based MRR can also reflect the performance of item recommendations.
The results in Table 3 show that our method performs well on all the data sets compared with HIP-RHINE-1 because we consider the heterogeneity of relations for node embedding. Besides, our method performs well on all the data sets compared with HIP-RHINE-2 because our method captures the degree of influence of different relations on the final node embedding. The experiments demonstrate the effectiveness of our multirelational semantic embedding and relation-aware attention layer. Compared with HIP-RHINE-3, our method performs well on all data sets because our method hierarchizes user intents and predicts user preferences for items based on specific intents. The experiments show the effectiveness of hierarchical user intents and preferences.
Performance evaluation of variant models
The complete HIP-RHINE is the top performing (bold).
Parameter analysis
After analyses on individual components in relation to the model's performance, we realize that the model's performance is also affected by the model's parameters.
To further investigate the influences of different parameters in our model, we calculate the values of Recall@20, Recall@50, MRR@20, and MRR@20 for HIP-RHINE across different numbers of dimensions with size d, and also explore the sensitivity of the parameter—the number of negative samples.
As shown in Figure 2a–d, the model's performance gradually improves as dimension d increases. However, the model performance decreases a little on the Last-FM data set when

Furthermore, we study the effect of the sampling number k on the overall performance. Because the item sizes differ across the three data sets, we experiment with various k ranges. Specifically, we try

The performance gain between two successive trials, in contrast, diminishes as the sampling number k grows. It suggests that if we continue to sample more negative samples, we will see less performance progress but more computational complexity.
Case study
To investigate whether our proposed model is effective and explainable, we chose one user at random from Douban-Book and visualize the hierarchical tree of user intention and preference.37,52 We extract attention between a single category and the observed objects that correspond to that category for each user.
As shown in Figure 4, there are three types of nodes. A category node is a broad term that encompasses a wide range of concepts. A concept is a collection of items that share some common attributes. Concepts, as opposed to coarse-grained categories and fine-grained entities, can assist in better representing users' interests at a semantic granularity that is appropriate. An entity is a unique item that belongs to one or more concepts. There are three sorts of edges between nodes as well. The IsA relationship denotes that the destination node is a child of the source node. The involved relationship indicates that the destination node is involved in a source node-described item.

An example to show the hierarchical tree of user intention and preference given by our approach.
The color scale of entity nodes (items) shows the value of the attention weights, with darker signifying a more considerable weight and lighter representing a lower weight, as illustrated in Figure 4. When generating category embeddings, we can see that the frequently visited objects are generally given a higher weight. This phenomenon might be explained because category-specific users' preferences are reflected in the most frequently viewed items in that category.
Conclusions
In this article, we propose a model for sequential recommendation based on hierarchical intentions and preferences with relation-aware HIN embedding, which can learn node representation in the HIN at a fine-grained level based on the particular relationships. To customize the merging of heterogeneous information, we adopt a relation-aware attention layer. Furthermore, we employ hierarchical trees to represent user intents and preferences hierarchically, and we use structured choice patterns of users for user preference learning to improve recommendation performance.
Extensive experiments on three real data sets are carried out to evaluate the performance of our proposed approach. In terms of Recall and MRR metrics, the findings show that our model outperforms state-of-the-art approaches by a significant margin. In the future, we will investigate multiple and variable intents or knowledge graph information combined with user intention modeling.
Footnotes
Authors' Contributions
F.Y. contributed to methodology (lead), writing—original draft (lead), formal analysis (lead), and writing—review and editing (equal). G.L. was involved in evaluation (lead) and writing—review and editing (equal). Y.Y. carried out conceptualization (lead) and writing—review and editing (equal).
Author Disclosure Statement
No competing financial interests exist.
Funding Information
This study was supported by Basic Public Welfare Research Project of Zhejiang, China (LGF20G020001), Key Lab of Film and TV Media Technology of Zhejiang Province (No. 2020E10015), and the AI University Research Centre (AI-URC) through the XJTLU Key Program Special Fund (KSF-A-17).
