Abstract
The existing recommender system provides personalized recommendation service for users in online shopping, entertainment, and other activities. In order to improve the probability of users accepting the system’s recommendation service, compared with the traditional recommender system, the interpretable recommender system will give the recommendation reasons and results at the same time. In this paper, an interpretable recommendation model based on XGBoost tree is proposed to obtain comprehensible and effective cross features from side information. The results are input into the embedded model based on attention mechanism to capture the invisible interaction among user IDs, item IDs and cross features. The captured interactions are used to predict the match score between the user and the recommended item. Cross-feature attention score is used to generate different recommendation reasons for different user-items.Experimental results show that the proposed algorithm can guarantee the quality of recommendation. The transparency and readability of the recommendation process has been improved by providing reference reasons. This method can help users better understand the recommendation behavior of the system and has certain enlightenment to help the recommender system become more personalized and intelligent.
Introduction
The development of information technology and the Internet industry have resulted in “information overload”. Users require to quickly and accurately find what they need from massive resources, while merchants may win more sales volume and profits by displaying proper items to users. However, Google, Baidu and other keyword-based search engines can only meet the simple requirements of the public rather than their personalized demands. The birth of the recommender system has greatly solved this problem [15]. The recommender system is an information filtering system that can predict the users’ rating or preference for a given item based on their basic information and historical behaviour records. The system changes the communication between businesses and users, increases their frequency of interaction. Compared with traditional recommender systems, explainable recommender systems will give the reason while giving the recommendation result. The explanation helps users understand how the model works and provides potential user feedback [10]. This method also increases the probability that users will accept the recommended item. Explainable recommendations are generally divided into three categories: item-based, user-based, and content-based. Item-based and user-based explainable recommendations are usually based on implicit or explicit feedback from users. User-based recommendation reasons are that users who are similar to current users like an item [14]. Nevertheless, users don’t know anything about users who closely resemble them. Consequently, the reason for recommendation may not be convincing. In addition, the disclosure of other users’ information may involve privacy issues in commercial systems. In item-based collaborative filtering, the recommendation reason is that the currently recommended item is analogous to the item you have purchased or browsed [27]. It often displayed as a list of recommended items similar to those have been purchased or viewed. However, users cannot see unused items which have a certain relevance to the item they have experienced.
Research shows that content-based recommendations are more popular with users than user-based and item-based recommendations. The content-based recommendation needs to determine the features of users and items and how interested users are with different features. It can identify the most suitable feature for interpretation and establish a finer modelling. For example, if a movie is recommended to a user, the recommendation reason may be that the movie belongs to a romantic, love, and literary type of movie that female users over the age of 18 enjoy. The feature information of the user and the item can indicate what recommendation is suitable for the user. Therefore, the recommender system should: find effective cross-features from the rich side information of users and items; predict the matching scores of users and items in an interpretable manner. The existing recommender system cannot satisfy these two conditions at the same time.
Gradient Boosting Decision Tree (GBDT) [16] and other tree-based methods make predictions by iteratively reducing the loss function. Generating decision rules is interpreted in this process. However, only a small number of possible feature interactions can be explored by this model, and its exploration ability is weaker than the embedding-based model [20]. Neural factorization machine (NFM) [11] and Wide & Deep [7] also embed the user ID, item ID, and cross features into the shared embedding space. Then the embedding vector is transformed non-linearly. Invisible interactions between user IDs, project IDs, and cross features are captured by the powerful representation capabilities of the non-linear implied layer.
Embedding based methods have strong generalization ability when capturing implicit features on user IDs and project ids. However, when the captured object is side information, due to the non-linear learning process, the captured cross-feature is implicit. The model does not explain the high-order cross features. On the other hand, tree-based models can generate clear decision rules for prediction, which are suitable for learning side information. However, it is not possible to determine the user IDs and project id-related attributes. Since leaves are the most significant cross feature, tree-based models are self-explanatory in nature. Therefore, to build a general and explainable recommender system, a better solution is to combine the two models. We use the active leaf node as a cross-feature of side information, embedding it in the shared embedded space with the user IDs and project IDs, capturing the complex interaction between the three. At the same time, the Attention mechanism was introduced to adjust the model as a weight by learning the importance of each interaction feature from the data rather than using a global weighting mechanism. This improves the performance of the model and helps to recommend that the system choose a more appropriate cross-features for different users-items as an explanation.
The structure of this paper is as follows: Section 2 introduces the related work of the research. Section 3 details the model structure used in this paper: First, we build eXtreme Gradient Boosting (XGBoost) trees [5] on the side information of users and items to obtain effective cross-features. Then we input user IDs, item IDs and cross-features into an embedding-based model. The model is a neural attention network that re-weights the cross-features according to the current predictions. The attention score chooses the most predictive cross features, while using the model to capture unseen interactions between IDs and cross features. The entire forecasting process is completely transparent and self-explanatory. Section 4 introduces the experimental design process and results. Section 5 draws conclusions and proposes future research directions.
Related work
Content-based recommendations use established user-preferred profiles and select items with analogous features. It is usually explained intuitively from the perspective of different project functions. A lot of recent work has conducted research from the perspectives of project tags, user comments, visual images and social information.
From the perspective of project tags, the system provides suggestions by matching user profiles with candidate project tags. Vig et al. [31] used movie tags to generate suggestions and explanations. The system shows the movie tags to users and tells which tags are related to them.
Comment-based recommendations use user comments to sentiment analysis and opinion mining for recommendation reasons. Zhang et al. [36] proposed and open-sourced a phrase-level sentiment analysis tool Sentiers. The tool can extract the “feature-view-emotion” triad from product reviews to express users’ positive or negative emotions about the product characteristics. Based on this toolkit, researchers have developed different models to interpret recommendations. Neuro-Linguistic Programming (NLP) tools are used to process comments. Some papers [9, 32] obtain the most important features of each item by mining user comments. Literatures [3, 35] obtain the user’s sentiment towards the item by analysing the content of user comments. For example, Pappas et al. [24] analysed the sentiment of each comment, constructed the user-item sentiment matrix and used collaborative filtering for processing. The paper [17] proposed a capsule network based model for rating prediction with user reviews, named CARP. CARP extracts the informative logic units from the user and item review documents and infer their corresponding sentiments for rating prediction. In addition, researchers tried to extract user opinions from comments. Integrated Topic and Latent Factor Model (ITLFM) [34] and Rating-Boosted Latent Topics (RBLT) [29] linearly combined comments and ratings to establish a model of user preferences and item features. The results were put into a matrix decomposition model for analysis. Chen et al. [6] designed an encoder-selector-decoder architecture for multi-task learning. The model uses a hierarchical co-attentive selector to identifie reviews and concepts that are important for the user- item pair based on co-attention.
Researchers also tried to use project images to provide users with image-based explanations. Lin et al. [19] studied the interpretable clothing recommendation problem. A convolutional neural network with attention mechanism was proposed to extract visual features of clothing. The extracted features are input into the neural prediction network to predict the recommendation score. McAuley et al. [22] developed a recommender system, which collects users’ visual preferences for modelling. These preferences are used to analyse the image information of the project (such as clothing patterns, styles, fabrics, etc.). The research on visually interpretable recommendations is still at the initial stage, and not all visual features are useful in recommendation prediction [8].
Some researchers used social information such as the social relationships of target users to generate explanations. These methods take into account the friendship and trust relationship between users. Park et al. [25] proposed a unified graph structure. The structure uses friends of the target user with similar preferences as the reason for recommendation. Ratings and social information are used to generate interpretable product recommendations. Chen et al. [4] proposed a SEcure SOcial RECommendation (SeSoRec) framework which is able to collaboratively mine knowledge from social platform to improve the recommendation performance of the rating platform. The research proved the correctness and security theoretically of SeSoRec.
These recommendation methods can provide some specific reasons, and they mainly rely on the content of each item rather than the ratings of other users [1]. However, most of the current work of researchers is to extract features for processing. These features are used as the same independent factors as user ID and item ID for research. Few people notice the impact of the interaction between features.
On the other hand, currently popular recommender system methods, for instance, NFM [11], Wide & Deep Model [7], Memory Augmented Graph Neural Network (MA-GNN) [21], etc. provide the most advanced recommendation performance. Their black box features cannot clearly reflect the reason for recommendation. The new model eXtreme Deep Factorization Machine (xDeepFM) [18] generates higher-order interaction feature in an explicit manner, where multiple interaction networks are stacked. However, xDeepFM has high complexity and is prone to overfitting [30]. Therefore, the paper proposes a new solution that combines XGBoost [5] with the attention mechanism. While mining cross-feature information to improve recommendation performance, it also ensures the interpretability of the recommender system.
Interpretable recommendation algorithm based on XGBoost and attention mechanism
XGBoost tree construction
XGBoost [5] is an efficient implementation of Gradient Boosting, which is a further improvement to GBDT. Boosting is a very effective ensemble learning model. Several weak classifiers with lower accuracy are combined into a strong classifier with higher accuracy. The Gradient Boosting Machine, as an improvement of Boosting, makes the loss function of the algorithm gradient drop in the iterative process when each tree is generated. XGBoost’s base classifier supports both Classification And Regression Trees (CART) and Gradient Boosting linear (GBlinear) classifiers. It has the characteristics of high accuracy, not easy to overfit, scalability, etc. XGBoost also can process high-order sparse features in a distributed manner. For a given dataset D = {(x
i
, y
i
)}, where ∣D ∣ = N, x
i
∈ R
M
, y
i
∈ R, N represents the number of samples. The meaning of M is the number of features in each sample. The XGBoost classification regression tree space Γ is expressed as:
In the equation, S denotes the number of trees, x
i
represents the i-th feature vector. The number of leaf nodes in each tree is expressed as L. The structure of each tree t is employed to map each sample to the corresponding leaf node. We can use the following Equation (2) to predict:
Each f s corresponds to an independent tree structure t and a leaf node weight w. The instance uses the decision rules of each tree to map to the leaf nodes of the corresponding tree. The weights of all leaf nodes are summed to calculate the final predicted value.
The above tree model cannot be optimized using traditional optimization methods. It is necessary to recursively solve the model of each tree through the training method of Additive Training (Boosting), i.e. the predicted value
Thus, the objective function can be written as Equation (3):
l represents the loss function, and Ω (f
p
) is used as a regularization term. The objective function is expanded by the second-order Taylor formula at
The first derivative and the second derivative of the loss function l respectively are g
i
and h
i
. Unifying all the leaf nodes, the objective function can be transformed into the cumulative form of leaf nodes. GBDT only uses first-order derivative information when optimizing. However, XGBoost carries out second-order Taylor formula expansion, which use first-order and second-order derivative information at the same time. In addition, the objective function of XGBoost adds the regular term
Recommender systems in the industry use a large number of content features for improving the accuracy of the model. These features can exhibit different information from various dimensions. The cross-features that these features combine also have great significance. Traditional second-order cross features are often designed by hand. After that, the features are input into an interpretable method. For high-order cross-features, the complexity will increase exponentially. It also requires designers who possess a lot of domain knowledge. This kind of labour-intensive work has high costs, large limitations, and poor scalability. In order to avoid these shortcomings, scholars began to study the use of deep models to automatically learn high-order cross-features. However, due to the black-box nature of the depth model, the obtained cross features are implicit and difficult to explain.
Therefore, the paper chooses to use XGBoost to capture effective cross features. Although XGBoost is not specifically designed for extracting cross-features, a leaf node represents a cross-feature. XGBoost optimizes the prediction of cross-features in iterations. It is considered that leaf nodes are useful and cross-features are reasonable.
Formally, the paper expresses XGBoost as a set of decision trees T = {T1, T2, . . . , T S }, where each tree maps a feature vector x to a leaf node with a weight w. The number of leaf nodes in the s-th tree is expressed as L s . It is different from the original XGBoost tree that sums the weights of the active leaf nodes as the predicted value. In this paper, the active leaf nodes are extracted as effective cross-features and provided to the subsequent model for processing. The embedding_lookup function is used to convert the set of all leaf nodes obtained in the XGBoost tree into a multi-hot vector t = XGBoost (x ∣ T) = [T1 (x) , . . . , T S (x)]. The vector is the concatenation of all the leaf nodes in the decision tree. The number of elements with value one in t is equal to the number of trees in XGBoost. Each tree has only one active leaf node, so a single effective cross vector can be obtained.
There are two trees T1 and T2 in the Figure 1. The depth of the trees is two, and there are three and four leaf nodes in the trees, respectively. The feature vector of an input sample is expressed as x, with four features x0, x1, x2, x3. After traversing the two trees, the feature vector x of the sample is mapped to the second node of T1 and the second node of T2. Each leaf node corresponds to a cross feature. By traversing the leaf nodes of all trees, all the cross features of the sample are obtained. The multi-hot vector t should be expressed as [0, 1, 0, 0, 1, 0, 0]. t extracts two effective crosses from the feature vector x = [x0, x1, x2, x3] to input into the subsequent model. Moreover, as the depth of the tree increases, XBGoost can capture high-order cross features. If the above example is substituted into the specific semantics, the obtained cross-features can be vividly displayed:
vL2 : [Age > 18] & [Occupation = teacher]
vL5 : [Age < 35] & [Gender ≠ Man]
The first cross feature indicates that the user is a person older than eighteen and whose occupation is a teacher. The second cross feature indicates that the user is a female younger than 35.
Cross features weighting based on attention mechanism
The paper uses a multi-hot vector t to represent the effectiveness of each cross feature. The intersection feature of value 1 is selected as the interpretation direction of the prediction results. The GBDT with logistic regression (GBDT+LR) model by Facebook [12] has demonstrated the effectiveness of this solution. However, the cross features of all user-item pairs are assigned the same weight. The processing will limit the prediction effect, because the importance of different cross features is different. Some useless feature interactions may even introduce noise. Therefore, the Attentional Factorization Machines (AFM) [33] model introduces the Attention mechanism, which learns the importance of each cross feature from the data as a weight to adjust the model. This mechanism improves the performance of the model, and explainable Attention can improve the explainability of the model.
For capturing the correlation of cross features in sparse multi-hot vectors, firstly, each valid cross feature is associated with an embedding vector. Then the cross features represented by each embedding vector are weighted individually. Inspired by the AFM model, an attention mechanism is designed based on its cross feature weighting method to weight the cross features. The modelling process of personalized weights on cross features is shown in Equation (5):
The hybrid model structure based on XGBoost and attention mechanism is shown in Figure 2.
After obtaining the weights of cross features, the paper aggregates the embedding vectors of all cross features in each sample by maximum pooling. Each sample obtains a uniform representation for cross features. This approach can retain main features, while reducing the parameters and calculations, preventing from overfitting and improved generalization ability.
Previously, only the cross features about the side information of users and items were considered. Therefore, the embedding vectors x, u, e, V of side information, user ID, item ID and cross feature set are integrated together. These unified embedding vectors use linear regression to make the final prediction. P ∈ R
k
and Q ∈ R
k
are used as the weights of the final linear regression layer. The sigmod function is used for activation to obtain the probability that the recommender system recommends a certain product for a user.
In this paper, whether the recommender system recommends a certain item for a user is regarded as a binary classification problem. Consequently, the paper uses Logloss as the loss function and uses L2 regularization to prevent overfitting. Since this model is composed of two models in series, the two models are trained separately. After training XGBoost to minimize the loss value, input cross features into the model based on the attention mechanism for training. Model optimization uses min-batch Adagrad.
This model is only a shallow model. Although there is no fully connected hidden layer, but the embedding mechanism and attention mechanism still make the model have strong representation ability and effectiveness. Because of the additivity of the shallow model, the contribution of each component can be effortlessly evaluated for uncomplicated interpretation.
Experimental environment and data sources
The experiment used two public datasets, from MovieLens 100k [13] and Myers-Briggs Personality Type Dataset [23]. MovieLens is one of the most commonly used datasets to verify the performance of the recommendation algorithm. There are a total of 100,000 ratings generated by 943 users rating 1682 movies. Each user has at least 20 ratings, and the rating is an integer from one to five. The rating is converted into a binary implicit feedback, indicating whether the user recommends the movie. On a scale of one to three, the rating is converted to zero, which means the movie is not worth recommending. When the rating is four or five, it is converted to one, which means the movie is worth recommending. The information of each user includes gender, age, occupation, region, and movie attributes are composed of 20 movie genre tags. The second data set comes from the Kaggle public database, with a total of 1,028,752 ratings, and 1,820 users rated 35,196 movies. The score is a number from zero to five with a step size of 0.5. The score is converted into a binary implicit feedback. A score from zero to three is not recommended, and a score of 3.5-5 is recommend. The detailed information are shown in Table 1. Table 2 and Table 3 respectively show some samples of these datasets.
The description of some representative features
The description of some representative features
A sample about the detail records of MovieLens 100k
A sample about the detail records of anime recommendation database
For each dataset, this paper randomly selects 20% as test data, and uses k-fold cross-validation to randomly divide the remaining data into training set (70%) and validation set (10%). The verification set is mainly used to adjust the hyperparameters, and finally the test set is used to compare the performance.
The simulation environment of this paper: hardware environment configuration: Intelcorei7 processor, 500G hard disk. Software environment configuration: simulation tool Pycharm Community, Anaconda, Win10 operating system. Anaconda is used to preprocess data, and Pycharm is used to build models to implement algorithms.
To evaluate the prediction score, two indicators are used: Logloss and Area Under Curve (AUC). We report the average score for all test instances, and the same settings apply to hyperparameter tuning on the validation set. Logloss [28] indicates the generalization ability of each model, measures the probability of user-project interaction deviating from the basic facts, pays more attention to the accuracy of the model, AUC indicates the ability to distinguish positive from negative models, and pays more attention to the sorting results of models. When the data scale is unbalanced, the model may be biased toward predicting more negative samples, and the Logloss results are still acceptable, but the AUC results are not ideal. Therefore, both are selected for comprehensive evaluation.
When there are N samples, for the i-th sample, the true value is y
i
and the predicted value is
TP: true positive (positive sample, and is predicted to be positive sample)
FP: false positive (negative sample, but is predicted to be positive sample)
TN: true negative (negative sample, and is predicted to be a negative sample)
FN: False negative (positive sample, but predicted as negative sample)
The AUC curve is the area under the curve. The curve is the receiver operating characteristic curve (ROC). The ordinate of the curve is the true positive rate (TPR), and the abscissa is the false positive rate (FPR).
The model takes different probability thresholds to get different (FPR, TPR), which is connected to the ROC curve. The AUC curve is the area under the ROC curve. The larger the TPR, the better, the smaller the FPR, and the higher the AUC value, the better.
When a positive sample and a negative sample are randomly selected, the probability of placing the positive sample in front of the negative sample according to the predicted value obtained by the model is the AUC value. The larger the AUC value, the more likely the current classification algorithm will rank the positive samples in front of the negative samples, and can be better classified.
Comparative experiment with other models
For proving the recommendation quality of the model, the model based on XGBoost and attention mechanism proposed in this paper is compared with the existing matrix factorization (MF), GBDT+LR model and AFM model. In order to make a fair comparison, all models are adjusted and optimized in this paper.
MF [26]: A general model-based collaborative filtering method that uses embedding vectors to encode user ID and item ID. It implicitly models cross features through the inner product of feature embedding.
GBDT+LR [12]: This method will input the cross features extracted from GBDT into logistic regression to refine the weight of each cross feature.
AFM [33]: The FM model with the Attention mechanism added, and the importance of each interactive feature is learned from the data through the neural network to adjust the model.
It can be seen from Table 4 that the method in this paper is slightly better than the comparison method in the Logloss and AUC indicators of the model. The model has higher accuracy and better ranking results while ensuring interpretability.
Performance comparison of each model
Performance comparison of each model
The value of the hyperparameter affects the overall performance of the model. In the two data sets, the number of model trees is based on {50, 100, 150, 200, 250} and {25, 50, 75, 100, 125} respectively. The maximum depth of a tree is selected among {2, 4, 6, 8, 10}. The attention size is set to be the same as the embedding size based on experience, and the value is taken at {5, 10, 15, 20}. The learning rate range is {0.005, 0.01, 0.03, 0.05, 0.1}. All embedding-based methods are optimized using mini-batch Adagrad. Take the number of trees and the maximum depth as an example for analysis. The effect of tree number The number of trees is related to the coverage of cross features, reflecting how much information is derived from the datasets. In Figure 3(a)-3(d), as the number of trees increases, the Logloss value gradually decreases, and the AUC gradually increases. As shown in Figure 3(a), performance is best in the first dataset when the number of trees is 200. When the number of trees is 250, Logloss increases and AUC decreases. Maybe the model is overfitting. The effect of maximum tree depth The depth of the tree determines the highest order of cross features, which can dig out more interactive information between features. Models with too small depth will underfit. When the depth is too large, the model will overfit.The experimental results are shown in Figure 3(e)-3(h), both datasets perform best when the maximum depth of the tree is eight.

XGBoost tree model example.

Structure diagram of hybrid model based on XGBoost and attention mechanism.

The influence of tree number and maximum tree depth on model effect.
The model predicts by generating explicit decision rules through the XGBoost tree, making the cross-features directly interpretable. The experiment uses the Graphviz package to visualize the decision tree in XGBoost. Part of the decision tree obtained is shown in Figure 4. In this process, the effective cross features can be captured. The sample labels in the figure are binary features. The value of label is zero or one. When the value is greater than 0.5, indicating that the item contains this label. On the contrary, the item does not include this tag.

Partial decision process of XGBoost tree.
After obtaining cross features, the weights of effective cross features in the attention mechanism represent the importance of the cross feature in prediction. Cross features of the heavy weight can be recommended to users as explanations.
In addition to making the cross-features visible, our model allows users to correct the process to refresh the recommendations as they desire. This property of adjusting recommendation is called scrutability and is the gateway to control the recommendation process [2]. It is illustrated by a sample user in Table 5.
Scrutable recommendation for a sampled user on the model
Scrutable recommendation for a sampled user on the model
The profile of the sample user is a 42-year-old female entertainer. In addition, most of the films in her historical interactions have the tag of Drama. As a result, the model detected such frequent cross features and recommended films such as Fargo and East of Eden to her. Supposing that the user wants to recommend movies to a scientist who loves horror movies. The model can assign a higher attentive weight to the cross feature that includes [Label = Horror] and [Evidence = scientist], and then obtain the predictions to update the recommendations. The Alien, Glow, Scream, Jaws and Candyman are the top score of prediction in the adjusted recommendation list.
This paper first gives an explanatory problem description of the recommender system. On the basis of the problem, an interpretable recommendation model based on XGBoost and attention mechanism is proposed, which solves the self-explanatory problem of the model prediction. The effective cross features are extracted from the XBGoost-based tree part. Through the introduction of the Attention mechanism, it is possible to predict the matching degree between users and items in an interpretable manner. According to the experimental results, the proposed method guarantees the recommendation performance and improves the readability of the recommendation reason. In the following job, we plan to use knowledge graphs to enhance algorithm interpretation capabilities. As a highly readable external knowledge carrier, the knowledge graph provides a great possibility for improving the ability of algorithm interpretation. Common recommended interpretation media are users, items, and features. The model will explore the associations between media, and select the most appropriate media to recommend and explain to users in accordance with the situation. Recommendation results and interpretation capabilities will be optimized. Another future direction is to use contextual information such as text comments and pictures. Comments extract the features of the items that users are most interested in. Features are added to the recommender system for modelling. It is significant to find a way to understand user preferences in different aspects.
