Abstract
The key to sequential recommendation modeling is to capture dynamic users’ interests. Existing sequential recommendation methods (e.g., self-attention mechanism) have achieved extraordinary success in modeling users’ interests. However, these models ignore that users have different levels of preferences for different aspects of items, failing to capture users’ most concerning aspects. In addition, they are highly dependent on the quality of training data, which may lead to overfitting of the model when the training data is insufficient. To address the above issues, we propose a novel sequence-aware model (Multi-Aspect Features of Items for Time-Ordered Sequential Recommendation, MFITSRec), which combines the features of items with user behavior sequences to learn more complex item-item and item-attribute relationships. Moreover, the model uses a self-attention network based on an absolute time relationship, which can better represent the changes in users’ interests and capture users’ preferences for particular aspects of items. Extensive experiments on five datasets demonstrate that our model outperforms various baseline models. In particular, the model’s prediction accuracy has been significantly improved on sparse datasets.
Keywords
Introduction
With the increase of information on the Internet, personalized recommendations are vital. It is a constant challenge to effectively capture the dynamically changing interests of users and to make accurate recommendations for their next interaction. Since, in each platform, sequences composed of users’ historical interaction behaviors are inherently sequential, a large number of research capture users’ dynamic interest changes from sequences and predict the next possible interactive items for users [1, 2].
In order to model the dynamic interests of users, researchers have proposed many types of sequential recommendation methods [3–5]. Markov Chains (MCs) [3, 6] modeled the user behavior sequence as a classical approach and predicted the next interaction by transferring probabilities. FPMC [7] combined matrix decomposition and MCs for simulating personalized transfer probabilities. Usually, both approaches only consider low-order interactions (i.e., first-order and second-order) and ignore higher-order interaction information. Deep learning-based approaches have become the dominant methods for modeling the user behavior sequence [8–10]. In addition, methods based on the self-attention mechanism [5, 11–13] adaptively adjust the weights, resulting in a substantial improvement in the performance of the sequential recommendation model. It should be noted that these deep learning methods generally require a large amount of high-quality data to achieve performance.
Traditional sequential recommendation models rely on sequential relationships and emphasize the sequentiality of interaction actions. However, some users will not frequently engage in interactive behaviors in real life, which tends to result in historical data sparsity. Additionally, for different items, users have different preferences for various aspects of them. For example, two users interact with the same type of items; One of them interacts with each item from a different brand, while another one interacts with all items from the same brand. We can find that the user has a strong preference for the brand, i.e., when interacting with the items, the user has a specific preference for the brand. Ignoring the multi-aspect features of items and relying only on the user’s historical interaction sequence for the prediction cannot accurately capture the dynamic changes in users’ interests and preferences for a particular aspect of items.
In this paper, we proposed MFITSRec, a time-ordered sequential recommendation model with multi-aspect features of items, which combined the unique features of items and user interaction sequences. It enabled a finite-length sequence to carry helpful information about various aspects of items. When the interaction data is sparse, MFITSRec can better capture users’ preferences for a particular aspect of an item and recommend items to users that match the particular preference. In addition, we considered the time interval as in TiSASRec [11] and introduced the absolute time relationship and the relative position of the interaction sequence into the self-attention mechanism. In summary, our contributions are as follows: We propose embedding the multi-aspect features of items in the user interaction sequence to model users’ preferences for multiple features so as to explore the degree of preference of different users for different items. We combine unique features of items and absolute temporal relationships to enhance the user interaction sequence and demonstrate that the model can improve recommendation performance on sparse datasets. We conducted extensive experiments on five datasets from three platforms to demonstrate the effectiveness of our model. The results show that our model outperforms competitive baselines significantly in terms of recommendation performance on both sparse and dense datasets and effectively captures users’ preferences for various aspects of items.
Related Work
General recommendation methods [14–17] focus on modeling users’ interests using vast amounts of implicit [18] interaction data to obtain the most relevant items for the user. However, these methods ignore the sequentiality between user interactions and have difficulty capturing the dynamic changes in users’ preferences. As a result, sequential recommendations that model users’ dynamic preferences have received much attention. Early sequential recommendations relied heavily on MCs [3, 7], which modeled the user-item interaction data in a sequence and predicted the next interaction by transferring the probability. Although MC-based models take into account the dependencies between historical interactions, they still cannot capture more associations between behaviors in a relatively long sequence [5, 19]. Most of the existing models use a deep learning approach to enable the models to capture the long-term preferences of users. Convolutional neural networks (CNNs) [4, 9, 20] treated historical interactions as "images" and captured local features of sequences. Recurrent neural networks (RNNs) [8, 21–23] utilized gated recurrent units (GRU) to model user behavior sequences. Graph Neural Networks (GNN) [10, 24, 25] modeled sequences as graph-structured data and used GNN to extract the embedding vectors of items. Recently, the sequential recommendation model FMLP-Rec [26], based on the MLP architecture, encoded user sequences by learnable filters and achieved the best performance on sparse datasets. However, these sequence recommendation methods can still have limitations when dealing with long sequences, may suffer from gradient disappearance or explosion, and are less interpretable.

The overall framework of MFITSRec. The model mainly consists of embedding layers, an absolute time relationship layer, the self-attention network layers, and a prediction layer. Firstly, obtaining the historical interaction sequence of a target user and enhancing the interaction representation by obtaining the features of each aspect of the corresponding items in the sequence and the timestamps of the interactions, at the same time, deriving the absolute time relationship between the interactions. Secondly, the absolute time relationship and the interaction sequence are input to a multi-layer self-attention sequential recommendation model based on the absolute time relationship for recommendation learning to obtain a multi-aspect preference representation of the user. Finally, the user’s score for each item is predicted using the representation and the combination embeddings of items.
The attention mechanism is first applied to machine translation tasks, which can make the output dependent on the relevant parts of the input, enhancing the model’s ability to handle long sequences while improving interpretability. Therefore, it is also naturally applied to the sequential recommendation [5, 27–29]. Li et al. [27] combined RNNs with the attention mechanism to capture users’ primary goals in the sequence. Wang et al. [30] integrated attention mechanisms into a shallow network to construct contextual representations without strict sequential relationships. RepeatNet [31] incorporated the repeated exploration mechanism with the attention mechanism into RNN. It is worth noting that these models essentially introduce attentional ideas into some original models (e.g., CNN, RNN, etc.) and still rely on these original models. As Transformer [32], a purely attention-based mechanism, achieves state-of-the-art performance in machine translation tasks, the self-attention mechanism is widely used in sequence recommendation [5, 11–13]. One of its advantages is that it has captured long-term dependencies by computing the attention weights between each pair of items in a sequence and is easily parallelized. SASRec [5] first used self-attention for the sequential recommendation, adapting weights to items. Since the position of items in a sequence is unknown by the self-attentive model, some approaches embed relative position relationships as inputs to the model [5, 12, 13]. However, relying only on relative position information is insufficient for determining the impact of items with different time intervals on the next item. To overcome the problem, some researchers embed timestamp information into the model to get the temporal relationship between each item [11, 33, 34]. TiSASRec [11] modeled interaction sequences as sequences with time intervals and enhanced SASRec by adding timestamp information. As the information proliferated, Stai et al. [35] extended the recommendation process to the set of accompanying information except for the item itself, while they explored the examined research topic by introducing computationally efficient solutions. Recently, the watchlist recommendation (WLR) modeled Trans2D [36] uses items’ attributes to recommend against eBay users’ watchlists. Most existing models fail to capture users’ preferences for specific aspects of items and suffer from overfitting due to data sparsity. Our work embedded multi-aspect features of items into the user’s historical sequence to capture multi-aspect preferences of users and enhance interaction representations while combining relative location and absolute time relationships to model the timestamp sequence.
We propose a sequential recommendation model MFITSRec that combines multi-aspect features of items with timestamps, mainly including the embedding layer, the absolute temporal relationship layer, the self-attention network layers, and the prediction layer. This section describes the specific construction of the model, and Figure 1 shows the architecture of MFITSRec.
In the sequential recommendation task of this section, the set of users and items are denoted as
Notation
Notation
Since the number of interactions varies from user to user, we fix the length of each user interaction sequence to a specific value
We build an item embedding matrix
For the items in
Where
Where
Where
Where
We still use the learnable relative position embedding as in [5] because multiple user interactions may occur in the same absolute timestamp, and it may not be possible to accurately model the sequentiality in the input sequence with only the absolute timestamp. For the key and value in the self-attention network layer, we use
We introduced an absolute time relationship, as in [11], to model the effect of user interactions at different intervals on predicting the next item. Different users generate interactions with different frequencies, which makes modeling their interactions difficult because the intervals between them do not have the same magnitude, so we express the absolute time relationship between two interactions as their relative time interval. For the time sequence t = (t1, t2, ⋯ , t n ) of fixed length, the absolute time relationship between item i and item j is:
Where
Where the main diagonal elements are all 0, similar to the relative position embedding, we use
Where
Since the self-attention mechanism can capture sequential relationships in a sequence, we use it to model the historical interaction behavior of users. In calculating the scaled dot product attention [32] process, inspired by [11], we consider both the relative position in the input sequence and the absolute temporal relationship between each item in the sequence.
Where
Where
To avoid information leakage, i.e., the output at time step t contains the effect of subsequent items on predicting the next item, we use masks, as in [5, 11], to disable the linking of subsequent keys to the current query at time step t.
Where
Where g (x) indicates self-attention layer or feed-forward network layer. That is to say, layer g first normalizes the input x using layer normalization before processing it, then applies dropout to its output, and finally adds the initial input x to the processed output.According to [37], layer normalization is defined as:
Where ⊙ is point-wise product, μ and σ are the mean and variance of x, α and β are the learned scale factor and deviation term respectively.
After b self-attention layers and feed-forward network layers, we obtain a multi-aspect preferences representation of users incorporating item’s attributes and absolute temporal relationships and calculate the user’s prediction score for the item i at time step t:
Where si,t is the probability that the following item is i after giving the first t items (l1, l2, ⋯ , l
t
),
For the user interaction sequence l = (l1, l2, ⋯ , l n ) after a fixed length, our desired output sequence is O = (o1, o2, ⋯ , o n ), where the element o t at time step t is defined as:
Where <pad> indicates the filler item. In the training process, we negatively sample the corresponding desired output sequence O = (o1, o2, ⋯ , o n ) to obtain the negative sample sequence N = (n1, n2, ⋯ , n n ), where n i ∉ L u , and using binary cross-entropy as the objective function:
Where σ (·) is the sigmoid function. We optimize the model using the ADAM optimizer [38], a stochastic gradient descent (SGD) variant with adaptive moment estimation.
This section evaluates our model on a large dataset of different domains. Our experiments aim to answer the following research questions:
Experimental Settings
We preprocessed these data in advance. As in [5, 11, 13], for all datasets, we treated reviews or ratings as implicit feedback (i.e., users interact with the items) and sorted the interactions using timestamps. For the Amazon and Steam datasets with high sparsity, we discard the data of users with less than five interactions and the corresponding items’ features. Table 2 provides detailed statistics for the datasets, where MovieLens are the densest, and Amazon and Steam have a lower average number of user interactions and a higher sparsity. On all datasets, we used the last item in each user interaction sequence as the test set, the previous item of the test item as the validation set, and used all remaining items as the training set.
Dataset statistics
Dataset statistics
Where M is the number of users, hits (i) is whether the item that the i-th user interacts with is in the recommendation list of length N, if in hits (i) =1, otherwise, hits (i) =0. In addition, p i indicates the position of the i-th user-interactive item in the recommendation list. If the item does not exist in the list, it is denoted as p i → ∞. We set N to 10 in the experiment. For each user, we added 100 randomly selected negative samples after the positive samples and calculated HR@10 and NDCG@10 based on the ranking of these 101 items [40]; the higher these two metrics are (ideally as close to 1 as possible), the better the performance.

Visualization of the attention value assigned by the model in the Games dataset to the various features of items in a user’s historical sequence. The last row is the recommended item, and the darker the color, the higher the attention weight.
We implemented MFITSRec 1 using PyTorch and optimized the model using the Adam optimizer, with the learning rate set to 0.001 and the batch size set to 128. We used two attention network layers and point-by-point feedforward network layers. For the MovieLens, we set the maximum sequence length to 200; for the Amazon, the maximum sequence length was 50; and for the Steam, the maximum sequence length was 10. For all datasets, we set the latent dimensions of category and brand to 50, and the impact of each attribute was discussed in Section 4.3. The maximum time interval was set to 512 for the Movielens and 256 for the other datasets. The dropout rate was 0.5 for all datasets.
Table 3 and Table 4 show the recommended performance of all baseline methods and our model on the five benchmark datasets. We used bold to indicate the best results for each column, and underline to indicate the second best results. Our proposed MFITSRec outperforms all previous methods on all datasets. In the baseline approach, Caser outperforms TransRec on denser datasets, and the opposite is true for sparse datasets. This is because the neural network-based approach captures long sequences using higher-order transitions, while the MC-based approach focuses on the dynamic transfer of items. However, the orders in high-order models tend to be set very small and cannot be dynamically adjusted, so the performance of Caser is worse than that of the self-attention-based SASRec. In dense datasets, TiSASRec performs better than FMLP-Rec, and FMLP-Rec performs better in sparse datasets because FMLP-Rec replaces the Transformer with an MLP layer in the frequency domain, which solves overfitting problems caused by too few parameters of the model on sparse data. TARN combines users’ dynamic and static preferences and outperforms SASRec, which considers only one of them on all datasets. BAR outperforms its backbone model SASRec on all datasets, indicating that splitting a user’s historical interaction sequence into an item sequence and an action sequence is an effective modeling approach. TLSRec performs best on the four datasets but slightly worse than TiSASRec on Steam. Despite this, TLSRec still achieves the best level on Steam. Presumably, this is because although TLSRec can learn the user’s long-term preferences by capturing the dependencies between sessions, it cannot divide enough effective sessions for an overly sparse dataset like Steam.Compared with TiSASRec, we improved NDCG@10 by more than 2% and more than 8% on the sparser datasets. Compared with TLSRec, we improved NDCG@10 by more than 2%.
Performance Comparison for sparse datasets
Performance Comparison for sparse datasets
Performance Comparison for dense datasets
In real life, the length of the recommendation list of different platforms is different. Therefore, we have modified K (i.e., the number of items recommended to users) for a more thorough comparison. Table 5 shows the recommendation performance of some state-of-the-art methods and our model when K = {5, 20}. Similar to NDCG@10 and Hit@10, MFITSRec performs best on all datasets. On average, MFITSRec improves TiSASRec by 10%, 5%, 7%, and 4% in terms of NDCG@5, Hit@5, NDCG@20, and Hit@20 respectively. At the same time, MFITSRec improves FLSRec, the current best baseline, by 4%, 2%, 2%, and 1% in terms of NDCG@5, Hit@5, NDCG@20, and Hit@20 respectively. Presumably, this is because, with the addition of multi-aspect features of items to MFITSRec, on the one hand, we can improve the accuracy of recommendations by capturing users’ preferences for specific aspects of items and making personalized recommendations for each user. On the other hand, we can mitigate overfitting by enhancing the representation of users’ historical sequence, which allows MFITSRec to learn more useful information on sparse data. We will discuss the impact of each feature in Section 4.3. In addition, we retained the absolute temporal relationship between interactions and adjusted the weights using time intervals. From the experimental results, we argue that distinguishing the degree of users’ preferences for different aspects and considering the absolute time interval between interactions are essential for personalized recommendations.
Performance comparison with different lengths of recommendation list
It is well known that users focused on different aspects of items differently when interacting with them (e.g., a user likes a certain brand of electronics while preferring a suspenseful type of movie). Our proposed MFITSRec can capture users’ preferences for specific aspects of items, allowing for better performance of the model. We visualize the attention values assigned by the model to various features of items in user’s historical sequence. Figure 2 shows the interaction sequence of a user in the Games dataset and the corresponding item multi-attribute list, where the color shades represent the magnitude of attention weights. We used the last item and its attributes that the user interacted with as the item to be predicted by the model (i.e., s true = 1). During the test, we observed that the item obtained the highest prediction score (s predict = 0.8627) among all positive and negative samples. It is clear that the user interacted most frequently with items of brand = 14 and categories = 183, which means that the user had a specific preference for brand = 14 or categories = 183. For MFITSRec, it gives a larger weight to these two attributes and considered that the user was more inclined to focus on items’ brands in Games.

Impact of item’s latent dimension d i on model’s performance (NDCG@10).
The reality is that not all items’ features strongly correlate with users’ interests (e.g., almost no users pay attention to where the item is produced), and as a result, they cannot improve recommendation performance significantly. Meanwhile, we will inevitably introduce many parameters into the model if we choose many features, making the time complexity too high and the experiments difficult to perform. We discuss the training efficiency of MFITSRec in Section 4.5. To avoid these issues, we conducted separate experiments on the features of each aspect of the items. We found that, in all datasets, users paid more attention to the category and brand of items compared to other features. Therefore, as introduced in Section 3.1, we discarded these features which are less attractive to users and considered only the category and brand of items. To verify the effect of various aspects of features on the model’s performance, we removed the brand attribute and the category attribute separately to obtain two new models. We compared these two models, TiSASRec and MFITSRec, on all datasets. Table 6 shows the results of the four models on the Games dataset. We found that both models with only category features and brand features outperformed TiSASRec, indicating that the inclusion of features of items enhances the model’s representation. The main reason limiting the performance of TiSASRec is that there are many identical timestamps (i.e., the same time interval) in the user sequence, making it impossible to model the absolute time relationships well [11]. Including features of items alleviate this problem from the information enhancement perspective. Among category and brand features, the model using only brand features achieved better results, meanings that users pay more attention to the items’ brand during the interaction. Adding more features was undoubtedly better for sparse datasets, and considering two features at the same time can lead to better performance. Our results reflected that capturing multi-aspect preferences of users using multi-aspect features of items is crucial on sparse datasets.
The comparison of each attribute

Impact of maximum sequence length n on model’s performance (NDCG@10).

Impact of the number of stacked self-attention network layers b on model’s performance.

Impact of maximum time interval t max on model’s performance (NDCG@10).
Influence of latent dimensions of features d g , d b on model’s performance
Influence of absolute time relationship and relative position on model’s performance
Figure 7 compares the training efficiency of MFITSRec with different numbers of items’ features on the Beauty dataset. We consider the number of features k used by the model from {1, 2, 3, 4, 5}. Obviously, the increase in the number of features will lead to an exponential increase in the number of parameters of the model, so the training speed of the model will gradually decrease. When k ≤ 2, the performance of MFITSRec improves with the increase in the number of features. When k = 2 and k = 3, the performance of MFITSRec is comparable, but there is a significant decrease in training speed when k = 3. When k > 3, the performance of MFITSRec decreases slightly while the training time increases, which may be due to overfitting caused by too many parameters. In general, training efficiency greatly affects the time cost of recommender model implementation. Therefore, we comprehensively consider the performance and time cost of the model and input two features into MFITSRec.
Figure 8 compares the training efficiency of SASRec, TiSASRec, and MFITSRec on the Beauty dataset, where k = 2. Since the parameters of MFITSRec and TiSASRec are larger than SASRec, the training speed is slightly longer than that of SASRec, but it is still on a comparable level. Obviously, our model has the best performance within the same amount of time.

Training efficiency of MFITSRec with different numbers of item’s features on the Beauty dataset.

Training efficiency of SASRec, TiSASRec, and MFITSRec on the Beauty dataset.
In this work, we proposed a new sequence-aware model (MFITSRec), which combined multi-aspect features of items with user behavior sequences to capture users’ preferences for a particular aspect of items and augment the input interaction sequence. This special design is due to the observation that users may place different levels of importance on different aspects of the same item. Our experiments on sparse and dense datasets showed that MFITSRec outperformed other competitive baselines. In addition, we investigated the relationship between each feature and the absolute time of items. We demonstrated that embedding multi-aspect features of items could effectively capture users’ preferences for various aspects of items and improve the data sparsity problem.
Footnotes
Acknowledgments
The work is supported by the Science and Technology Research Program of Chongqing Municipal Education Commission (No.KJZD-K202101105, KJQN202001136), Humanities and Social Sciences Research Program of Chongqing Municipal Education Commission (No.22SKGH302), the National Natural Science Foundation of China (No.61702063), the Action Plan for High-Quality Development of Graduate Education of Chongqing University of Technology (NO.gzlcx20232104).
