TLSTSRec: Time-aware long short-term attention neural network for sequential recommendation

Abstract

In recent years, sequential recommendation has received widespread attention for its role in enhancing user experience and driving personalized content recommendations. However, it also encounters challenges, including the limitations of modeling information and the variability of user preferences. A novel time-aware Long-Short Term Transformer (TLSTSRec) for sequential recommendation is introduced in this paper to address these challenges. TLSTSRec has two major innovative features. (1) Accurate modeling of users is achieved by fully leveraging temporal information. Time information is modeled by creating a trainable timestamp matrix from both the perspectives of time duration and time spectrum. (2) A novel time-aware Transformer model is proposed. To address the inherent variability of user preferences over time, the model combines long-term and short-term temporal information and adjusts the personalized trade-offs between long-term and short-term sequences using adaptive fusion layers. Subsequently, newly designed encoders and decoders are employed to model timestamps and interaction items. Finally, extensive experiments substantiate the effectiveness of TLSTSRec relative to various state-of-the-art sequential recommendation models based on MC/RNN/GNN/SA across a spectrum of widely used metrics. Furthermore, experiments are conducted to validate the rationality of the TLSTSRec structure.

Keywords

Sequential recommendation transformer self-attention time-aware model

1. Introduction

Sequential recommendation has become a focal point of research in the field of recommendation systems, introducing novel methodologies to meet users’ dynamic needs. Traditional recommendation methods, such as content-based recommendation [1] and collaborative filtering [2], although capable of providing valuable suggestions to users to some extent, are often static and struggle to capture users’ changing behaviors and preferences. In contrast, sequential recommendation models possess unique advantages, dynamically modeling user-item interactions and capturing sequential patterns [3,4]. This implies that these models can more accurately comprehend users’ interests and needs, thereby delivering more personalized and precise recommendations. Consequently, sequential recommendation models have received extensive research attention in recent years and have been effectively implemented in a number of fields., including film and music recommendations, e-commerce, tourism, and other applications [5,6].

The perpetually dynamic nature of user preferences presents a substantial challenge for user modeling, particularly in large-scale recommendation systems. On a daily basis, millions of new interaction data are integrated into the existing candidate pool [7]. Due to the periodic emergence of new trends and popular topics, users’ attention undergoes continuous fluctuations. Hence, it is imperative for recommendation systems to accord sufficient emphasis to short-term preferences. Simultaneously, users’ interests exhibit stability and long-term characteristics, requiring sequential recommendation models to proficiently apprehend the dynamic characteristics of short-term preferences and the enduring constancy of long-term preferences.

Figure 1.

An example of a user-item interaction sequence for a user.

Recent studies have revealed a common drawback in most existing self-attention (SA)-based sequential recommendation models. They inadequately utilize the time information associated with user-item interactions (timestamps) [8,9,10]. This observation has prompted our contemplation on enhancing sequential recommendation models. The majority of prevailing SA-based sequential recommendation models predominantly depends on incorporating positional embeddings to delineate the sequence of elements. While position embeddings serve their purpose, they do not fully consider the temporal dimension. This paper addresses the problem by innovatively capturing both long-term and short-term preferences, along with timestamp information. From the perspective of temporal length, a novel computation of time weights distinguishes time since the last interaction, allocating information with varying time spans to different attention layers, and finally consolidating them. This approach is justified by the time sensitivity in recommending the next interaction, where recent user behavior (short-term) typically holds more significance than actions or records from a long time ago (long-term) [11,12]. From the viewpoint of the time spectrum, a newly designed window function facilitates smoother weight calculations for time information, minimizing attention to time information outside the window, and thereby enhancing precision. Figure 1 illustrates two distinct interaction patterns for users – Andy and Mary. For Andy, his next project relies on the local features of his historical project sequence, dependent on his short-term interactions. In contrast, Mary’s next project is more influenced by her global interaction interests, necessitating the incorporation of both long-term and short-term information, as well as temporal information, for SA-based models to better capture user behavior patterns.

In this paper, a novel Time-Aware Long and Short-Term Transformer for Sequential Recommendation (TLSTSRec) is proposed. The model introduces a redesigned treatment of time information, a unique feature among self-attentive Sequential Recommendation. In recent studies, TAT4SRec [13] and TiSASRec [14] have utilized time information in SA-based sequential recommendation models. However, TiSASRec does not fully exploit the inherent continuity dependencies in timestamp information, and TAT4Rec lacks a clear distinction between long-term and short-term information, resulting in limited capturing capabilities for time information of different spans.

Our main contributions are as follows: −

A novel time-aware Transformer model is proposed, which combines long-term persistent preferences with short-term immediate interests. It adjusts the personalized trade-offs between long-term and short-term sequences through adaptive fusion layers. This configuration enhances the model’s ability to comprehensively and effectively capture information.

−

For timestamp information, a trainable timestamp matrix is utilized to model temporal information by incorporating both time duration and spectrum. This innovative multidimensional temporal information modeling approach allows for more effective adaptation to the continuously evolving patterns of user behavior over time.

−

We conducted extensive experiments, and the results demonstrate that our model outperforms other state-of-the-art models in prediction accuracy. Furthermore, in Section 4.6, the effectiveness of our model is further validated through the visualization of the distribution of time embedding vectors.

2. Related work

2.1. Sequential recommendation

Traditional recommendation systems methods, such as collaborative filtering and content-based recommendation [15,16], have been widely researched and applied primarily to address the issue of information overload. These methods typically employ a static modeling approach to capture users’ general preferences for items and make recommendations based on them. However, this static modeling approach may not be entirely applicable in many real-world scenarios. Specifically, in the real world, user interests and the popularity of items often undergo constant changes [11]. For example, over time, users may develop an interest in new movies, books, or music, while old preferences may gradually fade. Similarly, certain items may suddenly become very popular due to events or marketing activities. More importantly, interactions between users and items often exhibit sequential dependencies, meaning a user’s current choices may be influenced by their past behavior. Therefore, to more accurately capture this dynamism and sequential dependency, it is imperative for recommendation systems to introduce sequence dependency for more precise recommendations.

In the literature, early works primarily utilized Markov chain models [17] to propose personalized recommendations for the next step based on previous user actions. However, these methods only capture adjacent positions and struggle with handling long-term dependencies. Subsequently, neural network-based models rapidly advanced. Recurrent Neural Networks (RNNs), through sequential dependency relationships and memory mechanisms, enhanced recommendation performance across various tasks [18,19]. However, RNNs also face challenges in capturing long-term dependencies. Simultaneously, The utilization of Convolutional Neural Networks (CNNs) aimed to capture local features within sequences, with a particular emphasis on the influence of recent user behavior [20]. Due to issues like gradient vanishing or exploding, GRU4Rec $+$ emerged as a solution for capturing complex dependency relationships. It is a powerful yet computationally intensive sequential recommendation model suitable for scenarios requiring the capture of intricate dependency relationships. Graph Neural Networks (GNNs) [21] treat users and items as graph nodes, mapping each user’s interaction sequence to graph paths, and model the complex user-item and item-item relationships through multiple rounds of graph learning. Graph Convolutional Networks (GCNs) [22] are a type of GNN that extract graph features through convolution operations. Sequence recommendation algorithms based on GCNs excel at modeling the complex contextual relationships between items and sequences. The CatGCN [23] model further enhances initial node representations by modeling feature interactions before graph convolution. The CTGN [24] model integrates item category and interaction time information using a multi-layer graph convolution network and a temporal self-attention network to create multi-dimensional, fine-grained item representations. In the Transformer model proposed by Vaswani et al. [25], a self-attention mechanism is employed to capture dependencies between input and output sequences. Firstly, the self-attention mechanism can establish direct dependencies between any two points in a sequence, effectively capturing long-distance dependencies [14,20]. Secondly, unlike models like RNNs and LSTMs that require iterative steps over time, the self-attention mechanism allows parallel computation over the entire sequence, thus improving computational efficiency. Additionally, the incorporation of multi-head attention enables simultaneous focus on various segments of the sequence, facilitating the capture of more diverse contextual information. Due to these advantages, recent models in sequential recommendation systems are predominantly based on self-attention mechanisms with various architectures, such as TiSASRec [14] and FDSA [10]. The Transformer, incorporating multiple modules (including attention layers, feedforward networks, and positional embeddings), has demonstrated excellent performance in sequential recommendation tasks.

2.2. Attention mechanism

Attention mechanisms are widely popular in various tasks such as image/video captioning [26], machine translation [27], and recommendation [28]. Specifically, models based on self-attention (SA), like Transformer [25], have achieved remarkable performance in the NLP and CV domains. The key to the SA mechanism is its ability to effectively capture long-distance dependencies between different parts of the sequence [29], which is crucial for sequential recommendation. Multi-head attention, an integral part of the SA mechanism, allows the model to simultaneously focus on different parts of the input sequence, thereby better capturing its internal structure. Recent research has proposed improved models, such as SASRec [9], which is a groundbreaking model based on self-attention (SA) that leverages the encoder component of the Transformer. Its enhanced version, FDSA [10], independently models both items and features. Models like BERT4SessRec [30] apply the SA mechanism to model user-item interaction sequences. These models leverage the advantages of the SA mechanism to more effectively handle long-term dependencies, thereby enhancing the performance of recommendation systems. Additionally, SA models are computationally efficient and easily parallelizable.

While most existing SA-based models have demonstrated commendable performance, they tend to overlook timestamp information. Only TiSASRec [14] and TAT4SRec [13] incorporate timestamp information, with TiSASRec treating time intervals as relative positional embeddings, outperforming SASRec. However, TiSASRec overlooks the temporal sequence and dependencies of timestamps. On the other hand, TAT4Rec does not distinguish between long- and short-term information, leading to limitations in capturing time information with varying spans. In this paper, we address these limitations by modeling time information from two perspectives: time length and time spectrum, achieved through the creation of a trainable timestamp matrix. Our approach involves using a long-term layer in the long-term feature attention layer to capture users’ enduring preferences and incorporating a short-term layer to emphasize users’ immediate interests. This enhancement aims to improve the overall ability of the model to capture information effectively.

2.3. Windows function

The window function, alternatively known as a windowing or weighting function, is a mathematical tool extensively employed in information processing and spectrum analysis. In the context of time series processing, its application involves the weighting or smoothing of the time series to alleviate the impact of noise, abrupt changes, and irregularities. This process facilitates a more comprehensive understanding of the periodicity, trends, and features inherent in the sequence. The schematic diagram is shown in Fig. 2.

The utilized window function comprises two components, each serving distinct purposes. The Eq. (1) represents the rectangular pulse function, which carries additional significance in processing temporal information within a specific range. Thus, this function proves useful for local feature extraction and analysis within a designated time window. On the other hand, Eq. (2) mirrors the Hamming window function. In time series data with irregular timestamps, this function serves as a time-smoothing mechanism, effectively reducing irregular noise in temporal information. Furthermore, it assigns weights to time information, contributing to a more precise capture of temporal details.

\begin{aligned} F_{t r a n s} (x) & = s i g n (x + \frac{w}{2}) - s i g n (x - \frac{w}{2}) \end{aligned}

(1)

\begin{aligned} F_{W i n} (x) & = (α - β \cdot \cos (\frac{2 π x}{w})) \times \frac{1}{2} F_{t r a n s} (x) \end{aligned}

(2)

Figure 2.

User sequential interactions correspond to examples of relative timestamps, with anomalies highlighted in yellow for emphasis.

3. Methodology

3.1. Problem definition

Prediction Problem Description: Similar to related work such as [14], we initially represent the item space with a set of size N denoted as V and the user space with a set of size I denoted as U. For each user $u \in U$ , considering the user’s interaction sequence $(S_{1}^{u}, t_{1}^{u}), (S_{2}^{u}, t_{2}^{u}), \dots, (S_{| s^{u} |}^{u}, t_{| s^{u} |}^{u})$ , where $t_{i}^{u}$ represents the timestamp when user u interacts with item $S_{i}^{u}$ . The sequential recommendation task involves predicting the next potential interaction item $S_{(i + 1)}^{u}$ based on the input item sequence $(S_{1}^{u}, S_{2}^{u}, \dots, S_{| s^{u} |}^{u})$ and the timestamp sequence $(t_{1}^{u}, t_{2}^{u}, \dots, t_{| s^{u} |}^{u})$ . The interaction sequence $C^{u} = [(S_{1}^{u}, t_{1}^{u}), (S_{2}^{u}, t_{2}^{u}), \dots, (S_{| S^{u} | - 1}^{u}, t_{| S^{u} | - 1}^{u})]$ for each user is initially converted into a fixed-length item sequence $S = (S_{1}, S_{2}, \dots, S_{L})$ and a timestamp sequence $T = (t_{1}, t_{2}, \dots, t_{L})$ , Where L represents the maximum interaction sequence length accepted by the model. If the length of the user interaction sequence exceeds L, the most recent L interactions are considered. If the interaction sequence length is shorter than L, zero-padding is applied to the left side of the item sequence until its length reaches L.

Figure 3.

The overall framework of TLSTSRec.

Overview of the Model: The proposed Time-Aware Long and Short-Term Transformer for Sequential Recommendation (TLSTSRec), as illustrated in Fig. 3, comprises three primary components. The first segment is the embedding layer, which transforms temporal and item interaction information into an embedding matrix. The second segment constitutes the core interaction layer. In the upper section, the personalized temporal embedding layer introduces time information to the encoder layer. The global attention blocks in the encoder layer capture long-term and short-term information independently, subsequently dynamically merging them. In the lower section, the initial use of the global attention block captures item-related information. Subsequently, the multi-head attention combines the temporal information from the encoder block with the item information from the preceding global attention block. The third segment involves the output prediction layer, utilizing the amalgamated embedding vectors of timestamps and items for predictive modeling. Specific details of each component will be elaborated upon in the subsequent sections.

3.2. Item embedding layer

We create a learnable embedding matrix $M \in R^{N \times d}$ for item information, where d is the latent dimension [10,14]. The embedding vectorization provides flexibility and expressive power to the model for item features. Thus, we obtain the embedding matrix $E_{I} \in R^{L \times d}$ , where $E_{l i} = M_{s_{i}}$ , and L represents the maximum sequence length. Additionally, to incorporate the sequence of items, a trainable position embedding $P \in R^{L \times d}$ is added to the item embedding matrix. Embedding items into vectors is a common practice in sequential recommendation models because a unique embedding vector for each item can represent rich information behind each item in the sequence [3,10,14]. Firstly, a trainable embedding matrix $M \in R^{N \times d}$ is created for the items, where d is the dimension size. Retrieving the item embedding matrix $E_{I} \in R^{L \times d}$ , where $E_{l_{i}} = M_{s_{i}}$ . Additionally, to incorporate the sequence of items, a trainable position embedding $P \in R^{L \times d}$ is added to the item embedding matrix. The final item embedding matrix is given by:

\begin{aligned} {E m b}_{I} = [\begin{matrix} M_{s_{1}} + P_{1} \\ M_{s_{2}} + P_{2} \\ ⋮ \\ M_{s_{l}} + P_{l} \end{matrix}] \end{aligned}

It is worth noting that the position embedding used here differs from the fixed position embeddings in Transformers, employing trainable position embeddings provides more flexible position information and enhances performance [13,14].

3.3. Personalized timestamp embedding layer

Specifically, given the fixed-length time sequence $T = (t_{1}, t_{2}, \dots, t_{L})$ of user u, the time intervals of two items i and j are $| t_{i} - t_{j} |$ , where $R^{U}$ is the set of all users’ relative time intervals. Here, to optimize the memory cost of modeling time information, we introduce an innovative approach: segmenting time intervals into different categories. For recent interactions, we allocate short-term time embeddings, while for longer time spans, we assign long-term time embeddings. This strategy plays a significant role in extracting both long and short-term features effectively.

Expressing the maximum relative time interval as $T_{max} = max (R^{U})$ , and the scaling factor $ϕ$ is employed to control the maximum time interval. Subsequently, all relative time intervals are limited and scaled to a fixed interval $[0, k]$ , generating the final time interval sequence $T_{s}$ , k represents the number of bins, as illustrated below:

\begin{aligned} T_{s} = (t_{1}^{s}, t_{2}^{s}, \dots, t_{L}^{s}) = \frac{k \times min (T_{r}, T_{max}) \times ϕ}{max (R^{U})} \end{aligned}

(3)

We employ a learnable embedding matrix $M_{T} = [\begin{matrix} e_{1} \\ ⋮ \\ e_{k} \end{matrix}] \in R^{k \times d}$ to learn temporal information, where d represents the dimension size. To mitigate the impact of time intervals with extremely large spans that lack distinctive temporal features on the time embedding matrix, we use an embedding method based on a window function to transform the scaled relative time interval sequence $T_{s}$ into a time embedding matrix. The time weight for each row is given by: $e_{t_{i}}^{^{'}} = \sum_{j \in (0, k)} w_{t_{i}, j} e_{j}$ . The final time embedding matrix is obtained as $E_{T} = [\begin{matrix} e_{t_{1}}^{^{'}} \\ ⋮ \\ e_{t_{L}}^{^{'}} \end{matrix}] \in R^{L \times d}$ , where the weight $w_{t_{i}, j}$ is computed using the window function:

\begin{aligned} w_{t_{i}, j} & = F_{w i n} (t_{i} - j) \end{aligned}

(4)

\begin{aligned} F_{w i n} (x) & = (α - β \cdot \cos (\frac{2 π x}{w})) \cdot \frac{1}{2} (s i g n (x + \frac{w}{2}) - s i g n (x - \frac{w}{2})) \end{aligned}

(5)

In this context, the function $s i g n (x) = {\begin{matrix} 1, & x > 0 \\ 0, & x = 0 \\ - 1, & x < 0 \end{matrix}$ . The parameters $α = 0.51$ , $b e t a = 0.46$ serve as the weights for the window function. and w is the size of the window function.

It can be observed that this window function adopts a compound function combining a rectangular window. In the transition region of the window boundary, a smooth transition is employed, gradually changing the values from the window boundary to the transition area. This approach allows for more accurate calculation of the relative time interval sequence, thereby more precisely revealing potential implicit information in preceding and succeeding behaviors. The specific weight calculation is illustrated in the accompanying figure.

3.4. Global attention block

As both the encoder and decoder utilize multi-head attention, the introduction of Global Attention Block becomes essential. Global Attention Block can be regarded as a neural network module based on the self-attention mechanism, effectively modeling sequential patterns. Here, we employ scaled dot-product attention, with the attention input consisting of three matrices: Q, K, and V. The specific definitions are as follows:

\begin{aligned} S D P A (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d}}) V \end{aligned}

(6)

In the context of sequence recommendation models, SDPA is commonly used with the same objects as query, key, value [3,10,14], and the scaling factor helps avoid issues like gradient vanishing or exploding, especially in high dimensions. To enhance sensitivity to local information and capture fine-grained structures in the sequence, we modify SDPA by prohibiting connections between $Q_{i}$ and $K_{j}$ (where $j > i$ ). This restriction not only improves the model’s performance but also reduces computational time and resource consumption. The resulting attention weight matrix may become sparser, enhancing interpretability.

In addition, multi-head attention $(M H)$ is employed with the aim of directing the model’s focus to various aspects of information. The definition of $(M H)$ is as follows:

\begin{aligned} H & = M H (Q, K, V) = C o n c a t (h e a d_{1}, h e a d_{2}, \dots, h e a d_{n_{h}}) W^{O} \end{aligned}

(7)

\begin{aligned} h e a d_{i} & = S D P A (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) \end{aligned}

(8)

In this context, the matrices $W^{0}, W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}$ are considered as learnable parameters, each belonging to the space $R^{d \times d}$ . Additionally, $n_{h}$ represents the number of heads involved in the process. To tackle the issue of degradation in deep neural networks and preserve the original information of items, residual connections are employed. The mechanism is defined as follows:

\begin{aligned} M H S A (H_{1}) = L a y e r N o r m (d r o p o u t (M H (Q, K, V)) + Q) \end{aligned}

(9)

Finally, using the symbol O to represent the output of the entire process of the Multi-Head Self-Attention Block:

\begin{aligned} O = M H S A (Q, K, V) \end{aligned}

(10)

In future research, we will explore integrating item attributes and item popularity into the self-attention mechanism, which promises to be an interesting and meaningful endeavor.

3.5. Encoder and decoder layers

3.5.1. Encoder layers

Through the personalized time embedding layer, we acquire ${E m b}_{T}^{l}$ and $E m b_{T}^{s}$ . Subsequently, these embeddings are individually input into the multi-head attention block MHSA of the encoder, allowing for an exploration of the temporal relationships associated with each interaction within diverse contexts.

\begin{aligned} E_{T}^{l} = M H S A ({E m b}_{T}^{l}, {E m b}_{T}^{l}, {E m b}_{T}^{l}) \end{aligned}

(11)

The multi-head attention utilizes the same object, ${E m b}_{T}^{l}$ , as query, key, and value. It can distinguish useful time information in different contexts. The time embedding matrix for short-term interactions of more significance to the user is obtained using the same calculation method and is denoted as $E_{T}^{s}$ .

While MHSA is capable of aggregating all-time embedding vectors using adaptive weights, it is still a linear transformation. Therefore, we adopt a variant of the Feedforward Neural Network based on a Gated Linear Unit, which facilitates learning more intricate representations. The Transformer with a Gated Linear Unit outperforms the version with Feedforward Neural Network [31].

\begin{aligned} {F F N}_{R e G L U} (E_{T}^{l}) & = (R e L U (E_{T}^{l} W_{1}) \otimes E_{T}^{l} V) W_{2} \end{aligned}

(12)

\begin{aligned} O_{E}^{l} & = E_{T}^{l} + D r o p o u t ({F F N}_{R e G L U} (E_{T}^{l})) \end{aligned}

(13)

Since the short-term and long-term timestamp sequences capture information at different time scales, the short-term sequence may reflect the user’s recent interests and behaviors, while the long-term sequence may contain habits and preferences over a longer time range. By separately processing these two types of timestamp sequences, the model can better capture user behavior patterns at different time scales. The short-term and long-term information is then weighted to obtain the final time information. Therefore, we adopt an Adaptive Mixture Layer to adjust the personalized trade-off between long-term and short-term sequences. Specifically, this is achieved by employing average pooling followed by a linear layer with sigmoid activation function to obtain coefficients for the time sequences. $A d ({E m b}_{T}^{l})$ represents personalized weights for different time sequences, and $σ$ denotes the sigmoid activation function.

\begin{aligned} A d ({E m b}_{T}^{l}) & = σ (L i n e a r (\frac{1}{n} \sum_{i = 1}^{n} {E m b}_{T}^{l})) \end{aligned}

(14)

\begin{aligned} O_{E} & = A d ({E m b}_{T}^{l}) \otimes O_{E}^{l} + A d ({E m b}_{T}^{s})_{s} \otimes O_{E}^{s} \end{aligned}

(15)

To simplify the description of the entire process, the procedure of the encoder is defined as follows:

\begin{aligned} O_{E} = E B ({E m b}_{T}), O_{E} \in R^{L \times d} \end{aligned}

(16)

3.5.2. Decoder layer

In the decoder layer, a combination of global attention and multi-head attention is utilized. The initial global attention incorporates information from the item embedding layer, while the subsequent multi-head attention on the right receives input from the output $O_{E}$ of the encoder layer and the output of the preceding global attention. This approach is employed to integrate both time and interaction information.

\begin{aligned} {E m b}_{I}^{^{'}} = M H S A ({E m b}_{I}, {E m b}_{I}, {E m b}_{I}) \end{aligned}

(17)

The multi-head attention on the right is used to establish connections between the encoder and decoder, integrating time and item information. Here, $Q = {E m b}_{I}^{^{'}}$ , $K = O_{E}$ , and $V = O_{E}$ . By setting time information as K and V, the model can more easily distinguish and integrate relevant information at different time points. $E_{I}$ is used to represent the combination of project and time information:

\begin{aligned} E_{I} = M H S A ({E m b}_{I}^{^{'}}, O_{E}, O_{E}) \end{aligned}

(18)

To capture key features in the sequence and enhance the model’s performance and expressiveness, a method similar to Eq. 13 is employed:

\begin{aligned} {F F N}_{R e G L U} (E_{I}) & = (R e L U (E_{I} W_{1}) \otimes E_{I} V) W_{2} \end{aligned}

(19)

\begin{aligned} O_{D} & = {E m b}_{I}^{^{'}} + D r o p o u t ({F F N}_{R e G L U} (E_{I})) \end{aligned}

(20)

To simplify the description of the entire process, the procedure of the decoder is defined as follows:

\begin{aligned} O_{D} = D B ({E m b}_{I}, O_{E}), O_{E} \in R^{L \times d} \end{aligned}

(21)

3.6. Multilayer block

To better capture the complex relationships between input and output sequences, a multi-layered encoder and decoder structure is employed. The detailed structure is illustrated in Fig. 3. Their role is to comprehensively capture the intricate relationships between input and output sequences, enhancing the model’s representational capacity, especially when dealing with long sequences and data with hierarchical features. Additionally, the multi-layered structure improves the model’s robustness, aids in generalization, mitigates overfitting issues, and enhances both training and inference efficiency.

Therefore, a stack of N encoders and M decoders is used, with each decoder block’s input coming from the output of the last encoder. As a result, the $n^{th}$ layer of the encoder and the $m^{th}$ layer of the decoder are defined as follows:

\begin{aligned} O_{E}^{(n)} & = E B (O_{E}^{(n - 1)}) \end{aligned}

(22)

\begin{aligned} O_{D}^{(m)} & = E B (O_{E}^{(m - 1)}, O_{E}^{(N)}) \end{aligned}

(23)

where

O_{E}^{(0)} = E_{T}

O_{D}^{(0)} = {E m b}_{I}

. Here,

q \in [0, N]

p \in [0, M]

3.7. Prediction layer

After multilayer block, the preference scores of users for each item are computed as follows:

\begin{aligned} R_{i, t} = O_{D_{t}} M_{i}^{T} \end{aligned}

(24)

Where $O_{D_{t}}$ represents the $i^{th}$ row of $O_{D}$ , $M \in R^{L \times d}$ denotes the matrix of item embeddings. To avoid overfitting and enhance model performance, the same matrix M is used at the project embedding layer [19,26]. $R_{i, t}$ represents the preference score for the given history of t interactions with items ( $s_{1}, s_{2}, s_{3}, \dots, s_{t}$ ) and their corresponding timestamps ( $t_{1}, t_{2}, t_{3}, \dots, t_{t}$ ). A higher preference score $R_{i, t}$ implies a higher likelihood of item i being interacted with. Given an input interaction sequence $[(s_{1}, t_{1}), (s_{2}, t_{2}), \dots, (s_{n - 1}, t_{n - 1})]$ , the model’s expected output is the prediction for the next interaction in the input sequence ( $s_{2}, s_{3}, s_{4}, \dots, s_{n}$ ), utilize the last term of $O_{D}$ for prediction.

3.8. Model training

In the preceding context, we know that the user’s interaction sequence with items, $C^{u} = [(S_{1}^{u}, t_{1}^{u}), (S_{2}^{u}, t_{2}^{u}), \dots, (S_{| s^{u} | - 1}^{u}, t_{(| s^{u} | - 1)}^{u})]$ , has been divided into a fixed-length item sequence $S = (s_{1}, s_{2}, s_{3}, \dots, s_{L})$ and a timestamp sequence $(s_{1}, s_{2}, s_{3}, \dots, s_{t})$ . Let $o = (o_{1}, o_{2}, o_{3}, \dots, o_{L})$ be the expected output, where the $i^{th}$ element $o_{i}$ in the sequence is defined as follows:

\begin{aligned} o_{i} = {\begin{matrix} ⟨ p a d ⟩ & if s_{i} is a padding \\ s_{l + 1}^{u} & if 1 < l ⩽ L \\ s_{| s^{u} |}^{u} & if l = L \end{matrix} \end{aligned}

(25)

In the context where $⟨ p a d ⟩$ denotes the padding item, The task of sequential recommendation is to rank a list of items most likely to be interacted with based on a user’s historical interaction sequence. However, user interaction data is mostly implicit, and it cannot directly indicate the user’s true preferences, optimizing the preference scores $R_{i, t}$ directly is not feasible. Therefore, negative sampling is employed to generate negative samples. For each $o_{i}$ , an arbitrary selection $o_{i}^{^{'}}$ is made, where $o_{i}^{^{'}} \notin o$ , to serve as a negative sample.

The binary cross-entropy loss is employed as the objective function:

\begin{aligned} L = - \sum_{S^{u} \in S} \sum_{l \in [1, 2, \dots, L]} [\log (σ (r_{o_{l}, l})) + \log (1 - σ (r_{o_{i}^{^{'}}, l}))] \end{aligned}

(26)

The sigmoid function is defined as $σ (x) = \frac{1}{1 + e^{- x}}$ . It is vital to emphasize the masking of losses for items. The proposed model optimizes using the Adam optimizer [32], and for enhanced training efficiency, it applies the t-fixup [33], a Transformer weight initialization approach.

4. Experiments

In this section, TLSTRec will be empirically evaluated based on real datasets from two different domains, providing answers to the following questions:

RQ1 Can the proposed method in this paper surpass the current state-of-the-art models in sequential recommendation tasks?

RQ2 How does using the time window function and merging long and short time affect the performance of the model?

RQ3 How do uncertain parameters affect TLSTRec?

RQ4 How does the computational efficiency of TLSTRec compare to other models?

RQ5 what visualization methods are employed to validate the effectiveness and interpretability of personalized timestamps with window functions?

4.1. Experimental settings

4.1.1. Datasets

The two datasets utilized for experimental evaluation are sourced from the real world and widely employed in relevant research [9,10,14,34], with detailed data descriptions available in Table 1.

Steam [9]: This benchmark dataset is derived from the prominent online video game distribution platform, Steam, covering the time span approximately from 2010 to 2017.

Userbehavior [35]: Provided by Alibaba, this dataset encompasses user behavior data from the online shopping platform Taobao. The included user data comprises clicks, purchases, items added to the shopping cart, and product preferences.

For both datasets, actions such as reviews, clicks, or adding items to the cart are considered implicit feedback and are sorted by timestamp. A similar preprocessing procedure is applied to Steam, wherein items with fewer than 5 interactions lacking meaningful features are discarded. In the case of Userbehavior, users are sorted by their user IDs, and the interaction sequences of the top 100,000 users are extracted. Users with more than 300 interactions or fewer than 20 interactions are then excluded.

Concerning the evaluation, the leave-one-out evaluation method is adopted to partition these two datasets [9,10,14,13]. This method is extensively used in the evaluation of sequential recommendation. Specifically, for each user u and their interaction sequence $C^{u}$ , the most recent interaction $C_{| C^{U} |}^{U}$ is reserved for testing, the second most recent interaction $C_{| C^{U} - 1 |}^{U}$ is used for validation, and the remaining interactions are employed for training.

Table 1
Basic dataset statistics.

Dataset Steam Userbehavior

Users 334,700 100,000

Items 13,047 677,456

Actions per user 10.59 76.43

Actions per item 4.2M 7.8M

Time span 7 years 7 days

Dataset	Steam	Userbehavior
Users	334,700	100,000
Items	13,047	677,456
Actions per user	10.59	76.43
Actions per item	4.2M	7.8M
Time span	7 years	7 days

4.1.2. Evaluation metrics

Following evaluation standards [9,10,14,13], we employed five evaluation metrics: NDCG@5, NDCG@10, Hit Rate@5, Hit Rate@10, and MRR. HR focuses on the accuracy of the model, NDCG emphasizes the position of the user’s desired items in the model’s recommended list, favoring higher positions. MRR highlights the position of the user’s desired items in the model’s recommended list, favoring higher positions.

To tackle the computational time challenge of ranking items based on preference scores in large datasets, we randomly selected 100 users who had not previously interacted with the items and ranked these newly sampled items alongside the actual items [10,14,13]. Subsequently, the evaluation metrics were computed based on these 101 items.

4.1.3. Baseline models

To demonstrate the effectiveness of our proposed TLSTSRec, we compare it with various categories of recommendation models, including classical general recommendation models (PopRec) that do not consider sequential patterns, matrix factorization (MC) based models (TransRec, Caser), Recurrent Neural Network (RNNs) based models (GRU4Rec, GRU4Rec $+$ ), Graph Neural Networks (GNNs) based models (CatGCN, CTGNN) and self-attention (SA) based models (SASRec, FDSA, TiSASRec, TAT4Rec).

PopRec: The popularity-based recommendation model recommends items based on their frequency of occurrence.

TransRec [36]: A simple yet representative dual-tower recommendation framework is employed for modeling user feedback of mixed modalities.

Caser [20]: Caser embeds a sequence of items into an “image” and captures higher-order Markov chains by applying convolution operations to the “image”.

GRU4Rec [37]: A sequential recommendation model that employs stacked Gated Recurrent Units (GRUs) to capture and leverage sequential patterns in user-item interactions.

GRU4Rec $+$ [8]: Based on GRU4Rec, a new ranking loss function is introduced.

SASRec [9]: SASRec is a groundbreaking model based on self-attention (SA) that leverages the encoder component of the Transformer. Additionally, it employs the dot product between the successive underlying features of the most recent item and the embedding of the target item as a scoring function.

FDSA [10]: FDSA utilizes independent self-attention modules to model item transition patterns and feature transition patterns.

TiSASRec [14]: Based on self-attention, absolute positional information and relative time interval information are incorporated.

CatGCN [23]: A state-of-the-art model that optimizes initial node representations by modeling feature interactions before graph convolution.

CTGNN [24]: A state-of-the-art model that integrates item category and interaction time information, using a multi-layer graph convolutional network and a temporal self-attention network.

TAT4Rec [13]: State-of-the-art models leverage the Transformer to model time and items separately.

For fair comparison, we employed the code provided in the literature for Caser, GRU4Rec, GRU4Rec $+$ , SASRec, TiSASRec, CTGNN, CatGCN and TAT4Rec. We conducted experiments with dimensions 20, 30, 40, 50, and 100. regularization hyperparameters in 0.1, 0.01, 0.001, 0.0001, and learning rates in 0.1, 0.01, 0.001. Experiments were conducted with other settings based on the data provided in their respective original papers. The experiment was terminated if there was no improvement after 40 epochs.

4.1.4. Implementation details

We implemented TLSTSRec using PyTorch. The default configuration includes two encoder blocks and two decoder blocks, with a maximum sequence length $(L)$ set to 50 for all datasets. The batch size is 128, the learning rate is 0.001, the dropout rate for Userbehavior is 0.4, and for Steam, it is 0.2. The default window function $(w)$ size is set to 20. We adjusted the bin number $(k)$ in ${256, 512, 1024, 2048}$ and the scale factor in ${2, 4, 6, 8, 10}$ . The optimizer used is the Adam optimizer with momentum decay rates $β_{1} = 0.9$ and $β_{2} = 0.98$ . All experiments were conducted on a server equipped with 14 vCPU Intel(R) Xeon(R) Gold 6330 CPU @ 2.00 GHz and Nvidia GTX 3070 GPU.

Table 2
The recommendation performance is presented in the table, with the optimal outcome in each row highlighted in bold, and the second-best outcome underlined.

Userbehavior Steam

Model NDCG@5 Hit@5 NDCG@10 Hit@10 MRR NDCG@5 Hit@5 NDCG@10 Hit@5 MRR

PopRec 0.2679 0.3658 0.3182 0.4856 0.2887 0.4235 0.5775 0.4727 0.7493 0.4061

TransRec 0.3298 0.4342 0.3673 0.5314 0.3276 0.4742 0.6414 0.5222 0.7891 0.4495

Caser 0.2856 0.3413 0.3071 0.4763 0.2610 0.4883 0.6517 0.5347 0.7945 0.4637

GRU4Rec 0.5168 0.6173 0.5415 0.6629 0.3582 0.2287 0.3169 0.2703 0.4464 0.2393

GRU4ReC $+$ 0.6189 0.7210 0.6306 0.7768 0.6018 0.4533 0.6327 0.5488 0.7984 0.4692

SASRec 0.6176 0.7195 0.6286 0.7543 0.6050 0.6081 0.7617 0.6427 0.8680 0.5783

FDSA 0.6197 0.7186 0.6260 0.7672 0.5958 0.6102 0.7611 0.6427 0.8704 0.5794

TiSASRec 0.6403 0.7285 0.6472 0.7834 0.6208 0.6069 0.7599 0.6422 0.8503 0.5767

CatGCN 0.6242 0.7135 0.6359 0.7650 0.6081 0.6081 0.7546 0.6435 0.8523 0.577

CTGNN 0.6502 0.7394 0.6556 0.7928 0.6276 0.6136 0.7659 0.6473 0.8579 0.5807

TAT4SRec 0.6569 0.7475 0.6742 0.8004 0.6400 0.6153 0.7654 0.6508 0.8709 0.5883

TLSTSRec 0.6978 0.7693 0.7080 0.8006 0.6829 0.6199 0.7707 0.6531 0.8733 0.5930

	Userbehavior	Steam
PopRec	0.2679	0.3658	0.3182	0.4856	0.2887	0.4235	0.5775	0.4727	0.7493	0.4061
TransRec	0.3298	0.4342	0.3673	0.5314	0.3276	0.4742	0.6414	0.5222	0.7891	0.4495
Caser	0.2856	0.3413	0.3071	0.4763	0.2610	0.4883	0.6517	0.5347	0.7945	0.4637
GRU4Rec	0.5168	0.6173	0.5415	0.6629	0.3582	0.2287	0.3169	0.2703	0.4464	0.2393
GRU4ReC $+$	0.6189	0.7210	0.6306	0.7768	0.6018	0.4533	0.6327	0.5488	0.7984	0.4692
SASRec	0.6176	0.7195	0.6286	0.7543	0.6050	0.6081	0.7617	0.6427	0.8680	0.5783
FDSA	0.6197	0.7186	0.6260	0.7672	0.5958	0.6102	0.7611	0.6427	0.8704	0.5794
TiSASRec	0.6403	0.7285	0.6472	0.7834	0.6208	0.6069	0.7599	0.6422	0.8503	0.5767
CatGCN	0.6242	0.7135	0.6359	0.7650	0.6081	0.6081	0.7546	0.6435	0.8523	0.577
CTGNN	0.6502	0.7394	0.6556	0.7928	0.6276	0.6136	0.7659	0.6473	0.8579	0.5807
TAT4SRec	0.6569	0.7475	0.6742	0.8004	0.6400	0.6153	0.7654	0.6508	0.8709	0.5883
TLSTSRec	0.6978	0.7693	0.7080	0.8006	0.6829	0.6199	0.7707	0.6531	0.8733	0.5930

4.2. Overall performance (RQ1)

4.2.1. Performance comparison

Table 2 shows the recommendation performance of our model and baselines on the two datasets. Firstly, FPMC and TransRec are methods based on project state transitions. It can be observed that their performance on the relatively sparse Steam dataset is better than that of Userbehavior. Models based on neural networks (i.e., Caser, GRU4Rec, GRU4Rec $+$ ) show significantly better performance than FPMC and TransRec, as their ability to better capture complex sequential behaviors. Clearly, models based on self-attention (i.e., SASRec, FDSA, TiSARec, TAT4SRec, TSTLRec) outperform both MC and RNN-based models in both datasets. This is attributed to the effectiveness of SA mechanisms in capturing short-term and long-term relationships. Although the CatGCN model optimizes node representations through feature interaction modeling, it does not effectively utilize temporal nodes. The CTGNN employs a simplistic temporal decay activation function, which inadequately leverages long-term and short-term temporal information. The experimental results for TAT4SRec were reproduced using the code provided by the original authors.

We can observe that our proposed TLSTSRec outperforms eleven competing models across three standard ranking metricss, particularly demonstrating excellent performance on the Userbehavior dataset. This superiority stems from the specially designed encoder and decoder structure, as well as the incorporation of personalized long-term and short-term time information. Additionally, for dense datasets, the embedded module of the window function enables precise time modeling. The limited improvement observed on the Steam dataset may be attributed to the excessively sparse nature of the temporal data, constraining the effectiveness of the personalized time embedding module. This challenge underscores a significant hurdle in the field of sequential recommendation.

Secondly, Fig. 4 illustrates the impact of a key hyperparameter – the dimension size. We can see TLSTSRec consistently outperforms other models on the Userbehavior dataset as the dimension varies. When the dimension exceeds 30, the TLSTSRec model surpasses all other models on the Steam dataset. The impact of dimensions on SA/RNN-based models is more pronounced in the Userbehavior dataset, suggesting that neural network models may be more susceptible to dimension changes in datasets with dense user behavior.

Figure 4.

Models’ performance (NDCG@10) under different dimension size.

4.3. Ablation study (RQ2)

The first major contribution of TLSTSRec is a novel window function-based embedding module, designed to transform timestamps into time embedding matrices. The second contribution is a personalized long-short-term time embedding structure for integrating time information. Consequently, we conducted a detailed ablation study to understand the impact of these two crucial components in our model.

4.3.1. Influence of window function-based time module

Our proposed model, TLSTSRec, employs a novel personalized time embedding module based on a window function to handle timestamps. This ensures that similar time intervals are transformed into similar embedding matrices. To assess the influence of the window function-based time module, we conducted experiments by removing the window function from the time embedding layer. The removal of the window function-based time module implies a decrease in accuracy when computing time embedding vectors, as similar timestamps, can no longer be transformed into similar embedding vectors. As shown in Fig. 5, without the window function, the performance consistently decreases, indicating that maintaining the continuity of timestamps contributes to capturing temporal information.

Figure 5.

Impact of window function-based embedding.

4.3.2. Influence of merging timestamp information

To understand the influence of the proposed merging timestamp information in the encoder-decoder structure, we replaced the output of the encoder block $O_{E} \in R^{L \times d}$ with a new matrix filled with a constant value of 1. This new matrix served as queries and keys for the decoder part. Therefore, the decoder layer simply integrates item information.

As depicted in Fig. 6, the model’s performance significantly declines when the merging timestamp information is removed, indicating that timestamp information contains crucial user interaction features. It helps the model to better model users accurately, especially considering the proposed encoder-decoder structure, which effectively integrates time and interaction information.

Figure 6.

Impact of merging long short-term time information.

4.4. Hyperparameter study (RQ3)

4.4.1. Number of bins

Table 3 illustrates the impact of the number of bins for timestamps on both datasets. A larger number of bins implies that the temporal embedding layer can employ a more complex trainable embedding matrix $M_{T} \in R^{k \times d}$ to capture the temporal information inherent in timestamps.

From the data presented in Table 3, it can be observed that the number of bins for timestamps has a minimal impact on the performance of TLSTSRec on Steam, consistently exhibiting good performance. However, in the Userbehavior dataset, a smaller number of bins leads to a decrease in performance. This decrease may be attributed to the more frequent item interactions in the Userbehavior dataset, necessitating a larger embedding matrix $M_{T}$ to capture complex user behavior patterns. A larger number of bins enables the utilization of a more complex embedding matrix $M_{T} \in R^{k \times d}$ for timestamp modeling.

Table 3
The impact of bins is depicted, Optimal outcomes are highlighted in bold, while the second-best outcomes are underscored.

Userbehavior Steam

Number of Bins NDCG@5 Hit@5 NDCG@5 Hit@5

256 0.6824 0.7604 0.6192 0.7703

512 0.6901 0.7587 0.6221 0.6980

1024 0.6978 0.7693 0.6199 0.7707

2048 0.6910 0.7623 0.6213 0.7715

	Userbehavior	Steam
256	0.6824	0.7604	0.6192	0.7703
512	0.6901	0.7587	0.6221	0.6980
1024	0.6978	0.7693	0.6199	0.7707
2048	0.6910	0.7623	0.6213	0.7715

4.4.2. Dropout rate

Dropout [38] has proven to be a successful method in mitigating overfitting issues in deep learning models. Hence, we employ varying dropout rates to observe their impact on experimental performance. As depicted in Table 4, we observe a more pronounced variation in model performance with changes in the dropout rate for Userbehavior. This observation may be attributed to the fact that more intensive user behavior is more susceptible to the effects of dropout compared to Steam.

Table 4
The impact of dropout is depicted, with the optimal outcomes highlighted in bold, while the second-best outcomes are underscored.

Userbehavior Steam

Dropout Rate NDCG@5 Hit@5 NDCG@5 Hit@5

0.2 0.6724 0.7404 0.6199 0.7705

0.3 0.6865 0.7587 0.6156 0.7685

0.4 0.6978 0.7693 0.6180 0.7660

	Userbehavior	Steam
0.2	0.6724	0.7404	0.6199	0.7705
0.3	0.6865	0.7587	0.6156	0.7685
0.4	0.6978	0.7693	0.6180	0.7660

4.5. Efficiency analysis (RQ4)

Due to TLSTSRec’s utilization of an encoder-decoder structure to capture user preferences from timestamps and items, questions arise about whether it also has advantages in terms of computational efficiency. Therefore, we compared the training time per epoch and the total training time to convergence on the Steam dataset for SA-based models, including SASRec, FDSA, TiSASRec, and TAT4Rec. We set the dimensions of these five models to 50, with a learning rate of 0.001. From Fig. 7a, SASRec achieves the fastest training speed with its simplest model structure. It is known that SASRec exhibits significant improvements in efficiency compared to existing RNN/MC-based models, primarily due to the parallelizability of the self-attention mechanism. From Fig. 7b, it can be observed that TLSTSRec converges only slightly slower than SASRec and is marginally faster than TAT4Rec and TiSASRec. The comparable convergence speed between TLSTSRec and TAT4Rec is attributed to their similar utilization of temporal information. From the above results, it can be observed that our proposed TLSTSRec has the second fastest testing speed, which makes TLSTSRec feasible for practical application scenarios.

Figure 7.

The comparison of training speeds per epoch (a) and total training time to convergence (b) among SA-based models.

Figure 8.

Comparison of personalized timestamp embedding layer without and with window by visualization.

4.6. Visualization (RQ5)

As described in Section 3.3, the personalized timestamp embedding layer can transform similar timestamps into similar time embedding vectors using an embedding method based on a window function, which is beneficial for mining the potential similarity in time information. To evaluate the effectiveness of the personalized timestamp embedding layer and observe the visual changes with embedding methods using and removing the window function. We selected five timestamps: 1 minute, 15 minutes, 30 minutes, 10 hours, and 72 hours. Then, we conducted experiments using these two different embedding modules and transformed these five timestamps into time embedding vectors. Subsequently, we used T-SNE [39] to project the time embedding vectors into a two-dimensional space. Figure 8 illustrates the distribution of the five timestamp vectors under different embedding methods. It is evident that, the five timestamps exhibit an irregular distribution when the window function is removed. However, when the window function is applied, the closely related timestamps (1 minute, 15 minutes, and 30 minutes) exhibit a clustered distribution, indicating their similarity in time information, while timestamps with large time spans (10 hours and 72 hours) are positioned far apart, suggesting that they represent distinct time information.

5. Conclusion

To address the challenges posed by the limitations of modeling information and the variability of user preferences in sequential recommendation systems, we propose a novel time-aware Transformer model. In the real world, user interests and item popularity frequently change [11]. Our model combines long-term persistent preferences with short-term immediate interests, adjusting the personalized trade-offs between long-term and short-term sequences through adaptive fusion layers. This configuration enhances the model’s ability to comprehensively and effectively capture information. As demonstrated in Section 4.2.1, our model outperforms eleven competing models across three standard ranking metrics, as shown in Table 2. We also conducted experiments to assess the impact of the critical hyperparameter dimension size on recommendation performance. Figure 4 demonstrates that our model exhibits distinct advantages across all dimensions within the Userbehavior dataset, which is characterized by its rich temporal information. This finding further validates the adaptive fusion layer’s capability to model temporal information effectively. Additionally, we leverage temporal information to address the limitations of modeling information by employing a trainable timestamp matrix. This matrix models temporal information by incorporating both time duration and spectrum. This innovative multidimensional temporal information modeling approach allows for more effective adaptation to the continuously evolving patterns of Userbehavior over time. The effectiveness and interpretability of our model are further confirmed through ablation studies in Section 4.3 and visualization experiments in Section 4.6, where we analyze the visual distribution of time-embedding vectors.

Despite the excellent performance of our proposed time-aware Transformer model across various aspects, it still has certain limitations. Through a comprehensive analysis of these limitations, we aim to provide insights for future research improvements. The effectiveness of the TLSTSRec model is influenced by the density of the temporal data. In datasets with sparse temporal interactions, such as the Steam dataset, the performance improvements offered by TLSTSRec are constrained. Sparse datasets fail to provide sufficient information, thereby impeding the moduleâĂŹs ability to accurately capture and model user interactions.

In future work, we aim to explore the integration of more comprehensive information, including item categories and user personal information such as age and gender. This auxiliary information can further enhance prediction performance and reduce the model’s excessive reliance on temporal data. Furthermore, as demonstrated in Section 4.5, our model exhibits commendable computational efficiency. Consequently, future research will investigate the application of TLSTSRec in various online scenarios. This includes exploring music recommendation by combining user interaction timestamps with music category information and conducting e-commerce recommendations (such as those on Taobao and Amazon) by integrating implicit user interactions, item categories, and user personal information including age and gender. This exploration promises to be both interesting and meaningful.

Footnotes

Appendix

See Table 5.

References

Lin

Yang

Zeng

Liu

, Context-aware reinforcement learning for course recommendation, Applied Soft Computing 125 (2022), 109189.

Yang

Chen

Kang

, Memory-aware gated factorization machine for top-N recommendation, Knowledge-Based Systems 201 (2020), 106048.

Wang

Zhang

Wang

Aggarwal

, Sequential/Session-based Recommendations: Challenges, Approaches, Applications and Opportunities, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 3425–3428.

Zhang

Wang

, GCRec: Graph-augmented capsule network for next-item recommendation, IEEE Transactions on Neural Networks and Learning Systems, 2022.

Fang

Zhang

Shu

Guo

, Deep learning for sequential recommendation: Algorithms, influential factors, and evaluations, ACM Transactions on Information Systems (TOIS) 39(1) (2020), 1–42.

Wan

Wang

, Visual content-enhanced sequential recommendation with feature-level attention, Neurocomputing 443 (2021), 262–271.

Zhang

Kim

, A Survey on Incremental Update for Neural Recommender Systems, arXiv preprint arXiv:2303.02851, 2023.

Hidasi

Karatzoglou

, Recurrent neural networks with top-k gains for session-based recommendations, in: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2018, pp. 843–852.

Kang

W.-C.

McAuley

, Self-attentive sequential recommendation, in: 2018 IEEE International Conference on Data Mining (ICDM), IEEE, 2018, pp. 197–206.

10.

Zhang

Zhao

Liu

Sheng

V.S.

Wang

Liu

Zhou

et al., Feature-level Deeper Self-Attention Network for Sequential Recommendation, in: IJCAI, 2019, pp. 4320–4326.

11.

Xie

Wang

Zou

Xia

Lin

, Long short-term temporal meta-learning in online recommendation, in: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, 2022, pp. 1168–1176.

12.

Chen

Huang

Zhang

Xing

Dai

, Joint modeling of local and global behavior dynamics for session-based recommendation, in: ECAI 2020: 24th European Conference on Artificial Intelligence 29 August–8 September 2020, Santiago de Compostela, Spain, IOS Press, 2020.

13.

Zhang

Yang

Liu

, A time-aware self-attention based neural network model for sequential recommendation, Applied Soft Computing 133 (2023), 109894.

14.

Wang

McAuley

, Time interval aware self-attention for sequential recommendation, in: Proceedings of the 13th International Conference on Web Search and Data Mining, 2020, pp. 322–330.

15.

Yang

Zhou

Liu

King

, Hicf: Hyperbolic informative collaborative filtering, in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 2212–2221.

16.

Wang

Zhang

Chen

Liu

, Towards representation alignment and uniformity in collaborative filtering, in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 1816–1825.

17.

McAuley

, Fusing similarity models with markov chains for sparse sequential recommendation, in: 2016 IEEE 16th International Conference on Data Mining (ICDM), IEEE, 2016, pp. 191–200.

18.

Cui

Liu

Zhong

Wang

, MV-RNN: A multi-view recurrent neural network for sequential recommendation, IEEE Transactions on Knowledge and Data Engineering 32 (2018), 317–331.

19.

Luo

Chen

Cheng

Dong

Feng

, Metaselector: Meta-learning for recommendation with user-level adaptive model selection, in: Proceedings of The Web Conference 2020, 2020, pp. 2507–2513.

20.

Tang

Wang

, Personalized top-n sequential recommendation via convolutional sequence embedding, in: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 2018, pp. 565–573.

21.

Rong

Cheng

Meng

Huang

, Semi-supervised graph classification: A hierarchical graph perspective, in: The World Wide Web Conference, 2019, pp. 972–982.

22.

Cao

Lin

Guo

Liu

Wang

, Bipartite graph embedding via mutual information maximization, in: Proceedings of the 14th ACM International Conference on Web Search and Data Mining, 2021, pp. 635–643.

23.

Chen

Feng

Wang

Song

Ling

Zhang

, Catgcn: Graph convolutional networks with categorical node features, IEEE Transactions on Knowledge and Data Engineering 35(4) (2021), 3500–3511.

24.

Hao

Zhao

Liu

Xian

Zhao

Sheng

V.S.

, Multi-dimensional graph neural network for sequential recommendation, Pattern Recognition 139 (2023), 109504.

25.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

Ł.

Polosukhin

, Attention is all you need, Advances in Neural Information Processing Systems 30 (2017).

26.

Guo

M.-H.

T.-X.

Liu

J.-J.

Liu

Z.-N.

Jiang

P.-T.

T.-J.

Zhang

S.-H.

Martin

R.R.

Cheng

M.-M.

S.-M.

, Attention mechanisms in computer vision: A survey, Computational Visual Media, 2022, 1–38.

27.

Lupo

Dinarelli

Besacier

, Divide and rule: Effective pre-training for context-aware multi-encoder translation models, arXiv preprint arXiv:2103.17151, 2021.

28.

Jiang

Zhang

Luo

Kim

J.B.

Zhang

Wang

Xie

Kim

, AdaMCT: adaptive mixture of CNN-transformer for sequential recommendation, in: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 2023, pp. 976–986.

29.

Lin

Pan

Ming

, FISSA: Fusing item similarity models with self-attention networks for sequential recommendation, in: Proceedings of the 14th ACM Conference on Recommender Systems, 2020, pp. 130–139.

30.

Chen

Liu

Lei

Zha

Z.-J.

Xiong

, Bert4sessrec: Content-based video relevance prediction with bidirectional encoder representations from transformer, in: Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 2597–2601.

31.

Shazeer

, Glu variants improve transformer, arXiv preprint arXiv:2002.05202, 2020.

32.

Kingma

D.P.

, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, 2014.

33.

Huang

X.S.

Perez

Volkovs

, Improving transformer optimization through better initialization, in: International Conference on Machine Learning, PMLR, 2020, pp. 4475–4483.

34.

Cai

Wang

, Déjà vu: A contextualized temporal attention mechanism for sequential recommendation, in: Proceedings of The Web Conference 2020, 2020, pp. 2199–2209.

35.

Zhu

Zhang

Gai

, Learning tree-based deep model for recommender systems, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 1079–1088.

36.

Kang

W.-C.

McAuley

, Translation-based recommendation, in: Proceedings of the Eleventh ACM Conference on Recommender Systems, 2017, pp. 161–169.

37.

Hidasi

Karatzoglou

Baltrunas

Tikk

, Session-based recommendations with recurrent neural networks, arXiv preprint arXiv:1511.06939, 2015.

38.

Srivastava

Hinton

Krizhevsky

Sutskever

Salakhutdinov

, Dropout: A simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research 15 (2014), 1929–1958.

39.

Rauber

P.E.

Falcao

A.X.

Telea

A.C.

et al., Visualizing Time-Dependent Data Using Dynamic t-SNE, 2016.

	Userbehavior					Steam
Model	NDCG@5	Hit@5	NDCG@10	Hit@10	MRR	NDCG@5	Hit@5	NDCG@10	Hit@5	MRR
PopRec	0.2679	0.3658	0.3182	0.4856	0.2887	0.4235	0.5775	0.4727	0.7493	0.4061
TransRec	0.3298	0.4342	0.3673	0.5314	0.3276	0.4742	0.6414	0.5222	0.7891	0.4495
Caser	0.2856	0.3413	0.3071	0.4763	0.2610	0.4883	0.6517	0.5347	0.7945	0.4637
GRU4Rec	0.5168	0.6173	0.5415	0.6629	0.3582	0.2287	0.3169	0.2703	0.4464	0.2393
GRU4ReC $+$	0.6189	0.7210	0.6306	0.7768	0.6018	0.4533	0.6327	0.5488	0.7984	0.4692
SASRec	0.6176	0.7195	0.6286	0.7543	0.6050	0.6081	0.7617	0.6427	0.8680	0.5783
FDSA	0.6197	0.7186	0.6260	0.7672	0.5958	0.6102	0.7611	0.6427	0.8704	0.5794
TiSASRec	0.6403	0.7285	0.6472	0.7834	0.6208	0.6069	0.7599	0.6422	0.8503	0.5767
CatGCN	0.6242	0.7135	0.6359	0.7650	0.6081	0.6081	0.7546	0.6435	0.8523	0.577
CTGNN	0.6502	0.7394	0.6556	0.7928	0.6276	0.6136	0.7659	0.6473	0.8579	0.5807
TAT4SRec	0.6569	0.7475	0.6742	0.8004	0.6400	0.6153	0.7654	0.6508	0.8709	0.5883
TLSTSRec	0.6978	0.7693	0.7080	0.8006	0.6829	0.6199	0.7707	0.6531	0.8733	0.5930

	Userbehavior		Steam
Number of Bins	NDCG@5	Hit@5	NDCG@5	Hit@5
256	0.6824	0.7604	0.6192	0.7703
512	0.6901	0.7587	0.6221	0.6980
1024	0.6978	0.7693	0.6199	0.7707
2048	0.6910	0.7623	0.6213	0.7715

	Userbehavior		Steam
Dropout Rate	NDCG@5	Hit@5	NDCG@5	Hit@5
0.2	0.6724	0.7404	0.6199	0.7705
0.3	0.6865	0.7587	0.6156	0.7685
0.4	0.6978	0.7693	0.6180	0.7660

TLSTSRec: Time-aware long short-term attention neural network for sequential recommendation

Abstract

Keywords

1. Introduction

2.1. Sequential recommendation

2.2. Attention mechanism

2.3. Windows function

3.1. Problem definition

3.3. Personalized timestamp embedding layer

3.5.1. Encoder layers

4.1. Experimental settings

4.1.1. Datasets

Table 1 Basic dataset statistics. Dataset Steam Userbehavior Users 334,700 100,000 Items 13,047 677,456 Actions per user 10.59 76.43 Actions per item 4.2M 7.8M Time span 7 years 7 days

4.1.3. Baseline models

4.1.4. Implementation details

4.2.1. Performance comparison

4.3.1. Influence of window function-based time module

4.4.1. Number of bins

Table 4 The impact of dropout is depicted, with the optimal outcomes highlighted in bold, while the second-best outcomes are underscored. Userbehavior Steam Dropout Rate NDCG@5 Hit@5 NDCG@5 Hit@5 0.2 0.6724 0.7404 0.6199 0.7705 0.3 0.6865 0.7587 0.6156 0.7685 0.4 0.6978 0.7693 0.6180 0.7660

5. Conclusion

Footnotes

Appendix

References

Table 1
Basic dataset statistics.

Dataset Steam Userbehavior

Users 334,700 100,000

Items 13,047 677,456

Actions per user 10.59 76.43

Actions per item 4.2M 7.8M

Time span 7 years 7 days

Table 4
The impact of dropout is depicted, with the optimal outcomes highlighted in bold, while the second-best outcomes are underscored.

Userbehavior Steam

Dropout Rate NDCG@5 Hit@5 NDCG@5 Hit@5

0.2 0.6724 0.7404 0.6199 0.7705

0.3 0.6865 0.7587 0.6156 0.7685

0.4 0.6978 0.7693 0.6180 0.7660