Personalized online shopping recommendation algorithm and practice based on deep Q-network

Abstract

With the rapid development of e-commerce and the increasing diversity of user demands, personalized recommendations on online shopping platforms have become an important direction of research in the field of e-commerce, which is extremely important for promoting the intelligent transformation of the e-commerce industry. Personalized recommendations on online shopping platforms directly promote the growth of e-commerce sales by improving user experience and satisfaction; at the same time, through the application of intelligent recommendation technology, it accelerates the technological innovation of the e-commerce industry and the changes in the competitive landscape. However, current personalized recommendation algorithms still have problems such as poor algorithm performance, long iteration time, and low recommendation precision. To better achieve personalized recommendations for online shopping, this article applies deep Q-network (DQN) to explore personalized recommendation algorithms in depth. In the article, data is first collected and preprocessed through data augmentation, and based on the deep Q-network, a user–product interaction matrix is constructed. Afterward, based on user behavior characteristic data, user behavior modeling is implemented. Then a DQN-based learning architecture is constructed by combining long-term interests, short-term interests, and prediction modules, and a small batch gradient descent method is used to train the function. Finally, this article evaluates the performance of the algorithm. The research results show that when the length of the recommendation list is 1, the accuracy, recall, and mean average precision of the DQN algorithm are 0.924, 0.875, and 0.901, respectively. The recommendation algorithm based on DQN achieves good results in indicators of recommendation accuracy, recall, and mean average precision. In this article, DQN is applied, and the online shopping recommendation system is optimized by combining deep learning and reinforcement learning, significantly improving the quality and efficiency of recommendations and promoting the intelligent development of the e-commerce industry.

Keywords

personalized recommendation algorithm deep Q-Network online shopping recommendation user behavior modeling data augmentation

Introduction

In today’s digital age, personalized recommendation systems are crucial for online shopping platforms, greatly enhancing user experience and commercial value.^1,2 However, traditional recommendation algorithms face many challenges. Issues such as data sparsity, cold start, and user preference dynamics limit the recommendation effect and system response speed. To address these challenges, this article uses an innovative solution: a personalized recommendation algorithm based on deep Q-network (DQN). The DQN algorithm can make reasonable recommendations in sparse data and cold start situations by simulating the environment and reward mechanism.^3,4 It utilizes its self-learning and exploratory abilities to provide precise recommendations for new users and products even with little or no historical interaction data.^5,6 In addition, the DQN algorithm responds to changes in user behavior in real-time through continuous interaction and feedback loops, improving the timeliness of recommendations and personalized matching. With the powerful expressive power of deep learning, DQN can deeply explore the deep patterns of user behavior, construct complex user preference models, and provide users with a more personalized shopping experience. Meanwhile, the DQN algorithm utilizes an efficient parallel computing framework to significantly enhance big data processing capabilities, ensuring fast response and low latency of recommendation systems. This article hopes that through in-depth research on personalized recommendation algorithms, it can provide direction for the future development of personalized recommendation technology and promote the intelligent transformation of the e-commerce industry.

This article optimizes the personalized recommendation algorithm for online shopping by applying DQN. This algorithm utilizes deep learning techniques to mine user behavior patterns, construct complex preference models, and combines parallel computing to improve data processing efficiency and ensure fast response of recommendation systems. The contribution of this article lies in the successful application of DQN to e-commerce recommendation systems, overcoming multiple limitations of traditional algorithms, especially in dynamic environment learning and large-scale data processing. Compared with traditional algorithms, DQN still iterates efficiently in scenarios where the length of the recommendation list changes, surpassing other recommendation algorithms. The empirical analysis verifies the superiority of DQN on the JD dataset, reducing training difficulty and overfitting risk, and significantly improving recommendation performance.

The research contributions of this article are:

- The Deepin Q-network application is successfully applied to the e-commerce recommendation system, which solves the problems of long iteration time and low recommendation accuracy of traditional recommendation algorithms, and significantly improves the quality and efficiency of recommendations.

- Using DQN’s strong self-learning and exploration capabilities, it can still provide accurate recommendations even in the case of sparse data and cold start, effectively meeting the recommendation challenges of new users and new products when they join.

- By combining deep learning and reinforcement learning, a recommendation model that can capture users’ long-term and short-term interests at the same time is constructed, and the personalization level and real-time responsiveness of recommendations are improved.

- It realizes an efficient parallel computing framework, optimizes the big data processing process, and ensures the fast response and low latency of the recommendation system, providing strong technical support for the intelligent transformation of the e-commerce industry.

Related work

Personalized recommendation systems can analyze user behavior and preferences, provide customized recommendations, enhance user experience and satisfaction, and improve enterprise efficiency and profitability, which are key technologies for achieving precise marketing.^7,8 In recent years, both academia and industry have made significant progress in the research of personalized recommendation systems. Personalized recommendation systems are one of the important methods to solve the problem of big data overload, but traditional recommendation services face severe challenges. Cao Bin constructed a multi-objective recommendation model and proposed a distributed parallel evolutionary algorithm using non-dominated sorting and crowding distance to optimize the model. Compared with advanced algorithms, this algorithm performed well and was very efficient.⁹ Recommendation systems should provide users with accurate and rapid time changes. Cui Zhihua proposed a new recommendation model based on temporal correlation coefficient. Through experiments on two real datasets, MovieLens and Douban, it was found that the proposed model improved precision and was effective for application in Internet of Things (IoT) scenarios.¹⁰ Traditional personalized recommendation systems based on collaborative filtering are prone to time complexity issues. Zhang Jiang proposed a simple and efficient recommendation algorithm that could significantly reduce time complexity while achieving good recommendation performance. The feasibility and accuracy of the system were evaluated through real data.¹¹ Unlike most personalized recommendation systems, the model proposed by Nitu Paromita considered users’ latest interests by incorporating time-sensitive recency weights into the model. The performance of the proposed model is superior to existing personalized interest location recommendation models, with overall high accuracy.¹² The methods discussed by scholars have improved recommendation effectiveness to some extent, but still face challenges such as insufficient dynamic response capability and limited personalization.

Deep Q-networks have gradually gained attention by simulating user system interaction, dynamically adjusting recommendation strategies, and improving the real-time response capability of recommendation systems.^13,14 The goal of a recommendation system is to accurately and actively provide users with potentially interesting items. Lei Yu developed a social attention deep Q-network based on individual user and social neighbor preferences, and used the social attention layer to model the influence between the two to optimize the recommendation system. The proposed social attention deep Q-network significantly outperformed the advanced deep reinforcement learning agents at reasonable computational costs.¹⁵ Personalized recommendation systems have been widely applied in the field of e-learning to adapt to each learner’s own learning speed. Tan Chunxi proposed a new adaptive recommendation strategy under reinforcement learning framework, which was implemented by deep Q-learning algorithm. It could handle lost data properly and handle more individual specific functions to obtain better suggestions.¹⁶ The research methods of the above-mentioned scholars still have shortcomings in dealing with large-scale data and complex user behavior patterns.

Personalized recommendation systems have achieved significant results in improving recommendation accuracy and efficiency, but deep learning still has limitations in capturing dynamic user preferences and mining user behavior patterns, and large-scale data processing capabilities need to be strengthened. Although the research on deep Q-networks and recommendation systems has improved the personalization and adaptability of recommendations, it still faces challenges in terms of data processing efficiency and model interpretability. This article aims to integrate existing advantages and utilize a personalized online shopping recommendation algorithm based on deep Q-networks, emphasizing rapid learning of user preference changes, and combining distributed computing to optimize large-scale data processing capabilities, so as to achieve more efficient and personalized recommendations. The model interpretability is also explored to optimize precision marketing and services in the e-commerce field.

Implementation of personalized recommendation algorithm with deep Q-network

Data preprocessing

Data collection

Online shopping is the most widely used field for personalized recommendation systems.^17,18 In e-commerce, due to the large variety and quantity of products, consumers often have to browse through a lot of useless information before finding the products they want, which can easily affect their purchasing enthusiasm. At the same time, when the purpose of online shopping for netizens is unclear, recommendation systems can capture different types of user needs and take corresponding actions to motivate users to make corresponding choices, thereby achieving the goal of selling products.^19,20 This means that the recommendation system can accurately match user interests and product information, effectively reduce users’ troubles with a large amount of irrelevant information, and improve the shopping experience; at the same time, it also stimulates users’ desire to buy through personalized recommendations and promotes commodity sales.

The collection and processing of user behavior information is an important component of the entire recommendation system, affecting the depth of subsequent work and the accuracy of results directly.^21,22 This article takes a large e-commerce platform as the research object and selects the browsing, adding to favorites, adding to shopping cart, payment, and other behavior records of more than 10,000 random users on the platform in 2023 as the raw data. The original user behavior record includes the user ID, project ID, behavior type and timestamp, and the random code generated by the hash operation. From the perspectives of user privacy protection and data usage standards, personal information of users is anonymized, and then feature extraction is performed. The purpose of retaining the product ID (Identity document) is to associate it with product attribute information. The basic information about the collected data is shown in Table 1:

Table 1.

Basic information of collected data.

Serial number	User ID	Product ID	Behavior type	Time stamp	Product category	Product price (yuan)
1	U001	P001	Browsing	2023-07-01 08:00:00	Electronic product	9999.00
2	U001	P002	Adding to favorites	2023-07-02 14:30:00	Articles for daily use	299.00
3	U002	P003	Adding to shopping cart	2023-07-03 11:15:00	Clothes & accessories	499.50
4	U003	P004	Payment	2023-07-04 01:00:00	Food	59.99
5	U001	P005	Payment	2023-07-05 22:00:00	Electronic product	1299.00
6	U004	P006	Browsing	2023-07-06 10:00:00	Home furnishing	799.99
7	U002	P007	Browsing	2023-07-07 13:45:00	Books	99.99
8	U005	P008	Adding to shopping cart	2023-07-08 09:15:00	Beauty makeup	199.00
9	U001	P009	Browsing	2023-07-09 21:00:00	Sports equipment	149.99
10	U003	P010	Payment	2023-07-10 07:30:00	Electronic accessories	39.99
11	U002	P003	Browsing	2023-07-11 15:15:00	Clothes & accessories	499.50
12	U004	P006	Adding to favorites	2023-07-12 20:00:00	Home furnishing	799.99
13	U005	P008	Payment	2023-07-13 12:45:00	Beauty makeup	199.00
14	U006	P001	Browsing	2023-07-14 02:15:00	Electronic product	9999.00

After the collection of raw log data is completed, operations such as data cleaning, filling, sorting, and format conversion need to be carried out. Unreasonable timestamps, missing project ID, and invalid behavior types are excluded. On this basis, by calling the product attribute database, the product information in the log is supplemented, and a matrix diagram of user–product interaction relationship is established. The data set is sorted in chronological order to ensure the consistency of the time interval, so as to realize the model training and evaluation. The processed data is saved in the form of a database and backed up for use during modeling. After collection and processing, this article obtains an interaction record with over 10,000 users, nearly 30,000, covering 365 days. These seemingly simple behavioral records reflect deep-seated behavioral trends such as users’ consumption interests, preferences, and social intentions. The research results of this article provide reliable data support and digital processing methods for in-depth mining of user behavior characteristics and establishing behavior prediction models.

Data augmentation processing

In practical recommendations, when the sample size is small, it often causes network overfitting, which in turn affects the precision of the recommendation results. Therefore, this article studies data augmentation strategies from two aspects: information “exploration” and “utilization,” and based on this, explores spatiotemporal attribute enhancement methods based on bidirectional DTW (Dynamic Time Warping) algorithm, which can meet the needs of users lacking historical data and prevent overfitting caused by data factors. The two-way DTW algorithm is an improved time series similarity calculation method. It not only finds the shortest matching path between the two sequences but also calculates the maximum distance of each path, so as to standardize the cumulative distance, effectively enhance the spatiotemporal properties of the data and reduce the phenomenon of overfitting.

Although conventional DTW algorithms can achieve similarity between two different time series in two dimensions, the accumulated distance obtained is usually greatly affected due to the large amplitude.^23,24 Therefore, this article uses a bidirectional DTW algorithm to calculate the similarity between different users. Compared with traditional DTW algorithm, bidirectional DTW algorithm not only calculates the shortest path, but also calculates the maximum distance of each distorted path, achieving cumulative distance standardization. The shortest warping distance can be calculated according to Formula (1)

h (d, g) = h (m_{g}, n_{g}) = h (m, n) = \min {\begin{cases} h (m - 1, n) + c (m, n) \\ h (m - 1, n - 1) + 2 c (m, n) \\ h (m, n - 1) + c (m, n) \end{cases}

(1)

Formula (2) can be obtained from Formula (1)

H (d, g) = H (m_{g}, n_{g}) = H (m, n) = \max {\begin{cases} h (m - 1, n) + c (m, n) \\ h (m - 1, n - 1) + 2 c (m, n) \\ h (m, n - 1) + c (m, n) \end{cases}

(2)

Using the above two methods, the cumulative distance between a set of template sequences and a set of test sequences is calculated, and the formulas are

c (m, n) = d i s t (a, b) + \min (c (m - 1, n - 1), c (m - 1, n), c (m, n - 1))

(3)

c^{'} (m, n) = d i s t (a, b) + \max (c (m - 1, n - 1), c (m - 1, n), c (m, n - 1))

(4)

Among them,

c (m, n)

—distance between template sequence and test sequence;

d i s t (a, b)

—distance between two elements.

The formula for calculating the similarity between the template sequence and the test sequence can be expressed as

similarity = \frac{c (m, n)}{c^{'} (m, n)}

(5)

The bidirectional DTW algorithm can reflect the similarity between two users as a whole, but if only relying on the bidirectional DTW algorithm to supplement data, it inevitably leads to excessive noise and reduce recommendation accuracy in the case of sparse user data. Therefore, from the perspective of information utilization, combined with the idea of the longest common point sequence algorithm, users with sparse features are supplemented. However, when using the longest common point sequence to calculate the similarity between users, it often consumes a lot of time and resources in the case of a large data scale. Therefore, this article incorporates the spatiotemporal characteristics of users and preserves the temporal characteristics of the raw data as much as possible. Finally, this article compares the effectiveness of three data augmentation methods: information “exploration,” information “utilization,” and bidirectional DTW algorithm. The specific results are shown in Figure 1:

Figure 1.

Comparison of the effectiveness of three data augmentation methods.

Figure 1 presents the relationship between time steps and data averages, comparing the effectiveness of different data augmentation strategies in processing raw data. In the information “exploration” strategy, the random noise is added to the augmented data, and the overall waveform remains sinusoidal, with peak and valley values of approximately 1 and -1, respectively. This method improves the generalization ability of the model by increasing data diversity. In the information “utilization” strategy, the augmented data is adjusted for amplitude by multiplying it with a random factor, while the overall waveform remains unchanged. This preserves the basic pattern of the data while introducing new information, which makes the data more representative. The augmentation effect of the bidirectional DTW algorithm shows that the raw data fluctuates greatly, but the augmented data matched by DTW aligns better in spatiotemporal attributes, reduces fluctuations, and is smoother. This indicates that the DTW algorithm can better preserve the spatiotemporal correlation of data, improve temporal consistency and stability, reflect the continuity and correlation of user behavior, and enhance recommendation accuracy and user satisfaction. In general, these data enhancement strategies significantly improve the effectiveness of personalized recommendation algorithms by optimizing training data.

Construction of a user-product interaction matrix

This article is based on a deep Q-network and establishes an interaction matrix consisting of a set $V = {v_{1}, v_{2}, v_{3}, \dots, v_{10281}}$ of 10,281 users and a set $W = {w_{1}, w_{2}, w_{3}, \dots, w_{29827}}$ of 29,827 products. The user–product interaction matrix is determined based on the user’s likes and dislikes tags for the item mentioned above. For this purpose, a user–product interaction matrix $B \in S^{10281 \times 29827}$ can be defined

b_{v w} = {\begin{cases} 1, (v, w) h a s i n t e r a c t i o n \\ 0, (v, w) n o i n t e r a c t i o n \end{cases}

(6)

Among them,

b_{v w} = 1

indicates a relationship of liking and disliking between the user and the item, otherwise

b_{v w} = 0

. The user’s historical interaction records are mapped to the user–product interaction matrix, providing structured data for model training. The specific content related to the user–product interaction matrix is shown in Table 2:

Table 2.

Relevant content of the user–product interaction matrix.

User ID	Product ID	Type of interaction	Interaction value
U001	P001	Clicking	1
U001	P002	Purchase	3
U002	P001	Browsing	2
U002	P003	Adding to favorites	1
U003	P004	Clicking	3
U003	P002	Adding to shopping cart	1
U004	P003	Purchase	4
U004	P005	Clicking	2
U005	P004	Browsing	1
U005	P005	Evaluating	5
U006	P002	Purchase	1
U006	P006	Adding to favorites	1

User behavior modeling

User behavior characteristics mainly include five types of data: clicking (type_c), adding to shopping cart (type-b), purchasing (type_p), adding to favorites (type_s), and browsing (type_g). Among them, the purchase behavior basically indicates the user’s love for the product, so the weight in the preference is the largest, which is 1; the weight of increasing purchases and favorites is slightly lower than purchases, but higher than other behaviors, so the weights of these two behaviors are increased in the expression; the weights of clicking and browsing are the smallest, and the smaller weights are selected in the operation. User v’s preference level ${l o}_{v j}$ for product j is constructed, and the formulas are

{l o}_{v j} = (type_c_{v j} + type_g_{v j}) \cdot \exp (\frac{type_b_{v j} + type_s_{v j}}{2} + type_p_{v j})

(7)

L o = [\begin{array}{l} {l o}_{11} & \dots & {l o}_{1 j} \\ ⋮ & ⋱ & ⋮ \\ {l o}_{I = i 1} & \dots & {l o}_{i j} \end{array}]

(8)

Among them, the elements of matrix

L o

are

{l o}_{v j}

;

i

-number of users;

j

-number of items;

type_p_{v j}

—weight of user’s purchase of goods.

Formula (9) is used to transform the user preference to 0 to 5

t_{v j} = \frac{{l o}_{v j}}{\max (L o)} \times 5

(9)

Among them,

\max (L o)

—maximum value of matrix Lo.

t_{v j}

-user’s rating of the item.

The formula for the user rating matrix is

T = [\begin{array}{l} t_{11} & \dots & t_{1 j} \\ ⋮ & ⋱ & ⋮ \\ t_{I = i 1} & \dots & t_{i j} \end{array}]

(10)

Among them, the non-negative matrix factorization (NMF) algorithm^25,26 decomposes the matrix T into a user feature matrix Q and a product feature matrix O, and multiplies the two, that is, T = QO. Q and O are calculated by subtracting the rating matrix T from QO. Compared with other collaborative filtering methods, the results obtained by the NMF method are all non-negative.

The objective of non-negative matrix factorization can be expressed as

l = \min \frac{1}{2} \sum_{v, j \in T} {(t_{v, j} - q_{v} \cdot o_{j})}^{2}

(11)

Among them,

t_{v, j}

-user’s rating of the item;

q_{v}

and

o_{j}

-are the feature vectors of user v and item j, respectively.

When the user feature matrix and product feature matrix are initialized as non-negative, using Formula (11) cannot guarantee that they are always non-negative during iteration, so a learning rate needs to be defined. The formulas are

{\begin{cases} τ_{q} = \frac{q_{v f}}{\sum_{j \in J_{v}} o_{f j} \cdot {\hat{t}}_{v, j}} \\ τ_{o} = \frac{o_{f j}}{\sum_{v \in V_{j}} q_{v f} \cdot {\hat{t}}_{v, j}} \end{cases}

(12)

The learning rate from Formula (12) is substituted into Formula (11) to obtain the iterative expressions for non-negative matrix factorization, which are expressed as

{\begin{cases} q_{v f} = q_{v f} \frac{\sum_{j \in J_{v}} o_{f j} \cdot t_{v, j}}{\sum_{j \in J_{v}} o_{f j} \cdot {\hat{t}}_{v, j}} \\ o_{f j} = o_{f j} \frac{\sum_{v \in V_{j}} q_{v f} \cdot t_{v, j}}{\sum_{v \in V_{j}} q_{v f} \cdot {\hat{t}}_{v, j}} \end{cases}

(13)

Among them,

q_{v f}

and

o_{f j}

-components of user v on feature f and j on feature f;

{\hat{t}}_{v, j}

-calculated estimated rating.

By iteratively using Formula (13), it is ensured that after decomposition, both the user feature matrix and the product feature matrix are non-negative. Based on non-negative matrix factorization, the matrix is biased and regularized to obtain an objective function for non-negative matrix factorization. The formula is

l = \min \frac{1}{2} \cdot \sum_{v, j \in T} {(t_{v, j} - φ - y_{v} - y_{j} - q_{v} \cdot o_{j})}^{2} + γ_{q} {‖ Q q ‖}_{G}^{2} + γ_{o} {‖ O ‖}_{G}^{2}

(14)

Among them,

γ

-the regularization parameter;

y_{v}

-the bias of user v;

y_{j}

-the offset of item j.

The rating matrix of recommendation systems often has sparsity. Although the NMF algorithm can handle this problem, its indicator matrix is both storage intensive and time-consuming, as it needs to be multiplied with other matrices every iteration. To reduce this additional overhead and improve algorithm precision, DQN is applied in this article.

First, the gradient method is used to solve for user preference information and product preference information, and the partial derivatives of the objective function are obtained

\begin{array}{l} \frac{d l}{d y_{v}} = - \sum_{j \in J_{v}} (t_{v, j} - φ - y_{v} - \sum_{f = 1}^{F} q_{v f} \cdot o_{f j}) \\ \frac{d l}{d y_{j}} = - \sum_{v \in V_{j}} (t_{v, j} - φ - y_{v} - \sum_{f = 1}^{F} q_{v f} \cdot o_{f j}) \end{array}

(15)

The gradient descent method is used to solve, and iterative formulas for user and product bias are obtained

\begin{array}{l} y_{v} = y_{v} + τ_{y} \sum_{j \in J_{v}} (t_{v, j} - φ - y_{v} - \sum_{f = 1}^{F} q_{v f} \cdot o_{f j}) \\ y_{j} = y_{j} + τ_{y} \sum_{v \in V_{j}} (t_{v, j} - φ - y_{v} - \sum_{f = 1}^{F} q_{v f} \cdot o_{f j}) \end{array}

(16)

Formula (16) is the final iterative formula. After the above iterative processing, the missing items in the user rating matrix T can be efficiently restored. By ranking the ratings of each user in T, their levels of liking for each product can be obtained based on their behavioral information, and their favorite products can be actively recommended to them.

Construction and training optimization of DQN

This article models online shopping recommendation behavior as an interaction sequence of users selecting products and evaluating them, with the goal of maximizing long-term cumulative rewards in recommendation behavior. This model includes a five tuple of state space, action space, reward space, state transition, and decay factor.

State space Z: state $z_{r} \in Z$ at time r is defined as two sequences consisting of the user’s last M positive and negative feedback interaction records before time r. If the record is less than M times, it is filled with virtual items.

Reward space S: state $s_{r} \in S$ at time r is defined as the immediate reward obtained by the user’s feedback at time r.

The personalized recommendation algorithm in this article is based on DQN. DQN is a deep reinforcement learning method based on value optimization.^27,28 The learning state-action value function can estimate the long-term cumulative reward of the recommendation system for recommending item x in state z. The value function corresponding to the optimal strategy is the optimal value function, which should follow the Bellman equation, with the formula

P^{*} (z_{r}, x_{r}) = E_{z_{r} + 1} [s_{r} + β \max_{x^{'} \in X^{'}} P^{*} (z_{r} + 1, x^{'}) | z_{r}, x_{r}]

(17)

Among them, $X^{'}$ -the candidate set for the next recommended project.

Due to the high computational cost of traversing all states and behaviors, it is necessary to use numerical functions of deep Q-networks for approximation. The DQN training process is shown in Figure 2:

Figure 2.

Training process of DQN.

The learning architecture of the personalized recommendation algorithm based on DQN in this article consists of three parts: long-term interest module, short-term interest module, and prediction module. The long-term interests of users are learned using collaborative filtering methods, and the vector representation of users utilizes all interaction history information during the learning process, which can represent the long-term interest preference characteristics of users. User v and product j obtain user vectors and product vectors through the embedding layer and product embedding layer. Different from traditional collaborative filtering algorithms, the personalized recommendation algorithm based on DQN uses two vectors connected to each other, and adds multiple hidden layers on this basis to better understand the interaction between user vectors and item vectors. In the short-term interest module, the positive and negative feedback interaction history sequences are similar. The user’s most recent M positive feedback interaction records ${j_{1} {, j}_{2} {, \dots, j}_{M}}$ form a positive feedback interaction sequence. To reduce the complexity of the model and the required number of parameters, this article shares an embedding hierarchy between the positive and negative feedback interaction history sequences and the recommended products. On the product embedding layer, a sequence of product representation vectors is formed through positive feedback interaction sequences.

The DQN architecture in this article uses the last hidden layer vector of GRU (Gated Recurrent Unit) as the vector expression for the positive feedback interaction sequence. The DQN architecture reconstructs the structure of the Q-network to achieve synchronous learning of users’ short-term and long-term interests. DQN uses deep neural networks as the main body of the Q-network, extracting deep features of user behavior data through a multi-layer network structure to more accurately represent user interests.

However, in reality, the state and behavior space of a system are often vast, making it difficult to directly obtain their expected values. In practical applications, the commonly used method is small-batch gradient descent. By randomly selecting a small batch of state transition sets, their expected values are calculated, and their parameters are optimized using the Adam (Adaptive Moment Estimation) optimizer.^29,30 DQN uses deep neural networks to fit numerical functions, which can easily lead to model instability. To simplify model training, the user and item embedding layers are initialized and fixed with pre-trained vectors and do not update with training. This article uses probability matrix factorization method and Binary Cross Entropy loss function to train embedding layer weights. Meanwhile, the DQN framework introduces L2 regularization term in the loss function to limit the model’s expressive power, stabilize training, and prevent overfitting. During training, a recall process is also included to generate a high-quality candidate set of items.

Evaluation of the effectiveness of personalized recommendation algorithms

Experimental design

To verify the effectiveness of the personalized recommendation algorithm based on deep Q-network in this article, the JD dataset with 130 million purchase records is used as the experimental dataset. First, data preprocessing is required. Preprocessing mainly involves filtering the dataset based on the data preprocessing in section 3. The specific process is as follows: first, users are grouped based on their purchase records to obtain their purchase history grouping. Then, users with purchase records less than 300 times and products with purchase records less than 200 times are deleted. Afterward, a user-product matrix is constructed, and if the user purchases a certain product, the corresponding element of the matrix is set to 1. Next, users and products that purchase less than 150 times are iteratively deleted until they can no longer be deleted. The final filtered dataset has a user-product matrix sparsity rate of 0.541%.

To simulate real online shopping recommendations, this article divides the data into training and testing sets based on user purchase time. Each user’s purchase records are sorted by time, with the top 80% being the training set and the bottom 20% being the testing set. Afterward, the model is trained using the training set, and its recommendation performance in simulated real-world environments is evaluated using the testing set. The specific summary statistics about the JD dataset are shown in Table 3:

Table 3.

Summary statistics of JD dataset.

Serial number	Statistical indicators	Description
1	Dataset size	Containing 1 million users and nearly 100 million interactive logs
2	Time span	365 days
3	Data preprocessing	Completing data cleaning, completion, sorting, and format conversion
4	User-product interaction type	Browsing, adding to favorites, adding to shopping cart, purchase, and reviewing
5	Data augmentation strategy	Combining information “exploration” and “utilization” strategies, using bidirectional DTW algorithm and the idea of longest common point sequence
6	Training and testing sets partitioning	Sorted by user purchase time, with 80% of the earliest records as the training set and 20% of the latest records as the testing set
7	Feature construction	Constructing a user-product interaction matrix, considering the weight allocation of user behavior
8	Target	Improving the accuracy and real-time performance of personalized recommendations, alleviating data sparsity and cold start issues

In the experiment, in addition to testing the DQN algorithm proposed in this article, it is also compared with several baseline methods, including BPR algorithm (Bayesian Personalized Ranking), ML (machine learning) algorithm, and Deep Walk algorithm. This article first compares the recommendation performance of four algorithms from three perspectives: accuracy, recall, and mean average precision. Then, using root mean squared error as the testing standard, its parameter sensitivity is evaluated, and cross-validation is conducted. Finally, the algorithm iteration time is tested under different iteration times.

Due to the wide variety of products, considering all of them is very time-consuming. Therefore, in the testing set, for each user’s purchased product, this article randomly selects 50 products that the user does not purchased as negative examples, and combines them with the purchased products to form a new product set for testing. This method reduces computational complexity and provides a more comprehensive evaluation of the performance of recommendation systems.

Experimental results

Recommendation effect

Accuracy, recall, and mean average precision (MAP) are key indicators for evaluating recommendation algorithms. Accuracy reflects the proportion of correct items in the recommendation list, which affects user satisfaction. The recall rate reflects the level of coverage of user interests by recommendations. The mean average precision measures the accuracy of recommendation ranking. These three indicators jointly evaluate the performance of recommendation systems, helping researchers and engineers understand the effectiveness of algorithms, optimize models, and improve recommendation quality and user experience. This article tests the algorithm’s accuracy, recall, and mean average precision of recommendation list lengths (k) of 1, 5, and 10. Table 4 shows the results:

Table 4.

Comparison of the accuracy, recall, and mean average precision of the algorithms.

Performance	DQN algorithm	BPR algorithm	ML algorithm	Deep walk algorithm
Accuracy (k = 1)	0.924	0.704	0.801	0.743
Recall (k = 1)	0.875	0.657	0.779	0.698
MAP (k = 1)	0.901	0.692	0.793	0.712
Accuracy (k = 5)	0.895	0.671	0.743	0.729
Recall (k = 5)	0.842	0.602	0.702	0.694
MAP (k = 5)	0.856	0.635	0.713	0.706
Accuracy (k = 10)	0.863	0.618	0.724	0.678
Recall (k = 10)	0.801	0.569	0.688	0.625
MAP (k = 10)	0.834	0.588	0.696	0.636

According to Table 4, the DQN algorithm proposed in this article demonstrates its performance advantages in comparison tests with other algorithms such as BPR, ML, and Deep Walk. When the recommendation list length is set to 1, the DQN algorithm exhibits an accuracy of up to 0.924, while at this time the accuracy of BPR, ML, and Deep Walk algorithms are 0.704, 0.801, and 0.743, respectively, with the DQN algorithm far exceeding other algorithms. This means that its recommendation accuracy is extremely high and can effectively improve user instant satisfaction. Its recall rate of 0.875 and mean average precision of 0.901 are significantly ahead, indicating that not only is the single recommendation precise, but it can also cover the actual interests of users well. As the length of the recommendation list increases to 5 and 10, although the various indicators of the DQN algorithm slightly decrease, such as accuracy dropping to 0.863 and recall and MAP dropping to 0.801 and 0.834 when the list length is 10, respectively, it still maintains its leading position. This reflects the stability and effectiveness of the algorithm under different recommendation scales. In contrast, although other algorithms have shown some performance in certain indicators, they have overall failed to surpass the comprehensive recommendation quality of DQN algorithm, verifying the potential and application value of deep Q-network in the field of personalized recommendation.

Root mean squared error

Root mean squared error (RMSE) quantifies the difference between recommendation results and user feedback, reflecting the magnitude of prediction error. By monitoring RMSE, the prediction precision can be evaluated to ensure that recommendations meet user needs. Root mean square error can also optimize objectives, help adjust models to reduce bias, and improve recommendation accuracy and satisfaction. This article tests the RMSE changes under different regularization parameter γ to evaluate its parameter sensitivity. The results are shown in Figure 3:

Figure 3.

Root mean squared error changes of four algorithms with different regularization parameter γ.

Figure 3 shows the RMSE trends of four algorithms, DQN, BPR, ML, and Deep Walk, as the regularization parameter γ varies from 0 to 0.2. DQN exhibits the most stable performance, with RMSE consistently the lowest, gradually decreasing from 1.05 at γ = 0 to below 0.2. At the beginning, the RMSE of BPR is about 1.1. Although it decreases with increasing γ, it is still higher than DQN, indicating poor effectiveness. The starting RMSE of ML algorithm is about 1.15, with small changes and poor performance, and it is not sensitive to parameter adjustment. The RMSE of Deep Walk is about 1.1 at γ = 0 and shows a decreasing trend with increasing γ. At γ = 0.2, its RMSE is about 0.58. Overall, DQN is more sensitive to the regularization parameter, with higher accuracy and user satisfaction, ensuring stable and efficient operation of recommendation systems. By optimizing parameters, the accuracy, recall, and user satisfaction of the model can be maximized, achieving the best recommendation results.

In addition to comparing the root mean squared error with different regularization parameter, this article also tests the root mean squared error with different cross-validation times for recommendation list lengths (k) of 1, 5, and 10. Figure 4 shows the results:

Figure 4.

Comparison of root mean squared error at different cross-validation times. (a): Root mean squared error results for a recommendation list length of 1; (b): Root mean squared error results for a recommendation list length of 5; (c): Root mean squared error results for a recommendation list length of 10.

It can be found in Figures 4(a)–(c) that as the number of cross-validations increases from 1 to 15, the RMSE of the DQN algorithm fluctuates under various recommendation list lengths, reflecting the sensitivity and stability of the algorithm to cross-validation. When the length of the recommendation list is 1, the maximum RMSE of the DQN algorithm is 0.318 and the minimum is 0.214. The maximum RMSE of BPR algorithm is 0.577 and the minimum is 0.418. The maximum RMSE of ML algorithm is 0.504 and the minimum is 0.331. The maximum RMSE of the Deep Walk algorithm is 0.594 and the minimum is 0.426. This indicates that the prediction precision of DQN algorithm varies at different validation times, but its root mean squared error remains relatively low overall, outperforming other algorithms such as BPR, ML, and Deep Walk. As the length of the recommendation list increases to 5, the highest RMSE values for DQN, BPR, ML, and Deep Walk algorithms are 0.597, 0.855, 0.816, and 0.948, respectively. The minimum values are 0.432, 0.638, 0.561, and 0.745, respectively. When the length of the recommendation list increases to 10, the highest RMSE values of DQN, BPR, ML, and Deep Walk algorithms are 0.698, 0.913, 0.885, and 0.992, respectively, and the lowest values are 0.539, 0.723, 0.694, and 0.816, respectively. Compared to the other three algorithms, DQN still demonstrates good robustness and prediction precision. These data emphasize the importance of detailed parameter tuning under different settings, as well as the superior performance and robustness of DQN algorithm in handling personalized recommendation tasks, providing empirical basis for improving user personalized experience. This is because the DQN algorithm can better adapt to different recommendation list lengths and cross-verification times, showing high stability and prediction accuracy. At the same time, the algorithm uses deep learning technology and reinforcement learning principles to optimize recommendation strategies, so that it can make more accurate predictions in the face of different recommendation scenarios.

Iteration time

Iteration time is crucial for algorithm efficiency and real-time performance. Short iteration time means fast convergence, which can adapt to user changes and provide new recommendations. Reducing iteration can also improve response speed and enhance experience, which is important for big data processing and high-performance online services. At the end of this article, the time required for each iteration is tested for recommendation list lengths (k) of 1, 5, and 10. The results are shown in Figure 5:

Figure 5.

Comparison of time required for each iteration. (a): Time required for a single iteration when the recommendation list length is 1; (b): Time required for a single iteration when the recommendation list length is 5; (c): Time required for a single iteration when the recommendation list length is 10.

As shown in Figures 5(a)–(c), the DQN algorithm exhibits relatively efficient iterative performance under different recommendation list lengths. Especially when the length of the recommendation list is 1, the iteration time remains between 1.370 seconds and 2.141 seconds, with an average iteration time of 1.647 seconds. The BPR algorithm maintains an iteration time between 2.364 seconds and 4.170 seconds, with an average iteration time of 3.253 seconds. The iteration time of the ML algorithm is maintained between 2.175 seconds and 3.678 seconds, with an average iteration time of 2.999 seconds. The iteration time of the Deep Walk algorithm is maintained between 3.377 seconds and 4.659 seconds, with an average iteration time of 3.993 seconds. The time of DQN algorithm is significantly lower than that of BPR, ML, and Deep Walk algorithms. As the length of the recommendation list increases, the iteration time of all algorithms generally increases, but DQN still maintains a relatively low iteration time. When the recommendation list length is 5 and 10, its average iteration time is 3.024 seconds and 3.548 seconds, respectively. At this point, the average iteration time of the BPR algorithm is 6.461 seconds and 8.073 seconds, respectively. The average iteration time of ML algorithm is 4.321 seconds and 5.720 seconds, respectively. The average iteration time of the Deep Walk algorithm is 5.319 seconds and 7.375 seconds, respectively. This once again validates the efficiency of DQN algorithm in handling more complex recommendation scenarios. These data highlight the advantages of DQN algorithm in iteration efficiency, which not only shortens the model training cycle but also provides technical support for the rapid deployment and response of real-time recommendation systems, ensuring that users can obtain personalized updated shopping suggestions in a timely manner. Therefore, optimizing the iteration time is not only a reflection of algorithm efficiency but also a key factor in enhancing the practical value and market competitiveness of recommendation systems.

Conclusion

In the digital age, personalized recommendations are crucial for e-commerce, but traditional recommendation algorithms have many limitations. This article combines deep learning and reinforcement learning to apply deep Q-networks into recommendation algorithms, in order to overcome traditional problems and bring new breakthroughs to e-commerce platform recommendation systems. This article preprocesses massive data, simulates user-system interaction using DQN, learns recommendations through a reward mechanism, and solves the cold start problem. Meanwhile, it can adapt to changes in user preferences, ensuring timely and personalized recommendations. The deep learning technology constructs complex user preference models to achieve personalized experiences, and the parallel computing improves the processing capability of big data. Experiments have shown that the DQN algorithm outperformed other algorithms such as BPR, ML, and Deep Walk, and is more efficient in handling complex scenes. In the JD dataset testing, the algorithm performs excellently, reducing training difficulty and overfitting risk. Despite significant achievements, there is still room for improvement. Future research directions should focus on how to more effectively integrate users’ long-term and short-term interests, improve the efficiency of models in high-dimensional data processing, and explore more advanced model interpretation techniques to make the recommendation decision-making process more transparent. In addition, with the increasing importance of social networks and multimodal data in e-commerce recommendations, how to integrate this information in the DQN framework to further improve the accuracy and coverage of personalized recommendations is an important topic for future research. Through this technology, this article aims to bring greater convenience and value to users and businesses, achieving a comprehensive upgrade of e-commerce recommendation systems.

Footnotes

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Yuqi

Chen

Hongfei

, et al. Personalized product recommendation based on network representation learning. Chin J Comput 2019; 42(8): 1767–1778.

Zou

. Personalized product recommendation method for analyzing user behavior using DeepFM. Journal of Information Processing Systems 2021; 17(2): 369–384.

Oroojlooyjadid

Nazari

Snyder

, et al. Martin Takac “A deep q-network for the beer game: deep reinforcement learning for inventory optimization”. Manuf Serv Oper Manag 2022; 24(1): 285–304.

Yang

Juntao

Peng

. Multi-robot path planning based on a deep reinforcement learning DQN algorithm. CAAI Trans Intell Technol 2020; 5(3): 177–183.

Liang

. Commodity recommendation model based on improved deep Q network structure. J Comput Appl 2020; 40(9): 2613.

. Video recommendation algorithm based on the combination of FM and DQN. Computer & Digital Engineering 2021; 49(9): 1771–1776.

Vijayakumar

Vairavasundaram

Logesh

, et al. Effective knowledge based recommender system for tailored multiple point of interest recommendation. Int J Web Portals (IJWP) 2019; 11(1): 1–18.

Shokeen

Rana

. A study on features of social recommender systems. Artif Intell Rev 2020; 53(2): 965–988.

Cao

Zhao

, et al. Diversified personalized recommendation optimization based on mobile data. IEEE Trans Intell Transport Syst 2020; 22(4): 2133–2139. DOI: 10.1109/TITS.2020.3040909.

10.

Cui

Xue

, et al. Personalized recommendation system based on collaborative filtering for IoT scenarios. IEEE Trans Serv Comput 2020; 13(4): 685–695.

11.

Zhang

Wang

Yuan

, et al. Personalized real-time movie recommendation system: practical prototype and evaluation. Tsinghua Sci Technol 2019; 25(2): 180–191.

12.

Nitu

Coelho

Madiraju

. Improvising personalized travel recommendation system with recency effects. Big Data Min Anal 2021; 4: 139–154.

13.

Liang

Jin

Ren

, et al. An improved dual-channel deep Q-network model for tourism recommendation. Big Data 2023; 11(4): 268–281.

14.

Zheng

Ding

. Behavior-aware English reading article recommendation system using online distilled deep q-learning. J Cases Inf Technol 2023; 25(1): 1–21.

15.

Lei

Wang

, et al. Social attentive deep Q-networks for recommender systems. IEEE Trans Knowl Data Eng 2020; 34(5): 2443–2457.

16.

Tan

Han

, et al. Adaptive learning recommendation strategy based on deep Q-learning. Appl Psychol Meas 2020; 44(4): 251–266.

17.

Wang

Zhu

Zhang

, et al. Design of hybrid recommendation algorithm in online shopping system. Journal of New Media 2021; 3(4): 119–128.

18.

Gao

Liang

Sun

. Dynamic network intelligent hybrid recommendation algorithm and its application in online shopping platform. J Intell Fuzzy Syst 2021; 40(5): 9173–9185.

19.

Roy

Niculescu

. A dual process model of the influence of recommender systems on purchase intentions in online shopping environments. J Internet Commer 2023; 22(3): 432–453.

20.

Shankar

Perumal

Subramanian

, et al. An intelligent recommendation system in e-commerce using ensemble learning. Multimed Tool Appl 2024; 83: 48521–48537.

21.

Wei

. Online education recommendation model based on user behavior data analysis. J Intell Fuzzy Syst 2019; 37(4): 4725–4733. DOI: 10.3233/JIFS-179307.

22.

Liu

, et al. A personalized video recommendation strategy based on user playback behavior sequence. Chin J Comput 2020; 43(1): 123–135.

23.

Deriso

Boyd

. A general optimization framework for dynamic time warping. Optim Eng 2023; 24(2): 1411–1432.

24.

Zhang

Hadi

F-T

Thoresen

. Feature extraction from unequal length heterogeneous EHR time series via dynamic time warping and tensor decomposition. Data Min Knowl Discov 2021; 35(4): 1760–1784.

25.

Gan

Liu

, et al. Non-negative matrix factorization: a survey. Comput J 2021; 64(7): 1080–1092.

26.

Hafshejani

Moaberfard

. Initialization for non-negative matrix factorization: a comprehensive review. Int J Data Sci Anal 2023; 16(1): 119–134.

27.

Y-C

Quang Dinh

, et al. A hybrid DQN and optimization approach for strategy and resource allocation in MEC networks. IEEE Trans Wireless Commun 2021; 20(7): 4282–4295.

28.

Lin

C-C

Deng

D-J

Chih

Y-L

, et al. Smart manufacturing scheduling with edge computing using multiclass deep Q network. IEEE Trans Ind Inf 2019; 15(7): 4276–4284.

29.

Messaoud

Bradai

Moulay

. Online GMM clustering and mini-batch gradient descent based optimization for industrial IoT 4.0. IEEE Trans Ind Inf 2019; 16(2): 1427–1435.

30.

Peng

, et al. FIR filter based on small batch gradient descent method. Journal of Guangxi Normal University (Natural Science Edition) 2021; 39(4): 9–20.