E-commerce return data based on frequent itemset mining and time series symbolization clustering

Abstract

The rapid development of the Internet economy has made the e-commerce sales model more mature, but e-commerce returns are more and more frequent, resulting in a large number of returns data. However, traditional data mining methods have not taken into account the impact of time series on data, making it difficult to accurately and effectively analyze e-commerce return data. In order to effectively handle the time series characteristics in e-commerce return data, research is conducted on time series symbolization of raw data, converting continuous numerical time series data into discrete symbol series. In order to optimize the linear segmentation effect of time series, an incremental error segmentation method is introduced to replace the traditional sliding window segmentation method. The segmentation points are dynamically adjusted by gradually calculating the fitting error of each time period. At the same time, K-means clustering algorithm is used to cluster symbol sequences, and a frequent pattern growth algorithm is introduced for clustering symbol sequence frequent itemset mining. The results showed that the incremental error segmentation method used in the study reduced the fitting error by an average of 1.382 when the compression rate exceeded 90%. Under the same support rate, the proposed algorithm only consumed about 200 MB of memory and ran for only 60 seconds, proving the effectiveness and accuracy of the research method. Meanwhile, the analysis results of the example showed that an increase in merchant sales, a decrease in logistics anomalies, and a decline in store reputation all had a significant impact on product returns. This indicates that the results of this study can help businesses understand return trends, optimize sales and logistics, and thereby reduce return rates.

Keywords

K-means clustering FP-growth frequent itemsets time series data mining internet

Introduction

As the speed advancement of internet technology, more data have been recorded on the network, with time series data accounting for the largest proportion. According to statistics, the average annual growth rate of global e-commerce has exceeded 20%. Driven by the mobile Internet, online shopping shows explosive growth. E-commerce not only provides consumers with a convenient shopping experience, but also brings huge economic benefits to businesses. But with the increase of transaction volume, the issue of e-commerce returns has become increasingly prominent and has become an important problem that urgently needs to be solved in the field of e-commerce. E-commerce return refers to the behavior of consumers choosing to return goods to merchants after purchasing them for various reasons. According to some studies and reports, the return rate of e-commerce is much higher than that of traditional retail. Taking the US market as an example, the e-commerce return rate reached about 20% in 2022, and in some product categories, this proportion is even close to 30% or higher. The return situation in the Chinese market cannot be ignored, especially during large-scale promotional events such as “Double 11” or “Double 12”, where the issue of returns is more prominent. E-commerce return data is one of the important information in the e-commerce field, which has important reference value for businesses and consumers.^1,2 By studying and mining these data, businesses can better understand the reasons and patterns of returns, and take corresponding measures to reduce the return rate. In addition, for consumers, it can provide better return services and experiences. Therefore, studying e-commerce return data is of great necessity and practical significance. Traditional data mining methods have some limitations when processing time series data. Due to the high-dimensional nature of time series data, directly using raw data for mining may result in high computational complexity and inaccurate results.^3,4 At the same time, current clustering methods, including frequency domain-based and pattern-based time series symbolization (TSS) clustering, suffer from difficulties in feature extraction, poor stability, poor interpretability of clustering results, and difficulty in processing large-scale data. Therefore, in order to perform dimensionality reduction, compression and other processing on time series data, improve the efficiency of large-scale data processing, and better mine e-commerce return data, a frequent itemset mining (FIM) algorithm based on TSS clustering was studied and designed to study e-commerce return data. Incremental error segmentation (IES) method was used to optimize linear segmentation performance, and k-means clustering algorithm was used to cluster symbol sequences. To improve the efficiency of the algorithm, the frequent pattern growth (FP-growth) algorithm for clustering symbol sequence FIMs was introduced. In addition, efficient data mining on large-scale datasets was achieved through the FP-growth algorithm. This algorithm has high efficiency and accuracy in mining e-commerce return data, effectively capturing changes in return patterns and providing substantive conclusions about return behavior. The contribution of the research mainly includes two aspects. One is to provide a more accurate and effective clustering method for large-scale data. The second is to improve the accuracy and speed of e-commerce return data processing, which can help e-commerce enterprises better understand and analyze return data, and thus develop more effective return strategies and management measures.

This study mainly focuses on the following four aspects. The first part introduces the current research status of internet data mining methods. The second part is the specific design of FIM algorithms based on TSS. The third part is to analyze the data mining results through experiments, and introduce other algorithms to verify the performance of this algorithm. The fourth part is a summary and analysis of the entire text. This study promotes the study of e-commerce return data. By optimizing the linear segmentation method and introducing the clustering and FIM algorithm, the study provides an effective analysis method, which can help e-commerce enterprises to deeply understand the patterns and rules in the return data, so as to optimize the return strategy, improve user experience, and increase enterprise profits. In addition, the study also promotes the further development and application in the field of time series analysis and data mining.

Related works

As the speed growth of the Internet, big data, and artificial intelligence, all kinds of network data can be recorded and saved. Many scholars have studied the mining and analysis of Internet data. Among them, significant research results have been achieved in improving data mining algorithms, which can provide improvement ideas for this study. Kim et al. proposed a symbol centered data structure for the analysis of periodic data in time series databases, and verified the feasibility of this method through experiments on real data. The experiments showed that this method improved memory usage and runtime, and was more efficient.⁵ Li proposed a dynamic data mining algorithm based on time weight to address the issue of time weight affecting the effectiveness of data mining. This method formed an important row matrix by matching corresponding data with dynamic time features, and verified the feasibility of this method through experiments. The results showed that compared with existing methods, the information mined by this method was more comprehensive.⁶ Qu et al. found that efficient use of project sets can reveal combinations of projects with high profits, costs, or importance. Therefore, they proposed a high-performance algorithm Hamm for mining efficient use of project sets. This algorithm adopts a novel prefix tree and utility vector structure to mine efficient itemsets in one stage without generating candidate options. The results show that the proposed optimization method significantly reduces the search space and improves the speed of the Hamming algorithm.⁷ Li et al. proposed a malicious mining code detection method based on feature fusion and machine learning to detect malicious mining code. Extract feature vectors through n-gram model and TF-IDF, and select the best feature vector through classifier. The results showed that the recognition accuracy of this method reached 98.0%, the F1 value reached 0.969, and the area under the ROC curve reached 0.973.⁸ Alyasiri and Ali analyzed the quantity, speed, diversity, accuracy, and value of big data and proposed a powerful framework for handling data complexity, which helps to explore more deeply how big data principles optimize e-commerce return data analysis.⁹

The application achievements of data mining algorithms in other fields such as medicine can also provide certain research ideas for research. The Guo team applied data mining technology to the field of traditional Chinese medicine treatment, using the Apriori algorithm and hclust function to perform association clustering analysis on the traditional Chinese medicine database. 26 association rules were obtained, and four types of traditional Chinese medicine were identified through clustering analysis. The conclusion of clustering analysis was consistent with the actual situation, and the mining effect was good.¹⁰ Peng team proposed a time series data mining method with the facilities cost model to effectively predict the pollution of haze, linking the formation of haze with the multidimensional nature of time series. Other algorithms were introduced for comparative experiments to evidence the effect of this data mining method.¹¹ The Li team considered the low efficiency of traditional data mining algorithms and applied association rules to the data mining algorithm. They proposed a MapReduce-based association rule mining algorithm, which improved the efficiency of the algorithm and reduced memory requirements. In up to 40 million dataset mining experiments, the method still performed well.¹² Diamond and Happawana applied the FIM method to the field of clinical medicine, using the FP-growth algorithm to establish a database related to Alzheimer’s disease, and evaluated the feasibility of the algorithm through experiments. The experiments showed that the algorithm had good performance and could be used for medical related data analysis.¹³ Considering the impact of customer satisfaction on manufacturers, Kang team proposed an FIM algorithm based on fuzzy association. This algorithm used the fuzzy Delphi method to calculate the data weight and determined 14 association rules to analyze customer satisfaction. Experiments showed that this method was feasible and reduced the risk of product failure to a certain extent.¹⁴ Ahmed and Nath applied data mining to disease detection technology. To raise the algorithmic efficiency and reduce economic costs, an improved FP-growth algorithm was proposed, which used top-down and bottom-up methods to generate item sets and improve the algorithmic efficiency. Experiments verified that the algorithm improved the operation efficiency and had certain feasibility compared with traditional methods.¹⁵ Samal and other researchers proposed an enhanced differential protection scheme for microgrid fault detection and classification based on data mining models. The results show that the proposed intelligent differential protection scheme based on deep neural networks can provide reliable protection measures for microgrids.¹⁶

In summary, compared with traditional data mining algorithms, the TSS based FIM algorithm performs better in processing data with time series features. Traditional algorithms often overlook the temporal relationships in time series data, resulting in insufficient effectiveness in processing e-commerce sales data. TSS can effectively convert continuous time series into discrete symbol sequences, capturing time patterns and periodic behavior, making subsequent FIMs more accurate. The FP-growth algorithm has high performance in frequent itemset mining, significantly improving mining efficiency by reducing the generation of candidate itemsets. Based on this advantage, a new data mining framework has been designed. Firstly, TSS processing is applied to e-commerce sales data, and then FP-growth algorithm is applied for efficient frequent itemset mining, thereby revealing key patterns and consumer behavior patterns in sales data and promoting in-depth research on e-commerce data analysis.

Methods

In order to achieve effective analysis of e-commerce return data, this chapter first introduces the symbolization of time series, converting continuous time series curves into discrete symbol sequences. Then, the frequent pattern mining algorithm for time series was introduced, and the FP-growth algorithm was introduced to set the frequently occurring data in the time series as the threshold itemset in advance, improving the efficiency of data analysis.

Design of time series symbolization clustering

Due to the large volume, high dimensionality, and large amount of noise in the original time data, using traditional data mining algorithms to cluster and classify data may not achieve accurate results. So this study will symbolize time series data, express complex time series data in simple symbols, and conduct data mining analysis on the symbolized data to raise the efficiency of the algorithm.^17,18 The main content of TSS includes three processes: selecting important points, linearizing segmentation, and segment clustering. The purpose of screening important points is to select the support points of the time series contour, that is, the important points that need to be screened are the extreme points of the time series data. The points where the slope difference between the endpoints and line segments is greater than the threshold, that is, based on the time series. The mathematical expression for the slope difference of line segments is shown in equation (1).

k = a b s ((p_{i + 1} - p_{i}) - (p_{i} - p_{i - 1}))

(1)

As shown in equation (1), $k$ denotes the slope difference; $p_{i}$ indicates the slope of the $i$ th point; $p_{i + 1}$ and $p_{i - 1}$ indicates the slope of the $i + 1$ th point and $i - 1$ th point, respectively; $a b s$ means an absolute value function. Next, the filtered important point sequence will be segmented and linearized. The segmentation methods mainly include bottom-up, top-down, and sliding window segmentation. This study will use a sliding window segmentation-based method. As this method does not take into account the fitting error caused by newly added windows, the study will introduce incremental error on the basis of cumulative fitting error to optimize the algorithm. By introducing incremental error, the improved algorithm focuses on the changes in fitting error caused by newly added points. When the newly added points cause a significant change in the fitting error of the sliding window, the window before adding important points can be fitted separately. Therefore, this study introduces incremental error into the traditional sliding window segmentation compression algorithm, which can identify important points with significant short-term trend changes, thereby improving the accuracy of the segmentation algorithm while maintaining a basically unchanged compression rate. The approximate process of linearization segmentation is shown in Figure 1.¹⁹

Figure 1.

Linearized segment flow chart.

As shown in Figure 1, the linearization segmentation is to first initialize the sliding window. Then it uses line segment fitting to filter out the important point time series, and determines the incremental error and cumulative fitting error of the important point data until all important points are traversed, ending the linearization step.² It assumes that the time series $T$ , the line segment series $K$ , the approximate series of line segment $k_{i}$ after Linear interpolation is $X$ . The expressions of $T$ , $K$ and $X$ are shown in equation (2), then the expressions of the cumulative fitting error and compression rate of the linearization segmentations are shown in equations (3) and (4).

{\begin{cases} T = {t_{1}, t_{2}, . . ., t_{n}} \\ K = {k_{1}, k_{2}, . . ., k_{m}} \\ X = {x_{1}, x_{2}, . . ., x_{n}} \end{cases}

(2)

In equation (2), $t$ represents a time series of important points. $n$ means the amount of important points in the time series, and $m$ indicates the amount of linear segments.

E = \sum_{i = 1}^{n} {e_{i}}^{2} = \sum_{i = 1}^{n} {(t_{i} - x_{i})}^{2}

(3)

In equation (3), $E$ refers to the cumulative linear fitting error, while $e_{i}$ , $x_{i}$ , and $t_{i}$ represent the fitting error, approximate sequence, and time series of the important point $i$ , respectively.

p = (1 - \frac{n}{m}) \times 100 %

(4)

As shown in equation (4), $p$ denotes the compression rate, which means the compression rate of compressing the time series $T$ into the segment sequence $K$ .

The last and most crucial step in TSS is to perform clustering symbolization on linearization segmentations. The specific steps are as follows: first, extract the geometric features such as slope and length of the fitted line segments of each subsequence obtained in the previous step, and use them as the attributes of the sample points in the clustering process for clustering. Then, set the number of patterns in the original time series to the number of clusters and the size of the corresponding symbol table. Replace each line segment in the original time series with the symbol corresponding to its cluster center to obtain the symbol sequence. Due to the fact that in real life, more attention is paid to the trend changes in time series, the slope of a line segment is used as its feature, and three common trend patterns of rising, falling, and basically unchanged are often set. The clustering symbolization process is shown in Figure 2.

Figure 2.

Cluster symbolization process diagram.

Considering that the K-Means algorithm can quickly and effectively cluster symbolized time series data, and find clusters with similar patterns through a simple iterative process, it has good interpretability. Therefore, the clustering algorithm selected for the study is the K-means algorithm. This method calculates the distance between a line segment and the center of each class using the Euclidean distance method, and the mathematical expression formula for Euclidean distance is shown in equation (5).

d = \sqrt{{(x_{2} - x_{1})}^{2} + {(y_{2} - y_{1})}^{2}}

(5)

As shown in equation (5), $d$ stands for the distance from the line segment to the class center; $(x_{1}, y_{1})$ refers to the position of the line segment center; $(x_{2}, y_{2})$ expresses the position of the class center. The update expression for the cluster center is shown in equation (6).

C = \frac{1}{| C_{f} |} \sum_{x_{i} \in C_{f}} x_{i}

(6)

In equation (6), $C$ refers to the updated cluster center; $| C_{f} |$ means the amount of objects included in the $f$ th cluster; $x_{i}$ expresses the position of the $i$ th segment. In addition, during the clustering process, the types of clusters cannot be set too much, otherwise there will be too many types of patterns in the time series, and similar patterns will be represented by different symbols. On the contrary, if the types of clusters are set too low, it will result in the use of the same symbol for non-similar subsequences in the time series. Therefore, in practical applications, it is necessary to set an appropriate number of clusters based on the purpose of data mining tasks and the background knowledge of domain experts. This study sets the number of clusters to 4. After selecting important points, linearization segmentation, and segment clustering, the original time series data can be transformed into symbolic segments. The effect of the TSS is shown in Figure 3.²⁰

Figure 3.

Time series symbolization flow chart.

In Figure 3, the blue part denotes the original time series curve, and the selection result of important points indicates the approximate contour represented by the orange line segment. From the figure, the contour is roughly the same as the original curve. The result of linearization segmentation is displayed in the orange line segment in the figure, dividing the important point data curve into several straight line segments. The clustering symbolization results are shown in the letters a-f in the figure, assigning symbol representations to line segments with the same features, with each symbol representing an independent trend. The curve data shown in the figure can be represented by discrete symbols (abcbcdefebcbgddabdf) after symbolic processing.

Analysis of return data based on time series frequent itemset mining

After TSS, continuous time series curves are transformed into discrete symbol sequences. Next, it will conduct FIM of the time series, which sets the frequent data in the time series as the threshold itemset first to raise the efficiency of data analysis. Due to the large number of discrete time series symbols, research will first align the time series before further data mining. It assumes that the symbol sequence of $i$ time series data after symbolic processing is shown in equation (7).

K = {k_{1}, k_{2}, . . . k_{i}}

(7)

As shown in equation (7), $K$ represents the symbol sequence of the time series, which is the set of all $k_{i}$ ; $k_{i}$ stands for the $i$ th symbol sequence. Then the alignment operation equation is composed of the subsequence synthesized by every two breakpoints and the corresponding symbol, as shown in equation (8).

k_{i} = {[t_{1}, t_{2}] : a 1, [t_{2}, t_{3}] : a 2, . . ., [t_{m}, t_{m + 1}] : a m}, a i \in K

(8)

As shown in equation (8), $[t_{m}, t_{m + 1}]$ expresses the start and end breakpoints of the subsequence of the time series; $a m$ means the symbols of each subsequence, that is, the $m$ breakpoint. According to the time series alignment formula, it selects four line segment symbol sequences from $k_{1}$ to $k_{4}$ for alignment operation. The schematic diagram of time series alignment operation is shown in Figure 4.²¹

Figure 4.

Alignment operation diagram.

In Figure 4, due to the fact that the selection of segmentation points for the three symbol sequences formed by three different time series may not be the same in the linearization segmentation step, it is necessary to align the segmentation points of the three symbol sequences before conducting FIM, so that the segmentation points of each symbol sequence are consistent. It can be seen that after the alignment operation, the segmentation points of each symbol sequence become consistent. As this study focuses more on the frequent patterns of trend changes in various time series, only the slope of the line segment is considered as the feature. However, segmenting the same line segment does not change the slope value of the line segment before and after segmentation. Therefore, in the alignment operation, the symbol of the separated new line segment remains consistent with the symbol of the line segment before segmentation. When aligning the symbol sequence, it first combines the breakpoints of all the character sequence $k_{i}$ as shown in equation (8) into a sequence $X = {t_{1}, t_{2}, . . . t_{m}}$ containing $m$ breakpoints in chronological order, and then selects $t_{i}$ breakpoints one by one in $X$ to compare with the breakpoints in the symbol sequence $k_{i}$ . If the breakpoints are the same, then it compares the next breakpoint $t_{i + 1}$ . If the breakpoints are different, it separates the two adjacent subsequences of $t_{i}$ . As shown in Figure 4, after comparing and separating $t_{i}$ segmentation points, all $k_{i}$ contain the same number of segmentation points. It combines symbols with the same segmentation points into an itemset $I$ . After traversing all $m$ , the alignment operation is completed, and all itemsets are combined into an itemset dataset. The relevant equation for the itemset is shown in equation (9).

{\begin{cases} I_{k} = [a 1, a 2, . . . a n] \\ E = [I_{1}, I_{2}, . . . I_{m}] \end{cases}

(9)

As shown in equation (9), after the symbolization and alignment operations of the resulting time series, the original time series data is transformed into a dataset $E$ composed of several itemsets $I$ . Next, FIM will be performed on the dataset $E$ . Next, it will use three symbol sequences to describe the overall time series frequent mining algorithm. It assumes that the character representation of each sequence is shown in equation (10).

{\begin{cases} A = {a 0, a 1, a 2} \\ B = {b 0, b 1, b 2} \\ C = {c 0, c 1, c 2} \end{cases}

(10)

As shown in equation (10), the symbol table of the three $A B C$ sequences, after the TSS algorithm, is shown in equation (11), and the specific schematic diagram is shown in $k_{1}$ , $k_{2}$ , and $k_{3}$ sequences in Figure 4.

{\begin{cases} k_{1} = {a 2 : [0, 2], a 1 : [2, 3], a 0 : [3, 5]} \\ k_{2} = {b 1 : [0, 1], b 2 : [1, 2], b 0 : [2, 3], b 2 : [3, 4], b 0 : [4, 5]} \\ k_{3} = {c 0 : [0, 2], c 2 : [2, 4], c 1 : [4, 5]} \end{cases}

(11)

As shown in equation (11) and Figure 4, the numbers in the square brackets represent the start point and end breakpoint of the subsequence. The three symbolized sequences are aligned as shown in equation (12).

{\begin{cases} {k_{1}}^{'} = {a 2 : [0, 1], a 2 : [1, 2], a 1 : [2, 3], a 0 : [3, 4], a 0 : [3, 5]} \\ {k_{2}}^{'} = {b 1 : [0, 1], b 2 : [1, 2], b 0 : [2, 3], b 2 : [3, 4], b 0 : [4, 5]} \\ {k_{3}}^{'} = {c 0 : [0, 1], c 0 : [1, 2], c 2 : [2, 3], c 2 : [3, 4], c 1 : [4, 5]} \end{cases}

(12)

As shown in equation (12), after the alignment operation, the same subsequence can be combined into an item set. For example, $[0, 1]$ ’s item set is ${a 2, b 1, c 0}$ , and all item sets can be combined into an item set dataset. The FIM method to be used in this study is the FP-growth algorithm. When mining frequent itemsets in FP-growth, the first step is to construct a frequent itemset tree FP-tree and then mining frequent itemsets from the frequent pattern tree. The number of frequent itemsets is shown in Figure 5.²²

Figure 5.

Number of frequent itemsets.

As shown in Figure 5, the closer the node symbol sequence is to the root of the tree, the higher the frequency. Null means the original sequence, and data mining searches from a to the last item g. For example, the FP-tree of the last term g includes three branches, namely, <a: 4, b: 3, a: 3, e: 1, g: 1>, <a: 4, b: 3, a: 3, f: 1, g: 1>, and<a: 4, d: 1, e: 1, g: 1>. Although a appears a total of four times, it only appears once in one path that appears with g, that is, the itemset of statistical g, that is, the prefix path of statistical g, is <abceg: 1>, <abcfg: 1>, and <adeg: 1>, respectively. The relevant indicator of the FP-growth algorithm is support, and the mathematical expression for support is shown in equation (13).

s u p p o r t (a \to b) = \frac{P (a \cup b)}{m}

(13)

As shown in equation (13), $a$ and $b$ indicate two random segmentation points; $s u p p o r t (a \to b)$ denotes the support of the segmentation point $a$ relative to the segmentation point $b$ ; $P$ refers to the probability. The confidence expression for frequent itemsets is shown in equation (14).

c o n f i d e n c e (a \to b) = \frac{P (a \cup b)}{P (a)} = P (a / b)

(14)

As shown in equation (14), $c o n f i d e n c e (a \to b)$ expresses the confidence of the segmentation point $a$ relative to the segmentation point $b$ . Based on the time series and symbolic sequence of a certain type of e-commerce return data, and setting an appropriate window size, a set of sequences composed of symbols is obtained, thereby obtaining the corresponding set of sequences. Each item in the set is a character sequence with the same time length. After completing the partitioning operation, frequent sequence mining algorithms were utilized to mine frequent itemsets on the sequence set formed by symbol sequences. The frequent sequences mined are sorted in order of support count, reflecting the trends and patterns of different product returns. This can determine the effectiveness of time series clustering for a certain type of data in e-commerce return data.

When using the proposed FIM algorithm to analyze e-commerce return data, considering the privacy and security issues of user data, the research will anonymize user data before data mining, including de-identification, generalization and perturbation technology, so as to avoid the disclosure of personal identity information to the greatest extent. In the transmission and storage of data related to e-commerce returns, secure transmission protocols and encryption algorithms are mainly used to ensure that the data is not accessed and stolen by malicious parties. In addition, for particularly sensitive e-commerce return data, differential privacy technology is mainly used to protect user privacy. Differential privacy is a relatively advanced privacy protection technology, which can enhance the protection of original data and has great advantages in data security. This method can solve the privacy and security problems well.

Results and discussion

This study used e-commerce return related data as experimental data for experiments. The TSS, symbol pairing, and FIM results were divided to understand which indicators have the highest support rate in e-commerce return data. Other relevant data mining algorithms were introduced to compare with the algorithms in this study, and the superiority of this research algorithm was analyzed.²³

Algorithm performance testing

The performance of the proposed algorithm was tested, assuming w = 10, k = 7, T = 20, with a minimum support of 0.02 and a minimum confidence of 0.3. The algorithm proposed in this paper was used for rule mining. The proposed method was compared with the FP-growth algorithm based on a two-dimensional vector table (TFP-growth), Apriori-Hybrid algorithm, Hierarchical Mining (H-Mine) algorithm, and classical FP-growth algorithm. Among them, Apriori-Hybrid is an improved version based on the traditional Apriori algorithm. The H-Mine algorithm reduces the generation of candidate sets through a hierarchical search method, which has stronger scalability. Both are currently advanced algorithms. The hardware and software environments for this experiment are shown in Table 1.

Table 1.

Experimental software and hardware environment.

Environment	Configuration
Experimental platform	Python 3.8+scikit-learn+R-Studio
Processor	Intel Core i7-8750H
Operating system	Windows 11
Hard disk space	256G solid-state drive
Running memory	16GB
Graphics card	Graphics card

To verify the robustness of the proposed method on different datasets, two datasets were selected for testing. One of the datasets was the T20.I6.450K dataset, which contains 450000 transactions with an average of 20 items per transaction. The study selected one of them. The other dataset was the Mushroom dataset, which belongs to the University of California Irvine machine learning library. The total sample size of this dataset was 8124, and the project set contained 119. 23 was the maximum length of the transaction. At the same time, to demonstrate the differences in the comparison results of the five algorithms, a t-test was conducted to verify the significant differences in the results, with p < 0.05 indicating significant differences. The comparison results of the five algorithms in the two selected datasets are shown in Figure 6.

Figure 6.

Comparison results of five algorithms on two selected datasets. (a) Comparison of runtime in T20.16.450K dataset. (b) Comparison of memory usage in Mushroom dataset. (c) Comparison of runtime in T20.16.450K dataset. (d) Comparison of memory usage in Mushroom dataset.

In Figure 6(a) and (b) represent the comparison results of the running time and memory usage of the five methods in the T20.I6.450K dataset, respectively. Figure 6(a) and (b) represent the comparison results of the running time and memory usage of the five methods in the Mushroom dataset, respectively. From Figure 6(a) and (b), in the T20.I6.450K dataset, the method proposed in this study had a slower change in running time, with a maximum of only 65 s and a maximum memory usage of 350 MB, which was significantly better than the other four methods (p < 0.05). From Figure 6(a) and (b), in the Mushroom dataset, the running time of the proposed method fluctuated between 10 s and 20 s, with an increase in magnitude, and the highest memory usage was only 260 MB, showing a significant advantage with statistical significance (p < 0.05). Overall, the proposed method exhibited better performance and robustness in different datasets.

E-commerce return data mining testing

The experimental data source for this experiment was all the return data of a company from 2017 to 2022 provided by Alibaba Tianchi database. The company operates in multiple categories such as clothing, food, and household goods, with daily sales exceeding 40000. It ranks among the top three in sales on the Taobao platform and can serve as a representative for analyzing e-commerce return data. The experimental data included 13 e-commerce related elements such as store reputation, product rating, product specifications, whether it included shipping, list price, transaction price, sales volume, number of successful transactions, number of comments, flow anomalies, number of collections, service commitment, payment method, etc.²⁴ Among them, sales volume is an important indicator to measure the popularity of products in the market. High sales products indicate higher quality and customer satisfaction, lower likelihood of returns, and vice versa. Service commitment includes return policy, refund process, etc. It is a commitment made by e-commerce platforms or sellers to customer after-sales service. The logistics situation directly affects the time and status of customers receiving products. These three indicators directly affect the shopping experience and satisfaction of customers, and are key factors affecting the return rate of e-commerce. Therefore, in order to centrally reflect return data and reduce the interference of other unnecessary factors, the study selected three indicators of sales volume, service commitment, and logistics situation, as well as relevant time data, to form experimental data and represent its data mining results. The symbols for the original e-commerce return related element indicators are displayed in Table 2.

Table 2.

Original element symbol table.

/	Rising mode	Keep unchanged mode	Descent mode
Sales volume	a0	a1	a2
Service commitment	b0	b1	b2
Logistics situation	c0	c1	c2
Store reputation	d0	d1	d2
Product rating	e0	e1	e2
Product specifications	f0	f1	f2
Free shipping	g0	g1	g2
Mark the price	h0	h1	h2
Transaction price	i0	i1	i2
Number of successful transactions	j0	j1	j2
Number of comments	k0	k1	k2
Number of collections	m0	m1	m2
Payment method	n0	n1	n2

As shown in Table 2, 13 original related indicators were represented by letter symbols, such as a for sales, b for service commitment, and c for logistics. The time series curve data was all represented by an increase, a decrease, and a basic constant, as shown in the sales element in the table. a0 represents an increase in product sales; a1 indicates a basic constant sales; a2 denotes a decrease in sales; b0 refers to additional service commitments, such as freight insurance; b1 expresses basic service commitments, such as 7-day no reason return or exchange; b2 denotes no service commitments. Using the first three data in Table 2 as experimental data, the linear segmentation method designed in this study was experimented with an IES algorithm and a classic sliding window segmentation algorithm. The results were compared and analyzed from two aspects: compression rate and fitting error. The results are shown in Figure 7.

Figure 7.

Comparison chart of linear segmentation results. (a) Compression ratio comparison chart. (b) Comparison of fitting errors.

As shown in Figure 7(a), the compression rate of the incremental segmentation method in this study was basically the same as that of the sliding window segmentation algorithm, with compression rates above 90%. Among them, the compression rate of the service commitment element was the highest at 93.2%, and the overall compression segmentation effect was excellent. In Figure 7(b), the fitting error based on the incremental segmentation method was significantly smaller than that of the sliding window segmentation method. Among them, the fitting error of the sales element was the smallest at 2.467, a decrease of 2.074 compared to the sliding window method, and an average decrease of 1.382. Overall, the method used in this study significantly reduced fitting errors, resulting in smaller linear segmentation errors and higher performance, with little change in compression ratio compared to traditional methods. To further assess the effectiveness of the method designed in this study, the traditional frequent itemset algorithm Apriori and the unimproved FP-growth algorithms were compared in the same dataset. The study introduced the size of experimental memory data and experimental runtime to reflect the algorithm’s performance. The comparison of the outcomes is shown in Figure 8.

Figure 8.

Comparison chart of linear segmentation results. (a) Memory usage comparison chart. (b) Excavation time comparison chart.

As shown in Figure 8(a), the memory peak value of the three algorithms decreased with the rising of support. With the increase of support, the amount of data segmentation points decreased, the amount of itemsets that met segmentation decreased, and the memory peak value decreased accordingly. Among them, when the support level is 20%, the memory peaks of Apriori algorithm, unimproved FP-growth algorithm, and improved FP-growth algorithm are 284 MB, 263 MB, and 241 MB, respectively. And when the support level increased to 60%, the peak memory of each algorithm decreased to 238 MB, 228 MB, and 204 MB, respectively. Meanwhile, the memory peak of the algorithm proposed by the research institute has always been the lowest, reaching only 187 MB at a support rate of 70%, which is significantly lower than other algorithms. As shown in Figure 8(b), the overall running time of each algorithm also decreases with the increase of support, reaching a stable state after reaching about 70% support. Among them, the running time of the proposed algorithm is the lowest overall, only 63.54 s, which is 18.54 s less than the unimproved FP-growth algorithm. The improved FP-growth algorithm proposed by the research institute has significant advantages in memory consumption, runtime, and algorithm stability.

To further validate the effectiveness of the proposed method in e-commerce return data mining, three indicators were selected for evaluation: accuracy, recall, and F1 value. Among them, accuracy can evaluate the predictive accuracy of the model’s return behavior, while recall reflects the recognition ability of the model’s return behavior. The F1 value, as an evaluation indicator, can comprehensively consider the performance of accuracy and recall. At the same time, the study compared the methods in references 24 and 25 both of which are improved data mining algorithms and perform well in data processing in different fields. The test results of five methods in this e-commerce return data are shown in Table 3.

Table 3.

Test results of five methods in the e-commerce return data (%).

Method type	Accuracy	Recall	F1 value
Apriori	85.74	80.55	82.74
FP-growth	86.33	85.41	85.29
Reference 24	88.07	82.95	85.61
Reference 25	89.01	85.32	87.45
This study	96.63	91.59	93.22

From Tables 3, it can be seen that in the processing of e-commerce return data, the method proposed in this study has an accuracy rate, recall rate and F1 value of 96.63%, 91.59%, and 93.22%, respectively. All three indicators exceed 90%, which is significantly better than the four methods compared. This proves that the proposed method can accurately mine return data and has significant comprehensive advantages.

Empirical analysis of e-commerce return data

Next, empirical analysis was conducted on the selected merchant return data for the study, and data mining experiments were conducted on 13 original time series data of the store. The time series alignment, symbolization and final data mining results were analyzed separately. The time series alignment results of 13 indicator data are displayed in Table 4.

Table 4.

Original element symbol table.

Return elements	Number of original time series points	Important points	Align the number of segments before alignment	Number of segments after alignment
Sales volume	600	573	103	163
Service commitment	600	454	102	163
Logistics situation	600	498	79	163
Store reputation	600	562	83	163
Product rating	600	475	94	163
Product specifications	600	465	63	163
Free shipping	600	457	163	163
Mark the price	600	437	127	163
Transaction price	600	478	130	163
Number of successful transactions	600	472	109	163
Number of comments	600	458	98	163
Number of collections	600	468	134	163
Payment method	600	467	115	163

As shown in Table 4, the original time series points for each indicator were selected as 600 points, with most of the selected important points ranging from 400 to 500. Sales and store reputation indicators had more important points, around 550, indicating that the time series was relatively complex and might have a high support rate. The final segmentation point after sequence alignment operation was 159. Next, the aligned subsequence was symbolized. The symbols of each indicator are shown in Table 1. It selected some symbolic data results of the first three indicators in Table 2 for symbolic analysis. The symbolic results are shown in Figure 9.

Figure 9.

Symbolized schematic diagram. (a) Symbolization diagram of sales elements. (b) Symbolic diagram of service commitment elements. (c) Symbolization diagram of logistics abnormal elements.

As shown in Figure 9, the blue curve represents the raw data, green represents the rising mode, red represents the falling mode, and purple represents the basic unchanged. Relatively speaking, the data trend of sales indicators was relatively flat, with a similar number of upward and downward line segments, indicating that there are multiple small fluctuations in sales indicators and no significant increase or decrease. At the same time, it can be seen that the basic invariant sequence of sales indicators was the least, with more symbolic subsequences for sales and service commitment sequences. In addition, the trend of the symbolized line segments was basically consistent with the original data, indicating that the symbolization algorithm based on incremental error in this study can meet the experimental expression needs of the original data. It performed FIM on the symbol itemset after symbolization processing, and output the results of the eight itemsets with the highest support, as shown in Figure 10.

Figure 10.

FIM Results. (a) All element data mining results. (b) Sales element data mining results.

Figure 10(a) shows the data mining results for all 13 time series, with [a2, d0] and [c2, d0] having the highest support rates, indicating a high support rate for sales, c logistics anomalies, and d store reputation. Logistics anomalies had the greatest impact on return data, with a strong correlation between the rise and fall of logistics anomalies and the decline of store reputation. As shown in Figure 10(b), with the data mining results of the sales series, the support rates of each sales series were not significantly different. Among them, the support rates of the rising and falling modes were consistent, and the high correlation was in line with the actual situation. This indirectly verified the accuracy of the FIM results of all time series, indicating that the algorithm had good performance.

Conclusion

Traditional FIM is static data mining, without considering the impact of time series on data research. This study used FIM algorithms based on TSS as data clustering methods. The study used IES instead of traditional sliding window segmentation to optimize the linear segmentation effect, selected the improved FP-growth algorithm for data mining experiments, and introduced other related algorithms for comparative experiments. The experimental findings showed that, firstly, in different element sequence experiments, the compression rate of the IES method was basically the same as that of the traditional sliding window method, both of which were above 90%. The fitting error of the incremental method was significantly reduced, with the lowest sales element being 2.467, which was 2.074 lower than the traditional method. Secondly, when the memory peak and running time of the algorithm in this study reached the same support rate, compared with the traditional frequent itemset algorithm Apriori and the unimproved FP-growth algorithm, the memory peak was about 200 MB, and the running time was about 60 s, indicating that the algorithmic efficiency in this study has the highest efficiency. Finally, an analysis was conducted on the merchants who provided experimental data sources. The mining results of the original time series data of 13 elements showed that the increase in sales, the decrease in logistics anomalies, and the decline in store reputation had a high support rate, which had the greatest impact on product returns. Merchants should focus on these three aspects to reduce product return rates. Future research can further improve time series symbolization and error segmentation strategies, design more flexible and refined processing methods for elements with small changes over time, and avoid missing key data. At the same time, cross platform data fusion analysis can be further introduced, combined with multi-source data such as social media and user feedback, to comprehensively improve the accuracy of return prediction and the effectiveness of merchant optimization decisions.

Statements and declarations

Footnotes

Conflicting interest

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

References

Cheng

Shan

Zhuang

, et al. When virtual network operator meets e-commerce platform: advertising via data reward. IEEE Trans Mobile Comput 2022; 22(12): 7370–7386.

Wang

Liu

, et al. A multiperspective fraud detection method for multiparticipant e-commerce transactions. IEEE TCSS 2023; 11(2): 1564–1576.

Ebrahimi

Chai

Zhang

, et al. Heterogeneous domain adaptation with adversarial neural representation learning: experiments on e-commerce and cybersecurity. IEEE T Pattern Anal 2022; 45(2): 1862–1875.

Zhang

Zheng

Lin

, et al. Vehicle trajectory data mining for artificial intelligence and real-time traffic information extraction. IEEE T Intell Transp 2023; 24(11): 13088–13098.

Kim

Yun

, et al. Periodicity-oriented data analytics on time-series data for intelligence system. IEEE Syst J 2020; 99: 1–12.

. Time works well: dynamic time warping based on time weighting for time series data mining. Inform Sciences 2021; 547: 592–608.

Fournier-Viger

Liu

, et al. Mining high utility itemsets using prefix trees and utility vectors. IEEE T Knowl Data En 2023; 35(10): 10224–10236.

Jiang

Zhang

, et al. A malicious mining code detection method based on multi-features fusion. IEEE T Netw Sci Eng 2022; 10(5): 2731–2739.

Alyasiri

Ali

. Exploring GPT-4’s characteristics through the 5Vs of big data: a brief perspective. J Artif Intell Res 2023; 2023: 5–9.

10.

Guo

Wang

Peng

, et al. A data mining-based study on medication rules of Chinese herbs to treat heart failure with preserved ejection fraction. Chin J Integr Med 2022; 28(9): 847–854.

11.

Peng

Liu

. Haze pollution causality mining and prediction based on multi-dimensional time series with PS-FCM. Inf Sci 2020; 523: 307–317.

12.

Feng

Sun

. DCE-miner: an association rule mining algorithm for multimedia based on the MapReduce framework. Multimed Tools Appl 2020; 79(23): 16771–16793.

13.

Diamond

Happawana

. Association rule learning in neuropsychological data analysis for Alzheimer's disease. J Neuropsychol 2022; 16(1): 116–130.

14.

Kang

Porter

Bohemia

. Using the fuzzy weighted association rule mining approach to develop a customer satisfaction product form. J Intell Fuzzy Syst 2020; 38(2): 1–15.

15.

Ahmed

Nath

. Identification of adverse disease agents and risk analysis using frequent pattern mining. Inf Sci 2021; 576(2): 609–641.

16.

Samal

Samantaray

Sharma

. Data-mining model-based enhanced differential relaying scheme for microgrids. IEEE Syst J 2022; 17(3): 3623–3634.

17.

Wang

Min

, et al. Spatial-temporal cellular traffic prediction for 5G and beyond: a graph neural networks-based approach. IEEE T Ind Inform 2022; 19(4): 5722–5731.

18.

Sheng

Xue

, et al. Graph-based spatial-temporal convolutional network for vehicle trajectory prediction in autonomous driving. IEEE T Intell Transp 2022; 23(10): 17654–17665.

19.

Liu

Zhang

Wang

, et al. Spatial-temporal conv-sequence learning with accident encoding for traffic flow prediction. IEEE T Netw Sci Eng 2022; 9(3): 1765–1775.

20.

Kahraman

Tunga

Ayvaz

, et al. Understanding the purchase behaviour of Turkish consumers in B2C e-commerce. IJISAE 2019; 7(1): 52–59.

21.

Lin

Wang

. Dstgcn: dynamic spatial-temporal graph convolutional network for traffic prediction. IEEE Sens J 2022; 22(13): 13116–13124.

22.

Azadani

Boukerche

. A novel multimodal vehicle path prediction method based on temporal convolutional networks. IEEE T Intell Transp 2022; 23(12): 25384–25395.

23.

Huang

, et al. On understanding of spatiotemporal prediction model. IEEE T Circ Syst Vid 2022; 33(7): 3087–3103.

24.

Zhan

Gong

, et al. A spatial-temporal transformer network for city-level cellular traffic analysis and prediction. IEEE T Wirel Commun 2023; 22(12): 9412–9423.

25.

Duan

Chen

Shen

, et al. FDSA-STG: fully dynamic self-attention Spatio-temporal graph networks for intelligent traffic flow prediction. IEEE T Veh Technol 2022; 71(9): 9250–9260.