Abstract
With the emergence of e-commerce, more and more people conduct transactions through the Internet, thus resulting in a large number of transaction data. Data mining is to decompose a large amount of data according to data rules, and analyze network transaction data, so as to provide necessary digital links for companies to analyze the market and develop business. Although time series data mining is smaller than other types of data mining, it is also an important issue of data mining. In the real world, the correlation between data and time is very common. The study of time series model plays a very important role in data mining. Due to different purposes, Taobao data analysis is also different. In addition to statistics, at present, the in-depth research and analysis of Taobao data are relatively insufficient, and the analysis of Taobao transaction data based on time series is rare. In order to improve the accuracy of Taobao transaction data mining and better formulate Taobao marketing strategy, this paper used time series data mining technology to mine Taobao transaction data. This paper first introduced the role of Taobao transaction data mining, and then described the calculation method of time series data mining, including the re-description of time series and the similarity measurement of time series. Finally, through a series of processes such as data collection, data processing and data feature extraction, the data mining model for Taobao transaction was established, and two data prediction evaluation indicators, namely prediction accuracy and entropy, were proposed. The experimental part verified the effect of Taobao transaction data mining. The experimental results showed that the data mining model moment had good data prediction accuracy and entropy. The average data prediction accuracy was 94.26%, and the data mining ability was strong.
Introduction
In recent years, China’s e-commerce industry has been booming and has become a powerful force for the country’s rapid and sustainable economic development. However, with the wide range of products available on e-commerce platforms, consumers can easily suffer from “shopping fatigue”, also known as “information overload”. In the case of Taobao, in order to alleviate the problem of “information overload” and to cope with the increasingly competitive market, the platform has spent huge amounts of money and resources on promotional activities. However, the inability to accurately target users has led to undifferentiated advertising and product recommendations, which has had a significant negative impact on users. Therefore, how to provide consumers with a more personalised and lasting service experience, and on this basis, design a precise marketing plan for merchants to achieve a win-win situation for both consumers and platforms.
Taobao is the leading supplier of online trading platform. The online market not only represents the most active frontier of the market, but also represents the entire offline market. Therefore, for traditional enterprises that want to develop e-commerce brands, it is of great reference value to study Taobao data and future trends, and create and improve the entire e-commerce platform. Taobao’s data core and data generation focus on the transactions between buyers and sellers, and expand and access a large number of relevant data and information. Information also forms data with great commercial value, thus laying the foundation for Taobao to become a core service provider and carry out data development.
At present, there are many researches on Taobao transaction. Li Guangqin used the continuous difference analysis method to study the impact of Taobao Village on the income of rural residents and its mechanism [1]. Tang Ke examined the impact of industrial clusters represented by “Taobao Village/Town” on the per capita income of residents [2]. Wei Yehua Dennis investigated the development track and organizational structure of Taobao Village to better understand rural e-commerce and regional development [3]. Zhang Zhuyao took Taobao e-commerce platform as an example to explore the impact of online customers’ perceived usefulness, perceived ease of use, perceived cost, trust and satisfaction on repurchase intention [4]. Hu Jingjing took Taobao transaction as an example to conduct research on e-commerce seller credit rating [5]. Wang Cassandra C used the original interview data from the Taobao Village phenomenon to study the multifaceted nature of rural transformation caused by the disintermediation of e-commerce [6]. Although there were many studies on Taobao transaction, how to conduct data mining on Taobao transaction needed to be further studied.
Many scholars have studied the big data analysis of the trading market. Ye Xinyue conducted big data analysis on electronic market transactions based on online transaction data of Taobao platform [7]. Aloysius John A studied the relationship between service process and shopping results based on massive data generated by customer transactions in retail stores [8]. Fu Hanliang analyzed consumers’ consumption decisions based on users’ online comment search behavior [9]. Phucharoen Chayanon used big data analysis technology to investigate the shopping experience of tourists in department stores and street markets [10]. Boone Tonya used big data technology to analyze the supply chain sales forecast [11]. Schlegel Alexander used big data analysis to conduct comprehensive business planning for sales and operation [12]. Although there were many related studies on big data analysis in the trading market, how to mine Taobao trading data based on time series analysis needed further research.
In order to better analyze the behavior characteristics of Taobao users and achieve more accurate commodity marketing, this paper used time series data mining method to predict and analyze Taobao transaction data. In e-commerce, data mining had the function of mining potential customers, speculating customer needs, maintaining existing customers, and clustering customers. This paper introduced the four steps of data mining process, including clarifying the problem, preparing the data, establishing the model and model checking. It introduced the time series re-description and time series similarity measurement of time series data mining, and finally established the Taobao transaction data mining model. The experiment part collected Taobao transaction data to verify the effect of data mining model prediction and analysis.
Application of data mining in E-commerce
E-commerce has the characteristics of low initial investment cost, wide variety of raw material information, wide target groups, and geographical limitations. It is gradually becoming a new business with rapid development. The addition of history, purchase preferences and purchasing power are the most important information that e-commerce platform managers and sales personnel pay attention to E-commerce not only greatly reduces the input cost of platform sales personnel, but also carries out extensive publicity. Due to the convenient shopping experience, people are gradually changing their shopping habits, which brings new development to most small and medium-sized businesses. Data mining technology is applied to e-commerce. Different data mining algorithms are selected to meet the needs of users and achieve the expected goals according to the different goals of users. Sometimes multiple data mining algorithms are combined to obtain better results [13]. The role of data mining in e-commerce is shown in Fig. 1.
The role of data mining in e-commerce.
The main task of e-commerce is to attract customers. The prominent role of data mining technology is to attract potential customers through data analysis [14]. The server log records the complete record of user access. With the help of data mining technology, e-commerce platform administrators can accurately record user preferences and potential needs, aggregating customers with similar characteristics from Web lookup information during the e-commerce process. Segmentation of customer segments can help companies develop and implement their marketing strategies, and provide customized packaging services.Now, it is not only Taobao e-commerce platform that attracts potential customers. Public transport, subway and elevator businesses can attract customers’ attention to achieve sales objectives.
Speculation of customer needs
The understanding of customers’ real and potential needs has become the central theme of e-commerce research. Through data mining, customers’ interests and preferences are calculated. Personalized suggestions can accurately identify the needs of potential users and provide personalized services. The most important thing is to predict the potential purchase opportunities according to users’ purchase habits and preferences, and customize services for each customer to promote users’ impulse consumption. The information about customer attributes is collected and summarized, including gender, occupation, age, delivery address, etc., and classified according to different standards. The common characteristics of customers are marked in each classification, such as consumption capacity, purchase demand, etc. According to its characteristics, appropriate strategies are adopted so as to maximize the customer’s purchase needs and create advantages for the e-commerce platform.
Maintenance of existing customers
According to the purchasing cycle and purchasing power, customers are grouped. On this basis, according to the actual situation of existing customers, suggestions and services are provided to meet customers’ needs. The satisfaction and recognition of purchasing goods and services on the e-commerce platform is improved, and existing customers are transformed into loyal customers. Data mining can be used to determine the cause of potential customer loss and propose appropriate measures to improve or change the company’s advertising and marketing strategies.
Describe and analyse data anomalies or extremes, including irregular data, instances of anomalies and deviations of observations from expectations, etc., mainly for analysing abnormal customer behaviour. From the records of customer search times, similar products and sales, customer purchase rules are extracted to evaluate and calculate the customer’s consumption cycle and interest in products. In combination with the changes of seasons and consumption trends, corresponding promotion strategies and discounts for various products are introduced. The daily purchasing activities are limited, and the products sold are mainly products with relatively large quantities. For some existing customers, Taobao would use mobile applications or computers to send push notifications on the website according to their previous purchases. Some Taobao stores send new discount information at any time according to the phone number left by customers when purchasing, so as to reduce customer loss [15].
Clustering of existing customers
Customer clustering analysis is also widely used in e-commerce. The data is analyzed, and then customers with similar browsing history and similar product preferences are grouped into similar user groups. Cluster analysis shows that users with similar browsing trajectories may have the same interests and preferences when purchasing. The purchase preference of a user can be recommended to other users to determine the potential demand of the user category and predict the user’s behavior as much as possible. The list on Taobao is sorted according to the last time customers saw it and their purchase preferences. A column is displayed at the bottom of the shopping cart to recommend similar products according to the last time the customer saw the products.
The objectives of data mining in e-commerce are (1) to help companies determine the marketing mechanism: in e-commerce, business information comes from various channels, and this data information is processed by data mining processing techniques, from which decision-making information can be obtained for use in targeted marketing to specific consumer groups or individuals to determine the marketing mechanism of e-commerce. E-commerce marketing based on data mining can often be used to send marketing materials to consumers related to previous consumer behaviour. (2) Increasing customer loyalty: The content and hierarchy of a web site, the wording used, headings, incentive programmes, services, etc. can all be factors in attracting or losing customers. E-commerce sites can have millions of online transactions every day. Generate a large number of recorded documents and registration forms, how to analyse and mine these data. Fully understand the customer’s preferences, buying patterns, design a personalised website to meet the needs of different customer groups, and thus increase their competitiveness, becomes imperative!
Data mining technology based on time series
Data mining
In recent years, data mining has been a hot issue in the field of database application. It analyses processed data using proven techniques and methods in artificial intelligence, and the more techniques and methods it uses, the more accurate the information it gets. As reference information for decision-making, it is an intelligent discovery database. The data mining process is shown in Fig. 2.
Data mining process.
The first step of data mining is to determine the problems, objectives and requirements to be solved, and study the relevant context, so as to determine the tasks to be completed. For example, in order to know what potential customers like, the sales of each product must be confirmed and customer data must be searched. By clearly defining the objectives of the survey, the key issues are translated into quantitative objectives and data analysis.
Data preparation
After determining the problem and research task, it is necessary to collect data for the quantitative purpose of the research problem, and describe the baseline data and data attributes in detail to help control the availability and applicability of the data. The purpose of data collection is to identify the object of the discovery task’s operations. That is, the target data, a set of data extracted from the original database according to the user’s needs. Pre-processing is also required before analyzing data. The data format must be standardized. Incorrect data must be deleted, and missing data must be filled in. Data preprocessing generally includes removing noise, deriving and calculating missing data, eliminating duplicate records, and completing data type conversions, such as converting continuous data to discrete data for symbolic summarisation or discrete data to continuous data for neural network calculations, as well as dimensionality reduction, i.e., identifying genuinely useful features from the initial ones in order to reduce the number of variables to be taken into account for data mining. Some variables cannot be used directly and must be processed. This is troublesome, but it is a very important factor that affects the simulation effect.
Model establishment
Modeling is the core work of data mining, which can be converted into quantitative objectives by solving clear problems. This step includes algorithm and parameter selection. Different types of data and characteristics need to be studied for different problems. Therefore, the detection algorithm selected before modeling is also different. The sub-model should be described according to the problem solution and data. After selecting the algorithm, the model operation would be performed on the preprocessed dataset and the parameters contained therein would be described. Simulation is a dynamic process that needs constant correction and adjustment. The initial model forms a data set, and then uses the test results to evaluate the properties of other models through modification and adjustment to find the most useful search model. In addition, various problems need to be analyzed in detail.
Model inspection
The construction of the model is a dynamic process, which needs to track the market and continuously test the relevance of the model. Interpretation and evaluation of the results The knowledge found in the data mining phase may be evaluated for redundancy or irrelevance, which needs to be eliminated: it is also possible that the knowledge does not satisfy the user’s requirements, and the mining process needs to be repeated to re-mining. Whether the data has changed abnormally or the model itself has not met the expectations, this may lead to adjustments, which need to be recorded in detail.
The above four steps are not single chain operation but form a cycle. The fourth step is not the end point. This step focuses on tracking the execution data of the model. With the development of the market, new problems that need to be further solved are found in the monitoring process. This would return to the first step and repeat the data preparation, model construction and model validation to build the most suitable model for the current problem.
Mining of time series data
Time series is a sequence of data points composed of points with the same interval. Viewing a database of time events in chronological order and identifying another or more similar time series events from it, searching through the time series for patterns with a high probability of recurrence. Discovering sequential patterns facilitates e-commerce organisations to predict customer lookup patterns and thus target services to customers. Time series can be divided into two types: discrete and continuous. Research usually focuses on discrete time series. Therefore, a set of time series displayed by time points and their data values can be captured in the following format:
If the time series has the same sampling interval 1, that is,
Due to the large amount of time series data and the fast update speed, it is difficult to directly perform various navigation operations in the original time series. This is not only inefficient, but also unreliable. The selected repetitive description method greatly affects the complexity and effectiveness of time series data analysis, so a new time series description method is proposed.
Symbolic aggregate approximation (SAX) is to transform the original time series into a string representation through discretization. Each character in the string represents specific information in a specific sub-segment. The SAX method is the first symbolic representation that reduces the dimension of time series and has a tight lower bound on Euclidean distance. In order to represent a time series of length n as a string of length w, the process is as follows.
The normalized time series X of the original time series is normalized by the following formula to obtain a positive distribution with an average value of 0 and a standard deviation of 1:
Among them,
The piecewise aggregation approximation method is used to process the time series
The same number of probability intervals, that is, number of letters, is set to a. The set of ordered values composed of
Symbol time series
SAX not only accelerates the analysis of time series data, but also ensures the quality of results. It is widely used in various applications, but it also has important limitations in the SAX description method. The SAX method displays symbols based on the average value of each subitem, without considering the trend change of each subitem. Sub-segments with similar average values can be represented by the same characters, and the distance between the two sub-segments is 0.
The measurement of time series similarity is the core of many computer systems. In particular, similarity measurement is the most important part of time series data analysis. Similarity measurement shows the similarity between two time series, and the distance between them is usually used to express the similarity. The greater the distance, the greater the difference between the two sequences. The smaller the similarity, the smaller the distance. The smaller the difference between the two sequences, the greater the similarity. If the time series
With the development of time series data research, the calculation methods of time series similarity measurement are also increasing. Dynamic Time Warping (DTW) is introduced based on different distance calculation methods of different format data.
DTW is a similarity measurement method suitable for complex curves. DTW allows stretching and bending time series on Euclidean distance and other time axes. In DTW, distance does not need to accurately match all data points on the time series curve.
In the distance of DTW, it is necessary to find the mapping path closest to the two curves on the time scale, so as to map from one point to multiple points. The length of the curve is different. If it needs to, it can insert new data points. The method for calculating the DTW distance is described as follows.
It is assumed that the time series of the curve is
The first element All adjacent points on the curve path are similar to the mapping points in the matrix; The elements in the curve path increase monotonously over time. For example, if
From this perspective, many methods can meet these conditions. The main problem is how to choose the next path. The formula for calculating the shortest distance of DTW is as follows:
Among them,
The DTW distance has good applicability compared to the Euclidean distance in that it can be used without considering whether the lengths of the sequences are consistent with each other.
Data source
All research data are from Alibaba data platform, and the purchase records from November 1 to 30, 2020 are the focus of the research. This data is randomly selected from the entire sales platform of Taobao, with a sample proportion of 20%. Its main components include the seller’s identity card, the seller’s credit value, the buyer’s identity card, the buyer’s credit value, the goods’ identity card, the detailed price of goods, and the purchase quantity, transaction end time, goods category, type of goods issued by the seller, and logistics address for delivery and receipt of goods.
Data processing
Data cleaning found that some users only click and click multiple times. It is assumed that they are crawler users, and related to noise data. These data need to be deleted. In particular, there are more than 500 clicks without collection and payment.
Data feature extraction
In addition to big data, different time stamps are also used to retrieve clicks, collect data, shopping carts and payment times, and the cumulative data processing method is used to calculate new user values according to different indexes.
Among them,
Among them,
Among them,
Among them,
Among them,
Model establishment and data prediction
For a cube, the K-means algorithm defines
Taobao user clicks and payments are selected as clustering indicators, and Weka data mining software is used for cluster K-means analysis. Based on the statistical analysis results, the psychology of purchase intention is analyzed from the perspective of click and payment, and the types of tag users are summarized. According to time rules, users are grouped into four types of labels, including “daytime purchase types”, “night purchase type”, “weekend purchase type”, and “holiday purchase type”, to achieve user labels. With the Taobao user portrait, the behavior characteristics of Taobao users on different overlapping pages can be studied, and the user portrait results can be applied to the marketing of Taobao platform to provide users with personalized product services with multiple different labels to achieve accurate marketing.
Taobao transaction data mining model.
The prediction model and R software prediction function are used to predict the sales of the next seller and calculate the prediction.
The Taobao transaction data mining model in this paper is shown in Fig. 3.
The prediction effect of time series is judged. Generally, the prediction results obtained by a certain prediction method are compared with the known results to compare the similarity between them, that is, the accuracy of the sequence prediction. The prediction accuracy and entropy are used to measure the quality of the prediction method.
In order to evaluate the effectiveness of time series prediction, the grouping results obtained by using specific grouping methods are usually compared with known categories, and their proximity is compared, that is, the accuracy of sequence grouping.
Prediction accuracy: The prediction accuracy is used as an estimate. Among them,
Entropy: Entropy is taken as the evaluation value.
Entropy represents the degree to which a prediction class contains a real object and is a measure of purity. The total entropy of the prediction class is equal to the weighted sum of the entropy of each prediction class.
The transaction data of Alibaba data platform in September, October, November and December were collected. The data of the last day of each month were predicted, and the prediction accuracy and entropy value of the prediction results were calculated.
Forecast of transaction data in September
Taobao September transaction data forecast results. (a) Accuracy of forecast results of transaction data in September. (b) Entropy of forecast results of September trading data.
The predicted results of Taobao’s September trading data were shown in Fig. 4.
Figure 4a showed the accuracy rate of the predicted results of the September trading data, and Fig. 4b showed the entropy value of the predicted results of the September trading data.
The accuracy of the predicted results of Taobao’s September trading data was 92.35% for food data and 91.46% for household data. The accuracy rate of prediction results of clothing data was 93.57%, and that of sports data was 94.33%. The accuracy rate of prediction results of office data was 92.87%. The average accuracy rate of the predicted results of Taobao’s September trading data was 92.92%, which was relatively high.
According to the prediction of Taobao’s September transaction data, the entropy value of the prediction result of food data was 0.122, and the entropy value of the prediction result of household data was 0.124. The predicted entropy of clothing data was 0.132, and that of sports data was 0.125. The predicted entropy of office data was 0.126. The average entropy of the predicted results of Taobao’s September trading data was 0.126. The prediction entropy was high, which indicated that the prediction efficiency was high.
Forecast results of trading data of Taobao in October. (a) Accuracy of forecast results of trading data in October. (b) Entropy value of forecast results of trading data in October.
The predicted results of Taobao’s trading data in October were shown in Fig. 5.
Figure 5a showed the accuracy rate of the predicted results of the trading data in October, and Fig. 5b showed the entropy value of the predicted results of the trading data in October.
According to the prediction of the transaction data of Taobao in October, the accuracy of the prediction results of the data of food, household, clothing, sports and office was 93.56%, 94.08%, 93.74%, 92.88% and 94.69% respectively. The average accuracy rate of the predicted results of Taobao’s trading data in October was 93.79%, which was relatively high.
According to the prediction of the transaction data of Taobao in October, the entropy values of the predicted results of the data of food, household, clothing, sports and office were 0.089, 0.097, 0.103, 0.084 and 0.092, respectively. The average entropy of the predicted results of Taobao’s October trading data was 0.093. The prediction entropy was high, which indicated that the prediction efficiency was high.
Taobao November transaction data forecast results. (a) Accuracy of forecast results of transaction data in November. (b) Entropy of forecast results of November trading data.
The forecast results of Taobao’s November trading data were shown in Fig. 6.
Figure 6a showed the accuracy rate of the forecast results of November trading data, and Fig. 6b showed the entropy value of the forecast results of November trading data.
The accuracy of the predicted results of Taobao’s November trading data was 95.84%, 96.73%, 94.88%, 95.42% and 96.15% respectively for food, household, clothing, sports and office. The average accuracy rate of the predicted results of Taobao’s November trading data was 95.8%, which was relatively high.
According to the prediction of the transaction data of Taobao in November, the entropy value of the prediction results of the data of food, household, clothing, sports and office was 0.112, 0.127, 0.129, 0.132 and 0.137 respectively. The average entropy of the predicted results of Taobao’s November trading data was 0.127. The prediction entropy was high, which indicated that the prediction efficiency was high.
Comprehensive results of prediction and analysis of Taobao transaction data
Comprehensive results of prediction and analysis of Taobao transaction data
Taobao December transaction data forecast results. (a) Accuracy of forecast results of December trading data. (b) Entropy of forecast results of December trading data.
The predicted results of Taobao’s December trading data were shown in Fig. 7.
Figure 7a showed the accuracy rate of the predicted results of December trading data, and Fig. 7b showed the entropy value of the predicted results of December trading data.
According to the prediction of the transaction data of Taobao in December, the accuracy of the prediction results of the data of food, household, clothing, sports and office was 93.82%, 95.73%, 92.88%, 95.19% and 94.93% respectively. The average accuracy rate of the predicted results of Taobao’s December trading data was 94.51%, which was relatively high.
According to the prediction of the transaction data of Taobao in December, the entropy value of the prediction results of the data of food, household, clothing, sports and office was 0.136, 0.129, 0.133, 0.118 and 0.124 respectively. The average entropy of the predicted results of Taobao’s December trading data was 0.128. The prediction entropy was high, which indicated that the prediction efficiency was high.
The comprehensive results of prediction and analysis of Taobao transaction data were shown in Table 1.
As could be seen from Table 1, the average accuracy of the prediction and analysis results of Taobao trading data was 94.26%, and the average entropy of the prediction and analysis results was 0.119. The prediction and analysis accuracy was high, and the entropy value was also high. It showed that the data mining technology based on time series analysis had better data prediction ability, and could accurately predict the data of Taobao transaction data, thus analyzing the consumer’s behavior characteristics and formulating more accurate and perfect marketing strategies based on user characteristics.
Conclusions
This paper used time series data mining technology to mine Taobao transaction data. This paper described the process of data mining, and studied the calculation methods of time series data mining, which also studied the time series re-description and similarity measurement. In the process of Taobao transaction data mining, this paper collected the Taobao transaction data, and processed the collected data. It also deleted the meaningless data, and extracted the features of the processed data, so as to build the model and develop the data prediction and evaluation index. In the experimental part, the data mining experiment of Taobao transaction was carried out. The experimental results showed that the data mining model had good data prediction accuracy and could be used for data mining calculation of Taobao transaction.
