Abstract
In the process of e-commerce transactions, a large amount of data will be generated, whose effective classification is one of current research hotspots. An improved feature selection method was proposed based on the characteristics of Bayesian classification algorithm. Due to the long training and testing time of modern large-scale data classification on a single computer, a data classification algorithm based on Naive Bayes was designed and implemented on the Hadoop distributed platform. The experimental results showed that the improved algorithm could effectively improve the accuracy of classification, and the designed parallel Bayesian data classification algorithm had high efficiency, which was suitable for the processing and analysis of massive data.
Introduction
As the business mode of the information society, e-commerce is developing rapidly and vigorously. With the continuous expansion of e-commerce websites, the amount of information in commodity databases also saw a rapid growth [1]. In order to help users to find necessary goods quickly and accurately, it was very necessary to classify e-commerce data by entities [2, 3]. Regarding e-commerce data classification, “manual setting” and “machine entity recognition” methods could be adopted [3–5], which however have their own drawbacks: The fixed commodity classification library set manually can be used to realize the basic classification of commodities and obtain high classification accuracy. However, a lot of manpower and material resources are required to adapt to continuous data updating. Machine entity recognition has low cost and high automation level. However, the entity recognition technology which is commonly used at present cannot get ideal results because of the difference between e-commerce data and commonly-used structured relational data. Compared with the traditional entity recognition technology, the entity recognition of commodity data faced the following challenges [6, 7]: The information of goods is incomplete and word order is disordered. Different businesses have different descriptions of the same commodity entity, which seriously affects the quality of entity recognition. The descriptive information of commodities is cluttered with a number of useless advertisements.
Data classification was an important form of data analysis in the fields of data mining, machine learning and pattern recognition [2, 9]. Currently, mainstream classifiers included Naive Bayes Classification (NBC), Decision Tree Regression, Artificial Neural Network, Support Vector Machine and Classification Based on Association Rules [10–12]. The NBC algorithm assumes that sample attributes are independent of each other and effective in reducing the amount of computation required by the construction of the classification model, which is not true in massive real data. In order to effectively improve the deficiencies caused by this hypothesis, the NBC method could still exhibit high accuracy and speed when used in large databases by introducing an improved feature selection algorithm [12–14].
With the development of database technology and the rapid spread of the Internet, e-commerce provided increasing convenience for people’s lives [1, 4]. Computer back ends would process a great amount of data in practical applications, and many new problems and challenges would occur in the research of Naive Bayesian data classification, like massive data and new computing environments [12, 15]. Proposed by Google Lab, MapReduce parallel distributed computing model was mainly used for processing massive data [16, 17], which could organize clusters to process large-scale data sets and become the mainstream parallel data processing model of cloud computing platforms. The classification principle of Naive Bayes was analyzed to find the parallel point in the algorithm, apply it to the cloud computing environment and realize the parallelization algorithm of Naive Bayes on data classification through MapReduce computational model programming. Current large-scale data sets are too long to be calculated through the training and testing processes of a single computer. Numerous experimental results showed that the proposed parallelization algorithm had obvious advantages in the data classification of massive data sets with high execution efficiency and certain scalability and expansion performance [17].
Bayesian classification algorithm is a statistical classification method and a class of algorithms that use probability and statistical knowledge for classification. In many cases, the NBC algorithm can be comparable to decision tree and neural network classification algorithms, which is applicable to large databases and characterized by simpleness, high classification accuracy and fast speed.
Bayes’ theorem may see a decline in its classification accuracy because of making a false assumption that the influence of an attribute value on a given class is independent of other attribute values. For this reason, a number of Bayesian classification algorithms reducing the assumption of independence have been derived, like the tree augmented Bayes network (TAN) algorithm.
Related works
MapReduce programming model
The Hadoop platform consisted of Hadoop Distributed File System (HDFS) and MapReduce programming model [18–20]. HDFS is not only effective in storing and managing data files on all storage nodes in a Hadoop cluster, but also provides high-throughput data access, making it ideal for applications in large-scale data sets. In addition, it is a highly fault-tolerant system suitable for deployment on inexpensive and even heterogeneous machines, contributing to the wider availability of cloud computing environments and Hadoop applications [21–24]. As a simplified parallel programming mode realizing the automatic distribution of programs to a large cluster of ordinary machines for concurrent execution, MapReduce abstracts the parallel computation process into two functions, namely Map and Reduce functions. Both functions required input, output and intermediate data to be in the form of a set of < key, value > key-value pairs, the Map function of which parsed small data sets into a batch of < k1, v1 > pairs, and generated intermediate results < k2, v2 > after calculation.
MapReduce was responsible for passing all v2 sets with the same intermediate k2 value to the Reduce function that received an input in the form of < k2, (list of values)>and then processed the value set. The final output form of Reduce was < k3, v3 > and returned to the application program. In specific applications, it is only necessary for programmers to pay attention to the specific tasks of Map and Reduce functions. Complex problems such as distributed file system, task scheduling, fault tolerance and inter-machine communication in other parallel computations are handled by the background of MapReduce operating system.
Data classification based on naive bayes cluster with e-commerce
Bayesian theorem is a very important formula theorem in probability theory and a statistical principle combining the prior knowledge of data samples belonging to a certain class with the new knowledge collected from new sample data. The application of Bayesian theorem to the classification model can generate a Bayesian classification algorithm, which is a statistical classification method of using probability and statistical knowledge for classification. Let the set of training samples and all of its classifications be T, a class c was given and c ∈ C. The sample data set had n attributes A1, A2, A3, ⋯ , A
n
and the test sample set was x = (a1, a2, ⋯ , a
n
) ∈ X. According to the given sample set containing multiple attributes, the computational complexity of P (x|c) was directly calculated. To obtain a valid estimate of P (x|c), Naive Bayes classifiers usually assumed that attributes were independent of each other. For a particular category and its properties independent of each other, there would be:
Bayesian formula was combined to get:
Therefore, the NBC classifier could be expressed by the following formula:
Based on the MapReduce programming model, the top priority was to design and implement Map and Reduce functions. According to the principle of Naive Bayes Cluster algorithm, the MapReduce parallelization of the NBC algorithm was used to classify data into four stages, including data preprocessing, feature selection, model training and data testing when users conducted online transactions.
Data preprocessing stage
When making a purchase on an e-commerce site, a user changes from a potential customer to a value customer of the website. E-commerce websites generally store user transaction information, including purchase time, price, quantity, amount and other information in their own databases. Regarding these customers, therefore, their transaction behaviors can be analyzed based on the operation data of websites to estimate the value of each customer and the likelihood of expanding markets for them.
The stage of data preprocessing aimed to prepare NBC, determine feature attributes according to the data generated by user behavior, and appropriately divide each feature attribute to discretize continuous feature attributes to get the data required by the system. For example, user data included name, age, gender, profession, etc., while behavioral data involved browsing time, linking, saving, printing, search, and the like. The discrete process would not be performed if some feature attributes were discrete. The inputs at this stage were all training and test data sets with continuous feature attributes, while the output was a discrete attribute value, which would have an important impact on the entire user purchase process. The quality of the classifier was largely determined by feature attributes and features, attribute partitioning and training sample quality decisions. Continuous attribute discretization was realized since the optimal partitioning interval between n attributes of users’ sample data sets could be calculated in parallel. Therefore, MapReduce parallelization could be achieved at this stage.
Feature selection stage
Normally, NBC data is classified by featuring metadata directly. In fact, the weight of different data is different for a sample. That is, the degree of distinction between different metadata is different. This paper improved feature selection. The improved formula was as follows:
Wherein, tf
ik
and f
ik
were the frequency of and sample xi with k metadata and metadata appearing in the sample respectively.
Wherein, N denoted the sample point; n
k
referred to the number of samples with metadata. The final weight formula was:
Metadata frequency plus 1 was to ensure that metadata with a frequency of 1 had non-zero weights.
At this stage, the previous sample data set needed to be trained to get the required static values so as to prepare for the calculation of the probability that a test sample belonged to a certain class at the testing stage. According to the description of the NBC algorithm, training samples were randomly allocated to several Map task nodes for parallelization, which had no influence on the final results. Therefore, MapReduce parallelization could be realized at this stage.
Map function input < key, value > was the default input format of the MapReduce framework, wherein key was the offset of current sample data relative to the starting position of the file and value was sample data in the form of x = (c, a1, a2, a3, ⋯ , a i , ⋯ , a n ). After the processing of the Map function, the intermediate result of the key-value pair was < key1, value1>, wherein key1 had two types, namely “category” and “category, attrIndex, attrValue”. The former was used to count the total number of samples in each category, and the latter was used to count the number of occurrences of each attribute value under a specific category. Value1 was adopted to count the number.
Map (<key, value>,<key1, value1>)
{
Training sample data x was parsed from value;
Initial value1 was 1;
The class label category of sample data x was read as key1;
A string containing the label category of the sample class, the value of the sample attribute and the column number of values was constructed as key1;
<key1, value1 > was output;
}
Different key1 s of the Map output were mapped to different Reduce task nodes in the shuffling phase. After corresponding calculation, key1 remained unchanged, and value1 belonging to the same key1 value was accumulated, thereby obtaining the total number of samples in each category and the number of occurrences of each attribute value under different columns in a specific category.
Data testing stage
In the testing step, static data from the model training stage was required to calculate the probability of a test sample belonging to a certain class. Therefore, the results generated in the model training step needed to be read before parallel algorithm was written for the sample testing process. The path of the results was saved as follows: / home/output/part-r-00000, and put into HashMap, Array-List, which was easy to read in the testing step.
The input key pairs of the Map function in the testing phase were the same with that in the model training phase. The testing phase of samples was parallel because the static data completed in the training phase could be read independently and completely between different test samples. Firstly, the probability of all attributes of test sample data in a particular class was multiplied, and then the final probability of catePro was calculated by formula (4). For each test sample, all class Cs were traversed and the probability of belonging to a particular class C was calculated.
Map(<key, value>,<key1, value1>)
{
//Test data x was parsed from value; variable 1value was initialized to record the attribute value x of each component; variable catePro was initialized to record attributes for the probability of a particular class C. The initial value 1.0 was set.
For a particular class C;
while (value.hasMoreTokens())
{
1value=value.nextTokens; catePro = catePro*num;
} catePro = catePro/1.0/(sample data in current class C);
The specific class C was used as key1;
The calculated probability catePro was taken as value1;
<key1, value1 > was output;
}
According to the output of the Map function, the probability that test samples belonged to each class was obtained. When the output results were sorted from large to small, the category with the greatest probability value was the category of sample data.
Analysis of results
Lab environment
In this experiment, a total of 21 nodes in the Hadoop cluster were used, whose configuration information and running environments are exactly the same. The operating system version is Ubuntu 10.12, Hadoop version is 0.20.2, and Java version is 1.7.0. The cluster based on Hadoop0.20.2 was configured with the method introduced by the official website of Handoop Project. Among 21 nodes, one is a master node, and the remaining 20 are all slave nodes.
The basic experimental data used to test the classification effect in the cluster was from the data set in the UCI dataset. In addition to isolet, covtype, census_income and other data sets representing the real world, synthetic data sets data1 to data4 at MB and GB levels were constructed manually, as shown in Tables 1 and 2.
Experimental data
Experimental data
Experimental data
Regression test was conducted to evaluate the accuracy of the classifier. At the beginning of construction, all data sets in Table 1 were randomly divided into two parts, with one part used for constructing the classification model and the other part for detecting the accuracy of the classifier. Finally, the prepared data was run 10 times on a single machine, 10-node and 20-node clusters respectively, and the results were averaged to get the final correct rate, as shown in Tables 3 and 4.
Classification accuracy of different data sets in different environments (Corresponding to Table 1)
Classification accuracy of different data sets in different environments (Corresponding to Table 1)
Classification accuracy of different data sets in different environments(Corresponding to Table 2)
Tables 3 and 4 show that the improved feature selection algorithm was higher in classification accuracy compared with the unimproved one for most data sets, which thus had higher enforceability. In addition, it could be seen that the accuracy of different data sets running on a single machine, 10-node and 20-node clusters was not very different. This error was inevitable due to the randomness of samples selected during training and testing. Therefore, it could be said that the effects of classification on a single machine, 10-node and 20-node clusters were the same, suggesting that the NBC algorithm based on MapReduce programming model designed in this paper had no impact on its classification effect.
In the case of testing the performance of the algorithm, abalone, covtype and four different data sets of 512 MB, 1 GB, 1.5GB,2 GB,2.5GB, 4 GB and 4.5GB were used in the experiment. These data sets were tested on clusters of 2, 4, 6, 8, 10, 12, 14 and 16 different nodes respectively. The experimental results are shown in Fig. 1.

Running time of files of different sizes in the cluster.
Fig. 1 shows that the running time of data files abalone and ecotype changed little and even tended to be flat in a distributed cluster environment with different nodes, while that of data files data1 to data4 saw increasingly great changes. The experimental results indicated that the MapReduce parallelization implementation of the NBC algorithm was more suitable for large-scale data files, but had little effect on small-scale ones.
Moreover, Fig. 1 demonstrates that the data set abalone ran almost unchanged in a clustered environment with different nodes. This was because abalone was a data file smaller than 64 MB (the default size of the HDFS block) and taken by the system as a block of data, making it necessary to process only one node. MapReduce itself would need data transmission if too many nodes were in the cluster, which would increase additional system overhead. In a manner of speaking, the actual size of data files must exceed the default block size of HDFS so as to reflect the advantages of Hadoop parallel computing.
The Naive Bayesian algorithm with 2GB and 4.5GB data files on clusters of 2, 4, 6, 8 and 10 nodes were trained and tested, and the acceleration ratio was calculated. The definition of acceleration ratio was: Acceleration ratio = running time on a single machine/in a cluster, and the broken line of acceleration ratio was drawn, as shown in Figs. 2 and 3.

Line chart of acceleration ratio with 2GB data.

Line chart of acceleration ratio 4.5GB data.
It could be seen from Fig. 2 and 3 that the acceleration ratio of the classification process of the Naive Bayesian algorithm based on MapReduce tended to be linear. That is to say, increasing the number of nodes could significantly improve the efficiency of the algorithm for the same large-scale data sets. However, it did not mean that the larger the number of nodes was, the higher the efficiency of the algorithm would be. The running time of the algorithm would decrease when nodes reached a certain number. Through the analysis of experimental data results, it could be concluded that this phenomenon was caused for two reasons: Firstly, the file was processed in chunks, and the default size of the data block was 64 MB. Research showed that the performance of a cluster was the best when the number of data blocks was the same as that of nodes in the cluster. Second, the amount of communication between nodes became larger and larger with the increasing number of nodes, resulting in growing system overhead and eventually a slower computing rate. This problem could be solved by using a network of higher rate.
In this paper, the data generated in the process of e-commerce transaction was then studied in depth based on the parallel NBC algorithm of the Hadoop cloud computing platform. Firstly, a brief introduction was given to the Hadoop architecture and NBC algorithm. Then, an improved feature selection method was proposed to design and implement the parallel NBC algorithm based on MapReduce programming model. Finally, a lot of experimental results and analysis showed that the Naive Bayes data classifier based on the MapReduce framework had better training and test set scales while improving the accuracy of data classification. Speedup ratio greatly improved the performance of the NBC algorithm in data classification.
Declarations
Availability of data and materials: The datasets used and/or analyzed in the current study were available from the corresponding author on reasonable request.
Competing interests
This paper has no potential competing interests. All authors have seen the manuscript and approved of its submission to your journal. We confirm that the content of the manuscript has not been published or submitted for publication elsewhere.
Funding
This paper was supported by Scientific and Technological Research Program of Chongqing Municipal Education Commission (KJQN201900739), Dynamic Option Pricing Mechanism Study on Highway Revenue Rights from Heterogeneity Perspective.
Author’s contributions
The author took part in the discussion of the work described in this paper.
Footnotes
Acknowledgments
The authors thank the editor and anonymous reviewers for their helpful comments and valuable suggestions. I would like to acknowledge all our team members.
