Abstract
In order to pre-warning the product quality risk of the e-commerce platform, this paper studies the machine learning algorithm for the products quality risk assessment, which propose the Fuzzy C-Means clustering algorithm for the feature extraction and the Cost Sensitive Leaning (CSL)-Naive Bayesian algorithm to construct the assessment model for E-commerce product quality risk form the massive and unbalanced data. The experimental results show that the Machine Learning algorithm based on Spark has better scalability and superiority in the large-scale data environment, which can accurately identify e-commerce product quality risk.
Keywords
Introduction
Due to the information asymmetry in the E-Commerce transactions, customers are unsatisfied with the quality of products under the uncertainty information at online shopping, which lead to the negative impact on transaction efficiency of E-Commerce. George Akerlof (1970), the Nobel Economics Prize laureate, pointed that the e-commerce market will declined to the typical “lemon problem-inferior product market” without scientific supervision technology of E-Commerce [1]. Pan Yong (2009) pointed out that digital products of E-Commerce had been personalized, customized, which had the difficulty in product quality risk assessment [2].
With the Internet development, more and more scholars research the advanced supervision technology and mechanism for the E-Commerce product quality. Wilcock (2004) proposed that consumer’s benefit is an important factor, which the government should make regulation decision based on the consumers’ reviews about the E-Commerce market [3]. John R Lupien (2007) believed that the government, producers and third parties should be joined in the supervision activity in order to reduce the food quality risk [4]. Hoffman (2011) believed that the supervision movement emphasized the usage of public and private resources, which affected the FDA’s regulation from inspection turn to the risk prevention [5]. Yu Caixia (2015) proposed that product quality is the lifeline of e-commerce market healthy development., which the legal guarantee, supervision, quality traceability and information sharing system should be strength [6]. Zheng Liwei and Liu Xiaofei (2015) proposed that the supervision authorities should focus on the product information regulatory collection mechanism and information sharing mechanism though the information technology, based on the product quality supervision systems of the European and the United States [7]. Wang NingJiang (2015) proposed to build supervision platform of E-Commerce product quality in Big Data environment, which realize online and offline supervision cooperation [8].
Barnes and Vidgen (2002) proposed the WebQual model for e-commerce quality evaluation from five aspects: website usability, design, information, trust and user-website interaction quality [9]. Loiacono (2005) established the WebQual TM model for e-commerce quality assessment form twelve quality aspects of websites [10]. Wolfinbarger and Gilly (2003) proposed the qualitative scale to evaluate the quality of B2C e-commerce services including website design, reliability, privacy/security and customer service/care [11]. Zeithaml and Parasuraman (2005) researched and put forward the E-Servqual model for e-commerce service quality evaluation, which have eleven quality dimensions such as efficiency, technical reliability, compliance, security and privacy [12]. Thomas L. Ngo-Ye (2005) utilized the regression analysis method to study the online comment of Yelp and Amazon combined the features of comment with the reviewers by content analysis technology and marketing RFM theory [13]. Liang T. P. and Li X. (2015) proposed that the consumers reviews affect online shopping behaviors from perceived benefit, perceived website convenience, perceived risk by sentiment analysis approach [14]. Zuo wenming and Chen Huaqiong (2013) proposed the B2C e-commerce service quality model based on network word-of-mouth, using network word-of-mouth to find the deficiencies in the service process, revealing the problems of enterprise service quality [15]. Nie Hui (2014) put forward the content quality model of online reviews, which uses the data mining and regression methods to analyze the Chinese reviews [16]. Dai Yu-xin, Yuan Meng (2016) studied the risk prediction of Ecommerce product quality by big data technology of data warehouse and intelligent information analysis [17]. Guo Xianchao (2016) proposed the MapReduce technology and index model to assessment the product quality risk [18].
The traditional product quality risk assessment methods of e-commerce have insufficient data utilization and the branching pattern of computing process [19]. This paper proposes the machine learning algorithm on the Spark platform to analyze the product quality risk of e-commerce in big data environment. The Fuzzy C-Means clustering algorithm be adopted to extract the feature from the mass e-commerce products quality data. According to the risk evaluation rules of E-Commerce product quality, this paper establishes the risk classification model by Cost Sensitive Leaning Naive Bayesian algorithm, which can intelligent evaluate the product quality risk of E-Commerce.
Risk assessment procedure of E-Commerce products quality
(1) Analyze the risk affecting factors of the e-commerce product quality. The risk e-commerce product quality be composed of product quality factors, logistics factors, commodity environmental factors and customer service factors according to the research of Zenithal, Parasuraman and Malhotra (2000) [20]. The product quality risk factors are that product material certification, the national inspection product information and so on. The logistics risk factors are logistics-related information provided from the logistics number. The customer service risk factors be obtained from E-Commerce platform. The commodity environmental risk information can be obtained from the quality inspection organization of e-commerce products.
(2) This paper estimates the quality risk of e-commerce product based on the e-commerce quality assessment model of Wolfinbarger and Gilly (2003) [21], which establishes the risk assessment indicators: customer review (x1), comment star (x2), national inspection product (x3), commodity brand (x4), commodity material (x5), sales channel (x6), logistics company (x7), customer membership (x8), commodity type (x9), environment (x10), etc. The product quality risk assessment model of e-commerce is shown in Fig. 1.

The product quality risk assessment model of e-commerce.
(3) Assessment risk level of E-Commerce product quality according the risk management principles and guidelines of ISO 31000. The range of risk level from low to high is divided into four grades according to customer review and product attributes [1, 4], Table 1 describe the product quality risk level of e-commerce.
The product quality risk level of e-commerce
(4) assessment the product quality risk and calculate the loss by calculate the weight value multiplied with the risk indexes of the product. Given weight (w1, w2, . . . , wn) to each factors (x1, x2, . . . , xn),and then calculate the risk indicators (k1,k2,..., kn) of each factors according to the inspection requirements and standard. Then the risk index of the whole product is the weighted average k of the risk index of n indicators.
(5) Supervise and audit the quality risk factors of e-commerce products. The specific risk assessment process is shown in Fig. 2.

The procedure flow of e-commerce product quality risk assessment.
Naive Bayes algorithm
The theoretical basis of Bayesian network is to collect, mine and quantify prior information and form the prior probability distribution to improve the quality of statistical inference. Naive Bayesian algorithm (NB) is the classification algorithm based on Bayesian network. Bayesian function calculate the prior probability and conditional probability values of various categories for the classified items, which the classified item belongs to the category with the greatest probability.
Suppose D = {X1, X2, . . . , Xn is the input data set, the A1, A2, . . . , An denotes n attributes, the set {C1, C2, . . . , Cm}, represent the M category, X = {X1, X2, . . . , Xm} is the unknown category sample, Xi is the value of attribute Ai, calculating the probability that X belongs to category Ck. According to Bayesian Theorem:
Since the calculation of P (X|Ck) is very complex, it is generally assumed that n attribute variables are independent of each other:
P (X) is constant for all category in the formula. The P (Ci|X) is maximum as long as P (X|Ci) and P (Ci) are maximum. The Naive Bayesian classification model is:
Apache Spark is the general parallel computing framework for open source. It has the advantage of Hadoop and Map Reduce, but the Job intermediate results of Spark can be stored in memory without reading and writing in HDFS. Therefore, Spark is a memory-based parallel computing framework, which supports iterative data flow and has fast computing speed. Its core is RDD (Resilient Distributed Dataset) elastic data set. Spark integrates with programming language to operate RDD. Each data set is represented as RDD object, so the operation of data set is transformed into the operation of RDD object. Spark can better support various algorithms such as query and classification in interactive computing and big data processing. The operation model of Spark is shown in Fig. 3.

Operation mode of spark platform.
The Naive Bayesian algorithm on Spark platform reads data from disk and represents the RDD of training text as the form of (Key, (Count, property)), in which Key means the category label, Count means the count number, Property means the attribute value of feature. Reduce-ByKey operations are performed (Key (N, Property-Sum) according to the Key value (first grouped according to Key, then polymerize the data in the group), which is Key added Count values. N denotes the occurred number of Key, Property-Sum is the sum of the same attribute, and the number of Key is the number of category. The training model is formed though the way by obtained the feature attributes frequency of each category. This paper, dimension variables D (D1, D2, D3, D4) are randomly generated in the range of [0, 10]. The observation index variable x has the high or low states according to their numerical value. The product quality Q has high, moderate and low states. The dynamic classification model for E-Commerce product quality based on NB algorithm is as shown in Fig. 4.

The dynamic classification model of Naive Bayesian algorithm.
Feature extraction of unbalanced data by Fuzzy C-means clustering algorithm
The key point of e-commerce product quality risk assessment is the feature group extraction. This paper uses Fuzzy C-Means (FCM) Clustering algorithm to reduce irrelevant features and extract features from mass e-commerce product datasets. Supposed that the incomplete sample data set has n evaluation objects S = { s1, s2, . . . , sn } , each evaluation object has m attributes B = { b1, b2, . . . , bm } , X = { xij } (i = 1, 2, . . . , n, j = 1, 2, . . . , m) , Xij is the evaluation value of the jth evaluation index for the ith evaluation object. If the evaluation objects are divided into c different categories, then the center of c category is C = {c1, c2, . . . cn}, the dij means the distance between the evaluated object sj and the category center ci.
Supposed the membership degree μ
ij
means the degree that the element xj of the evaluation object sj belongs to the ci category, which satisfies
The objective function J (U, C) of FCM algorithm is the Formula (7).
The fuzzy factor is m and its value is 2.0. The derivative of formula (7) on membership degree μ
ij
and cluster center c
i
is obtained by using Lagrange.
The formulas (7) and (8) show that the sample clustering center is adjusted by the factor λ j . the clustering iteration process is indirectly adjusted by the membership degree μ ij . The specific procedure of the fuzzy clustering algorithm are as follows:
Given the categories number of clustering is c, the fuzzy weight is m, the threshold of termination iteration is ɛ and the maximum iterations is T.
Updating membership function.
(3) Updating clustering centers.
(4) Make judgment about termination conditions. If ∥J(t+1) - J(t) ∥ < ɛ, or the maximum iterations T is satisfied, then stop and output the results, otherwise, let t = t+1, return to step (3).
Through information extraction, the massive data can be formed in the logical structure with keyword combinations. So, the e-commerce products quality information can be expressed the structure of “product brand+fault performance/injury situation”, as shown in Fig. 5:

The information structure of e-commerce product quality.
The product quality risk data of e-commerce is unbalance dataset with interdependent feature, which is lack of sufficient samples information for stability analysis about some features. The, Naive Bayesian algorithm has some shortcomings, such as assuming that attributes are independent from each other and needing to know the prior probability, which will affect the accuracy of classification. The cost sensitive learning (CSL) algorithm can used to deal with imbalance data classification problems. Therefore, this paper proposes the CSL-Naive Bayesian algorithm for intelligent assessment of e-commerce product quality risk with unbalanced data.
Step 1 Data Preprocessing: Sample datasets are divided into training and tests datasets by cleaning consolidated data. The original training dataset express as V (xi) = (xi1, xi2, . . . , xin, ci), n is the feature number of the dataset, xin is the weight of feature word, ci is the category of dataset.
The cost matrix F (i, j) represent the mis-judgement cost of classifier. Cmin is the minority category and Cmax is the majority category, as shown in Table 2.
The cost matrix of cost-sensitive learning algorithm
The cost matrix of cost-sensitive learning algorithm
Step 2 make the judgment about the classification task type. the training task go to the Step 3, the test task go to the Step 4, the classification task go to Step 7.
Step 3 Training classification model of Naive Bayesian algorithm, which calculate the prior probability of each category P (Ci) = Ci/n according to the training dataset and the conditional probability P (xi|Ci) of each feature attribute about the category Ci.
As the cost matrix is confirmed, the risk function constructed by Naive Bayesian algorithm, which is expressed in Equation (10).
P (Cj|x) is the posterior probability, c = arg min {R (c i |x)} , 1 ⩽ i ⩽ l is the smallest value of the risk function that sample x is belonging to the Cj category.
Step 4 The weight w k of Naive Bayesian algorithm: (1) Select the Hash function. (2) Calculation the Hash value of each sample data by Hash function. (3) Calculate the correlation wk of the feature attributes with the category attributes, which is the similarity probability of set A and B though the hash value of set A and B by hash function in formula (11).
Step 5 Establish the CSL-Naive Bayesian classification model based on the conditional probability weight, which calculate the correlation P (X|Ci) (1 ⩽ i ⩽ m) between variables after dimensionality reduction of sample dataset in Equation (12).
Y is the category label of training sample, X is the feature attribute set of training sample, which include customer review, review star level, logistics company, membership, receiving place, product number, environment, national inspection products, product types, product brands, sales channel, product material, etc al. The wk ∈ [0, 1] express the weighted value between the feature attributes X and the category attribute Y of the sample data set. When the category attribute Y set can be expressed by linear function of the feature attribute X, that is Y* = θ + γX.
Calculating partial derivative of r about θ, γ, obtain formula:
The value of θ and γ can be calculated, and can be used in formula (16).
If the correlation coefficient is larger, then there is a good correlation, else there is a bad correlation between the decision attribute and the feature attribute.
Step 6 tests the product quality risk assessment model of E-Commerce with training data set.
Step 7 For classifies tasks, using data sets of classifier tasks to predict categories. When P (Ci|X) < P (Cj|X) , i ≠ j, the sample data X = (x1, x2, . . . , xn) ∈ Cj, Computing the category distinguish degree of each feature in the feature cluster, which is shown in formula (17).
The category distinguish degree is the mutual information values of difference features between the majority category and minority category, which the mutual information represents the correlation of certain category. The features with the highest distinguish ability are saved according to the category distinguish degree. The strong correlation features of minority category have lower mutual information value in majority category, and the strong correlation features of majority category have lower mutual information value in minority category. So, the strong correlation features of majority category and minority category are selected, and calculate the category distinguish degree by the formula (17).
t k denotes feature attributes, ci denotes category. The distinguish degree of features in different categories is defined as the difference of mutual information between features in majority category and minority category. The larger the difference, the stronger the discrimination ability.
This paper creates machine learning workflow and realize intelligent optimization models based on the DataFrame interface provided a large number of Spark.ml algorithm in the Apache Spark library. The PySpark is Python Shell of Spark, Python Language program and implement Spark API programs, such as:
From pyspark import SparkContext sc = SparkContext
(” local”, “Job Name”, pyFiles = [’MyFile.py’, ‘lib.zip’, ‘app.eg’])words = sc.textFile(” /usr /example /words”)
words.filter (lambdaw: w.startswith (“spar”)).take(5)
Based on Apache Spark computing framework, the CSL-Naive Bayesian algorithm is used to analyze the product quality risk assessment of E-Commerce. The flow of the CSL-Naive Bayesian classification algorithm are shown in Fig. 6:

The flow of Naive Bayesian algorithm.
1) sc. textFile(“hdfs://n1 : 8020/user/data/input”) to load data from HDFS and generate RDD.
2) using hash operation to generate (key, List [Int]) data for each row data, and calculate the mini-value of List [Int]. Repeat operation k times, generate n×k data set, and calculate the weight.
3) For RDD in step 1), use function map to generate data in the form of (key, (1, value). The key is Label and the value is Feature.
4) Using reduceByKey operation on the same Key(label) and Value(feature), after sum operation and get (key, (num, SumValue). The key is Label, num is the number of categories, and SumValue is the feature number of the same category.
5) Combined the weight of 2) with the conditional probability of 4), calculate the posterior probability of the NB algorithm with its weight.
Experimental data and environment
The experimental environment are the Hadoop 2.7,Python 3.5,Maven and Spark 2.3 cluster running on Windows system, which Spark serves as the computing framework. and Hadoop’s HDFS serves as the storage of the file system. The environment configuration is as follows:
This experiment uses UCI database provided by University of California Irvine, which contains 335 common standard test data sets. The data sets (Car Evaluation, Iris, Libras Movement, Monks and Pneumonia) are selected as training test data for machine learning algorithm. Car Evaluation data set has 1728 samples and 6 feature attributes; Iris data set has 150 samples and 4 feature attributes. Libras Movement data set has 360 samples and 91 feature attributes. Pneumonia data set has 9,976 samples with 76 feature attributes. Monks data set has 248 samples and 7 feature attributes.
In order to deal with unbalanced data sets, the minority category samples are defined as positive samples, whereas the majority category samples are defined as negative samples. The classifier performance evaluation indicators are Precision, Recall, F-measure and G-mean. The above evaluation indicators are calculated based on the confusion matrix of the classification problem, as shown in Table 4.
The configuration of big data environment
The configuration of big data environment
Confusion matrix
The Precision Rate is the classification accuracy of text classifier, which reflects the ratio of the positive samples number that be correctly classified by the classifier to the all positive samples number that are classified. Its mathematical definition is as follows:
The Recall measure the classification integrity of classifier, which reflects the ratio of the negative samples number that be correctly classified by the classifier to the all positive samples number that are classified. Its mathematical definition is as follows:
F-measure is evaluation index of classification algorithm combined the Recall with the Precision. The calculation formula is
G-mean is the balance degree of classification performance of classifiers between positive and negative samples in unbalanced data sets. Its mathematical definition is as follows:
The performance (Precision, Recall and G-mean) of NB and CSLNB algorithms on the data set Car Evaluation, Iris, Libras Movement, Pneumonia and Monks of UCI are shown in Table 5. The experiments show that the accuracy of FCM&CSL-NB algorithm is better than the NB algorithm at least 3% to 7%. The convergence speed of FCM&CSL-NB algorithm is faster than the NB algorithm.
Experimental performance of NB and CSL-NB
In order to evaluate the product quality risk of E-commerce platform. This paper uses the experimental dataset includes 1 million tagged training data and 500000 untagged test data in the train.csv and test.csv. The risk attributes of sample data are shown in Table 6.
The risk attributes of training data
The risk attributes of training data
The experiment analyzes the products quality risk for the E-commerce platform based on the NB algorithm and CSL-Naive Bayesian algorithm. After cross-validate the results against the risk assessment of the original data set, the classifier performances are showed on the Table 7.
Experiment results of product risk assessment for e-commerce
According to the experiment results, the CSL-Naive Bayesian algorithm reach the highest precision rate at the 181,500 dimensions of data set after feature selection and data dimension reduction. The precision curve of CSL-NB algorithm on Spark is shown on the Fig. 7(a).The precision of CSL-NB algorithm is 5.31% higher than that of NB, the recall of CSL-NB algorithm is 7.1% higher than that of NB, the F-measure of CSL-NB algorithm is 0.8831, the F-measure of NB is 0.8146, the Fig. 7(a) show the F-measure histogram.

The precision curve and F-measure histogram of NB and CSL-NB Algorithm, (a) The precision curve, (b) F-measure histogram.
In order to forecast the e-commerce product quality risk, this paper research the machine learning algorithm for the risk assessment of e-commerce products quality, which uses the fuzzy C-Means clustering algorithm to select the risk attributes and proposes the CSL-Naive Bayesian algorithm to enhance the classification accuracy for the imbalance data of the E-commerce product quality. The FCM&CSL-NB algorithm on Spark has better accuracy for the data sets (Car Evaluation, Iris, Libras Movement, Monks and Pneumonia) in the large-scale data environment. Experiment results show that its efficiency, accuracy and robustness, which is superior to the other algorithms and applied for the product quality risk assessment of E-commerce.
Footnotes
Acknowledgments
This work is supported by the Natural Science Foundation of Zhejiang Province in China (Grant No. LY18G020009), the National Natural Science Foundation of China (Grant No. 71831006).
