Improved C4.5 algorithm based on k-means

Abstract

When the traditional C4.5 algorithm deals with the big data with a large number of multidimensional continuous attribute values, it may cause the issue of low classification accuracy with the related discretization method. This paper proposes a novel method to discretize continuous data based on the k-means algorithm. The method generates data clusters by combining continuous, unfeatured data with corresponding class labels, and then takes the approximate boundary points of the cluster as the candidate splitting-points of the continuous attribute. Based on this, the information gain ratio is calculated. Experimental results show that, the proposed K-C4.5 algorithm improves the classification accuracy of the decision tree in comparison with the traditional one.

Keywords

C4.5 K-means continuous attribute discretization

1. Introduction

With the rapid development of big data and artificial intelligence technology, more and more industry fields need to use machine learning technology for data mining, such as smart medical, to obtain valuable information from big data. The data complexity is increasing, and the data amount is getting bigger and bigger. Those data involved in the analysis work are no longer just discrete data with clear feature information and limited sample types. On the contrary, a large number of them are continuous data, which results in the difficulty of classification. Moreover, the unit definition of the data attributes is numerous, and the attribute values are not countable.

There are many kinds of machine learning algorithms under different application scenarios. Among them, the C4.5 algorithm has been widely used in the clinical diagnosis [1], behavior analysis [2] and meteorological prediction [3], which thanks for its simple classification rules and clear traceable classification process. The C4.5 algorithm is a decision tree algorithm proposed by Quinlan [4] in 1993, which is an improvement and extension of ID3 algorithm [5] proposed by Quinlan. The C4.5 algorithm uses the information gain ratio instead of the information gain in the ID3 algorithm, which improves the multi-value tendency problem of the ID3 algorithm and improves the classification accuracy of the decision tree. On the other hand, C4.5 algorithm transforms continuous feature attributes into discrete feature attributes, which makes up for the insufficiency of ID3 algorithm in the continuous data processing.

However, the C4.5 algorithm suffers from several performance issues in the face of large-scale data sets with continuous data. One is that the traditional discretization method in C4.5 may require a long execution time. And it is difficult to get the expected classification result quickly. Consequently, this makes data analysis without any timeliness. The other is that, the discretization method may cause the lack of information when calculating the continuous values without features, which makes the final classification accuracy insufficient.

This paper makes some positive contributions to the current discretization method. It proposes two novel continuous-attribute-discretization mechanisms, K-C4.5 and TD-C4.5 to address the issues mentioned above.

The main contribution, K-C4.5 consists of a new discretization method based on the k-means clustering algorithm [6]. The discretization method of K-C4.5 combines the continuous attribute with the category information, which makes the selection of candidate splitting-point more reasonable. Thereby, it improves the classification accuracy. K-C4.5 generates a new data set that contains the continuous attributes and related class labels from original data set. Then, several clusters are constructed based on the k-means clustering algorithm. After that, it is calculated that of the information gain ratio of approximate boundary points between each cluster. The optimal splitting-point of C4.5 algorithm is found to construct the decision tree.

TD-C4.5 (short for Ten-Decile-group-based C4.5) decreases the execution time and improves the classification efficiency in comparisons with the tradition C4.5 algorithm. TD-C4.5 calculates the information gain ratio at 10 equal points after sorting the continuous attribute values.

Extensive experiments have been conducted. The experimental results show that the accuracy of the proposed K-C4.5 algorithm is better than any of traditional C4.5 algorithm and TD-C4.5 algorithm in the scenarios of the data mining which contain plentiful continuous attributes. Also, in classification the data with high-dimensional continuous condition attributes, K-C4.5 improves the accuracy of the algorithm stably.

The rest of this paper is organized as follows. The related literature is reviewed in Section 2, and the traditional C4.5 algorithm and k-means algorithm is introduced in Section 3. Improved C4.5 discretization method is then presented in Section 4, followed by experimental results and analysis. Conclusions are drawn in the last section.

2. Related work

Aiming at the inefficiency of C4.5 algorithm in processing continuous attribute values, many scholars have improved the discretization method of continuous data of C4.5 algorithm.

The researchers [7] proposed a discretization method based on CAIM criterion to deal with continuous attributes, in this way, the information loss of numerical data in the process of discretization is reduced. This method improves the classification accuracy of C4.5.

The authors [8] utilized the boundary value theorem to obtain the possible splitting-points; the optimal splitting-point was then selected by calculating the information gain ratio and Bayes probability of each point, which improves the efficiency of continuous attribute discretization.

To solve the problems of low efficiency and low classification accuracy of C4.5 algorithm in processing continuous data, the literature [9] proposed an improved C4.5 decision tree algorithm based on splitting criterion, several criterion are formed by randomly extracting training data which has been put back several times, and the optimal classification features of C4.5 decision tree are extracted by multiple splitting criterion, which improves the execution speed of the algorithm for large-scale training data.

However, the optimal classification rule in literature [9] comes from the local optimal solution, which aggravates the “overfitting” phenomenon of C4.5 algorithm. To solve this problem, literature [10] based on Occam’s razor principle, uses substitution estimation to reduce the scale of the C4.5 decision tree, thus reducing the fit of the decision tree to the training data. The author also discretizes the continuous data by using the boundary theorem and replaces the logarithm calculation of information entropy of the original C4.5 algorithm with Gini index, which improves the efficiency of the algorithm.

From the perspective of exploring the relevance of conditional attributes, literature [11] optimizes the C4.5 decision tree algorithm by calculating the average information entropy of one attribute and other attributes to represent the redundancy of this attribute and other attributes, combined with the information gain ratio as the split information of the decision tree, the classification accuracy of the C4.5 decision tree algorithm is improved.

In literature [12], the conditional attributes are merged by using the correlation between attributes, and proposed an improved C4.5 decision tree algorithm based on cosine similarity. Calculate the cosine similarity of two attributes with small difference in information entropy, merge the two attributes whose similarity is within the threshold range, recalculate the information gain ratio of the merged “new attribute”, and participate in the next split calculation, this way effectively reduce the scale of the decision tree and improve the execution efficiency and classification accuracy of C4.5 algorithm.

Literature [13] used Kendall coefficient of concordance to estimate the correlation of conditional attributes, classified conditional attributes according to the rating method, and introduced coefficient to improve the calculation formula of information gain. Experimental results show that the improved algorithm has higher execution efficiency and classification accuracy.

Table 1 summaries the literature reviewed above from the aspects of the effect of key way, technical methods and capabilities.

Table 1
A summary of literature

Ref. no	Key way	Technical methods	Capabilities
[7]	Rough set theory	Discretization method based on CAIM criterion	Reduced the information loss, improved the classification accuracy
[8]	Boundary value theorem	Optimal threshold method combining information gain rate with bayesian probability	Improved the efficiency of continuous attribute discretization
[9]	Classification rules	Selection method based on the optimal classification rules and selected optimal feature	Improved the execution speed of the algorithm
[10]	Fayyad and Irani boundary theory	Simplified method with Gini index instead of information entropy	Reduced the fit of the decision tree to training data, improved the efficiency of the algorithm
[11]	Redundancy of attributes	Information gain, segmentation entropy and redundancy combination method	Improved the classification accuracy
[12]	Cosine similarity	Discretization method with combining attribute values	Reduced the scale of the decision tree, improved the execution efficiency and classification accuracy
[13]	Kendall harmony coefficient	Selecting attributes by the correlation between conditional attributes	Improved the execution efficiency and classification accuracy

3. Introduction to traditional C4.5 and k-means algorithm

3.1 Traditional C4.5 algorithm

C4.5 algorithm is a decision tree algorithm commonly used in data mining. Decision tree algorithm [14] is a classification algorithm, it establishes a tree structure algorithm with classification process by learning and summarizing a group of training data with class labels. Each node on the tree (non-leaf node) represents an attribute, and each branch represents the split path on the attribute. The leaf node at the end of the tree stores the class label of the training data set, which represents the final classification result by layer and layer splitting. By recursively traversing all attributes of the training data set, a decision tree model with classification ability can be established. Figure 1 is a simple application of the decision tree model in medical field.

Figure 1.

Simple decision tree model [15].

In the decision tree model, the attribute selection measures (splitting criterion) are used to decide which attribute and which type of attribute value to split. C4.5 algorithm uses the information gain ratio as the splitting criterion to evaluate each attribute of the training data set.

Let the training data set $D_{C}$ be the training set with class label attribute, and set scale is $n$ . And the class label attribute has a total of $m$ different values, and $m$ different classes in $D_{C}$ are defined respectively. Suppose $R_{i}$ ( $i=1,2,\ldots,m$ ) is the collection of various data tuples in $D_{C}$ , then

$\displaystyle D_{C}=\sum\limits_{i=1}^{m}{R_{i}}$ (1)

where $p_{i}$ is the non-zero probability that any data tuple in $D_{C}$ belongs to $D_{i}$ , and the percentage of each class in all classes can be known:

$\displaystyle p_{i}=\frac{\left|{R_{i}}\right|}{\left|{D_{C}}\right|}$ (2)

Then, the information entropy of data set $D_{C}$ , that is, the expectation required for the classification of data tuples in data set $D_{C}$ is:

$\displaystyle\textit{Info}(D_{C})=-\sum\limits_{i=1}^{m}{p_{i}\log_{2}(p_{i})}$ (3)

Let this training data set have a total of $x$ attributes, one of which is $A$ , and the observation data of attribute $A$ has a total of $t$ different values $\{a_{1},a_{2},\ldots,a_{t}\}$ . At this moment, we use attribute $A$ as the partition attribute of the data set $D_{C}$ . If the attribute $A$ is discrete data, the data set $D_{C}$ can be divided into $t$ data subsets $\{D_{1},D_{2},\ldots,D_{t}\}$ corresponding to $t$ different attribute values $\{a_{1},a_{2},\ldots,a_{t}\}$ , $D_{j}$ is one of the subsets, and the value of the corresponding attribute $A$ is $a_{j}$ . In the ideal case, each subset $D_{1},D_{2},\ldots,D_{t}$ divided by the attribute values $a_{1},a_{2},\ldots,a_{t}$ is a pure data set, but in the real case, these data subsets often contain multiple classes. Then, after the completion of this division, how much information is needed to get an accurate classification? We can calculate the information entropy under the split of attribute $A$ by the following formula:

$\displaystyle\textit{Info}(D_{C})=\sum\limits_{i=1}^{m}\frac{\left|{D_{j}}% \right|}{\left|{D_{C}}\right|}\times\textit{Info}(D_{j})$ (4)

The smaller the $\textit{Info}_{A}$ , the higher the purity of the partition based on $A$ to $D_{C}$ .

The split information of the $t$ partitions generated by the data subset $\{D_{1},D_{2},\ldots,D_{t}\}$ is:

$\displaystyle\textit{SplitInfo}_{A}(D_{C})=\sum\limits_{i=1}^{m}{\frac{\left|{% D_{j}}\right|}{\left|{D_{C}}\right|}\times\log_{2}}\left(\frac{\left|{D_{j}}% \right|}{\left|{D_{C}}\right|}\right)$ (5)

Let $\textit{Info}(D_{C})$ be the information entropy of the initial data set, and $\textit{Info}_{A}(D_{C})$ be the information entropy partitioned on $A$ attribute. Then, the information gain $\textit{Gain}(A)$ of the training data set partitioned on $A$ is calculated as follows, which indicates the degree of reduction in the total amount of classified information due to the division of attribute $A$ .

$\displaystyle\textit{Gain}(A)=\textit{Info}(D_{C})-\textit{Info}_{A}(D_{C})$ (6)

The information gain ratio of the partition on the $A$ attribute is expressed as:

$\displaystyle\textit{GainRate}(A)=\frac{\textit{Gain}(A)}{\textit{SplitInfo}_{% A}(D_{C})}$ (7)

At this point, we choose the attribute with the highest information gain ratio as the split attribute of the decision tree node, the corresponding tuple as the splitting-point, and set the splitting-point as the split node $N$ of the decision tree. Recursively calculate the above process, each splitting-point divides the data set into smaller data subsets, and the attributes and it’s values participating in the calculation become less and less, splitting process is stopped until the data tuple of each subset belongs to the same category, then the decision tree is established.

3.2 K-means clustering algorithm

K-means clustering algorithm is one of the classical clustering algorithms. The core idea is to divide the data objects with similar Euclidean distances into the same cluster by iterative calculation, so that the scattered data points are as close as possible, and the clusters are as independent as possible from the other clusters. K-means algorithm first randomly selects $k$ objects as the initial cluster center, then calculates the Euclidean distances of other objects to those $k$ points, and classifies each object into a class with the nearest center point until all objects have been calculated. At this time, all data objects are divided once, and the data set is divided into $k$ clusters. Recalculate the cluster center of each generated cluster, and calculate the distance of all data objects to the new cluster center, and reclassify until no data object is assigned to a new cluster or no cluster center changes, then the calculation terminated.

The original training dataset $D_{K}$ contain $n$ multi-dimensional data objects, and randomly select $k$ objects as the center point of the cluster. In order to assign the $n$ data objects to $k$ clusters $C_{1},C_{2},\ldots,C_{k}$ , where the centroid $C_{i}$ represents the center point of the cluster and is defined by the mean of the cluster. There is a point $\bm{p}\in\bm{C}_{i}$ , $\textit{dist}(p,C_{i})$ represents the Euclidean distance from point $p$ to the center of the cluster. The Euclidean distance formula is:

$\displaystyle\textit{dist}(i,j)=\sqrt{(x_{i1}-x_{j1})^{2}+(x_{i2}-x_{j2})^{2}+% \cdots+(x_{iv}-x_{jv})^{2}}$ (8)

The sum of squared errors $E$ between all objects in $C_{i}$ and cluster center $\bm{C}_{i}$ can represent the quality of the cluster $C_{i}$ . The smaller $E$ is, the closer the data clustering is, and the higher quality of $C_{i}$ has.

$\displaystyle E=\sum\limits_{i=1}^{k}\sum\limits_{p\in C_{i}}{\textit{dist}(p,% C_{i})^{2}}$ (9)

4. Improved C4.5 discretization method

4.1 Discretization of continuous attributes of C4.5 algorithm

In the calculation of attribute information entropy, if the attribute to be split is discrete, it will be calculated according to the number of values of the attribute itself. If the attribute $A(a_{1},a_{2},\ldots,a_{n})$ to be split is a continuous attribute, the discretization method [4] of C4.5 algorithm is described as follows. Firstly, after sorting the value of the attribute $A$ in ascending order, it can be obtained the midpoints of two adjacent eigenvalues. These points can respectively divide the attribute $A$ into two parts, and these points are used as candidate splitting-points to calculate the information gain ratio, and the point with the highest gain ratio is selected as the optimal splitting-point of the attribute.

Table 2
Different critical value in TD-C4.5

Critical value	Classification accuracy (%)	Time (s)
50	92.75	36
100	92.78	37
150	92.77	56
200	92.76	60

4.2 TD-C4.5 method

In order to address the issues of the continuous attribute discretization method arisen in traditional C4.5 algorithm, this paper proposes an improved C4.5 algorithm TD-C4.5, which is based on the decile group method. TD-C4.5 is presented in Algorithm 1.

Algorithm 1. TD-C4.5
Input:
$D_{\textit{train}}$ : A multidimensional training data set with continuous attributes and corresponding class labels.
Output:
C4.5 decision tree.
1 Sort continuous attribute $A$ in ascending order.
2 Compute the number $N$ of different eigenvalues of attribute $A$ .
3 If $N$ is less than 100, then calculate C4.5 decision tree by traditional C4.5 algorithm.
4 If $N$ is greater than or equal to 100, obtain the minimum and maximum values $a_{0}$ , $a_{n+1}$ of attribute $A$ , insert 10 equal diversion points in interval $[a_{0},a_{n+1}]$ .
5 Calculate the information gain ratio respectively, and choose the point with the highest information gain ratio as the splitting-point of the decision tree.
6 The selected splitting-point divides the data set into two subsets; go back to Step.2 to for each subset until the decision tree has been established.
7 Use post-pruning method on generated decision tree to eliminate the influence of noise and isolated points on the classifier, and reduce the over-fitting problem between the decision tree and the training data.
8 End.

TD-C4.5 algorithm takes digit 100 as a critical value to determine whether an attribute is with continuous values. The basic idea is to decrease execution time together with to ensure the classification accuracy. Experiments have been performed to check the impacts of the value selection on accuracy and run time on large-scale data set cod-rna [16], which has 59535 data of 2 categories. Experimental results in Table 2 show that, there is no significant difference in classification accuracy. However, the higher the critical value, the longer the time consumed. From a rational point of view, for an attribute, if the number of different values is too small, various discrete data may be mistaken as continuous data. If the number is too large, the data sets with a small sample size will lose the significance of discretization. Also, it may prolong the execution time of the algorithm. Thus, digit 100 is selected as critical value in TD-C4.5.

Figure 2.

K-C4.5 discretization flow chart.

Algorithm 2. K-C4.5
Input:
$D_{\textit{train}}$ : A multidimensional training data set with continuous attributes and corresponding class labels.
Output:
C4.5 decision tree.
1 Pre-process the input data.
2 Estimate whether the attribute $C$ in $D_{\textit{train}}$ is with continuous values. If yes, go to Step.3, otherwise go to Step.8.
3 Extract $C$ values and corresponding class labels to generate array $D_{K}$ .
4 In $D_{K}$ , randomly select $k$ objects as the initial cluster center for clustering, until $k$ clusters are generated stably.
5 Calculate the median of the center points of two clusters, and take these points as the candidate splitting-point.
6 Calculate the information entropy and gain ratio of each classification (or candidate splitting-points).
7 Add the optimal splitting-point to the decision tree, which divides the data set into two subsets.
8 Go to Step.2 for each subset until the decision tree has been established.
9 Post prone generated decision tree if necessary.
10 End.

4.3 K-C4.5 algorithm

In the traditional C4.5 algorithm, when calculating the continuous attribute information gain ratio, numerical attribute values without classification features are substituted into Eq. (2) as discrete eigenvalues for probability calculation, as a result, features of continuous data are missing and the selected points with the best information gain ratio are not accurate enough. However, k-means algorithm is suitable for processing continuous data. By “characterizing” the data, k-means algorithm improves the rationality of the selection of splitting-points and the accuracy of final classification. K-C4.5 discretization process is shown in Fig. 2.

The idea of K-C4.5 discretization method is as follows: After inputting the training data set, if the attribute value is continuous, a new two-dimensional data set $D_{KC}$ is created, which includes continuous attribute value and corresponding class label as two attributes: $D_{KC}$ (data, feature). Using Eqs (8) and (9) for clustering calculation, a number of cluster with center points are obtained. By calculating the median of any two cluster centers, the approximate boundary points of the two clusters are obtained. By taking the value of the boundary point to calculate the information gain ratio of the attribute, we would choose the point with the highest information gain ratio as the splitting-point $N$ of decision tree, then perform the clustering operation of the next consecutive attribute value until the decision tree is established. The description of K-C4.5 algorithm is shown in Algorithm 2.

5. Experimental results and analysis

5.1 Experimental platform and data sets

The computer used in this experiment is Intel Core i7, 8 GB RAM, Windows 10 operating system. The proposed algorithms K-C4.5 and TD-C4.5 and traditional C4.5 are all implemented in Python language.

Since this paper focuses on a large number of continuous data discretization problems, from the perspective of control variables, we select two-class and multi-dimensional continuous attribute data sets for algorithm performance comparison. The test data includes Svmguide3 [17], and three data sets in UCI (UCI Machine Learning Repository: Data Sets) database [18], which are sensor_reading1 (subset of Wall-Following-Robot-Navigation Data), and HIGGS2 (a subset of HIGGS Data data set) and ionosphere data sets.

In the experiment, 70% of each candidate data set was selected as the training data set $D_{\textit{train}}$ , and 30% was used as the validation data set $D_{\textit{validate}}$ , and $D_{\textit{train}}\cap D_{\textit{validate}}=$ Ø. After the previous consolidation and correction of the data sets, the detailed information of these four data sets is shown in Table 3.

Table 3
Data sets

Data sets	Number of classes	Number of continuous	Number of instances	Number of instances
		attribute	(training set)	(validation set)
Svmguide3	2	20	900	384
Sensor_reading1	2	24	800	355
HIGGS2	2	24	700	300
Ionosphere	2	33	236	100

5.2 Classification accuracy

Classification accuracy is an important indicator to evaluate the performance of classification algorithms. The calculation method of classification accuracy is described as below.

In order to check the classification results, the Data set validation method [19] is used. As described above, the validation set $D_{\textit{validate}}$ contains the exact category information. After classified by C4.5 decision tree, the classification result of each data tuple in the validation set is compared with its own category information. Then, the comparison result is recorded. Let $T R$ be the number of tuples correctly classified and $F S$ the number of tuples incorrectly classified. When all data of the validation set are classified, the classification accuracy of the data set under a certain classification algorithm can be calculated as follows.

$\displaystyle\textit{AccuracyRate}=\frac{TR}{TR+FS}$ (10)

5.3 Comparisons of classification accuracy

This section compares the classification accuracy of the proposed algorithms K-C4.5, TD-C4.5 with traditional C4.5 algorithms. In order to test the performance of K-C4.5 algorithm under different $k$ values, three cases of $k=$ 5, $k=$ 10 and $k=$ 15 were selected for the experiment. Figures 3–5 show the classification performance of K-C4.5 algorithm on data sets svmguide3, sensor_reading1, HIGGS2 and ionosphere with different $k$ values.

Figure 3.

Comparison of classification accuracy of three algorithms when $k=$ 5.

It can be seen from Fig. 3 that when the parameter $k$ is taken as 5, the classification accuracy of the K-C4.5 algorithm on the four data sets is higher than traditional C4.5 and TD-C4.5 algorithms. Compared with traditional C4.5 algorithm, the classification accuracy of K-C4.5 algorithm is increased by 0.29% to 6.34%, which is 0.85% to 12.33% higher than that of the TD-C4.5 algorithm.

Figure 4.

Comparison of classification accuracy of three algorithms when $k=$ 10.

It can be observed in Fig. 4 that the classification accuracy of the K-C4.5 algorithm on the four data sets is higher than traditional C4.5 and TD-C4.5 when the parameter $k$ is taken as 10. Compared with traditional C4.5 algorithm, the classification accuracy of K-C4.5 algorithm is increased by 0.29 $\sim$ 5.47%, which is 0.85 $\sim$ 11.66% higher than that of the TD-C4.5 algorithm.

Figure 5 depicts that the classification accuracy of the K-C4.5 algorithm on the four data sets is higher than traditional C4.5 and TD-C4.5 algorithms when the parameter $k$ is set to 15. Compared with traditional C4.5 algorithm, the classification accuracy of K-C4.5 algorithm is increased by 0.29 $\sim$ 7.34%, which is 0.85 $\sim$ 9.3% higher than that of the TD-C4.5 algorithm.

From Figs 3–5, it can be seen that compared with traditional C4.5 and TD-C4.5 algorithms, when the conditional attribute dimension of the data set is small, the accuracy of TD-C4.5 algorithm is better than traditional C4.5 algorithm about 4.18%; when the conditional attributes of the data set are similar, the classification accuracy of TD-C4.5 algorithm and traditional C4.5 algorithm are not much different, only improved by 0.56% to 1.34%; when the conditional attribute dimension of the data set is high, the accuracy of the TD-C4.5 algorithm is lower than that of the traditional C4.5 algorithm, which is 8% lower. It can be seen that the comparison between TD-C4.5 algorithm and C4.5 algorithm, the classification accuracy is related to data set dimension.

In summary, after the TD-C45 algorithm discretizes the continuous attribute of a single column, the calculation scale of the information gain ratio is reduced, and the classification accuracy depends on data set dimension. The K-C4.5 algorithm has higher classification accuracy than traditional C4.5 and TD-C4.5, even if the parameter $k$ takes three different values. When the data set condition attribute dimension increases, the accuracy rate still improved steadily. The reasons are described as follows.

The traditional C4.5 discretization method directly calculates the gain rate without taking other related attributes into account. However, K-C4.5 algorithm first extracts the continuous attribute value and its corresponding class label to form a new temporary array, so that every attribute value has its own category information. In the two-dimensional coordinate system, these points have their own positions. K-C4.5 clusters the scattered data and integrates them into several stable groups. After that, approximate boundary points between every two groups are found, these points can be considered as candidate splitting-points to participate in computation of information gain rate. By using the data points that are closer to each other in space, the internal relations of them can be found. Thereby, the better splitting-points can be achieved. As a result, the performance of K-C4.5 is better than the tradition C4.5 with respect to the classification accuracy.

5.4 Comparisons of algorithm execution time

The execution time of the algorithm is another important indicator to measure the performance of the algorithm. This paper further compares K-C4.5 and TD-C4.5, as well as traditional C4.5’s execution time. The experimental results are shown in Table 4.

Table 4
Comparison of execution time of three algorithms

Data sets	Time (s)
	C4.5	TD-C4.5	K-C4.5
Svmguide3	8	1.232	42
Sensor_reading1	2.045	0.169	1.748
HIGGS2	8.428	1.229	44.955
Ionosphere	1	1	12.3

Figure 5.

Comparison of classification accuracy of three algorithms when $k=$ 15.

Table 4 shows that the execution time of TD-C4.5 algorithm on data set svmguide3 is 7 seconds less than that of traditional C4.5 algorithm; On sensor_reading1, TD-C4.5 is 1.876 seconds less than C4.5; On HIGGS2, TD-C4.5 is 7.199 seconds less than C4.5; On ionosphere, the execution time of the two algorithms is not much different. It can be seen that TD-C4.5 algorithm reduces the calculation scale of continuous attributes and improves the execution efficiency of traditional C4.5 algorithm. The main reasons for this result are described below.

The traditional C4.5 algorithm needs to calculate the information gain ratio about $n/2$ points, the algorithm is inefficient when the continuous attribute data size is large. TD-C4.5 algorithm only calculates the information gain ratio at 10 equal points, so its execution time will be reduced.

However, K-C4.5 algorithm needs to repeat clustering to get the candidate splitting-point before calculating the information gain ratio, then the execution time of the algorithm will increase correspondingly compared with the other two algorithms. Table 2 also shows that the execution time of K-C4.5 algorithm is 34 $\sim$ 41 seconds longer than the other two algorithms on the data svmguide3, on sensor_reading1, K-C4.5 is 0.297 seconds faster than traditional C4.5 algorithm and 1.579 seconds slower than TD-C4.5 algorithm; On HIGGS2, K-C4.5 is 36.527 $\sim$ 43.726 seconds slower than the other two algorithms; On ionosphere, K-C4.5 was 11.3 seconds slower than the other two algorithms. The main reason is that the initial clustering center selection of K-C4.5 algorithm is random, the choice of initial center point is uncertain whether it is good or bad, and the number of adjustment of cluster center is also different, so the time loss of K-C4.5 algorithm is different in different data sets. Although the execution time of K-C4.5 algorithm is relatively long, it performs better in classification accuracy as mentioned above.

5.5 Selection of parameter

k

in K-C4.5 algorithm

K-means algorithm is employed in K-C4.5 to deal with discretization of large high-dimensional data with continuous attributes. To understand how the value selection of parameter $k$ impacts on classification accuracy, experiments have been conducted on different data sets. And experimental results are shown in Fig. 6.

Figure 6.

Classification accuracy comparison of different $k$ on experimental data sets.

First, on the dataset svmguide3, as shown in Fig. 6, When $k=$ 5, the classification accuracy of K-C4.5 algorithm is 86.8%. When $k=$ 10, the classification accuracy is 85.93%, decreasing by 0.87%. When $k=$ 15, the classification accuracy is 87.8%, 1% higher than when $k=$ 5, and 1.87% higher than when $k=$ 10.

There are two possible reasons for the fact that the classification accuracy is not linearly correlated with the increase value of $k$ : First, due to the randomness of the initial $k$ value selection of the $k$ -means algorithm, the classification accuracy of the decision tree varies from $k$ to different values. The random selection of the $k$ directly affects the execution of the algorithm. Whenever the “good” initial $k$ is selected, the classification accuracy of the algorithm is high. Second, due to the particularity of the data set, the initial $k=$ 15 is more suitable for the svmguide3 data set, which can obtain better experimental results.

On the data set sensor_recording1, as shown in Fig. 6, when $k=$ 5, $k=$ 10, and $k=$ 15, the classification accuracy of K-C4.5 algorithm is 100%. These values of $k$ can achieve good experimental effect on sensor_recording1, and the classification accuracy is higher than the traditional C4.5 algorithm and TD-C4.5 algorithm. Five can be selected as the initial clustering number, which saves the running time of the algorithm.

On data set HIGGS2, as shown in Fig. 6, when $k=$ 5, the classification accuracy of K-C4.5 algorithm is 73.3%. When $k=$ 10, the classification accuracy is 72.67%, decreasing by 0.63%. When $k=$ 15, the classification accuracy is 70.3%, which is 3% lower than when $k=$ 5 and 2.37% lower than when $k=$ 10. The classification accuracy is linearly correlated with $k$ , while the classification accuracy decreases with the increase of value of $k$ . It can be seen that selecting 5 as $k$ has a good experimental effect on the data set HIGGS2.

As shown in Fig. 6, on the data set ionosphere, when $k=$ 5, the classification accuracy of the K-C4.5 algorithm is 94.33%; when $k=$ 10, the classification accuracy is 93.66%, decreased by 0.67%; when $k=$ 15, the classification accuracy is 91.3%, which is 3.03% lower than $k=$ 5, and decreased by 2.36% when compared with $k=$ 10. The classification accuracy is linearly correlated with $k$ , while the classification accuracy decreases with the increase of value of $k$ . It can be seen that selecting 5 as $k$ has a good experimental effect on the data set ionosphere.

Due to the different selection of $k$ , the classification accuracy performance of K-C4.5 algorithm on each data set is also different. Due to the randomness of the initial $k$ selection and the difference of the data set structure, there is no obvious law for the performance of different $k$ on different data sets. In the future research of parameter $k$ , multiple tests can be conducted on each training data set while $k$ can be set to more possibilities. Moreover, other improved $k$ -means algorithm can be selected to discretize continuous data, so as to make $k$ -means algorithm have more stable operation effect.

6. Conclusions

The traditional C4.5 algorithm suffers from the low classification efficiency and low classification accuracy when dealing with data set with a large number of multi-dimensional continuous values. In order to address this issue, this paper proposes two continuous-attribute-discretization mechanisms, K-C4.5 and TD-C4.5. In comparison with the traditional C4.5 algorithm, K-C4.5 greatly improves the classification accuracy through combination with $k$ -means algorithm, i.e., adding the clustering process of attribute values. K-C4.5 algorithm is superior to traditional C4.5 algorithm in terms of the classification accuracy When the data size keeps constant or decreasing, and the attribute dimension of continuous conditions increases. The discretization processing of TD-C4.5 algorithm reduces the number of candidate splitting-points and reduces the computational scale of information gain ratio. Compared with the traditional C4.5 algorithm, TD-C4.5 accelerates the execution time and improves the classification efficiency of the algorithm.

Footnotes

Acknowledgments

This work was financial supported in part by Natural Science Foundation of China (grant no. 6136 3016), and Natural Science Foundation of Inner Mongolia Autonomous Region (grant no. 2015MS0605 and no. 2015MS0626).

References

Ashwin Kumar

U.M.

and Ananda Kumar

K.R.

, Data Preparation by CFS, An Essential Approach for Decision Making Using C4.5 for Medical Data Mining, in: Proceedings of the 2013 Third International Conference on Advanced Computing & Communication Technologies, IEEE, 2013.

Haddadi

Runkel

Zincir-Heywood

A.N.

and Heywood

M.I.

, On botnet behaviour analysis using GP and C4.5, in: Companion Publication of the Conference on Genetic & Evolutionary Computation, ACM, 2014.

Han

et al., Analysis on meteorological conditions and health factors based on C4.5 algorithm, in: IEEE International Conference on Cloud Computing & Intelligence Systems, IEEE, 2015.

Salzberg

S.L.

, C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993, Machine Learning 16(3) (Sep. 1994), 235–240.

Quinlan

J.R.

, Induction of Decision Trees, Machine Learning 1, (Mar. 1986), 81–106.

MacQueen

J.B.

, (1967), Some Methods for classification and Analysis of Multivariate Observations, in: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, MR 0214227, Zbl 0214.46201, Retrieved 7th Apr. 2009, pp. 281–297.

H.T.

and Jia

Y.B.

, C4.5 improved algorithm based on rough set theory and CAIM criterion, Computer Systems Applications 27(7) (Jun. 2018), 139–144.

Chen

and Wu

C.X.

, Improvement and application of decision tree C4.5 algorithm, Software Guide 17(10) (Aug. 2018), 88–92.

X.W.

Chen

F.C.

and Li

S.M.

, Improved C4.5 decision tree algorithm based on classification rules, Computer Engineering and Design 34(12) (Dec. 2013), 4321–4325+4330.

10.

Miao

Y.F.

and Zhang

X.H.

, Improvement and application of C4.5 decision tree algorithm, Computer Engineering and Applications 51(13) (Jul. 2015), 255–258+270.

11.

Wei

and Ding

Y.J.

, An improved algorithm of C4.5 decision tree based on attributes correlation, Journal of North University Sity of China (Natural Science Edition) 35(4) (Aug. 2014), 402–406.

12.

Xia

X.C.

and Wang

X.Y.

, Improved C4. 5 decision tree algorithm based on cosine similarity, Computer Engineering and Design 39(1) (Jan. 2018), 120–125.

13.

W.P.

and Shang

J.Z.

, Improvement and analysis of C4.5 decision tree algorithm, Computer Engineering and Applications, [Online]. Available: https://kns-cnki-net.web.bisu.edu.cn/kcms/detail/11.2127.TP.20181025.1716.033.html, 29th Oct. 2018.

14.

Quinlan

J.R.

, Simplifying decision trees, International Journal of Man-Machine Studies 27(3) (Sep. 1987), 221–234.

15.

, Effect of decision trees for nursing intervention in orthopedic patients with selective operation, Modern Medical Journal 43(6) (Jun. 2015), 700–705.

16.

Uzilov

A.V.

Keegan

J.M.

and Mathews

D.H.

, Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change, BMC Bioinformatics 7(173) (2006).

17.

Hsu

C.W.

Chang

C.C.

and Lin

C.J.

, A practical guide to support vector classification, Technical report, Department of Computer Science, National Taiwan University, 2003.

18.

UCI.Machine Learning Repository, [Online]. Available: http://archive.ics.uci.edu/ml/index.php.

19.

Zhou

J.F.

Yang

A.M.

and Liu

, Traffic classification approach based on improved C4.5 algorithm, Computer Engineering and Applications 48(5) (Mar. 2012), 71–74.