Abstract
An accurate and efficient intelligent fault diagnosis method plays a key role in reducing the production arrest of forthcoming faults in modern industrial machines, increasing the safety of plant operations and optimizing manufacturing costs. Recently, a new approach for hierarchical clustering based on data field, was put forward and obtained good effect. Thus, inspired by the principle, a new efficient and intelligent fault diagnosis method called Mass Optimizing Group Identification Classification Algorithm (MOGICA) has been proposed in this article. In this classifier, the classification rate and size of used objects population have some fluctuation with the change of only parameter δ. Thus, with the purpose of making data field distribution more reasonable and increasing the classification accuracy, Entropy is introduced to determine the parameter δ. The performance of the method has been tested through two kinds of experiments. In the first experiment, four benchmark data sets were used to evaluate the performance of this algorithm. In the second experiment, the algorithm was used to diagnose the faults of ball bearing. Compared with other classification techniques in the two experiments, our method is more competitive.
Introduction
During the past twenty years, many new effective fault diagnosis methods, have been proposed to solve the classification and fault diagnosis tasks, such as support vector machine (SVM) [8, 13], relevance vector machine (RVM) [14, 27], artificial neural network (ANN) [16, 21] and artificial immune recognition system (AIRS) [26, 31] et al. In these problems a mapping among objects represented by different feature vectors (different class labels) is inferred on the basis of a set of training examples. The kernel problem is to make a description of each training set of objects and to detect which (new) objects belong to.
In these methods, SVM has the advantage on processing small sample problems. However, there are still some weaknesses in SVM [27]: 1) it can’t obtain the probability prediction; 2) the kernel function must satisfy the Mercer condition; 3) the penalty factor C and the variance parameter δ in RBF kernel function need to be estimated.
Tipping [22, 23] comes up with a statistical learning algorithm based on Bayesian theory: the Relevance Vector Machine RVM. It has the following advantages compared with SVM: 1) it can output the posterior probability distribution of the predicted value; 2) the kernel function doesn’t need to satisfy the Mercer condition; 3) the relatively simple decision function used in RVM is more suitable for online diagnosis. In recent years, there are also some applications and researches of RVM appearing in the field of mechanical fault diagnosis [12, 25]. Nevertheless, the classification results of RVM are still sensitive to the nuclear parameter value [3].
ANN is a network formed by a large number of widely-connected simple processing units and has a distributed information processing mode. Up to now, ANN has developed a variety of patterns, for instance, BPNN-KN [6], Hopfield-BP [19] and ART-KNN [29]. Nevertheless, ANN still has inadequacy in dealing with the fault diagnosis problem: it needs many samples for training and unable to explain the result of neural network training. Hence, ANN has great limitations in the case of lacking in fault samples.
Employing the metaphor of the immune network theory, Watkins et al. [26] put forward a supervised learning algorithm named as Artificial Immune Recognition System (AIRS) using the immune mechanism. Since then, AIRS is applied in a series of classification problems and a lot of improved algorithms continuously emerge. In the study of [16], a new resource allocation mechanism was done with fuzzy-logic in the Fuzzy-AIRS. As a further research, another nonlinear recognition system involving AIS and ANN (artificial neural network (ANN)-aided AIS-response, AaA-response) was proposed [15]. In the research of MAIRS2, a modified AIRS2 was used, which replace the k-nearest neighbor algorithm with the fuzzy k-nearest neighbor to improve diagnostic accuracy of diabetes diseases [5]. Inspired by the theory of data field, Zhang et al. [31] presented variable weight artificial immune recognizers (V-AIR). Their study uses the data field to describe memory cell’s detectability. Nevertheless, while these population control mechanisms used in AIRS were borrowed from the immune mechanism, it should be stressed that AIRS is in no way a mathematic description model of computation.
Data field used here is a virtual field to depict the interaction between objects associated to each data point of the whole space, simulating the methodology of physical field [28]. Gan et al. [9] used this in hierarchical clustering and found the shape, size and density cluster. Li and Du [10] studied the relationship between data field and non-parametric density estimation and found that potential function could reflect the data distribution intensity.
In this paper, combining the concept of data field, a new classification approach is put forward, called Mass Optimizing Group Identification Classification Algorithm (MOGICA), which obtains some mass optimizing objects to classify the target samples. We will show how the mass of training samples can be optimized by quadratic programming and how the binary classification interface can be formed with the mass optimizing objects. In the testing phase, the pattern of a training data belongs to the class it falling on in the binary classification model.
The paper is organized as follows. In Section 2, it mainly presents the concept and definitions of data field used in the proposed algorithm. In Section 3, the Mass Optimizing Group Identification Classification Algorithm (MOGICA) will be introduced in detail. Experiments, using Benchmark data sets and Bearing data sets, are presented in Section 4, and conclusions are presented in Section 5.
The data field theory
According to the nature of stable active field potential function in physics, the potential function of stable active field could be assumed as a single value isotropic function about site space position. The potential value at any position of space is proportional to parameters representing the intensity of source and inversely proportional to distance between the position and source of field. Hence, for the data field defined in feature space, we could give basic rules in the form of data field potential function as follows [9]:
Given an object φ
x
( φ
x
( φ
x
(
In principle, all equations satisfying the above conditions may be used to define the data field potential equation.
Based on the nuclear field potential equation, we could give a potential equation form:
Based on the superposition principle of potential function, if the mass of data objects is equal, data-intensive areas will have high potential value. Meanwhile, the area near the maximum value of potential function would also have the most intensive objects. As shown in Fig. 2, the potential function can reflect the intensity of data distribution and be used as an estimate of the overall distribution.
Extraordinary, when the mass of data objects are equal, the proposed nuclear field potential function and the parameter estimation of kernel density estimation are very similar. Assuming that the objects
According to the nature of probability density function, we can prove that the potential function and the probability density function differ in a constant. The proof process is as follows:
In Equation (2), the no negativity condition of φ (
Setting , we could get:
For is the Gaussian distribution density function with h = ω, Equation (7) could be changed into:
Thus, φ (
The algorithm description
Just as shown in Fig. 2, the data field can describe the distribution of samples. However, it needs all of the samples to describe the distribution of data structure. If the mass of objects don’t need to be equal, the mass of objects m1, m2, …, m
n
obviously can be seen as a set of functions related to the spatial location. When the overall distribution of samples is known, we can minimize the potential function and the overall distribution density function. Specifically, suppose that
The optimization method was originally designed as a method for data mining problems of one-class. In this section, we will describe a variant optimization method, called the Mass Optimizing Group Identification Classification Algorithm (MOGICA), which can handle the problem of two-class classification.
Suppose there are two sets of N-dimensional data points:
obviously, and havenothing to do with a1, a2, …, a
n
and b1, b2, …, b
l
. Thus, the objective function could be converted to:
Analysis of the formal equation: and are the mathematical expectation of φ (
Substitute Equation (1) into the Equation (13) and get:
Obviously, this is a typical constrained quadratic programming problem, which satisfies the following linear constraints:
The Equation (15) represents the equality constraint used in the constrained quadratic programming problem, which means that the mass for a type of objects data satisfies the normalization condition. The Equation (16), which needs all of the training samples to be divided into the right category, represents the inequality constraint and plays a significant role in improving sample’s classification accuracy. Equations (17) and (18) represent the value range and the initial value of each category’s mass respectively. The result of optimization is that a few objects located in the compact district have a large mass and most objects apart with very small mass. In other words, the distribution of field mainly depends on the mass of large objects and most other objects with small mass could be ignored.
For the sake of convenience, the objects with large enough mass are used as the representative objects and form the data field used for classification. Let the optimal solution of quadratic programming is and respectively. Like the classification interface of SVM [24], the optimal classification hyperplane are represented in H0:
The classification results of training samples’ set are:
The classification function is given as:
The state of any sample z can be judged by:
In Fig. 3, the two-dimensional space gives the potential field distribution of data objects with 40 data samples for each type of samples (class +1 and – 1). From this figure, it can be found that the potential field distribution can reflect the distribution of data field preferably; however it still has misclassification near the H0 potential field classification hyperplane (in two-dimensional space, it’s a H0 potential field classification line between class +1 and – 1). Given a threshold of mass, we can make mass assignment for each object according to the preceding quadratic programming method and select the objects with large mass as the representatives to form the data field. As shown in Fig. 4, though the data field formed by few used objects (the large mass objects) has greatly difference compared with the original potential field, it can greatly separate the two classes of data without misclassification. From the results of quadratic programming, it also could be found that the little used objects of class +1 (marked as ○) and class – 1 (marked as ◊) mainly distribute on both sides of boundary H0, which corrects the distribution of the original data field and directly changes the H0 potential field classification line. Thus, the proposed method has played a due role: reduce the number of used objects and corrects the H0 potential field classification line. Although the other objects are discarded, their contribution to the clustering has been embodied in the used objects.
Like the mechanism used in SVM, MOGICA is limited to dealing with two classification problem. To the multi-class problem, further improvement is needed. At present, there are mainly two methods constructing the multi-classification SVM: one-against-one and one-against-all [20]. In this paper, we also use the two methods to form the classification model of MOGICA. In one-against-one classification, any two kind of training samples would constitute a binary classification model and a total of K (K − 1)/2 classification models are constructed in the K classes type training samples. In the process of test sample’s classification, each sample is classified by all of the binary classification models and voted for all of the classification models. The category with the most votes is just the test samples’ category. In one-against-all classification, there are a total of K classification models: the ith (i = 1, 2, … , K) binary classification model would mark the ith class training samples as +1 and all of the rest training samples as –1. The test samples are classified by all of the binary classification models and the final prediction category is determined by the prediction category +1. Figure 5 gives the detailed description of off-line learning and on-line testing about thisalgorithm.
Experiments
In this part, the classification performance of MOGICA was tested through two case studies. The algorithm was first tested on four UCI data sets that are available from the machine learning repository [2]. The selection method of parameter δ and its robustness were also discussed in detail. In the second case, the algorithm was applied to the ball bearing fault diagnosis as a real world problem with the data sets from Case Western Reserve University [11].
Classification accuracy
In this study, the classification accuracies for the data sets were measured according to Equation (23) [16, 17]:
In this part, the classification performance of MOGICA was tested on four benchmark data sets which can be obtained through the machine learning repository of California University [2]. To begin with, the influence of parameter δ and its selection method were discussed in detailed. Then, MOGICA was compared with two classifiers based on artificial immune and some other well-known intelligent classifiers. The immune classifiers are AIRS [26] and AICSL [1]. The other well-known intelligent classifiers are quoted from [26] and [1].
Effect of parameter δ
MOGICA is a parameter control classification system, so it is important to establish that how the behavior of MOGICA has been altered with respect to the user defined parameter δ. In order to establish this, investigations were firstly used to determine what affect altering the parameter δ might have on the classification accuracy (Fig. 6). Then we explore the effect of δ’s change on the number of objects used for classification (Fig. 7).
Figure 6 shows how altering the parameter δ affects the overall classification performance for the MOGICA. Here it can be observed that on average across the four data sets, increasing the parameter δ, there is certain influence on the sample’s classification accuracy and this influence is more obvious when δ is small. In this Figure, the Ionosphere, Sonar and Iris show the same changing characteristics: when impact factor δ is small, the classification rate for the three types of samples increases to a certain extent with the enlarging δ; however, when δ increases to a certain degree (about δ = 1.00), the classification rate for the three types of samples tends to be stable. But for the Diabetes data set, the classification rate remains fluctuated during the change of δ.
As shown in Fig. 7, it reveals the influence of impact factor δ on the number of objects used for classification. This would affect the potential field distribution in the state space and the recognition speed. Clearly, these are also important influence factors to the classification model. It’s obvious that on average across the Ionosphere, Sonar and Iris data sets, there is little effect on the number of used objects when δ is greater than 0.8. When δ is less than 0.8, the number of used objects has great change for some data sets. However, for the Diabetes data set, the number of used objects is not subject to this rule: it has fluctuation at some individual points.
Summarizing the results above, it’s proved that δ has certain effect on the sample’s classification accuracy and the number of used objects, though this effect is limited for some data sets. Thus, it is necessary to optimize the δ value, so as to improve the classification accuracy.
Optimizing impact factor δ
With the purpose of making data field distribute more reasonable and improve the classification accuracy, Entropy is introduced to measure the uncertainty of potential distribution [9, 27]. Shannon entropy is commonly used as the measurement of uncertainty systems. A greater entropy indicates a greater uncertainty of the system. For the data field generated by
The potential entropy of all the objects can be defined as [28]:
Since the essence of optimizing δ is a single variable nonlinear function, the problem of optimizing δ is converted into the optimization problem of potential entropy H, namely, seeking the minimum H (δ). Various standard algorithms, such as the random search method and genetic algorithm, are suitable to solve such problem. In the practical applications, we adopted the random sampling method to reduce the time required for optimizing the parameter δ.
Since MOGICA and some artificial immune classifiers all use some objects, generated by clone, mutation or quadratic program, to classify the test samples, their classification cost is proportional to the number of memory cells or used objects. Thus we firstly compared our method with two artificial immune methods in this paper-AIRS and AICSL. Two important features of these algorithms were compared in this algorithm: classification accuracy and the number of memory cells/used objects. Like the method used in AIRS [26] and AICSL [1], the results in Table 1 were also obtained from the averaging multiple runs of MOGICA, typically consisting of three or more runs and five-way, or greater, cross validation. More specifically, for Iris data set, a five-fold cross validation scheme was employed with each result representing an average of three runs across these five divisions. For Ionosphere data set, we remained the division method as detailed in AIRS [26]: 200 instances which are carefully split almost 50% positive and 50% negative are used for training with the remaining 151 as test instances, consisting of 125 “good” and only 26 “bad” instances. The results reported here also represent an average of three runs. For the Diabetes data set, a ten-fold cross validation scheme is used, again with each of the 10 testing sets being disjoint from the others and the results are averaged over three runs across these data sets. The Sonar data set is divided randomly into 13 disjoint sets with 16 cases in each set. 12 of these sets are used as training data with 1 as the test data. The impact factors used in MOGICA are shown in Table 2, which are optimized by Equation (24).
In Table 1, the accuracy of MOGICA for Iris and Sonar are higher than AIRS and AICSL, but for Ionosphere and Diabetes, this method is worse than AIRS and AICSL respectively. However, the difference is not significant. For the number of used objects, this method requires fewer objects in the classification of Iris and Sonar, which means higher computational efficiency. For Ionosphere, MOGICA convergences to the optimum solution in larger number of used objects than those for AICSL. For the Diabetes, our method has the most used objects but has a little higher accuracy than AIRS. Thus, as a whole, for AIRS we gain 3:1 results in accuracy and used objects; for AICSL we gain 3:1 and 2:2 results in accuracy and used objects respectively. Hence, our method still has some superiority in classification.
The performance of MOGICA was also compared with the well-known classification techniques such as support vector machines, neural networks, fuzzy neural network and C4.5 etc. Table 3 shows the location of our proposed methods in the well-known classification techniques in detail. The other classifier results were obtained from [26] and [7]. All selected results of experiments were also taken from the same conditions as described in [26]. In order to produce the effectively comparison results, three trials were performed for each data set, and the resulting average was reported in Table 3. Just as shown in Table 3, MOGICA compares well with some of the best general purpose classifiers available, which is among the top 10 best classifiers. The only task for which its performance was not among the top 10 was the Diabetes set, which is also a difficult problem for all classifiers. We also have a number of refinements in mind for MOGICA, and we hope to have a further improvement in results.
Application of MOGICA in Bearing fault diagnosis
As an industrial case, MOGICA was conducted using the ball bearing data set of Case Western Reserve University [11] to validate the feasibility of our failure diagnosis method in this part. In the case, there are eight vibration waveforms: 2 normal working conditions and 6 malfunctioning working conditions with the bearing type, fault size, motor speed and motor load referring to Table 4. The first bearing test case was performed with one normal and three faulty conditions under the motor load 3Hp. Data was collected at 12,000 samples/second for drive end bearing experiments [11].
To prove the validity of this method for testing samples, we compared MOGICA with the artificial immune classifier V-AIR [31]. Identical to V-AIR, the normal and fault related features are also decomposed through seven layers ‘db3’ wavelet transform and high frequency of wavelet energy feature extraction [30] with length of each sample 2048 points. Then, many 7d energy eigenvectors representing the normal and fault conditions of bearing are formed. For comparison purpose, we got a 180×7 matrix for each working condition. In each working condition, 80 samples were randomly selected as training data, and the rest 100 samples were selected as testing data. Because there are four kinds of samples to classify, a total of 6 binary classification models are needed to form the one-against-one classification models. Figure 8 mainly shows the used objects and its data field distribution in 2-dimensional space with training samples decomposed by PCA [4]. Based on the δ optimizing method, δ was set to 0.19 during the experiment. Obviously, in Fig. 8(a–c), it is found by observing that the normal samples are all linearly separable with other fault samples and therefore only one object is used to classify for each type of samples. Thus, the form of the classification lines H0 between any two classes of the samples is a straight line. For Fig. 8(d), the situation is similar: out of the 80 samples for each type of samples, one sample is selected for classification. In Fig. 8(e), the two classes of samples are weakly nonlinear separable and overlap at the edge of two samples’ distribution area, which can’t be divided through a straight line. However, through the quadratic programming, we have formed an optimization curve H0 to separate the two classes of samples, which uses only a total of 4 samples to classify. In Fig. 8(f), the inner race and outer race fault samples are strongly nonlinear separable at the edge, but the MOGICA still could generate a H0 curve which fits the data distribution to classify the two samples. There is also another salient characteristic in this Fig: more samples are selected as the used objects and most of the used objects distribute near the border, which causes the adaptive of H0 curve. Thus, the MOGICA has played a good classification effect for all kinds of training samples.
Table 5 shows the classification performances of the two methods as well as the standard deviations for the two methods. It is obviously indicated that the diagnosis accuracy of MOGICA is higher than that of the ratios of V-AIR in normal bearing condition and equal to that of V-AIR in ball and inner face faults. However, for outer race fault, MOGICA got a result lower than V-AIR with 1.7%. Therefore, the two methods are equally prosperous in diagnosis accuracy. Whereas from the training sample’s amount point of view, MOGICA still has superiority.
In order to further prove MOGICA’s performance, the case of the second bearing test was compared with SVM and RVM [27]. These data sets are from a normal and three faulty conditions under the motor load 0Hp. The same as [27], the normal and fault related features were also extracted by ‘db16’ wavelet packet transform and some eight decomposition frequency bands were achieved with decomposition level i = 3. The results of MOGICA were compared to the results of SVM and RVM in Table 6 with δ = 0.23. It’s noted that: ‘No.RV’, ‘No.SV’,‘No.UB’, ‘Acc.Tr’, ‘Acc.Te’, ‘Time.Tr (s)’ and ‘Time.Te (s)’ denote the average number of relevance vectors for RVM, the average number of support vectors for SVM, the average number of used objects for MOGICA, the average accuracy of training samples, the average accuracy of testing samples, the average time of training and the average time of testing respectively. As listed in Table 6, all of the three methods can achieve high and comparable classification accuracy and our method doesn’t have outstanding advantages when compares with other two methods. However, the average number of UBs is fewer than the ones of RVs and SVs, i.e., 2.83 to 7.13 and 31.27 respectively. Theoretically, fewer UBs lead to less computational time because the decision Equation (21) of MOGICA is much simpler. Fewer UBs imply a significant reduction in computational complexity of the decision classifier which is also certified by the comparison result of Time.Te (s). In the comparison result of Time.Tr (s), it can be found that the proposed method needs more training time than other methods, i.e., 2.599 s, but it doesn’t affect the application for an offline training and online fault diagnosis system.
The performance results of the two bearing test cases indicate that, on average, MOGICA is sufficiently comparable with that of other well-known classifiers, for it needs fewer training samples or used objects to classify under the same conditions. At the same time, it has the highest detection speed, which is more suitable for online diagnosis.
Conclusions
In this paper, a new supervised learning system based on the concept of data field and quadratic programming principles have been presented. The parameter δ has certain effect on the sample’s classification accuracy and the number of used objects and can be solved through the optimization of potential entropy. At the same time, the data field used in this algorithm only differs by a constant from the kernel density estimation, which combines physical and statistics together. Therefore, the MOGICA classification method is clearer for data description. Based on minimizing the potential function and the overall distribution density function, the optimization procedure of this algorithm is presented. After this, a few objects are generated by quadratic programming to distinguish any two types of samples. At last, the performance of the proposed method was tested on two kinds of experiments. In the first experiment, the method was tested on four benchmark data sets. The comparison results show that our method is promising in some general purpose algorithms. In the second experiment, the classifier system was applied to ball bearing diagnosis. To compare with other methods, two feature extraction methods were used for ball bearing data sets: wavelet transform and wavelet packet transform. When the classification results were compared with other intelligent classifiers, MOGICA produced competitive results for balling bearing diagnosis.
Footnotes
Acknowledgments
This work was supported by Natural Science Foundation of China (51175316 and 51575331).
