Abstract
A machine learning approach to perform automatic detection and diagnosis of faults of electrical submersible pump systems is presented. Several thousand vibration patterns were acquired from vertically distributed accelerometers along the string of motors, pumps and protectors. Intermediate features are extracted from the raw vibration signals originating from the set of accelerometers. Each pattern was labelled by a human expert to provide ground truth with respect to the different operation classes (normal, sensor fault, rubbing, unbalance or misalignment). A software framework is used to compare several classifier architectures (K-Nearest-Neighbor, Random Forest, Support Vector Machine, Naïve Bayes and Decision Trees) in a bias aware performance evaluation. In order to boost the classification performance, an ensemble of different versions of a classifier architecture is constructed using the Decision Templates fusion function. The robustness of the system with respect to the emergence of new faults (i.e., untreated faults so far) is corroborated by a systematic analysis methodology.
Keywords
Introduction
Off-shore oil exploration is an expensive activity that involves the use of sophisticated and expensive equipment. The core subsystem that aids raising the crude petroleum and gas to the surface is known as electrical submersible pump (ESP) as seen in Fig. 1. Such subsystems operate under deep water and, therefore, are of difficult access, hindering real-time supervision. Additionally, production interruptions are very expensive, which reinforces the need for reliable operation of the equipment once deployed. In order to avoid costs with removal and replacement of a defective equipment that is under operation, they are carefully examined in a special testing environment [1, 2] prior to acquisition. Such tests aim to highlight potential faults of the equipment.
Illustration of a submersible pump string. This figure shows the six parts of the string: two pumps (in light gray), two protectors (in gray) and two motors (in dark gray). At least one pair of accelerometer sensors is attached in the x and y directions (x-y plane is orthogonal to the main string axis) of each part.
Most of the potential faults are due to mechanical problems and appear as a drift of the vibration patterns as compared to the healthy state. Vibration patterns can be measured with accelerometer sensors and then transformed into a more descriptive set of features to aid distinguishing between different operation patterns. The pump subsystems are tested in an artificial environment under various operation conditions while having vibration patterns collected by accelerometer sensors. Two accelerometers sensors are orthogonally placed (X-axis and Y-axis) in several equally distributed points along the six principal components of the equipment. Such a sampling distribution results in a total of 6
Example of data acquired by an accelerometer. The upper plot shows the raw vibration signal in the time domain whereas the lower plot shows the corresponding transformation to frequency domain via Fourier transform.
This work focuses on three types of motor pump faults: shaft misalignment, pump blade unbalance, and mechanical rubbing. Additionally, it considers faulty accelerometer sensors, generating abnormal vibration behaviour as a faulty pattern (the abnormal behaviour is not necessarily related to the pump). Considering the previous faults and the healthy operating condition, a total of five mutually exclusive categories are used to describe a motor pump: misalignment, unbalance, rubbing, sensor, normal. Although a few pumps show signals with evidence of two or more categories, they are rare and were disregarded in this work due to the lack of examples.
Usually, a human expert visually inspects a large amount of vibration spectra of frequencies looking for an error pattern, but this work is time consuming due to the many different curves that have to be analyzed. Moreover, this task requires a lot of tacit knowledge acquired by the human expert along many years of experience. Visually inspecting the vibration spectra and deciding whether the equipment should be rejected is not an easy task. In addition, this knowledge is not transferred or learned by other engineers without years of study and practice. Therefore, the possible loss of a professional with this experience and knowledge significantly affects the decision-making ability of the company and produces a considerable financial impact on its production processes. The expert system presented in this work performs such tasks automatically and systematically, and enables the work to be executed by non specialized personnel. More important, it can keep this type of corporate knowledge.
A frequent concern presented by the company is the need of using a technology that can be adjusted to identify new types of defects. In particular, the inclusion of patterns of new defects in the training dataset should not impair the discriminative power of the previously existing defects. Since the existing knowledge about the diagnosis of electrical submersible pump faults is still incomplete, it is expected that in the future new types of defects will be identified and incorporated into the training dataset when a significant number of examples are accumulated. Indeed, a fact observed throughout this research project is that the human expert eventually pointed to a pattern suspected of being a fault, but without enough conviction to reject the equipment as faulty or characterize the new fault. As a result, the technology must be capable of adapting to the discovery of new defects. Potentially, the technology developed in this work should have this behavior, but it was important to formally show this feature.
Since the popularization of computers, machine learning has been explored as a mean of aiding or replacing daily human tasks, or even specialized human tasks. Classification is one of the key tasks studied in machine learning. Classification tasks are handled in two steps: training and test. In the first, the classifier is presented with examples of labeled data in order to learn a classification model. In the second, the trained model is presented with an unlabeled example in order to predict one of the possible labels. There are many types of classifiers that can be used, with each one being more suitable for specific tasks. Given the literature of fault diagnosis presented below, this work focuses on K-Nearest Neighbor (KNN), Support Vector Machines (SVM), J48 Decision Tree (DTree), Naïve Bayes (NB) and Random Forests (RF).
The KNN [3] is a mandatory comparative classifier, since it has no hyperparameters, except the number
Classification is the key tool in model-free fault diagnosis when knowledge is exclusively based on labeled machine condition patterns. Fault diagnosis has been addressed in many computer applications and a set of examples are presented in the following. Surveys of machine fault diagnosis using support vector machine are presented in [17, 18]. A fuzzy approach is applied to fault diagnosis of power systems [19]. Dimensionality reduction and multilayer perceptrons are used to detect surface roughness for face milling operations in [20]. The work in [21] applied qualitative reasoning to monitor and diagnose electrical power quality. Classifier ensembles are used for the detection and isolation in sensor networks in [22]. The Naïve Bayes classifier is compared with a Bayesian net and the J48 algorithm in the context of fault diagnosis of roller bearing using sound signals [23].
The focus of this study is on fault diagnosis based on vibration pattern analysis. Wavelets and local discriminant bases selection algorithm are applied to vibration signals in the fault diagnosis of a single-cylinder spark ignition engine [24]. RFs are applied to vibration signals based fault diagnosis of an induction motor in [25] and to spur gears in [26]. The authors in [27] use locally linear embedding and SVMs for submersible plunger pump fault diagnosis. A systematic use of the Case Western Reserve University (CWRU) data set as a benchmark is proposed in [28]. The data set describes the specific problem of bearing defects and was also explored with a multiscale permutation entropy approach to select wavelet features in [29]. In the same context, the KNN is used as the classifier, together with statistical feature models [30]. Two other works propose to diagnose bearings using wavelet features [31, 32]. An ensemble of SVM classifiers is used to diagnose faults based on vibration signals of oil rig horizontal centrifugal motor pumps [33].
More specifically in the context of this work, a comparative study of KNN, SVM, Decision Trees and RF is performed for ESP fault diagnosis based on vibration signals [34]. Results showed that the evaluated classifiers have equivalent performance and also indicated that the standardization procedure can improve their performance. In a following work [35], the group investigated the use of Extreme Learning Machine (ELM) considering two different versions (based on random and kernel hidden units). Although those works addressed very well the fault diagnosis problem, they only explored the use of single classifiers to detect the faults, leaving room for investigating combination of classifiers. Additionally, they performed a limited evaluation of the system performance without considering the possibility of appearance of new faults along the time.
Following along the line of the research presented in [34], this work presents two new main contributions to the problem of automatic fault diagnosis of electrical submersible pumps.
The first contribution is the use of classifier ensembles with the inclusion of a powerful technique, called Decision Templates (DT) [36, 37], to combine results of more than one classifier constructed via bagging and improve the final performance. A comparative study is performed using real data acquired in tests accomplished before acquisition of electrical submersible pumps. The dataset comprises thousands of entries of accelerometer sensor data labelled by a human expert as one of the considered scenarios (normal, faulty sensor, faulty pump with rubbing, misalignment or unbalance). Differently from [34], results showed higher differences between the classifiers when considering the macro-averaged F-measure performance criterion and showed that the Decision Templates can improve the Random Forest classifier performance (best evaluated performance).
The second contribution is a systematic analysis of the faults by removal and substitution of each of the fault classes in order to evaluate the impact of the emergence of new defect patterns on the performance of the classifiers. New fault patterns might be discovered by the experts and would have to be included in the classifier as a new class. Such new faults might degrade the current performance of the classifier and therefore deserved investigation. It should be highlighted that no similar systematic analysis on the emergence of new defects has been reported in the fault diagnosis literature. Therefore, the proposed methodology may be applied in different contexts of fault diagnosis in which the same type of problem is present. The results of the study showed that the chosen classifier, Random Forest with Decision Templates, is robust with respect to the appearance of new fault patterns.
The remainder of this paper is organized as follows: Section 2 describes the proposed solution for submersible motor pump fault diagnosis, the feature extraction procedure, the classifier ensemble model and the Decision Templates technique. Section 3 describes the experimental methodology used to realize the comparative study. Section 4 presents the experimental results. Section 5 concludes the paper and presents perspectives for future work.
Expensive electrical submersible pump systems must be tested prior to deployment. This is necessary to avoid unnecessary costs since on-site, sub-sea correction is unfeasible. This work proposes a machine learning approach to address an important oil industry engineering problem. The complete work flow of the diagnosis system is composed of the following stages: i) acquiring raw vibration signals from accelerome-ters sensors attached to multiple positions of the pump system and converting them to the frequency domain, ii) extracting custom made features from the spectrum, iii) training a classifier offline and iv) using the trained classifier to diagnose faults. This workflow is used to determine the most appropriate classifier model combined with DT bagging ensembles for the fault diagnosis task. The feature model, the classifiers ensembles and the DT are described in more detail in the following subsections.
Feature extraction
The raw vibration data constitute a huge amount of information, and conventional Fourier transformation of the time to the frequency domain provides a spectrum that still has too much information. Therefore, it is inappropriate to be fed directly into a classifier. Although there are several works in the literature that explore the feature extraction by dimensionality reduction [20, 24], in this work hand-crafted features proposed in [34] were used to emulate the tacit knowledge of the domain expert when diagnosing a motor pump. They were designed to jointly create information about the peaks and shape of the spectrum in the range of significant frequencies in order to identify the considered faults. High peaks at the first and second harmonics are related to unbalance and misalignment faults, whereas an exponential decrease at low frequencies are relevant to detect rubbing and sensor fault.
Eight statistical features have been extracted from specific bands of the vibration spectrum of each sensor to be used as input of the classifiers. Each of these features is described below.
This feature set was designed to optimize the identification of the considered operation conditions of the process taking into consideration the analysis performed by the human expert. This first feature set
Each of the five type of classifiers (Naïve Bayes, KNN, SVM, J48 Decision Tree, Random Forest) motivated in Section 1 are combined in ensembles to raise the performance scores relative to a single classifier. As both SVM and the decision tree used in this study are binary classifiers, for the multi-class case, a set of one-against-one classifiers were trained for all possible class pairs, and were combined in a error-correcting output code multi-class model [38]. It should be stressed that classifier ensembles are homogeneous in this work. Thus, a classifier ensemble is not composed of different types of classifiers, but of different classifiers of the same type. So, there is an ensemble composed by different SVM classifiers, but there is no ensemble mixing, for instance, SVM and KNN classifiers.
Given a number
In order to classify an unseen instance, the output of each member of the ensemble is collected and then an ensemble information fusion function determines the final classification result. The choice of this function has fundamental importance for the operation of the ensemble and directly influences its performance. In this work, three ensemble information fusion functions were considered: majority voting, weighted voting and decision templates. In majority voting each classifier votes in one class (typically the best evaluated). The most frequent voted class is chosen. In weighted voting, the mean support of all classifiers is calculated specifically for each class. The final class is the one with the highest mean support. Decision Templates, focus of this work, are described in the next subsection.
Decision Templates
Decision Templates [36, 37] is a method for deciding the final classification for a single-label classifier ensemble. Given a number
Ideally these scores are probabilities based on the true class conditional distributions. Estimated probabilities, likelihoods, fuzzy set membership degrees or distances are also plausible as scores. Hence the entry
Consider the same training set, composed of
It is a matrix of the same dimension and structure as a DP, and can be considered as the expected DP of class
where
This section describes the methodology designed to evaluate the performance of the considered classifiers. It starts explaining how data from the motor pumps were collected and how each instance was labelled. With the dataset in hands, the next subsection describes the evaluation metrics, cross-validation procedure and the statistical framework used to compare the performance of the methods. The final subsection describes the methodology used for systematically analyzing the possible emergence of new faults.
Benchmark data set
The dataset is composed of vibrational signals collected from nine different motor pumps operating at different frequencies (40, 45, 50, 55 and 60 Hz) and flow rates, following strict tests procedures described in the standard API RP 11S8 for ESP vibration analysis. All procedures are accomplished in test wells using electrically isolated waterproof (100 m of depth) accelerometer sensors with a 0.1 V/g sensitivity and operating in the frequency range of 0.6 to 10000 Hz (maximum tolerance
For most of the data used in this study, a total of 36 accelerometer sensors were attached to the components of each equipment (each motor pump is composed of pumps, protectors and motors), resulting in the acquisition of a total of 4570 vibration signals. The vibration spectra of each signal were analyzed by a domain expert in order to define the labels. The expert assigned each spectrum to one out of the five considered categories resulting in the following percentage of occurrence: normal (3706 samples representing 78.02%), sensor (294 samples representing 6.19%), unbalance (485 samples representing 10.21%), misalignment (50 samples representing 1.05%), or rubbing (35 samples representing 0.74%). As expected, the dataset is imbalanced with the normal condition being the majority category.
Performance evaluation
In order to ensure a fair and unbiased evaluation of each classifier, a common method was created and used to test each classifier type. In the context of machine learning it is an accepted fact that simple training-test splits of the data are not enough to reliably obtain a realistic statement about the performance of a classifier. In fault diagnosis applications a sophisticated methodology to estimate performance scores is rarely encountered. The best hyperparameters of a certain classifier are often obtained by submitting the same data over and over again, until a good set is found. For instance, the constraint parameter
The performance evaluation framework is divided into three nested layers. The outer layer executes
As an example for the search of the best hyperparameter in the inner CV loop for tuning, take the K-Nearest Neighbor classifier.1 Supposing that the Euclidean distance metric is not modified, the only hyperparameter is the number
This strategy implies that
Hyperparameters used for each classifier architecture
Hyperparameters used for each classifier architecture
The performance evaluation framework is depicted in Algorithm 1. The following symbols are used:
Range Performance evaluation framework
FnFunctionend PM
( // all rounds)
// test set of
initialize best criterion
( // grid search for best hyperparameter set
( // all folds)
// training set of
// test set of
The main performance criterion used in this work is the macro-averaged F-measure [42], i.e. the relations between the data positive labels and those given by a classifier. This choice is motivated by the simultaneous consideration, in one single value, of precision and recall, derived from the confusion matrix. Macro-averaging is chosen since it treats all classes equally while micro-averaging favors classes with more examples. In a fault diagnosis task, the normal class is usually over-represented, since much more patterns come from a normal machine operating condition than faulty patterns. Consider a multi-class classification problem with
and the macro-averaged recall over
Finally the macro-averaged F-measure is
The parameter is
Data standardization is applied, i.e. substituting the original feature value
In addition to the macro-averaged F-measure (
In this section, the statistical procedure to compare the performance of the evaluated classifiers is described. There is a total of twenty different classifiers, five different architectures (RF, SVM, KNN, J48, NB), each without and with the application of bagging ensembles using the three information fusion functions. The classifiers were compared pair-wisely in order to select the most suitable for the proposed classification problem. The dataset split over the rounds and folds was ensured for every classifier.
One of the assumptions of the well-known paired
Large values of
In order to control the family-wise error, i.e. the cumulative Type I error, Holm’s step-down procedure [44] was employed. It starts by increasingly ordering the
Finally, it rejects the hypotheses of similarity in performance of tests with the following
The objective of this methodology is the analysis of the diagnosis system with respect to the emergence of new faults. For instance, it may happen that at the operational stage of the system, a novel fault class arises that has never been seen before, and it should be classified. Alternatively, a new fault already present in the data might be discovered and has to be classified. The system should be able to be retrained with examples of such cases while exhibiting a certain amount of robustness for these situations, i.e. it should be able to handle the new class without degrading the performance of the old classes. In an ideal scenario, the system performance would not be affected by inclusion of a new fault class.
In order to test this robustness, an existing fault is either omitted or is declared as a normal situation. In this way, the omitted existing fault can simulate a new fault that was never seen before and the existing fault declared as normal can simulate an already seen pattern that was not yet considered as a fault. For each of the four considered fault classes, three different scenarios are assembled, c.f. Table 2. In the case of “Removal”, the patterns associated with an existing fault are erased from both the training and the test set, and the performance evaluation of Algorithm 1is rerun to obtain a new set of
Depending on the hardness to diagnose the particular fault, the F-measure could drop or raise, with respect to the case when the fault is not removed. In order to permit a comparison with the latter, the “Original” case is defined. From the original training and test evaluation which produces a 5
Experimental methodology for robustness analysis
Experimental methodology for robustness analysis
The “Substitution” case is when the labelled patterns of a certain fault are re-labelled as normal patterns, thus subtracting the fault patterns from the training and test sets and adding them to the normal case. Again, the consequences on the confusion matrix is the reduction of the dimension to four, since the fault does not appear as a separate class anymore.
This section presents the results obtained with the experiments described in the previous section. The performance evaluation framework was developed in the WEKA workbench (version 3.4.3) [45]. The classifiers (Naïve Bayes, KNN, SVM, Decision Tree, Random Forest) were tested using their respective WEKA implementation (NaiveBayes, IBk, LibSVM – wrapper class for the libsvm library, J48 and RandomForest).
The experiments are divided into two parts. The first part is a comparative study of all classifier architectures, evaluating each type of classifier with and without the use of the ensemble information fusion functions (DT, Majority voting, Weighted voting). The objective is to obtain a hint that the use of ensembles is beneficial concerning the performance scores. The classifier that achieved the highest score is retained for the second part of the experiments. This part analyzes the effects of removing a certain fault from the classes (i.e. removing from the examples), or considering a fault as a normal class (i.e. changing their label to the normal class). The objective of this experiment is to evaluate the performance of the classification framework for more realistic situations. During the operation of the fault classification, it is plausible that new situations arise. For instance, broken rolling bearings manifest themselves in changed vibration pattern. It is expected from the diagnosis system that it is still able to discern the learned faults with a certain amount of robustness. That means that the classifier performance should not degrade significantly when it is trained considering new fault classes.
Comparison of standalone classifiers and bagging ensembles
The aim of this experimental stage is to determine which classifier performs best for the given database of the machine conditions. The experiments were performed with the total amount of
Boxplot of resulting macro F-measures for each classifier model. The superscript indicates the ensemble information fusion function (DT 
Summary of the macro F-measures performance criteria boxplot statistics for each classifier method. The superscript indicates the ensemble information fusion function (DT
Analyzing the results, one may see that the RF classifier using DT as the ensemble information fusion function achieved the best performance among all classifiers since it has the highest median performance and high values of the other statistical metrics. Moreover, DTs outperformed the other ensemble information fusion functions for all classifiers, except for SVM where the majority voting was slightly better. Therefore, the classifiers with and without ensembles are pairwisely compared considering DT as the ensemble information fusion function.
Figure 4 presents the results of the corrected
Matrix plot of results of the pairwise comparison between methods. The upper triangle presents corresponding 
The results suggest that the employment of Decision Templates has a considerable impact on the performance for a particular classifier. In the case of KNN and SVM, the F-measure value decreases after DTs are used. However, an improvement can be observed, when DT is used, considerably for RF and J48 and slightly for NB. Performance variance is probably due to the stability of the learning algorithm. Stable algorithms do not change considerably their predictions when the training data is slightly modified. According to [39], bagging ensembles do not have good performance with stable algorithms. KNN and SVM are stable while RF, J48 and NB are less stable.
Based on the extensive quantity of cross-validation experiments with
The macro-averaged F-measure might give an incomplete impression of the classifier performance. Thus, for improving the interpretability of the performance of the winning classifier RF
The results in Tables 4 and 5 show that the patterns of some classes are more informative than the others. Indeed, rubbing and misalignment are more often wrongly labeled, whereas normal, accelerometer faults and unbalance are more precisely identified. The inferior performance of the classifier for identifying rubbing and misalignment is probably due to the low numbers of labeled examples for those classes. Therefore, an increase in performance is expected when more data of those faults becomes available.
Confusion matrix of the RF
Precision, Recall, and F-measure metrics for each class. These metrics were calculated using the information in Table 4
Boxplot of macro F-measures for each of the three scenarios, “Original”, “Removal” and “Substitution”, described in Table 2, separated for each of the four considered fault classes that were removed or substituted.
This subsection describes the behavior of the RF
Summary of the macro F-measures performance criteria boxplot statistics for each fault class, comparing the three scenarios of Table 2. 1
The input parameters of Holm’s step-down procedure are
This result supports the hypothesis that the system is robust relative to the inclusion of new faults since the inclusion of a new defect did not significantly change the performance of the other classes. Therefore, it indicates that the system is able to handle new faults in case it has to be retrained to cope with additional fault patterns. This fact is important from the practical point of view because new fault pattern might be perceived by an expert with time. In this case, the system would have to be able to handle the new fault without degrading the performance with the old patterns. The “Removal” results indicate that the system performance will not degrade in case the oil company discovers a new fault pattern that was not yet present in the training examples. As an example consider the case in which the research department of the oil company discovers new fault patterns through simulation and decides to train the system to look for those patterns. The “Substitution” results indicate that the system performance will not degrade in case the oil company expert realizes that a pattern, apparently normal, is actually a new fault pattern that should be considered by the system.
When looking at each fault individually in Table 5, it is possible to see that rubbing is one of the most difficult fault to identify. This is confirmed by the result presented in Fig. 5 that shows a performance increase when the rubbing is removed from the problem (i.e., comparing only the “Original” experiments among the different faults). Following the same logic, the second most difficult fault is misalignment and this can also be confirmed by the results presented in Table 5, and Fig. 5. The unbalance fault shows a peculiar behavior in Fig. 5. When compared to the original experiment, it slightly decreases performance with the “Substitution” and increases with “Removal”. This could indicate that unbalance is somewhere in between the normal and fault classes in the feature space, which brings the other classes closer to normal during the “Substitution” and further apart during the “Removal”. This fact would cause respectively a decrease in performance (due to the harder problem the classifier had to solve) and an increase in performance (due to the simpler problem the classifier had to solve).
Results for Holm’s step-down procedure
This paper investigated the use of bagging ensembles with Decision Templates together with different classifier architectures in order to define the most appropriate model for the proposed diagnosis system of submersible pumps. Bias aware extensive cross validation was first used to determine that the Random Forest together with Decision Templates is the system with the highest scores. The chosen RF
From the engineering point of view, the results suggest that the system can aid non experts to perform, with less supervision, the work of highly trained experts. This results in an economy for the oil company of the hiring process, since a smaller number of expensive highly trained experts would be necessary. In addition, it would reduce the evasion of knowledge caused by the absence of the highly trained expert. Today, this kind of expert is in high demand because they have to supervise the whole pump testing procedure that is performed at the pump manufacturer, including all expenses (travel and personnel). One single test costs in turn of 500,000 dollars, therefore, it must be performed with precision and the analysis must be conclusive. The results showed that the system can perform systematically well. A proper fault detection avoids high intervention cost losses (ranging from 20 to 70 million dollars) to pull out ESP system from sub-sea satellite wells.
In future work, the causes of the performance variation of the classifiers should be investigated more and additional fault classes should be considered. Furthermore, data from the same motor pump was collected multiple times, however any dependencies between sensors were ignored in this work. An analysis considering the correlations between various vibration signals might be envisioned as future work when data from more motor pumps become available. Moreover, it is worth mentioning that the proposed system suffers the general drawbacks of every model-free fault diagnosis system. Enough representative data samples have to be provided in order to obtain a good generalization of the classifier. This is especially critical, if some machine conditions are underrepresented because they are very rare situations. Lastly, deep learning techniques have shown an incredible performance in various areas of research. Thus, some unidimensional convolution approaches, such as [46, 47], should be investigated to try to replace the handcrafted features designed for this problem.
Footnotes
The letter ‘K’ is generally both used for the number of neighbors of the KNN classifier and for the number of folds of a K-fold cross validation. Please be aware of the context.
Acknowledgments
This work was supported by CENPES-Petrobras under Grant Termo de Cooperação 0050.00070332.11.9 Petrobras-UFES.
