Abstract
The emergence of the Industry 4.0 trend brings automation and data exchange to industrial manufacturing. Using computational systems and IoT devices allows businesses to collect and deal with vast volumes of sensorial and business process data. The growing and proliferation of big data and machine learning technologies enable strategic decisions based on the analyzed data. This study suggests a data-driven predictive maintenance framework for the air production unit (APU) system of a train of Metro do Porto. The proposed method assists in detecting failures and errors in machinery before they reach critical stages. We present an anomaly detection model following an unsupervised approach, combining the Half-Space-trees method with One Class K Nearest Neighbor, adapted to deal with data streams. We evaluate and compare our approach with the Half-Space-Trees method applied without the One Class K Nearest Neighbor combination. Our model produced few type-I errors, significantly increasing the value of precision when compared to the Half-Space-Trees model. Our proposal achieved high anomaly detection performance, predicting most of the catastrophic failures of the APU train system.
Keywords
Introduction
Predictive maintenance (PdM) is a method that uses real-time analytic tools to assess collected data from various parts of one industrial machine [1]. The goal is to detect malfunctions as quickly as possible and fix them before they lead to a catastrophic failure. Anomaly detection lies at the core of PdM, with the primary focus on finding anomalies in the working components of machines at early stages and alerting supervisors to carry out maintenance activities [2].
This work describes a data-driven predictive maintenance system to detect anomalies on an Air Production Unit (APU) installed on trains of Metro of Porto. The goal is to identify as early as possible potential failures and notify the maintenance team of an anomaly (undetectable with traditional maintenance criteria), avoiding the inconvenience of removing a train from the operation and saving time and money for the company.
The data is collected from the APU using a set of analogic sensors and reading directly from the APU control system some digital signals that control the state of the APU. We receive the data in regular time intervals, and the learning process extracts information in near real-time to build a predictive model. The model can send an alarm to the maintenance teams, allowing timely intervention on the train.
In this work, we propose an online predictive model capable of dealing with incoming stream data with adaptive learning properties. Since the data incoming from the sensors is endless and received as a continuous flow, we choose to deepen the data stream mining topic, where the methods’ computational resources are limited (memory, computational power, processing time). These methods are based on incremental learning as data is induced incrementally and contemplate a forgetting mechanism to deal with limited memory. They differ from batch learning models such as Deep Neural Networks, which are static, computational power is usually a must to get the best fitting in data, and the learning process is performed offline.
Furthermore, we followed a semi-supervised learning approach since we did not know when train failures occurred at the beginning of the project. Therefore, we have combined two methods, the Half Space Trees (HS-Trees) algorithm for one-class anomaly detection in evolving streams [3] and an adaptation of the K-Nearest Neighbour [4, 5] capable of doing one-class classification in streaming data.
The main idea of our proposal is to use HS-Trees as the primary anomaly detector method to filter the incoming data. HS-Trees sends the observations detected as anomalies to the One-Class K-Nearest Neighbour method to reduce false positives. Our model presented high-performance results, detecting most of the catastrophic failures and producing fewer false positives compared to the HS-Trees method.
The paper is organized as follows: we provide an overview of the related work in the context of anomaly detection in Section 2. Section 3 describes the algorithms implemented in our proposal for fault detection using an semi-supervised learning approach. Section 4 describes the data used, the problem definition and the detailed description of our proposal. Section 5 presents the anomaly detection results of our model. Finally, Section 6 points out the conclusions, remarks, and future works.
Related work
Using sensors to monitor industrial equipment combined with the emergence of high-speed networks like 5G and computational systems allowed the development and adaptation of machine learning techniques to anomaly detection and predictive maintenance. In this section, we will present some studies regarding these two topics.
Maintenance in industrial equipment and repair procedures are typically responsive to a not-predicted issue. Since malfunctions in equipment affect the safety, availability, and environment, the authors in [6] proposed a real-time monitor to schedule monitoring tasks. These tasks obtain sensor information, measure the state and condition of several components, and determine when the most appropriate moment is to apply a preventive maintenance action (predictive maintenance) on the equipment. The predictive maintenance topic has been attracting growing interest over the last years with several proposals exploring different machine learning methods for predictive maintenance or anomaly detection [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]. More recently, a survey proposed by [19] analyses all the related work regarding the usage of machine learning techniques for predictive maintenance on the railway industry.
Industrial equipment often lacks sufficient and diverse anomalous data to build a binary classification system. Thus many of the predictive maintenance models rely on unsupervised anomaly detection algorithms, which are responsible for determining whether an observation of the sensor deviates from the normal state of the equipment [20, 21]. Detecting the presence of anomalies in real-time provides valuable insights and knowledge about the equipment to make a rigorous assessment of possible maintenance interventions. There are several works in the literature related to the topic of predictive maintenance in railway systems, and they can be organized into supervised or unsupervised learning approaches:
Supervised learning
Rabatel et al. [7] explored the application of sequential patterns to correctly identify normal and abnormal data generated by a set of sensors installed in three key train components.
Li et al. [8] proposed a five-step predictive maintenance framework. The first step is the feature extraction of the dataset containing information about bearings on the train. The second step is reducing dimensional space using the Principal Component Analysis. The model adopted was the Support Vector Machine. Finally, a confidence level for alarm prediction was defined, and a rule simplification divides the feature space into non-overlapping small grids.
In terms of predicting failures on door trains, Manco et al. [11] developed an application to predict and explain door failures using an outlier detection method. Pereira et al. [9] developed a failure detection system for classifying irregular open/close cycles within trains based on the difference between the inlet and outlet pressure in specific intervals of the cycle. More recently, Ribeiro et al. [10] explored data-driven PdM based on anomaly and novelty detection implemented to predict failure in the automatic door system. The results showed that a low-pass filter could significantly reduce the number of false alarms.
Fumeo et al. [22] described a condition-based maintenance algorithm that explores the online support vector regression algorithm to predict the remaining useful life of the railway vehicle. In particular, the authors aim to detect failures on the axle bearings as soon as possible.
Wan-Jui Lee [12] used the Linear Regression model to describe two different compressor operations (idle and running time). The authors used logistic functions to define the boundaries of the two classes or compressor operations modes. The system is used for air leakage detection by anomaly detection in a train’s braking pipes. They used a density-based clustering method with a dynamic threshold to distinguish anomalies.
Bukhsh et al. [15] explored the usage of tree-based models like Random Forest, Decision trees, or XGBoost to predict the status of railway switches. Additionally, the authors explored the Local Interpretable Model-Agnostic Explanations (LIME) to explain the possible reasons for the malfunction. Kalathas and Papoutsidakis [18] applied two well-known classification algorithms, the J48 and M5P, to monitor the health state of traction and braking subsystems of the Greek Railway. Adopting tree algorithms helps the maintenance teams understand the reason for the malfunction.
Kang et al. [13] described a system that uses a Bayesian statistical learning model to represent the expected behaviour of the train in terms of speed. The study’s main objective was to capture changes and anomalies in the trains’ speed to detect some malfunctions as early as possible.
Barros et al. [16] proposed adopting a rule-based system to detect anomalies on a train compressor unit. This system monitored several analogical and digital variables and then used a low pass filter to smooth the analogical signals and count the number of peaks in a time window. The rules were designed based on the maintenance teams’ expertise to define the compressor units’ normal state.
Unsupervised learning
Salierno et al. [17] proposed architecture for predictive maintenance on the railway domain. The proposed architecture is to predict failures in the interlocking railway system of the Italian Railway. The authors adopted a Long Short Term Memory model to capture abnormal patterns of the interlocking system.
Davari et al. [23] describe a sparse autoencoder (SAE) network for predictive maintenance on a metro railway domain. The proposed autoencoder is designed to predict failures on the air compressor subsystem to remove the train from circulation safely.
Chen et al. [14] presented a predictive system for the compressor air unit. The authors used a recurrent neural network using Long Short-Term Memory architecture for failure prediction. The authors compared their method with the random forest method, and the results showed that the neural network proposal was more stable when compared with the Random Forest.
All the described related works (summarized in Table 1) rely on identifying the normal state of the system/component, considering as possible anomalies the observations that do not have the same familiar patterns. Different machine learning models or techniques were applied depending on the context and characteristics of the equipment.
Our approach differs from the state of the art because it relies on machine learning techniques to identify abnormal patterns correctly. The supervised approaches presented in this section do not work in real-time because we do not know the ground truth. When we compare with unsupervised learning approaches, where some authors look to the autoencoders’ higher values of reconstruction error to signal an anomaly, we suffer from a false positive alarms problem. Our method relies on a semi-supervised learning algorithm, HS-Tree, which learns a single class and classifies all the other classes as an anomaly. If the output of one observation is positive for an anomaly, we use a kNN algorithm to see if the observation is distant from known normal observations of the air compressor unit. The ablation study in this manuscript shows a significant improvement in the evaluation metrics.
Related work comparison
Related work comparison
This section presents the algorithms employed to build the proposed model and detect the train system’s catastrophic failures.
Half space trees
Half-Space-Trees (HS-Trees) is a one-class anomaly detector algorithm built to deal with data streaming environments [3]. HS-trees is an ensemble that learns incrementally as the data arrives, capable of performing unsupervised learning in evolving data. The basic concept of this algorithm is to create binary trees by partitioning data into sub-spaces or regions. When generating a tree, the algorithm selects a random dimension and splits it into disjoint, equal-volume halves, thus creating the left and right side child nodes. This process is repeated until the tree reaches a maximum depth (user-specified parameter). Therefore, any data point in the domain travels a single path from the tree’s root to the leaves going through the different sub-spaces. Although the data is splitted into density sub-spaces, this algorithm differs from clustering algorithms such as SOM (self-organizing map) [24] since HS-trees output are scores of anomalous values estimated by the density regions, SOM computes clusters regions through distance based-techniques. The HS-trees captures the mass in each node, representing the number of data points, and uses it to profile each data point’s anomaly estimation.
The data-stream is partitioned into windows of equal sizes, named reference window and latest window. In reference window the algorithm learns the mass profile (
HS-trees example by [25] and a recorded latest mass profile. The left image represents the data partitioned, and the right image the HS-tree generated.
Figure 1 shows a representation of a HS-tree. The HS-tree partitions the domain dimensions with range values [0, 1] on the left side, where dots represent data points. The right side of Fig. 1 shows an HS-tree with a three-depth level. The inner nodes contain information regarding the splitting values and which dimension was partitioned, leaves contain mass profile values of the latest window. Also, it is possible to observe the mass profile value stored in the root’s right child (
To make the algorithm more robust,
Each tree in the ensemble computes a score for each data point independently regarding the anomaly score. The computed score allows the new arrival data point to traverse through the tree’s nodes until it reaches the leaves or a node with a mass profile equal to or less than a user-determined value
The One-Class K Nearest Neighbour (OCKNN) [5] is an adaptation of the original K-Nearest Neighbour algorithm [4] for supervised learning using distances between neighbours to classify the data. The
The mean distance value of each data point to its
OCKNN illustration example in two dimensional space, where 
It can be observed in Fig. 2 how the OCKNN algorithm works for one neighbour (
Problem definition
The Air Production Unit (APU) is part of a compressed air system, which produces pressurizing air from an electric motor. The electrical current consumed by the motor is converted into kinetic energy. The compressed air system is a crucial component of the train and delivers essential pressurized air to several clients like pneumatic suspension, oil injection on the rail to reduce the friction and noise on the curves, and injection of sand to gain traction rails, and finally, connect other trains. Applying predictive maintenance here is essential to predict the equipment failure before it happens, decreasing costs and optimizing the service.
Trains data
The data acquisition system collects information from several analogical sensors and digital signals generated by the APU control system. Based on the failure history of the train fleet, it is possible to identify the critical components of the system that generate the majority of the failures. These critical components are: (i) electrical valve; (ii) pressure valve; (iii) oil leaks; (iv) electrical motor; (v) pressure switches; and (vi) drying towers. The sensors and places to install them were strategically defined, considering the output of the failure history study. Figure 3 shows an overview of the train system.
Train System: dark arrows represent the pneumatic system, dashed arrows the control system and the thin black arrows the sensors.
The data acquisition system communicates with a cloud server that receives the data from the sensors with 1 Hz of sampling frequency. The system stores the data collected from the sensors and respective timestamps to a data logger file, and every five minutes, the file is sent to the server using the TCP/IP protocol application.
The considered analogical sensors were the following.
TP2 – Measures the pressure on the compressor. TP3 – Measures the pressure generated at the pneumatic panel. H1 – This valve is activated when the pressure read by the pressure switch of the command is above the operating pressure of 10.2 bar. DV pressure – Measures the pressure exerted due to pressure drop generated air dryers towers, and when it is equal to zero, the compressor is working under load. Motor Current – Measures the current of one phase of the three-phase motor, which should present values close to 0 A when the compressor turns off, close to 4 A when the compressor is working offloaded and close to 7 A when the compressor is working under load. When the compressor starts to work, the motor current presents values close to 9 A. Oil Temperature – Measures the temperature of the oil present on the compressor Flowmeter – Measures the airflow that leaves the APU for Reservoirs
The considered digital sensors were the following.
COMP – The electrical signal of the air intake valve on the compressor. It is active when there is no admission of air on the compressor, meaning that the compressor turns off or working offloaded. DV electric – the electrical signal that commands the compressor outlet valve. When it is active, it means that the compressor is working under load; when it is not active, it means that the compressor is off or offloaded. TOWERS – Defines which tower is drying the air and which tower is draining the humidity removed from the air. When it is not active, it means that tower one is working; when it is active, it means that tower two is working. MPG – Is responsible for activating the intake valve to start the compressor under load when the pressure in the APU is below 8.2 bar. Consequently, it will activate the sensor COMP, which assumes the same behaviour as MPG sensor. LPS – Is activated when the pressure is lower than 7 bars. Oil Level – Detects the oil level on the compressor and is active (equal to one) when the oil is below the expected values.
For our proposal, we only considered the analogical sensors data arriving in the stream recorded at each second. Figure 4 illustrates our anomaly detection model for predicting catastrophic failures.
Proposed methodology.
Before feeding HS-Trees algorithm, we aggregated the data in minutes through the timestamp feature. This operation extracted each sensor’s mean, median, standard deviation, and variance. Our experiences found that the information extracted by each minute was sufficient to prevent the HS-Trees algorithm from losing performance, thus optimizing the data processing time as it computes fewer records.
After running several experiments with HS-Trees, we noticed that this method was generating a large number of false positives since only 4% of data was reported as a failure, while HS-Trees was detecting around 25% of failures. To tackle this problem, we adopted the OCKNN algorithm to deal with data arriving continuously. The idea is that the OCKNN evaluates each anomaly detected observation from HS-Trees to check if it was detected correctly.
Data points are updated in the OCKNN training set if HS-Trees inferred these points as normal data. The update process considers the maximum and minimum distances to neighbour’s values captured during the stream. Distant normal points to its neighbours are added to the training set while neighbour points with the lowest distance are removed. This update mechanism showed high-performance results as stacking points with high distances present high sensitivity when detecting anomalous data.
In the case of HS-Trees inferred points as anomalous, the OCKNN method calculates the distance from each arriving data point to its closest neighbour to verify whether they are at an abnormal distance. To better understand our model, it is presented in Algorithm 4 the pseudo-code implementation.
[h] InputInput OutputOutput
Before starting the data stream, initial parameters and data structures were defined: A set of initial training data with 1400 records for the OCKNN method, which represents a whole day stack (24 h) from a period that we know there was no anomaly in the train system; a
The data stream cycle starts in Line 2, where variable
In case HS-Trees infers the arrival data point as normal behaviour(line 12) the algorithm assigns
As anomalous events are rare, we define a threshold value representing 1% of the arriving data with the highest distance values to its nearest neighbour. This threshold parameter value allowed our method to detect most anomalous periods generating few type I and II errors. Figure 5 shows the anomalous data points detected by our method in one of the performed experiments. Distance values equal to zero represent data points classified as normal behaviour, while distance values greater than zero are the anomalies detected by our method. The colours represent the real meaning of the data. In red are the data points that correspond to the real anomalies, and in blue, the data points that correspond to the real normal behaviour of the train system.
Anomalies detected by our method.
It can be seen in Fig. 5 a set of normal values detected as anomalies probably due to the initial fit of the HS-Trees model to the data distribution. The model classifies fewer observations as anomalous from mid-March, identifying practically part of all anomalous periods with only a few examples represented by normal activity (false negatives). This model was developed with python, using the implementation of the HS-Trees algorithm from the scikit-multiflow1 library [26] and the implementation of the K Nearest-Neighbour algorithm from the scikit-learn2 library [27] adapted to work as one class classification with online data.
In this section, we evaluate our model and report the result of our experiments. We evaluated the model’s effectiveness using data from a train in operation in 5 months of 2020, with some catastrophic failures reported during that period. The data contains 21 periods reported as anomalous. Some last a few minutes, others a couple of hours.
Evaluation procedure
In order to evaluate the performance of our approach, five experiments were carried out with some state-of-the-art anomaly detection algorithms in the context of data streams, using the data from the analogical sensors present in the APU system. Therefore, the mean, median, standard deviation and variance from the DV_pressure, TP2, TP3, H1, Oil_temperature, Motor_current and mode were used. The last feature concerns the status of the train. This feature has three states: in progress, stopped, and under maintenance. Maintenance status data has been discarded as tests are performed on the trains, causing the APU system to generate anomalous values, misleading the model’s predictions. It is also important to mention that all data were normalized using the standard window scaling technique, which standardizes features by removing the mean and scaling to unit variance. The mean and standard deviation are computed on a given window frame.
Regarding the algorithms, we tested our approach (
To assess the models, we verified that the detected anomalies were within the reported anomalous period, as shown in Fig. 6. If for a given model, there is an overlap in its output to the ground truth (in that anomalous period is detected more than one anomaly), then all observations from that period are counted as anomalous (True Positive) in the model’s output. Note that the results of our methodology were validated by experts at Metro do Porto.
We performed several experiments to adjust the hyperparameters reaching the settings in Table 2.
Selected hyperparameters
Selected hyperparameters
Models validation approach.
We used the accuracy, Precision, Recall, and F1 metrics for model evaluation, giving the necessary information to analyze the type I and type II errors. We can know how many observations from both classes (normal and abnormal) were correctly identified with accuracy. Equation (1) shows accuracy formulation, being:
True Positive (TP) – The number of observations correctly identified as an anomaly; False Positive (FP or Type I error) – The number of observations classified as an anomaly but corresponding to normal activity; True Negative (TN) – The number of observations correctly identified as normal activity; False Negative (FN or Type II error) – The number of observations classified as a normal activity but corresponding to anomalies;
Regarding the precision and recall metrics in Eqs (2) and (3), both measure the rate of FP and FN, respectively. For instance, a high recall value means low FN, while a small precision indicates high FP values. To analyze the balance of these two metrics, we compute the F1 score as the harmonic mean of Precision and Recall. While it is possible to take a simple average of the two scores, harmonic means are more resistant to outliers. Thus F1 score in Eq. (4) is a balanced metric that appropriately quantifies the correctness of models.
Performance results of the methods using the metrics Accuracy (a), Precision (b), Recall (c) and F1 Score (d).
First, we start by analyzing the results of the models in Fig. 7a where metric accuracy was used. We can observe that our model was the best, reaching an accuracy of around 98%, followed by the
Analyzing the remaining metrics, it can be seen in Fig. 7b the percentage of type I errors generated by the models. In this case, the
Analyzing Fig. 7c concerning recall metric, the results presented by the models are much better (except for the
To analyze the balance between precision and recall metrics, we can observe Fig. 7d, which presents the model’s performance evaluated by the F1 score metric. Therefore, our approach was the best, achieving 87% F1 score, followed by
Conclusions
Predictive Maintenance enables more efficient, longer-term planning for maintenance operations and makes it easier to allocate maintenance resources and define operational maintenance goals. One of the most promising aspects of the railway industry’s transformation is Predictive Maintenance through data collected on the equipment during operation to identify failures in real-time. Therefore, repairs can be adequately planned without unexpectedly taking trains out of service for emergencies or unnecessary routine Maintenance.
This paper presents a data-driven predictive maintenance framework for the APU train system of Metro of Porto. We used the HS-Trees method combined with OCKNN to build a predictive model capable of detecting catastrophic anomalies and dealing with streaming data.
Our empirical study shows that the use of HS-Trees provided significant performance improvements when used in conjunction with OCKNN. The proposed predictive model obtained high anomaly detection performance while maintaining fewer false positives and negatives compared to State-of-the-art methods. Distances from neighbours are a viable solution to reduce false positives for this problem. For future work, we intend to test the robustness of our model when drift occurs in data. This phenomenon is represented by a significant change in the data distribution, i.e., degradation as the train components age. Also, we will evaluate this methodology in other real-case scenarios related to Predictive Maintenance.
Footnotes
Acknowledgments
This research has been financially supported in part by the Spanish Ministerio de Economia y Competitividad (research project PID2019-109238GB-C22), and by the Xunta de Galicia (Grants ED431C 2018/34 and ED431G 2019/01) with the European Union ERDF funds. CITIC, as Research Center accredited by Galician University System, is funded by Consellería de Cultura, Educación e Universidades from Xunta de Galicia, supported in an 80% through ERDF Funds, ERDF Operational Programme Galicia 2014–2020, and the remaining 20% by Secretaría Xeral de Universidades (Grant ED431G 2019/01). This work was also supported by National Funds through the FCT – Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) within the Project UIDB/00760/2020 and UIDP/00760/2020.
