Abstract
This paper disseminates an extra dimension of substantial analysis demonstrating the trade-offs between the performance of Parametric (P) and Non-Parametric (NP) classification algorithms when applied to classify faults occurring in pneumatic actuators. Owing to the criticality of the actuator failures, classifying faults accurately may lead to robust fault tolerant models. In most cases, when applying machine learning, the choice of existing classifier algorithms for an application is random. This work, addresses the issue and quantitatively supports the selection of appropriate algorithm for non-parametric datasets. For the case study, popular parametric classification algorithms namely: Naïve Bayes (NB), Logistic Regression (LR), Linear Discriminant Analysis (LDA), Perceptron (PER) and non-parametric algorithms namely: Multi-Layer Perceptron (MLP), k Nearest Neighbor (kNN), Support Vector Machine (SVM), Decision Tree (DT) and Random Forest (RF) are implemented over a non-parametric, imbalanced synthetic dataset of a benchmark actuator process. Upon using parametric classifiers, severe adultery in results is witnessed which misleads the interpretation towards the accuracy of the model. Experimentally, about 20% improvement in accuracy is obtained on using non-parametric classifiers over the parametric ones. The robustness of the models is evaluated by inducing label noise varying between 5% to 20%. Triptych analysis is applied to discuss the interpretability of each machine learning model. The trade-offs in choice and performance of algorithms and the evaluating metrics for each estimator are analyzed both quantitatively and qualitatively. For a more cogent reasoning through validation, the results obtained for the synthetic dataset are compared against the industrial dataset of the pneumatic actuator of the sugar refinery, Development and Application of Methods for Actuator Diagnosis in Industrial Control Systems (DAMADICS). The efficiency of non-parametric classifiers for the pneumatic actuator dataset is well proved.
Keywords
Nomenclature
Bias
Number of classes in the datasets
Centre terms in rbf
Rod displacement in synthetic dataset, %
Dataset
Development and Application of Methods for Actuator Diagnosis in Industrial Control Systems
Decision Tree
Different faults in synthetic
and benchmark datasets
Fischer Discriminant Analysis
Flow rate of medium in industrial dataset, t/h
Flow rate of medium in synthetic dataset, t/h
Interpretability
Number of nearest neighbours in kNN
k Nearest Neighbours
Regularization coefficient
Linear Discriminant Analysis
Logistic Regression
Servomotor
Number of features in datasets
Multi-Layer Perceptron
Number of nodes in decision trees
Number of samples in dataset
Naïve Bayes
Non-Parametric
Positioner
Parametric
Perceptron
Inlet valve pressure, in MPa
Outlet valve pressure, in MPa
Predictivity
Robustness
Radial Basis Function
Random Forest
Simplicity
Synthetic Minority Oversampling Technique
Controller Set point
Stability
Singular value decomposition
Support Vector Machine
Time taken by the fault to fully develop
Temperature of medium, in °C
Start time of the fault
Control Valve
Rod displacement in industrial dataset, %
Samples in dataset
Targets variables in dataset
Learning rate of models
Regression values
Distance based hyperparameter for rbf
Discriminant function
Label noise
Percentage of label noise induced
Parameters
Class means
Standard deviation
Updated weights
Introduction
In the context of predictive analytics, the objective of classification models is to anticipate exact class labels for the given input data. Real world scenarios such as: sentiment analysis, text and speech analysis, remote sensing, medical signal analysis, fault classification and diagnosis [1] in industrial applications, quality assessment [9] etc., demand classification of data under different situations. Classification models can be either supervised, unsupervised or semi-supervised [18] based on their learning mode. They can be binary, multiclass or multi label classification based on the count of the target classes assigned. Building exemplary classifier models require command over the structure of the dataset like distribution of the data, correlation between the features, outlier information and consistency in the class distribution. It is observed that, algorithms that work with normally distributed dataset, act differently when applied to a dataset following binomial or multinomial distributions. These algorithms are called parametric and non-parametric algorithms, respectively. In this work, some popular parametric algorithms namely: Naïve Bayes (NB), Perceptron (PER), Logistic Regression (LR), Linear Discriminant Analysis (LDA) and some of the popular non-parametric algorithms namely: Multi-Layer Perceptron (MLP), k Nearest Neighbor (kNN), Support Vector Machines (SVM), Decision Tree (DT), Random Forest (RF) models are chosen for analysis [15].
Parametric algorithms are assumption-based models having fixed and finite parameters. These parameters are independent about the sample size [15], such that any prediction ‘y’ on the given data ‘- D’, is independent of the data size, provided that the parameters ‘θ’ are given. This is given in Equation (1).
The parameter ‘θ’ is an information rich content that aids in executing the task of the algorithm. Statistically, the typical assumptions made by a parametric algorithm include: normality, homogeneity of variance, linearity and feature independence. Naive Bayes classifier is a simple, fast and generative model classifier satisfying the assumptions of normality and independence between features [13]. It handles categorical inputs and multiclass classification efficiently. A slightly evolved form of Naïve Bayes is the Perceptron Classifier with a slack in independence assumption. Perceptron classifier is a straightforward, discriminative parametric algorithm. The basics and evolution of perceptron classifier model in neural networks [14] has taken different heights based on requirement of the applications. The linear threshold function of the Perceptron allows the algorithm to work at ease for linear models. Logistic Regression classifier is a discriminant model classifier [28] having strong linearity assumption between the features. It works with sigmoidal function and fits well for non-linear models. The Logistic Regression models specifies the conditional distribution of the model. Both Perceptron and Logistic Regression classifiers when used for multiclass classification binarizes all the classes and converts them to a pair wise [29] problem to formulate the classification strategy. Linear Discriminant Analysis follows the normality and homogeneity of variance assumptions [2] and provides the joint distribution information about the model. All the above-mentioned algorithms have strong and systematic mathematical formulation.
Non-parametric classification algorithms do not care about the normality of the data. They are quasi assumption free models that handle non-normal data. They have infinite parameters which vary according to the variations in the data [15]. The multi-layer perceptron (MLP) classifier is a type of artificial neural network that consists of multiple layers of interconnected artificial neurons. MLPs are inherently parametric, however, through proper regularisation techniques, it can act as a non-parametric classifier [4]. The k Nearest Neighbors is a memory-based algorithm where the model depends on the dataset only. The ‘lazy predict’ [19] character of kNN performs instance-based learning by allowing only storage during the training phase. It is well suited to identify more complex patterns in the data. Support vector machines do not have any probabilistic models nor do they depend upon the normality assumptions [10]. They are model free and work by applying generalized dot products to determine the correlation between data points. They are efficient for high dimensional data providing less computational complexity. Decision tree algorithms are widely used discriminative tree classifier models [31]. The trees are built in baseline or cost sensitive manner in the aim to decrease the impurities in each split. The decision split points and the decision rules grow in size as the complexity of the data increases. Random forest is an extended version of decision tree method. They are ensemble of trees which are generally weak learners and average out the values in the algorithm [3]. A well-organized year wise survey on random forest methods is given in [24]. Almost all aforementioned classifiers will also perform the regression tasks. From the literature it is inferred that, the non-parametric algorithms are not mathematically strong like parametric algorithms.
The motivation behind this work lies in the fact that the compatibility of applying parametric (P) and non-parametric (NP) algorithms for parametric and nonparametric datasets is a question of choice. In this work, the performances of various parametric and non-parametric classifier models for a non-parametric dataset DAMADICS (Development and Application of Methods for Actuator Diagnosis in Industrial Control Systems) are evaluated. Many works on the fault detection (classification) and diagnosis have been reported in literature including pneumatic actuator fault detection [8] where, classification is treated as a one-class problem.
The performance of the classifiers built is evaluated using performance metrics. The choice of classification metric can become exigent when applying for different datasets like balanced and imbalanced datasets. Using regular metrics to evaluate the classifier under all categories may lead to sub optimal [7] classification models.
The main contributions of this work are 5 folds: A synthetic dataset is acquired from modeling the pneumatic actuator using DABLib library under normal and faulty conditions. The imbalance in normal and faulty class distribution of the pneumatic actuator dataset is addressed using oversampling technique. Both binary and multiclass classification is performed using both parametric and non-parametric classifiers The trade-offs between the performance of the classification algorithms are analyzed under different criteria like, flexibility, accuracy, interpretability, robustness and complexity. The analysis results are quantified using suitable performance metrics
The highlight of the work includes the validation of the proposed classifier models on a benchmark industrial dataset (DAMADICS) of the pneumatic actuator in a sugar refinery located in Poland.
Accordingly, this paper consists of four main sections. Section 1 begins with an introduction to the classification and the motivation behind the contribution of this work towards the analysis of trade-offs using definite metrics. The description of the synthetic dataset, machine learning models used and the characteristics of these models are discussed in Section 2. The performance analysis of each classifier and the trade-offs between the models built on the synthetic dataset is well articulated in Section 3 under Results and Discussion. The results obtained from the synthetic dataset is validated against the industrial dataset of the DAMADICS pneumatic actuator in Section 4. Finally, the concluding remarks are presented in Section 5.
An insight into the synthetically generated dataset, algorithms employed and the characteristics analyzed are essential for better understanding for the readers. Hence, Section 2 deals with the elucidation of the dataset, the algorithms chosen and the performance indices considered to evaluate the algorithms.
Synthetic dataset
The dataset for testing the algorithms is a synthetic dataset derived by modelling the pneumatic actuator [16] system using DAMADICS simulink library [16]. The schematic diagram of the pneumatic actuator is shown in Fig. 1.

Representation of the DAMADICS actuator scheme [16].
From Fig. 1, it is noted that, the actuator system consists of three parts, namely: Positioner (P), Servomotor (M) and the control valve (V). The measurement variables considered include: controller set point (SP), water flow rate (F r ), inlet pressure of control valve (P i ), outlet pressure of control valve (P o ), temperature of the water (T m ) and the displacement (D) of the actuator rod. The setpoint (SP) and the rod displacement (D) both have a transducer range of 100%. The pressure transducer ranges from (0-4) MPa. The flow transducer range for water flow rate (F r ) measurement ranges from (0-40) t/h and the temperature transducer range is (0-200) °C. However, during data generation, all data are normalized in the range of 0-1.
The data was generated by building the actuator model and invoking different fault conditions namely: clogging, sedimentation and erosion with varying fault intensities and also in normal mode. The parameters used during the simulation of various faults are tabulated in Table 1.
Parameters used during the simulation of different faults [16]
From Table 1, the parameters namely: fault beginning time (t start ), the time taken by the fault to develop completely (t fdev ), the minimum fault intensity f i (min) and maximum fault intensity, f i (max) are observed. The variations of the actuator physical variables according to the parameters in Table 1 are studied and the synthetic data is thus generated. The representation of the generated dataset showing the effect of physical variables during normal and faulty modes are shown in Fig. 2.

Representation of the generated dataset showing the effect of physical variables during (a) normal mode (b) f1 - clogging fault (c) f2 - sedimentation fault (d) f3 - erosion fault.
In Fig. 2a, it can be observed that, the actuator is working without any fault condition induced. Here, the rod displacement (D) tracks the input control variable (SP) between 0.2 and 0.9 i.e., 20% and 90%. Also, the inlet pressure (P i ) is maintained at 0.65, the outlet pressure (P o ) varies around 0.9 and the temperature of water (T m ) is maintained at 0.22. Here, the normalized values are used. When the valve clogging fault f1, is induced abruptly at the time 900s with a medium intensity of 0.5 as in Fig. 2b, the rod displacement (D) is inhibited leading to disturbed medium flow, F r .
In Fig. 2c, the sedimentation fault condition f2 is gradually induced at time 900s and reaches the peak intensity of 1 at 4500s. During this fault, it can be observed that, there is a decrease in the flow of the medium (F r ) due to decreased valve plug travel.
The change in physical variables for the erosion fault condition f3 is observed in Fig. 2d. The fault is induced at the time 900s and reaches maximum intensity at 1500s itself. Hence, the erosion fault is a short and gradual fault. During the fault period, the flow of the medium is increased due to the increase in the valve plug movement due to erosion.
Thus, the dataset for normal condition and faulty condition was generated using the DAMADICS library. Also, an extensive exploratory data analysis was performed on this synthetic dataset and presented in [22]. The metadata of the generated dataset after performing exploratory data analysis is presented in Table 2.
The metadata of the synthetic dataset showing the non-parametric data distribution and imbalance in class distribution
From Table 2, it can be inferred that, the data distribution throughout the dataset is non-normal and there is imbalance in the distribution of classes. In the case study, to maintain the uniformity among classifier models, class imbalance is treated using a sampling technique namely, SMOTE: Synthetic Minority Over-sampling Technique [20] and no special algorithm or optimization is performed on each algorithm to increase their performance. However, a baseline hyper parameter tuning is practiced to achieve optimal performance from all classifier models.
Machine learning based classifier models automate pattern recognition and anomalies in data with minimal time expenditure and maximum accuracy. Recent times, machine learning based models have gained interest for detection and prediction [17] tasks due to their high accuracy rates. The advantages of selecting machine learning models over conventional models are highlighted by the authors [23] where the work statistically supports machine learning algorithms over conventional parametric algorithms for a different case study. In this section, many classifier models are built and the parameters and hyperparameters tuned to obtain optimal performance for each algorithm chosen are presented. Considering the popularity of the existing algorithms compared in this work, the pseudocode of the machine learning algorithms is provided in Appendix and not included in the article body. However, a short description of each algorithm is provided for better understanding and continuity.
Naïve Bayes (NB) classifier
Naïve Bayes Classifier is a vanilla model of theBayes theorem as given in Equation (2).
The parameters of NB algorithm are priori probability P (Y) priori and the likelihood probability P(X| Y) likelihood . It can be noted that, 2 values of priori during binary classification and 4 values of priori during multiclass classification along with the likelihood score is returned by the classifier as given in Table 4. There are no hyperparameters as such for tuning a naïve bayes classifier. Hence, it is simple and swift in performing classification. However, due to its generative property, only the conditional density between X and Y is obtained during classification. To get the overall density function, discriminative methods namely: Perceptron, Logistic Regression and Linear Discriminant Analysis are explored.
Here, a simple single layer perceptron model as given in Equation (3) is used for classification.
It can be seen that, ω is the only parameter which has to be learnt by the algorithm. Hence, the algorithm returns a weight and bias vector of [c x (m + 1)] as in Table 4. The hyper parameter α learning rate as given in Table 3, is obtained by performing search from a grid of values ranging from 0.001 to 1 and the epochs are fixed for minimum validation error. The accuracy of the Perceptron classifier depends on the linearly separable feature space due to the linear nature of its threshold function. Logistic regression classifier is chosen to overcome this limitation.
Parameters and tuned hyper parameters of parametric algorithms
Parameters learnt by the parametric algorithms
Logistic regression to predict target y is given in Equation (4).
For ‘m’ features, (m + 1) regression coefficients are learnt by the algorithm. Therefore, in this case study, 8 regression coefficients are learnt by the model as given in Table 4. The hyper parameter ‘C’, a positive integer used for regularization is tuned to its optimal value (C = 10 for binary, C = 100 for multiclass) with ‘l1’ regularization along with the learning rate α to achieve improved performance. The LR classifier is very sensitive to outliers and has reduced speed. Hence, another discriminative algorithm, LDA is considered for the case study.
Generalised discriminant analysis is used to find the linear combination of features that can separate two or more classes by finding a linear projection that maximizes the separation between classes while minimizing within-class scatter. The terms Linear Discriminant Analysis (LDA) and Fischer Discriminant Analysis (FDA) are often used interchangeably to describe this technique. Fisher’s Discriminant Analysis (FDA) [26] is Linear Discriminant Analysis (LDA) when there are only two classes. LDA is the direct extension of FDA to two or more classes. In practice, FDA is not a standalone classifier, since it is a dimensionality reduction technique. The transformed features obtained from FDA is fed into classifiers to make class predictions. Hence, in this work LDA is proposed for classification tasks.
The function F(x) followed by the Linear Discriminant classifier for classification of c classes is given in Equation (5).
The parameters and hyperparameters tuned for the case study are listed in Table 3.
The parameters projected in Table 4 for each machine learning model justifies the statement that ‘Parametric algorithms are models with unmodifiable and finite parameters.’
Due to the constrained working mode of the parametric algorithm, more flexible and relaxed models (non- parametric algorithms) with no/less assumptions are considered for classification, as discussed in the ensuing sections.
Neural network based MLPs are powerful and flexible classifiers, yet, they are generally considered parametric. However, on introducing non-parametric elements into the MLP model, non-parametric MLPs are obtained. Different approaches to achieve is, by using regularisation techniques such as dropout, hybrid dropout [27] and weight decay [4].
In this work, the weight decay approach is adopted. Similar to perceptron, the weights and bias values are initialised randomly. Here, a penalty term ‘l2’ is added to the loss function during training. This regularization technique can be seen as a form of non-parametric adjustment, as it promotes simpler weight configurations and reduces the model’s reliance on specific parameter values. Here, the hyperparameter namely, the number of hidden layers is fixed to be 2. Also, the optimal epochs are found to be 7 and 28 for binary and multiclass tasks respectively. The activation function of the MLP is a radial basis function and the hyperparameters γ and C are found using the grid search method. Here, γ= 1 for both the cases and C = 10 and 100 for binary and multiclass, respectively. The advantages of using MLP as a non-parametric classifier are its flexibility and ability to capture non-linear relationships effectively. However, it is computationally complex and has many hyperparameters for tuning.
k Nearest Neighbor (kNN) classifier
The kNN algorithm works by finding the ‘k’ nearest neighbors by computing the distance between each data point using local or global distance metrics and declaring the labels based on the majority votes of the neighbors. Here, the distance metric is the parameter that the algorithm learns. As mentioned in section 2, the dataset has a total of 2,66,100 rows hence, the size of the parameter varies subject to any variation on the size of the samples drawn in the testing phase.
The hyper parameter is the ‘k’ factor. Using grid search method, for this case study, k is tuned to the value of 3 for binary classification and 5 for multiclass classification task to obtain optimal model performance. Though kNN is flexible and powerful for classification tasks, the limitation is in speed and memory space consumption. This can be overcome by SVM classifier.
Support Vector Machines (SVM) Classifier
In SVM, an initial hyperplane is declared and the objective is to maximize the minimal distance of the hyperplane. Here, the ‘Radial Basis Function (RBF) kernel’ [6] is chosen since the classes in the dataset are not linearly separable. The distance ‘d’ is calculated for every data point in order to minimize the distance of hyperplane.
Here, the hyper parameter was tuned manually using grid search methods. The value of C is varied from 0.1 to 1000 and γ between 0.0001 to 1. The optimal values were chosen based on the validation error and model accuracy. The speed of SVM becomes slow when hyperplanes for all samples have to be decided when the dataset is large. Also, the multicollinearity between features is dealt easily with decision tree than SVM.
Decision Tree (DT) Classifier
In this classifier, Classification and Regression Tree (CART) model is assigned to build the decision tress. Gini index is used to measure the impurity content of tree. The tree is split until all equal classes are predicted and the Gini index becomes 0.
Decision trees are prone to overfit. The work supporting the overfitting elimination and designing the best fit decision tree model for the DAMADICS dataset is performed systematically by the same authors in their previous work [21] on the synthetic dataset. Here, the resampling technique was applied to determine the best fit model. In this work, the maximum depth of the tree is tuned and the binary decision tree depth is chosen as 5 and the multiclass decision tree depth is chosen as 32. The problem of overfitting can be eliminated by using RF classifier. Also, the RF is used to avoid the problem of forcing that occurs in decision trees.
Random Forest (RF) Classifier
In this algorithm, random samples are drawn and the best split criteria are met. This splitting varies with the change in dataset as in DT classifier. The procedure is repeated on every tree in the forest and the aggregated value of all tress gives the predicted outcome.
In the present case study, the number of trees formed is 10 in binary classification and 73 in multiclass problem. The parameters and hyper parameters used in the non-parametric algorithms are listed in Table 5.
Parameters and tuned hyper parameters of non-parametric algorithms
Parameters and tuned hyper parameters of non-parametric algorithms
The values of parameters of the non-parametric algorithms cannot be explicitly given as in Table 4 because, the parameters of non-parametric algorithms grow depending on the size of the data and hence, they are out of bound to tabulate.
The performance of parametric and non-parametric algorithms is evaluated using the performance indices namely: Model fitness, accuracy, interpretability, robustness and complexity. The trade-offs in the performance indices for Parametric and Non-Parametric algorithms are shown in Fig. 3.

Trade off curve between parametric and non-parametric model performance characteristics.
From the flexibility point of view, it can be inferred from Fig. 3 that, flexibility is directly proportional to accuracy, data complexity, overfitting, and inversely proportional to speed and interpretability for both the models due to the influence of data and parameters.
Statistically, flexibility implies model fitness. In machine learning, bias-variance decomposition analysis is performed to measure flexibility. The bias-variance decomposition for a training dataset with targets y
i
associated with inputs x
i
having a function
Bias =
Variance =
Mean Square Error =
The trade-offs between bias and variance should be optimal [25] to design an accurate classifier. Minimum bias and maximum variance make a good prediction model. The results of the bias-variance decomposition method are presented in Table 6. Here, the kNN exhibits a minimum bias for binary and DT for multiclass classification. Variance is high for RF binary and SVM multiclass. Such comparison is rudimentary and may mislead. The bias and variance of the same model is compared to analyze the flexibility of the model.
Results obtained through bias-variance decomposition analysis showing flexibility of model
From Table 6 it is observed that, the bias of NB and PER models are greater than their respective variance values. Hence, they act as underfit models for the case study. All other models exhibit some flexibility in their behavior having less bias value. However, the deviation between bias and variance in the DT and RF model is nominal and hence, they can be selected a best fit model for the case study. The Mean Square Error (MSE) is calculated using Equation (6) with σ= 0. The average bias and variance were obtained on 20% of test data for 5 model.
Accuracy is the simple metric to calculate the correctness of a model. It is a measure of correct predictions over the entire sample set as given in Equation (7).
Misclassification error can be calculated from accuracy metric as given in Equation (8).
For imbalanced dataset, the balanced accuracy metric as given in Equation (9) is to be considered since it is a weighted average score.
Accuracy and error metrics of P and NP models
From Table 7, it can be observed that, the accuracy of the parametric algorithms for binary classification is satisfactory. However, the accuracy of the multiclass parametric classifiers is poor. High misclassification errors are seen in parametric models. The performance of non-parametric algorithms for both binary and multiclass classification is satisfactory. In this case study, the balanced score is not very significant since, the imbalance in dataset is treated using the minority oversampling method as mentioned in Section 2.
The Area Under – Receiver Operating Characteristics (AU-ROC) value epitomizes the model efficiency. Also, it is used to confirm the prediction ability of the model. The AU-ROC curve is plotted between the false positive rate and true positive rate as in Fig. 4. The ROC score can range from 0 to 1.

ROC curve showing the performance of parametric and non-parametric classifiers for binary classification problem.
In Fig. 4, it can be seen that the non-parametric algorithms provide a score of 1. This makes them skillful classifiers than the parametric algorithms. The ROC score of parametric algorithms is lesser than 1, however not poor. Hence, in this case study, both parametric and non-parametric algorithms tend to show decent fitness for binary problem. The performance of parametric classifiers for multiclass classification problem is shown in Fig. 5.

ROC curves of parametric classifiers showing class-wise performance for multiclass classification problem (a) Naive Bayes classifier (b) Logistic regression classifier (c) Linear discriminant analysis classifier (d) Perceptron classifier.
From Fig. 5, it can be seen that the ROC score of parametric algorithms is in the range of 0.5 and below. The Parametric algorithms provide poor prediction ability for all classes ranging from class 0 to class 4. Figure 6 depicts the performance of the non-parametric algorithm for multiclass problem.

ROC curves of non-parametric classifiers showing class-wise performance for multiclass classification problem (a)Multi-Layer Perceptron classifier (b) k Nearest neighbor classifier (c) Support vector classifier (d) Decision tree classifier (e) Random Forest classifier.
From Fig. 6, it can be seen that the ROC curve provides a value nearest to 1 for all classes. On observing, it is noted that kNN and decision tree classifiers are more skilled than support vector machines and random forest classifiers since, the number of features is less than the number of training data and also, due to the automatic feature interaction nature of the decision trees. Nevertheless, other algorithms have also showed reasonable performances. Thus, classification is well established by non-parametric algorithms than the parametric algorithms.
According to [12], interpretability is defined as “the ability to explain the model in an understandable form”. Here, Triptych analysis [32] is employed, where the absolute score of interpretability (I) is given by Equation (10).
All the metrics range from 0-1. The model with increased P r , S t and S i score is declared to be an interpretable model.
On applying the Triptych analysis, the absolute value of interpretability is obtained as a summation of predictivity, stability and simplicity scores of each algorithm and it is listed in Table 8. Here, the predictivity value is the accuracy score of the model and the F1 score is the stability score.
Results from the Triptych analysis showing absolute values of interpretability scores of models
In Table 8, as discussed earlier, the predictivity which is given by the accuracy metric, is nominal for both algorithms when performing binary classification and reduced for parametric algorithms when performing multiclass classification. Also, the stability of non-parametric models is consistently satisfactory for both the cases. The simplicity score is naively chosen from a scale of 0-1 depending on the understanding ability of the humans about the algorithm structure. However, the absolute value of ‘I’ is calculated using the Equation (10) and the interpretability scores are plotted in Fig. 7.

Interpretability score Vs models.
In Fig. 7, there is a trade off in the interpretability score between the parametric and non-parametric algorithms. It can be seen that, parametric algorithm showing good interpretability score for binary classification, failed otherwise owing to the increase in data complexity.
Robustness explains the performance of an algorithm for a given task in the presence of noise with minimum or no deviation from accuracy of the model with zero noise. In machine learning models, noise can be induced in the generated data or the label [5]. Let ‘ɛ’ be the label noise induced in the dataset D, having X inputs and Y targets such that (X, Y, ɛ)< (X, Y), then noise robustness
The robustness analysis of the machine learning classification model is carried out by inducing 5-20% label noise. About η% of the negative labels were replaced with positive labels and the classification is performed. It is found through the accuracy metric that the built models were susceptible to changes in the stability of the models which is measured quantitatively using accuracy metric. The robustness of machine learning models against 5-20% of induced label noise is listed in Table 9. Here, the intermodel and intramodel stability analysis of parametric and non-parametric models is analyzed using accuracy as a metric.
The accuracy of machine learning models against induced label noise of 5-20% showing robustness of models
From the results, it can be inferred that, random forest algorithm showed noticeable robustness when compared to other models. It must be understood that the noise elimination of the machine learning models can be learnt/reduced by tuning the hyperparameters of the model. However, for the purpose of explaining the trade-offs between the algorithms, the optimal values of the hyperparameters achieved during model building process is maintained throughout the robustness analysis.
Complexity on the whole denotes the data, time and speed complexity of the model. The analysis applied is called the asymptotic analysis. Time complexity is basically the rate at which the model’s run time increases with growth in input data size. Space complexity is the rate at which the memory consumed by the model increases with growth in input data size. Here, for the asymptotic analysis, the Big O notation [30] is used to define the efficiency of the models.
The time and space complexity values of parametric and non-parametric algorithms are tabulated in Table 10, with reference to their general Big O notation for the corresponding model.
Time and space complexity values of each model calculated w.r.t Big O notation
Time and space complexity values of each model calculated w.r.t Big O notation
From Table 10, it can be seen that, the primitive models of parametric type algorithms namely: perceptron and logistic regression classifiers contribute to lesser time and space complexity. However, this reduction costs high in model accuracy. For the data complexity, the evolution of complexity in train time and train space for the kNN classifier is plotted in Fig. 8.

Training time and training space complexity of kNN for binary classification.
In Fig. 8, the increase in time for execution and the memory consumed by the algorithm for varying sizes of the training dataset is observed. It can be noted that, for small input size, i.e., when only 20% of the training dataset is given, (53,220 instances = n), k = 3 for binary, and number of features m = 9, the training time complexity of kNN is calculated as O(knm) and space complexity is calculated as O(nm). A linear increase in time and memory usage is observed for kNN. The parameters and hyperparameters during the binary and multiclass classification remain the same throughout.
For the purpose of performance comparison, the models were run in different platform. The execution times of each algorithm in different platforms like CPU and GPU are shown in Fig. 9.

Performance comparison of execution time in CPU and GPU platforms.
The CPU performance was evaluated when the models were run on Python-Jupyter notebook with windows 64bit OS + Processor Intel core i5 ∼ 2.4 GHz. For the GPU, the execution time of the models in Google Colab was used. It is found that, similar to the time and space complexity discussed above, parametric algorithms performed in less execution time. However, the performance of models in GPU were found to be faster than that in CPU for the dataset.
DAMADICS, a real time industrial dataset [11] of the pneumatic actuator in Polish sugar refinery, utilised for water level control of steam boiler in the evaporation stage (stage III) of the plant is used for the validation against the synthetic model. The dataset has about 21,59,974 data points with 7 attributes. The working range of the variables remain the same as given in Section 2. However, the faults recorded includes: f16 - positioner supply pressure fault, f18 - bypass valve fault and f19 - flowrate sensor fault [16]. The faults recorded in the real time actuator dataset are shown in Fig. 10. The normal mode of the actuator is shown in Fig. 10a. In Fig. 10b, the positioner pressure drop fault (f16) occurs in the duration: 57475s (pressure drops) – 57530s (pressure ok) and also from 57675s (pressure drops) – 57800s (pressure ok). During this transition, it is observed that, the rod displacement (X) drops to zero and the flow (F) is disturbed. The bypass valve partially or fully open fault (f18) occurs at 57340s (valve opens) – 57890 (valve closes) as shown in Fig. 10c. When the bypass valve (V3 in Fig. 1) is partially opened, it is observed that, the medium flow (F) drops to zero level. In Fig. 10d, at 58150s (sensor fault ON) – 58325s (sensor fault OFF), the flow rate sensor fails to read the flow data and shows a zero reading though the other variables constituting the process works well. The specifications of the faults in the industrial dataset are listed in Table 11.
List of Parameters of faults in the industrial data
List of Parameters of faults in the industrial data

Representation of the industrial DAMADICS dataset showing the effect of physical variables during (a) normal mode (b) f16 - positioner supply pressure fault (c) f18 - bypass valve fault (d) f19 - flowrate sensor fault.
The parameters and hyperparameters of the parametric and non-parametric machine learning models are listed in Table 12.
Parameters and hyperparameters of the Machine Learning models used
Initially, the industrial dataset is pre-processed. The values are normalized and outliers are removed using sigma method. On the non-parametric dataset, both binary classification (normal, faulty) and multiclass classification are performed using parametric and non-parametric algorithms. The performance of the algorithms on the dataset is studied and tabulated for different metrics as shown in Table 13.
Evaluation metrics showing the trade-offs between the performance of parametric and non-parametric classifiers in the benchmark industrial dataset
*Significant values.
In view of the exhaustive information provided in section 3 on analysis, the summary of the trade-off analysis for the industrial dataset is directly produced from interpreting Table 13.
On comparing the bias, kNN and DT has low bias for binary and multiclass classification respectively. However, for both cases, DT had performed with minimum bias. For the variance, kNN showed greater variance. However, a nominal bias/variance trade-off is seen in DT and is the best fit model. Here, RF and SVM show overfitting (i.e., SVM> >RF) and high MSE values. Also, average bias and average variance is taken for 10 models.
In the case of accuracy, LDA has high accuracy score during binary classification. However, the parametric algorithm did not work well for multiclass classification. Consistent performance is shown by non-parametric algorithms except MLP. The non-parametric MLP failed to achieve optimal performance for the industrial dataset. DT and RF are remarkable for both classification tasks.
Undoubtedly, due to increased accuracy, the predictivity and stability of non-parametric algorithms are better however, not their simplicity score. The absolute value of interpretability is high for DT in binary and RF in multiclass classification. Surprisingly, the performance of LDA is same as that of the non-parametric algorithm in this analysis.
RF proved to be more robust with a minimum accuracy deviation of about 2.6% from 0 to 20% noise levels during binary classification and 0.5% only in multiclass classification. The space and time complexity highly favored the parametric algorithms than non-parametric as expected, in particular for LR.
The ranking of each algorithm based on the best performance indices is illustrated as a colormap in Fig. 11.

Colormap showing the ranks of algorithms for different metrics while performing classification on (a) binary (b) multiclass.
From Fig. 11, for binary case, the non-parametric algorithm, DT is ranked 1 among all other models due to its good accuracy, minimum MCE values and high robustness. Also, kNN showed minimum MSE. In parametric algorithms, LDA classifier stood first for interpretability and LR classifier based on complexity. Similarly, it can be seen that, the last rank is obtained by the parametric models. Though P algorithms satisfy characteristics like flexibility, simplicity, space and time complexity, the trade-off between non-parametric algorithms on other important characteristics is an important measure for model selection. Here, notable trade-offs are witnessed among non-parametric models itself. Thus, the results obtained from the classifiers of the industrial dataset corroborate the better performance of non-parametric models. It also evinces that, the ensemble/tree models outperform kNN or SVM models for both binary and multiclass classification tasks.
This paper presented a systematic framework to analyze the trade-offs between the parametric and non-parametric algorithms when subjected to an imbalanced pneumatic actuator dataset which follows a non-normal data distribution. Initially, the structure of the DAMADICS actuator dataset was thoroughly analyzed and found to have non-normal data distribution and imbalance distribution in both binary and multi class targets. Classifier models were built using different parametric and non-parametric algorithms for binary (normal, faulty) classification and multiclass (i.e., normal, clogging fault, sedimentation fault and erosion fault in synthetic dataset; normal, positioner supply pressure drop fault, bypass valve fault and flowrate sensor fault in industrial benchmark dataset) classification. Among the performance of parametric and non-parametric models, the advantages of interpretability and speed of the parametric model remain defenseless against the non-parametric algorithm’s accuracy, robustness and flexibility. Within non-parametric classifiers, k nearest neighbor and decision tree algorithms performed well and was consistent in their outputs for both binary and multiclass problem for synthetic dataset whereas, random forest classifier was found to work ideally due to overfitting of the model. However, for the industrial dataset, random forest classifier was found to outperform and satisfy most of the required statistical characteristics. Since, this work concludes the conformity of non-parametric classifiers for DAMADICS dataset, extending the best fit model to the identification and prediction of faults of the pneumatic actuator with rare category analysis is an important direction for future research.
Footnotes
Acknowledgment
This work is supported by the Centre for Research, Anna University, Chennai, India by providing financial assistance through the award of Anna Centenary Research Fellowship (ACRF).
Appendix
The pseudocodes for the parametric and non-parametric algorithms discussed in Section 2 is given here.
