Abstract
Traditionally, performance measures such as accuracy, recall, precision, specificity, and negative predicted value (NPV) have been used to evaluate a classification model’s performance. However, these measures often fall short of capturing different classification scenarios, such as binary or multi-class, balanced or imbalanced, and noisy or noiseless data. Therefore, there is a need for a robust evaluation metric that can assist business decision-makers in selecting the most suitable model for a given scenario. Recently, a general performance score (GPS) comprising different combinations of traditional performance measures (TPMs) was proposed. However, it indiscriminately assigns equal importance to each measure, often leading to inconsistencies. To overcome the shortcomings of GPS, we introduce an enhanced metric called the Weighted General Performance Score (W-GPS) that considers each measure’s coefficient of variation (CV) and subsequently assigns weights to that measure based on its CV value. Considering consistency as a criterion, we found that W-GPS outperformed GPS in the above-mentioned classification scenarios. Further, considering W-GPS with different weighted combinations of TPMs, it was observed that no demarcation of these combinations that work best in a given scenario exists. Thus, W-GPS offers flexibility to the user to choose the most suitable combination for a given scenario.
Keywords
Introduction
Given a dataset
The performance of the classifier is assessed in terms of traditional performance measures such as accuracy, recall, sensitivity, precision, specificity, and f1-score [4, 5, 6, 7, 8]. The above performance measures are computed using the confusion matrix [9]. Table 1 depicts the confusion matrix of a binary classifier, where the rows represent the predicted classes and the columns represent the actual classes.
In Table 1, the diagonal elements of the confusion matrix, true positive (TP) and true negative (TN), correspond to the correct predictions made by the classifier. On the other hand, the off-diagonal elements false positive (FP) and false negative (FN) represent incorrect predictions. If a test instance’s true class is negative and the classifier predicts it as negative, it contributes to TN; if the model predicts it as positive, it is counted as FP, commonly known as a Type I error [10]. Similarly, if a test instance’s true class is positive and the classifier predicts it as positive, it contributes to TP; if the model predicts it as negative, it is counted as FN, commonly known as a Type II error [10].
In multi-class classification scenarios, the concepts of TP, TN, FP, and FN are computed for each class, and the traditional performance measures are computed using the micro and macro approaches. The micro approach considers all classes and calculates a single measure by aggregating true positives, true negatives, false positives, and false negatives across all classes [11, 12]. The macro-approach measures are computed independently for each class, and then the average is taken. The strategy used for multi-class classification is one vs. the rest (or one vs. all). This strategy trains a separate binary classifier, treating one of the classes as a positive class and the remaining classes as a negative class [13, 14]. Section 3, 4.1 presents the mathematical formulations for performance measures utilized in binary and multi-class classification problems.
Numerous studies have been conducted to assess the performance of classifiers by utilising diverse measures [15, 16, 17]. Each measure captures distinct aspects of classification performance. For instance, precision evaluates the classifier’s ability to identify positive instances correctly, recall measures the classifier’s ability to capture all positive instances, and the F1-score balances the trade-off between precision and recall. Indeed, it is crucial to consider specific requirements when choosing the appropriate measure, as no single measure can consider all aspects of classification [4].
Confusion matrix
Confusion matrix
In this work, a recently proposed General Performance Score (GPS) has been evaluated in different classification scenarios, such as binary or multi-class, balanced or imbalanced, and noisy or noiseless data [18]. GPS indiscriminately assigns equal importance to each performance measure, often leading to inconsistencies. To overcome the shortcomings of GPS, we proposed an enhanced metric called the Weighted General Performance Score (W-GPS) that considers each measure’s coefficient of variation (CV) and subsequently assigns weights to that measure based on its CV value. A measure having a higher CV indicates greater variability across multiple data subsets, resulting in a lower weight assigned to that measure. Conversely, a measure with a lower CV corresponds to a higher weight. W-GPS computes the weighted harmonic mean of combined performance measures that consider their variability across multiple data sets using the holdout method [19, 20, 21]. To assess the consistency of W-GPS, an extensive empirical study was conducted involving thirteen classifiers, namely, Logistic Regression [22], Support Vector Machines [23], and Linear Discriminant Analysis [24], Quadratic Models [25], Probability-based Naive Bayes [26], Instance-based K-nearest Neighbors [27], Tree based Decision Trees [28], ensemble-based classifiers such as Random Forest [29], Extra Tree [30], Gradient Boosting [31], Light Gradient Boosting [32], Extreme Gradient Boosting [33], and Neural Network based Multilayer Perceptron [34]. Considering consistency as a criterion, W-GPS outperformed GPS in all the above-mentioned classification scenarios.
The paper is organized as follows: Section 2 explores related work; Section 3 discusses the existing measures; and Section 4 defines the proposed methodology used in the work. The results of the experiments are detailed in Section 5. Section 6 discusses the implications of the results. Finally, Section 7 includes the conclusions and future scope.
Performance measures are essential in evaluating the effectiveness of classifiers in machine learning [35]. They provide quantitative assessments of a classifier’s predictive performance. These measures also assist in comparing different classifiers, optimizing their parameters, and selecting the most suitable model for a specific problem. The performance measures for a machine learning model are of great importance for various reasons stated below:
Model Assessment: Performance measures offer a quantitative means to measure the performance of a machine learning model. Model Selection: The use of appropriate performance measures may assist in selecting suitable models from the possible set of models. Model selection is the process of choosing the best machine-learning model from a group of possible machine-learning models for a training dataset. Optimization: Once a model is selected, these measures guide in fine-tuning the model parameters to optimize its performance. Assessing generalizability: When applied to a distinct test set, performance measures help determine how well the model performs on unseen data.
Each performance measure captures diverse aspects of a model’s performance and yields conflicting outcomes. However, selecting performance measures becomes complex for decision-makers as it entails understanding the contextual nuances of the problem, considering the costs associated with various types of errors (false positives versus false negatives), analyzing the characteristics of the data (balanced versus imbalanced), and the requirements of the business or application. These factors are elucidated as follows:
Understand the Business Impact: Different errors often have different business impacts. It is essential to consider that different types of errors can lead to varying business implications. For example, in a medical diagnosis setting, a false negative (a sick person being declared healthy) could have much more serious consequences than a false positive (a healthy person being declared sick). In this case, a measure like recall (which focuses on minimizing false negatives) might be more important [1, 36]. Data Distribution: If the class-wise distribution in the dataset is heavily imbalanced, accuracy might not be the best measure as it can be driven by the majority class [37]. Several studies have encountered issues with performance measures when dealing with imbalanced data [38, 39]. Balance trade-offs: There is often a trade-off between precision and recall [40]. When maximizing the inclusion of positive instances is crucial, prioritize recall. On the other hand, if it is more important to be sure about the instances the model classifies as positive, then precision should be prioritized. To strike a balance between precision and recall, consider using the F1 score. Multiple Measures: To capture all the aspects of a model’s performance, no standard measure exists. Therefore, a set of measures must be considered to provide a more comprehensive evaluation of the model.
Several research studies have been carried out to assess the effectiveness of different performance measures for evaluating ML classifiers. These studies aim to understand the strengths and limitations of different measures and their significance in various contexts [15, 41, 42, 43, 44]. Some of the recent studies are listed as follows: Huang et al. [45] conducted a comparison of machine learning algorithms, specifically Naive Bayes, Decision Trees, and Support Vector Machines (SVM), focusing on their predictive accuracy and the area under the ROC curve (AUC). Through experimentation with diverse datasets, they found that Naive Bayes, C4.4, and SVM exhibited similar levels of predictive accuracy. However, in terms of AUC, Naive Bayes, C4.4, and SVM outperformed C4.5. Huang et al. [46] used the same set of classifiers as used by Huang et al. [45] to compare the accuracy and AUC (Area Under the Curve) measures. The authors compared the consistency of accuracy and AUC, and their experimental study found AUC to be effective for measuring and comparing classifiers. Further, they showed that Naive Bayes outperforms pruned decision trees significantly when evaluated based on the area under the receiver operating characteristic curve (AU-ROC). They found that multiple evaluation measures evaluate different aspects of model performance. Sokolova et al. [47] explored different evaluation measures for ML classifiers, challenging the sufficiency of commonly used measures like accuracy, F1-score, and AUC. They also emphasized the drawbacks associated with these measures, particularly when dealing with situations where multiple classes hold equal significance. Additionally, they computed alternative measures such as Youden’s index, likelihood, and discriminant power. These alternative measures are widely employed in medical diagnosis and are shown to be interconnected. Their study further explored the applications of these measures to a case study centred around the classification of electronic negotiations, which occurred in diverse domains such as labour and business.
Several studies have been investigated that extensively compare performance measures for classification tasks. For instance, Zhou et al. [48] focused on the correlation analysis of performance measures for classifiers and categorized the commonly used measures into threshold, rank, and probability measures. Luque et al. [49] conducted a study examining classifiers’ performance measures and their symmetries, focusing on binary classifiers by exposing cross-symmetries and statistical symmetries among the measures. This study showed three types of symmetries between the measures: labelling inversion, scoring inversion, and combining these two inversions. In another study, Ferri et al. [50] extensively studied performance measures for classification tasks. In their work, they identified clusters and relationships among the behaviour of 18 performance measures. Through experimentation, they comprehensively analyzed the relationships between measures that can be useful for choosing the most adequate measures for a specific task. In conclusion, these studies enhanced practitioners’ understanding of measure relationships and demonstrated varying degrees of correlation among different sets of measures. These findings underscore the importance of understanding measure symmetries for effective metric selection and classification process enhancement.
In addition to exploring the relationship between the individual performance measures, the following studies proposed combining these independent measures into a unified measure. Nandi et al. [51] presented aggregated performance measures using statistical aggregation, such as harmonic mean (HM), geometric mean (GM), and arithmetic mean (AM), and two other objective functions, resulting in two novel performance measures: the distance from the origin (DO) and the distance from the ideal position (DIP), and discussed the need for a robust assessment of algorithms and the challenges of comparing multiple measures. They observed that HM performs best at the lower performance end, while DIP excels at the higher performance end. Furthermore, it provides valuable insights into algorithm properties and rankings based on these measures. In another work, Redondo et al. [52] focused on evaluating the performance of the classification model by proposing an alternative measure, the Unified Performance Measure (UPM), by combining multiple measures, such as recall, precision, specificity, and negative predicted value using harmonic mean. Comparative evaluations on simulated and real datasets demonstrated the superiority of UPM over other performance measures. Similarly, Diego et al. [18] proposed the General Performance Score (GPS) as a novel measure to combine performance measures for binary and multi-class classification problems. The GPS was defined using the harmonic mean to combine multiple performance measures. Their approach penalized low-value measures and was less sensitive to extreme values. These approaches provide enhanced tools for comprehensive performance evaluation in classification tasks.
At the same time, some recent studies emphasized the significance of incorporating weighted combinations and specialized measures to evaluate classifier performance [53, 54]. By assigning weights based on dataset characteristics and exploring alternative measures, more accurate and informative assessments of classifier performance can be obtained, and enhanced weighted performance measures improve the integrity of individual performance measurements and facilitate the development of more robust performance measures [53]. These approaches contribute to a better understanding of evaluation challenges in real-world scenarios and facilitate the development of more robust performance measures. For instance, Jadhav et al. [54] explored various performance measures, such as sensitivity, specificity, and accuracy, to assess ML classifiers and proposed a weighted measure called TPR-TNR that considers dataset imbalance and the costs associated with misclassification to address the limitations of existing measures, particularly in the context of imbalanced datasets.
Nevertheless, developing a comprehensive weighted performance function that incorporates relevant measures for a specific context is crucial. However, devising a universal score may oversimplify the process of evaluating the model’s performance, despite being a difficult endeavour. Moreover, determining the appropriate weights for each measure is often context-dependent. Taking cues from this, our study introduced a novel approach called W-GPS, which incorporates weight parameters to assign relative importance to different measures when combining them into a universal score. These weights are determined by considering each performance measure’s CV ratio. This approach assumes that measures with a lower CV are more stable and should significantly impact the overall performance measure.
In this work, various performance measures, including TPM, GPS, and W-GPS, have been investigated for the binary and multi-class classification tasks. These measures are examined across different data characteristics, such as the number of target variable classes (binary or multi-class), the distribution of target variable classes (balanced or unbalanced), and the quality of the data (noisy or clean). To determine the most appropriate performance measure for evaluating classifier performance, a wide variety of ML classifiers, such as decision trees, naive Bayes, k-nearest neighbours, logistic regression, extra trees, and others, are evaluated. The findings of this research contribute to the advancement of evaluation practices in classification tasks, enabling more informed choices in measure selection and encouraging the adoption of standardized evaluation methodologies.
This section discusses several existing performance measures, such as TPMs and GPS [18] to evaluate the classifier’s performance. These measures are determined through a
Traditional performance measures (TPMs)
The TPM, such as accuracy, precision, recall, and F1-score, capture different aspects of classifier performance. A confusion matrix is a fundamental tool for evaluating a classifier by comparing actual and predicted labels. It comprises essential elements such as TP, TN, FP, and FN, which are crucial in defining performance measures.
In binary classification scenarios with two classes 1 (positive) and 0 (negative), performance measures such as accuracy, precision, recall, specificity, negative predicted value (NPV), and F1 score are determined from the
Accuracy: Accuracy is a common measure to evaluate a classifier. It is the ratio of true predictions to total predictions. High accuracy indicates successful classification. However, for imbalanced datasets, where one class is more frequent, accuracy might not be the best measure for better insights into classifier performance.
Precision: This measure evaluates the fraction of True Positive prediction divided by the total number of positive predictions made by the model.
Recall: Recall measures a classification model’s ability to find all the positive observations in the dataset. It is the ratio of correctly classified positive instances to the total positive instances.
Specificity: This measure evaluates the fraction of true negative predictions divided by the total number of negative observations. It measures the ability of the model to find all the negative observations in the dataset.
Negative Predictive Value (NPV): This measure evaluates the fraction of true negative predictions divided by the model’s total number of negative predictions.
For multi-class classification problems with
where
The preceding section introduced several performance measures for assessing classification tasks. Each measure often emphasizes distinct facets of performance evaluation. In response to this diversity, Diego et al. [18] defined new performance measures by combining a set of measures into a unified measure employing the Harmonic Mean (HM) and imposing penalties for low metric values.
Let
Based on Eq. (7), we considered different combinations such as GPS (Recall, Precision, Specificity, and NPV), GPS (Recall, Precision), GPS (Recall, Specificity), GPS (Precision, NPV), and GPS (Specificity, NPV) in binary classification scenario. UPM is equivalent to GPS when considering the combined Recall, Precision, Specificity, and NPV measures.
Further, for multi-class classification scenarios, several instances of GPS, such as GPS(UPM), GPS(Recall), GPS(Precision), GPS(NPV), and GPS(Recall, Precision) are considered. The One-vs-Rest technique is applied to handle multiple classes, resulting in
Likewise, other instances of GPS are calculated following a similar approach.
In this work, along with HM, the combination of AM and GM has also been incorporated into the process of combining measures. Combining measures using GM and AM are defined as follows (Eqs 10 to 11)
HM is always less than or equal to the geometric mean, and the geometric mean is always less than or equal to the arithmetic mean [18].
When combining multiple measures, HM stands over GM and AM due to its enhanced sensitivity to smaller values in the aggregation process. This emphasizes the contribution of smaller values and avoids the distortion caused by larger ones.
This section provides details of the proposed performance measure (please see Section 4.1). Section 4.2 presents the experimental design followed in the present study.
Weighted general performance score (W-GPS)
GPS establishes a standardized measure by combining various performance measures using the harmonic mean but lacks a clear understanding of how each performance measure contributes to the combined measure, making it difficult to interpret the influence of each performance measure. To overcome this limitation, in this paper, the W-GPS is proposed to effectively combine measures according to their relative variability using weighted HM, where weights are determined based on the coefficient of variation (CV) value. These weights consider the CV of each measure across 100 iterations of model training, assigning lower weights to measures with higher variability and vice versa.
For the
To ensure values of
Subsequently, the weights of each measure are determined as the average of the CVs of the other measures, ensuring that higher variability corresponds to lower weights and vice versa. The weight of the
Let
Based on Eq. (15),
The computation of the proposed W-GPS can be illustrated as follows:
As an example, the calculation procedure for the proposed W-GPS is derived as the following steps:
So, the computed W-GPS value is approximately 0.787.
Based on Eq. (16), we have considered different combinations such as W-GPS (Recall, Precision, Specificity, and NPV), W-GPS (Recall, Precision), W-GPS (Recall, Specificity), W-GPS (Precision, NPV), and W-GPS (Specificity, NPV) in binary classification scenario. Further, for multi-class classification scenarios, several instances of W-GPS, such as W-GPS(UPM), W-GPS(Recall), W-GPS(Precision), W-GPS(NPV), and W-GPS(Recall, Precision) are considered. The One-vs-Rest technique is applied to handle multiple classes, resulting in
Likewise, other instances of W-GPS are calculated following a similar approach.
The proposed W-GPS satisfies the following properties:
Hence, in this case, W-GPS
It is noted that the computation of weights to be assigned to the metrics
This section discusses the experimental design followed by research objectives and workflow. Table 3 outlines the research objectives (ROs) related to TPMs, GPS, and W-GPS and the corresponding formulated research questions (RQ) for the study. Figure 1 depicts the overall research flow.
Research flow.
The first step is the selection of relevant datasets for the classification task. For this purpose, several datasets from the UCI repository are selected based on their balanced and unbalanced nature. Among these datasets, nine have binary classes, and 22 are for multi-class classification. Table 2 summarises these datasets based on the number of records, features, classes, and percentage of the majority class. Furthermore, to examine the impact of performance measures in challenging conditions, different proportions of noise, which include 20% and 30%, are added to the raw data. Subsequently, datasets listed in Table 2 undergo preprocessing involving data cleaning, normalization, feature selection, and dimensionality reduction. Following that, a diverse range of classifiers is chosen for experimental purposes, including Support Vector Machines (SVM), Neural Networks, Logistic Regression, Naive Bayes, K-nearest Neighbors, Random Forests, Decision Trees, Bagged Trees, and Boosted Trees. Following classifier selection, the next step is model construction. This entails dividing the dataset into training and test subsets, allocating 80% of instances for training and 20% for testing. Each classifier undergoes training on various splits for 100 iterations. Afterwards, the classifiers’ performance is evaluated using a range of performance assessment methods, which encompass TPMs (Traditional Performance Measures), GPS (General Performance Score), and W-GPS (Weighted General Performance Score), all of which are applied to the test dataset. As the final step of the process, a descriptive analysis of the proposed research objectives and questions is carried out.
All experiments are conducted using Python 3.9.12 in the Jupyter Notebook environment (with an Intel Core i5 processor and 8 GB of RAM on the Windows 10 operating system). The experiments involve assessing performance measures on diverse datasets using various ML classifiers. Section 5.1 examines the consistency of different TPMs, GPS, and proposed W-GPS-based measures. Section 5.2 runs a stability test on GPS and W-GPS-based measures. Finally, a comparison of W-GPS with the performance measures found in the literature is provided in Section 5.3.
Consistency of TPMs, GPS, and proposed W-GPS-based measures
The experimental results presented in this section are aligned with the research objective listed in Section 4.2. To meet these objectives, we assessed TPMs, GPS, and W-GPS in the classification scenario (binary or multi-class, balanced or imbalanced, and noisy or noiseless data). The experiments are conducted on the datasets listed in Table 2. However, we depict the boxplots of selected datasets listed as follows:
Diabetes Dataset (binary, binary Contraceptive Dataset (multi-class, multi-class
Variability in traditional performance measures (TPMs) across datasets: (a) Diabetes Dataset (Binary), (d) Contraceptive Dataset (Multi-class); (b) Balanced Diabetes Dataset (Binary), (e) Balanced Contraceptive Dataset (Multi-class); (c) Diabetes Dataset(Binary, 20% Noise), and (f) Contraceptive Dataset (Multi-class, 20% Noise).
Figure 2 depicts the boxplots of score value of TPMs for the above-selected datasets across 100 iterations of 13 classifiers. Figure 2(a) illustrates boxplots for the diabetes dataset, with a class distribution ratio of 65:35. The interquartile range, ranging from lower values to higher ones, differ across various performance metrics, including accuracy (0.67, 0.79), recall (0.70, 0.80), precision (0.75, 0.89), specificity (0.55, 0.65), and NPV (0.50, 0.60). These indicate less variation of the values across all TPMs. Similarly, a similar trend is observed for its balanced scenario, as illustrated in Fig. 2(b). Figure 2(c) depicts the boxplots for the noise scenario of the diabetes dataset and observed a noticeable increase in variance within the interquartile range of the metrics. This indicates that noise in the dataset leads to higher variability in the performance metrics. Figure 2(d) illustrates boxplots for the contraceptive dataset, with a class distribution ratio of 43:34:23. Similarly, the interquartile range across various performance metrics, including accuracy (0.62, 0.72), recall (0.46, 0.53), precision (0.46, 0.55), specificity (0.70,0.80), and NPV (0.74, 0.80), shows less variation from lower to higher values. This trend is also noted in the balanced scenario, depicted in Fig. 2(e). Figure 2(f) depicts the boxplots for the noise scenario of the contraceptive dataset and observed a noticeable increase in variance within the interquartile range of the metrics. This indicates that noise in the dataset leads to higher variability in the performance metrics. Moreover, the consistency of the TPMs is also affected by the noise. It is also noted that specificity and NPV score values are lower than accuracy, recall, and precision. These visualizations highlight significant dissimilarities in performance metrics across different classifiers on the same dataset, as noticeable from box size and positioning differences. This demonstrates clearly that different metrics exhibit varying behaviours.
Variability in several instances of General Performance Score (GPS) across datasets: (a) Diabetes Dataset (Binary), (d) Contraceptive Dataset (Multi-class), (b) Balanced Diabetes Dataset (Binary), (e) Balanced Contraceptive Dataset (Multi-class); (c) Diabetes Dataset (Binary, 20% Noise), and (f) Contraceptive Dataset (Multi-class, 20% Noise).
Figure 3 depicts the boxplots of score value of different instances of GPS for the above-selected datasets across 100 iterations of 13 classifiers. Figure 3(a) illustrates boxplots for the diabetes dataset. The interquartile range, ranging from lower values to higher ones, differs across different instances, including GPS_RPSN (0.56,0.7), GPS_RP (0.75,0.82), GPS_RS (0.6,0.75), GPS_PN (0.62, 0.7), and GPS_SN scores (0.55,0.62). These indicate less variation of the values across different instances of GPS. Similarly, observed a similar trend for its balanced scenario, as illustrated in Fig. 3(b). Figure 3(c) depicts the boxplots for the noise scenario of the diabetes dataset. There is a noticeable increase in variance within the interquartile range of the metrics. This indicates that noise in the dataset leads to higher variability in the performance metrics. Figure 3(d) illustrates boxplots for the contraceptive dataset. Similarly, the interquartile range across various performance metrics, including GPS_UPM (0.55, 0.62), GPS_Recall (0.42,0.52), GPS_Precision (0.75,0.80), GPS_NPV (0.75,0.78), and GPS_Recall_Precision (0.40,0.50), shows less variation from lower to higher values. This trend is also noted in the balanced scenario, depicted in Fig. 3(e). Figure 3(f) depicts the boxplots for the noise scenario of the contraceptive dataset. There is a noticeable increase in variance within the interquartile range of the metrics. This indicates that noise in the dataset leads to higher variability in the performance metrics. Moreover, the noise also affects the consistency of the GPS. These visualizations highlight significant dissimilarities in performance metrics across different classifiers on the same dataset, as noticeable from box size and positioning differences.
Variability in several instances of Weighted General Performance Score (W-GPS) across different classification scenarios: (a) Diabetes Dataset (Binary), (b) Balanced Diabetes Dataset (Binary), (c) Diabetes Dataset(Binary, 20% Noise), (d) Contraceptive Dataset (Multi-class), (e) Balanced Contraceptive Dataset (Multi-class), and (f) Contraceptive Dataset (Multi-class, 20% Noise).
Figure 4 depicts the boxplots of score value of different instances of W-GPS for the above-selected datasets across 100 iterations of 13 classifiers. Figure 4(a) illustrates boxplots for the diabetes dataset. The interquartile range, ranging from lower values to higher ones, differs across different instances, including W-GPS_RPSN (0.62,0.74), W-GPS_RP (0.85,0.91), W-GPS_RS (0.88,0.92), W-GPS_PN (0.70,0.75) and W-GPS_SN (0.72,0.79). These indicate less variation of the values across different instances of W-GPS. Similarly, observed a similar trend for its balanced scenario, as illustrated in Fig. 4(b). Figure 4(c) depicts the boxplots for the noise scenario of the diabetes dataset. There is a noticeable increase in variance within the interquartile range of the metrics. This indicates that noise in the dataset leads to higher variability in the performance metrics. Figure 4(d) illustrates boxplots for the contraceptive dataset. Similarly, the interquartile range across various performance metrics, including W-GPS_UPM (0.70,0.85), W-GPS_Recall (0.50,0.80), W-GPS_Precision (0.52,0.82), W-GPS_NPV (0.90,0.95) and W-GPS_Recall_Precision (0.10,0.60), shows less variation from lower to higher values. This trend is also noted in the balanced scenario, depicted in Fig. 4(e). Figure 4(f) depicts the boxplots for the noise scenario of the contraceptive dataset. There is a noticeable increase in variance within the interquartile range of the metrics. This indicates that noise in the dataset leads to higher variability in the performance metrics. Moreover, the consistency of the W-GPS is also affected by the noise. These visualizations highlight significant dissimilarities in performance metrics across different classifiers on the same dataset, as noticeable from box size and positioning differences.
It is noted that different approaches to performance measures exhibit varying behaviours, and different performance measures are inconsistent across all scenarios. This inconsistency in performance measures across classifiers makes it challenging to determine the best model based on a single measure alone. This demonstrates that different metrics exhibit varying behaviours. Thus, it suggests that a thorough evaluation involving multiple measures and possibly ensemble methods might be needed to make a well-informed decision about model selection. Similarly, we observed the same observation for the rest of the examined datasets.
Datasets used for the experiments
Research objectives and questions
This section examines the variability between the GPS and the W-GPS across different classifiers for different scenarios, such as raw, balanced, and noisy datasets. Table 4 show the mean and standard deviation (SD) for GPS_RPSN, the general performance score of the combination of recall, precision, specificity, and NPV, and W-GPS_RPSN, the weighted general performance score of the combination of recall, precision, specificity, and NPV for binary datasets. Table 7 show the mean and SD values for GPS_UPM, the general performance score of the combination of UPM of each class, and W-GPS_UPM, the general performance score of the combination of UPM of each class, for multi-class datasets. It is observed that the W-GPS_RPSN has less variability as compared to GPS_RPSN. For instance, in the adult dataset (Table 4), GPS has a mean of 0.59, whereas W-GPS shows improvement with a notably higher mean of 0.684. Additionally, GPS has an SD of 0.099, while W-GPS exhibits lower variability with an SD of 0.058.
Similarly, the same behaviour was observed for multi-class datasets. For instance, in the balance scale dataset (Table 7), GPS_UPM has a mean of 0.317, whereas W-GPS_UPM shows improvement with a notably higher mean of 0.776. Additionally, GPS_UPM has an SD of 0.392, while W-GPS exhibits lower variability with an SD of 0.162. This lower SD suggests that W-GPS offers a more consistent performance across different classifiers than GPS. The proposed weighted approach is more reliable and less sensitive to fluctuations. Thus, integrating weights into individual measures enhances the model’s decision-making, resulting in more dependable and consistent outcomes.
Comparative analysis of W-GPS with alternative measures
In this section, a comparative analysis of W-GPS metrics is conducted against alternative evaluation methods, including accuracy, recall, precision, specificity, NPV, F1 measure, MCC, GPS [18], and a composite metric using Geometric Mean (GM) and Arithmetic Mean (AM) components [51]. Table 5 displays the scores of the different metrics for binary datasets. It is noted that metrics such as accuracy, recall, precision, specificity, and NPV may not be reliable assessments of model performance. The higher values of these metrics are sometimes biased towards the majority class. Therefore, in some cases, individual metrics do not show the overall aspects of the model performance. To overcome this issue, GPS [18] is defined as the harmonic mean of a set of metrics that penalizes the metric exhibiting inferior values. However, GPS fails to consider the relative importance of these metrics. When users are presented with distinct metrics, it is unclear how much weight to give to each score. The proposed metric– W-GPS, provides a data-driven approach to assigning suitable weights to these metrics based on their variability across multiple runs. Thus, the W-GPS measure provides a more refined evaluation of the classifier’s performance. For example, in the case of the diabetes dataset (please see Table 5), the mean GPS_RPSN value is 0.717, while the W-GPS_RPSN surpasses it with a mean value of 0.725. Likewise, for the multi-class dataset (please see Table 6 for results for the connect-4 dataset), the mean GPS value is 0.605, while the W-GPS measure surpasses it with a mean value of 0.762.
Mean and standard deviation (SD) of GPS_RPSN and W-GPS_RPSN for binary datasets in various scenarios
Mean and standard deviation (SD) of GPS_RPSN and W-GPS_RPSN for binary datasets in various scenarios
Computed TPMs, GPS, and W-GPS in case of binary datasets
Computed TPMs, GPS, and W-GPS in case of multi-class datasets
Mean and standard deviation (SD) of GPS_UPM and W-GPS_UPM for multi-class datasets
Mean and standard deviation (SD) of GPS_UPM and W-GPS_UPM for noisy environment
Traditional performance measures (TPMs) such as recall, precision, specificity, and negative predictive value (NPV) are highly sensitive to minor data fluctuations and exhibit substantial variability across iterations. The results of the experiments demonstrated (please see Section 5.1) that a performance measure showing robustness for one classifier may not consistently exhibit a similar trend when applied to evaluate another classifier across diverse dataset characteristics. Although the General Performance Score introduced by [18] made an effort to reduce this variability by combining the measures with high variability. It failed to address the issue adequately as it assigned equal importance to all the measures. In this paper, we have tried to address the issue of sensitivity to minor data fluctuations and exhibit substantial variability across iterations. The proposed weighted general performance score (W-GPS) considers each measure’s coefficient of variation (CV) to assign a suitable weight. As detailed in Section 5.2, it is observed that the proposed W-GPS exhibited significantly reduced variation compared to GPS. We have compared traditional performance measures (TPMs), General Performance Scores (GPS), and the proposed Weighted General Performance Scores (W-GPS) for different classification scenarios.
Domain-wise average score of GPS and W-GPS
Domain-wise average score of GPS and W-GPS
The following subsections summarized W-GPS results for different factors, such as cost in terms of variability in data distribution and associated cost, presence of noise, and domain-specific requirements.
In real-life applications, one often encounters imbalanced datasets. A classifier’s performance metrics may be greatly impacted by the distribution of classes within a dataset. Indeed, if fluctuations in data cause significant variability in the outcome of an evaluation metric, it costs a loss of interpretability. [81] noted that finding an unbiased estimator for an imbalanced dataset is challenging. For example, for the Thoracic Surgery dataset Table 4 (class ratio 85:15), the mean GPS score on raw data is 0.111 with a variability of 0.066. Using SMOTE for data balancing, the new GPS score is 0.239 with a variability of 0.121, showing that while the average GPS improves, it also increases variability. Instead, the mean W-GPS score on raw data for the same dataset is 0.175, with a variability of 0.110. Again, using SMOTE to balance the dataset, we obtained the W-GPS as 0.328, with a variability of 0.088. The reduced variability in the case of W-GPS makes it more interpretable. Experimentation on the other datasets also exhibited reduced variability (please see Table 7). Thus, we conclude that W-GPS provides a more robust and stable evaluation of classification models, making it a more reliable metric for practical applications.
Impact of noise
Introducing noise into datasets generally increases the standard deviation of performance metrics, reflecting higher variability and instability in model performance. This occurs because noise adds random fluctuations and errors to the data, making it more challenging for the model to identify patterns and make accurate predictions. The W-GPS provided a more balanced and reliable assessment of the model’s effectiveness. In Table 8, the Contraceptive dataset with 20% noise shows that GPS has a mean of 0.583 and a standard deviation (SD) of 0.141, while W-GPS has a mean of 0.603 and an SD of 0.104. This represents a 3.4% increase in the mean and a 26.2% decrease in the SD for W-GPS compared to GPS. When the noise level is increased to 30%, GPS has a mean of 0.590 and an SD of 0.153, while W-GPS has a mean of 0.617 and an SD of 0.126. This indicates a 4.6% increase in the mean and a 17.6% decrease in the SD for W-GPS compared to GPS. Likewise, for the Diabetes dataset with 20% noise, GPS has a mean of 0.593 and an SD of 0.379, while W-GPS has a mean of 0.643 and an SD of 0.110. This represents an 8.4% increase in the mean and a 71.0% decrease in the SD for W-GPS compared to GPS. With the noise level increased to 30%, GPS shows a mean of 0.570 and an SD of 0.314, whereas W-GPS maintains a higher mean of 0.619 and a lower SD of 0.101. This demonstrates an 8.6% increase in the mean and a 67.8% decrease in the SD for W-GPS compared to GPS. The above results demonstrate that W-GPS maintains higher mean performance and lower variability in a noisy environment than GPS, as its weighting mechanism enables it to handle noisy data more effectively.
Domain-specific requirements
The W-GPS consistently outperforms GPS in terms of mean performance scores across various datasets from different domains, including medical, environmental, and general classification tasks (please see Table 9). For example, considering a set of 11 datasets from the medical domain, we observe that W-GPS consistently outperforms GPS. Experimentation on the other datasets also exhibited a similar trend (please see Table 9).
Conclusion and future scope
Traditionally, performance measures such as accuracy, precision, recall, and specificity have been used to evaluate the performance of classifiers. However, relying on a single measure overlooks other important indicators, leading to inconsistent results. As the traditional performance measures (TPMs) exhibit significant variability, recently, an aggregate performance metric, GPS, was proposed to provide a single reliable measure based on aggregation of TMPs. In this paper, we have proposed W-GPS, a robust performance measure for evaluating classification models. The proposed measure combines several traditional performance measures such as accuracy, precision, recall, specificity, and NPV. It assigns suitable weights to traditional performance measures based on their coefficient of variation. The proposed W-GPS is stable for a wide variety of classification scenarios. It also turns out to be consistently superior to GPS in terms of reduced variability across the datasets. Indeed, W-GPS offers flexibility to data analysts to develop different measures by choosing a suitable combination of TPMs, depending on the specific scenario. Future work will focus on developing a data-driven model selection procedure.
Funding
As mentioned above.
CRediT authorship contribution statements
Gaurav Pandey: conducted the experiments. Rashika Bagri: Wrote the original draft and interpreted the data. Rajan Gupta: developed the research objectives and methodologies. Ankit Rajpal and Naveen Kumar: led the analysis, review, and editing. Manoj Agarwal: led the analysis and interpreted the data.
Declaration of competing interests
The authors declare that they have no conflict of interest. This article contains no studies performed by the authors with human participants or animals.
Ethical and informed consent
This article does not contain any studies with human participants or animals.
Data availability and access
All the datasets used in the paper are publicly available. The links to access these datasets are provided in this paper.
Footnotes
Acknowledgments
Rashika Bagri would like to express gratitude to the University Grants Commission, New Delhi, India, for granting the Junior Research Fellowship (Reference No. 200510452974).
