Robust weighted general performance score for various classification scenarios

Abstract

Traditionally, performance measures such as accuracy, recall, precision, specificity, and negative predicted value (NPV) have been used to evaluate a classification model’s performance. However, these measures often fall short of capturing different classification scenarios, such as binary or multi-class, balanced or imbalanced, and noisy or noiseless data. Therefore, there is a need for a robust evaluation metric that can assist business decision-makers in selecting the most suitable model for a given scenario. Recently, a general performance score (GPS) comprising different combinations of traditional performance measures (TPMs) was proposed. However, it indiscriminately assigns equal importance to each measure, often leading to inconsistencies. To overcome the shortcomings of GPS, we introduce an enhanced metric called the Weighted General Performance Score (W-GPS) that considers each measure’s coefficient of variation (CV) and subsequently assigns weights to that measure based on its CV value. Considering consistency as a criterion, we found that W-GPS outperformed GPS in the above-mentioned classification scenarios. Further, considering W-GPS with different weighted combinations of TPMs, it was observed that no demarcation of these combinations that work best in a given scenario exists. Thus, W-GPS offers flexibility to the user to choose the most suitable combination for a given scenario.

Keywords

Performance measures classification imbalanced data modelling noisy data modelling coefficient of variation

1. Introduction

Given a dataset $D=\{t_{1},t_{2},\ldots,t_{n}\}$ comprising $n$ records with each record belonging to one of the predefined classes $C=\{C_{1},C_{2},\ldots,C_{m}\}$ , the task of classification is to determine a mapping function $f:D\rightarrow C$ [1, 2, 3].

The performance of the classifier is assessed in terms of traditional performance measures such as accuracy, recall, sensitivity, precision, specificity, and f1-score [4, 5, 6, 7, 8]. The above performance measures are computed using the confusion matrix [9]. Table 1 depicts the confusion matrix of a binary classifier, where the rows represent the predicted classes and the columns represent the actual classes.

In Table 1, the diagonal elements of the confusion matrix, true positive (TP) and true negative (TN), correspond to the correct predictions made by the classifier. On the other hand, the off-diagonal elements false positive (FP) and false negative (FN) represent incorrect predictions. If a test instance’s true class is negative and the classifier predicts it as negative, it contributes to TN; if the model predicts it as positive, it is counted as FP, commonly known as a Type I error [10]. Similarly, if a test instance’s true class is positive and the classifier predicts it as positive, it contributes to TP; if the model predicts it as negative, it is counted as FN, commonly known as a Type II error [10].

In multi-class classification scenarios, the concepts of TP, TN, FP, and FN are computed for each class, and the traditional performance measures are computed using the micro and macro approaches. The micro approach considers all classes and calculates a single measure by aggregating true positives, true negatives, false positives, and false negatives across all classes [11, 12]. The macro-approach measures are computed independently for each class, and then the average is taken. The strategy used for multi-class classification is one vs. the rest (or one vs. all). This strategy trains a separate binary classifier, treating one of the classes as a positive class and the remaining classes as a negative class [13, 14]. Section 3, 4.1 presents the mathematical formulations for performance measures utilized in binary and multi-class classification problems.

Numerous studies have been conducted to assess the performance of classifiers by utilising diverse measures [15, 16, 17]. Each measure captures distinct aspects of classification performance. For instance, precision evaluates the classifier’s ability to identify positive instances correctly, recall measures the classifier’s ability to capture all positive instances, and the F1-score balances the trade-off between precision and recall. Indeed, it is crucial to consider specific requirements when choosing the appropriate measure, as no single measure can consider all aspects of classification [4].

Table 1
Confusion matrix

		True class
		Positive	Negative
Predicted class	Positive	TP	FP
	Negative	FN	TN

In this work, a recently proposed General Performance Score (GPS) has been evaluated in different classification scenarios, such as binary or multi-class, balanced or imbalanced, and noisy or noiseless data [18]. GPS indiscriminately assigns equal importance to each performance measure, often leading to inconsistencies. To overcome the shortcomings of GPS, we proposed an enhanced metric called the Weighted General Performance Score (W-GPS) that considers each measure’s coefficient of variation (CV) and subsequently assigns weights to that measure based on its CV value. A measure having a higher CV indicates greater variability across multiple data subsets, resulting in a lower weight assigned to that measure. Conversely, a measure with a lower CV corresponds to a higher weight. W-GPS computes the weighted harmonic mean of combined performance measures that consider their variability across multiple data sets using the holdout method [19, 20, 21]. To assess the consistency of W-GPS, an extensive empirical study was conducted involving thirteen classifiers, namely, Logistic Regression [22], Support Vector Machines [23], and Linear Discriminant Analysis [24], Quadratic Models [25], Probability-based Naive Bayes [26], Instance-based K-nearest Neighbors [27], Tree based Decision Trees [28], ensemble-based classifiers such as Random Forest [29], Extra Tree [30], Gradient Boosting [31], Light Gradient Boosting [32], Extreme Gradient Boosting [33], and Neural Network based Multilayer Perceptron [34]. Considering consistency as a criterion, W-GPS outperformed GPS in all the above-mentioned classification scenarios.

The paper is organized as follows: Section 2 explores related work; Section 3 discusses the existing measures; and Section 4 defines the proposed methodology used in the work. The results of the experiments are detailed in Section 5. Section 6 discusses the implications of the results. Finally, Section 7 includes the conclusions and future scope.

2. Related work

Performance measures are essential in evaluating the effectiveness of classifiers in machine learning [35]. They provide quantitative assessments of a classifier’s predictive performance. These measures also assist in comparing different classifiers, optimizing their parameters, and selecting the most suitable model for a specific problem. The performance measures for a machine learning model are of great importance for various reasons stated below:

1.
Model Assessment: Performance measures offer a quantitative means to measure the performance of a machine learning model.
2.
Model Selection: The use of appropriate performance measures may assist in selecting suitable models from the possible set of models. Model selection is the process of choosing the best machine-learning model from a group of possible machine-learning models for a training dataset.
3.
Optimization: Once a model is selected, these measures guide in fine-tuning the model parameters to optimize its performance.
4.
Assessing generalizability: When applied to a distinct test set, performance measures help determine how well the model performs on unseen data.

Each performance measure captures diverse aspects of a model’s performance and yields conflicting outcomes. However, selecting performance measures becomes complex for decision-makers as it entails understanding the contextual nuances of the problem, considering the costs associated with various types of errors (false positives versus false negatives), analyzing the characteristics of the data (balanced versus imbalanced), and the requirements of the business or application. These factors are elucidated as follows:

1.
Understand the Business Impact: Different errors often have different business impacts. It is essential to consider that different types of errors can lead to varying business implications. For example, in a medical diagnosis setting, a false negative (a sick person being declared healthy) could have much more serious consequences than a false positive (a healthy person being declared sick). In this case, a measure like recall (which focuses on minimizing false negatives) might be more important [1, 36].
2.
Data Distribution: If the class-wise distribution in the dataset is heavily imbalanced, accuracy might not be the best measure as it can be driven by the majority class [37]. Several studies have encountered issues with performance measures when dealing with imbalanced data [38, 39].
3.
Balance trade-offs: There is often a trade-off between precision and recall [40]. When maximizing the inclusion of positive instances is crucial, prioritize recall. On the other hand, if it is more important to be sure about the instances the model classifies as positive, then precision should be prioritized. To strike a balance between precision and recall, consider using the F1 score.
4.
Multiple Measures: To capture all the aspects of a model’s performance, no standard measure exists. Therefore, a set of measures must be considered to provide a more comprehensive evaluation of the model.

Several research studies have been carried out to assess the effectiveness of different performance measures for evaluating ML classifiers. These studies aim to understand the strengths and limitations of different measures and their significance in various contexts [15, 41, 42, 43, 44]. Some of the recent studies are listed as follows: Huang et al. [45] conducted a comparison of machine learning algorithms, specifically Naive Bayes, Decision Trees, and Support Vector Machines (SVM), focusing on their predictive accuracy and the area under the ROC curve (AUC). Through experimentation with diverse datasets, they found that Naive Bayes, C4.4, and SVM exhibited similar levels of predictive accuracy. However, in terms of AUC, Naive Bayes, C4.4, and SVM outperformed C4.5. Huang et al. [46] used the same set of classifiers as used by Huang et al. [45] to compare the accuracy and AUC (Area Under the Curve) measures. The authors compared the consistency of accuracy and AUC, and their experimental study found AUC to be effective for measuring and comparing classifiers. Further, they showed that Naive Bayes outperforms pruned decision trees significantly when evaluated based on the area under the receiver operating characteristic curve (AU-ROC). They found that multiple evaluation measures evaluate different aspects of model performance. Sokolova et al. [47] explored different evaluation measures for ML classifiers, challenging the sufficiency of commonly used measures like accuracy, F1-score, and AUC. They also emphasized the drawbacks associated with these measures, particularly when dealing with situations where multiple classes hold equal significance. Additionally, they computed alternative measures such as Youden’s index, likelihood, and discriminant power. These alternative measures are widely employed in medical diagnosis and are shown to be interconnected. Their study further explored the applications of these measures to a case study centred around the classification of electronic negotiations, which occurred in diverse domains such as labour and business.

Several studies have been investigated that extensively compare performance measures for classification tasks. For instance, Zhou et al. [48] focused on the correlation analysis of performance measures for classifiers and categorized the commonly used measures into threshold, rank, and probability measures. Luque et al. [49] conducted a study examining classifiers’ performance measures and their symmetries, focusing on binary classifiers by exposing cross-symmetries and statistical symmetries among the measures. This study showed three types of symmetries between the measures: labelling inversion, scoring inversion, and combining these two inversions. In another study, Ferri et al. [50] extensively studied performance measures for classification tasks. In their work, they identified clusters and relationships among the behaviour of 18 performance measures. Through experimentation, they comprehensively analyzed the relationships between measures that can be useful for choosing the most adequate measures for a specific task. In conclusion, these studies enhanced practitioners’ understanding of measure relationships and demonstrated varying degrees of correlation among different sets of measures. These findings underscore the importance of understanding measure symmetries for effective metric selection and classification process enhancement.

In addition to exploring the relationship between the individual performance measures, the following studies proposed combining these independent measures into a unified measure. Nandi et al. [51] presented aggregated performance measures using statistical aggregation, such as harmonic mean (HM), geometric mean (GM), and arithmetic mean (AM), and two other objective functions, resulting in two novel performance measures: the distance from the origin (DO) and the distance from the ideal position (DIP), and discussed the need for a robust assessment of algorithms and the challenges of comparing multiple measures. They observed that HM performs best at the lower performance end, while DIP excels at the higher performance end. Furthermore, it provides valuable insights into algorithm properties and rankings based on these measures. In another work, Redondo et al. [52] focused on evaluating the performance of the classification model by proposing an alternative measure, the Unified Performance Measure (UPM), by combining multiple measures, such as recall, precision, specificity, and negative predicted value using harmonic mean. Comparative evaluations on simulated and real datasets demonstrated the superiority of UPM over other performance measures. Similarly, Diego et al. [18] proposed the General Performance Score (GPS) as a novel measure to combine performance measures for binary and multi-class classification problems. The GPS was defined using the harmonic mean to combine multiple performance measures. Their approach penalized low-value measures and was less sensitive to extreme values. These approaches provide enhanced tools for comprehensive performance evaluation in classification tasks.

At the same time, some recent studies emphasized the significance of incorporating weighted combinations and specialized measures to evaluate classifier performance [53, 54]. By assigning weights based on dataset characteristics and exploring alternative measures, more accurate and informative assessments of classifier performance can be obtained, and enhanced weighted performance measures improve the integrity of individual performance measurements and facilitate the development of more robust performance measures [53]. These approaches contribute to a better understanding of evaluation challenges in real-world scenarios and facilitate the development of more robust performance measures. For instance, Jadhav et al. [54] explored various performance measures, such as sensitivity, specificity, and accuracy, to assess ML classifiers and proposed a weighted measure called TPR-TNR that considers dataset imbalance and the costs associated with misclassification to address the limitations of existing measures, particularly in the context of imbalanced datasets.

Nevertheless, developing a comprehensive weighted performance function that incorporates relevant measures for a specific context is crucial. However, devising a universal score may oversimplify the process of evaluating the model’s performance, despite being a difficult endeavour. Moreover, determining the appropriate weights for each measure is often context-dependent. Taking cues from this, our study introduced a novel approach called W-GPS, which incorporates weight parameters to assign relative importance to different measures when combining them into a universal score. These weights are determined by considering each performance measure’s CV ratio. This approach assumes that measures with a lower CV are more stable and should significantly impact the overall performance measure.

In this work, various performance measures, including TPM, GPS, and W-GPS, have been investigated for the binary and multi-class classification tasks. These measures are examined across different data characteristics, such as the number of target variable classes (binary or multi-class), the distribution of target variable classes (balanced or unbalanced), and the quality of the data (noisy or clean). To determine the most appropriate performance measure for evaluating classifier performance, a wide variety of ML classifiers, such as decision trees, naive Bayes, k-nearest neighbours, logistic regression, extra trees, and others, are evaluated. The findings of this research contribute to the advancement of evaluation practices in classification tasks, enabling more informed choices in measure selection and encouraging the adoption of standardized evaluation methodologies.
3. Preliminaries

This section discusses several existing performance measures, such as TPMs and GPS [18] to evaluate the classifier’s performance. These measures are determined through a $K\times K$ confusion matrix where $K\geqslant$ 2 for binary and multi-class problems.

3.1 Traditional performance measures (TPMs)

The TPM, such as accuracy, precision, recall, and F1-score, capture different aspects of classifier performance. A confusion matrix is a fundamental tool for evaluating a classifier by comparing actual and predicted labels. It comprises essential elements such as TP, TN, FP, and FN, which are crucial in defining performance measures.

In binary classification scenarios with two classes 1 (positive) and 0 (negative), performance measures such as accuracy, precision, recall, specificity, negative predicted value (NPV), and F1 score are determined from the $2\times 2$ confusion matrix to evaluate the effectiveness of the model.

1.
Accuracy: Accuracy is a common measure to evaluate a classifier. It is the ratio of true predictions to total predictions. High accuracy indicates successful classification. However, for imbalanced datasets, where one class is more frequent, accuracy might not be the best measure for better insights into classifier performance.

$\displaystyle\textit{Accuracy}=\frac{\textit{TP}+\textit{TN}}{\textit{TP}+% \textit{TN}+\textit{FP}+\textit{FN}}$ (1)
2.
Precision: This measure evaluates the fraction of True Positive prediction divided by the total number of positive predictions made by the model.

$\displaystyle\textit{Precision}=\frac{\textit{TP}}{\textit{TP}+\textit{FP}}$ (2)
3.
Recall: Recall measures a classification model’s ability to find all the positive observations in the dataset. It is the ratio of correctly classified positive instances to the total positive instances.

$\displaystyle\textit{Recall}=\frac{\textit{TP}}{\textit{TP}+\textit{FN}}$ (3)
4.
Specificity: This measure evaluates the fraction of true negative predictions divided by the total number of negative observations. It measures the ability of the model to find all the negative observations in the dataset.

$\displaystyle\textit{Specificity}=\frac{\textit{TN}}{\textit{TN}+\textit{FP}}$ (4)
5.
Negative Predictive Value (NPV): This measure evaluates the fraction of true negative predictions divided by the model’s total number of negative predictions.

$\displaystyle\textit{NPV}=\frac{\textit{TN}}{\textit{TN}+\textit{FN}}$ (5)

For multi-class classification problems with $K$ classes, the One vs. Rest technique is used. This approach splits the multi-class dataset into multiple binary classification problems. For $K$ classes, $K$ binary classification models are trained. To evaluate the performance of the multi-class classification model, the average of the performance measure scores of all $K$ models is computed, known as the macro average. For instance, the macro average of accuracy is defined as follows Eq. (6):

$\displaystyle\textbf{macro-accuracy}=\frac{\sum_{i=1}^{K}\textit{Accuracy}_{i}% }{K}$ (6)

where $\textit{Accuracy}_{i}$ denotes the accuracy value for the $i^{\text{th}}$ model. Likewise, macro-recall, macro-precision, macro-specificity, macro-NPV, and macro-F1-score can be computed.
3.2 General Performance Score (GPS)

The preceding section introduced several performance measures for assessing classification tasks. Each measure often emphasizes distinct facets of performance evaluation. In response to this diversity, Diego et al. [18] defined new performance measures by combining a set of measures into a unified measure employing the Harmonic Mean (HM) and imposing penalties for low metric values.

Let $p_{1},p_{2},p_{3},\ldots,p_{n}$ represent $n$ different performance measures that evaluate the model for a classification problem. The GPS [18] is defined as follows:

$\displaystyle\textit{GPS}(p_{1},p_{2},p_{3},\ldots,p_{n})=\frac{n}{\sum_{i=1}^% {n}1/p_{i}}$ (7)

Based on Eq. (7), we considered different combinations such as GPS (Recall, Precision, Specificity, and NPV), GPS (Recall, Precision), GPS (Recall, Specificity), GPS (Precision, NPV), and GPS (Specificity, NPV) in binary classification scenario. UPM is equivalent to GPS when considering the combined Recall, Precision, Specificity, and NPV measures.

$\displaystyle\textit{GPS(Recall, Precision, Specificity, NPV)}=\frac{4}{\frac{% 1}{\textit{Recall}}+\frac{1}{\textit{Precision}}+\frac{1}{\textit{Specificity}% }+\frac{1}{\textit{NPV}}}$ (8)

Further, for multi-class classification scenarios, several instances of GPS, such as GPS(UPM), GPS(Recall), GPS(Precision), GPS(NPV), and GPS(Recall, Precision) are considered. The One-vs-Rest technique is applied to handle multiple classes, resulting in $K$ binary confusion matrices, where $K$ is the total number of classes. For instance, GPS(UPM) is specifically defined as the harmonic mean of the Unified Performance Measures (UPM) [52] across all individual classes, denoted as $(\textit{UPM}[1],\textit{UPM}[2],\cdots,\textit{UPM}[K])$ . This approach combines the UPM values associated with each class into a unified metric as follows Eq. (9):

$\displaystyle\textit{GPS}(\textit{UPM})=\frac{K}{\sum_{i=1}^{k}1/\textit{UPM}[% i]}$ (9)

Likewise, other instances of GPS are calculated following a similar approach.

In this work, along with HM, the combination of AM and GM has also been incorporated into the process of combining measures. Combining measures using GM and AM are defined as follows (Eqs 10 to 11)

$\displaystyle\textit{GM}(p_{1},p_{2},\ldots,p_{n})=\sqrt[n]{\prod_{i=1}^{n}p_{% i}}$ (10)

$\displaystyle\textit{AM}(p_{1},p_{2},\ldots,p_{n})=\frac{\sum_{i=1}^{n}p_{i}}{n}$ (11)

HM is always less than or equal to the geometric mean, and the geometric mean is always less than or equal to the arithmetic mean [18].

$\displaystyle\textit{HM}\leqslant\textit{GM}\leqslant\textit{AM}$ (12)

When combining multiple measures, HM stands over GM and AM due to its enhanced sensitivity to smaller values in the aggregation process. This emphasizes the contribution of smaller values and avoids the distortion caused by larger ones.

4. Proposed methodology

This section provides details of the proposed performance measure (please see Section 4.1). Section 4.2 presents the experimental design followed in the present study.

4.1 Weighted general performance score (W-GPS)

GPS establishes a standardized measure by combining various performance measures using the harmonic mean but lacks a clear understanding of how each performance measure contributes to the combined measure, making it difficult to interpret the influence of each performance measure. To overcome this limitation, in this paper, the W-GPS is proposed to effectively combine measures according to their relative variability using weighted HM, where weights are determined based on the coefficient of variation (CV) value. These weights consider the CV of each measure across 100 iterations of model training, assigning lower weights to measures with higher variability and vice versa.

For the $i^{\text{th}}$ performance measure, the coefficient of variation ( $\textit{CV}_{i}$ ) can be computed as follows Eq. (13):

$\displaystyle\textit{CV}_{i}=\frac{\textit{Standard}\;\textit{Deviation}\;% \textit{of}\;\textit{the}\;i^{\text{th}}\;\textit{measure}}{\textit{Mean}\;% \textit{of}\;\textit{the}\;i^{\text{th}}\;\textit{measure}}$ (13)

To ensure values of $\textit{CV}_{i}$ on a scale of 0 to 1, each $\textit{CV}_{i}$ is normalized as follows Eq. (14):

$\displaystyle\textit{Normalized}\;\textit{CV}_{i}=\frac{\textit{CV}_{i}}{% \textit{CV}_{1}+\textit{CV}_{2}+\cdots+\textit{CV}_{n}}$ (14)

Subsequently, the weights of each measure are determined as the average of the CVs of the other measures, ensuring that higher variability corresponds to lower weights and vice versa. The weight of the $i^{\text{th}}$ measure, denoted by $w_{i}$ , may be computed as follows Eq. (15):

$\displaystyle w_{i}=\frac{1}{n-1}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{n}\textit{CV}_{j}$ (15)

Let $p_{1},p_{2},\ldots,p_{n}$ be $n$ different performance measures and $w_{1},w_{2},\ldots,w_{n}$ be the weights of each performance measure. The W-GPS is defined as follows:

$\displaystyle\text{W-GPS}(p_{1},p_{2},\ldots,p_{n})=\frac{W}{\sum_{i=1}^{n}w_{% i}/p_{i}}$ (16)

Based on Eq. (15), $W={\sum_{i=1}^{n}w_{i}}=1$ can be inferred, i.e., the sum of the weights of the underlined performance measures is always equal to 1.0. Therefore, Eq. (16) can be rewritten as Eq. (17):

$\displaystyle\text{W-GPS}(p_{1},p_{2},\ldots,p_{n})=\frac{1}{\sum_{i=1}^{n}w_{% i}/p_{i}}$ (17)

The computation of the proposed W-GPS can be illustrated as follows:

As an example, the calculation procedure for the proposed W-GPS is derived as the following steps:

Step 1

Assume three performance measures values: $p_{1}=0.9,$ $p_{2}=0.8,$ and $p_{3}=0.7,$ and their corresponding coefficient of variation (CV) values: $\textit{CV}_{1}=0.4,$ $\textit{CV}_{2}=0.7$ , and $\textit{CV}_{3}=0.3$ .

Step 2

Compute normalized CV values using Eq. (14):

$\displaystyle\textit{nCV}_{1}=\frac{\textit{CV}_{1}}{\textit{CV}_{1}+\textit{% CV}_{2}+\textit{CV}_{3}}=0.285$ $\displaystyle\textit{nCV}_{2}=\frac{\textit{CV}_{2}}{\textit{CV}_{1}+\textit{% CV}_{2}+\textit{CV}_{3}}=0.5$ $\displaystyle\textit{nCV}_{3}=\frac{\textit{CV}_{3}}{\textit{CV}_{1}+\textit{% CV}_{2}+\textit{CV}_{3}}=0.214$

Step 3

Compute weights using Eq. (15):

$\displaystyle w_{1}=\frac{\textit{nCV}_{2}+\textit{nCV}_{3}}{2}=0.36$ $\displaystyle w_{2}=\frac{\textit{nCV}_{1}+\textit{nCV}_{3}}{2}=0.25$ $\displaystyle w_{3}=\frac{\textit{nCV}_{1}+\textit{nCV}_{2}}{2}=0.39$

Step 4

Compute the W-GPS using Eq. (16):

$\displaystyle\text{W-GPS}=\frac{0.36+0.25+0.39}{\frac{w_{1}}{p_{1}}+\frac{w_{2% }}{p_{2}}+\frac{w_{3}}{p_{3}}}=\frac{1}{\frac{0.36}{0.9}+\frac{0.25}{0.8}+% \frac{0.39}{0.7}}=\frac{1}{0.4+0.312+0.557}=0.787$

So, the computed W-GPS value is approximately 0.787.

Based on Eq. (16), we have considered different combinations such as W-GPS (Recall, Precision, Specificity, and NPV), W-GPS (Recall, Precision), W-GPS (Recall, Specificity), W-GPS (Precision, NPV), and W-GPS (Specificity, NPV) in binary classification scenario. Further, for multi-class classification scenarios, several instances of W-GPS, such as W-GPS(UPM), W-GPS(Recall), W-GPS(Precision), W-GPS(NPV), and W-GPS(Recall, Precision) are considered. The One-vs-Rest technique is applied to handle multiple classes, resulting in $K$ binary confusion matrices, where $K$ is the total number of classes. For instance, W-GPS(UPM) is specifically defined as the weighted Harmonic Mean of the Unified Performance Measures (UPM) [52] across all individual classes, denoted as $(\textit{UPM}[1],\textit{UPM}[2],\cdots,\textit{UPM}[K])$ . This approach combines the UPM values associated with each class into a unified measure as follows Eq. (18):

$\displaystyle\text{W-GPS(UPM)}=\frac{1}{\sum_{i=1}^{K}w_{i}/\textit{UPM}[i]}$ (18)

Likewise, other instances of W-GPS are calculated following a similar approach.

The proposed W-GPS satisfies the following properties:

Property 1: When all involved measures’ scores are maximum, the W-GPS is also maximum.

Proof: Suppose there is a situation where there are n measures, each with a score of 1, i.e., $p_{1}=p_{2}=\cdots=p_{n}=1$ . In this scenario, the W-GPS formula can be simplified as follows:

$\displaystyle\text{W-GPS}(p_{1},p_{2},\ldots,p_{n})=\frac{W}{w_{1}+w_{2}+% \cdots+w_{n}/1},$ $\displaystyle W=\sum_{i=1}^{n}w_{i}.$

Hence, in this case, W-GPS $=$ 1.

Property 2: If any performance measure value is 0, the W-GPS is also 0.

$\displaystyle\text{W-GPS}(p_{1},p_{2},\ldots,p_{n})=0,\text{if }\exists\,p_{k}% =0,$ $\displaystyle\text{where }k\subset\{1,2,\ldots,n\}$

It is noted that the computation of weights to be assigned to the metrics $p_{1},p_{2},\ldots,p_{n}$ is linear in the dataset size (please see Eq. (15)). Hence, the computation of W-GPS remains computationally feasible even for large datasets (please see Eq. (18)).

4.2 Experimental design

This section discusses the experimental design followed by research objectives and workflow. Table 3 outlines the research objectives (ROs) related to TPMs, GPS, and W-GPS and the corresponding formulated research questions (RQ) for the study. Figure 1 depicts the overall research flow.

Figure 1.

Research flow.

The first step is the selection of relevant datasets for the classification task. For this purpose, several datasets from the UCI repository are selected based on their balanced and unbalanced nature. Among these datasets, nine have binary classes, and 22 are for multi-class classification. Table 2 summarises these datasets based on the number of records, features, classes, and percentage of the majority class. Furthermore, to examine the impact of performance measures in challenging conditions, different proportions of noise, which include 20% and 30%, are added to the raw data. Subsequently, datasets listed in Table 2 undergo preprocessing involving data cleaning, normalization, feature selection, and dimensionality reduction. Following that, a diverse range of classifiers is chosen for experimental purposes, including Support Vector Machines (SVM), Neural Networks, Logistic Regression, Naive Bayes, K-nearest Neighbors, Random Forests, Decision Trees, Bagged Trees, and Boosted Trees. Following classifier selection, the next step is model construction. This entails dividing the dataset into training and test subsets, allocating 80% of instances for training and 20% for testing. Each classifier undergoes training on various splits for 100 iterations. Afterwards, the classifiers’ performance is evaluated using a range of performance assessment methods, which encompass TPMs (Traditional Performance Measures), GPS (General Performance Score), and W-GPS (Weighted General Performance Score), all of which are applied to the test dataset. As the final step of the process, a descriptive analysis of the proposed research objectives and questions is carried out.

5. Experimental details

All experiments are conducted using Python 3.9.12 in the Jupyter Notebook environment (with an Intel Core i5 processor and 8 GB of RAM on the Windows 10 operating system). The experiments involve assessing performance measures on diverse datasets using various ML classifiers. Section 5.1 examines the consistency of different TPMs, GPS, and proposed W-GPS-based measures. Section 5.2 runs a stability test on GPS and W-GPS-based measures. Finally, a comparison of W-GPS with the performance measures found in the literature is provided in Section 5.3.

5.1 Consistency of TPMs, GPS, and proposed W-GPS-based measures

The experimental results presented in this section are aligned with the research objective listed in Section 4.2. To meet these objectives, we assessed TPMs, GPS, and W-GPS in the classification scenario (binary or multi-class, balanced or imbalanced, and noisy or noiseless data). The experiments are conducted on the datasets listed in Table 2. However, we depict the boxplots of selected datasets listed as follows:

1.
Diabetes Dataset (binary, binary $+$ balanced, binary $+$ 20% noise, and binary $+$ 30% noise)
2.
Contraceptive Dataset (multi-class, multi-class $+$ balanced, multi-class $+$ 20% noise, and multi-class $+$ 30% noise)

Figure 2.
Variability in traditional performance measures (TPMs) across datasets: (a) Diabetes Dataset (Binary), (d) Contraceptive Dataset (Multi-class); (b) Balanced Diabetes Dataset (Binary), (e) Balanced Contraceptive Dataset (Multi-class); (c) Diabetes Dataset(Binary, 20% Noise), and (f) Contraceptive Dataset (Multi-class, 20% Noise).

Figure 2 depicts the boxplots of score value of TPMs for the above-selected datasets across 100 iterations of 13 classifiers. Figure 2(a) illustrates boxplots for the diabetes dataset, with a class distribution ratio of 65:35. The interquartile range, ranging from lower values to higher ones, differ across various performance metrics, including accuracy (0.67, 0.79), recall (0.70, 0.80), precision (0.75, 0.89), specificity (0.55, 0.65), and NPV (0.50, 0.60). These indicate less variation of the values across all TPMs. Similarly, a similar trend is observed for its balanced scenario, as illustrated in Fig. 2(b). Figure 2(c) depicts the boxplots for the noise scenario of the diabetes dataset and observed a noticeable increase in variance within the interquartile range of the metrics. This indicates that noise in the dataset leads to higher variability in the performance metrics. Figure 2(d) illustrates boxplots for the contraceptive dataset, with a class distribution ratio of 43:34:23. Similarly, the interquartile range across various performance metrics, including accuracy (0.62, 0.72), recall (0.46, 0.53), precision (0.46, 0.55), specificity (0.70,0.80), and NPV (0.74, 0.80), shows less variation from lower to higher values. This trend is also noted in the balanced scenario, depicted in Fig. 2(e). Figure 2(f) depicts the boxplots for the noise scenario of the contraceptive dataset and observed a noticeable increase in variance within the interquartile range of the metrics. This indicates that noise in the dataset leads to higher variability in the performance metrics. Moreover, the consistency of the TPMs is also affected by the noise. It is also noted that specificity and NPV score values are lower than accuracy, recall, and precision. These visualizations highlight significant dissimilarities in performance metrics across different classifiers on the same dataset, as noticeable from box size and positioning differences. This demonstrates clearly that different metrics exhibit varying behaviours.

Figure 3.
Variability in several instances of General Performance Score (GPS) across datasets: (a) Diabetes Dataset (Binary), (d) Contraceptive Dataset (Multi-class), (b) Balanced Diabetes Dataset (Binary), (e) Balanced Contraceptive Dataset (Multi-class); (c) Diabetes Dataset (Binary, 20% Noise), and (f) Contraceptive Dataset (Multi-class, 20% Noise).

Figure 3 depicts the boxplots of score value of different instances of GPS for the above-selected datasets across 100 iterations of 13 classifiers. Figure 3(a) illustrates boxplots for the diabetes dataset. The interquartile range, ranging from lower values to higher ones, differs across different instances, including GPS_RPSN (0.56,0.7), GPS_RP (0.75,0.82), GPS_RS (0.6,0.75), GPS_PN (0.62, 0.7), and GPS_SN scores (0.55,0.62). These indicate less variation of the values across different instances of GPS. Similarly, observed a similar trend for its balanced scenario, as illustrated in Fig. 3(b). Figure 3(c) depicts the boxplots for the noise scenario of the diabetes dataset. There is a noticeable increase in variance within the interquartile range of the metrics. This indicates that noise in the dataset leads to higher variability in the performance metrics. Figure 3(d) illustrates boxplots for the contraceptive dataset. Similarly, the interquartile range across various performance metrics, including GPS_UPM (0.55, 0.62), GPS_Recall (0.42,0.52), GPS_Precision (0.75,0.80), GPS_NPV (0.75,0.78), and GPS_Recall_Precision (0.40,0.50), shows less variation from lower to higher values. This trend is also noted in the balanced scenario, depicted in Fig. 3(e). Figure 3(f) depicts the boxplots for the noise scenario of the contraceptive dataset. There is a noticeable increase in variance within the interquartile range of the metrics. This indicates that noise in the dataset leads to higher variability in the performance metrics. Moreover, the noise also affects the consistency of the GPS. These visualizations highlight significant dissimilarities in performance metrics across different classifiers on the same dataset, as noticeable from box size and positioning differences.

Figure 4.
Variability in several instances of Weighted General Performance Score (W-GPS) across different classification scenarios: (a) Diabetes Dataset (Binary), (b) Balanced Diabetes Dataset (Binary), (c) Diabetes Dataset(Binary, 20% Noise), (d) Contraceptive Dataset (Multi-class), (e) Balanced Contraceptive Dataset (Multi-class), and (f) Contraceptive Dataset (Multi-class, 20% Noise).

Figure 4 depicts the boxplots of score value of different instances of W-GPS for the above-selected datasets across 100 iterations of 13 classifiers. Figure 4(a) illustrates boxplots for the diabetes dataset. The interquartile range, ranging from lower values to higher ones, differs across different instances, including W-GPS_RPSN (0.62,0.74), W-GPS_RP (0.85,0.91), W-GPS_RS (0.88,0.92), W-GPS_PN (0.70,0.75) and W-GPS_SN (0.72,0.79). These indicate less variation of the values across different instances of W-GPS. Similarly, observed a similar trend for its balanced scenario, as illustrated in Fig. 4(b). Figure 4(c) depicts the boxplots for the noise scenario of the diabetes dataset. There is a noticeable increase in variance within the interquartile range of the metrics. This indicates that noise in the dataset leads to higher variability in the performance metrics. Figure 4(d) illustrates boxplots for the contraceptive dataset. Similarly, the interquartile range across various performance metrics, including W-GPS_UPM (0.70,0.85), W-GPS_Recall (0.50,0.80), W-GPS_Precision (0.52,0.82), W-GPS_NPV (0.90,0.95) and W-GPS_Recall_Precision (0.10,0.60), shows less variation from lower to higher values. This trend is also noted in the balanced scenario, depicted in Fig. 4(e). Figure 4(f) depicts the boxplots for the noise scenario of the contraceptive dataset. There is a noticeable increase in variance within the interquartile range of the metrics. This indicates that noise in the dataset leads to higher variability in the performance metrics. Moreover, the consistency of the W-GPS is also affected by the noise. These visualizations highlight significant dissimilarities in performance metrics across different classifiers on the same dataset, as noticeable from box size and positioning differences.

It is noted that different approaches to performance measures exhibit varying behaviours, and different performance measures are inconsistent across all scenarios. This inconsistency in performance measures across classifiers makes it challenging to determine the best model based on a single measure alone. This demonstrates that different metrics exhibit varying behaviours. Thus, it suggests that a thorough evaluation involving multiple measures and possibly ensemble methods might be needed to make a well-informed decision about model selection. Similarly, we observed the same observation for the rest of the examined datasets.

Table 2
Datasets used for the experiments

Datasets Instances Features Classes Classes ratio

Adult dataset [55] 48842 14 2 70:30

Blood transfusion dataset [56] 748 5 2 75:25

Diabetes dataset [57] 768 9 2 65:35

Fertility dataset [58] 100 10 2 90:10

Haberman survival dataset [59] 306 5 2 75:25

Ionosphere dataset [60] 351 34 2 65:35

Parkinson dataset [61] 197 23 2 75:25

Raisin dataset [62] 900 8 2 50:50

Thoracic surgery dataset [63] 470 17 2 85:15

Admission dataset 85 2 3 36:34:30

QSAR biconcentration dataset [64] 779 14 3 60:30:10

Vertebral column dataset [65] 310 5 3 50:30:20

Contraceptive dataset [66] 1473 9 3 43:34:23

Thyroid dataset [67] 7200 21 3 95:4:1

Wine dataset [68] 178 13 3 40:33:27

Iris dataset [69] 150 4 3 33:33:33

Wheat seeds dataset [70] 196 7 3 33:33:33

Balance scale dataset [71] 625 4 3 53:45:8

Connect-4 dataset [72] 67557 42 3 65:25:10

Car evaluation dataset [73] 1728 6 4 70:22:4:4

Vehicle dataset [74] 946 18 4 25:25:25:25

HCV dataset [75] 615 14 5 90:4:3:2:1

Dermatology dataset [76] 366 33 6 31:20:17:13:13:6

Glass dataset [77] 214 10 6 35:33:13:8:6:4

Wine quality dataset [78] 4898 12 7 45:30:19:4:3:1:1

ZOO dataset [79] 101 16 7 40:20:13:9:9:5:4

Mice protein dataset [80] 1080 82 8 16:13:13:13:13:11:11:9

Table 3
Research objectives and questions

RO1: Examine the influence of TPMs on diverse datasets using widely adopted classifiers. RO2: Examine the influence of several instances of GPS on diverse datasets using widely adopted classifiers. R03: Examine the influence of several instances of W-GPS on diverse datasets using widely adopted classifiers.

RQ1: Are all TPMs consistent for evaluating the given problem? RQ1: Are all GPS consistent for evaluating the given problem? RQ1: Are all W-GPS consistent for evaluating the given problem?

RQ2: What is the impact of balanced class distribution on TPMs? RQ2: What is the impact of balanced class distribution on GPS? RQ2: What is the impact of balanced class distribution on W-GPS?

RQ3: What is the impact of noise variants of datasets on TPMs? RQ3: What is the impact of noise variants of datasets on GPS? RQ3: What is the impact of noise variants of datasets on W-GPS?

5.2 Variability of GPS and W-GPS

Datasets	Instances	Features	Classes	Classes ratio
Adult dataset [55]	48842	14	2	70:30
Blood transfusion dataset [56]	748	5	2	75:25
Diabetes dataset [57]	768	9	2	65:35
Fertility dataset [58]	100	10	2	90:10
Haberman survival dataset [59]	306	5	2	75:25
Ionosphere dataset [60]	351	34	2	65:35
Parkinson dataset [61]	197	23	2	75:25
Raisin dataset [62]	900	8	2	50:50
Thoracic surgery dataset [63]	470	17	2	85:15
Admission dataset	85	2	3	36:34:30
QSAR biconcentration dataset [64]	779	14	3	60:30:10
Vertebral column dataset [65]	310	5	3	50:30:20
Contraceptive dataset [66]	1473	9	3	43:34:23
Thyroid dataset [67]	7200	21	3	95:4:1
Wine dataset [68]	178	13	3	40:33:27
Iris dataset [69]	150	4	3	33:33:33
Wheat seeds dataset [70]	196	7	3	33:33:33
Balance scale dataset [71]	625	4	3	53:45:8
Connect-4 dataset [72]	67557	42	3	65:25:10
Car evaluation dataset [73]	1728	6	4	70:22:4:4
Vehicle dataset [74]	946	18	4	25:25:25:25
HCV dataset [75]	615	14	5	90:4:3:2:1
Dermatology dataset [76]	366	33	6	31:20:17:13:13:6
Glass dataset [77]	214	10	6	35:33:13:8:6:4
Wine quality dataset [78]	4898	12	7	45:30:19:4:3:1:1
ZOO dataset [79]	101	16	7	40:20:13:9:9:5:4
Mice protein dataset [80]	1080	82	8	16:13:13:13:13:11:11:9

This section examines the variability between the GPS and the W-GPS across different classifiers for different scenarios, such as raw, balanced, and noisy datasets. Table 4 show the mean and standard deviation (SD) for GPS_RPSN, the general performance score of the combination of recall, precision, specificity, and NPV, and W-GPS_RPSN, the weighted general performance score of the combination of recall, precision, specificity, and NPV for binary datasets. Table 7 show the mean and SD values for GPS_UPM, the general performance score of the combination of UPM of each class, and W-GPS_UPM, the general performance score of the combination of UPM of each class, for multi-class datasets. It is observed that the W-GPS_RPSN has less variability as compared to GPS_RPSN. For instance, in the adult dataset (Table 4), GPS has a mean of 0.59, whereas W-GPS shows improvement with a notably higher mean of 0.684. Additionally, GPS has an SD of 0.099, while W-GPS exhibits lower variability with an SD of 0.058.

Similarly, the same behaviour was observed for multi-class datasets. For instance, in the balance scale dataset (Table 7), GPS_UPM has a mean of 0.317, whereas W-GPS_UPM shows improvement with a notably higher mean of 0.776. Additionally, GPS_UPM has an SD of 0.392, while W-GPS exhibits lower variability with an SD of 0.162. This lower SD suggests that W-GPS offers a more consistent performance across different classifiers than GPS. The proposed weighted approach is more reliable and less sensitive to fluctuations. Thus, integrating weights into individual measures enhances the model’s decision-making, resulting in more dependable and consistent outcomes.

5.3 Comparative analysis of W-GPS with alternative measures

In this section, a comparative analysis of W-GPS metrics is conducted against alternative evaluation methods, including accuracy, recall, precision, specificity, NPV, F1 measure, MCC, GPS [18], and a composite metric using Geometric Mean (GM) and Arithmetic Mean (AM) components [51]. Table 5 displays the scores of the different metrics for binary datasets. It is noted that metrics such as accuracy, recall, precision, specificity, and NPV may not be reliable assessments of model performance. The higher values of these metrics are sometimes biased towards the majority class. Therefore, in some cases, individual metrics do not show the overall aspects of the model performance. To overcome this issue, GPS [18] is defined as the harmonic mean of a set of metrics that penalizes the metric exhibiting inferior values. However, GPS fails to consider the relative importance of these metrics. When users are presented with distinct metrics, it is unclear how much weight to give to each score. The proposed metric– W-GPS, provides a data-driven approach to assigning suitable weights to these metrics based on their variability across multiple runs. Thus, the W-GPS measure provides a more refined evaluation of the classifier’s performance. For example, in the case of the diabetes dataset (please see Table 5), the mean GPS_RPSN value is 0.717, while the W-GPS_RPSN surpasses it with a mean value of 0.725. Likewise, for the multi-class dataset (please see Table 6 for results for the connect-4 dataset), the mean GPS value is 0.605, while the W-GPS measure surpasses it with a mean value of 0.762.

Table 4
Mean and standard deviation (SD) of GPS_RPSN and W-GPS_RPSN for binary datasets in various scenarios

Dataset	Balanced/unbalanced	Raw dataset				Balanced dataset
		GPS_RPSN		W-GPS_RPSN		GPS_RPSN		W-GPS_RPSN
		Mean	SD	Mean	SD	Mean	SD	Mean	SD
Adult dataset	Unbalanced	0.599	0.099	0.684	0.058	0.613	0.090	0.660	0.060
Blood transfusion	Unbalanced	0.453	0.076	0.542	0.061	0.539	0.058	0.582	0.041
Diabetes	Unbalanced	0.663	0.095	0.699	0.048	0.677	0.092	0.701	0.049
Fertility	Unbalanced	0.188	0.139	0.258	0.185	0.249	0.076	0.358	0.102
Haberman survival	Unbalanced	0.440	0.042	0.535	0.031	0.528	0.053	0.580	0.044
Ionosphere	Unbalanced	0.878	0.042	0.894	0.034	0.879	0.037	0.894	0.034
Parkinson	Unbalanced	0.763	0.142	0.805	0.111	0.748	0.158	0.785	0.130
Thoracic surgery	Unbalanced	0.111	0.066	0.175	0.110	0.239	0.121	0.328	0.088
Raisin	Balanced	0.795	0.134	0.817	0.081	–	–	–	–
Diabetes (20% noise)	Unbalanced	0.593	0.379	0.643	0.110	–	–	–	–
Diabetes (30% noise)	Unbalanced	0.570	0.314	0.619	0.101	–	–	–	–

Table 5

Computed TPMs, GPS, and W-GPS in case of binary datasets

Measures	Datasets
	Adult	Blood transfusion	Diabetes	Ionosphere	Parkinson
Accuracy	0.798	0.720	0.727	0.890	0.821
Recall	0.827	0.805	0.776	0.895	0.719
Precision	0.928	0.843	0.819	0.895	0.719
Specificity	0.671	0.474	0.641	0.894	0.881
NPV	0.384	0.330	0.558	0.945	0.871
F1 measure	0.877	0.837	0.820	0.901	0.784
MCC	0.443	0.217	0.452	0.854	0.734
GPS_RPSN [18, 52]	0.612	0.482	0.681	0.874	0.764
W-GPS_RPSN (proposed)	0.680	0.548	0.888	0.929	0.790

Table 6

Computed TPMs, GPS, and W-GPS in case of multi-class datasets

Measures	Datasets
	Connect-4	Iris	Vertebral column	Wheat seeds	Car evaluation	Vehicle
Recall_1	0.157	1.000	0.638	0.883	0.989	0.982
Recall_2	0.963	0.923	0.977	0.981	0.974	0.565
Recall_3	0.745	0.925	0.783	0.945	0.891	0.985
Recall_4	NA	NA	NA	NA	0.908	0.493
Precision_1	0.566	1.000	0.713	0.929	0.995	0.878
Precision_2	0.848	0.932	0.956	0.971	0.943	0.569
Precision_3	0.813	0.930	0.772	0.923	0.958	0.951
Precision_4	NA	NA	NA	NA	0.996	0.842
NPV_1	0.917	1.000	0.916	0.946	0.974	0.994
NPV_2	0.904	0.963	0.979	0.990	0.992	0.849
NPV_3	0.919	0.964	0.897	0.975	0.996	0.995
NPV_4	NA	NA	NA	NA	0.996	0.995
Accuracy_macro	0.888	0.966	0.899	0.959	0.989	0.877
Recall_macro	0.622	0.949	0.799	0.937	0.940	0.756
Precision_macro	0.742	0.954	0.814	0.941	0.957	0.745
Specificity_macro	0.866	0.975	0.926	0.969	0.992	0.918
NPV_macro	0.913	0.976	0.931	0.970	0.990	0.920
F1+ Score	0.642	0.949	0.802	0.936	0.946	0.747
F1- Score	0.883	0.975	0.928	0.969	0.991	0.919
GM_UPM [51]	0.650	0.961	0.850	0.952	0.967	0.800
AM_UPM [51]	0.689	0.961	0.855	0.952	0.967	0.815
GPS_UPM [18, 52]	0.606	0.960	0.845	0.951	0.967	0.786
GPS_recall [18]	0.342	0.944	0.763	0.931	0.934	0.678
GPS_precision [18]	0.718	0.950	0.797	0.938	0.955	0.703
GPS_NPV [18]	0.913	0.975	0.929	0.969	0.990	0.914
W-GPS_UPM (proposed)	0.763	0.970	0.869	0.955	0.969	0.818
W-GPS_recall (proposed)	0.629	0.960	0.821	0.941	0.949	0.743
W-GPS_precision (proposed)	0.789	0.964	0.825	0.943	0.960	0.724
W-GPS_NPV (proposed)	0.916	0.982	0.933	0.972	0.992	0.928

Table 7

Mean and standard deviation (SD) of GPS_UPM and W-GPS_UPM for multi-class datasets

Dataset	Balanced/unbalanced	Raw dataset				Balanced dataset
		GPS_UPM		W-GPS_UPM		GPS_UPM		W-GPS_UPM
		Mean		SD		Mean		SD
Balance scale dataset	Unbalanced	0.317	0.392	0.776	0.162	0.506	0.355	0.756	0.178
Biconcentration dataset	Unbalanced	0.644	0.121	0.686	0.053	0.648	0.093	0.670	0.069
Car_evaluation dataset	Unbalanced	0.786	0.260	0.818	0.206	0.823	0.177	0.833	0.174
Dermatology dataset	Unbalanced	0.937	0.110	0.950	0.054	0.955	0.053	0.952	0.055
Glass dataset	Unbalanced	0.952	0.070	0.923	0.132	0.952	0.077	0.937	0.090
HCV dataset	Unbalanced	0.317	0.371	0.635	0.071	0.381	0.367	0.654	0.076
Thyroid dataset	Unbalanced	0.316	0.395	0.697	0.246	0.939	0.120	0.954	0.081
Vertebral column dataset	Unbalanced	0.811	0.075	0.843	0.023	0.817	0.066	0.842	0.025
Wine dataset	Unbalanced	0.903	0.136	0.928	0.111	0.923	0.167	0.885	0.199
Wine quality dataset	Unbalanced	0.010	0.067	0.186	0.068	0.058	0.117	0.229	0.059
ZOO dataset	Unbalanced	0.954	0.151	0.965	0.011	0.950	0.170	0.964	0.012
Contraceptive dataset	Unbalanced	0.572	0.080	0.578	0.057	0.577	0.074	0.582	0.053
Iris dataset	Balanced	0.960	0.055	0.967	0.034	0.960	0.055	0.967	0.034
Connect-4 dataset	Balanced	0.394	0.261	0.547	0.216	0.394	0.261	0.547	0.216
Admission dataset	Balanced	0.902	0.101	0.910	0.075	0.902	0.101	0.910	0.075
Wheat_seeds dataset	Balanced	0.937	0.062	0.942	0.032	0.937	0.062	0.942	0.032
Vehicle dataset	Balanced	0.742	0.129	0.768	0.111	0.742	0.129	0.768	0.111
Mice protein expression dataset	Balanced	0.940	0.102	0.814	0.042	0.940	0.102	0.814	0.042

Table 8

Mean and standard deviation (SD) of GPS_UPM and W-GPS_UPM for noisy environment

Dataset	GPS_UPM		W-GPS_UPM
	Mean	SD	Mean	SD
Contraceptive dataset (20% noise)	0.583	0.141	0.603	0.104
Contraceptive dataset (30% noise)	0.590	0.153	0.617	0.126
Balance scale dataset (20% noise)	0.734	0.039	0.771	0.042
Balance scale dataset (30% noise)	0.701	0.058	0.736	0.063
Biconcentration dataset (20% noise)	0.633	0.093	0.606	0.085
Bioconcentration dataset (30% noise)	0.606	0.087	0.619	0.087
Admission dataset (20% noise)	0.740	0.060	0.752	0.058
Admission dataset (30% noise)	0.758	0.064	0.771	0.064
Iris dataset (20% noise)	0.943	0.043	0.952	0.038
Iris dataset (30% noise)	0.946	0.040	0.953	0.031
Wine dataset (20% noise)	0.820	0.096	0.828	0.0.91
Wine dataset (30% noise)	0.863	0.070	0.866	0.070

6. Discussion

Traditional performance measures (TPMs) such as recall, precision, specificity, and negative predictive value (NPV) are highly sensitive to minor data fluctuations and exhibit substantial variability across iterations. The results of the experiments demonstrated (please see Section 5.1) that a performance measure showing robustness for one classifier may not consistently exhibit a similar trend when applied to evaluate another classifier across diverse dataset characteristics. Although the General Performance Score introduced by [18] made an effort to reduce this variability by combining the measures with high variability. It failed to address the issue adequately as it assigned equal importance to all the measures. In this paper, we have tried to address the issue of sensitivity to minor data fluctuations and exhibit substantial variability across iterations. The proposed weighted general performance score (W-GPS) considers each measure’s coefficient of variation (CV) to assign a suitable weight. As detailed in Section 5.2, it is observed that the proposed W-GPS exhibited significantly reduced variation compared to GPS. We have compared traditional performance measures (TPMs), General Performance Scores (GPS), and the proposed Weighted General Performance Scores (W-GPS) for different classification scenarios.

Table 9
Domain-wise average score of GPS and W-GPS

Domain	Number of datasets	Datasets	GPS	W-GPS	% change	Average GPS	Average W-GPS	% Change
Medical/healthcare	11	Adult dataset	0.599	0.684	14.19	0.509	0.621	21.907
		Blood transfusion dataset	0.453	0.542	19.647
		Diabetes dataset	0.663	0.699	5.43
		Fertility dataset	0.188	0.258	37.234
		Haberman survival dataset	0.44	0.535	21.591
		Parkinson dataset	0.763	0.805	5.505
		Thoracic surgery dataset	0.111	0.175	57.658
		HCV dataset	0.317	0.776	144.795
		Thyroid dataset	0.316	0.697	120.57
		Vertebral column dataset	0.811	0.843	3.946
		Mice protein expression dataset	0.94	0.814	$-$ 13.404
Biological/ecological	2	Ionosphere dataset	0.878	0.894	1.822	0.761	0.79	3.811
		Bioconcentration dataset	0.644	0.686	6.522
Agricultural/food	4	Raisin dataset	0.795	0.817	2.767	0.661	0.712	7.637
		Wine dataset	0.903	0.902	$-$ 0.111
		Wine quality dataset	0.01	0.186	1760
		Wheat seeds dataset	0.937	0.942	0.534
Engineering/automotive	2	Car evaluation dataset	0.786	0.818	4.071	0.764	0.793	3.796
		Vehicle dataset	0.742	0.768	3.504
Synthetic/noise	4	Diabetes (20% noise)	0.593	0.643	8.432	0.584	0.621	6.25
		Diabetes (30% noise)	0.57	0.619	8.596
		Contraceptive dataset (20% noise)	0.583	0.603	3.431
		Contraceptive dataset (30% noise)	0.59	0.617	4.576

The following subsections summarized W-GPS results for different factors, such as cost in terms of variability in data distribution and associated cost, presence of noise, and domain-specific requirements.

6.1 Variability in data distribution and associated cost

In real-life applications, one often encounters imbalanced datasets. A classifier’s performance metrics may be greatly impacted by the distribution of classes within a dataset. Indeed, if fluctuations in data cause significant variability in the outcome of an evaluation metric, it costs a loss of interpretability. [81] noted that finding an unbiased estimator for an imbalanced dataset is challenging. For example, for the Thoracic Surgery dataset Table 4 (class ratio 85:15), the mean GPS score on raw data is 0.111 with a variability of 0.066. Using SMOTE for data balancing, the new GPS score is 0.239 with a variability of 0.121, showing that while the average GPS improves, it also increases variability. Instead, the mean W-GPS score on raw data for the same dataset is 0.175, with a variability of 0.110. Again, using SMOTE to balance the dataset, we obtained the W-GPS as 0.328, with a variability of 0.088. The reduced variability in the case of W-GPS makes it more interpretable. Experimentation on the other datasets also exhibited reduced variability (please see Table 7). Thus, we conclude that W-GPS provides a more robust and stable evaluation of classification models, making it a more reliable metric for practical applications.

6.2 Impact of noise

Introducing noise into datasets generally increases the standard deviation of performance metrics, reflecting higher variability and instability in model performance. This occurs because noise adds random fluctuations and errors to the data, making it more challenging for the model to identify patterns and make accurate predictions. The W-GPS provided a more balanced and reliable assessment of the model’s effectiveness. In Table 8, the Contraceptive dataset with 20% noise shows that GPS has a mean of 0.583 and a standard deviation (SD) of 0.141, while W-GPS has a mean of 0.603 and an SD of 0.104. This represents a 3.4% increase in the mean and a 26.2% decrease in the SD for W-GPS compared to GPS. When the noise level is increased to 30%, GPS has a mean of 0.590 and an SD of 0.153, while W-GPS has a mean of 0.617 and an SD of 0.126. This indicates a 4.6% increase in the mean and a 17.6% decrease in the SD for W-GPS compared to GPS. Likewise, for the Diabetes dataset with 20% noise, GPS has a mean of 0.593 and an SD of 0.379, while W-GPS has a mean of 0.643 and an SD of 0.110. This represents an 8.4% increase in the mean and a 71.0% decrease in the SD for W-GPS compared to GPS. With the noise level increased to 30%, GPS shows a mean of 0.570 and an SD of 0.314, whereas W-GPS maintains a higher mean of 0.619 and a lower SD of 0.101. This demonstrates an 8.6% increase in the mean and a 67.8% decrease in the SD for W-GPS compared to GPS. The above results demonstrate that W-GPS maintains higher mean performance and lower variability in a noisy environment than GPS, as its weighting mechanism enables it to handle noisy data more effectively.

6.3 Domain-specific requirements

The W-GPS consistently outperforms GPS in terms of mean performance scores across various datasets from different domains, including medical, environmental, and general classification tasks (please see Table 9). For example, considering a set of 11 datasets from the medical domain, we observe that W-GPS consistently outperforms GPS. Experimentation on the other datasets also exhibited a similar trend (please see Table 9).

7. Conclusion and future scope

Traditionally, performance measures such as accuracy, precision, recall, and specificity have been used to evaluate the performance of classifiers. However, relying on a single measure overlooks other important indicators, leading to inconsistent results. As the traditional performance measures (TPMs) exhibit significant variability, recently, an aggregate performance metric, GPS, was proposed to provide a single reliable measure based on aggregation of TMPs. In this paper, we have proposed W-GPS, a robust performance measure for evaluating classification models. The proposed measure combines several traditional performance measures such as accuracy, precision, recall, specificity, and NPV. It assigns suitable weights to traditional performance measures based on their coefficient of variation. The proposed W-GPS is stable for a wide variety of classification scenarios. It also turns out to be consistently superior to GPS in terms of reduced variability across the datasets. Indeed, W-GPS offers flexibility to data analysts to develop different measures by choosing a suitable combination of TPMs, depending on the specific scenario. Future work will focus on developing a data-driven model selection procedure.

Funding

As mentioned above.

CRediT authorship contribution statements

Gaurav Pandey: conducted the experiments. Rashika Bagri: Wrote the original draft and interpreted the data. Rajan Gupta: developed the research objectives and methodologies. Ankit Rajpal and Naveen Kumar: led the analysis, review, and editing. Manoj Agarwal: led the analysis and interpreted the data.

Declaration of competing interests

The authors declare that they have no conflict of interest. This article contains no studies performed by the authors with human participants or animals.

Ethical and informed consent

This article does not contain any studies with human participants or animals.

Data availability and access

All the datasets used in the paper are publicly available. The links to access these datasets are provided in this paper.

Footnotes

Acknowledgments

Rashika Bagri would like to express gratitude to the University Grants Commission, New Delhi, India, for granting the Junior Research Fellowship (Reference No. 200510452974).

References

Domingos

. A few useful things to know about machine learning. Communications of the ACM. 2012; 55(10): 78-87.

Duda

Hart

, et al. Pattern classification. John Wiley and Sons; 2006.;

Butcher

Smith

. Feature engineering and selection: A practical approach for predictive models. The American Statistician. 2020; 74(3): 308-309.

Powers

. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv preprint arXiv201016061; 2020.

Vapnik

. The nature of statistical learning theory. Information Science and Statistics. Springer New York; 2013.

Sokolova

Lapalme

. A systematic analysis of performance measures for classification tasks. Information Processing and Management. 2009; 45(4): 427-437.

Provost

Fawcett

. Robust classification for imprecise environments. Machine Learning. 2001; 42: 203-231.

Raschka

. Model evaluation, model selection, and algorithm selection in machine learning. arXiv preprint arXiv:181112808. 2018.

Lunetta

Matthew

. The significance of a two-by-two contingency table. Journal of Educational Statistics. 1979; 4(2): 123-137.

10.

Akobeng

. Understanding type I and type II errors, statistical power and sample size. Acta Paediatrica. 2016; 105(6): 605-609.

11.

Takahashi

Yamamoto

Kuchiba

Koyama

. Confidence interval for micro-averaged F 1 and macro-averaged F 1 scores. Applied Intelligence. 2022; 52(5): 4961-4972.

12.

Suhaimi

Othman

Yaakub

. Comparative Analysis Between Macro and Micro-Accuracy in Imbalance Dataset for Movie Review Classification. In: Yang

Sherratt

Dey

Joshi

, editors. Proceedings of Seventh International Congress on Information and Communication Technology. Singapore: Springer Nature Singapore. 2023; 83-93.

13.

Hastie

Tibshirani

Friedman

. The elements of statistical learning: Data mining, inference, and prediction. 2009; 2. Springer.

14.

Rifkin

Klautau

. In Defense of One-Vs-All Classification. Journal of Machine Learning Research. 2004; 5: 101-141.

15.

Tharwat

. Classification assessment methods. Applied Computing and Informatics. 2020; 17(1): 168-192.

16.

de Amorim

Cavalcanti

Cruz

. The choice of scaling technique matters for classification performance. Applied Soft Computing. 2023; 133: 109924.

17.

Garc

Herrera

. An extension on statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. Journal of Machine Learning Research. 2008; 9(Oct): 2677-2694.

18.

De Diego

Redondo

Fernández

Navarro

Moguerza

. General performance score for classification problems. Applied Intelligence. 2022; 52(10): 12049-12063.

19.

Bedeian

Mossholder

. On the use of the coefficient of variation as a measure of diversity. Organizational Research Methods. 2000; 3(3): 285-297.

20.

Montgomery

Runger

. Applied Statistics and Probability for Engineers. 6th ed. John Wiley and Sons; 2014.

21.

Ruiz

Bandera

. Analysis of uncertainty indices used for building envelope calibration. Applied Energy. 2017; 185: 82-94.

22.

Albert

Anderson

. On the existence of maximum likelihood estimates in logistic regression models. Biometrika. 1984; 71(1): 1-10.

23.

Cortes

Vapnik

. Support-vector networks. Machine Learning. 1995; 20: 273-297.

24.

Fisher

. The use of multiple measurements in taxonomic problems. Annals of Eugenics. 1936; 7(2): 179-188.

25.

Srivastava

Gupta

Frigyik

. Bayesian quadratic discriminant analysis. Journal of Machine Learning Research. 2007; 8(6).

26.

Mukherjee

Sharma

. Intrusion detection using naive Bayes classifier with feature reduction. Procedia Technology. 2012; 4: 119-128.

27.

Cover

Hart

. Nearest neighbor pattern classification. IEEE Transactions on Information Theory. 1967; 13(1): 21-27.

28.

Loh

Shih

. Split selection methods for classification trees. Statistica Sinica. 1997; 815-840.

29.

Breiman

. Random forests. Machine Learning. 2001; 45: 5-32.

30.

Geurts

Ernst

Wehenkel

. Extremely randomized trees. Machine Learning. 2006; 63: 3-42.

31.

Friedman

. Greedy function approximation: a gradient boosting machine. Annals of Statistics. 2001; 1189-1232.

32.

Meng

Finley

Wang

Chen

, et al. Lightgbm: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems. 2017; 30.

33.

Chen

Guestrin

. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining. 2016; 785-794.

34.

McCulloch

Pitts

. A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics. 1943; 5: 115-133.

35.

Smith

Johnson

Lee

. Performance measures in machine learning: A comprehensive review. Journal of Artificial Intelligence Research. 2022; 20(3): 112-129.

36.

Zhong

Tian

Thilak

Anbarasan

. Machine learning-based multimedia services for business model evaluation. Computers and Electrical Engineering. 2022; 97: 107605.

37.

Garcia

. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering. 2009; 21(9): 1263-1284.

38.

Mullick

Datta

Dhekane

Das

. Appropriateness of performance indices for imbalanced data classification: An analysis. Pattern Recognition. 2020; 102: 107197.

39.

Luque

Carrasco

Martín

de Las Heras

. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition. 2019; 91: 216-231.

40.

Johnston

Brignall

Fitzgerald

. Good enough performance measurement: A trade-off between activity and action. Journal of the Operational Research Society. 2002; 53(3): 256-262.

41.

Gösgens

Zhiyanov

Tikhonov

Prokhorenkova

. Good classification measures and how to find them. Advances in Neural Information Processing Systems. 2021; 34: 17136-17147.

42.

Labatut

Cherifi

. Evaluation of performance measures for classifiers comparison. arXiv preprint arXiv:11124133. 2011.

43.

Davis

Goadrich

. The relationship between Precision-Recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning. 2006; 233-240.

44.

Cortes

Mohri

. AUC optimization vs. error rate minimization. Advances in Neural Information Processing Systems. 2003; 16.

45.

Huang

Ling

. Comparing naive Bayes, decision trees, and SVM with AUC and accuracy. In: Third IEEE International Conference on Data Mining. IEEE. 2003; 553-556.

46.

Huang

Ling

. Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on knowledge and Data Engineering. 2005; 17(3): 299-310.

47.

Sokolova

Japkowicz

Szpakowicz

. Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. In: AI 2006: Advances in Artificial Intelligence: 19th Australian Joint Conference on Artificial Intelligence, Hobart, Australia, December 4–8, 2006. Proceedings 19. Springer. 2006; 1015-1021.

48.

Zhou

Liu

. Correlation analysis of performance metrics for classifier. In: Decision Making and Soft Computing: Proceedings of the 11th International FLINS Conference. World Scientific. 2014; 487-492.

49.

Luque

Carrasco

Martín

Lama

. Exploring symmetry of binary classification performance metrics. Symmetry. 2019; 11(1): 47.

50.

Ferri

Hernandez-Orallo

Modroiu

. An Experimental Comparison of Performance Measures for Classification. Pattern Recognition Letters. 2009 1; 30: 27-38.

51.

Nandi

. From Multiple Independent Metrics to Single Performance Measure Based on Objective Function. IEEE Access. 2023.

52.

Redondo

Navarro

Fernández

de Diego

Moguerza

Fernández-Muñoz

. Unified performance measure for binary classification problems. Intelligent Data Engineering and Automated Learning-IDEAL 2020: 21st International Conference, Guimaraes, Portugal, November 4–6, 2020, Proceedings, Part II. Springer. 2020; 104-112.

53.

Uddin

. Addressing accuracy paradox using enhanched weighted performance metric in machine learning. In: 2019 Sixth HCT Information Technology Trends (ITT). IEEE. 2019; 319-324.

54.

Jadhav

. A novel weighted TPR-TNR measure to assess performance of the classifiers. Expert systems with applications. 2020; 152: 113391.

55.

Becker

Kohavi

. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/2/adult; 1996. (Accessed on 06/21/2023).

56.

Yeh

. Blood Transfusion Service Center – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/176/blood+transfusion+service+center; 2008. (Accessed on 06/21/2023).

57.

Kahn

. Diabetes – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/34/diabetes; 2014. (Accessed on 06/21/2023).

58.

Gil

Girela

. Fertility – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/244/fertility; 2013. (Accessed on 06/21/2023).

59.

Haberman

. Haberman’s Survival – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/43/haberman+s+survival; 1999. (Accessed on 06/21/2023).

60.

Sigillito

VHL

Wing

, K B. Ionosphere – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/52/ionosphere; 1989. (Accessed on 06/21/2023).

61.

Little

. Parkinsons – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/174/parkinsons; 2008. (Accessed on 06/21/2023).

62.

Cinar aK Ilkay. Raisin – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/850/raisin; 2023. (Accessed on 06/21/2023).

63.

Kalousis

Prados

Hilario

Pilar

Robles

Valenzuela

. On the representation and learning of real-world relations in medical domains. Artificial Intelligence in Medicine. 2003; 27(1): 35-61.

64.

QSAR Bioconcentration classes dataset – UCI Machine Learning Repository. https://archiveics.uci.edu/dataset/510/qsar+bioconcentration+classes+dataset; 2019. (Accessed on 06/21/2023).

65.

Barreto

Neto

. Vertebral Column – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/212/vertebral+column; 2011. (Accessed on 06/21/2023).

66.

Lim

. Contraceptive Method Choice – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/30/contraceptive+method+choice; 1997. (Accessed on 06/21/2023).

67.

Quinlan

. Thyroid Disease – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/102/thyroid+disease; 1987. (Accessed on 06/21/2023).

68.

Aeberhard

Forina

. Wine – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/109/wine; 1991. (Accessed on 06/21/2023).

69.

Fisher

. Iris – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/53/iris; 1988. (Accessed on 06/21/ 2023).

70.

Charytanowicz

, Szymon. Seeds – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/236/seeds; 2012. (Accessed on 06/21/2023).

71.

Siegler

. Balance Scale – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/12/balance+scale; 1994. (Accessed on 06/21/2023).

72.

Martin

Hirst

Kilby

. Connect-4 – A step to Connect-T-Generation. Tech Rep CSD-TR-98-12. 1999. Available from: https://archive.ics.uci.edu/ml/datasets/Connect-4.

73.

Bohanec

. Car Evaluation – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/19/car+evaluation; 1997. (Accessed on 06/21/2023).

74.

Bennett

Mangasarian

. StatLog (Vehicle Silhouettes). Tech Rep 917. 1992. Available from: http://archive.ics.uci.edu/ml/datasets/Statlog+(Vehicle+Silhouettes).

75.

Lichtinghagen, Ralf. HCV data – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/571/hcv+data; 2020. (Accessed on 06/21/2023).

76.

Ilter

Guvenir

. Dermatology – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/33/dermatology; 1998. (Accessed on 06/21/2023).

77.

German

. Glass Identification – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/42/glass+identification; 1987. (Accessed on 06/21/2023).

78.

Cortez

, Paulo Reis

. Wine Quality – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/186/wine+quality; 2009. (Accessed on 06/21/2023).

79.

Forsyth

. UCI Machine Learning Repository; 1990. (Accessed on 06/21/2023). https://archive.ics.uci.edu/dataset/111/zoo.

80.

Higuera

Krzysztof

. Mice Protein Expression – UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/342/mice+protein+expression; 2015. (Accessed on 06/21/2023).

81.

Liu

Zhang

Xiang