A clinical decision support system for heart disease prediction with ensemble two-fold classification framework

Abstract

Cardiovascular disease (CVD) is a severe public health concern globally. Early and accurate CVD diagnosis is a difficult task but a necessary endeavour required to prevent further damage and protect patients’ lives. Machine Learning (ML)-based Clinical Decision Support Systems (CDSS) have the potential to assist healthcare providers in making accurate CVD diagnoses and treatments. Clinical data usually contains missing values (MVs); hence, the incorporated imputation techniques for ML have become a critical consideration when working with real-world medical datasets. Furthermore, removing instances with MVs will lead to essential data loss and produce incorrect results. To overcome these issues, this paper proposes an efficient and reliable CDSS with Ensemble Two-Fold Classification (ETC) framework for classifying heart diseases. The effectiveness of the proposed ETC framework using different supervised ML algorithms is evaluated with four distinct imputation methods for handling MVs over the standard benchmark dataset, viz., the University of California, Irwin (UCI). Experimental results show that our proposed ETC framework with the k-Nearest Neighbors(k-NN) imputation method achieves better classification accuracy of 0.9999 and a lesser error rate of 0.0989 compared to other imputation methods and classifiers with similar execution times.

Keywords

Clinical dataset classification data pre-processing decision support system heart disease prediction imputation machine learning algorithms missing values

1 Introduction

One of the primary causes of death worldwide is cardiovascular disease (CVD) [1]. Globally, 17.9 million people died from CVD in 2019, accounting for 32% of all deaths. Food habits [62], lack of physical activity, high blood pressure, tobacco usage, obesity, cholesterol, alcohol abuse, pulse rate, diabetes, and hereditary risk factors are all associated with heart diseases [3 –6]. Most CVD mortality occurs in low- and middle-income countries. According to [2], cardiovascular diseases accounted for 40% of the 19.5 million deaths related to non-communicable conditions in 2020. This makes it imperative to enhance cardiovascular disease diagnosis and treatments.

Machine Learning (ML) based on Clinical Decision Support Systems (CDSS) and other disease-specific decision support systems is becoming more popular, allowing healthcare professionals to improve the accuracy of diagnosis and further provide precise treatment with improved quality outcomes. CDSS is one of the most frequently used automated analytical model-building technologies in ML. It is used to find a typical pattern in observed clinical data, develop a classification model, and further to make the accurate decision for several diseases [14, 15]. Several scientific investigations have addressed the issue of early diagnosis of disease. Although more robust, more precise classification models have been developed and proposed, some factors reduce classification accuracy.

Real-world data are often deficient, inconsistent, inaccurate, and lacking in patterns of precise attributes. In large clinical datasets missing values (MVs), high complexity, and unbalanced classifications are common problems in the design of CDSS [7 , 63]. Researchers find it hard to analyse and design CDSS when working with medical data that contain many MVs. The imputation method is a typical solution to deal with the MVs [59, 60]. The effectiveness of the presented imputation methods may differ depending on the dataset characteristics including sample size, the percentage of missing data, and the missing mechanism [9 –13]. It’s tough to provide a generic answer for various scenarios requiring data imputation techniques.

A generic framework is required for dealing with MVs in datasets. This research aims to create an efficient and reliable CDSS with Ensemble Two-Fold Classification (ETC) framework for classifying CVD to address numerous problems in CDSS design. Further, this ETC framework aims to compare and contrast the outcomes of different imputation strategies. The proposed methodology is tested over the UCI benchmark heart disease dataset to evaluate the effectiveness of the suggested measures using four alternative imputation methods for handling missing data.

The rest of the paper is organised as follows: A brief overview of related articles and a description of the dataset are included in section 2. The methodology and implementation of the suggested framework using different classifiers with four distinct imputation methods are defined in section 3. Section 4 contains the result findings and a discussion of the experiments carried out. Section 5 concludes with the recommendation of further work directions.

2 Related works

Over the last few decades, several researchers have presented various CDSS for predicting diseases using different ML algorithms. Researchers find it hard to analyse and design CDSS when working with medical data that contain many MVs. There are several approaches to dealing with datasets with missing cases, list-wise deletion approach is the simplest; it removes any instance from the dataset with only one missing value in its variables. This strategy results in data loss and decreased classification accuracy [17, 18].

Another inefficiency of missing information approaches in the classification sector is that many of these techniques only deal with MVs in the training phase and cannot identify new data with MVs unless independent imputations are used [19, 20]. In this approach, the MVs will be estimated initially, and then the model will be taught in the training step. On the other hand, the predictive model cannot be used with data that contains MVs. As a result, additional imputation is required to classify these new data; nevertheless, imputation challenges such as selecting the appropriate sample size or the proper imputation method resurface. Table 1 summarizes multiple solutions for managing MVs.

Table 1
Different strategies used in the literature to manage MVs

SI.No Study Objective Methods Summary Datasets Limitations

1 Pooja Rani et al. [21] Proposed a systematic technique for identifying MVs using the mean, mode, KNN, and Multivariate Imputation by Chained Equations (MICE) with four classifiers: LR, NB, RF, and SVM. The study’s findings demonstrated that MICE imputation performance is better when compared to other methods. UCI - Cleveland Heart disease Simple imputation approaches, mean, and mode, which are biased and provide impractical outcomes, were compared to state-of-the-art methods.

2 Saravana Kumar, K. et al. [22] Developed a multiple imputation method for MVs. To show how performance varies depending on the missing value mechanism and imputation method used. Deletion, Mean, Lower cell value, Upper cell value. The experiment’s findings demonstrated that Mean imputation outperformed other imputation methods used in the study. UCI - Cleveland Heart disease Compared to simple imputation methods, which are biased and produce unrealistic results.

3 Nishith Kumar et al. [23] To create a new kernel weight function-based restoration method for dealing with MVs and outliers. Missing at random (MAR) Existing data restoration algorithms found the suggested kernel weight-based solution extremely useful. Artificial and real metabolomics The approach was only tested on one dataset model; therefore, it might not work well with different data types.

4 Heru Nugroho et al. [24] During the restoration phase, extract missing data using a centre-based class adaptive technique and the firefly algorithm, considering attribute connection. Missing completely at random (MCAR) The experiment’s findings showed that dealing with MVs with the centre-based class firefly method was a good option. Iris, Ecoli, Wine, and Sonar datasets Imputation was tested using only one incomplete data method.

5 Che-Yu Hung et al. [25] To test the consistency of missing value handling when values aren’t missing at random. MAR In comparison to the other six ways, the strategy performed satisfactorily in resolving the lower unfinished challenge. Abalone and Boston Housing The strategy did not consider the rate of missing data, which could have skewed the results.

6 Cedric Beaulac et al. [26] To create a novel decision tree strategy for dealing with missing data. MCAR, MAR, Missing not at random (MNAR) Compared to other missing value handling techniques, the method exhibited a greater accuracy and a more interpretable classifier. Grades The method had a fault when the gating factor had little estimative power.

7 Marcelo B. A. Veras et al. [27] By modelling MVs as random parameters with a Gaussian distribution, the approach proposes a variation of the forward stagewise regression algorithm for input restoration. The method was helpful compared to conventional techniques that combined standard missing input methodologies with the basic FSR methodology. Glass, Housing, Iris, Miles Per Gallon (MPG), Motion view (Mv), Stocks, Wine, Forest fire There was no mention of the experiment’s missing value mechanisms.

8 Raymond Houe Ngouna et al. [28] Developed a multiple imputation method for MVs in groundwater datasets with a high rate of MVs MAR The strategy for dealing with missing data was chosen for its ability to consider the relationships among the variables of attention. Groundwater The absence of prior information about the missing data label may have made imputation more difficult.

9 Ralph C.Ward et al. [29] To demonstrate how performance changes are based on incomplete data and how MNAR can help ensure those accurate findings are achieved when multiple imputations case analyses are performed. MCAR, MAR, MNAR The study found that both complete case studies and various imputations can give accurate data under more conditions. Traumatic Brain Injury and Diabetes The lack of nonlinear components in the core models hampered the method.

10 Neil Y. Yen et al. [30] To use three ML methods to forecast MVs in a time series and choose the optimal strategy. According to the study, deep learning improved performance with vast data, while ML models performed better when limited data. Air quality data When adopting their most effective strategy, they incur high expenses in terms of time and computer power (deep learning).

11 Taeyoung Kim et al. [31] Before classification, this approach employed four missing data handling algorithms for the training data (Similarly, Multiple imputations, KNN, MICE) The KNN method was the most effective incomplete information restoration technique for photovoltaic forecasts among the imputation methods used. Weather dataset There was no mention of the experiment’s MVs mechanisms.

12 P. S. Raja et al. [32] A new hybrid was created. The rough parameter missing value imputation approach is referred to as Fuzzy C. This technique dealt with the dataset’s ambiguity and coarseness, resulting in better imputation outcomes. Yeast, Dukes’ B colon cancer, and Mice Protein Expression There was no mention of the experiment’s MVs mechanisms.

13 Mohamad Faiz Dzulkalnine et al. [33] To test an MVs technique that considers feature importance. The technique’s results showed that the hybrid algorithm outperformed previous approaches in terms of accuracy, RMSE, and MAE Pima Indian Diabetes The mechanism for MVs was not considered.

14 Chih-Fong Tsai et al. [34] Presented an iterative KNN that takes class labels into account. MCAR, MAR The strategy took into account class labels and outperformed previous imputation methods. Iris, Voting, Hepatitis Though empirically demonstrated, the technique has not been proven to converge conceptually.

15 Bashir, S et al. [61] We provide a multi-layer ensemble framework with increased bagging and optimal weighting. HM-Bag Moov (Hierarchical Multi- level classifiers Bagging with Multi-objective optimized voting) The results indicate that the HM-BagMoov ensemble framework achieved the highest accuracy, sensitivity, and F-Measure when compared with individual classifiers for all the diseases. Five heart disease, four breast cancer, two diabetes, two liver disease, and one hepatitis There was no mention of the experiment’s MVs mechanisms.

SI.No	Study	Objective	Methods	Summary	Datasets	Limitations
1	Pooja Rani et al. [21]	Proposed a systematic technique for identifying MVs using the mean, mode, KNN, and Multivariate Imputation by Chained Equations (MICE) with four classifiers: LR, NB, RF, and SVM.		The study’s findings demonstrated that MICE imputation performance is better when compared to other methods.	UCI - Cleveland Heart disease	Simple imputation approaches, mean, and mode, which are biased and provide impractical outcomes, were compared to state-of-the-art methods.
2	Saravana Kumar, K. et al. [22]	Developed a multiple imputation method for MVs. To show how performance varies depending on the missing value mechanism and imputation method used.	Deletion, Mean, Lower cell value, Upper cell value.	The experiment’s findings demonstrated that Mean imputation outperformed other imputation methods used in the study.	UCI - Cleveland Heart disease	Compared to simple imputation methods, which are biased and produce unrealistic results.
3	Nishith Kumar et al. [23]	To create a new kernel weight function-based restoration method for dealing with MVs and outliers.	Missing at random (MAR)	Existing data restoration algorithms found the suggested kernel weight-based solution extremely useful.	Artificial and real metabolomics	The approach was only tested on one dataset model; therefore, it might not work well with different data types.
4	Heru Nugroho et al. [24]	During the restoration phase, extract missing data using a centre-based class adaptive technique and the firefly algorithm, considering attribute connection.	Missing completely at random (MCAR)	The experiment’s findings showed that dealing with MVs with the centre-based class firefly method was a good option.	Iris, Ecoli, Wine, and Sonar datasets	Imputation was tested using only one incomplete data method.
5	Che-Yu Hung et al. [25]	To test the consistency of missing value handling when values aren’t missing at random.	MAR	In comparison to the other six ways, the strategy performed satisfactorily in resolving the lower unfinished challenge.	Abalone and Boston Housing	The strategy did not consider the rate of missing data, which could have skewed the results.
6	Cedric Beaulac et al. [26]	To create a novel decision tree strategy for dealing with missing data.	MCAR, MAR, Missing not at random (MNAR)	Compared to other missing value handling techniques, the method exhibited a greater accuracy and a more interpretable classifier.	Grades	The method had a fault when the gating factor had little estimative power.
7	Marcelo B. A. Veras et al. [27]	By modelling MVs as random parameters with a Gaussian distribution, the approach proposes a variation of the forward stagewise regression algorithm for input restoration.		The method was helpful compared to conventional techniques that combined standard missing input methodologies with the basic FSR methodology.	Glass, Housing, Iris, Miles Per Gallon (MPG), Motion view (Mv), Stocks, Wine, Forest fire	There was no mention of the experiment’s missing value mechanisms.
8	Raymond Houe Ngouna et al. [28]	Developed a multiple imputation method for MVs in groundwater datasets with a high rate of MVs	MAR	The strategy for dealing with missing data was chosen for its ability to consider the relationships among the variables of attention.	Groundwater	The absence of prior information about the missing data label may have made imputation more difficult.
9	Ralph C.Ward et al. [29]	To demonstrate how performance changes are based on incomplete data and how MNAR can help ensure those accurate findings are achieved when multiple imputations case analyses are performed.	MCAR, MAR, MNAR	The study found that both complete case studies and various imputations can give accurate data under more conditions.	Traumatic Brain Injury and Diabetes	The lack of nonlinear components in the core models hampered the method.
10	Neil Y. Yen et al. [30]	To use three ML methods to forecast MVs in a time series and choose the optimal strategy.		According to the study, deep learning improved performance with vast data, while ML models performed better when limited data.	Air quality data	When adopting their most effective strategy, they incur high expenses in terms of time and computer power (deep learning).
11	Taeyoung Kim et al. [31]	Before classification, this approach employed four missing data handling algorithms for the training data (Similarly, Multiple imputations, KNN, MICE)		The KNN method was the most effective incomplete information restoration technique for photovoltaic forecasts among the imputation methods used.	Weather dataset	There was no mention of the experiment’s MVs mechanisms.
12	P. S. Raja et al. [32]	A new hybrid was created. The rough parameter missing value imputation approach is referred to as Fuzzy C.		This technique dealt with the dataset’s ambiguity and coarseness, resulting in better imputation outcomes.	Yeast, Dukes’ B colon cancer, and Mice Protein Expression	There was no mention of the experiment’s MVs mechanisms.
13	Mohamad Faiz Dzulkalnine et al. [33]	To test an MVs technique that considers feature importance.		The technique’s results showed that the hybrid algorithm outperformed previous approaches in terms of accuracy, RMSE, and MAE	Pima Indian Diabetes	The mechanism for MVs was not considered.
14	Chih-Fong Tsai et al. [34]	Presented an iterative KNN that takes class labels into account.	MCAR, MAR	The strategy took into account class labels and outperformed previous imputation methods.	Iris, Voting, Hepatitis	Though empirically demonstrated, the technique has not been proven to converge conceptually.
15	Bashir, S et al. [61]	We provide a multi-layer ensemble framework with increased bagging and optimal weighting.	HM-Bag Moov (Hierarchical Multi- level classifiers Bagging with Multi-objective optimized voting)	The results indicate that the HM-BagMoov ensemble framework achieved the highest accuracy, sensitivity, and F-Measure when compared with individual classifiers for all the diseases.	Five heart disease, four breast cancer, two diabetes, two liver disease, and one hepatitis	There was no mention of the experiment’s MVs mechanisms.

Imputation is a method for handling MVs that involves replacing them with potential or estimated values in place of the MVs. The literature had suggested a number of conventional statistical and machine learning imputation techniques, including mean, regression, k-nearest neighbor, ensemble-based, etc., to handle MVs. This paper introduces an efficient and reliable CDSS with Ensemble Two-Fold Classification (ETC) framework for identifying heart disease with improved prediction accuracy and a lower error rate. The proposed ETC framework’s performance is evaluated using four distinct imputation approaches for managing MVs over the standard benchmark dataset: UCI.

2.1 Dataset description

In this research, the proposed ETC framework is evaluated using the Cleveland heart disease, collected from the UCI online ML repository [35]. From the literature, it is identified that 14 attributes (out of 76) were only considered for the design of the classification model. Table 2 lists the specifics of the dataset utilised in the experiment.

Table 2
Attributes (Variables/ Features) of heart disease in the Cleveland dataset; the response variable is the last row (heart disease status)

SI.No Attribute Number Attribute Name Description Value Range Data Type

1 3 age age in completed years 29 - 79 integer

2 4 sex male = 1, female = 0 0 - 1 binary

3 9 cp Types of chest pain: 1 = typical angina, 2 = atypical angina, 3 = non-angina discomfort, 4 = asymptomatic angina 1 - 4 categorical

4 10 trestbps Blood pressure at rest (in mm Hg at the time of admission to the hospital) 94 - 200 continuous

5 12 chol cholesterol levels in the blood in milligrams per decilitre 126 - 564 continuous

6 16 FBS The blood sugar level in milligrams per deciliter (mg/dl) after a fast One equals true, and 0 equals false. 0 - 1 binary

7 19 restecg Electrocardiographic outcomes at rest 0 = normal, 1 = ST-T wave irregularity, 2 = Estes’ criterion indicating probable or defined left ventricular hypertrophy 0 - 2 categorical

8 32 thalach Attained maximum heart rate 71 - 202 continuous

9 38 exang Angina due to exercise 1 = yes, 0 = no 0 - 1 binary

10 40 Old peak 1 = yes, 0 = no exercise-induced ST depression compared to rest. 2.55>terrible, 1.5-4.2 risk level, two low level 0 - 6.20 continuous

11 41 slope The slope of the ST portion of the peak exercise. One denotes an upward slope, 2 represents a flat surface, and 3 denotes a downward slope. 1 - 3 categorical

12 44 ca Fluoroscopy has colored a large number of essential vessels. 0 - 3 integer

13 51 thal Three unique numerical values represent heart health. 3 defines normal, 6 signifies a fixed flaw, and 7 suggests a reversible flaw. 3, 6, 7 categorical

14 58 num Attribute class (Diagnosis of heart disease). A score of 1-4 indicates the possibility of heart disease, while a score of 0 indicates that you are in good health. 0 - 4 integer

SI.No	Attribute Number	Attribute Name	Description	Value Range	Data Type
1	3	age	age in completed years	29 - 79	integer
2	4	sex	male = 1, female = 0	0 - 1	binary
3	9	cp	Types of chest pain: 1 = typical angina, 2 = atypical angina, 3 = non-angina discomfort, 4 = asymptomatic angina	1 - 4	categorical
4	10	trestbps	Blood pressure at rest (in mm Hg at the time of admission to the hospital)	94 - 200	continuous
5	12	chol	cholesterol levels in the blood in milligrams per decilitre	126 - 564	continuous
6	16	FBS	The blood sugar level in milligrams per deciliter (mg/dl) after a fast One equals true, and 0 equals false.	0 - 1	binary
7	19	restecg	Electrocardiographic outcomes at rest 0 = normal, 1 = ST-T wave irregularity, 2 = Estes’ criterion indicating probable or defined left ventricular hypertrophy	0 - 2	categorical
8	32	thalach	Attained maximum heart rate	71 - 202	continuous
9	38	exang	Angina due to exercise 1 = yes, 0 = no	0 - 1	binary
10	40	Old peak	1 = yes, 0 = no exercise-induced ST depression compared to rest. 2.55>terrible, 1.5-4.2 risk level, two low level	0 - 6.20	continuous
11	41	slope	The slope of the ST portion of the peak exercise. One denotes an upward slope, 2 represents a flat surface, and 3 denotes a downward slope.	1 - 3	categorical
12	44	ca	Fluoroscopy has colored a large number of essential vessels.	0 - 3	integer
13	51	thal	Three unique numerical values represent heart health. 3 defines normal, 6 signifies a fixed flaw, and 7 suggests a reversible flaw.	3, 6, 7	categorical
14	58	num	Attribute class (Diagnosis of heart disease). A score of 1-4 indicates the possibility of heart disease, while a score of 0 indicates that you are in good health.	0 - 4	integer

3 Proposed ensemble two-fold classification (ETC) framework

Inspired by the several CDSS previously proposed, this paper introduces an efficient and reliable CDSS with Ensemble Two-Fold Classification (ETC) framework for identifying heart disease with improved accuracy. The proposed ETC framework is depicted in Fig. 1.

Fig. 1

Proposed ensemble two-fold classification (ETC) framework

This framework has three main phases: data gathering, pre-processing, and ETC application. In the pre-processing stage, feature selection and scaling are performed, class balancing is done, and MVs replaced by imputation methods. Using a standard scalar, all features’ coefficients are brought to the same value, ensuring each character has a mean of 0 and a standard deviation of 1. The different imputation methods are used for handling MVs with approximated values based on the values in the dataset. Those pre-processed datasets are converted into binary based on the attribute levels. The proposed ETC framework algorithm is presented in Algorithm 1. Further, a classification method is performed using Decision Trees (DT), Logistic Regression (LR), Naive Bayes (NB), Neural Networks (NN), Random Forest (RF), and Support Vector Machine (SVM) classifiers. Finally, a hybrid classifier model is constructed using the input of knowledge-based systems and clinical guideline standards. The details of the proposed ETC framework are discussed in the following sections.

3.1 Preliminaries

Let Z_i ∈ Z ⊆ Tⁿ ; i = 1, dotsc, n be the clinical dataset, where n represents the total number of samples (records /tuples/rows), and m represents the total number of features (attributes/variables). Let Z_ij ∈ T, i = 1, dotsc, n and j = 1, dotsc, m be the i^thand j^th entry of the dataset under consideration. z_ijis defined as the value of the i^th attribute for the j^th patient.

3.2 Select the significant features

A feature selection process reduces the data size by selecting the most significant features. It minimises the classification model’s time complexity, analysis, and design without affecting the performance [8 , 16]. Clinical dataset issues are high dimensionality, partial or MVs, and a wide range of clinical characteristics and magnitudes. High-dimensional space must be mapped into a lower-dimensional space; for example: $v : {Z \to Z; Z \in T^{k}; k ⪡ n}$ (1) Z is the main requirement for feature selection since it is of primary importance to preserve the labels based on the attributes. However, this is not necessary for feature extraction since latent variables are used. Dimensionality reduction approaches pick a subset of selected attributes [64], resulting in the matrix ${\bar{Z}}_{n \times \bar{b}}$ ${\bar{Z}}_{(n \times \bar{b})} \subset {\bar{Z}}_{(n \times b)}$ (2) where $b ⪢ \bar{b}$ , b denotes the number of actual attributes, $\bar{b}$ denotes the number of chosen attributes, and ${\bar{Z}}_{(n \times \bar{b})}$ shows the relevant attributes presented in a data matrix. To reduce the dimension, one must determine a projection from a high-dimensional to a low-dimensional space. Since local projections are commonly used in projection mapping, X_data cannot contains missing elements. As a result, defining lost data is critical before constructing a suitable imputation approach. $X_{data} = [\begin{matrix} x_{11} & \dots & x_{i 1} \\ ⋮ & ⋱ & ⋮ \\ x_{1 j} & \dots & x_{ij} \end{matrix}]$ (3)

3.3 Apply standard scalar deviation

Standard scale data pre-processing technique for training data to measure the value of each cell after transforming the dataset into an understandable format. Removing the mean and scale unit variance to standardise functions. The following formula is used to measure the average score of model x: $z = \frac{x - u}{s}$ (4) Where ’x’ denotes the cell value, ’u’ denotes the mean value of the training data or 0 if with_mean = FALSE, ’s’ denotes the standard deviation of training data or one if with_std = FALSE. The standard deviation refers to the average amount of variance in the dataset. It shows how far each value differs from the mean. We have used with_mean = TRUE and with_std = TRUE in this work. The formulation of these preliminaries equation is given as follows: $σ = \sqrt{\frac{\sum {(x_{i} - μ)}^{2}}{N}}$ (5)

where ^′σ′ means the total number of samples’ standard deviation, ^′N′ signifies the total number of samples in the dataset, $^{'} x_{i}^{'}$ represents the current instance value, and ^′ μ′ denotes the total samples’ mean value.

3.4 Missing value analysis and imputation techniques

Handling the MVs in medical datasets is one of the most challenging tasks that analysts face because making the correct decision about how to process them creates a robust data model. There is no unique rule to manipulate MVs in a specific way, the method that obtains a strong model with the best performance [36 –38]. It is important to have domain knowledge about the dataset to provide an overview of pre-processing data and managing MVs. For an attribute, nullity values are MVs that are not recorded or not present. The z_ij constructs data matrix z, where z_ij is absent. $nullity = {z_{ij} \in Z : z_{ij} \in φ}$ (6) Find the numbers of MVs for each attribute (column) [A₁, A₂, A₃, …, A_m] .

$\begin{matrix} [A_{1}, A_{2}, A_{3}, \dots, A_{m}] \\ = {count}_{j = 1}^{m} (nullity (Z)_{1}, \dots, n, j) \end{matrix}$ (7) ${\bar{Z}}_{(n \times b)} = {find}_{(i = 1)}^{n} (nullity (Z_{(n \times b)}))$ (8) ${\bar{Z}}_{(n \times b)} = [\begin{matrix} 1, & MissingValue \\ 0, & Non - MissingValue \end{matrix}]$ (9) where $\bar{Z}$ is the data matrix that shows the MVs. The imputation method is used to correct incomplete, incorrect, or ambiguous data. The matrix of the clinical dataset Ψ_(n×m), n denotes patient samples, and m denotes attributes. There are different ways of imputation, but each has its advantages and disadvantages [39, 40]. In data mining-based methodologies, ML is the most used method for estimating MVs. Many data mining algorithms have been proposed [41], such as k-Nearest Neighbor (k-NN), NN, DT, RF, kernel-based, and NB imputation. This study will use four missing value imputation methods: deletion, mean, k-NN, and NB. The details of different imputation techniques are discussed in the following sections.

3.4.1 Imputation using listwise deletion method

This approach removes any case with only one missing variable value entirely from the dataset. Deletion is the easiest method because it eliminates the need to assess value. This method loses valuable information, resulting in a reduction in classification accuracy [39, 42]. Because it removes all MVs during training, one of the critical advantages of this strategy is that it produces a resilient model. The main drawback of this system is the loss of useful information, which works poorly when the proportion of MVs exceeds the ratio of the entire dataset. The results may become biased when data are manually removed from an experiment [43].

3.4.2 Imputation using mean values

This method is relatively straightforward and widely utilised. Missing data is replaced by averaging all known values of an attribute and then independently replacing each column. Only numeric data can be used with it. In addition, mean imputation has a trivial effect on the correlation coefficient and does not affect the regression coefficient [44, 45]. Imputation based on Mean (Little and Rubin, 2002), a single value is simulated for all missing instances of a feature, regardless of the input data distribution. The mean is computed by dividing the total value of the samples by the total number of pieces. It is mathematically represented as follows:

$\bar{x} = \frac{1}{n} \sum x_{i}$ (10) In this equation, $\bar{x}$ is the mean, and x values. ∑ simply means to add up each data point.

3.4.3 Imputation using the k-Nearest Neighbors algorithm

Cover and Hart proposed the k-NN algorithm [46] for the first time in 1967. The instance-based, lazy-learning algorithm k-NN is widely utilised (Wu et al. 2008). Batista and Monard [47] were the first to provide k-NN imputation for dealing with MVs. The MVs were estimated by finding the k-NN with MVs and then attributing them using the observation’s non-MVs neighbours. Zhang [48] presented a grey k-NN imputation approach to estimate the MVs to deal with heterogeneous data iteratively. An imputation method based on k-NN was applied to several missing data cases using different mechanisms and missing data models [53]. In the k-NN algorithm, nearest neighbours of MVs are classified and used to attribute MVs using a distance measure between the neighbours [49]. For k-NN imputation, several distance measures can be used, including the Minkowski distance, Cosine distance, Manhattan distance, Hamming distance, Jaccard distance, and Euclidean distance. Still, the Euclidean distance is the most widely used due to its efficiency and productivity [50, 51]. The Euclidean distance can determine the similarity between records by measuring the distance between them. This method is adaptable, allowing it to be used with both discrete and continuous datasets and numerous missing datasets [49, 52].

3.4.4 Imputation using Naive Bayes algorithm

Naive Bayes Imputation (NBI) is a technique for filling MVs by substituting the probability estimate for attribute information. The NBI method divides all data into two groups: complete data and data with missing data. The technique is repeated for each absent attribute to create complete data for categorisation. In the imputation technique for the lost value, entire data is utilised. The dataset is expressed in vector form with the m sequence attribute zⁱ = [z_i1, z_i2, z_i3, …, z_im] and the class is shown as t_j consist T = {t₁, t₂, t₃, …, t_j}. Data with missing attributes declared with probability P(Z₁ = z₁, Z₂ = z₂ … Z_j = ? … Z_d = z_d|y) [54]. NBI is used to anticipate the value of a variable that is missing on partial data, altering the probability calculation [55]. The following is the probability equation for describing the missing attributes: $P (z_{1}, z_{2} \dots Z_{j} \dots z_{d} | y) = π_{i \neq j}^{d} P (z_{i} | y)$ (11)

In order to fill in MVs, the Naive Bayes imputation method is used. To determine each missing attribute, the following equation is used: $\begin{matrix} P (T_{misj} | Z_{1} \dots Z_{t} \dots Z_{i}) = \\ \frac{P (T_{misj}) P (z_{1} \dots z_{t} \dots z_i | T_{misj})}{P (z_{1} \dots z_{t} \dots z_{i})} \end{matrix}$ (12)

Algorithm 1 Proposed Ensemble Two-Fold Classification (ETC) Framework Algorithm

Input:Give data matrix Z_(n*m), let z_ij ∈ T, i = 1, …, n and j = 1, …, m be the i^th and j^th entry of the dataset under consideration.

Output:Prediction Model

Procedure

Step 1: Select the significant features.

Step 2: Apply standard scalar deviation.

Step 3: Missing value analysis

Step 3.1 Find the missing values (MVs),let matrix A_(1×m) = A₁, A2, A₃, …, A_m, where A_(1×m) is the

number of null values (absent) of each column.

Step 3.2 Find the numbers of MVs for each column (variable) [A₁, A₂, A₃, …, A_m] by using (8)

Step 4: Apply missing value imputation techniques:

Step 4.1: Calculate the imputation values by using listwise deletion, mean imputation, k-NN imputation,and

Naive Bayes imputation. Let matrix IV_(1×b) have the imputation values.

Step 4.2: Find the MVs for each attribute and impute the MVs by using (9).

Step 4.3: Replace the MVs with imputation values IV_(1×b)

Step 5: Apply the proposed Ensemble Two-Fold classification (ETC) algorithm.

Step 5.1

Procedure ETC Algorithm

Input:D dataset (attributes with class label)

Output: Binary dataset d₁, d₂, d₃, …, d_n

categorical_evaluation_attributes = {cp, thal, slope, restecg}

threshold_evaluation_attributes = {age,trestbps, chol, thalach, ca}

maximum_threshold_values = {age:55,trestbps:140, chol:240, thalach:165, ca: 1 and below}

For i from categorical_evaluation_attributes do

current_col = categorical_evaluation_attributes[i]

current_col_levels = D.split(current_col.getLevel())

For j in current_col

j = = current_col_levels: then

replace 1 at equalent column’s row value.

put 0 on remaining column’s row value.

end if

end for

For i from threshold_evaluation_attributes do

For each_cell_values from threshold_evaluation_attributes[i]: do

If each_cell_values >maximum_threshold_values[i]: then

State each_cell_values = 1

else

each_cell_values = 0

end if

end for

Step 6: Validation schemes

Divide the dataset into two partitions (Training data 70% and Testing data 30%) using sampling techniques

Step 7: Construct an ensemble model.

end procedure

Procedure Ensemble Model

Input:A dataset D, a collection of classification algorithms N, the number of classifiers n, C is

a classification algorithm, and S is a set of data samples.

Output: An ensemble G

Step 7.1:

For i = 1 to n

Step 7.2: To sample D, use bootstrap sampling and generate S_i, which is the same size as D.

Step 7.3: Select the ([imodulo|N|] +1)th element in N as C_i

Step 7.4:Train G_i by applying C_i on S_i

Step 7.5: End

end for

Step 7.6: Return $G = \cup_{i = 1}^{n} G_{i}$

Step 8: Validate the constructed model using testing data.

Step 8.1: Classification: Supervised learning technique (target variable required)

Step 8.2: Apply the classification techniques using DT, LR, NB, NN, RF, and SVM.

Step 8.3: Get the output from classification methods classified into the classes of the target.

Step 9:Validate the performance of classification models using confusion matrix various performance measure (Accuracy (A), Error Rate (ER), F1-Score (F), Precision (P), Recall (R), Sensitivity (SS), Specificity (SC), Receiver operating characteristic curve - Area under the curve (ROC-AUC Score),True Negative (TN), False Positive (FP), False Negative (FN), and True Positive (TP)).

Step 10: Construct the final prediction model.

end procedure

3.5 Proposed ensemble two-fold classification (ETC) framework

The benchmark dataset for our proposed work has four data types of values. The first is an integer, the second is binary, the third is categorical (discrete), and the final one is continuous. All data types have different characteristics. The discrete type attributes have a certain level that entirely depends on the attribute’s types. The constant and integer type attributes have values within a specific range. To deal with these two categories, we are using two different approaches. Binary values are simple to create a rule with only two states, either 1 or 0, compared to discrete type values and continuous type values. One of this framework’s critical phases is converting the dataset into a binary format, which helps the ML algorithms in rules generation phases. The significant advantage of this framework is that it is fit for all types of ML algorithms. This ETC framework works based on the categorical binary conversion and threshold weight. The categorical variable concept is only applicable for discrete type attributes. Threshold evaluation is used for continuous and integer type attributes.

The categorical variable concept will apply to the following attributes: cp, restecg, slope, and thal. A single discrete type attribute column will be converted into multiple numbers of columns; it can be varied depending on its attribute level. Then which cell has a value that’s the corresponding row in the corresponding column will get the value as 1, rest of the value of the column as 0. For example, consider the scenario; the chest pain attribute has four levels. After applying the concept of the categorical variable, the chest pain (cp) column will be split into four columns cp_1, cp_2, cp_3, and cp_4. From row 1, it has a value of 1. So, cp_1 will get the value as 1. The remaining columns (cp_2, cp_3, cp_4) will get the value as 0. Similarly, it will work for all the samples. After completing this level, discrete-valued attributes were converted into binary values.

The continuous type and integer type attributes will be considered threshold evaluation techniques. It requires a maximum boundary value. Once the instance value crosses the maximum boundary limit, the patient may have the possibility of getting the disease; otherwise, the patient considering in a normal state (patient healthy). For example, let’s consider average blood pressure levels below 120/80 mm Hg and above 90/60 mm Hg are required for the above-said threshold evaluation. In such a way, all the remaining attributes are treated similarly. Accordingly, the following values were set as the maximum threshold values inside the algorithm based on medical references such as age - 55, trestbps - 140, chol-240, thalach - 165, ca - 0,1 are critical states, 2 and 3 are normal for evaluations to generate a fully binary dataset.

3.6 Validation schemes

One of the most critical processes in machine learning is model validation. The train-test hold-out validation is a data partitioning strategy for evaluating the produced model’s performance. The dataset is split into two sections: training data (70%) and testing data (30%) [56].

3.7 Ensemble model

Modelling with ensembles is a technique in which several diverse models are used to predict a result [61]. In predictive modelling, costumes are more accurate than individual models. The generalisation error of a prediction is reduced when ensemble models are used. The ensemble approach decreases prediction error when the underlying base models are diverse and independent.

To build a hybrid ensemble of n classifiers using a dataset D and a collection of classification algorithms N, each is formed by applying an algorithm selected in alternating G on a set of D data samples with bootstrap sampling [57]. By choosing algorithms randomly rather than alternating among them, one can train the hybrid ensemble using one of the algorithms in N with an equal probability. Consequently, according to prior knowledge, we can assign unequal probabilities to different algorithms. This process is illustrated in algorithm step 7.

A bootstrap sample is used to train diverse classifiers when constructing an ensemble from different datasets. The entry of the bootstrap sampling method is a D dataset, and the result is a D_a dataset of data samples drawn by substituting D, |D_a| = |D|. An ensemble of classifiers is made up of other datasets used to teach them; bootstrap sampling is the only source of diversity. In addition to training diverse classifiers, different classification algorithms are used, which offers a second source of diversity [58].

This trained model can be used to predict heart disease and help detect heart disease in patients. As a consequence, the number of tests is limited. Consequently, the condition will be cured at the right time, saving the lives of lakhs of people.

4 Experimental results and discussion

In this section, experimental results of the proposed ETC framework are discussed. Four different types of experiments are conducted on the UCI Cleveland dataset to evaluate the efficiency of the proposed ETC framework using six classification algorithms such as DT, LR, NB, NN, RF, and SVM. Several evaluation metrics assess the classifiers’ performances (accuracy, Mean Squared Error (MSE), precision, recall, f1-score, sensitivity, specificity, and ROC_AUC score). The details of four different experiment (E) results before and after applying the ETC framework are discussed in the following section, and the final result is presented in Tables 3–8. The overall comparison of four different imputation techniques is shown in Fig. 6.

Table 3
Different imputation methods for DT classifiers with and without ETC framework

Decision Tree without ETC algorithm

E A ER(MSE) P R F SS SC ROC AUC score TN FP FN TP

E1 0.9999 0.2666 0.74 0.67 0.71 0.6744 0.7872 0.7308 37 10 14 29

E2 0.9999 0.2417 0.81 0.62 0.7 0.619 0.8775 0.7482 43 6 16 26

E3 0.9999 0.3076 0.65 0.65 0.65 0.65 0.7254 0.6877 37 14 14 26

E4 0.9999 0.2637 0.63 0.74 0.68 0.7428 0.7321 0.7375 41 15 9 26

Decision Tree with ETC algorithm

E A ER (MSE) P R F SS SC ROC AUC score TN FP FN TP

E1 0.9999 0.3222 0.6 0.65 0.62 0.6486 0.6981 0.6733 38 16 13 24

E2 0.9999 0.2197 0.69 0.84 0.76 0.8378 0.7407 0.7892 40 14 6 31

E3 0.9999 0.2637 0.65 0.69 0.67 0.6857 0.7678 0.7267 43 13 11 24

E4 0.9999 0.2417 0.7 0.7 0.66 0.7027 0.7962 0.7494 43 11 11 26

Decision Tree without ETC algorithm
E1	0.9999	0.2666	0.74	0.67	0.71	0.6744	0.7872	0.7308	37	10	14	29
E2	0.9999	0.2417	0.81	0.62	0.7	0.619	0.8775	0.7482	43	6	16	26
E3	0.9999	0.3076	0.65	0.65	0.65	0.65	0.7254	0.6877	37	14	14	26
E4	0.9999	0.2637	0.63	0.74	0.68	0.7428	0.7321	0.7375	41	15	9	26
Decision Tree with ETC algorithm
E	A	ER (MSE)	P	R	F	SS	SC	ROC AUC score	TN	FP	FN	TP
E1	0.9999	0.3222	0.6	0.65	0.62	0.6486	0.6981	0.6733	38	16	13	24
E2	0.9999	0.2197	0.69	0.84	0.76	0.8378	0.7407	0.7892	40	14	6	31
E3	0.9999	0.2637	0.65	0.69	0.67	0.6857	0.7678	0.7267	43	13	11	24
E4	0.9999	0.2417	0.7	0.7	0.66	0.7027	0.7962	0.7494	43	11	11	26

Table 4

Different imputation methods for LR classifiers with and without ETC framework

Logistic Regression without ETC algorithm
E	A	ER(MSE)	P	R	F	SS	SC	ROC AUC score	TN	FP	FN	TP
E1	0.855	0.2111	0.78	0.83	0.8	0.8297	0.7441	0.7869	32	11	8	39
E2	0.849	0.1758	0.78	0.82	0.79	0.8157	0.8301	0.8229	44	9	7	31
E3	0.8679	0.1648	0.88	0.72	0.79	0.725	0.9215	0.8232	47	4	11	29
E4	0.8679	0.1648	0.88	0.78	0.82	0.7777	0.8913	0.8345	41	5	10	35
Logistic Regression with ETC algorithm
E	A	ER(MSE)	P	R	F	SS	SC	ROC AUC score	TN	FP	FN	TP
E1	0.8888	0.1777	0.75	0.83	0.85	0.8333	0.8148	0.824	44	10	6	30
E2	0.8679	0.1428	0.89	0.78	0.83	0.775	0.9215	0.8482	47	4	9	31
E3	0.8679	0.1208	0.9	0.84	0.87	0.8444	0.913	0.8787	42	4	7	38
E4	0.9009	0.1648	0.91	0.73	0.74	0.7272	0.9361	0.8317	44	3	12	32

Table 5

Different imputation methods for NB classifiers with and without ETC framework

Naive Bayes without ETC algorithm
E	A	ER(MSE)	P	R	F	SS	SC	ROC AUC score	TN	FP	FN	TP
E1	0.8696	0.2777	0.79	0.54	0.77	0.5365	0.8775	0.707	43	6	19	22
E2	0.8349	0.2087	0.86	0.68	0.76	0.6818	0.8936	0.7877	42	5	14	30
E3	0.8349	0.1318	0.86	0.81	0.83	0.8108	0.9074	0.8591	49	5	7	30
E4	0.8632	0.1758	0.88	0.77	0.82	0.7659	0.8863	0.8261	39	5	11	36
Naive Bayes without ETC algorithm
E	A	ER(MSE)	P	R	F	SS	SC	ROC AUC score	TN	FP	FN	TP
E1	0.744	0.1888	0.77	0.83	0.8	0.8292	0.7959	0.8125	39	10	7	34
E2	0.8443	0.1318	0.83	0.91	0.87	0.909	0.8297	0.8694	39	8	4	40
E3	0.8302	0.1318	0.89	0.81	0.85	0.8095	0.9183	0.8639	45	4	8	34
E4	0.7594	0.1538	0.82	0.78	0.6	0.7777	0.8909	0.8343	49	6	8	28

Table 6

Different imputation methods for NN classifiers with and without ETC framework

Neural Network without ETC algorithm
E	A	ER(MSE)	P	R	F	SS	SC	ROC AUC score	TN	FP	FN	TP
E1	0.9468	0.2333	0.78	0.73	0.75	0.7272	0.8043	0.7658	37	9	12	32
E2	0.9575	0.1538	0.87	0.79	0.82	0.7857	0.8979	0.8418	44	5	9	33
E3	0.9528	0.2087	0.83	0.74	0.79	0.7446	0.8409	0.7927	37	7	12	35
E4	0.9622	0.2087	0.76	0.78	0.77	0.775	0.8039	0.7894	41	10	9	31
Neural Network with ETC algorithm
E	A	ER(MSE)	P	R	F	SS	SC	ROC AUC score	TN	FP	FN	TP
E1	0.9903	0.2222	0.74	0.78	0.76	0.775	0.78	0.7775	39	11	9	31
E2	0.9952	0.2637	0.67	0.67	0.67	0.6666	0.7818	0.7242	43	12	12	24
E3	0.9999	0.2857	0.64	0.6	0.62	0.6	0.7857	0.6928	44	12	14	21
E4	0.9811	0.1978	0.84	0.72	0.77	0.7209	0.875	0.7979	42	6	12	31

Table 7

Different imputation methods for RF classifiers with and without ETC framework

Random Forest without ETC algorithm
E	A	ER(MSE)	P	R	F	SS	SC	ROC AUC score	TN	FP	FN	TP
E1	0.9951	0.2444	0.72	0.64	0.68	0.6388	0.8333	0.7361	45	9	13	23
E2	0.9999	0.2417	0.84	0.6	0.7	0.6046	0.8958	0.7502	43	5	17	26
E3	0.9952	0.2197	0.82	0.75	0.78	0.75	0.8139	0.7819	35	8	12	36
E4	0.9952	0.1978	0.83	0.71	0.76	0.7073	0.88	0.7936	44	6	12	29
Random Forest with ETC algorithm
E	A	ER(MSE)	P	R	F	SS	SC	ROC AUC score	TN	FP	FN	TP
E1	0.9999	0.1	0.89	0.87	0.88	0.8717	0.9215	0.8966	47	4	5	34
E2	0.9999	0.0989	0.89	0.87	0.88	0.875	0.9215	0.8975	47	4	5	35
E3	0.9952	0.1648	0.85	0.79	0.81	0.7857	0.8775	0.8316	43	6	9	33
E4	0.9999	0.1648	0.82	0.8	0.81	0.8048	0.86	0.8324	43	7	8	33

Table 8

Different imputation methods for SVM classifiers with and without ETC framework

Support Vector Machine without ETC algorithm
E	A	ER(MSE)	P	R	F	SS	SC	ROC AUC score	TN	FP	FN	TP
E1	0.9323	0.1444	0.88	0.81	0.84	0.8139	0.8936	0.8537	42	5	8	35
E2	0.8962	0.2197	0.82	0.7	0.76	0.7045	0.851	0.7778	40	7	13	31
E3	0.9056	0.1428	0.84	0.76	0.8	0.7647	0.9122	0.8384	52	5	8	26
E4	0.8962	0.1978	0.83	0.76	0.8	0.7608	0.8444	0.8026	38	7	11	35
Support Vector Machine with ETC algorithm
E	A	ER(MSE)	P	R	F	SS	SC	ROC AUC score	TN	FP	FN	TP
E1	0.942	0.2111	0.83	0.7	0.76	0.6976	0.8723	0.785	41	6	13	30
E2	0.9622	0.1208	0.9	0.84	0.87	0.8409	0.9148	0.8779	43	4	7	37
E3	0.9292	0.1538	0.83	0.83	0.83	0.8333	0.8571	0.8452	42	7	7	35
E4	0.9339	0.1538	0.83	0.83	0.83	0.8292	0.86	0.8446	43	7	7	34

Fig. 6

Graphical representation of overall comparative analysis of accuracy and error rate of four different imputation methods.

4.1 Experiment 1: Imputation using listwise deletion method with and without ETC framework (E1)

The Experiment 1 explains the Imputation Using Listwise Deletion Method with and without ETC Framework. The Cleveland dataset has 76 attributes and 303 records. In this experiment, the record containing the MVs is removed, and six classification models such as DT, LR, NB, NN, RF, and SVM are constructed using with and without the ETC framework. The performance evaluation of with and without ETC framework for Experiment 1 is presented in Fig. 2. The experimental results without using the ETC framework show that the DT classifier achieves better classification accuracy of 0.9999 and a lower error rate of 0.2666 compared to other classification models with similar execution times.

Fig. 2

Graphical representation of imputation using listwise deletion method with and without ETC framework.

The performance of the ETC framework identified that the RF algorithm achieved a better accuracy of 0.9999 and an error rate of 0.1, precision, recall, f1-score, sensitivity, specificity, and ROC-AUC score of 0.89, 0.87, 0.88, 0.8717, 0.9215, 0.8966 respectively. Mostly, the effectiveness of this ETC framework has improved prediction accuracy and minimized error rate. Imputation using the Listwise Deletion Method is described in Experiment 1 both with and without the ETC Framework. Also, it is inferred that the proposed ETC framework performs well primarily on all classification models when compared without the ETC framework.

4.2 Experiment 2: Imputation using k-nearest neighbors (k-NN) method with and without ETC framework (E2)

The Experiment 2 describes the Imputation Using k-Nearest Neighbors (k-NN) method with and without ETC Framework. The performance of with and without ETC framework for Experiment 2 is presented in Fig. 3. In this experiment, the record containing the MVs is replaced by the k-NN imputation method, and six different classification models are constructed with and without the ETC framework. The imputation using k-Nearest Neighbors (k-NN) approach is discussed in Experiment 2 on both with and without the ETC Framework. Compared to other classification models with comparable execution times, the experimental results without using the ETC framework demonstrate that the DT and RF classifiers are comparatively robust and obtain a better classification accuracy of 0.9999 and a lower error rate of 0.2417.

Fig. 3

Graphical representation of imputation using k-Nearest Neighbors (k-NN) method with and without ETC framework.

The ETC framework’s performance revealed that the DT and RF algorithms are relatively strong. As a result, when error rate and accuracy are taken into account as assessment measures, RF performance outperforms the DT method. Finally, the RF algorithm achieved improved accuracy of 0.9999, error rate of 0.0989, precision, recall, f1-score, sensitivity, specificity, and ROC-AUC score of 0.89, 0.87, 0.88, 0.875, 0.9215, and 0.8975, respectively. When compared to models without the suggested ETC framework, its effectiveness consistently outperforms all classification methods.

4.3 Experiment 3: Imputation using naive bayes method with and without ETC framework (E3)

Figure 4 displays the performance of Experiment 3 with and without the ETC framework. In this experiment, six alternative classification models are built using both the with and without ETC framework. The Naive Bayes imputation approach substitutes the record storing the MVs. When accuracy is considered as the only parameter for evaluation, the experimental results without the ETC framework demonstrate that the DT classifier achieves greater classification accuracy of 0.9999. The performance of the NB classifier is superior to other algorithms when the error rate (0.1318) is considered as an evaluation parameter. The RF classifier is superior to other classification algorithms when accuracy and error rate are considered evaluation measures.

Fig. 4

Graphical representation of imputation using naive bayes method with and without ETC framework

The performance of the ETC framework identified that the DT algorithm achieved better accuracy of 0.9999 and an error rate of 0.2637, precision, recall, f1-score, sensitivity, specificity, and ROC-AUC score is 0.65, 0.69, 0.67, 0.6857, 0.7678, 0.7267 respectively. Mostly, the effectiveness of this ETC framework has improved prediction accuracy and minimized error rate. Additionally, it can be deduced that, when compared to classification models without the suggested ETC framework, all classification models perform largely well.

4.4 Experiment 4: Imputation using mean values with and without ETC framework (E4)

The performance with and without the ETC framework for Experiment 4 is presented in Fig. 5. In this experiment, the record containing the MVs is replaced by the mean imputation method, and six different classification models are constructed using with and without the ETC framework. The experimental findings without the ETC framework show that the DT classifier obtains superior classification accuracy of 0.9999 when accuracy is utilised as the single metric for evaluation. When the error rate is considered as an evaluation parameter, the LR classifier performs better (0.1648) than other algorithms.

Fig. 5

Graphical representation of imputation using mean values with and without ETC framework

The performance of the ETC framework identified that the DT and RF algorithms are relatively high. As a result, the RF performance surpasses the DT algorithm when error rate and accuracy are considered evaluation metrics. Finally, the RF algorithm achieved a better accuracy of 0.9999, and an error rate of 0.1648, precision, recall, f1-score, sensitivity, specificity, and ROC-AUC score is 0.82, 0.8, 0.81, 0.8048, 0.86, 0.8324 respectively. Additionally, it can be inferred that all classification models perform significantly well when compared to classification models without the suggested ETC framework.

From Table 3, the experimental results without ETC framework using DT classifier show that the k-NN imputation method performs well while considering accuracy and error rate compared to other imputation methods. The experimental results of using the ETC framework utilizing the DT classifier for all imputation methods are similar. Still, the k-NN imputation approach is superior when considering error rate as an extra parameter. From the observation, the evaluation metrics of the DT classifier using the ETC framework with all imputation methods performs well.

Table 4’s experimental results utilizing the LR classifier without the ETC framework demonstrate that the NB imputation method outperforms other imputation techniques in terms of accuracy and error rate. All imputation approaches’ experimental outcomes using the ETC framework and LR classifier are comparatively similar. Still, the NB imputation strategy is superior when the error rate is considered as an additional parameter. According to the observation, the performance of the evaluation metrics for the LR classifier utilizing the ETC framework and all imputation methods is good.

The NB imputation method outperforms other imputation strategies in terms of accuracy and error rate, according to the experimental results in Table 5, utilising the NB classifier without the ETC framework. While the experimental results of all imputation strategies using the ETC framework and NB classifier are generally comparable, the k-NN imputation technique is superior when the error rate is considered an extra parameter. The performance of the evaluation metrics for the NB classifier using the ETC framework and all imputation methods, according to the observation, is good.

The experimental results in Table 6 using the NN classifier without the ETC framework show that the k-NN imputation approach beats alternative imputation strategies in terms of accuracy and error rate. The mean imputation strategy is superior when the error rate is considered as an additional parameter, even if the experimental results of all imputation strategies employing the ETC framework and NN classifier are generally equivalent. According to the observation, the performance of the assessment metrics for the NN classifier utilizing the ETC framework and all imputation techniques is good.

According to the experimental results in Table 7 using the RF classifier without the ETC framework, the mean imputation method outperforms other imputation strategies in terms of accuracy and error rate. Even though the experimental results of all imputation strategies using the ETC framework and RF classifier are often equal, the k-NN imputation strategy is preferred when the error rate is considered as an additional parameter. According to the observations, the performance of the assessment metrics for the RF classifier using the ETC framework and all imputation procedures is good.

From Table 8, the experimental results without the ETC framework using the SVM classifier show that the listwise deletion imputation method performs well while considering accuracy and error rate compared with other imputation methods. The experimental results of using the ETC framework utilizing an SVM classifier for all imputation methods are similar. Still, the k-NN imputation approach is superior when considering error rate as an extra parameter. From the observation, the evaluation metrics of the SVM classifier using the ETC framework with all imputation methods performs well.

From Fig. 6 and Tables 3–8, the experimental results of four different imputation experiments and six different classification models with and without the ETC framework show that the k-NN imputation method with RF performs well when compared with other imputation methods and other classifiers. The experimental results of this ETC framework on four different imputation methods have improved prediction accuracy and minimized error rate. It also shows some interesting findings, such as the suggested ETC framework having a lot of potentials and serving as a model for the healthcare business in terms of CDSS architecture.

5 Conclusion and future work

A novel CDSS with Ensemble Two-Fold Classification (ETC) framework for classifying cardiovascular diseases is proposed in this paper. This framework addresses the missing values in the clinical dataset. Further, the effectiveness of the proposed ETC framework using six classification algorithm models such as DT, LR, NB, NN, RF, and SVM is evaluated with four distinct imputation methods for handling MVs over the standard benchmark dataset, viz., UCI. Compared with past imputation methods with similar execution times, our proposed ETC framework with the k-NN imputation technique with an RF classifier achieves better classification accuracy of 0.9999 and a lower error rate of 0.0989. According to the analysis of the result, the proposed ETC framework outperformed individual classifiers for all imputation approaches in terms of accuracy, F1-Score, precision, recall (R), sensitivity, and other assessment metrics. In the future, the performance of this proposed Ensemble Two-Fold Classification (ETC) framework in diagnosing chronic diseases such as kidney disease, diabetes, breast cancer, liver disease, hepatitis, and all types of cancer will be evaluated using available datasets. In addition, the proposed framework can be expanded by utilising IoT devices to collect clinical parameters in real time. Moreover, a user-friendly application based on the suggested framework might be developed, allowing users to access it online and execute any query quickly and effectively.

References

Coronary Artery Disease: MedlinePlus: 2022. https://medlineplus.gov/coronaryarterydisease.html.

World Health Organization. Available from: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds).

Das

, Turkoglu

and Sengur

, Effective diagnosis of heart disease through neural networks ensembles, Expert Systems with Applications 36(4) (2009), 7675–7680.large surveys

Lee

H.G.

, Noh

K.Y.

, Ryu

K.H.

Mining biosignal data: coronary artery disease diagnosis using linear and nonlinear features of HRV, In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 218–228), Springer, Berlin, Heidelberg, 2007.

Masethe

H.D.

, Masethe

M.A.

Prediction of heart disease using classification algorithms, In Proceedings of theWorld Congress on Engineering and Computer Science (Vol. 2, pp. 22–24) (2014).

Nahar

, Imam

, Tickle

K.S.

and Chen

Y.P.P.

, Computational intelligence for heart disease diagnosis: A medical knowledge-driven approach, Expert Systems with Applications 40(1) (2013), 96–104.

Ayilara

O.F.

, Zhang

, Sajobi

T.T.

, Sawatzky

, Bohm

and Lix

L.M.

, Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry, Health Qual Life Outcomes 17(1) (2019), 106.

Devaraj

, Paulraj

An efficient feature subset selection algorithm for classification of multidimensional dataset, The Scientific World Journal, 2015.

Langkamp

D.L.

, Lehman

and Lemeshow

, Techniques for handling missing data in secondary analyses of largesurveys, Acad Pediatr 10(3) (2010), 205–210.

10.

Donders

A.R.T.

, Van Der Heijden

G.J.

, Stijnen

and Moons

K.G.

, A gentle introduction to imputation of missing values, J Clin Epidemiol 59(10) (2006), 1087–1091.

11.

Graham

J.W.

, Missing data analysis: making it work in the real world, Annu Rev Psychol 60 (2009), 549–576.

12.

Baraldi

A.N.

and Enders

C.K.

, An introduction to modern missing data analyses, J Sch Psychol 48(1) (2010), 5–37.

13.

Kang

, The prevention and handling of the missing data, Korean J Anesthesiol 64(5) (2013), 402.

14.

Kumar

D.S.

, Sathyadevi

and Sivanesh

, Decision support system for medical diagnosis using data mining, International Journal of Computer Science Issues (IJCSI) 8(3) (2011), 147.

15.

Kumar

D.S.

, Sukanya

and BIT-Campus

, Feature selection using multivariate adaptive regression splines, International Journal of Research and Reviews in Applied Sciences and Engineering (IJRRASE) 8(1) (2016), 17–24.

16.

Senthilkumar

and Paulraj

, Diabetes disease diagnosis using multivariate adaptive regression splines, AGE 768 (2013), 52.

17.

Chipman

H.A.

, George

E.I.

and McCulloch

R.E.

, BART: Bayesian additive regression trees, The Annals of Applied Statistics 4(1) (2010), 266–298.

18.

Hernandez

, Raftery

A.E.

, Pennington

S.R.

and Parnell

A.C.

, Bayesian additive regression trees using Bayesian model averaging, Statistics and Computing 28(4) (2018), 869–890.

19.

Hill

, Linero

and Murray

, Bayesian additive regression trees: a review and look forward, Annual Review of Statistics and Its Application 7(1) (2020), 251–278.

20.

Fukuma

, Prasath

V.S.

, Kawanaka

, Aronow

B.J.

, Takase

Eds.,Astudy on feature extraction and disease stage classification for glioma pathology images, in 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Vancouver, BC, Canada, 2016.

21.

Rani

, Kumar

, Jain

Multistage model for accurate prediction of missing values using imputation methods in heart disease dataset, In Innovative data communication technologies and application (pp. 637–653), Springer, Singapore, 2021.

22.

Saravana Kumar

Shenbagavadivu

Minimized Error Rate with Improved Prediction Accuracy Using Preprocessing Models, In Ubiquitous Intelligent Systems (pp. 597–610), Springer, Singapore, 2022.

23.

Kumar

, Hoque

and Sugimoto

, Kernel weighted least square approach for imputing missing values of metabolomics data, Scientific Reports 11(1) (2021), 1–12.

24.

Nugroho

, Utama

N.P.

and Surendro

, Class center-based firefly algorithm for handling missing data, Journal of Big Data 8(1) (2021), 1–14.

25.

Hung

C.Y.

, Jiang

B.C.

and Wang

C.C.

, Evaluating machine learning classification using sorted missing percentage technique based on missing data, Applied Sciences 10(14) (2020), 4920.

26.

Beaulac

and Rosenthal

J.S.

, BEST: A decision tree algorithm that handles missing values, Computational Statistics 35(3) (2020), 1001–1026.

27.

Veras

, Mesquita

D.P.

, Mattos

C.L.

and Gomes

J.P.

, A sparse linear regression model for incomplete datasets, Pattern Analysis and Applications 23(3) (2020), 1293–1303.

28.

Ngouna

R.H.

, Ratolojanahary

, Medjaher

, Dauriac

, Sebilo

and Junca-Bourie

, A data-driven method for detecting and diagnosing causes of water quality contamination in a dataset with a high rate of missing values, Engineering Applications of Artificial Intelligence 95 (2020), 103822.

29.

Ward

R.C.

, Axon

R.N.

and Gebregziabher

, Approaches for missing covariate data in logistic regression with MNAR sensitivity analyses, Biometrical Journal 62(4) (2020), 1025–1037.

30.

Yen

N.Y.

, Chang

J.-W.

, Liao

J.-Y.

and Yong

Y.-M.

, Analysis of interpolation algorithms for the missing values in IoT time series: a case of air quality in Taiwan, J Supercomput 76(8) (2019), 6475–500.

31.

Kim

, Ko

and Kim

, Analysis and impact evaluation of missing data imputation in day-ahead PV generation forecasting, Applied Sciences 9(1) (2019), 204.

32.

Raja

P.S.

, Sasirekha

and Thangavel

, A novel fuzzy rough clustering parameter-based missing value imputation, Neural Computing and Applications 32(14) (2020), 10033–10050.

33.

Dzulkalnine

M.F.

and Sallehuddin

, Missing data imputation with fuzzy feature selection for diabetes dataset, SN Applied Sciences 1(4) (2019), 1–12.

34.

Tsai

C.F.

, Li

M.L.

and Lin

W.C.

, A class center based approach for missing value imputation, Knowledge-Based Systems 151 (2018), 124–135.

35.

UCI –heart disease dataset from:, http://archive.ics.uci.edu/ml/datasets/Heart+Disease.

36.

Little

R.J.

, Rubin

D.B.

Statistical analysis with missing data (Vol. 793). John Wiley & Sons, (2019).

37.

E.D.

, Leeuw, J. Hox and M. Huisman, Prevention and treatment of item nonresponse, Journal of Official Statistics-Stockholm 19(2) (2003), 153–176.

38.

Berglund

, Heeringa

S.G.

Multiple imputations of missing data using SAS, SAS Institute, (2014).

39.

Chipman

H.A.

, George

E.I.

and McCulloch

R.E.

, BART: Bayesian additive regression trees, The Annals of Applied Statistics 4(1) (2010), 266–298.

40.

Hill

, Linero

and Murray

, Bayesian additive regression trees: A review and look forward, Annual Review of Statistics and Its Application 7 (2020), 251–278.

41.

Lin

W.C.

and Tsai

C.F.

, Missing value imputation: a review and analysis of the literature –, Artificial Intelligence Review 53(2) (2020), 1487–1509.

42.

Hernandez

, Raftery

A.E.

, Pennington

S.R.

and Parnell

A.C.

, Bayesian additive regression trees using Bayesian model averaging, Statistics and Computing 28(4) (2018), 869–890.

43.

Cheliotis

, Gkerekos

, Lazakis

and Theotokatos

, A novel data condition and performance hybrid imputation method for energy efficient operations of marine systems, Ocean Engineering 188 (2019), 106220.

44.

Poolsawad

, Moore

, Kambhampati

, Cleland

J.G.

Cleland, Handling missing values in data mining-A case study of heart failure dataset, In 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery (pp. 2934–2938), IEEE, (2012).

45.

Frawley

W.J.

, Piatetsky-Shapiro

and Matheus

C.J.

, Knowledge discovery in databases: An overview, AI magazine 13(3) (1992), 57–57.

46.

Cover

and Hart

, Nearest neighbor pattern classification, IEEE Transactions on Information Theory 13(1) (1967), 21–27.

47.

Batista

G.E.

and Monard

M.C.

, An analysis of four missing data treatment methods for supervised learning, Applied Artificial Intelligence 17(5-6) (2003), 519–533.

48.

Zhang

, Nearest neighbor selection for iteratively kNN imputation, Journal of Systems and Software 85(11) (2012), 2541–2552.

49.

Maillo

, Ramirez

, Triguero

and Herrera

, kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data, Knowledge-Based Systems 117 (2017), 3–15.

50.

Amirteimoori

and Kordrostami

, A Euclidean distance-based measure of efficiency in data envelopment analysis, Optimization 59(7) (2010), 985–996.

51.

Gimpy

M.D.R.V.

, Missing value imputation in multi-attribute data set, Int J Comput Sci Inf Technol 5(4) (2014), 1–7.

52.

Suthar

, Patel

and Goswami

, A survey: classification of imputation methods in data mining, International Journal of Emerging Technology and Advanced Engineering 2(1) (2012), 309–12.

53.

Pujianto

, Wibawa

A.P.

, Akbar

M.I.

K-nearest neighbor (k-NN) based missing data imputation, In 2019 5th International Conference on Science in Information Technology (ICSITech) (pp. 83–88), IEEE, 2019.

54.

Garcia

A.J.

, Hruschka

E.R.

Naive bayes as an imputation tool for classification problems, In Fifth International Conference on Hybrid Intelligent Systems (HIS’05) (pp. 3-pp), IEEE, 2005.

55.

Leng

, Wang

Learning naive bayes classifiers with incomplete data, In 2009 International Conference on Artificial Intelligence and Computational Intelligence (Vol. 4, pp. 350–353), IEEE, 2009.

56.

Das

, Turkoglu

and Sengur

, Effective diagnosis of heart disease through neural networks ensembles, Expert Systems with Applications 36(4) (2009), 7675–7680.

57.

Senthilkumar

and Paulraj

, Ensemble Deep Learning for Multi Label Classification in the Design of Clinical Decision Support System, Asian Journal of Information Technology 15(15) (2016), 2632–2637. DOI: 10.3923/ajit.2016.2632.2637.

58.

Hsu

K.W.

, A theoretical analysis of why hybrid ensembles work, Computational Intelligence and Neuroscience 2017 (2017).

59.

Acuna

, Rodriguez

The treatment of missing values and its effect on classifier accuracy, In Classification, clustering, and data mining applications (pp. 639–647). Springer, Berlin, Heidelberg, 2004.

60.

Myers

W.R.

, Handling missing data in clinical trials: an overview, Drug Information Journal: DIJ/Drug Information Association 34(2) (2000), 525–533.

61.

Bashir

, Qamar

and Khan

F.H.

, IntelliHealth: a medical decision support application using a novel weighted multi-layer classifier ensemble framework, Journal of Biomedical Informatics 59 (2016), 185–200.

62.

Abad-Segura

, Gonzalez-Zamar

M.D.

, Gomez-Galan

and Bernal-Bravo

, Management accounting for healthy nutrition education: meta-analysis, Nutrients 12(12) (2020), 3715.

63.

Onan

, Consensus clustering-based undersampling approach to imbalanced learning, Scientific Programming 2019 (2019).

64.

Onan

, Korukoglu

and Bulut

, Ensemble of keyword extraction methods and classifiers in text classification, Expert Systems with Applications 57 (2016), 232–247.