Abstract
Parkinson’s disease (PD) is a neurodegenerative condition that affects the neurological, behavioral, and physiological systems of the brain. According to the most recent WHO data, 0.51 percent of all fatalities in India are caused by PD. It is a widely recognized fact that about one million people in the United States suffer from PD, relative to nearly five million people worldwide. Approximately 90% of Parkinson’s patients have speech difficulties. As a result, it is crucial to identify PD early on so that appropriate treatment may be determined. For the early diagnosis of PD, we propose a Bagging-based hybrid (B-HPD) approach in this study. Seven classifiers such as Random Forest (RF), Decision Tree (DT), Logistic Regression (LR), Naïve Bayes (NB), K nearest neighbor (KNN), Random Under-sampling Boost (RUSBoost) and Support Vector Machine (SVM) are considered as base estimators for Bagging ensemble method and three oversampling techniques such as Synthetic Minority Oversampling Technique (SMOTE), Adaptive Synthetic (ADASYN) and SVMSmote are implemented under this research work. Feature Selection (FS) is also used for data preprocessing and further performance enhancement. We obtain the Parkinson’s Disease classification dataset (imbalanced) from the Kaggle repository. Finally, using two performance measures: Accuracy and Area under the curve (AUC), we compare the performance of the model with ALL features and with selected features. Our study suggests bagging with a base classifier: RF is showing the best performance in all the cases (with ALL features: 754, with FS: 500, with three Oversampling techniques) and may be used for PD diagnosis in the healthcare industry.
Introduction
Parkinson’s disease is a slowly developing brain illness caused by neurodegeneration, i.e., loss of brain cells. There are two categories of PD symptoms: motor symptoms and non-motor symptoms. Common motor symptoms include trembling, stiffness, and difficulty speaking and walking. These symptoms appear after 60–80% of dopamine-producing cells are destroyed. A neurotransmitter called as dopamine, is responsible for passing messages from the substantia nigra to other areas of the brain that control movement in the body. Dopamine enables humans to move with ease and balance. Typically, the symptoms start mildly and worsen over time. Non-motor symptoms include mental and behavioral issues, memory issues, sleep issues, sadness, exhaustion, and so forth may appear as the disease progresses [1]. Most people with PD begin to experience symptoms at age 60 or older; however, between 5 and 10 percent of cases start earlier. While the progression of symptoms varies from person to person, the most common adverse effects of dopaminergic neuron loss are balance issues and tremors. Unfortunately, there is no cure for PD, so patients depend on early detection and personalized therapy to limit the PD’s progression. PD comprises five stages of progression, with 90% of patients exhibiting symptoms associated with vocal cord damage as a symptom in stage 0. Vocal dysfunction is not only easy to diagnose, but it also fits under the umbrella of telemedicine or remote medicine. Instead of physically visiting a doctor, patients can capture sounds on their phones and conduct a basic test at home. Dysphonia and dysarthria are two common signs of voice modulation [2].
Early diagnosis of Parkinson’s disease may be challenging for a variety of reasons, including the fact that neurologists and movement disorder specialists take a long time to diagnose this condition after meticulously assessing the patient’s whole medical history as well as conducting many scans [1]. The ability of clinicians to effectively diagnose PD is dependent on their domain of competence, particularly when reviewing patient data and symptoms. Unfortunately, there are not enough qualified medical specialists in developing nations like Argentina, India, and so on, making the process of diagnosing Parkinson’s disease a challenging work. This led us to develop a decision-support system to assist clinicians in diagnosing PD.
Recently, Machine Learning algorithms (MLAs) have become a popular tool for diagnosing diseases in the medical field due to their easy implementation and high accuracy. In the literature, ML has also been utilized for the treatment of PD [3].
The majority of MLAs often assume training datasets to be balanced [4]. However, in medical datasets, we frequently deal with imbalance dataset, which emerges when one class’s proportion is higher than that of the other class. The majority class is the one with a higher number of samples, and the minority class is the one with the least number of samples [4]. Traditional MLAs’ learning processes are governed by assessment criteria, which may result in the minority class being overlooked and the majority class being favored.
The literature reports various strategies to deal with the Class imbalance problem (CIP). Data level (DL), algorithm level (AL), and hybrid level (HL) approaches are the three categories into which these strategies are divided [4, 5]. The DL approaches are considered external approaches. To balance the dataset, either the overrepresented class’ instances are under-sampled or the underrepresented class’ instances are oversampled. By doing this, ML classifiers are kept from becoming biased towards the majority class. However, DL techniques could result in information loss in under-sampling or overfitting due to oversampling. AL approaches regarded as internal approaches either modify the existing algorithms or create entirely novel ones. AL approaches demand profound familiarity and comprehension of the algorithm. Finally, HL strategies combine the DL and AL methodologies. This category includes cost-sensitive methods as well as ensemble methods [5, 6]. Every instance in the dataset is given a weight in cost-sensitive techniques, and these weights are adjusted depending on the performance of the classifier. The likelihood that an instance will be chosen for training by the classifier in the following cycle is represented by weights. Using the ensemble method, training is performed using various base classifiers and their forecasts are then combined to give a final hypothesis. Furthermore, researchers have delved into the use of FS techniques for enhancing the performance of ML techniques. FS techniques seek to reduce the complexity and dimension of the existing dataset under study by focusing on important or relevant features. This aids in the elimination of noise, prevents overfitting, and enables the ML models to work faster [7, 8].
For the early diagnosis of PD, we propose a B-HPD: Bagging-based Hybrid approach. In this approach, we use SMOTE, ADASYN & SVMSmote as the oversampling techniques and RF as the base classifier with the Bagging method. RF is an ensemble learning approach where multiple DTs are trained, and their forecasts are integrated to make the final outcome. We also compare our proposed approach: B-HPD, with other hybrid models with six base classifiers: DT, LR, KNN, NB, RUSBoost, and SVM. These above-mentioned classifiers cover the entirety of all traditional models, which may provide holistic perspectives from different views.
We obtain the PD classification dataset from the Kaggle repository having 756 instances and 755 attributes. The dataset is imbalanced in nature, having majority class instances: 564 and minority class instances: 192. SelectKBest FS technique for selecting the relevant features is used for improving the model’s performance. Finally, the performance of the models with ALL features and with FS is compared.
The primary contributions of this study are as follows:
For the early diagnosis of PD, we propose a hybrid approach B-HPD using SMOTE, ADASYN & SVMSmote as the oversampling techniques and RF as the base classifier with the Bagging method. We also explore our proposed approach with other hybrid models using DT, LR, NB, KNN, RUSBoost, and SVM as the base classifiers. Further, we apply the Filter method SelectKBest for FS and compare the performance of the hybrid models with ALL features and with Selected features. The proposed approach depicts superior results than the other hybrid models for the PD classification.
The paper is organized as follows: Literature Survey is discussed in Section II. Section III describes the Materials and Methods. The proposed methodology is shown in Section IV. Experimental setup and results are discussed in Section V, whereas Section VI concludes this work.
This section provides a concise summary of previous work in the area of handling imbalanced data and detection of PD.
Previous studies for handling class imbalanced data
Tarid et al. [9] in 2023 addressed the various ratios of the CIP (i.e., moderately or severely imbalanced classifications) and compared a number of sampling strategies. The authors employed RUS, random oversampling (ROS), and a hybrid resampling technique that combined SMOTE-NC, a synthetic minority oversampling strategy for nominal and continuous data, and RUS. To assess the outcomes of each sampling approach, RF classifier was implemented. According to the findings, for severely imbalanced data, hybrid resampling performed best whereas for moderately imbalanced datasets, ROS showed best values. Laith et al. [10] presented a review of the related data augmentation techniques for medical image processing. The work summarized most data augmentation strategies, both traditional and artificial, along with the benefits of employing the strategies to address data scarcity in medical imaging. Aayushi et al. [11] provided a detailed survey of the data scarcity problem and discussed the solutions along with the applications in various domains. The work mainly focused on different augmentation techniques for improving the model’s performance. Harsurinder et al. [12] provided a comprehensive explanation of learning issues caused by imbalanced data distribution. In addition, qualities, challenges, and solutions were also examined. Fredy et al. [13] in 2022 proposed Fast-SMOTE, a fast over-sampling technique which, while dealing with large datasets of different class imbalance ratios (CIR) worked fast at least twice as the other techniques suggested in the literature. Feng et al. [14] in 2022 proposed an oversampling method for imbalanced data called MeanRadius-SMOTE, which avoids the creation of undesired and noisy samples. The authors performed the experiments and compared the performance of the suggested approach with SMOTE and LR-SMOTE. It was observed that the proposed method outperformed others and suggested it may be of great significance for engineering applications. Amit et al. [15] in 2022 performed the comparative analysis of diverse class imbalance sampling strategies and conducted a thorough empirical study to assess the efficacy and efficiency of various class imbalance procedures in conjunction with leading-edge classification approaches. During this study, they discovered that ensembles such as AdaBoost, Extreme Gradient Boosting (XGBoost), and RF did well when oversampling was succeeded by under-sampling. Kovacs [16], in 2019, presented and discussed the detailed empirical comparison of 85 variations of minority oversampling approaches with 104 imbalanced datasets. The work’s objectives were to establish a new benchmark in the area, identify the oversampling principles that produce the greatest outcomes in general situations, and instruct practitioners on the best strategies to employ when working with certain types of datasets.
Previous studies for the detection of Parkinson’s disease
Aditi and Sushila [2] in 2023, carried out an experiment for the early identification of PD using MLAs such as SVM, RF, KNN, and LR. It was found that RF exhibited the best accuracy. Senjuti et al. [17] in 2023, studied the use of deep learning models (DLM) and MLAs for the prognosis of PD. Techniques such as XGBoost, AdaBoost, DT, etc., were studied and implemented along with deep neural networks. It was observed that w.r.t MLAs: XGBoost had the best accuracy of 92.18% and w.r.t DLM: three-layer deep neural network (DNN3) was able to achieve the best accuracy. To provide support to the medical practitioner in PD diagnosis, Meenakshi and D. Kishore [18] in 2023 investigated four MLAs such as XGBoost, Bagging, RF, AdaBoost and KNN. It was found that XGBoost surpassed others and achieved the highest AUC value. Ahmed et al. [19] in 2023 presented an advanced model of Bayesian Optimization-SVM (BO-SVM) for the classification of PD people. The authors conducted the study with six MLAs such as SVM, RF, LR, DT and Ridge Classifier. The outcomes suggested that the SVM model showed best accuracy using BO for optimizing the hyperparameters. Nader et al. [20] in 2022 performed a thorough examination of MLAs for the PD’s detection. The study demonstrated that RF, SVM and LR have the best accuracy and may be used for the diagnosis of PD. Mehrbakhsh et al. [21] in 2022 gave a comparison of the approaches used in PD diagnosis that were created using machine learning techniques. To conduct the comparison analysis, clustering and prediction learning methodologies were used. The authors employed support vector regression (SVR) ensembles and several clustering approaches for PD data clustering to forecast Motor-UPDRS (Unified Parkinson’s Disease Rating Scale) and Total-UPDRS. The results of data analysis for the prediction of Motor-UPDRS and Total-UPDRS in patients with PD showed that expectation-maximization with the help of SVR ensembles could offer better prediction accuracy in comparison to DTs, deep belief networks, neurofuzzy, and SVRs combined with other clustering techniques. Dhruv and Ishaan [22] in 2022 compared the performance of six MLAs such as SVM, LR, NB, RF, DT and KNN using different metrics. The study demonstrated that KNN achieved the highest accuracy. Arti et al. [23] in 2022 put forth research on the efficacy of using supervised classification algorithms, namely SVM, NB, KNN and artificial neural networks (ANN), with subjective diseases, where the suggested diagnosis method entails FS. The ANN was found to be having the highest accuracy. When the results from the study were contrasted to those from earlier ones, it was found that the suggested work provides both equivalent and superior outcomes.
Previous studies for feature selection techniques
Hela et al. [7] in 2023 applied different FS methods with multiple classifiers for the medical diagnosis of Polycystic Ovary Syndrome (PCOS). Ibomoiye and Yanxia [8] proposed a hybrid feature-selection technique in which features are rated using the information gain (IG) technique, and the top-ranked features are placed in a genetic algorithm (GA) wrapper that employs the extreme learning machine (ELM) as the learning algorithm. The proposed method fared better than earlier baseline procedures and methodologies in the present literature. Atreya et al. [24] presented an experimental study with three classifiers such as AdaBoost, Bagging and KNN using the FS technique. The study showed that the FS technique improved the performance of Bagging with an accuracy of 99.74%. G. Saranya and A. Pravin [25] surveyed and provided a comprehensive study for different diseases using FS and optimization methods. This study clearly demonstrated the need for an efficient unified framework that provides FS without partial data, minimal computational complexity, and maximum precision for any dataset size. Girish and Ferat [26] attempted to introduce FS strategies. The study suggested that using FS techniques can provide insights into data, improve classifier models, promote generalization, and identify unimportant factors. Ritika et al. [27] suggested a hybrid paradigm for early detection of PCOS. The study concluded that the model’s performance was improved by using the FS technique: Backward Feature Elimination. Chongsheng et al. [28] examined how FS before and after data resampling improves two-class imbalance learning. The study suggested that we should evaluate both pipelines (how FS before and after data resampling) when searching for the optimum imbalance classification model.
Materials and methods
This section discusses three oversampling techniques that have been widely used in the literature: SMOTE, ADASYN, and SVMSmote; one FS technique: SelectKBest, Bagging ensemble method with seven base classifiers: DT, RF, LR, KNN, NB, RUSBoost and SVM along with two performance measures: Accuracy and AUC.
Oversampling techniques
In the literature, researchers have employed various oversampling techniques to balance the dataset. For this work, we use three oversampling strategies: SMOTE, ADASYN and SVMSmote.
SMOTE
SMOTE generates new synthetic instances of the under-represented class based on the distance between each instance and its nearest neighbors [29, 30]. The smote algorithm is also discussed in this section.
ADASYN
Similar to SMOTE, ADASYN is another oversampling method. This method employs the distribution of instances from the under-represented class. Minority samples that are more difficult to learn are given more consideration than minority samples that are simpler to learn. Based on the density distribution, the number of synthetic instances required for a specific minority sample is determined [31]. The ADASYN algorithm is explained in this section [32].
Support vectors are used in SVMSmote to approximate the boundary area once the SVM classifier has been trained on the initial training data set. The lines linking each minority class support vector to its nearest neighbors are then utilized to generate random synthetic data [33]. The SVMsmote algorithm is also presented in this section [34].
Feature selection techniques
In the literature, there exist three categories of FS: Filter methods, Wrapper methods and embedded methods.
Filter methods select attributes on the basis of how relevant they are to the underlying MLAs. They serve as a preprocessing stage by choosing highly ranked characteristics and using them with ML techniques. They are computationally fast and resistant to overfitting, but they ignore feature dependency. The correlation/distance between features and the output variables are two examples of statistical performance measurements used in filter systems [35, 36].
In wrapper techniques, feature selection is integrated into an ML algorithm’s learning process. Therefore, a subset of features with the best prediction performance are looked for by these algorithms. Additionally, they rely on the predictor’s performance to identify the best feature subset and use its accuracy as their objective function. Because obtaining the crucial feature subset and dealing with the overfitting issue necessitates numerous calculations (multiple training rounds), wrapper approaches are well recognized for being computationally expensive [35, 36].
Embedded approaches combine the benefits of filter and wrapper methods and incorporate feature selection as part of the training process by integrating algorithm modeling and feature selection at the same time. As a result, they are more computationally efficient and less susceptible to overfitting than the wrapper method [35, 36].
In this study, we use the filter method, SelectKBest function for its resistance to overfitting and faster computation capabilities for selecting the relevant features. SelectKBest function in the scikit-learn library is utilized and different values of k are evaluated. The best features are then selected on the basis of the highest score which is computed using the f_classif univariate statistical analysis.
Ensemble method
Ensemble approach [37] aims to enhance the performance of a single classifier by inducing and combining a variety of classifiers to produce a new classifier that outperforms them all. The main idea is to train multiple classifiers and combine their predictions to make the final prediction.
We consider the Bagging ensemble method with seven base classifiers in this study.
Bagging
Bootstrap aggregating or bagging is an ensemble learning method that simultaneously trains several base classifiers. It has the advantage of being able to reduce variance while maintaining minimal bias in the base classifiers [38].
Base classifiers
Decision tree
DT generates a tree structure that resembles a flowchart. In this, we have internal nodes, branches and leaf nodes. Internal nodes represents a decision. Each branch displays the decision’s outcome, and leaf nodes depicts class labels. Since DT classifiers mimic how people solve problems, they are simple to understand. However, the selection of attribute for branching is a really complicated task and is inefficient in the case of a limited dataset. Additionally, if a minor adjustment is made, the entire tree structure is disrupted [39, 40].
Random forest
RF is an ensemble learning method which is used for solving problems involving classification and regression. It creates a forest of DTs, considers their outputs, and then makes the forecast based on the outcomes of majority vote [39]. RF addresses overfitting by averaging predictions. It can also manage multidimensional datasets. In contrast to DT, it creates many trees, which increases the model training time [40, 41].
Logistic regression
LR is a machine learning classifier for problems with binary classification [39]. To forecast probabilities between 0 and 1, LR employs the sigmoid function. The LR classifier has the benefit of having a high calculation efficiency. It takes less time and space to use the LR model. However, LR is vulnerable to data overfitting and is not suitable for non-linear distribution [40, 41].
Naïve bayes classifier
NB is a Bayes theorem-based probabilistic classifier [23, 40]. NB assumes that the feature set has no association. The probabilities of each class attribute are taken into account separately while classifying unknown data into defined classes. NB is easy to use and performs well with minimal datasets. The approach becomes challenging to use when dealing with massive data sets [40].
K-nearest neighbor
KNN is a supervised classification method that classifies the data according to the class of its closest neighbor. According to the value of k, the method computes the data classes. It makes predictions about the type of data sample based on consistency and distance from the nearest neighbor. The distance between a data point and its closest neighbor is computed using distance metrics like the Euclidean distance and the Manhattan distance [40, 41].
Random under-sampling boost
Seiffert et al. [42] proposed the random under-sampling boosting method, which is a variant of the SmoteBoost algorithm. RUSBoost merges boosting and random under-sampling. RUSBoost is more rapid and simple than SmoteBoost. RUSBoost, on the other hand, does not use an intelligent technique for under-sampling, which could result in the loss of important data.
Support vector machine
SVM is a statistical learning classifier that can handle both classification and regression issues [40, 41]. To separate the data points, SVM determines the hyperplane with the largest margin between the two classes. Because the kernel function changes low-dimensional data to high-dimensional data, SVM with kernel function can manage nonlinear data. However, selecting the appropriate kernel function might be challenging. Multiclass classification issues can be handled by SVM by breaking them down into separate binary class problems. SVM needs more training time and expensive computing resources.
Performance measures
Different performance metrics such as Accuracy Precision, Recall, F1 Score, AUC, ROC curves and so on, have been utilized in the literature to examine the classifiers’ performance. Accuracy is widely used by researchers and the AUC metric remains unaffected by imbalanced datasets [42]. Therefore, we consider these two performance measures for performance comparison: Accuracy and AUC as discussed below [43, 44, 45, 46].
Accuracy
Accuracy is defined as the proportion of correctly predicted events among all observed events. Accuracy is calculated using Eq. (1).
The true positive rate (TPrate) and true negative rate (TNrate), which are percentages of correctly categorized positive and negative cases, respectively, are shown against the x-axis and y-axis in an AUC curve. The performance of the classifier improves as the area under the ROC increases [2]. AUC is evaluated using Eq. (2):
For the early diagnosis of PD, we propose a Bagging-based hybrid approach: B-HPD using SMOTE, ADASYN & SVMSmote as the oversampling techniques and RF as the base classifier with the Bagging method. We also compare our proposed approach: B-HPD with other hybrid models with six base classifiers: DT, LR, KNN, NB, RUSBoost and SVM.
We acquire the PD classification dataset from the Kaggle Repository and use the FS technique, i.e. SelectKBest method to the relevant attributes from the feature space. Finally, we compare the performance of these hybrid models with ALL features and with FS using Accuracy and AUC. The flowchart for the proposed approach is shown in Fig. 1.
The algorithm for the B-HPD approach is also illustrated in this section.
Properties of datasets
Properties of datasets
Proposed methodology.
We use the PD classification dataset taken from the Kaggle Repository [47]. It is having 756 instances and 755 attributes including the ‘class’ variable. The dataset is imbalanced in nature having a minority class:192 and a majority class: 564. The dataset description is shown in Table 1.
Initially, the id attribute is dropped because it doesn’t seem to affect performance. We select the relevant features using the FS technique: SelectKBest and evaluate the different k values in the range from 100–700. We select 500 features with the highest score for this study.
Experiment and results
Parameter settings of the base classifiers
Parameter settings of the base classifiers
Python libraries with their versions
We conduct the experiment considering seven base classifiers: RF, DT, LR, NB, KNN, RUSBoost and SVM with the Bagging ensemble method and three oversampling techniques, i.e., SMOTE, ADASYN and SVMSmote. Two performance metrics, i.e., Accuracy and AUC are used for the performance evaluation. We compare the performance of these hybrid models with ALL features and with selected features. Repeated Stratified K Fold cross-validation with n_splits
Performance of Bagging (different base classifiers) SMOTE with ALL features selected
Performance of Bagging (different base classifiers) SMOTE with ALL features selected
Performance of Bagging (different base classifiers) ADASYN with ALL features selected
Performance of Bagging (different base classifiers) SVMSmote with ALL features selected
Tables 4–6 depicts the performance of the Bagging approach using all features (754 features) and over-sampling techniques like SMOTE, ADASYN, and SVMSmote.
Performance of Bagging (different base classifiers) SMOTE with ALL features selected.
Performance of Bagging (different base classifiers) ADASYN with ALL features selected.
Figures 2–4 shows a graphical illustration of the bagging method’s performance.
From the visual representations in Figs 2–4, the performance of Bagging
Performance of Bagging (different base classifiers) SMOTE with FS
Performance of Bagging (different base classifiers) SMOTE with FS
Performance of Bagging (different base classifiers) ADASYN with FS
Performance of Bagging (different base classifiers) SVMSmote with FS
Performance of Bagging (different base classifiers) SVMSmote with ALL features selected.
Tables 7–9 shows the performance of the Bagging method with FS (500 features) using SMOTE, ADASYN & SVMSmote oversampling technique.
Performance of Bagging (different base classifiers) SMOTE with FS.
Performance of Bagging (different base classifiers) ADASYN with FS.
Performance of Bagging (different base classifiers) SVMSmote with FS.
Figures 5–7 shows the performance of the bagging method graphically.
From the visual representations in Figs 5–7, the performance of Bagging
From this study, the following observations are noticed:
Bagging with base classifier: RF is showing the best performance in all the cases, i.e., with (1) ALL features (754): Accuracy From point 1, we observe that FS improves the model’s performance. Further, it is noticed that irrespective of the oversampling technique used, Bagging
It can be observed that RF is outperforming the other classifiers which may be attributed to its robustness to overfitting and better generalization capability making it accurately classifying unseen dataset. This consistent performance of RF also suggests that the classifier is well-suited with the underlying data characteristics and seems to be a reliable model for this dataset.
PD is a neurological ailment that causes symptoms like rigidity in the body, shaking, trouble walking and lack of coordination in the body. Therefore, it is essential to detect PD in its early stages for improving the patient’s life. Through this work, we have been able to discover few key pointers that would be beneficial as primary indicators for PD diagnosis, assist medical professionals in promptly diagnosing diseases, and make the diagnosis of PD cost-effective. We propose a Bagging-based hybrid method: B-HPD for the early diagnosis of PD in this paper. For balancing the dataset, we use three oversampling techniques: SMOTE, ADASYN and SVMSmote and for the base classifiers with the Bagging Ensemble method, we consider seven classifiers: DT, RF, NB, LR, KNN, RUSBoost and SVM. Further, in order to preprocess data and improve performance, the FS approach (Filter method: SelectKBest method) is also used. We assess the model’s performance with ALL features and with selected features. We contrast our proposed approach: B-HPD with other hybrid models. Our study indicates that regardless of the oversampling approach utilized, bagging with base classifier: RF is displaying the best performance with (1) ALL features (754): Accuracy
Footnotes
Acknowledgments
The authors are grateful to USICT, GGSIPU, New Delhi, India, for providing them with the opportunity to carry out this research.
