Feature selection and extraction for class prediction in dysphonia measures analysis:A case study on Parkinson’s disease speech rehabilitation

Abstract

BACKGROUND:

Speech disorders such as dysphonia and dysarthria represent an early and common manifestation of Parkinson’s disease. Class prediction is an essential task in automatic speech treatment, particularly in the Parkinson’s disease case. Many classification experiments have been performed which focus on the automatic detection of Parkinson’s disease patients from healthy speakers but results are still not optimistic. A major problem in accomplishing this task is high dimensionality of speech data.

OBJECTIVE:

In this work, the potential of Principal Component Analysis (PCA) based modeling in dimensionality reduction is taken into consideration as the data smoothening tool with multiclass target expression data.

METHODS:

On the basis of suggested PCA-based modeling, the power of class prediction using logistic regression (LR) and C5.0 in numeric data is investigated in publicly available Parkinson’s disease dataset Silverman voice treatment (LSVT) to develop an advanced classification model.

RESULTS:

The main advantage of our model is the effective reduction of the number of factors from $p=$ 309 to $k=$ 32 for LSVT Voice Rehabilitation dataset, with a fine classification accuracy of 100% and 99.92% for PCA-LR and PCA-C5.0 respectively. In addition, using only 9 dysphonia features, classification accuracy was (99.20%) and (99.11%) for PCA-LR, and PCA-C5.0 respectively.

CONCLUSIONS:

Our combined dimension reduction and data smoothening approaches have significant potential to minimize the number of features and increase the classification accuracy and then automatically classify subjects into Parkinson’s disease patients or healthy speakers.

Keywords

Dimension reduction classification machine learning dysphonia features Parkinson’s disease

1. Introduction

Parkinson’s disease (PD) is a neurodegenerative disorder characterized by muscular rigidity, tremor, bradykinesia and postural instability. During the course of the disease, patients with PD will develop voice and speech impairments including dysarthric speech, reduced loudness and loss of articulation which leads to a negative impact on functional communication and quality of life [1, 2, 3, 4]. These difficulties processing speech include deficits estimating the time intervals of an acoustic speech signal, altered emotional prosody and difficulty perceiving the individual’s own loudness. In other words, individuals with PD articulate imprecise consonants, particularly those used phonetically for the closure of phrases, have impaired prosody or natural variations in pitch, as well as impaired intensity and rhythm during spontaneous speech production [5]. According to Herd et al. [6] patients with Parkinson’s disease also suffer from difficulties in language selection, language understanding, coordination and dual tasks (talking and walking) as well as emotional intent and understanding. However, the neural mechanisms underlying these voice and speech disorders remain unclear. Recently, Dias and colleagues [7] reported that global motor disability and speech articulation impairment do not correlate with age at onset of PD symptoms or age of the patients at evaluation. According to Orozco-Arroyave et al. [8] it is possible to automatically classify between speech of people with PD and healthy controls (HC) with accuracies ranging from 84% to 99% in three languages: German, Spanish and Czech. In English native speakers the reported accuracy is 91% [9]. Bocklet et al. [10] considered a set of 176 German speakers (88 patients with PD and 88 HC) and performed the automatic detection of PD including acoustic, prosodic and voice related features. The accuracy reported in this work was 94%. Significant improvements in motor functions of patients with PD have been shown with the use of medical therapies and surgery procedures, however, data on their effects on speech production performance are inconsistent and overall success remains unclear [11]. The Lee Silverman Voice Treatment (LSVT ${}^{{\@setsize{\scriptsize}{9.5pt}{\viiipt}{\@viiipt}\textregistered}}$ ) has been shown to be the best behavioral therapy program for speech treatment in the short- and long-term for patients with PD [12, 13, 14, 15]. LSVT ${}^{{\@setsize{\scriptsize}{9.5pt}{\viiipt}{\@viiipt}\textregistered}}$ program improves phonatory effort and vocal characteristics (loudness, pitch variability, vocal quality), as well as improves speech articulation [16, 17].

In recent years, scientific data has the tendency of growing in both size and complexity. The high dimensionality of modern massive datasets has provided a considerable challenge to efficient algorithmic solutions design [18]. Dimension reduction is a subject of study in several research areas including high-dimensional data analysis, pattern classification, medical data processing, machine learning, and data mining applications. It’s the process of reducing the random number of variables under consideration. Dimension reduction techniques aim at finding the meaningful low dimensional data structures hidden in their high-dimensional observations which allow the user to better analyze the complex data sets in an interpretable way, such that most of the information in the data is preserved [19, 20]. Feature extraction and feature selection are two popular methods for dimension reduction that play important role in the medical data interpretation and application.

The feature selection process can be considered a problem of global combinatorial optimization in machine learning, which reduces the number of features from the original features, removes irrelevant, noisy and redundant data according to a certain criterion and results in higher classification accuracy [21]. Therefore, a good feature selection method is needed in order to speed up the processing rate, predictive accuracy, and to avoid incomprehensibility. Feature selection algorithms are separated into three categories; filter (extract features from the data without any learning involved), wrapper (use learning techniques to evaluate which features are useful) and hybrid (combine the feature selection step and the classifier construction) approaches [22, 23].

Feature extraction is a process that extracts a set of new features from the original features through some functional mapping with the mean goal to reduce dimensionality by linear [e.g. the Principal Component Analysis (PCA), linear discriminant analysis (LDA)] or non-linear methods [e.g. Kernel PCA and Kernel LDA], and hence improving the quality of data and the performance of data mining algorithms. LDA and PCA are the two popular independent feature extraction methods. Both of them extract features by projecting the original parameter vectors into a new feature space through a linear transformation matrix. But they optimize the transformation matrix with different intentions [24]. In recent years, machine learning techniques have been of great interest in disease classification, detection of irregularities and increase of medical decision-making objectivity. Interestingly, various research papers have attempted to build predictive telediagnosis and telemonitoring models for diagnosing Parkinson’s disease (PD) [25], especially for speech disorders considered as one of the most common problems in patients with PD [26, 27, 28, 29, 9]. Artificial intelligence continues to play an important role in healthcare informatics while researchers are working to solve many new problems in the big data era of healthcare [30].

The aim of this paper is to study the potential of dimensionality reduction in speech pattern analysis of a publicly available LSVT (Silverman voice treatment) Voice Rehabilitation DataSet in order to investigate prediction accuracy in the context of class prediction using the PCA. In this optic, a supervised feature selection process identifying the most significant feature was performed using Pearson’s correlation, followed by an independent stratified 10-fold cross validation and linear PCA. After dimension reduction, the class prediction was made by Logistic Regression and C5.0 algorithm.

The applied features extraction and the modeling approaches may be sensitive to gender. For this reason, it is important to try and understand the possible relationships between gender and dimension reduction for speech treatment issues.

Table 1
LSVT voice rehabilitation data set informations

Data set characteristics	Multivariate
Attribute characteristics	Real
Associated tasks	Classification
Number of instances	126
Number of attributes	309
Missing values	N/A
Area	Life
Date donated	2014-02-19

2. Dimension reduction: Background and related works

2.1 Dataset

The dataset is composed of a range of biomedical speech signal processing algorithms from 14 people who have been diagnosed with Parkinson’s disease undergoing the program assisting voice rehabilitation LSVT [31]. Table 1 shows the mean characteristics of the dataset. The 14 PD subjects (8 males and 6 females), had an age range of 51 to 69 (mean $\pm$ standard deviation: 61.9 $\pm$ 6.5) years and produced sustained vowel phonation. In total, each subject was originally instructed to produce 27 phonations (samples), each phonation is one of the nine possible combinations of pitch and amplitude. Tsanas et al. [31] measured a total of 309 dysphonia features to assess whether a sustained phonation is “acceptable” defined as positive sample or “unacceptable” defined as negative sample according to the clinical criteria of experts. The data set was originally collected to determine the most parsimonious feature subset which helps to predict the binary response “acceptable” or “unacceptable”. The system was evaluated on 126 phonations split into two subsets: a training subset consisting of 90% of the data samples (113 phonations), and a testing subset consisting of 10% of the data samples (13 phonations). The LOGO (fit locally and think globally) feature selection algorithm was applied to find the most discriminant subset of features. The subset was selected following a 10-fold cross validation strategy. The feature selection process was repeated 100 times on the training sets to avoid overfitting voting scheme. Two statistical machine learning algorithms, random forests (RF) and support vector machines (SVM) were used to discriminate between “acceptable” and “unacceptable” phonations. The authors reported a classification score around 90% considering a subset of features with 8 dysphonia measures with RF and SVM [31].

Figure 1.

A dimensionality reduction framework for speech treatment in PD.

2.2 Data preprocessing

The computational experiment in our work consists of the cross-validation of the linear PCA classifier over different feature vectors extracted from the LSVM data following diverse feature extraction processes and class prediction. Figure 1 shows a general framework summarizing the diverse steps that lead to the performance evaluation. Applying feature selection with Pearson Correlation Coefficient we obtain the significant features where data is extracted to build the feature vectors employed in the cross-validation experiments of the PCA classifier. After dimension reduction, we fixed the class prediction step with Logistic Regression and C5.0.

2.3 Dimension reduction

In most situations, one finds oneself with a number of variables which tends to largely exceed the number of observations. Dimensionality reduction is then the most intuitive solution to contribute to the resolution of these problems in the field of machine learning. It proceeds either by applying a feature selection or feature extraction. Redundant and useless information will be thus circumvented in order to have a better representation of the data. The principal objectives of the reduction of dimension can be described by Guérif [32]: to improve the task of classification and to facilitate visualization plus the comprehension of the data, we must identify the relevant features in order to reduce the storage space necessary, and of least time consumption also CPU-expenditure. However, the elimination of certain redundant or not very relevant information can increase the classification error, considering this information can prove to be informative if they jointly are used [33].

Figure 2.

Feature selection process.

Figure 3.

Feature extraction process.

Dimensionality reduction remains a complex problem. It is divided into two main categories: feature selection and extraction or the transformation of the features as shown in the Figs 3 and 3.

These data do not prove very relevant for the classification process. Certain variables correspond to noise, or they are not very informative, less correlated and redundant or even useless for classes prediction. The problem of dimension reduction can be defined by assuming that we have dataset represented in a $n\times p$ matrix ‘ $Y$ ’ consisting of $n$ data vectors $y_{i_{(i\in\{1,2,\ldots,n\})}}$ with dimensionality ‘ $p$ ’. Assume further that this dataset has intrinsic dimensionality ‘ $k$ ’ (where $k<p$ , and often $k\ll p$ ). Here, in mathematical terms, intrinsic dimensionality means that the points in dataset ‘ $Y$ ’ are lying on or near a manifold with dimensionality ‘ $k$ ’ that is embedded in the $p$ -dimensional space. Dimensionality reduction techniques transform dataset ‘ $Y$ ’ with dimensionality ‘ $p$ ’ into a new dataset ‘ $X$ ’ with dimensionality ‘ $k$ ’, while retaining the geometry of the data as much as possible [34, 35, 36].

In general, neither the geometry of the data manifold, nor the intrinsic dimensionality ‘ $k$ ’ of the dataset ‘ $Y$ ’ are known. Therefore, dimensionality reduction is an attention-demanding problem that can only be solved by assuming certain properties of the data (such as its intrinsic dimensionality) [37, 38].

2.3.1 Feature selection

The motivation for applying feature selection techniques in bioinformatics has shifted from being an illustrative example to becoming a real prerequisite for high dimensional model building [21].

Feature selection is suitable when the measurements acquisition is expensive. Its objective is to reduce the number of necessary measurements and to choose those most informative. For the feature selection phase, the Pearson Correlation Coefficient developed by Karl Pearson was used to find highly correlated features to the class label [39]. The Pearson’s correlation coefficient, typically denoted by $r$ , is a measure of the correlation (linear dependence) between two random variables $x$ and $y$ , taking values in the interval $r\in[-1,1]$ . A value of $r=1$ means that the two variables are in complete agreement. A value $r=-1$ means that the two variables take opposite values. Pearson’s correlation coefficient between two variables is defined as the covariance of the two variables normalized by the product of their standard deviations.

We suppose that each feature consists of $Y=\{y_{1},y_{2},\ldots,y_{n}\}$ values for samples 1 through $n$ in vector $Y$ and the corresponding class labels are $\{c_{1},c_{2},\ldots,c_{n}\}$ stored in vector $C$ . So, the Pearson Correlation Coefficient of each feature can be calculated as:

$\displaystyle r\left(Y,C\right)=\frac{\sum\nolimits_{i=1}^{n}{(y_{i}-\bar{y})(% c_{i}-\bar{c})}}{\sqrt{\sum\nolimits_{i=1}^{n}{(y_{i}-\bar{y})}^{2}}\sqrt{\sum% \nolimits_{i=1}^{n}{(c_{i}-\bar{c})}^{2}}}$ (1)

Where

$\displaystyle\bar{y}=\frac{1}{n}\sum\limits_{i=1}^{n}y_{i}$ (2)

And similarl

$\displaystyle\bar{c}=\frac{1}{n}\sum\limits_{i=1}^{n}c_{i}$ (3)

This equation gives a value between $-1$ and $+1$ , where $+1$ is a maximum positive correlation, 0 is no correlation, and $-1$ is the strongest negative correlation. The ${p}$ values were calculated using Student’s $t$ -distribution for a transformation of the correlation. Those features in the correlation coefficient matrix with ${p}$ values less than 0.05 were selected.

2.3.2 Feature extraction

The methods of extracting features constructed from the original variables ‘ $n$ ’, a new set containing ‘ $d\ll n$ ’ characteristics. Several alternatives methods exist and treat the feature extraction. Some of the best known methods include linear methods such as Principal Component Analysis (PCA) [40], Multidimensional Scaling (MDS) [41], and Linear Discriminant Analysis(LDA) [42].

Nonlinear methods were also developed such as Independent Component Analysis (ICA), ISOMAP [43], LLE [44], and nonlinear versions of PCA and LDA as Kernel PCA [45], Kernel LDA [46].

2.3.2.1. Principle component analysis (PCA)

PCA is one of the oldest and most unsupervised techniques widely used for dimensionality reduction [47]. It performs dimensionality reduction by embedding the data into a linear subspace of lower dimensionality. The basic idea behind the PCA is to reduce the dimensionality of a dataset while retaining as much as possible the variation in the original variables [48]. This is done by finding a linear basis of reduced dimensionality for the data, in which the amount of variance in the data is maximal. The basic working of a PCA is presented below.

$\displaystyle\begin{bmatrix}y_{11}&y_{12}&\ldots&y_{1n}\\ y_{21}&y_{22}&\ldots&y_{2n}\\ \vdots&\ldots&\ddots&\vdots\\ y_{m1}&y_{m2}&\ldots&y_{mn}\\ \end{bmatrix}=\left[y_{1},y_{2},\ldots,y_{N}\right]_{n\times 1}$ (4)

Step 1 Step 1

Mean value $\bar{y}$ is calculated using the equation:

$\displaystyle\bar{y}=\frac{1}{n}\sum\limits_{i=1}^{n}{y_{i}}$ (5)

Step 2

The mean value is subtracted from each feature:

$\displaystyle{\Phi}_{i}=y_{i}-\bar{y}$ (6)

Step 3

Matrix $A=[\Phi_{1},\Phi_{2},\ldots,\Phi_{m}]_{n\times m}$ is generated and covariance matrix $\Sigma$ is computed as follows:

$\displaystyle\Sigma=\frac{1}{m}\sum\limits_{i=1}^{m}{\Phi_{i}\Phi_{i}^{T}=% \frac{1}{m}\sum\limits_{i=1}^{m}{(y_{i}-\bar{y}){(y_{i}-\bar{y})}^{T}}=\left[% AA^{T}\right]_{n\times n}}$ (7)

The covariance matrix characterizes the distribution of the data.

Step 4

Eigenvalues are computed as:

$\displaystyle\lambda_{i_{(i=1..n)}}=\lambda_{1}>\lambda_{2}>\ldots\lambda_{n}$ (8)

Step 5

Eigenvectors are computed as:

$\displaystyle v_{\lambda_{i}}=v_{\lambda_{1}},v_{\lambda_{2}},\ldots,v_{% \lambda_{n}}$ (9)

Since $v_{\lambda_{i}}$ is symmetric, $v_{\lambda_{1}},v_{\lambda_{2}},\ldots,v_{\lambda_{n}}$ form a basis, $(y_{i}-\bar{y})$ , can be written as a linear combination of the eigenvectors:

$\displaystyle\left(y_{i}-\bar{y}\right)=a_{1}v_{\lambda_{1}}+a_{2}v_{\lambda_{% 2}}+\ldots+a_{n}v_{\lambda_{n}}=\sum\nolimits_{i=1}^{n}{a_{i}v_{\lambda_{i}}}$ (10)

where $a_{1},a_{2},\ldots,a_{n}$ are scalars.

Step 6

For dimensionality reduction, only the terms corresponding to the $k$ largest eigenvalues are kept:

$\displaystyle\hat{y}_{i}-\bar{y}=\sum\limits_{i=1}^{n}a_{i}v_{\lambda_{i}}$ (11)

where $k\ll n$ ,

The representation of $\hat{y}_{i}-\bar{y}$ into the basis $v_{\lambda_{i}}$ is thus ${[a_{1},a_{2},\ldots a_{n}]}^{T}$ , The linear transformation $R^{n}\mathrel{\mathop{\kern 0.0pt\longrightarrow}\limits_{(k\ll n)}}R^{k}$ by PCA that performs the dimensionality reduction is presented in equation:

$\displaystyle\begin{bmatrix}a_{1}\\ a_{2}\\ \vdots\\ a_{k}\\ \end{bmatrix}=\begin{bmatrix}v_{\lambda_{1}}^{T}\\ v_{\lambda_{2}}^{T}\\ \vdots\\ v_{\lambda_{k}}^{T}\\ \end{bmatrix}\left(y_{i}-\bar{y}\right)=U^{T}{(y}_{i}-\bar{y})$ (12)

The new variables (i.e. $a_{i}$ ’s) are uncorrelated. The covariance matrix for the $a_{i}$ ’s is presented in Equation:

$\displaystyle U^{T}\Sigma U={\begin{array}[]{*{20}c}\lambda_{1}&0&0&0\\ 0&\lambda_{2}&0&0\\ 0&0&\ddots&0\\ 0&0&0&\lambda_{n}\\ \end{array}}$ (13)

The covariance matrix represents only second order statistics among the vector values.

Let $n$ be the dimensionality of the data. The covariance matrix is used to calculate $\Sigma$ that is a diagonal matrix. $\Sigma$ is sorted and rearranged in the form of $\lambda_{1}>\lambda_{2}>\ldots\lambda_{n}$ so that the data exhibits maximum variance in $y_{1}$ , the next largest variance in $y_{2}$ and so on, with minimum variance in $y_{N}$ .

In fact, $\Sigma$ is a positive semi-definite symmetric matrix, we ensure that the eigenvalues are all real, positive or null and the eigenvectors are orthogonal between them. In addition, the eigenvalues $\lambda_{i_{(i=1..n)}}=\lambda_{1}>\lambda_{2}>\ldots\lambda_{N}$ are equal to the share of the total variance carried by the main component associated. This will select the axes forming the projection space. Moreover, in PCA most research methods of features number are based on the eigenvalues of the covariance matrix. This is justified by the fact that the eigenvalues of $\Sigma$ represent the variance introduced by the corresponding eigenvector. The empirical scree-test of Cattell [49], is commonly used. It is based on the analysis of the differences between eigenvalues and also can detect a “elbow” in the eigenvalues descent. The dimension selected by the method is that for which the differences between the eigenvalues following are all smaller than a certain threshold. The following Fig. 4 illustrates this technique: (a) shows the eigenvalues of $\Sigma$ ordered in a decreasing manner, and (b) shows the differences between eigenvalues.

Figure 4.

Selection of the principal components of the PCA using the scree-test of Cattell [49, 50].

In this example, the threshold was set at 10% of the biggest difference and the scree-test identifies an elbow at the 4 ${}^{\rm th}$ dimension [50].

2.4 Class prediction

The high dimension of ‘ $n$ ’ is then reduced to a lower dimension ‘ $k$ ’ after dimension reduction. The original data matrix is adapted by a matrix of features ( $n\times k$ , where $k<n$ ), constructed by PCA, as described in the previous section. Once the $k$ -features are composed, prediction of the response classes using C5.0 and LR algorithms is taken into consideration.

2.4.1 C5.0 Decision tree algorithm

C5.0 algorithm developed based on C4.5 by Quinlan [51] consists of a number of branches, one root, a number of nodes and a number of leaves. One branch is a chain of nodes from root to a leaf, and each node involves one attribute. Classification is done through the decision tree with its leaves representing the different conditions of the monoblock centrifugal pump. This algorithm is known to have many advantages such as higher accuracy, possibilities to use boosting, pruning, weighting and windowing features [52].

The use of this methods recurred that the root node at the top of the tree considers all samples and passes them through to the second node called “branch node”. The branch node generates rules for a group of samples based on an entropy measure. In this stage, C5.0 constructs a very large tree by considering all attribute values and finalizes the decision rule by pruning. It uses a heuristic approach for pruning based on splits statistical significance. After fixing the best rule, the branch nodes send the final class value in the last node, called the “leaf node” [53].

In this study, we adopt the largest gain rate of the attributes as the node, and use the recursive method based on Information Entropy to form the decision tree. Entropy provides an information-theoretic approach to measuring the goodness of a split. It measures the amount of information in an attribute. The following takes calculation evaluation the property $A$ as an example, calculating information gain rate $\textit{GainRatio}(A)$ , $S$ denotes a set of samples, and $p_{i}$ is the probability of a random sample belonging to $C_{i}$ , expressing as [54]:

$\displaystyle p_{i}=\frac{S_{i}}{S}$ (14)

Assume that category attribute has $n$ different values, defining $n$ different classes $C_{i}$ $(i=1,\ldots,n)$ Information Entropy of the current sample, calculated as follows:

$\displaystyle\textit{Info}\left(S\right)=\sum\limits_{i=1}^{n}p_{i}\log_{2}{(p% _{i}})$ (15)

We suppose that attribute $A$ has $n$ different values $\left\{A_{1},A_{2},\ldots,A_{n}\right\}$ , $S$ would be divided into $n$ subsets $\left\{S_{1},S_{2},\ldots,S_{n}\right\}$ by $A$ , in which $S_{j}$ includes the sample data when attribute $A$ takes on the value of $A_{j}$ , $S_{ij}$ is the number of sample of class $C_{i}$ which is in the subset $S_{j}$ . A division of property, $\textit{Info}\left(S,A\right)$ expresses the needed information entropies that attribute $A$ divides $S$ , calculated as follows:

$\displaystyle\textit{Info}\left(S,A\right)=\sum\limits_{i=1,j=1}^{n}{\frac{S_{% ij}}{S}\times\textit{Info}(A)}$ (16)

Split Information $\textit{SplitInfo}(A)$ is the Entropy of each value of attribute $A$ , for eliminating the bias of the attributes which has a large number of attribute values, calculated as follows:

$\displaystyle\textit{SplitInfo}(A)=-\sum\limits_{i=1}^{n}\left(\frac{\left|S_{% i}\right|}{|S|}\right)\log_{2}\left(\frac{\left|S_{i}\right|}{|S|}\right)$ (17)

Gain is computed to estimate the gain produced by a split over an attribute.

$\displaystyle\textit{Gain}(A)=\textit{Info}\left(S\right)-\textit{Info}\left(S% ,A\right)$ (18)

then:

$\displaystyle\textit{GainRatio}(A)=\frac{\textit{Gain}(A)}{\textit{SplitInfo}% \left(A\right)}$ (19)

The smaller the entropy, the purer the dataset.

2.4.2 Logistic regression

Logistic regression (LR) is one of the most common models for prediction, regression, and classification [55]. It’s a type of linear predictive model in which the output variable is a binary variable such as healthy or unhealthy, dead or alive, win or loss, etc. Logistic regression, widely applied in the medical sciences, assumes that the targets follow a Gaussian distribution [56]. The binary output variable can take one of two possible values, denoted by 1 and 0 (for example, $x=1$ if a disease is present; $x=0$ otherwise). The input variables are the features involved in prediction of the probability of the desired event ( $x=1$ ) denoted by $y=(y_{1},y_{2},\ldots,y_{n})$ .

Logistic regression method models the relations between these variables through the following equation:

$\displaystyle\log\left\{\frac{\mathcal{P}(Y=1)}{1-\mathcal{P}(Y=1)}\right\}={b% _{0}+b}_{1}y_{1}+b_{2}y_{2}+\ldots+b_{n}y_{n}$ (20)

Where $\mathcal{P}$ stands for probability, $b_{0}$ is called the “intercept” and ${(b}_{1},b_{2},\ldots,b_{n})$ are called the “regression coefficients” of $(y_{1},y_{2},\ldots,y_{n})$ respectively. Each of the regression coefficients describes the importance of the corresponding input attribute on the output.

2.5 Performance evaluation

Each model developed has the performance that has been measured in terms of the average accuracy, which means the number of correctly classified cases under the total number of cases in a testing set. The dataset is divided into a training set and a testing set. Comparing the classification performance of two models (PCA-C5.0 and PCA-LR) can be realized by accuracy rate, which is the most direct criterion to evaluate the classification models. It can be quantitatively evaluated by the following expression:

$\displaystyle\textit{Accuracy}=\frac{\textit{The number of correctly % classified cases}}{\textit{The total number of cases}}$ (21)

After data preprocessing, the proposed performance evaluation procedure on the dataset is applied. The performance of each model developed was measured in terms of average accuracy. The experiments were performed with training data and test data. The size was chosen differently and dependent on the available dataset in order to provide a reliable estimate and validate the developed models.

3. Results and discussion

The interest of dimension reduction by considering applications for the class prediction of spoken data has been illustrated.

3.1 Application to LSVT voice rehabilitation dataset

After data preprocessing, the proposed performance evaluation procedure on the LSVT Voice Rehabilitation dataset was applied. We choose 90% instances randomly to train the model and the 10% remainder instances to test the model. After 10-fold cross validation (CV) with 100 repetitions we present the accurate performance of only the first 30 steps of the features selection algorithms. As feature selection is performed in each cross-validation folder, the standard deviation of the number of features is provided.

Figure 5.

Mean $\pm$ standard deviation performance results of the first 30 steps of the feature selection algorithms using PCA-LR.

Figure 6.

Mean $\pm$ standard deviation performance results of the first 30 steps of the feature selection algorithms using PCA-C5.0.

Using PCA-LR model, the mean performance results reached 90% with thirteen features and 100% with twenty-three features (Fig. 5).

With the PCA-C5.0 model, we noted that when at least four features are used, the performance of the classifier reached 91,27% (Fig. 6).

The comparison of PCA-LR and PCA-C5.0 mean performance results (Fig. 7) shows that the PCA-C5.0 model is more efficient in minimizing the number of features and maximizing the accuracy. In fact, PCA-C5.0 provided an accuracy of more than 90% with only four features.

Table 2

Common features ‘ $k$ ’ and classification accuracy after reduction prediction performances by PCA-LR, and PCA-C5.0 using supervised factor selection and different criteria on LSVT Voice Rehabilitation dataset along

Dataset	Features	Feature	Reduction	Classification	Features	Classification
	( $p$ )	selection ( $p$ *)	model	model	( $k$ )	accuracy %
LSVT Voice Rehabilitation	309	114	PCA	PCA-LR	6	85.7
					13	90.5
					23	100
				PCA-C5.0	6	96.03
					13	99.21

Figure 7.

Comparison of mean performance results of the first 30 steps of the feature selection algorithms using PCA-LR, and PCA-C5.0.

The results obtained by applying feature selection and extraction with PCA-LR and PCA-C5.0 modelson the experimental dataset using different criteria to estimate the number of common features are represented in Table 2.

Firstly, it should be mentioned that the feature selection method reduced the number of features from $p=$ 309 to $p=$ 114.

On the other hand, the feature extraction using the first criterion called Kaiser Criterion with Eigenvalue $>$ 1 [57] provided us 6 common features and the cumulative variance equal to 80% as the second criterion, produced 13 common features. After dimension reduction, the results of class prediction show that for $k=$ 13 the best accuracy was obtained by the PCA-C5.0 model (99.21%), which was followed by the PCA-LR model (90.5%). For $k=$ 6 the highest accuracy was obtained from the PCA-C5.0 model, then the PCA-LR model with values 96.3% and 85.7% respectively.

Our results of 96.3% with the PCA-C5.0 model for $k=$ 6 remain the highest accuracy obtained compared to the original study by Tsanas et al. [31] who found an approximate classification score of 90% with $k=$ 8 using SVM and RF. A maximum of classification accuracy (100%) was provided by the PCA-LR model with the common features $k=$ 23.

Using a novel feature selection method based on Network of Canonical Correlation Analysis (NCCA) with two classifiers; neural network (NN) and support vector machine (SVM), Hossain et al. [58] show that NCCA is very robust in terms of accuracy 100% for $k=$ 10 compared to information gain (IG) method 88.1% and 85.71% for $k=$ 10 with NN and SVM Classifiers respectively. Interestingly, in order to improve the performance of dysphonia measures selection for Parkinson speech rehabilitation [59], using the Diversity Regularized Ensemble Feature Weighting (DREFW) algorithm and SVM as classifier showed that the proposed ensemble feature weighting algorithm can obtain high stability and better classification performance for speech assessment than the original study. The top ten dysphonia measures selected by DREFW are $\{\textit{GNE}_{\textit{NSR, TKEO}},$ $\textit{VFER}_{\textit{SNR, SEO}},$ $\textit{VFER}_{\textit{NSR, TKEO$1$}},$ $\textit{VFER}_{\textit{NSR, TKEO}},$ $\textit{VFER}_{\textit{NSR, SEO}},$ $\textit{IM}_{\textit{NSR, T KEO}},$ $\textit{Log energy},$ $0^{\rm th}\textit{MFCC},$ $2\textit{ndMFCC},$ $5^{\rm th}\textit{MFCC}\}$ .

Figure 8.

Females and males performance in the PCA-LR and PCA-C5.0 models.

3.2 Effect of gender on LSVT voice rehabilitation

The idea of testing gender effects in performance arose from previous evidence which suggested gender-related differences in speech abilities. To examine the effect of gender on dimension reduction performance, we applied dimension reduction methods separately to males ( $n=$ 8) and females ( $n=$ 6).

Figure 8 represents the results when training the C5.0 and LR with the 14 subjects in relation to gender. We observed that the classifiers fluctuate around 90% depending on the number of features presented to the classifiers.

For women in the LR model, the best results were obtained from $k=$ 13 with more than 90% in accuracy.

The performance of the model reached a maximum of 100% from $k=$ 14. Male participants had a higher average performance around 90% from $k=$ 5 which reached 100% from $k=$ 10.

In the PCA-C5.0 model, the performance exceeded 90% in female participants and reached 100% from $k=$ 9, while male participants had a higher performance with two features that exceeded 90% and reached 100% from $k=$ 6.

The differences in performance between male and female may reflect the well-documented gender-related differences concerning the speech performance and intelligibility in PD [60, 61, 62]. The size of the sample limits the generalizability of this study.

4. Conclusion

The current study was designed to address the problem of speech performance in Parkinson’s disease using dimension reduction and data smoothening approaches. In this paper, we focus on data preprocessing, feature extraction, dimensionality reduction and classification. We proposed a new method for dysphonia measures selection for Parkinson speech rehabilitation based on feature dimensionality reduction using PCA to low dimensionality feature space and using the C5.0 and LR as a decision functions for classification of dysphonia measures. The main advantage of our approach is that the number of factors can be effectively reduced from $p=$ 309 to $p=$ 114 by features selection and to $k=$ 6 by features reduction for LSVT Voice Rehabilitation dataset.

The results of extensive testing performed on the LSVT Voice Rehabilitation dataset (we have achieved an accuracy of within-PCA-C5.0 classification of 99.21% with 13 features only and an accuracy of 96.03% with 6 features) reveal the advantages of the proposed approach.

In addition, differences in performance between male and female were reported in this work which suggests that the applied features extraction and modeling approaches may be sensitive to gender. For this reason, further research is needed to understand the possible relationships between gender and dimensionality reduction for voice rehabilitation issues.

Conflict of interest

The authors have no conflict of interest to report.

References

Iansek

Marigliani

Bradshaw

Gates

. Speech impairment in a large sample of patients with Parkinson’s disease. Behav Neurol. 1998; 11: 131-137.

Logemann

Fisher

Boshes

Blonsky

. Frequency and co-ocurrence of vocal tract dysfunctions in the speech of a large sample of Parkinson patients. J Speech Hear Disord. 1978; 43: 47-57.

Miller

Noble

Jones

Burn

. Life with communication changes in Parkinson’s disease. Age Ageing. 2006; 35(3): 235-9.

Skodda

Grönheit

Schlegel

. Impairment of vowel articulation as a possible marker of disease progression in Parkinson’s disease. PLoS One. 2012; 7(2): e32132.

Kwan

Whitehill

. Perception of speech by individuals with Parkinson’s disease: a review. Parkinsons Dis. 2011; 2011: 389767.

Herd

Tomlinson

Deane

Brady

Smith

Sackley

Clarke

. Speech and language therapy versus placebo or no intervention for speech problems in Parkinson’s disease. Cochrane Database Syst Rev. 2012; 8: CD002812.

Dias

Barbosa

Limongi

Barbosa

. Speech disorders did not correlate with age at onset of Parkinson’s disease. Arq Neuropsiquiatr. 2016; 74(2): 117-21.

Orozco-Arroyave

Hönig

Arias-Londoño

Vargas-Bonilla

Daqrouq

Skodda

Rusz

Nöth

. Automatic detection of Parkinson’s disease in running speech spoken in three different languages. J Acoust Soc Am. 2016; 139(1): 481-500.

Little

McSharry

Hunter

Spielmanm

Ramig

. Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease. IEEE Trans Biomed Eng. 2009; 56: 1015-1022.

10.

Bocklet

Steidl

Noth

Skodda

. Automatic evaluation of parkinson’s speech – acoustic, prosodic and voice related cues. in Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH). 2013; 1149-1153.

11.

Skodda

. Effect of deep brain stimulation on speech performance in Parkinson’s disease. Parkinsons Dis. 2012; 850596.

12.

Ramig

Countryman

O’Brien

Hoehn

Thompson

. Intensive speech treatment for patients with Parkinson’s disease: Short and long term comparison of two techniques. Neurology. 1996; 47: 1496-1504.

13.

Ramig

Sapir

Countryman

, et al. Intensive voice treatment (LSVT) for patients with Parkinson’s disease: A 2 year follow up. J Neurol Neurosurg Psychiatry. 2001; 71: 493-498.

14.

Ramig

Fox

Sapir

. Parkinson’s disease: Speech and voice disorders and their treatment with Lee Silverman Voice Treatment. Semin Speech Lang. 2004; 25: 169-180.

15.

Baumgartner

Sapir

Ramig

. Voice quality changes following phonatory respiratory effort treat (LSVT) versus respiratory effort treatment in individuals with Parkinson disease. J Voice. 2001; 15: 105-114.

16.

Dromey

. Articulatory kinematics in patients with Parkinson’s disease using different speech treatment approaches. J Med Speech Lang. 2001; 8: 155-161.

17.

Sapir

Spielman

Ramig

, et al. Effects of intensive voice treatment (LSVT®) on vowel articulation in dysarthric individuals with idiopathic Parkinson disease: Acoustic and perceptual findings. J Speech Lang Hear Res. 2007; 50(4): 899-912.

18.

Zhu

. A review on dimension reduction. Int Stat Rev. 2013; 81(1): 134-150.

19.

Fan

Han

Liu

. Challenges of big data analysis. Natl Sci Rev. 2014; 1(2): 293-314.

20.

. Dimension reduction for high-dimensional data. Methods Mol Biol. 2010; 620: 417-34.

21.

Saeys

Inza

Larrañaga

. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007; 23(19): 2507-2517.

22.

Hira

Gillies

. A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinformatics. 2015; 2015: 198363.

23.

Drotár

Smékal

. Comparison of stability measures for feature selection. SAMI 2015IEEE 13th International Symposium on Applied Machine Intelligence and Informatics, Herlany Slovakia. 2015; 71-75.

24.

Wang

Paliwal

. Feature extraction and dimensionality reduction algorithms and their applications in vowel recognition. Pattern Recognit. 2003; 36(10): 2429-2439.

25.

Oung

Muthusamy

Lee

Basah

Yaacob

Sarillee

Lee

. Technologies for assessment of motor disorders in Parkinson’s disease: A review. Sensors (Basel). 2015; 15(9): 21710-45.

26.

Behroozi

Sami

. A multiple-classifier framework for parkinson’s disease detection based on various vocal testsint. J Telemed Appl. 2016; 2016: 6837498.

27.

Yang

Zheng

Luo

Cai

Liu

, et al. Effective dysphonia detection using feature dimension reduction and kernel density estimation for patients with Parkinson’s disease. PLoS ONE. 2014; 9(2): e88825.

28.

Tsanas

. Accurate telemonitoring of Parkinson’s disease symptom severity using nonlinear speech signal processing and statistical machine learning, D. Phil. (Ph.D.) thesis, University of Oxford, UK. 2012; p. 261.

29.

Tsanas

Little

McSharry

Ramig

. Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson’s disease symptom severity. J R Soc Interface. 2011; 8: 842-855.

30.

Yang

Veltri

. Intelligent healthcare informatics in big data era. Artif Intell Med. 2015; 65(2): 75-7.

31.

Tsanas

Little

Fox

Ramig

. Objective automatic assessment of rehabilitative speech treatment in Parkinson’s disease. IEEE T Neur Sys Reh. 2014; 22: 181-190.

32.

Guérif

. Réduction de dimension en apprentissage numérique non supervisée. PhD thesis, Université Paris 13 2006. p. 148.

33.

Ferchichi

Zidi

Laabidi

Maouche

. Feature selection using an SVM learning machines. In Proceedings of the 3rd International Conference on Signals, Circuits and Systems (SCS 2009). 2009; 1-6.

34.

El Moudden

ElBernoussi

Benyacoub

. A dimensionality reduction framework for automatic speech recognition, Proceedings of the 26th International Business Information Management Association Conference-Innovation Management and Sustainable Economic Competitive Advantage: From Regional Development to Global Growth, IBIMA. 2015; 2602-2608.

35.

Elmoudden

ElBernoussi

Benyacoub

. Modeling human activity recognition by dimensionality reduction approach, Proceedings of the 27th International Business Information Management Association Conference – Innovation Management and Education Excellence Vision 2020: From Regional Development Sustainability to Global Economic Growth, IBIMA. 2016; 1800-1805.

36.

El Moudden

Ouzir

Benyacoub

ElBernoussi

. Mining human activity using dimensionality reduction and pattern recognition. Contemporary Engineering Sciences. 2016; 9(21): 1031-1041.

37.

Wang

Zhu

. Sparse sufficient dimension reduction using optimal scoring. Comput. Stat. Data Anal. 2013; 57: 223-232.

38.

. Linear dimensionality reduction for multi-label classification. In: Proceedings of the 21st International Conference on Artificial Intelligence, Pasadena, CA. 2009; 1077-1082.

39.

Gibbons

. Nonparametric Statistical Inference. 5th Chapman & Hall/CRC. 2010.

40.

Pearson

. On lines and planes of closest fit to systems of points in space. Philos. Mag. 1901; 2: 559-572.

41.

Hastie

Tibshirani

Friedman

. The Elements of Statistical Learning. Springer. 2001; p. 745.

42.

Belhumeur

Hespanha

Kriegman

. Eigenfaces vs. fisherfaces: Recognition using class specifc linear projection. IEEE Trans Pattern Anal Mach Intell. 1997; 19(7): 711-720.

43.

Lee

Verleysen

. Non Linear Dimensionality Reduction. Springer. 2007; p. 309.

44.

Saul

Roweis

. An introduction to locally linear embedding. 2000. Available from http://www.cs.toronto.edu/

\sim

roweis/lle/.

45.

Schölkopf

Smola

Müller

. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 1998; 10: 1299-1319.

46.

Mika

Ratsch

Weston

Schblkopf

Muller

. Fisher discriminant analysis with kernels. In Proceedings of the IEEE Signal Processing Society Workshop. 1999; 41-48.

47.

Jolliffe

. Principal Component Analysis. Springer, New York. 2002; p. 488.

48.

Jolliffe

Cadima

. Principal component analysis: A review and recent developments. Phil Trans R Soc. 374(2065): 20150202.

49.

Cattell

. The scree test for the number of factors. Multivariate Behav Res. 2016; 1: 245-276.

50.

Bouveyron

. Modélisation et classification des données de grande dimension: Application à l’analyse d’images. PhD thesis, Université Joseph Fourier Grenoble 1. 2006. p. 183.

51.

Quinlan

. Data mining tools: See5 and C50. RuleQuest Research. 2007. Available from http://www.rulequest.com/see5-info.html.

52.

Galathiya

Ganatra

Bhensdadia

. Improved decision tree induction algorithm with feature selection, cross validation, model complexity and reduced error pruning. IJITCS. 2012; 3(2): 3427-3431.

53.

Bujlow

Riaz

Pedersen

. A method for classification of network traffic based on C50. Machine learning algorithm. In ICNC’12: International Conference on Computing, Networking and Communications (ICNC): Workshop on Computing, Networking and Communications. 2012; 237-241.

54.

Niu

Zong

Yan

Zhao

. Auto-recognizing DBMS workload based on C50. algorithm. Proceedings – 2009 2nd International Workshop on Knowledge Discovery and Data Mining, WKKD 2009. 2009; 4772051: 777-780.

55.

Dreiseitl

Ohno-Machado

. Logistic regression and artificial neural network classification models: a methodology review. J Biomed Inform. 2002; 35(5-6): 352-9.

56.

Srivatsa

. Evaluation of logistic regression and neural network model with sensitivity analysis on medical datasets. IJCSS. 2011; 5(5): 503-511.

57.

Kaiser

. The application of electronic computers to factor analysis. Educ Psychol Meas. 1960; 20: 141-151.

58.

Hossain

Kabir

Shahjahan

. A robust feature selection system with Colin’s CCA network. Neurocomputing. 2016.

59.

. Stable dysphonia measures selection for Parkinson speech rehabilitation via diversity regularized ensemble. IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP). 2016; 2264-2268.

60.

Skodda

Grönheit

Mancinelli

Schlegel

. Progression of voice and speech impairment in the course of Parkinson’s disease: A longitudinal study. Parkinson’s Dis. 2013; 20420080: 1-8.

61.

Skodda

Visser

Schlegel

. Gender-related patterns of dysprosody in Parkinson disease and correlation between speech variables and motor symptoms. J Voice. 2011; 25(1): 76-82.

62.

Weismer

. Speech intelligibility, in The Handbook of Clinical Linguistics Ball

Perkins

Muller

Howard

, Eds. 2008; p. 568-582.

Feature selection and extraction for class prediction in dysphonia measures analysis:A case study on Parkinson’s disease speech rehabilitation

Abstract

BACKGROUND:

OBJECTIVE:

METHODS:

RESULTS:

CONCLUSIONS:

Keywords

1. Introduction

Table 1 LSVT voice rehabilitation data set informations

2.1 Dataset

2.3 Dimension reduction

2.3.2.1. Principle component analysis (PCA)

2.4.1 C5.0 Decision tree algorithm

3.1 Application to LSVT voice rehabilitation dataset

4. Conclusion

Conflict of interest

References

Table 1
LSVT voice rehabilitation data set informations