Fusion of WPT and MFCC feature extraction in Parkinson’s disease diagnosis

Abstract

BACKGROUND:

Parkinson’s disease (PD) is a neurological disorder, progressive in nature. In order to provide customized patient care, diagnosis and monitoring using smart gadgets, smartphones, and smartwatches, there is a need for a system that works in natural as well as controlled environments.

OBJECTIVE AND METHODS:

The primary purpose is to record speech signal, and identify whether the speech signal is Parkinson or not. For this work, a comparison of three feature extraction methods, i.e. Wavelet Packets, MFCC, and a fusion of MFCC and WPT, were carried out. Apart from the feature extraction, two classifiers were used, i.e. HMM and SVM.

RESULTS:

In this study, a fusion of MFCC, WPT with HMM shows the best performance parameters.

CONCLUSION:

The best of the three feature extraction and classifier results are described in this paper.

Keywords

Classifier feature extraction speech signal MFCC Wavelet Packet Transforms

1. Introduction

Research has indicated that 70% to 90% of patients with Parkinson’s disease (PD) show speech disorders and that voice and prosody are cited as one of the earliest prodromal PD stages [1, 2, 3] In these stages, PD signs may run from being overwhelming to mellow, to the extent of being hardly noticeable. In this way, PD conclusion remains a considerable test, especially in the earlier stages [4]. Clinical features from a pathological speech incorporate: obsessive discourse, lack of resonance in speech, simultaneous varying speech tones, short expressions, pitch breaks, monotones, diminished clamor, uncontrollable speech loudness, unconditioned breaks, low pitch, diminished rate, spikes in speech and lessened pressure, decay, unending involuntary facial twitching, infrequent nasal spewing forth, swallowing difficulties, and involuntary uncontrollable salivating [5]. Speech disorders can be classified according to prosodic features, excitation source features and vocal tract features [2, 3, 5, 8]. The onus of confirmative conclusion has been intensely dependent just on the ability of a doctor (physician) who judges in view of the subject’s history and his/her discernible signs and manifestations. In this respect, the use of technology-based systems offers the potential for improved health care, mainly for the underprivileged and in instances where there are critical shortages of PD specialists [4]. The use of validated methods and target-directed data collected with technology-based tools in the domestic setup has prospects that span into effective patient management. Furthermore, the prolonged use of technology-based tools provides room for direct data access and easy interpretation as time goes by. This empowers the patient through education, compliance, custom measures in emergency cases and most importantly real time or near real time patient monitoring [4]. Some speech related diseases such as PD and Alzheimer’s, apart from speech impairment, are also associated with an aging population [4, 6, 9, 10]. Significant concluded and on-going researches on the use of biomarkers in detecting, evaluating and monitoring neurodegenerative diseases (ND) are on the way [2]. Keeping this in mind it can be argued that speech can be used to distinguish between pathological and non-pathological samples [5] since it is regarded as one of the early biomarkers for PD [3, 11, 12, 13, 14, 15, 16]. Continuous wavelet transforms are used as a signal processing tool in the investigations of time-varying frequency spectrum characteristics of transitory signals [17]. Wavelet packets that introduce signal filtering have the speech signal segmented into two, i.e. high and low frequencies [18]. A concise spectral analysis is achievable by dividing the speech signal into individual time as well as frequency spectral fields using wavelets [18]. Wavelets break down a composite signal into multiple cascaded signals of varying determinations with respect to frequency and time domain. A Wavelet Tree provides a spectral estimation for the critical bands audible to the human ear [18, 19]. With the rapid development of smart technologies and societal progress, efficient early diagnosis of the pathological conditions is needed for many diseases, e.g. high blood pressure, heart failure, Alzheimer’s, cancer, etc. Adoption of smart technologies aids in early disease diagnosis given the interaction between persons and gadgets like smartphones, watches, and computers in continuous monitoring of pathological conditions.

2. Methods

In the proposed methodology, we studied the MFCC (Mel-Frequency Cepstral Coefficient), DWT (Discrete Wavelet Transform), and a fusion of MFCC $+$ DWT as feature extraction methods and classified them using HMM (Hidden Markov Model) and SVM (Support Vector Machines). The block diagram of our proposed system is given in Fig. 1.

Figure 1.

System block diagram.

The wavelet analysis is tipped to be one of the most reliable methods in spectral feature extraction from non-stationary signals, since it encapsulates multi-resolution measures in the frequency and time domains [21]. At high frequencies smaller window sizes are considered, the assumption being that at high frequencies abrupt dynamic changes occur in the interest signal catered for by using smaller window sizes, whereas at lower frequencies larger window sizes are considered assuming that the speech signal is less dynamic and every necessary feature can be captured [20]. Wavelets are the preferable choice when working with non-stationary signals [22, 23]. The Wavelet Packet Transforms (WPT) beats a portion of the confinements looked at by different highlights as it is also useful to examine discourse flags specifically through an individualized wavelet packet structure into the basic groups characterized by psychoacoustics [21]. WPT is an algorithm that meets several requirements in the signal-processing field, giving analysis for both the frequency domain and time domain signal [15]. A Multi-Resolution Wavelet Analysis provides frequency and time data through spectral parameter variation. A wavelet is also a signal basis function set used in gathering subband localizations [22]. Continuous Wavelet Transformation of a signal $f(t)$ is defined as:

$\displaystyle\textit{WT}_{f}=\frac{1}{\sqrt{x}}\mathop{\smallint}\limits_{R}f% \left(t\right)\Psi^{\ast}\left({\frac{t-\tau}{x}}\right)dt=\langle f\left(t% \right),\Psi_{x,{\rm{\bf\tau}}}\left(t\right)\rangle$ (1)

Where $\Psi$ represents the mother/ basic wavelet, $x$ is the scale factor, $\Psi_{x,\tau}$ is the scaled and translated wavelet and $\tau$ is the translation factor.

Unlike other peak detection algorithms, continuous wavelets isolate peaks based on amplitude, disregarding less significant spectrum peaks; this makes it a powerful tool for identifying and separating an interest signal from spikes and other noises. The introduction of the wavelet-based transform improves the classifier performance [21].

2.1 Database and feature set

The data sets used were obtained from the University of California (UC) Irvine Machine Learning repository, consisting of 40 people, of whom we collected 20 pathological and 20 non-pathological voice samples during the training phase. Multiple voice samples were obtained from the 40 subjects, consisting of 26 voice samples of sustained vowels, numbers, words, and short sentences. Subsequently, the test phase follows after the training phase, and for this test set of voice, samples obtained from [26] were used. The test samples consisted of 28 patients (diseased and non-diseased). Vowel articulation of “a”, “o”, “u” was the procedure carried out to obtain the voice test set. Recording the frequency range was between 50 Hz and 13 kHz [26].

2.2 Feature extraction and classification

Several feature extraction algorithms are already in use. For speech, these include the Wavelet Packet Transforms, Linear Predictive code (LPC), Short-time Fourier Transform (STFT), Pulse Coupled Neural Network (PCNN), and Mel-Frequency Cepstral Coefficients (MFCC). The Wavelet theory is based on the Multi-Resolution Analysis (MRA) that embeds signal filtering using different frequencies [18].

2.3 Discrete Wavelet Packet Transforms (DWPT)

The approach of extracting the features is based on the multi-resolution properties of the WPT, generally with successive separations of high and low frequencies. The Multi-Resolution wavelet investigation strategy gives frequency as well as time data by shifting the determination properties into a wider spectrum. The decomposed signal consists of wavelets enabling sub-band localization. The use of WPT is possible even for degraded environments [23]. Wavelet packets eliminate noise in speech signals through decomposition to obtain energy and entropy characteristics. Energy oscillates with levels of packet decomposition and is calculated as shown in Eq. (2):

$\displaystyle E_{i}=\int^{\infty}_{-\infty}x^{i}_{j}(t)dt$ (2)

The calculation of the total energy ( $E_{\textit{Total}}$ ) is shown in Eq. (3):

$\displaystyle E_{\textit{Total}}=\sum^{2^{j}}_{i=1}E_{i}$ (3)

Figure 2.

Wavelet tree.

Where $E_{i}$ is the sub-band energy. Entropy measures the system disorders. A hybrid combination of MFCCs and WPTs is used for the pathological tests in this research. A 4-level wavelet signal breakdown using the 4 order Daubechies wavelets is shown in Fig. 2. It extracts the mixed scale estimation coefficients and the detailed coefficients gathered from the wavelet decomposition. The primary intent is to have a cheaper, more efficient system to diagnose PD without the need for specialized equipment or environment. Figures 3 and 4 show the WPT outputs for the original speech as well as a de-noised speech.

Figure 3.

Analyzed signal F0.

Figure 4.

F0 de-noised signal.

2.4 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC is utilized in the extraction of shortened vectors representing stationary parameters of speech signals. To obtain the spectrum frequency, Discrete Fourier Transforms are implemented in breaking down the speech signal into the frequency domain. MFCCs alone are considered, being noise insensitive. The Combined features of WPT-based feature-warped MFCC and feature-warped MFCCs of the enhanced speech signals are used for the feature extraction, as shown in Fig. 1. The MFCCs are computed over Hamming windowed frames of the enhanced speech signals with 30 ms size and 10 ms overlap. The MFCC features are obtainable through the use of 32 channel Melfilter banks, followed by a transformation to the Cepstral domain, keeping 13 coefficients. The First and second order Cepstral coefficients are then affixed to the MFCC features. A 301-frame gap feature warping is conducted on the obtained MFCC features. The frame of the enhanced speech signal is decomposed into two frequency sub-bands: approximation coefficients (low frequency) and detail coefficients (high-frequency sub-band). The two coefficients are then merged into one vector. The decomposition process can be repeated by applying the WPT to the low-frequency sub-band. In order to gather all the essential characteristics of the vocal tract, the feature-warped MFCCs are applied to the concatenated vector from the WPT. Finally, a combination of WPT and feature-warped approach can be performed by combining the feature-warped MFCCs from the spectrum of an improved speech signal with MFCC highlight twisting extricated from the WPT in a solitary component vector [24]. Speech recordings for pathological purposes are often carried out in controlled noise free environments. The above scenario is often expensive and to provide patient customized telehealth solutions, various types of environmental noises are often present. This, in turn, corrupts the speech traces provided to the system [5]. As a result, the performance of custom-made PD diagnostic systems tends to diminish significantly in uncontrolled natural environments because of the presence of high noise levels. Nevertheless, the execution of the MFCC features reduces essentially in the nearness of clamour and resonation conditions. Multiband feature methods depend on joining MFCC features of the loud discourse signs and MFCC separated from the Wavelet Packet Transforms (WPT) into a solitary component vector [25]. The WPT can be utilized to extricate more features from the low recurrence sub-bands. These features’ highlights include a few imperative highlights of the whole spectrum of the MFCC.

Table 1
Comparison of selected MFCC feature extraction tools and the mean square error values

MFCC feature extraction tools	MSE values
MFCC MSE (MELFCC, THIS)	0.00
MFCC MSE (HTK, MELFCC)	0.07
MFCC MSE (HTK, THIS)	0.07
HTK MFCC (variance)	260.23

Figure 5.

Step-by-step MFCC output.

Figure 5 shows a stepwise explanation of the MFCC. Figure 6 shows the calculation of the Mel FTK file format input/output achieved using the included simple htkread_lite and htkwrite frequency Cepstral coefficients from a speech signal.

Figure 6 compares the extracted Cepstral features using the MFCC routine against those extracted using HTK and MELFCC tools. The HTK file format input/output is obtainable from the included sample HTKREAD_LITE and HTKWRITE_LITE routines. Further functionality can be achieved by installing the VOICEBOX toolbox. MELFCC is used for calculating PLP (Perceptual Linear Prediction) and MFCC from sound waveforms. PLP is done for warping spectra to reduce speaker diversity while preserving important speech information. The MSEs (mean square values) are compared, as shown in Table 1.

Figure 6.

Mel-Frequency spectrum: HTK, MELFCC, THIS.

2.5 Evaluation

Parameters such as jitter, shimmer, harmonic-to-noise ratio, pitch period entropy (PPE), mean, median, energy, accuracy, sensitivity and specificity etc., are calculated.

$\displaystyle\text{Accuracy (ACC)}=\frac{\textit{TP}+\textit{TN}}{\textit{TP}+% \textit{FP}+\textit{FN}+\textit{TN}}*100{\%}$ (4) $\displaystyle\text{Sensitivity}=\frac{\textit{TP}}{\textit{TP}+\textit{FN}}*10% 0{\%}$ (5) $\displaystyle\text{Specificity}=\frac{\textit{TN}}{\textit{FP}+\textit{TN}}*10% 0{\%}$ (6)

TP (true positive) – No. of PD patients accurately classified as PD

TN (true negative) – No. of non-PD patients accurately classified as non-PD

FN (false negative) – No. of PD patients inaccurately classified as non-PD

FP (false negative) – No. of non-PD patients inaccurately classified as PD [25].

Table 2

Evaluation of the HMM classifier for various feature extraction methods

Component	WPT with HMM	MFCC with HMM	Fusion MFCC $+$ WPT with HMM
Accuracy	79.03%	93.55%	95.16%
Sensitivity	58.06%	90.32%	93.55%
Specificity	99.99%	96.77%	91.67%

Table 3

Evaluation of the SVM classifier for various feature extraction methods

Component	WPT with SVM	MFCC with SVM	Fusion MFCC $+$ WPT with SVM
Accuracy	72.56%	87.09%	85.48%
Sensitivity	99.99%	77.42%	80.65%
Specificity	45.16%	96.77%	90.32%

Figure 7.

Identification of normal and pathological voice signal.

2.6 Hidden Markov Models (HMM) and Support Vector Machines (SVM)

Two classifiers are used, i.e. HMM and SVM, and their performance parameters (sensitivity, accuracy and specificity) are compared.

Training: Training speech samples are used in the training phases for both classifiers (HMM and SVM) and a classification model is developed.

Testing: Test speeches are used in this process. Classifiers of either HMM or SVM determine the category, normal/ abnormal of the samples using the model obtained through the training process.

3. Results and discussions

The GUI tells us whether the input speech signal is a “normal” or “Parkinson detected voice signal”. Here, the features are extracted using MFCC, DWT, MFCC and DWT, and the HMM (Hidden Markov Model) classifier is used and the efficiency, sensitivity and specificity are calculated in the three instances of feature extraction techniques.

4. Conclusions

The future is pointed towards real-time smart diagnosis and monitoring of pathological and non-pathological clinical features of PD patients. There is a possibility of envisioning a future where a smaller scale level of information is applied to upgrade diagnostics, measure viability of mediation, early PD diagnosis, progress monitoring, and PD risk prediction, as well as evaluation of the efficiency of control mechanisms. Larger scale information comes in handy in providing the bigger picture about PD and to measure therapy effectiveness. These information foundations highlight the envisioned patient care and treatment systems.

In this work, a fusion of MFCC and WPT using the HMM classifier shows the best performance, as is indicated in Table 2, in terms of accuracy, sensitivity and specificity, when compared to the SVM classifier listed in Table 3.

For future work, we can focus on the achievement of higher efficiency, and the use of other feature extraction and classification methods for noisy (natural environments) as well as controlled environments when making voice recordings. This is done in a bid to produce highly accessible cheap PD diagnosis and monitoring tools.

Footnotes

Conflict of interest

None to report.

References

Bocklet

Nöth

Stemmer

Ruzickova

Rusz

. Detection of persons with Parkinson’s disease by acoustic, vocal, and prosodic analysis, in 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings, 2011.

Tsanas

Little

McSharry

Spielman

Ramig

. Novel speech signal processing algorithms for high-accuracy classification of Parkinsons disease. IEEE Trans. Biomed. Eng. 2012; 59(5): 1264-1271.

Harel

Cannizzaro

Snyder

. Variability in fundamental frequency during speech in prodromal and incipient Parkinson’s disease: A longitudinal case study. Brain Cogn. 2004; 56(1): 24-29.

Maetzler

Klucken

Horne

. A clinical view on the development of technology-based tools in managing Parkinson’s disease. Mov. Disord. 2016; 31(9): 1263-1271.

Farouk

. Clinical Diagnosis and Assessment of Speech Pathology, in Application of Wavelets in Speech Processing, 2018, pp. 77-80.

Birkholz

Martin

Scherbaum

Neuschaefer-Rube

. Manipulation of the prosodic features of vocal tract length, nasality and articulatory precision using articulatory synthesis, Comput Speech Lang, 2017.

Meier

Borsky

Magnusdottir

Johannsdottir

Gudnason

. Vocal tract and voice source features for monitoring cognitive workload, in 7th IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2016 – Proceedings, 2017.

Koolagudi

Rao

. Emotion recognition from speech: A review, Int. J. Speech Technol. 2012; 15(2): 99-117.

Benba

Jilbab

Hammouch

. Discriminating between patients with Parkinson’s and neurological diseases using cepstral analysis, IEEE Trans. Neural Syst. Rehabil. Eng., 2016.

10.

Chandrayan

Agarwal

Arif

Sahu

. Selection of dominant voice features for accurate detection of Parkinson’s disease, Proc. 3rd Int. Conf. Biosignals, Images Instrumentation, ICBSII 2017, no. March, 2017, pp. 16-18.

11.

Cummins

Scherer

Krajewski

Schnieder

Epps

Quatieri

. A review of depression and suicide risk assessment using speech analysis. Speech Commun. 2015; 71: 10-49.

12.

Jankovic

. Parkinson’s disease: clinical features and diagnosis, J. Neurol. Neurosurg. Psychiatry, 2008; 79(4): 368-376.

13.

Goetz

et al., Movement Disorder Society-Sponsored Revision of the Unified Parkinson’s Disease Rating Scale (MDS-UPDRS): Scale presentation and clinimetric testing results. Mov. Disord. 2008; 23(15): 2129-2170.

14.

Bazazeh

Shubair

Malik

. Biomarker discovery and validation for Parkinson’s disease: A machine learning approach. Proc IEEE Int. Conf. Bio-engineering Smart Technol, 2016.

15.

Wang

Hoekstra

Zuo

Cook

Zhang

. Biomarkers of Parkinson’s disease: Current status and future perspectives. Drug Discov. Today, 2013.

16.

Singh

Samavedham

. Unsupervised learning based feature extraction for differential diagnosis of neurodegenerative diseases: A case study on early-stage diagnosis of Parkinson disease, J. Neurosci. Methods. 2015; 256: 30-40.

17.

Daubechies

. The continuous wavelet transform, Ten Lect Wavelets. 1992; 15(C): 17-52.

18.

Farouk

. Spectral analysis of speech signal and pitch estimation, in Application of Wavelets in Speech Processing, Springer, 2018, pp. 23-28.

19.

Farouk

. Speech quality assessment, in Application of Wavelets in Speech Processing, Springer, 2018, pp. 501-506.

20.

Rufiner

Nacional

Nos

Goddard

Elcktrica

. A method of wavelet selection in phone recognition, Computer Standards & Interfaces. 1997; 20(6-7): 889-891.

21.

Farouk

, Speech recognition, in Application of Wavelets in Speech Processing, 2018, pp. 41-46.

22.

Zhao

Wang

. Voice activity detection based on distance entropy in noisy envirnment, 2009 Fifth Int. Jt. Conf. INC, IMS IDC, 2009; 1: 1364-1367.

23.

Cao

Guan

Gao

. Voice activity detection algorithm based on entropy in noisy environment, in 2016 Chinese Control and Decision Conference (CCDC), 2016, pp. 3799-3803.

24.

Al-Ali

AKH

Dean

Senadji

Baktashmotlagh

Chandran

. Speaker Verification with Multi-Run ICA Based Speech Enhancement, 2017.

25.

Al-Ali

AKH

Dean

Senadji

Chandran

Naik

, Enhanced Forensic Speaker Verification Using a Combination of DWT and MFCC Feature Warping in the Presence of Noise and Reverberation Conditions, IEEE Access. 2017; 5: 15400-15413.

26.

Sakar

et al., Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings, IEEE J. Biomed. Heal. Informatics. 2013; 17(4): 828-834.

Fusion of WPT and MFCC feature extraction in Parkinson’s disease diagnosis

Abstract

BACKGROUND:

OBJECTIVE AND METHODS:

RESULTS:

CONCLUSION:

Keywords

1. Introduction

2. Methods

2.2 Feature extraction and classification

2.3 Discrete Wavelet Packet Transforms (DWPT)

Table 1 Comparison of selected MFCC feature extraction tools and the mean square error values

3. Results and discussions

4. Conclusions

Footnotes

Conflict of interest

References

Table 1
Comparison of selected MFCC feature extraction tools and the mean square error values