Abstract
BACKGROUND:
Parkinson’s disease (PD) is a neurological disorder, progressive in nature. In order to provide customized patient care, diagnosis and monitoring using smart gadgets, smartphones, and smartwatches, there is a need for a system that works in natural as well as controlled environments.
OBJECTIVE AND METHODS:
The primary purpose is to record speech signal, and identify whether the speech signal is Parkinson or not. For this work, a comparison of three feature extraction methods, i.e. Wavelet Packets, MFCC, and a fusion of MFCC and WPT, were carried out. Apart from the feature extraction, two classifiers were used, i.e. HMM and SVM.
RESULTS:
In this study, a fusion of MFCC, WPT with HMM shows the best performance parameters.
CONCLUSION:
The best of the three feature extraction and classifier results are described in this paper.
Introduction
Research has indicated that 70% to 90% of patients with Parkinson’s disease (PD) show speech disorders and that voice and prosody are cited as one of the earliest prodromal PD stages [1, 2, 3] In these stages, PD signs may run from being overwhelming to mellow, to the extent of being hardly noticeable. In this way, PD conclusion remains a considerable test, especially in the earlier stages [4]. Clinical features from a pathological speech incorporate: obsessive discourse, lack of resonance in speech, simultaneous varying speech tones, short expressions, pitch breaks, monotones, diminished clamor, uncontrollable speech loudness, unconditioned breaks, low pitch, diminished rate, spikes in speech and lessened pressure, decay, unending involuntary facial twitching, infrequent nasal spewing forth, swallowing difficulties, and involuntary uncontrollable salivating [5]. Speech disorders can be classified according to prosodic features, excitation source features and vocal tract features [2, 3, 5, 8]. The onus of confirmative conclusion has been intensely dependent just on the ability of a doctor (physician) who judges in view of the subject’s history and his/her discernible signs and manifestations. In this respect, the use of technology-based systems offers the potential for improved health care, mainly for the underprivileged and in instances where there are critical shortages of PD specialists [4]. The use of validated methods and target-directed data collected with technology-based tools in the domestic setup has prospects that span into effective patient management. Furthermore, the prolonged use of technology-based tools provides room for direct data access and easy interpretation as time goes by. This empowers the patient through education, compliance, custom measures in emergency cases and most importantly real time or near real time patient monitoring [4]. Some speech related diseases such as PD and Alzheimer’s, apart from speech impairment, are also associated with an aging population [4, 6, 9, 10]. Significant concluded and on-going researches on the use of biomarkers in detecting, evaluating and monitoring neurodegenerative diseases (ND) are on the way [2]. Keeping this in mind it can be argued that speech can be used to distinguish between pathological and non-pathological samples [5] since it is regarded as one of the early biomarkers for PD [3, 11, 12, 13, 14, 15, 16]. Continuous wavelet transforms are used as a signal processing tool in the investigations of time-varying frequency spectrum characteristics of transitory signals [17]. Wavelet packets that introduce signal filtering have the speech signal segmented into two, i.e. high and low frequencies [18]. A concise spectral analysis is achievable by dividing the speech signal into individual time as well as frequency spectral fields using wavelets [18]. Wavelets break down a composite signal into multiple cascaded signals of varying determinations with respect to frequency and time domain. A Wavelet Tree provides a spectral estimation for the critical bands audible to the human ear [18, 19]. With the rapid development of smart technologies and societal progress, efficient early diagnosis of the pathological conditions is needed for many diseases, e.g. high blood pressure, heart failure, Alzheimer’s, cancer, etc. Adoption of smart technologies aids in early disease diagnosis given the interaction between persons and gadgets like smartphones, watches, and computers in continuous monitoring of pathological conditions.
Methods
In the proposed methodology, we studied the MFCC (Mel-Frequency Cepstral Coefficient), DWT (Discrete Wavelet Transform), and a fusion of MFCC
System block diagram.
The wavelet analysis is tipped to be one of the most reliable methods in spectral feature extraction from non-stationary signals, since it encapsulates multi-resolution measures in the frequency and time domains [21]. At high frequencies smaller window sizes are considered, the assumption being that at high frequencies abrupt dynamic changes occur in the interest signal catered for by using smaller window sizes, whereas at lower frequencies larger window sizes are considered assuming that the speech signal is less dynamic and every necessary feature can be captured [20]. Wavelets are the preferable choice when working with non-stationary signals [22, 23]. The Wavelet Packet Transforms (WPT) beats a portion of the confinements looked at by different highlights as it is also useful to examine discourse flags specifically through an individualized wavelet packet structure into the basic groups characterized by psychoacoustics [21]. WPT is an algorithm that meets several requirements in the signal-processing field, giving analysis for both the frequency domain and time domain signal [15]. A Multi-Resolution Wavelet Analysis provides frequency and time data through spectral parameter variation. A wavelet is also a signal basis function set used in gathering subband localizations [22]. Continuous Wavelet Transformation of a signal
Where
Unlike other peak detection algorithms, continuous wavelets isolate peaks based on amplitude, disregarding less significant spectrum peaks; this makes it a powerful tool for identifying and separating an interest signal from spikes and other noises. The introduction of the wavelet-based transform improves the classifier performance [21].
The data sets used were obtained from the University of California (UC) Irvine Machine Learning repository, consisting of 40 people, of whom we collected 20 pathological and 20 non-pathological voice samples during the training phase. Multiple voice samples were obtained from the 40 subjects, consisting of 26 voice samples of sustained vowels, numbers, words, and short sentences. Subsequently, the test phase follows after the training phase, and for this test set of voice, samples obtained from [26] were used. The test samples consisted of 28 patients (diseased and non-diseased). Vowel articulation of “a”, “o”, “u” was the procedure carried out to obtain the voice test set. Recording the frequency range was between 50 Hz and 13 kHz [26].
Feature extraction and classification
Several feature extraction algorithms are already in use. For speech, these include the Wavelet Packet Transforms, Linear Predictive code (LPC), Short-time Fourier Transform (STFT), Pulse Coupled Neural Network (PCNN), and Mel-Frequency Cepstral Coefficients (MFCC). The Wavelet theory is based on the Multi-Resolution Analysis (MRA) that embeds signal filtering using different frequencies [18].
Discrete Wavelet Packet Transforms (DWPT)
The approach of extracting the features is based on the multi-resolution properties of the WPT, generally with successive separations of high and low frequencies. The Multi-Resolution wavelet investigation strategy gives frequency as well as time data by shifting the determination properties into a wider spectrum. The decomposed signal consists of wavelets enabling sub-band localization. The use of WPT is possible even for degraded environments [23]. Wavelet packets eliminate noise in speech signals through decomposition to obtain energy and entropy characteristics. Energy oscillates with levels of packet decomposition and is calculated as shown in Eq. (2):
The calculation of the total energy (
Wavelet tree.
Where
Analyzed signal F0.
F0 de-noised signal.
MFCC is utilized in the extraction of shortened vectors representing stationary parameters of speech signals. To obtain the spectrum frequency, Discrete Fourier Transforms are implemented in breaking down the speech signal into the frequency domain. MFCCs alone are considered, being noise insensitive. The Combined features of WPT-based feature-warped MFCC and feature-warped MFCCs of the enhanced speech signals are used for the feature extraction, as shown in Fig. 1. The MFCCs are computed over Hamming windowed frames of the enhanced speech signals with 30 ms size and 10 ms overlap. The MFCC features are obtainable through the use of 32 channel Melfilter banks, followed by a transformation to the Cepstral domain, keeping 13 coefficients. The First and second order Cepstral coefficients are then affixed to the MFCC features. A 301-frame gap feature warping is conducted on the obtained MFCC features. The frame of the enhanced speech signal is decomposed into two frequency sub-bands: approximation coefficients (low frequency) and detail coefficients (high-frequency sub-band). The two coefficients are then merged into one vector. The decomposition process can be repeated by applying the WPT to the low-frequency sub-band. In order to gather all the essential characteristics of the vocal tract, the feature-warped MFCCs are applied to the concatenated vector from the WPT. Finally, a combination of WPT and feature-warped approach can be performed by combining the feature-warped MFCCs from the spectrum of an improved speech signal with MFCC highlight twisting extricated from the WPT in a solitary component vector [24]. Speech recordings for pathological purposes are often carried out in controlled noise free environments. The above scenario is often expensive and to provide patient customized telehealth solutions, various types of environmental noises are often present. This, in turn, corrupts the speech traces provided to the system [5]. As a result, the performance of custom-made PD diagnostic systems tends to diminish significantly in uncontrolled natural environments because of the presence of high noise levels. Nevertheless, the execution of the MFCC features reduces essentially in the nearness of clamour and resonation conditions. Multiband feature methods depend on joining MFCC features of the loud discourse signs and MFCC separated from the Wavelet Packet Transforms (WPT) into a solitary component vector [25]. The WPT can be utilized to extricate more features from the low recurrence sub-bands. These features’ highlights include a few imperative highlights of the whole spectrum of the MFCC.
Comparison of selected MFCC feature extraction tools and the mean square error values
Comparison of selected MFCC feature extraction tools and the mean square error values
Step-by-step MFCC output.
Figure 5 shows a stepwise explanation of the MFCC. Figure 6 shows the calculation of the Mel FTK file format input/output achieved using the included simple htkread_lite and htkwrite frequency Cepstral coefficients from a speech signal.
Figure 6 compares the extracted Cepstral features using the MFCC routine against those extracted using HTK and MELFCC tools. The HTK file format input/output is obtainable from the included sample HTKREAD_LITE and HTKWRITE_LITE routines. Further functionality can be achieved by installing the VOICEBOX toolbox. MELFCC is used for calculating PLP (Perceptual Linear Prediction) and MFCC from sound waveforms. PLP is done for warping spectra to reduce speaker diversity while preserving important speech information. The MSEs (mean square values) are compared, as shown in Table 1.
Mel-Frequency spectrum: HTK, MELFCC, THIS.
Parameters such as jitter, shimmer, harmonic-to-noise ratio, pitch period entropy (PPE), mean, median, energy, accuracy, sensitivity and specificity etc., are calculated.
TP (true positive) – No. of PD patients accurately classified as PD
TN (true negative) – No. of non-PD patients accurately classified as non-PD
FN (false negative) – No. of PD patients inaccurately classified as non-PD
FP (false negative) – No. of non-PD patients inaccurately classified as PD [25].
Evaluation of the HMM classifier for various feature extraction methods
Evaluation of the SVM classifier for various feature extraction methods
Identification of normal and pathological voice signal.
Two classifiers are used, i.e. HMM and SVM, and their performance parameters (sensitivity, accuracy and specificity) are compared.
Training: Training speech samples are used in the training phases for both classifiers (HMM and SVM) and a classification model is developed.
Testing: Test speeches are used in this process. Classifiers of either HMM or SVM determine the category, normal/ abnormal of the samples using the model obtained through the training process.
Results and discussions
The GUI tells us whether the input speech signal is a “normal” or “Parkinson detected voice signal”. Here, the features are extracted using MFCC, DWT, MFCC and DWT, and the HMM (Hidden Markov Model) classifier is used and the efficiency, sensitivity and specificity are calculated in the three instances of feature extraction techniques.
Conclusions
The future is pointed towards real-time smart diagnosis and monitoring of pathological and non-pathological clinical features of PD patients. There is a possibility of envisioning a future where a smaller scale level of information is applied to upgrade diagnostics, measure viability of mediation, early PD diagnosis, progress monitoring, and PD risk prediction, as well as evaluation of the efficiency of control mechanisms. Larger scale information comes in handy in providing the bigger picture about PD and to measure therapy effectiveness. These information foundations highlight the envisioned patient care and treatment systems.
In this work, a fusion of MFCC and WPT using the HMM classifier shows the best performance, as is indicated in Table 2, in terms of accuracy, sensitivity and specificity, when compared to the SVM classifier listed in Table 3.
For future work, we can focus on the achievement of higher efficiency, and the use of other feature extraction and classification methods for noisy (natural environments) as well as controlled environments when making voice recordings. This is done in a bid to produce highly accessible cheap PD diagnosis and monitoring tools.
Footnotes
Conflict of interest
None to report.
