Abstract
The deformation of speech caused by glottic vocal tract is an early bio marker for Parkinson’s disease. A novel idea of Line Spectral Frequency trajectory spectrum image representation of the speech signals of the subjects in Deep Convolution Neural Network is proposed for Parkinson’s disease classification in which the convolution layer automatically learn the features from the input images and no separate feature calculation stage in required. The human vocal tract that produces a short phonetics is assumed as an all-pole Infinite impulse response system and the Line spectral frequency trajectory spectrum images represents the poles of the system and reflects the voice defects due to Parkinson’s disease. It is shown that the proposed method outperforms the existing state of the art work for two different utterance tasks one for sustained phonation and another for natural running speech dataset. It is demonstrated that the Deep Convolution Neural Network results in a training accuracy of 92.5% for sustained phonation dataset and training accuracy of 99.18% for King’s college running speech dataset. The validation accuracies for both the datasets are 100%. The proposed work is much better than another recent benchmark work in which Mel Frequency Cepstral Coefficient parameters are used in machine learning for Parkinson’s disease detection in running speech. The high performance of the proposed method for King’s college running speech dataset which is collected through mobile device voice recordings, gains attention. Rigorous performance analysis is performed for running speech dataset by using separate isolated test set for repeated 50 trials and the performance metrics are F1 score of 99.37%, sensitivity of 100%, precision of 98.75% and specificity of 99.27%.
Keywords
Introduction
Parkinson’s disease (PD) is a progressive neurological disease with gradual onset of symptoms. The method of using speech analysis for early detection of PD has been demonstrated in several work [9, 20, 25, 31, 51, 56] had discussed on using vocal features for PD tele-diagnosis, [25] discuss on combining voice analysis with rapid eye movement to classify PD, [20, 31, 51, 56] discuss on utilizing voice classification methods using machine leaning algorithms to classify PD. Speech is an important bio-marker for PD as PD affects human speech at its very early stage [18, 25] and the tremor in the speech could be the first early symptom of PD [43]. PD can have a variety of effects on human speech that include soundlessness, quick speech utterance, single tone speech, word slur, breathy sentence endings etc. [27]. So, the speech analysis is an important tool that will be much helpful to detect PD at early stage. Recent research utilizes machine learning methods to classify PD by observing and learning certain parameters in human subjects [18, 23, 41, 47, 52, 53]. The recent robust approach to classify PD patients and healthy controls (HC) is to use prosodic and statistical features of speech in machine learning classification framework [2, 3, 8, 16, 17, 20, 24, 32, 34, 39] use the Oxford PD speech dataset in Deep neural network (DNN), [3] demonstrates PD classification from natural speech, [8] uses genetic algorithm to select speech features for PD classification, [16] uses the data provided by UCI machine learning repository and extracted 30 features from that for PD classification, [17] presented a comparative study of existing machine learning approaches for PD classification, [20] analyzed the PD classification method for English and German, [24] conducted a performance study for PD classification through ensemble machine learning, [32] applied dimension reduction on PD data using linear discriminant analysis, [34] used acoustic measures of speech for PD classification, [39] reviewed on PD diagnosis using machine learning. Some of the demerits of this feature-based approach are (i) the estimation of the features is computationally intensive [31] (ii) selecting the right features is a challenging task [19] (iii) the feature-based method introduces bias [40]. Representing the speech in some form of image and using in Deep Convolution Neural Network (DCNN) removes the need for feature estimation stage as Convolution Neural Network (CNN) learns the features automatically from the input visual images. The easiest and the trivial way is to use Short Time Fourier Transform (STFT) representation of the speech as input images for DCNN. The analysis of voice tremor in PD contributes to a better understanding of phonatory dysfunction and its relationship to disease symptoms [43]. The rate of tremor and the periodicity of amplitude tremor are the two most useful acoustic measures for distinguishing PD from HC. Tremor in the human voice increases in magnitude as the time since PD diagnosis increases. The rate of amplitude tremor decreases as the severity of PD motor symptoms increases. The oscillatory movement in the vocal tract appears to influence the amplitude of the cycle more than the length of the cycle in this group of people with PD [43]. Analyzing the vocal tract system in the form of spectrum image may lead to a more practical method of training the DCNN in which spectrums those represent vocal tract system can be used to train the model. Linear Prediction Coefficient (LPC) based spectrums can be used to represent the variations of vocal tract in which the LPC of speech frame can be arranged as a column in the spectrum image and the LPC obtained for the consecutive speech frames form a complete spectrum image that represents the speech. The LPC coefficients can be both positive and negative and should be dealt properly in image formation.
The LPC coefficients obtained for a speech frame that consists of a phonetic utterance, represent the system parameters that produce the speech. This system corresponds to the vocal tract of the person uttering the speech. Thus, the LPC based spectrum of the speech should serves as best image form of the speech to classify it as PD or HC. Line Spectral Frequency (LSF) trajectory can be derived from LPC and can be used in the spectrum in place of LPC. Analyzing the plot of LSF trajectory of speech frame reveals that the pattern is more distinct to use in PD classification. The LSF values are positive and the image representation of speech is more natural representing the vocal tract variations. LSF derived from the system coefficients (LPC) are basically the poles of the vocal system and thus any deformation in vocal tract will be reflected in LSF. Moreover, the application of LSF is successfully demonstrated in various literatures for speech compression and speech/speaker recognition [7, 22, 30, 44, 46, 48]. In this paper, an attempt is made to bring out the merits of LSF based spectrum representation of speech to classify PD accurately and a detailed analysis is conducted to show that the LSF trajectories discriminate PD and HC efficiently.
To accomplish speech analysis the subjects under the test has to perform utterance tasks and few of the popular utterance tasks used in PD classification are (a) the subjects are typically asked to repeat at least 5 seconds in one breath a rapid steady /pa-/ta/-/ka/ syllables repetition as constant as possible, (b) the subjects read phonetically unbalanced text of more than 130 words while saying approximately 5 second sustained vowels of /a/, /i/, /u/ at a comfortable pitch and loudness in one breath. (c) The monologue (90 seconds) discusses what they did today or last week, based on their interests, jobs, or family. (d) They read the text in a comfortable voice, reading sentences based on specific emotions such as excitement, sadness, confusion, fear, boredom, anger, bitterness, disappointment, wonder, and enjoyment in response to an emotionally neutral sentence. (e) They read the text in a rhythmic manner.
The sustained phonation utterance task is an ideal method as it involves utterance of only a particular vowel and the signal corresponds to voiced segment. Sustained phonation is a component of the vocal capability battery in which the patient is asked to maintain a sung tone as consistently as possible in order to detect tremor or other types of vocal instability. The maximum phonation time test determines how long an individual can sustain a sung tone after fully filling the lungs. It is a simple and easy-to-implement glottic efficiency test that only requires a timer and an audio recorder. Several works [9, 10, 32] had been demonstrated PD classification using sustained phonation data set.
An experiment is conducted in this paper to train a DCNN using LSF spectrum in sustained phonation dataset and the experiment results in training accuracy of 99.18% and validation accuracy of 100%. On the other hand, classifying PD using running speech which involves the subjects to speak naturally, also gains importance and priority. The method of PD classification using running speech is less clinical, more natural and several works had been demonstrated [3, 11, 49] to classify PD using running speech dataset. A recent work [11] proved that the classification model with a maximum Area under ROC (receiver operating characteristics) curve (AUC) of 93% [11] using Mel Frequency Cepstral Co-efficient (MFCC) parameters in classification algorithm. There are two merits of using running speech than using sustained phonation to analyze for PD in speech (i) It is less clinical and natural (ii) length of natural speech can be much longer than sustained phonation and more samples can be used in the training process which generalize the model better. One of the demerits of running speech is that, it is not ideal like sustained phonation. Sustained phonation speech comprises of pure voiced segments and unlike sustained phonation dataset, running speech comprises of mixture of voiced, unvoiced and silent segments. This demerit affects the classification accuracy of machine learning model using running speech dataset and the classification accuracy of running speech will be lesser than that of sustained phonation dataset [28]. In this paper, the LSF trajectory images which reflects the vocal cord parameters are learned using DCNN to give a high training and validation accuracy.
In this paper, a numerical experiment is conducted to implement and to analyze the performance of a DCNN based PD classification framework using LSF spectrum images on King’s college running speech data set [21] and a validation accuracy of 100% is achieved. King’s college Parkinson speech dataset is collected through voice recording using a mobile device which involves a spontaneous dialogue and a read text utterance tasks by the subjects. Thus, this stands as a good representative of practical natural speech which enables the final model to be deployed using telemedicine framework. To compare the performance of the proposed method the previously referred work [11] is used as a benchmark. In this work [11], MFCC features of the running speech are used in two different classifiers and a maximum AUC of 93% is achieved. In this paper, the validation accuracy of the proposed DCNN based PD classification using LSF based spectrum is 100% and it is superior and very attractive. The classification accuracy of recent other works are also compared with the proposed work and it is inferred that the performance of the proposed work is superior both for ideal sustained phonation dataset and also for the running speech dataset due to the robust classification capability of DCNN and the LSF trajectory spectrum. The training involves total 285 PD images, 450 HC images and testing comprises of 109 PD images and 165 HC images to demonstrate this 100% validation accuracy. These testing images are also used for fine tuning the training phase. To demonstrate the robustness of the proposed algorithm, the testset is split into two different sets, one set is used to tune the training model and the other one is isolated for pure testing purpose. This isolated dataset comprises of 49 PD images and 85 HC images and the experiment of training and testing with isolated dataset are done for 50 trials for statistically analyzing the learning accuracy for various trials. It results in precision of 98.75%, F1 score of 99.37%, sensitivity of 100% and specificity of 99.27%. This result brings out the robustness of the proposed method.
The points so far discussed are multifold and the aim of this research, the research questions and the research gap considered in this paper are stated below. The fundamental question raised in this research is “Will training LSF trajectory image form of speech in DCNN discriminates PD speech and normal speech better? It is expected to perform better, as LSF reflects the vocal cord that produces the speech. As DCNN learns the feature automatically from the input images by tuning the filter coefficients of convolution layer, a separate feature extraction stage in not required. Will it perform well for more practical scenario of natural speech of subjects which are in form of running speech dataset collected from the PD and HC subjects by spontaneous dialogue and reading a given text? The complexity arises in such practical speech, like King’s college running speech dataset as it is not ideally recorded in a clinic and was collected through telephone networks. Moreover the speech consists of all types speech segments like voiced segment, unvoiced segment, silence segment etc. Anyway, as LSF representation reflects the vocal cord, it is expected to capture the variations in vocal cord irrespective of the type of speech frames. Will the results be comparable with state of art methods like machine learning models using MFCC coefficients of PD speech? Unlike MFCC method, the proposed LSF representation works directly on raw speech and don’t require much preprocessing of the speech frames like pre-emphasis, windowing, filter bank operations, non-linear logarithmic operations and energy computations [11]. It is always a merit even if the proposed method works as good as MFCC method as the computation involved is more simple and straight forward. Following point is seen as the research gap in previous literature. Though LSF trajectory represents the speech and the vocal cord that produces the speech, none of the previous work had analyzed it for PD classification. This is the first attempt to use LSF representation with DCNN as an image format to train the ML model by automatically learning the feature using convolution layer.
The uniqueness and the novelty of this paper can be summarized as following. As already discussed, oscillatory movement of vocal tract influence the length of the cycle for PD subjects. So for, in the previous literature, the vocal cord information is not used extensively with machine learning algorithms for PD classification. In this paper, for the first time LPC coefficients that represents the vocal cord, modelled as all pole Infinite Impulse Response (IIR) filter are obtained and converted to LSF trajectory images and used in DCNN for further image classification. The other unique contribution of this paper is to utilize running speech samples for classification of PD using LSF images to demonstrate the best learning accuracy. This contribution ensures that the proposed method is good to apply for practical natural speech in telemedicine framework to classify PD. LSF trajectory is the best smooth model which represents the natural speech very well [35, 54] and in this work it is proved that the LSF trajectory images discriminates the natural speech obtained from running speech dataset of King’s college for PD and normal subjects at almost 100% validation accuracy. Rigorous statistical analysis shows that the model has a very high F1 score of 99.37%, sensitivity of 100%, precision of 98.75% and specificity of 99.27%. As the King’s college dataset is obtained using mobile device voice recordings, the result of the paper proves the application of the method in telemedicine frame work. King’s college dataset is gaining its attention in PD voice analysis research [14, 41, 42] and the dataset is available public. The other contribution of the paper is to analyze how good the LSF images discriminate PD and HC by plotting the LSF trajectories in Section 3.
The following is how this paper is organized. Section 2 discusses the LSF spectrums for Running PD speech classification which further divided into three sections: (2.1) Significance of LSF representations, (2.2) LSF versus MFCC representations of PD speech in Machine Learning, and (2.3) Other related works, Section 3 introduces the speech production mechanism and LSF. Section 4 discusses the proposed work which further divided into four sections: (4.1) Sustained phonation speech database (4.2) Running speech database (4.3) Speech based spectrum image representation (4.4) Deep learning Convolution neural network experiment which is again divided into two sub-sections- (4.4.1) CNN parameter setting for proposed work, and (4.4.2) Automatic feature learning of CNN. Section 5 displays the results and discussion. Finally, section 6 concludes the work.
LSF Spectrums for Running PD Speech classification
As it is discussed in the introduction section the aim of the research is to bring out, that the LSF trajectory images of PD speech can be used in DCNN learning for PD classification with better accuracy. In this section, the mathematical foundation and the significance of LSF representation of PD speech are discussed. The difference between applying LSF representation and MFCC representation of PD speech in machine learning algorithm is discussed. The other related work of PD classification using speech in machine learning algorithms is also discussed.
Significance of LSF representations
The dysfunction of vocal tract is responsible for the speech degeneration of PD patients, and the method can use the LPC which represents the denominator polynomial of vocal tract transfer function [37] to classify PD patients. LSF is another set of parameters derived from LPC which represents the poles of vocal tract transfer function. LSF is used in many areas of speech analysis and speech synthesis as it is a fully reversible representation for linear prediction coefficients (LPCs). To discuss the mathematical foundation of LSF, let us consider the polynomial function as in equation (1)
The problem considered in this research is to classify a subject as a PD or HC from the speech they utter naturally and this problem is a typical text independent speaker recognition problem and the quick survey reveals LSF is used significantly in speaker recognition for better accuracy. The key reason why LSF is a better representation of the vocal tract is that it represents the poles of the mathematical model of the vocal tract system and the property of the system is characterized by the location of the poles.
Sorin Dusan et al. (2007), devised a method for compressing speech using polynomial approximations of the LSF trajectories in time axis. This compression method is also appropriate for frame-based speech coders and can be used to compress features and it is demonstrated in an excellent speech coder [48]. Pujita Raman et al. (2015), developed a LSF-based speaker verification and had tested in noisy environments. The ability of transition, vowel, and consonant zones to discriminate between speakers is investigated. Transition regions are the most speaker discriminative in high SNR conditions, whereas vowel regions are the most speaker discriminative in low Signal-to-noise ratio (SNR) conditions. In this study, a new speaker verification system is proposed that combines information from static and dynamic LSF-based classifiers at the score level and scores during the verification phase using a combination of vowel and transition zones of speech [44]. Kawthar Yashmine Zergat et al. (2013), developed an automatic speaker verification system that recognizes transmitted voice over Internet protocol (VoIP) using a G.729 coder using LSF features with excellent performance [30]. Himadri Mukherjee et al. (2018), proposed a system called LSF-RG (LSF-Ratio Grade) for differentiating songs, speeches, and instrumentals. Several classification methods were used in the experiments, with the multi-layer perceptron producing the best results [22]. Anand D Subramaniam et al. (2002), proposed a method based on the parametric probability density function (PDF). Speech line spectral frequencies are modelled in this method. They used an error control scheme to quantify the difference between consecutive LSFs. The variable rate quantizer outperforms the fixed rate quantizer [7]. In the work of Rajesh Kumar Dubey et al. (2013), MFCC, Perceptual linear prediction coefficients (PLP), and LSF were used to compute different speech features from active speech, and the feature vectors were combined with their subjective mean opinion score (MOS) score for Gaussian Mixture Model (GMM) training. Over three databases, experimental results show that the combination of feature vector sets outperforms the ITU-T Recommendation P.563 in terms of correlation and root mean square error (RMSE) using combination averaged MOS and unconditioned MOS [46].
From the previous sub section 2.1, it is brought out that the LSF trajectory images are good candidates to be used for speaker recognition. As the problem of classifying the subject as PD or HC is close to speaker recognition it is proposed to use LSF based speech representation for PD classification from speech. Like LPC/LSF, MFCC is also one of the best parameters for distinguishing PD from HC and is used in both speaker identification and recognition. Many studies have been conducted using MFCC as features to classify PD. Iqra Nissar et al.. (2019), observed 95.39% accuracy in the classification of PD from HC [24]. Achraf Benba et al.. (2015) discovered that MFCC had the highest classification accuracy 91.17% [1]. Atiqur Rehman et al.. (2021) discovered that using MFCC to differentiate PD from HC yielded an accuracy of 97.50% [4]. Similarly, LPC is a speech signal processing parameter. It analyses the speech signal by estimating the formants, removing their effects, and estimating the frequency and intensity. J. Rusz et al. (2011), discovered 78% accuracy for the differentiation of PD from HC on the basis of the LPC analysis. This 78% accuracy, they verified by comparing the results of classification algorithm and results diagnosed by physicians [26]. When Qi Wei Oung et al. (2018), used LPC-based features, the Probabilistic Neural Network (PNN) classifier achieved the highest mean accuracy of 93.97% [45].
The fundamental difference of MFCC and LPC/LSF analysis is that MFCC mimics the auditory characteristics of human hearing process and it manipulates the signal and LPC/LSF reflects the system that produce the speech and thus represent vocal cord. It is intuitive to consider that any deformation of vocal tract due to PD will reflect more in LPC/LSF. Though LPC is used in PD classification, LSF trajectory-based PD classification had not been demonstrated and discussed in previous literature and in this paper, an attempt is made to apply it for running PD speech in a DCNN framework. PD analysis using running speech data is getting its attention and, in this paper, sustained phonation dataset is used for a quick test of the method and running speech dataset is considered for detailed analysis. Running speech is very vital in a scenario of modern non clinical telemedicine PD classification. The work considered for the comparison [11] is a MFCC based privacy-conscious method for classifying PD patients and HC based on speech impairment using running speech. Voice features from running speech signals were extracted from recordings of voice calls that were passively captured. To fuse and predict on voice features and demographic data from a multilingual cohort of 498 subjects, language-aware training of multiple and single instance learning classifiers was used (392-self reported HC & 106 PD patients). Voice call signals are initially passively captured by the smartphone’s microphone through the i-Prognosis app. The app recorded the first 15-75 seconds of the subject’s call. There was no need for speech-source separation. Then, on each subject’s smartphone, voice-related features are extracted. The total length of the feature vector was calculated as 2x4x13 (MFCC)+2x4x22 Bark Band Energies (BBE)=282. Lower order MFCCs contain the majority of the information about the overall spectral shape of the transfer function, whereas PD influences articulation over time, and higher order MFCCs capture this variability. They used two different types of classifiers and the maximum performance achieved is about 93% AUC. The methodology presented in this paper is to use LSF trajectory images in DCNN framework and it is demonstrated that it discriminates the PD running speech very accurate at almost 100% accuracy better than the benchmark work [11] assumed. Much work is not done to demonstrate the use of MFCC spectrum in DCNN framework. Attempts had been made to use mel spectrograms images in DCNN framework to achieve an accuracy of 81.60% [12].
Other related works
Karthikeyan Harimoorthy et al.. (2021), found a cloud-based Parkinson’s disease identification system for predictive telediagnosis and telemonitoring in smart healthcare applications. They proposed a cloud-assisted PD identification system with patient-centric and cost-effective features based on non-clinical parameters. The proposed system diagnoses the remote patient by examining symptoms such as dysphonia, which has been identified as the world’s most severe neurodegenerative disorder. They tested the proposed system with adaptive linear kernel Support vector machine (k-SVM) with the benchmark voice dataset collected from the University of California Irvine (UCI) repository, and the results show that the proposed system with adaptive linear kernel SVM (k-SVM) has a significant improvement on detection when compared to the existing classifiers [29]. Kaya D. (2022), used human voice to classify gender and PD identifications used a mRMR feature selection method which improves the classifier’s performance efficiency to 98.90% [15]. This paper [15] discusses on the need for proper feature engineering for a successful implementation of the neural network classification. The feature engineering is an additional block that involves computations and it is applicable for feature-based classification and in the proposed DCNN method the separate feature engineering block is removed.
Some of the key points those emerge out of the discussion are summarized here before the proposed work is elaborated. The dysfunction of vocal tract is responsible for speech degradation in PD patients and LSF trajectories reflect the vocal cord characteristics. LSF representation of speech is already vastly used in speech recognition, speaker recognition and song recognition. The LSF trajectory image can be used in DCNN for further learning to classify PD. MFCC is very popular and vastly used feature in PD classification and the maximum accuracy of 97.50 % [4] is demonstrated with SVM for sustained phonation dataset and for running speech it is reported that the learning model gives an AUC of 93% [11]. MFCC involves lot of preprocessing of speech frames like windowing, pre-emphasis and involved logarithmic operation in computation [11]. LSF computation involves solving the roots of LPC polynomials and LPC are determined by using linear prediction. LSF trajectory image creation is done on raw speech signal and does not involve preprocessing stages like windowing and pre-emphasis. LSF trajectory images are the right candidates to be used with DCNN. As MFCC are well computed features it is directly used with machine learning algorithm like SVM and much work are not done to use MFCC spectrum in DCNN. One of the reasons may be it is not raw in nature and they are well computed feature set in which further visual parameters [5] cannot be extracted. When Mel spectrograms are used in DCNN in recent work, a validation accuracy of 81.60% [12] is achieved. In this paper, it is proposed to use LSF trajectory spectrum obtained from the LPC calculated on raw speech with no pre-processing of the speech frames like windowing. Unlike previously discussed methods [1, 4, 24, 26, 45] it is proposed not to use any feature calculation stage but to use the LSF trajectory spectrum directly with DCNN and allows the CNN to learn the features from the visual spectrum. This numerical classification experiment using DCNN and LSF spectrum on sustained phonation and running speech dataset has a validation accuracy of 100% and rest of the sections will illustrate the complete process and other statistical analysis.
Speech production mechanism and LSF
LPC analysis of short speech frame results in coefficients of the digital Linear time invariant (LTI) system. The analysis operates on the raw speech discrete samples and no windowing or pre-processing are required and in this section the merits of LSF spectrum is discussed using PD/HC speech samples. Speech production mechanism can be modelled as a signal processing block as shown in Fig. 1. The vocal tract of human is a time varying system that produces speech. It is considered to be time invariant for production of a phonetic either voiced or unvoiced [27]. Voiced phonetics are produced by periodic glottal pulse as input for the vocal tract whereas the unvoiced phonetics are produced by noise like input to vocal tract and this input can be modelled as white noise. The signal processing model for vocal tract for a phonetic production is an all-pole IIR system of order P [37].

Speech production mechanism for speech signal processing.
whereas the H (z) is given by
The A (z) is given in equation (1). The a k in equation (1) are the LPC coefficients and the roots of the polynomial A (z) are the poles of the system. The poles can be determined from the LPC coefficients and from the poles the resonant frequencies can be determined, and the line spectral frequencies can be obtained. The LPC are the system parameters of the vocal tract and as LSF are derived from LPC, LSF reflects the abnormal condition of vocal tract.
LPC analysis is a technique for estimating the vocal tract transfer function, from which the formant frequencies can be calculated analytically. LPC is frequently used to transmit spectral envelop data. Figures 2 and 3 show trajectory of 16-point LPC vectors obtained over time from the speech signal of a HC subject and PD subject respectively uttered for the phonetic /aaa/. It can be observed that the trajectories are varying with time very rapidly with values ranging from negative to positive. When the value of these vectors approaches zero, they overlap. When comparing Figs. 2 and 3, it can be seen that they are both varying and the variations are very random and there are no observable patterns for the PD and HC. LSF is a more advanced representation of LPC. LSF decomposition has grown in popularity because it ensures predictor stability and keeps spectral errors local for small coefficient deviations and thus LSF is superior in terms of spectrum classifications. LSF trajectory obtained for a speech frame for the phonetic /aaa/ uttered by HC is given in Fig. 4.

Trajectory of LPC vectors for HC.

Trajectory of LPC vectors for PD.

Trajectory of LSF vectors for HC.
It can be observed that 16 display lines that slowly vary with respect to time denotes the vocal tract dynamics across time. It also to be noted that the values are positive and the trajectory of a given center frequency does not overlap with neighbour frequency trajectory. LSF trajectory obtained for a speech frame for the phonetic /aaa/ uttered by PD is given in Fig. 5. Quickly, it can be observed in Fig. 4 the first three lines counting from the bottom (starting from the bottom, i.e., the first, second, and third lines) are smooth and from Fig. 5 for PD voice it can be seen that the lower frequency spectrum lines vary in a greater extent for PD voice.

Trajectory of LSF vectors for PD.
When comparing Figs. 4 and 5, it can be seen that the HC graph has fewer variations in its waves than the PD graph. The waves in PD are more varied and complex than those in HC.
Figure 6 is a close-up representation of Figs. 4 and 5 highlighting the lower frequency lines ranging from 0 to 3. It is done for 3 set of PD voice and 3 set of healthy control voice uttered for the phonetic sound /aaa/. The first three graphs in the first row of Fig. 6 represent line spectral frequency variations in PD voice, while the lower three graphs in the second row represent line spectral frequency variations in healthy voice. From this Fig. 6, it can be observed that the frequency lines corresponding to PD voice have more variations when compared to that of healthy voice and the image spectrum representation related to the lines 0 to 3 is given in Fig. 7. This suggests LSF spectrum as a representation to classify healthy and PD speech using CNN.

Consolidated graphs of LSF Trajectories for PD and HC.

Consolidated LSF spectrums for PD and HC.
It is proposed to examine the method of PD classification using LSF spectrum image representation using sustained phonation and running speech. It is proposed to employ DCNN because it learns the features automatically with the help of convolution filter layers and hidden layers. This proposed method eliminates the time-consuming stage to determine speech parameters. This section discusses about the speech datasets, the DCNN parameter setting and the concept of automatic feature learning by DCNN.
Sustained phonation speech database
For the sustained phonation speech dataset, the dataset from Istanbul University [10] is used. In its current state, this dataset is divided into two parts (Training dataset and Testing dataset). The testing dataset is available publicly that consists of 168 samples for two different sustained phonations /aaa/ and /ooo/. The healthy control sustained phonation samples are user defined dataset with people of age ranging from 20 to 34 with a mean age of 27. Table 1 shows the dataset details, which include the total number of audio files in training and testing. So, one can learn about its prediction classes. In this experiment, there are a total of 278 audio files, 134 of which are for PD and 144 of which are for HC. In this paper, this dataset is used to quickly check the performance of LSF based spectrum representation of speech in DCNN.
Sustained phonation database for training and testing
Sustained phonation database for training and testing
This is the primary dataset used for the proposed work to show PD classification from running natural speech of the subjects using LSF spectrum-based speech representation in DCNN. The data in this dataset was collected at King’s College London (KCL) hospital in Denmark Hill, Bixton, on September 26-29, 2017 [21]. The dataset is open source and recent work available for both PD and HC subjects. This dataset has been used for PD classification and analysis in very recent work [14, 41, 42]. There are two categories of mobile device voice recording dataset is available which is made up of “Spontaneous Dialogue” and “Read Text” voice records. Spontaneous Dialogue and Read Text has two classes for PD and HC. The number of PD cases in “Spontaneous Dialogue” is 15 and 21whereas the number of HC cases in Read Text is 16 and 21 respectively. The key merits of this dataset are that it is public and it represents the natural form of speech of PD and normal. Such a public dataset that is available is limited.
The number of PD samples are 31 and the number of HC samples are 42. Both subjects’ voices are in their natural language form. To conduct the voice recordings, a 10m2 examination room with a reverberation time of approximately 500 ms was used. The voice recording was done via voice call. The sampling rate was 44.1 khz, with a bit depth of 16. Table 2 contains the database details, which include the total number of audio files in training and testing. The total number of audio files is 1009, with 735 for training and 274 for testing. The primary aim is to study the performance of classification of PD using LSF based spectrum of the speech samples.
Running speech database for training and testing
Running speech database for training and testing
In this work, it is proposed to employ CNN which produces promising results for images and to represent speeches as 2D images with information spread across the frequency-time plane. These images are then classified further. Sliding windows are used to slice the speech signals recorded at various sampling rates. The time domain frames are converted to frequency domain with the help of LSFs. The formant frequencies of the sustained phonation and running speech are represented by line spectral frequencies. The LSF representation of speech was discovered to be very simple, neat and compressed, and used in many speech compression methods for faithful reconstruction speech. Based on this concept, it is proposed in this paper to use LSF [7, 22, 30, 44, 46, 48] based representation. The speech signals are decimated with a sampling rate of 8 khz, frame size 256, and 16 LPC coefficients are determined for each frame as shown in Fig. 8.

Image generation for LSF based Spectrum.
According to the speech production mechanism, the system parameters of the vocal system are modelled as an all-pole IIR system. Here, the LPC are the filter coefficients, and the LSF are the IIR system’s poles. To generate LSF spectrum images, 16 LSF parameters are determined for each speech frame and arranged in the columns. Figures 9 and 10 show the LSF image obtained for a sample HC file and a sample PD file, respectively. The procedure is repeated for all of the audio files in the dataset, yielding 278 sustained phonation images and 1009 running speech images given in Tables 1 and 2. LSF images have a dimension of N x 16.

LSF image for running speech of HC.

LSF image for running speech of PD.
Deep learning is a subfield of artificial intelligence that is broadly defined as a machine’s ability to mimic intelligent human behavior. Artificial intelligence systems are used to accomplish complex tasks in a manner similar to how humans solve problems. It is composed of multiple processing layers and utilizes both structured as well as unstructured data for training. Practical examples of deep learning are pharmaceutical, virtual assistants, chatbots, entertainment, image coloring, robotics and many more. CNN performed well for spectrum image classification because it works well for image classification in this paper. Many machines learning-based models, such as support vector machine, have recently proven useful for accurate and efficient PD recognition [50].
CNN parameter setting for proposed work
In CNN, many steps are taken one by one to classify spectrum images [6]. In this proposed work, the model type is sequential, and the Keras library is used. In the first step, 64 filters are used with the ReLu function as an activation function. Following that, the number of filters is increased from 64 to 128 with maximum pooling and the ReLu activation function. During image preprocessing, rescale, shear range, zoom range, and horizontal flip are used. The shear and zoom ranges are both 0.2, while the horizontal flip is true and the class mode is binary. After the pooling stage, the flattening stage will arrive, and the system will have full connection as shown in Fig. 11. For the classification of two classes, a single output is required. So, the sigmoid activation function is used because its values range from 0 to 1.

General CNN Architecture.
The actual implementation of CNN in the proposed work is depicted in Fig. 12 which also illustrates the process of automatic learning of features by CNN. There are various work in which it is shown that CNN eliminates the need for handcrafted feature and parameters selection [13, 33, 36, 38, 55]. When a one-dimensional speech signal is represented in the form of time-frequency spectrum images, the variations in frequency for a given time frame varies along y-axis and the variations in different time frames varies along x-axis. So, this time-frequency distribution represents the complete speech and the characteristics of the speech is captured in the image characteristics. Learning these speech images can be achieved by CNN. The convolution layer of CNN automatically learns the filters using the feedback from output that discriminates the classes and the output of the convolution layers after flattening can be considered as features extracted from the images. These features after flattening layer can be used for further discriminative layers. Thus, this replaces the need for separate parameters computation stage. The pseudocode of the actual implementation is given in Algorithm 1.

DCNN learning Flow for PD classification.
As depicted in Figs. 11, 12 and algorithm-1, the CNN is implemented using Keras library and the proposed LSF image representations are trained and validated. For the case of sustained phonation dataset [10], there are total of 278 speech samples (PD-134 & HC-144) for detailed analysis. To train DCNN, 200 (PD-100 & HC-100) speech samples are taken and for testing and validation 78 (PD-34 & HC-44) speech samples are taken. In Kings college running speech dataset [21], there are total of 1009 speech samples (PD-394 & HC-615) for further analysis. To train CNN, 735 (PD-285 & HC-450) speech samples are taken and for testing and validation, 274 (PD-109 & HC-165) speech samples are taken. The training and validation accuracy achieved using LSF method for both types of utterance tasks, the sustained phonation and running speech and the most recent epoch (100th) is tabulated in Table 3.
DCNN Training and Validation Accuracy for 100 Epochs
DCNN Training and Validation Accuracy for 100 Epochs
It can be inferred that LSF based method performs good for both running speech and sustained phonation with 100% validation accuracy. The focus of this work is more on running speech and a detailed analysis is done for the experiment conducted for running speech samples. For the case of running speech samples, the model’s accuracy is plotted against the number of epochs. Figures 13–16 show the plots of training accuracy, validation accuracy, combined model accuracy, and combined model loss for LSF of running speech.

CNN model Training accuracy (LSF) for Running Speech dataset.

CNN model Validation accuracy (LSF) for Running Speech dataset.

Combined CNN model accuracy (LSF) for Running Speech dataset.

Combined CNN model Loss (LSF) for Running Speech dataset.
It can be concluded from the plots that the representations perform well, and for further analysis, we found that the we got 100% validation accuracy which means that there are no False Positives (FPs) and False Negatives (FNs). There are only two possibilities in our validation experiment True Positives (TPs) and True Negatives (TNs). From the performance it can be inferred that the selected method CNN training on LSF spectrums for PD classification for running speech data set is superior. It can be seen from Table 4 that the proposed method outperforms other existing works. It can be observed from Table 4 that for the existing work the training accuracy achieved for running speech is in higher side than that of sustained phonation.
Comparison of Existing Works with Proposed work
It can also be observed that the proposed method results in 100% classification accuracy for PD speech. It stands as a proof for the high performance of DCNN using LSF trajectory-based spectrum representation of PD speech.
As already discussed, the focus of the research is to demonstrate the efficacy of the proposed method for running speech dataset. For the case of running speech, to test the model’s robustness, the test set (PD-109 & HC-165) is divided into two sub-sets, testset-1(PD-60 & HC-80) and test set-2(PD-49 & HC-85). Only test set-1 audio files are used in training phase to tune the training accuracy and teste set-2 is kept isolated purely for testing. It means they have not used in training phase. The training and validation accuracy obtained during training phase using testset-1 is given in Table 5. The validation accuracy obtained using test set-2 is given in Table 6.
DCNN Training and Validation Accuracy for Test set-1
Validation Accuracy obtained using Test set-2
It can be observed that the validation accuracy is still high (100%) even if the test set is isolated from training set. For further analysis the experiment of training and testing with test set-2 is repeated for 50 trials and determined how many PD and HC samples are miss classified. This process is equivalent to determining the confusion matrix assuming all trials and the performance analysis is illustrated in Table 7. Assuming the total tests done in all 50 trials the standard machine learning model metrics are determined and tabulated in Table 7. The CNN model has F1 score of 99.37%, sensitivity of 100%, precision of 98.75% and specificity of 99.27%.
Performance analysis of CNN model for Running speech of Test set-2 (50 Trials)
The numerical experiment conducted in the paper demonstrates that the proposed method of learning the DCNN using LSF spectrum images of speech of PD and HC subjects results in a very high validation accuracy close to 100% both for sustained phonation and Kings college running speech dataset. This method eliminates the need for a separate feature extraction stage as DCNN automatically learns the features from the input images. The merits of LSF trajectory spectrums over LPC spectrum are discussed and it is demonstrated that the LSF spectral speech representation method validate the PD voice with 100% validation accuracy which is very interesting. Sustained phonation dataset is more clinical and number of samples are less where as in a natural running speech scenario lot of sample files can be created and increased sample size leads to better training accuracy. In this proposed work of PD classification by learning DCNN using LSF trajectory spectrums of speech the results lead to DCNN model of training accuracy 92.50% for sustained phonation and DCNN model of training accuracy 99.18% for running speech. Though running speech comprises of various types of speech frames like voiced frame, unvoiced frame silent frame in a nature random order, in this paper, the validation accuracy of 100% is achieved for running speech dataset. This is based on the fact that the LSF trajectory images model the vocal cord and thus deformation due to PD are captured irrespective of the speech frames. The achieved 100% validation accuracy is better than the performance of another MFCC parameter based running speech PD classification [11] in which the performance is found to be 93% AUC. The performance of the proposed method is much better when compared with other recent work both for sustained phonation and running speech. The reasons for the 100% validation accuracy of the proposed method are due to the promising performance of selected DCNN on visual image input and the novel LSF trajectory spectrum representation of the PD and HC speech. To determine the robustness of the proposed method the experiment is repeated for 50 trials for selected isolated test set and the performance analysis is performed. It shows that the sensitivity is 100% which implies that for all samples in all trials PD are never misclassified as HC whereas specificity is 99.27% which implies few HC samples are miss classified as PD. The novelty of the work is the idea of using smooth LSF trajectories in gray scale image format in DCNN training and to achieve a training accuracy of 99.18% and a validation accuracy of 100% for a more common, less clinical, natural running speech dataset. This paves the path to use LSF vectors over time to classify PD and also other voice defect analysis. Presently after convolution layer a fully connected neural network layer is used as a discriminating layer and the future scope of this work is to use various types of discriminating layers like SVM after convolution layers and to analyze the ROC and AUC curves.
