Abstract
The study of neonatal cry signals is always an interesting topic and still researcher works interminably to develop some module to predict the actual reason for the baby cry. It is really hard to predict the reason for their cry. The main focus of this paper is to develop a Dense Convolution Neural network (DCNN) to predict the cry. The target cry signal is categorized into five class based on their sound as “Eair”, “Eh”, “Neh”, “Heh” and “Owh”. Prediction of these signals helps in the detection of infant cry reason. The audio and speech features (AS Features) were exacted using Mel-Bark frequency cepstral coefficient from the spectrogram cry signal and fed into DCNN network. The systematic DCNN architecture is modelled with modified activation layer to classify the cry signal. The cry signal is collected in different growth phase of the infants and tested in proposed DCNN architecture. The performance of the system is calculated through parameters accuracy, specificity and sensitivity are calculated. The output of proposed system yielded a balanced accuracy of 92.31%. The highest accuracy level 95.31%, highest specificity level 94.58% and highest sensitivity level 93% attain through proposed technique. From this study, it is concluded that the proposed technique is more efficient in detecting cry signal compared to the existing techniques.
Keywords
Introduction
Unlike any other living species, humans have a unique and formal language to express their emotions. But Infants cannot express their thoughts in formal language instead rely on their cries to convey their needs. It is really hard to predict the reason for their cry. To make this concept possible, researchers conducted a sequence of studies on the Neonatal cry signal. [1]. They categorized the cry signal into five classes where each reflect a specific need. For instance, “Neh” for hungry, “Heh” for physical discomfort, “eh” when burp is needed, “Eair” for cramps and “owh” for fatigue. This technique is commonly known as Dustan baby language (DBL) [2]. This technique helps the parents/caretakers not only to predict the reason for the infant cry but also to predict certain aspects of diseases like Asphyxia and deafness [3]. An alysing the acoustic waves of infant cries helps to detect the physical and health condition of the babies [5, 6]. Classification of acoustic sound is possible in the machine learning network [8]. It processes and segments the sound wave and define the distinct feature of infant cry signal through the Machine learning network [24]. Figure 1 shows the spectrogram signal of an infant cry with different emotions.

Spectrogram of infant cry signal: (a) “Eair”, (b) “Eh”, (c) “Neh”, (d) “Heh” and (e) “Owh”.
In General, the audio features have four domain wavelets, cepstral, time and prosodic domain [12]. Mel Frequency Cepstral Coefficient, a familiar technique for sound recognition is commonly used to converts sound into a voice signal vector [16, 25]. This technique provides a short-term spectral representation for a signal and it acts similar to the concept of human hearing [13, 28]. But it is often affected by shape, size and number of filters used in it [17]. Gaussian Mixture Model is the probabilistic graph model to predict the variable from the given dataset [19]. In cry signal processing, the GMM model effectively identifies the relevant inspiration and expiration dilation [21, 22]. Prior work on Convolution neural network (CNN) [23] [9] such as P-Resnet [11], Alex Net [30], and Inception Net [20] has most efficient outcome in speech and audio signal processing. In current work the feature extraction step involves the extraction of both audio features and speech features. This technique efficiently captures the acoustic signal under diverse condition. In addition, a novel Dense Convolution Neural Network (DCNN) is developed to classify and categories the reason based on the frequency of sound waves. The features employed in this system classify the five-target class of cry signal.
The main objective of the proposed system includes the selection of appropriate features from the cry signal of the Neonatal, AS (Audio and Speech) features extraction and to classify five target classes. To convert the sound waves into acoustic signals, vector the Mel-Bark Frequency Cepstral coefficient (MBFCC) is implemented. Gaussian Mixture Model (GMM) is used to analyse and capture all the possible features in adverse conditions of cry signal. A modified Dense Convolution Neural networks (DCNN) is developed to train and classify the target classes of cry signal.
The organization of the paper is given arranged as follows: section 2 (related studies) explains the existing works on infant cry signal and its drawbacks, section 3 gives the comprehensive study of the proposed work, section 4 narrate the experimental analysis and results of the proposed work and last but not least the section 5 holds the conclusion part of this paper.
The study of computer-based techniques has developed multiple techniques to evaluate the neonatal health status based on the classification of infant cry signals. Currently, several machine learning and deep learning algorithms were developed to classify the cry signals. Some of those techniques are studied briefly in this section.
In 2015 Alaie, H.F., Abou-Abbas, L., and Tadj, C developed a pathological classification technique for infant cry signal using Gaussian Mixture Model (GMM). The acoustic analysis of the noisy infant cry signal has been extracted to measure the quantitative characteristics of healthy and sick infant cry. The static and dynamic Mel-Frequency Cepstral Coefficients (MFCC) are selected to extract both inspiratory and expiratory vocalization of the cry for discriminative feature vector formation. Then the Boosting Mixture Learning technique was developed to derive detect normal and abnormal cry signals. This technique does not contain sufficient variability to train GMM in all possible health conditions [10].
In 2016 Chang et al., had put forward an automatic infant cry detection model using deep learning techniques. Initially, the waveform audio format is used for audio files and the frames below 0.1 dB were removed to reduce the noise. The cry signal is converted into a spectrogram using Fast Fourier transform. Then the convolution neural network was trained for the recognition of the signal. A dropout technology is added to reduce the overfitting in CNN architecture. But the system has limited features to detect the cry reason and less accuracy in classification [4].
In 2016 Lavner et al., has designed a deep learning technique to detect the baby cry in the domestic environment. At first, the machine learning algorithms low-complexity logistic regression classifier was used for reference. MFCC, pitch and formants were extracted to train the classifier. Secondly, a complex CNN design operating on the Mel filter bank of the recording. This system detects the cry sound of Neonatal between 0– 6 months. It can also detect the talking sounds and door opening. But this system cannot predict the reason for the baby cry [29].
In 2018 Naithani et al., developed a Hidden Markov model (HMM) for segmenting the acoustic parts of the infant cry signal. The cry signal is obtained from the different environments with various hindrances. Different audio features like frequency and aperiodicity were measured to detect and optimize the performance of the system. Each HMM state was modelled with 10 Gaussian components and there are three HMM states in each class. Then a two-step adaptation method with feature normalization and semi-supervised learning was developed which yielded 80.7% accuracy. But it is observed that the inspiration phase has poor performance [7].
In 2019 Le et al., developed a classification technique to predict the cry reason of Neonatal using spectrogram images. The transfer learning with pre-trained SVM, CNN of ResNet50 is used for the classification process. The spectrogram images were chosen to classify the audio signals that use MFCC features. It combined deep learning models to improve the result of the technique. ResNet and SVM models were chosen for their simplicity and efficient performance. This technique mainly focuses on reducing false positives. It does on classify the reason for infant cry [15].
In 2019 Severini et al., developed a deep neural network with single and multi-channel neural networks. It also conducted an experimental evaluation on the synthetic dataset from the acoustic scene of NICU dataset and real dataset. It reveals few concerns about microphone array orientation and position. The Log-Mel coefficients are calculated to extract the features and the observed spectral audio signal predicts the cry signal. This evaluation shows that the SE-DNN system has better performance on the same dataset but it does not classify the cry signal [18].
In 2019 Dewi, S.P., Prasasti, A.L. and Irawan, B has developed an infant cry signal detection with a high fundamental frequency. The audio features are extracted from the input signal frame of 10– 40 ms using LFCC and KNN algorithms. This uses MFCC, SNN and VQ algorithms for classification. The only difference between MFCC and LFCC is the different filter-bank used in it. In stage 1 the sound signals are classified and in the next stage, results were analyzed for all the classification samples. But the increase in test samples decreases the accuracy of the process [26].
From the literature survey, we analyzed that most of the existing systems focus only on the detection of cry sound and haven’t done any research towards the classification of cry signal. Some case faces difficulties in training and classifying the signals. To overcome these training difficulties and classifying the cry signal a modified DCNN system is designed in the proposed system.
Design and methods
The infant cry signal prediction system focuses on predicting of the five target classes of the infant cry signal such as “Eair”, “Eh”, “Neh”, “Heh” and “Owh”. The principal concept is to imply the input spectrogram signal in time and frequency axis that helps easy process in CNN network. The spectrogram signals is subjected to Mel-Bark frequency cepstral to aggregate the frequency bands according to perceptual weight. The basic steps of processing include data acquisition, Pre-processing, Feature extraction, feature selection and classification as shown in Fig. 2.

Basic structure of the AS Feature and DCNN system.

Flow diagram representation of the AS features work system.
In signal processing, denoising is an initial and essential step that converts the input signal into an appropriate processing signal. It removes unwanted noise signals such as environmental noise, speech interference and other artifacts from the input signal. Initially, the audio recordings collected are fragmented into sequential overlapping slices of 1256 samples of about 15– 40 ms length. These slices are further sectioned into frames of 18 ms with a step size of 9 ms. Each frame is clipped to calculate the pitch frequency. The audio signals are clipped in 3-level method, the mathematical calculation of 3- level clipping is given in Equation (1);
x f – normally ranges 60– 70% of the maximum. The pitch frequency of the frame can be accelerated by 10 times after the clipping.
After the clipping, the zero-crossing rate (ZCR) and short-time energy values (STE) are calculated to classify the voiced and unvoiced signal from the cry signal. The zero-crossing rate is an indication of the energy frequency in the signal spectrum. The energy spectrum of the voiced part is high because of its periodicity and the unvoiced part has a low energy spectrum. Short-time energy (STE) is defined as the mean of the square of the sample values in an appropriate window. The STE can be mathematically described through Equation (2);
Where x (k) the coefficient of the appropriate window function is, w (n - k) represents the window function and N denotes the length of the window. The Hanning window is selected as it removes aliasing effects that occurred due to frame blocking.
ZCR is the rate of the number of zero-crossing in the cry signal. The voiced frame has low ZCR compared to the unvoiced signal. It is mathematically expressed as below Equation 3;
Where
The pre-processing step reduces the unvoiced signal from the infant cry and enhances the mother wave signal. Then the next significant step is feature extraction. In the proposed work both the Audio and Speech signals are extracted to distinguish each signal. The cepstral and time-domain features were used to measure the audio features. Similarly, the power of the cry signals is calculated to extract the speech features.
Speech features
It is revealed that the frequency attribute of the new-born cries sound varies in the cessation phase, a transitory phase and the inspiration phase. Intensity variation, basic frequency (F0), formants and duration are the most common auditory signals that carry prosodic information about infant cry.
A. Pitch information
Pitch is an important aspect of any kind of sound signal. It amasses essential frequency signals of cry signal. Initially, it checks the loaded input cry signal and selects the basic frequency range between 200 to 550 Hz with the time domain of 15 to 40 ms. The cepstrum domain Peaks are used to get an approximate estimate, and cross-correlation in the time domain and it is used to determine the initial pitch value. The cepstral peak domain is calculated from the Equation (4);
Where c (k) is cepstral pitch and x (k) is co-efficient of window?
For audio feature extraction, the Mel-Bark frequency cepstral domain is measured from the acoustic signal frames.
A. Mel-Frequency cepstral coefficients (MFCC)
The MFCC signal is the coefficient of Mel-frequency to convert the sounds signals into a vector signal. This technique works similarly to human hearing below 1000 Hz. The frequency of Mel Cepstrum is presented based on the linear cosine transformation of log spectra in a non-linear Mel frequency scale as a short-term power spectrum of the signal. MFCC initiates by dividing the signal frame by 15 to 40 milliseconds known the frame blocking. The aliasing effects occurred due to frame blocking is removed by the Hanning window. Equation (1) shows the windowing process of the signal:
Here W
h
(t) is a Hanning window function & t is the number of samples in a single frame. The results of the window are followed by a FT (Fourier transform) that converts the time signal to frequency domain. A frequency domain filter bank is applied to the signal so that it turns into Mel frequency. The derivative equation for Mel frequency is given below
MFCC uses a Mel-scale bank filter (a logarithmic triangular band-pass filter). Thus, a larger bandwidth is produced from a higher frequency filter. The final step of MFCC is Discrete Cosine Transformation (DCT), a tool used to measure audio signal similarity. After the result of the processing step, the DCT coefficients are retained to generate a series of sound vectors called the cepstral coefficient of the Mel frequency.
B. Bark Frequency Cepstral coefficient
Similar to MFCC, the BFCC distorts the power cepstrum that corresponds with human loudness perception. The BFCC approach is identical to the MFCC technique except for two terms. The general equation to convert the frequencies to bark scale is given below;
Where b refers to the bark frequency and f is the hertz frequency. The mapped frequency of the bark is processed through 18 filters. The centre frequency of these filters is the same as the first 18 of the 24 important listening frequencies. The BFCC is obtained using the DCT of the cepstrum bark frequency and the 10 DCT coefficients define the cepstrum amplitudes.
GMM modelling is a simple and effective statistical model that can create smooth approximations from any arbitrary data distribution. GMM application shows remarkable results in the detection and identification of speech and cry signals, because of its capacity to represent the forbidden data classes. In GMM the likelihood function is used for feature vector dimension, f
k
(x), each parameterized by D × 1 mean vector (μ
k
) and a D×D covariance matrix (∑
k
). The mathematical expression is given below;
where k = (μ k , k) are the parameters of i th Gaussian density, and A Tr denotes the transpose of matrix A. Mutually, a GMM can be represented by its parameters as λ k = (ck, k, k = 1,..., K).
CNN architecture is deep learning network and is highly suggested in the medical image for its remarkable classification accuracy. It has three main layers input, hidden and output layer. The spectrogram signal obtained from the Mel-Bark Frequency cepstral has a fixed and equalized length for effectual training process. In this work, we developed a novel DCNN architecture with modified additional feature map and modified activation layer to classify the cry signal. The target of the DCNN is to classify the five class of cry signals (“Eair”, “Eh”, “Neh”, and “Heh” and “Owh”) from the spectrogram input signal. The detailed description of the proposed work along with its architecture is shown in Fig. 4.

DCNN architecture layers to classify the cry signal classes of infants from spectrogram signal.
Figure 4 illustrates the basic function steps of dense convolution neural network. The working of the DCNN architecture implemented in proposed work is described in below locution. Initially, the Flatten layer converts the 28×28 pixel input spectrogram into a vector, resulting in a feature space with a width of 784 pixels. The first layer of this DCNN has 512 neurons that are all connected to the 784 weights, resulting in 784*512 + 512 = 401 920 weights to compute, including the bias. There are 407 050 coefficients in all. The Tanh function was originally used for activation. There are two convolutional layers based on 3×3 filters with average pooling. The feature space is thus reduced from 32×32×3 down to 6×6×16. They are followed by 2 hidden and dense layers of 120 and 84 neurons, and finally the same 10 neuron softmax layer to compute the probabilities. Total number of coefficients of the LeNet-5 is 101 770, a quarter of the Dense CNN. The convolution operation isolates background features in multiple scales through a dense convolution path. The output of the last completely connected layer has been entered into the softmax layer that distributes the class label numbers. The DCNN is feeding forward networks that efficiently detect the cry signals. Here every layer is connected in a feedforward manner that reduces feature repetition. To perform the layer transformation, the transition block is provided. All spectrograms cry signals were standardized, and statistics were generated on the training set. The hyper parameter setting of the proposed work is give in Table 1.
Hyper parameter setting
Hyper parameter setting
Initially, the audio segments are transformed into feature maps for CNN training and framed in sub-segments that are equivalent to the process previously explained. Each frame is a labeled of the audio segment’s target class designation. This database is utilised for the training of the CNN using supervised learning. Each variable is labeled as mentioned in the architecture with fixed length (τ = 22ms). The input frames are an audio signal frame (x) from the sample of length (l) and the output target is a vector probability of the target classes. There are 30 filters of size 3×3 in the DCNN, and stride = 2. The Rectified Linear Unit (RELU) is used to convert characteristics from a linear to a non-linear space. The size of the product is unaffected by the feed-in size. When the input is negative, it may be thought of as a threshold function with a zero product. The major part of training process is hypermeter setting and regularization of DCNN network. In proposed work weight decay and dropout function is used for regularization and overfitting problem. Some other regularization techniques such as implicit regularization [31], data augmentation [32] and drop path [33] can be interpreted at loss layer as regularization function that shows better regularization function. The proposed method adopted dropout function which has effective results in hidden nodes and controls overfitting problem.
Through the DCNN network, the input given to the nodes were evoked, the batch size is normalized after first hidden layer and the neurons fall from the drop-out layer at 0.2 rate before values move into next neuron layers. Finally there is a neuron with the probability value in the output layer. The DCNN network is trained using Matlab 2019b.
Table 2 gives the brief description of five cry signal target classes that classified using DCNN technique.
Application of different regularization technique and its functional unit.
Application of different regularization technique and its functional unit.
Description of Five Target Cry Signal Classes
A. Feature map transformation:
Initially, the input signal (x) is converted into spectrogram signals with respect to time and frequency axis as P [f] = |L [F] |2 through Fast Walsh-Fourier Transform (FWFT)
Where L is linear spectrogram signal, N is FWFT length, [n, k] are frequency variables, the power spectrogram’s frequency bands are coupled with a Mel-bark filter bench, using m filters in Mel-frequency banks:
Where MB is the Mel-Bark frequency band and spectrogram. MB value is then log transformed through decibel conversion for compressing dynamic range of power as;
B. Classification of Sub segments:
The feature map MB
log
is sectioned into x overlapping sub segments with Time frames (T) where
Here iɛ{ 0, …… . x - 1 } is subsegment index of the time frame. Thus, subsumes are spread equally throughout input length. Each sub segment is processed individually in CNN through output softmax layer. The final target class output is derived by the probability of complete input spectrogram signal in CNN outputs.
This study demands, according to certain estimates, a spatially-alternating activation function which alternate across the space coordinates and operate as space activation function for the content in the first half and as the activation of the Wavelet domain for the other half. It reduces the volume sizes of the convolution feature maps by using traditional 2D max-pooling.
In proposed work, a deep convolution neural network with modified activation function is designed for the classification target classes in infant cry signal with an intension to predict the actual cry signal. The main challenge in the prediction of cry signal includes the accurate prediction of five different cry sound as it is infant cry sound is too strong and the sounds are easily misheard in manual process. But the CNN network can fragment these signal with limited frames and measure the fine details to predict the sound class. The main objective of the MBFCC-DCNN model is to differentiate and classify the five target classes of the infant cry signal. The experimental outcome of the MBFCC-DCNN model is analysed in section 4.
In the proposed work, the DCNN architecture is trained using MATLAB 2019b. Audio data collected from the babies are clipped and processed into edible frames for processing. The spectrogram signals obtained from the mother wavelet signals are isolated and the features are extracted by Mel-Bark frequency cepstral coefficient and classified in the DCNN network. The performance analysis of the test samples is measured through Accuracy, sensitivity, specificity, true positive rate, and false-positive rates.
Data description
The database is collected from the NICU of nearby hospitals and neighbour colonies. The activity of babies was recorded continuously in the domestic environment and hospital. The data are collected with the permission of parents and staff persons in the hospital. The database consists of 27 sound signals streams, 12 from different infants and the rest from the same infant. There are seven female and male voices. The test takes performed at a loud location and results in a high MFCC failure rate, i.e. it failed to detect nearly 20% of sound signals of male and female cry sound. The error rate can then be reduced by adding BFCC to the MFCC method. Table 4. shows the error value changes in the function extraction process MFCC from 27 cry signals before and after the addition of the BFCC technique (16 male and 11 female). Row one is EER with 10 frequent expressions, and Row two is a single expression testing.
Equation error rate before and after the addition of BFCC
Equation error rate before and after the addition of BFCC
Performance comparison shows that the performance or MFCC has improved when it is combined with the BFCC in capturing the spectral signals in the high-frequency region. The MFCC signal effectively measures the pitch perception and BFCC reflects the loudness perception.
Both MFCC and BFCC performs similarly in the detecti on of cry signal. It has some extractor advantages, such as: - BFCC can be measured using the traunmuller equation while MFCC calculation requires a logarithm operation variable. It can recognize sound characters so it can determine the sound pattern. Mel-bark functions in a similar way to how a human listener performs. The output vector does not remove sounds in the extraction process but has a small data size.
Figure 5, shows the experimental analysis result of spectrogram signal for different emotions of infant cry. The spectrogram signal implies the intensity of cry signal in different bandwidth. the spectrogram frequency in Fig. 5b and 5c is sharp and clean as the baby does not feel any pain during this expression and frequency of signal is clear. In Fig. 5d and 5e the signal is shattered vague because the baby feel discomfort and tired over this condition. It makes the frequency so vague.

Experimental analysis of spectrogram signal in different emotion modules of infant cry using Mel-Bark FCC and DCNN, a) shows the input cry signal of infants, b) shows spectrogram signal of “Eair/ Eh”, c) shows spectrogram signal of “Neh”, d) shows spectrogram signal of “Heh”, e) shows spectrogram signal of “owh”.
In figure, the yellow bands are thick in pain Spectrogram because when they are in pain, infants will be crying harder. The red and yellow bands are thin in normal cry and the spectrogram bands are distorted when the baby is sick or asphyxiate because the infant sounds feeble when they are sick. From this analysis, it is obvious that our proposed Mel-BarkFCC method produces effective results over predicting the acoustic cry signal of cry-babies.
Further, we analyzed the spectrogram signal of infants in the different growth phases. For example, we divided the growth phase of the infant from 0– 1 year into 4 modules i.e, 0– 2 months, 2– 5 months, 5– 8 months and 8– 12 months. The cry signal is collected from all four groups of babies and the different sound signals (eair/eh, neh, heh and owh) sounds are measured. Figure 6 shows the prediction efficiency of cry signal in the different growth phases of babies. From our analysis, we observed that the cry signal of babies in the initial growth phase (0– 2 months) is so thin and mild due to immature voice track so the features are difficult to predict in this stage. but the prediction rate increases gradually as the vocal gets sharp in the fouth phase (8– 12 months) the prediction is more accurate compared to the first growth phase.

Graph for prediction of cry sound in the different growth phases of babies, a) shows the Eair/ Eh sound prediction of babies in four growth phase, b) shows the Neh sound prediction of babies in four growth phase, c) shows the Heh sound prediction of babies in four growth phase, d) shows the Owh sound prediction of babies in four growth phase.
The experimental results were evaluated with accuracy, specificity, sensitivity and DSI. The statistical evaluation of the parameters is given below,
Accuracy:
It is the ability to point out the odd pixels from the image correctly. The true positive and true negative values should be calculated to estimate the accuracy of the test. The mathematical expression of the accuracy is;
Sensitivity:
The sensitivity is the ratio of determining correctly grouped pixels to the sample image. The sensitivity is estimated by analysing the proportion of true positive in-patient cases. The mathematical expression of the sensitivity is;
Specificity:
It is the ability to point out healthy cases correctly. It is estimated by analysing the proportion of true negative in healthy cases. The mathematical expression of the specificity is;
Tables 5, 6 and 7 shows the comparative analysis of existing machine learning techniques with proposed Mel-Bark DCNN techniques with various classifiers like K- Nearest Neighbouring, Support Vector Machine (SVM) and Artificial Neural Network (ANN). Comparing the accuracy rate of five different infant cry signal, the average rate of accuracy in the existing system is 71.586% in KNN, 80% in SVM, 84% in ANN and the proposed method have 93.038%. While the average specificity rate of five different infant cry signal, the average rate of accuracy in the existing system is 72.478% in KNN, 81.726% in SVM, 86.88% in ANN and the proposed method have 91.962%. The average sensitivity rate of five different infant cry signal, the average sensitivity rate of the existing system is 72.584% in KNN, 80.8% in SVM, 86.062% in ANN and the proposed method have 91.894%. From the analysis, it is evident that the proposed technique has better efficiency inaccuracy, specificity, and sensitivity while comparing with different basic classifiers like KNN, SVM and ANN.
Comparative analysis of existing techniques with proposed DCNN for Accuracy
Comparative analysis of existing techniques with proposed DCNN for specificity
Comparative analysis of existing techniques with proposed DCNN for Sensitivity
Table 8 shows the Comparative analysis of Proposed DCNN technique with state-of-the-art methods. From the above comparison it is evident that the proposed DCNN method has better efficiency in the classification of target sound classes.
Comparison of Proposed DCNN technique with state-of-the-art methods
This paper focus on the research of infant cry signal and prediction of cry signal has shown it is effective to use spectrograms for audio classification. Initially the input audio signal is converted into appropriate spectrogram signal and clipped to select specific frequency signal. Then both the audio features and speech features are measured using Mel-bark frequency cepstrals and processed spectrogram signal is fed into the dense convolution neural network. In proposed method the target baby cry signal divided into five classes as “Eair”, “Eh”, “Neh”, “Heh” and “Owh”. DCNN system is trained to classify these specific target classes. The output is determined with different parameters like accuracy, specificity, and sensitivity. The output values of the DCNN classifier are compared with different ML classifiers technique to check the progression of the proposed method. The result obtained from the proposed method has better results compared with other classifiers. The proposed method attains an average accuracy, specificity, and sensitivity as 93.038%, 91.962% and 91.894% respectively using a Mel-Bark DCNN technique. The results show that the incorporation of Mel-Bark FCC into DCNN shows superior results in Competitive techniques. In future the work can be extended with large dataset and modified deep learning network to achieve the maximum efficiency. Also, various regularization techniques can be implemented to levitate the robust of network.
