Neonatal cry signal prediction and classification via dense convolution neural network

Abstract

The study of neonatal cry signals is always an interesting topic and still researcher works interminably to develop some module to predict the actual reason for the baby cry. It is really hard to predict the reason for their cry. The main focus of this paper is to develop a Dense Convolution Neural network (DCNN) to predict the cry. The target cry signal is categorized into five class based on their sound as “Eair”, “Eh”, “Neh”, “Heh” and “Owh”. Prediction of these signals helps in the detection of infant cry reason. The audio and speech features (AS Features) were exacted using Mel-Bark frequency cepstral coefficient from the spectrogram cry signal and fed into DCNN network. The systematic DCNN architecture is modelled with modified activation layer to classify the cry signal. The cry signal is collected in different growth phase of the infants and tested in proposed DCNN architecture. The performance of the system is calculated through parameters accuracy, specificity and sensitivity are calculated. The output of proposed system yielded a balanced accuracy of 92.31%. The highest accuracy level 95.31%, highest specificity level 94.58% and highest sensitivity level 93% attain through proposed technique. From this study, it is concluded that the proposed technique is more efficient in detecting cry signal compared to the existing techniques.

Keywords

Infant cry signal spectrogram images audio and speech features mel-bark frequency cepstral domain dense convolution neural network

1 Introduction

Unlike any other living species, humans have a unique and formal language to express their emotions. But Infants cannot express their thoughts in formal language instead rely on their cries to convey their needs. It is really hard to predict the reason for their cry. To make this concept possible, researchers conducted a sequence of studies on the Neonatal cry signal. [1]. They categorized the cry signal into five classes where each reflect a specific need. For instance, “Neh” for hungry, “Heh” for physical discomfort, “eh” when burp is needed, “Eair” for cramps and “owh” for fatigue. This technique is commonly known as Dustan baby language (DBL) [2]. This technique helps the parents/caretakers not only to predict the reason for the infant cry but also to predict certain aspects of diseases like Asphyxia and deafness [3]. An alysing the acoustic waves of infant cries helps to detect the physical and health condition of the babies [5, 6]. Classification of acoustic sound is possible in the machine learning network [8]. It processes and segments the sound wave and define the distinct feature of infant cry signal through the Machine learning network [24]. Figure 1 shows the spectrogram signal of an infant cry with different emotions.

Fig. 1

Spectrogram of infant cry signal: (a) “Eair”, (b) “Eh”, (c) “Neh”, (d) “Heh” and (e) “Owh”.

In General, the audio features have four domain wavelets, cepstral, time and prosodic domain [12]. Mel Frequency Cepstral Coefficient, a familiar technique for sound recognition is commonly used to converts sound into a voice signal vector [16, 25]. This technique provides a short-term spectral representation for a signal and it acts similar to the concept of human hearing [13 , 28]. But it is often affected by shape, size and number of filters used in it [17]. Gaussian Mixture Model is the probabilistic graph model to predict the variable from the given dataset [19]. In cry signal processing, the GMM model effectively identifies the relevant inspiration and expiration dilation [21, 22]. Prior work on Convolution neural network (CNN) [23] [9] such as P-Resnet [11], Alex Net [30], and Inception Net [20] has most efficient outcome in speech and audio signal processing. In current work the feature extraction step involves the extraction of both audio features and speech features. This technique efficiently captures the acoustic signal under diverse condition. In addition, a novel Dense Convolution Neural Network (DCNN) is developed to classify and categories the reason based on the frequency of sound waves. The features employed in this system classify the five-target class of cry signal.

The main objective of the proposed system includes the selection of appropriate features from the cry signal of the Neonatal, AS (Audio and Speech) features extraction and to classify five target classes. To convert the sound waves into acoustic signals, vector the Mel-Bark Frequency Cepstral coefficient (MBFCC) is implemented. Gaussian Mixture Model (GMM) is used to analyse and capture all the possible features in adverse conditions of cry signal. A modified Dense Convolution Neural networks (DCNN) is developed to train and classify the target classes of cry signal.

The organization of the paper is given arranged as follows: section 2 (related studies) explains the existing works on infant cry signal and its drawbacks, section 3 gives the comprehensive study of the proposed work, section 4 narrate the experimental analysis and results of the proposed work and last but not least the section 5 holds the conclusion part of this paper.

2 Related studies

The study of computer-based techniques has developed multiple techniques to evaluate the neonatal health status based on the classification of infant cry signals. Currently, several machine learning and deep learning algorithms were developed to classify the cry signals. Some of those techniques are studied briefly in this section.

In 2015 Alaie, H.F., Abou-Abbas, L., and Tadj, C developed a pathological classification technique for infant cry signal using Gaussian Mixture Model (GMM). The acoustic analysis of the noisy infant cry signal has been extracted to measure the quantitative characteristics of healthy and sick infant cry. The static and dynamic Mel-Frequency Cepstral Coefficients (MFCC) are selected to extract both inspiratory and expiratory vocalization of the cry for discriminative feature vector formation. Then the Boosting Mixture Learning technique was developed to derive detect normal and abnormal cry signals. This technique does not contain sufficient variability to train GMM in all possible health conditions [10].

In 2016 Chang et al., had put forward an automatic infant cry detection model using deep learning techniques. Initially, the waveform audio format is used for audio files and the frames below 0.1 dB were removed to reduce the noise. The cry signal is converted into a spectrogram using Fast Fourier transform. Then the convolution neural network was trained for the recognition of the signal. A dropout technology is added to reduce the overfitting in CNN architecture. But the system has limited features to detect the cry reason and less accuracy in classification [4].

In 2016 Lavner et al., has designed a deep learning technique to detect the baby cry in the domestic environment. At first, the machine learning algorithms low-complexity logistic regression classifier was used for reference. MFCC, pitch and formants were extracted to train the classifier. Secondly, a complex CNN design operating on the Mel filter bank of the recording. This system detects the cry sound of Neonatal between 0– 6 months. It can also detect the talking sounds and door opening. But this system cannot predict the reason for the baby cry [29].

In 2018 Naithani et al., developed a Hidden Markov model (HMM) for segmenting the acoustic parts of the infant cry signal. The cry signal is obtained from the different environments with various hindrances. Different audio features like frequency and aperiodicity were measured to detect and optimize the performance of the system. Each HMM state was modelled with 10 Gaussian components and there are three HMM states in each class. Then a two-step adaptation method with feature normalization and semi-supervised learning was developed which yielded 80.7% accuracy. But it is observed that the inspiration phase has poor performance [7].

In 2019 Le et al., developed a classification technique to predict the cry reason of Neonatal using spectrogram images. The transfer learning with pre-trained SVM, CNN of ResNet50 is used for the classification process. The spectrogram images were chosen to classify the audio signals that use MFCC features. It combined deep learning models to improve the result of the technique. ResNet and SVM models were chosen for their simplicity and efficient performance. This technique mainly focuses on reducing false positives. It does on classify the reason for infant cry [15].

In 2019 Severini et al., developed a deep neural network with single and multi-channel neural networks. It also conducted an experimental evaluation on the synthetic dataset from the acoustic scene of NICU dataset and real dataset. It reveals few concerns about microphone array orientation and position. The Log-Mel coefficients are calculated to extract the features and the observed spectral audio signal predicts the cry signal. This evaluation shows that the SE-DNN system has better performance on the same dataset but it does not classify the cry signal [18].

In 2019 Dewi, S.P., Prasasti, A.L. and Irawan, B has developed an infant cry signal detection with a high fundamental frequency. The audio features are extracted from the input signal frame of 10– 40 ms using LFCC and KNN algorithms. This uses MFCC, SNN and VQ algorithms for classification. The only difference between MFCC and LFCC is the different filter-bank used in it. In stage 1 the sound signals are classified and in the next stage, results were analyzed for all the classification samples. But the increase in test samples decreases the accuracy of the process [26].

From the literature survey, we analyzed that most of the existing systems focus only on the detection of cry sound and haven’t done any research towards the classification of cry signal. Some case faces difficulties in training and classifying the signals. To overcome these training difficulties and classifying the cry signal a modified DCNN system is designed in the proposed system.

3 Design and methods

The infant cry signal prediction system focuses on predicting of the five target classes of the infant cry signal such as “Eair”, “Eh”, “Neh”, “Heh” and “Owh”. The principal concept is to imply the input spectrogram signal in time and frequency axis that helps easy process in CNN network. The spectrogram signals is subjected to Mel-Bark frequency cepstral to aggregate the frequency bands according to perceptual weight. The basic steps of processing include data acquisition, Pre-processing, Feature extraction, feature selection and classification as shown in Fig. 2.

Fig. 2

Basic structure of the AS Feature and DCNN system.

Fig. 3

Flow diagram representation of the AS features work system.

3.1 Pre-processing

In signal processing, denoising is an initial and essential step that converts the input signal into an appropriate processing signal. It removes unwanted noise signals such as environmental noise, speech interference and other artifacts from the input signal. Initially, the audio recordings collected are fragmented into sequential overlapping slices of 1256 samples of about 15– 40 ms length. These slices are further sectioned into frames of 18 ms with a step size of 9 ms. Each frame is clipped to calculate the pitch frequency. The audio signals are clipped in 3-level method, the mathematical calculation of 3- level clipping is given in Equation (1); $f (x) = {\begin{matrix} 1, x > x_{f} \\ 0, - x ⩽ x ⩽ x_{f} \\ - 1, x < x_{f} \end{matrix}}$ (1)

x_f – normally ranges 60– 70% of the maximum. The pitch frequency of the frame can be accelerated by 10 times after the clipping.

3.1.1 Zero crossing rate and Short-time energy

After the clipping, the zero-crossing rate (ZCR) and short-time energy values (STE) are calculated to classify the voiced and unvoiced signal from the cry signal. The zero-crossing rate is an indication of the energy frequency in the signal spectrum. The energy spectrum of the voiced part is high because of its periodicity and the unvoiced part has a low energy spectrum. Short-time energy (STE) is defined as the mean of the square of the sample values in an appropriate window. The STE can be mathematically described through Equation (2); $E_{st} = \frac{1}{N} \sum_{k = 0}^{N - 1} {[x (k) w (n - k)]}^{2}$ (2)

Where x (k) the coefficient of the appropriate window function is, w (n - k) represents the window function and N denotes the length of the window. The Hanning window is selected as it removes aliasing effects that occurred due to frame blocking.

3.1.2 Zero crossing rate

ZCR is the rate of the number of zero-crossing in the cry signal. The voiced frame has low ZCR compared to the unvoiced signal. It is mathematically expressed as below Equation 3; $Z_{c} = \frac{1}{N} \sum_{k = o}^{N - 1} | sign [x (k)] - sign [x (k - 1)] | w (n - k)$ (3)

Where $sign [x (k)] = {\begin{matrix} 1 x (k) ⩾ 0 \\ - 1 x (k) < 0 \end{matrix}}$

3.2 Feature extraction

The pre-processing step reduces the unvoiced signal from the infant cry and enhances the mother wave signal. Then the next significant step is feature extraction. In the proposed work both the Audio and Speech signals are extracted to distinguish each signal. The cepstral and time-domain features were used to measure the audio features. Similarly, the power of the cry signals is calculated to extract the speech features.

3.2.1 Speech features

It is revealed that the frequency attribute of the new-born cries sound varies in the cessation phase, a transitory phase and the inspiration phase. Intensity variation, basic frequency (F0), formants and duration are the most common auditory signals that carry prosodic information about infant cry.

A. Pitch information

Pitch is an important aspect of any kind of sound signal. It amasses essential frequency signals of cry signal. Initially, it checks the loaded input cry signal and selects the basic frequency range between 200 to 550 Hz with the time domain of 15 to 40 ms. The cepstrum domain Peaks are used to get an approximate estimate, and cross-correlation in the time domain and it is used to determine the initial pitch value. The cepstral peak domain is calculated from the Equation (4); $c (k) = IDFT (log (DFT | x (k) |))$ (4)

Where c (k) is cepstral pitch and x (k) is co-efficient of window?

3.2.2 Audio features

For audio feature extraction, the Mel-Bark frequency cepstral domain is measured from the acoustic signal frames.

A. Mel-Frequency cepstral coefficients (MFCC)

The MFCC signal is the coefficient of Mel-frequency to convert the sounds signals into a vector signal. This technique works similarly to human hearing below 1000 Hz. The frequency of Mel Cepstrum is presented based on the linear cosine transformation of log spectra in a non-linear Mel frequency scale as a short-term power spectrum of the signal. MFCC initiates by dividing the signal frame by 15 to 40 milliseconds known the frame blocking. The aliasing effects occurred due to frame blocking is removed by the Hanning window. Equation (1) shows the windowing process of the signal: $W_{h} (t) = (1 - cos (2 π tf 0)) / 2$ (5)

Here W_h (t) is a Hanning window function & t is the number of samples in a single frame. The results of the window are followed by a FT (Fourier transform) that converts the time signal to frequency domain. A frequency domain filter bank is applied to the signal so that it turns into Mel frequency. The derivative equation for Mel frequency is given below $F_{Mel} = 2595 {log}_{10} (1 + \frac{f}{700})$ (6)

MFCC uses a Mel-scale bank filter (a logarithmic triangular band-pass filter). Thus, a larger bandwidth is produced from a higher frequency filter. The final step of MFCC is Discrete Cosine Transformation (DCT), a tool used to measure audio signal similarity. After the result of the processing step, the DCT coefficients are retained to generate a series of sound vectors called the cepstral coefficient of the Mel frequency.

B. Bark Frequency Cepstral coefficient

Similar to MFCC, the BFCC distorts the power cepstrum that corresponds with human loudness perception. The BFCC approach is identical to the MFCC technique except for two terms. The general equation to convert the frequencies to bark scale is given below; $F_{bark} = 13 \arctan (0.00076 f) + 3.5 \arctan ({(\frac{f}{7500})}^{2})$ (7)

Where b refers to the bark frequency and f is the hertz frequency. The mapped frequency of the bark is processed through 18 filters. The centre frequency of these filters is the same as the first 18 of the 24 important listening frequencies. The BFCC is obtained using the DCT of the cepstrum bark frequency and the 10 DCT coefficients define the cepstrum amplitudes.

3.3 Gaussian mixture model

GMM modelling is a simple and effective statistical model that can create smooth approximations from any arbitrary data distribution. GMM application shows remarkable results in the detection and identification of speech and cry signals, because of its capacity to represent the forbidden data classes. In GMM the likelihood function is used for feature vector dimension, f_k (x), each parameterized by D × 1 mean vector (μ_k) and a D×D covariance matrix (∑_k). The mathematical expression is given below; $\begin{matrix} F (x | λ_{g}) = \sum_{k = 1}^{G} c_{k} f_{k} (x) \sum_{k = 1}^{G} c_{k} ℵ (x | Φ_{k}) \\ = \sum_{k = 1}^{G} c_{k} ℵ (x | μ_{k} \sum_{k}) \end{matrix}$ (8) where λ _g signifies the Gaussian parameters and resides of K components with the restriction that the mixture weights must satisfy the following two constraints: c _k≥0 for i = 1,..., K and K = 1 C _i = 1. The i ^th component can be written in the following notation: $\begin{matrix} f_{k} (x) = ℵ (x | Φ_{k}) = ℵ (x | μ_{k} \sum_{k}) \\ = \frac{1}{{(2 π)}^{\frac{D}{2}} {| \sum_{k} |}^{\frac{1}{2}}} \\ \times exp (- \frac{1}{2} {(x - μ_{k})}^{Tr} \sum_{k}^{- 1} (x - μ_{k})) \end{matrix}$ (9)

where k = (μ _k , k) are the parameters of i th Gaussian density, and A Tr denotes the transpose of matrix A. Mutually, a GMM can be represented by its parameters as λ _k = (ck, k, k = 1,..., K).

3.4 Dense convolution neural network (DCNN)

CNN architecture is deep learning network and is highly suggested in the medical image for its remarkable classification accuracy. It has three main layers input, hidden and output layer. The spectrogram signal obtained from the Mel-Bark Frequency cepstral has a fixed and equalized length for effectual training process. In this work, we developed a novel DCNN architecture with modified additional feature map and modified activation layer to classify the cry signal. The target of the DCNN is to classify the five class of cry signals (“Eair”, “Eh”, “Neh”, and “Heh” and “Owh”) from the spectrogram input signal. The detailed description of the proposed work along with its architecture is shown in Fig. 4.

Fig. 4

DCNN architecture layers to classify the cry signal classes of infants from spectrogram signal.

3.4.1 Architecture design of DCNN

Figure 4 illustrates the basic function steps of dense convolution neural network. The working of the DCNN architecture implemented in proposed work is described in below locution. Initially, the Flatten layer converts the 28×28 pixel input spectrogram into a vector, resulting in a feature space with a width of 784 pixels. The first layer of this DCNN has 512 neurons that are all connected to the 784 weights, resulting in 784*512 + 512 = 401 920 weights to compute, including the bias. There are 407 050 coefficients in all. The Tanh function was originally used for activation. There are two convolutional layers based on 3×3 filters with average pooling. The feature space is thus reduced from 32×32×3 down to 6×6×16. They are followed by 2 hidden and dense layers of 120 and 84 neurons, and finally the same 10 neuron softmax layer to compute the probabilities. Total number of coefficients of the LeNet-5 is 101 770, a quarter of the Dense CNN. The convolution operation isolates background features in multiple scales through a dense convolution path. The output of the last completely connected layer has been entered into the softmax layer that distributes the class label numbers. The DCNN is feeding forward networks that efficiently detect the cry signals. Here every layer is connected in a feedforward manner that reduces feature repetition. To perform the layer transformation, the transition block is provided. All spectrograms cry signals were standardized, and statistics were generated on the training set. The hyper parameter setting of the proposed work is give in Table 1.

Table 1
Hyper parameter setting

Parameter Value

Number of Neurons in each layer 512

Learning rate 0.01

Activation function ReLu

Number of epochs 41

Batch size 1000

Drop out 0.2

Parameter	Value
Number of Neurons in each layer	512
Learning rate	0.01
Activation function	ReLu
Number of epochs	41
Batch size	1000
Drop out	0.2

3.4.2 Classification procedure of DCNN

Initially, the audio segments are transformed into feature maps for CNN training and framed in sub-segments that are equivalent to the process previously explained. Each frame is a labeled of the audio segment’s target class designation. This database is utilised for the training of the CNN using supervised learning. Each variable is labeled as mentioned in the architecture with fixed length (τ = 22ms). The input frames are an audio signal frame (x) from the sample of length (l) and the output target is a vector probability of the target classes. There are 30 filters of size 3×3 in the DCNN, and stride = 2. The Rectified Linear Unit (RELU) is used to convert characteristics from a linear to a non-linear space. The size of the product is unaffected by the feed-in size. When the input is negative, it may be thought of as a threshold function with a zero product. The major part of training process is hypermeter setting and regularization of DCNN network. In proposed work weight decay and dropout function is used for regularization and overfitting problem. Some other regularization techniques such as implicit regularization [31], data augmentation [32] and drop path [33] can be interpreted at loss layer as regularization function that shows better regularization function. The proposed method adopted dropout function which has effective results in hidden nodes and controls overfitting problem.

Through the DCNN network, the input given to the nodes were evoked, the batch size is normalized after first hidden layer and the neurons fall from the drop-out layer at 0.2 rate before values move into next neuron layers. Finally there is a neuron with the probability value in the output layer. The DCNN network is trained using Matlab 2019b.

Table 2 gives the brief description of five cry signal target classes that classified using DCNN technique.

Table 2
Application of different regularization technique and its functional unit.

Regularization Technique Unit and Working

Weight Decay Weight (training process)

DropConnect Sample training

Batch normalization Feature map extraction

Dropout Hidden layers (testing and

training)

Implicit regularization Loss function

(training process)

Regularization Technique	Unit and Working
Weight Decay	Weight (training process)
DropConnect	Sample training
Batch normalization	Feature map extraction
Dropout	Hidden layers (testing and
	training)
Implicit regularization	Loss function
	(training process)

Table 3

Description of Five Target Cry Signal Classes

Target class	Qualitative description
sound
Eair	Condition in which infant need to poop or Gassy feeling.
Eh	State in when infant need a Burp.
Neh	Condition when the infant is hungry or thirsty.
Heh	physically uncomfortable (hot, cold or wet)
Owh	State in which infant is in need of rest or feel sleepy.

A. Feature map transformation:

Initially, the input signal (x) is converted into spectrogram signals with respect to time and frequency axis as P [f] = |L [F] |² through Fast Walsh-Fourier Transform (FWFT) $L [F] = \sum_{k = 0}^{N - 1} f [k] . walsh [n, k]$ (10)

Where L is linear spectrogram signal, N is FWFT length, [n, k] are frequency variables, the power spectrogram’s frequency bands are coupled with a Mel-bark filter bench, using m filters in Mel-frequency banks: $MB [b] = \sum_{f = 0}^{N / 2} H [f] . p [f]$ (11)

Where MB is the Mel-Bark frequency band and spectrogram. MB value is then log transformed through decibel conversion for compressing dynamic range of power as; ${MB}_{\log} [b] = 10 . \log_{10} (MB [b]) dB$

B. Classification of Sub segments:

The feature map MB_log is sectioned into x overlapping sub segments with Time frames (T) where $T = \frac{τ}{h}$ . $S_{i} (b) = {MB}_{\log} [{Sh}_{i} + b]$ (12)

Here iɛ{ 0, …… . x - 1 } is subsegment index of the time frame. Thus, subsumes are spread equally throughout input length. Each sub segment is processed individually in CNN through output softmax layer. The final target class output is derived by the probability of complete input spectrogram signal in CNN outputs.

This study demands, according to certain estimates, a spatially-alternating activation function which alternate across the space coordinates and operate as space activation function for the content in the first half and as the activation of the Wavelet domain for the other half. It reduces the volume sizes of the convolution feature maps by using traditional 2D max-pooling.

In proposed work, a deep convolution neural network with modified activation function is designed for the classification target classes in infant cry signal with an intension to predict the actual cry signal. The main challenge in the prediction of cry signal includes the accurate prediction of five different cry sound as it is infant cry sound is too strong and the sounds are easily misheard in manual process. But the CNN network can fragment these signal with limited frames and measure the fine details to predict the sound class. The main objective of the MBFCC-DCNN model is to differentiate and classify the five target classes of the infant cry signal. The experimental outcome of the MBFCC-DCNN model is analysed in section 4.

4 Result and discussion

In the proposed work, the DCNN architecture is trained using MATLAB 2019b. Audio data collected from the babies are clipped and processed into edible frames for processing. The spectrogram signals obtained from the mother wavelet signals are isolated and the features are extracted by Mel-Bark frequency cepstral coefficient and classified in the DCNN network. The performance analysis of the test samples is measured through Accuracy, sensitivity, specificity, true positive rate, and false-positive rates.

4.1 Data description

The database is collected from the NICU of nearby hospitals and neighbour colonies. The activity of babies was recorded continuously in the domestic environment and hospital. The data are collected with the permission of parents and staff persons in the hospital. The database consists of 27 sound signals streams, 12 from different infants and the rest from the same infant. There are seven female and male voices. The test takes performed at a loud location and results in a high MFCC failure rate, i.e. it failed to detect nearly 20% of sound signals of male and female cry sound. The error rate can then be reduced by adding BFCC to the MFCC method. Table 4. shows the error value changes in the function extraction process MFCC from 27 cry signals before and after the addition of the BFCC technique (16 male and 11 female). Row one is EER with 10 frequent expressions, and Row two is a single expression testing.

Table 4
Equation error rate before and after the addition of BFCC

Before After

Male Female Male Female

3.61% 8.03% 3.02% 4.12%

3.45% 6.7% 2.43% 3.95%

Before	After
3.61%	8.03%	3.02%	4.12%
3.45%	6.7%	2.43%	3.95%

Performance comparison shows that the performance or MFCC has improved when it is combined with the BFCC in capturing the spectral signals in the high-frequency region. The MFCC signal effectively measures the pitch perception and BFCC reflects the loudness perception.

4.2 Analysis using Mel-Bark frequency cepstral coefficient (Mel-Bark FCC)

Both MFCC and BFCC performs similarly in the detecti on of cry signal. It has some extractor advantages, such as: -

BFCC can be measured using the traunmuller equation while MFCC calculation requires a logarithm operation variable.

It can recognize sound characters so it can determine the sound pattern.

Mel-bark functions in a similar way to how a human listener performs.

The output vector does not remove sounds in the extraction process but has a small data size.

Figure 5, shows the experimental analysis result of spectrogram signal for different emotions of infant cry. The spectrogram signal implies the intensity of cry signal in different bandwidth. the spectrogram frequency in Fig. 5b and 5c is sharp and clean as the baby does not feel any pain during this expression and frequency of signal is clear. In Fig. 5d and 5e the signal is shattered vague because the baby feel discomfort and tired over this condition. It makes the frequency so vague.

Fig. 5

Experimental analysis of spectrogram signal in different emotion modules of infant cry using Mel-Bark FCC and DCNN, a) shows the input cry signal of infants, b) shows spectrogram signal of “Eair/ Eh”, c) shows spectrogram signal of “Neh”, d) shows spectrogram signal of “Heh”, e) shows spectrogram signal of “owh”.

In figure, the yellow bands are thick in pain Spectrogram because when they are in pain, infants will be crying harder. The red and yellow bands are thin in normal cry and the spectrogram bands are distorted when the baby is sick or asphyxiate because the infant sounds feeble when they are sick. From this analysis, it is obvious that our proposed Mel-BarkFCC method produces effective results over predicting the acoustic cry signal of cry-babies.

Further, we analyzed the spectrogram signal of infants in the different growth phases. For example, we divided the growth phase of the infant from 0– 1 year into 4 modules i.e, 0– 2 months, 2– 5 months, 5– 8 months and 8– 12 months. The cry signal is collected from all four groups of babies and the different sound signals (eair/eh, neh, heh and owh) sounds are measured. Figure 6 shows the prediction efficiency of cry signal in the different growth phases of babies. From our analysis, we observed that the cry signal of babies in the initial growth phase (0– 2 months) is so thin and mild due to immature voice track so the features are difficult to predict in this stage. but the prediction rate increases gradually as the vocal gets sharp in the fouth phase (8– 12 months) the prediction is more accurate compared to the first growth phase.

Fig. 6

Graph for prediction of cry sound in the different growth phases of babies, a) shows the Eair/ Eh sound prediction of babies in four growth phase, b) shows the Neh sound prediction of babies in four growth phase, c) shows the Heh sound prediction of babies in four growth phase, d) shows the Owh sound prediction of babies in four growth phase.

4.3 Performance measure analysis

The experimental results were evaluated with accuracy, specificity, sensitivity and DSI. The statistical evaluation of the parameters is given below,

Accuracy:

It is the ability to point out the odd pixels from the image correctly. The true positive and true negative values should be calculated to estimate the accuracy of the test. The mathematical expression of the accuracy is; $Accuracy = \frac{true positvies + true negativies}{all samples}$ (13)

Sensitivity:

The sensitivity is the ratio of determining correctly grouped pixels to the sample image. The sensitivity is estimated by analysing the proportion of true positive in-patient cases. The mathematical expression of the sensitivity is; $Sensitivity = \frac{TP}{TN + FN}$ (14)

Specificity:

It is the ability to point out healthy cases correctly. It is estimated by analysing the proportion of true negative in healthy cases. The mathematical expression of the specificity is; $Specificity = \frac{TN}{TN + FP}$ (15)

Tables 5, 6 and 7 shows the comparative analysis of existing machine learning techniques with proposed Mel-Bark DCNN techniques with various classifiers like K- Nearest Neighbouring, Support Vector Machine (SVM) and Artificial Neural Network (ANN). Comparing the accuracy rate of five different infant cry signal, the average rate of accuracy in the existing system is 71.586% in KNN, 80% in SVM, 84% in ANN and the proposed method have 93.038%. While the average specificity rate of five different infant cry signal, the average rate of accuracy in the existing system is 72.478% in KNN, 81.726% in SVM, 86.88% in ANN and the proposed method have 91.962%. The average sensitivity rate of five different infant cry signal, the average sensitivity rate of the existing system is 72.584% in KNN, 80.8% in SVM, 86.062% in ANN and the proposed method have 91.894%. From the analysis, it is evident that the proposed technique has better efficiency inaccuracy, specificity, and sensitivity while comparing with different basic classifiers like KNN, SVM and ANN.

Table 5

Comparative analysis of existing techniques with proposed DCNN for Accuracy

Sample	Accuracy
spectrogram	KNN	SVM	ANN	Mel-bark
signals of				DCNN
infant cry				(proposed)
Signal 1	75.18	80.12	83.45	93.45
Signal 2	76.25	83.12	86.12	95.31
Signal 3	72.14	80.14	85.12	92.14
Signal 4	69.15	75.01	80.14	93.21
Signal 5	65.21	79.25	86.08	91.08
average	71.586	79.528	84.182	93.038

Table 6

Comparative analysis of existing techniques with proposed DCNN for specificity

Sample	Specificity
spectrogram	KNN	SVM	ANN	Mel-bark
signals of				DCNN
infant cry				(proposed)
Signal 1	76.89	80.69	85.45	90.32
Signal 2	74.78	81.65	89.47	94.58
Signal 3	73.24	85.21	83.45	91.65
Signal 4	69.03	79.54	87.26	92.81
Signal 5	68.45	81.54	88.77	90.45
average	72.478	81.726	86.88	91.962

Table 7

Comparative analysis of existing techniques with proposed DCNN for Sensitivity

Sample	Specificity
spectrogram	KNN	SVM	ANN	Mel-bark
signals of				DCNN
infant cry				(proposed)
Signal 1	73.19	79.31	85.14	90.25
Signal 2	76.34	80.54	86.21	92.65
Signal 3	79.15	81.32	86.47	92.54
Signal 4	69.03	80.69	88.24	91.02
Signal 5	65.21	82.14	84.25	93.01
Average	72.584	80.8	86.062	91.894

Table 8 shows the Comparative analysis of Proposed DCNN technique with state-of-the-art methods. From the above comparison it is evident that the proposed DCNN method has better efficiency in the classification of target sound classes.

Table 8

Comparison of Proposed DCNN technique with state-of-the-art methods

Prediction	SVM (4)	KNN (2)	CNN (22)	CNN-RNN (27)	Proposed
of sound					DCNN
cry signals
Eair	86.36	88.45	89.56	90.14	93.45
Eh	85.14	82	90.54	92.12	94.58
Neh	76.48	75.14	80.45	86.12	91.45
Heh	86.47	84.5	89.47	91.57	93.45
Owh	95.14	85	90.18	91.54	96.89

5 Conclusion

This paper focus on the research of infant cry signal and prediction of cry signal has shown it is effective to use spectrograms for audio classification. Initially the input audio signal is converted into appropriate spectrogram signal and clipped to select specific frequency signal. Then both the audio features and speech features are measured using Mel-bark frequency cepstrals and processed spectrogram signal is fed into the dense convolution neural network. In proposed method the target baby cry signal divided into five classes as “Eair”, “Eh”, “Neh”, “Heh” and “Owh”. DCNN system is trained to classify these specific target classes. The output is determined with different parameters like accuracy, specificity, and sensitivity. The output values of the DCNN classifier are compared with different ML classifiers technique to check the progression of the proposed method. The result obtained from the proposed method has better results compared with other classifiers. The proposed method attains an average accuracy, specificity, and sensitivity as 93.038%, 91.962% and 91.894% respectively using a Mel-Bark DCNN technique. The results show that the incorporation of Mel-Bark FCC into DCNN shows superior results in Competitive techniques. In future the work can be extended with large dataset and modified deep learning network to achieve the maximum efficiency. Also, various regularization techniques can be implemented to levitate the robust of network.

References

Chittora

and Patil

H.A.

, Classification of normal and pathological infant cries using bispectrum features, In 2015 23rd European Signal Processing Conference (EUSIPCO) (pp. 639–643). IEEE.

Mahmoud

A.M.

, Swilem

S.M.

, Alqarni

A.S.

and Haron

, December. Infant Cry Classification Using Semi-supervised K-Nearest Neighbor Approach, In 2020 13th International Conference on Developments in eSystems Engineering (DeSE) (pp. 305–310). IEEE.

Savareh

B.A.

, Hosseinkhani

and Jafari

, Infant Crying Classification by Using Genetic Algorithm and Artificial Neural Network, Acta Medica Iranica (2020), 531–539.

Chang

C.Y.

and Li

J.J.

, Application of deep learning for recognizing infant cries. In 2016 IEEE International Conference on Consumer Electronics-Taiwan (ICCE-TW) (pp. 1–2). IEEE. (2016).

Chang

C.Y.

, Chang

C.W.

, Kathiravan

, Lin

and Chen

S.T.

, DAG-SVM based infant cry classification system using sequential forward floating feature selection, Multidimensional Systems and Signal Processing 28(3) (2017), 961–976.

Anders

, Hlawitschka

and Fuchs

, Automatic classification of infant vocalization sequences with convolutional neural networks, Speech Communication 119 (2020), 36–45.

Naithani

, Kivinummi

, Virtanen

, Tammela

, Peltola

M.J.

and Leppänen

J.M.

, Automatic segmentation of infant cry signals using Gaussian Mixture Models, EURASIP Journal on Audio, Speech, and Music Processing 2018(1) (2018), 1–14.

Sharma

, Umapathy

and Krishnan

, Trends in audio signal feature extraction methods, Applied Acoustics 158 (2020), 107020.

Park

and Yoo

C.D.

, CNN-based learnable gammatone filterbank and equal-loudness normalization for environmental sound classification, IEEE Signal Processing Letters 27 (2020), 411–415.

10.

Alaie

H.F.

, Abou-Abbas

and Tadj

, Cry-based infant pathology classification using GMMs, Speech Communication 77 (2016), 28–52.

11.

Llombart

, Ribas

, Miguel

, Vicente

, Ortega

and Lleida

, Progressive loss functions for speech enhancement with deep neural networks, EURASIP Journal on Audio, Speech, and Music Processing 2021(1) (2021), 1–16.

12.

Saraswathy

, Hariharan

, Nadarajaw

, Khairunizam

and Yaacob

, Optimal selection of mother wavelet for accurate infant cry classification, Australasian Physical & Engineering Sciences in Medicine 37(2) (2014), 439–456.

13.

Wermke

, Robb

M.P.

and Schluter

P.J.

, Melody complexity of infants’ cry and non-cry vocalisations increases across the first six months, Scientific Reports 11(1) (2021), 1–11.

14.

Abou-Abbas

, Tadj

and Fersaie

H.A.

, A fully automated approach for baby cry signal segmentation and boundary detection of expiratory and inspiratory episodes, The Journal of the Acoustical Society of America 142(3) (2017), 1318501331.

15.

, Kabir

A.N.M.

, Ji

, Basodi

and Pan

, November. Using transfer learning, SVM, and ensemble classification to classify baby cries based on their spectrogram images, In 2019 IEEE 16th International Conference on Mobile Ad Hoc and Sensor Systems Workshops (MASSW) (pp. 106–110). IEEE. (2019).

16.

Liu

, Li

and Kuo

, March. Infant cry signal detection, pattern extraction and recognition, In 2018 International Conference on Information and Computer Technologies (ICICT) (pp. 159–163). IEEE. (2018).

17.

Novamizanti

, Prasasti

A.L.

and Utama

B.S.

, December. Study of Linear Discriminant Analysis to Identify Baby Cry Based on DWT and MFCC. In IOP Conference Series: Materials Science and Engineering (Vol. 982, No. 1, p. 012009). IOP Publishing (2020).

18.

Severini

, Ferretti

, Principi

and Squartini

, Automatic detection of cry sounds in neonatal intensive care units by using deep learning and acoustic scene simulation, IEEE Access 7 (2019), 51982–51993.

19.

Thaine

and Penn

, September. Extracting Mel-Frequency and Bark-Frequency Cepstral Coefficients from Encrypted Signals, In INTERSPEECH (pp. 3715–3719). (2019).

20.

Ting

P.J.

, Ruan

S.J.

and Li

L.P.H.

, Environmental Noise Classification with Inception-Dense Blocks for Hearing Aids, Sensors 21(16) (2021), 5406.

21.

Cohen

, Ruinskiy

, Zickfeld

, IJzerman

and Lavner

, Baby cry detection: deep learning and classical approaches. In Development and Analysis of Deep Learning Architectures (pp. 171–196). Springer, Cham. (2020).

22.

Hershey

, Chaudhuri

, Ellis

, Gemmeke

J.F.

, Jansen

, Moore

R.C.

, Plakal

, Platt

, Saurous

R.A.

, Seybold

and Slaney

, CNN architectures for large-scale audio classification, In 2017 ieee international conference on acoustics, speech and signal processing (icassp) (pp. 131–135). IEEE. (2017).

23.

Kwon

, A CNN-assisted enhanced audio signal processing for speech emotion recognition, Sensors 20(1) (2020), 183.

24.

Nagarajan

, Rengarajan

, Manoharan

and Baskaran

K.D.

, Infant cry analysis for emotion detection by using feature extraction methods, In Proceedings of WRFER International Conference (pp. 66–69). (2017).

25.

Sharma

, Asthana

and Mittal

V.K.

, A database of infant cry sounds to study the likely cause of cry, In Proceedings of the 12th International Conference on Natural Language Processing (pp. 112–117). (2015).

26.

Dewi

S.P.

, Prasasti

A.L.

and Irawan

, The study of baby crying analysis using MFCC and LFCC in different classification methods, In 2019 IEEE International Conference on Signals and Systems (ICSigSys) (pp. 18–23). IEEE. (2019).

27.

Maghfira

T.N.

, Basaruddin

and Krisnadhi

, April. Infant cry classification using cnn–rnn. In Journal of Physics: Conference Series (Vol. 1528, No. 1, p. 012019). IOP Publishing. (2020).

28.

Yao

, Plötz

, Johnson

and Barbaro

K.D.

, Automated detection of infant holding using wearable sensing: Implications for developmental science and intervention, Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies 3(2) (2019), 1–17.

29.

Lavner

, Cohen

, Ruinskiy

and IJzerman

, Baby cry detection in domestic environment using deep learning, In 2016 IEEE international conference on the science of electrical engineering (ICSEE) (pp. 1–5). IEEE. (2016).

30.

Zhang

, Du

, Wang

, Zhang

and Tu

, Attention based fully convolutional network for speech emotion recognition. In 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 1771–1775). IEEE. (2018).

31.

Zheng

, Yang

, Zhang

and Zhang

, Improvement of generalization ability of deep CNN via implicit regularization in two-stage training process, IEEE Access 6 (2018), 15844–15869.

32.

Zheng

, Yang

, Tian

, Jiang

and Wang

, A full stage data augmentation method in deep convolutional neural network for natural image classification, Discrete Dynamics in Nature and Society (2020).

33.

Zheng

, Tian

, Yang

, Wu

and Su

, PAC-Bayesian framework based drop-path method for 2D discriminative convolutional network pruning, Multidimensional Systems and Signal Processing 31(3) (2020), 793–827.

Before		After
Male	Female	Male	Female
3.61%	8.03%	3.02%	4.12%
3.45%	6.7%	2.43%	3.95%

Neonatal cry signal prediction and classification via dense convolution neural network

Abstract

Keywords

1 Introduction

3 Design and methods

3.2.1 Speech features

Table 1 Hyper parameter setting Parameter Value Number of Neurons in each layer 512 Learning rate 0.01 Activation function ReLu Number of epochs 41 Batch size 1000 Drop out 0.2

4.1 Data description

Table 4 Equation error rate before and after the addition of BFCC Before After Male Female Male Female 3.61% 8.03% 3.02% 4.12% 3.45% 6.7% 2.43% 3.95%

References

Table 1
Hyper parameter setting

Parameter Value

Number of Neurons in each layer 512

Learning rate 0.01

Activation function ReLu

Number of epochs 41

Batch size 1000

Drop out 0.2

Table 4
Equation error rate before and after the addition of BFCC

Before After

Male Female Male Female

3.61% 8.03% 3.02% 4.12%

3.45% 6.7% 2.43% 3.95%