Neural network-based speech fuzzy enhancement algorithm for smart home interaction

Abstract

With the rapid development of artificial intelligence and the continuous improvement of machine learning technology, speech recognition technology is also developing rapidly and the recognition accuracy is improving to meet the higher requirements of people for smart home devices, and combining smart home with voice recognition technology is an inevitable trend for future development. This study aims to propose a speech fuzzy enhancement algorithm based on neural network for smart home interactive speech recognition technology, so the study proposes a combination of fuzzy neural network algorithm (FNN) and stacked self-encoder (SAE) to form SAE-FNN algorithm, which has better non-linear characteristics and can better achieve feature learning, thus improving the performance of the whole system. The results show that with the SAE-FNN algorithm, the maximum relative error absolute value, average relative error and root mean square error are 0.355, 0.063 and 0.978, which are significantly higher than the other two individual algorithms, and the noise of the sound signal has little effect on the SAE-FNN algorithm. Therefore, it can be seen that the proposed SAE-FNN algorithm has excellent noise immunity performance. In summary, it can be seen that this neural network-based speech fuzzy enhancement algorithm for smart home interaction is extremely feasible.

Keywords

Smart home fuzzy neural network algorithm stacked self encoder speech emotion recognition speech recognition feature extraction

1. Introduction

Smart home is a home as a carrier to centralize the management of home facilities [1], which is both convenient and comfortable, as well as safe and environmentally friendly, and can bring intelligent and stylish living to families and improve the quality of living [2]. In recent years, with the development of technologies such as machine learning and artificial intelligence, as well as the rapid development of Internet media and the widespread use of mobile phone applications [3], it has led to a great expansion of the speech recognition library and provided a large number of training examples for acoustic and language modelling for speech recognition, as well as its strong learning capability [4], allowing optimization and tuning of model parameters to improve the recognition rate of the model. The introduction of intelligent speech recognition technology in the smart home has freed people from the shackles of manual operation, thus creating a new smart home [5]. Based on this, the research aims to propose a neural network-based speech fuzzy enhancement algorithm for smart home interaction. The research mainly consists of four parts, the second part is a review of the application of neural network-based smart home interactive speech fuzzy enhancement algorithm for domestic and international research status, the third part is the sound signal pre-processing and model construction of neural network-based smart home interactive speech fuzzy enhancement algorithm, and the fourth part is the performance of neural network-based smart home interactive speech fuzzy enhancement algorithm model Evaluation.

2. Related work

The reason why smart homes have developed so far and so well is by no means a one-off, and in recent years there have been a number of scholars in China who have explored it in depth to promote the development of this field.

The multiple-MKL algorithm proposed by Chang and Zhao not only maintained good performance in the original classes, but also improved the recognition of new classes with only few training samples [6]. To generate the optimal scheduling problem for smart home energy management systems, Zhao and Keerthisinghe [7] proposed an under-uncertainty fast real-time control strategy using Bellman optimality conditions and long short-term memory recurrent neural networks (LSTM-RNN) with errors in the range of 0.8%, saving at least 20% of the computation time. Pustokhina et al. [8] proposed an approach using fuzzy logic controllers based on high response time and network load being the limiting factors for IoT deployment and thus improve the reliability of IoT gateways. Kulkarni and Kulkarni [9] proposed a fuzzy neural network (FNN) for pattern classification, which, unlike other clustering algorithms used to build the hidden layer of RBFNN, is fast to train and retrieve, guaranteeing 100% accuracy for any training set. Zheng et al. [10] proposed a fuzzy system-fuzzy neural network-backpropagation control system in order to solve the control problem of complex robotic systems with uncertainties and disturbances, which can guarantee accurate, stable and efficient control, and experimental results proved the stability of their algorithm.

Kagan et al. [11] bridged the widely accepted model of biological systems based on Tsetlin automata acting in a stochastic environment, which implements multi-valued non-heterogeneous operators applied to aggregated inputs and internal states, and then uses these neurons to build networks, and studies have shown the stability of the algorithm due to similar algorithms. Ai et al. [12] solved the problem of observation repetition and lack of full graph in the observation repetition and lack of full graph problem of integrated learning using attentional model integrated convolutional recurrent neural networks (ACRNN) for unbalanced speech emotion recognition, and finally extensive experiments on IEMOCAP and Emo-DB samples demonstrated the superiority of our proposed method. Song et al. [13] proposed a speech-based method for automatically detecting frustration during game interactions, improving system performance from 58.8% unweighted average (UAR) to 93.1% due to continuous improvements in the use of convolutional neural networks (CNN) to implement speech recognition tasks. Alam et al. [14] survey reviewed state-of-the-art deep neural network architectures, algorithms, algorithms for speech and vision applications and systems, the study provides the most comprehensive survey of recent developments in intelligent speech and vision applications from the perspective of software and hardware systems, proving that it will revolutionize future research and development of speech and vision systems. Saadi and Belhadef [15] present a deep neural network-based system for extracting specific entities from natural language text with an accuracy of 96.81%, better than the state-of-the-art results.

It can be seen from the research of domestic and international scholars based on fuzzy neural network algorithm (FNN) and stacked self-encoder (SAE) that there are more studies on fuzzy neural network algorithm (FNN) and stacked self-encoder (SAE), but there are relatively few studies on the fusion and optimisation of SAE and FNN to achieve interactive speech recognition for smart homes. Therefore, this research mainly conducts research on smart home interactive speech recognition by integrating SAE and FNN algorithms. Compared with a single algorithm, the proposed algorithm can improve its recognition rate to a greater extent, and its feasibility and optimization are well verified.

3. Construction of a neural network-based fuzzy enhancement algorithm model for smart home interaction speech

3.1 Preprocessing method of sound signal

The usual pre-processing processes for smart home interactive speech are: pre-emphasis and windowing framing [16]. The study addresses this issue by using digital filtering techniques to pre-process the voice signal and improve its high frequency performance. The pre-emphasis digital filter is usually a first order digital filter with the expression shown in Eq. (1).

$\displaystyle H(z)=1-\alpha z^{-1}$ (1)

In Eq. (1), $\alpha$ is the coefficient of the pre-emphasis of and $\alpha\in(0.9,1)$ , whose value is usually based on experience, and in this study the value of $\alpha$ is 0.9375. Most speech signal processing and analysis needs to be carried out in a steady state, but inevitably, in real speech, some disturbing factors such as random noise are mixed in. The voice starts to become unstable. Although long speech can be unstable, it can remain stable for a certain time interval. Therefore, to make full use of the short time characteristics of the voice signal, the voice signal is usually windowed and framed, dividing the sound into a number of short time SMS numbers, each short time signal is called a frame. Usually, the signal frame length of each frame is between 20 ms and 30 ms. This is a method of frame splitting crossover, which means that there is an overlap between two frames, this is known as a frame offset and is usually 1/3 to 2/3 of the frame length. alternate frame splitting is used to ensure smooth transitions between two frames because of the correlation between the voice signal frames and the frame [17], Fig. 1 shows a schematic diagram of frame splitting.

Figure 1.

Framing diagram.

The long segmented speech signal after framing is a very short, finite signal and therefore often causes truncation when it is extracted for features, which means that spectral leakage occurs when the Fourier transform is applied to the finite signal, resulting in anomalous boundary values. To reduce the effect of the speech signal after framing, the method of adding windows must be used. Rectangular windows and Hamming windows are the two most common forms [18], and their functional forms are shown in Eqs (2) and (3) in that order.

$\displaystyle w(n)=\left\{{\begin{array}[]{l}1,0\leqslant n\leqslant(N-1)\\ 0,\text{others}\\ \end{array}}\right.$ (2) $\displaystyle w(n,\alpha)=\left\{{\begin{array}[]{l}1-\alpha-\alpha\cos\left(% \frac{2\pi n}{N-1}\right),0\leqslant n\leqslant(N-1)\\ 0,\text{others}\\ \end{array}}\right.$ (3)

In Eqs (2) and (3), $N$ denotes the window length and $\alpha$ denotes the coefficient, which is often expressed as 0.46. This study uses the Hamming window function for the framing and windowing of speech signals. In general, the louder the sound, the more energy there will be in that part; conversely, lower noise means lower energy. Corresponding to speech emotions, general emotions such as anger and surprise have higher emotional energy, while emotions such as sadness and fear have lower energy [19]. It can be seen that the short-term energy properties of speech signals can respond to emotions in a sense, and the short-term energy is defined as the weighted sum of squares of the sampled points per frame of speech as shown in Eq. (4).

$\displaystyle E_{n}=\sum\limits_{m=0}^{N-1}{x_{n}^{2}}(m)$ (4)

In Eq. (4), $E_{n}$ denotes the transient energy of the nth frame of the voice, $x_{n}(m)$ denotes the sound signal after windowing, and $N$ is the window length. The short time over-zero rate, expressed as the frequency at which the waveform passes through the zero level, is used to distinguish between clear and turbid audio and the presence or absence of sound when described using voice sampling points. The short time over zero rate is defined as shown in Eq. (5).

$\displaystyle Z_{n}=\frac{1}{2}\sum\limits_{m-1}^{N-1}{\left|{\text{sgn}(x_{n}% (m))-\text{sgn}(x_{n}(m-1))}\right|}$ (5)

In Eq. (5), $Z_{n}$ represents the short time over zero rate of the nth frame of the voice, $x_{n}(m)$ represents the sound signal after adding the window, $N$ is the window length and $\text{sgn}(x)$ is its sign function, which is defined as shown in Eq. (6).

$\displaystyle\text{sgn}(x)=\left\{{\begin{array}[]{l}1,x>0\\ 0,x=0\\ -1,x<0\\ \end{array}}\right.$ (6)

It has been found that the human ear perceives different frequencies of sound in different frequency bands. At low frequencies of speech, the relationship between human auditory performance and auditory frequency is linear; while at higher frequency bands, the ability to perceive frequencies shows: a logarithmic relationship. To accommodate the auditory characteristics of the human ear at different frequencies, researchers Davis and Mehlstein proposed the Mel inverse spectral coefficient (MFCC). A logarithmic relationship exists between the Mel frequency of the MFCC and the conventional frequency, and its expression is shown in Eq. (7).

$\displaystyle\textit{Mel}(f)=2595\times\log_{10}(1+f/700)$ (7)

In Eq. (7), $f$ is the common frequency in units of $H z$ , and $\textit{Mel}(f)$ denotes the frequency of Mel in units of Mel. Eliminating the individual dimensional correlations of the voice signal, the characteristics of MFCC are obtained as shown in Eq. (8).

$\displaystyle\textit{MFCC}_{n}=\sqrt{\frac{2}{N}}\sum\limits_{k=1}^{M}\ln% \theta(k)\cos\left[(k-0.5)\frac{n\pi}{M}\right],n=1,2,\ldots,L$ (8)

In Eq. (8), $\theta(k)$ denotes the energy input of the $k$ triangular filter, $L$ denotes the amount of MFCC features, and $M$ denotes the number of filters. Since the MFCC is characterised by the static characteristics of speech, and the dynamic characteristics of speech information can be obtained by first or second order differencing and thus solving, the above extraction process is shown in Fig. 2.

Figure 2.

MFCC extraction schematic.

In the extraction of emotional features of speech signals, each feature is expressed in a different way and its range of values varies greatly. Therefore, in order to measure the various types of features effectively, the extracted features must first be analysed and the various types of features normalised in the following two main ways.

The first method is min-max normalisation, which maps its original data to the range [0, 1] and performs a transformation of the initial data as in Eq. (9).

$\displaystyle\widehat{{f}_{i}}=\frac{f_{i}-\min_{i}}{\max_{i}-\min_{i}}$ (9)

In Eq. (9), represents the value of the dimension $\widehat{{f}_{i}}$ characteristic of the sample, $\min_{i}$ represents the minimum value of the dimension i characteristic in all samples, and $\max_{i}$ represents the maximum value of the dimension $i$ characteristic in all samples. The second method is Z-score standardisation, where the mean and standard deviation are first applied to all the raw data separately, and then normalised by the mean and standard deviation, and the final result satisfies the standard standard normal distribution with the transformation function shown in Eq. (10).

$\displaystyle\widehat{{f}_{i}}=\frac{f_{i}-\mu_{i}}{\sigma_{i}}$ (10)

In Eq. (10), $\mu_{i}$ represents the mean value of the i-th dimensional feature across all samples, $f_{i}$ represents the value of the i-th dimensional feature of the sample, and $\sigma_{i}$ represents the standard deviation of the i-th dimensional feature across all samples. The method proposed in the study for normalizing the features is mainly through the maximum and minimum values as well as the mean and standard deviation of the sample values of the training, and the normalization of the features of the training samples is performed using this method.

3.2 Evaluation model construction of a neural network-based fuzzy enhancement algorithm for smart home interactive speech

There are a large number of machine learning algorithms available for smart home interactive speech recognition technology, and neural networks are one of these algorithms. ANNs are essentially simple constructions and simulations that take advantage of the variance that a computer can handle. Firstly, it is adaptive and self-learning in that it can adjust and adapt the connections of each neuron in the network based on the desired outcome and feedback, resulting in a continuous improvement of the network’s network structure. The second feature is associative memory, where neurons are able to store messages and make full use of this property in their internal neuronal connections; in a feed-back neural network, neurons are able to make full use of their information with pre- and post-neurons, which in turn enables association. Thirdly, because ANN has excellent optimisation properties, it can optimise any non-linear function at all three levels. This provides a powerful reference for solving certain complex problems. Fourth, because of the network’s good fault tolerance, damage to local neurons in a neural network does not have a significant impact on the overall operation of the entire network [20]. Figure 3 shows a simple three-layer structure consisting of an input layer, a hidden layer and an output layer.

Figure 3.

Three-layer neural network structure.

A neuron is the most fundamental unit of operation in a neural network, usually a non-linear system with feedback inputs and threshold parameters. This is shown in Fig. 4.

Figure 4.

Single neuron structure.

The relationship between the output $x^{\prime}_{j}$ and the $N$ inputs $x_{1},x_{2},\cdots,x_{N}$ for the jth neuron structure given in Fig. 4 can be expressed as shown in Eq. (11).

$\displaystyle X_{j}=\sum\limits_{i=1}^{N}{w_{i}\cdot x_{i}+S_{j}}$ (11)

In Eq. (11), $S_{j}$ is the feedback signal. The feature function $f$ has a number of alternative functions that can be selected based on the characteristics of its inputs and outputs. By learning the environment, acquiring knowledge and improving itself is a feature of neural networks. Usually, performance improvement is achieved by adjusting its own parameters, such as weights and thresholds, which are difficult to detect when it contains a large number of hidden layers, so the BP algorithm can solve this problem well, laying the foundation for the further development of neural networks, whose structural features include input layer, hidden layer and output layer, the essence of which is the sum of squares of network errors as a loss function. In order to improve the nonlinear performance of the network, the study uses a gradient-based decreasing method for weighting regulation, so the activation function is introduced, and the study takes the Sigmoid activation function as an example, and its function image is shown in Fig. 5.

Figure 5.

Sigmoid function image.

As shown in the graph of the Sigmoid function in Fig. 5, the closer the horizontal axis of the Sigmoid function is to the origin, the greater the change in the vertical coordinate, which is known as soft saturation, but it tends to lead to undertraining of the neural network; it is prone to problems such as gradient disappearance and gradient explosion. Deep learning methods can solve these problems well and play an important role in computer vision, speech processing, emotion recognition and so on. Through the study of a large amount of data, it is verified that feature extraction and recognition using deep learning networks can greatly improve the accuracy of classification. The following is a detailed analysis of the stacked self-encoder (SAE) used in this study. The autoencoder needs to acquire the key elements that can represent the characteristics of the input signal, while the autoencoder that extracts to its core retains only one implicit layer and utilizes unsupervised methods to search for the required information. The output information represents the input information. Assuming Sigmoid as its activation function, a neuron is activated when its output value exceeds that activation value, and the neuron returns to that activation state at the end of that activation cycle. Since most neurons are normally in an inhibited state, the concept of a “penalty factor” is proposed to reduce the activity of the neuron, as shown in Eq. (12).

$\displaystyle\sum\limits_{j=1}^{S_{2}}{\rho\log\frac{\rho}{\rho^{\prime}_{j}}}% +(1-\rho)\log\frac{1-\rho}{1-\rho^{\prime}_{j}}$ (12)

In Eq. (12), $S_{2}$ is the number of hidden neurons, $j$ is a specific nerve cell, and $\rho$ is a sparse covariate, $\rho^{\prime}_{j}$ for dense parameters, these are relatively few numbers, these are relatively small numbers and the penalty factor increases as the difference between the values of $\rho$ and $\rho^{\prime}_{j}$ increases. This sparse penalty term is added to the original cost function of the self-encoder to generate a new total cost function, as shown in Eq. (13).

$\displaystyle J_{\textit{sparse}}(W,b)=J(W,b)+\beta\sum\limits_{j=1}^{S_{h}}{% KL(\rho\left\|{\hat{\rho}_{j}}\right.)}$ (13)

In Eq. (13), $\beta$ is a control factor for the weight of the sparse penalty term, which is related to $(W,b)$ , from which the appropriate parameter $(W,b)$ can be obtained using an iterative method such as gradient descent, based on the cost function described above. Noise-reducing autoencoders are generated for specific situations where artificial noise-based autoencoding is required to obtain better robustness of the characteristics due to some defects that may cause the input signal, such as a missing pixel, or noise in the sound signal. As shown in Fig. 6, new input data is randomly added to the original data according to a specific probability, while at decoding time, the implicit layer eliminates the effects caused by these interfering factors by decoding the interfering signals, thus making the obtained feature values more robust.

Figure 6.

Denoising autoencoder network structure.

Due to the complexity of the sound signal and its indeterminacy, the study proposes a fuzzy neural network (FNN), which has the ability to describe and express fuzziness, and is capable of adapting itself in parameter learning. The first step is to introduce the logical inference of ANFIS, which gives the fuzzy rules for two inputs $x$ , $y$ and one output $f$ given by the “if-then” ANFIS in T-S inference, as shown in Eq. (14).

$\displaystyle f_{1}=p_{1}x+q_{1}y+r_{1}$ (14)

In Eq. (14), $\{{p_{1},q_{1},r_{1}}\}$ is the posterior parameter of the fuzzy rule, and the study introduces a fuzzy neural network with a self-encoder as the affiliation function, denoted SAE-FNN. Its specific steps are explained in detail below. The first level is the sampling input level, whose input is the sound emotion extracted from the sound. The second layer is the fuzzification layer, with the SAE network as the dependency, using a simple neural network construction containing only two levels, input and output, at which point there is only a weight matrix $W$ with the values shown in Eq. (15).

$\displaystyle W=\left[{{\begin{array}[]{*{20}c}{w_{11}}&{w_{12}}&{w_{13}}\\ {w_{21}}&{w_{22}}&{w_{23}}\\ \end{array}}}\right]$ (15)

The third level is a type of fuzzy reasoning in which each neuron has a law of fuzziness.

The fourth level is the normalised layer, which normalises all inferred conclusions and all normalisations give a result of 1. The advantage of normalisation is that the series of data that follows can be inverted. When normalising the parameters, the corrected amplitude has a certain limiting value, thus effectively preventing the network from vibrating violently and thus allowing the convergence of the neural network to be improved. The fifth stage is the output stage, where the output after each normalisation is demodelled to obtain the corresponding classification of the input sample values. The above is the inference process performed on the SAE-FNN. The inverse transfer was used to adjust each parameter in the process of inverse transfer, based on which the weights of the output level, the weights of the Fuzzy level to the Fuzzy inference level and the affiliation parameters were returned.

4. Performance evaluation of a neural network-based fuzzy enhancement algorithm model for smart home interaction speech

The vocabulary used in this study ranged from 10 to 50. Due to the differences in vocabulary, there were differences in the recognition ability of each word.

Figure 7.

Comparison diagram before and after sound signal processing.

As shown in Fig. 7a, in the original voice signal, a longer segment of speech can be a kind of interruption, which has parts that do not include any message or have a lower message, e.g., a silenced part of speech; in order to reduce the overall computing power, the accuracy of speech is improved, and the overall voice is not continuous and the meaningful voice emotion is fragmented, so after pre-processing it, the voice signal After terminating the detection and removing the invalid segments, its audio image is stable and noise-free, and then the speech signal is imported into the set model. Considering the actual speech and speech recognition process, it is necessary to analyse the performance of the research method chosen for this study, in addition to training the fuzzy neural network algorithm (FNN), in this study the fuzzy neural network algorithm (FNN) and the stacked self-encoder (SAE) are combined to form the SAE-FNN algorithm. For comparison purposes, the basic BPNN algorithm, the FNN algorithm and the SAE-FNNN algorithm are used to train the network for comparison and to obtain training curves and recognition rate images.

Figure 8.

Training result graph.

As can be seen in Fig. 8, this method achieves the pre-set target of 10 ${}^{-4}$ after 22 training sessions, thus achieving the expected error. In the noise autoencoder model, using RMB to initialise the network parameters, three different types of autoencoder loss value change curves, the 5 level deep autoencoder its recognition rate is better than the 3 level autoencoder, while the deep noise autoencoder performance is 2.02% better than the autoencoder, indicating that the multi-layer coding network has better non-linear characteristics, which can achieve better feature learning. And the noise reduction technique is used, which makes it have better recognition effect, thus improving the performance of the whole system, as shown in Fig. 9.

Figure 9.

Training result graph.

The previous speech signals were obtained in a laboratory, where such signals are relatively stable and not contaminated, and therefore the recognition rate is higher. However, in practice, the quality of the observed speech signal varies considerably and it is therefore essential to analyse the noise immunity of the speech recognition techniques and software under study. In this paper, a specific speech “3” is used as an example to compare the BPNN algorithm, the FNN algorithm and the SAE-FNN algorithm to obtain different recognition rates compared to the noise amplitude by using similar pre-processing, feature extraction and parameter setting methods as the previous methods and applying them to the learning of the network. This is shown in Fig. 10.

As can be seen from Fig. 10, the FNN algorithm and the BPNN algorithm significantly reduce the accuracy of discrimination against additive noise, especially in broadband Gaussian white noise, the effect is extremely significant, with a reduction of about 20%, while for the SAE-FNN algorithm, the noise of the sound signal has minimal effect, so it can be seen that the SAE-FNNN algorithm proposed in the study has extremely strong noise immunity. In order to evaluate the effectiveness of each model, the accuracy and stability of the three algorithms were evaluated using three performance indicators: the absolute maximum relative error $E_{\max}$ , the average relative error $E_{\textit{ave}}$ and the root mean square error RMSE. The details are shown in Table 1.

Table 1

The calculation results of each model evaluation index

Error	BPNN	FNN	SAE-FNN
The absolute value of the maximum relative error $E_{\max}$ (%)	45.855	4.876	0.355
Mean relative error $E_{\textit{ave}}$ (%)	7.886	0.680	0.063
Root mean square error RMSE (10 ${}^{6}$ )	65.456	1.513	0.978

Figure 10.

Influence of noise amplitude on recognition rate.

As can be seen from Table 1, using the BPNN model, although the modelling is simple and easy, it is greatly reduced in accuracy and the model has poor strain; using the FNN model, although the training time is shorter, the accuracy is lacking and the model is not stable enough. The combined SAE-FNN algorithm used in this research, however, can effectively compensate for the pain points of both algorithms and is highly feasible, and its evaluation indexes all maintain the lowest rank, so the performance of this smart home interactive speech fuzzy enhancement algorithm research model is reliable, thus providing the necessary conditions for smart home interactive speech recognition.

5. Conclusion

Smart home interaction speech emotion recognition helps the human-computer interaction method of intelligence and humanization, it has a good application prospect, the research proposed in this study combines fuzzy neural network algorithm (FNN) and stacked autoencoder (SAE) to form SAE-FNN algorithm, the study shows that the research algorithm can achieve the pre-set target of 10 ${}^{-4}$ after 22 times of training, thus achieving the expected error, and compared the loss values of its three different types of autoencoders, and found that the deep autoencoder of level 5 had a better recognition rate than the level 3 autoencoder, while the performance of the deep noise autoencoder improved by 2.02% over the autoencoder, indicating that the multilayer coding network has better non-linear characteristics and can achieve better feature learning, and then the model of non-ideal type of sound was The experiments show that for the SAE-FNN algorithm, the noise of the sound signal has little effect on it, so it can be seen that the SAE-FNN algorithm proposed in the study has extremely strong anti-noise performance. It shows that the neural network-based speech fuzzy enhancement algorithm proposed in this study has strong practical significance. The research is only a preliminary attempt to introduce neural network into speech recognition, and there are still many theoretical and application problems that need to be further discussed. It is suggested that the following research can be carried out in the next step: according to the characteristics and requirements of speech recognition, further study the corresponding neural network structure to find a more reasonable and reliable network model form; The method of finding other effective features is more conducive to target classification. In particular, feature fusion deserves further study.

References

Huu

Thu

HNT

Minh

. Proposing a recognition system of gestures using MobilenetV2 combining single shot detector network for smart-home applications. J Electr Comput Eng. 2021; 2021: 6610461.

Bossung

Kast

. Smart sensorik in der schwangerschaft: narratives review über die verlagerung der schwangerschaftsvorsorge in den smart home bereich. Zeitschrift für Evidenz, Fortbildung und Qualität im Gesundheitswesen. 2021; 164: 35-43.

Khan

Abbas

Rehman

Saeed

Zeb

Uddin

Nasser

Ali

. A machine learning approach for blockchain-based smart home networks security. IEEE Network. 2021; 35(3): 223-229.

Andrade

SHMS

Contente

Rodrigues

Lima

Vijaykumar

Francês

CRL

. A smart home architecture for smart energy consumption in a residence with multiple users. IEEE Access. 2021; 9: 16807-16824.

Zhang

Tian

Chen

. Exploring the cognitive process for service task in smart home: A robot service mechanism. Future Gener Comput Syst. 2020; 102: 588-602.

Chang

Zhao

. Multiple kernel based transfer learning for the few-shot recognition task in smart home scene. IFAC-PapersOnLine. 2020; 53(2): 17101-17106.

Zhao

Keerthisinghe

. A fast and optimal smart home energy management system: State-space approximate dynamic programming. IEEE Access. 2020; 8: 184151-184159.

Pustokhina

Pustokhin

Rodrigues

JJPC

Gupta

Khanna

Shankar

, et al. Automatic vehicle license plate recognition using optimal k-means with convolutional neural network for intelligent transportation systems. IEEE Access. 2020; 8: 92907-92917.

Kulkarni

. Fuzzy neural network for pattern classification. Procedia Comput Sci. 2020; 167: 2606-2616.

10.

Zheng

Zhang

. Design of fuzzy system-fuzzy neural network-backstepping control for complex robot system. Inf Sci. 2020; 546: 1230-1255.

11.

Kagan

Rybalov

Yager

. Sum of certainties with the product of reasons: Neural network with fuzzy aggregators. Int J Uncertainty, Fuzziness Knowl-Based Syst. 2022; 30(1): 1-18.

12.

Sheng

Fang

Ling

. Ensemble learning with attention-integrated convolutional recurrent neural network for imbalanced speech emotion recognition. IEEE Access. 2020; 8: 199909-199919.

13.

Song

Mallol-Ragolta

Parada-Cabaleiro

Yang

Liu

Ren

, et al. Frustration recognition from speech during game interaction using wide residual networks. Virtual Reality Intell Hardware. 2021; 3(1): 76-86.

14.

Alam

Samad

Vidyaratne

Glandon

Iftekharuddin

. Survey on deep neural networks in speech and vision systems. Neurocomput. 2020; 417: 302-321.

15.

Saadi

Belhadef

. Deep neural networks for Arabic information extraction. Smart Sustainable Built Environ. 2020; 9(4): 467-482.

16.

Liang

Nie

Liu

Yang

. Deep neural network-based generalized sidelobe canceller for dual-channel far-field speech recognition. Neural Networks. 2021; 141: 225-237.

17.

Boloukian

Safi-Esfahani

. Recognition of words from brain-generated signals of speech-impaired people: application of autoencoders as a neural Turing machine controller in deep neural networks. Neural Networks. 2020; 121: 186-207.

18.

Zhao

Bao

Zhang

Cummins

Sun

Wang

, et al. Self-attention transfer networks for speech emotion recognition. Virtual Reality Intell Hardware. 2021; 3(1): 43-54.

19.

Becerra

De La Rosa

González

Pedroza

Escalante

Santos

. A comparative case study of neural network training by using frame-level cost functions for automatic speech recognition purposes in Spanish. Multimedia Tools Appl. 2020; 79(27): 19669-19715.

20.

Ukita

. Causal importance of low-level feature selectivity for generalization in image recognition. Neural Networks. 2020; 125: 185-193.