Abstract
Recognition of human speech has long been an intriguing issue among artificial intelligence and processing researchers. Speech is the most crucial and essential method of communication among the human beings. Several research efforts have been prepared in the field of speech recognition in the previous decades. Accordingly, a survey of speech recognition strategies suitable for human identification is discussed in this study. The main motivation of this survey is to explore the existing speech recognition strategies so that the researchers can include all the necessary metrics in their works in this domain and the limitations in the existing ones can be overcome. In this review, diverse issues included in speech recognition methodologies is distinguished and distinctive speech recognition procedures were studied to discover which qualities is tended to in a given system and which is disregarded. Hence, we offer a detailed survey of 50 methods from standard publishers from the year of 2000 to 2015. Here, we categorize the research based on three dissimilar perspectives, like techniques utilized, applications and parameter measures. In addition, this study gives an elaborate idea about speech recognition techniques.
Keywords
Introduction
Speech engineering
Engineering deals with the application of mathematics, economic, scientific and practical knowledge to innovate, design, assemble, research, and improve structures, systems, machines, materials, solutions, and organizations. There are a wide range of disciplines in engineering covering different fields, each focusing on specific areas of applied science, technology and so on. Acoustical or acoustic engineering is a branch of engineering that includes sound and vibration dealing with the acoustic applications, the science of sound and vibration. One of the main objectives of acoustical engineering is noise control. Acoustic science must contain studies of speech and hearing, the masking effect of noise, and several other physical and psychological research areas. This involves various sub disciplines depending on the PACS (Physics and Astronomy Classification Scheme) coding. A major area of research for acoustical engineering is speech that includes processing, production, and perception of speech. This also embraces physics, psychology, audio signal processing, physiology, and linguistics. Speech Engineering exploits few new algorithms in machine listening, focusing mainly on the identification of continuous speech features, like speech impediments, regional accents and so on, in addition to the recognition of individual phrases and words.
Speech recognition
Speech has emerged as one of the most incredible means of human communication. About the parallel versions of communications, like writing, body language, and gesture, speech is deemed as the most innate and direct form of communication [51, 52, 53]. As we are contented with the speech, naturally we would like to have an interface with the computers through the medium of speech, without resorting to the deployment of ancestral interfaces like the keyboards or the pointing devices. This is easily made possible by envisioning an Automatic Speech Recognition (ASR) technique, which represents the function of modulating a speech signal to a series of words with the assistance of an algorithm performed by a computer program. It is elegantly endowed with the skills essential for emerging as a significant means of interface between the humans and the computers [54]. During the last few decades, the automatic speech recognition (ASR) has scaled the pinnacle of victory with its efficiency amazingly augmented in a wide spectrum of authentic applications, such as from the easy digit detection to gargantuan vocabulary broadcast news transcript, from reading style voice dictation to impulsive dialogue systems, and so on. These exciting advances are large because of a multitude of dominant statistical modeling method which has been recognized in the ASR for characterizing authentic data, like the speech signals and spoken language documents gathered from realistic applications [55]. The fundamental target of the speech detection domain is to usher in novel methods and mechanisms for speech input to the machine. The investigation in the ASR by machines has been inviting a zooming level of enthusiasm for a mammoth duration of sixty years [56]. With the result, nowadays, the ASR is extensively employed in functions needing the human machine interface, like the automatic call processing [57], and also the computer which is capable of speaking and identifying speech in the local language [58]. In accordance with exciting advances in statistical modeling of speech, the automatic speech recognition (ASR) technique has a multitude of applications today in the tasks requiring a human-machine interface. These applications includes automatic call processing in telephone networks, and based data systems which furnishes latest travel data, stock price quotations, weather reports, data entry, voice dictation, access to data, travel, banking, commands, automobile portal, speech transcription, handicapped people (blind people) supermarket, railway reservations and so on. The speech recognition technology was extensively employed in the telephone networks to mechanize and thereby renovate the operator services. In addition, speech comprehension techniques, nowadays, are competent to comprehend the speech input for vocabularies of thousands of words in the functional scenarios. In this regard, the speech signal invariably communicates two vital kinds of data such as the (a) speech content and (b) the speaker identity. The speech detection is entrusted with the task of extracting the lexical data from the speech signal independently of the speaker by decreasing the inter-speaker unevenness. In the speaker recognition function, the major aim is the extraction of the identity of the individual [59]. Figure 1 shows the basic automatic speech recognition system.
Automatic speech recognition system.
Nowadays, these speech signals are extensively employed in the biometric recognition methods and also for interfacing with the computer. The cardinal target of the speech recognition regime is to develop suitable strategies and structures for conveying the speech data to the computer. Incidentally, speech has been an excellent and effective technique for interaction between human beings. For several reasons including automatic enthusiasm regarding the modules for automatic recognition of human speech capabilities, which need human machine interface and investigation in the automatic speech recognition by machines has riveted significant attention for six decades [60]. A speech recognition system has two fundamental segments such as the feature extraction and classification. Feature extraction technique casts a significant role in the speech recognition function. In fact, there are two leading techniques for the acoustic measurement. The former technique represents a temporal domain or a parametric technique like the linear prediction [61], which is designed to intimately harmonize the resonant configuration of the human vocal tract which generates the corresponding sound. The linear prediction coefficients (LPC) method is not recommended for representing speech as it presumes the signal to be immobile within a specified frame and hence is incapable of assessing the localized events precisely. Moreover, it also fails to capture the voiceless and nasalized sounds accurately [62]. The latter technique represents the non-parametric frequency domain method by the human auditory perception system and is termed as the Mel-frequency cepstral coefficients (MFCC) [63]. In addition, a host of feature extraction methods is in vogue in the Automatic speech recognition system (ASR). The feature extraction methods includes Linear Predictive Analysis (LPC), Linear Predictive Cepstral Coefficients (LPCC), Perceptual Linear Predictive Coefficients (PLP), Mel-Frequency Cepstral Coefficients (MFCC), Power spectral analysis (FFT), Mel scale cepstral analysis (MEL), Relative spectra filtering of log domain coefficients (RASTA), first order derivative (DELTA), and so on. The categorization phase also constitutes one of the vital functions in the speech recognition techniques. Now, a multitude of classifier approaches is employed to detect the speech. Figure 2 illustrate the general architecture of the speech recognition system.
General architecture speech recognition.
The issue of speech recognition in unpleasant scenarios has invited the zooming enthusiasm of the inquisitive investigators. It is found that there are four factors which have an impact on the speech entering a characteristic recognition system under worst situations.
First, background noise imposes a debasing impact on the speech signal. The loss of original speech signal is completely based upon the signal density of the background signal. If the noisy density of background is higher, the original speck signal may affect severely. Second, as the speaker is competent to hear the background noise, he is likely to modify his speech traits in an endeavor to fine-tune the communication efficacy over the noisy medium. The most notorious factor which frequently disrupts efficacy of communication is the noise. The Noise is an interference that happens between the communicators, that is the sender of the message, and the receiver. The noisy place typically puts a strain on oral communication as both the sender and receiver need to put additional endeavor to communicate. The efficacy of communication is impacted by determining the quantity of noise present in the communication channel. Third, the efficiency of any secondary task such as the action of driving a car is also likely to have an adverse effect on the traits of an operator’s speech production mechanism. A number of factors have been identified as an adverse effect of the speech production mechanism during the action of driving a car. The surrounding traffic is sometimes a topic of discussion between driver and passenger that may assist the driver’s situational awareness of the roadway environment, and the speech production rate of both passenger and driver reduced as the surrounding traffic demands increases. Fourth, the channel impacts also tend to debase the speech signal. The unpredictability of the channel is one of the most significant factors affecting the performance of the speech recognition systems. The channel impact is considered as the one of the complicated methods to debase the speech signal because of the information mixing on matrix.
Classification of speech recognition system.
Speech recognition technique can be categorized into several diverse groups based on what kinds of utterances they are competent to recognize, and are shown below.
Isolated speech
The isolated word recognizers need each utterance to have quietness such as the lack of an audio signal on both sides of the sample window. It receives single words or utterances at a particular time. This technique comprises “Listen/Not-Listen” states, where the user is expected to wait between utterances habitually carrying out the processing during the period of pauses. It may be better termed as the Isolated Utterance.
Connected words
The Connected word needs a minimum pause between utterances to enable the speech flow effortlessly. They are more or less identical to the isolated words.
Continuous speech
Continuous speech recognizers permit the users to speak more or less naturally, while the computer decides the content. Fundamentally, it represents the
Spontaneous speech
At a fundamental level, it can be deemed as speech which is natural sounding and not a trained one. An ASR system with spontaneous speech capacity has to be competent to address a host of natural speech traits such as words which run together such as the “ums” and “ahs”, including the slight stutters.
Classification of speech recognition system
As a rule, the speech recognition task is concerned with the speech inconsistency and is accountable for learning the association between the particular sounds and the related word or words [64]. In the course of the last few years, there has been a phenomenal advancement in the domain of speech recognition with two trends [65]. The former trend relates to the scholastic method, and the latter represents the realistic approach which involves technology furnishing the trouble-free low-level interface with the machine, substituted by the buttons and switches. In total, there are three different techniques related to the speech recognition [66, 67, 68] and they are as follows: (i) Acoustic-phonetic approach [69, 70, 71, 72], (ii) Artificial Intelligence approach and (iii) Pattern recognition approach. Figure 3 shows the classification of speech recognition system.
Acoustic-phonetic approach
The Acoustic-phonetic technique for speech recognition depends on locating the speech sound and furnishing proper labels to these sounds. The foundation of this technique relies on the fact that there is a host of finite and exclusive phonemes in the spoken language which may be largely represented by a group of acoustic traits which are exhibited in the speech signal over a time period. In view of the speaker and co-articulation impacts, the acoustic traits of phonetic units are extremely variable. Further, the underlying presumption of this technique is that the benchmark, governing the volatility is direct and can easily comprehend by the machines [73]. However, the acoustic-phonetic technique is employed only in a limited number of commercial applications [74]. While speaking about the important features of the Acoustic Phonetic approach, it specifies the features needed to recognize the sounds or class of sounds of interests. It helps to identify word-initial and word-final consonant clusters in an isolated word recognition system. Here, no new segmentation or marker placements would be required. On the basis of relative measures, mapping features into acoustic properties hence that they are relatively insensitive.
Pattern recognition approach
The Pattern recognition technique does not require any explicit awareness about the speech. This technique proceeds through two phases such as the training of speech patterns by certain common spectral parameter set and detection of patterns by means of the pattern analysis. The pattern recognition approach is used to reduce the dimension of the data. It gives high accuracy and low computational cost. Many Simple image coding techniques are available in pattern recognition approach. This approach provides an effectual method of combining the contributions of a number of speech measurements. The vital feature of this approach is that it employs a well formulated mathematical framework and creates reliable speech pattern representations, for consistent pattern comparison. The most well-known pattern recognition methods embrace the Template base technique and stochastic technique [75]. The stochastic model is the most appropriate one for the speech recognition as it employs the probabilistic techniques to address uncertain or deficient data [76]. There is a flood of several techniques following this approach which includes the HMM, SVM, DTW, VQ and so on. Outstanding among the techniques is the Hidden Markov Model which has emerged as the most well-known stochastic approach, which is currently available.
Template based approach
The core concept of the Template based approach [77]
is straightforward. An anthology of prototypical speech patterns is amassed as reference patterns characterizing the dictionary of candidate words. After that, recognition is performed by harmonizing an unidentified spoken word with each of the reference templates and chooses the type of the best identical pattern. The templates for all the words are configuring. One vital conception in the template approach is to arrive at a distinctive series of speech frames for a pattern (a word) using certain averaging process and to depend on the deployment of local spectral distance measures to assess and contrast the patterns. An alternate crucial notion is to employ a certain form of dynamic programming to momentarily align the patterns to account for the variances in the speaking rates across the talkers as well as across the recurrences of the word by the identical talker. The Template based approach handles the changing shape of a national contour depending on context. It posits an only reasonable number of contexts. The template-based approach is much stronger, long-range predictors. The employ of templates is intended to capture and reproduce the most important syllable-level features of the contour without over-smoothing. This approach utilizes the flexible expression format to portray the template condition. The template-based approach has the benefit that, faults owing to classification or segmentation of smaller acoustically more variable units for instance phonemes can be shunned.
Stochastic approach
The stochastic modeling [77] involves the deployment of probabilistic techniques to tackle with vague or imperfect data. In the speech detection task, ambiguity and imperfection originate from a variety of sources such as the confusable sounds, speaker inconsistency, contextual impacts, and homophone words. The Stochastic approach provides more information about potential future outcomes, thereby leading to improved decision-making. This approach involves the employ of probabilistic models to deal with incomplete information. Besides, this approach provides easy integration of knowledge source into a compiled architecture while comparing with the knowledge-based approach. Therefore, the stochastic models are the specifically appropriate methods for the speech recognition. Nowadays, the well-acclaimed stochastic technique known as the Hidden Markov Modeling, which represents a finite state Markov model and a set of output distributions is used. The transition constraints in the Markov chain models are the temporal variability, while the constraint in the output distribution models, and are the spectral variability. These two kinds of variability are the core of the speech recognition function. When compared with the template-based technique, the hidden Markov modeling is more wide-ranging and possesses a strong mathematical base. Whereas, the template based model relates to an uninterrupted density HMM, with identity covariance matrices and a slope constrained topology. Even though the templates can be trained on a few instances, they suffer from lack of probabilistic formulation of comprehensive HMMs and classically performs badly than HMMs. When compared with knowledge-based techniques, HMMs [78, 79, 80, 81, 82, 83] facilitate effortless incorporation of knowledge sources into an assembled architecture. An adverse effect of this is that HMMs find a Waterloo in the matter of furnishing adequate insight on the detection function. With the result, it has become a Herculean task to assess the errors of an HMM system with an eye on augmenting its efficiency. However, discreet assimilation of data has considerably fine-tuned the HMM-based systems.
Artificial intelligence approach (knowledge based approach)
The Artificial Intelligence approach [77] is a hybrid of the acoustic-phonetic technique and pattern recognition method. This innovative technique skillfully blends the notions and conceptions of the Acoustic-
Survey on speech recognition
All articles that had the word “speech recognition” in its title or as its keyword published between January 2000 to may 2015 was first selected from the scientific journals like IEEE, Elsevier, Springer and some international journals. In these periods, a huge number of studies have been developed on machine learning and other techniques to work out the problems of speech recognition. In this section, we survey and categorize the techniques based on computer vision, which has been developed so far for speech recognition. The speech recognition techniques can be sub-divided into seven broad categories, like neural network based speech recognition, Fuzzy logic based on speech recognition, Wavelet based speech recognition, an Optimization algorithm based speech recognition, DTW algorithm based speech recognition, sub-band based speech recognition and other approaches based on speech recognition.
Neural network based speech recognition
In this part, we elucidated some earlier works associated with the neural network based speech recognition technique, which inspired us to do this research. Some of the researches are briefly made cleared in this part. The Parallel Implementation of Artificial Neural Network training for speech recognition has been made cleared by Scanzio et al. [22]. They explained the execution of a complete ANN training procedure using the block mode back-propagation learning algorithm for sequential patterns such as the observation feature vectors of a speech recognition system utilizing the high-performance SIMD architecture of GPU using CUDA and its C-like language interface. In their execution, they considered all the strange features of training large scale sequential patterns, in specific, the re-segmentation of the training sentences, the block size for the feed-forward and the back-propagation steps, and the transmit of the vast amount of data from host memory to the GPU card. Their approach was checked by training acoustic models for huge vocabulary speech recognition tasks, illustrating a six times reduction of the time necessary to train real-world large size networks regarding an already optimized implementation by means of the Intel MKL libraries. In addition, Gevaert et al. [25] have made cleared the Neural Networks based Speech Recognition. The speech recognition classification performance was carried out using two standard neural networks structures as the classifier. The utilized standard neural network types comprise Feed-forward Neural Network (NN) with back propagation algorithm and a Radial Basis Functions Neural Networks. To triumph over the system, they further brought in the Deep Neural Networks for Acoustic Modeling based Speech Recognition [29]. Deep neural networks with several concealed layers, which were trained using novel methods, were exposed to outperform Gaussian mixture models on a variety of speech recognition benchmarks, sometimes by a huge margin. Similarly, Siniscalchi et al. [36] have elucidated the exploiting deep neural networks for detection-based speech recognition. They demonstrated that DNNs was employed to improve the classification precision of basic speech units, such as phonetic attributes (phonological features) and phonemes. This improvement leads to higher litheness and has the potential to put together both the top-down and bottom-up knowledge into the Automatic Speech Attribute Transcription (ASAT) framework. ASAT was a novel family of lattice-based speech recognition systems grounded on precise detection of speech attributes. This enhanced phoneme prediction precision, when incorporated into standard large vocabulary incessant speech recognition (LVCSR) system through a word lattice rescoring framework, effects in enhanced word recognition precision.
Currently, in speech processing, classification of gender was one of the most significant processes. Frequently gender classification was based on the pitch as a feature which was presented by Meena et al. [41]. They elucidated the fuzzy logic and neural network to recognize the gender of the speaker. Training dataset was produced by employing the three features to coach fuzzy logic and neural network. Next, the mean value was computed for the attained result from fuzzy logic and neural network. The method recognizes the gender of the speaker by using this threshold value. In addition, Siniscalchi et al. [42] have elucidated the artificial neural network approach to automatic speech processing. In this document, numerous ANN-based applications for speech processing were offered, varying from speech attribute extraction to phoneme estimation and classification. Moreover, it was exposed that ANNs perform a key role in significant speech applications, such as large vocabulary continuous speech recognition (LVCSR) and automatic language recognition. The objective of the document was to summarize chief ANN approaches to speech processing by means of the experience collected in the last seven years in their laboratories. In addition, Automatic Speech Recognition (ASR) was a technology for recognizing uttered word(s) symbolized as an acoustic signal. On the other hand, one of the significant features of a noise-robust ASR system was its capability to identify speech precisely in noisy conditions which are elucidated by Shahamiri and Salim [43]. They studied the applications of Multi-Nets Artificial Neural Networks (M-N ANNs), a realization of multiple-views multiple-learners approach, as Multi-Networks Speech Recognizers (M-NSRs) in offering a real-time, frequency-based noise-robust ASR model. M-NSRs describes the speech features connected with each word as a dissimilar view and apply a standalone ANN as one of the learners to fairly accurate that view; in the meantime, multiple-views single learner (MVSL) ANN-based speech recognizers use only one ANN to memorize the features of the whole vocabulary. In this research, M-NSR was offered and assessed using unforeseen test data that were affected by white, brown, and pink noises; more particularly, 27 experiments were performed on noisy speech to calculate the precision and recognition rate of their model. Additionally, they brought in the novel recurrent neural network-based Kalman filter for speech improvement, based on a noise-constrained estimate [48]. This system was worldwide asymptomatically steady to the noise constrained estimate. The suggested recurrent neural network is worldwide asymptomatically steady to the noise-constrained estimate. As the noise-constrained estimate has a vigorous performance against non-Gaussian noise, the recurrent neural network-based speech improvement algorithm minimizes the estimation error of Kalman filtering parameters in non-Gaussian noise. Moreover, having a low dimensional model feature, the neural network-based speech improvement algorithm is much faster than the recurrent neural networks-based speech improvement algorithms.
Fuzzy logic based on speech recognition
In this section, we explained the speech recognition technique using fuzzy logic. Among the fuzzy logic based speech recognition techniques, Sun et al. [3] presented the Fuzzy Logic-Based Natural Language Processing and Its Application to Speech Recognition. Their intention was to produce a system that learns from a linguistic corpus, the fuzzy semantic relations among the concepts symbolized by words and employ such relations to process the word sequences produced by speech recognition systems. In specific, the system was competent to forecast the words failed to be identified by a speech recognition system. This is to raise the precision of a speech recognition system. This was serving as the initial stage of deep semantic processing of speech recognition results by offering “semantic relatedness” between the recognized words. They account the fuzzy inference rule learning system, which was proposed and also report the experimental results based on the system. In addition, Halavati et al. [12] have elucidated speech recognition by means of fuzzy modeling and decision making that disregards noise instead of its finding and removal. For this, the speech spectrogram was changed into a fuzzy linguistic explanation, and this explanation was employed instead of exact acoustic features. During the guidance period, a genetic algorithm discovers suitable definitions for phonemes, and when these definitions are set, a simple new operator containing low-cost functions such as Max, Min, and Average formulates the recognition.
Based on fuzzy logic with firefly algorithm Hoseinkhani et al. [30] further launched the speech recognition technique to triumph over the difficulties in [3]. The speech samples were initially specified as the input of the fuzzy circuit to make sure (investigate) that the signals in fuzzy framework and pattern of signals were generated for each signal cluster. This causes dimension reduction of signal data and provides better and more dependable recognition result. They employed firefly classification method and considered a particular class for each input to develop recognition rate for recognizing speech. The categorizing fuzzy signal was the reason for raising the recognition accuracy. In addition, the speech retrieval system based on fuzzy logic and knowledge base filtering has been presented by Singh et al. [38]. The system makes out the audio query and recovers the audio file(s) in the corpus using an information retrieval (IR) Engine. They employed CMUSphinx4 Library for automatic speech recognition after adapting it to Indian accent and lab environment. The IR Engine in the back-end employs Fuzzy Logic based reasoning and knowledge-based filtering to recover related sentences (transcribed). Similarly, Nereveettil et al. [44] presented the feature selection algorithm for automatic speech recognition based on fuzzy logic. This document offers the feasibility of Mel Frequency Cepstral Coefficient Algorithm to extort features and Fuzzy Inference System model for feature selection, by decreasing the dimensionality of the extorted features. There was a rising requirement for a novel Feature selection method, to raise the processing rate and recognition accuracy of the classifier, by choosing the discriminative features. Therefore a Fuzzy Inference system model was applied, choosing the optimal features from speech vectors which are extorted by means of MFCC. Additionally, automatic recognition of the speech of children was a demanding task in computer-based speech recognition systems which were explicated by Mirhassani and Ting [45]. The system reduces the dimensionality without compromising the speech recognition rate. To evaluate the competence of the method, they examined six Malay vowels from the recording of 360 children, ages 7 to 12. The fuzzy-based feature selection permitted the stretchy choice of the MFCCs with the best discriminative capacity to improve the difference among the vowel classes.
Wavelet based speech recognition
In this section, we explained some of the research papers related to wavelet based speech recognition. The wavelet decomposition is applied to improve the time efficiency of the system. The Mel-scaled discrete wavelet coefficient for speech recognition has been presented by Gowdy and Tufekci [1]. The feature vector contains Mel-frequency discrete wavelet coefficients (MFDWC). The MFDWC were attained by employing the discrete wavelet transform (DWT) to the Mel-scaled log filter bank energies of a speech frame. The intention of employing the DWT was to advantage from its localization property in the time and frequency domains. MFDWC were related to sub band-based (SUB) features, and multi-resolution (MULT) features in that both effort to attain good time and frequency localization. To take away the noise from the speech, they brought in the Parallel Model Compensation (PMC) technique which was one of the most competent techniques for dealing with such noises. In [9], Tufekci and Gurbuz presented the noise-robust verification by means of Mel-Frequency Discrete Wavelet Coefficients and Parallel Model Compensation. The experimental effect illustrates the significant performance improvements of MFDWCs versus MFCCs after compensating the Gaussian Mixture Models (GMMs) by means of the PMC technique. To triumph over the difficulties in the system, they additionally brought in Mel-Frequency discrete wavelet coefficients and parallel model compensation for noise robust speech recognition in [11]. In this document, they examined the use of PMC and MFDWC features to take benefit of both noise compensation and local features (MFDWCs) to reduce the effect of noise on recognition performance. Also, they bring in a practical weighting technique based on the noise level of each coefficient. Moreover, Wu and Lin [19] presented the Speaker identification technique by means of discrete wavelet packet transform technique with irregular decomposition. The system contains a mixture of signal pre-process, feature extraction by means of wavelet packet transform (WPT) and speaker identification by means of an artificial neural network. In order to validate the result of the system for identification, a common regressive neural network (GRNN) model was employed and compared in the experimental investigation. To enhance the performance of speaker identification systems, Abdalla and Ali [23] have launched an effective and robust method based on Wavelet-Based Mel-Frequency Cepstral Coefficients for Speaker Identification by means of Hidden Markov Models. Based on the time-frequency multi-resolution property of wavelet transform, the input speech signal was crumbled into different frequency channels. For incarcerating the feature of the signal, the Mel-Frequency Cepstral Coefficients (MFCCs) of the wavelet channels were worked out. In the recognition stage, Hidden Markov Models (HMMs) were employed as they give improved recognition for the speaker’s features than Dynamic Time Warping (DTW).
In addition, Shao and Chang [27] have elucidated the Bayesian Separation with Sparsity Promotion in Perceptual Wavelet Domain for Speech Enhancement and Hybrid Speech Recognition. To reduce the mismatch among the training and the testing conditions of the classifier, a Bayesian plan was used in a wavelet domain to detach the speech and noise components in the suggested iterative speech enhancement algorithm. The non-phonetic information was removed while more important speech features were extorted and symbolized by the wavelet coefficients. It considerably develops the recognition performance at a low signal-to-noise ratio (SNR) without causing a poor performance at a high SNR. Likewise, Nehe and Holambe [31] have presented the DWT and LPC based feature extraction methods for isolated word recognition. They employed novel feature extraction methods, which used wavelet decomposition and decreased order linear predictive coding (LPC) coefficients, was elucidated for speech recognition. Using discrete wavelet transform, the coefficients were obtained from the decomposed speech frames. LPC coefficients obtained from sub-band decomposition (abbreviated as WLPC) of speech frame offer better representation than modeling the frame openly. The WLPC coefficients were further normalized in cepstrum domain to obtain a novel set of features indicated as wavelet sub-band cepstral mean normalized features. These approaches offer effective (better recognition rate), efficient (reduced feature vector dimension), and noise robust features. In [32], Tohidypour et al. presented the speech frame recognition based on redundant wavelet filter banks (RWFB). These RWFB parameters were much less shift-sensitive than those of critically sampled discrete wavelet transform (DWT). In this document, some kinds of wavelet representations were brought in, together with a mixture of critically sampled DWT and some dissimilar multi-channel redundant filter-banks down-sampled by 2. To find suitable filter values for multi-channel filter-banks, results of changing the zero moments of wavelet conversed. The most broadly applied speech representation was based on the Mel-Frequency Cepstral coefficients, which integrates biologically motivated features into artificial recognizers. On the other hand, the recognition performance with these features was still being improved, especially in difficult conditions. Vignolo et al. [40] have elucidated the genetic wavelet packets for speech recognition. In this work, a genetic algorithm to develop a speech representation, based on non-orthogonal wavelet decomposition, for phoneme classification was presented. The effects, attained for a set of Spanish phonemes, demonstrated that the suggested genetic algorithm was competent to find a representation that develops speech recognition results. In addition, the optimized representation was assessed in noise conditions.
Optimization algorithm based speech recognition
Some of the earlier works related to speech recognition that employs optimization algorithm in the literature inspired us to do this research. Some of the recent researches are briefly described in this section. The GA-Based Noisy Speech Recognition Using modified Two-Dimensional Cepstrum (MTDC) has been presented by Lin et al. [2]. This method enhanced the representativeness and robustness of the chosen TDC coefficients in noisy environments. The MTDC varies from the standard TDC by the use of filters to take away the noise components. In addition, in the GA-based MTDC method, they use the genetic algorithms (GAs) to find the vigorous coefficients in the MTDC matrix. Likewise, Ongkowijaya and Zhu [4] have presented the weighted feature approach based on GA for speech recognition. By means of a genetic algorithm, an approach for weighting feature was brought in to develop recognition accuracy through the exploitation of present recognition system simply by adding weight factor on feature vector. Furthermore, Gaussian mixture model (GMM) was broadly employed for modeling the speakers. In speaker identification, one main problem was, how to produce a set of GMMs for recognition purposes based upon the training data. Any random estimate of the initial model parameters was frequently leading to a sub-optimal model in practice owing to the hill-climbing feature of the maximum likelihood (ML) method. Hong and Kwong [10] have launched the hybrid training method based on genetic algorithm (GA) to resolve this problem. They employed the global searching capability of GA and joins the efficiency of the ML method.
In addition, Najkar et al. [26] presented the HMM-based speech recognition systems by means of particle swarm optimization. The most important idea was spotlighted on generating an initial population of segmentation vectors in the solution search space and developing the location of segments by a revising algorithm. Numerous methods were brought in and assessed for the representation of particles and their related movement structures. Additionally, two segmentation approaches were investigated. The initial method was the standard segmentation which attempts to maximize the likelihood function for each competing acoustic model independently. In the next method, a worldwide segmentation is tied between a number of models and the system attempts to optimize the likelihood by means of a commonly tied segmentation. To develop the presentation of the speaker recognition, Yadav and Mandal [28] have launched the Speaker Recognition by means of Particle Swarm Optimization. The Particle Swarms discovered optimal regions of the complex search space through the interaction of individuals in the population. To triumph over the drawbacks present in the PSO algorithm, Luo et al. [33] launched the Improved PSO algorithm for speaker recognition. The IPSO algorithm not only progresses extraordinarily the convergence velocity in the evolutionary optimization but also can adjust the balance between the global and local exploration properly. After that a speaker recognition approach using this enhanced algorithm to coach Support vector machine (SVM) was offered. The experimental effects demonstrate that the SVM optimized by IPSO attains higher classification accuracy than the standard SVM and successfully develops the speaker identification speed and precision. Similarly, Pan et al. [34] presented the Genetic algorithm on speech recognition by employing DHMM. The GA-trained DHMM was employed to raise the recognition rate for Mandarin speeches. Vector quantization based on a codebook was a basic process to identify the speech signal by DHMM. Through Mandarin speech features, a codebook was initially coached by genetic algorithms. The speech features were next quantized based on the trained codebook. Consequently, the quantized speech features were statistically employed to train the model of DHMM for speech recognition. In addition, Poonkuzhali et al. [37] have presented the ant colony optimization based automatic speech recognition. The huge amount of extorted features may have noise and other unnecessary features. Therefore, an evolutionary algorithm called as Ant Colony Optimization (ACO) was employed as a competent feature selection method. By employing Ant Colony Optimization technique, the unnecessary features were eliminated, and only best feature subset was attained. It was discovered that the total number of extracted features get decreased significantly.
DTW algorithm based speech recognition
For over two decades, the hidden Markov models (HMMs) was the leading technique for acoustic modeling in speech recognition. Yet, the advances in the HMM framework have not worked out its main problems. It decades information about time dependencies and was prone to over generalization. To triumph over these problems, De Wachter et al. [13] have presented the template based continuous speech recognition. They used the famous algorithm called Dynamic time wrapping (DTW). The traditional top-down search was so harmonized with a data-driven selection of candidates for DTW alignment. They also expand the DTW framework with a stretchy sub-word unit mechanism and a class sensitive distance measure – two elements proposed by state-of-the-art HMM systems. Likewise, Furtuna [18] have elucidated the Dynamic Programming Algorithms in Speech Recognition. Dynamic Time Warping algorithm worked out the problem competently by a dynamic comparison algorithm whose objective was to put in optical correspondence the temporal scales of the two words. In addition, for noise cancellation in the speech by means of recursive least square (RLS) and pattern recognition by employing fusion method of Dynamic Time Warping (DTW) and Hidden Markov Model (HMM) was presented by Al-Haddad et al. [20]. Speech signals were frequently distorted by background noise and the changes in the signal features could be quick These matters were particularly significant for robust speech recognition. Robustness is the main matter in speech recognition. The algorithm was checked on speech samples that were a part of a Malay corpus. It is exposed that the fusion technique was employed to combine the pattern recognition outputs of DTW and HMM.
The voice is an indication of infinite information. Muda et al. [21] presented an approach in which, Feature Extraction and Feature Matching were brought in to symbolize the voice signal by means of Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques. Mel Frequency Cepstral Coefficients (MFCCs) was employed as the removal techniques. The non-linear sequence alignment identified as Dynamic Time Warping (DTW) is employed as features matching techniques. As it is clear that the voice signal tends to have a dissimilar temporal rate, the alignment is significant to generate the better presentation. To triumph over the difficulties they additionally launched the Voice command recognition system based on MFCC and DTW. In addition, Bala [24] has elucidated the Voice command recognition system based on MFCC and DTW. This document is separated into two modules. Under the initial module, features of the speech signal were extorted in the form of MFCC coefficients and in the other module, the non-linear sequence alignment identified as Dynamic Time Warping (DTW) launched by Sakoe Chiba has been employed as features matching techniques. In addition, Priyadarshini et al. [35] have presented the Dynamic Time Warping based speech recognition for isolated Sinhala words. They offered an approach to recognize Sinhala speech based on Dynamic Time Warping (DTW) and the Mel Frequency Cepstral Coefficients (MFCC). Likewise, Nandyala and Kumar [46] have elucidated the Hybrid HMM/DTW based Speech Recognition with Kernel Adaptive Filtering method. The kernel methods were demonstrating good effects for speech processing applications. MFCC features were applied in the recognition process. It contains an HMM system employed to train the speech features, and for classification purpose, DTW method was employed. Experimental effects demonstrated a comparative development of recognition rate compared to the traditional methods.
Sub-band based speech recognition
Among the sub-band based speech recognition, the Maximum likelihood sub-band adaptation for robust speech recognition has been proposed by Zhu et al. [5]. The sub-band approach, where frequency sub-bands were multiplied with weighting factors and next United and changed to cepstra, which have verified to be more vigorous than both full-band and conventional sub-band cepstra in their experiments. In addition, a weighting factor was calculated approximately by employing maximum likelihood adaptation approaches in order to reduce the mismatch between trained models and viewed features. They assessed their methods on AURORA2 and Resource Management tasks and attained reliable performance development on both tasks. Joshi et al. [49] launched a technique to triumph over the difficulties present in above approach. The equalization framework was a two step process. In the initial step, conventional histogram equalization was made. By examining the histograms of the equalized cepstra, they demonstrated that the initial stage of conventional HEQ approach does not give back the sub-band specific noise distortion, even though the overall histogram was normalized. Therefore, in the next stage, sub-band specific histogram equalization was made. Every frame of cepstral coefficients was divided into low-frequency (LF) cepstra and high-frequency (HF) cepstra. Separate equalization was made on LF and HF cepstra to reimburse LF and HF-specific noise distortion. The cepstra related to the LF and HF bands were attained by employing simple averaging and differencing filters on the cepstral components inside a specific frame. This strategy was referred to as Sub-band Histogram Equalization (S-HEQ). By means of histogram study, they demonstrated that the S-HEQ approach was competent to compensate for the sub-band specific noise distortion. S-HEQ approach demonstrated a reliable development over the conventional HEQ approach with a comparative development of 12% and 22.10% over conventional HEQ in WER on Aurora-2 and Aurora-4 databases respectively. This equalization approach was also being employed with the deep neural network based systems and has exposed a reliable development in the recognition accuracies over conventional HEQ.
Other approaches based on speech recognition
Some of the other methods also developed for speech recognition in recent years. In [6], Erdogan et al. have made cleared the semantic study to develop speech recognition performance. Now, they elucidated the three novel language modeling techniques that employ semantic study for spoken dialog systems. They called these methods idea Sequence modeling, two-level semantic-lexical modeling, and joint semantic-lexical modeling. These models join lexical information with differing amounts of semantic information, by means of annotation supplied by either a shallow semantic parser or full hierarchical parser. Interpolation of the suggested models with class N-gram language models offers additional development in the air travel reservation domain. They demonstrate that as they raise the semantic information utilized and as they raise the tightness of integration among lexical and semantic items, they attained enhanced performance when interpolating with class language models, pointing out that the two kinds of models turn out to be more harmonizing in nature.
The Automatic speech recognition over error-prone wireless networks was presented by Tan et al. [7], in which a model of network degradations and robustness techniques was offered. These techniques were categorized into three categories such as error detection, error recovery and error concealment (EC). A one-frame error detection scheme was explained and compared with a frame-pair scheme. As countered to vector level techniques, a technique for fault detection and EC at the sub-vector level was offered.
The Generative factor examined HMM for automatic speech recognition was presented by Yao et al. [8]. In a standard HMM, observation vectors were symbolized by the Mixture of Gaussians (MoG) that were reliant on discrete valued concealed state sequence. The GFA-HMM launches a hierarchy of continuous-valued latent representation of observation vectors. For maximum likelihood estimation of the model anticipation maximization (EM), the algorithm is obtained. In the experiment, by differing the latent dimension and the number of mixture components in the latent spaces, the GFA-HMM achieved more compact representation than the standard HMM.
The Switching Linear Dynamical Systems for Noise Robust Speech Recognition was presented by Mesot and Barber [14]. They mutually modeled the dynamics of both the raw speech signal and the noise, by means of a Switching Linear Dynamical System (SLDS). This model was checked on isolated digit utterances damaged by Gaussian noise. Contrary to the Autoregressive HMM and its derivatives, which offered a model of uncorrupted raw speech, the SLDS was relatively noised robust and also considerably outperforms the state-of-the-art feature-based HMM. The computational difficulty of the SLDS scales exponentially with the length of the time series. To answer this, they employed Expectation Correction which offers a steady and precise linear-time approximation for this significant class of models, helping their further application in acoustic modeling.
By means of a Predictive Echo State Network, Skowronski and Harris [17] have elucidated the Noise-Robust Automatic Speech Recognition. They produced the predictive ESN classifier by joining the ESN with a condition machine framework. In small-
The secure speech recognition has been presented by Smaragdis and Shashanka [15]. In this document, they offered a process which facilitates privacy- preserving speech recognition transactions among two parties. They, presume one party with private speech data and one party with private speech recognition models. Their objective was to facilitate these parties to carry out a speech recognition task by means of their data, but without showing their private information to each other. They showed how to employ secure multiparty computation principles and how this system was computationally and securely right.
The Transforming Binary Uncertainties for Robust Speech Recognition has been made cleared by Srinivasan and Wang [16]. The use of the cepstral transformation spreads the data from the dominant noise time–frequency regions across all the cepstral features. They elucidated an administered approach by means of regression trees to learn the nonlinear transformation of the uncertainty from the linear spectral domain to the cepstral domain. This uncertainty was employed by a decoder that uses the variance related to the improved cepstral features to develop robust speech recognition.
Dai et al. [50] have elucidated the 2-D Psychoacoustic modeling of equivalent masking for automatic speech recognition. Only those sounds that fall beneath the masking threshold were adopted, which better replicates the physical masking results. They have specified detailed experimental effects, demonstrating the relationships among the subtractive and additive approaches. As all the parameters of the suggested filters were positive or zero, they were named as 2-D psychoacoustic P-filters. The specified theoretical study was offered to demonstrate the noise removal ability of these filters. Experiments were performed on the AURORA2 database.
Categorization based on techniques from 2000 to 2009
Categorization based on techniques from 2000 to 2009
Categorization based on techniques from 2009 to 2012
Based on SVM super vector reconstruction, Liu et al. [47] have elucidated the Homogenous ensemble phonotactic language recognition. In this architecture, the subsystems share the similar feature extraction, decoding, and N-gram counting preprocessing steps, but model in a dissimilar vector space by employing the SSR algorithm without important additional computation. They named this as homogeneous ensemble phonotactic language recognition (HEPLR) system. This system incorporates three dissimilar SVM super vector reconstruction algorithms, together with comparative SVM super vector reconstruction, functional SVM super vector reconstruction, and perturbing SVM super vector reconstruction. All of these algorithms were integrated by means of linear discriminant analysis-maximum mutual information (LDA-MMI) backend for improving language recognition evaluation (LRE) precision. Assessed by the National Institute of Standards and Technology (NIST) LRE 2009 task, the HEPLR system attains better presentation than a baseline Phone Recognition-Vector Space Modeling (PR-VSM) system with minimal extra computational cost.
Categorization based on techniques from 2013 to 2015
Categorization based on application from 2000 to 2015
Based on Relieff Algorithm and Support Vector Machines, Rashno et al. [39] have elucidated the Highly Efficient Dimension Reduction for Text-Independent Speaker Verification. They presented a feature selection approach based on Relieff algorithm for ASV systems by means of support vector machine (SVM) classifiers. This method was wrapper-based but makes use of Relieff weights in order to maintain lower difficulty by means of system performance. Hence, this method has lower difficulty compared to other wrapper-based methods, and can lead to 69% feature dimension reduction and has 1.25% of Equal Error Rate (EER) for the best case that emerged in RBF kernel of SVM. The suggested method has been compared with Genetic Algorithm (GA) and Ant Colony Optimization (ACO) methods for feature selection task. Effects demonstrated that the EER, number of chosen features and time difficulty of the method was lower than the GA and ACO methods for dissimilar kernels of SVM.
Categorization based on parameter measure
All the articles taken for this survey are categorized based on three different factors such as techniques, application and parameter measure.
Categorization based on techniques
In Table 1 (2000 to 2009), 20 research papers are included, where, two works utilized fuzzy logic, another three works utilized Genetic algorithm (GA), another four works are based on wavelet transforms, another three works utilized dynamic time warping (DTW) and remaining works utilized Echo State Network, SLDS, HMM, error-prone wireless network and semantic analysis algorithm, etc. In Table 2 (2009 to 2012), 15 research papers are presented, where, three works utilized Particle swarm optimization (PSO) algorithm, another three works utilized neural network classifier, another three works utilized dynamic time warping (DTW), another four works utilized wavelet transform and remaining works utilized fuzzy logic, genetic algorithm, etc. In Table 3, articles from the year 2013 to 2015 are presented. In this period, 15 research articles are available. In this section, five works utilized neural networks, four works utilized fuzzy logic, two works utilized support vector machine (SVM) and other methods used ACO, DTW, and sub-band, etc.
Categorization based on application
Here, 50 research articles based on the speech recognition is presented. From the Table 4, we submit 50 papers under speech application. Here, 37 works were presented based on speech recognition, another five works were presented based on speaker verification or recognition, another two papers were presented under speech enhancement, and six papers were presented based on voice recognition, gender recognition, language recognition, etc.
Categorization based on parameter measure
In this section, we explain most of the parameters used in measuring the effectiveness of the speech recognition techniques. Different kinds of the measures such as recognition rate, accuracy, word error rate, bit error rate, etc. were used. Table 5 shows the different parameter measures presented in this survey.
Summary of analysis based on type of techniques based speech recognition.
Based on the literature review conducted, we identified some of the important challenges which require more advanced procedures to handle it. The challenges are classified into two categories and are listed here.
Practical challenges
Although advances have been made recently in automatic speech recognition, robust and accurate speech recognition is still a challenging problem due to complex factors such as the variations of speakers and contents, and environment distortion.
A major challenge in speech recognition is overcoming the variability of the speaker. Spoken sentences or continuous dialogue changes based on the emotional state of the speaker, and are also affected by the individual characteristics of the speaker. Hence, speech is recognized primarily from the voice in the speaker independent mode; this accomplishment of a speaker-independent system is a breakthrough [7].
A practical challenge is about to consider emotion expressions into the speech recognition, and also, the system should take into account gender variability in an explicit manner.
The recent methods are also not provable against various noises and reverberation conditions to fit the real life applications.
Algorithmic challenges
In speech recognition, one of the central research issues is how to extract discriminative, affect-salient features from speech signals. In this direction, some speech features have been proposed in the literature. So, the challenge here is to rightly include the four categories: 1) acoustic features, 2) linguistic features (words and discourse), 3) context information (e.g., subject, gender, and turn-level features representing local and global aspects of the dialogue), and 4) hybrid features that combine acoustic features with other information.
Two major issues need to be handled for developing recognition systems. They are i) Selection of right features to comprise the unique information to be easily recognizable by any classification model. (ii) the right selection of samples for training a classification model.
The speech analysis methods need to handle the following signal characteristics which are directly affecting the speech recognitions system. (iii) huge variability of utterances that can be pronounced and no evidence on the best choices, (iv) considering the inter-speakers variance of the speech features to the optimal training of classification model.
In recent times, a traditional feature, called MFCC is utilized for speaker recognition. MFCC fails to extract significant feature values from the speech signal if the more noisy information is presented.
Speech recognition with noisy conditions put more challenge for researchers to develop a noise-adaptive classification algorithm.
In the literature, most of the speech processing methods utilizes the HMM and GMM for the classification of emotions but, the major problem with those methods is that they require detailed assumptions about the data distribution and model parameters.
For classification, GMM-UBM was utilized for classification which does not fit for noisy environmental conditions. Also, GMM-UBM requires consistent training using more data samples.
Also, neural network-based classification models require much training data for better classification, and low-level feature, local order, intrinsic characteristics are difficult to handle with those acoustic models.
Future directions
As a future direction to speech recognition, it is known that the transition rates (TRs) in the Markov process can be easily integrated to determine the performance and behavior of speech recognition systems. Here, Markovian jump linear systems (MJLSs) have been attained under the assumption that the mode transition information is well known [84]. By a set of standard differential (or difference) equations and a Markov stochastic process, the MJLSs have described [85]. The MJLSs are most suitable to model some practical systems which can be found in power systems, network control systems, and manufacturing systems, where random failure, repairs, and sudden environment changes may occur in Markov chains [86]. The Markov chains can be effectively applied for noise modeling and prediction for the speech signal.
On the other hand, T-S fuzzy affine model with parametric uncertainties has been used to model the nonlinear plant [87]. The fuzzy logic theory provides an important approach to tackling the analysis and design problems for complex nonlinear systems. To synthesize many highly complex nonlinear systems, the dynamic T-S fuzzy model has attracted lots of attention for most model-based fuzzy analysis approaches, since it is conceptually simple and rigorously valid [88]. T–S fuzzy models are well suitable to universally estimate any smooth nonlinear functions in any small set with any degree of precision [89]. The Fuzzy controller design for a class of semi-linear hyperbolic partial differential equations (PDE) systems on the basis of T-S fuzzy PDE models have exposed that the controller gains can be efficiently attained [90]. The Fuzzy logic systems have been used for estimate the unfamiliar nonlinear functions, and a fuzzy estimator model has been developed to approximation the states [91]. The continuous control signals can be attained at the sampling instants and at inter-sampling period by using a fuzzy estimator [92, 93]. The various fuzzy system discussed above can be effectively utilized for speech recognition to take the decision very flexible way.
Conclusion
Speech recognition is a standout amongst the most facilitating zones of machine information since individuals do an everyday movement of speech recognition. In this survey, we have reviewed various speech recognition techniques and tabulated different applications and parameters under speech signal. 50 articles were selected from the year 2000 to 2015, associated with speech recognition. Accordingly, these 50 articles are categorized into three ways and also, we have studied their limitations and time complexity. Importantly, three different factors such as, techniques, application and parameter measure were considered for comparing and reviewing the existing works. The detailed review performed in this paper will give the achievement happened in the speech recognition to further formulate the research ideas to overcome the current benchmark results for the researchers. At last, some of the research issues are also addressed to lead the further research in the same direction. Furthermore, the latest intelligent and algorithms were also discussed to incorporate those techniques for speech recognition in future.
