Abstract
Nowadays, the speech recognition applications can be found in several activities, and their existence as a field of study and research lasts for a long time. Although, many studies deal with different problems, in security-related areas, biometric identification, access to the Smartphone… Etc. In automatic speech recognition (ASR) systems, hidden Markov models (HMMs) have widely used for modeling the temporal speech signal. In order to optimize HMM parameters (i.e., observation and transition probabilities), iterative algorithms commonly used such as Forward-Backward or Baum-Welch. In this article, we propose to use the bacterial foraging optimization algorithm (BFOA) to enhance HMM with Gaussian mixture densities. As a global optimization algorithm of current interest, BFOA has proven itself for distributed optimization and control. Our experimental results show that the proposed approach yields a significant improvement of the transcription accuracy at signal/noise ratios greater than 15 dB.
Keywords
Introduction
Speech recognition is the capacity to listen to talked words and distinguish different sounds exhibit in it, recognize them as words of some known language. In the computer area, the Speech recognition is might be characterized as the capacity of the computer framework to acknowledge talked words in the sound configuration like wav and afterward create its substance in the content arrangement. Automatic speech recognition (ASR) [1] simulates human listening, it transforms speech into text.
This sort of conversion should be independent of vocabulary size, accent, speaker characteristics such as male or female etc. A more technical definition is given by Jurafsky, where he defines ASR as the building of system for mapping acoustic signals to a string of words. He continues by defining automatic speech understanding (ASU) as extending the goal to producing some sort of understanding of the sentence. Speech recognition is basically a pattern recognition problem. This involves extracting features from the input signal waves and classifying them to classes using pattern matching model. Performance of ASR system is measured on the basis of recognition accuracy, complexity, and robustness. Advantages of automatic speech recognition are accessibility for the deaf, cost reduction through automation, searchable text capability.
The problem of ASR is to program a computer to take digitized speech samples and print the words that a human would recognize when listening to the same sound. ASR has grown roughly in proportion to other areas of pattern recognition because of the desire to invent the machine capable of making complex decisions, and, practically, one that could function as swiftly as humans.
Although we have learned a great deal about how to build practical and useful speech recognition systems, there remain a number of fundamental questions about the technology to which we have no definitive answers. Unmistakably the speech signal is a standout amongst the most complex signals that we have to manage. It is created by a human’s vocal system and in this way difficult to be described by a straightforward 2-dimensional model of sound spread. While there exist various advanced numerical models which endeavor to simulate the speech production system, their demonstrating capacity is as yet restricted.
In this work, we investigate the effectiveness of using the Bacteria Foraging Optimization Algorithm (BFOA) for optimizing the structure and parameter learning of HMM. The BFOA, proposed by Passino [2], is a newcomer to the family of nature-inspired optimization algorithms. For over the last five decades, optimization algorithms like Genetic Algorithms (GAs) [3, 4], Evolutionary Programming (EP) [5], Evolutionary Strategies (ES) [6], which are inspired by evolution and natural genetics, dominated optimization algorithms field. Recently natural swarm inspired algorithms like Particle Swarm Optimization (PSO) [7], Ant Colony Optimization (ACO) [8] have found their way in this research area and proved their effectiveness. Following the same trend of swarm-based algorithms, Passino proposed the BFOA [2]. Furthermore, it is possible to design operators that favor biologically plausible changes to the structure of an HMM. That is, to ensure that modules of the states are kept intact. The objective of this work is to propose algorithms that improve this quality. The criterion used to quantify the quality of HMM is the probability that a given model generates a given observation. To solve this problem, we use as we have already mentioned a BFOA hybridization with HMM.
This paper is organized as follows, the first section is an introduction, the second, in which work motivations are exposed, is an overview of previous related works. The third section presents the proposed approach, the ASR system based BFOA/HMM and details about the used method. The experimental results of the proposed method are discussed in section four. Finally, section five concludes the paper with some perspectives.
Related work
To our best learning, the point of the machine and human interaction is to utilize the most regular method of communicating, through our speech. The execution of the programmed ASR systems depends on traditional HMMs which depend on maximum likelihood estimation (MLE) systems. Different models have been investigated by specialists like neural and Bayesian systems, discriminative training strategies, state duration modeling and the use of support vector machines with HMM.
The most punctual endeavors to devise systems for ASR by machine were made in the 1950’s. In 1952’s, at Bell Laboratories, Biddulph, and Balashek [9] fabricated a system depended heavily on measuring spectral resonances during the vowel region of each digit. In 1970’s speech recognition research achieved various huge turning points. To start with, the isolated word or discrete utterance recognition became a viable and usable technology based on the fundamental studies by Velichko and Zagoruyko in Russia [10, 11, 12].
In recognizing syllables or isolated words, the human auditory systems perform above chance level already at
Norris in 2008’s [14] presented a Bayesian model of continuous speech recognition, which is based on shortlist and shared many of its key assumptions: parallel competitive evaluation of multiple lexical hypotheses, phonologically abstract pre-lexical and lexical representations, feed-forward architecture with no online feedback, and a lexical segmentation algorithm based on the viability of chunks of the input as possible words.
Neural network models are powerful speech recognition engines [15]. Their ability to classify data and ability in parallel processing pave the way for speech recognition. A typical neural network consists of the input layer, hidden layer, output layer. Input layer receives the input signal and transfers the data to the hidden layer. Hidden layer computes the action function and all the necessary calculations are done in this layer. After computations output is transferred to the output layer. Graves in 2013’s [16] investigated deep recurrent neural networks (RNNs), which combined the multiple levels of representation, that have proved so effective in deep networks with the flexible use of long-range context that empowers RNNs. When they trained end-to-end with suitable regularization, they found that deep Long Short-term Memory of RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark.
Support Vector Machines (SVM) supervised learning models with associated learning algorithms. They are used for classification and regression analysis. They analyze the data and recognize patterns. Text-independent speaker recognition uses as their features a compact representation of a spoken utterance, known as i-vector [17]. Rather than estimating an SVM model per speaker, according to the one versus all discriminative paradigms, the Pairwise Support Vector Machine (PSVM) approach classifies a trial, consisting of a pair of i-vectors, as belonging or not to the same speaker class. Training a PSVM with a large amount of data, however, is a memory and computationally expensive task.
Generalized variable Hidden Markov Model (GVP-HMM) is used for speech recognition in a noisy environment [10]. A crucial task of automatic speech recognition systems is to robustly handle the mismatch against a target environment introduced by external factors such as environmental noise. When these factors are of time-varying nature, this problem becomes seven more challenging.
The used language in this study is Arabic. It is well known that Arabic is the fifth most widely spoken language in the world with approximately 300 million native speakers, cutting across a wide geographical area from North Africa to the Middle East. It is also one of the six official languages adopted in the United Nations and represents the official language in some twenty-two countries, whereas there are substantial Arabic-speaking communities in many countries. Arabic is also the liturgical and worship language for more than one billion and a half Muslims worldwide [18].
Moreover, many challenges are faced by Arabic speech recognition. For example, Arabic has short vowels which are usually ignored in the text, which add more confusion to the ASR decoder. Additionally, Arabic has many dialects where words are pronounced differently. Elmahdy and Gruhn in [19]summarized the main problems in Arabic speech recognition, which include Arabic phonetics, discretization problem, grapheme-to-phoneme relation, and morphological complexity. Bourouba in 2006’s [20] presented a new HMM/support vectors machine (SVM) (k-nearest neighbor) for the recognition of isolated spoken words. Sagheer in 2005’s [21] proposed a novel visual speech features representation system. They used it to comprise a complete lip-reading system. While Muhammad in 2011’s [22] evaluated conventional ASR system for six different types of voice disorder patients speaking Arabic digits. Mel-frequency cepstral coefficients (MFCCs) and Gaussian mixture model GMM/HMM are used as features and classifier, respectively. The recognition result is analyzed for types of diseases.
Asr system based BFOA/HMM
The structure of an ASR system is divided into three modules according to their role. The first module is feature extraction to capture the most relevant and discriminate characteristics of the signal to recognize. The second subsystem is the training module, whose function is to create the knowledge about the speech and language to be used in the system. The final module is the recognition one, whose function is tried to figure out the meaning of the input speech given in the testing phase.
Figure 1 shows the different stages in the processes of learning and recognition of the proposed system. Each element in this figure will be described further.
An overall architecture of an ASR system.
Steps of RASTA-PLP analysis.
The audio features extraction consists of using pre-processing techniques such as signal enhancement and environment sniffing to help preparation of the incoming audio stream for the features extraction step [23]. Generally, the audio is broken down into sliding window “frames” and features are extracted from each frame. In many cases, the frame size is on the order of 10 ms. In the literature, many results are reported about the audio features extraction for clean and noisy speech conditions. The MFCCs and linear prediction coefficients (LPCs) are the most commonly used acoustic features and additional research is ongoing in the field of noise robust acoustic features.
After the acoustic feature extraction step, the first and second derivatives of the data are usually concatenated with the original data to form the final features vector. The original data is also known as the static coefficients while the first and second derivatives are known as delta and delta-delta or acceleration coefficients.
The frequency response of the communication channel influences easily most speech parameter estimation techniques. For this reason, Hermansky in 1991’s [24] developed a technique that is more robust to such steady-state spectral factors in speech. This approach is conceptually simple and computationally efficient.
The proposed method for speech analysis is based on Relative-Spectral Analysis-Perceptual Linear Predictive method (RASTA-PLP) for feature extraction. This technique is an improvement of the traditional PLP method, which consists in a special filtering of the different frequency channels of a PLP analyzer. The previous filtering is done to make speech analysis less sensitive to the slowly changing or steady-state factors in speech. The RASTA method replaces the conventional critical-band short-term spectrum in PLP and introduces a less sensitive spectral estimation. For the RASTA-PLP features, an additional filtering is applied after decomposition of the spectrum into critical bands. This RASTA filter suppresses the low modulation frequencies which are supposed to stem from channel effects rather than from speech characteristics.
The process of computing RASTA-PLP coefficients is showed in Fig. 2.
Concerning the filter, the transfer function is denoted by Eq. (1):
The low cut-off frequency of the filter determines the fastest spectral change of the log spectrum. The high cut of frequency indicates the fastest spectral change which is preserved in the output parameters. The higher values of the band-pass filter attenuate the convolution noise. The RASTA filter has a long time constant of the integration that is of 500 ms for the particular filter of Eq. (1). It means that the current analysis result depends on previous outputs stored in the memory of the recursive RASTA filter. Thus, the analysis results depend on the time in where the analysis starts. The PLP and RASTA-PLP streams were augmented by their delta features [24].
In this work, the sampled speech data are segmented into 0.025 seconds frames with a 0.010 seconds overlapping of two consecutive windows. After application of a Hamming window, each of these frames is analyzed using the RASTA-PLP speech analysis technique, in order to make speech analysis more robust to spectral distortions.
The well-known Baum-Welch algorithm can be used to simultaneously estimate the state transition probabilities and the observation probabilities in a maximum likelihood framework from the sequence of observations. The states in HMM are assumed as discrete values. However, the observations can have either discrete (limited) values, modeled with probabilities in matrix B, or continuous (or continuous discretized), modeled by conditional probability density functions.
We applied BFOA to optimize the Baum-Welch algorithm in left-to-right HMM. The results between the HMM system and hybrid BFOA/HMM were compared to analyze how BFOA can improve the recognition rate in hybrid GA/HMM.
Hidden Markov Model (HMM)
Speech recognition systems generally assume that the speech signal is a realization of some message encoded as a sequence of one or more symbols. To perform the reverse operation of recognizing the underlying symbol sequence given a spoken utterance.
In this study, we used the Hidden Markov Models (HMM) [25], which have emerged as the predominant technology in speech recognition filed in recent years. These model shave proved to be the most appropriate to the speech recognition problems.
The HMMs are a class of models in which the distribution that generates an observation depends on the state of an underlying but unobserved Markov process [26]. Thereby, HMM is a combination of two processes, namely a Markov chain which determines the state at time
The HMM outputs are modeled as continuous density output probabilities by representing each output probability as a Gaussian mixture. Thus, the output probability for an HMM in state
Here, the sum is over the mixtures (
As the probability distribution for the output probabilities must integrate to one,
There are three (03) fundamental problems related to the HMM; the evaluation problem, the problem of determination of the path of states and the problem of learning (supervised or unsupervised). There are other problems related to HMM such as the overflow of representation of numbers in the machine, insufficient data for learning, the update of models when processes vary in time, the choice of HMM architecture the best adapted to data, the choice of a good initial estimate of the probabilities of the HMM. The Forward algorithm allows the calculation of the likelihood of the observations, the Viterbi algorithm finds the optimal path which is the most likely to follow and the Baum-Welch algorithm performs the supervised learning (re-estimate the parameters of HMM) [24].
Bacteria Foraging Optimization Algorithm (BFOA) developed by Liu and Passino [27] is a generally new populace based calculation; it is a nature-inspired meta-heuristic calculation, which emulates the foraging behavior of E. coli found in a human digestive tract.
As indicated by scrounging hypothesis, the bacteria need to search for the nutrients and achieve the nutrients in such a manner that energy consumption per unit time is minimized.
Description of parameters
Description of parameters
A gathering of bacteria forages for the nutrient region and flag others about the area of the supplements with the assistance of chemical attractants. Likewise, they help other creature to keep away from the poisonous locales by generating chemical repellents. The way of finding the regions of high nutrients by the bacteria can be modeled as an optimization process. BFOA has been applied in various optimization problems associated with the real-world and therefore already gained the attention of the researchers in the domain.
The swarm of microorganisms “Bacteria” S acts as takes after [2]:
Bacteria are discretionarily spread in the map of nutrients. Bacteria travel towards high-nutrient regions on the map. Those get their nourishment inadequate, expanded long and will split into two equal parts at a reasonable temperature. Besides those situated in a low nutrient district will scatter and the individuals who located in a toxic area will die. Bacteria located in the promising districts of their environment will attempt to pull in others microorganisms by producing chemical attractants. Bacteria are situated in the most elevated nutrient districts. Bacteria scatter as to search for new nutrient districts in their environment.
Bacteria foraging behavior searching conduct incorporates four primary advances: Chemo-taxis (tumble and swimming), swarming, Reproduction and Elimination-dispersal.
In this study, we explore the application of BFOA in order to optimize the learning of the HMMs. As other swarm intelligence algorithms, BFOA is inspired from the social and cooperative actions found in nature. In fact, the optimization process here consists in the way bacteria rummage around for the high nutrient regions. The application of BFOA is various in real-world optimization problems and therefore gained the attention of the researchers in the field.
Suppose that we want to find the minimum of
The definition of a chemotaxis step to be a tumble followed by a tumble or a tumble followed by a run is as follows:
Let
Let
Here,
Chemo-taxis
Biologically the movement of E. Coli bacterium is restricted into two different ways, either it can swim for a period of time in the same direction or it may tumble for the entire lifetime. In the BFOA, a unit step in random direction represents a “tumble” and a unit step with the same direction in the last step indicates a “run”, which simulates the chemo-taxis procedure is used in Eq. (6).
In each chemotaxis step, the bacterium generated a tumble direction firstly. Then the bacterium moves in the direction using Eq. (6). If the nutrient concentration in the new position is higher than the last position, it will run one more step in the same direction. This procedure continues until the nutrient get worse or the maximum run step is reached. The maximum run step is controlled by a parameter called Ns.
Swarming
Interesting group behavior has been observed for several motile species of bacteria including E. coli and S. Typhimurium, where intricate and stable spatiotemporal patterns (swarms) are formed in the semisolid nutrient medium. A group of E. Coli cells arranges themselves in a travelling ring by moving up the nutrient gradient when placed amidst a semisolid matrix with a single nutrient chemo-effecter. The cells when stimulated by a high level of succinate, release an attractant aspartate, which helps them to aggregate into groups and thus move as concentric patterns of swarms with high bacterial density.
BFOA simulates this social behavior by representing the combined cell-to-cell attraction and repelling effect. Now combined cell-to-cell attraction and repelling effects can be modeled as:
where
Reproduction
For every
Elimination and dispersal
In nature, the changes of environment where population lives may affect the behaviors of the population. For example, the sudden change of temperature, nutrient concentration and the flow of water. All these may cause bacteria in the population to die or move to another place. To simulate this phenomenon, eliminate-dispersal is added in the BFOA. After every
Generally, in the problem of HMM learning, the individuals are HMM. It is necessary to encode the HMM tore present the bacterium at
The population of size
A discriminated model is used to do Recognition, where learning will be associated with each word learned in HMM. Recognition will be done by calculating for each known HMM its probability of generating the word to recognize. The word will be recognized that the associated HMM obtained a maximum score.
The HMM who has the highest probability of generating the input data is selected in decision stage. In our experiment, the decision is performed using the Viterbi algorithm.
The Viterbi algorithm is used to find the best state sequence
It can be induced that:
The best state sequence can be retrieved by keeping track of the argument that maximizes the previous equation for each
The performance analysis of various speech recognition systems is evaluated. Speech recognition has a big potential in becoming an important factor of interaction between human and computer. A successful speech recognition system has to determine features not only present in the input pattern at one point in time but also features of input pattern changing over time. The most common performance measure is word error rate. This measure is computed by comparing a reference transcription output by the speech recognizer. From this comparison it is possible to compute the number of errors, which typically belongs to three categories:
Insertions I (when in the output of ASR it is present a word not present in reference), deletions D (a word is missed in ASR output) and substitutions S (a word is confused with another word).
Database
In order to test the proposed system, a multi-speakers database was used for a speech recognition task. Therefore, we used the AVARB database [26]. It contains pronunciations of Arabic words isolated, taken at a sampling frequency of 16 KHz.
Video data were captured using a digital camera with a 690
Experimental results
As a standard procedure in evaluating machine learning techniques, the dataset is split into training and test sets. Therefore, we have used 2/3 of the data for the learning stage and there maining 2/3 to test the effectiveness of our ASR system.
The proposed ASR system that uses RASTA-PLP method as audio features, and a BFOA/HMM for the speech modeling, was implemented as described in the previous sections. The whole programming was implemented in MATLAB. As a result, we obtain a matrix of 27 parameters by integrating the first and second derivative of the parameters. The BFOA/HMMs recognizers were built using the Hidden Markov Model Toolkit (HTK) [29].
Various kinds of the instance with different BFOA control parameters have been solved with our algorithm, in order to evaluate the performance of the proposed system, we have compare it with the classical training by the baum-Welch method. Each instance is run 15 times with a different number of the mixture; also the used maximum number of iteration for EM algorithm is 40. In the tables below, the obtained maximum average
BFOA parameters training HMM for AVARB database
BFOA parameters training HMM for AVARB database
We can clearly notice from Table 2 that the results are varied according to the parameters training of the BFOA, also to the number of stats and the number of mixture. However, with
To estimate the reliability of this system we must be calculated recognition rate that will allow us to evaluate the performance of each algorithm in order to compare them. For that, we choose an additional model which is the multilayered perceptron (MLP) neural networks. We use an MLP with a single hidden layer, and the input layer consists of X neurons, such that X represents the resulting classes of clustering.
Influence of the number of chemo-tactic steps on the recognition rates of the ASR system.
The net is trained by at first choosing little random weights and interior limits, and displaying all training data repeatedly. Weights are adjusted after every trial until converge and the cost function is reduced to an acceptable value.
The activation functions used were a hyperbolic tangent for the hidden layer and sigmoid for the output layer. The hyperbolic tangent function is given by:
The sigmoid function is given by:
After several tests over the training and test data, we find that the fitness function in BFOA is better with 21 iterations, which yields better results for optimizing HMM parameters. In Fig. 3, we can notice also that the BFOA fitness function increase faster, substantially at the beginning iterations.
In the following, we present our experimental results of the proposed ASR system over a range of noise levels using these three models. In order to simulate various noise conditions, the artificial white Gaussian noise was added.
The experiment was conducted under a mismatched condition, where the recognizers were trained at 20 dB SNR, and acoustic white Gaussian noise, ranging from
Recognition rates of ASR system for varying SNR levels.
We performed evaluation and comparison of the performance of the system by calculating the average recognition rate for each SNR levels, overall utterances for each word. Moreover, the HMM model is trained by using the traditional method (Baum-Welch algorithm). The Fig. 4 shows that in most of the cases, the recognition rates obtained with the proposed BFOA/HMM system provides better results in compare is onto those obtained by the HMM-based system, and the percentage increase amount from 1.4% to 15.3%, along with the increase in the size of the population.
Automatic Speech Recognition is viewed as an essential piece of human-computer interfaces that are conceived to utilize discourse, among different means, to accomplish regular, inescapable figuring. The state of ASR lacks robustness, to channel and environment noise continues to be a major impediment.
ASR system takes a human discourse as an information and requires a series of words as yield. The issue of consequently perceiving discourse with the assistance of a computer is a troublesome issue, and the purpose behind this is the unpredictability of the human dialect. Lack of linguistic corpora for dialect language models seems to be a difficulty in implementing an ASR. This stems from trouble in gathering sentences with vernaculars. The extent of the issue ought to be widened into larger vocabularies, continuous speech.
We presented in this paper an approach for an ASR system using BFOA/HMM with a Gaussian mixture density for modeling the Arabic speech. In order to further optimize the solution found by the Baum-Welch algorithm.
Based on the obtained results, we can conclude that the system modeled by HMM and trained by our BFOA/HMM have higher rates of recognition than the HMM trained by the Baum-Welch algorithm. Moreover, the classification results of Baum-Welch are greatly dependent on the initialization of the parameters in contrast to the BFOA, which gives stable results independently of the initial values.
As perspective, we are planning to cover more issues about improving the performance of the system, by enhancing the size of our database and increasing the number of speakers. In addition, we intend to test our system with other alternatives methods of fusion.
