Abstract
The proposed paper focuses on the methodology for training small domain-specific language models. The methodology has been applied for creating a language model for the demonstration version of the in-vehicle infotainment system speech interface. The proposed methodology is based on constructing an initial n-gram language model only from the vocabulary. Such “random” model can be easily adapted to the user’s speaking style. The methodology is well suitable for situations when constructing the accurate deterministic grammar is not a trivial task or the higher flexibility is required. In the case of in-vehicle infotainment, the key idea is to decrease the cognitive load of the driver during controlling such a system and also to make designing process less time consuming. Evaluation of the proposed approach was done on the INCARSCOM corpus (Slovak In-car Command Database). Obtained results favour designed method instead of constructing the appropriate deterministic grammar.
Introduction
Using speech for controlling the in-vehicle infotainment systems becomes more and more popular, because of its undisputable advantages. Besides the fact that it is one of the most natural communication ways for people it can also improve safety, usage comfort and can decrease the cognitive load of the driver [4]. Speech interfaces in vehicles become especially important when a touchscreen is inbuilt to control onboard devices.
For controlling a lot of industry systems by voice, rather than large vocabulary continues speech recognition system (LVCSR), the simpler recognition systems able to recognize isolated words and phrases are often successfully used. These systems are more accurate and also more robust, because of their limited complexity, but they often lack a flexibility. The user is forced to adapt itself to the style of interaction required by the system, which increases its cognitive load, which can be dangerous in the case of in-vehicle applications. The importance of the cognitive load or the level of cognitive distraction lies in its relation to safety as is often noted in several studies, e.g. [3, 13] or [14].
Moreover, the choice of command-based speech interface brings a need to define all possible speech commands and phrases, which can be used for controlling the system. An appropriate deterministic grammar has to be constructed, which allows desired commands. In the case of the systems with medium to high complexity could be difficult to define such deterministic grammar. The very important reason is that such a grammar usually allows only pre-defined formulations of commands, which forces users to learn and to memorize all commands to be able to interact with the system.
Therefore, application of the small domain-specific stochastic language model (n-gram) instead of the deterministic speech grammar seems to be more appropriate. Unfortunately, often at the beginning, there are no training data to train such stochastic model. To solve this problem, a new methodology for training small domain-specific stochastic language models was designed and will be proposed in the paper. The basic idea was to develop the methodology for creating the small domain-specific language models for middle-complex systems, which does not require any training corpus. A dictionary is considered as the only one resource, which contains a complete list of devices, actions, and attributes. Our work has been inspired by the similar approach presented by Harris in [5], who generated the bigram language model from Phoenix grammars. This approach tries to learn language grammar from a set of accurate-designed deterministic grammars. His approach requires pre-defined deterministic grammars, which were created manually. Methodology, proposed in this paper, goes a step further, when it starts with simple and general deterministic grammar, which can be automatically generated from the dictionary. Random sentences generated from such grammar can be used to train basic stochastic n-gram model, which can be further personalized to the user. There is also next difference in comparison with Harris approach. While Harris tried to learn language grammar, our goal is simpler – to estimate the domain-specific part of the language to make possible to adapt the speech recognition system to the user’s usage style. Such adaptation can be easily done by extending the training corpus with the new user’s inputs or by the interpolation between basic model and the new one, trained from user’s utterances. Techniques of language models adaptation are well known and are often used, e.g. Rudnicky in [10] applied adaptation on the new domain with a small training corpus.
The methodology proposed in this paper was applied to train a small domain-specific language model for controlling the in-vehicle infotainment speech interface in the Slovak language. The main reason for designing the Slovak version of such speech interface was the fact that there did not exist any infotainment systems, which supports Slovak speech. The work proposed here offers new resources and solutions, which can be used for designing the new one. The performance of proposed methodology for training and adaptation of the domain-specific language models were tested and evaluated on the newly-recorded initial corpus of speech commands recorded in a vehicle environment.
The work described in this paper falls within the area of human-machine interfaces and communication, which can be consider as a part of Cognitive Infocommunications, because of the common interests between them, as is described in [1] or [2].
The paper is organized as follows. Section 2 provides the description of the small Slovak speech corpus recorded in the vehicle. Section 3 introduces principles of designing and constructing of the small domain-specific language models. Section 4 describes the application of proposed methodology for designing a pilot speech recognition interface for the in-vehicle speech infotainment system and it brings also results of the early evaluation.
Methodology of training domain-specific language model and its adaptation
Deterministic speech grammars are usually used for speech recognition in domain-oriented interactive applications. But when the complexity of possible user’s input grows, it starts to be difficult to manually write grammars, which will cover all possible inputs. This problem is usually solved by training small domain-specific language models, which are trained from small domain-specific databases, collected from real conversations. Unfortunately, there are application areas, where it is not possible to collect such training data, because it requires to build complete system or to apply Wizard-of-Oz technique [6] (the exception may be the situation where the domain-specific data can be obtained from general data by the cluster method, as is described in [15]). The proposed methodology offers an approach, which can be used, when:
Writing of deterministic grammar is not possible due to its complexity, or; Writing of deterministic grammar is not possible due to its restrictive character and there is no database for training a stochastic language model.
The proposed methodology is the improved version of the methodology, which we designed for controlling SCORPIO robot by speech commands [9]. In the original version of the methodology, the stochastic language models were trained from the database of all possible words and word’s pairs. In the improved version, the best results were obtained with significantly smaller amount of randomly generated words and word’s pairs (according Gaussian distribution). The improved version of the methodology, proposed in the paper, allows it to be used for larger domains where the number of all possible words and pairs of words can be very large.
The presented methodology consists of two basic phases – constructing of so-called “random” bigram model (Phase I) and adaptation of this model (Phase II), which will be described in following subsections.
Infotainment systems with integrated speech interface and their support for different devices
The Phase I consists of following steps:
Collecting the domain vocabulary. Domain vocabulary can be constructed by summarization of all devices, functions, and parameters of the system. Such vocabulary should be extended by all possible variants of word’s basic form (e.g. by their inflected versions). Constructing the deterministic grammar, which can generate random words (unigrams) and word pairs (bigrams). Grammar that generates single words and pairs of words (bigrams) should look like one drawn in Fig. 2. Generating unigrams and bigrams. In this step, constructed deterministic grammar can be used for generation of random sentences. As our previous experiences shown, it is not necessary to generate all possible bigrams to obtain usable language model. As is shown in Table 1, the best results were obtained, when only 7% (1000 sentences) of all possible unigrams and bigrams were selected. Training stochastic language model. The basic stochastic model is trained from the set of generated unigrams and bigrams. The best results have always been achieved using bigram models (in comparison with unigram or trigram models). We named the result of this process “random bigram model”, because the model was trained from a database of randomly-generated sentences.
The result of the first phase of designed methodology is the random bigram model, which can be directly used for the automatic speech recognition. Then, such a model can be personalized to the user’s speaking style.
Phase II is focused on the adaptation of prepared random n-gram model. It is performed during the initial use of the desired system. The adaptation can be done directly by speech commands spoken by the operator without any post-processing (unsupervised learning), but better results can be obtained using manually corrected commands (supervised learning).
In the proposed methodology, unsupervised learning means that commands, spoken while using the system, are used for adaptation as were recognized by the system, without any post-processing. Supervised learning means that commands used for adaptation will be manually corrected before re-training of the model.
It is clear, that the accuracy of the system will converge to better results slower than in the case of the supervised learning. In many applications, only unsupervised learning is possible.
There are two stochastic language model adaptation techniques:
Merge source database and adaptation database and train the new model by the same procedure as the basic model was trained. Adapt the original language model by interpolation with the model trained from adaptation data.
INCARSCOM corpus
There were several reasons leading us to record the new speech corpus from the in-vehicle environment. The absence of such corpus in the Slovak language was one of the most important reasons. Experimentation with speech enhancement and acoustic modelling for the noisy environment motivates us to record such corpus.
At the beginning, a set of speech commands was specified according to the analysis of seven in-vehicle infotainment systems enumerated in Table 1.
Recording device installation inside the vehicle.
Deterministic grammar for generating unigrams and bigrams.
Slovak speech commands for controlling four basic areas (GPS navigation, call management, SMS support and audio playing) were specified. Set of the basic speech commands consists of 71 commands, which enable basic navigation in the infotainment menu and selection of items.
Recordings were made by three speakers (2 males, 1 female) in the environment of the Fiat 500L vehicle. The infotainment of the selected car contains the Blue&Me speech technology developed by Microsoft, but unfortunately, we were not able to connect and to acquire the speech signal from embedded microphones, which are located above the interior mirror. Instead of original microphones, the high-quality audio recording device Olympus LS-10 has been fixed at the same place, as is shown in Fig. 1.
Recordings were collected in the following scenarios:
Engine in the idling state, vehicle without movement; During driving from the position of the driver; During driving from the position of the passenger.
ASR system.
Regarding road type, both types – inside the town and between towns were traversed.
Each recording contains all speech commands (with three-second pauses between them) and spoken sequences of commands, that realize pre-defined usage scenarios. The overall length of the database is around 36 minutes. Two recordings were made, when engine idling. The next three recordings were made during driving.
The described methodology was applied to train small adaptive language model for the Slovak speech interface, which can be used for the in-vehicle infotainment system. The main motivation was the absence of the in-vehicle infotainment system in the Slovak language.
At the beginning, seven in-vehicle infotainment systems have been analysed and the new set of speech commands in the Slovak language was defined together with the initial vocabulary. The deterministic speech grammar was designed to produce all possible unigrams and bigrams. The grammar was written in Julian format. Julius toolkit [4] was used to generate random sentences from the grammar. Alternatively, own tool JSoft for a generation all possible unigrams and bigrams was used to generate all combinations of vocabulary words.
Three bigram language models were trained using SRILM toolkit [12] on randomly selected sentences (1000 and 3000 sentences) and for all possible sentences that match designed deterministic grammar (14400 sentences). The example of the final bigram model trained from 1000 training sentences can be seen in Fig. 4.
Then, models were adapted with speech commands collected during use of the system. For adaptation, the interpolation technique was used, with weighting parameter lambda equal to 0.5.
Evaluation
The evaluation was done on the automatic speech recognition system based on Julius decoder [7, 8] (see Fig. 3). Acoustic models were adopted from our previous work [11]. Due to the small size of the corpus, cross-validation evaluation is applied. The accuracy of the system is measured.
Obtained results can be found in Tables 2 and 3. Word Error Rate (WER) has been selected to evaluate the performance of the recognition system.
Results of the random models trained from different amount of training data; tested on recordings with idling engine
Results of the random models trained from different amount of training data; tested on recordings with idling engine
Results of the best random model and its adapted version on recordings, obtained during driving
Example of the bigram language model.
Table 2 shows results of the basic language models without adaptation, which were tested on the recordings recorded during engine idling. Recognized speech commands were saved to be used for later adaptation (unsupervised learning). Bigram language models trained on sentences generated from the general grammar give significantly better results (approx. 32% vs. approx. 15%) then general grammar itself.
Table 3 demonstrates the impact of the adaptation process, where unsupervised learning principle was applied. It means that recognised utterances were directly used to train the model and such a model is then used for interpolation with the basic model. The bigram LM model trained on 1000 random sentences was selected as the best model to be used for the next adaptation. For the adaptation process, the utterances spoken in a quieter environment (idling engine) were selected. Adapted models were tested on recordings made during driving. In can be observed that such conditions do not enable to use general grammar (WER approx. 56%). Significantly lower error rate was obtained with basic bigram model trained on 1000 randomly generated sentences (WER
It should be noted that absolute values of WER are relatively high, but this fact is not relevant in this experiment. Higher WER is caused by both the noisy in-vehicle conditions and because of using acoustic models trained in the different environment and conditions (telephony speech).
The proposed paper introduced the methodology for training small domain-specific language models only from the vocabulary. Such methodology does not require any training corpus for constructing n-gram model. Language models prepared by the methodology are well suitable for adaptation on the user’s usage style, because they model the random or equal distribution of the observation probabilities. The proposed approach is desired for training small domain-specific language models, especially in situations, when writing deterministic grammar is not possible due to its complexity, due to its restrictive character or there does not exist any corpus for training the stochastic language model. Moreover, it can outperform deterministic grammars in flexibility, because deterministic grammars are not able to accept utterances with other syntax, as is pre-defined inside the grammar.
Designed methodology was applied to prepare sto- chastic language model for the in-vehicle infotainment system. The initial evaluation was done on the newly recorded INCARSCOM corpus with promising results.
Footnotes
Acknowledgments
The work presented in this paper was supported by the Ministry of Education, Science, Research and Sport of the Slovak Republic under research project KEGA-055TUKE-4/2016 and by the Slovak Research and Development Agency under the contracts No. APPV-15-0731.
