Abstract
Internet of things (IoT) plays significant role in the fourth industrial revolution and attracts an increasing interest due to the rapid development of smart devices. IoT comprises factors of twofold. Firstly, a set of things (i.e., appliances, devices, vehicles, etc.) connected together via network. Secondly, human-device interaction to communicate with these things. Speech is the most natural methodology of interaction that can enrich user experience. In this paper, we propose a novel and effective approach for building customized voice interaction for controlling smart devices in IoT environments (i.e., Smart home). The proposed approach is based on extracting customized tiny decoding graph from a large graph constructed using weighted finite sates transducers. Experimental results showed that tiny decoding graphs are very efficient in terms of computational resources and recognition accuracy in clean and noisy conditions. To emphasize the effectiveness of the proposed approach, the standard Resources Management (RM1) dataset was employed and promising results were achieved when compared with four competitive approaches.
Introduction
Inernet of things (IoT) plays a crucial role in the fourth industrial revolution and expected to affect the way we interact with the surrounding things in our daily life. IoT is composed of physical things that are interconnected using a communication network which is connected to the internet. These things are smart devices and objects that form the basic building blocks of IoT. In smart homes, these smart objects are home appliances (i.e., Lamp, TV, Wash machine, Refrigerator, Air condition, etc.). The communication between the objects in IoT environment is performed using a standard protocol through which a smartphone can receive the shopping list which is generated automatically by the smart refrigerator, for instance [1, 2].
IoT utilizes a wide variety of technologies such as, machine learning, data analytics, digital communication, embedded processors, cloud computing, etc. The foundation of IoT is composed of two aspects, the first is the network created to interconnect the smart objects, and the second is the interaction between human and these smart objects. This human interaction can be performed in one of two approaches. Firstly, the interaction can be embedded within the objects so that human can interact with each object in separate. Secondly, human can interact with several objects, connected together with the same network, through a centralized interface. In both approaches, the interaction between human and IoT objects is commonly achieved by programming these objects with the application of natural means of communication [3, 4].
In IoT environment, most smart objects are controlled using a common modality, which is graphical user interface (GUI). However, using GUI for this purpose might be confusing for humans and sometimes becomes difficult to use. On the other hand, voice is considered the most natural and convenient modality of interaction between humans. Therefore, using voice for interacting with IoT can effectively improve user experience. This is because users prefer to interact with smart devices using natural voice more than pressing a button on a touch screen or clicking somewhere on a GUI. Due to the significant advancement in natural language understanding and voice recognition, virtual assistants have emerged to improve human-computer interaction Consequently, increased attention is paid every day to voice recognition as a crucial part in the development of interactive voice interfaces for IoT environments [5, 6].
The domain of smart environments includes important applications such as, smart homes, offices, museums, gyms, etc. The most popular application is smart home, in which a set of appliances with sensors and gadgets are connected together to manage and control all aspects of a household such as, lighting, heating, security, etc. In these applications, when voice recognition is employed as an interaction modality, it allows end-users to communicate more naturally with these environments and thus improves their user experience [7–9].
Human device interaction in IoT becomes more natural if the voice interaction scenarios can be personalized. For instance, personalized names can be assigned for every object in the IoT environment, so that these names can be used in interaction scenarios to issue voice commands to control these devices. The methodology of voice recognition is based on large acoustic and language models, which are located in the cloud, to transcribe the spoken commands into text. In most cases large corpora is used to train the language model to make it generic. However, this comes with a cost of decreased recognition accuracy. For instance, it is necessary to recognize the device name when recognizing the spoken command. However, when a user used a customized label to issue a command for certain device, it might be confusing for the recognition system to recognize it correctly when using the generic language model [10, 11].
A misunderstanding between device users and IoT manufacturers may happen when using the traditional solutions of voice recognition. The manufacturers of the IoT devices usually include the potential words and sentences that users must know to use IoT devices properly. In this case, simple words and sentences are commonly chosen by manufacturers to be easy to remember by the users. However, if we considered the cultural traits and the existing amount of languages, it can be obvious that fixing a set of words and sentences to control IoT devices may lead to misunderstanding and unexpected device behaviors due to the mismatch between users’ natural language expressions and the expected words that are predefined to control these devices [7, 12].
Motivation & proposed solution
In the daily basis of voice interaction with smart environments, it is important to have a customized voice recognition system that allows end-users to set their preferred commands and the corresponding intents. In addition, to allow the control of smart environments available all the time, regardless of the internet connection, it is necessary to make the full process of voice recognition running offline (i.e. locally on a smart phone). This requires the development of a light voice recognition system that consumes low computing resources and runs in real time. These requirements form the following research questions of this paper. Firstly, how to design a customizable voice recognition system that fits all aspects of a household? Secondly, how to make this system light in terms of memory and processing power requirements to be embedded on a handheld device such as smartphone? Thirdly, how to make this system runs in real-time and achieves high recognition accuracy?
To fulfill these research questions, we proposed a novel approach based on weighted finite state transducers (WFSTs). The approach starts with building a large WFST-based voice decoding graph then hosting it in the cloud. This large graph can be built and maintained by IoT manufacturers. The large decoding graph includes all the words and sentences that may appear in the human interaction. In addition, we propose an approach for extracting a tiny and customized decoding graph from the large decoding graph. The resulting tiny graph can be stored offline on user’s smartphone to control IoT devices and can be easily modified whenever user decided to change the words or sentences used to control IoT devices. There are many advantages for using these tiny decoding graphs. Firstly, it requires low resources and thus can be stored and work locally on a smartphone even if no internet connection is available. Secondly, it can achieve higher recognition accuracy when compared with other competitive approaches. Thirdly, it can be easily customized, and thus user can modify this tiny graph to match his/her preferred way of interaction with IoT devices.
Paper structure
The structure of this paper comes as follows. Literature review is presented in section2, followed by a description of the proposed approach along with the system architecture in section 4. Experimental results are then presented and discussed in section. Finally, the conclusion and future perspectives are given in section
Literature review
Situation analysis
The development of human-device interaction (HDI) is usually based on the analysis of ways of interaction between humans and devices. This analysis is used to guide the design and implementation of these ways to be evaluated before deploying them on devices. This results in interfaces that are efficient and simple for users to deal with. Therefore, the main objectives of HDI are to identify and recognize how users normally interact with devices then engage this interaction into HDI scenarios to boost the overall user experience [13, 14].
On the other hand, HDI usually faces a critical problem in which the specific commands used to activate device actions should meet users’ expectations and they should be comfortable with that. For users that are not familiar with HDI, such as elder people, this could be annoying for them and they usually take much time to get ready to use these devices smoothly and comfortably. In addition, some complaints are issued by novice users about the strangeness of interaction with their new devices, which affects their learning curve [15, 16].
Authors in [17] presented the challenges in voice interfaces and showed the importance of focusing the voice commands to specific IoT devices especially when the number of devices and services increases. This requires a collaboration between the voice-operated devices. In addition, authors suggested that a voice coding methodology is required for ad-hoc IoT networks. From the privacy perspectives, they emphasized the importance of privacy and protection in voice interfaces to generate the necessary trust.
A personalized voice recognition system is presented in [18]. In that system, authors used dynamic hierarchical language models in a combination of voice recognition and natural language understanding to customize the interaction with IoT devices. However, in this system users can customize the names of devices, and the other command words should be well known by the users.
These studies refer to an important observation that there are large variations in the commands used to interact with IoT devices and services and these commands do not necessarily match the way users might prefer to interact with these devices and services. This observation inspired the work presented in this paper and motivated us to develop a voice user interface that can be customized by users to meet their expectations and thus alleviates their learning curve and user experience.
IoT devices and services
The emergence of IoT is expected to radically change the traditional environments and gives a great opportunity to innovate new services and applications. Based on IoT advancements, it is expected to have a global connection between people and devices which links the physical activities in the real life with the virtual world. The things in IoT are typically described using metadata stored on electronic devices that are used to augment these things as part of a cyber space [19, 20].
The integration of IoT physical things into the cyber space has some limitations, such as the constant change in information received from these things, which makes it difficult to expand this information properly. Ontologies, which are defined as explicit and formal specification of real concepts, represent a way to cope with this limitation. Although ontologies can expand the information, that represent a certain knowledge, based on a standard specification, they have a constraint that IoT devices should follow a specific standardization to fit the standard ontology [21, 22].
There are several types of ontologies employed in the domain of IoT based on what information they are representing. There are ontologies for enabling plug-and-play integration of devices, ontologies that emphasize the human-home interaction, ontologies for space modeling, ontologies that deal with service properties, and ontologies that deal with the context modeling. Despite the power of these ontologies in enabling reasoning and describing knowledge, they consume large processing resources when describing the vast amount of actions [23, 24].
Alternative method to the ontologies is the use of tags for IoT devices description. Tagging is based on the idea of adding more information about a resource through attaching a metadata to it. From the understandable human perspective, these metadata usually contain keywords in the natural language used for interacting with these devices. The concept of tagging is commonly used in software development and search engines, where it links between the content and the people who use it [25, 26].
There are three actors in the IoT paradigm namely, human, devices, and services, where services and devices can be viewed as resources that human wants to get. The description of these devices and services can be done using tags, as they do not need specific standardization and more intuitive and natural results can be achieved. The work presented in this paper adopts this approach and allows users to customize the natural interaction with devices and services based on tags to augment the description of them.
Voice user interface
Natural voice is crafting the facet of human computer interaction. This results in a voice user interface (VUI) which is expected to get high prevalence in the coming few years. VUI is based on two significant technologies namely, automated speech recognition (ASR) and natural-language understanding (NLU). The task of ASR is to get a written text out of the speech audio signal. Whereas, understanding the written text and performing some action accordingly is the task of NLU. This action can be done in the form of spoken/visual response or physical action [27].
The core of VUI is the speech recognition engine [28, 29], which analyzes the spoken commands in a continuous manner. From the perspective of speech recognition, there are several potential events such as "recognized responses", "responses but didn’t recognize it", "no response", etc. The events of VUIs usually resulted from computation intensive and complex process, which is error prone. On the other hand, wireless user interfaces (WUIs) and GUIs, are based on events that are non-equivocal and low-level incidents [30].
A recent attempt to engage voice recognition with IoT devices interaction is presented in [31], where authors introduced the application of dynamic time warping (DTW) along with support vectors machines (SVM) for recognizing commands used in controlling IoT devices. In this paper, authors showed the improved performance when compared with SVM only. Authors evaluated the system using very small dataset which needs to be intensively verified using a larger one.
Another study on the application of voice recognition to interact with IoT devices is presented in [32]. In this paper, authors focused mainly on the voice separation problem in which interleaved voices are separated to improve the voice recognition accuracy. However, in terms of voice interaction methodology, they follow the traditional approach in which users have to know and get familiar with the set of predefined commands used to control IoT devices.
A study on the effects of noise on sub-phonemic evaluation for voice recognition was held by authors in [33]. In that work, authors employed some features, such as place and manner along with voice error patterns, such as distinctive feature distances and gray scale confusion matrices to improve the recognition accuracy. The study concluded that the misperception of voice in white noise is affected by place, manner, and voicing features.
Authors in [34] presented a system for voice recognition based on principle components analysis (PCA) for extracting images from the input voice and mel-frequency cepstral coefficients (MFCCs) for extracting the acoustic features. Both of these features are used in a DTW framework for voice matching using Euclidian distance criterion. Although this approach is simple, it suffers from high processing resources required in addition to the complexity in adding or customizing existing sequence of words to form an interaction command for controlling IoT devices and services.
A tool for recognizing voice to help visually impaired people is presented in [27]. Although achieving high accuracy when tested on certain users, it is expected to record less recognition accuracy if tested on large number of users as it was developed as a speaker dependent system.
In terms of controlling smart homes in IoT environment, authors in [35] presented a system for voice recognition to control IoT devices in a low cost and easy to install manner. To achieve better performance, authors suggested the use of wireless technology for voice interaction. That approach unfortunately suffers from problem of voice attenuation when the distance from speaker and voice capturing devices increases, in addition to the authentication lockage problem.
Authors in [36] proposed a SoundCity platform that enables the smooth interoperability between smartphones and heterogonous IoT devices. That framework was based on data streaming from smartphones to these devices via a set of ways and protocols that enables this communication. One important advantage of using smartphones to control IoT devices and services is the independence of the distance from these devices and thus no coverage limit and no restriction on the location of the user.
Based on this review, we deduced the necessity of a study on a potential methodology for handling the voice recognition application for IoT interaction in customizable interaction dialogues to match user preferences along with low computing resources requirements. In addition, due to the advancements in processing capabilities of current smartphones, it becomes intuitive to perform some operations, such as voice recognition, locally instead of doing it remotely in a cloud, which may suffer from network latency or security breaches. Therefore, this paper proposed a new approach to realize that in an integrated framework.
System architecture
In this section, we propose a system for controlling IoT devices and services using spoken commands. The spoken commands are captured using a smartphone where the command is processed and recognized using its local computing resources then the recognized command is sent to IoT controller to perform the intended action. The proposed architecture of voice interaction with IoT devices and services is depicted in Fig. 1. In this architecture, a large weighted finite state transducer (WFST) decoding graph is constructed and stored remotely on a cloud server. This large decoding graph is built based on well-trained acoustic and language models along with a large vocabulary. In addition, it is used for extracting tiny decoding graph for recognizing the customized spoken commands that fit user preferences.

Proposed architecture for voice interaction with IoT.
The large decoding graph contains a full set of potential words that are commonly used in natural human-device interactions and is suggested to be maintained by manufacturers of IoT devices. However, tiny decoding graph contains only the commands and words predetermined by user when s/he customizes IoT devices and services. The extraction of a tiny decoding graph can simply be performed by the user in terms of a simple sequence of steps as will be explained later in this section. The idea of customizing the spoken commands used in controlling IoT actions gives end-users the flexibility to change the spoken commands accepted by each device and service. All what we need to do is to extract a new tiny decoding graph from the large decoding graph hosted on the cloud based on these changes.
Once the spoken command is captured by the smartphone, it uses the local tiny decoding graph to recognize the spoken command and outputs an action identifier (ID). This ID, which corresponds to the recognized command, is sent to the IoT controller to perform the requested action. The IoT controller shown in Fig. 1 retrieves the set of primitive events based on the received action ID. Each device and service has a list of potential actions (for controlling sensors and actuators) that can be performed based on user requests.
The description of IoT devices and services attracted many researchers during the past decade. One of the suggested methods to describe these resources is the use of ontologies. Although this method is powerful in describing knowledge and enabling reasoning, it suffers from large resources consumption, which makes it less preferred in describing IoT devices due to the numerous types of these devices. In addition, defining new devices in IoT environment along with their ontologies is not a trivial task and sometimes require expensive computations to realize it. On the other hand, IoT devices and services can simply be described in terms of a set of keywords that should be selected carefully and attached to each device to achieve accurate deduction of the intended actions requested by users. In this later method, the definition of new devices becomes simpler and requires reasonable computations when compared with ontologies.
Fig.2 shows the contents of the description files of IoT devices and services. In the figure, each device/service has a set of actions and each action is described by a list of commands that is mapped to a unique ID along with a list of primitive events. The contents of the description file are written in an extended mark-up language (XML) format.

Contents of IoT devices and services description file.
A sample description file is presented in Fig. 3. In this figure, only one type of devices, living room lights, is listed for illustration. The devices included in this file are ceiling lights and lightshades, which are assumed to work together using the same command and based on the same sequence of events. As this file describes one action which is “light on”, it contains the potential spoken commands that user may use to trigger this action. Users can add or modify these commands, in the proposed architecture, to meet their preferences. Following the command list, the events part of the XML presents the list of primitive events that should occur to execute the intended action (i.e., light on).

Sample description of lighting device in a living room.
These primitive events are predefined by the device manufacturer and users are not allowed to change them as their operation depends on power connections from power source to device. If each device has a different set of primitive events, in this case there will be a different <Device>tag for each of them to contain the exact list of operations that should be applied to perform the requested action. Similarly, for the services provided by the IoT devices such as displaying the humidity, there is a tag <Services>in the description file showing the sequence of actions included in each service.
The description files are stored close to each device and can be read by a workflow manager. In addition, these files are modifiable, so that users can modify the corresponding commands of each device and service with the help of IoT controller and a dedicated mobile application.
The controller system starts with receiving an action ID recognized by the smartphone in response to the spoken command issued by user. This controller operates in terms of two modules, as shown in Fig. 4. Firstly, the workflow manager module, which is responsible for extracting the sequence of primitive events associated with the action ID from the description files. These extracted primitive events will be triggered to realize the intended action. Secondly, the message builder module, which is responsible for converting each event in the set of primitive events extracted by the workflow manager into a format suitable for IoT devices or services to behave accordingly. In addition, the message builder monitors IoT devices’ status to keep the workflow manager updated with that.

Architecture of IoT Controller.
The goal of voice recognition is to convert spoken utterances into text. To achieve this goal, captured voice signal is converted to a sequence of feature vectors called observation vectors (
In order to realize the target of equation (1), the observation vectors are used to train hidden Markov models (HMMs) as acoustic models. A language model is then built using a large text corpus. Both acoustic and language models are used to construct a large decoding graph by applying a series of WFST operations to merge acoustic and language models along with tri-phone context dependent models and phonetic dictionary into a single decoding graph using equation (2).
The result from equation (2) is a large WFST-based decoding graph which usually contains millions of states and transitions when built using large resources i.e., lexicon, acoustic, and language models. Therefore, this large decoding graph is proposed to be hosted on a cloud server. The main purpose of this large graph is to be used in extracting a custom tiny decoding graph that will be used in the actual voice recognition process running on smartphone.
The process of extracting the tiny decoding graph starts with collecting the customized spoken commands from the description files of the IoT devices and services. Then, these commands are merged together to form a single word network which is transformed to a weighted finite state acceptor (WFSA) using HParse tool [37]. The resulting WFSA is then composed with the large decoding graph to produce a tiny decoding graph, which is used for voice recognition/decoding on the smartphone. The size of the resulting tiny graph is quite small in terms of memory requirements and processing resources, as will be discussed later in the following sections.
The process of extracting tiny decoding graph is shown in Fig. 5. Extraction. In the figure, the description files of the IoT devices are parsed and the commands are retrieved and converted to finite state grammar (FSG). For instance, the description file depicted in Fig. 3 is parsed to generate the FSG shown in Fig. 6. This FSG is processed using HTK toolkit to generate a word network, which is then converted to WFSA. Fig. 7. shows a sample WFSA generated for a small set of commands. The resulting WFSA is sent to the cloud server to be composed with the large decoding graph. The composition is performed based on equation (3).

Extraction of tiny decoding graph.

Sample finite state grammar contains living lights commands.

Sample WFSA for the control commands of living lights.
The commands presented in Fig. 6 are reduced to focus on the proposed idea. However, in real scenarios, there are more commands with multiple words in each command, which results in a larger grammar and WFSA. In addition, user may optionally add "please" word for making the interaction with IoT devices and services more friendly.
The resulting WFSA shown in Fig.7, contains arcs labeled with the words as inputs and outputs of the acceptor. In addition, each state has a self loop to itself based on the epsilon (ε) word that is necessary for the composition process. < S > and < / S > denote the start and end of the spoken command, respectively.
Once WFSA is sent to cloud server, the process of composing this WFSA and large decoding graph starts to generate tiny decoding graph. Fig. 8 shows the composition algorithm given the aforementioned inputs. The resulting tiny decoding graphs contains hundreds of states as it contains context-dependent information of the words included in the spoken commands represented by WFSA. This tiny decoding graph is sent from the cloud server to smartphone to be used for voice recognition. In addition, the cloud server returns the acoustic models corresponds to tri-phones appeared in the context dependent information of the resulting tiny decoding graph. The process of extracting tiny decoding graph is performed either at the beginning of the installation of IoT environment or when user decided to modify or add new commands for controlling devices and/or services.

Algorithm for extracting tiny decoding graph (T2) from a large decoding graph (T1) based on a finite state acceptor (A).
Voice recognition process starts with receiving a spoken command from user using smartphone, then preprocessed to retrieve a set of mel-frequency cepstral coefficients (MFCC) feature vectors. These MFCC vectors are fed to the decoder to calculate the acoustic probabilities while traversing the tiny decoding graph till reaching the final state of the tiny graph to find out the recognized text corresponding to the uttered command. The recognized command is sent to the IoT controller to retrieve the sequence of primitive events that should be sent to the sensors and actuators to behave accordingly.
The communication between IoT controller and device sensors and actuators is performed in terms of a WiFi connection. In addition, this WiFi connection enables users to interact with devices and services in a smart home remotely and not necessary to be inside the smart home. From the perspective of interconnections between IoT controller and devices, Fig. 9 shows the proposed layout. In this figure, user starts speaking a command, which is recognized and sent to a Raspberry Pi processing unit, which retrieves primitive events that are formatted using a standard protocol and sent them to the corresponding relay drivers to act physically.

Proposed architecture with Raspberry Pi and relay drivers.
To evaluate the effectiveness of the proposed approach, resource management (RM1) dataset has been employed. Voiced utterances of this dataset are sampled at 16 kHz, 16 bits/sample and framed with a frame rate of 30 msec with 75%overlap between successive frames. Each frame was represented using 39 dimensional feature vectors with 13 static MFCCs including frame log energy, and 26 dynamic coefficients (13 Δ and 13 ΔΔ). Fig. 10 shows the first 13 MFCC static features for a set of frames in a spoken command. These features are used in training the acoustic hidden Markov models (HMMs) as one of the main building blocks of the decoding graphs used in voice recognition. All experiments are conducted on a machine with a central processing unit (CPU) running at a speed of 2.4 GHz with 8GB of main memory and Ubuntu 10.04. To prove the efficiency of the proposed approach, other competitive approaches were considered namely, large decoding graph, finite state graph (FSG), support vector machines (SVM) and a hybrid between SVM and dynamic time warping (SVM-DTW) [31].

Static MFCC features corresponding to a set of voice frames.
The proposed approach is based on extracting tiny decoding graph from a large decoding graph to achieve high recognition accuracy. The large decoding graph employed in this paper is constructed using a vocabulary size of 64K words and a tri-gram language model, in addition to context dependent triphone hidden Markov models (HMMs) [37]. The construction is based on WFST operations using OpenFST toolkit [38]. The acoustic HMMs consist of 8000 states, 32 Gaussian mixtures/state, and 256K mean vectors as shown in Fig. 1. These acoustic models are estimated from the Wall Street Journal (WSJ) speech corpus based on the maximum likelihood estimation (MLE) criterion and using the hidden Markov model toolkit (HTK) [37, 39]. On the other hand, the language model consists of 64K uni-grams, 594K bi-grams, 238K tri-grams, and a perplexity of 145, as shown in Fig. 1. The large decoding graph is created only once and can be hosted in the cloud to be accessed for tiny decoding graph extraction or for doing more training to improve the overall recognition accuracy. Once the the large graph is constructed, it becomes ready to be used in extracting tiny decoding graph corresponding to customized spoken commands specified by users to control their IoT devices and services. The specifications of the constructed large decoding graph (large WFST), tiny decoding graph, and finite state graph are shown in Table 3.
Properties of the voice decoding graphs
Properties of the voice decoding graphs
Acoustic hidden Markov models description
Language N-gram model description
Resource management (RM1) command corpus comprises speaker independent and speaker dependent sets of spoken commands and queries about the naval resources. Only the set of speaker independent utterances is considered in our experiments. The set of speaker independent utterances consists of 3990 training utterances from 109 speakers, and 1200 test utterances from 12 speakers. Similarly, based on the words provided by the vocabulary associated with the language models used in building the large decoding graph, RM1 utterances are filtered to exclude the utterances containing OOV words. Consequently, the number of utterances of the RM1 evaluation set is 668. A tiny decoding graph is constructed that corresponds to this set of utterances. In addition, this set of utterances is used to construct FSG as a voice user interface methodology to be used in comparison with the proposed approach.
Voice decoding results
Three criteria were employed to measure the performance of voice command recognition namely, True positive (T
p
), False positive (F
p
), and False negative (F
n
). These criteria are used to calculate the following measurements:
In addition, the accuracy of voice recognition is measured in terms of word-error rate (WER) according to equation (7)
Fig. 11 presents the values of precision, recall, and F1 of the proposed tiny decoding graph at clean condition of spoken utterances. As shown in this figure, the performance of the proposed approach outperforms the other approaches due to its fitness to the specified recognition task. In addition, this figure shows that the performance of SVM-DTW is close to the proposed approach. However, the advantage of the proposed approach is the simplicity to customize the tiny decoding graph in a simple way by amending the spoken commands used to control IoT devices and services then apply the composition operation with the large decoding graph to generate a new tiny graph that fits these changes. On the other hand, the application of SVM-DTW is better than applying SVM only which conforms with the results presented in [31]. In addition, the performance of the voice decoding using FSG is better than using the large WFST due to the grammar rules includes only the words of the recognition task. However, the performance of the large decoding graph is the worst because it contains much more words than what exists in the recognition task and the acoustic models used in building the large graph are trained on WSJ dataset only. This can be viewed as another advantage of the proposed system, that it can efficiently recognize utterances from

Recognition results in clean condition.
a dataset which is different the dataset used in training its acoustic models. This can be interpreted as an outcome from the well distribution of voice knowledge sources (acoustic and language models) on the extracted tiny decoding graph.
From the run-time perspective, Fig. 12 shows the real-time factor (RTF) of voice recognition based on the five approaches included in our experiments. As shown in the figure, feature extraction consumes the same time as we used only MFCC method for extracting the acoustic features. In addition, it can be viewed that the proposed approach achieved better RTF in comparison with large WFST, SVM, and SVM-DTW as it incorporates tiny decoding graph. On the other hand, FSG achieved faster recognition time than the other approaches as it involves simple grammar, but with low recognition accuracy as mentioned previously.

Timing profile of voice recognition.
To further investigate the gains from the proposed approach, four other experiments were conducted to verify the effectiveness of the proposed approach in noisy conditions. In these experiments, the noise types listed in Table 4 were employed by adding each type of these noise to the clean spoken commands at four signal-to-noise ratios (SNRs), which are 20dB, 15dB, 10dB, and 5dB [40]. The proposed approach along with the four other approaches where incorporated in these experiments.
Noise types of NoiseX92 dataset
Noise types of NoiseX92 dataset
Tables 6 present the average performance of the five decoding approaches in noisy conditions. Noises are added at 20dB and 15dB SNR levels and average evaluation results are recorded in Table Table 5. In addition, noises are added at 10dB and 5dB SNR levels and and results are presented in Table 6. As shown in these tables, the performance of the proposed approach significantly outperforms that of the other approaches for all noise types at different SNR levels. These results can be viewed as a significant emphasis of the appropriateness of the proposed approach under adverse conditions.
Speech recognition results in noisy condition (20dB and 15dB)
Speech recognition results in noisy condition (10dB and 5dB)
The recognition accuracy is measured in terms of WER and the evaluated accuracy is recorded in Fig. 13. The results presented in this figure emphasize the effectiveness of the proposed approach in recognizing the spoken commands in clean environment as well as in noisy conditions when compared with the four other approaches namely, large WFST, FSG, SVM, and SVM-DTW. Although recognition accuracy of SVM-DTW is also high, it suffers from a critical problem, which is the addition of new commands which may require re-training of the whole system. On the other hand, as the number of commands increases, the performance of that approach decreases.

WER of voice recognition in clean and noisy conditions.
This paper proposed a new approach for developing customized voice recognition system for controlling IoT devices and services. The proposed approach is based on extracting tiny decoding graph from a large decoding graph for achieving high recognition accuracy in real-time to improve user experience. The large decoding graph is proposed to be hosted on a cloud server and accessed only when adding new devices to the IoT environment or when a user decided to customize the commands used for controlling IoT devices. On the other hand, the tiny decoding graph is proposed to be hosted on the smartphone and thus the voice recognition process can be completely run locally of the smartphone. The evaluation of the proposed approach emphasized the effectiveness of the propose approach when compared with four other competitive approaches in terms of the accuracy of voice recognition, the real-time factor, and the processing resources required to the run the proposed system. An important advantage of the proposed system, when compared with others, is that it can be easily customized without re-training the models. An extension of this work would be using developing a hybrid approach based on deep learning techniques along with the proposed approach to alleviate the overall recognition accuracy. In addition, an investigation on the effect of multiple types of noise on the effectiveness of the proposed approach will be considered in our future directions.
