Abstract
Smart mobile devices have fostered new interaction scenarios for Ambient Intelligence that demand sophisticated interfaces. The main developers of operating systems for such devices have provided APIs for developers to implement their own applications, including different solutions for developing graphical interfaces, sensor control and voice interaction. Despite the usefulness of such resources, there are no strategies defined for coupling the multimodal interface with the possibilities that the devices offer to identify and adapt to the user needs. This way, current apps are usually developed ad-hoc and the spoken interface is conceived as an input for simple commands. In this paper we present a practical mobile application that integrates features of Android APIs on a modular architecture that emphasizes multimodal conversational interaction and context-awareness to foster user-adaptivity, robustness, and maintainability.
Keywords
Introduction
Devices such as smartphones and tablets are becoming widespread, and the fact that increasingly more individuals have always with them a device with numerous displays, sensors and connectivity possibilities, opens new interaction scenarios for Ambient Intelligence (AmI) systems and Smart Environments (SmE) that demand more sophisticated interfaces [3,4,31,33].
Multimodal conversational interfaces go a step beyond traditional graphical user interfaces (GUIs) by adding the possibility to communicate with these devices through other interaction modes, such as speech, tactile, and visual interaction [14,34,38,39]. They can be defined as computer programs designed to emulate communication capabilities of a human being by including different communication modalities to build a more natural human-machine interaction.
However, conversational interfaces are usually designed ad-hoc for their specific domain using rule-based models and standards in which developers must specify the steps to be followed by the system for each user response [49]. This way, the adaptation of the hand-crafted systems to specific users requirements or new tasks in a dynamic scenario is a time-consuming process that implies a considerable effort [7,34].
In addition, mobile devices have lead to a new paradigm in which they can collect information from the user pervasively. This can help building more complex user models to be employed not only to provide the system functionality, but also to boost its performance. However, this information is not usually considered when designing the dialogue model for the conversational interface [8,25,45]. For this reason, in most dialogue applications, the dialogue specification is the same for all cases: users typically have no control over the content or presentation of the service provided.
To solve this problem, it is necessary to perform careful user modelling and effective dialogue management. Statistical approaches are suited for this purpose, as the application of such techniques to user modelling and dialogue management makes it possible to consider a wider space of dialogue strategies in comparison to engineered rules [11,21]. The main reason is that statistical approaches can be trained from real dialogues, modelling the variability in user behaviours. Although the parameterization of the model depends on expert knowledge of the task, the resulting conversational interfaces have a more robust behaviour, better portability, and are easier to adapt to different user profiles or tasks [43].
In this paper, we propose the development of context-aware multimodal conversational interfaces by means of a domain-independent architecture. Our proposal is based on the definition of a statistical methodology for user modelling that estimates the user intention during the dialogue. The term user intention expresses the information that the user has to convey to the system to achieve their goals, such as extracting some particular information from the system. It is a very useful and compact representation of human-computer interaction that specifies the next steps to be carried out by the user as a counterpart in the human-machine conversation.
In our contribution, the information provided by the user model is also enriched by considering information related to the external context of the interaction. This information is acquired by means of sensors supported by mobile devices. At the end of 2016, 86.8% of smartphones and tablets operate with Android OS [9]. Also, there is an active community of developers who use the Android Open Source Project and have made possible to have more than one million applications currently available at the official Play Store, many of them completely free. For these reasons, our proposal makes use of different facilities integrated in Android-based devices.
The internal and external context, respectively related to the prediction of the user intention by the user model and the information provided by the sensors, make it possible to adapt the system dynamically taken into account these valuable information sources.
To do this, a statistical dialogue model based on neural networks is generated taking into account the contextual information and the history of the dialogue up to the current moment. The next response of the conversational interface is selected by means of this model. The codification of the information and the definition of a data structure which takes into account the data supplied by the user throughout the dialogue makes the estimation of the dialogue model from the training data and practical domains manageable.
The remainder of the paper is as follows. Section 2 describes the motivation of our proposal and related work. Section 3 describes the proposed framework to develop adaptive multimodal conversational interfaces for mobile devices. Section 4 shows how this practical framework has been employed to develop a multimodal city street and entertainment guide for Android-based mobile devices. Section 5 shows the results of the evaluation of this system using a set of defined measures. Finally, Section 6 presents some conclusions and future research lines.
State of the art
Conversational interfaces are computer programs that engage the user in a dialogue that aims to be similar to that between humans [34]. Different processes must be completed to achieve this complex objective, such as speech and feedback recognition, natural language processing, dialogue management or speech synthesis.
Recent advances in conversational interfaces has been propelled by the convergence of three enabling technologies. First, the Web emerged as a universal communications channel. Web-based dialogue systems are scalable enterprise systems that leverage the Internet to simultaneously deliver dialogue services to large populations of users. Second, the development of mobile technologies and intelligent devices, such as smartphones and tablets, have made it possible to deploy a large number of sensors and to integrate them into dialogue systems that provide multimodal interaction capabilities (i.e., use of different modalities for the input and/or output of the system) and allow their access in almost every place and at any time. Third, computational linguistics, the field of artificial intelligence that focuses on natural language software, has significantly increased speech recognition, natural language understanding and speech synthesis capabilities [34,39].
The design and development of a comprehensive adaptive conversational interfaces can be conceptually divided into two interconnected components; the user modelling and the corresponding adaptation, which in our proposal is implemented on the dialogue manager.
User modelling
Research in techniques for user modelling has a long history within the fields of language processing and speech technologies. According to [43], very early examples of user modelling in these fields are dominated by knowledge-based formalisms and various types of logic aimed at modelling the complex beliefs and intentions of agents.
However, traditionally conversational interfaces have tended to focus on cooperative, task-oriented rather than conversational forms of dialogue, so that user models have been typically less complex. It is possible to classify the different approaches with regard to the level of abstraction at which they model dialogue. This can be either at the acoustic level, the word level or the intention-level. The latter is a particularly useful and compact representation of human-computer interaction. Intentions cannot be observed, but they can be described using the speech-act and dialogue-act theories [46].
In recent years, simulation on the intention-level has been most popular [43]. This approach was first used by Levin et al. [28] and has been adopted in later work on user simulation by most research groups [6,10]. Modelling interaction on the intention-level avoids the need to reproduce the enormous variety of human language on the level of speech signals or word sequences [15,30].
The main purpose of a user intention model in this field is to improve the usability of a conversational interface through the generation of corpora with interactions between the system and the user model [20,30], reducing the time and effort required for collecting large samples of interactions with real users. The user model can be employed to evaluate different aspects of a conversational interface, particularly at the earlier stages of development, or to determine the effects of changes to the system’s functionalities (e.g., evaluate confirmation strategies or introduce errors or unpredicted answers in order to evaluate the capacity of the dialogue manager to react to unexpected situations).
Two main approaches can be distinguished in the creation of user intention models: rule-based and data or corpus-based. In a rule-based user model, different rules determine the behaviour of the system [34]. In this approach the researcher has complete control over the design of the evaluation study. However, these proposals are usually designed ad-hoc for their specific domain using models and standards in which developers must specify each step to be followed by the user model. This way, the adaptation of the hand-crafted designed models to new tasks is a time-consuming process that implies a considerable effort.
Corpus-based approaches use probabilistic methods to select the next user input, with the advantage that this uncertainty can better reflect the unexpected behaviours of users interacting with the system. Statistical models of user intention have been suggested as the solution to the lack of the data that is required for training and evaluating dialogue strategies. Using this approach, the conversational interface can explore the space of possible dialogue situations and learn enhanced strategies [43].
Georgila et al. [10] proposed the use of HMMs, defining a more detailed description of the user states and considering an extended representation of the history of the dialogue. A dialogue is represented as a sequence of Information States [37,52]. Two different methodologies are described to select the next user action given a history of information states. The first method uses n-grams, whereas the second is based on the use of a linear combination of 290 characteristics to calculate the probability of every action for a specific state.
A technique for user modelling based on explicit representations of the user goal and the user agenda is presented in [41]. The user agenda is a structure that contains the pending user dialogue acts that are needed to elicit the information specified in the goal. The agenda-based simulator is used in [42], to train a statistical Partially Observable MDP (POMDP)-based dialogue manager [5]. The main drawback of this approach is that the large state space of practical conversational interfaces makes its direct representation intractable [52]. Another disadvantage of the POMDP methodology is that the optimization process is free to choose any action at any time.
A data-driven user intention simulation method is presented in [19] that integrates diverse user discourse knowledge (cooperative, corrective, and self-directing). User intention is modelled based on logistic regression and Markov logic framework. Human dialogue knowledge is designed into two layers: domain and discourse knowledge, and integrated with the data-driven model in generation time.
Dialogue management
The dialogue management process of a conversational interface relies on the fundamental task of deciding the next action of the system, interpreting the incoming semantic representation of the user input in the context of the dialogue. In addition, it resolves ellipsis and anaphora, evaluates the relevance and completeness of user requests, identifies and recovers from recognition and understanding errors, retrieves information from data repositories, and decides about the next system’s response.
The design of the dialogue manager has been traditionally carried out by hand-crafting dialogue strategies tightly coupled to the application domain in order to optimize the behaviour of the dialogue system in that context. This way, the simplest dialogue management strategy is programmatic dialogue management, in which a generic program implements the application with an interaction model based on finite-state machines [2].
Frame-based dialogue managers do not have a predefined dialogue path but use a frame structure comprised of one slot per piece of information that the system can gather from the user [32]. The core idea is that humans communicate to achieve goals and during the interaction the mental state of the speakers may change. This strategy is suitable for form-filling tasks in which the system asks the user a series of questions to gather information, and then consults an external knowledge source, such as the ones that can be developed with VoiceXML.
A related approach is the so-called “information state” dialogue theory [47,52]. The information state of a dialogue represents the information needed to uniquely distinguish it from all others. It comprises the accumulated user interventions and previous dialogue actions on which the next system response can be based. The information state is also sometimes known as the conversation store, discourse context, or mental state. Following this theory, the main tasks of the dialogue manager are to update the information state based on the observed user actions, and select the next system action.
In the agent-based paradigm for dialogue management, dialogue is viewed as interaction between two agents, each of which is capable of reasoning about its own actions and beliefs. Agent-based dialogue management is suitable for the design of more natural dialogue systems in which the system and the user can share the initiative of the dialogue [24]. Thus, the dialogue manager takes the preceding context into account and the dialogue evolves dynamically as a sequence of related steps that build on top of each other. Automating the learning of agent-based dialogue managers by using statistical models trained with real conversations also allows us to model the variability in user behaviours and explore a wider range of strategies and dialogue movements, also reducing the time and effort required to develop the dialogue manager [8].
Statistical approaches for dialogue management present several important advantages. Rather than maintaining a single hypothesis for the dialogue state, they maintain a distribution over many hypotheses for the correct dialogue state. Statistical dialogue models can be trained with corpora of human-computer dialogues with the goal of explicitly modelling the variance in user behaviour that can be difficult to address by means of hand-written rules [43].
The most widespread methodology for machine-learning of dialogue strategies consists of modelling human-computer interaction as an optimization problem using Markov Decision Processes (MDP) and reinforcement methods [28]. The main drawback of this approach is that the large state space of practical spoken dialogue systems makes its direct representation intractable [52]. Partially Observable MDPs (POMDPs) outperform MDP-based dialogue strategies since they provide an explicit representation of uncertainty [5,8]. This enables the dialogue manager to avoid and recover from recognition errors by sharing and shifting probability mass between multiple hypotheses of the current dialogue state.
Other interesting approaches for statistical dialogue management are based on modelling the system by means of Hidden Markov Models [6], stochastic Finite-State Transducers [18], or using Bayesian Networks [35]. Also [26] proposed a different hybrid approach to dialogue modelling in which n-best recognition hypotheses are weighted using a mixture of expert knowledge and data-driven measures by using an agenda and an example-based machine translation approach respectively.
Modular architecture for the development of practical mobile conversational applications
Figure 1 shows the architecture that we propose for the development of adaptive multimodal conversational interfaces for mobile devices. A multimodal conversational interface providing spoken dialogue integrates five main tasks to deal with user’s spoken utterances: automatic speech recognition (ASR), natural language understanding (NLU), dialogue management (DM), natural language generation (NLG), and text-to-speech synthesis (TTS).

Proposed framework for the generation of multimodal conversational interfaces in Android-based mobile devices.
Discarding the simplest case, these systems require a sequence of interactions between the user and the system to achieve their final purpose. Therefore, the user’s goal is gradually reached during several dialogue turns. To do this, it is necessary to endow the system with the abilities to reference information that has appeared previously during the dialogue, take the initiative to recover the dialogue after a failure, request information that is necessary to fulfil the objective, or require clarification if it is not confident about the information provided by the user.
During the communication process, the system initially generates a message to welcome and inform the user about the features and functionalities of the system. Then, the system must perform a basic set of actions that are cyclically repeated after each user utterance: recognize the sequence of words mentioned by the user; extract the meaning of these words (i.e. understand the information that is useful for the system domain), access web services and databases to extract the information required by the user, adapt the interaction to the context features described above, decide what action or actions should be performed after each user request and play a spoken message to provide a response to the user.
Besides the recognition capabilities that are implemented within the Android operating systems, there is the possibility to build Android apps with speech input and output using the Google Speech API (package android.speech). With this API, speech recognition can be carried out by means of a RecognizerIntent, or by creating an instance of SpeechRecognizer. The former starts the intent and processes its results to complete the recognition, providing feedback to inform that the ASR is ready or there were errors during the recognition process. The latter provides developers with different recognition related events, thus allowing a more fine-grained processing of the speech recognition process. In both cases, the results are presented in the form of an N-best list with confidence scores.
Speech recognition is the process of obtaining the text string corresponding to an acoustic input. Our proposal integrates the Google Speech API to include the speech recognition functionality in a multimodal system. Speech recognition capabilities on Android devices have evolved rapidly since Android 2.1 with each new version of Android. Android 4.4 KitKat has incorporated the option of always listening, so that when the user says: “OK, Google” voice recognition is activated. In any case this are usually one-shot dialogues in which the user requires a functionality and the system performs the associated action (e.g. web search or navigation).
Once the conversational interface has recognized what the user uttered, it is necessary to understand what he said. Natural language processing generally involves morphological, lexical, syntactical, semantic, discourse and pragmatical knowledge [51]. Lexical and morphological knowledge allow dividing the words in their constituents distinguishing lexemes and morphemes. We propose the use of grammars in order to perform the semantic interpretation of the user inputs [22,48].
In our contribution, we model the context of the interaction as an additional valuable information source to be considered along with the semantic representation of the user input. In order to do so, we consider external and internal contextual information sources.
We propose the acquisition of external context by using sensors currently supported by Android devices. Most Android-powered devices have built-in sensors that measure motion, orientation, and various environmental conditions. Most of these sensors provide raw data with high precision and accuracy, and are useful to monitor three-dimensional device movement or positioning, or monitor changes in the ambient environment near a device.
The Android platform supports three main categories of sensors. Motion sensors measure acceleration forces and rotational forces along three axes. This category includes accelerometers, gravity sensors, gyroscopes, and rotational vector sensors. Environmental sensors measure various environmental parameters, such as ambient air temperature and pressure, illumination, and humidity. This category includes barometers, photometers, and thermometers. Finally, position sensors measure the physical position of a device. This category includes orientation sensors and magnetometers.
The Android sensor framework (android.hardware package) allows to access these sensors and acquire raw sensor data. Android also allows applications to access location services using the classes in the android.location package. The central component of the location framework is the LocationManager system service, also the Google Maps Android API permits to add maps to the application, which are based on Google Maps data.
Regarding internal context, our proposal is based on the traditional view of the dialogue act theory, in which communicative acts are defined as intentions or goals. Our technique is based on a statistical model to predict user’s intention during the dialogue, which is automatically learned from a dialogue corpus. This model is used by the system to anticipate the user’s needs by dynamically adopting their goals and also providing them with unsolicited comments and suggestions, as well as responding immediately to interruptions and provide clarification questions. The model takes into account the complete history of the interaction and also the information stored in user profiles.
Our proposed technique for user modelling predicts the next user dialogue act in the same representation defined for the spoken language understanding module. We represent dialogues as a sequence of pairs (
A data structure, that we call User Register (
The information contained in
General user information. User’s name and machine identifier, gender, preferred language, pathologies or speech disorders, age, current location, date, and time.
Skill level: This level is estimated by taking into account variables like the number of previous sessions, dialogues and dialogue turns, their durations, time that was necessary to access a specific web service, the date of the last interaction with the system, etc. A low, medium, high or expert level is assigned using these measures.
Usage statistics: They store the count of each action over the system that a user performs. Users’ preferences are automatically evaluated considering the user’s most required services during the previous dialogues, date and hour of the previous interactions, most frequent objectives and restrictions, and preferred output modality.
We propose to solve the previous equation by means of a classification process, which takes the current state of the dialogue (represented by means of the set
As previously described, the dialogue manager decides the next action of the system [47,50], interpreting the incoming semantic representation of the user input in the context of the dialogue. This module deals with different sources of information such as the NLU results, database queries results, application domain knowledge, and knowledge about the users and the previous dialogue history to select the next system action. We propose a statistical methodology that combines multimodal fusion and dialogue management functionalities. To do this, a data structure is introduced to store the information provided by the user’s inputs, the user’s intention model, and the context of the interaction.
The process that is carried out by the dialogue manager can lead the system to ask the user for additional information, to require a confirmation for information already provided by the user, or to generate a response once the user has provided all the required information. In the last case, the Web Query Manager receives this information and provides the dialogue manager with the result of the query (e.g., the list of recommended hotels that fulfils the user requirements, films that are played in the theatres nearby the user location, etc.).
The methodology that we propose for the multimodal data fusion and dialogue management processes considers the set of input information sources (spoken interaction, visual interaction, external context, and user intention modelling) by means of a machine-learning technique. As in our previous work on statistical dialogue management [11], we propose the definition of a data structure similar to the User Register to store the values for the dialogue acts provided by means of the different input modalities along the dialogue history, which we called Interaction Register (
The information contained in the
As in our previous work on dialogue management [11], we propose the use of a MLP-based classification process to determine the next system response given the single input that is provided by the interaction register after the fusion of the input modalities and also considering the previous system response. This way, the current state of the dialogue is represented by the term
Multimodal output generation
The modality fission module receives abstract, modality independent presentation goals from the dialogue manager. The multimodal output depends on several constraints for the specific domain of the system, e.g., the current scenario, the display size, and user preferences like the currently applicable modality mix. This module applies presentation strategies that decompose the complex presentation goal into presentation tasks. It also decides whether an object description is to be uttered verbally or graphically. The result is a presentation script that is passed to the Visual Information and Natural Language generation modules.
The visual generation module creates the visual arrangement of the content using dynamically created and filled graphical layout elements. Since many objects can be shown at the same time on the display, the manager re-arranges the objects on the screen and removes objects, if necessary. The visual structure of the user interface (UI) is defined in an Android-based multimodal application by means of layouts. Layouts can be defined by declaring UI elements in XML or instantiating layouts elements at runtime. Both alternatives can be combined in order to declare the application’s default layouts in XML and add code that would modify the state of the screen objects at run time. Declaring the UI allows to better separate the presentation of the application from the code that controls its behaviour.
UI layouts can be quickly designed in the same way a web page is generated. Android provides a wide variety of controls that can be incorporated to the UI, such as buttons, text fields, checkboxes, radio buttons, toggle buttons, spinners, and pickers. The View class provides the means to capture the events from the specific control that the user interacts with. The user interactions with the UI are captured by means of event listeners. The default event behaviours for the different controls can also been extended using the class event handlers.
Natural language generation is the process of obtaining texts in natural language from the non-linguistic representation, internal representation of information handled by the dialogue system [27,29]. The simplest approach consists in using predefined text messages (e.g., error messages and warnings). Finally, a text-to-speech synthesizer is used to generate the voice signal that will be transmitted to the user. We propose the use of the Google TTS API to include the TTS functionality in an application.
The text-to-speech functionality has been available on Android devices since Android 1.6 (API Level 4). The android.speech.tts package includes the classes and interfaces required to integrate text-to-speech synthesis in an Android application. They allow the initialization of the TTS engine, a callback to return speech data synthesized by a TTS engine, and control the events related to completing and starting the synthesis of an utterance, among other functionalities.
Every Android device incorporates a default TTS motor. In addition, Android allows the installation and personalization of several motors, like Pico TTS (British and American English, French, Italian, German, and Spanish), IVONA TTS HQ (dialects of the English language, Polish, German, French, Italian, Spanish, Romanian, and Icelandic), SVOX Classic TTS (40 male and female voices for more than 25 languages), Samsung TTS (Korean, English, Chinese, and Spanish), CereProc Ltd (dialects of the English language), eSpeak TTS (60 different languages), EasyTTS (27 different languages), Flite TTS (dialects of the English language), Ekho TTS (Cantonese and Mandarin), or Vaja TTS (English and Thai).
A multimodal city street and entertainment guide for Android-based mobile devices
As a proof of concept to show the integration of conversational interfaces in mobile applications to provide a range of functionalities covering the provision of personalized information and services, we have used the described architecture to develop a practical multimodal city street and entertainment guide for Android-based mobile devices. The app can be operated visually and orally.
The developed multimodal conversational interface uses Google Maps, Google Directions and Google Places. Google Maps Android API makes it possible to show an interactive map in response to a certain query. It is possible to add markers or zoom to a particular area, also to include images such as icons, highlighted areas and routes. Google Directions is a service that computes routes to reach a certain spot walking, on public transport or bicycle, and it is possible to specify the origin and destination as well as specific intermediate spots. Google Places shows detailed information about sites corresponding to number of categories currently including 80 million businesses and other interesting sites. Each of them include information verified by the owners and moderated contributors.
As a city guide, it is able to locate interesting sites near the current position of the user or a different starting point indicated by the user. It is able to locate sites such as banks, libraries or restaurants and to retrieve and display information about these sites, visualize their position in different maps, show routes and information, and navigate (Fig. 2).

Use of the conversational application to locate specific places.

Information provided by the conversational application for a specific place.
The search can be performed by touching the screen, using the graphical interface or orally. Once the required sites are retrieved, e.g. restaurants in an area of 1 km around the user’s current position, the user can obtain further information about them. When a store is selected, the view is centred on it and an information box appears indicating the name of the store and its address. A new screen contains an HTML block comprised of an image representing the type of required site, its name, geographic coordinates, complete address, punctuation in the Google+ social network, telephone, website, and its profile in Google+ (Fig. 3).
Also, users can enter and describe their own points of interest, which can be either permanent, e.g. preferred pub; or temporal, e.g. placed where he parked or a meeting point (Fig. 4). As an entertainment guide, it also facilitates information about films, theatres and other cultural activities. The information is provided in Spanish.

Incorporation of user’s points of interest and configuration options for the conversational application.

An example of a dialogue for the city street and entertainment app.
In order to provide the functionalities described, the system engages in a dialogue with the user to retrieve different pieces of information that are complemented with the context-awareness capabilities of the system. This way, the system response is adapted taking into account the specific preferences and suggestions selected by the users, as well as to the context in which the interaction takes place.
The application also allows users to complete a profile corresponding to their preferences on the location of the initial maps, preferred travel facilities, preferred types of stores, and specific details for each one of them. With regard to cultural activities, the application collects user preferences (sports, films, music, and museums) and film affinities (gender, country, year range, categories to be excluded, or favorite theatres). For illustrative purposes, Fig. 5 shows a dialogue corresponding to one of the scenarios described. Turns tagged with S refer to system turns, and turns with U refer to user turns.
It is very difficult to find procedures and measures unanimously accepted by the scientific community for the evaluation of multimodal systems [16,36]. We have completed an evaluation of the developed multimodal conversational interface that is focused on two main aspects. The first one is usability, measuring the benefits of multimodal interaction compared to unimodal interfaces (only spoken or tactile interaction). The second main aspect is the assessment of context-awareness.
Usability assessment of the different input and output modalities
The methodology used to evaluate the multimodal interaction with the conversational interface is based on the work presented in [36], which points out that the usability assessment of a multimodal system requires the evaluation of the different interaction modalities. To do this, we have developed the assessment questionnaire shown in Table 1, which is based in standard questionnaires like AttrakDiff [13] and SASSI [17]. A total of 25 native Spanish users (12 men, 13 women, aged 22 to 54, avg. age 34.6) participated in the evaluation.
Questionnaire designed for the usability assessment of the developed conversational interface
Questionnaire designed for the usability assessment of the developed conversational interface
As Table 1 shows, the questionnaire consists of 9 questions. Each question has 5 possible answers from which only one is selected. The main aspects that are evaluated are users previous experience using multimodal interfaces, the degree to which the user finds that the system understood him and they understood the system, the perceived interaction rate, the perceived difficulty level of the interaction with the system, the presence of errors, the certainty of the user of what to do at each time, and the global level of satisfaction with the system.
Results of the usability assessment of the city guide system (For the mean value M: 1 = worst, 5 = best evaluation)
Table 2 shows the average results of the subjective evaluation using the described questionnaire. We can observe that the participants’ previous experience using multimodal interfaces is very varied, as our objective was to evaluate the system with users with different degrees of familiarity with these systems.
With respect to the extent to which the users feel that the system understood them, it is possible to see that the recognizer had a very good performance. Tactile mode was perceived as more accurate than oral as expected, but the oral mode was punctuated very high by the users, with 4.64 over 5 respectively. As can be observed, users felt that the system understood them better with the multimodal than with the tactile mode.
The users also felt that the system responses were comprehensible, specially with the tactile and multimodal modes. The results of the oral mode were lower probably due to the quality of the synthesized voice. We used the standard Spanish voice provided by Android. The results were higher when using better synthesizers like the voices of IVONA TTS. Also with the film functionality, sometimes the titles of the films were in languages that differed from the one selected for TTS which reduced the intelligibility of the results in these particular cases.
Regarding the interaction rate, it was found adequate in most cases, though in some cases the participants reported they were expecting barge-in mechanisms in the oral mode. The multimodal mode was found very useful to for the users to interact with the system at a pace which is adequate for their needs, as they could switch between interaction modes. This is in consonance with the fact that the multimodal mode has reported the maximum perceived easiness of use.
Generally, the users have not perceived errors during their interactions in the tactile mode, while in the oral and multimodal modes more errors were detected though they did not imply a fail in the interaction and in every case they could complete the task. In particular, the participants reported that the multimodal mode was more useful than the oral mode for reporting errors to the users.
About the certainty of what to do at each moment of the interaction, participants felt more security in the multimodal mode, which has also received the better punctuation in the overall satisfaction as it brings more flexibility to the user.
In [44] and [43], the authors propose a set of statistical measures to evaluate the quality of a conversational interfaces. Three dimensions are defined: high-level features (dialogue and turn lengths), dialogue style (dialogue-act frequency, proportion of goal-directed actions, grounding, formalities, and unrecognised actions, proportion of information provided, reprovided, requested and rerequested), and dialogue efficiency (goal completion rates and times). Additionally, the simulation presented in [1,40,41] is evaluated by testing the similarity between real and simulated data by means of statistical measures (dialogue length, task completion rate and dialogue performance).
For comparison purposes, we have developed two versions of the multimodal application described in the previous section. The baseline system does not carry out any adaptation to the user, while the context-aware system adapts the dialogue considering the context information provided by the user intention recognizer and the external context provided by the sensors.
To assess the benefits of our proposal, we have evaluated the context-aware system and compared it to the baseline system. In order to do so, 30 recruited students and lecturers in our department participated in the evaluation (aged 18 to 47, mean 32.2 years old, 67% male). A total of 300 dialogues was recorded from the interactions of the recruited users, 15 users employed the real context-aware system and 15 users employed the baseline version of the system. The users were provided with a brochure describing the scenarios that they were asked to complete and main functionalities of the system.
A total of 40 scenarios was defined to consider the different queries that may be performed by users. Each scenario specified a set of objectives that had to be fulfilled by the user at the end of the dialogue and they were designed to include and combine the complete set of functionalities previously described for the system. An example of the defined scenarios is as follows:
The set of defined scenarios can be grouped into the following categories:
Basic scenarios with free interaction (e.g., obtain information about a certain kind of shops in an area of a city). In these scenarios the user has to fulfil a single dialogue goal selecting the different values that he has to provide in order to achieve it. There are 10 scenarios of this type.
Advanced scenarios with free interaction (e.g., obtain the film listings of a city in a specific date, and obtain further information about one of the films). In these scenarios the user must fulfil multiple goals. They were defined following a logical combination of the different functionalities described for the system. A dialogue is considered successful only if the complete set of goals have been achieved. There are 15 scenarios of this type.
Basic and Advanced scenarios with guided interaction. In this case, the particular values that the user must ask for are provided in the description of the scenario. In the previous cases, the user was free to choose them before the interaction. There are 25 scenarios of this type.
We have adapted the previously described measures considering the information that is available in the definition of the multimodal conversational interface and the context information included in the user profile. Our proposed measures can be classified into three groups: task success/efficiency measures, high-level dialogue features, and dialogue style/cooperativeness measures. These measures evaluate the overall quality of the acquired dialogues and provided services as a whole.
To compare the baseline and context-aware versions of the conversational interface we computed the mean value for the evaluation measures, which we extracted from different studies [1,40]. We then used two-tailed t-tests to verify that the means are significantly different across the different types of scenarios and users, as described in [1]. The significance of the results was computed using the SPSS software [23] with a significance level of 95%.
By means of high-level dialogue features, we evaluate the duration of the interactions, how much information is transmitted in individual turns, and how active the participants are. These features cover the following statistical properties: (i) dialogue length, measured as the number of turns per task, the number of turns of the shortest dialogue, the number of turns of the longest dialogue, and the number of turns of the most frequent dialogue; (ii) Percentage of different dialogues in each corpus and the number of repetitions of the most frequent dialogue; (iii) Turn length, measured by the number of actions per turn; and (iv) Participant activity as a ratio of system and user actions per dialogue. Table 3 shows the comparison of the different high-level measures for the context-aware and context-unaware systems.
Results of the high-level features defined for the comparison of the context-aware and context-unaware apps
Results of the high-level features defined for the comparison of the context-aware and context-unaware apps
Regarding the number of dialogue turns, the context-aware system produced shorter dialogues (8.3 turns in average) compared to the number of turns of the baseline system (13.4). As shown in Table 3, this general reduction in the number of turns is generalized also to the case of the longest, shortest and most seen dialogues for the context-aware system. This might be because users have to explicitly provide and confirm more information using the baseline system, whereas the context-aware system automatically adapted the dialogue to the user and the dialogue history. This way, users have more variability in order to provide the different information that is needed to access the different services in the baseline app.
Dialogue style and cooperativeness measures analyse the frequency of different dialogue acts and reflect the proportion of actions that are goal-directed (i.e. not indexed in dialogue formalities). For dialogue style features, we defined and counted a set of system/user dialogue acts. On the system side, we measured the confirmation of concepts and attributes, questions to require information, and system answers generated after a database query. On the user side, we measured the percentage of turns in which the user carries out a request to the system, provides information, confirms a concept or attribute, the Yes/No answers, and other answers not included in the previous categories. Finally, we have measured the proportion of goal-directed actions (request and provide information) versus the grounding actions (confirmations) and rest of actions.
Tables 4 and 5 respectively show the frequency of the most dominant user and system dialogue acts in the context-aware and context-unaware dialogues. On the system side, S_request, S_confirm, and S_inform indicate actions through which the system respectively requests, confirms, or provides information. S_other stands for other types of system prompts (e.g, Waiting and Not-Understood dialogue acts). On the user side, U_provide, U_ query, U_confirm, and U_yesno respectively identify actions by which the user provides, requests, confirms information or gives a yes/no answer, while U_other represents all other user actions (e.g, dialogue formalities or out of task information).
In both cases, it can be observed that there are significant differences in the distribution of dialogue acts. On the one hand, users need to provide less information using the context-aware architecture. This explains the higher proportion for the rest of user actions in the context-aware system. It can also be observed a higher proportion of yes/no actions for the context-aware dialogues. These actions are mainly used to confirm that the specific services have been correctly provided using context information.
Percentages of user dialogue acts using the context-aware and baseline apps
On the other hand, Table 5 shows that there is a reduction in the system requests when the context-aware architecture is used. This explains a higher proportion of the inform and confirmation system actions when this application is used.
Percentages of system dialogue acts using the context-aware and baseline apps
Finally, we grouped all user and system actions into three categories: “goal directed” (actions to provide or request information), “grounding” (confirmations and negations), and “rest”. Table 6 shows a comparison between these categories. As can be observed, the interactions with the context-aware application have a better quality, as the proportion of goal-directed actions is higher.
Percentages of goal directed and grounding actions using the context-aware and baseline apps
Finally, we evaluated the global operation of the applications considering the following interaction parameters: (i) question success rate (SR), percentage of successfully completed questions: system asks – user answers – system provides appropriate feedback about the answer; (ii) confirmation rate (CR), computed as the ratio between the number of explicit confirmations turns and the total of turns; (iii) error correction rate (ECR), percentage of corrected errors.
The results of this evaluation for the described interactions (Table 7) show that both apps could interact correctly with the users in most cases, achieving a success rates of 96.73% and 91.48% respectively. The fact that the possible answers to the user’s responses are restricted in most cases made it possible to have a very high success in speech recognition. Additionally, the approaches for error correction by means of confirming or re-asking for data were successful in 94.15% of the times when the speech recognizer did not provide the correct input.
The confirmation and error correction rates were also improved by the context-aware app, given that less information is required to the user, reducing the probability of introducing errors. The main problem detected was related to spoken user inputs misrecognized with a very high ASR confidence, and this erroneous information was forwarded to the dialogue manager. However, as the success rate shows, this fact did not have a considerable impact on the system’s operation.
Results of the evaluation of the global operation of the context-aware and baseline systems
In this paper we have contributed a framework to develop context-aware multimodal conversational interfaces that can be easily integrated in hand-held Android mobile devices. The framework consists of an architecture in which different systems and modules cooperate to provide adapted services, and a representation mode for knowledge sharing between the components of the architecture.
Within our framework, the interaction is managed dynamically using a classification process to select the best response for the system by taking into account the previous history of the interaction. We have adapted this methodology to develop a context-aware dialogue manager that uses a data structure that stores the information provided by the user regarding the task, context information provided by a statistical user intention recognizer, and also external context that can be provided by the sensors in the mobile device. To store and share context information we have defined a data structure that also manages user profiles.
To show the pertinence of our proposal, we have implemented an evaluated an Android conversational interface that uses geographical context in order to provide different location services to its users. To develop this system we have defined the complete requirements for the task, different modules in the system, and user profiles.
We have completed an evaluation of the developed application to assess the benefits of the multimodal interaction and context-awareness. Regarding the multimodal interaction, the users employed the system with visual only, speech only and multimodal modes. The results show that the maximum satisfaction rates were achieved by the multimodal mode, as the users were able to switch between modalities and found this flexibility very useful.
To compare the dialogues acquired using a context-aware and a context-unaware version of the system, we have defined a set of measures adapted to the main characteristics of our proposed architecture. Using these measures, we have evaluated the success of the dialogues and services provided as well as their efficiency and variability with regard to the different objectives specified in a set of dialogue scenarios. The overall results show that the users were satisfied with the interaction with the system, which achieved high performance rates.
We are currently using the framework to build applications in other increasingly complex domains implying different web services and web services mashups. We are also interested in applying our proposal to multi-domain tasks in order to measure the capability of our methodology to adapt efficiently to the requirements of Ambient Intelligence contexts that vary dynamically.
