Abstract
In spite of having been investigated for over fifty years, developing a robust spoken dialog management system remains an open research issue in robotics and natural language processing. In this paper, we present a language-independent spoken dialog management module integrated into a human-robot interaction system. We adopt an algorithmic approach to dialog modeling. A mobile robot functioning as a shopping assistant exemplifies the proposed approach. The dialog module is composed of a state transition network, in which state switches are conditioned by both visual and communicative factors. We use the formalism of a finite state automaton, where the robot changes its state by performing a speech act or a non-verbal action from the set of specified act/action types.
Keywords
Introduction
In the past years, a large amount of research has been conducted on human-robot spoken interaction. Various models have been proposed to develop a robust human-robot interaction system. Usually, these models are based on algorithmic or machine learning approaches and exploit knowledge from the fields of natural language processing, computational linguistic, and machine learning. However, despite all the efforts of the research community, no standard model for dialog management and interpretation of speech acts has been developed to date. Developing a robust spoken dialog management system for human-robot interaction remains an open research issue and attracts a growing number of works from various areas of computer and language sciences.
Linguistic support of a human-mobile robot dialog in natural language is one of the most complex issues in a more general research field of natural language dialog in human-machine systems. Much effort has been dedicated to this field [1]. The complexity of this task arises from the multi-modal nature of information which a robot has to “understand”, as well as from significant language-dependent differences between spatial relation representation in the human mind and the robotic intelligent system. Another difficulty is caused by the problem of natural language understanding at a level that allows co-reference resolution and a correct interpretation of interlocutor intentions [2].
The research on human-robot communication modeling has been mostly exploratory in the last years and focused on a deep analysis of various facets of the complex issue stated above: for example, some work has been done on the influence of the robot bodily actions and gestures on the success of communication. There has also been research on the significance of gaze tracking and self-localization capability for effective human-robot collaboration [3]. On the other hand, a lot of effort has been put in developing mobile operational robots able to perform specific functions dialoging with a human operator in an adequate way with respect to both language aspects: naturalness and reliability; see, for example, [4–7].
The main goal of this work is to develop a dialog manager for a human-robot interaction system. We propose an algorithmic approach to tackle the task of natural language dialog modeling. It is necessary to mention that an alternative approach to dialog automatic processing consists in applying machine learning techniques. We describe both approaches and explain why we have chosen the algorithmic approach for this task.
Our approach is based on a state transition network, where the states depend on a set of conditions for transitions, which in their turn, are specified by both visual and communicative factors. In order to illustrate the proposed model, we implement it in the dialog management module of a mobile shopping assistant robot – the Pioneer 3DX robot. To provide a real-world setting for the robot functioning as a shopping assistant, we consider the Antara Fashion Hall 1 , a shopping mall located in Mexico City, Mexico. We provide several examples of a human-robot dialog in English; however, the proposed dialog model can be implemented in a human-robot communication system developed for any natural language.
The rest of this paper is structured as follows. Section 2 presents existing approaches to dialog modeling. Section 3 describes the shopping assistant robot. Section 4 provides some characteristics of customer-shopping assistant dialog. Sections 5 and 6 describe robot and user speech acts, respectively. Section 7 presents the proposed dialog model. Finally, Section 8 draws the conclusions and points to possible directions of future work.
Existing approaches to dialog modeling:Machine learning vs. algorithmic methods
One of the most commonly used approaches to dialog modeling is based on machine learning (ML) techniques. For example, the authors of [8] used a machine learning method for detecting speech acts, which is a crucial task in automatic dialog processing. In fact, the machine learning strategy coupled with annotated corpora has been considered the most efficient method known up to date for solving many natural language processing (NLP) tasks. Machine learning methods take advantage of a vector space model, which allows for computing a similarity metric for pairs of any objects, including such objects as texts. In the vector space model, each object is represented by a set of features and their values, that is, by a vector. Features are interpreted as axes or dimensions in such space. It is obvious that a vector space may have a very large number of dimensions. The similarity metric commonly applied to vectors is the cosine of the angle between them: the more acute the angle between the vectors, the more similar the respective objects are. Recently, soft cosine similarity was introduced in [9], which allows for taking into account the similarity between features in a vector space model. The vector space model was proposed in [10], and its intelligible and detailed description can be found, for instance, in [11].
To construct a vector space for texts, it is common to use various types of n-grams as features, where the value of n indicates the number of elements in an n-gram. These elements can be words, characters, part-of-speech (POS) tags, and other entities. There is also an option of using syntactic dependency-based n-grams [12] obtained by following syntactic relations in syntactic trees. Application of syntactic dependency-based n-grams allows for introducing syntactic information into statistic machine learning methods.
In order to assign a numerical value to each element in the vector space, one of the following metrics is used. (1) tf (term frequency), tft,d means frequency of the term t in the document d, that is, the number of times the term appears in the document. The idea behind this value is obvious: the more times a term appears in a document, the more important this term is for the document. (2) idf (inverse document frequency), idf t weights a term t according to the number of documents in the corpus in which it appears, i.e., this metric shows how common the term t is in the collection of documents. The idea behind this value is that if a term appears in fewer documents, it is more valuable since it makes it easier to distinguish among the documents. (3) Finally, tf - idft,d = tft,d00 * idf t is used. This value is high when a term appears many times in asmall number of documents, and it is low when a term appears only a few times in a document or when it appears in many documents, which does not allow us to use this term to clearly distinguish among the documents.
Machine learning based processing of natural language data is usually performed applying the following basic steps: task definition, corpus preparation, corpus annotation (often this is done manually in order to obtain a goldstandard), feature selection and calculation of feature values (vector space construction), dimensionality reduction (e.g., applying latent semantic analysis [13]), selection of a suitable machine learning method/s and its/their parameter selection (e.g., Naive Bayes, Support Vector Machine, etc.), application of the method/s, selection of evaluation method (baseline), result evaluation (commonly by using 10 - fold cross validation), and results interpretation, see Fig. 1.
However, the machine learning approach is not applicable to all tasks: for instance, it is hardly efficient in determining the grammatical meaning of words, in POS-tagging, machine translation, speech recognition, and the like. To resolve such issues, more traditional algorithmic approaches and models are to be used. For example, in [8] referenced above, no convincing results were obtained to prove that machine learning allows for efficient dialog modeling. On the other hand, it is obvious that speech acts are data which seem to be appropriate for handling by machine learning techniques. However, the problem has its roots in the very nature of speech acts: their features reflect semantic and pragmatic aspects of communication, which are very complex to detect automatically at the contemporary level of text processing research. Machine learning basic steps.
Taking into consideration the limitations of the machine learning approach described above, in this paper we focus on the algorithmic approach (finite state automaton) to tackle the task of natural language dialog modeling which is described in detail in the following sections.
Not all robotic systems include the natural language dialog functionality, since this capacity is not required in some operational fields. However, there are types of robots, such as robot-guide, robot-transporter, robot-waiter, robot-shopping assistant, among others, whose services are performed in a way that makes it necessary or highly desirable to communicate with a human operator in natural language.
Each robotic function or behavior imposes certain requirements or constraints on the dialog structure and the inventory of communicative acts (spoken utterances and/or non-verbal responses). For example, depending on the function, a dialog can be initiated and managed either by a user or by a robot, or the initiative and control may pass from the robot to the user depending on the dialog states or progress. A robot functioning as a guide initiates and manages the dialog; however, the dialog with robotic systems performing other functions is managed by the user or by both participants in turns. The more functions the robot has, the bigger and more diverse its inventory of communicative acts should be, and the more significant the requirement to include the Conceptual-Intentional Module (CIM) into robotic system architecture becomes. CIM includes a knowledge database and a reasoning sub-system.
As mentioned above, our work is based on the algorithmic approach to describe the dialog management module for a human-mobile shopping assistant robot interaction system within the frame of the general approach described in [15]. To illustrate the proposed model, we implement it in the dialog management module of a mobile shopping assistant robot – the Pioneer 3DX robot. This robot is equipped with the speech recognition module Dragon NaturallySpeaking 2 and the syntactic analysis system FreeLing [16]. It is important to note that while the set of syntactic constraints on the word order and element displacement implemented in the system are language-specific according to the accumulated typological data [17], the inventory of elementary speech acts and communication strategies is universal, i.e., the dialog management module is language-independent. The language-independence of the model is one of the main advantages of our approach, i.e., the proposed dialog model can be implemented in a human-robot communication system developed for any natural language.
The Pioneer 3DX robot with a high-precision laser sensor for distance measuring is a standard de facto for developing real-world autonomous mobile robot applications; its dimensions are comparable to those of a human being. The robot moves according to the operation field map constructed in advance; objects or locations of interest are labeled on the map according to the customer’s queries.
To provide a real-world scenario, we consider the Antara Fashion Hall, a shopping mall located in Mexico City, Mexico. It consists of three floors and hosts more than 140 stores. In this paper, we consider only two stores (objects) labeled on the robot imaginary map. Let us call them the Carolina Herrera boutique (CH) and the Bershka store (BS) located on the first floor of the Antara Fashion Hall. In the rest of the paper, these names are used in the description of the dialog model. However, in any other model implementation, these particular objects can be substituted by objects in a specific environment.
The robot is able to move safely in a space even in the case when new objects (absent in its map) appear; such objects are detected and avoided. The robot can also identify the moment when a particular object becomes visible to the user: this fact is detected by absence of other objects between the given object and the user as well as by estimating the distance between the object and the user as sufficiently short; we suggest a threshold up to three meters. In our model, if an object X appears in the robot vision field (recall that it is the field not of a camera but of the sensor), this is represented as visual _ act (X).
Thus, we develop a dialog management model for a system, which does not have a big repository of knowledge. Instead, the shopping assistant robot will have access to the database of items on sale in stores. The item description consists of a list of its characteristics (size, color, fabric type, and price as the basic ones) and its image can be shown to the customer upon his/her request. Although the database may be quite big, information is restricted to a limited number of types. Therefore, we chose the function of a shopping assistant, as it is simple with respect to the size and diversity of its knowledge database. In our case, dialog is managed mostly by the robot, which narrows the inventory of required communicative acts (CA).
Characteristics of customer-shoppingassistant dialog
While e-commerce and e-shopping have been gaining popularity nowadays, there are still a lot of customers that prefer to look for their desired products in stores. At the same time, stores have been implementing various mobile and stationary shopping applications to provide better services to their customers. For example, nowadays, customers can take advantage of an intelligent shopping card with a computer mounted on it [18]. A customer’s shopping list can be downloaded and used to remind the customer of the items to buy; the computer also provides information on the location and availability of goods.
However, it is not a rare case that customers need help of a human shopping assistant to find what they would like to purchase. A shopping assistant communicates and interacts with the customer guiding him/her in the product search and purchase. Big stores and shopping malls may suffer a lack of trained personnel to provide high-quality individual assistance as it is time consuming sometimes. Therefore, there is a need in developing shopping assistant robots able to dialog with the customer, understand his/her speech, and provide appropriate attendance and support.
State of the art includes a number of works on the design of shopping assistant or shopping guide robotic systems including such robots as Shopbot [19], TOOMAS [20], and ISSR [21]. However, the majority of works on shopping service robots have been mainly focused on the issues of position estimation, obstacle detection, vision-based self-localization, autonomous navigation, customer identification, and customer tracking. Much less work has been done on developing a dialog module for such systems due to a high computational complexity of natural language processing in general and automatic speech recognition and generation in particular. We address this issue by proposing a dialog model based on a finite state automaton.
The objective of a shopping assistant is to help a customer (user) to purchase the product/s he/she desires. The robot that provides such service may initiate the dialog by greeting the customer and asking whether its services are required, or the customer may initiate communication with the robot, for example, by asking if a certain item is available in stores.
In order to fulfill its objective, the shopping assistant robot should be able to perform a number of tasks, the following being the most basic and common ones: Contact the customer. Ask about customer’s interests and interact with him/her. Guide the customer to the location of the desired items. Offer and/or show items according to the customer preferences. Assist in purchasing the selected product.
While performing the tasks described above, the robot may respond to the user questions or requests not related directly to the shopping procedure, i.e., it may guide the user to a cafeteria or a rest room, or offer other services depending on what is available in a particular store or shopping mall (it may also be a cinema, gym, medical service, exhibition, etc.).
To fulfill each of the tasks described above, the robot performs communicative acts (verbal and/or non-verbal) and moves guiding the user to the object of interest along the paths represented in the map of the robot.
Robot acts in the dialog
We develop a dialog model for a shopping assistant robot, which attends customers on one of the floors of a shopping mall. In our example presented further in Section 8, we refer to the Antara Fashion Hall in Mexico City, mentioned in the Introduction. In this implementation, the inventory of the robot acts includes the following three main types: speech acts, speech blocks, and multimodal blocks. The detailed inventory of the robot acts is presented in Fig. 2 and explained further in this section.
We consider Speech Acts (SA) as they are defined in the speech act theory, i.e., the acts of production of a linguistic structure unit corresponding to a single simple sentence possessing a certain illocutionary force, that is, it expresses (directly or indirectly) the speaker’s intention alongside with other pragmatic components of meaning. In our case, the inventory of the robot speech acts consists of the following: Question of accepting the shopping assistant service offer (Assistant-Question). Question of selecting the product of customer’s interest (Product-Question). Suggestion to choose the first object (store) from a list of alternatives, in our example, the list includes two options: the Carolina Herrera boutique (CH) and Bershka store (BS). Both objects are located on the first floor of the Antara Fashion Hall (Question_Altern(CH, BS)). Dialog termination in the case of offer refusal (Closure).
Each speech act is mapped to a set of various utterances to prevent multiple productions of the same utterance. We give examples of such utterances further in this section.
Speech Blocks (SB) are defined as complex speech acts considered as sequences of speech acts of one interlocutor. In speech blocks, SAs are interconnected by means of the overall speech intention, thus a speech block is viewed as a single linguistic unit. This concept implies that the other interlocutor will respond to the unit as a whole rather than to its elementary components, i.e., speech acts. In our model, we use three types of speech blocks: Borderline Blocks (BB), Narrative Blocks (NB), and Secondary Blocks (SB).

Robot acts inventory.
Borderline Blocks (BB) include the Introductory Block (IB) and the Terminal Block (TB). Each block consists of a sequence of elementary speech acts (SAs). The Introductory Block (IB) is a sequence that includes the following components (in a fixed order): Greeting + Introduction + Survey-Proposal (an offer of a guided survey of a store and products available in it). IB is launched in the initial phase, which is one of the speech phase types (we discuss phase types further in the paper). The Terminal Block (TB) is a sequence of the following elements: Announcement of service conclusion (Announce_End) + Self-evaluation (Evaluation) + Farewell. TB is initiated in the terminal stage, which the robot determines based on the dialog history. In the dialog formal representation, the transition to the TB production phase is conditioned by the context constraint NB(X) ∈ HD, where HD is the history of the dialog and X takes one of the two values, CH or BS, i.e., the system identifies the fact that both survey locations have been visited.
Narrative Blocks (NB) are texts or presentations which contain information of the objects to be shown to the user (in our implementation, there are two objects: the Carolina Herrera boutique and the Bershka store). NBs are executed when the robot has reached the object to be surveyed by the user, or more precisely, when the robot has finished an MMB2 type multimodal block (discussed further). Relevant information is loaded in the narrative block in advance.
A Secondary Block (SB) is pronounced by the robot if the user desires to use the utilities available in the store (in our implementation, these are a cooler and a WC). SB is executed together with a Secondary Multimodal Block (SMB) considered further in this section. SB is a deviation from the main goal: it is a statement of the intention to wait while the customer uses the utility (Promise_Wait) + a request for the user to inform of his/her return (Request_Return). This block is launched by the user positive response to the robot’s offer to use the utility.
Multimodal Blocks (MB) are sequences of speech acts (SAs) and non-verbal actions performed in a fixed order. There are two types of MBs: main multimodal blocks (MMB) and secondary multimodal blocks (SMB). There are two types of MMB in the system: MMB1 and MMB2. MMB1 is a request (command) of the robot to follow it (Imper_Move) + the robot dislocation towards the goal (Move(X)), where X ∈ {CH, BS}. Note that we assume that the user agrees to follow the robot, that is, the user obeys the command (Imper _ Move), as he/she accepts the robot proposal to use its shopping assistant services. During the robot operation, it is necessary to check from time to time whether the user is still following the robot, so an additional confirmation can be included into the model: the robot asks for a survey continuation confirmation every 5–10 minutes, and if there is no response or a negative response, terminates the services.
It is also assumed that the shopping assistant robot starts providing its services from a pre-defined initial location L (in our application, at the entrance to the first floor of the shopping mall). Then, depending on the customer’s preference, the robot moves to the first goal along the shortest path, then to the second goal along the path which includes L; remember that the robot is capable of detecting and avoiding obstacles absent in its map.
The first block of this illocutionary type is executed after the user responds to the robot’s SAQuestion_Altern(CH, BS) (this SA type was considered above). The second block of this type is triggered by the user’s negative response to the robot’s question if the user desires to use a certain utility, and it is generated after the first narrative block is completed.
Concerning the MMB2, the robot stops when reaches the goal (Stop _ X) + a command to stop (Imper _ Stop). This block is executed as a response to a visual act represented in the model as visual_act(X), where X ∈ {CH, BS}.
The Secondary Multimodal Block (SMB) is a sequence of actions related to secondary objects encountered during the survey (in our example, they are utilities Cooler and WS represented by the same words in the model). In the SMB, the robot stops at a secondary object (Stop(X)) + attracts attention to the secondary object Announce_Utility(X) + asks the question if the customer desires to use the utility X, Question_Utility(X), where X ∈ {Cooler, WC}. SMB, like MMB2, is started by a visual act represented in the model as visual_act(X), where X ∈ {Cooler, WC}, i.e., the robot “sees” this object labeled on its map when the distance between the object and the robot becomes short enough.
Thus, the robot can perform illocutionary-independent actions as well as illocutionary-dependent ones. These actions are defined according to [22]. Illocutionary-independent blocks, i.e., the blocks initiated by the robot, are BB, NB, MMB2 (Stop _ X + Imper _ Stop), and SMB. Illocutionary-dependent blocks, i.e., the blocks executed if the user performs a particular speech act (SA), are the speech block (SB) and the speech acts Question _ Altern and Ask _ Confirm, as well as MMB1 (Imper _ Move + Move _ X).
Each illocutionary type of both speech acts and speech blocks, as well as of speech components in multimodal blocks is mapped to a set of speech patterns (utterances). The robot chooses an utterance from the relevant set and saves it in its memory. For the next iteration, the robot views the set and the memory, and chooses another utterance saving it in the memory as well, etc. This strategy helps to avoid repetition and makes the robot verbal behavior more natural. Here are possible realizations of some SA types considered above: Greeting: Good morning (afternoon, evening)! Hi, how are you? Good to see you! Welcome to the Antara Fashion Hall, etc. Tour-Proposal: Would you like to have a tour along this floor? I can show you the stores located on this floor and give information of garments they offer. I can show you the way to the Carolina Herrera boutique where you can find a dress you are looking for, etc. Evaluation: Hopefully I have coped with this task. I hope the tour was not boring. I did my best to help you find the dress of your preference, etc. Promise_Wait: I will wait for you here. I will be here when you return. Waiting for you here. We will meet here and continue viewing the garments, etc. Imper_Move: Please follow me. Let’s go. Let’s begin our walk. Our search begins, please accompany me, etc. Tour-Question: Would you like … ? Are you interested in visiting the Bershka store? Will you accept this choice? Question_Altern(CH, BS): What shall we start with — visit the Carolina Herrera boutique or the Bershka store? What would you prefer — first see the garments in the Carolina Herrera boutique or in the Bershka store? Closure: OK, good-bye then. Thank you, see you later! Sorry, may be some other day, etc.
In our dialog model, each illocutionary-independent speech act and speech block of the robot imposes a certain communicative obligation on the user under the condition that the latter in general fulfills the Grice’s cooperative principle and conversational maxims [23], i.e., the user accepts the general purpose and direction of the dialog not attempting to mislead and/or misinform the robot. All user speech acts in this dialog type are illocutionary-dependent, i.e., due to the constraints mentioned above, the model does not include an option for the user to take initiative in the dialog, for example, by asking a question to the robot, even if the question refers to the store being viewed at the moment. The user has to respond accordingly to the illocutionary purpose of the robot speech acts, but if the user produces an unexpected utterance, the robot interprets such utterance as an error and then tries to “get” a response of the required type (see details in the next section). In future work, we plan to remove this constraint from the dialog.
Taking into account the constraints mentioned in the previous section, the inventory of speech acts expected from the user consists of the following SAs, the user SA is denoted as uSA: Consent uSA(Yes) or Rejection uSA(No): responses to an offer of surveying one store or both stores (this offer is included in the Introductory Block, IB) or an offer of using a utility included in the SB1 block. Answer to a multiple-option question concerning the selection of the object with which to begin a survey (it is a previously considered question Question _ Altern (X) , where X ∈ {CH, BS}, in our particular implementation): selection of the Carolina Herrera boutique, uSA(CH), or of the Bershka store, uSA(BS). Utterance communicating the user’s return to the agreed location after the utility usage, uSA (Ready). Any uSA not envisaged by the system in this dialog model is interpreted as an error, i.e., as an unspecified (non-interpretable) user speech act, uSA (Error).
If the system cannot recognize a current user speech act as one of the five possible illocutionary types, the robot returns to the state it was in before the unrecognized speech act was produced as uSA (Error), and generates the same speech act after which the user uttered the speech act uSA (Error).
While with respect to the illocutionary types of utterances the user is limited by communication obligations imposed on him/her by the robot speech acts, the user has certain freedom concerning his/her particular manners or ways of wording, structuring, and generating speech acts of the specified types. The speech analyzer module recognizes the units tagged in the system dictionary according to their propositional content or illocutionary function. The dialog management module assigns the user’s utterance, at a given moment in the dialog, to the illocutionary type (from the set of possible types) which corresponds to the recognized unit tags. For example, the lexeme the Carolina Herrera boutique or the Bershka store (in our examples, these lexemes are multiword named entities) generated by the user in response to the robot speech act of the type “Offer to select the first object from the set of options Question_Altern(CH, BS)” will be the tag of the illocutionary type “Selection of the Carolina Herrera boutique uSA(CH)”. In future research, a more complex analysis will be realized, since the syntactic analyzer (parser) allows for working with the parse tree of each utterance in the dialog.
The dialog model
Our dialog model is represented as a state transition network. Transitions are conditioned by both verbal and visual information received by the robot, as well as by the dialog history and user intentions. We represent the network as a finite automaton; other representations applied in state of the art are based on rules [24, 25].
Our model includes the following types of dialog states: the talking state (talk), the perception state (perc), the moving state (move), and the final state (final). Transitions from one state to another are performed by means of speech or movement acts of the robot. The acts are executed in response to a user speech act uSA (X), where X can have values from the set of the user illocutionary speech act types, or as a reaction to the robot’s visual perception act of an object Y, visual _ act (Y), where Y can take values from the set of the objects tagged in the robot internal map. Accordingly, a robot act conditioned by its current state is given as a pair (input SA of the user or visual act) ⟶ output act of the robot.
Therefore, the dialog model can be described as a set of utterances and acts that trigger transitions from one dialog state to another. It can be implemented with recursion, i.e., when a function is called within the body of the same function, but in order to avoid infinite loops, the number of recursive calls must be limited.
In continuation, we present the conditions for transitions in our model. The conditions have the following pattern: current state of the robot & “condition (if True): act (do this)” ⟶ new state of the robot.
In order to make a transition, the robot must be in the corresponding state with the specified condition satisfied. Then the robot will perform a relevant speech act and update the state. We mark the condition and the corresponding state and add them to the dialog history (HD) after each application to avoid repetitions. For some transitions there may be no condition: it means that the condition is true, i.e., the transition will be performed in any case. Sometimes a transition is not necessary, that is, there is no need for the robot to perform any actions. Such cases are denoted by the symbol ∅. Note that after a transition is performed, the robot changes its state to a new one. We provide an example of the finite automaton of our dialog model in Fig. 3, where speech acts are above or to the left of the arrow; robot acts are below or to the right of the arrow; * means that i is assigned and Obj i is removed from the list of objects; ** means that the list of objects is not empty; *** means that the list of objects is empty. The first two states are interpreted as follows: while the robot is in the initial state, it performs two self-initiated speech acts, i.e., SAs are not caused by external factors: first, the robot produces the Introductory Block (IB), then the question concerning the user desire to survey a store Tour-Question is generated, after which the robot moves to perc1 which is the state of perception (of input information).

Finite automaton of the dialog model.
Let us consider an example of a human-robot dialog. R is a shopping assistant robot, U is a user; each utterance is described using the formalism of the model developed in this work. Note that a transition can be performed before or after an utterance. This is shown by suspension points and opening and closing quotes in different lines.
R: Good afternoon. I am a shopping assistant robot. I can guide you in a walk along this floor and give you information about the stores located here. Would you like to follow me?
talk1 & “ ∅: IB” ⟶ talk2 ;
talk2 & “ ∅: Tour-Question” ⟶ perc1
U: Yes, thank you.
perc1 & “uSA(Yes) …
R: Now we can see dresses in the Caroline Herrera boutique or in the Bershka store. What store do you prefer to visit first?
… : Question_Altern(CH, BS)” ⟶ perc2
U : Bershka.
perc2 & “uSA(BS) …
R : Follow me, please.
… : MMB1(BS)” ⟶ move2
The robot and the user go to the Bershka store. Coming up with the ladies room, the robot stops and says:
R : Here is the ladies room. Would you like to use it?
move2 & “visual_act(WC): SMB(WC)” ⟶ perc5
U : No, thank you.
perc5 & “uSA(No) …
R: Let’s go.
… : MMB1(BS)” ⟶ move2
The robot and the user go further to the Bershka store. Having reached the goal the robot stops and says:
R: Here is the Bershka store.
move2 & “visual_act(BS): MMB2(BS)” ⟶ talk6
The robot provides a brief presentation of the items available in the store. Having finished and updated its state (talk6 ⟶ talk7), the robot says:
P: Follow me, please!
talk7 & “ ∅: MMB1(CH)” ⟶ move1
Moving to the second object to be surveyed and coming up with a drinking water cooler, the robot stops and says:
R: Here is a cooler. Would you like some water?
move1 & “visual_act(Cooler): SMB(Cooler)” ⟶ perc3
U: I would love to.
perc3 & “uSA(Yes) …
R: I will be waiting for you here. When you return please tell me that you are back.
… : SB” ⟶ perc4
The user goes to the cooler, takes some water, comes back, and says:
U: I am back. Ready to go on.
perc4 & “uSA(Ready) …
R: Let’s go.
… : MMB1(CH)” ⟶ move1
The robot and the user move to the Carolina Herrera boutique. Having reached the boutique, the robot stops and says:
R: Here is the boutique.
move1 & “visual_act(CH): MMB2(CH)” ⟶ talk4
Then the robot produces a discourse or a presentation describing the items on sale in the Carolina Herrera boutique. Having finished the Narrative Block and updated its state (talk4 ⟶ talk5), the robot generates the utterance:
R : Our walk has come to an end. Hope, it was not too tiring. Good bye.
talk5 & “ ∅:TB” ⟶ final
Here is another segment of the dialog:
…
R: What store shall we visit first — the Caroline Herrera boutique or the Bershka store?
[perc1 & “uSA(Yes):] Question_Altern(CH, BS)” ⟶ perc2
U: I don’t know.
perc2 & “uSA(Error): ∅” ⟶ talk3
R: What store shall we visit first — the Caroline Herrera boutique or the Bershka store?
talk3 & “∅: Question_Altern(Adm, AI)” ⟶perc2
U: Well, let’s go to the Bershka store.
perc2 & “uSA(BS) …
…
It can be observed that the robot’s utterances do not sound natural in all parts of the dialog. In order to generate a more natural speech, the dialog model has to be improved: for instance, instead of the propositional model we can use a predicate model which takes into account not only the illocutionary semantics of basic speech acts, but also their predicate-argument structure, as well as develop templates of relationships between utterances.
We described a dialog manager for a human-robot interaction system implemented in a mobile shopping assistant robot. The dialog manager includes the dialog model, the description of speech acts and speech blocks together with utterance templates for each speech act. Speech acts and blocks are used to build all model components. Our model, with the exception of the templates, is language-independent.
Although this paper describes a specific application of the model, the proposed model is universal with respect to its architecture and provides a means for building a formal dialog model in an autonomous mobile robotic system for any natural language, because the inventory of speech acts and communicative strategies, unlike language-specific constraints on the syntactic structure, is common to all natural languages.
In future work, we intend to remove the illocutionary-dependent constraints of human speech acts, that is, we will allow the user to take initiative in the dialog. Furthermore, we will perform a more complex analysis of each utterance in the dialog based on the parse tree obtained by a syntactic parser. Also, deep learning can be considered for this task, for example, graph based deep learning methods [27] or some complex methods, like Hybrid Attention Networks [28]. Finally, we will generate more natural robotic utterances by implementing a predicate model instead of the propositional model used in this work.
