Deep Learning and Hierarchical Reinforcement Learning for modeling a Conversational Recommender System

Abstract

In this paper, we propose a framework based on Hierarchical Reinforcement Learning for dialogue management in a Conversational Recommender System scenario. The framework splits the dialogue into more manageable tasks whose achievement corresponds to goals of the dialogue with the user. The framework consists of a meta-controller, which receives the user utterance and understands which goal should pursue, and a controller, which exploits a goal-specific representation to generate an answer composed by a sequence of tokens. The modules are trained using a two-stage strategy based on a preliminary Supervised Learning stage and a successive Reinforcement Learning stage.

Keywords

Conversational Recommender Systems Deep Learning Hierarchical Reinforcement Learning Conversational Agents

1 Introduction

Humans can solve tasks of varying complexity which require to satisfy different types of information needs, ranging from simple questions about common facts (whose answers can be found in encyclopedias) to more sophisticated ones in which they need to know what movie to watch during a romantic evening or what is the recipe for a good lasagne. These tasks can be solved by an intelligent agent able to answer questions formulated in a proper way, eventually considering user context and preferences. Conversational Recommender Systems (CRS) assist online users in their information-seeking and decision making tasks by supporting an interactive process [20] which could be goal oriented. The goal to reach consists in starting general and, through a series of interaction cycles, narrowing down the user interests until the desired item is obtained [31].

Generally speaking, a dialogue can be defined as a sequence of turns, each one consisting of an action performed by a speaker or by a hearer. A CRS can be considered a goal-driven dialogue system whose main goal, due to its complexity, can be solved effectively by dividing it in simpler goals. Indeed, a dialogue with this kind of system can include different phases, such as chatting, answering questions about specific facts and providing suggestions enriched by explanations, with the aim of satisfying the user information needs during the whole dialogue. The Hierarchical Reinforcement Learning (HRL) literature has consistently shown that, given the right decomposition, problems can be learned and solved more efficiently, that is to say in less time and with less resources [22]. The most popular HRL framework, called the Options framework [37], explicitly uses subgoals to build temporal abstractions, which allow faster learning and planning. An option can be conceptualized as a sort of macro-action which includes a list of starting conditions, a policy and a termination condition. An hierarchy is defined between the options to be solved and the primitive actions executed for each option.

Deep Learning (DL) architectures are widely used in dialogue systems [30 , 38] and are able to achieve good performance in generating meaningful dialogue. However, in this paper we are interested in CRS where the dialogue should support the user in a decision making task. The system should be able to produce both meaningful dialogue and relevant suggestions to the users. In order to integrate these two aspects in our DL model, we design an end-to-end architecture, called Converse-Et-Impera (CEI), in which the dialogue generation and recommendations are learnt using an unified model. The HRL is used to model the different goals required by the decision making task. Moreover, in order to enrich the descriptions of items suggested to users we leverage information coming from the Linked Open Data (LOD) cloud.

The main contributions of our paper are the following:

a framework based on HRL in which each CRS goal is modeled as a goal-specific representation module which learns a useful representation for the given goal;

an answer generation module leveraging the learned goal-specific representations and the user preferences to generate appropriate answers.

The goal of our research is to prove that an end-to-end architecture is able to both generate dialogue and provide relevant recommendations to the user.

The paper is structured as follows: Section 2 contains details about our methodology and summarizes the DL architecture. The evaluation is described in Section 3, while Section 4 provides related work. Final remarks are reported in Section 5.

2 Methodology

2.1 Overview

A dialogue can be considered as a temporal process because the assessment of how “good” an action is depends on the options and opportunities available while the dialogue progresses further. For this reason, action choice requires foresight and long-term planning to complete a satisfying dialogue for the user. We employ the mathematical framework of Markov Decision Processes (MDP) [2] represented by states s ∈ S, actions a ∈ A and a transition function T : (s, a) → s′. An agent operating in this framework receives a state s from the environment and can take an action a, which results in a new state s′. We define the reward function as $F : S \to ℝ$ . The objective of the agent is to maximize F over long periods of time.

A CRS can be considered a goal-driven dialogue system whose main goal, due to its complexity, can be solved effectively by dividing it in simpler goals. Indeed, a dialogue with this kind of system can include different phases, such as chatting, answering questions about specific facts and providing suggestions. In all these phases, the agent should use the words of the language to generate appropriate responses for the user conditioned on the goal to be solved. Our model, whose architecture is depicted in Fig. 1, takes inspiration from [15] to design a framework able to manage the different phases of a goal-driven dialogue by learning a stochastic policy π_g which defines a probability distribution over a finite set of agent goals g ∈ G given a state s ∈ S. The agent goals can be considered as high-level actions, thus they can be consistently modeled by the Options Framework [37]. In fact, the completion of a goal g can be achieved by a temporally extended course of actions, starting from a given timestep t and ending after some number of steps k. A module of the framework called meta-controller is responsible of selecting a goal g_t in a given state s_t. Moreover, an additional module called controller selects an action a_t given the state s_t and the current goal g_t following a goal-specific policy π_{g
_t}, which defines a probability distribution over the actions that the agent is able to execute to satisfy the goal g_t. Given the two modules, an external critic evaluates the reward signal f_t which the environment generates for the meta-controller, while an internal critic is responsible for evaluating whether a goal is reached and providing an appropriate reward r_t (g) to the controller. So, the objective function for the meta-controller is to maximize the cumulative external reward $F_{t} = \sum_{t^{'} = t}^{\infty} γ^{t^{'} - t} f_{t^{'}}$ , where γ is the reward discount factor. Similarly, the objective of the controller is to optimize the cumulative intrinsic reward $R_{t} (g) = \sum_{t^{'} = t}^{\infty} γ_{t}^{t^{'} - t} r_{t^{'}} (g)$ .

Fig. 1

CEI architecture: A bidirectional RNN generates a representation for the input utterance x. The learned representation is first exploited by the meta-controller, which learns to predict the goal g associated to a specific turn. The controller learns to condition its responses on the goal representation for the goal g , the user embedding u and the latent representation of the current user utterance x .

2.2 State representation module

During a dialogue, in a given timestep t, the system receives a user utterance x = 〈x₁, x₂, …, x_m〉 and the user identifier u ∈ U. Each word x_i is encoded in a vector representation (embedding) x _i by using a lookup operation on the word embedding matrix $W_{w} \in ℝ^{| V | \times d_{w}}$ and the user identifier u is encoded in an embedding u by using a lookup operation on the user embedding matrix $W_{u} \in ℝ^{| U | \times d_{u}}$ , where d_w is the word embedding size and d_u is the user embedding size. The sequence of word embeddings x _i is encoded using a bidirectional Recurrent Neural Network (RNN) encoder with Gated Recurrent Units (GRU) as in [11] which represents each word x_i as the concatenation of a forward encoding $\vec{h_{i}} \in ℝ^{h}$ and a backward encoding $\overset{\leftarrow}{h_{i}} \in ℝ^{h}$ . From now on, we denote the contextual representation for the word x_i by ${\tilde{x}}_{i} \in ℝ^{2 h}$ . In order to allow the system to keep track of the relevant information until the dialogue turn t, we equip it with an RNN with GRU units which encodes the contextual representation ${\tilde{x}}_{m}$ of the last word of the sentences until the current one to generate a representation of the dialogue turn called dialogue state $d \in ℝ^{d_{d}}$ . In this way, the generated representation will be influenced by the previous turns of the conversation.

2.3 Meta-controller module

We define the meta-controller policy π_MC as a feedforward neural network which receives in input the dialogue state d to predict the goal g ∈ G. Formally, the meta-controller is defined by the following equation: $π_{MC} = P (g | x, u) = softmax (d W_{MC}),$ (1) where softmax(y) is the softmax activation function applied to the vector y to obtain a probability distribution over the set of possible goals G and $W_{MC} \in ℝ^{d_{d} \times | G |}$ is a weight matrix.

2.4 Goal-specific representation module

Due to the various requirements of each goal, differently from [15], we design a goal-specific representation module which represents relevant aspects of the current state exploited by the agent to complete the current goal. Given the current goal g, the system encodes it in g by using a lookup operation on the goal embedding matrix $W_{g} \in ℝ^{| G | \times d_{g}}$ , where d_g is the goal embedding size. After that, the system asks the goal-specific representation moduleφ_g for a score vector $z \in ℝ^{| V |}$ which represents the agent attention towards specific tokens in the vocabulary V. In the implementation of the framework we support two different goals which belong to the set G namely chitchat and recommendation. Therefore, we suppose that each conversation can be divided in turns which can be associated to one of the two available goals. For each of them, we provide a goal-specific representation module which generates the score vector z according to a defined strategy. The chitchat module is intended to support general conversation utterances (e.g., “Hey”, “My name is John”, etc.) so it simply returns a vector of zeros as a result of the fact that it has not focused its attention on any token. In order to complete the recommendation goal, we have designed a module called IMNAMAP presented in [11] and inspired to [34], which is able to generate a score vector by applying an attention mechanism over multiple documents retrieved from a knowledge base. The network learns to uncover a possible inference chain that starts by considering the most relevant features in the query and in the documents and refines a vector of attention scores over V. In particular, the documents are retrieved using a search engine that exploits the user utterance as a query.

2.5 Intra-dialogue attention mechanism

In order to fully support the user during the conversation, the system should be able to exploit information gathered during previous turns that can be useful in subsequent turns. For instance, during a given turn an agent may know the preferred movie director by the user and in a successive turn needs to leverage it in order to suggest relevant items to him. For this reason, the system applies an intra-dialogue attention mechanism to refine the score vector z _t according to z _k where k = 1, …, t. In particular, each z _k is concatenated with g _t and d _t and given in input to a feedforward neural network in order to estimate a relevance score r_t,k. Then, a probability distribution over the score vectors is generated using ${\tilde{r}}_{t} = softmax ([r_{t, 1}, \dots, r_{t, t}])$ . Finally, the refined score vector is evaluated as ${\tilde{z}}_{t} = \sum_{k = 1}^{t} {\tilde{r}}_{t, k} z_{k}$ . In this way, we give to the agent the capability to understand the most relevant information extracted during the conversation.

2.6 Controller module

The controller exploits the refined score vector ${\tilde{z}}_{t}$ to generate a sequence of tokens a = 〈a₁, …, a_n〉 leveraging the Sequence-to-sequence framework [36] (also called Encoder-Decoder). The Sequence-to-sequence framework consists of two different modules: an RNN encoder which represents the input sequence and an RNN decoder able to decode the output sequence using a context vector c . In this work, the context vector c is represented by the final state of the encoded user utterance ${\tilde{x}}_{m}$ . The RNN decoder generates latent representations $〈 {\tilde{a}}_{1}, \dots, {\tilde{a}}_{n} 〉$ for the system response using Persona-based GRU units [16] to exploit the user preferences in the text generation task. A 2-layer feedforward neural network receives in input ${\tilde{a}}_{i}$ and ${\tilde{z}}_{t}$ to generate a probability distribution over the tokens of the vocabulary V, for each token a_i. Formally, the feedforward neural network is defined by the following equation: ${\tilde{a}}_{i} = softmax ([a_{i}, ({\tilde{z}}_{t} W_{ih})] W_{ho}),$ (2) where $W_{ih} \in ℝ^{| V | \times d_{ih}}$ and $W_{ho} \in ℝ^{(h + d_{ih}) \times | V |}$ are weight matrices, d_ih is the input-to-hidden layer size and [· , ·] is the concatenation operator (used in Equation 2).

2.7 Training procedure

Inspired by [33, 43], we develop a two-stage training strategy for our conversational agent composed by a preliminary Supervised Learning (SL) stage and a successive Reinforcement Learning (RL) stage. The motivation behind the adoption of the above-mentioned training strategy is to take into account historical data collected from previous interactions between a user and a system, which can be used to learn an initial effective policy for the meta-controller and the controller which can be further improved by the successive RL stage.

2.7.1 SL training procedure

In the SL stage, the agent learns to replicate the conversations which belong to a given dataset D which consists of N dialogs, each of them composed by T turns (x_i,j, u_i, g_i,j, (a_i,j,1, …, a_i,j,n)) by minimizing a loss function which takes into account both the meta-controller and the controller errors with regard to the training data. Given a subset D_MC consisting of the N dialogs belonging to D, each of them composed by T turns (x_i,j, u_i, g_i,j), we define the supervised loss function for the meta-controller policy π_MC as follows: $L_{MC} (D_{MC}) = \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{T} CE (g_{i, j}, π_{MC} (x_{i, j}, u_{i})),$ (3) where CE is the cross-entropy loss function which is used in order to evaluate the error in the meta-controller prediction π_MC (x_i,j, u_i) with regard to the target goal g_i,j. Given a subset D_C consisting of the N dialogs belonging to D, each of them composed by T turns (x_i,j, u_i, g_i,j, (a_i,j,1, …, a_i,j,n)), we define the supervised loss function for the controller policy π_C as follows:

$\begin{matrix} L_{C} (D_{C}) \\ = \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{T} \sum_{k = 1}^{n} ω (a_{i, j, k}) CE (a_{i, j, k}, π_{C} (x_{i, j}, u_{i}, g_{i, j})), \end{matrix}$ (4)

where CE is the cross-entropy loss function and ω (y_i,k,j) is a function which defines a weight associated to a token a_i,j,k equal to 2 if the position j refers to the last utterance in position T - 1 and a_i,j,k ∈ E, where E is the set of entities defined in the dataset (e.g. movies, actors,...), 1 otherwise. The function ω weights more errors done on generated suggestions. The meta-controller and the controller are jointly trained by minimizing a loss function L_a which linearly combines the two loss functions and applies L₂-regularization on it as follows: $\begin{matrix} L_{a} (D_{MC}, D_{C}) & = L_{MC} (D_{MC}) + L_{C} (D_{C}) \\ + α L_{2} (W_{w}, W_{u}, W_{g}) . \end{matrix}$ (5) A joint training has the benefit to refine the shared representations employed by the agent to represent the user utterances, the user embedding and the goal embedding in order to obtain good performance in both tasks.

2.7.2 RL training procedure

Given a set of experiences D_MC which consists of N dialogs, each of them composed by T turns (x_i,j, u_i, g_i,j, R_i,j), we define a loss function based on REINFORCE [41] for the meta-controller policy π_MC as follows: $\begin{matrix} λ (i, j) & = ɛ_{MC} E (π_{MC} (g_{i, j} | x_{i, j}, u_{i})) \\ L_{MC} (D_{MC}) & = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{T} log (π_{MC} (g_{i, j} | x_{i, j}, u_{i})) \\ \cdot (R_{i, j} - b (x_{i, j}, u_{i})) + λ (i, j) \\ + α L_{2} (W_{w}, W_{u}, W_{g}), \end{matrix}$ (6) where R_i,j is the discounted cumulative reward for the current state, ɛ_MC is a weight associated to the entropy regularizer E (π_MC) and b (x_i,j, u_i) is the baseline implemented as a feedforward neural network which estimates the expected future return from the current state received by the agent. Given a set of experiences D_C which consists of N dialogs, each of them composed by T turns (x_i,j, u_i, g_i,j, (R_i,j,1, …, R_i,j,n) , (a_i,j,1, …, a_i,j,n)), we adopt the loss function proposed in [44] for the controller policy as follows:

$\begin{matrix} ρ (i, j, k) & = log (π_{C} (y_{i, j, k} | (y_{i, j, k - 1}, \dots, y_{i, j, 1}), x_{i, j}, u_{i}, g_{i, j})) \\ σ (i, j, k) & = (R_{i, j, k} - b (y_{i, j, k}, x_{i, j}, u_{i}, g_{i, j})) \\ λ (i, j, k) & = ɛ_{C} E (π_{C} (y_{i, j, k} | (y_{i, j, k - 1}, \dots, y_{i, j, 1}), x_{i, j}, u_{i}, g_{i, j})) \\ L_{C} (D_{C}) & = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{T} \sum_{k = 1}^{n} ρ (i, j, k) σ (i, j, k) + λ (i, j, k) \\ + α L_{2} (W_{w}, W_{u}, W_{g}) \end{matrix}$ (7) where R_i,j,k is the cumulative reward for the current state, ɛ_C is a weight associated to the entropy regularizer E (π_C) and b (y_i,j,k, x_i,j, u_i, g_i,j) is the baseline implemented as a feedforward neural network which estimates the expected future return from the current state received by the agent. The entropy regularization is considered in the RL community as another relevant optimization trick for Policy-based methods which is applied in order to prevent premature convergence to sub-optimal policy and improve exploration [42].

3 Experimental evaluation

The experimental evaluation aim is to evaluate the performance of the developed agent in the generation of accurate suggestions and appropriate response to the user utterances. In order to achieve our objective, we defined a general procedure that can be used on classical recommender system datasets to generate a goal-oriented dialogue exploiting user preferences.

We applied the dialogue generation procedure (see Sub-section 3.1) on two well-known Recommender System (RS) datasets such as Movielens 1M [12] and Movie Tweetings [9]. For Movielens we leveraged the available demographic information associated to the user in order to extend the generated conversation with useful contextual questions about the user name, age and occupation. For each dataset, we retrieved information associated to the properties director (wdt:P57), cast member (wdt:P161) and genre (wdt:P136) associated to the items coming from the Wikidata knowledge base. The procedure generates two datasets: ML1 generated from Movielens is composed of 157, 135 dialogues containing 19, 039 tokens whose mean length is 14.78, while MT is generated from Movie Tweetings and contains 48, 933 dialogs whose mean length is 6.64 for a total of 53, 988 tokens.

In order to assess the effectiveness of the two-stage training strategy, we designed an evaluation procedure in which we first evaluated the CEI model after the SL training and then we evaluated the trained model after the successive RL training to demonstrate the improvement of the model performance.

3.1 Dialogue dataset generation procedure

Starting from a large scale dataset typically used in RS, we designed an automatic procedure [35] to generate a set of dialogues for a given user, paying attention to the fact that the procedure should introduce different linguistic variants of the user utterances and the dialogue flow should not be completely predefined. In this way we avoid the possibility that the agent “memorizes” the applied dialogue generation procedure. In particular, the idea behind our strategy is similar to the one proposed in [28]: For understanding which is the “best” set of questions that will lead to a given set of suggestions for the current user, the authors decide to apply this procedure to user preference data automatically generated by an NLP pipeline able to extract from item reviews the relevant factors that are interesting for the user. Our datasets have not item reviews, but we have access to users feedbacks. For this reason, we decide to exploit data from large knowledge bases for describing items using a set of attributes extracted from the Linked Open Data (LOD) cloud [13].

The defined dialogue generation procedure is general enough to be applied on any RS dataset which contains binary preferences for each user. In addition, each item of the dataset should have a unique identifier in a knowledge base of the LOD cloud. This is absolutely not a strict requirement because it is something that we can easily find available on the Web thanks to previous authors that have defined those mappings [24] or by using manual procedure or automatic algorithms to find the correct identifier associated to a given item.

Given a classical RS dataset composed by triples (u, i, r) which represents a binary preference r ∈ {0, 1} that the user u ∈ U has expressed for the item i ∈ I and suppose that each item has an URI which allows to reference a specific entity of the LOD cloud, it is possible to build a dataset R (u), for each user u, which consists of LOD-based feature vectors associated to each item i rated by the user u and a class value which is equal to r. Given a set of properties P associated to the type of item to be suggested (i.e., film, music band, etc.), where each p has a set of associated values equal to O (p), a LOD-based feature vector can be intended as a feature vector generated considering all the dummy variables [10] that can be generated from the set of properties P. For example, suppose that we want to represent the entity Pulp Fiction 1 as a LOD-based feature vector using the set of properties P = {p136, p57} and suppose that each property can assume 2 values. We need to create 4 binary features that represent the fact that the item has a specific value for a defined property or not.

Given a dataset R (u) for each user u built using the above-mentioned procedure, Decision Trees [27] 2 are used in order to learn a model for the preferences of the user. As shown in Fig. 3, the learned model can be considered as a specification of possible decision steps that the user can do to choose the item that he/she likes or does not like. In fact, a path from the root to a given leaf node represents the set of properties and their values that are considered by the model to classify the item. For this reason, a decision tree is a suitable choice in order to derive all the alternative paths that can be considered to get to specific items rated by the user and can be exploited to generate questions with regard to the attributes found in the selected paths.

The dialogue generation procedure is based on a random dialogue state graph which represents different states of the dialogue as nodes and its edges represent the possible state transitions that can be done according to a given probability distribution over them. In particular, the dialogue generation procedure starts from an initial state and traverses the graph following edges sampled according to a predefined probability distribution. The random traversal continues until a final state is reached which determines the end of the dialogue generation procedure. The procedure is applied multiple times for each user in the dataset until all the user preferences have been considered in order to generate multiple conversations for him/her. As stated before, the random nature of the graph allows to visit nodes in a random fashion which is something that is required if we want to generate a realistic dataset for an in-vitro evaluation of a conversational agent.

Fig. 2

Dialogue state graph used in the dialogue generation procedure. For each user, the dialogue state graph is exploited in order to generate plausible conversations between the user and the system based on the user preferences present in the dataset. Starting from the user_starts state, a random walk procedure allow to visit each state and add specific information to the conversation.

Fig. 3

Decision tree generated from the preferences of a specific user included in a recommender system dataset. Each branch in the tree represents a binary LOD feature (e.g., genre, director, etc.). By traversing the tree from the root to one of the leaf node, it is possible to obtain a set of decisions that can be exploited to generate questions regarding the user preferences. The leaf node will represent the item satisfying the binary features traversed during the tree visit.

The designed dialogue state graph, depicted in Fig. 2, contains nodes related to demographic information and nodes related to the elicitation of the user preferences, mainly preferences about specific values of the selected items properties or preferences of the user towards specific items. By doing a graph traversal, it is possible to generate a dialogue which resembles a plausible use case for a CRS. Usually, in a goal-oriented conversation, the user says to the agent his/her demographic information (i.e., name, age, occupation), after that, the agent may ask for some preferences related to items or properties which are fundamental for the suggestion step. Finally, a recommendation state is reached in which the elicited user preferences are exploited to generate a list of suggestions for the user. Given the user profile, a reference list of recommendations for the current dialogue is generated. The list is composed of a randomly selected percentage of positive items while the remaining are randomly sampled from the negative items. A description of the possible states contained in the dialogue state graph is reported as follows:

users starts: the initial state of the dialogue in which the user typically greets the user;

ask name: if the user name is available, uses it to generate a question for the user and generate a user answer which will contain his name;

ask age: if the user age is available, uses it to generate a question for the user and generate a user answer which will contain his age;

ask occupation: if the user occupation is available, uses it to generate a question for the user and generate a user answer which will contain his age;

ask preferences: a list of k suggestions is sampled from the user ratings. A randomly chosen percentage of these are positives and the remaining are negatives. Using the decision tree learned for the current user u, decision paths are generated for the consistently classified positive examples in the list. For each property p ∈ P found in the considered paths, are merged together all the values associated to it. From the merged sets are randomly sampled a given number of them (min (5, |values|+1)) and will be associated to the related property. After that, the set of properties to be considered in the dialogue is selected according to a binomial distribution and the corresponding system utterance and user utterance are generated exploiting the values associated to each property.

ask liked items: from the set of user preferences is selected a list of randomly sampled positive examples that are used in order to generate a question related to the items liked by the user.

recommend: represents the final stage of the dialogue in which the system generates a list of suggestions and composes them in a meaningful response that is presented to the user. According to the percentage of positive items in the list, a refine step is triggered.

refine: when the percentage of positive items in the list of suggestions is under 100%, a refine step is triggered which consists of additional questions about his/her preferences formulated in order to compose meaningful questions that will lead to additional positive items that will replace the negative ones present in the current list of suggestions. An example of a dialogue containing a refining step is shown in Table 2. Convergence is guaranteed because a single refine step may be applied only once replacing the remaining negative items with the positive ones.

user ends: represents the end of the dialogue generation procedure which extends the conversation with an additional response by the system providing a greeting message to the user.

The designed dialogue generation procedure, as previously stated, is general because starting from a classical recommender system datasets typically used in well-known challenges augmented with supplementary information retrieved from the LOD cloud, it is possible to generate coherent and meaningful conversations from which a conversational agent can be trained. Due to the hierarchical nature of the developed model, we annotate each step of the dialogue with a specific goal which should allow the system to understand which goal-specific module should be used to complete a given task. In particular, the states associated to the goal chitchat are: user starts, ask name, ask age, ask occupation, user ends. On the other hand, the states associated to the goal recommendation are recommend, refine, ask preferences and ask liked items. An example of a conversation generated by the designed automatic procedure related to the movie domain is reported in Table 1. The whole dataset is freely available on Github 3 . We associated to each entity in the dataset a compact representation generated from its Wikidata URI in order to associate a specific token to each entity in the dataset and that will be considered in the vocabulary V.

Table 1

Conversation between the bot and the user generated from the decision tree trained using user preferences

Agent	Utterance	Goal
user	Hello	chitchat
bot	What is your name?	chitchat
user	My name is John	chitchat
bot	How old are you?	chitchat
user	I am 25	chitchat
bot	What is your occupation?	chitchat
user	sales/marketing	chitchat
bot	What genres do you like?	qarecs
user	I like sci-fi. I hate animation movies	qarecs
bot	What are your favourite directors?	qarecs
user	I like David_Lynch and David_Cronenberg	qarecs
user	What are your favourite actors?	qarecs
bot	I like James_Woods.	qarecs
user	I suggest you Dune and Videodrome. Do you like them?	qarecs
bot	I love them	chitchat
user	I am glad that you like them	chitchat

Table 2

Conversation between the bot and the user containing a refine step

Agent	Utterance	Goal
user	Hey	chitchat
bot	What is your name?	chitchat
user	I am Sofia	chitchat
bot	How old are you?	chitchat
user	I am 25	chitchat
bot	What is your occupation?	chitchat
user	sales/marketing	chitchat
bot	Which are your favourite movies?	recommendation
user	I love Speed	recommendation
bot	I suggest you Erin_Brockovich Witness Pocahontas Four_Weddings_and_a_Funeral Anna_and_the_King. Do you like them?	recommendation
user	I hate them	chitchat
bot	What are your favourite directors?	recommendation
user	I like John_Waters	recommendation
bot	What are your favourite actors?	recommendation
user	I like Josef_Sommer Wade_Williams Marg_Helgenberger Jeroen_Krabbè	recommendation
bot	What genres do you like?	recommendation
user	My favourite genres are teen_film romance_film biographical_film	recommendation
bot	I suggest you Ever_After Hairspray. Do you like them?	recommendation
user	I like them	chitchat
bot	I am glad that you like them	chitchat

3.2 Supervised learning setting

In the SL stage the CEI model was trained on a dataset obtained by removing the utterances corresponding to the “refine” step from the generated dialogues because it should replicate the dialogues ignoring the additional “refine” steps. Intuitively, we want that the model avoids to learn from incorrect dialogue turns.

The effectiveness of the model was evaluated against some baselines such as Random, which is a random hierarchical agent which generates random goals and utterances composed by random tokens 4 , and Seq2seq, which is an agent able to generate a response exploiting the popular Sequence-to-sequence framework trained through SL on the generated dialogues. The comparison with Random was performed in order to prove that the designed agent is able to learn to replicate the training dialogues, while the comparison with Seq2seq was designed to demonstrate the effectiveness of the proposed architecture with respect to a classical Sequence-to-sequence model which has been adopted in different conversational agents [38].

Implementation details. The current implementation of the Seq2seq model had all the weights initialized from a normal distribution N (0, 0.05), employed a bidirectional RNN with GRU units to encode the user utterance and applies a dropout of 0.5 on the RNN encoder input and output connections [45]. The word embedding size d_w was fixed to 50 and the GRU output size was fixed to 256. The model was trained using the Adam optimizer [14] using a learning rate of 0.001 and applying gradient clipping considering as the maximum L₂ norm value 5, as suggested in [26]. An L₂ regularization factor was applied on the word embedding matrix W _w weighted by a constant value of 0.0001. The training procedure exploited mini-batches of size 32 for both the datasets and we decided to apply an early stopping procedure considering the loss function value on the validation set [3]. In particular, we stopped the training of the model when the validation loss function was higher than the lowest value obtained for 5 consecutive times. Otherwise, we interrupted the training at the epoch 30. This work used the official TensorFlow implementation of the Seq2seqmodel 5 .

As regards the CEI model parameters, we fixed the embedding size d_w, d_u e d_g to 50. In addition, due to the complexity of the model, we applied dropout of 0.5 on all the GRU cell input and output connections, a dropout of 0.2 on the IMNAMAP search gates and a dropout of 0.5 on all the hidden layers in fully connected neural networks employed in the architecture. The batch size was fixed to 32 and we applied the same early stopping procedure employed in the seq2seq model. The training procedure exploited the Adam optimizer with a learning rate of 0.0001 and we decided, due to the different datasets sizes, to apply a lower L₂ regularization weight α equal to 0.0001 on the ML1M dataset whereas we fixed α to 0.0001 on MT. To stabilize the training procedure and avoid that the model converges to poor local minima, we applied gradient clipping considering as the maximum L₂ norm value 5. We exploited the same procedure used in [11] to manage all the facts related to the given user query. In particular, the search engine returns at most the top 20 relevant facts for the user query. All the tokens which compose the knowledge base facts are stored in the vocabulary V as well as all the other tokens which belong to the dataset. The conversations which belong to the dataset were tokenized using the NLTK default tokenizer 6 . The model was implemented using the TensorFlow framework [1].

Results. In order to demonstrate the appropriateness of the meta-controller and controller implementation, we evaluated both the ability of the meta-controller to select specific goals and the ability of the controller to generate personalized responses. Particularly, to assess the effectiveness of the meta-controller implementation we evaluated the Precision of the trained model on the test set leveraging the goal associated to each turn of the conversation. For the controller we used two different metrics to assess the effectiveness of the proposed suggestions for a specific user and the goodness of the generated system responses from a “linguistic” perspective. Specifically, we used the F₁-measure, which gives an idea of the goodness of the list of suggestions according to the ground truth suggestions present in the test set related conversation. Moreover, a per-userF₁-measure was evaluated considering all the unique proposed suggestions by the trained model in the test set dialogues with regard to all the positively rated items of a given user present in test conversations. The system responses have been evaluated using the BLEU [25] measure by comparing the generated sequences with the corresponding ones present in the test set. In particular, we exploited the sentence-level BLEU with smoothing function presented in [17] and whose effectiveness has been confirmed in [6].

Firstly, we evaluated different configurations of the CEI model in order to find the best configuration on a specific dataset. In the evaluation, we have tried different values for a limited set of parameters while all the others are fixed to the above-mentioned values. In particular, the GRU output size can assume a value in the set {128, 256}, the inference GRU output size s in the IMNAMAP model can assume a value in the set {128, 256}, the output representation of the dialogue state RNN q can assume a value in the set {256, 512} and the hidden layer size r in the controller RNN decoder can assume a value in the set {1024, 2048}.

Thanks to the preliminary comparisons, we have selected the best configuration of the CEI model on the two datasets, which is the one that has the higher average score between all the evaluated measures (in the tables of results the best configurations are highlighted in bold). After that, we compared the best CEI model with the proposed baselines. Table 3 shows the experimental comparison between the best-performing configuration and the other baselines according to the evaluation measures. An NA value in the table is associated to a configuration for which, due to its implementation constraints, it is not possible to evaluate the measure (i.e., the seq2seq model it is not equipped with a meta-controller so it is not possible to evaluate its effectiveness). Table 4 shows a real conversation between the CEI agent and a simulated user whose preferences are taken from the test set.

Table 3
Evaluation between the best performing CEI model and all the baselines on MT and ML1M

Configuration MC precision C BLEU C F1 C per-user F1

MT ML1M MT ML1M MT ML1M MT ML1M

Random 0.332 0.333 0 0.001 0 0 0 0.002

Seq2seq NA NA 0.791 0.839 0.022 0.01 0.044 0.029

CEI SL 1.0 1.0 0.784 0.851 0.110 0.07 0.170 0.108

Configuration	MC precision	C BLEU	C F1	C per-user F1
Random	0.332	0.333	0	0.001	0	0	0	0.002
Seq2seq	NA	NA	0.791	0.839	0.022	0.01	0.044	0.029
CEI SL	1.0	1.0	0.784	0.851	0.110	0.07	0.170	0.108

Table 4

Real conversation between the CEI agent and a simulated user whose preferences are taken from the test set. The user greets the bot and the bot asks some demographic information like the name and the age. After that, the agent is correctly able to generate questions regarding some relevant preferences that can be required to propose appealing suggestions to the user. Finally, exploiting the collected information about the user preferences, generates a list of suggestions which contains “The Shawshank Redemption” (wd_q172241) and “Dead Calm” (wd_q845077) which are both movies which have “film adaptation” as genre (see the corresponding Wikidata page for more information). An interesting thing is that the preferred cast member “Kathleen Turner” it is not present in the cast members of the suggested movies. We expect that the model learned a latent relationships between the current user preferences and the similarity between the cast members of the suggested movies and the actress liked by the user

Agent	Utterance	Goal
user	Hey	chitchat
bot	What is your name?	chitchat
user	My name is Gabriele	chitchat
bot	How old are you?	chitchat
user	I am 38	chitchat
bot	What are your favourite actors?	chitchat
user	I like Kathleen_Turner	chitchat
bot	What genres do you like?	recommendation
user	My favourite genre is film_adaptation	recommendation
bot	I suggest you The_Shawshank_Redemption and Dead_Calm. Do you like them?	recommendation
user	I like them	chitchat
bot	I am glad that you like them	chitchat

In the final experimental evaluation we can observe that all the values associated to the measures related to the effectiveness of the suggestions are pretty satisfying compared to the one obtained by Random and Seq2seq. In particular, on both the MT and the ML1M dataset, there is a marked difference in terms of per-userF₁ and F₁ measure between CEI and the baselines which is justified by the ability of the intra-dialogue attention mechanism to propagate the attention scores collected during the conversation to the controller which exploits them in the response generation procedure. It is worth noting that on the MT dataset the BLEU measure evaluated for the Seq2seq model is higher than the one for our model. A justification for this result can be found in the fact that the baseline focuses on the generation of language and it ignores completely the generation of suggestions, as demonstrated by the lower F₁ measure. In addition, it is clear that an RNN easily overfits on the simple templates adopted in the generated system responses. Results prove that our architecture is able to provide the best recommendations with respect to the baselines and it is able to produce meaningful dialogues comparable to the state of the art approach based on Seq2seq.

3.3 Reinforcement learning setting

In this section we discuss the evaluation of the Reinforcement Learning (RL) stage.

Implementation details. For the RL stage, a particular OpenAI Gym [5] environment called Recommender System Environment (RSEnv) is designed to let a hierarchical agent interact with a simulated user according to his/her preferences extracted from the dialogue datasets. For each user we built the user profile by composing the available information collected during the dialogue generation procedure concerning his/her name, age, occupation and preferences. For each simulated dialogue, the environment selects a random user that will interact with the bot. The environment checks the presence of particular keywords in the agent utterance and generates the appropriate answer according to a template by exploiting the user profile. When the agent generates a sequence of tokens, it receives a reward which is evaluated by the BLEU metric so that, in the long-term, it should be able to understand which are the appropriate tokens that should be generated in specific states. Therefore, the agent keeps replying the user until a list of suggestions is proposed to him/her and, according to the current user preferences, a reward is generated as the mean between the F₁-measure and the BLEU which represents the user satisfaction for the proposed suggestions. Depending on the value of the evaluated reward signal, the user answers with a positive utterance (i.e., “I like them”) or a negative utterance (i.e., “I hate them”). It is in this moment that the system receives a supplementary reward signal for the meta-controller – extrinsic reward – which is equal to 50 if the user satisfaction score is over 0.5, -5 otherwise. There are some limit cases that the environment should be able to manage in order to support an effective training procedure. In particular, if the system accidentally generates a sequence of tokens which does not contain a recognized keyword, a special tag “unknown” is used to notify to the agent that the environment is not able to process its utterance. In this case, a reward signal of -5 is returned for the meta-controller and a reward equal to 0 is returned to the controller. Another case that the environment should manage is when the agent asks the environment for information which are not present for the current user. In this case, the environment notifies the agent that it does not know the answer and returns a “nothing” message. Furthermore, in order to prevent that the agent is trapped in a loop in which the environment answers always with the same utterance, we interrupt the interaction between the agent and the environment after 10 turns. If the system is not able to complete the dialogue in a number of turns which is less than the predefined threshold value, it is penalized with a reward of -5 for the meta-controller and receives 0 for the controller.

For each dataset, we selected only the best configuration in the SL training phase for the successive RL training stage. In this way, we expect to assess, thanks to a quantitative evaluation, if the system is able to improve its performance by learning from its own experiences with users. We exploited the REINFORCE algorithm for the meta-controller and the controller in order to fine-tune their policies directly from the experiences collected by the agent. The model trained using the SL training procedure is used in the RL training procedure during which it interacts with the environment by exploiting the meta-controller and the controller policy to collect experiences from which it can learn from.

The adopted training procedure starts by loading the pre-trained model obtained from the SL stage and uses it in a RL scenario so it selects actions according to a state given by the environment and observes a reward according to their goodness. The model parameters are the same as the ones used in the SL training procedure so the parameters are used as they are in the RL setting. The only difference relies in the optimization method employed and in the loss function exploited to optimize the model parameters. We run the training process in the RL scenario for a fixed number of experiences e and we used mini-batches of them to update the model parameters. The advantages of this strategy are twofold: first we are able to obtain more accurate gradient approximation and second we are able to exploit the power of the GPU leveraged in our experiments to compute simultaneously multiple matrix operations. The batch size is fixed to 8 on the MT dataset and to 32 on the ML1M dataset. The entropy regularization weights ɛ_MC and ɛ_C are fixed to 0.01, the number of experiences is fixed to 10, 000 for each dataset and we apply a discount factor on the meta-controller rewards equal to 0.99. In the RL scenario we do not design a joint training procedure because the loss functions of the meta-controller and the controller are updated in different moments, as described in [15]. Moreover, they represents different behaviour strategies that cannot be easily mixed together as they are. We also changed our optimization algorithm from Adam to the vanilla SGD because we observed during our experiments with Adam catastrophic effects on the effectiveness of the policies due to the aggressive behaviour of the algorithm in the early stages of the training. Indeed, according to our experience, it upsets the learned policies making them completely useless for our tasks.

Results. As described before, when the RL training stage is completed, we evaluate the performance of the learned model on a test environment which contains users data that the agent have not seen in the SL and RL training stages. The learned model resulting from the training procedure is evaluated according to a similar procedure that is used during the training stage. Particularly, we exploit the user profile information related to the users present in the test set of each dataset and we monitor how the system behaves in terms of two different measures collected during the system interaction with the environment: MC rewards, which is the meta-controller mean reward obtained in a mini-batch, and C rewards, which is the controller mean reward obtained in a mini-batch. Analyzing the graphs in Fig. 5 and in Fig. 7 related to the ML1M dataset, we can observe how the meta-controller and the controller are able to interact with the environment. The performance is stable for each batch of examples underlining the capability of the model to generalize quite well over the training experiences. The same trend is observed in the charts reported in Figs. 4 and 6, showing that the agent performance is still reliable and satisfying. The trends are promising because the agent obtains stable rewards over time despite the different conditions observed in the test environment.

Fig. 4

MC reward on MT test set: each data point represents the average reward for the MC among a given set of experiences that the agent is exposed to during test time in the RL training phase.

Fig. 5

MC reward on ML1M test set: each data point represents the average reward for the MC among a given set of experiences that the agent is exposed to during test time in the RL training phase.

Fig. 6

C reward on MT test set: each data point represents the average reward for the C among a given set of experiences that the agent is exposed to during test time in the RL training phase.

Fig. 7

C reward on ML1M test set: each data point represents the average reward for the C among a given set of experiences that the agent is exposed to during test time in the RL training phase.

4 Related Work

DL models have been applied in different modules of a dialogue system [30]. First of all, as feature extractors in order to learn a representation of the user utterance which is effective for the final task. In the work presented in [32], Convolutional Neural Networks are exploited in order to generate a representation for the user utterance which is obtained as a composition of multiple low-level representations for the words – word embedding – for each language supported. This represents a way to incorporate prior knowledge learned from huge textual corpora so as to grasp meaningful representations for the words and then fine-tune them for the task at hand. An interesting hybrid dialogue systems that combines DL models with rule-based component is described in the work presented in [39]. Their architecture is composed of four main components: 1) RNN, 2) domain-specific software, 3) domain-specific action templates, and 4) entity extraction module. The RNN is used as a method to extract a representation for the user utterance. The entity extraction module is responsible for extracting mentions to entities using string matching with database keys. An entity tracker returns entities specific information hand-crafted by the user. Action templates are scored using a NN and the resulting probability score vector is multiplied by an action mask that prevents the execution of specific actions. If the selected action contains placeholders they are replaced by an entity output module or if it is an API call it is directly invoked. The action taken is given in input to the next step because it may be considered a relevant feature for the dialogue. Its training procedure it is composed by two SL stages and one final RL phase in which the REINFORCE algorithm [41] is used to optimise the return of the given dialogue. Its effectiveness has been validated in a real-world scenario and an interesting interactive tool has been developed for it [40]. Dialogue response generation can be intended as a Structured Prediction problem in which, given the user input utterance (source), the system learns to generate a response (target). However, as argued by [38], by considering a dialogue as a simple input/output mapping, as is done in Machine Translation, is considered a limitation because it completely disregards the dialogue structure.

In the literature CRSs have been classified under different names according to the strategy adopted in order to extract relevant information about the user and provide recommendation to him/her. Case-based Critique systems finds cases similar to the user profile and elicits a critique for refining the user’s interests [23, 29] throughout an iterative process in which the system generates recommendations in a ranked list and allows the user to critique them. The critique will force the system to re-evaluate its recommendations according to the specified constraints.

A limitation of classical CRSs is that their strategy is typically hard-coded in advance; at each stage, the system executes a fixed, pre-determined action, notwithstanding the fact that other actions could also be available for execution. This design choice negatively effects the flexibility of the system to support conversation scenarios which are not expected by the designer. In fact, despite the effectiveness of these systems in different complex scenarios, as reported in [23], they always follow a predefined sequence of actions without adapting to the user requirements. Moreover, typically they uses a representation for the items that is hand-crafted which is incredibly labour-intensive for complex domains like music, movies and travels. For instance, in the work presented in [4] is exploited a first-order logic representation of the items by an attribute closure operator able to refine the attribute set of the items considered in the current conversation.

To overcome these limitations the work first presented in [18] and then extended in [19 –21], proposes a new type of CRS that by interacting with users is able to autonomously improve an initial default strategy in order to eventually learn and employ a better one applying RL techniques. It was first validated in off-line experiments to understand which were the state variables required by the system and then it was evaluated in an online setting in which the system had the task to support travellers. Another relevant work is the one presented in [8], which exploits a probabilistic latent factor model where the observed likes/dislikes of users on items are generated on the basis of latent variables. The model variables are learned so that the model can explain the observed training data. In this way, the model is able to provide suggestions to the user by asking each question with the goal of eliminating or confirming strong candidates.

5 Conclusions and Future Work

CEI represents a framework for a goal-oriented conversational agent whose objective is to provide a list of suggestions according to the user preferences. Our intuition is that a dialogue can be subdivided in fine-grained goals whose achievement allows the agent to successfully complete the conversation with the user. The framework exploits a combination of Deep Learning and Hierarchical Reinforcement Learning techniques and it is trained by using a two-stage procedure. Moreover, we enrich the items description leveraging on information coming from the LOD cloud. Experimental results prove that our approach is able to both provide relevant recommendations and produce meaningful dialogues.

With regards to the answer generation module we plan to understand if the current implementation is well suited to take into account the score vector weights generated by the intra-dialogue attention mechanism in order to effectively exploit them in the recommendation phase. In addition, we think that allowing the model to leverage multiple information sources can enable it to grasp different views of the items that can be used to provide more accurate suggestions according to the user preferences.

Footnotes

Wikidata identifier:

Implementation provided by the sklearn toolkit:

Sampled according to a uniform distribution.

NLTK word tokenizer documentation:

References

Abadi

, Agarwal

, Barham

, Brevdo

, Chen

, Citro

, Corrado

G.S.

, Davis

, Dean

, Devin

, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems, arXivpreprint arXiv:1603.04467 (2016).

Bellman

, A markovian decision process. Technical report, DTIC Document, 1957.

Bengio

, Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the Trade, Springer, 2012, pp. 437–478.

Benito-Picazo

, Enciso

, Rossi

and Guevara

, Conversational recommendation to avoid the cold-start problem, In Proceedings of the 16th International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2016, 2016.

Brockman

, Cheung

, Pettersson

, Schneider

, Schulman

, Tang

and Zaremba

, Openai gym, 2016.

Chen

and Cherry

, A systematic comparison of smoothing techniques for sentence-level bleu, ACL 2014 (2014), 362.

Cho

, Merriënboer

B.V.

, Gulcehre

, Bahdanau

, Bougares

, Schwenk

and Bengio

, Learning phrase representations using rnn encoder-decoder for statistical machine translation, ArXiv preprint arXiv:1406.1078 (2014).

Christakopoulou

, Radlinski

and Hofmann

, Towards conversational recommender systems, In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, In New York, NY, USA, 2016, pp. 815–824. ACM.

Dooms

, Pessemier

T.D.

and Martens

, Movietweetings: a movie rating dataset collected from twitter, In Workshop on Crowdsourcing and Human Computation for Recommender Systems, CrowdRec at RecSys, volume 2013, 2013, p. 43.

10.

Garavaglia

and Sharma

, A smart guide to dummy variables: Four applications and a macro, In Proceedings of the Northeast SAS Users Group Conference, 1998, p. 43.

11.

Greco

, Suglia

, Basile

, Rossiello

and Semeraro

, Iterative multi-document neural attention for multiple answer prediction, In Proceedings of the AI*IA Workshop on Deep Understanding and Reasoning: A Challenge for Next-generation Intelligent Agents 2016 co-located with 15th International Conference of the Italian Association for Artificial Intelligence (AIxIA 2016), Genova, Italy, 2016, pp. 19–29.

12.

Harper

F.M.

and Konstan

J.A.

, The movielens datasets: History and context, ACM Transactions on Interactive Intelligent Systems (TiiS)5(4) (2016), 19.

13.

Heath

and Bizer

, Linked data: Evolving the web into a global data space, Synthesis Lectures on the Semantic Web: Theory and Technology1(1) (2011), 1–136.

14.

Kingma

and Ba

, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).

15.

Kulkarni

T.D.

, Narasimhan

, Saeedi

and Tenenbaum

, Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation, In Advances in Neural Information Processing Systems, 2016, pp. 3675–3683.

16.

, Galley

, Brockett

, Spithourakis

G.P.

, Gao

and Dolan

, A persona-based neural conversation model, arXiv preprint arXiv:1603.06155 (2016).

17.

Lin

C.-Y.

and Och

F.J.

, Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics, In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2004, p. 605.

18.

Mahmood

and Ricci

, Learning and adaptivity in interactive recommender systems, In Proceedings of the Ninth International Conference on Electronic Commerce, ACM, 2007, pp. 75–84.

19.

Mahmood

and Ricci

, Adapting the interaction state model in conversational recommender systems, In Proceedings of the 10th International Conference on Electronic Commerce, ACM, 2008, p. 33.

20.

Mahmood

and Ricci

, Improving recommender systems with adaptive conversational strategies, In Proceedings of the 20th ACM Conference on Hypertext and Hypermedia, ACM, 2009, pp. 73–82.

21.

Mahmood

, Ricci

, Venturini

and Höpken

, Adaptive recommender systems for travel planning, Information and Communication Technologies in Tourism 2008, 2008, pp. 1–11.

22.

Maisto

, Donnarumma

and Pezzulo

, Divide et impera: subgoaling reduces the complexity of probabilistic inference and problem solving, Journal of The Royal Society Interface12(104) (2015), 20141335.

23.

Ginty

L.M.

and Smyth

, Deep dialogue vs casual conversation in recommender systems, (2002).

24.

Ostuni

V.C.

, Noia

T.D.

, Sciascio

E.D.

and Mirizzi

, Top-n recommendations from implicit feedback leveraging linked open data, In Proceedings of the 7th ACM Conference on Recommender Systems, RecSys ’13, New York, NY, USA, 2013, ACM, pp. 85–92.

25.

Papineni

, Roukos

, Ward

and Zhu

W.-J.

, Bleu: A method for automatic evaluation of machine translation, In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2002, pp. 311–318.

26.

Pascanu

, Mikolov

and Bengio

, On the difficulty of training recurrent neural networks, ICML (3)28 (2013), 1310–1318.

27.

Ross

, Quinlan, Induction of decision trees, Machine Learning1(1) (1986), 81–106.

28.

Reschke

, Vogel

and Jurafsky

, Generating recommendation dialogs by extracting information from user reviews, In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, Sofia, Bulgaria, Short Papers, Volume 2, 2013, pp. 499–504.

29.

Ricci

and Nguyen

Q.N.

, Acquiring and revising preferences in a critique-based mobile recommender system, IEEE Intelligent Systems22(3) (2007), 22–29.

30.

Rieser

and Lemon

, Reinforcement learning for adaptive dialogue systems: A data-driven methodology for dialogue management and natural language generation, Springer Science & Business Media (2011).

31.

Rubens

, Kaplan

and Sugiyama

, Active learning in recommender systems, In Recommender Systems Handbook, Springer, 2015, pp. 809–846.

32.

Shi

, Ushio

, Endo

, Yamagami

and Horii

, A multichannel convolutional neural network for cross-language dialog state tracking, In 2016 IEEE Workshop on Spoken Language Technology, SLT 2016 -Proceedings, 2017, pp. 559–564.

33.

Silver

, Huang

, Maddison

C.J.

, Guez

, Sifre

, Driessche

G.V.D.

, Schrittwieser

, Antonoglou

, Panneershelvam

, Lanctot

, et al., Mastering the game of go with deep neural networks and tree search, Nature529(7587) (2016), 484–489.

34.

Sordoni

, Bachman

, Trischler

and Bengio

, Iterative alternating neural attention for machine reading, arXiv preprint arXiv:1606.02245 (2016).

35.

Suglia

, Greco

, Basile

, Semeraro

and Caputo

, An automatic procedure for generating datasets for conversational recommender systems, In Proceedings of Dynamic Search for Complex Tasks-8th International Conference of the CLEF Association, CLEF, 2017.

36.

Sutskever

, Vinyals

and Le

Q.V.

, Sequence to sequence learning with neural networks, In Advances in Neural Information Processing Systems, 2014, pp. 3104–3112.

37.

Sutton

R.S.

, Precup

and Singh

, Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning, Artificial Intelligence112(1) (1999), 181–211.

38.

Vinyals

and Le

, A neural conversational model, arXiv preprint arXiv:1506.05869 (2015).

39.

Williams

J.D.

, Asadi

and Zweig

, Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning, In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 2017, pp. 665–677.

40.

Williams

J.D.

and Liden

, Demonstration of interactive teaching for end-to-end dialog control with hybrid code networks, In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, 2017, pp, 82–85.

41.

Williams

R.J.

, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning8(3-4) (1992), 229–256.

42.

Williams

R.J.

and Peng

, Function optimization using connectionist reinforcement learning algorithms, Connection Science3(3) (1991), 241–268.

43.

, Schuster

, Chen

, Le

Q.V.

, Norouzi

, Macherey

, Krikun

, Cao

, Gao

, Macherey

, et al., Google’s neural machine translation system: Bridging the gap between human and machine translation, arXiv preprint arXiv:1609.08144 (2016).

44.

Zaremba

and Sutskever

, Reinforcement learning neural turing machines-revised, arXiv preprint arXiv:1505.00521 (2015).

45.

Zaremba

, Sutskever

and Vinyals

, Recurrent neural network regularization, arXiv preprint arXiv:1409.2329 (2014).