Critic learning in multi agent credit assignment problem

Abstract

Multi-agent systems can be seen as an apparatus for testing the performance of real distributed systems. One problem encountered in multi-agent systems with the learning capability is credit assignment. This paper presents two methods for solving this problem. The first method assigns credit to the agents according to the history of the interaction while the second method assigns credit to the agents according to the knowledge of agents, and thus the shares of the agents are extracted from the feedback of the environment. The computer experiments show that critic learning has a positive impact in credit assignment problem.

Keywords

Multi-agent systems credit assignment reinforcement learning interaction history knowledge

1 Introduction

The idea of interactive learning first comes from psychological research, in which Skinner tries to condition a living entity with six means: reward (positive or negative), punishment (positive or negative), shaping, extinction, discrimination, and generalization [1]. In artificial intelligence, the term reinforcement learning (as a type of interactive learning) was used as an indication of a group of machine learning methods which use evaluative feedback (or reinforcement) to guide agents through the environment [2]. As the sole guide is the feedback, it is very important that when and how much feedback is given and which part of the agent’s knowledge will be updated based on the feedback. Single agent reinforcement learning can be used in multi-agent systems, the feedback received by the group is divided to some parts, and each agent performs reinforcement learning with its own share. The one who distributes the feedback is called the critic.

The credit assignment problem (CA) is investigated in four main areas: structural, temporal, multi-agent, and social credit assignment. The structural CA in single-agent domains looks for which part of the agent’s knowledge should be updated by the received feedback [3]. The temporal CA in single-agent domains builds on top of the problem called delayed reward [4, 5]. Delayed reward problem occurs when a feedback is given to an action of the agent after the time of performing that action, and how we understand which action (or set of actions) deserves that feedback. The multi-agent CA as its name implies, refers to the distribution of credit among a group of cooperating agents, a team of agents, but want to have individual learning during teamwork (and may also have individual learning of the task before joining the team) [6]. In the social CA [7], the attribution theory [8] from psychology is used to consider social concerns in determining the amount of the credit/blame to be assigned to the agents. In each of the challenges noted above, the purpose is to guide the agents in achieving the goal. Other methods such as actor-critic [9], Q-learning [5], and SARSA [10] are also proposed for the problem of temporal credit assignment.

Several studies have been carried out in multi-agent credit assignment. In knowledge-based CA [11], some criteria are proposed to evaluate the knowledge of agents, and based on the quantification of the knowledge; agents with knowledge below a certain threshold are punished, and above the thresholds are rewarded. The method proposed in [11], categorizes the agents into three groups: those who rewarded, those who neither rewarded nor punished, and those who punished. In structural credit assignment, some solutions are proposed based on evolutionary computing methods like neural networks and genetic algorithms [12 , 14], and others try to map structural CA to temporal CA and vice versa [3]. In this paper, we propose two methods for solving the problem of multi-agent credit assignment. The first method saves the history of interaction of the agents with the environment and based on this interaction concludes which action in the state deserves feedback, while in the second one; the critic agent tries to evaluate agent’s knowledge and gives their share based on their knowledge. These methods are compared together according to different criteria and the reasons for strengths and weaknesses in different experiments are investigated.

The rest of this paper is organized as follows: in Section 2, the definition of the multi-agent CA problem is given and in Section 3, we briefly review the related work. In Section 4, the history-based credit assignment method is presented and the knowledge-based method is introduced in Section 5. Section 6 presents the experimental results and Section 7 concludes the paper.

2 Multi-agent credit assignment problem

In this section, we give the notations that will be used throughout the paper, and also the formal definition of multi-agent CA problem. In the rest of the paper, we use the following notations. The number of agents will be specified by N, n shows the current agent, and S denotes the number of states. Each agent will select its action from the allowable action set A. The state of agent i at time t denoted by $s_{t}^{i}$ and the current action selected by this agent at state $s_{t}^{i}$ shown by $a_{t}^{i}$ . The feedback to the selected action $a_{t}^{i}$ by the agent i at time t represented by $f_{t}^{i}$ and f_t is the total feedback received at time t by the team of the agents.

2.1 Problem definition

Figure 1 shows the interaction of two main actors of the problem: an environment and a group of agents in the presence of a special agent, named critic. Critic can be either external or internal. By external critic, we mean that a distinct agent (in the group of agents) exists who is equipped with the wisdom of credit distribution. In the internal critic model, every agent has a kind of knowledge in his mind that helps him to distribute the received feedback to some part of his knowledge, or gain his share from the feedback received by a group of agents.

There exists a group of agents cooperating with each other in order to reach a goal. In the reinforcement learning model, a feedback is returned from the environment to give a clue of the rate of satisfaction of the environment from the action selected by the agent group. The feedback can be the consequence of the previous actions carried out by the agents meaning that there is a time interval between the action that is performed and its consequence in the form of a feedback. If each individual in the group of agents wants to use his own share of the reinforcement received by the environment, there is a need for someone to convert the scalar feedback (reinforcement) f_t to a vector of feedbacks $[f_{t}^{1}, f_{t}^{2}, . . . f_{t}^{N}]$ , where N is the number of agents and $f_{t}^{i}$ is intended to be given to the i^th agent, and with that, each member of the team can pursue his own individual learning, independent of the others. The critic is responsible for feedback distribution and generally has no information about the task assigned to the group of agents, and also about the duty of each agent in the team; the only thing available to him is the vector of actions selected by agents $[a_{t}^{1}, a_{t}^{2}, . . . a_{t}^{N}]$ . The critic can improve the quality of credit assignment over time, this is what we mean as “critic learning”. In the process of learning, either the critic learn the task of the agents, or it can learn the assignment, without learning the task.

The team of agents may be homogeneous or heterogeneous. The vector $[a_{t}^{1}, a_{t}^{2}, . . . a_{t}^{N}]$ may not be the same as the one received by the environment. There may be some kind of noise which converts the vector to the $[α_{t}^{1}, α_{t}^{2}, . . ., α_{t}^{N}]$ , so the environment does not receive the real set of actions and the feedback is the result of a noisy set.

3 Related works

The CA problem was first seen in the notes from Minsky [15], who defined it as: “The credit assignment problem concerns determining how the success of a system’s overall performance is due to the various contributions of system’s components.” It was simply a method for credit/blame sharing [16]. Weis classified the CA problem into two main categories: inter-agent CA and intra-agent credit assignment [17]. Inter-agent is the assignment of credit for an overall performance change to the external actions of the agents. Intra-agent is the assignment of credit for a particular external action of an agent to its underlying internal inference and decisions. Most of the methods proposed in the literature for solving the CA problem can be classified into four main groups: temporal, multi-agent, structural, and social credit assignment. Figure 2 shows the classification of these methods. There are also some CA methods that do not belong to the above four groups. These methods are classified as other methods.

In temporal credit assignment, components in the Minsky definition are time and action. The issue is accompanied by a famous problem of delayed reward [5], means that an agent acts in an environment and does not immediately receive the consequence of its actions. When the feedback is received, the agent typically does not know which of his actions deserves that. Several methods were proposed to solve the above mentioned problem. The first idea was given in [4]. The algorithms were based on learning automata and mathematical learning theory, and the tasks were selected to include, first in separation and then in group, the issue of misleading generalizations, delayed reinforcement, and so on.

Three reinforcement learning models that address temporal CA are based on actor-critic [9], Q-learning [5], and SARSA [10]. Eligibility traces are one of the basic methods of reinforcement learning [10, 4] that keep the history of the occurrence of an event, such as the visiting a state or choosing an action. The trace marks the memory parameters associated with an event as eligible for undergoing updates in the learning algorithm. When a TD error occurs, only the eligible states or actions are used to generate the feedback for the error. Thus, eligibility traces help to fill the gap between events and training information. The eligibility traces versions of SARSA (0) and one-step Q-Learning are called SARSA (λ) and Q (λ), respectively. Besides, first visit profit sharing was proposed as a solution of CA problem in the partially observable environments [18].

In [19], a method was given for solving temporal CA problem when the feedback is a sequence of decisions. To explore this problem, the event-related potentials (ERPs) were recorded as participants performed a sequential decision task.

Structural CA puts forth the question of which part of the agent’s knowledge deserves credit. In nonstructural methods, a kind of hierarchy was discovered in a task, and the problem was solved with that hierarchy [20, 21]. Attention-gated reinforcement learning (AGREL) resolved both the all-knowing teacher problem and the structural CA problem [22]. For learning in large state spaces, the learning problem can be decomposed to a set of subproblems. One approach to the decomposition is to represent the knowledge as a collection of smaller, more individually manageable pieces. Then, externally verifiable decomposition was used to learn more complex knowledge structures over which structural CA performed during learning [23]. In [24], statistical mechanical analysis was applied to the learning process in a simple neural networks to clarify the effect of temporal and structural credit assignments and their interactions. In [25], progress estimators were proposed to measure the progress toward the goal of a given behavior during his execution. In [26], the authors proposed a reinforcement learning mechanism as a model for recurrent choice and extended it for skill learning.

CA problem also exists in methods inspired from nature such as genetic algorithms and neural networks trained with backpropagation method [27]. Some methods tried to assign credit to genes in chromosomes using a multi-agent system such that each agent is responsible for discovering a suitable gene with the help of his own evaluation function [12]. CA issue was also raised in several applications such as learning a context free grammar from a sample of sentences using genetic algorithms [28], and some methods try to use a variation of Q-learning for multi-agent systems [29]. The CA problem was also discussed in political issues [30], and reviewed in the fields of economics, high energy physics, and information science [31]. Nonlinear CA in reinforcement learning is also used for music generation [32].

In addition, in neural network applications, different CA strategies were investigated in a two-level co-evolutionary model which involves a population of Gaussian neurons and a population of radial basis function networks [13, 14]. In evolutionary algorithms [33], a new way was proposed to assign credit to search operators, and thus, an adaptive evolutionary algorithm was developed.

A fuzzy CMAC (cerebellar model articulation controller) neural network model based on the CA concept was given in [34, 35]. In [36], the structure of the CMAC network was incorporated into self-organizing map (SOM) to construct a self-organizing CMAC (SOCMAC) network, and based on gray relational analysis; a CA technique for SOCMAC learning was introduced to hasten the overall learning process.

In social credit assignment, the CA was viewed socially. In [7], a computational model was introduced based on the psychological attribution theory. With this computational model the social inference in human-centric societies was modelled and the human behavior was predicted. With the aid of this model, attribution variables, causality, foreseeability, intention, and coercion are also modelled. Causality is the connection between actions and effects; foreseeability is the ability of prediction of the consequence of agents; intention is the commitment to work toward a certain act or outcome; and in coercion, the agent may be coerced to act but not coerced to achieve any outcome of the action. In [7], the authors discussed that the proper assignment of the credit in a social setting must not only consider actions and the knowledge state of different agents, but also utilize information available to reason about key attributions that contribute to the judgement process.

In [37] a method was proposed in which in addition to local rewards, social reinforcement was also added. This method considered the case when an agent observes other agents and attempts to imitate their behavior, and vicarious rewards were given when other agents received rewards.

Multi agent systems are another area where CA problem appears. In these systems, the components in Minsky definition are agents working with each other and the result is portrayed to the environment. The received feedback is based on the collective behavior of the team, not individual behavior.

One of the methods proposed in the multi-agent domain was knowledge-based credit assignment (KEBCA) [11], distributing the feedback according to the knowledge of agents. The agent’s knowledge was quantified in such a way that they can be compared and sorted. This process was carried out through some measures such as expertness, and certainty. If the knowledge of the agent was below of a threshold, the agent would be penalized, otherwise it would be rewarded. Another suggestion was that the agents would be partitioned into three groups: the first group is penalized, the second neither penalized nor rewarded, and the third rewarded. The author of [38] suggested a novel multi-agent active noise control (ANC) formulation via a slight modification of KEBCA.

Some researches tried to unify the temporal and multi-agent CA with the notion of a time-extended single-agent problem as a multi-agent single-time-step problem [3]. Another term for multi-agent credit assignment is collective intelligence (COIN) [39], the two are almost equivalent. The main results obtained from COIN only concern situations where agents can be organized in sub-groups working on totally independent problems.

In the existence of a large number of agents, the authors of [40] suggested the use of wonderful life utility and claimed that it is better than both local and global rewards. Tangamchit tried to compare averaging rewards and discounting rewards (that was emphasizing immediate rewards over long-time rewards), and concluded that averaging is better than discounting [41]. Fu tried to solve the problem with the aid of policy and action learning using state information [42].

The topic was also discussed in hierarchical reinforcement learning. This was done in the form of a modular reward, which was calculated from the temporal difference of the module gating signal and the value of the succeeding module. Modular reward was implemented for multiple model-based reinforcement learning (MMRL) architecture [43].

In [44], an online optimal control method is proposed for nonzero sum games of continuous-time, which is to some extent similar to the credit assignment problem, and in [45] a similar problem is solved in e-commerce application. Other recent works like methods proposed in [46 , 48] explored credit assignment in domains like deep learning. Some other methods like action abstraction [49] may be used to tackle the CA problem.

In [50], a multi-agent CA is proposed. For the sake of simplicity in presentation, we call this method dynamic critic method. The responsibility of the critic agent is to distribute f_t among the agents such that each agent receives his own part $f_{t}^{i}$ at time slot t and to update the share of agents. In this method, the total reward can be decomposed as a weighted sum of rewards of the agents as given by the following rule (Equation (1)). $f_{t} = \sum_{i = 1}^{N} c_{t}^{i} \times f_{t}^{i},$ (1) with the following constraint (Equation (2)), $\sum_{i = 1}^{N} c_{t}^{i} = 1 .$ (2)

The critic agent updates each coefficient by the following rule (Equation (3)). $c_{t + 1}^{i} = \frac{(1 - α) c_{t}^{i} + α \times f_{t}^{i}}{z},$ (3) where α is the learning rate, $c_{t}^{i}$ is the coefficient used at time slot t for agent i to update his knowledge, and $c_{t + 1}^{i}$ is the coefficient used at time t + 1. In order to satisfy the constraint given by Equation (3), the normalization factor z is used.

4 History-based credit assignment

One problem that a system might encounter is the use of the previous knowledge about the last situations it encounters to the present situation. To the best of our knowledge, there is no method that uses such information. In what follows, we propose a history-based method to utilize the previous knowledge in order for better distribution of credit. In other words, we use an undirected graph, which models the environment, to describe the idea. We define a set of unknown variables to model the environment and when the value of all of them is specified, the model of the environment is complete, and critic can respond to the agent’s actions instead of the interacting with the environment.

First we define interaction between the agents and the environment (Equation (4)), ${Int}_{t} ≜ {[a_{t}^{1}, a_{t}^{2}, . . . a_{t}^{N}], [f_{t}^{1}, f_{t}^{2}, . . . f_{t}^{N}], f_{t}} .$ (4) In interaction at time t denoted by Int_t, $a_{t}^{i}$ is the action selected by agent i, $f_{t}^{i}$ is its corresponding feedback and f_t is the total feedback received from the environment. In this interaction, typically the values of $[f_{t}^{1}, f_{t}^{2}, . . . f_{t}^{N}]$ is unknown, and the purpose of the critic is to estimate their values. We model Int_t as a node in the graph. The number of nodes is finite, since we suppose that the environment is deterministic, i.e. when a set of actions are applied to the environment, the same rewards are given to the agents. Formally, an environment env is deterministic by the following Equation (5),

$\begin{matrix} \det (env) ≜ {\forall t_{1}, t_{2}, [a_{t_{1}}^{1}, a_{t_{1}}^{2}, . . . a_{t_{1}}^{N}] \\ = [a_{t_{2}}^{1}, a_{t_{2}}^{2}, . . . a_{t_{2}}^{N}] \\ \Rightarrow f_{t_{1}} = f_{t_{2}} \land [f_{t_{1}}^{1}, f_{t_{1}}^{2}, . . . f_{t_{1}}^{N}] \\ = [f_{t_{2}}^{1}, f_{t_{2}}^{2}, . . . f_{t_{2}}^{N}]} . \end{matrix}$ (5)

Two nodes in the graph are adjacent if the corresponding actions list differs in only one, i.e. (Equation (6))

$\begin{matrix} adj (In t_{i}, In t_{j}) ≜ {[a_{i}^{1}, a_{i}^{2}, . . . a_{i}^{N}], \\ [a_{j}^{1}, a_{j}^{2}, . . . a_{j}^{N}], \\ \exists k s . t . a_{i}^{k} \neq a_{j}^{k} \land a_{i}^{∖ k} = a_{j}^{∖ k}}, \end{matrix}$ (6) where ∖k means the actions set except the k^th action. When f_{t
₁}, f_{t
₂} are determined, the feedback of actions that cause the adjacency of two nodes specified. The share of agent i is determined using Equation (7), $f_{t}^{i} = {\begin{matrix} f_{t_{1}} - f_{t_{2}} & if f_{t_{2}} < f_{t_{1}} \\ 0 & otherwise \end{matrix}$ (7)

In this way, we try to calculate the feedback of the i^th agent at time t. If the tasks for different agents are not similar, the agents should only use their own previous feedbacks, thus in the above equations, the f_{t
₁} changed to $f_{t_{1}}^{i}$ . Otherwise, other agents’ feedbacks can only be used. This means that when we find two sets of actions (and their corresponding feedback) with only one different action, the difference between two feedbacks, one of them setting to zero, is the feedback for the corresponding action. This way, the two $f_{t}^{i}$ s are determined (one of them is set to zero, and the other one to |f_{t
₁} - f_{t
₂} |).

One reason for setting the smaller value to zero is to preserve the nature of the reinforcement signal. The other reason is the importance of relative values of the feedbacks not their absolute value [10].

In the Fig. 3, we illustrate three main steps of the algorithm (for a two-action problem). In the first step (the one on the left), the two actions of two different states are determined, and since only one difference between actions exists, there is an edge between states. In the second step (the middle of the figure), the feedback for these sets of action are determined, and in the last step (the right of the figure), the values of two feedbacks are determined by the critic (although the feedback of the other action is not determined yet, and needs another possible adjacent interaction with the environment).

The algorithm continues until the values of all $f_{t}^{i}$ s are determined.

The history-based CA tries to keep the track of interactions between the agents and the environment. The information stored for each interaction is the list of actions selected by the agents and the list of thetotal feedbacks received by the team. In order to study the effect of each action on the total feedback, the list of interactions is traversed until two interactions with a single difference (in the same state) appears in the list of feedbacks. In the case of this single action, an approximation of the reward (difference of the two feedbacks) can be used to provide a situation for the agent to learn the value of this action. If the difference of the feedbacks is greater than zero the reward will be given to the first action, otherwise to the second action.

Suppose we maintain a matrix for the the feedbacks that are given by the environment. When the agents are heterogeneous, the feedback matrix is three dimensional, and its size will be N × |S| × |A|, and when the agents are homogeneous, the feedback matrix is two dimensional and the size will be decreased to |S| × |A|. In the homogeneous case, a single Q-table can be maintained for all agents and every update is performed on this table. This will decrease the amount of required memory and increase the speed of convergence of the learning. In this algorithm, we consider the general case of heterogeneous agents. Softmax policy given in Equation (8)), is used to select actions, τ is a parameter that tunes the randomness of the policy, in which high values of it show that exploration is considered, and low values indicate that exploitation is followed. $p (a_{t}^{i} | s_{t}^{i}) = \frac{e^{Q (s_{t}^{i}, a_{t}^{i}) / τ}}{\sum_{a_{.}^{i} \in A} e^{Q (s_{t}^{i}, a_{.}^{i}) / τ}},$ (8) where $Q (s_{t}^{i}, a_{t}^{i})$ are the entries of Q-table.

5 Knowledge measurement with ranking method

The second proposed method is based on the evaluation of the agents’ knowledge, which requires a criterion to quantify knowledge. The next step includes sorting according to that criterion, and the next one is to distribute the feedback according to the sorted list.

In this method, the CA process is divided into two phases, as shown in Fig. 4. The first phase called knowledge extraction in which the knowledge is extracted based on some predefined criteria. In this paper, we used the following criteria to extract knowledge: confidence (introduced in this paper), and six criteria including expertness, certainty, correctness, efficiency, group support ratio, and learning ratio given in [51, 11]. Each criterion will be explained later.

The second phase called the knowledge evaluation. In this phase, an attempt is made to evaluate and compare the knowledge of different agents, and decide how much feedback is given to each agent. In this phase, when a feedback is received, the agents are sorted ascendingly according to their knowledge. The feedback of agent i is calculated using the following Equation (9): $f_{t}^{i_{j}} = \frac{j}{N} f_{t},$ (9) where j is the index of the agent in the sorted list ([agnt₁, agnt₂, . . . , agnt_N]). These two phases are independent of each other, such that the method which is discussed in the next section has the same knowledge extraction phase with this method, and the knowledge evaluation phase is different.

6 Experimental results

In order to evaluate the proposed algorithms, computer experiments are conducted and the results are compared with the results of the algorithm given in [50]. In the next section we will explain the test domain.

6.1 The test domain

History-based and knowledge-based methods are tested in a multi-agent domain with 5 agents, borrowed from [11]. Every agent has the responsibility of adding two digits (0, 1, 2, 3, 4); and the agents cooperate to add two five-digit numbers. So the result of adding two digits is at most 8, and does not result in carry. The action set of each agent is set of nine actions (0, …, 8). The termination condition of the episode is the simultaneous correct answer of all of the agents to their question. The reward is such that every agent who gives the correct answer is given +4 rewards, otherwise they receive zero feedback. For example, the result of addition of two numbers 40321 and 30421 equals to 70742. If the agents declare the answer 77742, the team of agents will receive +16 from the environment.

6.2 Comparison of the three methods

The results of assessment are given in the following subsections using seven tests, in terms of a new criterion introduced as confidence as well as six other criteria (learning ratio, expertness, certainty, efficiency, correctness, group support ratio) introduced in [11, 51].

6.2.1 Test I: Comparison of the methods in terms of learning ratio

Group learning ratio (LR) is defined as the average of the individual learning ratio of the agents given in Equation (12). Equation (11) defines the term from an agent perspective and Equation (12) defines it from a group viewpoint. Individual learning ratio is defined as the ratio of learnt states (defined by Equation (10)) to the number of states; in other words the term |learnt s| is the number of states that are learned; and learning is defined as when the Q-values of a state have a single maximum, and in that state, the corresponding action is selected with the greedy policy. Parameter S is the set of states of the environment and |S| is the number of states that exists for each agent. By averaging it on the agents, Equation (12) gives the learning ratio for the team of agents.

$\begin{matrix} learnt (s_{t}^{i}) : = {\forall a_{t}^{i} s . t . feasible (a_{t}^{i}, s_{t}^{i}) \\ \Rightarrow f_{a_{i}}^{suggested} : = f_{a_{i}}^{real}}, \end{matrix}$ (10) ${LR}_{t}^{i} = \frac{| learnt (s_{t}^{i}) |}{| S |},$ (11) $L R_{t} = (1 / N) \sum_{i = 1}^{N} {LR}_{t}^{i},$ (12) where N is the number of agents. By $feasible (a_{t}^{i}, s_{t}^{i})$ we mean that action $a_{t}^{i}$ is plausible in state $s_{t}^{i}$ . Figure 5 shows the result of comparison of the three methods in terms of learning ratio. This figure shows that the history-based method which starts from the LR value of zero, reached the 95% at around episode 10, and the other two methods reached the value of 100% at around the 100^th episode.

6.2.2 Test II: Comparison of the methods in terms of confidence

The next criterion to be tested is the degree of confidence that the agent has when he is completing his Q-table. When the difference between the largest and the second largest value of the Q-table becomes larger, it seems the environment further insists more on the correctness of the action a. In terms of order statistics, if [q₍₁₎, q₍₂₎, q₍₃₎, . . . , q_(|A|)] be the sorted Q-values (ascending) according to different actions, confidence measure will be quantified as given in Equation (13): $Cnf (s_{t}^{i}) = q_{(| A |)} - q_{(| A | - 1)},$ (13)Figure 6 shows the results of comparison of three methods in terms of confidence. First the one with the ranking method seems to gain the best confidence with the pass of episode. The next one is a history based credit assignment discussed in Section 4, and the last with least confidence is the dynamic critic method explained in [50].

6.2.3 Test III: Comparison of the methods in terms of expertness

Expertness is another criterion to be used for comparison between three CA methods. This term is defined as the number of times the agent receives a reward and thus has selected correct actions ( $N_{r} (s_{t}^{i}, a_{t}^{i})$ ) minus the number of punishments received (wrong experiences of agent) ( $N_{p} (s_{t}^{i}, a_{t}^{i})$ ). Expertness is a history-based criterion, as it deals with a history of agent interactions with the environment (Equation (14)). $Exp (s_{t}^{i}, a_{t}^{i}) = N_{r} (s_{t}^{i}, a_{t}^{i}) - N_{p} (s_{t}^{i}, a_{t}^{i}) .$ (14)

Figure 7 compares three methods in terms of expertness. As it can be observed, after 100 episodes the Ranking method obtains higher expertness. Thus the ranking method acts better with expertness criterion. The dynamic critic method [50] is ranked next, and the last is the history-based method.

6.2.4 Test IV: Comparison of the methods in terms of certainty

The next experiment (Fig. 8) is a comparison of three different CA methods based on certainty criterion. Certainty is defined by Equation (15); and it compares the Q-value of an action in a state with the other Q-values of that state. The comparison is done in an exponential manner so that it exaggerates the difference of different Q-values of the state. This criterion needs a parameter, T, which is determined with Equation (16). When T→ ∞, different actions of state S have nearly the same probability 1/|A| where |A| is the number of actions. When T has a small value, the value of certainty is sensitive to the Q-values. Thus at the beginning of the experiments when the values of Q are not determined yet and thus unreliable, big values of T are used, and as the experiment proceeds, T will be reduced, and the effect of Q becomes more definite. $Cert (s_{t}^{i}, a_{t}^{i}) = \frac{exp (Q (s_{t}^{i}, a_{t}^{i}) / T)}{\sum_{a_{t}^{i} \in A} exp (Q (s_{t}^{i}, a_{t}^{i}) / T)} .$ (15)

Equation (16) determines the value of T for each episode. T₀ is a parameter that is set to 10, T_min is set to 1 in our experiments, and the variable episode determines the episode number which starts from 1. $T = max {\frac{T_{0}}{1 + log (episode)}, T_{min}} .$ (16)

From the certainty viewpoint and with the help of certainty in calculating coefficients of the dynamic critic method, the highest value is for the ranking method which, ignoring spikes, has good progress from 0 to 2. The other two methods do not have good performance.

6.2.5 Test V: Comparison of the methods in terms of efficiency

Efficiency is defined in Equation (17) and intends to count the number of non-zero credit assignments by the critic. The process of CA should be designed in a conservative way, since every reward/punishment to each agent, is directing him toward a goal, and thus misguiding the agent finally leads to missing the goal or taking a longer route towards the goal. Thus, every non-zero CA means that the critic accepts a risk and has a judgement about the action selected by the agents. $Eff = \sum_{i = 1}^{F} I (f_{t}^{i} \neq 0),$ (17) where I (x) is the indicator function that returns 1 if boolean x is true, and 0 when x is false. F is the number of feedbacks.

Figure 9 shows the results of a comparison CA methods from an efficiency viewpoint. The ranking and also dynamic critic method improves efficiency with the same rate, but history-based CA does not have much efficiency. The reason is that the number of exact-zero assignments to the agent is high due to the nature of the algorithm.

6.2.6 Test VI: Comparison of the methods in terms of correctness

Another criterion for a comparison of the CA methods is correctness. Correct assignments can be defined in several ways. A strict definition is based on the sign and magnitude of the feedback. This means that if the sign and magnitude of the feedback are the same as what the environment really wants to give to him, this shows the success of the critic. A more flexible way to determine the correct assignment is to specify a threshold for the difference between the given feedback and real feedback. If the difference is less than the threshold, it will be counted as correct, otherwise as wrong.

The other definition of correct assignment is only the sign (and not magnitude) of the real and given feedbacks. If they have the same sign, we can say that a correct assignment has been accomplished by the critic (Equation (18)). $C o r r = \sum_{i = 1}^{F} I (f_{t}^{i} = f_{t c o r}^{i}),$ (18) where $I (.), F, f_{t}^{i}$ and are the indicator function, the number of feedbacks, and distributed feedback, and the correct feedback, respectively.

Among these three definitions of correctness (Fig. 10), the modest one, that is, the one with the threshold, is selected for these experiments. The most successful one is the ranking method which has the most correct assignments, the next is the dynamic critic method, and the last one is the history-based method.

6.2.7 Test VII: Comparison of the methods in terms of group support ratio

The next measure is group support ratio (GSR) given in Equation (19). In contrast to other criteria that are calculated for single agent, this is a group-based measure. It will consider a window of w last interactions with the environment and computes the number of interactions which lead to the reward. Giving a reward to the team of agents is an indication of successful interaction, which at last ends with a reward, meaning that a goal or subgoal is being reached. For this experiment, w is set to 100. ${GSR}_{t} = \frac{\sum_{t^{'} = t - w}^{t} I (0 < f_{t^{'}})}{w} .$ (19)

The results of experiment is shown in Fig. 11. The ranking method gains the most GSR during different episodes, and finally it reaches the GSR value of 1, at episode 100. The next method according to GSR is dynamic critic, and also reaches the GSR value of 1 at the 100^th episode, but in episodes less than 100 it is lower than the ranking method. The next method, which is the lowest according to GSR, is the history-based method.

6.3 Comparison of the methods in terms of noise tolerance

In this section, we study the effect of the noise on the three methods. As illustrated in last section, we have used learning ratio, confidence, expertness, certainty, efficiency, correctness, and group support ratio, in the next experiments.

The noise is such that 20% of the Q-tables are changed 80% . As the main contribution of this paper is ranking method (Section 5), we brought both the noisy and noiseless results in the corresponding figures, and the noisy versions of the results of two methods are also reported for comparison.

The first comparison performed based on learning ratio. As can be seen, there is no major change on the ranking method, with and without noise. But history-based method reaches faster than other methods to a high degree and stays there; also dynamic critic method moves approximately the same (but a bit higher) than ranking method (Fig. 12).

The next measure is confidence (Fig. 13). Like learning ratio, ranking method tests either with or without noise, moves together, and increases continuously. In history-based method the average correctness does not change after 10^th episode; and, dynamic critic method has a slight increase until 70^th episode.

For the next criterion (Fig. 14), expertness, we can see that ranking method, even in the existence of noise moves in the same rate with the dynamic critic method; and history-based method moves much slower than the two gradient-based methods.

Certainty diagram (Fig. 15) is one of the diagrams with a lot of rise and falls. But a trend can be discovered in the behavior of ranking method. A continuous increase can be observed, in the noisy and noiseless results of ranking method, while the other two method are approximately near zero.

In efficiency criterion (Fig. 16), like expertness criterion, gradient based methods move together and much higher than history based method.

In the correctness criterion (Fig. 17), in contrast to other criteria, no method dominates the other constantly. The more questions the agents ask, the more correctness values they gain.

Group support ratio values are another measure to compare the CA methods (Fig. 18). As can be observed no distinct change can be seen in the ranking method with or without noise.

6.4 Comparison of the methods in terms of dependency to individual learning

In these set of experiments, we want to evaluate the effect of individual learning on the ranking method. Every diagram consists of four curves. Two of them are for ranking method, with and without individual learning, and the next two, are for the dynamic critic method and history-based method, both without individual learning. In the next set of legends for the diagrams, w/o means without, and w/ means with.

The first comparison was performed based on the learning ratio (Fig. 19). The best method, according to learning ratio criterion, is the history based method. The next best method, in the early episodes is the ranking method with individual learning, of course after the passage of some episodes (about 50), the ranking method with individual learning reaches to the other version. As can be observed, the only method with the non-zero value on the zero^th episode determines the method with individual learning.

Confidence is another measure of comparison, and as it is clear, the ranking method is not sensitive to the starting individual learning, and the two curves move together toward the value 7 (Fig. 20).

For the expertness criterion (Fig. 21), on the early episodes, the effect of individual learning on the ranking method, is noticeable; but after about 25^th episode, the two curves move together; and after 65^th episode, even the learning curve reaches to the ranking method.

The certainty measure (Fig. 22) also does not distinguish any difference between the two versions of ranking method, with and without the individual learning; the two other methods does not have any improvement according to certainty criterion.

As it is noticeable, with the efficiency criterion (Fig. 23), the three gradient based methods move together toward the value one; but the behavior of history based method is different, requiring much more episodes to have an improvement similar to the ranking and learning methods.

The behaviors of the three methods are not consistent along episodes. During last episodes, the effect of the individual learning in the early episodes becomes clear, and the ranking method with individual learning advanced more than other methods (Fig. 24).

Group support ratio is another criterion which can clarify the effect of individual learning on the ranking method. From the GSR viewpoint, the learning and ranking methods (without individual learning) move together and the ranking method equipped with individual learning gains much more GSR values in different episodes (Fig. 25).

7 Conclusions and future works

In this paper, two methods for solving the credit assignment problem have been proposed. In the first method, the critic tries to solve the problem without learning the underlying domain. This method uses a temporal difference method and learns credit assignment from the feedback of the environment, not the agents’ tasks. Different criteria were used to guide the critic agent. However, in the second method, the critic tries to evaluate the knowledge of the team, and decides how much reward is intended to be given to the agent. The dynamic critic method tries to learn the task of credit assignment, it is useful when there is lack of access to the domain information, and this deficiency leads to more time for learning, so the critic should not try to learn the domain, but to learn the credit assignment directly.

The critics in a multi-agent problem can be categorized as the internal critics and the external critics. The basis for this division is that, if the critic is someone apart from the other agents, we have external critic, and if we have multiple critics each one in the mind of one agent, it is internal (group of) critic(s). Experimental results show that the dynamic (external) critic methods are better than the internal critic methods. The reason is that the internal critic is the same as the agent itself and the responsibility of the agent is to learn the task but in the case of external critic, it is better for him not to interfere with the task, and so the history-based method is better.

The methods were evaluated with seven criteria, and the reasons for strengths and weaknesses in different experiments were investigated.

Future works include trying to define more criteria for the problem. Testing the solution in new domains, like decision support systems, are also among the future works; additionally we will try to apply it to problems like hierarchical learning, and finally to try to find new general clues for finding the rules of credit assignment.

References

Skinner

B.F.

, Operant behavior, American Psychologist18(8) (1963), 503.

Kaelbling

L.P.

, Littman

M.L.

and Moore

A.W.

, Reinforcement learning: A survey, Journal of Artificial Intelligence Research (1996), 237–285.

Agogino

A.K.

and Tumer

, Unifying temporal and structural credit assignment problems, in Proceedings of the third International Joint Conference on Autonomous Agents and Multiagent Systems-Volume22004, pp. 980–987.

Sutton

R.S.

, Temporal credit assignment in reinforcement learning, Ph.D. dissertation, University of Massachusetts –Amherst, 1984.

Watkins

C.J.

and Dayan

, Q-learning, Machine Learning8(3-4) (1992), 279–292.

Panait

and Luke

, Cooperative multi-agent learning: The state of the art, Autonomous Agents and Multi-Agent Systems11(3) (2005), 387–434.

Mao

and Gratch

, The social credit assignment problem, in Proceedings of the Fourth International Working Conference on Intelligent Virtual Agents, Springer, 2003, pp. 39–47.

Kelley

H.H.

, Attribution theory in social psychology, in Nebraska symposium on motivation, University of Nebraska Press, 1967.

Kimura

and Kobayashi

, An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function, in International Conference on Machine Learning, 1998, pp. 278–286.

10.

Sutton

R.S.

and Barto

A.G.

, Reinforcement learning: An introduction, MIT Press Cambridge1(1) (1998).

11.

Harati

, Ahmadabadi

M.N.

and Araabi

B.N.

, Knowledgebased multiagent credit assignment: A study on task type and critic information, IEEE Systems Journal1(1) (2007), 55–67.

12.

Agogino

, Tumer

and Miikkulainen

, Efficient credit assignment through evaluation function decomposition, in Proceedings of the 7th Annual Conference on Genetic and Evolutionary Computation, 2005, pp. 1309–1316.

13.

Khare

V.R.

, Yao

and Sendhoff

, Credit assignment among neurons in co-evolving populations, in Parallel Problem Solving from Nature-PPSN VIII, Springer, 2004, pp. 882–891.

14.

Friedrich

, Urbanczik

and Senn

, Spatio-temporal credit assignment in neuronal population learning, PLoS Comput Biol7(6) (2011), e1002092.

15.

Minsky

, Steps toward artificial intelligence, Artificial Intelligence: Critical Concepts49(1) (1961), 8–30.

16.

Mataric

M.J.

, Reward functions for accelerated learning, in Proceedings of the Eleventh International Conference on Machine Learning, 1994, pp. 181–189.

17.

Weiss

, Multiagent systems: A modern approach to distributed artificial intelligence, MIT Press, 1999.

18.

Arai

and Sycara

, Credit assignment method for learning effective stochastic policies in uncertain domains, 2001, pp. 815–822.

19.

Walsh

M.M.

and Anderson

J.R.

, Neural correlates of temporal credit assignment, in Proceedings of the 10th International Conference on Cognitive Modeling, pp. 2010, 256–270.

20.

Jones

and Goel

A.K.

, Structural credit assignment in hierarchical classification, in Proceedings of International Conference on Artificial Intelligence, pp. 2007, 378–384.

21.

Jones

and Goel

, Hierarchical judgement composition: Revisiting the structural credit assignment problem, in Proceedings of the AAAIWorkshop on Challenges in Game AISan Jose, CA, USA, 2004, pp. 67–71.

22.

Roelfsema

P.R.

and van Ooyen

, Attention-gated reinforcement learning of internal representations for classification, Neural Computation17(10) (2005), 2176–2214.

23.

Jones

and Goel

, Knowledge organization and structural credit assignment, Reasoning, Representation, and Learning in Computer Games, 2005, p. 37.

24.

Saito

, Katahira

, Okanoya

and Okada

, Statistical mechanics of structural and temporal credit assignment effects on learning in neural networks, Physical Review E83(5) (2011), 051125.

25.

Matarić

M.J.

, Learning in behavior-based multi-robot systems: Policies, models, and other agents, Cognitive Systems Research2(1) (2001), 81–93.

26.

W.-T.

and Anderson

J.R.

, From recurrent choice to skill learning: A reinforcement-learning model, Journal of Experimental Psychology: General135(2) (2006), 184.

27.

Pashaei

, Taghiyareh

and Badie

, A negotiation-based genetic framework for multi-agent credit assignment, in Multiagent System Technologies, Springer, 2014, pp. 72–89.

28.

Bianchi

, Learning grammatical rules from examples using a credit assignment algorithm, in Proceedings of the First Online Workshop on Soft Computing (WSC1), 1996, pp. 113–118.

29.

Agogino

A.K.

and Tumer

, Quicr-learning for multi-agent coordination, in Proceedings of the National Conference on Artificial Intelligence, vol. 21, no. 2. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006, p. 1438.

30.

Bednar

, Credit assignment and federal encroachment, Supreme Court Economic Review15(1) (2007), 285–308.

31.

Frandsen

T.F.

and Nicolaisen

, What is in a name? credit assignment practices in different disciplines, Journal of Informetrics4(4) (2010), 608–617.

32.

Franklin

J.A.

and Manfredi

V.U.

, Nonlinear credit assignment for musical sequences, in Second International Workshop on Intelligent Systems Design and Application, 2002, pp. 245–250.

33.

Whitacre

J.M.

, Pham

T.Q.

and Sarker

R.A.

, Credit assignment in adaptive evolutionary algorithms, in Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, 2006, pp. 1353–1360.

34.

Zhu

and Kong

, A fuzzy cmac neural network model based on credit assignment, International Journal of Information Technology12(6) (2006), 1–8.

35.

S.-F.

, Lee

Z.-J.

and Wang

Y.-P.

, Robust and fast learning for fuzzy cerebellar model articulation controllers, Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on36(1) (2006), 203–208.

36.

Yeh

M.-F.

and Chang

K.-C.

, A self-organizing cmac network with gray credit assignment, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics36(3) (2005), 623–635.

37.

Mataric

M.J.

, Learning to behave socially, in Proceedings of Third International Conference on Simulation of Adaptive Behavior617 (1994), 453–462.

38.

Raeisy

, Haghighi

S.G.

and Safavi

A.A.

, Active noise control system via multi-agent credit assignment, Journal of Intelligent and Fuzzy Systems26(2) (2014), 1051–1063.

39.

Wolpert

D.H.

, Wheeler

K.R.

and Tumer

, General principles of learning-based multi-agent systems, in Proceedings of the Third Annual Conference on Autonomous Agents, 1999, pp. 77–83.

40.

Ontañón

and Plaza

, A bartering approach to improve multiagent learning, pp, in Proceedings of the First International Joint Conference on Autonomous Agents and Multiagent Systems: Part 1 (2002), 386–393.

41.

Tangamchit

, Dolan

J.M.

and Khosla

P.K.

, The necessity of average rewards in cooperative multirobot learning, in IEEE International Conference on Robotics and Automation, 2002, Proceedings ICRA’02, vol. 2, (2002), pp. 1296–1301.

42.

W.-T.

and Anderson

J.R.

, Solving the credit assignment problem: Explicit and implicit learning of action sequences with probabilistic outcomes, Psychological Research72(3) (2008), 321–330.

43.

Samejima

, Doya

and Kawato

, Inter-module credit assignment in modular reinforcement learning, Neural Networks16(7) (2003), 985–994.

44.

Yasini

, Sitani

M.B.N.

and Kirampor

, Reinforcement learning and neural networks for multi-agent nonzerosum games of nonlinear constrained-input systems, International Journal of Machine Learning and Cybernetics, 2014, pp. 1–14.

45.

Shojaiemehr

and Rafsanjani

M.K.

, A multi-agent based model for collective purchasing in electronic commerce, International Journal of Machine Learning and Cybernetics4(4) (2013), 409–417.

46.

Lang

and Jiang

, Artificial intelligence based profitsharing algorithm in multi-agent systems, Journal of Networks9(1) (2014), 48–55.

47.

Yliniemi

and Tumer

, Multi-objective multiagent credit assignment through difference rewards in reinforcement learning, in Simulated Evolution and Learning, Springer, 2014, pp. 407–418.

48.

Bengio

, How auto-encoders could provide credit assignment in deep networks via target propagation, arXiv preprint arXiv:1407.7906, 2014.

49.

Taghizadeh

and Beigy

, A novel graphical approach to automatic abstraction in reinforcement learning, Robotics and Autonomous Systems61(8) (2013), 821–835.

50.

Rahaie

and Beigy

, Expertness framework in multi-agent systems and its application in credit assignment problem, Intelligent Data Analysis18(3) (2014), 511–528.

51.

Araabi

B.N.

, Mastoureshgh

and Ahmadabadi

M.N.

, A study on expertise of agents and its effects on cooperativelearning, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics37(2) (2007), 398–409.