Abstract
Multi-agent systems can be seen as an apparatus for testing the performance of real distributed systems. One problem encountered in multi-agent systems with the learning capability is credit assignment. This paper presents two methods for solving this problem. The first method assigns credit to the agents according to the history of the interaction while the second method assigns credit to the agents according to the knowledge of agents, and thus the shares of the agents are extracted from the feedback of the environment. The computer experiments show that critic learning has a positive impact in credit assignment problem.
Introduction
The idea of interactive learning first comes from psychological research, in which Skinner tries to condition a living entity with six means: reward (positive or negative), punishment (positive or negative), shaping, extinction, discrimination, and generalization [1]. In artificial intelligence, the term reinforcement learning (as a type of interactive learning) was used as an indication of a group of machine learning methods which use evaluative feedback (or reinforcement) to guide agents through the environment [2]. As the sole guide is the feedback, it is very important that when and how much feedback is given and which part of the agent’s knowledge will be updated based on the feedback. Single agent reinforcement learning can be used in multi-agent systems, the feedback received by the group is divided to some parts, and each agent performs reinforcement learning with its own share. The one who distributes the feedback is called the critic.
The credit assignment problem (CA) is investigated in four main areas: structural, temporal, multi-agent, and social credit assignment. The structural CA in single-agent domains looks for which part of the agent’s knowledge should be updated by the received feedback [3]. The temporal CA in single-agent domains builds on top of the problem called delayed reward [4, 5]. Delayed reward problem occurs when a feedback is given to an action of the agent after the time of performing that action, and how we understand which action (or set of actions) deserves that feedback. The multi-agent CA as its name implies, refers to the distribution of credit among a group of cooperating agents, a team of agents, but want to have individual learning during teamwork (and may also have individual learning of the task before joining the team) [6]. In the social CA [7], the attribution theory [8] from psychology is used to consider social concerns in determining the amount of the credit/blame to be assigned to the agents. In each of the challenges noted above, the purpose is to guide the agents in achieving the goal. Other methods such as actor-critic [9], Q-learning [5], and SARSA [10] are also proposed for the problem of temporal credit assignment.
Several studies have been carried out in multi-agent credit assignment. In knowledge-based CA [11], some criteria are proposed to evaluate the knowledge of agents, and based on the quantification of the knowledge; agents with knowledge below a certain threshold are punished, and above the thresholds are rewarded. The method proposed in [11], categorizes the agents into three groups: those who rewarded, those who neither rewarded nor punished, and those who punished. In structural credit assignment, some solutions are proposed based on evolutionary computing methods like neural networks and genetic algorithms [12, 14], and others try to map structural CA to temporal CA and vice versa [3]. In this paper, we propose two methods for solving the problem of multi-agent credit assignment. The first method saves the history of interaction of the agents with the environment and based on this interaction concludes which action in the state deserves feedback, while in the second one; the critic agent tries to evaluate agent’s knowledge and gives their share based on their knowledge. These methods are compared together according to different criteria and the reasons for strengths and weaknesses in different experiments are investigated.
The rest of this paper is organized as follows: in Section 2, the definition of the multi-agent CA problem is given and in Section 3, we briefly review the related work. In Section 4, the history-based credit assignment method is presented and the knowledge-based method is introduced in Section 5. Section 6 presents the experimental results and Section 7 concludes the paper.
Multi-agent credit assignment problem
In this section, we give the notations that will be used throughout the paper, and also the formal definition of multi-agent CA problem. In the rest of the paper, we use the following notations. The number of agents will be specified by N, n shows the current agent, and S denotes the number of states. Each agent will select its action from the allowable action set A. The state of agent i at time t denoted by and the current action selected by this agent at state shown by . The feedback to the selected action by the agent i at time t represented by and f t is the total feedback received at time t by the team of the agents.
Problem definition
Figure 1 shows the interaction of two main actors of the problem: an environment and a group of agents in the presence of a special agent, named critic. Critic can be either external or internal. By external critic, we mean that a distinct agent (in the group of agents) exists who is equipped with the wisdom of credit distribution. In the internal critic model, every agent has a kind of knowledge in his mind that helps him to distribute the received feedback to some part of his knowledge, or gain his share from the feedback received by a group of agents.
There exists a group of agents cooperating with each other in order to reach a goal. In the reinforcement learning model, a feedback is returned from the environment to give a clue of the rate of satisfaction of the environment from the action selected by the agent group. The feedback can be the consequence of the previous actions carried out by the agents meaning that there is a time interval between the action that is performed and its consequence in the form of a feedback. If each individual in the group of agents wants to use his own share of the reinforcement received by the environment, there is a need for someone to convert the scalar feedback (reinforcement) f t to a vector of feedbacks , where N is the number of agents and is intended to be given to the i th agent, and with that, each member of the team can pursue his own individual learning, independent of the others. The critic is responsible for feedback distribution and generally has no information about the task assigned to the group of agents, and also about the duty of each agent in the team; the only thing available to him is the vector of actions selected by agents . The critic can improve the quality of credit assignment over time, this is what we mean as “critic learning”. In the process of learning, either the critic learn the task of the agents, or it can learn the assignment, without learning the task.
The team of agents may be homogeneous or heterogeneous. The vector may not be the same as the one received by the environment. There may be some kind of noise which converts the vector to the , so the environment does not receive the real set of actions and the feedback is the result of a noisy set.
Related works
The CA problem was first seen in the notes from Minsky [15], who defined it as: “The credit assignment problem concerns determining how the success of a system’s overall performance is due to the various contributions of system’s components.” It was simply a method for credit/blame sharing [16]. Weis classified the CA problem into two main categories: inter-agent CA and intra-agent credit assignment [17]. Inter-agent is the assignment of credit for an overall performance change to the external actions of the agents. Intra-agent is the assignment of credit for a particular external action of an agent to its underlying internal inference and decisions. Most of the methods proposed in the literature for solving the CA problem can be classified into four main groups: temporal, multi-agent, structural, and social credit assignment. Figure 2 shows the classification of these methods. There are also some CA methods that do not belong to the above four groups. These methods are classified as other methods.
In temporal credit assignment, components in the Minsky definition are time and action. The issue is accompanied by a famous problem of delayed reward [5], means that an agent acts in an environment and does not immediately receive the consequence of its actions. When the feedback is received, the agent typically does not know which of his actions deserves that. Several methods were proposed to solve the above mentioned problem. The first idea was given in [4]. The algorithms were based on learning automata and mathematical learning theory, and the tasks were selected to include, first in separation and then in group, the issue of misleading generalizations, delayed reinforcement, and so on.
Three reinforcement learning models that address temporal CA are based on actor-critic [9], Q-learning [5], and SARSA [10]. Eligibility traces are one of the basic methods of reinforcement learning [10, 4] that keep the history of the occurrence of an event, such as the visiting a state or choosing an action. The trace marks the memory parameters associated with an event as eligible for undergoing updates in the learning algorithm. When a TD error occurs, only the eligible states or actions are used to generate the feedback for the error. Thus, eligibility traces help to fill the gap between events and training information. The eligibility traces versions of SARSA (0) and one-step Q-Learning are called SARSA (λ) and Q (λ), respectively. Besides, first visit profit sharing was proposed as a solution of CA problem in the partially observable environments [18].
In [19], a method was given for solving temporal CA problem when the feedback is a sequence of decisions. To explore this problem, the event-related potentials (ERPs) were recorded as participants performed a sequential decision task.
Structural CA puts forth the question of which part of the agent’s knowledge deserves credit. In nonstructural methods, a kind of hierarchy was discovered in a task, and the problem was solved with that hierarchy [20, 21]. Attention-gated reinforcement learning (AGREL) resolved both the all-knowing teacher problem and the structural CA problem [22]. For learning in large state spaces, the learning problem can be decomposed to a set of subproblems. One approach to the decomposition is to represent the knowledge as a collection of smaller, more individually manageable pieces. Then, externally verifiable decomposition was used to learn more complex knowledge structures over which structural CA performed during learning [23]. In [24], statistical mechanical analysis was applied to the learning process in a simple neural networks to clarify the effect of temporal and structural credit assignments and their interactions. In [25], progress estimators were proposed to measure the progress toward the goal of a given behavior during his execution. In [26], the authors proposed a reinforcement learning mechanism as a model for recurrent choice and extended it for skill learning.
CA problem also exists in methods inspired from nature such as genetic algorithms and neural networks trained with backpropagation method [27]. Some methods tried to assign credit to genes in chromosomes using a multi-agent system such that each agent is responsible for discovering a suitable gene with the help of his own evaluation function [12]. CA issue was also raised in several applications such as learning a context free grammar from a sample of sentences using genetic algorithms [28], and some methods try to use a variation of Q-learning for multi-agent systems [29]. The CA problem was also discussed in political issues [30], and reviewed in the fields of economics, high energy physics, and information science [31]. Nonlinear CA in reinforcement learning is also used for music generation [32].
In addition, in neural network applications, different CA strategies were investigated in a two-level co-evolutionary model which involves a population of Gaussian neurons and a population of radial basis function networks [13, 14]. In evolutionary algorithms [33], a new way was proposed to assign credit to search operators, and thus, an adaptive evolutionary algorithm was developed.
A fuzzy CMAC (cerebellar model articulation controller) neural network model based on the CA concept was given in [34, 35]. In [36], the structure of the CMAC network was incorporated into self-organizing map (SOM) to construct a self-organizing CMAC (SOCMAC) network, and based on gray relational analysis; a CA technique for SOCMAC learning was introduced to hasten the overall learning process.
In social credit assignment, the CA was viewed socially. In [7], a computational model was introduced based on the psychological attribution theory. With this computational model the social inference in human-centric societies was modelled and the human behavior was predicted. With the aid of this model, attribution variables, causality, foreseeability, intention, and coercion are also modelled. Causality is the connection between actions and effects; foreseeability is the ability of prediction of the consequence of agents; intention is the commitment to work toward a certain act or outcome; and in coercion, the agent may be coerced to act but not coerced to achieve any outcome of the action. In [7], the authors discussed that the proper assignment of the credit in a social setting must not only consider actions and the knowledge state of different agents, but also utilize information available to reason about key attributions that contribute to the judgement process.
In [37] a method was proposed in which in addition to local rewards, social reinforcement was also added. This method considered the case when an agent observes other agents and attempts to imitate their behavior, and vicarious rewards were given when other agents received rewards.
Multi agent systems are another area where CA problem appears. In these systems, the components in Minsky definition are agents working with each other and the result is portrayed to the environment. The received feedback is based on the collective behavior of the team, not individual behavior.
One of the methods proposed in the multi-agent domain was knowledge-based credit assignment (KEBCA) [11], distributing the feedback according to the knowledge of agents. The agent’s knowledge was quantified in such a way that they can be compared and sorted. This process was carried out through some measures such as expertness, and certainty. If the knowledge of the agent was below of a threshold, the agent would be penalized, otherwise it would be rewarded. Another suggestion was that the agents would be partitioned into three groups: the first group is penalized, the second neither penalized nor rewarded, and the third rewarded. The author of [38] suggested a novel multi-agent active noise control (ANC) formulation via a slight modification of KEBCA.
Some researches tried to unify the temporal and multi-agent CA with the notion of a time-extended single-agent problem as a multi-agent single-time-step problem [3]. Another term for multi-agent credit assignment is collective intelligence (COIN) [39], the two are almost equivalent. The main results obtained from COIN only concern situations where agents can be organized in sub-groups working on totally independent problems.
In the existence of a large number of agents, the authors of [40] suggested the use of wonderful life utility and claimed that it is better than both local and global rewards. Tangamchit tried to compare averaging rewards and discounting rewards (that was emphasizing immediate rewards over long-time rewards), and concluded that averaging is better than discounting [41]. Fu tried to solve the problem with the aid of policy and action learning using state information [42].
The topic was also discussed in hierarchical reinforcement learning. This was done in the form of a modular reward, which was calculated from the temporal difference of the module gating signal and the value of the succeeding module. Modular reward was implemented for multiple model-based reinforcement learning (MMRL) architecture [43].
In [44], an online optimal control method is proposed for nonzero sum games of continuous-time, which is to some extent similar to the credit assignment problem, and in [45] a similar problem is solved in e-commerce application. Other recent works like methods proposed in [46, 48] explored credit assignment in domains like deep learning. Some other methods like action abstraction [49] may be used to tackle the CA problem.
In [50], a multi-agent CA is proposed. For the sake of simplicity in presentation, we call this method dynamic critic method. The responsibility of the critic agent is to distribute f
t
among the agents such that each agent receives his own part at time slot t and to update the share of agents. In this method, the total reward can be decomposed as a weighted sum of rewards of the agents as given by the following rule (Equation (1)).
The critic agent updates each coefficient by the following rule (Equation (3)).
One problem that a system might encounter is the use of the previous knowledge about the last situations it encounters to the present situation. To the best of our knowledge, there is no method that uses such information. In what follows, we propose a history-based method to utilize the previous knowledge in order for better distribution of credit. In other words, we use an undirected graph, which models the environment, to describe the idea. We define a set of unknown variables to model the environment and when the value of all of them is specified, the model of the environment is complete, and critic can respond to the agent’s actions instead of the interacting with the environment.
First we define interaction between the agents and the environment (Equation (4)),
Two nodes in the graph are adjacent if the corresponding actions list differs in only one, i.e. (Equation (6))
In this way, we try to calculate the feedback of the i th agent at time t. If the tasks for different agents are not similar, the agents should only use their own previous feedbacks, thus in the above equations, the f t 1 changed to . Otherwise, other agents’ feedbacks can only be used. This means that when we find two sets of actions (and their corresponding feedback) with only one different action, the difference between two feedbacks, one of them setting to zero, is the feedback for the corresponding action. This way, the two s are determined (one of them is set to zero, and the other one to |f t 1 - f t 2 |).
One reason for setting the smaller value to zero is to preserve the nature of the reinforcement signal. The other reason is the importance of relative values of the feedbacks not their absolute value [10].
In the Fig. 3, we illustrate three main steps of the algorithm (for a two-action problem). In the first step (the one on the left), the two actions of two different states are determined, and since only one difference between actions exists, there is an edge between states. In the second step (the middle of the figure), the feedback for these sets of action are determined, and in the last step (the right of the figure), the values of two feedbacks are determined by the critic (although the feedback of the other action is not determined yet, and needs another possible adjacent interaction with the environment).
The algorithm continues until the values of all s are determined.
The history-based CA tries to keep the track of interactions between the agents and the environment. The information stored for each interaction is the list of actions selected by the agents and the list of thetotal feedbacks received by the team. In order to study the effect of each action on the total feedback, the list of interactions is traversed until two interactions with a single difference (in the same state) appears in the list of feedbacks. In the case of this single action, an approximation of the reward (difference of the two feedbacks) can be used to provide a situation for the agent to learn the value of this action. If the difference of the feedbacks is greater than zero the reward will be given to the first action, otherwise to the second action.
Suppose we maintain a matrix for the the feedbacks that are given by the environment. When the agents are heterogeneous, the feedback matrix is three dimensional, and its size will be N × |S| × |A|, and when the agents are homogeneous, the feedback matrix is two dimensional and the size will be decreased to |S| × |A|. In the homogeneous case, a single Q-table can be maintained for all agents and every update is performed on this table. This will decrease the amount of required memory and increase the speed of convergence of the learning. In this algorithm, we consider the general case of heterogeneous agents. Softmax policy given in Equation (8)), is used to select actions, τ is a parameter that tunes the randomness of the policy, in which high values of it show that exploration is considered, and low values indicate that exploitation is followed.
The second proposed method is based on the evaluation of the agents’ knowledge, which requires a criterion to quantify knowledge. The next step includes sorting according to that criterion, and the next one is to distribute the feedback according to the sorted list.
In this method, the CA process is divided into two phases, as shown in Fig. 4. The first phase called knowledge extraction in which the knowledge is extracted based on some predefined criteria. In this paper, we used the following criteria to extract knowledge: confidence (introduced in this paper), and six criteria including expertness, certainty, correctness, efficiency, group support ratio, and learning ratio given in [51, 11]. Each criterion will be explained later.
The second phase called the knowledge evaluation. In this phase, an attempt is made to evaluate and compare the knowledge of different agents, and decide how much feedback is given to each agent. In this phase, when a feedback is received, the agents are sorted ascendingly according to their knowledge. The feedback of agent i is calculated using the following Equation (9):
In order to evaluate the proposed algorithms, computer experiments are conducted and the results are compared with the results of the algorithm given in [50]. In the next section we will explain the test domain.
The test domain
History-based and knowledge-based methods are tested in a multi-agent domain with 5 agents, borrowed from [11]. Every agent has the responsibility of adding two digits (0, 1, 2, 3, 4); and the agents cooperate to add two five-digit numbers. So the result of adding two digits is at most 8, and does not result in carry. The action set of each agent is set of nine actions (0, …, 8). The termination condition of the episode is the simultaneous correct answer of all of the agents to their question. The reward is such that every agent who gives the correct answer is given +4 rewards, otherwise they receive zero feedback. For example, the result of addition of two numbers 40321 and 30421 equals to 70742. If the agents declare the answer 77742, the team of agents will receive +16 from the environment.
Comparison of the three methods
The results of assessment are given in the following subsections using seven tests, in terms of a new criterion introduced as confidence as well as six other criteria (learning ratio, expertness, certainty, efficiency, correctness, group support ratio) introduced in [11, 51].
Test I: Comparison of the methods in terms of learning ratio
Group learning ratio (LR) is defined as the average of the individual learning ratio of the agents given in Equation (12). Equation (11) defines the term from an agent perspective and Equation (12) defines it from a group viewpoint. Individual learning ratio is defined as the ratio of learnt states (defined by Equation (10)) to the number of states; in other words the term |learnt s| is the number of states that are learned; and learning is defined as when the Q-values of a state have a single maximum, and in that state, the corresponding action is selected with the greedy policy. Parameter S is the set of states of the environment and |S| is the number of states that exists for each agent. By averaging it on the agents, Equation (12) gives the learning ratio for the team of agents.
The next criterion to be tested is the degree of confidence that the agent has when he is completing his Q-table. When the difference between the largest and the second largest value of the Q-table becomes larger, it seems the environment further insists more on the correctness of the action a. In terms of order statistics, if [q(1), q(2), q(3), . . . , q(|A|)] be the sorted Q-values (ascending) according to different actions, confidence measure will be quantified as given in Equation (13):
Expertness is another criterion to be used for comparison between three CA methods. This term is defined as the number of times the agent receives a reward and thus has selected correct actions () minus the number of punishments received (wrong experiences of agent) (). Expertness is a history-based criterion, as it deals with a history of agent interactions with the environment (Equation (14)).
Figure 7 compares three methods in terms of expertness. As it can be observed, after 100 episodes the Ranking method obtains higher expertness. Thus the ranking method acts better with expertness criterion. The dynamic critic method [50] is ranked next, and the last is the history-based method.
The next experiment (Fig. 8) is a comparison of three different CA methods based on certainty criterion. Certainty is defined by Equation (15); and it compares the Q-value of an action in a state with the other Q-values of that state. The comparison is done in an exponential manner so that it exaggerates the difference of different Q-values of the state. This criterion needs a parameter, T, which is determined with Equation (16). When T→ ∞, different actions of state S have nearly the same probability 1/|A| where |A| is the number of actions. When T has a small value, the value of certainty is sensitive to the Q-values. Thus at the beginning of the experiments when the values of Q are not determined yet and thus unreliable, big values of T are used, and as the experiment proceeds, T will be reduced, and the effect of Q becomes more definite.
Equation (16) determines the value of T for each episode. T0 is a parameter that is set to 10, T
min
is set to 1 in our experiments, and the variable episode determines the episode number which starts from 1.
From the certainty viewpoint and with the help of certainty in calculating coefficients of the dynamic critic method, the highest value is for the ranking method which, ignoring spikes, has good progress from 0 to 2. The other two methods do not have good performance.
Efficiency is defined in Equation (17) and intends to count the number of non-zero credit assignments by the critic. The process of CA should be designed in a conservative way, since every reward/punishment to each agent, is directing him toward a goal, and thus misguiding the agent finally leads to missing the goal or taking a longer route towards the goal. Thus, every non-zero CA means that the critic accepts a risk and has a judgement about the action selected by the agents.
Figure 9 shows the results of a comparison CA methods from an efficiency viewpoint. The ranking and also dynamic critic method improves efficiency with the same rate, but history-based CA does not have much efficiency. The reason is that the number of exact-zero assignments to the agent is high due to the nature of the algorithm.
Another criterion for a comparison of the CA methods is correctness. Correct assignments can be defined in several ways. A strict definition is based on the sign and magnitude of the feedback. This means that if the sign and magnitude of the feedback are the same as what the environment really wants to give to him, this shows the success of the critic. A more flexible way to determine the correct assignment is to specify a threshold for the difference between the given feedback and real feedback. If the difference is less than the threshold, it will be counted as correct, otherwise as wrong.
The other definition of correct assignment is only the sign (and not magnitude) of the real and given feedbacks. If they have the same sign, we can say that a correct assignment has been accomplished by the critic (Equation (18)).
Among these three definitions of correctness (Fig. 10), the modest one, that is, the one with the threshold, is selected for these experiments. The most successful one is the ranking method which has the most correct assignments, the next is the dynamic critic method, and the last one is the history-based method.
The next measure is group support ratio (GSR) given in Equation (19). In contrast to other criteria that are calculated for single agent, this is a group-based measure. It will consider a window of w last interactions with the environment and computes the number of interactions which lead to the reward. Giving a reward to the team of agents is an indication of successful interaction, which at last ends with a reward, meaning that a goal or subgoal is being reached. For this experiment, w is set to 100.
The results of experiment is shown in Fig. 11. The ranking method gains the most GSR during different episodes, and finally it reaches the GSR value of 1, at episode 100. The next method according to GSR is dynamic critic, and also reaches the GSR value of 1 at the 100 th episode, but in episodes less than 100 it is lower than the ranking method. The next method, which is the lowest according to GSR, is the history-based method.
In this section, we study the effect of the noise on the three methods. As illustrated in last section, we have used learning ratio, confidence, expertness, certainty, efficiency, correctness, and group support ratio, in the next experiments.
The noise is such that 20% of the Q-tables are changed 80% . As the main contribution of this paper is ranking method (Section 5), we brought both the noisy and noiseless results in the corresponding figures, and the noisy versions of the results of two methods are also reported for comparison.
The first comparison performed based on learning ratio. As can be seen, there is no major change on the ranking method, with and without noise. But history-based method reaches faster than other methods to a high degree and stays there; also dynamic critic method moves approximately the same (but a bit higher) than ranking method (Fig. 12).
The next measure is confidence (Fig. 13). Like learning ratio, ranking method tests either with or without noise, moves together, and increases continuously. In history-based method the average correctness does not change after 10 th episode; and, dynamic critic method has a slight increase until 70 th episode.
For the next criterion (Fig. 14), expertness, we can see that ranking method, even in the existence of noise moves in the same rate with the dynamic critic method; and history-based method moves much slower than the two gradient-based methods.
Certainty diagram (Fig. 15) is one of the diagrams with a lot of rise and falls. But a trend can be discovered in the behavior of ranking method. A continuous increase can be observed, in the noisy and noiseless results of ranking method, while the other two method are approximately near zero.
In efficiency criterion (Fig. 16), like expertness criterion, gradient based methods move together and much higher than history based method.
In the correctness criterion (Fig. 17), in contrast to other criteria, no method dominates the other constantly. The more questions the agents ask, the more correctness values they gain.
Group support ratio values are another measure to compare the CA methods (Fig. 18). As can be observed no distinct change can be seen in the ranking method with or without noise.
Comparison of the methods in terms of dependency to individual learning
In these set of experiments, we want to evaluate the effect of individual learning on the ranking method. Every diagram consists of four curves. Two of them are for ranking method, with and without individual learning, and the next two, are for the dynamic critic method and history-based method, both without individual learning. In the next set of legends for the diagrams, w/o means without, and w/ means with.
The first comparison was performed based on the learning ratio (Fig. 19). The best method, according to learning ratio criterion, is the history based method. The next best method, in the early episodes is the ranking method with individual learning, of course after the passage of some episodes (about 50), the ranking method with individual learning reaches to the other version. As can be observed, the only method with the non-zero value on the zero th episode determines the method with individual learning.
Confidence is another measure of comparison, and as it is clear, the ranking method is not sensitive to the starting individual learning, and the two curves move together toward the value 7 (Fig. 20).
For the expertness criterion (Fig. 21), on the early episodes, the effect of individual learning on the ranking method, is noticeable; but after about 25 th episode, the two curves move together; and after 65 th episode, even the learning curve reaches to the ranking method.
The certainty measure (Fig. 22) also does not distinguish any difference between the two versions of ranking method, with and without the individual learning; the two other methods does not have any improvement according to certainty criterion.
As it is noticeable, with the efficiency criterion (Fig. 23), the three gradient based methods move together toward the value one; but the behavior of history based method is different, requiring much more episodes to have an improvement similar to the ranking and learning methods.
The behaviors of the three methods are not consistent along episodes. During last episodes, the effect of the individual learning in the early episodes becomes clear, and the ranking method with individual learning advanced more than other methods (Fig. 24).
Group support ratio is another criterion which can clarify the effect of individual learning on the ranking method. From the GSR viewpoint, the learning and ranking methods (without individual learning) move together and the ranking method equipped with individual learning gains much more GSR values in different episodes (Fig. 25).
Conclusions and future works
In this paper, two methods for solving the credit assignment problem have been proposed. In the first method, the critic tries to solve the problem without learning the underlying domain. This method uses a temporal difference method and learns credit assignment from the feedback of the environment, not the agents’ tasks. Different criteria were used to guide the critic agent. However, in the second method, the critic tries to evaluate the knowledge of the team, and decides how much reward is intended to be given to the agent. The dynamic critic method tries to learn the task of credit assignment, it is useful when there is lack of access to the domain information, and this deficiency leads to more time for learning, so the critic should not try to learn the domain, but to learn the credit assignment directly.
The critics in a multi-agent problem can be categorized as the internal critics and the external critics. The basis for this division is that, if the critic is someone apart from the other agents, we have external critic, and if we have multiple critics each one in the mind of one agent, it is internal (group of) critic(s). Experimental results show that the dynamic (external) critic methods are better than the internal critic methods. The reason is that the internal critic is the same as the agent itself and the responsibility of the agent is to learn the task but in the case of external critic, it is better for him not to interfere with the task, and so the history-based method is better.
The methods were evaluated with seven criteria, and the reasons for strengths and weaknesses in different experiments were investigated.
Future works include trying to define more criteria for the problem. Testing the solution in new domains, like decision support systems, are also among the future works; additionally we will try to apply it to problems like hierarchical learning, and finally to try to find new general clues for finding the rules of credit assignment.
