Abstract
We propose a method of autonomous learning of target decision strategies for coordination in the continuous cleaning domain. With ongoing advances in computer and sensor technologies, we can expect robot applications for covering large areas that often require coordinated/cooperative activities by multiple robots. We focus on the cleaning tasks by multiple robots or by agents which are programs to control the robots in this paper. We assumed situations where agents did not directly exchange deep and complicated internal information and reasoning results such as plans, strategies and long-term targets for their sophisticated coordinated activities, but rather exchanged superficial information such as the locations of other agents (using the equipment deployed) for their shallow coordination and individually learned appropriate strategies by observing how much dirt/dust had been vacuumed up in multi-agent system environments. We will first discuss the preliminary method of improving the coordinated activities by autonomously learning to select cleaning strategies to determine which targets to move to clear them. Although we could have improved the efficiency of cleaning, we observed a phenomenon where performance degraded if agents continued to learn strategies. This is because so many agents overly selected the same strategy (over-selection) by using autonomous learning. In addition, the preliminary method assumed information given about which regions in the environment easily became dirty. Thus, we propose a method that was extended by incorporating the preliminary method with (1) environmental learning to identify which places were likely to be dirty and (2) autonomous relearning through self-monitoring the amount of vacuumed dirt to avoid strategies from being over-selected. We experimentally evaluated the proposed method by comparing its performance with those obtained by the regimes of agents with a single strategy and obtained with the preliminary method. The experimental results revealed that the proposed method enabled agents to select target decision strategies and, if necessary, to abandon the current strategies from their own perspectives, resulting in appropriate combinations of multiple strategies. We also found that environmental learning on dirt accumulation was effectively learned.
Introduction
Recent advances in robotics and computer technology have enabled many robot applications in a variety of real-world environments. Some of them have worked in places and situations that are difficult for humans to work in, such as those in disaster relief, demining, nuclear plant searches, and interplanetary exploration (e.g., [8,20]), while others have fulfilled daily or commonplace functions such as cleaning, security patrols and egg collection robotics (e.g. [15,22,23]). Robots require different capabilities depending on their tasks. For example, robots for demining must be capable of searching large areas without omissions, and those for cleaning or security must maintain cleanliness/safety by continuously moving around the specified areas. Cooperation by multiple robots is required in such applications because a single robot can only cover a limited area due to speed and movement restrictions and battery capacity. Even if one or a few robots stopped due to failure, multi-robots could continue to operate by compensating for the stopped robots. We aimed at cleaning and security patrolling by multiple autonomous robots in this study.
Robots have to visit environments at different frequencies in these applications. For example, cleaning (security) robots have to frequently visit regions that easily accumulate dust (regions that should be guarded at high level of security). Furthermore, robots need to be recharged periodically to ensure continuous performance due to finite battery capacity. The optimal method for coordinated activities by multiple robots for cleaning/patrolling is not trivial in such situations. Therefore, studies on agents, which are control software residing in individual robots to navigate them in environments with appropriate strategies by taking into account the amount of remaining battery life and the locations of other robots, are crucial to maximize their efforts in multiple robot contexts.
There have been a number of studies on agents in patrolling problems using single or multiple robots. Kurabayashi et al. [15] proposed a centralized off-line method in which a single server generated the entire route for a sweeping task. However, centralized control often incurs high computational costs according to the number of robots. Chevaleyre [6] formalized the patrolling problem as the traveling salesman problem by using multiple agents, but did not take into account the fact that agents visit the locations at different frequencies. Elor and Bruckstein [10] proposed a segmentation method of covering a region by integrating ant pheromone and balloon models. Ahmadi and Stone [3] also proposed a method that segmented the region of responsibility for individual agents while exchanging boundary information and this accounted for different event generation cycles. However, segmentation by only interacting with neighboring agents results in slow convergence, especially in large systems.
Agents move around in a shared region without segmentation in our proposed method, unlike those in these studies but with different target decision strategies to determine the next places that should be cleaned, thereby coordinating activities. Locations where dirt tends to accumulate differ in individual environments and are affected by the shape of the environments, the places and the numbers of people working during the day, and the number of obstacles. This makes it impossible to determine the appropriate combination of agent strategies in the design phase of the system. Agents in our approach try out different strategies and individually identify the most effective ones by reinforcement learning.
Another feature of our approach is that we do not assume deep coordination in which agents acquire (or estimate) internal reasoning information like long- and short-term plans of actions in other agents, analyze them, or determine appropriate local plans and actions, but assume shallow coordination meaning that agents within our context only receive or exchange superficial data such as their past and current locations and do not perform complicated reasoning about others. Since they only have a few resources including CPU with limited performance, limited amounts of memory and limited battery capabilities, we should avoid sophisticated coordination. Furthermore, it is often quite difficult to maintain wireless communication accuracy, and there is a trade-off between accuracy and power consumption. Small interruptions and noise (resulting in incorrect information) in wireless communications make sophisticated coordination impossible. This assumption of shallow coordination also fits environments where agents cannot directly communicate with one another but can obtain other’s locations via broadcasting with on-site equipment (e.g., infrared emitters on the ceiling as will be explained in the next section). This enables agents to significantly conserve battery life.
The main contributions of this paper are three-fold. First, we propose and discuss a method of reinforcement learning to autonomously identify appropriate strategies from their local viewpoints. Although we already discussed and proposed strategy learning in our previous paper [25] along the lines of the purpose discussed above, we carried out more thorough additional experiments to obtain more detailed results. This will be called the preliminary method after this. During the experiments using the preliminary method for evaluation, we found that agents in certain environments overly selected the same strategy (over-selection), resulting in the gradual reduction of the entire performance over time. Second, we therefore added self-monitoring functions with which agents could individually find degraded performance. Then, if they found the degradation, they could discard the current learning results to escape from situations of over-selection. Finally, we incorporated environmental learning to identify which places were easy to make dirty into the preliminary method for a wider range of applications.
The rest of this paper is organized as follows. Section 3 describes the model of agents and the problem domain and then clarifies the issues mentioned above. We explain a number of planning stages in which they select targets and generate the paths to them in Section 4. Section 5 first presents the preliminary method in which agents individually select one from given target selection strategies to coordinate them based on the learned results. Then, we propose a method that incorporates self-monitoring and environmental learning with the proposed method of learning. Finally, we report three experimental results. The first experiment demonstrated that agents using our preliminary method could achieve better performance in terms of minimizing the cumulative existence of dirt in an environment by a comparison with uniform regimes in which all agents adopted the same strategies. The second experiment revealed that agents with the proposed method could avoid degraded performance due to over-selection. Agents in the third experiment could adaptively select appropriate target selection strategies by adding autonomous learning of their environments even if they did not have prior knowledge about which places were easy to make dirty.
Related work
There have been a number of studies on cleaning and patrolling problems using single or multiple agents (robots). Ahmadi and Stone [2] formalized the problem of continuous area sweeping as a general form for a number of applications such as demining, floor cleaning, and security patrols by robots. They then also proposed a method in which an agent moved around in search of events that occurred with different probabilities. However, that study did not address collaborative movements with multiple agents. Kurabayashi et al. [15] proposed a centralized off-line method in which a single server generated the entire route for a sweeping task. The route was then divided into fragments and allocated to individual agents to minimize the working time. Stranders et al. [22] also developed a single-agent divide-and-conquer method for a continuous patrol by a team of agents. Although these approaches could generate near-optimal solutions, such centralized approaches incur high computational costs if the number of agents and/or the size of areas are large. More importantly, the failure of central components results in the standstill of all agents.
The patrolling problem can be formalized from a more theoretical perspective. For example, Chevaleyre [6] formalized the patrolling problem as a traveling salesman problem with multiple agents and then compared the number of cyclic routes and route division methods from the viewpoint of a minimal route length. Elmaliach et al. [9] proposed an algorithm that found the shortest Hamiltoniam cycle in grids that were used to patrol areas and to evenly spread agents there. However, it is costly to calculate appropriate paths in large and real-world environments. In addition, unlike ours, these studies did not consider the case of agents visiting locations at required and different frequencies.
We can consider two approaches to achieving coordinated and collaborative tasks from the multi-agent perspective. The first approach is to partition the area into subareas so that agents can divide labor. Ahmadi and Stone [3] extended their previous work [2] to cases with multiple patrolling agents. Their extended method segmented the region of responsibility for individual agents, which exchanged boundary information, and agents visiting a boundary region more frequently tended to take charge of the region. Voronoi-based techniques are also other methods that do not require graphic descriptions of environments, such as [5,7,21].
Bio-inspired computation models are also used to divide areas. For example, Elor and Bruckstein [10] proposed a method of segmentation to cover a regions by integrating ant-pheromone and balloon models; a region was segmented into subregions that were individually assigned to different agents. Each subregion was considered to be a balloon, the pressure of which represented the size of the subregion. The pressure values were then indirectly exchanged using the pheromone communication model. Ranjbar-Sahraei et al. [18] introduced indirect communication using pheromone-based stigmergic communications to identify the regions that should be covered. McCaffery [16] proposed a graph partitioning algorithm using foraging behaviors. The resulting subgraphs were allocated to agents so that they covered the whole environment. However, the implementation of pheromone communications in decentralized multiple-robot applications is not trivial. Furthermore, this method did not take into account differences in agent performance and environmental characteristics. Kato and Sugawara [14] proposed a method of area segmentation that reflected these differences, e.g., high-performance agents were allocated wider subareas as responsible regions, although it was not based on a bio-inspired model.
Another approach is where agents traverse a shared area along different/identical routes or with different exploration algorithms. For example, the theoretical approaches mentioned above [6,9] were based on this approach. Mead et al. [17] proposed team formation by local interactions with neighboring robots for search-and-rescue applications. Agmon and Peleg [1] developed an algorithm for robots to gather at a certain point. Fierro et al. [11] proposed a framework in which robot vehicles followed a prescribed trajectory while maintaining a required formation without explicit communication between them. The formation of robots is significant in a number of application domains; e.g., it can reduce omissions in search-and-rescue applications. In contrast, our approach focuses on autonomously but appropriately selected independent algorithms for their coordinated task, because robot formation is not necessarily required in our domain (continues cleaning). Sampaio et al. [19] proposed a gravity-based model in which locals that had not been visited for a long time had the stronger gravity, and thus, agents tended to visit such locations in uniform patrolling. However, our method is used to visit locations with different frequencies according to (priori given or learned) data about how easily locations become dirty.
Model and problem description
Application environments
Agents are on portable cleaner robots capable of autonomously deciding their actions and receiving messages in our applications. We assumed that agents knew their own and others’ locations. We consider an example system that consists of a number of agents equipped with indicators (such as infrared emission/reflecting devices) on the top and a receiver installed on the ceiling that identifies their locations and periodically broadcasts these data to all agents. Agents are periodically informed of the locations of the others so that they can identify which locations have recently been cleaned. Another example system is where the cleaner agents have communication devices but only exchange superficial data (including current and past locations). However, agents do not acquire deep knowledge such as the others’ plans and intentions because the use of these sophisticated data for reasoning requires considerable computational power and additional communications for coordination [4]. In addition, such sophisticated capabilities are often restricted due to limited CPU power and battery capacity. We call coordination using only superficial data shallow coordination.
We also assumed that agents would have a map (graph) of the environment, G. Its structure may often be unknown in real applications, but there are many algorithms to create the map (e.g., [12,24]). We introduced these assumptions because we focused on the autonomous learning of target decision strategies.
Model of agent and environment
Let
We probabilistically describe the ease of the accumulation of dirt, and
We consider two cases where agents know or do not know
Batteries in agents
A battery in agent i is denoted by
Performance measure
We used the cumulative existence duration of dirt at certain intervals of time as the performance measure to evaluate our proposed method. This is defined as
Planning in agents
Agents create plans for their behaviors in two stages: target decision and path generation. The agent determines in the former stage which target node
Target decision strategies
Agents determine their target nodes by taking into account (1) which node is expected to accumulate the largest amount of dirt in the environment and (2) which node is unlikely to be shortly visited by other agents. Agent i uses one of the following strategies to select the target node,
Random selection
Agent i selects the next target node from V randomly.
Probabilistic greedy selection (PGS)
For positive integer
Repulsive selection (RS)
Let
Balanced neighbor-preferential selection (BNPS)
The basic idea behind the balanced neighbor-preferential selection strategy is that, if there are dirty nodes in the neighborhood, the agent selects nearby nodes to clean, but otherwise selects farther dirtier nodes. Note that agents identify node v as dirty when
For agent i and time t, we define
In BNPS, i will select the next target node
Path generation strategies
Avoiding running out of batteries
After the target decision stage, agent i generates path to the target
For any node
Path generation
We considered two path generation strategies. The first is where agents select the shortest path from the current position to the selected targets (if there are multiple shortest paths, agents randomly select one). The second strategy, called the gradual path generation (GPG) method, is based on the idea that agents move along the shortest path, but if there are dirty places near the shortest path, they stop there and clean them. Agents in the GPG method generate the route as follows. Suppose that i has target
We found after our experiments that GPG always outperformed the simple shortest path strategy; thus, we only used GPG in the experiments described below. Note that we only assumed shallow coordination so that agents did not take into account the plans (paths and targets) of other agents. Hence, the target of a local agent may be cleaned by other agents before it arrives there.
Proposed method
Autonomous strategy selection
We propose a preliminary method in which agents identify the appropriate target decision strategies for coordinated cleaning tasks from the amount of dirt vacuumed up in the past by using reinforcement learning [25]. We called the proposed method the adaptive meta target decision strategy (AMTDS). Suppose that agent i selects next target
The values
The Q-value of s is updated as
While we were conducting the experiments to evaluate AMTDS, we found a phenomenon where overall performance gradually decreased over time in a certain environment. We will explain this in detail by referring to experimental results in Section 6.3.1. It was caused by many agents selecting the same strategy and/or strategies that had similar characteristics. We called this situation over-selection. Thus, we added the ability to self-monitor their local performance to agents. Then, if they found gradually degraded performance, agents discarded current learning results to escape from the situations and start to learn again (relearning).
We introduce moving average values, which are often used to identify raising/falling tendency in stock markets, to find the gradual changes in the amount of vacuumed dirt. When agent i returns to the base at time
Learning of dirt accumulation
Agents learn
Experiments
Experimental environments
We evaluated AMTDS in a simulation environment. G is defined as a
We prepared four environments for the experiments (Fig. 1) because different but interesting coordination regimes emerged. Dirt is accumulated uniformly in Env. (a). We define
We set
We assumed that one tick would approximately be 4 seconds, velocity would be 0.25 m/s, the maximum operation time would be 1 hour, and the maximum battery charging time would be 3 hours.
Parameters for target decision strategies

Experimental environments.
The agents in the experiments were assumed to be homogeneous: they always used the GPG method as a path generation strategy and used one of seven target decision strategies (R (random selection), PGS, RS, BNPS, AMTDS, AMTDS/RM, and AMTDS/RMLD). Agents in the case of R, PGS, RS, and BNPS always adopted a single target decision strategy, so we called these situations single strategy regimes. However, when they used AMTDS AMTDS/RM or AMTDS/RMLD, they independently selected one from R, PGS, RS, or BNPS based on local learning. We then compared the values of
Performance comparison
Improvement ratios with 10 agents,
Improvement ratios with 10 agents,
Improvement ratios with 20 agents,
Effects of agent populations in Env. (a),
Effects of agent populations in Env. (b),
Effects of agent populations in Env. (c),
Effects of agent populations in Env. (d),
Tables 4–7 summarize the effects of agent populations on overall performance. Values

Values of cumulative existence times of dirt in Env. (a).

Values of cumulative existence times of dirt in Env. (b).
We also investigated how performance improved through learning for coordination. Figure 6(1)–(4), which is the graphs of

Values of cumulative existence times of dirt in Env. (c).

Values of cumulative existence times of dirt in Env. (d).

Improvement in

Number of agents with each target decision strategy (
We investigated which target decision strategy agents with AMTDS selected through learning. Figures 7(1)–(4) plot the number of agents that adopted R, PGS, RS, or BNPS strategies over time when
Fig. 7(1) that plots the strategies agents with AMTDS adopted in Env. (a) indicates that RS was selected more than the others. Because dirt was located uniformly over the entire region, agents with the RS strategy could move around separately and vacuumed up more. In contrast, the agents with PGS and BNPS decreased; because they, especially those with BNPS, could not find the dirty nodes, they could not effectively vacuumed dirt up. However, as shown in Fig. 2, the single regime with the RS strategy did not exhibited better performance. This suggested that a small number of PGS strategy improved the performance and thus, the characteristics of the single and multiple strategy regimes are different.
More agents adopted the RS strategy in Env. (b), where dirt was accumulated near the walls, as shown in Fig. 7(2). The near-the-wall dirty area is relatively wide in Env. (b), and because agents with RS tended to move around separately in this area, they could achieve better coordination than that in Env. (a). After RS, they were likely to select PGS because these agents could move toward dirty nodes where were left without cleaning by the RS agents. Instead, the agents with the R and BNPS are few. In this environment, this result is consistent with that where the RS single regime exhibited the best performance (see Fig. 3). On the other hand, BNPS was the second best in the single regime structure (Fig. 3). The combination of the RS and PGS performed better and eliminated the benefit of BNPS.
The curves in Fig. 7(3) indicate that PGS and BNPS were selected more in Env. (c) than in Env. (b). Since Env. (c) has the cingular dirty region near the walls, like Env. (b), agents with the RS strategy performed well by moving around separately. Furthermore, it also has two block-shaped, small dirty regions but the RS strategy prevented multiple agents from intensively cleaning them. We think that agents with PGS and BNPS compensated for this situation. Particularly, BNPS can clean the local dirty region and is suitable to clean the block regions.
PGS and BNPS were likely to be selected as the target decision strategies by agents with AMTDS in Env. (d) as is shown in Fig. 7(4). Because this environment has more block-shaped dirty regions, more agents with PGS and BNPS that can effectively clean the block regions were required. Other nodes seemed to be cleaned by the agents with RS separately. Such a strategy regime structure is effective to work in Env. (d).
The discussion so far indicates that the appropriate strategy when all the agents adopted the same strategy was not always identical to the one that was most used by agents with AMTDS. For example, with the single regime of PGS or BNPS, all agents congregated around the dirty rectangular-shaped regions, which may have resulted in inefficient concentrations in Env. (d). However, with the AMTDS strategy, many agents adopted PGS and BNPS and a few selected RS. In Env. (b), many agents with RS clean there in whole and a few agents with PGS covered the nodes that were left without cleaning. These combinations seemed to outperform the cleaning with the single strategy regime. Agents with AMTDS autonomously selected the strategies which they thought that were the best from their viewpoints, but such viewpoints were affected by other agent’s decisions. However, such an autonomous decision sometimes lead to a concentration to a certain strategy, as discussed in the next section.
Experiment 2: Avoidance of degradation by over-selection.
Over-selection of strategies
If we look at Fig. 6(4) again, we can confirm that performance in Env. (d) improved until 60000 ticks but gradually worsened after that. Thus, let us look at Tables 2 and 3 carefully. They indicate that
Such an over-selection was caused by the block-shaped dirtier regions; in such an environment agents should gathered such relatively small regions to clean and thus, strategies PGS and BNPS were appropriate. On the other hand, agents with RS did not approach to these dirty regions if another agent was there, but this is not enough to clean the dirty regions. Hence, agents tended to use PGS and BNPS over time, but this also means other nodes were left for a long time without cleaning.

Improvement in

Strategies selected with AMTDS/RM (
We introduced the AMTDS/RM to ease such situations of over-selection. Thus, the purpose of Exp. 2 was to investigate whether or not it really could avoid such degraded performance. Figure 8(1) and (2) plots improvements to performance (
First, Fig. 8(2) indicates that no degraded performance was observed; performance improved around 50000 ticks as it did in the case of AMTDS, and after that, performance was still continuing to improve until 150000 ticks and changed. Finally, the value of
We also examined how the strategy regimes changed over time in these environments. The results are provided in Fig. 9(1) and (2). By comparing these with Fig. 7(3) and (4), the orders of selected strategies were identical but AMTDS/RM could ease the bias in the strategy regimes of both environments. For example, the numbers of AMTDS/RM agents adopting PGS in Env. (c) and those adopting PGS or BNPS in Env. (d) were fewer than AMTDS agents adopting these strategies. Hence, the self-monitoring in AMTDS/RM avoided degradation by easing over-selection. Note that the R strategy was selected slightly more often with AMTDS/RM than with AMTDS in both environments because a few agents discarded the learning results in turn and the chances of selecting the R strategy increased.

Improvement in

Strategies selected with AMTDS/RMLD (
We conducted Exp. 3 to examine how AMTDS/RMLD, which is our proposed method in this paper, could contribute to improving performance by learning dirt accumulation probabilities from the amount of vacuumed dirt. We set
First, we can find that initially (before 10,000 ticks), performance temporarily worsened, but soon recovered in all environments. Then, performance,
Before discussing the regimes in individual environments, we can see common features in these figures: agents were initially likely to select the RS strategy. While dirt accumulation learning was insufficient, agents did not identify easy-to-dirty locations (this knowledge was important in PGS and BNPS) and assumed that the environment was uniform. Thus, the RS strategy, which made agents operate separately, could vacuum more dirt. Of course, this regime would gradually change according to progress in learning except for Env. (a), which was a really uniform environment. Hence, agents continued to adopt the RS strategy in the regime in Env. (a) (see Fig. 11(1)). This regime structure was quite different from that with AMTDS (see Fig. 7(1)) where dirt accumulation probabilities were assumed to have been given. This means that the initial increase of RS and the relearning by self-monitoring made the strategy regimes converge to other structures although their final performance was comparable.
We can also see that in the other environments, the orders of the numbers of adopted strategies with AMTDS/RMLD were identical to those with AMTDS (and AMTDS/RM), but their biased selections were avoided by relearning through self-monitoring. However, a negative effect by this relearning on their performance in Env. (c) disappeared, as seen in Fig. 10(3). Furthermore, AMTDS/RMLD outperformed the other methods in Env. (d) as shown in Fig. 10(4). We think that these improvements were caused by the side-effect of learning dirt accumulation probabilities: Dirt accumulated more in locally isolated regions in these environments and agents independently found (but not all) a number of these dirty local regions, and tended to clean there selectively. Therefore, agents unintentionally entered divisions of labor to improve task efficiency. We believe that such a division of labor occurred in other environments, but it appeared prominently in Env. (d) since it has a number of locally dirty regions. Thus, they could effectively clean although their strategy regime structure were slightly different from those by agents with AMTDS. We think that this is quite an interesting phenomenon and would like to utilize this feature; however, we need more analysis and study to control it for this purpose.
Finally, our experiments were done in the simulated environments because we want to clarify the effectiveness of our proposed methods. However, actual environments have more issues. For example, it is difficult to navigate robots exactly between adjacent grid cells without other helper equipment such as a localizer that identifying the locations of robots as mentioned in Section 3.1. Communications among agents are not always possible and packets sometimes dropped. We have to address these issues to achieve the autonomous robots that can work in the real world environments in the future.
Conclusion
We proposed a learning method of target decision strategies for coordination in cleaning tasks by multiple agents without direct communication. We first modeled the problem domain of continuous cleaning by multiple agents in which the locations of all agents were broadcast in the environment but they could not directly communicate with one another about their plans/targets for coordination. Then, we defined the evaluation measure as the cumulative existence duration of dirt per tick. We introduced three learning methods. First, AMTDS made agents individually identify the appropriate target decision strategies. AMTDS/RM, which is combined AMTDS and relearning with self-monitoring of local performance, could avoid the biased selection of strategies (over-selection). Finally, AMTDS/RMLD additionally used environmental learning to identify which locations were easy to make dirty. The experimental results demonstrated that the proposed methods could outperform other regimes in which agents adopted a single strategy and could generate an effective regime with appropriate combinations of multiple strategies for coordination. They also indicated that agents could avoid over-selection and could successfully learn environments.
We intend to modify the conditions of communication in future work. For example, agents can directly communicate with one another but within a limited distance and packets may sometimes drop. Such communications issues are necessary to apply the proposed method to other sweeping and patrolling applications.
