Agent–based monitoring of the task scheduling in computational clouds

Abstract

The monitoring of the computational processes in highly distributed environments remains challenging in today’s High Performance Computing. In this paper, we define the agent-based cloud monitoring system for supporting the computational tasks scheduling and resource allocation. The system consists of two types of agents, which may decide about the initialization of the schedule execution and monitor the work of the cloud computational nodes. The decision about running the new scheduling process is based on the expected number of available computational units in the specified time window. The efficiency of the proposed MAS-based model was justified through 40 empirical tests, where clouds without and within the MAS support were compared. The multiagent system (MAS) effectiveness has been expressed in the average number of floating point operations completed at the cloud resources in one second. The obtained results show the importance of setting the optimal initial time for execution of the new schedule. Our experiments show that for running the new schedule, at least 25% of the computing units in the clouds should be in the idle mode. Also the batches of tasks should not be too large, cause the waiting time for new schedule for execution should be short and not greater than 10% of expected batch execution time.

Keywords

multiagent systems monitoring computational cloud autonomous agent batch scheduling

1 Introduction

Scheduling of the complex numerical workloads in computational clouds and efficient management of the cloud resources remain the challenging research and engineering tasks over the past decade. The maximization of the resource utilization and benefits of the cloud service providers and the resource owners are the key objectives of the scheduling. However, such aspects may be opposite to the expectations and specific end users’ requirements, such as privacy and data protection, security, system reliability, responsiveness of the service software applications, possible minimal cost of using the cloud services. The generation and successful execution of the optimal schedules should guarantee the trade–off among the needs and requirements of the cloud end users and benefits and expectations of the cloud managers and resource owners. The intelligent scheduling monitoring tools may efficiently support the executions of tasks, but also play the role of the recommendation systems for the improvement of the scheduling methods and algorithms.

Multiagent Systems (MAS) allow to provide the autonomous and distributed decisions of intelligent agents and their adaptation to the changes in the environment. The agents are located in the shared environment, perform the autonomous actions, may interact with each other and change the environment [7, 33]. The first MAS models were developed about 30 years ago as the the tools for solving the complex problems in massive open distributed systems. Collective intelligence is helpful in the specification of agent–to–agent and agent–to–environment relations. The MAS are also successfully applied for parallel implementations of the complex meta–tasks in computational grids and clouds [6 , 28].

The autonomous decisions and actions of the agents are the basic properties of agents in classical and today’s modern MAS models [7 , 31]. The intelligent agent, through his autonomous decisions, has a full control on his behaviour. It allows to specify the full characteristics of the agent, which is called the ”internal state” of the intelligent agent [22].

In the formal model of MAS developed by Wooldridge [31], the agent interacts with the environment and may change its status. The dynamic changes of the environment may have the impacts on the agent’s behaviour and his future decisions. Such influence of the environment may be positive or negative. Also the agent may take the wrong decisions. Therefore, the parameters of the MAS should be relevant to the considered problems in order to fulfill the agent’s individual goals and realize the tasks defined for the agents.

The main characteristics of the intelligent agents can be defined in the following way: proactivity – intelligent agents are able to make conscious decision in order to help each other to achieve the common goals without being instructed by the external authorities and users; reactivity – intelligent agents are able to take near–real–time decisions based on the notified changes of the environment or events which happened in their social neighbourhood; social capability – intelligent agents are able to communicate with the other agents for the realization of the common (and individual) tasks.

In this paper, we develop the agent-based monitoring system of the execution of schedules of computational tasks in the cloud environment. The tasks are grouped into batches and can be run independently of each other. There are two types of agents defined in our model denoted by Ag1 and Ag2. Each Ag1 agent monitors the progress in the execution of the tasks allocated at the cloud computational units and detect the idle units. Ag2 agent sends the queries to the agents Ag1 about the status of the run schedule. Based on the information collected from Ag1-s, Ag2 decides about the initialization of the execution of the new schedule and reports the idle units.

The main contributions of this paper are the following:

definition of the MAS model integrated with the cloud environment for monitoring the execution times of schedules of batches of tasks;

definition of the agents’ decision mechanism for setting the proper sizes of the batches of tasks and schedules and identification of the proper time points for starting the schedule execution;

definition of the method of tuning the parameters of the MAS model to the considered scheduling problem;

and finally, performing excessive numerical analysis for the validation the MAS-based cloud scheduling model.

The developed model is the second step in our research on the monitoring and recommendation systems supporting the cloud task and resource management. In our previous work, we implemented the Artificial Neural Network as the recommendation system for the resource management in computational grids [16, 17]. The developed MAS model is a simple tool for the improvement of the effectiveness of the cloud small clusters in the calculation of the batch schedules. We would like to extend this model for the online schedulers.

The rest of the paper is organized as follows. In Section 2 we present a short state–of–the–art critical analysis of the existing monitoring models based on the MAS paradigm, which are used for monitoring of the processes in the distributed computational environments. We justify the originality of our model. The basic concept of the MAS and definition of the considered scheduling problem are presented in Section 3. The new MAS–based model for monitoring of the execution of schedules in computational clouds is defined in Section 4. Developed model was experimentally evaluated in Section 5. We summarize our work and draw simple conclusions in Section 6.

2 Related work

The efficient monitoring of the schedule execution in the cloud system requires salable intelligent tools. Although there are many examples of such tools in the literature [4 , 30], the very few of them are based on the MAS model. In this section, we provide a simple survey of the selected MAS-based monitoring techniques and address pros and cons of their applications in concrete scenarios.

MonALISA monitoring agent system [11] is defined as a Large Integrated Services Architecture. The main idea of this model is based on the generation of a set of dispersed communicating agents that collect and analyze data necessary for optimizing task processing in the distributed computational environments. The agents’ actions are mainly gathering information from the other agents and environment, but the agents are not capable of making any complex decisions.

Zubok et al. present in [10] the agent-based system for monitoring dynamic cloud environments. There are three types of agents defined in their model. The agents are able to monitor the cloud system performance, record the incoming tasks, and monitor the status of the computing resources, which includes the times of tasks execution, memory usage and available disk space at the cloud servers. The presented agent system does not implement any intelligent methods for decision making processes.

In [8], M. Dhingry developed the monitoring models with a set of virtual machine agents in order to collect the information about the resource usage for single virtual machine (VM) implemented in a given physical server. The model has also a bare metal supervisor agent, which monitors non virtual cloud infrastructure. The third agent plays the role of the central authority in the whole monitoring systems. This agent collects information and data from the rest of the agents and make the decisions about mapping the tasks to the cloud servers.

Meera and Swamynathan in [27] define the model with a set of agents monitoring each VM in the cloud. The data collected from that monitoring is sent to the central agent. Agents do not make intelligent decisions in this model, just collect and deliver the monitoring data to the system administrators.

Another model is New Relic Digital Intelligence Platform defined in [2], where the implemented agents monitor the efficiency of the execution of the software applications run in the cloud. The system can be easily integrated with Amazon AWS, Azure, OpenStack or HP Cloud Services.

IDERA Uptime Cloud Monitor [18] can be used for cloud monitoring and the prediction of the potential system misbehaviour and disturbances.

All above models are agent–based models and may be used for monitoring of the execution of computational tasks processed in distributed computing systems. Together, with other popular monitoring systems such as Ganglia [26], Nagios [5], Opsview [32], CloudHarmony [1] or SPAE Server Monitoring [3], they may support the environmental operator decisions. However, in the surveyed models, the agents do not have to make any complex (intelligent decisions). Their behaviours do not necessary have an impact on the states of the environment, in which they are situated. Such agents focus rather on the monitoring of the utilization of the physical nodes in the the cloud’s IaaS (Infrastructure as a service) layer. This is just one aspect of the complex monitoring of the schedule execution. The results of the simple analysis of the existing models inspired us for the research on the new MAS monitoring system, where collective intelligence methods can be used for the characteristics of agent-environment and agent-agent interactions.

3 Backgrounds of the monitoring model

3.1 Multiagent systems

The model of intelligent software agent developed by Wooldridge in [33] and updated by Weiss [31, 33] is one of the most popular formal models of the multiagent systems. The model in its recent form defines the agents in the dynamic environment, which is the integral part of the whole system. The agents and the environments can interact and change each other.

Based on the model of Wooldridge, let us denote by $E = {e_{t_{i}}}, i = 0, 1, \dots, n - 1; n \in ℕ,$ (1)

the finite set of n states of the environment. The parameter e_{t
_i} defines the state of the environment in the time interval t_i, i = 0, …, n - 1. The set of all possible finite subsets of the environment E, which may be achieved in at most n time intervals is defined in the following way: $\tilde{E} = 2^{E}$ (2)

The agent’s environment is defined as 3–tuple: $E n v = 〈 E, e_{t_{0}}, τ 〉$ (3) where E – is the set of the states defined in (1), e_{t
₀} – the initial state of the environment, and τ – is the state transformation function.

An intelligent agent Ag is located in the environment Env, and can perform actions from the set Ac, where: $Ac = {α_{t_{0}}, α_{t_{1}}, \dots, α_{t_{n - 1}}} .$ (4)

The set of all possible finite subsets of Ac is the set of all possible finite sets of agent’s actions performed in the at most n time intervals and it is defined as follows: $\tilde{Ac} = 2^{Ac}$ (5)

The agent’s actions may change the states of the environment. Formally, the agent Ag may be represented by its action function in the following way: $Ag : R^{E} \to Ac,$ (6) where $R^{E}$ is the set of all finite sequences of the agent’s moves in the environment Env with the state of the environment in the tail (of sequence).

Let us denote by $R^{Ac}$ the set of all the sequences of the agent’s moves with the action in the tail. The state transformation function defined the mapping of the agent’s action on the state of the environment, i.e.: $τ : R^{Ac} \to 2^{E}$ (7)

The agent–environment interactions can be defined by the following procedure:

Step 0: Setting the initial state of the environment e_{t
₀}.

Step 1: The agent analyses the current state of the environment in time t₀ and makes the action α_{t
₀}.

Step 2: The result of the agent’s action is changing the state of the environment from e_{t
₀} to e_{t
₁}, which becomes the updated current state of the environment in time t₁.

Step 3: The agent performs his action α_{t
₁} in the time t₁, based on the updated state of the environment e_{t
₁}.

The steps 1–3 are repeated for the next time intervals t₂, t₃, until t_n-1.

The whole process of agent–environment interactions can be defined as the chain r of environmental states and agent’s actions in the sequence t₀, t₁, . . . , t_n-1:

$r : e_{t_{0}} \overset{α_{t_{0}}}{\to} e_{t_{1}} \overset{α_{t_{1}}}{\to} e_{t_{2}} \overset{α_{t_{2}}}{\to} \dots e_{t_{n - 1}} \overset{α_{t_{n - 1}}}{\to} e_{t_{n}}$ (8)

The following pair 〈Ag, Env〉 defines a multiagent system.

The agents have their own internal structure, which is used for recording the information about the changes in the environment. Let us denote by I_Ag the set of all internal states of the agent Ag. The function of the monitoring of the environment Env by the agent Ag is defined as follows: $see : E \to Per,$ (9) where Per is the nonempty set of the agent’s perceptions (interpretations of the environmental states – see: [34]).

Based on the agent’s perception of the environment the agent’s action now is defined as follows: $action : I_{Ag} \to Ac .$ (10) The action function maps the internal states of the agent into the set of his actions.

The function next, which updates the internal state of the agent is defined in the following way: $next : I_{Ag} \times Per \to I_{Ag} .$ (11)

3.2 Batch scheduling problem in computational cloud

Problem of task scheduling can be formally defined using the following notation: [29]: $A | B | C,$ (12) where

A – represents the available resources and system architecture,

B – the mode of tasks processing and its constraints,

C – the scheduling criteria.

Using the generic scheme, we may define the batch scheduling of idependent task in the following way [24]):

\begin{matrix} Rm, CM | {(batch / on-line), \\ (independent / dependent / workflow), \\ (static / dynamic)} | (objectives), \end{matrix}

(13)

where

Rm – the heterogeneous environment of hardware resources allowing parallel task processing;

CM – central management;

batch/on - line – tasks processing modes;

independent/dependent/workflow – tasks having or not having dependencies between them or the additional data flow necessary during tasks processing;

static/dynamic static environment with constant processing unit characteristics or dynamic environment that may change during scheduling process;

objectives – the main criterion for particular task – resource mapping.

In the paper we are considering independent batch scheduling mode [17 , 24]. Considered tasks are collected in batches. The numbers of tasks in the batches may change in the different time intervals. The tasks from a given batch are processed independently on available Virtual Machines (VMs). The number of VMs may vary in the system but remain constant during the execution of task: $\begin{matrix} Rm, CM | \\ {batch, independent, (static / dynamic)} \\ | (objectives) . \end{matrix}$ (14)

Therefore, each single scheduling problem may have different number of possible schedules to chose from.

To estimate the time that computing unit will be calculating particular task, the computing capacity vector for VMs in Giga Floating point Operations Per Second (GFLOPS) was introduced: $CC = [{cc}_{1}, . . ., {cc}_{m}]$ (15)

The tasks are characterized by workload vector expressed in Giga Floating point Operations (GFLO): $WL = [{wl}_{1}, . . ., {wl}_{n}]$ (16)

To calculate the schedules, the Expected Time to Compute (ETC)[sec.] matrix was defined in the following way: $ETC [j] [i] = {wl}_{j} / {cc}_{i}$ (17) that results to $ETC = [{wl}_{j} / {cc}_{i}]_{i = 1, 2 . . ., m}^{j = 1, 2 . . ., n}$ (18) for each of ith worker and jth task.

Task execution process may be summarized as [25]:

monitoring the number m of computing units, and their computing capacity vector;

collecting tasks and formulating the batch;

monitoring the number n of tasks in the batch and workload vector for formulated batch;

calculating schedule;

allocating tasks at computing units;

monitoring the execution of the schedule.

To find the schedule, the time difference between the start and finish of a batch of tasks (makespan) was considered: $C_{max} = min_{S \in Schedules} {max_{i \in Machines, j \in {Tasks}_{i}} \sum C_{j}},$ (19) where C_j is the completion time of j-th task, Tasks_i indicates the set of all tasks in particular batch that should be processed by processing unit number i, Machines is the set of all processing units and Schedules is the set of off possible schedules. ETC matrix is used with the current time indicator to calculate C_j.

The particular schedule is selected for the execution that minimizes the makespan of tasks from the formulated batch assuming given processing units characteristics, see [20, 24]. The makespan is one of the most common optimizing criterion for batch scheduling problem.

4 MAS–based monitoring system for the execution of the cloud schedules

In this section, we define our developed MAS–based model for monitoring the execution of the schedules of batches of tasks in the cloud environment. The main idea is based on the MAS model developed by Dobrowolski in [9]. The analysis of the MAS system performance and structure is provided through the decomposition of the whole system into the set of the local coordinated units (in many cases – single agents). Such decomposition allows to focus on the definition of the strategies and actions of the agents and integrate those components into the global model, instead of the global analysis of the whole complex system.

4.1 Generic model of the MAS monitoring system

The model of MAS used for monitoring of the batch scheduling problem is defined as the following structure: $MAS = {AG, ID, TP, LOC, ES, ACT, ST},$ (20) where:

AG – set of agents in MAS;

ID – set of individual agents’ ID parameters;

TP – set of agents’ types parameters;

LOC – set of agents’ location parameters;

ES – set of all possible states of the environment;

ACT – set of the agents’ actions taken in the environment;

ST – set of the agents’ strategies.

Each agent Ag is characterized by the following parameters: $AG ∋ Ag = {id, tp, k, γ, β, δ, α},$ (21) where:

id ∈ ID – individual agent’s ID parameter;

tp ∈ TP – type of the agent;

k ∈ K – actual agent’s knowledge;

γ – agent’s perception of the environment function – based on it, each agent may improve of his knowledge; $γ : ES \to M (K),$ (22) where: K is the set of all possible knowledge of the agent and $M$ is the space of the probabilistic measures defined in the set K.

β – strategy selection function: $β : TP \times K \to Z;$ (23) where Z is the set of the strategy ID parameters;

δ – agent’s decision function, which defines his action based on the agent’s strategy: $δ : ST \to M (ACT);$ (24)α – agent’s action function, which changes the states of the environment: $α : ACT \to M (ES) .$ (25)

Two types of agents are specified in the developed model, namely TP = {Ag1, Ag2}, where:

Ag1 – monitors the execution of tasks according to the defined schedule and number of idle computational units, in which the new tasks may be allocated;

Ag2 – receives the reports from the agents Ag1 and defines the start time of the execution of the next batch of tasks (another new schedule).

We assume in the model presented in this paper, that just one agent Ag2 is generated.

The main characteristics of the above MAS model are the restrictions of the continuous monitoring of the agents’ locations¹. There are two such locations defined in the system: $LOC = {compunit, envmgmt},$ (26) where compunit denotes the location parameter of the cloud computational server (unit) in the physical layer of the cloud, and envmgmt – the cloud management unit. In this unit, the realization of the schedulers is initialized. Agent Ag2 is located in envmgmt, while the Ag1 agents in compunit.

The set of the agents’ strategies ST contains the strategies defined for the agent Ag2 and several Ag1 agents, i.e.: $ST = {{st}_{1}, {st}_{2}} .$ (27) where st₁ – is the general strategy of the agents Ag1, which monitor the execution of the schedule and send the information about it to the agent Ag2, and st₂ – is the strategy of Ag2, which generate the start time of the execution of the new schedule based on the specified parameters θ₁ and θ₂ related to the computational units working in the specified time points in the scheduling process. Both parameters are defined in details in 4.1.1.

Each strategy generates the set of possible actions of the agent of a given type. Formally, the set of the agents’ actions in the proposed MAS model can be defined in the following way: $ACT = {ACT}_{{st}_{1}} \cup {ACT}_{{st}_{2}}$ (28) where ${ACT}_{{st}_{1}} = {getstage, informstage};$ (29) and ${ACT}_{{st}_{2}} = {gentimepoint, startsched} .$ (30)

The interpretation of the elements of both sets ACT_{st
₁} and ACT_{st
₂} is as follows:

getstage – the update of the Ag1 agent’s knowledge about the execution of the schedules based on the values of the environment monitoring function;

informstage – Ag1 reports to the agent Ag2 the progress in the realization of the tasks in the schedule;

gentimepoint – Ag2 initialize the start time of the execution of the schedule based on the data received from all agents Ag1;

startsched – Ag2 starts the execution of the generated schedule.

4.1.1 Monitoring of the states of cloud computational units

The monitoring of the work of the cloud computational units (VMs) during the scheduling process is based on the agents’ strategies of the generation of the time points in the process of execution of the schedules t_k (k ≥ 1). Each time point parameter defines the start time of the execution of tasks from the schedule at the cloud servers. Agents Ag1 are defined for each cloud unit (VM) and monitors the execution of tasks assigned to that unit. Ag2 quests all agents Ag1 and asks for the information about the execution times of particular tasks from the batch at the units associated with the concrete Ag1 agents. Then, agent Ag2, based on the collected data from the agents Ag1, defines the start time of the execution of the schedule.

Let us denote by the ordered set of times of completing the tasks in the schedule at the time points t_k. The procedure of the generation of the time points by MAS has the following two main parameters:

θ₁, θ₁ ∈ {1, 2, …, m} – this parameter defines the number of the computational units, which must be in the idle mode in order to generate of the new time point;

θ₂, θ₂ ≥ 0 – this parameter defines the maximal waiting time for finishing all tasks in the schedule.

MAS generate new time point t_k based on the following relation:

$\begin{matrix} t_{k} = t_{k - 1} + \\ max {v_{θ_{1}, k - 1}^{sort}, v_{θ_{1} + 1, k - 1}^{sort} 1 \\ {v_{θ_{1} + 1, k - 1}^{sort} - v_{θ_{1}, k - 1}^{sort} \leq θ_{2}}, v_{m, k - 1}^{sort} 1 \\ {v_{m, k - 1}^{sort} - v_{θ_{1}, k - 1}^{sort} \leq θ_{2}}}, k \geq 1 . \end{matrix}$ (31)

In the case of finishing work by the the computational units approximately in the same time, θ_k coordinates of the vector $V_{k}^{sort}$ are 0, and the

$\sum_{j = θ_{1} + 1}^{m} 1 {v_{j, k - 1}^{sort} - v_{θ_{1}, k - 1}^{sort} \leq θ_{2}}$ (32) indicates number of coordinates of the same vector is in the range of $(0, θ_{2} + v_{θ_{1}, k - 1}^{sort}]$ . Let us denote by $I_{k}$ the set of computational units, for which the delays in the tasks execution are not greater than $θ_{2} + v_{θ_{1}, k - 1}^{sort}$ at the time t_k. The set $I_{k}$ can be defined in the following way:

$I_{k} = {i : v_{ki} \leq θ_{2} + v_{θ_{1}, k - 1}^{sort}, 1 \leq i \leq m}$ (33) The cardinal of the set $I_{k}$ is even and is estimated as follows: $(\sum_{j = 1}^{m} 1 {v_{j, 0}^{sort} - v_{θ_{1}, 0}^{sort} \leq θ_{2}}) \in 2 N$ .

Only the units from the set $I_{k}$ are active in the execution of the schedule at the time point t_k (all of them are idle in t_k).

The values of the parameter θ₁ are the fractions of the active computational units working in the agent environment, i.e.: $θ_{1} = Ω_{1} % \cdot m,$ (34) where Ω₁ ∈ (0, 100].

The parameter θ₂ was calculated based on the percentage of the average mean loading of 1 GFLOPS computing capacity. This calculation was made assuming the requested workload from currently processed batch and available computing capacity of the idle resources: $θ_{2} = Ω_{2} \cdot \frac{\sum_{j = 1}^{n} {wl}_{j}}{\sum_{i = 1}^{m} {cc}_{i}},$ (35) where ${wl}_{j}^{s}$ is the number of floating-point operations that must be performed to accomplish the task, cc_i is the computing capacity of unit number i, and Ω₂ ∈ (0, 100] is the v percentage of the mean load of 1 GFLOPS computing capacity.

5 Evaluation of developed model

The aim of the presented numerical experiment is to examine the environment behaviour considering the different number of computing units, diversified load and several sets of parameters. The basic design of the system monitoring depends on the parameters described by Equations (34)–(35). Presented numerical tests show the influence of the parameter setting on the environmental performance that is measured by the for the number of floating-point operations performed in one second.

The evaluation was conducted under the dedicated simulator of a computational cloud. The implementation is based on C++11 programming language, FastFlow framework and POSIX library. The simulator was build in Microsoft Visual Studio 2017 environment, The source code was optimized and compiled using Microsoft C/C++ compiler for 64-bit platforms.

Specification of the simulated environment is presented in Table 1. For presented experiments, we have used 2 characteristics of tasks presented in Table 2. The schedule was generated by the evolutionary algorithm specifically described in [15 , 21].

Table 1
Characteristics of computational units

Set I Set II

Number of units 40 200

Measure cc GFLOPS

Probability distribution cc $N (6; 4)$

Mean value 6

Standard deviation 2

Minimal value 2

Maximal value 10

	Set I	Set II
Number of units	40	200
Measure cc	GFLOPS
Probability distribution cc	$N (6; 4)$
Mean value	6
Standard deviation	2
Minimal value	2
Maximal value	10

Table 2

Characteristics of tasks

	Batch I	Batch II
Maximum number of tasks	4 000	20 000
Measure	GFLO
Probability distribution wl	Exp (0, 0002)
Mean Value wl	3 000
Standard deviation wl	3 000
Minimal value wl	2 000
Maximal value wl	28 000

First scheduling starts with all available computational units and with the whole batch of tasks ready to process. Characteristics of the tasks are presented in the Table 2. The next time points are generated based on agents’ decisions. The number of tasks in one batch is strictly proportional to the number of available computational units. It can be defined by: $n = Ω_{1} \cdot n_{\max},$ (36) where n_max is a maximum number of tasks presented in the Table 2.

The experiments were conducted under 40 different tests with the parameters Ω₁ and Ω₂, different characteristics of tasks batches and sets of computational units. Each test consists of 4 schedule execution processes in a sequence. Each experiment was repeated 100 times under the same configuration of parameters. The considered test scenarios are presented in Table 3 and Table 4. The value of θ₂ was determined by using eq. (35).

Table 3

Characteristics of scenarios for Ag2

Ω₁:	25%, 50%, 75%, 100%
Ω₂:	0%, 5%, 10%
Sets:	I and II
Batches:	I and II

Table 4

Values of θ₁ parameter

Values of Ω₁	Set I	Set II
parameter
25%	10	50
50%	20	100
75%	30	150
100%	40	200

The results of conducted experiments are presented in Figures 2–4. Each graph is described by a scenario, and values of Ω₁ and Ω₂. The results obtained for Ω₁ = 100 are achieved in the case, that cloud system was not monitored by the agents Ag1 and Ag2.

Fig. 1

Work flow in distributed computational environment supported by MAS–based monitoring system.

Fig. 2

Cloud units performance for each agent strategy and scenario.

Fig. 3

Box plots presenting average idle times of computational units for each agent strategy and scenario.

Fig. 4

Average idle times of computational units divided by a time spent for processing all batches of tasks for each agent strategy and scenario.

First, the generated schedules were compared in terms of the number of floating-point operations performed in one second within the entire available infrastructure (i.e. the average efficiency of the entire infrastructure). The main goal of the agents’ strategies was to determine such moments in the execution of the tasks that would allow better use of the available (i.e. idle) infrastructure.

Best results for the number of floating-point operations performed in one second were achieved for the third strategy (see: Fig. 2), when at least 25% of the computing units were idle and waiting time was equal 10% of average time for execution whole batch (see: eq. 35). In comparison to the not monitored environment, the third strategy allowed the improvement of the order of 26.14% for the scenario I, 25.14% for scenario II, 64.3% for scenario III, 63.61% for scenario IV.

A larger dispersion of values for cases supported by agents should also be noticed. On the one hand, this indicates less predictability of the generated schedules, and on the other, it can provide an opportunity to develop solutions (scheduling algorithms) that allow for better results.

The presented results also indicate a better use of computing power for smaller infrastructures – environments consisting of 40 computing units were characterized by approximately 20% better efficiency.

Two next Figures (3, 4) shows average idle times of computational units. Fig. 3 presents the distribution of idle values of individual computational units. The highest values are characterized by the unsupervised strategy, while for all supported scenarios the third strategy obtains the best results. The third strategy allows for the maximal reduction of idle time.

Fig. 4 shows the most important results of idle times savings. The figure presents average idle times divided by a time spent for processing all batches of tasks. It shows the profit of using agents strategies. In the case of scenarios number I and II, each unsupported computational unit was idle for about 40% of the running time. The best strategy minimized this time by about 13–14%, the rest from about 2.6% to 11%.

For large–scale environment (scenario III and IV), the profit resulting from the use of agent supervision was even greater. For unsupported strategy, the average idle time reached nearly 60% of the running time. As in the previous scenarios, the best results were achieved for the third strategy (Ω₁: 25, Ω₂: 10) – idle time equal to 36.86% and 37.18% for strategy III and IV respectively. An improvement of 22–22.5% was achieved in this way. The remaining strategies have reduced the average idle time from 13.2% to 20.84%.

Presented MAS model does not depends on the methods of generation of the schedules. The application of strategies of Ag1 and Ag2 agents gives good results in the case of high idle time of computational units.

Comparing to the systems mentioned in Section 2, the actions of the agents are more complex. They are taking intelligent decisions and they are able to change the environment. Moreover, the presented MAS is not monitoring all the processes passively. The collective intelligence of them supports the decisions to be taken during task scheduling and processing.

6 Conclusions

In this paper, we defined the multiagent system consisting of two types of agents. The proposed model supports the efficient execution of the schedules generated in distributed computing environments such as computational clouds. Those agents are making decisions about the starting time of batch execution. Additionally, they are monitoring the progress of the task calculations.

The agents Ag1 and Ag2 supports resource management by generating the optimal time points when the execution of the new schedules should be started. The decision is based on the number of actually available computational units and the earliest time when the new units may be available.

The proposed model was evaluated in 40 different numerical tests. The results were compared with the cloud environment without the MAS monitoring system. The best results were obtained for the strategy in which the next scheduling process was started when at least 25% of the computing units were idle, and the waiting time for new units was less or equal 10% of expected batch execution time. Similarly, the best results in terms of using the environmental infrastructure were obtained for the same strategy. The obtained results showed the significant impact of choosing the moment of the scheduling process on the infrastructure performance.

We plan to extend the model to the more complex scheduling scenarios and wider cloud infrastructure with many Ag2 agents. For that, we need to define the novel ontologies for better modelling the agents’ social behaviours and interactions.

A very interesting development of the presented research may be changing the uniqueness of agent number two. Introducing several agents of that type may decrease the probability of system failure. Such decentralization of responsibility may increase the resiliency of the system.

Footnotes

Acknowledgment

The contribution of Dr Daniel Grzonka was supported by the Foundation for Polish Science (FNP).

In general MAS model, the locations of agents do not have to be controlled by the other agents or the MAS monitoring unit.

References

CloudHarmony. https://cloudharmony.com.

New Relic Digital Intelligence Platform. https://newrelic.com.

Security Performance Availability Engine Server Monitoring. https://shalb.com/en/spae/.

Aceto

, Botta

, De Donato

and Pescapè

, Survey cloud monitoring: A survey, Comput Netw 57(9) (2013), 2093–2115.

Barth

, Nagios: System and network monitoring. No Starch Press, 2008.

Cetnarowicz

, Koncepcja strategii rozdzialu zadan w zdecentralizowanych strukturach wieloprocesorowych, Zeszyty Naukowe AGH, Elektrotechnika 8(3-4).

Cetnarowicz

, Paradygmat agentowy w Informatyce: Koncepcje, podstawy i zastosowania. Akademicka Oficyna Wydawnicza EXIT, 2012.

Dhingra

, Lakshmi

and Nandy

S.K.

, Resource usage monitoring in clouds, In 2012 ACM/IEEE 13th International Conference on Grid Computing, 2012, pp. 184–191.

Dobrowolski

, Technologie agentowe w zdecentralizowanych systemach informacyjno-decyzyjnych, volume 107 of Rozprawy. Monografie, AGH Uczelniane Wydawnictwa Naukowo-Dydaktyczne 2002.

10.

Zubok

D.A.

et al., A multi-agent approach to the monitoring of cloud computing system with dynamically changing configuration, In 2016 18th Conference of Open Innovations Association and Seminar on Information Security and Protection of Information Technology (FRUCT-ISPIT) (2016), pp. 410–416.

11.

Legrand

et al., Monalisa: An agent based, dynamic service system to monitor, control and optimize distributed systems, Computer Physics Communications 180(12) (2009), 2472–2498.

12.

Fatema

et al., A survey of cloud monitoring tools: Taxonomy, capabilities and objectives, Journal of Parallel and Distributed Computing 74(10) (2014), 2918–2933.

13.

Alhamazani

et al., An overview of the commercial cloud monitoring tools: Research dimensions, design issues, and state-of-the-art, Computing 97(4) (2015), 357–377.

14.

Franklin

and Graesser

Is it an agent, or just a program? A taxonomy for autonomous agents. In Proceedings of the Workshop on Intelligent Agents III, Agent Theories, Architectures, and Languages, ECAI’96, (1997), pp. 21–35. Springer-Verlag.

15.

Grzonka

, Jakobik

, Kolodziej

and Pllana

, Using a multi-agent system and artificial intelligence for monitoring and improving the cloud performance and security, Future Generation Computer Systems 86 (2018), 1106–1117.

16.

Grzonka

, Kolodziej

and Tao

Using artificial neural network for monitoring and supporting the grid scheduler performance. In 28th European Conference on Modelling and Simulation, ECMS 2014, Brescia, Italy, May 27-30, (2014), pp. 515–522.

17.

Grzonka

, Kołodziej

, Tao

and Ullah

, Khan, Artificial neural network support to monitoring of the evolutionary driven security aware scheduling in computational distributed environments, Future Generation Computer Systems 51 (2015), 72–86.

18.

IDERA. Uptime Cloud Monitor. https://www.idera.com/infrastructure-monitoring-as-a-service.

19.

Jakóbik

, Grzonka

and Kołodziej.

Security supportive energy aware scheduling and scaling for cloud environments. In European Conference on Modelling and Simulation, ECMS 2017, Budapest, Hungary, May 23-26, 2017, Proceedings., pp. 583–590.

20.

Jakóbik

, Grzonka

, Kołodziej

, Gonzalez-Velez

Towards secure non-deterministic meta-scheduling for clouds. In 30th European Conference on Modelling and Simulation, ECMS 2016, Regensburg, Germany, May 31 - June 03, 2016. Proceedings., pp. 596–602.

21.

Jakobik

, Grzonka

and Palmieri

, Non-deterministic security driven meta scheduler for distributed cloud organizations, Simulation Modelling Practice and Theory, 76 (2017), 67–81. High-Performance Modelling and Simulation for Big Data Applications.

22.

Kisiel-Dorohinicki

, Agentowe architektury populacyjnych systemow inteligencji obliczeniowej, volume 269 of Rozprawy. Monografie. AGH Uczelniane Wydawnictwa Naukowo-Dydaktyczne, 2013.

23.

Kołodziej

, Khan

S.U.

and Xhafa

Genetic algorithms for energy-aware scheduling in computational grids. In 2011 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, (2011), pp. 17–24.

24.

Kołodziej

, Evolutionary Hierarchical Multi-Criteria Metaheuristics for Scheduling in Large-Scale Grid Systems. Springer Publishing Company, Incorporated, 2012.

25.

Kołodziej

, Ullah Khan

, Wang

, Byrski

, Min-Allah

and Ahmad Madani

, Hierarchical genetic-based grid scheduling with energy optimization, Cluster Computing 16(3) (2013), 591–609.

26.

Massie

Matthew L.

, Chun

Brent N.

and Culler

David E.

, The ganglia distributed monitoring system: Design, implementation, and experience, Parallel Computing 30(7) (2004), 817–840.

27.

Meera

and Swamynathan

, Agent based resource monitoring system in iaas cloud environment, Procedia Technology 10 (2013), 200–207. First International Conference on Computational Intelligence: Modeling Techniques and Applications (CIMTA) 2013.

28.

Nawarecki

and Cetnarowicz

, A concept of the decentralized multi-agent RT system, In Proceedings of the International Conference Real TimeâĂŹ, (1995), pp. 167–171.

29.

Pinedo

Michael L.

, Scheduling: Theory, Algorithms, and Systems. Springer Publishing Company, Incorporated, 3rd edition, 2008.

30.

StuartWard

and Barker

, Observing the clouds: A survey and taxonomy of cloud monitoring, Journal of Cloud Computing 3(1) (2014), 24.

31.

Weiss

Multiagent Systems, The MIT Press, 2013.

32.

Wijntje

Monitoring with Opsview, Packt Publishing Ltd, 2013.

33.

Wooldridge

An Introduction to MultiAgent Systems. Wiley, 2009.

34.

Wooldridge

and Dunne

Paul E.

, The Computational Complexity of Agent Verification, (2002), pp. 115–127. Springer Berlin Heidelberg.

Agent–based monitoring of the task scheduling in computational clouds

Abstract

Keywords

1 Introduction

2 Related work

3 Backgrounds of the monitoring model

3.1 Multiagent systems

4.1 Generic model of the MAS monitoring system

Table 1 Characteristics of computational units Set I Set II Number of units 40 200 Measure cc GFLOPS Probability distribution cc N ( 6 ; 4 ) Mean value 6 Standard deviation 2 Minimal value 2 Maximal value 10

Footnotes

Acknowledgment

References

Table 1
Characteristics of computational units

Set I Set II

Number of units 40 200

Measure cc GFLOPS

Probability distribution cc $N (6; 4)$

Mean value 6

Standard deviation 2

Minimal value 2

Maximal value 10