Abstract
Regarding the ever-increasing development of data and computational centers due to the contribution of high-performance computing systems in such sectors, energy consumption has always been of great importance due to CO2 emissions that can result in adverse effects on the environment. In recent years, the notions such as “energy” and also “Green Computing” have played crucial roles when scheduling parallel tasks in datacenters. The duplication and clustering strategies, as well as Dynamic Voltage and Frequency Scaling (DVFS) techniques, have focused on the reduction of the energy consumption and the optimization of the performance parameters. Concerning scheduling Directed Acyclic Graph (DAG) of a datacenter processors equipped with the technique of DVFS, this paper proposes an energy- and time-aware algorithm based on dual-phase scheduling, called EATSDCDD, to apply the combination of the strategies for duplication and clustering along with the distribution of slack-time among the tasks of a cluster. DVFS and control procedures in the proposed green system are mapped into Petri net-based models, which contribute to designing a multiple decision process. In the first phase, we use an intelligent combined approach of the duplication and clustering strategies to run the immediate tasks of DAG along with monitoring the throughput by concentrating on the reduction of makespan and the energy consumed in the processors. The main idea of the proposed algorithm involves the achievement of a maximum reduction in energy consumption in the second phase. To this end, the slack time was distributed among non-critical dependent tasks. Additionally, we cover the issues of negotiation between consumers and service providers at the rate of μ based on Green Service Level Agreement (GSLA) to achieve a higher saving of the energy. Eventually, a set of data established for conducting the examinations and also different parameters of the constructed random DAG are assessed to examine the efficiency of our proposed algorithm. The obtained results confirms that our algorithm outperforms compared the other algorithms considered in this study.
Keywords
Introduction
The growth of applications requiring a high amount of energy has urged the academic community to investigate them concerning several issues such as economic aspects, environmental challenges, low reliability and availability [1]. Estimations indicate that the total power consumption in the information and communication technology (ICT) industry is over 868 billion kWh, which accounts for 5.3% of the total electricity used across the globe in 2008 [2]. Due to the ever-increasing technological advancement, it is expected that the total power demand will increase up to four times by 2025. Carbon emissions related to the energy consumption may potentially increase the cooling cost. Any boosts in the usage of efficient computing systems will end in a very hot temperature caused by the increased generated power for fueling them. It has been proven that some failures can be observed up to two times with the heat rising of 10°C; this affects the systems’ reliability and availability, which may result in severe damages [3].
Reliability refers to the ability of a system to perform its intended function successfully, for a specified period of time, under predetermined conditions. Accordingly, the power consumption for running and scheduling parallel tasks has heightened concerns in the last decade. Two specific approaches to energy conservation are dynamic power management (DPM) [4] and dynamic voltage frequency scaling (DVFS) [5]. The former usually aims to deactivate idle nodes due to their long durations. The DPM approach shows no participation in short idle durations events due to a quick runtime between each task. On the other hand, in the DVFS approach, due to the reduction of power consumption, the tasks receive a specified frequency. Due to the power consumption and other effective parameters, scheduling parallel tasks with limited prioritizing, i.e., directed acyclic graph, is of high concern in both homogenous and heterogeneous computing environments (e.g., cloud data centers). This issue raises one of the hard optimization problems [6] for which many heuristic algorithms have been put forward in the literature. Duplication and clustering strategies, as well as DVFS, have been found capable of reducing energy consumption. They focus on the optimization of performance parameters, including throughput and makespan. The propagation of several specific tasks and running the cluster ones occur on multi-processors to reduce the communication overheads among the processors equipped with DVFS.
The main contributions of this paper are summarized as follows: (1) We develop formal models for parallel tasks and an energy aware cloud and we also define the task scheduling issue, (2) We develop a novel energy-aware scheduling heuristics algorithm for parallel tasks called EATSDCDD, which can be adapted for a wide range of DAG applications, (3) We present the green SLA use scenarios and propose a new scheduling heuristics for energy aware parallel task scheduling, which makes a study of the tradeoff between the energy consumption and task execution time, (4) Unlike the most existing studies for DAG applications, we try to allocate a slack to tasks belonging to the maximum independent set of a task.
The remainder of this paper is organized as follows. Section 2 introduces background and related work. Section 3 introduces the system model and formulates the estimation scheduling problem model and Petri Net model. In Section 4, the proposed EATSDCDD algorithm is described and its performance is evaluated. Section 5 gives the Tracing algorithms on example DAG. Section 6 gives the experimental setup and simulation results comparing them with those of other algorithms, in both homogeneous and heterogeneous environments. Section 7 presents the Green Service Level Agreement. Finally, Section 8 concludes the whole paper.
Related work
The recent advancement in ICT has resulted in continuous growth of cloud data centers and computing centers, hence increasing the energy consumption and, consequently, negative impact on the environment through generation of greenhouse gases and excessive emission of CO2 [7]. In recent years, great efforts have been made focusing on developing parallel task scheduling algorithms for distributed platforms such as clusters, girds, and clouds. These studies have achieved good results in energy and performance improvements using List scheduling [8], DVFS techniques [1, 9-11], Clustering-based scheduling [1, 8], Heuristic algorithms [12-18] and Duplication-based scheduling [1]. This section gives a brief review of various existing task scheduling algorithms that mainly consider the improvement of energy efficiency and optimization of the performance parameters in distributed computing.
Energy reduction via dynamic voltage frequency scaling
The DVFS technique is one of the most efficient ways of providing significant energy saving for processors through simultaneous minimization of frequency and supply voltage for slack time slot of tasks as well as communication and idle phases [19, 20].
Wang et al. [8] employed an energy-aware scheduling heuristic algorithm called PALS and PATC to simultaneously reduce makespan and energy consumption for scheduling parallel tasks in a cluster using the DVFS technique. After determining the critical and non-critical paths, the algorithm assigns jobs in the critical paths to processors with the highest voltage/frequency. Then, the slack time of each job is calculated in the non-critical paths, and the voltage/frequency of the assigned processors is scaled down to process the non-critical jobs. This strategy mitigates energy consumption without increasing makespan.
Another approach to scheduling tasks was proposed by Wu et al. [21] to reduce energy consumption using the DVFS technique. This technique was adopted to dynamically control the frequency and voltage of cloud computing servers. The scheduling algorithm takes into account the maximum job (Fmax) and minimum job (Fmin) frequencies given to each job and multiple server Si running at maximum Si (Fmax) and minimum Si (Fmin) frequencies. For specific jobs, the scheduling algorithm efficiently assigns proper servers that run between (Fmin, Fmax) to jobs based on the requirements of job frequencies.
Juarez et al. [22] proposed a real-time dynamic scheduling method, called Multi-heuristic Resource Allocation (MHRA), for efficient execution of task-based applications on a distributed computing platform of cloud computing. This served to mitigate energy consumption and makespan. This method involved a polynomial time algorithm combining a set of heuristic rules and resource allocation techniques. To balance the two-objective function, a weight factor, α, was introduced 0 ≤ a ≤ 1 by which the user can specify the significance of each objective.
Hu et al. [3] developed the EASLA algorithm to schedule parallel applications through the DVFS technique, while maintaining the SLA on a cluster platform. The main idea behind the EASLA algorithm is to allocate each slack to a maximum set of independent tasks for each task using a compatible task matrix and scale frequencies down in order to minimize energy consumption within certain extension rate of makespan mutually accepted by user and service provider. Some researchers [23-25] proposed new task scheduling in distributed computing systems using the DVFS, clustering and duplication technique. Moreover, some authors [26-28] proposed new task slack time algorithm for task scheduling in distributed computing systems using the DVFS technique. Vydyanathan [29] proposed a multi objective task scheduling algorithm using the clustering and duplication technique.
Energy reduction via duplication, clustering, and list scheduling
A list scheduling algorithm maintains a list of all tasks of a given DAG according to their priorities. List scheduling has two phases: selecting the tasks with the highest priorities and selecting suitable processors. Heterogeneous Earliest Finish Time (HEFT) [16] is a well-known list-scheduling algorithm. HEFT operates in two phases. Prioritizing tasks and assigning tasks to processors. Mohamed et al. [30] proposed a modified non-square direct matrix converters (MC) for renewable microgrid applications using generative adversarial networks (GAN). They also provided a new application of the non-square direct MC in the microgrid system, which was able to provide a balanced output with any desired amplitude and frequency under unbalanced conditions, specifically in the case of using renewable energy sources such as photovoltaics (PVs) and wind turbines.
Clustering heuristics lower considerably the communication costs of data transferring to almost zero by propagating the tasks to the specified processors. Furthermore, task clustering helps to schedule based on the reduction of the active power of processors and power consumption [8]. Yang and Apostolos [31] proposed the DSC (Dominant Sequence Clustering) algorithm for Critical path task scheduling in homogeneous distributed computing systems using EST.
Mohamed et al. [32] proposed an energy management scheme considering five different agents, including an energy hub, a networked multi-microgrid with three agents, and a transportation system. Additionally, they developed an effective transportation system model within the smart island to save energy and minimize the operation costs.
The current increase in the chip density and improper reduction of CPU voltage in cloud data centers will certainly increase the frequency of errors during the execution of the workflow. As a result, this problem will occur in the timely completion of workflow programs, which raises serious concerns about the use and maintenance of data in cloud data centers. In this regard, Wu et al. [33] an error-conscious energy-saving work planning approach for workflow applications in DVFS-enabled cloud data centers. This approach resulted more than 30% energy savings while providing both reliability and completion time.
Li et al. [34] use reverse fault recovery, running a program with uniform frequency or high and low of target frequency, in the absence of desired frequency. This not only consumed less energy, but also led to the highest capability. They compared the strategy of running the program in real time with those proposed in recently-published articles. Experimental results proved that their proposed method can save up to 12% more energy than other methods without compromising the reliability of the program and the implementation deadlineb.
Safari and Khorsand [35] proposed an energy-aware scheduling algorithm for time-limited workflows by means of the DVFS method, in which the host reduces the operating frequency using different voltage levels. The purpose was to reduce energy consumption and customer service contract (SLA) violations and to improve resource utilization. The simulation results showed that their proposed method performs better when evaluating criteria such as energy use, average execution time, use of average resources, and relative breach of customer service contract.
Today by minimizing the energy of applications with a circular graphical structure in a heterogeneous cloud system, energy consumption can be saved. In this regard, Safari and Khoshnod [46] combined a power-based list-based programming algorithm with dynamic voltage and frequency-scaling technique (DVFS) for real-time work (PL-DVFS) in a way to maintain service quality by setting deadlines. The purpose was to improve performance and reduce overall energy consumption, including the3 CPU energy (busy and idle) and communication energy [36]. DVFS is becoming an industry standard due to its location in the CPU hardware; proper software approaches are essential for calibration. Hence, Amulu and Ramraj [37] proposed an approach to minimize the right time and cost, increase user satisfaction, and reduce energy consumption.
In multi-core systems, depending on the computational demand of an application, the cores either operate separately at different voltage and frequency levels or are placed on several voltage frequency islands (VFI) to reduce the system power consumption. In this regard, Hajiamini et al. [38] addressed the issue of VFI scheduling and partitioning with the aim of optimizing the execution time of a set of tasks for a given amount of energye.
Nejat et al. [39] considered a QOS-based synchronized resource management (RMA) algorithm for the first time in a multi-core system, which dynamically adjusts the size of the end-level cash partitions and the voltage frequency settings of each core. It was done in order to save energy while complying with QoS. Moreover, each program run on multi-core processors during multi-program workflow by the configuration space within the LLC partition size range and the dynamic voltage-frequency scaling (DVFS) settings at low overhead at runtime.
Despite the implementation, design, modeling, computing, communication, and several architectural challenges to the Internet of Things (IOT), the use of the cloud IOT is synchronized with energy efficiency, which has not been considered by researchers yet. To fill this gap, Toor et al. [40] considered the concept of renewable energy in the cloud environment. In order to add authentication as well as a layer of security to this sensitive data, block chain offers an ideal solution for providing a central office for data usage. The deployment of block chain technology and the cross-platform nature makes it an ideal proposition for exploiting the full potential of IOT. Thus, they presented a compact and energy-conscious design for the cloud computing environment of the Internet. Moreover some researchers [41-43] increased the reliability by providing a solution to reduce energy consumption.
System model
This section suggests the formal models applicable to systems architecture, tasks dependency graph, DVFS, the energy consumption of processors and communications, makespan, and throughput considering some limitations and hypothesis for the problems formulation. Table 1 summarizes notations on the mathematical formulas.
Definition of notations for problem formulation
Definition of notations for problem formulation
In this study, a model with the three layers (i.e., User layer, COMP superscaler layer, and Cloud Resource layer) is proposed for the aim of scheduling parallel tasks on a cloud data center. To run their efficient applications including main codes, users need to employ the processors of a cloud data center.
Tasks dependency graph
The task dependency analyzer in COMPSs layers converts the user’s sequential program to DAGs. A directed acyclic graph (or DAG) is a digraph that has no cycles. A rooted tree is a special kind of DAG and a DAG is a special kind of directed graph [44]. The formed DAGs, called tasks dependency graph, are represented as G(T,D), where:
• T: It consists of a set of tasks in G, as given in (1). All tasks ∀t
i
∈ T are the components of the application code (nodes in a DAG):
et (t
i
, p
j
) is the execution/computation time of task t
i
on processor p
j
. The task execution time t
i
is calculated as formulation (2):
CPI denotes the number of clock cycle per instruction for a task by processor. The ideal CPI is 1.
ct (d ij ) is the communication time/cost of an edge d ij , for transfer message d ij . This time/cost is incurred if t i and t j are scheduled on different processors, while it is considered to be zero if t i and t j are scheduled on the same processor.
Processors currently utilize the DVFS technique to reduce the power consumption in a computing data center. It allows processors to operate using a discrete set of voltage and frequency pairs, which is (vj, fj). The system enables us to deal with the heavy load of tasks with a useful performance by adjusting the voltage and frequency at the same time. Due to a specifically-required increase in both the voltage and frequency or, on the other hand, a possible lowest rate in each one in terms of saving energy, Equation (4) can be formulated as follows:
Furthermore, the execution time of task t
i
on processor p
j
with the set of working frequencies from f1j to f
kj
can be calculated through formulation (5). In fact, greater frequency level leads to a shorter makespan.
In this section, we present the models for the performance evaluation of DAGs of different sizes and the determination of the energy consumption, makespan, and throughput.
Estimating makespan
Makespan (C
max
) is defined as the amount of time (from start to end) required to complete a set of sequences. The aim of scheduling algorithms is the minimization of makespan. Equation (6) indicates how to obtain makespan. To calculate makespan, the largest task end time is subtracted from the smallest task start time.
The critical path in a DAG constitutes the longest path from an entry to exit point of a node in a graph. In the case of dependency of tasks, directed edges of a parent task will pursuit the other tasks. It then should determine all possible paths. Against the others, the path that includes a longer duration will indicate an appropriate run time of the efficient application. Accordingly, the tasks will indeed stick to the defined execution time. Furthermore, it can be assumed that the tasks of the critical path may assign to the processing resources with high voltage and frequency (higher computational capability); on the other hand, the processing resources with lower voltage and frequency may choose to allocate to those that exist in the DAG G, but not the critical path. This will decrease the power computation in the processors of a data center. Accordingly, it can be expressed as follow:
This adopted model [25] works with the evaluation of throughput. Two rates of CommR(G) and CompR(G) must be taken into account in evaluating the throughput associated with a DAG G. The data elements are then consecutively processed regarding the fact that all the available tasks in a cluster, for instance Ci, are accomplished in only one processor. The rate of computation for such a cluster is equal to
Likewise, the rate of communication for dij edges is equal to
As processors are capable of handling both communication and computation at the same time, the overall throughput can be calculated by formulation (10):
Scheduling of dependency tasks in computational systems needs the consumption of energy equivalent to the total energy consumption of the processors in both tasks execution and data transmission to another processor in a communication network.
3.4.3.1 Energy for computation Nowadays, CMOS technology design is commonly used to build processors in which power is consumed in one way; the static or dynamic consumption of energy is obtained by formulation (11):
Static power consumption, i.e., the main source of static current, is leakage current and reverse based PN junction when there is no circuit activity, whereas dynamic power consumption involves charging and discharging of capacitances when inputs are active.
Pstatic can be eliminated from the equation, regarding the energy consumption for the execution of the parallel tasks based on the energy for computation in processors and the energy in communications among processors. Thus, formulation (12) is used to calculate the dynamic energy consumption in a processor:
DVFS-based processors meet the maximization of power consumption (Pproc.highest) when the operation is carried out with the highest possible voltage (vhighest) and frequency (fhighest). Therefore, the active power consumption of a processor can be calculated based on a set of voltage and frequency (vj, fj) using formulation (13):
Since the proposed algorithm adopts the task duplication strategy for scheduling a DAG with n tasks on DVFS-enabled processors, the total energy consumption can be calculated by formulation (14):
Formulation (15) presents the obtaining of the energy consumption of the processors in idle durations, as follows:
Finally, the total energy consumed by processors to execute the task dependency graph can be obtained by the sum of (14) and (15) as follows:
3.4.3.2 Energy for communications. Since the processors in each datacenter have been assumed to be homogeneous, the data transfer speed and power consumption are identical (see formulation (17)):
3.4.3.3 Total energy for datacenter. This can be expressed by formulation (20) for a cloud data center that can be obtained through sum of (16) and (18), thus:
The energy-performance tradeoff scheduling problem stated before is formally defined as follows. If an application consists of n parallel tasks, the required number of PEs p, the schedule length of the best effort schedule makespanbest, and the negotiated makespan extension rate μ, after an initial schedule, allocate the slacks to the appropriate tasks for downscaling their frequencies to try to minimize energy consumption. More formally, the problem can be formulated as follows:
min En = En active + En idle + En com , C max (G) , MaxTh cur (G)
Subject To:
Makespan ⩽ Makespan best × (1 + μ)
En cur (G) ⩽ En c (G) , Th cur (G) ⩾ Th c (G)
Where
The timing constraint is enforced by formulation (20). Since the available frequency range after an initial schedule is finite and discrete, we use boolean
Petri Nets-based proposed model
Petri Nets (PNs) is a language for the modelling and validation of systems in which concurrency, communication, and synchronization play a major role. Petri Nets is a discrete-event modelling language with the functional programming language Standard ML and provides the foundation of the graphical notation such as token, place, arc, and transitions for concurrency, communication, and synchronization [45, 46].
Traditionally, Petri Nets are classified under two categories: low-level representation and high-level representation. The latter (e.g., Colored Petri nets (CPNs), Time Petri nets (TPNs), and Hierarchical Petri nets (HPNs)) is considered in applied objectives, regarding the possibility that allows for the compact modeling with quantitative capabilities [47]. In this section, the High-level Petri Nets provide a model of energy consumption in terms of an improvement for the rates of the throughput and makespan. Figures 1 and 2 show the optimized model of the proposed objectives, where the system states are modeled by places and the controlled and decidable operators are modeled by transitions.

The Petri Nets-based model for energy consumption.

Dynamic voltage and frequency scaling structured by Petri Nets.
In the following, the places and transitions of the proposed Petri Nets-based model of energy consumption are described in Tables 2 and 3, respectively.
Descriptions for the places of the proposed petri nets-based model of energy consumption
Descriptions for the transitions of the proposed petri nets-based model of energy consumption
This section presents energy aware task scheduling with combination of Duplication, Clustering, and DVFS slack allocation distribution algorithm, called EATSDCDD, in cloud datacenters, which is aimed at achieving high throughput and reducing the schedule length (makespan) and energy consumptions. Our proposed algorithm contains two main phases: EATSDCA and EADVFSDA. The former operates with a scheduling algorithm to achieve reduced energy consumption in communications and increase the throughput. On the other hand, the latter utilizes the DVFS technique on each processor to reduce the energy consumed for computing in processors and to run a DAG using the energy-aware dynamic and voltage frequency scaling (EADVFSD) algorithm. It will be discussed later in detail. The steps of the proposed method, which consists of three phases, are shown in Fig. 3.

The proposed EATSDCDD flowchart.
The first phase presents the Energy-Aware Task Duplication-Clustering Algorithm (EATSDC) for parallel task scheduling. Our EATSDC attempts to satisfy makespan, throughput, and energy constraints using duplication and clustering strategy. Task clustering reduces makespan by zeroing edges of high communication time and proper adoption of the strategy. Task duplication decreases the communication overhead by reducing the allocation of certain tasks to multiple processors and, consequently, lowering energy consumption. The pseudo-code of EATSDC is shown in Algorithm 1 (Fig. 4).

Algorithm 1.
Since the purpose of scheduling is the reduction of makespan, it can be carried out in a shortened term on the critical path using the b-level descending order (it offers prioritizing of a task). Inside each cluster, tasks are executed in the order of their b-level. b_level is a normal priority assignment for tasks. The b-level of a node is bounded from above by the length of a critical path (CP). b-level is calculated with Algorithm 2 (Fig. 5).

Algorithm 2.
Algorithm 1 executes the clustering and duplication strategies on the input graph. Lines 3 and 4 require |T|, while Lines 5 to 8 require |D| operation times to calculate CompR(G) and CommR(G). Sorting in Line 9 can be done at time |D|log|D| based on quick sort. Lines 10 to 16 require 2×(|T|+|D|)2 operations; the calculation of CP in Line 17 requires (|T|+|D|)2 operations, and Lines 19 to 25 require |D|2 operations to satisfy the energy consumption. Moreover, the calculation of b-level for all tasks in the integrated cluster requires (|T|+|D|+|D|) operations. As a result, the time complexity of EATDCA is equal to O(|D|×(|T|+|D|)2). While the complexity of the PATC algorithm is O(|E|×|E|Lg|E|+CT×A3).
After applying the duplication and clustering strategy to the input DAG, which leads to lower energy consumption and makespan and higher throughput, it is intended to mitigate the energy consumed by processors through determining the critical path and non-critical paths. In addition, the slack time of non-critical tasks is specified, and the voltage and frequency of processors allocated to processing of tasks in non-critical paths as well as idle and communication phases through scaling down DVFS are calculated. For this reason, it is essential to first explore the important parameters used in applying DVFS techniques to reduce energy consumption.
Calculation of slack time for tasks
Calculation of the Earliest Start Time (EST) is a top-down method, which starts with the first task and ends with the last one, calculated by formulation (21) starting from the entry task and terminating at the exit task. For the entry task, this parameter is 0:
After calculating EST, the Earliest Finish Time (EFT) can be calculated for task t
i
by formulation (22), where et (t
i
, p
m
) is execution time of task ith on processor jth:
Moreover, the calculation of Last Finish Time (LFT) is a bottom-up method, which starts with the last task and ends with the first task, calculated by formulation (23); for the exit task, this parameter equals to its earliest completion time:
The Latest Start Time (LST) for task t
i
is also calculated by formulation (24):
The non-critical tasks in a DAG are distinguished by the presence of slack. Critical tasks have zero slack, while non-critical tasks have slack value that is known as slack time. For each task ith, the slack time is calculated using formulation (25):
Algorithm 3 indicates the pseudo-code for computing the slack time; in case of critical and non-critical paths (Fig. 6).

Algorithm 3.
It indicates a list of tasks with the possibility of sharing and distributing the slack-time to each other. The compatible matrix of a directed acyclic graph with a specified number of tasks (n) is a form of an n*n matrix that is shown in Fig. 7. A graph of a task generally structures a symmetric compatible matrix. Compatible matrix elements, e.g., mij, show whether ti and tj tasks are capable of sharing the slack time. In case the element mij equals 1, it can potentially be expected the share and distribution of the slack-time between ti and tj tasks, regarding their independent execution. Moreover, the distributed slack time will neither disturb the specified time for the tasks, nor increase the total time of the execution. The results indicate that the distribution of the time among the dependent tasks of a cluster has the chance to save more percentage of the energy rather in the case of an only one task in a cluster.

Compatible matrix.
This section discusses how the dedicated voltage and frequency of processors to non-critical tasks, idle, and communication phases can be reduced based on the knowledge DVFS technique. In addition, it demonstrates the case of an increase in the dedicated voltage and frequency of processors to critical tasks, which will end in a diminished consumption of the energy. The critical path in a graph of scheduled tasks located in the Gantt chart contains a set of specified timeslots for the execution of the tasks as well as the data communication between the earliest task to the latest, in which makespan is the sum of the time for the computation and communications. The best effort scheduling algorithm fails to expand makespan and modify the voltage and frequency of the processors associated with the timeslots for the execution of tasks in the critical path. However, the other timeslots voltage and frequency of the Gantt chart can be decreased. To obtain the operating frequency of a specified processor (k), formulation (26) should be followed: Assume taskt
i
is a non-critical task and is executed on Processor k. Then, the taskt
i
’s execution time can be extended to slak time for task t
i
without violating precedence constraints (without changing the finish time of its precursors and the start time of its successors). Processor k operating frequency can be scaled to Processor k′.freqop:
Algorithm 4 shows the proposed pseudo-code algorithm for the scaling of voltage and frequency (Fig. 8).

Algorithm 4.
In the present study, a simple tracing example is used to illustrate our proposed algorithms application to the input DAG shown in Fig. 9. The number of processors |m| is 3, and throughput constraint is assumed to be 0.04 according to formulation (28). In the first step, we execute the EATSDC algorithm on the input DAG. This algorithm serves to satisfy the throughput, makespan, and somewhat energy consumption.

Running EATSDC algorithm: Throughput, makespan, and energy constraint satisfaction.
Initially, each task was assigned to an independent cluster (Line 3), then we start to compute the number of replicas for each cluster as numr(C1)= 0.8, numr(C2)=2, numr(C3)=0.4, numr(C4)=0.5, numr(C5)=0.4, and numr(C6)=0.2 (line 4). In this procedure, CompR(G) satisfied the throughput constraint.
Then, it was time to specify edges weighing greater than 25 and violating the throughput constraint (Lines 7 to 10). These edges included d12, d34 and d35. An edge with the highest weight (d12) was select, and then two tasks t1 and t2 were clustered. Edge d34was selected, and task t3 was duplicated. At the next stage, the first copy of t3 with t4 and another copy of t3 with t5 were clustered. As a result of this operation, the required throughput was met (Lines 11 to 18) (Fig. 4.b and 4.c). Given that Encur was larger than Enc, Lines 21 through 23 of the algorithm selected the edge with the largest communication time, while the tasks clustered the beginning and end of the selected edge. This process continued until the energy constraint was achieved. Fig. 4.d shows the result of EATDCA on the input graph, which includes clusters with tasks C1={t1, t2 }, C2={t3, t4 }, and C3={t′3, t5, t6 }, where each cluster was assigned to a processor. Then, the b-level of each task was calculated by Algorithm 2, which is equal to bl(t1)=52, bl(t5)=12, bl(t4)=20, bl(t’3)=24, bl(t3)=32, bl(t2)=42, and bl(t6)=4. The sequence of task execution in each cluster is based on their b-level. Then, Line 2 of Algorithm 3 specified the topological order of each task in DAG.
After the calculation of EST, EFT, LST, and LFT for each task t i (Lines 3 to 5), it determines the value of slack time for each task. Since this value is zero for the tasks t1, t3 and t6, these are considered critical tasks located in a critical path, whereas tasks t3, t’3, t4 and t5 are considered non-critical (Fig. 10).

The use of EATSDCDD algorithm to do the operation using the DVFS technique and the distribution of the slack-time on the early sample of DAG.
Figure 11 shows the results after applying the duplication & clustering technique to the Gantt chart. To further satisfy the energy constraints, there is a need to apply a voltage/frequency scaling algorithm (Algorithm 4) to the graph derived from the output of Algorithm 3. For each time slot associated with each processor, if the processor is executing critical tasks, the processor frequency reaches its highest value (Lines 4 to 6). If these are processing non-critical tasks, the processor frequency is calculated by Equation 26 (Lines 7 to 9). And if they are in the idle and communication phases, the processor frequency changes to the lowest level (Lines 10 to 12). In this algorithm, the slack time of non-critical tasks in each cluster is distributed according to the execution time of each task. This leads to greater saving of energy consumed by processors to execute tasks in DAG (Fig. 12).

Gantt chart after employ duplication & clustering technique.

Energy Gantt chart after stack time distribution (employ DVFS distribution).
This section presents the experimental evaluation of our proposed heuristic algorithm, i.e., Energy-Aware Task Scheduling with Duplication Clustering Dynamic Voltage/Frequency Distribution (EATSDCDD). Moreover, the results of the proposed EATSDCDD algorithm is compared with those of some previously-introduced algorithms, i.e., Power-Aware Task Clustering (PATC) [8], Power-Aware List-based Scheduling (PALS) [8], Energy Aware Duplication Scheduling (EADUS) & TEBUS [24], as well as EATSDCD, in terms of saving and consuming energy. Moreover, we highlight the contrast of EATSDCDD algorithm with HEFT [16] and RASD [23], underlying the reduction on the execution time or makespan. The mentioned algorithm is compared to the Throughput Constrained Latency Optimization heuristic (TCLO), Throughput Constrained Latency Optimization Pipelined (TCLO-P) [25], and EATSDCD [1] in both homogeneous and heterogeneous environments, considering the value of throughput and energy consumption. The simulation results confirmed the superiority of our proposed method to Energy Aware Service Level Agreement (EASLA) [3] and PALS [8] in terms of energy saving, based on the Green Service Level Agreement and the negotiation between users and service providers at the μ rate.
Experimental setup
First of all, the settings required for our experiments are presented, including the information of target platform and the energy, throughput, and makespan constraint settings, which are used for the following performance evaluation.
Settings for constraints of energy, throughput, and makespan
To avoid some unreasonable constraint settings that are impossible for an input DAG to meet, a lower bound of the energy constraint (denoted as Enmin), an upper bound of the throughput constraint (denoted as Thmax), and a lower bound of the makespan or critical path constraint (denoted as CPmin) were defined in this study.The maximum amount of reduction in energy consumption by processors to execute tasks can be achieved by eliminating all communication overheads and allowing no duplication (numr(t
i
)=1). To that end, it is essential to execute all tasks on a single processor. Therefore, the minimum energy consumption can be calculated as shown by formulation (27).
Moreover, the maximum throughput can be achieved when we have |m| processors in the system and by clustering all tasks and replicating this cluster to all processors, as indicated by formulation (28).
The minimum makespan or critical path is represented as CPmin(G) of which the realization depends on the execution of all tasks on one processor. It causes the makespan constraint due to neglecting the communication costs. One can obtain this with formulation (29) as follows:
A generally-accepted fact is that it is impossible to achieve both the minimal energy consumption and the maximal throughput simultaneously, since they are naturally conflicting with each other. Therefore, a coefficient should be established for Enmin (G) and Thmax (G) that is, four main constraints of En min (G): 1.1En min (G) , 1.5En min (G) , 1.2En min (G), and 2En min (G), and three throughput constraints Th max (G) are set: 0.25Th max (G) , 0.5Th max (G) and0.75Th max (G).
There are many random graph generator tools that generate weighted application DAG, e.g., STG (standard task graph) [48]. STG is a kind of benchmark for evaluation of proposed scheduling algorithms. In our simulation experiments, Graphs are generated for all combinations of the above parameters with the number of tasks ranging between 40 and 200, with incremental steps of 40. In addition, the tasks of 100, 200, 400, and 800 in graphs are randomly selected as the costs of nodes and edges. The CCR values are considered as 0.1, 1, 5, and 10. Equation (30) indicates how CCR is calculated in a DAG. The number of levels is determined by the parallelism factor λ(0.5, 1.0, 2.0, 5.0). Every set of the above parameters is used to generate 200 random task graphs in order to avoid scattering effects and then the average result is computed.
The platform of simulation environment used to evaluate our work is CloudSim toolkit [49] that operates based on Java. The CloudSim was installed in an Asus Notebook with Intel core i7-A540UP CPU 2.4 GHz with 8 cores and 4 GB memory. We created five datacenters in our simulation, and set 200 virtual machines, each involving three processors. Table 4 shows the details of the three processor types.
Power consumption for different voltage/frequency of processors [1]
Power consumption for different voltage/frequency of processors [1]
The first set of experiments compared the energy saving of algorithms with respect to various graph size. This section presents the simulation results obtained from our EATSDCDD scheme. The performance of the EATSDCDD algorithm was compared with that of five other algorithms, i.e., the PALS algorithm, PATC algorithm [8], Energy Aware Duplication Scheduling (EADUS) algorithm [24], TEBUS algorithm [24], and Energy Aware Task Scheduling Duplication Clustering DVFS (EATSDCD) algorithm [1]. PALS and PATC are critical path-based scheduling algorithms that attempt to achieve the shortest schedule length through clustering tasks in critical path and employ DVFS technique to decrease the consumption of energy and makespan. On the other hand, EADUS and TEBUS use the tasks duplication strategies to schedule DAG-based parallel tasks in a cluster computing in a way to reduce power consumption. The EATSDCD algorithm allocates the slack to one task in each cluster to reduce energy consumption by combining DVFS technique with duplication and clustering scheduling algorithm. EATSDCDD can achieve up to 58.3% energy saving respectively in the simulation. The EATSDCDD algorithm distributes the time of slack to all non-critical dependent tasks of a cluster along with the mentioned techniques; thus, it results in a more efficient saving of the energy. Results show the high impact of the parameters involved with the size of DAG and CCR on saving energy. The DAG highly focuses on computations and communications in the condition of CCR = 10 and CCR = 0.1. EATSDCDD achieved better energy saving among the others. The reason was that in our proposed method, a slack time was distributed and allocated to the tasks belonging to a maximum independent set of tasks in cluster. Table 5 compares the proposed method with the other energy-aware scheduling algorithms regarding the capability of saving the energy more efficiently.
Comparing the energy saving capacity in EATSDCDD and five other algorithms
The second set of experiments compares the makespan of algorithms with respect to different numbers of tasks and CCRs. The first phase of the proposed method (EATSDC) was compared with two other algorithms: Reliability Aware Scheduling (RASD) [23] and Heterogeneous Earliest Finish Time (HEFT) [16]. Figure 13 shows the makespan when running the EATSDC algorithm in different scenarios of numbers of tasks ranging from 40 to 200, with incremental step of 40, and CCR (0.1, 1, 5). The comparative results confirmed that the proposed EATSDC algorithm outperformed the others in terms of the items examined in this paper. For CCR = 0.1, CCR = 1, and CCR = 5, however, the impact of the makespan on scheduling decisions among the three algorithms is clear, that is, the incorporation of makespan overhead results in a significantly improvement of the performance. For CCR = 1, as the communication increases, Fig. 8.b reveals that EATSDC performs better, with respect to the quality of makespan. Specifically, as CCR = 5, when the communication dominates the application, Fig. 8.c shows that EATSDC significantly outperforms HEFT and RASD by 20.4% and 3.4%, respectively, in term of the average makespan.

Makespan of the EATSDC, RASD, and HEFT algorithms for different CCRs.
In the third group of experiments, the number of feasible solutions of the proposed method (EATSDCDD) was compared with that of the Energy Aware Task Scheduling Duplication Clustering DVFS (EATSDCD) [1] algorithm, Parallel Pipeline Latency Optimization) PaPilo) [29] algorithm, Throughput Constrained Latency Optimization heuristic (TCLO), and Throughput Constrained Latency Optimization Pipelined (TCLO-P) [25] algorithm. It was done among the set of 100 random task graphs from the STG benchmark [45], with respect to CCR = 1, under different Energy and throughput constraints (more specifically, four energy constraints En min (G): 1.1En min (G) , 1.5En min (G) , 1.2En min (G), and 2En min (G), and three throughput constraints Th max (G): 0.25Thmax (G) , 0.5Thmax (G) and 0.75Thmax (G)). Therefore, there are a total of seven constraint settings. Figure 14 shows the number of feasible solutions with three throughput constraints in CCR = 1 and Encur = 2Enmin (G), Encur = 1.5Enmin (G) .

The number of feasible solutions for 100 random task graph with n = 200, with different throughput and energy constraints.
From these results, the first observation indicates that EATSDCDD, PaPiLo, and EATSDCD can always find more solutions than TCLO and TCLO-P, without any violation.
Figure 15 shows the number of feasible solutions of 100 random task graph with three throughput constraints in CCR = 1 and Encur = 1.2Enmin (G). Accordingly, only EATSDCDD and EATSDCD algorithms observed the determined constraints and successfully met them, while PaPilo algorithm was not able to adapt to the specified constraints in some of the graphs. This is because in our proposed algorithm and EATSDCD, we attempt to allocate a slack to the tasks and use DVFS technique.

n = 200,
Moreover, when the energy constraint is relatively loose, EATSDCDD can get better performance than our previous work. On the contrary, EATSDCD can get lower feasible solution than EATSDCDD under more strict energy constraint. From the feasible solutions shown in Figure 16, under the strict energy consumption (Encur = 1.1Enmin (G)), EATSDCDD can get around 10% improvement in comparison with EATSDCD. The main reason is that the proposed algorithm distributes each slack to a set of tasks in a cluster, instead of assigning it to a task, and scales frequencies down to minimize the energy consumption.

n = 200,
Table 6 compares our algorithm with other energy aware DAG scheduling algorithms in term of max energy saving for DAG with large task size (7200 –14400) MI, Numerous nodes among (1000–2000).
Comparing the saving of energy between EATSDCDD and two other Algorithms for large test cases
The proposed method for the large test cases saves more energy, while the execution time of the proposed algorithm is longer than other algorithms.
In the previous section, we proposed the method to decrease the energy consumption without increasing the task execution time. This section presents a tradeoff between energy and makespan. Accordingly, a scenario must take place in which consumers and service providers negotiate at μ rate to realize possible scheduling regarding GSLA with respect to environment. The agreement is due to a reduction of energy consumption along with an increase in the task execution length. To obtain this, formulation (31) is used:
To calculate non-critical tasks and their slack-time, we use formulation (33) as follows:
Also, Equation (34) expresses the operating frequency of the processor (k) in running non-critical tasks:
The energy & makespan trade off algorithm is shown in Algorithm 5. Algorithm 6 may reduce the voltage and frequency of dedicated processors to run critical and non-critical tasks, idle, and communications based on GSLA and the agreed rate of μ.
In the fourth group of experiments, the energy saving capacity of the EATSDCDD and EASLA algorithms are compared with respect to various makespan extension rates. To this end, experimental results were provided for makespan extension rates (μ) equal to 5%, 10%, 20%, 30%, and 40% and 100, 200, 400, 800 for the number of tasks (n), and the number of processors as 10, 20, 30, and 40, while the value of CCR was set to 0.2. Figure 17 shows that in terms of energy saving, EATSDCDD outperformed EASLA for various numbers of tasks. For instance, when the makespan extension rate was 40%, the ESR of EATSDCDD algorithm was greater than the EASLA algorithm by (37%, 45%), (35%, 43%), (32%, 40%), and (33%, 35%) for the number of tasks of 100, 200, 400 and 800, respectively, with 10 processors.

Comparison between EATSDCDD and EASLA algorithms in terms of the energy saving in percentages, in the condition of CCR, m, n.
In the simulation for energy and makespan trade off algorithm, when the makespan extension rate was 40%, the ESR of the EATSDCDD algorithm was greater than the EASLA algorithm by (32%, 39%), (34%, 40%), (39%, 43%), and (45%, 49%) for the number of processors of 10, 20, 30, and 40, respectively, with the tasks number set to 400 as shown in Fig. 18. We can see that when the makespan extension rate increases, the energy saving increases, too.

Energy savings vs. makespan extension in the condition of CCR, m, n.
Comparing the energy saving capability of the two algorithms, EATSDCDD and PALS, the fifth set of simulation results are considered. In this regard, the makespan extension rates (μ) are 10%, 20%, and 30%. Furthermore, the numbers of tasks, CCR, and the number of processors were 200, 1, 50, respectively. Figure 19 shows the results of the conducted simulation in which expanding the makespan rate will result in energy saving; if the former increases up to 30%, the latter shows the significant amount of 77%.

Energy savings vs. makespan extension (CCR = 1, m = 50, n = 200).
A good energy-aware scheduling algorithm on DVFS-enabled datacenters can reduce energy consumption, which decreases the operational cost, reduces carbon footprint and CO2 emissions in high performance distributed computing systems (HPDCSs), and increases the system reliability and availability. To this end, in the present paper, the energy-performance tradeoff scheduling algorithm (EATSDCDD) was proposed. The proposed algorithm employed the clustering & duplication technique and distributed the slack time for set of non critical tasks in each cluster under a lower voltage and frequency. We also developed a Green SLA attempting to save more amount of energy. Eventually, a set of data was established for conducting the examinations, and also different parameters of the constructed random DAG were assessed to identify the efficiency of the proposed algorithm in comparison with a cluster and duplication-based algorithm as well as those algorithms that use the DVFS technique. The results indicated that the proposed algorithm succeeded by 8.3%, 20%, and 10% more than PALS, PATC, and EATSDCD algorithms, respectively, in terms of the energy consumption and makespan. The results also demonstrated up to 16% improvement of the proposed algorithm compared with the PaPilo algorithm where Encur = 1.2 Enmin(G). In addition, the proposed algorithm was shown almost better than the EATSDCD algorithm where Encur = 1.1 Enmin(G). Furthermore, with the increase of the μ rate up to 30% and 40%, the proposed algorithm showed 7% and 6% more advances in energy saving than the PALS and EASLA algorithm, respectively.
It is worth mentioning that the proposed method with the simultaneous use of three techniques of Duplication, Clustering and dynamic voltage and frequency scaling has a longer execution time compared to those ones use only one of these three techniques. Moreover, the proposed algorithm only reduces the power consumption of processors and does not pay attention to the power consumption of other parts of a data center such as memory, etc. In the future work, we are planning to test the algorithm on various degrees of heterogeneity among processors in datacenters and subtasks. Furthermore, we will consider individual components such as disks and memory to reduce energy consumption, and we will exam the application of the energy-aware scheduling algorithm to some real applications such as sparse Cholesky decomposition. Finally, the proposed algorithm-based approach does not guarantee the converging to the global optima. Thus, the development of the proposed method for converging to the global optima; is left to the next study.
Footnotes
Acknowledgement
The authors would like to thank the anonymous reviewers and the associate editor for their insightful comments and suggestions.
