Abstract
The virtualization design management and adaptive resource migration techniques are typically used for energy saving and system reliability improvement in the cloud computing system. The paper presents a resource migration technique based on node performance and failure rule (NPFR-RMT). It uses Broker to access node performance to determine abnormal nodes and failure nodes. All nodes in the idle queue are conducted failure rule detection with failure model. The fast and slow nodes are determined by node performance threshold analysis. The NPFR-RMT adds slow nodes and failure nodes into idle queue. The virtual machine (VM) migration method is also adapted to migrate tasks to normal nodes in idle queue for re-execution. Comparison experiment results show that the NPFR-RMT method presented in the paper can achieve expected aim that reducing energy consumption of data centers while ensuring service performance.
Introduction
The cloud computing service typically deploy service requests as program, data and etc. from one user on same virtual node. So the problem to allocate tasks submitted by users timely has been translated into management and deployment of virtual machines on a same physical host. After a user paid, the system allocates a virtual node containing operation environment needed for applications. The user can operate own applications on the virtual machine to complete works. The system administer configures and deploys multiple virtual machines in accordance with node hardware resource. To each user, he may not percept other users.
The virtual machine technique integrates large amount of resources for users. Various services provided by cloud infrastructure correspond to different implementation manners. The common goal is to achieve high level resource management with large-scale resource, namely to achieve scalability and reliability, cost saving and improve efficiency [1–6]. The resource management in infrastructure layer includes resource distribution, resource forecasting, resource allocation and dynamical resource migration. As the cloud computing system is made of hundreds of nodes, it is a common experience to find failure nodes. In order to adapt resource management dynamically, the resource migration technique is usually selected to shield effect of node failing on system operation, so as to ensure system reliability.
The NPFR-RMT is proposed in the paper. It uses Broker to obtain system node performance and determines threshold as well as fast and slow nodes, which will be added into queue for failure rule detection, so as to implement rapid migration. The experiment is carried out to verify algorithm performance. The rest of the paper is organized as follows:Section 2 discusses related works; Section 3 presents node failure processing model; Section 4 proposed the NPFR-RMT; Sections 5 and 6 present the experiment results and the conclusions.
Related works
The cloud computing system resources include common parameters as memory, CPU and bandwidth. The system administrator responds service requests for different level of users with priori scheduling method. In the virtualization technique, the system hardware resource is allocated and managed by Virtual Machine Manager (VMM).
In the existing cloud computing resource management researches, the concept of domain and role safety have been introduced into Iaas service model to manage virtual resource [7]. As to single node scheduling, the CPU resource dynamic scheduling method [8] and memory load balancing method among VMs [9] have been proposed. Most methods concentrate on management of VMM on memory sheet, CPU time slot allocation and memory space allocation of VM. In the aspect of multi-node scheduling, the VM ware product DRS achieve load balancing among multiple nodes with VM resource migration method [10]. The multi-node CPU, memory and storage space resource scheduling method to save energy is also brought out [11]. The resource division and advance reservation method is also applied for virtual resource allocation to maximize virtual resource utilization [12].
Although so many methods have been presented to improve resource management efficiency and improve node reliability, the node failure is a common experience. The system failure rule of node resource has been researched [13, 14]. The stochastic process has been used for system failure log research [14, 15]. It is found that the system failure interval can be regarded as a stochastic process meeting Weibull (scale, shape) distribution (shape <1). The average device failure time can be computed from the distribution and measures can be taken to improve system reliability.
In order to dynamically adapt resource management, the resource migration technique is used. The distributed structure is established based on virtualization and node broker [16]. Each broker abstracts information from multiple configuration files with multi-standard decision analyzing method, so as to maintain VM distribution, host monitoring, resource migration and VM migration. As far as virtualized cloud data center, an energy-efficient resource management system is proposed to reduce operation cost and reach service quality. Optimizations on system, bandwidth and hotspot have been made. The process-level live migration mechanism supports continued execution of applications during much of processes migration is also presented [17]. The method is integrated into an MPI execution environment to transparently sustain health-inflicted node failures. To address the problem of complex resource management, the adaptive organized model of cloud computing resource is put forward [18]. It emphasizes adaptive organization behaviors and dynamic resource optimizing compared to existing methods.
The virtualization technique promotes energy saving in the resource management system [23–25]. As the basis of cloud computing, the development of virtualization provides strong driving force. The virtualization technique is benefit for computing resource management, allocation and maintenance. The datacenter decreases capital consumption and improve utilization rate of hardware resource. It is mainly applied to perfect matching of reliable service in datacenter and virtual resource in physical nodes.
Node failure processing model
Related definition
The node performance in cloud computing is to reflect computing or storage capability in case of task processing. The capacity contribution for node performance includes computation ability, storage ability, network transmission ability and etc. The paper defines a scale to descript node contribution capacity, namely node performance.
Failure rule is mainly considered from three aspects, namely time spatial locality, unplanned node reboot failure and software aging. It is known from statistical rule that about 75% failure events take place in 20% time periods, about 80% error events occurs in 15% nodes. The failure arrive time follows Weibull(scale, shape), where the value of shape is in the scale from 0.025 to 0.05. The software aging indicates system reliability degrades and performance decreases as operation time increase. Imperfect design and interior fault may result in software aging.
Under large-scale opening cloud computing environment, the system failure rule has wide distribution and covering area in time and space factors. In order to facilitate dynamical resource migration method research, a genetic failure model framework is constructed. It contains information in three dimensions, namely failure time information, failure spatial information and the restore overhead caused by failure. The time information is used to describe perioddistribution as well as time correlation when failure occurs. In addition, the paper set failure overhead as w constant value to facilitate processing on failure of unplanned reboot.
Resource management model
Based on cloud computing resource distribution model, the paper designs a resource management model as shown in Fig. 1. Set a Global Broker (GB) in a same region and set a Local Broker (LB) in each pool within each region. The Brokers are responsible for two tasks. It is designed for information receiving and responds to it appropriately. As to information receiving, the GB runs on a master node to collect node information within its pool. The LB continuously monitors CPU utilization rate of local nodes. It also acquires information of each virtual machine from virtual machine monitors operating on nodes. In the aspect of processing mechanism, the GB selects optimal node to receive virtual machines needs to be migrated. The LB adjusts the scale of virtual machine in accordance with needs for resource and determines the time that virtual machine should be migrated from this node.
In the above management model, the abnormal node detection is based on computation speed variance monitoring. If there is change occur in nodes of the system, the variance is updated. If the speed change of some node exceeds a certain proportion of performance changing rate in all records, the node can be marked as abnormal one. It is because the probability of abnormal node occurs another failure is larger in accordance with failure rule.
Resource migration algorithm
Algorithm description
Hadoop platform adapts estimation method to detect slow tasks and migrate slow ones to new nodes for re-execution so as to avoid it affect completion time of whole work. It is because the memory performance and CPU performance of each node in the cluster are different. The node with low performance will affect task scheduling. Aiming at problems existing in original Hadoop model, the research team in University of California made improvement to bring out Longest Approximate Time to End (LATE) re-execution scheduling mechanism. The LATE set operation rate as
The main improvement of LATE is that the later finished task likely to take place in accordance with running rate prediction and then re-executes this task. The predicted completion time is
It means the running rate of this node is regarded as remain constant. The design of re-execution mechanism is to provide method to deal with excursed abnormal events of the node. If task with larger delay occurs, the largest possibility is the node where task locates also be abnormal, when we can allocate fast nodes with idle time to execute the slow tasks directly. This mechanism can ensure abnormal nodes be timely detected and high quality resources be reasonably utilized so that the operation time of whole work not be affected by abnormal event of some node. Under the premise of ensuring work ideal response time and not to cause additional overhead as possible, the paper puts forward a resource migration method based on node performance threshold and failure rule. Virtual machine migration technology is used to add fast and slow nodes into queue for failure rule detection. Related tasks and nodes perform rapid migration. Based on formally mentioned re-execution mechanism, it not only ensure the whole task been completed in regulated time, but also effectively decrease additional overhead caused by re-execution mechanism.
The flow of NPFR-RMT is implemented based on above resource management model and resource migration model, as shown in Fig. 2.
The specific execution flow mainly includes steps as following:
Simulation and analysis
Simulation conditions
The NPFR-RMT is conducted for simulation experiment with Cloudsim simulator. The system environment is Intel(R) Core(TM) 2 Duo, 2.10 GHz and 3 GB memory. CloudSim is a tool help for researching, developing and testing. It simulates many algorithms as virtual machine resource allocation algorithm, energy saving algorithm and etc. It has many advantages compared with actual cloud condition and accelerates speed of algorithm designing and testing [19–22].
In the CloudSim, there are several important entity classes, namely DataCenter, DataCenterBroker, PowerDataCenter and PowerModel. The DataCenter simulates core infrastructure services provided by cloud suppliers, which encapsulates a series hosts supporting homogeneous and heterogeneous resource configurations. The DataCenterBroker is an agent mechanism that acts as a link between task and resource to responsible for processing tasks submitted by users and allocating resources. The PowerDataCenter class simulates energy consumption processing and resource node migration. The PowerModel simulates energy consumption model. In order to simulate real-time load of cloud computing service, the DatacenterBroker is modified in the paper to generate 3 groups every 5 seconds. Each one contains tasks with different number. The tasks are randomly allocated to virtual machines. In order to keep stable load within certain period, the method maintains load status of this virtual machine and add tasks repeatedly in the following several cycles.
The NPFR-RMT is implemented based on CloudSim. On the targeted migration hosts, the memory resource and virtual machine are re-allocated in accordance with requirements of task on node resources, thus avoiding failure of resource allocation on the new hosts. The PowerDatacenter module is modified to implement NPFR-RMT algorithm. In addition, the simulation automatically terminates after all tasks been completed means the simulation ends. At this moment, the data center cannot be closed. When there is no task for processing, the virtual machines and close hosts close. In order to facilitate researches, the algorithm also modifies Datacenter module to set the host power consumption without load as the minimum one. The computation energy consumption of original system in case of no load is 0.
The experiment also implemented another two methods for comparison, namely no migration method and DVFS method adhere to CloudSim simulation tool. The no migration method uses PowerDatacenter module. The parameter setDisableMigrations does not permit virtual machine migration. The DVFS technique can be used for voltage regulation. The DVFS method allows migration and uses WM allocation method class PowerVmAllocationMethod-SingleThreshold to use single threshold. The node performance optimization method is the method proposed in the paper. There are totally three sets of designed experiments. The first one compares energy consumption of three migration methods to verify less consumption of node performance optimization method. The second set of experiments check difference of migration number and Service-level Agreement (SLA) of three methods, which also verifies performance of node optimization migration method. The third experiment compares energy consumption trend of three migration methods with different scale.
Result analysis
The simulation experiment configures one Datacenter, the parameters of which are as follows:
The Datacenter includes 10 hosts whose maximum power is 250 W. There are 3 processors in Datacenter. The processing ability is 1000 MIPS, 2000 MIPS and 3000 MIPS respectively. Each host has 10 GB memory. The host power consumption computation increases according to square of usage rate. The host bandwidth is 1000 Mb/s.
There are totally 80 virtual machines in a Datacenter. Each host runs inherent monitor of Xen virtual machine, the parameters are as following:
String vmm = “Xen”;
Each virtual machine is allocated 2500 Kb/s bandwidth. The system memory is 128 MB. The low threshold δ1 is 0.20 and high threshold δ2 is 0.80. The minimum power consumption without load is 0. Every 10 seconds, the system checks whether there is appropriate virtual machine for migration.
The energy consumption comparison of various methods is shown in Fig. 3. It can be seen that the energy consumption of NPFR-RMT is less than others two. Furthermore, the advantage is becoming obviously as time increases.
Specifically, the performance at 5 ks simulation time compares three parameters, namely SLA, migration time and average SLA. The result is shown in Table 1. As far as energy consumption concerned, the no migration consumes about more than 23.6% energy than NPFR-RMT. Compared to DVFS in CloudSim, the NPFR-RMT saves energy about 15.3%.
It can also be known that the migration time of NPFR-RMT is far less than that of DVFS. It is because the VM should allocate host newly to search migration path each time, so almost all VMs will be migrated. In other word, the DVFS method is nor practical in terms of migrationtime.
As regard to SLA, three methods almost have no difference. But the average SLA descends in order. It indicates that the probability that VM cannot obtain requested CPU for processing is low, so the NPFR-RMT system has higher service quality.
From above three aspects, it indicates that the NPFR-RMT not only reduces energy consumption and migration time, but also improves system service quality.
Based on former experiment, different group scale comparison experiments are designed. The Datacenter sets different physical node/virtual node number, namely 10/50, 100/500, 500/2500 and 1000/5000. Figure 4 shows energy consumption of three methods with different group scale.
It can be seen from the figure that the energy consumption also increases as data center scale increase. The energy consumption of NPFR-RMT method is the smallest in three methods. It is because NPFR-RMT method set high and low thresholds to reduce VM migration time, further decreasing CPU resource consumption in the host.
Conclusion
In this paper, a NPFR-RMT for resource management in the cloud computing system is proposed. A comparison experiment is carried out. It can be concluded that the NPFR-RMT method can achieved expected goal to decrease energy consumption of datacenter while ensuring service performance. However, as to the dynamic feature of underlying network of cloud computing [26], nodes within the network may be frequently offline and online, resulting in frequent migration. Therefore, the dynamic resource migration considering about load balancing will be our focus in the future research [27].
Footnotes
Acknowledgments
This work was supported by the National Science and Technology Support Program under Grant No. 2015BAK07B03, National “Twelfth Five-Year” Plan for Science & Technology Support under Grant No. 2013BAH18F02, the 12th Five-years Plan of national science of education the key Research topics of the Ministry of Education under GrantNo. CCA140152.
