Topology-aware virtual machine replication for fault tolerance in cloud computing systems

Abstract

Extensive use of cloud services has led to the need for service reliability for both the service provider as well as the users. In the Infrastructure as a Service cloud computing model, it is critical to ensure the reliability of resources such as virtual machines (VMs); storage networks etc. The paper proposes a replication-based fault tolerance method to improve the reliability of VM-based services. The proposed approach utilizes a data centre topology-aware method to select physical machines where replicas of VMs may be placed. The selection criteria for VM replica placement favour the physical machines at lower CPU temperature, more available space and at a lower edge length from the physical machine that primarily hosts the VM. By avoiding deteriorating physical machines, this policy increases the probability of successful recovery if the VM or its host physical machine fails. The proposed approach has been evaluated using two metrics, namely recoverability and the total bandwidth consumed in the replication and recovery process. The performance of the approach has been compared with a random replica placement method as well as a state of art algorithm. The simulation results illustrate that the proposed approach provides higher reliability than the other methods.

Keywords

Cloud computing virtual machines fault tolerance replication fat tree network topology

1. Introduction

Cloud computing is a paradigm of distributed computing developed for provisioning of dynamic computing services over the internet. It allows users to access, configure and manipulate the resources (such as software and hardware) at a remote location [18]. According to the U.S. National Institute of Standards and Technology (NIST) definition, “Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (for example servers, networks, storage, services, and applications) that can be quickly provisioned and released with the least management effort or service provider interaction” [13].

Cloud computing consists of three levels of service models such as (a) Software as a Service (SaaS): software applications are presented by the cloud service provider in the form of services to the consumer or end-users (b) Platform as a Service (PaaS): provision of platforms to develop, run, test and manage applications in the cloud (c) Infrastructure as a Service (IaaS): provision of access to resources as physical machines, storage, networks, servers, virtual machines on the cloud etc. [15].

This work focuses on the Infrastructure as a service model of cloud computing which employs infrastructure in data centres and technology of virtualization to provide users with resources on demand. With virtualization technology, users (also referred to as tenants) are able to rent and access shared resources in IaaS data centres [4]. Physical machines or hosts are provisioned as virtual machines to users for running their applications. Of the numerous virtual machines (VMs) executing in a cloud data centre, it is hard to guarantee that all the VMs constantly perform satisfactorily [9]. In practice, various cloud services are not able to meet their commitment towards reliability assurance successfully due to failures of VMs. In a cloud computing-based environment, it is critical to enhance the reliability of the VM-based services for QoS guarantees to users [5].

Several solutions have been developed to deal with service reliability issues that either forecast and prevent failures or endeavour to provide fault tolerance [17]. Recognizing and eliminating faults that may arise in the system is infeasible for a complicated computing environment such as a cloud computing system where VM failures are unavoidable [10]. Therefore, fault tolerance schemes are used more often to enable the system to serve the request even some of the components are not working properly. It is desired that a fault tolerant system is capable to work despite faults in software or hardware components, failures of power or other varieties of unexpected adversities [7, 20]. Thus, fault tolerance is related to successful operation and hence, system reliability.

Various fault tolerance schemes such as checkpointing and replication have been proposed in literature for the cloud environment. Replication of VMs has been proposed to provide reliability of applications executing on VMs hosted on cloud hosts. Faults and hence, failures of physical machines are inevitable in cloud computing. Each physical machine can store $v$ distinct virtual machines, each of which may be executing a task or application. If a virtual machine fails, the task is suspended and its already performed execution is lost. Similarly, the failure of a physical machine (PM) implies failure of all VMs hosted by this physical machine. Therefore, for achieving fault tolerance, each virtual machine needs to be replicated. If a VM or the host machine fails, it is mapped to a replica VM and the incomplete tasks continue execution at the replica; thus providing service reliability to users.

However, restarting on the replica VM consumes time and network resources since data or information to needs to be acquired from the central storage system/servers. Therefore, VM replication is not a trivial problem since it involves placing the replica at such a node that recovery is possible as well as quick, i.e., does not lead to a high downtime. In many instances, a single VM may fail while its host is still executing. If the replica has been randomly placed, the probability of encountering additional failures increases and more bandwidth is required to transfer the data or information to the replica. In comparison, the appropriate placement of replica of a VM could save a significant amount of time as well as network resources consumption in case of failure.

The paper addresses the problem of fault tolerance in cloud computing by proposing a virtual machine replication scheme. The proposed scheme is a new topology aware VM replica placement method to improve the reliability and availability of cloud based systems. Using this proposed approach, replicas of virtual machines (referred to as primary VMs) are placed on the optimal candidate servers (referred to as backup VMs). The optimality of a candidate for replica placement is determined by calculating its weight which is dependent on its distance from the VM and its current temperature. Distance between the replica and the host VM needs to be kept low for fast recovery. Further, candidates with higher temperatures are not preferred as they are more likely to fail earlier [5]. Weight based selection for replica placement contributes to a fast recovery by reducing the time and bandwidth required to transfer the data or information to the replica in case of failure. Evaluation of the proposed algorithm is performed by counting the percentage of successful recoveries for failing virtual machines. The proposed approach is also compared with a random VM placement approach and a related VM replication method, optimal redundant VM placement method (OPVMP) presented by Zhou et al. [1]. It has been verified that the proposed approach results in a higher percentage of recoveries by traversing lower number of links.

The rest of the paper is organized as follows: Section 2 discusses the related work in the area of fault tolerance in cloud computing. Section 3 discusses the proposed approach and Section 4 presents the performance evaluation of the proposed scheme with the results of simulation. Finally, we conclude our presentation in Section 5.

2. Literature review

Fault tolerance allows systems to offer the needed services in the presence of component failures, or one or multiple faults. Fault tolerance approaches aid in detecting and handling faults in the system that may occur either due to hardware failure or software faults. Fault tolerance is especially crucial in cloud platform as it provides to users assurance regarding performance, reliability, and availability of the applications executing in cloud. As identified from literature, fault tolerance approaches for cloud based systems are generally reactive in nature and work to reduce the effect of failure after the failure has occurred. The commonly used techniques to achieve reactive fault tolerance are replication, checkpointing based recovery and job resubmission [12]. Checkpointing [1] is used to save the system’s state periodically. In case of a constituent task’s failure, the job is restarted from the last checked pointed state rather than from the beginning and thus, it prevents the loss of useful computation. However, taking checkpoints periodically and restarting a failed service from the last saved checkpoint image(s) is time-consuming and involves high overheads during normal working

Another popular and related technique for fault tolerance is replication [11] which takes advantage of redundant and idle computing resources present in data centres to provide fault tolerance. Replication schemes create multiple copies or replicas of tasks or resources and stores these replicas in a distributed manner. As a result, a task can continue execution in the presence of failures as long as a replica is available.

Replication is one of the most commonly employed fault tolerance techniques in cloud due to the availability of redundant resources in the datacenters. As a result, a number of replication based fault tolerance techniques have been reported in literature for cloud computing. Zhou et al. [1] proposed an optimal redundant virtual machine placement model to improve the reliability of server-based cloud services using a replication-based fault tolerance method. A heuristic algorithm has been proposed for appropriate host server selection as well as optimal placement of VM replicas. An asymmetric virtual machine replication method to improve the reliability and reduce the access latency has been presented by Chen et al. [16]. The proposed approach employs an active primary virtual machine and semi-active slave VM combination. Wu et al. [19] have proposed a two-stage fault tolerance method. In the first stage, the algorithm ranks the manufacturing services according to their importance or priority. Replication policies are adjusted for services according to their ranks in order to achieve a trade-off among fault tolerance and use of resources for fault tolerance. In the second stage, authors use an A*-based heuristic alternative path searching algorithm to find appropriate replacement systems for the complex services in case of failures. An adaptive Replication and Resubmission based model has been presented by Patra et al. [14] for fault tolerance in real-time cloud computing. Using a predictive model, the system detects faults and subsequently decides on fault tolerance method based on the availability of resources. Another replication based method has been put forth by Abdelfattah et al. [3] that tolerates faults by implementing replication and resubmission approaches depending on the reliability requirements. Once a task failure occurs at the node with highest reliability, the task is rescheduled to the node with second highest reliability.

A related fault tolerance approach is of checkpointing whereby a VM’s execution state is saved periodically to stable storage during failure-free execution. Upon failure, a VM retrieves its last saved state and resumes computation from that state instead of restarting from initial state. Checkpointing entails overhead of storage and communication and hence, has not been employed generally for fault tolerance in cloud. In order to perform checkpointing with low overheads, a solution of incremental checkpointing has been suggested by Gill and Buyya [17]. Each checkpoint image contains incrementally pages that have been modified since the last checkpoint. An additional reduction in the consumption of network resource was presented by Zhao et al. [6], who designed a peer-to-peer checkpoint scheme for a cloud data centre with a fat-tree topology. In this approach, the checkpoint images are kept on neighbouring host servers. It is observed that placing the storage server in the same pod in the fat tree eliminates core switches from transfer of checkpoint images. Zhou et al. [1] proposed a checkpoint-based approach to tolerate the failures of both the host server as well as the edge switches. The main motive of this scheme is to ensure cloud service reliability and to minimize network resource consumption. Amoon [8], proposed an adaptive framework that utilizes both replication and checkpointing fault tolerance approaches to develop a fault tolerant framework. Their algorithm dynamically selects the most suitable fault tolerance scheme for each allocated VM.

The present work makes use of the topology of the cloud data centre in determining a suitable host for placing the VM replica. Subsequently, an optimal host is selected based on its temperature and current usage. The aim is to increase the probability of successful recovery of a failed VM.

3. Proposed approach

The paper proposes a topology aware VM replication scheme for a data centre with fat tree network topology.

Figure 1.

Reactive fault tolerance techniques.

3.1 System model

The paper considers an IaaS cloud computing system which employs fat-tree network topology architecture. The fat tree topology consists of three layers of switches as depicted in Fig. 2, where a fat tree topology with four pods is shown. The bottom layer of fat tree topology is edge switch. The link that connects an edge switch with a host server is known as edge link. The host PMs are physically linked to the network using edge switches. All host PMs that share similar edge switch are said to lie in one subnet. The middle layer of the fat tree topology is the aggregation switch. The connection that joins a core switch with an aggregation switch is known as the aggregation link. All host PMs that share similar aggregation switch with each other are said to be in the same pod. The topmost layer containing core switches is the core tier. The link that joins a core, as well as an aggregation switch, is known as core link [2, 5]. The main advantages of using fat tree topology are that all switches are indistinguishable commodity Ethernet switches. Furthermore, this topology has the benefit that it provides multiplicity of paths for communication in case the system encounters blockages of bandwidth resource. A k-port fat tree network, i.e., where every switch has k-number of ports can be constructed by using the values of Table 1.

Table 1
Specifications of a fat tree

Number of ports	$k$
Number of core switches	$\left({k/2}\right)^{2}$
Number of aggregation switches in a pod	$\left({k/2}\right)$
Number of edge switches in a pod	$\left({k/2}\right)$
Number of host server in a pod	$\left({k/2}\right)^{2}$
Total number of the host server	$\left({k^{3}/4}\right)$

Figure 2.

Fat-tree topology architecture.

The fat tree topology architecture consists of varied host PMs, where each PM is categorized by the CPU performance, well-defined MIPS (millions of instructions per second), the bandwidth of a network, size of RAM, as well as disk storage. At any certain time, a cloud-based data centre generally serves several concurrent users. The host PM may contain n heterogeneous VMs. Users generally submit their requests for leasing n heterogeneous VMs that are provisioned on the host PMs. The VMs can be categorized by CPU performance’s requirements, network bandwidth, RAM and disk storage. The measurement of each request is quantified in MIPS [5].

3.2 Selection of physical machines for replica placement

The proposed scheme calculates and uses a weight, w for each physical machine (PM) in the data center. A PM is preferred for placement of a replica VM based on its weight. The weight, $w$ of a physical machine is directly proportional to its CPU temperature; temp and the number of VMs, $n$ currently hosted by the PM, and computed as follows:

$\displaystyle w=\textit{temp*n}$ (1)

A low value of a PM’s weight indicates its appropriateness for hosting a replica of a VM. CPU temperature is selected as a causal parameter since it can be used to find a deteriorating physical machine. Maintaining temperatures of machines below threshold values can improve the energy efficiency of data centres. Monitoring and forecasting of CPU temperature are crucial for avoiding failures of PMs caused as a consequence of overheating. Thus, the machines with lower temperatures are considered more suitable. Further, each PM can host up to a maximum number of VMs dependent on it’s as well as the VMs’ configurations. Therefore, a PM hosting less number of VMs currently is more preferable as a candidate PM. Therefore, lower weight of a given PM $i$ than another PM $j$ indicates that PM $i$ is a better candidate than PM $j$ for replica placement.

Subsequent to calculation of weights, the proposed scheme places replicas of VMs on each host PM onto the candidate PMs. A PM can host a given number, $v$ of VMs including its own VM as well as replicas of VMs depending on its own as well as the VMs’ configurations. The proposed algorithm firstly selects the most favourable PM of all candidate PMs to place the replica VMs. Candidate PMs are selected based upon the weight parameter and hence, an optimal candidate PM should have the minimum weight out of all candidates. Once the VM replication is done, if any VM crashes, it is checked if its replica is available. Successful recovery is achieved by resuming the task on the replica. The details of the proposed algorithm are presented next and a flowchart is depicted in Fig. 3.

Figure 3.

Virtual machine replication.

3.3 Virtual machine replication

Each PM in a data centre is assumed to host one or more VMs and is referred to as the primary PM for VMs hosted by it. A backup PM is selected from other available PMs to place the replicas of VMs executing on the primary PM. For a given VM, its backup replica does not run till the primary functions normally. In case of failure of the primary, the replica on the backup PM replaces the failed VM. If failure of a VM is related to hardware failure, the replica VM obtains the required data from the central database of the service being executed. However, in case of software related failures of the VM alone, the data can be obtained from its host or primary PM.

Since a fat-tree topology is assumed, the candidate PMs are shortlisted based on their edge length from the primary PM for saving of bandwidth resource in the data centre. Therefore, firstly the PMs within the same subnet are considered. These are the PMs that are connected to the same edge switch as the host PM. If none of these can host the replica, the PMs connected to the same aggregation switch as the primary PM are considered. Only if none are found in the same subnet or pod, other PMs are considered as candidate PMs. The PM with the minimum weight out of all candidate PMs is then will be selected as the backup PM. Therefore, weight is computed for all PMs using Eq. (1) and thereafter, they are sorted in order of increasing weights. The first element from the list is checked for availability of space for an additional VM. If space is available, replica is placed on the candidate PM otherwise the next minimum weight candidate PM is selected from the list. This process continues until either

i.
There is no space available in any candidate PM or
ii.
Replicas of all VMs have been placed.

The VM replication procedure is presented in detail in Algorithm 1.

Algorithm 1. Virtual machine replica placement

Input:

s Number of total VMs: n;

Number of Physical Machines (PMs): m;

CPU Temperature for PMs: pm_temp[1 $\ldots$ m];

Number of VMs hosted at each PM: VMs[1..m]

Number of Candidate Physical Machines (PMs): m;// all physical machines are also candidates for replica storage

Output:

List of physical machines containing replicas of VMs:

Replica_pm[1 $\ldots$ m][1..v]

Variables:

Weight of physical machines: weight[1..m];

Edge length between two PMs: edge_length;

Number of edges between each pair of PMs: edge_count[1..m][1..m];

Available capacity on each PM to host additional VMs:

available_space[1..m];

List of shortlisted candidate PMs for a PM: list_rep_pm[1..k]

Weight of each PM in the list_rep_pm: list_wt_rep_pm[1..k]

Procedure:

1. Initialize edge_length $=$ 2

2. For each host i do

3. Compute weight[i] $=$ pm_temp[i]* VMs[i]

4. End for

5. For each host i do

6. For each candidate j

7. If (edge_count [i][j] $==$ edge_length) then

8. Add j to the list_rep_pm

9. Add weight[j]to the list_wt_rep_pm

10. End if

11. End for

12. End for

13. Sort list_wt_rep_pm in ascending order

14. Sort list_rep_pm according to the list_wt_rep_pm

15. For each host i do

16. For each VM l in i do

17. Select first element from the list list_rep_pm as j

18. Find available space on j for replication

19. If (available_space[j] $==$ 0) then

20. Continue loop

21. End if

22. Else

23. While (available_space[j] $\neq$ 0) do

24. Replicate VMl to the PM j

25. Set Replica_pm[i][l] $=$ j

26. Decrement available_space[j] by 1

27. Update weight of the PM j

28. End while

29. End else

30. End for

31. End for

32. Return Replica_pm

33. If allocated $<>$ 0 Repeat from step 3 for the edge_length $=$ 4 and 6

3.4 Handling failures of VMs or PMs

Algorithm 1. Virtual machine replica placement
Input:
s Number of total VMs: n;
Number of Physical Machines (PMs): m;
CPU Temperature for PMs: pm_temp[1 $\ldots$ m];
Number of VMs hosted at each PM: VMs[1..m]
Number of Candidate Physical Machines (PMs): m;// all physical machines are also candidates for replica storage
Output:
List of physical machines containing replicas of VMs:
Replica_pm[1 $\ldots$ m][1..v]
Variables:
Weight of physical machines: weight[1..m];
Edge length between two PMs: edge_length;
Number of edges between each pair of PMs: edge_count[1..m][1..m];
Available capacity on each PM to host additional VMs:
available_space[1..m];
List of shortlisted candidate PMs for a PM: list_rep_pm[1..k]
Weight of each PM in the list_rep_pm: list_wt_rep_pm[1..k]
Procedure:
1. Initialize edge_length $=$ 2
2. For each host i do
3. Compute weight[i] $=$ pm_temp[i]* VMs[i]
4. End for
5. For each host i do
6. For each candidate j
7. If (edge_count [i][j] $==$ edge_length) then
8. Add j to the list_rep_pm
9. Add weight[j]to the list_wt_rep_pm
10. End if
11. End for
12. End for
13. Sort list_wt_rep_pm in ascending order
14. Sort list_rep_pm according to the list_wt_rep_pm
15. For each host i do
16. For each VM l in i do
17. Select first element from the list list_rep_pm as j
18. Find available space on j for replication
19. If (available_space[j] $==$ 0) then
20. Continue loop
21. End if
22. Else
23. While (available_space[j] $\neq$ 0) do
24. Replicate VMl to the PM j
25. Set Replica_pm[i][l] $=$ j
26. Decrement available_space[j] by 1
27. Update weight of the PM j
28. End while
29. End else
30. End for
31. End for
32. Return Replica_pm
33. If allocated $<>$ 0 Repeat from step 3 for the edge_length $=$ 4 and 6

A PM may fail independently of another PM in the cloud data centre. The PM failure may occur due to several reasons such that network related failure, software or hardware faults etc. In this paper, we consider both hardware related failures of PMs and software failures of specific VMs. A VM may fail randomly due to any faults in operations being executed on this VM. Failure of a VM will not affect the other VMs executing on the same PM as itself. In this case, recovery of the VM will take place in an independent manner.

However, if a PM fails, all VMs hosted by the PM fail and need to recover. The present work has used the CPU temperature-based model for simulating the deterioration or failures of PMs [5]. A PM may fail if the CPU temperature exceeds the threshold value for maximum temperature. The temperature model is defined below:

$\displaystyle f\left({t|A,\omega,t_{i},t_{i+1}}\right)=\left\{{\begin{array}[]% {ll}e^{t}&0\leqslant t\leqslant t_{i}\\ e^{t_{i}}&t_{i}\leqslant t\leqslant t_{i+1}\\ A\sin\left({\omega t-\omega t_{i+1}}\right)+e^{t_{i}}&t_{i+1}\leqslant t% \leqslant t_{i+2}\\ \end{array}}\right.$ (2)

Here, the initial sub-equation simulates the procedure of CPU temperature variation during computer boot; $t_{i}$ is a fixed value, which is calculated as $e^{t_{i}}=$ 35 since $e^{t_{i}}$ denotes the CPU with no-load temperature set at 35 ${}^{\circ}$ C; $t_{i+1}$ is a random value; $t_{i+2}$ is calculated by $t_{i+2}=\pi/\omega+t_{i+1}$ ; the symbol A represents the highest value of CPU temperature (generally taken as less than 68 ${}^{\circ}$ C); and the symbol $\omega$ denotes the period for which the CPU is in execution. It is possible to adjust the value of A and $\omega$ randomly to signify diverse utilization of CPU in dissimilar time domains. There is a requirement of the first half round of sinusoidal function; whose value can be calculated by $\pi/\omega$ . Based on this temperature model, it is possible to identify a deteriorating PM.

3.5 Recovery from failures

An effective recovery procedure is necessary for a reliable cloud system. When one or more VMs of a host PM fail, the scheduled tasks on the failed VM may pause for a time period and are kept in the waiting queue. The recovery process is required to retrieve the replica of VMs from the backup PMs efficiently. If a failed VM recovers, all tasks in the waiting queue are rescheduled for execution. Otherwise, the scheduled task cannot be finished successfully. In case a PM fails, the recovery procedure is executed for all VMs hosted by it.

The VM recovery procedure is presented in detail in Algorithm 2.

Algorithm 2. Recovery of a Virtual Machine j
Input
Number of Physical Machines: m,
List of PMs holding replicas of VMs: Replica_pm [1 $\ldots$ m][1..v],
CPU temperature of each PM:temp [1..m],
Maximum CPU temperature temp_threshold,
Output
Success of Recovery: Recovered_VM(j)
1. Set $i=$ host of VM j
2. Set $p=$ Replica_pm [i][j],
3. If temp[ $p$ ] $>=$ temp_threshold
4. Then $j$ cannot be recovered.
5. Set Recovered_VM(j) $=$ false
6. Exit
7. Else if temp[ $p$ ] $<$ temp_threshold
8. Then VMs of $p$ can be recovered.
9. Set Recovered_VM(j) $=$ true
10. Exit
11. Return Recovered_VM(j)

4. Performance evaluation

In this section, we assess the efficiency of the proposed approach with simulation based experiments. We compare the proposed method with a random VM replication approach in terms of

•
Total number of successful recoveries possible and
•
The total number of links between a VM and its replica.

4.1 Simulation setup

A data centre with k-port fat tree network topology has been assumed for all experiments with specifications used for simulation listed as follows and in Table II. Details of number of switches for different values of k are given in Table I. For example, a data centre with 8-port fat tree topology consists of 16 core switches and 8 pods. Each pod consists of 4 aggregation and 4 edge switches. There are 4 host PMs in each subnet. Total number of PMs can be 128 and the maximum number of VMs could be 512. Similarly, a 12-port fat tree consists of 36 core switches and 12 pods. Each pod consists of 6 aggregation and 6 edge switches and each edge switch connects 6 hosts or PMs. Total number of PMs can be 432 and the maximum number of VMs could be 1728.T he simulation experiments have been performed on a system having Intel ${}^{\@setsize{\scriptsize}{9.5pt}{\viiipt}{\@viiipt}\textregistered}$ Xeon(R) CPU E5607 processor with CPU speed of 2.27 GHz 4 and memory of 15.6 GiB.

Table 2
System configuration for the proposed method

Platform configuration
The capacity of core link in fat tree	10 Gbps
The capacity of aggregation link in fat tree	10 Gbps
The capacity of edge link in fat tree	1 Gbps
Minimum CPU temperature	35 ${}^{\circ}$ C
Maximum CPU temperature	68 ${}^{\circ}$ C
Number of VMS for each host PM	0 to 4

4.2 Metrics

The VM replication approaches are evaluated using the following performance metrics:

Recoverability: This parameter is used to compute the percentage of successful recoveries of virtual machines, and is calculated as follows:

$\displaystyle\textit{Recoverability}=\frac{\textit{Total Number of Successful Recovery}}{\textit{Total Number of Failure}}\times 100$ (3)

Figure 4.

Recoverability vs number of VMs.

Figure 5.

Recoverability for varied number of VMs.

Total Bandwidth (TBW): Used to calculate the amount of bandwidth consumed in VM replication procedure. It indicates the total number of edges or links between a primary PM and a backup PM where the replica of VM is placed and is calculated as follows:

$\displaystyle\textit{TBW}=\sum_{i=1}^{n}{\sum_{j=1}^{m}\textit{link\_count}% \left(\textit{Host\_PM}_{i},\textit{Candidate\_PM}_{j}\right)\times\textit{% link\_capacity}}$ (4)

where $n$ denotes the number of host PM and $m$ is the number of VMs that a particular PM hosts.

The objective of the proposed VM replication scheme is to maximize the recoverability and reduce the total links between VMs and their replicas. While increase in recoverability is an obvious design parameter, reducing the number of links decreases the overhead and latency of recovery.

4.3 Results

4.3.1 Measuring recoverability

The first set of experiments measures the percentage of VMs that recovered successfully using the proposed method in a given period of time. The results are compared with the number of successful recoveries using a random replica placement method and a state of art algorithm, Optimal redundant Virtual Machine Placement (OPVMP) presented by Zhou et al. [1]. It can be observed from the results in Fig. 4 that the proposed scheme achieves higher recoverability, of approximately 15% on average, than the random VM placement scheme and around 6% than the OPVMP scheme. Further, as the number of VMs increase, value of recoverability does not degrade and hence, the proposed scheme is scalable in performance.

Subsequently, recoverability is measured by varying the time of simulation for different number of VMs. Figure 5 demonstrates the performance of the considered algorithms. The proposed algorithm results in higher recoverability as compared to the random approach as well as OPVMP. As time increases, there is a dip in performance of all algorithms. However, the proposed algorithm performs better than the other algorithms in all cases.

The next set of experiments evaluated the recoverability under varying rates of VM failures. Failure rate, i.e., how frequently a PM or VM fails in a specified period of time, provides a degree of reliability of the cloud computing system. Figure 6a and b demonstrates system recoverability for failure rate as 0.01, 0.05, 0.1 and 0.5 (denoting low to high rates of failures) for 512 and 1728 VMs, respectively.

Figure 6.

Recoverability vs failure rate.

Figure 7.

Total bandwidth consumed for replicating varying number of VMs.

In both cases, proposed approach is able to yield a higher recoverability than the random replica placement approach as well as OPVMP. Moreover, it is observed that even with an increase in failure rate of up to 50 times (i.e. 5000%), recoverability of the system remains comparable to that for lower failure rates.

4.3.2 Measuring bandwidth consumption

The next set of experiments is performed to assess the bandwidth consumed to place the replica of VMs on backup PMs. Figure 6 illustrates the bandwidth consumed while placing replicas of 216, 512, 1728 and 4096 VMs in a particular duration, using the proposed and OPVMP algorithms.

Figure 7 illustrates that the proposed approach consumes significantly lower bandwidth to replicate the VMs as compared to the OPVMP algorithm. This implies a saving of resources both during replica placement as well as during recovery.

Subsequently, the time taken to execute the proposed algorithm and the OPVMP algorithm were computed. The results in Fig. 8 indicate the higher time taken by OPVMP as it involves sorting operations in subnets, pods or even the whole network in worst case.

Figure 8.

Total execution time.

Thus, the simulations’ results demonstrate that the proposed approach offers better recovery percentage, provides scalability and requires traversal of significantly lower number of network connections than a random replication approach as well as a related, contemporary method, OPVMP.

5. Conclusion

The paper has presented a topology-aware replication-based fault tolerance method to improve reliability of cloud computing systems. The proposed approach employs a CPU temperature-based model to avoid deteriorating physical machines while placing the VM replicas. It also aims to place the replica of a VM on a physical machine closer to its host machine; thereby reducing bandwidth wastage and speeding up recovery of a failed VM. The performance evaluation has employed metrics of recoverability and the total number of links traversed in the replication process. The simulation results have illustrated that the proposed approach is successful at providing better reliability than other related and contemporary replica placement methods.

Footnotes

Authors’ Bios

Priti Kumari received the MCA degree from the Banasthali University of Rajasthan, India in 2014. She got the M.Tech degree in Computer Science & Engineering from the Banasthali University of Rajasthan, India in 2016. She is currently a Ph.D. student in Jaypee Institute of Information Technology, Noida, India, Computer Science branch. Her research interests include Cloud Computing, Fault Tolerance, and Distributed Computing.

Dr. Parmeet Kaur received the B.E. degree in Computer Science & Engineering from the Punjab Engineering College of Chandigarh, India. She received the M.Tech degree in Computer Science from the Kurukshetra University of Haryana and Ph.D. degree in Computer Science from NIT, Kurukshetra, India. She is currently an Assistant Professor (Senior Grade) in Jaypee Institute of Information Technology, Noida, India. Her research interests include Distributed Systems, Cloud Computing, Fault Tolerance, Big Data and Distributed algorithms.

References

Zhou

Sun

and Li

, Enhancing reliability via checkpointing in cloud computing systems, China Communications 14(7) (2017), 1–10.

Zhou

Wang

Cheng

Zheng

Yang

Chang

R.N.

Lyu

M.R.

and Buyya

, Cloud service reliability enhancement via virtual machine placement optimization, IEEE Transactions on Services Computing 10(6) (2017), 902–913.

AbdElfattah

Elkawkagy

and El-Sisi

, A reactive fault tolerance approach for cloud computing, in13th International Computer Engineering Conference (ICENCO), IEEE, 2017.

Guo

Liu

Lui

J.C.S.

and Jin

, Fair network bandwidth allocation in IaaS datacenters via a cooperative game approach, IEEE/ACM Transactions on Networking 24(2) (2015), 873–886.

Liu

Wang

Zhou

Kumar

S.A.P.

Yang

and Buyya

, Using proactive fault-tolerance approach to enhance cloud service reliability, IEEE Transactions on Cloud Computing 6(4) (2016), 1191–1202.

Zhao

Xiang

Lan

Huang

H.H.

and Subramaniam

, Elastic reliability optimization through peer-to-peer checkpointing in cloud computing, IEEE Transactions on Parallel and Distributed Systems 28(2) (2016), 491–502.

Saikia

L.P.

and Devi

Y.L.

, Fault tolerance techniques and algorithms in cloud computing, International Journal of Computer Science & Communication Networks 4(1) (2014), 1–8.

Amoon

, Adaptive framework for reliable cloud computing environment, IEEE Access 4 (2016), 9469–9478.

Hasan

and Goraya

M.S.

, Fault tolerance in cloud computing environment: A systematic survey, Computers in Industry 99 (2018), 156–172.

10.

Cheraghlou

M.N.

Khadem-Zadeh

and Haghparast

, A survey of fault tolerance architecture in cloud computing, Journal of Network and Computer Applications 61 (2016), 81–92.

11.

Shen

Zhu

and Guan

, Availability-Aware Virtual Network Embedding for Multi-tier Applications in Cloud Networks, In 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, IEEE, 2015, pp. 1–6.

12.

Kumari

and Kaur

, A survey of fault tolerance in cloud computing, Journal of King Saud University-Computer and Information Sciences (2018).

13.

Mell

and Grance

, The NIST definition of cloud computing, 2011.

14.

Patra

P.K.

Singh

Das

Dey

and Victoria

A.D.C.

, Replication and resubmission based adaptive decision for fault tolerance in real-time cloud computing: a new approach, International Journal of Service Science, Management, Engineering, and Technology (IJSSMET) 7(2) (2016), 46–60.

15.

Liu

Wang

Liu

Peng

and Wu

, Achieving reliable and secure services in cloud computing environments, Computers & Electrical Engineering 59 (2017), 153–164.

16.

Chen

and Chen

, Asymmetric virtual machine replication for low latency and high available service, Science China Information Sciences 61(9) (2018).

17.

Gill

S.S.

and Buyya

, Failure management for reliable cloud computing: A taxonomy, model and future directions, Computing in Science & Engineering, 2018, p. 1.

18.

Mustafa

Nazir

Hayat

Khan

A.R.

and Madani

S.A.

, Resource management in cloud computing: Taxonomy, prospects, and challenges, Computers & Electrical Engineering 47 (2015), 186–203.

19.

Peng

Wang

and Zhang

, A two-stage fault tolerance method for large-scale manufacturing network, IEEE Access 7 (2019), 81574–81592.

20.

Amin

Singh

and Sethi

, Review on fault tolerance techniques in cloud computing, International Journal of Computer Applications 116(18) (2015), 11–17.

Topology-aware virtual machine replication for fault tolerance in cloud computing systems

Abstract

Keywords

1. Introduction

2. Literature review

3. Proposed approach

Table 1 Specifications of a fat tree

4. Performance evaluation

• Total number of successful recoveries possible and • The total number of links between a VM and its replica. 4.1 Simulation setup

Table 2 System configuration for the proposed method

4.3.1 Measuring recoverability

Footnotes

Authors’ Bios

References

Table 1
Specifications of a fat tree

•
Total number of successful recoveries possible and
•
The total number of links between a VM and its replica.

4.1 Simulation setup

Table 2
System configuration for the proposed method