Abstract
Cloud computing has recently gained popularity as a resource platform for on-demand, high availability and high scalability access to resources, while offering dynamic flexible infrastructures and QoS (Quality of Services) guaranteed services. In this environment, reliability is very important to ensure the effectiveness of the system and meet the desired SLA (System Level Agreement). Our work proposed in this paper uses the checkpointing of tasks to ensure the reliability and the replication of checkpointing files to ensure the accessibility. To ensure the reliability of services, we used a fault tolerance strategy based on adaptive checkpointing of two levels. This strategy takes into account the complexity and characteristics of cloud computing and it minimizes the overhead. To ensure the accessibility and the availability of checkpointing storage, we used a dynamic passive replication based on an availability degree specified by SLA criteria to decide the placement and the number of replicas.
Keywords
Introduction
Cloud computing has emerged as the next generation of computing. It offers a high level of resource virtualization as: computing devices, storages, communication infrastructures. Cloud computing allows users to focus on their own application rather than other related issues (installations, configuration, management…). Cloud computing delivers hardware infrastructures and software application as services, and the user can consume theses services based on SLA (System Level Agreement) which defines theirs QoS parameters based on pay-as-you-go concept.
Since reliability is a critical parameter to ensure the service-ability of the cloud, it is necessary to implement a fault tolerance tool. Fault-tolerance is the property that enables a system to continue operating properly in the event of the failure of some of its components.
Fault tolerance can be classified on two categories: proactive and reactive.
Proactive fault tolerance: It predicts the fault and avoids recovery from fault, errors and failure and proactively replace the suspected component, means detect the problem before it actually come. It is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating parts of an application (task, process, or virtual machine) away from nodes that are about to fail. Reactive fault tolerance: It reduces the effort of failures, when the failure occurs. This technique makes system more robust. In other words, we can say that it an On-demand fault tolerance.
To achieve a reactive fault tolerance, several techniques are proposed such as: replication, resubmission and checkpointing.
Replication: Means copying. Several replicas of tasks are created and they are run on different resources, for effective execution and for getting the desired result. Hadoop, HA-Proxy, Amazon EC2 like tools are there on which replication can be implemented. Also, there are mainly three different types of replication schemes such as Active Replication, Semi-Active Replication and Passive Replication. Resubmission: Many times it happens that due to high network traffic or due to heavy work load, a task may fail, whenever such failed task is detected, at runtime the task is resubmitted either to the same or different working resource for execution. For these, certain algorithms are designed, which assigns task to resources on the basis of certain properties. Checkpointing [27]: It is a proficient task level fault tolerance technique for large applications. In this method, check pointing is done in system. When a task fails, instead of initiating from beginning it is restarted from the recently checked pointed state. Checkpointing is carried out periodically i.e., checkpoints are kept and process is executed from the recent check point, once system governs the fault.
Each of these strategies has its own advantages and limits and views the characteristics and the complexity of cloud computing, the choice of the fault tolerance technique becomes more difficult. The good strategy ensures the reliability and availability with minimum overhead and without affecting the SLA. Reliability is concerned with willingness for correct service and availability is concern with continuity of correct service.
In this paper, we propose a fault tolerance strategy based on the adaptive multi-level checkpointing. The proposed checkpointing uses an adaptive time-based coordinated checkpointing at the VMs (Virtual machines) of the same server and min-process coordinated checkpointing at the server level. The adaptive time based checkpointing reduces the overhead of classical time based checkpointing by minimizing the piggybacked and stored message. The minimum coordinated checkpointing is already used in several works. What is new in our strategy is that we used this checkpointing in the server level rather than the CIC (Communication induced checkpointing) as it is proposed in many works. In addition, we made our strategy fault tolerant by storing the dependency matrix in the broker and the updates are done only if there is an external communication. Generally in classical strategies this matrix is created in each checkpointing process which increases the overhead. More details can be found in the next sections.
To ensure the availability of checkpointing files even if a storage server fails, we improve the storage process using a dynamic and passive replication. The new in this strategy is using it for the checkpointing files and it is based on certain degree of availability specified by the SLA rules. The details are also in Section 4.
The rest of this paper is organized as follows. Section 2 briefly describes related work on different reactive fault tolerance strategies existing in literature. In Section 3, we provide an overview of our cloud model and introduce the fault-tolerance mechanism. In Section 4, we detail our fault tolerance technique based on the multi-level checkpointing for the reliability and the dynamic replication for the storage availability. We present experimental results in Section 5 and a comparative study in Section 6. Finally, we provide concluding remarks and perspectives for future work in Section 7.
Literature review
In order to tolerate faults, the strategies based on checkpointing save the executed portion of processes on stable storage. Hence, the executed portion can be retrieved in case of failure and further computation can be carried out [30]. The checkpointing protocols can be classified in three categories according to the synchronization among the application nodes:
In Independent checkpointing (uncoordinated) [1]: the checkpoints at each process are taken independently without any synchronization among the processes. Because of absence of synchronization, there is no guarantee that a set of local checkpoints taken will be a consistent set of checkpoints. It may require cascaded rollbacks that may lead to the initial state due to domino-effect [1]. The domino effect appears when a subset of processes, which have to be resumed after a failure, rollback unboundedly while determining a set of mutually consistent checkpoints. The independent checkpointing store all the checkpoints file during the job life. To overcome the problems of the domino effect in this protocol, the authors in [2, 3] propose to use an optimistic sender based messages logging. The time based coordinated checkpointing is an independent checkpointing where the coordination in done only in the timer level [28]. In this type of checkpointing the timers are synchronized to ensure the checkpointing approximately at the same time.
The coordinated or synchronous checkpointing forces the processes to synchronize to take checkpoints in such a manner that the resulting global state is consistent. The main advantage is that it sores only one permanent checkpoint in the stable memory and it is domino-effect free. The Chandy-Lamport [4] algorithm is the earliest coordinated checkpointing algorithm. This technique requires that at least one process sends a marker to notify the other ones to take a snapshot of their local states and then form a global checkpoint. To reduce the overhead caused by the synchronization, it was necessary to reduce the number of nodes involved in the checkpointing process. The paper [29] minimizes the overhead by synchronizing between the communicating nodes during the previous checkpointing interval. In [5, 6]; the authors propose a coordinated checkpointing where only the nodes depending on a selected initiator are forced to created theirs checkpoints. Time based coordinated checkpointing proposed in [8, 9] supposes that the logical timers of nodes are approximately synchronized which allows to creates a consistent checkpoints with a simple management of messages during a certain periods.
The last protocol is communication induced checkpointing (CIC). In this protocol, the processes take two kinds of checkpoints, local and forced [7]. Local checkpoints can be taken independently, while forced checkpoints are taken to guarantee the eventual progress of the recovery line. However the messages are piggybacked and useless checkpoints can be created.
Each protocol has advantages and limits, which makes the combination among two or three protocols necessary to adapt to the system levels and complexities [10].
All the cited works assume the ability to store the checkpoints images (files) in a reliable media which is not subject to failures (stable storage servers). But these dedicated servers may become a bottleneck when the system size increases. To overcome this problem, the authors in [11] introduced the concept of diskless checkpointing. In checkpointing stable storage media is replaced by memory for checkpoint storage. In checkpoint mirroring (MIR) [11], the checkpoint mirroring technique is used in which each processor saves a copy of its checkpoint to another processor’s local disk. In the case of processor failure the copy of that checkpoint will be available for a spare processor to continue the execution of that process. The paper [12] uses the same concept except that all the computational processors and the checkpoint processor are organized in chain. Each processor receives the data segment from its predecessor, calculates the checksum and sends it to the next processor in the chain. The process continues until the segment reaches the checkpoint server which is at the end of the chain. Diskless checkpointing is scalable but it is costly in terms of mapping and consistency management of checkpointing files. One way to provide checkpoint storage reliability with some degree of scalability is replication. In replication, multiple copies of a checkpoint are stored on different storage servers. Thus, data can be recovered even when part of the system is unavailable [13, 14].
Adaptive checkpointing protocol and storage
View the characteristics of each level, fault management must consider and adapt to the each level complexity. For this reason, our checkpointing strategy MLCp (Multi-Level Checkpointing) uses two different cooperative strategies in each level to minimize overhead and meet SLA. And since storage servers are not 100% reliable, it is necessary to ensure the availability of files in case of rollback checkpoints. In this case we add a dynamic replication service. In this section we explain in details our strategy of fault tolerance that hybrids between replication and checkpointing.
System overview
The infrastructure-level services (IaaS) in cloud organizations can be provided by a group of datacenters. Each datacenter entity manages a number of storage services and host entities. The hosts (servers) are assigned to one or more VMs (Virtual Machines) running on them. The virtualization offers a complete environment in which an operating system and many processes, probably belonging to various clients, may reside mutually. Using the VMs, a single-host hardware platform may be compatible with multiple, isolated guest operating system environments concurrently [15]. The use of Virtualization provided by hypervisors otherwise known as Virtual Machine Monitors (VMM), forms an innovative virtualization level that separates the service workload from the resource management.
According to the Cloud architecture, there are three types of failure in a cloud platform: hardware failure, (storage server/host), VM failure and application failure (all VMs executing the applications fail). The broker is an intelligent agent that takes care of many functions that distributed applications require such as: SLA negotiation, application management (discovering the right resources meeting the SLA, scheduling jobs in order to meet deadlines), and handling failure that may occur during execution (all the three types) using fault tolerance. The fault tolerance used by the broker is based on the checkpointing. And to ensure the availability of checkpointing failures, the broker uses also the replicate these files among several storage services.
The user application consists of a set of tasks distributed among different VMs of different hosts (according to the allocation strategies) that communicate by exchanging messages, which creates dependencies between VMs and hosts where these tasks are running. In this case, two levels of communication of different characteristic and complexity are generated:
VM level: This level comprises VMs of the same server involved in application performance. This level is characterized by a high level of communication and very small transfer times message (in some works it is considered as null). Server level: This level includes the system servers that participate in the execution of application. The communication in this level is rare and costly.
Multi-level checkpointing (MLCp)
Each server runs independently its checkpointing according to its own checkpointing interval specified by the SLA rules. The checkpointing of a server is the set of local checkpoints of all its VMs. Since there are dependencies between servers, it is possible to force one or multiple servers to create their checkpoints together to ensure a consistent state.
VM level checkpointing
According to [9, 10], the transfer time of messages between VMs of the same server is negligible compared to the time of transfer of messages between servers. In addition, the VMs clocks are synchronized or approximately synchronized in the worst cases. These characteristics make time based coordinated checkpointing the most suitable for this level.
In the based coordinated checkpointing: All the processes are approximately synchronized and have a deviation from real time in their local clock timers. The local clock drift rate between the processes being assumed as
where MD is the maximum deviation and it is calculated as below:
where
The orphan messages can be eliminated by blocking the communication during
So when a
Blocking the communication during
The checkpointing strategy at this level is the following: the checkpointing of a server is the collect of all checkpoints of its VMs. As the server interval checkpointing expires, each VM creates its local checkpoint independently of other VMs. Among the period
Checkpointing at VM level.
The local checkpointing file contains all the information necessary for correct VM recovery, it also contains the list of potential in-transit messages. These checkpointing files will be sent to a VM initiator in the server and the VMs continue their performances immediately. The initiator VM is responsible for collecting checkpoints of VMs and ensuring the checkpointing atomicity. It is also responsible to ensure the checkpointing storage. Figure 1 illustrates a communication scenario inside the server and between three servers (
If an external communication is detected during the checkpointing of a server, the server level checkpointing is triggered. The majority of existing studies use CIC (Communication Induced Checkpointing) in this level, but this strategy requires more storage checkpointing files and the recovery is expensive. In our work, we used minimum – Coordinated checkpointing process where only the dependent servers will be forced to create their checkpoints. A server can monitor its direct dependencies but it cannot know its transitive dependencies. In Fig. 2, the server
After receiving a request from a server, the broker consults
Checkpointing at server level.
In most checkpointing works, it is assumed that the checkpoint files are stored in 100% reliable and stable memory, which is difficult to guarantee in the real world. In this case, the checkpointing file can be unavailable or lost and the system reliability will be not assured. The diskless checkpointing storage ensures reliability storage but it is expensive in terms of mapping and consistency management of checkpoints files.
In our adaptive storage, we adopt the replication strategy to ensure the storage reliability. The replication is widely used in the literature for different raisons such as the performances or the reliability. The replication based on the reliability focuses on the accessibility [21]. However the replication based on the QoS uses the response time, the consistency management and the cost to decide the number and the placement of the replicas in the system. The replication can be used for files [21], tasks [22], data bases or even objects.
Since the checkpointing file is also a file and its existence is very important to the recovery process, we decide to replicate it to ensure both of reliability and QoS. Replicating the checkpointing file is not new in the literature but our work deals with the majority of checkpointing storage and replication problems such as: when, what, how much and where replicating without increasing the checkpointing overhead. In the next paragraphs we will explain in details our storage strategy.
In our approach, we assume that each datacenter
For example, a data center contains only storage servers with
So as long as the constraint of Eq. (5) is not satisfied, the replication manager in the broker continues creating replicas.
To designate the location of replicas in the group of available storage servers at the datacenter, we used a QoS criteria to select the
where
where
In our work, we have considered three main parameters: the server reliability
where
We also assumed that these three parameters have the same weight, that is to say:
The broker module will choose the server which has the highest rating value, as the primary server. The primary server is responsible for ensuring the atomicity of storage checkpoints files and replicating the files on other shadow servers (Creating replicas).
To select the shadow servers; the broker remains using formulae. 6 with one difference: it must also consider the distance between the primary server and its replica, in this case the selection criteria is as below:
where
To reduce even more the checkpointing storage overhead, our approach uses a hierarchical storage; in this case, a powerful VM is selected as a mediator (generally it is the initiator). The role of the mediator is ensuring the correctness and the atomicity of checkpointing storage of its VMs. A server can use one or more mediators depending on the number of VMs involved in the application. The storage of the checkpointing file in the primary server is done by the mediator but the replication of this file to other shadow servers is done asynchronously to reduce the storage time.
In Fig. 3, the mediators are VM4 for the server
Comparison of checkpointng techniques
Hierarchical checkpointing storage.
To evaluate our approach, we used two parts of experimentations. The first part focuses of the checkpointing technique, so we compared our MLCp approach with the strategy proposed in [9] because it is the most similar work compared to ours (see Table 1).
In the second part, we compared our storage strategy that we call adaptive replication ARS (adaptive replication strategy) with:
Classical Checkpointing Storage (CCS): In this case, a single storage server responsible for storing the checkpoint file. With k replication storage kRS: In this case, the checkpoint files k will be stored in storage servers randomly selected.
Checkpointing evaluation
We used two parameters to evaluate our MLCp approach: the checkpointing overhead and the SLA violation.
Checkpointing overhead
Checkpointing overhead is the most important parameter to measure the effectiveness of checkpointing [7]. The overhead can be measured in several ways depending on the simulation requirements. In our case, the overhead Overh is calculated as below:
where
Impact of the number of checkpoints on the overhead with 70% of internal checkpointing.
In this experiment, we used 40 VMs in seven servers, we used a Poisson Distribution to present the failure behavior with a failure rate
In the first case, the internal communication rate (communication inter-server) is 70% of the total communications of the system. The results presented in Fig. 4 show a small overhead rate in all approaches because the checkpointing will be the majority of the time inside the servers. In addition of that; the time-based checkpointing does not require many control or synchronization messages (except for the synchronization of the time that can be made periodically or during the communication). Our MLCp approach does not cause many overhead because it does not piggyback messages even during the
In the second case, the external communication rate (communication inter-server) is 70% of the total communications of the system. The results presented in Fig. 5 show that the overhead rate is high for both approaches because of the synchronization needed for creating the checkpoints between the servers which also indicates an intra-server checkpointing. But our approach has minimized the overhead because it ensures the minimum synchronization between the servers by the use of the dependency idea. In addition our strategy also minimizes the intra-server overhead (the first case).
Impact of the number of checkpoints on the overhead with 70% of external checkpointing.
Impact of 
The SLA can be determined in terms of several constraints such as the minimum throughput or maximum response time provided by the deployed system. Since these characteristics may vary depending on the applications, it is necessary to define a generic metric that can be used in our simulations to estimate the level of SLA that is delivered by the approach.
For this purposes, we define the overall level of SLA violation caused by the system as the sum of overheads in terms of cost and Energy as below:
where
To evaluate our MLCp approach in term of SLA violation we used the same scenario of the previous experiment except that we fixed the internal communication at 80% (which is the general case in the cloud) but the failure rate
According to the results presented in Fig. 6, the SLA violation increases if the failure rate increases due to:
The checkpointing process that requires time which increases the time of resource occupancy and consequently increases the cost and the energy. The recovery procedure that checks alternative resources to re-execute the task from the last checkpoint stored in the memory and then loads the checkpointing file into this resource (VM).
However, our approach has succeeded in minimizing the SLA violation because the checkpointing procedure consumes less time and the recovery only concerns the VMs depending on the failed VM.
Impact of the number of checkpoints on the overhead.
This section compares our storage service ARS to other storage strategies and evaluates its performances.
Impact of the number of checkpoints on the overhead
We used the following scenario: Number of VMs
According to the results presented in Fig. 7, the number of checkpoints automatically increases the overhead in all the checkpointing storage approaches. The overhead caused by the storage with kRS replication was high compared to conventional storage in CCS. In kRS, the mediators will be blocked to ensure
Impact of the size of checkpointing file on the overhead
We used the same scenario of the previous experiment with Number of Checkpoints equal to 30 but we increase the size of checkpointing file, we studied the impact of the size of checkpointing file on the overhead. Increasing the size of checkpointing files increases the overhead since the storage time increases (see Fig. 8). Our strategy ARS ensures minimum overhead due to hierarchical storage (Primary server node and shadows servers). kSR strategy gives bad results because of the number of checkpointing files that must be replicated and stored in different storage services Asynchronously.
Impact of the number of failed storage severs on data availability
Impact of checkpointing file size on the overhead.
Impact of 
The unavailability of checkpoints file indicates the impossibility of recovery in case of failure detection. In this experiment we used the parameter
Table 1 presents a comparison between different checkpointing strategies cited in the paper including our checkpointing strategy MLCp. The column “Synch” indicates the degree of synchronization requested to ensure the consistency. “FT” column indicates if the strategy is fault tolerant or not. In “Storage” column we can see if the strategy takes in consideration the process of checkpointing storage or not. The column “Transit” indicates the consideration or not of the in-transit messages.
To position our approach of checkpointing storage in relation to other existing works, we used Table 2. “Replicating Column” indicates the replication granularity. “FT Column” indicates if the strategy is fault tolerant or not. “Rep type Column” indicates if the replication is active (Active), Passive (P). “Recov Column” concerns only the checkpointing file and it shows if the recovery process in considered in the replication process. In “Number of Rep”, “n” indicates that the number of replicas is fixed however
Comparison of replication techniques
Comparison of replication techniques
Checkpointing is a popular strategy of fault tolerance in cloud environment. The weak point of the checkpointing is the overhead and the availability of checkpointing files. In this paper we proposed a checkpointing strategy that minimizes the overhead by minimizing the degree of coordination and the management of complex levels.
To ensure the availability of checkpointing files, we also used another fault tolerance technology based on replication. The storage manager in our approach provides a degree of reliability for each file checkpointing with the minimum cost of storage and updating. The results prove the effectiveness of our approach in terms of availability and minimizing the overhead. In future work, we expect to integrate cognitive agents for intelligent management of fault tolerance at checkpointing level and storage level.
Footnotes
Authors’ Bios
