Abstract
One of the current discussions concerning cloud computing environments involves the issue of failure prediction that influences the delivery of on-demand services through the Internet. Proactive failure prediction techniques play an important role in reducing undesirable consequents produced by failures within high performance systems. Accordingly, this study aims at proposing a threshold sensitive by using support vector machine to create an efficient mechanism for predicting failure within cloud environments. The new approach can operationally avoid system failures for each host based on log file which include features such as CPU utilization, RAM, and bandwidth, etc. In comparison to the base research, the findings demonstrated that the presented method could better reduce the percent of migrations about 76.19% proactively when the failure threshold level was 70%.
Introduction
Cloud computing has a wide range of demanding services in various domains and a huge number of requests for such services must be responded through cloud infrastructures datacenters that can support a bulk of users. Servers with such a large amount of workload may sometimes face to different types of hardware failures. Therefore, certain operational provisions must exist in order to deal with the failures occurring in this highly-loaded infrastructure. In recent years, there has been a heightened interest in monitoring systems utilized within cloud environments [2, 3]; hardware components are monitored and the information such as temperature, CPU, RAM and bandwidth utilization is logged. Log files are pre-processed by filters and then analyzed to identify imminent failure indications, patterns and future trends [2, 3, 27]. In particular, the preemptive migration method is introduced as one of the earliest attempts to prevent based on the feedback loop from runtime environment [2]. However, this approach is not able to predict a failure that is why in this current research a threshold sensitive framework which uses support vector machine (SVM) has been added to feedback loop control mechanism to predict failures. The challenge of this research is to reach a suitable failure threshold in order to experience less number of failures that result in less number of migrations in cloud computing environments.
The hypothesis is applying SVM along with the preemptive migration technique to categorize and predict failures with a more accurate threshold that results in reducing the number of failures and migrations. The aim of the paper is increasing availability time of cloud based datacenters by predicting failures and migrations, timely. The contributions of this paper are: first, proposing a failure prediction method for cloud computing environments; second, combine failure prediction methods along with preemptive migration mechanism; third, offering a SVM-based failure prediction for cloud computing (SVM-FPC) framework to predict failures based on log files. Log files are created based on the health status of hosts. Based on the obtained results, the threshold 70% is better than other thresholds to classify hosts in order to avoid failures.
The rest of this paper is organized as follows: some background studies are presented in Section 2; details of the approach are discussed in Section 3; the simulations and results are mentioned in Section 4; and finally, Section 5 remarks conclusions and future works.
Literature review
Through reviewing of the literature, in order to increase the performance of systems, numerous proactive failure prediction techniques have been introduced to decrease undesirable effects and improve fault tolerance in cloud computing [1, 2, 3, 4]. SVM was also developed based on crash possibility of a system for a large European manufacturing company. By keeping track of running systems and log files, Random Index and SVM were used to sort out failed and non-failed systems into separated classes. However, SVM could not identify failures satisfactorily [3]. Therefore, fault prediction and migration are used in combination to manage failure and increase the system availability. There are several methods in fault tolerance and prediction techniques in cloud computing nowadays. While some have focused on preventing errors and failures as a proactive fault tolerance technique, others have dealt with the impact of failures as a reactive fault tolerance technique [2, 14, 15, 16, 17, 18, 19, 20].
In 2010, Machida et al. [18] studied the effectiveness of a combination of rejuvenation and live Virtual Machine (VM) migration. When a VM needs to be rejuvenated, hosted VMs can be moved to another host using live VM migration to continue the execution. They need to perform more experiments on other scalable models. Software rejuvenation is a technique for improving availability of server virtualized systems as it can postpone failure caused by software.
In 2012, Fronza et al. [3] described an approach to predict failures based on log files using Random Index and SVM. Each operation was characterized in terms of its context, and SVM sequences were associated with the class of failed or non-failed hosts. The sequence or a pattern of messages may be used to predict failures. A classifier can be applied to associate the sequence of messages that precedes an event with a certain group. Weighted SVMs have been applied to deal with imbalanced datasets in order to improve true positive rates.
In 2013, Catak and Balaban [24] propose a method that is referred as the cloud SVM training mechanism (CloudSVM) in a cloud computing environment with MapReduce technique for distributed machine learning applications. By splitting data set over some nodes in cloud computing systems, CloudSVM classify features and then merge them together in MapReduce. This method is able to work on cloud computing systems without knowing how many computers are connected and the aim is to deal with large-scale data set.
In 2014, Chen et al. [25] investigate how to identify application failures based on resource usage measurements from the Google cluster traces. They apply recurrent neural networks to the resource usage measures, and generate features to categorize the input resources usage time series into different classes. Their results show that the model is able to predict failures of batch applications, which are the dominant jobs in the Google cluster. Moreover, the explore early classification to identify failures, and find that the prediction algorithm provides the cloud system enough time to take proactive actions much earlier than the termination of applications, with an average 6% to 10% of resource savings.
In 2015, Lv et al. [26] propose a distributed SVM with penalty factor in cloud computing to improve SVM data mining algorithm. By using MapReduce model, they calculating process of SVM in cloud computing with assign data records to different nodes. Because data in cloud computing is dynamic they propose to introduce a penalty factor to improved accuracy of information mining under cloud computing environment.
In 2014, Kumar et al. [27] explain a novel approach to predict hardware failure using SVM by analyzing the system log files. This approach uses a sequence or pattern of message using a slid window, which act as an input to SVM, to predict the likelihood of a failure. Experimental results using log files from a Linux-based Network attached storage system indicate SVM classifier can achieve an accuracy of 92%.
In 2012, Fu et al. [28] propose a set of innovation algorithm and a system named LogMaster for pars log files that have multiple attribution. They use an n-array sequence where each event is identified. In addition, they propose an abstraction named Event Correlation Graph (ECGs) to represent event relations and fast prediction. Experimental results on three log files show that their method can predict failure with a high precision and an acceptable recall rates.
In 2012, Bala and Chana [15] described checkpoint and restart. When a task fails, it allows restarting from the recently checked pointed state rather than from the beginning. This is an efficient fault tolerance technique for long running applications.
In 2010, Purdy et al. [19] proposed a new model called process replication that can be used to improve performance and fault-tolerance of High Performance Computing (HPC) applications on the cloud. HPC applications involve a set of processes coordinating a cluster of machines using message passing. They also studied the tradeoff among bandwidth, CPU time, and memory for latency.
In 2009, Angskun et al. [14] presented a self-healing network for supporting scalable and fault tolerant runtime environments. The self-healing network was designed to support transmission of messages across multiple nodes and protect them against recursive node and process failures. It will automatically recover itself after occur a failure. Making the protocol aware of the underlying network topology (in both the LAN and WAN environments) will greatly improve the overall performance of both the broadcast and multicast message distributions. This is equivalent to adding a function cost on each possible path and integrating this function cost with the computation of the shortest path.
In 2009, Engelmann et al. [2] presented a preemptive migration method relies on a feedback-loop control mechanism, where system and application health are monitored. The preventive action can eliminate imminent application failure by reallocating running application parts from unhealthy to healthy compute nodes. Application reallocation must be finished before the expected failure occurs; otherwise, the expected failure will be experienced by applications. They identified four distinct types of proactive fault tolerance using preemptive migrations based on the monitoring capabilities on the compute nodes and the processing of monitoring data within the loop feedback. Specific need for this paper is a prediction model based on log files to add to feedback-loop control mechanism.
In 2009, Fulp et al. [21] described a new spectrum-kernel SVM approach to predict hardware failure such as hard disk and processor based on system log files. These files include messages that represent a change of system state. Given labeled training data, SVM are applied on data obtained from system log files to determine which sequence are precursors to failure.
Table 1 shows a comparison of proactive and reactive techniques in terms of performance, availability, downtime, failure prediction, migration and policy.
Comparison of fault tolerance techniques
Comparison of fault tolerance techniques
The idea of this paper is presenting a technique to predict failures of a cloud environment through applying thresholds along with the SVM algorithm. The goal is to develop a proactive fault tolerance framework, namely SVM-FPC that monitors the system health from log file and separate failed and non-failed hosts to do preemptive migration before a host failure. Prior to describing SVM-FPC, the basic concepts including feedback-loop control mechanism, support vector machine, and performance assessment are reviewed as follows:
Feedback-loop control Mechanism: The base method which has been used in this paper is a proactive fault tolerance technique which using preemptive migration and relies on feedback-loop control mechanism. In this mechanism, a monitoring system observes health parameters of each node such as fan speeds, processor temperature, and processor utilization. Upon detecting alert or exciding-preset limitation, resource manager notified by monitoring system to prepare and reallocates the node, and notify runtime environment for migration the node to another node. To evaluate the history and context in which an alert is triggered or limit is exceeding, a history database and filter section was also added. Filter used on each node to process raw sensor data and alerts from the monitor to notify the resource manager based on a more thorough analysis of current trends, imminent failures, and possible future threads and history database used to save logs [2].
Support Vector Machines (SVM): SVM has been successfully applied in many fields such as image retrieval, handwriting recognition, gene profiling, and text classification [3, 5, 6, 7, 8, 9, 10]. SVM is learning method to recognize patterns, uses for classification and predictions. Characteristically, SVM uses two independent sets of data; namely, training and testing data. Only the training dataset enlists class labels and attributive features or variables to distinguish between failed and non-failed instances. In our work, LibSVM function1 with the Radial Base Function (RBF kernel) [23], which is reasonable kernel when the number of features is not very large, has been used.
Performance assessment: As shown in Table 1, the related metrics for evaluating quality of an SVM are false positive rate (FPR) and true positive rate (TPR) as well as precision and accuracy. True positive rate measures the proportion of positives that are correctly identified and false positive rate is against the true positive rate. Accuracy is total number of correct predictions (fail and non-fail) and precision is the retrieved instances that are relevant in the classification. Moreover, F-score is a measure of test’s accuracy that considers both the precision and recall.
Most relevant metrics for classification performance evolution [13]
Most relevant metrics for classification performance evolution [13]
TP: true positive classes; FP: false positive classes; TN: true negative classes; FN: false negative classes.
Figure 1 represents a schematic view of the proposed approach including three sections; however, this paper concentrated on data collection and SVM-based filters.
Schema of SVM-based failure prediction for cloud computing (SVM-FPC).
While the system is running, the status of hosts is monitored and the log file is sent to SVM-based filter as an input data. In this section, host features such as CPU utilization, Bandwidth, and RAM are monitored and log files are created subsequently. The process of monitoring algorithm is as follows:
Algorithm 1 Monitoring
Begin
1. int N; //number of Host
2. int i;
3. for i
4. Get "Host-CPU-utilization" and "Host-bandwidth" and "Host-RAM"
5. write Host health status to output
6. }
7. Return output and send it to SVM-based Filter
End
Structure of SVM-based filter
In this section, the failure thresholds are defined based on the host features, and then SVM trained when labeling process is finished. Each of the features that exceed the threshold is considered as a failure and it is labeled as
Algorithm 2 SVM-based filter
Begin
1. int N; //number of Host
2. int i;
3. for i
4. if (Host-CPU-utilization) or (Host-bandwidth) or (Host-RAM)
5. then
6. set Host label to
7. }
8. else
9. set Host label to 1
10. test and train of SVM
11. Return output to Resource Manager Section
End
Simulations and results
In this section, a case study of the recommended method is introduced and then simulations setup and results are presented.
Case study
In order to show the performance of the system, the case study has undertaken over a period of 24 hours through simulation. In this example, it is supposed that threshold is 40%. Data Collection: Host features such as CPU utilization, Bandwidth, and RAM have been used from the PlanetLab data set.2 This log file, which represents the health status of the hosts, is sent to SVM-base filter to identify failed and non-failed hosts by considering the chosen threshold level.
SVM-based filter: Table 3 shows the process of labeling ten hosts in SVM-based filter based on threshold 40%. For example, host number six is labeled as
Labeling of the hosts in SVM-FPC
Labeling of the hosts in SVM-FPC
Resource Manager: Resource Manager receives a host list from SVM-base filter; it reallocates the applications to migrate them from a node to another one based on the preemptive migration idea [2]. The accuracy of the SVM and the migration numbers are obtained according to the following relations:
Our approach has been applied to predict the failure of hosts based on log files. The PlanetLab dataset [24] was used along with SVM classifier and a range of threshold levels between 40% and 90% to achieve the optimum failure threshold in cloud computing environment.
The simulations environment is illustrated in Fig. 2. CloudSim is a java-based simulator [22] which is applied to simulate cloud-based environments and classify data based on threshold levels. The prepared log file, can be exported as a Comma Separated Values (CSV) files to perform specific statistical operations. Moreover, CSV files are converted to Attribute-Relation File Format (Arff) files to be used by Weka software. Weka3 using LibSVM function [11], whose aim is to predicts accuracy and precision.
The configurations of hosts and VMs are illustrated in Tables 4 and 5. Four VMs types and two hosts with different MIPS values have been defined in 800 and 100 instances, respectively.
VMs and hosts configuration
VMs and hosts configuration
VMs and hosts configuration
Simulation environment.
Log file example.
Log files are important for managing security of computer systems since they provide a history of a system status change. In this context, each instance in these files consists of a message, event or a feature; thus, analysis of the log files can identify causes of failures. In this paper, log files were used for analyzing system health status. PlanetLab dataset including four fields were used in this paper where these fields are host IDs, MIPS, RAM and Bandwidth. The attributes of each log file are listed in Table 6 and Fig. 3 shows an example file.
Field of log files in the analyzed data set
For each threshold, we have applied our approach to PlanetLab dataset with some threshold to select the best performing of SVM that was selected based on the accuracy and number of migrations. The results are as follows:
Simulation 1: Threshold 40%
To check the results based on threshold 40%, Planetlab workload was used. In this simulation, 89 migrations occurred and it was reduced to 62 migrations by using the new approach. The number of migrations showed that the selected threshold is not efficient because it imposes additional loads on the system to perform reallocation and migrations. Figure 4 shows the number of migrations using feedback-loop control framework and SVM. Table 7 shows the summary of obtained results in threshold 40% by SVM.
SVM results based on threshold 40%
SVM results based on threshold 40%
To check results based on 50% threshold, Planetlab workload were used. In this simulation, 76 migrations occurred and it was reduced to 24 migrations using the new approach. The numbers of migrations showed the selected this threshold was not efficient because it imposes additional loads on the system to perform reallocations and migrations. Figure 5 shows number of migrations using feedback-loop control framework and SVM. Table 8 shows the summary of obtained results in threshold 50% by SVM.
SVM results based on threshold 50%
SVM results based on threshold 50%
To check results based on threshold 60%, PlanetLab workload was used. In this simulation, 63 migrations occurred and it was reduced to 20 migrations using the new approach. Figure 6 shows the number of migrations using feedback-loop control framework and SVM. Table 9 shows the summary of obtained results in threshold 60% by SVM.
SVM results based on threshold 60%
SVM results based on threshold 60%
Number of migrations based on threshold 40%.
Number of migrations in threshold 50%.
To check the results based on threshold 70%, PlanetLab workload was used. In this simulation, 42 migrations occurred and it was reduced to 10 migrations using the new approach. Figure 7 shows the number of migrations using feedback-loop control framework and SVM. Table 10 shows the summary of obtained results in 70% threshold by SVM.
SVM results based on threshold 70%
SVM results based on threshold 70%
Number of migrations in threshold 60%.
Number of migrations in threshold 70%.
To check results based on 80% threshold, PlanetLab workload was used. In this simulation, 27 migrations occurred and it was reduced to 8 migrations using the new approach. Figure 8 shows number of migration using feedback-loop control framework and SVM. Table shows the summary of obtained results in threshold 80% by SVM.
SVM results based on threshold 80%
SVM results based on threshold 80%
To check results based on threshold 90%, PlanetLab workload was used. In this simulation, 15 migrations occurred and it was reduced to 9 migrations using the new approach. Figure 9 shows number of migration using feedback-loop control framework and SVM. Table 12 shows the summary of obtained results in threshold 90% by SVM.
SVM results based on threshold 90%
SVM results based on threshold 90%
Number of migrations in threshold 80%.
Table 13 and Fig. 10 show summaries of statistical results of all Simulations in this paper; The new approach can operationally avoid system failures for each host based on log file which include features such as CPU utilization (MIPS), Random Access Memory (RAM), and bandwidth, etc. In comparison to the base research, the findings demonstrated that the presented method could better reduce the percent of migrations about 76.19% proactively when the failure threshold level was 70%.
SVM results of all threshold levels
SVM results of all threshold levels
Number of migration in threshold 90%.
Number of migration in all threshold levels.
The aim of this paper was applying the SVM approach along with the preemptive migration technique [2] to predict failures in cloud computing environments. The results showed that the recommended SVM-FPC framework could contribute to evaluate failures in terms of several threshold levels. The threshold 70% showed the best results and was the most desirable level for failure prediction in this research work. The SVM had a satisfactory performance in classifying failed and non-failed states.
Our limitation in this work was the number of features in PlanetLab dataset. As a future study, several more features can be considered in order to verifying the consistency and accuracy of the failure prediction. Another aspect that is recommended for future direction is to evaluate the time feature to predict the occurrence of failures. There could be an early estimation of failures, providing additional time features to take right actions. Yet another action can be applying more accurate artificial methods to train SVM.
Footnotes
See
See
Authors’ Bios
