Abstract
Today’s large scale distributed platforms comprise thousands of resources from production, educational, and ad hoc environments including Clouds, Grids, P2P, etc. However, finding suitable resources from such a large pool to store large amounts of data and run multi-resource, long-running data processing applications (usually with few or no fault tolerance capabilities) is restricted by the dynamic availability of distributed resources. In addition to resource failures, the resources may be unavailable due to their owners’ policies for sharing their resources as well as the nature of domain they belong to (e.g. P2P systems, non-dedicated desktop Grids etc.). As a result, the availability-aware selection of distributed resources has become a challenging problem for data management, resource provisioning and job scheduling services. To this end, we present a novel resource availability characterization and prediction method for dynamic heterogeneous distributed environments. We identified 14 availability attributes that can be effectively used to model resource availability in dynamic distributed environments. Three data mining methods (particularly the neural network) are proposed to model and predict resource availability using our identified availability attributes. The availability of a resource is predicted for an instant of time as well as for a time duration. Our experiments for 28 different resources in Austrian Grid show that the predictions through the proposed approach are 18% and 31% (on average) more accurate than those by so far the best method (Naive Bayes’ Classifier) for instant and duration availability, respectively.
Keywords
Introduction
With the increasing maturity of distributed technologies, the scale, development, and utilization of distributed environments continue to expand. Today’s distributed environments include hundreds/thousands of resources from Clouds, Grids (data/computational Grids), clusters, P2P environments, such as, Cern Worldwide LHC Computing Grid (WLCG)a, Tera Gridb, Grid’5000c, Austrian Gridd, EGEEe, Open Science Gridf, PlanetLab [8] etc. These may include resources available for short periods such as Public Resource Computing (PRC) systems [3], labs from the universities, and on-demand resources [20]. In such diverse environments, resources vary largely in their availability properties. Several of the resources may not be available all the times, mainly because of different policies for contributing these resources for shared use, scheduled maintenance, and unpredictable resource failures [20].
High level middleware services, like resource management, meta-scheduler, etc. require resource availability predictions for effective utilization of resources in distributed environments. For example, the resource management services need resource availability predictions for allocation and co-allocation of resources for multiple job requests as well as for advance reservations. The scheduling services require predictions about resource availability to map given jobs to available resources considering jobs properties. For example, the jobs with long run time and without any checkpointing mechanisms need to be scheduled to resources with steady availability. Whereas, the jobs with light checkpointing can be replicated on occasionally available resources. Likewise, considering past patterns of resource availability, heavy checkpointing may be planned instead of periodic checkpointing, thereby reduce large overheads [22]. Data management services need to consider resource availability for effective storage and efficient access of data. Especially, in case of very large datasets (as in big data) [31] all major data operations like storage, curation, search, etc. involve multiple resources, which may be distributed geographically. The data management services require in-advance predictions about resource availability to select the most suitable resources for different requests of data access. For example, different resources may be chosen based on size of the requested data or frequency of the data access.
However, due to varying availability properties and patterns over time, the availability-aware selection of distributed resources has become a challenging problem. In addition, the varying reliability of resource hardware and software stack, different resource maintenance and management practices, and wide range of policies from the resource owners for sharing their resources in the distributed environment make the availability predictions a hard problem. To address this problem, in this paper, we present three data mining methods based on Decision Tree [37], and Naive Bayes’ Classifier [37] and Neural Network [37] to model and predict resource availability. The availability of a resource is predicted in two ways. First, availability at a given point in time – in current study, it refers to immediate next point in time. This availability is referred to as instant availability. Second, the availability for a given time duration – in current study it indicates the immediate next span of given length. This availability is referred to as duration availability.
To construct the proposed models, we followed the Cross Industry Standard Process for Data Mining (CRISP-DM) [34]. We have demonstrated our approach on a real availability trace of Austrian Grid resources. We evaluated prediction accuracy of the proposed models through a series of experiments for resource instant availability and duration availability. We also compared our results with those from previous prediction methods by Nadeem et al. [20, 22]. Our experiments showed that the instant availability and the duration availability predictions based on the proposed neural network based method are respectively 94.36% and 90.06% accurate, which are respectively 18% and 31% better than the best predictions by the methods proposed by Nadeem et al.
Please note that in this paper, terms “Grid-site” and “resource” refer to the same and are used interchangeably. The Grid-site refers to a computational resource (usually a cluster) that can process a job request (e.g. application execution, data access, data storage, data process etc.). We consider a Grid-site as “available” if it is switched on and accept job requests, otherwise we consider it “unavailable”.
The rest of the paper is organized as follows. Section 2 empirically describes our motivation of the present study. Section 3 describes our main approach for modeling resource availability through CRISP-DM. Different phases of CRISP-DM in the current study are subsequently described in Sections 3.1-3.5. Section 3.1 describes the dynamic nature of the Grid resources. Section 3.2 highlights major attributes of availability of resources in the trace. Section 3.3 describes data preparation by enriching the trace events with major availability attributes. Section 3.4 presents Decision Tree, Naive Bayes’ Classifier and Neural Network modelling of resource availability. The evaluation of these models is presented in Section 3.5. A review of related work is described in Section 4 and we finally conclude in Section 5.
Resource availability prediction motivation
To demonstrate why resource availability prediction is important, we demonstrated how job requests of different durations may fail on different Grid-sites in Austrian Grid trace. For this purpose, we simulated a series of 287 jobs of different durations from 10 min. to 24 hrs. (10 min, 15 min, 20 min, 25 min,...,24 hrs) on real availability trace of each Grid-site in Austrian Grid (see Section 1). These job durations were identified from the real job execution traces collected by local resource managers on Austrian Grid-sites. These job durations did not follow any well known distribution (like normal distribution, etc.). In our simulations, the beginning time of each job on each Grid-site was selected at random from the time span in trace dataset and then the execution was simulated. We considered a job execution “successful” if the Grid-site remained available (in the real trace) from the beginning to the end of the job. The job execution was considered “failed” if the Grid-site turned unavailable (in the real trace) anytime before the end of the job. In this study, our focus is at modeling resource availability behavior, so we are not targeting at other reasons of job failures and our trace data consisted of only two states of the Grid-sites, which are available or unavailable. On each Grid-site, the jobs of different durations were simulated multiple times in the same distribution as in the real trace. Figure 1 shows the percentage of the jobs failed due to resource unavailability, for jobs of different durations. We observed that the job failures increased for all the Grid-sites as duration of job increased. Intuitively, we observed that the job failures increased more rapidly for the Grid-sites with low availability. At maximum 49% jobs failed. It is noteworthy that there were job failures of 32% even for the jobs of the minimum duration of 10 min. These failures of jobs highlight the need for considering resource availability while planning job submission, data storage and access on these resources.

Percentage of jobs failed (due to resource unavailability) for jobs of different durations.
Considering resource availability is also important for planning data processing requests on such resources. The applications processing large datasets are usually executed in a distributed fashion exploiting several resources. These applications differ largely in their capabilities to tolerate resource failures. Even checkpointable applications would be affected by resource failures if the checkpointing time is higher than 5 minutes. Intuitively, the job scheduling service should consider expected availability durations of resources to map jobs to them. The jobs with long runtimes should be mapped to the resources with long availability durations. Likewise, the jobs with no checkpointing capabilities should mapped to the resources which are not expected to transit to unavailable state. Lastly, the resources failing intermittently may be allocated for short duration replicated jobs [30, 36].
We describe details of our method for resource availability prediction by following Cross Industry Standard Process for Data Mining (CRISP-DM). The major phases of CRISP-DM for our work are the followings. Understanding dynamic nature of the Grid resources (business understanding) Understanding trace data and availability attributes of trace resources (data understanding) Characterizing resource availability with availability attributes (data preparation) Modeling resource availability for predictions (modeling) Evaluating prediction models (evaluation)
These phases are described in details in the following sections.
Understanding dynamic nature of the grid resources
The dynamic behavior of the Grid resources and its causes are described in details in Section 1. Here, we describe the Austrian Grid resource availability trace that we considered in our study. Austrian Grid is Austria’s national Grid infrastructure that joins multiple self administered institutions. distributed nation wide. It consists of 28 heterogeneous Grid-sites located at different geographical locations in Austria. These Grid-sites comprise several CPUs and different memory architectures. Altogether, Austrian Grid comprises more than 1500 CPUs. The administrators of Austrian Grid-sites implement different policies to share their resources in the Grid and thus the Grid-sites exhibit different availability behavior.
We collected and characterized the availability trace of Austrian Grid resources for more than 12 months.For each Grid-site, the availability trace consisted of Grid-site’s state (unavailable or available) and the time stamp (on which the Grid-site state was recorded). The Grid-site state was recorded “available” if it was powered on and was accepting job requests (e.g. through Globus Resource Allocation Manager (GRAM)), otherwise it was recorded unavailable. The state of each Grid-site was monitored every 5 minutes. Collectively, the availability trace consisted of more than 25 million states recorded for all Austrian Grid-sites during the monitoring period. In next section, we explore the availability behavior of Austrian Grid-sites to find the availability attributes which can be used to model their overall availability.
Understanding trace data and availability attributes of trace resources
The trace data was collected through a monitoring application deployed on a dedicated resource external to the monitored resources. The monitoring application ran successfully without any failure during the monitoring period. Thus, no data was missing/incomplete. As the data recorded was very simple (time stamps and resource state), there was no need of data cleaning or transformation.
We attempted to understand the availability properties of the Austrian Grid resources from several respects. We began by analyzing the elementary statistics of resource (un)availability. Table 1 presents these as the maximum, the minimum, mean, median, first quartile, third quartile, and standard deviation. We noticed several points here. First, a median of 155 means that several resources remained available for small durations (less than 155 minutes). Compared to it, the mean of 1466.7 means at least some resources were available for larger durations. We observed large difference between mean and median for both availability and unavailability. This is because there are three different types of resources in Austrian Grid and the summery includes all three types of resources. First, the dedicated resources, which are supposed to be available all the times. It is noteworthy that even none of these resources did not exhibit 100% availability (see Figure 2). These resources exhibit long durations of availability and small durations of unavailability. Second, the resources from computer labs, which may be available for sometimes and unavailable at other times. These resources exhibit small durations of availability and unavailability. Third, the resources which are made available on special requests for users, otherwise they remain unavailable. These resources exhibit long durations of unavailability and relatively small durations of availability. The different distribution of durations between the minimum, first quartile, median, third quartile, mean, and the maximum shows that it will require several parameters to model availability of these resources. Similar observations were noticed for durations of unavailability of these resources. Figure 2 shows mean time between failures (MTBF) and mean time to reboot (MTR) of sample Grid sites from each of three types of resources.
Summary of the elementary statistics of resource (un)availability in Austrian Grid
Summary of the elementary statistics of resource (un)availability in Austrian Grid

In a step forward, we identified four availability classes in our trace based on two major availability characteristics: average daily availability and availability durations. The four availability classes are: high, medium upper, medium lower and low. The respective thresholds for daily availability for these categories are selected for four quarters of time in a day: less than 6 hours, 6 to less than 12 hours, 12 to less than 18 hours, 18 to 24 hours; in terms of daily availability these are [<=25%], [>25%-50%], [>50%-75%] and [>75%]. It is noteworthy that we did not find major clusters of daily availability in the availability trace. The thresholds for availability durations are based on the four major clusters of availability durations found in availability trace, these are [<=24], [>24-36], [>36-72], [>72] (in hours) respectively. Table 2 depicts this classification. This classification gives a first hand idea of different durations for which the jobs can be executed these resources. Particularly, their average availability duration gives an indication whether a job is likely to complete its execution on these resources or not.
Resource classification based on their daily availability and average availability duration
Figure 3 [20] depicts average availability of the resources during different hours of the day (Austrian local time). We observed that the resources have higher availability during 9 am to 9 pm. The availability increases from 9 am to 12 am; it remained in a very close range between 1 pm to 3 pm. Afterwards, it continued decreasing till mid night and remained at that level till 9 am. This change in availability was due to shut down of some of the resources starting from afternoon and again turned on starting from early working hours of the day.
Figure 4 [20] shows daily availability of Austrian Grid resources on different days in the monitoring trace. The daily availability oscillates between two major ranges during the year. In the first six months of the trace (from June to November), the availability mostly remains 40% –60 %. Whereas, in the later part, it mostly remained 8% –22 %. At this point, we are not focusing on the reasons behind that variations in availability, rather were looking for some observable patterns in resource availability that may be later helpful in characterizing resource availability. This was mainly because there may be many different reasons behind variations in availability of different resources, and we targeted at development of resource availability models which are independent of specific reasons behind variations in resource availability.

Availability of Austrian Grid resources on different hours of the day.

Daily availability of Austrian Grid resources on different days in the trace dataset.
Interesting patterns of availability were noticed on different days of the week too. Figure 5 [20] shows average availability of the considered resources on different days of the week. Relatively higher availability was observed during first three days of the week and the least availability was observed on Saturday. Interestingly, the availability on Sunday was more than that on Saturday. It shows that some of the resources were used in production on Sunday.

Availability of Austrian Grid resources on different days of the week.
During our effort for understanding the availability trace, we identified important attributes that help in describing resource availability at a given time. These attributes mainly cover availability information with respect to time and history of the resource. The time attributes represent calendar information for each recorded instance of data. These are date, hour of the day and day of the week. The history attribute covers resource state information from past and includes the following self-explanatory items: average availability yesterday average availability on the same day of last week average availability for last week average availability for last month average availability during peak hours average availability during off-peak hours average availability on week days average availability on weekends overall maximum duration of availability overall minimum duration of availability maximum duration of availability on same day of the week during peak hours maximum duration of availability on same day of the week during off-peak hours. overall average availability number of hours since last unavailable state
In the later part of the paper, we refer to these attributes as availability attributes. Initially, for each monitored resource, we recorded the availability data as a pair of the time stamp t and resource state y (available or unavailable) at that time. Each pair of this data is referred to as tuple and the whole data is referred to as trace data. Later, we enriched each tuple with resource availability attributes at that time. After this enrichment, each tuple can be considered as a set {X, y}, where X is set of resource availability attributes at time t and y is resource state at that time.
Modeling resource availability for predictions
In this work, we model resource availability predictions using our enriched trace through Decision Tree, Naive Bayes’ Classifier and Radial Basis Function Neural Network. The details of these methods are described in the following sections.
Decision tree for resource availability prediction
Using our enriched data in the form of {X, y}, we can model resource availability y at a time in terms of its availability attributes X. For simplicity, consider X as input and y as an output of the availability model.
In this study, our input data is numeric/nominal and output data is binary in nature. As we know that the data mining classification techniques are the most suitable for predicting binary or nominal categories [37], we consider our resource availability prediction as binary classification problem where each set of availability attributes maps to either of the two classes: available or unavailable. More specifically, we reformulate our availability prediction problem as “to find a predictive model M(X) that automatically assigns a class label y (available or unavailable) when presented with an attribute set X”.
Since our problem is a simple two class classification problem [37], we employed Decision Tree to model y in terms of X. The motivation for choosing the Decision Tree has been multifold. First, Decision Tree effectively deals with nonlinear relationships (if any) among the attributes (here the availability attributes) without manual efforts [24]. Second, relatively little effort is required for data preparation for decision tress. Last but not the least, the modeling through Decision Tree is simple, fast and easy to understand. Among other reasons for preferring Decision Tree on other methods like neural network, is that we wanted to understand the relationship between resource availability and other its availability attributes. However, during this study, we noticed that exploring in-depth relationships of resource availability and its availability attributes requires extra efforts, and we plan it in our future work. In this paper we focus on our work on resource availability prediction. Since all of our input data was continuous, we discretized it into different intervals. During descretization, we took special care that the intervals should neither be too small to suffer from overfitting nor be too large to loose accuracy. The final discretized intervals of availability were [1,7, 7, 14, 14, 21, 21, 28, 28, 35, 35, 42, 42, 49, 49, 56, 56, 63, 63, 70, 70, 77, 77, 84, 84, 91, 91, 100].
We modeled resource availability using Classification and Regression Trees (CART) [37] in RapidMiner tool [28]. The tree induction algorithm of CART takes three parameters as inputs: the training dataset D, set of availability attributes X, and splitting criteria selection method [37]. The splitting criteria selection method defines a policy for choosing the attribute A ∈ X (along with its one value) that best separates the tuples in D into individual classes. Gini index [37] was used to select the splitting criteria. Gini index (GI) measures the impurity of D as:
For each attribute A ∈ X, Gini index considers all possible binary splits. For our all continuous-valued attributes, we first sort an attribute’s values. Then, the midpoints of each pair of consecutive values are selected as split-points and GI is calculated for the midpoints. The midpoint resulting in the minimum GI for attribute A is selected as the final split-point for A. Based on the final split-points of each A ∈ X, D is partitioned into D1 and D2, where D1 satisfies A≤ split-point and D2 satisfies A> split-point. The Gini index for A is computed as a weighted sum of impurity of each partition.
The reduction in impurity by a binary split on A is calculated as
The attribute resulting in maximum reduction in impurity is considered as the final splitting attribute. The final splitting attribute along with its split-point makes splitting criteria.
We used post pruning approach for pruning our CART. More specifically, ”cost complexity pruning algorithm” [37] was used.
The Naive Bayes’ Classifier estimates probability of a resource state (available or unavailable) from its likelihood and prior probability conditional to resource availability characteristics. Thus, by nature Naive Bayes’ Classifier exploits resources’ past availability patterns and properties. For our current work of two-class resource state, we describe Naive Bayes’ Classifier as follows.
Suppose C1 and C2 represent the two classes of resource state i.e. available and unavailable. The priori probabilities of these classes, P (C1) and P (C2), are calculated during characterization phase as: P (C1) = n1/N and P (C2) = n2/N, where N represents the total number of recorded events and n1, n2 represent the events belonging to C1 and C2, respectively.
As defined in Section 1, X = {x1, x2, x3, . . . , x
n
} represents our availability attributes. Given a set of availability attributes, X, according to Bayes’ Theorem [37]:
The values of availability attributes are conditionally independent of one another, given the class label. Therefore,
The Naive Bayes’ Classifier was also implemented using RapidMiner.
In current study, we used radial basis function neural network (RBF-NN) to model and predict resource availability. An RBF-NN is typical neural network (NN) that with three layers: input layer, approximation layer, and classification layer (as shown in Figure 6), but its neurons in the hidden approximation layer use radial basis function as activation function [19, 41]. The main motivation for using RBF-NN to model resource availability is its efficient and effective approximation capability. Our objective for using RBF-NN for resource availability modeling was to find an approximation function g (X) that can model resource availability y (available or unavailable) in terms its availability attributes X as g (X) : X → y. The structure of three-layered RBF-NN used in our study is shown in Figure 6. The input layer accepts values of availability attributes X = {x1, x2, x3, . . . , x
n
} as an input to RBF-NN. In hidden layer, each neuron j finds the Euclidean distance d
j
of input X from the center C.

Structure of n-j-1 RBF-NN used to model resource availability.
The center C j was found as the center of the jth cluster determined using k-mean clustering algorithm on training dataset.
The output of the neuron j was given by applying Gaussian function (as the radial function) on d
j
Here, σ represents standard deviation (also called width) of the Gaussian and was computed as:
In our implementation of RBF-NN through RapidMiner tool, one layer of 18 neurons was developed. Other parameters include: learning rate=0.3; momentum=0.8. The training of RBF-NN continued till a minimum error of 0.01 was obtained or maximum 100 iterations were done.
In our experiments for evaluating the proposed approach, we used leave-one-out cross validation, where resource state at a given time (during the trace time period) was selected for predictions. The tuple at the given time was removed from the whole dataset and the remaining data was used for training. The major motivation of using leave-one-out cross validation was utilization of maximum available data. The accuracy of our predictive modeling through Decision Tree was evaluated in terms of accuracy of our predictions. The outcomes of our experiments were categorized as one of the followings: true positive, false positive, true negative, false negative. The respective frequencies of these outcomes are represented as T p , F p , T n , F n . The accuracy of our experiments was calculated as [37]:
Further in our evaluation, we also compared the accuracy of the proposed models with that of three models from related work proposed by Nadeem et al.: Pattern Matching [22], Nearest Neighbor (NN) Rule [20, 21], and Naive Bayes’ Classifier (NBC) [20]. In their models, Nadeem et al. took raw availability trace of 1′s and 0′s, where 1 represents available state and 0 represents unavailable state. To differentiate the methods proposed by Nadeem et al. from our proposed methods, in this paper, we refer to their methods with the suffix “binary-trace (BT)”: Pattern Matching (BT), nearest neighbor (BT), Naive Bayes’ Classifier (BT). Whereas, our proposed methods are referred to as Decision Tree (ET) and Naive Bayes’ Classifier (ET), where ET stands for “enriched-trace”.
For a fair comparison, we used the same availability trace of Austrian Grid-sites for all eight methods.
We evaluated the prediction accuracy of our proposed model for a set of 21 resources, where 7 resources were selected at random from each of the three resource classes in Austrian Grid: dedicated resources, resources from computer labs, resources available on request. The results presented in this paper represent their average accuracy. For each resource, the accuracy of our predictions was evaluated for 290 individual days of trace period (starting from day 31; the data for first 30 days was used to enrich the data of later days). For each of the 290 days, the predictions were made for each hour of the day (0, 1, 2, 3, . . . , 23), where the exact minutes after each hour of the day were selected at random. It is noteworthy that for each prediction, the tuple corresponding to the time of prediction was excluded from the training data. This way, the proposed method was evaluated for every resource for 290 * 24 = 6960 times. The average accuracy of the 21 resources on each day is presented here as daily prediction accuracy.
Figure 7 shows daily prediction accuracy of the three proposed methods (Neural Network (ET), Decision Tree (ET), and Naive Bayes’ Classifier (ET)), three methods from related work (Pattern Matching (BT), nearest neighbor (BT), and Naive Bayes’ Classifier (BT)), and two other methods (polynomial regression (ET) and time series forecasting (ET)). Their respective prediction accuracies ranged 82.21%-99.95%, 79.95%-98.67%, 72.36%-94.77%, 51.94%-77.50%, 64%-79.81%, 66%-85%, 66.35%-86.2% and 61.97%-82.34% with standard deviation 3.88, 4.05, 4.23, 5.59,3.83, and 4.12, 4.02 and 4.19. Figure 7-avgAccu depicts overall average accuracies of these methods. The respective average accuracies of these methods were 94.36%, 91.71%, 87.87%, 66.55%, 72.83%, 80.2%, 77.49% and 74.56%. Clearly, the proposed method RBF-NN (ET) showed the highest prediction accuracy of all eight methods. The Decision Tree (ET) based method showed the second highest prediction accuracy and the Naive Bayes’ Classifier (ET) showed the third best prediction accuracy. We observed that our proposed methods using enriched trace data yielded higher accuracy than the methods using binary trace. We believe that this difference of accuracy is due to use of our identified availability attributes, which capture more information about resource availability behavior in the past. We particularly observed this difference when we compared prediction accuracies of Naive Bayes’ Classifier using enriched-trace with the same using binary-trace.

Daily prediction accuracy of instant availability predictions using different prediction methods.
We also modeled nearest neighbor method and pattern matching method using enriched trace but their accuracies were less than the methods shown above. Therefore, we have not shown them in this paper.
We evaluated accuracy of duration availability predictions through the proposed methods for a set of 287 different durations beginning with 10 min. to 1440 mins (24 hrs.), i.e. {10, 15, 20, 25, . . . , 1440} minutes. Similar to our evaluations for instant availability predictions (Section 1), we evaluated our duration predictions for each 21 selected resources. We started our experiments for predictions for 10 minutes. For each resource, the date and time of the prediction was selected at random (from the available trace), and it was predicted whether the resource will be available for next 10 minutes or not. Resource’s actual availability (from the historical trace) for next 10 minutes was used as ground truth to evaluate our prediction. Similarly, experiments were conducted for other durations. The accuracy of all the resources was averaged to record the accuracy of each duration.
Figure 9 shows prediction accuracies for different time durations using eight methods. The prediction accuracies of Neural Network (ET), Decision Tree (ET) and Naive Bayes’ Classifier (ET), Pattern Matching (BT), nearest neighbor (BT), Naive Bayes’ Classifier (BT), polynomial regression (ET), and time series forecasting (ET) ranged 81.04%-99.41%, 76.08%-97.39%, 71.19%-93.64%, 51.75%-76.82%, 64.73%-84.23%, 55.72%-81.74%, 69.49%-89.55% and 65.63%-86.53% with standard deviation 3.88, 4.01, 3.11, 4.63, 2.76, 3.90, 2.86 and 3.01 respectively. Figure 9-avgAccu depicts overall average accuracies of the eight methods. Their respective average prediction accuracies were 90.06%, 86.1%, 81.66%, 64%, 74.11%, 68.69%, 78.67% and 76.15%. The RBF-NN (ET) showed the highest prediction accuracy of all eight methods. Decision Tree (ET) method showed the second highest and the Naive Bayes’ Classifier (ET) showed the third highest prediction accuracy. Once again, we observed that our proposed methods using enriched trace data yielded higher accuracy than the methods using binary trace. As described earlier, we believe that this difference of accuracy is due to use of our identified availability attributes.

Average accuracy of instant availability predictions using different prediction methods.

Accuracy of duration availability predictions using different prediction methods.

Average accuracy of duration availability predictions using different prediction methods.
It is noteworthy that prediction accuracy of our proposed methods is not affected by the duration of a job (no such pattern is observed in our experiments). This shows that our identified availability attributes can effectively model resource availability for short as well as for long durations. The development of RBF-NN, Decision Tree and predicting the availability were both computationally efficient. For our data, using RapidMiner tool on MacBook Pro 2.3 GHz with 16 GB memory it took only 27.9 and 3.57 seconds to develop RBF-NN and Decision Tree, respectively, and a few milli seconds for each prediction.
As the process of data mining is repetitive in nature, we believe the accuracy of the proposed models will further improve when the methods will be deployed in real environment and feedback (along with more data) is used for later learning process. With our current data, the prediction accuracy decreases if we decrease data in learning phase. However, the accuracy of previous methods decreased when the trace data included more resources different availability behavior and more historical data was used in training those methods.
Different studies have modeled resource availability in different individual environments like pool of desktop computers [7], peer-to-peer systems [5, 26], P2P desktop grid [27], cluster of computers [1], multi-computers [38], Grid [12, 30], and super-computers [38], but we target at a method that can be used to predict a resource availability independent of its environment. In other words, our method inherently includes the environment properties without their explicit specification. Some other studies that investigated resource availability properties, include [5, 32]. These efforts were based on short-term data of resource availability and ignored the availability policies defined by resource owners. The two studies closely related to ours are [12, 30]. These efforts demonstrate effectiveness of their approaches for a trace that includes resources with similar availability properties. Whereas, our method is demonstrated to be effective for a trace that consists of mix of resource types: dedicated resources, resources available temporarily and resources available on-demand.
Among many, Nurmi et al. [7], Kondo et al. [13, 14], Rood and Lewis [30], Singh and Kaur [35] and Finger et al. [10] investigated various environments to find patterns of resource availability. Shang et al. [33] predicted (with certain confidence) resource availability based on resource past availability. Mustafiz et al. [25] used Jaccard Index to predict resource availability in enterprize Grids. Ren et al. [40] exploited resource’ CPU usage and contention to predict resource availability through Markov chain. Perhaps a more closer effort to ours is by Andrzejak et al. [4], who predicted availability of a pool of resources using a Naïve Bayes’ classifier.
Brevik et al. [7] predicted resource availability through mathematical modeling. Studies in [12, 14] employed statistical methods to model resource availability. However, the predictions based on such models are the probability values of the resource availability, which are and are mostly not sufficient to conclude whether the resource will remain available or not, especially for mid range probability values. The authors in [29, 30] predicted resource availability considering resource availability on same previous day of the week and have achieved a good prediction accuracy. Contrarily, we identify and consider 14 main availability attributes that describe resource availability at a time and thus achieve much better prediction accuracy. Singh et al. [35] modeled availability and reliability of Grid computing systems using Markov model.
Bouyer et al. [6] proposed and online-announcer for detection of available resources in the Grid environment. The authors used rough set analysis to build useful rules for predicting behavior of individual nodes. Hu et al. [11] used support vector regression to predict resource availability in the Grid. The authors employed genetic algorithm to automatically determine the optimal parameters of support vector regression. Support vector regression has also been exploited by Zheng [42], where the author used simulated annealing algorithms to optimize parameters of support vector regression. Alsoghayera et al. [2] proposed a mathematical model for prediction of risk of Grid-site failures by modeling their reliability through a discrete-time analytical model. Their proposed approach to predict the failure of a resource for a given time duration is based on resource historical. Vrignat et al. [39] also predict resource failure using hidden Markov method. The author in [18] ranked different Grid-sites on the basis of their reliability to execute jobs successfully. Their proposed model of reliability considers resource availability, and job success rate from historical data.
Andrea et al. [9] predicted the availability of devices in online social networks using linear predictors. The authors in [17, 23] used ant colony optimization and genetic algorithm, respectively, to maximize resource availability for task scheduling.
Conclusion and future work
The resources in large scale distributed environments include hundreds/thousands of resources that have different availability properties. To exploit these resources effectively, the data management and resource management services need in-advance predictions of resource availability. However, different resource sharing policies, reliability of software and hardware of the resource, management and maintenance services make resource availability predictions a hard task. To address this problem, we presented a resource availability prediction approach based on data mining methods, particularly Neural Network. To apply data mining methods, we proposed a novel approach to characterize resource availability in terms of resource availability attributes in past. We enriched the collected binary availability trace with availability attributes. The proposed approach significantly improved the accuracy of resource availability predictions. Our experiments showed that the predictions using the RBF-NN (ET) through the proposed approach were 18% and 31% more accurate than those by the best method in related work, which is Naive Bayes’ Classifier (BT) for instant and duration availability predictions, respectively. The prediction accuracy improved as more data was used in the learning phase. There was a very small prediction overhead. Although our experiments were conducted in Austrian Grid environment, yet we believe the proposed model can be easily applied to similar other environments like HTCondor, CERN, etc.
In future, we plan to evaluate our approach with variations of availability attributes. We would also evaluate reduction in job run time when the resources are selected considering their availability predictions. Among the other plans are to evaluate improvements in data management tasks based on availability-aware resource selection.
Footnotes
Acknowledgment
This work was funded by the Deanship of Scientific Research (DSR), King Abdulaziz University, Jeddah, under grant No. 611-002-D1434. The authors, therefore, acknowledge with thanks DSR technical and financial support.
