Abstract
Replication technology can efficiently enhance data availability and thereby increase system reliability in cloud storage system. However, one urgent challenge is how to select the correct replica hosting sites, as different data center configurations lead to entirely different storage service quality. Additionally, users usually have individual requirements; some may care more about reliability while others are more likely to be concerned about cost. This paper investigates choosing the most suitable replica storage sites using an analytic hierarchy process (AHP) model by applying fuzzy comprehensive evaluation to candidate data centers for different kinds of users. First, a novel four-dimensional Qos (quality of service) model of cloud storage service is proposed. Then, a Qos preference-aware algorithm is introduced to deal with individual Qos sensitivity (IQS) constraints. In order to evaluate the candidate replica storage sites and select the best choices from among them, an algorithm based on fuzzy comprehensive evaluation is designed and implemented. Corresponding simulation results indicate that the strategy is suitable to serve various IQS users very well with better effectiveness and practicality.
Introduction
By placing multiple copies of data sets in different data centers, replication is a commonly used technique in cloud storage systems to reduce access latency and network bandwidth utilization, thereby enhancing system reliability and load balancing, especially for applied scientific and technical programs that require large-volume data sets [4, 7]. Additionally, it has been widely accepted as an important part of data-driven process monitoring or statistical process monitoring (SPM) that applying multivariate statistics and machine learning methods to fault detection and diagnosis for industrial process operations enhances production results [14].
However, many unsolved questions remain in data set replication technology. One question is how to select the correct sites for new replica storage that can satisfy not only the system requirements but also user requirements. On one hand, it is impossible to provide unlimited storage capacity in data centers. On the other hand, different configurations of different data centers lead to varying storage service quality, such as availability, performance and fault tolerance. Users of different cloud storage services may tolerate different Qos (quality of service). Each user typically has individual requirements, and some care more about reliability while others are most likely to be concerned about cost. Therefore, it is necessary to provide a preference-aware replica selection strategy for different users with individual Qos sensitivity (IQS). In addition, all sources in the cloud carry certain costs; wherever the data set replica storage is located, resource consumption must be paid for. Each data center deploys a different price policy, and thereby provide different storage cost ratios. Furthermore, data transfer also incurs cost, as the data sets must be transferred from remote data centers. However, the costs may be significantly different because application data sets vary in transfer price policies, usage frequencies and data size. Nevertheless, there is no suitable mathematical model to characterize network behavior predicting accurate replica placement due to global uncertainty. In this way, the wide utilization of the pay-as-you-go model in the cloud makes the data set replica placement problem more complex than before.
Fuzzy comprehensive evaluation is a mathematical method to comprehensively evaluate things that are not clearly defined in the real world by using fuzzy mathematical methods [8, 19]. In this paper, fuzzy logic is used to identify suitable data centers for the replication of data sets, because fuzzy logic can deal with reasoning that is approximate rather than fixed and exact. In order to address the replica storage placement problem in such an environment, a Qos preference-aware replica selection strategy for different kinds of users with individual Qos sensitivity constraints in cloud computing data centers is proposed. First, a novel four-dimensional Qos model of replica selection is introduced, including reliability, time, cost and security. Next, a Qos preference-aware algorithm based on an analytic hierarchy process (AHP) is proposed to deal with the IQS constraints. In order to evaluate the candidate replica storage sites and choose the best from among them, an algorithm for replica placement selection based on fuzzy comprehensive evaluation is designed and implemented.
The rest of the paper is organized as follows: Section 2 presents related research on data replication in the cloud. Section 3 provides a novel four-dimensional Qos model of replica selection. Section 4 describes the candidate data center evaluation algorithm based on fuzzy logic. Section 5 outlines the simulation environments and presents the simulation results. Finally, Section 6 concludes and provides directions for future research.
Literature review
With the advancement and development of various technologies, data set replica placement in distributed systems has been studied in many works, which are referenced and adopted in cloud data set replication. Lee et al. [12] presented a modified bandwidth hierarchy-based replication (BHR) algorithm by minimizing data access time and avoiding unnecessary replication. Ren et al. [20] proposed replica placement based on storage used and makespan as evaluation parameters.
Unfortunately, current solutions are frequently focused on improvements in data access performance, and neglect the cost of data set replica management, such as storage cost. Simultaneously, new requirements and challenges continue to emerge for the deployment of scientific applications in the cloud along with the development of information technology. Nevertheless, very little has been done to consider the comprehensive factors in data set replica placement, such as reliability and security, and particularly the related cost.
Based on fuzzy evaluation, information about the priority of various alternatives can be achieved as a reference for decision makers. Some recent works have addressed the problem of data set replica placement using fuzzy evaluations. Table 1 summarizes the related technological reports in recent years.
As shown in Table 1, it can be concluded that: (i) current solutions primarily consider a simple linear combination of resources. In the proposed approach, comprehensive factors are additionally considered, such as reliability and security, to improve system performance; (ii) cost is an important element in deciding replica storage placement, but little research has been conducted considering the cost paid by users. In this paper, cost will be regarded as one of fundamental components in determining replica storage placement; (iii) with growing emphasis on cloud computing, data management systems for the cloud environment have emerged, including Google file system (GFS) and Hadoop distributed file system (HDFS) [1]. However, the employed replica selection features are still relatively simple, and do not consider the Qos preference of the user. These motivate the development of a replica placement strategy that can optimize storage sites in addition to high system performance.
Qos based replica placement selection models
There are many factors that affect data set replica selection. This paper proposes a 4-dimensional Qos model, which addresses reliability, time, security and cost, as shown in Fig. 1. These variables describe data access quality in terms of reliability; access time elapsed since request was sent; security; and the cost of data set storage and transfer. These parameters are analyzed in detail. Reliability. It is clear that any application system that satisfies all business requirements but fails to satisfy the reliability quality of service parameters will lead to dissatisfaction in cloud users. Primary parameters involving reliability include: (i) data center availability; (ii) data transfer reliability; and (iii) consistency of data sets replicas. Time. Data centers, as an important infrastructure component of the cloud computing environment, must hand user requests as soon as possible. It is obvious that network bandwidth consumption is an important factors used to identify cloud storage service quality [5]. However, the storage request queue and storage media speed were not addressed in previous work as factors that influence the data set response time. The primary parameters involving the time dimension include: (i) available bandwidth; and (ii) data transfer delay. Each data center receives many requests simultaneously, but can serve only one request at a time [2, 9]. Therefore, requests must wait in a queue. However, there are many unknown exceptions in practical application, such as the interruption of a previous request due to an unstable or failed network; timeout of a previous request or a previous request that receives a response with various error codes. To solve these problems, a time threshold t
m
axtf, is defined as the longest transmission delay; each data transfer request must complete within the time threshold, which is proportional to data set size. (iii) Storage access latency. The storage media speed and the number of requests in the queue play major roles in determining the average response time experienced by applications [2]. Different storage media have different speeds (data transfer rates) in reading and writing operations. For example, the HP storage works Ultrium 920 Tape Drive speed = 120MBps, while the HP Storage Works Ultrium 448 Tape Drive speed = 24 MBps. Consequently, data set access latency is the data volume divided by storage speed [9]. Cost. With a pay-as-you-go model, it is obvious that cost is one of the most important aspects in deciding whether to use cloud storage services or not, especially when large data sets or “big data” are common in the cloud. There are three typical types of cost consumption: data set storage cost, transfer cost and update cost. Security. In a cloud environment with many data centers, it is necessary to consider not only the data set security itself, but also the safety of the data set environment. The primary parameters involving security include: host security, data transfer security and replica integrity.
Fuzzy based Qos preference-aware and replica placement algorithm
In this section, a new method of data sets replica placement is proposed using AHP, called the fuzzy based Qos preference-aware algorithm (FQPA).
Qos preference-aware algorithm
According to user Qos preference, a Qos preference-aware algorithm (QPA) is proposed, as described below.
After the hierarchy structure is established, fuzzy pair-wise comparison matrices of the ultimate and sub-goals can be obtained respectively for each dimension, written as matrix A and A k (k = 1, 2, 3, 4) with respect to reliability, time, cost and security. Specifically, A (a ij ) 4×4 represents the compared value for the 4th dimension, where a ij = f (x i , x j ) , (i, j = 1, 2, 3, 4), x i represents the four elements separately.
The following conclusions can be drawn:
Similarly, four 4-dimension matrixes can be obtained in the attributes layer, , where , k = 1, 2, 3, 4. Similarly, these matrixes are also compatible with the above two lemmas.
During construction of the judgment matrix, it is not necessary to be transitive and consistent, that is, a
ij
× a
jk
= a
ik
is not to be satisfied, but the consistency of the judgment matrix is generally required. In addition, the consistency ration (CR) can be calculated according to Equation (2) [17, 18].
RI is the average index for the randomly generated weight index, as is shown in Table 2. CI is the consistency index, that can be approximately calculated using Equation (3), where n is the dimension of matrix A (n = 4), and λ
max
is the maximum eigenvalue of the matrix.
A common calculation of the characteristic roots method can be obtained using Equation (4).
If CR < 0.1, the consistency of the matrix is acceptable; otherwise if CR > =0.1, appropriate amendments should be made to the matrix and the above steps should be repeated until the value is less than 0.1.
W is the weight vector once the value of CR meets the pre-defined index. Additionally, the weight vectors w1, w2, w3 and w4 can be obtained using the above methods.
Using the QPA algorithm, the individual IQS for cloud storage service users can be obtained using fuzzy and non-quantitative IQS constraints. Therefore, the corresponding weight vector reflecting IQS clouds users preference can be obtained, which is the foundation for the following algorithm.
In this sub-section, a novel fuzzy comprehensive evaluation algorithm for replica placement selection in the cloud using fuzzy theory is designed. The fuzzy Qos preference-aware replica placement algorithm (FQPA) can be described as follows:
There are at least two questions used to evaluate the 12 attributes. The first one is that there is no public criterion, and the second, the values upper and lower bounds are different. Also, some attributes are better with bigger values; however, some attributes are better with lower values.
In order to evaluate Qos attributes, the attribute values must be standardized using the membership function. In this paper, triangular membership functions (MF) are ised to describe these variables. Each variable has five MFs: I, II, III, IV and V. Table 3 describes the membership function for each attribute.
In Table 3, the replica update frequency is measured with units mpt and hpt; x-mpt (x-minutes per time) indicates that the data set updates once every x minutes, and x-hpt (x-hours per time) indicates that the data set updates once every x hours. Request wait time in the queue is used to describe data transfer delay. Since the time required for the current request is the same as the sum of the storage access latency time in the queue, the underlying request has to wait for the total latency of the prior requests in the queue. Thus, the data transfer delay should be the sum of the access latencies of all prior requests.
In addition, the access control strategy, transfer security strategy and data security strategy are described in Table 4. The number of security strategies is used to describe the host security, data center security and replica integrity, respectively. A modern server CPU usually has multi-cores, while the CPU utilization ratio refers to the whole system CPU resource utilization. It is not difficult to acquire all the processor utilization ratios using functions in C++ or Java programming language. For simplicity, the average CPU utilization ratio is used to represent the data center availability. For example, a data center with four cores of utilization ratios CPU-1:23.98%, CPU-2:17.43%, CPU-3:16.02% and CPU-4:8.15%, has an average CUP utilization ratio of 16.395%.
As an example, the CPU utilization ratio membership function using Equations (5–9) is presented. Other attributes membership functions are similar to that of the CPU.
Its membership function is depicted in Fig. 3.
Next, we evaluate each attribute according to its membership function, and obtain four judgment matrices as shown in Equation (10).
In this paper, weighted averaging operators M(·,+) take all elements into account, as shown by Equation (12).
Afterwards, a second level comprehensive evaluation result can be obtained using as shown in Step 3 and based on Equation (14).
Then the most suitable replica storage site is the data center with the highest scores .
In order to verify the validity and reliability of the comprehensive evaluation algorithm with respect to different individual Qos sensitivities, a replica storage placement problem simulation platform is designed and implemented at the Network & Information Security Lab, Shandong University of Finance and Economics (SDUFE). This simulation system is constructed based on SwinDeW-C [3, 21], which contains 10 super data centers (servers) and 200 ordinary data centers. The system on each data center is installed with VMWare (http://ww.vmware.com), so that it can offer unified computing and storage resources.
There are five modules in the simulation platform: Simulation conditions generator, which is responsible for generating various simulated conditions, such as data center transfer delay, data update frequency, disk throughput, data transfer strategy, etc.; Replica locater, which is responsible for locating candidate storage sites, including receiving the data request and collecting the data center information; Qos perceiver, whose main purpose is to turn the user Qos preferences into a comprehensive evaluation weight set; Replica placement selector, whose main purpose is to simulate the algorithm as in Section 4; Results display, whose main purpose is to depict the results on the screen.
In the performance evaluation, in addition to the proposed algorithms, the no replication (NREP) scheme is also used as the baseline, and the random (RAND) scheme and greedy algorithm (GA) are used for comparison. As the name states, in NREP there is just the original objects in the root as a replica of the data set.
Random simulations are conducted on randomly generated data sets of different sizes, generation times, and usage frequencies. In addition, in the simulations, 100 data sets are used, each with a random size from 100 GB to 1TB. The usage frequency is also random, ranging from 1 to 10 uses.
Simulation 1: A comparison of available probability with different replica placement strategies. In simulation 1, CU rel is used as an example (they express high reliability requirements for cloud storage service) by comparison of available probability. Generally speaking, users concerned with reliability can be satisfied with high data set available probability. Figure 4(a) depicts the comparison of available probabilities of data sets with an increasing number of replicas, indicating that the available probability will increase except with NREP. However, with comprehensive evaluation of data centers, FQPA is always better than the other three studied algorithms and can optimize the replica storage placement, which further increases the available probability of data sets.
Simulation 2: A comparison of cost with different replica placement strategies. In simulation 2, CU cost is used as an example (high concern about data management cost) by the comparison of data set management costs. In order to compare the costs, the normalized cost (NC k ) is defined as the ratio of difference between the cost of NREP and the cost of the feasible solution found by the algorithm to the cost of the NREP scheme. Figure 4(b) presents a comparison graph of the total data management costs. The graph clearly shows a cost reduction using the proposed replica strategy as compared to the other algorithms. The total cost of replicas becomes constant for a certain number of replicas. As a result, the proposed algorithm optimizes the cost of replication.
Simulation 3: A comparison of average transfer times with different replica placement strategies. CU time is used as an example (high concern about service time) by the comparison of average transfer times. Generally, those concerned with time can be satisfied with less data transfer delay. Figure 4(c) depicts the average transfer time among different replica placement strategies. Similarly, the NRCE average is defined as 100, and other values are the ratio of the average time to that if NCRE. As shown in Fig. 4(c), in all circumstance, the FQPA algorithm shows performance improvement over all other algorithms in terms of transfer time. As the number of replicas increases, the average transfer time decreases due to data set storage in more data centers, and data transfer must only occur from a nearby site.
Conclusions and future works
Because the human decision-making process usually contains fuzziness and vagueness, the FAHP is adopted to solve the problem in this paper. In order to find the correct replica storage sites for different kinds of users with individual Qos sensitivities in the cloud environment, this paper proposes a valuable approach based on FAHP with comprehensive consideration of the system configurations and user Qos preference. The analytic hierarchy is structured by four major parameters including data centers reliability, cost, time and security. Simulation results have shown that the selection strategy can achieve users IQS preferences and demonstrate improved effectiveness and practicality over traditional methods.
In future works, a knowledge-based expert system can be integrated to help decision-makers make more concise calculations and interpret the results in each step. Moreover, a proper evaluation approach as well as a suitable decision logic can be developed to make a correct final decision for data-driven framework systems, i.e., data driven approaches for industrial process monitoring or improved PLS focused on key performance indictors related to fault diagnosis.
Footnotes
Acknowledgments
This work presented in this paper is partly supported by Project of Shandong Provincial Natural Science Foundation (No.ZR2016FM01), China; the Doctor Foundation of Shandong University of Finance and Economics under Grant (No. 2010034), and the Project of Jinan Hightech Independent and Innovation (No. 201303015), China.
