Abstract
As a typical distributed network computing environment, cloud computing has been growing tremendously recently. Although it could provide a convenient environment for the execution of data-intensive application, the primary challenge for it is how to choose the security date replica for data-intensive application from different storage resources. Considering that the security has the characteristic of fuzziness and randomness, it is quantified using the cloud model in this paper. On this basis, taking the constraints of time, cost, and security into account, a scheduling algorithm of data-intensive application based on weighted set covering problem (WSCP) is proposed. Simulation results demonstrate that the security-aware data replica selection strategy performs better than traditional selection strategy on time, cost, security utility value and load balancing deviation (LBD).
Keywords
Introduction
Due to requirements of collaborative research, the large-scale scientific applications, such as the Compact Muon Solenoid Data Grid, astronomy and biology, have been received more and more attention. For generating huge amounts of data usually, they are typical data-intensive applications [12]. Meanwhile, with the rapid development of cloud computing, customers begin to migrate data-intensive applications to the cloud computing platform [8], which uses virtualization technology to integrate computing resources, storage resources and network resources as a resource pool, and enables consumers to utilize cloud resources with pay-on-demand regardless of the time and place. Consumers could get needed resources in different development phases by buying cloud service, without spending more time or energy [22]. To enhance the reliability and availability of large volumes of data stored as data files, they are divided into chunks, and replicated on different storage resources [7], usually located in different sites scattered throughout the network, which make customers face the challenge of how to select appropriate data replica for their data-intensive application. Although in the era of grid data, there was a lot of research on data replica selection [17,19], the cost-free, best-effort model, also in network aware scenarios [16], made the traditional optimization goal mainly focus on time, which was obviously not suitable for the business model in cloud computing. In addition, the cloud service model provided by cloud service providers had the advantage of high efficiency, economy and strong extensibility, which can attract many businesses and users. However, it also faced great challenge to deal with information security and privacy protection [23,26,31]. Aguiar et al. [2] pointed out that cloud service selection strategy without security-aware would cause the problem of poor performance, high risk, information leakage and even failure, etc. Because of the security problem existed in cloud computing [28], consumers start to worry about data security, namely it is important to choose security storage resources to prevent data from breaching or tampering. Let us assume the following scenario in real life. Suppose that a huge log files of the bank customers’ behavior need to be analyzed on the cloud computing platform, the bank is concerned not only with the efficiency of the log file processing, but also with the security of data in the process of storage or transmission.
In this paper, a data-intensive application is regard as Bag of Tasks (BoT) consisted of many independent tasks, each of which requires a number of data sets replicated on multiple storage resources. For the problems mentioned above, we put forward a novel security-aware data replica selection strategy. The contribution of this paper contains three parts: (1) We quantify the security of data replica using the cloud model; (2) The problem of BoT scheduling is formalized by considering time, cost, security and load balancing deviation; (3) We present a heuristic algorithm to solve the problem of BoT scheduling.
The structure of this paper is arranged as follows. Related research on data replica selection is reviewed in the next section. The formalized BoT is introduced in Section 3. The criteria of data replica selection are given in Section 4. The heuristic scheduling algorithm for BoT is described in Section 5. Section 6 discusses the performances of the algorithm through simulation experiments. Finally, conclusions and suggestions for future research are presented in Section 7.
Related works
Because the data replica selection strategy used to select appropriate data replica played an important role in the process of BoT scheduling, there were many algorithms listed in the literature [11]. The process of data replica selection was abstracted into set covering problem (SCP) by Venugopal and Buyya [27], and minimizing the total time was chose as the optimization objective. Al-Mistarihi and Yong [3] had two optimizing objectives: one was to determine fairness among customers, the other was to choose the best data replica by maximizing reliability and security, in which security was computed with the arithmetic mean method according to the historical data file. According to the criteria of data replica selection including of availability, security and time, the best data replica was chose through the Euclidean distance in [13]. Considering security and availability needed to be quantified by linguistic variables, Almuttairi et al. [4] selected rough set theory to describe the multiple attributes data replica selection problem, which could gain grey numbers used for data replica decision-making. Wang et al. [29] presented the scheduling policy subsumed trust mechanism, it described the credibility of the resource provider through trust mechanism based on the theory of the probability, and preferred to select reliable resource in order to improve the robustness and predictability of the scheduling policy. Considering the commercialization of cloudy computing environment, different cloud service providers may not provide true service information for their own interests, Fard et al. [10] put forward trusted scheduling mechanism, which incentivized cloud service providers to provide real price according to their quality of the services through auction mechanism. Similarly, selfish strategy dynamics in distributed scheduling, when multiple cloud providers compete, had been explored in [18]. Abbadi and Ruan [1] proposed a credible scheduling strategy that was used to establish the mapping relationship between the virtual machine and the physical resource. For the potential uncertainty and reliability issues existed in the process of the scheduling, Tan et al. [25] measured the trust of cloud services by the synthesis of direct trust and indirect trust, and put forward the scheduling algorithm by considering time, cost and trust. Leveraging multi-source parallel data-retrieval technique, Pandey and Buyya [20] proposed a novel method named enhanced static mapping heuristic (ESMH) for selecting data sources. Rahman et al. [21] proposed a replica selection algorithm based on access latency and bandwidth consumption. A novel hierarchical scheduling algorithm was presented by Mansouri et al. [15] to reduce the search time. To minimize the data access time and avoid unnecessary replication, Sashi and Thanamani [24] presented a Modified Bandwidth Hierarchy Replication (BHR) algorithm based on the standard BHR algorithm. A secure requirement (SRP) prioritized task-scheduling algorithm was proposed by Bhatia et al. [6], which could stop intruders modifying the resource requirement of the jobs.
From these data replica selection strategies given in literatures above, we can see that their optimizing goals mainly focus on time, cost, fairness, security, and so forth. Because of not reflecting security’s features of fuzziness and randomness, it is not quantified effectively. To address the issues, we describe the security of data replica using the cloud model, and put forward security-aware data replica selection strategy for BoT applications.
Formalization and framework of BoT
In this section, we formalize the components of cloud computing environment, BoT and the resource set of the task in detail, and design a framework of BoT, which is used to select appropriate resource set to execute BoT.
(Cloud computing environment).
Suppose that there are Q different data centers
(BoT).
Suppose that BoT includes a set of N independent tasks denoted as
(Resource set).
For each task
An example of BoT is depicted in Fig. 1. Assume that task

An example of BoT.
The framework of BoT is designed in Fig. 2. Firstly, BoT submits tasks needed to perform to the cloud broker. Then, the cloud broker sends requests of BoT to the module of resource set selection, and also feeds back the information of resource set meeted the demand to BoT. The module of resource set selection is responsible for obtaining informations of resource set from the information registry of cloud resource, and evaluating and selecting different resource set. The evaluation criteria for resource set is introduced in Section 4, and the algorithm used to select the resource set is discussed in Section 5. The information registry of cloud resource collects informations of resource set from different cloud service providers.

A framework of BoT.
Considering the economic characteristic of cloud computing and its dynamic and heterogeneous nature, the criteria including time, cost and security are considered in the process of resource set selection.
Completion time
Assume that computing resource
In Eq. (2),
Economic cost
Let
Security
As the security belongs to an uncertain concept with fuzziness and randomness, it is difficult to describe. We quantify it with the cloud model proposed by Li et al. [14], which describes the fuzziness and randomness of linguistic concepts, and implements uncertain transformation between the qualitative concept and the quantitative value. To facilitate understanding of research ideas in this paper, this subsection introduces related concept firstly, then presents the method of describing security using the cloud model.
(Cloud model and cloud drops).
Let A be a qualitative concept defined over a universe of discourse U, and a random instantiation of concept A is denoted as
Then the distribution of x in the universe U is called a cloud model, and x is called a cloud drop.
The cloud model describes qualitative features of an uncertain concept using three numerical characteristics, namely, expectation
Although there are several kinds of clouds, such as trapezium cloud, triangular cloud, normal cloud et al. According to the central limit theorem, we discuss the normal cloud model, which is described in Definition 5. Given three numerical characteristics

Forward normal cloud generator
Let A be a qualitative concept defined over a universe of discourse U. If

Backward normal cloud generator

Three numerical characteristics of the normal cloud model
Security level, concepts and range
In Algorithm 3, considering the lower limit and upper limit of the interval

Standard security cloud generator
There are P available storage resources denoted as
For

Calculate the score of synthetic cloud
To evaluate multi-objective utility function included multi-dimensional QoS, we use a multiple attribute decision making approach, which can map each QoS attribute into a value between 0 and 1, by compared it with the minimum and maximum possible value according to the available QoS information of all scheduling schemes. Because there are two kinds of QoS attributes, one is positive attributes with maximizing their values, the other one is negative attributes with minimizing their values [5]. They are calculated according to Eqs (8) and (9), respectively.
In Eq. (10),
The heuristic algorithm for BoT scheduling
A heuristic algorithm based on the utility function is presented for solving the problem of BoT scheduling, which is given in Algorithm 5.

A heuristic algorithm for BoT scheduling
In Algorithm 5, the trust score of each storage resource is calculated, and the minimum utility value
Experimental settings
To evaluate the proposed algorithm, we develop an extended simulation platform based on Cloudsim [9]. The data used in the experiment are all generated randomly. The number of data centers is initialized between 5 and 15 with the network bandwidth from 1 to 2 GB/s. The computing capacity of computing resources is set from 1000 to 1500 MIPS (Million Instructions Per second, MIPS), the security cloud’s digital characteristic
Performance analysis
We adopt makespan, cost, security utility and LBD for performance evaluation. Makespan is the time of completing all the tasks of the BoT application and can be computed by Eq. (11). The cost is the sum of the costs of completing all tasks and can be computed by Eq. (12). Security utility function
Assume that there are N tasks in BoT, then the security utility value BoT is calculated according to Eq. (13). The higher the security utility value is, the higher the security performance.
LBD calculated by Eq. (14) is used to measure the fair utilization of computing resources. The smaller the value of LBD is, the better the load balance of computing resources.
In Eq. (14),
In the experiment, we compare security-aware data replica selection (SDRS) in the paper with SCP in [27] and ESMH in [13] using performance indicators of time, cost and security. 50 BOT applications are generated according to Section 6.1, and the average performance indicators of four algorithms are computed. The experimental results are shown in Figs 4, 5, 6 and 7, respectively. It can be seen that the values of makespan and cost generated by SDRS is lower than those generated by SCP and ESMH, the value of security utility generated by SDRS is higher than those generated by SCP and ESMH. Meanwhile, the growth rate of LBD generated by SDRS is slower than those generated by SCP and ESMH. It can verify the validity and feasibility of SDRS.

The time of BoT.

The cost of BoT.

The security utility value of BoT.

The LBD of BoT.
For data-intensive applications based on the architecture of cloud computing, we choose BoT as the research subjects in the paper, describe the security of data replica using the cloud model, and propose the security-aware data replica selection algorithm for BoT scheduling. The simulation experiments show that the proposed selection strategy has good performance in time, cost, security utility value and LBD. Considering that the security-aware heuristic algorithm belongs to greedy algorithm, which is easy to fall into local optimal solution, how to select the effective methods to solve the problem is the next research work in the future.
