A new popularity-based data replication strategy in cloud systems

Abstract

Data-intensive cloud computing systems are growing year by year due to the increasing volume of data. In this context, data replication technique is frequently used to ensure a Quality of service, e.g., performance. However, most of the existing data replication strategies just reproduce the same number of replicas on some nodes, which is certainly not enough for more accurate results. To solve these problems, we propose a new data Replication and Placement strategy based on popularity of User Requests Group (RPURG). It aims to reduce the tenant response time and maximize benefit for the cloud provider while satisfying the Service Level Agreement (SLA). We demonstrate the validity of our strategy in a performance evaluation study. The result of experimentation shown robustness of RPURG.

Keywords

Data management cloud systems data replication SLA cost model business model performance

1. Introduction

Cloud computing has become a common solution in both government and industry for storing and processing large-scale data in the last decade. As the Cloud’s base device, data centers are capable of providing nearly infinite computing and storage capacity cost-effectively to meet the needs of users, i.e., tenants. Recently, the advent of huge-scale cloud-based systems has increased competition for vast data storage facilities in cloud services. Furthermore, vast volumes of data are being processed in data centers since more and more cloud based systems continue to be data consuming.

Modern distributed storage structures typically usually use data replication to ensure high data availability, performance and fault tolerance [32]. This improves the user’s queries that can be called from closest locations by requiring the data file to replicate. Creating replicas for data files that have more accessing frequency is very useful. The next question facing them is determining where the latest replicas will be mounted for guarantying consistency. Indeed, if replicas are randomly put in a chosen node, this does not result in improved system performance [35].

In order to cope with the heterogeneity of workload in cloud systems, a number of data replication strategies is being proposed to maintain at the required levels of satisfaction between users and providers [36]. In this context, the provider and its clients conclude utility-related Service Level Agreements (SLAs) to assess costs and fees depending on the output rates achieved. On the other hand, the service provider requires controlling its resource in order to maximize its profit. Utility-based management strategies are widely used to provide load managing and achieve the best trade-off for QoS levels across job levels [27, 28].

In this paper, we propose a new data Replication and Placement strategy based on Popularity of User Requests Group (RPURG) for heterogeneous cloud systems. It aims to minimize replication costs. We based on the frequency of access of latest group of clients requests (popularity group), user budget and quality of service (QoS) defined in the SLA contract. We formulate this as a Knapsack problem and try to solve it in such a way that the system’s flexibility and QoS are maintained at the optimal rates. Dealing with the consistency issue [25], the last updated data are propagated to all replicas, i.e., asynchrony replication [41, 42]. We defer this issue to a future work. Four problems are discussed in the proposed strategy:

(iii) (i)
Which data are replicated? To classify the data involved, we focused on data popularity. Data popularity is an important parameter that most replication strategies take into account. We consider data popularity of group of user requests. The number of requests for each data is computed by the data request matrix in the registry.
(ii)
When data are replicated? We based on a threshold response time when satisfying the response time objective for tenants depending on the tenant budget. By this way, we avoid penalties paid from the provider to its tenants.
(iii)
Where new replicas are placed? We consider the network bandwidth (NB) locality [28]. The NB between DC is low when the NB between nodes of a same DC is high. We also consider parameters such as the replication cost in each data center and latency for users, i.e., low response time threshold for users with important budget. The replication cost should also be smaller than the budget for the tenant. Consequently, replicas are placed according the replication cost that includes the data transfer cost. In addition, we take into account the load balancing issue [38].
(iv)
How many replicas are created? The number of replicas to be created/ deleted is periodically determined in a dynamic way. It depends on the provider profit when executing a group of request. This permits an elastic management of replication [26].

The rest of this paper is organized as follows: Section II provides a related work. Section III describes the details of the implemented strategy. Section IV evaluates the performance of the proposed strategy. Section V concludes the paper and gives some possible future work.
2. Literature review

Many data replication strategies were proposed in the literature. Most of synthesis replication work [28] classified these strategies in cloud systems into static [34] vs. Dynamic [8] strategies while other work classified them into provider-centric [28] vs. consumer-centric strategies [29, 37]. In static strategies, each generated replica is kept in the same position or removed manually by the user after a long service period. On the other hand, dynamic strategies build and remove replicas according to changes in the cloud setting. RPURG is considered as a dynamic strategy since replicas of each data set are created, placed and maintained dynamically according to the system workload and the tenant budget. Also, we deal with the provider-centric approach that attempts to ensure the provider profit while satisfying tenant’s SLOs. On the other hand, strategies are also defined according to the metric (or group of metrics) they are trying to manage. Some of them consider a policy for cost minimization for the provider while another aims to ensure the Quality of service for tenants. Figure 1 proposes a rough classification of replication strategies.

Figure 1.

Taxonomy replication strategies.

In what follows, we describe some strategies that aim to satisfy different objectives for the tenants. Most of them aim to satisfy a single tenant objective such as availability [8], energy consumption [43], performance [9] and fault tolerance [40]. Only some strategies aim to simultaneously meet several tenant objectives. On the other hand and as noted below, most of these strategies neglect the cost of replication for the provider.

Lin et al. [7] suggested two algorithms for QoS-aware data replication (QADR) in cloud computing systems. To perform data replication, the first algorithm adopts the basic concept of high-QoS first-replication (HQFR). Even so, the data replication expense and the volume of QoS-violated data replicas cannot be reduced by this greedy algorithm. The second algorithm converts the QADR problem in that well-known minimal cost maximum flow (MCMF) issue to meet these two minimum goals. However, the HQFR algorithm has scalable mechanism, but High time complexity.

Wei et al. [8] proposed a cost-effective dynamic replication management (CDRM). This primarily seeks to boost cloud computing efficiency between cost-effective flexibility and attractive load balancing. To catch the link between replica quantity and availability, a novel model is implemented. CDRM uses this pattern to measure and sustain minimum replica number for a guaranteed usability prerequisite. Replica location depends on the likelihood and capability of the data nodes being closed. Price effective maintenance of data replication will redistribute the task between data nodes by modifying the Number Replicas. Even so, CDRM have low reliability and high energy consumption.

Tos et al. [9] suggested the approach of performance and profit-oriented data replication (PERP) that offers SLA guarantees such as accessibility and quality as well as increases the cloud supplier’s economic advantage. PEPR saves replicas in data centers where an average reaction time is greater than the SLO response time limit. PEPR activates replication only if the provider’s reaction time and revenue growth are met. on the other hand, PEPR technique puts replicas at a lower load node, not even in the same sub-regions of the tenants. Replicas are thus not sufficiently similar to the tenants raise the transition time. The response time is also high.

Gill et al. [10] proposed a complex, cost-aware, streamlined data replication approach that determines the minimum number of replicas needed to achieve the functionality required. The knapsack principle has been used to optimize replication costs and to replicate replicas from higher-cost cloud services to lower-cost computer servers without affecting the availability of data. However, this strategy have different disadvantages like low consistency rates, low load balancing and high response time.

The authors in [11] proposed a model to analyze real-world replication work processes. They identify three new techniques to optimize the use of storage space during replica formation, and two new QoS-conscious greedy algorithms for optimizing replica placement [11]. However, the load balancing is not considered.

The proposed strategy in [12] has aimed to decrease the latency of access to the file server and increase the availability of data while optimizing the infrastructure by load balancing. By juggling between these optimization goals, an accurate and enhanced multi-objective integrated replication management (EIMORM) tries to find a trade-off between different objectives. However, the replication cost is neglected.

Mansouri et al. [13] suggested a complex data replication technique for popularity that is displayed on the cloud platform using data access details. In consideration of the 80/20 principle, DPRS replicates only a tiny volume of the data file demanded as much as possible. Effects of free storage room, no request and site centrality, it determines which site the document is copied to. In order to improve the cumulative efficiency by copying the data and accessing the data in parallel, this paper proposes a parallel download strategy. The strategy’s vulnerability neglects the impact of the pattern of file access in the replica judgment.

Dealing with high data longevity, PMCR separates the cloud services into the main tier and replacement tier and categorizes data obtained from data popularity into hot data, warm storage and cold files. PMCR preserves three replicas of the same files in one Copy-set, generated by two databases in the primary tier and one server in the backup tier, to accommodate both correlated and isolated failures. PMCR uses aggregation techniques to minimize computing costs and latency costs for the third replica of hot data and cold data in the recovery tier [14]. However, Response time is neglected.

Sun et al. [15] is based on the principle of making node replicas if they are overwhelmed and stored in other nodes to relieve their workload. It does not sacrifice the access latency if it will decrease the number of overwhelmed nodes, even if it is a decentralized solution. Maintaining replicas based on the state of their loads is one of the key features of DARS. But, it does not take into consideration neither bandwidth nor energy consumption.

Authors in [16] use the file access history. They decide the similarity of the data files while prefetching the most common files. Thus, the next time that a file is required on this web, it will be available locally. In comparison, the replica substitution technique plays a crucial role due to the constraint of storage capacity. The value of useful replicas based on the fuzzy inference method with four input parameters can be calculated by the PDR strategy (i.e. quantity of accesses, replication cost, latest when the copy was visited, and availability of data). In the other hand, they neglect load balance technique.

Casas et al. [17] conceive a cloud computing Structured and File Reuse-Replication Scheduling (BaRRS) algorithm. This algorithm splits the science workflow into several workflows by parallelization for balance use. It deals with the data recycle and multiple replication strategies for transferable data optimization. It takes into account task execution time, task dependence patterns and file size for adaptation of new replication and data reuse strategies. Ultimately, on the basis of budgetary cost and execution time, it chooses the optimum solution and it is difficult to have an optimal solution, which is a complex task.

Centered on file heat and node load, Zhao et al. [18] suggested an enhanced dynamic replica creation strategy. The node load is integrated and, using average heat and average load, the number of replicas is modified. Three new methods for maximizing storage space use during replication development and two new QoS greedy perceptive algorithms for optimizing replication location were proposed by Zeng et al. [11]. A more uniformly distributed replica of the data set can be accomplished by the use of a circular method during the replica creation process. However, Energy and consistency were neglected in this work.

Dai et al. [20] invented a significant approach for optimizing the I/O efficiency of distributed storage systems is the I/O load balancing. The data positioning algorithm greatly influences the degree of I/O load balancing. It is very difficult to develop a data positioning algorithm with the optimal I/O load balancing assurance, as data popularity follows an incredibly biased distribution [39].

Elango et al. [21] used cloud storage with a replication algorithm mix and task scheduling strategy for data replication in the data cloud world by tracking all work processes. In the large database of a node, the data replication can be accomplished by finding the commonly used data patterns. This will be accomplished by the frequent pattern mining algorithm based on Fuzzy FpTree.

In order to formulate the issue of data replica positioning, Cui et al. [22] built a tripartite graph and suggested a data placement technique based on the GA for a scientific workflow to minimize the volume and quantity of data movement in cloud environments. However, the privacy datasets in the science workflow were neglected by this work.

For secure cloud storage, Xue et al. [23] suggested a validated data sharing protocol based on proven data ownership and deletion. The device achieves the beneficial characteristics of promoting all operations. Without downloading the entire data from the old cloud, and verifying the data integrity of the new cloud, the data owner will transfer the outsourced data from one cloud to another. The t does not need to think about the deletion of the deleted data in the original cloud since the deletion proof on the cloud data will be produced by the cloud. Users should then verify that the data were correctly migrated and actually removed from the initial server.

Ma and Yang [24] solve the conventional problem of data replication by exploring an in-memory data cache solution. More specifically, using the stream processing system, the authors suggest a live data replication method for in-memory document stores. Some studies have been performed and the findings published indicate that, relative to Map Reduce-based batch replication, the proposed method is more effective for the replication of continuous in-stream altered data.

Mansouri et al. [2] described mathematical models to explain the five goals, such as cruel gain duration, stack fluctuation, storage consumption, risk of failure and latency, when the efficiency of each knowledge node is taken into consideration. Reproductions for the five goals are put across knowledge hubs. They show an expired replication replacement protocol that takes into consideration their important parameters: document usability, the last date the replication was demanded, the amount of accesses and the replication calculation. The ADRS promotes the transitory placement and the introduction of data-based options such as database transparency to determine the replication of the casualty. They neglected the question of trade off for better property, e.g., the usability and expense of different services.

Table 1

Comparison between existing replication strategies in the literature

Author/strategy	Cost	Network bandwidth	Response time	Consistency	Load balancing	Energy
Lin et al. [7]	$+$	$+$	$-$	$-$	$+$	$-$
Wei et al. [8]	$+$	$+$	$-$	$-$	$-$	$-$
Tos et al. [9]	$+$	$+$	$+$	$-$	$-$	$-$
Gill et al. [10]	$+$	$+$	$-$	$-$	$-$	$-$
Zeng et al. [11]	$+$	$+$	$+$	$-$	$-$	$-$
Edwin et al. [12]	$+$	$-$	$-$	$-$	$+$	$+$
Mansouri et al. [13]	$+$	$+$	$+$	$-$	$+$	$-$
Liu et al. [14]	$+$	$+$	$-$	$+$	$+$	$-$
Sun et al. [15]	$+$	$-$	$+$	$-$	$+$	$-$
Mansouri et al. [16]	$+$	$+$	$+$	$-$	$-$	$-$
Casas et al. [17]	$-$	$+$	$+$	$-$	$+$	$-$
Zhao et al. [18]	$+$	$-$	$+$	$-$	$+$	$-$
Dai et al. [20]	$+$	$+$	$+$	$-$	$+$	$-$
Elango et al. [21]	$-$	$-$	$+$	$-$	$-$	$-$
Cui et al. [22]	$+$	$+$	$+$	$-$	$+$	$-$
Xue et al. [23]	$+$	$+$	$+$	$-$	$-$	$-$
Ma et al. [24]	$+$	$-$	$+$	$-$	$-$	$-$
Ebadi et al. [6]	$+$	$+$	$+$	$-$	$-$	$-$
Jayasree et al. [5]	$-$	$+$	$+$	$-$	$-$	$-$
Khalili et al. [4]	$+$	$+$	$+$	$-$	$-$	$+$
Salem et al. [3]	$-$	$+$	$+$	$-$	$+$	$-$
Mansouri et al. [2]	$+$	$+$	$+$	$-$	$+$	$+$
Séguéla et al. [43]	$+$	$+$	$+$	$-$	$-$	$+$

Table 1 above presents a comparison between the replication strategies that exist in the literature compared to the following metrics: (Cost-Network bandwidth-Response time Consistency-Load balancing-Energy).

The $+$ indicates that the parameter has been taken into account in the approach. While, the $-$ indicates that the parameter was not taken into account in the approach.

For example Lin et al. [7] took into consideration Cost, Network bandwidth and load balancing but they neglected response time, Energy and consistency.

In the other hand, Elango et al. [21] focused on response time and neglected all other parameters.

3. A new data replication placement strategy

Each client, i.e., customer, has a clear concept of SLA restrictions and adaptive negotiation with provider procedures in the cloud system. The availability of resources to clients of the cloud infrastructure environment entails various difficulties and challenges.

The main goal of the structured resource distribution strategy is to follow the constraints of Quality of Service (QoS) and increase the cloud provider’s benefit.

Nowadays, cloud service providers have followed the same traditional methods and auction processes to distribute services. Current strategies are not very successful due to the diversity and diverse characteristics of services. Therefore, the cloud service provider should have an appropriate and personalized resource allocation strategy in which different aspects, such as the defined penalty cost, time and other SLA criteria should be followed by the clients.

Figure 2.

The proposed topology for our strategy.

For this reason, we have implemented New Data Replication and Placement based on Popularity of User Requests Group (RPURG) based on the elasticity of replicas following the frequency of access of latest group of clients requests (popularity group), user budget, quality of service for users to generally reduce client response time, reduce cost and maximize benefit for cloud provider without neglecting the conditions of the SLA to be respected or minimize the constraints to be violated between clients and provider.

3.1 Architecture of RPURG strategy

Figure 2 shows the topology adopted by our strategy RPURG. We distinguish three essential parts in this proposal architecture: (i) the user who sent request and receive response, (ii) datacenters that store all data and information need-it and finally (iii) an important part namely module management (Solver, Elasticity-alert, Repairer, Sla-violation-alert). Each module plays a role in this proposed strategy.

3.2 Model system

In our strategy, we use $N$ heterogeneous data centers, Solver, Elasticity-alert, Repairer, SLA-violation-alert, and 5 data matrices. Each of these elements will be detailed later in this section.

Let $\textit{DC}_{i}$ for each data-centers where $\textit{DC}_{i}=\{\textit{DC}_{1},\textit{DC}_{2},\textit{DC}_{3},\textit{DC}_% {4},\textit{DC}_{5},\ldots,\textit{DC}_{N}\}$

Each $\textit{DC}_{i}$ has a set of Node $N_{i.k}=\{N_{i.1},N_{i.2},N_{i.3},N_{i.4},\ldots N_{i.K}\}$

Each Node for Data center has a file F. We Indicated that $F_{i.k,j}=\{F_{i.k.1},F_{i.k.2},F_{i.k.3},\ldots F_{i.k.M}\}$

Figure 3.

Proposed architecture.

3.3 Modules description

As previously mentioned, our architecture contains several modules: Data Matrices, Solver, Repairer, Elasticity-alert and SLA-Violation-alert described as follows

3.3.1 Data matrices

We identify six matrices:

Matrix of popularity $P_{ikj}$ : It refers to the access frequency for each replica.

Matrix of capacity-node $S_{ik}$ : It refers to the storage available of host (Node).

Matrix of size-dataset $V_{j}$ : Ot refers to the storage space of dataset.

Matrix of threshold response time $Tdt_{u,j}$ : It gives the minimum response time for each client.

Matrix of threshold maximum budget $Tdb_{u,j}$ : It gives the maximum budget for each client.

Matrix of bandwidth network $\textit{BN}_{ik}$ : it refers to the average Bandwidth Network flow input output for each Node.

3.3.2 Solver

Since our solution aims to determine the number of necessary replicas such as the objectives of the tenant will be satisfied while ensuring a profit for the cloud provider, we think about this Mathematical statement of the problem:

$\displaystyle\text{Minimize}\sum_{i=1}^{n}{\sum_{j=1}^{m}{C_{ij}x_{ij}}}$ $\displaystyle\text{subject to}\sum_{j=1}^{m}{v_{j}x_{ij}}\leqslant Cap_{i},i=1% ,\ldots n\sum_{i=1}^{n}{a_{j}x_{ij}}\geqslant Tdt_{u,j},j=1,\ldots m,u=1,% \ldots rx_{ij}\in N,i=1,\ldots n,j=1,\ldots m$ (1)

where: $C_{ij}$ : replication cost and allocate Dataset j space in the datacenter $i$ ( $F_{i.k.j}$ ) $X_{ij}$ : number replicas of dataset $j$ in the datacenter $i$ ( $F_{i.k.j}$ ). $V_{j}$ : storage space (file size) of Dataset $j$ ( $F_{i.k.j}$ ) (file size). $Cap_{i}$ : Storage capacity Of $\textit{DC}_{i}$ where

$\displaystyle Cap_{i}=\sum_{k=1}^{K}S_{ik}$ (2)

$a_{j}$ : the Coefficient importance of Dataset $j$ ( $F_{i.k.j}$ )

$\displaystyle a_{j}=\frac{1}{\sum_{i=1}^{n}\sum_{k=1}^{K}P_{ikj}}$ (3)

3.3.3 Repairer

The system take into consideration consistency so we use the last file modified (time of modification) then we make the update for all the data with the same content and false value.

3.3.4 Elasticity-alert

Resources can be increased or decreased according to the popularity of each data. It means replicate a data $F_{ikj}$ if its popularity (frequency of access to this data) is greater than a given threshold $T_{\text{replication}}$ and delete a data $F_{ikj}$ if its popularity (frequency of access to this data) is lower than a given threshold $T_{\text{erasure}}$ .

The popularity of each file is calculated by the following simple equation:

$\displaystyle P_{ikj}=\frac{\text{number of request for the $F_{ikj}$}}{\text{% total number of request}}$ (4)

where $P_{ikj}$ parameter is stored in the popularity matrix.

3.3.5 SLA-violation-alert

The SLA violation module use parameters such as response time, budget for each client and minimum response time as a constraint in a cost minimization based algorithm. In cases of very strict violation for important client (very high budget), the Replication mechanism triggers an increase in resources for not have a huge penalty. We take a consideration a bandwidth as parameter to decrease response time.

3.4 RPURG strategy

In cloud computing, we have different sites, heterogeneous data centers, which contain various data file, so we find it difficult to select the file to replicate, so our strategy was faced with the challenge of answering the following questions:

KwInInputKwOutOutputKwFunFunctionKwbeginBegin KwendEnd [t!] Replica selection

Budget ${}_{\text{opt}}=$ 0.4; // the best threshold who define best clientsCost ${}_{\text{penality}}=$ 50; // cost penalty fixed on 50% of budget client $P_{ikj}$ , $T_{\text{replication}}$ , Tdb ${}_{u,j}$ , slaviolation ${}_{u,j}$

Replica selected

$u=$ 1 R (each user)

slaviolation ${}_{u,j}=$ true && $Tdb_{u}\geqslant$ Budget ${}_{opt}$ Replica selected;

slaviolation ${}_{u,j}=$ true && $Tdb_{u}<$ Budget ${}_{opt}$ penality // we note that penality is $Tdb_{u,j}*\text{Cost}_{\text{p\'{e}nality}}$ ;

$i=$ 1 to N

$k=$ 1 to K

$j=$ 1 to M

$P_{ikj}>T_{\text{replication}}$ Replica selected; Replica maybe erased;

Question 1: What data to replicate? For answering to this question we propose two issues first, We replicate the data on which the majority of requests are made, when the popularity of the file is above the replication threshold ( $P_{ikj}>T_{\text{replication}}$ ).

When the SLA-Violation alert is triggered for an interesting client ( $Tdb_{u,j}$ $\geqslant$ Budget ${}_{\text{opt}}$ )

[t!] Replica number $C_{ij}V_{j}\textit{Cap}_{i}a_{j},\textit{Tdb}_{u,j}$ , Replica selected

$X_{ij}$

Simplexe resolution

Get $X_{ij}$ // optimal number of resources

all replica selected

$X_{ij}++$ // increase number of resources

Question 2: How many replicas are created? For answering to this question we use the simplex gives us the optimal number of replica that meets the imposed constraints (2). The number of replica must be increased until reaching the replication threshold in the Elasticity module. The process of creating/deleting replicas is described in Algorithm 3.4.

[t!] Replica placement $X_{ij}$ // optimal replica number $\textit{EX}_{ij}$ // current replica number $S_{ik}V_{j}\textit{BN}_{ik}$

$i=$ 1 N

$j=$ 1 M

$\textit{EX}_{ij}<X_{ij}$

from available Node ${}_{k}$ // $V_{j}<S_{ik}$

select $\min\{C_{ij}/\textit{BN}_{ik}\}$ // compromise between reduced storage cost and a high bandwidth

Create Replica;

$S_{ik}=$ $S_{ik}-V_{j}$ ;

EX ${}_{ij}$ $++$ ;

Question 3: Where replicas are created? Among the data centers that have enough storage space we choose those that have a reduced storage cost and a high bandwidth (In most cases we must choose those that provide a compromise between the two previous constraints).

[t!] Replica deleting

$N=$ 10 // number of datacenters $K$ // number of Nodes $M$ // number of files $P_{ikj}$ , $T_{\text{erasure}}$ , $S_{ik}$ , $V_{j}$

$i=$ 1 to N

$k=$ 1 to K

$j=$ 1 to M

$P_{ikj}<T_{\text{erasure}}$

Replica erased

$S_{ik}=$ $S_{ik}+$ $V_{j}$ ;

Question 4: Which replicas are deleted? The number of the least popular replica (which has a low access frequency) must be decreased until reaching the erasure threshold in the Elasticity module.

Question 5: What is the cost of replication?

We note that for each operation, we calculate a cost then we have a benefit total

$\displaystyle\text{Benefit total}=\text{budget all user}-\text{cost total}$ (5) $\displaystyle\text{Cost total}=\text{cost replication}+\text{cost bandwidth % consumption}-\text{cost erasure}$ (6)

4. Performance analysis

In this section, we have defined our simulation tools. Then, we made a comparison of the result according to different criteria: response time, SLA violation, cost penalty, Total cost and $\lambda$ standard.

4.1 Simulation tools

In order to validate our strategy, we used CloudSim [1, 19], an open platform, flexible and extensible simulation model that facilitates seamless analysis, simulation, and evaluation of current Cloud computing systems and application services according to a distributed topology. The experiment was carried out using the following tools.

Table 2
Simulation tools

Simulator	Cloudsim 3.0.31 ¹ https://github.com/Cloudslab/cloudsim/releases/tag/cloudsim-3.0.3.
The device	Ram	8 (Go)
	CPU	3.017 (Hrz)
	HDD	1 (tera)
	Windows	7 pro

4.2 Result comparison

The work done was compared with the following data replication strategies: PERP [9], DPRS [13], and CDRM [8] in addition to the NO REPLICATION algorithm. The latter were chosen for this comparative study because they presented the best results among the previous works.

Table 3
Simulation parameters

Datacenter storage capacity [200, 500, 1000] GB

Node storage capacity 20 GB

Intra-datacenter bandwidth 40 Gbit/s

Avg. Intra-datacenter delay 5 ms

Avg. Inter-region delay 100 ms

Number of queries processed [200, 400, 600, 800, 1000].

Number of datasets 100

Avg. size of a data set 200 MB

Intra-datacenter data transfer cost $0.05 per GB

Inter-datacenter data transfer cost $0.5 per GB

Storage cost [$0.1, $0.2, $0.3, $0.4, $0.5] per GB

Penalty cost 50% of Budget per violation

To evaluate this experimentation we used different metrics average response time, SLA violation, Cost penalty and total cost in 10 heterogeneous data centers with different capacity, costs and bandwidths. We also vary the number of client requests by a step of 200 tasks [200, 400, 600, 800, 1000]. Finally, we use other parameters and variant metrics as shown in Table 3.

4.2.1 Average response time (s)

In first experiments, we measure the Average response time of various strategies. The results obtained are mentioned in (Table 4).

Table 4
Average time response values

	No replication	CDRM	DPRS	PERP	RPURG
200 tasks	25.89	17.03	13.99	15.33	12.40
400 tasks	45.82	28.20	19.29	25.71	18.55
600 tasks	85.67	50.54	29.90	46.46	30.86
800 tasks	111.29	64.90	36.71	59.80	38.77
1000 tasks	145.52	85.88	50.50	70.21	53.16

In Fig. 4, we notice a significant decrease in the average response time for the experimented queries with the RPURG compared to CDRM and PEPR strategies. On the other hand, there is an equality compared to the DPRS since it focuses on the response time as a priority in its algorithm.

4.2.2 SLA violation

We measure also the number of constraints violated by provider. The results obtained are mentioned in (Table 5).

Table 5
Number of SLA violations

	No replication	CDRM	DPRS	PERP	RPURG
200 tasks	55	30	55	17	7
400 tasks	192	115	83	33	29
600 tasks	466	285	140	66	74
800 tasks	643	395	177	88	103
1000 tasks	741	456	198	100	120

Figure 4.

Average time response comparison.

As shown in Fig. 5, The SLA violation time of the experimented queries with the RPURG is substantially reduced compared to CDRM and DPRS strategies. On the other hand PERP takes the lead since it focuses to reduce SLA violation for tenant’s satisfaction.

4.2.3 Cost penalty ($)

The provider has penalties to pay for clients suffering losses. Table 6 show the cost penalty for each strategy.

Table 6
Cost penalty values

	No replication	CDRM	DPRS	PERP	RPURG
200 tasks	399	293	341	102	26
400 tasks	734	437	422	159	85
600 tasks	1404	726	584	275	204
800 tasks	1834	912	688	349	280
1000 tasks	2074	1016	746	391	323

Figure 5.

SLA violation comparison.

Figure 6.

Cost penalty comparison.

In Fig. 6, our strategy RPURG takes the advantage with minimum Cost Penalty since he minimizes large sums of penalties and only neglects small ones compared to Perp. We note that DPRS and CDRM do not take this metric on consideration.

4.2.4 Total cost ($)

As shown in Table 7, we resume all costs obtained through Eq. (6) for each strategy by varying the number of tasks.

Table 7
Total cost values

	No replication	CDRM	DPRS	PERP	RPURG
200 tasks	1 500	645	467	489	351
400 tasks	1 959	1 088	765	826	539
600 tasks	2 878	1 975	1 362	1 501	916
800 tasks	3 468	2 545	1 746	1 935	1 158
1000 tasks	3 796	2 862	1 959	2 176	1 293

Figure 7 shows that RPURG find himself the leader compared to PERP, DPRS and CDRM in terms of total cost calculated in Eq. (6) used Cost replication, cost bandwidth consumption and cost erasure.

4.2.5

\lambda

standard

In order to study and analyze the behavior of our strategy in relation to strategies (No Replication, CDRM, DPRS, and PERP) We have defined a new metric called $\lambda$ which takes into account at the same time (Average response time, Sla violation, Cost Penalty and Total cost) defined by Eq. (7):

$\displaystyle\lambda=\sqrt{\text{(Average response time $\times$ Sla violation% $\times$ Cost Penalty $\times$ Total cost)}}$ (7)

We defined this Eq. (7) in order to compare between different strategies with heterogeneous metrics. So, the best strategy is the one which has minimum $\lambda$ standard values.

Table 8 presents the numerical results obtained by the simulation to measure $\lambda$ .

Table 8

$\lambda$ standard values

	No réplication	CDRM	DPRS	PERP	RPURG
200 tasks	29 193	9 826	11 069	3 605	890
400 tasks	194 499	70 033	30 506	18 008	10 014
600 tasks	525 110	190 448	69 380	46 813	28 261
800 tasks	737 646	267 857	94 370	65 331	39 991
1000 tasks	855 722	310 862	108 254	75 619	46 508

Figure 7.

Total cost comparison.

Figure 8.

$\lambda$ standard comparison.

From the results presented in the Tables 7 and 8 and graphs above, we clearly notice that the RPURG has succeeded in guaranteeing the best values $\lambda$ for all comparison criteria used (average response time – SLA violation-Cost penalty-total cost) compared to the three algorithms (NO Replication, CDRM, PERP).

Figure 8 gives the values of the standard $\lambda$ which is calculated by Eq. (7) presented by the five algorithms studied, and which shows the compromise made by each of these compared to the four imposed constraints. We note that the RPURG provides the best compromise of variants metrics compared to the other four algorithms.

4.3 Discussion

The originality of our paper is that it deals with a double problematic. On the one hand, the problem of cost provider service and on the other hand, the problem of satisfaction of clients. Concerning the assignment problem, we propose to duplicate some replicas if needed. Compared to the works presented in Section 2, our work integrate a Replicas Manager that Creates or Removes Replicas if needed. Moreover, make an exact choice of the placement of the replicas. This decision is based on the assignment algorithm. The works presented in Section 2, do the same thing but do not consider both of them (Cost provider & Satisfaction client) as well as the inconsistency problem is solved in this paper. This makes our work original comparing with the existing works. Our work is related to this dual problem in a large scale environment. Our Strategy neglects the fault tolerance in cloud computing [44].

5. Conclusion

Nowadays, many data replication strategies have already been proposed in order to improve data availability and performance in cloud systems. Most of research work show that the satisfaction of such objectives relies on number of replica and their placement.

In this paper, we have proposed a new data replication strategy that targets the Quality of Service (QoS) satisfaction in clouds while considering the replication cost. It allows an increased performance and cost containment of cloud services. In order to increase data accessibility and decrease cost caused by replication, we propose to dynamically adjust the replication degree of each data object so that the replication cost is minimized. We based on different protocols and tools like simplex. It takes into consideration a number of parameters such as data popularity when satisfying the performance SLO objective and taking into account the benefit of the provider. Finally, the proposed strategy is validated through experimentation based on CloudSim simulator. The result analysis shows the efficacy of our proposition.

For future work, we try to improve the repair module and to integrate a module that controls failure message when satisfying the fault tolerance objective [40]. We could also based on data mining strategies [30] and machine learning techniques [31] in order to further improve QoS.

Footnotes

Author’s Bios

	Abdenour Lazeb graduated from the Department of Computer Science in Oran (Algeria), and obtained his Master at USTO University in 2016. he is a PhD candidate in the Department of Computer Science in the Faculty of Exact and Applied Sciences at the University of Oran 1, Ahmed Ben Bella in Algeria. His research interests are distribute systems, cloud computing, grid computing, scheduling, replication, data placement, data management and resource allocation in large scale systems.
	Riad Mokadem is currently an Associate Professor in Computer Science at Paul Sabatier University, Toulouse, France, and a member of the IRIT laboratory. His main research interests are query optimization in large-scale distributed environments, data replication and database performance. He is involved in several International conferences and journals.
	Ghalem Belalem Graduated from Department of computer science, Faculty of exact and applied sciences, University of Oran1 Ahmed Ben Bella, Algeria, where he received PhD degree in computer science in 2007. His current research interests are distributed system; grid computing, cloud computing, replication, consistency, fault tolerance, resource management, economic models, energy consumption, Big data, IoT, mobile environment, images processing, Supply chain optimization, Decision support systems, High Performance Computing.

References

Belalem

Tayeb

F.Z.

and Zaoui

, Approaches to improve the resources management in the simulator CloudSim, International Conference on Information Computing and Applications (ICICA’2010) Zhu

Zhang

Liu

, eds, Tangshan, China, October 15–18, 2010. Proceedings, Part II of the Communications in Computer and Information Science book series (CCIS, volume 106), 2010, pp. 189–196.

Mansouri

Javidi

M.M.

and Zade

B.M.H.

, Using data mining techniques to improve replica management in cloud environment, Soft Computing 24 (2020) 7335–7360.

Salem

Salam

M.A.

Abdelkader

and Mohamed

A.A.

, An artificial bee colony algorithm for data replication optimization in cloud environments, IEEE Access 8 (2019), 51841–51852.

Khalili azimi

, A bee colony (Beehive) based approach for data replication in cloud environments, in: Fundamental Research in Electrical Engineering. Montaser Kouhsari

, ed., Lecture Notes in Electrical Engineering, Vol 480, 2019, pp: 1039–1052.

Jayasree

and Saravanan

, Apsdrdo: Adaptive particle swarm division and replication of data optimization for security in cloud computing, International Conference on Computing Intelligence and Data Science (ICCIDS 2018) 1 IOSR J Eng (2018).

Ebadi

and Jafari Navimipour

, An energy-aware method for data replication in the cloud environments using a Tabu search and particle swarm optimization algorithm, Concurrency and Computation: Practice and Experience 31(1) (2019), e4757.

Lin

J.W.

Chen

C.H.

and Chang

J.M.

, QoS-aware data replication for data-intensive applications in cloud computing systems, IEEE Transactions on Cloud Computing 1(1) (2013), 101–115.

Wei

Veeravalli

Gong

Zeng

and Feng

, CDRM: A cost-effective dynamic replication management scheme for cloud storage cluster, in: 2010 IEEE International Conference on Cluster Computing, 2010, pp. 188–196.

Tos

Mokadem

Hameurlain

Ayav

and Bora

, Ensuring performance and provider profit through data replication in cloud systems, Cluster Computing 21(3) (2018), 1479–1492.

10.

Gill

N.K.

and Singh

, A dynamic, cost-aware, optimized data replication strategy for heterogeneous cloud data centers, Future Generation Computer Systems 65 (2016), 10–32.

11.

Zeng

Wang

Kent

K.B.

Bremner

and Xu

, Toward cost-effective replica placements in cloud storage systems with QoS-awareness, Software: Practice and Experience 47(6) (2017), 813–829.

12.

Edwin

E.B.

Umamaheswari

and Thanka

M.R.

, An efficient and improved multi-objective optimized replication management with dynamic and cost aware strategies in cloud computing data center, Cluster Computing 22(5) (2019), 11119–11128.

13.

Mansouri

Rafsanjani

M.K.

and Javidi

M.M.

, DPRS: A dynamic popularity aware replication strategy with parallel download scheme in cloud environments, Simulation Modelling Practice and Theory 77 (2017), 177–196.

14.

Liu

Shen

and Narman

H.S.

, Popularity-aware multi-failure resilient and cost-effective replication for high data durability in cloud storage, IEEE Transactions on Parallel and Distributed Systems 30(10) (2018), 2355–2369.

15.

Sun

Yao

and Li

, DARS: A dynamic adaptive replica strategy under high load Cloud-P2P, Future Generation Computer Systems 78 (2018), 31–40.

16.

Mansouri

and Javidi

M.M.

, A new prefetching-aware data replication to decrease access latency in cloud environment, Journal of Systems and Software 144 (2018), 197–215.

17.

Casas

Taheri

Ranjan

Wang

and Zomaya

A.Y.

, A balanced scheduler with data reuse and replication for scientific workflows in cloud computing systems, Future Generation Computer Systems 74 (2017), 168–178.

18.

Zhao

and Zhang

, Dynamic replica creation strategy based on file heat and node load in hybrid cloud, in: Proceedings of 19th International Conference on Advanced Communication Technology (ICACT), IEEE 2017, pp. 213–220.

19.

Calheiros

R.N.

Ranjan

Beloglazov

De Rose

C.A.

and Buyya

, CloudSim: A toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms, Software: Practice and Experience 41(1) (2011), 23–50.

20.

Dai

Ibrahim

and Bassiouni

, An improved replica placement policy for Hadoop distributed file system running on cloud platforms, in: Proceedings of 4th International Conference on Cyber Security and Cloud Computing (CSCloud), 2017, pp. 270–275.

21.

Elango

and Kuppusamy

, Fuzzy FP-tree based data replication management system in cloud, International Journal of Engineering Trends and Technology 36(9) (2016), 481–489.

22.

Cui

Zhang

Yue

Shi

and Yuan

, A genetic algorithm based data replica placement strategy for scientific applications in clouds, IEEE Transactions on Services Computing 11(4) (2015), 727–739.

23.

Xue

and Shen

, Provable data transfer from provable data possession and deletion in cloud storage, Computer Standards and Interfaces 54 (2017), 46–54.

24.

and Yang

, Stream-based live data replication approach of in-memory cache, Concurrency and Computation: Practice and Experience 29(11) (2017), e4052.

25.

Limam

and Belalem

, A self-adaptive conflict resolution with flexible consistency guarantee in the cloud computing, Multiagent and Grid Systems 12(3) (2016), 217–238.

26.

Bouharaoua

and Belalem

, A quorum-based intelligent replicas management in data grids to improve performances, Multiagent and Grid Systems 13(2) (2017), 143–161.

27.

Gupta

Bhadauria

H.S.

and Singh

, SLA-aware load balancing using risk management framework in cloud, Journal of Ambient Intelligence and Humanized Computing, (2020), 1–10. doi: 10.1007/s12652-020-02458-1.

28.

Mokadem

and Hameurlain

, A data replication strategy with tenant performance and provider economic profit guarantees in Cloud data centers, Journal of Systems and Software 159 (2020), 110447.

29.

Limam

Mokadem

and Belalem

, Data replication strategy with satisfaction of availability, performance and tenant budget requirements, Cluster Computing 22(4) (2019), 1199–1210.

30.

Khelifa

Hamrouni

Mokadem

and Charrada

F.B.

, Cloud provider profit-aware and triadic concept analysis-based data replication strategy for tenant performance improvement, International Journal of High Performance Computing and Networking 16 (2021), 2–3. doi: 10.1504/IJHPCN.2020.112678.

31.

Wang

J.B.

Wang

J.Y.

Zhu

Lin

and Wang

, A machine learning framework for resource allocation assisted by cloud computing, IEEE Network 32(2) (2018), 144–151.

32.

Hasan

and Goraya

M.S.

, Fault tolerance in cloud computing environment: A systematic survey, Computers in Industry 99 (2018), 156–172.

33.

Tos

Mokadem

Hameurlain

and Ayav

, Achieving query performance in the cloud via a cost-effective data replication strategy, Soft Computing (2021), 1–18.

34.

Long

S.-Q.

Zhao

Y.-L.

and Chen

, MORM: A multi-objective strategy for cloud storage cluster, Journal of Systems Architecture 60(2) (2014), 234–244.

35.

Milani

B.A.

and Navimipour

N.J.

, A comprehensive review of the data replication techniques in the cloud environments: Major trends and future directions, Journal of Network and Computer Applications 64 (2016), 229–238.

36.

Chang

W.C.

and Wang

P.C.

, Adaptive replication for mobile edge computing, IEEE Journal on Selected Areas in Communications 36(11) (2018), 2422–2432.

37.

Amjad

Sher

and Daud

, A survey of dynamic replication strategies for improving data availability in data grids, Future Generation Computer Systems 28(2) (2012), 337–349.

38.

Zeng

and Veeravalli

, Optimal metadata replications and request balancing strategy on cloud data centers, Journal of Parallel and Distributed Computing 74(10) (2014), 2934–2940.

39.

Saito

and Shapiro

, Optimistic replication, ACM Computing Surveys (CSUR) 37(1) (2005), 42–81.

40.

Distler

, Byzantine fault-tolerant state-machine replication from a systems perspective, ACM Computing Surveys (CSUR) 54(1) (2021), 1–38.

41.

Nik

W.N.S.W.

Zhou

B.B.

Zomaya

A.Y.

and Abawajy

J.H.

, A framework for implementing asynchronous replication scheme in utility-based computing environment, in: 2015 International Conference on Cloud Computing and Big Data (CCBD), 2015, pp. 183–190.

42.

Walters

J.P.

and Chaudhary

, A scalable asynchronous replication-based strategy for fault tolerant MPI applications, in: International Conference on High-Performance Computing, Springer, Berlin, Heidelberg, 2007, pp. 257–268.

43.

Séguéla

Mokadem

and Pierson

J.M.

, Comparing energy-aware vs. cost-aware data replication strategy, In: Tenth International Green and Sustainable Computing Conference (IGSC), IEEE, Alexandria, VA, USA, 2019.

44.

Lazeb

Mokadem

and Belalem

, Towards a new data replication management in cloud systems, International Journal of Strategic Information Technology and Applications (IJSITA) 10(2) (2019), 1–20.

A new popularity-based data replication strategy in cloud systems

Abstract

Keywords

1. Introduction

3.2 Model system

3.3.1 Data matrices

3.3.2 Solver

3.3.4 Elasticity-alert

3.4 RPURG strategy

4.1 Simulation tools

Table 2 Simulation tools

Table 3 Simulation parameters

Table 4 Average time response values

Table 5 Number of SLA violations

Table 6 Cost penalty values

Table 7 Total cost values

5. Conclusion

Footnotes

Author’s Bios

References

Table 2
Simulation tools

Table 3
Simulation parameters

Table 4
Average time response values

Table 5
Number of SLA violations

Table 6
Cost penalty values

Table 7
Total cost values