Abstract
The cloud is an infrastructure that provides decentralized on-demand services. It allows consumers to pay only for the services they use. The consumer is the important entity in the cloud. The violation of the SLA contract between the consumer and the provider often leads to consequences because the service provider has to pay penalties. Data replication is emerging as an ideal solution to meet the new challenges of the cloud. This paper proposes a new replication strategy based on the popularity of data. This strategy adaptively selects the files to be replicated to improve the overall availability of data in the system, minimize query response time, and achieve the required quality of service. In addition, it dynamically determines the number of replicas to add and the best locations to store them. Experimental results show the effectiveness of the proposed strategy.
Introduction
Technological development in terms of hardware and software and the large volume of data generated have led to the emergence of cloud computing that provides online services [1, 2]. The latter has made it possible to facilitate the load for enterprises and users in terms of management, maintenance, update, security and availability [3], all at reasonable prices, where the consumer only pays for what he/she consumes [4]. This has led to the migration of most enterprises to cloud computing and the increase in the number of cloud computing users [5]. The cloud provides three main services: i) software as a service that offers complete programs to the end-user; ii) platform as a service that provides a platform ready for programmers and developers; iii) infrastructure as a service that provides a complete infrastructure [6, 7]. Before using the cloud, companies were responsible for everything in terms of infrastructure, platforms, services, databases, etc. The emergence of cloud computing has facilitated operations and reduced IT problems [8]. Service providers must respect the SLA contract which is a document between the service provider and the consumer; the provider must respect it. In case of a breach, the provider must pay penalties [9, 10, 11]. The growing number of service providers has led to competition to provide the best performance measures for availability, response time, load balancing, reliability and latency [12, 13, 14].
To get the most out of these measures, an effective technique is used, called data replication. It involves creating identical copies and storing them in different locations, This is done to improve availability, reduce access time, increase system robustness, and scale performance according to certain criteria [15, 16, 17, 18]. However, this technique poses particular problems in large-scale dynamic environments [19, 20]. To design a good replication strategy, we need to answer several questions: when to create replicas? What data needs to be replicated? Which volume of data has to be replicated? What is the nature of the data to be replicated? Where to replicate and How to ensure data consistency?
Two types of replication are used to answer these questions: static replication and dynamic replication [21]. In the static replication strategy, the number of replicas for each data is predetermined in advance. The placement of the replicas is also predefined, and these replicas are controlled beforehand. Therefore, static replication strategies do not fit with the cloud infrastructure because the cloud environment is heterogeneous and dynamic, such as increasing and decreasing data volume, changing bandwidth, scalability and node failure etc. In addition, static replication strategies do not take into account popular data. It should be noted that popular data files are the data with higher requests or higher access rates because user access models are very dynamic in the cloud system. Furthermore, static replication strategies do not support data replication during the execution of the job. Dynamic replication has emerged as the best solution to overcome these obstacles due to its adaptive nature of creating and omitting replicas based on user behavior and network topology. Similarly, replicas are calculated and stored dynamically based on system changes and user access patterns. In general, dynamic replication is most appropriate for an environment like the cloud, where the number and location of users who can access data is often determined very dynamically. These attractive characteristics motivated us to propose a new dynamic replication strategy to overcome the obstacles mentioned above.
Dynamic replication has several advantages but also some challenges. Among these challenges: if the wrong data file is chosen for replication, it will contribute to unnecessary use of memory space and thus increase the memory cost and management cost. The optimal number of replicas to perform must be determined. Once the number of replicas reaches a certain point, adding more replicas will no longer improve availability but will lead to unnecessary expenses. Placing multiple replicas of the same data in the same node does not improve availability or minimize bandwidth consumption and workload.
This study introduces a model for managing data replication in cloud computing. The proposed solution aims at optimizing the performance demanded by users while considering the limitation of the number of replicas. To do so, a replication strategy based on data popularity is presented. The strategy considers popular data according to different criteria such as the importance of a file, user rate, file size, number of replicas, etc. Several phases of the strategy are implemented chronologically to achieve the desired benefit. These steps consist of finding out which data should be replicated, when the replication event should be performed, how many replicas have to be created, where to place these replicas and what data are to be replaced. Main contributions of our strategy include: i) new adaptive measure for data popularity in cloud computing is proposed. The proposed measurement covers new factors to deal with the dynamic nature of the cloud, ii) new dynamic threshold that selects and places files in the set of replicable files, iii) calculation of the minimum number of replicas needed to improve availability and reduce response time according to the SLA contract, iv) new replica placement is proposed; the replica placement is based on four important parameters: availability probability, storage utilization, bandwidth and load variance, v) new replica replacement strategy based on the popularity degree of the file, size of replica and the last time the replica was requested.
The proposed approach was implemented and simulated using the CloudSim simulator and the results obtained were compared to other existing strategies.
The remainder of the paper is organized as follows: Section 2 analyzes existing data replication strategies in cloud computing; Section 3 describes the proposed strategy, while Section 4 presents the implementation environment, experimental results, and evaluation of the performance of the proposed strategy against others; Section 5 concludes the paper.
List of abbreviations and acronyms used in this paper
List of abbreviations and acronyms used in this paper
Several strategies have been proposed in the literature. They focus on efficiently creating and placing replicas in the cloud so that a node can find the data it needs in a minimum amount of time. The models on which these strategies are based differ to ultimately achieve the same objectives: improving data access time, increasing availability, tolerating failures and reducing bandwidth usage. Wei et al. [22] proposed a dynamic replication system called CDRM (Cost-Effective Dynamic Replication Management) that improves system availability and performance by cost-effectively balancing workloads. The authors addressed two problems: i) How many minimum replicas must be kept in the system to meet the availability requirement? ii) How to place these replicas effectively to maximize the performances and the balancing of load? This led to the proposal of a mathematical model that calculates the minimum number of replicas to meet the availability requirement of a file. Then, the authors calculated the probability of blocking nodes in a heterogeneous environment. This probability reflects the workload intensity and the performance of the data node. CDRM was implemented using the Hadoop Distributed File System (HDFS) [23]. Results show that the strategy the CDRM strategy meets availability requirements with improved access latencies, load balancing and system stability compared to HDFS default replication strategy.
A cost-effective dynamic data replication strategy called Cost-effective Incremental Replication (CIR) was proposed by Li et al. [24]. CIR is a cloud-based strategy in which the principal objective is to manage the reliability problem. This strategy aims to calculate the point of replica creation. A single data replica is created in the first step, while replicas are limited to three. Then, each time the new data is uploaded by the user or the replica creation point is reached, the new replica is created in the system. The replica creation point indicates that the current number of replicas cannot meet the reliability requirement. Furthermore, the authors evaluated the effectiveness of the strategy with the typical three replica data replication strategy. The latter is used by data storage systems such as Amazon S3 [25], Google File System [26] and HDFS [23]. In the CIR simulation, the authors applied the reliability settings and the pricing model on Amazon S3. The evaluation results indicate that the CIR strategy can considerably reduce replicas in a data center and storage cost while meeting reliability requirements.
Mansouri [27] proposed an Adaptive Data Replication Strategy (ADRS) in cloud computing to improve average response time, efficient network usage, load balancing, replication frequency and storage usage. This strategy answers two questions: i) How do you place these replicas to distribute workloads across the data node cluster efficiently? ii) Which replica should be deleted? A cost function calculates the result of a set of factors such as the average service time, load variance, storage usage, probability of failure, and response time to answer these questions. In this strategy, users can define weightings according to their own needs by setting a higher value for expected performance. If there is sufficient storage space in the selected site, the replica is stored. If not, one or more files must be deleted using the following parameters: the availability of the file, the last time the replica was requested, the number of accesses and the size of the replica file. The ADRS strategy compares its results with five existing algorithms. The results are very satisfactory.
An energy-efficient data replication method in cloud data centers was proposed by Boru et al. [28]. The authors assumed that several cloud computing data centers are geographically distributed around the world. Each data center has a three-level topology: core, aggregation and access, which are interconnected. In this replication method, each data object is permanently stored in the central database (CentralDB). Depending on the access model, it is replicated in the Datacenter DB and Rack DB databases. All failures in the data center DBs can be retrieved from the Central DB and vice versa. Moreover, this approach implements dynamic replication to improve both the availability and QoS of cloud applications that contain only an optimal number of replicas. This approach reduces system power and bandwidth consumption. This data replication technique also improves communication delay and network bandwidth between geographically dispersed data centers and within each data center.
Limam et al. [29] proposed a dynamic replication strategy called DRAPP (Data Replication strategy with the satisfaction of Availability, Performance and tenant budget requirements). This strategy simultaneously meets customer requirements for availability and performance while considering the customer’s budget and the supplier’s profit. It selects the data that has the highest access rate. Moreover, it is initiated only if one objective regarding availability or response time to a request is not met. In addition, replication is only performed if the creation of the new replica is cost-effective for the supplier. Then, the placement of replicas is subject to certain constraints such as limited budget and replication cost. The placement is done with a load-balancing. The authors considered the estimates of expenses and revenues on the execution of a query. The simulation results indicate that DRAPP can significantly increase the availability and performance of the cloud system while considering account the supplier’s profit.
A dynamic data replication strategy, called DPRS (Dynamic Predicted Replication Strategy), using an exponential smoothing prediction method in cloud systems was proposed by He et al. [30]. The DPRS takes into account the response time and additional cost of the cloud storage system. The authors aimed to answer two questions: (i) How much to replicate or delete? (ii) When to replicate? To this end, they proposed a mathematical model that analyzes the access history of each file based on the number of accesses to calculate the optimal number of replicas. Thus, the model predicts the access value in the next period using the exponential smoothing method. Then, it calculates the optional number of replicas and compares it to the current number to decide whether to create a replica for the file in question. If the replica is higher than the threshold value, it will be created, but it will be deleted if it is lower. This unique exponential smoothing method can effectively reduce the frequency of file creation and deletion. Experimental results indicate that DPRS effectively reduces the response time and additional cost of the cloud storage system.
Proposed adaptive replication strategy based on popular content
In the general framework of the cloud system, there are several data centers. Each data center consists of more than one homogeneous or heterogeneous node. A node is a physical machine that allocates several virtual machines. The virtual machines (VMs) are logical machines on which the task will be executed. Data centers are geographically distributed. The hierarchical topology is used in our proposal. Figure 1 shows the proposed architecture. The cloud system architecture contains a scheduling broker, replication manager, cloud information service (CIS), datacenter information service (DIS). The scheduling broker is responsible for interaction with end-users. Users send the jobs to the scheduling broker and are loaded into the job queue. The replication manager is responsible for creating, selecting, placing and replacing replicas.
The cloud data replication architecture.
On the other hand, the scheduling broker is responsible for sending tasks to virtual machines for execution. Cloud Information Service (CIS) is responsible for resource listing, indexing, and data center efficiency discovery. Scheduling Broker can request the Cloud Information Service to find the number and the locations of available data replicas. Each data center is a data center information service, which contains information about the data stored at any point in time. The information about the data includes its logical filename, number of access to the data, size of the data, importance, location where the data is stored, etc. Moreover, it contains information for each node. The datacenter information service is involved in the process of replication and the management of replicated data.
The main objective of this work is to contribute to the management of data replication in the cloud by proposing a new data replication strategy (ARSP). This strategy will improve certain performance metrics such as availability, response time and load balancing while respecting the constraints of the SLA contract. To do this, a replication strategy based on the popularity of the data is proposed; the strategy answers the following questions: what data needs to be replicated, which volume of data has to be replicated, where to replicate, which replicas need to be replaced.
The answer to these questions, the proposed strategy considers parameters such as the files needed for replication, the minimum number of replicas, the available storage space, and the system’s performance. The proposed approach acts adaptively and dynamically while considering the dynamicity of the cloud system. ARSP chronologically undergoes several phases to obtain the desired results. Figure 2 summarizes these phases.
Functional model of the ARSP strategy.
For each file added, the number of replicas to be created and verified is based on a certain availability required by the customer. If the probability of the current availability of a storage node is lower than the required availability for the file, then replication becomes necessary. Equation (1) [31, 32] is used to calculate the minimum number of replicas that must be added to the system to ensure the required availability of the file:
with,
where
For more details, we present an illustrative example. If the user requests a 99% availability for a file i and the node that stores that file has an 80% availability, the minimum number of replicas that must be created is 3, according to the calculation:
Once the replicas are created, they are placed using phase 4 related to the placement of replicas in the nodes.
This phase allows to accurately identify the files that need to be replicated in the cloud system. It also allows deciding when they need to be replicated using the concept of popularity. According to this concept, recently accessed data files are more likely to be accessed again in the future [33, 34]. The access history of each data file is examined to determine its respective degree of popularity. The advantage of popular data replication is that it significantly increases the overall performance of the cloud system. Files are selected using a replication factor (RF) that requires several factors such as the degree of popularity, the storage size reserved by each file, and the response time defined in the SLA contract.
Degree of popularity
In the literature [35, 36], popular data replication strategies are based on the Half-life concept. The latter is mentioned in several domains, such as medicine, physics and chemistry, favoring recent periods of file use. Therefore, it gives more weight to recently used files and less weight to historical records. Popularity is calculated using Eq. (3).
where
An example of calculating the popularity of a file
In these strategies, the file with the highest AF access value is chosen as the popular file. Table 2 provides an illustrative example. This method prefers files consulted in the current time to files consulted in the last time. For example, on Youtube, a popular video the previous year is not always popular at present. In the proposed strategy, two criteria were added to calculate the degree of popularity (DP): the importance of a file (IF), and the rate of use of this file by different cloud users. The aim is to ensure the selection and prediction of popular files. The DP is calculated as follows:
where
The importance of a file allows measuring the number of requests made by customers on a file. This is the key information because it indicates the importance of the data, thus allowing a great optimization of storage space. Equation (5) is used to calculate
where
Weights are used over the different time intervals to avoid neglecting the importance of data consultation in the current time. These weights allow giving importance to the history of access over time. Thus, in each time interval, we calculate the average access of each file. Table 3 presents an illustrative example.
An example of calculating the importance of a file
RateUser indicates the utilization ratio of files
where
The
After determining which files can be replicated, a comparison is made between the selected file response time and the response time requested in the SLA contract. If the response time is higher than the response time requested in the SLA, we replicate this file because the QoS is not guaranteed with the current number of replicas. The response time (RT) of each file is estimated using Eq. (9) [27]:
where
In a cloud storage system, network bandwidth is very limited and at the same time essential for overall performance. Too many replicas can be detrimental to the system and may not significantly improve availability, resulting in unnecessary workload. In this phase, the minimum number of replicas must be created for a file without reducing system performance and ensuring maximum data availability is determined. Equation (9) is used to deduce the number of replicas necessary to guarantee the response time defined in the file SLA contract.
Placement and replacement of replicas
This phase aims to ensure the desired results, such as high availability with minimum replication and system performance degradation. This is achieved through an effective location selection strategy. Our method of placement is composed of two parts.
Location selection
Once the replicas are created, we have to place them in the available storage nodes and find the best location. A placement factor (PF) is associated with each node and is calculated using Eq. (10):
where
Equation (10) selects the nodes that offer high availability, storage capacity and bandwidth, and lower workload. After calculating the PF of each node, the nodes will be ordered according to the PF value in descending order; the replica is then placed in the first node of the vector.
As mentioned before, placing multiple replicas of the same data in the same node does not improve availability or fault tolerance. For this reason, it will be useful to store only one replica of the data in a node. Algorithm 4 describes the placement of a replica in the node. Firstly, we check whether the file exists in the selected node. Next, we check whether there is enough space to accommodate the file. If there is not enough space, we delete the files using the suppression factor (SF). The suppression factor is based on five important parameters expressed as follows:
where
Equation (11) allows a better possibility to delete the replicas with a minimal degree of popularity, since the number of replicas and the size of the files are very important and the delay between the last request and the current time. After the suppression factor calculation, the replica with the minimum SF value would be deleted. We will test again if the available space is sufficient to place the new replica.
This section describes the simulation environment, the metrics used and the analysis of the simulation results. A comparison of the results is made between the four replication strategies implemented (ARSP, CDRM, ADSR and without replication).
Simulation environment
A simulator was used to implement our strategy. Several cloud simulators are available in the literature [37] to accurately implement real cloud performance evaluation scenarios. However, each proposed simulator focuses on one aspect of the cloud, such as resource allocation. To date, no cloud simulator has been specifically designed for data replication. CloudSim [38] is an open-source project and is widely used in the literature. The simulation suite can therefore be adapted to meet the requirements of our performance evaluation scenarios.
Class diagram
The CloudSim simulator is composed of several classes (Fig. 3). The classes we modified are indicated in red as the Datacenter class, and the classes we added are indicated in blue.
Sequence diagrams
Figure 4 shows the sequence diagrams that graphically represent the interactions between the main CloudSim entities in chronological order. These interactions are part of the following simulation scenario (Tbale 4).
Interactions in the simulation sequence diagram
Interactions in the simulation sequence diagram
CloudSim simulator class diagram.
Table 5 summarizes the simulation parameters used to carry out various experiments. These parameters are the same for all the experiments presented.
Parameters of simulations
Parameters of simulations
Simulation sequence diagram.
The metrics presented in this sub-section were used to validate our work and evaluate the behaviour of the proposed strategy.
Response time [39]
This metric allows measuring the response time related to the execution of a request. The response time of a request
The response time
Data transfer time to a node
Waiting time for request
Execution time of query
where
Finally, the Average Response Time can be calculated using Eq. (13):
where Average Response Time represents the average response time for the entire simulation and Nb_req indicates the number of queries.
This parameter defines the percentage of available storage space about available space. It is affected by the number of replicas created.
where
This measure counts the number of replicas created by the proposed strategy to compare it with other strategies.
Number of replicas deleted
This measure counts the number of replicas deleted by replication strategies during simulation.
Load variance
Load variance of data nodes indicates the standard deviation of a load of all data nodes working in the cloud. This metric can be used to indicate the degree of load balancing of the system. The lower value of the load variance indicates better load balancing [27, 40, 41]. The load
The load of the data node
where
where
Response time to requests
In this experiment, we measure the average response time obtained with the implemented strategies. Figure 5 shows the impact of replication on request response time. The results show that as the number of requests increases, the response time of the four strategies increases exponentially. The response time for the strategy without replication (No Repl) is effectively high, because the file is stored in a single node, thus all requests that request this file is sent to this node, causing a load and therefore the requests wait in the queues. On the other hand, cost-effective dynamic replication management (CDRM) strategy, adaptive data replication strategy (ADRS) and the proposed strategy (ARSP) increase the availability of replicas, which improves response time.
Average response time with a variation in the number of requests.
Figure 5 shows a significant decrease in response time when applying ARSP strategy. This is due to the replication mechanism we applied to improve response time. The proposed approach can also be noted to improve the response time with a gain of the system equivalent to 46.07% compared to the approach without replication, 29.53% compared to the CDRM strategy, and 22.95% compared to the strategy ADRS. ARSP gives a better response time because it only replicates popular files that are likely to be used in the near future and do not meet the response time required in the SLA contract. On another side, ARSP determines the minimum number of replicas that need to be added to the system to ensure the required response time. However, the other strategies do not calculate this number.
This experiment examines the contribution of our strategy regarding the number of replicas created. Only three strategies are compared because the strategy without replication does not dynamically replicate data. It should be noted that it is not a question of replicating endlessly for a strategy to be effective. A good replication strategy must minimize the number of replicas created. ARSP results in a reduced number of replicas created compared to the ADRS and CDRM strategies. ARSP deletes files using a suppression factor, which is calculated using several parameters such as the degree of popularity and file size and the fact that the replicas should not be placed in the same region. In addition, the strategy proposed in this study only replicates popular files that do not satisfy the SLA contract.
Figure 6 and Table 6 show the difference between the number of replicas created during the simulation of the ARSP strategy and the other strategies. We can deduce that ARSP was able to reduce this number with an average gain of 2.11% compared to ADRS and 13.64% compared to CDRM.
Number of replicas created in the system
Number of replicas created in the system
Number of replicas created.
Figure 7 shows the number of replicas deleted during the simulation. If the number of deleted replicas is very high, data availability will decrease significantly, and query execution time will increase significantly (waiting time in the queue). The strategic placement of replicas increases this number. Figure 7 and Table 7 show the number of replicas deleted during the simulation. It can be noted that there is a major difference between the proposed strategy and the other strategies. The results also show that ARSP reduces the number of replica suppression with an average gain of 22.51% compared to the ADSR strategy and an average gain of 56.85% compared to the CDRM strategy. This is due to the fact that the proposed strategy guarantees a good distribution of data in the system. In addition, ARSP deletes the replicas only if there is no storage space available in the data node. It is noteworthy that minimizing this parameter is important if we want to improve cloud performance (storage space, bandwidth, etc.).
Number of replicas deleted in the system
Number of replicas deleted in the system
Number of replicas deleted.
Figure 8 shows the available/used storage space for each strategy during the simulation. The percentage of the storage space used is directly related to the number and size of the replicas created and deleted during the simulation. Figure 8 indicates that the storage space used in the cloud is more important in the proposed strategy (it removes fewer replicas). This indicates that the proposed method uses the available storage space well and that the data will only be deleted if there is no storage space in the nodes to store new data. Thus, the data remains available in the system for a longer period. All these advantages are ensured thanks to selecting placement locations and the good distribution of replicas in the cloud. In addition, ARSP gives a great possibility of replicating data with a small size than data with a very large size, which allows storing several replicas of this data without deleting others.
Load variance in the cloud
The curves in Fig. 9 show the impact of the variation in the number of requests on the average system load. We varied the number of requests from 200 to 1200 in steps of 200 requests. The load variance indicates how the workload is distributed among the nodes. A high variance indicates that the workloads between the nodes are far apart, and vice versa. It is null when all load values in the nodes are identical. Therefore, the lower the value of the load variance, the better the load balancing.
Storage space used in the cloud.
Load variance according to the number of requests.
Figure 9 also indicates that the proposed strategy gives better results than the other strategies (CDRM, ADRS). Starting from 200 requests, the curve of the proposed strategy has a stable trend even though the number of requests is increasing, which allows it to scale easily, unlike other strategies that maintain instability in load balancing. This is simply due to the fact that the proposed strategy creates and deletes fewer replicas during the simulation. The replicated files are really popular files, so they are not deleted
Positioning our strategy regarding some related work
and created each time. In addition, ARSP guarantees a better distribution of replicas in the system thanks to the replication placement factor, which considers several criteria in selecting locations such as current node load, bandwidth, availability and free storage space.
In the literature, researchers have been mainly interested in increasing availability in the cloud system to satisfy consumers. On the other hand, few studies have looked specifically at increasing performance and compliance with SLAs. In addition, even fewer proposed strategies take into account the popularity of data in the cloud system. In Table 8, we aimed to position our Adaptive Replication Strategy based on Popular content (ARSP) concerning the existing related works cited above (Section 2). These strategies are compared with each other regarding parameters such as availability, reliability, storage space, the optimal number of replicas, reduction of response time, load balancing, power consumption, and fault tolerance. As well as strategies take into consideration popular data, respect the SLA contract, remove replicas and trigger dynamic replication.
Cloud computing offers a large number of services to users, creating a very large infrastructure. Users expect high performance and better availability from these services. This is why service providers use data replication, while ensuring compliance with the SLA contract. This paper examined the problem of data replication in the cloud computing environment. A new data replication strategy based on the concept of data popularity was presented. According to this concept, recently accessed data files are more likely to be accessed again in the future. The proposed replication strategy consists of four main phases, named respectively, phase of initializing the replica number, phase of selecting the files to replicate, phase of the creation of replicas, phase of placement and replacement of replicas. These phases adapt dynamically to changes in the cloud system.
This strategy called ARSP is proving to be effective in solving the problem of data management in the cloud. It creates the necessary data replicas for the cloud system while avoiding the violation of the SLA contract. In addition, ARSP increases the overall performance of cloud computing by intelligently placing the replicas to achieve the desired benefit. It even takes into consideration replacement in case of insufficient storage space. The proposed strategy selects popular data based on well-defined criteria to decide whether or not to proceed with the replication process. This selection minimizes the number of replicas created, promotes rapid processing of large numbers of requests, and respects response time.
To experiment with this work and confirm our theoretical results, we implemented and simulated our strategy and two other existing strategies (CDRM, ADRS) using the CloudSim simulator. The comparison results show that ARSP is generally more efficient than the CDRM and ADRS strategies. The performance of the strategy improves progressively as the number of requests increases. To continue our work, several interesting perspectives can be considered, such as i) integrating a replica consistency management service, ii) considering energy consumption in the cloud system, iii) integrating a fault tolerance service, iv) Taking into account the cost of replication, and v) validating our proposal in a real cloud environment.
Footnotes
Author’s Bios
