Abstract
Cloud providers have recently offered their unused resources as transient instances. Amazon sells idle cloud resources as spot instances pricing by an auction-based market mechanism to reduce the cost without any availability guarantee. Thus, to dynamically and autonomously manage cloud resources to execute user applications ensuring greater reliability with cheaper spot instances is an open problem. In this context, we propose a fault-tolerant multi-agent architecture as middleware of cloud providers and users to mediate access to a wide range of heterogeneous resources providing a resilient application execution environment with a dynamic flexible fault-tolerant mechanism based on adaptive checkpointing. Our architecture combines a case-based reasoning model with a survival analysis model to predict failure events and refine fault-tolerant plans with adequate parameters to increase reliability optimizing total execution time and costs. We evaluated the proposed architecture with real historical data collected from Amazon EC2 price changes including, with approximately 21 million records and generating 1,362,816 scenarios stored in our case knowledge database. The results considering the time to revocation achieved high levels of accuracy (98%) with a gain up to 74.48% to total execution time, reducing total cost when compared to other approaches in the literature.
Keywords
rm
Introduction
Cloud computing has evolved to offer services to execute applications with high flexibility, scalability and availability with affordable pricing for cloud users and vendors [7]. Cloud computing has emerged as a new paradigm to provide various types of computing resources to serve diverse application scenarios being one of most popular and important information technology trends [18].
Organizations are shifting their business to cloud providers that should furnish resources to attend users in a transparent way over the Internet. The cloud architecture consists of distributed resources with a set of services offered by a pay-as-you-go model, being an interesting alternative to explore resources often appearing to be unlimited [22].
The cloud Infrastructure as a Service (IaaS) is a base class that offers a set of Virtual Machines (VMs) with different configurations and capacity allowing users to choose according their needs. Since users’ resource demands change constantly, cloud providers are offering VMs services at inexpensive prices than the dedicated machines (on-demand) to maximize infrastructure utilization. IaaS is a promising instrument to implement and execute distributed bag-of-tasks (BoT) applications [14]. BoT is formed by a set of consistent applications (independent tasks) that can be executed in parallel using distributed hosts [8]. The effort to compute BoT with memory intensive applications has become common, being widely used in several applications, such as simulations, bioinformatics, data security, and cloud task scheduling [15, 25, 38, 34].
Cloud providers realized that dedicated hardware aren’t being fully used, having considerable idle resources to be offered as unreliable VMs as transient instances. Using these instances users can be revoked irreversibly according to each cloud providers’ rules. But these VMs are offered at considerably lower prices without any availability guarantee [30]. Amazon Elastic Compute Cloud (EC2) and Google Compute Engine (GCE) are offering transient instances using different strategies. Amazon adopts an auction based market mechanism with a dynamic pricing scenario based on users’ bids. Google is exploring idle resources where VMs are provisioned at considerable static value without price changes and usage time limits.
Despite cloud computing benefits, the use of transient instances indicates many relevant issues that still pose critical challenges. Considering automated management issues we may cite: dynamic resource allocation to compute applications ensuring reliability, efficient resource scheduling to achieve high-performance, dynamic flexible infrastructures to guarantee Quality of Services (QoS), ensure the accessibility to meet the desired System Level Agreement (SLA), security aspects (e.g., integrity and confidentiality of data, authentication), and fault-tolerant strategies to ensure accessibility, availability, reliability of services achieving cost optimization. In this challenging scenario, the main problem this research work addresses is the trade-off between reliability to ensure the effectiveness of users applications and the low cost of spot instances resources without availability guarantee.
In order to face the research problem to effectively use spot instances, intelligent techniques are desirable. More specifically, having intelligent agents with the ability to identify the environment in which they are inserted and act autonomously with learning skills would be useful [27]. As pointed out by [29], it is necessary to address three key factors to effectively use transient instances to fulfill user requests: (i) defining the right resource configuration according to user needs; (ii) creating a fault-tolerant strategy to avoid data loss if a failure occurs; and (iii) recovering from instance revocations, when they occur, to ensure application continuity.
Considering the use of autonomous agents to effectively manage spot instance resources, in this research we investigate how to dynamically and autonomously manage these resources to execute user applications ensuring greater reliability with optimized time and cost. Thus, we propose a fault-tolerant multi-agent architecture as middleware of cloud providers and users to mediate access to a wide range of heterogeneous resources providing a resilient application execution environment with a dynamic flexible fault-tolerant mechanism based on adaptive checkpointing.
To predict spot instance failures we use survival analysis and create a fault-tolerant execution plan considering the current availability scenario according the application needs. We build upon our previous work [4], where we demonstrated the viability of the prediction model achieving 92% success rate for survival prediction. The prediction results demonstrated the model potential as a solution to support decisions and efficiently use spot instances in cloud computing.
We evaluated the multi-agent architecture with real historical data collected from spot instances price changes achieving good results. Thus, we argue that the proposed architecture provides a novel approach that encompass a reasoning and a statistical model to predict Time Until Revocation, defining suitable fault-tolerant parameters to avoid extra time in application executions. Agents’ reasoning model uses Case-Based Reasoning (CBR) to decide based on previously observed cases, allowing continuous learning and problem-solving features similar to human being [1].
In summary, the key contributions of this paper are: (i) propose a platform independent multi-agent architecture for integrating BoT enabled systems and provide a fault-tolerant spot instance environment; and (ii) to increase reliability of users’ applications execution optimizing total time and costs. In this paper we propose the combination of features to predict spot instances failures, as proposed in [4], and a multi-agent architecture to dynamically create a fault-tolerant execution plan considering the current availability scenario according to users’ application needs.
The rest of this paper is organized as follows: the literature review is presented in Section 2; in Section 3 we provide details about the proposed multi-agent architecture; in Section 4 we present experimental results; a comparative study is summarized in Section 5; and in Section 6 we present conclusions and future work.
Literature review
As previously explained, the cloud computing unused resources are being exploited by cloud providers and significant work has been done to study the consumption of resources for the cheapest price in an attempt to provide a secure environment and avoid unexpected events. The assignation of unused cloud computational resources in the form of unreliable VMs is commonly known as transient resource. In the following, we briefly review related works focused on cloud resources usage in an attempt to create a fault-tolerant environment to run applications in an agent-based, spot instances and fault-tolerance perspectives.
Considering the use of fault-tolerant approaches using spot resource management, in [36] the authors propose a resource allocation strategy to address the problem of compute intensive application execution. To maximize the chance of finishing tasks on deadlines, the first approach uses a bidding mechanism that estimates future spot prices and support bidding decisions to cloud users. Using price history data authors present five bidding strategies: (i) a minimum value; (ii) a mean of all values; (iii) a current on-demand price; (iv) the high value observed; and (v) the current spot instance price. Efficient bidding strategies for spot instances are crucial in lowering execution costs, being ineffective for long executions. Even avoiding out-of-bid failure, if it occurs, all process data will be lost. As a second approach, this work evaluates an existing checkpointing-based approach using the same definitions used in [37]. The technique considered in this work is an hourly based execution state checkpointing, following the full-time charge rule applied by Amazon. As long applications imply increasing checkpoints, this technique can be improved by using a heuristic to define an appropriate checkpointing interval without compromising the total execution time. In our proposed architecture, an adaptive checkpointing is adopted. This flexible fault-tolerant approach adjusts checkpointing intervals according to the application execution time, e.g., superior to 60 minutes when long executions are present.
In [40, 39], authors simulated an auction with several checkpointing strategies to evaluate out-of-bid situations in spot instances: hourly based and at each price rising edges (when a spot price increasing is observed). Results show that the use of checkpointing technique tolerates instance failures and reduce costs when compared to on-demand instances. In [39] a time window evaluation was used in an adaptive checkpointing based on the spot history data compared to the current price. Authors state that using different bid value reflects on spot instance availability. Similar to [36], the total of checkpointing occurrence will be increased. Small checkpointing interval implies delayed executions and increases total user costs. Our approach is adapted for different bid strategies and each strategy imply on time until revocation, being an important element to define checkpointing interval according to current availability.
From a checkpoint perspective, an important element to be considered is the interval between saving states since it affects total execution time. Consider a task process TP that executes in a specific time
In [33, 41], authors use a set of smaller checkpointing intervals between 10 to 60 minutes. A defined interval of 10 minutes is the worst strategy when a large application is running. This is a scenario where the checkpointing time can exceed the interval time and increase user costs. In [33], the authors adopt job migration as a fault-tolerant technique that implicitly uses checkpointing over execution to be prepared for migration. Similar to other related works, the fault-tolerant parameters are static during execution time, ignoring the fact that changes occur over time considering an auction environment. A better checkpointing strategy is to predict new failures in run-time and refine parameters according to price changes. In [41], authors ignore the fact that CPU and memory intensive applications need considerable time to save the execution state, reflecting in increased total execution time. As observed in Algorithm 6, our proposal consider long executions as a trigger to evaluate and define appropriate checkpointing intervals.
Alternatively, in [20], authors propose a strategy in which a new checkpoint interval occurs every time price changes, i.e. checkpointing is defined by monitoring price changes. This approach is not interesting considering high demand instance types, e.g. r3.8xlarge, a memory-optimized instance type with 1.292.181 prices changes in the last two years (2017 and 2018) according to our records. In an auction scenario, high demand means increase price changes records. Our approach considers price changes when calculating time until revocation after each checkpointing and refine the interval parameter according to current availability.
Considering the multi-agent cloud management perspective, research focuses on security and resource provisioning. Using a model that focuses on private cloud providers, [12] proposes a framework that uses three agents to monitor execution, memory and CPU usage in the SaaS and IaaS layers. An important point to consider in this work is the use of agents with CBR model to predict resource usage and provide vertical elasticity of available resources. Also, historical execution data is needed to predict resource usage; whilst considered useful, it is not always possible. Transient resources are not being used in this work, the authors preferred using reliable servers with availability guarantees.
Considering that the research focus of this paper is on the use of multi-agent approach to effectively manage transient instance resources, the systematic literature review presented some works using agent approach with dedicated machines (on-demand) [31, 12, 2, 6]. The core agent behavior in these works use elastic resource allocation associated to prediction methods for effective resource provisioning in cloud computing. More recent works encompass dynamic resource provisioning approaches [19, 26].
Regarding the approaches with fixed checkpointing strategies, no work uses autonomous agents with spot instances. An important point is to consider checkpointing intervals to reduce total execution time and costs. Even works with adaptive checkpointing strategy, such as [23], authors focus the reliability of services and the availability of checkpointing storage in cloud environment using on-demand instances. While our approach focus on users’ applications execution reliability using spot instances to optimize total execution time and costs. In our prior work [4], we use a heuristic model with checkpoint/restore technique and a statistical model to predict time until revocation by analyzing price changes. Results presented high level of prediction accuracy (92%), but autonomous agents were not used to manage aspects with fault-tolerant technique according to execution scenario in real-time. In this work, an extension of [5] is proposed with increasing agent types, including recency effects to refine time until revocation, and extending the experimental scenario considering the amount of data and instance types which improved results as presented in Sections 3 and 4.
Taking into account this research investigates the efficient use of spot instance resources, our literature review focuses on using checkpoint/restore as the fault-tolerant technique to execution guarantee. Even with additional overhead, this technique is the most used one in cloud computing [13]. What is common among these fault-tolerant related works is the fact that the defined parameters are static and controlled by non-autonomous systems. A compilation of the cited works is presented in Table 1 compared to our proposal (marked with
Literature review summary
Literature review summary
To the best of our knowledge, there is no published research using the elements of this proposal, where the main contribution is a fault-tolerant multi-agent architecture to execute distributed BoT applications in a transparent way using spot instances. The proposed architecture combines CBR reasoning model (retrieve similar execution cases) associated to a survival analysis statistical model (predict failure events and ignore atypical known cases) to support agents’ reasoning to define adequate fault-tolerant plans with an adaptive checkpointing according current availability scenario to increase spot instances reliability optimizing total execution time and costs.
In this section, we present the fault-tolerant multi-agent architecture, that analyzes the current and historical scenario of spot instances (availability, prices and probabilistic indexes) to define an appropriate fault-tolerant execution plan and their respective definitions.
A negotiation layer between users and cloud providers enables a new breed of trust services. The management layer needs to intermediate external cloud resources in an autonomous way, choosing a convenient subset of resources to execute user applications, increasing resource usage and decreasing costs.
A resilient environment to run BoT is one of the most important and complex issues in cloud scenarios with non-guaranteed transient instances, which can be as complex as a multi-objective optimization problem since it attempts to reduce resource consumption, minimize user costs and ensure task execution.
Proposal overview
Transient instances’ revocation affects application performance compared to on-demand servers, since revocation affects the applications overhead because users have to deal with unexpected revocations in a non-transparent way. For example, consider a compute-intensive application that runs on a spot instance and periodically saves its execution state to a remote disk. After a revocation, the application can re-execute from its last consistent saved state. But saving many intervals incurs an overhead that increases running time and decreases the performance of a spot instance relative to an on-demand safe and reliable server.
Our proposal combines the CBR to retrieve similar execution cases, being supported by the survival analysis statistical model, a non-parametric technique used to ignore atypical known cases and helps agents to reason and define appropriate fault-tolerant parameters, e.g. the checkpointing intervals, to reduce cloud resource usage time and decrease user monetary costs. The overall working process of this proposal is presented in Fig. 1.
Flowchart of the overall working process of this proposal.
Figure 1 represents a flowchart that starts when a new application is submitted by the user, having a validation about required files and parameters. Considering a previous knowledge database generation (Algorith 3), after validation, a new spot instance will be requested and the time until revocation will be calculated by retrieving similar cases (Algorithm 3) and processing the survival curve (Algorithm 3), populating the survival time matrix. To define the appropriate fault-tolerant parameters (Algorithm 6), a survival rate is calculated after verifying an experimental scenario to observe if defined time until revocation is achieved by using historical price change data. In the end, a new remote execution will be created with a failure detector monitor, activating the behavior when the following events occur (Fig. 6: if it is the time to checkpointing, a new execution state will be saved in the repository; if an error occurred, recover from the last consistent saved state, retain the case as a failure case and a new execution will be created; and if the execution finished, the case will be retained and the flowchart ends.
Considering the significant amount of unused resources, cloud providers have been adopting a model that offers the unused capacity of VMs resources in the form of transient instances. Spot instances are VMs available at Amazon offered in an auction environment.
As opposed to on-demand instances, spot instances can be revoked without user intervention. Users request spot instances VMs using a bid value (
The value of
The VMs are distributed among distinct locations in a set of regions. Each region is divided into a subset of zones, which offer a set of independent resources and services with different failure probabilities. To launch a transient instance using Amazon EC2, a user submits a spot instance request command composed of 6-tuple, as presented in Eq. (2):
The Region and Zone are the primary attributes in a new spot request, being their availability defined according
The InstanceType represents the capabilities of it resources, like CPU, memory, and local storage capacity, e.g., the p3.8xlarge instance type represents the 3rd generation of GPU accelerated computing optimized instances with high frequency 4 GPUs, 32 vCPU, 64 gigabytes of GPU memory and 244 gigabytes of memory, commonly used to process machine or deep learning applications, speech recognition or high-performance computing.
Amazon EC2 currently supports the generation of 57 different instance types in the spot tier, although not all types are available in all Regions and Zones. The number of instances (QtdOfInstances) represents how many instances will be delivered to the user. Moreover, a set of attributes (AttributeSet) are needed to instance availability status be fulfilled, e.g. the attached storage, security chains, monitoring tasks and fleet strength, being a collection, or fleet, of unsecured spot instances, and optionally secured dedicated instances. When all the specifications for the user request are met, the spot instance VM is fulfilled, which can take a few minutes to be ready to use.
Finally, in order to use spot instances, a customer request is needed and must include a bid value (BidValue). Users define the maximum hourly value they are willing to pay (a bid), obtaining a VM to use while their bid exceeds (or is equal to) the current price of the instance. If the current price of a spot instance is equal or less than the user bid, the VM still available to the user. VMs are revoked when the current price of spot instance type exceeds the user’s bid. A bid fault occurs when the current price of the VM is above the user’s bid [28].
The price of a spot instance depends on the type of instance as well as VM demand within each data center. As an auction proposal, the spot instances are offered at lower prices compared to regular and safe on-demand servers, increasing their values when resource availability and the number of users send bids and receive VMs are considerable, decreasing otherwise.
According to the definition and classification of an environment, presented in [27], in this proposal scenario, the composition of spot instances is considered as a distributed, accessible, non deterministic, dynamic and discrete. A set of definitions about spot instances is presented in Table 2.
Definitions of principal elements used in this proposal related to spot instance
In order to use spot instances, the applications need to be intentionally fault tolerant, i.e. be prepared for unexpected failure events and recover from failures which are not always dominated by software designers.
Using spot instances allows reduced costs when running compute or memory intensive applications, as long as these applications are prepared for scenarios with performance degradation and server revocations.
Considering the use of cloud computing resources, resiliency means the capacity of a resource to be active, reliable, failure tolerant, recoverable, dependable, and secure, in case of unexpected failures that result in a temporary or permanent service disruption [3, 9, 35]. In the scope of our proposal, resilience is the ability to ensure application execution and is achieved through the fault-tolerant approach. Examples of six fully consistent fault-tolerant techniques: (i) The retry is the simplest fault-tolerant technique, since the recovery process consists of restarting the process from scratch, ignoring all processed data; (ii) the task re-submission fault-tolerant technique ignores processed data, resubmitting the failed task to another resource; (iii) the checkpoint and restore technique saves the application’s running state at specified time intervals. An overhead for each saved execution state is added because all process data and volatile memory needs to be saved in persistent memory; (iv) a set of parallel executions is the strategy used in replication, which increases the chances of execution success by increasing the number of parallel executions; (v) the software rejuvenation is used when an application needs a clean resource, rebooting the operating system according to schedule rules; (vi) and finally, job migration is a preventative approach that predict a future failure and migrate the running process to an environment with reliable resources by using a well-defined plan with safe instances.
Using fault-tolerant techniques minimizes the impact of VM unavailability. Results show that under more volatile scenarios, the use of fault-tolerant approaches become considerably more useful and provide significant benefits [39]. Intelligent approaches combined with fault-tolerant techniques allow increased resilience in scenarios that use spot instances to execute applications, avoiding data loss.
Analysis of reasoning model
The performance and availability of VMs in a cloud environment varies according to region, zone, instance types and different times of day [4], where the authors observe the existence of a pattern of price changes according to each day of the week and during each hour in the day.
Patterns can be observed in Fig. 2, that shows, for the price changes recovered from April/17 to December/18, how many price changes occurred on each day of the week and during each hour of the day, grouped by zones in the US-WEST region for M4Large instance type. The probability of bid faults increases when there are more price changes, consequently, few price alterations imply less failures.
Observed patterns in price changes in M4Large instance type.
A pattern can be observed in the m4.large instance type. The number of changes peaks during weekdays (a) (as opposed to weekends) and after 12am (c). The r4.xlarge has considerable price changes on weekends (b) and the number of changes by hour-in-day increases after 7pm (19 h) (d).
Our heuristic uses a model that analyzes historical and current prices of spot instances and their changes to define appropriate fault-tolerant parameters. Compared to existing similar studies, our proposal uses CBR to classify cases considering observed price change patterns of spot instances in terms of hour-in-day and day-of-week, as proposed in [4, 17, 16].
Our knowledge database is composed by cases attributes (Table 3) and is organized as a set of tuples (Eq. (3)) to be explored by our schematic CBR circle of the problem-solving process (illustrated in Fig. 3), composed by an initial knowledge generation (detailed in Algorithm 3), following by a set of eight sequential steps, as follows:
Schematic CBR circle overview.
Generating case knowledge database[1]
CasesGeneratorreg, zone, instType, initDate, finalDateSystem Initialization
addiction in addictions Processing each addiction to increase database
priceRow in priceHistoryList index i Each price change record
futurePriceRow in priceHistoryList Future price changes
censuredA higher price record was not found
Case (
Retrieving similar cases[1]
RetrieveSimilarDatareg, zone, instType, dayOfWeek, hourInDaySystem Initialization
case in model.cases Iterating on each case data
attribute in caseAttributes Iterating on each mapped attribute
functionType is null
(i) Current Problem – a step that represents a new submission problem containing the requested region, zone, instance, bid-strategy, day-of-week and hour-in-day attributes; (ii) Retrieve – as presented in Algorithm 3, this step uses similarity functions using Eq. (5) to fetch a set of similar cases according requested attributes. An amalgamation function is a weighted sum of all local similarities (attribute similarities) of a concept that constitutes the overall global similarity measure of the concept [32]. At the retrieving step, the similarity function process case attributes (Eq. (4)) and uses different weights (Table 3) to get closer or max calculated weights cases; (iii) Similar Cases – represents a structure with a set of similar cases retrieved to be reused as reference; (iv) Reuse – the adaptation and reuse of the retrieved cases are needed in this step, being necessary to construct a proposed solution to the new problem according to previously observed cases; (v) Solved Case – after adaptation, a set of cases is used as input to an algorithm that evaluates each case data, ignoring atypical cases and calculate time until revocation, as presented in Algorithm 3; (vi) Proposed Solution – a result of solved case step, a proposed solution data is composed by the time until revocation considering current availability scenario and bid strategy; (vii) Revise – the proposed solution is evaluated in this step to confirm its efficiency in a scenario with real test cases. In this scenario, the time until revocation are observed; (viii) Retain Case – after evaluation, it is necessary to explore the solved case and transform it into a learned case, adding its solution to the local knowledge database. If the proposed time until revocation was achieved, a success case will be added to the knowledge database, followed by the defined fault parameters. If the time until revocation was not achieved, the properties of the case is the same, but the current execution time will be present on new similar cases obtained in the retrieving step and changing new proposed solutions that used recency factor, as presented in Section 4.3.
Processing
ProcessSurvivalTimereg, zone, instType, confLevel, dow, hidSystem Initialization
case in cases index i
interval in sortedIntervals To find the closer interval
Considering that the composition of the attributes (region, zone, instance type, day-of-the-week and hour-in-day) can influence the revocation time, the similarity functions implemented on a CBR model offer an efficient approach to solve the problem posed in this paper. Once with a case database generated, it is necessary to retrieve similar cases as presented in Algorithm 3.
To increase the reasoning process, this model uses the equivalence and adaptation approaches, which involves the understanding of previous cases to assist with similar decisions and, in the case of failure, allows their adaptation.
Our CBR formalization model includes a case (
Each similarity function is used not only in case retrieval but also to adapt cases to augment the knowledge database, being determined as the proximity of attributes’ values and their respective weights are defined according to the context of the problem.
Context can be considered as an interpretation of a case, where only a selected subset of attributes are considered relevant compared to others, having their specifications and weights defined according to the detailed information in Table 3.
Attributes with N/A function strategy indicates that they are not used in a similarity function, being an important data used by the agent-based process, detailed on Section 3. A context
Using the history of price traces (
A set of real cases can be defined as
When matching cases, some attributes are forced to be equal. In a similarity process, we partition
Using the above partition of the attributes, we then calculate the similarity function between cases as presented in Eq. (5):
The adjustment indicator function,
A convenient statistical model is used to support intelligent agents’ decisions and offer a reliable environment to execute applications when spot instances are used in a cloud computing business model, where price and availability vary according to supply and demand. Statistical models provide mechanisms for extracting information, sometimes without complete data, providing support to organize, analyze and present processed data that may be useful in decision making processes.
Our model uses a Survival Analysis statistical model [24], that is a class of statistical methods which is used when the time to a specific event is one of the principal factors which the data represents. This model, much used in medical research, involves variables associated with time, following observations until the occurrence of an event of interest, frequently failure [10].
The event under study is a bid fault and the time to this event, time until revocation, is treated as the main element. The results were not enough using pure CBR, being necessary to integrate a statistical model to support agent decisions. A non-parametric technique is used, which incorporates incomplete (censored) data to avoid estimation bias. This method requires three elements, as follows:
Time indexed case data. An estimator function The confidence level of prediction approach
We would like to estimate the largest Survival Time (
As presented in Algorithm 3, the
If no deaths occur in a given interval then the survival curve does not decrease. A survival curve with respective times and confidence levels can be observed in Fig. 4.
Survival curves with respective survival times of some spot instances at sunday 6am.
According to Fig. 4, given the 98
We use recovered similar cases from the case knowledge database to produce a survival curve, which is used by the executor agent to predict the time until revocation according to a confidence level which provides sufficient security. The agent estimates the largest
This time is defined by
The working process of using survival analysis is presented in Algorithms 3 and 6, being explored as an important element to process time to revocation using 98
In this proposal, the agents extends the architecture using different approaches according to the current scenario, where application execution, VM monitoring and fault-tolerant execution plan can be managed dynamically and autonomously for a set of cloud providers and their respective services and resources.
According to the definition and classification presented in [27], in this proposed environment, the composition of spot instances is considered as a distributed, accessible, non-deterministic, dynamic and discrete.
Our proposal contains a set of autonomous agents composed by a quadruple
An abstract view of agents and their autonomy can be formalized. First, let us assume
The BE corresponds to a sequence of interleaved environment states and agent behaviours
Representing the history
To represent the effect that an agent behaviour implies on an environment, a state transformer function is defined as
Proposed multi-agent architecture.
To understand the elements of the proposal regarding the agents and their reasoning modeling, the multi-agent architecture is presented in Fig. 5 and some definitions can be formalized in the following way: sen-avail is composed by a sensor that observes the availability of spot instances in a cloud environment; sen-price is a set of finite sensors that monitor spot instances price changes. According to the nature of an auction, the price varies according to supply and demand and the system needs to be updated; sen-fail represents a set of sensors that observe a cloud environment with available and running spot instances, listening for revocation events; sen-exec is composed by sensors that obtain a set of
Agents’ PEAS definitions
The proposed architecture is composed of a finite set of agents that shares the entire environment, which communicate with each other using a multi-agent protocol communication layer to enable interoperability with different technologies. The integration with external cloud infrastructures occurs through pluggable modules to support other cloud providers. Moreover, an API that offers external access to the facade agent (Verifier) is provided to enable integration with external systems.
From the layer definitions presented in Fig. 5 and agent’s PEAS detailed in Table 4, a brief description follows: the Verifier agent checks user input and chooses which resources and instances types will be used based on user requirements, which includes application files, estimated execution time and the amount of CPU and memory required; the Price Monitor agent monitors a set of available price changes in the cloud environment. Each spot instances price change triggers a process on a Core agent to generate a case to be added the CBR knowledge database with the actual time until revocation based on the new record; the Core agent defines the most appropriate fault-tolerant technique and its respective parameters. To achieve high levels of accuracy, it uses an approach to predict time until revocation [4], having a reasoning process detailed in next sections. It produces elements to calculate a survival rate, which is mandatory information to support agent decisions; the Executor Manager agent is responsible to manage the cloud spot instances, creating and allocate
In our scenario, each agent has a defined environment, e.g. the Price Monitor agent is limited to observe only the auction environment, whereas Executor and Recover agents share the same environment. A brief sequence diagram, which summarizes the interaction between the agents since a new BoT submission until a failure detection event is presented in Fig. 6 as a set of eight steps, as follows: Step 01: the user provides the application and parameters to the Verifier agent; Step 02: the Verifier agent validate user input and asks the PriceMonitor agent to inform an appropriate instance type according to user parameters. At this time, the Verifier agent is responsible to reject or accept user submission according to architecture requirements, e.g. application files, bid-value, custom checkpointing interval and application time; Step 03: After PriceMonitor response, the VerifierAgent inform the Core agent to create an execution plan. At a result, the Core agent will receive the chosen instance and user params; Step 04: the Core agent retrieve similar cases according current day-of-week, hour-in-day, region, zone and instance type, calculate time until revocation using similarity functions (Eq. (5)) and define a fault-tolerant execution plan, according to Algorithm 6. As a result, the agent informs the most appropriate fault-tolerant execution plan to be used in the current application execution; Step 05: the Core agent asks the Executor Manager agent to start execution on
Sequence diagram of BoT and behavior in the presence of a failure.
Considering a user execution requirement with
As an agent-based project, the characterization of agents is necessary for the understanding of their objectives and interactions. As proposed by [27], the agents’ characterization is presented through the Performance, Environment, Actuators and Sensors (PEAS) aspects as presented in Table 4.
Given an application’s execution time (appTime) and
Defining the appropriate fault-tolerant execution plan[1]
DefineExecutionPlanappTime, region, zone, instType, bidValueSystem Initialization
validDataApp is able to execut
priceRow in priceHistoryList index i Each price change record
instant increasing 60 Time in minutes
not errorOcurred
A required parameter (appTime) is a mandatory data and it is used in the first step: a checkpoint interval will be adopted in a new execution or will be ignored. In [36, 40], the authors consider longer executions greater than one hour as the simplest and most intuitive, yet effective, way for dealing with the cost/reliability trade-off when running applications on spot instances.
The reasoning detailed in Algorithm 6 considers an execution with no checkpoint as the primary fault-tolerant plan to be applied when appTime is considered short (line 23), i.e. less than 60 minutes not requiring any other parameters. When appTime is considered long (line 5), the survival rate and its success are considered mandatory parameters, having a survival rate calculated (between lines 11 and 20), considering a successful when instant time overtake the processed survival time (line 20).
If success is high, a convenient checkpointing interval is calculated from
When the survival rate offers low confidence, a checkpoint fault-tolerant plan which does not use
The main goal of our proposal is to observe the actual environment and find an optimal fault-tolerant plan and parameters that minimize the total runtime and reduce costs. To reach these goals, our agents use the approach presented in [4], which uses CBR, survival analysis, and quantiles of the time until revocation to find a probable survival time and use it as part of a strategy to avoid occurrences of failures. In addition to [4], elements of this approach can extend the fault-tolerant definitions by using other techniques, like recency factor to obtain better results, no checkpointing or even user defined exception handling.
Experimental scenario
A set of Amazon’s spot instances and it respective price change history was used to evaluate our proposal and predict the
Using 19 months of real price changes, provided and collected from Amazon AWS from April of 2017 to October of 2018 (approximately 21 million records), our experiments simulate 1.362.816 scenarios using 78 instances, 4 bid strategies (actual, mean of last week, median of last week, mean last day), two Regions (US-WEST-1 and US-WEST-2) in a 13 weeks scenario (from August to October) with all relations of day-of-week and hour-in-day.
In order to define the optimal range of data to use for case generation and survival curve estimation, we calculated the success rate as a function of the number of months of previous data used. The results can be found in Fig. 7.
Successful rate according to amount of used data (in months).
According Fig. 7, two important features require comment. First, there is a general trend of increasing success as the number of months of data used increases. This is to be expected, as long as the data generating process has not changed significantly over the period analyzed, since more data allows us to more accurately estimate the non-parametric survival curves, leading to better estimates of the revocation time. Second, there is a significant dip in the success rate at 9 months. This occurs because of the introduction of new instance types which have few observations, leading to poor estimation and increased failures for these instance types, reducing the overall success rate.
Given the increasing success rate as more observations are used, we prefer the use of all the data collected over the entire period. From the collected data, approximately 110 million cases were generated, considering all zones and instances from US-WEST-1 and US-WEST-2 regions.
Each day-of-week and hour-in-day relationship has its own survival curve, being represented as a Survival Time Matrix (STM), as illustrated in Table 5, which shows time until revocation, in minutes, obtained by using the 98
Generated STM of r4.xlarge instance using values from the 98
confidence level and extracted from their respective survival curve
Generated STM of r4.xlarge instance using values from the 98
To evaluate STM times, a set of experiments that compare STM values with real scenarios was created. Each value was compared with an auction simulation. Each time until revocation extracted from a relationship between day-of-week and hour-in-day in STM time that is achieved by a simulation will increase by 1, 0 otherwise, as detailed in Algorithm 6. The results are shown on heat map graphic presented in Fig. 8.
Heat map of survival success compared of time until revocation times in r3.large.
Using the same experimental scenario, 1.362,816 simulations were executed. An example of the memory optimized instance type (r3.large) is illustrated in Fig. 8, achieving success rates around 80% with standard deviation of 1.56 percentage points. Considering all 78 instances’ simulations, the success rate goes up to around 85%. This occurs because some instances with elevated monetary costs have stable variations, allowing long time until revocation times.
An unsecured gap can be observed on Wednesday and Thursday between 12pm and 6pm (18 h), with 7 failures in 13 attempts, showing that another bid strategy is needed to increase the observed time until revocation and achieve better success rates.
To achieve better results, a new strategy can be incorporated into the
To refine the STM values and include a recency factor to time until revocation, the last 4 weeks (Jul/18) were used to simulate real time until revocation, creating a new subset of cases and calculating the median times between them to reproduce a new recent scenario. With this change, a new STM was generated (STM’), representing new values in the day-of-week and hour-in-day relationship with survival times that consider recent data. A generated STM’ is presented in Table 6 with new values compared to Table 5.
A new STM’ created after recency strategy in
function
A new STM’ created after recency strategy in
Better results were achieved with this recency change and can be observed in Fig. 9. Giving greater weight to consider more recent results allowed a gain of 12%, reaching a 92% success rate under the same conditions of our experiment. A comparative achieved gain can be observed in Table 7 regarding instance types illustrated in Fig. 10.
Heat map of survival success after recency factor in 
The new heat map with new experimental results for the day-of-week and hour-in-day relationship can be observed in Fig. 9, increasing success compared to Fig. 8. Green areas means that time prediction was achieved in a real auction scenario. Considering that Amazon AWS provides a wide selection of instance types optimized to fit different use cases, giving the user flexibility to choose the appropriate mix of resources for her applications, a set of optimized instance types are used in our experiments and a subset is presented in Fig. 10.
Successful rates (in percentage) obtained by experiments in Fig. 10
Set of heat map results considering distinct instance types.
A compute optimized instance type was used in (a) and (b), a kind of instance used on compute-intensive workloads that delivers very cost-effective high performance at a low price per compute ratio. The results of an accelerated computing instance with GPU resources optimized for graphics-intensive applications can be found in (c) and (d). The results shown in (e) and (f) represent a storage optimized instance with low latency and very high random I/O performance. Lastly, (g) and (h) represent a general purpose instance that provides a balance of compute, memory, and network resources. A compilation of the results obtained by the experiments presented in Fig. 10 can be observed in Table 7.
Observing the results for (g & h), even using the recency strategy only a small gain was obtained, from 32% to 39%. This kind of instance has considerable price changes over time, requiring a different strategy which considers more recent price change patterns.
Compared to Fig. 10(h), better results were achieved using the last 7 days as a validation period in our recency factor strategy, as shown in Fig. 11.
New heat map of M4.xlarge Instance considering recency of 7 days.
Given the prediction accuracy presented in [4] (92%), a reliable time until revocation extracted from STM can be used to define an appropriate fault-tolerant execution plan and parameters, e.g. checkpoint fault-tolerant mechanism and their intervals, as detailed in Algorithm 6.
Considering an application that requires time


Regarding the use of checkpoint/restore fault-tolerant technique, commonly used when
In (7) we present our approach’s results with respect to different
Since some related works consider only compute-intensive applications, the cost of the checkpoint is low and short intervals do not significantly affect the execution time. Experimental results for different
Results of total time from different scenarios
Table 8 illustrates the total time results. Although different behaviors are observed, in general, the agents perform well using
Results of total cost (in dollar US) charged from different scenarios
The number of checkpoints, CA, and the length of time for each checkpoint,
Results of checkpoint calculated from different scenarios
Table 8 demonstrates that a gain of 49.17% with respect to total execution time can be observed compared to (
All collected data and source code are available at the project website and can be used to reproduce our experiments and test other scenarios not contemplated in this paper.
A resilient multi-agent architecture was presented in this paper to provide an efficient way to use spot instances to run users’ applications achieving better levels of reliability and reduce total execution time and costs, focusing on an important element in a resilient environment: fault tolerant behaviors. The architecture uses a prediction approach to support agent decisions in a dynamic scenario. To the best of our knowledge, there is no published work that combines the use of intelligent agents with CBR, survival analysis to predict failure events and FT mechanism based on adaptive checkpointing to increase reliability of user applications in spot instances.
We believe the proposed approach can be used to define fault-tolerant techniques and their respective parameters. Moreover, to increase longer survival times, new bid strategies can be introduced, e.g. using bid values defined by a percent increase over actual instance prices. Using real price traces of spot instances, our experiments have confirmed the accuracy of the proposed model.
We identified instances for which our methodology yielded poor success rates. For these instances, the data generating process appears to be changing more quickly than for other instances. To capture these changes we implemented a recency factor which gives greater weight to more recent observations data and increases success rates.
As future work, the proposed architecture can be extended to deal with multiple cloud regions by the same provider considering data transfer costs between different regions. In addiction, a multiple provider approach can be used, extending the number of pricing models according to each cloud provider. Also, other strategies can be explored to exploit more recent data, such as comparing recent instance price changes to update fault-tolerant parameters. To increase the reliability, a combination of fault-tolerant plans can be used, e.g. a multi-strategic fault-tolerant framework.
Footnotes
Acknowledgments
This work was supported in part by the Brazilian Coordination for the Improvement of Higher Education Personnel – (CAPES) through grant number 1441250. Prof. Celia Ghedini Ralha thanks the support received from the Brazilian National Council for Scientific and Technological Development (CNPq) for the research productivity in Computer Science area through grant number 311301/2018-5.
Authors’ Bios
