Towards increasing reliability of Amazon EC2 spot instances with a fault-tolerant multi-agent architecture

Abstract

Cloud providers have recently offered their unused resources as transient instances. Amazon sells idle cloud resources as spot instances pricing by an auction-based market mechanism to reduce the cost without any availability guarantee. Thus, to dynamically and autonomously manage cloud resources to execute user applications ensuring greater reliability with cheaper spot instances is an open problem. In this context, we propose a fault-tolerant multi-agent architecture as middleware of cloud providers and users to mediate access to a wide range of heterogeneous resources providing a resilient application execution environment with a dynamic flexible fault-tolerant mechanism based on adaptive checkpointing. Our architecture combines a case-based reasoning model with a survival analysis model to predict failure events and refine fault-tolerant plans with adequate parameters to increase reliability optimizing total execution time and costs. We evaluated the proposed architecture with real historical data collected from Amazon EC2 price changes including, with approximately 21 million records and generating 1,362,816 scenarios stored in our case knowledge database. The results considering the time to revocation achieved high levels of accuracy (98%) with a gain up to 74.48% to total execution time, reducing total cost when compared to other approaches in the literature.

Keywords

Cloud computing spot instances fault tolerance adaptive checkpointing case-based reasoning survival analysis

ï»¿rm

1. Introduction

Cloud computing has evolved to offer services to execute applications with high flexibility, scalability and availability with affordable pricing for cloud users and vendors [7]. Cloud computing has emerged as a new paradigm to provide various types of computing resources to serve diverse application scenarios being one of most popular and important information technology trends [18].

Organizations are shifting their business to cloud providers that should furnish resources to attend users in a transparent way over the Internet. The cloud architecture consists of distributed resources with a set of services offered by a pay-as-you-go model, being an interesting alternative to explore resources often appearing to be unlimited [22].

The cloud Infrastructure as a Service (IaaS) is a base class that offers a set of Virtual Machines (VMs) with different configurations and capacity allowing users to choose according their needs. Since users’ resource demands change constantly, cloud providers are offering VMs services at inexpensive prices than the dedicated machines (on-demand) to maximize infrastructure utilization. IaaS is a promising instrument to implement and execute distributed bag-of-tasks (BoT) applications [14]. BoT is formed by a set of consistent applications (independent tasks) that can be executed in parallel using distributed hosts [8]. The effort to compute BoT with memory intensive applications has become common, being widely used in several applications, such as simulations, bioinformatics, data security, and cloud task scheduling [15, 25, 38, 34].

Cloud providers realized that dedicated hardware aren’t being fully used, having considerable idle resources to be offered as unreliable VMs as transient instances. Using these instances users can be revoked irreversibly according to each cloud providers’ rules. But these VMs are offered at considerably lower prices without any availability guarantee [30]. Amazon Elastic Compute Cloud (EC2) and Google Compute Engine (GCE) are offering transient instances using different strategies. Amazon adopts an auction based market mechanism with a dynamic pricing scenario based on users’ bids. Google is exploring idle resources where VMs are provisioned at considerable static value without price changes and usage time limits.

Despite cloud computing benefits, the use of transient instances indicates many relevant issues that still pose critical challenges. Considering automated management issues we may cite: dynamic resource allocation to compute applications ensuring reliability, efficient resource scheduling to achieve high-performance, dynamic flexible infrastructures to guarantee Quality of Services (QoS), ensure the accessibility to meet the desired System Level Agreement (SLA), security aspects (e.g., integrity and confidentiality of data, authentication), and fault-tolerant strategies to ensure accessibility, availability, reliability of services achieving cost optimization. In this challenging scenario, the main problem this research work addresses is the trade-off between reliability to ensure the effectiveness of users applications and the low cost of spot instances resources without availability guarantee.

In order to face the research problem to effectively use spot instances, intelligent techniques are desirable. More specifically, having intelligent agents with the ability to identify the environment in which they are inserted and act autonomously with learning skills would be useful [27]. As pointed out by [29], it is necessary to address three key factors to effectively use transient instances to fulfill user requests: (i) defining the right resource configuration according to user needs; (ii) creating a fault-tolerant strategy to avoid data loss if a failure occurs; and (iii) recovering from instance revocations, when they occur, to ensure application continuity.

Considering the use of autonomous agents to effectively manage spot instance resources, in this research we investigate how to dynamically and autonomously manage these resources to execute user applications ensuring greater reliability with optimized time and cost. Thus, we propose a fault-tolerant multi-agent architecture as middleware of cloud providers and users to mediate access to a wide range of heterogeneous resources providing a resilient application execution environment with a dynamic flexible fault-tolerant mechanism based on adaptive checkpointing.

To predict spot instance failures we use survival analysis and create a fault-tolerant execution plan considering the current availability scenario according the application needs. We build upon our previous work [4], where we demonstrated the viability of the prediction model achieving 92% success rate for survival prediction. The prediction results demonstrated the model potential as a solution to support decisions and efficiently use spot instances in cloud computing.

We evaluated the multi-agent architecture with real historical data collected from spot instances price changes achieving good results. Thus, we argue that the proposed architecture provides a novel approach that encompass a reasoning and a statistical model to predict Time Until Revocation, defining suitable fault-tolerant parameters to avoid extra time in application executions. Agents’ reasoning model uses Case-Based Reasoning (CBR) to decide based on previously observed cases, allowing continuous learning and problem-solving features similar to human being [1].

In summary, the key contributions of this paper are: (i) propose a platform independent multi-agent architecture for integrating BoT enabled systems and provide a fault-tolerant spot instance environment; and (ii) to increase reliability of users’ applications execution optimizing total time and costs. In this paper we propose the combination of features to predict spot instances failures, as proposed in [4], and a multi-agent architecture to dynamically create a fault-tolerant execution plan considering the current availability scenario according to users’ application needs.

The rest of this paper is organized as follows: the literature review is presented in Section 2; in Section 3 we provide details about the proposed multi-agent architecture; in Section 4 we present experimental results; a comparative study is summarized in Section 5; and in Section 6 we present conclusions and future work.

2. Literature review

As previously explained, the cloud computing unused resources are being exploited by cloud providers and significant work has been done to study the consumption of resources for the cheapest price in an attempt to provide a secure environment and avoid unexpected events. The assignation of unused cloud computational resources in the form of unreliable VMs is commonly known as transient resource. In the following, we briefly review related works focused on cloud resources usage in an attempt to create a fault-tolerant environment to run applications in an agent-based, spot instances and fault-tolerance perspectives.

Considering the use of fault-tolerant approaches using spot resource management, in [36] the authors propose a resource allocation strategy to address the problem of compute intensive application execution. To maximize the chance of finishing tasks on deadlines, the first approach uses a bidding mechanism that estimates future spot prices and support bidding decisions to cloud users. Using price history data authors present five bidding strategies: (i) a minimum value; (ii) a mean of all values; (iii) a current on-demand price; (iv) the high value observed; and (v) the current spot instance price. Efficient bidding strategies for spot instances are crucial in lowering execution costs, being ineffective for long executions. Even avoiding out-of-bid failure, if it occurs, all process data will be lost. As a second approach, this work evaluates an existing checkpointing-based approach using the same definitions used in [37]. The technique considered in this work is an hourly based execution state checkpointing, following the full-time charge rule applied by Amazon. As long applications imply increasing checkpoints, this technique can be improved by using a heuristic to define an appropriate checkpointing interval without compromising the total execution time. In our proposed architecture, an adaptive checkpointing is adopted. This flexible fault-tolerant approach adjusts checkpointing intervals according to the application execution time, e.g., superior to 60 minutes when long executions are present.

In [40, 39], authors simulated an auction with several checkpointing strategies to evaluate out-of-bid situations in spot instances: hourly based and at each price rising edges (when a spot price increasing is observed). Results show that the use of checkpointing technique tolerates instance failures and reduce costs when compared to on-demand instances. In [39] a time window evaluation was used in an adaptive checkpointing based on the spot history data compared to the current price. Authors state that using different bid value reflects on spot instance availability. Similar to [36], the total of checkpointing occurrence will be increased. Small checkpointing interval implies delayed executions and increases total user costs. Our approach is adapted for different bid strategies and each strategy imply on time until revocation, being an important element to define checkpointing interval according to current availability.

From a checkpoint perspective, an important element to be considered is the interval between saving states since it affects total execution time. Consider a task process TP that executes in a specific time $\textit{TP}_{t}=$ 2880 minutes ( $\approx$ 48 hours), with a checkpoint interval $T_{\textit{chk}}=$ 60 minutes, with an overhead $T_{\textit{over}}=$ 10 minutes. The resulting additional execution time $T^{+}=((\textit{TP}_{t}/T_{\textit{chk}})-1)*T_{\textit{over}}$ is 470 minutes ( $\approx$ 8 hours), having a total execution time $\textit{TPT}=\textit{TP}_{t}+T^{+}$ of 3350 minutes ( $\approx$ 56 hours). Thus, a well-defined checkpointing approach is compulsory to decreased execution time. Larger $T_{\textit{chk}}$ implies faster executions reducing monetary costs. The following works will be presented considering decreasing checkpointing intervals.

In [33, 41], authors use a set of smaller checkpointing intervals between 10 to 60 minutes. A defined interval of 10 minutes is the worst strategy when a large application is running. This is a scenario where the checkpointing time can exceed the interval time and increase user costs. In [33], the authors adopt job migration as a fault-tolerant technique that implicitly uses checkpointing over execution to be prepared for migration. Similar to other related works, the fault-tolerant parameters are static during execution time, ignoring the fact that changes occur over time considering an auction environment. A better checkpointing strategy is to predict new failures in run-time and refine parameters according to price changes. In [41], authors ignore the fact that CPU and memory intensive applications need considerable time to save the execution state, reflecting in increased total execution time. As observed in Algorithm 6, our proposal consider long executions as a trigger to evaluate and define appropriate checkpointing intervals.

Alternatively, in [20], authors propose a strategy in which a new checkpoint interval occurs every time price changes, i.e. checkpointing is defined by monitoring price changes. This approach is not interesting considering high demand instance types, e.g. r3.8xlarge, a memory-optimized instance type with 1.292.181 prices changes in the last two years (2017 and 2018) according to our records. In an auction scenario, high demand means increase price changes records. Our approach considers price changes when calculating time until revocation after each checkpointing and refine the interval parameter according to current availability.

Considering the multi-agent cloud management perspective, research focuses on security and resource provisioning. Using a model that focuses on private cloud providers, [12] proposes a framework that uses three agents to monitor execution, memory and CPU usage in the SaaS and IaaS layers. An important point to consider in this work is the use of agents with CBR model to predict resource usage and provide vertical elasticity of available resources. Also, historical execution data is needed to predict resource usage; whilst considered useful, it is not always possible. Transient resources are not being used in this work, the authors preferred using reliable servers with availability guarantees.

Considering that the research focus of this paper is on the use of multi-agent approach to effectively manage transient instance resources, the systematic literature review presented some works using agent approach with dedicated machines (on-demand) [31, 12, 2, 6]. The core agent behavior in these works use elastic resource allocation associated to prediction methods for effective resource provisioning in cloud computing. More recent works encompass dynamic resource provisioning approaches [19, 26].

Regarding the approaches with fixed checkpointing strategies, no work uses autonomous agents with spot instances. An important point is to consider checkpointing intervals to reduce total execution time and costs. Even works with adaptive checkpointing strategy, such as [23], authors focus the reliability of services and the availability of checkpointing storage in cloud environment using on-demand instances. While our approach focus on users’ applications execution reliability using spot instances to optimize total execution time and costs. In our prior work [4], we use a heuristic model with checkpoint/restore technique and a statistical model to predict time until revocation by analyzing price changes. Results presented high level of prediction accuracy (92%), but autonomous agents were not used to manage aspects with fault-tolerant technique according to execution scenario in real-time. In this work, an extension of [5] is proposed with increasing agent types, including recency effects to refine time until revocation, and extending the experimental scenario considering the amount of data and instance types which improved results as presented in Sections 3 and 4.

Taking into account this research investigates the efficient use of spot instance resources, our literature review focuses on using checkpoint/restore as the fault-tolerant technique to execution guarantee. Even with additional overhead, this technique is the most used one in cloud computing [13]. What is common among these fault-tolerant related works is the fact that the defined parameters are static and controlled by non-autonomous systems. A compilation of the cited works is presented in Table 1 compared to our proposal (marked with $[\>*\>]$ ).

Table 1
Literature review summary

References	Multi-agent	Checkpoint and restore	Recent spot instance generations	User transparent	Adaptive checkpointing	CBR	Survival analysis
[40]		$\CIRCLE$
[37]		$\CIRCLE$
[36]		$\CIRCLE$
[33]		$\CIRCLE$	$\CIRCLE$	$\CIRCLE$
[41]		$\CIRCLE$	$\CIRCLE$	$\CIRCLE$
[39]		$\CIRCLE$		$\CIRCLE$
[20]		$\CIRCLE$		$\CIRCLE$
$[\>*\>]$	$\CIRCLE$	$\CIRCLE$	$\CIRCLE$	$\CIRCLE$	$\CIRCLE$	$\CIRCLE$	$\CIRCLE$

To the best of our knowledge, there is no published research using the elements of this proposal, where the main contribution is a fault-tolerant multi-agent architecture to execute distributed BoT applications in a transparent way using spot instances. The proposed architecture combines CBR reasoning model (retrieve similar execution cases) associated to a survival analysis statistical model (predict failure events and ignore atypical known cases) to support agents’ reasoning to define adequate fault-tolerant plans with an adaptive checkpointing according current availability scenario to increase spot instances reliability optimizing total execution time and costs.

3. Proposed work

In this section, we present the fault-tolerant multi-agent architecture, that analyzes the current and historical scenario of spot instances (availability, prices and probabilistic indexes) to define an appropriate fault-tolerant execution plan and their respective definitions.

A negotiation layer between users and cloud providers enables a new breed of trust services. The management layer needs to intermediate external cloud resources in an autonomous way, choosing a convenient subset of resources to execute user applications, increasing resource usage and decreasing costs.

A resilient environment to run BoT is one of the most important and complex issues in cloud scenarios with non-guaranteed transient instances, which can be as complex as a multi-objective optimization problem since it attempts to reduce resource consumption, minimize user costs and ensure task execution.

3.1 Proposal overview

Transient instances’ revocation affects application performance compared to on-demand servers, since revocation affects the applications overhead because users have to deal with unexpected revocations in a non-transparent way. For example, consider a compute-intensive application that runs on a spot instance and periodically saves its execution state to a remote disk. After a revocation, the application can re-execute from its last consistent saved state. But saving many intervals incurs an overhead that increases running time and decreases the performance of a spot instance relative to an on-demand safe and reliable server.

Our proposal combines the CBR to retrieve similar execution cases, being supported by the survival analysis statistical model, a non-parametric technique used to ignore atypical known cases and helps agents to reason and define appropriate fault-tolerant parameters, e.g. the checkpointing intervals, to reduce cloud resource usage time and decrease user monetary costs. The overall working process of this proposal is presented in Fig. 1.

Figure 1.

Flowchart of the overall working process of this proposal.

Figure 1 represents a flowchart that starts when a new application is submitted by the user, having a validation about required files and parameters. Considering a previous knowledge database generation (Algorith 3), after validation, a new spot instance will be requested and the time until revocation will be calculated by retrieving similar cases (Algorithm 3) and processing the survival curve (Algorithm 3), populating the survival time matrix. To define the appropriate fault-tolerant parameters (Algorithm 6), a survival rate is calculated after verifying an experimental scenario to observe if defined time until revocation is achieved by using historical price change data. In the end, a new remote execution will be created with a failure detector monitor, activating the behavior when the following events occur (Fig. 6: if it is the time to checkpointing, a new execution state will be saved in the repository; if an error occurred, recover from the last consistent saved state, retain the case as a failure case and a new execution will be created; and if the execution finished, the case will be retained and the flowchart ends.

3.2 Using Amazon EC2 spot instances

Considering the significant amount of unused resources, cloud providers have been adopting a model that offers the unused capacity of VMs resources in the form of transient instances. Spot instances are VMs available at Amazon offered in an auction environment.

As opposed to on-demand instances, spot instances can be revoked without user intervention. Users request spot instances VMs using a bid value ( $S_{bid}$ ) (a value that is willing to pay) and receive the VMs if the current spot price ( $S_{p}$ ) is below their bid price. As long as the $S_{p}$ is below their $S_{bid}$ , spot instances provided to the user still available and only $S_{p}$ is charged.

The value of $S_{p}$ is constantly updated according to users demand. High demand by the cloud users increases $S_{p}$ , decreasing otherwise. The spot instance availability attribute ( $S_{on}$ ) indicates the state of the VM. When $S_{on}$ is set to true, the VM is up and running, false otherwise. When $S_{on}$ is set to false, the VM will be revoked in two minutes. The state of $S_{on}$ changes according to Eq. (1).

$\displaystyle S_{on}=\begin{cases}\textit{True}&\text{If∼{}}S_{p}\leqslant S_{% \textit{bid}}∼{}\land∼{}S_{\textit{ava}}(\ldots)\\ \textit{False}&\text{If∼{}}S_{p}>S_{\textit{bid}}∼{}\land∼{}\neg S_{\textit{% ava}}(\ldots)\end{cases}$ (1)

The VMs are distributed among distinct locations in a set of regions. Each region is divided into a subset of zones, which offer a set of independent resources and services with different failure probabilities. To launch a transient instance using Amazon EC2, a user submits a spot instance request command composed of 6-tuple, as presented in Eq. (2):

$\displaystyle\textit{Spot}_{\textit{Req}}=\{\textit{Region, Zone, InstanceType% , QtdOfInstances, AttributeSet, BidValue}\}$ (2)

The Region and Zone are the primary attributes in a new spot request, being their availability defined according $S_{\textit{ava}}$ , a black box cloud function, that returns the availability of VMs according their respective parameters: Region, Zone and InstanceType.

The InstanceType represents the capabilities of it resources, like CPU, memory, and local storage capacity, e.g., the p3.8xlarge instance type represents the 3rd generation of GPU accelerated computing optimized instances with high frequency 4 GPUs, 32 vCPU, 64 gigabytes of GPU memory and 244 gigabytes of memory, commonly used to process machine or deep learning applications, speech recognition or high-performance computing.

Amazon EC2 currently supports the generation of 57 different instance types in the spot tier, although not all types are available in all Regions and Zones. The number of instances (QtdOfInstances) represents how many instances will be delivered to the user. Moreover, a set of attributes (AttributeSet) are needed to instance availability status be fulfilled, e.g. the attached storage, security chains, monitoring tasks and fleet strength, being a collection, or fleet, of unsecured spot instances, and optionally secured dedicated instances. When all the specifications for the user request are met, the spot instance VM is fulfilled, which can take a few minutes to be ready to use.

Finally, in order to use spot instances, a customer request is needed and must include a bid value (BidValue). Users define the maximum hourly value they are willing to pay (a bid), obtaining a VM to use while their bid exceeds (or is equal to) the current price of the instance. If the current price of a spot instance is equal or less than the user bid, the VM still available to the user. VMs are revoked when the current price of spot instance type exceeds the user’s bid. A bid fault occurs when the current price of the VM is above the user’s bid [28].

The price of a spot instance depends on the type of instance as well as VM demand within each data center. As an auction proposal, the spot instances are offered at lower prices compared to regular and safe on-demand servers, increasing their values when resource availability and the number of users send bids and receive VMs are considerable, decreasing otherwise.

According to the definition and classification of an environment, presented in [27], in this proposal scenario, the composition of spot instances is considered as a distributed, accessible, non deterministic, dynamic and discrete. A set of definitions about spot instances is presented in Table 2.

Table 2

Definitions of principal elements used in this proposal related to spot instance

Property	Details
$C_{\textit{cpu}}$	Indicate the processing capacity, including virtual vCPU numbers and their respective clock speed.
$C_{\textit{mem}}$	Attribute indicating the resource associated to a volatile RAM memory size, including speed access and extended capacity, when applied.
$C_{\textit{vd}}$	Represents a virtual disk storage capacity available to the user, being used to store required files and data.
$C_{\textit{net}}$	Indicate the data transfer speed and capacity over an specific network adapter.
$C_{\textit{sts}}$	Attribute indicating the executing state of VM, that can be initializing, running, revoked or stopped.
$\textit{TI}_{\textit{sec}}$	Represents security attributes that allows external access to VM, as opened ports and private security key.
$V_{\textit{cur}}$	Represents the current instance price (at hourly intervals), acquired from a cloud provider.

In order to use spot instances, the applications need to be intentionally fault tolerant, i.e. be prepared for unexpected failure events and recover from failures which are not always dominated by software designers.

3.3 Achieving spot instance availability with fault-tolerant approaches

Using spot instances allows reduced costs when running compute or memory intensive applications, as long as these applications are prepared for scenarios with performance degradation and server revocations.

Considering the use of cloud computing resources, resiliency means the capacity of a resource to be active, reliable, failure tolerant, recoverable, dependable, and secure, in case of unexpected failures that result in a temporary or permanent service disruption [3, 9, 35]. In the scope of our proposal, resilience is the ability to ensure application execution and is achieved through the fault-tolerant approach. Examples of six fully consistent fault-tolerant techniques: (i) The retry is the simplest fault-tolerant technique, since the recovery process consists of restarting the process from scratch, ignoring all processed data; (ii) the task re-submission fault-tolerant technique ignores processed data, resubmitting the failed task to another resource; (iii) the checkpoint and restore technique saves the application’s running state at specified time intervals. An overhead for each saved execution state is added because all process data and volatile memory needs to be saved in persistent memory; (iv) a set of parallel executions is the strategy used in replication, which increases the chances of execution success by increasing the number of parallel executions; (v) the software rejuvenation is used when an application needs a clean resource, rebooting the operating system according to schedule rules; (vi) and finally, job migration is a preventative approach that predict a future failure and migrate the running process to an environment with reliable resources by using a well-defined plan with safe instances.

Using fault-tolerant techniques minimizes the impact of VM unavailability. Results show that under more volatile scenarios, the use of fault-tolerant approaches become considerably more useful and provide significant benefits [39]. Intelligent approaches combined with fault-tolerant techniques allow increased resilience in scenarios that use spot instances to execute applications, avoiding data loss.

3.4 Analysis of reasoning model

The performance and availability of VMs in a cloud environment varies according to region, zone, instance types and different times of day [4], where the authors observe the existence of a pattern of price changes according to each day of the week and during each hour in the day.

Patterns can be observed in Fig. 2, that shows, for the price changes recovered from April/17 to December/18, how many price changes occurred on each day of the week and during each hour of the day, grouped by zones in the US-WEST region for M4Large instance type. The probability of bid faults increases when there are more price changes, consequently, few price alterations imply less failures.

Figure 2.

Observed patterns in price changes in M4Large instance type.

A pattern can be observed in the m4.large instance type. The number of changes peaks during weekdays (a) (as opposed to weekends) and after 12am (c). The r4.xlarge has considerable price changes on weekends (b) and the number of changes by hour-in-day increases after 7pm (19 h) (d).

Our heuristic uses a model that analyzes historical and current prices of spot instances and their changes to define appropriate fault-tolerant parameters. Compared to existing similar studies, our proposal uses CBR to classify cases considering observed price change patterns of spot instances in terms of hour-in-day and day-of-week, as proposed in [4, 17, 16].

Our knowledge database is composed by cases attributes (Table 3) and is organized as a set of tuples (Eq. (3)) to be explored by our schematic CBR circle of the problem-solving process (illustrated in Fig. 3), composed by an initial knowledge generation (detailed in Algorithm 3), following by a set of eight sequential steps, as follows:

Figure 3.

Schematic CBR circle overview.

Generating case knowledge database[1]

CasesGeneratorreg, zone, instType, initDate, finalDateSystem Initialization

$\textit{functionParams}\leftarrow[\textit{reg, zone, instType, initDate, % finalDate}]$

$\textit{priceHistoryList}[]\leftarrow\textit{getPricesFromDB}(\textit{% functionParams})$

$\textit{addictions}[]\leftarrow\textit{new Double}(1,1.1,1.2,1.3,\textit{% currentOnDemandPrice})$

$\textit{caseBasedList}[]\leftarrow\textit{new List}()$

addiction in addictions Processing each addiction to increase database

priceRow in priceHistoryList index i Each price change record

$\textit{baseTime}\leftarrow\textit{priceRow.time}$ $\textit{basePrice}\leftarrow\textit{priceRow.price}\cdot\textit{addiction}$

$\textit{censured}\leftarrow\textit{True}$

futurePriceRow in priceHistoryList Future price changes

$\textit{comparablePrice}\leftarrow\textit{priceHistoryList}[i+1].\textit{price% futurePriceRow}$

$\textit{comparablePrice}>\textit{basePrice}$ A higher price was found $\textit{censured}\leftarrow\textit{False}$

$\textit{min}=\textit{calcMinutes}(\textit{baseTime, priceHistoryList}[i+1].% \textit{time})$

$\textit{caseBasedList.add}(\textit{new Case}(\textit{set-of-attrs}))$

censuredA higher price record was not found $\textit{censured}\leftarrow\textit{False}$

$\textit{min}=\textit{calcMinutes}(\textit{baseTime, priceHistoryList}[i+1].% \textit{time})$

$\textit{caseBasedList.add}(\textit{new Case}(\textit{setOfAttrs}))$

$\textit{persistOnDatabase}(\textit{caseBasedList})$

returncaseBasedList

Table 3

Case ( $\Theta$ ) attributes with their respective similarity function strategies and weights

Attribute	Detail	Function strategy	Weight
Instance	A subset of attributes indicating the used instance, including the instance name and resources capacities.	Constant	1 if attr is equal, else 0.
Region	Indicates the region location.	Constant	1 if attr is equal, else 0.
Zone	Attribute indicating the region zone.	Constant	1 if attr is equal, else 0.
Day-of-week	Represents the day-of-week.	Euclidean	A value between 0 and 1 according to the proximity of values 0 and 7.
Hour-in-day	Represents the hour-in-day.	Polynomial	A value between 0 and 1 according to the proximity of values 0 and 23.
Bid-strategy	Represents the bid strategy used to define user bid (median of last $n$ -days, actual price or percentage over current price).	Constant	1 if attr is equal, else 0.
Bid-value	Represents the bid value (spot instance request) used to acquire the VM.	N/A	N/A.
Time-init	Attribute indicating the time (timestamp) of VM acquisition.	N/A	N/A.
Time-end	Attribute indicating the time (timestamp) of VM revocation instant.	N/A	N/A.
Time-until-revocation	Represent the time until revocation, considering time-init and time-end.	N/A	N/A.
Censored-data	Attribute indicating if the $\Theta$ represents a censored event, being an important element to calculate the time until revocation.	N/A	N/A.

Retrieving similar cases[1]

RetrieveSimilarDatareg, zone, instType, dayOfWeek, hourInDaySystem Initialization

$\textit{paramSet}\leftarrow[\textit{reg, zone, instType, dayOfWeek, hourInDay}]$

$\textit{model}\leftarrow\textit{loadModule}()$ Load the CBR model, including the modeled similarity functions

$\textit{model.cases}\leftarrow\textit{loadCasesDatabase}()$ Load the CBR database, including filtered data according params

$\textit{caseAttributes}[]\leftarrow\textit{model.loadAttributes}()$

$\textit{similarCasesMap}[]\leftarrow\textit{new List}()$

case in model.cases Iterating on each case data

$\textit{case.weight}\leftarrow 0$

attribute in caseAttributes Iterating on each mapped attribute

$\textit{functionType}\leftarrow\textit{attribute.getFunctionType}()$

$\textit{functionType}\>eq\>\textit{model.HEAVISIDE}$ $\textit{case.weight}\mathrel{+}=(\textit{case}["\textit{attribute}"]$ equals attribute in paramSet $?$ attribute.weight $:$ $0$ )

$\textit{functionType}\>eq\>\textit{model.POLYNOMIAL}$ $\textit{case.weight}\mathrel{+}=\textit{calculateWeight}(\textit{case}["% \textit{attribute}"],$ $\textit{attribute, attribute.weight})$

functionType is null $||$ functionTypeeq model.UNKNOWN $\textit{RaiseAnError}\rightarrow$ “Unknown Similarity Function (functionType)”

$\textit{similarCasesMap.put}(\textit{case})$

$\textit{filterListByMaxWeight}(\textit{similarCasesMap})$ as Set

returnsimilarCasesMap

(i) Current Problem – a step that represents a new submission problem containing the requested region, zone, instance, bid-strategy, day-of-week and hour-in-day attributes; (ii) Retrieve – as presented in Algorithm 3, this step uses similarity functions using Eq. (5) to fetch a set of similar cases according requested attributes. An amalgamation function is a weighted sum of all local similarities (attribute similarities) of a concept that constitutes the overall global similarity measure of the concept [32]. At the retrieving step, the similarity function process case attributes (Eq. (4)) and uses different weights (Table 3) to get closer or max calculated weights cases; (iii) Similar Cases – represents a structure with a set of similar cases retrieved to be reused as reference; (iv) Reuse – the adaptation and reuse of the retrieved cases are needed in this step, being necessary to construct a proposed solution to the new problem according to previously observed cases; (v) Solved Case – after adaptation, a set of cases is used as input to an algorithm that evaluates each case data, ignoring atypical cases and calculate time until revocation, as presented in Algorithm 3; (vi) Proposed Solution – a result of solved case step, a proposed solution data is composed by the time until revocation considering current availability scenario and bid strategy; (vii) Revise – the proposed solution is evaluated in this step to confirm its efficiency in a scenario with real test cases. In this scenario, the time until revocation are observed; (viii) Retain Case – after evaluation, it is necessary to explore the solved case and transform it into a learned case, adding its solution to the local knowledge database. If the proposed time until revocation was achieved, a success case will be added to the knowledge database, followed by the defined fault parameters. If the time until revocation was not achieved, the properties of the case is the same, but the current execution time will be present on new similar cases obtained in the retrieving step and changing new proposed solutions that used recency factor, as presented in Section 4.3.

Processing $T_{S}$ [1]

ProcessSurvivalTimereg, zone, instType, confLevel, dow, hidSystem Initialization

$\textit{cases}[]\leftarrow\textit{RetrieveSimilarData}(\textit{reg, zone, % instType, dow, hid})$

$\textit{processRealCasesLength}(\textit{cases})\leqslant\textit{confidenceSize}$ raise Exception with too small times length

$\textit{timeCases}[]\leftarrow\textit{Integer}(\textit{sizeof}(\textit{cases}[% ]))$ $\textit{censored}[]\leftarrow\textit{Boolean}(\textit{sizeof}(\textit{cases}[]))$ Necessary on Kaplan Meier Estimator

case in cases index i $\textit{timeCases}[i]\leftarrow\textit{case}$ $\textit{censored}[i]\leftarrow[\textit{case.isCensored}()==\textit{True}]$

$\textit{sortedIntervals}\leftarrow\textit{KaplanMeierEstimator.compute}(% \textit{cases, censored})$

$\textit{firstInterval}\leftarrow\textit{sortedIntervals}[0]$

interval in sortedIntervals To find the closer interval $\textit{interval.cumulativeSurvival}\geqslant\textit{confLevel}$ $\textit{firstInterval}\leftarrow\textit{interval}$

returnfirstInterval.value

Considering that the composition of the attributes (region, zone, instance type, day-of-the-week and hour-in-day) can influence the revocation time, the similarity functions implemented on a CBR model offer an efficient approach to solve the problem posed in this paper. Once with a case database generated, it is necessary to retrieve similar cases as presented in Algorithm 3.

To increase the reasoning process, this model uses the equivalence and adaptation approaches, which involves the understanding of previous cases to assist with similar decisions and, in the case of failure, allows their adaptation.

Our CBR formalization model includes a case ( $\Theta$ ) element, which corresponds to the context of a real problem situation, represented as a finite set of $n$ key/value pairs composed by attributes ( $\vartheta$ ) and values ( $\nu$ ), as presented in Eq. (3).

$\displaystyle\Theta=\{<\vartheta_{0},\nu_{0}>,<\vartheta_{1},\nu_{1}>,<% \vartheta_{2},\nu_{2}>,\ldots,<\vartheta_{n},\nu_{n}>\}$ (3)

Each similarity function is used not only in case retrieval but also to adapt cases to augment the knowledge database, being determined as the proximity of attributes’ values and their respective weights are defined according to the context of the problem.

Context can be considered as an interpretation of a case, where only a selected subset of attributes are considered relevant compared to others, having their specifications and weights defined according to the detailed information in Table 3.

Attributes with N/A function strategy indicates that they are not used in a similarity function, being an important data used by the agent-based process, detailed on Section 3. A context $\Gamma$ is defined as a finite set of $j$ subset attributes ( $\Omega$ ) with associated constraints ( $C\alpha$ ) on their values:

$\displaystyle\Gamma=\{<\Omega_{1},C\alpha_{1}>,<\Omega_{2},C\alpha_{2}>,<% \Omega_{3},C\alpha_{3}>,\ldots,<\Omega_{j},C\alpha_{j}>\}$ (4)

Using the history of price traces ( $P_{t}$ ), a set of cases ( $\Delta$ ) can be created by using algorithms to simulate an auction scenario with user bids. These cases, generated from the real data (price traces), can be used to calculate the expected survival time. As a specific example, the cases can be generated by using the result of an estimated function $\textit{EST}_{\textit{bid}}(P_{t},n)$ , which calculates a median instance price over the last $n$ days, as a bid value and simulating the survival times from the price traces, revoking the instance when the price exceeds the bid.

A set of real cases can be defined as $\Delta=\{$ $\Theta$ ${}_{1},$ $\Theta$ ${}_{2},\ldots,$ $\Theta$ ${}_{n}\}$ with $\Theta$ , composed by a set of attributes presented in Table 3, being fully included into the knowledge database.

When matching cases, some attributes are forced to be equal. In a similarity process, we partition $\theta$ into $\theta_{E}$ and $\theta_{D}$ , representing attributes that must be equal and attributes that can be different, respectively.

Using the above partition of the attributes, we then calculate the similarity function between cases as presented in Eq. (5):

$\displaystyle\textit{SCases}\Big{(}\delta^{(1)},\delta^{(2)}\Big{)}=\prod_{% \theta\in\theta_{E}}A_{I}\Big{(}\gamma_{\theta}^{(1)}=\gamma_{\theta}^{(2)}% \Big{)}\prod_{\theta\in\theta_{D}}K_{\theta}\Big{(}|\gamma_{\theta}^{(1)}-% \gamma_{\theta}^{(2)}|\Big{)}$ (5)

The adjustment indicator function, $A_{I}$ , assumes the value $1$ if its argument is true and zero otherwise. The kernel functions, $K_{\theta}$ , for each of the attributes in $\theta_{D}$ all assume the value $1$ when $\gamma_{\theta}^{(1)}=\gamma_{\theta}^{(2)}$ , and decrease polynomially to zero as the values become more distant. The rate of decay varies according to the respective attribute.

3.5 A time-to-revocation prediction approach using survival analysis

A convenient statistical model is used to support intelligent agents’ decisions and offer a reliable environment to execute applications when spot instances are used in a cloud computing business model, where price and availability vary according to supply and demand. Statistical models provide mechanisms for extracting information, sometimes without complete data, providing support to organize, analyze and present processed data that may be useful in decision making processes.

Our model uses a Survival Analysis statistical model [24], that is a class of statistical methods which is used when the time to a specific event is one of the principal factors which the data represents. This model, much used in medical research, involves variables associated with time, following observations until the occurrence of an event of interest, frequently failure [10].

The event under study is a bid fault and the time to this event, time until revocation, is treated as the main element. The results were not enough using pure CBR, being necessary to integrate a statistical model to support agent decisions. A non-parametric technique is used, which incorporates incomplete (censored) data to avoid estimation bias. This method requires three elements, as follows:

iii. i.
Time indexed case data.
ii.
An estimator function $\hat{\textit{SF}}(\textit{time})$ to be applied to the data.
iii.
The confidence level of prediction approach $(c\in(0,1))$ .

We would like to estimate the largest Survival Time ( $T_{S}$ ) with a low probability of revocation. Algorithm 3 presents the method steps to obtain the $T_{S}$ of a defined parameters.

As presented in Algorithm 3, the $T_{S}$ time is extracted from estimated Survival Curves through a survival function using the Kaplan-Meier estimator [21], in which $t_{i}$ represents the upper limit of a small time interval, $D_{i}$ the number of deaths (failures) within that interval, and $S_{i}$ the number of survivors at the beginning of the interval, as follows:

$\displaystyle\hat{\textit{SF}}(\textit{time})=\prod_{i:t_{i}\leqslant t}(1-D_{% i}/S_{i})$ (6)

If no deaths occur in a given interval then the survival curve does not decrease. A survival curve with respective times and confidence levels can be observed in Fig. 4.

Figure 4.
Survival curves with respective survival times of some spot instances at sunday 6am.

According to Fig. 4, given the 98 ${}^{\rm th}$ confidence level (considering a 2% failure probability), the time until revocation for a general memory instance type (m3.large) is around two hours, while the estimated time until revocation for an accelerated computing instance type (p2.xlarge) is much smaller. As illustrated in Fig. 2, the time until revocation is considerably lower compared with a weekend day.

We use recovered similar cases from the case knowledge database to produce a survival curve, which is used by the executor agent to predict the time until revocation according to a confidence level which provides sufficient security. The agent estimates the largest $T_{S}$ for which we have a high confidence (98%) that our running instance will not be revoked, considering the relationship of day-of-week and hour-in-day.

This time is defined by $T_{S}$ $=\arg\max_{t}\{t\in\mathbb{R}\mid P(\textit{TUR}>t)\geqslant 0.98$ } and can be calculated as the 98 ${}^{\rm th}$ percentile of time until revocation from the estimated survival curve. The survival curve for $\delta_{i}$ is calculated from the other cases with the following weights:

$\displaystyle w_{ij}=\textit{SCases}(\delta_{i},\delta_{j})\bigg{/}\sum_{k\neq i% }\textit{SCases}(\delta_{i},\delta_{k})\mathrm{\ for\ }j\neq i$ (7)

The working process of using survival analysis is presented in Algorithms 3 and 6, being explored as an important element to process time to revocation using 98 ${}^{\rm th}$ percentile of confidence level in the process survival time step.
3.6 Multi-agent architecture

In this proposal, the agents extends the architecture using different approaches according to the current scenario, where application execution, VM monitoring and fault-tolerant execution plan can be managed dynamically and autonomously for a set of cloud providers and their respective services and resources.

According to the definition and classification presented in [27], in this proposed environment, the composition of spot instances is considered as a distributed, accessible, non-deterministic, dynamic and discrete.

Our proposal contains a set of autonomous agents composed by a quadruple $\textit{BRA}_{\textit{AG}}$ $=$ $<$ $A,$ $B, P, G$ $>$ , where:

•
$A$ stands for the set of agents;
•
$B$ describes the set of available behaviors of agents. While a behavior $b^{\prime}$ is defined by a subset of conditions, having $\textit{cond}(b^{\prime})\subseteq\textit{PR}$ ;
•
$P$ is a finite set of propositions to be used by agents asserts;
•
$G$ depicts the set of goals to be achieved, having $G\subseteq\textit{PR}$ .

An abstract view of agents and their autonomy can be formalized. First, let us assume $E=\{e^{\prime}_{1},$ $e^{\prime}_{2},$ $e^{\prime}_{2},\ldots,e^{\prime}_{n}\}$ as a set of finite discrete states, e.g. auction environment and price changes, and VMs availability with their execution states. Each agent has a set of possible behaviours $B=\{\beta^{\prime}_{1},\beta^{\prime}_{2},\beta^{\prime}_{3},\ldots,\beta^{% \prime}_{j}\}$ , which transform the state of the environment. The decision of which $\beta^{\prime}$ will be triggered depends on the $e^{\prime}$ state, i.e. $E$ responds with a set of states and the agent defines a set of behaviours to be executed BE.

The BE corresponds to a sequence of interleaved environment states and agent behaviours $\textit{BE}:e^{\prime}_{0}\xrightarrow[]{\beta_{0}}e^{\prime}_{1}\xrightarrow[% ]{\beta_{1}}e^{\prime}_{2}\xrightarrow[]{\beta_{2}}\ldots\xrightarrow[]{\beta_% {0}-1}e^{\prime}_{u}$ . As observed, each $e^{\prime}$ change triggers actions and this continues until a final $e_{u}$ is reached without consequent behaviors.

Representing the history $H$ as a set of all possible finite sequences (over $E$ and $B$ ), $H^{\textit{BE}}$ as the subset of a tuple $T=<e^{\prime}_{x},\beta^{\prime}_{x}>$ representing the $e^{\prime}_{x}$ triggering a $\beta^{\prime}_{x}$ behaviour, and $H^{E}$ as the subset of a tuple $T=<\beta^{\prime}_{x},e^{\prime}_{x+1}>$ as a new $e^{\prime}_{x+1}$ state as the consequence of $\beta^{\prime}_{x}$ .

To represent the effect that an agent behaviour implies on an environment, a state transformer function is defined as $\tau:H^{\textit{BE}}\rightarrow\varphi(E)$ . So, an environment $\Sigma$ is formalized as a triple $\Sigma=<E,e^{\prime}_{0},\varphi>$ , where $E$ represents a subset of states, $e^{\prime}_{0}\in E$ as the initial state, and $\varphi$ as a state transformer function, triggered after agent reasoning.

Figure 5.
Proposed multi-agent architecture.

To understand the elements of the proposal regarding the agents and their reasoning modeling, the multi-agent architecture is presented in Fig. 5 and some definitions can be formalized in the following way: sen-avail is composed by a sensor that observes the availability of spot instances in a cloud environment; sen-price is a set of finite sensors that monitor spot instances price changes. According to the nature of an auction, the price varies according to supply and demand and the system needs to be updated; sen-fail represents a set of sensors that observe a cloud environment with available and running spot instances, listening for revocation events; sen-exec is composed by sensors that obtain a set of $n$ running spot instances, composed by a BoT $T$ with a subset of tasks $T=\{t_{1},t_{2},\ldots,t_{n}\}$ . Each spot instance receives $T$ and executes the same BoT independently; $B$ is a bag of pre-defined implemented behaviors, individually created to achieve their respective objectives according a fault-tolerant execution plan, with a set defined as $B=\{\beta_{1},\beta_{2},\beta_{3},\;\ldots,\;\beta_{n}\}$ ; database represents a local knowledge database that contains a set of all price changes and processed data, including the cases used by the CBR model, if applicable; $\xi$ depicts a core function to evaluate an observed event to decide which behavior will be executed by a defined set of actions; location is a flag to indicate if it is a local or remote agent. Agents are inserted on a centralized container that allows remote agents and this attribute indicates if the agent is a local or connected remote agent.

Table 4
Agents’ PEAS definitions

Agent Performance Environment Actuators Sensors

Verifier Receive user application parameters, data and files. Partially observable, deterministic, sequential, static and discrete. Validate input data; define cloud provider and instances that match the requirements of the user tasks. Analyze provided data and files; evaluate instance prices and availability; define instance types.

Price monitor Acquires a set of instance types to monitor and their respective regions and zones; Identifies each instance type’s price changes. Observable, deterministic, sequential, dynamic and continuous. Keeps the instance price database updated. Analyze current instance price; evaluate instance prices changes; informs price change events.

Core Receive, from Verifier agent, the cloud provider and respective instance definitions; retrieve case database. Partially observable, deterministic, episodic, static and discrete. Define execution plan to distribute tasks and run an application with defined fault-tolerant technique and respective parameters. Request case database; apply the similarity model; recover similar cases to calculate the failure based on time until revocation per instance.

Executor manager manager Receive execution plan. Observable, stochastic, sequential, dynamic and continuous. Keep instance and task execution state; evaluate proposed case solution; retrieve case state. Request spot instances; copy required files to VM; Run applications; evaluate application execution; ends the instance when execution finishes.

Executor Monitor Obtain a set of running instances to monitor. Observable, deterministic, sequential, dynamic and continuous. Provide status data from running tasks on VMs; Report failure events for the Recover agent. Monitors remote task execution; Send information about a VM failure; re-validate events to avoid false-positives failures.

Recover Task execution failure when a bid fault occurs. Observable, deterministic, sequential, dynamic and continuous. Keep instance execution state; Ensure the fault-tolerant technique and parameters application; evaluate proposed case solution; retrieve case state. Recover when a bid fault occurs according to the fault-tolerant plan.

The proposed architecture is composed of a finite set of agents that shares the entire environment, which communicate with each other using a multi-agent protocol communication layer to enable interoperability with different technologies. The integration with external cloud infrastructures occurs through pluggable modules to support other cloud providers. Moreover, an API that offers external access to the facade agent (Verifier) is provided to enable integration with external systems.

From the layer definitions presented in Fig. 5 and agent’s PEAS detailed in Table 4, a brief description follows: the Verifier agent checks user input and chooses which resources and instances types will be used based on user requirements, which includes application files, estimated execution time and the amount of CPU and memory required; the Price Monitor agent monitors a set of available price changes in the cloud environment. Each spot instances price change triggers a process on a Core agent to generate a case to be added the CBR knowledge database with the actual time until revocation based on the new record; the Core agent defines the most appropriate fault-tolerant technique and its respective parameters. To achieve high levels of accuracy, it uses an approach to predict time until revocation [4], having a reasoning process detailed in next sections. It produces elements to calculate a survival rate, which is mandatory information to support agent decisions; the Executor Manager agent is responsible to manage the cloud spot instances, creating and allocate $n$ spot instances to attend application requirements, i.e. if the user needs 50 parallel executions, 50 VMs (spot instances) will be created and the execution will be managed. These instances will be used to run tasks and monitors it execution, respecting previous definitions, i.e., cloud region, zone, instance types, fault-tolerant technique, and parameters; the Executor Monitor is a lightweight agent that monitors instance execution and is used to quickly inform failures to be recovered by a Recover agent. Even though the Executor Manager agent monitors spot instances execution, the Executor Monitor agent is more efficient since their behavior occurs as an external observer; the Recover agent is instantiated when a failure occurs, being informed by a message from Executor Monitor agent and applying the recovery processes to guarantee application execution. As well as Executor Manager agent, the Recover agent respects the previous definitions about fault-tolerant technique and parameters.

In our scenario, each agent has a defined environment, e.g. the Price Monitor agent is limited to observe only the auction environment, whereas Executor and Recover agents share the same environment. A brief sequence diagram, which summarizes the interaction between the agents since a new BoT submission until a failure detection event is presented in Fig. 6 as a set of eight steps, as follows: Step 01: the user provides the application and parameters to the Verifier agent; Step 02: the Verifier agent validate user input and asks the PriceMonitor agent to inform an appropriate instance type according to user parameters. At this time, the Verifier agent is responsible to reject or accept user submission according to architecture requirements, e.g. application files, bid-value, custom checkpointing interval and application time; Step 03: After PriceMonitor response, the VerifierAgent inform the Core agent to create an execution plan. At a result, the Core agent will receive the chosen instance and user params; Step 04: the Core agent retrieve similar cases according current day-of-week, hour-in-day, region, zone and instance type, calculate time until revocation using similarity functions (Eq. (5)) and define a fault-tolerant execution plan, according to Algorithm 6. As a result, the agent informs the most appropriate fault-tolerant execution plan to be used in the current application execution; Step 05: the Core agent asks the Executor Manager agent to start execution on $n$ instances, according to user needs; Step 06: the ExecutorManager is the agent that is responsible to manager cloud spot instances. At this time, the agent requests new spot instance and wait for availability. Step 07: if a failure is detected, the Executor Monitor agent asks the Recover agent to inform the failure to the Core agent and start the recovery process. As a result, a new execution plan will be processed (as stated in Step 04) while the Recover agent initializes the recovery process to start a new execution using the latest consistent state; Step 08: considering that the recovery instant is different, a new execution plan needs to be calculated by the Core agent until a successful execution.

Figure 6.
Sequence diagram of BoT and behavior in the presence of a failure.

Considering a user execution requirement with $n$ instances, there should be one Verifier agent, one Core agent, $n$ Executor Manager agents, $n$ Executor Monitors and $n$ Recover agents, when necessary. In the presence of a failure, the Executor Monitors agent signals the Recover agent, having $r$ temporary Recover agents, where $r$ is a positive integer $r\leqslant n$ , when necessary, i.e. if an error occurs, which will recover according the defined fault-tolerant execution plan.

As an agent-based project, the characterization of agents is necessary for the understanding of their objectives and interactions. As proposed by [27], the agents’ characterization is presented through the Performance, Environment, Actuators and Sensors (PEAS) aspects as presented in Table 4.

Given an application’s execution time (appTime) and $T_{S}$ , a production rule [11] can be used to support agent decisions, as detailed in Algorithm 6.

Defining the appropriate fault-tolerant execution plan[1]

DefineExecutionPlanappTime, region, zone, instType, bidValueSystem Initialization

$\textit{validData}\leftarrow\textit{paramsValidation}(\textit{appTime, region,% zone, instType, bidValue})$

validDataApp is able to execut

$\textit{appTime}>60$ Time in minutes

$\textit{dayOfWeek, hourInDay}\leftarrow\textit{extractCurrentTime}()$

$\textit{params}\leftarrow[\textit{region, zone, instanceType},0.98,\textit{% dayOfWeek, hourInDay}]$

$\textit{survivalTime}\leftarrow\textit{ProcessSurvivalTime}(\textit{params})$ (Algorithm 3)

$\textit{priceHistorySet}\leftarrow\textit{recoverPriceHistoryFrom}(90)$ Time in days

$\textit{successRate}\leftarrow 0$

priceRow in priceHistoryList index i Each price change record

$\textit{extractTime}(\textit{priceRow})==\textit{dayOfWeek, hourInDay}$

$\textit{currentPrice}\leftarrow\textit{priceRow.currentPrice}$

$\textit{errorOcurred}\leftarrow\textit{False}$

$\textit{instant}\leftarrow\textit{priceRow.time}$

instant increasing 60 Time in minutes $\textit{priceRow.price}\geqslant\textit{priceHistorySet}[\textit{instant}].% \textit{price}$

$\textit{errorOcurred}\leftarrow\textit{True}$

not errorOcurred $\textit{successRate}++$

$\textit{chkInterval}\leftarrow\textit{calculateCheckpointInterval}(\textit{% successRate})$

$\textit{chkInterval}\leftarrow 0$

return $\textit{ExecutionPlan}(\textit{appTime, region, zone, instType, bidValue, % chkInterval})$

A required parameter (appTime) is a mandatory data and it is used in the first step: a checkpoint interval will be adopted in a new execution or will be ignored. In [36, 40], the authors consider longer executions greater than one hour as the simplest and most intuitive, yet effective, way for dealing with the cost/reliability trade-off when running applications on spot instances.

The reasoning detailed in Algorithm 6 considers an execution with no checkpoint as the primary fault-tolerant plan to be applied when appTime is considered short (line 23), i.e. less than 60 minutes not requiring any other parameters. When appTime is considered long (line 5), the survival rate and its success are considered mandatory parameters, having a survival rate calculated (between lines 11 and 20), considering a successful when instant time overtake the processed survival time (line 20).

If success is high, a convenient checkpointing interval is calculated from $T_{S}$ as follows. Given an application’s required execution time ( $T_{A}$ ), a checkpoint overhead ( $C_{O}$ ), a max interval parameter, in minutes, defined by the user ( $\textit{TU}_{\textit{chk}}$ ) and $T_{S}$ , estimated by STM, a convenient checkpoint interval (CI) can be obtained as follows:

$\displaystyle\textit{CI}=\min(T_{S}-C_{O},T_{A},\textit{TU}_{\textit{chk}})$ (8)

When the survival rate offers low confidence, a checkpoint fault-tolerant plan which does not use $T_{S}$ is considered as a last alternative. In a dynamic execution scenario, job migration can occur if a considerable price change pattern is observed by agents, that request a new instance considering a different bid strategy according to convenient recent values.

The main goal of our proposal is to observe the actual environment and find an optimal fault-tolerant plan and parameters that minimize the total runtime and reduce costs. To reach these goals, our agents use the approach presented in [4], which uses CBR, survival analysis, and quantiles of the time until revocation to find a probable survival time and use it as part of a strategy to avoid occurrences of failures. In addition to [4], elements of this approach can extend the fault-tolerant definitions by using other techniques, like recency factor to obtain better results, no checkpointing or even user defined exception handling.
4. Experiments and results

Agent	Performance	Environment	Actuators	Sensors
Verifier	Receive user application parameters, data and files.	Partially observable, deterministic, sequential, static and discrete.	Validate input data; define cloud provider and instances that match the requirements of the user tasks.	Analyze provided data and files; evaluate instance prices and availability; define instance types.
Price monitor	Acquires a set of instance types to monitor and their respective regions and zones; Identifies each instance type’s price changes.	Observable, deterministic, sequential, dynamic and continuous.	Keeps the instance price database updated.	Analyze current instance price; evaluate instance prices changes; informs price change events.
Core	Receive, from Verifier agent, the cloud provider and respective instance definitions; retrieve case database.	Partially observable, deterministic, episodic, static and discrete.	Define execution plan to distribute tasks and run an application with defined fault-tolerant technique and respective parameters.	Request case database; apply the similarity model; recover similar cases to calculate the failure based on time until revocation per instance.
Executor manager manager	Receive execution plan.	Observable, stochastic, sequential, dynamic and continuous.	Keep instance and task execution state; evaluate proposed case solution; retrieve case state.	Request spot instances; copy required files to VM; Run applications; evaluate application execution; ends the instance when execution finishes.
Executor Monitor	Obtain a set of running instances to monitor.	Observable, deterministic, sequential, dynamic and continuous.	Provide status data from running tasks on VMs; Report failure events for the Recover agent.	Monitors remote task execution; Send information about a VM failure; re-validate events to avoid false-positives failures.
Recover	Task execution failure when a bid fault occurs.	Observable, deterministic, sequential, dynamic and continuous.	Keep instance execution state; Ensure the fault-tolerant technique and parameters application; evaluate proposed case solution; retrieve case state.	Recover when a bid fault occurs according to the fault-tolerant plan.

4.1 Experimental scenario

A set of Amazon’s spot instances and it respective price change history was used to evaluate our proposal and predict the $T_{S}$ , used as input to define a fault-tolerant execution plan with respective parameters. We extend our experiment and adding the reasoning process to define fault-tolerant parameters for 78 instances, including all zones (us-west-1b, us-west-1c, us-west-2a, us-west-2b, us-west-2c) of the US-WEST-1 and US-WEST-2 regions.

Using 19 months of real price changes, provided and collected from Amazon AWS from April of 2017 to October of 2018 (approximately 21 million records), our experiments simulate 1.362.816 scenarios using 78 instances, 4 bid strategies (actual, mean of last week, median of last week, mean last day), two Regions (US-WEST-1 and US-WEST-2) in a 13 weeks scenario (from August to October) with all relations of day-of-week and hour-in-day.

In order to define the optimal range of data to use for case generation and survival curve estimation, we calculated the success rate as a function of the number of months of previous data used. The results can be found in Fig. 7.

Figure 7.

Successful rate according to amount of used data (in months).

According Fig. 7, two important features require comment. First, there is a general trend of increasing success as the number of months of data used increases. This is to be expected, as long as the data generating process has not changed significantly over the period analyzed, since more data allows us to more accurately estimate the non-parametric survival curves, leading to better estimates of the revocation time. Second, there is a significant dip in the success rate at 9 months. This occurs because of the introduction of new instance types which have few observations, leading to poor estimation and increased failures for these instance types, reducing the overall success rate.

Given the increasing success rate as more observations are used, we prefer the use of all the data collected over the entire period. From the collected data, approximately 110 million cases were generated, considering all zones and instances from US-WEST-1 and US-WEST-2 regions.

4.2 Evaluating time until revocation

Each day-of-week and hour-in-day relationship has its own survival curve, being represented as a Survival Time Matrix (STM), as illustrated in Table 5, which shows time until revocation, in minutes, obtained by using the 98 ${}^{\rm th}$ confidence level and extracted from their respective survival curves. The values in the STM are the results from the similar cases recovered by the similarity function in a case database, calculated from the best experiment results for the confidence level between 90 ${}^{\rm th}$ and 98 ${}^{\rm th}$ . Better results was achieved using 98 ${}^{\rm th}$ as the the confidence level.

Table 5
Generated STM of r4.xlarge instance using values from the 98 ${}^{\rm th}$ confidence level and extracted from their respective survival curve

		0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23
Day of the week	1	202	170	125	148	120	97	120	101	82	116	139	130	291	380	321	234	187	207	170	134	104	100	90	206
	2	172	120	128	83	109	77	84	98	68	115	110	106	99	112	130	105	159	155	136	110	87	106	259	179
	3	135	169	167	136	99	105	127	95	70	79	96	98	79	101	106	101	78	101	103	80	88	120	153	112
	4	142	103	118	104	136	100	147	94	69	85	164	133	119	92	91	99	99	84	136	84	83	136	139	196
	5	143	117	149	160	128	92	103	87	68	84	130	111	90	81	95	105	91	106	124	95	84	97	158	186
	6	139	153	136	108	90	72	102	91	70	100	172	127	93	82	115	110	97	153	108	126	82	149	121	226
	7	176	183	277	221	197	151	130	156	98	81	198	135	109	116	125	84	103	156	120	100	80	153	117	181
	Hour in Day

To evaluate STM times, a set of experiments that compare STM values with real scenarios was created. Each value was compared with an auction simulation. Each time until revocation extracted from a relationship between day-of-week and hour-in-day in STM time that is achieved by a simulation will increase by 1, 0 otherwise, as detailed in Algorithm 6. The results are shown on heat map graphic presented in Fig. 8.

Figure 8.

Heat map of survival success compared of time until revocation times in r3.large.

Using the same experimental scenario, 1.362,816 simulations were executed. An example of the memory optimized instance type (r3.large) is illustrated in Fig. 8, achieving success rates around 80% with standard deviation of 1.56 percentage points. Considering all 78 instances’ simulations, the success rate goes up to around 85%. This occurs because some instances with elevated monetary costs have stable variations, allowing long time until revocation times.

An unsecured gap can be observed on Wednesday and Thursday between 12pm and 6pm (18 h), with 7 failures in 13 attempts, showing that another bid strategy is needed to increase the observed time until revocation and achieve better success rates.

4.3 Updating time until revocations using a recency factor

To achieve better results, a new strategy can be incorporated into the $\hat{\textit{SF}}(\textit{time})$ function, where more recent generated cases and results receive a greater weight. This is one way to deal with changes in the data generating process over the period. In this strategy, the interval used in our experiment was reduced to 15 months, from April/17 to June/18 and a new validation period was added to increase the accuracy of our approach.

To refine the STM values and include a recency factor to time until revocation, the last 4 weeks (Jul/18) were used to simulate real time until revocation, creating a new subset of cases and calculating the median times between them to reproduce a new recent scenario. With this change, a new STM was generated (STM’), representing new values in the day-of-week and hour-in-day relationship with survival times that consider recent data. A generated STM’ is presented in Table 6 with new values compared to Table 5.

Table 6
A new STM’ created after recency strategy in $\hat{\textit{SF}}(\textit{time})$ function

		0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23
Day of the week	1	180	120	152	128	110	67	101	101	71	96	99	112	211	240	288	221	117	170	120	111	93	70	80	189
	2	100	100	112	80	109	60	65	89	80	120	101	160	120	112	150	150	139	112	130	140	97	112	233	119
	3	115	122	117	168	120	98	97	96	93	90	91	91	71	111	93	100	70	151	111	86	89	130	143	101
	4	132	131	108	97	106	80	124	90	69	85	124	111	112	111	101	80	89	74	126	88	88	122	132	156
	5	144	111	143	120	113	112	113	90	60	60	111	111	93	88	95	105	91	116	112	99	80	78	118	177
	6	139	168	120	98	97	96	93	90	91	91	71	127	93	82	115	110	97	153	108	126	82	149	121	226
	7	176	91	71	111	93	100	70	151	111	198	93	90	91	91	71	127	93	82	115	110	90	113	116	112
	Hour in Day

Better results were achieved with this recency change and can be observed in Fig. 9. Giving greater weight to consider more recent results allowed a gain of 12%, reaching a 92% success rate under the same conditions of our experiment. A comparative achieved gain can be observed in Table 7 regarding instance types illustrated in Fig. 10.

Figure 9.

Heat map of survival success after recency factor in $\hat{SF}$ .

The new heat map with new experimental results for the day-of-week and hour-in-day relationship can be observed in Fig. 9, increasing success compared to Fig. 8. Green areas means that time prediction was achieved in a real auction scenario. Considering that Amazon AWS provides a wide selection of instance types optimized to fit different use cases, giving the user flexibility to choose the appropriate mix of resources for her applications, a set of optimized instance types are used in our experiments and a subset is presented in Fig. 10.

Table 7

Successful rates (in percentage) obtained by experiments in Fig. 10

	(a & b)	(c & d)	(e & f)	(g & h)
Original strategy	76%	94%	80%	32%
Recency strategy	91%	96%	89%	39%

Figure 10.

Set of heat map results considering distinct instance types.

A compute optimized instance type was used in (a) and (b), a kind of instance used on compute-intensive workloads that delivers very cost-effective high performance at a low price per compute ratio. The results of an accelerated computing instance with GPU resources optimized for graphics-intensive applications can be found in (c) and (d). The results shown in (e) and (f) represent a storage optimized instance with low latency and very high random I/O performance. Lastly, (g) and (h) represent a general purpose instance that provides a balance of compute, memory, and network resources. A compilation of the results obtained by the experiments presented in Fig. 10 can be observed in Table 7.

Observing the results for (g & h), even using the recency strategy only a small gain was obtained, from 32% to 39%. This kind of instance has considerable price changes over time, requiring a different strategy which considers more recent price change patterns.

Compared to Fig. 10(h), better results were achieved using the last 7 days as a validation period in our recency factor strategy, as shown in Fig. 11.

Figure 11.

New heat map of M4.xlarge Instance considering recency of 7 days.

5. Comparative study

Given the prediction accuracy presented in [4] (92%), a reliable time until revocation extracted from STM can be used to define an appropriate fault-tolerant execution plan and parameters, e.g. checkpoint fault-tolerant mechanism and their intervals, as detailed in Algorithm 6.

Considering an application that requires time $T_{A}=$ 1440 minutes, having the checkpointing overhead time $C_{t}$ minutes and a user defined $T_{\textit{chk}}=T_{A}-C_{t}$ , in case of unexpected running delay, we can compare the different approaches in the related works with respect number of checkpoints with total costs (Fig. 12) and total time execution (Fig. 13).

Figure 12.

Reducing checkpointing and total cost compared to [36, 37, 40, 41, 39, 20, 33].

Figure 13.

Reducing total time execution compared to [36, 37, 40, 41, 39, 20, 33].

Regarding the use of checkpoint/restore fault-tolerant technique, commonly used when $T_{A}$ is considered long, we can compare different approaches in the literature with respect to total execution time (Fig. 13) and number of checkpoints with total cost (Fig. 12), using Eq. (8), as follows: (1) and (2) with fixed $T_{\textit{chk}}=$ 60 min [37, 40, 41]; (3) using a fixed $T_{\textit{chk}}=60-C_{t}$ , but considering overhead times [36]; ( $4_{a}$ ) and ( $4_{b}$ ) that considers each price changes as a trigger to a new checkpoint [39], using $T_{\textit{chk}}=$ 96 min and $T_{\textit{chk}}=$ 145 min respectively, recovered by real records from our price change database, and [20] treated as a failure on each price change. In case of exceeding the limit of checkpoint, the max value is used; ( $5_{a}$ ), ( $5_{b}$ ) and ( $5_{c}$ ) with a fixed interval of $T_{\textit{chk}}=$ 10 min, $T_{\textit{chk}}=$ 15 min and $T_{\textit{chk}}=$ 20 min, respectively [33]; (6) uses $T_{\textit{chk}}=$ 30 min as a fixed interval [41].

In (7) we present our approach’s results with respect to different $T_{S}$ scenarios: ( $7_{a}$ ) $T_{S}$ $=T_{A}$ ; ( $7_{b}$ ) with a half total time ( $T_{S}$ $=T_{A}/2$ ); ( $7_{c}$ ) evaluating a new time after checkpoint using the same ( $T_{S}$ $=T_{A}/2$ ); ( $7_{d}$ ) a close negative value compared to the total time ( $T_{S}$ $=T_{A}-$ overhead time); ( $7_{e}$ ) a close positive value compared to the total time ( $T_{S}$ $=T_{A}+$ overhead time); and ( $7_{f}$ ) with a considerable double $T_{A}$ as $T_{S}$ .

Since some related works consider only compute-intensive applications, the cost of the checkpoint is low and short intervals do not significantly affect the execution time. Experimental results for different $T_{A}$ and $C_{t}$ are presented in Tables 8–10.

Table 8

Results of total time from different scenarios

Strategy	$T_{A}=$ 300			$T_{A}=$ 400			$T_{A}=$ 900			$T_{A}=$ 1200			$T_{A}=$ 1440
	$C_{t}{=}\text{5}$	$C_{t}{=}\text{10}$	$C_{t}{=}\text{30}$	$C_{t}{=}\text{5}$	$C_{t}{=}\text{10}$	$C_{t}{=}\text{30}$	$C_{t}{=}\text{5}$	$C_{t}{=}\text{10}$	$C_{t}{=}\text{30}$	$C_{t}{=}\text{5}$	$C_{t}{=}\text{10}$	$C_{t}{=}\text{30}$	$C_{t}{=}\text{5}$	$C_{t}{=}\text{10}$	$C_{t}{=}\text{30}$
$1$	325	350	450	430	460	580	975	1050	1350	1365	1470	1890	1560	1680	2160
$2$	325	350	450	430	460	580	975	1050	1350	1365	1470	1890	1560	1680	2160
$3$	325	350	450	435	470	610	980	1060	1380	1370	1480	1920	1570	1700	2220
$4_{a}$	600	600	600	1280	800	790	1380	1860	1800	1740	2220	2520	1920	2400	2880
$4_{b}$	600	600	600	1600	800	790	1800	2350	1800	1985	2710	2520	2165	2890	2880
$5_{a}$	375	450	750	500	600	1000	1125	1350	2250	1575	1890	3150	1800	2160	3600
$5_{b}$	400	500	900	530	660	1180	1200	1500	2700	1680	2100	3780	1920	2400	4320
$5_{c}$	450	600	1200	600	800	1600	1350	1800	3600	1890	2520	5040	2160	2880	5760
$6$	350	400	600	465	530	790	1050	1200	1800	1470	1680	2520	1680	1920	2880
$7_{a}$	305	310	330	405	410	430	915	930	990	1280	1300	1380	1460	1480	1560
$7_{b}$	310	320	360	410	420	460	930	960	1080	1300	1340	1500	1485	1530	1710
$7_{c}$	305	310	330	405	410	430	905	910	930	1265	1270	1290	1445	1450	1470
$7_{d}$	305	310	330	405	410	430	915	930	990	1280	1300	1380	1460	1480	1560
$7_{e}$	305	310	330	405	410	430	910	920	960	1280	1300	1380	1460	1480	1560
$7_{f}$	305	310	330	405	410	430	905	910	930	1270	1280	1320	1450	1460	1500

Table 8 illustrates the total time results. Although different behaviors are observed, in general, the agents perform well using $T_{S}$ and calculated time until revocation to define the checkpoint interval, this being the best choice compared to other strategies, whether $C_{t}$ represents shorter or longer CPU or memory intensive applications. The application total execution time reflects on the total cost. A better strategy to decrease this time helps to reduce monetary costs for cloud users.

Table 9

Results of total cost (in dollar US) charged from different scenarios

Strategy	$T_{A}=$ 300			$T_{A}=$ 400			$T_{A}=$ 900			$T_{A}=$ 1200			$T_{A}=$ 1440
	$C_{t}{=}\text{5}$	$C_{t}{=}\text{10}$	$C_{t}{=}\text{30}$	$C_{t}{=}\text{5}$	$C_{t}{=}\text{10}$	$C_{t}{=}\text{30}$	$C_{t}{=}\text{5}$	$C_{t}{=}\text{10}$	$C_{t}{=}\text{30}$	$C_{t}{=}\text{5}$	$C_{t}{=}\text{10}$	$C_{t}{=}\text{30}$	$C_{t}{=}\text{5}$	$C_{t}{=}\text{10}$	$C_{t}{=}\text{30}$
$1$	30.44	32.78	42.15	40.28	43.09	54.33	91.33	98.35	126.45	127.86	137.69	177.03	146.12	157.36	202.32
$2$	30.44	32.78	42.15	40.28	43.09	54.33	91.33	98.35	126.45	127.86	137.69	177.03	146.12	157.36	202.32
$3$	30.44	32.78	42.15	40.74	44.02	57.14	91.79	99.29	129.26	128.32	138.63	179.84	147.06	159.23	207.94
$4_{a}$	56.20	56.20	56.20	119.89	74.93	74.00	129.26	174.22	168.60	162.98	207.94	263.04	179.84	224.80	269.76
$4_{b}$	56.20	56.20	56.20	149.87	74.93	74.00	168.60	220.12	168.60	185.93	253.84	263.04	202.79	270.70	269.76
$5_{a}$	35.12	42.15	70.25	46.83	56.20	93.67	105.38	126.45	210.75	147.53	177.03	295.05	168.60	202.32	337.20
$5_{b}$	37.47	46.83	84.30	49.64	61.82	110.53	112.40	140.50	252.90	157.36	196.70	354.06	179.84	224.80	404.64
$5_{c}$	42.15	56.20	112.40	56.20	74.93	149.87	126.45	168.60	337.20	177.03	236.04	472.08	202.32	269.76	539.52
$6$	32.78	37.47	56.20	43.55	49.64	74.00	98.35	112.40	168.60	137.69	157.36	236.04	157.36	179.84	269.76
$7_{a}$	28.57	29.04	30.91	37.94	38.40	40.28	85.70	87.11	92.73	119.89	121.77	129.26	136.75	138.63	146.12
$7_{b}$	29.04	29.97	33.72	38.40	39.34	43.09	87.11	89.92	101.16	121.77	125.51	140.50	139.09	143.31	160.17
$7_{c}$	28.57	29.04	30.91	37.94	38.40	40.28	84.77	85.24	87.11	118.49	118.96	120.83	135.35	135.82	137.69
$7_{d}$	28.57	29.04	30.91	37.94	38.40	40.28	85.70	87.11	92.73	119.89	121.77	129.26	136.75	138.63	146.12
$7_{e}$	28.57	29.04	30.91	37.94	38.40	40.28	85.24	86.17	89.92	119.89	121.77	129.26	136.75	138.63	146.12
$7_{f}$	28.57	29.04	30.91	37.94	38.40	40.28	84.77	85.24	87.11	118.96	119.89	123.64	135.82	136.75	140.50

The number of checkpoints, CA, and the length of time for each checkpoint, $C_{t}$ , impact application execution time. Each checkpoint overhead time $C_{t}$ increases the total used time UT as follows:

$\displaystyle\textit{UT}=\sum_{i=1}^{\textit{CA}}C_{t,i}+T_{A}$ (9)

Table 10

Results of checkpoint calculated from different scenarios

Strategy	$T_{A}=$ 300			$T_{A}=$ 400			$T_{A}=$ 900			$T_{A}=$ 1200			$T_{A}=$ 1440
	$C_{t}{=}\text{5}$	$C_{t}{=}\text{10}$	$C_{t}{=}\text{30}$	$C_{t}{=}\text{5}$	$C_{t}{=}\text{10}$	$C_{t}{=}\text{30}$	$C_{t}{=}\text{5}$	$C_{t}{=}\text{10}$	$C_{t}{=}\text{30}$	$C_{t}{=}\text{5}$	$C_{t}{=}\text{10}$	$C_{t}{=}\text{30}$	$C_{t}{=}\text{5}$	$C_{t}{=}\text{10}$	$C_{t}{=}\text{30}$
$1$	5	5	5	6	6	6	15	15	15	21	21	21	24	24	24
$2$	5	5	5	6	6	6	15	15	15	21	21	21	24	24	24
$3$	5	5	5	7	7	7	16	16	16	22	22	22	26	26	26
$4_{a}$	60	30	10	96	40	13	96	96	30	96	96	42	96	96	48
$4_{b}$	60	30	10	160	40	13	180	145	30	145	145	42	145	145	48
$5_{a}$	15	15	15	20	20	20	45	45	45	63	63	63	72	72	72
$5_{b}$	20	20	20	26	26	26	60	60	60	84	84	84	96	96	96
$5_{c}$	30	30	30	40	40	40	90	90	90	126	126	126	144	144	144
$6$	10	10	10	13	13	13	30	30	30	42	42	42	48	48	48
$7_{a}$	1	1	1	1	1	1	3	3	3	4	4	4	4	4	4
$7_{b}$	2	2	2	2	2	2	6	6	6	8	8	8	9	9	9
$7_{c}$	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1
$7_{d}$	1	1	1	1	1	1	3	3	3	4	4	4	4	4	4
$7_{e}$	1	1	1	1	1	1	2	2	2	4	4	4	4	4	4
$7_{f}$	1	1	1	1	1	1	1	1	1	2	2	2	2	2	2

Table 8 demonstrates that a gain of 49.17% with respect to total execution time can be observed compared to ( $4_{b}$ ) when short $T_{A}$ and $C_{t}$ are used and 72.5% when increase $C_{t}$ . Moreover, a 74.48% time reduction was achieved for long $C_{t}$ compared to ( $5_{c}$ ). Furthermore, the worst-case previously observed ( $T_{\textit{chk}}=10$ ) is a better strategy than a price change strategy when no compute or memory intensive applications are used and $T_{A}$ is short. Similarly, Table 10 presents results for the number of checkpoints used by different strategies. These values affect the total execution time, presented in Table 8 and total cost per instance, as observed in Table 9.

All collected data and source code are available at the project website and can be used to reproduce our experiments and test other scenarios not contemplated in this paper.

6. Conclusion

A resilient multi-agent architecture was presented in this paper to provide an efficient way to use spot instances to run users’ applications achieving better levels of reliability and reduce total execution time and costs, focusing on an important element in a resilient environment: fault tolerant behaviors. The architecture uses a prediction approach to support agent decisions in a dynamic scenario. To the best of our knowledge, there is no published work that combines the use of intelligent agents with CBR, survival analysis to predict failure events and FT mechanism based on adaptive checkpointing to increase reliability of user applications in spot instances.

We believe the proposed approach can be used to define fault-tolerant techniques and their respective parameters. Moreover, to increase longer survival times, new bid strategies can be introduced, e.g. using bid values defined by a percent increase over actual instance prices. Using real price traces of spot instances, our experiments have confirmed the accuracy of the proposed model.

We identified instances for which our methodology yielded poor success rates. For these instances, the data generating process appears to be changing more quickly than for other instances. To capture these changes we implemented a recency factor which gives greater weight to more recent observations data and increases success rates.

As future work, the proposed architecture can be extended to deal with multiple cloud regions by the same provider considering data transfer costs between different regions. In addiction, a multiple provider approach can be used, extending the number of pricing models according to each cloud provider. Also, other strategies can be explored to exploit more recent data, such as comparing recent instance price changes to update fault-tolerant parameters. To increase the reliability, a combination of fault-tolerant plans can be used, e.g. a multi-strategic fault-tolerant framework.

Footnotes

Acknowledgments

This work was supported in part by the Brazilian Coordination for the Improvement of Higher Education Personnel – (CAPES) through grant number 1441250. Prof. Celia Ghedini Ralha thanks the support received from the Brazilian National Council for Scientific and Technological Development (CNPq) for the research productivity in Computer Science area through grant number 311301/2018-5.

Authors’ Bios

José Pergentino de Araújo Neto holds a bachelor’s degree in Information Systems from the Faculdades Integradas de Patos (2005) and a Master’s degree in Software Engineering from the Center for Advanced Studies and Systems in Recife, Brazil (2009). He is currently a PhD student in the Informatics Graduate Program at the Department of Computer Science, University of Brasília, Brazil. He is a lecturer at the Institute of Higher Education of Brasília. He has experience in computer science with emphasis on information systems, software architecture, distributed systems, artificial intelligence and multiagent systems.

Donald Matthew Pianto (dpianto@gmail.com) holds a Ph.D. in Physics from the University of Illinois at Urbana Champaign, USA (1995), a PhD in Computational Mathematics from the Federal University of Pernambuco, Brazil (2008), and a Bachelor’s degree in Physics from SUNY at Stony Brook University, USA (1994). He is currently an Adjunct Professor at Statistics Department, University of Brasília, Brazil. He is a member of the Graduate Program in Applied Computing in the Department of Computer Science and Economics of the Public Sector in the Department of Economics, both programs in the University of Brasília. He has experience in the area of probability and statistics with emphasis in econometrics, statistical analysis of neural data, statistical analysis of images, and estimation of causal effects.

Célia Ghedini Ralha (ghedini@unb.br) holds a Ph.D. in Computer Science from Leeds University, England (1996) and a M.Sc. in Informatics from Aeronautics Institute of Technology, Brazil (1990). She is an Associate Professor at the Department of Computer Science, University of Brasília, Brazil. She has participated in many research projects and published several papers in international journals and conferences. She is a member of the Informatics Graduate Program at the Department of Computer Science, University of Brasília and a senior member of the Brazilian Computer Society. She receives a research productivity grant in the Computer Science area from the Brazilian National Council for Scientific and Technological Development (CNPq). Her current research interests include design of intelligent and knowledge-based systems, multiagent systems, agent-based simulation and multiagent planning.

References

Aamodt

and Plaza

, Case-based reasoning: Foundational issues, methodological variations, and system approaches, AI Communications 7(1) (March 1994), 39–59.

Al-Ayyoub

Jararweh

Daraghmeh

and Althebyan

, Multi-agent based dynamic resource provisioning and monitoring for cloud computing systems infrastructure, Cluster Computing 18(2) (June 2015), 919–932.

Al-Kuwaiti

Kyriakopoulos

and Hussein

, A comparative analysis of network dependability, fault-tolerance, reliability, security, and survivability, IEEE Communications Surveys Tutorials 11(2) (June 2009), 106–124.

Araujo Neto

J.P.

Pianto

D.M.

and Ralha

C.G.

, A Prediction Approach to Define Checkpoint Intervals in Spot Instances, in: Proceedings of the 11th International Conference on Cloud Computing, CLOUD 2018, SCF 2018, Volume 10967, Springer, Seattle, WA, USA, 2018, pp. 84–93.

Araujo Neto

J.P.

Pianto

D.M.

and Ralha

C.G.

, A resilient agent-based architecture for efficient usage of transient servers in cloud computing, in: 2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), 2018, pp. 218–225.

Bajo

De la Prieta

Corchado

J.M.

and Rodríguez

, A low-level resource allocation in an agent-based cloud computing platform, Applied Soft Computing 48 (November 2016), 716–728.

Buyya

Yeo

C.S.

Venugopal

Broberg

and Brandic

, Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility, Future Generation Computer Systems 25(6) (June 2009), 599–616.

Cirne

Brasileiro

Sauvé

Andrade

Paranhos

Santos-neto

and Medeiros

, Grid computing for bag of tasks applications, in: 3rd IFIP Conference on E-Commerce, E-Business and EGovernment, 2003.

Colman

Develder

and Tornatore

, A survey on resiliency techniques in cloud computing infrastructures and applications, IEEE Communications Surveys & Tutorials 18(3) (2016).

10.

Cox

D.R.

, Analysis of survival data, 1st edition edition, Routledge, 1984.

11.

Davis

Buchanan

and Shortliffe

, Production rules as a representation for a knowledge-based consultation program, Artificial Intelligence 8(1) (1977), 15–45.

12.

De la Prieta

Rodríguez

Bajo

and Corchado

J.M.

, +cloud: A virtual organization of multiagent system for resource allocation into a cloud computing environment, in: Transactions on Computational Collective Intelligence XV Nguyen

N.T.

Kowalczyk

Corchado

J.M.

and Bajo

, eds, Springer Berlin Heidelberg, Berlin, Heidelberg, 2014, pp. 164–181.

13.

Elnozahy

E.N.M.

Alvisi

Wang

Y.-M.

and Johnson

D.B.

, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys 34(3) (September 2002), 375–408.

14.

Iosup

Ostermann

Yigitbasi

M.N.

Prodan

Fahringer

and Epema

D.H.J.

, Performance analysis of cloud computing services for many-tasks scientific computing, IEEE Transactions on Parallel and Distributed Systems 22(6) (June 2011), 931–945.

15.

Iosup

Sonmez

Anoep

and Epema

, The performance of bags-of-tasks in large-scale distributed systems, in: Proceedings of the 17th International Symposium on High Performance Distributed Computing – HPDC ’08, ACM Press, 2008, pp. 97–108.

16.

Javadi

Thulasiram

R.K.

and Buyya

, Characterizing spot price dynamics in public cloud environments, Future Generation Computer Systems 29(4) (June 2013), 988–999.

17.

Javadi

Thulasiramy

R.K.

and Buyya

, Statistical modeling of spot instance prices in public cloud environments, in: 2011 Fourth IEEE International Conference on Utility and Cloud Computing, IEEE, 2011, pp. 219–228.

18.

Jula

Sundararajan

and Othman

, Cloud computing service composition: A systematic literature review, Expert Systems with Applications 41(8) (June 2014), 3809–3824.

19.

Kumar

K.D.

and Umamaheswari

, Prediction methods for effective resource provisioning in cloud computing: A survey, Multiagent and Grid Systems 14(3) (September 2018), 283–305.

20.

Lee

and Son

, DeepSpotCloud: Leveraging cross-region GPU spot instances for deep learning, in: 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), IEEE, 2017, pp. 98–105.

21.

Meeker

W.Q.

and Escobar

L.A.

, Statistical methods for reliability data, Wiley, New York, 1998.

22.

Mell

and Grance

, The nist definition of cloud computing, National Institute of Standards and Technology 53(6) (2009).

23.

Meroufel

and Belalem

, Adaptive checkpointing with reliable storage in cloud environment, Multiagent and Grid Systems 13(3) (September 2017), 253–268.

24.

Miller

R.G.

, Jr, Survival analysis, volume 2nd Edition, John Wiley & Sons, 2011.

25.

Oprescu

A.-M.

and Kielmann

, Bag-of-Tasks Scheduling under Budget Constraints, in: 2010 IEEE Second International Conference on Cloud Computing Technology and Science, IEEE, 2010, pp. 351–359.

26.

Ralha

C.G.

Mendes

A.H.D.

Laranjeira

L.A.

Araújo

A.P.F.

and Melo

A.C.M.A.

, Multiagent system for dynamic resource provisioning in cloud computing platforms, Future Generation Comp Syst 94 (May 2019), 80–96.

27.

Russell

and Norvig

, Artificial Intelligence: A Modern Approach, 3rd edition, Prentice Hall Press, Upper Saddle River, NJ, USA, 2010.

28.

Services

A.W.

, Amazon ec2 spot instances, https://aws.amazon.com/ec2/spot, accessed in January 2018.

29.

Sharma

Irwin

and Shenoy

, Portfolio-driven Resource Management for Transient Cloud Servers, Proceedings of the ACM on Measurement and Analysis of Computing Systems 1(1) (June 2017), 5:1–5:23.

30.

Shastri

Rizk

and Irwin

, Transient guarantees: Maximizing the value of idle cloud capacity, in: SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2016, pp. 992–1002.

31.

Siddiqui

Tahir

G.A.

Rehman

A.U.

Ali

Rasool

R.U.

and Bloodsworth

, Elastic jade: Dynamically scalable multi agents using cloud resources, in: 2012 Second International Conference on Cloud and Green Computing, 2012, pp. 167–172.

32.

Stahl

, Defining similarity measures: Top-down vs. bottom-up, in: Advances in Case-Based Reasoning Craw

and Preece

, eds, Springer Berlin Heidelberg, Berlin, Heidelberg, 2002, pp. 406–420.

33.

Subramanya

Guo

Sharma

Irwin

and Shenoy

, Spoton: A batch computing service for the spot market, in: Proceedings of the Sixth ACM Symposium on Cloud Computing, ACM, 2015, pp. 329–341.

34.

Tang

Yang

Liao

and Zhu

, A shared cache-aware task scheduling strategy for multi-core systems, Journal of Intelligent & Fuzzy Systems 31(2) (July 2016), 1079–1088.

35.

Vlacheas

P.T.

Stavroulaki

Demestichas

Cadzow

Gorniak

and Ikonomou

, Ontology and taxonomies of resilience, in: Tech Rep, European Network and Information Security Agency, 2011.

36.

Voorsluys

and Buyya

, Reliable Provisioning of Spot Instances for Compute-intensive Applications, in: IEEE 26th International Conference on Adv Information Networking and Applications, 2012, pp. 542–549.

37.

Voorsluys

Garg

and Buyya

, Provisioning spot market cloud resources to create cost-effective virtual clusters, in: International Conference on Algorithms and Architectures for Parallel Processing, Springer, 2011, pp. 395–408.

38.

Yang

Peng

and Wan

, Security-aware data replica selection strategy for Bag-of-Tasks application in cloud computing, Journal of High Speed Networks 21(4) (November 2015), 299–311.

39.

Andrzejak

and Kondo

, Monetary cost-aware checkpointing and migration on amazon cloud spot instances, IEEE Transactions on Services Computing 5(4) (2012), 512–524.

40.

Kondo

and Andrzejak

, Reducing costs of spot instances via checkpointing in the amazon elastic compute cloud, in: 2010 IEEE 3rd International Conference on Cloud Computing, 2010, pp. 236–243.

41.

Zhou

Zhang

and Wong

, Fault tolerant stencil computation on cloud-based gpu spot instances, IEEE Transactions on Cloud Computing (2018).

Towards increasing reliability of Amazon EC2 spot instances with a fault-tolerant multi-agent architecture

Abstract

Keywords

1. Introduction

2. Literature review

Table 1 Literature review summary

3.1 Proposal overview

3.4 Analysis of reasoning model

4.1 Experimental scenario

Table 5 Generated STM of r4.xlarge instance using values from the 98 th confidence level and extracted from their respective survival curve

Table 6 A new STM’ created after recency strategy in 𝑆𝐹 ^ ⁢ ( 𝑡𝑖𝑚𝑒 ) function

Footnotes

Acknowledgments

Authors’ Bios

References

Table 1
Literature review summary

Table 5
Generated STM of r4.xlarge instance using values from the 98 ${}^{\rm th}$ confidence level and extracted from their respective survival curve

Table 6
A new STM’ created after recency strategy in $\hat{\textit{SF}}(\textit{time})$ function