Gaussian Adapted Markov Model with Overhauled Fluctuation Analysis-Based Big Data Streaming Model in Cloud

Abstract

An accurate resource usage prediction in the big data streaming applications still remains as one of the complex processes. In the existing works, various resource scaling techniques are developed for forecasting the resource usage in the big data streaming systems. However, the baseline streaming mechanisms limit with the issues of inefficient resource scaling, inaccurate forecasting, high latency, and running time. Therefore, the proposed work motivates to develop a new framework, named as Gaussian adapted Markov model (GAMM)—overhauled fluctuation analysis (OFA), for an efficient big data streaming in the cloud systems. The purpose of this work is to efficiently manage the time-bounded big data streaming applications with reduced error rate. In this study, the gating strategy is also used to extract the set of features for obtaining nonlinear distribution of data and fat convergence solution, used to perform the fluctuation analysis. Moreover, the layered architecture is developed for simplifying the process of resource forecasting in the streaming applications. During experimentation, the results of the proposed stream model GAMM-OFA are validated and compared by using different measures.

Introduction

In ancient times, the data from smart phones, network management, product information, Internet of Things (IoT), and other data streams^1,2 are highly used in an escalating variety of applications. Handling these types of big data streams in the time bounding environment is a highly complex and crucial task, because big data^3–5 takes a long time to execute. Still, the conventional data management techniques are not more suitable for the real-time environments, due to their processing requirements, expense, and time. So, the cloud services⁶ are increasingly used to properly handle the large amount of big data streams. Typically, the cloud^7–9 is one of the popular and emerging platforms that allow the users to access their applications/services without delay and with reduced cost.

The Storm, Spark, Mahout, Hadoop, and etc. are the examples of tools used in the cloud streaming^10,11 applications. Moreover, these engines allow the cloud users to continuously process the data with simple operations. However, satisfying the quality of service (QoS)¹² requirements of the big data applications still remains one of the challenging tasks, because the service providers are required to properly determine the amount of resources to fulfill the QoS, specifically the CPU and memory usage. In the existing works, different types of resource scaling methods¹³ are developed for big data streaming applications. Some of the scaling approaches object to minimize the cost of resource usage prediction in data streaming. Finding efficient methods for real-time human resource adaptation becomes important. One of the most important methods for adjusting on-demand processing resources in cloud computing is elasticity.

In general, the resource scaling techniques^14,15 are categorized into the proactive and reactive types, in which the reactive approaches are computationally very efficient. The management of elasticity in reactive approaches is dependent on both static bounds and if-condition-then rules. Users typically set an upper and lower limit for an intended performance parameter (such as CPU and memory usage, or response time) to activate and deactivate various resources, accordingly. There is a delay interval for the resource supply after the system hits an upper bound. There has been an application overload during that time. The lack of response when using these options is another issue. There are instances where it is easy to predict the (de)allocation of resources, but the resource configuration remains identical due to poor threshold-setting decisions.

To predict system behavior, a proactive method uses prediction algorithms to choose the appropriate adjusting actions. The application will be able to handle the rise when it actually occurs due to this capability. For accomplishing this model, the time-series-based prediction models are used, including machine learning, reinforcement learning, and pattern matching techniques. Reactive managers are those who rely on their judgments about flexibility only on thresholds; more specifically, resource reconfiguration occurs when either the lower or upper threshold is exceeded.

However, it has major problems¹⁶ of increased latency in scaling, inability in handling multiple resources at the time, not suitable for the sensitive applications, and poor tolerance. The proactive techniques¹⁷ overwhelm these problems by accurately forecasting the requirements of future applications. Yet, some of the baseline proactive models^18–20 limit with the problems of inaccurate prediction results, autoscaling, delay in data arrival, and are clumsy. Therefore, the proposed work motivates to develop a new framework for an efficient big data streaming in the cloud systems. In this study, the resource usage patterns of multiples streams are accurately predicted by using the classifier. Moreover, it individually analyzes and captures the requirements of resources, which includes memory consumption, CPU utilization rate, and I/O status. The research objectives of this work are as follows:

To efficiently manage the time-bounded big data streaming applications in the cloud environment, a novel Gaussian adapted Markov model (GAMM)-integrated overhauled fluctuation analysis (OFA) framework is developed.

To properly address the resource scaling problem in the big data streaming application system, accurate scaling decisions are made by using the combination of models GAMM-GS-OFA.

To accurately forecast the resource usage with reduced error rate by using an advanced classification algorithm.

To design and develop a layered architecture for simplifying the process of resource forecasting in the streaming applications.

The main contribution of this research work is to effectively perform data streaming by accurately predicting the resource usage. For this purpose, an advanced streaming framework is developed with the methodologies GAMM, GS, and OFA. In this system, the streaming data are taken as the input for processing, and then, the nonlinear features are obtained with the use of versatile feature unit integrated with the gating strategy (GS). After that, the data distribution is carried out by using the GAMM, which helps to enable reliable data streaming. During this process, the time stamp estimation and weight calculation processes are carried out to predict the resource usage.

For validating the performance and efficacy of the proposed data streaming framework, widely used measures were latency, CPU utilization rate, memory usage rate, log-likelihood, error rate, and time. Due to its network of centroids and self-learned gap, the suggested streaming model has high multidimensional scalability when compared with baselines, demonstrated by the estimated results. Real-time streaming workload assessments in big data streaming applications frequently show a sizable degree of workload variability over time. The fluctuation can be seen in the off-peak and peak times that frequently define the arrival of the workload. The resources used for streaming applications can be significantly impacted by the temporal patterns shown, and hence, resources must be adjusted up or down as necessary.

The GAMM-OFM model's capacity to make predictions over extended time periods is a significant benefit. When the process switches from one distinct state to another, the conditional independence in the GAMM-OFM model is ensured.

The other portions of this article are categorized into the following: The Related Works section investigates some of the baseline resource scheduling and load balancing models used in the big data streaming application systems. It also examines the pros and cons of each streaming model according to its resource usage prediction rate. Moreover, a clear explanation about the proposed model, GAMM-GS-OFA, is presented with its flow, architecture, and modeling equations in the Proposed Methodology section. The detailed experimentation and analysis are carried out for validating the results of the proposed big data streaming model in the Results and Discussion section. Finally, the overall article is concluded along with the future scope in the Results and Discussion section.

Related Works

This section investigates some of the baseline models relevant to resource scaling, allocation, and load balancing in the big data streaming application systems. Moreover, it examines the pros and cons of each resource scaling model based on its streaming and scaling operations.

Ali et al.²¹ presented a comprehensive analysis to investigate the different types of issues in the big data analytics. Also, it analyzes the major effects of using a big data analytical model based on the horizontal and vertical platforms. Based on this study, it was identified that the major problems associated with the existing database models were mismatching, lack of numerical calculations, and missing iterative support. Zeng et al.²² presented a taxonomy of service level agreement (SLA) management schemes used for the Big Data Analytical Applications (BDAA). The purpose of this work was to ensure the high QoS by efficiently managing the SLAs in the cloud systems. Moreover, a new conceptual framework was developed based on the multidimensional categorization mechanism for SLA management. Ullah et al.¹ objected to the development of an enhanced big data resource management framework for cloud systems.

In this study, it was stated that the big data resource management frameworks should satisfy the following parameters such as processing speed, fault tolerance, low latency, security, and scalability. Mortazavi et al.²³ implemented an advanced deadline-aware scheduling mechanism for improving the big data streaming process in the cloud system. This work utilized the data analytic query operators for the public cloud resource provisioning. Typically, minimizing the response time and increasing the energy efficiency were the major processes of the scheduling framework used in the cloud systems. However, it is one of the very complex tasks due to increased dimensionality of big data in the streaming applications.

Most of the cloud scheduling frameworks have been developed to increase the speed of processing, and to solve the size difficulties in the big data streaming applications. Furthermore, the suggested big data computing framework comprises the major phases of data model, query model, and scheduling model. In addition to that, the query partitioning algorithm was also implemented in this work to solve the partial critical path problem. Yet, this conceptual framework was very difficult to understand, and it follows some complex operations to perform the big data streaming.

Vergilio et al.²⁴ deployed a systematic approach, named as unified vendor-agnostic solution for big data streaming application system. The purpose of this work was to investigate the importance of virtualization in the cloud systems for enhancing the process of scheduling. The key components of the big data framework were as follows: data processing, loading, aggregation, visualization, and model specification. Moreover, minimizing the energy consumption was one of the most essential factors that must be properly addressed in the big data streaming applications. Liu et al.²⁵ presented a new wireless big data architecture for streaming applications. In this study, the machine learning models used for big data processing were categorized into different types according to their learning capability, which includes the following:

Instance-based learning

Online learning

Batch learning

Model-based learning

Moreover, the prediction and online refinement processes were performed in this framework by using various optimization-based classification models. It encompasses the methodologies of online convex, incremental stochastic gradient, progressive learning, and adaptive learning model. Based on this study, it is analyzed that the throughput maximization is one of the essential factors that need to be addressed in the big data streaming systems. Inoubli et al.²⁶ discussed about the different types of challenges associated with the big data streaming models. In this study, the four distinct types of streaming frameworks have been validated and examined for big data applications, which include the following:

Apache spark

Storm

Flink

Samza

All these frameworks are very popular for the big data streaming application systems, and also that allow real-time scaling operations with high efficiency. Moreover, the performance of these tools was assessed in terms of latency, machine learning compatibility, flexibility, clustering management, Application Programming Interface, parallelization, and data transport. From this study, it is analyzed that memory consumption, CPU utilization rate, and bandwidth of resource usage must be improved for an efficient data streaming. Table 1 presents a comparative analysis of the existing and proposed streaming techniques used for the big data streaming applications.

Table 1.

Characteristic analysis of baseline and proposed streaming models

Data stream models	Framework	Self-adaptive	Parallel processing	Dimensionality
D-Stream, and GD-Stream^27,28	Online and Offline	$\times$	$\times$	$\times$
DS and Stream DM^29,30	Online and Offline	$\times$	✓	$\times$
SNC Stream, CEDAS, DBSTREAM^31–33	Online	$\times$	$\times$	$\times$
HDD Stream PKS Stream^34–37	Online and Offline	$\times$	$\times$	✓
ASKM, HDDS, SparkCluStream^34,38–40	Online and Offline	$\times$	$\times$	✓
Proposed GAAM-FA Stream	Online and Offline	✓	✓	✓

Proposed Methodology

This section provides a clear explanation about the proposed resource scaling methodology used for big data streaming in the cloud systems. The original contribution of this work is to design a new resource prediction framework for the big data streaming applications. In this study, the resources are properly forecasted by analyzing the characteristics/features of each data stream. The working flow and architecture model of the proposed big data streaming application system are shown in Figures 1 and 2, respectively. The proposed streaming framework incorporates the methodologies of GAMM, GS, and OFA for predicting the resource. As shown in the layered architecture, the given streaming problem is decoupled into the layers, whereas each stream job is individually modeled by using the combination of GAMM-OFA method.

FIG. 1.

The proposed work flow of Gaussian adapted Markov model.

FIG. 2.

Architecture model.

Typically, resource provisioning is one of the most significant problems that need to be addressed in the big data streaming applications. Also, it is highly essential to accurately predict resource usage with a reduced error rate, and prediction of time span. Hence, this research work objects to implement a novel GAMM- based GS-OFA methodology for predicting the resource usage of big data streaming applications. The primary advantages of this work are computational efficiency, minimized error rate, accurate prediction rate, optimized time period, and less overhead. As shown in Figure 1, the streaming data are obtained as the input for processing, where data streaming is performed with proper resource usage prediction. After getting the data, the GS is used to extract the set of features for obtaining the nonlinear distribution of data, where the GAMM is integrated for enabling an effective data distribution.

For better data streaming, feature analysis and data distribution are performed in the proposed framework with the use of GAMM technique. To improve the performance of GAMM, the gradient descent function is estimated based on the parameters of learning rate and weight value. Consequently, the OFA is performed to predict the resource usage, which helps to predict the accurate resource consumption of streaming application. So, the GAMM is used for enabling reliable and efficient data streaming, and OFA is used for the prediction of resource usage. As shown in Figure 2, the different types of streaming data such as social media streams, IoT data streams, financial data streams, and mobile streams are processed in the cloud network with the use of distinct layers such as application, processing, storage, data integration, and cloud resource.

Moreover, the resources required for processing these data streams, the controlling units such as resource monitor, resource predictor, and resource allocator, are used. Overall, huge dimensional big data streams are effectively processed in the proposed framework with an optimal resource utilization rate by using the combination of GAMM-OFA methods.

Big data streaming

At first, the streaming data are initialized with the training samples and predicted class label as represented below:

Initialize streaming data,

D_{s} = (I_{ω}, L_{ω}) ω = 1, \dots, T,

(1)

where $I_{ω} \in ℜ^{m}$ indicates the training samples at each time stamp with size d, and m represents the dimensional instance arriving at the time stamp T. In which, $m = 1$ indicates a single data at each time stamp, and $m > 1$ denotes the group of data obtained at each time stamp. Moreover, the predicted label is defined as follows: $L_{ω} \in \{l_{1}, \dots, l_{h}\} . ∕ ∕ C o r r e s p o n d i n g c l a s s l a b e l .$ (2)

In the versatile feature unit, the nonlinear distribution of data and fat convergence solution have been attained based on the set of features extracted by using the GS as shown below:

g_{a} (Δ I_{ω}) = δ (w_{g h t}^{g_{a}} * Δ I_{ω} + b i a s_{s}^{g_{a}}),

(3)

δ (y) = \frac{1}{1 + e^{- y}},

(4)

where $w_{g h t}^{g_{a}}$ and $b i a s_{s}^{g_{a}}$ are the weight and bias values of updated gate, respectively, and $δ (y)$ indicates the sigmoid function.

Consequently, the output of versatile feature unit is produced as follows:

v = α (f (I_{ω}) * g_{a} (Δ I_{ω}) + (Δ I_{ω} * (1 - g_{a} (Δ I_{ω})))),

(5)

where $f (.)$ indicates the feature vector, and $α (.)$ represents the activation function either ReLU or sigmoid, and N is the number of versatile feature unit.

Then, its corresponding forward propagation function is estimated by using the following model:

v^{(n)} = α (f * g_{a} + v^{(n - 1)} (1 - g_{a})),

(6)

where $n = 1, \dots, N$

Here, the online gradient descent function is used to optimize the classifier parameter as indicated below:

{w_{g h t}}_{c}^{n} (ω + 1) = {w_{g h t}}_{c}^{n} (ω) - τ * ϑ^{n} (ω) (\hat{l_{ω}} - l_{ω}) * {(v^{n})}^{T},

(7)

where $τ$ is the learning rate, $\hat{l_{t}}$ represents the predicted label, and $ϑ^{n} (t)$ denotes the ensemble weight parameter.

Furthermore, the weight is constructed according to the updated GS as represented below:

where v^k represents the feature versatile unit for all N number of units.

• After that, the model performance vector is constructed with the previous time stamp $(ω - 1)$ as shown below: $q_{v} = \{q_{v}^{1}, q_{v}^{2}, \dots, q_{v}^{ω - 1}\} .$ (9)

Consequently, the model performance of the current time stamp is computed by using the following model:

where r represents the model performance parameter.

Then, the fluctuation value vector is estimated according to the fluctuation level as shown below:

Δ q_{ω} = \{|q_{ω} - \hat{q}| \forall \hat{q}, \hat{q} \in q_{ω} [1]\},

(11)

where $\hat{q}$ is the fluctuation parameter with the constant value first element.

If the estimated fluctuation level is small and moderate, the selection function of classifier is determined by using the ensemble model as shown below:

where $\hat{L_{ω}}$ indicates the predicted result, and ${C_{b}}^{i} (t)$ is the list of base classifiers in ensemble.

GAMM-based classification

For accurately predicting the resource usage, the GAMM-based probabilistic classification technique is implemented in this work. It is one of the advance classification approaches that help to forecast the resource usage in the big data streaming application systems. In other words, it is determined as the stochastic process accumulates the viable points in the parameter space. Moreover, the key benefits of GAMM are accurate prediction results, minimal classification error rate, and simple to understand. In this technique, every next state depends on the current state, where no variation in every state transition, since the states are determined as the time invariant model. Then, the number of states in the GAMM is considered M, which are most likely hidden. During practical implementation, the physical implication is generally assigned to every state or the set of states in the model, which is represented as follows: $s_{t}^{i} = \{s_{t}^{1}, s_{t}^{2}, \dots, s_{t}^{M}\} .$ (13)

Then, a fixed state sequence is represented as shown below: $R = \{r_{1}, r_{2}, \dots, r_{T}\},$ (14)

where the state at a time t is represented as r_T. Consequently, the observation symbols at each state are denoted as N, which is corresponding to the modeled output of the physical system. The individual symbols are represented as follows: $O = \{o_{1}, o_{2}, \dots, o_{N}\} .$ (15)

Then, the initial state of the system $ω$ is defined by using the following model: $ω = \{ω_{i}\}, ω_{i} = P r (s_{t}^{i}),$ (16)

where $ω_{i}$ is initial state probability vector. After that, the probability distribution function of the state transition is computed as shown below:

where $h_{i j}$ denotes the probability of being in state j at time t + 1 given that it was in state i at time t. Then, it is assured that $h_{i j}$ is independent of time, and the observation symbol of probability distribution is determined as follows: $q_{i j} = P r \{P_{j} = t ∕ s_{t}^{i} = t - 1\},$ (18)

where P_j represents the observation at time t and the following conditions are considered: $1 \leq j \leq N .$ $1 \leq k \leq N .$

Consequently, the threshold value is determined for the probability for taking decisions based on the GAMM stochastic process, and is illustrated below: $H_{t} = l o g (\sqrt{{(2 π e)}^{d} C_{m}}),$ (19)

where H_t is the multivariate Gaussian distribution, and C_t denotes the covariance information of the given data. Then, the GAMM is initialized with the new solution as shown below: $x^{(t + 1)} = m^{t} + r^{t} * Q_{s}^{t} * ρ^{t},$ (20)

where $Q_{s}^{t}$ indicates the normalized square root of C_m, m^t is the mean information of the given data, r^t represents the scale factor, $ρ^{t}$ is constant $\sim N (0, 1)$ , and $x^{(t + 1)}$ indicates the input stream data with iteration. Furthermore, the parameters such as mean m^t and covariance C_m are computed, and the scaled factor r^t is updated as represented below: $m^{t + 1} = (1 + \frac{1}{N_{m}}) * m^{t} + \frac{1}{N_{m}} * x^{(t + 1)},$ (21) $C_{t + 1} = (1 + \frac{1}{N_{c}}) * C_{t} + \frac{1}{N_{c}} {(Δ x)}^{T},$ (22)

r^{t + 1} = e_{f} * r^{t},

(23)

where $e_{f}$ indicates the expansion factor. The estimated acceptance threshold value tH is a continuous lowered function, which is determined as follows: $t H^{t + 1} = (1 + \frac{1}{N_{T}}) * C_{t} + \frac{1}{N_{c}} {(f (x^{t + 1}))}^{T},$ (24)

where N_T represents the weighting factor. Finally, the fitness-dependent update value of tH makes the model as invariant that is used to compute the linear transformations in the objective function.

Fluctuation analysis

In this study, the fluctuation analysis (FA) is performed to accurately predict the resource usage of the big data streaming applications, which highly improves the prediction accuracy of resource consumption. Moreover, it is used to determine the fluctuation of streaming data according to the model performance of the adjacent time stamp values. If the fluctuation streaming is strong, the current data distribution is adapted with the ensemble model of classifier; if it is moderate, the weight value of the classifier less than the threshold value is considered stationary. Otherwise, the fitting capabilities of the data are further improved. The motive of using this technique is to activate or halt the streaming services based on the fluctuation state.

In this algorithm, the trained model loss factor is taken as the input for processing, and the state of fluctuation is produced as the output. During initialization, the loss factor is set with the parameters of model performance vector, time stamp, and strong and small fluctuation detection parameters.

Algorithm 1. Fluctuation analysis

As shown in Algorithm 1, the steps involved in the OFA are illustrated. The OFA is mainly implemented in this study to predict the resource utilization for making a reliable and successful big data streaming. In this technique, the training loss function, model performance parameter, strong and small detection parameters are taken as the inputs for analysis. After initializing the parameters, the performance vector is updated with the time stamp value. According to the loss function, the model performance vector is calculated and updated with the fluctuation value. Then, certain conditions are validated to determine the strong, moderate, and small resource usage predictions.

After that, the versatile classifier weight value is computed for streaming big data in the cloud systems. During this process, the streaming data, learning parameter, model performance parameter, and fluctuation detection parameter are considered inputs for processing. After initializing these parameters, the network is initialized with m number of versatile classifiers as shown below: $C_{b} = \{C_{b}^{1}, \dots, C_{b}^{m}\} .$ (28)

Then, the storage weight matrix of the versatile classifier is estimated according to the former time stamp value. The instance block of the streaming data is received with its appropriate true labels. Consequently, the gate updation is performed and the adaptive depth unit is estimated. Moreover, the loss value of base classifier is determined by using the following model: $\hat{l o s s_{ω}} = l o s s_{ω} (C_{b}^{m}, L_{ω}) .$ (29)

After that, the label and gate parameter of classifier are updated by using the following models: $w_{g h t}_{c}^{n} (ω + 1) = w_{g h t}_{c}^{n} (ω) - τ * ϑ^{n} (ω) (\hat{l_{ω}} - l_{ω}) * {(v^{n})}^{T} .$ (31)

Finally, the normalized versatile classifier weight value is computed by using the following model:

Based on this weight value, the resource usage is predicted with high accuracy and a reduced classification error rate. Furthermore, it enables an efficient and perfect big data streaming in the cloud systems with improved performance outcomes.

Algorithm 2. Activation modeling of data streaming
Initialize, Streaming Data $\to D_{s}$
learning rate $\to τ$
model performance parameter $\to r$
$φ_{1}$ Strong fluctuation detection parameter
$φ_{2}$ Small fluctuation detection parameter; Procedure:
• Network based m number of versatile classifiers is initialized at first by using Equation (28);
• Then, the versatile weight storage matrix is computed with the former time stamps.
• _for $ω = 1, \dots, T$ do
1. Receive instances block $I_{ω}$ and corresponding true labels $L_{ω}$ of streaming data D_s;
2. Update the gate as $g_{a} (Δ I_{ω})$ ;
3. Estimate yje adaptive depth unit;
4. Output $v^{(n)}$ ;
5. Set the loss factor of classifier as $\hat{l o s s_{ω}}$ by using Equation (29);
6. Predict $\hat{L_{ω}}$ as shown in Equation (30);
7. Update classifier and gate parameter by using Equations (31) and (32);
8. Estimate the normalized Versatile classifier weight by using Equation (33)

Theorem 1:

Here, the generator sigmoid function $δ^{γ} (y)$ computed in the GAMM is determined based on the following properties:

$δ_{y}^{γ}$ is considered the differentiable and continuous function at first.

ii.

Then, the derivative function of $δ_{y}^{γ}$ is computed by using the following model:

[δ_{y}^{γ} (t)]' = γ \frac{d_{f}^{'} (y)}{d_{f} (y)} \frac{d_{f} (δ_{y}^{γ} (t))}{d_{f}^{'} (δ_{y}^{γ} (t))} .

(34)

iii.

If estimated value of $γ > 0$ ( $γ < 0$ respectively), the sigmoid function $δ_{y}^{γ} (t)$ is sternly increasing (sternly decreasing, respectively).

iv.

According to the values of $δ_{y}^{γ} (t)$ with the derivative at $t = 0$ , the function is updated as follows:

δ_{y}^{γ} (0) = y a n d [δ_{y}^{γ} (t)]' |_{t = 0} = γ

(35)

If $γ > 0$ (i.e., $δ_{y}^{γ}$ is sternly increasing),

{lim}_{t \to - \infty} [δ_{y}^{γ} (t)] = 0 a n d {lim}_{t \to \infty} [δ_{y}^{γ} (t)] = 1

(36)

vi.

If $γ < 0$ (i.e., $δ_{y}^{γ}$ is sternly decreasing), then

{lim}_{t \to - \infty} [δ_{y}^{γ} (t)] = 1 a n d {lim}_{t \to \infty} [δ_{y}^{γ} (t)] = 0

(37)

Proof:

The proof for the properties of (i)–(vi) are given below:

$δ_{y}^{γ}$ is considered the composition of continuous and differentiable functions.

ii.

For simple understanding, let us consider $ρ (t)$ is computed as follows:

ρ (t) = d_{f} (y) e^{γ \frac{d_{f}^{'} (y)}{d_{f} (y)} t} t \in ℛ .

(38)

Based on this, the function is substituted as $δ_{y}^{γ} (t) = d_{f}^{- 1} (ρ (t))$ and $[δ_{y}^{γ} (t)]'$ , $\begin{matrix} [δ_{y}^{γ} (t) {]}' ={[} d_{f}^{- 1} (ρ (t))]' = \frac{1}{d_{f}^{'} (d_{f}^{- 1} (ρ (t)))} ρ' (t) \\ = \frac{1}{d_{f}^{'} (δ_{y}^{γ} (t))} d_{f} (t) e^{γ \frac{d_{f}^{'} (y)}{d_{f} (y)} t} γ \frac{d_{f}^{'} (y)}{d_{f} (y)} = γ \frac{d_{f}^{'} (y)}{d_{f} (y)} \frac{d_{f} (δ_{y}^{γ} (t))}{d_{f}^{'} (δ_{y}^{γ} (t))} \end{matrix} .$ (39)

iii.

Let us consider $γ > 0$ . If the value of f is sternly increasing (decreasing, correspondingly), then the values of $d_{f}^{- 1}$ and $e^{γ \frac{d_{f}^{'} (y)}{d_{f} (y)} (\cdot)}$ could be increased or reduced, $δ_{y}^{γ}$ is strictly decreasing.

Let us consider $γ > 0$ . If f is sternly increasing (reducing, respectively), then the values of $d_{f}^{- 1}$ is increasing (reducing, correspondingly) and $e^{γ \frac{d_{f}^{'} (y)}{d_{f} (y)} (\cdot)}$ is strictly decreasing (increasing, respectively) and so $δ_{y}^{γ}$ is strictly decreasing.

iv.

According to the definition of $δ_{y}^{γ}$ , we can get $δ_{y}^{γ} (0) = y$ . This is taken to account this property, also we can obtain that $[δ_{y}^{γ} (t)]' |_{t = 0} = γ$

Moreover, the limit properties of a generator function f, and the limit properties of the exponential function are considered here. This property immediately follows the definition of the generator function that depends on the sigmoid function $δ_{y}^{γ} (t)$ .

Results and Discussion

In this study, a detailed experimentation is conducted to validate the proposed models, GAMM-OFA, by using various parameters such as latency, memory consumption, likelihood function, prediction rate, purity, and standard deviation. A physical system with an 8-core i7-3770 CPU running at 3.40 GHz and 16 GB of RAM was used to set up the Spark cluster. With setups of 4 and 8 GB RAM and 2 VCPUs each, two VM instance types were taken into consideration. Each node had Hadoop 2.6, Spark 1.6, Scala 2.11, and JDK/JRE v1.8 installed. The other nodes were set up as slaves, with one node acting as the master. Each slave node has a 50 GB SSD disk, and the master node had a 195 GB SSD drive. Due to the lack of an independent storage system, Apache Spark was created to operate on the top of Hadoop.

It was configured to use Hadoop Distributed File System (HDFS) to supply storage for the streaming data. Apache Flume, a distributed system for gathering, aggregating, and transmitting the events produced, has been used to gather the streaming data. A certain quantity of data are downloaded at each interval and stored in HDFS so that the Spark engine may start processing it right away. The prediction model has been evaluated using two streaming application packages. At the master node, the constructed applications have been launched and monitored at various intervals. Each streaming application's workload arrival rates, the amount of streams, and a variety of system performance metrics have been obtained and stored for future examination. Moreover, the adaptability and applicability of the proposed big data streaming framework are verified by analyzing the resource usage in the cloud systems.

As shown in Figure 3, latency is computed for the existing data stream and proposed models, GAMM-OFA, with respect to varying time sequences in terms of minutes. In general, the latency is one of the most important parameters used to assess the performance of streaming applications. It is estimated based on the time delay of the data streams, in which increased latency could affect the performance of the entire big data streaming application. Hence, it must be properly minimized by accurately predicting the resource usage for data streaming. According to the estimated results, it is analyzed that the proposed model, GAMM-FD, provides reduced latency, when compared with the other techniques. Consequently, Figure 4 validates the CPU and memory usage rate of the proposed model GAMM-OFA, with respect to varying time in terms of minutes.

FIG. 3.

Latency versus time (minutes).

FIG. 4.

CPU and memory usage.

These parameters are mainly estimated to analyze the resource usage efficiency of the proposed big data streaming framework, in which, the CPU usage is determined in terms of (%), and the memory usage is determined in terms of bytes. Based on the curves, it is analyzed that the resource usage is properly maintained in the proposed streaming framework by forecasting the usage with the help of a classifier. Moreover, the log-likelihood function of the proposed GAMM is estimated as shown in Figure 5. According to the states of classifier, the model parameters of GAMM are updated and re-estimated for obtaining the results. In this study, the log-likelihood function is computed based on the training model of classifier.

FIG. 5.

Log-likelihood analysis.

In Figure 4, the parameters such as memory utilization and CPU utilization rate are estimated with respect to changing time (m), where the y-axis indicates the CPU usage in terms of percentage and x-axis indicates the memory usage in terms of bytes, and the x-axis represents the time (minutes). As shown Figure 4, both parameters CPU and memory usage are estimated with changing time. Moreover, the obtained results indicate that the CPU usage rate is reduced to 20%, and the memory consumption is reduced to 10 bytes. Similarly, the log-likelihood value is estimated in Figure 5 with respect to a different number of iterations, where the log-likelihood is represented in y-axis, and the count of iterations is represented in x-axis. This parameter is mainly estimated to validate the training efficiency of the classifier.

Figure 6 validates the states of GAMM classifier with respect to varying time intervals, which includes the actual and predicted classes. From the analysis, it is evident that the proposed GAMM could accurately predict the state of classifier under varying time sequences. Due to the proper weight estimation and model parameter computation, the states of classifier are accurately predicted, which depicts the improved performance of the proposed big data streaming framework. Moreover, the error rate of the GAMM-OFA mechanism is shown in Figure 7, which includes the root mean squared error (RMSE) and mean absolute percentage error (MAPE). The parameters are computed as follows:

FIG. 6.

Actual and predicted states.

FIG. 7.

Error rate.

M A P E = \frac{1}{N} \sum_{x = 1}^{N} |\frac{A c t_{x} - F o r_{x}}{F o r_{x}}|,

(40)

where N is the number of iterations, $A c t_{x}$ denotes the actual value, and $F o r_{x}$ indicates the forecasting value. $R M S E = \sqrt{\frac{\sum_{i = 1}^{N} {(a_{i} - â_{i})}^{2}}{N}},$ (41)

where N indicates the number of nonmissing data points, a_i is the actual observation, and $â_{i}$ is the forecasted observation. The reduced values of these parameters assure better prediction results. According to the estimated values, it is analyzed that the proposed GAMM-OFA could efficiently minimize the RMSE and MAPE values by exactly predicting the resource usage in the big data streaming applications.

Figure 8 validates the running time (seconds) of the existing streaming⁴¹ and proposed GAMM-OFA mechanisms by using various data sets such as DB₂, RBF₅, RBF₁₀, and RBF₂₀. The existing techniques considered in this work are CEDAS, HDD stream, GD stream, D-stream, and ESA stream. Here, the running time is mainly estimated for analyzing how data streaming is executed in the cloud systems with different data sets. Based on the estimated results, it is analyzed that the running time of the proposed model GAMM-OFA is efficiently reduced, when compared with the other data streaming techniques. Due to proper resource prediction based on the fluctuation analysis, the running time required for data streaming is efficiently minimized in the proposed framework. Consequently, purity (%) is also estimated for the existing and proposed streaming frameworks by using various data sets. Figure 9 depicts the purity analysis carried out and compares with the proposed system.

FIG. 8.

Time analysis.

FIG. 9.

Purity analysis.

In other terms, purity is defined as the precision of classifier, which helps to determine the forecasting accuracy of classification. Table 2 depicts the time and purity analysis of existing and proposed streaming techniques for various data sets. According to the results, it is identified that purity is also highly improved in the proposed data streaming framework by using the FA.

Table 2.

Time and purity analysis

Data sets	CEDAS		HDDStream		GD Stream		D-StreamII		ESA Stream		Proposed
Data sets	Time (s)	Purity (%)	Time (s)	Purity (%)	Time (s)	Purity (%)	Time (s)	Purity (%)	Time (s)	Purity (%)	Time (s)	Purity (%)
DB₂	26.01	7.4	429.5	79.3	19.35	82.7	21.47	83.5	16.02	96.1	15.4	96.9
RBF₅	284.54	87.6	549.81	87.3	209.43	95.1	137.4	94.8	61.11	97.3	59.48	98.1
RBF₁₀	369.21	86.9	943.6	59.9	996.45	88.6	271.56	86.7	92.23	96.4	90.4	97.6
RBF₂₀	502.24	83.5	1942.5	57.2	2889.31	83.8	835.44	82.3	138.32	95.4	136.78	96.8
RBF₄₀	1962.95	81.9	4350	53.6	5862.41	80.1	2771.56	79.2	215.23	94.8	213.45	95.3

Figure 6 depicts the average throughput of the existing and proposed data streaming mechanisms in terms of tuples/second. Typically, the throughput is defined as the successful data transmission in the network through communication channel. The increased value of throughput indicates the improved transmission efficiency of the streaming techniques. From the results, it is evident that the average throughput of the proposed GAMM-OFA technique is highly improved, when compared with the other big data streaming models, because the proper fluctuation analysis with GAMM helps to accurately predict the resource usage in the versatile network. Hence, the average throughput of big data streaming is highly improved in the proposed framework. Figure 10 depicts average throughput analysis carried out with existing approaches and proposed system.

FIG. 10.

Average throughput.

Figure 11 validates the latency of various data streaming techniques with respect to different time sequences (minutes). Similarly, the average latency of the task scheduling techniques used in the existing big data streaming framework is compared with the proposed model, GAMM-OFA, as shown in Figure 12. The estimated results show that the overall latency of the proposed streaming model is highly reduced by properly balancing the loads in the cloud systems. Moreover, the GAMM accurately forecasts resource usage with the help of gating methodology and fluctuation analysis. Hence, the proposed big data streaming technique overwhelms the baseline techniques⁴² with reduced average latency. Then, the standard deviation measures of the baseline and proposed streaming models are estimated and compared under different time instances (seconds).

FIG. 11.

Latency of various data streaming techniques.

FIG. 12.

Standard deviation of streams.

In this study, the standard deviation is computed according to the number of streams delivered in the cloud systems. The reduced standard deviation ensures an effective load balancing in the streaming applications. From the observed results, it is stated that the GAMM-OFA technique could efficiently reduce the standard deviation, when compared with the baseline streaming models. Load balancing analysis is shown in Figure 13.

FIG. 13.

Load balancing analysis.

Table 3 validates the RMSE and MAPE values of the proposed big data streaming system with respect to the experimental time period and prediction time in terms of hours. The results indicate that the error rate is greatly reduced in the proposed framework by effectively streaming the data in the cloud network.

Table 3.

Prediction analysis

Experimental time period	Prediction time	RMSE (%)	MAPE (%)
24 Hours	4 Hours	10.2	10.1
24 Hours	6 Hours	12.3	11.5
24 Hours	8 Hours	13.1	12.8
48 Hours	8 Hours	15.3	14.5
48 Hours	12 Hours	16.1	15.2

MAPE, mean absolute percentage error; RMSE, root mean squared error.

A training set and a prediction set were generated from the collected data sets. We trained the models on the training set and evaluated their accuracy by testing them with the prediction sets. After that, the upper layer used the GAMM-OFA results as input characteristics to decide how to scale the resources. In the prediction step, the system's current state is computed and used to forecast its future state. In this study, we attempt to produce a series of states that gave rise to the observation sequence. The degree to which the model works on new data sets that were not utilized when fitting the model tells us how accurate it is. In addition, two performance indices—the MAPE, a frequently used scale-independent measure, and the RSME, a scale-specific measure used to assess the prediction accuracy of the models, GAMM-OFA.

We used the OFM algorithm to figure out the states that had already been given the observations using the GAMM. After getting the model's parameters, we use a simulation strategy to create our own set of data. Because of its reliability in maintaining its predictive capabilities across time, the GDMM-OFM is effective at anticipating the resource consumption states of streaming applications. We can forecast the resource requirements of unbounded large data streaming applications by introducing explicit temporal structure into the framework. By using the created big data streaming applications operating in the Spark streaming environment, the proposed framework is validated and tested. Since the GAMM-OFM-based framework directly describes the duration of states, our empirical findings demonstrate that it performs significantly better than the standard models when used in unconstrained scenarios.

Conclusion

This article presents a new framework, named as GAMM-OFA Stream, for forecasting the resource usage of the big data streaming applications. In the existing works, different types of resource scaling techniques have been developed for big data streaming systems. Yet, most of the baseline models limit with the problems of inaccurate forecasting, high latency, reduced throughput, and increased error rate. Therefore, the proposed work motivates to develop a GAMM-based classification methodology to predict resource usage for big data streaming. In this study, the GS and OFA model are incorporated with the proposed streaming model for increasing the accuracy of scaling with reduced error predictions. Typically, resource provisioning is one of the most significant problems that need to be addressed in the big data streaming applications.

In GAMM, every next state depends on the current state, where no variation in every state transition, since the states are determined as the time invariant model. The OFA is used to determine the fluctuation of streaming data according to the model performance of the adjacent time stamp values. Furthermore, it enables an efficient and perfect big data streaming in the cloud systems with improved performance outcomes. In addition to that, an extensive experimentation is conducted in this work for validating the results of the proposed streaming model, GAMM-OFA, based on the parameters of latency, throughput, RMSE, MAPE, standard deviation, and running time. Then, the obtained values are compared with some of the recent baseline big data streaming models. Overall, the obtained results indicate that the proposed GAMM-OFA stream outperforms the baseline models with improved performance values due to proper resource usage prediction.

In the future, this work can be enhanced by implementing a new optimization + deep learning methodology for resource scaling in the big data streaming application systems.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

Funding Information

No funding received to carry out this research.

Abbreviations Used

References

Ullah

, Awan

, Sikander Hayat Khiyal

. Big data in cloud computing: A resource management perspective. Sci Program, 2018; 2018:1–18.

Dai

H-N

, Wong

RC-W

, Wang

, et al. Big data analytics for large-scale wireless networks: Challenges and opportunities. ACM Comput Surv, 2019; 52:1–36.

Moreno-Vozmediano

, Montero

, Huedo

, et al. Efficient resource provisioning for elastic cloud services based on machine learning techniques. J Cloud Comput, 2019; 8:1–18.

Kollenstart

, Harmsma

, Langius

, et al. Adaptive provisioning of heterogeneous cloud resources for big data processing. Big Data Cogn Comput, 2018; 2:15.

Farley

, Dawson

, Goring

, et al. Situating ecology as a big-data science: Current advances, challenges, and solutions. Bioscience, 2018; 68:563–576.

Saif

, Wazir

. Performance analysis of big data and cloud computing techniques: A survey. Procedia Comput Sci, 2018; 132:118–127.

Belcastro

, Marozzo

, Talia

. Programming models and systems for big data analysis. Int J Parallel Emergent Distrib Syst, 2019; 34:632–652.

, Yu

, Xu

, et al. Big Data and Cloud Computing. In: Manual of Digital Earth (Guo H, Goodchild MF, Annoni A eds.) Springer: Singapore; 2020; pp. 325–355.

Mushtaq

, Mushtaq

, Iqbal

, et al. Security, Integrity, and Privacy of Cloud Computing and Big Data. In: Security and Privacy Trends in Cloud Computing and Big Data. (Tariq MI, Balas VE, Tayyaba S, et al. eds.) Routledge, CRC Press: United States; 2022; pp. 19–51.

10.

Kobusińska

, Leung

, Hsu

C-H

, et al. Emerging Trends, Issues and Challenges in Internet of Things, Big Data and Cloud Computing. Vol. 87. Elsevier: Netherlands; 2018; pp. 416–419.

11.

Tang

, He

, Yu

, et al. A survey on spark ecosystem: Big data processing infrastructure, machine learning, and applications. IEEE Trans Knowl Data Eng, 2020; 34:71–91.

12.

ur Rehman

, Yaqoob

, Salah

, et al. The role of big data analytics in industrial Internet of Things. Fut Gener Comput Syst, 2019; 99:247–259.

13.

Mahmud

, Huang

, Salloum

, et al. A survey of data partitioning and sampling methods to support big data analysis. Big Data Mining Anal, 2020; 3:85–101.

14.

Kang

, Pan

, Liu

Job scheduling for big data analytical applications in clouds: A taxonomy study. Fut Gener Comput Syst, 2022.

15.

Kang

, Pan

, Liu

. An online algorithm for scheduling big data analysis jobs in cloud environments. Knowl Based Syst, 2022; 245:108628.

16.

Zhang

W-W

, Bhola

, Kumar

, et al. Study and analysis of big data for characterization of user association in large scale. Int J Syst Assur Eng Manag, 2022; 13:375–384.

17.

Xia

, Zhou

, Ren

, et al. Proactive and intelligent evaluation of big data queries in edge clouds with materialized views. Comput Netw, 2022; 203:108664.

18.

Faraji-Mehmandar

, Jabbehdari

, Javadi

HHS

. A self-learning approach for proactive resource and service provisioning in fog environment. J Supercomput, 2022; 78:16997–17026.

19.

Ivanovic

, Simic

. Efficient evolutionary optimization using predictive auto-scaling in containerized environment. Appl Soft Comput, 2022; 129:109610.

20.

Haibeh

, Yagoub

, Jarray

. A survey on mobile edge computing infrastructure: Design, resource management, and optimization approaches. IEEE Access, 2022; 10:27591–27610.

21.

Ali

AH.

A survey on vertical and horizontal scaling platforms for big data analytics. Int J Integr Eng, 2019; 11:138–150.

22.

Zeng

, Garg

, Barika

, et al. SLA management for big data analytical applications in clouds: A taxonomy study. ACM Comput Surv, 2020; 53:1–40.

23.

Mortazavi-Dehkordi

, Zamanifar

. Efficient deadline-aware scheduling for the analysis of big data streams in public cloud. Cluster Comput, 2020; 23:241–263.

24.

Vergilio

, Kor

A-L

, Mullier

A Unified Vendor-Agnostic Solution for Big Data Stream Processing in a Multi-Cloud Environment. Research Square Preprint; 2022.

25.

Liu

, Bi

, Shi

, et al. When machine learning meets big data: A wireless communication perspective. IEEE Vehic Technol Mag, 2019; 15:63–72.

26.

Inoubli

, Aridhi

, Mezni

, et al. A Comparative Study on Streaming Frameworks for Big Data. In: VLDB 2018: 44th International Conference on Very Large Data Bases: Workshop LADaS-Latin American Data Science, Rio de Janeiro, Brazil; 2018; pp. 1–8.

27.

Alkatheri

, Abbas

, Siddiqui

. A comparative study of big data frameworks. Int J Comput Sci Inf Secur, 2019; 17:66–73.

28.

Sayed

, Rady

, Aref

. Enhancing CluStream Algorithm for CLUSTERING Big Data Streaming Over Sliding

Window

. In: 2020 12th International Conference on Electrical Engineering (ICEENG), Cairo, Egypt; 2020; pp. 108–114.

29.

Gomes

, Plentz

, Rolt

CRD

, et al. A survey on data stream, big data and real-time. Int J Netw Virt Organ, 2019; 20:143–167.

30.

Luengo

, García-Gil

, Ramírez-Gallego

, et al. Big Data Preprocessing. Springer: Cham; 2020.

31.

Huang

, Wang

C-D

, Chao

H-Y

, et al. MVStream: Multiview data stream clustering. IEEE Trans Neural Netw Learn Syst, 2019; 31:3482–3496.

32.

Milli

, Bulut

SubtStream: Online Subtractive Stream Clustering Algorithm. In: Concurrency and Computation: Practice and Experience; 2022; p. e6968.

33.

Kolajo

, Daramola

, Adebiyi

. Big data stream analysis: A systematic literature review. J Big Data, 2019; 6:1–30.

34.

Balakrishna

, Solanki

, Gunjan

, et al. Performance Analysis of Linked Stream Big Data Processing Mechanisms for Unifying IoT Smart Data. In: International Conference on Intelligent Computing and Communication Technologies, Singapore; 2019; pp. 680–688.

35.

Zeydan

, Arslan

. Cloud 2

HDD

: Large-Scale HDD Data Analysis on Cloud for Cloud

Datacenters

. In: 2020 23rd Conference on Innovation in Clouds, Internet and Networks and Workshops (ICIN), Paris, France; 2020; pp. 243–249.

36.

Rawat

, Doku

, Garuba

. Cybersecurity in big data era: From securing big data to data-driven security. IEEE Trans Serv Comput, 2019; 14:2055–2072.

37.

Kokate

, Deshpande

, Mahalle

, et al. Data stream clustering techniques, applications, and models: Comparative analysis and discussion. Big Data Cogn Comput, 2018; 2:32.

38.

C-J

, Huang

S-F

. Real-time big data analytics for hard disk drive predictive maintenance. Comput Electr Eng, 2018; 71:93–101.

39.

Taha

, Meshry

, Yang

, et al. Two Stream Self-Supervised Learning for Action Recognition. arXiv preprint arXiv:1806.07383; 2018.

40.

Shah

, Mudaliar

Optimizing Latency Issues in Real-time Streaming Data in Big Data using Spark Stream Processing. EasyChair Preprint; 2019; pp. 2516–2314.

41.

, Li

, Wang

, et al. Esa-stream: Efficient self-adaptive online data stream clustering. IEEE Trans Knowl Data Eng, 2022; 34:617–630.

42.

Souravlas

, Anastasiadou

A Modular Arithmetic Approach for Runtime Stream Scheduling and Task Redistribution. In: International Conference on High Performance Computing & Simulation (HPCS 2020), Research Gate; 2020.