Trend-based time series data clustering for wind speed forecasting

Abstract

Wind forecasting is a time series problem, can aide in estimating the annual energy production of potential wind farms. Seasonality and trend are the two significant components that characterize the wind time series data. Variability in trend and seasonal component affects the performance of most of the forecasting methods. Therefore, to simplify the wind forecasting technique, generally, nonlinear seasonal and trend components are eliminated from wind time series data. Accuracy depends on the application function that is applicable to eliminate the trend and seasonality. In this article, a hybrid approach for time series forecasting has been proposed. A clustering technique has been developed, which finds the clusters of time series data showing identical trend components. After finding the proper clusters of similar trend components, statistical methods, namely, autoregressive integrated moving average and generalized autoregressive score techniques, are applied to the individual cluster. In the end, resulting components are aggregated. The experiment shows that the cluster-based forecasting technique gives better performance as compared with existing statistical models.

Keywords

Wind time series autoregressive integrated moving average generalized autoregressive score hybrid forecasting National Renewable Energy Laboratory

Introduction

Wind energy plays a significant role in the process of dealing global energy crisis. To make it a reliable source of energy, an accurate model for estimating power produced by wind power plants is required. The research reported in Thapar et al. (2011), Morshedizadeh et al. (2017), Wadhvani and Shukla (2018), and Dongre and Pateriya (2019) have concluded that there is a strong relationship between power generated by wind plant and wind speed of that site. An accurate model of wind speed prediction can improve the accuracy of power estimation. Prediction of wind speed at a particular site can be achieved through time series forecasting techniques. In these techniques, the existing time series data values are used to predict future values by extracting the hidden pattern of the given time series data. Various statistical models were introduced for wind speed forecasting like autoregressive moving average (ARMA) (Yang et al., 2015), autoregressive integrated moving average (ARIMA) (Torres et al., 2005), and generalized autoregressive score (GAS) (Creal et al., 2013). Usually, the time series data include the seasonal and trend components that may be homogeneous and heterogeneous in nature. In case, when the trend and seasonality components are homogeneous, the existing classical models (Kavasseri and Seetharaman, 2009; Lydia et al., 2015; Maatallah et al., 2015) are enough to model them, whereas in the presence of the heterogeneous trend and seasonality components, first these are eliminated by applying appropriated algorithms and then modeled. Here, elimination of the trend and seasonal components from the time series data may cause loss of some informative pattern present into the data. In order to make a reliable forecasting model, a method is required, which can model the time series data without eliminating the trend and seasonal component of data.

Statistical techniques are frequently used in practice as it produces forecasting results in lesser time. One of the limitations of these techniques is that they are not capable enough to handle heterogeneous time series data. In recent years, several hybrid methods are introduced to model the heterogeneous time series data accurately. Kushwah and Wadhvani (2019) have proposed the GAS and neural network–based hybrid modeling techniques for wind speed forecasting. The inclusion of a neural network in the existing GAS model had performed well with favorable levels of prediction errors. Inniss (2006) suggested that each time series data may have a nonlinear trend and seasonality pattern, which can be used to divide a heterogeneous wind speed data into homogeneous one. A trend component of wind time series shows the common tendency of the wind speed data to increase or decrease for the duration of a long period. In contrast, seasonality is the presence of periodic fluctuations in the time series data. Statistical methods are used to transform non-stationary time series into stationary, which convert nonlinear trend and seasonal component into linear ones. Kuznetsov and Mohri (2020) and Vilar et al. (2018) have proposed the methods in which heterogeneous wind speed data are divided into the homogeneous sample by the use of the clustering technique.

Clustering is an unsupervised learning method that divides the data points into the number of groups. In a literature, number of clustering methods are available that can be applied on non-sequential as well as sequential data points. Due to the difference in characteristics of non-sequential and sequential data, techniques applied to non-sequential data are not able to produce good results on time series data. Clustering techniques applied on non-sequential data produce a minimal number of cluster based on the distance metric for data values, for example, K-mean clustering algorithm is one of the methods which is used for determining the optimal number of clusters (Zhu et al., 2019). However, time series data have the characteristic of serial correlation between subsequent observations. The distance metric is not capable of merging similar data into groups without disturbing the serial correlation. Lim et al. (2018) have suggested that as time series data may have trend over time and exhibit seasonality, these characteristics can be used for identifying the similar structured of data values for generating the clusters.

This article begins by introducing the proposed clustering approaches for identifying the segments of time series data having identical trend shapes. Once the clusters of similar trends have been created, then statistical methods for time series forecasting, namely, ARIMA and GAS, have been used to model the time series data of each cluster. This is the hybrid approach of forecasting in which the final forecasted values are obtained by aggregating the results obtained by models developed on each cluster. Finally, the performance of statistical models, that is, ARIMA and GAS, and proposed hybrid models, that is, C-ARIMA and C-GAS, is measured using the criteria mean absolute error (MAE) and root mean square error (RMSE).

Time series data clustering

In general, the wind speed time series data have trend and seasonal components. A trend and seasonal analysis of wind time series data are used to extract the statistical characteristic over the period (Johnpaul et al., 2020). This work is focused on grouping out the segments of time series based on similar trend behavior. Statistic data values of any time series may show three types of trend characteristics, that is, increasing, decreasing, or equaling patterns. An individual segment of the series may have one or more than one different patterns. Based on the sequence of these patterns in the segments, clusters can be constructed (Wang et al., 2006).

Figure 1 shows a detailed description of the clustering approach, which is based on identifying similar trend components. The trend shows the general tendency of data values that may be increasing, decreasing, and equaling direction. Increasing, decreasing, and equaling sets are denoted by I, D, and E, respectively, and the segment is indicated by S. Suppose $m$ number of segments ( $s_{1}, s_{2}, s_{3}, \dots, s_{m}$ ) are created from statistics data. For each segment, S, initially, find I, D, and E sets and then calculate the length of each. Now, based on calculated length, identify the similar segments. If two or more segments show the same pattern, then assign them into the same cluster. If I, D, and E have the same length, then put into the same cluster; if the length of I is larger from D and E, then put them into separate clusters. If the length of D is larger from I and E, then put them into separate clusters. Similarly, according to the length of the sets, a different number of clusters are found.

Figure 1.

Clustering approach based on trend components.

Statistical models for wind speed forecasting

The time series data can be forecasted using the statistical models in terms of data, horizon, and accuracy. The forecasting of wind time series data is done using the wind direction, wind speed, air density, temperature, and so on. The forecasted wind time series data mainly include the estimated values of wind speed. The forecast horizon is defined as the time period for which the parameter is to be expected in the future, which usually ranges from short-term (day-ahead) to long-term (multiple day-ahead) forecast horizons. The forecasting accuracy is the efficiency measurement of the modeling technique, which can be evaluated using sufficient performance metrics. The forecasted accuracy is evaluated using the following statistical models: ARIMA and GAS.

ARIMA model

Ait Maatallah et al. (2015) have proposed the ARIMA model, which is used for time series forecasting. ARIMA model is a generalized form of ARMA model, which is used on time series data to predict the future values from past values in time series. ARIMA model can also be applied when the given time series is non-stationary. In this case, the differencing method can be used one or more times for removing the non-stationarity present into time series data. AR, MA, and I denoted the evolving variables which are used to regress on its own lagged, regression error term, and differencing (integrated) between current values and previous values, respectively. The mathematical form of the ARIMA (p, d, q) model can be represented as

y_{t} = α_{0} + α_{1} y_{t - 1} + \dots + α_{p} y_{t - p} + ε_{t} + β_{1} ε_{t - 1} + \dots + β_{q} ε_{t - q}

(1)

where $y_{t}$ is a variable of interest that is legged up to pth value. $α_{0}, α_{1}, \dots, α_{p}$ are the AR coefficients; $β_{1}, \dots, β_{q}$ are the MA coefficients; and $ε_{t}$ is assumed to be white noise. The p, d, and q are non-negative parameter values. The number of time lags of the autoregressive model is denoted by p, d denotes the number of differencing terms which is subtracted to current values from past values that provide the stationary time series data, and q is the order of moving average model.

GAS model

GAS model is a score driven model used for nonlinear time series data using the score function. GAS is an observation driven model due to which it is applied to asymmetric data, more complex dynamics data, and long-term data without future complexity (Creal et al., 2013; Harvey, 2013). The GAS model is used for wind time series that is capable of handling the varying density present into time series data. This model can be represented by conditional observation density $P (y_{t} | θ_{t})$ , where $y_{t}$ is a variable of interest that is dependent on $θ_{t}$ which is a latent time-varying parameter. However, $θ_{t}$ using the autoregressive equation for updating the time-varying parameter, defined recursively as

θ_{t} = ω + \sum_{i = 1}^{p} Ø_{i} θ_{t - i} + \sum_{j = 1}^{q} α_{j} S (θ_{j - 1}) \frac{\partial \log p (y_{t - j} | θ_{t - j})}{\partial θ_{t - j}}

(2)

where $ω$ is a vector of constant, Ø is an autoregressive coefficient, α is a scaling parameter, strictly positive scaling factor s is multiplied with the first derivative of the conditional density P, contributes for a single observation at time j. s depends on the observation $y_{t}$ and the time-varying parameter $θ_{t}$ . The main contribution of the GAS model is the selection of driving mechanism s, which is applicable to some nonlinear modeling techniques. The GAS model performs better for nonlinear data as compared to the ARIMA model. Since the GAS model relies on the score, it exploits the entire density structure instead of means and higher moments only.

Proposed hybrid methodology

The dataset used in experiments has been taken from the NREL site (National Renewable Energy Laboratory (NREL), 2007). The resource file includes a number of parameters, that is, wind speed, wind power, air temperature, air density, and so on. This work is focused on univariate time series forecasting, that’s why only wind speed data have been considered for modeling. Figure 2 illustrates a detailed overview of the proposed model.

Figure 2.

Proposed model for wind speed prediction.

The complete data are divided into training and testing pairs. The proposed methodology is the hybrid forecasting technique, which combines the clustering of time series data and statistical forecasting methods. Here, initially, the whole training data are divided into equal size segments. The size of the segment depends on the size of the completed data taken for modeling. After that proposed clustering method, as in section “Time series data clustering,” has been applied to form the clusters. As a result, a number of groups having a linear trend component in time series data have been formed. Since each cluster having a linear trend component, statistical time series forecasting methods, namely, ARIMA and GAS, have been applied to model the time series data.C-ARIMA (ARIMA model with clustering technique) and C-GAS (GAS model with clustering technique) names have been given to the hybrid approaches. In the first one, ARIMA statistical method has been applied after clustering. Similarly, in the second one, GAS statistical method has been applied after clustering. After developing the hybrid forecasting model on the training set, the performance of the models is evaluated by MAE and RMSE values on testing data. The complete process of the evaluation method has been discussed in section “Research design and analysis.”

Research design and analysis

The complete section presents the experiments conducted on a different dataset and then discusses the results achieved by experiments. The first section covered the detail of the datasets used for experimental purposes. The second section describes the metrics used for performance evaluation. The third section analyzes the results.

Dataset

The datasets that have been used for the experimental purpose are drawn from NREL with site Id 72509, 16883, 68003, and 124693. The site Id 72509 having the geographical location with longitude $- 106.259 °$ and latitude $41.776 °$ has an average wind speed of 9.241 m/s. The detailed description of all the datasets has been discussed in Table 1. NREL provides an average 5-min wind speed from SCADA (supervisory control and data acquisition) wind plant system reported at the height of 100 m, which has the 105,120 observations. For our experimentation, 8500 observations have been taken and divided into training and testing pairs.

Table 1.

Dataset description.

Dataset	Site Id	Year	Mean	Standard deviation	Min	Max
#1	72509	2007	11.513	6.055	0.088	28.422
#2	72509	2008	13.604	5.676	0.415	27.832
#3	72509	2009	13.925	5.818	0.253	27.689
#4	72509	2010	9.267	4.498	0.262	21.418
#5	68003	2011	8.646	3.566	0.094	16.772
#6	124693	2012	5.772	5.132	0.048	26.178
#7	16833	2007	7.967	3.963	0.161	20.84
#8	16833	2008	9.831	4.357	0.14	20.841
#9	16833	2009	8.772	3.792	0.653	19.128
#10	16833	2010	8.553	4.093	0.46	20.798
#11	16833	2011	7.816	3.911	0.067	17.829
#12	16833	2012	9.668	4.407	0.189	21.875

Performance monitoring criteria

The performance of statistical and hybrid models is measured by the suitable criteria that assess the ability of the models. For our experimentation, MAE and RMSE are used to measure the performance of wind speed forecasting. The measuring criteria like MAE and RMSE are further represented as follows

MAE = \frac{1}{N} \sum_{i = 1}^{N} \hat{y} (i) - y (i)

(3)

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(\hat{y} (i) - y (i))}^{2}}

(4)

here N is the total number of observations, y is input variable, and $\hat{y}$ is forecasted variable. The model which has the lowest values of MAE and RMSE performs better. The ARIMA, GAS, and hybrid models are implemented on the Python 3.6 version.

Results analysis

In this research article, statistical models (ARIMA and GAS) and hybrid models (C-ARIMA and C-GAS) are applied to 12 different datasets that are drawn from the NREL site. Table 2 shows the prediction results of ARIMA models in terms of MAE and RMSE values. Similarly, Table 3 shows the prediction results of the GAS model.

Table 2.

MAE, RMSE values using the ARIMA and clustered ARIMA models.

Dataset	ARIMA		C1-ARIMA		C2-ARIMA		C3-ARIMA
Dataset	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE
#1	7.346	8.649	5.159	5.973	5.940	6.937	5.570	6.326
#2	4.675	5.972	7.029	8.430	4.191	5.206	4.726	6.047
#3	2.968	4.281	2.747	3.897	3.821	5.021	6.156	6.964
#4	6.593	6.976	6.374	6.757	9.113	9.383	13.893	14.061
#5	4.074	4.359	10.319	11.136	6.344	6.685	7.404	7.697
#6	5.558	7.124	5.563	7.254	5.653	7.331	5.679	7.241
#7	2.796	3.291	3.257	3.833	1.714	2.106	92.738	123.995
#8	4.246	4.950	3.362	4.049	3.554	4.252	4.880	5.645
#9	4.207	4.785	11.621	12.062	9.483	10.010	9.358	9.877
#10	3.455	3.771	2.528	2.869	4.725	4.983	2.294	2.591
#11	2.188	2.743	2.144	2.699	2.048	2.609	6.471	6.956
#12	4.751	6.593	4.220	4.808	11.128	12.232	3.917	4.949

MAE: mean absolute error; RMSE: root mean square error; ARIMA: autoregressive integrated moving average.

Table 3.

MAE, RMSE values using the GAS and clustered GAS models.

Dataset	GAS		C1-GAS		C2-GAS		C3-GAS
Dataset	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE
#1	5.017	5.767	2.955	3.398	6.324	7.377	6.454	7.342
#2	4.788	6.176	5.776	7.304	4.510	5.672	4.802	6.133
#3	3.003	3.820	13.468	15.224	1.785	2.395	2.887	3.356
#4	2.738	3.431	1.794	2.449	4.767	5.940	9.919	10.250
#5	5.212	5.373	6.928	7.107	8.110	8.579	7.266	7.559
#6	7.268	9.359	7.175	9.369	5.940	7.771	6.143	7.955
#7	1.851	2.599	3.155	3.470	1.715	2.016	5.188	5.402
#8	6.324	7.166	5.487	6.258	5.341	6.077	5.029	5.777
#9	7.068	8.019	5.520	6.553	4.321	5.370	7.387	8.227
#10	4.952	5.371	2.112	2.426	8.576	8.800	2.820	3.159
#11	2.474	3.164	2.988	3.468	2.420	3.078	8.334	8.760
#12	4.378	5.951	5.900	6.899	5.213	6.823	6.093	7.259

MAE: mean absolute error; RMSE: root mean square error; GAS: generalized autoregressive score. Bold numeric value of MAE and RMSE indicates that the prediction model corresponding to the column has the least prediction error and performed better on the Dataset representing that row.

Figure 3 demonstrates the wind speed prediction using ARIMA and GAS model, applied separately on Dataset #1. Here, the GAS model shows better accuracy in comparison to the ARIMA model. Figure 4 demonstrates the wind speed prediction using the ARIMA model, applied individually on each cluster of Dataset #1. The ARIMA model shows better accuracy on the cluster-1 as compared to other clusters. Similarly, Figure 5 demonstrates the wind speed prediction using the GAS model, applied individually on each cluster of Dataset #1.

Figure 3.

The figures in the left and right panel show the wind speed prediction using the ARIMA and GAS model, respectively, on Dataset #1.

Figure 4.

The figures in the left, middle, and right panel show the wind speed prediction using the ARIMA model for first, second, and third clusters, respectively, on Dataset #1.

Figure 5.

The figures in the left, middle, and right panel show the wind speed prediction using the GAS model for first, second, and third clusters, respectively, on Dataset #1.

Figure 6 demonstrates the wind speed prediction using ARIMA and GAS model, applied separately on Dataset #7. Here, the GAS model shows better accuracy in comparison to the ARIMA model. Figure 7 demonstrates the wind speed prediction using the ARIMA model, applied individually on each cluster of Dataset #7. The ARIMA model shows better accuracy on the cluster-2 as compared to other clusters. Similarly, Figure 8 demonstrates the wind speed prediction using the GAS model, applied individually on each cluster of Dataset #7.

Figure 6.

The figures in the left and right panel show the wind speed prediction using the ARIMA and GAS model, respectively, on Dataset #7.

Figure 7.

The figures in the left, middle, and right panel show the wind speed prediction using the ARIMA model for first, second, and third clusters, respectively, on Dataset #7.

Figure 8.

The figures in the left, middle, and right panel show the wind speed prediction using the GAS model for first, second, and third clusters, respectively, on Dataset #7.

Tables 2 and 3 demonstrate the experimental results in terms of MAE and RMSE values achieved for the ARIMA variants and the GAS variants, respectively. In the table, the minimum value of MAE and RMSE is preferred and indicates that the results outperform. As seen from Table 2, the proposed hybrid model (C1-ARIMA) outperforms over the ARIMA model. In terms of MAE and RMSE values, the C1-ARIMA-based model empirically developed on Dataset #7 obtains the result, 1.714 and 2.106, respectively. The table demonstrates that the clustering-based ARIMA model performs better most of the time as compared to the ARIMA model. As seen from Table 2, the proposed hybrid model (C2-GAS) outperforms over the ARIMA model. In terms of MAE and RMSE values, the C2-GAS-based model empirically developed on Dataset #7 obtains the result, 1.715 and 2.016, respectively. This table also demonstrates that the clustering-based GAS model performs better most of the time as compared to the GAS model. Overall, it can be concluded that the performance of the clustering-based hybrid models outperforms over the statistical models.

Conclusion

The existing classical models of wind forecasting like ARIMA and GAS are relatively easer; however, due to the complexity in data, these have considerable restrictions in terms of implementation. By minimizing the complexity in the time series data without compromising the loss of informative patterns, better model performance can be achieved. This article provides one such mechanism for generalizing our model with optimal performance. In experiments, it has been observed that the clustering-based hybrid model (C-ARIMA and C-GAS) performs better as compared to existing ARIMA and GAS models. Using the proposed hybrid method on wind forecasting, the article analyzes the trend characteristics of wind time series data and finds that a trend component of wind time series data has different shapes. When first, the time series data are grouped according to the shape of the trend component, and then the model is developed for each group using the existing technique of forecasting, the results excel accuracy. In order to achieve better generalization, the model has been developed on 12 different datasets drawn from the NREL site.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Varsha Kushwah

Rajesh Wadhvani

References

Ait Maatallah

Achuthan

Janoyan

, et al. (2015) Recursive wind speed forecasting based on Hammerstein auto-regressive model. Applied Energy 145: 191–197.

Creal

Koopman

Lucas

(2013) Generalized autoregressive score models with applications. Journal of Applied Econometrics 28(5): 777–795.

Dongre

Pateriya

(2019) Power curve model classification to estimate wind turbine power output. Wind Engineering 43(3): 213–224.

Harvey

(2013) Dynamic Models for Volatility and Heavy Tails: With Applications to Financial and Economic Time Series. New York: Cambridge University Press.

Inniss

(2006) Seasonal clustering technique for time series data. European Journal of Operational Research 175(1): 376–384.

Johnpaul

Prasad

MVNK

Nickolas

, et al. (2020) Trendlets: A novel probabilistic representational structures for clustering the time series data. Expert Systems with Applications 145: 113–119.

Kavasseri

Seetharaman

(2009) Day-ahead wind speed forecasting using f-ARIMA models. Renewable Energy 34(5): 1388–1393.

Kushwah

Wadhvani

(2019) Performance monitoring of wind turbines using advanced statistical methods. Sādhanā 44: 163.

Kuznetsov

Mohri

(2020) Discrepancy-based theory and algorithms for forecasting non-stationary time series. Annals of Mathematics and Artificial Intelligence 88: 367–399.

10.

Lim

Wang

Yao

(2018) Time-series momentum in nearly 100 years of stock returns. Journal of Banking and Finance 97: 283–296.

11.

Lydia

Suresh Kumar

Immanuel Selvakumar

, et al. (2015) Wind resource estimation using wind speed and power curve models. Renewable Energy 83: 425–434.

12.

Morshedizadeh

Kordestani

Carriveau

, et al. (2017) Improved power curve monitoring of wind turbines. Wind Engineering 41(4): 260–271.

13.

National Renewable Energy Laboratory (NREL) (2007) Western dataset (Site-id 72509). Available at: https://www.nrel.gov/grid/western-wind-data.html

14.

Thapar

Agnihotri

Sethi

(2011) Critical analysis of methods for mathematical modelling of wind turbines. Renewable Energy 36(11): 3166–3177.

15.

Torres

García

De Blas

, et al. (2005) Forecast of hourly average wind speed with ARMA models in Navarre (Spain). Solar Energy 79(1): 65–77.

16.

Vilar

Lafuente-Rego

D’Urso

(2018) Quantile autocovariances: A powerful tool for hard and soft partitional clustering of time series. Fuzzy Sets and Systems 340: 38–72.

17.

Wadhvani

Shukla

(2018) Analysis of parametric and non-parametric regression techniques to model the wind turbine power curve. Wind Engineering 43(3): 225–232.

18.

Wang

Smith

Hyndman

(2006) Characteristic-based clustering for time series data. Data Mining and Knowledge Discovery 13(3): 335–364.

19.

Yang

Sharma

, et al. (2015) Forecasting of global horizontal irradiance by exponential smoothing, using decompositions. Energy 81: 111–119.

20.

Zhu

Zhang

Wen

, et al. (2019) Fast and stable clustering analysis based on grid-mapping K-means algorithm and new clustering validity index. Neurocomputing 363: 149–170.