Abstract
In the wind industry, the power curve serves as a performance index of the wind turbine. The machine-specific power curves are not sufficient to measure the performance of wind turbines in different environmental and geographical conditions. The aim is to develop a site-specific power curve of the wind turbine to estimate its output power. In this article, statistical methods based on empirical power curves are implemented using various techniques such as polynomial regression, splines regression, and smoothing splines regression. In the case of splines regression, instead of randomly selecting knots, the optimal number of knots and their positions are identified using three approaches: particle swarm optimization, half-split, and clustering. The National Renewable Energy Laboratory datasets have been used to develop the models. Imperial investigations show that knot-selection strategies improve the performance of splines regression. However, the smoothing splines-based power curve model estimates more accurately compared with all others.
Keywords
Introduction
The wind power industry is growing tremendously, proving itself to be an inexhaustible, sustainable, and environment-friendly resource to generate electricity. In India, wind power technology development has increased significantly, with 2016 as the record-breaking year (Fried et al., 2017). Large wind farms that covered an extended area of hundreds of square miles and located at high wind speed sites are designed to produce electricity on a large scale. Some of the approaches are reported in Carta et al. (2009), Safari (2011), and Mohammadi et al. (2017) to identify potential wind farms. These wind farms consist of several hundred individual wind turbines. There is an acute need to show consistent excellent performance to make a qualitative profit from these farms. The performance of operational wind farms can be maximized by optimizing the design of wind farm layout and the placement of turbines. At the design stage, turbine placement criteria and wind farm layout issues are addressed in Chowdhury et al. (2013) and Abdulrahman and Wood (2017), respectively. Shortly after system installation, a major issue is to check the performance of wind farms through initial system evaluation and subsequently monitor the system performance for continuous output optimization. In this study, the aim is to estimate the wind turbine power output in a site-specific environment to monitor its performance and reliability. In the wind industry, the power curve is used to understand the characteristic of the wind turbine. It indicates how large electric power will produce at different wind speeds. The power curve of the wind turbine is usually provided by the manufacturer in the form of a generic power curve which includes cut-in speed, rated speed, and cut-off speed, though this information is not sufficient to cover all aspects of the practical environment. An empirical power curve is required to exhibit the actual behavior of the wind turbine. Sohoni et al. (2016) have presented a detailed review of empirical approaches to developing the power curve of the wind turbine to estimate the output power.
Previously, a number of studies were undertaken to develop an empirical power curve using the manufacturer’s data, some of which are stated in Yang et al. (2003), Thapar et al. (2011), and Dongre and Pateriya (2019). In these studies, various techniques like cubic law, Weibull parameter-based approach, least squares, and so on, are used to build the wind turbine power curve. However, the studies were restricted to develop the machine-specific power curve. Under different conditions, these models do not show sufficient accuracy, as ideal conditions do not always exist for wind turbines. Hence, power curve modeling requires incorporation of site-specific conditions to improve the prediction accuracy. The techniques mentioned in Shokrzadeh et al. (2014), Manobel et al. (2018), and You et al. (2018) use high-frequency operational data to estimate the turbine’s behavior at the actual location. In this class, data-driven approaches are used in which models are extracted from high-frequency operational data. In these methods, the practical environmental aspects are considered for modeling. Shokrzadeh et al. (2014) and Wadhvani and Shukla (2019) have conducted extensive empirical studies to derive a good quality power curve from the actual data of wind turbines. In these studies, various variants of regression approach were applied, which includes linearized segmented regression, locally weighted polynomial regression, polynomial regression, four-parameter and five-parameter logistic regression, splines regression, and natural cubic splines regression. Any one of these methods can be applied for power curve modeling according to the available data pattern.
Polynomial regression (Llombart et al., 2006) is a simple way to provide a nonlinear fit between the independent variable, that is, wind speed, and the dependent variable, that is, wind power. It extends the linear model by adding extra independent variables, derived by raising each of the original independent variables to a power. This method estimates the model parameters by the least-squares approach to achieve better goodness of fit on observational data. Here, the increases in the degree of the polynomial allow producing an extremely nonlinear curve. This model, though very flexible, sometimes acquires very strange shapes with the data having higher anomalies. Other than this, with a higher degree of flexibility, polynomial models overestimate the predicted power, leading to an occurrence known as model overfitting. To remedy this problem, a flexible class of models that extends upon the polynomial regression splines regression (Wahba, 1990) can be used. Here, in order to produce a quality response, the range of the input variable, that is, the independent variable is divided into distinct regions. Instead of fitting higher degree polynomials, lower degree polynomials are fitted into distinct regions. The points where the regions change are known as knots. This method tries to separate the more flexible and the less flexible regions. However, one of the limitations with splines regression is that it shows abnormal behavior in the knots region. Introducing more knots in the flexible region may affect the smoothness of curve in that region. However, the best fits of the splines can be accomplished by selecting an optimal number of knots and their position.
While simulating the wind–power empirical relationship, the first major issue faced in the literature is that the power curve function exhibits a higher degree of flexibility in the regions in which data observations show a higher variability. The objective is to fit a piecewise polynomial function on wind speed-power data such that the curve shows rich smoothness and the error term of fitting function shows constant variance in the different intervals of wind speed. The said objective can be achieved by identifying the regions or penalizing the function in such a way that it returns the minimum sum of squared estimation error. In this article, three-knot selection methods, that is, knot selection using particle swarm optimization (PSO), half-split, and clustering, have been applied to identify the knot regions. Apart from this, usually, it is claimed that by penalizing the variability in the curve, the performance of the model can also be improved. Therefore, the penalized splines regression has been introduced, which applies the penalty to the loss function of the curve. Finally, empirical investigation has been conducted on the data provided in the HOMER software’s resource file (National Renewable Energy Laboratory (NREL), 2012).
Regression-based power curve modeling approaches
This section explores how the regressive methods, namely, polynomial regression, splines regression, and smoothing splines regression are used to fit the empirical power curve on the wind power dataset.
Polynomial regression
In this approach, the power output of a wind turbine is described by the n-degree polynomial function as (Shokrzadeh et al., 2014)
Here,
where
The solution to the problem can be formulated as
where
Splines regression
The power curve is most flexible in the region in which data observations show high variability. Splines address these issues by separating the domain of v into distinct regions. Splines are piecewise polynomial of order k of function
Here,
More generally, order-M splines with K knots perform least-squares regression with a total of K + M + 1 regression coefficients. For the dataset having N observations, the general equation is
The fitted power curve is
Here,
Smoothing splines regression
In smoothing splines (Griggs, 2013), knots are taken at each unique training observation
where
where
where
Thus, the fitted power curve can be represented as
where Z(λ) is a hat matrix and the results are dependent on the smoothing parameter λ.
Knots selection for splines regression
In splines, one needs to select the number of knots and their positions. Usually, it is common to place knots at regular intervals, since the spline is too wiggly in regions that retain more knots. Therefore, there might be an option to keep more knots in the regions in which the polynomial function varies more rapidly. Yet, selecting a large number of knots does not assure an optimal fit. The optimal number of knots and their correct position can provide a flexibility model. In the literature, various knot selection algorithms are available, and some of them are reported in He et al. (2001) and Li et al. (2005). This article does an empirical investigation on three-knot selection algorithms, which are described in this section.
PSO
PSO is a simple algorithm which is effective for the optimization of a wide range of functions. Siriruk (2012) has applied PSO as a model fitting technique. This section explains how PSO can be applied to approximate splines function. PSO is an iterative stochastic computational method inspired by the biological flocking pattern of birds and fish. The objective of the work is to find the optimal numbers of the region as well as their knot locations in a piecewise linear function. This can be achieved by determining the knot location and the slope of the curve in each region that minimize the RSS over the approximation interval. Let
PSO is an iterative process, performing the evaluation of the solutions which are represented by the particle locations and then adjusting the particle velocities based on prior knowledge. The aim is an adjustment of the particle position in accordance with its best position that is found so far along with the best position in its neighborhood. Assume that all N observations correspond to N particles. Each particle i maintains its current position,
The parameters C, ϕ1, and ϕ2 are user-supplied coefficients, where
The PSO usually runs by the repetition of equations (15) and (16) continuously, until the time when a specified number of repetitions have been reached. Another solution is to stop when the velocities are close to zero, implying that the algorithm has found the global best solution.
Half-split method
Keeping in mind the fact that the second-order derivative of splines is a piecewise linear function, Tjahjowidodo et al. (2015) applied the data splitting method for knots evaluation. In this approach, initially, the second derivative of sample data points is defined, and then a subset of the obtained sample is approximated as a straight line. This approach follows the first key point of splitting the data into a set of a piecewise linear function. For performing this, the half-split method, in order to subdivide the data, has been employed. In the first step, all data observations are assumed to be part of a single straight line having a knot vector
K-means clustering
The objective of this method is the grouping of data observations with common characteristics. The objective can be achieved by identifying k sets
In the second step, for each cluster
The iteration of the algorithm continues between the two steps until the point of a stopping criterion is met, which is when there is no change of clusters by any data point, minimization of the sum of the distances is achieved, or when the number of iterations reached is the maximum. The algorithm will be converging to a result, which may be a local optimum, implying that assessment of more than one run of the algorithm, having randomized starting centroids, might result in better outcomes.
Empirical investigation with real data
For evaluation of the performance of all the algorithms discussed in the above sections, empirical investigation is being conducted. While implementing the empirical power curve model to estimate the output power of wind turbine, its operational process is described in different phases, as shown in Figure 1.

Modeling structure to estimate wind turbine power output.
The experiment is carried out with real data from North America’s wind farms. For this task, two datasets are taken from NREL (2012) resource files, which specialize in renewable energy research, development, and efficiency. Dataset-A and Dataset-B correspond to site-id 124693 and site-id 126541, respectively. The geographical location of the site-id A is longitude:−120.005463 and latitude: 46.901657. Site-id B has longitude: −123.375778 and latitude: 48.64072. All the observations are obtained from the SCADA (Supervisory Control and Data Acquisition) system’s wind plant. In addition to this, the descriptive statistics of wind characteristics of both the sites are given in Table 1.
Wind characteristics of dataset sites used in modeling.
These datasets are first validated for inconsistent data combinations, out-of-range values, and missing values. Subsequently, it is preprocessed to create a training set for empirical modeling. In order to preprocess raw data, an outlier detection method similar to Warren et al. (2011), has been applied. Once both the datasets are cleaned, they can be used to assess the performance of the proposed empirical models. After preprocessing the datasets, in order to implement splines, power curve knots are evaluated by PSO, half-split, and clustering methods. A well-known benchmark setting of the PSO parameter as in Bartz-Beielstein et al. (2004) to optimize its function is given in Table 2.
Settings of the parameters of PSO algorithm.
In the experiment, as applied in Zhang and Yang (2015), the cross-validation approach of data sampling has been used, which helps to approximate the model’s prediction error for unseen data. A fivefold cross-validation approach has been used by splitting the dataset into five roughly equal sizes. Initially, the first group of observations has been considered as a validation set, and the remaining are treated as the training set. Training set observations are used to develop the model, and subsequently, the prediction error of the developed model is calculated using the validation set observations. The next time, the second group of observations is considered as a validation set, and the prediction error is computed for the model developed on new training observations. This process is repeated until the last group of observations is considered as the validation set.
The developed model characterizes the pattern of actual data. Evaluation of the model’s ability to generalize is important. Hence, parameters are required which evaluate the model based on the goodness of fit. The goodness of fit of the model can be decided by the loss function that measures the difference between the actual and the predicted value. Root mean square error (RMSE) and mean average error (MAE) has been used as a loss function for computing error between the actual output
where for a given wind speed vi,
Here,
The performance of the proposed methodology has been validated in several ways. In the simulation, empirical power curves for Dataset-A and Dataset-B have been developed using the polynomial regression, splines regression, and smoothing spline methods. As in the splines family, cubic splines are sufficient to fit into any pattern, the three-degree polynomial has been used to fit all the models. For Dataset-A, the respective plot of each model is shown in Figure 2. From the figure, it can be seen that the empirical power curve model obtains through three-degree polynomial to under-fit the data observations, which may lead to an overestimation of output power, whereas plots obtained through splines regression show sufficient fitting. Though, before cut-in speed and after rated speed, the splines model seems to be erratic. Finally, it is observed that the smoothing spline is good enough to fit the data observations in all aspects.

Empirical power curve fit for dataset A (a) with the polynomial regression, (b) with the splines regression, and (c) with the smoothing splines regression.
Later, the average error obtained in each stage is used to calculate the mean prediction error of the developed model. For computing the prediction error, MAE, RMSE, and R2 score formulas have been used. MAE, RMSE, and R2 score values are computed with each developed model for both the datasets. A comparative analysis of all these results is shown in Table 3. From the table, it can be seen that the splines model performs well compared with the polynomial model, but it is not sufficient. Justification for this improvement is based on the fact that fitting of lower degree polynomials into distinct regions improves the performance, but its abnormal behavior in the knots region shows inaccurate results for that range of values. However, the smoothing splines estimator tends to perform better and this is clearly evident from the results of both the datasets.
Comparison of prediction errors for different regression techniques on both the datasets.
RMSE: root mean square error; MAE: mean average error.
Furthermore, in order to improve the performance of splines regression, three-knot selection methods have been applied. The position of the knots using PSO, half-split, and clustering methods for both the datasets is shown in Table 4. For Dataset-A, the respective plot of splines regression using PSO, half-split, and clustering methods are shown in Figure 3(a), (b), and (c), respectively. Here, it can be observed that plots obtained from splines with PSO and half-split are less over-fitted, while splines with clustering do not show any significant improvement.
Set of knot positions for both the datasets using particle swarm optimization, half-split, and clustering methods.
PSO: particle swarm optimization.

Empirical power curve fit for dataset A with the splines regression using (a) particle swarm optimization knot selection method, (b) half-split knot selection method, and (c) K-means clustering knot selection method.
Furthermore, comparative analysis of the knots selection methods in terms of prediction errors, that is, MAE, RMSE, and R2 score for both the datasets is shown in Table 5. From the table, it can be observed that the PSO and half-split methods of knots selection improve the performance of splines regression method, whereas in the case of splines with clustering, the difference is minor.
Comparison of prediction errors for splines regression with different knot selection methods on both the datasets.
RMSE: root mean square error; MAE: mean average error; PSO: particle swarm optimization.
While investigating the performance of the proposed power curves, it has been observed that inconsistent data observations at an interval of high wind speed are not effectively filtered by preprocessing of data. It has been observed that polynomial models show sufficient variance in error obtained at higher and lower wind speeds, whereas splines realize data observations in segments by keeping the degree fixed. Therefore, to some extent, it can filter the influence of inconsistent data observations. Here, if the segments are selected appropriately, the smoothness can be further improved. On the other hand, the smoothing splines approach has the ability to directly apply the penalty on inconsistent data observations at any region between cut-in to cut-off wind speed. Overall, the comparison of different modeling methods for the same set of wind speed-power data shows that the prediction error of smoothing splines tend to be very low, while splines show sufficient accuracy if an appropriate knot selection method is employed.
Conclusion
This article explored various statistical aspects for developing the site-specific power curve of a wind turbine. The polynomial regression-based power curve is implemented to achieve goodness of fit on the NREL dataset. These models are relatively simple to realize; however, they have considerable limitations in terms of prediction accuracy. Splines power curves are more flexible than polynomials, and this is because instead of using a higher degree to produce flexible fit, the splines are realized into the segments but keep the degree fixed. The spline is a continuous function in the bounded interval which may be too flexible at the boundary regions. Two strategies have been suggested to remedy this problem. In the first strategy, the optimal number of boundaries and their positions are identified based on the variability present in the observational data. Here, PSO and the half-split methods of knot selection improved the results, whereas clustering has shown minor improvement. In the second strategy, the roughness penalty on the loss function is applied to improve the generalization capability of splines. The results show that smoothing splines have shown outstanding performance over all the other methods.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
