Abstract
Energy communities can support the energy transition, by engaging citizens through collective energy actions and generate positive economic, social and environmental outcomes. Renewable Energy Sources (RES) are gaining increasing share in the electricity mix as the economy decarbonises, with Photovoltaic (PV) plants to becoming more efficient and affordable. By incorporating Artificial Intelligence (AI) techniques, innovative applications can be developed to provide added value to energy communities. In this context, the scope of this paper is to compare Machine Learning (ML) and Deep Learning (DL) algorithms for the prediction of short-term production in a solar plant under an energy cooperative operation. Three different cases are considered, based on the data used as inputs for forecasting purposes. Lagged inputs are used to assess the historical data needed, and the algorithms’ accuracy is tested for the next hour’s PV production forecast. The comparative analysis between the proposed algorithms demonstrates the most accurate algorithm in each case, depending on the available data. For the highest performing algorithm, its performance accuracy in further forecasting horizons (3 hours, 6 hours and 24 hours) is also tested.
Keywords

Introduction
As consumer-empowerment and community-driven initiatives, energy communities can play a key role for social innovation as they reflect a fundamental shift in citizen’s behaviour and their role as a consumer [1]. Engaging citizens through collective energy actions can reinforce positive social norms and support the energy transition. Towards this direction, the Clean Energy Package of the European Commission (EC) recognises and offers an enabling legislative framework for ‘Citizen Energy Communities’ and ‘Renewable Energy Communities’ [2].
By 2030, EU will have to increase renewables to 32% share of the energy supply and in order to reach this binding target, an explicit role for citizens and communities is foreseen [3, 4]. This is an important step towards the ‘energy democracy’, as not only it acknowledged the role of democratically controlled communities in the energy transition, but it will also help European citizens to set up their own renewable energy projects and protecting them from the big players of the energy market. Successful renewable energy cooperatives generate positive economic, social and environmental outcomes while accelerating the social and psychological dimensions of the global transition towards clean energy sources [5].
One major source of renewable energy is harvesting solar power by PV solar plants. The technology is becoming more widely used and year on year PVs make up a bigger part of the energy mix in the European Union (EU). In 2018, the EU output of PV electricity reached the 127 TWh, amounting to 3.9% of the EU’s gross electricity output [6]. The coming decade continued growth is foreseen, mostly driven by increased self-consumption and more rooftop PV installations as a path towards a post-lignite era [7].
In order to maximise efficiency and optimise production for energy supply and demand [8], the use of Machine Learning (ML) and Deep Learning (DL) algorithms is being implemented in several application of Artificial Intelligence (AI). Predicting energy production values, plays an important role in PV applications, for both short-term and long-term forecasting horizons, as the prediction accuracy is the main factor in performance applications such as fault prediction and anomalies detection, load balancing and performance monitoring of the system.
The levels of applicability of AI applications on PV plants vary, depending on the specific solutions required to be applied and the available resources for data-driven solutions. AI Applications on PV plants in energy communities, from the simplest to the more advanced, require a significant amount of available infrastructure, interconnections and expertise in order to perform efficiently and leverage the added value. Utilising relevant infrastructure and sensor-based systems, are of major importance for PV plants in order to monitor energy efficiency [9], optimise supply and demand, and perform energy management in general [10]. However, the technological maturity and the available infrastructure in energy communities varies, depending on the activities of the community, its funding capabilities and its participants needs. This poses an issue when a data gathering process takes place in order to produce data-driven applications which can enhance the PV plant’s energy efficiency. Hence, the solutions and the services for energy communities, may be fragmented, in need of redesign and not easily replicable. More specifically, the availability of smart meters in consumers, the several types of sensors (e.g., temperature, humidity, etc.) installed on the PV plants’ sites, their communication interface, and their impact on the implementation of AI applications are factors to consider when developing AI applications for PV plants in energy communities.
Forecasting the future performance of a PV plant in several scales has been examined in literature, by applying several algorithms and architectures in order to achieve high accuracy.
A data-driven approach, studying two different feature sets based on the lagged power data and their descriptive statistics, and testing Neural Networks (NN), Support Vector Machines (SVR), Random Forests (RF), Long Short Term Memory (LSTM) and Convolutional Neural Networks (CNN) has been proposed to forecast PV power production at aggregated regional level (farm level) [11].
A Deep Extreme Learning Machine (DELM), combined with Enhanced Colliding Bodies Optimisation (ECBO) and Variational Mode Decomposition (VMD) to forecast the PV production of up to 4 hours, using the historical PV production data and forecasted solar irradiance (by using numerical weather prediction) has also been proposed [12].
Integration of the smart persistence prediction algorithm, irradiance and historical production data using RFs, has been proposed to predict PV power, by including a set of features that combines historical data, predictions, averages and variances for training and validation of the algorithm, which improve the accuracy of short-term forecasts [13].
For forecasting the short-term power generated by PV panels, a deep Recurrent Neural Network (RNN) has also been proposed, utilising on-site weather IOT dataset (solar irradiance, module temperature, ambient temperature, humidity and wind speed) and electrical data (DC current and voltage) [14]. Weather data (ambient temperature, atmospheric pressure, solar irradiance, wind speed and relative humidity) with historical production measurements has also been used in a supervised sequence to sequence deep learning model with attention, to forecast PV power production [15].
An ensemble model of two LSTMs with Attention Mechanism on the temperature and power time series to forecast the short-term production using the historical temperature and production data has also been proposed [16]. Discriminative deep models including autoencoders, LSTM networks, Rectified Linear Units, and CNNs for spatiotemporal pattern recognition has been examined to estimate PV energy production, given historical measurements [17]. In this approach, the PV production prediction combined with dictionary learning, is also used for behind-the-meter energy disaggregation in residential and commercial customers.
The performance of several algorithms (LSTM, MLR, SVR, XGBoost, BNN and RT) has also been examined with several weather variables and the historical PV to forecast the next hours’ [18] and the day ahead [19] PV production.
Weather data and the historical production data have also been used for advanced applications for PVs; predicting the output power of PV and determining degradations/anomalies in using a LSTM model [20], scheduling the maintenance operations of PV installations using Multiple Linear Regression (MLR) [21], anomaly detection in PV power production data based on a variational recurrent autoencoder with a variational Bi-LSTM [22], fault/status prediction based on self-organising maps (SOM) and key performance indicators [23], and a macro-level PV forecasting model methodology, to forecast short-term inverter-level PV production power, based on Feed Forward Neural Networks (FFNN), LSTM and Gated Recurrent Units (GRU) [24].
Many solutions and applications use complex algorithms and features engineering, as well as excessive data inputs to predict future performance. These approaches can lead to highly accurate results, but they could be difficult to be applied in the case of energy communities, where the data availability would pose several constraints in the level of analysis needed and could also require significant technological and operational costs to the community. The scope of this paper is to compare the performance and accuracy of different algorithms in short term forecasting of PV production, based on the data availability which an energy community can have. The methodology consists of three different cases, which are separated depending on the data availability. Data inputs are gathered from an energy community’s PV plan and meteorological stations and are consisted of the historical PV production and weather data (solar irradiance, temperature, humidity, wind speed and cloud cover). Five algorithms are proposed; a simple MLR, two ML algorithms (SVR and XGBoost) and two NNs used for DL (LSTM and CNN in a CNN-LSTM combination), and trained in an operational PV plant, forecasting the PV’s production of the next hour. For each case, the algorithms performance is analysed and compared in terms of accuracy. To this end, the ability of producing accurate results in different types of operational conditions of energy communities based on data availability is demonstrated and the most suitable ML/DL algorithms are applied. Then, the highest performing algorithm is used in order to make forecasts in different time horizons and test the accuracy.
Apart from the introduction, the paper is structured in four sections. The second section provides the methodology description and the models architecture. The results from the application of the selected ML/DL algorithms are summarised in the third section and the discussion of results are on the fourth section. Finally, the last section is presenting the conclusions and the next steps.
Methodology
Methodological framework
The methodological framework followed, is based on a six-step approach and starts with the data gathering process. An energy’s cooperative historical PV production data are gathered from online monitoring system. The weather data were retrieved from a website application with data from a local meteorological station [25] (temperature, humidity, cloud cover and wind speed), and Copernicus Atmosphere Data Store [26] (solar irradiance, i.e., global and diffuse radiation).
The second step follows with the separation of the three cases based on the availability of data inputs. In Case 1, only the output of the PV’s production (PAC), is considered as input to the algorithms and only the signal itself is used as input. In this case the goal is to test the accuracy of the algorithms, when there are no weather data available. Case 2 takes into account the historical PV production data with weather variables. The goal of this is to test the accuracy of the algorithms, when the most statistically significant weather features affecting the most the PV’s performance, are available. Case 3 considers using as input the historical PV production data with extended weather variable. In this case it is assumed that an energy community has a fully operational sensor-based monitoring system and can utilise a wealth of information to develop the ML/DL applications, which can measure several types of variables. The scope is to test the accuracy of the forecasting models when the weather variables can be used as input and perform advanced analytics.
The third step of the methodological framework is applying the algorithms to the specific cases. Lagged inputs as a time-series technique are used in order forecast the production in future horizons. The autocorrelation of the production signal is calculated to determine the lag that is required to make the forecasts.
Methodology overview.
Five different algorithms are considered to perform the forecasts, Long Short-Term Memory (LSTM) [27], Convolutional Neural Network (CNN) [28] in a CNN-LSTM combination, Support Vector Regression (SVR) [29], Multiple Linear Regression (MLR) [30] and XGBoost [31]. For Case 3, where extended variables are considered, a clustering of weather conditions is performed, based on the solar irradiance variable in order to help achieve highest accuracy.
Step 4 continues with the results of the algorithm’s when forecasting the next hour’s PV production. Step 5 follows with the comparison of all algorithms’ performance in each case for every lagged input. This will demonstrate the highest accuracy algorithm in each scenario. Finally, for the highest performing algorithm further forecast horizons are considered to test its accuracy. The forecast horizons are set to 3, 6 and 24 hours following the same process with the lagged inputs. An iterative process is also tested to create the day ahead forecast in order to compare the two approaches. Figure 1 presents the methodological framework.
The available datasets consisted of a period spanning over 30 months (August 2018–January 2021) of production data from a specific PV plant, with an one-hour interval between observations. The PV plant has nominal capacity of 23.52 kWp and peak capacity of 20 kWn. Data quality of the production data is considered to be very high, as they are deriving from a monitoring platform through the sensor-based system, and the observations of the production are the direct output from 4 DC/AC inverters installed. Weather data are gathered from a meteorological station near the solar plant through the website Copernicus consisting of several features. More specifically, from the local meteorological station the temperature, humidity, wind speed is collected in one hour interval observations. One hour interval observations are available from Copernicus database, where the global solar radiation and the diffuse solar radiation variables are collected. The data quality of the weather information is also considered to be high, as the main driver of PV production, solar irradiance, is extracted from the Copernicus database [32, 33]. The weather features are used as inputs to the models, according to the methodology, depending on the case.
The pre-processing process starts with the autocorrelation of the production signal is calculated (Fig. 2), to assess the impact of different time lags and how they contribute to the PV output forecast. The seasonality of the signal is observed, as autocorrelation shows the degree of similarity between a given time series and a lagged version of itself over successive time intervals.
Autocorrelation plot of PV production signal.
The autocorrelation of the signal is high in sequent time-steps and shows that the previous values of the production signal affect the value of the current time step. This can be explained by the fact that the intervals are of one hour and the weather conditions that affects the PVs performance do not change significantly hourly. Also, a day-by-day seasonality is observed, which can also be explained by the weather conditions and the state of the PVs, as the performance of the PV modules behave the same, given similar weather conditions during the day. Based on the autocorrelation and the time-series forecasting approach used for the three cases, the lags that are chosen are 1 hour, 5 hours and 10 hours before.
In Case 1, the production data from the energy community’s PV plant are solely used as the methodology suggests. The input consists of the production signal with time lags of 1, 5 and 10 hours.
For Case 2, weather data are used as well, along with the production data. In order to assess which of the available weather data to be used as inputs, a correlation analysis has been conducted. The most statistically relevant parameters are the solar irradiance, temperature, and humidity, which will be used as inputs from the weather datasets. Table 1 presents the Pearson and Spearman correlation coefficients of the above variables to the production of the PV, to demonstrate both the linear and non-linear correlations.
Pearson and Spearman correlations between PV production and solar irradiance, temperature and humidity
For Case 3, all weather-related variables from the website and Copernicus database are considered. In this way, a fully operational sensor-based system of an energy community is demonstrated, which has deployed extensive infrastructure, and assess the algorithms’ accuracy. This approach provides the possibility to test more advanced ML and DL models and possibly to be the base for more demanding energy related applications (e.g., maintenance actions, energy matching, load shifting, etc.). To this end, a clustering of weather conditions is also considered using the K-means algorithm. Since the solar irradiance data is in a time-series format, Dynamic Time Warping was used in order to apply the K-means algorithm to the data. The result is five clusters in which, each data point is assigned based on the value of solar irradiance at the specific timestamp. One-hot encoding of the five clusters of sky-conditions derived from DTW-K-means has been used, in order to insert the clusters in the algorithms.
Algorithms and configuration
The datasets are split in an 80/20 way, meaning that 80% of the data are used to train the models and 20% are used as unseen data to evaluate and consequently validate the algorithms’ performance. Since the data spans over 30 months starting from month August, the testing set consists of 6 months data, covering one month of summer (August), full autumn season and two months of winter (December, January). Min/max scaling and split sequences for forecasting (lag, horizon) has been also used.
Five different algorithms were tested to perform the forecasts as the methodology suggests, LSTM, CNN-LSTM combination, SVR, MLR and XGBoost. Table 2 includes a short description of each algorithm and the values of the parameters/hyperparameters used. The values of the parameters/hyperparameters were chosen in order to optimise the learning process after fine tuning and use the most efficient configuration.
The forecasting horizon is set to the next interval, i.e., the next hour’s PV production. The interval was chosen as only historical data are used (as lagged inputs) to the models and the interval of the observations is of one hour. The models’ output is the forecasted hourly PAC production in kWh.
For the highest performing algorithm, further forecast horizons are calculated following the same approach with the lagged inputs, but instead of producing only the next hour’s PV production forecast (one step ahead), it produces the 3, 6 and 24 hours respectively as a multi-step model. Following this multi-step model, an iterative process has also been created for the day-ahead PV output of the highest performing algorithm. In this iterative process each next time step forecast is used as input to the model, in order to predict the next time step, thus creating the 24-hour forecast.
This iterative process, in a real time operational situation, would require the day ahead weather forecast to be used as input. In this study, where historical data are used as training and testing data for the models and the measurements are known, the error of the weather forecast is not incorporated into the models’ accuracy.
Algorithms’ evaluation
To assess the algorithms’ forecasting performance, three statistical metrics are calculated, the
RMSE MAE
Results, algorithms’ performance
MAPE
MAPE, is a standard prediction technique that measures the accuracy of forecasting and justifies the prediction diversity for real datasets. MAPE as a percentage error, cannot use zero values because there would be a division by zero. In the case of PV modules, it can provide a useful metric as the error that it calculates, can disregard the night hours, where the PV production is zero, and focus on the production hours.
Table 3 presents the performance of each algorithm for the next hour’s PV production in each of the three cases.
For Case 1, where only historical production data are considered, LSTM has the highest performance. As expected, the more lags used as inputs the higher the accuracy, with
Figure 3 presents the actual and the predicted results of LSTM algorithm for lagged input of 10 hours.
Predicted and actual values (LSTM, Case 1, lag t-10, horizon t+1).
Predicted and actual values (XGBoost, Case 2, lag t-5, horizon t+1).
XGBoost horizons t+1, t+3, t+6, t+24 metrics’ evaluation
Predicted and actual values (XGBoost, Case 3, lag t-10, horizon t+1).
In Case 2, where weather data are considered, all algorithms are performing significantly better. The best algorithm in each scenario is XGBoost, with an
In Fig. 4, the plot of the predicted and the actual values are shown for XGBoost algorithm, in the case of 5 hour lagged input and forecasting horizon of 1 hour.
For Case 3, where extended availability of weather information and several techniques were used in order to achieve better accuracy, the algorithm with the best performance is XGBoost, with a
All algorithms performed with high accuracy in this approach, demonstrating their ability to predict the next hour’s PV output when there is high data availability. The configurations made to Case 3 for the data inputs, and the extended weather dataset, resulted in better and more accurate results. LSTM and CNN-LSTM follows in performance, performing slightly worse than XGBoost, with
Figure 5 shows the actual and the predicted values for XGBoost for lagged inputs of 10 hours.
After identifying the highest performing algorithm (XGBoost), its behaviour for a prediction horizon of 3 and 6 hours was tested. Since the same approach for the multi-step forecasting is used as the one-step, XGBoost is still the more accurate than the proposed algorithms. To this end, the highest performing algorithm from all cases and all lagged inputs scenarios, is tested for the 3 hours, 6 hours and 24 hours forecast as well. Table 4 presents the results.
Predicted and actual values (XGBoost, lag t-24, horizon t+3).
Predicted and actual values (XGBoost, lag t-24, horizon t+6).
Predicted and actual values (XGBoost, day ahead forecast, 6 months data).
Predicted and actual values (XGBoost, day ahead forecast, autumn season).
Predicted and actual values (XGBoost, day ahead forecast, summer season).
As only historical inputs are considered, the accuracy is lower as further time-steps are forecasted. A 24-hours lagged input is calculated in order to test if the algorithm can achieve better performance in the multi-step forecast. The results show that the forecast of the next time steps is better with 24-hours lagged inputs, with an
The day ahead forecast of XGBoost is not very accurate because the PV production is depended on the weather measurements and in this approach only historical data are used. To this end, the iterative process has been created for the XGBoost, in order to test a different approach of using the highly accurate next hour’s forecasting results and predicting more long-term production.
The
The overall MAPE is 14.58% for winter season, 19.036% for autumn season 14.30% and for summer season 5.94%. These results demonstrates that with the iterative process the day ahead forecast (and subsequently all forecasts up to 24 hours) can be accurately forecasted, especially when the weather conditions are stable.
Based on the results presented in the previous section, it is observed that high accuracy forecast of future PV performance is possible in short-term forecasting horizons.
Using as input only the historical production data are used, the best algorithm (LSTM) performs reasonably well. When weather information is considered as input, the highest performance algorithm was found to be XGBoost. The approach followed in this study, requires very few previous observations in order to make extremely accurate forecast for the next hour. In that way, this approach efficiency is demonstrated in energy community environment where the availability of data is often limited. With extended weather data available and the techniques used, all algorithms have high accuracy. XGBoost is the best algorithm in Case 3 too, but it was observed that the suggested DL algorithms are performing with similar accuracy as well. This can be explained by the fact that the DL algorithms need high amount of data in order to be efficient. If more data become available (e.g., the time intervals are shorter than an hour) then the accuracy of the DL algorithms would be improved.
For further forecasting horizons, the future PV production can also be calculated with good accuracy. However, as for the three cases, only historical data are considered, the iterative process compared to the simple multi-step forecast performs better. The iterative process that was considered to predict the day ahead PV production, performs highly accurate result in a daily basis, but in order to make the day ahead forecasting in an operational PV plant, future weather data are needed.
On a more general remark, the methodology followed, demonstrates how an energy community can use their available data in data-driven applications and produce accurate results of the future PV production up to 24 hours.
Predictions using only production historical data can be used for monitoring purposes of an energy community, as it can provide a quick estimation of the future performance without any infrastructure. More accurate forecasts, by incorporating the weather information can be used in energy trading. The hourly forecasts can be utilised, as energy communities could optimally choose when to sell the excessive energy they produce. Grid operations can also benefit from the proposed algorithms.
Predicting the energy supply from PV systems for the next 3 to 6 hours, can help operators to satisfy the electricity demand and peak shavings. Flexibility applications, such as positive and negative balancing energy provision for grid stability, can also make use of the few hours ahead forecasts. On the consumer side of the energy community, the short-term forecasts of PV production can help on increasing the self-consumption and optimising the scheduling of energy flows inside the community. Finally, maintenance actions of PVs can also be implemented using the day ahead forecast. The forecast can be used as indicator of malfunction of the PV plants of energy communities, if it is considered that the energy sources could be distributed, and they are not easily accessible to identify possible errors.
Conclusions
It is evident that energy communities will play a pivotal role in energy transition. In order to fully utilise their capabilities and addressing their possible limitation when deploying data-driven approaches, different considerations should be made depending on their available infrastructure. AI applications that can enhance PV production and energy efficiency, require high quality of data to produce meaningful results and provide added value. Thus, the operational status of the energy community should be examined, in order to provide services that can offer reliable results.
In this study, the energy production of PV plants is forecasted, based only on previous performance and previous weather data. Three different cases have been studied depending on the data availability. Case 1 takes into consideration only the historical production data, Case 2 incorporates the most statistically significant weather variables that affects PV performance, and Case 3 includes several weather variables. Five ML/DL algorithms are used for forecasting purposes, LSTM, CNN-LSTM combination, SVR, MLR and XGBoost. It is observed that all the algorithms perform well, in different combinations of lagged inputs for all three cases using only historical data. More specifically the next hour’s production can be forecasted with high accuracy, while the forecasted values of up to 24 hours also achieve high accuracy. The performance of the algorithms can be expected to be higher, given measurements would become available from on-site sensors, as opposed on this approach, where weather data were retrieved from meteorological stations.
It is observed that depending on the available infrastructure and the suitable algorithms, the power production of PVs can be forecasted in several short-term time horizons. Thus, PV maintenance energy matching, grid operations, demand management, scheduling supply of the energy communities can be achieved efficiently and set the base of more complex applications that require accurate short-term predictions, such as predictive maintenance or energy trading.
Next steps on expanding the results, would include the testing of the algorithms’ performance on datasets with smaller interval times and more data. Smaller intervals could make the DL algorithms perform better, as the NNs would be able to capture the connections better. Also, the use of measurements coming from on-site sensors should be considered, as it will allow more accurate weather information. A further next step would be to test the algorithms to other PV plants in different locations, as well. In that way, the replicability of the algorithms’ performance could be tested. Transfer learning techniques can also be used, based on the models proposed in this paper, when data availability is limited.
Finally, the main issue of forecasting PV production is the dependency of the weather forecast. Very near short-term forecasting can overcome this issue as the weather could be considered as stable, but for more long-term horizons the PV production will rely heavily from the weather forecast. A possible next step could be examining an interval forecast, where the forecasted power could be in range instead of an exact value.
Footnotes
Acknowledgments
The work presented is based on research conducted within the framework of the project “Modular Big Data Applications for Holistic Energy Services in Buildings (MATRYCS)”, of the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 1010000158 (
