Abstract
Abstract
A model to predict air pollutants' concentrations was developed by implementing spectral decomposition of time series data, obtained by Kolmogorov–Zurbenko filter, in Artificial Neural Networks (ANN). This model was utilized to separate and individually predict three spectral components of air pollutants' time series of short, seasonal, and long-term. The best set of input variable was selected by evaluating the significance of different input variables while modeling different time series components. Moreover, different possible approaches for constructing such models were examined. Performance of the constructed model to predict air pollutants' level at a central location in Tehran, Iran, which is one of the most polluted cities in the world, was assessed. The constructed model showed firm and reliable performance in modeling and predicting the two selected air pollutants of NOx and PM10. The R2 between predicted and observed values were ∼0.90 for most cases. It was shown that the developed model could perform better in modeling air pollutants compared with ordinary ANN models, especially in episodes of highly elevated pollution levels. Furthermore, this model provided the opportunity to separately predict pollutants' spectral components, such as baseline concentrations, which represent urban background levels. Predictions of baseline concentrations were also in fine agreement with the observed data. Such modeling and prediction could help policymakers to oversee different trends of pollutants' fluctuations, and make proper decisions to control the pollutants.
Introduction
E
Deterministic models require a comprehensive knowledge of pollutants' formation and sources, and processes involved in their transportation, transformation and fate (Seinfeld and Pandis, 2006; Carnevale et al., 2009; Hrust et al., 2009). Statistical models are formed by analyzing the relation between the recorded influential variables and pollutants' concentrations to derive statistical equations, and do not simulate such processes. Due to complex physiochemical processes involved in the formation and emission of air pollutants, statistical models were shown as proper choices (Hrust et al., 2009).
Several statistical models have been developed, including time series and spectral analysis, Multiple Linear Regression (MLR), support vector machines, Kalman filters, and Artificial Neural Networks (ANNs). Among all ANNs were used in several previous studies to model different air pollutants, presenting an overall acceptable performance (Boznar et al., 1993; Gardner and Dorling, 1999; Karppinen et al., 2000a, 2000b; Perez et al., 2000; Podnar et al., 2002; Kukkonen et al., 2003; Hooyberghs et al., 2005; Brunelli et al., 2007; Arhami et al., 2013). A notable privilege of ANNs over some other traditional statistical models, such as conventional MLR, is their superior ability to deal with data set, including noises and nonlinear relations (Gardner and Dorling, 1999; Kolehmainen et al., 2001), which is generally the case for air pollutants' data. Considering the good performance of such network and nonlinear relation between pollutants and their influential predictors and meteorological variable (Gardner and Dorling, 1999), ANN was chosen as the statistical modeling tool in the current study.
Despite much advancement in modeling air pollutants by ANN, more studies are still required to increase the accuracy and reliability of the models. Several studies are recommended to improve the predictions accuracy by noise reduction and rearrangement of data or explicit consideration of meteorological variables (Perez and Reyes, 2006; Arhami et al., 2013). Moreover, in air pollution modeling, similar to most of other environmental studies, prediction of elevated pollutants' levels, also addressed as peak points, is still a challenging endeavor.
Fluctuations in pollutant levels are influenced by variations in emission sources and atmospheric conditions, which usually follow daily, seasonal, and annual variations. Due to potentially significant impacts of these trends on pollution variations, modeling, and incorporating such trends can be beneficial in improving the ANN performance. Also modeling these trends could be helpful in observing and examining the effectiveness of the control strategies and policies on the pollutants' trend and consequently helping policymakers develop proper plans and decisions. A hypothesis in improving such model's accuracy, which was adopted in this article, is implementing a time series data filtering tool, which defines different temporal components of time series of the data to preprocess the data for ANN. This approach could help in discerning variations of air quality data and better incorporating effects of variations in different influential parameters. Also, extracting temporal components of air pollutants helps in better understanding the pollutants' trends and comparing them to trends of emission sources and meteorological variables, to find the best independent variables to use in ANN modeling.
Various filtering methods have been developed for decomposing the time series of data into the temporal components, such as anomalies, wavelet transform, and the Kolmogorov–Zurbenko (KZ) filter. Among all, KZ filter has been reliably used in separating air pollutant components, such as studies conducted on tropospheric ozone (Rao and Zurbenko, 1994; Eskridge et al., 1997; Milanchus et al., 1998; Hogrefe et al., 2003; Wise and Comrie, 2005) and PM10 data (Tchepel et al., 2010). The KZ filter method can be applied to data sets containing missed data, such as recorded air pollutants' data, with acceptable precision. In the current article, KZ filter was used to decompose pollutants' level and meteorological variables, as is described in the methods section.
Tehran, Iran, which was selected as our case study, has been frequently facing high levels of air pollutants' periods during recent years (Saadat et al., 2010; Givehchi et al., 2013). The huge fleet, including a considerable portion of highly polluting vehicles, many congested traffic throughout the city, insufficient public transportation system, some industries placed close to the city, and the geographical condition of Tehran being surrounded by high mountain chain on its downstream are the main reasons for the air pollution problem (Askariyeh and Arhami, 2013). Nowadays one of the few implemented actions to face this crisis is shutting down the schools or some governmental agencies, enforcing traffic limitations, and encouraging the residents to stay home to reduce the traffic load. Air pollution, like many other environmental problems, is a complex issue influenced by several factors, and need to be prevented by proper planning and legislation. Knowledge on pollutants' variation trends, especially their long-term baseline concentration is useful for policymakers and governments to establish efficient regulations to prevent pollution crisis. Also the ability to more accurately predict pollutants' level can help the officials to take better and more efficient counter measures. Despite the significance of the air pollution problem in Tehran, such information are not available and a reliable air pollution forecasting system has not been developed to predict the pollutants' concentrations, particularly during the critically polluted episodes.
In this study, a model was developed to predict pollutants' concentrations by implementing a filtering tool to decompose time series data and ANNs. This model utilizes separated components of time series data obtained through KZ filter, to model air pollutants' concentrations. The Nitrogen Oxides (NOx) and Particulate Matter less than 10 μm in aerodynamic diameters (PM10) concentrations recorded at one of the central urban stations in the polluted city of Tehran were used to evaluate the developed model's performance. The different plausible approaches for making such models were examined to determine the optimized method of using KZ filtered data with ANNs. After constructing the models, their accuracy and reliability in predicting pollutants' levels were evaluated and compared with ordinary ANN model. Moreover, the importance of input parameters on modeling in different time series components was evaluated. Finally, the long-term baseline concentrations were predicted as a novel application of the developed model.
Methods
Input and output data
Two criteria pollutants, NOx, and PM10, were studied to examine the ability of models in predicting both gaseous and particulate pollutants. These components are among the key pollutants of large cities, which have also frequently exceeded the allowable thresholds in Tehran during the recent years. Both of these pollutants are affected by complex primary and secondary formations, and physiochemical and photochemical parameters (Seinfeld and Pandis, 2006). PM10 and NOx concentrations were obtained from one of the central Tehran Air Quality Control Company's (AQCC) stations, Fatemi station. This station is located in the central part of Tehran (45°45’N, 30°50’ E) surrounded by main streets, with usually congested traffic, and affected by common urban sources. NOx concentrations were measured with the Chemiluminescence NOx analyzer (Model AC31M, Environment SA), and PM10 mass concentrations were measured by the beta attenuation mass monitor (BAM, Model 1020; Met One Instruments, Inc.), operated at standard conditions.
Potentially influential meteorological variables on air pollutants' concentrations were used as predictors for modeling hourly pollutants' concentrations. Meteorological variables used in this study were obtained from Mehrabad station, the nearest synoptic station to the studied air quality station, located at 35°41’N, 51°19’E and 1,190 m above the mean sea level. Among all meteorological variables measured at the Mehrabad station, four parameters were chosen as ANN's meteorological input variables based on the criteria presented in our previous study (Arhami et al., 2013) for optimized model's performance. These selected variables are: Wind Speed (WS) in m/s, Wind Direction (WD) in degree, Air Temperature (Temp) in °C, and Relative Humidity (RH) in percent. Two other variables, hour of the day (HOD) and month of the year (MOY), were also used as input parameters representing traffic pattern and variations of atmospheric conditions through the year and day. Moreover, for all the analysis in this study, seven separated ANNs were trained and used for each day of the week, representing different traffic pattern during the week. The significance of using HOD and MOY and separating the weekdays were completely explained in Arhami et al. (2013). Three input parameters of HOD, MOY, and WD, which have periodical rotation, were incorporated using Equations (1)–(3), respectively.
Spectral decomposition of time series data
The Kolmogorov–Zurbenko (KZ) filter was used in this study which is a low-pass filter removing high-frequency variation from the data time series. This filter was used to decompose time series into the temporal components of long term, seasonal, and short term. The trends of each of these components reflect different temporal variation in pollutants' sources and conditions. The long-term trend generally reflects the changes in the overall emission pattern, pollutant transport, climate condition, implemented policies, and economic situations; seasonal components mainly relates to solar cycle; and short-term component varies with sub-daily variations in weather condition and pollutants' sources activities, such as traffic volume (Wise and Comrie, 2005). The long-term, seasonal, and short-term components were computed by varying the length of the window and the number of iterations of KZ filter, as explained in previous studies (Rao and Zurbenko, 1994; Eskridge et al., 1997; Hogrefe et al., 2003; Wise and Comrie, 2005). Short-term variations contained cycles with period less than days and long-term trend reflected cycles longer than a year (Eskridge et al., 1997; Wise and Comrie, 2005), and seasonal window was in between. The baseline components of pollutants which are mainly their urban background concentrations were obtained as summation of the long and seasonal components (Hogrefe et al., 2003; Wise and Comrie, 2005). The KZ filters method is able to deal with the time series containing some missed data, which fits the case of pollutants of concern measured in Tehran (Eskridge et al., 1997; Wise and Comrie, 2005).
Artificial neural networks
Multilayer Perceptrons (MLPs) neural network were selected to model pollutants' levels and their components. The MLP neural networks were shown as suitable ANN types for modeling air pollutants' concentration in several previous studies (Hornik et al., 1989; Gardner and Dorling, 1999; Kolehmainen et al., 2001; Ibarra-Berastegi et al., 2008; Hoi et al., 2009; Hrust et al., 2009; Yazdanpanah et al., 2009; Arhami et al., 2013). Backpropagation learning algorithm was used to train these networks and the activation function used to convert a neuron's weighted input to its output activation was sigmoid. The MLP networks are made of an input, an output, and hidden layers, and each layer is comprised of several neurons. The optimized neural network's structure was selected by varying the numbers of hidden layers and neurons on each layer, through a trial and error procedure. The number of hidden layers of 3–5 and number of neurons on each layer of 14–24 were used for networks constructed for this study.
A key factor in optimizing ANN performance is the input data set volume and as it was described and examined in our previous study (Arhami et al., 2013), the best performances were obtained by using the most recent 1 year of input variables for short-term predictions. The data set were initially divided into two subsets, one for training and testing (validation) and one for prediction period. Subsequently, the data set for training and testing were randomly divided into two subsets, 85% of the data were dedicated for training and 15% for testing procedures. The networks were trained using the training subset, and then their performances were evaluated by testing the subset. Subsequently, the performances of the optimized networks to predict the pollutants' level were examined during the prediction periods, which were not used in networks' construction, including training and testing procedures. The whole data set used in this study were within the time period of January 2002 to January 2008. For seasonal and long-term components, 5 years of data set (January 2002 to December 2006) were used to train and test networks and one whole year (year 2007) was used for prediction period. For short-term component, whole year 2007 was used for training and testing, and 1 month of January 2008 was used for prediction period.
To evaluate the networks' performances, three statistical parameters of Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Index of Agreement (IA), in addition to correlation coefficient (R2) and slope of trend line (S), were calculated by comparing the observed and predicted concentrations.
Developing the model
A model by using both KZ filter and ANN was developed to model the air pollutants' level. As described, KZ filter was used to separate three components of NOx and PM10 concentrations and selected meteorological variables (WD, WS, Temp, and RH) time series, and ANN to model pollutants' level and/or their components. Three different approaches, Method 1 to 3, for combining components of dependent and independent data with ANN to develop the model were examined, as shown in Table 1. The performance of these three methods in predicting pollutants' levels were evaluated.
For each method seven different networks were made for different days of a week.
Short (parameter)=short-term component of parameter; seasonal (parameter)=seasonal term component of parameter; long (Parameter)=long-term component of the parameter.
HOD, hour of the day; MOY, month of the year; RH, relative humidity; WS, wind speed; WD, wind direction.
In the first method, the nonfiltered time series of the independent (input) variables were used through three different networks to model three components of pollutants' concentrations separated by KZ filter. These three modeled components were summed up to obtain the hourly pollutant concentrations. In the second method, all three components of the meteorological variables were used as input data set for modeling the nonfiltered pollutant concentration. In other words, instead of using hourly values of the meteorological variables, their components were used to model hourly pollutant concentrations. This method helps to understand the importance and impact of separating time series to the networks performance. In the third method, three different components of filtered input data sets were used to train and predict three different corresponding components of filtered pollutants' time series. In this approach, short, seasonal, and long terms of meteorological predictors were used to model short, seasonal, and long pollutant concentrations, respectively. In all methods, additional variables, such as HOD and MOY, besides meteorological variables, were used as model inputs to model pollutant levels or their filtered components.
Implementing method 1 and method 3, three different ANN's for each time series components (short term, seasonal, and long term), and seven different networks for each day of a week were constructed, resulted in 21 different networks for each pollutant. Whereas by implementing method 2 only seven networks reflecting 7 days of weeks were constructed. These networks for each pollutant, were trained, validated, and used to predict different components of concentration time series as described in the methods section. Since each day of the week was trained and predicted separately, seven different networks were prepared for each pollutant. The results of each time series components modeled by seven networks for different days of a week were combined for consecutive days to obtain a continuing time series of a pollutant during the prediction period and be comparable with the measured values. By means of this comparison the corresponding statistical parameters were calculated and presented in the results section.
Finally the developed model was used to predict the baseline concentration. Baseline concentration is the summation of long and seasonal components, contains resistant variation of a signal while removing peak values from the signal. Baseline concentration modeling is beneficial in long-term planning, and observing or predicting the effect of control strategies on pollutants' level. Hence, baseline concentration data sets were modeled for each of the pollutants, using the same predictors as explained. In other words, baseline components of four meteorological variables (WS, WD, Temp, and RH) and other two inputs (HOD and MOY) are used as input variables (predictors) and baseline components of pollutants' concentrations as targets in training and prediction steps.
Results and Discussion
Time series components and selecting best model
Temporal components of long term, seasonal, and short term for NOx, PM10, and meteorological variables were obtained by KZ filter. NOx time series from 2002 to 2007 and their separated components are presented in Fig. 1 as examples. As expected, the main fluctuations of the pollutants' time series are attributed to short-term component of this pollutant as it varied largely between ∼ −150 and +500 ppb. It was shown that hourly and daily variations reflect the noise behavior in the pollutants' time series. A harmonic trend for seasonal variation was found during the studied period; however, the amplitude of the seasonal term wave varied in different years. The fluctuations of seasonal term are much less than the short term and ranged from ∼ −45 to +75 ppb. The long-term component varied rather smoothly, whereas the noise and periodic variation of time series was omitted by short and seasonal term, respectively. The short-term, long-term, and seasonal components of other variables behaved rather similarly to NOx, with short and seasonal term reflecting noise and periodic variations, respectively, and smoothed long-term variations. The short-term component is accountable for more than 50% of total NOx and PM10 variations. This relative contribution of each component to the total concentration variations (not magnitude) was calculated by dividing the variance of temporal components to the variance of original time series of air pollution data.

Performance of different methods for combining filtered data with ANNs to build the models (called method 1 to 3 as described in Table 1) were compared, to find the best method for predicting the air pollutants' data. In this regard, comparisons between predicted NOx concentrations through different approaches, with observed levels, are presented in Fig. 2. Same input and target periods were used to train and verify the networks for all the three methods. For method 1 and 3, results of modeling three time series components of NOx were summed up to generate the total hourly concentrations. Hence all the plotted results in Fig. 2 represent the values for hourly pollutants' concentration (not their components). Among three different methods, method 1 showed the worst, and method 3 showed the best performance. Method 2 performed rather similar to method 3, but less accurately. Method 3 was also the best method regarding prediction of the peak levels of the time series data. Besides superior modeling accuracy, this approach would provide the opportunity of modeling the time series components separately, which is not achievable while implementing method 2. Hence, this approach was selected for building the model to predict the pollutants' levels.

Predictions made by different combining methods,
Selection of best input parameters for ANN modeling was assessed in our previous study (Arhami et. al, 2013). To verify the suitability of the selected data set for modeling each of the time series components (long term, seasonal, and short term), and examine the models' sensitivities to each set of input variables, extra networks were trained with eliminating each of the input parameters, one at a time. Comparison of these models' performances to the model constructed by all the selected input variables, as described in the methods section, represented the impact of eliminated input variables. Data obtained for predictions' performance of these networks for short term were listed in Table 2. The performance of ANNs was highly influenced by HOD and MOY which shows the models' high sensitivity to eliminating these two predictors. As it was mentioned earlier, these two predictors represent the variations of traffic pattern (main pollutants' sources) and atmospheric conditions through the day and year. In general, short components of pollutants' time series are highly dependent on each input parameters and all of them are essential for modeling and making predictions. This analysis was also done for long and seasonal time series components, but eliminating input variables (one at a time) did not impose significant impact on the performances of models, this shows that in long and seasonal components' modeling, other five input variables covered the lack of the eliminated one in each step. This indicated that the seasonal and long term could be modeled without including all of the input parameters. MOY, WS, and Temp are suggested variables at this time according to proper performances of long and seasonal networks trained only with MOY, WS, and Temp as input parameters.
MAE, mean absolute error; RMSE, root mean square error; IA, index of agreement.
Modeling results
Since method 3 (from different methods explained in Table 1) was selected, three different ANNs were constructed for each of three different time series components of short term, seasonal, and long term. The results of these networks were added to obtain the time series of pollutants' concentrations. Prediction of hourly short-term component of NOx and PM10 concentrations were compared with observed values in Fig. 3. Also the time series of hourly NOx and PM10 concentrations (summation of short-term, seasonal, and long-term values) were plotted in Fig. 4. For short-term components of concentrations, predictions resulted in R2 of 0.89 for NOx and 0.91 for PM10, which were close to their values for the total concentrations. This similarity indicated the uncertainties of the model's results in predicting that total concentrations are mainly attributed to the uncertainties in modeling the short-term component, influenced by sub-daily fluctuations in emission sources (affected mainly by traffic conditions) and meteorological variables. The performances in modeling short-term time series, due to sudden and sizable fluctuations in values, declined in comparison with long and seasonal components. In fact, the developed model performed accurately in modeling long and seasonal components, this could be attributed to slight variations and smooth changes of these components during the short prediction period of 1 month.

Comparison between observed values and predictions of developed model for short-term components of

Performance of hourly predictions of
Predictions of the developed model, using preprocessed data by KZ filter (separating time series components), were compared with predictions of ordinary ANN models trained with same input variables, without separating the components, in Fig. 5. The predictions by models were better than or similar to those of the ordinary ANN model. Although the overall performances of ordinary ANNs and this new approach were rather close, but due to separated time series and trained networks for each of them, ANNs with filtered data showed better ability in dealing with highly elevated peak concentrations. To illustrate this advantage of ANN with filtered data, the predicted period presented in Fig. 5 was selected in a way to include high pollutants' levels episodes for NOx in Tehran. As it could be seen, the ANN approach performed better, especially in predicting peak points, where NOx concentrations reach values of over 400–500 ppb. It is crucial to be able to have more accurate prediction of extreme pollutants' levels. These situations could reflect critical and unhealthy pollutant levels, and their predictions would help in taking proper urgent actions. For instance, in Tehran, episodes of elevated pollution levels frequently occurred in recent years, caused increased rate of hospital admission, morbidity, and mortality (Saadat et al., 2009, 2010; Nabavi et al., 2012). During these episodes local authorities were forced to take urgent actions such as asking the children and elderly to stay indoors, enforcing traffic limitations, and shutting down some organizations and schools. More reliable forecasting of highly elevated levels could be beneficial to the officials for better management of such crisis and protect public health as effective as possible. The developed models could be helpful at this point as they showed to be more reliable in predicting peak levels. The other advantage of using the model is providing separated predictions of each of the time series components and could predict their trend of variations. An important aspect of this advantage is presented and discussed in the following section.

Comparison of predictions' performance between ordinary optimized Artificial Neural Network (ANN) and developed hybrid model of ANN with filtered data for the period of containing the episode of elevated pollution levels in Tehran.
Predicting baseline levels
It is beneficial for policy makers to access predictions for different components of time series, such as baseline concentration component. In fact, variation in this component reflects the variation in time series concentrations excluding the sub-daily variations and it reflects the ambient urban background concentration which residences are exposed to. Hence, baseline concentration data sets were modeled for each of the pollutants, using the same predictors as explained earlier while implementing method 3. The separate trained networks were used to predict 1 year (2007) of baseline concentrations (seasonal and long-term components) using 5 years of baseline data which were not inserted in the training step (2002 to 2006), with R2=0.88 for NOx and 0.90 for PM10 and IA=0.96 for NOx and 0.97 for PM10. The whole year of hourly prediction (weekly moving average) for NOx and PM10 were plotted in Fig. 6. Modeling/predicting baseline concentration could be potentially beneficial in long-term planning, and observing or predicting the effect of control strategies on the pollutants' levels. This long period prediction for baseline concentration would illustrate a rather clear view of the pollutants' level in the coming years in the presence of meteorological predictors. Such models could be beneficial for the strategist who makes long-term mitigation planning or observe the effect of previously implemented actions.

Long-term prediction (∼1 year ahead) of baseline concentration for
Conclusions
In this study, we implemented a filtering tool, separating time series components, KZ filter, with ANN, to develop a model to predict hourly air pollutants' concentrations. The model was applied and assessed to predict hourly level of both gaseous and particulate air pollutants' level at a central station in the polluted city of Tehran, Iran. This model separated and individually predicted three components of air pollutants' time series of short term, seasonal and long term. The results showed rather strong performance of the developed model in predicting hourly NOx and PM10 levels, since the R2 between predicted and observed values were ∼0.90 for most cases. Moreover, the developed model outperformed the ordinary ANN model particularly in modeling peak pollutant levels. Based on the increasing application of ANNs in modeling the air pollution condition, results of this study could be beneficial in enhancing the accuracy and reliability of such metamodels' predictions. The other application of the developed model, which was tested in this study, was providing predictions for different spectral components of pollutants, such as baseline values, which are comprised of seasonal and long term values. Modeling and predicting baseline pollutants' levels are crucial for overseeing the effect of long-term control strategies and help policy makers to have a more accurate outlook about the trend of air pollution levels.
Footnotes
Acknowledgments
The authors would like to give special thanks to Tehran AQCC and Iran Meteorological Organization for providing the air pollution and meteorological data used in the current study.
Author Disclosure Statement
No competing financial interests exist.
