Abstract
Air pollution is a serious environmental problem that has attracted much attention. Predicting air pollutant concentration can provide useful information for urban environmental governance decision-making and residents’ daily health control. However, existing methods fail to model the temporal dependencies or have suffer from a weak ability to capture the spatial correlations of air pollutants. In this paper, we propose a general approach to predict air pollutant concentration, named DSTFN, which consists of a data completion component, a similar region selection component, and a deep spatial-temporal fusion network. The data completion component uses tensor decomposition method to complete the missing data of historical air quality. The similar region selection component uses region metadata to calculate the spatial similarity between regions. The deep spatial-temporal fusion network fuses urban heterogeneous data to capture factors affecting air quality and predict air pollutant concentration. Extensive experiments on a real-world dataset demonstrate that our model achieves the highest performance compared with state-of-the-art models for air quality prediction.
Introduction
With the development of economy and society, air pollution problem is getting more and more attention, especially in developing countries (e.g., China and Brazil). High concentration of air pollutant may cause serious health problems, such as heart or lung diseases. According to the World Health Organization, nine out of ten people now breathe polluted air, which links to 7 million premature deaths annually [1]. Moreover, cities can be plagued by smog, impacting the daily life of the residents negatively. Therefore, air pollution has become an urgent problem that harms public health and hinders urban economic development. For the treatment and prevention of air pollution problem, the information about air pollutant, such as the concentration of PM
In order to monitor air pollutant concentration in real time, the specific monitoring stations have established in different regions to collect air quality data. In addition to monitoring, there is a growing demand for predicting future air quality, which not only supports governments to make air pollution control strategies but also informs the general public to take actions (like staying at home) in advance. However, accurate prediction of air pollutant concentration is extremely challenging. First, air pollutant concentration can vary significantly over time and across locations. Figure 1a shows the PM
The change of air pollutant concentration in Beijing over a period of time: (a) An example of changes in PM
To address the problems above, we propose a deep spatial-temporal fusion network (DSTFN) to predict air pollutant concentration, which contains a bidirectional and unidirectional long short-term memory (BU-LSTM) subnet, a spatial subnet and a temporal subnet. The BU-LSTM subnet fuses air quality data of target region and similar regions, capturing bidirectional temporal dependencies and spatial correlations. The spatial subnet fuses air quality data and region metadata (e.g., POI distribution, road structure, landform), capturing the change patterns of air quality in the region itself. The temporal subnet fuses air quality data, meteorological data and weather forecast data, capturing the impact of historical and future weather conditions on air quality. In our approach, we use tensor decomposition method to complete historical air quality data, which increases training samples. Moreover, we utilize geographic similarity to represent spatial correlations of air quality, which means that the air quality in two regions with similar topographical distribution, road network structure and POI distribution have similar change trend. Extensive experiments on a real-world air quality dataset demonstrate that our proposed model achieves the highest performance compared with state-of-the-art and baseline models for air quality prediction. In summary, our contributions are listed as follows:
We propose a general and efficient approach, which trains a deep spatial-temporal fusion network to predict fine-grained air pollutant concentration. We use tensor decomposition and reconstruction theory to complete the missing data of air quality. We analyze the correlations between regions and propose the concept and calculation method of regional similarity matrix.
With the increasing demand for air quality prediction, many researchers have committed to exploring the change trend of air pollutant concentration and proposing many methods and systems. In this paper, we will review those works in three categories, namely classical emission models, statistical learning methods, and recent deep learning methods. The main idea of the classical emission models is to predict the change trend of air pollutant by simulating the diffusion regularity of pollutants in the atmosphere. Arystanbekova et al. [2] proposed Gaussian Plume model, which assumed that the concentration of air pollutant was dispersed in the vertical and horizontal directions in a Gaussian manner so as to diagnose and predict the level of air pollutant. The Street Canyon models proposed by Kim et al. [3] and Rakowska et al. [4] combined fluid mechanics, chemical reactions of active pollutant and other empirical knowledge to predict air pollutant concentration by simulating the transmission and chemical reactions of air pollutant in the street canyon. These classical emission methods simulated the diffusion of air pollutant based on a large number of empirical assumptions and parameters. Thus, the prediction results of these methods are far from being satisfactory due to the fact that some empirical assumptions may be inconsistent with the actual situation and the parameters required by the model (such as emission density, street geometry, dispersion parameters, etc.) are difficult to obtain accurately [5, 6, 7, 8].
The statistical learning methods aim to model the relationships between air quality and some influencing factors. Hasenfratz et al. [9] proposed generalized additive models (GAMs) to predict air quality using land-use features and traffic data. Auto-regressive integrated moving average (ARIMA) is a popular model for time series analysis, which has been successfully applied to air quality prediction [10, 11]. Kumar et al. [11] used ARIMA to predict the average concentration of various air pollutants. These models simulated the change trend of air pollutant concentration and achieved relatively good results in short-term prediction. But these models usually rely on the stationary hypothesis that the air quality changes gradually, which is not consistent with the dynamics of air quality changes.
In recent years, some researchers have begun to address air pollution problem through deep learning. Deep learning methods are applied to deal with high dimensional data problem, and have the capability to capture non-linear relationships [12]. LSTM [13] model is usually used for prediction of time series data, and can alleviate the problem of gradient vanishing or blowing up in RNN [14]. Gated Recurrent Unit (GRU) [15] is a simple variant of LSTM and can also be used for time series prediction. Yi et al. [16] proposed an approach including a spatial transformation component and a deep distributed fusion network to predict air quality. Qi et al. [17] proposed a three-stage deep neural network to infer and predict air quality. Kök, İbrahim et al. [18] used LSTM to predict air quality. Le and Duc [19] applied CNN and LSTM to predict the real-time air pollution, in which the CNN predicted the distribution of air pollution, and the LSTM combined with the neural network to evaluate the relationship of weather factor to future air pollution prediction. Tong et al. [20] applied bidirectional LSTM in the spatial-temporal interpolation of air pollutant concentration. Zhao et al. [21] proposed LSTM-FC model to predict PM
Problem description
Region division
In order to implement fine-grained air pollutant concentration prediction, it is necessary to divide urban regions into fine-grained regions. Taking Beijing as an example, the distribution of air quality monitoring stations in Beijing is shown in Fig. 2a. There are 36 air quality monitoring stations in total and the red marks indicate the location of each monitoring station.
In this paper, the latitude and longitude coordinates are used to divide the city into mutually disjoint grid regions, and the side length of each grid is 1 km. The reason for choosing such side length is that the coverage radius of each monitoring station is very small, ranging from 400 to 3,000 meters. If the side length of each grid region is too long, the data collected by the monitoring station in the region cannot fully represent the air quality of the region. The divided region within the dotted line in Fig. 2a is shown in Fig. 2b. The red grids in the figure represent the regions with monitoring station, and the others are the regions without monitoring station.
(a) The distribution of air quality monitoring station in Beijing. (b) The region within the dotted line in (a) is divided into nonoverlapping one kilometer grids.
We provide the definition of common symbols, as shown in Table 1.
Commonly used symbols
Commonly used symbols
The architecture of our approach.
Fusion network architecture
Figure 3 shows the architecture of the proposed DSTFN model, which consists of air quality data preprocessing and a deep spatial-temporal fusion network.
The preprocessing of air quality data includes data completion and similar region selection. In the data completion component, we use tensor decomposition method to complete the missing of historical air quality data. In the similar region selection component, we first extract spatial features from POI data, road network data and geographic location data, and then calculate the regional similarity matrix. For a target prediction region, we can select k similar regions with monitoring station through the regional similarity matrix. In the architecture, the AQI represents the historical air quality of target region, and the other AQI represents the historical air quality of similar regions. For a target region without monitoring station, we use the mean value of similar regions as the historical data of the target region.
The deep spatial-temporal fusion network consists of three subnets: spatial subnet (SS), temporal subnet (TS), and BU-LSTM subnet (BS). The AQI data, meteorological data, weather forecast data, region metadata (POI and road network data) and time data are fed into the fusion network. In addition, we use embedding method to map the categorical values to the real numbers and transform the original high-dimensional space into a low-dimensional subspace for capturing the intra-dynamics of input features. The outputs of three subnets are aggregated by weighted merge to generate the final prediction result.
Data completion
As mentioned above, urban air quality data is mainly collected by air quality monitoring stations built in cities, but the data will be missing during certain periods due to equipment failure, inspection, maintenance or other reasons. The missing ratio of the air quality dataset (detailed later) used in this paper is shown in Table 2. As can be seen from the table, the missing ratio of all kinds of air pollutants is higher than 10 percent, especially for PM
The missing ratio of air pollutant
The missing ratio of air pollutant
We use tensor decomposition method to complete the missing data. The tensor
The structure of air quality tensor decomposition.
We use tucker decomposition [24] method to decompose tensor
where the symbol
where the
Topographical variations in urban regions lead to complex spatiotemporal variations in air pollutant concentration [25]. We propose the concept of regional similarity matrix to describe the similarity of spatial features between regions. The
We normalize the spatial features of each region to [0, 1] and then calculate the similarities between the regions without monitoring station and the regions with monitoring station by Eq. (3).
where the
Bidirectional and unidirectional LSTM subnet
The structure of bidirectional and unidirectional LSTM.
The deep BU-LSTM neural network includes four components: a bidirectional LSTM (BI-LSTM) layer, a LSTM layer, an input layer and an output layer. The structure of deep BU-LSTM neural network is show in Fig. 5. The input layer contains an embedding layer that maps the input features X into a low-dimensional vector. The BI-LSTM is behind the input layer which can capture more useful information from spatial-temporal data and learn more useful features. When feeding the time series features with spatial-temporal information into the BI-LSTM layer, both the spatial correlations of air quality in different regions and the temporal dependencies of air quality are captured during the feature learning process. The LSTM layer in front of the output layer only needs to utilize the features learned from BI-LSTM layer, and iteratively calculates along the forward direction to generate predicted value, which can capture the forward dependencies. The output layer outputs the predicted value. The depth of the LSTM layer and BI-LSTM layer can be changed to adjust the generalization capabilities of the network.
The LSTM cell has three gates namely input gate, forget gate, and output gate, which controls the information flow through the cell and the neural network. At time
input gate:
forget gate:
output gate:
cell input memory state:
where the
cell output memory state:
hidden layer output:
where the
The structure of an unfolded BI-LSTM contains a forward LSTM layer and a backward LSTM layer, and connects two hidden layers to the same output layer. Symbol
where
The structure of fully connected subnet.
The structure of fully connected subnet (FC-Subnet) is shown in Fig. 6. Each subnet first uses a concatenate layer to merge the embedded air quality data and other domain data as input features. And then applies two fully connected layers to learn higher-order feature representations in a non-linear way. The spatial subnet fuses the air quality data and region metadata (e.g., POI distribution, road structure, landform) to capture the change patterns of air pollutant concatenation in region itself. The temporal subnet fuses the air quality data, meteorological data and weather forecast data to capture the impact of historical and future weather conditions on air pollutant concentration.
We use a parametric-matrix-based fusion method [26] to weight and merge the outputs of three subnets, and generate the final prediction result:
where the
Embedding [27] is a feature learning and dimensionality reduction technique, which is widely used in deep learning, especially in natural language processing tasks. It is a parameterized function which maps categorical values to real numbers or transforms original high-dimensional space into low-dimensional subspace. For categorical features, such as weather, embedding can transform the features represented by one-hot encoding to a real-valued vector. For numerical features, such as air quality, embedding can transform the raw features to a low-dimensional space. Embedding brings two main advantages:
The embedded vector can reflect the similarity between the original inputs. The embedded vector reduces the dimensions, which reduces the cost of network training.
The embedding setting. Dimension is represented by time steps * feature dimension in one timestamp
As shown in Table 3, we detail the embedding setting for each domain data. For historical time series data, such as air quality and meteorology, we use past
We use mean square error (MSE) to evaluate the training model, which is defined as follows:
where the
The Training Algorithm[1] Collection of monitoring station region
Dataset and parameter setting
In the evaluation, we apply our model to predict fine-grained air pollutant concentration based on real datasets collected from Beijing, China. The real urban data sources used in experiments are as follows:
Air quality data: We collect the dataset from Microsoft Research Asia (MSRA) website.1
Meteorological data: We collect fine-grained meteorological data from OpenWeatherMap.2
Weather forecast data: We collect fine-grained weather forecast data from OpenWeatherMap. It consists of weather, temperature, wind direction, wind speed in 16 districts of Beijing, spanning from May 2014 to May 2015. We use a temporal linear interpolation to convert the 4-hourly raw data to hourly data.
POI data: We collect the fine-grained POI data in Beijing from Amap with its APIs.3
Road network data: We collect the road network data in Beijing from OpenStreetMap.4
The data of air quality is completely missing at some timestamps, which accounts for 11.8% of total data. We use tensor decomposition method to complete this part of missing data. In addition, we divide Beijing into nonoverlapping one square kilometer grids, and the total number of grids is 3600. The size of
where the
We compare our model with the following methods:
SVR [29]: Support vector regression is a classical supervised regression model extended from support vector machine. RNN [14]: Recurrent neural network is a deep learning model, which can capture temporal dependencies of time series data. LSTM [13]: Long-short-term-memory network is a special kind of RNN, which is developed to deal with the exploding and vanishing gradient problems of traditional RNNs. GRU [15]: Gate recursive unit is similar to LSTM unit, but has fewer parameters and exhibits better performance on smaller datasets. BI-LSTM [20]: Bidirectional LSTM network is a special LSTM network with two separate hidden layers, which can handle time series data in both forward and backward directions. LSTM-FC [21]: LSTM-FC consists of a LSTM network for capturing temporal dependencies and a FC network for capturing spatial correlations. LSTM-NN [19]: LSTM-NN is a combination of a LSTM layer for air pollutant data and a neural network layer for other air pollution impact factors such as weather conditions. DAL [17]: DAL addresses the problems of interpolation, prediction, and feature analysis of fine-grained air quality by a deep learning network. DeepAir [16]: DeepAir is a deep distributed fusion network that uses multiple subnets to capture the potential relationships between air quality and other domain data.
The MAE and RMSE of DSTFN model and baseline models on the test set when predicting PM
concentration
The MAE and RMSE of DSTFN model and baseline models on the test set when predicting PM
Taking PM
The MAE of PM
on different fusion architectures
The MAE of PM
The MAE of PM
Taking PM
The MAE of PM
The MAE and RMSE of PM
The MAE and RMSE of PM
The concentration distribution of all air pollutants: (a)–(f) are the distributions of PM
The prediction of air pollutant concentration in the next 1 hour in monitoring station regions.
PM
Taking PM
Taking PM
Performance on predicting other pollutants
To demonstrate the effectiveness of our model for all air pollutants, we summarize the MAE and RMSE of PM
Predicting results visualization
To better understand the effectiveness of our model, we visualize the difference between predicted value and true value of air pollutant concentration in regions with monitoring station, and the change trend of air quality in regions without monitoring station.
Visualization of predicted and true air pollutant values
We select 500 consecutive hours of data for visualization from the test set that predicting the next 1 hour, as shown in Fig. 10. It can be seen intuitively from the figure that the change trend of predicted value curve and real value curve is very similar, and the values are also very close, especially for PM
Visualization of predicted value in regions without monitoring station
We use the trained DSTFN model to predict the change of PM
Conclusion
In this paper, we propose a deep spatial-temporal fusion network model to predict air pollutant concentration. Based on the domain knowledge of air pollution, we integrate urban heterogeneous data to capture the factors affecting air quality. Comparing with other state-of-the-art models on the same dataset, our approach achieves higher prediction accuracy, taking the MAE, RMSE as indicators. The advantages of our model are as follows:
The concept of regional similarity matrix is proposed to describe the spatial similarity between urban regions. The missing historical data of the monitoring stations are completed by using tensor decomposition method, which increases the samples of training. The embedding method is utilized to transform the original high-dimension features into low-dimension vector, which reduces the cost of network training. The BU-LSTM subnet can learn the temporal-spatial information of air quality from historical air quality data in target and similar regions. The spatial subnet use region metadata to capture specific changes of air quality in certain regions. The temporal subnet can use meteorological data and weather forecast data to capture the impact of weather conditions on air quality.
