Deep spatial-temporal fusion network for fine-grained air pollutant concentration prediction

Abstract

Air pollution is a serious environmental problem that has attracted much attention. Predicting air pollutant concentration can provide useful information for urban environmental governance decision-making and residents’ daily health control. However, existing methods fail to model the temporal dependencies or have suffer from a weak ability to capture the spatial correlations of air pollutants. In this paper, we propose a general approach to predict air pollutant concentration, named DSTFN, which consists of a data completion component, a similar region selection component, and a deep spatial-temporal fusion network. The data completion component uses tensor decomposition method to complete the missing data of historical air quality. The similar region selection component uses region metadata to calculate the spatial similarity between regions. The deep spatial-temporal fusion network fuses urban heterogeneous data to capture factors affecting air quality and predict air pollutant concentration. Extensive experiments on a real-world dataset demonstrate that our model achieves the highest performance compared with state-of-the-art models for air quality prediction.

Keywords

Air pollutant concentration prediction deep learning LSTM embedding tensor decomposition

1. Introduction

With the development of economy and society, air pollution problem is getting more and more attention, especially in developing countries (e.g., China and Brazil). High concentration of air pollutant may cause serious health problems, such as heart or lung diseases. According to the World Health Organization, nine out of ten people now breathe polluted air, which links to 7 million premature deaths annually [1]. Moreover, cities can be plagued by smog, impacting the daily life of the residents negatively. Therefore, air pollution has become an urgent problem that harms public health and hinders urban economic development. For the treatment and prevention of air pollution problem, the information about air pollutant, such as the concentration of PM ${}_{2.5}$ is important to urban environmental governance and the health of residents.

In order to monitor air pollutant concentration in real time, the specific monitoring stations have established in different regions to collect air quality data. In addition to monitoring, there is a growing demand for predicting future air quality, which not only supports governments to make air pollution control strategies but also informs the general public to take actions (like staying at home) in advance. However, accurate prediction of air pollutant concentration is extremely challenging. First, air pollutant concentration can vary significantly over time and across locations. Figure 1a shows the PM ${}_{2.5}$ concentration at two monitoring stations in Beijing, over a period of time. We observe that the “1011” station generally shows higher PM ${}_{2.5}$ concentration than that in “1001” (could be the heavy traffic congestion in “1011”). Besides, the PM ${}_{2.5}$ concentrations of the two monitoring stations are different significantly during this period. Second, the concentration of air pollutant can be affected by many factors, including weather conditions, human factors and industrial pollution, as well as surrounding air quality. Meanwhile, the interactions between these factors are so complex that it is challenging to determine the weight of each factor. Third, besides normal fluctuations, air quality can be dramatically changed due to some extreme conditions, such as rainstorm, typhoon, and sudden factory emissions. As shown in Fig. 1b, the PM ${}_{2.5}$ concentration in this station shows a large mutation within two days, soaring to 180 from 25 and then fall back to 25 within 12 hours.

Figure 1.

The change of air pollutant concentration in Beijing over a period of time: (a) An example of changes in PM ${}_{2.5}$ concentration at two monitoring stations. (b) An example of PM ${}_{2.5}$ concentration changes sharply within two days of 2014.

To address the problems above, we propose a deep spatial-temporal fusion network (DSTFN) to predict air pollutant concentration, which contains a bidirectional and unidirectional long short-term memory (BU-LSTM) subnet, a spatial subnet and a temporal subnet. The BU-LSTM subnet fuses air quality data of target region and similar regions, capturing bidirectional temporal dependencies and spatial correlations. The spatial subnet fuses air quality data and region metadata (e.g., POI distribution, road structure, landform), capturing the change patterns of air quality in the region itself. The temporal subnet fuses air quality data, meteorological data and weather forecast data, capturing the impact of historical and future weather conditions on air quality. In our approach, we use tensor decomposition method to complete historical air quality data, which increases training samples. Moreover, we utilize geographic similarity to represent spatial correlations of air quality, which means that the air quality in two regions with similar topographical distribution, road network structure and POI distribution have similar change trend. Extensive experiments on a real-world air quality dataset demonstrate that our proposed model achieves the highest performance compared with state-of-the-art and baseline models for air quality prediction. In summary, our contributions are listed as follows:

(1)

We propose a general and efficient approach, which trains a deep spatial-temporal fusion network to predict fine-grained air pollutant concentration.

(2)

We use tensor decomposition and reconstruction theory to complete the missing data of air quality.

(3)

We analyze the correlations between regions and propose the concept and calculation method of regional similarity matrix.

2. Related work

With the increasing demand for air quality prediction, many researchers have committed to exploring the change trend of air pollutant concentration and proposing many methods and systems. In this paper, we will review those works in three categories, namely classical emission models, statistical learning methods, and recent deep learning methods. The main idea of the classical emission models is to predict the change trend of air pollutant by simulating the diffusion regularity of pollutants in the atmosphere. Arystanbekova et al. [2] proposed Gaussian Plume model, which assumed that the concentration of air pollutant was dispersed in the vertical and horizontal directions in a Gaussian manner so as to diagnose and predict the level of air pollutant. The Street Canyon models proposed by Kim et al. [3] and Rakowska et al. [4] combined fluid mechanics, chemical reactions of active pollutant and other empirical knowledge to predict air pollutant concentration by simulating the transmission and chemical reactions of air pollutant in the street canyon. These classical emission methods simulated the diffusion of air pollutant based on a large number of empirical assumptions and parameters. Thus, the prediction results of these methods are far from being satisfactory due to the fact that some empirical assumptions may be inconsistent with the actual situation and the parameters required by the model (such as emission density, street geometry, dispersion parameters, etc.) are difficult to obtain accurately [5, 6, 7, 8].

The statistical learning methods aim to model the relationships between air quality and some influencing factors. Hasenfratz et al. [9] proposed generalized additive models (GAMs) to predict air quality using land-use features and traffic data. Auto-regressive integrated moving average (ARIMA) is a popular model for time series analysis, which has been successfully applied to air quality prediction [10, 11]. Kumar et al. [11] used ARIMA to predict the average concentration of various air pollutants. These models simulated the change trend of air pollutant concentration and achieved relatively good results in short-term prediction. But these models usually rely on the stationary hypothesis that the air quality changes gradually, which is not consistent with the dynamics of air quality changes.

In recent years, some researchers have begun to address air pollution problem through deep learning. Deep learning methods are applied to deal with high dimensional data problem, and have the capability to capture non-linear relationships [12]. LSTM [13] model is usually used for prediction of time series data, and can alleviate the problem of gradient vanishing or blowing up in RNN [14]. Gated Recurrent Unit (GRU) [15] is a simple variant of LSTM and can also be used for time series prediction. Yi et al. [16] proposed an approach including a spatial transformation component and a deep distributed fusion network to predict air quality. Qi et al. [17] proposed a three-stage deep neural network to infer and predict air quality. Kök, İbrahim et al. [18] used LSTM to predict air quality. Le and Duc [19] applied CNN and LSTM to predict the real-time air pollution, in which the CNN predicted the distribution of air pollution, and the LSTM combined with the neural network to evaluate the relationship of weather factor to future air pollution prediction. Tong et al. [20] applied bidirectional LSTM in the spatial-temporal interpolation of air pollutant concentration. Zhao et al. [21] proposed LSTM-FC model to predict PM ${}_{2.5}$ concentration. Lin et al. [22] applied the pre-constructed graph in the diffusion convolutional recurrent neural network to build an air quality forecasting model called GC-DCRNN. Cheng et al. [23] proposed the ADAIN model for urban air quality inference, which automatically learned the feature weights of different monitoring stations by using the attention mechanism. However, some of deep learning methods mentioned above are overly dependent on the capabilities of the model and do not process input data effectively.

3. Problem description

3.1 Region division

In order to implement fine-grained air pollutant concentration prediction, it is necessary to divide urban regions into fine-grained regions. Taking Beijing as an example, the distribution of air quality monitoring stations in Beijing is shown in Fig. 2a. There are 36 air quality monitoring stations in total and the red marks indicate the location of each monitoring station.

In this paper, the latitude and longitude coordinates are used to divide the city into mutually disjoint grid regions, and the side length of each grid is 1 km. The reason for choosing such side length is that the coverage radius of each monitoring station is very small, ranging from 400 to 3,000 meters. If the side length of each grid region is too long, the data collected by the monitoring station in the region cannot fully represent the air quality of the region. The divided region within the dotted line in Fig. 2a is shown in Fig. 2b. The red grids in the figure represent the regions with monitoring station, and the others are the regions without monitoring station.

Figure 2.

(a) The distribution of air quality monitoring station in Beijing. (b) The region within the dotted line in (a) is divided into nonoverlapping one kilometer grids.

3.2 Symbol definition

We provide the definition of common symbols, as shown in Table 1.

Table 1
Commonly used symbols

Symbols	Descriptions
$G$	The collection of all regions
$g$	A region $g\in G$
$G_{u}$	The collection of regions without monitoring station
$G_{v}$	The collection of regions with monitoring station
$T$	The whole length of timestamp of data
$t$	A timestamp $t\in T$
$\{\textit{AQI}_{g}^{t}\}_{G_{v}}^{T}$	The air quality data
$\{M_{g}^{t}\}_{G}^{T}$	The meteorological data
$\{W_{g}^{t}\}_{G}^{T}$	The weather forecast data
$\{P_{g}\}_{G}$	The POI data
$\{R_{g}\}_{G}$	The road network data
$\{L_{g}\}_{G}$	The location data

Figure 3.

The architecture of our approach.

4. Methodology

4.1 Fusion network architecture

Figure 3 shows the architecture of the proposed DSTFN model, which consists of air quality data preprocessing and a deep spatial-temporal fusion network.

The preprocessing of air quality data includes data completion and similar region selection. In the data completion component, we use tensor decomposition method to complete the missing of historical air quality data. In the similar region selection component, we first extract spatial features from POI data, road network data and geographic location data, and then calculate the regional similarity matrix. For a target prediction region, we can select k similar regions with monitoring station through the regional similarity matrix. In the architecture, the AQI represents the historical air quality of target region, and the other AQI represents the historical air quality of similar regions. For a target region without monitoring station, we use the mean value of similar regions as the historical data of the target region.

The deep spatial-temporal fusion network consists of three subnets: spatial subnet (SS), temporal subnet (TS), and BU-LSTM subnet (BS). The AQI data, meteorological data, weather forecast data, region metadata (POI and road network data) and time data are fed into the fusion network. In addition, we use embedding method to map the categorical values to the real numbers and transform the original high-dimensional space into a low-dimensional subspace for capturing the intra-dynamics of input features. The outputs of three subnets are aggregated by weighted merge to generate the final prediction result.

4.2 Data completion

As mentioned above, urban air quality data is mainly collected by air quality monitoring stations built in cities, but the data will be missing during certain periods due to equipment failure, inspection, maintenance or other reasons. The missing ratio of the air quality dataset (detailed later) used in this paper is shown in Table 2. As can be seen from the table, the missing ratio of all kinds of air pollutants is higher than 10 percent, especially for PM ${}_{10}$ as high as 45 percent. Completing the missing data can provide more training samples for deep learning model. In our architecture, this process is called data completion (DC).

Table 2
The missing ratio of air pollutant

Types	PM ${}_{2.5}$	PM ${}_{10}$	NO ${}_{2}$	CO	O ${}_{3}$	SO ${}_{2}$
Missing ratio	13.3%	45.1%	16%	15.1%	15.4%	15.2%

We use tensor decomposition method to complete the missing data. The tensor $A\in\mathbb{R}^{N\times M\times L}$ represents the spatial-temporal air pollutant concentration of the monitoring stations in different time slots. As can be seen from Fig. 4, $A$ has three dimensions denoting $N$ regions, $M$ air pollutant categories, and $L$ time slots, respectively. An entry $A_{i,j,k}$ represents the $j$ -th air pollutant concentration of region $i$ at time slot $k$ .

Figure 4.

The structure of air quality tensor decomposition.

We use tucker decomposition [24] method to decompose tensor $A$ into four parts including a core tensor $S\in\mathbb{R}^{d_{R}\times d_{P}\times d_{T}}$ , a space related matrix $R\in\mathbb{R}^{N\times d_{R}}$ , an air pollutant category related matrix $P\in\mathbb{R}^{M\times d_{P}}$ and a time related matrix $T\in\mathbb{R}^{L\times d_{T}}$ . The loss function that controls the error of decomposition is defined as Eq. (1).

$\displaystyle\textit{Loss}(S,R,P,T)=\frac{1}{2}\left\|A-S\times_{R}R\times_{P}% P\times_{T}T\right\|^{2}+\frac{\lambda}{2}(\left\|S\right\|^{2}+\left\|R\right% \|^{2}+\left\|P\right\|^{2}+\left\|T\right\|^{2}),$ (1)

where the symbol $\left\|\cdot\right\|^{2}$ denotes the $l_{2}$ norm, the $\times_{*}$ represents the tensor matrix multiplication, and * denotes the mode of a tensor. The first term is used to control the decomposition error and the second term is the regularization penalty to avoid overfitting. The $\lambda$ is a parameter controlling the contribution of the regularization penalty. We adopt gradient descent algorithm to solve the optimization problem. By minimizing the objective function, we can obtain optimized $R$ , $P$ , and $T$ . Then we can recover missing data in $A$ by Eq. (2).

$\displaystyle A_{r}=S\times_{R}R\times_{P}P\times_{T}T,$ (2)

where the $A_{r}$ represents the recovered tensor and each entry in $A_{r}$ is not empty. We consider that these entries are vacant in the original tensor $A$ , but have a value as the padding value in the recovered tensor $A_{r}$ . We use the values in $A_{r}$ to fill the missing values in $A$ .

4.3 Similar region selection

Topographical variations in urban regions lead to complex spatiotemporal variations in air pollutant concentration [25]. We propose the concept of regional similarity matrix to describe the similarity of spatial features between regions. The $\{S_{g}\}=\{P_{g},R_{g},L_{g}\}$ denotes the spatial features of region $g$ , which combines POI data, road network data, and geographical location data.

We normalize the spatial features of each region to [0, 1] and then calculate the similarities between the regions without monitoring station and the regions with monitoring station by Eq. (3).

$\displaystyle\textit{Mat}_{S}(i,j)=\textit{Mat}_{S}(g_{i},g_{j})=\sqrt{\left\|% S_{g_{i}}-S_{g_{j}}\right\|^{2}},$ (3)

where the $\textit{Mat}_{S}(i,j)$ represents the similarity between the region $g_{i}$ and the region $g_{j}$ , and $g_{i}\in G$ , and $g_{j}\in G_{v}$ . The $\textit{Mat}_{S}$ denotes the regional similarity matrix, where the rows represent all regions $G$ and the columns represent the regions $G_{v}$ with monitoring station. The smaller the value, the more similar they are. For any region, we can use the regional similarity matrix to select those similar regions with monitoring station.

4.4 Deep spatial-temporal fusion network

4.4.1 Bidirectional and unidirectional LSTM subnet

Figure 5.

The structure of bidirectional and unidirectional LSTM.

The deep BU-LSTM neural network includes four components: a bidirectional LSTM (BI-LSTM) layer, a LSTM layer, an input layer and an output layer. The structure of deep BU-LSTM neural network is show in Fig. 5. The input layer contains an embedding layer that maps the input features X into a low-dimensional vector. The BI-LSTM is behind the input layer which can capture more useful information from spatial-temporal data and learn more useful features. When feeding the time series features with spatial-temporal information into the BI-LSTM layer, both the spatial correlations of air quality in different regions and the temporal dependencies of air quality are captured during the feature learning process. The LSTM layer in front of the output layer only needs to utilize the features learned from BI-LSTM layer, and iteratively calculates along the forward direction to generate predicted value, which can capture the forward dependencies. The output layer outputs the predicted value. The depth of the LSTM layer and BI-LSTM layer can be changed to adjust the generalization capabilities of the network.

The LSTM cell has three gates namely input gate, forget gate, and output gate, which controls the information flow through the cell and the neural network. At time $t$ , the input gate, the forget gate, the output gate and the cell input memory state are denoted as $i_{t}$ , $f_{t}$ , $o_{t}$ , $\widetilde{C}_{t}$ , respectively. The formulas are as follows:

input gate:

$\displaystyle i_{t}=\sigma(W_{i}x_{t}+U_{i}h_{t-1}+b_{i}),$ (4)

forget gate:

$\displaystyle f_{t}=\sigma(W_{f}x_{t}+U_{f}h_{t-1}+b_{f}),$ (5)

output gate:

$\displaystyle o_{t}=\sigma(W_{o}x_{t}+U_{o}h_{t-1}+b_{o}),$ (6)

cell input memory state:

$\displaystyle\widetilde{C}_{t}=\textit{tanh}(W_{C}x_{t}+U_{C}h_{t-1}+b_{C}),$ (7)

where the $W_{i}$ , $W_{f}$ , $W_{o}$ and $W_{C}$ are the weight matrices connecting input $x_{t}$ to the three gates and cell input memory state. The $U_{i}$ , $U_{f}$ , $U_{o}$ and $U_{C}$ are the weight matrices connecting previous cell output state $h_{t-1}$ to the three gates and cell input memory state. And the $b_{i}$ , $b_{f}$ , $b_{o}$ and $b_{C}$ are the bias vectors of the three gates and cell input memory state. The $\sigma$ denotes the gate activation function and the tanh denotes the hyperbolic tangent function. The formulas of cell output memory state and hidden layer output are as follows:

cell output memory state:

$\displaystyle C_{t}=i_{t}*\widetilde{C}_{t}+f_{t}*C_{t-1},$ (8)

hidden layer output:

$\displaystyle h_{t}=o_{t}*\textit{tanh}(C_{t}),$ (9)

where the $i_{t}$ , $f_{t}$ , $\widetilde{C}_{t}$ and $C_{t}$ have same dimension.

The structure of an unfolded BI-LSTM contains a forward LSTM layer and a backward LSTM layer, and connects two hidden layers to the same output layer. Symbol $\overrightarrow{h}$ denotes the forward layer output, calculated by inputs in a positive sequence. Moreover, $\overleftarrow{h}$ denotes the backward layer output, calculated by inputs in a reversed sequence. Both of them are calculated by standard LSTM updating equations, respectively. The unit output calculated by Eq. (10).

$\displaystyle y_{t}=\sigma(\overrightarrow{h}_{t},\overleftarrow{h}_{t}),$ (10)

where $\sigma$ is the function combining two outputs, such as concatenation, average, and summation.

4.4.2 Fully connected subnet

Figure 6.

The structure of fully connected subnet.

The structure of fully connected subnet (FC-Subnet) is shown in Fig. 6. Each subnet first uses a concatenate layer to merge the embedded air quality data and other domain data as input features. And then applies two fully connected layers to learn higher-order feature representations in a non-linear way. The spatial subnet fuses the air quality data and region metadata (e.g., POI distribution, road structure, landform) to capture the change patterns of air pollutant concatenation in region itself. The temporal subnet fuses the air quality data, meteorological data and weather forecast data to capture the impact of historical and future weather conditions on air pollutant concentration.

4.4.3 Fusion

We use a parametric-matrix-based fusion method [26] to weight and merge the outputs of three subnets, and generate the final prediction result:

$\displaystyle\widehat{y}=\textit{Sigmoid}(y_{ss}\circ w_{ss}+y_{ts}\circ w_{ts% }+y_{bs}\circ w_{bs}),$ (11)

where the $\widehat{y}$ is the predicted value, the $y_{ss},y_{ts},y_{bs}$ are the outputs of three subnets, the $\circ$ is the Hadamard product, and the $w_{ss},w_{ts},w_{bs}$ are the learnable parameters for adjusting the degrees affected by spatial-subnet, temporal-subnet and BU-LSTM subnet, respectively. The Sigmoid function ensures that the output value is between 0 to 1. And later, we can denormalize the predicted value to get the final prediction result.

4.4.4 Embedding

Embedding [27] is a feature learning and dimensionality reduction technique, which is widely used in deep learning, especially in natural language processing tasks. It is a parameterized function which maps categorical values to real numbers or transforms original high-dimensional space into low-dimensional subspace. For categorical features, such as weather, embedding can transform the features represented by one-hot encoding to a real-valued vector. For numerical features, such as air quality, embedding can transform the raw features to a low-dimensional space. Embedding brings two main advantages:

(1)
The embedded vector can reflect the similarity between the original inputs.
(2)
The embedded vector reduces the dimensions, which reduces the cost of network training.

Table 3
The embedding setting. Dimension is represented by time steps * feature dimension in one timestamp

Data Feature Dimension Embedding

AQI PM ${}_{2.5}$ $l$ $l$

PM ${}_{10}$ $l$

NO ${}_{2}$ $l$

SO ${}_{2}$ $l$

CO $l$

O ${}_{3}$ $l$

Other AQI PM ${}_{2.5}$ $lk$ $lk$

PM ${}_{10}$ $lk$

NO ${}_{2}$ $lk$

SO ${}_{2}$ $lk$

CO $lk$

O ${}_{3}$ $lk$

Meteorology Weather $l8$ $l$

Wind Speed $l$

Wind Direction $l*8$

Humidity $l$

Temperature $l$

Pressure $l$

Weather forecast Weather $\delta t/4$ $\delta t/4$

Temperature $\delta t/4$

Wind Level $\delta t/4$

Wind Direction $\delta t/4$

Time Month 12 3

DayOfWeek 7

TimeOfDay 4

Region Metadata Station ID 36 5

Industrial POI Level 3

Green POI Level 3

Service POI Level 3

Road Level 3

As shown in Table 3, we detail the embedding setting for each domain data. For historical time series data, such as air quality and meteorology, we use past $l$ hours to incorporate the temporal information, and select $k$ similar regions to capture the spatial correlations of air pollutant. For weather forecast data, we use future $\delta t/4$ timestamp instance to indicate the change of weather, which provides the future information. For region metadata, it consists of station id, POI level, road network level, which captures intra-dynamics within the region. Such as industrial POI level, according to the statistics number and scale of the POI distribution in each region, we divide it into three levels, which represents high-density industrial zone, medium-density industrial zone and low-density industrial zone.
4.5 Algorithm

Data	Feature	Dimension	Embedding
AQI	PM ${}_{2.5}$	$l$	$l$
	PM ${}_{10}$	$l$
	NO ${}_{2}$	$l$
	SO ${}_{2}$	$l$
	CO	$l$
	O ${}_{3}$	$l$
Other AQI	PM ${}_{2.5}$	$l*k$	$l*k$
	PM ${}_{10}$	$l*k$
	NO ${}_{2}$	$l*k$
	SO ${}_{2}$	$l*k$
	CO	$l*k$
	O ${}_{3}$	$l*k$
Meteorology	Weather	$l*8$	$l$
	Wind Speed	$l$
	Wind Direction	$l*8$
	Humidity	$l$
	Temperature	$l$
	Pressure	$l$
Weather forecast	Weather	$\delta t/4$	$\delta t/4$
	Temperature	$\delta t/4$
	Wind Level	$\delta t/4$
	Wind Direction	$\delta t/4$
Time	Month	12	3
	DayOfWeek	7
	TimeOfDay	4
Region Metadata	Station ID	36	5
	Industrial POI Level	3
	Green POI Level	3
	Service POI Level	3
	Road Level	3

We use mean square error (MSE) to evaluate the training model, which is defined as follows:

$\displaystyle\textit{Loss}=\sum_{i=1}^{D}(y_{i}-\widehat{y_{i}})^{2},$ (12)

where the $D$ denotes the number of training samples, the $y_{i}$ is the ground-truth air pollutant concentration collected from monitoring stations, and the $\widehat{y_{i}}$ is the predicted air pollutant concentration value. We adopt the backpropagation algorithm to train the DSTFN model. The training process of DSTFN model is shown in Algorithm 4.5. The $\Theta$ denotes the all model parameters including the weight matrices $W$ , $U$ , and the bias $b$ of each gate and cell state. Lines 1–15 represent the generation process of the training samples, and lines 16–20 represent the training process. In order to improve the accuracy of the model, we use adam [28] to optimize the learning process of parameters, and use dropout strategy to invalidate some nodes improving the generalization ability of model.

The Training Algorithm[1] Collection of monitoring station region $G_{v}$ Historical air quality of $G_{v}$ : $\{\textit{AQI}_{g}^{t}\},t\in T$ ; Historical meteorology of $G_{v}$ : $\{M_{g}^{t}\},t\in T$ ; Weather forecasts data of $G_{v}$ : $\{W_{g}^{t}\},t\in T$ ; POI data of $G_{v}$ : $\{P_{g}\}$ ; Road network data of $G_{v}$ : $\{R_{g}\}$ ; Location data of $G_{v}$ : $\{L_{g}\}$ ; The future time offset $\delta t$ , historical time lags $l$ ;The target air pollutant type $y$ ; Learned DSTFN model $\mathcal{M}$ $\mathcal{D}\leftarrow\varnothing$ $\textit{Mat}_{S}=\textit{Regional\_Similarity\_Matrix}(P_{g},R_{g},L_{g})$ all $g\in G_{v}$ $G_{s}=\textit{K\_Similar\_Selection}(g,\textit{Mat}_{S})$ all available time step ( $T\leqslant t\leqslant n-1$ ) $x_{a}=[\textit{AQI}_{g}^{t-l},\ldots,\textit{AQI}_{g}^{t}]$ $x_{oa}=\textit{Get\_Other\_AQI}(G_{s},[\textit{AQI}_{G_{s}}^{t-l},\ldots,% \textit{AQI}_{G_{s}}^{t}])$ $x_{m}=[M_{g}^{t-l},\ldots,M_{g}^{t}]$ $x_{wf}=[W_{g}^{t},\ldots,W_{g}^{t}]$ $x_{rm}=\textit{Get\_Region\_Metadata}(P_{g},R_{g})$ $x_{t}=\textit{Get\_Time}(t)$ $y=\textit{Get\_Target}(\textit{AQI}_{g}^{t+\delta t})$ Append ({ $x_{a},x_{oa},x_{m},x_{m},x_{w},x_{rm},x_{t}$ }, $y$ ) into $\mathcal{D}$ initialize all learnable parameters $\Theta$ in DSTFN stopping criteria is not met randomly select a batch of samples $\mathcal{D}_{b}$ from $\mathcal{D}$ update $\Theta$ by minimizing the objective function with $\mathcal{D}_{b}$

5. Experiments and results

5.1 Dataset and parameter setting

In the evaluation, we apply our model to predict fine-grained air pollutant concentration based on real datasets collected from Beijing, China. The real urban data sources used in experiments are as follows:

(1)
Air quality data: We collect the dataset from Microsoft Research Asia (MSRA) website.1
¹
https://www.microsoft.com/en-us/research/project/urban-air/.

The data include 36 stations in Beijing, China and span from May 2014 to May 2015. Each station provides hourly reports of air pollutants, including PM ${}_{2.5}$ , PM ${}_{10}$ , SO ${}_{2}$ , NO ${}_{2}$ , CO and O ${}_{3}$ .
(2)
Meteorological data: We collect fine-grained meteorological data from OpenWeatherMap.2
²
https://openweathermap.org/.

It consists of weather, temperature, humidity, pressure, wind direction, wind speed in 16 districts of Beijing, spanning from May 2014 to May 2015. And the data may be missing due to equipment maintenance or other disruptions. For the missing data, we also use tensor decomposition to complete it.
(3)
Weather forecast data: We collect fine-grained weather forecast data from OpenWeatherMap. It consists of weather, temperature, wind direction, wind speed in 16 districts of Beijing, spanning from May 2014 to May 2015. We use a temporal linear interpolation to convert the 4-hourly raw data to hourly data.
(4)
POI data: We collect the fine-grained POI data in Beijing from Amap with its APIs.3
³
https://lbs.amap.com/api/webservice/guide/api/search.

We first divide the POI distribution of each region into three categories, namely industrial POI, green POI, service POI. Each category is divided into high-density level, medium-density level and low-density level. And then we calculate the level of each POI category of all regions within 1 km ${}^{2}$ as POI features.
(5)
Road network data: We collect the road network data in Beijing from OpenStreetMap.4
⁴
http://www.openstreetmap.org/.

We first calculate the total length of all road segments in each region within 1 km ${}^{2}$ . Then we divide the total length into three level, namely high-density level, medium-density level and low-density level as road network features.

The data of air quality is completely missing at some timestamps, which accounts for 11.8% of total data. We use tensor decomposition method to complete this part of missing data. In addition, we divide Beijing into nonoverlapping one square kilometer grids, and the total number of grids is 3600. The size of $G_{u}$ is 3564, and the size of $G_{v}$ is 36. The length of historical time lags $l=12$ , the number of similar stations $k=3$ , and the offset time steps $\delta\in[1,48]$ . For each experiment, we select 70 percent of all samples as training data and the remaining 30 percent as test data. We adopt the mean absolute error (MAE) defined by Eq. (13) and the root mean square error (RMSE) defined by Eq. (14) to evaluate the performance of several approaches.

$\displaystyle\textit{MAE}=\frac{\sum_{i=1}^{n}|y_{i}-\widehat{y_{i}}|}{n},$ (13) $\displaystyle\textit{RMSE}=\sqrt{\frac{\sum_{i=1}^{n}(y_{i}-\widehat{y_{i}})^{% 2}}{n}},$ (14)

where the $y_{i}$ is a ground-truth air pollutant concentration collected from the air quality monitoring stations, the $\widehat{y_{i}}$ is the predicted air pollutant concentration value, and the $n$ is the total number of testing samples.

We compare our model with the following methods:

•
SVR [29]: Support vector regression is a classical supervised regression model extended from support vector machine.
•
RNN [14]: Recurrent neural network is a deep learning model, which can capture temporal dependencies of time series data.
•
LSTM [13]: Long-short-term-memory network is a special kind of RNN, which is developed to deal with the exploding and vanishing gradient problems of traditional RNNs.
•
GRU [15]: Gate recursive unit is similar to LSTM unit, but has fewer parameters and exhibits better performance on smaller datasets.
•
BI-LSTM [20]: Bidirectional LSTM network is a special LSTM network with two separate hidden layers, which can handle time series data in both forward and backward directions.
•
LSTM-FC [21]: LSTM-FC consists of a LSTM network for capturing temporal dependencies and a FC network for capturing spatial correlations.
•
LSTM-NN [19]: LSTM-NN is a combination of a LSTM layer for air pollutant data and a neural network layer for other air pollution impact factors such as weather conditions.
•
DAL [17]: DAL addresses the problems of interpolation, prediction, and feature analysis of fine-grained air quality by a deep learning network.
•
DeepAir [16]: DeepAir is a deep distributed fusion network that uses multiple subnets to capture the potential relationships between air quality and other domain data.

5.2 Comparison with different baselines

Table 4
The MAE and RMSE of DSTFN model and baseline models on the test set when predicting PM ${}_{2.5}$ concentration

Method	1–6 hour		7–12 hour		13–24 hour		25–48 hour
	Mae	Rmse	Mae	Rmse	Mae	Rmse	Mae	Rmse
SVR	30.35	40.87	46.86	59.33	60.37	82.32	70.91	95.21
RNN	25.48	35.27	40.12	52.32	49.23	65.13	60.62	83.18
LSTM	22.35	32.92	37.62	49.53	47.66	60.71	57.53	79.87
GRU	22.97	33.44	37.93	49.77	47.69	61.83	56.32	79.19
BI-LSTM	21.09	30.40	36.74	48.60	46.33	59.77	56.76	78.31
LSTM-FC	21.49	31.65	36.53	48.24	46.15	59.58	56.36	78.16
LSTM-NN	21.46	31.17	36.19	47.78	46.54	59.89	56.66	78.83
DAL	20.22	29.23	35.98	47.85	45.91	59.14	54.39	77.56
DeepAir	19.29	28.92	35.19	47.18	45.23	58.43	53.84	76.19
DSTFN	17.95	27.33	33.82	45.21	44.78	57.22	53.72	75.39

Taking PM ${}_{2.5}$ as an example, we summarize the MAE and the RMSE of SVR, RNN, LSTM, GRU, BI-LSTM, LSTM-FC, LSTM-NN, DAL, DeepAir and DSTFN on the test set. As shown in Table 4, we observe that DSTFN achieves the best performance in all predicted hours of air pollutant concentration. For all methods, the MAE of 1–6 hours prediction on the test set is the lowest, and the MAE increases as the predicted hour increases. The RMSE as same as the MAE. The traditional machine learning model SVR achieves the worst performance in all comparison models. Because SVR do not have the ability to model the high-level representations of time series data. RNN, LSTM, and GRU achieves a better performance than SVR. GRU has fewer parameters than LSTM, so it is easier to converge and faster to train. When the size of the data is large, the performance of LSTM may be better than GRU. Traditional RNN achieves the worst performance among all neural network models because there is no useful gate units in its nodes to control information flow. Although RNN, LSTM, and GRU can effectively predict time series data such as air quality, it has no ability to capture potential factors affecting air quality in other domain data. DSTFN performs better than BI-LSTM, because BI-LSTM just learns the spatial-temporal information from time series data, and does not process the learned feature well in prediction. BI-LSTM acquires more spatial-temporal features from entire time series data and processes time series data in both forward and backward directions with two separate hidden layers. Comparing with LSTM-FC and LSTM-NN, the results show that our model is better. Although LSTM-FC captures spatial correlations through the fully connected layer, it only considers time series data, resulting in a low rate of accuracy. By considering air quality data recorded by neighbor stations, LSTM-FC outperforms LSTM significantly, which shows the importance of spatial correlations. LSTM-NN uses a neural network to capture the influence of weather conditions on air quality, but the spatial correlations of air quality are not considered. Note that we use the data completed by our completion method to train LSTM-FC and LSTM-NN, which is better than the original performance. DAL and DeepAir have significant improvement over traditional recurrent neural networks, such as RNN, LSTM and GRU. DAL uses feature selection and spatiotemporal semi-supervised learning to capture the spatial correlations and temporal dependencies of air quality. DeepAir captures the potential relationships between air quality and other data through multiple subnets, which greatly improves the accuracy of air pollutant concentration prediction. DSTFN performs better than DeepAir and DAL in all predicted periods. The reason is that DeepAir and DAL are based on deep neural network structure. The deep neural network structure does not learn time series data effectively, whereas the BU-LSTM subnet in DSTFN can learn the spatial correlations and temporal dependencies of air quality between multiple regions.

5.3 Performance on fusion architectures

Table 5
The MAE of PM ${}_{2.5}$ on different fusion architectures

Method	Prediction
	1–6 hour	7–12 hour	13–24 hour	25–48 hour
SS	21.45	38.34	47.71	59.91
TS	20.17	36.72	46.67	56.74
BS	19.29	35.21	45.66	55.58
DSTFN	17.95	33.82	44.78	53.72

Table 6

The MAE of PM ${}_{2.5}$ on the DSTFN model with different components

Method	Prediction
	1–6 hour	7–12 hour	13–24 hour	25–48 hour
None	21.54	37.64	49.87	59.19
DC	19.45	35.35	46.42	56.48
Embed	20.93	36.47	47.89	57.61
DC&Embed	17.95	33.82	44.78	53.72

Taking PM ${}_{2.5}$ as an example, we show the effectiveness of our deep spatial-temporal fusion architecture in Table 5. The DSTFN outperforms other subnets, including spatial subnet, temporal subnet, and BU-LSTM subnet. The performance of BU-LSTM subnet is better than spatial subnet and temporal subnet, because the BU-LSTM subnet fuses the air quality of target region and similar regions to capture bidirectional temporal dependencies and spatial correlations from historical series data. The temporal subnet performs better than spatial subnet, because the temporal subnet obtains future information in the time dimension through weather forecast data.

5.4 Performance on data completion and embedding

Figure 7.

The MAE of PM ${}_{2.5}$ in the topic of 1–6 hours prediction using different models whether contains data completion and embedding, respectively.

Figure 8.

The MAE and RMSE of PM ${}_{2.5}$ on DSTFN with different similar regions k and with different historical time lags l in the topic of 1–6 hours prediction.

Table 7

The MAE and RMSE of PM ${}_{10}$ , NO ${}_{2}$ , CO, O ${}_{3}$ , SO ${}_{2}$ on the test set

Pollutant type	1–6 hour		7–12 hour		13–24 hour		25–48 hour
	Mae	Rmse	Mae	Rmse	Mae	Rmse	Mae	Rmse
PM ${}_{2.5}$	17.95	27.33	33.82	45.21	44.78	57.22	53.72	75.39
PM ${}_{10}$	26.70	36.50	38.30	50.74	46.03	59.68	59.07	80.79
NO ${}_{2}$	12.58	18.60	16.88	24.02	17.34	24.55	18.38	25.93
CO	0.31	0.57	0.43	0.74	0.49	0.82	0.53	0.88
O ${}_{3}$	15.61	23.48	18.32	25.92	20.89	30.39	22.45	32.30
SO ${}_{2}$	5.60	11.81	7.65	15.10	8.09	15.47	8.39	15.93

Figure 9.

The concentration distribution of all air pollutants: (a)–(f) are the distributions of PM ${}_{2.5}$ , PM ${}_{10}$ , NO ${}_{2}$ , CO, O ${}_{3}$ , SO ${}_{2}$ , respectively.

Figure 10.

The prediction of air pollutant concentration in the next 1 hour in monitoring station regions.

Figure 11.

PM ${}_{2.5}$ prediction in regions without monitoring station.

Taking PM ${}_{2.5}$ as an example, we show the effectiveness of data completion and embedding in Fig. 7 and Table 6. As shown in Fig. 7a, for all models, the MAE obtained on the completed dataset is smaller than the MAE obtained on the missing dataset in the topic of 1–6 hours prediction. As shown in Fig. 7b, for all models, the MAE obtained on models with embedding is smaller than the MAE obtained on models without embedding in the topic of 1–6 hours prediction. As shown in Table 6, the performance of the DSTFN model with embedding on completed dataset achieves the smallest MAE. Our DSTFN model with embedding improves accuracy in predicting air pollutant concentration and the data completion based on tensor decomposition is effective for improving the experimental results.

5.5 Performance on different parameter setting

Taking PM ${}_{2.5}$ as an example, we show the effectiveness of different number of similar regions and historical time lags in Fig. 8. Experimental results show the effectiveness of the regional similarity matrix $\textit{Mat}_{S}$ in Fig. 8a. Compared with using only the air quality data from target region (k $=$ 0), using k similar regions selected from $\textit{Mat}_{S}$ gets lower MAE. When using the air quality of k similar regions as inputs, the MAE first decreases as $K$ increases, and when the MAE reaches a minimum, it increases as k increases. The results show that it is helpful to use similar region to improve the model performance, but the choice of the number of similar regions should be appropriate. As shown in Fig. 8b, the length of historical time lags will affect the performance of our model. The MAE first decreases as $l$ increases, and when the MAE reaches a minimum, it increases as $l$ increases. Thus the length of parameter $l$ also needs to be considered carefully.

5.6 Performance on predicting other pollutants

To demonstrate the effectiveness of our model for all air pollutants, we summarize the MAE and RMSE of PM ${}_{2.5}$ , PM ${}_{10}$ , NO ${}_{2}$ , CO, O ${}_{3}$ , SO ${}_{2}$ on the test set, as shown in Table 7. For all air pollutants, our model achieves the best performance in 1–6 hours, and as the predicted hour increases, the result gets worse. For all predicted hours, we observe that the results of PM ${}_{10}$ are worse than PM ${}_{2.5}$ , because PM ${}_{10}$ has too many missing data, as can be seen from Table 2. We can also observe that the results of NO ${}_{2}$ , O ${}_{3}$ and SO ${}_{2}$ are better than PM ${}_{2.5}$ and worse than CO. This is reasonable because the concentration of NO ${}_{2}$ , O ${}_{3}$ and SO ${}_{2}$ are mostly distributed between 0 and 200, which are more stable than PM ${}_{2.5}$ distributed between 0 and 400 and less stable than CO distributed between 0 and 10, as can be seen from Fig. 9.

5.7 Predicting results visualization

To better understand the effectiveness of our model, we visualize the difference between predicted value and true value of air pollutant concentration in regions with monitoring station, and the change trend of air quality in regions without monitoring station.

5.7.1 Visualization of predicted and true air pollutant values

We select 500 consecutive hours of data for visualization from the test set that predicting the next 1 hour, as shown in Fig. 10. It can be seen intuitively from the figure that the change trend of predicted value curve and real value curve is very similar, and the values are also very close, especially for PM ${}_{2.5}$ . The result shows that our model has a very good effect on the prediction of the next hour.

5.7.2 Visualization of predicted value in regions without monitoring station

We use the trained DSTFN model to predict the change of PM ${}_{2.5}$ concentration in the whole region of Beijing within 2 days. We select 6 hours from 48 hours prediction to visualize, as shown in Fig. 11. In the figure, there are 6 prediction charts of different hours in the same region of Beijing. According to the predicted value, the small rectangular grids in each prediction chart use different colors to represent the PM ${}_{2.5}$ concentration. As can be seen from the figure, the PM ${}_{2.5}$ concentration in Beijing began to increase from the southeast and spread to the north. Actually, the southeast direction of Beijing is the industrial zone of Hebei Province, which indicates that the prediction of the regions without monitoring station are in line with actual situation.

6. Conclusion

In this paper, we propose a deep spatial-temporal fusion network model to predict air pollutant concentration. Based on the domain knowledge of air pollution, we integrate urban heterogeneous data to capture the factors affecting air quality. Comparing with other state-of-the-art models on the same dataset, our approach achieves higher prediction accuracy, taking the MAE, RMSE as indicators. The advantages of our model are as follows:

(1)

The concept of regional similarity matrix is proposed to describe the spatial similarity between urban regions.

(2)

The missing historical data of the monitoring stations are completed by using tensor decomposition method, which increases the samples of training.

(3)

The embedding method is utilized to transform the original high-dimension features into low-dimension vector, which reduces the cost of network training.

(4)

The BU-LSTM subnet can learn the temporal-spatial information of air quality from historical air quality data in target and similar regions. The spatial subnet use region metadata to capture specific changes of air quality in certain regions. The temporal subnet can use meteorological data and weather forecast data to capture the impact of weather conditions on air quality.

References

WHO, [Online], https://www.who.int/air-pollution/news-and-events/how-air-pollution-is-destroying-our-health, Last accessed 4 Jun 2019.

Arystanbekova

N.K.

, Application of Gaussian plume models for air pollution simulation at instantaneous emissions, Mathematics and Computers in Simulation 67(4–5) (2004), 451–458.

Kim

M.J.

Park

R.J.

and Kim

J.-J.

, Urban air quality modeling with full O3–NOx–VOC chemistry: implications for O3 and PM air quality in a street canyon, Atmospheric Environment 47 (2012), 330–340.

Rakowska

Wong

K.C.

Townsend

Chan

K.L.

Westerdahl

Močnik

Drinovec

and Ning

, Impact of traffic volume and composition on the air quality and pedestrian exposure in urban street canyon, Atmospheric Environment 98 (2014), 260–270.

Godish

Davis

W.T.

and Fu

J.S.

, Air quality, CRC Press, 2014.

Chen

Cai

Ding

Yuan

and Chen

, Spatially fine-grained urban air quality estimation using ensemble semi-supervised learning and pruning, in: Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, ACM, 2016, pp. 1076–1087.

Zhu

J.Y.

Sun

and Li

V.O.

, An extended spatio-temporal granger causality model for air quality estimation with heterogeneous urban big data, IEEE Transactions on Big Data 3(3) (2017), 307–319.

Zheng

Liu

and Hsieh

H.-P.

, U-air: When urban air quality inference meets big data, in: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2013, pp. 1436–1444.

Hasenfratz

Saukh

Walser

Hueglin

Fierz

and Thiele

, Pushing the spatio-temporal resolution limit of urban air pollution maps, in: Pervasive Computing and Communications (PerCom), 2014 IEEE International Conference on, IEEE, 2014, pp. 69–77.

10.

Kumar

and Goyal

, Forecasting of daily air quality index in Delhi, Science of the Total Environment 409(24) (2011), 5517–5523.

11.

Kumar

and Jain

, ARIMA forecasting of ambient air pollutants (O 3, NO, NO 2 and CO), Stochastic Environmental Research and Risk Assessment 24(5) (2010), 751–760.

12.

LeCun

Bengio

and Hinton

, Deep learning, Nature 521(7553) (2015), 436.

13.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Computation 9(8) (1997), 1735–1780.

14.

Grossberg

, Recurrent neural networks, Scholarpedia 8(2) (2013), 1888.

15.

Cho

Van Merriënboer

Gulcehre

Bahdanau

Bougares

Schwenk

and Bengio

, Learning phrase representations using RNN encoder-decoder for statistical machine translation, arXiv preprint arXiv:1406.1078, 2014.

16.

Zhang

Wang

and Zheng

, Deep distributed fusion network for air quality prediction, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM, 2018, pp. 965–973.

17.

Wang

Song

and Zhang

Z.M.

, Deep air learning: Interpolation, prediction, and feature analysis of fine-grained air quality, IEEE Transactions on Knowledge and Data Engineering, 2018.

18.

Kök

İ.

Şimşek

M.U.

and Özdemir

, A deep learning model for air quality prediction in smart cities, in: Big Data (Big Data), 2017 IEEE International Conference on, IEEE, 2017, pp. 1983–1990.

19.

, Real-time Air Pollution prediction model based on Spatiotemporal Big data, arXiv preprint arXiv:1805.00432, 2018.

20.

Tong

Zhou

Hamilton

and Zhang

, Deep learning PM 2.5 concentrations with bidirectional LSTM RNN, Air Quality, Atmosphere & Health, 2019, 1–13.

21.

Zhao

Deng

Cai

and Chen

, Long short-term memory-Fully connected (LSTM-FC) neural network for PM2.5 concentration prediction, Chemosphere 220 (2019), 486–492.

22.

Lin

Mago

Gao

Chiang

Y.-Y.

Shahabi

and Ambite

J.L.

, Exploiting spatiotemporal patterns for accurate air quality forecasting using deep learning, in: Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2018, pp. 359–368.

23.

Cheng

Shen

Zhu

and Huang

, A neural attention model for urban air quality inference: Learning the weights of monitoring stations, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

24.

Kolda

T.G.

and Bader

B.W.

, Tensor decompositions and applications, SIAM Review 51(3) (2009), 455–500.

25.

Zhang

and Gong

, Spatiotemporal characteristics of urban air quality in China and geographic detection of their determinants, Journal of Geographical Sciences 28(5) (2018), 563–578.

26.

Zhang

Zheng

and Qi

, Deep spatio-temporal residual networks for citywide crowd flows prediction, in: Thirty-First AAAI Conference on Artificial Intelligence, 2017.

27.

Roweis

S.T.

and Saul

L.K.

, Nonlinear dimensionality reduction by locally linear embedding, Science 290(5500) (2000), 2323–2326.

28.

Kingma

D.P.

and Ba

, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, 2014.

29.

Basak

Pal

and Patranabis

D.C.

, Support vector regression, Neural Information Processing-Letters and Reviews 11(10) (2007), 203–224.

Deep spatial-temporal fusion network for fine-grained air pollutant concentration prediction

Abstract

Keywords

1. Introduction

3. Problem description

3.1 Region division

Table 1 Commonly used symbols

4.1 Fusion network architecture

4.2 Data completion

Table 2 The missing ratio of air pollutant

4.4.1 Bidirectional and unidirectional LSTM subnet

5.1 Dataset and parameter setting

Table 4 The MAE and RMSE of DSTFN model and baseline models on the test set when predicting PM 2.5 concentration

Table 5 The MAE of PM 2.5 on different fusion architectures

5.6 Performance on predicting other pollutants

5.7 Predicting results visualization

5.7.1 Visualization of predicted and true air pollutant values

5.7.2 Visualization of predicted value in regions without monitoring station

6. Conclusion

References

Table 1
Commonly used symbols

Table 2
The missing ratio of air pollutant

Table 4
The MAE and RMSE of DSTFN model and baseline models on the test set when predicting PM ${}_{2.5}$ concentration

Table 5
The MAE of PM ${}_{2.5}$ on different fusion architectures