Learning traffic as videos: A spatio-temporal VAE approach to periodic traffic raster data imputation

Abstract

For modern Intelligent Transportation System (ITS), data missing during traffic raster acquisition can be inevitable because of the loop detector malfunction or signal interference. Nevertheless, missing data imputation is meaningful due to the periodic spatio-temporal characteristics and individual randomness of traffic raster data. In this paper, traffic raster data collected from all spatial regions at each time interval are considered as a multiple channel image. Accordingly, the traffic raster data over a period of time can be regarded as video, on which an unsupervised generative neural network called MSST-VAE (Multiple Streams Spatial Temporal-VAE) is proposed for traffic raster data imputation, and this model can even robustly performs at varied missing rates while many other approaches fail to conduct. Two major innovations can be summarized in MSSTVAE: Firstly, it uses multiple periodic streams of Variational Auto-Encoders (VAEs) with Sylvester Normalizing Flows (SNFs), which shows strong generalization ability. Secondly, after the traffic raster data are transferred into videos, an ECB (Extraction-and-Calibration Block) consisting of dilated P3D gated convolution and multi-horizon attention mechanism is employed to learn global-local-granularity spatial features and long-short-term temporal features. Extensive experiments on three real traffic flow datasets validate that MSST-VAE outperforms other classical traffic imputation models with the least imputation error.

Keywords

Intelligent transportation system traffic raster data data imputation

1. Introduction

Along with the population growth, traffic raster data have become one of the largest kind of data in Intelligent Transportation System (ITS). Just like other typical scenes of Internet of Things (IoT), the traffic situation can be pretty unfavorable, such as too low or too high temperature, strong wind, intense sun exposure and so on [1].

For one thing, bad traffic situation may cause the loop detector aging and affect the data collection integrity. For another, poor signal strength can result in the losing of transmitted data. According to [48], traffic data missing rates in China and USA are 10% and 15% respectively, which have become a common problem for ITS [43]. Nonetheless, a lot of decision making algorithms (such as [2, 3]) in ITS require traffic raster data to be integrated and consecutive. Missing values can severely impact the performance of the algorithms, and some of them can not even function normally [42]. Therefore, in order to ensure the normal running of ITS, it is of great significance to fill in the missing traffic raster data before other algorithms or modules use them.

Traffic raster data, which are collected by loop sensor embedded underneath the roads to measure the traffic flow [4], is a typical data form in ITS. In this paper, the puzzle of traffic raster data imputation is investigated through learning them as videos. The transportation network map of one city can be segmented into several grids, and the raster data corresponding to each grid show multiple features (e.g., traffic inflow and outflow) at single temporal intervals. For each interval, the raster data can be considered as an image snapshot with multiple channels. Therefore, the whole urban traffic raster data over a time period can be learned as a video. The video-like traffic raster data show three strong features, a temporal feature among different timestamp images, a spatial feature within one image and a periodic feature as a whole [25]. More specifically, two kinds of periodicity can be concluded: short-term periodicity and long-term periodicity. Traffic raster data show completely different features at these two periodicities, and let’s take Beijing as an example to explain this. Max traffic flow may be between 8:00 and 9:00 compared with other time within one day because most people go to work during this period of time, which can be treated as short-term periodicity feature. In the meanwhile, min traffic flow that between 8:00 and 9:00 may be from January to February compared with other months within one year because many people are on holidays in these months, which can be treated as long-term periodicity feature. Therefore, traffic raster data should be split according to different periodicities and learned different features separately.

Two challenges appear when populating missing value of traffic raster data in ITS: 1) How to join periodic temporal feature with spatial feature fully and effectively. 2) How to complete the imputation task considering both universal periodic spatio-temporal dependencies and individual stochasticity at the same time. To tackle the challenges above, a novel generative neural network called MSST-VAE (Multiple Streams Spatial Temporal-VAE) is proposed to model spatio-temporal representations of traffic raster data robustly and then populate the missing value accurately, inspired by [5]. The model is mainly based on multiple streams of variational auto-encoders (VAEs), which can grasp the probabilistic densities of hidden variables rather than deterministic latent representations, giving MSST-VAE the ability of generalization. The periodic feature can be learned through multiple streams as well. To further generalize the distribution of traffic raster data, Sylvester normalizing flows (SNFs) [6] is adopt to transfer an isotropic normal distribution to a more flexible one through a series of invertible transformations, which adds more robustness to MSST-VAE. Moreover, we propose a model named Extraction-and-Calibration Block (ECB) for the encoder part of MSST-VAE. ECB is composed of two main techniques: dilated P3D gated convolution and multi-horizon attention mechanism. The former is utilized to extract global-local-granularity spatial dependencies and long-short-term temporal dependencies. The latter is employed to fine-tune features in dimension of channel, time, and space.

The contribution of the study can be encapsulated in the following:

Traffic raster data are innovatively viewed as videos from an extraordinary aspect, considering both spatial and temporal dependencies simultaneously.

MSST-VAE, a generative neural network based on multiple periodic VAE streams is proposed to ensure the generalization ability of the model, which can robustly performs at varied missing rates. To the best of our knowledge, this is the first work that applies multiple periodic streams of VAEs together with normalizing flows to resolve traffic raster data imputation issue.

Dilated P3D gated convolution and multi-horizon attention mechanism are proposed and deployed in ECB to capture global-local-granularity spatial dependencies and long-short-term temporal dependencies, therefore spatio-temporal features in traffic raster data can be fully learned from all aspects.

Extensive comparison and ablation experiments are conducted on three real-world traffic datasets to verify that MSST-VAE exhibits better performance than classical traffic imputation approaches under different missing rates.

The rest of this paper is organized as follows: Section 2 introduces and analyzes the related work about traffic data imputation. Section 3 explains the preliminary knowledge of this paper. Section 4 describes the architecture of MSST-VAE network at length. Section 5 presents the experiment settings and processes in detail, then discusses and analyzes the results obtained in the experiment. Section 6 summarizes the conclusions we get.

2. Related works

To address the problem of various traffic data imputation, plenty of solutions have been put forward, which may mainly be summarized into five groups: the prediction-based method, the interpolation-based method, the decomposition-based method, the statistic method, and deep learning method.

2.1 Prediction-based method

The prediction-based method tends to build a connection between historical and future missing value. Auto-regressive integrated moving average (ARIMA), one of the most popular prediction models for time series, is utilized in [7] for the imputation of traffic counts for highway agencies. [10] presents an approach to selecting and estimating ARIMA models for traffic volume imputation. However, the shortcoming of this method is obvious that the information following the missing data are not employed during the imputation, making it less effective.

2.2 Interpolation-based method

Interpolation-based approach fills the missing value with the neighboring value. As for spatio-temporal data, neighboring value can either be the temporal-liking value or spatial-liking value. [8] proposes a data-driven imputation method for sections of road based on their spatial and temporal correlation using a modified K-Nearest Neighbor method. [9] compares Local Least Squares (LLS) with K-Nearest when dealing with missing traffic data imputation. Both the research mentioned above use interpolation-based approaches. Nevertheless, this kind of method is determinative and cannot extract stochastic variations in spatio-temporal data.

2.3 Decomposition-based method

The decomposition-based approach intends to decompose the original tensor into multiple lower-dimension tensors, and then fills in the missing value on the lower-dimension tensors. Liu J, Musialski P, Wonka P, et al first come up with High Accuracy Low Rank Tensor Completion (HaLRTC), which extends the matrix case to the tensor case by proposing the definition of the trace norm for tensors for the first time and then building a working algorithm [11, 12] estimates traffic speed for ITS by combining HaLRTC and k-means++. Though the accuracy of this kind of approach is high, the computation complexity is sometimes unbearable.

2.4 Statistic-based method

The statistic-based approach learns the statistical rule of the data directly, and undertakes the imputation task based on the rule. The most widely-used method is Principal Component Analysis (PCA), and many improvements have been conducted on it [13] employs PPCA (Probabilistic Principal Component Analysis) to conduct probabilistic interpretation on PCA when dealing with traffic flow value imputation, while [14] adds Bayesian network to PCA to solve the same problem, which is known as BPCA (Bayesian Principal Component Analysis) [15] uses KPPCA (Kernel Probabilistic Principal Component Analysis) as imputing methods without any temporal or spatial dependency, and replaces linear kernel with non-linear kernel in PCA. However, this kind of approach focuses too much on constructing probability distribution model, thus loses detailed information varying in different situations.

2.5 Deep learning method

In recent years, the deep Learning methods have made noticeable achievement from all aspects, and various DL approaches have been improved to deal with data imputation issue [16] utilizes BRITS, an approach based on recurrent neural networks for imputing missing values in time series data without any specific assumption. Though the bidirectional temporal dependency within time series is considered during data imputation, little attention is given to spatial dependency among different time series [17] first transforms raw data into spatial-temporal images and then implements data imputation task with the help of convolution neural network (CNN). In [18], a deep auto-encoder for missing spatio-temporal data imputation is put forward, and the auto-encoder is designed as a combination of CNN and Bi-LSTM (Bi-Long Short Term Memory). Moreover, [19] deploys CNN based GAN (Generative Adversarial Network) to reconstruct missing multivariate time series. All these deep learning methods are limited to 2D-convolution when dealing with spatial feature extraction, and long-short term temporal dependencies are not considered.

3. Preliminaries

3.1 Missing data type

According to [20], missing data can be summarized into three categories: Missing Completely at Random (MCAR), Missing at Random (MAR) and Missing Not at Random (MNAR).

If the missing is irrelevant to observed or unobserved data, that is MCAR, which may happen in any region at any time randomly. MAR refers to the missing that can be described using observed data. For example, in-flow data missing in one region may be related to the out-flow data missing in adjacent regions. When the missing depends on unobserved attribute or on the missing attribute itself [21], we call it MNAR. MNAR occurs when devices are in long-term failure. The study mainly focuses on the former two kinds of missing, for it is of little significance to populate the missing value in MNAR.

3.2 Traffic raster data

As [20] depicts, an urban traffic network can be regarded as a raster map $G$ with longitude and latitude containing $H$ rows and $W$ columns. Each grid in $G$ means one region. For each region $(h,w)$ , $C$ different channels are measured at each timestamp. In this paper, $C$ is set to $2$ for in-flow and out-flow of one region. Therefore, if the map $G$ is considered as a whole, the measurements over a period of time, which involves $T$ intervals, can be considered as a tensor $x\in\mathbb{R}^{C*T*H*W}$ .

3.3 Problem statement

Under the MCAR assumption that data are missing completely at random [21], it seems that incomplete traffic raster data that are i.i.d. can be described as a set ${\bm{X}}=\{x_{i}\}^{n}_{i=1}\in\mathbb{R}^{C\ast T\ast H\ast W}$ . Correspondingly, another set ${\bm{M}}=\{m_{i}\}^{n}_{i=1}\in M^{n}$ is defined as a mask set for ${\bm{X}}$ , where $M=\{0,1\}^{C\ast T\ast H\ast W}$ when $m_{i}[c,t,h,w]=$ 0, whereas $x_{i}[c,t,h,w]$ is missing, $m_{i}[c,t,h,w]=$ 1.

For tensor-based traffic raster data imputation that this paper discusses on, an incomplete tensor $x$ in the space ${\bm{X}}$ is divided into an observed part $x^{\textit{obs}}$ and a missing part $x^{\textit{mis}}$ . The objective of this paper is to populate the missing value in $x^{\textit{mis}}$ through $x^{\textit{obs}}$ .

4. Method

In this part, the architecture of MSST-VAE is presented at first, and then we propose the crucial modules ECB (Extraction-and-Calibration Block) and TCB (Transposed Convolution Block) used in MSST-VAE. Finally, we describe the engineering practice of our model in the ITS, which includes two parts: offline model training and online imputation.

As Fig. 1 shows, Multiple Streams Spatial Temporal-VAE (MSST-VAE) is mainly based on multiple periodic streams of variational auto-encoders, which contains four parts: Pre-processing, Encoder, Sylvester NFs, and Decoder. Generally, multiple streams help to catch the periodic feature of traffic raster data on the whole. VAE is utilized to model the distribution of hidden variables rather than deterministic latent representations, and then rebuild data. By this means, both universal periodic spatio-temporal dependencies and individual stochasticity are considered at the same time when imputing missing value. However, the distribution of latent variables in traditional VAE is limited to Gaussian distribution, which makes the model less robust. To improve the extensibility of amortized variational inference based approximate posterior, Sylvester normalizing flows (SNFs) as an iterative process, is adopted to remodel a diagonal Gaussian distribution to be more flexible. Moreover, Extraction-and-Calibration Blocks (ECBs) are used in Encoder part to extract global-local-granularity spatial features and long-short-term temporal features with the help of dilated P3D gated convolution and multi-horizon attention mechanism.

Figure 1.

Overall architecture of MSST-VAE. The upper part and the bottom part are for long and short periodicity streams respectively.

4.1 Overall network architecture

1) Pre-processing. It is known that traffic raster data show strong hourly, daily, weekly, and yearly periodicity when regarded as a whole [40]. In order to reflect this multi-periodic feature (mentioned in Section 1), the original traffic raster data are segmented into two kinds of samples ${\bm{X}}^{l}$ , ${\bm{X}}^{s}$ with different sizes (i.e. hour and day), corresponding to short-term periodicity and long-term periodicity. Then two kinds of samples are sent into two streams (as shown in Fig. 1) to learn short-term periodicity and long-term periodicity features respectively. However, this model is just an illustrative example and can be generalized to more streams as needed. The segment size is recommended to follow the natural time period. For example, supposing that the sampling frequency of sensor is $q$ times per hour, then the segment size for short-term periodicity can be assumed as $q$ and that for long-term periodicity is $24*q$ correspondingly. In other words, the former samples represent hourly data, and the latter represent daily data. More details about pre-processing of datasets used in this paper will be introduced in Section 5.1.

2) Encoder. After Pre-processing, both two streams of traffic raster data transfer to the Encoder part and flow along four ECBs (elaborated in Section B) in sequence to generate the deterministic hidden representations denoted as $h$ . Then $h$ is flattened before entering the fully-connected layers to parameterize the mean $\mu$ and log deviation $\textit{log}\sigma^{2}$ for the original latent variable $z_{0}$ , which will be developed by the reparametrization trick [21] as follows, where $\sigma$ is sampled from ${\cal N}\left({0,1}\right)$ .

$\displaystyle z_{0}=\mu+\sigma\otimes\varepsilon$ (1)

3) Sylvester NFs. However, the output of the encoder $z_{0}$ obeys the probabilistic posterior $q_{0}(z_{0}|x)\sim{\cal N}\left({\mu,\sigma^{2}I}\right)$ , which may limit the expression of vanilla VAE [21]. Therefore, Berg et al. [6] proposed the Sylvester NFs which could enhance the scalability of the distribution $q_{\psi}(z|x)$ . A series of convertible mappings $f_{K}^{\circ}\ldots^{\circ}f_{k}^{\circ}\ldots^{\circ}f_{1}\left({z_{0}}\right)$ are utilized to translate $z_{0}$ to $z_{k}$ , where $f_{k}\left({z_{k-1}}\right)=z_{k-1}+\textit{QR}\tan\widetilde{(R}Q^{T}z_{k-1}+b)$ . $R$ and ${R}^{\prime\prime}$ are upper triangular parameter matrices, whereas Q and b are column orthogonal matrix and parameter vector respectively. All the parameters can be calculated by a dependent dense layer. The log posterior $\log q_{K}(z_{K}|x)$ of final iteration can be expressed in the Eq. (2), where $\text{det}\left({\frac{\partial f_{k}\left({z_{k-1}}\right)}{\partial z_{k-1}}% }\right)$ is the Jacobian determinant of the k-transformation $f_{k}$ . It is assumed that $q_{\varphi}\left({z|x}\right):=q_{K}\left(z_{K}|x\right)$ and then $z_{K}$ is fed into the decoder part.

$\displaystyle\log q_{K}(z_{K}|x)=\log q_{0}(z_{K}|x)-\sum_{k=1}^{K}\log\left|{% \text{det}\left({\frac{\partial f_{k}\left({z_{k-1}}\right)}{\partial z_{k-1}}% }\right)}\right|$ (2)

4) Decoder. The decoder deploys single dense layer and reshaping operation to transform $z_{K}$ back to a tensor, and then utilizes three TCBs for up-sampling (elaborated in Section B). Then, the reconstructed sample can be obtained through the last 3D transposed convolution and tanh function. In the end, reconstructed samples $\widehat{X^{l}}$ and $\widehat{X^{s}}$ derived from the two streams are fused together to obtain the final reconstructed traffic raster data $\widehat{X}$ by the Eq. (3), where $W^{l}$ and $W^{s}$ are parameter matrices that need to be trained through the whole network.

$\displaystyle\widehat{X}=W^{l}\odot\widehat{X}^{l}+W^{s}\odot\widehat{X^{s}}$ (3)

4.2 Major modules

1) Extraction-and-Calibration Block. As the major component in the encoder part, Extraction-and-Calibration Block (ECB), mainly involves two sub-modules: dilated P3D gated convolution and multi-horizon attention mechanism. Assume that a feature map $F_{l-1}\in\mathbb{R}^{C\ast T\ast H\ast W}$ is the input of l-th ECB, by applying dilated P3D gated convolution ${\bm{G}}$ , ECB initially extracts spatial-temporal information in $F_{l-1}$ and pays more attention to the valid ones to extract important features. Next, it further calibrates the feature maps by utilizing multi-horizon attention mechanism that combines channel attention $M_{c}$ , temporal attention $M_{t}$ with cross-spatial attention $M_{cs}$ . Therefore, it is named as Extraction-and-Calibration Block as shown in Fig. 2. The overall procedure of ECB is explained in the following equations:

$\displaystyle F^{\prime}_{l-1}=G\left({F_{l-1}}\right)$ (4) $\displaystyle F_{l}=M_{c}\left({F^{\prime}}_{l-1}\right)\odot M_{t}\left({F^{% \prime}}_{l-1}\right)\odot M_{cs}\left({F^{\prime}}_{l-1}\right)\odot{F^{% \prime}}_{l-1}$ (5)

where $F^{\prime}_{l-1}\in\mathbb{R}^{C^{\prime}*T^{\prime}*H^{\prime}*W^{\prime}}$ is an intermediate feature map, while $F_{l}\in\mathbb{R}^{C^{\prime}*T^{\prime}*H^{\prime}*W^{\prime}}$ is the output of l-th ECB and $\odot$ denotes element-wise product. $M_{c}$ , $M_{t}$ , and $M_{cs}$ can be calculated in Eqs (6), (7), and (8) respectively. Furthermore, the dilated P3D gated convolution and multi-horizon attention mechanisms are elaborated in the following.

Figure 2.

The workflow of the Extraction-and-Calibration Block (ECB) where $\otimes$ denotes the element-wise product which obeys broadcasting rules.

Figure 3.

Left Part: Schematic Diagram of 3D CNN. Right Part: The designs for P3D CNN [27].

a) Dilated P3D gated convolution. In order to extract spatio-temporal features from data (like video), one natural way is to extend convolution kernels in CNN from 2D to 3D, like what [26] does to deal with video in-painting issue. To be more specific, in each layer of 3D CNN, a set of N maps of size $D*H*W$ are convoluted by M sets of $K^{D}*K^{H}*K^{W}$ size filters, as shown in Fig. 3. However, traditional 3D CNN requires huge amount of computation and memory overheads [28], thus [27] comes up with a novel Pseudo-3D Residual Net (P3D ResNet) to solve this problem. The key point of P3D lies in the utilization of 2D convolutions to encode spatial information and 1D convolutional filters for temporal dimension. Moreover, [27] proposes three modes for P3D, as depicted in Fig. 3. MSST-VAE is mainly based on design (a), which considers stacked architecture by making temporal 1D filters (T) follow spatial 2D filters (S) in cascaded manner [27].

Different from [27], in dilated P3D gated convolution, two parts (S and T) of P3D CNN are optimized respectively to capture global-local-granularity spatial dependencies and long-short-term temporal dependencies. Besides, an additional gating mechanism is utilized as an attention map on the output features.

Figure 4.

Overview of dilated P3D gated convolution consists of two workflows.

The workflow of dilated P3D gated convolution is displayed in Fig. 4, in which the upper part is for nonlinear spatio-temporal feature learning while the bottom part is used as a gating for soft weights calibration to the upper part by a sigmoid function. Finally, both parts are fused by the element-wise product. Dilated convolution was first proposed in [29] to increase the receptive filed in convolution operation.

It should be noticed that spatial 2D filters of P3D CNN in [27] are replaced with 2D-dilated convolution to extract global-local-granularity spatial features, and temporal 1D filters are substitute for 1D-dilated convolution for long-short-term temporal features. More details about these two parts will be discussed later.

According to [38], treating traffic raster data as images and simply applying plain CNN may not achieve the best performance, for correlations among ‘far away’ regions in the image are not taken into account. Inspired by [39], 2D-dilated spatial convolution (i.e., the blue block in Fig. 5) is utilized in dilated P3D gated convolution to catch local-granularity and global-granularity spatial features. Its structure is shown in Fig. 6. For each layer of ECB, traffic raster images with time interval $t$ are convoluted with different kernel size. Different feature maps derived from various sizes of kernels are concatenated at channel dimension. Then, the concatenated feature maps will be fed into the next layer with the same set of kernel size but a smaller dilation factor, so on and so forth.

Figure 5.

Network of dilated P3D convolution used in ECB. The bottom part is 1D dilated temporal convolution and the upper part is 2D dilated spatial convolution. The yellow blocks and blue blocks represent the receptive field of 1D-dilated and 2D-dilated temporal convolution respectively, and the red dots represent the real features maps that need to be convoluted.

According to [37], time series is characterized by its numerical and continuous nature, so it is always seen as a whole instead of individual numerical fields. Traffic raster data can be viewed as 2D time series. Therefore, 1D-dilated temporal convolution (i.e. the yellow block in Fig. 5) is utilized in dilated P3D gated convolution to catch long-term and short-term temporal features. Its structure is similar to 2D-dilated temporal convolution, as shown in Fig. 5. Feature maps derived from 2D-dilated spatial convolution is convoluted by 1D kernel at temporal dimension. Kernel sizes are the same among most different layers, but differ in one certain layer. When the network goes deeper, the dilation factor gets smaller gradually.

b) Multi-horizon attention mechanism. Channel attention and spatial attention in Convolution Block Attention Module (CBAM) [22] are employed to image classification task. In this study, we extend CBAM from spatial attention to cross-spatial attention, and temporal attention is also added as shown in Fig. 6. In the multi-horizon attention mechanism, channel attention is utilized to learn the channel relations, such as inflows and outflows of traffic crowd, while temporal attention and cross-spatial attention are employed to enlarge or weaken the influence of features computed by dilated P3D gated convolution. Moreover, max-pooling and average-pooling are introduced to accelerate the computing as well as avoiding over-fitting.

Figure 6.

The overview of multi-horizon attention mechanism consisting of channel attention (left), temporal attention (left) and cross-spatial attention (right).

For channel attention, we first perform max-pooling and average-pooling operations to generate two channel-wise variables $v_{l-1}^{\max}\in\mathbb{R}^{C\ast 1\ast 1\ast 1}$ and $v_{l-1}^{\textit{avg}}\in\mathbb{R}^{C\ast 1\ast 1\ast 1}$ from feature map $F^{\prime}_{l-1}$ . Then both two variables are fed into two com mon dense layers independently, one for channel reduction from $C$ to $C/r$ ( $r$ is the reduction ratio), the other for channel recovery back to $C$ . Finally, two variables are fused and calculated by a sigmoid function to get the channel attention map $M_{c}\in\mathbb{R}^{C\ast 1\ast 1\ast 1}$ . The computation formulas are as follows:

$\displaystyle M_{c}=\textit{Sigmoid}\left(\textit{MLP}\left(\textit{MaxPooling% }\left({F_{l-1}^{\prime}}\right)\right)+\textit{MLP}\left(\textit{AvgPooling}% \left({F_{l-1}^{\prime}}\right)\right)\right)=\textit{Sigmoid}\left(\textit{% MLP}\left(v_{l-1}^{\max}\right)+\textit{MLP}\left(v_{l-1}^{\textit{avg}}\right% )\right)$ (6)

For temporal attention, the architecture of which is the same with that of channel attention, and the temporal attention map $M_{t}\in\mathbb{R}^{1\ast T\ast 1\ast 1}$ can be calculated correspondingly as follows:

$\displaystyle M_{t}=\textit{Sigmoid}\left({\textit{MLP}\left({\textit{% MaxPooling}\left({F_{l-1}^{\prime}}\right)}\right)+\textit{MLP}\left({\textit{% AvgPooling}\left({F_{l-1}^{\prime}}\right)}\right)}\right)=\textit{Sigmoid}% \left({\textit{MLP}\left({v_{l-1}^{\max}}\right)+\textit{MLP}\left({v_{l-1}^{% \textit{avg}}}\right)}\right)$ (7)

For cross-spatial attention, max-pooling and average-pooling are operated on feature map $F^{\prime}_{l-1}$ to develop two 2D maps $u_{l-1}^{\max}\in\mathbb{R}^{1\ast 1\ast H\ast W}$ and $u_{l-1}^{\textit{avg}}\in\mathbb{R}^{1\ast 1\ast H\ast W}$ . Next, both two are concatenated in channel dimension and sent to cross convolution part followed by a sigmoid function to create a cross-spatial attention map $M_{cs}\in\mathbb{R}^{1\ast 1\ast H\ast W}$ .

$\displaystyle M_{cs}=\textit{Sigmoid}(\textit{CrossConv}([\textit{MaxPooling}% \left({F^{\prime}_{l-1}}\right),\textit{AvgPooling}\left({F^{\prime}_{l-1}}% \right)]))=\textit{Sigmoid}\left({\textit{CrossConv}\left({\left[{u_{l-1}^{% \max};u_{l-1}^{\textit{avg}}}\right]}\right)}\right)$ (8)

c) Cross-Convolution. Cross-Convolution mainly consists of two operations: Lat-Conv and Lon-Conv, just as Fig. 7 depicts. The former is utilized for learning spatial attention of latitude dimension and the latter is for spatial attention of longitude dimension, which is typical for traffic raster data. $A_{\textit{lat}}$ and $A_{\textit{lon}}$ can be calculated by the following:

$\displaystyle F_{l-1}^{{}^{\prime\prime}}=\alpha\cdot\textit{LatConv}\left({F_% {l-1}^{\prime}}\right)+F_{l-1}^{\prime}$ (9) $\displaystyle\textit{CrossCon}\left({F_{l-1}^{\prime}}\right)=\beta\cdot% \textit{LonConv}\left({F_{l-1}^{{}^{\prime\prime}}}\right)+F_{l-1}^{{}^{\prime% \prime}}$ (10)

where $\alpha$ and $\beta$ are scalar value, LatConv and LonConv are 2D convolution with kernel size of $1*W$ and $1*H$ respectively. Eventually, the feature map $F l$ of next layer can be obtained by element-wise product of $\textit{Mc},\textit{Mt},\textit{Mcs}$ and $F^{\prime}_{l-1}$ , just as explained in Eq. (5) above.

Figure 7.

Detailed design of Cross Convolution.

2) Transposed Convolution Block. Transposed Convolution Block (TCB) is the main component for up-sampling in Decoder part of MSST-VAE. Compared with ECB, the architecture of TCB is relatively simpler, for it only consists of P3D gated transposed convolution, without multi-horizon attention mechanism. The workflow of P3D gated transposed convolution is similar with Fig. 5, and the convolution operation is replaced with the transposed convolution operation.

4.3 Engineering practice

The application of MSST-VAE in engineering practice can be divided into two phrases: offline model training and online imputation. The former is for training the model to converge, and the latter is to utilize the trained model to impute missing value in real time. Therefore, both share the same network architecture and the models as well as parameters of the latter are derived from the former.

1) Offline Model Training. Since the true value of missing data are unavailable, the objective of MSST-VAE is to minimize the negative log likelihood of the observed data in offline model training. In Fig. 1, each stream is trained separately and then the fusion parameter in Decoder is fine-tuned as expressed in Eq. (3). For the i-th sample $x_{i}$ in each stream, the loss function is designed as in the following, where $\left({\theta,\phi}\right)$ are parameterized by the Encoder and Decoder respectively. In Eq. (11), the first item is the reconstruction error of the observed part $x_{i}^{\textit{obs}}$ , while the second item is from SNFs. The last item is the Kullback-Leibler distance [30] between two densities to avoid over-fitting by making conditional distribution of $z_{0}$ similar to the prior distribution of $z_{K}$ .

$\displaystyle F_{i}\left({\theta,\phi}\right)=-(\log p\left({x_{i}^{\textit{% obs}},z_{k};\theta}\right)-\log p(z_{k}|x_{i}^{\textit{obs}};\theta))=-\log% \frac{p\left({x_{i}^{\textit{obs}},z_{k};\theta}\right)}{q_{k}(z_{k}|x_{i}^{% \textit{obs}};\phi)}+\log\frac{p(z_{k}|x_{i}^{\textit{obs}};\theta)}{q_{k}(z_{% k}|x_{i}^{\textit{obs}};\phi)}=\mathbb{E}_{q_{k}}\left[\textit{logp}\left({x_{% i}^{\textit{obs}},z_{k};\theta}\right)\right]{+\mathbb{E}_{q_{k}}}\left[% \textit{logq}_{k}(z_{k}|x_{i}^{\textit{obs}};\phi)\right]{-\mathbb{E}_{q_{k}}}% \left[\textit{log}\frac{q_{k}(z_{k}|x_{i}^{\textit{obs}};\phi)}{p(z_{k}|x_{i}^% {\textit{obs}};\theta)}\right]\leqslant-\mathbb{E}_{q_{k}}\left[\textit{logp}% \left({x_{i}^{\textit{obs}},z_{k};\theta}\right)\right]{+\mathbb{E}_{q_{k}}}% \left[\textit{logq}_{k}(z_{k}|x_{i}^{\textit{obs}};\phi)\right]=-\mathbb{E}_{q% _{0}}\left[\textit{logp}\left({x_{i}^{\textit{obs}},z_{k};\theta}\right)-\sum_% {n=1}^{N}\log\left|{\textit{det}\left({\frac{\partial f_{n}\left({z_{n-1}}% \right)}{\partial z_{n-1}}}\right)}\right|\right.\left.+D_{\textit{KL}}\left[q% _{0}(z_{0}|x_{i}^{\textit{obs}};\phi)||p\left({z_{k};\theta}\right)\right]\!\!% \!\!\!\!\!\!\!\!\phantom{\sum_{n=1}^{N}}\right]$ (11)

2) Online Imputation. Once the model is well-trained in offline model training, incomplete traffic raster sample $x$ (e.g., Fig. 8) can be filled online. First of all, all the missing values $x^{\textit{mis}}$ are initialized by zero. The imputation effectiveness is maintained because of the nonlinear transformation of neural network [31]. Then, a robust distribution of the latent variable is obtained through the encoder and SNFs, as introduced in Section 4. Lastly, according to the conditional distribution $p_{\theta}\left({x^{\textit{mis}}|x^{\textit{obs}}}\right)$ , missing value $x^{\textit{mis}}$ can be imputed by sampling from a Markov chain [32] in decoder part.

$\displaystyle{p}\left({x^{\textit{mis}}|x^{\textit{obs}}}\right)=\mathop{% \smallint}\nolimits p_{\theta}\left({x^{\textit{mis}}|Z_{K}}\right)q_{\phi}% \left({Z_{K}|x^{\textit{obs}}}\right)dZ_{K}$ (12)

5. Experiment

In this section, we firstly introduce the datasets utilized in all experiments and list model settings followed by evaluation metrics in this paper. Then, MSST-VAE is compared with some typical baseline methods from all aspects. Finally, we also conduct ablation experiments to figure out the validity of major modules in MSST-VAE.

5.1 Datasets and experiment settings

1) Datasets. Experiments in this paper are conducted on three real-world traffic flow datasets: TaxiBJ [2], TaxiNYC [33] and BikeNYC [2], as shown in Table 1. TaxiBJ and TaxiNYC are GPS tracking data of taxis in Beijing and New York, BikeNYC contains trip data from New York bike rental system.

Table 1
Detailed description of three datasets

Dataset	TaxiBJ	TaxiNYC	BikeNYC
Data type	taxi GPS tracking data	taxi GPS tracking data	bike trip data
Location	Beijing	New York	New York
Time range	2013.7.1–2013.10.30 2014.3.1–2014.6.30 2015.3.1–2015.6.30 2015.11.1–2016.4.10	2010.1.1–2010.12.31	2014.4.1–2014.9.30
Time interval	30 minutes	1 hour	1 hour
Raster map size	(32,32)	(10.20)	(16,8)
Time intervals	22459	26304	4392

Figure 8.

An example of the imputation effectiveness of MSST-VAE on TaxiBJ described in Table 1. Green means unblocked, whereas red means congested. Left: Truth-ground data matrix. Mid: Incomplete data with missing rate of 30%. Right: Reconstructed data.

Missing values are manually generated with different missing rates at random. In this paper, traffic raster data within each day and hour is regarded as a sample for long-periodicity stream and short-periodicity stream respectively, considering the time interval of three datasets. Taking TaxiBJ as an example, if $x_{t-1}$ and $x_{t}$ are collected from the same hour, then one short-periodicity sample can be formulated as ${\bm{X}}^{s}=\{x_{t-1},x_{t}\}$ , and the corresponding long-periodicity sample can be formulated as ${\bm{X}}^{l}=\{x_{t-25},x_{t-24},\ldots,x_{t-1},x_{t}\}$ . During data reconstruction phrase, Min-Max normalization is utilized to regulate data to the range of $[-1,1]$ , because tanh (whose interval is also from $-$ 1 to 1) is applied as the activation function of MSST-VAE in the final layer. All the datasets are divided training sets and test sets with the 9:1 scale. In case of over-fitting, 90% of the data in training sets are chosen for model training, and the rest is for validation. Model with the best performance on the validation set is saved, and the final results in Fig. 9 are the validation-set-saved model conducted on test sets.

2) Model Configuration. The hyper-parameters of MSST-VAE is set empirically as follows: Window size in long-periodicity stream is set to be 1 day, and 1 hour for short-periodicity stream. The dimension of each latent variable $z_{k}$ is fixed to 128. The length of Sylvester NFs is 5. The quantity of ECB and TCB is set to 4 and 3 respectively. The number of mid channels (output of 1D-dilated temporal convolution) for each ECB is 8, 44, 88, 176 in ascending order. The number of out channels (output of 2D-dilated spatial convolution) for each ECB is 16, 32, 64, 128 which is also in ascending order. In dilated P-3D convolution of ECB, the kernel sizes of each 2D-dilated spatial convolution are set to be 1 * 2 * 2, 1 * 3 *3, 1 * 6 * 6 and 1 * 7 * 7, and the stride is 1. Similarly, the kernel sizes of each 1D-dilated temporal convolution are set to be 2 * 1 * 1, 3 * 1 * 1, 6 * 1 * 1 and 7 * 1 * 1, and the stride is 1. The number of dilation factor of both two kinds of convolutions in each ECB is 8, 4, 2, 1 as in descending order. The number of out channels for each TCB is 64, 32, 16 and that of last 3D transposed convolution lay is 2. Kernel size for 3D transposed convolution in both TCB and last layer is set to 2, with padding of 1 in spatial and channel dimension, 0 in temporal dimension.

Besides, all the hyper-parameters and thresholds are set by our experience or at random initially, and hyper-parameter-searching is utilized to test all the assemblies to get the best performance hyper-parameter. The same operation is performed with other baseline approaches. All the hyper-parameters involved in some baseline methods of comparison experiments is shown as follows, and the hyper-parameters of the rest deep learning baseline methods are according to the configurations in the original papers.

Table 2

Hyper-parameters of comparison experiments

Baseline method	Parameter	Value
ARIMA	start_q, max_p, max_q	1, 5, 5
LLS	k	4
PPCA	No. of principle component	2
HaLRTC	rho	1e-4
ARIMA	start_q, max_p, max_q	1, 5, 5
LLS	k	4

The reduction ratio $r$ of channel attention and temporal attention mechanism is set to be 2 and 8 respectively. The initial learning rate of Adam optimizer that employed in gradient descent is $10-3$ . What’s more, the total number of epochs in short-term stream for TaxiBJ, TaxiNYC and BikeNYC are 700, 500 and 2000 separately. The total number of epochs in long-term stream for TaxiBJ, TaxiNYC and BikeNYC are 1200, 1000 and 3000 respectively. The total number of epochs in fusion model for TaxiBJ, TaxiNYC and BikeNYC are 100, 50 and 500 severally.

3) Evaluation Metrics. In order to compare our approach with other baseline methods, the matching of truth-ground data and imputed data are measured by Refined Normalized Mean Square Error (NMSE), Refined Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) expressed in Eqs (13), (14), and (15). NMSE is used to calculate absolute error of single-timestamp traffic raster data for each region in detail, while RMSE and MAE is for single-timestamp traffic raster data for all regions as a whole. Therefore, the former one can represents individual similarity and the latter two targets can be utilized to verify the global similarity. Moreover, RMSE is relatively more sensitive to ‘abnormal’ data when compared with MAE. So, high RMSE means textremely ‘bad’ padding value exists.

$\displaystyle\textit{RMSE}=\sqrt{\frac{1}{T}\mathop{\sum}\limits_{t=1}^{T}% \left({1-m}\right)\otimes\left({x_{t}-\widehat{x_{t}}}\right)^{2}}$ (13) $\displaystyle\textit{NMSE}=\frac{1}{T}\mathop{\sum}\limits_{t=1}^{T}\frac{% \mathop{\sum}\nolimits_{j=1}^{J}\mathop{\sum}\nolimits_{i=1}^{I}\left({1-m}% \right)\otimes\left({x_{i,j}-\widehat{x_{i,j}}}\right)^{2}}{\mathop{\sum}% \nolimits_{j=1}^{J}\mathop{\sum}\nolimits_{i=1}^{I}\left({1-m}\right)\otimes% \left({x_{i,j}}\right)^{2}}$ (14) $\displaystyle\textit{MAE}=\frac{1}{T}\mathop{\sum}\limits_{t=1}^{T}\left|{% \left({1-m}\right)\otimes\left({x_{t}-\widehat{x_{t}}}\right)}\right|$ (15)

Where $m$ is the mask tensor of observations defined in Section 3, $\otimes$ denotes Hadmard product, $T$ denotes number of timestamp, $I$ and $J$ denotes the number of regions.

According to [13, 44, 45], to test the effectiveness and robustness of MSST-VAE as well as other comparing approaches in all kinds of situations, different missing ratios ranging from 10% to 50% with 10% interval are considered. Therefore, even when some sensors are in failure, data imputation can ensure the integrity of data and avoid system failure of ITS before the related sensors to be repaired.

5.2 Comparison experiment result and analysis

In this section, a comparison is made among MSST-VAE and ten baseline methods: ARIMA [7], LLS [9], PPCA [13], HaLRTC [34], SDAE [35], MLP-VAE [5], CombCN [36], 3DConvGAN [46], DeepSTN+ [47], and STVAE [41], which can be classified into two groups. ARIMA, LLS, PPCA, and HaLRTC are traditional methods corresponding to the four sub-categories in Section 2. SDAE, CombCN, MLP-VAE, 3DConvGAN, DeepSTN+ and STVAE are deep learning-based methods. Moreover, MLP-VAE and STVAE are also VAE-based model for data imputation. Figures 9 and 10 show the comparison results of our approach with other baseline methods on three datasets at different missing rates.

Figure 9.

NMAE and MAE results of comparison experiment conducted on TaxiBJ, TaxiNYC and BikeNYC datasets.

ARIMA [7] directly predicts the missing value based on data from past timestamp at the same location. However, spatial dependency among different locations is not considered. LLS [9] imputes the missing value with the most “similar” value on the same raster map, whereas temporal dependency among data from different timestamps is not taken into consideration. Apparently, taking only temporal or spatial dependency into account is far from satisfactory in three measurements according to the results in Figs 9 and 10. It’s also worth noticing that the prediction ability of ARIMA degrades badly on account of the missing value according to the bad performance of MAE.

Figure 10.

RMSE result of comparison experiment conducted on TaxiBJ, TaxiNYC and BikeNYC datasets.

PPCA [13] tends to build a Gaussian latent variable model and then fill in the missing value according to the model. Yet, the relationship between latent and observed variables is linear, making it incapable of discovering complex spatio-temporal features. That is why PPCA performs worse at NMSE than MAE and RMSE in Figs 9 and 10, since global imputation ability on spatial-temporal dimension is limited. On the contrary, as a non-linear model, MSST-VAE outperforms PPCA on all datasets obviously.

HaLRTC [34] is a typical decomposition-based method mentioned in Section 2. It performs well at low missing rates on TaxiNYC and BikeNYC, for spatial dimension and temporal dimension are both considered. Nevertheless, it is not robust when missing rates rise, and even shows over-fitting when the missing rate reaches 90%. In contrast, regularization terms are added to the loss function Eq. (11) of MSST-VAE to prevent over-fitting. The results in Figs 9 and 10 also confirm that MSST-VAE performs relatively more stable under different missing rates. Moreover, as a decomposition-based method, the computation complexity of HaLRTC is higher than MSST-VAE (discussed in Section 3.2), which can also be verified by the results in Table 3.

Denosing Autoencoder (SDAE) [35] is a stacking deep learning method based on reconstruction architecture. Although encoder-decoder model is utilized in SDAE as MSST-VAE does, the hidden feature of SDAE being extracted is determinative. Contrast with VAE used in our model, which shows more generative ability, SDAE behave even worse than some traditional methods at RMSE, for there may exists some extremely bad imputed value. Besides, SDAE requires the completeness of data when training model, whereas MSST-VAE can be trained with incomplete traffic raster data, which is more suitable for practical engineering applications.

CombCN [36] is originally designed for video inpainting tasks, which applies a 3D convolution subnetwork for temporal consistency and a 2D convolution subnetwork for spatial modeling. In MSST-VAE, further improvements have been made, which is dilated P3D convolution. Compared with CombCN, global-local-granularity spatial features and long-short-term temporal features, which are the typical character of traffic raster data in ITS, can be extracted through MSST-VAE efficiently. The results of three measurements in Figs 9 and 10 also illustrate the superiority of dilated P3D gated convolution, comparing with traditional 2D and 3D convolution in CombCN.

MLP-VAE [5] is a deep Bayesian model that could learn stochastic mappings between observed data and latent variables. Due to the stochastic representations of VAE, the overall performance of MLP-VAE is second only to MSST-VAE, demonstrating the advantage of VAE with generalization ability in data imputation tasks. However, it is still not as good as MSST-VAE for two reasons: On the one hand, hidden variable in MLP-VAE obeys the Gaussian distribution, making the model less generalized, when compared with Sylvester NFs used in MSST-VAE. On the other hand, multi-layer perceptions are less satisfactory than dilated P3 convolution when extracting spatio-temporal features.

DeepSTN+ [47] is based on ST-ResNet to fulfil traffic data prediction task, in which different temporal periodicity streams are employed in the model (e.g day of week, hour of day), which is similar to the two VAE streams in MSST-VAE shown in Fig. 1. However, though the network of DeepSTN+ is deep, the operation within it is mainly based on plain 2D-Convolution, which is far from enough to learn the complex spatio-temporal features in traffic raster data. That’s why MSST-VAE outperforms it in all indexes according to Figs 9 and 10.

3DConvGAN [46] is an generative model based on GAN to deal with traffic imputation task, and 3D-Convolution is utilized to extract spatio-temporal correlations to improve the recovery accuracy. On the one hand, compared with VAE, GAN in 3DConGAN is hard to converge during training phrase since the input of generator in 3D-Convolution is randomly sampled from Gaussian distribution, which can be verified by the results in Table 3 (epoch of 3DConvGan can be sometimes 10 times more than ST-VAE in offline training phrase). On the other hand, dilated P3D convolution in MSST-VAE can learn short-long-term temporal features and global-local-granularity spatial features, while plain 3D-Convolution in 3DConvGAN can only gain simple spatio-temporal features. Also, 3D-Convolution is more time-consuming and resource-consuming than P3D-Convolution [27].

STVAE [41] is our previous work, compared with which, periodic features are considered by deploying two kinds of streams in MSST-VAE. What’s more, not only global-local-granularity spatial dependencies and long-short-term temporal dependencies are more emphasized but also multi-horizon attention mechanism is introduced in ECB, making MSST-VAE outperforms STVAE on three measurements according to Figs 9 and 10.

From the discussion above, we can draw a conclusion that MSST-VAE fulfills a more competitive imputation performance on traffic raster data than all the baseline models including the state-of-the-art VAE-based model ST-VAE, especially when the missing rate is high. Moreover, MSST-VAE shows steady performance on three indexes: Lowest NMSE and MAE demonstrating good local and global imputation ability, and lowest RMSE showing little extremely bad imputed value. What’s more, MSST-VAE displays strong robustness even at different missing rates.

Table 3

Resource-consuming of HalRTC, 3DConvGAN, and MSST-VAE

Datasets	Epochs in offline training phrase		Time in online imputation phrase
	3DConvGAN	MSST-VAE	HaLRTC	MSST-VAE
TaxiBJ	5000	500	27.373–32.406s	4.509–5.074s
TaxiNYC	5000–10000	1000	4.015–8.422s	0.790–0.803s
BikeNYC	2000	500	12.881–17.393s	1.001–1.015s

5.3 Ablation experiment result and analysis

In order to validate the effectiveness of major techniques in MSST-VAE, ablation experiments are conducted in this section. We reconfigure MSST-VAE to create four variants described as follows: 1) single-stream MSST-VAE. 2) MSST-VAE w/o SNFs. 3) MSST-VAE w/o dilated P3D convolution. 4) MSST-VAE w/o multi-horizon attention mechanism. Table 4 presents the NMSE and RMSE of MSST-VAE and four variants on three datasets at different missing rates.

Single-stream MSST-VAE. One of the input streams ( $X_{L}$ in Fig. 1) is removed from MSST-VAE, and only the short-periodicity data $X^{s}$ is fed into the network. It shows the second worst performance in comparison with the rest four approaches, demonstrating the importance of learning periodic feature for traffic raster data.

MSST-VAE w/o SNFs. SNFs in removed from MSST-VAE, and the hidden variables of VAE obey Gaussian distribution. Contrasting with MSST-VAE, when the generalization for hidden variable in VAE is weakened, the performance of imputation surely degrades according to Table 4.

MSST-VAE w/o dilated P3D convolution. Dilated P3D convolution in ECB and TCB is replaced with plain 3D convolution to extract spatio-temporal features. It behaves the worst of the four, because plain 3D extract global-local granularity spatial feature and long-short-term temporal feature less effectively and efficiently compared than dilated P3D convolution. In other words, the importance of spatio-temporal dependency for traffic raster data imputation task should be noted.

Table 4
Results of ablation experiment on three datasets

Missing rate	Model	TaxiBJ			TaxiNYC			BikeNYC
		NMSE	RMSE	MAE	NMSE	RMSE	MAE	NMSE	RMSE	MAE
10%	Single-stream MSST-VAE	0.0342	4.2449	0.00234	0.0548	1.0784	0.00155	0.0790	1.1583	0.00251
	MSST-VAE w/o SNFs	0.0294	3.9237	0.00226	0.0446	0.9731	0.00147	0.0667	1.0646	0.00191
	MSST-VAE w/o dilated P3D convolution	0.0386	4.5016	0.00242	0.0507	1.0367	0.00171	0.0688	1.0700	0.00223
	MSST-VAE w/o multi-horizon attention	0.0263	3.7060	0.00216	0.0350	0.8605	0.00139	0.0669	1.0666	0.00191
	MSST-VAE	0.0215	3.3518	0.00203	0.0325	0.8303	0.00137	0.0557	0.9696	0.00187
20%	Single-stream MSST-VAE	0.0370	7.6566	0.00690	0.0612	2.0780	0.00515	0.0804	2.0321	0.01171
	MSST-VAE w/o SNFs	0.0325	7.1573	0.00683	0.0519	1.9150	0.00506	0.0741	1.9542	0.00750
	MSST-VAE w/o dilated P3D convolution	0.0407	8.0014	0.00699	0.0552	1.9738	0.00551	0.0729	1.9196	0.00673
	MSST-VAE w/o multi-horizon attention	0.0261	6.3990	0.00672	0.0412	1.7056	0.00449	0.0719	1.9205	0.00607
	MSST-VAE	0.0246	6.2105	0.00645	0.0333	1.5323	0.00407	0.0635	1.8037	0.00590
30%	Single-stream MSST-VAE	0.0408	10.5372	0.01231	0.0599	2.5096	0.00809	0.0887	2.8520	0.01097
	MSST-VAE w/o SNFs	0.0327	9.4061	0.01197	0.0502	2.2993	0.00788	0.0930	2.9373	0.01177
	MSST-VAE w/o dilated P3D convolution	0.0419	10.6474	0.01236	0.0629	2.5733	0.00950	0.0961	2.9594	0.01189
	MSST-VAE w/o multi-horizon attention	0.0287	8.7948	0.01181	0.0450	2.1766	0.00750	0.0899	2.8750	0.01100
	MSST-VAE	0.0239	8.0444	0.01064	0.0364	1.9573	0.00676	0.0664	2.4649	0.01039
40%	Single-stream MSST-VAE	0.0455	12.9473	0.01766	0.0635	3.3392	0.01307	0.1052	3.5782	0.01686
	MSST-VAE w/o SNFs	0.0369	11.6738	0.01704	0.0481	2.8734	0.01143	0.1083	3.6405	0.01694
	MSST-VAE w/o dilated P3D convolution	0.0454	12.9612	0.01765	0.0642	3.3584	0.01392	0.1304	3.6618	0.01854
	MSST-VAE w/o multi-horizon attention	0.0295	10.4175	0.01652	0.0466	2.8629	0.01116	0.1133	3.7124	0.01707
	MSST-VAE	0.0277	10.0816	0.01559	0.0406	2.6706	0.01079	0.0988	3.4777	0.01669
50%	Single-stream MSST-VAE	0.0429	14.6450	0.02470	0.1247	5.3030	0.03219	0.1799	5.8972	0.02965
	MSST-VAE w/o SNFs	0.0382	13.8173	0.02344	0.0742	4.0933	0.01834	0.1686	5.7157	0.02678
	MSST-VAE w/o dilated P3D convolution	0.0497	15.8044	0.02640	0.0979	4.6993	0.02246	0.1818	5.3751	0.03014
	MSST-VAE w/o multi-horizon attention	0.0374	13.6759	0.02332	0.0683	3.9323	0.01763	0.1558	5.5133	0.02354
	MSST-VAE	0.0347	13.1464	0.02281	0.0693	3.9522	0.01773	0.1550	5.5065	0.02334

MSST-VAE w/o multi-horizon attention. Multi-horizon attention mechanism, which is used for better representation of spatio-temporal features, is removed from ECB in this part of experiment. Though the effectiveness is not as obvious as that in P3D, it is acceptable under such circumstance.

5.4 Sensitive study

Figure 11.

Sensitive study of ECB number $n$ on three public datasets with Average Missing Rate.The measurement indexes are NMSE, RMSE and MAE from left to right.

Figure 12.

Sensitive study of length of NFs $l$ on three public datasets with Average Missing Rate.The measurement indexes are NMSE, RMSE and MAE from left to right.

In order to evaluate the effectiveness of various parameterizations of MSST-VAE, hyper-parameter sensitive study is conducted on public datasets with average missing rates. Two major hyper-parameter is selected as the object of the study: number of ECB n and length of NFs $l$ , which determinate the basic architecture of MSST-VAE. The results are as shown in Figs 11 and 12. It can be observed from the results that too many ECB and too deep NFs may not improve the performance of the MSST-VAE, therefore 4 is set for $n$ and 5 is set for $l$ .

6. Conclusion

In this paper, a Multiple Streams Spatial Temporal-VAE (MSST-VAE) is proposed to address the challenge of periodic traffic raster data imputation task. In MSST-VAE, multiple periodic streams of VAE that leverages Sylvester NFs not only take multi-grained periodicity into account but also guarantee the generalization and accuracy of imputed data. Moreover, dilated P3D convolution and multi-horizon attention mechanism that deployed in Extraction-and-Calibration Block (ECB) make the MSST-VAE capable of fully extracting global-local-granularity spatial feature and long-short-term temporal feature of traffic raster data at the same time. The results of comparison experiment and ablation experiment also indicate that the MSST-VAE is effective and superior to the existing work of periodic traffic raster data imputation. In future studies, the authors will extend data imputation task from traffic raster data to traffic trajectory data, and the P-3D convolution may be replaced by graph convolution network to extract non-Euclidean spatial features when missing value are imputed.

Statements and declarations

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper

Data availabilities

The original code can be accessible on reasonable request after evaluation.

Footnotes

Acknowledgments

This work is financially supported by Shenzhen Science and Technology Program under Grant No. GXWD20220817124827001 and No.JCYJ20210324132406016.

References

Zhang

Chen

Jiang

and Huang

, Anomaly Detection of Periodic Multivariate Time Series under High Acquisition Frequency Scene in IoT, in International Conference on Data Mining Workshops (ICDMW), Sorrento Italy, (2020), 543–552.

Zhang

Zheng

and Qi

, Deep spatio-temporal residual networks for citywide crowd flows prediction, in Proceedings of the 31th AAAI Conference on Artificial Intelligence, San Francisco, USA, (2017), 1655–1661.

Chen

and Wang

F-Y.

, Traffic Flow Imputation Using Parallel Data and Generative Adversarial Networks, IEEE Transactions on Intelligent Transportation Systems 21(4), (2020), 1624–1630.

Nguyen

NT.

Dao

MS.

and Zettsu

, Leveraging 3D-Raster-Images and DeepCNN with Multi-source Urban Sensing Data for Traffic Congestion Prediction, Database and Expert Systems Applications, (2020), 396–406.

Boquet

Morell

Serrano

and Vicario

J.L.

, A variational autoencoder solution for road traffic forecasting systems: missing data imputation, dimension reduction, model selection and anomaly detection, Transportation Research Part C: Emerging Technologies 115, (2020), 102622.

Van Den Berg

Hasenclever

Tomczak

J.M.

and Welling

, Sylvester normalizing flows for variational inference, in 34th Conference on Uncertainty in Artificial Intelligence, Monterey USA, (2018), 393–402.

Zhong

Sharma

and Lingras

, Genetically Designed Models for Accurate Imputation of Missing Traffic Counts, Transportation Research Record 1879, (2004), 71–79.

Tak

Woo

and Yeo

, Data-Driven Imputation Method for Traffic Data in Sectional Units of Road Links, IEEE Transactions on Intelligent Transportation Systems 17(6), (2016), 1762–1771.

and Li

, Missing traffic data: comparison of imputation methods, IET Intelligent Transport Systems 8, (2014), 51–57.

10.

Elshenawy

El-darieby

and Abdulhai

, Automatic Imputation of Missing Highway Traffic Volume Data, in IEEE International Conference on Pervasive Computing and Communications Workshops, Athens Greece, (2018), 373–378.

11.

Liu

Musialski

Wonka

and Ye

, Tensor Completion for Estimating Missing Values in Visual Data, IEEE Transactions on Pattern Analysis and Machine Intelligence 35(1), (2013), 208–220.

12.

Qiu

and Zhang

, A Traffic Speed Imputation Method Based on Self-adaption and Clustering, in 4th IEEE International Conference on Big Data Analytics (ICBDA), Ahmedabad India, (2019), 26–31.

13.

Zhang

and Hu

, PPCA-Based Missing Data Imputation for Traffic Flow Volume: A Systematical Approach, IEEE Transactions on Intelligent Transportation Systems, 10(3), (2009), 512–522.

14.

Zhang

Jia

and Li

, A BPCA based missing value imputing method for traffic flow volume data, IEEE Intelligent Traffics Symposium, (2008), 985–990.

15.

and Li

, Efficient missing data imputing for traffic flow by considering temporal and spatial dependence, Transportation Research Part C: Emerging Technologies 34, (2013), 108120.

16.

Wei

Wang

Zhou

and Li

, BRITS: Bidirectional Recurrent Imputation for Time Series, Advances in Neural Information Processing Systems, 31, (2018), 6776–6786.

17.

Zhuang

and Wang

, Innovative method for traffic data imputation based on convolutional neural network, IET Intelligent Transport Systems, 13, (2019), 605–613.

18.

Reza

and Regan

, A convolution recurrent autoencoder for spatio-temporal missing data imputation, arXiv preprint, (2019), abs/190412413.

19.

Guo

Wan

, A data imputation method for multivariate time series based on generative adversarial network, Neurocomputing 360, (2019), 185–197.

20.

Guo

Lin

Chen

and Wan

, Deep Spatial–Temporal 3D Convolutional Neural Networks for Traffic Data Forecasting, IEEE Transactions on Intelligent Transportation Systems 20(10), (2019), 3913–3926.

21.

Kingma

D.P.

and Welling

, Auto-encoding variational bayes, arXiv preprint, (2013), abs/13126114.

22.

Woo

Park

and Lee

J.Y.

, Cbam: Convolutional block attention module, Computer Vision – ECCV 11211, (2018), 3–19.

23.

Pereira

R.C.

Santos

M.S.

Rodrigues

P.P.

and Abreu

P.H.

, Reviewing Autoencoders for Missing Data Imputation: Technical Trends, Applications and Outcomes, Journal of Artificial Intelligence Research 69 (2020), 1255–1285.

24.

Gondara

and Wang

, MIDA: Multiple Imputation Using Denoising Autoencoders, Advances in Knowledge Discovery and Data Mining, 10939 (2018), 260–272.

25.

Guo

Lin

Feng

Song

and Wan

, Attention Based Spatial-Temporal Graph Convolutional Networks for Traffic Flow Forecasting, in Proceedings of the AAAI Conference on Artificial Intelligence, Hawaii USA, 33, (2019), 922–929.

26.

Chang

Y.-L.

Liu

Z.Y.

Lee

K.-Y.

and Hsu

, Free-Form Video Inpainting With 3D Gated Convolution and Temporal PatchGAN, in IEEE/CVF International Conference on Computer Vision (ICCV), Seoul South Korea, (2019), 9065–9074.

27.

Qiu

Yao

and Mei

, Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks, in IEEE International Conference on Computer Vision (ICCV), Venice Italy, (2017), 5534–5542.

28.

Mittal

, Vibhu, A survey of accelerator architectures for 3D convolution neural networks, Journal of Systems Architecture, 115, (2021), 102041.

29.

Koltun

, Multi-scale context aggregation by dilated convolutions, arXiv preprint, (2015), abs/151107122.

30.

Barz

Rodner

Garcia

Y.G.

and Denzler

, Detecting Regions of Maximal Divergence for Spatio-Temporal Anomaly Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 41(5), (2019), 1088–1101.

31.

Naz’abal

Olmos

P.M.

Ghahramani

and Valera

, Handling incomplete heterogeneous data using vaes, Pattern Recognition 107, (2020), 107501.

32.

Rezende

D.J.

Mohamed

and Wierstra

, Stochastic backpropagation and approximate inference in deep generative models, in International Conference on Machine Learning, Beijing China, (2014), 1278–1286.

33.

Brian

and Dan

, New York City Taxi Trip Data (2010–2013), University of Illinois at Urbana-Champaign, (2016), doi: 10.13012/J8PN93H8.

34.

Ran

Tan

and Jin

P.J.

, Tensor based missing traffic data completion with spatial–temporal correlation, Physica A: Statistical Mechanics and its Applications 446 (2016), 54–63.

35.

Duan

Liu

Y.L.

and Wang

F.Y.

, An efficient realization of deep learning for traffic data imputation, Transportation research part C: emerging technologies, 72 (2016), 168–181.

36.

Wang

Huang

Han

and Wang

, Video inpainting by jointly learning temporal structure and spatial details, in Proceedings of the AAAI Conference on Artificial Intelligence, Hawaii USA, 33 (2019), 5232–5239.

37.

T.-C.

, A review on time series data mining, Engineering Applications of Artificial Intelligence, 24 (2011), 164-181.

38.

Shahabi

and Liu

, Diffusion convolutional recurrent neural network: Data-driven traffic forecasting, arXiv preprint, (2017), abs/170701926.

39.

Szegedy

Liu

Jia

Sermanet

Reed

Anguelov

Erhan

Vanhoucke

and Rabinovich

, Going deeper with convolutions, in Proceedings of the IEEE conference on computer vision and pattern recognition, Boston USA, (2015), 1–9.

40.

Yang

Kang

and Yuan

, ST-LBAGAN: Spatio-temporal learnable bidirectional attention generative adversarial networks for missing traffic data imputation, Knowledge-Based Systems, 215 (2021), 106705.

41.

Chen

Zhang

Chen

Jiang

Huang

and Gu

, Learning Traffic as Videos: A Spatio-Temporal VAE Approach for Traffic Data Imputation, in International Conference on Artificial Neural Networks and Machine Learning, Bratislava Slovakia, (2021), 12895.

42.

Chen

Cai

Chen

and Li

, Graph regularized local self-representation for missing value imputation with applications to on-road traffic sensor data, Neurocomputing, 303, (2018), 47–59.

43.

Zhao

and Li

, Improving the Traffic Data Imputation Accuracy Using Temporal and Spatial Information, in 7th International Conference on Intelligent Computation Technology and Automation, Changsha China, (2014), 312–317.

44.

Al-Deek

H.M.

Venkata

and Ravi Chandra

, New algorithms for filtering and imputation of real-time and archived dual-loop detector data in I-4 data warehouse, Transportation Research Record: Journal of the Transportation Research Board 1867, (2004), 116–126.

45.

and Shi

, Short-term traffic flow forecasting model under missing data, Journal of Computer Applications 30 (2010), 1117–1120.

46.

Zheng

and Feng

, 3D Convolutional Generative Adversarial Networks for Missing Traffic Data Completion , in 10th International Conference on Wireless Communications and Signal Processing, Hangzhou China, (2018), 1–6.

47.

Lin

Feng

and Jin

, DeepSTN+: Context-Aware Spatial-Temporal Neural Network for Crowd Flow Prediction in Metropolis, in Proceedings of the AAAI Conference on Artificial Intelligence, Hawaii USA, (2019), 1020–1027.

48.

Zhang

Chen

and Huang

, Data Imputation in IoT Using Spatio-Temporal Variational Auto-Encoder, Neurocomputing, 529, (2023), 23–32.

Learning traffic as videos: A spatio-temporal VAE approach to periodic traffic raster data imputation

Abstract

Keywords

1. Introduction

2. Related works

2.1 Prediction-based method

2.2 Interpolation-based method

2.3 Decomposition-based method

2.4 Statistic-based method

2.5 Deep learning method

3. Preliminaries

3.1 Missing data type

3.2 Traffic raster data

3.3 Problem statement

4. Method

5.1 Datasets and experiment settings

Table 1 Detailed description of three datasets

Table 4 Results of ablation experiment on three datasets

Statements and declarations

Data availabilities

Footnotes

Acknowledgments

References

Table 1
Detailed description of three datasets

Table 4
Results of ablation experiment on three datasets