Background subtraction by probabilistic modeling of patch features learned by deep autoencoders

Abstract

Video sequence analysis systems must be able to operate for long periods of time and they must attempt that the different events that can affect the quality of the input data do not diminish the performance of the system to an excessive extent. In this work a method called Probabilistic Mixture of Deeply Autoencoded Patch Features (PMDAPF) is proposed. A Deep Autoencoder is the cornerstone of the methodology for robust background modeling and foreground detection that is presented in this document. Its purpose is to obtain a reduced set of significant features from each patch belonging to one of the several shifted tilings of the video frames. Then, a probabilistic model is responsible for determining whether the whole patch belongs to the background or not. Foreground pixel detection takes into account the information of all patches in which the pixel is included. The robustness of the proposal, as well as its suitability to the uninterrupted analysis and processing of visual information, is reflected in the experiments, in which the performance of the proposed system is affected slightly whereas those of the classic methods are degraded drastically.

Keywords

Background modeling foreground detection deep learning autoencoders probabilistic mixture models

1. Introduction

The number of software applications that need to extract some kind of information from images or video sequences has grown steadily in the last decades. Among the fields in the area of artificial vision, video surveillance is still one of the most active because it involves several complex video analysis and processing tasks that must be resolved efficiently and reliably. That is the case of background subtraction, which consists in deciding which elements of the image are part of the background of the scene and which ones are moving objects, i.e, they belong to the foreground.

Background modeling algorithms must work continuously, thus a difficult goal that they should achieve is to perform adequately when processing each frame of the video sequence and not only the initial ones. Therefore, they should be robust enough to maintain their level of performance when events that may compromise the quality of input data occur and when the characteristics of the background itself vary. For instance, it is not uncommon for sudden lighting changes to happen in indoor scenes, while atmospheric conditions can affect object detection in outdoor environments.

Foreground detection algorithms can be considered as binary classification methods that compute the probability of each frame pixel or region to belong to one of the two possible classes: foreground or background. For that purpose, the most referenced proposals attempt to learn an underlying model that describes the changes of that pixel or region features over time.

In this work a background subtraction algorithm that works at the region level is presented. Each pixel is not analyzed independently but as a component of a square patch that is part of a particular tiling of the video frame. Since several distinct shifted tilings are employed for each frame, each single pixel is classified after taking into consideration the information about all those patches to which it belongs. The existent noise in the patches will be removed by a previously trained deep autoencoder [1], which is an unsupervised deep learning neural network well suited to information representation due to its ability to provide relevant data features [2]. Single layer autoencoders are proved to span the same subspace as a Principal Components Analysis technique [3]. Although deep autoencoders composed of several linear layers have been successfully used in background modeling in noisy environments [4, 5] it may be difficult for them to capture the visual patterns present in the pixel neighborhood. Deep autoencoders such as the one proposed in [6], try to capture relevant visual features by means of several initial convolutional layers, while the last layers are dense and devoted to compute several significant features of the input that can be expressed as linear combinations of the visual characteristics.

In the present work, we extend our previous methodology described in [4] (published in conference proceedings [7, 8]) by using tilings and proposing a new probabilistic model. Although the proposed methodology is conceptually based on [5, 4] the use of tilings causes it to provide a higher resolution and more stable segmentation than [5]. Besides, the proposed probabilistic method does not depend on parameters, improving the model in [4].

The paper is divided into the following sections: Section 3 explains the foreground detection methodology, which is based on a deep autoencoder which comprises convolutional and linear layers, and a probabilistic model that works with the reduced patch representation provided by the autoencoder; Section 4 reports the experimental results over several public surveillance sequences and finally in Section 5 the conclusions are presented.

2. Related works

Background modeling has been of paramount importance in video surveillance systems [9]. In recent decades, a large number of published techniques model the background by analyzing the intensity of each pixel over time. Initially, the methods with better performance modeled this signal through parametric distributions. Thus, Wren et al. [10] assume only one type of moving objects in a not very dynamic environment and pixel classification is based on its color, location and information about a set of $2D$ regions or blobs that may appear in the frame, each of which is modeled by a Gaussian distribution. In an attempt to cope with dynamic backgrounds, [11] proposes a mixture of a fixed number of Gaussian distributions in order to characterize the distribution of each pixel values, whereas in [12] a mixture model with a variable number of Gaussian components, which is limited to an upper bound selected a priori, is learned. On the other hand, [13] employs non parametric probability models based on kernel distributions for each pixel. The aforementioned methods analyze each pixel individually and they seem to have low tolerance to noisy data as it is discussed in [14].

Other proposals, such as SOBS[15], and FSOM [16], take into account not only color intensity but also pixel vicinity information. Pixel models are based on unsupervised artificial neural networks known as self-organizing maps [17], in which topological neighborhood relations of the input patterns are preserved. The combination of each pixel output with those of their neighbors allows for a more robust detection of foreground objects. Furthermore, SC-SOBS[18] and SOBS-CF[19] are revisions of SOBS that try to enforce spatial consistency to reduce false positives in detection. More recent techniques like LOBSTER[20], SUBSENSE[21] or PAWCS[22] increase noise resilience by including several local binary similarity patterns for each pixel representation, which describe texture patterns of the region to which the pixel belongs.

Dimensionality reduction techniques such as Robust Principal Component Analysis (RPCA) [23] have also been used successfully in the foreground detection field. Although delivering a good performance, conventional RPCA-based methods are not suitable for real-time applications due to their high computational complexity. More efficient RPCA-based algorithms are based on specific assumptions about subspaces and outlier distribution which prevent them from being used in some cases.

The use of deep learning techniques has been a true revolution in the field of intelligent information processing, especially when images or video sequences are the input data that have to be processed. Thus, it is possible to highlight proposals which evaluate infrared images for face detection [24], explore multi-channel EEG signals for seizure detection [25, 26, 27] or for interaction analysis between cortical regions of the brain [28]. In a context related to video surveillance, object tracking tasks [29] or object classification in road scenes [30] have been carried out. In addition, deep learning networks have also been applied in subjects as diverse as pupil identification [31], the detection of evidences of Parkinson’s disease by means of 3D images [32] or the evaluation of physical structures in engineering [33].

Background subtraction techniques which are based on deep learning neural networks have become very popular popular and effective in recent years [34]. Some of them focus on exploiting spatial and temporal information in order to learn a background model as precise as possible that is then used to classify the image pixels. In [35], a method composed of two stacked autoencoders which try to find the latent space features of the input video frames is employed to model the underlying structure of dynamic backgrounds. A convolutional long-short term memory is used in [36] in an attempt to learn not only spatial information but also the temporal associativity among the frames where sequential movements of foreground objects occur. Two different deep convolutional architectures that aggregate multi-scale information in order to improve foreground segmentation accuracy and robustness are proposed in [37]. Finally, a background estimation and foreground detection model that is able to reconstruct missing parts of the input image is presented in [38]. Generative adversarial networks are used for predicting new contexts in missing image regions while a semantic convolutional network is responsible for inpainting missing regions. On the other hand, in [39] the background is modeled with a simple grayscale image and a convolutional neural network is trained so that it can be able to subtract the single background image from the video frame image, and classify each frame pixel accordingly. A similar approach is followed in [40], where the subtraction operation is made by combining Bayesian generative adversarial networks and parallel vision theory.

3. Methodology

The standard strategy to model the background in video sequences for video surveillance purposes implies the representation of each pixel of the frame individually. As opposed to this strategy, we propose to model the video frames by dividing them into patches of $N\times N$ pixels, which will be individually examined in order to estimate whether they belong to the background. The size of the patch $N$ may affect the type of visual features that are detected as well as the performance of the background subtraction process. On the one hand, small values values of $N$ limit the autoencoder convolutional layer capacity of learning visual features. In the particular case where $N=1$ , no information about the neighborhood of each pixel is taken into account and only features that are based on the pixel color can be detected. On the other hand, when the size of the patch is increased, more complex and larger visual features can be learned by the autoencoder. Nonetheless, foreground segmentation results may be visually worse due to aliasing effects that may occur when the same class is assigned to each pixel that is part of the same patch. In order to increase the spatial resolution of the resulting segmentation, $M$ shifted tilings are considered. In this way, each individual pixel of the video frame is part of $M$ patches, one per tiling, and its classification depends on the classification of the $M$ patches to which the pixel belongs.

The process of classifying a patch as background or foreground consists of two steps. First, a compressed representation of the patch is learned by means of the extraction of its significant features as computed by a previously trained Deep Autoencoder (DA from now on) [2]. After that, in a second step, the reduced representation of the patch is inputted to a probabilistic mixture model which estimates the probability of the patch to belong to the foreground. Such mixture model is updated as the video sequence progresses.

3.1 Patch feature extraction

Since DA might fail to model too small patches, we propose to use $N\ \times\ N$ pixel patches where $N$ is big enough to obtain adequate models from patches.We also have $M$ tilings for each video frame so that each tiling includes $N\ \times\ N$ patches. Each tiling $i$ with $i\in\{1,\ldots,M\}$ is defined by a vector $s_{i}\in\{0,\ldots,N-1\}\times\{0,\ldots,N-1\}$ that serves as the shift and makes it distinct from all other tilings. The upper left corner of the first $N\ \times\ N$ patch from tiling $i$ is located at position $s_{i}$ in the video frame. The video frame is extended by padding so that all tilings have the same number of $N\ \times\ N$ patches.

A pixel belongs to only one patch in each tiling. As there are $M$ tilings, the pixel is part of $M$ different patches. In fact, it is within the intersection of the $M$ patches that contain that particular pixel. Once those patches have been classified, Eq. (23) defines how to calculate the pixel classification. Figure 1 shows how $M=4$ patches from different tilings intersect with each other. All pixels within that intersection will be assigned to the same class since they all belong to the same patch in each tiling.

Figure 1.

Shifted tilings example with $M=$ 4. The blue, purple, green and red planes represent shifted patches of size $N\times N$ pixels while the gray plane represents their intersection with size $\frac{N}{2}\times\frac{N}{2}$ pixels.

Let $\mathbf{X}\in\mathbb{R}^{G}$ be a patch of size $G=N^{2}$ pixels. In what follows we will assume that tristimulus pixel color values are employed. A unique single DA learns to represent all the patches from all the tilings, as defined by the following equations:

$\displaystyle\tilde{\mathbf{X}}=g\left(f\left(\mathbf{X}\right)\right)$ (1) $\displaystyle f\;:\;\mathbb{R}^{G}\rightarrow\mathbb{R}^{L}$ (2) $\displaystyle g\;:\;\mathbb{R}^{L}\rightarrow\mathbb{R}^{G}$ (3)

where $\tilde{\mathbf{X}}\in\mathbb{R}^{G}$ is the reconstructed patch associated to input patch $\mathbf{X}$ , $f$ is the encoder contained in the autoencoder, $g$ is the decoder contained in the autoencoder, and $L$ is the width (number of neurons) of the innermost layer of the neural network. That is, the autoencoder compresses the high dimensional input of dimension $G$ to a reduced set of significant features of dimension $L$ where $L<G$ to transform the compressed data with size $L$ to a reconstructed vector with size $G$ .

As the video sequence progresses, we use the coding layers from DA to discover relevant features so, for any patch $\mathbf{t}$ , we obtain a feature vector $\mathbf{v}\in\mathbb{R}^{L}$ . We will call this process as patch encoding.

$\displaystyle\mathbf{v}=f\left(\mathbf{t}\right)$ (4)

The goal of the training of the autoencoder is to minimize the mean squared reconstruction error $\mathcal{E}_{\textit{train}}$ :

$\displaystyle\mathcal{E}_{\textit{train}}=\sum_{i=1}^{T}\left\|\mathbf{X}-% \tilde{\mathbf{X}}\right\|^{2}$ (5)

where $T$ is the total number of patches which can be found in the training set, which is typically made of video frames. No regularization term has been added to the autoencoder, since the self regularization properties of autoencoders [41] and the large number of training samples have been enough to avoid overfitting.

Although autoencoders are aimed to obtain a reliable representation consisting of general features, the influence of spurious factors such as illumination and local variation should be reduced. The invariance of the autoencoder to various scene conditions can be enhanced as proposed by several authors [42, 43, 5], that have used a training set that comprises a large volume of general image patches from natural video sequences with manually inserted noise, rather than employing patches extracted from the frames corresponding to the video to be processed. We follow this approach in our proposal. The training set for our unique autoencoder is extracted from the Microsoft Common Objects in Context (COCO) dataset [44].

3.2 Patch classification

We perform the background modeling by training a $L$ -dimensional probabilistic mixture which models the distribution of features values discovered by $f$ . One Gaussian mixture component $K$ is used to model the background and one uniform mixture component $U$ is used to model the foreground. Each patch in each tiling has associated vectors $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$ to represent the mean vector and the vector of the standard deviations of its background, respectively. It is assumed that a given encoded patch follows this probabilistic mixture model:

$\displaystyle p(\mathbf{v})=\pi_{\textit{Back}}p(\mathbf{v}|\textit{Back})+\pi% _{\textit{Fore}}p(\mathbf{v}|\textit{Fore})$ (6)

where $\pi_{\textit{Back}}$ and $\pi_{\textit{Fore}}$ are the a priori probabilities of the background and foreground, respectively, and:

$\displaystyle p(\mathbf{v}|\textit{Back})=K(\mathbf{v}|\boldsymbol{\mu},% \mathbf{\Sigma})$ (7) $\displaystyle p(\mathbf{v}|\textit{Fore})=U(\mathbf{v})$ (8)

The Gaussian and uniform mixture components are defined as:

$\displaystyle K(\mathbf{v}|\boldsymbol{\mu},\mathbf{\Sigma})=(2\pi)^{-L/2}% \text{det}(\mathbf{\Sigma})^{-1/2}\text{exp}\left(-\frac{1}{2}(\mathbf{v}-% \boldsymbol{\mu})\mathbf{\Sigma}^{-1}(\mathbf{v}-\boldsymbol{\mu})\right)$ (9)

$\displaystyle U(\mathbf{v})=\left\{\begin{array}[]{ll}1/\textit{Vol(H)}&\text{% iff $\mathbf{v}\in H$}\\ 0&\text{iff $\mathbf{v}\not\in H$}\end{array}\right.$ (10)

Please note that $H$ is the support of the uniform probability density function. We call Vol(H) the $L$ -dimensional volume of $H$ so with $d_{h}$ and $e_{h}$ as the minimum and maximum values in the domain of encoding function $f$ .

$\displaystyle H=[d_{1},e_{1}]\times[d_{2},e_{2}]\times\ldots\times[d_{L},e_{L}]$ (11)

$\displaystyle\textit{Vol(H)}=\prod_{h=1}^{L}(e_{h}-d_{h})$ (12)

The mean vector and the vector of standard deviations of the Gaussian are defined as follows:

$\displaystyle\boldsymbol{\mu}=E[\mathbf{v}|\textit{Back}]$ (13) $\displaystyle\boldsymbol{\sigma}=\left(\sqrt{E\left[\left(v_{j}-\mu_{j}\right)% ^{2}|\textit{Back}\right]}\right)_{j=1,\ldots,L}$ (14)

The covariance matrix of the Gaussian is assumed to be diagonal:

$\displaystyle\mathbf{\Sigma}=E[\left(\mathbf{v}-\bm{\mu}\right)\left(\mathbf{v% }-\bm{\mu}\right)^{T}|\textit{Back}]=\left(\begin{array}[]{ccccc}\sigma_{1}^{2% }&0&.&.&0\\ 0&\sigma_{2}^{2}&.&.&0\\ \cdot&\cdot&\cdot&\cdot&\cdot\\ 0&0&\cdot&\cdot&\sigma_{L}^{2}\end{array}\right)$ (15)

With this goal, Robbins-Monro stochastic approximation algorithm [45] is used to approximate the mean vector $\boldsymbol{\mu}$ and the variance vector $\boldsymbol{\psi}$ of each component of $\mathbf{v}$ :

$\displaystyle\boldsymbol{\psi}=\left(\sigma_{j}^{2}\right)_{j=1,\ldots,L}=% \left(E\left[\left(v_{j}-\mu_{j}\right)^{2}|\textit{Back}\right]\right)_{j=1,% \ldots,L}$ (16)

Initially, in order to save memory, we use Welford’s online algorithm [46] to obtain first $\boldsymbol{\mu}$ and $\boldsymbol{\psi}$ for each patch and tiling from video frames during training phase.

The class probabilities given the encoded patch $\mathbf{v}_{n}$ at time $n$ are given by the Bayes theorem:

$\displaystyle\forall i\in\{\textit{Back, Fore}\},R_{n,i}=P(i|\mathbf{v}_{n})=$ (17) $\displaystyle\quad\frac{\pi_{i}p(\mathbf{v}_{n}|i)}{\pi_{\textit{Back}}p(% \mathbf{v}_{n}|\textit{Back})+\pi_{\textit{Fore}}p(\mathbf{v}_{n}|\textit{Fore% })}$

So, given $\alpha$ as update step size, to update the background model $\boldsymbol{\mu}_{n}$ and $\boldsymbol{\psi}_{n}$ with encoded patch $\mathbf{v}_{n}$ with background probability $R_{\textit{n,Back}}$ the following equations are employed:

$\displaystyle\boldsymbol{\mu}_{n+1}=(1-\alpha R_{\textit{n,Back}})\boldsymbol{% \mu}_{n}+\alpha R_{\textit{n,Back}}\mathbf{v}_{n}$ (18) $\displaystyle\boldsymbol{\psi}_{n+1}=(1-\alpha R_{\textit{n,Back}})\boldsymbol% {\psi}_{n}+\alpha R_{\textit{n,Back}}[\mathbf{v}_{n}-\boldsymbol{\mu}_{n}]^{2}$ (19)

Also, in order to update the a priori probabilities $\pi_{\textit{n,Back}}$ and $\pi_{\textit{n,Fore}}$ given $R_{\textit{n,Back}}$ and $R_{\textit{n,Fore}}$ the following equations can be applied:

$\displaystyle\forall i\in\{\textit{Back, Fore}\},$ (20) $\displaystyle\quad\pi_{n+1,i}=(1-\alpha)\pi_{n,i}+\alpha R_{n,i}$

Since all $\mathbf{\Sigma}$ is a diagonal matrix with the elements of the variance vector $\boldsymbol{\psi}$ as the main diagonal, we have:

$\displaystyle\text{det}(\mathbf{\Sigma})=\prod_{j=1}^{L}\sigma^{2}_{j}$ (21) $\displaystyle(\mathbf{v}-\boldsymbol{\mu})^{T}\mathbf{\Sigma}^{-1}(\mathbf{v}-% \boldsymbol{\mu})=\sum_{j=1}^{L}\frac{(v_{j}-\mu_{j})^{2}}{{\sigma}^{2}_{j}}$ (22)

For the sake of simplicity, we will set all prior pro-babilities $\pi_{\textit{Fore}}=\pi_{\textit{Back}}=$ 0.5 and we will not update them. Moreover, adapting them would not be very advantageous, since a video might contain much foreground in some sections and very little in other sections, so that the prior probabilities would not be helpful.

With $i$ as a pixel and $j$ as a tiling with $j\in{1,\ldots,M}$ , $\mathbf{t}_{i,j,n}$ is the patch belonging to $j$ -th tiling that contains the $i$ -th pixel at time $n$ and $R_{i,j,n}$ is its foreground probability. We will segment the $i$ -th pixel at time $n$ as foreground if the average foreground probability for all encoded patches containing the pixel is greater than a threshold $\tau$ where usually $\tau=$ 0.5:

$\displaystyle\frac{\sum_{j=1}^{M}{R_{i,j,n}}}{M}>\tau\implies\text{foreground}$ (23)

3.3 Encoding activations functions

Since we use an uniform component to model the foreground, we need that the function $f$ does not diverge to $\infty$ nor $-\infty$ so that the support $H$ of the uniform probability density function (pdf) is finite. The domain of $f$ will be the same as the domain of the DA last encoding layer activation function.

On Fig. 2 on page 7 the method scheme can be observed.

Figure 2.

Method scheme. On the left we have the original image, which is divided into patches following shifted tilings. Each patch is encoded and segmented. Each pixel is segmented according to the segmentations of the patches that include it.

Figure 3.

Autoencoder structure. On the left we have the original 16x16 patch (3 color channels). Blue figures are the outputs of convolutional layers with (5,5) filters where the number above them indicates the number of filters. Red figures and green ones are the outputs of (2,2) MaxPool and Upsampling layers respectively. Number above them indicates their data depth. Orange figures are the output of the two reshape layer: first one from data cube to vector, second one from vector to data cube. Black figures represent dense linear layers of the shown length. On the right we have the last 16x16x3 convolutional output. The type of convolutional layers in the decoding structure are the same as in the encoding one and padding is added to the data so that only depth dimension size is modified by convolutions.

4. Experimental results

As the previous methodology section shows, our method needs the encoding layers from an auto-encoder to process each patch. The network architecture is shown in Fig. 3. Convolutional and max pooling layers have been used to get relevant visual features in the patch so that the last dense layers select the 16 most important features. The proposed auto-encoder has a 16 $\times$ 16 $\times$ 3 $=$ 768 size input which is reduced to $L=$ 16 significant features in the last encoder layer output. All activation functions are ReLU functions except that of the last encoder layer, in which, to address the restriction described in Subsection 3.3, hyperbolic tangent and sigmoid functions could be used but we have used a restricted ReLU: a function which is linear within the range [0,1], returns 0 if the value is lesser than 0 and returns 1 if the value is greater than 1. No regularization term has been added to the autoencoder, since the self regularization properties of autoencoders and the large number of training samples have been enough to avoid overfitting. Threshold $\tau$ has been set to 0.5, which is one of the best estimations obtained in the experimental tests carried out later in this section (Fig. 4).

Based on our previous work [4], we have selected $M=$ 16 since we think it is a good compromise between performance and computation requirements. All of the frames have been segmented with an update step size $\alpha=$ 0.005, which has shown to attain an adequate balance between the speed of adaptation to dynamic backgrounds and the stability of the background model. If any covariance matrix value $\sigma_{j}^{2}$ is lower than 0.001 after updating it, an extra 0.001 is added to it so that numerical errors can be prevented in the computation of the matrix inversion for the Gaussian mixture component Eq. (9).

The autoencoder has been trained using 400,000 16 $\times$ 16 patches obtained from random images from COCO dataset altered with Gaussian noise as input $\mathbf{X}$ . COCO dataset has been used to train the neural network to ensure the existence of diverse data not related to sequences that would be used to evaluate the proposal. The autoencoder network has 773,715 trainable parameters and was trained for 10 epochs using Adam optimizer with default parameters. The corresponding raw patch for each input is employed to compare with the output $\tilde{\mathbf{X}}$ .

In order to obtain the initial means and variances, the first video sequence frames are selected to train the probabilistic model prior to segmenting the following frames from that video sequence. The training frames are defined by the dataset used for the experiment and consists of all the frames that are not within the temporal Region of Interest that will be later employed to test the method.

4.1 Methods

Eight methods have been used to make a performance comparison with our proposal: WrenGA [10], ZivkovicGMM [12], ElgammalKDE [13], SOBS [15], SOBS_CF [19], SuBSENSE [21], LOBSTER [20] and PAWCS [22]. All of them are avaliable on the BGS library [47]1 that is which we have used in order to get the segmentation images.

The proposed method has been implemented using Python. Neural Network implementation uses Keras2 as high-level application programming interface working on Tensorflow.3

4.2 Sequences and noise

Figure 4.

Average F-measure for all sequences and noises with different thresholds.

4.2.1 Sequences

A total of 26 sequences have been selected from five categories from ChangeDetection.net website4 in order to test PMDAPF. The Baseline category presents a mixture of mild challenges that are common in computer vision, such as subtle background motion, isolated shadows, an abandoned object or pedestrians that stop for a short time to move away later (4 sequences). The Dynamic Background category includes scenes with strong background motion acting as intrinsic noise in the image, e.g. boats on shimmering water, roads next to fountain or trees shaken by the wind (6 sequences). The Shadows category contains video sequences with different strong shadows (6 sequences). The Night category includes videos recorded during night (6 sequences). Finally, the BadWeather category is composed of outdoor video sequences affected by winter weather conditions like snow or fog (4 sequences). These categories have been selected since they represent common video surveillance situations.

4.2.2 Noise

Original videos plus nine versions, which were obtained adding different type of noise to the original video sequence, have been used to test the proposed approach. As a result, 10 $\times$ 26 $=$ 260 different videos have been segmented with each method, namely:

•
Raw videos with no noise added.
•
Gaussian noise: four different versions were obtained by adding Gaussian noise with zero mean and standard deviations 0.1, 0.2, 0.3 and 0.4 respectively.
•
Salt and pepper: Black and White pixels were randomly added to raw images.
•
Uniform: two distinct uniform noises added to raw video. From 0 to 1 and from $-$ 0.5 to 0.5.
•
Compression: Raw images are saved changing the parameter cv2.IMWRITE_JPEG_QUALITY which defines the compression quality. The better quality is represented by the maximum value 100. The parameter was set to 10 and 1 respectively in order to obtain the corresponding noisy video sequences.

Table 1
Quantitative results for all categories. Table shows average F-measure (and standard deviation) value for all videos for each method and noise. The higher, the better. Bold value indicates the best result within the row

PMDAPF PAWCS SUBSENSE LOBSTER KDE ZIVKOVIC WREN SOBS SOBS_CF

No noise 0.63(0.18) 0.77(0.23) 0.79(0.19) 0.66(0.26) 0.40(0.25) 0.58(0.23) 0.52(0.23) 0.49(0.28) 0.42(0.26)

Gaussian $0.1$ 0.63(0.18) 0.25(0.20) 0.53(0.24) 0.23(0.22) 0.06(0.05) 0.39(0.21) 0.22(0.14) 0.07(0.06) 0.09(0.07)

Gaussian $0.2$ 0.63(0.18) 0.07(0.06) 0.24(0.20) 0.06(0.05) 0.05(0.04) 0.09(0.07) 0.07(0.05) 0.06(0.04) 0.06(0.05)

Gaussian $0.3$ 0.62(0.19) 0.05(0.04) 0.16(0.19) 0.05(0.04) 0.05(0.04) 0.06(0.05) 0.06(0.04) 0.05(0.04) 0.06(0.04)

Gaussian $0.4$ 0.60(0.20) 0.05(0.04) 0.11(0.16) 0.05(0.04) 0.05(0.04) 0.05(0.04) 0.05(0.04) 0.05(0.04) 0.05(0.04)

Uniform [0,1] 0.55(0.24) 0.07(0.05) 0.41(0.26) 0.12(0.14) 0.07(0.06) 0.08(0.06) 0.07(0.06) 0.07(0.05) 0.07(0.06)

Uniform [ $-$ 0.5,0.5] 0.62(0.19) 0.05(0.04) 0.20(0.20) 0.05(0.04) 0.05(0.04) 0.05(0.04) 0.05(0.04) 0.05(0.04) 0.05(0.04)

Salt and pepper 0.56(0.23) 0.42(0.29) 0.44(0.22) 0.55(0.28) 0.18(0.15) 0.15(0.11) 0.13(0.10) 0.13(0.10) 0.13(0.10)

Compression 10 0.63(0.18) 0.61(0.23) 0.74(0.21) 0.64(0.26) 0.26(0.20) 0.53(0.21) 0.5(0.22) 0.41(0.25) 0.39(0.25)

Compression 1 0.59(0.20) 0.51(0.25) 0.52(0.22) 0.44(0.24) 0.22(0.17) 0.26(0.16) 0.45(0.21) 0.22(0.16) 0.24(0.17)

Table 2
Quantitative results for baseline category. Table shows average F-measure (and standard deviation) value for all videos for each method and noise. The higher, the better. Bold value indicates the best result within the row

PMDAPF PAWCS SUBSENSE LOBSTER KDE ZIVKOVIC WREN SOBS SOBS_CF

No noise 0.69(0.08) 0.94(0.01) 0.95(0.01) 0.92(0.02) 0.68(0.12) 0.79(0.13) 0.75(0.15) 0.77(0.07) 0.61(0.23)

Gaussian $0.1$ 0.69(0.08) 0.36(0.25) 0.76(0.16) 0.45(0.34) 0.09(0.06) 0.55(0.23) 0.33(0.12) 0.11(0.08) 0.14(0.09)

Gaussian $0.2$ 0.69(0.08) 0.11(0.10) 0.41(0.32) 0.09(0.06) 0.07(0.06) 0.13(0.12) 0.10(0.08) 0.08(0.06) 0.09(0.07)

Gaussian $0.3$ 0.68(0.08) 0.08(0.07) 0.32(0.33) 0.08(0.06) 0.07(0.06) 0.08(0.07) 0.08(0.06) 0.08(0.06) 0.08(0.06)

Gaussian $0.4$ 0.66(0.06) 0.07(0.06) 0.25(0.28) 0.07(0.06) 0.07(0.06) 0.08(0.06) 0.07(0.06) 0.07(0.06) 0.08(0.06)

Uniform [0,1] 0.64(0.13) 0.10(0.06) 0.64(0.08) 0.26(0.28) 0.10(0.05) 0.10(0.05) 0.11(0.05) 0.09(0.05) 0.10(0.05)

Uniform [ $-$ 0.5,0.5] 0.68(0.08) 0.08(0.07) 0.35(0.34) 0.07(0.06) 0.07(0.06) 0.08(0.06) 0.08(0.06) 0.07(0.06) 0.08(0.06)

Salt and pepper 0.64(0.08) 0.66(0.13) 0.56(0.27) 0.85(0.12) 0.28(0.14) 0.23(0.15) 0.19(0.12) 0.20(0.13) 0.19(0.14)

Compression 10 0.68(0.08) 0.80(0.08) 0.92(0.02) 0.89(0.02) 0.37(0.12) 0.72(0.12) 0.71(0.14) 0.57(0.22) 0.54(0.23)

Compression 1 0.64(0.06) 0.71(0.16) 0.72(0.04) 0.70(0.07) 0.34(0.06) 0.37(0.09) 0.65(0.11) 0.33(0.12) 0.35(0.14)

4.3 Evaluation

	PMDAPF	PAWCS	SUBSENSE	LOBSTER	KDE	ZIVKOVIC	WREN	SOBS	SOBS_CF
No noise	0.63(0.18)	0.77(0.23)	0.79(0.19)	0.66(0.26)	0.40(0.25)	0.58(0.23)	0.52(0.23)	0.49(0.28)	0.42(0.26)
Gaussian $0.1$	0.63(0.18)	0.25(0.20)	0.53(0.24)	0.23(0.22)	0.06(0.05)	0.39(0.21)	0.22(0.14)	0.07(0.06)	0.09(0.07)
Gaussian $0.2$	0.63(0.18)	0.07(0.06)	0.24(0.20)	0.06(0.05)	0.05(0.04)	0.09(0.07)	0.07(0.05)	0.06(0.04)	0.06(0.05)
Gaussian $0.3$	0.62(0.19)	0.05(0.04)	0.16(0.19)	0.05(0.04)	0.05(0.04)	0.06(0.05)	0.06(0.04)	0.05(0.04)	0.06(0.04)
Gaussian $0.4$	0.60(0.20)	0.05(0.04)	0.11(0.16)	0.05(0.04)	0.05(0.04)	0.05(0.04)	0.05(0.04)	0.05(0.04)	0.05(0.04)
Uniform [0,1]	0.55(0.24)	0.07(0.05)	0.41(0.26)	0.12(0.14)	0.07(0.06)	0.08(0.06)	0.07(0.06)	0.07(0.05)	0.07(0.06)
Uniform [ $-$ 0.5,0.5]	0.62(0.19)	0.05(0.04)	0.20(0.20)	0.05(0.04)	0.05(0.04)	0.05(0.04)	0.05(0.04)	0.05(0.04)	0.05(0.04)
Salt and pepper	0.56(0.23)	0.42(0.29)	0.44(0.22)	0.55(0.28)	0.18(0.15)	0.15(0.11)	0.13(0.10)	0.13(0.10)	0.13(0.10)
Compression 10	0.63(0.18)	0.61(0.23)	0.74(0.21)	0.64(0.26)	0.26(0.20)	0.53(0.21)	0.5(0.22)	0.41(0.25)	0.39(0.25)
Compression 1	0.59(0.20)	0.51(0.25)	0.52(0.22)	0.44(0.24)	0.22(0.17)	0.26(0.16)	0.45(0.21)	0.22(0.16)	0.24(0.17)

	PMDAPF	PAWCS	SUBSENSE	LOBSTER	KDE	ZIVKOVIC	WREN	SOBS	SOBS_CF
No noise	0.69(0.08)	0.94(0.01)	0.95(0.01)	0.92(0.02)	0.68(0.12)	0.79(0.13)	0.75(0.15)	0.77(0.07)	0.61(0.23)
Gaussian $0.1$	0.69(0.08)	0.36(0.25)	0.76(0.16)	0.45(0.34)	0.09(0.06)	0.55(0.23)	0.33(0.12)	0.11(0.08)	0.14(0.09)
Gaussian $0.2$	0.69(0.08)	0.11(0.10)	0.41(0.32)	0.09(0.06)	0.07(0.06)	0.13(0.12)	0.10(0.08)	0.08(0.06)	0.09(0.07)
Gaussian $0.3$	0.68(0.08)	0.08(0.07)	0.32(0.33)	0.08(0.06)	0.07(0.06)	0.08(0.07)	0.08(0.06)	0.08(0.06)	0.08(0.06)
Gaussian $0.4$	0.66(0.06)	0.07(0.06)	0.25(0.28)	0.07(0.06)	0.07(0.06)	0.08(0.06)	0.07(0.06)	0.07(0.06)	0.08(0.06)
Uniform [0,1]	0.64(0.13)	0.10(0.06)	0.64(0.08)	0.26(0.28)	0.10(0.05)	0.10(0.05)	0.11(0.05)	0.09(0.05)	0.10(0.05)
Uniform [ $-$ 0.5,0.5]	0.68(0.08)	0.08(0.07)	0.35(0.34)	0.07(0.06)	0.07(0.06)	0.08(0.06)	0.08(0.06)	0.07(0.06)	0.08(0.06)
Salt and pepper	0.64(0.08)	0.66(0.13)	0.56(0.27)	0.85(0.12)	0.28(0.14)	0.23(0.15)	0.19(0.12)	0.20(0.13)	0.19(0.14)
Compression 10	0.68(0.08)	0.80(0.08)	0.92(0.02)	0.89(0.02)	0.37(0.12)	0.72(0.12)	0.71(0.14)	0.57(0.22)	0.54(0.23)
Compression 1	0.64(0.06)	0.71(0.16)	0.72(0.04)	0.70(0.07)	0.34(0.06)	0.37(0.09)	0.65(0.11)	0.33(0.12)	0.35(0.14)

For quantitative comparison purposes, the well-known F-measure, which is defined as a balanced harmonic mean of precision and recall, is chosen. This measure provides values in the interval [0,1], where values close to one mean better performance.

F-measure has been calculated for all sequences. True Positives (TP), True Negatives (TN), False Posi-tives (FP) and False Negatives (FN) from all frames in Region of Interest (frames to evaluate specified by ChangeDetection.net) are added to calculate sequence F-measure for each method segmentation.

$\displaystyle\textit{F-measure}=\frac{\textit{2*Precision*Recall}}{\textit{% Precision $+$ Recall}}$ (24)

with

$\displaystyle\textit{Precision}=\frac{\textit{TP}}{\textit{TP}+\textit{FP}}$ (25) $\displaystyle\textit{Recall}=\frac{\textit{TP}}{\textit{TP}+\textit{FN}}$ (26)

4.4 Results

Tables from 1 to 6 show quantitative results for videos with and without added noise. In accordance to them, the proposed method clearly outperforms WrenGA, ZivkovicGMM, ElGammalKDE, SOBS and SOBS_CF in nearly all of the test categories, specially when noise is added. Although PMDAPF performance in the case of raw video sequences is lower than those of robust methods such as PAWCS, SUBSENSE and LOBSTER, the performance of these methods is greatly affected and diminished when noise is present in the images. Due to its noise resilience, our proposal level of performance is reduced slightly and the method gets the best average F-measure values over almost all noisy videos (8 over 9) except compression 10 noise (Table 1).

It is also remarkable our method good performance with dynamicBackground and night categories when adding noise (Tables 3 and 6).

Figure 5.

Comparison between PMDAPF and the method from [5] (denoted as PRL2019), both with $\alpha=$ 0.005. Average F-measure for sequences pedestrians, canoe, fountain01, fountain02 and boats is computed for each kind of noise: no noise, Gaussian noise with $\sigma=$ 0.1, Gaussian noise with $\sigma=$ 0.2, uniform noise, salt and pepper, compression 10 and compression 1.

Table 3

Quantitative results for dynamicBackground category. Table shows average F-measure (and standard deviation) value for all videos for each method and noise. The higher, the better. Bold value indicates the best result within the row

	PMDAPF	PAWCS	SUBSENSE	LOBSTER	KDE	ZIVKOVIC	WREN	SOBS	SOBS_CF
No noise	0.74(0.16)	0.90(0.07)	0.82(0.09)	0.57(0.31)	0.21(0.07)	0.42(0.25)	0.29(0.19)	0.17(0.13)	0.16(0.14)
Gaussian $0.1$	0.74(0.16)	0.10(0.09)	0.51(0.21)	0.13(0.19)	0.03(0.03)	0.24(0.20)	0.10(0.09)	0.03(0.03)	0.04(0.04)
Gaussian $0.2$	0.74(0.16)	0.04(0.05)	0.17(0.14)	0.03(0.03)	0.03(0.03)	0.05(0.05)	0.04(0.04)	0.03(0.03)	0.03(0.03)
Gaussian $0.3$	0.72(0.17)	0.03(0.03)	0.07(0.07)	0.03(0.03)	0.03(0.03)	0.03(0.03)	0.03(0.03)	0.03(0.03)	0.03(0.03)
Gaussian $0.4$	0.68(0.23)	0.03(0.03)	0.04(0.05)	0.03(0.03)	0.03(0.03)	0.03(0.03)	0.03(0.03)	0.03(0.03)	0.03(0.03)
Uniform [0,1]	0.65(0.24)	0.03(0.03)	0.31(0.22)	0.08(0.11)	0.03(0.03)	0.03(0.03)	0.03(0.03)	0.03(0.03)	0.03(0.03)
Uniform [ $-$ 0.5,0.5]	0.73(0.17)	0.03(0.04)	0.11(0.10)	0.03(0.03)	0.03(0.03)	0.03(0.03)	0.03(0.03)	0.03(0.03)	0.03(0.03)
Salt and pepper	0.65(0.26)	0.66(0.30)	0.45(0.23)	0.28(0.19)	0.08(0.07)	0.08(0.08)	0.07(0.06)	0.07(0.07)	0.07(0.07)
Ccompression 10	0.74(0.16)	0.62(0.24)	0.77(0.14)	0.56(0.30)	0.12(0.08)	0.38(0.21)	0.29(0.19)	0.15(0.12)	0.15(0.14)
Compression 1	0.71(0.19)	0.44(0.21)	0.51(0.22)	0.35(0.20)	0.10(0.06)	0.13(0.09)	0.27(0.19)	0.09(0.06)	0.10(0.06)

Table 4

Quantitative results for badWeather category. Table shows average F-measure (and standard deviation) value for all videos for each method and noise. Bold value indicates the best result within the row

	PMDAPF	PAWCS	SUBSENSE	LOBSTER	KDE	ZIVKOVIC	WREN	SOBS	SOBS_CF
No noise	0.63(0.16)	0.78(0.08)	0.86(0.04)	0.61(0.14)	0.46(0.24)	0.74(0.13)	0.65(0.13)	0.63(0.24)	0.54(0.25)
Gaussian $0.1$	0.64(0.16)	0.33(0.31)	0.51(0.26)	0.30(0.27)	0.05(0.06)	0.44(0.25)	0.19(0.15)	0.05(0.05)	0.07(0.06)
Gaussian $0.2$	0.64(0.15)	0.04(0.04)	0.24(0.33)	0.05(0.06)	0.04(0.04)	0.09(0.10)	0.05(0.06)	0.04(0.04)	0.05(0.05)
Gaussian $0.3$	0.64(0.15)	0.04(0.04)	0.19(0.32)	0.04(0.04)	0.04(0.04)	0.05(0.05)	0.04(0.04)	0.04(0.04)	0.04(0.04)
Gaussian $0.4$	0.62(0.16)	0.04(0.04)	0.15(0.28)	0.04(0.04)	0.04(0.04)	0.04(0.04)	0.04(0.04)	0.04(0.04)	0.04(0.04)
Uniform [0,1]	0.37(0.32)	0.06(0.08)	0.57(0.28)	0.09(0.09)	0.06(0.08)	0.08(0.09)	0.07(0.08)	0.06(0.08)	0.07(0.08)
Uniform [ $-$ 0.5,0.5]	0.66(0.13)	0.04(0.04)	0.25(0.33)	0.04(0.04)	0.04(0.04)	0.04(0.04)	0.04(0.04)	0.04(0.04)	0.04(0.04)
Salt and pepper	0.51(0.26)	0.21(0.17)	0.48(0.25)	0.67(0.09)	0.15(0.18)	0.13(0.11)	0.10(0.10)	0.11(0.09)	0.10(0.08)
Compression 10	0.64(0.16)	0.72(0.12)	0.75(0.16)	0.59(0.15)	0.42(0.30)	0.65(0.10)	0.61(0.15)	0.60(0.25)	0.54(0.26)
Compression 1	0.56(0.23)	0.44(0.15)	0.57(0.27)	0.37(0.27)	0.22(0.28)	0.33(0.20)	0.52(0.19)	0.23(0.21)	0.24(0.21)

Table 5

Quantitative results for shadow category. Table shows average F-measure (and standard deviation) value for all videos for each method and noise. The higher, the better. Bold value indicates the best result within the row

	PMDAPF	PAWCS	SUBSENSE	LOBSTER	KDE	ZIVKOVIC	WREN	SOBS	SOBS_CF
No noise	0.69(0.16)	0.89(0.04)	0.90(0.04)	0.87(0.04)	0.61(0.16)	0.70(0.09)	0.66(0.07)	0.65(0.20)	0.55(0.27)
Gaussian $0.1$	0.70(0.16)	0.33(0.19)	0.64(0.10)	0.26(0.11)	0.10(0.05)	0.51(0.09)	0.34(0.09)	0.12(0.05)	0.15(0.07)
Gaussian $0.2$	0.70(0.16)	0.09(0.04)	0.26(0.14)	0.09(0.04)	0.09(0.04)	0.13(0.06)	0.11(0.05)	0.09(0.04)	0.10(0.05)
Gaussian $0.3$	0.69(0.15)	0.08(0.04)	0.14(0.08)	0.09(0.04)	0.09(0.04)	0.09(0.04)	0.09(0.04)	0.09(0.04)	0.09(0.04)
Gaussian $0.4$	0.67(0.15)	0.08(0.04)	0.09(0.06)	0.08(0.04)	0.08(0.04)	0.09(0.04)	0.09(0.04)	0.08(0.04)	0.09(0.04)
Uniform [0,1]	0.70(0.14)	0.10(0.05)	0.53(0.09)	0.16(0.08)	0.12(0.06)	0.13(0.07)	0.13(0.06)	0.11(0.05)	0.12(0.06)
Uniform [ $-$ 0.5,0.5]	0.68(0.15)	0.08(0.03)	0.20(0.12)	0.09(0.04)	0.08(0.04)	0.09(0.04)	0.09(0.04)	0.09(0.04)	0.09(0.04)
Salt and pepper	0.66(0.15)	0.40(0.24)	0.50(0.18)	0.77(0.12)	0.33(0.14)	0.24(0.09)	0.23(0.09)	0.22(0.09)	0.21(0.10)
Compression 10	0.69(0.16)	0.66(0.17)	0.87(0.05)	0.86(0.04)	0.34(0.17)	0.65(0.12)	0.64(0.08)	0.53(0.22)	0.51(0.26)
Compression 1	0.68(0.15)	0.76(0.10)	0.61(0.18)	0.59(0.18)	0.36(0.17)	0.35(0.16)	0.58(0.10)	0.31(0.18)	0.37(0.22)

Table 6

Quantitative results for night category. Table shows average F-measure (and standard deviation) value for all videos for each method and noise. The higher, the better. Bold value indicates the best result within the row

	PMDAPF	PAWCS	SUBSENSE	LOBSTER	KDE	ZIVKOVIC	WREN	SOBS	SOBS_CF
No noise	0.42(0.14)	0.40(0.17)	0.49(0.16)	0.41(0.16)	0.17(0.10)	0.36(0.16)	0.36(0.17)	0.35(0.19)	0.36(0.18)
Gaussian $0.1$	0.42(0.14)	0.18(0.13)	0.29(0.21)	0.09(0.06)	0.04(0.03)	0.28(0.14)	0.15(0.09)	0.05(0.03)	0.07(0.04)
Gaussian $0.2$	0.41(0.14)	0.08(0.06)	0.19(0.12)	0.04(0.02)	0.04(0.02)	0.07(0.04)	0.05(0.03)	0.04(0.02)	0.05(0.02)
Gaussian $0.3$	0.39(0.15)	0.04(0.02)	0.12(0.09)	0.04(0.02)	0.04(0.02)	0.04(0.02)	0.04(0.02)	0.04(0.02)	0.04(0.02)
Gaussian $0.4$	0.37(0.15)	0.04(0.02)	0.09(0.07)	0.04(0.02)	0.04(0.02)	0.04(0.02)	0.04(0.02)	0.04(0.02)	0.04(0.02)
Uniform [0,1]	0.35(0.16)	0.04(0.03)	0.15(0.21)	0.05(0.04)	0.04(0.03)	0.05(0.03)	0.05(0.03)	0.04(0.03)	0.05(0.03)
Uniform [ $-$ 0.5,0.5]	0.39(0.14)	0.05(0.02)	0.15(0.10)	0.04(0.02)	0.04(0.02)	0.04(0.02)	0.04(0.02)	0.04(0.02)	0.04(0.02)
Salt and pepper	0.34(0.17)	0.19(0.12)	0.26(0.15)	0.33(0.22)	0.09(0.05)	0.10(0.06)	0.09(0.05)	0.09(0.04)	0.08(0.04)
Compression 10	0.41(0.15)	0.34(0.16)	0.45(0.17)	0.37(0.17)	0.14(0.13)	0.34(0.14)	0.35(0.17)	0.31(0.17)	0.33(0.16)
Compression 1	0.36(0.12)	0.25(0.15)	0.28(0.10)	0.28(0.19)	0.12(0.06)	0.17(0.08)	0.30(0.15)	0.17(0.08)	0.19(0.08)

Figure 6.

Qualitative results for the image 1275 from fountain02 from dynamicBackground category: First row shows original frames with different noises and ground truth GT. Following rows show the different segmentations created by our method, WREN, ZIVKOVIC, PAWCS, SUBSENSE, KDE, SOBS and SOBS_CF respectively.

On Fig. 6 some examples of the behavior of our method compared to others can be observed. In accordance with quantitative tables, methods WrenGA, SOBS, ZivkovicGMM, ElGammalKDE and SOBS_CF generate images full of False Positives in cases where the input images are noisy. In addition, more complex methods such as PAWCS and LOBSTER show this behaviour when Gaussian and uniform noise is added. PMDAPF seems to be less accurate than the other methods when segmenting foreground objects in frames without noise, though it is able to maintain a similar level of foreground segmentation performance even in noisy images. It is also able to avoid the massive generation of False Positives shown by most methods in those tests where noise was added. Moreover, it does not suffer from the increase of False Negatives that affects SUBSENSE when Gaussian and uniform noise is present.

The average performance value as threshold $\tau$ is modified can be observed on Fig. 4. Values between 0.35 and 0.55 show the greatest performance.

Since PMDAPF is conceptually based on the previous work [5], a comparison using five sequences with 7 variations (raw and with 6 different types of noise) has been included and its results can be seen on Fig. 5.

5. Conclusions

A foreground detection method for video sequences which represents the background as a probabilistic mixture model has been proposed. In order to provide estimations of the background probability for each image pixel, the method considers the $M$ square patches of size $N\times N$ which include the pixel and computes a combination of their probabilities of belonging to the background. The patches come from $M$ tilings of the video frame. The patch probabilities are obtained from a probabilistic model of the significant features learned by a previously trained deep autoencoder. The autoencoder is able to learn such features even under noise conditions. The reduced representation of each patch learned by the autoencoder is supplied to the probabilistic mixture model, which in turn yields the likelihood that the patch belongs to the background or the foreground.

The chosen probability mixture model comprises a multivariate Gaussian mixture component with diagonal covariance matrix for the background of the scene, and a uniform mixture component for the foreground. This way, the foreground component is able to model equally well any incoming foreground object, while the background component specializes in the particular characteristics of the background region at hand. Stochastic approximation theory is employed to derive the learning algorithm for the probabilistic mixture. Finally, a Bayesian classification is carried out in order to obtain the background and foreground probabilities, according to suitably learned a priori probabilities.

The experimental design includes several heterogeneous scenes, with and without noise. These scenes have been processed by our method and other eight competing background modeling methods. According to the obtained results, the robustness of our method is outstanding. Not only it is able to keep a good performance even in the presence of moderate noise, but it also stands as the method that works best with very noisy sequences, which make the performance of the other methods fall drastically whereas the proposed method is just slightly affected. These experimental results confirm the validity of the proposed feature extraction architecture, the chosen probabilistic model and its associated learning algorithm.

As future work, a study about the influence of the autoencoder architecture in the performance of the probabilistic model could be developed. Different number and type (convolutional or dense) of layers and distinct number of neurons would be tested for that purpose. On the other hand, it is planned to analyze the influence of variational autoencoders (VAE) in the field of background modeling, since they try to represent the encoding latent space in a more organized way. Finally, from a supervised point of view, it is possible to consider the use of recurrent neural networks, since the significant information in video sequences is both spatial and temporal. This would permit the definition of end-to-end deep neural models, which would integrate both the representation of the codified information and the background modeling.

Footnotes

https://github.com/andrewssobral/bgslibrary.

https://keras.io/.

https://www.tensorflow.org/.

http://changedetection.net/.

Acknowledgments

This work is partially supported by the Ministry of Economy and Competitiveness of Spain under grants TIN2016-75097-P and PPIT.UMA.B1.2017. It is also partially supported by the Ministry of Science, Innovation and Universities of Spain under grant RTI2018-094645-B-I00, project name Automated detection with low-cost hardware of unusual activities in video sequences. It is also partially supported by the Auto-nomous Government of Andalusia (Spain) under project UMA18-FEDERJA-084, project name Detection of anomalous behavior agents by deep learning in low-cost video surveillance intelligent systems. All of them include funds from the European Regional Development Fund (ERDF). The authors thankfully acknowledge the computer resources, technical expertise and assistance provided by the SCBI (Super-computing and Bioinformatics) center of the University of Málaga. They also gratefully acknowledge the support of NVIDIA Corporation with the donation of two Titan X GPUs used for this research. The authors acknowledge the funding from the Universidad de Málaga.

References

Charte

García

del Jesus

Herrera

. A practical tutorial on autoencoders for nonlinear feature fusion: Taxonomy, models, software and guidelines. Information Fusion. 2018; 44: 78-96.

Vincent

Larochelle

Lajoie

Bengio

Manzagol

. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. Journal of Machine Learning Research. 2010; 11: 3371-3408.

Baldi

Hornik

. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks. 1989; 2(1): 53-58.

García-González

Ortiz-de Lazcano-Lobato

Luque-Baena

López-Rubio

. Background Modeling by Shifted Tilings of Stacked Denoising Autoencoders. In: From Bioinspired Systems and Biomedical Applications to Machine Learning, 2019, pp. 307-316.

García-González

Ortiz-de Lazcano-Lobato

Luque-Baena

Molina-Cabello

López-Rubio

. Foreground detection by probabilistic modeling of the features discovered by stacked denoising autoencoders in noisy video sequences. Pattern Recognition Letters. 2019; 125: 481-487.

Guo

Liu

Zhu

Yin

. Deep Clustering with Convolutional Autoencoders. In: Neural Information Processing, 2017, pp. 373-382.

Ferrandez Vicente

Alvarez-Sanchez

de la Paz Lopez

Toledo Moreo

Adeli

, editors. Understanding the Brain Function and Emotions, Proceedings of the 8th International Work-Conference on the Interplay Between Natural and Artificial Computation, Part I. Springer, 2019.

Ferrandez Vicente

Alvarez-Sanchez

de la Paz Lopez

Toledo Moreo

Adeli

, editors. From Bioinspired Systems and Biomedical Applications to Machine Learning. Proceedings of the 8th International Work-Conference on the Interplay Between Natural and Artificial Computation, Part II. Springer, 2019.

Collins

Lipton

Kanade

. Introduction to the special section on video surveillance. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2000; 22(8): 745-746.

10.

Wren

Azarbayejani

Darrell

Pentl

. Pfinder: Real-Time Tracking of the Human Body. IEEE Trans on Pattern Analysis and Machine Intelligence. 1997; 19(7): 780-785.

11.

Stauffer

Grimson

WEL

. Adaptive background mixture models for real-time tracking. vol. 2, 1999, pp. 246-252.

12.

Zivkovic

van der Heijden

. Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recognition Letters. 2006; 27(7): 773-780.

13.

Elgammal

Harwood

Davis

. Non-parametric model for background subtraction. In: Computer Vision (ECCV). Springer, 2000, pp. 751-767.

14.

López-Rubio

Molina-Cabello

Luque-Baena

Domínguez

. Foreground Detection by Competitive Learning for Varying Input Distributions. International Journal of Neural Systems. 2018; 28(5): 1750056.

15.

Maddalena

Petrosino

. A Self-Organizing Approach to Background Subtraction for Visual Surveillance Applications. Trans Img Proc. 2008; Jul; 17(7): 1168-1177.

16.

López-Rubio

Luque-Baena

Domínguez

. Foreground detection in video sequences with probabilistic self-organizing maps. International Journal of Neural Systems. 2011; 21(3): 225-246.

17.

Ritter, Schulten. Kohonen’s self-organizing maps: exploring their computational capabilities. In: IEEE International Conference on Neural Networks, 1988, pp. 109-116.

18.

Maddalena

Petrosino

. The SOBS algorithm: what are the limits? In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2012, pp. 21-26.

19.

Maddalena

Petrosino

. A fuzzy spatial coherence-based approach to background/foreground separation for moving object detection. Neural Computing and Applications. 2010; 19(2): 179-186.

20.

St-Charles

Bilodeau

. Improving background subtraction using Local Binary Similarity Patterns. In: IEEE Winter Conference on Applications of Computer Vision, 2014, pp. 509-515.

21.

St-Charles

Bilodeau

Bergevin

. SuBSENSE: A Universal Change Detection Method With Local Adaptive Sensitivity. IEEE Transactions on Image Processing. 2015; 24(1): 359-373.

22.

St-Charles

Bilodeau

Bergevin

. Universal Background Subtraction Using Word Consensus Models. IEEE Transactions on Image Processing. 2016; 25(10): 4768-4781.

23.

Javed

Narayanamurthy

Bouwmans

Vaswani

. Robust PCA and Robust Subspace Tracking: A Comparative Evaluation. In: IEEE Statistical Signal Processing Workshop (SSP), 2018, pp. 836-840.

24.

Wang

Bai

. Regional parallel structure based CNN for thermal infrared face identification. Integrated Computer-Aided Engineering. 2018; 25(3): 247-260.

25.

Antoniades

Spyrou

Martin-Lopez

Valentin

Alarcon

Sanei

, et al. Deep Neural Architectures for Mapping Scalp to Intracranial EEG. International Journal of Neural Systems. 2018; 28(8).

26.

Acharya

Hagiwara

Tan

Adeli

. Deep convolutional neural network for the automated detection and diagnosis of seizure using EEG signals. Computers in Biology and Medicine. 2018; 100: 270-278.

27.

Ansari

Cherian

Caicedo

Naulaers

De Vos

Van Huffel

. Neonatal Seizure Detection Using Deep Convolutional Neural Networks. International Journal of Neural Systems. 2019; 29(4).

28.

Hua

Wang

Liu

Khalid

. A Novel Method of Building Functional Brain Network Using Deep Learning Algorithm with Application in Proficiency Detection. International Journal of Neural Systems. 2019; 29(1).

29.

Yang

Cappelle

Ruichek

El Bagdouri

. Multi-object tracking with discriminant correlation filter based deep learning tracker. Integrated Computer-Aided Engineering. 2019; 26(3): 273-284.

30.

Molina-Cabello

Luque-Baena

López-Rubio

Thurnhofer-Hemsi

. Vehicle type detection by ensembles of convolutional neural networks operating on super resolved images. Integrated Computer-Aided Engineering. 2018; 25(4): 321-333.

31.

Vera-Olmos

Pardo

Melero

Malpica

. DeepEye: Deep convolutional network for pupil detection in real environments. Integrated Computer-Aided Engineering. 2018; 26(1): 85-95.

32.

Manzanera

Meles

Leenders

Renken

Pagani

Arnaldi

, et al. Scaled Subprofile Modeling and Convolutional Neural Networks for the Identification of Parkinson’s Disease in 3D Nuclear Imaging Data. International Journal of Neural Systems, 2019.

33.

Rafiei

Adeli

. A novel unsupervised deep learning model for global and local health condition assessment of structures. Engineering Structures. 2018; 156: 598-607.

34.

Bouwmans

Javed

Sultana

Jung

. Deep neural network concepts for background subtraction:A systematic review and comparative evaluation. Neural Networks. 2019; 117: 8-66.

35.

Gracewell

John

. Dynamic background modeling using deep learning autoencoder network. Multimedia Tools and Applications. 2019; 3.

36.

Choo

Seo

Jeong

Cho

. Learning Background Subtraction by Video Synthesis and Multi-scale Recurrent Networks. In: Computer Vision – ACCV, 2019, pp. 357-372.

37.

Lim

Keles

. Foreground segmentation using convolutional neural networks for multiscale feature encoding. Pattern Recognition Letters. 2018; 112: 256-262.

38.

Sultana

Mahmood

Javed

Jung

. Unsupervised deep context prediction for background estimation and foreground segmentation. Machine Vision and Applications. 2019 Apr; 30(3): 375-395.

39.

Braham

Van Droogenbroeck

. Deep background subtraction with scene-specific convolutional neural networks. In: International Conference on Systems, Signals and Image Processing (IWSSIP), 2016, pp. 1-4.

40.

Zheng

Wang

. A Novel Background Subtraction Algorithm Based on Parallel Vision and Bayesian GANs. Neurocomputing. 2019; 6.

41.

Radhakrishnan

Belkin

Uhler

. Downsampling leads to Image Memorization in Convolutional Autoencoders. CoRR. 2018abs/1810.10333, Available from: http//arxiv.org/abs/1810.10333.

42.

Wang

Yeung

. Learning a Deep Compact Image Representation for Visual Tracking. In: Advances in Neural Inform. Processing Systems. 2013; 26: 809-817.

43.

Zhang

Zhao

. Deep learning driven blockwise moving object detection with binary scene modeling. Neurocomputing. 2015; 168: 454-463.

44.

Lin

Maire

Belongie

Hays

Perona

Ramanan

, et al. Microsoft COCO: Common Objects in Context. In: European Conference on Computer Vision, ECCV, 2014, pp. 740-755.

45.

Robbins

Monro

. A Stochastic Approximation Method. The Annals of Mathematical Statistics. 1951; 22(3): 400-407.

46.

Welford

. Note on a Method for Calculating Corrected Sums of Squares and Products. Technometrics. 1962; 4(3): 419-420.

47.

Sobral

Bouwmans

. BGS Library: A Library Framework for Algorithm’s Evaluation in Foreground/Background Segmentation. In: Background Modeling and Foreground Detection for Video Surveillance. CRC Press, Taylor and Francis, 2014.