Abstract
Detecting human carrying baggage from video sequences is one of the important modules in identifying unattended baggage for video surveillance system. Hence, this paper addresses a framework for implementing such module. As the video was recorded using a static camera, the background modeling is firstly constructed for extracting foreground regions. These regions are considered as candidate of human by further verifying them using a general human detector. To identify whether the human is carrying baggage or not, the human region is divided into several components such as head, body, leg and baggage components according to the spatial information of baggage relative to a human body proportion. The scalable histogram of oriented gradient features of each component are extracted and the feature dimension is reduced by applying genetic algorithm. The features are trained using a support vector machine (SVM) over each component regarded as a weak classifier. The boosting machine is employed to combine these weak classifiers into a strong classifier for final decision. In experiment, standard public dataset are used to evaluate the effectiveness of our proposed approach. The results verified that the proposed framework outperforms the state-of-the-art methods and can be considered as one of the solutions for aforementioned task.
Keywords
Introduction
In recent years, the number of surveillance cameras installed in public areas, such as buses, airports, railway, stations, building lobbies, schools, stores, and other public spaces, has been increased. A surveillance camera is one of the most significant technological innovations in a security domain. Currently, there are an estimated 100 million surveillance cameras in use world-wide [1]. A surveillance camera plays an especially important role in ensuring safety and security of citizens. This fact brings up that automated surveillance system is necessary to be developed to automatically detect any suspicious activity in the monitoring area. Often a detailed description of a human appearing in camera footage is utilized for analyzing behavior of person.
In automatic video surveillance system, detecting human carrying baggage is one of the most widely used modules for preventing theft detection, criminal behavior identification, and bombing prevention. Conceptually, human carrying baggage detection can be achieved by observing the changes in human appearance caused by carrying baggage. It means that human carrying objects can be distinguished from human without one. In practice, the human-baggage detector can be used for doing such a task. However, the task is inherently difficult due to the wide range of baggage that can be carried by a person and the different ways in which they can be carried.
This paper proposes an approach for detecting human carrying baggage. Our approach utilizes the strong connection between baggage and body components. Instead of constructing model for entire of object, our approach builds a model for each component [2] of body including the baggage component according to several possible placements. Adopted from [3] with some improvements, spatial model of baggage is used for detecting human carrying baggage.
An initial version of our approach has been described in [4]. Comparing to our previous work, there are several major extensions that merit being highlighted: Applying a foreground/background segmentation to extract candidates of human-baggage regions from video sequences. The utilization of dual thresholding in foreground extraction is introduced (Section 4). Describing more detail about scalable histogram of oriented gradient feature and its implementation (Section 6), support vector machine, and boosting machine (Section 7). Utilization of a genetic algorithm for reducing feature dimension (Subsection 6.3). Evaluating our component model on the video sequences (Subsection 8.1). Conducting more experiments such as the parameter selection of background modeling (Subsection 8.2), the feature selection effect (Subsection 8.3), and analysis of processing time (Subsection 8.5). Comparing our approach to the state-of-the-art human carrying baggage detection (Subsection 8.4).
Related works
In the simplest way, the human-baggage detector can be used for detecting human carrying baggage. However, the task is inherently difficult due to the wide range of baggage that can be carried by a person and the different ways in which they can be carried. In the literature, there have been several approaches proposed for detecting baggage that abandoned by the owners [5, 6] or still being carried [3, 8] by them. Tian [5] proposed a method to detect abandoned and removed objects using background subtraction and foreground analysis. In their approach, the background is modeled by three Gaussian mixture that combining with texture information in order to handle lighting change conditions. The static region obtained by background subtraction is then analyze using region growing. This region is then classified as either an abandoned or a removed object by some rules. However, in some cases this method produces many false alarm due to imperfect background subtraction.
Fan [6] proposed a relative attributes schema to prioritize alerts by ranking candidate regions. However, in real implementation to know who is the owner of abandoned baggage is very important. Therefore, as prior process, the system should be able to detect the person who carried baggage. The authors from [3, 7] proposed same concept to detect carried object by people. They utilized the sequence of human moving to make spatial temporal template. It was then aligned against view-specific exemplar generated offline to obtain the best match. A carried object was detected from the temporal protrusion. The author in [3] extended the framework such that the system can also classify the baggage type based on the position in relevance to the human body carrying it. However, the method assumes that parts of the carried objects are protruding from the body silhouettes. Due to its dependency on protrusion, the method cannot detect non protruding carried object e.g. backpack. The protruding problem can be solved by method from [8]. This method utilized a ratio color histogram. Using assumption of the carried object color is different with clothes, it will achieve a good result in accuracy. However, this method is dependent on an event where the bag being transferred or left. The assumption of observing the person before and after the change in carrying status is an application specific and cannot be used as a general human carrying baggage detector.
System overview
This section describes the details of our proposed approach for detecting human carrying baggage in video sequence. A general framework of our proposed approach is shown in Fig. 1. It can be seen that the proposed approach consists of following main stages:
Object candidate extraction
Background modeling
One of the important processes for extracting candidate object regions is to construct a background model. It is used for subtracting the frames in order to separate foreground and background pixels. The subtraction procedure is generally applied with the Gaussian Mixture Model (GMM) method, but it requires a high computational cost due to complexity of process with training stage [9, 10]. Therefore, a simple background model based on statistic is applied. First, the statistical data on each pixel are computed over N previous frame indexes. Suppose I (x, y) = {I1 (x, y) , …, I
N
(x, y)} be set of pixel brightness values (e.g. gray intensity) on the location (x, y). The maximum and minimum pixel values, I
H
(x, y) and I
L
(x, y) on this location are then calculated. These values are considered for updating the background model. It is observed that if this location only contains background pixels, the color difference between maximum and minimum intensity values of the pixel should be low enough. In this case, the value of this pixel should be used for updating the background model. In contrary, the pixel is considered as containing foreground. Pixels with maximum and minimum brightness values are represented as vector (r
max
, g
max
, b
max
) and (r
min
, g
min
, b
min
), respectively. Here, a color difference is calculated as a distance in the 3D color space [11] and if difference is below a threshold δ
d
, then the background model is modified by as
Foreground/Background segmentation is straightforward using intensity difference, D (x, y) between current frame I
t
(x, y) and background model B (x, y) in certain pixel location, (x, y). General background segmentation methods utilizes single thresholding to decide whether the pixel is either foreground or background. In contrast, inspired with canny edge detector, the utilization of double threshold values, small and large thresholds, T
s
and T
l
, is introduced. Technically, if the difference value is more than a large threshold, the pixel is classified as foreground. If the difference value is located between small and large thresholds and its 8-neighborhood pixels are foregrounds, the pixel is classified as foreground. Otherwise, the pixel is classified as background. Mathematically, this process is defined as:
In order to verify whether the foreground object is human or not, a learning framework for human model is developed. It is also useful to filter out non human foreground regions due to imperfect background/foreground segmentation. First, the connected component labeling was applied on the L binary maps for localizing the candidate regions. In preliminary of verification stage, the histogram of oriented gradient (HOG) features [12] are extracted on each candidate region by firstly re-sizing regions into 128 × 64 pixels size [13]. The HOG features are computed on a dense grid of cells using local contrast normalization on overlapping blocks. A nine-bin histogram of unsigned pixel oriented weighted by magnitude is created for each cell. These histograms are normalized over each overlapping block. The components of the feature vector are the values from the histogram of each normalized cell. After extracting HOG features, the support vector machine is employed to train the human model. Finally, each verified foreground human region is further analyzed for the possibility existence of baggage, otherwise the region would be eliminated.
Spatial-based joint component model
As shown in Fig. 2(a), the average of the height and width of a person according to human body proportion [14] are defined by H = 8h, W = 2h, respectively, where h is the height of head, such that h = H/8. The center of the body in the vertical axis is assigned as a bend line B position. Vertical line C is denoted as the center of body in the horizontal axis traversing the body centroid. Let T and L denote the position of the top of the head and most left location of body on the image, respectively, then B = T + 4h and C = L + h. These spatial parameters are employed for making our human carrying baggage model.
The main idea of the proposed model is derived from the facts that the baggage can be placed in any possible locations correspond to the body proportion and the direction of view relative to the camera. Training images are collected, and the baggage positions of all images are localized manually and summarized as follows: backpack and handbag are mostly located around or the top of bend line; tote bag, duffle bag and rolling bag are located in the bottom of bend line with different average of height. Thus, our spatial model is divided in into three major categories, 1) backpack or hand bag, 2) tote bag or duffle bag and 3) rolling luggage. As shown in Fig. 2(b–d), spatial models of baggage defines the set of conditions for checking whether the baggage exists or not. For instance, if our model identify the baggage location on the human region as in Fig. 2(b) with the highest probability value among spatial models, then the bag is classified as a backpack. If the baggage could not be identified in all spatial models, the region is classified as human without baggage.
Hereafter, the baggage location on the human region can be determining by dividing the region into several small sub-regions (i.e. component) and analyzing their locations relative to human region. Thus, the human carrying baggage is modeled based on observations of the small component and their relative position among them. These model intuitive the relationship between components. Human region is divided into four components c i ; head, torso, leg and baggage components, as shown in Fig. 3. Height of head, torso, and leg component are 1.5h, 2.5h and 4h, respectively, while the height of baggage component is varying according to our model.
Feature extraction
The feature vector is extracted from full body and all components, independently. A low dimension feature vector approximately covers an entire object, while higher dimension feature vector cover smaller regions of the object.
Pixel-level feature maps
In this section, feature extraction based on scalable histogram of oriented gradient (SHOG) [15] is implemented. In most feature extraction method, the region should be resized equal to template size in order to obtain same feature dimension for each region. However, resizing region might cause losing of the local information, especially scale up process due to interpolation problem of pixel value. In contrast, fixed-length SHOG feature could be obtained from region regardless of pixel size, so it contains more rich local information which allows extracting high-discriminated features. First, Sobel operator [16] computes intensity gradient values in x and y direction, respectively,
SHOG is extracted by aggregating per-pixel feature maps of oriented gradient within block from 2 × 2 to 16 × 16 block sizes with increment of 2. Different block size will accumulate different number of pixels within block. For example, if input image of 64 × 64 pixel sizes is divided into 2 × 2 block sizes, each block contains 32 × 32 pixels. However, if it is divided into 4 × 4 block sizes, each block will have 16 × 16 pixels. The final feature vector is obtained by combining the average magnitude values within block on each feature maps of oriented gradient and normalizing them using the L2-Norm [17]. Finally, the feature vector has 5,256 elements (584 blocks × 9 elements/block). From these processes, it can be ascertained that feature dimension is always fixed, regardless of the input image size. All histograms for each block sizes are concatenated into one histogram, named SHOG [15].
1:
2:
3:
4:
5:
6:
7: offspring i ← crossover (p1, p2)
8: offspring i ← mutation (offspring i )
9:
10:
11:
12:
13:
14:
Genetic algorithm-based feature selection
In many classification process, a high dimensional feature may cause an expensive computational cost. In addition, some data in certain dimension may not be very useful for distinguishing object class. Therefore, the feature dimension reduction based on a genetic algorithm (GA) approach [18] is carried out in order to reduce computational cost, but the recognition rate is still promising.
The general framework of GA can be seen at Algorithm 1. As initial population, n chromosomes with l genes are randomly created, where each gene represents the feature index. The fitness value of each chromosome is computed based on the detection rate of the nearest cluster neighbor classifier [19]. The chromosomes are sorted out based on their fitness value. The chromosome parents are taken from the chromosomes with high fitness values in the population. Hereafter, k-offspring are generated using the crossover and mutation process. These new offspring are merged to the population, whilek-chromosomes with the worst values are removed from the population.
The crossover process selects one random pivot point for generating a new offspring from recombination of sub-chromosomes of the first and the second parents. If it is lower than a fixed threshold, a new offspring is formed by combining the left part of the first parent and the right part of the second parent and vice verse. In contrary, the mutation process is performed based on probability of a random number. If the random number is larger the predefined threshold, one randomly selected gen of new offspring is replaced with a gen from the original feature of the first parent. To be noted that the picked gen should not already exist in the mutated offspring.
Training and decision
As shown in Fig. 1, all components are trained using support vector machine based on scalable histogram of oriented gradient. The boosting machine is the applied for combining these components as strong classifier.
Support vector machine
Support Vector Machine classifier [20, 21] is used to learning of component as a weak classifier. Formally, given some training data , where the y i is either 1 or -1, indicating the class to which feature x i belongs. In our task, y i = 1 can be considered as either body part or baggage, and y i = -1 is classified as background. Each x i is a p-dimensional real vector. Our goal is to find the maximum-margin dividing points among two classes. Any hyperplane can be written as the set of points x satisfying w T x - b = 0, where w the normal vector to the hyperplane. The parameter determines the offset of the hyperplane from the origin along the normal vector. By assuming our training data are linearly separable, two hyperplanes that separate the data are determined. These hyperplanes can be described by the equation w T x - b ≥ 1 and w T x - b ≤ -1 according two classes of data. By using geometry, the distance between these two hyperplane is . To maximize the distance, the denominator ∥w∥ should be minimized. Thus, the optimization problem is formulated asfollow
Finally, The output probability of SVM classification is computed by
1:
2:
3:
4:
5:
6:
7:
8:
Boosting is an approach to machine learning based on the idea of constructing a highly accurate classifier by combining many relatively weak and inaccurate classifiers. Here, the output probability of SVM over body and baggage components are considered as weak classifiers. Formally, the component is represented by c
i
= {x, y, s, v}, where specifying an anchor position (x, y) relative to full body in the s
th
scale of image and feature vector v. The score of component interpolation for model m based on hybrid of boosting [22] is defined as follows
There have been many boosting methods to find the weighting value of linear combination of several classifiers. One of the simplest and the most efficient is Adaptive Boosting [23]. Given the training data (x1, y1, l1, p1) , …, (x n , y n , l n , p n ), where n is the number of training samples. First, the weighting factor of each data is set to be D1 (i) =1/n. The AdaBoost method is then performed as shown in Algorithm 2.
The human-baggage region detection is decided based on the score for each model according to the best possible placement of baggage and other components relative to body region. Let M be the number of possible models trained in the framework. The final score of object hypothesis being human-baggage object is the model which has the highest score values comparing to other models that formulated as Equation (9). In addition, if the value of J
final
is less than a fixed threshold, the object hypothesis is classified as other object which represents either human without baggage or non-human regions.
The procedure from subsection 7.3 usually obtains multiple overlapping bounding boxes of human-baggage regions. Therefore, it is necessary to combine these regions for unifying detection and rejecting misdetections. A greedy procedure is used for eliminating repeated detection in the same region via non-maximum suppression [2]. Given a set of detection J which each detection is defined by a bounding box location and a score. The detection in J is then sorted by score, and the final detection is determined by choosing the highest scoring ones and skipping the detection with bounding box that are at least covered by previously selected bounding box.
Experiments
Datasets
Our model was tested on human-baggage dataset, collected from subset of iLIDS [24], PETS2006 [25] and our own images. The dataset is divided into two groups, training and testing video sequences. The training video contains 976 image frames containing 162 human-baggage (positive samples) and 542 human without baggage regions (negative samples). For training of the spatial model, the human-baggage region samples are cropped manually into four components: head, body, leg, and baggage. The testing video contains 1,657 image frames containing 252 human-baggage and 672 human without baggage regions. Figure 4 shows several positives samples of our dataset used in our training.
Parameter selection in background modeling
The optimal parameter values were found by conducting experiment using a training video and were cross-validated using a testing video. Figure 5 depicts evaluation results of training and validation stages using six-best different parameter settings over training dataset. The results are calculated from human verification accuracy on the foreground segmentation stage. Based on the results, it is found that if the learning rate is set more than 0.5, the background modeling is updated faster. In this case, the stationary persons who remain on the scene within a short time period are blended as background and could not be considered as candidate regions of human carrying baggage. In consequence, the accuracy of system will decrease. On the other hand, if the learning rate is set too low, the background modeling is updated very slowly. In this case, the static objects on the scene could be detected as foreground for a longer time. Thus, our method could produce many false regions. Selecting different T s and T l also affects the accuracy. If a large threshold T l is getting smaller, the model produces more foreground objects. Because of this, the false regions lead the human verification stage fails to extract candidate regions. In contrast, a bigger value of T l will decrease detected moving objects. Finally, the optimal parameters were found as follows: (1) A background modeling was updated with rate β = 0.3 and the foreground segmentation was applied using small and large thresholds, T s = 50 and T l = 75.
Optimal feature selection analysis
This part evaluates the effect of feature selection based on genetic algorithm. Each component is modeled as binary classification, e.g head or non-head. The fitness value of each component and the average of fitness values for all components are calculated. For evaluating the effectiveness feature selection using genetic algorithm, the following two issues are investigated: (1) the effect of crossover and mutation probability, and (2) the effect of percentage of feature dimension. The summary of evaluation results can be seen in Fig. 6. As shown in Fig. 6(a), small crossover and mutation probabilities give a better fitness value. The optimal average fitness value is obtained when the crossover and mutation probabilities are set equal to 0.2 and 0.2, respectively. In addition, the number of iterations is linear with the average fitness value. In the other word, the fitness value is getting better when number of iterations is higher. Next, the percentage of feature dimension is investigated to obtain optimal dimension size which gives a good accuracy as well as a fast processing time. As seen in Fig. 6(b), when the percentage values are set between 1% and 20%, the fitness values are increased significantly, but this significant improvement does not occur after 20%. The fitness values between 20% and 50% dimension size are almost constant. Thus, it can be concluded that 20% dimension size is good enough for representing full dimension of feature. In training evaluation, 20% and full dimension sizes achieve 95.18% and 96.71% in accuracy, respectively, with different only around 2%. However, a 20% dimension size could perform faster classification process. Thus, for full implementation, the 20% feature dimension is used and it is compared with full dimension in term of accuracy and processing time.
Detection results
First, our model was evaluated for classifying the regions into either human with or without baggage. The human carrying baggage regions are set to be positive samples. Human without baggage region are set to be negative samples. All samples are collected manually and are cropped to fit object region and annotated their components based on our model. The method achieves detection rate of 0.56 and 0.58 for training and testing data, respectively. Figure 7 shows the result of our model for detecting either human or human with baggage. The proposed method achieves detection rate of 58.12% when background modeling was applied and 52.02% with sliding window only. These results outperform HOG+SVM [12], LBP+SVM [26], and HOG+LBP+SVM [27] with detection rate 39.31%, 39.93%, and 41.22%, respectively.
Next, our method was evaluated on the full video sequences. In practical implementation, the same human-baggage region is usually detected several times as overlapped bounding boxes. Therefore, it is necessary to combine the overlapped regions for unifying detection and rejecting missdetection. The detected regions are grouped based on their position, and prediction probability (score), as described in Subsection 7.4. Some typical results from the proposed method are depicted in Fig. 8.
Processing time analysis
The specification of computer for implementing the proposed approach was Intel i7-4770 CPU (3.40 GHz), 8 GB RAM. The system was implemented using C++ and OpenCV under windows 7 operating system. OpenCV was used for basic image processing. The processing speed of the proposed approach was examined on three video sequences of iLIDS datasets. The frame is resized into 640 × 480 pixels. As show in Table 1, our approach could perform on 6 fps for full feature dimension. The fastest process was obtained in background modeling and foreground segmentation as these stages only apply statistical information without performing any learning model. On the other hand, The most expensive part is the component detection on the human region. Our model contains four different components such as head, body, leg and baggage components. Each component model localizes its position in difference scales and translations.
Next, the effect of feature selection in processing time was investigated. As shown in Table 2, the proposed approach with 20% feature dimension size is almost 5 times faster than using full size of feature dimension. It performs average processing times as much as 40.12 ms, while using full dimension the average processing time is around 176.54 ms. However, the accuracy difference between them is only 7%. Thus, the genetic algorithm-based feature selection gains a good improvement in term of processing time while still maintaining the accuracy of detection. Furthermore, the method was also evaluated using sliding window approach without background modeling. The accuracy is promising, but it requires a long processing time and is almost 5 times slower than the proposed approach.
Conclusion
This paper proposed a joint component analysis for detecting human carrying baggage on the video sequences. Since the method involves the video recorded using a static camera, the background modeling was build to separate the foreground and background regions. The foreground regions were further analyzed to verify whether these regions are human carrying baggage. If the region has been verified as human, it was modeled as human-baggage region by dividing it into four components: head, body, leg and baggage components. The model utilized the spatial information of baggage relative to human body. Scalable histogram of oriented gradient (SHOG) features were extracted on each component. The feature extraction stages produces a high dimensional feature vector. For reducing the feature dimension, genetic algorithm was then applied. The features of body and baggage components were trained using support vector machine (SVM). Boosting machine based on linear combination of weak classifiers over components was performed. After conducting extensive experiment, our method achieves 58.12% for detection rate.
Nevertheless, our method has some limitations for detecting baggage carried by human. First, it may fail to detect multiple baggage carried by the same person. The additional model consisting multiple baggage placements should be considered in our future work for handling this problem. Second, our method fails to detect overlapping human-baggage region. Increasing the number of part body can be one of the solutions for solving this issue. Third, our method may fail to detect baggage that has same color with clothes. Combining several features such as texture, edge may solve this problem.
