Abstract
In this study, we propose a novel method to process crowds and partially occluded pedestrians during single-image pedestrian detection. First, two procedures are proposed and developed to extract features at different body parts of pedestrian from images. One procedure uses the multiscale block-based histogram of oriented gradients, which is preprocessed via Gabor filtering, to effectively enhance the descriptions of features of pedestrians’ heads and bodies. The other involves modifying the Haar-like features as parallelogram-based Haar-like features to suitably depict pedestrians’ legs and arms, which are typically not straight while walking. In addition, the computations of features are expedited through integral image acceleration for these two extraction methods. After the pedestrian features are acquired, a two-tier support vector machine (SVM) classifier is proposed for processing partially occluded pedestrians. The first-tier SVM classifier is used to judge if there are any occluded body parts. Next, classification probabilities of nonoccluded body parts in first-tier one are input into the second-tier classifier to determine if there are any pedestrians in the detection window. Compared with six state-of-the-art approaches, the experimental results indicate that the proposed method is more accurate and satisfactory in terms of the receiver operating characteristic curve and four other criteria. Additionally, our method effectively processes images in which a crowd is present or pedestrians are partially occluded and enables pedestrian detection in images of different scenes.
Keywords
Introduction
Pedestrian detection [1, 2, 3, 4, 5, 6, 7, 8, 9], an intelligent camera technology, is critical in machine visual techno-logy [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]. As its name implies, this technology is useful for determining if there are any pedestrians in an image and where they are located. It is applied in various fields [20, 21, 22, 23, 24, 25, 26, 27, 28], such as in intelligent surveillance systems for public safety, serves as an indispensable component of the ever-evolving automatic driving system, improves human-AI interaction, and aids in unmanned aerial vehicles used in mountain or shipwreck rescue operations [29, 30, 31]. Pedestrian detection is used so extensively that it has become a major focus of research.
The target of pedestrian detection has both rigid and flexible object properties [32], and the image features are easily affected by clothes, hair, body shape, skin color, posture, and visual angle. If a detection target exists in a complex image background, the target is likely to be obscured/occluded by trees, buildings, and vehicles in the background [33, 34, 35, 36, 37, 38, 39, 40, 41, 42]. More specifically, herein, obscuration mostly means occlusion. For pedestrian detection to be suitably applied in different images and environments, recognition rate and speed should be taken into consider. Because of these considerations, pedestrian detection is a relatively challenging area of image recognition research. Moreover, because the detection system functions similarly to human eyes, its effectiveness is easily affected by weather, climate, and visibility. For example, detection effectiveness is lower at night and without streetlights than during the day. Taken together, complex backgrounds and obscurity are two major limitations in current pedestrian detection technology [43]. Overcoming the limitations will enhance the practicality of this technology. Therefore, the objective of this study is to modify the recognition algorithm for pedestrian detection to improve its recognition rate.
The image features of pedestrians can be divided into three categories: primary features, features acquired through machine learning, and mixed features. Primary features refer to colors, texture, and image gradient vectors. Features acquired through machine learning methods such as deep learning are highly recognizable pedestrian image features that computers learn on the basis of a massive amount of pedestrian samples. Mixed features are a combination of multiple low-level features or the higher-order statistical features of low-level features.
The histogram of oriented gradient (HOG) [8], a primary feature, is the most widely used pedestrian image feature with favorable performance. An HOG depicts the local gradient magnitude and gradient-oriented features of an image. The census transform histogram is proposed for illustrating the global information of a scene, and was first used in scene classification [44]. Features acquired through machine learning are the distinctive features of an image that are obtained through feature selection. The AdaBoost Haar-based face detection method [45] is proposed to use the AdaBoost algorithm to select discernible features (weak classifiers) among numerous Haar image features. This method has been successfully used in pedestrian detection. Adaptive contour features [46] are proposed that use AdaBoost to select features in an oriented gradient space in which three operations are defined – growth, consolidation, and cutting – to effectively illustrate the feature co-occurrences of a shape. Regarding mixed features, [47, 48] have combined both oriented gradient and local binary pattern (LBP) histograms, used this combination as the image feature, and proposed a pedestrian detection method for partial obscurity. Moreover, they have trained two linear support vector machine (SVM) classifiers [49, 50, 51] through training data and used occluded pedestrian images (composed of INRIA [52] and PASCAL [53] datasets) in combination with enhanced HOG–LBP features and a procedure for processing global and partial obscurity to verify the procedure.
The present study focuses on increasing the accuracy of pedestrian detection by modifying the HOG as the multiscale block-based histogram of oriented gradient (MBHOG) for pedestrians’ heads and bodies. In addition, the parallelogram-based Haar-like feature (PHF), an improved version of the Haar-like feature, is applied for depicting pedestrians’ arms and legs. The MBHOG and PHF constitute a joint component model that serves as an improved feature vector. This image feature vector is incorporated with the proposed two-tier SVM classifier to improve the accuracy rate of pedestrian detection by processing partial obscurity, complex environments, and crowds. More specifically, a novel method is proposed to improve the accuracy of single-image partially occluded pedestrian detection. Two procedures are proposed to extract features from different body parts of pedestrian images. The first proposes MBHOG features to effectively enhance the descriptions of pedestrians’ heads and bodies. The second involves PHF features to suitably depict pedestrians’ legs and arms. These two features extracted are expedited through integral image acceleration and then concatenated as novel feature vectors. Finally, a two-tier SVM classifier is further proposed to resolve the issues of partially occluded pedestrians. The first-tier classifier is used to judge if there are any occluded body parts, whereas the second-tier one is used to determine if there are any pedestrians in the detection window. The proposed method can effectively process images in which a crowd is present or pedestrians are partially occluded and enables pedestrian detection in images of different scenes.
This paper is structured as follows. Section 2 describes the materials and the proposed methods. In Section 3, the experimental results and discussion are presented. The conclusion is given in Section 4.
Materials and methods
Figure 1 depicts the study procedure. To collect training images from the INRIA database, this study first adopts two extraction methods to capture the features of pedestrians’ body parts. One method is MBHOG feature preprocessed by the Gabor filter; this feature describes the trunk. The other is the PHF, which describes limbs. Next, this study uses principal components analysis (PCA) to reduce the dimension of these two features. Lastly, this study employes the two-tier SVM classifier for recognizing partially occluded pedestrians. The first-tier SVM classifier is used to determine if any body part is occluded. Next, classification probabilities of estimated in first-tier one for nonoccluded parts are input into the second-tier classifier to determine if there are any pedestrians in the detection window.
Flowchart of the proposed method.
In the experiments, the whole images in INRIA database [52] are used. The database contains training and testing sets. In the training set, there are 2,416 pedestrian photos and 1,218 background photos without pedestrians, and in the testing set, there are 1,131 pedestrian photos and 453 background photos without pedestrians (Fig. 2). The border of original positive and negative samples (which refer respectively to pedestrian photos and background photos without pedestrians) is cropped to obtain a resolution of 64
Pedestrian head and body feature extraction through Gabor filtering and MBHOG
Figure 3 shows the MBHOG proposed by this study. After an image is input, gamma calibration and Gabor filter preprocessing are performed. Next, the gradient is calculated, multiscale cells and blocks are configured, and L2-norm normalization is performed to collect feature vectors and thereby enhance the description of pedestrians’ upper body features.
Positive and negative samples for pedestrian detection: (a) positive samples; (b) negative samples.
Flow chart of MBHOG feature extraction.
Preprocessing for pedestrian head and body feature extraction: (a) input images; (b) preprocessed images.
First, gamma correction is performed on images to compensate for human visual characteristics. The purpose of this preprocessing is to reduce the impact of light or shadows on the images and to reduce noise interferences.
Both scenes and backgrounds can be complex during pedestrian detection and Gabor filters are effective in image edge detection [54, 55]. Therefore, this study uses Gabor filters for preprocessing in pedestrian detection. The generating function for Gabor filters is expressed by:
where
Oriented gradient histogram: (a) Pixel gradient angle and intensity, (b) Oriented gradient histogram of a cell.
To calculate the gradient direction, simple one-dimension gradient masks [
Configuring multiscale cells and blocks
The HOG feature segments a training image into multiple cells of equal size, which are then classified into the nine bins, and the number of cells in each bin is counted. More specifically, 180
In addition, to effectively describe the partial contour and shape of a target feature, the HOG feature organizes multiple cells into one block. The blocks are overlapped to prevent any critical feature descriptors from being overlooked, or the miss rate will increase. However, if the overlap is excessively large, feature dimensions and computation time will increase (Fig. 6).
Cumulative sum of histogram gradients method
The proposed MBHOG feature extraction method links HOG feature vectors obtained from different cells and blocks to improve contour description and feature recognition, thereby increasing the accuracy rate of detection. Because the MBHOG repeatedly computes the gradient histogram of the same block, the cumulative sum of histogram gradients method is adopted to facilitate the computation.
The first step of the method is to save the cumulative sum of voting ingredients of the
where
where
where
Blocks overlapping.
The adaptation of cells to partial light and shadows is improved. To this end, before all features are combined, each block is normalized to render each cell self-adaptive and reduce the impact of partial light. The equation for partial L2-norm normalization is expressed by:
where
The PHF is an improvement on Haar-like features for the detection of certain pedestrian body parts. Haar-like features are often used for detecting human facial images, and have a rectangular area constraint, which makes them not suitable for detecting pedestrians because their legs and arms are typically inclined while they are walking. In this case, the parallelogramical image features of the PHF are more appropriate for pedestrian detection. Therefore, this study uses the PHF to extract the features of pedestrians’ arms and legs.
Preprocessing
Preprocessing comprises three steps. First, the image is rendered in grayscale. Second, gamma correction is performed to reduce both the impact of light and shadows and noise interferences. Lastly, the histogram equalization is used to limit the differences between image and background intensity to an acceptable range for classifier recognition. The results of preprocessing for pedestrian limb feature extraction are shown in Fig. 7.
PHF image features and integral image acceleration method
A PHF image feature is obtained by subtracting the sum of intensities of white blocks in a region from the sum of intensities of black blocks in the region. The reliability of a PHF is accessed by examining the features of a group of PHFs; these values are acquired via different-sized templates from images. PHF features of various scales and areas are accordingly obtained. The integral image acceleration method is also applied to facilitate the acquisition of PHF feature vectors. A PHF is parallelogrammical. The construction of the first integral image
Preprocessing for pedestrian limb feature extraction: (a) input images; (b) preprocessed images.
Eq. (10), and, all TP values are obtained from the computed TP values, as expressed by Eq. (11):
where
where
This study uses templates of different sizes to extract the PHF features of the left arm, left leg, right arm, and right leg. To improve pedestrian detection, this study uses the absolute value of PHF as the feature vector for detection. Next, TP integral images are used to compute the sum of intensities within a parallelogram region, quickly yielding the regional cumulative sum:
Feature dimensions of different body parts
Because of the large dimensions of MBHOG and PHF, this study uses PCA to reduce their dimensions and obtain main features. The extent of reduction varies depending on the training set (Table 1). Nevertheless, 99% of image information is retained. Following feature dimension reduction, the accuracy rate of pedestrian detection increases, as do classifier training and the recognition speed.
Proposed two-tier classifiers
Proposed two-tier classifiers
This study proposes the two-tier SVM classifier and the joint component model to address the limitations of pedestrian detection technology concerning partially body parts and overlapping pedestrians. The PHF and MBHOG are used on different body parts; the MBHOG are applied in the trunk and the head, whereas different PHF templates in the left arm, right arm, left leg, and right leg. After features of each body part are extracted and their dimensions are reduced through PCA, the SVM is employed for the first-tier training on each body part. Classifier parameters are thereby obtained for all five body parts. For new testing data, the trained first-tier classifier is used to compute the classification probability of each of the five body parts, and possible combinations of these five body parts are specified and used as the second-tier SVM classifier training set (Table 2). If the combinations of nonoccluded body parts in the first-tier classifier do not exist in the second-tier classifier, this suggests that no pedestrian is present in the image.
Recognizing new data entails the following steps. First, the size of the data is adjusted to that of the training set. Second, the features of the five body parts are obtained and their dimensions are reduced through PCA. The trained first-tier SVM classifier is used to generate the classification probabilities of the five body parts and determine whether any of the body parts are occluded. If the probability score exceeds the threshold (Fig. 8), this body part is deemed to be occluded. Lastly, nonoccluded body parts are paired with the trained second-tier classifier to determine the presence of any pedestrians. Any combinations of nonoccluded body parts that are not paired with a second-tier SVM classifier indicate the absence of pedestrians.
MBHOG with and without Gabor filtering preprocessing
MBHOG with and without Gabor filtering preprocessing
Body part obscurity as determined by the first-tier classifier.
The objective of this study is to accurately identify whether there are any pedestrians in an image and where these pedestrians are located. The accuracy rate, false negative rate (i.e. miss rate), true positive rate (i.e. detection rate), false positive rate, and receiver operating characteristic (ROC) curve value are used to evaluate the efficacy of the proposed method. The accuracy, miss, detection, and false positive rates are defined as follows:
where the accuracy rate is the proportion of correct detection results to all detection results. The miss rate is the proportion of undetected pedestrians to pedestrians actually present. The detection rate is the proportion of correctly detected pedestrians to the pedestrians actually present. The false positive rate is the proportion of cases in which pedestrians are incorrectly detected to cases in which no pedestrian is present. The receiver operating characteristic (ROC) curve is also used [56], with the
Result evaluation in first-tier classifier
Preprocessing using Gabor filtering on trunk
First, this study uses Gabor filtering to preprocess the trunk and enhance the efficacy of MBHOG features. MBHOG features of the trunk are extracted for all training data. After the dimensions of the features are reduced through PCA, 10-fold cross validation is performed using the SVM classifier. The results show that the mean accuracy and detection rates of MBHOG features preprocessed through Gabor filtering increase by 3.13% and 3.4%, respectively, the false positive rate decreases by 3%, and the miss rate is improved by 1.14% (Table 3).
Classification results of different pedestrian limb
Classification results of different pedestrian limb
Results of proposed pedestrian detection method: (a) Nonoccluded full-body pedestrian images, (b) Partially occluded pedestrian images.
ROC curves of the proposed method and six state-of-the-art approaches for nonoccluded pedestrian images.
ROC curves of the proposed method and six state-of-the-art approaches for partially occluded pedestrian images.
Thereafter, PHF feature extraction of the four limbs is performed on all training data. After the dimensions of the features are reduced through PCA, 10-fold cross validation is conducted using the SVM classifier with the RBF kernel and the parameters of Gamma and Cost being 0.03375 and 312.5, respectively.
As shown in Table 4, the classification results of the four limbs are highly similar, with accuracy rates approximating 87%. The images of the four limbs change considerably more than those of the trunk; thus, the accuracy rate is lower for the limb images. However, these images can still be recognized by the second-tier SVM classifier.
Results of proposed pedestrian detection method
The proposed pedestrian detection method is tested in different complex images. Figure 9 presents partially occluded pedestrian images (Fig. 9a) and nonoccluded pedestrian full-body images (Fig. 9b), with detected pedestrians circled with a red rectangle.
Comparisons with six state-of-the-art approaches
The performance of the proposed method is tested on full-body and partially occluded images of pedestrians in the INRIA database according to the four criteria (the accuracy, miss, detection, and false positive rates). The method is also compared with six state-of-the-art approaches, which are evaluated on the basis of the ROC curve. These six approaches – i.e., Dalal and Triggs approach [8], Wang et al. approach [47], Hoang and Jo approach [48], Zhang et al. approach [57], Ouyang and Wang approach [58], and Luo et al. approach [59] – are selected for comparison because they resolve pedestrian obscurity, ensure high pedestrian detection accuracy, and incorporate the INRIA database.
The results of the proposed method in detecting nonoccluded pedestrian images are with respect to the four criteria: accuracy rate, miss rate, detection rate, and false positive rate. The accuracy rate, miss rate, detection rate, and false positive rate of the proposed method for nonoccluded pedestrian image detection are 97.61%, 2.7%, 97.6%, and 2.4%, respectively. The accuracy rate of the proposed method in detecting pedestrian full-body images is as high as 97.61%. As Fig. 10 shows, at a false positive rate of 0.01, the proposed method achieves a detection rate of at least 5.5% higher than that of any of the six comparable approaches. The figure also indicates that the detection rate of the proposed method is the highest across all false positive rates.
The results of the proposed method in detecting partially occluded pedestrian images are also with respect to the four criteria. The accuracy rate, miss rate, detection rate, and false positive rate of the proposed method for partially occluded pedestrian image detection are 93.81%, 3.3%, 93.8%, 6.5%, respectively. The accuracy rate of the proposed method in detecting partially occluded pedestrian images is as high as 93.81%. As Fig. 11 shows, with the false positive rate ranging between 0.025 and 0.05, the proposed method exhibits a detection of at least 4.5% higher than that of any of the six comparable approaches. The figure also suggests that the detection rate of the proposed method is the highest across all false positive rates.
Conclusion
This study has proposed a method to process crowds and partially occluded pedestrians during pedestrian detection. This method is developed in two ways. First, two image features are proposed to process partial obscurity according to the body parts of pedestrians involved in images: the MBHOG features preprocessed by Gabor filtering for depicting the trunk, and PHF for depicting the four limbs. Computation of both features is accelerated through the integral image acceleration method. Second, a two-tier SVM classifier is designed to process partial pedestrian obscurity. The first-tier SVM classifier determines if any body part is occluded. The classification probabilities of unoccluded body parts are used as the inputs of the second-tier SVM classifier to determine if any pedestrians are present in the detection window. Compared with six state-of-the-art detection approaches, the proposed method is more accurate, performing better in the ROC curve and four other criteria. In addition, the method processes images in which a crowd is present or pedestrians are partially occluded and enables pedestrian detection in images of different scenes.
Footnotes
Acknowledgments
The author would like to express his sincere appreciation for grants partially from MOST105-2410-H-194-059-MY3, Ministry of Science and Technology, Taiwan. In addition, he also thanks the student, Ji-Di Su, which helps to handle parts of the materials.
