Abstract
This article presents a novel moving object detection algorithm using median-based scale invariant local ternary pattern for intelligent video surveillance system. Both the texture and color local features are extracted from the incoming frames independently and they are combined at the classification level to improve the object detection results. Here, each incoming image frames are subdivided into several regions and the median-based scale invariant local ternary pattern (MD-SILTP) is obtained for each sub-region. Based on the MD-SILTP patterns, the texture histograms are computed and matched with the background model using the histogram intersection method. Furthermore, the color features are extracted through color histogram matching technique. The background model is then updated based on the best matching texture and color histograms. Finally, the color and texture information are combined for final feature classification. Experiment results illustrate that the fusion of MD-SILTP texture with the color features is stable than the others under smooth surface regions, image noises due to illumination changes, moving cast shadow, and scaling problems.
Keywords
Introduction
Presently, the Computer Science Engineering stream becomes part of the day-to-day life activities and it serves the society in different aspects. The video surveillance system, human-computer interaction, traffic monitoring system, and remote sensing are the few examples. Most of the people need the development of an intelligent video surveillance system for various applications including people monitoring in the prohibited area, border security, security for commercial property, human activity analysis and so on. However, the intelligent video surveillance system requires robust moving object detection algorithms for high-level video analysis. Moving object detection is the key process in any kind of video surveillance applications and it is being extensively analyzed by the research aspirants. It aims to detect the target (i.e., pedestrian, human face, buildings, and car) under different complex situations including inconsistent target appearance, variations in target sizes, different viewpoints, changing lighting conditions, and smooth surface regions [1]. A robust object detection algorithm should update the dynamic variations and must be able to discriminate the foreground with less false positives. An appropriate background modeling technique is cable of dealing dynamic background variations and illumination changes [2]. Elgammal et al. 2002 proposed a dominant background modeling in which the color features are used for the background modeling [3]. However, its performance is often degraded due to insufficient data under complex environments. In Gaussian mixture model [4], each pixel is modeled as a mixture of Gaussian distributions and the color features are used to update the background changes temporally. Yet, it is not suitable for similar background colors and abrupt illumination changes.
Local binary pattern (LBP) descriptor [5] is proposed to discriminate the foreground under monotonic illumination changes. Heikkila and Pietikainen developed a texture based novel background modeling in which the background is modeled by constructing the LBP region histograms [6]. Still, the performance of LBP is degraded by the noise in case of dynamic background variations. So, the LBP operator is further extended as Local Ternary Pattern (LTP) operator [7] by adding a small offset for pixel comparison. Though it is robust for noisy images, the detection results are unsatisfied under pixel intensity changes. Later, the LBP is extended as Center-symmetric LBP (CS-LBP) to generate the effective binary patterns [8] by involving less number of histograms by considering only the center-symmetric pairs of pixels. It is robust to flat image regions, but the performance is poor for noisy images. Afterward, the Scale Invariant Local Ternary Pattern (SILTP) operator is proposed to handle the issues related to LBP and LTP. The SILTP operator is strong enough to deal with the moving cast shadows, scaling, and lighting changes [9]. However, the SILTP patterns become zero in flat surface regions. This zero SILTP pattern does not give any useful information about the foreground texture and hence, the SILTP is not suitable for flat and smooth surface areas. Though the texture based background modeling methods are intelligent to illumination changes and updating the dynamic variations successfully, they often deviated from the target under similar texture and smooth surface regions. Under such conditions, the local color features are good alternatives for texture features. Nowadays, some of the researchers use more than one local feature for better results under illumination changes, cast shadows, motion changes, flat and smooth backgrounds [10]. In adaptive pixel-wise kernel variance [11], the texture features are combined with color features by allowing multiple variances to each pixel.
A pixel with the best variance is selected for updating the dynamic variations. However, the shadows are being mistakenly classified as foreground and the pixel computations are higher. Hong Han et al. 2015 proposed a framework in which both texture and color information are integrated for background modeling [12]. Here, texture information is derived from SILTP patterns and color information is obtained by color difference method. Furthermore, the background is modeled with single histogram by considering the dominant background patterns and it is assumed that the frequently occurring patterns are likely to be background patterns and also a block with more background patterns is treated as background block. But, the foreground patterns with different distributions or different colors are sometimes wrongly considered as background patterns and the assumptions become violated. In this case, this framework [12] will be ineffective. Alternatively, multichannel-SILTP (MC-SILTP) stays invariant in flat areas because of different channel thresholding [13]. However, MC-SILTP cannot be bounded by the spatial non-saliency and still keep sensitive in featureless areas.
Research contributions
In this article, median-based SILTP operator is proposed to detect the target such as pedestrians, car, and a human face with less false positives. In addition to that, the color features are incorporated to improve the system performance. The main contributions of the proposed approach are as follows: First, the input RGB image frame is converted into a grayscale image frame and it is divided into several sub-blocks in equal size. Then, the median-based SILTP (MD-SILTP) pattern is calculated for each sub-block and the texture histogram is estimated using MD-SILTP patterns as histogram bins. These texture histograms are matched with the background model histogram using the histogram intersection method [14]. The background model is updated with the best matching histograms. Second, the color histogram is estimated for the incoming RGB image frame and it is compared with the background model using color histogram matching technique. Then, the background model is updated with the matched color histograms. Third, the texture and color features are combined for final feature classification. i.e., the features that satisfy either the color threshold or the texture threshold are classified as foreground features. At last, the bounding box is derived and placed around the foreground features in all successive image frames.
The rest of this article is formatted as follows. Section 2 summarizes the materials and methods used. In Section 3, the experimental results and comparison of proposed approach with other algorithms are presented. At the end, the article is successfully concluded in Section 4.
Materials and methods used
The proposed framework comprises three stages namely MD-SILTP texture feature detection, histogram based color feature detection, and combined feature classification. Figure 1 shows the flow diagram of the proposed framework.

Flow diagram of proposed approach.
In basic LBP operator [5], the center pixel of each sub-image region is compared with its neighboring pixels and it is replaced by the binary pattern ‘0’ or ‘1’ as shown in Fig. 2(a). From the Fig. 2(a), it is inferred that the LBP binary patterns do not vary with the scale changes. However, the LBP is more sensitive to local image noises. From Fig. 2(b), it is observed that the LTP [7] produces the correct binary patterns even if the pixels are corrupted by the image noises. But, the scale changes of the pixel intensities lead to poor classification results. Figure 2(c) shows that the SILTP operator [9] is robust to scale changes, but it is still ineffective in case the of image noises. On the other hand, the proposed Median-based scale invariant local ternary pattern (MD-SILTP) operator is strong enough to deal with image noises and scale variant problems. Here, all the pixels including center pixels within each block are arranged in ascending order and the median value is calculated. Then, the center pixel is compared with the median value and encoded as 00, 01, and 10 as shown in Fig. 2(d). This technique makes the resulting descriptor as less sensitive to image noises and scale changes with few computational overhead.

Encoding scheme (a) LBP, (b) LTP, (c) SILTP, and (d) MD-SILTP. Note: The first row represents the encoding schemes without noises. The second row highlights the resulting binary patterns in case of image noises and the third row shows the binary patterns in case the of scale changes (all pixel values are doubled). The red color fonts indicate the changes in the pixels and their binary patterns due to noise and scaling problems.
Let f
c
be the center pixel at the spatial location (r
c
, s
c
), and f
n
be the neighborhood pixels placed around the center pixel. Then, the MD-SILTP is calculated as follows:
The thresholding function is expressed as:
Once the MD-SILTP patterns are derived for each block, the texture histograms are constructed using their probabilities as histogram bins (i.e., each histogram bin represents the number of occurrence of the MD-SILTP pattern in each block). Let S
B
be the image block of size (M × N), and f
S
B
be the MD-SILTP histogram of that image block. Then, the MD-SILTP histogram is calculated as follows:
In Equation (6), the term
MD-SILTP patterns are not enough to deal with the smooth surface regions such as road, single color clothes similar to the texture of the car vehicles etc ... Furthermore, the MD-SILTP patterns become zero in flat surface regions. Hence, it is difficult todiscriminate the smooth foregrounds from the similar backgrounds. Alternatively, the color information is strong enough to deal with the similar texture. So that, the color information can additionally be used with the texture features in order to handle the smooth environments. In proposed approach, the color histogram is derived for each incoming frame and it is matched with the background model color histogram using the histogram matching method.
Let f
C
be the color histogram of the incoming frame and
Initially, the background MD-SILTP texture histogram is computed and then the color background histogram is obtained. From these background histograms, the probabilities of the pixels being classified as background are calculated and they are properly combined for final feature classification. Here, the texture and color features are combined at the decision levels instead of feature level to avoid the pixel ambiguity. Let
Qualitative analysis
The proposed approach is implemented on PC with Pentium® Dual-core CPU E5200@2.5 GHz; 4GB RAM using MATLAB 2013 and the performance is tested on five publically available datasets including viptraffic.avi, vision traffic.avi, waving trees, and curtain. The complexities involved in these data sets are listed in Table 1. The proposed method is also evaluated with the other baseline methods such as multi-scale fusion of texture and color [2], block-based background modeling method based on integration of texture and color information (BITC) [12], and coarse-to-fine detection theory [15]. For state-of-the-art methods, the parameters suggested by the authors are used for performance evaluation. The proposed method is evaluated with the constant parameters such as
Tracking sequences used in proposed method
Tracking sequences used in proposed method
From Table 1, it is inferred that the similar background texture (i.e., the road surface is similar to the body of the car) and illumination changes are the major challenges involved in the viptraffic.avi video sequence. In internet video sequence, the complexities are created because of the vehicles and pedestrians crossing the road and outdoor illumination changes. The visiontraffic.avi video sequence contains the moving shadows and similar background texture. In waving tree sequence, a human is walking through waving tree and it is difficult to detect a human in the presence of waving trees background. In a curtain video sequence, the waving curtains create the dynamic background and it is difficult to identify the foreground without background disturbances. For qualitative analysis, the sample frame results of the proposed method and the baseline methods are demonstrated in Figs. 3–7. From the results, it is noticed that the proposed method identifies the foreground object better than the others in all sequences. In Fig. 3(a), the multi-scale fusion of texture and color detects the foregrounds with the false alarms, because of the foreground pixel variations due to abrupt motion and similar background. Moreover, it is unable to cope up with the pixel intensity variations and failed to detect the foreground pixels.

Sample frame results of proposed approach and others for a viptraffic.avi video sequence. (a) multi-scale fusion of texture and color, (b) Coarse-to-fine detection, (c) BITC and (d) proposed.

Sample frame results of proposed approach and others for internet video sequence. (a) multi-scale fusion of texture and color, (b) Coarse-to-fine detection, (c) BITC and (d) proposed.

Sample frame results of proposed approach and others for a visiontraffic.avi video sequence. (a) multi-scale fusion of texture and color, (b) Coarse-to-fine detection, (c) BITC and (d) proposed.

Sample frame results of proposed approach and others for curtain video sequence. (a) multi-scale fusion of texture and color, (b) Coarse-to-fine detection, (c) BITC and (d) proposed.

Sample frame results of proposed approach and others for Waving Tree video sequence. (a) multi-scale fusion of texture and color, (b) Coarse-to-fine detection, (c) BITC and (d) proposed.
In Fig. 3(b), the bounding box of coarse-to-fine detection is breaking into parts and the background pixels are wrongly identified as foreground pixels. In Fig. 3(c), BITC detects the target with few false alarms. On the other hand, the proposed approach detects the target correctly in presence of abrupt motion and similar background texture (see Fig. 3(d)). For internet video sequence, all baseline methods including multi-scale fusion, coarse-to-fine detection, and BITC undergo the split and merge problems. In few image frames, they combine several foregrounds as a single object and vice-versa. In addition to that, the background objects are also incorrectly detected as foregrounds because of illumination changes caused by the sunlight (see Fig. 4(a-c)). In proposed method, the local features such as color and MD-SILTP texture are integrated very well and therefore, the target is precisely identified without background influences (see Fig. 4(d)). In order to validate the performance of proposed approach under moving soft shadow, it is evaluated with other baseline methods on visiontraffic.avi video sequence and the sample results are given inFig. 5(a-d).
The multi-scale fusion captures the target with shadow and coarse-to-fine detection algorithm is distracted from the target significantly with false positives. Even though the BITC suppresses the shadow, it often deviates from the target because of illumination changes and considerable similarity between shadow and foreground. In proposed approach, the moving cast shadow and vast illumination variations are effectively handled by incorporating the supplement color features in addition to the median based texture features. Hence, the proposed approach consistently detects the target without moving shadow in all frames. The performances of the proposed and other baseline trackers are tested on the dynamic sequences including Curtain and Waving Tree and the sample results are given in Fig. 6 and Fig. 7.
In Curtain video sequence, the coarse-to-fine detection algorithm wrongly detects the waving curtains as foregrounds. The multi-scale fusion algorithm is degraded by the false alarms due to reflections from the whiteboard. While BITC is intelligently suppressing the waving curtains, it splits the foreground into several parts and even sometimes the target is missed out by the BITC approach. Alternatively, the proposed approach perfectly updates the background changes and properly detects the foreground in most of the image frames. Compared to other baseline methods, the proposed method also gives the appropriate results in waving tree sequences by suppressing the waving tree branchessuccessfully.
For quantitative evaluation, the performance metrics including precision, detection rate, F-measure, false alarm rate and success rate are numericallyevaluated for all the methods as follows [16–18]:
In Equations (12–16), the term T p , F p and F n are the true positives (properly detected foreground objects), false positives (falsely detected foregrounds), false negatives (falsely detected backgrounds) respectively. All frame based metrics are evaluated by comparing the ground truth regions with detected regions. In addition to that, all foreground detections are considered as success if it achieves 50% of the overlapping between the detected region and ground truth region. The performance metrics are numerically calculated for different texture and color thresholds ranging from 0.1 to 1 and the threshold which gives higher values of precision, detection rate, F-measure and lower values of false alarm rate is selected as an optimal threshold value. Figure 8 (a-d) shows the performance of proposed approach in terms of surveillance metrics for different texture thresholds. It is noticed that the texture threshold (φ th ) of 0.6 gives the reasonable values for all metrics. Simultaneously, the color threshold (φ th ) is set to 0.3 after several trials. Hence, the thresholds φ th = 0.6 and φ th = 0.3 are used for final feature classification. The quantitative results of proposed and the other baseline methods on publically available video sequences are given in Table 2. The average performance metrics are reported in Table 3. It can be viewed that the proposed method achieves good results in terms of all frame based metrics with additional computation time and fewer classification errors. The quantitative results for waving tree and curtain image sequences illustrate that the proposed approach strongly updates the background variations and identifies the target significantly while the others are not capable of updating the dynamic backgrounds. The quantitative results for the vision traffic.avi and internet video sequences represent that the proposed approach attains good results against the other baseline methods irrespective of moving cast shadows and multiple moving objects. From the results of a viptraffic.avi video sequence, it is validated that the proposed approach is more competent to others in case of abrupt motion and similar texture.

Performance metrics with respect to different SILTP threshold. (a) Precision, (b) Detection rate, (c) F-measure and (d) False alarm rate.
Performance metrics evaluated for various publically available data sequences
Note: Bold fonts indicate the best performance.
Average performance metrics of proposed and other baseline methods
Note: Bold fonts indicate the best performance.
The proposed method performs well in several complex situations including similar texture, moving cast shadow, cluttered background, multiple moving objects, illumination variations, abrupt motion and dynamic backgrounds. Nevertheless, some of the issues yet to be discussed for further improvement. First, the texture and color histograms are constructed for background feature detections and the temporary background image frame is updated with the dynamic variations by updating the texture histogram and color histograms of the background image. Though these histograms are more stable for complex environments, they are computationally intensive. Second, the proposed approach is limited to off-line processing and it is not suitable for online processing. Future work will investigate these issues and optimize the proposed approach in different aspects.
Conclusion
A novel median based SILTP operator is proposed for background modeling and additionally, the histogram based color features are incorporated to cope up with smooth surface regions. Initially, the incoming frame is subdivided into several small blocks and the MD-SILTP patterns are estimated. Then, the background model is constructed and updated based on the texture and color background features obtained through the histogram intersection between incoming frame histogram and temporary background image histogram. At the end, the final foreground feature classification is done based on joint thresholding using color and texture information. Finally, the foreground features are tracked by the suitable bounding box in all successive frames. An experiment was carried out for different video sequences and the performance was evaluated with other baseline methods. The comparative results obviously illustrate that the proposed approach successfully handles the different challenging conditions such as similar texture, moving cast shadow, cluttered background, multiple moving objects, illumination variations, abrupt motion and dynamic backgrounds.
