A parallel approach for the training stage of the Viola-Jones face detection algorithm

Abstract

Face detection is the first step for automatic face recognition systems. However, detecting faces is not an easy task due to variations in factors such as pose, illumination, scale among others. Efficient face detection algorithms like the one proposed by Viola-Jones allows one to detect faces in real-time with high accuracy rates. However, this algorithm involves several stages that consume huge computation, particularly for the training process. Also, increasing input image’s size for face detection and using large training data sets for face recognition demand additional computing resources to achieve real-time processing. In this paper we present a parallel approach to perform three stages of the Viola-Jones face detection algorithm, particularly for the integral image computation, Haar-like features estimation and the evaluation of these features. Our experimental results show that our proposed approach obtains better performance than the OpenCV library implementations.

Keywords

Face detection Viola-Jones computational parallel model CUDA

1. Introduction

Parallel computing has become an essential topic in computer science and its application fields, as it is much better suited for modeling, simulating and understanding complex real-world problems when compared to serial computing [23]. For example, parallel computing has been successfully applied in many areas such as genetics, biotechnology, chemistry, physics, mechanical engineering, circuit design, microelectronics, environment, medical imaging and diagnosis, pharmaceutical design, data mining, and big data, among others [23].

Parallel computing, in simple words, is the simultaneous use of multiple computing resources (cores, computers, etc.) to solve a computational problem [12]. Generally, parallel computing is used when facing problems requiring of demanding computing resources, either from large scale problems or applications requiring of real time processing. For instance, parallel computing is essential in computer vision tasks such as visual surveillance, representation learning (e.g., deep learning), image retrieval, human-machine interaction or face recognition. In this last task, guaranteeing real time performance is crucial in most of its applications (e.g., authentication, surveillance and behavior analysis).

However, a first step for those systems and their applications is face detection, a time consuming process that is affected by additional factors as variations in pose, rotation, illumination and scale. Several efficient face detection methods are available nowadays, among the most effective ones is the Viola-Jones algorithm [4]. Although being among the most effective and efficient techniques, it involves several tasks that consume large computing resources, particularly for the training phase.

In recent years, Field-Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs) have become useful tools for putting in practice large scale parallelism in the Viola-Jones algorithm [16]. For example, Junguk et al. introduced a parallel FPGA architecture of multiple classifiers to accelerate the processing for face detection based on the Viola-Jones method [13]. They obtained 3.3 times better performance over the architecture of a single classifier and 84 times over an equivalent software solution. Hiromoto et al. proposed a partially parallel architecture also based on FPGA, achieving higher processing performance and smaller circuit area than the parallel implementation of all classifiers [14]. Their architecture reached 30 fps when processing images of size of 640 $\times$ 480 pixels. In 2010, Bilgic et. al. introduced an algorithm to calculate the integral image that perform in real-time on a GPU using CUDA [15]. They achieved 5 ms on a GPU for image sizes of 4.5 $\times$ 10 ${}^{6}$ pixels versus 45 ms on a CPU implementation. Hefenbrock et al. presented a GPU design to achieve high performance on VGA images [17]. Authors claim that their proposed parallel model obtains better time performance than using dedicated devices such as FPGAs.

In 2011, Oro et al. introduced an optimized Haar-based face detector that works in real-time using high definition videos [18]. They proposed kernel operations by exploiting both, coarse and fine grain parallelism in GPUs, for performing integral image computations and filter evaluations. They achieved competitive results in images sizes of 1080 pixels. Kyrkou and Theocharides presented a systolic array to implement detection in parallel (FPGA) [19]. They detected objects within 1024 $\times$ 768 images with 64 fps performance, and 95% accuracy.

In 2013, Chang and Hwang introduced an enhanced training method using the Adaboost algorithm for obtaining the localized sampling optimum (LSO) from a local IP-CAM video [21]. Also, they used an improved motion detection algorithm that cooperates with the former face detector to speed up processing time and achieve better detection rate. Pertsau and Uvarov introduced a parallel algorithm for face detection in images, which is an extension of an OpenCV library [22]. They also developed a scheduling algorithm to balance the workload among GPU’s threads.

In 2014, Bilaniuk et al. presented an embedded FPGA implementation of a face detection method based on boosted LBP features [24]. Their proposed implementation runs at 5 VGA fps, while providing similar accuracy to the PC version of the LBP algorithm included in OpenCV. Wei et al. introduced an FPGA architecture that simulates a small part of the entire Viola-Jones algorithm [6]. This simulation achieved rates up to 15 fps using 120 $\times$ 20 images.

In 2008, Gao and Lu proposed an FPGA design focused on feature classifier calculation [9]. In their approach, the frame rate was 98 fps on 256 $\times$ 192 images. In 2009, Cho et al. [11] presented an architecture that performed all tasks of the algorithm on an FPGA using special frame grabbers and buffers to accelerate the calculations. Their approach is faster than conventional multicore processor implementations, operating at 6.55 fps for VGA images, versus 0.31 fps for single core implementations. Also, their approach is able to compute 3 features in parallel.

In 2011, Ding et al. [20] introduced techniques for face detection and presented a method to select the stop threshold for the image reduction process, which decreases the total computation by half. Their proposed system achieves real-time face detection at 100 fps for SVGA images. In 2014, Kumar and Agarwal [25] presented an architecture for generating the integral of an image of the current detection window dynamically. They reported that their approach reduces hardware resources usage by around 50% using VGA images.

In this paper, we present a parallel computational model to accelerate three stages of the Viola-Jones face detection algorithm, inherent to the training process using a GPU and the CUDA programming model. The three targeted tasks are: to calculate the integral image, to obtain the haar-like features, and to evaluate the acquired features. Therefore, we present three parallel algorithms in order to improve the performance of the face detection task in computer vision, which demands high computing resources in real time applicacions. Experimental results are compared to the OpenCV implementation, obtaining in almost all cases, a significant acceleration performance in terms of processing times. According to these results we can highlight that the three proposed algorithms improved the implementations of OpenCV, particularly the integral image computation obtained the lowest processing times.

The remainder of the paper is organized as follows. Section 2 presents a brief background of the Viola-Jones face detection algorithm, which comprises three main stages: haar-like feature extraction, integral image, and training. Also, in this section, a brief overview of parallel computing and the CUDA architecture is provided. The proposed parallel approaches are introduced in Section 3. Experimental results are presented in Section 4. Finally, conclusions and directions for future work are presented in Section 5.

2. Background

In this section the Viola-Jones algorithm with its the three stages considered for acceleration are described. Also, a general overview of GPUs architectural characteristics and the CUDA programming model are presented.

2.1 Viola-Jones algorithm

The Viola-Jones algorithm [4] is considered one of the most effective and efficient algorithms for object detection, and particularly for face detection, due to it is fast and accurate performance in real-time applications. This algorithm has three main components: haar-like feature estimation, integral image computation and training a cascade classifier, these components are described in the rest of this section.

2.1.1 Haar-like features

In 1997, Oren et al. [1] introduced a novel method for people detection using the Haar wavelet concept. Then, in 2000 Papageorgiou and Poggio [2] applied this concept for people and face detection in unconstrained and cluttered scenes. After that, Viola and Jones [4] proposed to use this concept for object detection, calling it as haar-like features. The haar-like features are represented as rectangle regions on the image, which can be defined as:

$\displaystyle\textit{feature}_{I}={\sum}_{i\in I=\{1,\dots,N\}}w_{i}\textit{% RecSum}(r_{i})$

To compute the value of a feature, the sum of all pixels contained in each of the rectangles making up the feature is calculated, $\textit{RecSum}(r_{i})$ . Then, each sum is multiplied by the corresponding rectangle’s weight $w_{i}$ and the result is accumulated for all the rectangles in the feature.

Because the set of features that can be obtained is almost infinite, some constrains have been proposed to reduce the number of them [5]. Figure 1 shows the haar-like features set proposed by Viola-Jones [4] and the extended set of features proposed by Lienhart and Maydt [5].

Figure 1.

Haar-like features proposed by Viola-Jones and the rotated ones proposed by Lienhart and Maydt [5].

Even taking some restrictions, the number of potential features is large, for example in an image of size of 24 $\times$ 24 pixels, can be generated about 160,000 features, due to the haar-like features is an over-complete set [4]. For this reason, Viola and Jones proposed to use the integral image to compute fast and efficiently a sum of rectangle areas.

2.1.2 Integral image

Estimating RecSum in formula (2.1.1), i.e. calculating the sum of pixels within each rectangle of the haar-like features is computationally expensive. Thus, the integral image is used to calculate each haar-like feature in one single pass and in constant time. This concept involves a sum of all the pixels that are above and on the left of a point $(x,y)$ in the image. In formal terms this is described as:

$\displaystyle ii(x,y)=\sum_{x^{\prime}\leqslant x,y^{\prime}\leqslant y}i(x^{% \prime},y^{\prime})$ (1)

where $ii(x,y)$ is the integral image, considering the original image $i(x^{\prime},y^{\prime})$ of the location $(x,y)$ . Using the following recursive formulas, the integral image can be computed:

$\displaystyle s(x,y)=s(x,y-1)+i(x,y)$ (2) $\displaystyle ii(x,y)=ii(x-1,y)+s(x,y)$ (3)

where $s(x,y)$ is the cumulative row sum. If $s(x,-1)=0$ and $ii(-1,y)=0$ the integral image can be computed in one pass [4, 5]. Figure 2 shows the result of an integral image, note that the sum of the pixels goes from top to bottom, and from left to right, so that the last pixel is the sum of all previous pixels plus this one.

Figure 2.

Result of calculating the integral image.

As a result, two array references are obtained with the sum of any rectangular area in the image, that can be calculated efficiently in constant time [4]. Figure 3 shows the sum of the pixels of the ABCD region calculated by the Eq. (4) requires of only four references of the integral image.

$\displaystyle\sum_{(x,y)\in\textit{ABCD}}i(x,y)=ii(D)+ii(A)-ii(B)-ii(C)$ (4)

Haar-like features are defined as the intensity differences between two or four rectangles, therefore the integral image can be used to calculate one simple rectangular haar-like feature [5]. For example, from the first group in Fig. 1, the feature value of case (a) is the average difference of pixel values between black and white rectangles.

Figure 3.

Gray area is computed using four references as $D-(B+C)+A$ .

To detect faces of different sizes, either the haar-features can be scaled up or the image can be scaled down. Therefore, the integral image needs to be calculated several times, and this process requires large amount of computation.

2.1.3 Cascade classifier

Viola-Jones used the AdaBoost algortithm shown in Algorithm 2.1.3 for the training stage. The main goal of Adaboost is to build a strong classifier using a weighted linear combination of weak classifiers (i.e., an ensemble classifier). Briefly, the algorithm performs as follows: Step 2 initializes a weight vector for each positive and negative training sample, this vector is updated in each iteration, allowing the ensemble to become a strong ensemble classifier.

In Step 3 starts the loop from one to the maximum number of training levels $T$ . In Step 5 the best classifier is selected, and the examples are ordered according to the feature value. In the end there are 4 additions: $T^{+}$ and $T^{-}$ , the total sums of weights of all positive and negative examples, respectively; $S^{+}$ , the positive examples before the current example; and $S^{-}$ , the negative examples before the current example. In Step 7 the weights are updated in order to improve the classification of false positives for later levels. At this point, a weak classifier is stored, and finally in Step 10 a strong classifier is obtained.

We want to highlight that the algorithm involves a process that requires large amounts of computations because it performs an iterative task of the same instructions with different data. Therefore, parallel computing is an alternative to perform this task in order to reduce the processing time for training.

[h] The AdaBoost algorithm[1] Training set $T=\{(x_{1},y_{1}),...,(x_{n},y_{n})\}$ Strong classifier $C(x)$ Let be $(x_{1},y_{1}),...,(x_{n},y_{n})$ a training set where $x_{i}\in X$ is a set of instances, and $y_{i}\in Y=\{0,1\}$ is a set of negative and positive examples, respectively. Initialize weights $w_{1,i}\leftarrow\frac{1}{2m},\frac{1}{2l}$ for $y=0,1$ , where $m$ and $l$ are the total number of negative and positive examples, respectively. levels $t\leftarrow 1$ to $T$ Normalize weights $w_{t,i}\leftarrow\frac{w_{t,i}}{\sum_{j=1}^{n}w_{t,j}}$ Select the weak classifier $h$ with the lowest error

A classifier $h_{t}(x)=h(x,f_{t},p_{t},\theta_{t})$ is defined, where $f_{t},p_{t}$ and $\theta_{t}$ minimize the error $\epsilon_{t}$ $\epsilon_{t}=min_{f,p,\theta}\sum_{i}w_{i}|h(x_{i},f,p,\theta)-y_{i}|$ Update the weights: $w_{t+1,i}\leftarrow w_{t,i}\beta_{t}^{1-e_{i}}$ where $e_{i}=0$ if the example $x_{i}$ is classified correctly, otherwise $e_{i}=1$ and $\beta_{t}=\frac{\epsilon_{t}}{1-\epsilon_{t}}$

The final strong classifier is:

$C(x)=\begin{cases}\sum_{t=1}^{T}\alpha_{t}h_{t}(x)\geqslant\frac{1}{2}\sum_{t=% 1}^{T}\alpha_{t}\\ 0\ \ \ \ \ \ \ \ \textit{otherwise}\end{cases}$

where $\alpha_{t}=log\frac{1}{\beta_{t}}$

2.2 Parallel computing

In several computational problems and applications serial computing is not fast enough to obtain a solution in a reasonable amount of time. Parallel computing is an alternative approach to solve problems requiring of intensive computation and/or handling huge amounts of information [16].

Parallel computing allows the concurrent execution of tasks, in such a way that a single processing load is divided and solved through different processing units. There are several parallel processing schemes, such as having a single instruction or a single program processing multiple data concurrently. At an algorithmic level several considerations must be made, such as data dependency.

It is important to mention that parallel schemes can not be defined for all algorithmic problems. In addition, a parallel processing scheme has some characteristics in order to operate in a correct and efficient way. These characteristics include the following [3]:

•
Granularity: defines the number of processor elements among which the processing load is divided:

–
Coarse-grained. A number of tasks are grouped and therefore more intensive computing is required per processing unit.
–
Fine-grained. A minimum number of tasks are grouped implying processing elements are less complex and carry out much less intensive computing.

•
Types of parallel processing:

–
Explicit. Instructions are included within the program to explicitly specify which tasks are executed in parallel.
–
Implicit. Instructions are inserted by the compiler for parallel program’s execution.

•
Synchronization prevents overlapping of two or more processes.
•
Latency relates to information transition time from request to receipt.
•
Scalability is the parallel’s scheme ability to maintain its performance while increasing the number of processors or the problem size.

In terms of processing architectures, a general taxonomy divides them in multiple cores and many cores processors. Multiple cores processors sequentially execute instructions among processing cores or parallelize their execution to a certain level. On the other hand, many-cores processors are designed to implement different parallel execution schemes such as the Single Instruction Multiple Data (SIMD) and can achieve massively parallel execution [16]. Graphic Processing Units (GPUs) are many-cores architectures that are able to execute Single Program Multiple Data (SPMD) schemes. GPUs are able to launch thousands of threads for execution achieving significantly better processing times on difficult computing tasks in different application arenas.

CUDA (Compute Unified Device Architecture) was introduced by NVIDIA in 2006, in order to open the usage of GPUs for general purpose computing. CUDA is a software-hardware architecture that enables GPUs to be programmed at a higher level while taking advantage of their implicit massive parallelism [16].

GPU-based architectures have evolved very rapidly through recent years. A GPU architecture consist of a number of streaming multiprocessors (SM) that has access to global and local memory levels. Every SM contains a number of CUDA cores which are smaller processing units that carry out parallel programs execution. The number of SM varies according to the GPU module and it has improved over the years in terms of number of processing elements, memory sizes and access schemes, complexity of arithmetic logic units, power consumption among others.

Starting with the G80 (2006) and evolving into the FERMI architecture (2008), NVIDIA GPUs increased the number of processing cores from 128 to 240, defined larger register file sizes, improved memory access and supported double precision floating point. Latest GPU architectures known as Kepler and Maxwell have improved by doubling the shared memory size, increasing the number of threads per SM, the number of available registers, as well as the grid configuration size. These architectural improvements have allowed a remarkable improvement in terms of performance when targeting complex computational problems such as image processing tasks.

At a programming level, CUDA introduces some concepts that are necessary to launch a parallel process in the GPU. A kernel is a set of algorithmic steps to be executed in the GPU. It is necessary to configure the size of the grid and the thread-block, both are tightly related with the application that is accelerated. Instructions need to be set to indicate which execution lines are for the GPU (device) or for the CPU (host). Input data is normally transferred from CPU to GPU and viceversa, although necessary data transfers between host and device take considerable time and need to be kept to minimum (see Fig. 4).

Figure 4.
Structural diagram of CPU and GPU [16].

Figure 5.
Diagram for parallel computation of the integral image.

Figure 6.
Parallel computation of the integral image. In (a) the sum is calculated for rows, while in (b) for columns.

3. Proposed parallel computational model

This section presents the proposed parallel computational model targeting the training stage of the Viola-Jones algorithm, comprising the estimation of integral image, generating the haar-like feature and evaluating the obtained features.

3.1 Parallel estimation of the integral image

The integral image can be calculated using a sequential double sum. First, row pixels are added, considering all previous elements in all rows. After, corresponding pixels in a column are added, considering also all previous columns. Addition of rows and columns can be calculated independently and therefore a parallel approach can be implemented.

Figure 5 shows the two stages of parallel processing to calculate the integral image: first, $n$ threads are launched to process corresponding rows from an image. Addition for every pixel is carried out considering previous calculated ones. In a second step, $m$ threads are launched to process corresponding columns, repeating the rows process but now for columns. Resulting data is the integral image, i.e. an array where each point is the addition of all points within a rectangle formed by current point and position (0, 0) in the original image.

Figure 6 shows the steps of the proposed parallel approach for calculating the integral image graphically. Figure 6a shows the sum considering rows for every pixel with previous ones. After that, there is a delay when all processes are finished using a synchronization process. In (b) is shown the same process but for each column. The sum in one row and one column is performed in sequential way, however it is extended to several rows and columns in parallel.

It is necessary for every block of threads executing in parallel to synchronize. In the proposed model, the integral image is calculated in the following order: rows first, once this stage is completed, columns are calculated. Internally, CUDA executes a synchronization stage in order to avoid running conditions when accessing memory.

CUDA defines a number of built-in identifiers in order to map input data, in this case image data $G I$ , to corresponding processing threads and blocks. In Algorithm 3.1, threadIdx.x in Steps 1 and 7. For each section, $n$ or $m$ threads were used depending on the image size.

[h] Pseudocode to compute the integral image using CUDA[1] Vector of the gray scale image GI Vector of the integral image II First, image rows are processed. $i\leftarrow threadIdx.x$

each $Task\ i$ $k\leftarrow 2\textrm{ to }widthIma$ $II[i\times widthIma+k]\leftarrow II[i\times widthIma+k]+GI[i\times widthIma+(k% -1)]$

Then, the columns are processed.

$i\leftarrow threadIdx.x$

each $Task\ i$ $k\leftarrow 2\textrm{ to }heightIma$ $II[i+widthIma\times k]\leftarrow II[i+widthIma\times k]+GI[i+widthIma\times(k-% 1)]$ whereheightIma and widthIma are height and width of gray scale image GI, respectively.

Steps 2 and 8 are the parallel start of all $i$ threads. The total threads in each section are the height and width of the image. In Step 3, the image integral is calculated for image rows, while in Step 9 is calculate for image columns. A number of threads organized in threads blocks are launch for this calculation.

Step 4 scans across the image width and to each element that adds the previous one. This process is done in parallel by row, i.e. there is one task for each row. Each sum is performed sequentially, therefore, $m$ processes are sent, one for each row. In the expression $i\times\textit{widthIma}+k$ , the variable $i$ is an index to start in each row of the image, similarly occurs when calculating the image integral for columns.

Step 10 corresponds to the calculation of the image integral for columns. In this case the index $i+\textit{widthIma}\times k$ , where $i$ now is from all columns in parallel, and each column starts a sequential sum. The index $k$ now is from the top to the bottom of the image, thus $k$ is multiplied by widthIma to obtain the next item immediately below in the same column within the image.

Finally, the sums of Steps 3 and 4 for the rows, and 9 and 10 for the corresponding columns, provide the integral image as described in the background section.

3.2 Parallel generation of haar-like features

According to Viola and Jones viola-2001, 160,000 features can be generated using images of size of 384 $\times$ 288 pixels, considering a window detector of size of 24 $\times$ 24 pixels. Then, when larger images have to be processed, much more features will be created, and huge processing will be needed.

The proposed method for creating the haar-like features is composed of two stages: the first one runs in sequential order while the second one in parallel. The creation process starts considering one type, among the four types of haar-like features. Then, a validation process is performed with the goal of finding features within the search window. When a feature is found, this one is stored into a queue data structure; otherwise the feature will be scaled to height and the process will be repeated. Once the feature has reached the maximum height size of the search window, the feature will be scaled to width, and the process will be repeated. When the process has finished another feature will be selected, and the process will be started until all types of features have been considered. Figure 7 shows the parallel process to create all possible haar-like features within the window.

[h] Pseudocode for creation of haar-like features using CUDA (sequential section)[1] Integral image vector II $dx\leftarrow 1$ to $w i d t h W$ $dy\leftarrow 1$ to $h e i g h t W$ Depending on feature type, the variable $l e v e l$ is calculated in different way. The brackets in the block assignment means that may or may not apply. $blocksX=width-(dx\{\times k-\}1)$ $blocksY=height-(dy\{\times l\}-1)$ variable $l e v e l$ adds the total of tasks that were created $level\leftarrow+blocksX\times blocksY$ The size of blocks is defined according to feature type. $dim3\ NB(blocksX,blocksY)$ parallelFeat(type, widthW, heightW, dx, dy, level) whereheightW and widthW are the height and the width of search window, respectively. $k$ and $l$ are defined according to the feature type.

Figure 7.

Parallel process to create haar-like features.

Algorithm 3.2 shows the pseudocode of the sequential section for feature creation. Steps 1 and 2 run two loops, where the scaling, width $d x$ and height $d y$ , are performed. Steps 3 and 4 define variables for the total threads and blocks needed for a specific feature size. In Step 5, it is defined a variable to know the final index of the feature array. In Step 6, the instruction dim3 is used to specify the total set of threads and blocks in CUDA, depending on the feature size. Finally in Step 7, the function parallelFeat is called to create all features in parallel for a specific size, that corresponds to the Algorithm 7.

[h] Pseudocode for one type of haar-like feature creation using CUDA (parallel section)[1] $k$ , $l$ , $d x$ and $d y$ are parameters according to the specific feature type to be created. $l e v e l$ is a reference to feature array and $w e i g h t$ is a variable to obtain positive values. A vector with one type of features. $i\leftarrow blockIdx.x\times(blockDim.x\times blockDim.y)+blockDim.x\times threadIdx% .y+threadIdx.x$

$x$ and $y$ are references to locate a new feature $y\leftarrow threadIdx.x$ $x\leftarrow threadIdx.y$ each Task i

$(x\{+dx\times k\}\leqslant widthW)\ and\ (y\{+dy\times l\}\leqslant heightW)$ It is defined the type of feature and its rotation

$carHaar[level+i].featureType\leftarrow k$ $carHaar[level+i].rotated\leftarrow false$ The location of the rectangles are calculated $carHaar[level+i].x1\leftarrow x$ $carHaar[level+i].y1\leftarrow y$ $carHaar[level+i].w1\leftarrow dx\{\times k\}$ $carHaar[level+i].h1\leftarrow dy\{\times l\}$ $carHaar[level+i].wt1\leftarrow weight$

$carHaar[level+i].x2\leftarrow x\{+dx\times k\}$ $carHaar[level+i].y2\leftarrow y\{+dy\times l\}$ $carHaar[level+i].w2\leftarrow dx\{\times k\}$ $carHaar[level+i].h2\leftarrow dy\{\times l\}$ $carHaar[level+i].wt2\leftarrow weight2$ $carHaar[level+i].x3\leftarrow\{x+dx\times k\}$ $carHaar[level+i].y3\leftarrow\{y+dy\times l\}$ $carHaar[level+i].w3\leftarrow\{dx\times k\}$ $carHaar[level+i].h3\leftarrow\{dy\times l\}$ $carHaar[level+i].wt3\leftarrow\{weight3\}$ whereheightW and widthW are the height and width of the search window, respectively.

Figure 8.

Feature selection process.

Figure 9.

Diagram for parallel feature evaluation.

Algorithm 7 describes the feature creation process according to a specific type, width and height, using a window size of 24 $\times$ 24 pixels. The method has variations according to desired feature type to be created, thus, the brackets indicate that may or may not apply, or that there are variables defined as zero. In Step 1 there is a declaration of an identifier for each task, while in Steps 2 and 3 are defined the block dimensions to scan the window. These variables will change depending on the width and height used. The condition in Step 5 verifies if a feature is found within the window. In Step 6 is defined the type of feature, while in Step 7 is defined whether or not the feature is rotated (features proposed by Lienhart-Maydt). Steps 8 to 12 define the main rectangle of the feature, also its position, width, height and a weight to obtain a positive value for the evaluation process. In Steps 13 to 17 are defined the reference of the second rectangle, and finally in Steps 18 to 22 for the third and fourth rectangle, which uses only one reference.

3.3 Parallel evaluation of haar-like features

Once all features are created, the next step is to evaluate them within the integral image. This process is performed during the training stage to determine if the selected features correctly classify the objects.

First, in Fig. 8, is shown the process to select features. Then, once the features are obtained, $t$ processes are sent for one specific feature type. Each process evaluates one feature in a specific location on the image in order to define if this one is a candidate to be a weak classifier. The features are evaluated according to a threshold and parity (1, $-$ 1), which depends on the feature type. Figure 9 shows the evaluation process considering a feature type. There are dark and white areas, so the evaluation takes the dark ones and they are subtracted from the white areas. Because it is possible to obtain negative values, the parity is used to make the result be a positive number, and to decide if a specific feature is on a threshold.

Algorithm 3.3 shows the pseudocode to accept a valid feature. In Step 1 a identifier of the task is defined while in Step 2 all tasks start in parallel. Step 3 evaluates a specific feature of the array, this corresponds to the function in Step 10. Then, in Step 11 the variable ret sets as the sum of intensities of the first 2 rectangles. Steps 12 and 13 are the sum of the third rectangle, and the result is returned in Step 15. In the main function, if the evaluation is less than a threshold, the information of the feature is stored in another array, otherwise zeros values are assigned to all positions of the same array. The last action was done, because all threads blocks in CUDA execute the instructions concurrently even for those where no features were found. Thus, if there is a valid feature, data is stored, otherwise zeros are assigned.

Pseudocode for haar-like features evaluation using CUDA[1] $t h r e s h o l d$ permits to accept a feature, while $f e a t H a a r$ is a vector of features Vector of selected features, featSel $i\leftarrow blockIdx.x\times(blockDim.x\times blockDim.y)+blockDim.x\times threadIdx% .y+threadIdx.x$

each $Task\ i$ $total\leftarrow\bf{evaluation}$ $(featHaar[i],imgI)$ $total<threshold$ $featSel[i]=featHaar[i]$ $featSel[i]=zeros()$ Function evaluation( $c H r, i m g I$ ) $ret\leftarrow cHr.wt1\times(imgI[cHr.pt11]-imgI[cHr.pt12]-imgI[cHr.pt13]+imgI[% cHr.pt14])+cHr.wt2\times(imgI[cHr.pt21]-imgI[cHr.pt22]-imgI[cHr.pt23]+imgI[cHr% .pt24])$

$cHr.wt3\neq 0$ $ret\leftarrow+cHr.wt3\times(imgI[cHr.pt31]-imgI[cHr.pt32]-imgI[cHr.pt32]+imgI[% cHr.pt34])$ $r e t$ End Function whereimgI is the integral image, cHr is a feature array and cHr.ptxy are the coordinates from the corners of each rectangle to obtain the sum of intensities in the integral image

At the end of the evaluation process the error is validated for each classifier and will be chosen the one with the least error using the Eq. (5). This equation comes from the Adaboost Algorithm 2.1.3, where it is used for calculating single errors from classifiers (Line 6).

$\displaystyle\epsilon_{t}=\sum_{i}w_{i}|h_{j}(x_{i})-y_{i}|$ (5)

where $w$ are the weights of each feature, $h$ are the weak classifiers that evaluate a feature $x$ and classify according the classes $y\in\{1,-1\}$ , for each feature $i$ , and $t$ denotes the training level.

4. Experimental results

The experiments were carried out using a personal computer with AMD ${}^{\@setsize{\scriptsize}{9.5pt}{\viiipt}{\@viiipt}\textregistered}$ Phenom II at 3.2 GHz, 12 GB of RAM memory, Linux Fedora 14 x86_64 and OpenCV library version 2.3. For CUDA programming, a GeForce 430 GT graphic card was used, which contains 96 CUDA cores and 1 GB of RAM DDR3 memory.

Different sizes of images were tested for the integral image computation. Thus, the experiments were performed using five datasets with ten images each with sizes: 256 $\times$ 256, 512 $\times$ 512, 1024 $\times$ 1024, 1800 $\times$ 1400 and 4000 $\times$ 3000 pixels.

Table 1 shows the processing times of the first parallel task, the integral image computation, while in Fig. 10 is shown a comparative graph. We can observe from these results, that our approach obtained lower times than OpenCV in almost all cases. Only in one result, our approach was not able to obtain the best result, that is for images of size of 256 $\times$ 256 pixels. Also, we can see that when size of images increases, the performance of CUDA increases too, as it is expected; however, only for the last two cases (images of size of 1800 $\times$ 1400 and 4000 $\times$ 3000) the difference (time in ms) is more significant than the previous processed images.

Table 1
Results for the integral image computation using OpenCV and our approach

Image resolution	Time (ms) OpenCV	Time (ms) our approach
256 $\times$ 256	0.5235	0.6534
512 $\times$ 512	2.1087	1.9437
1024 $\times$ 1024	7.9498	7.3694
1800 $\times$ 1400	19.0183	15.5521
4000 $\times$ 3000	90.8161	78.7494

Table 2

Results for creating haar-like features using OpenCV and our approach

	OpenCV time (ms)	Our approach time (ms)
	79.737282	34.0566
	80.444962	32.7718
	81.332932	33.9721
	81.923775	21.9925
	81.396545	35.8005
	81.950691	35.5701
	81.152672	35.7644
	81.398209	23.8792
	81.650177	23.8342
	81.520958	38.1009
Average	81.2508	31.5742

Figure 10.

Comparative time-graph of CPU vs GPU for the integral image computation.

Table 3

Results of times in miliseconds for haar-like features evaluation using CUDA

Feature	$x2$	$y2$	$x3$	$y3$	$x2_{y}2$
	10.0368	10.9202	10.6733	11.8821	7.6902
	10.0465	10.8810	10.7094	11.9012	7.6951
	10.0983	10.8721	10.7069	11.9309	7.6749
	10.0974	10.8987	10.6805	11.8571	7.6688
	10.0941	10.9067	10.6606	11.9332	7.6712
Average	10.0746	10.8957	10.6861	11.9009	7.6800
Total	51.2374

Figure 11.

Comparative graph of execution times of our approach vs CPU for haar-like features creation.

Table 2 shows the results for haar-like feature creation using OpenCV and our proposed approach. In all results, the proposed solution using CUDA obtained lower processing times than the OpenCV ones; in fact, we can observe that our results almost reduce by half the others. The experiments for creating all possible features were calculated using a window of 24 $\times$ 24 pixels.

Figure 11 shows the execution times for creation of features. We can observe that some executions obtained results lower than 10 milliseconds in most of the cases. This may be due to the run-time system of CUDA, because the processes are sent to the core by this system trying to take advantage of the maximum number of cores.

The last proposed parallel task was for evaluating the haar-like features. This evaluation was performed using images of size of 50 $\times$ 50 pixels, considering one of each feature type for the five ones created. Therefore, first the features of 3 rectangles were evaluated, after that the ones using 4 rectangles, and finally for 5 rectangles. Table 3 shows the times of the evaluation of the five features: two areas as $x_{2}$ and $y_{2}$ , three areas as $x_{3}$ and $y_{3}$ and last four areas as $x_{2},y_{2}$ . In this case there was not comparison with OpenCV, because it was not possible to locate the equivalent process of this evaluation in the library code, therefore only the results using CUDA are shown. The total time of evaluations is 51.2374 milliseconds that is less than 20 milliseconds over the creation time.

5. Conclusions

In this work we have presented a parallel approach for three tasks of the Viola-Jones face detection algorithm, particularly for the training stage. The three tasks are the following: the integral image computation, haar-like features creation, and evaluation of these features. Thus, we have proposed three parallel algorithms applied to the face detection problem in computer vision that demand high computing resources in real time processing. The CUDA programming model was used to accelerate the calculation of these tasks. Our experimental results have shown that the proposed methods improve OpenCV implementations. In fact, for the integral image computation our approach obtained lower processing times in almost all cases, for different image sizes. Also, for feature creation the proposed method obtained lower times in all experiments. In the future, we plan to implement the proposed methods on new GPU architectures such as the NVDIA’s next generation Kepler. In addition, to develop parallel methods for other object detection algorithms which involves large computational loads.

References

Oren

Papageorgiou

Sinha

Osuna

and Poggio

, Pedestrian detection using wavelet templates, in: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997, pp. 193–199.

Papageorgiou

and Poggio

, A Trainable System for Object Detection, International Journal of Computer Vision 30(1) (2000), 15–33.

García

Delgado

and Castañeda

, Metodologías de Paralelización en la Supercomputadora CICESE2000, Technical Report, Departamento de Cómputo Dirección de Telemática Centro de Investigación Científica y de Educación Superior de Ensenada, 2000.

Viola

and Jones

, Rapid object detection using a boosted cascade of simple features, in: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition CVPR, 2001, pp. 511–518.

Lienhart

and Maydt

, An extended set of Haar-like features for rapid object detection, in: IEEE International Conference on Image Processing, Vol. 1, 2002, pp. 900–903.

Wei

Bing

and Chareonsak

, Fpga implementation of adaboost algorithm for detection of face biometrics, in: IEEE International Workshop on Biomedical Circuits and Systems, 2004, pp. S1/6-17-20.

Fung

and Mann

, Using graphics devices in reverse: Gpu-based image processing and computer vision, in: 2008 IEEE International Conference on Multimedia and Expo, 2008, pp. 9–12.

Bradski

and Kaehler

, Learning OpenCV. Nutshell Handbook., OReilly Media Inc., 2008.

Gao

and Lu

S.L.

, Novel fpga based haar classifier face detection algorithm acceleration, in: International Conference on Field Programmable Logic and Applications, 2008, pp. 373–378.

10.

García Chang

M.E.

, Diseño e implementación de una herramienta de detección facial, Master Thesis, Instituto Politécnico Nacional. Centro de Innovación y Desarrollo Tecnológico en Cómputo, 2009.

11.

Cho

Mirzaei

Oberg

and Kastner

, Fgpa-based face detection system using haar classifiers, in: Proceeding of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, ACM, New York, NY, USA, 2009, pp. 103–112.

12.

Trujillo

, Algoritmos paralelos para la solución de problemas de optimización discretos aplicados a la decodificación de señales Ph.D. dissertation, Departamento de Sistemas Informáticos y Computación. Universidad Politécnica de Valencia, Valencia España, 2009.

13.

Junguk

Benson

Mirzaei

and Kastner

, Parallelized Architecture of Multiple Classifiers for Face Detection, in: 20th IEEE International Conference on Application-specific Systems, Architectures and Processors, 2009, pp. 75–82.

14.

Hiromoto

Sugano

and Miyamoto

, Partially parallel architecture for AdaBoost-based detection with haar-like features, IEEE Trans. Circ. Syst. Video Technol 19(1) (2009), 41–52.

15.

Bilgic

and Berthold

K.P.

, Horn, Ichiro Masaki, Efficient Integral Image Computation on the GPU, in: 2010 IEEE Intelligent Vehicles Symposium, 2010, pp. 528–533.

16.

Kirk

and Hwu

, Programming Massively Parallel Processors. A Hands-on Approach, Morgan Kaufmann Publishers, 2010.

17.

Hefenbrock

Oberg

Thanh

Kastner

and Baden

S.B.

, Accelerating Viola-Jones Face Detection to FPGA-Level Using GPUs, in: 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2010, pp. 11–18.

18.

Oro

Fernández

Rodríguez

Martorell

and Hernando

, Real-time GPU-based Face Detection in HD Video Sequences, in: 2011 IEEE International Conference on Computer Vision Workshops, 2011, pp. 530–537.

19.

Kyrkou

and Theocharides

, A flexible parallel hardware architecture for AdaBoost-based real-time object detection, IEEE Trans. Very Large Scale Integr. Syst 19(6) (2011), 1034–1047.

20.

Ding

Zhao

Wang

Shu

and Wu

M.-Y.

, Hecto-Scale Frame Rate Face Detection System for SVGA Source on FPGA Board, in: IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines, 2011, pp. 37–40.

21.

Chang

C.J.

and Hwang

S.L.

, LSO-AdaBoost Based Face Detection for IP-CAM Video, Applied Mechanics and Materials 284–287 (2013), 3543–3548.

22.

Pertsau

and Uvarov

, Face detection algorithm using haar-like feature for GPU architecture, in: 7th International Conference onIntelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2013, pp. 726–730.

23.

Barney

, Introduction to Parallel Computing. Lawrence Livermore National Laboratory: https://computing.llnl.gov/tutorials/parallel_comp/, 2014.

24.

Bilaniuk

Fazl-Ersi

Laganiere

Laroche

and Moulder

, Fast LBP Face Detection on Low-Power SIMD Architectures, in: IEEE Conference onComputer Vision and Pattern Recognition Workshops (CVPRW), 2014, pp. 630–636.

25.

Kumar

and Agarwal

, A novel architecture for dynamic integral image generation for Haar-based face detection on FPGA, in: IEEE Conference TENCON, 2014, pp. 1–6.