An eye detection method based on convolutional neural networks and support vector machines

Abstract

Eye detection plays an important role in many fields, because eyes provide prominent facial feature information. However, changes in face pose, illumination variation, with glasses, and eye occlusions can make it difficult to detect eyes well from facial images. This paper proposes a hybrid model for eye detection. The model is an integration of two classifiers: Convolutional Neural Networks (CNN) and Support Vector Machines (SVM). In order to improve the speed of detection in the system, an eye variance filter (EVF) is constructed for eliminating most of noneye images to keep less candidate eye images. The CNN then works as a trainable feature extractor to explicitly extract various latent eye features. Finally, the trained SVM classifier is employed for eye verification instead of using the CNN classification function. Experiments applying the model have been conducted on the BioID, IMM, FERET and ORL face databases. Comparisons with other methods on the same databases indicate that this hybrid model has achieved a higher detection accuracy. Extensive experiments demonstrate the robustness and efficiency of our method by testing it on different facial images with varying eye conditions.

Keywords

Eye variance filter convolutional neural networks support vector machines eye detection

1. Introduction

Eyes are more prominent features of the human, compared to the nose or mouth. Eye detection is a sub-field of object detection in image processing. Eye detection methods have been widely applied in many fields, such as drowsiness detection for intelligent vehicle systems [1, 2, 3], eye gaze tracking devices [4, 5, 6], human-robot interaction [7, 8, 9], and automatic face detection and recognition systems [10]. However, eye detection is a very difficult task, since structural individualities of eyes vary greatly across the different races of the world. The structural individualities include eye size, iris color, and eyelid boldness and width. Additional factors such as glasses and their glints, changes in illumination, face poses, and occlusions can introduce noise to eye detection methods, which could lead to false detection.

The better eye detection methods need to have the robustness for people with glasses, illumination changes, rotation of the frontal face, and different eye occlusions (e.g., semi-closed eyes, closed eyes, and squinting). In this paper, a coarse-to-fine classifier is designed to locate eye regions. First, an eye variance filter (EVF) is trained as a coarse classifier which can roughly find eye positions, filtering out most noneye regions quickly, and consequently improving the detection speed of the system. Then, we use the CNN architecture as a feature extractor to learn and extract features automatically. Lastly, the final classification for eye images is completed using the SVM classifier.

To verify the feasibility of our methodology, we gave the compared detection results for eye localization with our previous and other researchers’ works on the four face databases, i.e., Biometric Identity (BioID), Informatics and Mathematical Modeling (IMM), Face Recognition Technology (FERET), and AT&T database of faces (ORL). It was observed that the proposed method achieved a better eye detection result on images including eyes with glasses, various illumination scenarios, face pose changes, and eye occlusions.

The remaining parts of this paper are organized as follows. Some related works to eye detection are given in Section 2. The principles of EVF, CNN, and SVM models are described in Section 3. Section 4 presents the process of eye detection. The experimental results and performance analysis are presented in Section 5. The merits of our proposed method are given in Section 6. Finally, Section 7 provides the conclusions.

2. Related works

By looking at the principles of related methods, eye detection techniques can be divided into four categories: shape-based, feature-based, appearance-based, and hybrid-based methods [11].

The shape-based method depends on the geometric eye model and a similarity measure to decide whether a face image contains eye images. The models consist of simple elliptical shapes and complex shapes. Simple elliptical shape models reply on the viewing angle, then the elliptical iris and pupil regions can be modeled using shape parameters. Complex shape models allow for more detailed modeling of the eye shape. Perez et al. [12] first used the thresholds of image intensities to estimate the pupil ellipse center. Then the limbus and pupil boundaries are extracted using an edge detection technique. Kawaguchi et al. [13] adopted a separability filter to extract the feature, and then the Hough transform was used for model fitting. Increasing the number of parameters used to construct the model will help make the shape based model more precise, but this also increases the computational demand. This limitation makes handling face pose changes and eye occlusions difficult for this method.

The feature-based method focuses on distinctive features from the eye image, e.g., eyebrow, pupil, and iris, to locate the eye regions. Sirohey and Rosenfeld [14] used the linear and non-linear filters constructed from Gabor wavelets to detect the iris and corner features of the eyes. Then these features were further filtered to remove false features. A voting mechanism was finally applied to make a decision for accurately locating the iris. Feng and Yuen [15] focused on three cues for locating eyes on gray-scale face images. Each cue indicated the candidate eye regions. The precise eye location was determined and validated by a variance projection function. Based on this method, Zhou and Geng [16] proposed a hybrid projection function constructed by the integral projection and variance projection functions. Eye detection performance is decided by the optimal parameters of the hybrid projection function. In Ando [17], a seven-segment rectangular filter is employed to find the between-the-eyes feature of the human face for eye region location. Feature-based methods are generally robust against to illumination and pose changes, so the method requires high-quality images [11].

The appearance-based method relies on models constructed directly by using the photometric appearance of the eyes. The method needs no specific a priori information of eye images, but a sufficient amount of training data to learn the parameters to build an eye model. The feature extraction process is included in the method, which adds the function to eliminate noise and reduce dimensionality. Vijayalaxmi and Rao [18] used a Gabor filter as a feature extractor and SVM as a classifier. The eye images were processed by the Gabor filter, and the output fed into the SVM to train the classifier. Ryu and Oh [19] developed an algorithm that used eigenvectors of the binary edge data set from eye fields, then using neural networks for the eye region detection. Wu and Trivedi [20] used a binary tree to model a human eye statistical structure for locating accurate eye regions. With a clustering method based on the pairwise mutual information between features, the dependent features were separated into different subsets. The process of eye detection would be repeated until the subset features have enough mutual information. The final detection was achieved through the Bayesian criterion. In Fu et al. [21], the orthogonal wavelet analysis method was used as a feature extractor, the output was given to the SVM for locating eye regions. The appearance-based method is based on the machine learning algorithms to learn the model from training data, so the method can apply to all kinds of different eye images.

Compared to the methods mentioned above, the hybrid-based method is more popular for eye region detection. The method combines two methods or more into one system to locate eye regions. Peng et al. [22] and Xia et al. [23] first adopted the feature-based method to locate probable human-eye areas, then accurate eye regions were found with a modified template-matching method. In Hassballah et al. [24], the gray intensity variance was used for locating candidate eye regions. Then with an independent components analysis (ICA) method, the precise eye regions were detected. Kalbkhani et al. [25] used a non-linear RGB to YCBCr color conversion. An eye mapping algorithm was then applied to locate the eye regions on the created face mask. Phromsuthirak and Umchid [26] also adopted the eye mapping algorithm to first locate possible eye regions. The correct eye regions were then determined using the geometrical test method. Our previous research for eye detection also adopted the hybrid-based method [36]. First of all, the eye variance filter was taken as a coarse classifier to filter out most noneye regions. The principal component analysis (PCA) was then used as a feature extractor. Lastly, the accurate eye regions were determined by a trained SVM classifier. The mechanism of the hybrid-based method acts like a cascaded classifier which uses rough initial classification to then achieve accurate eye localization. Consequently, the hybrid-based method aims at combining the advantages of different eye models into a system to overcome their respective shortcomings. Based on the principle, in this paper we proposed a new hybrid eye detection method using the CNN and SVM.

3. The proposed method

The proposed training and eye detection processes are shown in Fig. 1. In the training stage, three models (EVF, CNN and SVM) are trained. The training samples used for EVF are scaled to the resolution of 48 $\times$ 24 pixels and samples for CNN are scaled to the size of 32 $\times$ 32 pixels, respectively.

Figure 1.

(a) Training of eye detection framework and (b) Testing of eye detection framework.

In the testing stage, the boosted cascade face detector [27] is applied for initial face region location. The wrong detecting face regions are then manually corrected. The detected face image is normalized to an image of size 200 $\times$ 200 pixels. To improve calculation efficiency, the eye search region is limited to the top half of the detected face region, as human eyes tend to exist in this region. The input patch size (detected windows) is first set to 48 $\times$ 24 pixels and the patch passes the trained EVF. If the EVF judges it as an eye image, the patch passes to the next stage for further eye classification. If the judgment is no eye image, it is filtered out. The same pre-treatments of images size are carried out for these candidate eye images. Then, the trained CNN extracts the received eye image features. The features pass through the SVM to achieve the final eye classification results.

3.1 Eye Variance Filter (EVF)

Firstly, the eye variance image is introduced into the system. Based on the fact that the change of grey intensity in the eye region is more obvious than in other regions on the face, the second-order moment (or variance on a domain) is used as an indicator of variation in gray intensity. Hence, the variance of the eye image $I(x,y)$ on the domain $\Omega$ is defined as:

$\displaystyle\sigma_{\Omega}=\frac{1}{A_{\Omega}}\sum\limits_{\left({x,y}% \right)\in\Omega}{\left[{I\left({x,y}\right)-\overline{I_{\Omega}}}\right]}^{2}$ (1)

where $A_{\Omega}$ and $\overline{I_{\Omega}}$ are the area and average gray intensity on the domain $\Omega$ , respectively. The variance is a non-negative value and has two properties [15]. $\sigma_{\Omega}$ is rotation invariant on domain $\Omega$ , and $\sigma_{\Omega}$ reflects gray intensity variations rather than the exact shape on the domain.

The 48 $\times$ 24 pixel eye image is divided a 16 $\times$ 8 array of 3 $\times$ 3 pixel non-overlapped sub-blocks. For an image $I(x,y)$ , the variance image is defined as:

$\displaystyle V_{\sigma}\left({i,j}\right)=\sigma_{\Omega},$ (2) $\displaystyle\Omega_{ij}=\left\{{\left({i-1}\right)l+1\leqslant x\leqslant il,% \left({j-1}\right)l+1\leqslant y\leqslant jl}\right\}$

where $l$ and $\Omega_{ij}$ are the width/height (width $=$ height) and the area in each sub-block, respectively.

The variance image on each sub-block is calculated by Eq. (3.1). Each sub-block has different features of grey intensity. An example of an eye image from an eye database and its variance image is shown in Fig. 2.

Figure 2.

(a) An eye image and (b) its variance image.

Then to construct an EVF, 30 eye images with or without glasses are extracted from our eye database. The EVF on the $({i,j})$ sub-block (called $F_{e}({i,j})$ ) is constructed by calculating the variance image average across all 30 eye images. $F_{e}({i,j})$ is defined as:

$\displaystyle F_{e}\left({i,j}\right)=\frac{1}{N}\sum\limits_{k=1}^{N}{\left[{% V_{\sigma}\left({i,j}\right)}\right]_{k}}$ (3)

where $[{V_{\sigma}({i,j})}]_{k}$ is the variance image $V_{\sigma}({i,j})$ of the $({i,j})$ sub-block on the $k t h$ eye image and $N$ is the amount of the eye images ( $N=$ 30). The EVF is constructed and shown in Fig. 3.

Figure 3.

(a) 30 eye images and (b) corresponding to EVF.

The EVF is used to detect the most probable eye regions. The correlation is calculated between the EVF and eye/noneye variance images on the face. The correlation is defined as:

$\displaystyle R\left({V_{\sigma},F_{e}}\right)=\frac{E\left[{\left({\xi_{V_{% \sigma i}}-E\left({\xi_{V_{\sigma i}}}\right)}\right)\left({\xi_{F_{e}}-E\left% ({\xi_{F_{e}}}\right)}\right)}\right]}{\sqrt{D\left({\xi_{V_{\sigma i}}}\right% )D\left({\xi_{F_{e}}}\right)}}$ (4)

where $\xi_{V_{\sigma i}}$ and $\xi_{F_{e}}$ are the concatenated vectors of the variance image of the possible eye regions and $F_{e}$ , respectively. The $E(\bullet)$ and $D(\bullet)$ represent the mathematical expectation and variance of the random variable, respectively.

In order to calculate the EVF correlation value, another 60 images with 30 noneye images and 30 eye images are extracted from our eye database. The variance images of the noneye images are constructed in the same manner as the eye variance image. Figure 4 shows that the eye region images have correlation values greater than 0.32, while the noneye region images have correlation values less than 0.32. So 0.32 can be taken as the EVF threshold. Consequently, the role of the EVF in the process of eye detection is used to filter out most of the noneye images and keep the more probable eye regions possibility, i.e. candidate eyes.

Figure 4.

Correlation between eye/noneye images and EVF.

3.2 CNN feature extractor

A Convolutional Neural Network is a multi-layered neural network with a deep supervised learning architecture [28]. The CNN architecture consists of two parts: an automatic feature extractor and a trainable classifier. In our research, the trainable classifier is replaced by the SVM classifier and the CNN is used as the feature extractor. Figure 5 shows the architecture of CNN with a SVM classifier.

Figure 5.

The architecture of CNN with SVM. Conv: Convolution, Subs: Subsampling.

As shown in Fig. 5, the CNN consists of 7 layers, including an input layer, convolution layer, subsampling layer, full connected layer and output layer. Layers C1 through C5 are used for eye image feature extraction. In this processing, a series of successive convolutions and subsampling operations are performed. Each layer is composed of multiple two dimensional planes called feature maps, and each feature map contains multiple independent neurons. Each neuron on a feature map receives inputs from a small neighborhood (identified as the “receptive field” in [29]) in the previous layer. All the neurons in one feature map share the same kernel and connecting weights [29]. The neurons can extract elementary visual features such as oriented edges, end-points, or corners from receptive fields. Extracted features are then combined by the subsequent layers in order to obtaining high-level features.

Convolution and subsampling are very important steps in the CNN feature extractor. In the convolution process, each feature map unit is computed with two steps. The input $x$ is first multiplied with the trainable convolutional kernel filters of size 5 $\times$ 5, then a trainable bias $b_{x}$ is added. The results then pass through Rectified Linear Units (ReLU) transformation to obtain the whole feature maps $C_{x}$ in the convolution layer. There are two justifying reasons for using the ReLU as the activation function. ReLU does not face the gradient vanishing problem experienced with using sigmoid and tanh function first [30]. Additionally, CNN with ReLU trains several times faster than the traditional CNN with sigmoid and tanh functions for large models trained on large datasets [31, 32]. A feature $C_{x}$ can be defined as follows:

$\displaystyle C_{x}=\max\left({0,K*x+b_{x}}\right)$ (5)

In the subsampling (Pooling) process, each feature map unit is also achieved by two steps. Each neuron in the subsampling layer first computes the average over the 2 $\times$ 2 spatial neighborhoods from the previous convolution layer, multiplies it by a trainable coefficient $w_{x+1}$ , then adds a trainable bias $b_{x+1}$ . The result passes through the ReLU. Each feature map layer reduces the feature size from the previous feature size $M\times N$ to $\lceil{M/2\times N/2}\rceil$ . A feature map $S_{x+1}$ can be expressed as follows:

$\displaystyle S_{x+1}=\max\left({0,\sum{C_{x}}\times w_{x+1}+b_{x+1}}\right)$ (6)

Figure 5 shows the input layer is an eye image with pixel size 32 $\times$ 32. The eye image data matrix is normalized, pattern centralized, and fed to the hidden layer. C1 is a convolution layer with 6 feature maps. Each neuron in the feature map is connected to a 5 $\times$ 5 neighborhood into the input layer. The feature size is 28 $\times$ 28 due to using convolutional step 1. The subsampling layer S2 consists of 6 feature maps. Each feature map in S2 corresponds to the previous feature in C1. The receptive field is a 2 by 2 area. According to the above subsampling process, the feature map size of S2 is 14 $\times$ 14. Layer C3 is also a convolution layer, which contains 16 feature maps. Each feature map in Layer C3 is connected to six 5 $\times$ 5 neighborhoods, one for each feature map in Layer S2. The feature map pixel size is 10 $\times$ 10. Similar to Layer S2, Layer S4 is a subsampling layer, which has a 2 $\times$ 2 receptive field and 16 feature maps. Layer C5 is a convolutional layer (also called full connected layer) with 128 feature maps. Because the size of the receptive field is same as the size of convolutional kernel filter, Layer C5 feature map size is 1 $\times$ 1. The output layer contains two neurons and are fully connected to layer C5. Each output is calculated by activation function. The input of the activation function is the feature maps multiplied with trainable weights, plus a bias term. The output layer is used to classify the input image as an eye or noneye.

In our research, the last output layer of the CNN model is replaced by an SVM classifier. Layers C1 through C5 act as a feature extractor. The output values of the layer C5 can be treated as features for the SVM classifier. The original CNN with the output layer is trained with several epochs until the training process coverages. The SVM classifier then uses the output features from Layer C5 as feature vectors to train the SVM. Once the SVM classifier has been trained, it performs the recognition task and makes decisions on testing eye and noneye images.

3.3 SVM classifier

The theory of support vector machines (SVM) was proposed by Vapnik in 1995 [33]. The SVM is a learning method based on statistical theory that has been successfully applied to many fields for detection and identification purposes [37, 38]. The main idea of SVM is to seek an optimal hyperplane as the decision surface to separate the classes whilst maximizing the points over the separation margin and minimizes the error. This method is mainly used for solving two-class problems.

In this paper, the classifier is devoted to distinguish between eye images and noneye images. It is a binary classification problem. Suppose that there exists a hyper-plane which could divide sample space into two categories, one positive set (eye images), and a negative set (noneye images).

Suppose a training set $({x_{i},y_{i}})$ , $i=1,2,\ldots,n$ , $x_{i}\in R^{d}$ , where $x_{i}$ is the sample of the training sets, and $y_{i}=\pm$ 1 is the class label. The optimal separating hyperplane $H$ in the feature space can be defined as follows:

$\displaystyle H:{\bm{w}}^{T}{\bm{x}}+b=0$ (7)

where ${\bm{w}}$ is a d-dimensional vector and $b$ is a real number.

The separation margin between the classes is $m=2/\|{\bm{w}}\|$ . To maximize $m$ , ${\bm{w}}$ should be minimized. This optimization problem can be solved by the equation:

$\displaystyle\min\frac{1}{2}\left({\left\|{\bm{w}}\right\|^{2}}\right)$ (8)

subject to: $y_{i}({{\bm{w}}^{T}{\bm{x}}_{i}+b})\geqslant 1$ .

When the training set is linearly inseparable, an optimization algorithm for linearly inseparable problems is introduced. The goal is to motivate SVM to search for the hyperplane that could maximize the margin and minimize number of misclassification errors. The optimization problem of the separating hyperplane becomes:

$\displaystyle\min\frac{1}{2}\left\|{\bm{w}}\right\|^{2}+C\sum\limits_{i=1}^{n}% {\xi_{i}}$ (9)

subject to: $y_{i}({{\bm{w}}^{T}\varphi({x_{i}})+b})\geqslant 1-\xi_{i}$ , $i=1,2,\ldots,n$ , $\xi_{i}\geqslant 0$ , where $C>0$ is the penalty parameter of the error term, $\xi_{i}$ is the slack variables.

By introducing Lagrange function, the dual problem of quadratic programming is obtained:

$\displaystyle\mathop{\max}\limits_{\alpha}\sum\limits_{i=1}^{n}{\alpha_{i}}-% \frac{1}{2}\sum\limits_{i,j=1}^{n}{\alpha_{i}\alpha_{j}y_{i}y_{j}{\bm{x}}_{i}^% {T}{\bm{x}}_{j}}$ (10)

subject to: $\sum_{i=1}^{n}{\alpha_{i}y_{i}=0}$ , $0\leqslant\alpha_{i}\leqslant C$ , $i=1,2,\dots,n$ .

In general, the training sets are not linearly separable. In the nonlinear case, we need to transform the lower dimensional feature space into a higher dimensional feature space via nonlinear mapping. Suppose there is a nonlinear mapping: $\phi$ : $\text{X}\to F$ , $\text{X}\in\text{R}^{d}$ , $F\in\text{R}^{k}$ , $k\geqslant d$ , which maps input samples $x_{i}\in\text{X}$ into a k-dimensional feature space $F$ [34]. Usually kernel functions are used for mapping nonlinearly seperable data into higher dimensional fearture space. The kernel function can be defined as:

$\displaystyle K\left({x_{i},x_{j}}\right)=\phi^{T}\left({x_{i}}\right)\phi% \left({x_{j}}\right)$ (11)

In this case, the optimal function in Eq. (10) becomes:

$\displaystyle L\left(\alpha\right)=\sum\limits_{i=1}^{n}{\alpha_{i}}-\frac{1}{% 2}\sum\limits_{i,j=1}^{n}{\alpha_{i}\alpha_{j}y_{i}y_{j}\phi\left({{\bm{x}}_{i% }}\right)\phi\left({{\bm{x}}_{j}}\right)}=\sum\limits_{i=1}^{n}{\alpha_{i}}-% \frac{1}{2}\sum\limits_{i,j=1}^{n}{\alpha_{i}\alpha_{j}y_{i}y_{j}K\left({{\bm{% x}}_{i},{\bm{x}}_{j}}\right)}$ (12)

The general kernel function has three categories: polynomial kernel function, Gaussian kernel function, and RBF kernel function.

Therefore, the final decision function is defined as follows:

$\displaystyle f\left({\bm{x}}\right)=\text{sgn}\left({\sum\limits_{i=1}^{n}{% \alpha_{i}^{*}y_{i}K\left({{\bm{x}}_{i},{\bm{x}}}\right)+b^{*}}}\right)$ (13)

The detailed theory of SVM can be found in [33]. The RBF kernel function is used as the SVM kernel function in this paper. The RBF kernel function can be defined as:

$\displaystyle K\left({x_{i},x_{j}}\right)=\exp\left\{{-\left({\left|{x_{i}-x_{% j}}\right|^{2}}\right)/\sigma^{2}}\right\}$ (14)

Where $\sigma$ is RBF kernel parameter.

4. The eye detection process

This section presents the detailed process of the proposed eye detection method. The method consists of six steps as follows:

Figure 6.

The eye detection process.

Step1 The boosted cascade face detector [27] is applied to locate the face region as shown in Fig. 6a. However, 100% eye detection accuracy could not be obtained using the face detector. Hence, the wrong face detection images are manually corrected.

Step2 The detected face is normalized to an image of size 200 $\times$ 200 pixels as shown in Fig. 6b.

Step3 In general, eyes always exist in the top half of the face. Hence, the eye search region is limited to the top half of detected face region. With the human face geometric structure, the search region is rebuilt as:

$\displaystyle W_{sr}\approx\frac{1}{2}W_{F}\times\gamma_{W},H_{sr}\approx\frac% {1}{2}W_{F}\times\gamma_{H}$ (15)

where $W_{F}$ is the width and height of face image (width $=$ height), $W_{sr}$ and $H_{sr}$ are the width and height of the search region, $\gamma_{W}$ and $\gamma_{H}$ are the adjustment coefficient factors of width and height of search region respectively. The search region of size $W_{sr}\times H_{sr}$ pixels on the human face can be shown in Fig. 6c.

Step4 The search region variance image is calculated as shown in Fig. 6d. Then, many overlapped windows of size 16 $\times$ 8 sub-blocks of size 3 $\times$ 3 pixels with 3 pixel interval (1 sub-block of size 3 $\times$ 3 pixels) are built to detect the candidate eye regions on the variance image search region.

Step5 With the Eq. (4) the correlation values between the extracted vector block and EVF are calculated. Among them the correlation values with higher 0.32 are counted into the set of candidate eye regions. Through the Fig. 6e, some examples of candidate eye regions are obtained.

Step6 From Fig. 6e, most of noneye images in the search region can be observed to have been discarded by EVF. The trained CNN extractor and SVM classifier are then employed to select only two regions which are the most possible regions of left and right eyes from the candidate eye images. The accurate left and right eye regions are shown in Fig. 6f.

5. Experimental validation of the proposed method

5.1 The database

The images used for establishment of our database are extracted from the extended M2VTS dabatbase (XM2VTS), Psychological Image Collection at Stirling (PICS) face database, Japanese Female Facial Expression (JAFFE) face database, Milborrow/University of Cape Town (MUCT) face database, California Institute of Technology (Caltech) face database, Self-face database, and face images from Internet websites.

The XM2VTS database contains four recordings of 295 subjects, with each recording consisting of a speaking head shot and a rotating head shot. We select 750 images from XM2VTS used in our database. The PICS face database is a collection of images with many face sets. In this paper, the images are only extracted from 2D face sets including the Aberdeen set with 687 facial images of 90 individuals, the Pain Expressions set with 599 facial images of 23 individuals (13 women and 10 men), and Utrecht ECVP set with 131 facial images of 69 individuals (20 women and 49 men). The JAFFE face database is constructed by Kyushu University in Japan with 213 images of 7 facial expressions posed by 10 Japanese female models. The MUCT face database is provided from University of Cape Town in South Africa with 3755 face images. The Caltech face database is collected by Markus Weber at California Institute of Technology with 450 frontal face images. The self-face database is built by the Measurement & Testing Laboratory of Beijing Institute of Technology (BIT M&T) with 1,000 face images from 10 people. We selected and downloaded 13,000 face images from Google (USA), Bing (USA) and Baidu (China) search engines.

The database consists of a positive image set and a negative image set. The positive image set contains 18,000 eye images, which are captured manually from face images from the above face databases, with different sizes, different gaze directions, with/without glasses, rotation, under various lighting conditions, etc. The negative image set contains 20,000 noneye images including eyebrow, incomplete eye image, skin, nose, etc. The size of all images in eye database is normalized to 48 $\times$ 24.

5.2 Evaluation of eye detection

The proposed method was tested separately on the BioID, IMM, FERET, and ORL face databases. The BioID face database contains 1521 grayscale, frontal facial images of pixel size 384 $\times$ 286, acquired under various illumination conditions with a complex background. It consists of 1039 images without glasses and 482 images with glasses. The IMM face database is a collection of male and female digital color images with different illumination conditions and various facial expressions. It contains 240 images of 40 persons (six images per person) of size 480 $\times$ 640. The FERET face database contains 1400 images of 200 subjects (seven images per person) with various color skins, rotated faces, and different lighting conditions. The face image size is 80 $\times$ 80. The ORL face database contains 400 images of ten different images of each of 40 distinct subjects. The size of each image is 92 $\times$ 112. The images were taken at different times, varying the lighting, facial expressions (open/closed eyes, smiling/not smiling) and facial details (glasses/no glasses). It consists of 281 images without glasses and 119 images with glasses. The situation of images from the four face databases is almost the same as realistic situation, including different backgrounds, gaze directions, and illumination conditions.

The proposed algorithm in this paper is run on the corrected face regions captured by the boosted cascade face detector. However, the face detector could not achieve 100% detection accuracy on the face databases. Hence, in order to achieve the eye detection accuracy on each image on the four databases, we manually extracted the face regions on the in-corrected face images.

At present, there is still no criterion to evaluate the accuracy of eye region detection. In this paper, we considered the correct test results that the upper and lower eyelid and two corners of the eye fallen into the eye regions. Comparison of results with existing methods are proposed in Table 1. Because the picked methods proposed by researchers are tested on different face databases and different number of face images, we only provide the detection rates on the corresponding face database in Table 1. In order to show the advantages of the proposed detection method, Table 1 also gives the compared result with the CNN, which is not only used as the features extractor, but as the output predictor. Additionally, eye detection rates are given separately on the BioID and ORL face images with glasses and without glasses.

Table 1
Detection accuracy (NR: not reported, NG: no glasses, WG: with glasses, LE: left eye, RE: right eye)

Method	Datasets
	BioID (%)	IMM (%)	FERET (%)	ORL (%)
Our method	98.94 (NG)	99.17	97.92	97.15 (NG)
	96.47 (WG)			94.12 (WG)
CNN	97.79 (NG)	97.92	96.86	96.80 (NG)
	94.19 (WG)			92.44 (WG)
Yu et al. [36]	97.80 (NG)	98.70	97.60	NR
	92.50 (WG)
Hassaballah et al. [24]	97.10	NR	97.30	NR
Kalbkhani et al. [25]	NR	98.65	NR	NR
Peng et al. [22]	NR	NR	NR	95.20 (NG)
Zhou and Geng [16]	94.70	NR	NR	NR
Xia et al. [23]	NR	NR	NR	94.70
Wu and Trivedi [20]	NR	NR	92.43	NR
Ryu and Oh [19]	NR	NR	NR	91.70 (LE)
				88.50 (RE)
Ando et al. [17]	NR	NR	88.00	NR

The bold numbers represent the best accuracy in each face database.

Figure 7.

Examples of eye images from our database, including (a) variance from structural individuality, (b) variance from iris motion and eye condition, and (c) variance from noise (Such as glare, eye glasses, etc).

The average eye detection rate obtained on the whole BioID face database is 98.16%, which is higher than Hassaballah et al. [24] and Zhou and Geng [16]’s methods. The images from IMM face database have high quality with respect to other databases. Our proposed method obtained 99.17% detection rate on the IMM. Its result is better performance than the Kalbkhani et al. [25]’s method. For the ORL face database, we obtain the eye detection rates of 97.15% and 94.12% on images with and without glasses, respectively. The accuracy on face images without glasses is higher than Peng et al. [22]’s method. The average eye detection rate on the ORL database is 96.25% compared with the 94.7% in Xia et al. [23]. Dasgupta et al. [2] uses a baseline EVF for eye detection and achieved a detection rate of 92.5% on the MIT AI laboratory face database, whereas the proposed method obtains an average detection rate of 97.92% on the four databases.

Figure 8.

Results of eye detection on the four face databases, including (a) BioID without glasses, (b) BioID with glasses, (c) IMM, (d) FERET, (e) ORL without glasses, and (f) ORL with glasses.

According to the merits in Section 4, if this system uses the CNN only as the feature extractor and the classifier, the result shows the eye detection accuracy is lower than the model integrating CNN and SVM. Additionally, compared results with our previous research [36] indicates that the proposed method by this paper achieves higher accuracies on the BioID, IMM and FERET databases, respectively. It also has an evidence that the performance of CNN used as the feature extractor is better than the PCA. The reason is great likely the CNN could extract more features with more representations than the PCA feature extractor. Furthermore, for the images with glasses on BioID face database, the detection accuracy obtained by the CNN and SVM is much higher than the PCA and SVM.

5.3 Robustness to illumination changes

In this paper, the robustness evaluation of the proposed eye detection method under illumination changes is provided. Here, the Yale Face Database B was used in the test. The Database contains 5700 grayscale images of 10 subjects under 576 viewing conditions with 9 poses and 64 illumination conditions. The size of each image is 640 $\times$ 480 pixels. However, some images with large shadows could not applied in our test, shown in Fig. 9. The test shows results from the 3175 selected Yale Face Database B face images.

Figure 9.

Examples of face images with big shadows from Yale Face Database B.

The qualitative samples of eye detection results obtained from four subjects in the subsets of the Yale Face Database B. The results show the better detection performance under the light source directions varying from different angles ( $\leqslant$ 40 ${}^{\circ}$ azimuth and elevation) with respect to the camera axis. Additionally, the database contains some face images with different pose changes. An example is given in the last row of Fig. 10, showing the robustness of the eye detection method to pose changes.

Figure 10.

Results of the robustness to illumination changes on Yale Face Database B.

Through experiments, we obtain the 97.64% correct rate of eye detection on the 3175 face images. The conclusion could be given that the proposed eye detector has a better robustness to illumination and pose changes. In order to further test the robustness of the method to pose changes, the next subsection provides more detection results.

5.4 Robustness to pose changes

The effects of pose changes on the proposed eye detection method are evaluated using the FEI Face Database. The database contains images of 200 subjects captured under an upright frontal position with profile rotation of up to 180 degrees for a total of 2800 images (14 images of each subject). The scale of rotation angle might vary by about 10% and the size of each image is 640 $\times$ 480 pixels. The database shows many subjects’ faces horizontally rotated $\pm$ 45 ${}^{\circ}$ from the forward position. These images provide very challenging conditions for the proposed eye detection method, as shown in Fig. 11. Only images rotated within $\pm$ 30 ${}^{\circ}$ range were used for testing the proposed eye detection method. Its purpose was to ensure both eyes were present in the image. Based on this constraint, the 1000 face images, with five images of each subject, were captured from the FEI Face Database.

Table 2
Detection results on different facial rotation angles

Pose	$-$ 30 ${}^{\circ}$	$-$ 10 ${}^{\circ}$	0 ${}^{\circ}$	$+$ 10 ${}^{\circ}$	$+$ 30 ${}^{\circ}$	Average
Accuracy	94.5%	98.0%	99.5%	98.5%	94.0%	96.9%

Figure 11.

Examples of face images with large rotations from FEI Face Database.

The experiment gives the results from images with horizontal rotations of 0 ${}^{\circ}$ , $\pm$ 10 ${}^{\circ}$ , $\pm$ 30 ${}^{\circ}$ from facing forward. Test results are shown in Table 2. The best accuracy is achieved when subjects face directly forward, which provides a high resolution image around the eye region, as well as the least obstruction from other features. Similar results are achieved when subject faces are horizontally rotated $\pm$ 10 ${}^{\circ}$ left or right. When the horizontal rotation angle is up to $\pm$ 30 ${}^{\circ}$ , accuracy is lower than in other cases, in which part of the eye image begins to be obstructed by the turned face region. Some examples of eye detection results are shown in Fig. 12.

Figure 12.

Results of the robustness to pose changes on FEI Face Database.

6. Discussion

Based on the fact that the proposed hybrid model could incorporate the merits of the CNN and SVM classifiers, the aim is to avoid the limitations of the two classifiers.

6.1 Feature extraction

The extracted features for a classifier usually have a great impact on the final classification results. More features result in more representations, the classification accuracy will be improved better. Numerous studies and applications have demonstrated that a SVM classifier could obtain better classification performance. However, the SVM alone does not have a strong ability to extract the features, so it must depend on other methods. The CNN model can combine low-level representation into high-level representation, which is abstract, complex, and non-linear, its main advantage is that it automatically extracts the salient features of the input image. Because the CNN model uses the weights sharing technique, the extracted features are invariant at a certain degree to the scale and shift and shape the distortion invariance of input characters. The eye image characters generally contain different conditions regarding face rotation and head movement. Hence, the CNN model presents a better ability for feature extraction while the eye may be rotating. Furthermore, the elementary features, such as eye corner, eyelid edge, etc., play an important role in eye classification. According to the CNN theory in Section 3.2, it is known that CNN uses the receptive field concept successfully to obtain the locally features such as oriented edges, end-points, or corners from receptive fields. The trainable features of CNN can therefore be used as the feature extractor to collect more representative and relevant information for eye classification compared with other feature extractors such as PCA. As a result, the SVM generalization ability is maximized to enhance the classification accuracy of the hybrid model after replacing the C5 output unit in the CNN.

6.2 Hyperplane

If the CNN classifier is used in one system only, it would face two limitations based on previous research [35]. The aim of the learning method for one classifier is to find a hyperplane for two classes and attempt to minimize the errors in the training set. In the CNN training process, once the first separating hyperplane with the back-propagation algorithm is obtained, the algorithm does not continue to improve the separating hyperplane solution. For the SVM classifier, the separating hyperplane can be optimized through the quadratic programming problem solution. The margin area between two classes of training samples then reaches its maximum. At the output layer, the CNN aims to achieve a high value (nearly $+$ 1) to one neuron whereas all the remaining neurons assign a low value (nearly $-$ 1). In this case, the CNN classifier causes difficulties in rejecting classification errors. The SVM classifier calculates the estimated probability of each class on the testing data in the classification decision and can help the system to design an efficient rejection mechanism with obtained probability values.

Table 3
Comparison of detection times

Model	Image size	Environment	Run time
CNN $+$ SVM	200 $\times$ 200	Python	2.60 s
EVF $+$ CNN $+$ SVM	200 $\times$ 200	Python	0.65 s

In the testing process, the eye region detections are performed on the face image, so the system needs to extract a large number of features. In this case, if the eye detection system uses the CNN and SVM, it can increase computational complexity. Because the trainable EVF could remove more noneye images from the candidate images, it is an ideal solution. Table 3 provides a comparison of detection time with and without EVF in the system. It is evident that the processing time is faster when CNN and SVM are used for eye classification while adopting the EVF.

7. Conclusions

In this paper, a new hybrid model based on Convolutional Neural Networks and Support Vectors Machines has been proposed to solve the eye detection problem. In general, the multilayer architecture of the CNN is very complex with a computational cost. In order to improve the detection speed of the system, an eye variance filter is trained to quickly filter out most noneye regions and demonstrate that the eye detection is faster than a system without EVF. Feature representation for eye images is a very important factor in eye detection. The CNN has the advantage of extracting various latent eye features, and acts as this system’s automatic feature extractor. The SVM is used as an output predictor to further classify noneye and eye regions. The experimental results and performance comparisons are reported in detail, providing evidence that the proposed hybrid model is effective at solving the eye detection problem.

From the reported accuracy of the system, we believe that this eye detection method can provide enabling technology to applications in the future. The proposed eye method was completed in Python and run on a laptop with 2.4 GHz i7-4700MQ CPU and 8GB DDR3 RAM.

Footnotes

Acknowledgments

This work has been financially supported by National Natural Science Foundation of China through grand No. 81271568 and 81471743, and U.S. National Science Foundation (NSF) through the grant No. 0954579 and 1333524, and Zhejiang University State Key Laboratory Open Funding GZKF-201512. We thank all the participants who have participated in this work.

References

Vicente

Huang

Xiong

Torre

F.D.L.

Zhang

and Levi

, Driver gaze tracking and eyes off the road detction system, IEEE Transactions on Intelligent Transportation System 16 (2015), 2014–2027.

Dasgupta

George

Happy

S.L.

and Routray

, A vision-based system for monitoring the loss of attention in automotive drivers, IEEE Transactions on Intelligent Transportation System 164 (2013), 1825–1838.

Jiménez

Bergasa

L.M.

Nuevo

Hernández

and Daza

I.G.

, Gaze fixation system for the evaluation of driver distractions induced by IVIS, IEEE Transactions on Intelligent Transportation System 13 (2012), 1167–1178.

Lin

Tang

Schmidt

Wang

and Guo

, An easy iris center detection method for eye gaze tracking system, Journal of Eye Movement Research 5 (2015), 1–20.

Zhu

and Ji

, Novel eye gaze tracking techniques under natural head movement,IEEE Transactions on Biomedical Engineering 54(12), 2246–2260.

Fang

Wang

and Chen

, A novel method for gaze tracking by local pattern model and support vector regressor, Signal Processing 90 (2010), 1290–1299.

Lin

Schmidt

Wang

and Wang

, Human-robot interaction based on gaze gestures for the drone teleoperation, Journal of Eye Movement Research 7 (2014), 1–14.

Wang

Lin

and Bai

, Gaze Tracking System for Teleoperation, In: The 26the Chinese Control and Decision Conference, 2014, pp. 4617–4622.

Zhu

Gedeom

and Taylor

, “Moving to center”: A gaze-driven remote camera control for teleoperation, Interacting with Computer 23 (2011), 85–95.

10.

Yang

Kriegman

D.J.

and Ahuja

, Detecting faces in images: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2010), 34–58.

11.

Hansen

D.W.

and Ji

, In the eye of the beholder: A survey of models for eyes and gaze, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2010), 478–500.

12.

Perez

Cordoba

M.I.

Garcia

Mendez

Munoz

M.L.

Pedraza

J.L.

and Sanche

, A Precise Eye-Gaze Detection and Tracking System, In: The 11th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, 2003, pp. 105–108.

13.

Kawaguchi

Hidaka

and Rizon

, Detection of eyes from human faces by the hough transform and separability filter, In: International Conference on Image Processing, 2000, pp. 49–52.

14.

Sirohey

S.A.

and Rosenfeld

, Eye detection in a face image using linear and nonlinear filters, Pattern Recognition 34 (2001), 1367–1391.

15.

Feng

and Yuen

, Multi-cues eye detection on gray intensity image, Pattern Recognition 34 (2001), 1033–1046.

16.

Zhou

and Geng

, Projection functions for eye detection, Pattern Recognition 5 (2004), 1049–1056.

17.

Ando

Senicond

O.K.I.

and Moshnyaga

V.G.

, A low complexity algorithm for eye detection and tracking in energy-constrained applications, In: IEEE 2013 Communications, Signal Processing, and Their Applications Conference, 2013, pp. 1–4.

18.

Vijayalaxmi and P.S.

Rao

, Eye detection using Gabor filter and SVM, In: 12th International Conference on Intelligent Systems and Applications (ISDA), 2012, pp. 880–883.

19.

Ryu

Y.S.

and Oh

S.Y.

, Automatic extraction of eye and mouth fields from a face image using eigenfeatures and ensemble networks, Applied Intelligence 17 (2002), 171–185.

20.

and Trivedi

M.M.

, A binary tree for probability learning in eye detection, In: IEEE 2005 Computer Vision and Pattern Recognition Conference, 2005, pp. 164–171.

21.

and Xiang

, Robust eye localization on multi-view face in complex background based on SVM algorithm, In: International Symposium on Information Engineering & Electronic Commerce, 2010, pp. 1–5.

22.

Peng

Chen

Ruan

and Kukharev

G.A.

, A robust algorithm for eye detection on grey intensity face without spectacles, Journal of Computer Science & Technology 5 (2005), 127–132.

23.

Xia

Dong

and Chao

, Rapid human-eye detection based on an integrated method, In: IEEE 2010 Communications and Mobile Computing Conference, 2010, pp. 12–14.

24.

Hassaballah

Kanazawa

Ido

and Ido

, Efficient eye detection method based on grey intensity variance and independent components analysis, IET Computer Vision 4 (2010), 261–271.

25.

Kalbkhani

Shayeste

M.G.

and Moussvi

S.M.

, Efficient algorithm for detection of face, eye, and eye state, IET Computer Vision 7 (2013), 184–200.

26.

Phromsuthirak

and Umchid

, Development of a geometrical algorithm for eye detection in color images, In: IEEE 2012 Biomedical Engineering Conference, 2012, pp. 5–7.

27.

Viola

and Jones

M.J.

, Robust real-time face detection, International Journal of Computer Vision 57 (2004), 137–154.

28.

LeCun

Bottou

Bengio

and Haffner

, Gradient-based learning applied to document recognition, In: Proceedings of the IEEE, 1998, pp. 2278–2324.

29.

Lauer

Suen

C.Y.

and Bloch

, A trainable feature extractor for handwritten digit recognition, Pattern Recognition 40 (2007), 1816–1824.

30.

Chen

Yang

Zhong

Pan

Chen

and Zhang

, CNNTracker: Online discriminative object tracking via deep convolutional neural network, Applied Soft Computing 38 (2016), 1088–1098.

31.

Krizhevsky

Sutskever

and Hinton

G.E.

, Imagenet classification with deep convolutional neural networks, In: Advances in Neural Information Processing Systems 25 (NIPS 2012), pp. 1097–1105.

32.

Zhan

Tao

and Li

, Face detection using representation learning, Neurocomputing 187 (2016), 19–26.

33.

Vapnik

, The Nature of Statistical Theory, Springer, New York, 1995.

34.

Boser

Guyon

I.M.

and Vapnik

, A Training Algorithm for Optimal Margin Classifiers, In: ACM 1922 Computational Learning Theory Annual Workshop, 1992, pp. 27–29.

35.

Niu

and Suen

, A novel hybrid CNN-SVM classifier for recognizing handwritten digits, Pattern Recognition 45 (2012), 1318–1325.

36.

Lin

and Wang

, An efficient hybrid eye detection method, sTurkish Journal of Electrical Engineering & Computer Sciences 24 (2016), 1586–1603.

37.

Zhang

Wang

and Chen

, Intelligent fault diagnosis of roller bearings with multivariable ensemble-based incremental support vector machine, Knowledge-Based Systems 89 (2015), 56–85.

38.

Wan

H.W.D.

and Zhu

, An effective feature selection method for hyperspectral image classification based on genetic algorithm and support vector machine, Knowledge-Based Systems 24 (2011), 40–48.

An eye detection method based on convolutional neural networks and support vector machines

Abstract

Keywords

1. Introduction

2. Related works

3. The proposed method

5.1 The database

5.2 Evaluation of eye detection

Table 1 Detection accuracy (NR: not reported, NG: no glasses, WG: with glasses, LE: left eye, RE: right eye)

Table 2 Detection results on different facial rotation angles

6.1 Feature extraction

6.2 Hyperplane

Table 3 Comparison of detection times

Footnotes

Acknowledgments

References

Table 1
Detection accuracy (NR: not reported, NG: no glasses, WG: with glasses, LE: left eye, RE: right eye)

Table 2
Detection results on different facial rotation angles

Table 3
Comparison of detection times