Abstract
The convolutional neural network, based on multi-scale features, is introduced to thermal infrared face identification in this paper. A novel CNN structure is proposed based on characteristics of thermal infrared faces. To enhance and extract inconspicuous thermal infrared facial features for identification, convoluted edges are taken as the initial features. A regional parallel structured CNN algorithm (RPS net) is proposed to obtain multi-scale features based on edge information. Extensive experiments are conducted and analyzed, including statistical test with various classifiers, feature vector property, accuracies of each class and robustness against various noise. The experimental result indicates that RPS net overtakes algorithms based on traditional features (HoG, Fisherface and LBP) and some CNN algorithms (Alex net, VGG net, DeepID net and TFR net), with high quality features. Therefore, RPS net is effective and robust for thermal infrared face identification.
Introduction
Face identification is one of the most meaningful issues in pattern recognition field. Due to the rapid development of thermal infrared sensors, face identification in thermal infrared images has been widely used, especially in some extreme environments where visible face image is invalid for identification. For example, full-time system, in which visible face recognition could not achieve high accuracy at night, could be greatly improved by thermal infrared imaging and recognition. Another example is that skilled makeup technique causes great change in human appearance, which increases the difficulty of visible face recognition. However, this has little effect on thermal infrared face recognition. We believe that thermal infrared face imaging and recognition could provide supplementary information for visible image systems, especially in some extreme environments. Multi-sensor imaging and recognition, such as thermal infrared and visible images based systems, will be the trend of intelligent applications.
Face images in different imaging sensors. (a) shows visible color face images. (b) shows visible grayscale face images. (c) shows corresponding thermal infrared face images. Best viewed in color.
Edge maps of visible face and thermal infrared face by Roberts, Sobel and Prewitt edge detectors.
Different from visible images, thermal infrared faces involve more temperature information than edge details, as shown in Fig. 1. To show the information difference of visible face and thermal infrared face, various edge detections, by Roberts [1], Sobel [2] and Prewitt methods [3], are deployed with results shown in Fig. 2. In general, visible faces have rich information in facial organs, such as eyes and nose. The variation of thermal infrared face locates in hair and mouse regions, where temperature changes. Moreover, temperature distribution in facial area changes over time, environment, emotion and so on [4]. In other words, the thermal infrared face images could be very different from each other, even though they are taken from the same person. Compared to faces in visible and near infrared images, thermal infrared faces contain worse inter-class difference and intra-class consistency. This increases the difficulty of distinguishing thermal infrared faces from each other. Usually, identification consists of two procedures: feature extraction and face classification. Extracting features with enough discrimination is the key to solving thermal infrared face identification problem.
The rest of this paper contains four main sections. In Section 2, some related works will be presented and analyzed, including traditional feature based algorithms and some popular CNN algorithms. In Section 3, the proposed RPS net is given in detail, including the regional parallel structure and the convolution parameters. In Section 4, comparison with other algorithms is conducted on the thermal infrared face dataset. The test accuracies, robustness against noise, and the property of feature vectors are compared and analyzed. In Section 5, the characteristics of the proposed RPS net is concluded. Multi-scale features, extracted through regional parallel structure, turn out to be effective for thermal infrared face identification.
The convolutional neural networks with regional parallel structure. Best viewed in color.
Some enlightening algorithms for face identification [5, 6, 7, 8, 9, 10] are proposed in recent years. In [5], lattice is computed as the feature, and k-nearest neighbor (KNN) algorithm [11] is taken as classifier for infrared face identification. In [6], Zernike moments (ZMs) and Hermite kernels (HKs) are performed to generate local and global features for recognition. By using principal component analysis followed by linear discriminant analysis as classifier, the algorithm has good accuracy and robustness. Celebrity-1000 [7], containing 1000 celebrities sampled from YouTuBe and Youku, is set up as a large-scale video database in an unconstrained environment. The revised multitask joint sparse representation (MTJSR) [7] achieves high accuracy on the Celebrity-1000 dataset. Actually, most face identification algorithms base on visible images and near infrared images, in which there are rich edge and texture information to generate stable patterns for identification. Therefore, facial feature based algorithms are still effective for these images. Recently, higher accuracy of identification has been achieved by deep learning algorithms. It is because these images usually contain clear facial features, such as the region of eyes, nose and mouth, which could provide stable patterns for identification.
The convolutional neural network (CNN) has been fast developed in recent 10 years since Hinton’swork [12]. Actually, CNN has been introduced to various areas, such as semantic segmentation [13, 14], object recognition [15, 16], damage detection [17, 18, 19] and transportation problems [20, 21], with good real-time performance and robustness [22, 23]. Lots of efficient face verification and identification algorithms are proposed [24, 25, 26, 27, 28, 29, 30, 31, 32, 33]. DeepFace [24] and DeepID [26] are two of the state-of-the-art algorithms for face identification and verification. DeepFace system [24] contains two steps for identity verification of visible face images. The first step is face frontalization based on 3D models and fiducial-points, which provides the front face with more features for classification. The second step is a CNN classifier, including three locally connected layers which contain the local facial features. Actually, face frontalization reduces the difficulty of feature extraction by localizing different regions of face. However, the lack of regional textures in thermal infrared face images would affect the accuracy of frontalization, and introduce extra noise while training the network. DeepID net [26] is an efficient CNN for face verification with serial structure, which has achieved high accuracy on LFW dataset. Up to now, the DeepID team has proposed the third generation [32] with higher accuracy, which combines some basic elements from VGG net [34] and GoogLeNet [35]. Thermal face recognition algorithm [33] is proposed by adopting serial structure CNN for thermal infrared recognition, according to various head rotation, expression and illumination. For face identification, most algorithms base on manually designed features, such as HoG [36], LBP [37], Fisherface (LDA) [38] and Eigenface (PCA) [39, 40]. Followed with classifier, such as SVM [41], reasonably designed features could effectively present the discrimination among faces from different persons. However, thermal infrared face images could not be accurately distinguished by these features, as discussed in Section 3. This is because variation of facial temperature could not be fully described as the difference among different identities. In other words, the resolution of manually designed features is not enough for thermal infrared face images.
To achieve high accuracy of face identification in thermal infrared images, we propose a convolutional neural network with regional parallel structure (RPS net). Different from serial structure based CNN, such as DeepID [26], RPS net contains a regional parallel structure as one of the main modules to extract features in different region sizes. In this way, multi-scale features are generated to learn the difference among face identities. The experimental result indicates that RPS net achieves high identification accuracy with good robustness for thermal infrared face images, due to the extracted high quality feature vector.
Algorithm
In general, pixel intensities in thermal infrared images describe regional distribution of temperature on face. There are two kinds of information, absolute intensities and relative pattern. Actually, absolute intensities contribute a lot to the face detection from background, while relative pattern is the key to distinguishing faces from each other in foreground. As shown in Figs 1 and 2, some differences between thermal infrared faces and visible faces could be found out. Firstly, regional textures and details are not clear enough, especially the area of eyes and mouth. Secon- dly, no features are stable enough for facial expression. In other words, intensity distribution in some regions, such as the nose area in Fig. 1(c), might change due to time, environment, emotion and other temperature-related reasons. This leads to an unstable relative relationship between different facial areas. In general, the features for identification in thermal infrared face images are not as effective as visible face images. To extract features representing the difference, multi-scale information is introduced by using a novel CNN with regional parallel structure.
Some examples of the learned convolution kernels for initial edge feature extraction.
The structure of the proposed RPS net is shown in Fig. 3, which contains three cascaded parts, initial edge feature extraction, multi-scale feature extraction and feature vector classification. Firstly, initial edge feature extraction module consists a cascaded convolution and max-pooling layer, in Section 3.1. Secondly, convolutions with different kernels are design as three parallel channels to generate multi-scale features, in Section 3.2. Finally, the convolutional feature maps are transformed into feature vector by fully connected layer and softmax loss, as presented in Section 3.3 and analyzed in Section 4.2.2 with experiments. These make up an end-to-end thermal infrared face identification algorithm.
Regardless of sensor types and imaging principles, most images are strongly spatial correlated. In thermal infrared face images, the macroscopical pattern of facial region could not be extracted due to low contrast and noise. To find features which could describe the difference among facial identities, initial feature maps in the first layer are generated through 128 convolution kernels in shape 2
(a) are the initial edge feature maps after convolution, ReLU and max-pooling. (b)–(d) are some response examples to specific convolution kernels shown in 3D from (a). Best viewed in color.
Extracted feature maps of the three channels in regional parallel layer. Channel (a) is the features in neighbor region.Channel (b-c) are the features in local region. Feature maps in Channel (b) are convoluted in shape 3 
In this way, regional edge information which is very important for thermal infrared face identification, could be maintained in the initial feature maps. The convolution operation in this layer is similar to edge detection filtering with Roberts cross operator [1], as shown below.
Actually, Roberts cross operator is an extreme form of convolution in shape 2
The ReLU [44] nonlinearity is added after each convolution in our net. In this way, the convolution operation could be expressed as Eq. (2).
The final feature vectors and the cosines to mean vectors. The feature vector from the same person are close to each other with large cosine value to the mean vector. Best viewed in color.
In Fig. 5, (a) is the generated feature maps through all 128 convolution kernels in the current layer. (b, c, d) are the 2nd, 18th and 37th channels from feature map (a). (b) represents the response for face contour from background. (c) describes the distribution of facial objects organs, such as location of the eyes, nose and mouse. In (d), some key points for identification are emphasized. These feature maps with rich edge information are individually shown as 3-D meshes for better visual effect, as shown in Fig. 5(b)–(d).
After the initial edge feature maps are extracted, followed operations are designed to find the pattern of edges in neighborhood. Different from the structure of early proposed CNN algorithms, such as Alex net [44] and VGG net [34], regional parallel convolutions with different kernel sizes and strides are designed as different channels in one layer. As shown in Fig. 3, the regional parallel layer is combined by the initial feature maps and two feature maps extracted through different convolution kernels. In this way, patterns of initial edge features could be learned in different scales. The multi-scale features could provide enough information for thermal infrared face identification. The proposed three channels for feature extraction could be considered as two types, features in neighbor region and features in local region. Details of the structure and parameters are given below.
The features in neighbor region
The key to thermal infrared face identification is finding the hidden relationship of edge features or gradients. In particular, edge itself is a kind of features for identification. As the source of edge pattern learning, the initial edge features could be considered as the minimum-scale pattern. Therefore, the initial feature maps, as shown in Channel (a) from Fig. 6, are taken as features in neighbor region to generate the final feature vector. Actually, some similar strategies are adopted in many CNN algorithms, such as DeepID [26] net. In this way, information in neighbor region could be included in the final feature vector.
The features in local region
To obtain information in middle and large receptive field, convolutions in local region are designed as two paralleled channels, as shown in Fig. 3. In this paper, a convolution with 64 kernels in shape 3
As shown in Fig. 6, the feature maps in Channel (b) and (c) are responses of edge variety in middle scale and large scale. With the feature maps in neighbor region, three channels are taken as the regional parallel layer, which could extract multi-scale features for thermal infrared faces.
The feature vector of thermal infrared face
Multi-scale feature maps from regional parallel layer describe the patterns in different region sizes, which contain valuable information for the final identification of thermal infrared faces. Actually, there are 128
Dropout method [45] is to invalidate some components of feature vector randomly while training model. Due to randomness of dropout, each two nodes could be possibly not working at the same time. This makes the followed layer less affected by the correlation of nodes in current layer. Therefore, the components of feature vector could be trained with weak correlation for classification. In this way, classifier will be more sensitive to components themselves rather than the collective effect of some components. Experiments [45] indicate that dropout method is effective for limiting over-fitting.
Discussion
The CNN algorithm in this paper is quite different from the networks for object or face recognition, such as DeepID net [26], VGG net [34] and Alex net [44]. Firstly, kernel size of initial convolution is much smaller. Despite the influence of image sizes, kernel size of initial convolution layer decides the basic scale of local region. Large kernel size, such as 4
Some examples from our thermal infrared face dataset with various movement and expression.
Due to regional parallel structure, information from initial edge features could be reserved and trained as the patterns in different region sizes. In this way, multi-scale features could be extracted to generate the final feature vector for classification. Different from serial structure based CNNs, RPS net has only four convolution layers and fully connected layers cascaded, but involves more edge features and edge patterns. These are crucial for thermal infrared face identification. Therefore, the feature vector extracted through RPS net has better inter-class difference and intra-class consistency, as analyzed in Section 4.2.2.
There are some thermal infrared face image data- bases publicly available for different applications. IRIS (Imaging, Robotics and Intelligent System) Thermal/Visible Face Database [46] and Terravic Facial IR Database [47] have 30 and 20 subjects with various expressions and movements, respectively. These publicly available thermal infrared face datasets are not suitable for CNN based experiments for the small number of samples and subjects. The thermal infrared face images used in our experiments are 64
To evaluate RPS net comprehensively, three traditional feature based algorithms, HoG [36], LBP [37], Fisherface (LDA) [38] and four convolutional neural network based methods, DeepID net [26], VGG net [34], Alex net [44], Thermal face recognition net (TFR net) [33] are conducted with the public Caffe toolbox [48] on this thermal infrared face image dataset. Actually, DeepID net is proposed for face verification with the Joint Bayesian [49] technique. In this paper, DeepID net is trained as an identification issue with a softmax loss layer to get the predicted classes. The softmax loss
The complexity of CNN algorithms is mainly decided by convolutions. Without accelerating strategy, convolution complexity is
The accuracies on the test datasets
In our experiment, the dataset is divided into five equal parts with no intersections to each other. Specifically, there are 3007, 2978, 2969, 2978 and 2941 images in these subsets respectively. For comprehensive comparison, all subsets are chosen as test set with model trained on the other four subsets. The average accuracy of five test results is taken as the final identification accuracy. Test accuracies of the proposed algorithm and comparison algorithms are listed in Table 1. Our algorithm exceeds all the compared algorithms by average accuracy. This indicates that RPS net is effective for thermal infrared face identification, in general.
To evaluate and analyze algorithms comprehensively on our thermal infrared dataset, statistical analysis for the compared CNN algorithms with various classifiers, property analysis of extracted feature vectors, accuracy analysis based on each class and robustness analysis based on top-
Compared to traditional feature based algorithms, such as HoG, Fisherface and LBP, CNN algorithms achieve higher identification accuracies. HoG feature descriptor is based on gradient statistics in regional blocks, which is effecitve for the detection of specific objects. With the performance of classifier, such as SVM [50], HoG feature works better for detecting objects from background than distinguishing faces from each other. Fisherface or Linear Discriminant Analysis (LDA), similar to Eigenface algorithms, is based on analysis of whole dataset. Therefore, identification is affected by not only inter-class difference but also intra-class consistency. The issue which reduces intra-class consistency, such as noise, could influence identification accuracy easily. LBP feature is a fast and robust texture descriptor in regional area. Based on the relation to central pixel, LBP feature is not sensitive to illumination. However, facial pattern of thermal infrared face images is quite different from visible face images. As description of temperature distribution on face, thermal infrared face images have no clear textures and details stable enough to represent the difference among various identities. Therefore, manually designed features are not sufficient for identification. CNN algorithms, including the proposed RPS net, achieve higher accuracies due to efficient feature learning strategy.
Comparison to CNN algorithms
Compared to traditional feature based algorithms, CNN algorithms achieve higher accuracies due to good resolution of extracted feature vectors. Actually, CNN algorithms achieve close test accuracies to each other. To compare the detail difference of these algorithms, test accuracies for each person are shown in Fig. 9.
In Fig. 9, the class labels in horizontal axis are sorted by test accuracies from low to high for each algorithm. As no algorithms could ensure accuracy advantage over other algorithms for every class, the accuracy curves for all classes are rearranged by accuracies as shown in Fig. 9. In general, RPS net exceeds others based on particular accuracy for each person.
The paired-sample
-test result
The paired-sample
The sorted test accuracies of each class.
To show the significance of the proposed algorithm over the other four CNN algorithms, paired-sample
In Table 2, the Hs with low
To compare these CNN algorithms with RPS net comprehensively, robustness, layer depth and properties of the feature vector are analyzed with more experiments below.
To measure the robustness of the proposed algorithm, two kinds of experiments are conducted. One is top-1, 3, 5, 7 and 9 accuracies of the compared CNN algorithms, as shown in Fig. 10. The other is predicting performance of CNN algorithms based on thermal infrared face images with various Gaussian or salt & pepper noise, as shown in Figs 12 and 13. To compare the network complexity and layer depth, numbers of convolutions and fully connected layers in depth are listed in Table 3. These indicates that RPS net performs better for identification of thermal infrared faces robustly with less layers.
The numbers of convolution and fully connected layers in depth
The numbers of convolution and fully connected layers in depth
The top-1, 3, 5, 7 and 9 test accuracies of the CNN algorithms. Best viewed in color.
In Fig. 10, top-1, 3, 5, 7 and 9 test accuracies of CNN algorithms are presented. It indicates that the proposed algorithm exceeds the other CNN algorithms at all top-
To measure the algorithm robustness against various environment and noise, Gaussian and salt & pepper noise with various intensity are introduced on the test dataset. The Gaussian noise is with 0 mean value and 0.01, 0.015, 0.02, 0.025 and 0.03 variances. The salt & pepper noise density varies in 0.01, 0.03, 0.05, 0.07 and 0.09. The noised thermal infrared face image with different noise types and intensities are shown in Fig. 11.
The predicting accuracies of CNN algorithms based on various Gaussian noised images are shown in Fig. 12. With larger Gaussian variance, noised images become more difficult to identify, for all compared algorithms. In Fig. 12, accuracy of RPS net decreases slightly, while exceeding the others. It indicates that RPS net has high accuracy and good robustness against Gaussian noise.
Noised thermal infrared face images with different noise types and intensities.
The test accuracies on images noised by various Gaussian noise. Best viewed in color.
As shown in Fig. 11, salt & pepper noise has more randomness than Gaussian noise. Because Gaussian noise affects all pixels of image, while salt & pepper noise randomly affects some pixels. Gaussian noised image is more difficult to identify by human eyes, but easier for computer-based image classifier. The situation for salt & pepper noise is on the contrary. The accuracies of CNN algorithms for various salt & pepper noised images are shown in Fig. 13.
In Fig. 13, some algorithms are highly affected by salt & pepper noise, such as Alex net and VGG net. Actually, Alex net and VGG net are classic serial structured CNN algorithms. Although little influenced by noise, TFR net could not achieve high accuracy for identification. DeepID net has close robustness to the proposed RPS net, due to its regional parallel layer before fully connected layer. In conclusion, parallel channels could greatly improve the robustness of CNN features. Therefore, RPS net achieves high accuracy with good robustness by taken regional parallel structure as one of the main modules.
Actually, the four compared algorithms are basically serial structure based CNNs, aiming at extracting deeper features through sequentially cascaded convolutions. By adding deeper layers, CNN could extract more abstract features for learning patterns. Different from the strategy of “going deeper”, parallel convolutions are adopted in the proposed RPS net based on edge information. We believe multi-scale features of thermal infrared face are more suitable and robust for identification than abstract features. Therefore, serial convolution layers in depth are changed into a parallel-structured convolution, which contains three different channels of convolutional features in one layer. In this way, RPS net contains only two layers cascaded before fully connected layer, initial edge convolution layer and regional parallel layer, to acquire the feature vector for classification. Although layer depth is shorter as shown in Table 3, edge information is enhanced and involved in multi-scale features, ensuring correctness and robustness of thermal infrared face identification.
The test accuracies on images noised by various salt & pepper noise. Best viewed in color.
The feature vectors extracted through compared CNNs. The feature vectors for Alex net and VGG net are 4096-dimensional and shown in size 128 
The extracted feature vectors, with intra-class consistency measures for identification extracted through four CNN algorithms, are shown in Fig. 14. As the final representation of thermal infrared face, feature vector could be considered as the key to evaluating the performance of different algorithms. Good feature vector should have high inter-class difference and intra-class consistency at the same time. Inter-class difference could be directly represented by classification accuracy, while intra-class consistency is not easy to describe. Usually, vector distance, such as Euclidean distance between two vectors, is taken as the similarity measure. However, distance measure is not reasonable for vectors in different dimensions. For example, the distance between 1000-dimensional vectors is quite different from the distance between 2-dimensional vectors. Therefore, non-dimensional measures are more reasonable than dimensional measures, such as vector distance. In our paper, cosine values, based on sample feature vectors from the same class, are calculated and compared as the measure of intra-class consistency. Larger cosine value indicates better intra-class consistency in the corresponding class. To evaluate the general intra-class consistency, two measures
In Eq. (4.2.2),
The predicting accuracies using different classifiers
Actually, classification accuracy is mainly affected by inter-class difference and intra-class consistency, despite various classifier based on different theories. On the other hand, vectors for classification with better inter-class difference and intra-class consistency, could achieve higher accuracy regardless of classifier properties. In other words, the performance of vectors for classification could be induced by comparing accuracies through different classifiers. To this purpose, several classical classifiers, SVM [50], KNN [11] and Random Forests [51] are introduced to train and test on the feature vectors extracted through different CNN algorithms. Actually, 10% of training images are separated as validation dataset to adjust hyper parameters, such as
Table 4 indicates that the features of RPS net achieve higher accuracies than the other compared algorithms using different classifiers, in general. For further comparison, accuracies for each class are rearranged according to their values from low to high, as shown in Figs 15–17 by different classifiers.
Similar to the result shown in Fig. 9, Figs 15–17 indicate that RPS net achieves higher single-class accuracies in general. For statistical comparison based on the accuracy of each person, paired-sample
The paired-sample
The sorted accuracies of each class by SVM.
The sorted accuracies of each class by KNN.
The sorted accuracies of each class by Random Forests.
As shown in Table 5, all Hs are rejected due to low
The experiments for intra-class consistency of feature vector and performance with different classifiers, improved that the features extracted by RPS net have good inter-class difference and intra-class consistency. This is because the multi-scale features from RPS net contain more meaningful information for thermal infrared face identification. Therefore, the proposed RPS net has better performance for thermal infrared face identification.
In serial structure based CNN algorithms, convolution layers are cascaded as a sequence to acquire abstract features from deeper layers. In any serial system especially CNNs, the current layer should have more parameters to enhance information from the previous layer. Therefore, lots of pooling or down-sampling methods are proposed and adopted to limit the parameter number without losing features very much. One of the problems while designing serial structured CNN, is to deal with the contradiction between the increasing parameters and decreasing features by down-sampling. Compared to serial structure algorithms, features extracted by parallel structure usually have better property and higher robustness [52, 53, 54, 55], which could avoid losing information between cascaded modules.
By limiting depth and adding regional parallel convolutions, RPS net could generate multi-scale features which is more effective for thermal infrared face identification. This is because that multi-scale features are extracted from low-level information, which is very important for thermal infrared images. Comparing to serial structured CNN features, multi-scale features contain more information than high-level abstract features for thermal infrared face identification.
Due to regional parallel structure, RPS net could extract high quality features with less layers. Therefore, higher accuracy of thermal infrared face identification could be achieved robustly by using RPS net.
Conclusions
In this paper, a novel convolutional neural network is proposed and introduced to thermal infrared face identification. The convoluted edges are taken as initial features, based on the characteristics of thermal infrared face images. A regional parallel structure is proposed to extract multi-scale features based on edge information. In this way, the enhanced edges could be represented by multi-scale features including neighbor and local region information, which perform better than abstract features extracted by serial structured CNNs.
Extensive experiments are conducted and analyzed for comprehensive evaluation. The statistical test results with various classifiers indicate that features extracted by RPS net achieves high accuracy with different classifiers, statistically. Feature vector property experiment demonstrates that RPS net features have good inter-class difference and intra-class consistency. Moreover, the robustness of RPS net has been proved, by introducing Gaussian and salt & pepper noise for test. The experimental result indicates that RPS net overtakes algorithms based on traditional features (HoG, Fisherface and LBP) and some convolutional neural networks (Alex net, VGG net, DeepID net and TFR net), with high quality features.
In conclusion, the proposed RPS net is more effective and robust than some traditional feature based algorithms and serial structured neural networks for thermal infrared face identification.
Footnotes
Acknowledgments
This work is supported by the National Natural Science Foundation of China under Grant U1736217, the Program for New Century Excellent Talents in Universities under Grant NCET-13-0020, the Fundamental Research Funds for the Central Universities under Grant YWF-17-BJ-Y-69.
