Abstract
Facial expression recognition (FER) has been an active research area in recent years, which plays a vital role in national security and human-computer interaction. Due to the lacking of sufficient expression features and facial images, it is challenging to automatically recognize facial expression with high accuracy. In this paper, we propose a fusion handcraft feature method to improve FER from images. Firstly, a new texture feature extraction method PD-LDN (Pixel Difference Local Directional Number pattern) is proposed, which can extract more local information, reduce noise disturbance and feature dimension. Secondly, the handcrafted features including PD-LDN texture features, geometric features, and BOVW (Bag of Visual Words) semantic features are connected in parallel to an improved autoencoder network for fusion. Finally, the fused features are input into the softmax classifier for recognizing facial expression. We conduct extensive experiments on JAFFE and CK+datasets. Our proposed method shows superior performance than the state-of-the-art approaches on recognizing facial expressions.
Introduction
Facial expressions can intuitively reflect the inner activities of human beings and play an important role in social intercourse. Studies have shown that in information transmission, the amount of information conveyed by facial expressions is as much as 55% [1]. In recent years, FER has become a research hotspot in the field of computer vision. Due to the complex changes in facial expressions and the recognition process involving many interdisciplinary subjects such as psychology, computer vision, and machine learning, FER is a challenging task [2–4]. The most important procedure for recognizing facial expression is to extract discriminative features so as to successfully distinguish different emotions. Researchers from various countries have proposed different feature extraction methods to classify facial expressions from images.
There are handcraft feature-based methods such as: texture feature-based methods [5, 11–15], geometric feature-based methods [16–18], and semantic feature-based methods [19]. Furthermore, considering the combination of various features can provide more effective facial representation, many researchers began to fuse these handcraft features and the performance is superior to individual feature-based methods [20–25, 43]. With the rapid development of deep learning in recent years, many novel neural networks have been applied to expression recognition [7–9, 40–44]. The end-to-end neural network model can obtain high-level features through learning the low-level features of the images. However, the training and testing stage of deep learning networks require a large amount of data, and the existing expression datasets are too small to meet the need [36]. It will lead to the problem of over-fitting.
In this paper, we concentrate on recognizing facial expressions from images. Inspired by numerous successful applications of handcraft features in FER, we propose a method with fused handcraft features including texture, geometric, and semantic features. Besides, we propose PD-LDN to exploit local texture features of facial expression image. Whereas, the seven-order moment is good at denoting the locations and shape of facial components, thus reflecting the geometric characteristics of different expressions. In this paper, we adopt the seven-order moment to extract geometric features of facial expression images. Also, due to the eyes, nose, and mouth make a great contribution to the facial expression [10], we propose the approach based on the key areas of the face and BOVW to extract the representative semantic feature. Then, we fuse these handcraft features through an autoencoder network to obtain the final characteristic feature of facial expression. By fusion with the autoencoder network, the information and correlation among handcraft features can be exploited and reconstructed effectively. Meanwhile, the method shows stronger robustness and thus improves the accuracy of recognizing facial expressions.
The novelty and contribution of our work are primarily: (1) A novel PD-LDN is proposed, which can extract the detailed local features of facial expression image by calculating the edge response value of the local neighborhood and adjusting the pixel in coding direction through an adaptive threshold. Besides, PD-LDN has high robustness on noise and reduces the feature dimension effectively. (2) We propose to extract the semantic feature on key regions of a facial image by the BOVW model which introduces DSIFT (Discrete SIFT) to describe the key points of the facial expression. (3) Unlike the existing fusion methods by connecting the features in series, we propose a feature fusion model with an improved autoencoder network. In which, three hidden layer vectors are assigned to represent the texture features, geometric features, and semantic features respectively. The final fused features are obtained through the unified decoding of the hidden layer nodes. The proposed method can distinguish the changes of expressions from the aspect of appearance, structure, and intensity of a face image.
The paper is organized as follows: Section II discusses the related works. Section III introduces the proposed method. Section IV discusses the experimental results. The paper is concluded in Section V.
Related work
The texture feature is vital for pattern analysis of images. Jabid et al. [11] defined LDP (Local Direction Pattern) to classify facial expression. Wang et al. [12] put forward LPQ (Local Phase Quantization) and sparse representation to extract the texture features of expression. LDN [32] was presented to extract the texture feature of facial expression, which can obtain a set of direction codes according to the edge response value of the pixel, and represent the local edge texture information well. Shokoohi et al. [13] proposed to extract texture features of face images by using GLTP (Gradient Local pattern and gradient-based ternary Texture Patterns). Ding et al. [14] used LBP and Taylor expansion to perform texture feature extraction on facial expression images. Gu et al. [15] adopt a local Gabor feature radial coding mode to recognize expression. Considering the shape variables of the facial features, [16–18] proposed geometric feature-based methods.
To make full use of different methods, Harrati et al. [20] combined BOVW and HOG (Histogram of Oriented Gradient) to extract high-dimensional features for recognizing expressions. Kumari et al. [21] extracted LBP, HOG, and LDP features in series to form a fusion feature and then used SVM to classify facial expressions. [22–24] utilized texture and geometric features to recognize expressions. In [25], Yang et al. introduced three texture features of LBP, LDN, and EOH in series to perform FER. The recognition rate of this method increasing a lot compared to [22–24]. Ghimire et al. [42] and Happy et al. [43] combined shape and appearance features for FER to achieve better accuracy.
Several deep learning networks have recently been used for FER. Chen et al. [26] adopted Cellular Neural Network (CNN) to recognize facial expressions, which extracted the implicit features of images through the convolutional layer of the network and used the pooling layer to reduce features. Liu et al. [27] proposed a BDBN (Bilinear Deep Belief Network) for expression recognition. BDBN consists of a set of weak classifiers, each of which is responsible for the classification of an expression. However, the high accuracy of this method is at the cost of complex computational and long training time. Lopes et al. [28] proposed to combine CNN and data preprocessing to classify facial expression. In [29], the weighted mixture deep neural network (WMDNN) is used to extract the facial expression features with the dual-stream. Literature [30] proposed to use FRR-CNN (Feature Redundancy Reduced Convolutional Neural Network) for FER. These two approaches achieve a better result on the FER compared to other deep learning methods.
Autoencoder is a symmetrical neural network that can learn the features of input data by minimizing the discrepancy between the original data and its reconstruction. Recently, autoencoder networks have been applied to FER. Lv et al. [40] adopted HOG, DBN, and Gabor to extract the features, and utilized stacked autoencoder (SAE) to recognize expressions. Zeng et al. [41] proposed deep sparse autoencoder (DSAE) to recognize the facial expression, which can learn robust and discriminative features from data. This method introduced geometric and appearance features to compose a high-dimensional feature and got an accuracy of 95.79%.
Proposed method
Method review
We propose a feature fusion method for FER. The method with two key schemes: handcraft features extraction and features fusion. Figure 1 illustrates the framework of the proposed method.

Illustration of the proposed method for facial expression recognition. First, preprocess the original images to irregular interest blocks. For each block, texture features, geometric feature and semantic feature are extracted by the proposed PD-LDN, seven-order moment and BOVW respectively. Then, the final features are fused by the improved autoencoder network. Finally, the features are sent to the softmax classifier for facial expression recognition.
As we can see in Fig. 1, the facial images are blocked firstly. We employ the AAM (Active Appearance Model) [31] to mark feature points on the face image. Then, the face image can be divided into several irregular blocks according to the connection of these feature points. These blocks can segment the expression features well, and the size of the block can change with expression.
Then, we perform feature extraction which consists of PD-LDN to extract texture features, seven-order moment to extract geometric features, and improved BOVW to extract semantic features. Finally, the features are fused by the autoencoder network and then classified by the softmax to recognize facial expression. Below, we present the details of the key schemes, the proposed texture feature extraction method PD-LDN, semantic feature extraction method - improved BOVW, and feature fusion model.
The LDN is a typical texture feature-based method. The directional and edge feature of the local neighborhood are extracted, which can reflect the local texture information of the image well and overcome the shortcomings of LBP on the influence of illumination and noise. Whereas, some local information may be lost as its operator performs mask calculation only on the local neighborhood, ignoring the pixels of the neighborhood center. Besides, LDN performs mask calculation in 8 directions for each neighborhood, resulting in high computation complexity.
To solve these problems, this paper proposes a local description operator PD-LDN. PD-LDN can calculate the edge response value of local neighborhood only in two main directions, and obtain the pixel value in the coding direction through the adaptive threshold, to obtain more detailed local texture information and representative features.
Figure 2 shows the Robinson masks [33]. M0 ∼ M7 are the template matrix of 8 directions. Due to the symmetry property of the Robinson matrix, the edge response value calculated in the M0 ∼ M3 and M4 ∼ M7 are opposite to each other. We only conduct mask calculation in these four directions

The Robinson mask.
The mask value of the center pixel is 0 in LDN, which means that the center pixel is ignored and much representative information is missed. The PD-LDN takes into account the influence of the central pixel on the process of coding. As shown in Fig. 3, PD-LDN puts each pixel on the center of the 3*3 neighborhood and uses the Robinson mask to calculate the edge response of this pixel. As shown in Equation (1):

The coding process of PD-LDN.
The two largest absolute edge response values are taken as the coding direction D1 and D2, that is:
The corresponding pixel values of D1 and D2 are subtracted from the central pixel value respectively to obtain difference values. The difference values are binarized to obtain two binary patterns B1 and B2 by Equation (3):
The threshold δ is adaptively adjusted according to the pixel value in the neighborhood, and the calculation process is as follows:
The mean value of all pixels
The contrast value g
v
is obtained by subtracting each pixel value in the neighborhood from the mean value
The mean value
Finally, the adaptive threshold δ can be obtained by:
In Equations (4)–(7), T is the number of pixels in the neighborhood, g v is the pixel value.
Then, we combine the direction codes D1 and D2 and the binary codes B1 and B2 to get the PD-LDN code of the point (x, y) in the image. The value of PD-LDN can be obtained by converting the binary PD-LDN into decimal PD-LDN:
As shown in Fig. 4, the PD-LDN histogr of each block can be got by performing PD-LDN calculation on each pixel. The PD-LDN histograms of 15 blocks are connected in series to obtain the feature histogram of the entire expression image. The dimension of each block is 26, and the PD-LDN feature dimension of an image is 26 *15 = 960.

The feature extraction process of PD-LDN.
As the local feature of an image has certain similarities with the vocabulary of natural language, it can be treated as a visual vocabulary [34]. Due to the BOVW model performs feature extraction on the whole expression image, it will make some image areas fragmentize, resulting in the low expression recognition rate [10]. Therefore, we propose a method based on the key areas of the face and BOVW. Firstly, face images are segmented according to the feature points located by AAM. The blocks of eyes, nose, and mouth are segmented to form key regions. These key regions are normalized to 41*130. Then, the semantic feature is extracted by the BOVW model. The process is illustrated in Fig. 5.

Schematic diagram of construct the BOVW model. Three dictionaries consist of k words are clustered by DSIFT features. The cluster centers are denoted as triangles, squares, hexagons, and circles. The statistical word distribution can finally be represented as the histogram of word visual frequency, that is, the semantic feature of an expression.
The BOVW model usually adopts SIFT (Scale Invariant Feature Transform) to represent the key points. This mode may extract various numbers of key points for different expressions, which may cause the difference of feature dimensions extracted from every image. It has an adverse effect on the subsequent clustering operations. This paper introduces DSIFT to perform facial expression images.
The extracted DSIFT features of key regions are clustered by k-means to obtain three visual dictionaries. One word in the dictionary is regarded as a cluster center and represented as a 128-dimensional vector. Each dictionary consists of k words. The Euclidean distance is used to calculate the distance between the DSIFT feature vector and each word in the dictionary and map it to the word with the smallest distance. Finally, the semantic feature of an expression can be obtained by the statistical word distribution of a map.
As a kind of neural network, autoencoder can explore the deep information and correlation existing in the features [35]. To improve its representation, robustness and achieve better effects of feature fusion, this paper improves the traditional autoencoder network, as shown in Fig. 6.

The proposed autoencoder structure. Xp, Xg, and Xb are input vectors of PD-LDN, seven-order moment, and BOVW. Wp, Wg, and Wb are the weight matrix. hp, hg, and hb are hidden layer vectors.
Considering that the number of autoencoder network layers is higher than 3, it will lead to the problems of over-fitting [36]. To achieve better fusion features, we designed a 3 layers network model.
In Fig. 6, Layer 1 is the input layer, Layer 2 is the hidden layer, Layer 3 is the output layer.
The process of network coding can be expressed as:
Then, all of the hidden layer nodes are combined to form a hidden layer vector for decoding. The hidden layer vector
The weight matrix and the offset vector are adjusted by the gradient descent method as (14) and (15):
In this section, we provide a detailed experimental analysis of the proposed method and compare it with some state-of-the-art methods.
Datasets
JAFFE
The JAFFE dataset [37] contains 213 facial expression images of 10 women, including 7 expressions: angry, scared, happy, disgusted, neutral, sad, and surprised. Each image has a resolution of 256*256. In this paper, all the images in JAFFE are selected for the experiment, and the expression labels are assigned as: 1-fear, 2-disgust, 3-happy, 4-sad, 5-angry, 6-neutral, 7-surprise.
CK+
The CK+(Extended Cohn Kanade) dataset [38] was released in 2010. CK+contains 123 people, a total of 593 expression image sequences. The resolution of frames is 640*490. This paper selects 312 sequences of the CK+dataset, a total of 5200 images for the experiment. The labels of all expression images are assigned the same as the JAFFE.
PD-LDN results and analysis
The results include the recognition rate influenced by block mode, threshold, and different detailed experiments of the proposed PD-LDN. We adopt SVM as a classifier.
The impact of the block mode on PD-LDN
The images are divided by different block modes firstly. The average 10 accuracies are taken as the result of each mode, as shown in Table 1. It can be seen that the irregular block mode gets a better result than the regular mode.
Comparison of different block mode
Comparison of different block mode
We compare the dynamical threshold and the fixed threshold according to the pixel distribution in the neighborhood of PD-LDN. From Fig. 7 it can be seen with a fixed threshold, PD-LDN can achieve a better result when the threshold is set to 40 in JAFFE and 60 in CK+.

The PD-LDN performance with different thresholds.
We conduct the experiment 10 times with adaptive and fixed threshold respectively. The average recognition rates are taken as the final result. In Table 2, PD-LDNδ = 40 and PD-LDNδ = 60 represent the fixed threshold 40 and 60, PD-LDN K means using an adaptive threshold in PD-LDN. It can be seen that the adaptive threshold can get a better recognition rate. Therefore, an adaptive threshold is used in this paper.
Comparison of PD-LDN with different threshold
We compare PD-LDN with several traditional texture feature-based methods. The experimental results are shown in Table 3. It suggests that the accuracy obtained by the proposed PD-LDN is higher than other methods.
Comparison with different texture feature-based approaches
Comparison with different texture feature-based approaches
The facial expression images are added with various Gaussian noise. Different methods are adopted to extract the feature of the facial expression image with noise. The recognition rate is showed in Fig. 8. We can see that under the influence of different Gaussian noises, the results of the other texture feature-based methods have significantly reduced. PD-LDN shows a remarkable anti-noise ability.

The recognition rate (%) of different approaches with different Gauss noise.
Parameter settings
The number of hidden layer nodes in the autoencoder network has a certain influence on the effect of feature fusion. If the number of hidden layer nodes is too small, the network can not have the learning ability; on the contrary, too many nodes may greatly increase the complexity of the network structure, the network tends to fall into a local minimum in the learning process.
So we set the number of hidden layer nodes by the controlling variables method. As the nodes of one feature are changed, the other two are controlled to be unchanged. The dimension of the PD-LDN feature is 15*64 = 960, which of the seventh-order moment is 7*15 = 105, and the BOVW has a feature dimension of 300. As adjusting one feature node, the nodes of the other two features are set to half of its input number. The nodes are: hidden layer nodes of PD-LDN h p , nodes of seven-order moment feature h g , and nodes of the BOVW feature h b . According to experiments shown in Fig. 9, the optimal hidden layer nodes are assigned h p = 600, h g = 75, and h b = 210.

Experiments on the number of hidden layer nodes. h p , h g , h b are the hidden node of PD-LDN, seven-order moment, and BOVW feature respectively.
In the process of semantic feature extraction, the optimal parameters of the DSIFT are the number of words k and the sampling density m in the visual dictionary, which are all selected through experiments. We set the value of k are 100, 200, 300, 400; m are 2, 4, 6, 8. The experiments are taken 3 times on JAFFE and CK+, respectively, and the average value is taken as the final result. As we can see in Fig. 10, the recognition can get the best accuracy as the sampling density m is 2 and the number of words k in the visual dictionary is 300.

Recognition rates of DSIFT with different k (the number of words) and m (the sampling density) (a) On JAFFE. (b) On CK+.
To validate the effectiveness of the proposed method, we conduct the confusion matrix in the experiments. Figure 11 is the confusion matrix of the proposed method on JAFFE. As can be seen that the recognition rate of anger, happiness, and surprise is higher because of the large variation of facial muscles. The expression of neutral usually confuse with other expressions due to it has no remarkable change and its feature distribution contains the largest area in the feature space, so the recognition accuracy of neural is reduced.

Confusion matrix on JAFFE. AN, NO, DI, FE, HA, SA, SU denote anger, neutral, disgust, fear, happiness, sadness and surprise respectively.
Figure 12 is the confusion matrix of the proposed method on CK+. The recognition rates of sadness and neutral expression are lower due to the eyes and mouths are similar, meanwhile, the features of eyes and mouth are difficult to extract properly to distinguish them. The category of anger, surprise, and happiness can be always predicted correctly because these images are all with larger facial changes and the features are obvious.

Confusion matrix on CK+. AN, NO, DI, FE, HA, SA, SU denote anger, neutral, disgust, fear, happiness, sadness and surprise respectively.
The recognition results of the three single feature methods and different feature fusion methods are compared by classifier softmax in Table 4. According to the experiment, the proposed fusion features can assist in improving the performance of expression recognizing and work best.
Comparison with the individual component of combination
Comparison with the individual component of combination
To further validate the effectiveness of the proposed method, Table 5 shows the comparison of the recognition rate with the state-of-the-art FER methods.
In literature [13], the GLTP is lack of facial expression information due to its incomplete feature. The literature [25] connected LBP, LDN, and EOH features in series for recognition. Since all of the three features belong to the texture feature-based method, the facial expression features can not be represented in multiple ways. Besides, the three features are only fused simply, and no deep information is discovered. Some deep learning approaches are introduced and compared, such as CNN, DNN, and FRR-CNN in [26, 29] and [30]. They are trained with our training data for a fair comparison. The results in Table 5 show that our approach performs better than others. The proposed method fulfills the extraction of texture, geometric, and semantic information by different features, which are beneficial to the recognition of expressions.
Comparison with state-of-the-art approaches (%)
In this paper, we have presented a novel approach for FER by feature fusion method, which can recognize the expressions with high accuracy. Particularly, a novel texture feature extraction method PD-LDN has been proposed to get robust and low-dimension features. Both the handcrafted feature including texture feature, geometric feature, and semantic feature have been introduced to compose a multi-feature and fused by an autoencoder network to identify the expressions successfully by learning discriminative features from images. In the end, the experiment results have demonstrated that the proposed approach outperforms the other state-of-the-art methods in terms of the 7-class expression recognition.
