Efficient facial expression recognition based on convolutional neural network

Abstract

The goal of research in Facial Expression Recognition (FER) is to build a robust and strong recognizability model. In this paper, we propose a new scheme for FER systems based on convolutional neural network. Part of the regular convolution operation is replaced by depthwise separable convolution to reduce the number of parameters and the computational workload; the self-adaption joint loss function is adopted to improve the classification performance. In addition, we balance our train set through data augmentation, and we preprocess the input images through illumination processing, face detection, and other methods, effectively maximizing the expression recognition rate. Experiments to validate our methods are conducted based on the TensorFlow platform and Fer2013 dataset. We analyze the experimental results before and after train set balancing and network model modification, and we compare our results with those of other researchers. The results show that our method is effective at increasing the expression recognition rate under the same experiment conditions. We further conduct an experiment on our own expression dataset relevant to driving safety, and it yields similar results.

Keywords

Convolutional neural network depthwise separable convolution facial expression recognition self-adaption joint loss function

1. Introduction

Facial expression is human’s most effective way of emotional communication other than language. Mehrabian [1], a well-known American psychologist, proposed a formula stating that words, tone of voice, and facial expression account for 7%, 38%, and 55% of emotional exchange, respectively. Facial expression recognition extracts the facial expression features from the original input images and classifies them according to human emotional expressions, such as anger, disgust, fear, happiness, sadness and surprise, and is thus deemed to be the critical technology of an emotion monitoring system. Facial expression recognition is widely applied to scenarios such as driver-emotion monitoring in smart transportation, clinical testing in medical systems, and lie-detection analysis in criminal cases and has attracted much research in recent years.

With the development of computer GPUs and the establishment of massive facial expression databases, many researchers now work on expression recognition using deep learning technology. Deep learning combines feature extraction and facial expression classification, using multi-layer nonlinear transformation to implicitly extract the features to obtain more abstract high-level feature expression. With improved representation ability and robustness, this model can significantly boost the performance of FER. Convolutional Neural Network (CNN) [2] is a popular subfield of deep learning for computer vision, which is widely used in the facial expression recognition problem [3]. A CNN is a type of deep neural network with convolutional layers. It can directly use pixel values as input. By using local receptive fields, parameter sharing, sparse connections and down-sampling of the face image space in the neural network, a CNN extracts the locality and other features from the input data to complete its autonomous learning, implicitly attaining more abstract global feature expression of the image. A CNN is also robust to shifting, scaling, rotation, and other transformations on the image. Deep convolutional neural networks have impressive performances on face-related recognition tasks [4, 5]. However, the increase in the network layers not only brings about stronger feature extraction capability but also significantly increases the computational cost and storage requirements [6]. In addition, traditional CNNs use softmax loss to penalize the misclassified samples, which thus forces the features of different classes apart. However, facial expression recognition also suffers from high intra-class variations. The differences in the facial features can hinder expression classification. Some expression classes are less distinct from each other: happy expressions are usually exaggerated, with distinct features; on the other hand, sad expressions are more subtle, and they have features similar to those of fear expressions.

Having sufficient and balanced training data is important for the design of a deep expression recognition system. However, in many facial expression databases, the number of expressions in each class is not balanced, which would cause insufficient training of expression classes with fewer samples.

In this paper, we propose a new scheme for an FER system based on convolutional neural network:

1)
Adopting depth separable convolution to replace part of the convolution operation to optimize the training model;
2)
applying the self-adaption joint loss function to decrease the feature diversity within classes and increase the feature distance between different expression classes to help differentiate the facial expressions; and
3)
conducting training and test analysis on the Fer2013 datasets using data augmentation in the preprocessing phase to increase the sample size of the insufficient classes. In addition, we compile a facial expression dataset containing five hazardous emotions related to driving safety, and then, we test our methods on this dataset. The results of our experiment show that the proposed method has decent generalizability.

2. Related work

The work presented in the literature to solve the facial expression recognition problem with deep learning includes two main groups: dynamic sequence-based and still image approaches. The former classification exploits emotional recognition for dynamic image sequences [7, 8], and the latter recognizes expressions from still images. Since facial expression recognition for dynamic image sequences can benefit from the results from still images, our study focuses on static images.

To increase the network depth without efficiency reduction, based on the foundation architecture of a CNN, Kaiming He et al. introduced residual networks that are easier to optimize and can improve the accuracy through the considerably increased depth [9].

For face recognition, Schroff et al. [10] proposed FaceNet, which directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. To distinguish the features, triplet loss was proposed. The idea is that the feature distance between the same identities should be as small as possible, and the feature distance between different identities should be as large as possible. For FER, exponential triplet-based loss [11] was utilized to give difficult samples more weight when updating the network. Li et al. [12] proposed a loss function that can effectively enhance the discriminant force of the depth features between different categories, namely, the loss function combining the angular margin loss and center loss, and they used the VGGFace2 dataset to conduct face classification training on the network.

Deep neural networks require sufficient training data to ensure generalizability to a given recognition task [13]. However, most publicly available databases for FER do not have a sufficient quantity. Currently, most deep learning models of FER are trained on standard datasets. They usually yield fair results on these datasets, but their accuracy tends to severely drop once applied to real-life scenarios. This phenomenon is due to the following: 1) The training datasets mostly come from posed photos taken in laboratories. These photos are largely different from data collected from daily life, which lowers the models’ generalizability. 2) Many datasets have a small data volume. Datasets such as JAFFE and CK+ have less than 100 training samples for each facial expression class.

A large, complete dataset is a fundamental requirement for a method to be reliable. The Fer2013 facial expression (Goodfellow, 2013) dataset is an official dataset for the Facial Expression Recognition Challenge on Kaggle. It contains facial expression data of people from different races, age groups, and genders. All its images are downloaded by a web crawler, and some contain noise such as illumination, variations in pose (side face), and occlusions (glasses, hands, hair, beard, accessories), which resemble real-life scenarios. Models trained on this dataset have stronger generalizability and are thus more practical. On the Fer2013 dataset, Tang et al. [14] researchers combined the CNN and SVM loss function, building a model that won the Fer2013 Kaggle Expression Recognition Challenge in 2013, with a recognition rate of 71.2%. Devries et al. [15] added predictions of the locations of facial landmarks in facial expression prediction, which is a multitask model with a final recognition rate of 67.21%. Zhang et al. [16] implemented a bridging layer on Tang et al.’s work, fusing data from multiple sources, including the Fer2013 dataset, and adding additional input into the fully connected layer for face representation learning, with a better performance of 75.1%. Guo et al. [17] adopted an additional classifier and an exponential triplet loss function, increasing the weight of the hard samples, and raised the recognition rate to 71.33%. Kim et al. [18] integrated 9 CNNs, pushing the rate to 73.73%. Pramerdorfer et al. [19], by integrating 8 DCNNs of changed structure including VGG, Inception, and ResNet, finally achieved a recognition rate of 75.2%.

Some emotions could jeopardize driving safety. According to research performed at Virginia Tech, sadness or anger can be more hazardous than cell phone use while driving – the odds of traffic accidents under such emotions is 5 times that with cell phone use. Excitement makes people drive faster; disgust causes a more offensive driving style; and dullness distracts a driver’s attention. These five emotions are considered to be harmful to driving. To further study their corresponding expressions, we constructed a dataset consisting of 1,580 raw images of eleven subjects. Each image is 1280 $\times$ 720 pixels. The subjects are presented with eliciting videos of each emotion, assisted by the language induced, and we record their corresponding expressions. These expression images are then manually selected into our dataset. In this paper, we use Fer2013 and our own dataset to train and test our expression recognition model, respectively.

3. Proposed scheme

In this study, we propose an efficient scheme to improve the recognition accuracy based on a convolutional neural network, which includes 3 parts: optimizing the network architecture with depthwise separable convolution, adopting a self-adaption joint loss function, and using data augmentation in the preprocessing phase to increase the sample size for the insufficient classes.

3.1 Optimizing the network architecture

In this paper, we design an improved convolutional neural network architecture for facial expression recognition. This network is modified on the basis of a baseline convolutional neural network, as shown by Fig. 1. In addition to the input layer, this network has a depth of 15 layers, including 6 convolutional layers (C1, C2, C3, C4, C5*, and C6*), 3 pooling layers (P1, P2, and P3), 2 shortcut layers (S1 and S2), one separable convolutional layer (SC), 2 fully connected layers, and one classifier. A batch normalization layer is added after the convolutional layers, shortcut layers, and separable convolutional layer for normalization.

The input layer is a 48 $\times$ 48 pixels matrix of a face image. The convolutional layers and pooling layers have several feature maps; each feature map is connected to part of its former feature map. The first three convolutional layers C1, C2 and C3 all use 64 kernels for the convolution operation; C4 and C5* use 128 and 256 kernels, respectively, for dimensional enhancement to increase the number of channels and receive more features; C6* uses 128 kernels for dimension reduction; S1 and S2 both use 64 convolutional kernels; and SC uses 256 convolutional kernels. C1, C5*, C6* use a kernel size of 1 $\times$ 1; C2, C3, C4, S1, S2 and SC use a kernel size of 3 $\times$ 3. The pooling layer uses a 2 $\times$ 2 sliding window for max pooling. Fully connected layer F1 has 2048 neurons; F2 has 1024 neurons. The two fully connected layers have a dropout layer added behind them, moderating overfitting. F1 and pooling layer P3 are fully connected. The classification layer has 7 neurons. It classifies the output of the fully connected layer into 7 emotional classes: angry, disgust, fear, happy, sad, surprise, and neutral.

Figure 1.

Schematic diagram of the improved network architecture.

3.1.1 1

\times

1 convolutional kernel

Lin et al. [20] proposed a network structure, namely, “Network-In-Network” (NIN), which enhances the model discriminability. Learning from them, we use a 1 $\times$ 1 convolutional kernel to linearly combine every pixel in different channels. This modification not only improves the network’s nonlinear mapping capability while preserving the precision but also enables dimensional enhancement and reduction of the network structure, achieving cross-channel interaction of information.

3.1.2 A stack of small kernels replacing large kernels

In convolutional neural networks, larger kernels have larger receptive fields, thus capturing more context at once. A large kernel is good at extracting features, but at the cost of a greater computation workload [21], which is a burden for model optimization. To solve this problem, Karen et al. [22] replaced the large convolutional kernel in the VGG network with a stack of small kernels. They discovered not only that the number of parameters had decreased, preventing overfitting, but also that the feature recognition capability after training had been improved. In this paper, we thus use a stack of two 3 $\times$ 3 convolutional kernels instead of a single 5 $\times$ 5 kernel.

3.1.3 Applying depthwise separable convolution

François Chollet [23] proposed depthwise separable convolution as almost an extreme version of Inception modules [24, 25, 26]. Regular convolution considers mapping cross-channel correlations and spatial correlations simultaneously. Depthwise separable convolution first maps the spatial correlations only and then maps cross-channel correlations separately. This change in the convolution operation also changes the computational complexity. For example, Fer2013’s input feature map has 2 spatial dimensions (width and height) of 48 respectively, and one channel dimension of 1. Suppose that the 2 output spatial dimensions remain as 48, with a channel dimension of 64 and a kernel size of 3 $\times$ 3. Then, for a regular convolutional calculation, the number of parameters is (3 $\times$ 3 $\times$ 1) $\times$ 64 $=$ 576, and the computational workload is 576 $\times$ 48 $\times$ 48 $=$ 1327104. For depthwise separable convolution, the number of parameters is (3 $\times$ 3 $\times$ 1) $+$ (1 $\times$ 1 $\times$ 1) $\times$ 64 $=$ 73, and the computational workload is 73 $\times$ 48 $\times$ 48 $=$ 168192. The two methods have a workload ratio of 1:0.127. Depthwise separable convolution reduces both the number of parameters and the computational workload to approximately one-eighth those of a regular convolution. It is proven in the comparative research between the Xception architecture [23] and Inception V3 [25] that a smaller parameter size can represent the image features, reduce the computational workload, and improve the operating efficiency. Overall, the depthwise separable convolution represents the model more effectively. In our neural network architecture, SC is a depthwise separable convolution layer, which is used to replace the regular convolution operation to reduce the number of parameters and the computation workload.

3.1.4 Explicitly utilization of the feature information extracted by convolution

Different convolutional layers extract different features. These features are complementary. For deeper neural networks, if expression recognition and classification rely exclusively on features extracted from the last layer, then a proportion of the information in the mid-layers will be neglected. Lu et al. [27] used residual learning blocks to improve the training and optimization process of the deep convolutional neural network model, which improved the generalization ability of the network model for facial expression recognition while reducing the time cost of the model convergence. In this paper, we adopt a residual learning method. The connection based on multi-layer residuals is used at the underlying layer to perform feature fusion on the mapping results, fully utilizing the underlying features and thus enriching the convolutional features. Before introducing residual learning, the underlying mapping $H(x)=x$ is an identity mapping, and after introducing residual learning, the mapping becomes $H(x)=F(x)+x$ . In this study, we achieve residual learning by adding shortcut connections to both consecutive 3 $\times$ 3 convolutions respectively.

3.2 Self-adaption joint loss function

Softmax loss is frequently used in multi-classification tasks in FER. It can be used to classify the extracted features and normalize the classification result. The formula [28] is given as follows:

$\displaystyle\text{L}_{S}=-\sum\limits_{i=1}^{m}{\log\frac{e^{W_{y_{i}}^{T}x_{% i}+b_{yi}}}{\sum\nolimits_{j=1}^{n}{e^{W_{j}^{T}x_{i}+b_{j}}}}}$ (1)

In the formula, $m$ is the size of the batch, and $n$ is the number of classes.

$\displaystyle P=\frac{e^{W_{y_{i}}^{T}x_{i}+b_{yi}}}{\sum_{j=1}^{n}{e^{W_{j}^{% T}x_{i}+b_{j}}}}$ (2)

$P$ represents the probability that sample $x_{i}$ belongs to class $y_{i}$ . A traditional convolution neural network uses softmax loss to enhance the inter-class variations, which helps separate the different classes. However, in the field of FER, both the intra-class compactness and the inter-class dispersion can affect the network’s classification capacity. For example, different faces with the same expression can result in some degree of dispersion within the same class, thus hurting the efficiency of the expression recognition. Wen’s [29] research results also suggest that using softmax loss only in the convolutional neural network can enhance dispersion among the different classes, but intra-class diversity still exists.

To increase the intra-class compactness and the network’s feature recognition capability, Wen et al. [29] proposed the center loss function as given in the following formula:

$\displaystyle\text{L}_{C}=\frac{1}{2}\sum\limits_{i=1}^{m}||x_{i}-c_{yi}||_{2}% ^{2}$ (3)

$c_{yi}$ represents the feature center of the $y_{i}$ th class; $x_{i}$ is the feature before entering the fully connected layer; and $m$ is the size of the batch. The center loss function shows that the sum of the square of the distance between each sample and the feature center should be as small as possible; i.e., the center loss concentrates features of the same class at its center, increasing the compactness. Wang et al. [30] added center loss to the loss function of Faster R-CNN in face detection and used it to monitor the learning of deep features for face/non-face classification.

In our study, to enhance this inter-class distinction as well as minimize the intra-class distance, we adopt the self-adaption joint function of softmax loss and center loss, and the joint formula is given as follows:

$\displaystyle L=L_{S}+\lambda L_{C}=-\sum\limits_{i=1}^{m}{\log\frac{e^{W_{y_{% i}}^{T}x_{i}+b_{y_{i}}}}{\sum_{j=1}^{n}{e^{W_{j}^{T}x_{i}+b_{j}}}}}+\frac{% \lambda}{2}\sum\limits_{i=1}^{m}||x_{i}-c_{y_{i}}||_{2}^{2}$ (4)

In the formula, $\lambda$ is a self-adaption parameter that controls the two loss functions. It usually ranges from 0.0001 to 0.1. $\lambda$ is initially set to a small value to enhance the inter-class distinction. As the number of iterations increases, the value of $\lambda$ increases, and the weight of the center loss function gradually increases to concentrate the features of the same class at its center. In this process, our model is trained to specifically target expression features that are different from face features, which is helpful for expression classification.

3.3 Data balance and augmentation

To guarantee the model’s generalizability, a deep neural network requires enough training data [13]. Given that transformations such as shifting, scaling, and rotation of images do not affect the CNN’s classification result, data augmentation increases the quantity of images by randomly varying the sample data, combines multiple operations, and generates more unseen training samples to make the network more robust to deviated and rotated faces. Data augmentation is divided into real-time augmentation and offline augmentation. Real-time augmentation involves conduct online cropping, scaling and other transformations of input model data in mini-batch, and it is already imbedded in the deep learning kit. Offline augmentation preprocesses data through spatial geometric, visual, and other transformations, further expanding the data volume and diversity. In our case, even though the Fer2013 dataset already has a large data volume, data augmentation can still improve the overall performance of our constructed model.

In general, the quantity distribution of various categories in the dataset is inconsistent, which would cause insufficient training of those classes with fewer samples and reduce the classification accuracy of the samples. Balancing out the class that has fewer expression samples than the other classes has become especially important. Commonly used data augmentation methods to solve the imbalance of a dataset usually involve resampling the data, including the following [31]:

(1)
Random undersampling: The distribution of classes is balanced by randomly removing samples of most classes.
(2)
Random oversampling: The distribution of classes is balanced by increasing the number of samples in the minority classes by randomly enhancing the minority classes.

In addition, Cluster-based Sampling, Synthetic Minority Over-sampling Technique, Modified Synthetic Minority Oversampling Technique etc. can also be used. In this study, we use the random oversampling offline-augmentation method “four-corner cropping $+$ center cropping $+$ horizontal flipping” to enlarge the dataset of the classes with fewer samples to the volume of the largest class. Four-corner clipping and center clipping of the original image and scaling to a uniform size are equivalent to enriching the training data in the position dimension. It is used with the horizontal image to enrich the training data of the opposite direction. The advantage of this approach is that classes with fewer samples can present a richer representation without losing information.

Algorithm 1 summarizes the face expression recognition algorithm based on a convolutional neural network proposed in this paper.

Algorithm 1 Facial expression recognition based on CNN

1 Input: Training set $T=\{x_{i}\}$ after balance and augmentation operation.

2 Given: number of training set N, mini-batch size m, iteration t, number of epoch ep, current epoch epoNow, and hyperparameters $\lambda$ and zoom factor f

3 For epoch $=$ 1 to ep

4 For $t=1$ to N/m

5 Forward propagation:

6 Calculate the joint loss: $L=L_{s}+\lambda L_{c}$

7 Backward propagation:

8 Calculate the gradient of the joint loss

9 Calculate the self-adaption parameter:

10. If eopNow%10 $==$ 0 and $\lambda<$ 0.1

11. $\lambda=\lambda\ast(1+1/(f*\text{epoNow}))$

12. Update parameters: the joint loss parameters, hyperparameter $\lambda$

13 and the network layer parameters.

14 End for

15 End for

16 Output: Training set accuracy acc and training set loss.

4. Experiment and discussion

Algorithm 1 Facial expression recognition based on CNN
1	Input: Training set $T=\{x_{i}\}$ after balance and augmentation operation.
2	Given: number of training set N, mini-batch size m, iteration t, number of epoch ep, current epoch epoNow, and hyperparameters $\lambda$ and zoom factor f
3	For epoch $=$ 1 to ep
4	For $t=1$ to N/m
5	Forward propagation:
6	Calculate the joint loss: $L=L_{s}+\lambda L_{c}$
7	Backward propagation:
8	Calculate the gradient of the joint loss
9	Calculate the self-adaption parameter:
10.	If eopNow%10 $==$ 0 and $\lambda<$ 0.1
11.	$\lambda=\lambda\ast(1+1/(f*\text{epoNow}))$
12.	Update parameters: the joint loss parameters, hyperparameter $\lambda$
13	and the network layer parameters.
14	End for
15	End for
16	Output: Training set accuracy acc and training set loss.

The experiment in this paper uses the Fer2013 expression dataset as the object to evaluate the performance of the proposed methods. The experiment is set up based on the TensorFlow deep learning framework with the Python programming language, and it operates on the Ubuntu16.04 system The Fer2013 dataset images vary in illumination and contrast. In an unconstrained environment, this difference would cause significant variance within classes, hindering expression recognition. Illumination processing is thus necessary. In addition, face detection is necessary for many facial applications, such as facial recognition and facial expression analysis [32]. Although the Fer2013 dataset has already eliminated most irrelevant image areas, certain images still contain too many non-face areas.

Jin et al. [33] adopted the Gamma transformation to improve the facial recognition ability under uneven illumination. The Gamma transformation is a non-linear gray transformation in image processing, which can expand the dynamic range of images with dark or shaded areas and improve the recognizability of the details. When the Gamma value is less than one, the transformation stretches low grayscale areas of the image while squeezing the high grayscale areas; when the Gamma value is greater than one, the transformation does the opposite. In this paper, we also adopt the Gamma transformation for illumination processing.

Figure 2.

Distribution diagram of various expression data in the Fer2013 dataset.

Viola and Jones [34] proposed a detection framework with a cascade architecture (V-J algorithm), which extracts Haar-Like features from images and trains the cascade classifiers using AdaBoost. This detection method has a fast calculation speed, decent detection performance, and real-time capability. In this paper, the V-J algorithm is also used in the image’s preprocessing phase.

4.1 Data preprocessing

The preprocessing of input images is a prerequisite for feature extraction and expression classification. The number distribution of various expression samples in the Fer2013 dataset is shown in Fig. 2. The train set, validation set and test set have 28709, 3589 and 3589 samples respectively. Data preprocessing mainly includes the following:

(1)
Balancing out the “disgust” class which has fewer expression samples than the other classes and using the offline-augmentation method “four-corner cropping $+$ center cropping $+$ horizontal flipping” to enlarge the dataset tenfold relative to the original volume. The volume of the “disgust” class in the train set increases from the original 436 to 4360
(2)
Setting the Gamma value to 0.8, adopting the Gamma transformation for all of the image data, weakening the illumination’s impact on the face images, and making the textures in the shaded areas clearer
(3)
Detecting faces in the images by the V-J algorithm face detector encapsulated in OpenCV, and scaling the images to a 48 $\times$ 48 size
(4)
Applying real-time augmentation to the train set before feeding it into the training model. Data are processed through random cropping, random scaling, and horizontal flipping with mini-batches of batch size 32. Such augmentation methods are embedded in the TensorFlow toolkit and supported by GPU acceleration.

It should be noted that all the data need Gamma illumination processing and V-J face detection, but only the training set needs augmentation and balancing.
4.2 Experimental process

Figure 3 shows the flow of the facial expression recognition. We use Stochastic Gradient Descent as the optimization method, 1e-2 as the learning rate, Xavier for uniform initialization, and ReLU as the non-linear activation function. However, we choose to use a linear activation function for dimension reduction at the C6* convolutional layer to prevent further loss of features since the compressed feature will lose part of its negative input through the ReLU activation function. This approach helps preserve more feature information, improving the model’s performance.

Figure 3.

Facial expression recognition flow chart.

The experiment is divided into three parts:

(1)

Comparing the recognition rate of our model with and without balancing classes

(2)

Conducting a comparison experiment on the modification of the network model, analyzing each adjustment’s impact on the expression recognition rate

(3)

Comparing the results of our experiment to other results using different network structures on the same dataset, and testing the effectiveness of the model proposed in this paper.

For convenience of expression, we use abbreviations for the following operations: “minus” represents the class balance method involving cutting other classes’ data volume to the same level as the smallest class; “plus” represents conducting offline data augmentation on the smallest class to increase its data volume tenfold through “four-corner cropping $+$ center cropping $+$ horizontal flipping”; “s” means using the softmax loss function; “sc” means using the self-adaption joint function of softmax loss and center loss; “SH” represents adding the residual learning shortcut connection; and “SP” represents replacing some regular convolutional layers with separable convolutional layers.

4.3 Experiment analysis and discussion

Table 1 lists the average expression recognition rates of each trial and their neural network structures and settings. Figure 4 illustrates the baseline model’s network architecture, which is represented by Model 1 in the table. All the modifications are made step by step. We compare the results before and after each adjustment and analyze the impact on the recognition rate. Figure 1 illustrates the final version of our improved network architecture.

Table 1
In the experiment, we set the network structure according to the gradually improved model, and the average recognition rate based on the model training is obtained

Model	Data balance		SH	SP		Loss function		Average recognition
	Minus	Plus		Before adding	After adding	s	sc	rate of expression (%)
1	–	–	–	–	–	$\surd$	–	67.079
2	–	$\surd$	–	–	–	$\surd$	–	51.639
3	$\surd$	–	–	–	–	$\surd$	–	69.183
4	$\surd$	–	$\surd$	–	–	$\surd$	–	70.356
5	$\surd$	–	$\surd$	–	–	–	$\surd$	70.747
6	$\surd$	–	–	$\surd$	–	$\surd$	–	70.317
7	$\surd$	–	–	–	$\surd$	$\surd$	–	70.943
8	$\surd$	–	$\surd$	–	$\surd$	–	$\surd$	71.842

Figure 4.

Schematic diagram of the network architecture of the baseline model.

4.3.1 Comparison experiment of dataset class balancing

We compare the results of three models in this experiment: Model 1 adopts the baseline network architecture. It uses the original samples without class balancing and uses only the softmax loss function Model 2 balances classes by reducing the sample volume of large classes on the basis of Model 1. The “disgust” class has the smallest sample volume among all seven classes. Model 2 reduces the sample volume of the other classes to the same level as the “disgust” class. In contrast to Model 2, Model 3 balances the classes by data augmentation.

Figure 5.

Influence of category balance on the expression recognition rate.

We compare the experiment results of these three models and illustrate the results using a line chart as shown in Fig. 5:

(1)

In Model 1 which does not involve class balancing, the recognition rates of the “happy” and “surprise” classes are notably higher than those of the other expression classes as well as the average recognition rate. Two possible reasons for this result are as follows: i) These two expression classes are more distinctive. In real life, happy expressions and surprised expressions are often exaggerated and easy to read. ii) Unbalanced classes in the train set could affect the recognition rate. In this train set, the “happy” class has a much larger data volume than the other classes, and it also has the highest recognition rate. The “disgust” class has the smallest data volume and the lowest recognition rate of 43.2%, which is much lower than the average recognition rate. To prove our hypotheses, we conduct experiments on Model 2 and Model 3.

(2)

In Model 2, all the other sample classes are reduced to the size of the “disgust” class, which leads to a great increase in the recognition rate of the “disgust” class. However, the recognition rates of the other six classes drop. This finding shows that i) balancing the training sample of different classes helps increase the recognition rate of the smallest class and ii) decreasing the data volume hurts the model’s recognition rate on the corresponding classes. Additionally, despite a drop, the recognition rates for the “happy” and “surprise” class are still higher than or at least equal to those of the other expression classes and are much higher than the average rate. This finding further proves that these two expression classes are rather distinctive.

(3)

In Model 3, which uses data augmentation for class balancing, the “disgust” class sample size is enlarged tenfold using offline augmentation of four-corner cropping $+$ center cropping $+$ horizontal flipping, reaching the same level as the other classes. It is shown in Fig. 6 that compared to the result of Model 2, the augmentation method has a preferred result for all seven classes. Model 3 also has a higher average recognition rate than the previous two models, as shown in Table 1. This finding gives further evidence that i) an increase in the training data improves the expression recognition rate and ii) class balancing helps the model’s performance in expression recognition.

Figure 6.

Comparison of the feature graphs before and after adding a residual connection.

Figure 7.

Comparison of various expression recognition rates before and after adding a residual connection.

4.3.2 Comparison experiment before and after network model improvements

In this part, we build three models for the comparison.

Experiment on the residual connection

We build Model 4 by adopting the residual module on the basis of Model 3, and we conduct this comparison experiment to test its effect. We capture a feature map at pooling layer P1 in Fig. 1 and another at pooling layer P1 in Fig. 4. The resulting images are shown in Fig. 6a and b. By comparison, we discover that by adding a residual connection, we can extract richer convolutional features. Figure 7 shows a comparison of the two models. Compared to Model 3, which does not use a residual connection, Model 4 has much higher recognition rates for the classes “angry”, “disgust”, “fear”, and “neutral”. It is also shown in Table 1 that by using a residual connection, the average recognition rate rises to 70.356%, which is 1.173% higher than that of Model 3. This finding supports that residual learning improves the model’s recognition rate.

Experiment on the self-adaption joint loss function

In Model 5, we replace the softmax loss function in Model 4 with the self-adaption joint function of softmax loss and center loss. Figure 8 shows a comparison of the results. By adopting the joint loss function, the recognition rates largely increase in the classes “fear”, “happy”, “sad”, and “surprise”, and especially for the classes “fear” and “sad”, the increases in the recognition rates are relatively large. In addition, the recognition rates slightly decrease for the classes “neutral”, “angry”, and “disgust”. The overall average recognition rate rises 0.391% with the self-adaption joint loss function. This finding shows that the self-adaption loss function can increase the intra-class compactness, which is helpful in distinguishing the expression classes that are less distinct from each other.

Figure 8.

Influence of the loss function on the expression recognition rate.

Experiment on the separable convolutional layer

Figure 9.

Influence of the separable convolutional layer on the expression recognition rate.

To analyze the effect of separable convolutional layers, we build Model 6 and Model 7 on the basis of the baseline model and keep their number of network layers constant. In Model 6, which uses data augmentation for class balancing, we replace the two 3 $\times$ 3 convolutional layers C5 and C6 with two 1 $\times$ 1 layers C5* and C6*, and then, we add another 3 $\times$ 3 layer C. Model 7 is the same as Model 6, except that it uses a separable convolutional layer SC instead of the 3 $\times$ 3 convolutional layer C. The comparison result is given in Fig. 9. The findings show that Model 7 has a higher recognition rate for the classes “disgust”, “fear”, “sad”, “surprise” and a lower rate for the other three classes. Model 7’s average recognition rate is 0.626% higher than that of Model 6, revealing the better performance of the separable convolution in differentiating indistinguishable expressions.

4.3.3 Comparison of the experimental results with other methods on the same dataset

In Model 8, we augment its sample volume for class balancing and modify its network architecture by using the joint self-adaption function of softmax loss and center loss, adding the residual learning module and adopting the separable convolutional layer. The final architecture is shown in Fig. 1. The training results on this model achieve an expression recognition rate of 71.842%. Compared to the initial Model 1, this expression recognition rate is 4.763% higher, and it is also higher than the human eye’s average recognition rate of the same expression dataset, which is 65 $\pm$ 5% [35]. Compared to the training results of other researchers, as shown in Table 2, our method has a higher recognition rate under the constraint of not using an external dataset or network ensemble. For example, our method has a 0.512% higher recognition rate than the method proposed by Guo et al. [17] This finding proves that our improvements are effective and able to improve the training model’s expression recognition capability.

Table 2
Performance comparison with other methods for FER based on the Fer2013 dataset

Method	Test accuracy (%)
Tang et al. [14]	71.2
Devries et al. [15]	67.21
Guo et al [17]	71.33
Xu et al. [36]	65.60
Qian et al. [37]	68.00
Proposal	71.842

4.4 Algorithm application to our dataset

We compile five facial expression datasets that correspond to the emotions hazardous to driving safety, and then, we test the improved model on this dataset. Our results for this experiment show that the proposed methods had decent generalizability.

Anger, disgust, excitement, sadness, and dullness are five emotions that have negative effects on driving safety. We build a dataset consisting of 1,580 720p expression images corresponding to these five emotions. Each class of emotion has 371, 413, 271, 304, and 221 images respectively. In this experiment, we divide the dataset into the train set, validation set, and test set. We use the methods mentioned in our previous text for image preprocessing, and then, we apply data augmentation to the smallest “dullness” class of the train set, increasing its images to the same level as the other expression classes. On this dataset we train and test Model 8. Of course, the classification layer has 5 neurons, and it classifies the output of the fully connected layer into the 5 emotional classes: anger, disgust, excitement, sadness and dullness. We obtain a recognition rate of 98.361%. This finding supports the effectiveness and generalizability of the model we proposed.

5. Conclusion

Facial expression recognition has wide applicability in real life scenarios. In this paper, we propose a new scheme for FER based on a convolutional neural network. Our contributions include the following: replacing part of the regular convolution operation with depthwise separable convolution to reduce the number of parameters and the computation workload; adopting the self-adaption joint function of softmax loss and center loss to mitigate the influence of subject identity bias, minimize the intra-class distance in acquiring shared features, and increase the inter-class distance in differentiating different expression classes; and balancing the train set through data augmentation to increase the expression recognition rate of our model. We conduct our experiment on TensorFlow as the platform, adopting the Fer2013 dataset for multiple controlled experiments. We compare our results with the results of other researchers. The comparison proves our methods and model to be effective, and we can increase the recognition rate of human facial expressions. Last, by applying our methods to our dataset of emotions hazardous to driving and analyzing the results, we further prove the effectiveness and generalizability of our methods. We thus look forward to further applications of our scheme in monitoring drivers’ emotional states and improving driving safety. In the next phase of our research, external datasets can be introduced. Multiple deep convolutional neural networks can also be integrated into our method to further increase the recognition rate.

Footnotes

Acknowledgments

The research is supported by the State Key Laboratory of Geo-Information Engineering of China (Grant No. SKLGIE2017-M-4-6) and the National Natural Science Foundation of China (Grant No. 41701537). We would like to thank Kaggle Inc for providing dataset for the studied FER. We also acknowledge Sicheng Zhao for his English editing support.

References

Mehrabian

, Communication without words, Psychology Today 2(4) (1968), 53–56.

Chang

Wen

and Ma

J.J.

, Facial expression recognition based on complexity perception classification algorithm, https://arxiv.org/abs/1803. 00185, 2018.

Uddin

M.Z.

Khaksar

and Torresen

, Facial expression recognition using salient features and convolutional neural network, IEEE Access, 2017, pp. 26146–26161.

Lopes

A.T.

Aguiar

E.D.

Souza

A.D.

and Oliveira-Santos

, Facial expression recognition with Convolutional Neural Networks: Coping with few data and the training sample order, Pattern Recognition 2017, pp. 610–628.

Pons

and Masip

, Supervised committee of convolutional neural networks in automated facial expression analysis, IEEE Transactions on Affective Computing, 2017, p. 1.

Jiwen

, Slimming Convolutional Neural Networks with Depthwise Separable Convolutions, M.S. Dissertation, Xiamen University, 2018.

Kim

Yoo

Kwak

Choi

and Kim

, Deep generative contrastive networks for facial expression recognition, arXiv preprint arXiv:1703.07140, 2017.

Chen

and Liu

, Deep peak-neutral difference feature for facial expression recognition, Multimedia Tools and Applications, 2018, pp. 1–17.

Zhang

Ren

and Sun

, Deep residual learning for image recognition, the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2016, 770–778.

10.

Schroff

Dmitry

and James

, Facenet: A unified embedding for face recognition and clustering, the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 815–823.

11.

Guo

Tao

Xiong

and Tao

, Deep neural networks with relativity learning for facial expression recognition, IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 2016, pp. 1–6.

12.

Zhong

Chen

and Wang

, Deep face recognition combined with angular margin loss and center loss, Journal of Computer Applications 39(S2) (2019), 55–58.

13.

and Deng

, Deep facial expression recognition: a survey, https://arxiv.org/abs/1804.08348, 2018.

14.

Tang

, Deep learning using support vector machines, https://arxiv.org/abs/1306.0239v1, 2013.

15.

Devries

Biswaranjan

and Graham

W.T.

, Multi-task learning of facial landmarks and expression, Canadian Conference on Computer & Robot Vision CRV, 2014, pp. 98–103.

16.

Zhang

Luo

Change

L.C.

et al., Learning social relation traits from face images, IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 3631–3639.

17.

Guo

Tao

et al., Deep neural networks with relativity learning for facial expression recognition, IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Seattle, USA, 2016, pp. 1–6.

18.

Bo-Kyeong

Suh-Yeon

Jihyeon

et al., Fusing aligned and non-aligned face information for automatic affect recognition in the wild: a deep learning approach, IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , Las Vegas, USA, 2016, pp. 1499–1508.

19.

Pramerdorfer

and Kampel

, Facial expression recognition using convolutional neural networks: state of the art, https://arxiv.org/abs/1612.02903, 2016.

20.

Lin

Chen

and Yan

, Network in network, https://arxiv.org/abs/1312.4400, 2013.

21.

Jin

Gong

et al., Classification of Clouds in Satellite Imagery Using Adaptive Fuzzy Sparse Representation. Sensors 16(12) (2016), 2153.

22.

Karen

and Andrew

, Very deep convolutional networks for large-scale image recognition, https://arxiv.org/abs/1409.1556, 2014.

23.

Chollet

, Xception: deep learning with depthwise separable convolutions, the 30th IEEE Conference on Computer Vision and Pattern Recognition Honolulu, HI, USA, 2017, pp. 1800–1807.

24.

Szegedy

Liu

Jia

et al., Going deeper with convolutions, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 1–9.

25.

Szegedy

Vanhoucke

Ioffe

et al., Re-thinking the Inception architecture for computer vision, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 2818–2826.

26.

Christian

Sergey

Vincent

et al. learning

Inception-v4,Inception-ResNetandtheimpactofresidualconnectionson

, https://arxiv.org/abs/1602.07261, 2016.

27.

Zhu

Hao

and Yan

, Facial expression recognition based on deep residual network, Journal of Data Acquisition and Processing 34(1) (2019), 50–57.

28.

Omkar

M.P.

Andrea

and Andrew

, Deep face recognition, Proceedings of the 2015 British Machine Vision Conference, Guildford: BMVC Press, 2015, pp. 1–12.

29.

Wen

Zhang

et al., A discriminative feature learning approach for deep face recognition, Computer Vision – ECCV 2016 14th European Conference, Amsterdam, Netherlands, 2016, pp. 499–515.

30.

Wang

et al., Face R-CNNs, https://arxiv.org/abs/1706.01061, 2017.

31.

Gao

Zhou

and Hu

, Research on Image Recognition of Convolution Neural Network Based on Data Enhancement, Computer Technology and Development 28(8) (2018), 62–65.

32.

Zhang

and Li

, Joint face detection and alignment using multi-task cascaded convolutional networks, IEEE Signal Processing Letters 23(10) (2016), 1499–1503.

33.

Jin

Gong

Zeng

and Fu

, Illumination robust face recognition using random projection and sparse representation, Signal Image and Video Processing 12(4) (2018), 721–729.

34.

Viola

and Jones

M.J.

, Robust real-time face detection, International Journal of Computer Vision 57(2) (2004), 137–154.

35.

Ian

J.G.

Dumitru

Pierre

L.C.

et al., Challenges in representation learning: a report on three machine learning contests, Neural Networks (2015), 59–63.

36.

Zhang

and Zhao

, Expression recognition algorithm for parallel convolutional neural networks, Journal of Image and Graphics 24(2) (2019), 0227–0236.

37.

Qian

Shao

et al., Multi-view facial expression recognition based on improved convolutional neural network, Computer Engineering and Applications 54(24) (2018), 12–19.

Efficient facial expression recognition based on convolutional neural network

Abstract

Keywords

1. Introduction

3. Proposed scheme

3.1 Optimizing the network architecture

3.1.2 A stack of small kernels replacing large kernels

3.1.3 Applying depthwise separable convolution

3.1.4 Explicitly utilization of the feature information extracted by convolution

3.2 Self-adaption joint loss function

Table 1 In the experiment, we set the network structure according to the gradually improved model, and the average recognition rate based on the model training is obtained

Table 2 Performance comparison with other methods for FER based on the Fer2013 dataset

5. Conclusion

Footnotes

Acknowledgments

References

Table 1
In the experiment, we set the network structure according to the gradually improved model, and the average recognition rate based on the model training is obtained

Table 2
Performance comparison with other methods for FER based on the Fer2013 dataset