Abstract
The work described in this paper attempts to contribute to one of the most stimulating and promising sectors in the field of emotion recognition, which is health care. Multidisciplinary studies in artificial intelligence, augmented reality, and psychology stressed out the importance of emotions in communication and awareness. The intent is the recognition of human emotions, processing images streamed in real-time from a mobile device. The proposed techniques involve the use of open source libraries of visual recognition and machine learning approaches based on convolutional neural networks (CNN).
Keywords
Introduction
The study proposes and tests an innovative technique for automated emotion recognition through face detection via Convolutional Neural Networks (CNN). The provided technique is meant to be applied for supporting people with health disorders that can reduce communication skills (e.g. muscle wasting, stroke, autism, or, more simply, pain) in order to recognize emotions and generate real-time feedback, or feeding more complex systems. Therefore, a user who cannot recognise emotions herself/himself (e.g. for a visual or cognitive impairment) can be assisted with augmented emotional stimuli; on the other hand, a user who has difficulties expressing emotions (e.g. speaking impairment or difficulty) can be monitored to infer expressions of suffering, and intervene accordingly. The proposed system, extension of the EmEx system, previous work of the authors [39], exploits the high precision of CNN processing, designed to process images. The added value of the proposed technique is to focus on visual micro-expressions for emotional classification: this strategy implies and aims to train the system on a specific person in order to recognise her facial particularities. Its capability is then exploited on the single user with a personalised training. A special classification of emotions is set up, divided into basic emotions inspired by the Ekmann classes [18]: anger, sadness, disgust, happiness, neutral. The system implementation includes a graphic user interface allowing to create a set of emotion events, i.e. a personalised training set. Finally, another part of the ad-hoc software makes use of the convolutional neural model to detect the type of emotion in a video stream, where the same user who trained the network appears. The particularity of our system is, in fact, to be user-centered: the user trains the network on her/his own facial expressions, the software supports personalized emotional feedback for each particular user. Personal traits, such as scars or flaws, or individual variations in emotional feeling and expression, will help the training to precise recognition.
Through an in-deep study of the elements characterising the problem, the system is capable of overcoming the barriers that have so far imposed such a problem. The proposed system has also robustness and scalability properties, verified with experiments. In order to obtain optimised results, the ambient light setting needs a proper setup:
The proposed solution has been implemented in C++ and OpenCV graphics libraries; hence it is compatible with all operating systems. The use of these graphics libraries is due to their high reliability and constant support from an increasingly growing community of developers and users. Moreover, these libraries are widely used in many research projects and prototypes based on an open source approach. The open source approach greatly facilitated our work, thanks to the high number of examples available that simplified many developing phases, the accurate and extensive documentation, and the opportunities of sharing and improving our work with the community of developers.
Related work
Deep learning and image classification
The recent scientific focus on Deep Learning towards the end of the XX century has contributed to the rebirth of a major interest in neural networks. The true impact of Deep Learning began in the context of speech recognition around the year 2010, when two Microsoft Research employees, Lil Deng and Geoenix Hinton, realised that using large amounts of data for training a deep neural network resulted in lowering error rates far below the state of the art [25]. Discoveries in the field of hardware have certainly contributed to the rise of interest in Deep Learning. In particular, the ever-powerful GPUs seem to be able to perform the countless mathematical calculations of matrices and vectors in Deep Learning [21,31,45]. Actual GPUs allow reducing workout times from the weeks to a day.
Recently, deep learning has been used for several types of research aiming at the classification of images and learning, trying to solve the limitations of machine learning, which reside in overfitting and domain dependence, with image adaptation, kernel randomisation [32] and transfer learning [34].
Commitment has been dedicated by researchers to exploit domain dependence as a feature, where personalised classification can be easily done on a particular user or entity, especially for smart-home systems [29] and microblog sentiment tagging [11].
Alternative approaches consider evolutionary algorithms, random walks on semantic networks of images [16] and max-product neural networks.
Convolutional Neural Networks are among the most used methods for affective image classification [35] thanks to their flexibility for transfer learning, and easy tools available on the Web [6].
AI-assisted health care
For computerised health care assisting, multidisciplinary studies in Artificial Intelligence, Augmented Reality and Robotics stressed out the importance of computer science for automatizing real-life tasks for assistive and learning objects, such as detecting words from labial movements (i.e. automated lip detection) [19], Virtual reality for prosthetic training [36] or neural telerehabilitation of patients with stroke [20], vocal interfaces for robotics applications [3].
As an application of complex networks, it is possible to predict bacteria diffusion patterns [14], as well as epidemiology data [46], having a viral spread. To be mentioned, high advances are happening on medical image recognition and multi-stage feature selection for classification of cancer data [1], and of text corpora for medical or patient feedback in social networks.
One of the most promising advances of recent years for AI-assisted health care is the opportunity to have light-implementation Mobile Apps, that can be easily developed to be used in a friendly manner [15], to assist and support disabled users for communication and learning tasks. Such applications can be run directly on personal smartphones or wearables devices, for health monitoring and prognosis [33] as well as for interactive support for people with disabilities or conditions that can influence communication and learning, such as autism spectrum disorders [22]. Using cloud services or networks in the Internet of Things (IoT), makes possible both to connect such devices to high capability servers, both to collect data in a distributed collaborative perspective [44], in order to feed big knowledge-bases, and increase the capability of the single object, i.e. of its owner, as a member of a large interactive collective dynamic knowledge (i.e. a Big Data) network.
Affective computing and emotion recognition
Multidisciplinary approaches recently stressed out the importance of recognising and extracting affective and mental states, in particular emotions, for communication, understanding, and supporting humans in any task with automated detectors and artificial assistants having machine emotional intelligence [38]. In real-life problems, individuals transform overwhelming amounts of heterogeneous data in a manageable and personalized subset of classified items. The process of recognition of moods and sentiments is largely complex, but recent research underlines that basic emotional states such as happiness, sadness, anger, disgust, or neutral state [2] can be recognized based on physiological clues such as heart rate, skin conductance and face expression, differently from sentiment, moods and affect, which are more complex states and can be better managed with a multidimensional approach [17,18]. Since Rosalind Picard defined the challenges for Affective Computing in 2003 [37], numerous advances have been done in the task of emotion recognition, such as defining collective influence of emotions expressed online [8], stating that emotional expressiveness is the crucial fuel that sustains communities; studying cultural aspects of emotions in art [4] and its variations; create emotionally engaging experiences in games [7], where affective changes are crucial to the conscious experience of the world around us. Some of the more ethical and important challenges defined by Rosalind Picard, however, still remain open. For example, many of the modalities for emotion recognition (e.g. blood chemistry, brain activity, neurotransmitters) are not easily available, commercial tools are limited [9], datasets for training are not general [41] and people’s emotion is so idiosyncratic and variable, that there is difficult to accurately recognise individual’s emotional states from available data [37].
This paper aims to provide a solution to overcome such difficulties.
The proposed method for emotion extraction: EmEx2
The proposed method takes some insights from literature about lip reading [5,28,40], and the GUI-based EmEx software [39], ad-hoc implemented. EmEx2 will recognise the emotion of the user whose face is in the camera streaming [43]. For our work, we need to establish a subdivision of classes (see Table 1) as a representation of the Ekman basic emotions [18], which will be our multi-classification features [42].
Identified classes to represent emotions
Identified classes to represent emotions

Various phases of the EmEx2 application: Acquisition of frames by a camera, identification of the main features (facial micro-expressions) and final classification by the convolutional neural network.
In Fig. 1 the various phases of the EmEx2 software are represented. The initial phase is based on the training of the system based on the specific user’s characteristics, and is realised acquiring a series of images associated to each of the considered emotions, summarised in Table 1. Each image is then processed by the Convolutional Neural Network to identify the features associated to the emotion expressions of the face and to classify the corresponding emotion.
In routine operations the frames captured by the camera will be processed and the emotions identified by the EmEx2 application in real-time. The information represented by the visual expression captured by the camera is processed to return a given output. The system will process the information, that can then be returned to the user as a real-time feedback (sounds, symbols, graphics, and elements in the Graphic User Interface) as shown in Fig. 2.A, or it can be further processed by the system, and the output saved in a file for further processing by external tools, as shown in Fig. 2.B. Actions may be carried out by the system to help people with impairments and/or disabilities, on the basis of the emotion expressions interpreted by the system.
In Fig. 3 the general scheme of the implemented Convolutional Neural Network is summarised, both for the initial training and the classification phases. The two phases are structured in a similar way, however the initial training phase, customised to the user, implies some more complex operations and the communication scheme among the various layers of the Convolutional Neural Networks is more complex. The layers are represented in Fig. 3 as rectangles (each type of layer has its own color associated), while the blobs (used to pass information among layers) are represented as hexagons.

Simple output scheme: The output can be returned as real-time feedback (sound, symbols, graphics, and elements in the GUI) (A), or in a file to be further processed by external tools (B). By the interpreted emotions, the system may carry out some actions to help people with disabilities or impairments.
The layers sequence is crucial to obtain good classification scores. In the present research, we used a series of Convolutional layers alternated to a Normalisation layer or a Pooling layer, considered when is necessary to improve the quality of the input data making transformations not altering the original information. For example, using the Normalization layer we can make a linear transformation of the image to improve its readability while, using the Pooling layer, we can reduce the size of the input data to speed-up the computation. Finally, there are Inner Product layers (particular Fully Connected layers, where the number of neurons in output is the number of neurons of the level itself), alternated to a Rectified Linear Units layer that quickly trains the convolutional neural network. Then, the Convolutional Neural Network is ended by the Output layer.
During the training phase, the output is represented by the Accuracy of the process and by the lost information; during the classification phase, we obtain the emotion recognition output, according to the classes defined in Table 1.

Representative diagram of CNN network to classify emotions. The training and classification phases.
CNN training phase
Creating the dataset represents a crucial phase [23], where it is possible to train a neural network with a set of images expressing the emotions described in Table 1. Such a training phase will enable the user to train a personal network with the appropriate parameters for her/his case, and our algorithm will automatically recognise user’s emotions during the software training activity, i.e. in the frame acquisition phase.
A personalised automation software automatically detects the position of the person’s face [12,27,40] in the camera, recognising the emotions basing on the training set. The software interface is designed to be usable [15], to intuitively guide the user into the various facial expressions in the training phase, in order to form the dataset of images and files needed to train the neural network.
Besides the automated training, the software interface also allows the user to provide additional frames (e.g. photos) and to tag them with an emotional label, to improve the neural network and its recognition capabilities. Moreover, a usable list of frames of the training is always accessible by the user, to manually remove in an easy way eventual noisy frames.
Implementing the convolutional neural network
In order to have a direct approach to the world of conundrum neural networks, the proposed approach focuses on the use of an open source framework. Open source is in fact a powerful method to guarantee security and privacy, where it is possible for each version to check what exactly the code does, and to fix eventual bugs by a collective effort. For our proposal, we chose the Caffe framework [6] available online, supported by an extensive online documentation, usable, and of proven efficiency.
To create our user-centered personal dataset of emotional labels, we proceeded to set up the files for the proper convolutional neural training, in agreement with the CNN-based theory.
Network layers are thus set up to extract the specific information of the image data, accurately setting the parameters to have an effective recognition. Figure 3 shows the layers of the proposed neural network for emotion recognition [24] CNN, where training data are labeled with emotions, and the results of the layer computation is evaluated in terms of accuracy and loss.
The neural network consists of a number of layers connected to each other. The main level is for the appropriately converted dataset and the corresponding tags. The second layer is the convoluted layer: being directly linked to the dataset, in this layer the convolution operations are performed on the images, extracting specific information about each frame in each class. Subsequently, a pooling layer is used to reduce, in width and height, the volume of previously created data. Such a reduction leads to reduce also the parameters magnitude, and therefore the computational time, needed for the network to be computed. For scaling, a maximum and average calculation function of a variable set is used: in our tests we used a max function. After the reduction step, the network continues again with convolutional layers.
Subsequently, after the layer pooling is applied further, a layer of inner product (innerProduct) groups all the information obtained up to this point. This grouping step has the useful effect to express such information in a single numeric value, that can be processed again in the subsequent phases.
After the group layer, the system is able to return a vector representation of neurons (from n to 1). This suggests that, from here onwards, it will no longer be possible to apply unambiguous layers.
Another layer of innerProduct is then applied: it is useful to put these layers in sequence, because the last one will have an output parameter that will equal the number of classes needed for the classification. The K final values will be therefore the object of a probability function that will allow the final classification.
It is noticeable that in the TRAINING phase, the network ends with an Accuracy layer for network accuracy calculation, and with a Loss layer for the calculation of the error function needed for a correct and useful training phase. In the CLASSIFICATION phase, instead, we use the same file, but having a SoftMax layer as the final layer: the main purpose of the SoftMax layer is the classification of new images (test set), not present in the dataset. This type of layer calculates the likelihood of the most appropriate class in the grading phase, and therefore represents the final solution.
In order to reach the final solution, we created the network, set the parameters of each layer and prepared the various files needed for Caffe to start the training phase with the appropriate command. Each user will be able to setup a personalized network, in an easy and intuitive way, guided by the EmEx2 graphic interface.
It is also necessary to create a network for the classification phase. This will be identical, in the layer type, to the network used for training, but it will have as an input an unknown image captured by the camera, and as an output a layer capable of establishing a property for the most appropriate class corresponding to the detected image. The network in question is visible in the Fig. 3.
Image classification
It is possible to go to the classification step by creating the network template. A software has been designed to process the image by using the Opencv libraries, automatically pulling out the face position, after which the frames are captured and their rescue are controlled by Caffe, which will classify them, each in the corresponding class of membership, returning the proposed output value. To have a safer result, a check of occurrences is made: the consecutive frames belonging to the same class are counted, and if they exceed a predetermined threshold value, then it is likely that the expression will be expressed by the network. In this way it is possible to delete some false positives, therefore improving the accuracy of the network. The classification is carried out using the so-called ‘top-5 accuracy’ method. This means that the grading probabilities are reported for each of the first five classes in descending order, from the most probable to the least probable. In this way it is also possible to establish a classification of multiple emotions, that is represented by the visual expression of multiple expressions. The software, and the underlying convolutional neural network, are able to detect even if the user is expressing both sadness and seriousness at the same time, given the chances of the neural network for each dataset class.
To obtain an even more reliable final classification, a check of occurrences is carried out, in agreement with [30] and [26]. Observing how many times consecutively an emotion can detect noise by eliminating frame sequences below a predetermined threshold, it can be assumed that the emotions that occur a small number of times do not really belong to the emotion expressed by the person. With this step we get two benefits. The first is to significantly reduce the length of the carrier containing the data. Moreover, by observing how many times an emotion occurs, it is possible to detect noise, by eliminating emotion sequences below a predetermined threshold. In our experiments such a threshold has been found empirically. In some cases, the threshold may be too low, but it is considered preferable to leave some rare misconception, rather than risking to eliminate a sequence of correct emotions.
Evaluation of results and future works
Evaluation of results
In this paper we described EmEx2, a system designed to automate the identification of emotions from face expressions acquired through a Webcam. The system is based on Convolutional Neural Networks. To run tests, we used a personal computer equipped with an Intel core i5-3210M CPU, 2,50 GHz, 4,00 GB of RAM and a Webcam 640x480 px, 30 fps.
By using a specific training phase for each user, the recognition of emotions become more effective. It is advisable to create the dataset and the subsequent training phase for each user according to a personalised approach, although this method can also be used without a training phase, and introducing a complex and complete dataset of samples for each class. A significant advantage in using this system is also the fact that it allows a variable distance of the user from the Webcam.
The first step consists in creating an image dataset, in the appropriate input format, depicting the training set and the validation sets useful to Caffe for the training phase, which, thanks to the personalised layers, allows to calculate the accuracy of the model at certain intervals.
To facilitate the use of the system by the final user, the ability to interrupt the training phase in advance is provided, so that an expert user who wants to improve some parameters, without waiting the completion of the computational phase, is enabled to do it. The template created in such a way will be used for a Webcam rating. Finally, the classification program serves the system to classify emotions expressed by the user, based on the previously-prepared training data.
Final tests allow to identify the emotion with a given probability, depending on appropriate classes for the face acquired by the Webcam, as shown in Fig. 4.

Face image classification – 1.

Face image classification – 2.
In the two figures, we can see that the expression of happiness can be recognized easily (Fig. 4), while a small uncertainty is present for the trained data for the second classification (Fig. 5). This is due to the fact that it is easy for the system to recognize a smile by the movement of the lips, i.e. lips detection. A more accurate training for the other emotions may enhance the results.
Table 2 shows the number of correct classifications for each type of emotion, carried out in a consecutive image acquisition. In this case 5 consecutive acquisitions for the smile images have been carried out. Small defects are found only for very similar representations as visual expressions.
Results
A better match can be obtained with more examples for the training dataset. Ideal future works can include a revision of the structure of the neural convolutional network to better fit special applications for different user types.
This work is focused on creating a method of recognizing emotions visually. Such a system can be useful to the user with difficulties in expressing or recognizing emotions e.g., users with autism spectrum disorders, sick users in pain, people assisting users to be monitored for health conditions.
This effort has been made possible thanks to recent developments in neural networks, more specifically CNN convolutional neural networks, able to process images accurately.
The use of a simple Webcam has led to good results, but a higher resolution camera may improve accuracy, both in training and grading. The experiments shown that this operation can be done with simple tools: good results can be achieved with inexpensive hardware. The proposed classification can be accurate, though having pictures of very similar emotions, visually speaking: this feature is considered an advantage for the proposed user-centered application, based on facial micro-expressions and small features.
The main disadvantage of the proposed approach is the same of all machine learning techniques: to depend in general on a good training phase.
Footnotes
Acknowledgements
The authors are partially supported by two projects financed by Basic Research Fund, 2015, University of Perugia: G-Lorep and EoL: interfacing between a repository of didactic-scientific material repositories and e-assessment environments (S. Tasso), Computational approaches for the efficient use of General Purpose GPU Computing (O. Gervasi).
