Abstract
The Graphic Interchange Format (GIF) is a bitmap picture format that has a series of perpetually repeating images or silent movies that may be viewed without the user having to click and start them. GIFs are frequently used to visually represent emotions that are expressed through body language such as gestures, movements, and facial expressions. Computing may be used to recognise thoughts and other emotions like desire, interest, sentiments, etc. by using emotional expressions or movements as face markers or properties in GIFs. The ability to predict emotions in GIFs may make it easier to express oneself on social media and convey a person’s attitude or personality. Emotion detection in GIFs may be utilised for a range of purposes, e.g., developing a recommendation system, detecting inappropriate content, sentiment identification from GIF-induced sentiment as perceived by person and creating a GIF tag generating system. This study discusses the prior contributions made towards emotion identification in GIFs and describes a method for detecting seven different emotion classes (Happy, Anger, Sad, Surprise, Disgust, Fear, and Neutral) in GIFs by combining an activity recognition network with face emotional expression. The suggested deep neural network, RNN, LSTM approach produced an F1-score of 0.89 and an accuracy of 88 percent.
Keywords
Introduction
GIF is a format for storing and delivering generic colour raster pictures or graphical material. They offer basic advantages, such as quiet without sound and short or short length, making them more discrete and easier to digest as compared to extended films, which need more time and a bigger commitment to bandwidth. These animated GIFs are created from short clips of movie scenes, cartoons, and television shows and are used as an improved version of emoticons on social media, interactive sites, message boards, and even in emails. As a result of their widespread use in instant messaging, online journalism, social media, and online services, animated GIF videos or images have regained enormous popularity. When compared to traditional images, GIFs are superior at displaying dynamic material, telling tales, and expressing emotions. In the age of big data, a massive amount of multimedia data with multiple modalities such as text, photographs, videos, and so on is produced every minute in all types of online social networks. People are more likely to submit GIF videos than static photographs to make a personalised and appealing upload among various multimedia. Emotion, as opposed to thought or information, is an innate or intuitive experience. It evolves throughout time since it is an innate state of mind impacted by one’s surroundings, mood, or interactions with people. Emotion prediction in GIFs will make it simpler to chat on social networking apps, express your brand’s personality, market, and build tales based on one’s mood.
GIFs, which comprise a wide range of facial expressions, movements, personalities, gestures, and other types of body language, are now the most common way of graphically and artistically transmitting a wide range of emotions. These emotional movements or gestures may be utilised as emotional indicators or features in computing research and analysis to detect emotions, interest, attitudes, and other characteristics in GIFs.
Emotion detection from GIFs is a tough challenge for academics because, unlike traditional emotion recognition, a single animated GIF might contain humans and a broad variety of items displaying different emotions under different settings, making prediction more difficult. GIFs, like videos with spatiotemporal volumes, have various distinctive characteristics, such as briefness, looping, quiet, and emotional expressiveness, which offers unique obstacles in their research.
The motivation for this study stems from the fact that GIFs have received little attention and are underutilised for emotion identification. Creating a recommendation system, identifying poisonous and inappropriate material, sentiment identification of GIF-induced sentiment as perceived by the audience, and GIF tag generating system are just a few of the many jobs that may be achieved by recognising emotions in GIFs. Furthermore, understanding and modelling of films and GIFs is a complicated and unexplored topic that encompasses, in addition to their context, the study of many subjects such as humans, animals, and animations.
This research describes an innovative method for recognising emotions in GIFs. While past research works were capable of doing this task, they were computationally costly and ineffective at predicting emotions. The suggested study tries to address these limitations. Further parts offer a high-level overview of the suggested technique. The remainder of the paper is arranged as follows: Section 2 describes the previous research work conducted in the field of emotion detection in videos and GIFs. Section 3 explains the contribution outline of the project. Section 4 elaborates upon the materials used for research like tools and technologies used, the dataset used, etc. Section 5 talks about the proposed methodology along with the training procedure. The results as well as conclusions are shown in Sections 6 and 7.
Related works
A method for collaborative sentiment analysis of online user-generated material that includes both textual and visual content has been discussed in [1]. An unsupervised language model was trained with picture descriptions and titles, while a pre-trained CNN model on ImageNet was fine-tuned for extracting visual characteristics. Around 101 keywords were used to query photographs on Getty Images, and the images were subsequently classified based on the keywords’ moods. Both early and late fusion were attempted in this work. The model’s performance was examined using two datasets: 20% of the test data and the Twitter dataset, with results of 0.769 and 0.716, respectively.
Authors of [2] presented a Visual Sentiment Topic Model (VSTM). The authors first obtained the visual sentiment features by using the Visual Sentiment Ontology (VSO); then, they constructed a Visual Sentiment Topic Model by using all images in the same topic; and finally, they chose better visual sentiment features based on the distribution of visual sentiment features in a topic. The suggested technique has the advantage of selecting the properties of the discriminative visual sentiment ontology based on knowledge about the subject of visual sentiment. The experiment findings suggest that their technique outperforms the Visual Sentiment Ontology (VSO) model.
In [3], a 3D-CNN was used to identify perceived emotions in GIFs. The collection consists of 6113 GIFs taken from the GIFGIF dataset and labelled with 17 emotional classifications. The animated GIFs ranged in duration from 2 to 347 frames. The frames of short-length GIFs were looped to duplicate the same observed feeling, while larger GIFs were broken into multiple segments of equal length while retaining the flow of succeeding frames. Their feature representation is the C3D (Convolutional 3D) video descriptor. These GIFs were scaled and fed into a 3D CNN, which extracted spatial and spatiotemporal characteristics before being trained with a Lasso regression. On the 6113 GIFs dataset, the model has a Normalized Mean Squared Error (nMSE) of 0.7161.
In [4] the authors suggested a Key point Attended Visual Attention Network (KAVAN) for human-centred emotion identification in GIFs. To comprehend the gross GIF emotion categories, the dataset contains 6119 GIFs with 17 emotions organised into four primary emotion groups. The proposed network is made up of two modules: a face soft attention module and a temporal module that uses Hierarchical Segment LSTM. Human face traits were represented in the soft attention module as estimated facial key points that were merged with retrieved frame-level visual data. The temporal module, on the other hand, was made up of a single Long Short-Term Network (LSTM) that was fed with the processed key point information. This multi-task learning was 68.27 percent accurate. Experiments with this two-module strategy shown that extracting important points information and employing facial masks increased KAVAN’s performance.
A brand-new dataset called Tumblr GIF (TGIF) was created in [5], which included 120K crowdsourced natural language descriptions and 100K animated GIFs from Tumblr. Work was done to create a testbed for picture sequence description systems so as to produce natural language descriptions for animated GIFs or video clips. In order to ensure a high-quality dataset, unique quality controls were created to verify free-form text input from crowd-workers. The dataset used was a perfect benchmark for the task of captioning visual content because the author demonstrated that there was an unambiguous relationship between visual content and natural language descriptions in it. The Visual Audio Network, a CNN-based network was worked upon by authors of [6]. This method combines spatial, channel-wise, and temporal attention into a visual 3D CNN, as well as temporal attention into an auditory 2D CNN. They also offer a unique classification loss to direct attention production and prevent the problem of some films being wrongly categorised into groups with opposing polarity. The authors concentrate on emotion identification using User Generated Videos (UVGs). They carry out the specified work on the VideoEmotion-8 datasets and show that their method beats other state-of-the-art methods.
A context fusion framework for emotion identification was suggested by authors of [7], which includes high-level semantic characteristics as context clues produced from state-of-the-art deep models. They initially trained detectors to recognise human behaviours using two large-scale video standards, which provided useful information about what was happening in the movie. Then, using large-scale object detectors that had been trained to distinguish 1K classes, they calculated object characteristics. Finally, to identify emotions, they employ a context fusion network to merge these deep semantic characteristics. Context fusion networks obtained 50.4 percent and 51.8 percent on Video Emotion and Ekman, respectively, outperforming other fusion approaches.
In article of [8], researchers developed a sentiment analysis technique for GIFs with a visual-textual sentiment score. For each GIF video, they retrieved visual elements at both the sequence and frame levels. They calculated the visual emotion score from the given GIF video using the latest Convolution3D network, VGG16 network, and ConvLSTM model. The textual sentiment score was then derived using the SentiWordNet3.0 model from the output of Synset Forests on the GIF’s brief text annotations. Then, using grid search, they created an effective and multimodal fusion function that was merged with the visual and textual emotion score. The suggested model attained a precision of 0.7839 on the T-GIF dataset. Instead of learning unreferenced functions, authors of [9] suggested reformulating the layers to learn residual functions with reference to the layer inputs. This offered thorough empirical proof that these residual networks are simpler to optimise and can improve accuracy over significantly more depth. The author tested residual nets on the ImageNet dataset that had up to 152 layers, i.e. about eight times deeper than VGG nets yet it is less complex. An error rate of 3.57 percent was produced on the ImageNet test set. In [10] researchers, examined the emotion characteristics and feature fusion methodologies for audio and visual modality. They investigated audio features for emotion using speech-spectrogram and log-mel-spectrogram, and they assessed a number of facial aspects using various CNN models and emotion pretrained techniques. The author investigated feature concatenation and factorised bilinear pooling (FBP) for cross-modal feature fusion as intra-modal and cross-modal fusion approaches, respectively. Attention mechanisms were also designed to highlight the key emotion component. Author scored 65.5 percent on the AFEW validation set and 62.48 percent on the test set after careful review. Authors of [11] introduced a deep convolutional neural network architecture called Inception which attained the new state of the art for classification and detection. This architecture’s primary distinguishing feature is the better exploitation of the network’s computational capabilities. The dataset used in [12], contains 400 human activity classes, each with at least 400 video clips. Each clip came from a different YouTube video and lasted for about 10 seconds. The acts were human-centred and span a wide range of classifications, including interactions between people and objects like playing instruments and those between people like shaking hands. The authors included baseline performance data for neural network architectures. Using this dataset they trained and tested for human action classification, also describing the dataset’s characteristics and its collection process. Using, a statistical machine-learning model that was trained on a representative corpus of text from the World Wide Web, researchers recreated a range of recognised biases as determined by the Implicit Association Test [13]. The findings show that text corpora contain recoverable and accurate imprints of historical biases, whether they are morally neutral, like those toward flowers or insects, problematic, like those toward race or gender, or even just true, like the current gender distribution in first names or career paths. The author’s techniques show promise for locating and eliminating cultural biases, including those in technology. Instead of learning unreferenced functions, the researchers in [14] expressly reformulate the layers to learn residual functions with reference to the layer inputs. The author presented thorough empirical data demonstrating that these residual networks are simpler to optimise and can improve accuracy over far more depth. The author tested residual nets on the ImageNet dataset that had a depth of up to 152 layers, which was eight times deeper than VGG nets but still had a lesser level of complexity. On the ImageNet test set, an ensemble of these residual nets produced an error rate of 3.57 percent. In [15, 16], authors concentrated on learning discriminative classifiers on manually created features like HOG3D and IDT. The author switched from creating hand-designed features to creating end-to-end learning systems as a result of the success of deep learning methodologies. In [17], the author introduced a novel Two-Stream Inflated 3D ConvNet (I3D), which was based on the inflation of 2D ConvNets: filters and pooling kernels were expanded into 3D, this enabled the learning of seamless spatio-temporal feature extractors from a video. ImageNet architecture designs and their parameters were used. I3D models significantly outperform the state-of-the-art in action categorization after pre-training on Kinetics, reaching 80.2 percent on HMDB-51 and 97.9 percent on UCF-101. The authors of [18], introduced UCF101, the most difficult dataset for action identification of the ones already in use. It was noticeably larger than other datasets because it has almost 13k clips and 101 action classes. Unrestricted YouTube films that include difficulties like bad lighting, a cluttered background, and extreme camera motion make up UCF101. Using the traditional bag of words method, they gave baseline action recognition findings on this new dataset, with an overall accuracy of 44.5 percent. Article [19], discussed a comprehensive Deep Neural Network (DNN) method for identifying emotions in images of a cartoon. Since the state-of-the-art does not contain a significant amount of data, the author gathered a dataset of size 8 K from two cartoon characters, “Tom” and “Jerry”, who displayed four distinct emotions: happiness, sadness, anger, and surprise. With an accuracy score of 0.96, the suggested integrated DNN technique properly identified the character, separated their face masks, and recognised the resulting emotions. It was trained on a sizable dataset made up of animations for both Tom and Jerry. For character detection, the method used Mask R-CNN, and for emotion classification, ResNet-50, MobileNetV2, InceptionV3, and VGG 16 state-of-the-art deep learning models. To categorise emotions in the author’s study, VGG 16 outperforms others with an accuracy of 96% and F1 score of 0.85. The proposed integrated DNN outperforms the state-of-the-art approaches. In [20], Facial Action Code was used, after studying the anatomical underpinnings of facial movement. Any face movement captured on camera, on film, or on videotape can be described using this method in terms of anatomically based action units. In the article of [22], an integrated method for producing aspect-level opinion summaries for movie reviews is discussed. The technology, known as Movie Prism, looks at each movie review, finds terms that refer to aspects, finds opinions about those aspects, and then creates a visual summary of the movie’s opinions based on those aspects. Without any training, the designed system can generate visual aspect level opinion summaries from unstructured textual reviews, and the results are reasonably accurate. Authors of [23] used two algorithms to recognise facial expressions of emotion. The Cohn-Kanade (CK/CK
In [24], authors discuss emotion recognition by using transfer learning approaches. They have used pre-trained networks of Resnet50, vgg19, Inception V3, and Mobile Net. The fully connected layers of the pre-trained ConvNets were eliminated, and fully connected layers that were suitable for the number of instructions in the task were added The study carried out in [25] discusses how the facial traits are used for biometric application to recognise user identities that are specific to age and gender. In order to categorise facial photos based on different age groups and genders, a novel method has been created by extracting Local Binary Pattern (LBP) and Gray Level Co-Occurrence Matrix (GLCM) image features that effectively reflect facial and skin regions of users. Utilizing the IMDB wikicrop facial dataset, three major Deep Learning Classification techniques – Convolution Neural Network (CNN), Region-based CNN (RCNN), and Fast RCNN – classify these extracted features. In comparison to the findings of the other two classifiers, experiments utilising the CNN classifier produced the best result, 96.4 percent accuracy. Facial emotion recognition in the elderly was done using a SVM classifier in [26]. Convolutional neural networks were employed in [27] to address the issue of facial expression recognition (FER) (CNN). With the help of this innovative FER technique based on regularization optimizations and activations parameters, facial expressions were obtained from databases like CK+ and JAFFE. The model could distinguish between emotions including joy, sorrow, surprise, fear, rage, disgust, and neutrality. As described in this study, a number of approaches, were used to assess the model’s performance. When paired with other methodologies, the FER technique can be used in tests to identify emotions with SoftMax, Adam and Dropout Ratio of 0.1 to 0.2. In the paper of [28] authors review the research works carried out in the fields of human emotion detection from facial images, speech and brain signals. They provide an outline of the areas where emotion recognition is more required.
In [29], a cutting-edge method for accurately identifying the emotions of seven different categories, including Happy, Anger, Sadness, Disgust, Neutral, Surprise, and Fear, using deep learning, was created. The fer2013 data set, which contains 35887 photos, and the CK48
In [30], authors tried to extract different facial features using some well-known techniques such as Local Binary Patterns (LBP), Histogram of Gradients (HOG), Scale Invariant Feature Transform (SIFT) and Speeded-up Robust Features (SURF). These techniques were tested on the most popular Japanese Female Facial Expression (JAFFE) database. Authors employed fine-tuned CNN for image sentiment in [31], VADER for texts, and GIFs that included both image sentiment and face expression analysis in each frame. They showed that using both textual and visual features improved results were achieved as compared to other models that solely used visual or textual features. The output scores from each of the text, image, and GIF modules were combined to get the final sentiment score for the incoming tweets.
In the article of [32] the authors suggested an automated strategy that required the least amount of manual labour for gathering emotive animated GIFs from the Internet. The technique sorted a huge number of unlabelled GIFs using weak emotion recognizers that had been trained on labelled data. The authors discovered that the number of GIFs a labeller needs to check can be significantly decreased by taking use of the clustered nature of emotions. A dataset named GIFGIF+ with 23,544 GIFs representing 17 emotions were produced using the suggested method, offering a viable foundation for affective computing research.
In [33], researchers concluded that animated GIFs are noticeably more engaging than other forms of media. Their study used a deeper visual analysis of about 100 000 animated GIFs and combined the findings with 13 user interviews on Tumblr to determine what makes animated GIFs captivating. According to the author, GIFs are the most engaging content on Tumblr because of its animation, absence of sound, immediate consumption, low bandwidth requirements, little time commitment, storytelling qualities, and capacity for expressing emotions. Additionally, the author discovered that GIFs with faces and higher motion energy, homogeneity, resolution, and frame rate were more engaging. In [34], various existing classifiers have been considered on several data sets to assess their performance. In [35], GIFs reaction was used, which could capture intricate emotive states online. They demonstrated how to add induced emotion and induced sentiment labels to the data. A ground breaking effective dataset of 30K tweets was used to produce and distribute Reaction GIF. Multi-label induced emotion classification and induced sentiment prediction were proposed in this article. New avenues for research in affective computing and emotion recognition were made possible by this methodology and dataset. In the article [36], authors claimed to have proposed the first computational evaluation of its kind for content-based prediction on animated GIFs. The dataset consisted of over 3,800 animated GIFs gathered from the MIT GIFGIF platform, each with scores for 17 discrete emotions aggregated from over 2.5M user annotations. In addition, they proposed a conceptual paradigm for emotion prediction that demonstrates the value of distinguishing between various emotional categories in order to be specific about the emotion target Comparing low-level, aesthetic, semantic, and facial aspects, as well as other content features for emotion prediction, was one of the main goals.
Contribution outline
Workflow of the proposed methodology.
The fundamental goal of this research is to use many deep learning approaches to understand and distinguish various emotions displayed in a GIF, as well as to validate the results fairly (in comparison to state-of-the-art works) on enormous amounts of data. This allows us to accomplish a number of goals with this study, including:
Combinations of facial action units detected on the face were used to predict emotion communicated through facial expressions. The facial action units were detected using the Open Face Library. Prediction of emotion class based on an action done in a GIF that was recognised using an I3D network with an Inception Net backbone, as well as action classification based on emotions. Get the final results from a combination of emotion scores derived from facial expressions and the action taken, and then test them on a subset of the GIFGIF dataset consisting of 856 GIFS for 7 classes. Built a GIF recommendation system that uses text emotion analysis and GIF tag creation to generate GIF recommendations for textual data.
The suggested method, shown in Fig. 1, starts with loading and prepping data for action recognition, followed by a two-stream computation of emotion classes based on the action done and facial emotions expressed, from which the final emotion is depicted by the GIF.
The RNN model’s issues can be resolved using LSTM. It can thus be applied to address:
RNNs have a long-term reliance issue. Exploding and disappearing gradients.
The core of an LSTM network is its cell, or more specifically, its cell state, which gives the LSTM some memory so it may retain information from the past.
The members of LSTM are Gates:
We use three gates in the LSTM:
Input gate. Forget Gate. Output Gate.
In LSTM, gates are sigmoid activation functions that output values between 0 and 1, with 0 or 1 being the value generated in the majority of occurrences.
In order to employ the correct pronoun or verb, the cell state may remember the gender of the subject in a specific input sequence. Figure 2 shows the mathematical modelling of LSTM Model.
Dataset used
Distribution of GIFs for different emotions in the testing dataset
Distribution of GIFs for different emotions in the testing dataset
Mathematical modelling of LSTM Model [37].
The Kinetics-400 dataset [12] was used to train the suggested model. The Kinetics Human Action Video collection has a significantly larger number of videos than previous datasets of this type. The dataset contains 400 human activity classes, each having at least 400 video clips and 400–1150 clips, each from a separate YouTube video that lasts around 10 seconds. The acts are all human-centred and fall into a range of categories, which included human-object interactions like playing instruments as well as human-human interactions like shaking hands. Because the use case is categorization, only short films of around ten seconds containing the activity are provided, and no untrimmed videos are used. The dataset, however, has the potential to be used for a range of applications, including multi-modal analysis, because the clips include sound. On this dataset, standard existing model performance is significantly lower than on UCF-101 [14] and comparable to HMDB-51, however big models like 3D ConvNets can be trained from scratch.
The GIFGIF dataset contains about 23,544 GIFs categorised as 17 different emotions (Satisfaction, Anger, Pride, Contentment, Guilt, Excitement, Sadness, Fear, Happiness, Contempt, Pleasure, Relief, Embarrassment, Amusement, Shame, Disgust, and Surprise). It predicts emotions from a large number of animated GIFs using a clustered multi-task learning technique. This method categorises efficiently a large number of target GIFS in terms of 17 different emotion categories and the intensity of those emotions using 3D CNNs and transfer learning. This collection of animated and human-centred GIFs includes both animated and human-centred GIFs. For this task, only GIFs containing human faces were evaluated. The Giphy Application Programming Interface (API) was used to pull 856 GIFs from the "GIFGIF" collection for use in the testing procedure. The distribution of each emotion in relation to the total number of movies is shown in Table 1.
For training the emotion analysis model on textual data we created a dataset comprising of 20,000 tweets from twitter. Out of these 16,000 tweets were used for training the emotion analysis model while 2,000 were used as a validation set and 2,000 were used for testing purposes. The dataset comprises of 7 labels namely (Anger, Happiness, Disgust, Surprise, Sadness, Fear, and Neutral). Figure 3 shows the distribution of tweets of training dataset across the 7 labelled classes.
Amount of textual data tweets divided according to emotion (1: Happy (2285), 2: Sad (2280), 3: Anger (2283), 4: Surprise (2287), 5: Fear (2192), 6: Disgust (2312), 7: Neutral (2361)).
The Kinetics-400 dataset was prepared and processed by first downloading the entire dataset, which has 400 classes and is 400 GB in size, with around 300,000 videos stored in separate folders for train, test, and validation sets, which were further divided into folders according to video clip classes. A python library called “YouTube-dl” was used to download the videos, which were downloaded using a unique YouTube id provided in a csv file in the Kinetics-400 dataset utilising subprocess calls. All of the video files that were corrupted or missing were discarded. After downloading the entire film, the video was reduced using the ffmpeg library according to the timestamps specified in the csv. For speedier data pre-processing, Parallel Computation was used. After recording the video clips, the frames for each video, as well as the optical flow for each video, were extracted using the OpenCV package for Python and saved for future training and evaluation. After extracting GIFs from the GIFGIF dataset for the testing dataset, separate directories for each emotion containing GIFs were generated. The dataset was cleaned up by removing corrupted and invalid (animated) GIFs. Each video ID was mapped to its matching emotion using a CSV (Comma Separated Values) file. These directories and the csv file were then utilised to evaluate the framework’s performance during runtime. Prior to model building, text pre-processing is required to prepare the text data for usage in the model. To complete the data preparation tasks, a variety of strategies must be used. Eliminating punctuation signs such as “.,!$()*%@”, removing URLs and stop words, decreasing case of the text, and tokenization are only a few instances of preparation. After then, the process of removing stop words is carried out. Stop words, which offer nothing to the analysis and should be avoided at all costs, must be eliminated from the text. It’s a random collection of words with little or no meaning. However, it is not mandatory to utilise the stop words on the provided list because they should be carefully picked to meet the project’s requirements. We may want to construct a custom list of stop words for usage in that case, depending on the circumstances.
After these procedures are completed, stemming or lemmatization is used to further refine the output. Words must be stemmed or reduced to their root/base form before being printed, a process known as text standardisation. Stemming has the problem of losing the meaning of the root form or failing to reduce the word to a genuine English term, whereas lemmatization stems the word but does not lose the meaning, as opposed to stemming the word but assuring that the meaning is not lost. It also features a pre-defined vocabulary that keeps words in context and tests each word against that dictionary, decreasing the dictionary’s overall size.
Methodology
Action recognition in GIFs
I3D architecture [13] is utilised to construct seamless spatio-temporal feature extractors from GIFs for the human activity recognition module. I3D is a two-stream Inflated 3D ConvNet based on a 2D ConvNet’s inflation. All of the network’s filters and pooling kernels were expanded to take time into account. In 2D models, filters are square matrix of
Architecture
Architecture of proposed action recognition model.
In Inception-V1, the initial convolutional layer with stride 2, is followed by four max-pooling layers with stride 2 and a 7
On the 12 GB NVIDIA Tesla K80 GPU with a batch size of 128 and the Kinetics validation set, the Inception-V1 model was trained on movies using stochastic gradient descent to reduce the convergence rate and modify the learning rate hyperparameters. After the model was initialised, the learning rate hyperparameters were tweaked on the Kinetics validation set All of the models were created and implemented using Pytorch. Random cropping spatially was employed to ensure a required number of frames during training, and smaller GIFs were scaled up to 256 pixels before randomly cropping a 224
The action recognition model predicts the top 5 action classes and their related confidence scores for a given input GIF once it has been trained. These action classes are mapped depending on the emotion experienced throughout the activity. As a consequence, the end outcome is emotion mapped to anticipated action class with the greatest confidence score.
Facial expression recognition in GIFs
In GIFs, where human facial landmarks and head posture are plainly visible, facial behaviour analysis (expression recognition) plays a key part in emotion identification. An open face toolkit, which provides a pipeline for constructing interactive apps based on facial behaviour analysis, is employed for this purpose. We can properly identify emotions in a video sequence by detecting and analysing facial activity units. Each AU number correlates to a distinct face muscle action, making AUs the building blocks of facial expressions. An emotional expression was expressed by the combination of action units. Activation of AUs 6 (cheek raiser) and 12 (lip corner puller), for example, produced a happy or cheerful face, whereas activation of AUs 1 (inner brow raiser), 4 (brow lowered), and 15 (lip corner depressor) produced a sad face. AUs were utilised to predict the facial expressions associated with various experimental situations and to characterise emotional states such as contempt, confusion, and guilt. Figure 5 shows the detection of action units on face and the corresponding emotion detected using it.
Table for most prominent action units shown in different emotions detected
Table for most prominent action units shown in different emotions detected
Detection of action units for emotion recognition.
Emotion detectors were trained using deliberately posed or naturalistically induced emotional facial expressions, allowing them to categorise fresh photos depending on how closely a face resembled a canonical emotional facial expression. It’s crucial to remember that labelling a smiling face as cheerful doesn’t always suggest that the person is in a pleasant condition internally. However, in emotion research, labelling specific configurations of AUs with semantic concepts of emotions was useful to characterise the contexts in which people tend to display these facial expressions, as well as how the display of certain emotion expressions accompanies changes in learning and social behaviours. Because they are shared by all the fundamental emotions, facial action codes (AUs), which are part of the Facial Action Coding System (FACS), are an appealing tool for studying facial expressions (happiness, sadness, anger, fear, surprise, disgust, and contempt).
FACS is made up of 44 action units, including those for the head and eyes, among other things. The production of AUs is physically linked to the contraction of particular facial muscles. They can be found on their own or in various combinations. AU combinations can be additive, in which case the appearance of the components remains unchanged, or nonadditive, in which case the appearance of the constituents’ changes as a result of the combination (analogous to co-articulation effects in speech). A 5-point ordinal scale is used to quantify the degree of muscle contraction happening in action units that vary in intensity. Despite the fact that there are just a few atomic action units, researchers have discovered over 7,000 distinct combinations of action units. FACS provides the necessary amount of information when it comes to characterising facial expressions. This FACS data is used in the proposed approach for detecting emotions expressed through facial expressions, which is based on this data. Table 2 shows the combination of facial action units used to detect a particular emotion.
Methodology for getting final emotion score for a GIF.
After going through two separate phases to acquire the emotions portrayed in GIFs, namely the action emotion detection phase and the facial expression emotion phase, we receive two different emotion values for the same GIFs, which might be the same or different. While facial expressions are sufficient to identify a wide range of GIFs, there are numerous scenarios when facial expressions might be inconsistent, such as when face information is only partially accessible or when collecting facial expressions is difficult. In certain scenarios, the capacity to interpret emotional states utilising various activities such as bodily motions, contact with external objects, and a wide range of human acts is essential. We do, however, prioritise emotions recognised via facial expressions over emotions discovered through action done since our testing has shown that facial expressions are a more accurate way of recognising emotions displayed in a GIF. As a consequence, while combining and calculating the final score, we give precedence to face expression if facial expressions are recognised using the suggested technique, and if facial expressions are not detected, we utilise emotion detected from activity done. This is depicted in Fig. 6.
Language modelling is the first step in the RNN-LSTM-based Approach Framework, followed by deep learning model fine-tuning for sentiment analysis. PyCharm is a piece of testing and training software. The purpose of language modelling is to train our model to understand a certain language using a big data corpus. This knowledge is used in various Natural Language Processing (NLP) tasks such as paragraph summarization, text generation, recognition of named entity, text classification, and many more. This language modelling pre-training approach is assisted by number of fine-tuning stages to acquire the features required for accomplishing a specific NLP task. Unsupervised learning is used in language modelling. The amount of labelled data necessary to fine-tune the model is lowered dramatically, resulting in less annotation and computing time.
Following the cleaning of the raw English data, numerous tags are defined in order to build the language modelling dataset from the cleaned Twitter data. TensorFlow along with Fast-AI are used. The host system had 16 GB of RAM and an Nvidia Tesla P100. This section delves into the design and methods of the proposed RNN-LSTM network. This network outperforms the CNN and RNN models.
RNN can recall and repeat past knowledge calculations by adding them to the next element of the input sequence. RNN excels in situations that need sequential data, preventing context misinterpretation and improper language. Improvement in terms of accuracy can be achieved by combining LSTM and RNN. The RNN has a long-term dependency problem that LSTM solves. Actions of tokenization, encoding and cleaning are carried out on the data before it is saved for later use. Figure 7 depicts the architecture for an emotional analysis model for textual input using the RNN-LSTM framework.
Architecture for emotional analysis model for textual data
The preliminary GIF recommendation makes advantage of the emotion recognised from GIFs to create a general corpus of GIFs divided by emotion. When a person wants to utilise this system, the text typed is used to recommend relevant GIFs by analysing the emotion expressed in the text, matching it with the GIFs in the corpus, and recommending a random GIF based on the emotion as a recommendation.
Results for action recognition
In this section of the paper, we compare the model’s performance using a range of datasets for training and testing. Testing was performed on the UCF-101 and Kinetics-400 datasets, respectively. The model’s accuracy was 92.3 percent for the UCF-101 dataset and 74.2 percent for the Kinetics dataset, respectively. We conduct our tests on the split 1 test sets of UCF-101 and the held-out test set of Kinetics. Several significant observations are made in this section. To begin with, our I3D models surpass the competitors in both datasets. Given the large number of parameters and the small size of UCF-101, it shows that the advantages of ImageNet pre-training may be applied to 3D ConvNets as well. All models perform much worse on Kinetics than on UCF-101, indicating that the two datasets have dramatically different levels of complexity.
The output from certain nodes can influence future input to the same nodes in a recurrent neural network (RNN), a family of artificial neural networks where connections between nodes can establish a loop. It can display temporal dynamic behaviour as a result of this. RNNs, which are derived from feedforward neural networks, can process input sequences of varying length by using their internal state (memory).
The class of networks with a limited impulse response are known as convolutional neural networks, whereas those with an infinite impulse response are recurrent neural networks. A finite impulse recurrent network is a directed acyclic graph that can be unrolled and substituted with a strictly feedforward neural network, whereas an infinite impulse recurrent network is a directed cyclic graph that cannot be unrolled. Both kinds of networks display temporal dynamic behaviour.
In case of both finite impulse and infinite impulse recurrent networks we can include additional stored states, and a direct control over the storage may be possible with the neural network. If the network or graph incorporates time delays through feedback loops, then it can also incorporate role of the storage. Long short-term memory networks (LSTMs) and gated recurrent units consist of these regulated states, which can also be termed as gated states or gated memory. This is also known as a feedback neural network (FNN).
In terms of the model, we developed our neural network using RNN-LSTM sequential, bidirectional, and dense layers. SoftMax, which turns a vector of integers into a vector of probabilities with each value’s probability proportionate to its relative scale, is what we utilised to activate our layers. Recurrent neural networks of the Long Short-Term Memory (LSTM) type can learn order dependency in issues arising during sequence prediction. This type of behaviour is required for solving complicated problems in areas of machine translation, speech recognition and such others. A challenging area of deep learning is LSTMs. Understanding what LSTMs are and how concepts like bidirectional and sequence-to-sequence apply to the field can be challenging. Furthermore, considering all datasets, two-stream systems out perform single-stream architectures, but the relative relevance of RGB and flow changes considerably between Kinetics and the other datasets. As compared to the contribution from RGB the contribution from flow alone is slightly higher on UCF-101, but substantially lower on Kinetics. A visual comparison of the datasets suggests that Kinetics contains substantially more camera motion, which may complicate the operation of the motion stream. The I3D model, on the other hand, appears to be able to extract more information from the flow stream than the other models, owing to its significantly larger temporal receptive field (64 frames during training versus 10 frames during testing) and more integrated temporal feature extraction machinery. It seems that compared to the RGB stream Kinetics stream contains more discriminative information. We tried to identify actions from flow in Kinetics, but this was not the case with RGB – future research on incorporating some form of motion stabilisation into these architectures may present opportunities.
Metrics evaluations
Theory of precision and recall
Theory of precision and recall
Precision is the precision/accuracy of your model in predicting positive events and how many of those anticipated positive outcomes are truly positive. When the costs of False Positives are high, accuracy is an appropriate criterion to employ to determine efficacy. The formula in Eq. (1) represents precision.
The ratio of accurately detected positive results to the number of relevant elements that were retrieved is known as recall. The term “recall” is also known as “sensitivity”, as shown in Eq. (2). Table 3 depicts the precision and recall theory.
where TP
An F1 score is essential in order to attain a desirable balance of precision and recall. True Negatives are sometimes disregarded, but they can account for a major amount of accuracy, which is why F1 score may be a better metric to employ if we need to balance Precision and Recall and the class distribution is unequal (large number of Actual Negatives). The mathematical formula for F1 score is shown in Eq. (3).
Classification Report (0: Anger, 1: Disgust, 2: Fear, 3: Happiness, 4: Sadness, 5: Surprise, 6: Neutral)
Confusion matrix.
The arithmetic mean, also known as the unweighted mean, of all the per-class F1 scores is used to get the macro-averaged F1 score, also known as the macro F1 score. Regardless of the support values, all classes are treated similarly by this method. In the above table Macro average of all the scores are near to 0.89. The weighted-average F1 score is determined by averaging all of the per-class F1 scores while accounting for the support of each class. The term “weight” essentially refers to the share of support for each class in relation to the total worth of support. Above table describes the weighted average coming near to 0.89.
When the trained emotion recognition model is tested on the kinetics-400 dataset, the classification report displayed in Table 4 is created. The precision for GIFs with ‘Anger’ Emotion is 0.70, indicating that the model wrongly labels some GIFs with ‘Anger’ emotion when they are not, implying that the model has a high false positive rate for ‘Anger’ emotion. The recall for the same emotion is strong (0.93), indicating that the model has a lower rate of false negatives and is accurately able to associate the ‘Anger’ label with the GIFs that genuinely have ‘Anger’ as the ground truth emotion. Furthermore, for the emotion label ‘disgust’, the model scores the best F1-score (0.92), indicating good precision and recall values. The F1-score for the most dominant emotion, ‘Neutral’, is high (0.91), owing to the model’s ability to acquire more features for this emotion class than for other emotion classes.
Classification Report (0: Anger, 1: Disgust, 2: Fear, 3: Happiness, 4: Sadness, 5: Surprise, 6: Neutral)
Accuracy and loss curves vs epochs.
The Confusion Matrix is a matrix representation of a machine learning classification performance metric. To compare actual and expected values, a
As for emotion analysis in textual data, we prepared our model for 500 steps and validate on 2000 examples in a single epoch. We have executed our code on a total of 20 epochs. As a result, our model can achieve 87.30% validation accuracy and 96.71% training precision.
The training and loss curves against epochs for the emotion analysis model using textual data are shown in Fig. 9. The model’s accuracy on the training dataset is shown by the blue curves, while its accuracy on the validation dataset is shown by the orange curves. The classification report for the emotional analysis model for textual data is displayed in Table 5.
Conclusion
GIFs are a general media for expressing or visually conveying a variety of emotions due to their capacity to incorporate different types of body languages such as facial expressions, movements, gestures. The inspiration for this study stems from the fact that GIFs are far less examined and explored for detecting emotions. The GIF tag generating system is one of several activities that may be accomplished by recognising emotions in GIFs, including having a recommendation technique, detecting poisonous and improper content, and identifying GIF-induced sentiment as viewed by the viewer. Furthermore, emotion prediction in GIFs can make it simpler to express oneself on social media apps, exhibit a brand’s attitude/personality, advertise and develop stories based on one’s mood.
The current method combines the visual emotional score with the verbal emotional score to forecast the total emotion conveyed by the GIF. The visual emotional score from brief annotated GIFs may be extracted using 3D Convolutional Neural Networks (C3D). The combined visual elements can only capture a brief visual background and do not comprehend the emotion portrayed in the GIF video. In this method, the model’s long-term meaning for video sentiment analysis would be integrated, making it more effective.
