Abstract
With the trend of people expressing opinions and emotions via images online, increasing attention has been paid to affective analysis of visual content. Traditional image affective analysis mainly focuses on single-label classification, but an image usually evokes multiple emotions. To this end, emotion distribution learning is proposed to describe emotions more explicitly. However, most current studies ignore the ambiguity included in emotions and the elusive correlations with complex visual features. Considering that emotions evoked by images are delivered through various visual features, and each feature in the image may have multiple emotion attributes, this paper develops a novel model that extracts multiple features and proposes an enhanced fuzzy k-nearest neighbor (EFKNN) to calculate the fuzzy emotional memberships. Specifically, the multiple visual features are converted into fuzzy emotional memberships of each feature belonging to emotion classes, which can be regarded as an intermediate representation to bridge the affective gap. Then, the fuzzy emotional memberships are fed into a fully connected neural network to learn the relationships between the fuzzy memberships and image emotion distributions. To obtain the fuzzy memberships of test images, a novel sparse learning method is introduced by learning the combination coefficients of test images and training images. Extensive experimental results on several datasets verify the superiority of our proposed approach for emotion distribution learning of images.
Keywords
Introduction
With the prevalence of social networks, people prefer to record their daily lives and express their opinions via visual content, such as images and videos [1]. As images are one of the most intuitive and convenient mediums, image emotion recognition has attracted increasing research interests. The emotions people perceive from images may directly determine their decision and influence their visual preference. Therefore, practical analysis of images at an affective level can help understand the users’ behaviors and attitudes, which benefits a wide range of applications, such as advertising [2], opinion mining [3], depression detection [4].
However, compared to traditional visual tasks, image emotion recognition is inherently challenging because of the subjectivity perception problem. The same image may invoke different emotions for viewers with diverse social and cultural backgrounds [5]. Furthermore, even one viewer may have multiple feelings toward an image. Recent studies mainly paid attention to image emotion classification based on the supposition that emotions perceived by viewers can reach a consensus [6]. A certain label is unable to describe the affective content of the image accurately. Thus, emotion distribution [7, 8] has been proposed to capture the latent emotions of images rather than pure classification. By assigning a given number of labels, the emotion distribution denotes the degree to which each label describes the image emotions. Figure 1 shows two samples from different datasets and their distribution labels, which indicates that there exists ambiguity among emotions. Hence, this work considers image emotion recognition as an emotion distribution prediction task instead of a classification task.

Affective images with their emotion distributions from different datasets.
Similar to other computer vision tasks, the semantic gap between low-level features and high-level semantics still exists in image emotion recognition. There is a major obstacle that how to express subjective emotion with objective visual information. For image emotion recognition, the semantic gap extends to the affective gap [9]. Some studies have investigated various approaches to address this challenge. In the early stages, hand-crafted features were designed based on the observation of people perceiving the emotion of an image and related to image aesthetics [6, 11]. With the rise of deep convolutional neural networks (CNN), recent image emotion recognition methods have shifted to learn the discriminative features of an image [12], which shows the outstanding potential of robust features.
In fact, the deep features are not sufficient for emotion recognition because emotions and semantics may also be delivered by low-level and mid-level visual features [13]. Thus, several attempts have been made to probe more discriminative representations for image emotion recognition by jointly combing multiple features through the fusion strategies [14, 15]. But these studies focus on mapping visual features to emotions directly, which is difficult for people to comprehend how to make decisions. Others prefer to learn the affective-related visual concepts [16, 17] to bridge the affective gap. However, the visual concepts should be pre-defined, which may fail to handle complex scenes because one object may convey different emotions in different contexts. Additionally, they ignore that each feature in the image may have multiple emotional attributes. For example, the color red can not only evoke feelings of excitement but also stimulate violence. As the emotions are naturally contained with ambiguous and imprecision, there are elusive connections between visual features and emotions. The discriminative learning methods struggle to deal with such fuzziness. Based on this observation, this work introduces the fuzzy approach to bridge the affective gap for image emotion recognition by generating the fuzzy emotional membership of visual features to emotional terms. In this way, the relationships between various visual features and emotions of an image can be quantified. To achieve this, we propose an enhanced fuzzy k-nearest neighbor algorithm (EFKNN) to convert the visual features into fuzzy emotional memberships from the perspective of affective semantic, which is coming under a supervised learning approach and deterministic form of k-nearest neighbor with the benefit of fuzzy set theory.
In this paper, we develop a novel emotion distribution learning framework for image emotion recognition. Various types of visual features are extracted to explore the relations between the multiple visual features and emotions. The EFKNN algorithm is proposed to calculate the fuzzy emotional memberships of these visual features. The values of the fuzzy emotional memberships represent the correlations between each feature extracted from training images and each emotion class, which are regarded as the intermediate representations to narrow the affective gap. Then, the fuzzy emotional memberships are taken as the input of a fully connected neural network to learn the relationship between fuzzy emotional memberships and the labeled emotion distributions. For an unlabeled image, it cannot obtain the fuzzy emotional memberships directly because of the affective gap. Hence, sparse learning is utilized to obtain the fuzzy emotional memberships of test images, which constructs the visual features of test images with the visual features of the training images by learning the combination coefficients. It is based on the hypothesis that the fuzzy emotional memberships of an unlabeled image can be approximately modeled as a linear combination of the fuzzy emotional memberships of the training images. The final emotion distribution of a test image is predicted through the trained fully connected neural network.
The main contributions of this paper are summarized as follows: We propose a new strategy for emotion distribution prediction. This method transforms the visual features into fuzzy emotional memberships to bridge the affective gap. To our best knowledge, it is the first time to introduce fuzzy memberships to explore the complex connections between visual features and emotions from an interpretable perspective. In order to capture the ambiguity and uncertainty of visual features and emotion classes, an EFKNN algorithm based on semantics is proposed to calculate the fuzzy emotional memberships of multiple features to emotions. The sparse learning algorithm based on fuzzy emotional memberships is proposed for emotion distribution learning, which learns the shared parameters by reconstructing the features of training images and testing images. And the fuzzy emotional memberships of the test image are calculated by linearly combining the fuzzy memberships of the training images. Experiments performed on Emotion6 [5], Abstract [6] and IESN [18, 19] datasets verify the effectiveness of our proposed method.
The rest of this paper is organized as follows. Related works are reviewed in Section 2. The proposed approach for image emotion distribution learning is described in Section 3. Experimental settings, results and analysis are elaborated in Section 4. And the conclusion and future work are followed in Section 6.
This section briefly reviews existing works on image emotion recognition, the background of the fuzzy approach and sparse learning.
Image emotion recognition
With the development of computer vision, image emotion recognition has become a meaningful research topic in recent years. The solutions mainly fall into two types of emotion representation approaches: the dimensional model that projects emotions into a dimension emotion space [7, 20] and categorical model that classifies emotions into several basic categories [6, 21]. The categorical models are obvious for common people understanding and thus have been mainly employed by previous studies. As emotion is a highly subjective and complex variable, existing studies which take image emotion recognition as a classification task have neglected explicit difference on emotion intensity. Therefore, label distribution learning algorithms [22] were proposed to assign a descriptive degree value to each category. In such a case, it is more reasonable to predict the emotion distribution of images. Consequently, Peng et al. [5] constructed a CNN regression that utilized a deep CNN with the Euclidean loss to train the regressions for each emotion, and the regression results were transformed to emotion distribution through normalization. Based on the condition probability neural network (CPNN) [23], binary conditional probability neural network (BCPNN) and augmented conditional probability neural network (ACPNN) [24] were developed for emotion distribution prediction. Zhao et al. [8, 25] proposed shared sparse learning (SSL) to transfer the features to emotion distribution by learning the mapping factors. Considering the co-occurrence among emotion labels, graph convolutional networks [26] and structural learning framework [27] were both applied to model these correlations among labels for emotion distribution prediction.
Learning a discriminative feature representation is crucial to image emotion recognition. Early researchers on image emotion recognition concentrated on designing hand-crafted features, such as holistic features [11], color and texture [6], principles-of-art features [10], attributes [28] and Adjective Noun Pairs (ANPs) [16]. Recently, deep CNN-based features have been widely used in image emotion recognition [12, 17]. Because of the complexity of emotions, researchers gradually prefer to integrate multiple features for emotion recognition [14, 29]. However, the correlations between the features and the emotions have not been explored well. The existing methods stated above only map the features to labels directly, and people cannot comprehend why such features induce a particular emotion. To tackle this issue, we combine multiple features for image emotion recognition with the fuzzy approach to probe the relationships between multiple visual features and emotions, which can improve the interpretability of image emotion recognition and quantify the connections of multiple features and emotions.
Applications of fuzzy approach
Fuzzy approach, which is based on fuzzy logic [30], is aimed to deal with uncertain data. In this context, each instance is typically not crisp and thus belongs to different classes to different degrees. In recent years, fuzzy approaches have been preferred for a variety of applications, such as text classification [31], medical diagnosis [32] and decision-making [33]. As the affective analysis is permeated with uncertainty, the fuzzy approach is more appropriate to compute such fuzziness that includes in images and emotions. Li et al. [34] developed the 3D fuzzy visual features by fuzzy c-means (FCM) to construct the emotional feature space for analyzing the semantic of images. The fuzzy set theory also has been applied to learn the automatic semantic annotation [35] and emotion understanding [36]. But most of them just fuzzify the features or constructed the fuzzy classifiers for the final emotion classification. Besides, researchers attempted to exploit fuzzy clustering for visual content analysis [37, 38]. It is an unsupervised manner for fuzzy classification, hence, it cannot overcome the affective gap. The aforementioned researches only solved the ambiguity of emotion to a certain extent, which is unable to bridge the affective gap at the same time.
The fuzzy k-nearest neighbor algorithm (FKNN) [39] is one of the successful algorithms that performs the classification by adding fuzzy logic into the standard KNN algorithm [40]. It allocates the degree of fuzzy membership to each class while considering the distance of its k-nearest neighbors. It has been applied to image classification [41] and retrieval [42]. With the benefit of its computational simplicity and conceptual, as well as the supervised manner for fuzzy classification, this work proposes an EFKNN algorithm based on FKNN to tackle the fuzziness of visual features and emotions. The proposed EFKNN can not only reduce the affective gap by the fuzzy emotional memberships but also tackle the black box non-interpretability of previous methods.
Sparse learning
The main idea of sparse learning is that many important signals enable to be approximately represented as a linear combination of basic functions [43]. Sparsity has been exploited in computer vision to retrieve the sparse representation of data with regard to a dictionary or a series of bases [44, 45]. Wright et al. [46] have demonstrated that it is able to sparsely represent images from the perspective of the attributes of visual neurons. Zhao et al. integrated sparse learning to learn the continuous probability distribution [7] and the discrete probability distribution [47] of image emotions. In this paper, we also leverage the sparse representation method to acquire the fuzzy emotional memberships of unlabeled images. The novelty of our model lies in that the visual features are transferred to fuzzy membership vectors to be the intermediate representation instead of learning emotion distribution by sparse learning directly.
Methodology
In this section, we introduce the proposed model in detail. The goal of this work is to learn the emotion distribution of images. For an affective image, multiple features are extracted to probe the correlations between features and emotions. Suppose there are N training images I ={ I1, I2, …, I
N
} and L emotion categories c1, c2, …, c
L
. Given an affective image I
n
(n = 1, 2, …, N), we extract J types of features and the multiple features representation of the image I
n
is
Overall framework
As illustrated in Fig. 2, our proposed framework consists of four parts: a) multiple features extraction, b) fuzzy emotional membership calculation by the proposed EFKNN, c) sparse learning based on fuzzy emotional memberships, and d) emotion distribution prediction. Firstly, in the training phase, various kinds of features are extracted from training images to strengthen the representation ability of emotions. The dimensions of the extracted features are reduced by the principal component analysis (PCA) algorithm [48]. Secondly, an enhanced fuzzy k-nearest neighbor (EFKNN) based on semantics is proposed to calculate the fuzzy emotional memberships, which represent the degrees of each feature belonging to each emotion category. Thirdly, the obtained fuzzy emotional memberships are fed into a fully connected neural network to learn the relationship between the fuzzy emotional memberships and the labeled emotion distribution. Then, in the test phase, the sample transfer method based sparse learning algorithm is proposed to learn the correlation coefficients (which called shared factors learning) to construct the visual features of the test image with the visual features of training images. The learned coefficients are employed to gain the fuzzy memberships of the test image by the linear combination of the fuzzy emotional memberships of training images. In the end, the emotion distributions of the test images are predicted by the trained fully connected neural network taking the acquired fuzzy emotional memberships as input.

The overall framework of our proposed method for image emotion distribution learning. The solid lines and dash lines indicate the training phase and test phase, respectively.
In this section, an extended version of the FKNN algorithm is proposed, called Enhanced FKNN (EFKNN). The proposed EFKNN overcomes the drawback of the original FKNN, which is incapable to compute the fuzzy memberships of the visual features from the perspective of affective semantics. In image emotion recognition task, it often appears that the low-level features of the two images are very similar, but the evoked emotions are quite different. Even though the FKNN can manage the ambiguity by assigning the probabilities (which are called memberships) of instances to a class, it generates the membership degree by the location of the instance in feature space, which is determined by the low-level features. Therefore, it is unable to narrow the affective gap through the original FKNN.
To tackle this issue, we propose the EFKNN based on semantics to calculate the fuzzy emotional memberships of visual features to emotion categories. Like the conventional FKNN algorithm, the EFKNN will consider K nearest neighbors to the vectors in the feature space. In contrast to the FKNN, the EFKNN method measures similarity as a weight factor based on the emotion distribution of images to distinguish the images with similar features but with different emotions. An illustrative example is depicted in Fig. 3, showing the procedure of the EFKNN considering three emotion categories A, B and C in a two-dimensional feature space. As shown in Fig. 3(1), we first project all items to the feature space and randomly select a set of class centers for these items. Meanwhile, the emotion distributions of these K nearest neighbors within the same class are averaged as the emotion distribution of the class centers. As the fact that the emotion distribution of an image is delivered through diverse visual features from that image, we suppose that the emotion distribution of each feature is the same as the emotion distribution of that image. Then, we measure the similarity between the emotion distribution of the feature and that of each class center as the intensity value in Fig. 3(2). The intensity values are regarded as weight factors. The more the closeness of the emotion distributions between the features and the class centers, the more important the weights. After that, the new class centers are acquired by taking the average of K nearest neighbors as shown in Fig. 3(3). When calculating the final membership degree of each feature vector belongs to emotion classes, only the distance of features to class centers are taken into account. Finally, we get the approximate real emotion distribution of each feature according to the same type of features from different images in the space as shown in Fig. 3(4).

The procedure of EFKNN considering three emotion categories A, B and C of one feature in a two-dimensional space.
Given a set of feature vectors of the jth features of the training images
Let
We adjust the membership degree by the weighting factor w
l
to gain a reasonable emotional membership degree. Considering the symmetry between different distributions, we adopt Wasserstein distance [49] to measure the distribution similarity between the emotion distribution of the feature vectors p (x
nj
) and the emotion distribution of the class center
As depicted in Eq. (4), the fuzzy memberships of jth features of the image I
n
can be collected and denoted as
Figure 4 shows two sample images and their fuzzy emotional memberships computed by the proposed EFKNN. Multiple types of features extracted from an image can be associated with multiple emotions, each of which with a different degree of fuzzy membership.

Sample images from Emotion6 dataset and their fuzzy emotional memberships calculated by EFKNN. Six types of visual features extracted from an image are associated with emotion categories, each of which with a degree of fuzzy membership.
Based on the proposed EFKNN algorithm, the fuzzy emotional memberships of multiple features are calculated from the labeled images. However, it fails to directly acquire the fuzzy emotional memberships for unlabeled test images because of the affective gap. In view of the fact that an image is composed of different visual features, we assume that each feature of the image takes on a certain emotional membership degree and the emotional membership degrees between the features from different images have high similarity. This similarity makes the expression of similar features sparse. Thus, we introduce a sparse learning method to generate the fuzzy emotional membership of a certain feature in a test image.
Different from prior work [47] that shifted the features to emotions directly, we adopt sparse learning to learn the correlation coefficients between the same features in different images. Then these correlation coefficients are used as joint sparsity constraints to linearly combine the fuzzy memberships of unlabeled images and training images. As illustrated in Fig. 2, the sparse learning based on fuzzy emotional memberships contains two procedures, shared factors learning and fuzzy emotional memberships mapping. First, the shared factors are learned through the visual features of training images. Then, we acquire the fuzzy memberships of the test image by linearly combining the fuzzy memberships of training images through the learned shared factors.
Given a test image I
m
, the feature set is defined as I
m
={
The Eq. (9) is precisely an NP-hard problem that is unable to be solved directly. We replace ℓ0 norm with ℓ
r
norm and rewrite the objective function to:
In practice, the value of r approximates 0, r → 0. Finally, the fuzzy membership
Through our proposed EFKNN, the fuzzy emotional membership of each feature extracted from training images is collected and represented as
Suppose the input vector in layer h is
Finally, the output of the last fully connected layer is converted into the probability distribution of different emotions by a softmax function, which makes sure the output of the neural network
There are various criteria to measure the similarity between two label distributions. Following the literature [53], this work utilizes the Kullbavk-Leibler (KL) divergence as the overall loss function to measure the similarity of the ground-truth
In this section, we evaluate the proposed method against the state-of-the-art emotion distribution prediction methods by extensive experiments to verify the effectiveness of our proposed method.
Experiment settings
Datasets
To validate the availability of our proposed model for image emotion distribution prediction, we conduct experiments on three image emotion distribution datasets:
For emotion distribution prediction, emotion distribution can be obtained through normalization. The number of votes for each emotion category is divided by the total votes of 7 or 8 categories. For example, assuming there are 15 viewers voting on 8 emotion categories {2, 5, 0, 8, 0, 1, 2, }, the emotion probability distribution is {0.1, 0.25, 0, 0.4, 0, 0.05, 0.1, 0.1}. Note that the same image can be voted with multiple emotions for one viewer.
Features
Based on the fact that the emotion evoked by an image is delivered by various types of visual features, we extract multiple features from different levels to enhance the visual feature representation and explore the relationships between multiple features and emotions.
Low-level features
Mid-level features
High-level features
Baseline models
We compare our proposed method with the following state-of-art approaches for image emotion distribution learning: CNNR [6], ACPNN [24], DLDL [53], SSL [25], WMSSL [47] and WMCPNN [55]. The CNNR changes the last fully connected layer to 1 and replaces the softmax loss by a Euclidean loss. Referring to the literature [6], we pre-train the Caffe model [56] on ImageNet and fine-tune the CNN in our training set for CNNR. The ACPNN and BCPNN share similar network structures with CPNN [23]. But different from CPNN that takes singles integers as label representation, the BCPNN and ACPNN replace the integer labels by binary coding and augmented label distribution, respectively. However, the CPNN is specially designed for age estimation where the integral labels correspond to ages, but emotions have no relevance with numbers. It is meaningless to feed the emotion labels to CPNN, so this work does not conduct experiments on it. Besides, the ACPNN was proposed based on the BCPNN. As proved in the literature [24], the performance of ACPNN is better than BCPNN, so we only compare our method with the ACPNN. The DLDL method is a CNN-based algorithm that replaces the Euclidean loss with the KL divergence loss. The SSL predicts the emotion distribution of the test image by learning the combination coefficients and linearly combining the coefficients with the emotion distribution of training images. The WMSSL is an expanding version of SSL, which learns the optimal weights for multiple features and utilizes the joint sparsity constraints across different features. The WMCPNN extends the CPNN into multiple features setting and associates the visual features with emotion distribution by learning the combination coefficients of multiple features and exploring the complementarity.
Evaluation metrics
Since the evaluation metrics for the single-label classification method are not applicable in distribution learning, we use five evaluation metrics introduced in [22, 47] to assess our model, including the squared difference (SSD), the Bhattacharyya coefficient (BC), the Kullback-Leibler (KL) divergence, the Chebyshev distance (Che) and the cosine coefficient (Cos). We estimate the performance of emotion distribution by measuring the similarity or distance between the predicted emotion distribution and the ground-true. In these metrics, the SSD measures in terms of regression. The BC and KL measure the distance between two distributions, and the Cos measures the similarity. For SSD, KL and Che metrics, the lower is better. For BC and Cos metrics, the higher is better.
Implementation details
We execute our experiments on three datasets. The Abstract dataset randomly divides into 80%for training and 20%for testing. And the Emotion6 and IESN datasets randomly divide into 70%for training and 30%for testing. For our proposed EFKNN, the fuzzy parameter α is set to 2 following the recommendation in the literature [39]. We also investigate the optimal value of K for the EFKNN in section 4.2.4 to ensure that the model has superior performance. For the sparse learning, the parameter β is set to 0.0001 empirically. Then, we set up a fully connected neural network with two hidden layers to learning the relationship between the fuzzy memberships of multiple features and the emotion terms. As the datasets for our experiments contain seven or eight emotion categories, and we employ six types of features for emotion distribution prediction, the dimension of the input fuzzy membership vectors is 48. For the Emotion6 dataset containing seven categories, the remaining nodes of the input layer are filled with 0. The output is an emotion label distribution with emotion categories. During the training phase, we set the learning rate to 0.001. The network is trained through the stochastic gradient descent (SGD) optimization method. Our experiments are all carried out on NVIDIA GTX 1080Ti GPU with 32 GB onboard memory.
Results and analysis
The effectiveness of multiple features
Previous studies have proved that combining multiple features can significantly improve the performance of image emotion recognition [15]. However, the influence of various types of features in our proposed fuzzy membership generation still needs to be explored. In order to verify the necessity of the features required in this model, we conduct an ablation study on three datasets to compare the impact of different visual features. We remove one feature each time, including each type of visual feature mentioned in section 4.1.2 to execute experiments for our proposed model. Meanwhile, the results of using all features are also reported. The results are illustrated in Fig. 5. The “-GIST”, “-Elem”, “-PAEF”, “-Attribute”, “-SentiBank”, and “-CNN” represent the proposed model without GIST, Elem, PAEF, Attribute, SentiBank, and CNN, respectively. And “Our” represent our model with all features. Please note that comparing to our method with all features, the worse the performance, the larger gain the feature impacts for our method to predict emotion distribution. Among them, the lower the value of SSD, KL, and Che, the better the performance, and the higher the value of BC and Cosine, the better the performance.

The effectiveness of different features in the proposed model on three datasets evaluated by SSD, KL, Che, BC, and Cos, respectively.
From Fig. 5, we can observe that: (1) The high-level features have the greatest impact on the performance of the model. When the SentiBank feature or the CNN feature are removed, the performance of the model has been significantly reduced. In addition, the CNN feature is superior to other hand-crafted features. (2) Among the low-level and mid-level features, the Elem and PAEF features have more influence on the results of emotion distribution prediction. A possible explanation for this might be that these features are artistic features, which can describe emotions more comprehensively. (3) Our model still achieves satisfying results when the attribute feature on the Abstract dataset and the GIST feature on the IESN dataset is removed. It may be explained by the reason that the images on the Abstract dataset are abstract paintings without apparent objects and scenes, so the attribute feature has little impact on the Abstract dataset. In contrast with the IESN dataset, the images are collected from the social network. These images have obvious objects and expressions that are directly related to emotions, which results in the GIST feature has almost no effect on the IESN dataset. (4) Across different datasets, the prediction accuracy on Emotion6 fluctuates significantly when the corresponding emotion features are removed. But the fluctuations of accuracy on the Abstract and IESN datasets are relatively small. (5) Our method with all visual features performs best compared to the model with any features removed, which indicates that the multiple features selected in this model are valid.
To assess the effectiveness of the proposed EFKNN in fuzzy emotional membership calculation, we compare our framework with fuzzy emotional memberships calculated using traditional triangle function, fuzzy C-means clustering (FCM), original FKNN and the proposed EFKNN, respectively. The experiments were performed on the Emotion6 and Abstract datasets. And the results on two datasets estimated by KL and BC are listed in Table 1. As shown in Table 1, the prediction metrics of our framework with EFKNN exceed that of our framework with other methods, which demonstrates the effectiveness of our proposed EFKNN. Moreover, our framework with the triangle function outperforms than using the FCM. The reason is that the triangle function manually selects the centers of the initial emotion categories when calculating the fuzzy emotional memberships. It is artificially intervened with the prior knowledge. Therefore, the performance is superior to the FCM, which is unsupervised clustering based on the similarity of features. Although our framework with the original FKNN improves the accuracy compared to our framework with triangle function and FCM, the prediction results of using the original FKNN are even lower than some baseline methods in Table 2. It is because employing the original FKNN to calculate fuzzy emotional memberships is based on the similarity of features rather than the affective perspective, which further demonstrates the availability of our proposed EFKNN for improving performance.
Performance comparison between our framework with fuzzy emotional memberships calculated by the triangle function, FCM, the FKNN and the proposed EFKNN, respectively
Performance comparison between our framework with fuzzy emotional memberships calculated by the triangle function, FCM, the FKNN and the proposed EFKNN, respectively
Performance comparison with state-of-art methods for emotion distribution learning on the Emotion6, Abstract and IESN datasets
We compare our model with the state-of-art approaches mentioned above and followed by a detailed discussion. First, our method compares with the uni-feature based models for emotion distribution prediction, including CNNR [5], ACPNN [24], and DLDL [53]. Then, the early fusion and late fusion strategies are employed to process multiple features for the SSL [8] models, including the emotion features mentioned in section 4.1.2. Further, the WMSSL [47] and MCPNN [55] can make use of multi-features, so they feed the multiple features directly into the models. Table 2 presents the experimental results on three datasets evaluated by five metrics. The results highlighted in bold indicate the best value of each metric.
From the results shown in Table 2, we can conclude that: (1) The DLDL performs better than the other uni-feature based methods, which validates its effectiveness for emotion distribution prediction. (2) The accuracy of the SSL for fusing multiple features is related to datasets. The early fusion achieves better results on the Emotion6 and IESN, while late fusion performs better on the Abstract. (3) The DLDL method is superior to SSL on Emotion6 and IESN datasets. But some metrics are worse than the SSL on the Abstract dataset. It may be explained by the reason that the DLDL is a deep-CNN based model built for object recognition. But the images in the Abstract dataset are without recognizable objects, which makes the DLDL unable to achieve better performance on Abstract than the SSL that also uses low-level and mid-level features. (4) In some cases, the CNNR outperforms the late fusion of SSL on Emotion6 and IESN, and early fusion on Abstract. It may because the training procedure of CNNR is under supervised, and the optimal emotion distribution is achieved by minimizing the objective function of each emotion category. This result shows its comparable performance for emotion distribution prediction. (5) The WMSSL and WMCPNN, which jointly fusion multiple features, exceed all uni-features based models.
As shown in Table 2, compared with the state-of-the-art approaches on three different scale and emotion categories datasets, our proposed method makes an obvious improvement in all datasets. Our model shows an advantage over uni-feature based methods. The reasons lie in two aspects. First, our proposed method combines multiple visual features. Second, our method calculates the fuzzy membership of each feature and emotion category by the proposed EFKNN, which verifies the feasibility of the EFKNN. As for the multi-feature approaches, our model also outperforms the WMSSL and the WMCPNN. Unlike the WMSSL and WMCPNN, which weight multi-features, our method computes the fuzzy membership of each feature to emotion categories based on the proposed EFKNN. Therefore, the emotion distribution of an image is indirectly deconstructed to fuzzy memberships for intermediate representation, which indicates the fuzzy emotional membership is capable of narrowing the affective gap for image emotion distribution prediction. Compared with the best results of the SSL, WMSLL, and WMCPNN in different datasets, our method makes a more obvious improvement of performance on Emotion6 and Abstract datasets than that of IESN dataset, especially on the Abstract dataset. The BC evaluation metric of our approach on the IESN dataset is even lower than that of WMCPNN. In view of the fact that images in the IESN dataset are from the social networks, and the labeled emotions are easily impacted by others, the labeled emotions in the IESN dataset are less subjective than those in the Emotion6 and Abstract datasets. This result suggests that our method shows advantages in dealing with ambiguous data.
To give intuitive predicted results of our method, Fig. 6 visualizes the ground truth and the predicted emotion distribution for several image examples from different datasets. The predicted emotion distributions of our proposed method are more similar to the ground truth than the other approaches, which further demonstrates the contribution of our model to improving the prediction accuracy of image emotion distribution.

Visualization of the predicted emotion distribution results of several examples using the proposed model and comparable approaches.
As described above, our proposed model calculates the fuzzy emotional memberships of visual features by EFKNN. In terms of the FKNN algorithm, the value of K is the number of considering nearest neighbors for assigning the items into emotion terms. The selection of the K value has an important influence on the prediction of the EFKNN algorithm. A meager K value may produce unreasonable predicted results, while a high K value may lead to outliers influence the results. Therefore, we study the optimal value for the parameter K by cross-validation over the possible choices {1, 3, 5, 7, 9, 11, 13, 15, 17, 19}. The experimental results for each feature versus the parameter K on the Emotion6, Abstract and IESN datasets are measured by KL divergence. As illustrated in Figs. 7, 8 and 9, the best performance is obtained with K = 7 on the Emotion6 dataset, K = 5 on the Abstract dataset and K = 13 on the IESN dataset, respectively. We chose them in all our experiments.

Impact of different values of K on the Emotion6.

Impact of different values of K on the Abstract.

Impact of different values of K on the IESN.
In this paper, we propose a novel framework for emotion distribution learning, which can solve the subjectivity problem of emotion perceptions. We employ the fuzzy approach to explore the correlations between visual features and human subjective emotion. The EFKNN algorithm is developed to transform visual features into fuzzy emotional memberships as the intermediate representation to bridge the affective gap. It provides the capability to capture the ambiguity and uncertainty between visual features and emotions. Moreover, our method also shows an advantage in interpretability for image emotion recognition. And we implement a sparse learning algorithm based on fuzzy memberships to acquire the fuzzy emotional memberships of the unlabeled images. The experimental results on three datasets verify that our proposed method outperforms the state-of-the-art methods for emotion distribution learning.
However, the computational efficiency of the proposed approach costs more time because of its complexity of optimization. In the future, we will improve the efficiency of our method and extend the model to tackle larger-scale datasets. Furthermore, we also plan to implement real-world applications based on emotion distributions, such as artworks emotion understanding and use the learned principles in painting education or intelligent creation.
Footnotes
Acknowledgments
This work is supported by the National Key Research and Development Plan of China (No. 2017YFD0400101).
