Multimedia emotion representation analysis based on graph convolution adversarial learning and attention mechanism

Abstract

Technological progress has driven the vigorous development of multimedia technology, with massive amounts of multimedia data generated every moment. Efficient sentiment analysis algorithms can help people understand and use multimedia data, reduce production and management costs, and improve the efficiency of human-computer interaction. The extraction of emotional features from multimedia information is a crucial step in capturing semantic information. Accurately extracting emotional states from multimedia content has become one of the important focuses of information processing. Traditional methods for extracting emotional features have limited accuracy in information disclosure due to their singularity, resulting in a significant gap between information content and actual cognition. To address this issue, a multimedia emotion representation method combining graph convolutional adversarial learning and attention mechanism was proposed. This method achieved the final multimedia emotion design model by constructing an emotion representation feature model, adversarial design of multidimensional emotion labels, and attention modules for local and overall emotions. The proposed hybrid model was tested and analyzed, and the results showed that the average loss value of the multimedia emotion fusion algorithm was less than 0.3, and its accurate recognition rate of video data reached 90.47%. The recognition accuracy of neutral, angry, happy, and sad emotional labels exceeded 85%, with the highest value reaching 92.30%, significantly better than other algorithms. In addition, the improved hybrid algorithm performed better in information representation and extraction capabilities, with an increase in emotional information interactivity of over 40% and an overall average time consumption of less than 1.5 s. The study analyzes multimedia emotional data from two dimensions: features and labels, effectively providing important research value and significance for emotional data mining and emotional content capture.

Keywords

Graph convolution network adversarial learning strategies attention mechanism emotion representation label information cross entropy function

1. Introduction

The development of digital information and data media, as an important carrier of human thinking mode and knowledge reserve, is often full of rich emotional information and content. Strengthening the emotional recognition in multimedia data information is a key step to extract information content and analyze user needs. The traditional method of emotion marking, which is based on a single emotion, has certain limitations due to the diversity and complexity of human emotions. As a result, the multimedia information it presents is often a comprehensive embodiment of a variety of basic emotions. Additionally, its sentence order, emotional words, and logical word connection are expressions of different emotional properties to a certain extent.¹ Positive emotions are more likely to be triggered simultaneously, often with an accompanying relationship, while negative emotions rarely occur simultaneously, and this relationship between emotions can be referred to as the relational model of emotion. The emergence of artificial emotional intelligence technology has changed the field of computer vision, making people pay more attention to the emotional needs behind the expression of information.² Lei Y et al. then attempted to improve the mismatch between expected and perceived labels using support vector machines, deep neural networks, and gradient enhanced decision trees. The results showed that all three preference learning models significantly outperformed the traditional classifier baseline, with the combined model based on gradient enhanced decision trees having better accuracy in ranking emotions.³ A limitation of focusing solely on emotional tags or the distribution of emotional features when analyzing emotional tagging is that it often limits the scope of emotional information expressed and neglects the context of the global data. Therefore, Wang M et al. proposed an image emotion classification method based on multi-graph and multi-label learning, which avoided the loss of graph information by calculating the similarity of emotion features and fusing the node relationships. The results showed that the method could better solve the learning error accumulation situation and had a better performance on the dataset.⁴ Wang Q et al. argued that the fusion strategy combining speech and EEG signals could improve the overall accuracy of emotion recognition.⁵ Furthermore, the imbalance of the data can also influence the distribution of emotional parts of speech, making it challenging to accurately annotate multimedia emotions.⁶ To address this problem, Zhang H et al. proposed to learn local emotion region features using regional multiscale network and encode them using graph attention network. The results showed that the method had a significant improvement on the benchmark dataset.⁷ Chen M et al. proposed a feature extraction emotion computation model based on joint mutual information. The results showed that the method achieved an average improvement of 0.85% in the accuracy of emotion classification.⁸ The sentiment categories and sentiment labels contained in the information sample data are different, and most of the previous researches have performed sentiment analysis for image or speech data, and it is difficult to represent the complexity of multi-sentiments. The emotional categories and emotional tags contained in the information sample data are different. Based on this, this study studies the correlation analysis of feature tag information based on the emotional relationship mode at the two levels of feature and tag, and combines the graph convolution network (GCN) with excellent topological relationship processing ability and the confrontation network that can shorten the distribution distance, so as to realize the extraction of emotional features. Different from the previous research on the single analysis of emotional characteristics, the research focuses on the multi-dimensional emotional characteristics and the correlation analysis of regional emotions, to better ensure the integrity and accuracy of emotional data information extraction. The research focuses on identifying and analyzing the multimedia emotion representation problem from four aspects. Part I is a literature review and discussion of the current multimedia emotion representation algorithms and related adversarial learning algorithms and so on. Part II is to improve the graph convolutional network and enrich the dimension of feature extraction by introducing the adversarial learning strategy and the theory of attention mechanism, taking into account the characteristics of multimedia information representation. Part III investigates and analyzes the effect of multimedia emotion recognition under this fusion method, and Part IV is an overview summary of the whole study. Figure 1 is a graphical summary of the research, which is used to present the overall idea of the manuscript.

Figure 1.

Graphical summary.

2. Related works

At present, there is a surge of multimedia data, and the annual reading volume of Chinese media is calculated in trillion. These data have the characteristics of high dimension, large scale and diversity of types. To analyze emotional tendencies and judge valuable information from massive data, past scholars have proposed a variety of algorithms. Among them, Horvat et al. used unsupervised learning algorithm and nencki emotional picture system data set to verify the Monte Carlo simulation method, and finally developed a software application to identify the characteristics of network emotion.⁹ Hao proposed an intelligent model based on deep learning language enhancement, which captured the network emotional state. The research results showed that the method had good detection effect on vector face smile information.¹⁰ Zhao et al. took large-scale heterogeneous multimedia data as the research object, applied the Aho-Corasick (AC) algorithm to compare the effect of emotion analysis of different multimedia types, and the results showed that the algorithm had good application value in emotion analysis.¹¹ Chattopadhyay et al. took speech in multimedia data as the research object and proposed a hybrid wrapper feature selection algorithm to recognize the emotion represented by speech data from clustering method and atomic optimization search algorithm. The results showed that the recognition accuracy of the hybrid algorithm for various data sets reached more than 70%.¹² Asghar Ma et al. verified the emotion classification method based on EEG in the field of human emotion classification. They used complex continuous wavelet transform for space-time analysis, and extracted features through three deep neural networks. The experimental results showed that the algorithm had lower computational cost, and the algorithm speed and accuracy were guaranteed.¹³ Song et al. proposed a robust discriminant sparse regression algorithm, and introduced the regularization constraint of feature selection. Finally, they designed an alternative optimization algorithm and verified the accuracy of the algorithm on multiple data sets.¹⁴ Annadurai et al. proposed an enhanced support vector machine (E-SVM) algorithm to distinguish the true and false emotional information in the research data set. The results showed that the accuracy of the algorithm reached 98%.¹⁵ Marik et al. took multimedia speech as the research object, proposed a two-stage hybrid depth feature selection framework, and combined it with the automatic feature engineering optimization algorithm. They tested it on the data set and found that the algorithm had good application value.¹⁶ Koduru et al. used different feature extraction algorithms to identify emotions, and used Mel frequency cepstrum coefficient to identify emotional features. The results showed that this method had obvious recognition effect in people's general emotions.¹⁷

In multimedia emotion representation analysis, GCN algorithm and attention mechanism are widely used in this field. Among them, the GCN algorithm proposed by Yu et al. expanded the data by encoding complex patterns, and conducted adversarial training based on GCN. The experimental results showed that the framework had good applicability.¹⁸ Aiming at the similarity problem of semantic complementary information, Dong et al. used the sample adjacency relationship and instance construction to realize the feature generator under the GCN and fully connected network, and realized the feature design and representation through the game strategy. The results proved that this method had good applicability and effectiveness in the data set.¹⁹ To solve the end-to-end problem in heterogeneous relational graph, Qin et al. proposed a new heterogeneous graph attention model, which was embedded in entity form and added with adaptive hostile technology. The experimental results showed that the accuracy of this method in image data was improved.²⁰ Tiwari et al. proposed a shift incremental accelerated linear discriminant analysis method to study multimodal emotion recognition, which extracted the most discriminative and dynamic features from speech and video sequences, and used support vector classifiers to achieve multimodal feature emotion classification. The results showed that the accuracy of this method in emotion classification on the database was above 90%, and the accuracy of emotion recognition was better than other algorithms.²¹ Bhattacharya et al. extracted audio file features from a multilingual emotion database and used them as inputs for a convolutional neural network to achieve emotion recognition processing. The results showed that the recognition accuracy of this method in the dataset exceeded 95%, and the multilingual emotion detection model was less affected by language types, with an application accuracy of 97.89%.²² Le et al. proposed a fusion and representation learning method based on transformers for multimodal emotion recognition in video data. They represented modal information using a unified transformer architecture and classified the information using label level representation methods. The results showed that this method had good emotion recognition accuracy on the benchmark dataset and performed significantly better than other methods.²³ Garcia et al. realized emotion detection by designing heterogeneous emotion result aggregators, and managed different detectors. The results showed that the method had good applicability.²⁴ Scholar Huang et al. achieved rich task acquisition by multi-view GCN, and realized view fusion by adding attention mechanism, which had efficient application effect.²⁵ Analyzing individual emotions using social media data has an important role and potential in many fields. For example, Ahmad B and Jun S performed natural language processing and emotions on Twitter data. The results could provide decision support for healthcare professionals to improve the management of cancer patients.²⁶ Adversarial learning provides a powerful technical support for multimedia sentiment management, which can significantly improve the accuracy, robustness, and practicality of sentiment analysis. Moreover, it has a wide range of applications in affective computing, mental health, and human-computer interaction. Among them, Mao Z et al. proposed weakly supervised target object localization using multi-scale gradient pyramid features for data localization. This method avoided the high cost of manual annotation and had significant localization accuracy.²⁷ Ahmad B et al.²⁸ and Ahmad et al.²⁹ proposed the combination of variational self-encoder and generative adversarial network (GAN) for medical image classification. The results showed that the method had good classification performance. Table 1 summarizes the research contents and ideas of previous media sentiment representation analysis.

Table 1.
Summary of relevant work on media emotion representation analysis.

Scholar Method Limitations or deficiencies The gap

Horvat et al. ⁹ Unsupervised learning algorithm for identifying network emotional features Not involving multimodal data Lack of modeling for complex emotional interactions

Hao ¹⁰ Intelligent model enhanced with deep learning language to capture network emotional states

Zhao et al. ¹¹ Analyzing the sentiment analysis effect of large-scale heterogeneous multimedia data Limited to algorithm performance comparison only Lack of deep learning modeling for emotional representation

Chattopadhyay et al. ¹² Feature selection algorithm for recognizing emotions in speech data Limited to voice data only Lack of fusion of multimodal emotional features

Asghar et al. ¹³ Emotion classification method based on EEG, using continuous wavelet transform and deep neural network to extract features Not involving multimodal data

Song et al. ¹⁴ Robust discriminant sparse regression algorithm

Annadurai et al. ¹⁵ Enhancing support vector machine algorithm to distinguish true and false emotional information Limited to feature selection methods only

Marik et al. ¹⁶ Two stage hybrid deep feature selection framework Limited to voice data only

Koduru et al. ¹⁷ Mel frequency cepstral coefficient recognition of emotional features Not involving multimodal data

Yu et al. ¹⁸ GCN adversarial training framework Algorithm performance needs to be improved

Dong et al. ¹⁹ Feature generator and game strategy for feature extraction Not involving multimodal data

Qin et al. ²⁰ Heterogeneous graph attention model with embedded adaptive adversarial techniques

Tiwari et al. ²¹ Shift incremental accelerated linear discriminant analysis method for identifying multimodal emotions Limited to voice and video data only

Bhattacharya et al. ²² Convolutional neural network recognition of multilingual sentiment databases Not involving multimodal data

Le et al. ²³ Transformer multimodal emotion recognition method High computational complexity

Garcia et al. ²⁴ Design a heterogeneous emotion result aggregator, Not involving multimodal data

Huang et al. ²⁵ Multi view fusion, attention mechanism Not involving adversarial learning

Scholar	Method	Limitations or deficiencies	The gap
Horvat et al. ⁹	Unsupervised learning algorithm for identifying network emotional features	Not involving multimodal data	Lack of modeling for complex emotional interactions
Hao ¹⁰	Intelligent model enhanced with deep learning language to capture network emotional states
Zhao et al. ¹¹	Analyzing the sentiment analysis effect of large-scale heterogeneous multimedia data	Limited to algorithm performance comparison only	Lack of deep learning modeling for emotional representation
Chattopadhyay et al. ¹²	Feature selection algorithm for recognizing emotions in speech data	Limited to voice data only	Lack of fusion of multimodal emotional features
Asghar et al. ¹³	Emotion classification method based on EEG, using continuous wavelet transform and deep neural network to extract features	Not involving multimodal data
Song et al. ¹⁴	Robust discriminant sparse regression algorithm
Annadurai et al. ¹⁵	Enhancing support vector machine algorithm to distinguish true and false emotional information	Limited to feature selection methods only
Marik et al. ¹⁶	Two stage hybrid deep feature selection framework	Limited to voice data only
Koduru et al. ¹⁷	Mel frequency cepstral coefficient recognition of emotional features	Not involving multimodal data
Yu et al. ¹⁸	GCN adversarial training framework	Algorithm performance needs to be improved
Dong et al. ¹⁹	Feature generator and game strategy for feature extraction	Not involving multimodal data
Qin et al. ²⁰	Heterogeneous graph attention model with embedded adaptive adversarial techniques
Tiwari et al. ²¹	Shift incremental accelerated linear discriminant analysis method for identifying multimodal emotions	Limited to voice and video data only
Bhattacharya et al. ²²	Convolutional neural network recognition of multilingual sentiment databases	Not involving multimodal data
Le et al. ²³	Transformer multimodal emotion recognition method	High computational complexity
Garcia et al. ²⁴	Design a heterogeneous emotion result aggregator,	Not involving multimodal data
Huang et al. ²⁵	Multi view fusion, attention mechanism	Not involving adversarial learning

To sum up, previous scholars have used ant colony algorithm, mixed wrapper feature selection algorithm, robust discriminant sparse regression algorithm and other algorithms to distinguish multimedia data in the research of multimedia emotion representation algorithms. However, these algorithms have limitations in the application of multimedia picture data, and the recognition of a single emotion can not grasp the relationship between emotion intervals. Therefore, the research combines the improved GCN algorithm with attention mechanism to achieve multimedia emotion representation analysis. It concentrates on emotional region information and dimensional features to construct the representation model, thereby providing a more comprehensive reference for multimedia information analysis. By combining graph convolutional adversarial learning and attention mechanisms, an efficient multimedia sentiment representation analysis method is proposed. Its multimodal fusion approach can also provide important theoretical and technical contributions to improve the accuracy of sentiment analysis and enrich information analysis methods and tools.

3. Proposed methodology

Strengthening the representation analysis of multimedia information is the key to extract emotional information. This study studies the advantages of information feature extraction based on GCN, and adds the counter learning strategy to reduce the distance between the two kinds of label information in the global distribution. The loss design of cross entropy function is carried out to grasp the integration characteristics of different emotional information and improve the imbalance of emotional categories. At the same time, the attention mechanism theory is introduced, and the attention map is extracted from the local and overall aspects, so as to better improve the accurate performance and application value of emotion representation combined with GCN.

3.1. Multimedia emotion annotation design based on graph convolution adversarial learning network

Unlike convolutional neural network, which is limited to the local structure of the image, GCN mainly realizes the processing of its characterization information through node information interaction, and can represent the learning function on non-Euclidean structure data. To a certain extent, it can better extract the topological spatial features of irregular data.³⁰ There are connections between samples in the graph data, and the disorganized structure makes it difficult to arrange and combine them. GCN can effectively extract spatial features from topological graphs. It first defines the Fourier transform on the graph, and then transforms the convolution operation from spatial domain to frequency domain. That is, after completing the convolution operation in the frequency domain, it is transformed back to the spatial domain through inverse Fourier transform. The GCN adopts the convolution idea, expanding the size of the convolution kernel to the entire number of samples, and updating each sample with the connection relationship between samples to obtain the trained feature representation. GCN uses mean aggregation in network training to update samples, and the obtained features of each sample not only contain its own information, but also receive other information related to itself. However, GCN is difficult to achieve batch processing of data, so it requires high computing power and cannot determine the strength of the connection relationship between different samples well. GCN is a neural network that processes graph-structured data. Its core idea is to extend convolutional operations from regular grids to graph structures, and its core idea is to use adjacency matrices and node features for information propagation. Its mathematical expression is given in equation (1).

H^{(m + 1)} = f ({\tilde{D}}^{- 0.5} \tilde{A} {\tilde{D}}^{- 0.5} H^{(m)} W^{(m)})

(1)

In equation (1), $\tilde{D}$ denotes the degree matrix, $\tilde{A}$ denotes the normalized collocation matrix, $H^{(m + 1)}$ is the node representation of the $^{(m + 1)}$ layer, $H^{(m + 1)}$ is the learnable weight matrix, and $H^{(m + 1)}$ is the nonlinear activation function. After convolution processing, the given graph structure can also be expressed in functional form as equation (2).

{H^{(m + 1)}}^{'} = h (\tilde{A} H^{(m)} W^{(m)})

(2)

In equation (2), h is nonlinear operation. After interacting with each other, the nodes in the GCN can master the spatial relationship of the characteristic graph. With the help of GCN idea, this study constructs a directed graph based on emotion, and represents the emotional characteristics in the emotional relationship graph. GAN introduce game theory concepts into machine learning. In the case of an optimal or near optimal discriminator, minimizing the loss of the generator is essentially minimizing the Jensen Shannon divergence between the true label distribution and the objective function. This allows the generator to generate data samples that are as realistic as possible.³¹ The nodes in the emotion relationship graph can be represented by the representation characteristics of the corresponding emotion word vector, which can be recorded as $H^{(1)} \in R^{C * h}$ . Among them, C and $H^{(1)}$ represent the category number and dimension of the representation information. The annotation information without clearly defined collar relationship can be represented by conditional probability, and each pair of emotions in the training set can be counted. The process can be represented by equation (3).

P (J i = 1 | J j = 1) = \frac{\sum_{k = 1}^{N} I (y i^{(k)} = 1, y j^{(k)} = 1)}{\sum_{k = 1}^{N} I (y j^{(k)} = 1)}

(3)

In equation (3), I represents the indicator function, N represents the number of training samples, and $i, j$ represent emotions. y represents the emotional label of the training set, and J represents the expected value of the label function. $J i$ and $J j$ denote sentiment nodes and $y i^{(k)}$ is the labeled value of the sentiment in the k sample. The initial layer of the GCN is represented by the basic emotional representation and the normalized adjacency matrix, and the input layer data of each layer is the output result of its previous layer. This process can be represented by equation (4).

{H^{(m + 1)}}^{″} = Φ_{g} (\tilde{A} H^{(m)} W g c n^{(m)})

(4)

In equation (4), $W g c n^{(m)}$ represents the parameters of the network, $Φ_{g}$ is the number of dimensions of the final emotional representation, and m is the number of layers of the network. Finally, the output emotional representation realizes the co-occurrence connection between different emotions through the hierarchical transmission between networks. Meanwhile, the encoder extracts the emotional features of the input sample data. At the same time, to ensure that the emotional relationship mode can better realize the forward propagation, the product processing of the network output value and the sample features is studied. This process can be represented by equation (5).

d_{g c n} = H^{(m)} Φ_{e}^{^{k}} (X; W_{m n})

(5)

In equation (5), $Φ_{e}^{^{k}}$ represents the features of the encoder $Φ_{e}$ in layer k. $d_{g c n}$ is the integration features. X represents the training sample. $W m n$ denotes a learnable weight matrix that maps the encoder features onto the same space as the graph convolution output features. When the output value of the encoder is connected with the integrated feature, the predicted multi-dimensional emotional feature can be realized by splicing operation. The goal of sentiment GCNs is to learn the function represented by nodes in a given graph.³² Therefore, the study uses GCNs to take the output of the previous layer as the input of the next layer. In the first layer, GCN takes the initial sentiment representation and normalized adjacency matrix as inputs to generate new sentiment representations. The newly calculated emotional representation of each layer is the weighted sum of all neighboring nodes under the guidance of the adjacency matrix. The final emotion representation learns the co-occurrence patterns between different emotions through hierarchical propagation between layers in the emotion GCN. To ensure that the predicted label information has good real similarity in the global distribution, emotional distribution constraints are designed in the label layer, i.e., to reduce the distance between the two types of label information in the global distribution by generating a confrontation network. As an unsupervised training model, the structure of GAN mainly includes envelope generator and discriminator. There is competition between the generator network which is similar to the actual distribution and the false distribution discriminator network which is different from the real distribution.

min_{G} max_{D} E_{y^{'} \sim P (y^{'})} [\log D (y^{'}))] + E_{z \sim P (z)} [\log (1 - D (z)))]

(6)

In equation (6), G means generator, D is discriminator, $P (y^{'})$ is real label distribution, $y^{'}$ represents label, z represents noise, D needs to be maximized, and G needs to be minimized. $D (y^{'})$ denotes the discriminator's discrimination result for the true sample, and E denotes the expectation operation, which is used to calculate the expectation of the sample distribution in the loss function. $D (z)$ denotes the discriminator's discrimination result for the generated sample. This data type includes label noise, model noise, and environmental noise. When performing sentiment analysis in GANs, the selected sentiment labels may lead to erroneous data due to incorrect labeling standards, and the GAN model may exist due to imperfect generators when generating outputs. Environmental noise refers to background noise in the dataset that is not related to the emotional information, mainly related to the source of the text. In the training process, the parameters of the generator and the holder are processed iteratively until they reach equilibrium, and then the data samples with high authenticity can be obtained. Figure 2 is a schematic diagram of the model structure for generating the countermeasure network.³³

Figure 2.

Schematic diagram of model structure of generative adaptive network.

At the same time, considering the collinearity of different emotions, the model should be as close to the real label situation as possible when carrying out label distribution. Therefore, the previous emotion labeling model is used as a label generator, and the neural network with only emotional labels as the input value is introduced as a discriminator to realize the division of different emotional labels. The objective function of confrontation learning is shown in equation (7).

\begin{aligned} min_{Φ e, g, c} max_{D} E_{\hat{y} \sim P (\hat{y})} [\log D (\hat{y}))] + E_{y^{'} \sim P (y)} [\log (1 - D (y^{'})))] \end{aligned}

(7)

In equation (7), $\hat{y}$ represents the distribution of prediction labels. e denotes the sentiment label, which represents the true sentiment category or intensity. g denotes the generator, which is used to generate the sentiment label distribution. c is the condition information. $Φ$ denotes emotion label weight. Then, the generator and discriminator are optimized by using the learning strategy of objective minimization iteration and objective maximization training, and the equation (8) is obtained.

{\begin{aligned} ℓ d (Φ_{e, g, c,} D) & = min_{D} - [E_{\hat{y} \sim P (\hat{y})} [\log D (\hat{y}))] \\ + E_{y^{'} \sim P (y)} [\log (1 - D (y^{'})))] \\ ℓ g (Φ_{e, g, c,} D) & = min_{Φ e, g, c} (1 - λ) E (x - y) \sim P (x - y) ℓ c e (Φ e, g, c) \\ - λ E_{\hat{y} \sim P (\hat{y})} [\log (D (\hat{y})))] \end{aligned}

(8)

In equation (8), $ℓ_{d}$ represents the loss of the discriminator, $ℓ_{g}$ denotes generator loss, and $ℓ_{c e} (Φ_{e, g, c})$ is the balanced cross entropy loss value between the predicted value and the real value of the label information. $λ$ is a trade-off parameter. In the process of emotion annotation learning, it is necessary to pay attention that the parameters are updated simultaneously, and the discriminator and generator are relatively fixed when updating the parameters. Considering the multidimensional and synchronicity of emotion label classification, it is easy to ignore the balance between emotion categories by only using the cross entropy function for loss design. The sample data has the problem of uneven distribution of positive and negative samples between groups and within groups. Therefore, to solve this problem, the weight value definition is used to realize the loss function design. That is, by adding an offset to the equation, and its mathematical expression is shown in equation (9).

\begin{aligned} {\begin{aligned} ω_{c}^{int r a} & = \frac{\sum_{k = 1}^{N} I (y c^{(k)} = 1)}{N} \\ ω_{c}^{int e r} & = \frac{1 / ω_{c}^{int r a}}{\sum_{c = 1}^{C} (1 / ω_{c}^{int r a})} \end{aligned} \end{aligned}

(9)

In equation (9), c represents the type of occurrence probability, $y c$ is the occurrence probability of the real emotion category, $ω_{c}^{int r a}$ represents the intra-class imbalance weight, and $ω_{c}^{int e r}$ represents the class value imbalance weight. The addition of weights can effectively avoid the impact of single category or a few categories on the results. The design process of multimedia emotion annotation algorithm is shown in Figure 3.

Figure 3.

Design process of multimedia emotion annotation algorithm.

The input values (encoder, GCN, classifier, and discriminator) are initialized in the training set, then the matrix is constructed according to the conditional probability of emotional tags, and the matrix is normalized. Second, the emotion representation is constructed according to the word vector of emotional tags, and the conditions of iteration times and network update times are set. Feed forward propagation is carried out for the sampled training samples and training tags under the network and coding processing to obtain the prediction tag and update the discriminator. The above steps are utilized to judge the generator. The overall structure of emotion annotation after implementing the local relationship mode in the GCN and constraining against the learning strategy is shown in Figure 4.

Figure 4.

Sentiment labeling with improved graph convolutional network constraints.

In Figure 4, the GCN manages the generated output results and their corresponding emotional representations. The encoder extracts the relevant emotional features from the input value sample data, combines the extracted results with the GCN results, and inputs them into the classifier to obtain the prediction label. Finally, the discriminator is used to distinguish the real tag from the prediction tag, so as to generate a more authentic multi-dimensional emotional tag.

3.2. Image emotion analysis based on attention relationship modeling

The development of multimedia technology makes image information fill human's lives, and it is important research content to extract and convey emotional information with the help of image visual content. The common key to image emotion analysis is the mining of emotion feature tasks. The limitations and biases of visual cognition make different individuals focus on different aspects of image information content extraction and emotional data mining. Attention mechanism is an important human cognitive ability.³⁴ In the field of emotional representation, attention mechanism can help people better understand emotion and the multi-level information of emotional experience. Self-attention mechanism can be used to model the relationship between voice, text, and other elements in different time steps in the input sequence, and improve the accuracy of emotion classification. Self-attention mechanism is a kind of attention mechanism, which makes use of the interaction between the elements in the sequence. It can connect the elements in different positions, calculate the correlation degree between them, and give different weights to the elements in different positions, so as to effectively use the information of the elements in the sequence.³⁵ In multimedia emotion representation analysis, the self-attention mechanism can be used to model the relationship between speech, text, and other elements at different time steps in the input sequence to improve the accuracy of emotion classification. In addition to self-attention mechanism, the introduction of attention mechanism into model training can effectively make the model focus on the extraction of key emotional feature information, and then realize the improvement of its representation ability. Figure 5 is a schematic diagram of the attention mechanism.³⁶

Figure 5.

Schematic diagram of attention mechanism.

In the research of multimedia emotion representation analysis based on graph convolutional adversarial learning, the attention mechanism can focus on the key areas and important information of emotion representation. At the same time, compared with the traditional attention mechanism, the multi-head attention mechanism can perform multiple linear transformations on the vector parameters of the input layer, so as to comprehensively extract the emotional information of the samples. Figure 6 shows the combination of attention mechanism and neural network.

Figure 6.

Emotional recognition model combining attention mechanism and neural network.

The multi-head attention mechanism linearly transforms the output processed by the GCN, and the weight value of the attention model can be multiplied by the input value of the network. Multi-head attention captures different feature representations through multiple attention heads, the mathematical expression of which is shown in equation (10).

MultiHead (Q, K, V) = Concat ({head}_{1}, {head}_{2}, \dots, head h) W^{O}

(10)

In equation (10), Q, K, and V denote the matrix representation of query, key, and value. $h e a d$ denotes the attention head. $W^{O}$ is the weight matrix of the output. Each attention head can focus on a different part of the input data. Computing multiple attention heads in parallel can learn richer feature representations and reduce the model's dependence on a single attention pattern, thus improving generalization performance. At the same time, considering the differences of emotional information in different image regions, a new emotional learning method is proposed, which is to extract emotional features by introducing semantic attention and global attention. The input image is constructed with a corresponding feature map, including the width and height of the map and the number of channels. The process can be expressed as equation (11).

O c = θ (I^{'})

(11)

In equation (11), $θ$ refers to the feature extractor, $O c$ is the overall feature and $I^{'}$ is the input image.³⁷ At the same time, considering the differences of the emotion corresponding to the image region, the semantic vector words are used to represent the different emotion regions, the emotion pixels of the feature map are scored, and the vector construction algorithm is used to extract the semantic representation of the training information on the data set to obtain the vector set. Glove model is a word vector model based on the theory of word collinearity probability. It can effectively extract semantic features with the help of matrix factorization. Its mathematical expression is shown in equation (12).

F (o t^{'}, o t^{″}, o d) = \frac{P_{d / t^{'}}}{P_{d / t^{″}}}

(12)

In equation (12), d refers to words, $o t^{'}, o t^{″}, o d$ are word vectors and $t^{'}, t^{″}$ are words. The larger the F value, the better the correlation between words. The model realizes the vector representation of text words by means of iterative gradient descent. When the algorithm is used to represent the word vector set, its initial vector dimension is four-dimensional, and the multimodal decomposition bilinear pooling method is used to realize the fusion of vectors. This process can be represented by equation (13).

\bar{f} c^{'}, t^{'}, t^{″} = S u m P o o l i n g (W^{T} f t^{'} t^{″} \circ V^{T} e c^{'}, s)

(13)

In equation (13), $S u m P o o l i n g$ means to execute the pooling step, s is the size of latitude window, $(W^{T} f t^{'} t^{″} \circ V^{T} e c^{'})$ represents the feature, $f t^{'} t^{″}$ is the feature of the feature map, W is the parameter matrix of special learning, and V is the number of emotion categories. $e c^{'}$ denotes the embedding vector of sentiment categories. T denotes the matrix transpose. $c^{'}$ denotes the index of the sentiment category. Then the weight of each emotion can be obtained by calculating the importance of emotion in each position, and the weight value is normalized to obtain the emotion like attention map. The global weighted average pooling operation is performed on the emotion-like attention map, and the regional characteristics of the emotion can be obtained. The quasi emotional attention map reflects the emotional intensity of the region from a certain point of view. By integrating the quasi emotional attention maps showing different aspects, the emotional attention map from the global perspective can be obtained. In this process, the multi-layer perception function realizes the standardization of the calculation of emotional weight, and its mathematical expression is shown in equation (14).

{\begin{aligned} b c^{'} = M L P (f c^{'}) \\ β c^{'} = \frac{\exp (b c^{'})}{\sum_{c^{'}} \exp (b c^{'})} \end{aligned}

(14)

In equation (14), b represents the weight vector, $β$ is the weight vector after normalization operation. $M L P$ denotes a multilayer perceptron. $f c^{'}$ denotes the feature representation of the sentiment category, $b c^{'}$ denotes the weight vector of the sentiment category, and $β c^{'}$ denotes the normalized sentiment weight. Then, in the image label distribution, the Kullback Leibler divergence is used to limit the weight to ensure that it has good similarity with the real label distribution.³⁸ The global attention schema T analyzes the emotion of the perceptual region from the global perspective, and its mathematical expression is shown in equation (15).

T = \sum_{c^{'}} β c^{'} α c^{'}

(15)

In equation (15), $α c^{'}$ represents the emotion like attention map. The feature vectors obtained from the original feature map and the generated feature map after global average pooling can represent different emotional feature dimensions, such as emotional information in a single level and emotional information content in a global perspective. The features of different representation dimensions are spliced, and the prediction vector $s^{'}$ is mapped with the help of the output network to obtain equation (16).

\begin{aligned} s^{'} = Φ_{0} ([f_{g c n}, f_{b a s e}, f_{s e n}]) \end{aligned}

(16)

In equation (16), $f_{g c n}$ represents the emotional information features under a single level, $f_{s e n}$ represents the emotional information features under the global view, and $f_{b a s e}$ is the initial visual feature of the image. The joint loss function can be obtained by narrowing the distribution distance between the two emotional features with the help of the divergence function, as shown in equation (17).

\begin{aligned} L_{k l} = τ L_{k / 1} + L_{k / 2} \end{aligned}

(17)

In equation (17), $τ$ represents the trade-off parameters between joint loss functions. $L_{k l}$ is the joint loss function, and $L_{k / 1}$ . $L_{k / 2}$ denote the loss functions for the two sentiment features. Based on the above analysis, it can get the image distribution model framework under the attention mechanism modeling, as shown in Figure 7.

Figure 7.

Impression distribution algorithm framework based on attention relationship modeling.

Figure 7 shows that the features of the input image are represented, and then it is divided into attention-like graphs with different emotional attributes based on the difference of emotional semantic word vectors. The obtained emotional feature intensities are fed into the GCN to better understand the relationship between emotional features and regions. At the same time, the image propagation thinning under the attention mechanism can better capture the image emotion.

4. Results and discussion

The research selected Music database and NVIE database for experimental analysis. The two databases contained video and music, which involved many emotional categories, including positive emotions and negative emotions. The data were divided into test set and training set according to the ratio of 1:9, and 25% of the data in the training set was selected as the verification set. The experimental analysis was carried out on the windows 10 operating system, and the weight range of combat loss was set to 0.1, the depth of emotion GCN was set to 2, the output emotion representation was 128 and 256, and the word vector dimension trained by glove algorithm was set to 50. The learning rate of GCN network was 0.1, and the number of iterations was 3500. The music database contained 10,000 songs covering various genres such as classical, pop, rock, jazz, and electronic. Audio files were stored in WAV format at a sampling rate of 44.1 kHz, with each song lasting between 30 s and 5 min. The songs had a rich variety of languages (such as English, Chinese, Spanish, etc.) and cultural backgrounds. Emotion labels included happiness, sadness, anger, and serenity, which were manually annotated. The NVIE database contained 10,000 visible light and infrared facial expression images taken by 100 participants under different lighting conditions. Emotion labels covered seven basic emotions (happiness, sadness, anger, surprise, fear, disgust, calm) and were manually annotated by a team of experts. This data included a large number of participants and situations under different lighting conditions. The study considered the visual features of the NVIE database and analyzed them using visual excitability, energy color, average energy, and other factors. The study first removed background noise data from audio and video, and then extracted emotion related feature data, including audio and visual features. Based on the emotion model, the sample data in the database were labeled with emotion tags, and then could be characterized and analyzed using quantitative indicators. In terms of evaluation indicators, the study considered algorithm performance and case analysis, selection loss curves, emotion tag recognition confusion matrices, ablation experimental indicators, information interactivity, error values, etc. to evaluate the performance of the model. The loss curve could be used to reflect the data processing performance of the algorithm, and the confusion matrix could visually display the performance of the model on each emotion category, including common misclassification situations. The ablation experiment could be used to test the effectiveness of the model, and the better the data extraction ability, the better it can achieve emotional representation analysis. The information interaction, feature training time, and error results could reflect the ability of different sentiment part of speech analysis and algorithm processing accuracy. The stronger the emotional interaction, the more helpful it was to analyze the sentiment intensity of information data. The shorter the emotional feature time and error value, the higher the representation accuracy of the algorithm. The loss results of the proposed fusion algorithm were analyzed. The results are shown in Figure 8.

Figure 8.

Training loss and testing loss results of the fusion algorithm.

Table 2.

Comparison of training experience results of different models.

	Music database			Video database
Contrast model	Accuracy	Recall	F value	Accuracy	Recall	F value
WDGRL	74.35	78.23	75.48	72.34	75.36	74.31
GCN	77.47	80.52	79.21	70.32	70.23	70.44
GCN-GAN	82.25	81.49	81.71	80.07	79.32	80.11
E-SVM	84.36	85.32	84.38	83.25	82.43	83.48
GCN-FCN	86.43	87.25	88.12	87.12	89.24	88.75
Proposed method	91.22	90.32	92.23	90.47	91.16	90.52

The experimental results showed that when using the basic network to analyze the multimedia emotional representation information, the training loss and test loss results showed a significant downward trend with the increase of the number of iterations, and the average training loss and test loss results were 0.28 and 0.26. The slope of the loss curve of the proposed hybrid algorithm was significantly greater than that of the improved algorithm. The curve of the improved algorithm was relatively smooth, and the number of fluctuation nodes was significantly reduced compared with other algorithms. In general, the lower the training loss value, the better the fit of the model to the training data, which usually means that the model can more accurately extract and represent emotional features. Low training loss is often associated with more accurate emotional feature extraction because the model is better able to capture key features and patterns in the data. In the research of multimedia emotional representation, reducing the training loss value could improve the emotional feature extraction ability and overall performance of the model, thereby better achieving accurate expression and understanding of emotional information in multimedia content. Then, the application effect of the multimedia emotion representation model proposed in the study was analyzed and compared with Wasserstein distance guided representation learning algorithm (WDGRL), E-SVM GCN and graph convolutional networks-fully connected networks (GCN-FCN). The results are shown in Table 2.

The results in Table 2 showed that the index evaluation results of the proposed model were better than those of other algorithms on the two databases. The difference between the maximum accuracy feature extraction results of the proposed model and WDGRL algorithm on the Music database was more than 10%, and the difference between the proposed model and GCN-GAN and the algorithm was within 5%. On the video database, in terms of the accurate results of feature extraction, the ranking results of the accurate recognition results of the above algorithm were: research model (90.47)>GCN-FCN (87.12)>E-SVM (83.25) GCN-GAN (80.07)>WDGRL (72.34)>GCN (70.32). The above results indicated that the research method could achieve good extraction accuracy of emotional features in both datasets. However, there were certain differences in the performance of other algorithms on the dataset, and their accuracy values did not exceed 90. The rationale underlying this outcome was that the research model integrated adversarial learning with a multi-head attention mechanism. The former could achieve structured feature learning, while the latter could dynamically capture key emotional features. The attention to different emotional information regions and the extraction of different emotional intensities enhanced their selectivity, which could effectively improve the weight of key emotional clues. Although the WDGRL algorithm used adversarial strategies to reduce cross-domain distribution differences, it was difficult to achieve fine domain alignment for feature extraction, and its generalization performance was limited. GAN enhanced the generation ability of GCN network, so the performance of GCN-GAN was better than GCN network. However, it still had shortcomings in dealing with the sparsity and wide value range problems of weighted dynamic networks, resulting in lower extraction accuracy than the research model and GCN-FCN algorithm. FCN's improvement of GCN could improve its ability to extract edge features. However, it relied heavily on independent modules in series with the E-SVM method. This was not suitable for extracting modal information from video datasets and inevitably led to information loss. To further test and analyze the research model, the vocabulary discrimination under different emotion extraction algorithms was analyzed, and the results are shown in Table 3.

Table 3.

Results of emotional vocabulary extraction under different methods.

Method	Emotional tendency	Number of words	Correct number	Accuracy	Recall rate	F value
WDGRL	Positive	3346	2410	0.666	0.761	0.712
	Negative	4320	4195	0.739	0.708	0.763
	Population	7903	6843	0.713	0.728	0.738
GCN	Positive	3094	2434	0.729	0.768	0.745
	Negative	4572	4299	0.749	0.762	0.797
	Population	7903	6967	0.785	0.765	0.795
GCN-GAN	Positive	3471	2535	0.726	0.821	0.795
	Negative	4445	4320	0.899	0.768	0.823
	Population	8028	6968	0.823	0.788	0.798
E-SVM	Positive	3219	2559	0.789	0.828	0.811
	Negative	4697	4424	0.909	0.822	0.857
	Population	8028	7092	0.862	0.825	0.845
GCN-FCN	Positive	3583	2647	0.795	0.89	0.829
	Negative	4557	4432	0.968	0.837	0.892
	Population	8140	7080	0.892	0.857	0.867
Proposed method	Positive	3331	2671	0.858	0.897	0.869
	Negative	4809	4536	0.978	0.891	0.926
	Population	8140	7204	0.929	0.894	0.909

Table 3 shows that the accuracy and recall of WDGRL and GCN methods in extracting sentiment parts of speech were less than 0.8, and the number of correctly extracted sentiment part of speech words accounted for a small proportion. The model proposed in the study achieved an extraction score of 0.85 or higher for goldfish in three different emotional polarities. Compared with the GCN-GAN method, E-SVM method, and GCN-FCN, the proposed algorithm had stable performance and showed high accuracy in emotion discrimination. Then the emotion recognition of different algorithms was analyzed, and the results are shown in Figure 9.

Figure 9.

Emotional label recognition results of three algorithms.

Figure 9 shows the emotional tag recognition confusion matrix of the three algorithms, in which the abscissa and ordinate respectively represent the predicted value and the real value. Specifically, there was a big difference between the emotional labels predicted by WDGRL algorithm and the real values, and its accuracy rates in neutral, angry, happy, and sad emotions were 72.35%, 72.48%, 65.36%, and 70.23%, respectively. The accuracy of GCN network's emotional tag prediction results was more than 70%, and the maximum value was 79.54%. The accuracy of GCN-GAN algorithm was 79.564% for angry emotional tags, and the recognition of other emotional tags was 80%. Then, the recognition results of the other three algorithms were analyzed. The results are shown in Figure 10.

Figure 10.

Emotional tag recognition results of three other algorithms.

The results in Figure 10 show the accuracy of the other three algorithms for emotion recognition. The recognition accuracy results of GCN-FCN algorithm and the proposed algorithm under the four kinds of emotional tags were greater than or equal to 80%, which meant that the error between the predicted results of emotional tags and the real value under the corresponding algorithm was small. The accuracy of the proposed method in neutral, angry, happy and sad emotions were 90.15%, 88.96%, 90.65%, and 92.30%, respectively. The recognition results were greater than the other two algorithms, and the maximum recognition accuracy difference was 10.85% and 20.83%. The recognition accuracy of E-SVM algorithm in anger and sadness was 68.13% and 89.31%, respectively. The reason may be that the algorithm is difficult to recognize fuzzy emotional words. The confusion matrix could clearly display the prediction accuracy and misjudgment of the model on each emotional label, which helped evaluate its accuracy on different emotional categories. By analyzing the confusion matrix, the model might have confusion or bias in specific emotional categories. The emotion tags prediction results of the proposed model effectively reflected its information extraction accuracy, and the emotion recognition effect was good. To further test the applicability of the research method, it was analyzed for emotion recognition with deep siamese network (EmoDSN),³⁹ multi-spatial learning semantic alignment network (SAMS),⁴⁰ deep multi instance learning algorithm (EDMIL),⁴¹ multi label multimodal emotion recognition with transformer based (MLM Trans),⁴² and multi label multimodal emotion recognition with Transformer based (MLM Trans).⁴³ The results are shown in Table 4.

Table 4.

Emotion recognition results of different depth algorithms.

Contrast model	Accuracy	Recall	F1	MAE	mAP	Parameter quantity/M	Training efficiency
EmoDSN	85.41	84.67	85.12	0.12	0.89	12.21	85.41
SAMS	87.35	86.21	87.89	0.09	0.91	22.75	87.35
EDMIL	82.16	80.94	81.75	0.15	0.86	9.86	82.16
TransEEG	88.72	87.53	88.45	0.09	0.92	28.31	88.72
MLM-Trans	89.63	88.47	89.82	0.07	0.93	30.12	89.63
Proposed method	90.05	90.32	92.23	0.08	0.94	18.50	91.22

In Table 4, the values of the research model on the four classification indicators of Acc, Recall, F1, and mAP were 90.05, 90.32, 92.23, and 0.94, respectively. Its performance was superior to other comparative algorithms, indicating that it had higher discriminative ability in sentiment classification tasks. Second, the TransEEG model and the MLM-Trans model performed well. The model proposed by the research showed an improvement of over 1.5% in both F1 and mAP compared to MLM-Trans, indicating that graph convolution was better at capturing structured emotional features than Transformer. Compared to SAMS, the Acc improvement of the research model exceeded 3.5%, indicating that multi-head attention was more effective in focusing on key emotional information than traditional attention mechanisms. The performance of the EDMIL method was the weakest, possibly due to its reliance on weakly supervised learning, which made it difficult to handle fine-grained emotions, and its MAE was the highest (0.15). The reason for this may be that it was difficult to handle label features at different levels, resulting in significant prediction bias. Moreover, the research model achieved a good balance between parameter quantity and training efficiency, reducing the parameter quantity by 47.1% and improving the training speed by 38.5% compared to MLM-Trans. EDMIL was the most lightweight and suitable for scenarios with limited resources but low accuracy requirements. MLM-Trans had the highest computational cost, which may be due to the high computational complexity of Transformer's self-attention mechanism. The research model realized the collaborative learning of structured features and dynamic weighting of attention mechanisms, which enabled it to have good multimodal emotion recognition performance, and the design of its loss function could also improve the computational efficiency. At the same time, the AUC curve was used to analyze the emotional feature extraction results of the comparative algorithms mentioned above, as shown in Figure 11.

Figure 11.

AUC results of sentiment feature classification using different algorithms.

In Figure 11, the overall results showed that the research model had a better AUC value of more than 0.80, followed by the TransEEG algorithm and the MLM-Trans algorithm, which performed better with an AUC value of more than 0.65, and the rest of the comparative algorithms had slightly worse classification accuracy. The above results showed that the research model was able to classify and process emotional features better. Then ablation experiments were carried out on different algorithms, in which algorithms 1–7 respectively represent the basic model (RESNET), emotion-like attention network (EAN), emotion-like attention network + GCN (EAN-GCN), comprehensive emotion attention network + GCN (CEAN-GCN), attention-like mechanism (ALM), multi-attention mechanism + GCN (MAM-GCN) and the fusion model proposed in the article. The results are shown in Figure 12.

Figure 12.

Algorithm attenuation experimental results under a simple dataset.

Figure 12 shows that the ablation results of the algorithms under different selection strategies are different. In the simple data set, the values of various algorithms in the distance measurement (Chebyshev distance, Canberra index, edge distance offset) were above 0.22, 0.75, and 0.65. The Chebyshev distance value of the proposed fusion algorithm was 0.223, which was significantly lower than other algorithms, indicating that the maximum difference of its number was is small. In terms of cosine coefficient and intersection similarity index, the proposed algorithm values were 0.864 and 0.695, respectively. The above results indicated that the performance of the basic model improved in most indicators after adding other components such as attention mechanism, graph convolutional network, etc., indicating that the ResNet model alone might not be sufficient to fully capture the complexity of multimedia emotional representation. The values of the emotional attention network for Chebyshev distance and Canberra distance were 0.25 and 0.68, respectively. This suggested that attention mechanisms helped to better capture emotion-related information. The comprehensive emotional attention network + GCN performed better on multiple indicators compared to the class emotional attention network + GCN, indicating that a more comprehensive attention mechanism might help capture richer emotional information. The class attention mechanism performed worse than the comprehensive emotional attention network + GCN on certain indicators, suggesting that the individual attention mechanism might not be as effective as the combined graph convolutional network. Both emotion-based attention networks and multi-head attention mechanisms showed contributions to performance improvement. The multi-head attention mechanism was particularly effective because it could capture more information, and graph convolutional networks were more helpful in capturing structural information in the data, improving the model's representational ability.

Figure 13 shows that on complex data sets, the results of distance measurement and intersection similarity of the proposed algorithm were 0.876 and 0.698, respectively. The extraction of vector information could better improve its representation ability. The introduction of global attention mechanism and network processing could greatly grasp the relationship between information words, and effectively reduce the loss of data. To further analyze the application performance of the research methodology, the Music and NVIE datasets were expanded to 100 K and 50 K samples, respectively. Moreover, the multiple hardware environments were set up to test the training and inference times. The results are presented in Table 5.

Figure 13.

Algorithm iteration experimental results under complex datasets.

Table 5.

Scalability analysis of the proposed model across different computational settings.

Experiment	Dataset	Hardware	Avg. Inference	Throughput	Accuracy/	GPU Memory
Setting	Size		Time (s)	(samples/sec)	%	Usage/GB
Baseline (Original)	10 K (Music)	RTX 3090	0.9 ± 0.2	110	90.5	14
Baseline (Original)	10 K (NVIE)	RTX 3090	1.1 ± 0.3	95	89.8	12
Scaled Dataset (10×)	100 K (Music)	RTX 3090	1.2 ± 0.3	85	89.1	16
Scaled Dataset (10×)	50 K (NVIE)	RTX 3090	1.4 ± 0.4	75	88.6	14
Edge Device (low-cost GPU)	10 K (Music)	RTX 3060	1.4 ± 0.3	70	88.9	10
Edge Device (low-cost GPU)	10 K (NVIE)	RTX 3060	1.6 ± 0.4	60	87.5	8
CPU-only deployment	10 K (Music)	Xeon Gold 6248R	4.2 ± 0.5	25	85.2	32
CPU-only deployment	10 K (NVIE)	Xeon Gold 6248R	4.5 ± 0.6	20	84.7	28

In Table 5, GPU memory consumption referred to the peak memory consumption results during inference. The results in Table 5 showed that even when the dataset size increased by a factor of 10, the model proposed in the study maintained a stable accuracy (decreasing by about 1–2%) and the inference time increased by only 20–30%. On large datasets, the throughput of high-end GPUs was still greater than 75 samples/sec, demonstrating high efficiency, and it had better real-time performance (<1.6 s/sample) on GPUs (RTX 3060) with less than 10% accuracy loss. Subsequently, the proposed algorithm was analyzed for information interactivity to better analyze the representation of information emotion. The results are shown in Figure 14.

Figure 14.

Interaction results of the research algorithm in six emotional parts of speech.

Figure 14 shows that when the hybrid algorithm was used to measure the interactivity of different emotional parts of speech, it was found that the interactivity of these six emotions was more than 40%, and the average emotional interaction results were good. The algorithm could effectively analyze the emotional intensity of sentence information, and effectively extract the interactive representation feature of information. Then, the time-consuming and error results of the emotional feature extraction training of the algorithm were analyzed. The results are shown in Figure 15.

Figure 15.

Time consumption and error results of emotional feature extraction training for four algorithms.

The results in Figure 15 (a) show that in terms of emotional feature extraction time, the proposed algorithm had small fluctuation nodes, and the overall average time consumption was less than 1.5 s. Moreover, its curve gradually tended to be stable after the number of emotional words was greater than 100. The average time consumption of GCN-GAN algorithm, GCN-FCN algorithm and E-SVM algorithm was 2.36 s, 1.88 s and 2.97 s. In Figure 15 (b), the emotion classification error curves of the four algorithms showed a downward trend with the increase of the number of iterations, and the loss value of the proposed algorithm tended to 0.56 after more than 1500 iterations. The above results showed that the proposed hybrid algorithm had better emotional feature extraction effect and better algorithm performance.

5. Conclusion

Emotion analysis is an important part of data mining, and extracting emotional representations rich in emotional content from multimedia data is an important problem that current emotion analysis algorithms need to face. Therefore, this research proposed a representation analysis method based on graph convolutional adversarial learning and attention mechanism. The emotional representation analysis method was tested, and the results showed that the accuracy of the fusion algorithm in video data analysis was 90.47%, far higher than the GCN-FC algorithm's 87.12%, E-SVM algorithm's 83.25%, GCN-GAN algorithm's 80.07%, WDGRL algorithm's 72.34%, and GCN algorithm's 70.32%. Moreover, the fusion method achieved recognition accuracy of over 90% in neutral, angry, happy, and sad emotion labels, and performed well in ablation experiments, which was far superior to other comparative algorithms. The hybrid model measured the interactivity of different emotional parts of speech at over 40%, with an average time consumption of less than 1.5 s, demonstrating good applicability and effectiveness. In summary, the proposed fusion method based on graph convolutional adversarial learning and attention mechanism has significant advantages in emotion representation analysis. It can effectively identify emotion labels and demonstrate good robustness, interactivity and computational efficiency in emotion analysis. It can also effectively provide efficient and accurate solutions for multimedia emotion analysis. Considering that the structure and attributes of different types of multimedia data are different, a universal sentiment annotation algorithm framework was proposed in this study. However, the selected dataset cannot fully represent all possible multimedia emotional contexts. Therefore, in future work, more refined emotional annotation models need to be proposed for multimedia data to strengthen the analysis of emotional data characteristics and the mining of emotional content. At the same time, the representation information analysis algorithm under the fusion of GCN and attention mechanism module is still difficult to comprehensively and accurately grasp the relationship between different emotional regions. The presentation of emotions in different cultures can affect model performance. For example, introverted emotional expression in certain cultures can lead to model recognition bias. If there is an uneven distribution of age, gender, or race in the training data, the model's emotion recognition performance for minority groups may decrease. The experiment is based only on specific databases (such as music and video data from the NVIE) and does not cover a broader cultural background. In the future, it is necessary to introduce cross-cultural and multi-year datasets to verify fairness. Meanwhile, adversarial depolarization techniques and unsupervised domain adaptation methods will be introduced to achieve bias detection, improve the model's cross-domain adaptation ability, and better evaluate different emotional differences. At the same time, in future work, it is necessary to consider the application of personalized systems in image emotion analysis. There is still significant room for improvement in the recognition accuracy of commonly used emotional feature segmentation methods. In future research, network structure optimization, network depth deepening, and the impact of different parameters on model recognition performance can be further considered based on the study of the model. Research should be further conducted on how to efficiently integrate data from different modalities, capture emotional dynamic changes, and improve the accuracy of multimodal emotional representation.

Footnotes

ORCID iDs

Yanmei Tian

Meng Zhu

Author contributions

Yanmei Tian and Meng Zhu all participated in the writing of the paper and the review of the final draft.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Availability of data and materials

The data is provided within the manuscript.

References

Adyapady

annappa

. A comprehensive review of facial expression recognition technologies. Multimedia Syst july 2022; 29: 73–103.

Wang J

Zhao

, et al. Unlocking the emotional world of visual media: an overview of the science, research, and impact of understanding emotion. Proc IEEE october 2023; 111: 1236–1286.

Lei

Cao

. Audio-visual emotion recognition with preference learning based on intended and multi-modal perceived labels. IEEE Transactions on Affective Computing october 2023; 14: 2954–2969.

Wang

Zhao

Wang

, et al. Image emotion multi-label classification based on multi-graph learning. Expert Syst Appl november 2023; 231: 120641.

Wang

Yang

, et al. Multi-modal emotion recognition using EEG and speech signals. Comput Biol Med october 2022; 149: 105907.

Chenchen

Jialin

WJL

Wang H

, et al. Visual-texual emotion analysis with deep coupled video and danmu neural networks. IEEE Trans Multimed june 2020; 22: 1634–1646.

Zhang

. Multiscale emotion representation learning for affective image recognition. IEEE Trans Multimed july 2022; 25: 2203–2212.

Chen

Xie

, et al. Research on emotion recognition for online learning in a novel computing model. Appl Sci may 2022; 12: 4236.

Horvat

Jovi ć

Burnik

. Assessing the robustness of cluster solutions in emotionally annotated pictures using Monte-Carlo simulation stabilized K-means algorithm. Machine Learning and Knowledge Extraction may 2021; 3: 435–452.

10.

Hao

. Multimedia English teaching analysis based on deep learning speech enhancement algorithm and robust expression positioning. J Intell Fuzzy Syst august 2020; 39: 1779–1791.

11.

Zhao

Wang

Soleymani

, et al. Effective computing for large-scale heterogeneous multimedia data: a survey. ACM transactions on multimedia computing. Communications, and Applications (Tomm) december 2019; 15: 1–32.

12.

Chattopadhyay

Dey

Singh P

, et al. A feature selection model for speech emotion recognition using clustering based population generation with hybrid of equipment optimizer and atom search optimization algorithm. Multimed Tools Appl february 2022; 82: 9693–9726.

13.

Asghar M

Khan M

Rizwan

, et al. AI Inspired EEG based spatial feature selection method using multivariate empirical mode decomposition for emotion classification. Multimedia Syst april 2021; 28: 1275–1288.

14.

Song

Zheng

, et al. Speech emotion recognition based on robust discriminatory spark expression. IEEE Transactions on Cognitive and Developmental Systems april 2020; 13: 343–353.

15.

Annadurai

Arock

Vadivel

. Real and fake emotion detection using enhanced boosted support vector machine algorithm. Multimed Tools Appl june 2022; 82: 1333–1353.

16.

Marik

Chattopadhyay

Singh P

. A hybrid deep feature selection framework for emotion recognition from human speeches. Multimed Tools Appl october 2022; 82: 11461–11487.

17.

Koduru

Valiveti H

Budati A

. Feature extraction algorithms to improve the speech emotion recognition rate. Int J Speech Technol january 2020; 23: 45–55.

18.

Yin

, et al. Enhancing social recommendation with alternative graph revolutionary networks. IEEE Trans Knowl Data Eng october 2020; 34: 3727–3739.

19.

Dong

Liu

Zhu

, et al. Alternative graph revolutionary network for cross modal retrieval. IEEE Trans Circuits Syst Video Technol april 2021; 32: 1634–1645.

20.

Qin

Sheikh

Reinwald

, et al. Relationship aware graph attention model with adaptive self-adaptive training. Procedures of the AAAI Conference on Artistic Intelligence may 2021; 35: 9368–9376.

21.

Tiwari

Rathod

Thakkar

, et al. Multimodal emotion recognition using SDA-LDA algorithm in video clips. J Ambient Intell Humaniz Comput october 2023; 14: 6585–6602.

22.

Bhattacharya

Borah

Mishra B

, et al. Emotion detection from multilingual audio using deep analysis. Multimed Tools Appl may 2022; 81: 41309–41338.

23.

Le H

Lee G

Kim S

, et al. Multi-label multimodal emotion recognition with transformer-based fusion and emotion-level representation learning. IEEE Access february 2023; 11: 14742–14751.

24.

Garcia

Lozano

penichet

VMR

, et al. . Building a three-level multimodal infection recognition framework. Multimed Tools Appl january 2023; 82: 239–269.

25.

Huang

Song

, et al. Multitask representation learning with multi view graph revolutionary networks. IEEE Transactions on Neural Networks and Learning Systems december 2020; 33: 983–995.

26.

Ahmad

Jun

. Sentiment analysis of cancer patients about their treatment during the peak time of pandemic COVID-19. In: 2021 4th International Conference on Computing & Information Sciences (ICCIS), november 2021, pp.1–5.

27.

Mao

Zhou

Sun

, et al. Weakly-supervised object localization with gradient-pyramid feature. Applied Intelligence may 2023; 53: 2923–2935.

28.

Ahmad

Jun

Palade

, et al. Improving skin cancer classification using heavy-tailed student t-distribution in generative adversarial networks (TED-GAN). Diagnostics november 2021; 11: 2147.

29.

Ahmad

Sun

You

, et al. Brain tumor classification using a combination of variational autoencoders and generative adversarial networks. Biomedicines january 2022; 10: 23.

30.

Cao

Sun

. Research on multimedia interaction design of children's emotional education based on social presence. J Comput Aided Des Comput Graph apr.2020; 32: 1087–1094.

31.

Tian

She

. A visual–audio-based emotion recognition system integrating dimensional analysis. IEEE Transactions on Computational Social Systems september 2022; 10: 3273–3282.

32.

Wang

Gui

Cheng

, et al. A survey on emotional visualization and visual analysis. J Vis september 2023; 26: 177–198.

33.

Yang

Kin K

Ariji

, et al. Generative adversarial networks in dental imaging: a systematic review. Oral Radiol november 2024; 40: 93–108.

34.

Thiruthuvanathan M

Krishnan

. Multimodal emotional analysis through hierarchical video summarization and face tracking. Multimed Tools Appl may 2022; 81: 35535–35554.

35.

Zheng

Zhang

Wang

, et al. Multi-channel weight-sharing autoencoder based on cascade multi-head attention for multimodal emotion recognition. IEEE Trans Multimed january 2022; 25: 2213–2225.

36.

Zulqarnain

Ghazali

Aamir

, et al. An efficient two-state GRU based on feature attention mechanism for sentiment analysis. Multimed Tools Appl june 2024; 83: 3085–3110.

37.

Ho N

Yang H

Kim S

, et al. Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network. IEEE Access june 2020; 8: 61672–61686.

38.

Savchenko A

Savchenko L

. Audio-visual continuous recognition of emotional state in a multi-user system based on personalized representation of facial expressions and voice. Pattern Recognit Image Anal october 2022; 32: 665–671.

39.

Zhang

El Ali

Hanjalic

, et al. Few-shot learning for fine-grained emotion recognition using physiological signals. IEEE Trans Multimed january 2022; 25: 3773–3787.

40.

Hou

Zhang

Liu

, et al. Semantic alignment network for multi-modal emotion recognition. IEEE Trans Circuits Syst Video Technol september 2023; 33: 5318–5329.

41.

Zhang

El Ali

Wang

, et al. Weakly-supervised learning for fine-grained emotion recognition using physiological signals. IEEE Transactions on Affective Computing october 2022; 14: 2304–2322.

42.

Deng

Qin

, et al. MMPose: movie-induced multi-label positive emotion classification through EEG signals. IEEE Transactions on Affective Computing october 2022; 14: 2925–2938.

43.

Le H

Lee G

Kim S

, et al. Multi-label multimodal emotion recognition with transformer-based fusion and emotion-level representation learning. IEEE Access march 2023; 11: 14742–14751.