A cross-model hierarchical interactive fusion network for end-to-end multimodal aspect-based sentiment analysis

Abstract

For the aspect-based sentiment analysis task, traditional works are only for text modality. However, in social media scenarios, texts often contain abbreviations, clerical errors, or grammatical errors, which invalidate traditional methods. In this study, the cross-model hierarchical interactive fusion network incorporating an end-to-end approach is proposed to address this challenge. In the network, a feature attention module and a feature fusion module are proposed to obtain the multimodal interaction feature between the image modality and the text modality. Through the attention mechanism and gated fusion mechanism, these two modules realize the auxiliary function of image in the text-based aspect-based sentiment analysis task. Meanwhile, a boundary auxiliary module is used to explore the dependencies between two core subtasks of the aspect-based sentiment analysis. Experimental results on two publicly available multi-modal aspect-based sentiment datasets validate the effectiveness of the proposed approach.

Keywords

Multimodal aspect-based sentiment analysis hierarchical interactive fusion multi-head interaction attention mechanism gated mechanism

1. Introduction

Aspect-based sentiment analysis has always been a vital research direction in the area of NLP (Natural Language Processing) [1, 2, 3] and it is to research user evaluation of some products. With the popularity of social media, an increasing number of people share comments and remarks on the Internet. Because the scale of commenting-data is larger, unable to use artificial to record each comment, we need to establish a model based on aspect-based sentiment analysis to automatically analyze the user’s comments and the user’s attitude towards all aspects of the product. However, the large number of abbreviations caused by fragmented reading and grammatical errors caused by the informality of the social media scene often result in lower performance for traditional methods. Due to the fact that many tweets not only contain text but also include images, audio, and video, the information from these other modalities is considered to provide auxiliary information to the text corpus [4]. In this context, multimodal language modeling of sentiment analysis has become the central research direction of natural language processing. While various types of multimodal information are available, this paper focuses on text and images as the subjects of study. This choice is due to the fact that publicly available datasets commonly used for sentiment analysis tasks typically include only text and images.

The traditional aspect-based sentiment analysis comprises two subtasks, namely, aspect term extraction (ATE) and aspect sentiment classification (ASC). ATE is a sequence annotation task that is primarily used to extract attributes (or aspects) of the expressed opinions. ASC is a sequence classification task that is mainly used to detect the emotional polarity expressed by these extracted aspects in the opinion text [5]. However, most Aspect-Based Sentiment Analysis (ABSA) methods do not consider these two subtasks as a unified whole, which not only overlooks the joint information between the two subtasks but also consumes a significant amount of training resources. Furthermore, aspect terms are often assumed to be already provided in the solving process of the second subtask, which is not aligned with reality. As a result, existing methods have some shortcomings [6].

To address the questions raised above, we proposed a new cross-model hierarchical interactive fusion network (CHIFN) based on an end-to-end collapsed-label method to handle the Multimodal Aspect-based Sentiment Analysis (MABSA) task. The utilization of multimodal data can enhance the recognition performance of the model, and the incorporation of an end-to-end framework enables the model to simultaneously handle both ATE and ASC tasks. The CHIFN has five primary modules, including Feature Extraction Module (FEM), Feature Attention Module (FAM), Feature Fusion Module (FFM), Sentiment Prediction Module (SPM), and Boundary Auxiliary Module (BAM). (1) FEM is used to extract initial features from texts and images. Although the thought of contextual sentiment analysis is extremely critical when extracting text features, many works related to sentiment analysis just focus on the content of the data, overlooking the context and, as a result, failing to capture the correct emotional polarity. While the pre-trained BERT can be utilized to extract contextual features [1, 7, 8] from text, the integration of attention mechanisms makes it difficult to capture contextual semantic relationships within single sentences [1]. To tackle this difficulty, we use the Bi-directional Long Short-Term Memory (BiLSTM) to acquire context details [2, 5, 9], but only the powerful encoder and advanced neural network structure cannot get enough contextual semantic information in the text. To fully extract syntactic relationships within the text, we introduce Graph Convolution Network (GCN) as a supplementary method [10]. For the task of extracting image features, CNN and its variants are often the primary choices. Therefore, in this study, the pre-trained Visual Geometry Group (VGG) is employed to extract features for each region in every image. Additionally, BiLSTM is used to extract the internal connections of different regions as image features. (2) FAM obtains the interactive effects between the cross-modal data through the attention mechanism [1, 12, 13]. Transformer-based multi-head interaction attention mechanism is used to obtain text and image attention features. (3) FFM constructs the gated attention fusion mechanism and obtains multimodal interaction influence features through image and text attention representations. (4) SPM inputs the multimodal interaction influence features to a fully connected layer for marking the aspect terms and acquiring initial aspect sentiment for TSC. In addition, SPM synthesizes aspect and initial sentiment information through the attentional mechanism to make the final prediction of aspect sentiment. (5) BAM is the extractor of aspect boundary information auxiliary aspect sentiment classification. Our proposed BAM can acquire boundary information through the boundary-guided matrix and automatically identify its proportion in the eventual marker result according to the confidence of the object boundary marker. By using text features, image features, and auxiliary modules, we expect that the CHIFN can get a better performance in the classification task [14]. The proposed model can extract the aspect terms of text under the multimodal data, and classify these aspect terms emotionally. We conducted a lot of experiments about the CHIFN on two multimodal datasets (Twitter 2015 and Twitter 2018). Our model and baseline models were trained under the same experimental conditions, and the performance of these models was evaluated and compared. Furthermore, a series of ablation experiments also indicated that each component of our model contributed to the advancement of sentiment classification results. The primary contributions of our work are as follows:

Aspect-based sentiment analysis on Twitter multimodal datasets is a novel research direction, so our study is enlightening. The CHIFN provides new ideas for ABSA in social scenarios by using image-assisted information to help the text in social media acquire more effective contextual semantic information. Moreover, we adopted the end-to-end framework which identifies multiple aspect terms in the text under the multimodal data and classifies their sentiment at the same time.

The BAM was proposed to link to the two subtasks of the ABSA by extracting boundary-auxiliary information and transferring aspect term information to the sentiment classification task.

To align information between different modes, preserve the dominance of text, and reduce noise input, we proposed a multi-interaction attention mechanism and a gated attention fusion mechanism. The multi-interaction attention mechanism fully captures the bidirectional interaction between images and text. The gated attention fusion mechanism can effectively integrate text features and image perception of text representation features, and focus on the important information part of text features while reducing the input of non-essential information.

2. Related work

Our work is intricately connected to two areas of research:End-to-End Aspect-Based Sentiment Analysis (E2E-ABSA) and Multimodal Sentiment Analysis (MSA). Subsection 2.1 and 2.2 show the existing end-to-end architectures and the study of multimodal data, respectively.

2.1 E2E-ABSA

Traditional sentiment analysis tasks often rely on textual content. To simultaneously address two subtasks of aspect-based sentiment analysis in traditional sentiment analysis tasks, most of the work takes an end-to-end structure. There are now four main approaches (pipeline, multi-task, joint-label approaches, and collapsed methods) to solve the end-to-end sentiment analysis.

The pipeline method, as described in references [15, 16], addresses ABSA by treating it as two disconnected tasks and resolving them sequentially through a pipeline way [17]. However, this approach may be susceptible to propagation errors [18]. The multi-task approach [19, 20] uses an encoder for input, adopts a separate decoding procedure to label aspect terms, and predicts their sentiment, but the two decoding results may not matchx [10]. The joint-label approach [10] bridges the differences between the two subtasks by converting the ASC into a sequence annotation task and jointly trains the model of the two subtasks with a group of aspect term boundary labels (for example, B, I, E, S, and O) and emotion markers (for example, POS, NEG, NEU) [19]. Collapsed method [10, 17, 21] transforms the ATE and ASC tasks into a sequence labeling task to eliminate the discrimination of the two subtasks, which applies a group of well-designed labels, namely, B-POS,NEG,NEU, I-POS, NEG, NEU, E-POS, NEG, NEU, S-POS, NEG, NEU. B, I, E, and S represent the beginning of, middle of, end of, and a single word of the aspect term respectively, and O is not within the aspect term. Besides the “O” tag, each tag of collapsed methods contains two aspects of information: the component of an aspect term and its sentiment. For instance, “E-NEG” represents the end of a negative aspect term, and “S-NEU” indicates a neutral term containing a word. Table 1 shows the example labels of the joint-label and collapsed methods. Existing studies [15, 19] have indicated that collapsed and joint-label outperform pipeline and multi-task. However, only using the collapsed or joint-label method cannot effectively transfer the aspect term boundary information to the ASC task. Considering that collapse is true end-to-end, i.e., treating two subtasks as one task, we combined it with the newly constructed BSM to further promote the progress of the aspect-based sentiment analysis task. BSM can automatically absorb boundary information through the boundary-guided matrix to provide information for ASC tasks. Using both BSM and collapsed method can effectively treat ATE and ASC tasks as a whole, and let ASC make full use of the aspect boundary information of ATE.

Table 1
The designed labels are used in two approaches

Input	Former	Bridgecorp	Boss	Rod	Petricevic	Will	Be	Released	From	Jail	Next	Month	.
Collapsed	O	S-POS	O	B-NEU	B-NEU	O	O	O	O	O	O	O	O
Joint	O	S	O	B	E	O	O	O	O	O	O	O	O
	O	POS	O	NEG	NEG	O	O	O	O	O	O	O	O

2.2 MSA

At present, there are few researches on multimodal aspect-based sentiment analysis, and most of them focus on the multimodal word extraction task. The experimental results in [44] indicate that, in the social scenario, the model based on text $+$ images can yield better performance than the model based on the textual modality. In the early multimodal sentiment analysis tasks based on text and images, models were mainly constructed through feature engineering [22]. However, these methods require extremely detailed, biased, and labor-intensive feature engineering [4]. For example, a task of obtaining 1200 Adjective Noun Pairs (ANP) from an image and estimating the sentiment score based on the grammar and spelling manner of the text [23].

With the development of science and technology, a large number of feature extraction methods based on deep learning have been proposed in the field of multimodal sentiment classification [24, 25, 26]. For images, features-extracting methods are motivated by the excellent performance of CNN in image classification tasks and object detection tasks, the CNN and its variants have become the most popular image feature extractors [27, 28, 29]. For text features, a pre-trained BERT [21, 30, 31] acts as a feature extractor, the BiLSTM [11] extract text context semantic information, and the GCNs (Graph Convolution Networks) [10] serve as a supplement to fully extract the syntactic features in the text. Subsequently, the focus of research in this area shifted to how to fuse multiple modalities before performing the classification task.

The fusion technique of multimodal sentiment analysis aims to provide additional information to enhance the networks’ performance. Multi-modal feature fusion methods mainly include early or feature-level fusion, late or decision-level fusion, hybrid fusion, model-level fusion, and rule-level fusion [2]. Early or feature-level fusion [32] uses a feature representation by directly concatenating features extracted from different modalities. The main advantage of the method is that correlations between multimodal data are identified early to provide accurate results. Compared with sentiment analysis of text features or image features, the fused features obtained by processing image and text features with the feature fusion have significant advantages in terms of accuracy and processing time [7]. The purpose of late or decision-level fusion is to independently classify each mode feature [33]. The advantage of the method is that each mode can be learned using its best-fit classifier. But this approach spends a lot of time because it needs to use different classifiers. In sentiment classification, late or decision-level fusion methods achieve better results than traditional early or feature-level fusion methods [34]. Hybrid fusion [14, 35] contains the common advantages of decision level and feature level fusion techniques, so it can produce satisfactory results. Model level fusion [36] is to extract relationships between data extracted under different modes. This technology combines features from different modules to improve the performance of the model [37]. Rule-based fusion [38] takes techniques such as weighted fusing ways and majority voting machines to fuse multimodal features. For weighted fusing ways, using the means (e.g., sum or product) combines features from multiple modalities. However, these methods are better executed only under the condition that weights are correctly initialized.

Taking an attention mechanism to fuse multimodal features is a common approach for model fusion in recent years [4, 12, 39]. The attention mechanism can get the weight of each word or each pixel by query vectors, which enables the model to effectively focus on features rich in emotional information and reduce noise. How to effectively fuse semantic information of different modalities and reduce noise input is a research focus. In this paper, we construct a multi-interaction attention mechanism to fuse the dual interaction features between images and text. In addition, we use a gated attention mechanism to fuse interactive features and text features in the fusion stage to avoid excessive noise. In particular, the constructed fusion mechanism also ensures that text is the primary source of affective information, which is consistent with the fact that the quality of sentiment information in the text modality is superior to that of the image modality.

3. Model

Although multimodal data typically includes images, text, audio, video, etc., prevalent datasets often only contain two modalities: text and images. We transform the MABSA task into a sequence marking task by using a collapsed label strategy. Based on this strategy, we construct the sentiment label $\{y^{S}\}\in\{O\}\cup\{B:\textit{POS},I:\textit{POS},E:\textit{POS}\}\cup\{B:% \textit{NEG},I:\textit{NEG},E:\textit{NEG}\}\cup\{B:\textit{NEU},I:\textit{NEU% },E:\textit{NEU}\}$ . Each label encompasses dual information: the boundary and sentiment of the aspect term. For example, the multimodal input includes a sample containing textual content $T=[\textit{View},\textit{over},\textit{Gloston},\textit{Festival},\textit{% tonight}]$ and a related image. In this text, “Glaston” and “Festival” together form a term, “Glaston” is the beginning of the term, “Festival” is the end of the term, and the sentiment of the term is positive. Therefore our goal is to label aspect terms in the text and their emotional polarity $Y^{S}=\{O,O,B:\textit{POS},E:\textit{POS},O\}$ . The whole framework of CHIFN is shown in Fig. 1. Firstly, in the FEM, the text features and photo features are extracted. Secondly, in the FAM, we get the bidirectional interaction features between images and text by the multi-head interaction attention mechanism. Then, in FFN, we utilize the gating mechanism to fuse features obtained from the FAM, aiming to avoid introducing excessive noise and preserving the dominance of the text. Finally, in SPM, we use the BSM to obtain the boundary features and combine them with the initial aspect of emotional features to obtain the final aspect of emotional features.

Figure 1.

The model CHIFN’s framework.

3.1 Feature extraction module

In the module, we use the pre-trained model BERT, BiLSTM, and GCNs to fully extract high-level semantic features of the text and the pre-trained VGGNet network to extract picture pixel features.

3.1.1 Text feature extraction

Given a text $T=\{{W_{0}},{W_{1}},\ldots,{W_{t}}\}$ , the pre-trained model BERT [8] is used to extract the text feature vector ${T^{B}}=\{{w_{0}},{w_{1}},\ldots,{w_{t}}\}$ from the raw text $T$ , where ${w_{i}}\in[0,t]$ and the ${R^{{D_{B}}}}$ denotes the dimension of word vectors from BERT. To fully acquire the contextual information of the text, we use a BiLSTM and GCNs to refine the text feature. First, ${T^{B}}$ is fed into the BiLSTM to obtain the output $T_{B}^{H}=[{h_{0}},{h_{1}},\ldots,{h_{t}}]$ , where ${h_{i}}=[\overrightarrow{{h_{i}}},\overleftarrow{{h_{i}}}]$ and $i\in[0,t]$ . For each token, the dimension of the hidden feature representation ${h_{i}}$ learned by the LSTM unit is ${D_{h}}$ .

$\displaystyle w_{i}=\textit{BERT}(W_{i})$ (1) $\displaystyle\overleftarrow{h_{i}}=\overleftarrow{\textit{LSTM}(w_{i})}$ (2) $\displaystyle\overrightarrow{h_{i}}=\overrightarrow{\textit{LSTM}(w_{i})}$ (3)

Then, the Stranfordnlp tool is used to extract the adjacency matrix A composed of the syntactic structure of text T. Finally, the final text feature ${T^{F}}=\{{x_{0}},{x_{1}},\ldots,{x_{t}}\}$ can be obtained by entering the Laplace-transformed neighbor matrix $\widetilde{A}$ and T into the k-layer GCNs.

$\displaystyle\tilde{A}=A+I$ (4) $\displaystyle\hat{A}=\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{1/2}$ (5) $\displaystyle T^{B^{l+1}}=\sigma(\widehat{A}{T^{{B^{(l)}}}}{W^{l}})$ (6)

where I is the unit matrix, $\tilde{D}$ is the degree matrix of $\tilde{A}$ , ${T^{{B^{(l)}}}}$ is the feature of the l-th layer of the network, $l\in[0,k]$ , and $\sigma$ is the nonlinear activation function.

3.1.2 Image feature extraction

We first utilize a pre-trained VGG to get a series of image feature vectors. To obtain the relationship between the feature vectors and the parts of the two-dimensional picture, we extract the image feature maps from a lower convolution layer, which is different from utilizing the results of the top fully connected layer as the initial photo features [40]. This visual feature extractor produces S feature maps. Every mapping feature is a $M\times M$ tensor. Each feature vector is flatted as a ${D_{\textit{img}}}$ -dimensional feature vector ${i_{j}}$ that corresponds to a part of an image. To extract the sequence information in the regional image feature vectors, we take a BiLSTM to model the extracted individual picture features, which can selectively focus on some parts of the image, reducing the input of the image noise.

Given the image I, the pre-trained VGGNet [27] removing the top FC (Fully Connected) layer is first used to extract S two-dimensional visual feature vectors ${i_{j}}$ from the image I. Then, a BiLSTM is used to obtain the interaction features between pixel points. The visual image features consist of the results of the final time hop of the BiLSTM, namely, ${I^{F}}={h_{{i_{S}}}},{h_{{i_{S}}}}\in{R^{2\times{D_{i}}}}$ .

$\displaystyle I_{\textit{vector}}=\textit{VGGNet}(I)$ (7) $\displaystyle I_{\textit{vector}}=[i_{1},i_{2},\ldots,i_{S}]$ (8) $\displaystyle\overleftarrow{h_{i_{j}}}=\overleftarrow{\textit{LSTM}(i_{j})}$ (9) $\displaystyle\overrightarrow{h_{i_{j}}}=\overrightarrow{\textit{LSTM}(i_{j})}$ (10)

3.2 Feature attention module

Considering the mutual reinforcement and complement between image information and text information, we propose a multi-interaction attention mechanism to fully capture the bidirectional interaction between images and text. Because text features contain high-level semantic features and have stronger differentiation and semantic emotion information, text modality need to play a dominant role in multimodal aspect-based sentiment analysis. In the feature attention module, we first use text features and image features to get text attention features. Then, image attention features are obtained by combining text attention features and text features [11].

3.2.1 Textual attention

When people look at photos, they tend to focus on the parts that to them rather than the entire image. In other words, different pixels in a picture contribute differently to emotional analysis. Therefore, we intend to construct a textual attention layer to assign different weights to different pixels. First, we employ a multi-head interaction attention mechanism, which obtains text-guided image features by using text feature ${T^{F}}=[{x_{0}},{x_{1}},\ldots,{x_{t}}]$ as queries and the visual image ${I^{F}}$ as keys and values. The Feed-forward Network (FFN) [41] and the Layer Normalization (LN) [42] are then used to process the text-guided image attention.

$\displaystyle Z=LN(T^{F}+\textit{MCATT}(T^{F},I^{F}))$ (11) $\displaystyle H_{T\_I}=LN(\textit{FFN}(Z)+Z)$ (12) $\displaystyle H_{T\_I}=[H_{T\_I}[0],\ldots,H_{T\_I}[0]]$ (13)

where ${H_{T\_I}}\in{R^{(t+1)\times d}}$ is the text-aware image representation.

3.2.2 Image attention

Since different words contribute differently to sentiment analysis, we construct an image attention layer to extract important sentiment words. The multi-head interactive attention mechanism can obtain the image-guided text representation ${H_{I\_T}}$ by using ${H_{T\_I}}$ as queries and ${T^{F}}$ as keys and values.

$\displaystyle H_{I\_T}=\textit{TAI}(H_{T\_I},T^{F})$ (14) $\displaystyle H_{I\_T}=[H_{I\_T}[0],\ldots,H_{I\_T}[t]]$ (15)

3.3 Feature fusion module

To avoid excessive noise, the gated attention mechanism is used to fuse the text feature ${T^{F}}$ and the text-aware image feature ${H_{I\_T}}$ obtained from FAM. This mechanism not only enables the acquisition of image features related to text but also plays a role in disregarding unimportant information in the image [24]. Additionally, the gated attention mechanism allows the model to focus more on important information in the text by utilizing the image-guided text feature while retaining the information dominance of the text.

$\displaystyle g_{i}=\delta(W_{g1}H_{I\_T_{i}}+W_{g2}T{{}^{F}_{i}}+b_{g})$ (16) $\displaystyle l_{i}=g_{i}\times H_{I\_T_{i}}$ (17) $\displaystyle z_{i}=\textit{tanh}(W_{z1}l_{i}+W_{z2}T{{}^{F}_{i}}+b_{z})$ (18) $\displaystyle Z=[z_{1},z_{2},\ldots,z_{t}]$ (19)

where $\delta$ represents the softmax activation function, $W_{g1}$ , $W_{g1}$ , $b_{g}$ , $W_{z1}$ , $W_{z2}$ and $b_{z}$ are the trainable parameters, $Z$ denotes the final text-image fusion feature.

3.4 Boundary sentiment module

Accoring to Li et al. [18], BSM is designed to facilitate the ASC task by automatically assimilating boundary information through a boundary-guided matrix. For instance, if the boundary label of a word is S, indicating the word is the beginning of an aspect term and the matching collapsed label corresponding to the word can only be S: POS, S: NEG or S: NEU. Therefore, we establish the auxiliary module BSM for aspect term boundary prediction, where the effective label set ${Y^{T}}$ is B, I, E, S, O.

First, we use the FC layer with a nonlinear activation function to predict boundary labels ${Y^{T}}$ .

$\displaystyle Y_{i}^{T}=\delta(W^{t}\times Z[i]+b^{t})$ (20) $\displaystyle Y_{\textit{pred}}^{T}=[y_{1}^{T},y_{2}^{T},\ldots,y_{t}^{T}]$ (21)

where $\delta$ is the softmax activation function, $W^{t}$ and $b^{t}$ are the trainable.

Then, the transition matrix ${W^{TS}}\in{R^{|{y^{T}}|\times|{y^{S}}|}}$ for absorbing the boundary information is proposed. Since the prior information about the transition probability between boundary labels and collapsed labels is unknown, we initialize ${W^{TS}}$ by

$\displaystyle W_{{}^{i,j}}^{TS}=\left\{\begin{array}[]{ll}\frac{1}{{\sum% \limits_{j=0}^{k}{{\Omega_{i\to j}}}}},&\text{if }{\Omega_{i\to j}}=1\\ 0,&\text{if }{\Omega_{i\to j}}=0\end{array}\right.$

where ${\Omega_{i\to j}}=1$ if the aspect term boundary label $i$ can be transferred to the aspect term sentiment tag $j$ , otherwise ${\Omega_{i\to j}}=0$ , $k$ indicates the number of types of aspect term sentiment labels. For example, ${\Omega_{B\to B:\textit{POS}}}=1$ , ${\Omega_{B\to I:\textit{POS}}}=0$ .

After obtaining ${W^{TS}}$ , we can get the aspect boundary information ${Y^{T\to S}}$ by

$\displaystyle Y^{T\leftarrow S}=Y{{}^{T}_{\textit{pred}}}*W^{TS}$ (22)

3.5 Sentiment prediction module

We obtain the initial aspect sentiment label ${Y^{\textit{S\_initial}}}$ by inputting the text-image fusion feature Z from FFM into the FC layer. After inputting ${Y^{\textit{S\_initial}}}$ and ${Y^{T\to S}}$ into an attention layer, the final aspects of the sentiment tag ${Y^{S}}$ is obtained by using ${Y^{T\to S}}$ to guide ${Y^{\textit{S\_initial}}}$ .

$\displaystyle Y^{\textit{S\_initial}}=\delta_{1}(W_{Z}Z+b_{Z})$ (23) $\displaystyle g_{T\rightarrow S}=\delta_{2}(W_{T\rightarrow S}Y^{T\rightarrow S% }+b_{T\rightarrow S})$ (24) $\displaystyle Y_{i}^{S}={g_{i}}\otimes{Y^{\textit{S\_inirial}}}_{i}$ (25) $\displaystyle Y^{S}=[Y^{S}_{1},Y^{S}_{2},\ldots,Y^{S}_{t}]$ (26)

where $\delta_{1}$ and $\delta_{1}$ denote tanh and softmax activation function, $W_{Z}$ , $b_{z}$ , $W_{T\rightarrow S}$ and $b_{T\rightarrow S}$ are trainable.

3.6 Model training

In this paper, the CrossEntropy Loss function is used to obtain the loss of aspect terms extraction ${Y^{T}}$ and aspect sentiment classification ${Y^{S}}$ .

$\displaystyle L({F_{T}})=\textit{CrossEntropyLoss}(Y_{\textit{pred}}^{T},Y_{% \textit{true}}^{T})$ (27) $\displaystyle L({F_{S}})=\textit{CrossEntropyLoss}(Y_{\textit{pred}}^{S},Y_{% \textit{true}}^{S})$ (28)

In addition, L2 regularization [49] is used for joint optimization of loss functions. This optimization method can effectively improve the feature representation and performance of CHIFN.

$\displaystyle\textit{Loss}=L(F_{T})+L{F_{S}}+\lambda\|\theta\|{{}_{2}}$ (29)

4. Experiment

In this paper, comparison and ablation experiments were performed to demonstrate the performance of the proposed CHIFN.

4.1 Model training

Two publicly available datasets (TWITTER-2015 and TWITTER-2017 [43, 45]) were used to validate the effectiveness of the proposed method, and their details are shown in Table 2. The two datasets are multimodal tweets posted on Twitter from 2014 to 2015 and 2016 to 2017, respectively. When filtering the corpus, tweets without images were removed and tweets with images were retained. If a text corresponds to multiple images, then a randomly selected image is used as the matching image for the text to ensure that the text and image are one-to-one. Finally, tweets without any aspect terms, with a text length less than 3, and those containing challenging text were excluded. Corpus annotation follows the BIO-2 standard [48]. In this article, the two Twitter datasets are used in combination.

Table 2
The underlying information of two Twitter datasets

Label	TWITTER-2015			TWITTER-2015
	Train	Dev.	Test	Train	Dev.	Test
#NEU	1883	670	607	1638	517	573
#POS	928	303	317	1508	515	493
#NEG	368	149	113	416	144	168
Total	3179	1122	1037	3562	1176	1234

In the pre-processing process, we found that many samples in the original datasets were repeated. While repeated tweets have different aspect terms, it’s not friendly for us to tag multiple aspect terms tasks in a sentence. Therefore, we combined the aspect term labels of the same data in the original datasets to get new datasets. We used the original modeling ratio in the corpus as the modeling criterion, as detailed in Table 3. Since the original test dataset did not annotate the emotion of the words, we divided the processed training dataset into the training dataset and the validation dataset according to the ratio of 9:1 to verify the effect of our model on the validation dataset. Since the test dataset did not have labels, we just performed the label prediction.

Table 3

The detail information of dataset be used

	Train dataset	Test dataset
Sample size	3414	1348
Average number of aspect terms	2.13	1.06
Total number of words	55977	23608
Total number of pictures	3414	1348
The average length of the sample	16.40	17.51

Table 4

The summary of parameter values

Name	Value
The batch size	10
The learning rate	0.0001
The probability dropout	0.3
The maximum length of text	30
The shape of the text feature	30 $\times$ 768
The output dimension of BERT	768
The regularization coefficient	0.0001
The shape of the image feature	7 $\times$ 7 $\times$ 512
The output dimension of LSTM	128

4.2 Experimental setting

For text data, we used the pre-trained “Bert-base-uncased” model [8] with 12 transformer blocks as the text feature extractor. For the image data, first, they are resized to 224 $\times$ 224, and then the processed images are fed into the pre-trained VGG-16 [27] network to extract the results before the classification layer as image features. We used the weights of the VGG-16 model trained on the ImageNet dataset [46] to initialize the weights of the image encoder, and the weights of the image encoder were fixed for subsequent training. We adopted two layers of GCN units, with the output size of 128 for the first layer and 128 $\times$ 2 for the second layer. Additionally, we used an early-stop approach and L2 regularization to prevent overfitting. All of the models were implemented using tensorflow2.9.1. The Adam optimization technology was used to minimize the Loss. Table 4 show the important parameters of the proposed method.

4.3 Evaluation metrics

CHIFN transforms the two subtasks of aspect-based sentiment analysis into one sequence annotation task, which means that it needs to simultaneously label aspect terms and their emotional polarity. For the ABSA task, a correct prediction means that the aspect terms are completely extracted and the emotional polarity of the aspect terms is correctly labeled. We adopted the Recall, Precision, and F1-measure to evaluate the performance of the term sentiment classification model. F1 is the harmonic mean of Recall and Precision, which can evaluate model performance under the condition of data imbalance.

$\displaystyle\textit{Precision}=\frac{TP}{TP+FP}$ (30) $\displaystyle\textit{Recall}=\frac{TP}{TP+FN}$ (31) $\displaystyle\textit{F1-measure}=\frac{2*\textit{Precision}*\textit{Recall}}{% \textit{Precision}+\textit{Recall}}$ (32)

where TP represents the number of actual aspect terms and correctly predicted by the model, TN represents the number of actual non-aspect terms and correctly predicted as non-aspect terms by the model, FP represents the number of aspect terms actually non-aspect and correctly predicted by the model, FN represents the number of actual aspect terms and correctly predicted as non-aspects by the model.

4.4 Baseline models

Several baseline algorithms were used as competing algorithms to validate the effectiveness of the proposed method. In addition, we developed some ablation models to study the effectiveness of different modules in the proposed network. Details of these methods are shown below. Base model $+$ BG $+$ SC $+$ OE [18]: It is a method that can handle the E2E-ABSA task. This method contains a basic model and three constituents. The underlying model has two piled RNNs, with the top one handling the ATE task and the bottom one handling the ASC task. The boundary guidance (BG), sentimental consistency (SC), and opinion enhancement (OE) are the three components associated with the basic network. BERT $+$ C-ATT [17]: It is a model with a three-layer structure. The first layer obtains the word representation through a pre-trained language model. The second layer is an attention layer of sentence components, which is mainly used to extract sentence components. Under the influence of inducing components, words in the same component are constrained to attend to each other, making the aspect term pay more attention to its corresponding opinion. The final layer is a linear classification layer for predicting uniform labels. GOPREM [47]: GOPREM obtains text representations through a pre-trained BERT model and extracts image features using a pre-trained Faster-RCNN object detection network. In addition, it strengthens the syntactic relationship by using a gating mechanism to integrate these two modal features and a syntactic number-based graph attention network. Then, the model interacts text features with image features by introducing a common attention network. Subsequently, the model combines the interactive features through a gating mechanism. Finally, the Conditional Random Field (CRF) model is used to extract terms and determine their sentiment polarity.

CHIFN_BAM, CHIFN_GCNs, and CHIFN_FFM represent CHIFN without BAM, CHIFN without GCNs, and CHIFN without FFM, respectively.

4.5 Results and analysis

Table 5 shows the performance of CHIFN and its competing algorithms on the two datasets, and it is visualized as shown in Fig. 2. Based on the information presented in Fig. 2 and Table 5, we can see that the Precision and F1-measure of the proposed method were better than the Precision and F1-measure of the current state-of-the-art model (COPREM). Although the Recall of CHIFN was worse than that of COPREM, the difference was not significant. In particular, the F1-measure of CHIFN outperformed those of COPREM in the two tasks. Therefore, CHIFN demonstrated the best performance in these two tasks.

In addition, we found the following phenomena. In the text modal, the ATE F1-measure and ASC F1-measure of BERT $+$ C-ATT are 1.8% and 1.5% more than those of Glove $+$ C-ATT, respectively, suggesting that using the pre-trained BERT as the embedding layer of word vectors is more effective than Glove. The reason for this phenomenon is that the GloVe model has inherent defects in handling polysemous words and generating context-independent word vectors, while the BERT model is good at observing these problems and providing corresponding solutions. Both Glove $+$ C-ATT and Base model $+$ BG $+$ SC $+$ OE use Glove as the feature extractor, but Glove $+$ C-ATT with the attention mechanism performs better, which partly illustrates the effectiveness of the attention mechanism. As can be shown in Table 5, the performances of the various models in the multimodal data are better than that of a single text mode. In the Fig. 2, we can see more intuitively that our model performs better than the existing models. In addition to the Recall on aspect terms extraction task, the rest of the evaluation metrics of the CHIFN all exceed the advanced model COPREM of multi-modal emotion analysis. The F1-measure values for the two subtasks of the CHIFN are 85.8% and 85.3% respectively. For the TWITTER datasets, our CHIFN surpasses the COPREM by 1.5% and 3.6% on the F1-measure value, respectively. All of these results confirm the superiority of our CHIFN in the E2E-ABSA task based on multimodal data.

Table 5
The contrast results of the baseline models and the CHIFN

Modality	Method	TWITTER
		Aspect terms extraction			Aspect sentiment classificatio
		Precision	Recall	F1-measure	Precision	Recall	F1-measure
Text	Base model $+$ BG $+$ SC $+$ OE	0.681	0.586	0.630	0.501	0.519	0.510
	Glove $+$ C-ATT	0.816	0.788	0.802	0.814	0.773	0.793
	BERT $+$ C-ATT	0.838	0.802	0.820	0.829	0.788	0.808
Text $+$ pictures	COPREM	0.824	0.863	0.843	0.801	0.833	0.817
	CHIFN (our)	0.858	0.858	0.858	0.858	0.848	0.853

Table 6

The results of ablation experiments

Method	TWITTER
	Aspect terms extraction			Aspect sentiment classification
	Precision	Recall	F1-measure	Precision	Recall	F1-measure
CHIFN_BAM	0.819	0.818	0.819	0.820	0.808	0.814
CHIFN_GCNs	0.853	0.850	0.851	0.860	0.817	0.838
CHIFN_FFA	0.872	0.644	0.741	0.874	0.426	0.573
CHIFN(our)	0.858	0.858	0.858	0.858	0.848	0.853

Figure 2.

The comparison of experimental performance.

For the text unimodal model, the performance of the text $+$ picture multimodal model is better, indicating that the picture data can assist the text for better complete aspect-based sentiment analysis. At the same time, our model is better than the existing multimodal better aspect-based sentiment analysis model GOPREM is the main reason why we use the aspect boundary transfer matrix, more effectively transfer aspect boundary information to emotional task, at the same time in the modal fusion, we ensure the text main mode status, prevent photo noise input.

4.6 Intrinsic comparisons

To understand the performance of each module in the proposed network, we conducted ablation experiments, and the experimental results were shown in Table 6. The results showed that BAM, GCNs and FFA all improved the performance of the model. There are three reasons for these results. (1) The BAM module can effectively transfer boundary information to help the system to classify more accurately; (2) The GCNs module can enhance and modify the extracted semantic features; (3) Feature fusion can effectively reduce the introduction of noise.

5. Conclusion and future work

To realize End-to-End multimodal aspect-Based sentiment analysis, this paper proposed CHIFN that can simultaneously solve ATE and ASC tasks, which contains five main modules, namely, feature extraction module, feature attention module, feature fusion module, sentiment prediction module and boundary auxiliary module. A task is accomplished through these five modules working in conjunction with each other. Specifically, the feature extraction module was used to extract the high-level semantic features within the text and images, and the feature attention module with the interaction attention mechanism was utilized to obtain interaction features for both modalities. The feature fusion module and boundary auxiliary module played important roles in acquiring image features that are highly correlated with the text and providing boundary information, respectively. The sentiment prediction module with the classification layer was used to obtain aspect terms as well as sentiment categories in the text.

Although the proposed method has good performance in the social media corpus, it still has some shortcomings.

Since a large number of words in the text are non-aspect terms, the model may tend to label all words as a non-term category. In addition, since the emotional categories of the entities in the dataset are unbalanced, most of the emotional polarity of the evaluation object is neutral. This will result in the model not being able to learn the emotional information in the dataset well with a small number of samples. Although we would like to validate the performance of our model on other datasets, there are only two publicly available multimodal datasets (TWITTER-2015 and TWITTER-2017) that meet the requirements of the fine-grained analysis task. Therefore, we will work towards more accurate and larger corpus annotations in our next work to further validate the performance of our model.

We used two pre-trained models with a large number of parameters to extract text and image features, which makes our model less practical. At the same time, the two pre-trained models are located in different feature representation spaces, which might lead to cross-modal semantic gaps. In our subsequent work, we will consider using the knowledge distillation method to compress the number of model parameters and an auxiliary model to map text and image features to the same spatial dimension to reduce the semantic gap between the different modalities.

While we hope for a metric to assess the correlation between the meaning, aspects, and polarity of the text with the features extracted from the associated image, such a metric does not currently exist. In the future, we will undertake research in this area to further enhance and refine our understanding.

Footnotes

Funding and/or conflicts of interests/competing interests

We declare that we have no financial and personal relationships with other people and organizations, which may improperly affect our work, and that we have no professional or other personal interests of any nature or kind in any products, services and companies.

References

Q.F.

Lin

L.Y.

et al., MEDT: Using multimodal encoding-decoding network as in transformer for multimodal sentiment analysis, IEEE Access 10 (2022), 28750–28759. Ganesh Chandrasekaran and Tu N. Nguyen and Jude Hemanth D.

Chandrasekaran

Nguyen

T.N.

and Hemanth

D.J.

, Multimodal sentimental analysis for social media applications: A comprehensive review, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 11 (2021).

Q.C.

Stefani

Toto

et al., Towards Multimodal Sentiment Analysis Inspired by the Quantum Theoretical Framework, in: 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), 2020. pp. 177–180.

Mao

and Chen

, A Co-Memory Network for Multimodal Sentiment Analysis, in: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018. pp. 177–180.

Luo

H.S.

T.R.

Liu

et al., DOER: Dual Cross-Shared RNN for Aspect Term-Polarity Co-Extraction, Annual Meeting of the Association for Computational Linguistics, 2019, 591–601.

Zhao

S.C.

Jia

G.L.

Yang

J.F.

et al., Emotion recognition from multiple modalities: Fundamentals and methodologies, IEEE Signal Processing Magazine 38 (2021), 59–73.

J.F.

and Chen

and Xia

, Hierarchical interactive multimodal transformer for aspect-based multimodal sentiment analysis, IEEE Transactions on Affective Computing 14 (2023), 1966–1978.

Devlin

Chang

M.W.

Lee

et al., BERT: Pre-training of deep bidirectional transformers for language understanding, North American Chapter of the Association for Computational Linguistics 1 (2019), 4171–4186.

Wang

Zhu

Dai

et al., Deep memory network with Bi-LSTM for personalized context-aware citation recommendation, Neurocomputing 410 (2020), 103–113.

10.

Chen

G.M.

Tian

Y.H.

and Song

, Joint Aspect Extraction and Sentiment Analysis with Directional Graph Convolutional Networks, in: International Conference on Computational Linguistics, 2020. pp. 272–279.

11.

Peng

Zhang

C.X.

Xue

X.J.

et al., Cross-modal complementary network with hierarchical fusion for multimodal sentiment classification, Tsinghua Science and Technology 27 (2022), 664–672.

12.

Mao

W.J.

and Chen

G.D.

, Multi-interactive memory network for aspect based multimodal sentiment analysis, in: AAAI Conference on Artificial Intelligence, Vol. 46, 2019. pp. 371–378.

13.

Kumar

and Vepa

, Gated Mechanism for Attention Based Multi Modal Sentiment Analysis, in: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020. pp. 4477–4481.

14.

Wöllmer

Weninger

Knaup

et al., YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context, IEEE Intelligent Systems 28 (2013), 46–53.

15.

Mitchell

Aguilar

Wilson

et al., Open Domain Targeted Sentiment, in: Conference on Empirical Methods in Natural Language Processing, 2013. pp. 1643–1654.

16.

M.H.

Peng

Y.X.

Huang

et al., Open-Domain Targeted Sentiment Analysis via Span-Based Extraction and Classification, ArXiv abs/1906.03820, 2019, 537–546.

17.

Xiang

Zhang

et al., Sentence constituent-aware attention mechanism for end-to-end aspect-based sentiment analysis, Multimedia Tools and Applications 81 (2020), 15333–15348.

18.

Bing

L.D.

P.G.

et al., A unified model for opinion target extraction and target sentiment prediction, in: AAAI Conference on Artificial Intelligence, Vol. 824, 2019. pp. 6714–6721.

19.

Zhang

M.S.

Zhang

and Vo

D.T.

, Neural Networks for Open Domain Targeted Sentiment, in: Conference on Empirical Methods in Natural Language Processing, 2015. pp. 612–621.

20.

D.H.

S.J.

and Wang

H.F.

, Joint Learning for Targeted Sentiment Analysis, in: Conference on Empirical Methods in Natural Language Processing, 2018. pp. 4737–4742.

21.

Bing

L.D.

Zhang

W.X.

et al., Exploiting BERT for End-to-End Aspect-based Sentiment Analysis, in: Conference on Empirical Methods in Natural Language Processing, 2019. pp. 34–41.

22.

Pennington

Socher

and Manning

, GloVe: Global Vectors for Word Representation, in: Conference on Empirical Methods in Natural Language Processing, 2014. pp. 1532–1543.

23.

Borth

Chen

et al., Large-scale visual sentiment ontology and detectors using adjective noun pairs, in: Proceedings of the 21st ACM International Conference on Multimedia, 2013. pp. 223–232.

24.

Y.P.

Liu

Peng

et al, Gated attention fusion network for multimodal sentiment classification, Knowl. Based Syst 240 (2013), 108107.

25.

Y.H.

Lin

H.F.

Meng

J.N.

et al., Visual and textual sentiment analysis of a microblog using deep convolutional neural networks, Algorithms 9 (2016), 41.

26.

Wang

Zhou

et al., Deep Tensor Evidence Fusion Network for Sentiment Classification, IEEE Transactions on Computational Social Systems, 2022.

27.

Simonyan

and Zisserman

, Very Deep Convolutional Networks for Large-Scale Image Recognitionn, CoRR abs/1409.1556, 2014.

28.

Wang

G.R.

Wang

K.Z.

and Lin

, Adaptively Connected Neural Networks, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. pp. 1781–1790.

29.

Tzirakis

Trigeorgis

Nicolaou

M.A.

et al., End-to-end multimodal emotion recognition using deep neural networks, IEEE Journal of Selected Topics in Signal Processing 11 (2019), 1301–1309.

30.

Hoang

Bihorac

O.A.

and Rouces

, Aspect-Based Sentiment Analysis using BERT, in: Nordic Conference of Computational Linguistics, 2019. pp. 187–196.

31.

Wang

Chen

and Wang

, Multi-task BERT for Aspect-based Sentiment Analysis, in: 2021 IEEE International Conference on Smart Computing (SMARTCOMP), 2021. pp. 383–385.

32.

Monkaresi

Hussain

M.S.

and Calvo

R.A.

, Classification of affects using head movement, skin color features and physiological signals, in: 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2012. pp. 2664–2669.

33.

Cai

and Xia

, Convolutional neural networks for multimedia sentiment analysis, Natural Language Processing and Chinese Computing 9362 (2012), 159–167.

34.

Dobrišek

Gajsek

Mihelic

et al., Towards efficient multi-modal emotion recognition, International Journal of Advanced Robotic Systems 10 (2013), 53.

35.

Siddiquie

Chisholm

and Divakaran

, Exploiting Multimodal Affect and Semantics to Identify Politically Persuasive Web Videos, in: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, 2015.

36.

Lin

J.C.

C.H.

and Wei

W.L.

, Error weighted semi-coupled hidden markov model for audio-visual emotion recognition, IEEE Transactions on Multimedia 14 (2012), 142–156.

37.

Zeng

Z.H.

Y.X.

Liu

et al., Training combination strategy of multi-stream fused hidden Markov model for audio-visual affect recognition, ACM Multimedia, 2006.

38.

Al-Azani

and El-Alfy

E.S.M.

, Enhanced video analytics for sentiment analysis based on fusing textual, auditory and visual information, IEEE Access 8 (2020), 136843–136857.

39.

Truong

Q.T.

and Lauw

H.W.

, VistaNet: visual aspect attention network for multimodal sentiment analysis, in: AAAI Conference on Artificial Intelligence, Vol. 38, 2019. pp. 305–312.

40.

and Mao

, MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis, in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2017.

41.

Vaswani

Shazeer

Parmar

et al., Attention is all you need, Neural Information Processing Systems, 2017, 5998–6008.

42.

J.L.

Kiros

J.R.

and Hinton

G.E.

, Layer normalization, ArXiv abs/1607.06450, 2016.

43.

and Jiang

, Adapting bert for target-oriented multimodal sentiment classification, in: International Joint Conference on Artificial Intelligence, 2019. pp. 5408–5414.

44.

Neves

Carvalho

et al., Adaptive co-attention network for named entity recognition in tweets, in: AAAI Conference on Artificial Intelligence, 2018. pp. 5674–5681.

45.

J.L.

Kiros

J.R.

and Hinton

G.E.

, Visual attention model for name tagging in multimodal social media, Annual Meeting of the Association for Computational Linguistics 1 (2018), 1990–1999.

46.

Deng

Dong

Socher

et al., ImageNet: A large-scale hierarchical image database, Computer Vision and Pattern Recognition, 2009, 248–255.

47.

Cheng

S.L.

, Key technologies for fine-grained sentiment analysis towards multimodal data, Nanjing: Southeast University, 2021.

48.

Sang

E.F.T.K.

and Veenstra

, Representing Text Chunks, ArXiv cs.CL/9907006, 1999, 173–179.

49.

Liu

Zhu

H.G.

Ren

Y.C.

et al., A Novel Intelligent Forecasting Framework for Quarterly or Monthly Energy Consumption, IEEE Transactions on Industrial Informatics, 2023, 1–12.

A cross-model hierarchical interactive fusion network for end-to-end multimodal aspect-based sentiment analysis

Abstract

Keywords

1. Introduction

2. Related work

2.1 E2E-ABSA

Table 1 The designed labels are used in two approaches

3. Model

3.1.1 Text feature extraction

3.2.1 Textual attention

4.1 Model training

Table 2 The underlying information of two Twitter datasets

4.3 Evaluation metrics

4.5 Results and analysis

Table 5 The contrast results of the baseline models and the CHIFN

5. Conclusion and future work

Footnotes

Funding and/or conflicts of interests/competing interests

References

Table 1
The designed labels are used in two approaches

Table 2
The underlying information of two Twitter datasets

Table 5
The contrast results of the baseline models and the CHIFN