Abstract
For the aspect-based sentiment analysis task, traditional works are only for text modality. However, in social media scenarios, texts often contain abbreviations, clerical errors, or grammatical errors, which invalidate traditional methods. In this study, the cross-model hierarchical interactive fusion network incorporating an end-to-end approach is proposed to address this challenge. In the network, a feature attention module and a feature fusion module are proposed to obtain the multimodal interaction feature between the image modality and the text modality. Through the attention mechanism and gated fusion mechanism, these two modules realize the auxiliary function of image in the text-based aspect-based sentiment analysis task. Meanwhile, a boundary auxiliary module is used to explore the dependencies between two core subtasks of the aspect-based sentiment analysis. Experimental results on two publicly available multi-modal aspect-based sentiment datasets validate the effectiveness of the proposed approach.
Keywords
Introduction
Aspect-based sentiment analysis has always been a vital research direction in the area of NLP (Natural Language Processing) [1, 2, 3] and it is to research user evaluation of some products. With the popularity of social media, an increasing number of people share comments and remarks on the Internet. Because the scale of commenting-data is larger, unable to use artificial to record each comment, we need to establish a model based on aspect-based sentiment analysis to automatically analyze the user’s comments and the user’s attitude towards all aspects of the product. However, the large number of abbreviations caused by fragmented reading and grammatical errors caused by the informality of the social media scene often result in lower performance for traditional methods. Due to the fact that many tweets not only contain text but also include images, audio, and video, the information from these other modalities is considered to provide auxiliary information to the text corpus [4]. In this context, multimodal language modeling of sentiment analysis has become the central research direction of natural language processing. While various types of multimodal information are available, this paper focuses on text and images as the subjects of study. This choice is due to the fact that publicly available datasets commonly used for sentiment analysis tasks typically include only text and images.
The traditional aspect-based sentiment analysis comprises two subtasks, namely, aspect term extraction (ATE) and aspect sentiment classification (ASC). ATE is a sequence annotation task that is primarily used to extract attributes (or aspects) of the expressed opinions. ASC is a sequence classification task that is mainly used to detect the emotional polarity expressed by these extracted aspects in the opinion text [5]. However, most Aspect-Based Sentiment Analysis (ABSA) methods do not consider these two subtasks as a unified whole, which not only overlooks the joint information between the two subtasks but also consumes a significant amount of training resources. Furthermore, aspect terms are often assumed to be already provided in the solving process of the second subtask, which is not aligned with reality. As a result, existing methods have some shortcomings [6].
To address the questions raised above, we proposed a new cross-model hierarchical interactive fusion network (CHIFN) based on an end-to-end collapsed-label method to handle the Multimodal Aspect-based Sentiment Analysis (MABSA) task. The utilization of multimodal data can enhance the recognition performance of the model, and the incorporation of an end-to-end framework enables the model to simultaneously handle both ATE and ASC tasks. The CHIFN has five primary modules, including Feature Extraction Module (FEM), Feature Attention Module (FAM), Feature Fusion Module (FFM), Sentiment Prediction Module (SPM), and Boundary Auxiliary Module (BAM). (1) FEM is used to extract initial features from texts and images. Although the thought of contextual sentiment analysis is extremely critical when extracting text features, many works related to sentiment analysis just focus on the content of the data, overlooking the context and, as a result, failing to capture the correct emotional polarity. While the pre-trained BERT can be utilized to extract contextual features [1, 7, 8] from text, the integration of attention mechanisms makes it difficult to capture contextual semantic relationships within single sentences [1]. To tackle this difficulty, we use the Bi-directional Long Short-Term Memory (BiLSTM) to acquire context details [2, 5, 9], but only the powerful encoder and advanced neural network structure cannot get enough contextual semantic information in the text. To fully extract syntactic relationships within the text, we introduce Graph Convolution Network (GCN) as a supplementary method [10]. For the task of extracting image features, CNN and its variants are often the primary choices. Therefore, in this study, the pre-trained Visual Geometry Group (VGG) is employed to extract features for each region in every image. Additionally, BiLSTM is used to extract the internal connections of different regions as image features. (2) FAM obtains the interactive effects between the cross-modal data through the attention mechanism [1, 12, 13]. Transformer-based multi-head interaction attention mechanism is used to obtain text and image attention features. (3) FFM constructs the gated attention fusion mechanism and obtains multimodal interaction influence features through image and text attention representations. (4) SPM inputs the multimodal interaction influence features to a fully connected layer for marking the aspect terms and acquiring initial aspect sentiment for TSC. In addition, SPM synthesizes aspect and initial sentiment information through the attentional mechanism to make the final prediction of aspect sentiment. (5) BAM is the extractor of aspect boundary information auxiliary aspect sentiment classification. Our proposed BAM can acquire boundary information through the boundary-guided matrix and automatically identify its proportion in the eventual marker result according to the confidence of the object boundary marker. By using text features, image features, and auxiliary modules, we expect that the CHIFN can get a better performance in the classification task [14]. The proposed model can extract the aspect terms of text under the multimodal data, and classify these aspect terms emotionally. We conducted a lot of experiments about the CHIFN on two multimodal datasets (Twitter 2015 and Twitter 2018). Our model and baseline models were trained under the same experimental conditions, and the performance of these models was evaluated and compared. Furthermore, a series of ablation experiments also indicated that each component of our model contributed to the advancement of sentiment classification results. The primary contributions of our work are as follows:
Aspect-based sentiment analysis on Twitter multimodal datasets is a novel research direction, so our study is enlightening. The CHIFN provides new ideas for ABSA in social scenarios by using image-assisted information to help the text in social media acquire more effective contextual semantic information. Moreover, we adopted the end-to-end framework which identifies multiple aspect terms in the text under the multimodal data and classifies their sentiment at the same time. The BAM was proposed to link to the two subtasks of the ABSA by extracting boundary-auxiliary information and transferring aspect term information to the sentiment classification task. To align information between different modes, preserve the dominance of text, and reduce noise input, we proposed a multi-interaction attention mechanism and a gated attention fusion mechanism. The multi-interaction attention mechanism fully captures the bidirectional interaction between images and text. The gated attention fusion mechanism can effectively integrate text features and image perception of text representation features, and focus on the important information part of text features while reducing the input of non-essential information.
Related work
Our work is intricately connected to two areas of research:End-to-End Aspect-Based Sentiment Analysis (E2E-ABSA) and Multimodal Sentiment Analysis (MSA). Subsection 2.1 and 2.2 show the existing end-to-end architectures and the study of multimodal data, respectively.
E2E-ABSA
Traditional sentiment analysis tasks often rely on textual content. To simultaneously address two subtasks of aspect-based sentiment analysis in traditional sentiment analysis tasks, most of the work takes an end-to-end structure. There are now four main approaches (pipeline, multi-task, joint-label approaches, and collapsed methods) to solve the end-to-end sentiment analysis.
The pipeline method, as described in references [15, 16], addresses ABSA by treating it as two disconnected tasks and resolving them sequentially through a pipeline way [17]. However, this approach may be susceptible to propagation errors [18]. The multi-task approach [19, 20] uses an encoder for input, adopts a separate decoding procedure to label aspect terms, and predicts their sentiment, but the two decoding results may not matchx [10]. The joint-label approach [10] bridges the differences between the two subtasks by converting the ASC into a sequence annotation task and jointly trains the model of the two subtasks with a group of aspect term boundary labels (for example, B, I, E, S, and O) and emotion markers (for example, POS, NEG, NEU) [19]. Collapsed method [10, 17, 21] transforms the ATE and ASC tasks into a sequence labeling task to eliminate the discrimination of the two subtasks, which applies a group of well-designed labels, namely, B-POS,NEG,NEU, I-POS, NEG, NEU, E-POS, NEG, NEU, S-POS, NEG, NEU. B, I, E, and S represent the beginning of, middle of, end of, and a single word of the aspect term respectively, and O is not within the aspect term. Besides the “O” tag, each tag of collapsed methods contains two aspects of information: the component of an aspect term and its sentiment. For instance, “E-NEG” represents the end of a negative aspect term, and “S-NEU” indicates a neutral term containing a word. Table 1 shows the example labels of the joint-label and collapsed methods. Existing studies [15, 19] have indicated that collapsed and joint-label outperform pipeline and multi-task. However, only using the collapsed or joint-label method cannot effectively transfer the aspect term boundary information to the ASC task. Considering that collapse is true end-to-end, i.e., treating two subtasks as one task, we combined it with the newly constructed BSM to further promote the progress of the aspect-based sentiment analysis task. BSM can automatically absorb boundary information through the boundary-guided matrix to provide information for ASC tasks. Using both BSM and collapsed method can effectively treat ATE and ASC tasks as a whole, and let ASC make full use of the aspect boundary information of ATE.
The designed labels are used in two approaches
The designed labels are used in two approaches
At present, there are few researches on multimodal aspect-based sentiment analysis, and most of them focus on the multimodal word extraction task. The experimental results in [44] indicate that, in the social scenario, the model based on text
With the development of science and technology, a large number of feature extraction methods based on deep learning have been proposed in the field of multimodal sentiment classification [24, 25, 26]. For images, features-extracting methods are motivated by the excellent performance of CNN in image classification tasks and object detection tasks, the CNN and its variants have become the most popular image feature extractors [27, 28, 29]. For text features, a pre-trained BERT [21, 30, 31] acts as a feature extractor, the BiLSTM [11] extract text context semantic information, and the GCNs (Graph Convolution Networks) [10] serve as a supplement to fully extract the syntactic features in the text. Subsequently, the focus of research in this area shifted to how to fuse multiple modalities before performing the classification task.
The fusion technique of multimodal sentiment analysis aims to provide additional information to enhance the networks’ performance. Multi-modal feature fusion methods mainly include early or feature-level fusion, late or decision-level fusion, hybrid fusion, model-level fusion, and rule-level fusion [2]. Early or feature-level fusion [32] uses a feature representation by directly concatenating features extracted from different modalities. The main advantage of the method is that correlations between multimodal data are identified early to provide accurate results. Compared with sentiment analysis of text features or image features, the fused features obtained by processing image and text features with the feature fusion have significant advantages in terms of accuracy and processing time [7]. The purpose of late or decision-level fusion is to independently classify each mode feature [33]. The advantage of the method is that each mode can be learned using its best-fit classifier. But this approach spends a lot of time because it needs to use different classifiers. In sentiment classification, late or decision-level fusion methods achieve better results than traditional early or feature-level fusion methods [34]. Hybrid fusion [14, 35] contains the common advantages of decision level and feature level fusion techniques, so it can produce satisfactory results. Model level fusion [36] is to extract relationships between data extracted under different modes. This technology combines features from different modules to improve the performance of the model [37]. Rule-based fusion [38] takes techniques such as weighted fusing ways and majority voting machines to fuse multimodal features. For weighted fusing ways, using the means (e.g., sum or product) combines features from multiple modalities. However, these methods are better executed only under the condition that weights are correctly initialized.
Taking an attention mechanism to fuse multimodal features is a common approach for model fusion in recent years [4, 12, 39]. The attention mechanism can get the weight of each word or each pixel by query vectors, which enables the model to effectively focus on features rich in emotional information and reduce noise. How to effectively fuse semantic information of different modalities and reduce noise input is a research focus. In this paper, we construct a multi-interaction attention mechanism to fuse the dual interaction features between images and text. In addition, we use a gated attention mechanism to fuse interactive features and text features in the fusion stage to avoid excessive noise. In particular, the constructed fusion mechanism also ensures that text is the primary source of affective information, which is consistent with the fact that the quality of sentiment information in the text modality is superior to that of the image modality.
Model
Although multimodal data typically includes images, text, audio, video, etc., prevalent datasets often only contain two modalities: text and images. We transform the MABSA task into a sequence marking task by using a collapsed label strategy. Based on this strategy, we construct the sentiment label
The model CHIFN’s framework.
In the module, we use the pre-trained model BERT, BiLSTM, and GCNs to fully extract high-level semantic features of the text and the pre-trained VGGNet network to extract picture pixel features.
Text feature extraction
Given a text
Then, the Stranfordnlp tool is used to extract the adjacency matrix A composed of the syntactic structure of text T. Finally, the final text feature
where I is the unit matrix,
We first utilize a pre-trained VGG to get a series of image feature vectors. To obtain the relationship between the feature vectors and the parts of the two-dimensional picture, we extract the image feature maps from a lower convolution layer, which is different from utilizing the results of the top fully connected layer as the initial photo features [40]. This visual feature extractor produces S feature maps. Every mapping feature is a
Given the image I, the pre-trained VGGNet [27] removing the top FC (Fully Connected) layer is first used to extract S two-dimensional visual feature vectors
Considering the mutual reinforcement and complement between image information and text information, we propose a multi-interaction attention mechanism to fully capture the bidirectional interaction between images and text. Because text features contain high-level semantic features and have stronger differentiation and semantic emotion information, text modality need to play a dominant role in multimodal aspect-based sentiment analysis. In the feature attention module, we first use text features and image features to get text attention features. Then, image attention features are obtained by combining text attention features and text features [11].
Textual attention
When people look at photos, they tend to focus on the parts that to them rather than the entire image. In other words, different pixels in a picture contribute differently to emotional analysis. Therefore, we intend to construct a textual attention layer to assign different weights to different pixels. First, we employ a multi-head interaction attention mechanism, which obtains text-guided image features by using text feature
where
Since different words contribute differently to sentiment analysis, we construct an image attention layer to extract important sentiment words. The multi-head interactive attention mechanism can obtain the image-guided text representation
To avoid excessive noise, the gated attention mechanism is used to fuse the text feature
where
Accoring to Li et al. [18], BSM is designed to facilitate the ASC task by automatically assimilating boundary information through a boundary-guided matrix. For instance, if the boundary label of a word is S, indicating the word is the beginning of an aspect term and the matching collapsed label corresponding to the word can only be S: POS, S: NEG or S: NEU. Therefore, we establish the auxiliary module BSM for aspect term boundary prediction, where the effective label set
First, we use the FC layer with a nonlinear activation function to predict boundary labels
where
Then, the transition matrix
where
After obtaining
We obtain the initial aspect sentiment label
where
In this paper, the CrossEntropy Loss function is used to obtain the loss of aspect terms extraction
In addition, L2 regularization [49] is used for joint optimization of loss functions. This optimization method can effectively improve the feature representation and performance of CHIFN.
In this paper, comparison and ablation experiments were performed to demonstrate the performance of the proposed CHIFN.
Model training
Two publicly available datasets (TWITTER-2015 and TWITTER-2017 [43, 45]) were used to validate the effectiveness of the proposed method, and their details are shown in Table 2. The two datasets are multimodal tweets posted on Twitter from 2014 to 2015 and 2016 to 2017, respectively. When filtering the corpus, tweets without images were removed and tweets with images were retained. If a text corresponds to multiple images, then a randomly selected image is used as the matching image for the text to ensure that the text and image are one-to-one. Finally, tweets without any aspect terms, with a text length less than 3, and those containing challenging text were excluded. Corpus annotation follows the BIO-2 standard [48]. In this article, the two Twitter datasets are used in combination.
The underlying information of two Twitter datasets
The underlying information of two Twitter datasets
In the pre-processing process, we found that many samples in the original datasets were repeated. While repeated tweets have different aspect terms, it’s not friendly for us to tag multiple aspect terms tasks in a sentence. Therefore, we combined the aspect term labels of the same data in the original datasets to get new datasets. We used the original modeling ratio in the corpus as the modeling criterion, as detailed in Table 3. Since the original test dataset did not annotate the emotion of the words, we divided the processed training dataset into the training dataset and the validation dataset according to the ratio of 9:1 to verify the effect of our model on the validation dataset. Since the test dataset did not have labels, we just performed the label prediction.
The detail information of dataset be used
The summary of parameter values
For text data, we used the pre-trained “Bert-base-uncased” model [8] with 12 transformer blocks as the text feature extractor. For the image data, first, they are resized to 224
Evaluation metrics
CHIFN transforms the two subtasks of aspect-based sentiment analysis into one sequence annotation task, which means that it needs to simultaneously label aspect terms and their emotional polarity. For the ABSA task, a correct prediction means that the aspect terms are completely extracted and the emotional polarity of the aspect terms is correctly labeled. We adopted the Recall, Precision, and F1-measure to evaluate the performance of the term sentiment classification model. F1 is the harmonic mean of Recall and Precision, which can evaluate model performance under the condition of data imbalance.
where TP represents the number of actual aspect terms and correctly predicted by the model, TN represents the number of actual non-aspect terms and correctly predicted as non-aspect terms by the model, FP represents the number of aspect terms actually non-aspect and correctly predicted by the model, FN represents the number of actual aspect terms and correctly predicted as non-aspects by the model.
Several baseline algorithms were used as competing algorithms to validate the effectiveness of the proposed method. In addition, we developed some ablation models to study the effectiveness of different modules in the proposed network. Details of these methods are shown below. Base model
CHIFN_BAM, CHIFN_GCNs, and CHIFN_FFM represent CHIFN without BAM, CHIFN without GCNs, and CHIFN without FFM, respectively.
Results and analysis
Table 5 shows the performance of CHIFN and its competing algorithms on the two datasets, and it is visualized as shown in Fig. 2. Based on the information presented in Fig. 2 and Table 5, we can see that the Precision and F1-measure of the proposed method were better than the Precision and F1-measure of the current state-of-the-art model (COPREM). Although the Recall of CHIFN was worse than that of COPREM, the difference was not significant. In particular, the F1-measure of CHIFN outperformed those of COPREM in the two tasks. Therefore, CHIFN demonstrated the best performance in these two tasks.
In addition, we found the following phenomena. In the text modal, the ATE F1-measure and ASC F1-measure of BERT
The contrast results of the baseline models and the CHIFN
The contrast results of the baseline models and the CHIFN
The results of ablation experiments
The comparison of experimental performance.
For the text unimodal model, the performance of the text
To understand the performance of each module in the proposed network, we conducted ablation experiments, and the experimental results were shown in Table 6. The results showed that BAM, GCNs and FFA all improved the performance of the model. There are three reasons for these results. (1) The BAM module can effectively transfer boundary information to help the system to classify more accurately; (2) The GCNs module can enhance and modify the extracted semantic features; (3) Feature fusion can effectively reduce the introduction of noise.
Conclusion and future work
To realize End-to-End multimodal aspect-Based sentiment analysis, this paper proposed CHIFN that can simultaneously solve ATE and ASC tasks, which contains five main modules, namely, feature extraction module, feature attention module, feature fusion module, sentiment prediction module and boundary auxiliary module. A task is accomplished through these five modules working in conjunction with each other. Specifically, the feature extraction module was used to extract the high-level semantic features within the text and images, and the feature attention module with the interaction attention mechanism was utilized to obtain interaction features for both modalities. The feature fusion module and boundary auxiliary module played important roles in acquiring image features that are highly correlated with the text and providing boundary information, respectively. The sentiment prediction module with the classification layer was used to obtain aspect terms as well as sentiment categories in the text.
Although the proposed method has good performance in the social media corpus, it still has some shortcomings.
Since a large number of words in the text are non-aspect terms, the model may tend to label all words as a non-term category. In addition, since the emotional categories of the entities in the dataset are unbalanced, most of the emotional polarity of the evaluation object is neutral. This will result in the model not being able to learn the emotional information in the dataset well with a small number of samples. Although we would like to validate the performance of our model on other datasets, there are only two publicly available multimodal datasets (TWITTER-2015 and TWITTER-2017) that meet the requirements of the fine-grained analysis task. Therefore, we will work towards more accurate and larger corpus annotations in our next work to further validate the performance of our model. We used two pre-trained models with a large number of parameters to extract text and image features, which makes our model less practical. At the same time, the two pre-trained models are located in different feature representation spaces, which might lead to cross-modal semantic gaps. In our subsequent work, we will consider using the knowledge distillation method to compress the number of model parameters and an auxiliary model to map text and image features to the same spatial dimension to reduce the semantic gap between the different modalities. While we hope for a metric to assess the correlation between the meaning, aspects, and polarity of the text with the features extracted from the associated image, such a metric does not currently exist. In the future, we will undertake research in this area to further enhance and refine our understanding.
Footnotes
Funding and/or conflicts of interests/competing interests
We declare that we have no financial and personal relationships with other people and organizations, which may improperly affect our work, and that we have no professional or other personal interests of any nature or kind in any products, services and companies.
