Abstract
The difficulty of cross-domain text sentiment classification is that the data distributions in the source domain and the target domain are inconsistent. This paper proposes an attention network based on feature sequences (ANFS) for cross-domain sentiment classification, which focuses on important semantic features by using the attention mechanism. Particularly, ANFS uses a three-layer convolutional neural network (CNN) to perform deep feature extraction on the text, and then uses a bidirectional long short-term memory (BiLSTM) to capture the long-term dependency relationship among the text feature sequences. We first transfer the ANFS model trained on the source domain to the target domain and share the parameters of the convolutional layer; then we use a small amount of labeled target domain data to fine-tune the model of the BiLSTM layer and the attention layer. The experimental results on cross-domain sentiment analysis tasks demonstrate that ANFS can significantly outperform the state-of-the-art methods for cross-domain sentiment classification problems.
Introduction
In the context of the Internet era, artificial intelligence and cloud computing have been integrated into people’s lives. A massive variety of data are generated at all times. The application of big data will become the key to upgrading in various industries. E-commerce platforms provide convenience, which can help people get more product information from product reviews, and can also mine customer preferences from user reviews. As is well known, machine learning algorithms can predict user preferences, but model training relies on labeled data, and data annotations are time-consuming and labor-intensive processes, which bring great difficulties to the training of machine learning models.
As one of the machine learning algorithms, transfer learning [1] has emerged as an effective approach to addressing the problem of insufficient high-quality training data. It assumes the same classification problem in different domains shares some common properties and thus source domain’s training data can help target domain’s classifier training. Typically, the source domain has a large labeled training dataset, while the target domain has very few labeled training instances. In sentiment classification, it has been challenging to apply transfer learning method as there is no high-quality labeled source-domain training data.
Two classic methods for applying transfer learning to sentiment classification are the Structural Correspondence Learning (SCL) algorithm proposed by Blitzer et al. [2] and the Spectral Feature Alignment (SFA) proposed by Pan et al. [3]. SCL is used primarily to find common features between the target and source domains; SFA aligns domain-specific words from different domains into a unified cluster and builds a bridge between the source domain and the target domain. The limitations of using these methods are manually extract features and expert design rules or modeling in specific areas. Although better classification results can be achieved, with the rapid increase of data volume, artificial feature extraction cannot meet the actual development requirements.
With the rapid development of deep learning, the application to transfer learning of research methods related to deep learning has become a hot topic of recent researches, and many results have been obtained. Yosinski et al. [4] studied the transferability of deep neural networks. They found that using the transferred features to initialize the model can improve the generalization performance, and a lot of fine-tuning for new tasks can effectively improve the model performance of deep neural networks. In a recent review, Liu et al. [5] summarized the relevant research results of sentiment analysis in recent years and focused on the algorithms and applications of transfer learning in sentiment analysis. They also introduced in detail the cross-domain sentiment analysis used corpus and research methods in recent years. In this review, many recent research methods are listed, including CNN, LSTM, etc. Compared with the same data set, the parameter transfer method is mainly used for cross-domain sentiment classification research. However, these models are not comprehensive enough for feature extraction.
Ganin et al. [6] proposed a DANN algorithm in 2016, which embeds domain adaptive learning into the feature representation process. The obtained feedforward neural network can be directly applied to the target domain. Qu et al. [7] proposed an adversarial category alignment network (ACAN), which takes decision boundaries into consideration and attempts to enhance the consistency of categories between the source and target domains. Wei et al. [8] proposed a two-layer convolutional neural network to conduct cross-domain sentiment classification research on the cross-domain product review transfer learning benchmark. This method works well by introducing large-scale auxiliary cross-domain dataset. Compared with a recent work, Ji et al. [9] designed a bifurcated LSTM, which utilizes the attention-based LSTM classifiers and orthogonal constraints to complete cross-domain sentiment classification tasks. However, these methods do not consider the dependencies between contexts.
In recent years, natural language processing tasks have made great progress in the development of deeper representation language models. These models are pre-trained on large text corpora and fine-tuned for specific tasks, including cross-domain sentiment tasks. Howard and Ruder [10] proposed an effective transfer learning method, named the Universal Language Model Fine-tuning (ULMFiT), which can be applied to any task in NLP. This method is better than the latest technology on six text classification tasks. When the pre-trained models are applied to different datasets, one just needs to fine-tune the parameters of the language model to resolve these differences. In general, using the pre-trained models to perform specific tasks requires high-performance equipment. For most researchers, this is difficult to achieve.
CNN is a feedforward neural network, first proposed by LeCun [11]. It has the characteristics of parameter sharing and can capture local features of text, but lacks the ability to learn sequence correlation. In 1997, Hochreiter et al. [12] proposed LSTM which can not only process input of unlimited length, but also utilize the information of the entire input sequence. BiLSTM was a combination of forward LSTM and backward LSTM, which can effectively obtain the front-to-back dependencies between features, but it cannot consider the importance of features. The attention mechanism was first proposed by Bahdanau et al. [13] and used as a translation model. It was concerned with the distribution of input weights, so the attention mechanism in the model can pay attention to important semantic features. In summary, combining the three models mentioned above can fully consider local information, context information and important semantic information.
In order to solve the problem of negative transfer and incomplete feature extraction in the existing methods, this paper proposes an attention network based on feature sequences (ANFS). First, we use word2vec to map the text into low-dimensional dense word vectors. Secondly, we use the ANFS model to extract local semantic features, contextual semantic relationships and important text features. Thirdly, we use the softmax classifier to classify the sentiment of the text. Through the above processing to train and save the optimal model, we transfer the model to the target domain and use a small amount of target domain data to fine-tune the model to solve the cross-domain sentiment classification problem. The main contribution and work of this paper are as follows:
Extract the text features more comprehensively by using CNN to extract local semantic features of sentences and BiLSTM to capture the contextual relationship between semantic features. Add an attention mechanism to focus on important semantic features. Share the parameters of the convolutional layer, use a few target domain data to fine-tune the BiLSTM layer and the attention layer to solve the negative transfer problem effectively.
Deep transfer learning
With the rapid development of deep learning, related research methods have been applied to transfer learning research ideas, and many research results have been achieved. A review [14] classified deep transfer learning into four categories: instances-based deep transfer learning, mapping-based deep transfer learning, networks-based deep transfer learning, and adversarial-based deep transfer learning.
In order to supplement the training set in the target domain, the instance-based deep transfer learning method selects some instances from the source domain and assigns appropriate weight values to the selected instances. Dai et al. [15] proposed a TrAdaBoost method, which uses boosting-based algorithms to filter out samples with large differences in the distribution between the source and target domains, and uses the source domain samples and a few target domain samples to learn an accurate model. This allows knowledge to be efficiently transferred from the source domain data to the target domain data. Wang et al. [16] introduced an adaptive source-domain training instance selection method to address the problem of noisy source-domain training data. This method can effectively identify the most informative training examples. The iterative method selected informative samples from the source domain with the informativeness measures, merged with the target-domain training data, evaluated the performance of learned classifier for the target domain, and updated the informativeness measures for the next iteration.
The mapping-based deep transfer learning method maps the data of the source domain and the target domain to the new data space. This new data space is similar to both domains and is suitable for joint deep neural networks. Pan et al. [17] proposed a transfer component analysis (TCA) method for domain adaptation, which attempts to use the cross-domain transfer component in a Reproducing Kernel Hilbert Space to preserve data attributes. The data distribution in different domains is also closer. The new representation in the subspace is used to train the model in the source domain for the target domain.
The network-based deep transfer learning method is to use the source domain data to train the network structure and transfer it to the target domain. Oquab et al. [18] repeatedly used the first few layers of CNN training on the ImageNet dataset to extract intermediate image representations of other dataset images. CNN was trained to learn image representations, which can be effectively transferred to other training data where the amount of data is limited, as a visual recognition task. Long et al. [19] proposed a joint learning method to combine adaptive classifiers and transferable features of labeled data in the source domain and unlabeled data in the target domain, which guides the target classifier by inserting multiple layers into the deep network, to learn the residual function.
The adversarial-based deep transfer learning methods introduce adversarial techniques, which are inspired by generative adversarial networks (GANs) [20] to find transferable representations that are suitable for the source and target domains. Ganin et al. [21] proposed an adversarial training method that makes it suitable for most feedforward neural models by adding several standard layers and a simple new gradient flipping layer. Tzeng et al. [22] proposed a sparsely labeled target domain data with a cross-domain and cross-task transfer method. A special joint loss function is used in this work to force CNN to optimize the distance between domains. Hoffman et al. [23] proposed a new GAN loss and combined the discriminant model with a new domain adaptive method. Long et al. [24] proposed a stochastic multilinear adversarial network to achieve a deep and discriminative adversarial adaptive network. The principle is to use multiple feature layers and a classifier layer based on stochastic multilinear adversarial.
Cross-domain sentiment classification
The difficulty of cross-domain text sentiment classification is that the data distribution in the source domain is inconsistent with that in the target domain, and text data sentiment has a strong domain. The cross-domain text sentiment classification problem is mainly studied through knowledge transfer in different domains, machine learning, fine-tuning of pre-trained models, and deep learning. From the perspective of knowledge transferring, Jia et al. proposed two methods to reduce the difference between the source and target domain, namely words alignment based on association rules [25] and domain alignment based on multi-viewpoint domain-shared feature [26]. The former established indirect mapping relationship between specific domain words in different domains by learning strong association rules between domains, and shared specific domain words in the same domain. The latter used the sentiment dictionary and improved MI [3] technology to establish an unambiguous shared feature set between domains, and extracted unique feature word pairs between domains through syntactic analysis and association rules to achieve the expansion of the domain dictionary and the alignment of the information distribution space between domains. Then two methods used machine learning classifiers to solve cross-domain sentiment classification problems. Alternatively, a method where the alignment of strong association rules words and words in a specific domain, to a certain extent, reduces the difference between the source domain and the target domain, achieve the alignment of different domain spaces or words, and use machine learning classifiers to solve cross-domain sentiment classification problems.
From the perspective of fine-tuning the pre-trained model, Myagmar et al. [27] fine-tuned the cross-domain sentiment classification models of BERT and XLNet to efficiently solve the cross-domain sentiment classification tasks. From the machine learning perspective, Lin et al. [28] proposed a classification method based on the regression model, which uses a tree-structured domain representation and domain similarity to predict the number of nodes from multiple source nodes to the target node, with a loss of complexity in accuracy.
Tang et al. [29] analyzed the application and challenge of deep learning methods in sentiment analysis. The authors pointed out that deep learning models can automatically extract high-level features of data, and that sentiment classification and sentiment dictionary construction are superior to traditional machine learning methods. Yu et al. [30] used deep learning to model sentences, and the experimental results show that deep learning models are superior to traditional machine learning models. Li et al. [31] proposed an end-to-end Adversarial Memory Network (AMN). Gradient inversion was added to make the parameters of the shared layer participate in gradient updates in both classifiers and in text classification. The method of maximizing the classification error on the classifier and minimizing the classification error on the sentiment classifier performs cross-domain sentiment analysis. Deep learning methods can learn higher-level features from the data and can automatically extract the features of the data. Compared with manual design features of traditional machine learning methods, deep learning methods are generally superior to traditional machine learning methods.
Deep transfer learning algorithms
Basic definitions
Domain: a domain Source domain: Given the source domain data, Target domain: Given the target domain data,
The overall flowchart of the model proposed in this paper to solve cross-domain sentiment classification is shown in Fig. 1. We use the source domain data to train the model, then migrate it to the target domain, and use a small amount of labeled data in the target domain to fine-tune the model to achieve the goal of predicting the sentiment in the target domain.
In Fig. 1, we can see that the network structures of the source domain and the target domain are exactly the same. First, we use the source domain data to learn the weights of the model. Then, we transfer it to the target domain and share the parameters of the convolutional layer which can keep the common weight information of the two domains and avoid learning this knowledge again. Finally, we use a small amount of data in the target domain to fine-tune the model. The information in the target domain will update the parameters of the BiLSTM and attention layers in the network. This operation can make the network more suitable for the target domain, thereby effectively improves the accuracy of network classification.
Model transfer diagram.
First, we train the neural network by using the labeled data from the source domain. The input layer is a batch of samples, where the sample is expressed as
In Eq. (1),
The first layer of the network consists of a convolutional layer and a maximum pooling layer. The role of this layer is to extract the features of the sentence. Mainly through the convolutional kernel
where
In Eq. (4),
The second and third layers of the model have the same structure as the first layer. The output of the previous layer is the input of the next layer. The purpose is to extract more abstract features of the sentence.
The fourth layer of the network is the BiLSTM layer, whose role is to capture long-term dependencies between features. We input the depth features extracted by the largest pooling in the previous layer to the BiLSTM layer to obtain two hidden layer states with opposite timing, and joint them as output. BiLSTM obtains the context information of the input sequence. At time
where
The fifth layer of the network is the attention layer, which captures the importance of different features and gives a higher weight to important features. We will get the attention weight
where
Finally, the output
Let
where Lable is the label of the sample.
In order to solve the problem of overfitting and time consuming during the course of model training, we use the dropout method proposed by Hinton et al. [32] in the model. The main function of dropout is to temporarily discard some neural network units from the network according to a certain probability, and to reduce the interaction between neurons in the hidden layer. Figure 2 shows the model diagram used in this paper.
Network structure diagram of the source and target domains.
We transfer the trained network to the target domain and fine-tune it by using a small number of labeled target domain data. The first layer uses word2vec to represent the data of the input layer in the format of
The weights of the three-layer convolution in the network are the weights, given by
where
Then, we use a small amount of labeled target domain data to fine-tune the weights of the BiLSTM layer and the attention layer. The initial value of the state of the BiLSTM layer uses the weights trained in the source domain. At time
where
We use a small amount of data in the target domain to fine-tune the attention weight
Finally, the output
Let
We use the labeled data of the source domain for training, and then save the trained network structure and the weights of each layer. We transfer the network to the target domain and share the convolution layer weights. We use a small amount of labeled data in the target domain to fine-tune the weights of the BiLSTM layer and the attention layer, finally perform the sentiment classification for the target domain. The specific algorithm flow of the network is shown as Algorithm 1.
Dataset
This paper evaluates our algorithm by one corpus, namely, Amazon reviews, a multi-domain sentiment data set collected by Blizer et al. [2]. We use the Amazon reviews data set to select Amazon product reviews for four different domains: Book, DVD, Electronic and Kitchen. The data in these four different domains all contain 1,000 positive comments and negative comments, for a total of 8,000 data. Cross-domain sentiment classification is performed in these four different domains, detailed statistics are shown in Table 1.
The information of datasets
The information of datasets
In the experiment, we use word2vec model to convert words into word vectors. The dimension of word vector is 300, including three million vocabulary foundations. The network consists of three convolutional layers, a pooling layer, a BiLSTM layer, and an attention layer. The specific parameters are given as follows: the convolution kernel size is 3, the numbers of convolution kernel are 32, 64, 128, the dropout parameter is 0.2, the batch size is 64, the given parameter
Network parameters
Network parameters
We choose accuracy to facilitate comparison with other methods. The formula is as follows:
Among them, TP indicates that the sample is positive and is predicted to be positive; TN indicates that the sample is negative and is predicted to be negative; FP indicates that the sample is negative and is predicted to be positive; and FN indicates that the sample is positive and is predicted to be negative. TP
In order to verify the effectiveness of the proposed method, we compared the proposed cross-domain sentiment classification method with the following seven methods, namely, SCL [2], SCL-ML [2], SFA [3], ITIAD [33], AE-SCL-LR [34], DANN [6], CNN_FT [35].
Implementation detail
Table 3 shows the accuracy comparison results of the proposed method and the above seven methods. Among them, B represents the data set in the Book domain; E represents the data set in the Electronic domain; D represents the data set in the DVD domain; K represents the data set in the Kitchen domain. B-D indicates that the source domain is Book; and the target domain is DVD, and so on.
The experimental results in Table 3 are obtained by using 50 target domain data to fine-tune the network. These 50 pieces of data were extracted from the target domain, including 25 positive examples and 25 negative examples. As shown in the table, the attention network based on feature sequence method is effective. Compared with the above seven methods, our method is effective. The accuracy of our method is the highest, with an average accuracy of 81.52%. We only use 1/40 of the labeled data in the target domain to fine-tune the network to achieve higher performance.
Comparison of different methods of k
50 cross-domain sentiment classifications
Comparison of different methods of k
Compared with the classic method, SCL, although the accuracy is only improved by 4%, our network does not require any manual operation, which effectively saves time. Compared with the ITIAD method, the average accuracy of our model is improved by 7.3%, which means that deep learning methods are better than machine learning methods. From the table, we can see that when the source domain is the Electronic domain and the target domain is the Kitchen domain, the experimental results are generally high, which indicates that the difference between the Electronic and Kitchen domains is small. Compared with the SCL-ML, SFA, DANN, AE-SCL-SR, and CNN_FT methods, the average accuracy of our model are improved by 7.1%, 2.4%, 6.7%, 3.4%, and 1%, respectively.
In our previous work [35], CNN was used to conduct cross-domain sentiment classification research. After training the model in the source domain, it was transferred to the target domain and the weights of the convolutional layer were shared. Via using a small amount of data in the target domain to fine-tune the fully connected layer, the method achieves a relatively good performance. In the follow-up research, we found that for text data, CNN is a great feature extractor, but it can only extract local features of text data, and does not take into account the context and important semantic features of text. Thereafter, we proposed a new method named ANFS, it can not only extract the local features of the data, but also fully consider the dependencies between features. Meanwhile, we also use the attention mechanism to the important semantic features.
Compared with other methods, the advantage of the ANFS method is that the network can automatically extract more comprehensive features of the text. CNN extracts the local features of the data, BiLSTM captures the context dependence between the features, and finally we use the attention mechanism to focus on the important semantic features, which effectively improves the accuracy of the model classification. In the process of model transferring, the parameters of the convolutional layer are shared, and the BiLSTM layer and the attention layer of the model are fine-tuned with a very small amount of data in the target domain, which is used to further improving the classification accuracy of the model.
The impact of target domain labeling data on the model.
The impact of the number of iterations on model performance.
We also test the effect of different amounts of labeled target domain data. The values of the data amount
The number of iterations is also one of the important parameters of the network. This also has a great impact on the performance and running time of our model. We choose three sub-experiments to verify the impact of the number of iterations on model performance. These three sub-experiments are B-D, E-B, and K-B, respectively. In Fig. 4, it can be seen that when the number of iterations is taken as 6–13, the network performance is the best. After more than 13 iterations, the network will overfit and the model performance will also decrease. Therefore, we choose the number of iterations of the model 6–13.
The attention network based on feature sequences proposed in this paper is used for cross-domain sentiment classification. First, we extract the local features of the data through a convolutional neural network, and use BiLSTM to capture the long-term dependence between features. Then, we assign the weight for the important sentiment features by the attention mechanism, which makes the features more comprehensive. Finally, we transfer the network trained by the source domain data to the target domain and fine-tune the network by a small amount of target domain data. Our experimental results show that the method can effectively complete cross-domain sentiment classification tasks.
In the future, we will use more data to train the network, which can improve the generalization ability of the network, and combine the instance-based transfer learning method to solve the cross-domain sentiment classification problems.
Footnotes
Acknowledgments
We are also appreciating the reviewers and editors for their valuable suggestions and comments to improve this work. This research was funded by the National Natural Science Foundation of China (No. 61876031), Natural Science Foundation of Liaoning Province, China (20180550921, 2019-ZD-0175) and Scientific Research Fund Project of the Education Department of Liaoning Province (LJYT201906).
