An attention network based on feature sequences for cross-domain sentiment classification

Abstract

The difficulty of cross-domain text sentiment classification is that the data distributions in the source domain and the target domain are inconsistent. This paper proposes an attention network based on feature sequences (ANFS) for cross-domain sentiment classification, which focuses on important semantic features by using the attention mechanism. Particularly, ANFS uses a three-layer convolutional neural network (CNN) to perform deep feature extraction on the text, and then uses a bidirectional long short-term memory (BiLSTM) to capture the long-term dependency relationship among the text feature sequences. We first transfer the ANFS model trained on the source domain to the target domain and share the parameters of the convolutional layer; then we use a small amount of labeled target domain data to fine-tune the model of the BiLSTM layer and the attention layer. The experimental results on cross-domain sentiment analysis tasks demonstrate that ANFS can significantly outperform the state-of-the-art methods for cross-domain sentiment classification problems.

Keywords

Deep transfer learning CNN BiLSTM attention mechanism sentiment classification

1. Introduction

In the context of the Internet era, artificial intelligence and cloud computing have been integrated into people’s lives. A massive variety of data are generated at all times. The application of big data will become the key to upgrading in various industries. E-commerce platforms provide convenience, which can help people get more product information from product reviews, and can also mine customer preferences from user reviews. As is well known, machine learning algorithms can predict user preferences, but model training relies on labeled data, and data annotations are time-consuming and labor-intensive processes, which bring great difficulties to the training of machine learning models.

As one of the machine learning algorithms, transfer learning [1] has emerged as an effective approach to addressing the problem of insufficient high-quality training data. It assumes the same classification problem in different domains shares some common properties and thus source domain’s training data can help target domain’s classifier training. Typically, the source domain has a large labeled training dataset, while the target domain has very few labeled training instances. In sentiment classification, it has been challenging to apply transfer learning method as there is no high-quality labeled source-domain training data.

Two classic methods for applying transfer learning to sentiment classification are the Structural Correspondence Learning (SCL) algorithm proposed by Blitzer et al. [2] and the Spectral Feature Alignment (SFA) proposed by Pan et al. [3]. SCL is used primarily to find common features between the target and source domains; SFA aligns domain-specific words from different domains into a unified cluster and builds a bridge between the source domain and the target domain. The limitations of using these methods are manually extract features and expert design rules or modeling in specific areas. Although better classification results can be achieved, with the rapid increase of data volume, artificial feature extraction cannot meet the actual development requirements.

With the rapid development of deep learning, the application to transfer learning of research methods related to deep learning has become a hot topic of recent researches, and many results have been obtained. Yosinski et al. [4] studied the transferability of deep neural networks. They found that using the transferred features to initialize the model can improve the generalization performance, and a lot of fine-tuning for new tasks can effectively improve the model performance of deep neural networks. In a recent review, Liu et al. [5] summarized the relevant research results of sentiment analysis in recent years and focused on the algorithms and applications of transfer learning in sentiment analysis. They also introduced in detail the cross-domain sentiment analysis used corpus and research methods in recent years. In this review, many recent research methods are listed, including CNN, LSTM, etc. Compared with the same data set, the parameter transfer method is mainly used for cross-domain sentiment classification research. However, these models are not comprehensive enough for feature extraction.

Ganin et al. [6] proposed a DANN algorithm in 2016, which embeds domain adaptive learning into the feature representation process. The obtained feedforward neural network can be directly applied to the target domain. Qu et al. [7] proposed an adversarial category alignment network (ACAN), which takes decision boundaries into consideration and attempts to enhance the consistency of categories between the source and target domains. Wei et al. [8] proposed a two-layer convolutional neural network to conduct cross-domain sentiment classification research on the cross-domain product review transfer learning benchmark. This method works well by introducing large-scale auxiliary cross-domain dataset. Compared with a recent work, Ji et al. [9] designed a bifurcated LSTM, which utilizes the attention-based LSTM classifiers and orthogonal constraints to complete cross-domain sentiment classification tasks. However, these methods do not consider the dependencies between contexts.

In recent years, natural language processing tasks have made great progress in the development of deeper representation language models. These models are pre-trained on large text corpora and fine-tuned for specific tasks, including cross-domain sentiment tasks. Howard and Ruder [10] proposed an effective transfer learning method, named the Universal Language Model Fine-tuning (ULMFiT), which can be applied to any task in NLP. This method is better than the latest technology on six text classification tasks. When the pre-trained models are applied to different datasets, one just needs to fine-tune the parameters of the language model to resolve these differences. In general, using the pre-trained models to perform specific tasks requires high-performance equipment. For most researchers, this is difficult to achieve.

CNN is a feedforward neural network, first proposed by LeCun [11]. It has the characteristics of parameter sharing and can capture local features of text, but lacks the ability to learn sequence correlation. In 1997, Hochreiter et al. [12] proposed LSTM which can not only process input of unlimited length, but also utilize the information of the entire input sequence. BiLSTM was a combination of forward LSTM and backward LSTM, which can effectively obtain the front-to-back dependencies between features, but it cannot consider the importance of features. The attention mechanism was first proposed by Bahdanau et al. [13] and used as a translation model. It was concerned with the distribution of input weights, so the attention mechanism in the model can pay attention to important semantic features. In summary, combining the three models mentioned above can fully consider local information, context information and important semantic information.

In order to solve the problem of negative transfer and incomplete feature extraction in the existing methods, this paper proposes an attention network based on feature sequences (ANFS). First, we use word2vec to map the text into low-dimensional dense word vectors. Secondly, we use the ANFS model to extract local semantic features, contextual semantic relationships and important text features. Thirdly, we use the softmax classifier to classify the sentiment of the text. Through the above processing to train and save the optimal model, we transfer the model to the target domain and use a small amount of target domain data to fine-tune the model to solve the cross-domain sentiment classification problem. The main contribution and work of this paper are as follows:

1.
Extract the text features more comprehensively by using CNN to extract local semantic features of sentences and BiLSTM to capture the contextual relationship between semantic features.
2.
Add an attention mechanism to focus on important semantic features.
3.
Share the parameters of the convolutional layer, use a few target domain data to fine-tune the BiLSTM layer and the attention layer to solve the negative transfer problem effectively.

2. Related works

2.1 Deep transfer learning

With the rapid development of deep learning, related research methods have been applied to transfer learning research ideas, and many research results have been achieved. A review [14] classified deep transfer learning into four categories: instances-based deep transfer learning, mapping-based deep transfer learning, networks-based deep transfer learning, and adversarial-based deep transfer learning.

In order to supplement the training set in the target domain, the instance-based deep transfer learning method selects some instances from the source domain and assigns appropriate weight values to the selected instances. Dai et al. [15] proposed a TrAdaBoost method, which uses boosting-based algorithms to filter out samples with large differences in the distribution between the source and target domains, and uses the source domain samples and a few target domain samples to learn an accurate model. This allows knowledge to be efficiently transferred from the source domain data to the target domain data. Wang et al. [16] introduced an adaptive source-domain training instance selection method to address the problem of noisy source-domain training data. This method can effectively identify the most informative training examples. The iterative method selected informative samples from the source domain with the informativeness measures, merged with the target-domain training data, evaluated the performance of learned classifier for the target domain, and updated the informativeness measures for the next iteration.

The mapping-based deep transfer learning method maps the data of the source domain and the target domain to the new data space. This new data space is similar to both domains and is suitable for joint deep neural networks. Pan et al. [17] proposed a transfer component analysis (TCA) method for domain adaptation, which attempts to use the cross-domain transfer component in a Reproducing Kernel Hilbert Space to preserve data attributes. The data distribution in different domains is also closer. The new representation in the subspace is used to train the model in the source domain for the target domain.

The network-based deep transfer learning method is to use the source domain data to train the network structure and transfer it to the target domain. Oquab et al. [18] repeatedly used the first few layers of CNN training on the ImageNet dataset to extract intermediate image representations of other dataset images. CNN was trained to learn image representations, which can be effectively transferred to other training data where the amount of data is limited, as a visual recognition task. Long et al. [19] proposed a joint learning method to combine adaptive classifiers and transferable features of labeled data in the source domain and unlabeled data in the target domain, which guides the target classifier by inserting multiple layers into the deep network, to learn the residual function.

The adversarial-based deep transfer learning methods introduce adversarial techniques, which are inspired by generative adversarial networks (GANs) [20] to find transferable representations that are suitable for the source and target domains. Ganin et al. [21] proposed an adversarial training method that makes it suitable for most feedforward neural models by adding several standard layers and a simple new gradient flipping layer. Tzeng et al. [22] proposed a sparsely labeled target domain data with a cross-domain and cross-task transfer method. A special joint loss function is used in this work to force CNN to optimize the distance between domains. Hoffman et al. [23] proposed a new GAN loss and combined the discriminant model with a new domain adaptive method. Long et al. [24] proposed a stochastic multilinear adversarial network to achieve a deep and discriminative adversarial adaptive network. The principle is to use multiple feature layers and a classifier layer based on stochastic multilinear adversarial.

2.2 Cross-domain sentiment classification

The difficulty of cross-domain text sentiment classification is that the data distribution in the source domain is inconsistent with that in the target domain, and text data sentiment has a strong domain. The cross-domain text sentiment classification problem is mainly studied through knowledge transfer in different domains, machine learning, fine-tuning of pre-trained models, and deep learning. From the perspective of knowledge transferring, Jia et al. proposed two methods to reduce the difference between the source and target domain, namely words alignment based on association rules [25] and domain alignment based on multi-viewpoint domain-shared feature [26]. The former established indirect mapping relationship between specific domain words in different domains by learning strong association rules between domains, and shared specific domain words in the same domain. The latter used the sentiment dictionary and improved MI [3] technology to establish an unambiguous shared feature set between domains, and extracted unique feature word pairs between domains through syntactic analysis and association rules to achieve the expansion of the domain dictionary and the alignment of the information distribution space between domains. Then two methods used machine learning classifiers to solve cross-domain sentiment classification problems. Alternatively, a method where the alignment of strong association rules words and words in a specific domain, to a certain extent, reduces the difference between the source domain and the target domain, achieve the alignment of different domain spaces or words, and use machine learning classifiers to solve cross-domain sentiment classification problems.

From the perspective of fine-tuning the pre-trained model, Myagmar et al. [27] fine-tuned the cross-domain sentiment classification models of BERT and XLNet to efficiently solve the cross-domain sentiment classification tasks. From the machine learning perspective, Lin et al. [28] proposed a classification method based on the regression model, which uses a tree-structured domain representation and domain similarity to predict the number of nodes from multiple source nodes to the target node, with a loss of complexity in accuracy.

Tang et al. [29] analyzed the application and challenge of deep learning methods in sentiment analysis. The authors pointed out that deep learning models can automatically extract high-level features of data, and that sentiment classification and sentiment dictionary construction are superior to traditional machine learning methods. Yu et al. [30] used deep learning to model sentences, and the experimental results show that deep learning models are superior to traditional machine learning models. Li et al. [31] proposed an end-to-end Adversarial Memory Network (AMN). Gradient inversion was added to make the parameters of the shared layer participate in gradient updates in both classifiers and in text classification. The method of maximizing the classification error on the classifier and minimizing the classification error on the sentiment classifier performs cross-domain sentiment analysis. Deep learning methods can learn higher-level features from the data and can automatically extract the features of the data. Compared with manual design features of traditional machine learning methods, deep learning methods are generally superior to traditional machine learning methods.

3. Deep transfer learning algorithms

3.1 Basic definitions

•
Domain: a domain $D$ consists of a feature space $\chi$ and a marginal probability distribution $P(X)$ , where $\chi$ represents the space of all item vectors, $X$ represents a specific learning sample. In general, if two domains are different, they may have different feature spaces or different marginal probability distributions.
•
Source domain: Given the source domain data, $D_{S}=\{(X_{s_{i}},Y_{s_{i}})\}_{i=1}^{n_{s}}$ , where $X_{s_{i}}$ represents the i-th sample in the source domain, $Y_{s_{i}}$ represents the label corresponding to the i-th sample, and $n_{s}$ represents the number of the source domain data.
•
Target domain: Given the target domain data, $D_{T}=\{(X_{T_{i}})\}_{i=1}^{n_{T}}$ , where $X_{T_{i}}$ represents the i-th sample of the target domain, and $n_{T}$ represents the number of the target domain data.

3.2 Overall process

The overall flowchart of the model proposed in this paper to solve cross-domain sentiment classification is shown in Fig. 1. We use the source domain data to train the model, then migrate it to the target domain, and use a small amount of labeled data in the target domain to fine-tune the model to achieve the goal of predicting the sentiment in the target domain.

In Fig. 1, we can see that the network structures of the source domain and the target domain are exactly the same. First, we use the source domain data to learn the weights of the model. Then, we transfer it to the target domain and share the parameters of the convolutional layer which can keep the common weight information of the two domains and avoid learning this knowledge again. Finally, we use a small amount of data in the target domain to fine-tune the model. The information in the target domain will update the parameters of the BiLSTM and attention layers in the network. This operation can make the network more suitable for the target domain, thereby effectively improves the accuracy of network classification.

Figure 1.

Model transfer diagram.

3.3 Structure of attention network based on feature sequence

First, we train the neural network by using the labeled data from the source domain. The input layer is a batch of samples, where the sample is expressed as $xs\in R_{s}^{n\times 1}$ , and the fixed length of the sentence is $n$ (if the length is less than $n$ , then it is supplement by 0). The embedding layer uses word2vec to change the original input layer to $xs\in R_{s}^{n\times k}$ , and $k$ is a word vector dimension. $xs_{i}\in R_{s}^{k}$ represents the i-th word in the sentence, so the input sentence can be expressed as:

$\displaystyle xs_{1:n}=xs_{1}\oplus xs_{2}\oplus\ldots\oplus xs_{n}$ (1)

In Eq. (1), $\oplus$ is a connection operator.

The first layer of the network consists of a convolutional layer and a maximum pooling layer. The role of this layer is to extract the features of the sentence. Mainly through the convolutional kernel $w_{s}\in R_{s}^{h\times k}$ of $h\times k$ to slide on the input layer from the top to the bottom to complete the convolution operation. A feature map is obtained through the convolutional operation. The feature map has a column of 1, the line is $n-h+1$ . Take the maximum of the features as the main feature, that is,

$\displaystyle c_{s}=(c_{s}^{1},c_{s}^{2},\ldots,c_{s}^{n-h+1})$ (2) $\displaystyle\max(c_{s})=(\max(c_{s}^{1}),\max(c_{s}^{2}),\ldots,\max(c_{s}^{n% -h+1}))$ (3)

where

$\displaystyle c_{s}^{i}=f(w_{s}\cdot xs_{i:i+h-1}+b_{s})$ (4)

In Eq. (4), $f$ is a nonlinear activation function and $b_{s}$ is a bias term.

The second and third layers of the model have the same structure as the first layer. The output of the previous layer is the input of the next layer. The purpose is to extract more abstract features of the sentence.

The fourth layer of the network is the BiLSTM layer, whose role is to capture long-term dependencies between features. We input the depth features extracted by the largest pooling in the previous layer to the BiLSTM layer to obtain two hidden layer states with opposite timing, and joint them as output. BiLSTM obtains the context information of the input sequence. At time $p$ , $\vec{h}_{p}$ represents the forward output of the LSTM; $\mathord{\buildrel\lower 3.0pt\hbox{$\scriptscriptstyle\leftarrow$}\over{h}}_{p}$ represents the backward output of the LSTM; $x_{p}$ represents the input; and $h_{p}$ represents the output of the BiLSTM, as shown in the following formulas:

$\displaystyle\vec{h}_{p}=\textit{LSTM}(x_{p},\vec{h}_{p-1})$ (5) $\displaystyle\mathord{\buildrel\lower 3.0pt\hbox{$\scriptscriptstyle\leftarrow% $}\over{h}}_{p}=\textit{LSTM}(x_{p},\mathord{\buildrel\lower 3.0pt\hbox{$% \scriptscriptstyle\leftarrow$}\over{h}}_{p-1})$ (6) $\displaystyle h_{p}=w_{p}\vec{h}_{p}+v_{p}\mathord{\buildrel\lower 3.0pt\hbox{% $\scriptscriptstyle\leftarrow$}\over{h}}_{p}+b_{p}$ (7)

where $w_{p}$ represents the weight of forward output, $v_{p}$ represents the weight of backward output; and $b_{p}$ represents the offset.

The fifth layer of the network is the attention layer, which captures the importance of different features and gives a higher weight to important features. We will get the attention weight $s_{s}$ occupied by the hidden state $h_{P}$ at every moment, and obtain the attention value $\alpha_{s}$ of the hidden state by accumulation, let $A_{P}$ represent the hidden unit of $h_{p}$ , as shown in the following formulas:

$\displaystyle A_{p}=\tanh(w_{a}h_{p}+b_{a})$ (8) $\displaystyle s_{s}=\textit{soft}\max(A_{p}v_{a})$ (9) $\displaystyle\alpha_{s}=\sum{s_{s}h_{p}}$ (10)

where $v_{a}$ represents context vector, $w_{a}$ and $b_{a}$ are attention parameters.

Finally, the output $\alpha_{s}$ of the attention layer is used as the input of the softmax classifier to obtain the probability of each class, and the class is judged by the probability value. The formula is shown in Eq. (11):

$\displaystyle\hat{y}_{s}=w\cdot\alpha_{s}+b$ (11)

Let $P(\hat{y}_{s}^{i})$ denote the probability of the target domain sample in the i-th category. The formula is:

$\displaystyle P(\hat{y}_{s}^{i})=\frac{\exp(\hat{y}_{s}^{i})}{\sum\limits_{i=0% }^{\textit{Label}}{\exp(\hat{y}_{s}^{i})}}$ (12)

where Lable is the label of the sample.

In order to solve the problem of overfitting and time consuming during the course of model training, we use the dropout method proposed by Hinton et al. [32] in the model. The main function of dropout is to temporarily discard some neural network units from the network according to a certain probability, and to reduce the interaction between neurons in the hidden layer. Figure 2 shows the model diagram used in this paper.

Figure 2.

Network structure diagram of the source and target domains.

3.4 Cross-domain sentiment classification

We transfer the trained network to the target domain and fine-tune it by using a small number of labeled target domain data. The first layer uses word2vec to represent the data of the input layer in the format of $xt\in R_{t}^{n\times k}$ . The specific formula is as follows:

$\displaystyle xt_{1:n}=xt_{1}\oplus xt_{2}\oplus\ldots\oplus xt_{n}$ (13)

The weights of the three-layer convolution in the network are the weights, given by $w_{s}$ , trained in the source domain. The maximum feature value obtained by the maximum pooling operation is used as the main feature. The specific formula is:

$\displaystyle c_{t}=(c_{t}^{1},c_{t}^{2},\ldots,c_{t}^{n-h+1})$ (14) $\displaystyle\max(c_{t})=(\max(c_{t}^{1}),\max(c_{t}^{2}),\ldots,\max(c_{t}^{n% -h+1}))$ (15)

where

$\displaystyle c_{t}^{i}=f(w_{s}\cdot xt_{i:i+h-1}+b_{s})$ (16)

Then, we use a small amount of labeled target domain data to fine-tune the weights of the BiLSTM layer and the attention layer. The initial value of the state of the BiLSTM layer uses the weights trained in the source domain. At time $p$ , $\vec{h}^{\prime}_{p}$ represents the forward output of LSTM; $\mathord{\buildrel\lower 3.0pt\hbox{$\scriptscriptstyle\leftarrow$}\over{h}}^{% \prime}_{p}$ represents the backward output of LSTM; $x^{\prime}_{p}$ represents the input; and $h^{\prime}_{p}$ represents the output of BiLSTM, as shown in the following formulas:

$\displaystyle\vec{h}^{\prime}_{p}=\textit{LSTM}(x^{\prime}_{p},\vec{h}^{\prime% }_{p-1})$ (17) $\displaystyle\mathord{\buildrel\lower 3.0pt\hbox{$\scriptscriptstyle\leftarrow% $}\over{h}}^{\prime}_{p}=\textit{LSTM}(x^{\prime}_{p},\mathord{\buildrel\lower 3% .0pt\hbox{$\scriptscriptstyle\leftarrow$}\over{h}}^{\prime}_{p-1})$ (18) $\displaystyle h^{\prime}_{p}=w^{\prime}_{p}\vec{h}^{\prime}_{p}+v^{\prime}_{p}% \mathord{\buildrel\lower 3.0pt\hbox{$\scriptscriptstyle\leftarrow$}\over{h}}^{% \prime}_{p}+b^{\prime}_{p}$ (19)

where $w^{\prime}_{p}$ and $v^{\prime}_{p}$ represent the weight values obtained by fine-tuning $w_{p}$ and $v_{p}$ , $b^{\prime}_{p}$ is also the offset obtained by fine-tuning $b_{p}$ .

We use a small amount of data in the target domain to fine-tune the attention weight $s_{t}$ of the hidden state $h^{\prime}_{p}$ obtained at each moment, and finally obtain the attention value of the hidden state by accumulating $\alpha_{t}$ , $A^{\prime}_{p}$ represents the hidden unit of $h^{\prime}_{p}$ . The specific formulas are as follows:

$\displaystyle A^{\prime}_{p}=\tanh(w^{\prime}_{a}h^{\prime}_{p}+b^{\prime}_{a})$ (20) $\displaystyle s_{t}=\textit{soft}\max(A^{\prime}_{p}v_{a})$ (21) $\displaystyle\alpha_{t}=\sum{s_{t}h_{p}}$ (22)

Finally, the output $\alpha_{t}$ of the attention layer is used as the input of the softmax classifier to obtain the probability of each class, and the category is judged by the probability value.

$\displaystyle\hat{y}=w\cdot\alpha_{t}+b$ (23)

Let $P(\hat{y}_{t}^{i})$ denote the probability of the target domain sample in the i-th category. The formula is:

$\displaystyle P(\hat{y}_{t}^{i})=\frac{\exp(\hat{y}_{t}^{i})}{\sum\limits_{i=0% }^{\textit{Label}}{\exp(\hat{y}_{t}^{i})}}$ (24)

We use the labeled data of the source domain for training, and then save the trained network structure and the weights of each layer. We transfer the network to the target domain and share the convolution layer weights. We use a small amount of labeled data in the target domain to fine-tune the weights of the BiLSTM layer and the attention layer, finally perform the sentiment classification for the target domain. The specific algorithm flow of the network is shown as Algorithm 1.

Algorithm 1: FSAN training program
Input: $\mbox{D}_{S}$ : source domain dataset
$\mbox{D}_{T}$ : target domain dataset
$n$ : target domain data for fine-tuning the model
i/j/k: training/fine-tuning/prediction sample index
len( $\cdot$ ): total number of samples in a dataset “ $\cdot$ ”
Output: ${\hat{\text{y}}}_{T}$ : Target domain prediction label
1: begin
2: map all data into low-dimensional dense vectors using word2vec
3: for i $\leftarrow$ 1 in len ( $\mbox{D}_{S}$ ) do
4: computer $w_{s}$ , $b_{s}$ , $w_{p}$ , $v_{p}$ , $\alpha_{s}$ by Eqs (2)–(10)
5: end for
6: save $f_{\theta}$
7: for j $\leftarrow$ 1 in len( $n$ ) do
8: fixing $w_{s}$ , $b_{s}$ , update $w_{p}$ , $v_{p}$ , $\alpha_{s}$ in light of Eqs (17)–(22)
9: end for
10: until achieving interNum times
11: revise $w_{p}$ , $v_{p}$ , $\alpha_{s}$ to $w^{\prime}_{p}$ , $v^{\prime}_{p}$ , $\alpha^{\prime}_{s}$ with $f_{\theta}$ , $n$
12: get the fine-tuned model $f^{\prime}_{\theta}$
13: for k $\leftarrow$ 1 in len( $\mbox{D}_{T}-n)$ do
14: use model $f^{\prime}_{\theta}$ to predict the label of k
15: end for
16: output ${\hat{\text{y}}}_{T}$

4. Experimental results and analyses

4.1 Dataset

This paper evaluates our algorithm by one corpus, namely, Amazon reviews, a multi-domain sentiment data set collected by Blizer et al. [2]. We use the Amazon reviews data set to select Amazon product reviews for four different domains: Book, DVD, Electronic and Kitchen. The data in these four different domains all contain 1,000 positive comments and negative comments, for a total of 8,000 data. Cross-domain sentiment classification is performed in these four different domains, detailed statistics are shown in Table 1.

Table 1
The information of datasets

Dataset type	Positive example	Negative example
Book	1000	1000
Kitchen	1000	1000
DVD	1000	1000
Electronic	1000	1000

4.2 Experimental parameters

In the experiment, we use word2vec model to convert words into word vectors. The dimension of word vector is 300, including three million vocabulary foundations. The network consists of three convolutional layers, a pooling layer, a BiLSTM layer, and an attention layer. The specific parameters are given as follows: the convolution kernel size is 3, the numbers of convolution kernel are 32, 64, 128, the dropout parameter is 0.2, the batch size is 64, the given parameter $k$ is the number of labeled data in the target domain for fine-tuning. Table 2 lists the parameter settings of the model in the experiment.

Table 2
Network parameters

Parameter name	Parameter value
Numbers of convolution kernel	32, 64, 128
Convolution kernel size	3
LSTM cell status dimension	128
LSTM hidden state dimension	128
Sentence length	600
Word vector dimension	300
Batch size	32
Dropout	0.5
Target domain training data size (K)	50, 100, 200, 400

4.3 Experimental evaluation index

We choose accuracy to facilitate comparison with other methods. The formula is as follows:

$\displaystyle\text{Accuracy}=\frac{TP+TN}{TP+TN+FP+FN}$ (25)

Among them, TP indicates that the sample is positive and is predicted to be positive; TN indicates that the sample is negative and is predicted to be negative; FP indicates that the sample is negative and is predicted to be positive; and FN indicates that the sample is positive and is predicted to be negative. TP $+$ TN $+$ FP $+$ FN is the total sample size.

4.4 Compared methods

In order to verify the effectiveness of the proposed method, we compared the proposed cross-domain sentiment classification method with the following seven methods, namely, SCL [2], SCL-ML [2], SFA [3], ITIAD [33], AE-SCL-LR [34], DANN [6], CNN_FT [35].

4.5 Implementation detail

Table 3 shows the accuracy comparison results of the proposed method and the above seven methods. Among them, B represents the data set in the Book domain; E represents the data set in the Electronic domain; D represents the data set in the DVD domain; K represents the data set in the Kitchen domain. B-D indicates that the source domain is Book; and the target domain is DVD, and so on.

The experimental results in Table 3 are obtained by using 50 target domain data to fine-tune the network. These 50 pieces of data were extracted from the target domain, including 25 positive examples and 25 negative examples. As shown in the table, the attention network based on feature sequence method is effective. Compared with the above seven methods, our method is effective. The accuracy of our method is the highest, with an average accuracy of 81.52%. We only use 1/40 of the labeled data in the target domain to fine-tune the network to achieve higher performance.

Table 3
Comparison of different methods of k $=$ 50 cross-domain sentiment classifications

Source-target	SCL	SCL-ML	SFA	ITIAD	AE-SCL-LR	DANN	CNN_FT	ANFS
B-D	0.815	0.788	0.826	0.805	0.811	0.737	0.833	0.842
B-E	0.790	0.719	0.720	0.730	0.768	0.680	0.782	0.811
B-K	0.792	0.772	0.780	0.720	0.801	0.788	0.805	0.801
D-B	0.765	0.732	0.826	0.670	0.773	0.750	0.819	0.826
D-E	0.732	0.715	0.770	0.740	0.781	0.745	0.807	0.804
D-K	0.800	0.740	0.810	0.710	0.803	0.776	0.814	0.827
E-B	0.756	0.685	0.755	0.683	0.712	0.700	0.775	0.774
E-K	0.840	0.829	0.871	0.857	0.846	0.845	0.853	0.878
K-B	0.712	0.693	0.740	0.679	0.730	0.712	0.771	0.787
K-D	0.751	0.720	0.770	0.740	0.763	0.714	0.786	0.802
K-E	0.810	0.822	0.846	0.800	0.840	0.821	0.841	0.853
Average	0.775	0.743	0.790	0.742	0.781	0.748	0.806	0.815

Compared with the classic method, SCL, although the accuracy is only improved by 4%, our network does not require any manual operation, which effectively saves time. Compared with the ITIAD method, the average accuracy of our model is improved by 7.3%, which means that deep learning methods are better than machine learning methods. From the table, we can see that when the source domain is the Electronic domain and the target domain is the Kitchen domain, the experimental results are generally high, which indicates that the difference between the Electronic and Kitchen domains is small. Compared with the SCL-ML, SFA, DANN, AE-SCL-SR, and CNN_FT methods, the average accuracy of our model are improved by 7.1%, 2.4%, 6.7%, 3.4%, and 1%, respectively.

In our previous work [35], CNN was used to conduct cross-domain sentiment classification research. After training the model in the source domain, it was transferred to the target domain and the weights of the convolutional layer were shared. Via using a small amount of data in the target domain to fine-tune the fully connected layer, the method achieves a relatively good performance. In the follow-up research, we found that for text data, CNN is a great feature extractor, but it can only extract local features of text data, and does not take into account the context and important semantic features of text. Thereafter, we proposed a new method named ANFS, it can not only extract the local features of the data, but also fully consider the dependencies between features. Meanwhile, we also use the attention mechanism to the important semantic features.

Compared with other methods, the advantage of the ANFS method is that the network can automatically extract more comprehensive features of the text. CNN extracts the local features of the data, BiLSTM captures the context dependence between the features, and finally we use the attention mechanism to focus on the important semantic features, which effectively improves the accuracy of the model classification. In the process of model transferring, the parameters of the convolutional layer are shared, and the BiLSTM layer and the attention layer of the model are fine-tuned with a very small amount of data in the target domain, which is used to further improving the classification accuracy of the model.

Figure 3.

The impact of target domain labeling data on the model.

Figure 4.

The impact of the number of iterations on model performance.

We also test the effect of different amounts of labeled target domain data. The values of the data amount $k$ of the target domain are 50, 100, 200 and 400, respectively. In general, the amount of data used in the target domain does not exceed 20% of the total number of target domains. Figure 3 shows the experimental results for fine-tuning the network, which contains more data in the target domain.

The number of iterations is also one of the important parameters of the network. This also has a great impact on the performance and running time of our model. We choose three sub-experiments to verify the impact of the number of iterations on model performance. These three sub-experiments are B-D, E-B, and K-B, respectively. In Fig. 4, it can be seen that when the number of iterations is taken as 6–13, the network performance is the best. After more than 13 iterations, the network will overfit and the model performance will also decrease. Therefore, we choose the number of iterations of the model 6–13.

5. Conclusion

The attention network based on feature sequences proposed in this paper is used for cross-domain sentiment classification. First, we extract the local features of the data through a convolutional neural network, and use BiLSTM to capture the long-term dependence between features. Then, we assign the weight for the important sentiment features by the attention mechanism, which makes the features more comprehensive. Finally, we transfer the network trained by the source domain data to the target domain and fine-tune the network by a small amount of target domain data. Our experimental results show that the method can effectively complete cross-domain sentiment classification tasks.

In the future, we will use more data to train the network, which can improve the generalization ability of the network, and combine the instance-based transfer learning method to solve the cross-domain sentiment classification problems.

Footnotes

Acknowledgments

We are also appreciating the reviewers and editors for their valuable suggestions and comments to improve this work. This research was funded by the National Natural Science Foundation of China (No. 61876031), Natural Science Foundation of Liaoning Province, China (20180550921, 2019-ZD-0175) and Scientific Research Fund Project of the Education Department of Liaoning Province (LJYT201906).

References

Pan

S.J.

and Yang

, A survey on transfer learning, IEEE Transactions on Knowledge and Data Engineering 22(10) (2009), 1345–1359.

Blitzer

Dredze

and Pereira

, Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification, in: Proc. 45th Annual Meeting Assoc. Comput. Linguistics, ACL’07, Prague, Czech Republic, 2007, pp. 187–205.

Pan

S.J.

Sun

J.T.

Yang

and Chen

, Cross-domain sentiment classification via spectral feature alignment, in: Proceedings of the 19th International Conference on World Wide, Web, 2010, April, pp. 751–760.

Yosinski

Clune

Bengio

and Lipson

, How transferable are features in deep neural networks? in: Advances in Neural Information Processing Systems, 2014, pp. 3320–3328.

Liu

Shi

and Jia

, A survey of sentiment analysis based on transfer learning, IEEE Access 7 (2019), 85401–85412.

Ganin

Ustinova

Ajakan

Germain

Larochelle

Laviolette

and Lempitsky

, Domain-adversarial training of neural networks, The Journal of Machine Learning Research 17(1) (2016), 2096–2030.

Zou

Cheng

Yang

and Zhou

, Adversarial category alignment network for cross-domain sentiment classification, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, Long and Short Papers, 2019, June, pp. 2496–2508.

Wei

et al., Low-resource cross-domain product review sentiment classification based on a CNN with an auxiliary large-scale corpus, Algorithms 10(3) (2017), 81.

Luo

Chen

and Li

, Cross-domain sentiment classification via a bifurcated-LSTM, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Cham, 2018, June, pp. 681–693.

10.

Howard

and Ruder

, Universal language model fine-tuning for text classification, arXiv preprint arXiv:1801.06146, 2018.

11.

Lecun

and Bottou

, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86(11) (1998), 2278–2324.

12.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Computation 9(8) (1997), 1735–1780.

13.

Bahdanau

Cho

and Bengio

, Neural machine translation by jointly learning to Align and Translate, arXiv: Computation and Language, 2014.

14.

Tan

Sun

Kong

Zhang

Yang

and Liu

, A survey on deep transfer learning, in: International Conference on Artificial Neural Networks, 2018, October, pp. 270–279. Springer, Cham.

15.

Dai

Yang

and Xue

G.R.

, Boosting for transfer learning, in: Proceedings of the 24th International Conference on Machine Learning, ACM, 2007, pp. 193–200.

16.

Wang

Chen

Thirunarayan

and Sheth

A.P.

, Adaptive training instance selection for cross-domain emotion identification, in: Proceedings of the International Conference on Web Intelligence, 2017, August, pp. 525–532.

17.

Pan

S.J.

Tsang

I.W.

Kwok

J.T.

and Yang

, Domain adaptation via transfer component analysis, Neural Networks, IEEE Transactions on 22(2) (2011), 199–210.

18.

Oquab

Léon

Laptev

and Sivic

, Learning and transferring mid-level image representations using convolutional neural networks, 2014.

19.

Long

Zhu

Wang

and Jordan

M.I.

, Unsupervised domain adaptation with residual transfer networks, in: Advances in Neural Information Processing Systems, 2016, pp. 136–144.

20.

Goodfellow

Pouget-Abadie

Mirza

Warde-Farley

Ozair

and Bengio

, Generative adversarial nets, in: Advances in Neural Information Processing Systems, 2014, pp. 2672–2680.

21.

Ganin

and Lempitsky

, Unsupervised domain adaptation by backpropagation, arXiv preprint arXiv:1409.7495, 2014.

22.

Tzeng

Hoffman

Darrell

and Saenko

, Simultaneous deep transfer across domains and tasks, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4068–4076.

23.

Tzeng

Hoffman

Saenko

and Darrell

, Adversarial discriminative domain adaptation, 2017.

24.

Long

Cao

Wang

and Jordan

M.I.

, Domain adaptation with randomized multilinear adversarial networks, arXiv preprint arXiv:1705.10667, 2017.

25.

Jia

X.B.

Jin

Cardiff

and Bhanu

, Words alignment based on association rules for cross-domain sentiment classification, Frontiers of Information Technology & Electronic Engineering 19(2) (2018), 260–272.

26.

Jia

Jin

and Chen

, Domain alignment based on multi-viewpoint domain-shared feature for cross-domain sentiment classification, Journal of Computer Research and Development 55(11) (2018), 2439–2451.

27.

Myagmar

and Kimura

, Cross-domain sentiment classification with bidirectional contextualized transformer language models, IEEE Access 7 (2019), 163219–163230.

28.

Lin

C.K.

Lee

Y.Y.

C.H.

and Chen

H.H.

, Taxonomy-based regression model for cross-domain sentiment classification, in: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, 2013, October, pp. 1557–1560.

29.

Tang

Qin

and Liu

, Deep learning for sentiment analysis: successful approaches and future challenges, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5(6) (2015), 292–303.

30.

and Jiang

, Learning sentence embeddings with auxiliary tasks for cross-domain sentiment classification, Association for Computational Linguistics, 2016.

31.

Zhang

Wei

and Yang

, End-to-End Adversarial Memory Network for Cross-domain Sentiment Classification, in: Twenty-sixth International Joint Conference on Artificial Intelligence, 2017.

32.

Hinton

G.E.

Srivastava

Krizhevsky

Sutskever

and Salakhutdinov

R.R.

, Improving neural networks by preventing co-adaptation of feature detectors, arXiv preprint arXiv:1207.0580, 2012.

33.

Sharma

Bhattacharyya

Dandapat

and Bhatt

H.S.

, Identifying transferable information across domains for cross-domain sentiment classification, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers, 2018, July, pp. 968–978.

34.

Ziser

and Reichart

, Neural structural correspondence learning for domain adaptation, arXiv preprint arXiv:1610.01588, 2016.

35.

Meng

Long

Zhao

and Liu

, Cross-domain text sentiment analysis based on CNN_FT method, Information 10(5) (2019), 162.6.

An attention network based on feature sequences for cross-domain sentiment classification

Abstract

Keywords

1. Introduction

2.1 Deep transfer learning

2.2 Cross-domain sentiment classification

3. Deep transfer learning algorithms

3.1 Basic definitions

4.1 Dataset

Table 1 The information of datasets

Table 2 Network parameters

4.5 Implementation detail

Table 3 Comparison of different methods of k = 50 cross-domain sentiment classifications

Footnotes

Acknowledgments

References

Table 1
The information of datasets

Table 2
Network parameters

Table 3
Comparison of different methods of k $=$ 50 cross-domain sentiment classifications