Fusion of part-of-speech vectors and attention mechanisms for cross-domain sentiment analysis

Abstract

The grand challenge of cross-domain sentiment analysis is that classifiers trained in a specific domain are very sensitive to the discrepancy between domains. A sentiment classifier trained in the source domain usually have a poor performance in the target domain. One of the main strategies to solve this problem is the pivot-based strategy, which regards the feature representation as an important component. However, part-of-speech information was not considered to guide the learning of feature representation and feature mapping in previous pivot-based models. Therefore, we present a fused part-of-speech vectors and attention-based model (FAM). In our model, we fuse part-of-speech vectors and feature word embeddings as the representation of features, giving deep semantics to mapping features. And we adopt Multi-Head attention mechanism to train the cross-domain sentiment classifier to obtain the connection between different features. The results of 12 groups comparative experiments on the Amazon dataset demonstrate that our model outperforms all baseline models in this paper.

Keywords

Part-of-speech vectors Multi-Head attention mechanism cross-domain sentiment analysis

1 Introduction

Sentiment classification aims to automatically predict the exact sentiment polarity (E.g. negative, neutral or positive) of user online sentiment texts like book reviews. The rapid increase in users’ online reviews makes it widely concerned in research areas [1−4]. And significant progress has been made in the research of supervised sentiment analysis algorithms for a specific domain [5−7]. However, user’s reviews usually come from various domains, some sentiment words have opposite polarities in different domains. For example, the word “lightweight” in the electronics domain is used to describe electronic products that are easy to carry, contains a positive sentiment of users. On the contrary, this word is used to describe the superficiality of movie content in the movie domain, usually contains negative sentiment of the audience.

This domain discrepancy has motivated a large number of researches on cross-domain sentiment classification which applies the knowledge from source domain to target domain. Blitzer et al. [8] introduced Structural Correspondence Learning (SCL) and defined pivot features as the words that frequently appear in domains and have same sentiment polarity in different domains (E.g. “good” or “terrible”). In this work, SCL was used to establish the correlation between pivot features and non-pivot features. Pan et al. [9] developed an algorithm, Spectral Feature Alignment (SFA), to align non-pivots features into unified clusters. Glorot et al. [10] obtained a low-dimensional feature representation by the Stacked Denoising Auto-encoders (SDA), then trained a linear support vector machine classifier on the feature space. Bollegala et al. [12] used an unsupervised method to learn domain-specific feature representations, and constrained pivot features by optimizing the objective function to have the same representation in different domains. Li et al. [13] proposed a novel model, Adversarial Memory Network (AMN), to automatically identity the pivots features by incorporating memory networks into adversarial training network. Consistent with Li et al. [13], Qu et al. [23] still focused on minimizing the discrepancy between domains, and proposed an adversarial category alignment network (ACAN).

However, in the feature representation methods of the above studies, the feature vectors do not contain deep semantic information. During the feature mapping process, the model cannot learn the deep semantics of the features and cannot construct a high-quality feature space. Therefore, we propose a model (FAM) which fuses part-of-speech vectors into feature representation and adopts the attention mechanism to construct sentiment classifier. We incorporate part-of-speech vectors and feature word embeddings as the representation of features. The part-of-speech information combines the characteristics of pivot features that are mostly adjectives or nouns and the grammatical norms that adjectives in the text can follow nouns but not adverbs. The use of part-of-speech information can deepen the model’s understanding of semantics, learn the sequence relationship between different features better, and more accurately obtain the mapping relationship between non-pivot features and pivot features. Furthermore, at the stage of constructing the sentiment classifier, considering that traditional machine learning classifier cannot fit the features and sentiment labels in the new feature space well, we adopt the deep learning model Multi-Head attention mechanism to train the sentiment classifier. The self-attention, an important component of the Multi-Head attention mechanism, captures the dependencies in the entire sequence by calculating the information of the current position with other positions, and obtains the connections between different features to improve the performance of the sentiment classifier.

The contributions of this work are as follows:

We fuse part-of-speech vectors and feature word embeddings as the representation of features, giving deep semantics to mapping features.

We use the Multi-Head attention mechanism to aggregate different features and train cross-domain sentiment classifier simultaneously.

Experiments conducted on the Amazon dataset show that our proposed model FAM outperforms competitive baselines.

2 Related work

2.1 Pivot-based strategies

The methods based on pivot feature have been successful applied to solve the domain adaption problem. Blitzer et al. [14] adopted structural correspondence learning (SCL) to construct the identical feature space for domains to solve the domain adaption problem. Subsequently, Blitzer et al. [8] perfected their SCL method and collected a benchmark dataset for cross-domain sentiment research. Specifically, they trained a sentiment classifier by concatenating the original feature representation and the SCL representation. Pan et al. [9] also proposed a method based on the pivots and non-pivots, but adopted a graph-based method to assemble non-pivot features and expanded the original feature space. Knowing the distribution of a word in a specific domain, Bollegala et al. [15] proposed an unsupervised method to predict the distribution of the word in another domain. Later, Bollegala et al. [12] proposed an unsupervised but domain-specific word representation learning method. Yu and Jiang [16] trained a convolutional recurrent neural network model as a sentiment classifier and predicted the presence of pivots simultaneously. Ziser and Reichart [17] combined SCL with autoencoder, proposed the neural structural correspondence learning (NSCL) learned the context representations to predict the presence of pivots. Sarma et al. [18] adopted Canonical Correlation Analysis (CCA) to combine generic embeddings and Domain Specific (DS) embeddings to obtained the Domain Adapted (DA) embeddings.

In this paper, we also emphasized the role of the word embedding of features. Unlike these works, we fused the part-of-speech vectors of features with word embeddings, which featured pivot features vectors semantics.

2.2 Autoencoder strategies

Glorot et al. [10] adopted stacked denoising autoencoders (SDA) created a lower-dimensional representation for their data, which was used to augment the feature space. Chen et al. [11] extended SDA by using a series of closed-form linear transformations, presented Marginalized Denoising Autoencoders (MSDA). Clinchant et al. [19] proposed a different method, they added regularization to the denoising autoencoders.

Overall, the performance of autoencoder models is better than the SCL models, but worse interpretability, longer training times, and smaller feature space.

2.3 Part-of-speech in sentiment analysis

Various approaches have been applied to the sentiment analysis, in which part-of-speech information are used. Yin et al. [25] proposed a novel model that combines part-of-speech tags and word features in an interactive way to improve the performance of aspect term extraction in aspect-based sentiment analysis. Cheng et al. [26] weighted different part-of-speech through an attention mechanism and obtained encouraging results in sentiment classification. In the work of Thanh et al. [27], part-of-speech of online conversational text was used for sentiment analysis to analyze users’ behavior. These works show that part-of-speech has a positive effect on sentiment analysis. In this work, we propose to use part-of-speech to improve the cross-domain sentiment analysis.

2.4 Attention-based strategies in sentiment analysis

Attention mechanism, as a deep learning model, has gratifying performance in sentiment analysis tasks in recent years. Wang et al. [2] trained sentiment classifier by combining aspect embedding and attention mechanism, presented an attention-based LSTM with aspect embedding (ATAE-LSTM) model. Ma et al. [20] obtained the context interactive information by using an attention mechanism with target information, and then proposed the interactive attention networks (IAN). Tay et al. [21] adopted two novel methods to fuse word embeddings and aspect vectors, circular convolution and circular correlation, then fed the fused vectors into the attention mechanism to train sentiment classifier.

Given the good performance of the attention mechanism in sentiment analysis, we used it to train the cross-domain sentiment classifiers. In this paper, the sentiment classifier is trained with a Multi-Head attention mechanism network, then the accuracy of sentiment classification is tested.

3 Model

3.1 Task description

The task we address is cross domain sentiment analysis, the goal is that the sentiment classifier trained in source domain still has good classification performance in target domain. Formally, the source domain contains rich labeled data, we denoted it as D_s and the labeled data as $src_data = {(x_{s}^{i}, y_{s}^{i})}$ , where $x_{s}^{i}$ and $y_{s}^{i}$ are the data in the source domain and the corresponding sentiment label respectively. However, the target domain contains almost no labeled data, we denoted it as D_t, and the data as $dst_data = {(x_{t}^{i})}$ , which is the test set in testing stage.

The overall model architecture is illustrated in Fig. 1. Our model contains two subnetworks, the left is a feature mapping network, and the right is a cross-domain sentiment classification network. The input of the feature mapping network is the source domain labeled datasrc _ data and the unlabeled data unlab _ data between domains, we denoted it as map _ data. The input of the cross-domain sentiment classification network is the src _ data. After training the feature mapping network, the weight parameters of fusion layer and feature mapping layer were saved. Then the mapping features of the cross-domain sentiment classification network were obtained by sharing the weight parameters. We used the mapping features as the input of the attention mechanism to train the sentiment classifier. In the next few sections, we will describe our model layer-by-layer.

Fig. 1

The architecture of our proposed FAM. From a horizontal perspective, our model consists of three parts: the fusion layer, feature mapping layer, and the label prediction module.

3.2 Fusion layer

Part-of-speech is a very important aspect of natural language, but it is rarely used to solve cross-domain sentiment classification tasks. In fact, part-of-speech is important for generating part-of-speech vectors. On the one hand, the use of part-of-speech information can naturally concatenate the collocation order, the regularity, and relationships among words. For example, the pivot features are mostly adjectives and nouns, and the adjectives in the text can be followed by nouns but not adverbs. Part-of-speech not only affects the sentence structure of the text, but also determines the order of features. On the other hand, the use of part-of-speech information can also help the model to understand the semantics deeply, to learn the sequence relationships between different features well, and to learn the mapping relationships between non-pivot features and pivot features completely.

Intuitively, we fuse the feature and the corresponding part-of-speech information. Note that when using deep learning to solve cross-domain sentiment classification tasks, we must first use word embedding to vectorize features. The word embedding used in this paper are 300-dimensional which pre-trained by Glove. We obtained the part-of-speech tags through the POS_TAG tool in the NLTK package. And we used these tags to generate 300-dimensional part-of-speech vectors. As illustrated in Fig. 1, for a sentence of length n, we denote the word embedding as $v^{w} = {v_{1}^{w}, v_{2}^{w}, \dots, v_{n}^{w}}$ , the part-of-speech vector as $v^{p} = {v_{1}^{p}, v_{2}^{p}, \dots, v_{n}^{p}}$ . Concatenating the word vector v^w and the part-of-speech vector v^p to get the fused presentation of the feature v^f.

3.3 Feature mapping layer

3.3.1 Input of feature mapping network

It should be stated that the weights of the fusion layer and the feature mapping layer in the cross-domain sentiment classification network come from the feature mapping network. After training the feature mapping network, the weight parameters of fusion layer and feature mapping layer were saved. Then the mapping features of the cross-domain sentiment classification network were obtained by sharing the weight parameters. However, the input of the two networks is different. The input of the feature mapping network is map _ data, i.e. the combination of src _ data and unlab _ data. And the input of the cross-domain sentiment classification network is src _ data, i.e. the source domain labeled data.

3.3.2 Feature mapping layer

Long short-term memory (LSTM) neural network can learn the sentence structure fully, can capture long-distance information, and can learn the association information between different features. In this paper, LSTM is used to associate pivot features with non-pivot features, then map the features between domains into a same, low-dimensional mapping feature space to reduce the discrepancy across multiple domains. In our model, we take the fusion feature representation v^f as the input of LSTM, and the output is the corresponding mapping feature.

The principle of feature mapping is illustrated in Fig. 2. The examples in Fig. 2 are book reviews and kitchen reviews, LSTM learns the connection between nice - usually a positive adjective in both source domain and target domain, so hence a pivot feature –and durable - an adjective that is often used to describe kitchen appliances, but not books, and hence a non-pivot feature. In this way, the LSTM is used to predict whether the feature is a pivot feature, then the pivot feature and the non-pivot feature are mapped into the same feature space. In our model, the word embedding layer v^f is encoded by the LSTM network, and the hidden layer features h = h₁, h₂, ⋯ , h_n are mapping features.

Fig. 2

Principle of feature mapping based on LSTM.

3.3.3 Output of feature mapping network

The mapping features h = h₁, h₂, ⋯ , h_n are then put into a softmax layer which converts h into a probability distribution. Given an input word, LSTM should predict the next word in the pivot feature sequence. The probability that the next word is the i_th word can be calculated by the following formula. $p (y_{t} = i) = \frac{e^{h_{t} w_{i}}}{\sum_{k = 1}^{| V |} e^{h_{t} w_{k}}}$ (1)

Where W_i and W_k are the weight vectors, |V| is the size of the lexicon. The loss function in this paper is the cross-entropy loss function.

Through the feature mapping network module, the connection between the non-pivot features and the pivot features is established. and the weights of the fusion layer and feature mapping layer are saved, which will be shared with the sentiment classification module.

3.4 Sentiment label prediction module

As illustrated in Fig. 1, in our model, the sentiment label prediction module consists of three parts: input layer, Multi-Head attention mechanism layer, and output layer.

3.4.1 Input layer

When the training of the feature map network is completed, the weight matrixes of the fusion layer and the feature mapping layer are saved. We obtain the mapping features by sharing the weight parameters of the feature mapping network, then use the mapping features as input to train the sentiment classifier.

3.4.2 Multi-head attention mechanism layer

As also shown in Fig. 1, the input of Multi-Head attention layer is the mapping features. The position information of each feature is obtained by using the Multi-Head attention mechanism. Specifically, matrix Q ∈ R^n×d, K ∈ R^n×d, V ∈ R^n×d are given, Multi-Head mechanism can calculate the weight of attention by the following formula. $Attention (Q, K, V) = softmax (\frac{{QK}^{T}}{\sqrt{d}}) V$ (2) we set d = 8, d is the number of hidden units in the neural network in this formula.

The self-attention, which is the component of the Multi-Head attention mechanism, is used to capture complex interaction and dependency between terms within a sequence, and get the connections between different features. Specifically, the information of the current position and all other positions can be calculated to capture the dependencies in the entire sequence, and then get relevant information from the sentence. In a sentence, each word must perform the attention calculation with other words.

In our model, we first perform linear transformation on the three vectors of Q, K, and V, then perform the attention calculation. The calculation process is shown in Fig. 3. At the center of the figure is a variant of general attention named Scaled Dot-Product Attention, which is the central calculator unit of the Multi-Head attention mechanism. The calculation of the Scaled Dot-Product Attention part will perform many times, Head is the number of the calculation, note that the linear projection of Q, K, and V under each Head is different. Take one of the Heads as an example, $Q^{'} \in Q \times W_{i}^{Q}, K^{'} \in K \times W_{i}^{K}$ , $V^{'} \in Q \times W_{i}^{V}$ , where $W_{i}^{*}$ is parameter matrix, since the input of the Multi-Head attention mechanism are the mapping features, so Q′ = K′ = V′ = h, The result of the Multi-Head layer is as follows. $M_{i} = softmax (\frac{Q^{'} {K^{'}}^{T}}{\sqrt{d}}) V^{'}$ (3)

Fig. 3

The calculation process of Multi-Head attention mechanism layer.

After H times calculations, the M_i are concatenated. In this paper H = 16.

$M = Concat (M_{1}, M_{2}, \dots, M_{H})$ (4)

3.4.3 Output layer

Perform a mean pooling operation on M to obtain a vector C. And send the pooled vector C into a softmax layer to obtain the sentiment label, as shown in the following formula. $y = softmax (W_{s} C + b_{s})$ (5)

Where W_s and b_s are the parameters for softmax layer.

4 Experiment

4.1 Dataset

We utilize the Amazon dataset collected by Blitzer et al. [8] to evaluate our proposed method. This dataset consists of books reviewers(B), DVD reviews(D), kitchen reviews(K), and electronics reviews(E). Obviously, these reviews from four domains. For each domain, there are 2,000 labeled data, with 1000 positive and 1000 negative reviews. Each domain also contains a much larger set of unlabeled reviews, including 6000 in books domain, 34741 in DVD domain, 13153 in electronics domain, and 16785 in kitchen domain. The statistics for this dataset are given in Table 1.

Table 1
Statistics for the Amazon dataset

Domain Positive reviews Negative reviews Unlabeled reviews

B 1,000 1,000 6,000

D 1,000 1,000 34,741

E 1,000 1,000 13,153

K 1,000 1,000 16,785

Domain	Positive reviews	Negative reviews	Unlabeled reviews
B	1,000	1,000	6,000
D	1,000	1,000	34,741
E	1,000	1,000	13,153
K	1,000	1,000	16,785

4.2 Evaluation metrics

4.2.1 Jensen-Shannon (JS) divergence

Indicators that measure data difference between domains are indispensable in cross-domain sentiment analysis. In our experiment, we choose JS divergence as the indicator to evaluate the impact of data discrepancy on sentiment classifier. The greater the JS divergence, the greater the domain discrepancy. When JS = 0, the source domain and the target domain are almost identical. The definition of JS divergence is as follows. $\begin{matrix} JS (P | | Q) = \frac{1}{2} KL (P | | \frac{P + Q}{2}) \\ + \frac{1}{2} KL (Q | | \frac{P + Q}{2}) \end{matrix}$ (6)

Where P and Q correspond to the distribution of source domain and target domain respectively. In addition, KL in this formula is the Kullback–Leibler (KL) divergence - an indicator to measure the asymmetry of data between domains - and its definition is as follows. $\begin{matrix} D_{KL} (P | | Q) = - \sum Q (x) \log \frac{1}{P (x)} \\ + \sum P (x) \log \frac{1}{Q (x)} \end{matrix}$ (7)

For the problem of domain difference, we selected 10,000 feature words with the highest word frequency in all domains, and obtained the feature distribution of these features to calculate the JS divergence between domains. The JS divergence of the four domains in the Amazon dataset is shown in Table 2.

Table 2

The JS divergence of the domains in the Amazon dataset

Domain	B	D	K	E
B	0	0.066	0.13	0.136
D		0	0.127	0.137
K			0	0.092
E				0

4.2.2 Accuracy

In this paper, we use accuracy to evaluate the performance of cross-domain sentiment classification models. The accuracy indicates that among num _ all reviews, num _ correct reviews are classified correctly. The accuracy is described as follows. $p = \frac{num_correct}{num_all} \times 100$ (8)

4.3 Baselines and model

In this paper, we design 12 sets of cross-domain sentiment classification tasks as: B → K, D → K, E → K ; D → B, E → B, K → B ; B → D, E → D, K → D ; B → E, D → E, K → E. The left and right sides of the arrow correspond to the source and target domains respectively. In addition, we design two sets of experiments. The first set of experiments is used to verify the role of the part-of-speech vectors and attention mechanism in our model. The second set of experiments is a comparison experiment between our method and the baseline models. We consider the following comparisons:

DANN: Following the experiment of Qu et al. [23], the batch-size is 50, dropout rate is 0.5, and the learning rate is 0.0001.

AMN: Following the experiment of Li et al. [24], the structure of Transferable Transformer is 4-layer, the size of self-attention is 256, and the learning rate is 7e-4.

mSDA: Following the work of Chen et al. [11]. We keep the values of parameters in mSDA, the number of reconstructed features is 500, the number of stacked layers is 1, and the corruption probability is 0.1.

AE-SCL-SR: Following the work of Ziser et al. [17]. We keep the values of parameters in AE-SCL-SR, the frequency of pivot features in both domains exceeds 20, and the number of pivot features is 100.

PBLM_LSTM: Following the work of Ziser et al. [22]. We keep the values of parameters in PBLM_LSTM, the frequency of pivot features in both domains exceeds 20, and 100 pivot features are selected. In addition, the pivot feature embeddings are learned by the word2vec model and the dimension is 500.

PBLM_CNN: Following the work of Ziser et al. [22]. We keep the values of parameters in PBLM_CNN, the number of filters is 250 and the size of the convolution kernel is 3.

FAM_CNN: This model is an extension of ours model, the classifier model is consistent with the previous PBLM_CNN model. Unlike the PBLM_CNN, the feature representation of this model contains part-of-speech information.

FAM: This model is our model. Unlike the FAM_CNN, the classifier of this model is Multi-Head attention mechanism.

4.4 Analysis and results

4.4.1 Experiment of fusion of part-of-speech vectors and attention mechanism

In this work, in order to fairly verify the effectiveness of part-of-speech vectors and att ention mechanisms, we conducted 12 groups of comparative experiments on PBLM_CNN, FAM_CNN, and FAM. The difference between PBLM_CNN and FAM_CNN is that the feature representation of FAM_CNN contains part-of-speech information, while that of PBLM_CNN does not. Besides, the difference between FAM and FAM_CNN is that the sentiment classifier of the former is Multi-Head attention mechanism, while the latter is CNN. The results are summarized in Table 3. FAM_CNN performs better than PBLM_CNN in 10 groups experiments, which demonstrates the effectiveness of part-of-speech information. Particularly, in the two groups of cross-domain sentiment classification experiment of E → B and E → D, FAM_CNN respectively outperforms PBLM_CNN by 4.5% and 2.8%. In fact, fusing the part-of-speech vector and word embedding as feature vector, we fully considered the part-of-speech information and semantic information. The former can deepen the understanding of the text content, get a better order and position information between features. The latter can learn the relationship between different features deeply. In two groups of experiments, the performance of FAM_CNN is slightly worse.

Table 3
FAM compared with the baselines (accuracy %). Best scores are in bold

Model DANN AMN mSDA AE-SCL-SR PBLM_LSTM PBLM_CNN FAM_CNN FAM

B → K 76.1 79.1 78.8 80.1 80.9 82.5 83.7 85.9

D → K 77.4 78.8 77.4 80.3 83.3 83.2 83.9 83.0

E → K 84.0 85.2 84.5 84.6 87.1 87.8 89.3 89.9

D → B 81.7 76.9 76.1 77.3 80.5 82.5 82.6 82.4

E → B 79.0 73.2 71.9 71.1 70.8 71.4 75.9 76.7

K → B 79.3 72.9 70.0 73.0 73.5 74.2 76.5 77.3

B → D 82.3 83.5 78.3 81.1 82.6 84.2 83.5 84.7

E → D 79.7 75.2 71.0 74.5 77.6 75.0 77.8 79.2

K → D 80.5 77.8 71.4 76.3 78.6 79.8 79.6 80.3

B → E 77.6 76.4 74.6 76.8 74.5 77.6 80.4 81.0

D → E 79.7 77.9 75.0 78.1 80.4 79.6 80.0 81.3

K → E 86.7 85.5 82.4 84.0 85.4 87.1 88.5 87.9

Average 80.3 78.5 76.0 78.1 79.6 80.4 81.8 82.5

Model	DANN	AMN	mSDA	AE-SCL-SR	PBLM_LSTM	PBLM_CNN	FAM_CNN	FAM
B → K	76.1	79.1	78.8	80.1	80.9	82.5	83.7	85.9
D → K	77.4	78.8	77.4	80.3	83.3	83.2	83.9	83.0
E → K	84.0	85.2	84.5	84.6	87.1	87.8	89.3	89.9
D → B	81.7	76.9	76.1	77.3	80.5	82.5	82.6	82.4
E → B	79.0	73.2	71.9	71.1	70.8	71.4	75.9	76.7
K → B	79.3	72.9	70.0	73.0	73.5	74.2	76.5	77.3
B → D	82.3	83.5	78.3	81.1	82.6	84.2	83.5	84.7
E → D	79.7	75.2	71.0	74.5	77.6	75.0	77.8	79.2
K → D	80.5	77.8	71.4	76.3	78.6	79.8	79.6	80.3
B → E	77.6	76.4	74.6	76.8	74.5	77.6	80.4	81.0
D → E	79.7	77.9	75.0	78.1	80.4	79.6	80.0	81.3
K → E	86.7	85.5	82.4	84.0	85.4	87.1	88.5	87.9
Average	80.3	78.5	76.0	78.1	79.6	80.4	81.8	82.5

A further experimental comparison between the FAM_CNN and FAM shows that FAM performs better than FAM_CNN in 9 groups experiments, which signifies the Multi-Head attention mechanism is effective on the cross-domain sentiment classification task. Particularly, in the group of cross-domain sentiment classification experiment of B → K, FAM outperforms FAM by 2.2%. In principle, Introducing the attention mechanism model allows FAM to capture the internal correlation between features and further improve the classification performance of the sentiment classifier.

4.4.2 Comparison experiment with baselines

According to the experimental results of comparing FAM with the baselines in Table 3, FAM achieved a high classification accuracy. Particularly, FAM outperforms mSDA all, which proves that by establishing the mapping relationship between non-pivot features and pivot features, the inter-domain discrepancy can be well reduced and a better classification accuracy can be achieved. In addition, it is easy to find that PBLM_CNN and PBLM_LSTM outperform AE-SCL-SR in 12 groups experiments, this demonstrates that the feature representation based on word embedding outperform the feature representation based on document-word frequency matrix, and the cross-domain classifier based on deep learning neural network outperform the logistics classifier based on traditional machine learning. It further verifies that our method is effective, i.e. word embedding and attention mechanism have important influence on the cross-domain classification task. Furthermore, compared with the PBLM_LSTM and PBLM_CNN, in our 12 groups of experiments, there are 11 groups experiments FAM outperform PBLM_LSTM and 10 groups experiments outperform PBLM_CNN, which demonstrates the part-of-speech information is useful for cross-domain sentiment classification. We make a further comparison between FAM and adversarial learning models, including DANN and AMN. FAM achieves the best performance on 8 groups, while DANN only achieves the best performance on 4 groups. AMN has no obvious advantage.

In terms of average accuracy, FAM shows improvement over all baselines, this demonstrates that it is necessary to consider the part-of-speech and use attention mechanism in the cross-domain sentiment classification task. Additionally, as shown in the Table 3, all models have achieved the best results in the cross-domain sentiment classification experiment of E → K or K → E, the reason may be that the JS divergence between the two domains is relatively small in Table 2, i.e. the discrepancy between the two domains is small, and the feature distribution is similar, so that the domain adaption effect is best.

Exactly, we can see that FAM performs not well in the cross-domain sentiment classification experiment of E → B, K → B, E → D and K → D. This can be explained by that the large JS divergence between B and E, B and K, D and E, D and K in Table 2. Another cause of errors is that some sentimental words have different sentiment polarity in different domains. For example, “unpredictable” is positive in the book domain (unpredictable plot), but negative in the electronic domain (unpredictable steering). This is the general error of current cross-domain sentiment analysis.

5 Conclusion

We have presented an approach to make features with deep semantics which fuses the part-of-speech vectors of features when training the feature embeddings. Furthermore, in this paper, the Multi-Head attention mechanism is used to train cross-domain classifier for the first time. In the 12 groups comparison experiments of cross-domain sentiment classification tasks, our model FAM shows substantial improvement over baseline models. Our model achieves a novel performance on 10 of the 12 groups, and the average accuracy of our model reaches 82.5%. In future work, we will generalize to other datasets to verify the classification performance of our model.

References

Liu

, Sentiment analysis and opinion mining, Synthesis Lectures on Human Language Technologies 2012, 1–167.

Wang

, Huang

, Zhao

and Zhu

, Attention-based LSTM for Aspect-level Sentiment Classification, In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, 606–615.

Chen

, Sun

, Bing

and Yang

, Recurrent attention network on memory for aspect sentiment analysis, In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, 452–461.

, Bing

, Lam

and Shi

, Transformation networks for target-oriented sentiment classification, In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018.

Wang

and Manning

C.D.

, Base lines and bigrams: Simple, good sentiment and topic classification, In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 2012, 90–94.

Socher

, Perelygin

, Wu

J.Y.

, Chuang

, Manning

C.D.

, Ng

A.Y.

and Potts

, Recursive deep models for semantic compositionality over a sentiment treebank, In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, 1642.

Tang

, Qin

, Feng

and Liu

, Target dependent sentiment classificationwith long short-termmemory, 2015.

Blitzer

, Dredze

and Pereira

, Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification, In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, 2007,440–447.

Pan

S.J.

, Ni

, Sun

J.-T.

, Yang

and Chen

, Cross-domain sentiment classification via spectral feature alignment, In Proceedings of the 19th International Conference on World Wide Web, 2010, 751–760.

10.

Glorot

, Bordes

and Bengio

, Domain adaptation for large-scale sentiment classification: A deep learning approach, In Proceedings of the 28th International Conference on International Conference on Machine Learning, 2011, 513–520.

11.

Chen

, Xu

, Weinberger

K.Q.

and Sha

, Marginalized denoising autoencoders for domain adaptation, In Proceedings of the 29th International Coference on International Conference on Machine Learning, 2012, 1627–1634.

12.

Bollegala

, Maehara

and Kawarabayashi

K.-I.

, Unsupervised cross-domain word representation learning, In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2015, 730–740.

13.

, Zhang

, Wu

, et al., End-to-end adversarial memory network for cross-domain sentiment classification, Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017, 2237-2243.

14.

Blitzer

, McDonald

and Pereira

, Domain adaptation with structural correspondence learning, In Proceedings of the 2006 Conference on EmpiricalMethods in Natural Language Processing, 2006, pages 120–128.

15.

Bollegala

, Weir

and Carroll

, Learning to predict distributions ofwords across domains, In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014, 613–623.

16.

and Jiang

, Learning sentence embeddings with auxiliary tasks for cross-domain sentiment classification, In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pages 236–246.

17.

Ziser

and Reichart

, Neural structural correspondence learning for domain adaptation, In Proceedings of the 21st Conference on Computational Natural Language Learning, 2017, 400–410.

18.

Sarma

P.K.

, Liang

and Sethares

W.A.

, Domain Adapted Word Embeddings for Improved Sentiment Classification, In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018, 37–42.

19.

Clinchant

, Csurka

and Chidlovskii

, A domain adaptation regularization for denoising autoencoders, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016.

20.

, Li

, Zhang

and Wang

, Interactive Attention Networks for Aspect-Level Sentiment Classification, In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017.

21.

Tay

, Tuan

L.A.

and Hui

S.C.

, Learning to Attend via Word-Aspect Associative Fusion for Aspect-Based Sentiment Analysis, In Association for the Advancement of Artificial Intelligence, 2018.

22.

Ziser

and Reichart

, Pivot Based Language Modeling for Improved Neural Domain Adaption, In Proceedings of North American Chapter of the Association for Computational Linguistics & Human Language Technology, 2018, 1241–1251.

23.

, Zou

, Cheng

, Yang

and Zhou

, Adversarial Category Alignment Network for Cross-domain Sentiment Classification, In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, 2496–2508.

24.

, Ye

, Long

, Tang

, Xu

and Wang

, Simultaneous Learning of Pivots and Representations for Cross-Domain Sentiment Classification, In Proceedings of Association for the Advancement of Artificial Intelligence, 2020, 8220–8227.

25.

Yin

, Wu

and Chang

, Interactive Neural Network: Leveraging Part-of-SpeechWindow for Aspect Term Extraction (Student Abstract), In Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 13977–13978.

26.

Cheng

, Yue

and Song

, Sentiment Classification Based on Part-of-Speech and Self-Attention Mechanism[J], IEEE Access (99) (2020), 1–1.

27.

, Sharma

, Kumar

, et al., Crime rate detection using social media of different crime locations and Twitter part-of-speech tagger with Brown clustering[J], Journal of Intelligent and Fuzzy Systems (4) (2020), 1–13.

Fusion of part-of-speech vectors and attention mechanisms for cross-domain sentiment analysis

Abstract

Keywords

1 Introduction

2 Related work

2.1 Pivot-based strategies

2.2 Autoencoder strategies

2.3 Part-of-speech in sentiment analysis

2.4 Attention-based strategies in sentiment analysis

3 Model

3.1 Task description

3.3 Feature mapping layer

3.3.1 Input of feature mapping network

3.3.2 Feature mapping layer

3.4.1 Input layer

3.4.2 Multi-head attention mechanism layer

4.1 Dataset

Table 1 Statistics for the Amazon dataset Domain Positive reviews Negative reviews Unlabeled reviews B 1,000 1,000 6,000 D 1,000 1,000 34,741 E 1,000 1,000 13,153 K 1,000 1,000 16,785

4.2.1 Jensen-Shannon (JS) divergence

4.4 Analysis and results

4.4.1 Experiment of fusion of part-of-speech vectors and attention mechanism

5 Conclusion

References

Table 1
Statistics for the Amazon dataset

Domain Positive reviews Negative reviews Unlabeled reviews

B 1,000 1,000 6,000

D 1,000 1,000 34,741

E 1,000 1,000 13,153

K 1,000 1,000 16,785