Abstract
The grand challenge of cross-domain sentiment analysis is that classifiers trained in a specific domain are very sensitive to the discrepancy between domains. A sentiment classifier trained in the source domain usually have a poor performance in the target domain. One of the main strategies to solve this problem is the pivot-based strategy, which regards the feature representation as an important component. However, part-of-speech information was not considered to guide the learning of feature representation and feature mapping in previous pivot-based models. Therefore, we present a fused part-of-speech vectors and attention-based model (FAM). In our model, we fuse part-of-speech vectors and feature word embeddings as the representation of features, giving deep semantics to mapping features. And we adopt Multi-Head attention mechanism to train the cross-domain sentiment classifier to obtain the connection between different features. The results of 12 groups comparative experiments on the Amazon dataset demonstrate that our model outperforms all baseline models in this paper.
Introduction
Sentiment classification aims to automatically predict the exact sentiment polarity (E.g. negative, neutral or positive) of user online sentiment texts like book reviews. The rapid increase in users’ online reviews makes it widely concerned in research areas [1−4]. And significant progress has been made in the research of supervised sentiment analysis algorithms for a specific domain [5−7]. However, user’s reviews usually come from various domains, some sentiment words have opposite polarities in different domains. For example, the word “lightweight” in the electronics domain is used to describe electronic products that are easy to carry, contains a positive sentiment of users. On the contrary, this word is used to describe the superficiality of movie content in the movie domain, usually contains negative sentiment of the audience.
This domain discrepancy has motivated a large number of researches on cross-domain sentiment classification which applies the knowledge from source domain to target domain. Blitzer et al. [8] introduced Structural Correspondence Learning (SCL) and defined pivot features as the words that frequently appear in domains and have same sentiment polarity in different domains (E.g. “good” or “terrible”). In this work, SCL was used to establish the correlation between pivot features and non-pivot features. Pan et al. [9] developed an algorithm, Spectral Feature Alignment (SFA), to align non-pivots features into unified clusters. Glorot et al. [10] obtained a low-dimensional feature representation by the Stacked Denoising Auto-encoders (SDA), then trained a linear support vector machine classifier on the feature space. Bollegala et al. [12] used an unsupervised method to learn domain-specific feature representations, and constrained pivot features by optimizing the objective function to have the same representation in different domains. Li et al. [13] proposed a novel model, Adversarial Memory Network (AMN), to automatically identity the pivots features by incorporating memory networks into adversarial training network. Consistent with Li et al. [13], Qu et al. [23] still focused on minimizing the discrepancy between domains, and proposed an adversarial category alignment network (ACAN).
However, in the feature representation methods of the above studies, the feature vectors do not contain deep semantic information. During the feature mapping process, the model cannot learn the deep semantics of the features and cannot construct a high-quality feature space. Therefore, we propose a model (FAM) which fuses part-of-speech vectors into feature representation and adopts the attention mechanism to construct sentiment classifier. We incorporate part-of-speech vectors and feature word embeddings as the representation of features. The part-of-speech information combines the characteristics of pivot features that are mostly adjectives or nouns and the grammatical norms that adjectives in the text can follow nouns but not adverbs. The use of part-of-speech information can deepen the model’s understanding of semantics, learn the sequence relationship between different features better, and more accurately obtain the mapping relationship between non-pivot features and pivot features. Furthermore, at the stage of constructing the sentiment classifier, considering that traditional machine learning classifier cannot fit the features and sentiment labels in the new feature space well, we adopt the deep learning model Multi-Head attention mechanism to train the sentiment classifier. The self-attention, an important component of the Multi-Head attention mechanism, captures the dependencies in the entire sequence by calculating the information of the current position with other positions, and obtains the connections between different features to improve the performance of the sentiment classifier.
The contributions of this work are as follows: We fuse part-of-speech vectors and feature word embeddings as the representation of features, giving deep semantics to mapping features. We use the Multi-Head attention mechanism to aggregate different features and train cross-domain sentiment classifier simultaneously. Experiments conducted on the Amazon dataset show that our proposed model FAM outperforms competitive baselines.
Related work
Pivot-based strategies
The methods based on pivot feature have been successful applied to solve the domain adaption problem. Blitzer et al. [14] adopted structural correspondence learning (SCL) to construct the identical feature space for domains to solve the domain adaption problem. Subsequently, Blitzer et al. [8] perfected their SCL method and collected a benchmark dataset for cross-domain sentiment research. Specifically, they trained a sentiment classifier by concatenating the original feature representation and the SCL representation. Pan et al. [9] also proposed a method based on the pivots and non-pivots, but adopted a graph-based method to assemble non-pivot features and expanded the original feature space. Knowing the distribution of a word in a specific domain, Bollegala et al. [15] proposed an unsupervised method to predict the distribution of the word in another domain. Later, Bollegala et al. [12] proposed an unsupervised but domain-specific word representation learning method. Yu and Jiang [16] trained a convolutional recurrent neural network model as a sentiment classifier and predicted the presence of pivots simultaneously. Ziser and Reichart [17] combined SCL with autoencoder, proposed the neural structural correspondence learning (NSCL) learned the context representations to predict the presence of pivots. Sarma et al. [18] adopted Canonical Correlation Analysis (CCA) to combine generic embeddings and Domain Specific (DS) embeddings to obtained the Domain Adapted (DA) embeddings.
In this paper, we also emphasized the role of the word embedding of features. Unlike these works, we fused the part-of-speech vectors of features with word embeddings, which featured pivot features vectors semantics.
Autoencoder strategies
Glorot et al. [10] adopted stacked denoising autoencoders (SDA) created a lower-dimensional representation for their data, which was used to augment the feature space. Chen et al. [11] extended SDA by using a series of closed-form linear transformations, presented Marginalized Denoising Autoencoders (MSDA). Clinchant et al. [19] proposed a different method, they added regularization to the denoising autoencoders.
Overall, the performance of autoencoder models is better than the SCL models, but worse interpretability, longer training times, and smaller feature space.
Part-of-speech in sentiment analysis
Various approaches have been applied to the sentiment analysis, in which part-of-speech information are used. Yin et al. [25] proposed a novel model that combines part-of-speech tags and word features in an interactive way to improve the performance of aspect term extraction in aspect-based sentiment analysis. Cheng et al. [26] weighted different part-of-speech through an attention mechanism and obtained encouraging results in sentiment classification. In the work of Thanh et al. [27], part-of-speech of online conversational text was used for sentiment analysis to analyze users’ behavior. These works show that part-of-speech has a positive effect on sentiment analysis. In this work, we propose to use part-of-speech to improve the cross-domain sentiment analysis.
Attention-based strategies in sentiment analysis
Attention mechanism, as a deep learning model, has gratifying performance in sentiment analysis tasks in recent years. Wang et al. [2] trained sentiment classifier by combining aspect embedding and attention mechanism, presented an attention-based LSTM with aspect embedding (ATAE-LSTM) model. Ma et al. [20] obtained the context interactive information by using an attention mechanism with target information, and then proposed the interactive attention networks (IAN). Tay et al. [21] adopted two novel methods to fuse word embeddings and aspect vectors, circular convolution and circular correlation, then fed the fused vectors into the attention mechanism to train sentiment classifier.
Given the good performance of the attention mechanism in sentiment analysis, we used it to train the cross-domain sentiment classifiers. In this paper, the sentiment classifier is trained with a Multi-Head attention mechanism network, then the accuracy of sentiment classification is tested.
Model
Task description
The task we address is cross domain sentiment analysis, the goal is that the sentiment classifier trained in source domain still has good classification performance in target domain. Formally, the source domain contains rich labeled data, we denoted it as D
s
and the labeled data as
The overall model architecture is illustrated in Fig. 1. Our model contains two subnetworks, the left is a feature mapping network, and the right is a cross-domain sentiment classification network. The input of the feature mapping network is the source domain labeled datasrc _ data and the unlabeled data unlab _ data between domains, we denoted it as map _ data. The input of the cross-domain sentiment classification network is the src _ data. After training the feature mapping network, the weight parameters of fusion layer and feature mapping layer were saved. Then the mapping features of the cross-domain sentiment classification network were obtained by sharing the weight parameters. We used the mapping features as the input of the attention mechanism to train the sentiment classifier. In the next few sections, we will describe our model layer-by-layer.

The architecture of our proposed FAM. From a horizontal perspective, our model consists of three parts: the fusion layer, feature mapping layer, and the label prediction module.
Part-of-speech is a very important aspect of natural language, but it is rarely used to solve cross-domain sentiment classification tasks. In fact, part-of-speech is important for generating part-of-speech vectors. On the one hand, the use of part-of-speech information can naturally concatenate the collocation order, the regularity, and relationships among words. For example, the pivot features are mostly adjectives and nouns, and the adjectives in the text can be followed by nouns but not adverbs. Part-of-speech not only affects the sentence structure of the text, but also determines the order of features. On the other hand, the use of part-of-speech information can also help the model to understand the semantics deeply, to learn the sequence relationships between different features well, and to learn the mapping relationships between non-pivot features and pivot features completely.
Intuitively, we fuse the feature and the corresponding part-of-speech information. Note that when using deep learning to solve cross-domain sentiment classification tasks, we must first use word embedding to vectorize features. The word embedding used in this paper are 300-dimensional which pre-trained by Glove. We obtained the part-of-speech tags through the POS_TAG tool in the NLTK package. And we used these tags to generate 300-dimensional part-of-speech vectors. As illustrated in Fig. 1, for a sentence of length n, we denote the word embedding as
Feature mapping layer
Input of feature mapping network
It should be stated that the weights of the fusion layer and the feature mapping layer in the cross-domain sentiment classification network come from the feature mapping network. After training the feature mapping network, the weight parameters of fusion layer and feature mapping layer were saved. Then the mapping features of the cross-domain sentiment classification network were obtained by sharing the weight parameters. However, the input of the two networks is different. The input of the feature mapping network is map _ data, i.e. the combination of src _ data and unlab _ data. And the input of the cross-domain sentiment classification network is src _ data, i.e. the source domain labeled data.
Feature mapping layer
Long short-term memory (LSTM) neural network can learn the sentence structure fully, can capture long-distance information, and can learn the association information between different features. In this paper, LSTM is used to associate pivot features with non-pivot features, then map the features between domains into a same, low-dimensional mapping feature space to reduce the discrepancy across multiple domains. In our model, we take the fusion feature representation v f as the input of LSTM, and the output is the corresponding mapping feature.
The principle of feature mapping is illustrated in Fig. 2. The examples in Fig. 2 are book reviews and kitchen reviews, LSTM learns the connection between nice - usually a positive adjective in both source domain and target domain, so hence a pivot feature –and durable - an adjective that is often used to describe kitchen appliances, but not books, and hence a non-pivot feature. In this way, the LSTM is used to predict whether the feature is a pivot feature, then the pivot feature and the non-pivot feature are mapped into the same feature space. In our model, the word embedding layer v f is encoded by the LSTM network, and the hidden layer features h = h1, h2, ⋯ , h n are mapping features.

Principle of feature mapping based on LSTM.
The mapping features h = h1, h2, ⋯ , h
n
are then put into a softmax layer which converts h into a probability distribution. Given an input word, LSTM should predict the next word in the pivot feature sequence. The probability that the next word is the i
t
h word can be calculated by the following formula.
Where W i and W k are the weight vectors, |V| is the size of the lexicon. The loss function in this paper is the cross-entropy loss function.
Through the feature mapping network module, the connection between the non-pivot features and the pivot features is established. and the weights of the fusion layer and feature mapping layer are saved, which will be shared with the sentiment classification module.
As illustrated in Fig. 1, in our model, the sentiment label prediction module consists of three parts: input layer, Multi-Head attention mechanism layer, and output layer.
Input layer
When the training of the feature map network is completed, the weight matrixes of the fusion layer and the feature mapping layer are saved. We obtain the mapping features by sharing the weight parameters of the feature mapping network, then use the mapping features as input to train the sentiment classifier.
Multi-head attention mechanism layer
As also shown in Fig. 1, the input of Multi-Head attention layer is the mapping features. The position information of each feature is obtained by using the Multi-Head attention mechanism. Specifically, matrix Q ∈ Rn×d, K ∈ Rn×d, V ∈ Rn×d are given, Multi-Head mechanism can calculate the weight of attention by the following formula.
The self-attention, which is the component of the Multi-Head attention mechanism, is used to capture complex interaction and dependency between terms within a sequence, and get the connections between different features. Specifically, the information of the current position and all other positions can be calculated to capture the dependencies in the entire sequence, and then get relevant information from the sentence. In a sentence, each word must perform the attention calculation with other words.
In our model, we first perform linear transformation on the three vectors of Q, K, and V, then perform the attention calculation. The calculation process is shown in Fig. 3. At the center of the figure is a variant of general attention named Scaled Dot-Product Attention, which is the central calculator unit of the Multi-Head attention mechanism. The calculation of the Scaled Dot-Product Attention part will perform many times, Head is the number of the calculation, note that the linear projection of Q, K, and V under each Head is different. Take one of the Heads as an example,

The calculation process of Multi-Head attention mechanism layer.
After H times calculations, the M i are concatenated. In this paper H = 16.
Perform a mean pooling operation on M to obtain a vector C. And send the pooled vector C into a softmax layer to obtain the sentiment label, as shown in the following formula.
Where W s and b s are the parameters for softmax layer.
Dataset
We utilize the Amazon dataset collected by Blitzer et al. [8] to evaluate our proposed method. This dataset consists of books reviewers(B), DVD reviews(D), kitchen reviews(K), and electronics reviews(E). Obviously, these reviews from four domains. For each domain, there are 2,000 labeled data, with 1000 positive and 1000 negative reviews. Each domain also contains a much larger set of unlabeled reviews, including 6000 in books domain, 34741 in DVD domain, 13153 in electronics domain, and 16785 in kitchen domain. The statistics for this dataset are given in Table 1.
Statistics for the Amazon dataset
Statistics for the Amazon dataset
Jensen-Shannon (JS) divergence
Indicators that measure data difference between domains are indispensable in cross-domain sentiment analysis. In our experiment, we choose JS divergence as the indicator to evaluate the impact of data discrepancy on sentiment classifier. The greater the JS divergence, the greater the domain discrepancy. When JS = 0, the source domain and the target domain are almost identical. The definition of JS divergence is as follows.
Where P and Q correspond to the distribution of source domain and target domain respectively. In addition, KL in this formula is the Kullback–Leibler (KL) divergence - an indicator to measure the asymmetry of data between domains - and its definition is as follows.
For the problem of domain difference, we selected 10,000 feature words with the highest word frequency in all domains, and obtained the feature distribution of these features to calculate the JS divergence between domains. The JS divergence of the four domains in the Amazon dataset is shown in Table 2.
The JS divergence of the domains in the Amazon dataset
In this paper, we use accuracy to evaluate the performance of cross-domain sentiment classification models. The accuracy indicates that among num _ all reviews, num _ correct reviews are classified correctly. The accuracy is described as follows.
In this paper, we design 12 sets of cross-domain sentiment classification tasks as: B → K, D → K, E → K ; D → B, E → B, K → B ; B → D, E → D, K → D ; B → E, D → E, K → E. The left and right sides of the arrow correspond to the source and target domains respectively. In addition, we design two sets of experiments. The first set of experiments is used to verify the role of the part-of-speech vectors and attention mechanism in our model. The second set of experiments is a comparison experiment between our method and the baseline models. We consider the following comparisons:
Analysis and results
Experiment of fusion of part-of-speech vectors and attention mechanism
In this work, in order to fairly verify the effectiveness of part-of-speech vectors and att ention mechanisms, we conducted 12 groups of comparative experiments on PBLM_CNN, FAM_CNN, and FAM. The difference between PBLM_CNN and FAM_CNN is that the feature representation of FAM_CNN contains part-of-speech information, while that of PBLM_CNN does not. Besides, the difference between FAM and FAM_CNN is that the sentiment classifier of the former is Multi-Head attention mechanism, while the latter is CNN. The results are summarized in Table 3. FAM_CNN performs better than PBLM_CNN in 10 groups experiments, which demonstrates the effectiveness of part-of-speech information. Particularly, in the two groups of cross-domain sentiment classification experiment of E → B and E → D, FAM_CNN respectively outperforms PBLM_CNN by 4.5% and 2.8%. In fact, fusing the part-of-speech vector and word embedding as feature vector, we fully considered the part-of-speech information and semantic information. The former can deepen the understanding of the text content, get a better order and position information between features. The latter can learn the relationship between different features deeply. In two groups of experiments, the performance of FAM_CNN is slightly worse.
FAM compared with the baselines (accuracy %). Best scores are in bold
FAM compared with the baselines (accuracy %). Best scores are in bold
A further experimental comparison between the FAM_CNN and FAM shows that FAM performs better than FAM_CNN in 9 groups experiments, which signifies the Multi-Head attention mechanism is effective on the cross-domain sentiment classification task. Particularly, in the group of cross-domain sentiment classification experiment of B → K, FAM outperforms FAM by 2.2%. In principle, Introducing the attention mechanism model allows FAM to capture the internal correlation between features and further improve the classification performance of the sentiment classifier.
According to the experimental results of comparing FAM with the baselines in Table 3, FAM achieved a high classification accuracy. Particularly, FAM outperforms mSDA all, which proves that by establishing the mapping relationship between non-pivot features and pivot features, the inter-domain discrepancy can be well reduced and a better classification accuracy can be achieved. In addition, it is easy to find that PBLM_CNN and PBLM_LSTM outperform AE-SCL-SR in 12 groups experiments, this demonstrates that the feature representation based on word embedding outperform the feature representation based on document-word frequency matrix, and the cross-domain classifier based on deep learning neural network outperform the logistics classifier based on traditional machine learning. It further verifies that our method is effective, i.e. word embedding and attention mechanism have important influence on the cross-domain classification task. Furthermore, compared with the PBLM_LSTM and PBLM_CNN, in our 12 groups of experiments, there are 11 groups experiments FAM outperform PBLM_LSTM and 10 groups experiments outperform PBLM_CNN, which demonstrates the part-of-speech information is useful for cross-domain sentiment classification. We make a further comparison between FAM and adversarial learning models, including DANN and AMN. FAM achieves the best performance on 8 groups, while DANN only achieves the best performance on 4 groups. AMN has no obvious advantage.
In terms of average accuracy, FAM shows improvement over all baselines, this demonstrates that it is necessary to consider the part-of-speech and use attention mechanism in the cross-domain sentiment classification task. Additionally, as shown in the Table 3, all models have achieved the best results in the cross-domain sentiment classification experiment of E → K or K → E, the reason may be that the JS divergence between the two domains is relatively small in Table 2, i.e. the discrepancy between the two domains is small, and the feature distribution is similar, so that the domain adaption effect is best.
Exactly, we can see that FAM performs not well in the cross-domain sentiment classification experiment of E → B, K → B, E → D and K → D. This can be explained by that the large JS divergence between B and E, B and K, D and E, D and K in Table 2. Another cause of errors is that some sentimental words have different sentiment polarity in different domains. For example, “unpredictable” is positive in the book domain (unpredictable plot), but negative in the electronic domain (unpredictable steering). This is the general error of current cross-domain sentiment analysis.
Conclusion
We have presented an approach to make features with deep semantics which fuses the part-of-speech vectors of features when training the feature embeddings. Furthermore, in this paper, the Multi-Head attention mechanism is used to train cross-domain classifier for the first time. In the 12 groups comparison experiments of cross-domain sentiment classification tasks, our model FAM shows substantial improvement over baseline models. Our model achieves a novel performance on 10 of the 12 groups, and the average accuracy of our model reaches 82.5%. In future work, we will generalize to other datasets to verify the classification performance of our model.
