The deep learning word vector model using part of speech and sentiment information

Abstract

Language Model is used to describe and calculate the probability of a reasonable sentence occurrence in natural language. In practical applications, language model as the core of natural language processing is often used in machine translation, information indexing, voice recognition, context processing such as sentiment recognition and other tasks. We will discuss advantages and weaknesses of traditional statistical language models and neural Network Language Models such as CBOW and Skip-gram. Keeping in view the traditional statistical language model and neural network model, we will try to put forward the word vector model based on part of speech and sentiment information (PSWV-model) in order to use more natural language information such as word order features, part of speech features, and sentiment polarity information under the framework of Mikolov’s model. And finally we will present our deliberations on some advantages of PSWV model and other models including CBOW and Skip-Gram, CDNV in the NLP tasks including named entities recognition and sentiment polarity analysis.

Keywords

Deep learning word vector sentiment analysis named entity recognition

1 Introduction

In the contemporary world internet applications such as Micro-blog, Weibo,Wechat, Facebook, and Twitter etc,. have become the main platforms for people to express their views and feelings or sentiments. Huge rise in the number of internet users has resulted in proportional rise in the volume of their relevant messages. The ways to extract the current social trends of public opinion from the social media that could be utilized by the Governments and enterprises to make timely and effective decision-making, is becoming a hot topic in the field of natural language processing [1]. Natural language processing (NLP) is using computer to process and understand large scale text in internet. In fact, NLP is an important topic of the field of artificial intelligence. The traditional statistical language models include N-gram Model and N-pos Model, Maximum Entropy Model, and some others. In recent years machine learning model has become the mainstream of natural language processing research [2]. In previous ten years, most of the machine learning methods in the field of NLP belongs to the shallow layer learning model. The common shallow layer learning model includes the classification model based on support vector machine (SVM), the sequential tagging model based on conditional random field, logistic regression, Naive Bayes and others [3]. The shallow models almost require artificial experience to extract features from the text. The machine learning model is mainly responsible for classification or prediction. The performance of the models is often determined by the quality of features extracted artificially. Therefore, researchers have to conduct long studies to observe and extract the features of the text. Researcher needs to extract the features for different tasks. Relatively the more difficult problem in extracting effective features is the requirement from the researcher to have a rich experience and comprehensive understanding of the data.

In Natural language processing, the commonly used word vector representation methods are One-hot Representation (Bag-of-word), Distributed Representation. In previous studies, the One-hot Representation of words is often found simple and rough [4]. A word is generally expressed as a high-dimensional vector consistent with the size of the dictionary. Each position of the high-dimensional vector corresponds to a particular word in the dictionary. A particular word in corresponding position is 1, otherwise 0. There are two problems in One-hot representation method. Firstly, it has a very high dimension and is extremely sparse. Its high dimension and sparsity can cause the common “dimensionality disaster” in natural language processing. For example, a vector with a dimension of 100,000 is entered into a neural network model. Under this dimension, the computational cost is very high even for simple applications when using neural network model. However, it cannot represent complex semantic information in a natural language such as the similarity between words and words. This method also makes it impossible to calculate the basic semantic correlations in the language.

In recent years, a large number of researchers have a clear inclination towards Distributed Representation express the text. Bengio uses the neural network model to get a vector representation of word embedding (word vector) [5]. The word vector is a low dimensional, dense, continuous vector representation. In addition, the word vector can also contain both semantic and syntactic information in the text. Word vectors not only effectively avoid the problem of dimensionality disaster and sparsity, but also the semantic correlation between words can be computed easily. In the result of Bengio’s study, most natural language processing methods under neural networks are based on the input of word vectors. The researchers improved the word vector under the different deep learning structure such as the RNN(recursive neural network) or the recursive automatic encoding (recursive auto-encoder), CNN(Convolutional Neural Networks) [6], etc. The current method leads basically to the learning of word vectors in large scale data in an unmonitored way, ignoring the characteristics of human language learning and some very important inherent attributes of language which should be collected artificially.

We will review the traditional statistical language models such as N-gram model, N-POS model, the maximum entropy model including some others to find their principles, weakness and advantages in the following sections. More over we will also review popular word vector model such as traditional one-hot representation, Bengio’s word vector, Mikolov’s CBOW, and Skip-gram to analyze these model structures for their advantage and weak points that could be improved. Subject to these precise reviews, we propose our new model PSWV-model to improve the performance of word vector in NLP Task.

2 Relation to previous work

2.1 Statistical language models

In the traditional statistical language model, the goal of the model is to compute the probability of each sentence appearing in the corpus. The conditional probability of each sentence appearing is finally obtained by multiplying the conditional probability of each word in a sentence: $p (s) = p (w^{T}) = \prod_{t = 1}^{T} p (w_{t} | context)$ (2.1)

Where $w_{T} = (w_{1}, w_{2}, . . . w_{T})$ and w_i is the i-th word in the sentence. According to the different partitioning methods of context, it can be divided into different language models. Model (2.1) is defined as unigram model when context = null. When context = w_t-n+1, w_t-n+2, . . . , w_t-1 model (2.1) is proposed as N-gram model. The objective of N-gram model optimization is the maximum log likelihood as following $\begin{matrix} \prod_{t = 1}^{T} P_{t} (w_{t} | w_{t - n + 1}, w_{t - n + 2}, . . ., w_{t - 1}) \\ log P_{t} (w_{t} | w_{t - n + 1}, w_{t - n + 2}, . . ., w_{t - 1}) \end{matrix}$ (2.2) when context = w_t-3, w_t-2, . . . , w_t-1 model (2.1) is trigram model.

N-POS model is a derivation of N-gram model. It assumed that the probability of t-th depend on the n - 1 words’ part-of-speech POS. N-POS model can be expressed as following:

$\begin{matrix} p (s) = p (w^{T}) = \prod_{t = 1}^{T} P_{t} (w_{t} | c (w_{t - n + 1}), \\ c (w_{t - n + 2}), . . ., c (w_{t - 1})) \end{matrix}$ (2.3) where c (.) is mapping function of part of speech.

In natural language processing, another well known model is the maximum entropy model. It is proposed that the event is predicted by the full information of the event and no assumptions are made on the predicted event. The prediction can reduce the risk and obtains the maximum entropy value. The maximum entropy model is shown as following $p (w | context) = \frac{e \sum_{i} α_{i} f_{i} (context, w)}{z (context)}$ (2.4) where z (context) is the is the expectation of all words appearing in the context of word w.

2.2 The neural network language model

In Natural language processing, the commonly used word vectors is one-hot representation and Distributed representation. The idea of One-hot representation is to express words into a very long vector. The dimension of the vector is the size of the lexicon. This representation method possesses two challenging shortcomings: 1) dimensionality disaster: the dimension of word vector usually reaches tens of thousands of dimensions, hence the word vector is difficult to apply in deep learning; 2) lexical gap phenomenon: the similarity between words cannot be described.

In 1986, Hinton proposed Distributed Representation to solve the above two problems. Bengion (2002) further proposed a neural network language model and its structure shown in the following Fig. 2.1.

Fig. 2.1

The structure of neural network language model.

In Fig. 2.1, w_t-n, . . . , w_t-2, w_t-1 indicate the n - 1 words before the t-th in the context or sentence, the word w is mapped C (w) to a vector firstly. C is a |V| × m matrix. |V| represents the size of the lexicon. m represents the dimension of a word vector, usually set to several hundred dimensions.

The first layer of the neural network is the input layer. The calculated method in the input layer is concatenating vectors. The word vectors are connected in sequence, and then (n - 1) × m vectors of are obtained as following $X = C (C (w_{t - n}), . . ., C (w_{t - 2}), C (w_{t - 1}))$

The second layer is the hidden layer, and the vector X is transformed linearly as following $HX + d$ (2.5) where H is the weight matrix of X and d is bias. Then nonlinear transformation results as the active function, and then the result of transformation function is used as the input of the third layer. $tanh (HX + d)$ (2.6)

The third layer is the output layer. The output layer contains |V| neurons. The value of neurons y_t under the condition of the context = {w_t-n, . . . , w_t-2, w_t-1} is the un normalized log probability of the t-th word. The probability distribution of y in the output layer after transformed by the input layer and hidden layer is as follows $y = b + ω x + U tanh (HX + d)$ (2.7) where ω, U, H is the weight matrix between layers of the neural network. And b, d is the bias term. Finally, y is normalized by using softmax function and the constraints of the probability distribution of y are as follows: $p (w_{t} | w_{t - n}, . . ., w_{t - 2}, w_{t - 1}) = \frac{e^{y_{w_{t}}}}{\sum_{i} e^{y_{w_{i}}}}$ (2.8)

Parameters θ = (ω, U, H, d, b, C) updating in the model adopts SGA(stochastic gradient ascent) as following $θ \leftarrow θ + η \frac{\partial log p (w_{t} | w_{t - n}, . . ., w_{t - 2}, w_{t - 1})}{\partial θ}$ (2.9) where η is the learning rate, that is the step size of updating when the gradient rising in the process of model solving.

Based on the neural network language model of Bengio, Mikolov proposed two model including CBOW(Continuous Bag-of-words) and Skip-Gram in 2013 [7]. CBOW use the context of the given word w to find the probability of the word w, while Skip-Gram is the probability of the context of the known word w. CBOW and the Skip-Gram model framework are shown in the following Figs. 2.2 and 2.3. It consists of input layer, projection layer and output layer. The biggest difference between CBOW and Bengio’s neural network language model is that the hidden layer is removed because of its huge amount of calculation and the projection layer is added. In order to further reduce the computational complexity, Mikolov changed concatenating vectors in the hidden layer into calculating the sum of vectors in the projection layer.

Fig. 2.2

CBOW struture.

Fig. 2.3

SKIP-GRAM struture.

The learning objective function of CBOW is to provide a maximum logarithmic likelihood function according to the context (w_t-2, w_t-1, w_t+1, w_t+2) when the learning window is [-2, 2]. $L = \sum_{w \in C} log p (w | context (w))$ (2.10) where w is any word in corpus C, and the number or dimension of vocabularies in the corpus is |C|. The above problem can be transformed into a multi-classification problem. The commonly used method is the softmax regression. However, softmax regression needs to compute the probability and normalization of the |C| vocabularies. The amount of computation is large, so Mikolov puts forward two improvements: adopting hierarchical softmax and negative sampling.

3 The proposed model

The reviews of neural network language model and its improvements reveal that these models can remove dimensionality disaster and lexical gap phenomenon in contrast with traditional statistical language model. Especially, the Mikolov’s improvements simplify the neural network structure and reduce the calculation complexity. However, these improvements drop some features such as the characteristics of part of speech and the sentiment information of words inherent in the natural language. In following sections, we will focus on maintaining the word order of the context, and take the characteristics of part of speech and the sentiment information of words into the consideration of the model under the basis of the neural network model and its improvement.

3.1 Preserving word order of the context

In order to reduce the time complexity of the model, the CBOW proposed by Mikolov in 2013 changed the vector concatenating into vector summing in the input layer of Hinton’s neural network language model in 2002, and used softmax in the output layer. But this will sacrifice the word order feature of context. We think that under the increasing computation speed, we can keep the word order feature to capture more semantic information by keeping the concatenating vectors proposed by Hinton in the input layer.

3.2 The characteristics of part of speech

At present, most word vector models seldom use grammatical knowledge. However, it is well known that grammatical knowledge is very important for understanding of the language. And the semantics and usage of the same word is different when the part of speech is different. For example, the English word “like” means “prefer or wish to do something”,when it is a verb. It means “having the same or similar characteristics” when it is an adjective. Obviously, the part of speech is important to understand the words.

Scholar Baotian Hu published an article “A novel word embedding learning model using the dissociation” [8] in Neuro-computing in 2016. He proposed the CDNV (continuous dissociation between nouns and verbs model) with the ideas to categorize words as verbs, nouns and others when developing the word vector model. Hu proposed three methods to separate the words with parts of speech, including CDNV-1, CDNV-2, CDNV-3, and the maximum separation degree is CDNV-3.

In fact, part of speech can be divided into many kinds, such as verbs, nouns, prepositions, adjectives, adverbs, etc. Firstly, we use the language tagging tools to annotate the corpus to obtain the training set which contains the feature of parts of speech. The categories of POS tagger may include VB/NN/CD/JJ/MD/IN/CC/RB/PRP etc. In our model, the same word may also be defined as a different encoding. C (w_i, p_i) is the encoding of a word w_i when part of speech of the word is p_i,where p_i∈ {VB,NN,CD,JJ,RP,IN,CC,RB,PRP,PBR}.

3.3 The sentiment information of words

The word vector models of Bengio and Mikolov can solve the two words’ semantic isolation problems in one-hot representation, which can make word vector of the two words with similar semantic close to each other. Thus their word vector models can make many NLP tasks, such as clustering part of speech analysis, synonym analysis, and the others, more clear and simple. However, Bengio and Mikolov’s word vector models and its subsequent improved models ignore a key information in natural language, that is the sentiment information of words. Some words, such as “beautiful” and “ugly”, “love” and “hate”, have the opposite sentiment polarity. In that case, the distance of the word vectors should be far away from each other. On the other hand, the distance in the expression of word vector of the similar sentiment polarity words such as “smile” and “laugh” should be closer to each other.

Andrew L.Mass ‘s (2011) paper “learning word vector for sentiment analysis” [9] proved that words with opposite sentiment polarities have closer distance in current word vector model because grammatical and usage similarities are closer to each other in the semantic space. Such word vectors often remain unsatisfactory in the task of sentiment analysis. Andrew L.Mass suggested that it is better to consider both semantic and emotional information of words in training word vectors procedures. Andrew L.Mass also proposed that the probability of whole samples can be obtained by unsupervised continuous probability distribution model to catch the semantic information of context. And then he optimized the word vector through a supervised model by introducing annotated context. Finally, the two parts are combined together to form the objective function, so that his word vector contains not only semantic information, but also emotional information.

In this paper, we will consider the unification of semantic information and sentiment information in the framework of CBOW model proposed by Mikolov. We shall attempt to improve CBOW model to obtain the semantic information and sentiment information of the context. We shall pursue a supervised context for sentiment polarity labelling in our model. The emotion of texts is divided into subjective and objective. The objective text does not contain any emotion classification information. Subjective text contains emotional information, such as love, dislike, and sadness, etc,. Normally it divides emotion of text into two, three or multi categories. The two classifications are divided into positive and negative. Three classifications are divided into positive, negative, and neutral. The multi sentiment polarity of text is commonly classified in the following categories: joy, anger, sadness, shock, and fear. In addition, OCC-model proposes that emotional cognition can be divided into three categories and 21 subgroups. For example, Qiao Xiangjie [10] (2010) proposed that e-learning students have eight emotions based on OCC-model: pleasure / sadness, satisfaction / disappointment, gratitude / anger, shame / pride. The sentiment intensity of text can be divided into different step-type grades such as like, like very much, annoying, hate. The other way is trying to define the sentiment intensity of text by scoring 1, 2, 3, 4, and 5. We divide the sentiment polarity s_w of word w into n-grades such as $s_{w} \in {s_{1}^{w}, s_{2}^{w}, . . ., s_{n}^{w}}$ .The sentiment polarity is separated by Huffman tree under the framework of CBOW model. The typical separation structures are shown in Figs. 3.1, 3.2 and 3.3 as below, where $s_{P}^{w} = {s_{1}^{w}, s_{2}^{w}, s_{3}^{w}}, s_{N}^{w} = {s_{4}^{w}, s_{5}^{w}}$ .

Fig. 3.1

The positive and negative separation structure.

Fig. 3.2

The five grades separation structure.

Fig. 3.3

The multi-grades separation structure.

3.4 Word vector model based on part of speech and sentiment Information (PSWV-model)

We conclude by proposing a model to hold the word order features, part of speech features and sentiment polarity feature of text on the basis of CBOW model. We shall focus on obtaining more semantic and emotional information of texts by the training of our word vector model. We shall define our word vector model as PSWV-model based on part of speech and sentiment information. There are three layers in our model, such as input layer, projection layer and output layer, as shown in Fig. 3.4. The training window size of the word w is n. For example, if n = 2, then the probability of the word w is obtained by the information of the two words in front of, and the next two words of it.

Fig. 3.4

The structure of PSWV-model.

Given current word is w_t, we take four rows in the look up table or word vector Matrix C as the input word vector of w_t-2, w_t-1, w_t+1, w_t+2. C is a matrix and its size is |V| × m. |V| is the size of the word table. Super parameters m is the word vector’s dimension and needs to be set by the users. m is usually set to a few hundred dimensions. The first layer of the model is to map the word vectors from the look up table C and connect the context of word w_t orderly. $X_{nm} = (C (w_{t - 2}), C (w_{t - 1}), C (w_{t + 1}), C (w_{t + 2}))$ (3.1)

However, Mikolov’s CBOW model calculate the sum of the word vectors w_t-2, w_t-1, w_t+1, w_t+2. In that case, it ignores word order information when summing the word vectors. Our PSVW-model concatenate the word vectors of w_t-2, w_t-1, w_t+1, w_t+2. Thus we can obtain word order information. Mikolov’s CBOW model simplifies the complicated computation by removing the hidden layer of Bengio’s neural network language model in order to reduce the computational complexity quite significantly. Our PSVW-model absorbs the advantage of CBOW model. The hidden layer is simplified to projection layer, whose function is to transfer the word vector information from the input layer to the output layer.

The third layer is the output layer. The neural network language model used the traditional softmax classifier in the output layer. The number of nodes in the output layer is the size of the dictionary. The size of the parameter matrix is m × |V|. Then the output layer has a huge amount of computation in output layer of the neural network language model. Hence, Mikolov tried to use the hierarchical softmax optimization mechanism in the output layer. Actually, the computational complexity can be significantly reduced by using hierarchical softmax algorithm. We can also use hierarchical softmax algorithm in our PSWV-model. For example, we try to model the sentence “ I like to read love stories”. Assuming that the predictor word is “like” and the learning window is n = 2.Then the preceding word is “I” and the following two words is “to read”. Since there is only one word above, the missing word can be replaced by a placeholder *PADDING*.The words in learning window are mapped into word vectors in matrix C, respectively. Then we can concatenate the four word vectors orderly and transmit to projection layer. After that information of part of speech and its sentiment polarity of the word are transferred together to hierarchical softmax classification in output layer. Finally, the word vector model can fuse with the information of word order, the sentiment polarity of words and part of speech (POS).

Shown in Fig. 3.4, we adopt the hierarchical softmax algorithm in the output layer. Then, the output layer becomes a Huffman tree. The Huffman tree consists of three layers: the first layer is the semantic separation layer; the second layer is the POS separation layer; the third layers is sentiment polarity separation layer. The non-leaf node of Huffman tree represents a two-class operation. If the output is 0, then the path goes left. If the output is 1, then the path goes right. Passing through the semantic separation layer, the POS separation layer and sentiment polarity separation layer, each leaf node represents a word in the corpus with a unique POS feature and sentiment polarity feature. Therefore, each word in Huffman tree may have different codes when the word has different POS and sentiment polarity features. The probability of each word in the corpus with different POS feature and sentiment polarity feature is the product of the probability of the root node and all the non-leaf nodes in the path of root to leaf. Hence, our PSWV-model should maximize the logarithmic likelihood function as follows:

$L = \sum_{wt \in C} log P (w_{t} | W_{t}^{T}, POS, SP)$ (3.2) where $W_{t}^{T} = (w_{t - n}, w_{t - n + 1} . . ., w_{t + 2}, w_{t + n})^{T}$ and POS is the part of speech of the word W_t. SP is the sentiment polarity feature of the word W_t.

In order to describe PSWV-model clearly, we define the following symbols.

R^w represents the path from the root node of Huffman tree to the leaf node corresponding to the word W.

R^c is the path from the root node of Huffman tree to the semantic separation layer.

R^p is the path of the word w in the POS separation layer.

R^s is the path of the word w in sentiment polarity separation layer.

n^w is the number of nodes passed from the root node of Huffman tree to the leaf node corresponding to the word W.

n^c is the number of nodes passes through the semantic separation layer corresponding to the word W.

n^p is the number of nodes passes through the POS separation layer corresponding to the word W.

n^s is the number of nodes passes through the sentiment polarity separation layer corresponding to the word W.

where n^w = n^c + n^p + n^s.

$P_{1}^{w}, P_{2}^{w}, . . ., P_{n^{c}}^{w}$ are the nodes of the word W passes over the semantic separation layer.

${\hat{P}}_{1}^{w}, {\hat{P}}_{2}^{w}, . . ., {\hat{P}}_{n^{p}}^{w}$ are the nodes of the word W passes over the POS separation layer.

${\tilde{P}}_{1}^{w}, {\tilde{P}}_{2}^{w}, . . ., {\tilde{P}}_{n^{s}}^{w}$ are the nodes of the word W passes over the sentiment polarity separation layer.

where $P_{n^{c}}^{w} = {\hat{P}}_{1}^{w}$ represents the common transformation node between the semantic separation layer and the POS separation layer.

Where ${\hat{P}}_{n^{p}}^{w} = {\tilde{P}}_{1}^{w}$ represents the common transformation node between the POS separation layer and sentiment polarity separation layer.

$ξ_{1}^{w}, ξ_{2}^{w}, . . ., ξ_{n^{c}}^{w}$ are the Huffman encoding of the nodes of the word W passes over the semantic separation layer, where $ξ_{j}^{w}$ is the code of $P_{j}^{w}$ .

${\hat{ξ}}_{1}^{w}, {\hat{ξ}}_{2}^{w}, . . ., {\hat{ξ}}_{n^{p}}^{w}$ are the Huffman encoding of the nodes of the word W passes over the POS separation layer, where ${\hat{ξ}}_{j}^{w}$ is the code of the ${\hat{P}}_{j}^{w}$ .

${\tilde{ξ}}_{1}^{w}, {\tilde{ξ}}_{2}^{w}, . . ., {\tilde{ξ}}_{n^{s}}^{w}$ are the Huffman encoding of the nodes of the word W passes over the sentiment polarity separation layer, where ${\tilde{ξ}}_{j}^{w}$ is the code of the ${\tilde{P}}_{j}^{w}$ .

$θ_{1}^{w}, θ_{2}^{w}, . . ., θ_{n^{c}}^{w}$ are the parameter vector of non-leaf node in the path R^c.

${\hat{θ}}_{1}^{w}, {\hat{θ}}_{2}^{w}, . . ., {\hat{θ}}_{n^{p}}^{w}$ are the parameter vector of non-leaf node in the path R^p.

${\tilde{θ}}_{1}^{w}, {\tilde{θ}}_{2}^{w}, . . ., {\tilde{θ}}_{n^{c}}^{w}$ are the parameter vector of non-leaf node in the path R^s.

With these definitions, we can convert the conditional probability in Equation (3.2) into: $\begin{matrix} P (w_{t} | W_{t}^{T}) = \\ \prod_{i = 2}^{n^{c}} p (ξ_{i}^{w} | X_{w}, θ_{i = 1}^{w}) \prod_{j = 2}^{n^{p}} p (ξ_{j}^{w} | X_{w}, {\hat{θ}}_{j - 1}^{w}) \\ \prod_{k = 2}^{n^{s}} p (ξ_{k}^{w} | X_{w}, {\hat{θ}}_{k - 1}^{w}) \end{matrix}$ (3.3) where $p (ξ_{i}^{w} | X_{w}, θ_{i - 1}^{w}) = {\begin{matrix} ρ (X_{w}^{T} θ_{i - 1}^{w}) ξ_{i}^{w} = 1 \\ 1 - ρ (X_{w}^{T} θ_{i - 1}^{w}) ξ_{i}^{w} = 0 \end{matrix}$ (3.4) $p ({\hat{ξ}}_{j}^{w} | X_{w}, {\hat{θ}}_{j - 1}^{w}) = {\begin{matrix} ρ (X_{w}^{T} {\hat{θ}}_{j - 1}^{w}) {\hat{ξ}}_{j}^{w} = 1 \\ 1 - ρ (X_{w}^{T} {\hat{θ}}_{j - 1}^{w}) {\hat{ξ}}_{j}^{w} = 0 \end{matrix}$ (3.5) $p ({\tilde{ξ}}_{k}^{w} | X_{w}, {\tilde{θ}}_{k - 1}^{w}) = {\begin{matrix} ρ (X_{w}^{T} {\tilde{θ}}_{k - 1}^{w}) {\tilde{ξ}}_{k}^{w} = 1 \\ 1 - ρ (X_{w}^{T} {\tilde{θ}}_{k - 1}^{w}) {\tilde{ξ}}_{k}^{w} = 0 \end{matrix}$ (3.6)

ρ (·) is sigmoid function. Hence, $p (ξ_{i}^{w} | X_{w}, θ_{i - 1}^{w})$ in the semantic separation layer can be simplified as following $\begin{matrix} p (ξ_{i}^{w} | X_{w}, θ_{i - 1}^{w}) = \\ [ρ (X_{w}^{T} θ_{i - 1}^{w})]^{1 - ξ_{i}^{w}} [1 - ρ (X_{w}^{T} θ_{i - 1}^{w})]^{ξ_{i}^{w}} \end{matrix}$ (3.7)

In the POS separation layer, $p ({\hat{ξ}}_{j}^{w} | X_{w}, {\hat{θ}}_{j - 1}^{w})$ can be simplified as following $\begin{matrix} p (ξ_{j}^{w} | X_{w}, {\hat{θ}}_{j - 1}^{w}) = \\ [ρ (X_{w}^{T} {\hat{θ}}_{j - 1}^{w})]^{1 - {\hat{ξ}}_{j}^{w}} [1 - ρ (X_{w}^{T} {\hat{θ}}_{j - 1}^{w})]^{{\hat{ξ}}_{j}^{w}} \end{matrix}$ (3.8)

In the sentiment polarity separation layer, $p ({\tilde{ξ}}_{j}^{w} | X_{w}, {\tilde{θ}}_{j - 1}^{w})$ can be simplified as following $\begin{matrix} p ({\tilde{ξ}}_{k}^{w} | X_{w}, θ_{k - 1}^{w}) = \\ [ρ (X_{w}^{T} {\tilde{θ}}_{k - 1}^{w})]^{1 - {\tilde{ξ}}_{k}^{w}} [1 - ρ (X_{w}^{T} {\tilde{θ}}_{k - 1}^{w})]^{{\tilde{ξ}}_{k}^{w}} \end{matrix}$ (3.9)

Hence, Equation (3.2) can be converted as following $\begin{matrix} L = \sum_{w_{t} \in C} log \prod_{i = 2}^{n^{c}} p (ξ_{i}^{w} | X_{w}, θ_{i - 1}^{w}) \prod_{j = 1}^{n^{p}} p ({\hat{ξ}}_{j}^{w} | X_{w}, {\hat{θ}}_{j - 1}^{w}) \\ \prod_{k = 1}^{n^{s}} p ({\tilde{ξ}}_{k}^{w} | X_{w}, {\tilde{θ}}_{k - 1}^{w}) \end{matrix}$ (3.10)

Substituted Equations (3.6) to (3.10), then Equation (3.3) can be converted as following $\begin{matrix} L = \sum_{w_{t} \in C} {\sum_{i = 2}^{n^{c}} {(1 - ξ_{i}^{w}) log ρ (X_{w}^{T} θ_{i - 1}^{w}) + \\ ξ_{i}^{w} log [1 - ρ (X_{w}^{T} θ_{i - 1}^{w})]} + \\ \sum_{j = 2}^{n^{p}} {(1 - {\hat{ξ}}_{j}^{w}) log [1 - ρ (X_{w}^{T} {\hat{θ}}_{j - 1}^{w})] + \\ {\hat{ξ}}_{j}^{w} log [1 - ρ (X_{w}^{T} {\hat{θ}}_{j - 1}^{w})]} + \\ \sum_{k = 2}^{n^{s}} {(1 - {\tilde{ξ}}_{k}^{w}) log [1 - ρ (X_{w}^{T} {\tilde{θ}}_{k - 1}^{w})] + \\ {\tilde{ξ}}_{k}^{w} log [1 - ρ (X_{w}^{T} {\tilde{θ}}_{k - 1}^{w})]}} \end{matrix}$ (3.11)

In logarithmic likelihood function (3.11), the parameters include: $θ_{i - 1}^{w}$ , ${\hat{θ}}_{j - 1}^{w}$ , ${\tilde{θ}}_{k - 1}^{w}$ , X_w, where $θ_{i - 1}^{w}$ is the parameter vector of non-leaf nodes of the semantic separation layer. ${\hat{θ}}_{j - 1}^{w}$ is the parameter vector of nodes in the POS separation layer. ${\tilde{θ}}_{k - 1}^{w}$ is the parameter vector of sentiment polarity separation layer. X_w is concatenating vetors w_t-n, w_t-n+1, . . . , w_t+2,w_t+n orderly.

According to optimization theory, in order to maximize the likelihood function, we must obtain the partial derivative of the parameters $θ_{i - 1}^{w}$ , ${\hat{θ}}_{j - 1}^{w}$ , ${\tilde{θ}}_{k - 1}^{w}$ , X_w $\begin{matrix} \frac{\partial L (θ_{i - 1}^{w}, {\hat{θ}}_{j - 1}^{w}, {\tilde{θ}}_{k - 1}^{w}, X_{w})}{\partial θ_{i - 1}^{w}} = \\ \frac{\partial (1 - ξ_{i}^{w}) log ρ (X_{w}^{T} θ_{i - 1}^{w}) + ξ_{i}^{w} log [1 - ρ (X_{w}^{T} θ_{i - 1}^{w})]}{\partial θ_{i - 1}^{w}} \end{matrix}$ (3.12)

According to the features of sigmoid function ρ′ (x) = ρ (x) (1 - ρ (x)), we can obtain Equation (3.13) from Equation (3.12). $\frac{\partial L (θ_{i - 1}^{w}, {\hat{θ}}_{j - 1}^{w}, {\tilde{θ}}_{k - 1}^{w})}{\partial θ_{i - 1}^{w}} = [1 - ξ_{i}^{w} - ρ (X_{w}^{T} θ_{i - 1}^{w})] X_{w}$ (3.13)

Similarly, we can obtain the following Equations (3.14) to (3.16) $\frac{\partial L (θ_{i - 1}^{w}, {\hat{θ}}_{j - 1}^{w}, {\tilde{θ}}_{k - 1}^{w}, X_{w})}{\partial {\hat{θ}}_{j - 1}^{w}} = [1 - {\hat{ξ}}_{i}^{w} - ρ (X_{w}^{T} {\hat{θ}}_{j - 1}^{w})] X_{w}$ (3.14) $\frac{\partial L (θ_{i - 1}^{w}, {\hat{θ}}_{j - 1}^{w}, {\tilde{θ}}_{k - 1}^{w}, X_{w})}{\partial {\tilde{θ}}_{k - 1}^{w}} = [1 - {\tilde{ξ}}_{i}^{w} - ρ (X_{w}^{T} {\tilde{θ}}_{k - 1}^{w})] X_{w}$ (3.15) $\begin{matrix} L (θ_{i - 1}^{w}, {\hat{θ}}_{j - 1}^{w}, {\tilde{θ}}_{k - 1}^{w}, X_{w}) \end{matrix} \partial X_{w} = [1 - ξ_{i}^{w} - ρ (X_{w}^{T} θ_{i - 1}^{w})] θ_{i - 1}^{w} + [1 - {\hat{ξ}}_{i}^{w} - ρ (X_{w}^{T} {\hat{θ}}_{j - 1}^{w})] {\hat{θ}}_{j - 1}^{w} + [1 - {\tilde{ξ}}_{i}^{w} - ρ (X_{w}^{T} {\tilde{θ}}_{k - 1}^{w})] {\tilde{θ}}_{k - 1}^{w}$ (3.16) In the procedures of parameters training, we adopt stochastic gradient ascending (SGA) to update the parameters. $θ_{i - 1}^{w} \leftarrow θ_{i - 1}^{w} + η_{1} \frac{\partial L (θ_{i - 1}^{w}, {\hat{θ}}_{j - 1}^{w}, {\tilde{θ}}_{k - 1}^{w}, X_{w})}{\partial θ_{i - 1}^{w}}$ (3.17) ${\hat{θ}}_{j - 1}^{w} \leftarrow {\hat{θ}}_{j - 1}^{w} + η_{2} \frac{\partial L (θ_{i - 1}^{w}, {\hat{θ}}_{j - 1}^{w}, {\tilde{θ}}_{k - 1}^{w}, X_{w})}{\partial {\hat{θ}}_{j - 1}^{w}}$ (3.18) ${\tilde{θ}}_{k - 1}^{w} \leftarrow {\tilde{θ}}_{k - 1}^{w} + η_{3} \frac{\partial L (θ_{i - 1}^{w}, {\hat{θ}}_{j - 1}^{w}, {\tilde{θ}}_{k - 1}^{w}, X_{w})}{\partial {\tilde{θ}}_{k - 1}^{w}}$ (3.19) where η₁, η₂, η₃ is the learning rates of the semantic separation layer, the POS separation layer and sentiment polarity separation layer respectively. X_w is concatenated by w_t-n, w_t-n+1, . . . , w_t+2, w_t+n. The learning rate of parameter X_w is η₄, and its updating method shows in the following $X_{w} \leftarrow X_{w} + η_{4} \frac{\partial L (θ_{i - 1}^{w}, {\hat{θ}}_{j - 1}^{w}, {\tilde{θ}}_{k - 1}^{w}, X_{w})}{\partial X_{w}}$ (3.20) From the above derivation process, we can get the following algorithm for updating the parameters of PSWV-model in hierarchical softmax.

Input: the word vectors X_{w
_t-n}, X_{w
_t-n}, …, X_{w
_t+n-1}, X_{w
_t+n} of context of word w_t

$θ_{i - 1}^{w} = 0, {\hat{θ}}_{j - 1}^{w} = 0, {\tilde{θ}}_{k - 1}^{w} = 0$

X_{w
_t} = X_{w
_t-n} ⊕ X_{w
_t-n} ⊕ . . . ⊕ X_{w
_t+n-1} ⊕ X_{w
_t+n} (⊕represents concatenating)

For i = 2:n^c;j=2:n^p;k=2:n^s

{

\begin{matrix} g 1 = η 1 [1 - ξ_{i}^{w} - ρ (X_{w}^{T} θ_{i - 1}^{w})]; \\ g 2 = η 2 [1 - {\hat{ξ}}_{i}^{w} - ρ (X_{w}^{T} {\hat{θ}}_{j - 1}^{w})]; \\ g 3 = η 3 [1 - {\tilde{ξ}}_{i}^{w} - ρ (X_{w}^{T} {\tilde{θ}}_{k - 1}^{w})]; \\ θ_{i - 1}^{w} : = θ_{i - 1}^{w} + g 1 X_{w}; \\ {\hat{θ}}_{j - 1}^{w} : = {\hat{θ}}_{j - 1}^{w} + g 2 X_{w}; \\ {\tilde{θ}}_{k - 1}^{w} : = {\tilde{θ}}_{k - 1}^{w} + g 3 X_{w}; \\ X_{w} : = X_{w} + η 4 (g 1 θ_{i - 1}^{w} + g 2 {\hat{θ}}_{j - 1}^{w} + g 3 {\tilde{θ}}_{k - 1}^{w}) X_{w}; \end{matrix}

}

3.5 Time complexity

The word vector models in the literature include HLBL word vector (hierarchical log-bilinear word vector) [11], Turian’s word vector, C & W word vector [12], the word vector of Huang and CBOW and Skip-gram. The structure of CBOW and Skip-gram is more simple and clear, so they are more efficient and can be used as benchmark for comparison. The time complexity is defined as the amount of parameters that need to be accessed during each iteration step. The time complexity of the CBOW can be defined as: $TC (CBOW) = 2 nm + m {log}_{2} | V |$ (3.21) m is the dimension of word vector. |V| is the size of dictionary. n is the size of training window of word vector. 2nm is the amount of parameters needing to be accessed in input layer. m log ₂|V| is the amount of parameters needing to be accessed in Hierarchical softmax in output layer.

The time complexity of the PSWV-model can be defined as: $TC (PSWV) = 2 nm + m {log}_{2} (1 + λ_{2} + λ_{3}) | V |$ (3.22) where λ₂, λ₃ is the proportion of nodes increased after adding the POS separation layer and sentiment polarity separation on the basis of CBOW-model. Therefore, the ratio of time complexity of PSWV-model to time complexity of CBOW is $\begin{matrix} TCR = \frac{2 nm + mn {log}_{2} (1 + λ_{2} + λ_{3}) | V |}{2 nm + mn {log}_{2} | V |} = \\ \frac{2 n + n {log}_{2} (1 + λ_{2} + λ_{3}) | V |}{2 n + n {log}_{2} | V |} \end{matrix}$ (3.23) In theory, the minimum upper bound and maximum lower bound for increasing the number of nodes by adding POS separation layer are sup(λ₂) =5, inf(λ₂) =0 respectively. If we adopt the five grades separation structure shown in Fig. 3.2, the minimum upper bound and maximum lower bound for increasing the number of nodes by adding sentiment polarity separation layer are sup(λ₂) =4, inf(λ₂) =0 respectively. Hence, the minimum upper bound of TCR is $sup (TCR) = \frac{2 n + n {log}_{2} 10 | V |}{2 n + {log}_{2} | V |}$ (3.24) The maximum lower bound of TCR is $inf (TCR) = \frac{2 n + n {log}_{2} 10 | V |}{2 n + {log}_{2} | V |}$ (3.25) It can be seen that the addition of POS separation layer and sentiment polarity separation layer on the basis of CBOW does not increase the time complexity unacceptably. For example, when the size of dictionary is |V|=130000 and the size of training windows is n = 5, the value of sup(TCR) and inf(TCR) are $sup (TCR) = 4.13$ (3.26) $inf (TCR) = 3.58$ (3.27)

4 Experiment and evaluation

In the contemporary literature, the classical word vector models include CBOW, SKIP-GRAM, GLOVE, and HLBL (hierarchical log-bilinear) word vectors, Turian’s vector, C&W word vector and Huang word vector. The common test method for the quality of word vectors is added to some natural language processing tasks as specific features, such as Named Entity Recognition and sentiment polarity analysis. We compare and analyze the performance of PSWV model and other models on the tasks of named entities recognition and sentiment polarity recognition.

4.1 Performance of PSWV-model in task of named entity recognition

In this study, English corpus of Wikipedia (snapshot as of December 2017) and Reuters RCV data training set are used to train PSWV-model. Firstly, the data sets are cleaned and normalized, such as deleting short sentences, removing abnormal sentences, converting uppercase into lowercase. The Arabic numeral and the low frequency word is removed. Finally, the data set is integrated into the 15billion text dataset. Then the whole data set is annotated in part of speech using Stanford Pos-tagger. After that the text data is annotated according to sentiment lexicons such as LIWC (Pennebaker, 2007), MPOA Subjectivity cues lexicon (Riloff and Wiebe 2003), Bing Liu opinion lexicons (Bing Liu, 2004), Sentiword Net (Stefano, 2010). In this experiment, the dimension size of the word vector is m = 50, and the size of the training window is n = 3.

Named Entity Recognition recognizes the boundaries and types of named entity in text. It plays a basic role in information extraction, question answering system, machine translation, and can be used to evaluate the quality of word vector models. Okazaki in 2007 proposed a fast implementation of conditional Random Fields to evaluate the performance of Named Entity Recognition system. In this study, we chose Okazaki’s standard to evaluate the quality and performance of word vector models as follows: $F 1 = \frac{2 PR}{(R + P)}$ (4.1) where P is the Accuracy for block or named entity. R is recall rate for block or named entity. We are using the program of Okazaki (Ner.py) to obtain the commonly used features, and add the word vector on the basis of these common features. We are comparing the word vector such as CBOW, Skip-gram, Turia and Huang with our PSWV model. The experimental results are shown in Table. 4.1. The test data is based on the publicly evaluated task dataset (CoNLL 2003) from Reuters News [13].

Table 4.1

Four categories performance of F1 score based on CRFSuite by Okazaki

dataset &Task	Huang	Turian	HLBL	CBOW	CDNV	PSWV
Data:TEST	83.18	86.14	85.62	86.45	87.12	92.16*
Task1:LOC	87.76	88.90	90.62	89.04	90.60	96.81*
Task2:MISC	77.24	77.72	78.33	78.29	78.60	88.84*
Tak3:ORG	83.50	83.11	83.61	83.26	83.28	93.64*
Task4:PER	93.92	93.84	94.24	94.21	94.38	98.87*

The results show that all models can improve the performance of named entity recognition system based on CRF. Table. 4.1 reflects that PSWV-model can significantly improve the performance of named entity recognition system based on CRF when adding POS separation layer and sentiment polarity separation layer. The performance on the test data set is significantly superior to the benchmark systems CBOW. The performance of PSWV-model on the sub-tasks including LOC, MISC, ORG and PER is significantly better than that of the benchmark system CBOW word vector and other word vectors. CDNV and PSWV have adopted the frame of part of speech separation at the same time. CDNV adopted the Dissociation between Nouns and Verbs. PSWV adopted a completely separate structure of part of speech. Hence, the advantage of PSWV is not larger than CDNV. However, the advantage of PSWV is relatively larger than that of other models which do not adopt the separation of part of speech.

4.2 Performance of PSWV-model in task of sentiment polarity analysis

For an additional mode of verification of, the performance of PSWV model in sentiment analysis, we use labeled micro-blog data to verify the effect of PSWV-model on task of sentiment polarity recognition when applying deep learning (LSTM) model. The hot topics in our study are retrieved from Sina Weibo (Fackbook and Twitter). According to the retrieval results, we take the reviews from popular and authority newspapers or website such as Beijing News, People’s Daily and other news media. We collected more than 500 thousand reviews or comments from these websites through data collecting software—octopus12.0. And then we were able to obtain more than 300 thousand reviews or comments after data cleaning process. The data cleaning process included removing the user names, empty comments, and less than 5 character comments. The user names and empty comments contain no sentiment information, and most of the comments with less than five characters do not contain emotional information. This data will become noise data and interfere with the results of experiments when doing emotional analysis. So the three kinds of data were deleted to ensure the accuracy of the experiment. After that, the reviews text data were labeled with emotion signs. And the data was divided into 4 categories and 21 subclasses according to emotion labeling. 10000 reviews with positive sentiment polarity and 10000 with negative sentiment polarity were randomly selected as training sets after sentiment tagging of the text data. And then 2500 items were randomly selected as test sets in the remaining data. Hence, there are 20000 reviews in training data and 5000 reviews in test data set.

In our study, the performance of the model is measured by three indexes: accuracy rate, precision prediction, recall rate and F1 value. Mixture matrix is introduced to define the three indexes. Accuracy rate is the rate of correctly predicted reviews number accounts for the total number of test data sets as following $Accuracy rate = \frac{TP + TN}{TP + FP + TN + FN}$ (4.2) where TP (True Positive) is the number of positive comments that are correctly predicted as positive ones. FP (False Positive) is the number of negative emotion comments that are predicted as negative ones. FN (False Negative) is the number of positive emotion comments that are predicted as negative ones. TN (True Negative) is the number emotion of positive comments that are predicted as negative ones.

Take positive emotion comments as an example to explain the meaning of Precision, Recall and F1 values. In the test data set, Precision is the rate of the number of reviews correctly predicted as positive emotions and the total number of reviews predicted as positive emotions: $Precision = \frac{TP}{TP + FP}$ (4.3)

Recall is the rate of the number of reviews correctly predicted as positive emotions and the total number of positive emotions reviews $Recall rate = \frac{TP}{TP + FN}$ (4.4)

F1 value is defined as the harmonic average of precision and recall as following $F 1 = \frac{2 \times Recall \times Precision}{Recall + Precision}$ (4.5)

We input CBOW, Glove, CDNV and PSWV word vectors as features into the deep learning model (LSTM [14]) respectively to predict the sentiment polarity of comments or reviews. Table. 4.2 shows the experimental results when dimensions of word vectors are 50,100,150,200,250,300. The following steps are adopted in the experiment.

Setting the initial value of the super parameters in the LSTM neural network. The initial iteration is set as 3. The loss function is set as the cross entropy loss function. The optimizer selects the Adma optimizer.

Training the LSTM neural network to predict the sentiment polarity with different word vector of comments which are obtained by CBOW, Glove, CDNV and PSWV models.

Carrying out several experiments of LSTM under each word vector model in different dimensions such as 50,100,150,200,250,300. And the super parameters in the model are constantly adjusted to improve the performance of LSTM model. After many experiments, the performance evaluation indexes of different word vectors in LSTM neural network are obtained, as shown in Table. 4.2.

Table 4.2

The accuracy rate of sentiment polarity recognition based on LSTM Neural Network Model under different word Vector models

Word vector\Dimension	50	100	150	200	250	300
CBOW	81.1	81.9	82.3	83.3	*83.8	82.8
CDNV	83.1	84.3	85.1	*85.3	84.2	83.8
GLOVE	81	83.70	83.6	*84.8	82.7	83
PSWV	85.1	86.3	91.6	92.5	87.1	85

According to Table. 4.2 and Fig. 4.1, we can see that the performance of sentiment polarity recognition by LSTM model using PSWV word vector in is better than that using CBOW, Glove and CDNV in different dimensions. The reason being that the PSWV word vector not only uses the semantic information provided by the context and the words order information and part of speech information, but also makes use of the sentiment information of the words during the training process. Therefore, the performance of sentiment polarity recognition in different dimensions is relatively good than other models.

Fig. 4.1

The accuracy rate of sentiment polarity recognition based on LSTM Model under different word vector models and different dimensions.

Comparing with CBOW in the process of word vector training, CDNV not only uses the semantic information provided by context, but also uses the separation information of nouns and verbs. Therefore, it is better to capture the sentiment polarity of words compared with CBOW. In addition, CBOW use the several words of the context to find the probability of the specified word, so CBOW can just capture the local information of the context. However, Glove model trains the word vector with global information, so the Glove word vector is better than CBOW word vector.

5 Conclusions and forecast

Language models play important role in practical applications of NLP task such as machine translation, information indexing, voice recognition, context processing. Hence, we analyze the principles of traditional statistical language models, such as N-gram model, N-POS model and the maximum entropy model. We find that statistical language model has some inherent weaknesses, such as semantic gap and Features sparse phenomenon. Semantic gap phenomenon shows that the strings of two semantically similar words may be completely different. But there are also some advantages, such as that the model can get many kinds of features of language including part of speech features, word order features, word combination features and so on.

Contrary to that, the representation of words in neural Network Language Model is vector form which is semantic oriented. And the distance of two semantic similar words vector are close, which can solve the problem of semantic gap in traditional statistical language model. The development trend of neural network model is more and clearer after Bengion’s work in 2002. In 2013, two improved neural network language models CBOW and Skip-Gramm proposed by Mikolov push the application of neural network model language apply rapidly. Mikolov improves Bengion neural network model to reduce the computational complexity of the model greatly. But the improvement of changing concatenating vectors in the hidden layer into calculating the sum of vectors in the projection layer results in the loss of the original word order features. In addition, CBOW and Skip-Gram cannot capture the linguistic features and emotional information. However, the improvement of Mikolov provides a good idea for other researchers. For example, Hierarchical softmax algorithm can significantly reduce the computational complexity. In the framework of Mikolov’s model, the CDNV-model proposed by the scholars Hu in 2016 attempts to utilize the part-of-speech features of verbs and nouns, and achieves good results. Inspired by the traditional statistical language model and neural network model, our attempt is to put forward PSWV-model in order to use more language information such as word order features, part of speech features and sentiment polarity information under the framework of Mikolov model. And then we can compare and analyze some advantages of PSWV model and other models including CBOW and Skip-Gram, CDNV in the NLP tasks including named entities recognition and sentiment polarity recognition.

In near future, further development may be made in the following aspects. First, more language features can be accommodated under the framework of CBOW. And more deep learning models such as RNN, CNN, LSTM+CNN etc. can be used to verify the performance of word vector model in some NLP tasks such as sentiment polarity recognition. Second, we can try to further develop the model on the Skip-Gram architecture. Third, the computational complexity of the model may be reduced further.

Footnotes

Acknowledgments

This work is supported by Project funded by China Postdoctoral Science Foundation (2017M613016), Social Science Foundation of Guangdong (Grant No. GD16XYJ27), and Key Research topic of Shenzhen Polytechnic (Grant No. 601722k35007). The authors would like to thank the editor and an anonymous referee for their helpful comments on the manuscript.

References

, Lin

, Meng

, et al., Visual and Textual Sentiment Analysis of a Microblog Using Deep Convolutional Neural Networks, Algorithms 9(2) (2016), 41–49.

Tang

, Wei

, Qin

, Yang

, Liu

and Zhou

, Sentiment embeddings with applications to sentiment analysis, IEEE Transactions on Knowledge and Data Engineering 28(2) (2016), 496–509.

Liang

, Liu

, Tan

and Bai

, Sentiment classification based on as-lda model, Procedia Computer Science 31 (2014), 511–516.

Hinton

G.E.

and Salakhutdinov

R.R.

, Reducing the dimensionality of data with neural networks, Science 313(5786) (2006), 504–507.

Bengio

, Ducharme

, Vincent

, et al., A neural probabilistic language, Journal of Machine Learning Research 3(2) (2003), 1137–1155.

Sainath

T.N.

, Kingsbury

and Saon

, Deep Convolutional Neural Networks for Large-scale Speech Tasks, Neural Networks 64 (2014), 39–48.

Henry

, Cuffy

and Mcinnes

B.T.

, Vector representations of multi-word terms for semantic relatedness, Journal of Biomedical Informatics 77 (2018), 111.

Baotian

, Buzhou

and etc, A novel word embedding learning model using the dissociation between nouns and verbs, Neurocomputing 2016(171) (2016), 1108–1117.

Maas

A.L.

, Daly

R.E.

, Pham

P.T.

, et al., Learning word vectors for sentiment analysis, Proceedings ComTonal Linguistics of the 49th Annual Meeting of the Association for Human Language Technologies-Volume 1, Association for Computational Linguistics 2011 (2011), 142–150.

10.

XiangJie

, Zhiliang

and Wansen

, Emotional Modeling in an E-learning System Based on OCC Theory, Computer Science 37(5) (2010), 214–218.

11.

, Chen

, Xia

, Lu

, Liu

and Wang

, Word embedding composition for data imbalances in sentiment and emotion classification, Cognitive Computation 7(2) (2015), 226–240.

12.

Collobert

, Weston

, Bottou

, et al., Natural Language Processing (almost) from Scratch, Journal of Machine Learning Research 12 (2011), 2493–2537.

13.

Wang

and Manning

C.D.

, Learning a product of experts with elitist lasso, Newdesign.aclweb.org 34(1) (2013), 58–65.

14.

Graves

and Schmidhuber

, Frame wise phoneme classification with bidirectional LSTM and other neural network architectures, Neural Networks 18(5) (2005), 602–610.