A multi-grained aspect vector learning model for unsupervised aspect identification

Abstract

Unsupervised aspect identification is a challenging task in aspect-based sentiment analysis. Traditional topic models are usually used for this task, but they are not appropriate for short texts such as product reviews. In this work, we propose an aspect identification model based on aspect vector reconstruction. A key of our model is that we make connections between sentence vectors and multi-grained aspect vectors using fuzzy k-means membership function. Furthermore, to make full use of different aspect representations in vector space, we reconstruct sentence vectors based on coarse-grained aspect vectors and fine-grained aspect vectors simultaneously. The resulting model can therefore learn better aspect representations. Experimental results on two datasets from different domains show that our proposed model can outperform a few baselines in terms of aspect identification and topic coherence of the extracted aspect terms.

Keywords

Aspect identification text clustering topic coherence membership function aspect extraction

1 Introduction

Online reviews are useful sources for evaluating entity aspects. Identifying aspects accurately from reviews is crucial for downstream tasks such as aspect-based sentiment analysis and summarization. Given a collection of reviews from the same domain, aspect identification aims to discover different clusters of aspects (aspect categories), each cluster associated with a set of aspect terms or a distribution over such terms. For example, in the review sentence, “my salmon was completely raw”, the aspect term is “salmon”, which belongs to the aspect category “food”. The tasks of our model are: (1) extracting all representative aspect terms from the review corpus, (2) clustering aspect terms with similar meanings into corresponding categories, and each category represents a single aspect, e.g. clustering “beef”, “pork”, and “salmon” into one aspect category “food”, (3) clustering review sentences into corresponding aspect categories, e.g. the above sentence is clustered to “food”.

Previous solutions for aspect identification can be summarized into three methods: rule-based, supervised, and unsupervised method. The rule-based method is usually based on syntactic rules, which are not suitable for online reviews that lack syntactic rules. Supervised methods need data annotation and domain adaptation, which is also not a good choice for fast aspect identification. Unsupervised methods are usually based on topic models such as Latent Dirichlet Allocation (LDA) [1]. Here each aspect is modeled as a topic, which is essentially a multinomial distribution over words, and review sentences are modeled as mixtures of these topics. And Embedded Topic Model (ETM) [3], a generative model of documents that marries traditional topic models with word embeddings, but still uses Bag of Words document representations as model input. Some special topic models and their extensions have been proposed for aspect identification [14 , 28]. Despite topic models have achieved success, these models are limited to formal and well-edited documents, such as news reports and scientific articles because they rely on document-level word collocations. When we process short texts, such as online reviews, the performance of these models will likely be inevitably compromised, due to the severe data sparsity issue.

It is worth considering how to use the powerful feature representation ability of neural networks for aspect identification. Compared with the traditional multinomial word distribution-based language models, neural language models constructed in a continuous space may better handle low-frequency words in reviews and address the data sparsity problem. To this end, some neural topic models [6 , 27] have been proposed and shown to produce more coherent topics than earlier models such as LDA. In Aspect Based Autoencoder model (ABAE), He et al. exploit the word vectors pretrained on the dataset to acquire the distribution of word co-occurrences, and predict aspect probabilities of a sentence to reconstruct a sentence vector as a combination of multiple aspect vectors (i.e. aspect matrix). Each sentence is associated with more than one aspect, so its aspect vector (i.e. the vector representing aspect, we call it sentence vector simply in this paper) can be reconstructed by multiple aspect vectors.

ABAE uses a linear layer to transform a sentence vector to a low dimensional vector of aspect probabilities, which needs to train extra parameters. In fact, the aspect probabilities of a sentence are related to the distances between its sentence vector and these aspect vectors. This relation is not employed in ABAE. Different from this, our model calculates these aspect probabilities directly, by applying the distances to the fuzzy k-means membership function. Besides, ABAE fixes the number of aspect categories, i.e. the number of clustering, which means the reconstruction is only based on one aspect matrix. We notice that if the number of aspect categories changes, the clustered aspect categories, and aspect terms changes simultaneously. A small clustering number can get coarse-grained aspects (each aspect is a large cluster), which correspond to some relatively abstract topics, while a big clustering number can get fine-grained aspects (each aspect is a smaller cluster), which correspond to some relatively specific topics. Motivated by such an observation, we consider that a sentence vector can be reconstructed by two different aspect matrixes: one for coarse-grained aspect and the other for fine-grained aspect. By two reconstructions for sentences, the model can learn more robust and reasonable aspect representations.

We summarize the main contributions as follows: We propose an unsupervised aspect identification model based on aspect vector reconstruction with the fuzzy k-means membership function. We propose a multi-grained aspect vector learning mode in our model to get better aspect representations, including the coarse-grained aspect vector and the fine-grained aspect vector. We tested our model on two datasets of restaurant and beer domains. Compared with other aspect identification models, our model had excellent performance in sentence-level aspect identification and discovered more meaningful and coherent aspects.

The rest of the paper is organized as follows. Section 2 lists the related work. Section 3 describes the proposed model for aspect identification. The experiments on datasets are conducted in Section 4 and their results are discussed in detail. The brief conclusions and future work are given in Section 5.

2 Related work

Early studies of aspect identification mostly focus on the design of hand-crafted rules or features [15, 18]. Recently, the proposal of neural models enables automatic representation learning, treating it as a supervised sequential labeling task [11 , 25]. These approaches can achieve better performances than their prior works. Besides, the Non-iterative supervised learning model [8] also provides a solution for this task. But these supervised models, rely on manually annotated data, thus restricted their scaling ability for new domains or language.

Our work is mainly in the line with unsupervised aspect extraction research. Unsupervised methods are mainly based on probabilistic topic models, such as LDA, pLSA, infinite replicated Softmax model (iRSM) [7], bi-Directional Recurrent Attentional Topic Model (bi-RATM) [10] and Embedded Topic Model(ETM) [3]. They are often used to extract topics from text collections and learn latent document representations. Though they have been shown to be powerful in modeling large text corpora, these topic modeling still remains a drawback in the sparse-data setting, especially for cases where word co-occurrence data is insufficient. This is because traditional topic models do not directly encode word co-occurrence statistics. Assuming that each word is generated independently, they implicitly capture such patterns by modeling word generation at the document level. Besides, topic models need to estimate the topic distribution of each document. Online reviews tend to be short, making it more difficult to estimate the distribution of themes. So, they do not work well on short text or a corpus of few documents.

To deal with such an issue, many previous efforts incorporate external representations, such as word embeddings [9, 17] and knowledge pretrained on large-scale high-quality resources [4, 5]. In another line of the research, some work focuses on how to enrich the context of short messages, such as Biterm topic model (BTM) [26]. It extends a message into a biterm set with all combinations of any two distinct words appearing in the message.

Recently, neural topic models have been shown better performance than topic models [6 , 27]. In ABAE, He et al. proposed to predict aspect probabilities of a sentence and then use this probability to reconstruct an embedding for the sentence as a combination of multiple aspect embeddings. The aspect embeddings are initialized by clustering the word embeddings using the k-means algorithm. ABAE is trained by minimizing the sentence reconstruction error. Vargas et al. [20] propose SUAEx which is a deep neural network approach for unsupervised aspect extraction, which relies on the similarity of word-embedding.

Inspired by neural topic models, the principle of our model is to use two different aspect matrixes to reconstruct each sentence. In the process of reconstruction, we directly use the membership function of fuzzy k-means to calculate the reconstruction weights. We finally learn two reasonable aspect matrices and all sentence aspect vectors. Here each row of the aspect matrix is equivalent to the centroid of each aspect cluster in the vector space. Through reconstruction, our model completes aspect clustering and sentence clustering.

3 Model description

We describe our model in this section. Our goal is to learn two sets of aspect vectors (aspect matrix), in which one set is coarse-grained aspect and the other is fine-grained aspect. We infer the aspect categories and identify aspect terms mainly by the coarse-grained vectors, infer some extra aspects by the fine-grained vectors. Through two reconstructions, we can get a better and more reasonable aspect representation. The meaning of these aspects can be inferred by words whose vectors are nearest to aspect vectors in the embedding space. In the first place, we pretrain word embeddings of the dataset by Skipgram to map words that often co-occur in a context to points that are close by in the embedding space [12]. Skipgram takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique. The word vectors correspond to the rows of a word embedding matrix $E \in ℝ^{V \times D}$ E ∈ R^V×D, where V is the vocabulary size. We want to learn two aspect matrixes $T^{c} \in ℝ^{L_{c} \times D}$ and $T^{f} \in ℝ^{L^{f} \times D}$ , where aspect vectors share the same embedding space with words. Each input to the model is a list of indexes for words in a review sentence. Given such an input, three steps are performed as shown in Fig. 1. First, we filter away non-aspect words by using an attention mechanism and construct a sentence vector r from weighted word embeddings. Then, we reconstruct the sentence vector twice by a linear combination of aspect vectors from T^c and T^f. We reconstruct a sentence vector by using the membership degrees of the sentence belonging to different aspects and the corresponding aspect vectors.

Fig. 1

The structure of our model.

The specific flow of the model is as follows:

3.1 Representing the original aspect vector of a sentence

Let $e_{x_{i}} \in ℝ^{D}$ represents D-dimensional pretrained word vector of the i′th word in a sentence x, then the embedding is E = e_{x
₁} ⊕ e_{x
₂} ⊕ … ⊕ e_{x
_n}, where ⊕ is the concatenation operator, $E \in ℝ^{n \times D}$ , n is the length of the sentence. We calculate the original aspect vector of a sentence by applying attention operation to E. $x_{r} = \frac{1}{n} \sum_{i = 1}^{n} e_{x_{i}}$ (1) $\emptyset_{i} = e_{x_{i}}^{⊤} \cdot M \cdot x_{r}$ (2) $α_{i} = \frac{\exp (\emptyset_{i})}{\sum_{i = 1}^{n} \exp (\emptyset_{i})}$ (3) $r = \sum_{i = 1}^{n} α_{i} e_{x_{i}}$ (4)

Where M $\in ℝ^{D \times D}$ is the parameter to be optimized. By attention operation, the sentence is represented to be an original aspect vector which pays more attention to those words related to aspects.

3.2 Reconstructing the aspect vector of the sentence using coarse-grained aspect vectors

If there are L^c coarse-grained aspects for a collection of reviews, then a review sentence can be reconstructed by these aspect vectors. Let $t_{l}^{c} \in ℝ^{D}$ represents D-dimensional aspect vector of the l′th aspect, then the coarse-grained aspect matrix is $T^{c} = t_{1}^{c} \oplus t_{2}^{c} \oplus \dots \oplus t_{L^{c}}^{c}$ , where ⊕ is the concatenation operator, $T^{c} \in ℝ^{L_{c} \times D}$ . According to the principle of the fuzzy c-means algorithm, the membership degree of a vector belonging to a cluster is dependent on the distances between the vector and the centroid vector of different clusters. Consistent with the fuzzy c-means algorithm, we calculate the membership degree of a sentence belonging to the l′th aspect according to the distances between the original aspect vector r and these aspect vectors $t_{l}^{c}$ . $u_{l}^{c} = {(1 + \sum_{\begin{matrix} k = 1 \\ k \neq l \end{matrix}}^{L^{c}} {(\frac{| | r - t_{l}^{c} | |^{2}}{| | r - t_{k}^{c} | |^{2}})}^{\frac{1}{m - 1}})}^{- 1}, l = 1, 2, \dots, L^{c}$ (5) $p_{l}^{c} = \frac{exp (u_{l}^{c})}{\sum_{l = 1}^{L^{c}} exp (u_{l}^{c})}$ (6) $r^{c} = \sum_{l = 1}^{L^{c}} p_{l}^{c} t_{l}^{c}$ (7) Where $u_{l}^{c}$ is the membership degree, and m is the fuzzy weighted index. $p_{l}^{c}$ are the normalized weights over L^c aspect vectors, where each weight represents the probability that the input sentence belongs to the related aspect l, and r^c is the reconstructed aspect vector.

3.3 Reconstructing the aspect vector of the sentence using fine-grained aspect vectors

If there are L^f fine-grained aspects for a collection of reviews, then a review sentence can be reconstructed by these aspect vectors. Similar to the previous reconstruction of coarse-grained aspect vectors, let $t_{l}^{f} \in ℝ^{D}$ represents D-dimensional aspect vector of the l′th aspect, then the fine-grained aspect matrix is $T^{f} = t_{1}^{f} \oplus t_{2}^{f} \oplus \dots \oplus t_{L_{f}}^{f}$ , $T^{f} \in ℝ^{L^{f} \times D}$ . Calculate the membership degree of the sentence belonging to the l′th fine-grained aspect according to the distances between the reconstructed vector r_c and these aspect vectors $t_{l}^{f}$ . $u_{l}^{f} = {(1 + \sum_{\begin{matrix} k = 1 \\ k \neq l \end{matrix}}^{L^{f}} {(\frac{| | r_{c} - t_{l}^{f} | |^{2}}{| | r_{c} - t_{k}^{f} | |^{2}})}^{\frac{1}{m - 1}})}^{ - 1} , l = 1, 2, \dots, L^{f}$ (8) $p_{l}^{f} = \frac{exp (u_{l}^{f})}{\sum_{l = 1}^{L^{f}} exp (u_{l}^{f})}$ (9) $r^{f} = \sum_{l = 1}^{L^{f}} p_{l}^{f} t_{l}^{f}$ (10) Where $u_{l}^{f}$ is the membership degree and $p_{l}^{f}$ is the weight of the sentence belonging to aspect l. r^f is the reconstructed aspect vector of the sentence using fine-grained aspect vectors.

3.4 Training the final objective

According to the above steps, we have three representations for a sentence, i.e. r, r^c and r^f. If in the dataset, the coarse-grained aspect vectors and fine-grained aspect vectors are reasonable enough, then the reconstructed aspect vector of a sentence should be reasonable, which means r, r^c and r^f should be similar. Our model is trained to minimize the reconstruction error. We adopted the contrastive max-margin objective function used in previous work [6]. For each input sentence, we randomly sample num sentences from the dataset as negative samples. We represent each negative sample as n_i which is computed by averaging its word embeddings. Our objective is to make r^c and r^f similar to r while different from those negative samples. Therefore, the objective J_j (θ) of the j sentence in the dataset is formulated as a hinge loss that maximizes the inner product between r^c, r^f and r, and simultaneously minimize the inner product between r^c, r^f and the negative samples. We calculate the reconstruction loss of a sentence by the following equation: $J_{j} (θ) = \sum_{i = 1}^{num} \max (0, 1 - {rr}^{c} - λ {rr}^{f} + r^{c} n_{i} + λ n_{i} r^{f})$ (11)

Where λ is a hyperparameter that controls the weight of the reconstruction by fine-grained aspect vectors. num is the number of negative samples. We sum J_j (θ) of each sentence in the dataset to get the total loss J (θ) of the reconstruction. What’s more, the aspect matrix T^c and T^f may suffer from redundancy problems during the training. We add two regularization terms proposed by the work [24] to ensure the diversity of the aspects. $V^{c} (θ) = | | T_{n}^{c} \cdot T_{n^{T}}^{c} - I | |$ (12) $V^{f} (θ) = | | T_{n}^{f} \cdot T_{n^{T}}^{f} - I | |$ (13) Where I is the identity matrix, and $T_{n}^{c}$ is T^c with each row normalized respectively to have length 1. $T_{n}^{f}$ and T^f have the same relation. V^c and V^f reaches their minimum values when the dot product between any two different aspect vectors is zero. Thus, the regularization term encourages orthogonality among the rows of the aspect matrix and penalizes redundancy between different aspect vectors. The final objective L (θ) is obtained by: $L (θ) = J (θ) + β V^{c} (θ) + β V^{f} (θ)$ (14)

Where β is a hyperparameter that controls the weight of the regularization term. The corresponding learning objective is to minimize L (θ) by optimizing parameters {T^c, T^f, M}.

After training, the sentences in the model are clustered, and vectors in T^c and T^f are centroids of different clusters in embedding space. We chose top n words whose vectors are closest to the aspect vector $T_{l}^{c}$ as aspect terms of the l′th aspect. Meanwhile, the category of a sentence is decided by its max possibility $p_{l}^{c}$ .

4 Experiment and analysis

4.1 Datasets

We evaluate our method on two datasets 1 from restaurant and beer domain respectively. The restaurant dataset contains over 50,000 training reviews and a subset of 3,400 sentences with manually labeled aspects. There are six manually defined aspect labels: Food, Staff, Ambience, Price, Anecdotes, and Miscellaneous. For the evaluation of aspect identification, we use three labels of them, following the work [6]. The number of training samples in the beer dataset is almost 30 times that of the restaurant. There are 6,301 sentences manually defined five aspect labels: Feel, Look, Smell, Taste, and Overall. For the evaluation of aspect identification, we use the sentences annotated with Feel, Look, Smell, Taste labels. The detailed statistics of the datasets are summarized in Table 1.

Table 1
Dataset description

Domain Training sentences Annotated sentences Max length of a sentence Total unique words

Restaurant 52,574 Food: 887 158 45,023

Staff: 352

Ambience: 251

Beer 1,586,259 Feel: 1022 191 17,017

Look:1607

Smell&Taste: 3672

Domain	Training sentences	Annotated sentences	Max length of a sentence	Total unique words
Restaurant	52,574	Food: 887	158	45,023
		Staff: 352
		Ambience: 251
Beer	1,586,259	Feel: 1022	191	17,017
		Look:1607
		Smell&Taste: 3672

4.2 Experimental setup

Review corpora are preprocessed by removing punctuation symbols, stop words and words appearing less than 10 times. We applied pre-trained word2vec embedding for initialization of the word embedding E. We set the embedding size to 200, window size to 10, and negative sample size to 5. We also initialize the aspect matrix T^c and T^f with the centroids of clusters resulting from running k-means on word embeddings. After experimental adjustment, we set batch size to 50, hyperparameter λ to 0.2, β to 0.1, fuzzy weighted index m to 2, and randomly sample 20 sentences from the data as negative samples. Other parameters are initialized randomly. During the training process, we fix the word embedding matrix E to be untrainable. We employ Adam optimizer and run 15 epochs with an early stop strategy. The dropout strategy is also adopted to avoid overfitting. The results reported for all models are the average over 10 runs. It is noteworthy that the results of our model are very stable, and the results of ten experiments hardly fluctuate.

We experimented with the different number of aspects and set the number of the coarse-grain aspects to 14 and the fine-grained aspects to 50. The coarse-grained number is set to 14 for comparison with the baseline models. We manually labeled each coarse-grained aspect to one of the gold-standard aspects (refer to Table 1) according to its representative aspect terms, in accordance with the previous work [2, 29]. Representative terms of an aspect are those words whose vectors are most similar to the corresponding aspect vector.

4.3 Baseline methods

To validate the performance of our model, we compare it with some baselines:

LocLDA [2]: This is a standard implementation of LDA, in which each sentence is treated as a separate document. We set Dirichlet priors α=0.05 and β=0.1, and run 1,000 iterations of Gibbs sampling.

k-means: We adopt the k-means algorithm for clustering the pretrained word embeddings and use the centroids as aspect vectors directly. In the experiment of this paper, different centroids of the model are iterated 10 times.

SAS [14]: It is a hybrid topic model that jointly discovers both aspects and aspect-specific opinions. This model has been shown to be competitive among topic models in discovering meaningful aspects. For SAS, we set α=50/K and β=0.1.

BTM [26]: This is a biterm topic model that is specially designed for short texts. The major advantage of BTM over conventional LDA models is that it alleviates the problem of data sparsity in short texts by directly modeling the generation of unordered word-pair co-occurrences (biterms) over the corpus. For BTM, we set α=50/K and β=0.1.

SERBM [21]: a restricted Boltzmann Machine (RBM) that learns topic distributions, and assigns individual words to these distributions. By doing so, it learns to assign words to aspects. For the experimental setup, we use 10 hidden units.

ABAE [6]: This is an unsupervised neural topic model, which has been shown to produce more coherent topics than earlier models such as LDA. We fix the word embedding matrix E and optimize other parameters using Adam with a learning rate of 0.001 for 15 epochs and batch size of 50.

SUAEx [20]: This is an unsupervised aspect extraction model based on the similarity of word-embeddings. In the experiment, we use word 200 embedding dimension, 50 batchsize, and 15 training epochs.

ETM [3]: Embedded Topic Model, a generative model of documents that marries traditional topic models with word embeddings. In the experiment, the learning rate of the model was set as 0.002, the iteration was 15 times, and Adam was used as the optimizer.

4.4 Experimental result

4.4.1 Inferred aspects and extracted representative aspect terms

Previous literature has demonstrated the superiority of ABAE in aspect inference. In this paper, we compare the aspects learned from the ABAE model and our model, shown in Table 2. In Table 2, the left is the gold-standard aspect labels, and the right is representative aspect terms for each inferred aspect. From Table 2, we can see that in fourteen aspects, three aspect vectors can be inferred as the theme of Food by our model, while two by ABAE; two aspect vectors can be inferred as Staff by our model, while only one by ABAE. These three aspect labels are easy to infer manually and are the most important aspects for restaurant domain. More inferred aspects mean that we can get more corresponding aspect terms. For example, in “Food” aspect inferred by our model, Aspect 1 includes the aspect terms about dessert, Aspect 2 includes the aspect terms about the main course, and Aspect 3 includes the aspect terms about seafood. Similar results can be seen on the beer dataset. The topic consistency of these obtained aspect terms is illustrated in the following section.

Table 2
The inferred aspects and representative aspect terms for restaurant reviews

Aspect Method Top 10 aspect terms

Food Our model Aspect 1: strawberry chocolate < num>cooky vanilla cheesecake banana brownie cream tiramisu

Aspect 2: bacon onion bean tortilla omelette lettuce thick chip chili pickle

Aspect 3: seared scallop cod braised duck bass sea squid octopus veal

ABAE Aspect 1: tiramisu cheesecake gelato espresso banana sorbet creme brulee icecream souffle

Aspect 2: vegetable risotto lasagna stew halibut dish broth veal meatball spaghetti

Staff Our model Aspect 4: busboy apologized cleared politely refused proceeded asked apologize repeatedly apology

Aspect 5: staff waitstaff polite helpful professional courteous knowledgeable efficient server bartender

ABAE Aspect 3:waitstaff server staff manner host bartender polite patient waitress hostess

Ambience Our model Aspect 6: lit fireplace spacious booth dimly lighting banquette couch patio floor

ABAE Aspect 4: couch lighting comfy spacious lit funky fireplace patio dark furniture

Aspect	Method	Top 10 aspect terms
Food	Our model	Aspect 1: strawberry chocolate < num>cooky vanilla cheesecake banana brownie cream tiramisu
		Aspect 2: bacon onion bean tortilla omelette lettuce thick chip chili pickle
		Aspect 3: seared scallop cod braised duck bass sea squid octopus veal
	ABAE	Aspect 1: tiramisu cheesecake gelato espresso banana sorbet creme brulee icecream souffle
		Aspect 2: vegetable risotto lasagna stew halibut dish broth veal meatball spaghetti
Staff	Our model	Aspect 4: busboy apologized cleared politely refused proceeded asked apologize repeatedly apology
		Aspect 5: staff waitstaff polite helpful professional courteous knowledgeable efficient server bartender
	ABAE	Aspect 3:waitstaff server staff manner host bartender polite patient waitress hostess
Ambience	Our model	Aspect 6: lit fireplace spacious booth dimly lighting banquette couch patio floor
	ABAE	Aspect 4: couch lighting comfy spacious lit funky fireplace patio dark furniture

4.4.2 Topic coherence

We also evaluated our models with topic coherence, which is a metric measuring aspect quality based on the co-occurrence of words [13]. It is defined as: $\begin{matrix} COH (t, V^{(t)}) = \frac{2}{M (M + 1)} \sum_{m = 2}^{M} \sum_{l = 1}^{m - 1} \\ \log \frac{D (v_{m}^{(t)}, v_{l}^{(t)}) + 1}{D (v_{l}^{(t)})} \end{matrix}$ (15)

Where V^(t) contains the M most probable words in aspect t. $v_{m}^{(t)}$ and $v_{l}^{(t)}$ are the m′th and l′th words in V^(t). $D (v_{l}^{(t)})$ is the number of sentences containing the word $v_{l}^{(t)}$ and $D (v_{m}^{(t)}, v_{l}^{(t)})$ is the number of sentences containing both $v_{m}^{(t)}$ and $v_{l}^{(t)}$ Generally speaking, the higher the topic consistency score, the greater the semantic relevance of the extracted topic, the better the topic cohesion and the higher the quality. We calculate the average coherence score of each model as: ${COH}^{AVR} = \sum_{t = 1}^{K} COH (t, V^{(t)})$ (16)

In our experiment, the number of aspects K is set to 14. Figures 2 and 3 show the average topic coherence COH^AVR calculated for different models in the Restaurant domain and the Beer domain, respectively. Conventional LDA models do not directly encode word co-occurrence statistics, they implicitly capture such patterns by modeling word generation from the document level, assuming that each word is generated independently. So their performance is significantly lower than our model. All topics based on our model can get better performance than others. Especially for the top 30, 40, and 50 terms, our model obtains obvious improvement than other models. It proves that aspects discovered by our models are more coherent than those discovered by the competitors.

Fig. 2

Average topic coherence score versus number of top n terms for the restaurant domain.

Fig. 3

Average topic coherence score versus number of top n terms for the beer domain.

4.4.3 Sentence level aspect identification

We calculate p^c for each test sentence by Equation (6), and assign the sentence an inferred aspect label according to the highest weight. Then we assign the gold-standard label to the sentence according to the above mapping between inferred aspects and gold-standard labels, shown in Table 1. For example, for the sentence “my salmon was completely raw”, if $p_{k}^{c}$ is max in all $p_{l}^{c},$ l = 1, 2, ⋯ , L^f, then it is labeled to be the k′th aspect. Next, if we infer l′th aspect to be Food according to its representative terms, the final label of the sentence is Food. We evaluate the performance of sentence-level aspect identification on restaurant domain using the annotated sentences shown in Table 1. The evaluation criteria are precision, recall, and F1 scores, to evaluate how well the predictions match the true labels. The results are shown in Table 3.

Table 3
Sentence level aspect identification results of different models on the restaurant domain

Aspect Method Precision Recall F1

LocLDA 0.898 0.648 0.753

ME-LDA 0.874 0.787 0.828

SAS 0.867 0.772 0.817

BTM 0.933 0.745 0.816

Food SERBM 0.891 0.854 0.872

k-means 0.931 0.647 0.755

ABAE 0.953 0.741 0.828

ETM 0.839 0.762 0.800

SUAEx 0.917 0.900 0.908

Our model 0.891 0.870 0.881

LocLDA 0.804 0.585 0.677

ME-LDA 0.779 0.540 0.638

SAS 0.774 0.556 0.647

BTM 0.828 0.579 0.677

Staff SERBM 0.819 0.582 0.680

k-means 0.789 0.685 0.659

ABAE 0.802 0.728 0.757

ETM 0.794 0.632 0.704

SUAEx 0.660 0.872 0.752

Our model 0.823 0.724 0.770

LocLDA 0.603 0.677 0.638

ME-LDA 0.773 0.558 0.648

SAS 0.780 0.542 0.640

BTM 0.813 0.599 0.685

Ambience SERBM 0.805 0.592 0.682

k-means 0.730 0.637 0.677

ABAE 0.815 0.698 0.740

ETM 0.809 0.583 0.678

SUAEx 0.884 0.546 0.675

Our model 0.835 0.725 0.776

Aspect	Method	Precision	Recall	F1
	LocLDA	0.898	0.648	0.753
	ME-LDA	0.874	0.787	0.828
	SAS	0.867	0.772	0.817
	BTM	0.933	0.745	0.816
Food	SERBM	0.891	0.854	0.872
	k-means	0.931	0.647	0.755
	ABAE	0.953	0.741	0.828
	ETM	0.839	0.762	0.800
	SUAEx	0.917	0.900	0.908
	Our model	0.891	0.870	0.881
	LocLDA	0.804	0.585	0.677
	ME-LDA	0.779	0.540	0.638
	SAS	0.774	0.556	0.647
	BTM	0.828	0.579	0.677
Staff	SERBM	0.819	0.582	0.680
	k-means	0.789	0.685	0.659
	ABAE	0.802	0.728	0.757
	ETM	0.794	0.632	0.704
	SUAEx	0.660	0.872	0.752
	Our model	0.823	0.724	0.770
	LocLDA	0.603	0.677	0.638
	ME-LDA	0.773	0.558	0.648
	SAS	0.780	0.542	0.640
	BTM	0.813	0.599	0.685
Ambience	SERBM	0.805	0.592	0.682
	k-means	0.730	0.637	0.677
	ABAE	0.815	0.698	0.740
	ETM	0.809	0.583	0.678
	SUAEx	0.884	0.546	0.675
	Our model	0.835	0.725	0.776

We can see that our model outperforms all other competitors. As all sentences are from the same domain, it is uneasy to effectively discover clear aspects and cluster sentences by using co-occurrence statistics. So traditional topic models and ETM perform poorly, especially for Staff and Ambience. The Recall score of our model for Food improves by 12.9% compared with ABAE, which accord with the result in Table 2. The macro-average F1 of our model is 80.9%, which is 3.4% higher than ABAE, for that our model learns a more robust and reasonable aspect representation after two reconstructions of the sentence. The macro-average F1 of our model which is 3.1% higher than that of SUAEx. We think that SUAEx considers more words for comprehensive expression, which may introduce more noise and damage the performance.

We further compare the performance of our model and other models on the beer dataset. The results are shown in Table 4.

Table 4

Sentence level aspect identification results of different models on the beer domain

Aspect	Method	Precision	Recall	F1
	LocLDA	0.938	0.537	0.675
	ME-LDA	–	–	–
	SAS	0.783	0.695	0.730
	BTM	0.892	0.687	0.772
Feel	SERBM	–	–	–
	k-means	0.720	0.815	0.737
	ABAE	0.815	0.824	0.816
	ETM	0.824	0.699	0.756
	SUAEx	0.687	0.832	0.753
	Our model	0.900	0.758	0.823
	LocLDA	0.651	0.873	0.735
	ME-LDA	–	–	–
	SAS	0.804	0.759	0.769
	BTM	0.885	0.760	0.815
Taste+Smell	SERBM	–	–	–
	k-means	0.697	0.828	0.740
	ABAE	0.897	0.853	0.866
	ETM	0.885	0.730	0.800
	SUAEx	0.844	0.922	0.881
	Our model	0.866	0.873	0.870
	LocLDA	0.963	0.676	0.774
	ME-LDA	–	–	–
	SAS	0.958	0.705	0.806
	BTM	0.953	0.854	0.872
Look	SERBM	–	–	–
	k-means	0.915	0.696	0.765
	ABAE	0.969	0.882	0.905
	ETM	0.968	0.764	0.854
	SUAEx	0.876	0.849	0.862
	Our model	0.920	0.903	0.911

In the Beer domain, we combined Taste and Smell to form the single aspect “Taste+Smell”, because “Taste” and “Smell” aspects are so similar that many words can be used to describe both aspects. We can see F1 score in every aspect of our model is better than all the methods. All of the models scored slightly lower in F1, and the analysis found that most of the sentences in the unrecognized Feel aspect were ambiguous, describing something close to the Taste+Smell aspect, such as “Way Bitter perhaps bit Thin”, labeled “Feel”, but the model classifies this to the “Taste+Smell” aspect based on “bitter”, because “bitter” is also closely related to the Taste+Smell aspect. Such as “low medium Body”, labeled “feel”, but the model has a hard time judging the aspect of sentence based on the information provided by the sentence, and these two similar descriptions cause all the models to have a slightly lower F1 score for feel aspect. The macro-average F1 of our model is 86.8%, which is 3.6% higher than that of SUAEx, because the multi-grained reconstruct makes our model get more accurate aspect representation. The results show that our model has a better ability to identify aspects than all the base models.

4.4.4 Ablation study

We conducted an ablation study on our model to analyze the role of membership function and fine-grained aspects. Model 1 refers to that we only reconstruct sentence vectors with coarse-grained aspect vectors and use the following equation to calculate u^c, instead of the Equation (5). $u_{l}^{c} = \frac{1}{| | r - t_{l}^{c} | |^{2}}$ (17)

Model 2 refers to that in our model structure, we only use the coarse-grained aspect vectors to reconstruct sentences, without the fine-grained aspect vectors. The other settings of the two ablation models are consistent with our full model. We compared the performance of ablation models for sentence-level aspect identification on restaurant domain. The results are shown in Table 5.

Table 5

Sentence level aspect identification results on the restaurant domain

Aspect	Method	Precision	Recall	F1
Food	ABAE	0.953	0.741	0.828
	Model 1	0.932	0.761	0.838
	Model 2	0.833	0.861	0.847
	Full model	0.891	0.870	0.881
Staff	ABAE	0.802	0.728	0.757
	Model 1	0.773	0.718	0.744
	Model 2	0.772	0.727	0.749
	Full model	0.823	0.724	0.770
Ambience	ABAE	0.815	0.698	0.740
	Model 1	0.786	0.711	0.747
	Model 2	0.710	0.789	0.747
	Full model	0.835	0.725	0.776

We mainly compare F1 score for different models and make the following observations from Table 5: (1) Generally speaking, our Model 1 has little difference from the performance of ABAE, which shows that there is hardly any loss of performance by directly using the distance between the original sentence vector and the aspect vectors to reconstruct the sentence vector, without using the dense layer of ABAE. Specifically, compared with ABAE, our model 1 increases 1% for Food, 0.7% for Ambience, but decreases 1.3% for Staff. Food has 887 test samples, 2.5 and 3.5 times the other two aspects. Therefore, the performance improvement of Food has a greater impact on the overall performance. (2) Compared with Model 1, Model 2 has a further performance improvement, which proves the rationality of using the membership function of equation (5). By using the membership function to calculate the weights and reconstruct sentence vectors, the overall performance of Model 2 is about 2% higher than that of ABAE. (3) Fine-grained aspect vectors play a very important role. Adding this module, the performance of our full model significantly improves for the three aspects. This proves that the employment of two different granularity aspects can improve the rationality of sentence representation and thus better aspect identification. Besides, in the previous experiments, we have seen that our full model can learn consistent aspect terms. It shows that the aspect representation is more reasonable through two reconstructions.

4.4.5 Study on fine-grained aspect vectors

The above experiment shows that the performance of our model is improved by adding the fine-grained aspect vectors. We infer the aspect categories and identify aspect terms mainly by the coarse-grained vectors. In fact, we also can infer some extra aspects by the fine-grained vectors. Table 6 lists seven fine-grained aspects inferred by the fine-grained aspect vectors for the restaurant domain. All these aspects belong to Food, and each aspect has its specific description of Food. Except for the topic coherence scores of the third and seventh aspects are little, the rest aspects have a good performance on topic coherence.

Table 6
The inferred aspects and representative aspect terms using fine-grained aspect vectors for restaurant reviews

Aspect Topic Coherence Scores Top 10 aspect terms

Japanese Cuisine –104.39 sushi roll sashimi tuna oyster noodle spring fish shrimp eel

Drink –96.49 glass bottle beer sangria martini tap champagne water wine sake

Meat –141.55 veal lamb duck rib hideous bass chop risotto psychic scallop

Desert –102.46 cupcake chocolate coffee cooky cake tea dessert cheesecake ice cup

Western Fast Food –97.58 pizza slice crust thin oven bagel topping cheese burger pie

Adjectives –119.15 presented creative chef official variety executed wide seasonal fresh prepared

Adjectives –166.12 burnt pleasingly nutmeg taste rally dry weiner tasteless stripped piece

Aspect	Topic Coherence Scores	Top 10 aspect terms
Japanese Cuisine	–104.39	sushi roll sashimi tuna oyster noodle spring fish shrimp eel
Drink	–96.49	glass bottle beer sangria martini tap champagne water wine sake
Meat	–141.55	veal lamb duck rib hideous bass chop risotto psychic scallop
Desert	–102.46	cupcake chocolate coffee cooky cake tea dessert cheesecake ice cup
Western Fast Food	–97.58	pizza slice crust thin oven bagel topping cheese burger pie
Adjectives	–119.15	presented creative chef official variety executed wide seasonal fresh prepared
Adjectives	–166.12	burnt pleasingly nutmeg taste rally dry weiner tasteless stripped piece

5 Conclusion

We propose a new aspect identification model, which combines two aspect matrixes with different granularity to reconstruct sentence representation. Our experimental results show that our model not only learns higher quality aspects but also more effectively captures the aspects of reviews than previous methods, meanwhile produces more coherent topics. The macro-average F1 values of our model for sentence-level aspect identification are 80.9% and 86.8% respectively on Restaurant and Beer datasets, which have obvious improvement compared with the other state-of-the art models. But in we need to manually infer the aspect category for each aspect vector, which limits the application of the model. In the future, we will explore investigate our model in a broader range of datasets.

Footnotes

References

Blei

D.M.

, Ng

A.Y.

and Jordan

M.I.

, Latent dirichlet allocation[J], The Journal of Machine Learning Research 3 (2003), 993–1022.

Brody

and Elhadad

, An unsupervised aspect-sentiment model for online reviews[C]//Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (2010), 804–812.

Dieng

A.B.

, Ruiz

F.J.R.

and Blei

D.M.

, Topic modeling in embedding spaces[J], Transactions of the Association for Computational Linguistics 8 (2020), 439–453.

Gupta

, Chaudhary

, Buettner

, et al., Document informed neural autoregressive topic models with distributional prior[C], Proceedings of the AAAI Conference on Artificial Intelligence 33(01) (2019), 6505–6512.

Gupta

, Chaudhary

and Schütze

, Multi-view and Multi-source Transfers in Neural Topic Modeling with Pretrained Topic and Word Embeddings. 2019. URL:https://arxiv.org/abs/1909.06563

, Lee

W.S.

, Ng

H.T.

, et al., An unsupervised neural attention model for aspect extraction[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2017), 388–397.

Huhnstock

N.A.

, Karlsson

, Riveiro

, et al., An infinite replicated Softmax model for topic modeling[C]//International Conference on Modeling Decisions for Artificial Intelligence. Springer, Cham (2019), 307–318.

Izonin

, Tkachenko

, Kryvinska

, et al., Multiple Linear Regression based on Coefficients Identification using Non-Iterative SGTM Neural-Like Structure[C]//International Work-Conference on Artificial Neural Networks. Springer, Cham (2019), 467–479.

, Wang

, Zhang

, et al., Topic modeling for short texts with auxiliary word embeddings[C]//Proceedings of the 39th International ACMSIGIR conference on Research and Development in Information Retrieval (2016), 165–174.

10.

, Zhang

and Pan

, Bi-directional recurrent attentional topic model[J], ACM Transactions on Knowledge Discovery from Data (TKDD) 14(6) (2020), 1–30.

11.

and Lam

, Deep multi-task learning for aspect term extraction with memory interaction[C]//Proceedings of the 2017 conference on empirical methods in natural language processing (2017), 2886–2892.

12.

Mikolov

, Yih

and Zweig

, Linguistic regularities in continuous space word representations[C]//Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies (2013), 746–751.

13.

Mimno

, Wallach

, Talley

, et al., Optimizing semantic coherence in topic models[C]//Proceedings of the 2011 conference on empirical methods in natural language processing (2011), 262–272.

14.

Mukherjee

and Liu

, Aspect extraction through semi-supervised modeling[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). (2012), 339–348.

15.

Qiu

, Liu

, Bu

, et al., Opinion word expansion and target extraction through double propagation[J], Computational Linguistics 37(1) (2011), 9–27.

16.

Shams

and Baraani-Dastjerdi

, Enriched LDA (ELDA): Combination of latent Dirichlet allocation with word co-occurrence analysis for aspect extraction[J], Expert Systems with Applications 80 (2017), 136–146.

17.

Shi

, Lam

, Jameel

, et al., Jointly learning word embeddings and latent topics[C]//Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (2017), 375–384.

18.

Somasundaran

and Wiebe

, Recognizing stances in online debates[C]//Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (2009), 226–234.

19.

Srivastava

and Sutton

, Autoencoding variational inference for topic models. In 5th International Conference on Learning Representations (ICLR’17). 2017.

20.

Vargas

D.S.

, Pessutto

L.R.C.

, Moreira

V.P.

, Simple Unsupervised Similarity-Based Aspect Extraction. 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing). 2020.

21.

Wang

, Liu

, Cao

, et al., Sentiment-aspect extraction based on restricted boltzmann machines[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (2015), 616–625.

22.

Wang

, Pan

S.J.

, Dahlmeier

, et al., Recursive neural conditional random fields for aspect-based sentiment analysis. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP’16). Austin, Texas, 2016.616–626. DIO: https://doi.org/10.18653/v1/D16-1059.

23.

Wang

and Pan

S.J.

, Recursive neural structural correspondence network for cross-domain aspect and opinion co-extraction[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2018), 2171–2181.

24.

Weston

, Bengio

and Usunier

, Scaling up to large vocabulary image annotation. In Proceedings of the Twenty-Second Inter-national Joint Conference on Artificial Intelligence (IJCAI’11). AAAI press, Barcelona, Catalonia, Spain. 2011.

25.

, Liu

, Shu

, et al., Double embeddings and cnn-based sequence labeling for aspect extraction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18), 2018. 592–598. DIO: https://doi.org/10.18653/v1/P18-2094.

26.

Yan

, Guo

, Lan

, et al., A biterm topic model for short texts[C]//Proceedings of the 22nd international conference on World Wide Web (2013), 1445–1456.

27.

Yin

, Wei

, Dong

, et al., Unsupervised Word and Dependency Path Embeddings for Aspect Term Extraction. In Proceedings of the Twenty-Seven Inter-national Joint Conference on Artificial Intelligence (IJCAI’16). 2016.

28.

Yuan

and Wu

, A hybrid hdp-me-lda model for sentiment analysis[C]//2017 2nd International Conference on Automation, Mechanical Control and Computational Engineering (AMCCE 2017). Atlantis Press, 2017.

29.

Zhao

, Jiang

, Yan

, et al., Jointly modeling aspects and opinions with a MaxEnt-LDA hybrid[C]. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP’10), MIT Stata Center, Massachusetts, USA (2010), 9–1.

A multi-grained aspect vector learning model for unsupervised aspect identification

Abstract

Keywords

1 Introduction

2 Related work

3 Model description

4.1 Datasets

Table 1 Dataset description Domain Training sentences Annotated sentences Max length of a sentence Total unique words Restaurant 52,574 Food: 887 158 45,023 Staff: 352 Ambience: 251 Beer 1,586,259 Feel: 1022 191 17,017 Look:1607 Smell&Taste: 3672

4.3 Baseline methods

4.4 Experimental result

4.4.1 Inferred aspects and extracted representative aspect terms

Footnotes

References

Table 1
Dataset description

Domain Training sentences Annotated sentences Max length of a sentence Total unique words

Restaurant 52,574 Food: 887 158 45,023

Staff: 352

Ambience: 251

Beer 1,586,259 Feel: 1022 191 17,017

Look:1607

Smell&Taste: 3672