Hierarchical features-based targeted aspect extraction from online reviews

Abstract

With the prevalence of online review websites, large-scale data promote the necessity of focused analysis. This task aims to capture the information that is highly relevant to a specific aspect. However, the broad scope of the aspects of the various products makes this task overarching but challenging. A commonly used solution is to modify the topic models with additional information to capture the features for a specific aspect (referred to as a targeted aspect). However, the existing topic models, either perform the full analysis to capture features as many as possible or estimate the similarity to capture features as coherent as possible, overlook the fine-grained semantic relations between the features, resulting in the captured features coarse and confusing. In this paper, we propose a novel Hierarchical Features-based Topic Model (HFTM) to extract targeted aspects from online reviews, then to capture the aspect-specific features. Specifically, our model can not only capture the direct features posing target-to-feature semantics but also capture the latent features posing feature-to-feature semantics. The experiments conducted on real-world datasets demonstrate that HFTMl outperforms the state-of-the-art baselines in terms of both aspect extraction and document classification.

Keywords

Topic modeling text mining aspect extraction focused analysis online reviews

1. Introduction

On the popular online review websites, like Twitter and Amazon, people can discuss the interested or purchased products [24, 11]. These online reviews in turn help other users make purchase decisions and consist of broad aspects. Each of these aspects contains several highly related features that can describe the characteristics of the aspect, namely, these features are aspect-specific. For example, Imaging is an aspect of camera, and it contains several related features, such as the focusing function and the full-frame of a camera. In reality, when browsing the online reviews of a product, people always tend to find the features about some specific aspects, i.e., targeted aspects, rather than the coarse features from broad aspects. Therefore, focused analysis emerged for this situation [30].

Topic modeling is an effective technique to learn the unknown features structure in large-scale data, so as to support various applications, such as recommendation systems [37, 26, 3], document classification [25, 6, 20], and sentiment analysis [36, 27, 2]. However, to capture the features for a targeted aspect, the broad aspects of large-scale online reviews pose two research questions to the existing topic models: (RQ1): how the targeted aspects are highly related the features, i.e., how to capture the target-to-feature semantics, and (RQ2): how these aspect-specific features are connected with each other, i.e., how to capture the feature-to-feature semantics. To answer the two questions, the existing approaches can be classified into two categories: (1) full analysis-based approaches, and (2) targeted aspects-oriented approaches.

The models in the first category capture the aspect-specific features as many as possible by employing additional information, such as sentiment [10, 1, 28], word linking [9, 22, 29], and ratings of a product [15, 23, 18]. The introduced information poses a constraint over the words and helps reflect each word into a corresponding aspect by performing the full analysis, so as to identify the targeted aspects. However, the simple constraint at the whole corpus level aggravates the sparsity problem in word space. Moreover, as the introduced information is not highly related to the target aspects, it may capture unrelated features. Therefore, the captured features are too general to describe a specific aspect, which means these models cannot answer RQ1 very well.

The models in the second category capture the aspect-specific features as coherent as possible by improving the semantic similarity at the document level or the sentence level [24, 16, 21, 8]. The content with a higher semantic similarity has a higher probability to share the same targeted aspect. Hence, the overall coherence can be enhanced by improving the semantic similarity between the captured features. For instance, TTM [30] is a representative topic model, which captures the aspect-specific features with the highest similarity to a set of keywords. These models can estimate semantics between features and an aspect, so as to identify the targeted aspects, which answers RQ1 very well. However, they overlook the latent semantics between features. Therefore, these models cannot answer RQ2 well.

The limitations of the existing approaches lie in the overlook of the intrinsic semantics between features. Actually, all the aspect-specific features exist in two layers in a hierarchy rooted by a targeted aspect. The upper layer consists of direct features posing target-to-feature semantics, and the lower layer consists of latent features posing feature-to-feature semantics. Below we introduce an example to show the two types of features and the motivation of our work.

A Motivating Example. Figure 1 depicts an example with two camera reviews from Amazon, which mention three common aspects denoted with different colors, such as Imaging in blue while Function in green. And we give each aspect a weight value, which is computed with the proportion of the number of words used to describe the aspect out of the total number of words in the review, e.g., Imaging has a weight of 0.45 in Review 1. When browsing online reviews, people tend to find the features of a targeted aspect, rather than broad aspects. For example, if the targeted aspect of a user is Imaging, the features of the other two aspects are not expected.

Figure 1.

A motivating example with camera reviews from Amazon.

Figure 2.

An example of the hierarchy formed by a targeted aspect and its two types of features.

To capture the features for the targeted aspect Imaging, a user specifies the targeted aspect with seed keywords, such as {“sensitivity”, “focus”, “fast”, “full-frame”} denoted in blue italics in Fig. 1. There are two types of features, i.e., direct features and latent features. The direct features explicitly describe the targeted aspect with the keywords. For example, the feature full-frame in Review 1 covers the keyword “sensitivity”. By contrast, latent features do not contain the keywords but have semantics linking to the direct features, i.e., they describe the targeted aspects with feature-to-feature semantics. For example, the feature lens in Review 2 is semantically coherent to a direct feature focusing, since a good focusing function of the lens can assist in imaging. Therefore, a targeted aspect, its direct features, and corresponding latent features form a two-layer hierarchy (see Fig. 2), while direct features and latent features describe the target-to-feature semantics and feature-to-feature semantics, respectively.

However, the existing approaches can only capture the direct features with target-to-feature semantics, overlooking the feature-to-feature semantics, no matter they either perform a full analysis to capture features as many as possible or improve the semantic similarity to capture features as coherent as possible. Therefore, they cannot answer both the research questions RQ1 & RQ2 well.

To overcome the above-mentioned limitations of the existing approaches, we propose a Hierarchical Features-based Topic Model (HFTM), which not only captures aspect-specific features, but also explains how they are hierarchically connected with the two types of semantics.

The characteristics and contributions of our method are summarized:

•

We first point out the importance of the hierarchy formed by a targeted aspect and its two types of features, including the direct features with target-to-feature semantics and the latent features with feature-to-feature semantics.

•

We propose a Hierarchical Features-based Topic Model (HFTM), which captures both direct features and latent features for a targeted aspect and explains how these features are hierarchically connected.

•

The experiments conducted on three real-world datasets demonstrate that our method outperforms the state-of-the-art baselines in terms of both aspect extraction and document classification.

The remainder of this paper is organized as follows. In Section 2, we briefly review related works. Section 3 presents the overall framework of our proposed method and technical details. In Section 4, we report experimental results and analysis. Finally, our work is concluded in Section 5.

2. Related work

In this section, we review the related work according to the two above-mentioned questions: (RQ1) how the targeted aspects are highly related the features, i.e., aspect extraction, and (RQ2) how these aspect-specific features are connected with each other, i.e., feature semantics.

2.1 Aspect extraction

To answer RQ1, a common way is to modify the conventional topic models, such as Latent Dirichlet Allocation (LDA) [5], to extract aspect-specific features, for which many models have been proposed [22, 32, 35, 28]. The existing topic models for aspect extraction can be generally classified into two categories: (1) full analysis-based approaches, and (2) targeted aspects-oriented approaches.

The methods in the first category capture the aspect-specific features as many as possible by performing the full analysis at the corpus level. Thus, a natural strategy is to introduce additional information to help identify the target-to-feature semantics. For example, Ramage et al. [22] propose the LabeledLDA model with additional labels to classify each text to a corresponding aspect. Nimala et al. [19] propose RUSBTM with an additional sentiment layer at the whole corpus level. Besides the constraints over the word space, some approaches regard the pre-trained word embedding as additional information for aspect extraction [32, 35, 33]. Although the full analysis helps capture aspect-specific features to some extent, it aggravates the sparsity problem over word space as the additional information is too narrow, resulting in the latent semantic connection between words is filtered. Therefore, these approaches only focus on the target-to-feature semantics at the corpus level, such as posing the constraint over the word space to gather the related words for a targeted aspect [28], resulting in the captured features are too general to describe an aspect.

The methods in the second category capture the aspect-specific features as coherent as possible by improving the semantic similarity at the document or the sentence level. Thus, a common strategy is to accurately estimate the semantic similarity between a targeted aspect and a document. For instance, TTM [30] adopts a spike-and-slab prior while identifying similar features in each sentence. Chen et al. [8] propose the MaToAsp model to group similar features into the same aspect. Li et al. [16] capture the aspect-specific features by adopting the co-occurrence words in each sentence. Although these approaches can capture the coherent features for a targeted aspect, the similarity estimation at the document or sentence level can only capture the high-frequency features [30], resulting in the low-frequent and implicit semantics between features are overlooked.

Unlike the single-layer analysis adopted by the above-mentioned approaches, our work captures the aspect-specific features from the hierarchy formed by a targeted aspect, its direct features, and latent features. Therefore, the captured features contain both the target-to-feature semantics and feature-to-feature semantics, which help effectively describe a targeted aspect.

2.2 Feature semantics

To answer RQ2, it is necessary to better represent and estimate the feature-to-feature semantics, which can complement the semantics between a target and a feature (i.e., target-to-feature semantics) with the semantics between features. The existing methods considering this issue can be generally classified into two categories: (1) weighting-based methods, and (2) WMD (word mover’s distance)-based methods.

The weighting-based methods assume that a stronger semantics has a higher frequency. For example, Bekoulis et al. [4] propose an alternative weighting scheme for feature assignment. Yang et al. [34] apply tensor factorization to infer the weights for features of an aspect. And LWE-TM [31] defines a group of probability-weighted coefficients to estimate the semantics between features. In contrast, WMD (word mover’s distance)-based methods assume that a more similar semantic has a closer distance. These methods estimate the similarity within a semantic distance. The commonly used strategies are word co-occurrence [9, 13], attention mechanism [24], and temporal distance [7]. It is worth mentioning that the feature semantics can be validated by the domain expert. For example, a set of pre-given keywords of an aspect [30] or some well-defined concepts can be used to represent the semantics of the real world [14].

The aforementioned methods notice the feature-to-feature semantics, which improves the completeness of the captured features to some extent. However, the weighting mechanism requires complex cross-validation and is not easy to repeat. Although WMD provides a better way to estimate feature semantics with a quantifiable distance, it cannot represent the target-to-feature semantics. Moreover, the lack of hierarchical analysis makes the captured features too special to represent as more angles as possible for a targeted aspect.

3. Our method

In this section, we propose our method in detail. We first present the definitions and notations. Then, the generative process and the inference are introduced. Finally, we introduce the capturing process with two types of sparsity. Table 1 lists the notations used in this paper.

Table 1
A list of notations

Notation	Meaning
$D$	# of documents in the entire corpus $C$
$V$	# of words in vocabulary
$N$	# of words in a document
$B$	# of word pairs extracted
$R$	# of the relevance status
$T$	The targeted aspects
$K_{d},K_{l}$	# The direct features and the latent features, respectively
$f_{d},f_{l}$	The direct features and the latent features
$d_{i},b_{i},w_{i}$	$i^{\text{th}}$ document/word pair/word, respectively
$r$	The relevance status
$x$	Keywords indicator
$\beta$	Word selector (value of word selector $\in\{0,1\}$ )
$\theta_{d},\theta_{l}$	Multinomial distribution over direct/latent features
$\varphi_{d},\varphi_{l}$	Multinomial distribution over words for direct/latent features
$\tau$	Bernoulli distribution over relevance status
$\delta$	Bernoulli distribution over word selector
$\pi$	The distribution of adaptive sliding window size
$\alpha,\gamma$	Dirichlet priors for $\theta$ and $\pi$ , respectively
$p, q$	Beta priors for $\delta$
$v_{t}$	Word vector for a targeted aspect
$v_{d},v_{l}$	Word vector for a direct/latent feature
$n_{w\|k}$	# of word $w$ assigned to feature $k$
$n_{\cdot\|k}$	# of total words assigned to feature $k$
$n_{\neg i,k}$	# of word pairs assigned to feature $k$ excluding $b_{i}$
$n_{\neg i,w\|k}$	# of word $w$ assigned to feature $k$ excluding the words in $b_{i}$

3.1 Definition of terminologies

Given a corpus consists of $D$ documents $C=\{d_{1},d_{2},\ldots,d_{D}\}$ , in each of which is a separate review sentence, we assume that $C$ contains broad aspects of products. In reality, when browsing these online reviews, people always tend to find the features that are highly related to a specific aspect $T$ , which is referred to as a targeted aspect or target, so as to find the personalized recommendation of the products they are interested in [30]. Each targeted aspect covers $K$ aspect-specific features, and each aspect-specific feature $f$ consists of several highly coherent words. Here, we define two types of features as follows.

Direct feature

A direct feature $f_{d}$ explicitly describes the targets with target-to-feature semantics, which reflect the mapping relationship between the targets and the features. Specifically, each target contains some specific keywords, and each direct feature of it contains at least one of the keywords. Therefore, we can identify the direct features with a keyword indicator $x$ and a word selector $\beta$ . As shown in Fig. 1, the two keywords “full-frame” and “focusing” can help identify the two corresponding direct features in two sentences of Review 1.

Latent feature

A latent feature $f_{l}$ implicitly describes the targets with the feature-to-feature semantics, which reflect the coherence between it and a direct feature. Specifically, a latent feature contains none of the keywords, but has a higher semantic similarity to a direct feature than others, which helps indirectly describe the low-frequency information of a target and complement the direct features. As shown in Fig. 1, the latent feature lens contains none of the keywords but has higher semantics to a direct feature focusing.

Hierarchical features

For a targeted aspect $T$ , all the aspect-specific features exist in two layers in a hierarchy rooted by the targeted aspect $T$ . The upper layer (Layer 1) consists of the direct features posing target-to-feature semantics, and the lower layer (Layer 2) consists of the latent features posing feature-to-feature semantics. As shown in Fig. 2, with a targeted aspect Imaging as the root, Layer 1 contains the direct features full-frame and focusing, which can be directly identified by the keywords. And Layer 2 consists of the latest features noise, shutter, and lens, which have a higher semantic similarity to a direct feature. Specifically, for estimating the semantic similarities, we represent each feature with a word vector $v$ . $v_{t}$ , $v_{d}$ , and $v_{l}$ denote a word vector of a targeted aspect, a direct feature, and a latent feature, respectively. All the vectors are encoded by a one-hot representation with its word distribution $\varphi$ .

Co-occurrence word pairs

In this paper, we use co-occurrence word pairs (also called biterm [9]) to compose the latent features, rather than use a single word (also called unigram [9]). A word pair $b_{k}=(w_{i},w_{j})$ denotes two unordered extracted words $w_{i}$ and $w_{j}$ that are co-occurring within a semantic distance (i.e., within a size of sliding window). For example, a sentence with three distinct words ( $w_{1},w_{2},w_{3}$ ) generates three word pairs:

$\displaystyle(w_{1},w_{2},w_{3})=\{(w_{1},w_{2}),(w_{2},w_{3}),(w_{1},w_{3})\}$ (1)

It is worth mentioning that the extraction of co-occurrence word pairs in this paper is different from the method in [9]. We employ an adaptive size of sliding window based on a Poisson distribution of each sentence’s length, rather than use a fixed size. We assume that each sentence contains only one feature because the average number of words in a document is too short to cover multiple features.

3.2 Model description and generative process

To capture the aspect-specific features for a targeted aspect from online reviews, we propose a novel Hierarchical Features-based Topic Model (HFTM), which not only answers how the targeted aspects are highly related to the features, but also answers how these aspect-specific features are connected with each other.

There are two phases in our HFTM model. In Phase 1, a keyword indicator $x$ and a word selector $\beta$ are used to sample the relevance status distribution $\tau$ , which indicates whether the current document has the direct features of a target. In Phase 2, HFTM first captures both direct features and latent features with the hierarchically formed semantics. Then, all the captured features are aggregated by ranking their semantic similarities.

Figure 3 presents the graphical representations of the two phases. Each node in the figure denotes a random variable, while a shaded one denotes an observed variable. And each rectangle denotes an iteration with the number of iterations is shown in the bottom right corner.

Figure 3.

Graphic representations with two phases (a) sampling the relevance status, and (b) capturing both direct features and latent features for the targeted aspect of our proposed HFTM model.

[b] Sampling the relevance status (Phase 1)[1] $D$ documents of online reviews; The priors of Beta distribution, $p$ and $q$ ; The keywords indicator $x$ . The relevance status distribution $\tau$ .

Feature $f\in[1,K_{d}]$ and $[1,K_{l}]$ Draw a prior distribution $\tau\sim\textit{Beta}(p,q)$ ; For each word $w$ , draw a word selector $\beta\sim\textit{Bernoulli}(\delta)$ ; Draw a prior word distribution $\varphi_{k}\sim\textit{Dirichlet}(\beta)$ ; Draw the relevance status distribution $\tau\sim\textit{Bernoulli}(x)$ .

Document $d\in C={d_{1},d_{2},\ldots,d_{D}}$ Draw a relevance status $r$ based on the keyword indicator $x$ and the $\textit{Bernoulli}(\tau)$ .

Here, we describe the model and introduce the overall generative process with the two phases. First, Phase 1 is to sample the relevance status for each document to a targeted aspect. The generative process is as described in Algorithm 3.

Given the corpus consists of $D$ documents, the priors of the Beta distribution $p$ and $q$ , and the keywords indicator $x$ , Phase 1 samples the relevance status with 2 steps. First, in Lines 1–6, HFTM draws the prior Beta distribution for the word selector $\beta$ and sample the overall word distribution $\varphi_{k}$ for each feature. Second, in Lines 7–9, HFTM samples a relevance status $r$ for each document by employing a keyword indicator $x$ . Specifically, $x=1$ means that the document contains at least one of the keywords, while $x=0$ means it contains none of the keywords. Moreover, for the variable of relevance status $r$ , $r=1$ means that the document explicitly describes the targeted aspect and contains direct features, while $r=0$ indicates that the current document will be a candidate to capture the latent features in Phase 2. The output of Phase 1 is the relevance statue distribution $\tau$ . The inference of Phase 1 is shown in the next part.

Second, Phase 2 is to capture the two types of features, including the direct features and the latent features. The generative process is as described in Algorithm 3.

[h] Capturing both direct features and latent features (Phase 2)[1] $D$ documents of online reviews; The hyper-parameters, $\alpha$ and $\gamma$ ; The relevance status distribution $\tau$ . The set of aspect-specific features.

Draw the feature distribution $\theta\sim\textit{Dirichlet}(\alpha)$ for the entire corpus $C$ ; Document $d\in C={d_{1},d_{2},\ldots,d_{D}}$ $r=1$ , i.e., $d$ contains direct features Draw a feature $f_{d}\sim\textit{Multinomial}(\theta_{d})$ ; Draw a word $w\sim\textit{Multinomial}(\varphi_{d})$ ;

$r=0$ , i.e., $d$ contains latent features Draw the size of self-adaptive sliding window $\pi\sim\textit{Poisson}(\gamma)$ ; Extract a co-occurrence word pair $b\in\{b_{1},b_{2},\ldots,b_{B}\}$ ; Draw a latent feature $f_{l}\sim\textit{Multinomial}(\theta_{l})$ ; Draw words in $b$ , $w_{1},w_{2}\sim\textit{Multinomial}(\varphi_{l})$ ;

Direct feature $f_{d}$ Estimate its semantic similarity to a targeted aspect $T$ by computing the cosine similarity $CS_{d}=\textit{Cosine}(v_{d},v_{t})$ ;

Latent feature $f_{l}$ Estimate its semantic similarity to a direct feature $f_{d}$ by computing the cosine similarity $CS_{l}=\textit{Consine}(v_{l},v_{d})$ ;

Aggregate final features by ranking the similarity values.

Based on the relevance status $r$ sampled by Phase 1, Phase 2 captures both direct features and latent features with 3 steps. First, in Line 1, HFTM samples an overall feature distribution $\theta$ for all the documents. Second, in Lines 2–13, HFTM samples the topical words for each feature according to the relevance status. Specifically, when $r=1$ , HFTM samples topical words for direct features from the word distribution $\varphi_{d}$ over $K_{d}$ direct features, while when $r=0$ , HFTM samples topical words for latent features from the word distribution $\varphi_{l}$ over $K_{l}$ latent features. Third, in Lines 14–20, the final features are aggregated by ranking the semantic similarity of all the captured features computed by a function $\textit{Cosine}(\cdot)$ , which is to estimate the semantic similarity between two distribution represented by a vector $v$ . The inference of Phase 2 is also shown in the next part.

3.3 Feature capture with hierarchical semantics

This subsection presents the inference details of HFTM to explain how our model captures both direct and latent features for a targeted aspect with the hierarchically formed semantics. This part also explains how to answer the two research questions proposed in Section 1. We use the collapsed Gibbs Sampling [12] to derive the variables in our model.

In Phase 1, HFTM samples the relevance status $r$ for each document based on a set of seed keywords denoted as $v_{t}$ , which impose a sparse constraint over the word space, i.e., $|v_{t}|\ll V$ . Therefore, the word vector $v_{t}$ can be identified by a keyword indicator $x$ and a word selector $\beta$ . The design of these two variables is to derive the relevance status distribution $\tau$ . As shown in the motivating example, when the targeted aspect is Imaging, words “focusing” and “full-frame” are in seed keyword vector $v_{t}$ . Phase 1 is to identify documents that contain direct features with the target-to-feature semantics. And the direct features generated from these documents compose Layer 1 of a targeted aspect.

[b] Hierarchical semantic estimation. [1] The relevance status distribution, $\tau$ ; The set of candidate features, $f_{d}$ and $f_{l}$ encoded by vectors $v_{d}$ and $v_{l}$ ; The set of seed keywords vector, $v_{t}$ ; The number of final aggregated featured, $K$ ; The similarity threshold, $t$ . The set of aggregated aspect-specific features $L_{d}$ and $L_{l}$ . $f_{d}\in[1,K_{d}]$ $\frac{v_{d}\cdot v_{t}}{\lVert v_{d}\rVert\lVert v_{t}\rVert}\geqslant t$ Put $f_{d}$ into a candidate list $L_{d}$ of Layer 1;

$f_{l}\in[1,K_{l}]$ $\frac{v_{l}\cdot v_{d}}{\lVert v_{l}\rVert\lVert v_{d}\rVert}\geqslant t$ Put $f_{l}$ into a candidate list $L_{l}$ of Layer 2; Descending sort the candidate lists $L_{d}$ and $L_{l}$ ; Aggregate the final top $K$ features from both $L_{d}$ and $L_{l}$ .

The sampling process in Phase 1 is as follows. First, we sample the prior Beta distribution $\delta$ for the word selector $\beta$ by employing the two priors $p$ and $q$ , where $w$ denotes a word in the vocabulary, $b\in\{0,1\}$ , $\Gamma(\cdot)$ is a Gamma function, and the symbol $\cdot$ means a summation of the corresponding instances:

$\displaystyle P(\beta_{w}=b|\beta_{\neg w},\delta)\propto$ (2) $\displaystyle\left\{\begin{array}[]{ll}\Gamma(n_{w}^{r=1}+\delta)\times\Gamma(% |\beta_{\neg w}^{r=1}|+\delta+V+n_{\neg w}^{r=1})\times\Gamma(|\beta_{\neg w}^% {r=1}|\delta+\delta+V)\times|\beta_{\neg w}^{r=1}|&b=1\\ \Gamma(\delta)\times\Gamma(|\beta_{\neg w}^{r=0}|\delta+\delta+V+n_{\neg w}^{r% =0})\times\Gamma(|\beta_{\neg w}^{r=0}|\delta+V)\times(V-|\beta_{\neg w}^{r=0}% |-1)&b=0\\ \end{array}\right.$

Second, we sample the relevance status $r_{i}$ for document $d_{i}$ , where $r_{i}\in R=\{0,1\}$ , $c\in\{0,1\}$ , and $i\in D$ :

$\displaystyle P(r_{i}=c|r_{\neg i},x,\delta,\alpha,\beta)\propto$ (3) $\displaystyle\frac{\sum_{i=1}^{D}(n_{\neg i}^{r=c})}{\sum_{i=1}^{D}(n_{\neg i}% ^{r=c})+R}\times\frac{\prod_{w=1}^{V}\Gamma(\beta_{w}^{r=c}n_{\neg i}^{r=c})+% \beta_{w}^{r=c}\delta}{\Gamma(\sum_{w=1}^{V}(n_{\neg i}^{r=c}\beta_{\neg w}^{r% =c})+|\beta_{\cdot}^{r=c}|\delta+V)}.$

In Phase 2, HFTM captures topical words for both direct features generated from the documents with $r=1$ and latent features generated from the documents with $r=0$ . First, we sample the aspect-specific features by employing the relevance status distribution $\tau$ .

Specifically, if $r=1$ , we sample a direct feature $f_{d}=k\in K_{d}$ as follows:

$\displaystyle P(f_{d}=k|f_{\neg i},\alpha,\beta,\delta)\propto$ (4) $\displaystyle\frac{n_{\neg i|k}^{r=1}+\alpha}{n_{\cdot|k}^{r=1}+K_{d}\alpha}% \times\frac{\beta_{w}^{r=1}n_{\neg i|k}^{r=1}+\beta_{w}^{r=1}\delta}{n_{\cdot|% k}^{r=1}+|\beta_{\cdot|k}^{r=1}|\delta+V}.$

If $r=0$ , we sample a latent feature $f_{l}=k\in K_{l}$ as follows:

$\displaystyle P(f_{l}=k|f_{\neg i},B_{\neg i},\alpha,\beta,\gamma)\propto$ (5) $\displaystyle(n_{\neg i|k}^{r=0}+\alpha)\gamma\times\frac{(n_{\neg i,w_{1}|k}^% {r=0}+\beta_{w_{1}}^{r=0}+1)(n_{\neg i,w_{2}|k}^{r=0}+\beta_{w_{2}}^{r=0})}{% \sum_{B}(n_{\cdot|k}^{r=0}+\beta_{\cdot}^{r=0}+1)(n_{\cdot|k}^{r=0}+\beta_{% \cdot}^{r=0})},$

where $B_{\neg i}$ denotes all the word pairs except $b_{i}$ , and $w_{1}$ and $w_{2}$ are two distinct words in the word pair $b_{i}$ .

At last, we adopt a semantic estimation to aggregate all the discovered features as described in Lines 14–20 of Phase 2. We adopt cosine similarity to form the hierarchical semantics with both target-to-feature semantics (i.e., Layer 1 with direct features) and feature-to-feature semantics (i.e., Layer 2 with latent features) as described in Algorithm 3.3. In Algorithm 3, $t$ denotes a threshold of the similarity estimation. $L_{d}$ and $L_{l}$ denote the candidate direct features and candidate latent features, respectively. $K$ denotes the number of final features aggregated from $L_{d}$ and $L_{l}$ .

3.4 Capturing features with sparsity

In this part, we discuss two types of sparsity in capturing aspect-specific features, i.e., (1) the sparsity over word space caused by the global word constraint, and (2) the sparsity over co-occurrence word pairs caused by the length limitation on each document.

The first type of sparsity is caused by the word constraint over the word space when pre-giving the set of keywords of a target, i.e., $|v_{t}|\ll V$ . To address this problem, HFTM adjusts the word distribution over direct features $\varphi_{d}$ by employing the relevance status $r$ . For example, assuming that the words “focusing” and “full-frame” are two pre-given keywords of the targeted aspect Imaging, the keyword indicator $x$ can explicitly identify these words. In the meantime, the word selector $\beta$ helps sample a direct feature-word distribution $\varphi_{d}$ . Considering a document $d$ contains a keyword “full-frame”, its relevance status is $r=1$ . It is reasonable to assume that the other words in $d$ have a higher semantic similarity to the targeted aspect, even though they cannot be explicitly identified by $x$ and $\beta$ . Therefore, we add them to the set of keywords for the next iteration.

The second type of sparsity is caused by the limitation of text length. For example, Amazon limits the length of a review to be no more than 200 characters, while Twitter limits the length of a tweet to be no more than 280 characters. Such limitations lead to the sparsity over the number of features that a document can contain. In addition, the statistic information, such as word frequency, cannot be fully utilized due to this sparsity, resulting in the overlook of the feature-to-feature semantics. Although this type of sparsity has less impact on the direct features, it negatively affects latent features. To address this problem, HFTM enriches the feature space with co-occurrence word pairs. The extraction for co-occurrence word pairs is in a global semantic distance (i.e., within a self-adaptive sliding window size), which is denoted as $W L$ and computed as follows:

$\displaystyle WL_{i}=N_{i}\times\gamma$ (6)

where $N_{i}$ is the number of words in document $d_{i}$ . Then, we sample a word pair as follows:

$\displaystyle P(b_{i}|B_{\neg i},\alpha,\beta^{r=0},\gamma)\propto$ (7) $\displaystyle\frac{n_{\neg i|k}+\alpha}{\sum_{k=1}^{K_{l}}(B+K_{l}\alpha)}% \times\frac{\prod_{i=1}^{V}(n_{\neg i|k}\gamma)}{\sum_{k=1}^{K_{l}}(B+V\beta_{% \cdot}^{r=0})}$

Therefore, the utilization of co-occurrence word pairs can alleviate the second type of sparsity, which helps capture the latent features. And Eq. (3.3) presents the latent feature assignment for each co-occurrence word pair.

4. Experiments

In this section, we present the experiments conducted on three real-world datasets. Specifically, our experiments wish to answer the following four key questions:

Q1: How effective is our model to capture features for a particular aspect? (see Experiment 1) Q2: How are the captured features semantically coherent from the human perspective? (see Experiment 2) Q3: How do the number of capture features $K$ and the average text length AvgLen affect the performance of our model? (see Experiment 3) Q4: How good is the effectiveness of our model on an external application? (see Experiment 4)

4.1 Experimental settings

Datasets. We use three real-world datasets, including one film review dataset Douban, which is crawled from the Douban website1

¹
https://movie.douban.com/.

(a popular film review website in China), one product review dataset Camera crawled from Amazon2

https://www.amazon.cn/.

[30], and one tweet dataset Cigar crawled from Twitter [30]. The reasons that we choose these three datasets are as follows: (1) the reviewed entities are different, i.e., Douban reviews films while Camera reviews digital cameras; (2) they have significantly different average text length AvgLen; (3) the sizes of vocabulary are different.

For each dataset, we conduct the standard pre-processing as follows: (1) convert letters into lowercase, (2) remove meaningless words and the characters not in Latin as stop words, (3) remove the low-frequency words (whose frequency is less than 2), and (4) remove the documents whose length is smaller than 5 words. After the pre-processing, the statistics of the datasets are summarized in Table 2.

Table 2

Data statistics, where #Docs is the number of documents, V is the number of words, and AvgLen is the average number of words in a document

Dataset	#Docs	V	AvgLen	Aspects
Douban	2,519	11,817	54.99	Plot, Actors, Image, Box-office, etc.
Camera	9,976	705	34.77	Imaging, Lens, Function, etc.
Cigar	43,762	566	5.95	Event, Production Place, Box, etc.

Baselines. We compare the experimental results of our HFTM model with the following six baselines, which can be classified into two categories: (1) full analysis-based approaches, and (2) targeted aspects-oriented approaches.

4.1.1 Full analysis-based approaches

•
Latent Dirichlet Allocation (LDA) [5] is a conventional topic model, which captures features at the corpus level. It is worth mentioning that the full analysis may split the targeted aspects into multiple features without considering any hierarchically formed semantics.
•
Biterm Topic Model (BTM) [9] is a co-occurrence word pairs-based method, which captures the features at the corpus level with the global semantics. The corpus-level features can be very effective on sparse data, like tweets.
•
Simplified Hierarchical Features based Topic Model (S-HFTM) is a simplified version of our method, which captures aspect-specific features without the keyword indicator. This adjustment is to investigate how our model improves feature coherence by capturing hierarchical features rather than performing the full analysis at the corpus level.

4.1.2 Targeted aspects-oriented approaches

•
Targeted Latent Dirichlet Allocation (T-LDA) is an updated model based on standard LDA. This adjustment is to enable LDA to capture aspect-specific features by employing the keyword indicator. We compare our model with T-LDA to investigate how the latent features enhance the overall coherence and validate the necessity of latent features.
•
Targeted Biterm Topic Model (T-BTM) is an updated model based on standard BTM. This adjustment also enables BTM to capture aspect-specific features by employing the keyword indicator. We compare our model with T-BTM to investigate which of the two mechanisms (i.e., the co-occurrence word pairs and the latent features) contributes more to the improvement of feature coherence.
•
Targeted Topic Model (TTM) [30] is a state-of-the-art method, which performs the focused analysis at the document level to capture aspect-specific features. We compare our model with TTM to investigate how the feature-to-feature semantics improves the overall coherence, i.e., to validate the effectiveness of latent features.

Experimental tasks. To evaluate the performance of our HFTM model and the baselines, we design the following four experimental tasks:

•
Task 1: Quantitative evaluation of feature coherence (for Q1)
•
Task 2: Qualitative evaluation of feature coherence (for Q2)
•
Task 3: Robustness analysis (for Q3)
•
Task 4: Document classification evaluation (for Q4)

Evaluation metrics. For Task 1 and Task 3, we adopt Normalized Pointwise Mutual Information (NPMI) [17], which quantitatively measures the semantic coherence of words in each feature. For Task 2, we visualize the topical words to investigate how coherent the features are. For Task 4, we employ an open-sourced Naive Bayes Classifier (NBC)3
³
http://javaml.sourceforge.net/.

because it has a strong probabilistic foundation. These metrics have been widely used in the literature [13, 9].

Parameter settings. For fair comparisons, we have optimized both the parameters of our HFTM model and those of the baselines. First, for the commonly used hyper-parameters $\alpha$ and $\beta$ in Dirichlet distribution, we adopt $\alpha=0.05$ , and $\beta=0.01$ as suggested in the literature [5]. This setting has been validated with a single selection on a small size testing set of each of the datasets. Second, we set the hyper-parameters of Beta distribution as $p=1$ and $q=1$ as suggested in the literature [30]. Third, for the similarity threshold $t$ , we optimized it via a 10-time grid search with cross-validations. In addition, we set the same number $K$ for both direct and latent features, namely, we set $K_{d}=K_{l}=K$ . For all experiments in this paper, we investigate the performance of each model with the different number of features $K$ , which changes from 10 to 35 with a step of 5. Finally, for all the baselines, we adopt the optimized parameters as suggested in their respective papers. We run each model 10 times with each of which runs 1000 iterations for the Gibbs Sampling, and then report the averaged results.
4.2 Experimental results

To answer the key questions Q1–Q4, we conduct the following four experiments and report the corresponding results, respectively.

4.2.1 Experiment 1: Quantitative evaluation of feature coherence (for Q1)

To answer Q1, we quantitatively evaluate the semantic coherence of the aspect-specific features captured by our HFTM model and the six baselines. The higher the coherence of captured features, the more effective they can describe a targeted aspect. We adopt NPMI [17] as the estimation metric, which measures the semantic coherence between the features by employing an external corpus, such as Wikipedia text corpus.4

⁴
https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset.

For an aspect-specific feature

f_{k}

, we use the most

N

probable words within a 20-words sliding window to compute the NPMI score as follows:

$\displaystyle\textit{NPMI}(k)=\sum_{1\leqslant i\textless j\leqslant N}\frac{% \log\frac{P(w_{i},w_{j})}{P(w_{i})P(w_{j})}}{-\log P(w_{i},w_{j})}$ (8)

where $P(w_{i},w_{j})$ is the probability of two words $w_{i}$ and $w_{j}$ co-occurring in a document, and $P(w_{i})$ is the probability of word $w_{i}$ occurring in the document. We conduct Experiment 1 under the different number of features $K$ changing from 10 to 35 with a step of 5, and we set $N=\{5,10,20,40\}$ . Table 3 presents the experimental results. Due to the space limitation, we only list the NPMIs with $K=\{10,20,30\}$ and $N=\{5,10,20\}$ . And the improvement delivered by HFTM is computed with the results of the complete experiment. A higher NPMI indicates a higher feature coherence.

Table 3

NPMI scores on the three datasets with the number of features $K$ changing from 10 to 30

Number of features $K$		$K=$ 10			$K=$ 20			$K=$ 30			Average scores	Improvement by HFTM
Datasets	Method	Top5	Top10	Top20	Top5	Top10	Top20	Top5	Top10	Top20
Douban	LDA	1.2978	1.2147	0.9902	0.7841	0.6942	0.5155	0.7036	0.6377	0.4142	0.7388	$+$ 1.00 (100.04%)
	T-LDA	1.5988	1.3993	1.1125	1.1158	0.8322	0.6037	0.9209	0.7042	0.4902	0.8741	$+$ 0.87 (86.51%)
	BTM	1.6589	1.4853	1.2071	1.1149	0.9313	0.6614	1.0249	0.7474	0.5133	0.9463	$+$ 0.79 (79.29%)
	T-BTM	1.7361	1.6382	1.4497	1.3278	1.1547	0.9202	1.1186	0.9650	0.8693	1.1510	$+$ 0.59 (58.82%)
	TTM	1.6089	1.4621	1.2237	1.2309	1.0622	0.7815	1.2029	0.9105	0.5232	1.0041	$+$ 0.74 (73.51%)
	S-HFTM	1.9392	1.8087	1.6878	1.6677	1.6028	1.4951	1.6948	1.4451	1.2965	1.5764	$+$ 0.16 (16.28%)
	HFTM	2.1347	1.9182	1.7671	1.8659	1.7343	1.6003	1.8029	1.6423	1.5835	1.7392	N/A
Camera	LDA	1.6580	1.4643	1.3742	1.2164	0.9412	0.8652	1.1570	0.8094	0.7362	1.0873	$+$ 1.03 (102.56%)
	T-LDA	1.7309	1.7001	1.6006	1.5210	1.4711	1.3102	1.4842	1.3170	1.2279	1.4387	$+$ 0.67 (67.41%)
	BTM	1.7846	1.6151	1.5497	1.5381	1.3146	1.1192	1.4897	1.1936	0.8670	1.2939	$+$ 0.82 (81.89%)
	T-BTM	1.9062	1.8401	1.7416	1.7546	1.7257	1.6380	1.7220	1.6567	1.6258	1.7007	$+$ 0.41 (41.21%)
	TTM	1.7346	1.7257	1.5830	1.7146	1.6051	1.4845	1.6946	1.4845	1.4052	1.5643	$+$ 0.55 (54.86%)
	S-HFTM	2.2725	2.1896	2.1209	2.2343	2.0666	2.0113	2.1033	1.8836	1.8107	2.0100	$+$ 0.10 (10.28%)
	HFTM	2.2910	2.1602	2.1132	2.1644	2.1401	2.1160	2.1133	2.0817	2.0539	2.1128	N/A
Cigar	LDA	1.2351	1.1466	0.9649	1.1175	0.9761	0.7692	1.0346	0.8628	0.7731	0.9283	0.78 (77.83%)
	T-LDA	1.4947	1.1581	0.8817	1.6711	1.1531	1.0807	1.6849	1.7893	1.4248	1.2700	0.44 (43.65%)
	BTM	1.7094	1.4544	1.1972	1.4726	1.2106	0.9961	1.3726	1.1806	0.9942	1.2069	0.50 (49.97%)
	T-BTM	1.4603	1.3536	1.1794	1.7694	1.5649	1.3248	1.8894	1.7049	1.4248	1.4402	0.27 (26.64%)
	TTM	1.2536	1.1794	1.0329	1.4711	1.2035	1.1307	1.6993	1.3849	1.2748	1.2341	0.47 (47.25%)
	S-HFTM	1.9549	1.7139	1.5706	1.8894	1.6849	1.3299	1.7130	1.4255	1.1832	1.5350	0.17 (17.15%)
	HFTM	1.8145	1.7036	1.6146	1.7862	1.9053	1.7227	2.1053	1.7862	1.7227	1.7066	N/A

First, we can see from Table 3 that our HFTM model outperforms all the baselines on the three datasets. On the one hand, on average, HTFM significantly improves LDA, BTM, and S-HFTM by 93.48%, 70.38%, and 14.57%, respectively. This result demonstrates that our HFTM model can improve feature coherence by capturing the target-to-feature semantics with the direct features. On the other hand, the averaged improvements in NPMIs delivered by HTFM over T-LDA, T-BTM, and TTM are 65.86%, 42.22%, and 58.54%, respectively. This result demonstrates the latent features can complement the direct features, which prevent the captured features from being too specific.

Second, we can see that the targeted aspects oriented methods outperform the full analysis-based methods, especially when comparing our HFTM model with its simplified version S-HFTM. This result demonstrates that the extraction of targeted aspects at the document level or the sentence level is more effective to narrow the range of the aspects than performing a full analysis at the corpus level. Moreover, the improvement becomes more significant when introducing latent features.

Summary 1

Overall, our HFTM model significantly outperforms the existing methods for capturing aspect-specific features. Specifically, it enhances the single-layer semantics between a targeted aspect and features by introducing the latent features with feature-to-feature semantics, which improves the feature coherence.

4.2.2 Experiment 2: Qualitative evaluation of feature coherence (for Q2)

To answer Q2, we qualitatively investigate the coherence of the captured features and the corresponding targeted aspect. When the targeted aspect is the Plot of films, Table 4 visualizes the captured features with their topical words. We mark unrelated words in italicize and red to present the difference between our model and the baselines.

Table 4
Aspect-specific features of a targeted aspect Plot on Douban dataset with a fixed feature number $K=20$

	The targeted aspect: Plot
S-HFTM	Direct feature $f_{d}$ : script	Script plot metaphor insinuate creative englishhumanthey fragment ending
T-LDA	Direct feature $f_{d}$ : script	Script plot complete feature series beginning ending replace dad horrible
	Latent feature $f_{l}$ : actors	Yunfat-Chow acting feeling kingdom voice detail reality student fairy look
T-BTM	Direct feature $f_{d}$ : script	Script plot complete motivational fantastic beginning low-cost time ending fiction
	Latent feature $f_{l}$ : photography	Photography life-like image routine mess gorgeous coffer Douban simple love
TTM	Direct feature $f_{d}$ : script	Script scenario protagonist plot scriptwriter brother plagiarize shocking touching look
	Latent feature $f_{l_{1}}$ : actors	Bob Ekin-Cheng Xun-Chow cat say acting talented reputation dinosaur human
	Latent feature $f_{l_{2}}$ : photography	Photography life-like image shocking filter very brilliant support toy cyberpunk style
HFTM	Direct feature $f_{d}$ : script	Script scenario protagonist plot scriptwriter ending plagiarize touching slow hotel
	Latent feature $f_{l_{1}}$ : box-office	Flop public praise business Oscar top reputation effect ancientssing
	Latent feature $f_{l_{2}}$ : actors	Bob Ekin-Cheng Xun-Chow cat dailogue Yunfat-Chow acting talented astonishing Disney

As presented in the results of Experiment 1, the feature coherence of full analysis-based methods (i.e., LDA and BTM) is worse than that of the targeted aspects oriented methods (i.e., T-LDA, T-BTM, and TTM). Therefore, in Experiment 2, we only compare the performance of HFTM and the three targeted aspects oriented methods (i.e., T-LDA, T-BTM, and TTM). Moreover, we also compare our model with the simplified version S-HFTM to demonstrate the necessity of the latent features.

First, all the compared methods capture the direct feature script with some keywords, such as “script” and “scenario”. Specifically, our HFTM model captures better results with more discriminative words, such as “scriptwriter” and “ending” (two important factors of a film’s plot quality). In addition, our model can alleviate the influence of noise by capturing the fewest unrelated words than those of the baselines. For example, the direct feature script captured by TTM contains the unrelated words “look” and “brother”. T-BTM groups more adjectives together, such as “complete”, “motivational”, and “fantastic”. However, these words are not coherent to the plot of a film and make the result too general and confusing. Among these methods, T-LDA captures the worst result with many noisy words, such as “dad” and “horrible”.

Second, HFTM improves the interpretability of the features by complementing the direct features with the latent features. Specifically, the box-office (i.e., commercial sales of a film) and the actors are two factors that are highly related to a film’s plot. The two latent features captured by HFTM are more semantically coherent to Plot than the feature photography captured by both TTM and T-BTM. This is mainly because a good plot can directly improve the commercial sales of a film, but it cannot directly affect the audiences’ reflection of photography. In contrast, the representation of the plot requires good actors and the actors’ acting skills, in turn, improves the commercial sales of a film. In terms of S-HFTM, it can capture the direct features, but it cannot complement the direct features with feature-to-feature semantics, leading to the captured features that cannot describe the targeted aspect from multiple angles.

Summary 2

Overall, our HFTM model can capture the more interpretable features for a targeted aspect by considering the feature-to-feature semantics. Comparing HFTM with TTM, which is a representative model for targeted aspect extraction, the experimental results demonstrate that the hierarchically formed features are more effective in describing the targeted aspect.

4.2.3 Experiment 3: Robustness evaluation with two factors (for Q3)

To answer Q3, we evaluate the robustness of our model and the baselines by changing of two factors, i.e., the number of features $K$ and the average text length AvgLen. The former factor reflects the model’s performance under different focus extent of a targeted aspect, namely, a larger $K$ indicates a lower focus extent, while the latter factor reflects the model’s performance with different data sparsity, namely, a smaller AvgLen indicates a higher data sparsity.

In this experiment, we set the number of features $K$ changing from 10 to 35 with a step of 5. And the three datasets are with different sparsity extent, i.e., their AvgLen has significant differences. When the two factors change, the larger fluctuation of the NPMIs’ curve indicates the worse robustness. Figure 4 presents the experimental results on (a) the Douban dataset, (b) the Camera dataset, and (c) the Cigar dataset. For better readability, the solid lines present the performance of the targeted aspects oriented methods (i.e., LDA, BTM, and S-HFTM), while the dotted lines present the performance of the full analysis-based methods (i.e., T-LDA, T-BTM, TTM, and HFTM).

Figure 4.

Robustness comparison between our model and the baselines with the changing of the number of features $K$ and data sparsity on (a) Douban (less sparse), (b) Camera (less sparse), and (c) Cigar (more sparse) datasets.

First, when $K$ increases, our HFTM model achieves the best robustness with the smallest fluctuation in the curve on all datasets. In contrast, the six baselines, either they perform a full analysis at the corpus level or improve similarity at the document level, all have significant fluctuations, especially when $K$ increases from 10 to 15 on the Douban dataset. It is worth mentioning that, although two typical full analysis-based methods LDA and BTM perform worse than their updated models, the simplified model S-HFTM is comparable to HFTM. This result demonstrates that, when the focus extent of a targeted aspect drops (i.e., $K$ increases), the hierarchical semantics is more effective to maintain the focus on the aspect-specific features than the single-layer semantics.

Second, we can see that, when the data sparsity increases (e.g., AvgLen is 54.99 on the Douban dataset drops to 5.49 on the Cigar dataset), the targeted aspects oriented models show an upward trend, while the full analysis-based models show a downward trend. Specifically, our HFTM model achieves the best robustness with the best NPMI, unless when $K$ is very small (e.g., $K=10$ on the Cigar dataset). This result demonstrates that the latent features with feature-to-feature semantics can enrich the aspect-specific features, which helps maintain the robustness when data sparsity increases. The performance of S-HFTM and BTM are comparable to our model. This result demonstrates that the feature-to-feature semantics and the co-occurrence word pairs contribute can reinforce each other, so as to improve the topic robustness. Hence, this explains why our HFTM performs the best.

Summary 3

Overall, our HFTM model achieves the best robustness under the changes of two factors. In a few cases, the simplified model S-HFTM has slightly better performance. Nevertheless, the NPMIs of our model are more robust than the baselines, no matter with the less focus extent or with sparse data. The experimental results demonstrate the necessity of the latent features represented with co-occurrence word pairs.

4.2.4 Experiment 4: Document classification evaluation (for Q4)

To answer Q4, we further investigate the performance of our HFTM model for document classification. We conduct Experiment 4 on the two datasets Douban and Cigar under the number of features $K$ changing from 10 to 35 with a step of 5. The experimental results are presented in Fig. 5.

Figure 5.

Document classification performance of our model and the baselines with different number of features $K$ on (a) Douban and (b) Cigar datasets.

We have the following three observations: (1) On both datasets, our HFTM model achieves the best classification accuracy compared to the baselines. This result demonstrates the hierarchical features are more effective to describe the intrinsic semantics of a targeted aspect; (2) Compared with full analysis-based methods, the targeted aspects oriented methods have better performance. This is mainly because the category label of a document can be regarded as a target label of this document, which means that document-level modeling is more effective to capture this local feature; (3) Comparing the two datasets, Cigar consists of tweets while Doban consists of film reviews, which means that Cigar has a higher sparsity than Douban. It can be seen that both methods perform worse on the Cigar dataset than those on the Douban dataset. This result demonstrates that the updated method for each full analysis-based model can only improve the classification accuracy in a limited way. In other words, this result demonstrates the necessity of the latent features, which can complement the direct features to further improve classification accuracy.

Summary 4

Overall, our HFTM model is effective for the document classification task, which means the hierarchical features are useful for the real-world application. Especially when dealing with sparse data, the latent features represented with co-occurrence word pairs can enrich the local semantics to improve the classification accuracy.

5. Conclusion

In this paper, we have proposed a novel Hierarchical Features-based Topic Model (HFTM), which can capture aspect-specific features that exist in two layers in a hierarchy rooted by the targeted aspect. Specifically, the direct features in the upper layer pose target-to-feature semantics, while the latent features in the lower layer pose feature-to-feature semantics. The two types of features prevent the description of a targeted aspect from being too general or too special. Therefore, a person who browses the large-scale online reviews could effectively find the aspect-specific features coherent to his/her interest. The experiments conducted on three real-world datasets have demonstrated that our model is superior to the baselines for both aspect detection and document classification task.

In future work, we will consider improving our work with non-parametric sampling strategies to enable the method can suit more real-world scenarios. We are also interested in applying more personal information to generate an auxiliary user profile, like temporal behavior patterns.

Footnotes

Acknowledgments

This work has been supported by the National Key Research and Development Program of China under grant 2016YFB1000901, the National Natural Science Foundation of China under grant 91746209 and the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of the Ministry of Education of China under grant IRT17R32.

References

Agathangelou

Katakis

Koutoulakis

Kokkoras

and Gunopulos

, Learning patterns for discovering domain-oriented opinion words, Knowledge and Information Systems 55(1) (2018), 45–77.

Ahuja

Wei

and Carley

K.M.

, Microblog sentiment topic model, in: 2016 IEEE 16th International Conference on Data Mining Workshops, ICDMW 16’, 2016, pp. 1031–1038.

Bagci

and Karagoz

, Context-aware location recommendation by using a random walk-based approach, Knowledge and Information Systems 47(2) (2016), 241–260.

Bekoulis

and Rousseau

, Graph-based term weighting scheme for topic modeling, in: 2016 IEEE 16th International Conference on Data Mining Workshops, ICDMW 16’, 2016, pp. 1039–1044.

Blei

D.M.

A.Y.

and Jordan

M.I.

, Latent dirichlet allocation, Journal of Machine Learning Research 3(Jan) (2003), 993–1022.

Burkhardt

and Kramer

, Online multi-label dependency topic models for text classification, Machine Learning 107(5) (2018), 859–886.

Chen

Xie

Leung

C.-C.

and Li

, Modeling latent topics and temporal distance for story segmentation of broadcast news, IEEE/ACM Transactions on Audio, Speech, and Language Processing 25(1) (2016), 112–123.

Chen

Martineau

Cheng

and Sheth

, Clustering for simultaneous extraction of aspects and features from reviews, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 16’, 2016, pp. 789–799.

Cheng

Yan

Lan

and Guo

, Btm: topic modeling over short texts, IEEE Transactions on Knowledge and Data Engineering 26(12) (2014), 2928–2941.

10.

Fan

Dai

Huang

and Chen

, Target-oriented opinion words extraction with target-fused neural sequence labeling, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NACC-HIT 19’, 2019, pp. 2509–2518.

11.

Georgiou

El Abbadi

and Yan

, Extracting topics with focused communities for social content recommendation, in: Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, CSCW 17’, 2017, pp. 1432–1443.

12.

Griffiths

T.L.

and Steyvers

, Finding scientific topics, Proceedings of the National academy of Sciences 101(suppl 1) (2004), 5228–5235.

13.

and Wu

, A self-adaptive sliding window based topic model for non-uniform texts, in: 2017 IEEE International Conference on Data Mining, ICDM 17’, 2017, pp. 147–156.

14.

Herrmann

and Pernul

, Towards security semantics in workflow management, in: Proceedings of 31st Hawaii International Conference on System Sciences, HICSS 98’, Vol. 7, 1998, pp. 766–767.

15.

Shi

Zhuang

and Philip

S.Y.

, Integrating topic model and heterogeneous information network for aspect mining with rating bias, in: Advances in Knowledge Discovery and Data Mining – 23rd Pacific-Asia Conference, PAKDD 19’, 2019, pp. 160–171.

16.

Wang

Zhang

Sun

and Ma

, Topic modeling for short texts with auxiliary word embeddings, in: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 16’, 2016, pp. 165–174.

17.

Nguyen

D.Q.

Billingsley

and Johnson

, Improving topic models with latent feature word representations, Transactions of the Association for Computational Linguistics 3 (2015), 299–313.

18.

Nikolenko

S.I.

Tutubalina

Malykh

Shenbin

and Alekseev

, Aspera: aspect-based rating prediction model, in: Proceddings of 41st European Conference on Information Retrieval, ECIR 19’, 2019, pp. 163–171.

19.

Nimala

and Jebakumar

, A robust user sentiment biterm topic mixture model based on user aggregation strategy to avoid data sparsity for short text, Journal of Medical Systems 43(4) (2019), 93.

20.

Pavlinek

and Podgorelec

, Text classification method based on self-training and lda topic models, Expert Systems with Applications 80 (2017), 83–93.

21.

Rakesh

Ding

Ahuja

Rao

Sun

and Reddy

C.K.

, A sparse topic model for extracting aspect-specific summaries from online reviews, in: Proceedings of the 2018 World Wide Web Conference, WWW 18’, 2018, pp. 1573–1582.

22.

Ramage

Hall

Nallapati

and Manning

C.D.

, Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora, in: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 09’, 2009, pp. 248–256.

23.

Reihanian

Feizi-Derakhshi

M.-R.

and Aghdasi

H.S.

, Overlapping community detection in rating-based social networks through analyzing topics, ratings and links, Pattern Recognition 81 (2018), 370–387.

24.

Rida-E-Fatima

Javed

Banjar

Irtaza

Dawood

and Alamri

, A multi-layer dual attention deep learning model with refined word embeddings for aspect-based sentiment analysis, IEEE Access 7 (2019), 114795–114807.

25.

Soleimani

and Miller

D.J.

, Semi-supervised multi-label topic models for document classification and sentence labeling, in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 16’, 2016, pp. 105–114.

26.

Song

Jiang

Qin

and Liao

, A novel temporal and topic-aware recommender model, World Wide Web 22(5) (2019), 2105–2127.

27.

Toosinezhad

Mohamadpoor

and Malazi

H.T.

, Dynamic windowing mechanism to combine sentiment and n-gram analysis in detecting events from social media, Knowledge and Information Systems 60(1) (2019), 179–196.

28.

Tutubalina

, Target-based topic model for problem phrase extraction, in: Proceddings of 37th European Conference on Information Retrieval, ECIR 15’, 2015, pp. 271–277.

29.

Wang

Tong

Cai

Hanratty

and Han

, Mining multi-aspect reflection of news events in twitter: Discovery, linking and presentation, in: 2015 IEEE International Conference on Data Mining, ICDM 15’, 2015, pp. 429–438.

30.

Wang

Chen

Fei

Liu

and Emery

, Targeted topic modeling for focused analysis, in: Proceedings of 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SIGKDD 16’, 2016, pp. 1235–1244.

31.

Wei

Luo

Pan

Zhang

and Safi

Q.G.K.

, Locally weighted embedding topic modeling by markov random walk structure approximation and sparse regularization, Neurocomputing 285 (2018), 35–50.

32.

Yuan

and Huang

, A hybrid unsupervised method for aspect term and opinion target extraction, Knowledge-Based Systems 148 (2018), 66–73.

33.

Yang

Jones

and Samatova

N.F.

, Mining aspect-specific opinions from online reviews using a latent embedding structured topic model, in: International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 17’, 2017, pp. 195–210.

34.

Yang

Liu

Nie

and Wang

, Collaborative filtering with weighted opinion aspects, Neurocomputing 210 (2016), 185–196.

35.

Yang

Zhu

Shen

and Zhao

, Cross-domain aspect/sentiment-aware abstractive review summarization, in: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM 18’, 2018, pp. 1531–1534.

36.

Yue

Chen

Zuo

and Yin

, A survey of sentiment analysis in social media, Knowledge and Information Systems, 2018, 1–47.

37.

Zhu

Zhou

Xiong

and Yuan

, Privacy-preserving topic model for tagging recommender systems, Knowledge and Information Systems 46(1) (2016), 33–58.

Hierarchical features-based targeted aspect extraction from online reviews

Abstract

Keywords

1. Introduction

2.1 Aspect extraction

2.2 Feature semantics

3. Our method

Table 1 A list of notations

Direct feature

Latent feature

Hierarchical features

Co-occurrence word pairs

4.1 Experimental settings

1 https://movie.douban.com/.

4.2.1 Experiment 1: Quantitative evaluation of feature coherence (for Q1)

4 https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset.

Summary 1

Table 4 Aspect-specific features of a targeted aspect Plot on Douban dataset with a fixed feature number K = 20

Summary 2

Summary 3

Summary 4

Footnotes

Acknowledgments

References

Table 1
A list of notations

¹
https://movie.douban.com/.

⁴
https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset.

Table 4
Aspect-specific features of a targeted aspect Plot on Douban dataset with a fixed feature number $K=20$