Abstract
The authors address two significant challenges in using online text reviews to obtain fine-grained, attribute-level sentiment ratings. First, in contrast to methods that rely on word frequency, they develop a deep learning convolutional–long short-term memory hybrid model to account for language structure. The convolutional layer accounts for spatial structure (adjacent word groups or phrases), and long short-term memory accounts for the sequential structure of language (sentiment distributed and modified across nonadjacent phrases). Second, they address the problem of missing attributes in text when constructing attribute sentiment scores, as reviewers write about only a subset of attributes and remain silent on others. They develop a model-based imputation strategy using a structural model of heterogeneous rating behavior. Using Yelp restaurant review data, they show superior attribute sentiment scoring accuracy with their model. They identify three reviewer segments with different motivations: status seeking, altruism/want voice, and need to vent/praise. Surprisingly, attribute mentions in reviews are driven by the need to inform and vent/praise rather than by attribute importance. The heterogeneous model-based imputation performs better than other common imputations and, importantly, leads to managerially significant corrections in restaurant attribute ratings. More broadly, the results suggest that social science research should pay more attention to reducing measurement error in variables constructed from text.
Keywords
Many firms conduct routine tracking surveys on product/service performance on selected attributes that managers believe drive overall customer satisfaction (Mittal, Katrichis, and Kumar 2001; Mittal, Kumar, and Tsiros 1999). The summary scores from these surveys are used as dashboard metrics of overall satisfaction and as performance metrics at firms. Because surveys are costly, suffer from response biases, and become outdated quickly (Bi et al. 2019; Culotta and Cutler 2016), crowdsourced online review platforms have emerged as an alternative and cheaper source of scalable, real-time feedback for businesses to listen in on their markets for performance tracking as well as competitive benchmarking (e.g., Li, Xiaoyuan, and Lu 2020; Xu 2019). Further, as peer–peer trust and reputation arising from online review platforms gain in importance relative to brand advertising–based trust and reputation (Hollenbeck 2018), consumers use review platforms when making choices in many experience goods markets (e.g., Luca and Vats 2013; Zhu and Zhang 2010). Consequently, by necessity, many firms now rely on quantitative metrics of attribute-level performance from user-generated text—they even link performance benchmarking and employee compensation directly to online review performance. As an example, see Dubois et al. (2016) for how Accor Hotels use online review–based attribute sentiment score cards to manage hotel property employee performance.
This article develops a scalable text analysis method to convert open-ended text reviews from online review platforms to produce attribute-level summary ratings. 1 This involves solving two novel and challenging subproblems. First, it requires developing a text mining framework that can convert the rich texture of attribute-level sentiment expressed in the text to a fine-grained quantitative rating scale that captures not only the valence of the sentiment, but also the degree of positivity or negativity in sentiment. Our model produces significantly higher accuracy classifications for both sentiment valence and fine-grained scoring relative to common benchmark methods. Because text is increasingly used in social science research (Gentzkow, Kelly, and Taddy 2019), our results suggest that many commonly used text analysis methods can produce large measurement error when converting text to quantitative sentiment data, leading to biased inferences and erroneous substantive insights.
The second problem is that because reviewers self-select the attributes to write about in open-ended text, many attributes will be missing in unprompted reviews. The challenge is to correctly interpret “silence” when a reviewer does not mention an attribute in the review text and impute the correct sentiment to obtain the aggregate attribute-level rating. We show that the correct imputations lead to significant corrections in restaurants’ average attribute ratings. Given that Luca (2016) estimates that a one-point change in rating leads to a 5%–9% change in restaurant revenues, these corrections are economically significant. Further, behavioral research has long recognized the importance of the right imputation for missing values because people do not ignore missing attributes and often make complex and imperfect inferences from missing data in evaluations. For example, Slovic and MacPhillamy (1974) and Peloza, Ye, and Montford (2015) discuss some common types of wrong inferences—higher weights on common attributes (i.e., attributes for which information is available for all options) or simply proxy a missing attribute score with some unrelated attribute score (extra-attribute misestimation). Gurney and Loewenstein (2019) provide an excellent review of this topic. Although the nature of these inferences may vary, the general takeaway is that missingness usually worsens choice and decision making. Thus, review platforms are interested in obtaining corrected attribute ratings. 2 Further, as firms begin to increasingly use these text-based attribute ratings for internal feedback, performance measurement, and compensation (e.g., Dubois et al. 2016), obtaining the corrected metrics becomes even more important. We next describe the key challenges involved in tackling these two problems and explain how we address them.
Challenges in Attribute-Level Sentiment Scoring from Text
Attribute-level sentiment scoring from text involves connecting a specific product attribute (e.g., food, service) to an associated satisfaction rating. With fine-grained sentiment scoring, we need to both convert text to valence (positive, negative, or neutral) and represent the degree of positivity and negativity (in, say, a five-point scale). Although there has been some work on sentiment scoring of attribute valence (e.g., Archak, Ghose, and Ipeirotis 2011; Li, Xiaoyuan, and Lu 2020), there has been little work on fine-grained attribute scoring—the focus of our article. Next, we describe the challenges involved relative to extant work. We note that the computer science literature in fine-grained sentiment scoring is still evolving, and it remains an open problem in natural language processing (NLP; Schouten and Frasincar 2015).
Over the last decade, marketing scholars have extensively used text analysis to identify topics, customer needs, and mentions of product attributes. Many of these works have used “bag-of-words” approaches such as the baseline latent Dirichlet allocation (LDA) and lexicons, where the identification of attributes and sentiments is based on the frequency of sentiment words. LDA applications include Tirunillai and Tellis (2014), Hollenbeck (2018), Puranam, Narayan, and Kadiyali (2017), and Büschken and Allenby (2016). Recently, Büschken and Allenby (2020) showed that the quality of topics extracted from an LDA model can be improved by incorporating punctuation and conjunctions (e.g., “and,” “but”), which link sentences. Archak, Ghose, and Ipeirotis (2011) use a lexicon method to identify attributes and sentiment valence but do not address fine-grained sentiment scoring. 3 Word-frequency-based bag-of-words methods are limited in their ability to adequately score attribute sentiments. For example, the sentences “The food was pretty bad, not good at all,” and “The food was pretty good, not bad at all,” would both have the same word frequencies but opposite meanings. Or consider the following examples where sentiment degree is modified, as in (1) “horrible,” “not horrible,” and “not that horrible” and (2) “delight” and “just missed being a delight.” When words are merely counted, as in bag of words, making the connections between the key sentiment words “horrible” and “delight” with their degree modifiers is difficult without considering how they are grouped adjacently to form phrases (i.e., spatial structure).
More generally, in NLP, certain types of sentences are considered “hard” for sentiment scoring (Socher et al. 2013). Like the previous examples, which modified sentiment degree, negations often require accounting for adjacent words (i.e., spatial structure) to correctly interpret both valence and sentiment degree. Further, other types of sentences, such as long and scattered sentences and contrastive conjunctions, require accounting for both the spatial and the sequential structure of language, as the sentiment is distributed and modified across nonadjacent words in a sentence. When there are long sentences with sentiments scattered across attributes, it becomes challenging to correctly associate the sentiment with the attribute. Further, sentiments may be modified along different parts of a long sentence, and therefore, one must consider these sequences together when inferring sentiment. Contrastive conjunctions—words/phrases such as “but,” “despite,” and “in spite of”—can reverse the sentiment of a sentence on either side of the conjunction. Implied sentiments are challenging because the meaning/sentiment associated with a word lies within a richer context of its usage.
These examples motivate the need to go beyond frequency-based bag-of-words approaches and model the structure of language (in terms of phrases and sequences). In our deep learning model, a convolutional layer captures the spatial structure (grouping of adjacent words), and a long short-term memory (LSTM) layer captures the sequential structure (sequence of adjacent and nonadjacent phrases). This enables us to improve our sentiment classification not only in the aggregate on “easy” sentences but also on the “hard” sentences.
Accounting for Missing Attributes in Attribute Sentiment Scoring
As described previously, the current literature on topic identification focuses on the frequency of mentions across reviews to identify the most common or novel needs/benefits, attributes desired by consumers/users. The implicit assumption is that topics or attributes that are not mentioned are not important and can be ignored.
We question the premise that importance is the primary reason why an attribute is mentioned or not. There may be other reasons why a reviewer is silent on an attribute. Some may write only if it can influence or inform readers. For example, if there is high variance among current raters, one's rating can be influential and informative. Or if one's own rating is different from the consensus based on current reviews, one may be motivated to write a different point of view. There could, of course, be asymmetry in this motivation depending on whether the deviation from consensus is positive or negative. Finally, some raters may choose not to write when the product meets expectations (and their rating would have been a three out of five) but only write to praise/vent when they are very satisfied or dissatisfied.
We develop a model-based strategy that imputes missing sentiment based on observable restaurant characteristics and observable/unobservable reviewer characteristics. We consider and exploit the following four key features of the available data in this context in developing and identifying the structural model of rating with self-selection:
The same restaurant is visited and experienced by multiple reviewers; given that a restaurant provides similar services to all patrons, we assume that all reviewers receive a common latent utility plus idiosyncratic shocks. The same reviewer visits multiple restaurants. This allows us to identify observable reviewer heterogeneity and unobserved heterogeneity in rating styles (i.e., how they map experienced utility to attribute-level ratings). All reviewers provide an overall rating, so given multiple observations from a reviewer, we can infer heterogeneous weights of attributes on overall ratings. Finally, variables such as informativeness and the need to praise/vent help account for self-selection.
We allow the structural model of rating behavior to account for heterogeneity in rating styles and weights on attributes driving overall ratings. Specifically, we allow for a nonlinear and heterogeneous mapping from experienced utility to attribute ratings using an ordinal logit and a heterogeneous weighting of different attributes to explain the observed overall rating as a regression. The heterogeneity is modeled within a latent-class framework. The attribute-mention equation (which models attribute-writing choice) helps account for self-selection. We estimate the model using a nested iterative expectation–maximization (EM) algorithm—an inner iteration for the attribute rating imputation, and an outer iteration for the unobserved heterogeneity parameters. The structural model provides insights on reviewer segments and their behavior. The attribute-mention equation enables us to assess the conjectured drivers of attribute mention (vs. silence) in reviews. Together, the attribute-mention equation and the structural model help with imputation that requires heterogeneous and time-varying reviewer- and restaurant-specific factors to be taken into account. We find that there are multiple reviewer segments with different motivations to write reviews: one segment seeks status, another wants to vent/praise, and a third is altruistic or wants to voice their opinion. Interestingly, we find that the need to inform and the need to vent/praise drive which attributes are mentioned in the review, rather than attribute importance. We then validate the imputations from our structural model by showing superior performance relative to simpler homogeneous models and other ad hoc imputation rules on holdout data. Finally, we demonstrate that corrections for attribute mentions based on observable and unobservable heterogeneity lead to significant corrections in average attribute ratings for a business.
We note that our problem definition for attribute-level ratings abstracts away from issues of (1) selection, in terms of who chooses to review (e.g., Le Mens et al. 2018; Li and Hitt 2008), and (2) strategic review shading by reviewers and/or fake reviews (e.g., Luca and Zervas 2016; Mayzlin, Dover, and Chevalier 2014) when aggregating ratings. Reviewer selection/review shading issues are relevant not just for attribute-level ratings but also for overall ratings; as such, any approaches to address these issues for overall ratings should also be applicable to attribute-level ratings.
In summary, our key contributions are as follows: First, we advance the text analytics literature in marketing by addressing the problem of fine-grained attribute sentiment scoring (i.e., we capture not only attribute sentiment valence but also the degree of positivity or negativity in sentiment). For this, we highlight the need to move beyond word-frequency-based approaches (lexicon and LDA) to a deep learning approach that accounts for language structure. Specifically, we account for the spatial and sequential structure of language using a convolutional–LSTM model. Second, we find that attribute mentions in reviews are driven by the need to inform and the need to praise/vent but are not based on the importance that the reviewer places on the attribute. Using a structural model of rating behavior, we develop a model-based imputation for missing attribute ratings. Overall, we note that although this research is motivated by the empirical context of online reviews, the problems of generating fine-grained attribute sentiment scoring from text and the interpretation/correction of attribute mentions have broad applications across many settings. In particular, we note that the large improvements in sentiment scoring accuracy (for both valence and fine grained) that result from our method suggest that social science research using text analysis (Gentzkow, Kelly, and Taddy 2019) should pay more attention to advanced NLP methods to reduce measurement error in their constructed variables and thus avoid biased and misleading inference about tested hypotheses.
The rest of the article is organized as follows. The following section discusses the related literature. We then describe the problems and challenges of attribute sentiment scoring and explain how our model addresses these challenges. Then, we discuss the structural model of rating behavior, the estimation strategy, and how the model is used for imputing missing attribute scores. After describing our data, we summarize the results and conclude.
Related Literature
This article is related to multiple strands of literature in marketing and computer science. We organize our discussion in two parts.
Text Analytics on User-Generated Content and Online Reviews
Table 1 positions our research with respect to the most relevant literature on online reviews and user-generated content (UGC) in marketing. Some of the early research on this topic (e.g., Chevalier and Mayzlin 2006; Dhar and Chang 2009; Duan, Gu, and Whinston 2008; Ghose and Ipeirotis 2007; Onishi and Manchanda 2012) uses quantitative metrics such as review ratings, review volume, and word count to infer the impact of UGC on business outcomes (e.g., sales, stock prices). While these works established the importance of studying UGC and its specific role in experience goods markets, they did not investigate content in review text.
Most Relevant Marketing Literature on Text Analytics.
Notes: Y = yes; N = no; N.A. = not applicable. CNN = convolutional neural network; RNN = recurrent neural network.
Another research stream focused on using UGC in blogs and review forums to extract insights around customer needs and brand positioning (e.g., Büschken and Allenby 2016; Lee and Bradlow 2011; Netzer et al. 2012; Tirunillai and Tellis 2014). Archak, Ghose, and Ipeirotis (2011) study UGC to measure sentiment valence (not fine-grained sentiment) on specific product attributes using a lexicon approach and examine its impact on demand.
Fine-grained sentiment analysis for individual attributes is one of the more challenging variants of the sentiment analysis problem (Feldman 2013; Wang, Lu, and Zhai 2010). Approaches by Wang, Lu, and Zhai (2010) and Taboada et al. (2011) are highly interpretable but rely on carefully hand-crafted features; they are therefore not scalable. In addition, they underperform in detecting sentiments in difficult sentences. Supervised text classification methods such as support vector machine (SVM; Joachims 2002) do not require hand-crafting and are scalable, but they need large amounts of labeled training data (tagged by humans) to reach desired levels of accuracy. Consequently, deep learning models (Kim 2014; Socher et al. 2013; Zhou et al. 2015) combined with meaning-infused word vectors (Mikolov et al. 2013; Pennington, Socher, and Manning 2014) have revolutionized the field of text mining: they do extremely well on text classification tasks yet require a much smaller volume of training data to attain high levels of accuracy. Thus, they overcome the shortcomings of both traditional supervised algorithms and unsupervised algorithms. A limitation is that they lack interpretability, so it is difficult to understand what drives the performance of deep learning models. Recently, marketing scholars have used deep learning models for text analysis to answer important questions such as need identification (Timoshenko and Hauser 2019) and the impact of reading reviews about particular attributes on purchasing decisions (Liu, Lee, and Srinivasan 2019), but their focus is not on fine-grained sentiment, and thus language structure is less important.
We advance the marketing literature on sentiment analysis in two ways: (1) we consider fine-grained attribute sentiment scoring, and (2) we move from bag-of-words methods such as LDA and lexicons to deep learning models that account for structural aspects of language. Hybrid models that combine features of different deep learning architectures can improve performance on hard tasks (Wang et al. 2016); in that spirit, we motivate and construct a hybrid convolutional–LSTM model. Further, to understand the key drivers of model performance, we test our model on various types of hard sentences. In our corpus, nearly half of the sentences are “hard,” justifying the need to account for language structure. By reporting performance metrics not just overall, but on types of hard sentences, we offer new benchmarks for performance evaluation in future research.
Missing Attributes (Attribute Mentions) in Reviews
Our study of attribute mentions in text reviews is primarily related to the statistics literature on missing data and imputations. Rubin (1976) laid the seminal framework for analysis of missing data, in which every data point has some likelihood of being missing. Rubin classifies missing data problems into three groups: “missing completely at random,” “missing at random” (MAR), and “missing not at random.” Missing completely at random occurs when the probability of missing is the same for all cases (i.e., the causes of the missing data are unrelated to the data). This assumption is likely violated in most settings.
Most modern imputation models for missing data are based on the MAR assumption (i.e., the probability of being missing is the same within groups defined by the observed data. Missing data models under MAR assumption are often estimated using multiple imputation or likelihood methods. Likelihood-based approaches use either Bayesian methods or the EM algorithm for estimation. Recently, Athey et al. (2018) proposed matrix completion methods for imputation in big data settings. However, to the extent that attribute choice involves self-selection, even with a structural model of rating with observed and unobserved heterogeneity, the problem still belongs to the “missing not at random” class (i.e., the missingness of a variable is a function of the variable itself, even after controlling for other observed and unobserved characteristics; e.g., the reviewer's decision to rate could be a function of the rating itself in our setting). The most common approach is then to introduce new identifying restrictions by explicitly justifying a model of missingness for the context at hand and estimate the joint model of missingness with the behavioral model (Little and Rubin 2019; Mohan and Pearl 2021). In this article, we augment the structural model of heterogeneous reviewer rating behavior with an equation for attribute mentions. The structural model also allows for both observable and unobservable heterogeneity in rating styles and linking of attribute ratings to overall ratings. The model allows for a rich, heterogeneous, nonlinear mapping from experienced utility to five-level rating and weighted mapping of attribute ratings to overall rating behavior. The attribute-mention equation accounts for self-selection. We use an EM algorithm to estimate the model, with imputation that allows for both heterogeneity and self-selection to fill in for missing attribute ratings during the EM iterations.
Converting Text into Numeric Attribute Sentiment Scores
We first describe the attribute-level sentiment analysis problem of converting unstructured text data in reviews into attribute-level sentiment scores. We then describe two methods of attribute scoring models with text data: (1) the lexicon model and (2) the deep learning model. 4 Along the way, we also describe various implementation issues and choices that need to be made.
The problem of attribute-level sentiment analysis is to take a document d as input (in our empirical example, a Yelp review) and identify the various attributes

Illustration of attribute-level sentiment analysis.
As Figure 1 shows, the first step involves splitting the review text into sentences (all standard sentence separators such as full stops [periods], exclamation points, and question marks are considered for split). The next steps involve identifying relevant attributes in a sentence and their corresponding sentiments. Next, we describe the process of attribute and sentiment scale selection.
The attribute discovery phase is similar to an exploratory phase preceding a quantitative survey. For this, we conducted (1) a review of the literature, (2) an analysis of the most frequent attribute words in the corpus, and (3) topic modeling using LDA. The LDA results and most important words associated with each topic are presented in Table W2 in the Web Appendix. The literature on restaurant evaluation and industry customer satisfaction surveys identified food quality, employee behavior and wait time (service), basic hygiene, look and feel (ambiance), and value for money as the most common attributes (Ganu, Elhadad, and Marian 2009). We then did frequent word categorization of our review corpus by associating the most high-frequency nouns, noun phrases, and select verbs to restaurant-relevant attributes. Beyond the four attributes identified from previous literature and industry surveys, we found a fifth attribute, “location,” that has words pertaining to parking, convenience, and safety of the restaurant location. Finally, we conducted topic modeling of our review corpus using LDA. As is common with LDA, these topics combined both restaurant attributes and consumer sentiments, and given the very high frequency of food related comments, the topics were disproportionately around food. Büschken and Allenby (2016) note that by initializing the LDA model with seed words for a wider range of attributes, one could obtain more balanced topics; however, because we only needed to identify relevant topics and not gain greater balance, using seed words did not help in identifying additional attributes that were relevant for a large-enough set of restaurants to be used on a platform. Overall, we concluded that the five attributes—food, service, ambiance, value, and location—captured the most relevant attributes for a restaurant rating platform. We use a five-point scale for sentiment granularity (1 = “extremely negative,” 3 = “neutral,” and 5 = “extremely positive”) because this is comparable to the five-point rating scale in many review platforms. Moreover, human taggers fail in practice to differentiate well between classes when the sentiment granularity is higher than five levels (Socher et al. 2013). We next describe the two types of attribute sentiment classifiers we consider.
Attribute Sentiment Classifier: The Lexicon Method
We begin with the lexicon-based method because it is highly interpretable, transparent, and widely used and thus serves as a useful benchmark relative to more complicated models. The method consists of lexicon construction followed by attribute sentiment classification of text based on dictionary lookups (i.e., sentences are classified into an attribute and sentiment class by locating word matches in attribute and sentiment class–specific dictionaries). Next, we explain the method and discuss its limitations.
Lexicon building. Lexicon construction involves creating a dictionary of attribute words with corresponding attribute labels (e.g., waiter—“service”) and sentiment words with sentiment class labels (e.g., excellent—“extremely positive”). We first identify the high-frequency attribute and sentiment words in our corpus to create our vocabulary. We construct attribute- and sentiment class–specific dictionaries by asking human taggers on Amazon Mechanical Turk to classify all attribute words into one of the five attributes we identified in Step 1 (food, service, value, ambiance, and location) and all sentiment words into one of the five sentiment classes (given that we decided to use a five-point rating scale). Every word is labeled by three distinct human taggers, and we retain only those words for which at least two of the three taggers agree on the labeling.
5
Attribute-level sentiment scoring. Each review is split into sentences. Using the lexicon, each attribute word in the sentence is classified into one of the prespecified attributes (or none) and each sentiment word is classified into a five-point sentiment rating scale using a “lookup” or search of the precreated lexicons. Following this, the steps are similar to those listed in Figure 1.
Despite its simplicity, interpretability, and transparency, the method has several limitations. First, lexicon construction is costly in both time and effort and scales linearly with number of words. Second, and more importantly, the method treats language as simply a bag of words or “fixed phrases” and does not account for various aspects of language structure. In practice, lexicon methods therefore work fairly well for sentiment identification in simple sentences but perform poorly on hard sentences (Liu et al. 2010).
Why the Lexicon Method Fares Poorly with Hard Sentences
We next elaborate further on why lexicon methods fail to classify hard sentences, as mentioned in the introduction. This is problematic because hard sentences are close to 50% of sentences in our review corpus. We explain each of these types next.
Negations and sentiment degree. Sentences with different degrees of negative sentiment can be difficult to classify without accounting for variable size n-grams. Lexicon methods typically look at one word at a time and are unable to obtain sentiment valence or degree. Even if ad hoc approaches are used to address standard negations with bigrams or trigrams by hard-coding negation phrases, examples such as “Pizza is not that good” or “Pizza is not at all great” illustrate that such ad hoc approaches are unlikely to be effective overall in capturing degree of sentiment. This motivates the use of the convolutional layer, which handles the spatial structure. Long sentences and scattered sentiments. In long sentences (those consisting of more than 20 words), the degree of sentiment (and even polarity) can change multiple times. As an example, consider the sentence, “OK, in fact good, to start with but kept getting worse and wait staff were unapologetic but manager saved the night.” In this sentence, the sentiment flows from being good to bad to extremely bad and then back to positive. Yelp reviews tend to have a significant percentage of long sentences. Without sequence history, the classifier cannot capture sentiment shifts and will classify most of these sentences as neutral due to the mix of positive- and negative-sentiment words. More importantly, immediate sentiment modifiers may be changed by sentiment words that are farther away, so having a “long-term memory” of what was said before and whether recent sentiment (short-term memory) should take precedence needs to be considered. The LSTM layer helps with both the sequencing and the immediate and distant sentiment modifiers, while the convolutional layer still helps group words into phrases within the long sentence before being fed into the LSTM layer. Contrastive conjunctions. Sentences that have an “X but Y” structure often get misclassified by sentiment classifiers because the model needs to take into account both the clauses before and after the conjunction and weigh their relative importance to decide the final sentiment. An example sentence includes “Despite the creativity in the menu, execution was a disappointment.” The first half of the sentence is extremely positive due to the word “creativity,” but the second half moderates it significantly. A good classifier should be able to learn from both parts of the sentence to arrive at the correct classification. While the convolutional layer identifies phrases before and after the conjunction, the LSTM layer helps interpret the change of meaning after the conjunction. Implied sentiments (sarcasm and subtle negations). These sentences do not have explicit positive or negative sentiment words, but the context implies the underlying sentiment. This makes the task of sentiment identification extremely difficult for all classes of models and especially for models that rely on a specific set of positive or negative words. An example sentence includes “The place is a treasure if only you are lucky to be there on the right day.” This is an example of sarcasm; the reviewer uses a positive word (“treasure”) but hints at the extreme variance in the type of experience one can have. There could also be subtle negations; for example, “The girl managing the bar had to be the waitress for everyone.” Here, the reviewer is complaining about lack of service arising out of shortage of staff without using an explicitly negative word. Because the meaning/sentiment associated with the word lies in the richer context of its usage, we empirically assess how much the spatial and sequential structure helps with accurate classification.
Attribute Sentiment Classifier: A Deep Learning Hybrid Convolutional–LSTM Model
Lexicon methods use a constructive algorithm based on precoded attributes and sentiment words in a lexicon to score attribute-level sentiment. In contrast, deep learning models are a type of supervised learning model, where the model is trained using a training data set by minimizing a loss function (e.g., the distance between the model's predictions and the true labels). The trained model is then used to score attribute-level sentiment on the full data set. Like deep learning, regression and SVM are also variations of supervised learning.
What distinguishes deep learning from regression and SVM is that deep learning aims to model high-level abstractions in data by using multiple processing layers (thus the word “deep” in its name), composed of linear and nonlinear transformations (Goodfellow, Bengio, and Courville 2016). Deep learning algorithms are useful in scenarios where feature (variable) engineering is complex and it is difficult to select the most relevant features for a classification or regression task. For instance, in our task of fine-grained sentiment analysis, it is not clear which features (combination of variable-length n-grams) are most informative when classifying a sentence into “good food” or “great service.” The two key ingredients behind the success of deep learning models for NLP are (1) meaningful word representations as input and (2) the ability to extract contiguous variable-size n-grams (spatial structure) with ease while retaining sequential structure in terms of word order and associated meaning.
In this subsection, we outline the architecture of the model and its intuition and discuss critical modeling/implementation choices. Web Appendix C provides a self-contained technical description. Figure 2 shows the general architecture of a neural network used for text classification. Following preprocessing of text, the first layer is the embedding layer, where words are converted to numerical vectors by making use of word embeddings. These embedded numerical vectors are then fed to the succeeding feature-generating layers, which are the core of the deep learning model. In contrast to older supervised learning methods such as SVM, which work with the raw data directly as inputs, these feature-generating layers (i.e., the convolutional layer and LSTM network layer) extract higher-level features important for classification. The extracted feature vectors are then passed into a logit classifier (soft-max) that classifies the sentence to the class with highest probability of association.

General architecture of a deep learning network for text classification.
Embedding layer and word representation
Neural network layers work by performing a series of arithmetic operations on inputs and weights of the edges that connect neurons. Thus, words need to be converted into a numerical vector before being fed into a neural network. 6 These vectors are called “embedding,” and most well-known embedding algorithms (e.g., word2vec, GloVe) are based on the distributional hypothesis—that is, words with similar meanings tend to co-occur more frequently (Harris 1954) and thus have vectors that are close in the embedding space. The efficiency of the neural network improves manifold if these initial inputs carry meaningful information about the relationships between words. Therefore, the choice of embedding is an important one; we experiment with both embeddings trained from scratch on our Yelp review corpus and a range of pretrained word embeddings such as word2vec (Mikolov et al. 2013) and GloVe (Pennington, Socher, and Manning 2014) that are available for all words in our vocabulary but have been trained on different corpuses (e.g., Wikipedia dumps, Gigaword news, Common Crawl). There are pros and cons for both approaches: pretrained embeddings is a form of transfer learning that eliminates embedding generation time, but self-trained embeddings may result in higher classification accuracy due to a more context-relevant vocabulary.
Feature-generating layers (convolutional–LSTM)
The macro architecture of the neural network comprises layers to be included (e.g., feed-forward or convolutional) and type of interconnections between them. As discussed previously, the most challenging aspect of our task is dealing with different types of hard negations resulting from variable-size n-grams (e.g., “not good,” “not that great”) and shifting polarities (e.g., “started off well but ended in a sorry surprise”). In many challenging text and image classification problems (Wang et al. 2016), hybrid models that combine the strengths and mitigate the shortcomings of each individual model have been found to improve performance. In that spirit, we build a network consisting of a single convolutional layer with variable-size filters followed by an LSTM layer.
Convolutional layers with different filter sizes specialize in extracting variable-length n-grams (phrases) associated with relevant attributes and sentiments and have recently been used successfully in various text analysis applications (Kim 2014; Timoshenko and Hauser 2019). To improve granular sentiment detection where sequence information is critical, we follow the convolutional layer with an LSTM layer that processes the features (phrases) identified from the convolutional layer. LSTM is a variant of the recurrent neural networks that specializes in handling longer contextual information (Hochreiter and Schmidhuber 1997). An LSTM employs a cell state (long-term memory) and a combination of gates that are like “regulators” of information to constantly evaluate what parts of the history (in this case, n-grams from the previous part of the sentence) need to be forgotten and what needs to be retained to improve the accuracy of the attribute and sentiment classification task. As we noted in our discussion of “hard” sentences, by taking advantage of the properties of the convolutional layer and LSTM, we expect the hybrid to improve classification accuracy while keeping training time low.
Classifier
The loss function choice depends on the nature of the classification task. Because our tasks involve the classification of text into five attribute classes and five sentiment classes, it is a multiclass classification problem. We use the standard loss function for multiclass classification called categorical cross entropy. Say si represents the convolutional–LSTM model classification for sentence i and ti represents the ground truth classification. Then, the cross entropy loss function can be defined as follows:
Deep Learning Implementation: Important Choices
Word embeddings
We tested pretrained embeddings based on word2vec and GloVe with different numbers of embedding dimensions (e.g., 100, 300) for attributes and sentiment classification. Further, we evaluated whether self-trained embeddings from the specific text corpus could produce superior classification relative to the pretrained embeddings.
Micro architecture
The micro architectural decisions in a neural network involve the number of neurons in each of the layers, the size and number of filters for the convolutional layer, and dimensions of the max pooling function (that concatenates variable-size feature vectors generated from variable-size convolutional filters). Many of these decisions are empirically driven, but some factors that inform these choices are sentiment classification and attribute classification. Sentiment classification would rely on presence of long-range n-grams, so we would typically choose a mix of filter sizes for this task, ranging from one to six grams. In contrast, the attribute classification task often needs only unigrams and bigrams (“chicken,” “cola drink,” “wait time”), and thus simple unigram and bigram filters would be sufficient. In addition, because the sequence of n-grams matters for sentiment classification, ideally we would not use a max-pooling layer after the convolutional layer, as the aggregation loses sequential information before being passed to the LSTM layer. However, a pooling layer is needed to merge variable-size feature maps generated from the convolutional filters. We balance this trade-off by max-pooling on the smallest possible pooling dimension so that we can preserve as much of the sequence information as feasible in sending input into the LSTM layer.
Model training
As is standard for deep learning models, the model parameters are optimized jointly by training the model iteratively on smaller subsamples of the training data (mini-batches) and then using the estimation error to improve the model (i.e., change the weights and biases in small increments) through a feedback loop. We experimented with mini-batch sizes of 5, 10, 25, 30, and 50, and different optimizers. We chose the RMSProp (Dauphin et al. 2015) optimizer because it uses an adaptive learning rate.
Performance Measures for Model Comparison
The primary metric on which we compare our models is accuracy or hit rate:
Analysis of Structured Ratings Accounting for Missing Attributes
So far, we have focused on converting review text into numerical attribute scores on a five-point scale; we coded attributes as “missing” when the reviewer is silent on an attribute. For every review, we also have an overall rating for the restaurant. With this quantitative data, we develop and estimate a structural model of reviewer–restaurant experience and rating behavior.
The structural model gives us (1) insights into reviewer segments, (2) reviewer attribute rating and writing behaviors, and (3) imputations for missing attributes. We then assess the validity of the model-based imputations on a holdout sample. Finally, we illustrate that corrections for attribute rating using the imputations can be substantial and economically/managerially significant.
A Structural Model of Rating Behavior
The structural model consists of three parts consistent with the data-generating process: The first is an ordinal logit model of attribute rating that accommodates (1) a nonlinear mapping from experienced quality to attribute ratings and (2) heterogeneity in reviewer rating styles. The second is a logit model of attribute mention that enables us to test specific hypotheses related to missing attributes. Using the estimates from the ordinal logit model of attribute rating and the logit models of attribute mention, we impute missing attribute ratings using the Bayes rule. The third part is a regression model of overall ratings against attribute ratings to estimate how attribute ratings impact overall ratings. For this regression, we impute attribute ratings from the previous step when they are missing. The model allows for observed and unobserved heterogeneity. We use an iterative multistep EM algorithm to estimate the model.
We begin with the model of attribute rating. Every reviewer i who writes a review has an experience with the restaurant. Let
Each reviewer may belong to a segment
We next formulate the attribute-mention model as a binary logit for each attribute for each segment g:
Denote
Finally, we model the overall rating equation for each review as a weighted sum of the ratings on attributes, allowing for both observable and unobservable reviewer heterogeneity (by same latent class as for attribute ratings). Specifically, we model ratings as a segment-specific linear regression model:
Then, the rating equation with segment-specific imputation for missing attribute rating is
Model Likelihood
The parameters to be estimated are
The overall likelihood across all reviewers is given by
Estimation Algorithm
We outline the estimation algorithm here and present the step-by-step details in the Appendix. We begin by describing the iterative procedure to maximize the likelihood of the model without unobserved heterogeneity.
First, we estimate the ordinal logit model in Equation 3 with only observations that have the attribute ratings reported. This gives us initial estimates of αk, βik, and Cks. Second, we estimate the attribute-mention equation in Equation 4, where we use the estimates αk, βik, and Cks to impute the attribute ratings for reviews when attribute ratings are missing. Then, from Equation 5, we apply Bayes rule to revise the probability
Empirical Application
Yelp is a crowdsourced review platform where reviewers can review a range of local businesses (e.g., restaurants, spas and salons, dentists, mechanics, home services). The website was officially launched in a few U.S West Coast cities in August 2005 and subsequently expanded to other U.S cities and countries over the next few years. As of Q1 2017, Yelp is present in 31 countries, with 177 million reviews and over 5 million unique businesses listed (Yelp Investor Relations Q4 2018). Given our empirical application, we focus on restaurant reviews. Since 2008, Yelp has shared review, reviewer, and business information for select U.S and international cities as part of its annual challenge. Unique reviewer and business identification numbers in the data help create a two-way panel of reviews at the reviewer and business levels. For each review, we observe overall rating, textual evaluation, and date of posting as well as information about business characteristics (e.g., cuisine, price range, address, name) and reviewer characteristics (e.g., experience with Yelp, Elite status). Table 2 summarizes the various data sets we use for different types of analysis. A discussion on each data set follows.
Exploratory analysis. We use the full data set of 1.2 million restaurant reviews for the exploratory analysis to identify attribute and sentiment classes that we described in the model section. We created a vocabulary of 8,458 words consisting of both sentiment and attribute words. As is normal, we excluded stop words, meaningless phrases, and the long tail of words with occurrence frequency less than 1,500 in our corpus. We then did a parts-of-speech tagging of our word list (i.e., we classified our word list into adjectives, adverbs, nouns, and verbs so as to separate attribute and sentiment words). Attribute words are mainly nouns, whereas sentiment words are mainly adjectives and adverbs, with important exceptions: some verbs are strong indicators of an attribute (e.g., “greeting,” “seated,” “served” referring to service) and some adjectives are good indicators of both attribute and sentiment (e.g., “cheap” refers to the value attribute with negative sentiment). Finally human taggers classified the attribute and sentiment words into attribute and sentiment classes. In our dictionaries, we only retain those words that have been labeled into a particular class by at least two out of three taggers.
10
Training and test data for supervised learning. For supervised learning, we constructed another data set at the sentence level. Human taggers classify the sentences into their primary attribute and sentiment levels. We ensure that the data set is balanced in its representation of all attribute and sentiment classes. We used 75% of these data for training and the remainder for model validation and testing. As discussed previously, lexicon methods do not account for the challenges of obtaining attribute sentiments for hard sentence types. In a randomly sampled subset of sentences from our corpus, 48% of all sentences and 66% of the negative sentences belong to one of the complex types. Long sentences account for 27% of our data. Given their empirical importance, we created a special test data set of hard sentence types to assess model performance specifically on such sentence types. Restaurant and reviewer stratified sample. To estimate the linkages between attribute-level sentiment and overall ratings, we focus on a stratified sample of reviews. We ensure that we have multiple reviews by individuals so that we can account for unobserved heterogeneity in reviewer rating styles. We want multiple reviews on restaurants to ensure that there are multiple reviewers who obtained similar latent utilities up to a random shock. We therefore restricted our sample to only individuals who posted at least 5 reviews and restaurants that have at least 20 reviews. This restriction also helps eliminate human or bot-generated fake reviews, as fakes are mostly from users with one or only few reviews (Luca and Zervas 2016).
Description of Data Sets.
We then used stratified sampling by restaurant and reviewer types to ensure that different restaurant types (e.g., high and low end, chain and independent) and reviewer types (Elite and non-Elite, experienced and naive) are represented in the data. This allows us to study how ratings and missing attributes differ by the types.
The sampling leaves us with 45,652 reviews from 2,704 businesses and 19,583 reviewers. As past restaurant reviews might impact current reviews, we incorporate restaurants’ time-varying features (e.g., variance and mean of past reviews) by extracting all past reviews for the restaurants in our stratified sample. The full data set (including all past reviews for restaurants in our sample) contains 250,000 reviews. We generate each review's time-varying variables, including number of past reviews, mean and variance of past star rating, and mean and variance of past attribute ratings.
Table 3 provides descriptive statistics for reviews, reviewers, and restaurants. First, we show a comparison of the characteristics of the full data and stratified sample of 45,652 reviews in terms of review and reviewer characteristics. The mean and median number of reviews per reviewer in our sample is slightly higher than the population (due to stratification). However, the reviewers in our sample are fairly similar to the population in terms of average star rating, experience, and length of reviews. The lower part of Table 3 shows the distribution of different business types in the stratified sample. Our sample has almost an equal mix of chain and independent restaurants, but independent restaurants get more reviews with higher ratings on average. Low-end and high-end restaurants do not show much difference in terms of average star rating.
Reviews, Reviewers and Business (Summary Statistics of Data Set and Sample).
Attribute Sentiment Classication
We first report the results of converting text data into quantifiable attribute and sentiment scores. We report the performance in three parts: (1) overall classification accuracy, (2) classification accuracy on “hard” sentence types, and (3) polarity and attribute classification.
Overall Classification Accuracy
The lexicon-based method that relies on carefully crafted rules and human-tagged lexicons performs better than most supervised machine learning algorithms and is as good as the convolutional–LSTM in the attribute classification task. This is because this task is relatively unambiguous and the lexicons are constructed specifically for the domain of restaurant reviews. However, this method does very poorly in the more complex five-grained sentiment analysis task. Among supervised algorithms, SVMs do better than most of the other classifiers in both attribute and sentiment classification tasks. This is in line with previous literature showing that SVMs are the best machine learning–based text classifiers. The network with only the convolutional layer just matches the performance of the SVM. However, the convolutional–LSTM does better than all methods in both attribute and sentiment classification tasks (see Table 4). The accuracy of the convolutional–LSTM in the task of five-level sentiment classification is 50%—lower than state-of-art (SOTA) accuracy of 56% reported in Brahma (2018) but on a different data set for which we do not know the differential mix of “hard” versus “easy” sentences in the corpus. Relatedly, other articles do not report dimensions of classification accuracy such as confusion matrices, so we are unable to benchmark on these other relevant accuracy metrics. 11
Comparison of Text Mining Methods.
Notes: CNN = convolutional neural network.
The convolutional–LSTM model with self-trained embeddings does slightly better than the one using pretrained GloVe embeddings in terms of both attribute and sentiment accuracy. This could be attributed to the slightly more relevant vocabulary generated when word vectors are trained from scratch on a specific corpus.
Classification Accuracy by Sentence Types
To gain intuition about when our model performs better than benchmarks, we assess the classification accuracy by sentence type. For this, we sampled 100 sentences of each type from the test data set. Table 5 reports the results. Our hybrid convolutional–LSTM performs better than other models for all sentence types, but especially so for hard sentences that require consideration of the spatial and sequential structure. Interestingly, the overall classification accuracy is particularly improved for the scattered sentiments in long sentences.
Performance by Sentence Types.
Notes: CNN = convolutional neural network.
Sentiment Polarity and Attribute Classification
A natural question is whether the improvement in convolutional–LSTM is only for fine-grained attribute sentiment scoring or whether it can be of help for sentiment valence as well (i.e., positive, negative, and neutral). Overall, the best-performing convolutional–LSTM model detects the correct valence for 74% of the sentences in the test data compared with 55% for lexicons and 64% for SVM. Further, these improvements are substantially better for various types of hard sentences. The convolutional–LSTM with self-trained embeddings is particularly good at preserving polarity for positive classes (ratings of 4 and 5), whereas the convolutional–LSTM with GloVe 300 embeddings is more well-balanced because it preserves polarity reasonably well for both positive and negative classes. Thus, our model is potentially valuable even for applications that only require sentiment valence. In addition, with respect to attribute classification, the model has more than 70% accuracy across four of the five classes (except location, which is sometimes confused with ambiance). The confusion matrices are provided in the Web Appendix (Tables W4b and W4c).
Structural Model of Rating Behavior
Overall, the three-segment model fits best based on Bayesian information criterion (BIC); at 201, the BIC for the three-segment model is lower than the BICs of the two-segment and four-segment models. To help with the interpretation of the structural model estimates, we first describe the descriptive characteristics of the three segments. We then report and interpret the estimates of the structural model.
Segment Description
Table 6 presents the descriptive statistics of three segments. Segment 1 (9% of reviewers), the smallest segment, consists of 65% Elite reviewers. The people in this segment write the most often, contribute double their share in reviews (19%), write the longest reviews, and include the most attributes. On average, they tend to write earlier than others, be harsher in their review than the restaurant’s average rating, and have relatively low rating variance. Given the high percentage of Elites, greater frequency, and more comprehensive and longer reviews, we name this segment “status-seeking regulars.”
Reviewer Segments: Descriptors.
In contrast, Segment 3, accounting for 30% of reviewers, has no Elites and writes least often, contributing only 25% of reviews. The reviewers write the shortest reviews and include the fewest attributes. They tend to write at later stages, after others have provided their reviews, and generally tend to be more generous in their overall ratings. Interestingly, they also have the highest variance in their reviews, though they visit restaurants with high ratings and lower variance. We call them the “emotive irregulars,” given their lower frequency and limited contributions in text reviews. They tend to offer either very positive or relatively negative reviews.
Finally, Segment 2 is the largest segment, with 61% of the reviewers; it has only 26% Elites and contributes 56% of reviews. Their behavior lies between the other two more extreme segments. They write fewer, shorter reviews and include fewer attributes than Segment 1, but more than Segment 3. Their ratings are very similar to the average of the restaurant ratings. We call these reviewers as the “altruistic mass.” They form the bulk of the Yelp reviewing community: they write reviews diligently with little expectation of rewards and merely want to make their voice heard.
Mapping Latent Utility to Attribute Ratings: The Ordinal Logit Model
We present the estimates of the ordinal logit model that maps latent utility to attribute ratings in two parts. Table 7, Panel A, presents the mapping between restaurant observables and true latent attribute-level experience. As we expected, restaurants with higher ratings have overall higher latent utility, chains have lower latent utility, and prices reduce latent utility. Figure 3 shows the thresholds

Structural model estimates: attribute-level thresholds of latent utility for nonelites by segment.
Structural Model Estimates: Drivers of Attribute Rating and Overall Rating.
p < .1.
p < .05.
p < .01.
Notes: Standard deviations are in parentheses.
How Attributes Impact Overall Ratings
Table 7, Panel B, shows the weights on the attribute ratings that impact overall rating for the three latent segments. For ease of interpretation, the weights reported are normalized such that they sum to 1. Note that all weights were estimated to be positive even though such a constraint was not imposed. Segment 1 (status-seeking regulars), the smallest at 9%, places the most importance on food and service in terms of overall ratings. Segment 2 (altruistic mass), the largest at 61%, cares the most about food. In contrast, the ratings of Segment 3 (emotive irregulars), with 30% of reviewers, are driven mostly by value and location. Overall, the root mean square error (RMSE) of predicted overall ratings across the three segments is .79, suggesting that the model has good explanatory power.
Attribute Self-Selection in Reviews
Table 8 presents the estimates of the attribute-mention equations. The estimates enable us to test the informativenesss and praise/vent conjectures as drivers of attribute mention in reviews. The hit rates for mentions are quite high: food (.87), service (.84), ambiance (.71), value (.73), and location (.85).
Structural Model Estimates: Drivers of Attribute Mention
p < .1.
p < .05.
p < .01.
Notes: Standard deviations are in parentheses. The baseline for the dummy variables (attribute and sentiment) are attribute ambiance and sentiment 3. We have 45,652 reviews and 5 attributes, thus 228,260 (45,652
Informativeness
Table 8, Panel A, shows support for the informativeness hypothesis. The higher positive attribute coefficients for food and service and negative coefficients of value and location, relative to the normalized ambiance coefficient of zero, support our conjecture that reviewers write more often about experience attributes and tend to be silent on search attributes that can be discovered easily on the site. Further, as we expected, variance has a positive coefficient, in support of the hypothesis that attributes are more likely to be mentioned when opinions about a restaurant are varied. Interestingly, negative deviations induce the attribute to be mentioned, but the opposite is true for positive deviations. This is the case across all segments.
Praise/vent need
Table 8, Panel B, uses the coefficients of the interaction terms between attribute and sentiment level to help test the praise/vent conjecture for attribute mentions. Those who want to praise/vent are more likely to provide extreme sentiments (ratings of 1, 2, or 5) than moderate sentiments (ratings of 3 and 4). For food and service, there is a higher probability of reporting moderate ratings compared with more extreme ratings. For value and location, there is a tendency to vent, especially among those in Segment 3, who value these attributes.
Finally, we assess the role of attribute importance in attribute-mention choice by comparing an attribute’s probability of being missing across segments (bottom panel of Table 6) with the attribute importance weights of the three segments (Table 7). Food and service (and, to a lesser extent, ambiance) have the lowest missing values. Food, service, and ambiance also have among the highest impact on overall ratings for Segments 1 and 2. But for Segment 3, even though food and service do not drive overall ratings, they still are the most written-about attributes. Similarly, even though value and location impact overall rating for Segment 3, these are still the attributes most often absent in text reviews. Thus, the relationship between attribute importance and mention is not clear and can vary by reviewer type.
In summary, the information value of reviews plays a significant role in the motivation to write about attributes across all segments. We also found the motivation to praise good performance and vent about bad performance, but this varied across segments and attributes. For staple features such as food, service, and ambiance, all three segments are more likely to write when satisfied and less likely to write when dissatisfied. Overall, this might explain in general why reviews tend to be skewed more positively on rating sites, if this also translates to selection into who writes reviews. However, members of Segment 3 are likely to vent more when dissatisfied about two attributes that drive their ratings—value and location. The lack of a strong link between attribute importance for overall ratings and mentions in reviews suggests that online reviews may not be as complete a source of topic and need identification, as previously believed. However, we note that this could be because the importance of an attribute for satisfaction, conditional on visit, may not account for the importance of that attribute in driving visits. Nevertheless, our results suggest caution in the use of frequency of mentions as a proxy for benefit or need importance; we suggest that this issue be explored in future research.
Validation of Imputation
We validate our model-based imputation approach in Table 9 by assessing the ability to predict attribute ratings on a holdout sample, relative to benchmark imputations. For benchmarks, we consider a model with no reviewer heterogeneity in rating styles and rating importance weights and ad hoc fixed imputations. For the fixed imputations, we considered whether missing ratings reflect average satisfaction (rating of 3), very low satisfaction (rating of 1) or very high satisfaction (rating of 5). We do this by comparing the predicted attribute rating with the observed rating if an attribute rating is present on the holdout sample (10% of the observations). The overall RMSE across all attributes is lower for our model relative to the benchmarks. Even when the RMSE is compared by attribute, we find that our model does better on all attributes.
Model Fit: RMSE Across Imputations.
Correction for Attribute Mention (Silence) in Attribute Ratings
We now illustrate how correcting for attribute mentions (silence) through imputation at the individual review level can impact overall attribute rating for a restaurant. In Table 10, we can see that correction for missing attributes has significant impact on attributes that are missed more frequently: value and location in general, and food for chain restaurants. The correction could be either upward or downward depending on attribute, restaurant type, and reviewer type. For example, at an independent restaurant in Phoenix, where most reviewers are silent about service at higher satisfaction levels, observed service ratings are lower than actual service ratings after imputing for missing attribute ratings. Then, correction results in higher service ratings than observed ratings (Figure 4, Panel A). Food and ambiance scores barely change, and value and location scores slightly go up after imputation for this restaurant. In Figure 4, Panel B, we illustrate a chain restaurant in Las Vegas where many of the reviewers are silent on food and location attributes when satisfied and silent on value rating when dissatisfied. Here, food and location scores increase, and value score decreases. Overall, this shows that our imputation approach based on restaurant observables, rater observables, and unobservable heterogeneity is extremely flexible in its imputations and the ability to correct for missing attribute ratings.

Change in average attribute rating.
Impact of Imputation on Attribute Ratings.
Notes: N = 2,719.
Conclusion
This research addresses the general problem of using unstructured text data to generate quantifiable market feedback typically obtained through surveys; the specific application is to use restaurant reviews to generate attribute-level ratings of restaurants. We discuss two novel and challenging problems around online text reviews: (1) converting text into fine-grained numerical sentiment scores on prespecified attributes (e.g., food, service) by accounting for language structure, and (2) accounting for missing attributes in attribute sentiment scoring. For the first problem, we use a deep learning convolution-LSTM model that exploits the spatial and sequential structure of language to improve sentiment classification, especially on known types of hard sentences in NLP. To address missing attributes, we create and estimate a structural model of reviewer rating behavior that takes into account the data-generating process to develop a model-based imputation procedure to address attribute silence. Overall, our article illustrates the value of combining “engineering” thinking underlying machine learning approaches with “social science” thinking from econometrics to answer novel marketing questions.
Substantively, we identified three segments of reviewers—the smallest but most active reviewers (“status-seeking regulars”); the largest segment (“altruistic mass”), who review without reward expectations; and people who review infrequently but write about attributes they are extremely satisfied or dissatisfied about (“emotive irregulars”). Our insights on attribute silence in reviews show that the need to inform and the need to praise/vent drive more of the review writing than the importance of the attribute. Not only does this contribute to the literature on why people engage in online word of mouth (Berger 2014); it also has implications for using reviews as a source of data for needs/benefits identification. In particular, contrary to conventional wisdom, the frequency of mentions of a benefit or a topic may not necessarily be a proxy of its importance for all types of reviewers.
We conclude with suggestions for future research. First, while our method improved performance accuracy for all hard sentence types, there is clearly more room for improvement. It would be useful to evaluate the performance of the recent transformer-based language models (e.g., BERT, GPT-2) that are contextual and use an attention mechanism for our task of sentence-level, fine-grained sentiment analysis of hard sentences, as these models have improved the SOTA in several language tasks, especially involving longer sequences such as paragraphs (though our context is only a sentence). Second, our article shows that many traditional lexicon-based approaches used for text analysis have significant levels of classification error even with respect to sentiment valence, and more so with respect to fine-grained sentiment scoring. It would be useful to study whether the measurement error in the conversion from text to numeric data induced by these simpler (but intuitive methods) leads to attenuation bias that impacts the substantive conclusions in social science research. Third, given our objective to summarize only reviews that have been written (as those are what impact consumer perception of sentiments), we abstracted away from the issue of selection in the decision to write reviews. Further, we abstracted away from the issue of fake reviews/review shading. It would be worthwhile to combine our content analysis at the attribute level with work on fake reviews/review shading to get a richer understanding of how to correct for issues of selection and fake reviews in tracking word of mouth.
Supplemental Material
sj-pdf-1-mrj-10.1177_00222437211052500 - Supplemental material for Attribute Sentiment Scoring with Online Text Reviews: Accounting for Language Structure and Missing Attributes
Supplemental material, sj-pdf-1-mrj-10.1177_00222437211052500 for Attribute Sentiment Scoring with Online Text Reviews: Accounting for Language Structure and Missing Attributes by Ishita Chakraborty, Minkyung Kim and K. Sudhir in Journal of Marketing Research
Footnotes
Appendix: Estimation Algorithm: Detailed Steps
Acknowledgments
The authors thank the participants in the marketing seminars at Duke, Hong Kong University of Science and Technology, City University of Hong Kong, Indian School of Business, Kellogg School of Management at Northwestern University, Penn State, National Institute of Industrial Engineering, University of British Columbia, University of North Carolina at Charlotte, University of Texas at Austin, University of Wisconsin–Madison, University of California Berkeley, Notre Dame, University of Washington Marketing Camp, the Yale Quant Marketing Lunch Seminar, the Yale School of Management Faculty Seminar, the 2019 INFORMS Society for Marketing Science Doctoral Consortium, the 2018 Carnegie Mellon University–Temple University Conference on Big Data and Machine Learning, the 2019 Summer Institute in Competitive Strategy, the 2019 Management Science Workshop at University of Chile, 2019 Interactive Marketing Research Conference in Houston, and the 2019 SAS AI Symposium.
Associate Editor
P.K. Kannan
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship and/or publication of this article.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
