Abstract
The author proposes a topic model tailored to the study of creative documents (e.g., academic papers, movie scripts), which extends Poisson factorization in two ways. First, the creativity literature emphasizes the importance of novelty in creative industries. Accordingly, this article introduces a set of residual topics that represent the portion of each document that is not explained by a combination of common topics. Second, creative documents are typically accompanied by summaries (e.g., abstracts, synopses). Accordingly, the author jointly models the content of creative documents and their summaries, and captures systematic variations in topic intensities between the documents and their summaries. This article validates and illustrates the model in three domains: marketing academic papers, movie scripts, and TV show closed captions. It illustrates how the joint modeling of documents and summaries provides some insight into how people summarize creative documents and enhances understanding of the significance of each topic. It shows that the model described produces new measures of distinctiveness that can inform the perennial debate on the relation between novelty and success in creative industries. Finally, the author shows how the proposed model may form the basis for decision support tools that assist people in writing summaries of creative documents.
With the digitization of the economy, people are both producing and consuming more creative content. On the supply side, according to Florida (2014), more than 40 million Americans (or approximately one-third of the employed population) belong to the “creative class.” This class includes people in science and engineering, education, arts, and entertainment whose primary economic function is to create new ideas, new technology, and new creative content. On the demand side, the average American spends approximately 12 hours per day consuming media (Statista 2017), and the media and entertainment industry alone is valued at approximately $2 trillion globally (Statista 2018).
This article uses the term “creative document” to refer to any written document that describes the output of a creative process. Examples include academic papers, fiction books, movie and TV show scripts, plays, business models, and new product descriptions. In contrast, noncreative documents include news articles, instruction manuals, and so on. In addition to being managerially relevant, creative documents have captured the interest of academics. Several studies have attempted to identify correlates of success in creative industries—in particular, the link between the distinctiveness of a creative document and its success (e.g., the link between the distinctiveness of an academic paper and its number of citations).
Studying creative documents on a large scale in a scientific manner has been historically challenging, due to the unstructured nature of the data contained in these documents. With the development of natural language processing tools such as latent Dirichlet allocation (LDA) (Blei et al. 2003) and Poisson factorization (Canny 2004), it has become possible to systematically extract text-based topics and features from creative documents. Although some studies have applied variations of traditional topic models to the study of creative documents (e.g., Berger and Packard 2018; Eliashberg, Hui, and Zhang 2007, 2014; Toubia et al. 2018), I argue that these models fail to capture at least two essential aspects of creative documents.
First, the creativity literature has shown that novelty is a key construct when it comes to creative content. Traditional topic models perform dimensionality reduction by approximating each document using a set of topics, which are common across all documents in the corpus. With a traditional topic model, the distinctiveness of a document may be measured by the distinctiveness of its combination of topics. However, traditional topic models fail to capture another aspect of distinctiveness: the extent to which a document may not be explained by common topics. As such, I argue that traditional topic models are limited in their ability to provide rich measures of distinctiveness, which may inform the debate on the link between novelty and success in creative industries.
Second, creative documents are often accompanied by summaries. For example, academic papers are accompanied by abstracts, books and movies by synopses, new products by short descriptions, business plans by executive summaries, and so on. Summaries play a key role in the market by helping consumers extract information from creative products more efficiently and decide which products to consume. For example, a consumer may be enticed to buy a book or watch a movie by a synopsis, or to buy a new product by its short description. One may argue that summaries serve as “lubricant” in the market for creative content and soften competition by making it easier for consumers to decide which products to consume. 1 Traditional topic models do not capture the relation between a document and its summary. I argue that modeling and quantifying the process by which humans summarize creative documents not only is interesting from an academic perspective but also offers practical benefits. From the perspective of extracting meaningful, interpretable topics from a corpus of creative documents, summaries may be viewed as shorter documents produced by people who invested time and effort to determine which topics in a creative document are “essential” enough to be included in its summary. As such, summaries have the potential to improve our understanding of the significance of each topic. Moreover, modeling the summarization process opens the door for the development of computer-based tools to assist authors and marketers in creative industries in writing summaries of creative documents. For example, by identifying characteristics of summaries that correlate with success in a specific creative industry, it is possible to advise authors to emphasize certain topics in their summaries.
Motivated by these two characteristics of creative documents, I propose a topic model tailored to the study of creative documents. The contribution of this research is primarily methodological. The model extends Poisson factorization in two ways. First, it accounts for not only the portion of a document that may be explained by topics that are common across documents, but also the “residual” (or “outside the cone”; see the geometric interpretation) portion that is not explained by combinations of these common topics. Second, I jointly model the content of creative documents and their summaries. The model represents systematic variations in the extent to which each common topic, as well as each “residual” topic, appears in summaries compared with full documents.
While topic models have been applied to creative documents, to the best of my knowledge this model is the first topic model specifically tailored for creative documents. The model offers at least three benefits that traditional topic models cannot provide, for both academics and practitioners. First, each topic estimated by the model comes with a variable that quantifies the extent to which the topic was deemed “summary worthy” by the people who wrote the summaries of the documents in the corpus. I illustrate how this additional layer of information provides some insight into the process by which people summarize creative documents in a particular domain and enhances our understanding of the significance of each topic. Second, for academics and practitioners interested in participating in the ongoing debate on the link between distinctiveness and success of creative products, I show that the model provides various measures of distinctiveness, which have the potential to uncover new insight into correlates of success in creative industries. I use three data sets to empirically explore the relation between three measures of distinctiveness (and various success measures, i.e., number of citations of academic papers, movie and TV show ratings, and movie return on investment). Third, I show that the model may serve as the basis for interactive decision support tools that assist people in writing summaries of creative documents. The development of such tools may be informed by an empirical analysis of correlates of success in the target industry. For example, I find that marketing academic papers whose abstracts put relatively more emphasis on the “outside the cone” content in the paper tend to have more citations. Accordingly, the model can help authors identify the “outside the cone” content in their paper and emphasize it in their abstract. I develop a proof of concept for such a tool.
Relevant Literatures
The study of creativity in various domains, from scientific discovery (e.g., Uzzi et al. 2013) to linguistics (e.g., Giora 2003) and innovation (Toubia and Netzer 2017), has suggested that creativity lies in the optimal balance between novelty and familiarity. For example, Ward (1995, p. 166) argues that “truly useful creativity may reflect a balance between novelty and a connection to previous ideas.” Furthermore, building on previous research from a wide range of domains (e.g., Finke, Ward, and Smith 1992; Mednick 1962), Toubia and Netzer (2017) show that when attempting to quantify familiarity and novelty in a document using text analysis, researchers should focus on novel versus familiar combinations of words, rather than words that themselves appear more or less frequently.
These insights inform the modeling approach used herein. I adopt a natural language processing approach, which captures topics defined as combinations of words. The model nests and extends previous applications of Poisson factorization to the study of text documents, such as Canny (2004) and Gopalan, Charlin, and Blei (2014). For example, Gopalan, Charlin, and Blei (2014) study how researchers rate academic papers by modeling documents and researcher preferences as latent vectors in a topic space. The model proposed herein builds on Gopalan, Charlin, and Blei’s (2014) model, though it differs in a few important ways. First, I model the content of full documents and their summaries, rather than modeling the content of documents and consumers’ preferences for these documents. These different objectives give rise to very different data, model specifications, and data-generating processes. Moreover, I jointly model the content of documents and their summaries, I explicitly model “residual” topics, and I model how residual topics are represented in summaries, none of which is performed by Gopalan, Charlin, and Blei’s (2014) model. Finally, I use offset variables in a novel way to capture systematic variations in topic intensities in full documents versus summaries. As noted in the introduction, several papers have used extant topic models to study creative documents (e.g., Berger and Packard 2018; Eliashberg, Hui, and Zhang 2007, 2014; Toubia et al. 2018). However, to the best of my knowledge the model developed here is the first topic model tailored to the study of creative documents.
Note that most applications of topic modeling in the marketing literature have used latent Dirichlet allocation (LDA; Blei et al. 2003) or extensions thereof (e.g., Büschken and Allenby 2016; Liu and Toubia 2018; Puranam, Narayan, and Kadiyali 2017; Tirunillai and Tellis 2014; Toubia et al. 2018; Zhong and Schweidel 2020). The basic LDA model shares many similarities with the basic Poisson factorization model, although previous research has suggested that Poisson factorization tends to fit data better (Canny 2004; Gopalan, Hofman, and Blei 2013, Gopalan, Charlin, and Blei 2014). My choice of Poisson factorization was primarily driven by the attractive conjugacy property of this approach. Indeed, the model remains conditionally conjugate, despite the additional complexities resulting from jointly modeling the content of documents and summaries while explicitly capturing residual content. 2
Despite the importance of summaries in the commercialization of creative content, summarization has received very little attention in the marketing literature. In contrast, it is a substantial subfield of computer science (see, e.g., Allahyari et al. 2017; Nenkova and McKeown 2012; Radev, Hovy, and McKeown 2002; Yao, Wan, and Xiao 2017). However, computer scientists have focused mostly on automatic text summarization, in which a summary is produced without any human intervention. This process is typically done by identifying and selecting a subset of the sentences in the original document, a process called extractive summarization (Allahyari et al. 2017). Such text summarization tools are useful for summarizing large numbers of documents (e.g., news articles) on a regular basis, quickly and efficiently (McKeown and Radev 1995; Radev and McKeown 1998). In contrast, I focus on situations in which summaries provide additional content written by humans, from which valuable insights might be learned. In terms of practical applications, I envision computers not as a replacement for, but rather as an aid to humans, and consider decision support tools that assist them in writing summaries of creative documents. The different perspective on summarization adopted here also translates into methodological differences. Some studies have applied topic modeling to text summarization, sometimes introducing document-specific topics that capture unique content in each document, which should be included in the summary (Daumé and Marcu 2006; Delort and Alfonseca 2012; Haghighi and Vanderwende 2009). These document-specific topics are similar in spirit to the residual topics in the model. However, given their focus on extractive summarization, unlike my model, these models do not consider summaries as an additional source of information, they do not model the content of summaries, and they do not include summaries in their training data.
Proposed Model
Model Foundation: Poisson Factorization
I index creative documents by
For each regular topic
For each document For each topic, draw topic intensity For each word
To gain intuition for this base model, recall that the sum of independent Poisson-distributed random variables is a Poisson variable. Hence, according to Poisson factorization, the number of occurrences of word
One can also interpret Poisson factorization geometrically. (To the best of my knowledge, the following geometric interpretation of Poisson factorization is new to the literature.) Topics and documents may be represented in the Euclidean space defined by the words in the vocabulary. That is, topic
Mathematically, the positive combinations of the set of topic vectors,

Geometric interpretation of “inside the cone” versus “outside the cone” content.
In summary, the primary focus of traditional topic models such as Poisson factorization is to understand topics that are common across documents in a corpus and to quantify the intensity with which each topic is featured in each document. In doing so, Poisson factorization approximates each document as a positive combination of common topics.
Residual Topics
My model extends Poisson factorization in two ways. First, it captures “outside the cone” content by introducing one “residual topic” associated with each document (to my knowledge, a novel way of using Poisson factorization). For each document
The introduction of this residual topic was motivated by the creativity literature, in an attempt to account for distinct content in the document. One may wonder whether the residual topic is simply “noise.” To address this issue, the “Empirical Applications” section empirically tests whether the residual topic indeed relates to the success of creative documents in ways that are predicted by the creativity literature. If this topic were “just noise,” no systematic relation with the success of creative documents should be present. Theoretically, the model still includes “noise,” above and beyond the residual topics. Indeed, the number of occurrences of each word remains stochastic and governed by a Poisson distribution. In addition, the prior induces sparsity and trades off fit with the complexity of the model. As a result, the expected value of the number of occurrences of each word according to the model does not perfectly fit the observed value, even in the presence of residual topics.
Figure 1 illustrates geometrically how the vector corresponding to a document is decomposed into two vectors: the “inside the cone” component that projects the document vector onto the cone defined by the regular topics and the “outside the cone” component that closes the gap between the original vector and the projection. (Again, this simple illustration focuses on expected values and ignores the effect of the prior; the actual model produces a distribution of word occurrences, and fit is not perfect due to the sparsity-inducing prior.)
Offset Variables
The second way in which the model extends Poisson factorization is that it jointly models the content of creative documents and their summaries. To that end, I introduce a set of “offset” variables that capture how topics are weighed in summaries, compared with full documents. The topic intensities in the summary of a creative document may not be the same as the topic intensities in the full document. First, some regular topics may be typically judged by the authors of summaries as being more or less worthy of being featured in a document’s summary, which should translate into systematic differences across regular topics in how they are weighed in summaries versus full documents. For example, topics that relate to data analysis (substantive findings) may be relatively under-weighed (over-weighed) in the abstracts of academic papers compared with the full papers. To capture and quantify such phenomenon, I allow each regular topic
A perennial issue with traditional topics models is the difficulty of interpreting topics, resulting from the unsupervised nature of these models. Offset variables provide an additional layer of information that helps users understand the significance of each topic by giving it a “score” that captures the extent to which people decide to include this topic when writing summaries of creative documents in the domain under study. Although offset variables have been used for different purposes in previous applications of Poisson factorization (e.g., Gopalan, Charlin, and Blei 2014), to the best of my knowledge this article, as the first to use Poisson factorization to jointly model documents and their summaries, is also the first to use offset variables to capture how the intensities of topics vary between documents and summaries. Web Appendix F further explores the impact of introducing offset variables by estimating an alternative version of the model that does not include these variables, showing that the topics learned by this alternative model are substantively different from the topics learned by the proposed model. In the proposed model, topics are defined as groups of words that tend to not only appear together but also appear with the same relative frequency in summaries compared with full documents. Accordingly, the presence of offset variables affects the topics learned from the model.
Data-Generating Process
Putting all these pieces together, the data generating process for the model is as follows:
For each regular topic
For each residual topic
For each document
For each document summary
Estimation Using Variational Inference
To estimate the model, I start by defining auxiliary variables that allocate the occurrences of each word
Selecting the Number of Topics
Although the number of topics could be selected using cross-validation to achieve minimum perplexity, I use a simpler approach advocated by Gopalan, Charlin, and Blei (2014): I set the number of topics
Extension: Dynamic Topics
Web Appendix G introduces a dynamic extension of this model, inspired by Blei and Lafferty (2006). I model each topic as having a base version and introduce a set of time-specific offset variables that capture the evolution of each topic over discrete time periods. In each time period, the weights of each topic are assumed to be equal to the weights in the previous period, plus a set of offset variables specific to that topic and that time period. This extension is also estimated using variational inference. I apply it to the marketing academic paper data set, which contain all papers published in a set of journals over six years. I find that the introduction of dynamics does not change the conclusions from the empirical analysis.
Empirical Applications
Data Sets
I apply the model to three data sets. In each data set, all documents were preprocessed following standard steps in natural language processing: eliminate non-English characters and words, numbers, and punctuation; tokenize the text (i.e., break each document into individual words or tokens); remove common stop words; and remove tokens (words) that contain only one character. No stemming or lemmatization was performed. In each data set, I randomly split the set of documents into two samples: a calibration set with 75% of the documents and a validation set with 25% of the documents.
I constructed the vocabulary of words in each data set based on the full documents in the calibration set only (i.e., I did not use the summaries and the validation documents to select the vocabulary). I computed the term frequency (
The first data set consists of the full texts (excluding the abstracts, bibliographies, and appendices) and the abstracts of all 1,333 research papers published in Journal of Consumer Research, Journal of Marketing, Journal of Marketing Research, and Marketing Science between 2010 and 2015. Most of the papers were downloaded in PDF format. Some spelling errors occurred while converting PDF files to text files; hence, a spelling corrector was trained based on the autocorrection package in Python and applied before preprocessing the data. Table 1 reports descriptive statistics for all data sets after preprocessing.
Descriptive Statistics.
Notes: There are 1,333 papers in the academic paper data set, 858 movies in the movie data set, and 26,561 TV show episodes in the TV show data set. There are 1,000 words in the vocabulary for each data set. The first column (vertically aligned) contains the data set, the second the metric of interest, the third the unit of analysis, and the remaining columns report the mean, standard deviation, min, and max of the correspond metric across the units of analysis. For example, the first row indicates that in the marketing academic paper data set, papers have on average 2,110.26 word occurrences.
The second data set consists of the scripts and synopses of 858 movies released in the United States for which scripts were available on the Internet Movie Script Database (imsdb.com) and synopses were available on the Internet Movie Database (IMDB; imdb.com). Words corresponding to names of locations, people, and organizations were identified using the Stanford Named Entity Recognition classifier and removed from the data before preprocessing.
For the third data set, I collaborated with a major global media company interested in creating a “knowledge graph” for its extensive library of TV content (i.e., identifying a set of meaningful, interpretable topics that describe each TV show episode to classify its content). The company made available the collection of closed captions for 26,561 unique TV show episodes, which constitute most of the company’s catalog of U.S.-based, English-language TV show episodes. The company decided to work with closed captions because they are available systematically and consistently for all episodes, as they are required by the Federal Communication Commission. The company also made available the synopses of all TV show episodes, which are part of its internal programming system. As in the previous data set, words corresponding to names of locations, people, and organizations were removed from the data before preprocessing.
Fit and Predictive Performance
Benchmarks
The proposed model extends Poisson factorization in two ways. First, it models “residual” topics that are unique to each document. Second, it allows the topic intensities in summaries to differ from the topic intensities in main documents. To determine the benefits of these two extensions, I tested a series of nested models. All benchmarks are estimated using variational inference, with the same convergence criterion and hyperparameters. The first benchmark considered is a nested model that does not include residual topics. This benchmark is a nested version of the proposed model, in which
Finally, I consider LDA, a nonnested benchmark, due to its popularity. Because LDA does not include offset variables, the topic intensities in the summary of a document are assumed to be the same as in the full document. In addition, LDA does not include residual topics. Web Appendix D provides details of the LDA benchmark.
Measures of Fit
I estimate each model on the full texts and summaries of the calibration documents in each data set. The output from the model and any of its nested benchmark may be summarized by computing a vector of fitted Poisson rates
In addition, for each document, fitted Poisson rates can be constructed for the number of occurrences of words in the document’s summary:
To compare the model with LDA, I transform these Poisson rates into multinomial distributions
I measure fit using the standard measure of perplexity (Blei et al. 2003). Given a set of full documents
Perplexity is defined similarly for the document summaries:
where
For each model, I also estimate the intensities on regular topics
Therefore, the in-sample fit measures consist of the perplexity scores for the full texts of the calibration documents, the summaries of the calibration documents, and the full texts of the validation documents. In addition, Web Appendix H reports the deviance information criterion (DIC) for each benchmark, showing that it is lowest for the full model in all three data sets.
Measure of Predictive Performance
The predictive task considered herein is that of predicting the content of the summary of a validation document, given the full text of this document and the model parameters estimated on the set of calibration documents. Consider a validation document
Results
Table 2 reports the performance of the proposed model, the nested benchmarks, and LDA on each of the three data sets. The comparisons between benchmarks are similar across data sets. It is evident that the proposed model performs best in terms of fitting the summaries of calibration documents and predicting the summaries of validation documents and that the “No residual topic” benchmark usually performs worse than the “
Fit and Predictive Performance.
Notes: fit and predictive performance are measured using perplexity (lower values indicate better fit).
The “Residual topics only” benchmark, not surprisingly, performs best in terms of fitting the full documents. This benchmark does not attempt to learn any topic across documents; that is, it does not generate any substantive insight. In addition, the fit on the full documents comes at the expense of fitting or predicting the content of the summaries of documents. Interestingly, this benchmark performs similarly to the “
Web Appendix H tests an alternative measure of predictive performance in which I randomly held out a subset of the word occurrences in each validation document that are predicted based on the parameter estimates and the other words in the document. In this scenario, the content of validation summaries is predicted based only on a subset of the words in the full document. I find that the full model performs best in terms of predicting the held-out portion of validation documents and the summaries of validation documents, with the exception of the marketing academic paper data set, in which the “Residual topics only” benchmark performs slightly better at predicting the held-out portion of validation documents.
In summary, these results suggest it is reasonable to extend Poisson factorization to study creative documents and their summaries, by capturing residual content and systematic differences in topic intensities in summaries versus full documents using offset variables. The following three sections illustrate three benefits offered by the proposed model over traditional topic models, as listed in the introduction. First, the joint modeling of creative documents and their summaries sheds light on the process by which people summarize creative documents and enhances understanding of the significance of the topics estimated by the model. Second, the model may be used to construct various measures of distinctiveness for creative documents, which can inform the debate on the link between distinctiveness and success in creative industries. Third, I present a proof of concept of an online tool based on the model, which can assist humans in writing summaries of creative documents. The remainder of the article focuses on the results based on estimating the model on the calibration sample in each data set.
Model Output: Topics and Offset Variables
As mentioned previously, I set the number of regular topics

Distribution of offset variables: marketing academic papers.
Figure 3 reports the distribution of the proportion of fitted content assigned to the residual topic (“outside the cone”) in documents and summaries, for the academic paper data set. The proportion of fitted “outside the cone” content in document

Distribution of the proportion of fitted “outside the cone” content in documents and summaries” marketing academic papers.
I next report descriptions of the nonflat regular topics and illustrate the type of insight offered by estimating offset variables for these topics. Web Appendix A reports the offset variables, the average topic intensities across documents, and the words with the highest topic weights for all nonflat regular topics in each data set. I also visualize some of these topics by creating word clouds using randomly drawn words according to a discrete probability distribution with weights proportional to the topic weights

Word clouds for regular topics with smallest offset variables: marketing academic papers.

Word clouds for regular topics with largest offset variables: marketing academic papers.
Web Appendix A displays similar information for the movie and TV show data sets. In the movie data set, the topics with the lowest offset variables appear to relate to the setting of various scenes in the movie. In the TV show data set, the two topics with the smallest offset variables appear to relate to standard dialogues. The topic with the largest offset variable appears to relate to actions (e.g., “gets,” “takes,” “finds,” “comes”), and relationships (e.g., “friends,” “family”). The topic with the second largest offset variable appears to relate to the appearance of guest stars and other special events in the episode.
The figures and tables reported in this section illustrate the additional layer of information provided by the joint modeling of creative documents and their summaries. Offset variables provide insight into the process by which people summarize creative documents in a particular domain and enhance understanding of the significance of each topic. As noted previously, the introduction of residual topics reduces the number of nonflat regular topics estimated by the model. “Rare” topics, those that are shared by only a small number of documents, are likely to be reflected in residual topics rather than regular topics. Hence, if a researcher’s goal is to identify such rare topics, the version of the model that does not include residual topics may be preferred. Including residual topics, in contrast, greatly improves the model’s ability to fit and predict the content of documents and summaries and allows researchers to develop a rich set of distinctiveness measures that may be linked to success. Indeed, when residual topics are not present, two of the distinctiveness measures defined in the next section become unavailable.
Measuring the Distinctiveness of Creative Documents
Some debate has occurred in the literature on the relationship between distinctiveness and success in creative industries. This section reviews some of the empirical studies that have contributed to this debate and shows that the proposed model may be used to estimate various measures of distinctiveness, which may help researchers paint a more nuanced picture of the relationship between the distinctiveness and success of creative documents.
Distinctiveness Measures Based on Proposed Model
I consider three distinctiveness measures, each of which relies on different aspects of the model. The first, directly based on Berger and Packard (2018), measures the distinctiveness of the combination of regular topics in a document. Given a reference group
The second measure, which is novel and not available from traditional topic models, is based on the “outside the cone” content in the document. I compute, for each creative document, the proportion of fitted content allocated to the residual topic:
In these analyses, I standardize all three measures across documents for interpretability. Table 3 reports the correlations between the three distinctiveness measures in each data set. The lack of consistently high correlation between any pair of measures suggests that these three measures indeed capture different aspects of creative documents.
Correlation Between Distinctiveness Measures.
Distinctiveness versus Success in Academic Papers
The extant literature has found a general positive relationship between distinctiveness and number of citations in academic papers (Uzzi et al. 2013).
8
The number of citations of each paper in the data set is extracted using the application programming interface (API) offered by Crossref (www.crossref.org).
9
For the papers in the calibration data set, I regress the log of 1 plus the number of citations (I take the log due to the skewness of the number of citations) on the three distinctiveness measures.
10
I control for journal fixed effects, publication year fixed effects, the paper’s intensities on (nonflat) regular topics
Link Between Distinctiveness Measures and Citations: Marketing Academic Papers.
Note: Ordinary least squares regression. All three distinctiveness measures are standardized across papers for interpretability.
These results are purely correlational. Moreover, I was not able to include all the variables from all previous analyses of the factors of citations of marketing academic articles (e.g., Stremersch, Verniers, and Verhoef 2007; Stremersch et al. 2015, who do not focus on distinctiveness). My goal is not to make definitive claims on the causal relation between distinctiveness and number of citations of marketing academic papers; rather, it is to illustrate how the distinctiveness measures derived from the proposed model may be used by researchers interested in contributing to that literature. Interestingly, at least in this data set, the content of summaries appears to be related to the success of creative documents. This echoes Pryzant, Chung, and Jurafsky (2017), who study the link between the presence of certain phrases in the description of products in e-commerce platforms (e.g., including references to authority or seasonality) and product sales. Given the ubiquity of summaries across creative industries, further research may be conducted that links the success of creative products to variations in the content of their summaries.
Distinctiveness versus Success in Entertainment Products
While the extant literature makes a clear prediction on the link between distinctiveness and citations in academic papers, the literature is not as clear on the link between distinctiveness and success in the context of entertainment products. On the one hand, Berger and Packard (2018) show that songs whose lyrics are more different from their genres are ranked higher in digital downloads. Danescu-Niculescu-Mizil et al. (2012) and Askin and Mauskapf (2017) also find that distinctiveness is an attractive feature of entertainment products. On the other hand, according to Salganik, Dodds, and Watts (2006), the content of entertainment products has little impact on the success of these products, echoing previous research by Bielby and Bielby (1994), who also report a quote from a past president of CBS entertainment that “all hits are flukes,” and Hahn and Bentley (2003).
Ratings
I first analyze the link between the three measures of distinctiveness and the ratings of movies and TV shows. For each movie in the calibration data set, I collect the average rating from IMDB (based on the ratings of IMDB users), which I standardize across movies for interpretability. I include fixed effects for the movie’s MPAA rating, fixed effects for the movie’s genre(s), the movie’s intensities on the (nonflat) regular topics, the movie’s duration (in min), and the log of the movie’s production budget (in U.S. dollars, adjusted for inflation, using the tool available at https://data.bls.gov/cgi-bin/cpicalc.pl). All these control variables (with the exception of the intensities on regular topics) are obtained from IMDB. Results are provided in the first column of Table 5. I find that “outside the cone distinctiveness” is positively related to the movie’s rating. Interestingly, “inside the cone distinctiveness” is actually negatively related to the movie’s rating in the data set (i.e., movies whose regular topic intensities deviate more from the mean of their genre tend to receive lower ratings). I also find that “outside the cone emphasis in summary” is not significantly related to ratings. This is not surprising, given that the role played by synopses in the movie industry is more restricted than the role played by abstracts in academia.
Link Between Distinctiveness Measures and Performance: Movies.
Notes: Each column corresponds to one regression estimated separately using ordinary least squares. All three distinctiveness measures and movie ratings are standardized across movies for interpretability. Observations in the first (second) regression are limited to movies for which production budget was available (production budget and box office performance were available).
For TV shows, I obtained IMDB ratings for 9,358 of the episodes in the calibration data set (some episodes were not found on IMDB, and IMDB reports ratings only for episodes that were rated by at least five users). For the analysis, I only kept episodes from TV series for which ratings on at least two episodes were available, so that I could include fixed effects for each TV series. This resulted in 9,285 observations and 318 fixed effects. In addition, I control again for the episode’s intensities on the (nonflat) regular topics. Results are provided in Table 6. In this data set, consistent with the analysis of movie ratings, I find that “outside the cone distinctiveness” is positively related to the TV show’s rating. The coefficients for the other two measures of distinctiveness are not statistically significant.
Link Between Distinctiveness Measures and Performance: TV Episodes.
Notes: Ordinary least squares regression. All three distinctiveness measures and episode ratings are standardized across episodes for interpretability.
Return on investment
Finally, for movies, I analyze the link between distinctiveness and financial success, measured as the log of the movie’s return on investment, defined as in Eliashberg, Hui, and Zhang (2014) as the ratio of the movie’s domestic box office performance (also obtained from IMDB) to its production budget. In addition to the controls included in the first regression reported in Table 5, I control for the movie’s rating. Results, again based on the calibration data set, are provided in the second column of Table 5. This data set shows that none of the distinctiveness measure is significantly related to financial success.
Discussion
The analysis provided herein suggests that “inside the cone distinctiveness,” “outside the cone distinctiveness,” and “outside the cone emphasis in summary” provide meaningful and useful measures of distinctiveness, which may have different relations to success, depending on the context and on how success is defined and measured. Across three data sets, “outside the cone distinctiveness” (a novel measure introduced here) is robustly and positively associated with success. In contrast, “inside the cone distinctiveness” (which is directly based on extant research) is positively related to the number of citations of marketing academic papers but negatively related to movie ratings. This is not inconsistent with the literature, which suggests that distinctiveness should be positively related to success for academic papers, but which is more ambivalent on the link between distinctiveness and success in entertainment industries. Finally, in the context of marketing academic papers, I find that putting more emphasis in an academic paper’s abstract on the “outside the cone” content from the paper is associated with a larger number of citations.
Note that the measures of distinctiveness are based on the entire set of training documents and thus do not capture novelty with respect to contemporaneous documents. In particular, some documents may have been novel when they were released/published and may have become influential, leading to similar future documents. Such novel documents may not score high on the distinctiveness measures despite being novel, due to the presence of similar documents in the corpus. The dynamic version of the model described in Web Appendix G addresses this issue by allowing topics to evolve over time, hence measuring the distinctiveness of a document with respect to the topics defined at the time this document was published. I apply this dynamic extension of the model to the marketing academic paper data sets, which contains all papers published in a set of journals over six years, and find that “inside the cone distinctiveness,” “outside the cone distinctiveness,” and “outside the cone emphasis in summary” are still all positively and significantly related to the number of citations.
Web Appendix H also tests various alternative measures of distinctiveness and alternative ways to explore the link between distinctiveness and success. I find that as the vocabulary size changes, the significance of some of the coefficients associated with distinctiveness measures may change, although I observe no reversal (i.e., a coefficient that is significant in one direction under one vocabulary size is never significant in the other direction under a different vocabulary size). A simulation study conducted to illustrate how measures of distinctiveness are affected as relevant words are omitted from the vocabulary or as irrelevant words are included in the vocabulary confirms that the selection of the vocabulary size is bound to have some impact on the output of the model. While this is not an attractive feature, unfortunately this is a characteristic of any topic model, not just the one presented here. Using alternative specifications that link distinctiveness to financial success in the movie data set yields similar results to those reported in Table 5. Measuring “inside the cone distinctiveness” using the entire set of training documents as the reference group, rather than documents in the same journal / genre / TV series, produces results similar to those reported in Tables 4–6. I perform an analysis that reflects the fact that measures of distinctiveness are constructed from model parameters that are estimated with uncertainty rather than measured precisely. I run each regression 1,000 times using different draws from the posterior distribution of the model parameters and report the average coefficients as well as whether the 90% and 95% credible intervals include 0. Results are consistent with those reported in Tables 4–6. Finally, I measure distinctiveness using standard topic models (LDA and Poisson factorization), rather than the proposed model. “Inside the cone distinctiveness” is the only distinctiveness measure available from these models, and it is never statistically significantly related to success in any of the regression, with the exception of “inside the cone distinctiveness” estimated based on the standard Poisson factorization, which is marginally related to return of investment of movies.
Computer-Assisted Summary Writing
As mentioned in the literature review, the traditional approach in the computer science literature would be to attempt to completely automate the summarization of documents, typically via sentence extraction. I argue that this approach is less relevant in the context of creative documents. In particular, the nature of creative documents is such that the stakes are usually high enough for people to be motivated and available to write summaries. For example, the author or publisher of a new book typically has enough motivation to write a synopsis for this document and may not find as much value in a tool that would automatically generate a summary. Similar comments may be made about the publisher of a new movie or play, the author of an academic paper, the developer of an innovative product, the author of a business plan, and so on. This is in contrast to the traditional text summarization literature that typically deals with the summarization of large volumes of documents such as news articles, where automation has significant cost saving implications. Moreover, sentence extraction is likely to be an inappropriate text summarization approach in many creative contexts. For example, an abstract of a scientific paper made exclusively of sentences from the paper, or a TV show synopsis made exclusively of sentences from the show’s dialogues, may be unacceptable to the relevant audience. Accordingly, I argue that in the creative context, it is more useful to develop decision support tools that assist humans in writing summaries of creative documents, rather than developing automatic text summarization tools featuring sentence extraction.
I have built a proof of concept for such a decision support tool, using php and a mysql database. The tool allows a user to upload a creative document that was not necessarily part of the corpus on which the model was estimated. When the user submits a new document
As output, the tool reports representative words for the five regular topics with the largest intensities (
Because the model should be run separately in each domain, I customize the tool for each domain of application. I have created one version of the online tool corresponding to each corpus studied herein (marketing academic papers, movies, and TV shows). This proof of concept is publicly available at http://creativesummary.org. 14
Importantly, such a decision support tool may also leverage analysis such as the one reported in the previous section, to help users improve the effectiveness of their summaries. For example, I found that marketing academic papers in which the abstract puts more emphasis on the “outside the cone” content in the paper (i.e., higher
Conclusions
The contribution of this article is primarily methodological. I develop and apply a new topic model designed specifically for the study of creative documents. Guided by the creativity literature, this model nests and extends Poisson factorization in two ways. First, I explicitly model residual, “outside the cone” content and how it is represented in summaries versus documents. Second, I jointly model the content of documents and their summaries, and quantify (using offset variables) how the intensity of each topic differs systematically in summaries compared with full documents. I validate the model using three data sets containing marketing academic papers (summarized by abstracts), movie scripts (summarized by synopses), and TV show closed captions (summarized by synopses). The proposed model offers the standard benefits of topic models; that is, it extracts topics from a corpus of documents and assigns intensities on each topic for each document (although the introduction of residual topics changes the number and content of the nonflat regular topics). This article illustrates three additional benefits the model provides for academics and practitioners. First, the offset variables estimated by the model, which quantify the extent to which each topic was deemed “summary worthy” by the humans who wrote the summaries of the documents in the corpus, shed light into the process by which humans summarize creative documents and identify the significance of each topic. Second, I illustrate how the model may be used to construct new measures of distinctiveness for creative documents, which have the potential to shed new light on the relation between distinctiveness and success in creative industries. Third, I develop an online, interactive, freely accessible tool based on the model, which provides a proof of concept for using the model’s output to assist humans in writing summaries of creative documents.
I close by highlighting additional areas for future research. First, it would be interesting to introduce covariates into the model that influence the topic intensities and/or the offset variables. In the context of entertainment products, such covariates might include genres, country of origin, and so on. In the context of academic papers, these covariates may include subfields, whether the paper is based on a dissertation, and so on. Second, alternative topic models may capture the structure of creative documents (e.g., different sections, scenes, acts). Third, it would be worthwhile to study how the content of summaries varies systematically based on the objectives of the summary. For example, in some cases summaries serve primarily as “teasers” for creative products, while in others they serve more as “substitutes” for the products. For example, the offset variables might differ systematically between spoilers and synopses, or between abstracts written for conferences versus journal articles.
Online supplement
Supplemental Material, web_appendix - A Poisson Factorization Topic Model for the Study of Creative Documents (and Their Summaries)
Supplemental Material, web_appendix for A Poisson Factorization Topic Model for the Study of Creative Documents (and Their Summaries) by Olivier Toubia in Journal of Marketing Research
Footnotes
Acknowledgments
Yanyan Li, Ahmed Mrad, and Sibel Sozuer Zorlu provided outstanding research assistance on this project.
Associate Editor
Vrinda Kadiyali
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
