Abstract

1. Introduction
Goldenstein and Poschmann (hereafter GP, this volume, pp. 83–131) offer a novel approach, or combination of approaches, for tracking multiple layers of “meaning” in a collection of texts over time. The core theoretical argument is sound: Any given text conveys multiple layers of meaning simultaneously, the relationships across those layers may evolve dynamically over time in any given corpus, and different text-analytic tools capture different layers of meaning and should be able to characterize this co-evolution if used in combination. In fact, similar logic underlies recent developments at the cutting edge of natural language processing (NLP) research. There are, however, some pernicious measurement traps in the text-analytic techniques GP apply here, and in the absence of some important validation, we cannot be certain that they have captured the meaning they assert.
2. The Target Concept and the Data
In their running example, GP explore how the (layered, dynamic) narrative about “corporate responsibility” in the United States has changed since 1950. GP are modest about their substantive goals, indicating their example merely “facilitates first interpretations and . . . provides a beneficial road map for a deeper consideration of responsibility’s multiplexity.” Contrasting with literature that claims no change or slow change, they assert their analysis shows corporate responsibility “has changed significantly over time” and “did not happen gradually.”
In this example, they analyze a set of 15,371 articles from the New York Times and Washington Post, appearing between 1950 and 2013 in the business section, using the word responsibility or responsibilities (hereafter responsibilit*) and identifying a corporation or a representative as the one responsible. The sentence mentioning responsibility is analyzed with dependency parsing as a possible source of a “semantic triplet” capturing “macro-framing.” A triplet is recorded if responsibilit* is the direct object of a verb and has one or more identified modifiers. The remaining sentences in the same paragraph, stripped of all but adjectives and nonproper nouns, become a document in the topic model capturing the “context” in which the triplet appears.
If compiled as this is described, these data are incredibly sparse. I was able, for example, to find exactly two instances (I think) of the triplet organization-have-day-to-day in the New York Times between 1950 and 1978 (spanning the first two eras GP define). The paragraph containing the first one reads, in its entirety, “Edie is a wholly-owned subsidiary of Lionel D. Edie & Co., a Merrill Lynch sub-unit. It will have day-to-day responsibility for investment decisions.” This also produces the triplet organization-have-decisions and perhaps organization-have-investment. The adjectives and nonproper nouns of the first sentence form the three-word context document to be topic-modeled: wholly-owned subsidiary sub-unit. It is unclear, but presumably this three-word document is replicated in the topic-modeling data for each of the other triplets extracted. In the second instance I found, the target sentence comprises the whole paragraph, meaning there is no context document at all. It is not clear what happens in this extremely common (in newspapers) phenomenon. (It is possible GP did not really mean “paragraph,” but we are not given any examples or statistics about length.)
If it is typical that specific triplets are rare, context documents are one to three sentences in length (or even blank), and context documents may be duplicated with each extracted triplet, the sparsity and sampling noise in the data are likely substantial and consequential. This is likely to be particularly problematic for understanding eras one and two, given they appear collectively to comprise only about 20 percent of the data, and it will make comparisons over time very noisy.
3. Meaning in the Topics
We have very little to go on with respect to measurement validation of the latent Dirichlet allocation (LDA) topic model, so it is difficult to know what meaning is being captured by the topics (“semantic patterns”) it finds. We are given no example documents, no descriptive statistics about the input document-term matrix, no descriptive statistics about the LDA output model beyond highest ranking keywords, no statistics over time, and no comparison to other measures (Grimmer and Stewart 2013; Quinn et al. 2010). For the moment, then, I turn to another example to illustrate the nontopical sorts of meaning that LDA and similar topic models can capture. 1
The example provided in the documentation of the Python lda package used by GP is a 20-topic model of 395 news articles from 1996 to 1997 selected from the Reuters RCV1 data set using a search for “church.” 2 Several of these are recognizable as news “topics” that would be expected given the scope of the data set: for example, the Pope, Mother Teresa, the status of Scientology in Germany. Others seem like clear topics with dubious relationship to church (e.g., elvis king fans presley or yeltsin russian russia president), and others seem more abstract (e.g., years year time last or church government political country).
Inspection of the high-loading documents helps with the first group. It turns out the Elvis and Yeltsin topics, along with three others, are artifact topics defined by duplicate or triplicate articles. LDA chases these perfectly replicated term co-occurrences and devotes entire topics to them. Did that happen here? GP do not say whether they deduplicated the articles before they started. And if every document enters the model once for every unique triplet it contains, the data may be mostly duplicates. In any case, given how short these documents are, it is not inconceivable that there are many duplicates and near duplicates consisting entirely or primarily of common co-occurring terms like chief executive, vice president, or board directors.
The abstract group deconstructs the documents not by topic but by different residual aspects of news articles, corresponding in this case perhaps to the what and when of journalism 101’s five Ws. A topic-topic, like “Mother Teresa,” has a relatively small number of documents with a relatively high proportion of content attributed to the topic and a relatively large number of documents with very small proportion (relatively low mean, high variance, high skew, high kurtosis). An aspect-topic has a relatively large number of documents with a moderate proportion (relatively high mean, low variance, low skew, low kurtosis). Here, the “when” topic is largest in 15 percent of documents, second largest in 57 percent of documents, and third largest in 17 percent.
We are not given enough information to know whether GP’s model contains such topics—and the restriction to nouns and adjectives obscures the suspects—but I have never seen a topic model up close that did not. This matters for at least two reasons. First, with documents this short, many documents with essentially no content of interest will appear to be “about” such residual topics. Second, GP “explicitly exclude” 38 of their 70 topics. If documents tend to have a residual topic that is the second or third largest in proportion before exclusion, a high number will appear to be about the residual topic once their highest proportion topic is excluded.
4. Meaning in the Semantic Triplets
GP provide an inconsistent description of how triplets enter their data set, leaving it unclear exactly what meaning is being captured. They assert that responsibilit* is always the direct object of the target sentence, but they provide at least one example where it is not: “failed to live up to their fiduciary responsibilities.” 3 They assert that they extract for each triplet its entity modifiers and object modifiers, but neither of these is a term defined in either the current or former dependency parsing specifications or output by Stanford CoreNLP, 4 and their software labels these “ObjectModifier” and “SecondObject,” respectively. It seems the operationalization is to extract any modifiers of responsibilit* with dependency parse codes amod (adjectival modifiers; e.g., total responsibility), poss (possessive modifiers; e.g., manager’s responsibility), compound (compound noun premodifiers; e.g., business responsibility), and nmod (noun modifiers such as attributes through a preposition; e.g., responsibility to the public).
The details remain unclear. Consider two examples given for the triplet organization-have-management: We feel the [Organization] that profits from the sales of these products should have the financial responsibility for proper management and disposal. This combines with domestic and Canadian operations the overseas subsidiaries for which the [Organization] has management responsibility.
Does extraction operate through conjunctions, yielding also organization-have-disposal in the first example? Does it include relative pronouns, yielding also organization-have-which in the second, or is there coreference resolution to organization-have-subsidiaries? Stepping back, GP state obliquely that there is one triplet per sentence. If so, why would our method extract organization-have-management in the second case and not organization-have-financial in the first?
5. Meaning in the Distribution of Triplets Over Time
From their Figure 5, GP draw the conclusion that macro-framing of responsibility changed significantly and suddenly over time. For example, over half of the triplets that appear in era one fail to appear in era two but reappear in era three. It seems to me less plausible, however, that this represents a sudden shift in macro-framing and sudden shift back rather than, because many specific triplets are rare, a reflection that the samples in era one and era two are too small to draw such conclusions.
We are given no descriptive statistics about the triplets beyond the information in Figure 5. From this we can determine there are 790 unique triplets. We do not know how many triplets were observed in total. We do know that 55 triplets occurred at least three times (at least once in each era), and 223 occurred at least twice (at least once in each of two eras). As many as 512 triplets might have occurred only once: as many as 36 in era one, 16 in era two, and 460 in era three.
These bounds are consistent with what we would estimate from a Zipf’s law sort of assumption. Using the patterns observed by Booth (1967; again, treating these as we would individual word frequencies), if
6. Meaning in Context Association of Triplets Over Time
This logic carries through to skepticism about the meaning we can attribute to GP’s Figure 6, in which triplets appear to change their context across eras. Most of the triplets are rare. Some triplets that occur in era three and in a previous era will have occurred many times, but most are in the long tail and will have occurred only once or twice in the earlier eras. If so, their context association in the earlier era is based on one or two very short documents, which have a high likelihood of being associated with the largest semantic groups (likely management) or residual topics, especially after exclusion of the majority of other topics.
So, perhaps these triplets did not change association but rather were assigned a relatively capricious association as a baseline for comparison. One face validity check we have for this argument is that, if true, then these switcher triplets will be associated with a more logical semantic group in era three (due to the much larger sample size) than in their prior association. This seems to be the case. The switchers into the shareholder issue semantic group include fiduciary responsibility to shareholders and stockholders with respect to a fund, portfolio, value that a person can perform, execute, exercise, honor, violate. The switchers into the misconduct and risk group include legal responsibility for failure that one can share, assume, accept, carry, bear. The switchers into the social issue group include moral, social responsibility to society, community, consumers, workers that one can accept, shoulder, recognize. 6
7. Conclusion
Interest in layered, dynamic, context-dependent semantics is both appropriate and consistent with contemporary NLP research. The most important development in NLP in the past decade is probably research around word embeddings that learn “semantic vector spaces” based on the words that appear within a certain “context window” of surrounding words. Different aspects of word semantics are captured by defining context differently. The most dramatic breakthrough in language modeling—for things like dependency parsing—in the past few years has been the advent of long-short-term memory (LSTM) neural networks, which model text based on both short-range and long-range dependencies. Jurafsky and Martin (2018) provide an accessible introduction to these and related models that might offer alternative approaches to extract such semantic complexities.
I applaud the concept and the effort here. But I am skeptical that the end result is what GP think it is, and I am frustrated that we do not have basic things like descriptive statistics and content validity checks that would help allay that skepticism. It has been demonstrated (perhaps most dramatically by Lazer and colleagues 2014) that social science offers validation practices and standards that can identify when and why black-box computational tools have misled us. No one article can do everything, of course, but we could have more confidence in GP’s results if more such validation efforts had been pursued and reported here.
