Measuring Abstract Mind-Sets Through Syntax: Automating the Linguistic Category Model

Abstract

Abstraction in language has critical implications for memory, judgment, and learning and can provide an important window into a person’s cognitive abstraction level. The linguistic category model (LCM) provides one well-validated, human-coded approach to quantifying linguistic abstraction. In this article, we leverage the LCM to construct the Syntax-LCM, a computer-automated method which quantifies syntax use that indicates abstraction levels. We test the Syntax-LCM’s accuracy for approximating hand-coded LCM scores and validate that it differentiates between text intended for a distal or proximal message recipient (previously linked with shifts in abstraction). We also consider existing automated methods for quantifying linguistic abstraction and find that the Syntax-LCM most consistently approximates LCM scores across contexts. We discuss practical and theoretical implications of these findings.

Keywords

LCM construal-level theory text analysis syntax language abstraction

Abstraction is a critical construct that influences outcomes such as learning, memory, judgment, self-regulation, and behavior (for a review, see Burgoon, Henderson, & Markman, 2013). A key subset of abstraction research focuses on abstraction in language. Decades of work suggest that abstract language impacts processing and memory (e.g., Paivio, 1991; Schwanenflugel, Harnishfeger, & Stowe, 1988) as well as perceptions of affective connotation (Kousta, Vigliocco, Vinson, Andrews, & Del Campo, 2011), informativeness and enduringness (Semin & Fiedler, 1991), truthfulness (Hansen & Wänke, 2010), and social evaluation of a communicator (Wakslak, Smith, & Han, 2014). Separately, linguistic abstraction is also a useful window into cognitive abstraction; because mental abstraction affects language choice, a key method for differentiating between concrete and abstract cognition is considering the words a person uses to describe his or her thinking.

Prior literature has considered varied approaches to conceptualizing and measuring linguistic abstraction. In the current article, we present a syntax-based automated method for measuring abstraction in language that builds on Semin and Fiedler’s theoretically grounded and well-validated approach, the linguistic category model (LCM, 1988). We start by describing the LCM in more detail, and another alternative approach (Brysbaert, Warriner, & Kuperman, 2014) that is also readily automatable. We then introduce the Syntax-LCM, our syntax-based, automated approach. In three studies, we validate this method and compare it to alternative automated methods for coding linguistic abstraction.

The LCM: Measuring Abstraction in Language

The LCM is a theoretical framework that considers the social–cognitive functions of linguistic categories (Semin & Fiedler, 1988). This model has been used widely in research oriented toward better understanding the impact of language on social cognition (e.g., Semin & Fiedler, 1991) including research on individual-level attribution (Semin & Fiedler, 1989) and constructive memory biases (Fiedler, Semin, & Bolten, 1989). Linguistic intergroup bias research (e.g., Maass, Salvi, Arcuri, & Semin, 1989) serves as an exemplar of the LCM’s social–cognitive approach to language and its explanatory power, suggesting that people are biased toward using different levels of linguistic abstraction when describing positive in-group (rather than out-group) behaviors and negative out-group (rather than in-group) behaviors and that this plays a role in stereotype perpetuation. The LCM has been fruitfully applied to both short lines of language (e.g., Maass et al., 1989) and lengthier texts (e.g., Schmid & Fiedler, 1996).

The LCM distinguishes between four linguistic categories which vary in their degree of abstraction. Adjectives (along with adverbs and noun-modifiers) form the most abstract linguistic category, as they emphasize decontextualized, invariant features of an object or event. By comparison, verbs are more concrete than adjectives because they provide specific contextual information that changes over time. Within verb classes, the LCM distinguishes between three verb types. Descriptive action verbs (DAVs) are most concrete, describing an observable action with a clear beginning and end that is grounded in a physical body part (e.g., eating, walking). Interpretive action verbs (IAVs) are actions with a clear beginning and end but that involve some amount of interpretation (e.g., helping, exercising). IAVs require interpretation and are thereby more abstract than DAVs. Finally, state verbs (SVs) describe enduring mental or emotional states (e.g., love, admire); these are more abstract than IAVs and DAVs but less abstract than adjectives.

The LCM has been successfully applied for decades, leading to a more refined understanding of language’s social functions (Semin, 2011). Abstraction is one central component in this work, along with inductive inference, on which the linguistic categories also vary. The LCM was originally applied to better understand the implications of describing behaviors using different linguistic categories (e.g., how does person A perceive behavior B given it is described using linguistic category C) or to better understand how behaviors are likely to be described (e.g., how does person X describe behavior Y of person Z). More recently, it has also been used to more generally quantify the abstraction level of a passage of text (e.g., Fujita et al., 2006; Joshi & Wakslak, 2014). Many researchers using LCM for this purpose are conducting research that is informed by construal level theory (CLT; Trope & Liberman, 2010), a theoretical perspective that has argued for a link between abstract mental representation and psychological distance.

An important challenge for CLT and other theories of cognitive abstraction is how to measure an individual’s abstraction level. Although researchers have developed several constrained tasks that measure abstraction in regard to a specific set of stimuli (see Burgoon et al., 2013), they also hope to gauge individuals’ level of cognitive abstraction in more naturalistic contexts. One fruitful approach is to approximate a speaker’s level of cognitive abstraction by quantifying his or her level of linguistic abstraction; this allows researchers to use less constrained tasks and leverage real-world archival data such as online reviews or social media posts. CLT researchers interested in quantifying linguistic abstraction have turned to the LCM because of its theoretically grounded, well-validated history. Critically, the LCM’s conceptualization of abstraction also fits well with how CLT conceptualizes this construct, with both perspectives emphasizing that abstract representations focus on characteristics that are relevant across contexts (i.e., enduring characteristics).

Automated Methods for Coding Abstract Mind-Sets

Despite the LCM’s attractiveness as a method for quantifying linguistic abstraction, this approach can be costly to implement for large amounts of text. As with any hand-coding scheme, coders must be trained and coding large amounts of text by hand is inherently time-consuming. To bypass these constraints, researchers are increasingly turning to automated methods for coding abstraction in larger corpora (longer passages or higher volume of short messages; Bhatia & Walasek, 2016; Joshi, Wakslak, & Huang, 2018; Reyt, Weisenfeld, & Trope, 2016; Snefjella & Kuperman, 2015). For example, Snefjella and Kuperman (2015) coded large corpora (e.g., New York Times articles, Twitter data) for linguistic abstraction and correlated this with communicators’ distance from the event they were describing to explore the CLT-posited link between distance and abstraction in naturalistic contexts. This “big data” approach was made possible through automated coding, as hand-coding thousands of articles or millions of tweets would not be pragmatically feasible.

Snefjella and Kuperman’s automated abstraction coding approach (see also Bhatia & Walasek, 2016) relies on research conducted by Brysbaert, Warriner, and Kuperman (2014; henceforth BWK), who used crowdsourcing to yield concreteness ratings for 40,000 English word lemmas, including verbs, nouns, prepositions, adjectives, adverbs, and single letters “a”. Raters were instructed that concrete words are those experienced directly by one of the five senses, while abstract words cannot be experienced directly but are rather defined by other words, with many words falling in between the two extremes. Each word was then judged by 25–30 raters on a 5-point scale (1 = abstract to 5 = concrete). These ratings can be used to generate an overall concreteness score via a weighted word-count approach: Each word in the to-be-coded text that appears in the BWK data set is weighted by its concreteness rating, the values are summed, and the sum divided by the total number of counted words.

The overall scores generated by the BWK and the LCM methods are likely to correlate in many contexts, given that the LCM’s categories broadly vary along the experience-based/language-based continuum emphasized in the BWK rater instructions; for example, the LCM’s most concrete category, DAVs, involves a physical referent and thereby is more experience-based than other verb forms and adjectives. However, there are several notable differences between the two approaches that may also lead to divergence, depending on the specific types of words likely to be used in that context.

First, the ratings in the BWK data set reflect a lay understanding of abstraction, guided by the experience-based versus language-based distinction provided in the initial rating instructions; this differs from the LCM, where linguistic categories are distinguished conceptually and argued to vary in abstraction. Second, BWK ratings are of individual, decontextualized words; ratings of sentences thus reflect the average concreteness rating of the individual words used in that sentence and do not consider the way these words are being used in conjunction with one another. In comparison, the LCM primarily considers how a word is being used in a sentence when generating an abstraction weight (e.g., when a noun is used to describe an object, it is coded as an adjective; Coenen, Hedebouw, & Semin, 2006).

Third, SVs are highly language-based; thus, while the LCM identifies this category as less abstract than adjectives, their BWK ratings are typically more abstract than many adjectives. This relates to a fourth, larger issue: Whereas the LCM does not distinguish abstractness within linguistic category, BWK ratings do. For example, adjectives in BWK ratings can be abstract (e.g., “ethical,” rated 1.3) or concrete (e.g., “bald,” rated 4.69). Fifth, the LCM does not code words that do not fall into its four linguistic categories, whereas the BWK method codes a wider range of words, including articles, prepositions, and pronouns.

To explicate distinctions between the two methods, consider the following sentence: “She is a thief.” A researcher using the LCM would code “thief” (a noun) as an adjective, since it describes the subject “she”; the sentence would therefore have the same LCM score as the sentence “She is unethical.” In contrast, a researcher using the BWK method would identify the former sentence as more concrete (M = 2.695) than the latter sentence (M = 2.19). Now, consider a similar sentence containing a DAV: “She stole something.” A researcher using the LCM would code this sentence as more concrete than “She is a thief” but would code the two quite similarly using the BWK method (M = 2.75 vs 2.695).¹

A different approach to automating abstraction coding is to mimic the LCM by creating dictionaries of the LCM verb categories. Seih, Beier, and Pennebaker (2016) attempted this by collecting 1,800 commonly used verbs and sorting them into the three LCM verb categories using human coders and existing General Inquirer dictionaries (Stone, Dunphy, & Smith, 1966). Then, they applied part-of-speech tagging (to identify adjectives and verbs) and the Linguistic Inquiry and Word Count program to create LIWC-LCM scores (Pennebaker, Booth, Boyd, & Francis, 2015). Using this method, they found higher abstract language scores for participants who wrote about distal (rather than proximal) events.

While the BWK and LIWC-LCM methods offer the benefit of decreased coding costs, neither has been rigorously compared to human-generated LCM scores. As described above, the BWK method is an inherently different approach, and further understanding of how it relates to the well-established LCM is important, given researchers may use this method for pragmatic, rather than theoretically derived, reasons. The LIWC-LCM would also benefit from further investigation, as it was specifically designed to approximate the LCM but does not consider the larger sentence context integral to many LCM decision rules (instead coding words in isolation). Further, the gap between LIWC-LCM and the context-based rules of the LCM offers the potential for a third approach to LCM automation that incorporates the larger sentence context.

Syntax-LCM: Automating the LCM Using Syntax

To bridge the gap between existing automation methods and the LCM, we developed the Syntax-LCM, a method that quantifies both part-of-speech tags and dependency tree features that indicate abstraction levels. Since the LCM considers syntactical organization (e.g., copulas and clausal nouns), we hypothesized that quantifying both feature types may lead to more accurate approximations of LCM scores. In this method, we combine the LIWC-LCM's verb lists with novel syntactic features to create an abstraction score that captures both the context and specific verb word choices integral to the LCM.

In what follows, we describe the Syntax-LCM method development and present three studies validating its effectiveness. In Study 1, we test the predictive accuracy of Syntax-LCM for approximating hand-coded LCM (hLCM) scores and its effectiveness for differentiating between experimental conditions designed to elicit abstract and concrete sentences using a corpus collected by an affiliated lab. In Study 2, we test its generalizability using a data set hand-coded by unaffiliated researchers. Finally, in Study 3, we examined whether the Syntax-LCM accurately predicts hand-coded scores for Twitter data, a major source of textual data for social scientists. Materials, data, and R scripts are available at https://osf.io/hsnmq/?view_only=8e33ec6a2c6644f58a0437bc95d4d2e5. For all studies, we report how we determined sample size, all data exclusions (if any), and all measures used for comparison.

Syntax-LCM Method Development

To create the Syntax-LCM feature dictionaries, we first selected an existing, open-ended response data set (henceforth referred to as development corpus) collected by the first author’s research lab. The corpus is comprised of 256 undergraduate psychology participants’ responses to two writing prompts. In the first prompt, participants wrote about the importance of being loyal or fair to other students; in the second, they wrote about another student’s work quality. Participants generated a total of 1,439 sentences in response to these prompts (Prompt 1 = 973 sentences; Prompt 2 = 466 sentences), and we used each sentence as the unit of analysis for method creation.

Establishing Human-Coded LCM Abstraction Scores

We began by hand-coding each sentence using the LCM manual (Coenen et al., 2006). During the course of coder training, we corresponded extensively with Gün Semin, one of the model’s developers and manual authors, to develop an LCM coding addendum clarifying rules that our coders were uncertain about. Two independent coders used this addendum in conjunction with the LCM manual to hand-code the corpus for DAV, IAV, SV, and adjectives (ADJ) categories, resolving disagreements through discussion (average intercoder reliability k = .84).

Next, we computed hLCM abstraction score for each sentence using the LCM manual equation:

\frac{((DAV \times 1 + (1 AV \times 2) + (SV \times 3) + (ADJ \times 4)))}{(DAV + IAV + SV + ADJ)} .

In this equation, DAV, IAV, SV, and ADJ represent the number of times each of these features occurred in the text; these counts are assigned a weight based on their theorized abstraction level, with concrete verbs (DAVs) receiving the lowest weight and ADJ the highest. The weighted sum is divided by the number of coded items to generate abstraction scores ranging from 1 (concrete) to 4 (abstract). We use hLCM scores as the criterion for comparison, given our goal of approximating this method and interest in comparing automated methods with the LCM.

Syntax-LCM Method

We developed the Syntax-LCM method using three steps.

Step 1: Syntax feature generation

First, we created the Parsed Corpus R function that parses each sentence and extracts its syntactic part-of-speech (e.g., noun, adjective) and dependency parse tree features (e.g., copula, clausal subject) using the coreNLPR version 3.4.2 package (Arnold & Tilden, 2016). This step results in a syntactic representation of each sentence that can be analyzed in place of the sentence itself (see Supplemental Material for an in-depth explanation of these features).

Step 2: Syntax-LCM dictionary creation

Next, we created the “concrete” and “abstract” syntax dictionary lists. Whereas typical dictionaries are comprised of lists of words related to a theme, these dictionaries instead are comprised of syntactic and dependency tree features related to either abstract or concrete language.

To identify which syntactic features distinguish reliably between abstract and concrete sentences, we created two text groupings, one containing the top third most concrete sentences in the corpus and one containing the top third most abstract sentences in the corpus. Then, we conducted a binary logistic regression with all nonpunctuation-based syntactic features predicting group membership with 10-fold cross validation.² The classification algorithm achieved 83% cross-validated accuracy (83% precision, 82% recall, and f ₁ score of 0.83), demonstrating the effectiveness of syntactic and dependency features for distinguishing between abstract and concrete sentences.

We compared each feature’s logistic regression coefficients across the 10-folds validation and identified features that yielded regression coefficients significant at the p < .05 level with greater than the absolute value of .05 weights across all folds. This resulted in a total of 22 features, split evenly between the abstract feature dictionary (six adjective-related features, five verb-related features) and concrete feature dictionary (see Table 1).

Table 1.

Syntax-LCM Features List.

Abstract Features	Concrete Features
LCM-specified features amod: adjectival modifier auxpass: passive auxiliary cop: copula compound: noun compound mark: subordinate clause marker nmod: npmod: noun as adverb modifier xcomp: clausal compliment expl: expletive Theory-consistent features vpn: past participle verb vbz: 3rd person present tense verb	Theory-consistent features aposs: appositional modifier advcl: adverbial clause modifier case: case marking conj: conjunct csubj: clausal subject discourse: discourse element mwe: multiword expression nnps: proper plural noun nsubj: nominal subject nummod: numeric modifier vbg: present participle verb

Notably, these features mirrored both LCM manual coding rules (e.g., copulas, adjectives) and novel syntactic features not directly captured in the LCM but consistent with theories of abstraction such as CLT; third-person and past tense verbs signified abstract sentences, whereas first-person and present verbs indicated concrete sentences. These features parallel CLT research that finds objects/events with greater temporal or physical distance are represented more abstractly, providing new evidence that findings core to CLT are identifiable in language.

Step 3: Computing Syntax-LCM abstraction scores

Finally, we created the SyntaxLCMR function for calculating Syntax-LCM scores. The Syntax-LCM function takes the syntactic representations generated in Step 1, imports the LIWC-LCM verb dictionaries (Seih, Beier, & Pennebaker, 2016) and the new syntax dictionaries, and counts the total number of features present in each sentence. Then, it uses the following equation (where SADJ and SVERBs stand for syntax adjectives and syntax verbs) to apply the weights from the LCM manual to the frequency counts for each category and to calculate a Syntax-LCM abstraction score, ranging from 1 (concrete) to 4 (abstract):

\frac{(a b s t r a c t SADJs \times 4) + (SVs \times 3) + ((IAVs + a b s t r a c t SVERBs) \times 2) + ((DAVs + c o n c r e t e S) \times 1)}{(a b s t r a c t SADJs + SVs + IAVs + a b s t r a c t SVERBs + DAVs + c o n c r e t e S)} .

Study 1

In Study 1, we selected a corpus to test the Syntax-LCM method’s validity in three ways. First, the corpus was generated by a unique population in response to novel prompts in a different research lab (compared to the development corpus) to test the generalizability of the method. Second, the corpus was selected from a study that manipulated audience distance, known to influence linguistic abstraction, to test its efficacy for differentiating between conditions known to promote more abstract or concrete communication. Finally, we compare the Syntax-LCM’s accuracy at approximating hLCM scores to that of two existing automated methods (BWK and LIWC-LCM).

Method

Data set

An affiliated research lab asked 71 business school students to describe a day in the life of a [University-Name] student in writing (data published as Study 2 of Yip-Bannicq, Kalkstein, & Trope, 2019). They were told responses would be sent to a prospective student located in either a close or distal location (close audience, concrete condition = 275 sentences; far audience, abstract condition = 225 sentences; Corpus N = 500 sentences).

Procedure

Three independent coders generated hLCM scores using the LCM manual (average intercoder reliability k = .89). We then applied the respective automated methods to calculate Syntax-LCM, BWK, and LIWC-LCM scores (details below). Finally, we compared the variance accounted for by each method predicting the hand-coded scores and their efficacy for differentiating between distance conditions.

BWK scores

We calculated BWK scores using the weighted word count algorithm described earlier (Brysbaert et al., 2014) and reverse-scored ratings so higher scores reflect more abstract sentences for cross-method consistency.

LIWC-LCM

We calculated LIWC-LCM scores following procedures detailed by Seih, Beier, and Pennebaker (2016). We used the coreNLP tagger to identify parts-of-speech, applied the LIWC-LCM verb dictionary to count and weight verbs in each category (DAVs, IAVs, and SVs), summed the features, and divided the sum by the total feature count (for a full explanation, see Seih et al., 2016).

Results and Discussion

We began by conducting Pearson’s correlation analyses of the relationship between Syntax-LCM, LIWC-LCM, BWK, and hLCM scores (see Table 2, below diagonal). Results indicated that Syntax-LCM scores were more strongly correlated with hLCM scores compared to BWK ratings, Z = 7.98, p < .001, LIWC-LCM scores, Z = 5.26, p < .001.

Table 2.

Study 1 and Study 2 Pearson’s R Correlations Between Abstraction Scores.

Method	hLCM	Syntax-LCM	BWK	LIWC-LCM
hLCM		0.43(.001)	0.18(.001)	0.28(.001)
Syntax-LCM	0.61(.001)		0.16(.001)	0.48(.001)
BWK	0.26(.001)	0.31(.001)		0.10(.028)
LIWC-LCM	0.38(.001)	0.26(.001)	0.00(.841)

Note. p values in parentheses. Study 1 correlations below diagonal. Study 2 correlations above diagonal. hLCM = hand-coded LCM; BWK = Brysbaert, Warriner, and Kuperman; LIWC-LCM = Linguistic Inquiry and Word Count–Linguistic Category Model.

Next, we ran a hierarchical regression analysis predicting hLCM scores with BWK scores entered at Step 1, LIWC-LCM scores at Step 2, and Syntax-LCM scores at Step 3 (see Table 3). Supporting our hypothesis, we found Syntax-LCM scores accounted for significant, unique variance in hLCM scores beyond other methods.

Table 3.

Summary of Hierarchical Regression Analysis for Automated Methods Predicting Hand-Coded LCM scores (Study 1).

Variable	β	SE	t	p	95% CI	ηp²	R ²	ΔR ²
Step 1							.07	.07
BWK	.19	.03	5.95	.001	[.13, .25]	.07
Step 2							.21	.14
BWK	.19	.03	6.57	.001	[.13, .25]	.08
LIWC-LCM	.28	.03	9.37	.001	[.22, .33]	.15
Step 3							.33	.12
BWK	.10	.03	3.34	.001	[.04, .15]	.02
LIWC-LCM	.10	.03	2.98	.003	[.03, .16]	.01
Syntax-LCM	.32	.03	9.38	.001	[.25, .39]	.15

Note. LIWC-LCM = Linguistic Inquiry and Word Count–Linguistic Category Model; BWK = Brysbaert, Warriner, and Kuperman.

Finally, we tested each method’s efficacy for predicting distance conditions (i.e., whether sentences generated in the distant condition received more abstract scores than those generated in the proximal condition; see Table 4). First looking at the hLCM scores as a manipulation check, we found that participants in the distant condition generated more abstract sentences than those in the proximal condition. Next, we found support for the Syntax-LCM method’s validity; participants in the distant condition had significantly higher Syntax-LCM scores than those in the proximal condition. LIWC-LCM and BWK scores also successfully differentiated between conditions.

Table 4.

t Test and Descriptive Analysis for Experimental Condition Predicting Automated and Hand-Coded Methods.

Method	Abstract Condition		Concrete Condition		t	df	p	Cohen’s d
Method	M	SD	M	SD	t	df	p	Cohen’s d
hLCM	3.25	.60	2.77	.74	7.85	489	.001	.71
SyntaxLCM	2.15	.41	1.93	.44	5.86	491	.001	.52
LIWC-LCM	3.39	.48	3.29	.59	2.12	498	.035	.19
BWK	3.60	.25	3.47	.34	4.95	488	.001	.44

Note. hLCM = hand-coded LCM; LIWC-LCM = Linguistic Inquiry and Word Count–Linguistic Category Model; BWK = Brysbaert, Warriner, and Kuperman.

Study 2

In Study 2, we conduct a stricter test of the Syntax-LCM method’s generalizability and predictive accuracy by applying it to text generated and hand-coded by a different research lab.

Method

Data Set

We acquired Study 2’s corpus from researchers unaffiliated with our institution (Yip-Bannicq, Kalkstein, & Trope, 2017). One hundred and two participants completed a lab study where they watched five video clips of shapes interacting and wrote a sentence describing what they saw in the video after each clip. A research assistant trained by the data collection lab coded each sentence using the LCM manual, resulting in 504 sentences.

Results

Using Study 1’s empirical approach, Pearson’s correlation analyses showed that Syntax-LCM scores were more strongly correlated with hLCM scores (see Table 2, above diagonal) compared to BWK, Z = 4.68, p < .001, and LIWC-LCM, Z = 3.60, p < .001. Hierarchical regression analysis results also indicated Syntax-LCM scores accounted for unique variance in hLCM scores after controlling for the other methods (see Table 5).

Table 5.

Summary of Hierarchical Regression Analysis for Automated Methods Predicting Hand-Coded LCM scores (Study 2).

Variable	β	SE	t	p	95% CI	ηp²	R ²	ΔR ²
Step 1							.05	.05
BWK	0.17	.03	4.89	.001	[.10, .24]	.05
Step 2							.13	.08
BWK	0.15	.03	4.59	.001	[.09, .22]	.04
LIWC-LCM	0.23	.03	6.61	.001	[.16, .29]	.08
Step 3							.21	.08
BWK	0.12	.03	3.67	.001	[.05, .18]	.03
LIWC-LCM	0.11	.04	2.89	.004	[.03, .18]	.02
Syntax-LCM	0.26	.04	7.16	.001	[.19, .33]	.10

Note. BWK = Brysbaert, Warriner, and Kuperman; LIWC-LCM = Linguistic Inquiry and Word Count–Linguistic Category Model.

Discussion

As with data from our lab (Study 1), we found the Syntax-LCM was the best automated approximation of hLCM scores for text coded by an external lab source, suggesting its effectiveness is not constrained to our own research lab or experimental contexts. In addition, each of the three data sets used in method creation and Studies 1–2 asked participants to respond to different topic domains, further validating the generalizability of the method (i.e., values in method creation corpus; day-in-the-life descriptions in Study 1; description of videos in Study 2).

Study 3

Studies 1 and 2 validated the Syntax-LCM’s approximation of hLCM scores for lab-generated responses and its efficacy for differentiating between construal manipulation conditions. In Study 3, we tested whether the Syntax-LCM also approximates hLCM scores for Twitter data for two primary reasons. First, Twitter is a readily available source of social media data, making it a research tool for social scientists. For example, two recent papers exploring CLT ideas in natural language use made use of Twitter data, both using the BWK ratings as their automated coding method (e.g., Bhatia & Walasek, 2016; Snefjella & Kuperman, 2015).

Second, Tweet syntax is unique due to Tweet character limits (at time of data collection, 140). This restriction may lead users to generate text with different syntactic patterns from regular speech or written prompts, and these sentence structures may not be comparable to everyday English syntax. Thus, it is feasible that our Syntax-LCM method could be less effective for predicting hLCM scores in this context. Ensuring the Syntax-LCM can effectively approximate hLCM scores for this data source would be helpful if we hope to provide a useful tool for many current, large-scale corpora.

Method

Data Set

We selected a subset of a previously purchased data set of Tweets containing Hurricane Sandy–related hashtags (e.g., “#sandy,” “#HurricaneSandy”) that contained the word “hurricane” to ensure Tweets were related to the same topic.³ After removing retweets and duplicate, non-English, or indecipherable Tweets (e.g., hyperlinks without additional text), our final corpus size was 52,183 Tweets. We used the same methods as prior studies to calculate the automated scores for the entire corpus⁴ and selected a random subset of 1,500 Tweets for hand-coding using the LCM manual. Two research assistants completed the coding and resolved disagreements through discussion (average intercoder reliability k = .81).

Results and Discussion

About 287 of the 1,500 tweets were uncodable as they lacked LCM coding scheme features, leaving a final sample size of 1,287 Tweets. As in previous studies, we first conducted Pearson’s correlation analyses of the relationship between automated methods and hLCM scores (see Table 6).

Table 6.

Study 3 Correlations Between Abstraction Scores.

Method	hLCM	Syntax-LCM	BWK
hLCM
Syntax-LCM	0.39(.001)
BWK	−0.07(.020)	−0.17(.001)
LIWC-LCM	0.34(.001)	0.64(.001)	−0.23(.001)

Note. p values in parentheses. hLCM = Hand-Coded LCM; BWK = Brysbaert, Warriner, and Kuperman; LIWC-LCM = Linguistic Inquiry and Word Count–Linguistic Category Model.

The correlation between hLCM and Syntax-LCM scores was significantly higher than the correlation between hLCM and BWK, Z = 11.35, p < .001, and LIWC-LCM scores, Z = 2.30, p = .011. In this study, BWK ratings were negatively correlated with hLCM, which is notable given the surprising direction of association. As mentioned earlier, the BWK and LCM methods approach abstraction differently and the context may impact how well they correlate.

Next, we used the same hierarchical regression analysis method as in previous studies to assess each method’s predictive accuracy (see Table 7). Replicating previous studies, we found that Syntax-LCM scores were a strong predictor of hLCM scores, with LIWC-LCM scores also contributing significantly to the model. However, unlike previous studies, we did not find that BWK scores provided unique predictive accuracy.

Table 7.

Summary of Hierarchical Regression Analysis for Automated Methods Predicting Hand-Coded LCM Scores (Study 3).

Variable	β	SE	t	p	95% CI	ηp²	R ²	ΔR ²
Step 1							.00	.00
BWK	−0.05	.02	−2.32	.020	[−.10, −.01]	.01
Step 2							.14	.13
BWK	0.00	.02	0.16	.872	[−.04, .05]	.00
LIWC-LCM	0.30	.02	13.57	.001	[.26, .34]	.14
Step 3							.19	.04
BWK	0.00	.02	0.03	.976	[−.04, .04]	.00
LIWC-LCM	0.16	.03	5.80	.001	[.11, .22]	.03
Syntax-LCM	0.22	.03	7.65	.001	[.16, .27]	.05

Note. BWK = Brysbaert, Warriner, and Kuperman; LIWC-LCM = Linguistic Inquiry and Word Count–Linguistic Category Model.

General Discussion

Across three studies and four data sets, we introduce the Syntax-LCM method for measuring abstraction in text using syntactic features and consider its effectiveness in predicting hand-coded LCM scores. While each of the three automated methods we tested accounted for unique information in our models, we found that the Syntax-LCM was most accurate at approximating hLCM scores across topic prompts and labs and it varied based on an established driver of abstract communication: Scores were higher when participants were communicating with a distal than a proximal audience. Finally, it outperformed other methods for Twitter data, a unique context where syntax usage is often idiosyncratic.

The syntactic features used in the Syntax-LCM method also demonstrate theoretical validity, paralleling the LCM manual’s coding rules. Notably, the Syntax-LCM contributed novel linguistic evidence for consistency between typical CLT results and CLT in language: Third-person and past tense verbs indicated abstract sentences, whereas first-person and present tense verbs indicated concrete sentences.

The Syntax-LCM may be of particular interest to CLT researchers who have struggled to find an automated method that fits CLT’s theoretical conceptualization and facilitates coding data efficiently, reliably, and with interpretable output. The Syntax-LCM provides a reasonable option, most highly correlating with hLCM scores across our studies. However, we note that although the Syntax-LCM best approximated hLCM scores, this does not make it inherently the “best” way to automate abstraction coding. Having been developed based upon the LCM, its usefulness may be constrained to contexts appropriate to that method. Further, Syntax-LCM, LCM, and BWK seem to capture unique parts of the variance in hLCM scores. Researchers studying abstraction should consider if combining approaches (e.g., the LCM and BWK methods, and other potential conceptual approaches) may provide further value and how to conceptualize these differing methods when considering their predictive utility in tandem.

In general, we suggest that if researchers are confined to use a single method, Syntax-LCM appears to a strong choice as it has the highest correlation with hLCM scores. However, at minimum, we encourage researchers to think carefully about the appropriate measure for their given research question, based upon the measure’s fit with their conceptualization of abstraction and the nature of language in the particular focal context.

Conclusion

Cognitive abstraction is indicated not only by the words people use but by the relationships between their words. The Syntax-LCM method described here incorporates such relationships to better approximate hand-coded scores generated using the LCM, a well-validated and established approach to conceptualizing and measuring linguistic abstraction. Researchers can use this practical tool to test their ideas with larger and more varied data sources while simultaneously ensuring construct validity and avoiding semantic restrictions, providing a useful bridge from the lab to the field.

Supplemental Material

Supplemental Material, Syntax-LCM_Supplemental - Measuring Abstract Mind-Sets Through Syntax: Automating the Linguistic Category Model

Supplemental Material, Syntax-LCM_Supplemental for Measuring Abstract Mind-Sets Through Syntax: Automating the Linguistic Category Model by Kate M. Johnson-Grey, Reihane Boghrati, Cheryl J. Wakslak and Morteza Dehghani in Social Psychological and Personality Science

Footnotes

Acknowledgments

Thank you to Gün Semin for his help in generating the LCM addendum.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported in part by NSF IBSS Grant 1520031 and NSF Grant BCS-1349054.

ORCID iD

Kate M. Johnson-Grey

Morteza Dehghani

Supplemental Material

The supplemental material is available in the online version of the article.

Notes

References

Arnold

Tilton

. (2016). coreNLP: Wrappers around Stanford CoreNLP tools [Computersoftware manual] (R package version 0.4-2). Retrieved from https://CRAN.R-project.org/package=coreNLP

Bhatia

Walasek

(2016). Event construal and temporal distance in natural language. Cognition, 152, 1–8.

Brysbaert

Warriner

A. B.

Kuperman

(2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavioral Research Methods, 46, 904–911.

Burgoon

E. M.

Henderson

M. D.

Markman

A. B.

(2013). There are many ways to see the forest for the trees a tour guide for abstraction. Perspectives on Psychological Science, 8, 501–520.

Coenen

L. H. M.

Hedebouw

Semin

G.R

. (2006). Measuring language abstraction: The Linguistic Category Model (LCM) Manual. Retrieved December 12, 2014, from http://www.cratylus.org/Text/1111548454250-3815/pC/1111473983125-6408/uploadedFiles/1151434261594-8567.pdf

Fiedler

Semin

G. R.

Bolten

(1989). Language use and reification of social information: Top-down and bottom-up processing in person cognition. European Journal of Social Psychology, 19, 271–295.

Fujita

Henderson

M. D.

Eng

Trope

Liberman

(2006). Spatial distance and mental construal of social events. Psychological Science, 17, 278–282.

Fujita

Trope

Liberman

Levin-Sagi

(2006). Construal levels and self control. Journal of Personality and Social Psychology, 90, 351–367.

Hansen

Wänke

(2010). Truth from language and truth from fit: The impact of linguistic concreteness and level of construal on subjective truth. Personality and Social Psychology Bulletin, 36, 1576–1588.

10.

Joshi

Wakslak

C. J.

(2014). Communicating with the crowd: Speakers use abstract messages when addressing larger audiences. Journal of Experimental Psychology: General, 143, 351–362.

11.

Joshi

Wakslak

C. J.

Huang

. (2018) Gender differences in speech abstraction and implications for women’s success in organizations. Manuscript submitted for publication.

12.

Kousta

S. T.

Vigliocco

Vinson

D. P.

Andrews

Del Campo

(2011). The representation of abstract words: Why emotion matters. Journal of Experimental Psychology: General, 140, 14–34.

13.

Maass

Salvi

Arcuri

Semin

G. R.

(1989). Language use in intergroup contexts: The linguistic intergroup bias. Journal of Personality and Social Psychology, 57, 981.

14.

Paivio

(1991). Dual coding theory: Retrospect and current status. Canadian Journal of Psychology, 45, 255–287.

15.

Pennebaker

Booth

Boyd

Francis

. (2015). Linguistic inquiry and word count: LIWC 2015 operators manual. Austin, TX: Pennebaker Conglomerates.

16.

Reyt

J. N.

Wiesenfeld

B. M.

Trope

(2016). Big picture is better: The social implications of construal level for advice taking. Organizational Behavior and Human Decision Processes, 135, 22–31.

17.

Schmid

Fiedler

(1996). Language and implicit attributions in the Nuremberg trials analyzing prosecutors’ and defense attorneys’ closing speeches. Human Communication Research, 22, 371–398.

18.

Schwanenflugel

P. J.

Harnishfeger

K. K.

Stowe

R. W.

(1988). Context availability and lexical decisions for abstract and concrete words. Journal of Memory and Language, 27, 499–520.

19.

Seih

Beier

Pennebaker

J. W.

(2016). Development and examination of the linguistic category model in a computerized text analysis method. Journal of Language and Social Psychology, 36, 1–13.

20.

Semin

G.R.

(2011). Culturally situated linguistic ecologies and language use: Cultural tools at the service of representing and shaping situated realities. In Advances in Cultural Social Psychology, 1, 217–249.

21.

Semin

G. R.

Fiedler

(1988). The cognitive functions of linguistic categories in describing persons: Social cognition and language. Journal of Personality and Social Psychology, 54, 558–568.

22.

Semin

G.R.

Fiedler

(1989). Relocating attributional phenomena within a language-cognition interface: The case of actors’ and observers’ perspectives. European Journal of Social Psychology, 19, 491–508.

23.

Semin

G. R.

Fiedler

(1991). The linguistic category model, its bases, applications and range. European Review of Social Psychology, 2, 130.

24.

Snefjella

Kuperman

(2015). Concreteness and psychological distance in natural language use. Psychological Science, 26, 1449–1460.

25.

Stone

P. J.

Dunphy

D. C.

Smith

M. S

. (1966). The general inquirer: A computer approach to content analysis. Oxford, England: MIT Press.

26.

Trope

Liberman

(2010). Construal-level theory of psychological distance. Psychological Review, 117, 440–463.

27.

Wakslak

C. J.

Smith

P. K.

Han

(2014). Using abstract language signals power. Journal of Personality and Social Psychology, 107, 41–55.

28.

Yip-Bannicq

Kalkstein

D. A.

Trope

. (2019). Abstraction in shared reality. Manuscript in preparation.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.09 MB