Abstract
To date, there is a paucity of research conducting natural language processing (NLP) on the open-ended responses of behavior rating scales. Using three NLP lexicons for sentiment analysis of the open-ended responses of the Behavior Assessment System for Children-Third Edition, the researchers discovered a moderately positive correlation between the human composite rating and the sentiment score using each of the lexicons for strengths comments and a slightly positive correlation for the concerns comments made by guardians and teachers. In addition, the researchers found that as the word count increased for open-ended responses regarding the child’s strengths, there was a greater positive sentiment rating. Conversely, as word count increased for open-ended responses regarding child concerns, the human raters scored comments more negatively. The authors offer a proof-of-concept to use NLP-based sentiment analysis of open-ended comments to complement other data for clinical decision making.
Developmental and behavioral health disorders are among the top five chronic pediatric conditions that cause functional impairment in the United States (Halfon et al., 2012; Slmoski, 2012). Behavioral problems in children are associated with numerous long-term negative consequences, including difficulties in academic functioning (Tremblay et al., 1992), school dropout, unemployment later in life, and poverty (Morgan et al., 2009). Additionally, if children demonstrate both aggression and academic difficulties, they are more likely to experience peer rejection, and struggle with substance abuse and delinquency (Schaeffer et al., 2003; Schaeffer et al., 2006).
Pediatric behavioral and emotional problems are often undetected (Lavigne et al., 1993; Weitzman & Wegner, 2015), partly due to the difficulty in the identification of specific behavioral problems without the use of standardized measures (Sheldrick et al., 2011). Parents and caregivers can be helpful partners to clinicians in the use of validated screening tools (Glascoe & Marks, 2011), especially for younger children (Forness et al., 1996). As such, clinicians (e.g., health care providers, psychologists, and educators) encourage parents or guardians to complete validated screening instruments to assess children and adolescents for behavioral and psychosocial concerns (American Academy of Pediatrics Task Force on Mental Health, 2010; Duncombe et al., 2012; Rishel et al., 2005).
There are numerous instruments available for clinicians to obtain information about children’s behavioral and social–emotional functioning. The format of these assessments ranges from forced-choice questionnaires to more open-ended styles. Research indicates that the format of assessment items may constrain the behavior of the respondent (Martinez, 1999). Haladyna (1998) elaborated that item format selection should be based on the interpretation to be made from the information gathered. The standard recommendation is for assessments to include items of both formats (i.e., forced-choice and open-ended) in order to obtain the most comprehensive information (Traub & MacRury, 1990). Therefore, if the assessment developer includes both forced-choice (a selection response mode) and open-ended (a production response mode) assessment items, then the provider should be able to gather the most clinically relevant and complete data. When parents complete open-ended assessment items, they often present information that is unique from the information gathered in the forced-choice items. Additionally, parents often elaborate on their thoughts and opinions in free text. For instance, when responding to items on the Parents’ Evaluation of Developmental Status, 27.5% of parents qualified their concerns with a written comment (Cox et al., 2010). While the information gathered by such open-ended items may prove fruitful for clinicians, they are then tasked with accurately analyzing and coding the data presented in the open-ended response text, while attempting to avoid confirmation bias.
To address the analysis of open-ended responses, natural language processing (NLP) has been used to systematically break down language into smaller elemental pieces, which can then be used to assess relationships or associations (Crowston et al., 2012). NLP uses algorithms, methodologies, and tools to analyze naturally occurring text for the purpose of achieving human-like language processing (Joseph et al., 2016). Additionally, NLP methods appear useful in the identification of major themes found with traditional qualitative analysis (Guetterman et al., 2018). NLP has a history of improving the efficiency while maintaining the accuracy of free text analysis (e.g., Ruge et al., 1991). Not only does NLP allow for quick coding, but NLP appears to reduce the burden of manual abstraction or human analyses (e.g., Carrell et al., 2014). Finally, NLP can improve the efficiency of human processing of language by applying NLP to an entire data set, and qualitatively coding a smaller subsample (Guetterman et al., 2018).
One analytical approach that involves NLP is sentiment analysis, or the process of identifying and categorizing opinions expressed in a text to determine whether the attitude toward the topic is positive, negative, or neutral (Liu, 2012). Generally, when conducting a sentiment analysis, a lexicon is used to identify the polarity (positive or negative) or degree of polarity for a given word. Some lexicons have been created for a specific purpose, such as to extract sentiment from financial texts (Oliveira et al., 2014) or product reviews (Hu & Liu, 2004), while others were created as general-purpose lexicons to extract sentiment from texts (Khoo & Johnkhan, 2018). Lexicons also vary in the way in which the sentiment of the words in the lexicon are classified, ranging from a scale rating of sentiment to a binary indicator for positive or negative sentiment. Sentiment lexicons are typically validated by crowd-sourcing data, running the lexicon against natural language sets of text (e.g. product reviews) or by evaluating the lexicon’s performance against another validated lexicon (Khoo & Johnkhan, 2018).
Clinical researchers have sought to utilize NLP in the health and behavioral health arenas. For instance, NLP has been used to reduce error when dictating information about services provided into an electronic health record (Kumar et al., 2014; Murff et al., 2011; Pakhomov et al., 2007). In a more recent study, Qiwei He et al. (2017) assessed patient self-narratives in posttraumatic stress disorder screening using NLP and text mining. They found this process helpful for individuals with “middle to moderate mental health needs” due to the ease of administration. They also noted the flexibility of such an approach and reported that it allowed respondents to express themselves more freely. de Vries’ (2017) utilized machine learning to detect and analyze the sentiment of children’s expressions in their diaries. The researcher found that not only are machine learning models stronger at capturing context and more complex negations but they are also better at analyzing the sentiment of diary text than a symbolic sentiment scoring algorithm.
Purpose of the Current Study
The authors aimed to conduct a proof-of-concept study to utilize NLP to analyze the sentiment of open-ended responses from a behavior rating scale designed for children and adolescents. The purpose was to evaluate the sentiment of open-ended responses via NLP against human ratings to determine the level of agreement between the two rating methods.
Primary Objective
The authors of the current study hypothesized that the NLP lexicon-based sentiment of the Behavior Assessment System for Children-Third edition (BASC-3) comments would moderately to strongly correlate with the sentiment score determined by human raters.
Secondary Objective
The authors hypothesized that total word count of each open-ended comment would correlate moderately to strongly with the human rated sentiment. That is, longer comments highlighting strengths would tend to have greater levels of positive sentiment and longer comments highlighting concerns would tend to have greater levels of negative sentiment.
Method
The current authors report how they determined the sample size, all data exclusions, all manipulations, and all measures in the study.
Participants
The researchers reviewed BASC-3 questionnaire responses from 185 parents, guardians, caretakers, and teachers of children under the age of 18 referred to the Psychology and Developmental-Behavioral Pediatric clinics of an academic health system, between March 1, 2017 and February 28, 2019. Exclusion criteria included (a) an incomplete BASC-3 checklist or (b) a BASC-3 completed in a language other than English because the lexicons used in the current study are based in English. Of the 185 eligible participants, 11 participants were excluded because both the concerns comment and strengths comment fields were blank, bringing the sample size to 174. The sample of 174 BASC-3 forms were filled out by 35 (20.11%) teachers, 109 (62.64%) mothers, 23 (13.22%) fathers, and 7 (4.02%) guardians or other relatives. The age of the index child ranged from 2 to 17 years with Mdn of 7 (interquartile range [IQR] = 5-11) years. The BASC-3 protocols used consisted of the teacher rating scales (TRS) for preschoolers (n = 15; 2-5 years of age), children (n = 15; 6-10 years), and adolescents (n = 5; 12-14 years), and parent rating scales (PRS) for preschoolers (n = 43; 2-5 years), children (n = 68; 6-12 years), and adolescents (n = 28; 12-17 years). The investigative team included a developmental–behavioral pediatrician, biostatistician, pediatric resident, and doctoral-level clinical psychology practicum student.
Design
A retrospective cross-sectional design was conducted.
Materials
Q-Global
Q-global is a web-based system for administering, scoring, and reporting tests that permits clinicians to score more than 60 assessments on a secure server. The BASC-3 is one of the assessments available in the Q-global system.
BASC-3
For the current study, the authors utilized the BASC-3 (Kamphaus & Reynolds, 2015), a multimethod system which gathers detailed information regarding the child’s adaptive and problem behavior across a variety of settings. Both the PRS and TRS yield four to five composite scales, 14 primary scales, seven Content Scales, five Clinical Probability Indices, and five Executive Functioning Indices (Kamphaus & Reynolds, 2015). Standardization was conducted with about 1,700 to 1,800 forms administered by approximately 300 examiners in 44 states on a representative population aged two through 21. Coefficient α reliabilities ranged from .83 to .97 across all scales and indices.
RStudio
RStudio software and packages, tidytext and syuzhet, were used to clean, process, and extract the sentiment from each of the strengths and concerns comments. The corpus of concerns and strengths comments was created and transformed into lists of words and contracted words were expanded. Numbers, punctuation, and unnecessary white spaces were removed. A document term matrix was used to identify the frequency distribution of words in the concerns and strengths comments.
SAS v9.4
SAS v9.4 was used to analyze BASC-3 data and the resulting sentiment score data obtained using RStudio.
Data Collection
Data collection procedures were approved by the corresponding author’s institutional review board. All data used in this study had originally been collected for clinical decision-making purposes. Data were extracted from the Q-global system and de-identified by the researchers before inclusion in analysis.
Analytic Procedure
The Principal Investigator (PI) maintained secure online access to Q-global throughout the duration of the study. The PI downloaded a spreadsheet (.csv file) of the requisite BASC-3 data that included limited demographic information of the participant (i.e., age without identifying birth date and gender), rater (i.e., gender and relation to the participant), and instrument used. The PI extracted the responses from the two open-ended free-text responses for the strengths (What are the behavioral and/or emotional strengths of this child?) and concerns (Please list any specific behavioral and/or emotional concerns you have about this child.) questions at the beginning of the BASC-3. Finally, identifiers in the open-ended responses were redacted prior to analysis.
Two independent raters, including one expert rater, rated the sentiment of each of the strengths and concerns comments provided by the 174 respondents (i.e., parent, guardian, or teacher). The polarity of the sentiment was rated using a 7-point Likert-type scale ranging from −3 (very negative) to +3 (very positive). The level of difficulty for rating the text was rated using a 4-point Likert-type scale ranging from 1 (very easy) to 4 (very difficult). As previously stated, if both the strengths and concerns comment fields were left blank, then that participant’s data were discarded. However, if either the strengths or the concerns comment field was left blank, that missing text was coded as neutral; this neutral coding of blank text allows for reports to have “nothing to say” in one section. Weighted Kappa (95% confidence interval) was used to measure agreement between the expert rater and other human raters for the strengths and concerns sentiment. Weighted Kappa takes into account the relative proximity of discordant ratings. For the concerns and strengths comments, discordant human ratings were reconciled to obtain one agreed on sentiment rating.
NLP was used to extract the sentiment from the open-ended responses using the bag-of-words approach in which (a) the text was stripped of punctuation, (b) grammar and word order were disregarded, and (c) the multiplicity of words was accounted for. First, contractions were expanded to the contracted set of words in order to maintain the intended semantics when removing punctuation. For example, expanding “he’ll” and “I’ll” to “he will” and “I will” prevents these words from being scored as “hell” and “ill.” Then, punctuation was stripped from the text. The frequency of each unique word within each participant’s comment was totaled and multiplied by the sentiment value in the corresponding lexicon with help from the Syuzhet package (Jockers, 2015). In doing so, the frequency of a given repeated word was accounted for in the overall sentiment rating for the concerns and strengths comment.
For each of the strengths and concerns comments, the agreed upon sentiment rating by the human raters was compared with the sentiment rating of three general-purpose sentiment lexicons: AFINN, Bing, and National Research Council Canada (NRC; see Supplemental Table S1, available online). The AFINN (Nielsen, 2011) lexicon assigns each word with an integer score ranging from negative five to five, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. The Bing sentiment lexicon (Liu et al., 2005) categorizes each word in a binary fashion for positive and negative sentiment. Finally, the NRC Word-Emotion Association Lexicon (Mohammad & Turney, 2010) categorizes each word in a binary fashion for each sentiment (i.e., negative and positive) and emotion (i.e., anger, anticipation, disgust, fear, joy, sadness, surprise, and trust) based on Plutchik’s (2001) wheel of emotions. The sentiment rating for the strengths and concerns comment was computed as the difference in word sentiment counts, or positive count minus negative count, for both the NRC and Bing lexicon. The sentiment rating for the strengths and concerns comment was computed as the sum of sentiment ratings for the AFINN lexicon. Therefore, for each lexicon, scores closer to zero indicate a neutral sentiment and the use of similar numbers of positive and negative words. Higher (positive) scores indicate more positive sentiment whereas lower (negative) scores indicate a more negative sentiment. Spearman rank-order correlation was used to assess (a) the correlation between composite human sentiment rating and the sentiment score using each lexicon and (b) the correlation between the composite human sentiment rating and the word count for the strengths and concerns comments. For the sets of strengths and concerns comments, we used partial Spearman rank-order correlations to evaluate the correlation between composite human sentiment rating and the sentiment score using each lexicon while accounting for word count.
Results
The word count for both the strengths and concerns comments ranged from zero to a maximum of 95 and 99 words, respectively. The strengths comments were typically shorter with a Mdn of 16 (IQR = 9-32) relative to the concerns comments which were typically longer with a Mdn of 29 (IQR = 15-58) words. Furthermore, teachers typically provided longer strengths and concerns comments than parents or guardians. Strengths comments provided by teachers included a Mdn of 23 words (IQR = 14-36), whereas parents included a Mdn of 15 words (IQR = 6-27). Concerns comments provided by teachers included a Mdn of 47 words (IQR = 28-68) whereas parents included a Mdn of 24 words (IQR = 11-49). For strengths, the median word count provided by teachers was about 50% greater than that provided by parents/guardians, whereas for concerns, the median word count provided by teachers was about 100% greater than that provided by parents/guardians.
Reconciliation of human ratings led to exact agreement of both the concerns and strengths sentiment for 167 (96%) of the 174 participants. The human sentiment rating for concerns comments had a stronger negative correlation with word count (r = −.579 [p < .0001]) than the positive correlation of human sentiment rating of the strengths comments with word count (r = .348 [p < .0001]). Table 1 shows the strengths comments, sentiment rating, and difficulty rating for strengths comments in which the human raters disagreed on the sentiment.
Strengths and Concerns Comments With Human Rater Disagreement.
Note. Sentiment was rated using a 7 point Likert-type scale consisting of −3 (very negative); −2 (moderately negative); −1 (slightly negative); 0 (neutral); +1 (slightly positive); +2 (moderately positive) +3 (very positive).
For the strengths and concerns comments, the Spearman correlation between the human sentiment rating and the lexicon sentiment rating was greatest for the Bing lexicon (see Table 2). The concerns comments had consistently lower Spearman correlations with the composite human sentiment rating across all three lexicons, implying that the lexicon-based sentiment scoring of the concerns comments did not perform as well as that for the strengths comments.
Spearman Rank-Order Correlation Between Composite Human Sentiment Rating and Lexicon Sentiment Rating for Strengths and Concerns and Partial Correlation Controlling for Word Count.
Partial correlation between composite human sentiment rating and lexicon sentiment rating accounting for word count. NRC = National Research Council Canada.
Figure 1 illustrates a positive correlation between the composite human sentiment rating and the sentiment scored by each of the three lexicons for the strengths comments. However, there appears to be no correlation between the composite human sentiment rating and the sentiment scored by each of the three lexicons for the concerns comments. That is, for each of the three lexicons, the human composite rating for the strengths comments tended to have greater agreement with the lexicon rating than the human composite rating and the lexicon rating for the concerns comments.

Jittered scatter plots of composite human sentiment rating and the sentiment score for each the concerns comments and strengths comments using each of the three lexicons.
For teachers’ strengths comments, the researchers observed a similar, yet slightly greater correlation between the human composite sentiment rating and the lexicon sentiment rating (Bing r = .66 [p < .0001]; NRC r = .57 [p = .0006]) than for parent/guardians’ strengths comments (Bing r = .53 [p < .0001]; NRC r = .40 [p < .0001]). However, the AFINN lexicon is contrary to the Bing and NRC lexicon in this regard because the researchers observed a similar, yet slightly smaller correlation between the human composite sentiment rating and the AFINN lexicon sentiment rating (AFINN r = .46 [p = .0086]) for teachers’ positive ratings than that observed for parents/guardians (AFINN r = .54 [p < .0001]). To determine the correlation between word count and rating for teachers’/parents’/guardians’ comments, the authors explored if word count was a confounding variable. In Table 2, partial correlations indicated that when accounting for word count, the correlation between the human composite sentiment rating and lexicon sentiment rating slightly decreased yet remained significant.
Discussion
To the authors’ knowledge, this is the first study to use NLP in sentiment analysis of open-ended responses on a commonly used behavior rating scale. Using three commonly found NLP lexicons, the authors found a moderately positive correlation between the human composite rating and the sentiment score using each of the lexicons for strengths comments written by parents or guardians. Conversely, the authors found a weak positive correlation between the human composite rating and the sentiment score using each of the three NLP lexicons for the concerns comments written by parents or guardians. Similarly, the authors discovered a moderately positive rank-order correlation between the human composite sentiment rating and the sentiment score using each of the three NLP lexicons for the strengths comments written by teachers. The authors also uncovered a relatively weak positive correlation between the human composite sentiment rating and Bing rating for the concerns comments provided by teachers; however, there was a moderately positive correlation between the human composite sentiment rating and the AFINN lexicon. In addition, the authors found that as the word count increased there was a higher positive sentiment rating for open-ended responses regarding the child’s strengths. Conversely, as word count increased for the open-ended response regarding child concerns, the comment was rated more negatively among the human expert raters.
Comparing parent and/or guardians’ with teachers’ responses, the latter typically provided strengths and concerns comments with a lexicon-based sentiment score that was more highly correlated to the human composite sentiment rating. This concordance is likely due to teachers typically providing comments with greater word count than parents. Although teachers often provided greater amounts of text than parent or guardians, word count was found to be weakly correlated with the human sentiment rating, and weakly to moderately correlated with lexicon sentiment rating. Accordingly, the type of respondent (i.e., teacher or parent/guardian) does not confound the relationship between human and lexicon sentiment rating beyond the extent influenced by word count. It is possible that specific lexicons are more appropriate for use with parents while others are more appropriate to use with teachers. The results illustrated that there was particular difficulty in obtaining agreement between human ratings and NLP for negative comments. Furthermore, the strongest NLP lexicon in this study appeared to be the Bing lexicon. It is plausible that teachers use different terminology to describe the participants’ behaviors than parents. Additionally, children may exhibit different behavior depending on the context; consequently, the degree of perceived behavioral concern based on the context could lead to discrepancies in reporting between teachers and parents (Dirks et al., 2012). Furthermore, parents and teachers may hold different developmental expectations and therefore may interpret behavior differently (Wakschlag et al., 2010). The authors suggest that context be considered when analyzing the sentiment from the behavior ratings of different respondents.
For each lexicon, the rank-order correlation between strengths sentiment (measured using the lexicon) and the human composite rating is greater than the rank-order correlation between the concerns sentiment (measured using the lexicon) and the human composite rating. This discrepancy could largely be due to the use of unigram lexicons. Unigram lexicons do not take negations into account. Please consider the following two comments as they would result in the same sentiment score: “I am concerned with Jane Doe not being able to engage with peers making and keeping good friends” and “I am concerned with Jane Doe being able to engage with peers making and keeping good friends.” The first sentence is an actual, de-identified example from a concerns comment. Each time the negations were present in the respondent’s language the lexicon sentiment rating was neutral or positive while the composite human raters rated the overall comment as negative. Within the sample (n = 165), 50 (30.30%) of the concerns comments contained the word “not,” while only 17 (10.30%) of the strengths comments contained the word “not.” The AFINN lexicon does have some bigrams such as “not good” and “not working,” but the majority of sentiments within the lexicon are scored for unigrams. Within Bing and NRC there are only unigrams and the word “not” is not included. As such, additional preprocessing could be used to handle negations prior to obtaining lexicon-based sentiment.
Another potential reason for the discrepancy between the lexicon-based sentiment rating and the human rater sentiment rating is the human rater was aware of whether the content was a “concern” or “strength.” Furthermore, the human rater was able to interpret the comment based on that contextual understanding. The following concerns comments were rated more positively by the lexicons than by the human raters: “Speaking and understanding,” “regulation of her emotions,” “focus, concentration, planning,” and “concentration, self-help, emotions.” “Speaking and understanding” was rated as neutral (i.e., neither positive nor negative) sentiment by the human raters whereas the Bing lexicon rated this statement as positive because of the inclusion of the word “understand.” Thus, the NLP engine has limitations, which reduce its accuracy in rating negative comments. One strategy to address this weakness would be to include the stem of the prompt the respondent receives (i.e., “what concerns”) and thus train the engine that those responses are negative in context. Another strategy would be to include a sentence level sentiment analyses; however, the ability to complete a sentence level sentiment analysis is dependent on the use of correct punctuation.
NLP is efficient with regard to time and effort. As an actuarial-based approach in quantifying sentiment, NLP avoids the potential bias from a single human rater since evaluating sentiment can be highly subjective and influenced by personal experiences, thoughts, and beliefs, and may be influenced by confirmation bias. Additionally, the use of a given set of NLP rules and sentiment lexicon will consistently provide precise and objective measures of a given comment’s sentiment. Of course, language changes and adapts over time; therefore, sentiment lexicons need to be periodically fine-tuned to reflect the sentiment of the population of interest. The end-result is that NLP is not intended to replace clinical judgment, but to provide added utility that contributes to the clinical decision-making process.
Behavioral health providers are presented with numerous demands on the job, ranging from administrative tasks (e.g. writing progress notes) to completing direct care work (e.g. completing evaluations). As previously mentioned, there are significant benefits to using behavioral screening tools across settings to identify children at risk of developing or who are currently experiencing socioemotional difficulties. Providers would benefit from having the methodology to streamline their workflow when it comes to analyzing screening tool responses and flagging significant concerns within the text. With these strategies in place, they would be able to focus more on clinical care and less on administrative tasks, thereby potentially reducing provider burnout, while still ensuring the high-quality assessment and care of their patients.
Utilizing the open-ended responses in behavior rating scales provide the respondent an opportunity to let clinicians and researchers know what is on their mind in a potentially less intimidating environment than required when communicating concerns verbally and directly to the provider. In the current study, an overwhelming majority of respondents provided written text in the open-ended fields, which bears testament to the value of including this response option in information gathering and clinical decision making. Additionally, respondents’ comments oblige clinicians and researchers to read what they have written, thereby further informing treatment planning (Singer & Couper, 2017).
Limitations and Future Directions
There were a number of limitations of the current study. First, misspelled words or grammatical errors were not corrected prior to analysis and therefore the correct sentiment may not have been extracted from these words. This is a well-known challenge in qualitative analysis, going back half a century with Kammeyer and Roth (1971) stating that “Responses to open-ended questions are usually less than completely clear; they often contain ambiguous words and phrases; and they are frequently ungrammatical and poorly worded.” That being said, investigators were able to ascertain the message of the attempted communication despite misspelled words and grammatical errors. Future researchers might consider proactively correcting such errors to reduce the risk of inaccurate sentiment ratings. As previously stated, performing sentiment analysis using NLP may prevent the influence of bias by human raters with the use of automated opinion mining. However, lexicons may still contain the bias of the individual(s) that created the lexicon. Additionally, the lexicons used in this study contained mostly unigrams which did not allow for detection of negated phrases typically found in the concerns comments. Therefore, human raters often rated the sentiment of the concerns comments more accurately and reliably. Notably, not all comment ratings could be reconciled by the human raters, which indicates the level of difficulty in scoring sentiment of open-ended responses. Another limitation of performing lexicon-based sentiment analysis is the finite number of words in the lexicons and the assignment of a fixed sentiment orientation/score to each word. Finally, although NLP can automate components of a qualitative coding process, a trained NLP analyst may be needed to develop and validate the rule set used to clean, prepare text files, and run programming to obtain the lexicon-based sentiment score. This method may be less efficient for small data sets; however, as the code develops, there will likely be efficiency improvements for the larger data sets. Eventually, this would reduce response effort required for human coders (Crowston et al., 2012).
NLP has already been used in the field of psychology. For instance, machine learning automated approaches have explored psychological constructs (i.e., psychosis) for prognostic prediction (Bedi et al., 2015; Fitzpatrick et al., 2017). The current study is an initial proof-of-concept study that utilizes NLP in the screening/diagnostic arena, to correlate open-ended responses to the forced-choice questions of a rating scale. The open-ended responses offer content for diagnostic and treatment decisions. The current authors were intrigued by the possibility of NLP identifying patterns that may elude human coders (without much effort). This ability would need to be replicated in other, larger samples, with an aim both of automation and increasing efficiency of the process. Furthermore, future researchers might also seek to identify discrepancies between the analyzed open-ended responses and the forced-choice items to determine the clinical utility for clinical decision making by the practitioner. Participants are required to translate their judgments of a child’s behavior into the one-dimensional response format to make a singular choice if completing forced-choice items. Kjell et al. (2019) found that using open-ended questions and NLP has the potential of complementing and extending traditional rating scales. The current authors argue that NLP can help translate open-ended or semantic responses with less burden on the respondent and clinician.
A combination of NLP for the labor-intensive tasks (i.e., word count and sentiment extraction) and human coding for the nuanced parts (i.e., contextual analysis) of open-ended response analyses would be a viable path forward as NLP lexicons continue to be refined. Future researchers might consider the real-world feasibility, utility, and practicality of using NLP by clinicians and clinical researchers. Increasing the availability as well as ease of use of NLP algorithms or engines would also be critical.
Conclusion
This proof-of-concept study utilizes NLP to extract sentiment information from open-ended questions in behavioral screening instruments. As the use of machine learning becomes more widely available, clinicians and researchers may nearly effortlessly access a vast lexicon and perform sentiment analysis. Free and open source software such as RStudio and Python provide text mining packages that can be easily executed, and there is no shortage of guides and tutorials to assist with obtaining output (e.g., sentiment of a given text using the Bing lexicon). Tools should become increasingly user-friendly and sentiment analysis will likely become more ubiquitously used by researchers and clinicians. Ultimately, the goal of machine learning and NLP should not be to replace or validate human performance but raise its level.
Supplemental Material
sj-pdf-1-asm-10.1177_1073191121996466 – Supplemental material for Assessment of Agreement Between Human Ratings and Lexicon-Based Sentiment Ratings of Open-Ended Responses on a Behavioral Rating Scale
Supplemental material, sj-pdf-1-asm-10.1177_1073191121996466 for Assessment of Agreement Between Human Ratings and Lexicon-Based Sentiment Ratings of Open-Ended Responses on a Behavioral Rating Scale by Olivia Gratz, Duncan Vos, Megan Burke and Neelkamal Soares in Assessment
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
