Abstract
Negative schizotypal traits potentially can be digitally phenotyped using objective vocal analysis. Prior attempts have shown mixed success in this regard, potentially because acoustic analysis has relied on small, constrained feature sets. We employed machine learning to (a) optimize and cross-validate predictive models of self-reported negative schizotypy using a large acoustic feature set, (b) evaluate model performance as a function of sex and speaking task, (c) understand potential mechanisms underlying negative schizotypal traits by evaluating the key acoustic features within these models, and (d) examine model performance in its convergence with clinical symptoms and cognitive functioning. Accuracy was good (> 80%) and was improved by considering speaking task and sex. However, the features identified as most predictive of negative schizotypal traits were generally not considered critical to their conceptual definitions. Implications for validating and implementing digital phenotyping to understand and quantify negative schizotypy are discussed.
Socio-emotional dysfunctions have long been considered a critical risk factor for schizophrenia-spectrum pathology (Brown et al., 2007; Meehl, 1962). Negative schizotypal traits—often defined in terms of social anhedonia, lack of close friends, and constricted affect (American Psychiatric Association, 2013; Raine, 1991)—are thought to be central to these dysfunctions. Negative schizotypy is typically measured using self-report trait questionnaires. Scores from these questionnaires have been consistently associated with concomitant abnormalities in self-reported quality of life (Cohen, Auster, MacAulay, & McGovern, 2014) and cognitive (Chun et al., 2013; Ettinger et al., 2015), emotional (Cohen, Callaway, et al., 2012; Kerns et al., 2008; Li et al., 2019), and social (Horan et al., 2007; Kwapil et al., 2009) functioning. They have also been associated with objective physiological abnormalities (Ettinger et al., 2015; Gooding, 1999; Wang et al., 2018) and the development of Cluster A psychopathology (Blanchard et al., 2011; Gooding et al., 2005; Kwapil et al., 2013). Phenotypes associated with negative schizotypal traits involve, among other things, social behaviors that can potentially be quantified and objectified using in situ digital-phenotyping technologies. These technologies could offer many practical and psychometric advantages over traditional self-report trait questionnaires (Cohen, 2019; Cohen & Elvevåg, 2014) in that they can potentially provide ratio-level data that are continuously recorded as individuals navigate their daily routine.
One approach to digitally phenotype negative schizotypal traits involves quantifying vocal expression using computerized acoustic analysis of natural speech. Vocal expression is an attractive target for measurement because it (a) maps onto the “constricted affect” trait of schizotypy (American Psychiatric Association, 2013); (b) conceptually relates to schizophrenia-spectrum symptoms of blunted vocal affect and alogia (Cohen et al., 2019; Cohen, Schwartz, et al., 2021); (c) reflects a key sociocognitive ability that generally requires psychomotor, social cognitive, and working memory abilities putatively central to negative schizotypy; (d) is based on speech data that can be collected unobtrusively using inexpensive recording technologies; and (e) employs natural language processing, a field of computational linguistics that has been applied to schizophrenia research for more than a decade (Cohen, Auster, McGovern, & MacAulay, 2014; Corcoran et al., 2018; Holshausen et al., 2014; Kuperberg, 2010; Parola et al., 2020). We are aware of seven studies that have examined relationships between self-reported negative schizotypy and acoustic features of speech (Bedwell et al., 2014; Cohen, Auster, McGovern, & MacAulay, 2014; Cohen & Hong, 2011; Cohen, Iglesias, & Minor, 2009; Cohen, Morrison, et al., 2012; Dickey et al., 2012; Tsakanikos & Claridge, 2005). As yet, these relationships have been surprisingly modest, if not negligible, in magnitude, and there have been some counterintuitive findings (for a review of these studies, see Table 1).
Summary of Published Studies Using Acoustic Analysis of Natural Speech in Negative Schizotypy
Note: SPQ = Schizotypal Personality Questionnaire (Raine, 1991); SPQ-CA = Schizotypal Personality Questionnaire–Constricted Affect; SPQ-BR = Schizotypal Personality Questionnaire–Brief Revised (Callaway et al., 2014); SPD = schizotypal personality disorder; O-LIFE = Oxford–Liverpool Inventory of Feelings and Experiences (Mason & Claridge, 2006); r = Pearson’s correlations of schizotypy with the relevant acoustic measure; d = Cohen’s d of high schizotypy group relative to a control group.
There are several considerations when evaluating prior findings. First, nearly all studies have employed small, “conceptually critically” acoustic feature sets, selected according to their conceptual overlap with symptoms of blunted vocal affect (e.g., decreased intonation and emphasis) and alogia (e.g., decreased number of utterances, word counts, pause times). This feature selection approach assumes continuity in phenotypic expression between negative symptoms and negative schizotypy. Although implied in some schizophrenia-spectrum models (Lenzenweger, 2006; Meehl, 1962), it is not clear that this continuity actually exists, and it is not clear that these conceptually critical features are particularly important even in schizophrenia (Cohen, Auster, McGovern, & MacAulay, 2014; Cohen, Cox, et al., 2020; Cohen, Mitchell, et al., 2016; Meaux et al., 2018). Moreover, the reliance on small feature sets (typically on the order of two to 10 features) is a problem in that the human voice is complex and can be quantified in thousands of potentially distinct summary features. Consider that the Institute of Electrical and Electronics Engineers held a series of competitions to predict emotional expression from a corpus of vocal recordings and that the winning algorithm contained more than 6,500 vocal features (for elaboration, see Eyben et al., 2016). This implicates thousands of features that may still be leveraged for prediction. Hence, it could be the case that larger and more comprehensive acoustic feature sets can capture negative schizotypy in ways small feature sets cannot. In the present study, we employed a comprehensive acoustic feature set to derive and cross-validate optimal models of negative schizotypy.
Second, the influences of demographics and speaking task on feature selection are likely significant but are understudied. Sex is associated with considerable variation in vocal expression (Kent & Kim, 2008) and may moderate the relationship between schizotypy and vocal expression. In a study of 44 young adults, Bedwell et al. (2014) observed that relationships between self-reported schizotypy and vocal acoustic anomalies were specific to males—although this was confined to positive and not negative schizotypy traits. Likewise, speaking tasks are associated with considerable and systematic variation in vocal expression. It is well known, for example, that vocal expression changes as a function of cognitive (Cohen, Auster, McGovern, & MacAulay, 2014; Cohen, Dinzeo, et al., 2015), affective (Johnstone et al., 2007; Scherer, 1989; Sobin & Alpert, 1999), social (Cohen, Mitchell, Beck, & Hicks, 2017; Laukka et al., 2008; Parola et al., 2020), and contingency-based (Johnstone et al., 2007) factors. One preliminary study of young adults found that self-reported negative schizotypy was associated with flat and sparse vocal expression, but only when cognitive load of the speaking task was experimentally increased (Cohen, Morrison, et al., 2012).
In the present study, we employed least absolute shrinkage and selection operator (LASSO), a regularized regression procedure, to evaluate a large vocal feature set from archived speech samples procured from young adults. LASSO is an established machine-learning procedure that accommodates large feature sets and is robust to Type 1 errors and overfitting. We sought to accomplish the following:
Aim 1: Optimize and cross-validate predictive models of self-reported negative schizotypy using acoustic features.
Aim 2: Evaluate model performance as a function of sex and speaking task.
Aim 3: Identify the acoustic features critical to optimizing model performance (hence, providing insight into the potential mechanisms underlying negative schizotypy).
Aim 4: Evaluate model performance in its convergence with clinical symptoms and cognitive functioning.
Method
Participants
Analyzed in this study are data for 515 undergraduate adult students enrolled in large universities in Louisiana and New Jersey (for details, see Fig. S1 and Tables S1 and S2 in the Supplemental Material available online). Collectively, these participants provided 5,191 usable audio recordings, aggregated from seven studies over the course of a decade at Louisiana State University. Cases were included in in the present study if they had Schizotypal Personality Questionnaire (SPQ) scores, demographic information (age range = 18–30; to avoid potential confounds associated with older age), and sufficient speech data (detailed below) for analysis. Information about data availability is included in Tables S1 and S2 in the Supplemental Material. Participants were, on average, 19.33 years old (SD = 1.70; range = 18.00–30.00) and had the following average SPQ scores: positive (M = 1.72, SD = 1.02; range = 0.00–3.48; possible range = 0.00–4.00), negative (M = 1.78, SD = 1.20; range = 0–4.00; possible range = 0–4.00), and disorganized (M = 1.96, SD = 1.2; range = 0.00–4.00; possible range = 0.00–4.00). The sample was 42% male and 58% female, and the majority were White (77%) or Black/African American (12%). The remaining 11% were Asian, Native American, or undisclosed/missing demographic data. For an overview of our general methodology and critical terms used in this study, see Fig. S1 in the Supplemental Material.
Self-reported schizotypal traits
Schizotypy was measured using various iterations of the SPQ, including the full version (k items = 72; Raine, 1991), the brief version (k items = 22; Raine & Benishay, 1995), and the brief revised (SPQ-BR) version (k items = 32; Callaway et al., 2014). For all versions, we employed a 5-point Likert response format scale from 0 (strongly disagree) to 4 (strongly agree) with computerized administration. For each measure, we used the superordinate Positive (i.e., comprising items from the ideas of reference/suspiciousness, odd beliefs/magical thinking, unusual perception subordinate scales), Negative (i.e., comprising items from the no close friends and constricted affect subordinate scales), and Disorganization (i.e., comprising items from the odd speech and eccentric behavior subordinate scales) factor solutions on the basis of relevant factor analyses from our lab (Callaway et al., 2014; Cohen et al., 2010). Scale items are included in Table S3 in the Supplemental Material.
Our preliminary machine-learning procedures focused on binary categorization using a score of 3. This score reflects an average of agree or greater on each individual negative schizotypy item, which was important given that schizotypy items (and hence, descriptive statistics) varied across studies. Binarization also helped to remove potentially ambiguous cases when building our model and helped to define the extreme ends of the continuum when applied to individual scores. Our use of a binary criterion is not meant to imply that the symptom is binary in nature because modeling allows a “degree of fit” (i.e., predicted negative schizotypy) score that is continuous in nature.
Speaking tasks
Although the specifics of the speaking tasks varied across the seven studies, there were two general types of tasks examined here. The first, a picture task, involved having participants discuss their thoughts and feelings to visual pictures displayed on a computer screen for 20 s each. Visual stimuli were positive-, negative-, and neutral-valence images from the International Affective Picture System (IAPS; Lang et al., 1997). The valence blocks were presented in a random valence order. The second speech task involved participants providing free-recall speech describing their daily routines, hobbies, living situations, and autobiographical memories for either 60 s or 90 s, depending on the study. Participants were encouraged to select self-relevant and important topics, and they were provided with exemplar probes. Note that negative schizotypy and vocal characteristics generally did not vary in their relationships as a function of speaking prompts or IAPS condition in our prior studies (for references, see Table 1; for an exception, see Cohen, Morrison, et al., 2012; for a lack of replication, see Cohen, Auster, McGovern, & MacAulay, 2014). Administration was standardized such that for all tasks, instructions and stimuli presentation were automated on a computer and participants were encouraged to speak as much as possible. Research assistants were present in the room and read instructions to participants but refrained from speaking during the vocal-recording process.
Acoustic analysis
Two different software packages were employed in this study. The first is designed to capture relatively global features conceptually relevant in psychiatric symptoms and reflects an iteration of the VOXCOM system developed by Murray Alpert (Alpert et al., 1986). This iteration, called the Computerized Assessment of Affect from Natural Speech (CANS) system (Cohen & Hong, 2011; Cohen, Minor, et al., 2009), involves extracting basic acoustic properties from recordings and analyzing them within vocal utterances (defined as silence bounded by 150+ ms). Support for the CANS comes from more than a dozen studies, including psychometric evaluation in 1,350 nonpsychiatric adults (Cohen, Renshaw, et al., 2016) and 309 patients with severe mental illnesses (Cohen, Mitchell, et al., 2016). The CANS feature set includes 46 distinct acoustic features related to speech production (e.g., number of utterances, average pause length) and speech variability (e.g., intonation, emphasis). The second system captures more basic, psychophysically complex features relevant to affective science and is dubbed the Extended Geneva Minimalist Acoustic Parameter Set (GeMAPS; Eyben et al., 2016). GeMAPS was derived using machine-learning-based feature-reduction procedures on a large feature set as part of the INTERSPEECH competitions from 2009 to 2013. GeMAPS analyzes 88 distinct features. Validity for this feature set for predicting emotional expressive states in demographically diverse clinical and nonclinical samples can be found elsewhere (e.g., Eyben et al., 2016).
For data-reduction purposes, we organized the GeMAPS and CANS features into a limited set of conceptually distinct functional categories using principal components analysis (PCA) and interpreted the conceptual and functional properties of these features. PCA of the 134 features explained only about 60% of potential variance from 25 factors, so we further refined this factor solution according to conceptual and functional commonalities between features. Intraclass correlation coefficients were used to evaluate internal consistency of these categories. Our final solution included five categories (see Table S3 in the Supplemental Material). Internal consistency for each category was adequate (αs > .55). Categories included “intensity/amplitude” (k items = 25, α = .87), “formant frequency” (k items = 32, α = .59), “fundamental frequency” (k items = 20, α = .87), “spectral” (k = 37, α = .69), and “pause/utterance properties” (k items = 20, α = .77). The latter category most closely approximates symptoms of alogia. Descriptions of these categories are included in Table 3.
To support exploratory analysis (for elaboration, see Results section), we identified three features from the CANS deemed conceptually critical to the operational definitions of negative schizophrenia and, hence, of potential importance to negative schizotypy. These included two features reflecting blunted vocal affect (i.e., intonation: computed as the average of standard deviation of fundamental frequency values computed within each utterance; emphasis: computed as the average of standard deviation of intensity/volume values computed within each utterance) and one related to alogia (i.e., number of utterances: number of consecutively voiced frames bounded on either side by silence). These features are by no means comprehensive in their conceptual coverage but reflect “face-valid” proxies of their respective constructs, which have been identified in PCA of nonpsychiatric and psychiatric samples (Cohen, Mitchell, et al., 2016; Cohen, Renshaw, et al., 2016).
Clinical symptoms, cognitive functioning, and inpatient psychiatric hospitalization history
Clinical symptoms were measured using the 53-item self-report Brief Symptom Inventory (BSI; Derogatis & Melisaratos, 1983). Symptoms were rated on a 5-point Likert scale reflecting the prior 2-week epoch using an online survey completed concurrently with the SPQ. Increasing scores reflect increasing symptom severity. Cognitive functioning was measured with the Repeatable Battery for Neuropsychology Status (RBANS; Randolph et al., 1998). Index scores for the Immediate Memory, Visual Construction, Language, Attention, Delayed Memory tests were examined here. The BSI and RBANS have been used in hundreds of published research studies and have demonstrated good reliability and convergent validity for their respective purposes.
Statistical analyses
Our analyses addressed four aims. First, we sought to optimize and cross-validate predictive models of self-reported negative schizotypy using acoustic features. The machine-learning procedures used for these analyses are described in the next section. Second, we sought to evaluate model performance as a function of sex (male, female) and speaking task (picture and free-recall tasks). Note that for the latter, we examined all recordings (i.e., those from the picture and free-recall tasks), as opposed to those in the picture task only, because we lacked sufficient samples for modeling free-recall samples independently. As part of this process, we cross-validated models built on one criterion (e.g., men, picture task) to evaluate their accuracy in predicting different criterion (e.g., women, free recall task). Third, we evaluated the individual acoustic features critical to these various models to help provide insight into potential mechanisms underlying negative schizotypy. This was done by inspecting model weights and correlations and by using stability selection (see next section). This was also done by examining the top 10 features for each model. Because conceptually related features often correlate highly with each other, we examined the functional category of each item—as defined in the Acoustic Analysis section.
Finally, we sought to examine the optimal algorithm with respect to prediction of clinical symptoms and cognitive functioning. We used the models derived from the first two aims to compute predicted machine-learning-based scores for each vocal sample. These scores were then averaged across speech samples within individual participants and examined in their convergence with clinical symptoms (i.e., BSI scores) and cognitive functioning (i.e., RBANS total scores) scores using bivariate correlations. We hypothesized that predicted scores (i.e., those derived from the machine-learning model) would be significantly correlated to clinical symptoms and cognitive functioning given that these models are built to approximate, as closely as possible, the self-reported negative schizotypy ratings. Exploratory analyses, involving conceptually critical acoustic features because of their overlap with operational definitions of alogia and blunted vocal affect, were conducted and are explained in the exploratory results and the Acoustic Analysis method sections. Acoustic data were truncated at ±3.5 SD before being analyzed. This was conducted primarily to reduce the impact of extreme values from background noise that our filtering and other data-processing procedures missed. Analyses were conducted in the R software environment (Version 4.0.2; R Core Team, 2020) using the glmnet (Version 4.0-2; Friedman et al., 2020) and stabs (Version 0.6-3; Hofner & Hothorn, 2017) packages.
Machine learning
We used LASSO regularized regression with fivefold cross-validation. LASSO regularized regression has two objectives. The first, as in all models, is to minimize the difference between the true values and predicted values of the dependent variable. The second objective is to minimize some function of the model coefficients themselves. In the case of LASSO, the function is the sum of the absolute value of the model coefficients. This term is minimized when all coefficients are set to zero, but such a model completely disregards the data and so will fail to accurately predict the dependent variable. The right balance of regularization and predictive accuracy is different for each data set, and so we estimated this balance through cross-validation to obtain a model with many weights set to zero. By explicitly selecting highly informative features and constraining model degrees of freedom, LASSO can produce models that generalize well to new data. The strength of the LASSO regularization penalty was tuned to each training set with cross-validation. This is nested within the procedure described above and, importantly, never includes examples included in the test set. The penalty is refit to each training set so that the definition of the penalty is completely uninfluenced by aspects of the current test set.
Each round of testing will produce a different model, which means potentially a different selection of features. After determining that machine-learning models can generalize to untrained cases, follow-up analysis of important features was carried out according to models’ fit to all cases, and stability selection was employed. Stability selection is a subsampling procedure that resembles bootstrapping (Meinshausen & Bühlmann, 2010; Shah & Samworth, 2013). By training on thousands of random subsets of cases, we can estimate feature probabilities given different amounts of regularization. Sets of features identified through stability selection are reliably important over many sets of cases. The inclusion threshold is set to control the familywise error rate.
The ratio of high to low negative schizotypy cases in each subsample was matched across the folds in the cross-validation process. Then, a test set was formed by selecting one of these subsamples, and a training set was formed by combining the remaining subsamples. A model was fit to the training set and evaluated on the test set. This was repeated so that each of the five subsamples was used as the test set. We report hit-rate and correct-rejection values that have been averaged over these five folds. We report accuracy as the sum of the hit rate and correct-rejection rate divided by 2 so that 0.5 corresponds to random performance.
Results
Preliminary analyses
Self-reported negative schizotypy scores for men (n = 216; M = 1.56; SD = 1.08) and women (n = 299; M = 1.57, SD = 1.09) did not differ in either continuous, t(515) = 0.11, p = .91, d = 0.01, or binary, χ2(1, N = 515) = 0.72, p = .40, formats. Self-reported ethnicity did not differ with respect to continuous, F(2, 512) = 0.12, p = .73, or binary, χ2(1, N = 515) = 1.33, p = .53, self-reported negative schizotypy scores. Most of the acoustic feature values (averaged across all recordings per participant) were significantly different between men and women, reflecting 78% (68 of 88) of GeMAPS and 74% of CANS (37 or 46) features. Negative schizotypy scores were not significantly correlated with age, r(513)s < absolute value of .05, ps > .55, either in binary or continuous form.
Aim 1: optimize and cross-validate predictive models of negative schizotypy using acoustic features
Average accuracy for predicting high negative schizotypy was good across the five analytic folds, and the overall range was 80% to 91% for the training sets (see Table 2). There are two important things to note. First, training and test-set accuracy were generally very close to each other, which suggests that overfitting was not a major concern. Second, accuracy scores reflected comparable hit rates and correct rejections, which suggests that our models were reasonably balanced in sensitivity and specificity.
Summary of Machine-Learning-Based Analyses, Predicting Self-Reported Trait Negative Schizotypy
Note: k = number of samples included in each category; neg = negative schizotypy is below the cutoff to be considered present; pos = negative schizotypy is considered present.
Predicted (i.e., a continuous, linear score predicted from the machine-learning algorithm) and self-report-based negative schizotypy scores were significantly correlated with each other, r(513) = .49, p < .001. Although the magnitude of this correlation may seem modest given the high accuracy reported in the machine-learning models, it is important to consider that acoustic features were modeled on binary negative schizotypy scores, not continuous ones. For the solutions for the machine-learning models as well as the functional categories of the various features, see Table S4 in the Supplemental Material.
Aim 2: evaluate model performance as a function of sex and speaking task
Models constrained by sex (i.e., run independently for men and women) showed improved accuracy over models that were not; those including only men showed the highest accuracy. Likewise, models constrained by task showed improved accuracy over those that did not. Overall, constraining by both task and sex showed the highest accuracy overall.
For the most part, the individual models failed to generalize as a function of sex and speaking task (see Table S5 in the Supplemental Material). When a model was built on all speech tasks and applied to only data from the free-speech task, accuracy dropped to 59%. Data built on the picture task and applied to only data from the free-speech task was near chance (i.e., 52%). Likewise, accuracy for models built on men and then applied to women were poor (i.e., 60%; 52% for vice versa).
Although we lacked sufficient diversity to evaluate model generalization as a function of ethnicity, self-reported ethnicity did not significantly differ with respect to predicted negative schizotypy, F(1, 224) = 0.15, p = .70.
Aim 3: identify the acoustic features that differ as a function of sex and speaking task
For the features identified through stability selection, see Table S6 in the Supplemental Material. There are several notable features from this. First, only three of the 30 total features identified across the six models were directly related to the operational definitions of alogia and blunted affect (i.e., “utterance properties,” “pause properties,” “intensity/amplitude,” or “fundamental frequency”). Rather, the features reflected relatively comprehensive spectral components, which concern much more comprehensive measure of vocal frequencies, formants, harmonics, tonal qualities, and intensity. Second, there were few overlapping features between men and women (e.g., < 20% overlapping across any models). Third, the features identified through stability selection were generally sufficient, in and of themselves, for accurately predicting self-reported negative schizotypy for men. When LASSO models were recomputed using just those features selected through stability selection, accuracy for Models 2 and 5 were 81% (for both test and training models) and 79% (for both test and training models), respectively. For women (and combined samples), accuracy for the models did not exceed 66%. This suggests that the models for men were accurate with fewer features than models for women.
Evaluating How Acoustic Features Differ Across the Six Machine-Learning Models by Comparing Functional Categories of the Top 10 Acoustic Features in Each Model
Note: dB = decibel; MFCC = Mel frequency cepstral coefficients.
For the functional categories of the top 10 features from each of these models, see Table 3. Examining functional category, as opposed to the individual features, is informative because of potential high correlations between features within the same functional category and potential coefficient instability across bootstrapping iterations of the model. As above, the vast majority of features were not directly related to alogia or blunted vocal affect but were related, instead, to more comprehensive aspects of vocal signal. The one exception was that 20% to 40% of the features for women were related to intensity/amplitude. As above, there was considerable variability between men and women in the functional categories.
Aim 4: examine the optimal algorithm with respect to prediction of clinical symptoms and cognitive functioning
Per the bivariate correlation analyses, self-reported negative schizotypy scores were significantly associated with greater severity for virtually all self-report symptom measures (e.g., positive and disorganized schizotypy and BSI psychoticism, paranoia, depression, and anxiety; rs with BSI measures range = .36–.68; mean r = .54; see Table 4). In contrast, predicted negative schizotypy was not highly related to positive and disorganized schizotypal traits (i.e., rs = .14 and .11, respectively) but was significantly correlated (rs > .30, ps < .05) with BSI psychoticism and depression symptoms. It is noteworthy that depression, a construct that overlaps with negative schizotypy (Lewandowski et al., 2006), was less related to predicted than self-reported negative schizotypy scores (rs = .37 vs. .62). Neither self-report nor predicted schizotypy were significantly related to cognitive functioning.
Bivariate Correlations Between Self-Report and Predicted Negative Schizotypy and Clinical Symptoms and Cognitive Functioning
Note: Conceptual-based acoustic features are included to inform exploratory analyses. SPQ = Schizotypal Personality Questionnaire.
p < .10. *p < .05.
Exploratory follow-up analyses: examine conceptual-based acoustic measures of blunt affect and alogia
The lack of features related to alogia and blunted vocal affect in the predicted models of negative schizotypy was unexpected. To explore this, we evaluated relationships between conceptually critical acoustic measures (described in the Method section) and the self-report and predicted measures of negative schizotypy. We also evaluated their relative contributions to clinical symptoms and cognitive functioning using regressions (see Table S7 in the Supplemental Material).
Self-reported negative schizotypy was significantly but modestly associated with reduced speech production (i.e., lower utterance number) but not intonation or emphasis (Table 4). Predicted negative schizotypy was significantly associated with reduced speech production and increased variability in intensity. Conceptually critical measures were not significantly related to BSI symptoms but were related to cognitive functioning. Reduced speech production was significantly associated with poorer immediate memory and language performance (and attention at a trend level). Reduced emphasis was associated with poorer immediate memory performance. Collectively, conceptual-based measures explained 8% of the variance in immediate memory beyond the 3% explained by self-report and predicted measures of schizotypy (see Table S7 in the Supplemental Material). They similarly explained between 2% and 6% of the variance in other cognitive functions.
Exploratory follow-up analyses: subcomponents of negative schizotypy
Note that the subcomponents of negative schizotypy (i.e., constricted affect and no close friends) overlapped heavily. The two subscales were highly correlated in this study, r(513) = .78, p < .001, and were similarly represented in the superordinate negative schizotypy scale, r(513)s = .93 and .95, respectively, p < .001. Moreover, predicted negative schizotypy scales similarly tapped both Constricted Affect, r(513) = .49, p < .001, and No Close friends, r(513) = .43, p < .001, scales.
Discussion
In this study, we performed LASSO regression on a comprehensive feature set to optimize the prediction of negative symptoms with acoustic data. We used the largest acoustic feature set to date, accommodated by a machine-learning approach chosen to maximize generalizability. Four findings were notable. First, models were able to achieve high accuracy for predicting negative schizotypy relative to previous studies. Second, accuracy was improved when accounting for sex and speaking task. Third, and paradoxically, the acoustic features most predictive were generally not conceptually critical to the operational definitions of negative schizophrenia symptoms (i.e., blunted affect and alogia). Finally, both self-reported negative schizotypy and machine-learning-based scores were associated with a range of self-reported psychopathology symptoms, but neither was related to cognitive functioning. Implications of these findings are discussed below.
From a practical perspective, the present study provides valuable proof of concept for digital phenotyping of negative schizotypy. Self-report is largely considered a reliable, valid, and practical method for measuring schizotypy, but there are limitations inherent in self-report. Clinical-rating-based approaches have been used in the general population (Loranger et al., 1997), but they tend to require substantial resources to use, and there are limitations inherent in their use (Cohen, 2019; Cohen, Schwartz, et al., 2021). It has been proposed by several research groups in this area that objective modalities could be used to enhance or replace existing measures (Bedwell et al., 2006; Gooding et al., 2006; Lenzenweger et al., 2007; Minor & Cohen, 2012). Acoustic analysis is potentially advantageous because it can be procured from many different media (e.g., videos, telephone, ambient recording) using data that are available as individuals navigate their daily routine. This potentially allows components of negative schizotypy to be tracked over time and space, thus greatly enhancing the ecological validity of schizotypy assessment over traditional interview or single-administration “trait” measures (Cohen & Elvevåg, 2014). Although self-report ecological momentary assessment solutions for measuring state negative schizotypy have been advanced (Brown et al., 2007), acoustic analyses (and other digital phenotyping technologies) offer potentially greater precision for measuring potential changes in signal because they employ active/passive recording technologies, yield ratio-level data, and are amenable to multimodal and high-dimensional data analytic approaches. The present findings are similar to those found in a study of patients with serious mental illness that used acoustic features to predict negative symptom clinical ratings (Cohen, Cox, et al., 2020). In that study, we achieved similar accuracy rates as in the present study (generally, > 80%; in some cases, > 90%), which suggests that acoustic-based digital phenotyping is appropriate for measuring negative traits/symptoms across the schizophrenia spectrum.
The present study also highlights potential complexities to implementing acoustic analytic technologies. First, our findings demonstrate the importance of accounting for demographic variables when modeling self-reported negative schizotypy. Many schizotypy phenotypes (e.g., for a meta-analysis of self-report scales, see Miettunen & Jääskeläinen, 2010) are quantitatively different between men and women, and a prior study found that sex moderated relationships between schizotypy and vocal acoustics (Bedwell et al., 2014). In the present study, directly testing a schizotypy by sex moderation on vocal acoustics was impractical given the large feature set, but there was evidence for moderation in that (a) features tended to differ between men and women, (b) models did not generalize well across sex, and (c) the critical features for predicting negative schizotypy were different between men and women. Potential (and non-mutually exclusive) explanations for sex differences in schizophrenia involve hormonal (Rao & Kölsch, 2003), neurostructural (Goldstein et al., 2002), and psychosocial (Abel et al., 2010; Read et al., 2001) systems. Differentiating whether the effects reflect sex or gender are also important. Regarding the latter, it is possible that differences in schizotypal traits reflect systematic biases in how the items are interpreted or reported by men and women. The SPQ-BR item, “I rarely laugh and smile,” for example, may be evaluated using different socio-normative comparison groups in men compared with women, at least, as a group. The present findings point to another possibility—that raters judge different objective characteristics (e.g., vocal acoustic qualities) as being more indicative of schizotypy for certain sexes even if they are judging themselves. Understanding the reasons underlying sex differences in schizotypy phenotypic expression is an important topic for future research.
In the present study, we built our models on a well-validated self-report trait measure of schizotypy. Although our models demonstrated reasonable convergent validity with both self-reported schizotypal trait and other measures (e.g., BSI psychoticism, depression), neither the self-reported nor predicted measures of schizotypy were associated with their expected acoustic features, at least, those conceptually critical to definitions of alogia and blunt affect in schizophrenia. Moreover, neither was related to cognitive functioning as might be expected given neurodevelopmental theories of schizotypy (Cohen, Dinzeo, et al., 2015; Ettinger et al., 2015; Meehl, 1962; but for a meta-analysis that reported general null findings, see Chun et al., 2013). These null findings highlight two conceptual points. First, it is unclear to what degree self-report trait schizotypy measures should be used as “ground truth” criterion and, if not, what should or could be used to improve the model’s construct validity. Unlike schizophrenia, schizotypy lacks a consensus operational definition to disambiguate cases. Necessary and sufficient components of schizotypy have been proposed, for example, as reflected in “schizotaxia” (Meehl, 1962), although their application has largely been in supporting a conceptual framework as opposed to defining individual cases for model building. Further work in this regard is needed, particularly given evidence that subjective and objective aspects of schizotypy do not necessarily converge but both are potentially clinically important (Cohen, Mitchell, Beck, & Hicks, 2017). Second, to the degree that a criterion contains gender, age, and cultural biases, an “accurate” model will emulate those biases. It is important to appreciate that “objective” technologies are not necessarily unbiased if validated using biased criterion.
The present findings offer insight into the acoustic features underlying self-reported negative schizotypy. Features related to the Mel-Frequency Cepstral Coefficient (MFCC), spectral, and formant frequency (i.e., F1, F2, and F3 values) were particularly important given that they represented at least half of the top features in our predictive models. These features have not been typically examined in the context of either schizophrenia or schizotypy (see Table 1). These features reflect the spectral quality and richness of speech and involve a much broader summary of frequency and intensity than those examined in prior schizotypy studies. Spectral measures involve coordination between vocal tracts and folds and involve shaping of sounds with mouth and tongue. MFCC values have become particularly important for speech and music recognition systems and figure prominently in machine-learning-based applications of acoustic features more generally (Frühholz & Belin, 2018; Ittichaichareon et al., 2012). Their relative importance in the models probably provides less insight into the pathophysiology of negative schizotypy per se and more insight about the acoustic features that individuals use to self-evaluate when completing self-report trait questionnaires. From this perspective, it makes sense that relatively global features are critical for modeling purposes given that they are probably more readily accessible and interpretable than more basic features related to pause length, intonation, and emphasis (at least, as defined in prior studies).
We found similar results in a study using acoustic features to predict clinical ratings in patients with serious mental illness; conceptually critical features were not particularly important for modeling purposes and were more highly related to cognitive functioning than either clinical ratings or predicted scores (Cohen, Cox, et al., 2020). LASSO regression, which has been used to identify predictors of schizophrenia-spectrum disorders that are not conceptually related to their diagnosis (Ciarleglio et al., 2019), is advantageous precisely because of its ability to identify significant features that may not otherwise be evaluated. Regardless of whether these “high-level” spectral acoustic features are important for negative schizotypal traits or whether they are simply important for participants’ self-evaluation of them, their inclusion in “big data” models of schizotypy is warranted by their clear potential to improve model accuracy.
Limitations of the present study include lack of data to account for medication, illicit substance use, or co-occurring anxiety or depression. The latter is a particular concern given conceptual overlap between depression and negative schizotypy (Lewandowski et al., 2006). Our sample was constrained in terms of culture/ethnicity and was recruited from a small geographic catchment region. Moreover, our models were dependent on self-report measures of schizotypy. Although informative, they may not be ideal for measuring some aspects of negative schizotypy and for differentiating between its subcomponents (e.g., constricted affect and no close friends). These limitations notwithstanding, the present findings are important for digital-phenotyping efforts targeting schizotypal traits. Future research should leverage additional acoustic tasks and replicate our use of large feature sets to generalize findings and further refine the mechanisms that moderate state phenotypic expression of negative schizotypy.
Supplemental Material
sj-pdf-1-cpx-10.1177_21677026211017835 – Supplemental material for High Predictive Accuracy of Negative Schizotypy With Acoustic Measures
Supplemental material, sj-pdf-1-cpx-10.1177_21677026211017835 for High Predictive Accuracy of Negative Schizotypy With Acoustic Measures by Alex S. Cohen, Christopher R. Cox, Tovah Cowan, Michael D. Masucci, Thanh P. Le, Anna R. Docherty and Jeffrey S. Bedwell in Clinical Psychological Science
Footnotes
Transparency
Action Editor: Michael F. Pogue-Geile
Editor: Kenneth J. Sher
Author Contributions
A. S. Cohen developed the study concept, performed the data analysis, and wrote the manuscript. C. R. Cox helped design and interpret the analyses and revised the manuscript. T. P. Le, T. Cowan, and M. D. Masucci aided in interpretation and revision of the manuscript. All of the authors approved the final manuscript for submission.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
