Abstract
In this article, I discuss construction of a set of weighted indices for the Minnesota Multiphasic Personality Inventory-2 Restructured Form (MMPI-2-RF) designed to provide direct guidance in three specific differential diagnostic problems. I created a calibration data set using a combined sample of mental health patients (n = 2,043). Using the MMPI-2-RF’s Substantive Scales as a pool of potential predictors, I applied the lasso, a penalized regression technique, to derive three logistic regression equations differentiating three major diagnostic groups (schizophrenia, bipolar disorder, and major depressive disorder) from one another. Then, I extracted empirically derived beta weights from these equations and used them to create composite differential diagnostic indices, which I scored in a separate holdout validation data set (n = 873). The differential diagnostic indices performed well in the validation data set (schizophrenia vs. bipolar area under the curve [AUC] = .76; schizophrenia vs. major depression AUC = .90; bipolar vs. major depression AUC = .75). Moreover, they substantially outperformed any single existing MMPI-2-RF scale in the same differential diagnostic tasks. In addition to discussing the development and initial validation of these indices, I present methods for deriving clinically referenced standard scores and diagnostic classification probabilities for obtained raw index scores.
Keywords
Differential diagnosis is a common clinical problem facing psychologists engaged in assessment. In differential diagnosis, the psychologist attempts to rule out competing diagnostic hypotheses on the basis of observed signs and symptoms. The stakes in differential diagnosis can be quite high. For example, competing diagnoses under consideration may point to rather different case formulations and therefore to different (and potentially mutually exclusive) courses of intervention. Differential diagnosis can also have serious implications for other conclusions a psychologist may reach (e.g., whether someone meets the threshold criterion of a major mental illness in a forensic criminal responsibility evaluation).
At its core, differential diagnosis is a form of prediction, and like many predictive tasks in psychological assessment, it generally involves integration of complex and potentially ambiguous data. A psychologist has two broad options for combining data to derive a prediction: mechanical (statistical, actuarial) or clinical (impressionistic, subjective, intuitive). The distinction between mechanical and clinical judgment was most famously delineated by Meehl (1954), who argued that despite the many potential advantages an expert clinician can bring to bear on clinical prediction problems, one should favor mechanically driven prediction methods when given a choice between the two approaches. Meehl supported his assertion with empirical data (a review of the then-modest literature describing studies comparing mechanical and clinical predictions), and his conclusion was later supported by multiple independently conducted meta-analyses surveying hundreds of comparative studies across many substantive domains (Ægisdóttir et al., 2006; Grove et al., 2000; Sawyer, 1966). Taken as a whole, these findings indicate that statistical prediction nearly always performs as well as, or better than, clinical prediction; moreover, no situation in which a clinician might be expected to systematically outperform mechanical methods has been identified. However, in practice these findings are often effectively irrelevant, as psychologists generally lack empirically validated mechanical methods to apply even to relatively common clinical prediction tasks (Grove & Meehl, 1996; Meehl, 1956). 1
Differential Diagnosis With the MMPI-2-RF
Broadband personality assessment instruments offer a range of clinically useful information that is potentially relevant to differential diagnosis. The Minnesota Multiphasic Personality Inventory–2 Restructured Form (MMPI-2-RF; Ben-Porath & Tellegen, 2008/2011) is currently used in a broad range of clinical contexts (mental health treatment, neuropsychology, behavioral health, forensics, etc.). It includes nine Validity Scales to assess response style (inconsistent responding, overreporting, underreporting) as well as 42 Substantive Scales assessing a range of homogeneous clinical constructs primarily across five broad substantive domains (internalizing problems, externalizing problems, thought dysfunction, interpersonal problems, and somatic/cognitive problems).
Case conceptualization is one of the most common clinical tasks for which the MMPI-2-RF is used. This often includes situations in which the test taker’s presentation includes clinical features that potentially point toward alternative diagnostic formulations. Although an MMPI-2-RF protocol offers a wealth of information potentially useful for differential diagnosis, such an application of the test is nevertheless subject to at least two noteworthy limitations. First, by design, the MMPI-2-RF’s Substantive Scales measure relatively unitary transdiagnostic clinical constructs, not complex diagnoses. 2 Second, despite the apparent frequency with which the MMPI-2-RF is used for differential diagnosis, the MMPI-2-RF research literature (like the literature associated with other major personality assessment instruments) provides scant empirical guidance for directly applying the instrument to specific differential diagnostic problems outside of the detection of response bias (e.g., malingering). Nevertheless, several studies associated with using the MMPI-2-RF Substantive Scales for differential diagnosis bear mentioning.
Watson et al. (2011) identified the MMPI-2-RF ACT (Activation) scale as the best single-scale discriminator for a differential diagnosis of major depressive disorder versus bipolar disorder, with an area under the curve (AUC) in their clinical sample of .74 (recommended raw cut score = 4); furthermore, they noted that the scale appeared to effectively differentiate between the two mood disorders even for patients with a bipolar disorder in a depressive episode. However, they did not consider the use of multiple scales in combination for this differential diagnostic task. Locke et al. (2010) identified scale RC1 (Somatic Complaints) as an optimal single-scale discriminator between epileptic and psychogenic nonepileptic seizure patients, obtaining a classification accuracy rate of 68% in the authors’ sample at an optimal cut score (T = 65). In a follow-up study, Locke and Thomas (2010) developed two novel scales to discriminate between those same two seizure patient groups, and they further demonstrated that a logistic regression equation could differentiate between the groups with 78% accuracy in the scale development sample. Sellbom et al. (2012) found that multiple MMPI-2-RF scales validly discriminated major depressive disorder, schizophrenia, and bipolar disorder diagnostic groups from one another, though the authors did not present a method for directly applying that information clinically, as that study was focused on construct validation of the MMPI-2-RF scales. Finally, Lee et al. (2018) examined the potential utility of a select set of MMPI-2-RF and MMPI-2 (Butcher et al., 1989) scales in discriminating between major depressive disorder and schizophrenia; to aid clinical interpretation, the authors presented odds ratios for select scales at select cutoffs.
The Present Study
The studies just mentioned share at least two limitations with respect to clinical application. First, where they do provide guidance for using the MMPI-2-RF’s standard scores to guide differential diagnosis, they strongly emphasize single-scale interpretation, which is potentially problematic given the comparatively complex nature of psychological diagnostic constructs. Second, studies that provide empirically optimized cut scores or algorithms do not provide validation data to indicate how well the interpretive method provided would generalize to a new sample.
The current study aims to overcome both of these limitations by (1) producing statistically derived indices for the MMPI-2-RF that guide differential diagnosis by systematically combining data from multiple scale scores and (2) fairly evaluating the performance of those indices using validation data that are separate from the index development data. In addition, I provide methods for possible clinical interpretation and diagnostic classification using index scores. I select three major diagnostic constructs as targets for differential diagnosis: schizophrenia, bipolar disorder, and major depressive disorder (MDD). These diagnoses provide compelling targets for differential prediction for several reasons. Although any two of these diagnoses share one or more phenotypic clinical features (e.g., schizophrenia and major depression are both often characterized by anhedonia, and bipolar disorder and major depression both reflect disturbances of positive affect), they represent putatively distinctive underlying processes, to some extent. In addition, the outcome of differential diagnosis in each case has significant implications for treatment. Finally, past research has indicated that the MMPI-2-RF is capable of statistically distinguishing between any two of these diagnostic groups to some meaningful extent (Lee et al., 2018; Sellbom et al., 2012; Watson et al., 2011).
Method
Participants
I drew participants in the present study from three archival samples consisting entirely of mental health patients who had produced scorable MMPI-2-RF protocols. 3 One sample consisted of psychiatric inpatients from a large county hospital (n = 1,524), one of psychiatric inpatients from a VA medical center (n = 1,401), and one of outpatients from a community mental health center (1,020). The two inpatient samples are detailed in Arbisi et al. (2002, 2003). The outpatient sample is described in significant detail in Graham et al. (1999). The properties of MMPI-2-RF scale scores in all three samples (including associations with extra-test criteria) are detailed in the MMPI-2-RF Technical Manual (Tellegen & Ben-Porath, 2008).
In the present study, I combined the three samples, then excluded any protocols considered invalid according to any standard MMPI-2-RF invalidity criteria (CNS ≥ 18, VRIN-r ≥ 80, TRIN-r ≥ 80, F-r ≥ 120, Fp-r ≥ 100), for an overall exclusion rate of 24%. Then, I randomly assigned 70% of participants from the remaining combined sample to a calibration data set (i.e., a data set to be used for index development; n = 2,043), and I set aside the remaining 30% of the combined valid sample for use as a holdout validation data set (n = 875).
In the inpatient settings, diagnoses were generally provided by the attending psychiatrist, whereas in the outpatient setting, diagnoses were provided primarily by master’s-level mental health professionals. In the inpatient data sets, both intake and discharge diagnoses were available, but only the discharge diagnoses were used in the present study, as those diagnoses were made after a more prolonged period of clinical observation. Only intake diagnoses were available for the outpatient sample.
I did not include all participants in the calibration and validation data sets in analyses, as only some participants were assigned relevant clinical diagnoses. Additionally, in an effort to maximize the validity of diagnostic criteria, I did not consider diagnoses that were coded as rule-outs for inclusion in analyses. Diagnostic frequencies for the three target diagnoses 4 are listed in Table 1, along with subsample sizes for the various combined differential diagnostic groups. I excluded participants from differential diagnostic analyses on a case-by-case basis if they were assigned both of the relevant diagnoses; for example, a patient whose diagnoses of record included both schizophrenia and bipolar disorder would be excluded from analyses involving the schizophrenia versus bipolar disorder differential diagnostic index.
Relevant Diagnostic Frequencies by Data Set.
Note. MDD = major depressive disorder.
Measures
Minnesota Multiphasic Personality Inventory-2 Restructured Form (MMPI-2-RF)
I described the MMPI-2-RF in greater detail earlier. In the present study, I used the Validity Scales only to exclude participants from analyses, as already described. However, I used the full set of 42 Substantive Scales of the MMPI-2-RF to form a pool of candidate predictors for the differential diagnostic indices, that is, I considered all Substantive Scales for potential inclusion in each of the differential diagnostic indices.
Analysis
To calibrate the differential diagnostic indices, I set up each differential diagnostic task as a separate logistic regression model, with one diagnosis coded 1, and the other coded 0 (schizophrenia vs. bipolar disorder: SCZ = 1, BD = 0; schizophrenia vs. major depression: SCZ = 1, MDD = 0; bipolar disorder vs. major depression: BD = 1, MDD = 0). The MMPI-2-RF Substantive Scale raw scores served as a pool of potential predictors. In order to create parsimonious models with robust beta weights, I selected a regression approach called the lasso (least absolute shrinkage and selection operator; Tibshirani, 1996). This form of modeling tends to produce conservative and generalizable predictor weights while simultaneously performing predictor selection. Moreover, explorations of the validity of using the lasso to generate predictions from psychological assessment data suggest that the lasso produces at least modestly more robust and generalizable predictions than more conventional maximum likelihood logistic regression (Menton, 2020). The prediction equation produced by lasso logistic regression is identical in form to the equations produced by more conventional maximum likelihood estimation logistic regression; the only difference is that the parameter weights have been estimated by a modified process. In the present study, lasso modeling was performed using the glmnet package (Friedman et al., 2009) in R (R Core Team, 2019). All other analyses were also carried out in R, and the R code used in the present study is available as an online supplementary file. In the supplemental materials, I also describe the modeling process in substantially greater detail and explain the considerations that led to selection of the predictor weights used in the differential diagnostic indices. Mean MMPI-2-RF Substantive Scale scores for each diagnostic group are displayed in Supplemental Table S1.
To aid application and interpretation, I then converted the calibrated logistic regression models to linear indices by directly extracting the models’ beta weights, multiplying them by a constant of 1,000 (to produce whole numbers, which some test users may find easier to work with, while retaining precision), and summing the weighted scale scores without further transformation. I removed from their respective indices any predictors whose weights were judged too small to meaningfully influence index performance. 5 I then evaluated the performance of the resulting indices using the holdout validation data set, with the AUC statistic serving as the index performance criterion. 6
To further guide interpretation of the three primary differential diagnostic indices, I decomposed each index into two subindices, each of which consisted only of predictors weighted in the direction of one of the diagnoses (i.e., positive or negative) targeted by the primary index. I entirely carried over the magnitudes of the predictor weights from the primary indices; however, I converted any negative weights to positive weights so that increases in scores on the MMPI-2-RF scales comprising the subindex would always correspond to an increased score on the subindex. I intended these subindices to provide a basis for interpreting ambiguous primary index scores (i.e., scores not clearly consistent with one target diagnosis or the other), as there are two potential reasons that such a score could be obtained, each associated with substantially different clinical implications. First, an individual could produce generally unelevated scores on the MMPI-2-RF scales comprising the index. Second, an individual could produce substantial elevations on scales scoring in both of the index’s diagnostic “directions,” essentially cancelling one another out.
I then clinically standardized the differential diagnostic primary indices and subindices on the calibration sample to provide a reference group–based approach to understanding the magnitude of index elevations. The use of conventional norm-referenced standardization techniques (creating a reference group drawn from the general population rather than a clinical population) with a differential diagnostic index would have raised significant conceptual and statistical concerns; for example, what would a normative reference point even mean for a purely clinical indicator? 7 As an alternative, I applied a clinically based standardization procedure to create what I here call clinically referenced standard scores (CRSSs). For the primary differential diagnostic indices, I standardized each CRSS so that a standard score of zero corresponds to the raw index score that falls at the precise midpoint between the mean index scores of both target diagnostic groups. The standardized dispersion value is set so that a value of 10 corresponds to the mean raw index score of one of the diagnostic groups. To aid interpretation, I “reflected” primary index standard scores in a manner similar to the MMPI-2-RF’s TRIN-r scale. In other words, raw scores below the reference mean are converted into positive CRSSs so that a deviation from the reference mean always results in a score elevation. For example, for the schizophrenia versus bipolar disorder index, I calculated the mean index score for individuals with schizophrenia separately from the mean index score for individuals with bipolar disorder. I then used the index midpoint between these two values to anchor the index CRSS score of zero, and I set the distance from the midpoint to the diagnostic group means to equal a CRSS of 10.
I derived CRSS scores for the subindices (described in greater detail below) using a slightly different procedure. For these, I set the anchor point to the contrast group mean, and I set a CRSS score of 10 to equal the distance to the target group mean. For example, subindex SoB (schizophrenia over bipolar disorder) consists only of the weighted scales that point toward a diagnosis of schizophrenia, rather than bipolar disorder, in primary index SvB (schizophrenia vs. bipolar disorder). I therefore anchored the CRSS for SoB such that a standard score of zero equals the mean raw score of the bipolar disorder group (the contrast group), and a standard score of 10 equals the mean raw score of the schizophrenia group (the target group). Subindex CRSSs are not “reflected” in the same manner as the primary index standard scores; thus, they can take negative values.
Finally, to provide index users with differential diagnostic classification probabilities associated with specific test taker index scores, I derived predicted class probabilities by applying a simple Bayesian machine learning algorithm called naïve Bayes through the naivebayes package (Majka, 2019) in R. In the present study, naïve Bayes models the underlying distributions of primary index scores for both diagnostic classes, which I assume to be approximately gaussian based on a visual inspection of the class score distributions. Once so calibrated, naïve Bayes can generate class probabilities for any given index score value by examining the relative densities of the modeled class distributions at the specified value of the index. To avoid overfitting the estimated class probabilities to the present study’s calibration data, I instead derived class probabilities from the index scores in the validation data. These class probabilities differ notably from cut score statistics such as positive predictive value (PPV) and negative predictive value (NPV), as the class probabilities provide class prediction estimates associated with specific index scores, whereas PPV and NPV describe probabilities associated with a range of scores (scores at or beyond a cut point). Thus, although PPV and NPV are arguably more useful for generating prespecified classification decision rules, the class probabilities associated with specific index scores provide more direct guidance for interpreting a particular set of obtained scores. Additionally, at the tails of the score distributions, the classification probabilities derived from naïve Bayes permit reasonable extrapolation beyond the range of observed values in the validation sample using the basic distributional properties (mean, standard deviation) of the scores for each class.
Results
Differential Diagnostic Index Composition
The index creation process resulted in three primary differential diagnostic indices, which I label SvB (schizophrenia vs. bipolar disorder), SvD (schizophrenia vs. [major] depression), and BvD (bipolar disorder vs. [major] depression). I adopt a similar labeling scheme for the subindices: SoB (schizophrenia over bipolar disorder, i.e., the scales that conjointly predict in the direction of schizophrenia in the SvB primary index), BoS (bipolar over schizophrenia), SoD (schizophrenia over depression), DoS (depression over schizophrenia), BoD (bipolar disorder over depression), and DoB (depression over bipolar disorder). I display the compositions of the primary indices in Table 2, and an interested reader can readily derive subindex compositions from these data using the weight extraction procedure previously described (see the “Method” section).
Primary Index Compositions.
The compositions of the primary indices appear largely theoretically consistent with the differential diagnostic tasks for which they were designed. For example, in index SvB, differential prediction of schizophrenia is primarily driven by Persecutory Ideation (corresponding to the paranoia often experienced by individuals with schizophrenia), whereas differential prediction of bipolar disorder is driven by a combination of mood disturbance (Self-Doubt) and excessive behavioral activation (Activation). For index SvD, schizophrenia is primarily differentiated from major depression by problems associated with thought dysfunction (Persecutory Ideation, Psychoticism) and depression is differentiated from schizophrenia by internalizing distress (e.g., Emotional/Internalizing Dysfunction, Self-Doubt) and by somatic problems (e.g., Somatic Complaints, Malaise), likely capturing the neurovegetative symptoms that are often associated with major depressive episodes. 8 For index BvD, bipolar disorder is distinguished by excessive behavioral activation (Activation) and by interpersonal aggression (Aggressiveness) whereas major depression is distinguished by negative mood disturbance (Emotional/Internalizing Dysfunction) and by neurovegetative symptoms (Somatic Complaints, Malaise).
Index Performance
I evaluated the performances of all differential diagnostic indices in the validation data set using the AUC statistic. 9 Table 3 displays the AUC values for all primary indices and subindices 10 by differential diagnostic task, along with 95% confidence intervals computed by bootstrapping. 11 I also include AUC values for all MMPI-2-RF Substantive scales for comparison. All three primary differential diagnostic indices performed well overall in their respective differential diagnostic prediction tasks (SvB AUC = .76 [95% CI = .67, .84]; SvD AUC = .90 [.83, .94], BvD AUC = .75 [.67, .82]). Each primary differential diagnostic index also substantially outperformed all individual MMPI-2-RF scales in discriminating between its target diagnoses. The strongest single-scale discriminator of schizophrenia from bipolar disorder was RC6 (Persecutory Ideation; AUC = .68 [.57, .77]). EID (Emotional/Internalizing Dysfunction) was the strongest single-scale differentiator of schizophrenia and major depressive disorder (AUC = .78 [.69, .84]). In discriminating bipolar disorder from major depressive disorder, scales EID, RCd, and RC2 (Emotional/Internalizing Dysfunction, Demoralization, and Low Positive Emotions, respectively) performed approximately equally well (AUC = .69 [.61, .76-.77]). Practical implications of the possible performance advantage of using composite indices rather than individual scales in differential diagnosis will be discussed later, in the context of classification utility. As an alternative metric, I provide point-biserial correlations for these indicators in Supplemental Table S2.
AUC Values With 95% Confidence Intervals for Differential Diagnostic Indicators.
Note. Boldfaced values indicate the highest obtained AUC for each differential diagnostic task. Confidence intervals are BCa bootstrapped intervals (2,000 replications). AUC = area under the curve; SCZ = Schizophrenia; BD = Bipolar Disorder; MDD = MDD; SvB = SCZ versus BD; SvD = SCZ versus MDD; BvD = BD versus MDD; SoB = SCZ over BD; BoS = BD over SCZ; SoD = SCZ over MDD; DoS = MDD over SCZ; BoD = BD over MDD; DoB = MDD over BD; EID = Emotional/Internalizing Dysfunction; BXD = Behavioral/Externalizing Dysfunction; THD = Thought Dysfunction; RCd = Demoralization; RC1 = Somatic Complaints; RC2 = Low Positive Emotions; RC3 = Cynicism; RC4 = Antisocial Behavior; RC6 = Persecutory Ideation; RC7 = Dysfunctional Negative Emotions; RC8 = Aberrant Experiences; RC9 = Hypomanic Activation; MLS = Malaise; GIC = Gastrointestinal Complaints; HPC = Head Pain Complaints; NUC = Neurological Complaints; COG = Cognitive Complaints; SUI = Suicidal/Death Ideation; HLP = Helplessness/Hopelessness; SFD = Self-Doubt; NFC = Inefficacy; STW = Stress/Worry; AXY = Anxiety; ANP = Anger Proneness; BRF = Behavior-Restricting Fears; MSF = Multiple Specific Fears; JCP = Juvenile Conduct Problems; SUB = Substance Abuse; AGG = Aggression; ACT = Activation; FML = Family Problems; IPP = Interpersonal Passivity; SAV = Social Avoidance; SHY = Shyness; DSF = Disaffiliativeness; AES = Aesthetic-Literary Interests; MEC = Mechanical–Physical Interests; AGGR-r = Aggressiveness–Revised; PSYC-r = Psychoticism–Revised; DISC-r = Disconstraint–Revised; NEGE-r = Negative Emotionality–Revised; INTR-r = Introversion–Revised.
Index Standardization
I transformed the primary and subindices into CRSSs using the standardization process previously described. Table 4 displays the reference values used to derive the CRSSs. As previously indicated, to aid interpretation, primary index CRSSs are “reflected” in a manner similar to the MMPI-2-RF’s TRIN-r scale. In other words, raw scores below the reference anchor value (the clinical midpoint for the index) are converted into positive standard scores so that a deviation from the index’s clinical midpoint always results in a score elevation. A reflected CRSS value is also presented with a suffix indicating the diagnostic direction of the elevation (S = schizophrenia, B = bipolar disorder, D = [major] depression). Thus, for example, if an individual obtains a raw score twice as far below the clinical midpoint on index SvB (on which lower raw scores indicate bipolar disorder and higher raw scores indicate schizophrenia) as the typical patient with bipolar disorder, that raw score would be converted to a CRSS of 20B.
Standardization Reference Values.
Note. SvB = schizophrenia versus bipolar disorder; SvD = schizophrenia versus (major) depression; BvD = bipolar disorder versus (major) depression; SoB = schizophrenia over bipolar disorder; BoS = bipolar disorder over schizophrenia; SoD = schizophrenia over (major) depression; DoS = (major) depression over schizophrenia; BoD = bipolar disorder over (major) depression; DoB = (major) depression over bipolar disorder.
Table 5 displays CRSS means for each of the target diagnostic groups in the validation sample along with bootstrapped 95% confidence intervals. 12 Mean CRSS values for patients with schizoaffective disorder in the validation sample (n = 25) are also provided as an additional point of comparison, as that diagnosis represents a possible confound (or compromise, depending on one’s view of the construct) in some of the differential diagnostic problems relevant to these indices given the heterogeneous symptoms associated with the diagnosis. As expected, each of the three target diagnostic groups deviates in the predicted direction along each relevant primary index, producing mean scores highly similar to the mean scores in the calibration sample (i.e., CRSS ≈ 10). For example, patients in the validation group diagnosed with schizophrenia produced mean scores of 11S [95% CI = 6S, 15S] on SvB and 9S [6S, 12S] on SvD. Patients diagnosed with major depression evidenced possible slight regression toward the midpoint along relevant primary indices (M SvD = 8D [7D, 10D], M BvD = 6D [3D, 9D]). I provide abbreviated CRSS lookup tables for SvB, SvD, and BvD as Tables 6 to 8, respectively; more granularly detailed lookup tables can be found in the online Supplemental Materials.
Mean Index Standard Scores by Diagnosis With 95% Confidence Intervals.
Note. Confidence intervals are BCa bootstrapped intervals (2,000 replications). SvB = schizophrenia versus bipolar disorder; SvD = schizophrenia versus (major) depression; BvD = bipolar disorder versus (major) depression; SoB = schizophrenia over bipolar disorder; BoS = bipolar disorder over schizophrenia; SoD = schizophrenia over (major) depression; DoS = (major) depression over schizophrenia; BoD = bipolar disorder over (major) depression; DoB = (major) depression over bipolar disorder.
SvB Standard Scores and Classification Utility Statistics (RBRs = .40 S, .60 B).
Note. SvB = schizophrenia versus bipolar disorder; RBRs = relative base rates; S = schizophrenia; B = bipolar; CRSS = clinically referenced standard score; Sen = sensitivity; Spc = specificity; PPV = positive predictive value; NPV = negative predictive value; HR = hit rate (overall classification accuracy); Pr(S) = probability of schizophrenia; Pr(B) = probability of bipolar disorder.
SvD Standard Scores and Classification Utility Statistics (RBRs = .30 S, .70 D).
Note. SvD = schizophrenia versus (major) depression; RBRs = relative base rates; S = schizophrenia; D = depression; CRSS = clinically referenced standard score; Sen = sensitivity; Spc = specificity; PPV = positive predictive value; NPV = negative predictive value; HR = hit rate (overall classification accuracy); Pr(S) = probability of schizophrenia; Pr(D) = probability of depression.
BvD Standard Scores and Classification Utility Statistics (RBRs = .40 B, .60 D).
Note. BvD = bipolar disorder versus (major) depression; RBRs = relative base rates; B = bipolar; D = depression; CRSS = clinically referenced standard score; Sen = sensitivity; Spc = specificity; PPV = positive predictive value; NPV = negative predictive value; HR = hit rate (overall classification accuracy); Pr(B) = probability of bipolar disorder; Pr(D) = probability of [major] depression.
In the validation sample, the target diagnostic groups also tended to differentially elevate their relevant subindex scores in an appropriate manner (e.g., patients diagnosed with schizophrenia tended to elevate subindices positively predicting a differential diagnosis of schizophrenia) and at a magnitude similar to that observed in the calibration sample (e.g., producing CRSSs around 10 on subscales positively predicting the diagnosis). For example, patients diagnosed with bipolar disorder generated mean standard scores of 8 [95% CI = 5, 12] on BoS and 8 [5, 11] on BoD. Conversely diagnostic groups tended to produce low (near-zero) elevations on subindices for which they served as the contrast class; for example, the mean SoB score for patients diagnosed with bipolar disorder was −1 [−3, 3]. As expected, patients diagnosed with schizoaffective disorder produced fairly ambiguous and undistinctive mean scores on primary indices as well as heterogeneous patterns of elevation across most of the subindices (albeit with an increased tendency toward elevating subindices positively predictive of schizophrenia). I provide abbreviated CRSS conversions for the differential diagnostic subindices in Table 9 (see the Supplemental Materials for a more detailed CRSS conversion table).
Subindex Standard Score Lookup Table.
Note. CRSS = clinically referenced standard score; SvB = schizophrenia versus bipolar disorder; SvD = schizophrenia versus (major) depression; BvD = bipolar disorder versus (major) depression; SoB = schizophrenia over bipolar disorder; BoS = bipolar disorder over schizophrenia; SoD = schizophrenia over (major) depression; DoS = (major) depression over schizophrenia; BoD = bipolar disorder over (major) depression; DoB = (major) depression over bipolar disorder.
Classification Utility Statistics
Tables 6 to 8 display classification utility statistics (sensitivity [Sen], specificity [Spc], positive predictive value [PPV], negative predictive value [NPV], and hit rate [HR]) for the primary indices. Additionally, those tables contain diagnostic class probability values for specific index score values, obtained through application of the naïve Bayes algorithm in the validation sample. I note that when interpreting and evaluating the classification utility statistics presented here in light of the differential diagnostic problems addressed by the indices, one must consider not the overall base rates (BRs) of the target diagnoses in the parent setting, but rather their relative base rates (RBRs; i.e., their base rates in the narrower subpopulation consisting only of the individuals with one of the two diagnoses targeted by the index), 13 an issue I discuss later in greater detail.
For each primary index, at least some of the cut scores displayed in Tables 6 to 8 provide significantly better overall classification accuracy in the validation sample than simply “betting the base rates” (i.e., simply predicting that each validation case belongs to the most populous class for each differential diagnostic problem). SvB evidences a peak HR of .76 at scores of 15S and 20S (beating a bet against the bipolar disorder RBR of .60), SvD obtains a maximum accuracy of .83 at a CRSS of 0 (contrasted with an RBR of .70 for major depression), and BvD demonstrates a maximum HR of .73 at a score of 10B (beating an RBR of .60 for major depression). To provide additional points of comparison, I computed overall classification accuracy at optimally selected cut scores in the validation data for the best performing individual MMPI-2-RF scales for these differential diagnostic problems (as indicated by their previously calculated AUC values). At these optimal cut scores, RC6 (Persecutory Ideation) discriminates schizophrenia from bipolar disorder with a hit rate of .72 (T = 75), the accuracy of EID (Emotional/Internalizing Dysfunction) in discriminating schizophrenia from MDD is .77 (T = 57), and EID discriminates bipolar disorder from MDD with an accuracy of .71 (T = 57). 14 I provide estimated PPV, NPV, HR, and classification probabilities for each index at various other RBRs in the supplemental materials to facilitate evaluation of these indices in other samples.
Discussion
In the present study, I developed and evaluated a set of three weighted differential diagnostic indices for the MMPI-2-RF: SvB (schizophrenia vs. bipolar disorder), SvD (schizophrenia vs. [major] depression), and BvD (bipolar disorder vs. [major] depression). In addition, to augment interpretation, I created two subindices for each of the primary differential diagnostic indices and provided standard score and classification probability conversions for raw index scores. This study’s method included several novel elements, including use of lasso logistic regression to select and weight linear index predictors, derivation of CRSSs to standardize the indices, and application of the naïve Bayes algorithm to map classification probabilities onto index raw scores by combining diagnostic group score distributional data with local relative base rate information.
Criterion Validity of the Differential Diagnostic Indices
Each of the three primary differential diagnostic indices demonstrated good performance in discriminating between its two target diagnostic groups in the validation sample. Moreover, each primary index markedly outperformed any single MMPI-2-RF scale in discriminating between the index’s target diagnostic groups, indicating that use of these combined indices offers a substantial potential advantage over single-scale interpretation within the context of specific differential diagnostic tasks. Additionally, each primary index appeared to substantially improve on base rate prediction. Of the primary indices, SvD demonstrated particularly strong discriminative power, likely reflecting greater dissimilarity between its two target diagnoses (at least along the relevant dimensions assessed by the MMPI-2-RF) than between the competing diagnoses targeted by the other two primary indices. The advantage of using the composite BvD index over a single scale to discriminate between its target diagnoses was more modest than the other indices, though the advantage offered by BvD was nevertheless measurable and likely practically meaningful in typical clinical contexts.
Primary indices SvB and BvD demonstrated performance in differentiating schizophrenia from major depression that was roughly comparable to their performance in their keyed differential diagnostic tasks (SvB AUC = .73, BvD AUC = .77). This is perhaps unsurprising, as some predictors are shared across indices, and the prediction criteria also overlap between tasks. Nevertheless, in the present study, each differential diagnostic task was performed most effectively by the index keyed to that task. These findings underscore the importance of interpreting each index only within the context of the keyed differential diagnostic task.
Interpretive Considerations
These are experimental indices that have not yet been subjected to additional validation beyond the analyses presented in this article. Thus, absent further validation work, a recommendation to apply these indices clinically would be premature. Nevertheless, given that these indices (or modified or alternative versions thereof) could eventually be applied in practice, potential interpretive strategies are worth considering. As a starting point, I briefly propose the following broad approach: (1) clearly identifying the differential diagnostic task at hand, (2) computing the relevant index and subindex scores from MMPI-2-RF raw scale scores if the protocol is valid, (3) deriving a qualitative description of the index scores from the primary and subindex CRSS scores, and (4) interpreting classification probabilities in light of the known (or estimated) relative base rates of the target diagnoses in the local population. Index result interpretation may need to be modified given relevant extra–test data that significantly add to or otherwise alter the potential meaning of test data. Additionally, in light of the nature and purpose of these differential diagnostic indices, several other considerations should also be taken into account; these are discussed next.
Appropriate Uses of Indices
Each of the differential diagnostic indices presented in this article was designed for a highly circumscribed clinical situation: discrimination of one specified diagnostic group from another specified diagnostic group. Any application of an index outside of the differential diagnostic task for which it was intended is not supported by the presently available data and is beyond the scope of the intended purpose of these indices. In particular, clinicians should avoid any temptation to use these indices for general diagnosis rather than differential diagnosis (e.g., diagnosis of bipolar disorder driven by index or subindex elevations without any a priori differential of schizophrenia or major depression). The subindices are not designed to be used in isolation, only to clarify the reason for obtaining ambiguous scores on the primary differential diagnostic indices.
Overall Base Rates Versus Relative Base Rates
Incorporation of base rate information is crucial to decision-making using psychometric data, as base rates can have a profound influence on classification probabilities even after test scores are known (Meehl & Rosen, 1955). In discussing these indices, I make an explicit distinction between overall diagnostic base rates (BRs) and relative base rates (RBRs), which I defined previously. This distinction is important because the differential diagnostic indices were not designed to be applied to a general clinical population in which diagnostic frequencies are appropriately described in terms of overall BRs. Rather, the indices were designed to be used with much narrower clinical subpopulations consisting only of individuals with one of two target diagnoses (or perhaps falling on a spectrum between those diagnoses), and in this case, diagnostic frequencies are much more helpfully described in terms of RBRs. 15 Perhaps even more precisely, RBRs should be calculated or estimated for the even narrower subpopulation of individuals who are presumed to have one of the two target diagnoses and are referred for testing and successfully complete testing and produce valid test protocols.
Use of Standard Scores and Cut Scores
I designed the CRSSs computed in the present study to be relatively intuitive to interpret. For example, a primary index CRSS of 10 in one diagnostic direction or the other indicates a score very similar to the average score obtained by members of that diagnostic group. This does not necessarily mean that one should use a score of 10 (or any other prespecified CRSS value) as a cut score for predicting diagnosis. Indeed, I do not advocate for using any inflexible cut scores to derive a diagnosis from these indices. Thoughtful interpretation of any of these indices should include comparison of an obtained score to reference group performance, incorporation of local RBR information, and consideration of relevant extra–test data. In this sense, the role of CRSSs is primarily descriptive, as the primary indices indicate the extent to which a test taker’s responses generally distinctively resemble one diagnostic group or another, and the subindices provide a means of further interpreting ambiguous primary index scores. The classification probabilities are likely to provide a much better and more straightforward means of predicting diagnostic group membership for those interested in discrete diagnostic classification, as those probabilities provide a means for understanding an obtained score in the context of local RBRs through straightforwardly interpretable class probabilities. This is especially important when the RBRs for the target diagnoses are highly imbalanced, as in such a scenario the CRSS may be highly misleading and the true class probabilities may be unintuitive. 16
Inherent Diagnostic Uncertainty
Clinical diagnosis of psychological disorders is a probabilistic process, and the information used to inform such prediction (including scores on the present indices) generally contains substantial uncertainty. Use of these differential diagnostic indices provides no guarantee that a clinician will be able to cleanly and unambiguously place an examinee into one diagnostic bin or another. However, even in ambiguous cases, use of these indices provides a basis for quantifying and characterizing the nature and extent of the diagnostic ambiguity (e.g., identifying heterogeneity in reported problems or describing diagnostic probabilities in terms of both problems reported on the MMPI-2-RF and local base rates).
Integration With Dimensional Nosologies
Although the present differential diagnostic indices are keyed to categorical diagnostic constructs, the research literature increasingly indicates that psychopathology is much better understood in terms of overlapping dimensional constructs of varying specificities than in terms of diagnostic categories (see, e.g., Kotov et al., 2017). Each of the diagnoses targeted by the present indices can be thought of as clusters, perhaps poorly defined, of correlated features localized around, but extending away from, different points in the many-dimensional nomological network described by a dimensional nosology. These clusters overlap, of course, giving rise to the differential diagnostic problems that result when clinicians attempt to map a crude categorical system onto underlying dimensional constructs. When approached from this perspective, the purpose of the present indices is not to neatly place test takers into artificial diagnostic bins with fuzzy boundaries, but to approximately locate them in feature space along the line that intersects the centroids of the relevant clusters in the nomological network.
Future Directions
When a new measure is developed, its associations with a broad range of criteria are typically evaluated in order to appraise the measure’s validity. Arguably, given the highly circumscribed intended uses of the present indices, the need for such a broad range of criteria is greatly reduced, though at the very least the discriminative power of the indices should be established in additional settings. The classification probabilities produced by the naïve Bayes algorithm should also be evaluated in other settings, especially given that those probabilities were established in the present validation sample, without an additional independent sample on which to validate them. These classification probabilities can be evaluated by attending to their calibration in new settings (i.e., the extent to which the predicted probabilities map onto the probabilities of actual known group membership), 17 which can be carried out by using the classification probabilities displayed in the Supplemental Materials in Tables S3-S35 (available online).
Future research could focus on refining the present differential diagnostic indices, advancing the methods used to derive them, or creating new indices altogether. For example, the validity of the indices could perhaps be enhanced through use of higher quality diagnostic data (e.g., data derived from structured clinical interview rather than ordinary clinical intake). With appropriate statistical power and methods, more powerful indices could also potentially be derived from item-level test information rather than scale-level information, an approach that could potentially reduce the “noise” introduced by less diagnostically relevant items contained in the standard MMPI-2-RF scales. 18 Incorporating additional predictive information from other sources (e.g., clinician ratings, other tests) would also likely improve the indices’ predictive power, with the caveat that the generalizable utility of such indices would be limited by the availability of such information for new cases. Finally, the methods described here could also be easily applied to other predictive tasks or used to develop similar indices for other instruments.
Supplemental Material
sj-docx-1-asm-10.1177_1073191120978797 – Supplemental material for Development and Initial Validation of Differential Diagnostic Indices for the MMPI-2-RF
Supplemental material, sj-docx-1-asm-10.1177_1073191120978797 for Development and Initial Validation of Differential Diagnostic Indices for the MMPI-2-RF by William H. Menton in Assessment
Supplemental Material
sj-pdf-2-asm-10.1177_1073191120978797 – Supplemental material for Development and Initial Validation of Differential Diagnostic Indices for the MMPI-2-RF
Supplemental material, sj-pdf-2-asm-10.1177_1073191120978797 for Development and Initial Validation of Differential Diagnostic Indices for the MMPI-2-RF by William H. Menton in Assessment
Supplemental Material
sj-r-3-asm-10.1177_1073191120978797 – Supplemental material for Development and Initial Validation of Differential Diagnostic Indices for the MMPI-2-RF
Supplemental material, sj-r-3-asm-10.1177_1073191120978797 for Development and Initial Validation of Differential Diagnostic Indices for the MMPI-2-RF by William H. Menton in Assessment
Footnotes
Acknowledgements
The author wishes to gratefully acknowledge the contributions of Will Grove, who passed away on September 1, 2017. The fundamental concept underlying the project (i.e., developing empirically keyed differential diagnostic indices for the MMPI) was proposed by Dr Grove, who also provided a significant degree of guidance and statistical programming work during early stages of the project.
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
