Abstract
Across various structured diagnostic instruments, the criteria used to diagnose alcohol use disorder (AUD) are not assessed consistently. For example, different instruments often pose questions that reflect different thresholds of the underlying symptoms. We consider the criteria for craving and the inability to cut down or stop drinking to demonstrate the influence of using different thresholds for a positive symptom endorsement with respect to the estimated edges of a symptom network. Results indicate that the utilization of these differing thresholds leads to significant differences in edge weights. Generally, higher thresholds relate more strongly to lower prevalence rate criteria and the reverse for lower thresholds. These findings have implications for reproducibility of effects in symptom networks and their generalization across studies.
The diagnostic revolution in the ’60s and ’70s culminated in the widespread adoption of Diagnostic and Statistical Manual of Mental Disorders (3rd ed.; DSM–III; American Psychiatric Association [APA], 1980). Although this ushered in an era of more explicit criteria definitions and structured interviews designed to assess them, there has been a general lack of appreciation of variability with respect to how individual diagnostic criteria are operationalized across alternative diagnostic instruments. In the case of alcohol use disorder (AUD), and presumably other disorders as well, there is considerable variability across diagnostic instruments such that a criterion that represents a low-middle-severity criterion in one instrument can be a high-severity criterion in another instrument (Lane, Steinley, & Sher, 2016). This might appear to be trivial at first glance and may often be overlooked as the researcher focuses on the final diagnoses or model characteristics. The importance of the specific way criteria are assessed, however, should be considered for the interpretations of the final results, as the reliability and validity of any results are predicated on our ability to precisely identify the disorder itself.
Variation in operationalization has received relatively little attention in the empirical literature, although Regier et al. (1998) discussed the general need for improved standardization of assessment methods 20 years ago. More recently, it was demonstrated that for the diagnosis of a major depressive episode (MDE), varying the threshold of the “depressed mood all day” criterion, which is required for the diagnosis, produces very different prevalence rates (Karlsson, Marttunen, Karlsson, Kaprio, & Hillevi, 2010). In this case, the threshold was defined by setting the cutoff for the criterion at different response levels to the same question. Similarly, different diagnostic instruments for AUD operationalize the descriptors “recurrent” or “repeatedly” as reflecting two or more (e.g., Alcohol Use Disorder and Associated Disabilities Interview Schedule 5, or AUDADIS-5; Grant, Goldstein, Smith, et al., 2015) or three or more times (Semi-Structured Assessment for the Genetics of Alcoholism, or SSAGA; Bucholz et al., 1994).
Beyond using the duration and frequency of specific symptoms, the severity of criterion thresholds can vary according to the symptoms used to operationalize them. For example, one can vary the threshold for the presence/absence of a diagnostic criterion by using multiple, binary items that exhibit properties of a Guttman scale. Guttman scales utilize a series of binary questions to assess a single construct. The notable feature is that items are (largely) nested within each other successively (e.g., a “yes” for Question 2 implies the answer to Question 1 was also “yes”). In the context of symptoms of psychopathology, this could be viewed as degrees of severity. The implication of this definitional variability for nosological research on symptom structure is unknown, but should it be influential, the generalizability of estimates of relations across studies employing different assessments is called into question.
AUD
The assessment of AUDs provides an informative case to explore the relevance of criterion thresholds. In a recent meta-analysis, Lane et al. (2016) found considerable variability in the rank order of Item Response Theory (IRT) thresholds across 34 published studies (30 unique data sets), with the largest source of identifiable variance attributable to the diagnostic interview employed in the study. Some of this variation is likely due to how different interviews assess the extent to which a given symptom is considered “repeated” or “recurrent” (e.g., two or more times, three or more times), some possibly due to the number of questions used to assess the symptom and some due to the actual wording of the item.
In many survey instruments, each AUD criterion is assessed using responses from multiple items. In some instances, the items included are distinct components of the criteria. For example, one criterion asks if an individual has given up activities in order to drink. This is measured in the AUDADIS-IV and AUDADIS-5 (Grant, Dawson, & Hasin, 2001; Grant, Goldstein, Smith, et al., 2015), asking “Did you ever . . .”:
Give up or cut down on activities that were important to you in order to drink—like work, school, or associating with friends or relatives?
Give up or cut down on activities that you were interested in or that gave you pleasure in order to drink?
These two questions assess different aspects of the same criterion, and one may exist without the other. However, there are other criteria whose subitems appear to assess different severities of the same symptom. Though not technically Guttman scales in structure, a number of items assessing AUD might be termed “Guttman-like” in that, for the most part, a positive response to one predicates a positive response to the other. The added severity information from these Guttman-like questions is, however, lost on these items when the scoring requires simply “one or the other.” Final estimates of AUD severity are assessed using the total number of criteria met, ignoring responses to the implicit severity of the item in the subset of items assessing the criterion. Scoring algorithms aside, it is this quality of the AUDADIS that permits an examination of the extent to which the threshold for determining the presence or absence of an individual symptom/criterion can affect analyses. The subsets of two AUD items in particular provide excellent opportunities for exploring the significance and impact of variation in criterion threshold: alcohol craving and the inability to cut down on alcohol use. The National Epidemiological Surveys on Alcohol and Related Conditions, NESARC Wave 2 (Grant, Kaplan, Moore, & Kimball, 2007) and NESARC III (Grant, Goldstein, Saha, et al., 2015), provide large samples with which to test the differences between these AUDADIS items. Prior to analyzing these data, we describe the two criteria in detail.
Craving
The craving criterion is defined by the DSM–5 as “craving, or a strong desire or urge to use alcohol” (APA, 2013, pp. 490–491). In diagnostic interviews, however, the wording of the actual questions for this criterion varies. There exist larger scales used to assess craving, such as the 14-item Desires for Alcohol Questionnaire (DAQ; Kramer et al., 2010; Love, James, & Willner, 1998), including a range of phrases such as “I wanted a drink so much I could almost taste it” and “I thought drinking was pleasant.” More often in larger, comprehensive surveys, fewer items are used. For example, in the Composite International Diagnostic Interview 3.0 (CIDI 3.0; Kessler et al., 2013), craving is assessed by the single item “Was there ever a time in your life when you often had such a strong desire to drink that you couldn’t stop yourself from taking a drink or found it difficult to think of anything else?” Similarly, in the Semi-Structured Assessment of the Genetics of Alcoholism (SSAGA; Bucholz et al., 1994), craving refers to “In situations where you couldn’t drink, did you ever have such a strong desire for it that you couldn’t think of anything else?” In contrast, for someone to endorse craving on the AUDADIS, one needs only to endorse one of two items asking “Did you ever . . .”:
Feel a very strong desire or urge to drink?
Want a drink so badly that you couldn’t think of anything else?
Here, the first statement is arguably a lower threshold of craving, thus the two craving items reflect a Guttman-like quality, where Statement 1 could be seen as a lower threshold and prerequisite to Statement 2, which is indeed revealed in endorsement rates. The low-threshold cut down item has a prevalence of 12.31% in NESARC Wave 2 and 13.43% in NESARC III, and the high threshold has corresponding prevalence rates of 2.69% and 3.46%. In NESARC Wave 2 and NESARC III, 80.87% and 89.25% of those who endorse the high-threshold item also endorse the low-threshold item. Correspondingly, only 16.65% and 17.86% of those who endorse the low-threshold item also endorse the high-threshold item.
Cut down
The inability to cut down or stop drinking (cut down) provides a second example of a diagnostic criterion assessed with statements that differ in severity in the NESARC studies. Cut down was included in the DSM–IV as one of the alcohol dependence criteria and was retained in the DSM–5. It is defined by the DSM–5 as “a persistent desire or unsuccessful efforts to cut down or control alcohol use” (APA, 2013, p. 490). The “persistent” component of this implies multiple occurrences of the symptom without a strictly defined count, and this is reflected in its varying assessments. For example, the CIDI 3.0 asks, “Were there times when you tried to stop or cut down on your drinking and found that you were not able to do so?” (Kessler et al., 2013). The SSAGA-I instrument asks, “Have you 3 or more times wanted to stop or cut down on drinking?” followed by “Were you always able to stop or cut down when you wanted to?” Here, the criterion is met if the answer to the first is “yes” or the answer to the second is “no” (Bucholz et al., 1994).
In the National Longitudinal Alcohol Epidemiology Survey (NLAES; Grant, Peterson, Dawson, & Chou, 1994), using an earlier version of the AUDADIS, the persistence of the cut down items is assessed first asking if the experience had ever happened and following up by asking how many times. Individuals qualified for this criterion by either endorsing both statements or reporting multiple occasions of either in the follow-up question. Both the AUDADIS-IV and -5 collapse the statement and follow up by asking “Did you ever . . .”:
More than once want to stop or cut down on your drinking?
More than once TRY to stop or cut down on your drinking but found you couldn’t do it?
As with the other criteria, cut down is scored positively if either item is endorsed (and all respondents are asked both questions). Though, in the structure of the interview, these items are not strictly Guttman scales, Statement 1 does appear to measure a lesser degree of the criterion, and logically it could be argued that in most cases a person would want to cut down before they tried. This is reflected in the endorsement rates in the NESARC data. The low-threshold craving item has a prevalence rate of 4.01% and 10.69%, and the high 0.83% and 2.14% in NESARC Wave 2 and NESARC III, respectively. Of those who report the high-threshold cut down item, 86.55% for NESARC Wave 2 and 88.21% for NESARC III report the low-threshold item as well, demonstrating that it rarely appears alone. On the other hand, only 18.89% and 22.73% of those who positively endorse the low-threshold item also endorse the high-threshold item.
Network analysis
To examine these effects, we utilize an emerging modeling tool for epidemiological studies of psychopathological symptoms: network analysis. This is of particular use here because of its focus on the symptoms or criteria themselves. Network models of psychological symptoms are increasingly utilized as they allow the researcher an alternative to the classical latent variable models of disorders (Fried et al., 2017; Hofmann, Curtiss, & McNally, 2016). In these networks, symptoms interact with and influence each other rather than reflecting manifestations of an underlying (latent) disorder (Borsboom & Cramer, 2013). Generally, a network is composed of a set of nodes, here representing the symptoms or criteria of interest and edges that describe how strongly they relate to each other. Unlike traditional networks where the edges are observed (for a review of traditional social networks, see Wasserman & Faust, 1994), in symptom or criteria networks they must be inferred from data. This methodology has already been applied to many studies of specific psychiatric disorders such as substance use disorders (e.g., Anker et al., 2017; Rhemtulla et al., 2016) or depression (e.g., Boschloo, van Borkulo, Borsboom, & Schoevers, 2016).
With the focus on the symptoms themselves, these models provide a framework with which to consider diagnostic criteria individually, as opposed to focusing on the larger diagnosis. This can be done by considering each of an individual criterion’s connections within the network. The presence of an edge between two criteria indicates they are significantly associated with each other. Within the realm of network analysis, there also exist specific measures used to evaluate the role of a node within an entire network called centrality statistics. Centrality is a measure of importance, and in a network, this can take multiple forms. The three commonly used indices to study symptom networks are betweenness, closeness, and strength. Betweenness centrality is calculated by counting how often each node lies on the shortest paths between all pairs of other nodes in the network, giving higher scores to those that are intermediaries more often. Closeness centrality describes how close, on average, the given node is to any other node in the graph. It is calculated for each node as the inverse of the sum of all the edge weights to the other nodes. The strength of a node, also called its degree, is simply the sum of the weights of its adjacent edges.
The Present Study
In this article, we are interested in examining the effect of the different thresholds of a criterion on the estimated edges and centrality estimates of an AUD criteria network. Utilizing network estimation techniques allows us to characterize the way the remaining criteria differentially relate to different thresholds of the criterion in question. Finding that estimates of various network parameters are affected by the severity of the criteria would have important implications for determining the likely reproducibility of research across instruments, as the wording utilized varies.
In the AUDADIS interviews used for NESARC Wave 2 and NESARC III, we focus on the craving and cut down items as they reflect the Guttman-like quality discussed above. Although counted equally in the scoring of the data (endorsement of either leads to meeting the criterion), the items could be expected to identify people with very different AUD severities. Because of this, we expect them individually to relate differently to other symptoms in the network. This allows us to examine how the threshold used to determine presence/absence of a symptom might affect the structure of the criteria used to make a given diagnosis and to investigate the effect of varying the threshold for determining the presence/absence of a symptom on both generalizability and reproducibility.
Method
Samples
Two NESARC data sets were included in this analysis. These are nationally representative samples of civilian, noninstitutionalized adults in the U.S. population conducted by the National Institute on Alcohol Abuse and Alcoholism (NIAAA). NESARC Wave 2 was a follow-up to the original survey, taken in 2004–2005 (n = 34,653; Grant et al., 2007). NESARC III was conducted in 2012–2013 with a new sample of n = 36,309 (Grant, Goldstein, Saha, et al., 2015). Participants under the age of 21 were removed from the sample, 24 from Wave 2 and 1,597 from III. Of this sample, only those who reported they had consumed at least one drink of alcohol in the past year were included, leaving n = 22,160 and n = 24,773 observations, respectively. In this final sample, the average age of participants was 45.88 (σ = 15.89) for Wave 2 and 44.40 (σ = 16.03) for III. The sample from NESARC Wave 2 consisted of 46.86% males and was 63.35% White, 15.44% Black, and 17.36% Hispanic. The sample from NESARC III was 46.57% male and 56.03% White, 19.74% Black, and 18.57% Hispanic.
Measures
The DSM–5 defines 11 criteria for AUD 1 : larger amounts/longer time spent drinking than intended (larger/longer), inability to cut down (cut down), excessive time spent drinking or recovering from drinking (time spent), craving, failure to fulfill major roles and obligations (role interference), use despite causing social/interpersonal problems (social problems), giving up activities in order to drink (give up activities), getting involved in risky situations while or after drinking (hazardous use), use despite causing physical/psychological problems (phys/psych problems), tolerance, and withdrawal. The AUDADIS-IV (Grant et al., 2001) provides the wording for the questions used to assess these criteria used in the NESARC Wave 2 survey, and the updated AUDADIS 5 (Hasin et al., 2015) was used for NESARC III. 2 The AUDADIS is a fully structured diagnostic interview that utilizes the DSM–IV criteria for mental disorders, with the focus on AUD. In this study, only the past year occurrences of the criteria are considered. The unweighted 3 sample prevalence rates (for past year drinkers over 21) can be found in Table 1.
Prevalence Rates and Results of the Analyses
Note: The columns labeled “Estimate” contain the edges estimated using the entire sample. The “Bootstrap” columns contain the mean and standard deviation of the bootstrap estimates of each edge. The “Difference” columns are calculated by subtracting the edge estimated for the high threshold from the edge estimated for the low threshold.
p < .005 for the t test of the difference in bootstrapped edges.
Network estimation
The eLasso algorithm creates networks from binary data and can be performed using the IsingFit package in R (van Borkulo, Epskamp, & van Borkulo, 2016). A full description is given by van Borkulo et al. (2014), but here we briefly summarize the process. It begins with the set binary variables representing the criteria, indicating the presence or absence of each in the participants. Selecting one criterion at a time, a regularized logistic regression model is fit using the remaining criteria as the predictor variables. The resulting regression coefficients are used as the edge weights in the final network. To create a sparse network, where not all edges are present between pairs of criteria, a Lasso estimator is used to reduce weights to zero where conditional independence is implied between criteria. Finally, to create a symmetric matrix, the estimates for each pair are averaged (e.g., the edge between cut down and larger/longer is the average of the coefficient where cut down is the dependent variable and where larger/longer is the dependent variable). 4
Analysis
Four networks were estimated for each data set. 5 Networks were created considering the alteration of one criterion at a time, so that when we are studying the differences in the craving threshold, the cut down criterion is held constant using the original scoring (where a subject qualifies if he or she has an affirmative response to either statement), and the craving criterion is held constant to study the differences in the cut down threshold. When each is held constant, the original scoring rule is used, which counts the criterion as present if the respondent indicates either or both of the statement are true.
In terms of edge weights, it seems unlikely that the changes in one criterion will extend much beyond its adjacent edges. Because of the pairwise nature of the estimation algorithm and the large sample, these remaining edges, between criteria not modified, are expected to be fairly stable. Thus, there are 10 potential edges of interest for each of the criteria being modified: each describing the strength of the relationship (or lack thereof) between the cut down or craving criterion and the remaining criteria. While the differences in the estimated relations can be observed by considering the correlation of the edge weights, a larger number of observations is required to provide meaningful confidence intervals for the differences. A bootstrapping technique is employed for this purpose. The procedure begins by taking a random sample of the data with replacement (with the sample size of n = 10,000). From this sample, both the high (e.g., using the high-threshold item) and the low (e.g., using the low-threshold item) networks are estimated for a criterion. The data are resampled and the networks estimated 100 times, providing a large enough sample of estimates for inference. It gives us 100 estimates of each of the edges to conduct statistical analyses and provide meaningful confidence intervals on the correlations, as well as t tests for the differences in specific criterion edges. For the correlations, the intraclass correlation coefficient (ICC, with 95% confidence intervals) for random raters is used (Shrout & Fleiss, 1979). All node centrality measures (betweenness, closeness, and strength) are calculated within the R package qgraph (Epskamp, Cramer, Waldorp, Schmittmann, & Borsboom, 2012).
Results
As the relationships in these networks are pairwise, the rest of the network is not appreciably changed by the modification of one criterion. This is reflected in the correlations using all the estimated relations in the network (as the network is symmetric, there are 55 total edges estimated). For NESARC Wave 2, the corresponding edges of the networks using the low versus high threshold of the craving criterion correlated ICC = .93 (95% confidence interval = [.89, .96]) and the cut down networks correlated ICC = .96 (95% confidence interval = [.93, .97]). For NESARC-III the craving networks correlate ICC = .89 (95% confidence interval = [.82, .94]) and the cut down networks ICC = .98 (95% confidence interval = [.96, .99]).
There is some variation in the percentages of the sample who meet criteria for AUD by reporting two or more criteria. When only the low-threshold criteria are included, the rates are roughly the same, but a slight decrease can be seen where the high-threshold items are used. In NESARC Wave 2, 15.5% of the sample diagnosed with AUD when the original “either or” criteria were used. For both craving and cut down, the low threshold was the same, but using only the high craving threshold, the percentage drops to 15.1%, and using the high cut down threshold it drops to 13.1%. In NESARC III, 19.2% of the sample diagnosed with AUD, and moving to the low threshold only for both items, this is reduced to 19.1%. Using only the high threshold craving and cut down items, the percentage drops to 17.8% and 17.2%, respectively.
Looking specifically at the edges related to the modified criteria, we begin to see the impact of the varying criteria thresholds. Figure 1 depicts the networks estimated from the full sample, and the relevant edges are colored in blue. Here, the changes in the thickness and darkness of the edges between the networks using the high and low threshold are notable. Table 1 contains the estimates of the network relations from the full data as well as the means and standard deviations of the bootstrapped estimates and the difference between the high and low thresholds. The edges in the table are ordered by their prevalence weights averaged over both data sets, in descending order. A broad trend can be seen here in the differences between the high- and low-threshold edges for both modified criteria: The more prevalent criteria tend to have stronger connections to the low-threshold items, and the rarer criteria are more associated with those with higher thresholds.

Graphs of the networks created using the qgraph package (Epskamp et al., 2012). This utilizes the Fruchterman and Reingold (1991) layout method to place the nodes.
Considering the modification of the craving criterion, the low- and high-threshold networks edges are correlated ICC = .25 (95% confidence interval = [–.47, .75]) and ICC = –.09 (95% confidence interval = [–.69, .56]) for NESARC Wave 2 and NESARC III. These correlations indicate little similarity in the estimated edges, but the confidence intervals reflect a considerable amount of uncertainty for these estimates (based on only 10 values). For this reason, we consider t tests of the bootstrapped edges (with a Bonferroni correction of the significance for 10 estimates). Except for social problems in NESARC III, the t tests for the differences are all significant. The largest increases from the high to the low threshold in both data sets are the edges to the three most prevalent criteria: larger/longer, cut down, and hazardous use. Conversely, the largest decreases (indicating stronger relation to the high-threshold item) in both data sets are found in the least prevalent items: give up activities and role interference.
The correlations for the edges estimated with different thresholds for cut down are ICC = .74 (95% confidence interval = [.27, .93]) for NESARC Wave 2 and ICC = .89 (95% confidence interval = [.62, .97]) for NESARC III. In the NESARC Wave 2 data, the results continue the general trend mentioned before, with higher relations between the low-threshold and more prevalent items, with all differences significant except the edges to craving, phys/psych, and role interference. In NESARC III there is not as clear of a trend in terms of the sign of the changes. The edges to withdrawal, craving, and social problems are all not significantly different. We do see the same trend for the prevalent hazardous use criterion, where it relates more strongly to the lower threshold item, and the rarer give up activities criterion is again more related to the higher threshold cut down statement.
The centrality measures showed little variation between the networks overall. This was anticipated, as the centrality measures would be expected to be fairly stable when using the same data with only one node varying at a time. Though small, the difference does appear to always favor low-threshold items as being more central, as can be seen in the estimated centrality statistics in Table 2. A better evaluation of the effects of symptom thresholds on centrality measures would involve permuting the thresholds of all criteria systematically and estimating the distribution of each centrality parameter across these permutations. We are prevented from doing this, however, in that in the AUDADIS, there are not multiple thresholds for each criterion that would display the Guttman-like qualities needed.
The Centrality Statistics of the Nodes for the Symptoms Craving (Crave) and Inability to Cut Down (Cut Down)
Note: The unstandardized centrality statistics are always lower (or equal) for the criteria calculated using the higher threshold.
Discussion
Since the dawn of the modern diagnostic era and the introduction of DSM–III, research on psychiatric diagnosis at the formal diagnostic level, the latent variable level, and the network level of analysis appear to tacitly assume that, for all intents and purposes, the diagnostic instrument employed is not a critical factor in research. Although the sequence of steps involved in translating a set of diagnostic criteria into a structured interview has been thoughtfully characterized (Robins & Cottler, 2004), research on differences across diagnostic instruments has tended to focus at the syndromal level (assessed categorically or by criteria counts; e.g., Grant, Goldstein, Smith, et al., 2015) and not the symptom/criterion level (Samet, Waxman, Hatzenbuehler, & Hasin, 2007). Lane et al.’s (2016) demonstration of large variation in the relative thresholds of different AUD criteria across diagnostic instruments highlights the potential importance of considering individual criterion or symptom thresholds when examining symptom network structure. Consequently, examining the robustness of symptom networks across alternative operationalizations of symptoms would seem like a high priority for assessing reproducibility of findings and having confidence in the generalization of findings.
For both cut down and craving, when we vary the threshold for determining its presence or absence, we observe different patterns of interaction with the other symptoms of AUD. The lower threshold craving statement has a much higher prevalence rate, and when it is used in network analyses, the relationships tend to be stronger with other criteria that are more common. When only the higher threshold craving statement is used, the criterion becomes much rarer and the relationships with other low prevalence rate criteria are strengthened, while ties to the more common items are weakened. This trend might be expected given the nature of the high and low thresholds. The high-threshold items indicate a more severe degree of the criterion, which likely reflects a more severe disorder. Giving up activities, a rarer criterion, characterizes someone who is choosing alcohol over other important or pleasurable activities. This would reflect a greater degree of impact on their lives than the more common criterion, hazardous use, which identifies those whose drinking has led them into risky situations. As demonstrated in our analyses, giving up activities is more closely associated with both higher thresholds of cut down and craving, and hazardous use more associated with lower.
While our primary motivation for undertaking the analyses reported here was methodological, it is important to consider that there may be substantive interpretations to these findings. That is, the true relational structure of symptoms may depend upon the severity of the symptom as operationalized. Low-threshold Symptom XL might only be related to Symptom Y because of its association with a high-threshold manifestation of Symptom XH ; that is, there is no “direct association” between XL and Y if the path from XL to XH is modeled. For network models like those in this study, such potential issues could be addressed by modeling the networks with alternative methods that do not limit the analyses to binary data (e.g., the polychoric correlation graphs in qgraph, or mixed graphical models, MGM; Haslbeck & Waldorp, 2015). However, most research utilizes the binary outcome variables to assess criteria and disorders, implicitly assuming that alternative operationalizations of a criterion are functionally interchangeable. Additionally, symptom-level bias in the assessment of criteria associated with a range of covariates (e.g., differential item functioning) could distort estimates of edge weights and centrality measures. These considerations highlight the importance of not taking symptom and criterion assessments at face value and of considering both the severity of symptoms overall and population mixtures that could differentially bias such severity.
Until now we have only considered the differences between relative thresholds of the examined criteria. The identification of a Guttman-like variable was accomplished using the cross-tabulations of the subitems, finding cases where there are a large number of people who report the “low threshold” only and very few who report only the “high threshold.” However, these interview items were not developed as true Guttman scales, and in the cases examined here, we only have two levels for comparison. What happens between these two “thresholds” is unknown, and thus, the findings here could reflect either categorical or continuous differences in the “severity” of the symptoms. For example, the actual attempts to cut down on drinking, and particularly their subsequent failures, might have social consequences that the desire alone would not have. Indeed, in our analysis, the social/interpersonal problems criterion was more strongly associated with the higher threshold cut down item. This might indicate that the two subitems are reflecting categorically different symptoms. If the underlying construct is continuous, however, this might reflect instead a critical threshold of sorts where the increasing severity of one symptom gets to a point where it begins to cause other problems. While our analyses are limited in what we can say about these possibilities, they indicate the potential for further investigation into the individual criterion definitions.
Extending beyond the comparison here, it is important to highlight there are potentially numerous differences across instruments, each of which could affect the type of finding reported here or findings as basic as prevalence. It has been previously demonstrated that the simple change in the structure of the survey likely resulted in an increase of lifetime AUD prevalence from 18% to 30% largely because of how it affected responding to the hazardous use item (Vergés, Littlefield, & Sher, 2011). Other survey instruments assess the same criteria using entirely different wording. The results presented here reinforce the need for a more critical look at variation across diagnostic instruments and the potential for generalizations across studies beyond their effect on obtained network structures.
While we demonstrated the effect of different AUD criteria wording using network analysis, it is likely that the same issues will plague other multivariate methods that are used to analyze multidimensional binary data sets. For instance, it is highly likely that the thresholding would have an influence on both the number of classes and class structure of a latent class analysis. Additionally, varying the thresholds would influence the representation achieved via correspondence analysis. Likewise, we would expect sensitivity in the results of factor models that used different thresholds to operationalize the criteria. As such, this should serve as a siren’s call to take care in how criteria are assessed and operationalized and further supports the notion that sophisticated models are not able to “fix” shortcomings at the assessment level.
Although the focus of the current analyses is on AUD, as noted above, a similar situation appears to be present for major depression. There is reason to suspect that variation in operationalization of diagnostic criteria is relevant for many if not most psychiatric conditions. While the effects of using more or less “severe” operationalizations on prevalence are obvious, the importance of variation in the operationalization of criteria is less obvious but profoundly important to understanding the associations observed among symptoms/criteria. Consequently, not only must we not assume generalizability across studies utilizing different diagnostic instruments, but we should more carefully consider the wide variation in diagnostic procedures that characterize modern research. As noted by Lane et al. (2016), variation in diagnostic instruments that are foundational to much psychiatric research is large and unappreciated, and even the recovery of similar psychometric structures across studies can reflect mere “skin deep” reproducibility.
Footnotes
Author Contributions
All authors contributed to the design of the study. M. Hoffman and D. Steinley contributed to the statistical analysis. T. J. Trull and K. J. Sher contributed to the interpretation and discussion of results. M. Hoffman drafted the manuscript, and all the authors provided critical revisions. All the authors approved the final version of the manuscript for submission.
Declaration of Conflicting Interests
The author(s) declared that there were no conflicts of interest with respect to the authorship or the publication of this article.
Funding
This work was supported by National Institute on Alcohol Abuse and Alcoholism Grants R01-AA024133, K05-AA017242, and T32-AA013526 (to K. J. Sher) and R01-AA023248 (to D. Steinley).
