Abstract
Because studies examining youth drug use often have data with a high proportion of zeros, they often do not meet the assumptions for univariate or linear regression analyses that are typically used. We demonstrate the use of zero-inflated negative binomial regression models to address excessive zeros in drug use frequency on perceptions of disapproval and perceived harm among middle and high school students (N = 522). We found that perceptions of parent disapproval were a better predictor of marijuana use (p = .01) than peer disapproval. Perceived harm was related to marijuana use (p = .04). Researchers should consider using zero-inflated negative binomial regression models when examining youth drug use.
Marijuana is the most commonly used drug among youth in the United States (Substance Abuse and Mental Health Services Administration [SAMHSA], 2020). Approximately 21% of adolescents in grades 8, 10, and 12 reported using marijuana at some point in their life (Johnston et al., 2020; Centers for Disease Control & Prevention [CDC], 2020). While levels of marijuana use among adolescents have remained steady since 2000, changes in adolescents’ perceived harmfulness of using marijuana are diminishing, and societal acceptance of marijuana use and other drugs is rising (Johnston et al., 2020). Recent trends in the legalization of recreational marijuana in many states and the subsequent increase in marijuana advertising and access to marijuana may influence adolescent marijuana use and normalize substance use behaviors (Dai, 2017). This is concerning because adolescents are in a transition stage in which marijuana use could cause damage to brain development and alcohol, tobacco, and other drug use is known to lead to other unhealthy risk behaviors that have long-lasting mental and physical consequences (Dunn & Yannessa, 2022; SAMHSA, 2022).
However, most youths do not use marijuana or other illegal drugs regularly. Both the Monitoring the Future Study (Johnston et al., 2020) and the CDC's Youth Risk Behavior Surveillance System (YRBSS) (CDC, 2020) report that only around 17% of youth in grades 9–12 used marijuana in the past 30 days. In fact, marijuana use among teens tends to be episodic and occurs infrequently (Pelham et al., 2021). For example, Siegel et al.'s (2015) study found that most adolescents received zero offers to use marijuana in the past 30 days (76%) and had not used marijuana (78%) in the past year. Both Hammond et al. (2020) and Chadwick et al. (2013) found that less than 6 percent of teens used marijuana daily, using data from the National Longitudinal Study for Adolescent Health. Similar findings of infrequent marijuana use have been found in the Monitoring the Future Study with around 5.5 percent of 12th graders using marijuana weekly (Johnston et al., 2020). The episodic nature of marijuana use is most pronounced among early adolescents with some studies revealing that less than 0.4 percent of 8th graders used marijuana weekly (Johnston et al., 2020). However, the prevalence of current marijuana use (past 30 days) steadily increases throughout adolescence as do perceptions of parents’ and peers’ favorable attitudes toward marijuana use (Guttmannova et al., 2019). Adolescent perceptions of parental disapproval of drug use are known to be inversely correlated with current use (Meldrum et al., 2022). Conversely, perceptions of risk, specifically the belief that marijuana use is not harmful, decrease as teens get older and are highly correlated with alcohol, tobacco, and other drug use (Bailey et al., 2020; Johnston et al., 2020; Mariani & Williams, 2021; Sarvet et al., 2018).
Because drug use is infrequent among adolescents, count data often found in studies of alcohol, tobacco, and other drug use often exhibit two (related) characteristics: a large proportion of zero counts (referred to as zero-inflated) and an excess of variability (referred to as overdispersion) (Pittman et al., 2020). As Grimm and Stegmann (2019) point out, when researchers measure the frequency of drug use such as “How many times have you used drugs in the past 30 days?” this leads to distributions of counts. These count outcomes are unique and share certain properties. For example, the possible values are distinct (i.e., 0, 1, 2), the distribution of these scores is often positively skewed, and contain a large number of zero responses. In practice, counts are generally over-dispersed or zero-inflated and, indeed, commonly both (Famoye & Singh, 2003; Jochmann, 2013; Yang et al., 2010). This can lead to substantial analytical issues in studies of adolescent drug use or the evaluation of drug prevention programs. When the observed data involve excessive zero counts, the problem of overdispersion results in serious estimation errors (e.g., biased parameter estimates, invalid standard errors, and too small p values), and thus produces a misleading conclusion (Lee et al., 2012).
In recent years there has been considerable interest in count data models with excessive zeros. Application areas are diverse and include manufacturing defects (Lambert, 1992), medical consultations (Gurmu, 1997), dentistry (Böhning et al., 1999), and public health (Moulton et al., 2002). The studies in which the dataset contains excess zeros explain that zero counts can be generated from one of two separate processes: (1) structural zeros, among subjects not at risk for the event, and (2) sampling zeros, which occur by chance. A simpler example might be data measuring the number of fish caught by youth attending a summer camp. A value of zero could result from either of two distinct processes: lack of fishing skills by those who chose to fish or having chosen not to fish at all. Both may yield a value of zero for “number of fish caught” but for different reasons (University of California, Los Angeles [UCLA] Statistical Consulting Center, 2021). In drug education and substance use research, the absence of marijuana use may potentially come from different groups: a group of individuals who are truly abstinent or a group of individuals who would use marijuana but did not necessarily use it during the last 30 days. Since the data combine zero responses from both groups and the first group is certain to have zeros, the number of zeros will be inflated. Hence, count data that come with an excess of zeros inevitably requires zero-inflated regression models to validate the sources of zeros.
This study applies both zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) regression models that have been proposed to account for the excessive zeros of substance use in addition to allowing for overdispersion (Famoye & Singh, 2006). The zero-inflated models analyze the two processes related to estimate probabilities separately and combines them with a logistic link distribution to identify covariates that might predict non-user (structural zeros) and users (including sampling zeros). Since different explanatory variables, such as perceptions of parent disapproval, may be significant for one group, but not for the other, results from a zero-inflated model provide two different groups of variables: (1) variables that are associated with a higher likelihood of being in the zero group and (2) variables that are associated with higher count values. We examined various psycho-social factors and the frequency of marijuana use in a sample of youth in grades 8 through 12. We demonstrate the preference for using a ZINB regression model to address excessive zeros in drug use frequency on perceptions of parent and peer disapproval and perceived harm among middle and high school students.
Method
Participants and procedure
This study used data from an ongoing, longitudinal study of adolescent drug use funded by the Drug-Free Communities, Substance Abuse and Mental Health Services Administration (SAMHSA). Data was collected in March 2019 (N = 522) among a diverse sample of public middle school and high school students in grades 8–12 in a suburban school district in Southern California. Prior to data collection, parent/guardian consent was obtained. The consent form described the purpose of the study, risks/benefits of participation, the voluntary nature of participation, confidentiality of responses, and the researcher's contact information. The consent form was double-sided and available in both English and Spanish. Consent forms were distributed to students by their teachers and asked to give the consent form to their parent/guardian and return it 4–5 days later. Additionally, before survey data collection, participants’ assent was obtained by telling them about the confidential nature of the survey and that they did not have to complete the survey if they did not want to even if their parent/guardian said it was OK to do so. Participants completed the self-administered, paper and pencil survey during normal class time. The survey took an average of 7 min to complete. Data were collected by the researchers. This study was reviewed and approved by the Institutional Review Board at California Baptist University.
Measures
The survey included demographic items including age (years), gender (boy/girl), grade (8th to 12th), race (White, Black, Asian, Pacific Islander, American Indian, or other race), and ethnicity (Hispanic/Latino- yes/no). Items from the California Healthy Kids Survey (CHKS, 2019) were used to measure current marijuana use “In the last 30 days, how many times have you used marijuana (pot, weed, grass)?”, perceived harm of using marijuana “How much do you think people risk harming themselves physically or in other ways if they use marijuana?” ranging from 1 (No Risk) to 4 (Great Risk), parental disapproval of marijuana use, “How wrong do your parents or guardians feel it would be for you to use marijuana?” ranging from 1 (Not Wrong at All- OK) to 4 (Very Wrong), and peer disapproval of marijuana use, “How wrong do your friends feel it would be for you to use marijuana ranging from 1 (Not Wrong at All- OK) to 4 (Very Wrong). These items demonstrated satisfactory test -retests reliability (r = .79).
Analytical approach
The number of students who have smoked marijuana during the last 30 days is considered as a count response in which the observations can take only non-negative integers, and where these integers arise from counting rather than ranking. Since the counts of marijuana use are highly non-normal and thus are not well estimated by the OLS regression, generalized linear models, such as Poisson or Negative Binomial distributions, can be considered in analyzing the skewed counts. However, if zero values are dominant in the data set, then these ordinary count data models pose a risk of underestimating the probability of having zeros. In particular, despite a large number of respondents (92% of the sample) reporting zero use of marijuana during the last 30 days, both Poisson and Negative Binomial models estimate that approximately 70% of students would not smoke marijuana, underpredicting the actual observations.
As noted earlier, count data are zero-inflated when excessive zeros exist, which would be expected under a given probability distribution such as Poisson and Negative Binomial. Zero-inflated models assume that zero counts come from two latent sub-processes because the population consists of observations that always contain zeros and possibly zero counts. Specifically, in the ZIP regression model, the number of substance use
Results
Demographics
Students that participated in this study were composed of 267 males (51.6%), 250 females (48.4%). The average age was 15.01 years (SD = 1.32). Approximately 78% of the students were Hispanic/Latino. The racial distribution of the sample included 58.9% White, 4.2% Black, 1.6% American Indian, 1.6% Asian, 1.0% Pacific Islander, and 32.7% was other race. Percentages of students by grade include 8th grade (20.5%), 9th grade (22.6%), 10th grade (31.0%), 11th grade (19.9%), and 12th grade (6.0%) (see Table 1).
Demographics.
Note. N varies by variable because of missing data.
Marijuana use
A total of 510 respondents answered the question: “In the last 30 days, how many times have you used marijuana?” (see Figure 1). Of these, 17 had used 1 time, 9 had used 2 times, 2 had used 3 times, etc., with an upper count of 23 (the maximum number of time of marijuana use). Figure 1 shows that the excessive number of zeros (n = 467; 91.55%) of marijuana use as well as overdispersion of the variable of interest, and thus the ZIP and zero-negative binomial regression (ZINB) models were fitted. The parameters for the models are evaluated in terms of the odds ratio (OR) of “no marijuana use vs. use” for the zero-inflated part (e.g., ZINB) and the incidence rate ratio (RR) of “one more marijuana use (given that its use has already taken place)” for the count part (e.g., Poisson).

Frequency of marijuana use (past 30 days).
Parent disapproval and marijuana use
Table 2 shows that parental disapproval of marijuana use is significantly associated with the likelihood of adolescent marijuana use. Adolescents who perceive that their parent strongly disapproves of marijuana use (Very Wrong) are less likely to use marijuana than adolescents who perceive their parents somewhat disapprove (Wrong or Little Wrong) or Not Wrong at All- OK. Specifically, the parameter estimates of the zero-inflated parts in that the expected odds of ‘No Use’ are statistically associated with parental disapproval. The odds of adolescent marijuana use for the Wrong group is 0.17 (=ORZIP; 0.19 = ORZINB) times higher than the Very Wrong (p < .05). The odds for the Little Wrong group is 0.19 (= ORZIP; 0.29 = ORZINB) higher, and the odds for the Not Wrong at All- OK group is 0.12 (= ORZIP; 0.10 = ORZINB) higher than the Very Wrong group.
Effect of parent disapproval on marijuana use.
Note. Reference group is ‘Very Wrong’.
*p < 0.5. **p < 0.01.
From the count component (i.e., Poisson) in Table 2, given that its use has already occurred, the expected incidence ratios (RR) of adolescent marijuana use increase by 2.73 (Wrong) and 3.98 (Little Wrong) times respectively, as the degrees of parental disapproval becomes weaker, and these results are statistically significant (p < .05). The expected RR for the Not Wrong at All- OK group is smaller and not statistically significant than the RRs of the Wrong or Little Wrong groups (p > .05). The count component of ZINB shows similar results, however, the expected RRs are larger than those in the ZIP model (Wrong = 3.69, Little Wrong = 6.15, and Not Wrong at All- OK) = 1.65).
The results indicate that the ZINB model better estimates the effect of parent disapproval of adolescent marijuana use. We fit the four competing models: Poisson, negative binomial (Negbin), ZIP, and ZINB, and used popular model selection criteria (lower is better) using Akaike's Information Criterion (AIC) and Bayesian Information Criterion (BIC). Then, we employed Vuong's test to determine the best-fitting model. Tables 3 and 4 present various model comparison criteria and Vuong's statistics, respectively, for the competing models. According to the AIC and BIC values in Table 3, the ZINB model provides the best fit whereas the standard Poisson model provides the worst fit. From the fact that NegBin is substantially better than ZIP even though it does not consider excessive zeroes, ZIP is believed to underestimate the extreme values. The pairwise comparisons in Table 4 also show that the ZINB outperforms than other three models (p < 0.01), while the Poisson model is the least preferred.
Model fit.
Pairwise comparison of models.
Note. *p < 0.5. **p < 0.01
Peer disapproval and marijuana use
Table 5 shows that perceptions of peer disapproval are significant predictors of adolescent marijuana use. Specifically, the odds for marijuana use for adolescents who perceive that their friends think that marijuana use is Wrong is about 30% higher than the odds for the Very Wrong group (ORZIP = 0.32; ORZINB = 0.26) (p < .05). The predicted odds for the Little Wrong (ORZIP = 0.18; ORZINB = 0.12) and the Not Wrong at All- OK (ORZIP = 0.07; ORZINB = 0.06) group are higher than the Very Wrong group (p < .05).
Effect of peer disapproval on marijuana use.
Note. Reference group is ‘Very Wrong’.
*p < 0.5. **p < 0.01.
Perceived harm and marijuana use
Table 6 shows the effect of adolescent perceptions of harm and marijuana use. Estimates from the models suggest that adolescents who believe that marijuana use is a Great Risk are less likely to use marijuana. Specifically, adolescents in the Moderate Risk group are more likely to use marijuana than those adolescents in the Great Risk group (ORZIP = 0.23; ORZINB = 0.02) (p < .05). The model shows that when adolescent's perceptions of risk decrease from Great Risk to Slight Risk, the frequency of marijuana use increases approximately 30% (ORZIP = 0.38; ORZINB = 0.31).
Effect of perceived harm on marijuana use.
Note. Reference group is ‘Great Risk.
*p < 0.5. **p < 0.01.
Discussion
The purpose of this study was to examine the effect of parent and peer disapproval and perceived harm on adolescent marijuana use and to demonstrate the use of ZINB regression models to address excessive zeros in drug use data. We found that perceptions of parent disapproval and peer disapproval were predictors of marijuana use among adolescents. Perceived harm of marijuana use was related to marijuana use. This is consistent with prior research showing that greater parental and peer disapproval of substance use is negatively associated with marijuana use and low or even moderate perceived harm is associated with higher adolescent marijuana use (Bailey et al., 2020; Meldrum et al., 2022). Additionally, we demonstrated that the negative binomial regression (ZINB) is the preferred method of handling drug use data that is over-dispersed due to a large number of zeros and/or data that are positively skewed often found in studies of adolescent drug use. We showed that the ZINB model performed better than the traditional Poisson, Negative Binomial (NegBin), and ZIP models.
Another approach is to dichotomize the frequency data (Xie et al. 2013). The outcome variable then becomes those who engage in marijuana use in the past 30 days (at least once) and those who do not. The data can then be analyzed with traditional logistic regression models or a chi-square statistic. However, this approach may lead to loss of statistical power (i.e., ability to find important effects or associations) and, conversely, larger sample sizes are needed. Additionally, dichotomizing quantitative data like drug use can lead to the loss of information like variability (i.e., how similar or different youth in their drug use) and spurious statistical significance (i.e., finding statistically significant relationships that are not really there) (MacCallum et al., 2002).
Several limitations should be noted. First, ZINB models are applicable when there is interest in a model for “latent” classes with counts (e.g., 1, 2, 3, …) generated from a negative binomial distribution and the ‘No Use’ group with only zero counts. Quantifying an explanatory variable's effect in the overall mixture population would be problematic. Second, we used data from a sample of adolescents that was primarily Hispanic. The results of this study may not apply to other adolescent populations. Finally, as Meldrum et al. (2022) note, measuring adolescents’ perceptions of the degree of parental or peer disapproval may not capture the extent to which parents or peers “approve” of using drugs like marijuana. It is unclear whether selecting “Not wrong at all” reflects an adolescent's perception that his/her parents or peers approve of drug use or think that marijuana use is OK (Meldrum et al., 2022).
An advantage of this approach is that existing statistical computer programs like R and Stata can easily conduct the analyses and the learning curve for performing this approach is gentle for anyone with a background in inferential statistics. More information regarding the use of Stata for conducting negative binomial regressions (ZINBs), using a practical, step-by-step example, is available at the UCLA Research Computing Website at https://stats.oarc.ucla.edu/stata/output/zero-inflated-negative-binomial-regression/.
Substance use among adolescents is a complex issue. Reducing adolescent substance use by teaching parents the importance of setting expectations and how to communicate clear messages of disapproval and increasing adolescents’ perceptions of risk through universal, evidence-based drug education programs is essential. Likewise, understanding why adolescents use drugs and which drug education programs are effective requires the use of methods and analytic techniques that best fit the data. Future research should examine both the frequency of parent-child communication about substance use and the content of those discussions, particularly among various ethnic and racial groups. Likewise, this analytical approach may be useful to those researchers studying substance abuse prevention and drug education programs aimed at adolescents.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
This study was supported by a grant from the Substance Abuse and Mental Health Services Administration #5H79SP021394-05.
