Abstract
Binary regression models generally assume that the response variable is measured perfectly. However, in some situations, the outcome is subject to misclassification: a success may be erroneously classified as a failure or vice versa. Many methods, described in existing literature, have been developed to deal with misclassification, but we demonstrate that these methods may lead to serious inferential problems when only a single evaluation of the individual is taken. Thus, this study proposes to incorporate repeated and independent responses in misclassification binary regression models, considering the total number of successes obtained or even the simple majority classification. We use subjective prior distributions, as our conditional means prior, to evaluate and compare models. A data augmentation approach, Gibbs sampling, and Adaptive Rejection Metropolis Sampling are used for posterior inferences. Simulation studies suggested that repeated measures significantly improve the posterior estimates, in that these estimates are closer to those obtained in a case with no misclassifications with a lower standard deviation. Finally, we illustrate the usefulness of the new methodology with the analysis about defects in eyeglass lenses.
Introduction
Binary data are currently subject to misclassification in a variety of applied fields, such as epidemiological and medical research, and econometric and social sciences. As first observed by Bross (1954), the parameter estimators are extremely biased when the classification system is subject to error, that is, when a success may be erroneously classified as a failure (known as a false negative in medical applications), or vice versa (a false positive). Examples of cases where misclassification can occur include smoking cessation (McInturff et al. 2004), cognitive severity in patients with Alzheimer's disease (Luo et al. 2016), human papillomavirus detection (Paulino et al. 2003), cancer death due to radiation (Roy et al. 2005), and serological tests (Fujisawa and Izumi 2000; Menten et al. 2012). Many methods for correcting the estimates have been proposed. Examples include a double sampling scheme in frequentist modelling (Tenenbein, 1970) and using an informative prior distribution in a Bayesian point of view (Gaba and Winkler, 1992).
Some approaches are also available for regression settings for data arising from longitudinal studies. For an unobservable alternating binary Markov process (a subject may alternate between two states over time), Cook et al. (2000), Nagelkerke et al. (1990), Rosychuk and Thompson (2003) and Rosychuk and Thompson (2003) discussed the impact of misclassification on transition probability estimates. García-Zattera et al. (2010) focused on misclassified longitudinal binary data where the true response follows a progressive or monotone process (subjects cannot alternate between the two stages once the severe stage is reached over time). In a Bayesian approach, they also extended the proposed model to describe the relationship between the covariates and the prevalence and incidence, allowing for different time intervals between examinations for each subject, and different classifiers. Garcia-Zattera et al. (2012) extended the obtained results and proposed a multivariate binary inhomogeneous Markov model for the case of unobserved, correlated response variables subject to an unconstrained misclassification process and a monotone behaviour.
Misclassification issues in survival analysis can also be found in the statistical literature. Some methods estimate time-to-event distributions for time-dependent sensitivity with no covariates Balasubramanian and Lagakos 2001, 2003) or with auxiliary covariates in the Cox proportional hazards model (Snapinn, 1998). Gu et al. (2015) proposed a semiparametric likelihood-based approach to time-to-event models to estimate the association of one or more covariates with an error-prone, self-reported outcome. For current status data, a method for nonparametric estimation of the distribution function for the one-sample problem was proposed by McKeown and Jewell (2010). They also extended this methodology to parametric regression models, and Sal Y Rosas and Hughes (2011) extended it to situations in which sensitivity and specificity may vary across groups of individuals, using nonparametric and semiparametric methods. In a discrete-time context, Meier et al. (2003) extended the discrete proportional hazards model to incorporate misclassified responses. Hunsberger et al. (2010) incorporated sensitivity and specificity quantities into a subject-specific and time-dependent covariates model, and Richardson and Hughes (2000) implemented an expectation-maximization (EM) algorithm to estimate the product limit of the survival function with no covariates and misclassification. Garca-Zattera et al. (2016) extended the accelerated failure time (AFT) frailty modelling approach to account for misclassification of the determination of the event in the context of interval-censored data.
In the context of binary regression models for cross-sectional studies, ignoring misclassification errors can produce biased covariate effect estimates (Neuhaus, 1999; McInturff et al., 2004; Meyer and Mittag, 2017; Savoca, 2011; Carroll et al., 2006). McInturff et al. (2004) also noted in their example that some variables that were significant if misclassification was accounted for were not significant if misclassification was ignored.
Many frequentist methods have been developed to handle the problem of logistic regression with outcome misclassification. These methods generally consider a validation subsample (Lyles et al., 2011; Carroll et al., 2006; Meyer and Mittag, 2017). Magder and Hughes (1997) derived maximum likelihood estimators (MLEs) using an EM algorithm. Roy et al. (2005) also considered maximum likelihood estimation, but allowed for measurement error in the covariates. Semiparametric MLEs were proposed by Cheng and Hsueh (2003) for handling differential (covariate-dependent) misclassification and by Cheng and Chen (2006) for case-control data. Edwards et al. (2013) presented an alternative approach, using missing data methods (multiple imputation with internal validation data).
Bayesian methods have also been developed to address the problem of binary regression with outcome misclassification. For case-control studies with differential misclassification, Prescott and Garthwaite (2005) proposed a Bayesian approach in which validation data were used to correct misclassification. Tu et al. (1999) proposed a method that accommodates error probabilities in a great family of link functions and uses a latent variable approach with an MCMC algorithm to perform posterior inferences. Paulino et al. (2003) also considered latent variables, but prior distributions for the regression coefficients were elicited using the so-called conditional means prior (CMP; Bedrick et al., 1996). Naranjo et al. (2013) used two types of latent variables in probit and t-link regressions. McInturff et al. (2004) followed the methods proposed by Paulino et al. (2003), but their likelihood function was based only on observed variables, as the software WinBUGS was used.
However, the mere consideration of misclassification does not eliminate the possibility of inferential problems, as in assessing the significance of the important variables in the study. This is particularly true when the misclassification probabilities are unknown or gold standard tests are unavailable or prohibitively expensive. So an alternative approach is using a repeated-measurement sampling procedure, in which repeated and independent measures of the subjects are performed by the same error-prone diagnostic test.
Accounting for multiple measures in a misclassification context has already been explored. Evans et al. (1996) considered a Bayesian estimation of the success probability of binary data subject to misclassification errors when repeated measures are available, using both Gauss–Jacobi quadrature and Gibbs sampling for posterior analyses. A similar approach was developed by Swartz et al. (2004) regarding misclassification in multinomial data. Fujisawa and Izumi (2000) proposed maximum likelihood estimation for prevalence, sensitivity and specificity based on repeated classifications. Quinino et al. (2013) and Morais et al. (2017) proposed an estimator based on simple majority results of repeated classifications as an alternative to the method of moments for obtaining parameter estimators of a mixture of two- or three-binomial distributions.
For case-control studies in which exposure assessments are error-prone, Lai et al. (2007) assumed non-differential (covariate-independent) misclassification, performed repeated imperfect measurements and obtained the maximum likelihood estimates by using an EM algorithm. In the same context, Zhang et al. (2013) accounted for both differential and dependent misclassification and, in a Bayesian approach, proposed a method to estimate the measurement-error corrected exposure-disease association.
The purpose of this study is to develop a Bayesian method that takes into account repeated and independent measures of misclassified responses in binary regression. Repeated measures are not considered a validation subsample because they are performed using the same imperfect test, that is, there is not a gold-standard test available. We evaluated two possible specific situations usually observed in practice: (a) the results of all repeated measures are available and (b) only the result of the majority of the repeated measures is known. In Section 2, we propose models that incorporate the results of repeated classifications. Bayesian procedures are described in Section 3. Simulation studies and a real-data application are presented in Sections 4 and 5, respectively. Discussion and comments are provided in Section 6.
Statistical methods
Let
Using generalized linear models (McCullagh and Nelder, 1989) we can express
Suppose that a non-destructive fallible mechanism measures each sample subject
Frequency of success model (FSM)
Let
Therefore, the likelihood function is expressed as
An augmented data approach can be used to obtain a more tractable likelihood function. Let
For each
Based on the augmented data,
For the sake of simplicity and without a lack of generality, we can use the logit link function (or an alternative function, according to the nature of the data) and assume unknown and non-differential errors to rewrite the likelihood function (4) in terms of
An alternative way to address repeated classifications is to consider the response variable as equal to the most frequent test result (i.e., the simple majority result) after
Then, we have
where Bin
It is reasonable to assume that if
The likelihood function can, therefore, be written as
An augmented data approach can also be used. Let
The likelihood function based on the augmented data
Using the logit link function and non-differential errors, we obtain the simple majority model (SMM):
When
The identifiability analysis of the proposed models is discussed in Appendix A. In the next section, we develop a Bayesian approach to the FSM and SMM, presented earlier.
A Bayesian approach to binary regression with misclassification requires prior knowledge about the regression coefficients and error probabilities. As in Paulino et al. (2003), Gerlach and Stamey (2007), Naranjo et al. (2013) and others, it is appropriate to independently specify the error probabilities as
As reported by O'Hagan et al. (1990), it is extremely difficult to directly specify a prior distribution for the regression coefficients directly. Many methods have been proposed, but the standard approach is to assume a normal (e.g., West, 1985) or diffuse distribution
Another approach to binomial regression is to evaluate the success probability for various covariate values instead of evaluating the regression coefficients, as in Tsutakawa (1975), Tsutakawa and Lin (1986) and Bedrick et al. (1996). As suggested by Bedrick et al. (1996), this procedure is less complicated, especially when considering competing models. When comparing competing models, such as a logistic versus probit regression, model coefficients may have different interpretations and require different information. However, this problem is circumvented if information on the probabilities is elicited, as the prior distributions for regression coefficients can be induced from this information.
Bedrick et al. (1996) proposed evaluating prior information at
The prior distribution For instance, under the logistic regression model, we have The likelihood functions (2.3) and (2.5) can be factored as For the FSM, the posterior distributions are
Note that (3.3) is a product of beta distributions.
For the SMM, we have
MCMC methods are necessary to conduct Bayesian analysis. We will consider the use of the Gibbs sampler algorithm (Gelfand and Smith, 1990). The procedure, described in Appendix B, was implemented in the object-oriented matrix programming Language Ox, Version 5.1 (Doornik, 2007), and the code is available from the first author upon request.
In this section, we consider the data from two studies that were previously presented in the existing literature: O-ring data (a small sample with only one independent variable) and Trauma data (a larger sample, considering five candidate independent variables). In both cases, the authors did not discuss the possibility of misclassification. Therefore, we simulated ‘misclassified’ responses and evaluated the impact of ignoring this misclassification. Then, we carried out simulation studies to compare the performance of the repeated classifications approach with that of a single classification approach in which some prior information about the errors probabilities was available. We conducted a sensitivity analysis that accounted for some different prior distribution specifications for the misclassification errors.
Single regressor: O-ring data
Challenger, the American space shuttle, exploded in 1986, 73 seconds after its take-off. The cause of the accident was attributed to the failure of two O-rings (circular rubber rings that sealed the joints of the shuttle's solid rocket boosters). It is suspected that the low temperature at launch time (31
The authors proposed a logistic regression model in which the probability that any O-ring failed was modelled as a function of temperature; the CMPs were Beta(1, 0.577) and Beta(0.577, 1) for 55
Table 1 presents the posterior means, standard deviations (SD) and percentiles of the regression coefficients (5% and 95%). Additionally, we made inferences about modelled probabilities in 53
The analysis of the original dataset did not consider the possibility of misclassification. However, to improve the understanding of the effect of misclassification, we generated misclassified responses that were simulated under the assumption that some non-differential error could occur in the response variable, perhaps due to the technical difficulty of detecting failure or a registration mistake. Thus, from the original data, we generated a sample with
Posterior estimates for O-ring data
Posterior estimates for O-ring data
Notably, ignoring misclassification leads to some inference problems: the intercept and the variable temperature became non-significant (90% central posterior intervals for the coefficients covered
Suppose that three repeated evaluations of the O-rings are available and the logistic regression model accounting for misclassification was fitted. First, we considered just a single measure (FSM
Posterior estimates for O-ring data accounting for misclassification
When considering a single evaluation of the O-rings, the intercept was not significant and the posterior estimates for modelled probabilities in 53
Two prior beta distributions for the misclassification probabilities were used (Figure 1): Beta(2, 10) (for Case 1) and Beta(89.9, 809.1) (for Case 2). The hyperparameters for these two distributions were obtained from prior specified means (0.17 and 0.10, respectively) and variances (0.01 and 0.0001, respectively). The latter distribution is more informative. For the regression coefficients, we used the aforementioned CPMs produced by Christensen's (1997) and Bedrick et al. (1996).
Prior beta distributions for the misclassification probabilities
The details of the simulation study are described below: Step I: Generate the Step II: Using Case 1’s prior distribution, obtain posterior estimates under the FSM (or the SMM) using the first result of the test obtained in Step I. Step III: Given the data the FSM based on the number of successes obtained in the first the SMM based on the simple majority classification obtained in the first Step IV: Repeat Steps II and III using Case 2 prior distribution. Step V: Repeat Steps I through IV for a large number ( Step VI: Find the average of
We used
The posterior mean, standard deviation (i.e., the square root of the average of the posterior variances, denoted by SD) and coverage probabilities (CP) of 90% credible (central posterior) intervals are shown in Tables 3 and 4. For
Estimates of the mean, standard deviation (SD) and coverage probabilities (CP) of 90% credible intervals for the intercept coefficient
Estimates of the mean, standard deviation (SD), and coverage probabilities (CP) of 90% credible intervals for Temperature
Biased parameter estimates were obtained under the standard model (
We also considered data from a random sample of 300 patients admitted to the University of New Mexico Trauma Center between 1991 and 1994. This dataset was first analysed in Bedrick et al. (1996) and Christensen's (1997) studies to explore CMPs. For each patient, they obtained data describing the injury severity score (ISS), revised trauma score (RTS), age (AGE), predominant type of injury (TI) and survival. ISS is the overall index of a patient's injuries and ranges from 0 (no injuries) to 75 (a patient with serious injuries at 3 or more body areas). RTS is an index of physiological injury, constructed as a weighted average of an incoming patient's respiratory rate, systolic blood pressure and a coma scale. RTS ranges from 0 (no vital signs) to 7.84 (normal vital signs). TI is a dichotomous variable, where TI
A trauma surgeon proposed a logistic regression model to estimate the probability of a patient's death. The model includes an intercept, the predictors ISS, RTS, AGE and TI, and the interaction of AGE and TI. This model also utilized CMPs, that is, the beta prior distributions for death probabilities on six sets of conditions
Prior distribution parameters for death probabilities
Prior distribution parameters for death probabilities
The obtained prior and posterior distributions for the regression coefficients are displayed in Figure 2. The posterior estimates obtained using our Ox routine (Table 6) are similar to those obtained by Christensen's (1997).
Prior (solid line) and posterior (dashed line) distributions for regression coefficients
The analysis of the original dataset did not consider the possibility of misclassification and, as in the previous example, we generated contaminated ‘reported’ responses with
Posterior estimates for Trauma data
For original data (no misclassification), ISS, RTS and AGE were considered significant variables (the 90% central posterior interval for the regression coefficients associated with these variables did not cover
We supposed that three repeated evaluations of the patients were available and the logistic regression models accounting for misclassification were fitted. The results of the FSM with
Posterior estimates for Trauma data accounting for misclassification
Considering a single measure, ISS and RTS are significant, but the variable AGE is incorrectly considered to be a non-significant variable. Fitting the FSM with
Adjusted ORs were similar between the models considering the direction of effect, with the exceptions of TI and AGE*TI in the naive model.
A simulation study was carried out again to demonstrate the good performance of repeated measures. The same prior beta distributions for the misclassification probabilities were used: Beta(2, 10) (for Case 1) and Beta(89.9, 809.1) (for Case 2). For the regression coefficients, we used the CPMs that were described in Table 5. Posterior estimates obtained after
Estimates of the mean, standard deviation (SD) and coverage probabilities (CP) of 90% credible intervals for the intercept coefficient
Estimates of the mean, standard deviation (SD) and coverage probabilities (CP) of 90% credible intervals for ISS
Estimates of the mean, standard deviation (SD) and coverage probabilities (CP) of 90% credible intervals for RTS
Estimates of the mean, standard deviation (SD) and coverage probabilities (CP) of 90% credible intervals for AGE
Estimates of the mean, standard deviation (SD) and coverage probabilities (CP) of 90% credible intervals for TI
Estimates of the mean, standard deviation (SD) and coverage probabilities (CP) of 90% credible intervals for AGE*TI
The results obtained in this example are similar to those in the previous example: when increasing the number of repeated classifications, the bias and standard error decrease. Thus, these results indicate that the proposed models, which incorporate repeated test results, markedly improve parameter estimation in terms of bias and efficiency.
An industry that produces lenses for glasses contacted the Department of Statistics of the Federal University of Minas Gerais to evaluate the increase in the number of defective lenses after the diversification of the feedstock source (Region 0 or 1) and the use of an alternative laboratory built for lens production (Laboratory 0 or 1). The lenses were submitted to a visual inspection in order to detect defects such as scratches, cracks and excessive thickness or thinness, among others. If a lens showed one or more defects, it was classified as ‘defective’; otherwise, it was classified as ‘perfect’. Four inspectors took part in the evaluation study, and they performed five repeated independent classifications of 400 randomly selected lenses from the production. The selection was made in equal quantities for each combination of feedstock source (
A summary for the data obtained after five repeated independent classifications is described in Table 14. It was observed that, for example, among five repeated classifications, 13 L0
In this example, we will restrict the analysis to logistic regression. For the regression coefficients, we used the aforementioned CMPs (Bedrick et al., 1996). Since we have two covariates, we choose three covariate configurations to induce the prior distribution for the regression coefficients. We carried out the elicitation of the prior with the kind collaboration of the process and quality control manager of the industry, who was asked to choose the three configurations and to provide values for the 1%, 50% and 99% quantiles of his prior density for the probability of a perfect lens with those covariate characteristics. This is a purposeful over-specification of the beta distributions; therefore, to calculate the prior hyperparameters, we used the two quantiles about which the expert was more confident, namely 50% and 99%, and the third one to assess the consistency of the choice. This process was iterated until a consistent set of values was found and then the hyperparameters were obtained numerically.
Frequency of ‘perfect’ classifications among five repeated classifications
Frequency of ‘perfect’ classifications among five repeated classifications
A similar procedure was used to induce the beta prior distributions for the misclassification probabilities. The complete set of elicited quantiles and obtained hyperparameters are given in Table 15.
Expert elicitation of prior quantiles for
and
.
To conduct a Bayesian analysis, posterior distributions for FSM and SMM were obtained by the MCMC method. Figures 3 displays the plots of the prior and posterior for perfect probabilities for each combination of Laboratory (
The posterior estimates for all parameters are given in Table 16. The estimated regression coefficients associated with Laboratory and Supplier and the misclassification probabilities were quite similar under SMM and FSM, but the extra amount of information available for FSM results in narrower credibility intervals for all the parameters.
Posterior estimates for lens data
The 95% highest posterior density (HPD) credible intervals: Perfect lenses by laboratory and region
The 95% highest posterior density (HPD) credible intervals: Perfect lenses by laboratory and region
In practical terms, using the FSM, the feedstock originating in Region 1 and Laboratory 1 (
In the context of binomial regression with misclassification, we proposed to perform multiple assessments of the sample units. This additional information was incorporated using two different models. The first model uses the total number of successes obtained (FSM), while the second model uses simple majority classification (SMM). The new methodology we introduced is quite general and can be used to address a variety of practical problems.
The FSM results are closer to those obtained in a no-misclassification case with a lower standard deviation than the results of the SMM. However, both methods lead to similar conclusions. In this sense, the SMM may be a reasonable recommendation for cases in which we only have the final classification of the individuals.
A sensitivity analysis that took into account different prior specifications was performed. We demonstrated that the repetitive test procedure is a good alternative if there is some (or even little) information about classification errors, as the proposed models did not appear to be sensitive to different prior specifications.
We provided a discussion about the local identifiability of the proposed models. As far as we know, proving global identifiability under binary regression model subject to non-differential misclassification of the response variable is an open problem that would befit from future studies. Future studies should also perform Bayesian model selection (as in Gerlach and Stamey, 2007) and extend the proposed model to accommodate random effects (or frailties).
Appendix
Usually, consistent estimators of identifiable model parameters can be obtained in likelihood or Bayesian approaches, in the sense that estimates will converge to true parameter values as the sample size increases. However, if the true distribution of observable data in fact corresponds to more than one set of parameter values (non-identifiable situations), then no amount of data can lead to the true parameter values (Gustafson, 2004). Therefore, although Bayesian analysis can be performed under a non-identified model if proper prior distributions are used, the arising posterior distribution may not have certain appealing theoretical properties. For example, the joint posterior distribution of the parameters may not converge asymptotically (Dendukuri et al., 2010), the posterior distribution may not concentrate the mass around the true value of a parameter when the sample size increases, and there tend to be strong correlations between parameters in the posterior distribution, resulting in poor mixing of Markov chains used to explore the posterior, and thus rendering estimators that may fail to converge within reasonable computing times (Swartz et al., 2004; Gelfand and Sahu, 1999). Furthermore, even large sample sizes may be unable to overcome prior information (Johnson et al., 2001), or the posterior distributions arising from different priors will not match as the sample size increases.
In this sense, even when adopting a Bayesian approach, it is important to evaluate the situations that guarantee the identifiability of parameters in the proposed models, in order to minimize practical problems.
First, consider FSM (or SMM) with
where
Following Newey and McFadden (1994), Hausman et al. (1998) state, in Theorem 1 (p. 243) that, under the probit link function,
where
Thus, estimators based on (7.1), such as NLS and MLE, cannot distinguish between the parameter values
Note that restriction
FSM local identifiability
If
where
Probability distribution under FSM in a non-identifiability case
It is also possible to prove FSM local identifiability through the results obtained by Teicher (1963) and Blischke (1962). To do so, consider the simple case with no covariates:
where
If there are
Note that in this step the parameters of interest
This completes the proof that
SMM local identifiability
For SMM with
This fact is illustrated in Table A2, with
Probability distribution under SMM in a non-identifiability case
SMM local identifiability can also be proved directly using Hausman's probability function (Hausman et al., 1998), as described in the following. Suppose that each sample subject is evaluated
Following the results obtained for (A.1) by Hausman et al. (1998),
This completes the proof that
After choosing adequate initial values for Sample For FSM, given β(r−1),λ(r−1) and t, sample d(r) in:
where Do For SMM, given
Do Sample For FSM, given where For SMM, given Sample For FSM, given and sample Do For SMM, given and sample Then, do
The Adaptive Rejection Metropolis Sampling (ARMS) algorithm (Gilks et al., 1995) can be used to sample from (B.1), (B.2), (B.3) and (B.4).
Appendix C: Convergence issue in FSM and SMM
A simple way to evaluate the convergence of a chain is to monitor the ergodic means. Convergence occurs when the ergodic means of chains that started from different initial values converge to the same values. Therefore, we use a single chain to produce inferences. Monitoring autocorrelation is also very useful: low values indicate fast convergence and high values indicate slow convergence.
Figure C1 and C2 show the monitoring charts of the ergodic means and autocorrelation of SSM parameters for a sample generated from the O-ring and Trauma simulation studies for one generated sample of O-ring and Trauma simulation study, respectively.
This analysis was performed for many generated samples and scenarios using the R package ‘MCMCpack’. Therefore, a burn-in period of 50 000 and a lag of 10 iterations were used. Figure C3 shows convergence results of FSM for the Lens data, for which we used a burn-in period of 50 000 and a lag of 15 iterations.
SSM in O-ring simulation: Convergence results
SSM inTrauma simulation: Convergence results
FSM for Lens data: Convergence results
Acknowledgments
This work was supported by CAPES, CNPq and Pró-Reitoria de Pesquisa–UFMG [grant number 04/2016]. We would like to thank the editor, an associate editor and the two referees for their detailed and insightful comments, which led to a much improved manuscript.
