Estimating the AUC of mixture MROC curve in the presence of measurement errors

Abstract

In a classification scenario, we usually come across data with and without class labels. If the class labels of individuals are unknown or masked by hidden components, the classifier rules must include the identification of the actual number of subcomponents in the data. Also, the presence of measurement errors in the data may influence the measures of the receiver operating characteristic model. In this paper, a mixture of multivariate receiver operating characteristic models is proposed to deal with multi-model patterns in the data, and a bias-corrected estimator is derived for estimating the area under the curve of the proposed model. The proposed methodology is supported by the real dataset and simulation studies.

Keywords

Multivariate ROC curve measurement error area under the curve

1. Introduction

The receiver operating characteristic (ROC) curve is a widely used classification technique for evaluating the test’s performance and is also useful in comparing diagnostic tests by means of its area under the curve (AUC) and sensitivities. AUC is a measure of accuracy that defines how well a diagnostic test can distinguish or allocate an individual into one of the predefined classes. It takes a value between 0 and 1. However, the theoretically acceptable value of the AUC is between 0.5 and 1. A higher AUC value indicates better performance of a diagnostic test. The authors in Balaswamy and Vishnu Vardhan (2016) have provided exhaustive coverage of several bi-distributional parametric ROC models that were developed for normal and non-normal data.

The most widely used parametric ROC model is the bi-normal model, which assumes that both populations are distributed according to normal distributions. The bi-distributional ROC models that are available in the literature are mostly based on a univariate setup. In the recent past, multivariate versions of the ROC curve were also proposed under the assumption that the population follows a multivariate normal distribution (Su and Liu, 1993; Yin and Tian, 2014; Sameera et al., 2016).

The profile of patients or subjects will be derived from measurements derived from a diagnostic test or a marker in a univariate or multivariate setup. Most often, such measurements are highly susceptible to errors. These errors may arise due to the instruments used in the laboratory, the knowledge of the technicians, biological variability, temporal changes in subjects, etc. In medicine, for example, technicians will use laboratory instruments to obtain recordings of a patient’s systolic and diastolic blood pressure. Such recordings may be subject to measurement errors if the instruments or the person reporting them produce incorrect results. The presence of such measurement errors (MEs) may generate both bias and huge variability in the data, which in turn reduces the power of the study and contaminates the true information expressed by the measures of the ROC curve. In a ROC setting, the authors in Coffin and Sukhatme (1996) demonstrated that in the presence of measurement errors, the area estimates of a bi-normal ROC curve are downward. As a result, they developed a bias corrected approximation for correcting the AUC. The asymptotic distribution of these corrected estimators was given by Kim and Gleser (2000). Faraggi (2000) derived the confidence intervals and coverage probabilities for the AUC in the presence of MEs. References on defining the confidence intervals of the AUC in the presence of MEs include Reiser (2000), Schisterman et al. (2001), Tosteson et al. (2005) and Perkins et al. (2009).

The ROC models that have been discussed so far were developed with prior knowledge of the class labels. Even if class labels are known or unknown, identification of subcomponents needs to be carried out. The determination of subcomponents within the known class labels is quite essential. This is because the information will be masked if we proceed with the existing class labels. For illustration purposes, the Vertebral Column dataset is considered and the probability density function plots for each variable are drawn. This supports the claim that, after identification, the dataset has two more components within the diseased population. In total, the dataset has three components: (Healthy ( $H$ ), Disk Hernia ( $D_{1}$ ) and Spondilolysthesis ( $D_{2}$ )). Hence, there is a need to define two classifiers to classify the individuals into one of the three populations. In such scenarios, the existing procedures of binary classification and correction for AUC in the presence of measurement error may not be feasible to implement. With this background, the present work focuses on developing a mixture of multivariate ROC (mMROC) model for addressing the problem of subcomponents in each class, and its bias corrected approximation is derived for correcting the AUC when the observations are measured with errors.

The rest of the paper is organized as follows: In Section 2, the proposed methodology is discussed in detail. The derivations pertaining to mixture ROC models and the estimation of the AUC in the presence of measurement errors are also reported in this section. Further, the proposed methodology is appended with simulated and real datasets in Section 3. The summary of the entire work is presented in Section 4.

2. Methodology

2.1 Mixture of Multivariate Receiver Operating Characteristic (mMROC) curve

Figure 1.

Hypothetically overlapping probability density curves of healthy and diseased populations.

Let us assume that a total dataset has three subcomponents based on the hypothetical structures shown in Fig. 1. Let $S=\left\{X,Y,Z\right\}$ denote the data matrix containing information about the classes healthy ( $H$ ), diseased population 1 ( $D_{1}$ ) and diseased population 2 ( $D_{2}$ ), respectively, each of which is assumed to follow a multivariate normal distribution.

Notationally, we can set $X\sim\textit{MVN}({\bm{\mu}}_{H},\Sigma_{H});\;Y\sim\textit{MVN}(\bm{\mu}_{D_{% 1}},\Sigma_{D_{1}})$ and $Z\sim MVN(\bm{\mu}_{D_{2}},\Sigma_{D_{2}})$ , where $\bm{\mu}_{H},\;\bm{\mu}_{D_{1}},\;\bm{\mu}_{D_{2}}$ and $\Sigma_{H}$ , $\Sigma_{D_{1}}$ , $\Sigma_{D_{2}}$ are the mean vectors and covariance matrices of $H,D_{1}$ and $D_{2}$ populations, respectively.

The expressions for FPR and TPR in the mixture form are defined as

$\displaystyle\textit{FPR}=\lambda_{1}\textit{FPR}_{1}+\lambda_{2}\textit{FPR}_% {2}$ (1) $\displaystyle\textit{TPR}=\lambda_{1}\textit{TPR}_{1}+\lambda_{2}\textit{TPR}_% {2},$

where $\lambda_{1}$ and $\lambda_{2}$ are the mixing proportions obtained through the EM-algorithm.

By definition, we write

$\displaystyle\textit{FPR}_{1}=1-\Phi\left(\frac{c_{1}-{{b_{1}^{\prime}}}\bm{% \mu}_{H}}{\sqrt{{b_{1}^{\prime}}\Sigma_{H}b_{1}}}\right);\qquad\textit{FPR}_{2% }=1-\Phi\left(\frac{c_{2}-{b_{2}^{\prime}}\bm{\mu}_{D_{1}}}{\sqrt{{b^{\prime}_% {2}}\Sigma_{D_{1}}b_{2}}}\right)$ (2) $\displaystyle\textit{TPR}_{1}=\Phi\left(\frac{{b_{1}^{\prime}}\bm{\mu}_{D_{1}}% -c_{1}}{\sqrt{{b_{1}^{\prime}}\Sigma_{D_{1}}b_{1}}}\right);\qquad\textit{TPR}_% {2}=\Phi\left(\frac{{b_{2}^{\prime}}\bm{\mu}_{D_{2}}-c_{2}}{\sqrt{{b_{1}^{% \prime}}\Sigma_{D_{2}}b_{2}}}\right),$

where $c_{1}$ and $c_{2}$ are the threshold values, $b_{1}$ and $b_{2}$ $(\neq 0)$ are the vector of coefficients of $k$ variables obtained from a minimax procedure. They are given in Sameera et al. (2016) as

$\displaystyle b_{1}=[t\Sigma_{D_{1}}+(1-t)\Sigma_{H}]^{-1}(\bm{\mu}_{D_{1}}-% \bm{\mu}_{H});\qquad b_{2}=[t\Sigma_{D_{2}}+(1-t)\Sigma_{D_{1}}]^{-1}(\bm{\mu}% _{D_{2}}-\bm{\mu}_{D_{1}}),$

where $t$ is a constant between $0$ and $1$ that is determined by trial and error with a $0.1$ increment. Here, $b_{1}$ corresponds to a vector of coefficients of $H$ and $D_{1}$ population, and $b_{2}$ corresponds to a vector of coefficients of $D_{1}$ and $D_{2}$ populations. From Eq. (2), the expressions for $c_{1}$ and $c_{2}$ are as follows:

$\displaystyle c_{1}=b_{1}^{\prime}\bm{\mu}_{H}+\sqrt{b_{1}^{\prime}\Sigma_{H}b% _{1}}\;\Phi^{-1}(1-\textit{FPR}_{1});\quad c_{2}=b_{2}^{\prime}\bm{\mu}_{D_{1}% }+\sqrt{b_{2}^{\prime}\Sigma_{D_{1}}b_{2}}\;\Phi^{-1}(1-\textit{FPR}_{2}),$ (3)

where $\Phi^{-1}$ is the inverse cumulative distribution of multivariate normal. The ROC expression can be obtained from Eq. (1) by substituting Eq. (3). After simplifications, we get

$\displaystyle\textit{ROC}=\lambda_{1}\Phi\left[\alpha_{1}+\beta_{1}\Phi^{-1}% \left(1-\textit{FPR}_{1}\right)\right]+\lambda_{2}\Phi\left[\alpha_{2}+\beta_{% 2}\Phi^{-1}\left(1-\textit{FPR}_{2}\right)\right],$

where $\alpha_{1}=\frac{{b_{1}^{\prime}}(\bm{\mu}_{D_{1}}-\bm{\mu}_{H})}{\sqrt{{b_{1}% ^{\prime}}\Sigma_{D_{1}}b_{1}}}$ , $\alpha_{2}=\frac{{b_{2}^{\prime}}(\bm{\mu}_{D_{2}}-\bm{\mu}_{D_{1}})}{\sqrt{{b% _{2}^{\prime}}\Sigma_{D_{2}}b_{2}}}$ , $\beta_{1}=-\frac{\sqrt{{b_{1}^{\prime}}\Sigma_{H}b_{1}}}{\sqrt{{b_{1}^{\prime}% }\Sigma_{D_{1}}b_{1}}}$ , $\beta_{2}=-\frac{\sqrt{{b_{2}^{\prime}}\Sigma_{D_{1}}b_{2}}}{\sqrt{{b_{2}^{% \prime}}\Sigma_{D_{2}}b_{2}}}$ . This quantity is called a 2-component mixture of multivariate receiver operating characteristic (mMROC) curve.

The expression of the mMROC in terms of AUCs is

$\displaystyle\Theta=\lambda_{1}\Phi\left(\frac{{b_{1}^{\prime}}(\bm{\mu}_{D_{1% }}-\bm{\mu}_{H})}{\sqrt{{b_{1}^{\prime}}(\Sigma_{D_{1}}+\Sigma_{H})b_{1}}}% \right)+\lambda_{2}\Phi\left(\frac{{b_{2}^{\prime}}(\bm{\mu}_{D_{2}}-\bm{\mu}_% {D_{1}})}{\sqrt{{b_{2}^{\prime}}(\Sigma_{D_{2}}+\Sigma_{D_{1}})b_{2}}}\right),$

that is $\Theta=\lambda_{1}\;\Theta_{1}+\lambda_{2}\;\Theta_{2}$ , where $\Theta_{1}$ and $\Theta_{2}$ denote the AUCs of $H$ and $D_{1}$ population, and $D_{1}$ and $D_{2}$ populations, respectively.

2.2 Bias corrected approximation

Suppose the markers are subject to measurement errors, then, instead of observing $X_{i}$ , $i=1,\ldots,m,$ and $Y_{j},j=1,\ldots,n,$ and $Z_{k},k=1,2,\ldots,l,$ we observe the data matrix ‘ $S=\left\{X,Y,Z\right\}$ ’ as $S_{e}=\left\{A,B,C\right\}$ , where

$\displaystyle A_{i}=X_{i}+\Sigma_{\epsilon},\;i=1,2,\ldots,m;\;\epsilon\sim% \textit{iid}\;N(0,b^{\prime}\Sigma_{\epsilon}b)$ $\displaystyle B_{j}=Y_{j}+\Sigma_{\eta},\;j=1,2,\ldots,n;\;\eta\sim\textit{iid% }\;N(0,b^{\prime}\Sigma_{\eta}b)$ $\displaystyle C_{k}=Z_{k}+\Sigma_{\gamma},\;k=1,2,\ldots,l;\;\gamma\sim\textit% {iid}\;N(0,b^{\prime}\Sigma_{\gamma}b)$

and $\Sigma_{\epsilon},\Sigma_{\eta}$ and $\Sigma_{\gamma}$ are the random measurement error covariance matrices. When such measurement errors are present, the likelihood estimator of the AUC can be defined as

$\displaystyle\hat{\Theta}_{\textit{ME}}=\hat{\lambda}_{1}\hat{\Theta}_{1}+\hat% {\lambda}_{2}\hat{\Theta}_{2},$

where $\hat{\Theta}_{1}=\Phi\left(\frac{{\hat{b}_{1}^{\prime}}(\hat{\bm{\mu}}_{B}-% \hat{\bm{\mu}}_{A})}{\sqrt{{\hat{b}_{1}^{\prime}}(\hat{\Sigma}_{B}+\hat{\Sigma% }_{A})\hat{b}_{1}}}\right)$ , $\hat{\Theta}_{2}=\Phi\left(\frac{{\hat{b}_{2}^{\prime}}(\hat{\bm{\mu}}_{C}-% \hat{\bm{\mu}}_{B})}{\sqrt{{\hat{b}_{2}^{\prime}}(\hat{\Sigma}_{C}+\hat{\Sigma% }_{B})\hat{b}_{2}}}\right)$ , and $\hat{\Sigma}_{A}$ , $\hat{\Sigma}_{B}$ and $\hat{\Sigma}_{C}$ are the estimated covariance matrices of the populations $A$ , $B$ and $C$ , respectively. Since the observations are measured with error, we have $E(\hat{\Theta}_{\textit{ME}})=\Theta+O(1)$ . In order to correct the bias in $\hat{\Theta}_{\textit{ME}}$ , let $\bm{\delta}_{1}=\bm{\epsilon}-\bm{\eta}\sim N(0,b_{1}^{\prime}\Sigma_{\epsilon% }b_{1}+_{1}b^{\prime}\Sigma_{\eta}b_{1}),\bm{\delta}_{2}=\bm{\eta}-\bm{\gamma}% \sim N(0,b_{2}^{\prime}\Sigma_{\eta}b_{2}+b_{2}^{\prime}\Sigma_{\gamma}b_{2})$ , and the unbiased estimate of $\Theta$ can be written in the following form:

$\displaystyle E(\hat{\Theta}_{1})\simeq P(Y>X+\bm{\delta}_{1})=\iint[1-F_{Y}(s% +t)]f_{X}(s)f_{\bm{\delta}_{1}}(t)dsdt\simeq\Theta_{1}-\frac{E(\bm{\delta}_{1}% \bm{\delta}_{1}^{\prime})}{2}\int f_{Y}^{\prime}(s)f_{X}(s)ds=\Theta_{1}+% \Omega_{1}$ $\displaystyle E(\hat{\Theta}_{2})\simeq P(Z>Y+\bm{\delta}_{2})=\iint[1-F_{Z}(s% +t)]f_{Y}(s)f_{\bm{\delta}_{2}}(t)dsdt\simeq\Theta_{2}-\frac{E(\bm{\delta}_{2}% \bm{\delta}_{2}^{\prime})}{2}\int f_{Z}^{\prime}(s)f_{Y}(s)ds=\Theta_{2}+% \Omega_{2},$

where $F_{Y}(.)$ and $F_{Z}(.)$ are the cumulative distribution functions of $Y$ and $Z$ , respectively, $f_{Y}(.)$ and $f_{X}(.)$ are the probability density functions of $Y$ and $X$ , respectively, and $f_{\bm{\delta}_{1}}(.)$ and $f_{\bm{\delta}_{2}}(.)$ are the probability density functions of $\bm{\delta}_{1}$ and $\bm{\delta}_{2}$ , respectively. After simplifications, the approximate bias expressions in using $\hat{\Theta}_{\textit{ME}}$ to estimate $\Theta$ as

$\displaystyle\hat{\Omega}_{1}=\frac{b_{1}^{\prime}(\hat{\Sigma}_{\epsilon}+% \hat{\Sigma}_{\eta})}{2\sqrt{2\pi}(\tau_{1})}\left(\frac{b_{1}^{\prime}(\hat{% \bm{\mu}}_{D_{1}}-\hat{\bm{\mu}}_{H})}{\sqrt{\tau_{1}}}\right)\exp\left\{-% \frac{1}{2}\left(\frac{b_{1}^{\prime}(\hat{\bm{\mu}}_{D_{1}}-\hat{\bm{\mu}}_{H% })}{\sqrt{\tau_{1}}}\right)^{2}\right\}$ $\displaystyle\hat{\Omega}_{2}=\frac{b_{2}^{\prime}(\hat{\Sigma}_{\eta}+\hat{% \Sigma}_{\gamma})}{2\sqrt{2\pi}(\tau_{2})}\left(\frac{b_{2}^{\prime}(\hat{\bm{% \mu}}_{D_{2}}-\hat{\bm{\mu}}_{D_{1}})}{\sqrt{\tau_{2}}}\right)\exp\left\{-% \frac{1}{2}\left(\frac{b_{2}^{\prime}(\hat{\bm{\mu}}_{D_{2}}-\hat{\bm{\mu}}_{D% _{1}})}{\sqrt{\tau_{2}}}\right)^{2}\right\},$

where $\tau_{1}=b_{1}^{\prime}[(\hat{\Sigma}_{D_{1}}+\hat{\Sigma}_{H})-(\hat{\Sigma}_% {\epsilon}+\hat{\Sigma}_{\eta})]b_{1}$ , and $\tau_{2}=b_{2}^{\prime}[(\hat{\Sigma}_{D_{2}}+\hat{\Sigma}_{D_{1}})-(\hat{% \Sigma}_{\eta}+\hat{\Sigma}_{\gamma})]b_{2}$ . Then the bias-corrected estimator of $\Theta$ is defined by the following mixture form:

$\displaystyle\hat{\Theta}_{\textit{corr}}=\hat{\lambda}_{1}\hat{\Theta}_{1}^{% \prime}+\hat{\lambda}_{2}\hat{\Theta}_{2}^{\prime},$ (4)

where $\hat{\Theta}_{1}^{\prime}=\hat{\Theta}_{1}+\hat{\Omega}_{1}$ and $\hat{\Theta}_{2}^{\prime}=\hat{\Theta}_{2}+\hat{\Omega}_{2}$ . It is the corrected estimator for $\Theta$ in the presence of measurement errors.

3. Results

3.1 Simulation study

Table 1
Considered mixing proportions, mean vectors and covariance matrices for simulation studies

Sets	$\lambda_{1}$	$\lambda_{2}$	$\bm{\mu}_{H}$	$\bm{\mu}_{D_{1}}$	$\bm{\mu}_{D_{2}}$	$\Sigma_{H}$	$\Sigma_{D_{1}}$	$\Sigma_{D_{2}}$
A	0.5	0.5	$\begin{pmatrix}15.2\\ 16.8\end{pmatrix}$	$\begin{pmatrix}18.1\\ 17.8\end{pmatrix}$	$\begin{pmatrix}26.2\\ 24.1\end{pmatrix}$	$\begin{pmatrix}2&1\\ 1&2\end{pmatrix}$	$\begin{pmatrix}2&1\\ 1&4\end{pmatrix}$	$\begin{pmatrix}2&1\\ 1&6\end{pmatrix}$
B	0.5	0.5	$\begin{pmatrix}15.2\\ 16.8\end{pmatrix}$	$\begin{pmatrix}18.1\\ 17.8\end{pmatrix}$	$\begin{pmatrix}26.2\\ 24.1\end{pmatrix}$	$\begin{pmatrix}2&1\\ 1&1\end{pmatrix}$	$\begin{pmatrix}2&1\\ 1&1\end{pmatrix}$	$\begin{pmatrix}2&1\\ 1&1\end{pmatrix}$

In this section, we have conducted extensive simulation studies to support the proposed methodology. Using the arbitrary parametric values in Table 1, two different sets of bivariate random samples with different sample sizes ( $n=$ 25, 50, 100, 250) are generated.

Also, two cases of error covariance matrices are considered to generate random error observations, such as:

Case (i): the presence of measurements errors with a common error variance, i.e.,

$\displaystyle\Sigma_{\epsilon}=\Sigma_{\eta}=\Sigma_{\gamma}=(\sigma^{2}).$

Case (ii): the presence of measurement errors with different error variances, i.e.,

$\displaystyle\Sigma_{\epsilon}=\Sigma_{\eta}=\Sigma_{\gamma}=(\sigma_{i}^{2});% \;i=1,2,\ldots,n.$

Table 2

The Bias and MSE of the estimated and bias-corrected estimator of the AUC

Case (i)	$\Sigma_{\epsilon}=\Sigma_{\eta}=\Sigma_{\gamma}=\begin{pmatrix}1.7&0\\ &1.7\end{pmatrix}$
Sets	$\hat{\Theta}$	$n$	$\hat{\Theta}_{\textit{ME}}$	Bias	MSE	${\Theta}_{\textit{corr}}$	Bias	MSE
A	0.96419	25	0.92781	$-$ 0.03637	0.00188	0.96729	0.00311	0.00045
		50	0.92885	$-$ 0.03533	0.00173	0.96808	0.00390	0.00038
		100	0.93002	$-$ 0.03417	0.00138	0.97002	0.00584	0.00020
		200	0.92914	$-$ 0.03505	0.00134	0.96964	0.00545	0.00012
B	0.96776	25	0.93095	$-$ 0.03681	0.00190	0.96873	0.00097	0.00040
		50	0.93059	$-$ 0.03717	0.00184	0.96966	0.00190	0.00032
		100	0.92974	$-$ 0.03802	0.00168	0.96970	0.00194	0.00018
		200	0.92889	$-$ 0.03887	0.00163	0.96846	0.00070	0.00010
Case (ii)	$\Sigma_{\epsilon}=\Sigma_{\eta}=\Sigma_{\gamma}=\begin{pmatrix}1.5&0\\ &1.9\end{pmatrix}$
A	0.96419	25	0.93420	$-$ 0.02998	0.00140	0.96877	0.00459	0.00043
		50	0.93300	$-$ 0.03119	0.00140	0.96841	0.00422	0.00035
		100	0.93201	$-$ 0.03217	0.00127	0.96785	0.00367	0.00021
		200	0.93181	$-$ 0.03238	0.00117	0.96838	0.00419	0.00012
B	0.96776	25	0.93246	$-$ 0.03530	0.00172	0.96856	0.00068	0.00035
		50	0.93261	$-$ 0.03515	0.00168	0.96840	0.00064	0.00032
		100	0.93258	$-$ 0.03518	0.00146	0.96819	0.00043	0.00018
		200	0.93218	$-$ 0.03558	0.00138	0.96824	0.00048	0.00010

Figure 2.

The true ROC curve and contaminated ROC (with ME) curves at various sample sizes.

In each case, the computations (AUC’s, bias and MSE) are made and presented in Table 2.

The accuracy measures in all three cases, according to Table 2, have achieved lower values than the true AUC ( $\hat{\Theta}$ ); this is due to the inclusion of error observations in the data. Using Eq. (4), the $\hat{\Theta}_{\textit{ME}}$ is corrected and the $\hat{\Theta}_{\textit{corr}}$ values are also reported in Table 2. These $\hat{\Theta}_{\textit{corr}}$ values are as close to the true AUC ( $\hat{\Theta}$ ) as possible while having the least bias and MSE. Hence, the proposed bias-corrected approximation is shown to be useful to estimate the AUC when the data are affected by measurement errors.

The corresponding mMROC plots are drawn and shown in Fig. 2 to support the numerical results. From this figure, we can see how the mROC curves are affected due to measurement errors.

3.2 Real data set

Table 3
Bias and MSE of estimated and corrected estimator of the AUC of the Vertebral Column dataset

	$\Theta$	$\hat{\Theta}_{\textit{ME}}$	Bias	MSE	$\hat{\Theta}_{\textit{corr}}$	Bias	MSE
Mixture ROC	0.93235	0.89757	$-$ 0.03479	0.00121	0.92187	$-$ 0.01049	0.00011

Figure 3.

Identified mixture probability density function plots for each variable in Vertebral Column dataset.

Figure 4.

True and contaminated ROC (with ME) curves for the Vertebral Column dataset.

The Vertebral Column (Guilherme and Ajalmar, 2011) dataset is considered, which consists of 310 samples and 6 characteristics (pelvic incidence (PI), pelvic tilt (PT), lumbar lordosis angle (LLA), sacral slope (SS), pelvic radius (PR) and grade of spondylolisthesis (GS)) measured based on the shape and orientation of the pelvis and lumbar spine. For demonstration purposes, the known or available class labels of this dataset are ignored. Upon implementing the EM algorithm, three components are identified and matched to the classes given in the actual data. The probability density function plot exhibits a tri-model pattern (see Fig. 3).

Further, to mimic the presence of MEs the samples are generated using the error covariance matrices, as shown below.

$\displaystyle\Sigma_{\epsilon}=\Sigma_{\eta}=\Sigma_{\gamma}=\begin{pmatrix}11% .6&0.0&0.0&0.0&0.0&0.0\\ &10.4&0.0&0.0&0.0&0.0\\ &&18.9&0.0&0.0&0.0\\ &&&17.8&0.0&0.0\\ &&&&12.5&0.0\\ &&&&&18.1\end{pmatrix}.$

The AUC ( $\Theta$ ) value is calculated both before and after the error samples are added. The corrected AUC ( $\hat{\Theta}_{corr}$ ) is computed after adjusting with the bias corrected approximation (4) and the results are shown in Table 3.

From the results, the true AUC is $\Theta=$ 0.93235, and in the presence of MEs, we get $\hat{\Theta}_{\textit{ME}}=$ 0.89757, which is biased downward. The proposed bias corrected estimator is applied to correct the AUC, $\hat{\Theta}_{\textit{corr}}=$ 0.92187, which is the closest AUC value to the original data and which has less bias ( $-$ 0.01049) and minimum mean square error ( $\textit{MSE}=$ 0.00011) when compared with the bias and MSE of contaminated data ( $-$ 0.03479, 0.00121). This clearly indicates that when the variables or markers get affected by measurement error, the true AUC may get contaminated, and the bias depicts the difference in information that gets masked due to the presence of measurement error. So, the bias corrected estimator has helped in retaining the true AUC value. Hence, the proposed estimator helps in obtaining the true estimate of the AUC, even if the data are influenced by measurement error. In Fig. 4, the ROC plots are drawn for before and after correction of the AUC, and it is clearly seen that the performance of the classifier is affected in the presence of measurement errors.

4. Summary

In this paper, we make an attempt to address the problem of constructing an ROC model when there are multiple model patterns in the known class labels. In a medical scenario, most of the datasets exhibit such multi-model patterns. In such situations, before proceeding to model the data, it is suggested to look at the hidden probability density function patterns in each class, particularly the diseased population. An illustration of this kind is demonstrated, and to model such patterns, a mixture of multivariate ROC models is proposed.

Later, discussions are presented on the impact of measurement errors in estimating the AUC of the mixture MROC curve. When the data possesses measurement errors, it is shown that the AUC will be biased downward. To address this, a bias-corrected estimator is derived, and through the simulated and real datasets, the support and usefulness of the proposed estimator are discussed in detail. The corrected AUC values are observed to have less bias and a minimum MSE.

References

Balaswamy

& Vishnu Vardhan

(2016). An Anthology of Parametric ROC Models. Research & Reviews: Journal of Statistics, 5(2), 32-6.

Coffin

& Sukhatme

(1996). A parametric approach to measurement errors in receiver operating characteristic studies. In Lifetime data: Models in reliability and survival analysis (pp. 71-75). Boston, MA: Springer.

Faraggi

(2000). The effect of random measurement error on receiver operating characteristic (ROC) curves. Statistics in Medicine, 19(1), 61-70.

de Alencar Barreto

& da Rocha Neto

A.R.

(2011). UCI Machine Learning Repository. Retrieved from: https://archive.ics.uci.edu/ml/datasets/Vertebral+Column.

Kim

& Gleser

L.J.

(2000). SIMEX approaches to measurement error in ROC studies. Communications in Statistics-Theory and Methods, 29(11), 2473-2491.

Perkins

N.J.

Schisterman

E.F.

& Vexler

(2009). Generalized ROC curve inference for a biomarker subject to a limit of detection and measurement error. Statistics in Medicine, 28(13), 1841-1860.

Reiser

(2000). Measuring the effectiveness of diagnostic markers in the presence of measurement error through the use of ROC curves. Statistics in Medicine, 19(16), 2115-2129.

Sameera

Vardhan

R.V.

& Sarma

K.V.S.

(2016). Binary classification using multivariate receiver operating characteristic curve for continuous data. Journal of Biopharmaceutical Statistics, 26(3), 421-431.

Schisterman

E.F.

Faraggi

Reiser

& Trevisan

(2001). Statistical inference for the area under the receiver operating characteristic curve in the presence of random measurement error. American Journal of Epidemiology, 154(2), 174-179.

10.

Siva

& Vishnu Vardhan

(2022). Multi-Class Classification using Mixtures of Univariate and Multivariate ROC Curves. Journal of Biostatistics and Epidemiology, 8(2), 208-233.

11.

Siva

Vishnu Vardhan

& Asha

(2022). Estimating the AUC of the MROC curve in the presence of measurement errors. Communications for Statistical Applications and Methods, 29(5), 533-545.

12.

J.Q.

& Liu

J.S.

(1993). Linear combinations of multiple diagnostic markers. Journal of the American Statistical Association, 88, 1350-1355.

13.

Tosteson

T.D.

Buonaccorsi

J.P.

Demidenko

& Wells

W.A.

(2005). Measurement error and confidence intervals for ROC curves. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 47(4), 409-416.

14.

Yin

& Tian

(2014). Optimal linear combinations of multiple diagnostic biomarkers based on Youden index. Statistics in Medicine, 33, 1426-1440.