Parametric modeling of receiver operating characteristics curves

Abstract

Receiver operating characteristics (ROC) curves play a pivotal role in the analyses of data collected in applications involving machine vision, machine learning and clinical diagnostics. The importance of ROC curves lies in the fact that all decision-making strategies rely on the interpretations of the curves and features extracted from them. Such analyses become simple and straightforward if it is possible to have a statistical fit for the empirical ROC curve. A methodology is developed and demonstrated to obtain a parametric fit for the ROC curves using multiple tools in statistics such as chi square testing, bootstrapping (parametric and non-parametric) and $t$ -testing. Relying on three data sets and an ensemble of density functions used in modeling sensor and econometric data, statistical modeling of the ROC curves (best fit) is accomplished. While the reported research relied on simulated data sets, the approaches implemented and demonstrated in this work can easily be adapted to data collected in clinical as well as non-clinical settings.

Keywords

Receiver operating characteristics curves bigamma fits ROC fits chi square tests T-tests parametric and non-parametric bootstrapping machine vision

1. Introduction

Receiver operating characteristics (ROC) analysis is extensively used in machine vision and machine learning on issues of detecting the presence or absence of a target in the field of view of a sensor (Bradley, 1997; Chang, 2010; Provost & Fawcett, 2001). Its wide use in clinical diagnostics is also well established (Obuchowski & Bullen, 2018; Swets, 1988; Vickers, 2008). Regardless of whether the interest is in machine vision or clinical diagnostics, availability of appropriate statistical fits for the ROC curves are essential. These fits will allow the user to identify the optimal operating points and draw appropriate conclusions regarding the sensor (machine vision) or a clinical test (medical diagnostics). They will also provide a means to test signal processing algorithms (combining data from multiple sensors in machine vision or combining data from multiple clinical observations) for performance enhancements before actual implementation (Brown & Davis, 2006; Hanley, 1996; Metz, 1989; Walsh, 1999). The goodness-of-fit (GOF) studies of the ROC curves undertaken in terms of binormal and bigamma fits have been around for a while, and interest has been directed at exploring other fits (Dorfman et al., 1997; Kester & Buntix, 2000; Hoyer and Kuss, 2020; Pundir & Amala, 2014; Rota & Antolini, 2014). Recent efforts based on fully parametrized bi-distributional models for ROC have examined several statistical models using a meta-analysis with pairs of sensitivity and specificity (Hoyer & Kuss, 2020; Kester & Buntix, 2000; Pundir & Amala, 2014). None of these reported models and accompanying analyses examined the impact of such models on the critical threshold values (Neyman-Person, Youden’s index and Bayes’ risk) for the interpretation of ROC curves (Chang, 1989; Lipovetsky & Conklin, 2020; Provost & Fawcett, 2001; Zweig & Campbell, 1993), and estimation of sensitivity, specificity, positive predictive values, etc. In other words, the efficacy of these statistical models to produce results that match those provided by the data (empirical results) has not been studied.

In this work, a methodology is proposed and tested to establish a statistical fit for the data under the hypothesis $H_{0}$ (disease absent in a clinical scenario or target absent in the case of machine vision) and another statistical fit for the data under the hypothesis $H_{1}$ (disease present in a clinical scenario or target present in the case of machine vision). These two fits are used to obtain a continuous fit (goodness-of-fit) for the ROC curve. Multiple hypothesis testing relying on chi square testing is used for both hypotheses individually to ascertain the best statistical fit in each case (Barton, 1956; Cochran, 1954; Watson, 1958). The densities studied include gamma, Nakagami, Rician, Rayleigh, lognormal, Weibull, inverse Gaussian, generalized gamma (Stacy’s distribution), Burr distribution, and generalized Pareto distribution. These densities were chosen considering various statistical models used in modeling sensor (radio frequency, wireless, sonar) and medical imaging (X-ray and ultrasound) data (Destrempes & Cloutier, 2010; Simon & Alouini, 2001). The Burr and Pareto distribution find extensive application in econometrics (Hogg & Klugman, 1983). Use of different densities for the data sets from the two hypotheses makes the effort in this research different from the existing body of research where same density is used for modeling data from both hypotheses (Hoyer & Kuss, 2020; Pundir & Amala, 2014).

A two-step methodology is followed here to carry out the analysis and interpretation of the ROC curves. In the first step, the area under the ROC curve (AUC) is examined in terms of its mean and standard deviation (Siva et al., 2023). In other words, AUC obtained from the data (empirical) is compared to the AUC calculated using the continuous fit for the ROC curve. In this regard, three different fits are chosen, a bigamma fit used by many researchers, a bi-generalized gamma fit and a biBest fit using the two fits obtained from hypothesis testing (Demidenko, 2012; Dorfman et al., 1997; Hanley, 1988; Rota & Antolini, 2014). A bi-generalized gamma fit is a general form of a bigamma fit because a three parameter Stacy’s distribution or the generalized gamma density becomes a gamma density when one of the parameters of the Stacy’s distribution is equal to 1. AUC comparison relies on both parametric and non-parametric bootstrapping (Davison et al., 2003). Because the sensor data values are always positive, binormal fits were not examined. In the second step, the ROC curves (empirical and continuous fits) are studied in terms of three thresholds used by researchers (Schisterman et al., 2005); distance to the top left corner of the ROC curve (Neyman-Pearson), the difference between probability of detection or sensitivity and probability of false alarm (Youden’s index J) and the accuracy (Bayes’ risk). A parametric hypothesis test (two-sided $t$ -test) is conducted with the null hypothesis supporting the notion that in terms of the three thresholds, there is no statistically significant difference between empirical ROC and each of these different fits along the ROC curve (Rohatgi & Saleh, 2000).

The manuscript starts with the attributes of the data sets used in this study. The section on statistical methods provides details of the densities used. The section on methodology offers insights into the approaches for the analysis of the ROC fits using hypothesis (chi square) tests, bootstrapping, and $t$ -tests. This is followed by results and discussion.

2. Data sets

Table 1
Target absent (top 70) and target present cohorts (bottom 60). Three data sets

a. #1
3.05	0.36	2.15	0.38	0.54	3.65	0.85	0.1	0.95	0.74	1.36	5.2
3.62	2.75	0.31	0.25	1.54	2.99	0.61	0.38	0.9	0.74	1.28	1.61
0.01	3.41	0.58	0.57	0.23	0.81	5.74	1.77	2.59	0.03	3.04	2.66
1.47	0.44	0.68	8.06	1.73	0.23	0.77	0.59	1.99	1.45	4.27	1.14
3.5	1.05	1.1	1.77	0.66	2.71	1.68	1.58	4.79	1.08	2.27
1.7	2.35	1.63	0.81	0.36	1.16	3.83	1.72	0.13	0.66	1.08
2.02	12.92	4.55	12.85	7.36	2.92	18.7	4.98	2.84	3.73	9.17	5.44
3.45	12.32	7.12	3.86	7.83	7.21	5.41	5.93	3.48	4.71	9.91	3.25
1.26	3.61	9.57	4.93	4.83	3.63	1.86	0.72	1.79	5.74	3.64	3.07
0.56	0.98	9.4	6.19	5.34	5	2.65	9.7	7.28	1.99	4.78	3.86
1.08	11.49	7.85	5.04	2.52	9.56	5.59	11.72	10.54	3.91	3.8	4.59
b. #2
0.51	1	0.71	0.6	0.76	0.61	0.2	1.49	0.05	0.21	0.95	0.91
0.08	1.8	2.12	0.21	0.87	0.39	1.02	1.18	0.84	0.33	0.36	0.23
0.15	0.64	0.27	0.31	1.72	1.5	0.11	1.58	2.59	1.46	0.53	0.4
3.19	0.17	0.26	0.43	0.31	1.05	0.8	0.67	1.47	1.3	0.77	1.57
1.4	0.62	0.62	2.1	1.73	0.28	0.71	0.92	0.56	1.5	0.31
0.23	0.25	0.76	2.56	0.14	0.08	2.47	1.12	0.75	0.23	0.98
0.85	4.2	1.79	1.1	3.76	2.18	1.97	3.91	3.3	1.4	1.19	1.41
2.06	2.33	0.71	1.67	2.3	0.62	5.21	10.59	0.1	3.19	5.55	0.96
1.29	6.21	1.62	2.26	1.35	3.1	3.01	0.7	0.19	2.25	3.94	7.52
2.26	6.07	1.57	3.2	3.7	0.49	3.55	2.57	0.33	3.57	4.09	4.1
1.98	3.18	1.53	0.5	0.91	0.53	3.62	2.01	1.65	1.44	1.82	2.49
c. #3
0.52	2.88	2.92	2.02	0.54	3.47	0.78	2.31	6.39	6.49	1.18	1.64
0.09	1.4	0.18	0.35	1.06	1.12	1.8	1.28	3.55	2.48	0.28	3.2
3.31	2.16	0.64	0.36	0.11	1.58	1.07	0.59	1.45	1.22	2.44	2.16
7.75	3.23	0.07	0.92	0.94	2.26	1.25	0.41	2.66	0.91	2.71	1.36
3.14	1.05	0.87	0.63	0.67	0.24	0.67	5.25	3.23	0.13	3.16
5.55	2.87	1.29	3.03	0.66	3.88	3.57	0.8	0.81	0.75	5.77
4.54	8.73	6.88	5.36	8.64	0.98	10.33	7.21	0.64	4.31	13.3	5.9
4.16	6.13	2.9	7.24	1.57	4.92	3.43	8.91	8.73	7.49	3.76	8.4
12.9	7.01	4.04	5.45	4.56	10.99	7.83	1.4	1.76	2.86	8.62	2.54
5.64	0.64	1.69	2.05	5.24	5.79	8.24	4.27	1.82	2.9	10.44	5.25
12.35	5.35	2.05	11.02	4.03	6.4	5.47	3.57	9.89	8.43	2.55	3.01

The author has been active in research on statistical modeling of wireless channels and backscattered ultrasonic signals (medical ultrasound). During the past few years, the author incorporated data analytics in the undergraduate probability course for engineers where every student was provided with a data set containing samples of sensor outputs (target absent and target present). These randomly generated data sets were rigorously tested to ensure they had attributes of data that might fit both sensor outputs in machine vision and statistics of medical imaging data. Three data sets (#1 – #3) used in this study are displayed in Table 1. Each set has 70 values representing the Target Absent (hypothesis H₀) cohort and 60 values representing the Target Present cohort. The Target Present (hypothesis H₁) cohort has a mean value that is higher than the mean value of the Target Absent cohort.

Each data set implies existence of 131 samples of threshold values (130 samples of data along with the sample value of 0 allowing for the lowest possible measured value) for analysis that take the ROC curve from the bottom left corner [0,0] to the top right corner [1,1] of the ROC plot. Note that the ROC curve is drawn with the probability of false alarm (1-specificity) along the X-axis and the sensitivity or probability of detection along the Y-axis. For each threshold, it is possible to obtain a confusion matrix in terms of false and missed counts or a transition matrix in terms of the probabilities of false alarm and miss (Ltifi et al., 2020; Provost & Fawcett, 2001). If the probability density functions associated with the two hypotheses are represented as $f_{X}(x|H_{k})$ , $k=$ 0,1, the probability of false alarm ( $P_{F}$ ) and the probability of detection ( $P_{D}$ ) become

$\displaystyle P_{F}=\int\limits_{T}^{\infty}{f\left({x|H_{0}}\right)}dx=\mbox{% Prob}\left({x|H_{0}>T}\right)=S_{X}\left({T|H_{0}}\right),$ (1)

$\displaystyle P_{D}=\int\limits_{T}^{\infty}{f\left({x|H_{1}}\right)}dx=\mbox{% Prob}\left({x|H_{1}>T}\right)=S_{X}\left({T|H_{1}}\right).$ (2)

In Eqs (1) and (2), ${\bm{T}}$ is the threshold and $S$ (.) is the survival function (Rohatgi & Saleh, 2000). The probability of miss ( $P_{M}$ ) is

$\displaystyle P_{M}=1-P_{D}=\int\limits_{0}^{T}{f\left({x|H_{1}}\right)}dx=% \mbox{Prob}\left({x|H_{1}\leqslant T}\right)=F_{X}\left({T|H_{1}}\right).$ (3)

It should be noted that

$\displaystyle F_{X}\left({T|H_{1}}\right)=1-S_{X}\left({T|H_{1}}\right).$ (4)

In Eq. (3), $F$ (.) is the cumulative distribution function (CDF). Notice the existence of the equality sign within the parenthesis of in Eq. (3) reflecting the definition of the CDF of a random variable (Rohatgi & Saleh, 2000). The ROC curve is the plot of $P_{F}$ versus $P_{D}$ as the threshold varies from 0 to the largest value of the pooled ( $H_{0}$ and $H_{1}$ ) data.

If $N_{0}$ and $N_{1}$ are the sample counts in the respective cohorts, a priori probabilities of the two hypotheses become,

$\displaystyle P\left({H_{0}}\right)=\frac{N_{0}}{N_{0}+N_{1}},\quad P\left({H_% {1}}\right)=\frac{N_{1}}{N_{0}+N_{1}}.$ (5)

If $N_{F}$ represents the false counts (number of samples from the target absent cohort that exceed the threshold ${\bm{T}}$ ) and $N_{C}$ represents the correct counts (number of samples from the target present cohort that exceed the threshold ${\bm{T}}$ ), the probably of false alarm $P_{F}$ and probability of detection $P_{D}$ become,

$\displaystyle P_{F}=\frac{N_{F}}{N_{0}},\quad P_{D}=\frac{N_{C}}{N_{1}}.$ (6)

For every threshold $N_{F}$ and $N_{C}$ are available and the confusion matrix ( $C_{X}$ ) becomes (Ltifi et al., 2020; Provost & Fawcett, 2001),

$\displaystyle C_{X}=\left[{{\begin{array}[]{ll}{N_{0}-N_{F}}&{N_{F}}\\ {N_{1}-N_{c}}&{N_{c}}\\ \end{array}}}\right].$ (7)

While the elements of rows sum up to the target absent (first row) and target present (second row) counts, the elements of the columns sum up to target not detected (first column) and target detected (second column) respectively. The transition matrix ( $T_{X}$ ) becomes (Ltifi et al., 2020; Provost & Fawcett, 2001),

$\displaystyle T_{X}=\left[{{\begin{array}[]{cc}{1-P_{F}}&{P_{M}}\\ {P_{F}}&{1-P_{M}}\\ \end{array}}}\right]=\left[{{\begin{array}[]{cc}{\frac{N_{0}-N_{F}}{N_{0}}}&{% \frac{N_{1}-N_{C}}{N_{1}}}\\ {\frac{N_{F}}{N_{0}}}&{\frac{N_{C}}{N_{1}}}\\ \end{array}}}\right].$ (8)

The transition matrix has the property that elements of each column add to 1. Furthermore, the probabilities of detecting the target, $P_{\textit{Det}}$ and not detecting the target, $P_{\textit{notDet}}$ can be expressed as

$\displaystyle\left[{{\begin{array}[]{c}{P_{\textit{notDet}}}\\ {P_{\textit{Det}}}\\ \end{array}}}\right]=T_{X}\left[{{\begin{array}[]{c}{P\left({H_{0}}\right)}\\ {P\left({H_{1}}\right)}\\ \end{array}}}\right].$ (9)

The transition or confusion matrix completely defines the properties of the sensor or a decision maker that uses the data. These matrices rely on the threshold, demonstrating the importance of the choice of an appropriate threshold.

Table 1b and Table 1c provide two additional data sets (#2 and #3) that were used in this study. They have attributes like those of set #1.

3. Statistical models

While bigamma fits (Dorfman et al., 1997) have been used to model the ROC curves specifically when the data contains only positive values, in this work other models, one for the target absent cohort and another one for the target present cohort, are explored. A specific set of models where both hypotheses (target absent and target present) modeled in terms of identical densities was reported earlier (Pundir & Amala, 2014). The probability densities considered in this work are listed below (in all case, x $\geqslant$ 0 and $H_{k}$ , $k=$ 0,1 correspond to the two hypotheses):

Gamma

$\displaystyle f_{X}\left({x|H_{k}}\right)=\frac{x^{a_{k}-1}}{b_{k}^{a_{k}}% \Gamma\left({a_{k}}\right)}\exp\left({-\frac{x}{b_{k}}}\right),\,\;\;a_{k},{% \kern 1.0pt}\;b_{k}>0;$ (10)

Generalized gamma (Stacy’s distribution expressed in a slightly different format)

$\displaystyle f_{X}\left({x|H_{k}}\right)=\frac{c_{k}x^{c_{k}a_{k}-1}}{b_{k}^{% a_{k}}\Gamma\left({a_{k}}\right)}\exp\left({-\frac{x^{c_{k}}}{b_{k}}}\right),% \;\;a_{k},{\kern 1.0pt}\;b_{k},c_{k}>0;$ (11)

Weibull

$\displaystyle f_{X}\left({x|H_{k}}\right)=\left({\frac{b_{k}}{a_{k}}}\right)% \left({\frac{x}{a_{k}}}\right)^{b_{k}-1}\exp\left({-\left[{\frac{x}{a_{k}}}% \right]^{b_{k}}}\right),\,\;\;a_{k},{\kern 1.0pt}\;b_{k}>0;$ (12)

Nakagami

$\displaystyle f_{X}\left({x|H_{k}}\right)=2\left({\frac{m_{k}}{\Omega_{k}}}% \right)\frac{1}{\Gamma\left({m_{k}}\right)}x^{2m_{k}-1}\exp\left({-\frac{m_{k}% }{\Omega_{k}}x^{2}}\right),\,\;\Omega_{k}>0,{\kern 1.0pt}\;m_{k}\geqslant\frac% {1}{2};$ (13)

Inverse Gaussian

$\displaystyle f_{X}\left({x|H_{k}}\right)=\sqrt{\frac{\lambda_{k}}{2\pi x^{3}}% }\exp\left({-\frac{\lambda_{k}}{2\mu_{k}^{2}x}\left({x-\mu_{k}}\right)^{2}}% \right),\;\lambda_{k},{\kern 1.0pt}\;\mu_{k}>0;$ (14)

Lognormal

$\displaystyle f_{X}\left({x|H_{k}}\right)=\frac{1}{\sqrt{2\pi\sigma_{k}^{2}x^{% 2}}}\exp\left({-\frac{1}{2\sigma_{k}^{2}}\left[{\log_{e}x-\mu_{k}}\right]^{2}}% \right),\;\,\sigma_{k}>0;$ (15)

Rayleigh

$\displaystyle f_{X}\left({x|H_{k}}\right)=\frac{x}{b_{k}^{2}}\exp\left({-\frac% {x^{2}}{2b_{k}^{2}}}\right),\;b_{k}>0;$ (16)

Rician

$\displaystyle f_{X}\left({x|H_{k}}\right)=\frac{x}{\sigma_{k}^{2}}\exp\left({-% \frac{x^{2}+s_{k}^{2}}{2\sigma_{k}^{2}}}\right)I_{0}\left({\frac{\textit{xs}_{% k}}{\sigma_{k}^{2}}}\right),{\kern 1.0pt}\;\sigma_{k},\;s_{k}\geqslant 0;$ (17)

Burr

$\displaystyle f_{X}\left({x|H_{k}}\right)=\left[{1+\left({\frac{x}{\alpha}}% \right)^{c}}\right]^{-\left({k+1}\right)}\frac{\textit{kc}}{\alpha}\left({% \frac{x}{\alpha}}\right)^{c-1},\alpha>0,c>0,k>0;$ (18)

Generalized Pareto

$\displaystyle f_{X}\left({x|H_{k}}\right)=\frac{1}{\sigma}\left[{1+k\frac{% \left({x-\theta}\right)}{\sigma}}\right]^{-\left({1+\frac{1}{k}}\right)},x>% \theta,k>0,\sigma>0.$ (19)

In Eq. (19), $k$ and $\theta$ are real.

In the expressions for densities, $\Gamma$ (.) is the gamma function and $I_{0}$ (.) is the 0^th order modified Bessel function of the first kind (Simon & Alouini, 2001). The generalized gamma density becomes the gamma density when the parameter $c_{k}=$ 1. The Rician density is included in the models so that existence of line-of-sight (LOS) component in wireless or radar communications can be considered. Rician density becomes Rayleigh when the LOS component, $s_{k}$ , becomes zero. Similarly, Nakagami density becomes Rayleigh when the Nakagami parameter $m_{k}$ becomes 1 (Simon & Alouini, 2001). Burr and generalized Pareto distributions are used in econometrics and under special conditions are related to each other, and some of the other densities (Hogg & Klugman, 1983). Normal densities (and hence binormal fits) are not included in this study considering the positive values of the data collected from the sensors.

4. Methodology

The first step in analyzing the ROC curve through a continuous density model is to establish a statistical fit for the two cohorts of the data or the two hypotheses, $H_{0}$ and $H_{1}$ . A multi-hypothesis goodness-of-fit test is implemented using the chi square test. For each cohort, the best fit (the criterion yet to be defined) will be based on the results of chi square tests (Barton, 1956; Rohatgi & Saleh, 2000).

Chi square test is undertaken to estimate the test statistic and the $p$ -value. The null hypothesis is not rejected if the $p$ -value exceeds 0.05 (level of significance of 5%). This means that for each density listed in Section 3, a chi square test is conducted. The process is carried out for each cohort separately. The density with the highest value of $p$ (exceeding 0.05) is taken as the best fit. It is understood that any one of the densities with $p$ -value exceeding 0.05 might be acceptable, keeping in mind that $p$ -value itself is random (Rohatgi & Saleh, 2000). This composite hypothesis testing results in one density as the best fit for the hypothesis $H_{0}$ and another density for the hypothesis $H_{1}$ , creating the pair identified as the bi-Best. It is possible that the two best fits may turn out to be the bigamma (gamma densities for both hypotheses). The pair may also be the bi generalized gamma (generalized gamma fits for both hypotheses). With the availability of the best fits, a fit for ROC curve is obtained, and a thorough analysis of the ROC curves can be carried out through bootstrapping, namely parametric bootstrapping to establish the appropriateness of the fits and non-parametric or standard bootstrapping for further comparison fits to the empirical set (Davison et al., 2003). Furthermore, $t$ -tests are conducted to check whether the thresholds obtained from the data and those obtained from the fits match statistically (Rohatgi & Saleh, 2000). For this purpose, three different criteria are used; the Neyman Pearson, Youden’s index, and the Bayes’ (Lipovetsky & Conklin, 2020; Provost & Fawcett, 2001; Schisterman et al., 2005; Tilbury et al. 2000; Xu, 2012). The Neyman-Pearson criterion relies on the distance to the top left corner of the ROC curve ( $P_{F}=$ 0, $P_{D}=$ 1; the ideal operating point) and the optimum threshold is chosen based on the shortest distance. The distance to the top left corner of the ROC curve, ${\bm{D}}$ , is,

$\displaystyle D=\sqrt{P_{F}^{2}+\left({1-P_{D}}\right)^{2}}=\sqrt{P_{F}^{2}+P_% {M}^{2}}.$ (20)

The Youden’s index criterion is based maximizing the difference between $P_{D}$ and $P_{F}$ with the index ${\bm{Y}}$ defined as (Schisterman et al., 2005; Zweig & Campbell, 1993)

$\displaystyle Y=P_{D}-P_{F}.$ (21)

Bayes’ criterion relies on cost factors involved with decisions. In the absence of cost considerations, Bayes’ criterion leads to a threshold that maximizes the accuracy of the system ${\bm{B}}$ defined as

$\displaystyle B=\left({1-P_{F}}\right)P\left({H_{0}}\right)+P_{D}P\left({H_{1}% }\right).$ (22)

The continuous fits can be compared to the empirical ROC by conducting a paired $t$ -test (2-sided) for all three thresholds. As stated earlier, with 130 samples of data, there will be 131 threshold values including the lowest possible value of the data, namely, a zero. Thus, $t$ -tests use 131 pairs of values.

5. Computational aspects

Matlab (www.mathworks.com) was used for the computational analysis relying primarily on the Statistics and Machine Learning Toolbox. All the densities other than generalized gamma are available in Matlab (version R2019a). Matlab also provides maximum likelihood estimates (MLE) of the parameters (of the densities), random numbers, probability density and cumulative distribution plots. The command chi2gof(.) performs the chi square tests providing various outputs (test statistic, degrees of freedom and $p$ -value of the test).

Most of the densities are 2-parameter ones except for the Rayleigh, Burr, generalized gamma, and generalized Pareto. Rayleigh is a single parameter density while the generalized gamma, Burr, and generalized Pareto are 3-parameter densities. Of these, the generalized gamma is not available in Matlab. The generalized gamma variable (GG) is obtained from a gamma variable ( $G$ ) as

$\displaystyle\textit{GG}=G^{\frac{1}{c_{k}}}.$ (23)

While fitdist(.) command in Matlab provides the MLE estimates of the parameters of the densities except the generalized gamma density. Matlab however allows MLE of parameters of any density using a ‘custom’ provision, making it possible to estimate the parameters of the generalized gamma density. While random numbers can be generated using the command, random(.), densities are obtained using the pdf(.) command and cumulative distribution functions obtained using the cdf(.) command. Continuous plots of the density are obtained using the command ksdensity(.), and histogram(.) provides the histogram with provisions for various normalized formats. Paired $t$ -testing is accomplished through the command ttest(.). Equation (23) can be used to obtain the generalized gamma and its parameters, pdf and cdf. Traditional bootstrapping is carried out through bootstrp(.) command. Parametric bootstrapping is carried out through random number generation.

6. Data analysis and results

Table 2
Chi square test results for the three data sets

Hypothesis H₀				Hypothesis H₁
Density	DoF	$\chi^{2}_{\textit{stat}}$	$p$ -value	Density	DoF	$\chi^{2}_{\textit{stat}}$	$p$ -value
a. #1
gen. Pareto	2	0.07	0.96	gen. Gamma	3	3.81	0.28
gamma	2	0.22	0.89	gamma	2	3.82	0.15
Weibull	2	0.32	0.85	lognormal	2	4.07	0.13
Burr	1	0.32	0.57	Weibull	2	5.16	0.08
Nakagami	2	1.45	0.48	Rayleigh	3	7.17	0.07
gen. Gamma	3	2.56	0.46	inv. Gaussian	2	5.85	0.05
lognormal	2	3.44	0.18	Nakagami	2	5.91	0.05
inv. Gaussian	1	18.06	$<$ 0.05	gen. Pareto	3	14.77	$<$ 0.05
Rayleigh	2	29.18	$<$ 0.05	Burr	1	4.18	$<$ 0.05
Rician	1	29.18	$<$ 0.05	Rician	2	7.17	$<$ 0.05
b. set #2
Weibull	3	4.97	0.17	Weibull	2	1.74	0.41
Nakagami	2	3.50	0.17	gamma	2	2.08	0.35
gamma	3	5.22	0.16	Nakagami	1	1.16	0.28
gen. Gamma	3	6.31	0.10	gen. Gamma	3	4.07	0.25
lognormal	2	9.26	$<$ 0.05	Rayleigh	2	5.62	0.06
inv. Gaussian	2	13.14	$<$ 0.05	Rician	1	5.62	$<$ 0.05
Rayleigh	3	19.75	$<$ 0.05	lognormal	2	6.94	$<$ 0.05
Rician	2	19.75	$<$ 0.05	inv. Gaussian	2	13.01	$<$ 0.05
c. set #3
gamma	3	3.74	0.29	Nakagami	4	1.13	0.89
Weibull	2	3.04	0.22	Weibull	4	1.26	0.87
Nakagami	2	3.22	0.20	Rayleigh	4	1.92	0.75
gen. Gamma	3	9.46	$<$ 0.05	gen. Gamma	3	1.63	0.65
lognormal	2	7.29	$<$ 0.05	gamma	4	2.48	0.65
inv. Gaussian	2	14.29	$<$ 0.05	Rician	3	1.92	0.59
Rayleigh	3	32.36	$<$ 0.05	lognormal	3	4.05	0.26
Rician	2	32.67	$<$ 0.05	inv. Gaussian	3	6.28	0.10

Table 3

Results of bootstrapping the parametric and non-parametric displaying the mean ( $\mu$ ), 95% confidence interval (CI), and standard deviation ( $\sigma$ ) for the three data sets

Fit	$\mu$	95% CI	$\sigma$
#1 biBest^* gen. pareto-gen. gamma
EMP	0.88	[0.82, 0.94]	0.03
biG	0.88	[0.82, 0.93]	0.03
bigG	0.88	[0.83, 0.93]	0.03
biBest^*	0.88	[0.82, 0.93]	0.03
#2 biBest^* weibull-weibull
EMP	0.82	[0.75, 0.90]	0.04
biG	0.83	[0.75, 0.89]	0.04
bigG	0.82	[0.76, 0.89]	0.03
biBest^*	0.82	[0.74, 0.8939]	0.04
#3 biBest^* gamma-nakagami
EMP	0.86	[0.79, 0.92]	0.03
biG	0.86	[0.80, 0.92]	0.03
bigG	0.86	[0.78, 0.92]	0.03
biBest^*	0.86	[0.80, 0.92]	0.03

Table 4

Comparison of ROC curves for the three fits and the empirical (EMP) curve

#1 biBest^* gen. pareto-gen. gamma
Pair	Youden’s J	Dist. to [0,1]	Bayes’
[EMP,biG]	0.89	0.49	0.73
[EMP,bigG]	0.81	0.50	0.61
[EMP,biBest^*]	0.43	0.30	0.51
#2 biBest^* weibull-weibull
[EMP,biG]	0.48	$<$ 0.05	0.40
[EMP,bigG]	0.88	0.39	0.99
[EMP,biBest^*]	0.21	0.25	0.15
#3 biBest^* gamma-nakagami
[EMP,biG]	0.28	$<$ 0.05	0.25
[EMP,bigG]	0.15	0.13	0.25
[EMP,biBest^*]	0.46	0.90	0.61

Data set #1 was analyzed first, followed by the other two sets. For all the densities, chi square tests were carried out in Matlab. For the generalized gamma, custom MLE routine in Matlab was used to obtain the parameters. Tables 2, 3, and 4 display the results of chi square tests, bootstrapping, and $t$ -tests respectively for all three data sets. For each cohort (hypothesis), the densities are sorted in descending order of the $p$ -values (Table 2).

Figure 1.

Histograms and theoretical (best) fits (set #1) [left] ROC curves [right].

Table 2a summarizes the results of chi square tests for set #1. The generalized Pareto and generalized gamma are taken as the best fits for the target absent and target present cohorts respectively. Several other densities also result in $p$ -values exceeding 0.05. Figure 1 [left] displays the histogram of the data cohorts and the corresponding best fits and Fig. 1 [right] displays the empirical ROC curve (EMP) along with the bigamma fit (biG), bi-generalized gamma fit (bigG) and the best fit (biBest). The fits were obtained using the respective cumulative distribution functions as the threshold is varied as explained in Section 2. For set #1, all three ROC fits are sufficiently close to the empirical ROC curve. The similarities among these curves (three fits and the empirical) were examined through bootstrapping and paired hypothesis tests based on student’s $t$ -tests (Rohatgi & Saleh, 2000).

The comparison of the ROC curves relied on both non-parametric and parametric bootstrapping (Davison et al. 2003). Traditional (non-parametric) bootstrapping (500 times) was used with the data set obtaining the 95% confidence interval, mean ( $\mu$ ) and standard deviation ( $\sigma$ ) of AUC. Parametric bootstrapping was used to obtain the statistics of AUC obtained from the theoretical fits starting with the bigamma fits. This required generation of 70 random samples belonging to gamma density with parameters estimated from the data (H₀) and 60 samples belonging to gamma density with parameters estimated from the data (H₁). The ROC area with this set was calculated and the process repeated 500 times. This procedure leads to parametric bootstrapping for the bigamma case. This procedure was repeated with bi-generalized gamma fit and followed by the biBest fit. The results are displayed in Table 3a showing remarkable agreement with regards to the 95% confidence interval (CI) of the area (95% CI), its mean ( $\mu$ ) and standard deviation ( $\sigma$ ). The parametric bootstrapping (500 boot samples) results are shown as EMP (empirical). The AUC values obtained directly from the ROC plots seen in Fig. 1 are close to the corresponding mean values obtained from bootstrapping.

While bootstrapping provides comparison of AUC values, there is also a need to examine the fits in terms of applicability to the estimation of the probabilities of false alarm and miss, and the positive predictive value through the determination of an optimum threshold (Liu, 2012; Schisterman et al., 2005). This step was accomplished through paired 2-sided $t$ -tests. With the availability of 131 thresholds (130 value of the data and the lowest possible value of the data, namely 0), the empirical ROC can be paired with the theoretical ROC from the bigamma, bi-generalized gamma and the biBest fits based on three criteria, distance to the top left corner of the ROC plot (Neyman-Pearson) in Eq. (20), the Youden’s index in Eq. (21) and the Bayes’ criterion in Eq. (22). The results are tabulated in Table 4a. With the high $p$ -values displayed, the 2-sided $t$ -tests suggest that there are no statistically significant differences in the paired approaches, further demonstrating the strength of the theoretical fits. In other words, any interpretation of the ROC curve requiring a specific threshold (based on any one of the three criteria) is unaffected by a continuous fit.

Figure 2.

Histograms and theoretical (best) fits (set #2) [left] ROC curves [right].

Analysis was now carried out with set #2 (Table 1b). The summary of the chi square results is displayed in Table 2b. They show that the best fits for both data cohorts are Weibull densities. Several other densities also are acceptable fits based on the criterion of $p$ -value exceeding 0.05. Figure 2 [left] displays the histogram and the best fits for each cohort. Figure 2 [right] shows the ROC curves (empirical and the three fits) along with the corresponding values of the area under the ROC curves (AUC).

Bootstrapping results are provided in Table 3b. While the best fit ROC results in slightly lower value of AUC, all the AUC values and the standard deviation are reasonably close. The mean values also are close to the AUC values displayed in Fig. 2b. Table 4b displays the results of the paired $t$ -tests to examine whether there exist any statistically significant differences among the ROC fits (and the empirical curve) based on the three criteria. As before, results show that no statistically significant differences exist. The only exception is the threshold behavior associated with the bigamma fit for the case of Neyman Pearson criterion. The Burr and generalized Pareto distributions were not pursued because of the difficulties in estimation of their parameters.

Figure 3.

Histograms and theoretical (best) fits [left] ROC curves [right].

Analysis was carried out with set #3 (Table 1c). The summary of the chi square results is displayed in Table 2c. They show that the best fits for the cohorts of data are gamma and Nakagami densities respectively. It is also seen that all the densities studied are acceptable fits for the target present cohort. Figure 3 [left] displays the histogram and the best fits for each cohort. Figure 3 [right] shows the ROC curves (empirical and the three fits). Table 3c summarizes the results of bootstrapping. All the AUC values and the standard deviation are reasonably close. Table 4c displays the results of the paired $t$ -tests to examine whether there exist any statistically significant differences among the ROC fits (and the empirical curve) based on the three criteria. Except with the case of Neyman-Pearson criterion with regards to the bigamma fit, other fits show no statistically significant differences with the empirical ROC.

7. Conclusions

ROC analysis is an essential step in applications involving machine vision and medical diagnostics, and data analytics in general. The analysis of the data and subsequent decision-making process based on the interpretation of data are simplified with the availability of statistical fit for the ROC curves. This assumes that the ROC fits have been rigorously tested using established statistical tools. With fits assured for each cohort of the data sets (target absent and target present), the models and results presented here offer a very general approach to theoretical modeling of ROC curves (Kochański, 2022; Rota & Antolini, 2014).

All elements of a typical ROC curve such as the area under the ROC curve (AUC) and the three important thresholds (Neyman-Pearson, Youden’s J and Bayes’) are examined to ensure that the modeling is proper, and the model is appropriate. Combining hypothesis testing relying on chi square tests followed by bootstrapping and $t$ -tests, it is possible to demonstrate the strength and validity of the statistical model. While the research reported here was conducted using simulated data, the approaches demonstrated here relying on established statistical methodologies, can still be applied to all data sets collected in experiments. Additional data sets and information on Matlab scripts will be available from the author upon request.

References

Barton

D.E.

(1956). Neyman’s test of goodness of fit when the null hypothesis is composite. Scandinavian Actuarial J, (2), 216-245.

Bradley

A.P.

(1997). The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145-1159.

Brown

C.D.

, & Davis

H.T.

(2006). Receiver operating characteristics curves and related decision measures: A tutorial. Chemometrics and Intelligent Laboratory Syst, 80(1), 24-38.

Chang

C.-I.

(2010). Multiparameter receiver operating characteristic analysis for signal detection and classification. IEEE Sensors Journal, 10(3), 423-442.

Chang

P.J.

(1989). Bayesian analysis revisited: a radiologist’s survival guide. American Journal of Roentgenology, 152(4), 721-729.

Cochran

W.G.

(1954). Some methods for strengthening the common χ2 tests. Biometrics, 10(4), 417-451.

Davison

A.C.

Hinkley

D.V.

, & Young

G.A.

(2003). Recent developments in bootstrap methodology. Statistical Science, 141-157.

Demidenko

. (2012). Confidence intervals and bands for the binormal ROC curve revisited. Journal of Applied Stat, 39(1), 67-79.

Destrempes

, & Cloutier

. (2010). A critical review and uniformized representation of statistical distributions modeling the ultrasound echo envelope. Ultrasound in Medicine and Biology, 36(7), 1037-1051.

10.

Dorfman

D.D.

Berbaum

K.S.

Metz

C.E.

Lenth

R.V.

Hanley

J.A.

, & Dagga

H.A.

(1997). Proper receiver operating characteristic analysis: the bigamma model. Academic Radiology, 4(2), 138-149.

11.

Hanley

J.A.

(1988). The robustness of the “binormal” assumptions used in fitting ROC curves. Medical decision Making, 8(3), 197-203.

12.

Hanley

J.A.

(1996). The use of the ‘binormal’ model for parametric ROC analysis of quantitative diagnostic tests. Statistics in medicine, 15(14), 1575-1585.

13.

Hogg

R.V.

, & Klugman

S.A.

(1983). On the estimation of long tailed skewed distributions with actuarial applications. Journal of Econometrics, 23(1), 91-102.

14.

Hoyer

, & Kuss

(2020). Meta analysis of full ROC curves with flexible parametric distributions of diagnostic test values. Res Syn Meth, 11(2), 301-313.

15.

Kester

A.D.

, & Buntinx

(2000). Meta-analysis of ROC curves. Medical Decision Making, 20(4), 430-439.

16.

Kochański

(2022). Which curve fits best: Fitting ROC curve models to empirical credit-scoring data. Risks, 10(10), 184.

17.

Liu

. (2012). Classification accuracy and cut point selection. Statistics in Medicine, 31(23), 2676-2686.

18.

Lipovetsky

, & Conklin

M.W.

(2020). Bayesian sensitivity-specificity and ROC analysis for finding key drivers. Journal of Modern Applied Statistical Methods, 19(1), eP3023.

19.

Ltifi

Benmohamed

Kolski

, & Ben Ayed

(2020). Adapted visual analytics process for intelligent decision-making: application in a medical context. International Journal of Information Technology and Decision Making, 19(1), 241-282.

20.

Metz

C.E.

(1989). Some practical issues of experimental design and data analysis in radiological ROC studies. Investigative Radiology, 24(3), 234-245.

21.

Obuchowski

N.A.

, & Bullen

J.A.

(2018). Receiver operating characteristic (ROC) curves: review of methods with applications in diagnostic medicine. Physics in Medicine and Biol, 63(7), 07TR01.

22.

Provost

F.J.

, & Fawcett

(2001). Robust classification for imprecise environments. Machine Learning, 42(3), 203-231.

23.

Pundir

, & Amala

(2014). Parametric receiver operating characteristic modeling for continuous data: A glance. Model Assisted Statistics and Applications, 9(2), 121-135.

24.

Rohatgi

V.K.

, & Saleh

A.K.

(2000). An Introduction to Probability and Statistics. 2 Edition. Wiley.

25.

Rota

, & Antolini

(2014). The optimal cut-point for Gaussian and Gamma distributed biomarkers. Computational Statistics and Data Analysis, 69, 1-14.

26.

Schisterman

E.F.

Perkins

N.J.

Liu

, & Bondell

(2005). Optimal cut-point, and its corresponding Youden Index to discriminate individuals using pooled blood samples. Epidemiology, 16(1), 73-81.

27.

Simon

M.K.

, & Alouini

M.S.

(2001). Digital communication over fading channels. New York: Wiley.

28.

Siva

Vishnu Vardhan

, & Chesneau

(2023). Estimating the AUC of mixture MROC curve in the presence of measurement errors. Model Assisted Statistics and Applications, 18(3), 237-244.

29.

Swets

J.A.

(1988). Measuring the accuracy of diagnostic systems. Science, 240(4857), 1285-1293.

30.

Tilbury

J.B.

Van Eetvelt

W.J.

Garibaldi

J.M.

Curnsw

J.S.

, & Ifeachor

E.C.

(2000). Receiver operating characteristic analysis for intelligent medical systems-a new approach for finding confidence intervals. IEEE Transactions on Biomedical Engineering, 47(7), 952-963.

31.

Vickers

A.J.

(2008). Decision analysis for the evaluation of diagnostic tests, prediction models, and molecular markers. The American Statistician, 62(4), 314-320.

32.

Walsh

S.J.

(1999). Goodness-of-fit issues in ROC curve estimation. Medical decision making, 19(2), 193-201.

33.

Watson

G.S.

(1958). On chi-square goodness-of-fit tests for continuous distributions. J of the Royal Statistical Society. Series B (Methodological), 20(1), 44-72.

34.

Zweig

M.H.

, & Campbell

(1993). Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clinical Chemistry, 39(4), 561-577.

Parametric modeling of receiver operating characteristics curves

Abstract

Keywords

1. Introduction

2. Data sets

Table 1 Target absent (top 70) and target present cohorts (bottom 60). Three data sets

Table 2 Chi square test results for the three data sets

References

Table 1
Target absent (top 70) and target present cohorts (bottom 60). Three data sets

Table 2
Chi square test results for the three data sets