Comparing Statistical Methods for Quantifying Drug Sensitivity Based on In Vitro Dose

Abstract

In vitro chemosensitivity assays are invaluable for assessing chemotherapeutic agents' effects on cancer cells. Yet the dose–response curves generated by those assays, usually approximated by four-parameter logistic (4PL) models, are oftentimes difficult to interpret, with no clear indication of which metric should be used to compare them. Here, five commonly used metrics, absolute and relative half-maximal inhibitory concentration (IC₅₀), area under the dose–response curve (AUC) based on trapezoidal rule and a parametric approach, and the effect at the maximal concentrations (E_max ), were compared in both simulations and real-life scenarios to evaluate their use with 4PL curves. Despite the fact that IC₅₀ is the most widely used metric to analyze dose–response curves, this study demonstrated that it was not the most reliable of the metrics tested. Fitted AUC showed the best overall performance in both the simulation and real-life scenarios; trapezoidal AUC showed similar performance to fitted AUC in most cases.

Introduction

When assessing the sensitivity of in vitro cultures to chemotherapeutic compounds, it is important to compare dose–response curves to evaluate assay performance within and between assays, to weigh experimental curves against reference curves, and to rank order experimental compounds. Yet accurate assessment of these curves is oftentimes hampered by variables such as compound concentration, antibody dilution, and the detection method used. Data generated by everyday assays are usually constrained by technology and resources, are frequently difficult to interpret, and are not as perfect a fit of the model as one would like. A few metrics are commonly used to rank and classify dose–response curves, but it is not always apparent which metric will provide the most accurate information, especially when the data are far from ideal. This study evaluated these commonly used metrics in a variety of computer simulations and real-life examples. The goal was to investigate the statistical properties of each metric and thus identify the metrics that are the most often reliable when used in typical datasets. A more accurate evaluation of dose–response curves can lead to a better understanding of experimental drugs' potency and efficacy, which in turn can lead to more precise appraisals of anticancer agents in vitro.

The focus of this research was the quantification of dose–response curves assumed to follow a four-parameter logistic (4PL) model, which was defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} y = \beta_2 + \frac { \beta_1 - \beta_2 } { 1 + ( x / \beta_3 ) ^{ \beta_4 } } \end{align*} \end{document}

where y is the response, x is the drug concentration, β ₁ is the upper limit (top), β ₂ is the lower limit (bottom), β ₃ is the half-way response (IC₅₀) between β ₁ and β ₂, and β ₄ is the slope. The 4PL model, initially associated with ligand binding assays, was chosen for this study, because it is one of the most commonly used models for investigating nonlinear dose–response relationships in pharmacology laboratories today.^1

–4 Once the model is used to fit dose–response data, the estimated response at maximal dose, the estimated response at minimal dose, the curve slope, and the dose at 50% maximal response (IC₅₀ or EC₅₀) are all reported; from this information, responses at various concentrations can be calculated using the fitted model.⁵ Although the 4PL model has its limitations (e.g., it is a symmetrical function and some dose–response data are not symmetrical), the 4PL model has the advantages that it is flexible, it can be used to fit data over a large range of distribution forms, and it is widely used and accepted in the pharmacology community.^4,5

In this study, five metrics were considered: relative and absolute IC₅₀, fitted and trapezoidal area under the dose-response curve (AUC), and effect at the maximal concentrations (E _max) (Fig. 1). Specifically, relative IC₅₀ was the concentration that corresponds to the inflection point of the dose–response curve (halfway between the top and the bottom of the fitted 4PL curve); absolute IC₅₀ was the concentration that causes 50% of maximal inhibition effect (halfway between top of the 4PL and zero); trapezoidal AUC was the area under the dose–response curve generated by piece-wise linear connection of the observed data points using the trapezoidal rule; fitted AUC was the area under the dose–response curve generated by the fitted 4PL model; E _max was the summation effect at the highest three concentrations. (Note that this E _max definition is slightly different than the usual definition. Here, it does not require that the plateau of the curve be reached at high concentrations.)

Fig. 1.

Five metrics illustrated on a sample four-parameter logistic (4PL) dose–response curve. This curve points out the five metrics evaluated in this study: relative IC₅₀ (the concentration that corresponds to the inflection point of the dose–response curve, halfway between the top and the bottom of the fitted 4PL curve); absolute IC₅₀ (the concentration that causes 50% of maximal inhibition effect, halfway between the top of the 4PL and zero); fitted area under the curve (AUC) (the model-based derivation of area under the curve); trapezoidal AUC (the area under the dose–response curve generated by piece-wise linear connection of the observed data points using the trapezoidal rule); and E _max (the average effect at the highest three concentrations—a slightly different definition than the traditional definition).

These particular five metrics were chosen based on the authors' belief that these metrics are the most frequently used metrics for analyzing dose–response data. Relative and absolute IC₅₀ are outputs of the 4PL model and are therefore the most often used to analyze dose–response curves.⁵ They were considered here as the primary competitor metrics against which alternative metrics were to be assessed. Trapezoidal AUC was considered, because it is commonly used to quantify dose–response curves.^6

–10 Fitted AUC was included as a new method in this study. The primary advantage of fitted AUC over trapezoidal AUC is that the fitted curve is guaranteed to be monotonic, so the metric is less affected by measurement errors. E _max was included, because it was used to evaluate dose–response data by groups that performed a similar in vitro assay to the one described here.^11,12 The metrics investigated in this study characterize different aspects of a dose–response curve; IC₅₀ and E _max are point metrics (IC₅₀ reports a specific concentration at half-maximal response and E _max reports a specific level of response at the maximum concentrations tested), whereas AUC is more of an overview or summation of a drug's effect across the entire range of doses tested.^13,14

Materials and Methods

The five metrics' performances were initially judged on their abilities to classify and rank curves in computer simulations, which created typical 4PL dose–response curves normally generated by in vitro sensitivity assays. Next, the metrics were assessed by “real-life” scenarios. These real-life scenarios evaluated the metrics' capacities to differentiate estrogen receptor (ER)-positive and -negative cell lines and the metrics' capacities to correlate with publically recognized drug sensitivities of cancer cell lines. The workflow is illustrated in Table 1.

Table 1.

Workflow for Materials and Methods

	Simulations		Real-life scenarios
Data	Classification scheme (8 scenarios)	Ranking scheme (8 scenarios)	Dataset 1	Dataset 2
Evaluation criteria	Average classification error rate	Mean correlation with true ranking of the curves	Association with ER status	Association with ER status; correlation with published IC₅₀ values

ER, estrogen receptor.

Assuming that there are 11 drug doses used in the assay, noted as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document}$$x_i , i = 0 , 1 , \ldots , 10$$ \end{document} , x ₀ corresponds to dose=0, and x ₁₀ corresponds to dose=10, the highest dose. Relative IC₅₀ and absolute IC₅₀ were based on fitted curves made by the 4PL model of curve fitting.^1,15 Relative IC₅₀ was defined here as the compound dose that equals a 50% response between the maximum response of the curve (the top) and the minimum response of the curve (the bottom).^16,17 In the case of relative IC₅₀, the top and bottom may or may not have been 100% and 0%, respectively. Absolute IC₅₀ was defined here as the compound dose that results in 50% response compared with the control (usually equal to 100%), and absolute IC₅₀ was independent of the dynamic range of the assay.^16,17 Fitted AUC was the model-based derivation of the area under a dose–response curve (i.e., a parametric 4PL model is fitted to the data, and then the area under this parametric curve is calculated), that is, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} \rm Fitted \ { \rm AUC } = \int \limits_0^ { 10 } \left( \beta_2 + \frac { \beta_1 - \beta_2 } { 1 + ( { \it x } / \beta_3 ) ^ { \beta_4 } } \right) \it dx. \end{align*} \end{document}

Trapezoidal AUC was the area under the curve based on the trapezoidal rule, which is calculated as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} \hbox { Trapezoidal AUC } = \mathop\sum_ { { \rm i } = 0 } ^ { 10 } y_i - \frac { y_0 + y_ { 10 } } { 2 } . \end{align*} \end{document}

Here, it is assumed that the log-doses are equally spaced.

E _max usually refers to the drug effect at the highest concentration. For this study, E _max was defined a bit differently. To reduce variability, E _max equaled the summation of the last three responses on the dose–response curve. For the purpose of this study, E _max can be thought of as a truncated version of trapezoidal AUC, eliminating the requirement for the responses at the highest three concentrations to fall in an asymptote. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} E_{\max} =\mathop \sum_{i = 8}^{10} y_i \end{align*} \end{document}

In the real-life evaluation of dose–response curves, quality control processes typically exclude outliers in datasets from downstream analysis; therefore, outliers were not included in the simulations described here. The focus of this research was to evaluate the five metrics' performance when the data fit the 4PL model and no outliers occurred. The simulations purposefully challenged the 4PL model with attributes, such as a truncated right end, to mimic what is seen in reality.

Computer Simulation Scenarios

The simulations performed here were based on the 4PL model, using the statistical software package R version 2.10.1,^1,18 which was also used for all the other statistical analyses in this study. In particular, the estimates of relative IC₅₀, absolute IC₅₀, and fitted AUC were based on nonlinear ordinary least square curve fitting of the 4PL model using R package DRC. In all of the simulations, normal random errors, ɛ∼N(0,σ²), were added to the true 4PL models. Different variation levels, σ, were investigated between 1% and 15%, but the relative performance of the metrics was found to be independent of the variation level (data not shown). Therefore, all of the simulations were based on σ=5%. Different in IC₅₀s and bottoms were also investigated over a range of values. All the simulations assumed that the response range was between 0% and 100%, and the doses ranged between 0 (no treatment) and 10 (maximum concentration).

Two schemes were set up to assess the metrics: the classification scheme and the ranking scheme. The classification scheme contained scenarios that evaluated the metrics' abilities to correctly categorize samples into corresponding groups (see Supplementary Fig. S1 and Supplementary Table S1 for visualization of the classification scenarios; Supplementary Data are available online at www.liebertonline.com/adt). The ranking scheme contained scenarios that evaluated the metrics' abilities to correctly rank order curves generated from 4PL models with known parameter values (see Supplementary Fig. S2 and Supplementary Table S2 for visualization of the ranking scenarios).

For the classification scheme, the two groups were assumed to represent two 4PL “families”: resistant to compound treatment and sensitive to compound treatment. There were eight classification scenarios considered within this classification scheme. For each scenario, the curve families had three parameters in common and differed in one (IC₅₀ or bottom) (Supplementary Table S1). One hundred curves for each of the resistant and sensitive families were generated per simulation run. The median of the 200 values of each metric (a convenient number chosen to simplify programming) was used as the cutoff to determine resistant or sensitive in each scenario, and the error rate was then calculated for each metric in each scenario. For instance, if in one simulation run the top 100 estimated absolute IC₅₀ values (classified as the “upper” group) include five curves that actually belong to the “lower” group, then the classification error rate for absolute IC₅₀ in this simulation run is 5%. This process was then independently repeated 100 times, and the average classification error rate (ACER) and the standard deviation (SD) of classification error rates over the 100 simulation runs were calculated to appraise the performance of each metric. In this simulation scheme, the true membership of each curve was known by simulation design, so the parameter values (including the four parameters in 4PL and the noise level) were intentionally chosen so that each method had the chance to misclassify curves. This was done, because when two curve groups are very different, all of the methods can do a perfect job of classification. The objective was to compare the misclassification error rates by the five methods. To do a fair comparison, the eight scenarios were designed so that no one method was particularly favored. For instance, scenario 1 featured two parallel curves, which may have favored IC₅₀ and AUC over E _max; scenario 2 was a right-truncation of scenario 1, so E _max was expected to improve; scenario 3 was comprised of two nonparallel curves, which may have favored AUC and E _max over IC₅₀.

For the ranking scheme, there were also eight scenarios considered. In each scenario, the curves had three parameters in common and differed in one (IC₅₀ or bottom—the variable parameter); the variation range of the variable parameter is given in brackets (Supplementary Table S2). For each variable parameter in each scenario, 200 equidistant intermediate values were considered to span that variation range. In each simulation run, 200 curves were generated based on the 200 sets of parameter. The correlation between the true ranks of the variable parameter and the ranks of the simulation results (based on 4PL model fit) was calculated for each metric. The best-performing metric was expected to have a higher correlation with the true ranks of the curves. This simulation was repeated 100 times. The mean and the SD of the 100 correlation values by the metric were used to evaluate the metric's performance.

The intent of the simulation designs was to mimic different situations in reality (e.g., partial coverage of the entire assay dynamic range, nonparallel curves). For example, in the classification scheme, setting the true IC₅₀ values around 5 (assuming the higher and lower plateaus were achieved at concentrations 0 and 10, respectively) meant that the entire curve could be observed, whereas setting the true IC₅₀s around 8 meant that the assay dose design was not adequate to see the bottom part of the 4PL curve. Similarly, setting the slope parameter to 0.5 flattened the true 4PL curves so that the dose range only covered the linear portion of the 4PL curve, as in scenarios 5, 6, 7, and 8 of both classification and ranking simulation schemes. In fact, some of the curves appeared to be more linear than 4PL. The bottom was not set to zero in either simulation scheme, because (1) the bottom is rarely zero in real-life assays and (2) it allowed for the differentiation between absolute IC₅₀ and relative IC₅₀. However, the comparisons of the metrics were expected to be the same regardless of whether the bottom was set to zero or a different value such as 0.25.

To more fully explain and confirm the metrics' performances in the simulations under the classification scheme, the effect size and the coefficient of variation (CV) for each metric were calculated based on theoretical formulas and/or by simulations. Effect size was defined here as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} \rm Effect \ size = \frac { Mean_1 - Mean_2 } { SD } \end{align*} \end{document}

where, for a given metric, Mean₁ was the mean value of one group and Mean₂ was the mean of another group; SD was the standard deviation of the mean based on the variation of the two groups. Theoretical formulas could then be derived for the means and variances of relative IC₅₀, trapezoidal AUC, and E _max. Using those derived formulas, the effect size and CV of each metric could then be calculated when the parameter values of the 4PL were known. There are no closed-form formulas for the mean and variance of fitted AUC and absolute IC₅₀, so they were estimated based on simulations.

Real-Life Scenarios

The first real-life scenario involved 27 breast cancer cell lines obtained from ATCC (Manassas, VA), which were maintained in RPMI media (Mediatech, Herndon, VA) containing 10% FBS (HyClone, Logan, UT) at 37°C in 5% CO₂. These cell lines included AU565, BT20, BT474, BT483, BT549, CAMA1, HCC1143, HCC1187, HCC1428, HCC1500, HC1569, HCC1937, HCC1954, HCC202, HCC38, MCF10A, MCF7, MDAMB157, MDAMB175VII, MDAMB231, MDAMB361, MDAMB436, MDAMB453, SKBR3, T47D, UACC812, and ZR751. Of these 27 cell lines, 10 were estrogen receptor positive (ER+) and 17 were negative (ER−).¹⁹ Cells from each cell line were seeded at 320 cells per well in 384-well plates and were allowed to adhere to the plate for 24 h. A four-drug mixture of paclitaxel (T), 5-fluorouracil (F), doxorubicin (A), (McKesson Specialty Care Solutions, La Vergne, TN), and cyclophosphamide (C) (Niomech, Bielefeld, Germany), also known as T/FAC, was created and 10 serial dilutions of the mixture were made.²⁰ The wells were treated with each T/FAC dose in triplicate, one cell line per plate, with three control wells of media alone per cell line. The plates were incubated for 72 h at 37°C.

The second and third real-life scenarios used groups of 30 and 21 breast cancer cell lines, respectively. The group of 30 cell lines included AU565, BT20, BT549, HC1143, HCC1569, HCC1937, HCC1954, HCC38, MDAB157, MDAMB231, MDAMB453, MDAMB468, BT474, CAMA1, MCF7, MDAMB175VII, MDAMB361, MDAMB415, T47D, UACC812, ZR7530, CAL120, CAL51, CAL851, EFM19, EVAST, HCC1395, HCC1419, MFM223, and UACC893 (ATCC). The first 21 cell lines in the list were used in the third real-life scenario as well. All cell lines were maintained as described earlier.

All cell lines in both real-life scenarios were treated with 10 serial dilutions of paclitaxel in triplicate, one cell line per 384-well plate. Each plate contained three control wells of media alone. The plates were incubated overnight, treated with the serial dilutions of paclitaxel, and then incubated for 72 h at 37°C.

All plates in the real-life scenarios were assayed by the ChemoFx^® drug response marker, an in vitro chemosensitivity assay (Precision Therapeutics, Pittsburgh, PA), to create dose–response curves for each cell line's response to treatment.^21,22 Briefly, ChemoFx began with the removal of media and nonadherent cells after the incubation period. The remaining cells were fixed in 95% ethanol and then stained with DAPI (Molecular Probes, Eugene, OR). A proprietary automated microscope (Precision Therapeutics) was used to capture and count UV images of the stained cells in each well. A survival fraction (SF) at dose i ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document}$$i = 0 , 1 , 2 , \ldots , 10$$ \end{document} ) was calculated as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document}$$SF_i = \frac { mean_ { drug } ^ { ( i ) } } { mean_ { control }} $$ \end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document}$$mean_{drug}^{ ( i ) }$$ \end{document} was the average of the number of surviving cells in the drug-treated wells at dose i, and mean_control was the average number of living cells in the control wells. The SF data were then used to generate dose–response curves for each cell line, which were assessed by the metrics of interest: relative IC₅₀, absolute IC₅₀, fitted AUC, trapezoidal AUC, and E _max.

Each metric's average values for ER+ and ER− cell lines were calculated from the ChemoFx-generated T/FAC dose–response curves. The Student's t-test was performed to compare the separation of average values for each metric, the Wilcoxon rank test was run to test the ranks of the metrics, and the chi-square test was executed to evaluate how many times a metric correctly categorized a cell line based on ER status. Collectively, these tests were completed to determine which metric demonstrated the greatest separation between the two populations of cell lines.

In the second real-life scenario, the paclitaxel dose–response curves for each of the 30 cell lines were analyzed by each of the five metrics. Each metric's performance for each cell line was directly correlated to paclitaxel IC₅₀ values for that cell line published in a publically available online database.^23,24 The database, created by the Wellcome Trust Sanger Institute, consists of dose–response data for cancer cell lines treated with various chemotherapeutic agents and analyzed using an ATP-based cell viability assay.²⁵ The Pearson's correlation coefficients associated the 30 values for each metric with the 30 Sanger IC₅₀ values for the corresponding cell lines. The coefficients served to demonstrate how well each metric correlated with the paclitaxel IC₅₀ values for each cell line listed in the Sanger database.

The third real-life scenario used a subset of 21 breast cancer cell lines with known ER status from the 30 lines used in the second scenario.¹⁹ Nine cell lines were ER+ and 12 lines were ER−. Using the paclitaxel dose–response curves generated in the previous scenario, each metric was assessed by t-test to establish how well it separated the cell lines into ER+ and ER− groups.

Results

Classification Scheme

The classification scheme contained eight scenarios to test the five metrics, and the ACERs and SDs were calculated for each (Table 2). Overall, fitted AUC was the most reliable, with the lowest average error rate in six out of eight scenarios. Trapezoidal AUC behaved similarly to fitted AUC in all scenarios. Absolute IC₅₀ and E _max each had the lowest mean error rate in one out of eight scenarios, whereas relative IC₅₀ never had the lowest mean error rates in any of the scenarios. As a whole, the differences in mean error rates for fitted AUC, trapezoidal AUC, and absolute IC₅₀ were quite small with only relative IC₅₀ and E _max showing comparatively larger mean error rates in certain scenarios. Absolute IC₅₀ always performed better than relative IC₅₀, and E _max had the lowest average error rate only when the curves were not parallel and the bottoms were truncated.

Table 2.

Average Classification Error Rates for Classification Scheme Scenarios

Scenario number	Relative IC₅₀	Absolute IC₅₀	Fitted AUC	Trapezoidal AUC	E _max
1	0.120 (0.02)	0.112 (0.02)	0.099 (0.02)	0.117 (0.03)	0.450 (0.04)
2	0.274 (0.03)	0.139 (0.03)	0.090 (0.02)	0.140 (0.03)	0.138 (0.02)
3	0.503 (0.03)	0.169 (0.03)	0.033 (0.01)	0.043 (0.01)	0.046 (0.01)
4	0.504 (0.03)	0.185 (0.03)	0.175 (0.03)	0.226 (0.03)	0.097 (0.02)
5	0.334 (0.03)	0.228 (0.03)	0.151 (0.03)	0.154 (0.03)	0.376 (0.03)
6	0.438 (0.04)	0.243 (0.03)	0.178 (0.03)	0.190 (0.03)	0.237 (0.03)
7	0.504 (0.04)	0.109 (0.02)	0.041 (0.01)	0.043 (0.01)	0.063 (0.02)
8	0.510 (0.04)	0.126 (0.02)	0.179 (0.03)	0.187 (0.03)	0.132 (0.02)

Average classification error rate and standard deviation (in parentheses) per metric for the classification scenarios, which tested the metrics' capacities to correctly classify the curves into sensitive and resistant groups.

AUC, area under the dose-response curve.

Ranking Scheme

The ranking scheme was comprised of eight scenarios to test the five metrics, and the mean correlation (MC) and SD were calculated for each metric in each scenario (Table 3). Again, fitted AUC performed best overall with the highest MC to true rank in six out of eight scenarios; trapezoidal AUC performed similarly in most cases. E _max had the highest MC to true ranks in two out of eight scenarios, again when the curves were not parallel and the bottoms were truncated. Relative IC₅₀ and absolute IC₅₀ did not have the highest MC to true ranks in any of the scenarios. Once again, the differences between the MC to true ranks between absolute IC₅₀, fitted AUC, and trapezoidal AUC were minimal with only relative IC₅₀ and E _max showing comparatively lower MC to true ranks in certain scenarios. As before, absolute IC₅₀ was always preferable to relative IC₅₀. In this study, classification and ranking scenarios showed similar trends in performance by the metrics, which serves to confirm the findings outlined here.

Table 3.

Mean Correlation for Ranking Scheme Scenarios

Scenario number	Relative IC₅₀	Absolute IC₅₀	Fitted AUC	Trapezoidal AUC	E _max
1	0.903 (0.01)	0.909 (0.01)	0.918 (0.01)	0.907 (0.01)	0.201 (0.07)
2	0.720 (0.04)	0.889 (0.01)	0.924 (0.01)	0.890 (0.01)	0.886 (0.01)
3	0.001 (0.06)	0.865 (0.02)	0.956 (0.01)	0.950 (0.01)	0.949 (0.01)
4	0.006 (0.07)	0.843 (0.02)	0.857 (0.02)	0.805 (0.02)	0.919 (0.01)
5	0.606 (0.04)	0.807 (0.02)	0.881 (0.01)	0.878 (0.01)	0.481 (0.05)
6	0.286 (0.08)	0.764 (0.03)	0.857 (0.02)	0.847 (0.02)	0.786 (0.02)
7	0.001 (0.07)	0.913 (0.01)	0.950 (0.01)	0.949 (0.01)	0.940 (0.01)
8	0.025 (0.08)	0.879 (0.01)	0.858 (0.02)	0.850 (0.02)	0.896 (0.01)

Mean correlation and standard deviation (in parentheses) per metric for the ranking scenarios, which tested the metrics' abilities to order the curves based on true parameter values.

CV and Effect Size

Assuming the true 4PL parameters are (β ₁, β ₂, β ₃, β ₄=(1, 0.25, 5, 1), and the error variance σ²=0.01 (the conclusion is independent of the actual values used), the CV calculations for trapezoidal AUC, relative IC₅₀, and E _max were made based on the derived formulas; the CVs for fitted AUC and absolute IC₅₀ were calculated from simulations (Table 4A). The differences between the CV values in the theoretical and simulation instances were very small, showing good agreement between the two approaches. The ranking of the CV values was fitted AUC<trapezoidal AUC<relative IC₅₀<absolute IC₅₀<E _max, which provides an indication of the relative stability of the five metrics. In terms of the effect size (the larger the value the greater the differentiating power of the metric for the classification scheme), fitted AUC performed the best, and trapezoidal AUC behaved similarly in all of the scenarios (Table 4B). Relative IC₅₀ had the smallest effect size for most of the cases; absolute IC₅₀ and E _max were in between. These results confirm the simulation results in Table 1.

Table 4A.

The Theoretical and Simulation Coefficient of Variation for the Classification Scenarios for Each Metric

CV	Relative IC₅₀	Absolute IC₅₀	Fitted AUC	Trapezoidal AUC	E _max
Theoretical	0.096	NA	NA	0.053	0.215
Simulation	0.105	0.123	0.054	0.056	0.222

CV, coefficient of variation; NA, theoretical formula not available.

Table 4B.

Effect Size for the Classification Scenarios for Each Metric

Scenario	Relative IC₅₀	Absolute IC₅₀	Fitted AUC	Trapezoidal AUC	E _max
1	2.07	2.44	2.59	2.40	0.59
2	1.19	2.22	2.66	2.18	2.75
3	0.00	1.92	3.67	3.40	3.29
4	0.00	1.78	1.86	1.53	1.92
5	0.77	1.54	2.12	2.06	0.90
6	0.29	0.51	1.89	1.78	1.55
7	0.00	2.45	3.44	3.38	2.87
8	0.00	0.87	1.90	1.80	1.84

These data serve to explain the metrics' performances in the simulations. Bigger effect size is expected to result in lower classification error rate.

Real-Life Scenario 1: T/FAC-Treated Differentiation Based on ER Status

Based on P values by t-test, Wilcoxon rank test, and chi-square test, absolute IC₅₀ had the smallest P values and therefore differentiated best on ER status, followed by E _max and then relative IC₅₀ and fitted AUC (Table 5). Note that the dose–response curves may not have covered the whole span of the sigmoidal profile, so this may have resulted in a favoring of particular metrics over others (Fig. 2). Although absolute IC₅₀ had the smallest P values in these cases, the differences in P values across all metrics and tests completed were very small, and therefore, all metrics could be regarded as differentiating based on ER status equally well.

Fig. 2.

ChemoFx dose–response curves of the 27 breast cancer cell lines treated with T/FAC (paclitaxel [T], 5-fluorouracil [F], doxorubicin [A], and cyclophosphamide [C]). Each data point represents the average of triplicate measurements at a given dose. There were 17 estrogen receptor-negative (ER−) cell lines (black) and 10 estrogen receptor-positive (ER+) cell lines (gray).

Table 5.

P values to Assess the Metrics' Abilities to Differentiate 27 T/FAC-Treated Cell Lines Based on Estrogen Receptor Status

	Relative IC₅₀	Absolute IC₅₀	Fitted AUC	Trapezoidal AUC	E _max
t-test	0.0105	0.0009	0.0072	0.0073	0.0021
Wilcoxon rank test	0.0058	0.0016	0.0050	0.0058	0.0019
Chi-square test	0.0006	0.0006	0.0010	0.0010	0.0014

t-Test, Wilcoxon rank test, and chi-square test were run to evaluate each metric's ability to classify cell lines by ER status after generation of T/FAC (paclitaxel [T], 5-fluorouracil [F], doxorubicin [A], and cyclophosphamide [C]) dose–response curves by ChemoFx.

Real-Life Scenario 2: Paclitaxel-Treated Comparison with Sanger Database

The correlation coefficients (Table 6) associated each metric's values for each of the 30 cell lines to the corresponding paclitaxel IC₅₀ values contained in the Sanger database. The highest correlation coefficients were generated by fitted and trapezoidal AUC, followed by absolute IC₅₀, then E _max, and finally relative IC₅₀.

Table 6.

Correlation Coefficients to Demonstrate the Degree of Correlation Between the Five Metrics and the Sanger Database of IC₅₀ Values in Paclitaxel-Treated Cells

	Relative IC₅₀	Absolute IC₅₀	Fitted AUC	Trapezoidal AUC	E _max
Pearson's correlation coefficient	0.4028	0.5187	0.5641	0.5735	0.4771

Each of the five metrics analyzed palictaxel dose–response curves produced by ChemoFx for 30 breast cancer cell lines. Pearson's correlation coefficients were produced, which associated each metric's values for the 30 cell lines to the corresponding paclitaxel IC₅₀ values for those same cell lines within the publically available Sanger database.

Real-Life Scenario 3: Paclitaxel-Treated Differentiation Based on ER Status

Table 7 reports the t-test P values for each metric that represent the difference in each metric's values between the ER+ and ER− groups. Again, fitted and trapezoidal AUC showed the smallest P values, followed by E _max, relative IC₅₀, and absolute IC₅₀. The dose–response curves for the 21 paclitaxel-treated breast cancer cell lines are shown in Figure 3.

Fig. 3.

ChemoFx dose–response curves of the 21 breast cancer cell lines treated with paclitaxel. The paclitaxel dose–response curves are plotted for the 21 breast cancer cell lines of known ER status. Nine cell lines were ER+(in gray) and 12 cell lines were ER−(in black).

Table 7.

P Values to Determine the Metrics' Abilities to Differentiate Paclitaxel-Treated Cell Lines Based on Estrogen Receptor Status

	Relative IC₅₀	Absolute IC₅₀	Fitted AUC	Trapezoidal AUC	E _max
t-Test P value	0.0662	0.2695	0.0236	0.0269	0.0565

ChemoFx generated dose–response curves for paclitaxel-treated breast cancer cell lines were analyzed by the five metrics. t-Tests were run to determine how well each metric separated the cell lines into ER+ and ER− groups.

Discussion

The purpose of this study was to investigate five metrics that are commonly used in 4PL dose–response curves by assessing those metrics' classification and ranking accuracy across computer simulations and real-life scenarios. The goal was to identify the metric that has the best performance in classifying and ranking dose–response curves. Even though sometimes other models are used in practice (e.g., polynomial models, smoothing spline models, 5-parameter logistic models), 4PL is the most popular. Therefore, this research focused on comparing the metrics when the underlying models were assumed to be 4PL. When the curve cannot be approximated by 4PL, for example, bell-shaped, then methods such as IC₅₀ and fitted AUC cannot be directly derived without modifications on the model; in certain cases the point estimate such as IC₅₀ may not be a suitable metric anymore. For assays with fixed concentrations, trapezoidal AUC and E _max can be similarly calculated, because they are model-free.

Based on this research, we found that although relative IC₅₀ is probably the most widely used metric for analyzing dose–response curves, data generated here suggest that relative IC₅₀ oftentimes demonstrates poorer performance than other options in the majority of generated datasets. Ultimately, fitted AUC was the metric that showed the best overall performance in both classification and ranking simulation scenarios, executed well in differentiating based on ER status, and demonstrated high correlation with publically recognized drug sensitivities. In reality, performing a good fit of a 4PL model may pose a challenge to those who do not use sophisticated statistical packages; if that is the case, trapezoidal AUC, as opposed to fitted AUC, can be used, because trapezoidal AUC showed comparable performance to fitted AUC in most situations. In certain data situations, E _max was slightly better than AUC calculations and was usually comparable to absolute IC₅₀ values; however, the difference between E _max and AUC calculations in those cases was insignificant. Finally, AUC and absolute IC₅₀ always performed better than relative IC₅₀.

This research considered IC₅₀ to be a point-sensitive metric; the conclusion should be easily extended to IC_#, where # is a number between 0 and 100. This investigation focused on the situations in which the curves did not cross, even though they might have been nonparallel, that is, all of the metrics were positively correlated. Therefore, it is intuitive that point-sensitive metrics such as IC₅₀ and E _max are not as reliable as summary-sensitive metrics such as AUC, because AUC accumulates information across the entire dose range. When curves do cross (Supplementary Fig. S3), it is much harder to rank curves. For instance, IC₂₀ and IC₈₀ could give opposite rankings for two curves. In those cases, the comparison and choice of the best metric would have to incorporate other information such as clinical knowledge.

One caveat for using AUC and E _max is that their values depend on the range of doses examined and the sensitivity of the assay technology used.^26
–28 To appropriately compare two dose–response curves using either AUC or E _max, the curves need to be measured across the same range of doses using the same assay detection technology. Alternatively, to calculate IC₅₀ requires only enough data to fit a 4PL model. If the data can fit a 4PL model, the resulting IC₅₀ remains relatively the same and is not impacted by the dose range tested or assay technology utilized. Therefore, although AUC was the metric that demonstrated the best overall performance in this study, AUC and E _max should be reserved for comparing only the curves that cover the same range of doses using the same assay technology. On the other hand, IC₅₀ requires that the data be approximated by a 4PL model. If the data do not produce a good fit of the 4PL model, then the IC₅₀ estimate is not reliable. Alternatively, trapezoidal AUC and E _max are nonparametric methods, and therefore, no model assumptions are required.

In this investigation of real-life scenarios, it was necessary to make some assumptions to proceed. The first assumption was based on the consistent in-house observation that ChemoFx outcomes correlate to the ER status of the cell line tested. Specifically, ER+ cell lines are more often labeled “resistant” by the assay, whereas ER– cell lines are more often labeled as “sensitive.” Reports in the literature confirm that ER status can be correlated with the outcomes of in vitro ATP assays,¹¹ yet no publications reflect the same correlation between ER status and ChemoFx outcome. In this investigation, this observation was assumed to reflect the truth.

It might seem surprising that the IC₅₀ values from the ChemoFx assays did not show the strongest correlation with the IC₅₀ values from the Sanger data. Possible explanations are that (1) Sanger data were based on ATP assays, ChemoFx used direct counting of the surviving cells; (2) the doses are different between the two assays, even though in theory this may not impact the estimate of the IC₅₀ values, but in reality this could contribute to the differences; (3) in terms of absolute IC₅₀, the correlation of 0.52 is comparable to the best of 0.56 from AUC methods, relative IC₅₀ has the worst correlation of 0.41; (4) in scenario 1 of the simulations (both in the Classification scheme and the Ranking scheme), the truth is that IC₅₀ value is the only differentiating factor between the curves, whereas the other three parameters are identical. Thus, it is expected that IC₅₀ methods should be the best performers in this scenario, but still AUC methods showed superior performance.

In conclusion, this study demonstrated that, although widely used, relative IC₅₀ is not as accurate as AUC or absolute IC₅₀ in most situations, when ranking and classification of 4PL dose–response curves are necessary. AUC is generally the most accurate and best to use in the situations examined here. Fitted AUC should be utilized whenever sophisticated statistical software is available to accurately fit dose–response curves, but trapezoidal AUC is recommended if basic, more limited, software packages are used. The findings reported here can be extremely beneficial to researchers investigating candidate compounds in drug–response assays and should be considered to more precisely evaluate dose–response curves in the laboratory.

Footnotes

Acknowledgments

The authors thank Rebecca J. Palmer, PhD, for her assistance in the preparation of this manuscript and also thank the Informatics Team members at Precision for valuable inputs to this research as well as the development of the manuscript.

Disclosure Statement

No competing financial interests exist.

Abbreviations

References

Findlay

, Dillard

. Appropriate calibration curve fitting in ligand binding assays. AAPS J, 2007; 9:E260–E267.

DeLean

, Munson

, Rodbard

. Simultaneous analysis of families of sigmoidal curves: application to bioassay, radioligand assay, and physiological dose-response curves. Am J Physiol, 1978; 235:E97–E102.

Volund

. Application of the four-parameter logistic model to bioassay: comparison with slope ratio and parallel line models. Biometrics, 1978; 34:357–365.

Gottschalk

, Dunn

. The five-parameter logistic: a characterization and comparison with the four-parameter logistic. Anal Biochem, 2005; 343:54–65.

Dudley

, Edwards

, Ekins

, Finney

, McKenzle

IGM

, Raab

, Rodbard

, Rodgers

RPC

. Guidelines for immunoassay data processing. Clin Chem, 1985; 31:1264–1271.

Singh

, Prasad

, Singer

, MacAllister

. Ageing is associated with impairment of nitric oxide and prostanoid dilator pathways in the human forearm. Clin Sci, 2002; 102:595–600.

Gentile

, Skoner

. The relationship between airway hyperreactivity (AHR) and sodium, potassium adenosine triphosphatase (Na+,K+ATPase) enzyme inhibition. J Allergy Clin Immunol, 1997; 99:367–373.

Kayano

, Horiuchi

, Mori

et al. A simulation study to evaluate limited sampling strategies to estimate area under the curve of drug concentration versus time following repetitive oral dosing: limited sampling model versus naive trapezoidal method. Biol Pharm Bull, 2009; 32:1486–1490.

Saint-Marcoux

, Royer

, Debord

et al. Pharmacokinetic modelling and development of Bayesian estimators for therapeutic drug monitoring of mycophenolate mofetil in reduced-intensity haematopoietic stem cell transplantation. Clin Pharmacokinet, 2009; 48:667–675.

10.

Margolis

, Bilker

, Boston

, Localio

, Berlin

. Statistical characteristics of area under the receiver operating characteristic curve for a simple prognostic model using traditional and bootstrapped approaches. J Clin Epidemiol, 2002; 55:518–524.

11.

Koo

, Jung

, Shin

et al. Impact of grade, hormone receptor, and HER-2 status in women with breast cancer on response to specific chemotherapeutic agents by in vitro adenosine triphosphate-based chemotherapy response assay. J Korean Med Sci, 2009; 24:1150–1157.

12.

d'Amato

, Landreneau

, McKenna

, Santos

, Parker

. Prevalence of in vitro extreme chemotherapy resistance in resected nonsmall-cell lung cancer. Ann Thorac Surg, 2006; 81:440–446discussion 446–447.

13.

Hioe

, Wrin

, Seaman

et al. Anti-V3 monoclonal antibodies display broad neutralizing activities against multiple HIV-1 subtypes. PLoS One, 2010; 5:e10254.

14.

MacDougall

. Analysis of Dose-Response Studies—Emax Model. Springer: New York, 2006.

15.

Smith

, Sittampalam

. Conceptual and statistical issues in the validation of analytic dilution assays for pharmaceutical applications. J Biopharm Stat, 1998; 8:509–532.

16.

Glossary of Quantitative Biology Terms. www.ncgc.nih.gov/guidance/glossary.html. 2010 October 14.

17.

Kang

, Smith

, Morton

, Keshelava

, Houghton

, Reynolds

. National cancer institute pediatric preclinical testing program: model description for in vitro cytotoxicity testing. Pediatr Blood Cancer, 2011; 56:239–249.

18.

The R Project for Statistical Computing. www.r-project.org. 2010 October 14.

19.

Neve

, Chin

, Fridlyand

et al. A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. Cancer Cell, 2006; 10:515–527.

20.

Hess

. Statistical issues in clinical trial design. Curr Oncol Rep, 2007; 9:55–59.

21.

Brower

, Fensterer

, Bush

. The ChemoFx assay: an ex vivo chemosensitivity, resistance assay for predicting patient response to cancer chemotherapy. Methods Mol Biol, 2008; 414:57–78.

22.

Mor

, Alvero

. Apoptosis, Cancer: Methods, Protocols. Humana Press: Totowa, NJ, 2008.

23.

Genomics of Drug Sensitivity in Cancer, Summary Information December 21, 2010. www.sanger.ac.uk/genetics/CGP/translation/add_info.shtml. 2011 June 7.

24.

Genomics of Drug Sensitivity in Cancer January 5, 2011. www.sanger.ac.uk/genetics/CGP/translation/compound_sens_data.shtml. 2011 June 7.

25.

Genomics of Drug Sensitivity in Cancer, Screening Methodologies December 21, 2010. www.sanger.ac.uk/genetics/CGP/translation/screening_meth.shtml- curve_fit. 2011 June 7.

26.

Hand

. Measuring classifier performance: a coherent alternative to the area under the ROC curve. Mach Learn, 2009; 77:103–123.

27.

Toutain

. Verterinary Pharmacology and Therapeutics, 9th. Wiley-Blackwell: Ames, IA, 2009.

28.

Hanczar

, Hua

, Sima

, Weinstein

, Bittner

, Dougherty

. Small-sample precision of ROC-related estimates. Bioinformatics, 2010; 26:822–830.

Comparing Statistical Methods for Quantifying Drug Sensitivity Based on In Vitro Dose–Response Assays