Abstract
Abstract
Integrating information from in vitro, in silico, and in chemico methods into toxicity testing strategies has been widely considered the way of phasing out animal testing. At the same time, testing strategies using new approaches and methods shall provide adequate and relevant information about chemicals' hazardous properties. We reviewed objectives and requirements for guiding the process of data integration that are suggested in the scientific literature. Based on the existing approaches, we develop criteria for resource-efficient testing strategies, and we evaluate existing testing strategies for skin sensitization hazard and risk assessment under these criteria. We conclude that existing testing strategies—except two cases—still focus predominantly on maximizing toxicity information, but largely ignore resource efficiency criteria. Balancing information gained from testing strategies with its respective direct and indirect costs (including also welfare losses for society in case of unintended health or environmental damages) is a necessary condition to allow for transparent comparisons of their resource efficiency. Therefore, developing approaches for balancing information gains and costs should become an explicit part of the developmental process of nonanimal testing strategies to ensure that phasing out animal testing complies not only with regulatory information requirements but also with available resources.
Introduction
N
Source: OECD (2016). 19
The development of NAMs and testing strategies has been driven by multiple objectives. Besides replacing or reducing animal testing, a key purpose has been the need to acquire sufficient and relevant information about chemicals' hazardous properties with less time and at lower costs than traditional animal tests.20,23–25 For specific human endpoints such as skin sensitization, available mechanistic information about the pathway from a molecular initiating event (MIE) and the sequence of key events that ultimately cause an adverse effect to occur—the adverse outcome pathway (AOP)26,27—have been considered a guiding principle and component for the construction of testing strategies.19,28–30
Clearly, these multiple objectives are not necessarily complementary. Maximizing the information outcomes from testing strategies, decreasing the time and the number of test substances needed to attain hazard or risk information, reducing costs of testing, and minimizing or even avoiding the use of animals are likely to be competing objectives. 31 A crucial challenge for the development of testing strategies is, therefore, to define criteria that allow balancing conflicting objectives to ensure that the outcome of an IATA, that is, information for hazard and/or risk assessment purposes, is generated in an efficient manner.
Resource efficiency—the use of resources to either maximize information outcomes at a given resource endowment or to minimize costs for achieving a given outcome target—is a fundamental economic paradigm. 32 Although criteria and conceptual requirements for developing resource-efficient testing strategies have been proposed in the toxicological literature,20,21,25,33 a comprehensive review of these criteria, and an evaluation of existing testing strategies under these criteria, has not been provided.
The aim of this article is, therefore, to review criteria proposed in the scientific toxicological literature for constructing testing strategies to be used for hazard and risk assessment of chemicals. Next, we propose conceptual and informational criteria that can guide resource-efficient data integration. We then evaluate existing testing strategies for assessing skin sensitization potential (i.e., hazard identification) and potency (i.e., subcategorization into weak, moderate, strong, and extreme sensitizers) with respect to these criteria.
Methods: Survey of the Scientific Literature and Selection of Studies
For identifying requirements and conceptual criteria that are considered relevant for combining different types of information into testing strategies we conducted a systematic survey of the toxicological literature published between 2000 and 2017. This period was chosen because it covers a time frame with course setting developments for chemicals' risk management, for example, the development and enforcement of the European REACH legislation, 34 or the publication of the path-breaking report of the U.S. National Research Council (NRC) “Toxicity Testing in the 21st Century: A Vision and a Strategy.”33,35
The aim of the survey was to select scientific studies that address and discuss general objectives, requirements, and criteria for constructing testing strategies. Given the different terminologies used for describing the overall process of combining different types of information into testing strategies, and to ensure that the literature analysis captured a sufficiently broad spectrum of studies, the search was based on individual and composed search terms as shown in Figure 1.

Keywords and composed search terms for identifying scientific studies addressing general objectives, requirements, and criteria integrating information into testing strategies.
The search was conducted by means of two online literature databases, Scopus (https://www-scopus-com-s.web.bisu.edu.cn/search/form.uri?zone=TopNavBar&origin=recordpage) and PubMed (https://www.ncbi.nlm.nih.gov/pubmed?otool=inlwurlib). Search terms were identified from the title, the abstract, and the keywords of a study. Since the focus of the analysis was on identifying general, that is, endpoint-independent, criteria and requirements, we did not consider studies that referred to specific endpoints, endpoint categories, or substance groups (acute toxicity, skin irritation and sensitization, reproductive toxicity, repeated dose toxicity, etc.). Furthermore, policy reports or guideline documents, and studies that did not explicitly discuss criteria and requirements for integrating information, were not included into the analysis.
Results
Objectives, criteria, and requirements for integrating information into testing strategies
The individual and composed term-based literature search revealed a total of 47 studies, 15 of which discussed objectives and criteria or requirements for integrating information into testing strategies and were thus considered sufficiently informative for the evaluation (Table 2). For the 15 studies we also checked the titles, keywords, and abstracts of the quoted references for the search terms. The search term “defined approach” could not be identified in the title, abstract, or keywords of publications in the Scopus and the PubMed databases. Consequently, this search term did not retrieve results.
The study was also identified under other composed search terms.
KE, key event; MOE, margin of exposure.
As already mentioned in the introduction, most studies pointed to multiple objectives underlying to the development of testing strategies. These consist of both outcome-based (e.g., reduction of animals and costs and high protection of human health), conceptual (e.g., combination of different pieces of information and tools), and procedural objectives (e.g., cost efficiency of the assessment, generate adequate information at low costs, and with a minimum or no animal use, prioritization of testing). Regarding the criteria and requirements that were considered important for the development of testing strategies, we found a broad range of aspects that can be grouped into (i) criteria related to generating information (e.g., about a hazard or a risk) and (ii) criteria guiding the conceptual process of combining different types of information and data integration.
Criteria for generating information about chemical hazards or risks by means of a testing strategy
Several studies emphasized that exploiting all available information from different sources, including testing (e.g., in vitro and high throughput in vitro methods) and nontesting approaches (e.g., in chemico methods and in silico methods), is a key requirement for constructing testing strategies.7,20,21,25,36,37 Although there seems to be general agreement that the use of animal tests should be minimized, some studies still suggest in vivo tests to be included into a testing strategy “as a last resort.” 38 A testing strategy should start with carefully evaluating all existing information, for example, from in vivo or in vitro approaches and existing exposure information, to decide whether remaining data gaps need to be filled with additional testing, how data gaps can be filled most efficiently, or whether exposure-based waiving approaches can be applied.36,38,39 Furthermore, it was pointed out that the uncertainty of outcomes from individual building blocks in a testing strategy, and of the conclusion that is ultimately adopted, should be assessed and minimized.21,33,40
Criteria for guiding the conceptual process of information and data integration in a testing strategy
Corresponding to the use and combination of different pieces of information, several authors underlined that testing strategies should remain flexible regarding the selection of methods and the order of steps in the assessment.20,25,39 In general, the integration of information was characterized as a dynamic process that progresses along with the development of testing methods in combination with exposure information, and the exploitation of mechanistic information.7,41 Some studies explicitly pointed to the potential of AOPs to prioritize and guide testing, but also to the use of AOPs as prediction tools (e.g., of MIEs) within testing strategies.7,30 However, the confidence in the events of an AOP, and of the AOP as a whole, should be documented by means of semiquantitative and quantitative methods. 30 In addition, it was repeatedly acknowledged that the development of testing strategies requires criteria and tools addressing how the combination of information from different sources can be done in an “optimal” way. Several studies pointed to the urgent need to structure testing strategies more efficiently.20,21,25,33 This also included the recommendation to use quantitative performance parameters and statistical methods for evaluating data quality and the (inter- and intralaboratory) variability of the data,25,42 but also information on testing costs, time, and regulatory acceptance. 10 In this context, the terms “efficiency,” “resource-efficiency,” or “cost-effectiveness” were frequently used to characterize the process of balancing different types of information toward an overall result of the strategy that is considered reliable, robust, and relevant. 33
Defining “resource-efficient” testing
The studies listed in Table 2 emphasize the need to combine different types of data about a chemical's properties with information about the resource use for generating these data. Also in the earlier literature on chemicals' testing and safety evaluation, “resource-efficiency” 43 or “cost-effectiveness”44,45 were frequently proposed as targets of a modern approach to toxicity testing and safety evaluation. However, the meaning of these terms in the context of the development of testing strategies has not been concretized. The term “efficiency” or “resource efficiency” is a key economic decision criterion for guiding the allocation of scarce resources. In the economics literature, “resource efficiency” denotes an allocation of resources that allows achieving a given outcome target with a minimum of resources. 32 Assuming that the ultimate goal of toxicity testing is to allow for adopting better-informed decisions upon chemicals use, toxicological testing requires a variety of resources, in particular appropriate laboratory equipment or computational capacities, manpower, laboratory animals, and time. Clearly, resource use depends on the toxicological effect of interest (the so-called endpoint). The challenge is, therefore, to distribute available resources such that a maximum of output—e.g. a hazard information—can be achieved, or, to use a dual formulation, that a certain information outcome can be achieved with a minimum of resources. 46 Thus, efficient or optimal testing can be characterized as a process wherein a maximum of information is achieved at the lowest cost. In the context of chemicals testing, we propose to base efficiency evaluations on the following operational criteria:
Efficiency evaluations of individual testing methods and testing strategies require, first, to specify information gains and costs. Then we need to select appropriate quantifiable parameters for both components. Information gains can be quantified by different metrics. For example, a testing method's information outcome can be characterized in terms of its predictive accuracy, describing “the closeness of agreement between test method results and accepted reference values.” 85 Common metrics, in the simple case of a dichotomized outcome, are sensitivity (being the proportion of hazardous substances correctly classified as hazardous by a testing method) and specificity (being the proportion of nonhazardous substances correctly classified as nonhazardous by the testing method). In addition, information gains from testing can be characterized by a testing method's reliability, denoting its ability to reproduce within and between laboratories over time and usually expressed in terms of a testing method's intra- and inter-laboratory reproducibility. Finally, information about the coverage of specific key events in the AOP of a particular in vivo adverse outcome by a specific testing method is important in order to quantify the informational gains from testing.
Testing costs can be distinguished into direct and indirect costs. Direct costs consist of (i) laboratory equipment or computational capacities for conducting a testing method, (ii) laboratory animal welfare loss (in case of an animal test), and (iii) testing time. Indirect testing costs include, for example, expenditures, resources, and time needed for the validation of a (nonanimal) testing method, or switching costs in cases wherein new technologies have to be adopted. 46
The evaluation of resource efficiency can be based on five key criteria. First, the purpose of the assessment, for example a classification or labeling, a hazard or a risk prediction, should be spelled out. Second, efficiency evaluations require defining a mechanism to balance information gains from testing and the costs of this test. This, third, requires specifying how information gains and costs are valued. Basically, two possibilities exist: a monetary and a nonmonetary valuation. In case of a nonmonetary valuation, information or cost parameters are expressed in terms of their natural units (e.g., the proportion of positive chemicals correctly classified or the number of animals used in a test). A monetary assessment requires transferring information or cost parameters into monetary (e.g., Dollar or Euro) values. Although direct costs, that is, expenditures for conducting a test, are usually expressed in monetary terms, a monetary valuation of other cost components (e.g., animal welfare loss) is less common and often not wanted due to, for example, ethical concerns. 47 Likewise, a monetization of test outcomes, in particular the expected gains and losses arising from decisions that are based on these outcomes (e.g., health and environmental benefits and costs from a continued use of a substance), is often not straightforward due to the absence of market-based values.
Fourth, evaluating the resource efficiency of testing strategies must account for different types of uncertainty underlying to information gains and costs. Any testing method, including the animal test, has a limited precision and accuracy, since it is merely a model representation of human or environmental toxicological endpoint. Hence the information outcomes from these methods are uncertain. Ideally, if different (nonanimal) testing methods are combined into a testing strategy, uncertainty will be reduced throughout the strategy. Again, different options exist for assessing the uncertainty of test information and costs. A general distinction can be made between frequentist and Bayesian approaches. Frequentist approaches, for example the calculation of confidence intervals of statistical measures, assess the variability of testing outcomes due to the variability of input data. Applying frequentist approaches requires the underlying data sets to be of a sufficient size. Bayesian inference methods, to the contrary, adopt a more comprehensive concept of uncertainty by explicitly accounting for a decision-maker's subjective beliefs, for example about a substance's properties. This provides a means for updating information if new data become available. Finally, given that nonanimal testing strategies combine individual nonanimal testing methods, integrating data from individual testing methods requires determining a stopping rule for testing.24,48 Ideally, this stopping rule should be endogenous, that is, it should result from the process of combining different types of information. An endogenous stopping rule is conditional on the results from testing following the sequential steps of a testing strategy. This contrasts with exogenous stopping rules, for example information targets that are defined prior to testing.
Evaluating testing strategies addressing skin sensitization according to resource efficiency criteria
Following the identification of general criteria for resource-efficient testing in the previous section, this section offers a detailed evaluation of these criteria for existing testing strategies used for skin sensitization hazard and risk assessment. Skin sensitization is the clinically relevant endpoint for assessing allergic contact dermatitis.49,50 Approximately 15%–20% of the human population suffer from an acute contact dermatitis (ACD) incident once in their life. 51 Assessing chemicals' ability to cause ACD—their skin sensitization potential (i.e., hazard)—is a key information requirement for the safety assessment of chemicals falling under the European chemicals' legislation REACH 34 and the European Cosmetics Regulation. 52 It is the first complex toxicological endpoint for which an AOP has been defined and is used in regulatory practice. 53 Hence, skin sensitization can be used as a case to investigate progress on the challenge of resource-efficient toxicity testing using NAMs.
To date, none of the existing NAMs is considered to provide sufficient information to fully replace the animal tests used for skin sensitization hazard identification and potency assessment. 53 Instead, a combination of in vitro, in chemico, and in silico methods has been considered a promising way forward to generate sufficient information and eventually replace in vivo tests.43,91 During the past years, several testing strategies have been proposed for the assessment of skin sensitization potential and potency.19,41,54,84
In general, testing strategies for skin sensitization potential and potency assessment use different conceptual and methodological approaches to combine information from the individual NAMs. Hence, they are presented in different ways, for example, in the form of qualitative flowcharts,8,55–57 quantitative probabilistic approaches (machine learning) applying artificial neural networks (ANNs),58–60 Bayesian networks (BNs),21,61,62 as deterministic approaches based on a “majority vote” decision rule for batteries of NAMs,63–66 or as score-based batteries of NAMs.67–69 In addition, a regression analysis model 70 and a quantitative model using toxicokinetics and toxicodynamics modeling 71 have been proposed. Based on the criteria defined in Table 3, Table 4 offers a comparative evaluation of existing testing strategies for assessing skin sensitization hazard and potency suggested in the scientific literature, and in the recent OECD guidance document. 19 The testing strategies, therefore, comply with the definition of IATA proposed in the Guidance document recently published by the OECD. 19
Sources: Own collection of information from the OECD Annex I on the case studies proposed as DAs (OECD, 2016) and from individual publications. “2 out of 3” ITS approach63,64,85; Kao ITS and Kao STS68,69; RIVM STS65,66; Stacking meta model 80 ; IDS81,82; BN ITS21,61,62; ANN ITS58–60 ; EC-JRC72,73; Global and local regression models57,70,75; SARA. 71
ACD; ANN, artificial neural network; BNs, Bayesian networks; GPMT, Guinea pig maximization test; LLNA, local lymph node assay; MI, mutual information; MIE, molecular initiating event; n.a, not available; NPV, negative predictive value; PLS; PPV, positive predictive value; RIVM; RMS, root mean square; SVM, Support Vector Machine.
Reliability assessments of individual testing methods used in a strategy, accounting for intra- and inter-laboratory reproducibility of the testing methods, are usually provided. Exemptions are the RIVM STS and the Stacking meta model, for which the variability of individual data sources was not explicitly taken into account. 19
For assessing the reliability of probabilistic testing strategies, different methods are used. In case of deterministic strategies, these included an assessment of individual methods' reproducibility, applicability domain, or predictive accuracy parameters such as positive predictive value and negative predictive value. Within probabilistic approaches, additional quantitative assessments based on, for instance, Bayes factors and regression analyses, are used. Each source of information in a testing strategy (i.e., each individual testing method) addresses a specific key event within the skin sensitization AOP. In general, it is assumed that the first three key events of the AOP (i.e., covalent binding of the electrophilic substance to skin proteins, release of proinflammatory cytokines and induction of cytoprotective cellular pathways, activation and maturation of dendritic cells, and their migration to the local lymph nodes) must be considered in a WoE approach to meet the information requirements of REACH Annex VII and to allow for conclusions on a substance's skin sensitization potential. 53 This is the case for most strategies presented in Table 4.61,62,64,65 However, some testing strategies focus on selected key events only. For example, the Kao DA68,69 covers the first and third key event, whereas the EC-JRC DA covers the MIE only,72–74 which is considered to provide the final conclusion on the skin sensitization potential. 72 The IATA suggested by Patlewicz et al. 75 and the skin allergy risk assessment (SARA) strategy 70 cover the fourth key event.
Considering direct and indirect testing costs we observe that only few testing strategies report direct and/or indirect testing cost estimates. These are the “2 out of 3” ITS, 76 and the RIVM STS. 65 Studies suggesting probabilistic testing strategies70,58–60 are assumed to always save costs because unnecessary testing can be avoided. A transparent assessment of the resource use is, however, lacking. Furthermore, the BN ITS 61 and the “2 out of 3” ITS63,78 suggest that additional testing costs can be avoided if the information collected at a certain step of the strategy is sufficient to conclude on the skin sensitization potential. Again, this is not underpinned by any form of assessment. In few cases, for example, the nontesting pipeline approach,57,75 the animal test local lymph node assay (LLNA) is proposed as a “last resort.”
Regarding the valuation of information and cost metrics, all strategies selected document the information outcomes in a nonmonetary way. Furthermore, direct or indirect testing costs are reported in only two cases, the “2 out of 3” ITS, and in the RIVM STS. However, these strategies do not explicitly include cost information in the building process of the strategy. In the case of the “2 out of 3” ITS, a follow-up study 48 applied a decision-theoretic Value-of-Information (VOI) approach. The assessment includes information about direct testing costs and health damage costs related to ACD occurrence into an efficiency assessment of STSs consisting of selected nonanimal prediction methods for skin sensitization assessment.
None of the testing strategies included in the evaluation (Table 4) proposes or uses a mechanism for balancing information gains and costs. Furthermore, they do not incorporate an endogenous stopping rule to testing. Instead, the decision to stop testing is based on exogenous rules, for example, a predefined number and order of testing and nontesting methods or AOP coverage.65,66,68 Other deterministic testing strategies such as “2 out of 3” ITS apply a majority vote rule suggested for the cases wherein the decision follows the outcome of two concordant test results.63,64 The RIVM STS 65 uses as a first step a Bayesian QSAR approach, described by Rorije et al., 79 which is followed by tiers of testing methods. The overall conclusion is also based on a majority vote from test results from sequential steps in the strategy. In probabilistic strategies such as the BN ITS,20,61,62 the stopping rule is exogenously determined by defining information targets, for example, the prediction of the skin sensitization potential using the LLNA results as a reference.
The uncertainty of relevant parameters for information outcomes is assessed in a variety of ways. The BN ITS proposed by Refs,20,61,62 offers an elaborate uncertainty assessment with regard to predictive accuracy of each individual method, and the precision, being the ability of a method to produce concordant results from repeated testing. Uncertainty is based on Bayesian inference and mutual information theory. In case of deterministic approaches, the majority vote is frequently applied to test results without explicitly assessing uncertainties. Individual testing methods' reproducibility, interchangeability, and reliability are assessed for methods used in, for example, the “2 out of 3” ITS63,64,85 and the RIVM STS.65,66 For the “2 out of 3” ITS Leontaridou et al., 76 introduced a statistical approach for quantifying precision, being defined as a testing method's or testing strategies ability to correctly classify substances in repeated applications. By determining the pooled standard deviation of a testing method's results, a range, the so-called borderline range, around the classification threshold of a nonanimal testing method's prediction model, is quantified. Within the borderline range, test results are nonconclusive due to a testing method's biological and technical variability, that is, test results falling in this range may not be considered unambiguous.
We observe differences between deterministic and probabilistic approaches regarding the integration of different types of information toward a final decision (i.e., a conclusion on hazard or potency or whether further testing is required): Whereas deterministic strategies use qualitative criteria such as the majority vote rule applied in the “2 out of 3” ITS,63,64 probabilistic testing strategies use quantitative methods offering endogenous data integration mechanisms. Examples are the ANN ITS57–59 and the BN ITS.20,21,61,62 In the Stacking meta model, 80 the IDS81,82 and the ANN ITS58–60,83 machine learning approaches are used, which encompass computational algorithms developed to predict hazardous properties of substances and to reduce uncertainties underlying the assessment of the hazardous properties. They include Support Vector Machines and Classification and Regression Trees.81,82 The EC-JRC DA uses the Classification Trees machine learning approach based on in silico information to predict skin sensitization potential.72,73
Using machine learning approaches is considered a suitable approach to optimize testing strategies because they allow for a quantification of uncertainties at any stage of the testing strategy, and they allow for learning (i.e., updating the assessment) if new information (e.g., about the molecular structure of a substance) is received. Still, all testing strategies using machine learning approaches focus on maximizing the information outcomes from testing. Hence, information outcomes (expressed, e.g., in terms of the predictive accuracy metrics) are not balanced with the costs for generating this information.
Conclusions and Recommendations
We reviewed the state-of-the-art of testing strategies regarding the combination of information from different testing and nontesting methods (i.e., NAMs) to determine the hazardous properties of chemicals. Based on a systematic literature search and evaluation we identified objectives, requirements, and criteria that are considered relevant for the development of testing strategies. Our analysis revealed that one of these requirements is resource efficiency. To make this requirement operational, we propose criteria that facilitate the evaluation of resource efficiency of testing strategies. We applied these criteria to existing testing strategies suggested for the assessment of skin sensitization potential (i.e., hazard assessment) and potency (i.e., risk assessment), which were described in the scientific literature and discussed in a recent OECD guidance document. 19
Our findings illustrate that most of the testing strategies were dedicated to maximizing the information outcome of testing. Only the “2 out of 3” ITS and the RIVM STS considered information about direct or indirect testing costs. Still, none of the testing strategies incorporated costs or resource use as a complementary piece of information into the construction of the strategy. As a consequence, a key requirement for efficiency evaluations, i.e. generating information about the gains and costs of the assessment process, is not met. Furthermore, tools that allow balancing information outcomes with the resources required to generate this information are not incorporated into these testing strategies. Consequently, the (information) gains per unit of cost remain unclear, and conclusions on the resource efficiency of testing strategies for assessing skin sensitization are, therefore, not possible.
If resources and time for testing are limited, an optimal allocation of scarce resources is considered highly desirable. Still, developers and users of nonanimal testing strategies (and toxicological testing methods in general) only have started to become aware that optimization includes balancing potentially competing objectives rather than merely maximizing the information gains from testing. There is thus a need for incorporating information and data about resource use (direct and indirect costs, animal welfare, and time) into the construction process of testing strategies.23,24 This requires developing and testing approaches that allow combining information on resource use with toxicological information. Depending on the specific context at hand, different tools can be used for this purpose. For example, in a recent article by Leontaridou et al., 48 a decision-theoretic approach using Bayesian Value-of-Information analysis was applied to guide the optimization of testing strategies. Applying the approach to nonanimal testing methods used for assessing skin sensitization potency (i.e., hazard) demonstrated that testing strategies can be more resource efficient than the animal test LLNA. Furthermore, (sequential) combinations of nonanimal testing methods usually perform better than individual methods, a result that was also confirmed by Roberts and Patlewicz. 92 In addition, their results indicate that full coverage of all key events in the sensitization AOP is neither a necessary nor a sufficient condition for optimal, resource-efficient testing and safety evaluations. Rather, depending on the available information, it may be preferable to start a sequence with a testing method (or a combination of methods), which refers to a key event occurring “late” in the AOP, or, contrary, to the MIE.
As an alternative to incorporating optimization methods in the construction process of testing strategies, their resource efficiency can be evaluated ex post, that is, after the strategy was developed. This can be done, for example, by applying cost–benefit or cost-effectiveness analysis. Both methods have been widely used for efficiency assessments of medical treatments, wherein the conceptual challenge to identify the best performing alternative is similar to the challenge of toxicity testing.86–89 Applications to the field of toxicity testing were provided by Nordberg et al., 90 Norlen et al., 46 or Gabbert and van Ierland. 47 Despite the rich set of available tools, practical applications are required to understand their applicability, their implications for chemicals risk assessment, and to ensure that phasing out animal testing complies not only with regulatory information requirements but also with available resources. This, in turn, requires to strengthen the inter- and transdisciplinary collaboration of toxicologists and economists.
Footnotes
Author Disclosure Statement
No competing financial interests exist.
