Abstract
After an eyewitness completes a lineup, officers are advised to ask witnesses how confident they are in their identification. Although researchers in the lab typically study eyewitness confidence numerically, confidence in the field is primarily gathered verbally. In the current study, we used a natural language-processing approach to develop an automated model to classify verbal eyewitness confidence statements. Across a variety of stimulus materials and witnessing conditions, our model correctly classified adult witnesses’ (N = 4,541) level of confidence (i.e., high, medium, or low) 71% of the time. Confidence-accuracy calibration curves demonstrate that the model’s confidence classification performs similarly in predicting eyewitness accuracy compared to witnesses’ self-reported numeric confidence. Our model also furnishes a new metric, confidence entropy, that measures the vagueness of witnesses’ confidence statements and provides independent information about eyewitness accuracy. These results have implications for how empirical scientists collect confidence data and how police interpret eyewitness confidence statements.
Of the over 3,300 exonerations recorded to date by the National Registry of Exonerations (n.d.), over 900 are due, at least in part, to eyewitness misidentifications. Nearly all these cases involved an eyewitness testifying that they were highly confident in their identification (Garrett, 2011). Highly confident witnesses are persuasive to jurors, influencing their verdict decisions, judgments of guilt, and perceptions of a witness’s accuracy (Slane & Dodson, 2022). Researchers often refer to eyewitness confidence as a reflector variable (Wells, 2020). Reflector variables are witness behaviors occurring during or after an identification procedure that are associated with witness accuracy. Eyewitness confidence is a reflector variable because high confidence, from an unbiased lineup, indicates accuracy (Wixted & Wells, 2017). For this reason, officers are advised to document witness confidence immediately after any identification procedure (National Research Council, 2014; Wells et al., 2020).
Empirical research on the relationship between confidence and accuracy largely measures witness confidence numerically (Smalarz et al., 2021). However, officers in the field typically ask for witness confidence verbally, in the witness’s own words (National Research Council, 2014). Although verbal confidence is the most common method of documenting confidence, it has distinct drawbacks. Interpretations of verbal confidence are inconsistent, and individuals’ numeric translations of verbal confidence vary widely (Theil, 2002). Imagine a situation in which a witness makes an identification from a lineup and then states they are “fairly certain” the person they selected is the person who committed the crime. How does a police officer interpret this confidence statement? Does the officer believe that this witness has made a highly confident identification or not? If the officer misinterprets the intended meaning of the confidence statement (e.g., believes the witness is highly confident when they are not), then this impairs the ability of confidence to act as a cue to witness accuracy.
Compounding this problem, interpretations of verbal confidence are affected by base rate (Wallsten et al., 1986) and contextual information (Brun & Teigen, 1988). For example, researchers have documented a featural justification effect (Dodson & Dobolyi, 2015) in which witnesses justifying their reasons for their level of confidence (e.g., “I remember his bushy eyebrows”) can lead to more misinterpretations of confidence than when justification information is not provided. Thus, not only are verbal confidence statements prone to misunderstandings, but additional contextual information provided by witnesses tends to worsen, rather than improve, this problem.
One way to improve comprehension and reduce the ambiguity of verbal confidence statements is to use machine-learning approaches rather than human evaluators to classify verbal confidence. In one of the only articles to date using this approach, a “bag of words” model revealed that verbal confidence was predictive of accuracy and that the content of verbal confidence statements contained additional diagnostic cues beyond the information provided by numeric confidence (Seale-Carlisle et al., 2021). However, the confidence statements used in this study included both verbal statements of confidence as well as verbal statements of justification, which are currently not typically collected in the field. Additionally, the language classifier used in this study relied on a basic linguistic model that simply counted the use of individual words.
Using machine-learning approaches to evaluate verbal eyewitness confidence statements has several potential benefits. It is faster than human coders, especially with large amounts of data. It is also more replicable and can be easily implemented for both researchers and practitioners. Machine-learning approaches may also be less influenced by cognitive biases, such as the influence of preexisting information or cultural biases (Grabman & Dodson, 2019).
In the current study, we developed a Transformer-based large language model (LLM) to categorize witness confidence statements. The goal of this model is to interpret the intended meaning of a witness’s verbal confidence statement. To do this, we analyzed a sample of witnesses who explained their confidence both in their own words and using numbers. We used these witness-provided numeric translations to identify the ground truth of how confident witnesses actually are in their identifications. We define confidence statements as low confidence (0%–25%), medium confidence (26%–74%), or high confidence (75%–100%).
Statement of Relevance
After an eyewitness completes a lineup, police officers are advised to ask witnesses how confident they are in their identification. Confidence, from an unbiased lineup, can help predict whether a witness has made an accurate identification. Although researchers in the lab typically assess eyewitness confidence numerically, confidence in the field is primarily gathered verbally, in the witness’s own words. We developed a machine-learning model to read and classify eyewitness confidence statements, and we made it freely available online to researchers and practitioners (https://huggingface.co/spaces/psheaton/eyewitness_confidence_classifier). Across a variety of lineup types, our model correctly classified witnesses’ level of confidence (i.e., high, medium, or low) 71% of the time. We further demonstrate that the model’s confidence classification serves as a reliable tool for identifying accurate witnesses. These results have implications for how empirical scientists collect confidence data in the lab and how police interpret eyewitness confidence statements in the field.
After developing the model, we then tested the performance of our LLM by applying it to data previously unseen by the model (hereafter called external data)—specifically, samples of verbal eyewitness confidence statements from studies in which researchers also collected self-reported numeric confidence, furnishing a reference measure of ground-truth confidence. Our LLM is freely available for use by other researchers and practitioners at https://huggingface.co/spaces/psheaton/eyewitness_confidence_classifier. It includes functionality to batch process large collections of confidence statements.
Open Practices Statement
The data and materials for the pilot study are available on the Open Science Framework (OSF) at https://osf.io/9cbt6/. Appendix S2 details how the remaining data sets were obtained. This study was not preregistered.
Method
Pilot data
We recruited 989 participants on Amazon Mechanical Turk (MTurk) using CloudResearch’s MTurk Toolkit. Participants watched a short video of a robbery (Kenchel et al., 2021) and were randomly assigned to view a six-person either target-present or target-absent lineup. After completing the lineup, all participants were asked to explain their confidence in their own words and then to translate their confidence to a number using an 11-point scale. The data and materials for this study are available on OSF. A full description of the methods and outcomes of the pilot study is available in the Supplemental Material available online. This study was approved by the University of Mississippi Institutional Review Board.
Data sets
To ensure that the model was as generalizable as possible, we sought to obtain all existing data sets for which participants (a) made a lineup identification, (b) expressed their confidence in their own words, and (c) translated their verbal confidence into a numeric response. To obtain these data sets, the first author reviewed the eyewitness confidence literature and contacted authors of published articles that contained data that met these conditions. Seven data sets were used in the current study: the pilot data described above as well as data from six additional articles (Bergold & Heaton, 2018; Dobolyi & Dodson, 2018; Grabman & Dodson, 2024; Grabman et al., 2019; Kenchel et al., 2021; Smalarz et al., 2021). Only participants who made a lineup identification (i.e., not a lineup rejection) from a target-present lineup were included in our analyses. A more detailed description of each of these data sets can be found in the Supplemental Material: Table S1 compares the key features of these data sets, and Figure S1 provides a flowchart about the distribution of data to the test and the external data sets.
Modeling approach
Our model relies on the Transformer (Vaswani et al., 2017). The Transformer is a neural-net-based architecture featuring multiheaded self-attention in which the relative importance between different words in an input is calculated. This enables Transformers to handle dependencies between words, a key feature in natural language understanding. Using this Transformer architecture, researchers have trained several LLMs to achieve unprecedented performance on language-modeling tasks. The first of these LLMs was BERT (Devlin et al., 2018), which was developed at Google and introduced in 2018.
BERT is trained using masked language modeling. During training, tokens (usually words) are masked out or hidden from the model, and the transformer neural net is trained to guess the missing token on the basis of the tokens in the surrounding context. As this process is repeated over millions of tokens, the language model effectively learns the probability distribution of the language. From there, the language model can be fine-tuned to perform a variety of downstream tasks, such as text classification and extractive question answering.
In the current study, we relied on RoBERTa (Liu et al., 2019), a successor to BERT, which is pretrained on massive amounts of text, including all of English Wikipedia and 100 GB of other Web-crawled data from the Internet. We used transfer learning to adapt the RoBERTa model to our task. Transfer learning begins with a model pretrained on a very large amount of generic data; the model is then fine-tuned by providing it with ancillary data specific to a particular downstream task—in this case, classifying a particular postlineup confidence statement as reflecting low, medium, or high confidence. This allows the fine-tuned model to leverage all the general semantic information gained during the pretraining step (e.g., that the word “very” denotes greater intensity of quality or belief) with customization toward performing a particular task (e.g., that the response “not very” maps onto low confidence).
Model training
We used the Transformers library from HuggingFace to train our model on the classification task. The training data included three separate data sets: the pilot data as well as data from two previous studies (Dobolyi & Dodson, 2018; Grabman et al., 2019) using a similar methodology (see Fig. S1 in the Supplemental Material). Responses from the three studies were comingled and treated equally.
We randomized the training data and split it into training, test, and validation sets using a split of roughly 80%, 10%, and 10%. This approach, which is standard in machine learning, predicts an output for a given set of model parameters or weights, compares this to the known true output in the training data set, and then updates the weights to iteratively achieve better classification accuracy. To assess how well the model is learning, we periodically tested performance using the test set, allowing us to gauge improvement and terminate the iterative learning process once we reached the model’s maximum achievable classification accuracy. Finally, we used the validation set, containing the 10% of the initial training data set the model has never seen, to measure model performance. Beyond providing a classification for each verbal confidence statement, the LLM also outputs a probability distribution over the three categories (i.e., probability that a given confidence statement is low, medium, or high confidence).
Across all the data sets, respondents often included numeric language in their verbal confidence statements (e.g., “I am 100% sure”). To help the model learn a numerical baseline, we performed data augmentation. We added 115 additional samples to the training set that used numeric language in sample verbal confidence reports (see Table S2 in the Supplemental Material). This data augmentation ensured that the model saw examples (both numeric and text) of different numbers and percentages to help it learn how to classify words associated with numeric confidence.
We used the Trainer API from HuggingFace to fine-tune RoBERTa, adding a randomly initialized classification head and training for five epochs at a learning rate of 5e–6. During the validation step, our final model correctly predicted 71% of the validation set. We defined a correct or accurate classification as one in which the model’s categorization matched the witness’s intended level of confidence. That is, if a witness states they are “pretty certain” and define this as 60% certain, then the ground truth would be that this witness is moderately confident. If the model outputs a categorization of moderate confidence, then this would be an accurate outcome.
Results
Model performance
Table 1 reports the classification accuracy of the LLM when applied to four different external data sets not used in the model-development or training process. These results illustrate the range of performance that might be expected when applying the model to new, unseen data. Tables S3 to S6 in the Supplemental Material provide full confusion matrices for each data set.
Across the four external data sets, when classifying statements as low, medium, or high confidence, the LLM correctly classifies 71% of the confidence statements. When classifying confidence as either highly confident (75% or above) or not highly confident (less than 75%), the model performance improves to an average of 83% accuracy.
We believe that comparing accuracy to a perfect 100% rate is not appropriate for this model. The LLMs cannot achieve 100% accuracy because the classification process requires a unique mapping between a particular statement and an accuracy level. In reality, different participants may use the same words to describe different degrees of numeric confidence (e.g., Participant 1 is 80% confident [high] and describes their confidence as “pretty confident” versus Participant 2 who is 60% confident [medium] but also describes their confidence as “pretty confident”). The final column of Table 1 reports the maximum possible accuracy achievable by any classification process for each data set, taking into account the incidence of such conflicting responses. For three of the four external data sets, the LLM achieves greater than 75% of the maximum possible accuracy.
LLM Model Classification Accuracy for External Data Sets
Note: 95% confidence intervals are reported in brackets. LLM = large language model.
The main reason for measuring confidence is to permit inferences about the likely accuracy of a particular identification. In the laboratory, researchers often use confidence-accuracy calibration curves as a way of characterizing the relationship between confidence and accuracy for a given eyewitness task. Do the LLM-based classifications yield accuracy information similar to what direct, witness-reported numeric confidence would provide? To examine this question, we plotted confidence-accuracy curves in Figure 1 for each of the four external data sets based on true confidence measured numerically (actual dashed line with circles in blue) and imputed confidence measured by our model on the basis of verbal confidence (imputed dotted line with squares in orange).

Confidence-accuracy curves for four external data sets, from (a) Bergold and Heaton (2018), (b) Kenchel et al. (2021), (c) Smalarz et al. (2021), and (d) Grabman and Dodson (2022). Whiskers denote 95% confidence intervals for each accuracy level.
The external data sets exhibit a range of actual calibration patterns. Figures 1a and 1b show examples of tasks with weak calibration: In the true data, there is a modest but statistically significant increase in the likelihood of a correct identification for highly confident eyewitnesses compared to those with medium or low confidence (Fig. 1a: 13.2 percentage points, p = .003; Fig. 1b: 13.1 percentage points, p = .011). However, there was no measurable difference between low- and medium-confidence respondents (Fig. 1a: p = .144; Fig. 1b: p = .610). Figures 1c and 1d show examples of tasks with good calibration. The likelihood of a correct identification is monotonically increasing in the level of confidence, it is substantially higher for high-confidence respondents compared to low-confidence respondents, and the absolute rate of correct responses is high for highly confident respondents (Fig. 1c: 86%; Fig. 1d: 85%).
The model-imputed confidence classifications appear to perform well. True accuracy levels generally fall within the 95% confidence intervals of those estimated on the basis of the LLM classification, and the LLM-based confidence-accuracy curves yield qualitatively comparable insights to the true curves, demonstrating similarly weak calibration in Figures 1a and 1b and good calibration in Figures 1c and 1d.
To more formally test for differences between the calibration curves, we estimated regression models in which we predicted accuracy using indicators for the LLM-based confidence levels (low, medium, and high) as primary predictors and indicators for the actual confidence levels as auxiliary predictors. We conducted an F test for joint significance of the actual confidence levels, which in essence tested statistically whether the actual confidence levels provided any information about accuracy above what is available from the LLM model. For two of the data sets, we failed to reject the null of no difference—F(2, 241) = 1.73, p = .180, η p ² = .014 (see Fig. 1a) and F(2, 113) = 0.38, p = .685, η p ² = .007 (see Fig. 1b)—whereas for two of the data sets, true confidence measured numerically did provide some additional explanatory power—F(2, 1673) = 54.28, p < .001, η p ² = .062 (see Fig. 1c), and F(2, 2,028) = 39.49, p < .001, η p ² = .037 (see Fig. 1d), For the two data sets in which there are statistically significant differences, the qualitative differences remain modest—for example, the expected accuracy based on the LLM classification is always within 8 percentage points of the actual accuracy based on true confidence, despite the fact that accuracy varies by over 40 percentage points across confidence levels.
To examine whether our results are particular to our choice of RoBERTa as the base LLM, we also fine-tuned OPT (Zhang et al., 2022), another transformer-based language model, using the same data. OPT, developed by Meta AI, has a similar architecture to the GPT class of models and can incorporate substantially more parameters than RoBERTa (1.3 billion versus 125 million). Rather than predicting what word should fill in a mask, as in RoBERTa’s masked language modeling, OPT is simply trained to predict the next word in a given sequence. Our results using OPT are statistically and qualitatively similar to those achieved with RoBERTa (see Table S7 in the Supplemental Material), demonstrating the robustness of our results.
To further probe the real-world usefulness of the model, we applied it to the 35 confidence phrases extracted by Behrman and Davey (2001) and Behrman and Richards (2005) from statements of actual eyewitnesses obtained by the Sacramento Police Department during the investigation of 183 real criminal cases. Behrman and Richards (2005) employed 84 human coders to classify these statements into low-, medium-, and high-confidence categories. Our model’s categorization of these 35 statements is shown in Figure 2. Although Behrman and Richards (2005) used different confidence-level cutoffs from the present study—meaning that conventional confusion matrices would likely be uninformative—we can assess statistically whether our model replicates human interpretations by conducting a multivariate analysis of variance test in which the outcomes are our three model-generated categories and the main predictors are Behrman and Richards’s (2005) low, medium, and high groupings. This tests for whether the LLM assigns systematically different probability ratings to statements categorized differently by humans. For the overall joint test—Wilks’ lambda = .35, F(6, 60) = 6.93, p < .001—and for each individual categorical comparison—low versus medium: F(1, 32) = 10.29, p = .003; high vs. medium: F(1, 32) = 13.80, p < .001; and high versus low: F(1, 32) = 46.72, p < .001—we reject the null of no difference, demonstrating that when humans distinguish particular real-world eyewitness confidence statements, the model also distinguishes them.

Model output for 35 confidence statements from real witnesses, from Behrman and Richards (2005), by human categorization. Low, medium, and high categorizations were made by 84 human coders, as described in Behrman and Richards (2005). The LLM model probabilities for low, medium, and high confidence are depicted with the green, yellow, and red bars in the figure. The LLM’s final classification would be represented by the longest of the three bars. Note that Behrman and Richards (2005) define low confidence as ranging from 0 to 4, medium confidence as ranging from 5 to 7, and high confidence as ranging from 8 to 10.
Confidence entropy
Our model also furnishes a new metric containing independent information about eyewitness confidence. Adapting a concept from information theory, we define the confidence entropy of a particular statement as
As shown in Figure 3, which depicts a histogram of confidence entropy measures for statements that are low, medium, and high confidence drawn from the combined four external data sets (N = 4,541), confidence entropy is a distinct concept from confidence itself. It is possible, for example, for someone to express high certainty with little ambiguity in interpretation (e.g., “I’m 100% certain”) or for someone to express high certainty but in a way that leaves more room for interpretation (e.g., “I believe I saw this one”). Whereas confidence level provides information about whether the participant believes they made an accurate identification, confidence entropy measures how well they can explain their confidence. Table S8 in the Supplemental Material provides examples of actual confidence statements that exhibit differing combinations of confidence level and confidence entropy.

Entropy distribution by self-reported numeric confidence level.
To assess whether entropy provides useful additional information about eyewitness accuracy over and above the confidence level itself, we plotted confidence-accuracy curves for the two well-calibrated test data sets in Figure 4 (Grabman & Dodson, 2022; Smalarz et al., 2021); the figure also differentiates between responses with low, medium, and high confidence entropy (defined by terciles). For high-confidence identifications, confidence entropy appears clearly related to accuracy, with higher entropy (i.e., vaguer) responses associated with lower accuracy. To more formally test the predictive power of entropy, we estimated regressions identical to those described previously (i.e., with a full set of interactions between true and predicted confidence levels) as main predictors, but with confidence entropy as an auxiliary predictor. For both data sets, after conditioning on the confidence level, entropy was negatively and statistically significantly related to accuracy, F(1, 1,668) = 5.39, p = .020, η p ² = .003 (Smalarz et al., 2021), and F(1, 2,489) = 39.49, p < .001, η p ² = .016 (Grabman & Dodson, 2022).

Eyewitness accuracy by confidence level and confidence entropy, from (a) Smalarz et al. (2021) and (b) Grabman and Dodson (2022). Whiskers denote 95% confidence intervals for each accuracy level.
To examine whether entropy can add predictive power relative to current best practices, we estimated regression models using these two data sets in which the outcome was a correct identification and the predictors were a full set of indicators for all numeric confidence levels reported by participants. Saturating the model with predictors in this manner incorporates all possible information obtainable from self-reported numeric confidence. We then entered entropy as an additional predictor and tested its significance. For the Smalarz et al. (2021) data set, which allows respondents to make a nuanced report of numeric confidence (i.e., 0–100%), we found that entropy remained negatively and statistically significantly associated with confidence even after fully controlling for numeric confidence, F(1, 1,613) = 3.86, p = .049, η p ² = .002 (Smalarz et al., 2021), and F(1, 2,492) = 0.42, p = .519, η p ² < .001 (Grabman & Dodson, 2024).
Together, these analyses suggest that confidence entropy—a measure unavailable previously, but now readily producible by applying natural language processing to confidence statements—merits further investigation as a potential new reflector variable that can be used to better characterize eyewitness accuracy.
Discussion
Under appropriate conditions, eyewitness confidence measured at the time of an identification procedure can be a valuable diagnostic cue for identification accuracy (Wixted & Wells, 2017). Moreover, the U.S. Supreme Court has specifically identified a witness’s confidence at the time of the identification as a relevant factor for courts to consider in evaluating the admissibility of lineup evidence (Neil v. Biggers, 1972). However, when lineups are administered in the field, eyewitnesses are rarely asked to provide numeric or other structured ratings of their confidence (Police Executive Research Forum, 2013). Until now there has been no efficient, systematic, reproducible method to interpret verbal or textual descriptions of confidence. Our Transformer-based LLM accomplishes this in a manner that largely reproduces the categorization one would obtain had the eyewitnesses been asked to rate their confidence numerically. Moreover, our LLM-based categorization provides similar information about eyewitness accuracy to that obtained with a numeric confidence measure; it also provides the new metric of confidence entropy.
Our work extends the existing literature about the relationship between confidence and accuracy across confidence-scale types. Past work has shown similar confidence-accuracy relationships between different numeric confidence scales (Tekin & Roediger, 2017), between numeric and graded verbal scales (Weber et al., 2008), and between numeric and freely reported verbal confidence (Smalarz et al., 2021). Our model replicates this pattern of findings showing a similar confidence-accuracy relationship between the model’s confidence classification and participants’ self-reported numeric confidence.
Our model has several practical applications. Initial confidence recorded from an unbiased lineup is predictive of accuracy (Wixted & Wells, 2017). Most known misidentification cases had an eyewitness who testified at trial with high confidence in their identification, but were not highly confident at the time of the lineup (Garrett, 2011). We believe our LLM is an efficient, low-cost solution to help officers better understand a witness’s initial confidence statement. Outside evaluators often have differing evaluations of verbal confidence statements (Greenspan & Loftus, 2024). Our model provides a way for officers to reliably, simply, and replicably interpret the intended meaning of a witness’s initial confidence statement. In situations in which the identification procedure is video recorded—a recommended best practice in lineup administration (Wells et al., 2020)—or recorded verbatim in writing, the LLM could categorize the witnesses’ description of their confidence at any point in the future—including, potentially, many years after the original procedure—and then evaluate that statement free from contextual bias. The model also provides a way to adjudicate ambiguous cases—for example, when human coders may disagree as to whether a particular statement denotes high confidence (“I’m thinking I’m right”)—by essentially leveraging a large body of data from our training data. The model also offers a linguistically informed method to infer confidence when eyewitnesses offer unusual or unexpected statements that might be difficult for humans to interpret (“that’s a clown question, bro”).
Confidence, and now confidence entropy, are two reflector variables that can help indicate witness accuracy. One additional reflector variable that might be at play here is decision time. In addition to high confidence, fast identification decisions are predictive of witness accuracy (Quigley-McBride & Wells, 2023). Future research using this LLM model could explore the interplay of witness confidence, confidence entropy, and decision time to further understand factors related to accurate and inaccurate identifications.
The model also has considerable potential to support the expansion of academic research on verbal confidence statements. One significant impediment to experimental research is that any quantitative analysis of verbal confidence statements has traditionally required researchers to hire and train human coders, who then review each verbal statement and classify it manually. This process is expensive, time-consuming, and nonreplicable across coders. The LLM can process thousands of confidence statements almost instantaneously, and the version of the model at https://huggingface.co/spaces/psheaton/eyewitness_confidence_classifier includes functionality to accept file uploads for batch processing of statements. We anticipate that this model should substantially reduce the cost and complexity of coding verbal confidence statements, thus removing barriers for researchers to use the more ecologically valid measure of verbal confidence in their studies. Moreover, the LLM can readily improve over time both as the underlying language model is upgraded and as data from additional studies are incorporated into the training process. Whereas human coding typically requires starting anew with a fresh set of coders for each study, the LLM can draw from the accumulated knowledge embedded in thousands or even tens of thousands of identification responses generated by multiple researchers.
Supplemental Material
sj-docx-1-pss-10.1177_09567976241229028 – Supplemental material for Assessing Verbal Eyewitness Confidence Statements Using Natural Language Processing
Supplemental material, sj-docx-1-pss-10.1177_09567976241229028 for Assessing Verbal Eyewitness Confidence Statements Using Natural Language Processing by Rachel Leigh Greenspan, Alex Lyman and Paul Heaton in Psychological Science
Footnotes
Acknowledgements
This paper was presented at the 2023 American Psychology–Law Society conference in Philadelphia, PA. We acknowledge and thank Amanda Bergold, Chad Dodson, Jesse Grabman, Jillian Kenchel, and Laura Smalarz for sharing their data for this study.
Transparency
Action Editor: Angela Lukowski
Editor: Patricia J. Bauer
Author Contributions
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
