Simulating Data for Clinical Research: A Tutorial

Abstract

Simulation studies use computer-generated data to examine questions of interest that have traditionally been used to study properties of statistics and estimating algorithms. With the recent advent of powerful processing capabilities in affordable computers along with readily usable software, it is now feasible to use a simulation study to aid in clinical decision making. By simulating large quantities of data that mimic clinical situations, it is possible to understand the ramifications of different decisions better than using tangentially relevant data or intuition. In this tutorial article, I describe the general steps in conducting a simulation study with particular emphasis on clinical decision making. I conclude with a didactic example taken from clinical literature on identifying a specific learning disability.

Keywords

simulation study Monte Carlo study clinical decision making specific learning disability

Simulation studies use computer-generated data to examine questions of interest. Originally, the most common purposes for conducting a simulation study in education or psychology were to determine the sampling distributions of test statistics, compare parameter estimators, evaluate an algorithm, or compare multiple algorithms that perform the same function (Psychometric Society, 1979). Later, researchers found that simulation studies could also be useful in planning the sample size required for a study (e.g., Beaujean, 2014; Muthén & Muthén, 2002).

With the advent of affordable computers that have powerful processing capabilities and readily usable software, it is now feasible to use a simulation study to help with clinical decision making. In clinical practice, it is not unusual to make decisions about individuals who are normatively rare (e.g., psychiatric/education diagnosis, atypical symptoms). In these populations, often there is very limited data available to test clinical hypotheses that are generalizable beyond a single case. Consequently, by simulating large quantities of data that mimic clinical situations, it is possible to understand the ramifications of different decisions better than using tangentially relevant data or intuition. In this tutorial article, I describe the general steps in conducting a simulation study with particular emphasis on the clinical context. I conclude the article with a didactic example.

Steps in Conducting a Simulation Study

There are six basic steps required for all simulation studies (Burton, Altman, Royston, & Holder, 2006; Fan, 2012; Feinberg & Rubright, 2016; Harwell, Stone, Hsu, & Kirisci, 1996; Law, 2006; Paxton, Curran, Bollen, Kirby, & Chen, 2001).

1. Define the problems of interest and determine whether a simulation study is the most suitable way to study them. This step includes specifying the scope of the problems, questions to be answered, resources required for the simulation, and time frame to complete the study. It may not be possible to specify this information very precisely at the beginning of the study. As the study proceeds, however, you should be able to gain additional insights and be able to reformulate the problems/questions and understand the required resources better.

Simulation studies can be conducted to investigate almost any type of scientific problem, but they are best used when the needed data cannot be reasonably obtained in another way. All simulation studies are based on simplified versions of reality, so they will never be able to fully replicate all the complexities involved in the problems of interest. Thus, if actual data are available—or it is feasible to collect them—then they should be analyzed instead of, or in conjunction with, simulated data.

2. Collect needed information to build the conceptual model. A conceptual model is a simplified version of the real system of interest that only includes the major variables thought to affect the outcomes of interest. When initially developing the conceptual model, it is important to collect information from multiple sources (e.g., professional literature, existing system, subject matter experts), so that the model does as good a job as possible representing the essential aspects of the real system.

To build a worthwhile conceptual model, gather information about the essential variables, including how the variables are related to each other and what distributions data from the variables follow. While the normal distribution is the “go to” distribution in educational and psychological research, it is not always the most appropriate. For example, Shadish and Sullivan (2011) noted that the majority of data collected in single-case research were counts (e.g., number of behaviors observed in a time period) or proportions (e.g., proportion of time intervals a behavior was observed). Thus, simulating data for these types of variables would require using alternative distributions, such as Poisson, binomial, or even a mixture of distributions.

In addition to determining information about the essential variables, creating a conceptual model also requires making simplifying assumptions about the system. This is necessary to make the conceptual model tractable for a simulation study. Unnecessary details can result in excessive execution time for the simulations or obscure aspects of the system that are really important. Typically, it is better to start with a simple model and add more complexity as needed.

In clinical situations, the model assumptions will typically be that certain variables can be excluded without substantially influencing the results, the relations among the variables are of a certain magnitude and direction, and the model residuals/errors have a certain structure. For example, if simulating longitudinal data, some typical assumptions would be that variables collected at Wave 1 are related to the same variables collected at Wave 2, and that the Wave 2 variables are not causing the Wave 1 variables. As another example, if simulating classroom behavior data, it might be a tenable assumption that the principal’s education level can be excluded from the conceptual model without substantially influencing the results.

While not required, it may be helpful to create a path diagram of the conceptual model. Not only can many clinical problems be conceptualized in terms of a path diagram (Hoyle & Smith, 1994), but creating such a diagram also requires the user to be explicit about the model’s assumptions and can also aid in the analysis of the data. Loehlin and Beaujean (2017) provide details for creating and analyzing path diagrams for a variety of common models in psychology and related disciplines.

3. Design the simulation study. After creating the conceptual model, the next step is to design the actual simulation study. This involves determining sample sizes (n), the number of simulations to execute (m), values for all the model’s conditions (i.e., parameters), and the methods for analyzing the simulated data. The first three aspects of this step are all intimately related to each other, and the selected values should be based on the study’s purpose and the resources available.

While a bit of an oversimplification, simulation studies can be classified as either unreplicated or replicated. Unreplicated studies simulate m = 1 data set for a given set of conceptual model conditions (e.g., data distributions, parameter values, n). As only one data set is being simulated, typically, the n used for this type of design is large.

One example of an unreplicated study is Stuebing, Fletcher, Branum-Martin, and Francis (2012). They were interested in the accuracy of three different methods for identifying a specific learning disability (SLD). To do so, they simulated n = 1,000,000 observations using a single set of published correlations between scores on norm-referenced cognitive ability and academic achievement tests. As another example, Crawford, Garthwaite, and Gault (2007) showed how using unreplicated simulations with a large value of n can aid in determining base rates for score differences.

A replicated study (also called a Monte Carlo [MC] study) simulates m > 1 data sets for each set of model conditions. As multiple data sets are simulated, the n is selected to reflect either what is typically found in the system or what previous studies of system have used. Usually, multiple values for some of the conceptual model conditions are used to generate the data and then the results from the different conditions are compared.

MC studies are common in statistical research, but they are relatively uncommon in clinical studies. One example of such a study is Moreau (2014). He was interested in the influence of individual differences on working memory training interventions. To investigate this, he simulated data under a variety of conditions of how the treatment and control groups could be formed. For each set of conditions, he simulated n = 20 observations in each group, calculated the between-group differences, and then repeated the process m = 10,000 times.

If conducting an MC study, selecting the model conditions and values for m is important—and there are not absolute best values for either. On one hand, large values of m provide precise results. On the other hand—assuming a fixed amount of resources—larger values of m reduce the number of conceptual model conditions that can be investigated, which can reduce the external validity (i.e., generalizability) of the study. Skrondal (2000) advocated for choosing these values based on the individual simulation study instead of just relying on “conventional wisdom.”

The last part involved in designing a simulation study is determining how to analyze the generated data. Planning this out before starting the simulation process can not only increase the efficiency of the data analysis but will also aid in making sure the simulated data will be able to answer the questions of interest. The data analysis should include descriptive statistics, such as the mean and standard deviation (SD) of the statistics of interest across the m replications (see Table 1). In MC studies, the SD of a statistic across replications is an approximation of the statistic’s standard error; thus, it is sometimes called the empirical standard error (ESE). The ESE can be used the same way as analytic standard errors, such as examining the precision of the statistic. The ESE could be used to create an empirical confidence interval (CI) as well, but more typically CIs are created in MC studies by rank ordering the m statistics and finding values at the desired percentiles (Buckland, 1984).

Table 1.

Summary Statistics and Performance Measures for Simulation Studies.

Measure	Symbol/formula
Summary
Average value of statistic across m samples ( $\bar{\hat{θ}}$ )	$\bar{\hat{θ}} = \sum_{i = 1}^{m} \frac{\hat{θ_{i}}}{m}$
Empirical standard error (ESE)^a	$E S E (\hat{θ}) = \sqrt{\sum_{i = 1}^{m} \frac{{(\hat{θ_{i}} - \bar{\hat{θ}})}^{2}}{m}}$
Bias
Parameter bias	$\bar{\hat{θ}} - θ$
Relative bias	$\frac{\bar{\hat{θ}} - θ}{θ}$
Standardized bias	$\frac{\bar{\hat{θ}} - θ}{E S E (\hat{θ})}$
Statistical efficiency
Mean square error (MSE)	${(\bar{\hat{θ}} - θ)}^{2} + E S E {(\hat{θ})}^{2}$
Root mean square error (RMSE)	$\sqrt{{(\bar{\hat{θ}} - θ)}^{2} + E S E {(\hat{θ})}^{2}}$
Coverage
Coverage	Proportion of the data sets where the confidence interval includes θ

Note. $θ$ = parameter value; $\hat{θ}$ = statistic value from single sample; n = sample size; m = number of simulated data sets of size n for a given set of conditions.

If parameter estimates are unbiased, $\bar{\hat{θ}}$ can be replaced with $θ$ .

The data analysis for most simulation studies should go beyond descriptive statistics. Because researchers have full control over the parameter values examined in a simulation study, many have advocated for treating simulation studies—especially MC studies—as experiments. Thus, they argue that simulation studies should use appropriate experimental designs, validity checks, and data analyses techniques (e.g., Harwell, 1997; Skrondal, 2000).

4. Simulate the data. After designing the study, the data need to be simulated. This is done using the conceptual model and some pseudorandom numbers (Gentle, 2003). This step also includes any data manipulation, such as transforming standardized values to values on a T scale, adding skew or kurtosis to variables, or removing observations to mimic missing value conditions.

Simulating data can be done in a general-purpose programming language (e.g., Python, C++), many statistical programs (Lee, 2015), as well as some specialized simulation programs. For example, in the next section, I used the R program (R Development Core Team, 2016). I did so because it is free, and there are many packages available to conduct most analyses that are of interest for examining clinical data.

When simulating the data, start by creating some pilot data using small values for n and m (if conducting a MC study) along with one or two sets of parameter values. Use these data to check for adequacy of data, such as the absence of impossible values, reasonable values for the sample statistics, and the performance measures discussed in Step 5. If the data are not being simulated as expected, this may indicate that the program needs to be debugged.

Once you are sure the data are being simulated correctly, simulate the m desired data sets of size n. Be sure to store the necessary information from each of the m data sets so that you do not have to repeat the simulation process at a later date. For unreplicated studies, it will usually be possible to save each of the simulated data sets. For MC studies, it may not be feasible to save each data set—it depends on size of m, n, and the number of conceptual model conditions. If saving all the data sets is not feasible, then save the necessary information from each of the m iterations before the simulation procedure discards them. The necessary information will be values required for the desired data analysis as well as that needed to check the validity of the results (see Step 5).

5. Evaluate the simulated data. Evaluating the simulations involves typical data analysis as well as a validation analysis. The process involved in the typical data analysis depends on the purpose of the simulation study, and should have been specified as part of Step 3. Make sure that the statistical model used to analyze the simulated data converges for each data set. If not, then record how may data sets failed and determine whether those particular data sets need to be discarded and new ones generated or whether the failures require a post hoc change of some aspects to the conceptual model to omit certain scenarios that cannot be simulated reliably.

Validating the simulation process is similar to validating the use of test scores in that there is not a single statistic that will give you this information; instead, you need to gather multiple pieces of information and make decisions based on all the evidence available. One source of evidence (results validly) comes from comparing results from the simulated data with results from data generated from the actual existing system. Of course, this assumes that data are available from the existing system, which may not be the case. Another source of evidence (face validity) comes from determining whether the simulation results are consistent with how you perceive the system should operate based on what you learned in Step 2.

A third type of validity evidence comes from gathering data from performance measures, which compare the simulated results with the parameter values used to simulate the data. The three most common performance measures examine bias, efficiency, and coverage—although there are many others available (see Bandalos & Leite, 2013; Burton et al., 2006; Muthén & Muthén, 2002). I discuss each of the performance measures conceptually and provide the formulae in Table 1. Except for bias, they all require m > 1, so it can only be estimated for MC studies.

In statistics, bias is the difference between a statistic’s value (estimated from sample data) and the value of the population parameter. Thus, calculating parameter bias requires subtracting the value of the parameter from the mean value of the statistic used to estimate the parameter across the m data sets. In an unreplicated study, m = 1; therefore, parameter bias is estimated from a single data set.

The problem with calculating parameter bias is that the scale depends on the metric of the parameter, so determining severity is difficult. Thus, it is common in simulation studies to transform the scale of the bias measure to a proportion. One method (relative bias) requires dividing the bias value by the value of the parameter, which works as long as the parameter value is not zero. An alternative for MC studies (standardized bias) requires dividing bias by the ESE. No matter what bias measure is used, smaller values indicate less bias.

Statistical efficiency is related to the statistic’s sampling variance, with less variability being more preferred. One measure of efficiency is the mean square error (MSE), which is the sum of the squared parameter bias and squared ESE (i.e., empirical sampling variance). If there is bias, then MSE represents the overall accuracy of the parameter estimation. If there is no bias, then MSE is a measure of the sampling variance of the statistic. The square root of the MSE (RMSE) transforms the MSE back onto the same scale as the parameter, which makes interpretation somewhat easier. If there is no bias, MSE ≈ ESE.

Confidence interval coverage is the proportion of samples in which the parameter value is contained in the statistic’s CI. The coverage should be approximately equal to the nominal coverage rate (e.g., 95% of the m samples for a 95% CI). Under-coverage (i.e., coverage << .95 for a 95% CI) indicates that the CI is too narrow, which results an increase in finding effects when they are not there (i.e., type I errors). Over-coverage (e.g., coverage >> .95 for a 95% CI) indicates the CI is too wide, which results in an increase of not finding effects when they are there (i.e., type II errors).

In addition to performance measures, it may be useful to conduct a sensitivity analyses, especially for unreplicated studies. A sensitivity analyses involves slightly changing the values of parameters of interest to see how the outcomes (or performance measures) change. If there is a substantial change, this would indicate that those aspects of the model have a large influence on the results so their values should be selected very carefully.

6. Disseminate the results. The last step of a simulation study involves communicating the results. It is important not only to communicate what you found but also to describe your study in sufficient detail so that others can replicate it. Boomsma, Hoyle, and Panter (2012), Hoaglin and Andrews (1975), and the Psychometric Society (1979) all provide explanations of what to include in such communications.

Example

Macmann, Barnett, Lombard, Belton-Kocher, and Sharpe (1989) were interested in studying the dependability of actuarial methods to identify an SLD using the discrepancy model. While this question is now somewhat outdated, they simulated data for part of their study and provided enough details so that it can be replicated.

Step 1: Macmann et al. (1989) investigated the concordance of aptitude-achievement discrepancy scores calculated using different score comparison methods. Of major interest to them was the classification agreement between the methods.

This problem is well suited for a simulation study. First, there is currently no “gold standard” criterion for SLD, so it is impossible to determine any given method’s true accuracy. Second, although Macmann et al. (1989) were able to collect “real” data, the sample sizes were small (ns ranging from 106 to 298) and consisted only of students referred for evaluation of a suspected SLD. Thus, these results have questionable generalizability.

Step 2: Macmann et al. (1989) developed their conceptual model based on values from their student data as well as the two essential factors that determine the reliability of an actuarial classification procedure: score correlation and diagnostic threshold. One possible path diagram of the model is given in Figure 1. Since thresholds were not directly included in the simulation of the data, the path diagram is relatively simple, consisting of five parameters: 1 covariance (r), 2 means, and 2 SDs. To standardize the variables and covariance, both means were set to zero and both SDs were set to one. They thought that both test scores jointly followed a bivariate normal distribution and that the correlation parameter ranged between .60 and .95. To determine SLD presence and diagnosis, they used two diagnostic thresholds: −1.00 and −1.96.

Figure 1.

Path diagram of Macmann, Barnett, Lombard, Belton-Kocher, and Sharpe’s (1989) conceptual model.

Step 3: The sample statistics Macmann et al. (1989) calculated for were related to SLD identification decisions. First, they calculated an agreement matrix: the positive agreement (scores on Tests 1 and 2 below threshold), negative agreement (scores on Tests 1 and 2 above threshold), negative disagreement (only score on Test 2 below threshold), and positive disagreement (only score on Test 1 below threshold). Second, they calculated multiple measures of classification agreement, although I only calculate one for this example: kappa (κ; Cohen, 1960).

Macmann et al. (1989) conducted an unreplicated study, simulating m = 1 sample of n = 5,000 individuals for each set of model conditions (see Table 2). While I replicate this scenario, I also extend the design to an MC study. I do this by simulating m = 5,000 samples of n = 200 for each set of model conditions. The value of n was taken to represent the sample sizes Macmann et al. reported for their student data.

Table 2.

Model Condition Values for Macmann, Barnett, Lombard, Belton-Kocher, and Sharpe (1989) Study.

	Correlation between two measures
	.60	.80	.95
Threshold
−1.00	US: n = 5,000; m = 1	US: —	US: —
−1.00	MC: n = 200; m = 5,000	MC: n = 200; m = 5,000	MC: n = 200; m = 5,000
−1.96	US: n = 5,000; m = 1	US: n = 5,000; m = 1	US: n = 5,000; m = 1
−1.96	MC: n = 200; m = 5,000	MC: n = 200; m = 5,000	MC: n = 200; m = 5,000

Note. US = unreplicated study; — = data not simulated under this condition in original article; MC = Monte Carlo study; m = number of simulations; n = sample size. All data were simulated to follow a bivariate normal distribution.

Step 4: In this article’s Appendix, I provide R syntax for replicating the unreplicated simulation study as well as the extension to a MC study.

Steps 5 and 6: The results from the unreplicated simulation are shown in Table 3. The one performance measure that can be calculated (bias) indicated that the simulated data’s correlations, on average, are the same as the population values. Moreover, the agreement matrix and κ values are similar to Macmann et al.’s (1989) study, with any differences likely due to using different pseudorandom number generators or different seeds for the generators (Brooks, Barcikowski, & Robey, 1999).

Table 3.

Agreement Data From Macmann, Barnett, Lombard, Belton-Kocher, and Sharpe (1989) and Current Study.

Model conditions		Macmann et al.’s study						The present study
Cor	Threshold	r	ND	NA	PA	PD	κ	r	ND	NA	PA	PD	κ
0.60	−1.00	—	420	3,726	377	477	0.35	0.60	424	3,802	370	404	0.37
0.60	−1.96	—	92	4,774	29	105	0.21	0.60	99	4,763	43	95	0.29
0.80	−1.96	—	67	4,799	63	71	0.46	0.80	65	4,800	64	71	0.47
0.95	−1.96	—	39	4,827	95	39	0.70	0.95	35	4,832	90	43	0.69

Note. Cor: correlation parameter value; ND = negative disagreement; NA = negative agreement; PA = positive agreement; PD = positive disagreement; — = not reported.

For the MC study, some of the simulated data sets did not have values for each cell of the agreement matrix, which produced a missing value for κ. For those data sets, I removed all the observations and resimulated a new data set so that each condition had m = 5,000 data sets. The results from the MC study are in Table 4. The left part of the table gives the values from the performance measures; the right part gives summary statistics for the m = 5,000 κ values simulated for each model condition.

Table 4.

Results From Monte Carlo Study With n = 200 for Each Condition.

Model conditions		Performance measures				Summary
Cor	Threshold	$\bar{\hat{θ}}$	$E S E (\hat{θ})$	RMSE	Coverage	$\bar{\hat{κ}}$	$E S E (\hat{κ})$
.60	−1.00	0.60	0.05	0.05	0.94	0.35	0.09
.60	−1.96	0.60	0.05	0.05	0.95	0.31	0.12
.80	−1.00	0.80	0.03	0.03	0.95	0.54	0.08
.80	−1.96	0.80	0.03	0.03	0.95	0.45	0.15
.95	−1.00	0.95	0.01	0.01	0.95	0.77	0.07
.95	−1.96	0.95	0.01	0.01	0.95	0.68	0.14

Note. See Table 1 for description of performance measures and summary statistics. Cor = correlation parameter value; $\bar{\hat{κ}}$ = average κ value across m samples; $E S E (\hat{κ})$ = empirical standard error for κ.

The performance measures indicate that the MC simulations created good sample data: The bias is minimal (<.00 for each condition), the MSE is low, and coverage is between .94 and .95 for the 95% CI. The average of the kappa values from the MC study are similar to the values from the unreplicated study, with any differences likely due to sampling because the MC study used a much smaller n than the unreplicated study. In addition to the average value of κ, the MC study provides an indication of κ’s sampling variability. For example, for the model conditions of r = .60 and a −1.00 threshold, the average value of κ is .35 and ESE is .09. Although not shown in the table, the values for κ at the 2.5 and 97.5 percentiles are .18 and .53. These are the values for the lower and upper bound of the empirical 95% CI. Thus, if the true correlation between tests used for SLD identification was .60 and was evaluated across 200 students using a threshold of −1.00 for diagnosis, it would not be uncommon to find a κ value ranging anywhere from .18 and .53.

This MC analysis can extend beyond descriptive statistics to compare the values of κ across the conditions. For example, a two-way ANOVA (correlation by threshold) shows that there are differences between κ values across the model conditions. The most prominent factor in these differences is the correlation between the measures (i.e., higher correlations produce higher κ values), although the threshold used also contributes to agreement differences (i.e., lower thresholds produce higher κ values). The generalized eta-squared ( $η_{G}^{2}$ ) for the correlation and threshold factors are .66 and .09, respectively, while $η_{G}^{2}$ for the correlation-threshold interaction is .01.

The results from both the unreplicated and MC studies show that SLD classification reliability using the discrepancy model is generally low for the conditions used in Macmann et al.’s (1989) study. Kappa levels can be made more acceptable by using measures that are strongly correlated (i.e., r = .95) or using lower diagnostic thresholds, but these conditions are typically uncommon in clinical practice. Thus, Macmann et al.’s main conclusion—the problems with using score discrepancies for SLD classification cannot be resolved through using “better” measures or statistical formulae—applies here as well. Moreover, their suggestion of creating expectancy tables to describe the effects of score correlation and cutoff values on classification agreement—for situations were using aptitude-achievement discrepancies is required by administrative policy—is essentially produced in the right part of Table 4. In the absence of “real” agreement data for a given set of tests scores, such a table could easily be extended to include other model conditions as well as other measures of classification agreement.

Summary

Simulation studies can be powerful tool for understanding the essential aspects of psychological and educational systems. While historically these methods were not accessible to clinicians or clinically oriented researchers, this is no longer the case. The availability of computers with powerful processing capabilities along with readily usable software have allowed simulation studies to be part of studying clinical decision making. While simulated data will never be able to replace data actually collected from the variables of interest, they are well suited for situations where it is not feasible to collect the necessary data. Hopefully, this tutorial article can aid individuals in using simulation studies to aid in understanding such problems.

Footnotes

Appendix

This appendix provides R syntax to conduct the simulation studies described in this article. The syntax uses loops and is designed for didactic purposes, making it functional but not computationally efficient. For information on writing more efficient R syntax to simulate data, see Hallgren (2013) or Robert and Casella (2009).

For both the unreplicated and MC studies, I only analyze data from one of the model conditions. Analyzing the other data sets requires straightforward modifications of the syntax. In R, NA is used to indicate missing values, so I abbreviate negative agreement using NegA. Any line starting with a pound symbol (#) is a comment.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Bandalos

D. L.

Leite

(2013). The use of Monte Carlo studies in structural equation modeling research. In Hancock

G. R.

Mueller

R. O.

(Eds.), Structural equation modeling: A second course (2nd ed., pp. 625-666). Charlotte, NC: Information Age Publishing.

Beaujean

A. A.

(2014). Sample size determination for regression models using Monte Carlo methods in R. Practical Assessment, Research, and Evaluation, 19(12), 1-16. Retrieved from http://pareonline.net/getvn.asp?v=19&;n=12

Boomsma

Hoyle

R. H.

Panter

A. T.

(2012). The structural equation modeling research report. In Hoyle

R. H.

(Ed.), Handbook of structural equation modeling (pp. 341-358). New York, NY: Guilford Press.

Brooks

G. P.

Barcikowski

R. S.

Robey

R. R.

(1999, April). Monte Carlo simulation for perusal and practice. Paper presented at the annual meeting of the American Educational Research Association, Montreal, Quebec, Canada.

Buckland

S. T.

(1984). Monte Carlo confidence intervals. Biometrics, 40, 811-817. doi:10.2307/2530926

Burton

Altman

D. G.

Royston

Holder

R. L.

(2006). The design of simulation studies in medical statistics. Statistics in Medicine, 25, 4279-4292. doi:10.1002/sim.2673

Cohen

(1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46. doi:10.1177/001316446002000104

Crawford

J. R.

Garthwaite

P. H.

Gault

C. B.

(2007). Estimating the percentage of the population with abnormally low scores (or abnormally large score differences) on standardized neuropsychological test batteries: A generic method with applications. Neuropsychology, 21, 419-430. doi:10.1037/0894-4105.21.4.419

Fan

(2012). Designing simulation studies. In Cooper

Camic

P. M.

Long

D. L.

Panter

A. T.

Rindskopf

Sher

K. J.

(Eds.), APA Handbook of Research Methods in Psychology: Vol. 3. Data analysis and research publication (pp. 427-444). Washington, DC: American Psychological Association.

10.

Feinberg

R. A.

Rubright

J. D.

(2016). Conducting simulation studies in psychometrics. Educational Measurement, 35, 36-49. doi:10.1111/emip.12111

11.

Gentle

J. E.

(2003). Random number generation and Monte Carlo methods (2nd ed.). New York, NY: Springer.

12.

Hallgren

K. A.

(2013). Conducting simulation studies in the R programming environment. Tutorials in Quantitative Methods for Psychology, 9, 43-60.

13.

Harwell

M. R.

(1997). Analyzing the results of Monte Carlo studies in item response theory. Educational and Psychological Measurement, 57, 266-279. doi:10.1177/0013164497057002006

14.

Harwell

M. R.

Stone

C. A.

Hsu

T.-C.

Kirisci

(1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20, 101-125. doi:10.1177/014662169602000201

15.

Hoaglin

D. C.

Andrews

D. F.

(1975). The reporting of computation-based results in statistics. The American Statistician, 29, 122-126. doi:10.2307/2683438

16.

Hoyle

R. H.

Smith

G. T.

(1994). Formulating clinical research hypotheses as structural equation models: A conceptual overview. Journal of Consulting and Clinical Psychology, 62, 429-440. doi:10.1037/0022-006X.62.3.429

17.

Law

A. M.

(2006). Simulation modeling and analysis (4th ed.). New York, NY: McGraw-Hill Higher Education.

18.

Lee

(2015). Implementing a simulation study Using multiple software packages for structural equation modeling. SAGE Open, 5,1–16. doi:10.1177/2158244015591823

19.

Loehlin

J. C.

Beaujean

A. A.

(2017). Latent variable models: An introduction to factor, path, and structural equation analysis (5th ed.). New York, NY: Routledge.

20.

Macmann

G. M.

Barnett

D. W.

Lombard

T. J.

Belton-Kocher

Sharpe

M. N.

(1989). On the actuarial classification of children: Fundamental studies of classification agreement. The Journal of Special Education, 23, 127-149. doi:10.1177/002246698902300202

21.

Moreau

(2014). Making sense of discrepancies in working memory training experiments: A Monte Carlo simulation. Frontiers in Systems Neuroscience, 8, Article 161. doi:10.3389/fnsys.2014.00161

22.

Muthén

L. K.

Muthén

B. O.

(2002). How to use a Monte Carlo study to decide on sample size and determine power. Structural Equation Modeling: A Multidisciplinary Journal, 9, 599-620. doi:10.1207/S15328007SEM0904_8

23.

Paxton

Curran

P. J.

Bollen

K. A.

Kirby

Chen

(2001). Monte Carlo experiments: Design and implementation. Structural Equation Modeling: A Multidisciplinary Journal, 8, 287-312. doi:10.1207/S15328007SEM0802_7

24.

Psychometric Society. (1979). Publication policy regarding Monte Carlo studies. Psychometrika, 44, 133-134.

25.

R Development Core Team. (2016). R: A language and environment for statistical computing (Version 3.2.3) [Computer program]. Vienna, Austria: R Foundation for Statistical Computing.

26.

Robert

Casella

(2009). Introducing Monte Carlo methods with R. New York, NY: Springer.

27.

Shadish

W. R.

Sullivan

K. J.

(2011). Characteristics of single-case designs used to assess intervention effects in 2008. Behavior Research Methods, 43, 971-980. doi:10.3758/s13428-011-0111-y

28.

Skrondal

(2000). Design and analysis of Monte Carlo experiments: Attacking the conventional wisdom. Multivariate Behavioral Research, 35, 137-167. doi:10.1207/S15327906MBR3502_1

29.

Stuebing

K. K.

Fletcher

J. M.

Branum-Martin

Francis

D. J.

(2012). Evaluation of the technical adequacy of three methods for identifying specific learning disabilities based on cognitive discrepancies. School Psychology Review, 41, 3-22.