Development of a Research Methods and Statistics Concept Inventory

Abstract

Research methods and statistics are core courses in the undergraduate psychology major. To assess learning outcomes, it would be useful to have a measure that assesses research methods and statistical literacy beyond course grades. In two studies, we developed and provided initial validation results for a research methods and statistical knowledge concept inventory for eventual use in further scholarship of teaching and learning. In Study 1, we created vignettes and administered open-ended questions to psychology subject pool students. In Study 2, we refined the vignettes and created multiple-choice items using participant responses from Study 1. After administering the measure to psychology subject pool students and a community-based sample of Amazon Mechanical Turk workers, we used item response theory to select 20 items to compose the final Psychological Research Inventory of Concepts.

Keywords

research methods statistics assessment measure development

Research methods and statistics in the behavioral sciences are core domains of learning at the undergraduate level, consistent with Goal 2 of the American Psychological Association’s (APA, 2013) guidelines for the psychology major Version 2.0. Yet, courses in research methods and statistics are often “dreaded” by psychology students (Conners, McCown, & Roskos-Edwoldsen, 1998; Dempster & McCorry, 2009; Freng, Webber, Blatter, Wing, & Scott, 2011; Vittengl et al., 2004). These courses are typically taken because they are required, not out of genuine interest, with low levels of intrinsic motivation and low expectations of performing well, both of which are more likely to involve performance goals (e.g., obtaining a good grade) over actual mastery goals (e.g., learning and retaining the material; Pintrich, 2003). Most students are capable of learning the material for long enough to pass a test, but doing so does not ensure actual knowledge acquisition or the ability to apply the knowledge to a novel situation (e.g., reasoning).

Beyond APA’s guidelines, mastery of research methods and statistics also serves a practical function. Psychology has been criticized by other disciplines as a “soft” science (Berezow, 2012; Wilson, 2012). Even worse, psychology is considered “less scientific” than other fields (e.g., physics) even by teachers of psychology (Howell, Collisson, & King, 2014). Most students and faculty in psychology, when informing a new acquaintance that psychology is his or her field of study, can attest to receiving responses such as “Are you analyzing me right now?” that imply mind-reading abilities and an interest in psychotherapy, despite the broad nature of psychological science. Beyond simply “passing the word” that psychology is scientific, undergraduate psychology majors are likely to encounter popular press depictions of psychological research. An important global role of psychology education is to give students the ability to critically evaluate media reports of behavioral science (Lawson, Jordan-Fleming, & Bodle, 2015), even if the majority of undergraduate majors do not pursue careers in the discipline. In short, grasping the concepts of research methods and statistics is vital to the undergraduate psychology curriculum, which means it is important to ensure that teaching in research methods and statistics produces learning that lasts beyond the current semester.

How can we know if our teaching produces mastery learning? Traditional research on teaching techniques uses exam grades (e.g., Burkley & Burlkey, 2009; Stansbury & Munro, 2013) or study-specific measures (Bachiochi et al., 2011) as outcome variables. Course and exam grades are not ideal as outcome measures because they typically do not reflect reasoning ability within the content area, and study-specific measures may not be generally applicable or reproducible. Researchers have also used longer term outcomes such as global measures given to senior majors upon graduation to assess methodology and statistics knowledge. As one example of this, the psychology Area Concentration Achievement Test (http://www.collegeoutcomes.com) assesses learning in several subareas, including experimental design and statistics. Other researchers have used the test successfully to demonstrate the influence of programmatic choices (e.g., sequencing and integration of methods and statistics) on scientific knowledge (Barron & Apple, 2015; Pliske, Caldwell, Calin-Jageman, & Taylor-Ritzler, 2015). However, this test is intended to assess learning in graduating seniors only and costs money to administer and score, which makes it less feasible for examination of teaching within a semester-long course.

To improve teaching and assessment in research methods and statistics, it would be useful to have a tool that helps instructors differentiate between students who simply perform well in a course (assessed via course grades) and those who actually learn the material (Hake, 2015). Other fields have developed such tools, called “concept inventories” (Anderson, Fisher, & Norman, 2002; Evans et al., 2003; Halloun & Hestenes, 1985a, 1985b; Wallace & Bailey, 2010). The physics inventory, for example, distinguishes actual reasoning about motion learned through physics coursework from lay or commonsense beliefs about motion (Halloun & Hestenes, 1985a, 1985b). Concept inventories serve multiple functions, including use as a placement exam to evaluate instruction and as a diagnostic test for identifying misconceptions. Concept inventories are typically quick and free to administer, simple to score, and easily accessible to instructors and researchers alike. As such, a research methods and statistics concept inventory may hold promise for assessment of teaching within semester- or year-long courses as well as for programmatic assessment of scientific literacy in psychology.

To date, psychologists (and other behavioral scientists) do not have a good measure of applied research methods and statistical knowledge. We have measures assessing interest in research (Vittengl et al., 2004), fears of math or statistics (Onwuegbuzie & Wilson, 2003), willingness to invest effort toward knowledge acquisition (Norris, Pacini, & Epstein, 1998), and measures of general critical thinking (Facione & Facione, 1994). However, we have no measure that can differentiate knowledge and reasoning from grades. Indeed, even measures that are conceptually closer, such as research self-efficacy, are aimed at graduate students actively engaging in production of research (Forester, Kahn, & Hesson-McInnis, 2004), which is a concept not applicable to the average undergraduate population. Researchers in mathematics developed the Statistics Concept Inventory (SCI; Allen, 2006) to test facility with probability, descriptive statistics (e.g., frequencies, data presentation and summary), distributions, and regression, among other topics. Although useful, the SCI does not assess practical aspects of interpreting and applying statistical results that are needed to succeed in a research methods course and to critically examine psychological literature or media reports of scientific findings. Finally, the best measure, or the measure that comes closest to evaluating scientific literacy in psychology, is the Psychological Critical Thinking Exam (Lawson, 1999). This newly revised exam (Lawson et al., 2015) uses vignettes about psychological research and has students identify problems in an open-ended format. The test appears to have adequate psychometric properties (Lawson et al., 2015), but as an open-ended test, it requires coding to score and would likely be less amenable to administration in large courses or large assessment programs, especially as a pretest/posttest format.

The purpose of the current project was to develop a concept inventory that would test core components of scientific knowledge, statistical literacy, and correct interpretation of study results. We aimed to model processes students would need to critically evaluate media reports of social science and behavioral research. Our goal was to create a measure that was brief enough that it could be completed within one class period (approximately 50 min to an hour), easy to score, and amenable to multiple administrations, similar to the concept inventories in other fields. In Study 1, we conducted a qualitative study by constructing vignettes and administering them with open-ended questions to obtain responses as the basis for multiple-choice foils. In Study 2, we administered an initially wide range of multiple-choice items and used item response theory (IRT) to select a set of 20 well-performing items for use in a multiple-choice concept inventory. Ultimately, our goal was to develop a measure that can be used for assessment of research methods and statistical literacy for use in the further scholarship of teaching and learning as well as for undergraduate psychology program assessment.

Study 1

Method

Design and Measure

To capture the full scope of concepts to cover in our instrument (Clark & Watson, 1995), we first generated a list of topics within the area of research methods and statistics. This list of topics was vetted by other experts (psychology faculty at a teaching of psychology conference). The topics included importance of replication, external influences on studies (e.g., sampling bias, experimenter bias), operationalization of variables, reliability and validity of measurement, interpreting correlations, correlation and causation, random assignment procedures, experimental design/confounds, comparing the strength of two studies, factorial designs, order effects, group-level data applied to the individual, interpreting mean differences (using both graphs and numbers, interpreting statistics), external validity and generalizability, influences on statistics (e.g., ceiling and floor effects, outliers, restriction of range), and a few “general” questions that did not fit into another category.

We then created two to four short vignettes per topic, followed by some yes/no (e.g., “do you agree with the conclusion of the researcher in this scenario?”) but mostly open-ended questions asking the participants to critically evaluate the study or specific features of the study (e.g. method and conclusions). Two example vignettes from Study 1 are provided in Table 1.

Table 1.

Examples of Vignettes and Queries Used in Study 1.

Topic	Vignette	Question(s)	Common Responses
Group-level data applied to individual	According to published research, college freshman drink more than five drinks per week (5.7 drinks). Forty-five percent of freshman also reported binge drinking (drinking five or more drinks on a single occasion) one or more times per week. The same study showed that students more likely to binge drink are male, White, under 24, and residents of a fraternity	Your friend John says, “That study isn’t right. I’m White, 22, male, in a frat, and I only have three drinks a week.” How do you respond to him?	John is an outlier John is in the minority John is in the other 55% John is not part of the 45% With three drinks per week, John fits the stereotype closely enough. Just because that is the average, it doesn’t mean that the number applies to everyone individually The study is an average of many people not just you You are an exception to the rule.
Reliability and validity of measurement	Burt is curious about his IQ score, so he looks online to find a free intelligence test. He takes the test, which asks him to complete several puzzles and word games, and gets an IQ of 115, which is above average (100 is the average score in the United States). He wants to post his result to his blog, but he forgets. A few days later, he remembers the test and takes it again, so that he can get a screenshot of his test result. This time he gets a score of 99. He is surprised, and takes it again just to be sure. The third time he gets a score of 105	(a) Do you think this test is a valid measure of intelligence? (b) Why or why not? If you said no, what are some problems with the measure?	“No” (88.9%): The test failed to provide a consistent score when taken multiple times The test is not from a credible source Free online intelligence tests do not provide a valid IQ measure Valid IQ tests need to be given by a proctor/professional The test is inaccurate because it did not provide the same response all 3 times the test was taken The tests probably pull from a test bank and use different questions testing different areas each time the test was taken Puzzles and word games are not valid ways of testing IQ Yes (11.1%): The three scores are different, they are similar enough to think that the measure is valid

Topic

Vignette

Question(s)

Common Responses

Group-level data applied to individual

According to published research, college freshman drink more than five drinks per week (5.7 drinks). Forty-five percent of freshman also reported binge drinking (drinking five or more drinks on a single occasion) one or more times per week. The same study showed that students more likely to binge drink are male, White, under 24, and residents of a fraternity

Your friend John says, “That study isn’t right. I’m White, 22, male, in a frat, and I only have three drinks a week.” How do you respond to him?

John is an outlier

John is in the minority

John is in the other 55%

John is not part of the 45%

With three drinks per week, John fits the stereotype closely enough.

Just because that is the average, it doesn’t mean that the number applies to everyone individually

The study is an average of many people not just you

You are an exception to the rule.

Reliability and validity of measurement

Burt is curious about his IQ score, so he looks online to find a free intelligence test. He takes the test, which asks him to complete several puzzles and word games, and gets an IQ of 115, which is above average (100 is the average score in the United States). He wants to post his result to his blog, but he forgets. A few days later, he remembers the test and takes it again, so that he can get a screenshot of his test result. This time he gets a score of 99. He is surprised, and takes it again just to be sure. The third time he gets a score of 105

(a) Do you think this test is a valid measure of intelligence? (b) Why or why not? If you said no, what are some problems with the measure?

“No” (88.9%):

The test failed to provide a consistent score when taken multiple times

The test is not from a credible source

Free online intelligence tests do not provide a valid IQ measure

Valid IQ tests need to be given by a proctor/professional

The test is inaccurate because it did not provide the same response all 3 times the test was taken

The tests probably pull from a test bank and use different questions testing different areas each time the test was taken

Puzzles and word games are not valid ways of testing IQ

Yes (11.1%):

The three scores are different, they are similar enough to think that the measure is valid

Participants and Procedure

Participants were 196 college students at a mid-South university, participating in a psychology subject pool who received partial course credit in Introductory Psychology in exchange for completing the study. The participants were mostly college freshman (59.2%), had an average age of 19.32 years (SD = 1.41), and were 56.6% women and 81.1% White. A sizable percentage (40.8%. n = 80) of the participants reported taking a psychology course in high school, and 33.3% indicated prior history with a statistics or research methods course (though not necessarily research methods in the behavioral sciences). After excluding 21 participants who did not complete the study in one setting (identified by duration times of more than 150 min), the average time to complete a set of items was 53.34 min (SD = 23.30).

Participants completed the study online via Qualtrics (Provo, UT) and were told the study would examine knowledge of and attitudes toward research methods and statistics in the behavioral sciences (see https://www.qualtrics.com/support/research-resources/cite-reference-qualtrics-research/). Due to the large number of vignettes, each participant was randomized to one vignette per topic to prevent fatigue and to ensure that the length of the study session would be approximately 1 hr per person. Thus, each participant responded to an average of 16 vignettes, and each vignette was completed by between 49 and 98 students.

Results and Brief Discussion

The two goals of this study were to evaluate our vignettes for readability and to use participant-generated responses to create multiple-choice items. In the service of our first goal, we identified several vignettes and questions that, based on the responses given, were not easily understood by our sample. For these vignettes, we either edited vignette for clarity or marked it for elimination.

For our second goal, we examined the content of the responses qualitatively, focusing on general trends and themes that emerged when considering the responses to each question for each vignette. Research assistants grouped the responses by similar content or theme; most questions had several common responses (see Table 1 for example). We then examined the responses for well-phrased “correct” answers and well-phrased “incorrect” answers that we used as the basis for multiple-choice responses.

Overall, the open-ended questions were successful in generating a large amount of data and providing us insight into participants’ thought processes. We note three central findings from Study 1. First, there was significant variability in how participants responded, but neither ceiling nor floor effects were present; none of the questions generated responses that were all accurate or all inaccurate. Second, as we had hoped, the vignettes varied in their level of difficulty. Some of the vignettes (e.g., factorial designs, interpreting statistics) were associated with few correct answers, whereas others (e.g., interpreting correlations, group-level data applied to an individual) were associated with on-target answers from the majority of respondents. As we hoped to include items with a range of difficulty levels in our final measure, the qualitative findings were promising. Finally, the participants converged on similar themes in their responses. The common “incorrect” answers allowed us to generate plausible foils, which we would have had a difficult time generating on our own; as instructors of research methods and statistics, we can no longer easily “think like a novice.”

Study 2

The goal of Study 2 was to develop a multiple-choice version of the vignette-based measure described in Study 1 and to test it using Item Response Theory (IRT). For each vignette retained from Study 1, we converted the open-ended questions into question stems amenable to multiple-choice answers and developed four to five responses (1 correct and the remainder incorrect) for each question. The incorrect responses were generated from the most common answers given in Study 1 and reworded for uniformity and style when needed. With the intention of creating a final measure of about 20 items, our first revision reduced the item set to 38 questions based on 32 vignettes (several vignettes had multiple associated questions, such as the factorial design scenarios), retaining 2 vignettes per topic.

Method

Participants and Procedure

Data from Study 2 were collected from two sources: (a) individuals from a large mid-Southern University psychology subject pool and (b) individuals recruited from Amazon Mechanical Turk (MTurk), a web-based service where “workers” receive small amounts of money to complete online tasks, including questionnaires. Subject pool participants (n = 284) signed up for the study via an online study management program (Experimetrix) and received partial course credit for participation. MTurk participants (n = 474) were required to be from the United States and were paid US$3 for completing the full inventory plus demographics.

Of the 758 individuals who completed the study, the final sample was 624 after exclusions. We excluded people who reported they did not live in the United States (MTurk = 3, subject pool = 2) and people who did not click the “consent” box (MTurk = 11). We also excluded people with duration of less than 20 min (MTurk = 38, subject pool = 27) after pilot testing revealed the average time to complete the measure was around 40 min. Based on this pilot, we considered it implausible for a person to read the vignettes and respond to all items in less than 20 min. We also excluded people who admitted they did not read the vignettes or put in less than 70% effort into the study (MTurk = 27, subject pool = 67). In total, 58 people were excluded from the MTurk sample and 76 were excluded from the subject pool sample (note that many people excluded were excluded for multiple reasons). A greater percentage (26.76%) of the subject pool sample was excluded compared to the percentage of the MTurk sample (12.21%) who were excluded, χ² = 25.82, p < .001. However, there were no significant differences between people included and excluded on gender or ethnicity.

The 32 vignettes were administered in random order via Qualtrics. We also asked participants demographic questions (age, gender, and ethnicity) and several questions about educational history. Specifically, we asked about educational level (some high school, high school diploma, trade school certification, some college, bachelor’s degree, or advanced degree). We also asked for ACT and/or SAT scores and whether they had taken high school and/or college psychology or statistics courses. We also asked about completion of a college-level research methods course (assuming that the majority of high schools do not offer a research methods course specifically).

Results

Demographics of both samples are listed in Table 2. Unsurprisingly, the MTurk sample was older than the subject pool sample. The subject pool sample also had a higher percentage of women and lower ACT scores than the MTurk sample, with no differences in ethnicity. In terms of educational achievement, more of the MTurk sample had completed a college degree (39.7% had an undergraduate degree, and an additional 10.9% had an advanced degree beyond the bachelors), and only a small portion of MTurk workers were currently enrolled in college.

Table 2.

Demographic Characteristics for Participants in Study 2.

Variable	Total	MTurk (n = 416)	Subject Pool (n = 208)	Statistical Differences	d and 95% CI [LL, UL]
Age	30.09 (12.18)	35.38 (11.53)	19.50 (3.34)	t = 19.46**	1.46 [1.46, 1.84]
Gender	58.00% female	52.4% female	69.2% female	χ² = 15.98**	0.32 [0.17, 0.48]
Ethnicity	78% White	76.8% White	80.3% White	χ² = 1.30	0.09 [−0.07, 0.25]
% Current college students	42.7% current	14.1% current	100% current	χ² = 418.00**	2.84 [2.57, 3.11]
% With at least a college degree	34.0%	50.6%	0%	χ² = 159.79**	1.18 [0.99, 1.36]
ACT	26.42 (3.80)	27.21 (4.36), n = 116	25.93 (3.33), n = 187	t = 2.91**	0.34 [0.11, 0.58]
SAT	1,719.49 (286.53)	1,761.92 (322.05), n = 67	1,673.63 (236.51), n = 62	t = 1.76, p = .08	0.29 [−0.03, 0.61]
PRIC	8.80 (3.10)	8.92 (3.19)	8.56 (2.89)	t = 1.40	0.12 [−0.05, 0.29]

Note. CI = confidence interval; LL = lower limit; PRIC = Psychological Research Inventory of Concepts; MTurk = Amazon Mechanical Turk; UL = upper limit.

*p < .05. **p < .01.

Any person with a college degree or currently enrolled in college was also asked about their major and was coded as a psychology major (9.4%, n = 43), a nonpsychology major (85.8%, n = 394), or undetermined (including undeclared; 4.8%, n = 22). The proportion of psychology majors in the MTurk sample was not statistically different than in the subject pool sample, χ² = 3.62, p = .16.

IRT

IRT is a sophisticated analytic strategy for evaluating measurement quality. Unlike classical test theory, which uses the simple percentage of correct responses as an index of item difficulty, IRT uses logistic functions to model the relationship between individual item response and participants’ abilities. A full description of IRT as used in scale design and assessment is not possible here (for use of IRT with concept inventories, see Bristow et al., 2012; Stone, Ye, Zhu, & Lane, 2009; Wang & Bao, 2010; see also Edelen & Reeve, 2007). Essentially, IRT provides item-level information for people with varying ability levels, modeled via item characteristic curves (ICCs), and allows for more nuanced decision-making when selecting items in the context of scale development (Edelen & Reeve, 2007). For the current analysis, items were recoded as correct or incorrect and thus modeled as dichotomous. We first conducted some basic item analyses (e.g., percentage correct and point biserials) on the full data set, with both samples combined. IRT analysis assumes the latent construct is unidimensional, so 5 items with negative point biserials were discarded. Point biserials and percentage correct of all remaining 33 items are listed in Table 3.

Table 3.

IRT Parameters for Items in Study 2.

Item Parameters	IRT Parameters			Percentage Correct	Point Biserial	p Value for χ² Statistic	Retained in PRIC	Topic
Item Parameters	a	b	c	Percentage Correct	Point Biserial	p Value for χ² Statistic	Retained in PRIC	Topic
Item 1	1.00	0.74	0.11	37.80	.51	.22		General
Item 2	0.47	0.24	0.17	39.80	.19	.03	Y	Replication
Item 4	0.27	1.20	0.22	54.90	.30	.0001		External influences
Item 5	0.63	−0.30	0.19	19.00	.10	.38	Y	External influences
Item 6	0.69	0.98	0.23	40.00	.15	.99	Y	Operationalization
Item 7	0.76	1.08	0.18	51.20	.11	.77		Operationalization
Item 8	0.62	2.32	0.19	16.50	.21	.95	Y	Reliability and validity
Item 9	0.62	1.62	0.14	39.50	.15	.26	Y	Reliability and validity
Item 10	0.85	−0.72	0.19	70.70	.38	.04	Y	Interpreting correlation
Item 11	0.78	−0.73	0.19	65.00	.40	.41		Interpreting correlation
Item 13	1.15	0.55	0.23	85.10	.20	.01	Y	Correlation/causation
Item 15	1.54	2.24	0.12	49.30	.29	.94	Y	Random assignment
Item 16	0.36	1.05	0.27	45.40	.31	.59		Confounds
Item 17	0.39	1.87	0.21	22.90	.12	.35	Y	Confounds
Item 18	0.76	3.38	0.17	39.40	.36	.13		Confounds
Item 20	0.88	1.85	0.32	67.80	.43	.77	Y	Comparing studies
Item 21	1.14	2.19	0.12	27.80	.20	.95		Main effects
Item 22	0.77	2.10	0.32	31.00	.30	.25		Main effects
Item 23	0.64	−0.61	0.20	73.80	.50	.02		Interactions
Item 24	0.43	−2.25	0.20	35.40	.28	.53	Y	Main effects
Item 25	0.51	0.93	0.24	80.30	.34	.96	Y	Main effects
Item 26	0.81	2.82	0.20	54.10	.31	.52	Y	Interactions
Item 27	0.68	−0.47	0.18	31.50	.33	.63	Y	Order effects
Item 28	1.06	1.47	0.24	73.60	.45	.58		Order Eefects
Item 29	0.52	−1.46	0.21	49.30	.46	.42		Applying data to individual
Item 30	0.54	0.40	0.20	13.40	.20	.14	Y	Applying data to individual
Item 31	0.87	1.49	0.18	14.60	.18	.91		Interpreting statistics
Item 32	1.61	2.26	0.11	16.60	.07	.89	Y	Interpreting statistics
Item 33	1.17	2.72	0.15	41.40	.35	.46	Y	Interpreting statistics
Item 35	0.82	0.98	0.20	47.00	.35	.97	Y	External validity
Item 36	0.64	0.82	0.22	53.10	.17	.25		Influences on statistics
Item 37	0.67	1.17	0.20	48.00	.31	.24	Y	Influences on statistics
Item 38	0.71	0.75	0.21	41.00	.32	.81	Y	Influences on statistics

Note. IRT = item response theory; PRIC = Psychological Research Inventory of Concepts; Y = yes.

Because a correct answer can be obtained purely by chance or via guesswork, we wanted to use the three parameter (3PL) model, which includes a c parameter or “guessing” parameter, beyond the two parameters included in a 2PL model. In IRT, b is the location parameter (also known as the difficulty parameter) and essentially indicates the ability level associated with a 50% probability of obtaining the correct response. Higher b values (which typically range between −3 and 3, akin to z-scores) indicate greater ability is needed to increase likelihood of obtaining the correct answer (e.g., harder items), whereas lower b values suggest easier items. The second parameter, the a parameter, also called the discrimination parameter (which typically range between 0 and 2), provides the slope of the ICC at the difficulty level associated with parameter b, where steeper slopes are more representative of the latent construct and are more discriminating of people with different abilities.

After removing the 5 items with negative point biserials, we proceeded with fitting a 3PL model to the data using BILOG-MG software (Zimowski, Muraki, Mislevy, & Bock, 1996).¹ We then examined item parameters along with the ICCs for the categories of items included in the measure (all item parameters and percentage correct listed in Table 2). Of note, we evaluated item fit not only by looking for nonsignificant χ² but also by examining fit plots. Beyond item fit, we examined the set of parameters to include items with varying difficulties (b values), and when items were otherwise similar, we typically retained the items with better discrimination (a values). We typically kept one vignette per topic, with a few exceptions (reliability and validity of measurement had two questions for one vignette, main effect and interactions had three questions for one vignette, interpreting statistics had two questions for one vignette, and we kept two ‘influences on statistics’ vignettes). Although we retained 2 items with significant χ² (Items 2 and 10), the fit plots for these items did not suggest significant misfit (Wallace & Bailey, 2000). In total, 20 items were retained for the final version of the Psychological Research Inventory of Concepts (PRIC; full measure available from authors). To preserve integrity of the entire version, we present two example items with good fit statistics that were not retained in the final iteration in Table 4.

Table 4.

Examples of Multiple-Choice Items With Correct Answer in Boldface Font.

Example Item 1. What is the relationship between physiological activation and aggression? Dr. Longo brought a group of 3- to 4-year-old children into the lab and had them encounter a mildly stressful situation. He measured physiological activation (sweat) in response to stress, where higher scores represent a greater degree of activation. He then had the children interact with a human-like doll and assessed the verbal and physical aggression the children showed toward the doll. Below is the graphic depicting the relationship between physiological activation and aggression.

Describe the relationship between physiological activation and aggression in children.

(A). The more physiologically active children are, the less aggressive they are

(B) Aggressive behavior increases when the stress level of the situation increases.

(D) John, who sweats a lot in response to social situations, is less aggressive than other children.

Example Item 2. Do you want to be happier? Scientists say: get a dog! A review of published research, conducted by Dr. Wells, suggests that dog owners, compared to nonpet owners, have fewer minor physical ailments (e.g., headaches, colds, and allergies) and fewer serious medical problems. One study showed that dog owners were more than 8.6% likely to be alive 1 year after a heart attack. Dogs can also smell certain types of cancer and low blood sugar associated with diabetes. Dr. Wells said, “It is possible that dogs can directly promote our well-being by buffering us from stress. The ownership of a dog can also lead to increases in physical activity and facilitate the development of social contacts, which may enhance physiological and psychological human health in a more indirect manner.” Describe additional research that would be required to confirm that dogs truly cause fewer health ailments.

(A) Researchers should include a group of people who own other types of pets (e.g., cats, rabbits, etc.) in addition to just dog owners to make sure that the effect is not just due to companionship.

(B) Researchers should investigate the age of pet owners, as those who are capable of taking care of a pet are likely younger and healthier than those who are not.

(C) Researchers should follow people with depression or other mental health problems who own dogs to see if dog ownership helps improve symptoms.

(D) Researchers should find a group of people who do not own pets and assign some of them to take care of a dog for a year and then compare the health of people who received the dog to those who did not get the dog.

(E) Researchers need to account for the fact that some dogs are sick or aggressive and likely cause more stress not less.

Initial Scores

The mean raw score of items correct out of 20 was 8.80 (SD = 3.10, range = 1–19) or 44.02% (SD = 15.51, range = 5–95%). Of note, the 20-item final version correlated at .90 with the full score, suggesting the 20-item version is a good representation of the overall constructs assessed. There were no differences between MTurk sample PRIC scores and subject pool scores (see Table 2).

We found that the PRIC score significantly correlated with participant provided ACT score, r = .40, p < .001 (n = 303), and SAT scores, r = .22, p = .01 (n = 129). Because we also asked about prior experience with psychology, research methods, and statistics courses, we were able to tabulate score differences based on educational history. Table 5 includes the PRIC scores based on past experience with psychology, statistics, or research methods courses. There were no differences in PRIC scores based on enrollment in high school statistics or psychology courses. However, individuals who had taken college-level coursework in statistics, research methods, or psychology performed higher on the PRIC than those who had not. We also asked people who had taken college coursework if they completed the course within the last year (or were currently enrolled), if they completed the course 1–5 years ago, or more than 5 years ago. Time since course completion was not significant for college statistics, F(2, 214) = 1.64, ns, η² = .02, or for college research methods, F(2, 97) = 1.14, ns, η² = .02. However, for those who had completed a college psychology course, F(2, 428) = 4.18, p = .02, η² = .02, those who completed the course more than 5 years ago (M = 9.48, SD = 3.26) scored higher than those who completed the course within the least year or were currently enrolled (M = 8.54, SD = 2.99), p = .02. Those who completed the course 1–5 years ago (M = 9.26, SD = 3.23) were not significantly different from either other group using Bonferroni post hoc tests.

Table 5.

PRIC Score Differences Based on Prior Statistics and Research Experiences in Study 2 Item.

	No	Yes	t test	d and 95% CI [LL, UL]
High school-level statistics or methods class?	8.78 (3.07), n = 444	8.86 (3.18), n = 181	−0.27	0.02 [−0.15, 0.20]
High school-level psychology course?	8.97 (2.98), n = 368	8.60 (3.28), n = 252	1.45	0.11 [−0.04, 0.28]
College-level statistics course?	8.35 (2.85), n = 404	9.63 (3.38), n = 220	−5.00**	0.42 [0.25, 0.58]
College-level research methods course?	8.66 (2.96), n = 523	9.55 (3.67), n = 100	−2.64*	0.29 [0.07, 0.50]
College-level psychology course?	8.33 (2.91), n = 195	9.02 (3.16), n = 430	−2.60*	0.22 [0.05, 0.39]

Note. CI = confidence interval; LL = lower limit; PRIC = Psychological Research Inventory of Concepts; UL = upper limit.

*p < .05. **p < .01.

In terms of educational differences, a one-way analysis of variance revealed significant differences between education groups, F(2, 615) = 22.96, p < .001, η² = .07. Specifically, Bonferroni post hoc tests revealed that people with an advanced degree scored higher on the PRIC (M = 11.13, SD = 3.47) than people with a bachelor’s degree (M = 9.41, SD = 3.10), p < .001, who in turn scored higher than those without a bachelor’s degree (M = 8.29, SD = 2.89), p < .001.

Brief Discussion

Our central goal for Study 2 was to test a multiple-choice version of our concept inventory, developed from the open-ended responses of Study 1, and reduce the item set to a well-performing 20 items using IRT. We were able to do that successfully and ended up with a 20-item inventory that includes items varying in difficulty and discriminability.

As an initial step toward measurement validity, we found that higher PRIC scores were associated with higher standardized test scores (ACT and SAT), which follows as both are performance-based measures and may reflect careful thinking and problem-solving. We also found evidence that individuals who had college- but not high school-level coursework in statistics, research methods, and psychology actually performed better on the PRIC than people who had not taken college-level coursework in these subjects. Although only an initial test, these findings are promising in the scope of measurement development, as they suggest that at the cross-sectional level, the PRIC assesses constructs taught in research methods, statistics, and psychology courses.

One limitation of Study 2 is the large number of participants who were excluded. Low effort, whether assessed directly via a question asking about effort or indirectly by the duration of measure completion, is common in both subject pool and MTurk samples (Goodman, Cryder, & Cheema, 2012; Oppenheimer, Meyvis, & Davidenko, 2009). Participants who are paid for their time (MTurk) or who need to complete studies for course completion (subject pool) are not inherently invested in the topic and thus might not find putting forth significant mental effort worthwhile. However, the lack of significant differences between MTurk and psychology subject pool samples on both the PRIC score and proportion identifying as psychology majors suggests that these are fairly comparable groups in terms of research methods and statistical knowledge, even with the exclusions.

General Discussion

The two studies in this article describe the process used to create the PRIC, a measure used to assess understanding of and reasoning about critical concepts in research methods and statistics within the behavioral sciences. Overall, we provided initial evidence supporting the PRIC as a valid measure of reasoning and application of statistics and research methodology in psychology. We also demonstrated that individuals who scored higher on the PRIC had greater academic ability, indexed by standardized test scores (ACT and SAT), and that people with more advanced education also scored higher on the PRIC compared to those with less education. The creation of this measure fills a needed niche in the behavioral sciences, a standardized way for educators and researchers to assess research methods and statistical reasoning.

One major advantage of the PRIC is its brevity; it can be used in college-level classes that average 50–75 min. The PRIC is also easy to administer and score, given that it employs multiple-choice questions with a single correct answer. Students can complete the PRIC online, which allows for easy randomization of items and response options. However, an instructor or researcher may also choose to administer the PRIC in a supervised paper format. These features and flexibility of the PRIC make it a useful educational and research tool.

An additional strength of the PRIC is that it was developed iteratively, using participants own language for the multiple-choice responses, and that we used IRT to guide development and refinement of the items. IRT provides considerable psychometric advantages over classical test theory approaches to measurement development (Edelen & Reeve, 2007), as the basis for including or excluding items involves more than just interitem correlations. With IRT, items can be evaluated for discriminability and difficulty, and inclusion of a guessing parameter, the latter particularly useful for multiple-choice items. As such, our IRT analyses demonstrated that our final set of 20 multiple-choice items varied both in difficulty and in discrimination between high- and low-ability individuals, suggesting that the PRIC may be useful not only for evaluation of teaching but also for identification of students’ ability levels in the research methods arena.

Finally, a strength of the PRIC is that it was developed using multiple samples. Overall performance on the PRIC was similar for both the college-aged novice sample (undergraduate psychology subject pool students) and a wide range of adults in the United States recruited via MTurk. This can be taken as evidence that the PRIC does not merely test content knowledge in the area of social sciences. Instead, the PRIC appears to test the ability to reason about and apply information to scenarios about research and statistics, which is likely why our currently enrolled introductory psychology students did not perform differently from a diverse, older MTurk sample.

Despite promising initial results, questions remain about the usefulness of the PRIC in departmental and classroom settings. We found that higher PRIC scores were associated with greater academic ability and greater educational achievement. Thus, it is plausible that the PRIC assesses perseverance toward long-term goals (e.g., grit; Duckworth, Peterson, Matthews, & Kelly, 2007), cognitive effort (Frederick, 2005), or basic knowledge in psychology (Smith & Barker, 2008), not specific reasoning in research methods and statistics. We also recognize that people with college or advanced degrees may perform better on the PRIC because of critical thinking skills accumulated throughout an undergraduate education as opposed to the particular type of critical thinking needed in research methods and statistics. Finally, it will be important to know if statistics and/or research methods courses increase PRIC scores. Although the idea of a concept inventory is to assess research methods and statistical knowledge beyond grades, it would still be useful to know if PRIC scores are associated with higher grades in research methods or statistics courses. Thus, we recognize that further work is needed to provide further validation of the PRIC, and we turn to answering these questions in subsequent studies (Veilleux & Chapman, 2017, this issue).

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by a “Research in Teaching” grant from the University of Arkansas Teaching and Faculty Support Center and from a Society for the Teaching of Psychology/Psi Chi Assessment Grant.

Note

References

Allen

(2006). The Statistics Concept Inventory: Development and analysis of a cognitive assessment instrument in statistics (Unpublished doctoral dissertation). University of Oklahoma, Norman, OK.

American Psychological Association. (2013). APA guidelines for the undergraduate psychology major (Version 2.0). Retrieved from http://www.apa.org/ed/precollege/undergrad/index.aspx

Anderson

D. L.

Fisher

K. M.

Norman

G. J.

(2002). Development and evaluation of the conceptual inventory of natural selection. Journal of Research in Science Teaching, 39, 952–978. doi:10.1002/tea.10053

Bachiochi

Everton

Evans

Fugere

Escoto

Letterman

Leszczynski

(2011). Using empirical article analysis to assess research methods courses. Teaching of Psychology, 38, 5–9. doi:10.1177/0098628310387787

Barron

K. E.

Apple

K. J.

(2015). Debating curricular strategies for teaching statistics and research methods: What does the current evidence suggest? Teaching of Psychology, 41, 187–194. doi:10.1177/0098628314537967

Berezow

A. B.

(2012). Why psychology isn’t a science. Los Angeles Times. Retrieved from http://articles.latimes.com/2012/jul/13/news/la-ol-blowback-pscyhology-science-20120713

Bristow

Erkorkmaz

Hiussoon

J. P.

Joen

Owen

W. S.

Waslander

S. L.

Stubley

G. D.

(2012). A control systems concept inventory test design and assessment. IEEE Transactions on Education, 55, 203–212. doi:10.1109/TE.2011.2160946

Burkley

(2009). Mythbusters: A tool for teaching research methods in psychology. Teaching of Psychology, 36, 179–184. doi:10.1080/00986280902739586

Clark

L. A.

Watson

(1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7, 309–319. doi:10.1037//1040-3590.7.3.309

10.

Conners

F. A.

Mccown

S. M.

Roskos-Ewoldsen

(1998). Unique challenges in teaching undergraduate statistics. Teaching of Psychology, 25, 40–42. doi:10.1207/s15328023top2501_12

11.

Dempster

McCorry

N. K.

(2009). The role of previous experience and attitudes toward statistics in statistics assessment outcomes among undergraduate psychology students. Journal of Statistics Education, 17. Retrieved from http://www.amstat.org/publications/jse/v17n2/dempster.pdf

12.

Duckworth

A. L.

Peterson

Matthews

M. D.

Kelly

D. R.

(2007). Grit: Perseverance and passion for long-term goals. Journal of Personality and Social Psychology, 9, 1087–1101. doi:10.1037/0022-3514.92.6.1087

13.

Edelen

M. O.

Reeve

B. B.

(2007). Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement. Quality of Life Research, 16, 5–18. doi:10.1007/s11136-007-9198-0

14.

Evans

D. L.

Gray

G. L.

Krause

Martin

Midkiff

Wage

(2003). Progress on concept inventory assessment tools. In Proceedings of the 33rd ASEE/IEEE Frontiers in Education Conference (pp. TG41–T4G8).

15.

Facione

P. A.

Facione

(1994). The California critical thinking skills test (CCTST): Test manual. Millbrae: California Academic Press.

16.

Forester

Kahn

J. H.

Hesson-McInnis

M. S.

(2004). Factor structures of three measures of research self-efficacy. Journal of Career Assessment, 12, 3–16. doi:10.1177/1069072703257719

17.

Frederick

(2005). Cognitive reflection and decision making. Journal of Economic Perspectives, 19, 25–42.

18.

Freng

Webber

Blatter

Wing

Scott

W. D.

(2011). The role of statistics and research methods in the academic success of psychology majors: Do performance and enrollment timing matter? Teaching of Psychology, 38, 83–88. doi:10.1177/0098628311401591

19.

Goodman

J. K.

Cryder

C. E.

Cheema

(2012). Data collection in a flat world: The strengths and weaknesses of Mechanical Turk samples. Journal of Behavioral Decision Making, 26, 213–224. doi:10.1002/bdm.1753

20.

Hake

R. R.

(2015). What might psychologists learn from scholarship of teaching and learning in physics? Scholarship of Teaching and Learning in Psychology, 1, 100–106. doi:10.1037/stl0000022

21.

Halloun

I. A.

Hestenes

(1985a). Common sense conepts about motion. American Journal of Physics, 53, 1056–1065. doi:10.1119/1.14031

22.

Halloun

I. A.

Hestenes

(1985b). The initial knowledge state of college physics students. American Journal of Physics, 53, 1043–1055. doi:10.1119/1.14030

23.

Howell

J. L.

Collisson

King

K. M.

(2014). Physics envy: Psychologists’ perceptions of psychology and agreement about core concepts. Teaching of Psychology, 41, 330–334. doi:10.1177/0098628314549705

24.

Lawson

T. J.

(1999). Assessing psychological critical thinking as a learning outcome for psychology majors. Teaching of Psychology, 26, 207–209.

25.

Lawson

T. J.

Jordan-Fleming

M. K.

Bodle

J. H.

(2015). Measuring psychological critical thinking: An update. Teaching of Psychology, 42, 248–253. doi:10.1177/0098628315587624

26.

Norris

Pacini

Epstein

(1998). The rational-experiential inventory, short form (Unpublished inventory). University of Massachusetts Amherst, Amherst, MA.

27.

Onwuegbuzie

A. J.

Wilson

V. A.

(2003). Statistics anxiety: Nature, etiology, antecedents, effects, and treatments—A comprehensive review of the literature. Teaching in Higher Education, 8, 195–209. doi:10.1080/1356251032000052447

28.

Oppenheimer

D. M.

Meyvis

Davidenko

(2009). Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology, 45, 867–872. doi:10.1016/j.jesp.2009.03.009

29.

Pintrich

P. R.

(2003). A motivational science perspective on the role of student motivation in learning and teaching contexts. Journal of Educational Psychology, 95, 667–686.

30.

Pliske

R. M.

Caldwell

T. L.

Calin-Jageman

R. J.

Taylor-Ritzler

(2015). Demonstrating the effectiveness of an integrated and intensive research methods and statistics course sequence. Teaching of Psychology, 42, 153–156. doi:10.1177/0098628315573139

31.

Smith

D. L.

Barker

(2008). Using yes–no recognition tests to assess student memory for course content. Teaching of Psychology, 35, 319–326. doi:10.1080/00986280802374468

32.

Stansbury

J. A.

Munro

G. D.

(2013). Gaming in the classroom: An innovative way to teach factorial designs. Teaching of Psychology, 40, 148–152. doi:10.1177/0098628312475037

33.

Stone

C. A.

Zhu

Lane

(2009). Providing subscale scores for diagnostic information: A case study when the test is essentially unidimensional. Applied Measurement in Education, 23, 63–86. doi:10.1080/08957340903423651

34.

Veilleux

J. C.

Chapman

K. M.

(2017). Validation of the Psychological Research Inventory of Concepts: An Index of Research and Statistical Literacy. Teaching of Psychology, 44, 212–221.

35.

Vittengl

J. R.

Bosley

C. Y.

Brescia

S. A.

Eckardt

E. A.

Neidig

J. M.

Shelver

K. S.

… Sapenoff

L. A.

(2004). Why are some undergraduates more (and others less) interested in psychological research? Teaching of Psychology, 31, 91–97. doi:10.1207/s15328023top3102

36.

Wallace

C. S.

Bailey

J. M.

(2010). Do concept inventories actually measure anything? Astronomy Education Review, 9. Retrieved from http://astronomy101.jpl.nasa.gov/files/Wallace_07.pdf

37.

Wang

Bao

(2010). Analyzing force concept inventory with item response theory. American Journal of Physics, 78, 1064–1070. doi:10.1119/1.3443565

38.

Wilson

T. D.

(2012). Stop bullying the ‘soft’ sciences. Los Angeles Times. Retrieved from http://articles.latimes.com/2012/jul/12/opinion/la-oe-wilson-social-sciences-20120712

39.

Zimowski

M. F.

Muraki

Mislevy

R. J.

Bock

R. D.

(1996). Bilog MG: Multiple-group IRT analysis and test maintenance for binary items. Chicago, IL: Scientific Software International.