Abstract
The Statistical Anxiety Rating Scale (STARS) was used to measure statistics anxiety across 423 graduate and undergraduate students from a midsized university, in the western United States. Students' responses were analyzed using confirmatory factor analysis (CFA) to assess the validity of scores from the proposed six-factor model, which was well-fitting, according to various adjunct fit indexes. Students' responses were then examined using multigroup CFA to explore factorial invariance across sex and student classification (i.e., undergraduates and graduates). The model was found to be factorially invariant across sex, but not across student classification, possibly meaning graduate and undergraduate students ascribed different meaning to some items. If one ignores the test of factorial invariance, between-groups statistical tests can be unduly influenced by measurement artifacts, sometimes erroneously identifying statistically significant mean differences when there are none.
Many university students taking statistics classes report statistics anxiety (Onwuegbuzie, 2004). Statistics anxiety has been researched for several decades (Earley & Mertler, 2002), using numerous tests. One of the most popular tests is the Statistical Anxiety Rating Scale (STARS; Cruise, Cash, & Bolton, 1985). The STARS was developed to measure statistics anxiety using a sample of 423 students (graduate and undergraduate) across a variety of academic disciplines. Statistics anxiety has been defined in several ways, but commonly cited ones are attributable to Cruise, et al. (1985), Zeidner (1990), and Onwuegbuzie, Da Ros, and Ryan (1997). Onwuegbuzie, et al. defined statistics anxiety as “a state-anxiety reaction to any situation in which a student is confronted with statistics in any form and at any time” (p. 28). Cruise, et al.'s definition stated “the feelings of anxiety encountered when taking a statistics course or doing statistical analysis; that is, gathering, processing, and interpreting” (p. 92), whereas Zeidner defined statistics anxiety as a performance characterized by extensive worry, intrusive thoughts, mental disorganization, tension, and physiological arousal … when exposed to statistics content, problems, instructional situations, or evaluative contexts, and is commonly claimed to debilitate performance in a wide variety of academic situations by interfering with the manipulation of statistics data and solution of statistics problems. (p. 319)
Prior estimates show statistics anxiety is experienced by the majority of graduate students at uncomfortable levels (Onwuegbuzie, 2004). Of major concern is that performance in a statistics class and magnitude of statistics anxiety are negatively related (Zeidner, 1990; Onwuegbuzie & Seaman, 1995; Fitzgerald & Jurs, 1996). This is alarming, considering the important role statistics plays in quantitative research (Birenbaum & Eylath, 1994). Graduate students should be able to readily interpret statistical findings in scholarly publications (Birenbaum & Eylath, 1994). Knowledge of statistics and applying statistical techniques are ever more critical in all academic disciplines (Baloğlu, 2003; Mji, 2009). Although research methods are commonly taught separately from statistics in graduate programs, research methods per se are not typically part of the undergraduate curriculum, so statistics courses may be students' only formal introduction to research methods. Statistics anxiety is not the principal concern, but rather the outcomes related to anxiety (Mji, 2009). Anxious students may have difficulty learning and using statistics.
There are many explanations for possible sources of statistics anxiety. These are often broken down into three categories: environmental, dispositional, and situational (Baloğlu, 2003). Environmental aspects might include sex, age, ethnicity, academic major, and previous mathematics experiences (Baloğlu, 2003), which may be described as the biases one brings into the statistics course (Onwuegbuzie, et al., 1997). Some prior research indicated women experience difficulty in quantitative areas (Royse & Romph, 1992) and that women experienced higher statistics anxiety (Zeidner, 1990; Onwuegbuzie & Seaman, 1995); however, other research has indicated that no such significant differences (Cruise & Wilkins, 1980; Baloğlu, 2003). Unlike environmental factors, situational factors associated with greater of statistics anxiety have been reported during enrollment in statistics class. These factors include exposure to statistical definitions, the instructor (Onwuegbuzie, et al., 1997), lack of feedback from the instructor (Zeidner, 1991), and the general nature of the statistics class. Dispositional factors which might be related to statistics anxiety include learning styles (Onwuegbuzie, 1998), general attitude toward statistics (Harvey, Plake, & Wise, 1985), and perceptions of statistics (Zeidner, 1991). For instance, because many students dread being required to take a statistics course, they often take these courses at the end of their degree program (Onwuegbuzie & Wilson, 2003; Zeidner, 1991). Waiting until the end of their academic careers before enrolling in a required statistics class means not actively applying statistics during their academic training (Onwuegbuzie, et al., 1997). In addition, many students perceive statistics as the most difficult class they have to take (Schacht & Stewart, 1990) and perform worse than in other classes (Onwuegbuzie, Slate, Paterson, & Watson, 2000).
There is also evidence of higher statistics anxiety among graduate students than undergraduates (Harvey, et al., 1985; Benson & Bandalos, 1989;). On the other hand, in a separate study, Benson (1989) reported statistics anxiety did not differ statistically significantly between undergraduate and graduate students. Higher statistics anxiety also was correlated negatively with the length of the class (Bell, 2001), suggesting shorter courses may be linked to higher anxiety.
Psychometrics of the STARS
Factorial invariance (also known as equivalence of measurement) allows the assumption that a construct holds the same meaning for the different groups tested. It is proposed that by ignoring a test of factorial invariance, researchers essentially ignore a fundamental measurement assumption when conducting between-group comparisons. Then, a comparison could be statistically significant or not only as a result of a measurement artifact, namely nonequivalence of measurement between groups. The implications of ignoring this issue in the context of the STARS should be examined. Even researchers not interested in statistics anxiety may have interest in the broad implications of measurement nonequivalence, which, in the extreme, could put into question one's statistical conclusions.
Baloğlu (2002) attempted to confirm the six-factor model of STARS using a sample of 221 undergraduate college students. All factor loadings were significant (p < .05). Model fit was also assessed using several fit indices, including the goodness-of-fit index (GFI; Jöreskog & Sörbom, 1982), comparative fit index (CFI; Bentler, 1990), and root mean square error of approximation (RMSEA; Steiger, 1990). All measures of fit indicated the six-factor model was not a good fit for the data, so support for construct validity of the STARS was minimal in this group. Hanna, Shevlin, and Dempster (2008) conducted confirmatory factor analyses, testing one-, four-, and six-factor models with 849 undergraduate psychology students in the United Kingdom. Appropriate fit indices, including RMSEA, CFI, and standardized root mean square residual (SRMR) indicated the one-factor model fit the data poorly. Both the 4- and six-factor models exhibited reasonable model fit, with the six-factor model fitting the data the best.
Dauphinee, Schau, and Stevens (1997) conducted azconfirmatory factor analysis of the Survey of Attitudes Toward Statistics (SATS) in a group of 991 undergraduate students. An invariance analysis across sex suggested that the SATS model was equivalent across sex. Maximum likelihood confirmatory factor analysis was used using LISREL 7. Although this was then the only known method of performing an invariance analysis, this method has now been judged to be inaccurate for use with ordinal data. Instead, a weighted least squares mean- and variance-adjusted (WLSMV) estimator must be used to accommodate ordinal data (Lubke & Muthén, 2004; Millsap & Yun-Tein, 2004). This estimator is generally robust under ordinal (non-normal) data conditions (Flora & Curran, 2004; Hutchinson, Raymond, & Black, 2008).
Clearly, statistics anxiety has been studied for decades; but, there are no studies which have first assessed whether the validity of the scores from STARS are equivalent across different subgroups. This is a particularly important issue because absence of measurement equivalence implies that group responses are not meaningfully comparable (Byrne, Shavelson, & Muthén, 1989; Vandenberg & Lance, 2000). Absence of measurement equivalence prohibits valid score comparison across different subgroups because the comparison is essentially “misleading and illegitimate” (Hui & Triandis, 1985, p. 134). Without measurement equivalence, notable mean differences in statistics anxiety across different groups may be attributable to measurement artifacts rather than real differences in perception of statistics anxiety (Hutchinson, et al., 2008). If there is no measurement equivalence across comparison groups, it is possible that prior research results are invalid because the assumption of equivalent groups was incorrect. The current study brings to the forefront the importance of measurement equivalence when making group comparisons. For instance, if subpopulations interpret the meaning of the STARS items differently, then no accurate comparison of groups can be meaningfully made.
The research on statistics anxiety is fairly extensive, but there are clear weaknesses in the literature. Although the STARS has been a popular scale for measuring statistics anxiety, there has been no psychometric research on whether STARS is measuring statistics anxiety equivalently for all students. Perhaps previous studies have been published under the assumption that STARS does measure statistics anxiety equivalently for all students. This assumption may be incorrect. Therefore, the rationale for the present study was to assess whether measurement is equivalent among different groups of students who have typically been compared.
Hypothesis 1. Statistics anxiety will exhibit factorial invariance across sex.
Hypothesis 2. Statistics anxiety will exhibit factorial invariance across students' year of study.
Method
Participants
Upon IRB approval, students at a midsized university in the Western part of the USA were recruited from intact classrooms to complete the STARS and a demographics questionnaire. There were 423 participants (293 women, Mage = 26.6 yr., SD = 10.5; 130 men, Mage = 24.9 yr., SD = 9.6). Both undergraduate students (120 freshmen, 64 sophomores, 51 juniors, 23 seniors) and graduate students (53 master's, 112 doctoral) were recruited to take the survey. Students were enrolled in introductory undergraduate statistics and psychology classes, or in intermediate and advanced graduate statistics and research methods classes. Students across various academic disciplines were represented in this sample, including behavioral sciences (n = 89), business (n = 14), education (n = 41), health sciences (n = 82), performing and visual arts (n = 6), social sciences (n = 31), and other (n = 8).
Measures
The survey included the STARS instrument by demographic questions (see Appendix, pp.). Each of the 51 items in the STARS (Cruise & Wilkins, 1980) was rated on a 5-point scale. For the first 23 items measuring test and class anxiety, interpretation anxiety, and fear of asking for help, participants rated their anxiety using anchors of 1: No anxiety and 5: Strong anxiety. A sample item is “Studying for an examination in a statistics course”. For the next 28 items measuring worth of statistics, computational self-concept, and fear of statistics teachers, participants were asked to rate their agreement using anchors of 1: Strongly agree and 5: Strongly disagree. A sample item is “Statistics is worthless to me since it is empirical and my area of specialization is abstract”. For Items 1 to 23, higher scores on each item correspond to higher anxiety; and for Items 24 to 51, higher scores on each item correspond to more positive attitudes.
Onwuegbuzie (1999) reported an estimate of internal reliability of scores on the STARS of .78 on the Worth of Statistics subscale and .84 on the Test and Class Anxiety subscale with a median of .8 for 225 African-American participants at a small suburban college of education in a mid-southern state. Baloğlu (2002) reported internal consistency reliability coefficients on the STARS scores of .64 on the Fear of Statistics Teachers subscale of .94 on the Worth of Statistics subscale for 246 college students.
The survey also comprised a series of demographic questions, including sex, age, academic major, classification (i.e., freshman, sophomore, junior, senior, master's, doctoral), mathematics background (i.e., statistics course in high school, undergraduate statistics course in college, graduate statistics courses, high school algebra, college algebra, trigonometry, and calculus). For mathematics background, students could check all that applied.
Procedure
Participants in undergraduate introductory statistics courses were given a hard copy of the survey and a separate sheet of paper with the URL to access the online version. The class was given the option of either taking the survey in paper format or online; all chose to complete the hard copy. Participants in another class were provided a hard copy of the survey without an option to complete online. Other classes were only provided with the URL to access the online survey. The online survey format was also used as a follow-up method for collecting data—an e-mail was sent out to introductory statistics and psychology students with the URL of the survey. In all, 69 participants completed the paper survey and 354 completed the online survey.
The human participants consent form was on the first page of the paper survey packet. Participants were explicitly told to tear off the consent form to keep for their own records. The next six pages comprised the STARS. A cash prize entry form was attached as the last page; participants were told to detach this page from the survey packet and fill in their name, phone number, and e-mail address if they were interested in having a chance to win the $20 cash prize. The cash prize entry forms were kept in a separate stack from the rest of the survey. Entry forms were shuffled to minimize possible matching of entry form to participants' corresponding surveys.
Those participants who chose to complete the survey online were first presented a screen encompassing the human participant consent form. At the bottom of the screen were two options: I consent and I decline. Declining to consent directed the participant to the university's website. Consenting directed the participant to the survey. Online, if a participant wished to enter into the $20 cash prize drawing, a text box was available to provide the necessary contact information (i.e., name, phone number, e-mail address). If the participant felt his anonymity could be compromised by entering contact information into the textbox, an e-mail address was also provided for separate transmission.
Before data collection began, an application for exempt review was submitted to and subsequently approved by the institutional review board at the university at which the study was conducted. Depending on the feasibility of the researcher personally being able to distribute paper surveys, some students were only given the option to take the online version. For all introductory statistics students, a follow-up email was sent to indicate that they could take the survey online if they had not taken it in class. The primary purpose of this follow-up was to solicit a response from those students who did not attend class on the day the paper survey was distributed.
Results
Preliminary Analyses
Data were entered into PASW 18 (IBM, 2010) and preliminary descriptive (e.g., means, standard deviations, frequencies) and internal consistency reliability analyses (e.g., Cronbach's alpha) were run. Cronbach's alpha was .96 for the total score. Reliability coefficients for the six subscales are shown in Table 1, which are relatively consistent with estimates found in prior studies.
Cronbach's Alpha by Subpopulation
Baseline Model
A six-factor confirmatory factor analysis model was specified with all observed variables being ordered categorical and estimated with Mplus (Version 5.21; Muthén & Muthén, 2010), using the WLSMV estimator. Overall model fit and component fit were examined using several measures of fit, including an appropriate χ2 test employing the WLSMV estimator, RMSEA, Tucker-Lewis index (TLI; Tucker & Lewis, 1973), and CFI.
The six-factor model in the current study fit well. Although the χ2 test statistic was statistically significant, as shown in Table 2, several adjunct fit indexes provided evidence supporting the six-factor model. The CFI was .943 and the TLI was .940 (each with a range of 0 to 1.0), where the closer the measure is to 1.0 the better the model fit. The RMSEA was .059, where a lower number is more desirable and indicates better fit (RMSEA values can range from 0 to infinity and the RMSEA is considered a measure of “badness of fit”). These fit values were compared to Hu and Bentler's (1999) guidelines for cutoff criteria and indicate adequate model fit. The RMSEA was higher than the suggested cutoff value of .05 for close fit (Browne & Cudeck, 1993); however, Brown and Cudeck also stated that a RMSEA value less than .08 is indicative of reasonable fit. The CFI was slightly lower than .95, which is the cutoff criterion suggested by Hu and Bentler (1999) and indicates model fit may not be ideal. In regard to component fit, all parameter estimates (standardized factor loadings) were statistically significant (p < .01) and in the expected direction, which is indicative of good fit.
Baseline Model Fit for Six-factor STARS Model
Model modification indices (MI) were examined to identify areas of misfit, which provides an estimate of the decrease in χ2 for the overall model if a given parameter were freed for estimation (Brown, 2006). Two items from the Agreement subscale produced extreme modification index values on the factor loadings. Item 11, “Since I have never enjoyed math I do not see how I can enjoy statistics,” was set to load on the factor computational self-concept, and Item 24, “Statistical figures are not fit for human consumption,” were set to load on the factor Worth of Statistics. Given possible multiple factor loadings suggested by the modification indices for these two items, they were removed from the model and the CFA was re-run. The fit of the new model was slightly better, as shown in Table 2. Table 2 summarizes the fit statistics for the two versions of the baseline model. A baseline model is the first step for subsequent model comparisons.
Invariance Results
Invariance by sex.—Before testing for invariance across sex, the six-factor baseline model was separately fit for women and men to assess whether the model provided good fit for each group separately. Adjunct fit indexes suggested good global model fit for both sexes (Table 3). In regard to component fit, all parameter estimates were statistically significant for both sexes and indicative of good fit.
Invariance Tests of Six-factor STARS Model by Sex and Student Classification
Based on WLSMV-adjusted calculations.
The first step of a factorial invariance analysis is to assess configural invariance, which requires identical models to be specified for both men and women with the added stipulation that parameters are separately estimated for the men and women. As shown in Table 3, despite the χ2 test being statistically significant, the CFI, TLI, and RMSEA are suggestive of good fit. Therefore, the six-factor model seems to be suitable for both men and women.
The second step of the invariance analysis, metric invariance (Vandenberg & Lance, 2000), tested what the decrement in model fit would be statistically significant if the factor loadings were constrained equal for the men and women. The χ2 difference test was not statistically significant p = .02 (applying Bonferroni adjustment and using alpha of .01 to account for numerous statistical tests), indicating items appear to be functioning similarly for both groups. Therefore, the invariance testing continued.
The test for scalar invariance (Vandenberg & Lance, 2000) was the third step in the invariance analyses, where it was tested whether the item thresholds were invariant across men and women. The χ2 difference test was not statistically significant (Table 3), meaning scalar invariance held, so the invariance testing proceeded to the test of latent means differences.
Latent mean differences between men and women for both the Test and Class Anxiety factor and the Interpretation Anxiety factor were statistically significant (Table 4), indicating that women were reporting higher anxiety in those areas. Differences between latent means were not statistically significant for the men and women on Fear of Asking for Help, Worth of Statistics, Fear of Statistics Teacher, or Computational Self-concept (Table 4).
Tests of Latent Means Between Sexes
Note.—A negative latent mean indicates that women exhibited higher anxiety or more agreement.
Invariance by students' classification.—Separate CFA models were tested for undergraduate and graduate students. According to the adjunct fit indexes (Table 3), the baseline model fit adequately for undergraduates; however, both CFI and TLI were slightly lower than desirable. Estimates of all parameters were statistically significant. The CFA model for graduate students fit well (Table 3). Estimates of all parameters were statistically significant for both groups.
Configural invariance was observed, despite a statistically significant χ2 value (Table 3), indicating support for the tenability of the same pattern of fixed and free factor loadings across undergraduate and graduate students. Metric invariance, however, did not hold across this subpopulation, with the χ2 difference test being statistically significant (Table 3). In general, lack of metric invariance indicates that some items may have behaved differently across undergraduate students and graduate students, even though the overall factor structure held across the two groups.
Because factorial invariance failed at the metric level of testing, scalar invariance was not tested. Instead, post hoc tests were performed to study the nature of between-group differences for the latent variable. Chi-square difference tests were conducted for the six latent variables (Table 5). From the statistical tests, one may infer both configural and metric invariance were evident for two factors, Test and Class Anxiety and Interpretation Anxiety. The factor Fear of Asking for Help passed the test for configural, metric, and scalar invariance. However, the last three factors (Worth of Statistics, Fear of Statistics Teachers, Computational Self-concept) were noninvariant on the factor loadings (metric invariance), which may suggest of undergraduate and graduate students interpreting the STARS items in different ways.
PosthocModel Fit for Student Classification
Discussion
Cruise and other authors using the STARS suggested that the construct of statistics anxiety consists of six factors: Worth of Statistics, Interpretation Anxiety, Test and Class Anxiety, Computational Self-concept, Fear of Asking for Help, and Fear of Statistics Teachers. A CFA confirmed the tenability of this six-factor structure. The WLSMV estimator was used because of its robustnessto handle ordinal data (Flora & Curran, 2004).
Use of the WLSMV estimator does not seem to be described in studies examining rating scale data from scales purporting to measure statistics anxiety. Results of studies not employing the WLSMV estimator were based on state-of-the-art estimators at the time; however, new evidence in the form of a more appropriate estimator could bring the prior results inot question. For instance, Dauphinee, et al. (1997) analyzed rating scale data from participants complete the Survey of Attitudes Towards Statistics (SATS) employing the maximum likelihood estimator to conduct a multigroup CFA across sex and found the measurement model to be equivalent across both groups. However, it is conceivable that by employing the WLSMV estimator, which would have produced unconfounded estimates of thresholds and factor loadings (Lubke & Muthén, 2004; Millsap & Yun-Tein, 2004), may have led to different conclusions.
In the current study, the factorial invariance tests in the current study employed the WLSMV estimator in the multigroup CFA and indicated that the six-factor measurement model was equivalent across sex. The presence of metric invariance indicates that the individual items from the STARS function similarly for men and women. Scalar invariance indicates the thresholds do not differ by sex.
Because the model by sex was invariant across the thresholds, tests of latent means were performed for each of the six factors. Two sex differences were identified as statistically significant: Women had higher mean ratings on Test and Class Anxiety and Interpretation Anxiety. Further implications of invariance of the thresholds include the validity of between-group comparisons made for sex. If the researcher compares observed means (or latent means) on two groups, an independent-samples t test would be meaningful and readily interpretable as a true mean difference. However, in the absence of metric invariance, researchers should be cautious in interpreting the meaning of such differences.
On the other hand, invariance tests across undergraduate and graduate students indicated that these two groups may not have ascribed the same meaning to all items on the STARS, i.e., scalar noninvariance was present. This finding precluded any meaningful between-group comparisons of these two groups (Vandenberg & Lance, 2000). In other words, the STARS should not be used if the purpose is to compare mean scores of undergraduate and graduate students.
Although some research supports higher scores on statistics anxiety among graduate students than for undergraduate students (e.g., Benson & Bandalos, 1989; Harvey, et al., 1985), tests for factorial invariance were not conducted. Therefore, it is unknown whether the results are meaningful for the samples in those studies.
Conclusions
Although discussions of equivalence of measurement or factorial invariance may not be found in many statistics texts, researchers should be aware of the effect of imprecise measurement in statistical analyses. In particular, lack of factorial invariance of metrics (thresholds) should preclude between-groups comparisons, even for an independent-samples t test. For instance, should one find a statistically significant difference in observed (or latent) sample means without first conducting a multigroup CFA to test for the presence of factorial invariance, there can be no certainty that this difference reflects a real difference in the trait of interest. It is possible the difference is an artifact of measurement, reflecting nonequivalence in the sample.
Factorial invariance should be viewed as an assumption in between-groups statistical tests, as one should strengthen the validity of statistical conclusions based on comparison of group scores on a trait. Before assuming that certain subpopulations of interest are invariant, that assumption must first be tested to ensure such comparisons are valid.
Limitations
The sample of students in the current study came from one university and therefore may affect generalizability of results to other populations. Invariance tests by academic discipline were not conducted in the current study, as group sample sizes were not sufficient. Future research should focus on obtaining larger and more diverse samples by academic concentration, as statistics anxiety might manifest differently for physical science majors as opposed to liberal arts majors, for example.
