Abstract
Achievement goal theory helps describe how and why students engage in various academic behaviors. Historically, achievement goals have been examined almost exclusively with undergraduate, nonminority samples, and predominately with factor analytic techniques. The present study adds to a growing literature by providing initial validation of a leading achievement goal measure, the Achievement Goal Questionnaire-Revised (AGQ-R; Elliot & Murayama, 2008), among rural (N = 186) and urban (N = 197) African American high school students. Collectively, results from both confirmatory factor and Rasch analyses highlight issues that should be considered when using the AGQ-R among African American high school students.
Achievement goal theory offers a framework for understanding student motivation and behavior by distinguishing between different achievement goals (Ames, 1992; Pintrich & Schunk, 2002). The achievement goal construct has shown promise as an explanatory construct of student motivation and achievement in many educational settings and across numerous student populations (Hulleman, Schrager, Bodmann, & Harackiewicz, 2010). Current achievement goal theory proposes a 2 × 2 framework, whereby achievement goals are assumed to vary according to goal definition (i.e., mastery or performance) and goal valence (i.e., approach or avoidance) (Elliot & Murayama, 2008; Wu & Chen, 2010). In achievement settings, students with mastery-approach goals seek to improve their competence, while students with mastery-avoidance goals focus on the avoidance of task-based incompetence. Furthermore, students with performance-approach goals strive to perform better than their peers, whereas students with performance-avoidance goals seek to avoid performing worse than their peers. Within the 2 × 2 framework, the Achievement Goal Questionnaire-Revised (AGQ-R; Elliot & Murayama, 2008) is the latest validated measure of achievement goals.
Validating 2 × 2 Achievement Goal Measures
In developing and validating items for the original Achievement Goal Questionnaire (AGQ), Elliot and McGregor (2001) employed confirmatory factor analyses (CFA) and tested competing models among a sample of 180 U.S. undergraduate students. They showed that the 2 × 2 framework provided the best fit among other dichotomous and trichotomous models. More recently, Muis, Winne, and Edwards (2009) validated the AGQ with undergraduate students using both traditional psychometric and modern Rasch analyses. In both of these studies, the ethnicities of the participants were not mentioned. In 2004, Finney, Pieper, and Barron modified the original AGQ items and created the Achievement Goal Questionnaire-Modified (AGQ-M) to measure achievement goals in general academic contexts. The majority of participants (85%) in that study were Caucasian. Later, Elliot and Murayama (2008) revised the original AGQ items and developed the Achievement Goal Questionnaire-Revised (AGQ-R), which assesses achievement goals in a course-specific context. Similar to Finney et al.’s (2004) study, Elliot and Murayama employed traditional factor analytic techniques with a predominantly Caucasian (69%) undergraduate sample.
Some follow-up AGQ validation studies have utilized non-Caucasian samples, such as Murayama, Zhou, and Nesbit’s (2009) study that included Canadian and Japanese college students or, Chiang, Yeh, Lin, and Hwang’s (2011) study that used Taiwan preuniversity students. However, among the many validation studies that exist, we could find no studies that focused exclusively on African Americans or high school students, let alone combined samples of both. Further, the few studies that have examined achievement goals among African Americans offer inconsistent findings. For example, while Pastor and RiCharde (2003) found differences in achievement goal measurement between African American and White college students utilizing differential item functioning on the AGQ-M, Campbell, Barry, Joe, and Finney (2008) found no differences with CFA. Therefore, results suggest conflicting psychometric qualities for the AGQ-M for African Americans, and to date, there has been no achievement goal validation studies among African Americans in noncollege settings let alone using the latest 2 × 2 achievement goal measure—the AGQ-R.
Purpose of the Present Research
From our review, we perceive three continued gaps in the achievement goal validation literature. First, a continued lack of validation studies examining achievement goal measures among African American and noncollege aged samples has limited the use of the robust achievement goal construct as an explanatory factor among historically underexamined student populations. Second, of the limited research that has examined achievement goals among African American students, conflicting results have emerged from this research. These conflicting results may be, in part, caused by the use of different psychometric approaches. Lastly, these limited and conflicting findings may not apply to the latest 2 × 2 achievement goal measure—the AGQ-R—as they were obtained from studies that utilized prior versions of the instrument. The purpose of the present study was to address the three aforementioned gaps in the literature by validating the AGQ-R among two samples—urban and rural—of African American high school students using confirmatory factor and Rasch analyses.
We focused on both rural (Study 1) and urban (Study 2) African American students to potentially identify differences between rural and urban youth, given that the few studies that have investigated African Americans’ achievement goals have focused on urban samples (e.g., Campbell et al., 2008; Pastor & RiCharde, 2003). Furthermore, because nearly half of all African Americans live outside of urban areas (McKinnon, 2003), it seemed especially important to address the validity of achievement goal measures with both samples. Previous research has also shown that urban and rural settings may yield different educational outcomes for students (e.g., Elder, King, & Conger, 1996; Hektner, 1995; Howley, 2006; Rojewski, 1999).
Study 1: Rural Sample
Method
Participants
Participants were 186 African American high school students from a rural town in the southern United States (nfemale = 99, nmale = 87). The sample consisted of 9th through 12th graders ranging from 14 to 19 years of age (M = 16 years, 5 months; SD = 1 year, 4 months). The high school is located in a low socioeconomic status area, and offers free lunches to 100% of the students.
Apparatus
Students reported their achievement goals using the AGQ-R (Elliot & Murayama, 2008). As the AGQ-R is a context-specific instrument, all students answered the AGQ-R in regards to their math class. The AGQ-R measures the 2 × 2 achievement goal framework with 12 items (see Table 1 for AGQ-R items). Three items serve as indicators for each of the four achievement goals. Utilizing a 5-point Likert-type scale, participants indicated the extent to which they agreed with each item using a scale of 1 (strongly disagree) to 5 (strongly agree).
Items for the Achievement Goal Questionnaire-Revised (AGQ-R; Elliot & Murayama, 2008).
Note: The present study used the same items order than the one used by Elliot and Murayama (2008).
Procedure
Data collection occurred in the middle of the 2011 spring semester. One of the principal investigators and several graduate students administered student demographic questionnaires and the AGQ-R during students’ math classes.
Data analyses
We employed both confirmatory factor and Rasch analyses to examine the validity of the AGQ-R. Our choice of methods was guided by two factors. First, in their original AGQ-R validation study, Elliot and Murayama (2008) relied exclusively on CFA. In the present study, we also draw from these results for comparison with present results. Second, other researchers have come to understand the unique information that is gained through employing both traditional and modern techniques in validating instruments (e.g., Boman, Curtis, Furlong, & Smith, 2006; Mok, 2004) but, to date, these complimentary techniques have rarely been used in achievement goal validation research (e.g., Muis et al., 2009). We thus included the use of Rasch analyses as a complement to CFA to examine underlying measurement properties at both the item and person level.
CFA
Similarly to Elliot and colleagues (2001; 2008), we began with a test of the 2 × 2 achievement goal model and followed with six alternative models. The following six models were tested: (1) trichotomous model A, in which the performance-approach and performance-avoidance items loaded on their respective factor and the mastery-approach and mastery-avoidance items were grouped together; (2) trichotomous model B, in which the mastery-approach and mastery-avoidance items loaded on their respective factor and the performance-approach and performance-avoidance items were grouped together; (3) trichotomous model C, in which the mastery-approach and performance-approach items loaded on their respective factor and the mastery-avoidance and performance-avoidance items were grouped together; (4) trichotomous model D, in which the mastery-avoidance and performance-avoidance items loaded on their respective factor and the mastery-approach and performance-approach items were grouped together; (5) a mastery-performance model, in which the mastery-approach and mastery-avoidance items formed one factor and the performance-approach and performance-avoidance items formed a second factor; and (6) an approach-avoidance model, in which the mastery-approach and performance-approach items formed one factor and the mastery-avoidance and performance-avoidance items formed a second factor.
Rasch analysis
Rasch analysis provides follow-up measurement information at both the person and items level. According to Ludlow, Enterline, and Cochran-Smith (2008), “Rasch models are used as confirmatory tests of the extent to which scales have been successfully developed according to explicit a priori measurement criteria” (p. 196). Furthermore, Rasch models are invariant and capable of investigating person and item interactions separately. For the present study, the Rasch Rating Scale Model (RRSM; Andrich, 1978) was utilized. Winsteps measurement software (Linacre, 2010) estimated the parameters for the model using joint maximum likelihood estimation procedures. Consistent with prior Rasch validation studies (e.g., Royal & Elahi, 2011; Wolfe, Ray, & Harris, 2004; Wolfe & Smith, 2007), the evaluation of the AGQ-R’s validity came from six criteria: dimensionality, reliability, rating scale effectiveness, item measure quality, person measure quality, and item hierarchy.
Results
CFA
Even though it provided the best fit, the hypothesized 2 × 2 achievement goal model did not meet the criteria for a good model fit (Comparative Fit Index = 0.88; Incremental Fit Index = 0.88; Root Mean Square Error of Approximation = 0.090; χ2/df = 2.34, χ2 (48, n =186) = 112.3, p < .01). Varying the alignment of the items, either by combining performance and mastery items, or by combining approach and avoidance items respectively, did not seem to lead to acceptable fit with any of the subsequent six models (see Table 2). Figure 1 illustrates the factor loadings and correlations for the hypothesized 2 × 2 achievement goal model for the rural sample. Most factor loadings were acceptable with the exception of two factor loadings (Q1 and Q11). All four achievement goals were found to be significantly correlated with each other (p < .01). Cronbach’s α ranged from 0.68 to 0.74 demonstrating questionable (0.6 < α <0.7, George & Mallery, 2003) to acceptable (0.7 < α <0.8, George & Mallery, 2003) levels of internal consistency (see Table 3 for descriptive and reliability statistics).
Comparisons Between Models.
Note: N = 186 for Study 1. N = 197 for Study 2.

CFA for the rural sample (Study 1). All coefficients are standardized and significant (p < .01). Error terms have been removed from the model for simplicity.
Descriptive Statistics.
Rasch analysis
We conducted the Rasch analysis examining all 12 achievement goal items concurrently, as CFA results did not provide support for splitting the items according to any meaningful pattern (e.g., approach/avoidance, mastery/performance, 2 × 2 structure). Dimensionality was assessed by conducting a principal components analysis of standardized residual correlations, which is useful to identify the amount of variance explained by each extracted principal component or dimension. A total of 44.3% of the variance was explained by the first extracted Rasch dimension. The largest secondary dimension explained 7.2% of the variance. The eigenvalue of the first contrast was 2.1. A contrast of at least 2.0 is necessary to be considered a dimension (Linacre, 2012). Thus, if a secondary dimension is present it only has the strength of about two items. This information suggests there is sufficient evidence that the construct is unidimensional, thus making the RRSM an appropriate tool for data analysis.
Reliability and separation measures indicate the extent to which scores are reproducible (see Table 4). Separation measures indicate the number of statistically distinguishable levels in the data. Person reliability was 0.77, indicating moderate internal consistency. Item reliability was stable at 0.95, indicating high item reliability. Separation estimates of 1.85 for persons indicated a reasonable spread. Item separation measures of 4.44 indicate sufficient spread of items.
Reliability and Separation Measures.
Rating scale effectiveness provides an indication of how each rating scale category is functioning, and more importantly that they are functioning properly according to Rasch assumptions. It was assessed by evaluating the sample’s use of the rating scale categories and the scale’s inferential value (see Table 5). Counts and percentages were provided to determine the extent to which the various rating scale categories were utilized by participants. INFIT and OUTFIT mean square fit statistics indicated the extent to which each rating scale category is “noisy,” or producing calibrations that are not desirable for productive measurement. Structure calibration refers to the calibrated measure of transition between categories. Also called “step calibration,”this measure indicates how difficult it is to observe each category. Results indicate that the rural sample responded in a skewed manner, finding it easier to agree with items the majority of the time. Fit statistics are within acceptable range and indicate relatively noise-free calibrations. Step calibrations are a bit problematic. These calibrations should advance from smallest to largest in accordance with the direction of the scale. Here, the second negative category has a positive calibration, while the third and fourth categories are negative. Step disordering does not necessarily mean respondents were unable to distinguish the difference between each category. Here, it is likely reflective of the low probability that certain categories will be observed due to the skewness of participants’ responses.
Rating Scale Effectiveness.
Item measure quality is determined by investigating the extent to which the items vary in difficulty, the size of the standard errors, and the degree to which the items fit the model’s expectations (see Table 6). Item difficulty calibrations ranged from −0.62 to 0.79 logits, indicating a lack of spread. Standard errors ranged in size from 0.07 to 0.11. As mentioned previously, fit statistics are useful for identifying noisy measures. Wright and Linacre (1994) indicate for rating scales, values of 0.6 to 1.4 are ideal. With regard to the present data, items Q1 and Q6 demonstrate a bit of noise with regard to OUTFIT mean square statistics (1.50 and 1.42 respectively).
Item Fit Statistics.
Person measure quality is similar to item measure quality, and indicates replicability of scores if the same items were given to a similar sample of participants. It is determined by investigating the stability of the measures, their associated standard errors, and the extent to which the measures are noisy (see Table 7). A total of 13 participants were removed from the final analysis because they grossly misfit the model with OUTFIT mean square fit statistics of 2.0 or greater (Wolfe & Smith, 2007). The measures were relatively stable after removing these outliers. Fit statistics nearly met the ideal value of 1.0, indicating relatively noise-free calibrations.
Overall Data to Model Fit Statistics.
Item hierarchy investigates the extent to which the items rank-order themselves in a manner that is consistent with theory. The item hierarchical map (Figure 2) presents an illustration of the item hierarchy. Items Q7 and Q8 were the most difficult item for students to endorse, followed by Q10 and Q5. Item Q2 was the easiest item for students to endorse. Based on the distribution of the person sample (left side of the map), there appears to be a ceiling effect. In addition, most items are relatively easy to endorse because the item mean (0 logits) is located near the bottom of the person endorsability scale.

Item Hierarchical Map for the Rural Sample (Study 1).
Each of the four hypothesized achievement goal subscales as well as any other dichotomous and trichotomous breakdowns can also be identified on the item hierarchical map. The placement as to where each of the items appears along its respective subscale indicates whether or not the scale seems to be functioning appropriately. For example, as we know that items Q1, Q3, and Q7 are supposed to make up the mastery-approach subscale, we can examine the distribution of all three items on the hierarchical map. If all three items are clumped together on the hierarchical map, then the scale is not working. If they are spread out, preferably with some items appear above the item mean and some below, then we might have some evidence that the scale is working properly (as it is more likely to spread people and items out, thus increasing reliability). In this study, the tendency for the items of each subscale to be clumped together on the item map is aligned with the CFA results as it fails to provide strong support for the 2 × 2 achievement goal model.
Overall, the Rasch results for the rural sample highlight some psychometric issues related to the use of the AGQ-R with low-income rural African American high school students. Study 2, described in the next section, was designed to determine if similar issues could be found with low-income urban African American high school students.
Study 2: Urban Sample
Method
Participants
Participants were 197 African Americans from two urban high schools located in a large city in the southern United States (nfemale = 102, nmale = 95). The sample consisted of 9th through 12th graders ranging from 13 to 19 years of age (M = 15 years, 8 months; SD = 1 year, 3 months). Both high schools reside in a low socioeconomic status area, and offer free lunches to 100% of the students.
Apparatus
Similarly to Study 1, the AGQ-R was used.
Procedure
Data were collected in the middle of the 2011 fall semester. Students answered the AGQ-R in regard to their math class in the schools’ computer labs.
Data analyses
The analyses conducted for Study 2 were the same as the ones used in Study 1.
Results
CFA
The hypothesized 2 × 2 model provided a better fit than the six alternative models (see Table 2). Similarly to Study 1, CFA results did not provide strong support for the 2 × 2 achievement goal structure. None of the four fit statistics met the criteria for good model fit (CFI = 0.87; IFI = 0.87; RMSEA = 0.090; χ2/df = 2.58, χ2 (48, n = 197) = 123.8, p < .01). Figure 3 illustrates the factor loadings and correlations for the hypothesized 2 × 2 achievement goal model for the urban sample. Four out of the 12 factor loadings (Q7, Q5, Q9, and Q4) were unacceptable. All achievement goals were significantly correlated with each other (p < .01), except for mastery-approach and mastery-avoidance goals. At the exception of the Cronbach’s α for mastery-approach goals (0.71), which was merely acceptable, all other Cronbach’s α were questionable (George & Mallery, 2003), and ranged from 0.60 to 0.67 (see Table 3 for descriptive and reliability statistics).

Confirmatory factor analysis for the urban sample (Study 2). All coefficients are standardized and significant (p < .01) at the exception the mastery-approach to mastery-avoidance coefficient which is not significant. Error terms have been removed from the model for simplicity.
Rasch analysis
A total of 41.9% of the variance was explained. Results indicated a single Rasch dimension, as the second contrast produced an eigenvalue of 1.9. Thus, similar to Study 1, the data were found to be sufficiently unidimensional.
Person reliability estimates indicated good internal consistency with a value of 0.82. At the item level, a stable item reliability of 0.97 indicated high item reliability. Separation estimates of 2.11 for persons and 5.51 for items indicated sufficient spreads (see Table 4).
Consistent with Study 1, rating scale effectiveness results with the urban sample showed that the participants favored the positive end of the scale. Fit statistics are within acceptable range indicating relatively noise-free calibrations. Again, step calibrations continue to have some issues as they did not advance from smallest to largest in accordance with the direction of the scale (see Table 5).
Item difficulty calibrations ranged from −0.48 to 1.34 logits, indicating an acceptable spread. Standard errors ranged in size from 0.08 to 0.12. Overall, Q9 was the only item with a bit of noise with an OUTFIT mean square statistic of 1.44 (see Table 6).
For the person measure quality, a total of 31 participants were removed from the final analysis because they grossly misfit the model with OUTFIT mean square fit statistics of 2.0 or greater (Wolfe & Smith, 2007). INFIT mean square values were a little higher than desirable but still fit within acceptable range. OUTFIT mean square values were really close to the ideal value of 1.0, indicating relatively noise-free calibrations (see Table 7).
Lastly, the item hierarchical map (Figure 4) indicates that most students could easily endorse just about every item with the exception of Q9 and Q5. Item Q3 was the easiest item for students to endorse. Similar to Study 1 and consistent with the CFA results, the tendency for the items of each subscale to be clumped together on the item map revealed a lack of support for the 2 × 2 achievement goal model.

Item Hierarchical Map for the Urban Sample (Study 2).
General Discussion
We addressed three perceived gaps in the existing achievement goal validation literature by examining the psychometric properties of the AGQ-R among rural and urban African American high school students employing CFA and Rasch analyses. For both samples, CFA results failed to yield acceptable fit statistics. Conversely, Rasch results suggested mixed findings at the individual person and item level. As both Study 1 and Study 2 yielded similar results, findings from both studies are discussed concurrently next.
CFA
Across both samples, individual reliability data (Cronbach’s α) did not provide strong support that the AGQ-R measured four empirically separable and internally consistent achievement goals. A lack of acceptable fit statistics along with low factor loadings did not uphold the overall construct validity of the 2 × 2 achievement goal model among both samples. Furthermore, fit statistics for six follow-up models showed no improvement in overall fit.
In addition to not confirming the 2 × 2 factorial structure, our CFA results did not support Elliot and Murayama’s (2008) original finding that showed that goals sharing a common definition dimension (i.e., mastery or performance) were more closely related than goals sharing a common valence dimension (i.e., approach or avoidance). In fact, across the present studies, goals sharing a common valence dimension (e.g., mastery-approach and performance-approach goals) were not only found to be more closely related than goals sharing a common definition dimension (e.g., mastery-approach and mastery-avoidance goals), some goals that shared no common definition or valence dimension were found to be positively correlated (e.g., mastery-approach and performance-avoidance goals). Previous research suggests that mastery-approach and performance-avoidance goals are distinct constructs (Elliot & Murayama, 2008). Thus, from a conceptual perspective, we were surprised that these goals were as highly correlated as they were. However, a closer look at the interfactor correlations provides some rationale for these seemingly conflicting results. That is, the significant positive correlations between all four goals suggest that the four AGQ-R subscales might not be empirically distinguishable from each other. This lack of discriminant validity might have been precipitated by the use of high school students, who unlike the college students from the original AGQ-R validation study (Elliot & Murayama, 2008), might have encountered confusion with some AGQ-R wording, thus leading to an inability to differentiate among some of the items. As suggested by follow-up Rasch analyses, inconsistent findings from the CFA results may be explained, in part, due to some confusion or bias existing at the specific item level on the AGQ-R. We turn our attention to those results next.
Rasch Analyses
In both Study 1 and Study 2, the principal components analysis of residuals determined a strong primary Rasch dimension for all 12 items, providing evidence toward the substantive aspect of validity. Reliability estimates were less than desirable for person measures, but good for items. The moderate levels of reliability are likely due to the items not varying much and the participants having little trouble endorsing the items. The effectiveness of the rating scale addressed the structural aspect of validity. Various quality control checks presented acceptable results with the exception of disordered steps. In both datasets, the disordering was likely due to the skewness of responses, as students tended to favor the positive end of the scale. Results of item measure quality appeared sound and lend evidence to the content aspect of validity. In addition, there appears to be some evidence for communicative validity (Lopez, 1996), or the extent to which the rating scale categories are sufficient and appropriately interpreted by students. Therefore, the problems that may be apparent in the rating scale are likely due to the targeting of items for these particular samples of students, as virtually all 12 items were rather easy to endorse.
Limitations and Future Directions
While the current study sheds more light on the applicability of the AGQ-R with African American students, it has several limitations that should be acknowledged. Findings might not be generalizable to all African American students as both our rural and urban samples were selected from low socioeconomic areas. Additional research examining African Americans in various developmental stages and socioeconomic statuses is therefore necessary before drawing conclusions. Beyond African Americans, more AGQ-R research is needed on noncollege students with different ethnicities.
At this time, we can only speculate as to the reasons why the findings of the current study did not match prior findings. It could be, for instance, that high school students, as opposed to more cognitively developed college students, encounter difficulty with the wording of some AGQ-R items, therefore leading to lack of sufficient differentiation among the four achievement goal constructs. The double negative wording of the avoidance goals (e.g., “avoid learning less,” “avoid performing worse”) might have been especially difficult to comprehend for our samples of high school students. This hypothesis could explain why all items were relatively easy to endorse as well as why all four achievement goals were highly correlated. Additional research should examine whether cognitive abilities play a role in interpreting AGQ-R items. Furthermore, modification indices could be used in future studies to identify additional reasons for lack of fit. Lastly, collapsing the AGQ-R rating scale down to three categories could also be considered in future studies as it might make it easier for students to rate the items. Doing so might improve the disordered step calibrations (Muis et al., 2009), which in turns should enhance the interpretability of measures.
Conclusion
The current study provides preliminary evidence to inform educators and researchers about the applicability of the AGQ-R for African American high school students. Collectively, findings from both CFA and Rasch analyses highlight continued strengths, but also apparent cautions that must be considered when utilizing the AGQ-R among samples of African American high school students.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
Funding received from an anonymous donation to the University of Memphis Foundation and a University of Memphis Faculty Research Grant awarded to Dr. Martin Jones.
