Abstract
In this rejoinder, we focus on five important themes derived from the valuable comments of Harvey; Wang, Hogge, and Sahai; and Whittaker and Worthington. Specifically, we reflect on (a) methodological and conceptual issues associated with the use of focus groups in developing an initial item pool, (b) the debate about the use of Rasch versus other item response theory (IRT) models, (c) response scale functioning and the use of item parcels, (d) bandwidth and scale sensitivity, and (e) multicultural considerations. We reconsider the strengths and limitations of the approaches that we have endorsed in light of the comments of our colleagues. We conclude with the hope that dialogue on the use of focus groups and IRT in measurement development and on multicultural assessment continues.
We are grateful for three very thoughtful responses (Harvey, 2016; Wang, Hogge, & Sahai, 2016; Whittaker & Worthington, 2016) to our contribution advocating wider use of focus groups and Rasch model item response theory (IRT) methods for instrument development (Mallinckrodt, Miles, & Recabarren, 2016). In the space available, we cannot respond to all of the excellent points raised by our colleagues. Our rejoinder is organized into five themes that highlight many areas where we agree with their critique, as well as a few points of disagreement. We view the latter as mostly differences of emphasis rather than as diametrically opposite viewpoints.
Focus Groups and Content Validity
Wang et al. (2016) noted that we did not clearly articulate our paradigmatic stance or attend to epistemological issues in our recommendations for using focus groups. We agree that this is important in mixed-methods research. This suggestion may have actually helped strengthen our argument for the use of focus groups in measurement development research, particularly with a cultural construct. Quantitative instruments like the Everyday Multicultural Competencies/Revised Scale of Ethnocultural Empathy (EMC/RSEE) are typically used in research grounded in a postpositivist paradigm that seeks to approximate an objective reality through hypothesis testing (Ponterotto, 2005), but these instruments need not be developed exclusively from this perspective. Using focus groups to generate items for a quantitative measure does not remove measurement development from the domain of postpositivism, but the use of mixed-methods does imply a valuing of both subjective and objective data (Hansen, Creswell, Plano Clark, Petska, & Creswell, 2005). Thus, it embraces the constructivist-interpretivist notion that meaning is uncovered through reflection “stimulated by the interactive researcher-participant dialogue” (Ponterotto, 2005, p. 129). Specifically, in terms of research on cultural issues, this stance values and prioritizes the voices of those who have lived experience with the construct under study, rather than foregrounding the voices of the research team. In addition, inviting focus group members themselves to generate example items that illustrate their experience and understanding of a construct can also help reduce researcher bias in interpreting the focus group data and translating it into actual items. Thus, we agree with the recommendation by Wang et al. that researchers using mixed-methods should clearly describe their research paradigm and epistemology. Explicitly addressing these steps is an important component in the larger goals of self-reflexivity and bracketing that help researchers reduce bias. As Wang et al. point out, more information about our research team’s own self-reflection and subjectivity would have been especially important given that one of our focus groups consisted of graduate student participants who were also members of the research team.
We agree with Wang et al.’s (2016) further suggestion that “it is imperative for researchers to discuss who has the power to define everyday multicultural competencies and who should be included in the focus groups.” Counseling psychologists should strive to be advocates for social justice in each of their professional roles, including (and especially) that of researcher (Vera & Speight, 2003). When developing a measure and defining a construct—particularly a cultural construct—failure to attend to power dynamics risks further silencing the voices of those from historically marginalized social identity groups and perpetuating systemic oppression. Wang et al. cautioned that focus groups consisting of members with the same ethnic/racial identity may have generated different items than the mixed groups that we used. Whittaker and Worthington (2016) presented a related critique regarding the usefulness of focus groups, suggesting that “to produce items that will be applicable and generalizable across a variety of identity groups, representation of participants from a multitude of identities is likely to be necessary.”
In assembling the three focus groups to develop the EMC/RSEE, we solicited individuals whom we felt had a vested interest in, knowledge of, and experience with the development of multicultural competencies among undergraduate students in a variety of settings. We solicited input from undergraduate residence-hall peer counselors, doctoral students involved in undergraduate education in a counseling psychology program with a heavy emphasis on multiculturalism and social justice, and university administrators. Thus, we sampled a broad range of campus perspectives about how multicultural competencies are taught to and modeled for undergraduate students. Although we should have made this clear in Mallinckrodt et al. (2016; see also Mallinckrodt et al., 2014), our focus groups did reflect diversity in terms of age, ethnicity, nationality, race, religion, sexual orientation, and social class, which, as described in the following paragraph, helped expand our thinking about the universe of items. However, we acknowledge that this diversity, and the diversity of the samples used in the validation study, was limited to the context of a predominantly White institution, which limits the applicability of the current EMC/RSEE to White students. Wang et al. (2016) also bring up the issue of power dynamics that may exist within preexisting groups. Although we tried to minimize this to the extent possible by conducting separate groups for individuals presumed to have similar levels of power (i.e., undergraduate students, graduate students, and university administrators), we agree that a further examination of power dynamics and preexisting relationships could aid in the interpretation of focus group data.
We also note that representing the social identities of all potential research participants was not our primary intention, nor was it the goal of the focus groups used in Mallinckrodt et al. (2014). Instead, the great advantage of using focus groups to generate an initial item pool is to establish content validity “by defining a universe of items and sampling systematically within this universe” (Cronbach & Meehl, 1955, p. 282). From this perspective, the ideal focus group for generating items is composed of members whose life experiences have given them extensive exposure to the construct of interest, regardless of their social identity. The task of developing a new instrument can be summarized by a variation on a staple phrase of U.S. courtroom TV dramas, “the construct, the whole construct, and nothing but the construct.” From this framework we believe much of Worthington and Whittaker’s (2006) summary of best practices emphasizes the important task of “nothing but the construct” by focusing on construct validity, eliminating items that are confounded by extraneous factors, and assessing validity and reliability in classical test theory (CTT) terms. Our intention in advocating for a wider use of focus groups to generate an initial item pool was to draw more attention to the equally important task of sampling “the whole construct” as Cronbach and Meehl (1955) described content validity. Here the goal is to ensure that all important facets of the construct are represented in the measure. We stress again how crucial the additional perspective of a focus group is when the researchers themselves have little or no direct experience with the construct. The goal should not be for the focus group(s) to represent social identities different from the researchers or a range of possible identities but, instead, to represent the broadest possible experience with the construct.
To illustrate the value of tapping these perspectives, we note that Factor 2, Resentment and Cultural Dominance, of the EMC/RSEE was identified only through participation by residence hall peer advisors in one of the three focus groups (Mallinckrodt et al., 2014). This factor would not have emerged if the researchers alone had generated the initial item pool. It is disquieting to consider that even without the items contributed by the peer advisor focus group, five of the six remaining EMC/RSEE factors would almost certainly have emerged and have been confirmed. The crucial point is that, as valuable as the procedures described by Worthington and Whittaker (2006) are, following them faithfully would not have provided us with an indication that the EMC/RSEE lacked the crucial dimension assessed by Factor 2. This subscale is emerging as one of the most interesting because it may assess reactionary backlash in the responses of some participants to diversity training (Mallinckrodt, Miles, & Chery, 2016).
Whittaker and Worthington (2016) raise further concerns because using focus groups can be cumbersome given that scale development often involves multiple studies. They argue against adopting a new best practice and are concerned about the impact on the editorial review process. We agree that it would be best to use multiple focus groups, but it is far better to have one group assist in generating an item pool than none at all. The difficulty of organizing multiple groups should not deter researchers from including at least one. We paraphrase Voltaire: Le mieux est l’ennemi du bien, or “the best is the enemy of good [enough].” Regarding best practices, Worthington and Whittaker (2006) themselves suggested that having the items reviewed by one or more groups of knowledgeable people (experts) to assess item quality on a number of different dimensions is another critical step in the process. At a minimum, expert review should involve an analysis of content validity. (p. 814)
As an alternative to this best practice (i.e., having groups of experts review items after they have been generated by the researchers), our position is that it can be just as useful to form groups of experts with real-life experience of the construct and involve them in the generation of the initial item pool.
Regarding the editorial review process, attention to content validity has been a norm for some time. For example, Mallinckrodt (2006) stated as incoming editor of the Journal of Counseling Psychology (JCP) that it is important for authors to furnish convincing evidence that the initial item pool they have generated adequately samples the construct domain. Generally, readers have more confidence that the domain of interest has been adequately sampled when items are generated by panels of experts, by focus groups, or through other methods designed to tap the experience of informants who have had a wide range of direct experience with the construct. (p. 129)
The editorial review principle is that researchers have an affirmative burden to make a case for the content validity of the instruments they develop by describing procedures to give some assurance that the universe of items has been adequately sampled. Focus groups are one very good way to make this case, but because they are not the only way, we agree that they should not be elevated to the status of a best practice in instrument development. We do believe that describing whatever method was used to establish content validity of the initial item pool should be a best practice. Because concerns about content validity can represent a “fatal flaw,” they were a frequent reason instrument development manuscripts were rejected from JCP without an opportunity to revise (Hoyt & Mallinckrodt, 2012).
Data Fit of Rasch Versus Alternative IRT Models
One of the most stringent requirements of the Rasch model is that all items have equal discrimination parameters, and thus, each has an equal relationship to the latent trait. Although this assumption cannot be met perfectly by any set of actual data, the question should not be presented as a false dichotomy in which either the model fits perfectly or researchers are guilty of “forcing items to fit the model on an ad hoc basis.” Instead, the question of model fit is one of degree, not unlike the decision to conduct parametric statistical tests on data that are not distributed perfectly normally. In both cases, researchers must assess the degree to which data depart from the required assumptions of the procedure they intend to use (understanding that no real-world data will match perfectly), together with how robust the procedure is to a given type of departure. However, the matter is more complex than this because, as Harvey (2016) pointed out, if one believes that factors such as guessing, social desirability, and variability in item discrimination are important, then the Rasch model is indeed fundamentally misspecified.
We agree with all our colleagues Harvey (2016), Wang et al. (2016), and Whittaker and Worthington (2016) that techniques based on the Rasch model should never be uncritically adopted and that one model must never be assumed best in all circumstances. We pointed out that many of the powerful benefits of the Rasch model derive from its assumption of specific objectivity, which in turn depends on a simplified view of how latent constructs, test items, and test takers interact. One side in the “Rasch wars” believes that the model is hopelessly flawed in making these simplistic assumptions, whereas Rasch proponents believe that the model is relatively robust to modest departures from these requirements and therefore is useful in many applications (McNamara & Knoch, 2012). For many researchers, the fundamental assumptions of the Rasch model are “deal breakers” because the parameters it omits simply cannot be ignored. Although we respect this position, for us the significance of ignoring parameters excluded by the Rasch model is a relative matter. Of course, the fit of the data to the model must always be assessed. Thissen and Wainer (1982) advised beginning by examining simpler models first, starting with the 1-pl or Rasch. If only a small proportion of items cause an observed lack of fit and they do not form a coherent subset of content, they can be omitted. (Not unlike researchers who shorten subscales based on CTT considerations of item-factor loadings.) We find considerable value in Harvey’s (2016) alternative suggestion to start with a more complex model and work toward more simplicity. We are also very intrigued by Harvey’s (2016) suggestion that an alternative ideal-point approach may be preferable to the family of dominance approaches that include the Rasch model. The number of IRT approaches is ever growing and will present researchers with new dilemmas for selecting the most appropriate approach. Our position is that the Rasch model offers considerable advantages if it provides a reasonable fit to the data without discarding too many items, and researchers are comfortable philosophically with the fundamental assumptions it makes.
Analysis of Response Scale Performance and Item Parcels
One of these powerful advantages is the rich detail in analyses of response scale performance provided by Rasch IRT methods. We believe this aspect of instrument development receives far too little attention. The benefits of careful item wording are not realized if the response scale performs poorly. Consider, for example, that if respondents cannot meaningfully differentiate between levels of a 7-point response scale, their failure to do so will inject error into every item of the scale, as Mallinckrodt and Tekie (2015) found in their analysis of the Working Alliance Inventory (WAI). It is true that the Rasch model omits measurement parameters that many researchers believe are crucial, but the model can also be used to check other critical assumptions that are often ignored, such as the step calibrations between response scale points or the possibility of differential item function across critical demographic groups.
We provided the full data set from Mallinckrodt et al. (2014) to Whittaker and Worthington (2016) for reanalysis. They only examined data for Factor 2 at Time 1. Thus, one reason for the differences in our findings may have been that we conducted IRT and factor analyses (both exploratory and confirmatory) on randomly selected subsamples drawn from both time points, and scores on Factor 2 changed significantly in the 6 weeks between the middle and end of the fall semester. To clarify their questions about the order of our procedures, the final stage of IRT analyses involved examination of Andrich step calibrations between response scale points for a subscale as a whole. Neighboring response points for three subscales were collapsed. Whittaker and Worthington do raise an important concern in that the item difficulties we reported were all based on the initial 6-point response scale. These coefficients can be expected to differ after collapsing some response scale points, but not by much given the relatively low frequency of respondents who used the responses we collapsed. Nevertheless, after deciding to collapse categories based on Andrich step calibrations, it is desirable to repeat the entire sequence of IRT analyses that lead to the final selection of an item pool.
Finally, we note that we did not use item parceling for differential item function analyses, but instead only for our final confirmatory factory analyses (CFA). We stand by the recommendations of Floyd and Widaman (1995) who warned against using individual items as indicators of the latent construct in CFA, especially for subscales with few items. They point out that the structure of subscales with as few as five to eight items can be quite difficult to confirm using one-item indicators. This is because pairs of items are likely to have correlated errors due to idiosyncratic wording. IRT can be used to address a frequent concern about item parcels in CFA, namely masking multidimensionality.
Item Difficulty, Bandwidth, and Scale Sensitivity
We appreciate the attention Whittaker and Worthington (2016) draw to the lack of clarity in our use of the term “differentiate” in connection with individual test items, especially given that the Rasch model does not provide estimates of an individual item’s capacity to discriminate. (This is the parameter excluded in moving from 2-pl to 1-pl models like the Rasch.) To clarify, our intention was to highlight how a careful selection of items at specific levels of difficulty enhances test information within a specific range of the continuum of person-ability. Enhanced test information increases the precision of estimation and, thus, the capacity of the subscale as a whole to better differentiate between individuals who are close in ability. However, unlike the CTT assumption of equal sensitivity throughout the entire range of test scores, IRT assumes varying levels of sensitivity at different points of the continuum that correspond to the test information curve. Careful selection of items allows researchers to flatten that curve if their goal is an equi-precise scale, or heighten its peak to concentrate information value (i.e., sensitivity) near a predetermined cutoff score (Embretson & Reise, 2000). Greater information value in a particular range of the continuum increases the capacity of the subscale as a whole (not an individual item) to differentiate between individuals closely spaced in ability within this zone of heightened sensitivity.
It is not widely appreciated that when discarding items to shorten a scale using only the decision criteria of item-factor loadings derived from CTT, researchers can alter the information curve in unintended ways. For example, Mallinckrodt and Tekie (2015) compared two different brief versions of the WAI. They showed through IRT analyses of information curves that Tracey and Kokotovic’s (1989) four-item version of the full 12-item Tasks subscale is more sensitive in estimating the lower range of Tasks scores, whereas Hatcher and Gillaspy’s (2006) four-item version is more sensitive in the higher ranges of the original WAI because Tracey and Kokotovic inadvertently selected less difficult items than did Hatcher and Gillaspy. Mallinckrodt and Tekie created the Brief Alliance Inventory by selecting 16 items that preserved as much of the original bandwidth and information value of the 36-item WAI as possible.
Multicultural Considerations
Wang et al. (2016) raised the important concern that “everyday multicultural competencies are not an end product that can be measured using a brief scale.” We appreciate the quote from Desmond Tutu about “little bits of good” creating social change because it reflects our own views on social justice. As Bell (2010) suggested, we believe “that social justice is both a process and a goal” (p. 21) and agree with Wang et al. that multicultural competence requires a “lifetime commitment.” Wang et al. pointed out that the question prompt for the focus groups used in developing the initial item pool for the EMC/RSEE asked about the forms of knowledge, skills, attitudes, and awareness that a student needs to “function effectively” “in a productive career” (Mallinckrodt et al., 2014, p. 135) does further limit the scope of the everyday multicultural competencies assessed. However, we believe it is important that educators and administrators have means for assessing multicultural interventions across institutions and studies to ensure these interventions are effective. In developing the EMC/RSEE, we hoped to sample the content universe. We believe that these competencies are important, foundational competencies (e.g., awareness of contemporary racism and privilege, cultural openness, and desire to learn). In developing these competencies, we hope that the stage is set for a “lifetime commitment” to further developing multicultural competence and working toward social justice.
Part of the concern of Wang et al. (2016) arises from the fact that the EMC/RSEE is limited in scope to diversity related to race and ethnicity, and that diversity programming on college campuses often focuses on other forms of culture (e.g., gender, sexual orientation) and associated forms of oppression or “–isms” (e.g., sexism, heterosexism, respectively). This is true on our campus as well, and so we see the limitation of the EMC/RSEE as a measure focused on multicultural competencies related to race and ethnicity. This, too, strikes us as an apt example of “the best is the enemy of the good [enough].” We believe that, while we wait for the development of an even more comprehensive and inclusive measure, the EMC/RSEE with its limitations is a positive step forward (and away from the use of idiosyncratic scales for each unique study). The EMC/RSEE, like the Scale of Ethnocultural Empathy (SEE) from which it was derived (Wang et al., 2003), is not intended to be comprehensive or prescriptive. We consider EMC/RSEE is a first step (or, more accurately, a second step, given that the original SEE was an excellent first step). We hope other researchers carry the work forward to identify other competencies.
We appreciate the thoughtful responses of Harvey (2016), Wang et al. (2016), and Whittaker and Worthington (2016). We thank them for advancing this dialogue about the use of focus groups and IRT in instrument development and the assessment of multicultural competencies. We hope the dialogue will continue and be joined by others.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
