Abstract
We review contemporary best practice for developing and validating measures of constructs in the organizational sciences. The three basic steps in scale development are: (a) construct definition, (b) choosing operationalizations that match the construct definition, and (c) obtaining empirical evidence to confirm construct validity. While summarizing this 3-step process [i.e., Define-Operationalize-Confirm], we address many issues in establishing construct validity and provide a checklist for journal reviewers and authors when evaluating the validity of measures used in organizational research. Among other points, we pay special attention to construct conceptualization, acknowledging existing constructs, improving existing measures, multidimensional constructs, macro-level constructs, and the need for independent samples to confirm construct validity and measurement equivalence across subpopulations.
The accuracy of tests of relationships between constructs rests on the foundation of sound construct development and measurement (Edwards, 2003; Nunnally & Bernstein, 1994; Schwab, 1980). Without evidence that measures represent their intended constructs, researchers run the risk that tests of theoretical relationships are biased, misleading, or simply wrong. We articulate contemporary best practices for selecting, developing, revising, and validating measures of constructs in the organizational sciences.
The activities involved in scale development can be organized into three general steps: (1) construct conceptualization, (2) operationalizing the construct, and (3) assessing evidence to confirm construct validity. These three steps capture the essence of approaches articulated in previous reviews that build upon the seminal work of Loevinger (Clark & Watson, 1995; 2019; DeVellis, 2003; Hinkin, 1998; Loevinger, 1957; MacKenzie, Podsakoff, & Podsakoff, 2011). These sets of authors, drawing from a common base of knowledge and practice, have described similar approaches to construct development. The three steps in construct development also apply (in an abbreviated way) when revising measures and when using existing measures. While summarizing this three-step process, we pay special attention to construct conceptualization and mapping operationalizations of constructs to their definitions. We also address issues of establishing construct validity for micro-level, macro-level, multilevel, and multidimensional constructs; measurement equivalence across time and subpopulations; and the need for independent samples to confirm construct validity.
The literature on construct development is large with many brilliant insights and considerable practical advice, and we direct readers to delve into the papers we cite instead of relying solely on this paper. We also acknowledge that researchers differ in their opinions and stances on a number of issues related to construct development. These divergences of opinion may be rooted in philosophical positions, or custom, but we urge researchers to develop their own principled stand as they specify their measurement models, collect and assemble empirical evidence to test their measurement theory, and present it to academic peers for inspection – just as we do for theories of relationships between constructs.
How researchers practice construct definition and operationalization, and then confirm validity, creates a foundation for accumulating knowledge in organizational science. Construct validity—i.e., the extent to which we are measuring what we believe we are measuring—is a sine qua non of organizational research, and knowledge of methods for establishing construct validity is therefore indispensable. Our recommendations are also presented in the form of a checklist that embodies the principles we discuss, which we hope succinctly captures the state of the art for reviewers, editors, and authors. Before we begin, it is helpful to direct the reader to the glossary of construct validity terms in the appendix.
Constructs are conceptual phenomena that facilitate our understanding of the world and how it operates. Thus, the nature of constructs varies substantially not only across disciplines (e.g., organizational behavior, strategic management) but from construct to construct. Constructs may differ considerably in the logical arguments that justify their conceptualization, in how they are manifested, and in the evidence that supports their plausibility. Despite the variety of constructs, there are common principles underlying the three steps in conceptualizing, operationalizing, and using evidence to confirm their validity.
Although the steps of conceptualization, operationalization, and confirmatory evidence are presented as a sequence, in practice these steps may not be followed in a strictly sequential fashion and are often iterative. For example, researchers may use available empirical evidence to clarify a definition and to revise items/indicators in the measurement model. Nor should the steps and procedures we describe be slavishly followed, because constructs may require deviations from our specific recommendations in order to adhere to the principles we espouse. Following a checklist is not prima facie evidence that a construct is well measured. We also remind readers that construct validity requires a unitary approach to establishing that the scores for a construct are valid for the intended purpose (e.g., theoretical research, decision making) rather than ticking off a list of types of validity. Whereas it may be useful to discuss evidence regarding content, construct, and criterion validity as these represent facets of validity, confirming construct validity requires a holistic assessment of the assembled evidence (American Education Research Association, 2014; Landy, 1986). The task is to construct a logical, theoretical argument for a construct, articulate the relationships between the construct and its measures, and obtain empirical evidence to support the plausibility of the hypothesized measurement model. The result should be a theoretical and empirical case for the measurement model that is convincing to a skeptical academic audience.
Step 1: Construct Conceptualization and Definition
A construct is an abstraction that helps us makes sense of our environment and is a useful aid to developing theories about relationships. Only by naming these abstractions as constructs (e.g., job satisfaction, organizational performance) can we theorize about relationships between them. We take the position that the phenomenon underlying the construct is real, even if our definition and understanding of it are flawed (Cronbach & Meehl, 1955; Messick, 1981). Construct definitions should correspond to the underlying phenomenon, and should distinguish not only what the construct is, but also what it isn’t. Definitions should also clarify how a construct is different from, and similar to, other constructs. Constructs may be defined narrowly or broadly. For example, overall conscientiousness is a broad construct, whereas industriousness is a narrow construct or facet of conscientiousness. Likewise, corporate visibility is a slice of the wider construct of corporate reputation.
Review the Literature Thoroughly
The first step in conceptualizing a construct should be to review the relevant research literature to see if the construct is in use, perhaps under another name, or if multiple constructs with the same name have different definitions. Organizational science is vast and diverse with hundreds of existing constructs, and we recommend that researchers consider the extent to which their target construct is redundant with, or distinct from, the well-known and impactful construct domains in the field. When introducing a new construct, authors should take care to acknowledge whether the alleged new construct might be a relabeling or recombination of content sampled from other domains.
The enterprise of academic research presents powerful incentives for researchers to ignore existing constructs and to pretend that relabeled or reshuffled constructs are novel. Kelley’s (1927) jangle fallacy occurs when two different construct names are used for the same phenomenon (or two labels for the same construct). As evidence of the jangle fallacy in strategy research, three measures (i.e., R&D intensity, patent counts, patent citations) have been used to assess a diverse array of constructs and construct labels (Ketchen, Ireland, & Baker, 2013), using multiple construct names for the same phenomenon (e.g., using patent counts to measure three distinct constructs: innovative productivity, knowledge stock, and technology expertise). Further, recent examples of highly-cited constructs and construct labels that have essentially ignored or downplayed their redundancy with pre-existing constructs can be found in the areas of grit [e.g., grit is correlated rcorrected = .84 with conscientiousness; (Credé, Tynan, & Harms, 2017) and work engagement (work engagement has high content overlap and is correlated rcorrected = .77 with a combination of job involvement, job satisfaction, and organizational commitment; Newman, Joseph, & Hulin, 2010)]. To address the jangle fallacy, Newman, Harrison, Carpenter, and Rariden (2016) surveyed the management literature and the editorial board of the Academy of Management Journal to enumerate seven cardinal construct domains in the field of OB/HR (i.e., general mental ability, core self-evaluations, overall job attitude, social exchange quality, behavioral work engagement, job complexity, and leader individualized consideration); noting these seven established construct domains have often been resampled and relabeled; and recommending future authors explicitly acknowledge these existing constructs rather than relabeling them.
Next, Kelley’s (1927) jingle fallacy occurs when two different constructs are given the same name. For instance, the construct name ‘strategic consensus’ has been defined and measured in distinctly varied ways (Kellermanns, Walter, Lechner, & Floyd, 2005); and the construct name ‘emotional intelligence’ has also been used to refer to many different constructs (Mayer, Roberts, & Barsade, 2008). The problems of construct proliferation, empirical redundancy (Le, Schmidt, Harter, & Lauver, 2010; Schwab, 1980), relabeling of existing constructs (i.e., the jangle fallacy/false differentiation; Kelley, 1927; Cardinal, Sitkin, & Long, 2010), and ignoring distinctions among constructs that are different (jingle fallacy) all contribute to confusion, lack of parsimony, and inefficiency in scientific research because of inadequate construct conceptualization.
Unfortunately, published research tends to minimize, or draw attention away from, measurement problems. Close inspection may reveal a variety of difficulties, including: inconsistencies in definitions, a lack of correspondence between definitions and operationalizations, weak discrimination from related constructs, troublesome items/indicators, or lack of substantial relationships when there is strong theory predicting such relationships. For example, in the area of leadership, items have been confounded with outcomes, supposedly distinct types of leadership have conceptual overlap, and similar items are used to measure different types of leadership (Shaffer, DeGeest, & Li, 2016; Van Knippenberg & Sitkin, 2013).
Traditional advice recommends using established measures whenever possible, yet problems with constructs and their operationalizations introduce bias into estimates of relationships between constructs. For example, the organizational commitment questionnaire (Mowday, Steers, & Porter, 1979) is a well-established measure, but has been criticized for being contaminated with items measuring turnover intentions, leading to upward bias in the correlation between organizational commitment and turnover (Bozeman & Perrewé, 2001). This raises the conceptual issue of whether turnover intentions should become part of the construct definition of organizational commitment (Klein, Molloy, & Brinsfield, 2012). When the literature review reveals inadequacies, ambiguities, or conflicts surrounding construct definitions, these issues must be resolved before proceeding. We argue that perpetuating poor measurement, in the face of documented weaknesses, does not advance science. Instead, we urge scholars to contribute toward building a solid foundation for generating knowledge by taking advantage of the opportunity to revise existing measures or develop new measures. In some instances, it may be necessary to define and develop a new construct. For example, the construct of firm risk taking has been conceived of as research and development (R&D) spending, and also as R&D intensity (spending to sales); but Bromiley, Rau and Zhang (2017) develop conceptual arguments for distinguishing between spending and intensity and offer supporting empirical evidence of the distinction. In another example, Brady, Brown and Liang (2017) expanded the concept of workplace gossip away from a narrow deviance perspective, to also include positive (and potentially prosocial) evaluative talk about another person who is not present.
Fortunate researchers will find that their focal construct is already adequately defined in past work. In many cases, prior definitions, accompanied by empirical validity evidence, may be sufficient to indicate that researchers can proceed with already-published scales or operationalizations. However, the mere existence of published definitions and scales is not sufficient to preclude the need for local construct validity evidence.
Conceptualizing a construct may lead to one of three decisions: adopting an existing construct and measurement, revising an existing construct and/or measurement, or developing a new construct and measurement. Our recommendations vary depending on the decision to adopt, revise, or newly develop a construct and measure. Regardless of this choice, the construct definition is arguably the central element of a construct conceptualization.
Formally Define the Construct: Characteristics of Good Construct Definitions
In a monumental work on developing construct definitions, Podsakoff, MacKenzie, and Podsakoff (2016) provide a list of issues reviewers and authors should consider when evaluating a construct definition. To our understanding, good definitions (a) clarify the type of property (e.g., feelings, perceptions, beliefs, behavior, or performance metrics) the construct represents, (b) clarify the entity/level of analysis (e.g., individual, group, organization, task, or event) to which the property applies, (c) note the construct's essential and unique attributes, or the attributes shared by cases of the concept, and (d) specify the dimensionality of the construct. In addition to the formal construct definition, construct conceptualization also involves detailing how the construct relates to existing constructs. It is nonetheless important to avoid circularity in the definition, as one should not embed antecedents or consequences in the definition (e.g., the construct work withdrawal should not be defined as a response to a dissatisfying job situation; rather the construct should be defined independently, and its antecedents should be empirically studied, rather than assumed and embedded in the construct definition itself). Constructs may be defined narrowly or broadly; narrow definitions are useful for fine-grained constructs (e.g., satisfaction with coworkers), whereas broad constructs (e.g., overall job satisfaction) may be more useful for theorizing at a more abstract or general level. The definition also should indicate whether the construct is stable or variable (e.g., over time, culture, organizational membership). The construct should be defined across its full expected range (Tay & Jebb, 2018). That is, is the low end of a construct defined as an absence of, or as a negative of the construct? (e.g., work engagement has sometimes been defined as the opposite of burnout, whereas positive affectivity and negative affectivity are two distinct constructs and not opposites of each other—as such, positive affectivity should be defined while specifying that its low end is not the same as negative affectivity).
Articulating the Nomological Net of the Target Construct
Constructs exist in a network of related constructs: antecedent causes, outcomes/criteria, and other variables that may be related to the target because they share a common cause. Theoretically predicted relationships between the target construct and other constructs clarify the nature of the target construct, and when tested serve to offer evidence of nomological validity. Literature review will indicate whether the nomological network was well specified and tested for existing constructs, but we recommend that researchers undertake this task anew for substantially revised and original construct measures. To increase the value of this step, we encourage researchers to be more precise in expressing both the direction (positive or negative) and the strength (weak, moderate, strong; or range) of relationships among constructs in the nomological network (Edwards & Berry, 2010). To elaborate, predicting that the relationship between a target construct and another construct will not only be positive and significant, but further that it will fall within a specified range, or that it will be of greater magnitude than the relationship with a different construct, will facilitate understanding of the target construct itself. For example, Shipp and colleagues specified both the direction and effect size magnitude for predicted relationships between their measure of temporal focus and other related constructs (Shipp, Edwards, & Lambert, 2009).
Documenting the Process and Evidence
The process for developing a revised or new definition must be adequately described. Researchers should identify the literature that was searched, include the steps used to revise and refine the attributes and definition of the construct (e.g., consulting dictionaries), and describe the role of subject matter experts, practitioner experts, or focus groups used to clarify matters central to the definition (Podsakoff et al., 2016). For example, O’Neill and Rothbard (2017) relied on extensive interviews with firefighters to develop the constructs of companionate love and joviality among coworkers.
When the choice is to adopt an existing construct, the definition must be reported with appropriate citations to critical past work on the construct. When researchers choose to revise a construct or to develop a new construct, they must also fully describe and document the conceptualization. Keep in mind that the definition of the construct comes before, and may be independent of, decisions on how the construct will be measured. Figure 1 lists the steps associated with conceptualizing a construct. Table 1 is a checklist, in which we distinguish between information necessary when using an existing construct definition, vs. choosing to revise or develop a new construct.

The general steps of construct development.
Checklist for Construct Development and Validation.
Step 2: Operationalizing the Construct
Constructs are not directly observable. To infer the presence or degree/amount of a construct, we rely on signals of the construct as expressed in its items/indicators (e.g., items in a survey, responses to questions in an interview, accounting numbers associated with a firm's activity). The relationships between a construct definition and its’ items/indicator(s) constitute a measurement theory, and it is up to researchers to articulate the theoretical logic linking constructs to the indicators of the construct. The operationalization of a construct must represent the definition of the construct (i.e., content validity). Also, the relationship between a construct and its items/indicator(s) must be theoretically described, because this specification will ultimately guide the development and selection of measures, as well as the process and standards for later confirming construct validity.
The source of the data does not define the relationship between the item/indicator and the construct. Indicators of a construct may come from self-reports, scores on word puzzles, others’ reports of a target, or informed respondents (e.g., CFO or HR giving organization-level information). Counts and ratings of events, including behavioral observations in situ from videos, may be appropriate. Sources of archival information can include financial records, observable characteristics of groups or organizations, scores from content coding of emails or public speeches, web page material, and transcripts of presentations to analysts, among many others.
When a review of relevant research (Step 1: Define) reveals that a construct has existing measure(s) that have been theoretically justified, explicitly articulated in a measurement model, and the measurement model has been empirically confirmed; then it makes sense to adopt an existing measure. Using an existing measure facilitates comparing results across studies and thus enables the creation of new knowledge that can be integrated with past research. Revising an existing operationalization may be necessary when evidence indicates prior problems or inconsistencies (e.g., item content does not match definition, construct deficiency/measures do not capture all aspects of the construct, construct contamination/measures capture surplus content that is not part of the construct definition). Operationalizing the construct entails specifying the measurement model.
Specify Measurement Models
We believe that construct definitions (Step 1) and operationalizations (Step 2) can be more easily confirmed via construct validity methods (i.e., see Step 3 below) if the researcher begins with the outcome in mind, and then thinks backward from that outcome. To this end, we note there are several types of relationships between indicators and constructs, which can be specified in various types of measurement models (for examples see Figure 2).

Examples of Measurement Models.
Figure 2a shows a measurement model that is unidimensional, with one construct and multiple measures/items/operationalizations/indicators. For example, the construct might be job satisfaction, and the indicators might be survey items from the Brayfield and Rothe (1951) overall job satisfaction scale. Job satisfaction would be considered a latent construct (not directly observed, but rather inferred from each person's scores on the measures/observable items/manifest indicators). The latent job satisfaction construct causes or gives rise to the observed scores on the items/indicators, which is why the measurement model is drawn with arrows pointing from the construct to its items/indicators. In this construct validity paradigm, the survey items are written with the goal of quantifying each individual's standing on the latent concept of job satisfaction. Another example is the unidimensional measurement model for the construct of board of directors control (Boyd, 1994; Boyd, Gove, & Hitt, 2005), which is measured by five indicators: % of stock owned by board, number of directors representing ownership groups, proportion of insiders on the board, director pay and CEO duality (negative indicators).
Next, Figure 2b shows a measurement model with more than one construct (i.e., oblique 3-factor model). The three constructs (also called factors in factor analysis) are labeled A, B, and C; and each of these constructs/factors is reflected with its own unique set of measures. The model is called oblique, which means the constructs (or factors) are correlated with each other (there is no theoretical constraint requiring the constructs to be uncorrelated, or orthogonal). Note the correlations between constructs (ϕ, called factor correlations), and the relationships between each indicator and its corresponding construct/factor (λ, called factor loadings). Examples of Figure 2b might include the measurement model for job satisfaction, organizational commitment, and job involvement (which are 3 correlated constructs, but are measured via different items/indicators; Mathieu & Farr, 1991).
Another measurement model is shown in Figure 2c (hierarchical model): (a) there are 3 constructs/factors (A, B, and C), (b) each construct/factor has its own unique items/indicators, but (c) the pattern of relationships among the 3 constructs/factors is modeled as a more general or abstract higher-order factor. One example of a higher-order construct/factor is general mental ability (e.g., Spearman's g), which is reflected by lower-order factors such as numerical ability, verbal ability, and spatial ability (Ackerman, Beier, & Boyle, 2005), each of which has its own indicators/items/operations.
Generate Items or Indicators
After the construct is defined and its conceptual properties and distinctiveness from related constructs is clarified, it is time to begin choosing items/indicators. One principle of measurement models is that the content of the items/indicators should correspond to the content of the construct. The process of selecting or creating particular items or measures from a universe of possible indicators, in order to represent a particular hypothetical construct domain, is called domain sampling (Nunnally, 1970).
The global approach requires that each of the indicators, or items, fully reflects the content of the construct at the level of abstraction that is used to define the construct. Continuing the job satisfaction example, global survey items would reference the idea of overall satisfaction with the job in its entirety (Ironson, Smith, Brannick, Gibson, & Paul, 1989). In contrast, the facet-composite approach uses facet domain sampling by identifying important facets of a broad, higher-order construct. Items/indicators are chosen to reflect each facet-level construct, theoretically reasoning that the facet constructs themselves are specific reflections of the higher-order construct. Regardless whether global or facet domain sampling is used, Clark and Watson (1995; 2019) recommend oversampling the content domain to include both items/indicators directly assessing the target construct and items/indicators tangentially related to the target construct, to enable distinctions to be drawn in later analyses.
When choosing a certain number of items/indicators to measure a construct, researchers can face a bandwidth-fidelity dilemma (Cronbach, 1960; Ones, Viswesvaran, & Reiss, 1996). For example, if we are constrained to use only 3 items on a survey measure, then we must make the choice between measuring a narrow facet construct (e.g., industriousness) reliably (by writing 3 very-similar items), versus measuring a broad construct (e.g., conscientiousness) unreliably (by using one item to measure each facet; industriousness, attention to detail, responsibility). If a researcher claims to be measuring a broad Big Five personality trait with only 2 or 3 items, then we know that researcher has chosen to either: (a) only measure one facet of the trait, reliably, or (b) measure the broad trait, unreliably. In order to measure a broad trait reliably, one could simply use more than 2 or 3 items.
Adherents of the global domain sampling approach argue that the hierarchical facet domain sampling approach is problematic. First, if you measure a broad construct (job satisfaction) by oversampling one part of the content domain (satisfaction with working conditions) and under-sampling another part of the content domain (satisfaction with compensation), then your unbalanced domain sampling can produce over- or underrepresentation of particular lower-order constructs in the resulting measurement instrument. Second, it is difficult to theoretically or empirically determine that the facet sampling approach has captured the “right” facets (Scarpello & Campbell, 1983). In contrast, those favoring facet domain sampling argue that the global approach is not applicable for some constructs, because it is difficult to sample some constructs at the proper level of abstraction (e.g., effective items/indicators exist for the facet constructs of verbal intelligence and spatial intelligence, but it is difficult to find good indicators for the higher order construct of general intelligence).
We, the authors, differ in our views on the utility and appropriateness of the global and facet approaches—specifically, Lambert believes that domain sampling should occur at the same level of abstraction or construct breadth where the analysis occurs, whereas Newman believes that facet domain sampling can support analyses at both narrow and broad levels of abstraction. We recognize the value in both perspectives, particularly as regards the untested assumption that broad constructs can be assessed by combining items/indicators designed to measure narrower constructs (is the whole meaningfully different from the sum of its parts?; cf. Chang, Ferris, Johnson, Rosen, & Tan, 2012; Scarpello & Campbell, 1983). We urge you to understand the distinctions between these different approaches and to choose based on your own reason and logic.
Items/indicators may be sampled from the content domain both deductively (items/indicators drawn directly from the construct definition, prior theory, and/or prior measurement; (e.g., Colquitt, 2001), or inductively (perhaps by asking members of the target population or experts to generate examples of items/indicators; e.g., Bennett & Robinson, 2000). The deductive approach rests on theory and acknowledges the expertise of past scholars, and the inductive approach may incorporate the perspectives of knowledgeable stakeholders and members of the population being studied, facilitating the development of new theory (Hinkin, 1995). Both the deductive and inductive approaches to indicator generation are useful, and may be employed singly or in combination.
When developing items/indicators to measure a construct, how many are desirable? According to Hinkin’s (1998) rule-of-thumb, the initial data collection (prior to validation) should contain approximately 8 to 12 items/indicators per construct, and the final data collection should contain 4 to 6 items/indicators per construct. We suggest the absolute minimum number of items/indicators is three, because that number is required to algebraically identify a measurement model when testing construct validity via structural equation modeling (SEM). In order to ensure adequate internal consistency reliability (e.g., Cronbach's α > .7), the number of items required would depend upon average inter-item correlations (i.e.,
Response scales (e.g., with response options referring to the extent of agreement, to frequency, or amount) should be carefully selected to accurately map onto respondents’ ability to discriminate between response options (e.g., neither too few nor too many points in the scale), and in terms they understand (e.g., Americans generally are more familiar with Fahrenheit temperature than with Celsius, and in the range of 0°F to 100°F rather than to 300°F; Krosnick & Presser, 2010). Moreover, the meaning of the verbal anchors (i.e., response format) for response scales should align with the wording of the items/indicators, and should capture the full range of respondents’ intended answers (Tourangeau, Conrad, & Couper, 2013, p. 78). Results are somewhat mixed and depend on the question and the sample, but 5-point, 7-point, and 9-point (odd-numbered) scales are common and may be preferable; and reliability can be higher when the points are accompanied by verbal anchors rather than just labeling the endpoints (Alwin & Krosnick, 1991).
Assess Content Validity
We address content validity in this step, rather than step 3, because of its central role in choosing effective operationalizations of constructs. All measures, regardless of their type, should exhibit content validity (Aguinis & Vandenberg, 2014; Anderson & Gerbing, 1991; Schriesheim, Powers, Scandura, Gardiner, & Lankau, 1993). Remembering that the relationships between items/indicators and constructs represent a measurement theory necessitates presenting a theoretical rationale to justify the operationalization of the construct. The point of content validity analysis is to demonstrate that the indicators correspond to the construct definition.
Multiple approaches can be used to bolster the case for content validity of items/indicators used to measure a construct (see J. C. Anderson & Gerbing, 1991; Colquitt, Sabey, Rodell, & Hill, 2019; Hinkin & Tracey, 1999; Schriesheim et al., 1993). Briefly, the approaches described by these authors involve asking a sample of judges to either (a) classify each item/indicator as matching one construct definition more than it matches other construct definitions (Anderson & Gerbing, 1991; Colquitt et al. label this definitional distinctiveness), and/or (b) rate the degree of correspondence between each item and a set of various construct definitions (Hinkin & Tracey, 1999; Colquitt et al. label this definitional correspondence). Ideally, each item/indicator can be correctly classified as belonging to its intended construct definition, and/or can be rated to have a high degree of correspondence with its intended construct definition. For example, subject matter experts (e.g., faculty, doctoral students, advanced undergraduates, or members of the population being studied) can rate the extent to which draft items/indicators are consistent with definitions of a new construct (Wolfson, Tannenbaum, Mathieu, & Maynard, 2018). Colquitt et al. (2019) systematically tested and developed norming standards for evaluating the probability of correct item categorization, and for evaluating the magnitude of item-definition correspondence (norming standards for both of these depend upon correlations between the focal construct and related constructs). Another content validity approach, sometimes called cognitive interviewing or a “think aloud” technique, involves prompting respondents to report every thought that occurs to them as they respond to survey questions or other measures (Willis, 2005). For example, Grégoire et al. (2010) asked experienced entrepreneurs to report their thoughts as they completed an opportunity recognition exercise, supporting the logical argument for content validity. As a result of content validity analysis, items/indicators that do not correspond to their intended constructs may be deleted, prior to collecting data for confirmatory factor analysis.
It is difficult to overstate the importance of content validity assessment. Miller et al. found that 66% of a sample of published papers lacked correspondence between the construct definition and the operationalization of organizational performance (e.g., organizational performance is defined as a broad latent construct, but often operationalized as a single dimension of performance; Miller, Washburn, & Glick, 2013). Likewise, individual employees’ job performance has also been operationalized in ways that diverge from construct conceptualizations (J. P. Campbell, Gasser, & Oswald, 1996). As both sets of authors point out, the lack of correspondence between measures and constructs renders the interpretation of results from a single study meaningless, and obstructs the cumulation of results across studies.
In our experience, measures that routinely suffer from weak confirmatory factor analysis (CFA) results have likely not been subjected to content validity assessment (either by the Anderson-Gerbing 1991 approach, or the Hinkin-Tracey 1999 approach); and such assessment greatly improves the chances that one's measurement model will exhibit good fit after data are collected. Further, content validity assessment can be used for scale revision with existing scales, to improve CFA results (Carpenter, Son, Harris, Alexander, & Horner, 2016).
Documenting the Process and Evidence
When selecting operationalizations to match a construct definition, the choices must be documented. The requirements for documenting an existing measure are relatively straightforward. Researchers should report: (a) representative example items/indicators (or the full scale if the scale is new), (b) notes about instructions to participants, (c) scoring guidelines (including standardization decisions), (d) the response scale [e.g., (1-not at all) to (5-a great deal)], (e) the measurement model to be tested (e.g., which items/indicators load onto which constructs, and which constructs are specified to be correlated; see Figure 2), and (f) relevant citations for the measure and any existing theory that supports the chosen operationalization.
The instructions to respondents, response scale, instructions for scoring and coding, description of the training for behavioral raters, required procedures (e.g., for web-scraping), and other materials integral to interpreting data accurately are all part of the measure, and must be reported (DeVellis, 2003; Tourangeau et al., 2013). Instructions, to both respondents and to researchers, can be theoretically important by creating context. For example, the same behavioral items/indicators may be used to measure followers’ perceptions of their leaders’ behavior today (Tepper et al., 2018), or on average (Judge & Piccolo, 2004); and the instructions are necessary to clarify which information was requested.
The theoretical rationale supporting changes to an existing measure should also be described (Heggestad et al., 2019). For example, when items/indicators originally referring to supervisors as the target are changed to refer to the organization as the target, this should be mentioned. However, when troublesome items/indicators are revised (e.g., wording changes), or a lengthy scale is trimmed, the content validity process should be repeated with the modified scale (including empirical evidence from an independent sample for why an item/indicator was dropped, the resulting part-whole correlation for the shortened scale, and new validity evidence to support any changes to item wording; Heggestad et al., 2019). This supports the correspondence between the construct definition and its modified operationalization/items/indicators.
Documenting the process for proposing a new measure requires a lengthier and more detailed description. This includes the theoretical rationale for the items/indicators, specifying the measurement model, and steps taken to document content validity.
Step 3: Evidence to Confirm Construct Validity
Once a construct has been defined (Step 1: Define) (see Pedhazur & Schmelkin, 1991; Podsakoff et al., 2016), and after measures/items/indicators have been selected and screened for their subjective content-based connection to the construct definition (Step 2: Operationalize) (Anderson & Gerbing, 1991; Colquitt et al., 2019; Hinkin & Tracey, 1999), the final step (Step 3: Confirm) is to begin the iterative process of confirming and refining the measurement model in a series of independent samples. Construct validity is not determined by pointing to a specific statistic, but is a plausible conclusion that is based on an array of evidence consistent with the proposed theoretical measurement model (Jackson, Gillaspy, & Purc-Stephenson, 2009; McDonald & Ho, 2002). Because construct validity is not a property of a scale, but rather a property of the specific application of the scale in a particular sample, evidence for construct validity must be examined anew for each sample (Messick, 1995; Nunnally & Bernstein, 1994). The specific type of evidence that may be persuasive varies depending on the definition of the construct, but often includes assessing the reliability of the measures, testing the measurement model with CFA, and assessing the nomological validity of the focal construct(s).
In our discussion of construct validity, we report common rules of thumb to orient readers to desirable standards; but rules of thumb are only coarse approximations of truth, can be easily misapplied, and might be useless in specific circumstances (Lance & Vandenberg, 2009). Rules of thumb should not be applied thoughtlessly, and are not hard and fast. As with any rule of thumb, it is the researcher's underlying logic and strength of argument that should be paramount, not the numerical rule per se.
Collect Data to Test Measurement Model
When collecting data to evaluate one's measurement model, the researcher should sample data from the population of interest (e.g., working adults, top management team members, customer service representatives, firms in a dynamic environment). Convenience samples (e.g., MBA students, MTurk workers/online panels/crowdsourced data, undergraduate students) might well represent one's population of interest. However, we advise researchers to use a diverse selection of convenience samples—that is, to avoid using three MTurk samples or three student samples only—and to attempt to validate constructs across different samples from the population.
Planning an adequate sample size is a complicated issue in CFA (Gerbing & Anderson, 1985; Jackson, 2003; MacCallum, Browne, & Sugawara, 1996; MacCallum, Lee, & Browne, 2010; Muthén & Muthén, 2002). In summary, it is important to secure an adequate sample size in order to maintain adequate statistical power for CFA hypothesis tests, as well as to limit convergence failures and error in parameter estimates (factor loadings, factor intercorrelations) and model fit indices (χ2, CFI, TLI, RMSEA). Sample size is not the only important factor, as the quality of CFA outcomes is also enhanced by larger magnitudes of factor loadings, having multiple indicators per variable, and avoiding model misspecification. A general rule of thumb for CFA sample size might not make sense, but without one researchers may push the boundaries by using tiny samples. The closest we can find to an empirically-grounded rule of thumb is Jackson's (2001) result, showing sample sizes of N = 200-400 produced better CFA results than samples of N = 100, but that there are diminishing returns after N = 400 (as summarized by Jackson, 2007). Sample sizes smaller than 200 might be adequate when factor loadings are large (e.g., greater than .7) or when there are many items/indicators per factor. The rule of thumb we advocate (N ≈ 200 or greater) is the same as Hoelter’s (1983) tentative suggestion that sample size should exceed N = 200 per group, to “indicate that a particular model adequately reproduces an observed covariance structure” (p. 331).
Further, the population from which one samples might influence item/indicator variance, item/indicator means/base rates, normality, or whether the item/indicator makes sense. Prior to conducting CFA, it is important to inspect and report item/indicator distributions to ensure the items/indicators exhibit adequate item variance and are not extremely skewed [e.g., Bennett & Robinson, 2000, removed items with standard deviations < 1.2 (on a 1-to-7 scale) from their workplace deviance measure]. In the long run, the measurement model can ultimately be tested across different populations using item-level meta-analysis (Carpenter et al., 2016).
Test the Measurement Model with Confirmatory Factor Analysis
Confirmatory factor analysis (CFA) is an elegant methodology for confirming the construct validity of one's measures. Many sources describe how to conduct CFA (e.g., Brown, 2015; Lance & Vandenberg, 2002; Pedhazur & Schmelkin, 1991). CFA is the first step in the two-step approach to structural equation modeling advocated by Anderson and Gerbing (1988), where the first step is to test the measurement model and the second step is to test the substantive structural model. That is, the a priori measurement model is a hypothesis about the relationships between indicators and constructs, and should be tested before proceeding to subsequent tests of substantive hypotheses as specified in the theoretical model of interest. Because the goodness-of-fit of a substantive structural equation model is often driven by the fit of its measurement model component (O'Boyle & Williams, 2011), it is essential to assess the measurement model independent of the overall theoretical model. CFA is not limited to survey data alone, but can be used for any multiple-indicator micro or macro construct, regardless of the source of the data (archival, self-report, etc.).
It is worth mentioning at this point that exploratory factor analysis (EFA) is unnecessary and rarely appropriate, until after a CFA has already been attempted and failed. When the researcher assigns items/indicators to a construct, this constitutes a hypothesized relationship that should be tested. Thus, CFA should be used first. The attempt to use an EFA first would be tantamount to a confession that the researcher did not know what s/he was trying to measure when the data were collected (had no hypothesized measurement model). To restate, EFA only makes sense after CFA has failed, or if data being used were collected without any construct definitions guiding the selection of items/indicators (e.g., archival data). In such instances, we still recommend undertaking Step 2 (Operationalize the Construct, content validity analysis) as described above in order to specify one's measurement model, prior to implementing any factor analysis.
Is EFA ever appropriate? Yes. Exploratory, inductive approaches can complement deductive approaches (Aguinis & Vandenberg, 2014). Thus, EFA may be useful, under two circumstances: (a) when the researcher legitimately does not know what they are trying to measure (e.g., when the Big Five personality traits were originally derived; see Cattell, 1947; Fiske, 1949; Norman, 1963), and/or (b) after CFA has failed (i.e., EFA, including EFA with parallel analysis to detect inadvertent multidimensionality, or exploratory uses of CFA with post hoc model modifications, can be appropriate, but only if used for the purpose of specifying a model that is then immediately tested using CFA in an independent dataset). When assessing dimensionality with limited information about what may be a complex measurement structure, both of the above conditions might be met. Whether using traditional EFA or more recent exploratory procedures in SEM (Asparouhov & Muthén, 2009; Brown, 2015; Conway & Huffcutt, 2003; Fabrigar et al., 1999; Morin, Arens & Marsh, 2016; Zickar, 2020), we reiterate the requirement to test the obtained measurement model with CFA using an independent sample. Further, we emphasize that principal components analysis (PCA) is not factor analysis, and should be avoided whenever the goal is to measure latent constructs (see discussion by Conway & Huffcutt, 2003; Ford et al., 1986). Also, when using EFA, orthogonal rotations should be avoided because most constructs are theoretically correlated, and if constructs are indeed uncorrelated the oblique techniques will still reveal that.
Reliability
A common reliability index is Cronbach's coefficient α, which assesses internal consistency across items/indicators (Cortina, 1993; cf. Cho & Kim, 2015). Cronbach's α is weak evidence of construct validity because it: (a) is strongly influenced by the number of items/indicators in the measure, (b) is a lower bound estimate of reliability, (c) does not address convergent or discriminant validity between constructs, and (d) assumes that the combined items/indicators are unidimensional, tau-equivalent (i.e., have equal factor loadings), and have uncorrelated item errors (Cho & Kim, 2015; Cortina, 1993; McNeish, 2018). Alternative internal consistency reliability indices include coefficient omega (ωh, which relaxes the assumption of tau equivalence, and also indexes unidimensionality) and composite reliabilities (which are appropriate for a construct of multiple related dimensions for either tau equivalent or congeneric measurement models; Cho, 2016). Cronbach's α assesses reliability across items/indicators on a measure, but reliability can also be estimated across raters (LeBreton & Senter, 2008), across occasions (Schmidt, Le, & Ilies, 2003), or across items/indicators, raters, and occasions simultaneously (Cronbach, Gleser, Nanda, & Rajaratnam, 1972; DeShon, 2002; Woehr, Putka, & Bowler, 2012); generalizability theory). For example, an approach for using trained raters to code CEO narcissism on the basis of video clips was also evaluated by comparing ratings of video clips by doctoral students to both self-reports and to others’ reports of narcissism (Petrenko, Aime, Ridge, & Hill, 2016). Content analysis of text can also be assessed for reliability. For instance, multiple kinds of measurement error for constructs developed from computer-aided text analyses can be estimated (McKenny, Aguinis, Short, & Anglin, 2018).
Additionally, researchers often seek to support discriminant validity inferences by comparing their hypothesized measurement model against theoretically plausible alternative models (e.g., comparing an oblique factor model with two constructs against a unidimensional model in which the two constructs are constrained to be perfectly correlated: ϕ = 1.0, or a single-factor model). If the model-data fit for the unidimensional model is worse than model-data fit for the oblique multifactor model (e.g., if ΔCFI > .01), this is treated as initial evidence for discriminant validity. However, this sort of model comparison evidence can be quite weak, because with large sample sizes even a factor correlation of ϕ = .8 or ϕ = .9 can be empirically distinguished from ϕ = 1.0. There are circumstances where two constructs may be very highly correlated: such as a very strong causal (nearly deterministic) effect, the existence of a higher-order construct, or a slightly different form of the same construct (same content but different targets; e.g., perceived organizational support and perceived supervisor support).
Exceptions could be made when there is a strong a priori theoretical reason for expecting correlated residuals, for example longitudinal designs where an item/indicator is repeated and its uniqueness correlates with itself over time, or perhaps because the content of the items/indicators is influenced by a construct other than the one in question (Cole et al., 2007). Much of the time when researchers want to allow correlated uniquenesses among items/indicators, it is because they believe there exists a lower-order or specific factor that two or more items/indicators have in common (based on similar item wording, etc.). In such cases, the researcher should specify the lower-order factor and confirm its existence in a new dataset.
Nomological Validity
The understanding of a construct is facilitated by knowing its relationships with other constructs. Testing the nomological network (as predicted in Step 2 above) entails showing that a construct is related to other variables as expected (Cronbach & Meehl, 1955; Schwab, 2005). As an example, when Klein, Cooper, Molloy and Swanson (2014) proposed their new measure of organizational commitment, they predicted and found relationships between their new measure and theoretically-related constructs (i.e., positive relationships with job satisfaction, organizational identification, extra-role behavior, and in-role effort; and a negative relationship with turnover intentions). Likewise, Danneels (2016), as a part of developing new measures for dynamic capabilities, tested the nomological net showing that R&D competency predicted concurrent and subsequent accumulation of technological resources. Further, in the rare circumstance that it might be necessary to use a single-item/indicator measure of a construct, assessing the nomological network can bolster the empirical case for its construct validity.
Modifying the Measurement Model: Data-Driven Modifications Require Collecting New Data
When the results of the CFA suggest that the measurement model fits the data and there is evidence of convergent validity, discriminant validity, and nomological validity; then it is reasonable to proceed to hypothesis testing. However, when the measurement model exhibits poor fit to the data, then steps should be taken to identify the problem(s) and to modify the model. Such steps typically include deleting an item/indicator, or specifying an item/indicator to load onto a different factor than was originally hypothesized. However, we strongly caution that, when estimating measurement model fit, one should not use modification indices or other CFA information to change the model in any way after looking at the data. Changing the model then reporting the modified fit on the same data renders the model fit indices meaningless in the current dataset. Only after CFA fails (e.g., poor model fit, standardized factor loadings < .4, standardized factor correlations > .7), then CFA or EFA may be used in an exploratory fashion (i.e., by inspecting both the model modification indices/standardized residuals and the results of a content validity assessment—see Step 2 above) to identify the source of misfit. We emphasize that the modified measurement model must be tested on a new, independent sample. Post hoc model modifications to improve model fit often fail to replicate in future samples, because they capitalize on chance characteristics of the dataset at hand (MacCallum, Roznowski, & Necowitz, 1992). A new dataset must be collected for each revised measurement model.
When CFA fails, researchers might be tempted to improve model fit by using item parceling. An item parcel is a subset of items/indicators aggregated (usually averaged together) to form an indicator (e.g., instead of using 15 items as indicators, one might use 5 parcels of 3 indicators each). The advantages of parceling compared to using single items/indicators (Bandalos, 2002; Marsh, Lüdtke, Nagengast, Morin, & Von Davier, 2013; Williams, Vandenberg, & Edwards, 2009) include: (a) parcels are more reliable (smaller uniquenesses), more normally distributed, have smaller correlations between residuals, and have more intervals between scale points, (b) parceling greatly reduces the number of parameters being estimated, enabling CFA to converge with much smaller sample sizes, and (c) parceling generates better model goodness-of-fit. Regardless of the parceling strategy employed (e.g., based on item factor loadings, random assignment of items/indicators to parcels, or a priori theoretical facets), the major disadvantage of parceling is that a CFA with parceled indicators hides the very information that is necessary to evaluate the relationships between a construct and its item-level indicators (Little, Rhemtulla, Gibson, & Schoemann, 2013; Marsh et al., 2013; Meade & Kroustalis, 2006). Parcels may or may not reflect the construct, but they are not diagnostic of the construct-item relationship for each item/indicator. As such, parceling may be appropriate when testing structural models, but should be avoided when testing measurement models. Parceling strategies, when used to assess measurement models, signal that more item/indicator-level construct validation work is needed.
Documenting the Process and Evidence
Other Considerations: Method Variance, Measurement Equivalence, Formative Constructs, Single Item Measures, Forced Choice Measures, Multilevel Constructs, and Algebraically Combined Measures
In addition to the basic methods of construct validation reviewed in the preceding sections, there are a few other considerations to keep in mind when addressing construct validity.
In a formative measure, the items/indicators may be uncorrelated with each other, and may correspond to unrelated facets of the construct (meaning that the items/indicators need not correspond to the entire construct definition). If researchers choose to develop a formative measure, it is not sufficient to generate a set of items/indicators where each measure refers to different content, and then to simply declare the measurement model as formative. Instead, it is essential to justify the theoretical logic for why and how items/indicators cause the latent construct, why it is unnecessary to account for measurement error in the item/indicator(s), to specify the measurement model, to demonstrate statistical evidence required for validity, and to specify how the model will be identified for estimation in SEM. To identify a formative model, it is necessary to include at least two reflective measures, or two endogenous outcome constructs (MacKenzie et al., 2005); but the loadings on the formative indicators will vary depending on which reflective constructs or measures are chosen, changing the meaning of the formative construct itself (Edwards, 2011). Accordingly, researchers who choose formative measurement should develop theoretically sound qualifying criteria for specifying which reflective variables or indicators, and not others, are appropriate for identifying the model when assessing construct validity (Bollen & Diamantopoulos, 2017; Edwards, 2001, 2011; Howell et al., 2007; MacKenzie et al., 2005). One of the better examples of validating a formative measurement approach is the research on entrepreneurial orientation by Anderson et al. (2015).
Despite the clear advantages of using multiple indicators of a construct, single item/indicator measures are sometimes unavoidable, for instance when relying on archival measures. Using single item/indicator measures confers a special obligation to clearly articulate the theoretical rationale linking item/indicator and construct, and to assess content and nomological validity. Using single item/indicator measures is better than no measures at all, but not as desirable as using multiple items/indicators to indicate a construct.
Establishing construct validity for aggregate or group-level constructs requires matching the nature of the construct to the appropriate type of empirical evidence. Many widely-studied organizational phenomena inherently require multilevel measurement and analysis, because the phenomena are often measured via individual-level perceptions but the constructs are conceptualized to reside at the group-level of analysis (e.g., organizational climate, leadership; (James & Jones, 1974; Klein & Kozlowski, 2000b; Rousseau, 1985). These constructs require considering within-group agreement and reliability (Bliese, 2000; LeBreton & Senter, 2008; Newman & Sin, 2020); consideration of the item referent (e.g., “I am satisfied” vs. “My team is satisfied”; Chan, 1998b; Klein & Kozlowski, 2000b), as well as potential considerations of measurement equivalence across levels of analysis (psychometric isomorphism; Tay, Woo, & Vermunt, 2014), and nomological network homology across levels of analysis (Chen, Bliese, & Mathieu, 2005). In many cases (e.g., when measuring team-level constructs or organization-level constructs that are assessed via individual-level perceptions or surveys), multilevel CFA is often appropriate (Muthen, 1994)—which estimates two sets of factor loadings and factor correlations (at both the within-group and between-group levels of analysis) simultaneously.
Summary and Conclusion
It is important to remember—whether using established, revised, or newly developed measures—that the relationships between items/indicators and the constructs they are intended to represent is a measurement theory that must be tested. We have endeavored to provide practical guidance to reviewers, editors, and authors in the form of a checklist with supporting explanations. Our advice is no guarantee that the measure of a construct is valid but should be viewed as a guide to improving measurement practice. The reader should keep in mind that our paper is simply an overview of current recommendations and is subject to future revision. Moreover, it was necessary to give short shrift to many construct measurement and validity topics, and our review of recommended practices cannot address all contingencies that will arise in research practice. Instead, our overarching recommendation is to develop a sound, theoretically-derived approach to measuring constructs and to provide substantial and persuasive evidence that the measurement model should not be rejected.
We remind readers that confirming construct validity does not certify that a measure is validated for all time and for all purposes. The extent to which tests require local validity analyses varies. Tests that have been developed for specific and well defined purposes, and have been extensively validated (e.g., well-known tests for educational, vocational or clinical purposes) may not need validity assessment for a specific application. Yet, many, if not most, tests used for research purposes and academic publications lack recommended and extensive validity assessment (American Educational Research Association, 2014). Moreover, the purpose of the research is related to the required precision of the instruments; scales used for making decisions about peoples’ lives (e.g., hiring, admissions) often require more and different validity evidence than empirical contributions to theoretical work. Construct validity is an ongoing process (American Educational Research Association, 2014). Even if using measures that exhibited adequate construct validity in prior studies, local construct validity evidence may be critical. Measures for a construct never reach standards of validity such that further testing is unnecessary - construct validity must be revisited each time the construct is used.
Footnotes
Appendix A: Glossary of Construct Validity Terms
Construct (latent construct, concept, factor) – an attribute, process or disposition of people, groups, or firms (Cronbach & Meehl, 1955, p. 283; Messick, 1981, p577).
Measure (operationalization, item, indicator) – “an observed score gathered through self-report, interview, observation, or some other means (DeVellis, 2003; Edwards, 2003, p. 329; Edwards & Bagozzi, 2000; Lord & Novick, 1968; Messick, 1995).”
Construct validity – “the correspondence between a construct and a measure” as evaluated by cumulative evidence (Cronbach & Meehl, 1955; Edwards, 2003, p. 329; Nunnally, 1978; Schwab, 1980).
Content validity – “the degree to which a measure represents a particular domain of content” (Anderson & Gerbing, 1991; Edwards, 2003, p. 330).
Construct domain – theoretical definition of the content area of a particular construct (Hinkin, 1995, p. 969; Nunnally, 1970; Podsakoff, MacKenzie & Podsakoff, 2016; Schwab, 1980; Schriesheim et al., 1999). The notion of a construct domain is useful for understanding the practice of domain sampling.
Domain sampling – choosing particular items or measures from a universe of possible items, in order to represent a particular hypothetical construct domain (Nunnally, 1970, p. 546).
Measurement model – specifies the relationships of indicators/items to their assigned constructs, typically with freely correlated constructs (Anderson & Gerbing, 1988).
Note. Adapted from Newman, Harrison, Carpenter, & Rariden (2016).
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
