Abstract
The verity of results about a psychological construct hinges on the validity of its measurement, making construct validation a fundamental methodology to the scientific process. We reviewed a representative sample of articles published in the Journal of Personality and Social Psychology for construct validity evidence. We report that latent variable measurement, in which responses to items are used to represent a construct, is pervasive in social and personality research. However, the field does not appear to be engaged in best practices for ongoing construct validation. We found that validity evidence of existing and author-developed scales was lacking, with coefficient α often being the only psychometric evidence reported. We provide a discussion of why the construct validation framework is important for social and personality researchers and recommendations for improving practice.
Whatever exists at all exists in some amount. To know it thoroughly involves knowing its quantity as well as its quality.
Psychological phenomena investigated in psychology are often latent, in that the constructs of interest are typically unobservable (e.g., attitudes). Measures are developed and employed in the pursuit of studying these phenomena. For instance, the construct of life satisfaction is often measured by the 5-item Satisfaction with Life Scale (Diener, Emmons, Larsen, & Griffin, 1985). After responses to these items are collected and scored, these scores are taken to represent the construct of life satisfaction in data analysis and in the interpretations from analysis. Studying latent constructs of this nature, as opposed to observable variables such as height or weight, require the process of construct validation. This process begins with identifying a construct, defining it, developing a theory about the structure of the construct (e.g., how many factors are present, how they are related), selecting a means of measuring the construct (e.g., Likert-type scales), and establishing that the measure appropriately represents the construct. This process of construct validation is the means by which evidence is generated to support that scores reflect the target construct (i.e., have construct validity).
The verity of results about a psychological construct hinges on the validity of its measurement, making construct validation a fundamental methodology in the scientific process, particularly in psychology. If the construct of interest is studied with poor measurement, the ability to make any claims about the phenomenon is severely curtailed because what exactly is being measured is unknown and that uncertainty trickles down into the primary results.
Purpose of Study
To assess current practice, we conducted a systemic review of the use of psychological measures using a random sample of 30% of the empirical articles published in the Journal of Personality and Social Psychology (JPSP) in 2014. Many consider JPSP the flagship journal of social and personality psychology; accordingly, we assumed that all aspects of research within would be exemplary. Thus, we set out to determine to what extent researchers are utilizing rigorous methodology for construct validation. Prior to reporting results from our review, we briefly review the established standards for generating validity evidence of measures, reiterating the fundamental role of construct validity in strengthening the conclusions drawn from psychological research. Subsequent to reporting our results, we offer recommendations for improving the use of psychological measures so as to strengthen research findings in the areas of personality and social psychology.
Construct Validation
Construct validation is the process of integrating evidence to support the meaning of a number which is assumed to represent a psychological construct. Cronbach and Meehl (1955) described construct validation as necessary whenever “an investigator believes that his instrument reflects a particular construct, to which are attached certain meanings.” Further, construct validity pertains to a specific use of a scale (e.g., diagnosis or research) and can often be context or population dependent (Kane, 2013; Messick, 1995). Stated differently, a particular scale may only measure the intended construct within a specific context. Van Bavel, Mende-Siedlecki, Brady, and Reinero (2016) discuss this same issue, termed “contextual sensitivity,” in relation to scientific reproducibility broadly. They found that studies were less likely to replicate when the psychological processes under study were contextually sensitive. Just as some psychological processes may be influenced by context, so too can their measurement. Thus, the process of construct validation is best viewed as ongoing in which validity evidence is continually gathered in defense of findings.
The Standards of Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education [AERA, APA, & NCME], 2014) serve as an official reference, outlining best practice and methodology for conducting construct validation. These recommended practices have been categorized into three phases: substantive, structural, and external (Loevinger, 1957). The substantive phase comprises the theoretical underpinnings of a measure where previous literature is used to define the construct and outline its scope, describing the necessary content required for reasonably measuring the construct (i.e., items which tap certain dimensions). In the structural phase, quantitative analyses are used to examine the psychometric properties of the measure such as the factor structure or internal consistency. In the final, external phase, researchers gather evidence for how the construct relates to other constructs or predicts criteria, placing it in a larger nomological network (Cronbach & Meehl, 1955). These phases encompass a host of potential studies and methodologies which cannot be comprehensively reviewed here. Table 1 provides a nonexhaustive summary of validity evidence from each phase.
Examples of Validity Evidence and Resources for Each Phase of Construct Validation.
Note. Table draws from a collection of seminal works and texts on validation and measurement more broadly including Benson (1998), Clark and Watson (1995), Crocker and Algina (2006), Loevinger (1957), Strauss and Smith (2009), and Raykov and Marcoulides (2011).
The majority of research conducted in the social and personality areas can be couched in the external phase. Although researchers may not explicitly be aware they are engaged in construct validation, they are implicitly conducting external validation when they gather information about a construct. The three phases of construct validation progress sequentially such that conclusions made in the external (third) phase may not be valid if the construct does not have a strong theoretical foundation (first phase), and the scale which measures it does not have acceptable psychometric properties (second phase). Further, even though acceptable psychometric properties were previously determined by the researchers developing the scale, it does not mean that the scale will have these same properties in a different study (i.e., the measurement may not replicate; for a discussion, see Fabrigar & Wegener, 2016). And critically, if a scale does not have acceptable properties in current research, it is questionable whether the scale is measuring the same construct as determined previously. Thus, substantive and structural evidence of validity are prerequisites for considering findings that relate to the external phase or a replication study. If the measurement properties of a scale do not replicate, then the replicability of the results from analyses using those measures is suspect.
Given the recent interest in the replication of psychological findings, we investigated what methodologies social and personality researchers use to provide ongoing structural validity evidence for measures they employ. We first obtained a snapshot of the most pervasive applications of measurement and then focused specifically on scales, defined as measures that use items to represent a latent construct. We coded the structural validity evidence provided in support of the use of these scales. Using this review as a basis, we develop recommendations for improving measurement and ultimately, psychological findings. Specifically, we aimed to answer the following questions: What types of measures are social and personality researchers using? How often do authors report a previous validation study? How often do authors report psychometric information?
Method
Sampling and Data Sources
The total number of articles published in the JPSP in 2014 (N = 122) served as the finite population. Among these articles, seven were editorials, errata, commentaries, or meta-analyses. A random sample of 39 (34%) empirical articles, stratified by substantive area (i.e., Attitudes and Social Cognition [ASC], Interpersonal Relations and Group Processes [IRGP], and Personality Processes and Individual Differences [PPID]), was drawn from the remainder for coding. Of the N = 39 sampled articles, 26% (n = 10) were from the section on ASC, 33% (n = 13) were from IRGP, and 41% (n = 16) were from personality PPID. These percentages corroborated with the finite population percentages observed (26%, 35%, and 39% for ASC, IRGP, and PPID, respectively) indicating fidelity of the sampling procedure.
For this review, we focus on empirical studies in which effects based on latent constructs are of interest; as such, we removed four articles from the sample (leaving n = 35) because they were a research synthesis, a theoretical paper, or a scale development paper. Although scale development papers focus on measurement, we considered them a different population from those utilizing scales because they focus on the full-scale development process and all phases of construct validation.
Coding of Articles
In this review, we focus on how researchers engaged in ongoing construct validation, specifically the validity evidence from the structural phase reported in “Method” section. Common approaches to this phase of construct validation are listed in Table 1. We focused on Method section because that is where the primary variables of interest and their psychometric properties (e.g., factor analysis or reliability) are typically described. Accordingly, our results exclude substantive or external construct validity evidence (e.g., theoretical breadth or predictive validity) possibly present in other sections. Additionally, we did not code for validity evidence of manipulation checks or measures that were not used in the final analysis. The current work is only a snapshot of one part of the construct validation process and should not be construed as reviewing all evidence researchers should report, such as other phases of validation and the validity of manipulation checks. We stress that manipulation checks serve the essential role of quantifying the internal validity (and construct validity) of experimental designs.
All articles were coded independently by a senior and junior coder for the frequency of evidence reported. The senior coders are authors of this article with formal training in measurement and statistics, whereas junior coders were student research assistants trained specifically for this project. We coded the frequency of reported evidence for each measure, which were objective observations (e.g., number of items on a scale or presence of a reliability coefficient), as opposed to subjective judgments. To ensure there were no data entry errors, we used double entry for all articles, whereby coders met to cross check any disagreement with the original work. Errors of entry were corrected, which resulted in one accurate data set for analysis, a common approach in reviews of measurement properties (e.g., Hulleman, Schrager, Bodmann, & Harackiewicz, 2010; Weidman, Steckler, & Tracy, 2016).
Results
Types of Measures Used
On average, we coded 4.02 (standard deviation [SD] = 2.16) experiments per article and a total of 700 instances of measures with an average of 20.00 (SD = 9.71) measures per article. Some of these measures were only used once within an article, but most were used repeatedly across experiments within a paper. When taking into account measures which were used repeatedly across experiments, we coded N = 500 unique measures, with an average of 14.29 (SD = 6.54) unique measures per article. Eighty-seven percent (n = 433) of these unique measures were item-based scales in which questions or statements were combined in some way to form a composite score, meant to represent the construct of interest. These scales included 1-item measures, surveys, questionnaires, and tests. Thirteen percent of measures were not scales and varied in their approach to measurement; these measures included demographic variables, tasks, qualitative data, and observations.
In this study, we focus on scales, defined as measures that use items to capture a latent construct for which the process of construct validation is applicable. Among the unique scales, 30% (n = 132) were 1-item scales and 70% (n = 301) included more than 1 item. However, the specific number of items was not reported for 19% of unique scales (n = 79). For those with the specific number of items reported, the mean scale length was 4.69 (SD = 6.35, range = 1–58, n = 354). Excluding 1-item scales, the average scale length was 6.87 (SD = 7.18, range = 2–58, n = 222). Finally, 81% (351) used a Likert-type response scale. The second most common response scale was binary (e.g., yes/no or right/wrong), representing 4% of scales. Nine percent did not report the response scale.
Validity Evidence Reported
Validity evidence for a scale can take on two major forms: using evidence from a previous study (which assumes that evidence extends to the current study) and conducting sample specific analyses to provide ongoing evidence. As such, we coded and report how often authors used existing scales and how often structural validity evidence was reported for those scales. We intended to code for other information, but it was not routinely reported in Method section. However, it may have been presented in other areas. For example, some authors reported correlations between variables in “Results” section, but it was not presented as validity evidence for the scale in Method section. Such results are not reflected in our study.
Use of existing scales
Roughly half the unique scales, 53% (n = 230), were accompanied by a citation, suggesting that the scales had previously established validity evidence. Forty percent of the scales had no stated source and are assumed to be author created, whereas 7% of the scales were explicitly stated to have been developed by the author. Notably, of the scales (n = 230) which were cited from existing literature, 19% were modified or adapted in some way such that the psychometric information provided by the citation may not extend to the adapted version. Scales accompanied by a citation were longer on average (M = 6.18, SD = 7.20) than scales with no citation (M = 3.43, SD = 5.25).
Psychometric information
For this analysis, we focused on scales with 2 or more items, as 1-item scales require different validation methodologies (discussed later). Two types of psychometric evidence were presented in the reviewed articles: reliability coefficients and factor analyses. Table 2 presents frequencies and percentages of the type of structural validity evidence reported, split by whether or not a citation was provided. Authors reporting the use of a previously developed scale, which accompany a citation, were more likely to report a factor analysis. Scales without a citation were likely to be shorter in length, with scales of 2–3 items not being appropriate for a factor analysis, which partly explains why so few researchers reported a factor analysis.
Structural Validity Evidence Reported by Presence of a Citation for the Scale.
Note. These percentages do not sum to 100% because scales sometimes included reliability coefficients and factor analyses.
Some of these scales include instances where the author combined multiple scales to form a new scale or index. These combination scales included scales which were used separately in a previous experiment, a combination of previously published scales or a combination of items with multiple modes of responses such as a qualitative response with a Likert-type response. For example, one author noted two separate scales had low αs, reported combining the scales resulted in a higher α, and then created an average score from items across both scales. We coded 22 combination scales and 18 of those reported coefficient α as sole justification for combining measures.
Reliability coefficients
Given the frequent reporting of reliability coefficients, we further examined their characteristics. Of the scales that included 2 or more items (n = 301), coefficient α (Cronbach, 1951) was by far the most common reliability coefficient provided, comprising 73% (n = 222) of reported reliability information, with a correlation between 2 items representing 4%, the remaining scales did not report reliability information. One article utilized numerous scales and reported test–retest reliability in addition to α.
Of the scales for which α was reported, 15% were not specific estimates. Instead, a range across repeated measures designs or groups (e.g., α = .80–.86) or the lowest estimate (e.g., α > .80) were reported. Scales without their specific reliability coefficients (n = 45) were not included in the analyses.
Many scales were used multiple times within an article and some authors reported sample-specific α coefficients. Two hundred and forty-five estimates of α were reported for 166 unique scales. The average coefficient α estimate was .79, SD = .13, range = .17–.87. Figure 1 shows the distribution of α by whether a citation was provided for the scale. This plot shows that the variance in reliability is smaller for cited scales. We also ran a multilevel model to take into account the nested structure of these αs, as multiple αs were reported for unique scales within an article. This model included three parameter estimates: the expected grand mean (the intercept, γ00), the variance within unique measures (σ2), and the variance between unique measures (τ00). The estimated grand mean of α across all unique measures was

Boxplots of the α distributions for both novel and previously developed scales.
Discussion
Latent variable measurement is at the foundation of social and personality psychological research. The importance of establishing construct validity for these measures is reflected in the many resources which outline best practices (AERA et al., 2014; Borsboom & Mellenbergh, 2004). These resources are keys to strengthening the verity and dependability of findings. Generally, our results indicate that researchers who report structural evidence of ongoing construct validation in Method section of their paper are in the minority. This suggests that many constructs studied in social and personality research lack appropriate validation, which will contribute to questionable conclusions and difficulty of subsequent research to replicate. There is a vast field of measurement research that has together created best measurement practices (e.g., see Table 1), and we highlight key findings of our review and provide recommendations for improving current practice.
On the Fly Measurement
There is an abundance of latent variable measurement in social and personality psychology research. An average article in JPSP used 20 measures and latent variable measurement accounted for 87% of these measures. Roughly, half of these scales (46%) included no reference to previous validation, appearing to have been developed on the fly. α was the only psychometric information reported for half of these scales which had no previously published validity evidence, and 19% had no accompanying psychometric information.
These scales are intended to represent latent constructs, and the lack of validity evidence suggests that rigorous methodology for measurement has been overlooked by authors and reviewers. Valid measurement is a necessary prerequisite to the interpretation of results and cannot be ensured if no evidence is reported. For instance, researchers studying temperature need to ensure that their thermometer provides accurate readings of temperature before interpreting their results. In psychology, ensuring accurate scores from measures is more complicated, requiring an entire process of construct validation. When newly developed scales are reported, evidence is required to indicate that scores from these scales reflect the purported construct of interest, because these scores have a direct and dramatic impact on the theory researchers are developing. Until that evidence is available, any conclusions are questionable.
We recommend researchers consider their studies as part of a broader literature which encompasses substantive theory including what is known about how to best measure the constructs central to that theory. If a new scale is needed, then the full process of construct validation is necessary. We recognize that construct validation is a lengthy process, but it is theoretically and methodologically rich, providing the potential for numerous contributions to one’s field. As such we recommend researchers and reviewers gather and look for multiple sources of validity evidence, especially when a scale has no cited source.
The Importance of Ongoing Validation
Nighteen percent of scales accompanied by a citation were explicitly said to have been adapted or modified, but new validity evidence for these scales was often not provided. Further, when citations were reported for existing scales, there was little discussion of why the scale would be valid for the current research context. For example, some research utilizes a small number of publically available items from the Graduate Record Exam (GRE), which was originally designed for graduate admissions. The GRE was intended have hundreds of items, and using a small subset of this total makes it unlikely that scores based on these items continue to reflect the intended construct. Although the items are not modified, the validity evidence supporting the original purpose of graduate admissions is unlikely to extend to this new purpose.
Constructs are in a constant state of validation, where researchers attempt to hone and expand existing theory using the evidence they garner in their studies. The measurement of these constructs similarly requires continual evaluation and refinement, which is why construct validation is discussed as an ongoing process in The Standards. Just as primary research findings can be context dependent (Van Bavel, Mende-Siedlecki, Brady, & Reinero, 2016), so too can measurement properties. If the validity evidence for a scale does not hold in an adapted version or in a new context, then the scores do not represent the same construct and results based on these scores will not be comparable to the previous research. If researchers are using an adapted scale, or a scale in a new way, evidence is needed to show that the scale scores are valid representation of the construct. Examples of such psychometric evidence are described in Table 1 and include factor analyses which indicate the same factor structure as previous research. Another approach is studies of measurement invariance which test for the same measurement properties across different populations (see Millsap, 2011). We observed numerous studies which tested hypotheses across numerous populations (e.g., age-groups, cultures), but only one tested measurement invariance.
Big Theories, Small Scales
Thirty percent of scales we reviewed had 1 item, and the majority of scales without a citation had less than 3 items. Construct validation is built on the notion that when researchers develop items for a scale, they are sampling from a population of possible items. As such, short scales have historically been discouraged by the measurement community (e.g., Nunnally, 1978) because they would not adequately represent the construct and would lack in predictive power compared to multi-item scales (Diamantopoulos, Sarstedt, Fuchs, Wilczynski, & Kaiser, 2012). The case of using a 1-item scale requires careful consideration and validation (e.g., see, Robins, Hendin, & Trzesniewski, 2001).
We recommend researchers consider the construct they wish to measure and the adequacy of a couple items to fully capture the breadth of that construct (see construct representativeness in Table 1). For example, the construct of status includes multiple dimensions, such as wealth, social affiliation, and prestige (Cheng & Tracy, 2014), which would be difficult to capture with a short scale. If 2–3 items were used to represent status, they would provide an extremely narrow conceptualization of the construct and may not generalize to the larger theoretical domain or existing literature. Measurement of broader or multidimensional constructs requires longer scales. For narrow conceptualizations of a construct, strong validity evidence can justify the use of a shorter scale. Such analyses include comparing the predictive power of single item or short scales to a longer scale and using correction formulas to estimate scale reliability (see Eisinga, Grotenhuis, & Pelzer, 2013).
Limitations of α
Coefficient α was by far the most common type of evidence reported with regard to the psychometric properties of a scale. The average α was .78, and ranged from .17 to .97, with lower estimates of .60 and below being somewhat common (10%). We reiterate that these low αs were associated with scales used in primary analyses, suggesting that a substantial number of primary variables are measured with poor reliability. Further, there was a heavy reliance on α as the sole source of structural validity evidence. Over half of the scales which did not accompany a citation (i.e., were explicitly said to have been author developed or a source was not stated) reported α as the only psychometric property. Although α is a useful tool for summarizing the internal consistency of items on a scale as a measure of reliability, reliability is necessary but not sufficient evidence of validity. Further, α has a long history of misuse and abuse in the social sciences (Schmitt, 1996), which our results corroborate.
A comprehensive review of the assumptions and uses of α is beyond the scope of this article. We highlight key information relevant to our findings and refer readers to comprehensive references in Table 1. Given certain assumptions, the α derived from a sample provides an estimate of internal consistency of items within a scale. These assumptions are expressed as an essentially tau-equivalent measurement model, which is a factor model where each item indicates only one factor, items have equal loadings, but item intercepts and error variances can differ. α was the most common and sole source of psychometric evidence reported for scales with no previously published source (78%), making it unclear whether such assumptions were met. To the extent that these assumptions are not met, α can be biased. We referred readers to Graham (2006), Sijtsma, (2009), and Yang and Green (2011) for details on how to test these assumptions in classical test theory and structural equation modeling framework. Additionally, we saw no reporting of McDonald’s (1999) ω, which can be used under circumstances in which items measure the same factor but have unequal loadings.
It is incorrect to use α as a measure of unidimensionality, when unidimensionality is a prerequisite for its computation (Cronbach, 1951). In our review, α was used to justify combining multiple scales to form a single variable 18 times, implying that the misinterpretation of α as a measure of unidimensionality (Schmitt, 1996) continues today. There are numerous demonstrations showing that α can be high even if the scale has multiple and completely orthogonal factors (Cortina, 1993; Schmitt, 1996). When authors combine items or scales to form a combination scale, they are assuming that their scores represent a single construct. If the score used is a blend of numerous constructs, the results cannot capture the theoretical insights which would be gained by representing the factors separately, conflating several distinct psychological processes.
The heavy reliance on α also suggests that researchers are using it as a criterion for scale use and even item selection. Indeed, we noted numerous instances in which α was reported to justify item removal. Reliability is important to consider in construct validation, but it should not be maximized at the expense of other evidence. Drawing from the example of the broad construct of status in the previous section, we would expect a scale with numerous and similarly worded items, which captures a narrow conceptualization of status, to have a high reliability coefficient. However, this scale would not capture the breadth of the construct and lack in content validity. The construct validity of a scale cannot be boiled down into a single number, as evidenced by the list of potential validity studies one could conduct in Table 1. Even with high reliability as measured by α, researchers should offer evidence from the substantive and structural phase of construct validation before moving on to interpreting results from primary analyses.
Conclusions and Recommendations
Our review indicates that the use of scales is pervasive in social and personality psychology research and highlights the crucial role of construct validation in the conclusions derived from the use of scale scores. It also indicates that the practice of conducting and reporting evidence of ongoing construct validation could be increased, which would be the benefit of the field. We recommend the many resources and practices regarding construct validity for researchers and the reviewers who will be evaluating their work in Table 1. In summary, the key points to take away are: Consider valid measurement a prerequisite for interpreting the results of a study or a replication. If adequate measurement properties are not replicated, the rest of the results are necessarily not replicated. Incorporate ongoing validation from all phases into your program of research and report on it, particularly if you have created a new scale, adapted an existing scale, or are using an existing scale in a new context or population. Consider the construct representation and relevance when choosing items. Broad constructs will generally require longer scales. Halt the sole and incorrect use of coefficient α.
In closing, we want to stress that a fundamental step toward supporting a research community in utilizing more rigorous methodology is formal training. In the most recent review of graduate training in psychology (Aiken, West, & Millsap, 2008), few departments offered a full course on measurement such as test construction or classical test theory (20–24% depending on the specific topic), with 20–42% offering no curriculum on any measurement topics. Given this lack of graduate training, it is likely that many social and personality researchers are unaware of the vast methodologies associated with construct validation. Psychometrics often utilizes advanced statistical modeling such as item response theory and structural equation modeling. However, full courses devoted to such topics are rare. So even for the researcher who is mindful of measurement, they may have had little experience using the methodologies needed to evaluate scales. We hope the present assessment of the field and recommendations may serve as a starting point for strengthening the research methodology of social and personality psychology.
Footnotes
Authors’ Note
All authors designed the study. Eric Hehman and Jessica K. Flake compiled the data. Jessica K. Flake and Jolynn Pek analyzed the data. All authors wrote the article.
Acknowledgments
We would like to acknowledge Andrew Kim, Nyiesha Grant, and Lina Kanawati for their help in coding.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded in part by the SSHRC small grants program (P2016-0202) and the Early Researcher Award granted by the Ontario Ministry of Research and Innovation (ER15-11-004), both awarded to Jolynn Pek.
