Abstract
While there are a number of short personality trait measures that have been validated for use with adults, few are specifically validated for use with adolescents. To trust such measures, it must be demonstrated that they have adequate construct validity. According to the view of construct validity as a unifying form of validity requiring the integration of different complementary sources of information, this article reports the evaluation of content, factor, convergent, and criterion validities as well as reliability of adolescents’ self-reported personality traits. Moreover, this study sought to address an inherent potential limitation of short personality trait measures, namely their limited conceptual breadth. In this study, starting with items from a known measure, after the language-level was adjusted for use with adolescents, items tapping fundamental primary traits were added to determine the impact of added conceptual breadth on the psychometric properties of the scales. The resulting new measure was named the Big Five Personality Trait Short Questionnaire (BFPTSQ). A group of expert judges considered the items to have adequate content validity. Using data from a community sample of early adolescents, the results confirmed the factor validity of the Big Five structure in adolescence as well as its measurement invariance across genders. More important, the added items did improve the convergent and criterion validities of the scales, but did not negatively affect their reliability. This study supports the construct validity of adolescents’ self-reported personality traits and points to the importance of conceptual breadth in short personality measures.
Understanding personality trait development across the life course has overarching theoretical and practical implications (Caspi & Shiner, 2006). Whereas most adults have the cognitive capabilities to reflect on the various aspects of their self and report their thoughts, values, motivations, feelings, and behaviors, is it the same for adolescents? While there has been growing interest in the validity and reliability of youths’ self-reported personality traits (Soto, John, Gosling, & Potter, 2008; Tackett, Slobodskaya, et al., 2012), there are still few short assessment instruments specifically validated for this developmental period. To trust these self-reported assessments by adolescents, it should be demonstrated that they have adequate construct validity. This article reports the evaluation of construct validity of adolescents’ personality traits. According to the view of construct validity as a unifying form of validity requiring the integration of different complementary sources of information (Messick, 1995; Simms & Watson, 2007), content validity, factor validity, convergent validity, criterion validity, and reliability were evaluated. This study also sought to address an inherent potential limitation of short personality trait measures, namely their limited conceptual breadth.
A Developmental Perspective on Big Five Personality Traits
There appears to be a current consensus that a taxonomy of five broad or higher-order personality traits can help classify and account for the variation in most of the numerous existing primary traits (which are also sometimes called facets, subdomains, or lower-order traits; Goldberg, 1990, 1993; John, Neumann, & Soto, 2008; Markon, Krueger, & Watson, 2005; McCrae & Costa, 1997). These “Big Five” broad traits are Openness to Experience, Extraversion, Agreeableness, Conscientiousness, and Neuroticism (or its positive pole, Emotional Stability). Openness represents individual differences in intellectual curiosity, imagination, appreciation of different ideas and artistic expressions, and different social and political values; Extraversion reflects individual differences in sociability, assertiveness, activity level, appreciation of exciting activities, and the propensity to express positive emotions; Agreeableness reveals individual differences in prosociality, empathy, collaboration, and helpfulness with others; Conscientiousness represents individual differences in the propensity to be organized, to plan things ahead, to control impulses, and to respect and abide conventional social norms and rules; and Neuroticism refers to individual differences in the propensity to experience negative emotions such as anxiety, fear, depressed mood, and irritability and to have low self-worth.
The early evidence for the existence of the Big Five traits originated mainly from factor analytic studies with adult samples (Goldberg, 1990, 1993; John et al., 2008). Still, there is a growing body of research supporting the notion that the Big Five traits can be recognized and measured in childhood and adolescence (Caspi & Shiner, 2006; Shiner, 2010). On one hand, research on temperament or personality traits in infancy and early childhood suggests that many primary traits identifiable within each of the broad Big Five traits in adults are not yet developed at this early age. Indeed, most models of infant and early childhood personality suggest that there are three rather than five broad dimensions, namely Positive Emotionality/Surgency, Negative Emotionality, and Effortful Control, which are conceptually similar to Extraversion, Neuroticism, and Conscientiousness, respectively (Rothbart & Bates, 2006; Shiner, 2010). On the other hand, differing from these early stages of life, there is cumulating evidence from different areas of research suggesting that arguably all of the primary traits identified in adults are also identifiable by early adolescence, at least in an emerging form.
First, individual differences in the physiological, endocrinal, and neural systems hypothesized to underlie adult personality traits (DeYoung & Gray, 2009) can be measured by early adolescence, though various brain circuits related to these systems actually change or “mature” during the entire second decade of life (Spear, 2010). Second, helped by their growing capacity to think abstractly, early adolescents can create an abstraction of their self by combining several of their attributes (Harter, 2006). Moreover, the self-concept starts becoming more differentiated in early adolescence, which means that early adolescents are more able than children to describe their self-attributes differently according to different social situations or roles (Harter, 2006). In the same vein, there is evidence that by early adolescence, around age 12, autobiographical life narratives are more globally coherent than at earlier ages (Habermas & de Silveira, 2008). This suggests that a coherent sense of self is manifested at least by early adolescence.
Third, a number of studies using item-level factor analyses of self-reports from instruments constructed for adults tend to recover reasonably well the Big Five factor structure with early adolescents’ samples (e.g., Allik, Laidra, Realo, & Pullmann, 2004; McCrae et al., 2002; W. D. Parker & Stumpf, 1998; Scholte, van Aken, & van Lieshout, 1997; Soto et al., 2008). In general, these studies suggest that by the age of 10 to 12, there is systematic and meaningful covariation between Big Five trait indicators (items tapping different primary traits), even though within-trait coherence (internal consistency) and between-trait differentiation (low factor correlations) tend to increase throughout the entire adolescence (Soto et al., 2008). In addition to self-reports, studies using item-level factor analyses of instruments not a priori designed to measure the Big Five traits and using informant reports such as parents and teachers tend to recover well the Big Five trait structure in late childhood and early adolescents’ samples (e.g., Goldberg, 2001; Tackett, Krueger, Iacono, & McGue, 2008; Tackett, Slobodskaya, et al., 2012).
Based on this evidence, it is reasonable to argue for the construction of a developmentally sensitive personality trait measure for adolescents assessing the same primary traits that are typically assessed in adulthood. Of course, the items of such instrument need to be either developmentally neutral or sensitive to this developmental period (e.g., referring to school rather than work) and have a language level appropriate for young adolescents.
Apart from the factor structure replicated across different developmental periods, there is also evidence that Big Five personality traits have adequate predictive validity in childhood, adolescence, and adulthood. There is mounting evidence that the Big Five traits are related, both concurrently and prospectively, to various positive and negative consequential life outcomes in adulthood (Ozer & Benet-Martínez, 2006), even beyond classical constructs such as socioeconomic status and intelligence (Roberts, Kuncel, Shiner, Caspi, & Goldberg, 2007). Clinical and developmental psychopathology research in recent years has led to mounting evidence that personality traits are meaningful correlates or predictors of various psychopathologies (Malouff, Thorsteinsson, & Schutte, 2005; Widiger & Smith, 2008), notably mood and anxiety disorders (Klein, Dyson, Kujawa, & Kotov, 2012; Klein, Kotov, & Bufferd, 2011; Kotov, Gamez, Schmidt, & Watson, 2010), antisocial behavior (Jones, Miller, & Lynam, 2011; Miller & Lynam, 2001), and substance use and abuse (Ball, 2005; Terracciano, Lockenhoff, Crum, Bienvenu, & Costa, 2008). Research on children and adolescents also supports the validity of personality traits for predicting psychopathology (Caspi, 2000; De Pauw & Mervielde, 2010; Krueger, Caspi, Moffitt, Silva, & McGee, 1996; Tackett, 2006; Tackett, Martel, & Kushner, 2012).
Short Personality Trait Measures and Conceptual Breadth
Initial evidence for the existence of the Big Five originated from early factor analytic studies of hundreds of personality trait descriptors or adjectives (Goldberg, 1990, 1993; John et al., 2008). Each of these five broad traits thus encompasses several primary personality traits. In other words, the Big Five traits are constructs with large conceptual breadth or bandwidth (Cattell, 1966). A number of integrative instruments measuring these broad traits and some of their corresponding primary traits have been developed (see chapters in Boyle, Matthews, & Saklofske, 2008). However, because users of personality measures are often faced with limited testing time, particularly in research situations, complete integrative personality trait measures are often considered too lengthy. For example, in large community-based longitudinal studies, in which personality traits are only one of several constructs being assessed, using shorter measures becomes practically inevitable. Not surprisingly, short measures of the Big Five traits are currently quite popular. A number of such measures exists, most of them having satisfactory basic psychometric properties (e.g., Big Five Questionnaire–Children Version [BFQ-C]—Barbaranelli, Caprara, Rabasca, & Pastorelli, 2003; Mini-International Personality Item Pool Big Five Measure [Mini-IPIP]—Donnellan, Oswald, Baird, & Lucas, 2006; International Personality Item Pool Big Five Measure [IPIP50]—Goldberg, 1999; Ten-Item Personality Inventory [TIPI]—Gosling, Rentfrow, & Swann, 2003; Big Five Inventory [BFI]—John, Donahue, & Kentle, 1991; John et al., 2008; NEO Five-Factor Inventory [NEO-FFI-3]—McCrae & Costa, 2010; Five Factor Model Rating Form [FFMRF]—Mullins-Sweatt, Jamerson, Samuel, Olson, & Widiger, 2006; Big Five Inventory–10 [BFI-10]—Rammstedt & John, 2007; Big Five Mini Markers [BFMM]—Saucier, 1994; Mini Modular Markers [3M40]—Saucier, 2002).
Various authors have reviewed issues pertaining to the validity and usefulness of short measures in general (e.g., Silverstein, 1990; Smith, McCarthy, & Anderson, 2000), and conceptual and statistical strategies have been proposed to shorten long versions (Marsh, Ellis, Parada, Richards, & Heubeck, 2005; Stanton, Sinar, Balzer, & Smith, 2002). Most of these authors have pointed to some nontrivial shortcomings of short measures of conceptually complex or broad constructs. One central shortcoming is the limited conceptual breadth. Indeed, conceptually broad constructs typically consist of several primary traits. In principle, for a short measure to be considered a valid alternative to the full-length one, each of its primary traits should be represented. A short measure of a broad construct has limited conceptual breadth when some primary traits are proportionally less represented, or worse, not represented at all (Loevinger, 1957; Smith, Fischer, & Fister, 2003). In other words, such a measure does not have adequate content validity (Haynes, Richards, & Kubany, 1995).
As argued by different scholars, the main reason why short personality measures have reduced conceptual breadth has to do with the common ways items are retained (Marsh et al., 2005; Saucier & Goldberg, 2002; Smith et al., 2000; Stanton et al., 2002). These authors point to three common empirical methods to develop short personality trait scales: factor saturation (selecting items with the highest factor loadings), factor discrimination (selecting univocal items without cross-loadings), and internal consistency maximization (selecting items contributing the most to internal consistency). These methods are useful, but they all tend to lead to the same shortcoming: scales with limited conceptual breadth in which items tapping some fundamental primary personality traits are absent (see Saucier & Goldberg, 2002). However, it is critical to preserve an adequate conceptual breadth in a personality measure for two related reasons. First, when representative sampling of all the primary traits embedded in a broad trait is not maintained, the measure’s content validity is compromised (Haynes et al., 1995; Smith et al., 2000; Smith et al., 2003). When content validity is not adequate, it affects almost all other psychometrics properties (Haynes et al., 1995). Second, preserving conceptual breadth is important when personality trait measures are used to predict external criteria (i.e., criterion validity). As pointed out by different scholars, all things being equal, a conceptually broader scale will tend to be related to a broader array of criteria (Loevinger, 1957; Saucier & Goldberg, 2002). This is particularly important for short Big Five measures because they are often used in community and longitudinal studies to predict a multitude of criterion variables. Of course, measures of conceptually broad predictors generally tend to provide optimal prediction of equally complex and heterogeneous criteria, whereas measures of narrower predictors are most efficient in predicting specific criteria (Cronbach & Gleser, 1965; Loevinger, 1957). As noted by Hogan and Roberts (1996), psychologists and other social scientists generally study conceptually complex or broad constructs or syndromes. In this context, short measures of broad constructs like Big Five personality traits preserving conceptual breadth are important.
An examination of the content of most existing Big Five short measures indicates that they suffer from a limited conceptual breadth. Indeed, a number of theoretically important primary traits are usually absent, which ipso facto limits content validity, but may also impact other psychometric properties. The point is not to argue that existing short personality trait measures are inappropriate or not useful, but merely to notice that most, if not all, of them have limited conceptual breadth. Yet evidence is still lacking regarding the impact of this limitation on other psychometric properties, such as convergent validity with independent measures of related constructs or criterion validity with consequential outcomes.
The Present Study
Rationale for Items Modification and Addition
In this study, the impact of adding conceptual breadth in a short Big Five personality trait measure was evaluated. While existing studies typically aimed at reducing the number of items, the reverse strategy was used in this study. Indeed, starting with an existing measure, a number of items tapping missing fundamental primary traits were added to the scales.
The BFI (John et al., 1991; John et al., 2008) was selected as an appropriate measure to provide an initial item pool because of its adequate basic psychometric properties and, more specifically, its satisfactory content validity as compared with other noncommercial short personality trait measures. Moreover, this measure contains short verbal statement items, which are preferable to single-trait adjective items because the former are more likely to be easily and unequivocally understood by early adolescents (Barbaranelli et al., 2003). Also, single-trait adjectives are often at a higher level of abstraction than short verbal statements, making them potentially harder to translate precisely in different languages (Saucier & Goldberg, 2002). The items measuring the BFI traits posted on the public domain International Personality Item Pool website were used as the initial items pool (http://ipip.ori.org/newBFITable.htm). The items are also available in John et al. (2008). As will be explained below, because of all the modifications made to the original item pool, the resulting new measure was named the Big Five Personality Trait Short Questionnaire (BFPTSQ), to avoid confusion with the original questionnaire. The BFPTSQ items are presented in the appendix. Back translation from the original French version (with all the modifications and additions described next) was used to make this English version.
Because this study was conducted with French Canadian early adolescents, it was important to ensure that all items would be well understood by the participants. A number of researchers have reported that adolescents have difficulty reading or understanding many items in personality questionnaires (e.g., Allik et al., 2004; De Fruyt, Mervielde, Hoekstra, & Rolland, 2000; McCrae, Costa, & Martin, 2005), arguably because these measures were constructed and validated for use with adults. Therefore, the first step was to translate and revise the items. Three researchers and three graduate students familiar with personality theory (and the Big Five model more specifically), experienced in psychometric test construction, as well as in research with adolescents first translated the items into French. Modifications were made until a consensus was reached. Afterward, the items were reviewed for language level. The criterion was that adolescents starting high school (at ages 12 or 13 in the province of Quebec) be able to understand the items. This led to minor modifications to most of the items. A number of items were slightly simplified, and for a few of them an additional descriptor or stem connected by a comma was added. Even if it is generally not recommended to have multiple stems in an item, the objective was merely to provide a synonym or to elaborate.
Before adding new items, one Openness item from the original item pool was deleted because it was judged less relevant for adolescents and not central to the target construct (“prefer work that is routine”). Moreover, an Extraversion item that was judged equivocal was removed (“generates a lot of enthusiasm”) and replaced with an item tapping social dominance or leadership (17).
The second step was to add new items representing important primary personality traits. To keep the measure short, with a testing time of 10 to 15 minutes for adolescents (5 to 10 minutes for adults), a maximum of 50 items overall, with 10 items per scale, was decided on. The goal was not to add items for all missing primary traits in the original item pool, but to add only a few of the most fundamental.
Scholarly reviews on trait structure and empirical research with adolescents guided the choice of which primary traits were candidates for adding items. For Openness, the original BFI scale has good content coverage for the primary traits of Intellectual Inquisitiveness, Creativity, and Aesthetic and Artistic Appreciation. It was decided that only one item tapping Openness to Cultural Diversity (31r) would be added. This primary trait is part of some comprehensive Openness models (McCrae & Sutin, 2009). For instance, it would be represented by the facet “Openness to Values” in the NEO-PI-3 (McCrae & Costa, 2010). As discussed by McCrae and Sutin (2009), this social attitude can be considered part of Openness because open persons are generally lower in ethnocentrism and thus, more open to different cultures and lifestyles. Assessing individual differences in this social attitude is important because it is associated with behavioral and emotional adjustment of both adolescents and teachers attending multicultural high schools in large metropolitan areas (García-Coll & Magnuson, 1997). Among existing short Big Five measures, there are no items tapping Openness to Cultural Diversity in the BFI, the IPIP50, the NEO-FFI-3, the BFMM, or the 3M40.
For Extraversion, the original BFI scale has good content coverage for the primary traits of Sociability, Expressiveness, Assertiveness, and Activity. It was decided that an item tapping Sensation Seeking (42) and one tapping Joyfulness or Positive Emotions (47) would be added. These two primary traits are part of various comprehensive models of Extraversion (Wilt & Revelle, 2009). For instance, they would correspond to Excitement Seeking and Positive Emotions in the NEO-PI-3 (McCrae & Costa, 2010). Sensation Seeking is certainly a fundamental personality trait that has proven useful for understanding the development of negative life outcomes (Zuckerman, 2009). For instance, it is related to antisocial behavior (Jones et al., 2011). Assessing individual differences in Sensation Seeking in adolescents can be useful for predicting substance use and to decide which adolescents should participate in a preventive intervention (Sargant, Tanski, Stoolmiller, & Hanewinkel, 2010). Joyfulness is also an important primary trait, as some scholars consider it to be the core aspect of Extraversion (Tellegen & Waller, 2008; Watson & Clark, 1997). Assessing individual differences in Joyfulness is important because it is related to various life outcomes, such as attachment styles (Shiota, Keltner, & John, 2006), stress recovery, and resilience (Ong, Bergeman, Bisconti, & Wallace, 2006). Among existing short Big Five measures, there are no items tapping Sensation Seeking in the BFI, the BFQ-C, the IPIP50, the BFMM, or the 3M40, while there are no items tapping Joyfulness in the BFI, the IPIP50, the BFMM, or the 3M40.
For Agreeableness, the original BFI scale has good content coverage for the primary traits of Altruism, Compassion, Forgiveness, and Cooperation. It was decided that an item tapping Machiavellianism (48r) would be added. This constitutes a ubiquitous primary personality trait in the human experience (Jones & Paulhus, 2009; Wilson, Near, & Miller, 1996) and is arguably a central construct for studying antisocial behavior development and psychopathy (Barry, Kerig, Stellwagen, & Barry, 2010). This primary trait is part of various comprehensive Agreeableness models and would be represented by the facet Straightforwardness and, to a lesser extent, Altruism in the NEO-PI-3 (McCrae & Costa, 2010). 1 Assessing individual differences in this construct is important because it is clearly associated with serious conduct problems and antisocial behavior in children, adolescents, and adults (Frick et al., 2003; Jones et al., 2011). Among existing short Big Five measures, there are no items tapping Machiavellianism in the BFI, the BFQ-C, the IPIP50, the BFMM, or the 3M40.
For Conscientiousness, the original BFI scale has good content coverage for the primary traits of Dutifulness, Self-Discipline, Planfulness, and Order. It was decided that an item tapping Impulsiveness (49r) would be added. Impulsiveness constitutes without a doubt a fundamental primary personality trait (Madden & Bickel, 2010) and is part of most comprehensive models of Conscientiousness (Roberts, Jackson, Fayard, Edmonds, & Meints, 2008). In the NEO-PI-3, it would be represented by the facet Deliberation (McCrae & Costa, 2010). 2 Assessing individual differences in impulse control is important because it is related to executive functions and plays a key role in various psychopathologies, such as ADHD and antisocial behavior (Crews & Boettiger, 2009; Jones et al., 2011; Newman & Wallace, 1993; Nigg, 2006). Among existing short Big Five measures, there are no items tapping Impulsiveness in the BFI, the BFQ-C, the IPIP50, the NEO-FFI-3, the BFMM, or the 3M40.
Finally, for Emotional Stability, the original BFI scale has good content coverage for the primary traits of Anxiousness, Fearfulness, Worry, and Depressed Mood. It was decided that an item tapping Low Self-Worth (45r) and another one tapping Irritability (50r) would be added. It is clear that Self-Worth (or Self-Esteem) is a crucial trait of the human experience (Bosson & Swann, 2009). Irritability (or Anger) is also a central construct in the human personality (Barefoot & Boyle, 2009; Snaith & Taylor, 1985). Both of these primary traits are part of various comprehensive models of Neuroticism (Clark & Watson, 2008; Tellegen & Waller, 2008). For instance, in the NEO-PI-3 (McCrae & Costa, 2010), low Self-Worth would be represented by the facet Vulnerability and, to a lesser extent, by Depression, while Irritability would be represented by the facet Angry Hostility. Assessing individual differences in Self-Worth is important because it is related to psychopathology development, particularly internalizing psychopathologies (Mann, Hosman, Schaalma, & de Vries, 2004) and aggression and antisocial behavior (Donnellan, Trzesniewski, Robins, Moffitt, & Caspi, 2005). Assessing individual differences in Irritability is also important because it is related to the development of internalizing psychopathologies (Stringaris, Cohen, Pine, & Leibenluft, 2009), ADHD (Leibenluft, Cohen, Gorrindo, Brook, & Pine, 2006), reactive aggression (Scarpa & Raine, 1997), antisocial behavior (Jones et al., 2011), and substance use (Tarter, Blackson, Brigham, Moss, & Caprara, 1995). Among existing short Big Five measures, there are no items tapping Irritability in the BFI and the NEO-FFI-3, while there are no items tapping Low Self-Worth in the BFI, the BFQ-C, the IPIP50, the BFMM, or the 3M40.
Evaluation of Construct Validity and Impact of Added Conceptual Breadth
To ensure that the modified items as well as the newly added items were valid, a group of experts in personality trait theory and test construction were first asked to rate the relevance or representativeness of the items with regard to their target Big Five trait. Factor validity was then assessed in two ways. First, to ensure that all items were associated with their target broad trait, factor analyses were conducted. Second, given the well-documented gender differences in personality trait means (Else-Quest, Hyde, Goldsmith, & Van Hulle, 2006; Schmitt, Realo, Voracek, & Allik, 2008), measurement invariance across genders was also evaluated. Indeed, before testing mean differences, there is a need to test whether the constructs of interest are the same in both groups, and few studies have fully tested measurement invariance of personality traits across genders (Marsh et al., 2010). Reliability was also estimated. Given the secondary objective of this study, reliability estimates from the scales with original content (i.e., without the new items) and those with added items were compared statistically. It was hypothesized that given their small number, the added items would have no or marginal impact on reliability estimates. Indeed, the conceptual breadth of the scales is increased, which could reduce their reliability; but they also become longer, which should slightly increase reliability. Convergent validity was also assessed using the NEO-PI-3 scales (McCrae & Costa, 2010). This questionnaire can arguably be considered a gold standard for convergent validity given its widespread use in personality research. Again, to determine whether the BFPTSQ scales with added items provided enhanced convergent validity, the correlations derived from the original-content scales and those derived from the added-item scales were compared using statistical tests. It was hypothesized that the scales with added items would have significantly higher correlations with each of the broad Big Five traits, as well as with primary traits from the NEO-PI-3 corresponding to specific content of the newly added items. Finally, criterion validity was assessed by correlating the personality scales with consequential outcomes. Specifically, because several studies showed that they are related to personality traits, measures of externalizing and internalizing psychopathology (e.g., De Pauw & Mervielde, 2010; Klein et al., 2012; Tackett, 2006; Tackett, Martel et al., 2012; Widiger & Smith, 2008) as well as academic achievement (e.g., Noftle & Robins, 2007; Poropat, 2009) were used. Based on these studies, it was hypothesized that the externalizing psychopathology and substance use scales would be negatively related to Agreeableness and Conscientiousness and positively related to Extraversion. For internalizing psychopathology scales, it was expected that they would be negatively related to Emotional Stability and Extraversion. It was also hypothesized that academic achievement would be negatively related to Conscientiousness and Openness. Finally, because of their larger conceptual breadth, it was hypothesized that the added-item scales would provide significantly higher correlations to each relevant outcome.
Method
Procedure and Participants
The data come from the first assessment of an ongoing prospective longitudinal study on personality, antisocial behavior, and psychopathology development during adolescence. Eight French-language high schools from four different school boards of the larger Montreal metropolitan and Quebec City areas were initially targeted to participate (Quebec, Canada). These schools were a priori targeted to sample adolescents from various socioeconomic statuses and ethnic backgrounds. After the project was presented to all school administrations, one school declined to participate. In total, 41 classes participated in the first assessment: 29 regular classes, 8 international curriculum classes (i.e., enriched curriculum), and 4 specialized classes for special-needs students (i.e., with behavior problems and/or learning difficulties).
Before the data collection, the project was sanctioned by the University of Montreal’s ethical review board as well as all the school boards ethical committees. All the adolescents signed a consent form, completed a form in which they provided their contact information (email, home address, etc.), and were given an envelope to bring to their parents. The envelop contained a letter for the parents explaining the study, a consent form, a questionnaire about their child, a questionnaire about themselves, and a preaddressed and prepaid envelop to return all the material. All parents who did not return the signed consent form were contacted by phone. Only six parents refused to allow their child to participate in the study.
For the data collection, two trained research assistants visited each group between January and June 2009. Because the initial questionnaire in this longitudinal study was extensive, the adolescents were given two regular class periods (75 minutes each) to fill it out. All those who were absent from school the day of the assessment were contacted by phone to set a new date for the data collection. Moreover, to minimize missing data, all adolescents who were unable to fully complete the questionnaire during the two periods were contacted and asked to return the missing responses by email. Two $20 gift certificates were randomly drawn in each group among the participating adolescents.
The questionnaires of 1,036 adolescents were gathered in the first assessment of this study. Personality data were available for 1,028 of them (only a subsample of 598 filed the NEO-PI-3). Their mean age was 12.71 (SD = 0.59), with most aged 12 (35.5%) or 13 (59.1%) and only a few aged 14 (4.4%) or 15 (1.0%). The sample was equally proportioned by gender (females, n = 520, 50.2%). Consistent with the province of Quebec population, most of them were Caucasian (74%), while Blacks (4.9%), Hispanics (3.3%), Asians (3.2%), Arabs (7.4%), Canadian First Nations (2.8%), and multiethnic persons (4.4%) made up the rest of the sample. The large majority were born in Canada (90.8%). Most of them lived with their two biological parents (68.1%), while 16.2% lived in a shared custody situation with their biological mother and father, 5.0% with their biological mother and her new partner, 1.7% with their biological father and his new partner, 6.1% with their biological mother alone, 1.2% with their biological father alone, and the rest with another family member or an adoptive family or in foster care (1.7%). With regard to siblings, 65.5% (M = 0.98, SD = 0.98) of these adolescents lived with at least one brother, while 61.6% lived with at least one sister (M = 0.90, SD = 0.93).
Measures
Personality
Big Five Personality Trait Short Questionnaire (BFPTSQ)
Even though the questionnaire resulting from this study is similar in content to the BFI developed by John and his colleagues (1991, 2008), it is a different measure. First, most items were modified in two ways: First, the language was simplified for use with early adolescents, and, second, new items corresponding to important primary traits were added. Because of these differences and to avoid confusion with the original measure, the new one was given a different name: the Big Five Personality Trait Short Questionnaire (BFPTSQ). The BFPTSQ has 50 items, 10 for each trait. Respondents use a 5-point Likert-type response format (totally disagree = 0, disagree a little = 1, neutral opinion = 2, agree a little = 3, totally agree = 4). Just like the original instrument, the introduction sentence, “I see myself as someone who,” is presented at the top of each page. The French and English versions of the BFPTSQ are available from the author.
NEO-Personality Inventory-3 (NEO-PI-3)
The NEO-PI-3 (McCrae & Costa, 2010) assesses the broad and primary personality traits of the five-factor model. The broad traits are Openness, Extraversion, Agreeableness, Conscientiousness, and Neuroticism. There is a total of 30 primary-trait scales, six for each of the broad trait. The major difference between the NEO-PI-3 and its previous edition (NEO-PI-R) is that 37 items were replaced or modified to be easier to read for adolescents, as well as to be better indicators of their target primary trait (McCrae et al., 2005). This inventory includes 240 items (8 per primary trait) with a 5-point Likert-type response scale (strongly disagree = 0 to strongly agree = 4). In the present sample, internal consistency coefficients are satisfactory for almost all scales, except for a few primary-trait scales. For the broad traits, internal consistency estimates are all high, with coefficients ranging from .79 (Openness) to .92 (Conscientiousness). For the primary traits, they range from .65 to .79, but fall below .60 for one primary trait of Extraversion (Activity, .55), one of Agreeableness (Compliance, .54), two of Neuroticism (Angry Hostility, .58; Impulsiveness, .56), and three of Openness (Feelings, .56; Actions, .37; Values, .43). Even though internal consistency is substandard for some primary traits, particularly for Openness, other researchers have observed the same phenomenon using the same instrument with adolescents (e.g., De Fruyt et al., 2000).
Outcomes
Psychopathology
To measure psychopathology, seven scales from the Youth Inventory Version 4 (Gadow & Sprafkin, 1999) were used. This inventory is a self-report measure of the most prevalent DSM-IV mental disorders during adolescence. All items were written to correspond directly to the symptoms of DSM-IV disorders. The adolescents rated the last 12-month frequency of all items on a 4-point scale (never = 0, sometimes = 1, often = 2, and very often = 3). While the scales retained the DSM disorder labels, they can all be computed as either categorical (i.e., a symptom count threshold allowing identification of the presence of a disorder) or dimensional (i.e., a sum of the items, which represents a frequency or severity score). The dimensional scales were used in this study. Three scales of externalizing psychopathology and four scales of internalizing psychopathology were selected: For externalizing the scales are Attention Deficit Hyperactivity Disorder (ADHD, 18 items), Conduct Disorder (CD, 15 items), and Oppositional Defiant Disorder (ODD, 8 items); and for internalizing, the scales are Major Depression Disorder (MDD, 12 items), Bipolar Disorder (BD, 10 items), Generalized Anxiety Disorder (GAD, 8 items), and Social Phobia (SOP, 4 items). The internal consistency of the scales is satisfactory in this sample, with coefficients ranging from .78 to .91.
Substance Use
Substance use was assessed with modified items from the Measures of Quebec Adolescents’ Social and Personal Adjustment (Le Blanc, 1996). This instrument assesses several dimensions of adolescent adjustment and was validated on a large representative sample of the Quebec adolescent population. Adolescents were asked whether they had used eight substances during the past 12 months. The items correspond to major psychoactive substance categories, namely (a) “drank alcohol to get drunk (not just for tasting or for a dinner),” (b) “smoked cannabis (marijuana, pot, hash, weed),” (c) “inhaled substances such as glue, gas, paint stripper, Dust-Off or other chemical substances,” (d) “used stimulants such as Ecstasy, MDMA, XTC, etc.,” (e) “used other stimulants such as methamphetamines (‘meth’), cocaine, crack, or crystal meth (‘Ice’),” (f) “used hallucinogens such as LSD, acids, PCP, mescaline, ‘angel dust,’ magic mushrooms, etc.,” (g) “used tranquilizers such as GHB, Valium, Librium, Ativan, Xanax, etc.,” and (h) “used analgesics such as heroin, morphine, opium, codeine, Oxycontin, Demerol, etc.” All items were rated on a 4-point frequency scale (never = 0, rarely (1 or 2 times) = 1, often = 2, and very often = 3). Internal consistency is adequate in this sample with a coefficient of .86.
Grade Point Average (GPA)
The GPA scale was computed using grades from official final report cards that were transmitted by the schools. The courses included in the curriculum are French (native language), English (secondary language), sciences, mathematics, history, geography, arts, physical education, and ethics and religions. Scores can vary from 0 to 100, with 60 as the passing grade, as per the norms in effect in the Canadian province of Quebec. Internal consistency is .91 in this sample.
Statistical Analyses
All analyses were conducted using Mplus Version 6.12 (Muthén & Muthén, 2010). Unless otherwise noted, all analyses were conducted using the robust maximum likelihood estimator (MLR), which provides adjusted standard errors and statistical fit tests that are robust to nonnormality in the data. Confidence intervals (95%) were calculated and reported to provide information regarding the precision of all relevant parameter estimates (Cumming, 2012).
Content validity
To assess content validity, six expert judges on both personality theory and construction of personality measures were solicited to rate all items for their relevance or representativeness with regard to their target Big Five scale. 3 The raters were asked to estimate the relevance of each item using a 4-point response scale: not at all representative = 0, somewhat representative = 1, very representative = 2, extremely representative = 3. With these ratings, a Content Validity Index (Polit & Beck, 2006) can be computed: one for individual items (I-CVI) and one for scales (S-CVI). The I-CVI is computed by adding up the number of raters who gave a rating of 2 or 3 and dividing the result by the number of raters. When there are six raters or more, an item is considered relevant if the I-CVI is .78 or higher (Polit & Beck, 2006). The S-CVI is computed by adding up the I-CVI values of the items forming a scale and dividing the result by the number of raters. When there are six raters or more, a scale is considered to have adequate content validity if the S-CVI is at least .80, but preferably .90 or higher (Polit & Beck, 2006).
Factor validity
For all factor analyses presented hereafter, the items were treated as continuous. This was justified by recent simulation studies which suggest that Likert-type scales with five or more response categories can be treated as continuous without resulting in biased parameters (Beauducel & Herzberg, 2006; Rhemtulla, Brosseau-Liard, & Savalei, 2012). Moreover, following Marsh et al. (2010), all factor models were estimated with and without a priori correlated uniquenesses (CUs), which are used to reflect the fact that some items relate to the same primary trait (or subdomain), share similar content (but reversed scoring), or share the same word. 4 This was necessary because CUs are common in personality trait measures and failure to include them can lead to biased parameter estimates (Marsh et al., 2010; Marsh & Hau, 1996). As noted by Marsh et al. (2010), using ex post facto CUs should generally be avoided, but in the case of Big Five personality trait measures, many CUs, such as those included in this study, are theoretically or conceptually defensible.
Factor validity was assessed using two types of models. The a priori five-factor structure was first tested with independent clusters model confirmatory factor analysis (ICM-CFA). The initial ICM-CFA model (M1) was specified so that answers to the 50 items would be explained by five correlated factors, with each item having a nonzero loading on its target factor and zero loadings on all other factors, while item uniquenesses are uncorrelated. However, a number of researchers have noted that a good-fitting ICM-CFA model is rarely, if ever, obtained with personality trait structures measured by several items (e.g., Church & Burke, 1994; Marsh et al., 2010; McCrae, Zonderman, Costa, Bond, & Paunonen, 1996; J. D. A. Parker, Bagby, & Summerfeldt, 1993). One possible explanation is that in an ICM-CFA model, a simple structure is specified (i.e., nonzero loading on the target factor, and all cross-loadings fixed to 0). Yet, the Big Five personality trait structure was uncovered using exploratory factor analysis (EFA), which estimates all possible factor loadings. In other words, an EFA is closer to the true model. In consequence, the factor validity was also tested using exploratory structural equation modeling (ESEM; Asparouhov & Muthén, 2009; Morin, Marsh, & Nagengast, 2013). An ESEM model closely resembles EFA, but with some important advantages. For instance, in addition to allowing all loadings to be estimated on all factors, it also (a) provides standard errors for all parameters so that their statistical significance can be tested, (b) allows correlated item uniqueness to be included, (c) allows multiple-group measurement invariance tests to be conducted, and (d) provides fit indices and statistical tests routinely available in CFA. The a priori ESEM five-factor model (M2) was tested so that all target loadings and cross-loadings are estimated, pending minimal constraints imposed on the unrotated solution for identification purposes (for technical details on identification, see Asparouhov & Muthén, 2009; Morin et al., 2013). One potential drawback of ESEM is rotational indeterminacy. Indeed, a model could converge on different solutions depending on the chosen rotation method (for technical details on rotation issues, see Asparouhov & Muthén, 2009; Morin et al., 2013; Morin & Maïano, 2011). Consequently, different factor loading patterns can be found, which influence the size of factor correlations, even though models with different rotation methods have equal implication for the covariance matrix, and thus provide the same fit to the data. In this study, six rotations available in Mplus were tested: Geomin (Mplus default), Geomin with an epsilon value of .5 (a value recommended in previous ESEM studies of Big Five measures; Marsh et al., 2010), CF-varimax, CF-quartimax, CF-equamax, and finally target loading rotation (all loadings on the nontarget factors are given a start value of 0).
Once the best factor model was identified, the measurement invariance across genders of this model was evaluated using a series of increasingly stringent multiple-group models (see Meredith, 1993; Morin & Maïano, 2011; Vandenberg & Lance, 2000): configural invariance (MG1; all loadings, intercepts, and uniquenesses are freely estimated, with the latent variances constrained to 1 and latent means constrained to 0), metric invariance (MG2; loadings constrained to invariance, which allows free estimation of the factor variances in one group), scalar invariance (MG3; intercepts constrained to invariance, which allows free estimation of the factor means in one group), strict invariance (MG4; uniquenesses constrained to equality), CU invariance (MG5), variance/covariance invariance (MG6; all of which must be done simultaneously in ESEM), and latent means invariance (MG7). For each model in this sequence, the imposed constraints are additive and the preceding model serves as a reference.
The assessment of model fit was based on various indices (West, Taylor, & Wu, 2012). The chi-square test was estimated for all models. A nonsignificant chi-square suggests a good fitting model. However, because this test is known to be overly sensitive to increasing sample size, to minor departure from multivariate normality and to minor (substantively irrelevant) model misspecifications, additional fit indices were considered more explicitly: the comparative fit index (CFI), the Tucker–Lewis index (TLI), the root mean square error of approximation (RMSEA) and its 90% confidence interval, and the standardized root mean square residual (SRMR). Values of .95 or above for the CFI and TLI, of .06 or below for the RMSEA, and of .08 or below for the SRMR have been suggested as being indicative of a good model fit (Hu & Bentler, 1999). However, these criteria are often considered too restrictive for many applications; therefore, values of .90 or above for the CFI and TLI, of .08 or below for the RMSEA, and of .10 or below for the SRMR suggest an acceptable fit of the model (Bentler, 1990; Marsh, Hau, & Wen, 2004). For the RMSEA 90% CI, values below .05 for the lower bound and below .08 for the upper bound suggest acceptable fit (MacCallum, Browne, & Sugawara, 1996).
For the assessment of change in model fit in invariance tests, the Satorra–Bentler scaled chi-square test (Satorra, 2000) was computed for all multiple-group models. This test takes into account the scaling correction in MLR estimation. However, because the chi-square test tends to be overly sensitive to sample size and minor departure from multivariate normality, researchers examine changes more closely in other fit indices. Cheung and Rensvold (2002) suggested using change in CFI, where values below .01 indicate that the invariance hypothesis should not be rejected, values between .01 and .02 suggest the possibility of non-invariance, and values above .02 support the rejection of the invariance hypothesis (Cheung & Rensvold, 2002; Vandenberg & Lance, 2000). Chen (2007) suggested using changes in RMSEA, where values below .015 indicate that the invariance hypothesis should not be rejected.
Despite all these suggested criteria, it is important to note that since there are still very few applications of ESEM testing complex factor structures with 50 items or more and given that the adequacy of the aforementioned fit indices and proposed cutoff scores have not been rigorously tested with these models, their adequacy for assessing model fit in ESEM still needs to be thoroughly evaluated (Marsh et al., 2010; Morin et al., 2013). These proposed cutoff values should thus be considered as useful, but rough guidelines in an ESEM context.
Reliability
Reliability of the scales was estimated using different statistical indices. First the traditional Cronbach’s alpha coefficient was computed. However, this coefficient is recognized as a limited estimator of reliability because it is adequate only if three conditions are met: the items are strongly unidimensional, the items are essentially tau-equivalent (i.e., they all have high loadings), and there are no correlated errors or uniquenesses (Raykov, 1998). Therefore, the latent variable model composite reliability, denoted by Rho (ρ), suggested by Raykov (1997, 2012) was also computed. Composite reliability is essentially the total amount of true score variance in relation to the total scale score variance and thus corresponds to the traditional concept of reliability in classical test theory. The latent variable approach to reliability tends to provide less biased estimates than Cronbach’s alpha (Raykov, 2012). Moreover, given that failure to account for correlated item uniquenesses can lead to systematic overestimation of reliability, the ρ conditional on CUs was also computed (Raykov, 2001, 2012). Given their importance in determining estimate precision, 95% confidence intervals were computed for all ρ estimates. Finally, because the main objective of this study was to assess whether scales with added items provide different reliability estimates, the method proposed by Raykov (2007b, 2012) to calculate the difference between two composite reliability estimates was used. Using the MODEL CONSTRAINT option in Mplus, composite reliability was specified for both scales, and a change parameter (Delta) was then computed along with a formal statistical significance test. This method also allows computation of a 95% confidence interval for the delta parameter.
Convergent and criterion validities
For convergent validity, the scales were correlated with their corresponding scales from the NEO-PI-3 (McCrae & Costa, 2010), while for criterion validity, the scales were correlated with different consequential outcome scales: three scales of externalizing psychopathology, four scales of internalizing psychopathology, one of substance use, and one of academic achievement. To determine whether the added-item scales provide higher correlations than the original-content scales (i.e., improved convergent and criterion validity), the structural equation modeling approach suggested by Raykov (2007a, 2012) was used to compare the correlations of scales with differing number of items. Using the MODEL CONSTRAINT option in Mplus, both correlations were specified and a correlation change parameter (Delta) was then computed, along with a formal statistical significance test. This method also allows computation of a 95% confidence interval for the delta parameter.
Results
Content Validity
In general, the six experts’ ratings all point to adequate content validity. No item received a rating of 0 (“not relevant”). For individual items, 76% (38 out of 50) have a perfect agreement with an I-CVI = 1. All the other items have an I-CVI above the recommended criterion of .78, except three items, which have an I-CVI of .666: Items 42 (Sensation Seeking), 45 (Low Self-Worth), and 48 (Machiavellianism). Nonetheless, four out of six raters identified these three items, which are all newly added ones, as “very representative.” Regarding content validity of the scales, all received highly favorable ratings from the experts, with all estimates above the recommended criterion. The S-CVI is .966 for Conscientiousness, .949 for Openness, Extraversion, and Emotional Stability, and .933 for Agreeableness.
Factor Validity
The goodness-of-fit statistics from the different factor analytic models are presented in Table 1. All indices suggest that ICM-CFA clearly does not fit the data (M1). Adding a priori CUs (M1b) significantly improved the fit, but it was still a poor-fitting model. Fitting an ESEM model (M2) largely improved fit over the ICM-CFA model as suggested by the large Δχ2, ΔCFI, and ΔRMSEA. After the results from all the tested rotations had been examined, the target loading rotation was selected because it provided a somewhat clearer factor loading pattern (i.e., the target loadings tended to be slightly larger, while the cross-loadings tended to be smaller than with other rotation methods) and the factor correlations were only slightly higher. 5 The fit of this model, however, remains unacceptable because the CFI and TLI values were below the acceptable criterion. A model adding a priori CUs (M2b) again significantly improved the fit to the data. In contrast to the preceding models, this ESEM with CUs shows fit indices all in the satisfactory range, with CFI and TLI above .90, as well as RMSEA and SRMR below .06.
Goodness-of-Fit Statistics From the Confirmatory Factor Analytic and Exploratory Structural Equation Models.
Note. ICM-CFA = independent clusters model confirmatory factor analysis; ESEM = exploratory structural equation modeling; χ2 = chi square; df = degrees of freedom; CFI = comparative fit index; TLI = Tucker–Lewis index; RMSEA = root mean square error of approximation; 90% CI = 90% confidence interval of the RMSEA; SRMR = standardized root mean square residual; Ref = reference model; ΔSχ2 = Satorra–Bentler scaled chi-square difference test; Δdf = change in degrees of freedom; ΔCFI = change in CFI; ΔRMSEA = change in RMSEA; λ = factor loadings; τ = intercepts; δ = uniquenesses; ξ = factor variances; φ = factor covariances; η = factor means.
p < .01.
Table 2 presents the standardized factor loadings from the ESEM model with CUs (M2b). Most target loading items are substantial and are clearly statistically related to their expected factor. Still, among the target loadings, 10 out of 50 have a value below .30, though they are statistically related to their expected factor. The standard errors tend to be small for all items, including those with low loadings. Examination of the confidence intervals suggests that all target loadings are relevant as none includes a value of 0. Apart from the 50 target loadings, there is a total of 105 statistically significant cross-loadings, with 58 clearly significant at .001 and 47 of lesser significance at .05.
Standardized Factor Loadings From the Exploratory Structural Equation Model of the BFPTSQ Items.
Note. Shaded entries are the target loading items. Item numbers with an r are reverse scored. λ = factor loadings; δ = uniquenesses; 95% CI = 95% confidence interval.
p < .05. **p < .001.
Examination of each factor shows that for Openness, the anchor item (21) taps ingenuity and creativity. The new Openness to Cultural Diversity item (31r), although it is clearly statistically related to Openness, has a somewhat low standardized loading. As expected, this item has a significant cross-loading on Agreeableness. For Extraversion, the anchor item (22r) taps talkativeness and liveliness. The new Sensation Seeking (42) and Joyfulness (47) items are clearly statistically related to Extraversion, but also tend to have small loadings. As expected, Item 47 has a significant cross-loading on Emotional Stability. For Agreeableness, the anchor item (28r) taps warmth and kindness. The new Machiavellianism item (48r) is strongly related to Agreeableness and, in fact, has one of the highest loadings on that factor. For Conscientiousness, the anchor item (4) taps thoroughness or diligence. The new Impulsiveness item (49r) is statistically related to Conscientiousness, but has a somewhat low loading. Moreover, this is a complex item because it has significant cross-loadings on all other factors, particularly on Agreeableness. Finally, for Emotional Stability, the anchor item (40r) taps nervousness. The new Low Self-Worth (45r) and Irritability (50r) items are clearly related to Emotional Stability and are in fact items with some of the highest loadings on that factor. As expected, Item 45r has a significant cross-loading on Extraversion, as does Item 50r on Agreeableness.
Table 3 presents the latent factor correlations and their 95% confidence intervals from the ICM-CFA and ESEM models. As expected, the factor correlations from ESEM are substantially smaller than those from ICM-CFA. While the average absolute factor correlation for ICM-CFA is .27 (SD = .17), it is .13 (SD = .08) for ESEM. For example, in both models, the largest correlation is between Agreeableness and Conscientious; it is .605 in ICM-CFA, but decreases to .348 in ESEM. The second largest correlation in both models is between Openness and Extraversion; it is .447 in ICM-CFA, but decreases to .269 in ESEM. Interestingly, the lowest and only nonsignificant correlation in ICM-CFA is between Openness and Emotional Stability, but it becomes negative and significant in ESEM.
Point and Interval Estimate of Factor Correlations of the BFPTSQ.
Note. Latent factor correlations from the final exploratory structural equation model (ESEM) are presented above the diagonal, while latent correlations from the independent clusters model confirmatory factor analysis (ICM-CFA) are presented below the diagonal. φ = factor covariance/correlation; 95% CI = 95% confidence interval.
p < .05. **p < .001.
The goodness-of-fit statistics from the gender invariance tests are presented in Table 1. Fitting the ESEM model with all freely estimated parameters for males and females separately provides acceptable fit to the data (MG1). Constraining all factor loadings to equality did not significantly worsen the fit (MG2). This result is noteworthy, because such a test involves many more parameters (i.e., all target loadings and cross-loadings are constrained to equality) than in ICM-CFA. Constraining intercepts to equality across genders, however, did seem to suggest the possibility of non-invariance (MG3). Indeed, the scaled chi-square difference test is significant and the change in CFI is at the upper limit of the cutoff criterion suggesting the possibility of non-invariance. Therefore, a model with partial invariance of intercepts (MG3b) was estimated (Byrne, Shavelson, & Muthén, 1989). Based on modification indices, four items were freed across groups: females had higher item intercepts for Items 26, 41, and 47 and a lower item intercept for Item 40. This model provided a better fit than the model with fully invariant intercepts. Constraining item uniquenesses across genders did not lead to a significant worsening in model fit (MG4). Indeed, even though the scaled chi-square test is significant, the change in CFI is not significant and the change in RMSEA is again trivial. Another model constraining the CUs to equality across genders did not lead to a significant worsening of fit (MG5), suggesting that these parameters are not gender-specific. Overall, even if the fit indices tended to gradually decrease with the increases in equality constraints (CFI and TLI), the RMSEA remained in the acceptable range for all models, even considering its confidence intervals. All these results reasonably suggest that the BFPTSQ factor structure show full measurement invariance across genders.
Constraining the variances/covariances across genders did not lead to a significant worsening of fit (MG6). As expected, however, constraining the latent factor means to equality across genders did lead to a significant worsening in fit (MG7). Examination of the modification indices and the results of the preceding model (MG6) revealed a number of significant mean differences between males and females. With the males’ means fixed to 0 and the variances fixed to 1 in both groups—which allows interpretation of these differences in terms of traditional effect size—the females’ latent means were significantly lower in Emotional Stability (−.755, p < .0001), significantly higher in Agreeableness (.578, p < .0001) and Conscientiousness (.344, p < .001), and marginally higher in Openness (.158, p < .10). There was no significant gender difference in Extraversion.
Reliability
The point and interval estimates of reliability and the comparison tests between reliability estimates (Δρ) are presented in Table 4. In general, all estimates suggest that the BFPTSQ scales have adequate reliability. The traditional α and ρ without CUs tend to be similar across all scales. As expected, however, ρ conditional on CUs are all noticeably lower, decreasing below .70 for Openness and Agreeableness. The estimates tend to be somewhat higher for Extraversion and Emotional Stability, while they tend to be the lowest for Openness and Agreeableness.
Point and Interval Reliability Estimates of the BFPTSQ Scales.
Note. α = Cronbach’s alpha internal consistency coefficient; ρ = latent variable model composite reliability estimate; 95% CI = 95% confidence interval; Δρ = delta parameter (composite reliability difference estimate); CU = correlated uniqueness.
p < .05. **p < .001.
Overall, the delta parameters (Δρ) resulting from the comparison tests between the original-content scales and the added-item scales tend to suggest that both scales provide similar reliability estimates. This is supported by the fact that the confidence intervals of all composite reliability estimates are overlapping. This demonstrates that the newly added items are not detrimental to scale reliability, despite the fact that a number of them show low factor loadings. In fact, when considering ρ without CUs, the results showed that the added items significantly decrease reliability for Openness and Extraversion, while they significantly increase reliability for Agreeableness and Emotional Stability. However, when the ρ conditional on CUs are considered—which are arguably closer to the true models—all differences between reliability estimates are nonsignificant, except for Agreeableness, which continues to suggest that the added item increases reliability.
Convergent Validity
The point and interval correlation estimates between the BFPTSQ and NEO-PI-3 scales as well as the comparison tests between correlations (Δr) are presented in Table 5. The overall pattern of correlations suggests adequate convergent validity. All correlations between the broad-trait scales are high, mostly around .70. For the scales with added items, they range from .629 for Openness to .804 for Conscientiousness. The correlations between the BFPTSQ scales and the corresponding NEO-PI-3 primary-trait scales are generally lower, but all are clearly significant. The only exception is the correlation between BFPTSQ Agreeableness and Modesty, which is quite low and even nonsignificant for the original-content scale.
Point and Interval Correlation Estimates Between BFPTSQ and NEO-PI-3 Scales.
Note. In this table, each Big Five personality trait is correlated with its corresponding higher-order and primary-trait scales from the NEO-PI-3. 95% CI = 95% confidence interval; Δr = delta parameter (correlation difference estimate).
Because Emotional Stability is simply the opposite pole, all correlations with Neuroticism and its primary traits from the NEO-PI-3 are actually negative and are presented in absolute value to simplify the table.
p < .05. **p < .001.
Overall, the delta parameters (Δr) resulting from the comparison tests between the original-content scales and those with added items suggest that the new scales significantly improved convergent validity. Indeed, all correlations between the broad Big Five trait scales are significantly higher for the added-item scales. However, the confidence intervals of all correlation pairs overlap somewhat. With regard to the primary-trait scales, the expected results are observed. Indeed, for Extraversion, the increase in correlation is largest for Excitement Seeking and Positive Emotions. For Agreeableness, the increase is largest for Straightforwardness, while for Conscientiousness, it is largest for Deliberation. For Emotional Stability, the increase in correlation is largest for Depression, Self-Consciousness, and Angry Hostility. In addition to these expected improved correlations, several other correlations with primary-trait scales also significantly increased, which is to be expected since all primary traits have a common variance associated with their corresponding broad Big Five trait. Again, the confidence intervals of all correlation pairs overlap somewhat.
Criterion Validity
The point and interval correlation estimates between the BFPTSQ and outcome scales as well as the comparison tests between correlations (Δr) are presented in Table 6. The pattern of correlations between the Big Five and outcome scales generally suggests adequate criterion validity. Openness is related to lower scores in CD and to higher scores in BD and GAD. As expected, Openness is most strongly related to higher scores in GPA. Extraversion is related to higher scores in ADHD, CD, BD, and substance use, but to lower scores in MDD and SOP. As anticipated, Extraversion is most strongly related to lower scores in SOP and to higher scores in ADHD. Agreeableness is related to lower scores on all psychopathology scales and substance use, but to higher scores in GPA. As projected, Agreeableness is most strongly related to the three externalizing psychopathology scales, particularly CD and ODD. Conscientiousness is also related to lower scores on all psychopathology scales and substance use, but to higher scores in GPA. As hypothesized, Conscientiousness is most strongly related to higher scores in GPA and to lower scores in ADHD. Emotional Stability is related to lower scores on all psychopathology scales, except CD and substance use. Emotional Stability is not related to GPA. As expected, Emotional Stability is most strongly related to the four internalizing psychopathology scales, particularly MDD, GAD, and SOP. 6
Point and Interval Estimates of Correlations Between BFPTSQ and Outcomes Scales.
Note. All psychopathology scales are dimensional scores representing the sum of the frequency for all items. ADHD = Attention Deficit Hyperactivity Disorder; CD = Conduct Disorder; ODD = Oppositional Defiant Disorder; MDD = Major Depression Disorder; BP = Bipolar Disorder; GAD = Generalized Anxiety Disorder; SOP = Social Phobia; SUBS = Substance Use; GPA = Grade Point Average; 95% CI = 95% confidence interval; Δr = delta parameter (correlation difference estimate).
p < .10. *p < .05. **p < .001.
Overall, the delta parameters (Δr) resulting from the comparison tests between the original-content scales and those with added items suggest that that the new scales significantly improved criterion validity. For Openness, the added-item scale increases the negative relations with CD (i.e., it becomes more negative), but reduces the positive relation with BD. For Extraversion, the added-item scale increases the positive correlations with ADHD, ODD, and BD, but reduces the negative relations with MDD and SOP. For Agreeableness, the added-item scale increases the negative relation with ADHD, CD, ODD, and BD, but does not change the relations with internalizing psychopathology scales or GPA. For Conscientiousness, the added-item scale increases the negative relations with all psychopathology scales, but does not change the relations with GPA. Finally, for Emotional Stability, the added-item scale increases the negative relations with ADHD, ODD, MDD, BD, and SOP. However, the confidence intervals of all correlation pairs overlap somewhat.
Discussion
The general objective of this study was to evaluate the construct validity of adolescents’ self-reported personality traits. In accordance with the view of construct validity as a unifying form of validity requiring the integration of different complementary sources of information (Messick, 1995; Simms & Watson, 2007), content validity, factor validity, convergent validity, criterion validity, and reliability were evaluated. A secondary objective was to evaluate the potentially beneficial impact of increasing the conceptual breadth of scales from a short personality measure. Starting with an item pool from an existing measure, the language level was first adjusted for use with adolescents, and items tapping fundamental primary personality traits that were missing from this pool were added to each of the Big Five scales. This led to a new measure which was named the Big Five Personality Trait Short Questionnaire (BFPTSQ). Overall, the results of this study supported the construct validity of adolescents’ self-reported Big Five personality traits. The results also support the idea that adding conceptual breadth in a short personality trait measure can have a significant positive impact on some of its psychometric properties.
Construct Validity of the BFPTSQ
For the evaluation of content validity, experts in personality theory and questionnaire construction were asked to rate the adequacy or representativeness of the items. All items were identified as valid indicators of their target trait according to the usual criteria (Polit & Beck, 2006), which suggests that the BFPTSQ scales have adequate content validity. However, three items, all newly added ones, were identified as somewhat less representative, namely Items 42 (Sensation Seeking), 45 (Low Self-Worth), and 48 (Machiavellianism). This suggests that even though most scholars would agree that these items represent important personality traits, there appears to be no clear consensus among experts as to their place within the Big Five taxonomy.
The tests of the BFPTSQ factor validity revealed five notable results. First, the Big Five factor structure was well recovered in a sample of French Canadian adolescents. There are still few published item-level factor analyses of the Big Five structure with adolescents self-reports, but the results of this study tend to replicate those from other studies with adolescent samples (e.g., Allik et al., 2004; McCrae et al., 2002; W.D. Parker & Stumpf, 1998; Soto et al., 2008). Second, all the newly added items were significantly associated with their target Big Five trait. However, some of these added items tended to show low factor loadings. This is perhaps not so surprising because, in accordance with the classic bandwidth-fidelity dilemma (Cronbach & Gleser, 1965), if there is an increase in the conceptual breadth of a scale composed of repeated items measuring only a few primary traits, these new items will inevitably have somewhat lower loadings. Even though factor loadings are commonly considered meaningful when they exceed .30 or .40, this popular rule of thumb is generally not recommended for deciding whether an item is part of a factor. Indeed, as pointed out by Preacher and MacCallum (2003), now that some statistical programs calculate the standard error associated with each factor loading, it is advisable to use statistical significance and confidence intervals. Also be taken into account is the fact that including CUs in an ESEM model typically results in lower loading estimates for some items. In factor analyses of the NEO-FFI items, Marsh et al. (2010) also observed that a number of items show standardized loadings below the common recommendation of .30. Apart from the low loadings, another characteristic of the factor solution observed in this study is that there were a number of complex items (i.e., items with significant loadings on two or more factors). For instance, the Impulsiveness item (48r) added to the Conscientiousness scale is one such item. Even though it would be conceptually preferable to have no such complex items (Donnellan et al., 2006), given that Impulsiveness is such a fundamental personality trait, researchers must deal with the trade-off between a full representation of primary traits within a Big Five measure and the increased complexity of interpreting scale scores with complex items.
A third notable result from factor validity tests is that it is difficult—if even possible—to achieve good model fit to the data by using ICM-CFA for a complex structure like the Big Five when it is measured by several items. Indeed, as recently demonstrated by a number of researchers, it is more appropriate to use ESEM to model all possible factor loadings because it is closer than ICM-CFA to the true model of the Big Five structure (e.g., Furnham, Guenole, Levine, & Chamorro-Premuzic, 2013; Marsh et al., 2010; Rosellini & Brown, 2011). In this study, out of a total of 250 possible factor loadings (number of items times the number of factors), there was a total of 105 (42%) significant cross-loadings. It should be noted, however, that the fit of the ESEM model is acceptable, but far from excellent according to typical criteria suggested for practical fit indices (i.e., Hu & Bentler, 1999). This is perhaps not surprising because as Marsh, Hau, Balla, and Grayson (1998) noted, as the number of indicators in a factor model increases, there tends to be a decrease in fit, even for properly specified models.
A fourth interesting result from the factor analyses is that factor correlations are considerably lower with ESEM than with ICM-CFA. As shown by Marsh et al. (2010), when researchers use ICM-CFA, thereby fixing to 0 the numerous significant cross-loadings that are actually expected in a Big Five personality trait structure, factor correlations are vastly inflated because this is the way these cross-loadings can be represented (see also Asparouhov & Muthén, 2009). Using ESEM provides factor correlations that are probably closer to the true population parameters and supports the discriminant validity among the Big Five traits as measured by the BFPTSQ.
A fifth interesting finding from the factor analyses is the measurement invariance across genders. This is important because to make valid factor mean comparisons, researchers must demonstrate at least that the loadings and intercepts are invariant across groups (Millsap & Olivera-Aguilar, 2012). Unfortunately, these verifications are still rarely done in personality research. The results of this study suggest that it is reasonable to assume that the BFPTSQ factor structure shows measurement invariance across genders. That is, factor loadings, item uniquenesses, CUs, and factor variances/covariances (or correlations) were all fairly invariant across males and females. However, the item intercepts were not fully invariant and four of them were freed across groups. Two of these items were from Openness and were significantly higher for females, and, interestingly, both tap artistic interests (26 and 41r). This suggests that females are significantly more disposed than males to have artistic interests during adolescence. Another intercept tapping joyfulness (47) was significantly higher for females, which suggests they are more prone than males to experience positive emotions during adolescence. Finally, the intercept of a reversed item tapping nervousness (40r) was significantly lower for females, which suggests that females are more predisposed than males to experience anxiety. Non-invariant item intercepts suggest that the mean differences observed across groups in the corresponding factors are not uniquely due to latent factor mean differences, but also in part to differences in these invariant item intercepts (Millsap & Olivera-Aguilar, 2012). However, even though researchers would want all intercepts to be invariant across groups, partial invariance is certainly acceptable for making valid mean comparisons when there are only a few invariant items (Byrne et al., 1989).
Backed up with this measurement invariance, the results showed that there are sizable mean gender differences in adolescents’ personality traits. In order of magnitude, females have significantly lower levels of Emotional Stability, significantly higher levels of Agreeableness and Conscientiousness, and marginally higher levels of Openness. There was no significant gender difference in Extraversion in this sample. These results are consistent with those from a recent meta-analysis with adult samples, except for the absence of differences in Extraversion (Schmitt et al., 2008). However, when gender differences in Extraversion are observed in adolescents, they are typically rather small (Allik et al., 2004). A study by Costa, Terracciano, and McCrae (2001) could help in interpreting these results. These authors observed that while mean gender differences tended to be observed across all primary traits subsumed by Neuroticism and Agreeableness, gender differences were not consistent across primary traits for Openness and Extraversion. For these two broad traits, some primary traits favored women, while others favored men. Differences within broad trait thus explained that mean gender differences were rather small for Openness and Extraversion. It is thus possible that the particular items included in the BFPTSQ scales partly explain the absence of mean difference in Extraversion.
With regard to the reliability of the BFPTSQ scales, overall, the results suggest that the estimates are all acceptable. Of importance, the comparison tests (i.e., delta parameters) between the original-content scales and the BFPTSQ ones with added items revealed that the new items did not improve reliability, nor did they have a detrimental impact. Of course, only one or two items were added per scale, so the absence of significant influence on reliability estimates is not unexpected. Still, it is interesting that reliability estimates were not significantly decreased, considering that some of these new items tended to have somewhat low factor loadings. Another noteworthy observation concerning reliability is that the latent variable model composite reliability (ρ) estimates were systematically lower when models including CUs were considered. It is known that the traditional Cronbach’s alpha coefficient will be accurate only when the items are clearly unidimensional and essentially tau-equivalent and the uniquenesses are uncorrelated (Raykov, 2001, 2012). The results of this study confirmed that CUs that are not accounted for tend to inflate reliability estimates of personality trait scales.
Concerning convergent validity, overall, the correlations with the NEO-PI-3 (McCrae & Costa, 2010) scales suggest adequate validity of the BFPTSQ scales. All the correlations between the broad-trait scales are high, ranging from .629 for Openness to .804 for Conscientiousness. The comparison tests confirmed that, compared with those with original content, the BFPTSQ scales with added items significantly increased these correlations for all Big Five traits. Moreover, the correlations between the BFPTSQ scales and their target NEO-PI-3 primary-trait scales were generally moderate to high, and all were significant. Because nearly 43% (13 out of 30) of the correlations with NEO-PI-3 primary-trait scales did not significantly increase with the added items, this suggests that the correlation increases are not merely a matter of increased general variance. Interestingly, the comparison tests confirmed that the BFPTSQ scales with new items showed significantly higher correlations with a number of NEO-PI-3 primary-trait scales. Critically, the comparison tests generally confirmed expectations regarding the BFPTSQ scales as compared with those with the original content: for Extraversion, the increase in correlation was larger for Excitement Seeking and Positive Emotions; for Agreeableness, it was for Straightforwardness; for Conscientiousness, it was for Deliberation (which could also be called Impulsiveness based on its content); and for Emotional Stability, it was for Depression and Self-Consciousness. Contrary to expectations, however, the BFPTSQ Openness scale with a new item tapping Openness to Cultural Diversity did not significantly increase the relation with the NEO-PI-3 Openness to Values scale, which is surprising since it contains similar items. It should be noted that for comparison tests, when the confidence intervals are considered, all overlap somewhat. Even though nonoverlapping 95% confidence intervals are rather stringent tests of the difference between two estimates (Cumming, 2012), this suggest that the correlation increase flagged as statistically significant as per the delta parameters tends to be small.
As for criterion validity, overall the correlations with the outcome measures suggest adequate concurrent validity of the BFPTSQ scales. As expected based on scholarly reviews (Klein et al., 2012; Tackett, Martel, et al., 2012; Widiger & Smith, 2008) and meta-analytic findings derived with adult samples (Malouff et al., 2005), externalizing psychopathology scales tended to be related to low levels of Agreeableness and Conscientiousness and, to a lesser extent, to low levels of Emotional Stability and high levels of Extraversion. However, the BFPTSQ Extraversion scale did not show a significantly higher correlation for substance use. A higher correlation was expected because of the added Sensation Seeking item, which is typically related to substance use (Sargant et al., 2010; Zuckerman, 2009). Internalizing psychopathology scales, meanwhile, tended to be related to low levels of Emotional Stability, Agreeableness, and Conscientiousness and, to a lesser extent, to low levels of Extraversion. An exception is bipolar disorder, which was related to high levels of Extraversion, even though it is considered a mood disorder. As expected based on Malouff et al.’s (2005) meta-analytic findings, Openness showed small but significant relations to some psychopathology scales. Indeed, this meta-analysis found that for self-reports, Openness is positively related to psychopathology, while for observer ratings, there is a negative relation. In the present study, Openness was negatively related to CD and positively related to MDD, BD and GAD. However, compared with the other correlations, they tended to be small. Moreover, as expected based on meta-analytic findings (Poropat, 2009), Conscientiousness and Openness were both positively related to GPA. In this sample of adolescents, Agreeableness was also clearly related to GPA, much more strongly than is typically observed in adult samples (Noftle & Robins, 2007). The comparison tests generally confirmed that, compared with those with the original content, the BFPTSQ scales with added items provide several significantly higher correlations with outcome scales. Because 33% (15 out of 45) of the correlations with outcome scales did not increase significantly with the added items, this again suggests that the correlation increases are not simply a matter of increased general variance. When the 95% confidence intervals are considered for comparison tests, all overlap somewhat, which suggests that the correlation increase flagged as statistically significant as per the delta parameters tends to be small.
Impact of Added Conceptual Breadth
This study showed that by adding only a few items tapping important primary personality traits, factor validity is adequate, reliability is not affected, but convergent and criterion validities are somewhat improved. Even if the BFPTSQ is far from being optimal in terms of content coverage and psychometric properties, the results of this study underscore the significance of its added conceptual breadth.
Because of the limited time available in many assessment situations, short personality measures are popular. In principle, to be valid, short measures should have items tapping all of the most important primary traits subsumed in a given Big Five trait (Haynes et al., 1995; Smith et al., 2003). Therefore, the minimal number of items in a short scale should arguably be the number of primary traits encompassed in a given Big Five trait. If some items tapping fundamental primary traits do not load strongly (but significantly) on their target factor, they should not be excluded a priori for this reason. In other words, results of factor or reliability analyses should not supersede issues of conceptual breadth and validity in questionnaire construction (Cizek, Rosenberg, & Koos, 2008; Loevinger, 1957). This should rather be taken as a challenge for personality researchers to find the place of these primary personality traits in the complex multivariate space of the Big Five taxonomy. This challenge is difficult, however, because even though personality psychologists have reached a reasonable consensus on the hierarchical nature and number of broad personality traits (Markon et al., 2005), they are not even close to a consensus concerning the nature and number of fundamental primary traits.
Limitations and Future Research
The BFPTSQ psychometric properties evaluated in this study are generally satisfactory, but a few limitations should be mentioned. First, all the psychometric properties of the BFPTSQ were evaluated for one sample of adolescents. Even though efforts were made to gather a large, very roughly representative sample of adolescents, the results need to be replicated. Second, because studying personality trait development is important, a cross-validation should be carried out with an adult sample. Third, the French-language version of the BFPTSQ was validated, so it would be important to replicate these results using the English- or other-language versions. Fourth, it would be interesting to evaluate whether the factor structure and measurement invariance across genders can be replicated with reports from informants, such as parents and teachers. Fifth, there is a potential shared method effect in the criterion validity tests because self-report measures were used for the assessment of both personality traits and outcomes. It would be important to replicate these results using different informants or methods. Sixth, the evaluation of criterion validity was conducted with a limited number of outcomes (i.e., psychopathology and achievement). Clearly, personality traits have been shown to be associated with a wide range of other positive and negative life outcomes, so additional predictive studies using these new scales are needed. Seventh, this study provided only a preliminary evaluation of incremental validity. Indeed, before making any strong claims about the incremental validity of the BFPTSQ, it would be important to demonstrate that its scales significantly add to the prediction of consequential outcomes beyond what can be predicted by scales from other known short Big Five measures. Despite these limitations, the results of this study suggest that the BFPTSQ appears to be a potentially useful alternative to existing Big Five short measures.
Footnotes
Appendix
Big Five Personality Trait Short Questionnaire (BFPTSQ) Items
| I see myself as someone who . . . | ||
|---|---|---|
|
|
||
| 1 | Is original, often has new ideas. | |
| 6 | Is curious about many different things. | |
| 11 | Is ingenious, reflects a lot. | |
| 16 | Has a lot of imagination. | |
| 21 | Is inventive, creative. | |
| 26 | Likes artistic or aesthetic experiences. | |
|
|
R | Is not really interested in different cultures, their customs and values. |
| 36 | Likes to reflect, tries to understand complex things. | |
| 41 | R | Has few artistic interests. |
| 46 | Is sophisticated when it comes to art, music or literature. | |
|
|
||
| 2 | Likes to talk, expresses his/her opinion. | |
| 7 | R | Is reserved or shy, has difficulty approaching others. |
| 12 | Is full of energy, likes to always be active. | |
| 17 | Is a leader, capable of convincing others. | |
| 22 | R | Is rather quiet, does not talk a lot. |
| 27 | Shows self-confidence, is able to assert himself/herself. | |
| 32 | R | Is timid, shy. |
| 37 | Is extraverted, sociable. | |
|
|
Likes exciting activities, which provide thrills. | |
|
|
Has a tendency to laugh and have fun easily. | |
|
|
||
| 3 | R | Has a tendency to criticize others. |
| 8 | Is helpful and generous with others. | |
| 13 | R | Provokes quarrels or arguments with others. |
| 18 | Is lenient, forgives easily. | |
| 23 | Generally trusts others. | |
| 28 | R | Can be distant and cold towards others. |
| 33 | Is considerate and kind to almost everyone. | |
| 38 | R | Can sometimes be rude or mean towards others. |
| 43 | Likes to cooperate with others. | |
|
|
R | Can deceive and manipulate people to get what he/she want. |
|
|
||
| 4 | Works conscientiously, does the things he/she has to do well. | |
| 9 | R | Can be a little careless and negligent. |
| 14 | Is a reliable student/worker, who can be counted on. | |
| 19 | R | Has a tendency to be disorganized, messy. |
| 24 | R | Has a tendency to be lazy. |
| 29 | Perseveres until the task at hand is completed. | |
| 34 | Does things efficiently, works well and quickly. | |
| 39 | Plans things that need to be done and follows through the plans. | |
| 44 | R | Is easily distracted, has difficulty remaining attentive. |
|
|
R | Can do things impulsively without thinking about the consequences. |
|
|
||
| 5 | R | Has a tendency to be easily depressed, sad. |
| 10 | Is generally relaxed, handles stress well. | |
| 15 | R | Can be tense, stressed out. |
| 20 | R | Worries a lot about many things. |
| 25 | Is emotionally stable, not easily upset. | |
| 30 | R | Can be moody. |
| 35 | Stays calm in tense or stressful situations. | |
| 40 | R | Can easily become nervous. |
|
|
R | Has a tendency to feel inferior to others. |
|
|
R | Has a tendency to be easily irritated. |
Note. R = reversed-score item. Boldface item numbers represent newly added items.
Acknowledgements
I want to thank all the school board members, school principals, as well as all the adolescents, parents, and teachers, who participated in the study. Thank you also to Alexandre Morin and Dave Miranda for their comments on an early draft of this article.
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was made possible by a grant from the Fonds québécois pour la recherche sur la société et la culture (FQRSC).
