Abstract
Questions concerning longitudinal stability and multi-method consistency are critical to temperament research. Latent State-Trait (LST) analyses address these directly, and were utilized in this study. Thus, our primary objective was to apply LST analyses in a temperament context, using longitudinal and multi-method data to determine the amount of trait vs. state variance, as well as convergence for measures of Distress to Limitations (DL) facets. Mothers’ ratings and independent observations of DL behaviors collected on two occasions (8 months old and 12 months old) for 148 infants (49.2% female) were utilized. Single source latent state-trait (LST) analyses indicated that parent ratings of DL behavior (PDL) contained more trait (M = 61%) than state residual (M = 39%) variance, whereas independent observations (IO) of DL behavior contained substantially more state residual (75%) than trait (25%) variance. A multiple source LST analysis indicated virtually zero convergence for either trait or state residual variance between PDL and IO ratings (M = 2%). In conclusion, PDL ratings were more trait-like across the 4-month interval, whereas IO ratings of DL were more state-like in nature. Also, no convergence was found between the two methods of measurement. Results are discussed with an emphasis on implications for the utility of LST analyses in temperament research.
One prominent controversy in temperament research concerns agreement among different sources of information, especially congruence between parent-report and observation-based indices of reactivity. Although laboratory observations are often characterized as more objective, parent-report indicators were shown to possess predictive validity that rivals alternative indicators (Pauli-Pott, Mertesacker, Bade, Haverkock, & Beckmann, 2003). In addition, it can be argued that maternal perspective on infant temperament is critical because of the primary caregiver’s contribution to the child’s milieu, especially in early childhood (Bornstein, 2014). At the same time, laboratory observations provide an opportunity to structure the situation in a standardized manner, administering episodes designed to elicit certain attributes, such as distress to limitations or frustration (Gagne, Van Hulle, Aksan, Essex, & Goldsmith, 2011). More recently, it has been proposed that each methodological approach possesses unique strengths and weaknesses, and future research should articulate these distinctions more precisely (Gartstein et al., 2016). Convergence (or lack thereof) between parent-report and observation-based measures of temperament could be framed from the state vs. trait perspective. That is, the nature of parent-report and observation-based tools is such that the former lends itself primarily to capturing trait-related variability, whereas infant responses to brief, albeit highly structured experimental manipulations may speak more to state-related changes in emotion/reactivity.
Longitudinal data with multiple assessment points are required to answer state vs. trait questions, which are relevant not only to measurement considerations but also to temperament theories which typically include assumptions about temporal stability and setting/person consistency. Stability is viewed as one of the defining features of temperament, with significant consistency observed across assessments as early as infancy (Bornstein et al., 2015). At the same time, the psychobiological model of temperament anticipates changes as a function of development, especially rapid in early childhood, and considerable growth has been demonstrated for multiple attributes, such as fear (e.g., Gartstein et al., 2010). The current article illustrates an application of a novel methodological approach, Latent State-Trait (LST) Analyses, to questions concerning stability/change and convergence of temperament measures. Specifically, examining the possibility that the commonly reported lack of convergence between laboratory observations and parent-report can be framed as a function of state vs. trait distinction among these indicators.
Existing studies addressing longitudinal stability and cross-setting consistency of temperament typically rely on correlation or regression coefficients (Bornstein et al., 2015; Garcia Coll, Halpern, Vohr, Seifer, & Oh, 1992), which do not directly address questions concerning stability and/or consistency of longitudinal variance. Studies relying on latent growth modeling (LGM) also fall short in this regard (e.g., Gartstein et al., 2010), as LGM addresses change in mean levels of an attribute, rather than variance stability. If the variance of an attribute (but not its mean values) lacks stability, the instability occurs in a non-systematic manner (some individuals decrease while others increase). Importantly, this pattern of results would suggest the attribute in question is relatively occasion-specific rather than trait-like.
LST analyses directly address stability and convergence questions using longitudinal/multi-method data. Single-method LST models apply to single-method longitudinal data (e.g., parent-report across multiple occasions) (for review of LST modeling, see Geiser, Hintz, Burns, & Servera, 2017; Steyer, Mayer, Geiser, & Cole, 2014). With a minimum of two occasions of measurement, a single method LST model can separate the variance in temperament indicators into trait variance (i.e., see trait consistency definition in Table 1), state residual variance (i.e., see occasion-specificity definition in Table 1), and measurement error variance (unreliability). Trait variance represents the amount of true score variance in temperament indicators that is purely person-specific and independent of the occasion and/or person-occasion interactions. State residual (occasion-specific) variance, in contrast, represents the amount of true score variance in temperament indicators that reflects occasion-specific (situational) influences and/or person-occasion interactions. This LST model can help determine if a temperament attribute is more trait- or more state-like for a single source across multiple occasions of measurement (Figure 1). Definitions for this model are presented in Table 1. A multiple method LST model (Courvoisier et al., 2008; Litson, Geiser, Burns, & Servera, 2016; Preszler, Burns, Litson, Geiser, & Servera, 2016; Preszler et al., 2017) differs from a single method LST model by the simultaneous analysis of two or more methods of measurement (e.g., ratings). Including multiple methods in the same model (for example, information obtained from parent ratings and laboratory observations) makes it possible to study the convergent validity of trait and occasion-specific components across methods.

Single method latent state-trait model with parcel-specific trait factors measured at two time points. This model was applied to both PDL and IO methods separately. The Mplus code for this model is shown in the supplemental material. T = trait factor; O = state (occasionspecific) residual factor. Participant n = 147.
Definitions for single and multiple source latent state trait (LST) models
Note. “Reliable variance” or “reliability” is also sometimes referred to as “true score variance.” Similarly, “occasion-specific variance” is also sometimes referred to as “state residual variance” in LST modeling.
The multiple method LST model used in this study requires that one method be selected as a reference method (e.g., mother ratings) with the other methods considered nonreference methods (e.g., lab indicators). This multiple method LST model can be used to identify the amount of trait and state residual variance in temperament indicators that the nonreference methods either share or do not share with the reference method. Our multiple method LST model resulted in four variance components for the nonreference method (see Table 1)—(a) shared trait consistency, or the proportion of trait variance in temperament indicators that a nonreference method shares with the reference method; shared trait consistency thus indicates the degree of convergent validity with regard to stable (trait) aspects of behavior; (b) unique trait consistency, or the proportion of trait variance in temperament indicators that a nonreference method does not share with the reference method; unique trait consistency thus indicates the degree of method-specificity with regard to stable (trait) aspects of behavior; (c) shared occasion-specificity, or the proportion of occasion-specific variance in temperament indicators that a non-reference method shares with the reference method; shared occasion-specificity thus indicates the degree of convergent validity with regard to time-variable (state residual) aspects of behavior; and (d) unique occasion-specificity, or the proportion of occasion-specific variance in temperament indicators that is unique to a nonreference method and not shared with the reference method; thus, unique occasion-specificity indicates the degree of method-specificity with regard to variable (state residual) aspects of behavior. The decomposition of true score (i.e., reliable) variance into these four components for the nonreference method affords analysis of both temporal consistency and cross-method convergent validity of trait and state residual components in temperament indicators. Figure 2 shows the multiple method LST model used in the present study with the definitions for this model shown in the Table 1.

Multiple method latent state-trait model with item-specific trait factors. T = reference trait factors that are defined by mother reports (Shared Trait Consistency); TM = residual trait factors that pertain to the lab indicators (Unique Trait Consistency); O = reference occasionspecific factors that are defined by mother reports (Shared Occasion-Specificity); OM = occasion-specific factors that pertain lab indicators (Unique Occasion-Specificity). M11 and M21 = IBQ19; M12 and M22 = IBQ20R; M13 and M23 = IBQ21; L11 and L21 = BDYANG; L12 and L22 = INANG; L13 and L23 = INDSVC. Participant n = 147.
Importantly, there are different types of LST models that may be more appropriate depending on the specific goals of the investigation. For example, certain LST models can take into account autoregressive effects, and others can take into account trait change (for an in-depth explication of the variety of LST modeling techniques, see Geiser et al., 2017). The multitrait multistate (MTMS) model with no autoregressive or trait change effects (Geiser et al., 2017, Figure 2) was selected for the purposes of this study, given that only two time points were utilized and temperament stability has been demonstrated within this developmental timeframe (e.g., Bornstein et al., 2015). This type of model was used for both the single- and multiple- method analyses, shown in our Figure 1 and Figure 2, respectively.
LST models seem particularly well suited for the temperament context because these provide an opportunity to directly examine consistency of a temperament construct across time and across methods of measurement (e.g., lab measure and parent/observer ratings). Temperament is conceptualized as an enduring construct, yet empirical evaluations of stability remain informative, especially during periods of rapid development (e.g., infancy). Moreover, an ongoing measurement debate, elucidating agreement and areas of discordance among different sources of information, make this field optimal for LST applications.
Distress to limitations as an example
Recent uses of LST involve investigations of developmental psychopathology, such as ADHD and ODD, in school-age children, (Litson et al., 2016; Preszler et al., 2016). However, to our knowledge, LST models have not yet been applied to temperament constructs, despite the demonstrated utility of this approach and its applicability to individual differences in reactivity and regulation. The present study was designed to address this gap in research and represents an illustration of LST in this context. For the purposes of this illustration, we focused on a fine-grained infant temperament construct of Distress to Limitations, defined as fussing, crying, or showing distress while a) in a confining place or position, b) in caretaking activities, and c) unable to perform a desired action. These reflect anger/frustration, and were measured across two time points using two methods. This level of analysis was undertaken because more global temperament factors (e.g., Negative Emotionality) represent multi-dimensional constructs with heterogeneous content. Distress to Limitations was also selected as a target because it has been successfully evaluated via structured laboratory observations, along with parent-report ratings (Stifter & Braungart, 1995; Gartstein & Rothbart, 2003). Although other fine-grained components of temperament should be examined via LST techniques in the future, we focused on Distress to Limitations for this initial illustration because negative emotionality has been the most widely studied broadband domain of temperament, and anger/frustration represents a component with clearly observable manifestations, expected to provide the basis for consistency across measurement modalities.
Our objectives were to (1) use single-method LST models to determine the amount of time-stable and occasion-specific variance for each method separately, and (2) to use a multiple-method model to determine the amount of shared time-stable variance and shared occasion-specific variance between the two methods: laboratory observation-based and parent-report scores. Stability was examined from the age of 8 to 12 months, capturing an important developmental period at the end of the first year of life.
Method
Participants and procedures
A community sample of 148 English-speaking mothers with healthy full-term infants was recruited through birth announcements and a universal prevention program. Participating caregivers (mean age = 28.67, SD = 5.27) were English-speaking and primarily Caucasian (92%), representative of the general population. The vast majority of mothers were married (93.1%), having completed at least some college (90.9%), with family incomes generally >US$20,000 (73.2%). Mothers responded to the Infant Behavior Questionnaire-Revised (Gartstein & Rothbart, 2003) at 8 and 12 months of age, and visited the laboratory with their infants (49.2% female; 50.8% male), who participated in the Laboratory Temperament Assessment Battery (Lab-TAB; Gagne et al., 2011; Goldsmith & Rothbart, 1996). Only the Distress to Limitations (DL) scale of the IBQ-R and the Toy Retraction episode of Lab-TAB, designed to elicit mild frustration, were considered.
Measure
Subsequently, research assistants were trained to assign reliable codes to infant behaviors reflecting anger/frustration in the Lab-TAB Toy Retraction task. These coders rated infant displays of anger/frustration across nine 5-second epochs for: (1) latency of first anger response; (2) intensity of anger expression; (3) intensity of distress vocalizations; (4) intensity of bodily anger; (5) presence of struggling/escape behavior; (6) intensity of struggling/escape behaviors; (7) attempts to reach toy; and (8) banging on the table (See Table 2 for the coding scheme description). Satisfactory reliability, including inter-rater agreement (75% to 88%), were demonstrated for these indicators (Majdandzic, van den Boom, & Heesbeen, 2008). In the current sample, inter-rater reliability coefficients ranged from .61 to .98 (mean r = .81).
Toy retraction coding scheme
Analyses
In the present study, we followed a twofold strategy, first selecting DL items most relevant in content to the Toy Retraction experimental situation a-priori. Second, if measurement requirements (i.e., adequate measurement model, longitudinal invariance, and mean stability) were not satisfied for these a priori selected items, we sought indicators within the DL domain of the IBQ-R that met measurement requirements. Although in this study items were selected based on their content similarity to the Lab-TAB measure, as well as to each other, it should be noted that specific facets or subdomains within Distress to Limitations have not been previously identified.
Analyses were conducted using Mplus 7.4 (Muthén & Muthén, 1998–2012). The robust maximum likelihood estimation (MLR) estimator accounted for any non-normality in the item ratings and Full Information Maximum Likelihood (FIML) was used to accommodate missing data. The comparative fit index (CFI, ideal study criterion ≥.95), the Tucker-Lewis Index (TLI, ideal study criterion ≥.95), and the root mean square error of approximation (RMSEA, ideal study criterion ≤.05) were considered to evaluate model fit. Before applying the LST model, we first needed to demonstrate invariance of like-item loadings and intercepts across time. First, invariance analyses were conducted to ensure equivalence of like-item loadings and intercepts across time using changes in CFI value and chi-square difference tests. If the CFI decrease was <.01 with the introduction of constraints on like-item loadings and like-item thresholds (strong invariance), the invariance constraints were considered tenable (Little, 2013, Chapter 5). In addition, if the chi-square difference test was non-significant at p < .05, we considered the invariance constraints tenable.
Results
As noted, we initially focused on Distress to Limitations items that overlapped in content with the experimental situation, i.e., addressed anger/frustration responses to toy removal (i.e., “When something the baby was playing with had to be removed, how often did s/he: (1) cry or show distress for a time; (2) seem not bothered?; and (3) “When the baby wanted something, how often did s/he become upset when s/he could not get what s/he wanted?”). However, these did not produce an adequate-fitting configural model across time (RMSEA = .092; CFI = .966; TLI = .914), and the third item did not have high loadings (.54) on the construct at either time point.
Additional parent-report DL (PDL) items with content addressing two distinct themes were considered next. The first item set included three questions related to waking up (e.g., “After sleeping, how often did the baby (1) fuss or cry immediately?”, (2) “play quietly in crib?”, and (3) “cry if someone doesn’t come within a few minutes?”). The second set included three items related to behavior before falling asleep (e.g., “How often did the baby: (1) seem angry (crying and fussing) when you left her/him in the crib?”, (2) “seem contented when left in the crib?”, (3) “cry or fuss before going to sleep for naps?”). Only the first of these item sets met invariance (ΔCFI = .000, χ2 difftest = 2.25, p = .691) and relative mean stability assumptions necessary to conduct our LST analyses. Thus, DL items addressing anger/frustration upon waking up were considered further. A laboratory-based measurement model comprised of three independent observer (IO) indices met strong measurement invariance (ΔCFI = .008, χ2 difftest = 7.25, p = .123) and mean stability requirements of LST analysis: bodily anger, intensity of anger, and distress vocalizations.
Time invariance of loadings and thresholds was attained across both methods. We conducted invariance analyses of like-item loadings and intercepts across time on the PDL and IO measures. The configural model provided a good fit for each method (CFIs > .997, TLIs > .989, and RMSEAs < .035). Invariance of like-item loadings and intercepts held across the two time points for each of the models (i.e., no decrease in CFI was greater than .008), providing a foundation for proceeding with LST analyses.
PDL and IO LST models were associated with good fit (Table 3), as both estimated models adequately represented the data. The multiple-method model, including both methods simultaneously, had good fit (Table 3). The variance components of our models are displayed in Table 4. PDL single method analyses showed more trait consistency (M = .61) than occasion-specificity (M = .39), whereas the IO single method results demonstrated substantially more occasion-specificity (M = .75) than trait consistency (M = .25). The multiple-method model, comparing PDL to the IO, showed the IO measure shared virtually no variance with PDL, as indicated by the levels of shared trait consistency (M = 0%) and shared minimal occasion-specificity (M = 2%), in contrast to the higher levels of unique trait consistency (M = 28%) and unique occasion-specificity (M = 70%).
Goodness-of-fit statistics for latent state trait analyses for distress to limitations reports
Note. CFI = comparative fit index; TLI = Tucker-Lewis index; RMSEA = root mean square error of approximation; CI = confidence interval; ns = nonsignificant. Participant n = 147.
Average consistency, occasion-specificity, and reliability estimates from single and multiple method latent state-trait analyses on the distress to limitations reports
Note. Entries indicate averages across items (with ranges in parentheses). Multiple method model compares Independent observer-rated factor to Parent-rated factor. Participant n = 147.
Discussion
Our study sought to introduce the use of LST analysis (Steyer et al., 2014) to investigate the trait vs. state components of temperament assessed via parent-report and laboratory observations. The longstanding questions concerning agreement among different sources of information and stability vs. change in individual differences make this approach ideal for temperament research. Our initial effort focused on Distress to Limitations, reflecting anger and frustration-related emotional reactivity exhibited across contexts, including a situation wherein an attractive toy is removed, enacted in the laboratory evaluation of individual differences.
Single-method analyses indicated that PDL was about 60% stable across two time points, but IO was only about 25% stable. Although these findings require replication, this pattern of results suggests a number of possibilities. First, these findings bring attention to the questions concerning test-retest reliability of IO ratings of infants; specifically, those concerning an appropriate interval that would capture this psychometric property of the test, rather than the development/instability of the phenomenon in question. Although short in duration, the period 8 to 12 months could represent a transition with respect to temperament, and development more generally (Bornstein, Arterberry, & Lamb, 2014), and it may be more appropriate to frame IO as more sensitive to these developmental shifts. Laboratory-based measures were previously described as advantageous in terms of sensitivity to developmental changes, as parents tend to be stable in their expectations, not shifting perceptions of child attributes effectively with developmental transitions (Gagne et al., 2011; Saudino, 2003). This increased sensitivity is especially critical during periods of rapid transitions, and observation-based techniques may be preferred to parent-report in this context.
According to the multiple method analysis, the two methods (parent-report and laboratory observations) shared virtually zero variance. There are many potential explanations, including the significantly briefer time frame of the independent observation compared to maternal ratings, the introduction of an unfamiliar person into the infant’s environment in the context of Lab-TAB, or other novel aspects of the laboratory. Moreover, discrepancy in content of questionnaire items and the nature of laboratory observations could be responsible, as the most content-consistent items did not satisfy LST requirements. That is, although the parent-report IBQ-R indicators and Lab-TAB derived codes reflect the same underlying temperament construct, namely Distress to Limitations, these did not overlap in terms of specific content. That is, items derived from laboratory observations addressed toy removal, whereas parents reported concerning anger/frustration upon waking up. Distress to Limitations was conceptualized (Rothbart, 1981) and has been consistently treated as a unidimensional construct (Gartstein & Rothbart, 2003; Gartstein, Bridgett, & Low, 2012). The pattern of results observed in this study suggests this approach should be subjected to a systematic empirical evaluation. Notably, these latter explanations speak to a context-dependent nature of DL in infancy. Although replication and future research utilizing additional measures (e.g., physiological indicators) is required, our findings suggest that parent-report of infant DL functions more consistently with the trait-like definition than IO ratings recorded in the laboratory setting. Laboratory-based observations may appear as more state-like in their functioning due to increased sensitivity to developmental changes occurring in the second half of the first year of life (e.g., Gartstein, Hancock, & Iverson, in press).
Limitations
The present sample was demographically homogeneous, limiting generalizability of results. Only one temperament dimension, Distress to Limitations, was examined, with observations limited to two time points. In addition, we relied on heterogeneous aspects of the “Distress to Limitations” construct across the two examined methods, as a result of the necessarily limited content addressed in the laboratory, and the failure of parallel parent-report items to meet measurement requirements. Unfortunately, results of this study do not provide the basis for conclusive interpretations concerning this failure of items most comparable with respect to face validity to meet LST prerequisite criteria. However, if replicated, this pattern of results would have implications for temperament measurement and theory. It may be that manifestations of certain temperament attributes are context dependent during specific developmental periods, such as infancy. That is, whereas our laboratory observation-based indicators reflected infant distress in response to toy removal, parental ratings addressed child distress in response to waking-up. This discrepancy stands in contrast to recent studies of ADHD (Litson et al., 2016) and ODD (Preszler et al., 2016), wherein different raters completed nearly identically worded items. The latter technique appears useful in minimizing item-specific method effects that might otherwise lead to higher nonconvergence across sources, and could be applied in the context of temperament research.
Future directions
Future studies should address the limitations noted above, recruiting more diverse samples with respect to socio-demographic factors. Additional time points should also be collected to help elucidate causes of instability across time and/or inconsistency across settings. More extensive longitudinal designs, following participants to determine if the trait-like presentation becomes consolidated later in childhood, should also be implemented. Questions concerning dimensionality of temperament constructs should also be addressed more conclusively in future research. For example, facets addressing components of this attribute (e.g., frustration to blocked goals) could be examined to determine if these provide meaningful distinctions. Importantly, researchers should continue to develop and refine tools to support multi-method measurement of different aspects of temperament, ensuring good psychometric properties. Methods employed in related LST applications, for example Burns & colleagues’ studies (Litson et al., 2016; Preszler et al., 2016, 2017) addressing adolescent and child psychopathology (e.g., ADHD, ODD, SCT), could provide a useful model in their rigorous psychometric evaluations preceding LST studies (Burns & Lee, 2011; Burns, Servera, Bernad, Carillo, & Geiser, 2014; Kadka & Burns, 2013).
Conclusion
Temperament is typically conceptualized as trait-like, expected to manifest consistently across time and settings. We set out to demonstrate that LST modeling allows for a more direct and effective investigation of this trait-like nature, relative to previously used correlational, regression, and LGM techniques. Our illustration demonstrated time-stability for mothers’ DL ratings, but little time-stability in IO indicators, or consistency between the two sets of measures designed to address the same underlying temperament construct.
Supplemental material
Supplemental Material, JBD743066_supplementary_file_1 - Latent state-trait modeling: A new tool to refine temperament methodology
Supplemental Material, JBD743066_supplementary_file_1 for Latent state-trait modeling: A new tool to refine temperament methodology by Jonathan Preszler, and Maria A. Gartstein in International Journal of Behavioral Development
Supplemental material
Supplemental Material, JBD743066_supplementary_file_2 - Latent state-trait modeling: A new tool to refine temperament methodology
Supplemental Material, JBD743066_supplementary_file_2 for Latent state-trait modeling: A new tool to refine temperament methodology by Jonathan Preszler, and Maria A. Gartstein in International Journal of Behavioral Development
Supplemental material
Supplemental Material, JBD743066_supplementary_file_3 - Latent state-trait modeling: A new tool to refine temperament methodology
Supplemental Material, JBD743066_supplementary_file_3 for Latent state-trait modeling: A new tool to refine temperament methodology by Jonathan Preszler, and Maria A. Gartstein in International Journal of Behavioral Development
Footnotes
Funding
The authors declared receipt of the following financial support for the research, authorship, and/or publication of this article: Jonathan Preszler’s work was supported by the Anthony Marchionne Foundation for the Scientific Study of Human Relations and Psychological Processes Endowed Graduate Fellowship for Research at Washington State University.
Supplemental material
The supplemental material is available online with the article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
