Abstract
It is often not stated or quantified how well measured proxy variables account for the variance in latent constructs they are intended to represent. A sensitivity analysis was run using data from the Survey of Health, Ageing and Retirement in Europe to estimate models varying in the degree to which proxy variables represent intended constructs. Results showed that parameter estimates differ substantially across different levels of variable representation. When variables are used with poor construct validity, an insufficient amount of variance is removed from the observed spurious relationship between design variable and outcome. The findings from this methodological demonstration underscore the importance of selecting proxy variables that accurately represent the underlying construct for which control is intended.
Investigators who do not have access to the variables they wish to control for often use proxy variables. Many studies will use education as a control variable when income is not available because education and income are frequently correlated. However, correlations and known validity coefficients do not necessarily establish an accurate mapping between the indicator and the construct from which measurement is targeted (Sechrest, 2005). Education and income may indeed be correlated, but education may not capture the variance in income that is associated with design variables in a model.
It is not always clear or stated how well the observed variable represents the latent construct. Investigators may aimlessly include demographic variables that are typically used as confounds (e.g., age, sex) without much thought about how accurately these variables represent the variance they want to model.
Two issues warrant distinction when evaluating the quality of this proxy relationship. The first is a thorough conceptualization of how well the observed variable represents the latent construct of primary interest. Many researchers, especially in the social sciences, who use income as a control variable are not interested in the number of dollars a person earns per se; rather, they are interested in an underlying construct, such as resource availability, financial stress, or familial provisioning, that is relevant to the theoretical object of the research.
The second issue is a specific instance of the first issue and relates to how accurately the operationalized variable represents the intricacies of the construct it is intended to measure. Income may be tested as a predictor of quality of life; however, the effectiveness of the income measure depends on how closely its operationalization embodies the construct predicted to affect the quality of life. If resource availability is a primary influence on the quality of life, net income may capture the variance in this outcome better than gross income. Two people with the same gross income may have vastly different resources available if one allocates a large percentage of his or her gross to family health insurance, while a single worker is able to retain a larger portion of his or her base salary.
The importance of adequately identifying and operationalizing a proxy variable is an issue that spans many scientific disciplines. Below are examples of how income can represent drastically different constructs across health, criminology, and education.
Income in Health Research
Wood et al. (2012) evaluated competing theories used to explain the influence of income on mental health. The authors were interested in two mechanistic processes beyond actual dollars that underlie income and may directly influence the mental illness. One such process that income can represent is access to resources and the ability to obtain medical treatment. The second dynamic is the social status. Occupying low social status can cause individuals to experience profound anxiety and physiological distress, such as reduced serotonin and dopamine that can alter brain and behavioral functioning.
Data from the British Household Panel Survey were analyzed, and a relationship between income and psychological distress was detected (b = −.08 to −.13). This relationship deteriorated (income b = −.04 to .06) with the inclusion of income rank (b = −.29 to −.40) to the statistical model, leading the authors to infer that, rather than simple access to medical services, social standing among community members can explain mental health. In this scenario, proxy variables need to measure social rank when investigators predict mental health. As the results indicated, income rank represents this construct better than absolute income.
Income in Criminology Research
Income can also represent a variety of distinct constructs in criminology and sociology research. Galloway and Skardhamar (2010) investigated the association between social economic status (SES) and crime. They hypothesized that this relationship is due to increased resources available to high SES individuals that allow them to accomplish their goals. In addition to materialistic resources, high SES can relate to advantages and support bestowed by parents. Without access to material and familial support, low SES individuals may view criminal activities as an alternative strategy for attaining solvency and other financial goals and necessities. Furthermore, low SES can affect interpersonal interactions, in which financial and employment stress breeds a contentious home environment. Children experiencing such dysfunctional parental relationships may be more likely to engage in risk-taking and aggressive behaviors (Galloway & Skardhamar, 2010).
Galloway and Skardhamar (2010) used the income to represent SES and found that, although there was a strong association between income and crime, the relationship (hazard ratios) between income and crime diminished with the inclusion of parental education in the model. Galloway and Skardhamar (2010) concluded that crime is a mechanism to meet financial demands when other goal-required resources, such as education, are not available. Hence, education may be a more valid proxy to model incidences of crime compared to income.
Income in Education Research
Income can also represent constructs related to academic performance that are beyond basic effort such as study time. Mullis et al. (2003) noted that educated or wealthy parents can teach their children study strategies, provide materialistic resources, such as technology, and be active in the educational experience to help these students succeed.
Mullis et al. (2003) modeled data on parental income and education and the degree of resources available that make the home conducive to academic success (e.g., newspapers and work areas as indicators of resource capital) and interpreted them as proxy causes of academic success via parents’ provisioning of academic support. Results indicated that resource capital accounted for a large portion of the variance (23%) in academic performance, with income obtaining the largest β loading of .73. The β loadings for parental education (.68) and home resources (.67) were slightly lower. Mullis et al. (2003) maximized the potency of proxy variables in modeling the underlying constructs of interest. Income, by itself, did not contain the information necessary to explain the variance in academic success sufficiently.
The examples outlined above demonstrate how a single variable can represent vastly different constructs when we predict outcomes. It is, therefore, essential for investigators to have a clear understanding of the constructs they wish to model and ensure that proxy variables adequately represent the latent phenomenon.
The following example is of the second issue regarding the quality of proxy variables: accurate operationalization of measured variables in representing theoretically important latent constructs.
Income Measuring Income
As outlined above, Galloway and Skardhamar (2010) estimated the relationship between parental income and child criminal offense. The income used in statistical models was measured as yearly total earnings from employment and social security. Income from social services, such as welfare, was not included because it was unavailable. Measured income was defined as earned income. This distinction, or at least the disclosure of the specific type of income used, is essential for evaluating which construct income is really measuring. Can welfare income be expected to play a part in predicting criminal offenses, and does the exclusive use of earned income reduce the amount of variability that can potentially be explained in crime rates? To answer this question, we must contemplate what it is about income that relates it to crime and ensure that our measure of income adequately represents this latent factor.
Effects of Imperfect Proxy Variables
When predictors are estimated without rigorous consideration of how and why they interact with other variables in a model, poor conceptualization and operationalization of the observed–latent trait relationship can occur. These conditions compromising the representativeness of measured proxy variables are likely to cause inaccurate parameter estimates because these imperfect proxies do not account for an adequate amount of the construct’s influence on other select variables in the network.
What, then, is the effect of using a nonrepresentative proxy variable? The study detailed above by Wood et al. (2012) demonstrates the dramatic effect that an imperfect proxy can have on parameter estimates. The authors hypothesized that low social status can cause physiological changes in the brain, which may lead to psychological distress. Using income to represent social status, they found that, across numerous models, significant β values for income ranged from .08 to .13. However, income may not have been ideal for capturing the interpersonal competitive spirit that can cause changes in the brain during the struggle for status. When income rank was included in the model, the β for income deteriorated to half its original value (ranging from .04 to .06), while income rank yielded a β (ranging from −.29 to −.40) up to almost 4 times that of the original income estimate. Income rank may, therefore, be effectively tapping into a competitive earning construct that causes physiological changes in the brain. These estimates illustrate the discrepancy in the two income variables’ ability to measure, under the proposed hypothesis, causes of psychological distress.
The purpose of this article was to quantify and evaluate the effect of misspecifying the measured variable–latent trait relationship. Models varying in the representativeness of measured variables on latent constructs were used in a sensitivity analysis.
Method
Data
Archival data from Wave 4 of the Survey of Health, Ageing and Retirement in Europe (SHARE) were used for this demonstration. The SHARE project houses an extensive database of health, social, and SES measures of older Europeans and Israelis. The fourth wave of SHARE data was collected between 2010 and 2012 and contained 12,222 records. The analysis was limited to one country (Spain) to reduce the variability that can result from country-specific heterogeneity (Buber & Engelhardt, 2011). The final sample size was 1,138.
Measures
Many of the published SHARE studies involve health- and age-related influences on depression in aging adults. The effect of age on depression was selected as the main substantive hypothesis of this demonstration. Depression was measured by the EURO-D. This measure comprises a list of 12 symptoms of depression for which participants report a presence or absence, and scores range from 0 to 12, with higher scores indicating more severe depression (see the SHARE website for symptoms used to create this measure, http://www.share-project.org/). This measure was developed to compare depression across European countries, and its strong psychometric properties have been documented (Prince et al., 1999).
The EURO-D was obtained from the SHARE dataset of generated variables. This dataset contains variables created from multiple questions from the original SHARE survey data.
The intended controls were physical health and physical and mental impairment, as these constructs have been used in depression research with SHARE data (Brandt et al., 2012; Buber & Engelhardt, 2011; Deindl, 2013; Ladin, 2008; Lindwall et al., 2011; Verropoulou & Tsimbos, 2007). Two measured proxy variables included to represent the constructs of physical health and physical and mental impairment were (1) the number of health-related symptoms the participant is bothered by and (2) the number of contacts with a doctor in the past year. The former is a strong proxy of these two constructs and the latter a poor proxy. A latent variable of general health comprising these two proxy variables was created as a reference for comparison.
The two constructs that proxy variables were intended to represent, and the composite proxy variable, were calculated by creating composite measures of variables described above. These variables were standardized using the STANDARD procedure in SAS Version 9.4 to have a mean of 0 and standard deviation of 1. The physical health construct comprised the number of chronic conditions the participant was told by a physician that they had, a single-item assessment of current health on a 5-point Likert-type scale, the number of drugs taken, and the presence or absence of long-term illness. The number of activities of daily living (ADLs) and the number of instrumental activities of daily living (IADLs) that the participant had difficulty performing was a sum of activity limitation questions the subject reported. This sum comprised the physical and mental impairment construct. While the ADL measure is a survey of limitations that result strictly from physical disability, the IADL measure contains a number of limitations that result from cognitive impairment, such as using a map and making telephone calls. Details of the generated variables are on the SHARE website mentioned above. The structure of these models can be seen in Figures 1 –3.

Structural model of good proxy variable.

Structural model of poor proxy variable.

Structural model of general health latent proxy variable.
Analysis
A sensitivity analysis was run to compare model parameter estimates and fit indices between two models differing in how well the proxy variable is related to the construct it is intended to represent. As a reference, a third model was run using a latent variable of general health that comprised the two proxy measures. The CALIS procedure in SAS Version 9.4 was used to run structural equation models (SEMs). All models were estimated with diagonally weighted least squares on Spearman correlation matrices.
Nonsignificant path coefficients were retained in the models to compare the effects of different proxy variables across models. Although all β weights for the design variables and the constructs were obtained through SEM, path coefficients for the constructs and indicators were generated with general linear models. This two-step estimation process was conducted to stabilize the variance when the relationship between design variables and the constructs was estimated.
Results
The physical health and physical and mental impairment constructs had similar item-level α of .81 and .71, respectively. Figures 1 –3 show the path coefficients of each item to its corresponding construct. These coefficients range from .74 to .85 for the physical health construct and are .91 for both indicators of the physical and mental impairment construct.
Number of symptoms bothering the respondent had a .70 correlation with the physical health construct and a .65 correlation with the physical and mental impairment construct, confirming its strength as a good proxy variable. Number of doctor contacts in the last year had a .50 correlation with the physical health construct and a .28 correlation with the physical and mental impairment construct, confirming it as a moderate to poor proxy variable. The general health latent variable of these proxy variables had a β coefficient of .66 with the physical health construct and .36 with the physical and mental impairment construct.
Good Representation of a Proxy Variable: Number of Symptoms Bothered By
The model using a number of symptoms bothering the respondent as a proxy for the latent constructs of interest had a strong model fit (GFI = .996, AGFI = .99, NFI = .98; Table 1). This variable rendered the relationship between age and depression spurious (β = −.01). Figure 1 shows the strong direct effect of symptoms bothered by on depression (β = .60) and a moderate effect of symptoms bothered by on age (β = .34). The physical health construct had a smaller effect on symptoms bothered by (β = .43) compared to the effect of physical and mental impairment construct on symptom count (β = .65). The R 2 for depression was .35 for depression and .12 for age, and .95 for symptoms bothered by. Physical health had an indirect impact (β) on the depression of about .26 (product of the two associated pathway coefficients) and physical and mental impairment had an indirect impact (β) of about .39.
Fit Indices for Structural Equation Models.
Note. AGFI = adjusted goodness-of-fit index; GFI = goodness-of-fit index; NFI = normed fit index; SRMR = standardized root-mean-square residual.
Poor Representation of Proxy Variable: Number of Doctor Contacts
The number of contacts with a doctor in the past year used as a proxy yielded much poorer model fit compared to the symptoms count proxy (GFI = .98, AGFI = .91, NFI = .87; Table 1). Figure 2 shows the relationship between age and depression was much larger with this proxy’s inability to remove adequate confounding variance (β = .14). This relationship was .15 standardized units lower in the good proxy model. The direct effects of doctor visits on depression (β = .23) and age (β = .20) were much smaller compared to the good proxy model. The good proxy model had a path coefficient almost 3 times as large as the model with the poor proxy variable. Doctor visits designated as a poor proxy is supported by the small direct effect from physical and mental impairment construct (β = −.06). The direct effect from the physical health construct on doctor visits, however, was moderate to strong (β = .64), indicating that simply being correlated with a construct does not make it a good proxy variable. The R 2 for depression was .09 and .04 for age, indicating a lack of adequate explanatory value in this model. The R 2 was .37 for the number of doctor contacts. Physical health continued to indirectly impact depression (β ∼ .17; the sum of the product of two indirect pathway coefficients), while the indirect impact of physical and mental impairment all but disappeared (β ∼ .02).
The general health latent variable comprised of these proxy measures yielded metrics similar to the number of symptoms bothered by model (Figure 3). Model fit indices showed strong fit to the data (GFI = .996, AGFI = .99, NFI = .98; Table 1). The direct effect from age to depression was close to zero (β = −.06), suggesting that this proxy variable was effective in removing the confounding variance of the health-related constructs. The direct effects of the latent proxy on depression (β = .61) and age (β = .40) were similar to the strong proxy model. The effects from the physical health construct to the composite proxy (β = .40) and the impairment to the latent proxy (β = .69) were also similar to the good proxy model. Physical health had a noticeable indirect impact on depression (β ∼ .40) and physical and mental impairment had a moderate indirect impact (β ∼ .22). The R 2 for depression was .35 and .17 for age, indicating that the variance from the good proxy variable was still influential in explaining variance in these two outcomes. The R 2 was .87 for the general health latent proxy variable.
Table 2 comprises a summary of the β weights of the pathways estimated from models with different proxy variables. It is clear that the better a proxy variable represents the health and impairment constructs required for adequate statistical control, the smaller the β estimates for the primary hypothesis become. From this table, and noted correlations of each proxy to the constructs they are intended to represent, we are able to quantify the effect of using imperfect proxy variables with these models.
Summary of Path Coefficients (β) Across the Two Models Differing in Quality of Proxy Variable.
Discussion
Investigators use proxy variables when they do not have access to an intended variable, and the proxy variable is correlated with this intended variable. When specifying a good proxy variable, investigators must understand the dimensions of the construct they wish to control so that this proxy captures the essential variance, beyond simple correlation, required for statistical control.
The current demonstration illustrates that a proxy that does not capture this variance leads to inaccurate path coefficients. The poor proxy in this study had relatively smaller path coefficients to the main design variables. As a result, the β for the main hypothesis of the direct effect of age on depression was much higher compared to the strong proxy variable, which rendered the relationship between age and depression virtually spurious.
As shown, the number of doctor contacts is related to the physical health construct. However, the model fit is clearly inferior to the model using the number of symptoms bothered by. This finding speaks to the idea that there are multiple dimensions of controlling constructs. The number of symptoms bothered by had moderate to strong path coefficients for both the physical health and the physical and mental impairment constructs. As such, there were moderate indirect effects of these constructs on the primary outcome of depression. However, despite the fact that the number of doctor contacts was related to the physical health construct that needed to be modeled as a confound, it did not adequately capture the variance in the two constructs that needed to be removed from the relationship between age and depression. This finding is demonstrated by the failure of this poor proxy variable to control for the confounding variance in the direct effect of age on depression, as well as the virtual nonexistence of an indirect effect of the physical and mental impairment construct on depression, an effect that was present with the good proxy variable.
The model using the general health latent proxy variable produced quality model fit, likely because it captured the explanatory variance from the good proxy variable, while also containing variance from an additional dimension of health status from the poor proxy variable. This additional dimension from the poor proxy bolstered the indirect effect of the physical health construct on depression from .26 in the good proxy model to .40 in the model using the general health latent construct as a proxy variable.
The inadequacy of the number of doctor contacts capturing specific physical health variance may result from the large variability in the health problems and the effect of these problems in aging adults. Brandt et al. (2012) identified a number of childhood SES and income factors relating to successful aging in older Europeans. This research supports the current findings that leading functional lives in late adulthood can be determined by multidimensional variables. It is, therefore, essential to identify what elements of age are associated with certain outcomes. What made number of symptoms bothered by a much better proxy variable than the number of doctor visits is that it accounted for and removed specific health variance from the measured association between the two design variables. Ladin (2008) made a similar note in discussing that, although education and income are frequently correlated, each variable has unique components allowing them to link to and predict various outcomes. This finding highlights the importance of accurately specifying the essence of what control variables are intended to represent, beyond simple intervariable correlations.
The presence of multidimensional constructs needing control may require large numbers of variables. In addition to the problem of measuring these variables, sample sizes may limit the power to effectively model an ecosystem of relevant variables. While the composite proxy variable in this study produced slightly smaller path coefficients to depression and from the physical health construct, it did show the ability to capture multidimensional variance. Even though it lost some variance from the good proxy variable, this composite was still capable of explaining quality variance in the outcomes. Composite variables of multiple proxy variables can be a solution to include multiple dimensions of variance without sacrificing statistical power.
When scientific problems call for an estimate of an underlying construct, and this construct is not adequately represented by a proxy variable, then investigators must concede that their parameter estimates may be biased. Golden et al. (1982), estimating the relationship between the intelligence of brain-damaged patients and degree of brain damage, determined that measures of predamage intelligence were required for adequate statistical control. In the absence of any possible measure of predamage intelligence, the authors used predamage education as a proxy. Their response to use this measure is appealing, in that they not only recognized education as an imperfect proxy but they also assessed how it would likely overestimate the relationship between measured intelligence in brain-damaged patients and degree of affliction. This type of responsible qualification is necessary when we work with limited data.
Realizing the underlying influences that interact in a network of variables is essential for accurately choosing relevant proxy variables and specifying models. This understanding will allow us to quantify the correction necessary to estimate and adjust for the effect of an imperfect proxy variable. Furthermore, we can better examine if proxy variables are pulling in additional, unwanted variance if we are clear on how the proxy variable relates to the construct of interest.
Proxy variables are necessary because we do not have unlimited access to relevant data, even if we are designing prospective studies. It is therefore very likely that any proxy variable we choose will be limited. Careful selection of this variable and some quantification of its limitation relative to its relationship with other model variables will help investigators qualify any conclusions and guide prospective methodologies. This approach can help researchers working with archival data because they can better evaluate the relevance of available data to their models. If quality proxies are not available in data repositories, investigators can estimate the effect of inadequate statistical control using published work.
Detailing the relationships between latent constructs and measured variables should help our analytic methods become concordant with our conceptual hypotheses. We are, therefore, getting more accurate estimates by precisely modeling relationships that are scientifically credible and hypothesized to exist. Ultimately, this adds up to getting more information out of our data.
Limitations
This work proceeded under the assumption that the good proxy model was a gold standard. This, of course, was not the case, because of the limitations of the data available in the SHARE dataset. The specific degree to which imperfect proxy variables over- or underestimated path coefficients, although accurate in a relative sense, were made relative to a model that was not complete. This inability to obtain a true good proxy model speaks to a central point in this work, namely, that validity coefficients of true underlying constructs of interest are rarely known.
Another limitation was the small range of correlations between the latent constructs and the proxy variables. This restricted range led, in part, to path coefficients that did not always differ dramatically when poor and moderate proxies were used (as discussed above, a large reason for this lack of difference is because the desired variance was simply not removed by the proxy variable). It would have been ideal, at least for demonstration purposes, to have proxy variables comprising correlations with latent traits that ranged from about .10 to .85.
Finally, the data from the first wave of SHARE are no longer current. However, since the purpose of this research is to demonstrate the statistical effects of different model specifications and not to make substantive conclusions about mental health, these older SHARE data should not affect the methodological principles studied.
Footnotes
Acknowledgments
The author wishes to thank Lee Sechrest, Richard Bootzin, Aurelio José Figueredo, W. Jake Jacobs, and Heather York for their detailed review and feedback on this work.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
