Abstract
Factor analysis is a commonly used method of modelling correlated multivariate exposure data. Typically, the measurement model is assumed to have constant factor loadings. However, from our preliminary analyses of the Environmental Protection Agency's (EPA's) PM2.5 fine speciation data, we have observed that the factor loadings for four constituents change considerably in stratified analyses. Since invariance of factor loadings is a prerequisite for valid comparison of the underlying latent variables, we propose a factor model that includes non-constant factor loadings that change over time and space using P-spline penalized with the generalized cross-validation (GCV) criterion. The model is implemented using the Expectation-Maximization (EM) algorithm and we select the multiple spline smoothing parameters by minimizing the GCV criterion with Newton's method during each iteration of the EM algorithm. The algorithm is applied to a one-factor model that includes four constituents. Through bootstrap confidence bands, we find that the factor loading for total nitrate changes across seasons and geographic regions.
Keywords
Introduction
Fine particulate matter that is, air particles with diameter <2.5 micrometers (PM2.5), is associated with adverse health effects (e.g., Dominici et al., 2006; Schwartz et al., 1996). (Laden et al., (2000)) suggested PM2.5 from different sources has different toxic effects, implying the importance of the chemical profile of the constituents of PM2.5. Because the constituents are typically correlated, studies that analyze the relationship between constituents and health outcomes often use methods that summarize the correlated constituents, for example, using clustering (Austin et al., 2013) or latent variable models (Gryparis et al., 2007; Nikolov et al., 2008). In particular, widely used source apportionment methods (Thurston et al., 2011) use factor analytic models to summarize constituent data in terms of their relationship to air pollution sources, such as power plant or vehicle exhaust, that can be potentially regulated. Since the air pollution sources are not directly observed, source contributions are treated as latent factors and estimated from the data.
Array of heat maps shows how the Pearson correlation among the four PM2.5 constituents (log-transformed) change across different strata defined by region (row) and time period (column). The column headers denotes the year (Yr1: 2008, Yr2: 2009) and the quarter (Qtr) within that year. The row headers indicate the three US regions (also see Section 5 and Figure 5). The strength of correlation is represented with different shades of green. We can see that the differences in correlations are primarily between total nitrate and the other constituents.
Array of heat maps shows how the Pearson correlation among the four PM2.5 constituents (log-transformed) change across different strata defined by region (row) and time period (column). The column headers denotes the year (Yr1: 2008, Yr2: 2009) and the quarter (Qtr) within that year. The row headers indicate the three US regions (also see Section 5 and Figure 5). The strength of correlation is represented with different shades of green. We can see that the differences in correlations are primarily between total nitrate and the other constituents.
However, methods to summarize PM2.5 constituents may yield inconsistent summaries when applied to large geographic areas or long time periods. Such summaries depend on the correlation among constituents and correlations among constituents may vary with time and space (Austin et al., 2013; Bell et al., 2007). For instance, our preliminary anaylses of PM2.5 constituent data issued by the Environmental Protection Agency (EPA) Air Quality System (AQS) for years 2008 and 2009 (see Section 5 for more details on the data) suggest that the correlation among constituents related to secondary inorganic aerosols (Salvador et al., 2004; Thurston et al., 2011) varies with season and geographic location (Figure 1). Nitrate exhibits the greatest differences in terms of its correlation with other constituents, both across regions and across time within region. As a result, it is not surprising that preliminary factor analyses of this constituent data, stratified by time and region, yield factor loadings that differ across time and region for some constituents. In particular, the factor loading for total nitrate (Figure 2a) demonstrates large differences across regions and time. The estimated factor loadings tend to be larger in the winter months, implying a season effect. Furthermore, the estimated factor loadings are significantly higher in the midwest region, compared to other regions, for all periods except the first quarter of 2008. By contrast, the factor loading for ammonium ion exhibits fewer and less pronounced differences across either region or time (Figure 2b).
(a) and (b) compare the factor loadings for a particular constituent from stratified confirmatory factor analyses. The
-axis shows the size of the factor loading. Each dot on the plot is the factor loading from analysis of one stratum. The midwest, northeast and south regions are represented by solid, dashed and dotted lines, respectively. These results suggest that there is temporal and spatial difference for the factor loading of total nitrate (a). In comparison, ammonium ion (b) has a rather constant factor loading as the confidence intervals (represented as error bars) of the different estimates mostly overlap.
Differences in the factor model structure are not specifically modelled in current literature that uses factor analytic models applied to constituent data collected across large geographical areas. Although many models have been proposed to reduce data dimension and to predict outcomes based on many constituents simultaneously (Austin et al., 2013; Choi et al., 2009; Peng et al., 2009), a more thorough understanding of how the correlations among PM2.5 constituents change with time and space may allow us to gain further insights on the validity and interpretation of such dimension-reduction approaches and thus potentially improve modelling the association between constituents and health outcomes.
There is now a large literature on building models that analyze air pollution data from the perspective of time and space (e.g., Berrocal et al., 2010; Daniels et al., 2006; Huerta et al., 2004; Shaddick and Wakefield, 2002; Sahu and Mardia, 2005 to name a few). However, we focus on models utilizing latent factors, given their connection to air pollution sources (Nikolov et al., 2008; 2011) and because some of these models can be applied to understanding temporal and spatial patterns in the correlation of PM2.5 constituents specifically. For example, (Gryparis et al., (2007)) proposed semiparametric latent variable regression models for modelling multiple surrogates of a single pollution source. They used penalized splines to allow non-linear effects of covariates on the latent variable mean and used geoadditive models to account for the temporal and spatial correlations in the latent variable (Kammann and Wand, 2003). (Lopes et al., (2008)) used a spatial dynamic factor model to incorporate temporal and spatial correlations among observations of a single pollutant from different monitors over a certain time period. In that model, the spatial correlation was introduced through the factor loading matrix, that is, at a given time-point, monitors were treated as different variables. They modelled the temporal correlation among the latent factors using an autoregressive structure. (Ippoliti et al., (2012)) used a similar dynamic factor model but they coupled two sets of latent factors so that one latent factor could be predicted from the other.
More generally, models using latent variables have gained much popularity in research that relates environmental exposures to health outcomes because they can reduce the dimension of the data by summarizing correlated exposure measurements (see Sánchez et al., 2005). In the last decade, much work has been done to relax model assumptions, including nonlinear relationships among latent variables (Arminger, 1998; Guo et al., 2012; Lee et al., 2003; Lu et al., 2012; Song and Lu, 2010; Zhu and Lee, 1999), nonlinear links relating items to latent variables (Bauer, 2005; Gryparis et al., 2007), as well as models with spatial structures on the latent variables (Fahrmeir and Raach, 2007; Hewson and Bailey, 2010; Liu et al., 2005).
However, except for (Zhang and Lee, (2009)) and (Valentini et al., (2013)), none of these existing advanced models specifically address the variation in the factor loadings exhibited among the constituent data (Figure 2a). Traditionally, the measurement model that links the observed items to the latent variables is assumed constant across the covariate space (e.g., across time and geographic location). A constant measurement model, also termed measurement model invariance (Horn and McArdle, 1992; Meredith, 1993), is needed for valid comparison of differences in latent variable means across covariate levels or groups. However, it has been shown that allowing the measurement model to vary across values of categorical covariates may be desirable when estimating exposure–health outcome associations, potentially because the flexible measurement model enables better estimation of the latent factor itself (Sánchez et al., 2012; Tao et al., 2015). Allowing the measurement model to vary across continuous covariates, however, is a more challenging problem. (Zhang and Lee (2009)) used a local maximum likelihood-based estimation procedure to model the factor loadings as varying with time. (Valentini et al., (2013)) gave the factor loadings a spatial structure and they modelled the factor loadings as conditionally independent multivariate Gaussian Markov Random Fields.
In this article, we propose a semiparametric factor analysis model with non-constant factor loadings that change with multiple covariates, in particular time and space as motivated by the EPA's PM2.5 constituents data. We use penalized splines to estimate factor loadings that change with multiple covariates. This construction for the factor loadings bears a resemblance to the classic varying coefficients model (Hastie and Tibshirani, 1993). In Section 2, we propose the factor model that incorporates non-constant factor loadings, while Section 3, explains the estimation procedure. A key feature of our estimation algorithm is to include the generalized cross-validation (GCV) score within the EM iterations so that we can optimize the smoothing of the splines in a latent variable setting. In Section 4, we present a simulation study to assess the algorithm and estimation properties for certain model parameters and the estimated factor scores. We then analyze the constituent data set and draw inferences based on bootstrap approaches in Section 5. Section 6 ends with a discussion and future directions.
Let
We assume that the elements of
In classical latent variable models, the factor loadings are estimated as constant parameters or sometimes allowed to differ across levels of a categorical variable (Horn and McArdle, 1992). However, in our proposed model, factor loadings are allowed to vary smoothly with time and/or location, resembling a varying coefficients model (Hastie and Tibshirani, 1993).
We model the non-constant factor loadings as a smooth function of
We represent the non-constant factor loadings using penalized splines. Specifically, we construct univariate cubic B-spline basis functions for each of the covariates,
We use the squared first difference penalty (Eilers and Marx, 1996) to control the degrees of freedom of the smooth, which yields the desirable property of shrinking the smooth to the intercept
We use the EM algorithm to facilitate the estimation, where the complete data structure includes the observed data
Using the assumption of a normal distribution and omitting the proportionality constant, the complete data log-likelihood is
E-step
Even though we are to maximize the penalized likelihood as in equation (3.1), the penalty does not affect the E-step because the penalty does not involve the missing latent variable. The expected values of
M-step
In the M-step, we estimate
For each observed variable
So far, we have treated the smoothing parameters
Since the latent variable is unknown and is itself to be computed from
We minimize
We use bootstrap resampling to construct a
For single point estimates, such as
Simulation study
We conduct simulation studies to examine the advantages and disadvantages of this novel estimation method, as well as to verify that the estimation algorithm can recover model parameters. We first examine whether the algorithm can recover the true parameter values when the factor loadings vary with several covariates. For this purpose, we simulate 1000 data sets where
Next, we further investigate how biases can enter various components of the model across different simulation scenarios where the factor loadings take on different functional forms. For this study, we let the factor loadings be functions of only one covariate. This simpler setting can enhance computational efficiency while still illustrating the impact of various estimating approaches under different scenarios. Specifically, we investigate how using non-constant factor loadings affects the estimation of model parameters and the factor scores. Examining estimation of factor scores gives insight as to how estimated source contributions in air pollution studies may be affected, for instance.
The simulations are set up with three different shapes for the true factor loadings. We let
Constant:
In the formulas,
Shape of
for the simulation study. Each plot shows three curves differentiated by line styles when a scenario has three observed variables and the amplitude parameter
.
Shape of
for the simulation study. Each plot shows three curves differentiated by line styles when a scenario has three observed variables and the amplitude parameter
.
For each of the three shapes of factor loadings listed above, we hold some parameters constant, but we change the signal-to-noise ratio in the model as follows. For all the scenarios, we set
Bias and standard deviation (in parentheses) of
,
for different simulation scenarios and estimating approaches.
Bias and mean squared error (MSE) of the factor score for different simulation scenarios and estimating approaches. IBIAS and IMSE show the bias and MSE integrated over the entire distribution of the latent variable
; the remaining columns show the bias and MSE integrated within ranges of the distribution of
, for example, the ‘
’ column shows the bias at the lower tail of the distribution of
. The bias and MSE were symmetric, hence only columns for the lower half of the distribution are shown. As the values show, bias and MSE are larger at the tails of the distribution of
.
Simulation results showing plots of
against
when factor loadings are estimated as constant or non-constant, when true factor loadings are non-constant. The marker symbol distinguishes
by tertiles of
, where circle, diamond and triangle refer to the lower, middle and upper tertile, respectively. The solid line is where (
,
) falls if
.
For each scenario, we generate 1000 simulated data sets, each with 1000 equally spaced data points within the range of
We report the simulation results primarily focusing on scenarios where (a) each simulated data set has 1000 equally spaced data points within the range of
We look at the results of our simulation study from two aspects. First, we compare the bias and standard error of
Next, we compare how estimation approaches affect the factor score (Table 2). We define factor score as
The overall average bias and MSE for the factor scores are similar across estimation approaches; however, differences emerge when we examine bias by percentiles of the distribution of the true latent variable and when we examine the factor scores at different covariate values
In scenarios where the variance in the observed variables is smaller (i.e.,
We applied our model to four constituents from the US PM2.5 speciation data issued by EPA for years 2008 and 2009. The data set we used included 108 monitors stationed in part of the midwest, south and northeast (see Figure 5). Thirty-six monitors had measurements in every three days when they were functioning, but the majority had measurements only in every six days. Thus, we took every other observation from the thirty–six monitors that had measurements in every three days so that data from all monitors were equally spaced.
Many of the constituents in the EPA data were zero inflated and would require special treatment beyond the scope of this article. Thus, we selected constituents that, after taking the logarithm, had an approximate normal distribution. Based on the exploratory factor analysis (see Section 1 and Figure 1), we decided to look at four constituents, sulphur, ammonium ion, total nitrate, sulphate, that are associated with secondary aerosols (Nikolov et al., 2008) and fit into a one-factor structure.
Since the concentration of constituents change across time and space, the mean of each observed variable (
The factor loadings were initially modelled according to equation (2.2), but the interaction term
Table 3 summarizes the estimates of all the parameters in the model except the spline coefficients. (We do not list
Parameter estimates for four PM2.5 constituent data.
Parameter estimates for four PM2.5 constituent data.
(a) is the contour plot of
for the factor loading of total nitrate. The bigger scattered dots are the monitor locations and the smaller arrayed dots represent areas that are significantly different from zero according to the 95% confidence band constructed using bootstrap. (b) is
, and 95% confidence band, for the factor loading of total nitrate
(
= 0 is 1 January of 2008 and
= 2 is 31 December of 2009).
Factor scores from two models: (a) all factor loadings estimated as constant parameters (y-axis), (b) all factor loadings estimated as non-constant (x-axis). Factor scores representing observations from different seasons and regions are plotted as follows: warmer regions in cold months (circle), colder regions in cold months (diamond), warmer regions in warm months (square) and colder regions in warm months (triangle); warm months refer to May, June, July, August; cold months refer to November, December, January, February; warmer regions refer to Mississippi, Georgia, Alabama and colder regions refer to the midwest except Missouri.
We propose a latent variable model where the factor loadings are allowed to vary across continuous covariate values. This modelling approach extends classical factor analysis models and can help shed light into how the relationship between measured variables and underlying constructs may be modified by covariates, and thus can help improve estimation of underlying latent variables. This model is motivated by analysis of PM2.5 data that showed a noticeable difference for some of the factor loadings across time and region. Variations of factor analysis models have long been used to summarize PM2.5 constituents and gain insight into the sources the particles arose from, a process termed source apportionment. However, such data reduction methods have typically been applied to smaller geographical areas. Factor structures that can better capture the temporal and spatially varying correlations among constituents can help improve summaries of PM2.5 constituents collected from a wide geographical area and time span. Our model allows us to capture time and space-varying correlations among constituents through the temporal and spatial pattern of factor loadings while maintaining a connection to interpetable sources.
We use the penalized regression spline to model factor loadings as functions of multiple covariates and we control the smoothness related to each covariate with its own smoothing parameter. In order to optimize the smoothing parameters in the context of a latent variable model, we propose the use of GCV within each iteration of the EM algorithm. This bypasses the issue of lack of predictors for the marginal distribution of the model by using the latent variable as the predictor in a single M-step. We also incorporate Newton's method into the EM iterations to facilitate the optimization of multiple smoothing parameters at the same time. The algorithm is successfully implemented in SAS (available as supplementary material), thus enhancing applicability of our method.
Alternatively, we can penalize the spline by treating the spline coefficients as random variables, similar to the likelihood approach used for smoothing parameter selection in additive models (Ruppert et al., 2003; Wood, 2006). Under this approach, the variance of the random coefficients acts as the smoothing parameter and it can be estimated using restricted maximum likelihood (REML). (Reiss and Ogden, 2009) discusses the advantage of REML over GCV based on a theoretical comparison. However, this approach would be more computationally intensive for a latent factor model because the likelihood does not have a closed form.
In this article, we use bootstrap to draw inference, which can be computationally intensive. Future work can include developing methods to test the significance of different component functions of the non-constant factor loading more efficiently. In our model, we assume that the random errors in equation (2.1) are uncorrelated. In our data application, we de-trended the data before applying our modelling strategy to remove spatial and temporal autocorrelation using GAM. Conditional residuals from the model had negligible correlation, suggesting that an independent error structure may be sufficient for the de-trended data that we have analyzed. Incorporating alternative correlation structures is a desirable next step to avoid the two-step approach we have used. Alternatively, as in the work of (Gryparis et al. (2007)), the mean trend of the latent variable could also be used to capture the temporal and spatial correlation; this extension could also be used to readily predict constituents in space and time, by first predicting the latent variable at a given time and location.
In our current data analysis, we have modelled every constituent with a penalized regression spline and then assessed whether it is invariant across time and space. Estimating each spline with its own set of smoothing parameters is demanding, although feasible, computationally. It may also be less efficient to build up a model with this much complexity when there are a considerable number of observed items. However, the non-constant factor loadings for many of the constituents may potentially change in similar patterns, due to their common source. Therefore, one potential extension to further improve estimation efficiency is to model multiple non-constant factor loadings with fewer number of curves that capture the essential shape (Kneip and Gasser, 1988; Jiang et al., 2013). The extension of our model to more than one latent variable would be fairly straightforward after proper identifiability constraints are applied. The model can also easily accommodate data missing at random, since it is likelihood based. However, a potentially more challenging problem is the case when observed variables are not measured simultaneously at all locations, that is, missing by design.
The model and algorithm implemented in this article provide a platform for developing more comprehensive measurement models that can allow us to apply flexible dimension reduction strategies. The measurement model for the factor analysis that we have proposed can incorporate known information (such as well-understood sources of air pollution) while learning from the data by allowing for non-constant factor loadings.
Acknowledgments
The authors acknowledge support from: NIH grants R01ES016932; R01ES017022; P01ES022844; P20ES018171; UM NIEHS Core Center P30 ES017885 and EPA grants RD834800 and RD83543601. The contents of this study are solely the responsibility of the authors and do not necessarily represent the official views of the The National Institute of Environmental Health Sciences (NIEHS) or the National Institute of Health or EPA.
Appendix
,
and their 95% confidence bands. (a), (b) respectively show
,
as contour plots. (c) plots
(circles),
(solid line) and its confidence band (shaded area). Results for the other
,
are similar.
Bias and standard error (in parentheses) of
,
,
for simulations where the factor loadings change with three covariates.
Mean integrated squared error (MISE) of
for different simulation scenarios and estimating approaches.
