Generalized multiple indicators,multiple causes measurement error models

Abstract

Generalized Multiple Indicators, Multiple Causes Measurement Error Models (G-MIMIC ME) can be used to study the effects of an unobservable latent variable on a set of outcomes when the causes of the latent variables are unobserved. The errors associated with the unobserved causal variables can be due to either bias recall or day-to-day variability. Another potential source of error, the Berkson error, is due to individual variations that arise from the assignment of group data to individual subjects. In this article, we accomplish the following: (a) extend the classical linear MIMIC models to allow both Berkson and classical measurement errors where the distributions of the outcome variables belong in the exponential family, (b) develop likelihood based estimation methods using the MC-EM algorithm and (c) estimate the variance of the classical measurement error associated with the approximation of the amount of radiation dose received by atomic bomb survivors at the time of their exposure. The G-MIMIC ME model is applied to study the effect of genetic damage, a latent construct based on exposure to radiation, and the effect of radiation dose on physical indicators of genetic damage.

Keywords

atomic bomb survivor data Berkson error Generalized linear models instrumental variables measurement error MIMIC models

1 Introduction

The biological effects of ionizing radiation on exposed body organs varies depending on the dosage received at the time of exposure and also on the organ in question. Ionizing radiation can have several effects on living cells including their damage or the damage of the DNA within the cell's nucleus (Awa, 1997). Cells have a natural repair mechanism whereby the initial damage due to the exposure can result in their self-repair while still functioning either normally or abnormally following the repair. Characteristics of cellular damage due to exposure to ionizing radiation include the deletion of segments of the DNA. In studying the effects of ionizing radiation on genetic damage among subjects exposed to high levels of radiation doses, such as a subset of the atomic bomb survivors, genetic damage can be seen as an underlying latent construct where it is not directly observable but can be measured by various indicators or assays. Some of these assays include the chromosome aberrations (CA) and the glycophorin A (GPA) assays. Chromosome aberrations are often used as biomarkers for genetic damage where some of the damages to the chromosome include changes in the number of chromosomes on each DNA strand or changes to the structure of the chromosome (Awa, 1997; Fenech, 2002; Rana et al., 2010). Some of these structural abnormalities result from the breakage of chromosomes or incorrectly joining of the resulting fragments during the repair mechanisms (Awa, 1997). Another indicator of genetic damage is the GPA assay (Grant and Bigbee, 1993). GPA is a molecule found in mature red blood cells among individuals with M/N blood types. The GPA assay has been considered a biodosimeter for exposure to ionizing radiation and has also been considered an intermediate biomarker for genetic damage (Grant and Bigbee, 1993). This assay has been recommended as an intermediate biomarker for genetic damage among individuals exposed to chronic exposure to radiation and is considered a ‘powerful cumulative biodosimeter’ (Grant and Bigbee, 1993). The assay measures the number of variant cells that have lost expression of the M allele in blood samples from subjects with heterozygous M/N alleles (Grant and Bigbee, 1993). It has also been used as a biomarker for individual DNA repair capacity and cancer risk following genetic or DNA damage (Kyoizumi et al., 2005).

In studying the effects of ionizing radiation on indicators of genetic damage, a variation of the multiple indicators, multiple causes measurement error (MIMIC ME) (Tekwe et al., 2014) models can be applied. MIMIC ME models are useful for studying the effects of an underlying latent construct on its multiple outcomes or indicators when some of its causes are measured with error in linear settings. For example, the true ionizing radiation dose received by atomic bomb survivors at the time of their exposure is unknown but is measured by the Dosimetry 2002 (DS02), a physical dosimetry system developed by the Radiation Effects Research Foundation (RERF) for estimating true radiation doses. DS02 is defined using a physical dosimetry system based on the survivor reported distance and shielding at the time of exposure. Once obtained, the self-reported location data are placed on the map of the city and all individuals reporting to be at similar distances away from the hypocenter are all assigned distances based on the grid in which their reported distance falls. All survivors whose self-reported distance are within the same grid are assigned the averaged DS02 calculated for the grid based on the distance associated with the centre of the grid. The self-reported measures of location and shielding introduce classical measurement error into the dosimetry system while the Berkson error is introduced to the system due to the fact that all individuals within each grid are assigned the same DS02 values based on the averaged distance. The Berkson error is an averaging error that occurs when individuals with similar characteristics are assigned an average value of a measurement rather than their individual measured value.

It is well known that one of the effects of exposure to ionizing radiation is genetic damage. Genetic damage due to ionizing radiation is not directly observable; however, indicators such as the GPA and CA assays are used to estimate its physical manifestation. In this manuscript, we define the generalized multiple indicators, multiple causes measurement error (G-MIMIC ME) models which allow the relationship between a latent construct and its multiple outcomes to be of the generalized linear form while one of the the causal variables for the underlying latent construct is measured with error. The defined G-MIMIC ME model allows a mixture of both Berkson and classical measurement errors.

The outline of the article is as follows: In Sections 2 and 3, we define the G-MIMIC ME model and discuss its estimation, respectively. The application of the model to our motivating example and the results are presented in Sections 4 and 5. The results from a brief simulation study are provided in Section 6 and some concluding remarks can be found in Section 7.

2 Generalized multiple indicators, multiple causes measurement error models

The G-MIMIC ME model extends the classical linear MIMIC model (Zellner, 1970; Goldberger, 1972; Joreskog and Goldberger, 1975; Skrondal and Rabe-Hesketh, 2004) to the generalized linear model setting with a mixture of both Berkson and classical measurement errors. The defined model is a structural equation model with multiple causes including error-free covariates and a single true covariate measured with error. It is a combination of a confirmatory factor analysis with a common underlying factor or latent construct and path models with covariates in the generalized linear setting. The use of latent constructs is prevalent in fields such as the social sciences or medicine. For example, in medical terminology, syndromes are not directly observable. However, they are defined based on their multiple indicators which occur simultaneously to be defined collectively as a syndrome. The G-MIMIC ME model with the subscript for the $i^{th}$ subject is specified as follows:

E (Y_{i j}) = g_{j}^{- 1} (X_{i} β_{1 j} + β_{2 j} T_{i} + Z_{i j}^{T} γ_{j});

(2.1)

\begin{matrix} T_{i} & = & X_{i} α_{1} + Z_{i}^{T} α_{2} + η_{i}; \end{matrix}

(2.2)

\begin{matrix} X_{i} & = & L_{i} + V_{i}; \end{matrix}

(2.3)

\begin{matrix} W_{i} & = & L_{i} + U_{i}; \end{matrix}

(2.4)

\begin{matrix} L_{i} & = & Z_{i}^{T} ζ + ψ_{i}, \end{matrix}

(2.5)

where there are

i = 1, . . ., n

subjects,

j = 1, . . ., J

indicator or outcome variables in the model, X_i,

L_{i}

W_{i}

V_{i}

and

U_{i}

, are all scalars. While

ζ

is a

K \times 1

vector,

g_{j} (\cdot)

is a monotone, twice continuously differentiable function, T_i is an underlying latent variable and

η_{i}

is the model error in the causal equation for T_i. The

Y_{ij}

’s are the

J

multiple indicators of T_i, the multiple causes of T_i include Z_iz vectors of error-free covariates associated with the

j^{th}

outcome for the

i^{th}

subject, while X_i is an imperfectly measured covariate subjected to the Berkson error

V_{i}

. In the Berkson error model in (2.3), the true covariate is assumed to be correlated to

V_{i}

, the Berkson error. By definition of the Berkson error

L_{i}

is assumed to be independent of

V_{i}

; while X_i is correlated to

V_{i}

. The

W_{i}

represents an unbiased measure of X_i prone to the classical measurement error

U_{i}

. The term

L_{i}

is an intermediary predictor of X_i introduced to the model for modeling the mixture of Berkson and classical measurement errors (Mallick et al., 2002; Schafer and Gilbert, 2006). The variable

L_{i}

serves as intermediate variables between

W_{i}

and X_i.

The effect of T_i on the $j^{th}$ indicator variable is determined through $β_{2 j}$ , its coefficient in the indicator models. The slope $β_{1 j}$ is the coefficient on X_i in the $j^{th}$ indicator model. We make the following model assumptions:

The YY_ij’s are conditionally independent given the latent variables T_i, X_i and the error-free covariates, $Z_{ij}$ .

The random variables $V_{i}$ , $U_{i}$ and $η_{i}$ have mean zero and are mutually independent. We re-scale (2.2) such that $var (η_{i}) = 1$ to remove the indeterminancy in the model. We define $var (V_{i}) = σ_{v}^{2}$ and $var (U_{i}) = σ_{u}^{2}$ . We assume $U_{i} \sim Normal (0, σ_{u}^{2})$ , $V_{i} \sim Normal (0, σ_{v}^{2})$ and $η_{i} \sim Normal (0, 1)$ .

The latent construct, T_i, is independent of all random variables except, $η_{i}$ and X_i.

The random error $ψ_{i}$ has mean zero and variance $σ_{L}^{2}$ and is independent of all other random variables.

The true covariate measured with error, X_i, is independent of the error terms $η_{i}$ and $U_{i}$ . However, it is not independent of the Berkson error, $V_{i}$ . Thus, by definition of the Berkson error, X_i is correlated to the Berkson error, $V_{i}$ , in the Berkson measurement error model in (2.3).

Given $(X_{i}, T_{i}, Z_{i})$ , the indicator variables $Y_{ij}$ each follow a generalized linear model with density function

\begin{matrix} f (y_{ij} | X_{i}, T_{i}, Z_{i}) & = & exp [\frac{Y_{ij} θ_{j} (X_{i}, T_{i}, Z_{ij}) - B_{j} {θ_{j} (X_{i}, T_{i}, Z_{ij})}}{σ_{ε_{j}}} \\ + ψ (Y_{ij})], \end{matrix}

with mean functions

B_{j}^{(1)} {θ_{j} (X_{i}, T_{i}, Z_{ij})} = μ_{j} (X_{i}, T_{i}, Z_{ij})

. The mean function is modeled through the link function

g_{j} (\cdot)

g_{j} {μ_{j} (X_{i}, T_{i}, Z_{ij})} = X_{i} β_{1 j} + β_{2 j} T_{i} + Z_{ij}^{T} γ_{j} .

Given

(X i, T_{i}, Z_{i})

, the indicator variables

(Y_{i 1}, Y_{i 2}, . . ., Y_{iJ})

are mutually independent and only correlated through T_i, the underlying latent construct. A novel contribution of the defined model to the current literature on MIMIC models is the extension of the classical model to allow both Berkson and classical measurement errors in the presence of non-linear relationships between the underlying latent construct and its multiple outcomes.

In Figure 1, we provide a path diagram for the defined model. The model is illustrated with one underlying latent construct $(T)$ , three indicators $(Y_{1}, Y_{2}, Y_{3})$ and two error-free covariates $(Z)$ , and a covariate measured with error $(X)$ which is unbiasedly measured by W. An advantage of the defined model is that it allows for the simultaneous modelling of the indicators of the latent construct. Additionally, such models can also be used to reduce the dimension of the multiple indicators of a hypothetical construct. The models can also be used in medical research to formally define constructs such as syndromes which are not directly observed but are defined based on multiple observable outcomes.

Figure 1

The model when $J = 3$ . Given $Z$ , $X$ and T, the outcomes Y’s are conditionally independent.

2.1 Reduced form equations

The G-MIMIC ME model can be re-written in its reduced form by substituting the causal equation for T_i into the structural equation models for the outcome variables. The reduced form equation is expressed as:

\begin{matrix} E (Y_{ij}) & = & g_{j}^{- 1} (X_{i} κ_{1 j} + Z_{ij}^{T} κ_{2 j} + β_{2 j} η_{i}), \end{matrix}

where

κ_{1 j} = β_{1 j} + β_{2 j} α_{1}

and

κ_{2 j} = γ_{j} + β_{2 j} α_{2}

. The coefficient

κ_{1 j}

allows the assessment of the total effects of

X_{i}

Y_{ij}

while

κ_{2 j}

represents the total effects of the error-free covariates on

Y_{ij}

The identifiability of non-linear structural equation models are often done on a case by case basis. As in most measurement error problems, additional information from the data are usually used to identify the model. Our defined model can be identified under any of the following conditions:

The Berkson measurement error variance, $σ_{v}^{2}$ , is known and there are instrumental variables, M, for the true covariate $X$ in the data.

Both measurement error variances $σ_{v}^{2}$ and $σ_{u}^{2}$ are known.

Repeated measures on $W$ are available in the data and $σ_{v}^{2}$ is known.

In this manuscript, we focus on the instrumental variable approach to identify the model. However, any of the two other conditions can also be used to identify the model based on the available data. We propose the estimation of the reduced form parameters by employing the MC-EM algorithm (Wei and Tanner, 1990).

3 Estimation

3.1 MC-EM estimation of the G-MIMIC ME

We propose an extension of the EM algorithm, namely the MC-EM (Wei and Tanner, 1990) algorithm, to obtain the maximum likelihood estimates (MLEs) of the identified model. The EM (Dempster et al., 1997) is an iterative procedure for finding the MLEs in missing data settings. The E-step of the EM algorithm requires the calculation of the expectation of the complete data log likelihood, $L_{c} (θ)$ , provided in (3.1) with respect to the posterior distribution of the latent variables. If the latent variables were directly observable, the calculation of the integral would be straightforward for models in which the closed form of the integral exists. However, for most distributions belonging in the exponential family, the closed form solutions of the integral do not exist. Thus, the integral is approximated using Monte Carlo methods where the latent variable is imputed at each expectation step. Several Markov chain Monte Carlo approaches can be used to obtain the Monte Carlo sample associated with each step of the MC-EM algorithm, including the Gibbs sampler (Levine and Casella, 2001). In this manuscript, the Gibbs sampler (Geman and Geman, 1984) was used to approximate the integral involved in the expectation step.

By applying the MC-EM algorithm, we treat the measurement error problem as a missing data problem, where the observations on the outcome variables $Y_{i}$ , and $W_{i}$ are the observed incomplete data, while values of X_i, $η_{i}$ and $L_{i}$ are considered the missing data. Together, they form the complete data. To proceed with the estimation, we first construct the complete data likelihood based on the reduced form model. We assume that the conditional distributions of the outcome variables given the latent variables, $Y_{ij} | X_{i}, η_{i}^{'} s$ have distributions belonging to the exponential family while $X_{i} | L_{i}$ and $W_{i} | L_{i}^{'} s$ , are assumed to be normally distributed. We also assume conditional independence of the outcome variable distributions given the latent variables. The resulting likelihood for the reduced form model parameters is:

\begin{matrix} L_{c} (θ) & = & \prod_{i = 1}^{n} f (Y_{i}, W_{i}, X_{i}, L_{i}, η_{i}) \end{matrix}

(3.1)

\begin{matrix} = & \prod_{i = 1}^{n} f_{1} (Y_{i 1} | Y_{i 2}, . . ., Y_{iJ}, W_{i}, X_{i}, L_{i}, η_{i}, Z_{i 1}, θ_{1}) \end{matrix}

(3.2)

\begin{matrix} \times & f_{2} (Y_{i 2} | Y_{i 3}, . . ., Y_{iJ}, W_{i}, X_{i}, L_{i}, η_{i}, Z_{i 2}, θ_{2}) \times . . . \end{matrix}

(3.3)

\begin{matrix} \times & f_{J} (Y_{iJ} | W_{i}, X_{i}, L_{i}, η_{i}, Z_{iJ}, θ_{J}) f_{W | L} (W_{i} | X_{i}, L_{i}, η_{i}, θ_{W | L}) \end{matrix}

(3.4)

\begin{matrix} \times & f_{X | L} (X_{i} | L_{i}, η_{i}, θ_{X | L}) f_{L} (L_{i} | η_{i}, Z_{i}, θ_{L}) f_{η} (η_{i} | θ_{η}), \end{matrix}

(3.5)

where

θ_{j}^{T} = (κ_{1 j}, κ_{2 j}, β_{2 j})

θ_{W | L} = σ_{u}^{2}

θ_{X | L} = σ_{v}^{2}

θ_{L}^{T} = (μ_{L}, σ_{ψ}^{2})

, and

θ_{η} T = (0, 1)

. There are

J

total number of outcome variables,

i

represents the

i^{th}

subject while

n

is the total sample size included in the analysis. The details of our MC-EM approach are provided in the Appendix.

To obtain the standard errors for the estimators, the Oakes (1999) method which adjusts for the observed data likelihood can be used. Another option is to use re-sampling approaches such as the bootstrap method (Efron, 1979) to obtain the standard errors. For our current motivating example, the standard errors for the MC-EM estimates were obtained using the bootstrap approach.

3.2 Instrumental variable approach to identifying the G-MIMIC ME model

Instrumental variable analysis can be used to identify the model parameters of the defined model. An instrumental variable, $M$ , is an observed measure of $X$ , which is correlated with $X$ but independent of the measurement error, $U$ . The model involved in the instrumental variable approach to identifying the model is expressed as (2.1) – (2.5) with the addition that for $r = 1, . . ., R$ ,

E (M_{ir}) = g_{r}^{- 1} (X_{i} δ_{1 r} + Z_{i 2 r}^{T} δ_{2 r})

where

g_{r} (•)

is a monotone, twice continuously differentiable function associated with the

r^{th}

instrumental variable. Thus the relationship between the instrumental variables and X_i is also of the generalized linear form. The reduced models remain unchanged; however, the conditional distributions for

f M | X, Z_{2} (M_{i r} | X_{i}, Z_{i 2 r}, θ_{r})

, where

θ T = (δ_{1 r}, δ_{2 r})

are added to the likelihood function in Section 3.1. The instrumental variable approach has an advantage of permitting the estimation of the variance of the classical measurement error while also identifying the model.

4 Application of the G-MIMIC ME model

4.1 Background and data

In this article, we apply the G-MIMIC ME model to the atomic bomb survivors data collected by RERF. We define the variables in our application as follows:

T_i is the underlying latent construct defined as the overall genetic damage. This underlying latent construct has two physical manifestations, namely, chromosome aberrations $Y_{1} = CA$ and glycopherin A gene mutations $Y_{2} = GPA$ .

$Y_{1} = CA$ is defined as the number of cells in a sample of approximately 100 peripheral blood cells that have stable chromosomal aberrations (Sposto et al., 1991; Stram et al., 1993).

$Y_{2} = GPA$ is defined as the hemizygous mutant fraction at the glycopherin A locus in mature red blood cells (Kyoizumi et al., 2005).

$X_{i} = \log (dose)$ is the $\log$ of the true radiation dose measured imperfectly with error and is based on radiation dosimetry calculations.

$W_{i} = \log (DS 02)$ is the observed grid-specific value of $X_{i}$ .

$L_{i} = \log (ds 02)$ is a latent intermediary variable such that

\begin{matrix} \log (dose)_{i} = L_{i} + v_{i}; \\ \log (DS 02)_{=} L_{i} + u_{i}, \end{matrix}

is used to represent the exact radiation dose received at the time of exposure by the survivors. Thus,

L_{i}

is the true average value based on the

i^{th}

survivor's self-reported distance from the hypocenter at the time of exposure.

A subset of the Adult Health Study (AHS) in Hiroshima and Nagasaki, Japan who were exposed within 500 m to 2500 m from the hypocenter with complete data were included in the study. For the current analysis, survivors with $\log (DS 02) > 6$ were included the analysis, allowing us to focus on individuals who were exposed to higher doses of radiation. The error-free covariates included in the indicator models are sex ( $1 =$ male, $2 =$ female), age at measurement (age of CA measurement, age of GPA measurement, age at the time of bombing) and smoking. Two smoking variables describing the amount of cigarettes smoked by the survivors prior to the CA and GPA measurements were defined. That is, the average number of cigarettes smoked by the survivors prior to the date that the measurements for CA was taken (CigC) as well as the average number of cigarettes smoked prior to the date on which the blood work was done to measure GPA (CigG). The final analytical sample size was $338$ .

4.2 Model and instrument

In our data, $X$ , $W$ , $L$ , $U$ , $V$ , $σ_{u}^{2}$ and $σ_{v}^{2}$ are all scalar. Two instrumental variables were used to identify the variance of the classical measurement error, $σ_{u}^{2}$ . The two instrumental variables used were epilation and internal bleeding at the time of exposure. Epilation is defined as the loss of over two-thirds of scalp hair (Sposto et al., 1991), while internal bleeding is characterized as an acute symptom of exposure to ionizing radiation dose when the bone marrow cells are damaged. Both epilation and internal bleeding are instrumental variables for acute levels of radiation dose exposure and are also symptoms of acute exposure, within days to weeks after exposure, to acute levels of radiation following the incident.

In our analysis, we assumed $σ_{v}^{2} = 0.08$ based on prior analyses of the data. A starting value of $σ_{u}^{2} = 0.18$ was used in the estimation procedure.

A side goal of the analysis is to obtain an estimate of $σ_{u}^{2}$ . Currently, a coefficient of variation 0.35, corresponding to $σ_{u}^{2} = 0.1155$ , is assumed by researchers, largely based on weak evidence and/or heuristic terms. The G-MIMIC ME model with an instrumental variable for identifying the measurement error component of the model can be expressed as (2.1) – (2.5) with the addition that for $r = 1, . . ., R$ ,

M_{i r} = g_{r} (X_{i} δ_{1 r} + Z_{i 2 r}^{T} δ_{2 r}) + ω_{i r},

(4.1)

where $g_{r} (\cdot)$ , $r = 1, . . ., R$ is a monotone, twice continuously differentiable function associated with the $r^{th}$ instrumental variable. The $(ω_{i 1}, . . ., ω_{iR})$ are independent of all other random variables and have mean zero and covariance matrix diag $(σ_{ω 1}^{2}, . . ., σ_{ω R}^{2})$ . Thus, the relationship between the instrumental variables and X_i is also of the generalized linear form. The following distributional assumptions are made regarding the conditional distributions for our current application:

\begin{matrix} {CA}_{i} | X_{i}, η_{i}, Z_{i} \sim Binomial ({ncells}_{i}, π_{{CA}_{i}}); \end{matrix}

(4.2)

\begin{matrix} log (GPA)_{i} | X_{i}, η_{i}, Z_{i} \sim Normal (μ_{{GPA}_{i}}, σ_{ε 2}^{2}); \end{matrix}

(4.3)

\begin{matrix} {EP}_{i} | X_{i}, Z_{i 2} \sim Bernoulli (π_{EP i}); \end{matrix}

(4.4)

\begin{matrix} {BLE}_{i} | X_{i}, Z_{i 2} \sim Bernoulli (π_{{BLE}_{i}}); \end{matrix}

(4.5)

\begin{matrix} log (DS 02)_{i} | L_{i} \sim Normal (L_{i}, σ_{u}^{2}); \end{matrix}

(4.6)

\begin{matrix} X_{i} | L_{i} \sim Normal (L_{i}, σ_{v}^{2}); \end{matrix}

(4.7)

L_{i} \sim Normal (μ_{Li}, σ_{L}^{2}),

(4.8)

where

\begin{matrix} π_{{CA}_{i}} & = & exp (κ_{11} X_{i} + β_{21} η_{i} + Z_{i 1}^{T} κ 21) / 1 + \exp (κ_{1} X_{i} + β_{21} η_{i} +_{Z} i 1 T_{κ}^{21}); \\ μ_{{GPA}_{i}} & = & κ_{12} X_{i} + β_{22} η_{i} + Z_{i 2}^{T} κ_{22}; \\ π_{{EP}_{i}} & = & \exp (δ_{11} X_{i} + Z_{i}^{T} δ_{21}) / 1 + \exp (δ_{11} X_{i} + Z_{i}^{T} δ_{21}); \\ π_{{BLE}_{i}} & = & \exp (δ_{12} X_{i} + Z_{i}^{T} δ 22) / 1 + \exp (δ_{12} X_{i} + Z_{i}^{T} δ_{22}), \end{matrix}

while

{ncells}_{i} =

number of cells observed for the

i^{th}

subject,

Z_{i 1}^{T} = ({AgeC}_{i}, {sex}_{i}, {CigC}_{i})

Z_{i 2}^{T} = ({AgeG}_{i}, {sex}_{i}, {CigG}_{i})

and

Z_{i}^{T} = ({AgeATB}_{i}, {sex}_{i})

5 Results

In this section, we discuss the results from our application of the G-MIMIC ME model with instrumental variables to RERF data. The Gibbs sampler was used in the MC-E step (please see the Appendix for additional details). Based on the evaluation of the convergence of the simulated latent variables, we find that the generated data for $L_{i}$ , X_i and $η_{i}$ do come from stationary distributions using the Heidelberger–Welch tests of convergence. An evaluation of the autocorrelation plots also indicate that the generated data were independent. A Monte Carlo sample size (M) of $2000$ was generated for each survivor, following a burn-in sample of $10 000$ . The trace plots of the simulated data provided no evidence of any irregularities in the sampling plots of the generated values. To obtain the standard errors, the bootstrap approach was used with 200 bootstrap samples.

Table 1 provides the results for fitting the outcome models to RERF data. The MME adjusted estimates in the table are based on the adjustment for the mixture of Berkson and classical measurement errors and the unadjusted estimates are based on performing a generalized linear model analysis by replacing X_i by

W_{i}

and failing to account for any measurement errors. Finally, the CME adjusted estimates are based on the adjustment for classical measurement error alone, ignoring Berkson measurement error. The coefficient on the overall genetic in the unadjusted model for CA was assumed to be 1. In the

\log (GPA)

analysis, the term for the overall genetic damage was replaced by CA in the unadjusted model. Overall, the table provides estimated coefficients based on analyses which adjust for the mixture of measurement errors, classical measurement error alone and also unadjusted estimates based on the analyses which do not account for either of the measurement errors.

Table 1

Results from the analysis of CA, Loggpa, EP and Bleeding. ${\hat{β}}_{MME}$ = mixed (Berkson and classical) measurement error adjusted estimates, ${\hat{β}}_{CME}$ = classical measurement error adjusted estimates, ${\hat{β}}_{UN}$ = unadjusted parameter estimates, $%$ ${change}_{C} ME$ = $\frac{{\hat{β}}_{CME} - {\hat{β}}_{MME}}{{\hat{β}}_{MME}} \times 100 %$ = percentage change in the CME adjusted estimates from the MME adjusted estimates, $%$ ${change}_{U} N$ = $\frac{{\hat{β}}_{UN} - {\hat{β}}_{MME}}{{\hat{β}}_{MME}} \times 100 %$ = percentage change in the unadjusted estimates from the MME adjusted estimates

CA	${\hat{β}}_{adj}$	SE	P	${\hat{β}}_{UN}$	$%$ ${change}_{UN}$	${\hat{β}}_{CME}$	$%$ ${change}_{CME}$
AgeCA	0.0196	0.007	0.003	$-$ 0.002	$-$ 110.20	0.0197	0.51
Sex	$-$ 0.111	0.181	0.5396	0.248	$-$ 323.42	$-$ 0.101	$-$ 9.00
AvecigCA	0.0136	0.011	0.209	0.01	$-$ 26.47	0.017	25.00
True Dose	1.293	0.172	$< 0.0001$	1.02	$-$ 21.11	2.154	66.59
Genetic Damage	0.384	0.142	0.007	1.00	160.42	0.157	$-$ 59.11
Loggpa	${\hat{β}}_{adj}$	SE	P	${\hat{β}}_{UN}$	$%$ ${change}_{UN}$	${\hat{β}}_{CME}$	$%$ ${change}_{CME}$
AgeGPA	0.004	0.011	0.728	$-$ 0.001	$-$ 125.00	0.005	25.00
Sex	$-$ 0.050	0.185	0.786	$-$ 0.081	62.00	$-$ 0.061	22.00
AvecigGPA	0.008	0.009	0.401	0.004	$-$ 50.00	0.009	12.50
True Dose	0.608	0.160	0.0001	0.469	$-$ 22.86	1.054	73.35
Genetic Damage	0.023	0.215	0.916	2.862	12343.48	$-$ 0.44	$-$ 2013.48
EP	${\hat{β}}_{adj}$	SE	P	${\hat{β}}_{UN}$	$%$ ${change}_{UN}$	${\hat{β}}_{CME}$	$%$ ${change}_{CME}$
AgeATB	0.016	0.037	0.661	0.001	$-$ 93.75	0.025	50.00
Sex	0.112	0.581	0.847	0.10	$-$ 10.71	0.004	$-$ 96.43
True Dose	3.048	0.966	0.004	$-$ 45.01	81.86	3.56	16.79
Bleeding	${\hat{β}}_{adj}$	SE	P	${\hat{β}}_{UN}$	$%$ ${change}_{UN}$	${\hat{β}}_{CME}$	$%$ ${change}_{CME}$
AgeATB	0.039	0.028	0.173	0.026	$-$ 33.33	0.044	12.82
Sex	$-$ 0.023	0.489	0.962	$-$ 0.023	0.00	$-$ 0.08	247.82
True Dose	1.681	0.486	1.13	$-$ 32.72	48.63	2.37	40.99

Overall, we find that CA is highly predictive as an assay for genetic damage $(p = 0.007)$ . However, the $\log (GPA)$ was not found to be predictive for genetic damage $(p = 0.92)$ after adjusting for the relevant error-free covariates and radiation dose. Both assays, CA and $\log (GPA)$ , were both statistically and significantly related to radiation dose $p < 0.0001$ and $p = 0.0001$ , respectively. This finding confirms the use of these biological assays as biodosimeters for radiation dose exposure.

The results from the modeling of the instrumental variables, epilation and bleeding, are also included in Table 1. A statistically significant relationship was found between true radiation dose and epilation, after adjusting for age at the time of exposure and the gender of the survivor (p-value $=$ 0.002). However, AgeATB and gender were not statistically significant (p-value = 0.66 and 0.845, respectively) after adjusting for true dose in the epilation model. Similar results were also found for internal bleeding.

5.1 Estimating

σ_{u}^{2}

Epilation and internal bleeding were used as instrumental variables to identify the model due to the presence of the classical measurement error, $u$ . Based on our analysis, $σ_{u}^{2}$ was estimated to be 0.109. Our data-driven estimate for $σ_{u}^{2}$ is based on an adjustment for both classical and Berkson measurement errors. Previous estimates that also account for both types of measurement error while using an instrumental variable approach have estimated $σ_{u}^{2}$ to be 0.181 (Miller, unpublished manuscript), 0.1225 (Carter, unpublished manuscript) and 0.092 (Tekwe et al., 2014). The corresponding coefficient of variation based on our estimated value for $σ_{u}^{2}$ is 0.34 in the current application. The current coefficient of variation being used at RERF is 0.35 which is based on an adjustment for classical measurement error alone with a corresponding variance of 0.1156. An additional objective of our current application is to use instrumental variables to identify $σ_{u}^{2}$ . Using our proposed model, an estimate of $σ_{u}^{2}$ that is comparable to RERF's current value and our previous estimate is obtained. The significance and novelty of this application indicates that the G-MIMIC ME model can also be used as an alternative model to obtain an estimate for the variance of the classical measurement error, $σ_{u}^{2}$ .

5.2 Impact of measurement error on the parameter estimates

Table 1 also provides measurement error adjusted and unadjusted parameter estimates. The table also includes the percentage changes from the parameter estimates based on the adjustment of the mixture of measurement errors. Overall, we find that adjusting for the measurement errors has massive impacts on the estimated parameters. The effects of failing to adjust for any measurement error in estimating the error-free covariates tend to depend on the error-free covariate. However, we find that failing to account for any measurement error leads to under-estimating the effects of true radiation dose while over-estimating the effects of the overall genetic damage on its outcomes. We also find that failing to account for the Berkson error leads to over-estimating the true effects of the radiation dose. The percentage changes in the estimated coefficient on true radiation dose when comparing the classical measurement adjusted alone to the estimates based on the adjustment of the mixture of measurement errors range from $16 %$ to $73 %$ . This finding confirms the importance of adjusting for both types of measurement error when the error associated with the estimation of the true covariate is due to a mixture of Berkson and classical measurement errors.

5.3 Sensitivity analysis of the parameter estimates to assumed values of $σ_{v}^{2}$ and starting values of $σ_{u}^{2}$

In this section, we provide a discussion of the sensitivity of the parameter estimates to the assumed known values for the Berkson error and the starting values for $σ_{u}^{2}$ . In the sensitivity analysis, we allowed the range of the assumed values for the Berkson error to range from 0.02 to 0.08 while the starting values for $σ_{u}^{2}$ ranged from 0.10 to 0.28 (see Table 2). In assessing the sensitivity of the parameters to the starting value of $σ_{u}^{2}$ and the assumed known $σ_{v}^{2}$ , the parameters were estimated under all the possible pairs of combination of measurement errors. The first set of parameters were estimated under $σ_{v}^{2 KNOWN} = 0.02$ and $σ_{u}^{2 INITIAL} = 0.10$ while the last set of parameters were estimated under $σ_{v}^{2 KNOWN} = 0.08$ and $σ_{u}^{2 INITIAL} = 0.28$ .

In general, the coefficients on the error-free covariates and the underlying latent construct, the overall genetic damage, were the least sensitive to $σ_{v}^{2 KNOWN}$ and $σ_{u}^{2 INITIAL}$ . However, the coefficients on true radiation dose in the various models were very sensitive to the starting values of $σ_{u}^{2 INITIAL}$ and $σ_{v}^{2 KNONWN}$ . We find that the coefficient on true radiation dose in the instrumental variable models were much more sensitive to the measurement errors when compared to the estimated coefficients on true dose in the outcome models.

We also assessed the sensitivity of the estimates of $σ_{u}^{2}$ to $σ_{v}^{2 KNOWN}$ and $σ_{u}^{2 INITIAL}$ . The estimated value for $σ_{u}^{2}$ when $σ_{v}^{2 KNOWN} = 0.02$ and with $σ_{u}^{2 INITIAL} = 0.010$ was 0.157, while the estimated value when $σ_{v}^{2 KNOWN} = 0.08$ and $σ_{u}^{2 INITIAL} = 0.28$ was 0.11. The corresponding percentage change was an absolute magnitude of $29 %$ . In general, we find that the estimated parameters were sensitive to the assumed known values of the variance of the Berkson error. As the assumed value for the variance of the Berkson error increases, the parameters which were found to be more sensitive to the measurement errors tended to decrease. That is, the estimated coefficients on true dose and ${\hat{σ}}_{u}^{2}$ and the estimated values for $σ_{u}^{2}$ decreased as the assumed known values for $σ_{v}^{2 KNONWN}$ increased.

5.4 Illustration of identifiability of the G-MIMIC ME

A parameter

θ_{i} ε Θ

is said to be identified if no two different values of

θ ε Θ

lead to the same sampling distribution of the observable random variables and the model

Table 2

Impact of assumed known values of $σ_{v}^{2}$ and initial values of $σ_{u}^{2}$ on the estimated model parameters for CA analysis. The subscript in ${\hat{β}}_{σ_{v}^{2}, σ_{u}^{2}}$ represents the assumed known value for $σ_{v}^{2}$ and initial value for $σ_{u}^{2}$ , respectively

CA	${\hat{β}}_{0.02, 0.10}$	${\hat{β}}_{0.02, 0.18}$	${\hat{β}}_{0.02, 0.28}$	${\hat{β}}_{0.04, 0.10}$	${\hat{β}}_{0.04, 0.18}$	${\hat{β}}_{0.04, 0.28}$	${\hat{β}}_{0.08, 0.10}$	${\hat{β}}_{0.08, 0.18}$	${\hat{β}}_{0.08, 0.28}$
	AgeC	0.019	0.018	0.018	0.019	0.019	0.019	0.019	0.019	0.019
	Sex	$-$ 0.101	$-$ 0.103	$-$ 0.106	$-$ 0.102	$-$ 0.106	$-$ 0.104	$-$ 0.105	$-$ 0.106	$-$ 0.103
	CigC	0.014	0.014	0.014	0.014	0.014	0.014	0.014	0.014	0.014
	True Dose	1.581	1.591	1.576	1.486	1.488	1.488	1.352	1.35	1.351
	Genetic Damage	0.397	0.394	0.4	0.398	0.397	0.396	0.394	0.395	0.396
	GPA	${\hat{β}}_{0.02, 0.10}$	${\hat{β}}_{0.02, 0.18}$	${\hat{β}}_{0.02, 0.28}$	${\hat{β}}_{0.04, 0.10}$	${\hat{β}}_{0.04, 0.18}$	${\hat{β}}_{0.04, 0.28}$	${\hat{β}}_{0.08, 0.10}$	${\hat{β}}_{0.08, 0.18}$	${\hat{β}}_{0.08, 0.28}$
	AgeG	0.003	0.003	0.003	0.003	0.004	0.003	0.003	0.003	0.003
	Sex	$-$ 0.063	$-$ 0.065	$-$ 0.068	$-$ 0.065	$-$ 0.067	$-$ 0.067	$-$ 0.068	$-$ 0.068	$-$ 0.068
	CigG	0.008	0.007	0.008	0.008	0.008	0.007	0.008	0.007	0.007
	True Dose	0.756	0.762	0.755	0.711	0.712	0.715	0.648	0.645	0.648
	Genetic Damage	$-$ 0.008	$-$ 0.014	$-$ 0.006	$-$ 0.008	$-$ 0.01	$-$ 0.014	$-$ 0.012	$-$ 0.009	$-$ 0.012
	EP	${\hat{β}}_{0.02, 0.10}$	${\hat{β}}_{0.02, 0.18}$	${\hat{β}}_{0.02, 0.28}$	${\hat{β}}_{0.04, 0.10}$	${\hat{β}}_{0.04, 0.18}$	${\hat{β}}_{0.04, 0.28}$	${\hat{β}}_{0.08, 0.10}$	${\hat{β}}_{0.08, 0.18}$	${\hat{β}}_{0.08, 0.28}$
	AgeATB	0.018	0.017	0.018	0.018	0.019	0.018	0.018	0.018	0.018
	Sex	0.071	0.065	0.058	0.068	0.059	0.059	0.056	0.057	0.057
	True Dose	3.22	3.22	3.228	3.026	3.029	3.019	2.721	2.724	2.73
	Bleeding	${\hat{β}}_{0.02, 0.10}$	${\hat{β}}_{0.02, 0.18}$	${\hat{β}}_{0.02, 0.28}$	${\hat{β}}_{0.04, 0.10}$	${\hat{β}}_{0.04, 0.18}$	${\hat{β}}_{0.04, 0.28}$	${\hat{β}}_{0.08, 0.10}$	${\hat{β}}_{0.08, 0.18}$	${\hat{β}}_{0.08, 0.28}$
	AgeATB	0.038	0.037	0.038	0.037	0.038	0.038	0.038	0.038	0.038
	Sex	$-$ 0.045	$-$ 0.048	$-$ 0.053	$-$ 0.047	$-$ 0.053	$-$ 0.052	$-$ 0.053	$-$ 0.053	$-$ 0.053
	True Dose	1.844	1.852	1.841	1.724	1.724	1.732	1.579	1.565	1.562
$σ_{u_{estimated}}^{2}$	0.157	0.16	0.156	0.139	0.14	0.139	0.112	0.11	0.111

is considered identified if, and only if every element of

θ

is identified (Fuller, 1987). A general discussion of the identifiability of non-linear measurement error models is generally not possible since in some settings the model parameters are identified without additional information while in other settings they are not. For example, in binary regression with a normally distributed measurement error, the probit model is not identified while the logistic measurement error model is identified (Carroll et al.2008). In this section, we numerically demonstrate that our proposed model is identified with two outcome variables with the use of two instrumental variables to identify the model. We accomplish this by fitting the model with four different starting values and if all the resulting model parameters converge to the same maximum likelihood estimates then the model is considered to be identified.

Table 3

Identifiability results from the CA analysis

CA	${\hat{β}}_{01}$	${\hat{β}}_{1}$	${\hat{β}}_{02}$	${\hat{β}}_{2}$	${\hat{β}}_{03}$	${\hat{β}}_{3}$	${\hat{β}}_{04}$	${\hat{β}}_{4}$
AgeC	$-$ 0.002	0.019	0	0.019	0.017	0.019	0.02	0.019
Sex	$-$ 0.248	$-$ 0.103	0.245	$-$ 0.104	$-$ 0.108	$-$ 0.103	$-$ 0.111	$-$ 0.103
CigC	$-$ 0.01	0.014	0.018	0.014	0.01	0.014	0.014	0.014
True Dose	1.019	1.349	0.12	1.346	0.808	1.35	1.293	1.349
Genetic Damage	1	0.395	1	0.398	1	0.395	0.385	0.395
Loggpa	${\hat{β}}_{01}$	${\hat{β}}_{1}$	${\hat{β}}_{02}$	${\hat{β}}_{2}$	${\hat{β}}_{03}$	${\hat{β}}_{3}$	${\hat{β}}_{04}$	${\hat{β}}_{4}$
AgeG	$-$ 0.004	0.003	$-$ 0.004	0.003	0.003	0.003	0.004	0.003
Sex	$-$ 0.21	$-$ 0.066	$-$ 0.243	$-$ 0.067	$-$ 0.036	$-$ 0.066	$-$ 0.05	$-$ 0.066
CigG	$-$ 0.001	0.008	0.007	0.007	0.008	0.008	0.008	0.008
True Dose	0.42	0.646	0.241	0.645	0.238	0.647	0.608	0.646
Genetic Damage	2.304	$-$ 0.012	$-$ 1.208	$-$ 0.01	3.306	$-$ 0.012	0.023	$-$ 0.011
EP	${\hat{β}}_{01}$	${\hat{β}}_{1}$	${\hat{β}}_{02}$	${\hat{β}}_{2}$	${\hat{β}}_{03}$	${\hat{β}}_{3}$	${\hat{β}}_{04}$	${\hat{β}}_{4}$
AgeATB	$-$ 0.002	0.018	$-$ 0.03	0.018	$-$ 0.009	0.018	0.016	0.018
Sex	$-$ 0.308	0.061	1.447	0.061	0.184	0.062	0.112	0.062
True Dose	2.023	2.729	0.482	2.734	0.729	2.732	3.048	2.73
Bleeding	${\hat{β}}_{01}$	${\hat{β}}_{1}$	${\hat{β}}_{02}$	${\hat{β}}_{2}$	${\hat{β}}_{03}$	${\hat{β}}_{3}$	${\hat{β}}_{04}$	${\hat{β}}_{4}$
AgeATB	0.043	0.038	0.038	0.038	0.025	0.038	0.039	0.038
Sex	$-$ 0.198	$-$ 0.051	1.035	$-$ 0.05	0.126	$-$ 0.05	$-$ 0.023	$-$ 0.05
True Dose	0.918	1.566	1.572	1.564	0.683	1.568	1.681	1.566

To illustrate identifiability, we analyzed the data under different starting values for the parameters (see Table 3). Four different sets of starting values were used to see if the estimated values would converge to the same maximum likelihood estimate. Based on our analysis, we find that the estimated model parameters all converge to the same values regardless of the starting values used. Thus, we conclude that the model is identified with known $σ_{v}^{2}$ and at least two outcome variables for T_i, in addition to at least two instrumental variables to identify $σ_{u}^{2}$ when an instrumental variable approach is used.

6 Simulation study

A simulation study was performed to assess the impact of ignoring measurement error in estimating the effects of true radiation dose on the relevant outcomes. We generated a dataset similar to our motivating example. The data were generated under the assumption that the variances of the classical and Berkson measurement errors were $0.25$ and $0.10$ , respectively. We also assumed that the $\log (X) \sim Normal (7.33, 0.61)$ , $\log (L) \sim Normal (7.33, 0.51)$ and $logDS 02 \sim Normal (7.33, 0.65)$ . Since our application focused on survivors exposed to acute levels of radiation doses, we also restricted the values of $logDS 02 > 6.0$ .

Figure 2 provides the results from the simulated dataset for all the models considered. The solid blue line in the figure indicates the true $logit (π_{CA})$ , $\hat{E} (logGPA)$ , $logit (π_{EP})$ and $logit (π_{Bleeding})$ , respectively. The dashed black lines provide the estimated values for the outcomes based on our application of the G-MIMIC ME model to the simulated data, the dashed grey lines provide the results from unadjusted estimation while the solid red line provides the results based on only adjusting for classical measurement error. Our simulation study indicates that for all the outcomes considered, failing to account for both classical and Berkson measurement errors in assessing the impact of radiation dose results in under-estimating its impact, while failing to account for Berkson measurement error leads to over-estimating the effects of true radiation dose. All the dashed grey lines in Figure 1 were all below the solid blue lines and the black dashed lines. Similarly, all the red lines for the classical measurement error adjusted only estimates were above the solid blue and dashed black lines. These indicate that our proposed approached performs adequately well in assessing the effects of radiation dose on the various outcomes when compared to the methods which do not fully account for the mixture of measurement errors. The estimated variance of the classical measurement error based on the simulated data was 0.114.

Figure 2

Note: Figure is available in full colour in the online version at smj.sagepub.com.

7 Discussion

In this article, we applied the G-MIMIC ME model with an instrumental variable to assess the impact of true radiation dose and overall genetic damage on the physical indicators of overall genetic damage CA and log(GPA). The G-MIMIC ME model allows for the predictor variable in the causal equation for the underlying latent construct, T_i, to be measured imprecisely as in the case of DS02 estimates of radiation dose among survivors of the atomic bomb. The presence of classical measurement error introduces an additional complexity involving the estimation of the regression coefficients. The complication involves lack of identification of the variance of the classical measurement error, $σ_{u}^{2}$ , without additional information. In our current analysis, the instrumental variable approach was used as the identifying information to estimate $σ_{u}^{2}$ . Since the presence of measurement error biases the regression coefficients, one needs to adjust for its presence. In our current application, $σ_{u}^{2}$ is estimated to be $0.109$ . This estimate corresponds to a coefficient of variation of $0.34$ . We find that the number of stable chromosome aberrations following exposure to radiation dose is an indicator for overall genetic damage while the GPA gene mutation on mature red blood cells was not found to be its physical indicator.

Our current model, the G-MIMIC ME model allows us to assess the direct effect of an underlying latent construct on its physical indicators or outcomes after adjusting for error-free covariates and a true covariate prone to both classical and Berkson measurement errors.

Appendix

One of the complications often faced by researchers implementing the EM (Dempster et al., 1997) algorithm involves the need to explicitly calculate the expected log likelihood of the complete data. The EM algorithm was extended by Wei and Tanner (Wei and Tanner, 1990) to the Monte Carlo EM (MC-EM) algorithm. The MC-EM algorithm does not require the direct computation of the expected log likelihood of the complete data. MC-EM is an approximation method that works in estimating functions due to the weak law of large numbers. Rather than directly calculating the expected log likelihood as required under the EM algorithm, the MC-EM algorithm is based on simulating the missing data, Y_mis, from its conditional distribution, $f (Y_{m i s} | Y_{o b s}, θ)$ (Wei and Tanner, 1990). Thus, the complete-data likelihood is approximated by:

\overset{⌢}{Q} (θ | θ_{0}, Y_{m i s}) - \frac{1}{m} \sum_{i = 1}^{m} l o g L_{c} (θ | Y_{m i s}, Y_{o b s})

(A.1)

where Y_mis represents the missing data while Y_obs represents the observed data. The $\overset{⌢}{Q} (θ | θ_{0,} Y_{m i s})$ converges to $Q (θ | θ_{0}, Y_{m i s})$ as m goes to infinity based on the weak law of large numbers. The E-step of the EM algorithm requires the calculation of the expectation of the complete log likelihood of L_c(θ) provided above with respect to the posterior distribution of the latent variables. If the latent variables directly were observable, the calculation of the integral would be straightforward for models for which the closed form of the integral exists. However, for most distributions belonging in the exponential family, the closed forms of this integral do not exist. Also, since the posterior distribution of the latent variables are unknown, the expectation step of the EM algorithm is turned into an approximation problem where the latent variables are treated as missing data. The integral is approximated using Monte Carlo methods where the latent variable is imputed at each expectation step. We use the Gibbs sampler to approximate the integral involved in the expectation step.

The maximization step involves the maximization of the approximated completed data log likelihood to obtain the parameter estimates under the conditional independence assumption. Under the conditional independence assumption, the structural parameters associated with each outcome or indicator variable can be obtained independently of the parameters associated with the other outcomes.

Our MC-EM approach can be summarized as follows:

Obtain initial values, θ⁽⁰⁾.

Simulate M Markov chain Monte Carlo samples ${η_{i m}, L_{i m}, X_{i m}}_{m = 1}^{M}$ from

f (η_{i}, L_{i}, X_{i} | θ^{(0)}, Y_{i}, W_{i}, M_{i})

(A.2)

for the i^th survivor, i = 1, ..., n and m = 1, ..., M, assuming the initial parameters are true parameters.

Fill in the simulated M MCMC samples as replacement data for the missing data in $l o g L_{c} (θ | Y_{i}, W_{i}, M_{i}, Z_{i})$ .

Create an expanded dataset for each subject (i = 1,..., n) based on the M simulated Monte Carlo samples from (2). The original incomplete and expanded complete dataset are indicated below for the i^th subject:

\begin{matrix} [Y_{i j} W_{i k} M_{i r} Z_{i j}^{T}] \\ [Y_{i j} W_{i k} M_{i r} Z_{i j}^{T} L_{i 1} η_{i 1} X_{i 1}] \end{matrix}

where

Y_{i j} = (Y_{i 1}, Y_{i 2}, \dots, Y_{i j}), W_{i k} = (W_{i 1}, W_{i 2}, \dots, W_{i k}), Z_{i j} = (Z_{i 1}^{T}, \dots, Z_{i 1}^{T}),

, and

M_{i r} = (M_{i 1}, M_{i 2}, \dots M_{i R})

are the instrumental variables.

Approximate the Q function using the expanded data as follows:

Q (θ | θ^{(t)}) = \frac{1}{M} \sum \log {L_{c} (θ | Y, M, W, η_{m}^{(t)}, X_{m}^{(t)}, L_{m}^{(t)})}

M-step: Maximize $Q (θ | θ^{(0)})$ with respect to θ to get an updated vector of estimates θ⁽¹⁾ for θ.

Repeat this process at the t^th iteration by generating a new set of samples ${η_{i m}, L_{i m}, X_{i m}}_{m = 1}^{M}$ from the conditional distributions $f (η_{i}, X_{i}, L_{i} | Y_{i}, M, W i, θ^{(t + 1)})$ for the i^th subject.

Maximize the resulting Q-function, $Q (θ | θ^{(t)})$ , and continue iterating between the Monte-Carlo E-step and M-step, until the chosen convergence criteria is satisfied.

We note that under the conditional independence assumption, the Q functions associated with each conditional distribution for each outcome variable can be obtained independently. Once the expanded or concatenated data have been created, maximization of this likelihood can be done using optimization techniques such as the Newton Ralphson algorithm. Another approach to obtaining the parameter estimates is to use readily available procedures in statistical packages such as PROC GLM in SAS or the GLM function in R to analyze the augmented dataset as if Q were a log likelihood function constructed from i.i.d. observations. This method of maximizing the likelihood works because the approximated Q function is a pseudo-loglikelihood that has the form of a likelihood of Mn independent observations as illustrated in the expression for Q above. Once we obtain the MC-EM based parameter estimates, we perform inferential procedures on the estimated parameters.

To obtain the standard errors for the estimators, the Oakes' (1991) method which adjusts for the observed data likelihood can be used. Another option is to use the bootstrap method to obtain the standard errors. For our current motivating example, the standard errors for the MC-EM estimates were obtained using the bootstrap approach.

Acknowledgments

The Radiation Effects Research Foundation (RERF), Hiroshima and Nagasaki, Japan is a private, non-profit foundation funded by the Japanese Ministry of Health, Labour and Welfare (MHLW) and the U.S. Department of Energy (DOE), the latter in part through the DOE Award DE-HS0000031 to the National Academy of Sciences. This publication was supported by RERF Research Protocol A5-11. The views of the authors do not necessarily reflect those of the two governments.

References

Awa

(1997) Analysis of chromosome aberrations in atomic bomb survivors for dose assessment: Studies at the Radiation Effects Research Foundation from 1968 to 1993. Stem Cells , 15, 163–73.

Carroll

Spiegelman

Lan

Bailey

Abott

(1984) On errors-in-variables for binary regression models. Biometrika , 71, 19–26.

Carter

Cullings

Cologne

Funamoto

Kusunoki

Neriishi

Seed

Miller

Tekwe

Nakamura

Stram

Ross

(2008) Estimation of radiation dose from biological manifestations and imperfect measures of physical determinants. Paper presented at the 2008 Joint Statistical Meetings, Denver, CO, August 7.

Dempster

Laird

Rubin

(1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B. , 39, 1–38.

Efron

(1979) Bootstrap methods: Another look at the jacknife. The Annals of Statistics , 7, 1–26.

Fenech

(2002) Biomarkers of genetic damage for cancer epidemiology. Toxicology , 181, 411–16.

Fuller

(1987) Measurement Error Models . New York: Wiley.

Geman

(1984) Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images. IEEE Tactions on Pattern Analysis and Machine Intelligence , 6, 721–41.

Goldberger

(1972) Structural equation methods in the social sciences. Econometrica , 40, 979–1001.

10.

Grant

Bigbee

(1993) In vivo somatic mutation and segregation at the human glycophorin A (GPA) locus: Phenotypic variation encompassing both gene-specific and chromosomal mechanisms. Mutation Research , 288, 163–72.

11.

Joreskog

Goldberger

(1975) Estimation of a model with multiple indicators, and multiple causes of a single latent variable. Journal of American Statistical Association , 70, 631–39.

12.

Kyoizumi

Kusunoki

Hayashi

Hakoda

Cologne

Nakachi

(2005) Individual variation of somatic gene mutability in relation to cancer susceptibility: Prospective study on erythrocyte glycophorin A gene mutations of atomic bomb survivors. Cancer Research 65, 5462–69.

13.

Levine

Casella

(2001) Implementations of the Monte Carlo EM algorithm. Journal of Computational and Graphical Statistics 10, 422–39.

14.

Mallick

Hoffman

Carroll

(2002) Semiparametric regression modeling with mixtures of Berkson and classical error, with application to fallout from the Nevada test site. Biometrics , 58, 13–20.

15.

Miller

(2009) Statistical methods for biodosimetry in the presence of both Berkson and classical measurement errors. PhD thesis, State University of New York at Buffalo, Buffalo, New York.

16.

Oakes

(1999) Direct calculation of the information via EM algorithm. Journal of Royal Statistical Society, Series B , 61, 479–82.

17.

Rana

Kumar

Sultana

Sharma

(2010) Radiation-induced biomarkers for the detection and assessment of absorbed radiation doses. Journal of Pharmacy and Bioallied Sciences , 2, 189–96.

18.

Schafer

Gilbert

(2006) Some statistical implications of dose uncertainty in radiation dose response analyses. Radiation Research , 166, 303–312.

19.

Skrondal

Rabe-Hesketh

(2004) Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models . New York: Chapman and Hall.

20.

Sposto

Stram

Awa

(1991) An estimate of the magnitude of random errors in the DS86 dosimetry data on chromosome aberrations and severe epilation. Radiation Research , 128, 157–69.

21.

Stram

Sposto

Preston

Abrahamson

Honda

Awa

(1993) Stable chromosome aberrations among A-bomb survivors: An update. Radiation Research , 136, 29–36.

22.

Tekwe

Carter

Cullings

Carroll

(2014) Multiple indicators, multiple causes measurement error models. Statistics in Medicine , 33, 4469–81.

23.

Wei

GCG

Tanner

(1990) A Monte Carlo implementation of the EM algorithms and the poor man's data augmentation algorithms. Journal of the American Statistical Association , 85, 669–704.

24.

Zellner

(1970) Estimation of regression relationships containing unobservable independent variables. International Economic Review , 11, 441–54.

Generalized multiple indicators,multiple causes measurement error models

Abstract

Keywords

1 Introduction

2 Generalized multiple indicators, multiple causes measurement error models

The model when J = 3 . Given Z , X and T, the outcomes Y’s are conditionally independent.

3 Estimation

3.1 MC-EM estimation of the G-MIMIC ME

4 Application of the G-MIMIC ME model

4.1 Background and data

4.2 Model and instrument

Table 1

5.2 Impact of measurement error on the parameter estimates

5.3 Sensitivity analysis of the parameter estimates to assumed values of σ v 2 and starting values of σ u 2

5.4 Illustration of identifiability of the G-MIMIC ME

Table 2

Impact of assumed known values of σ v 2 and initial values of σ u 2 on the estimated model parameters for CA analysis. The subscript in β ̂ σ v 2 , σ u 2 represents the assumed known value for σ v 2 and initial value for σ u 2 , respectively

Identifiability results from the CA analysis

Appendix

Acknowledgments

References

The model when $J = 3$ . Given $Z$ , $X$ and T, the outcomes Y’s are conditionally independent.

5.3 Sensitivity analysis of the parameter estimates to assumed values of $σ_{v}^{2}$ and starting values of $σ_{u}^{2}$

Impact of assumed known values of $σ_{v}^{2}$ and initial values of $σ_{u}^{2}$ on the estimated model parameters for CA analysis. The subscript in ${\hat{β}}_{σ_{v}^{2}, σ_{u}^{2}}$ represents the assumed known value for $σ_{v}^{2}$ and initial value for $σ_{u}^{2}$ , respectively