A Skew-Normal Mixture Regression Model

Abstract

A challenge associated with traditional mixture regression models (MRMs), which rest on the assumption of normally distributed errors, is determining the number of unobserved groups. Specifically, even slight deviations from normality can lead to the detection of spurious classes. The current work aims to (a) examine how sensitive the commonly used model selection indices are in class enumeration of MRMs with nonnormal errors, (b) investigate whether a skew-normal MRM can accommodate nonnormality, and (c) illustrate the potential of this model with a real data analysis. Simulation results indicate that model information criteria are not useful for class determination in MRMs unless errors follow a perfect normal distribution. The skew-normal MRM can accurately identify the number of latent classes in the presence of normal or mildly skewed errors, but fails to do so in severely skewed conditions. Furthermore, across the experimental conditions it is seen that some parameter estimates provided by the skew-normal MRM become more biased as skewness increases whereas others remain unbiased. Discussion of these results in the context of the applicability of skew-normal MRMs is provided.

Keywords

skew-normal distributions mixture regression models class identification estimation bias

Normality-based mixture regression analysis has been widely used in marketing research over the past two decades and has been increasingly applied in other fields within the social and behavioral sciences (Wedel & DeSarbo, 1994). However, this type of analysis has been criticized for being too sensitive to the violation of its residual normality assumption, resulting in additional unobserved groups and thus misleading inference from data (e.g., Van Horn et al., 2012). In this article, we present a generalized form of mixture regression models (MRMs) based on the skew-normal distribution introduced by Azzalini (1985). The proposed model subsumes the traditional normality-based MRM as a special case and has the advantage of accommodating a nonnormal continuous response variable. Moreover, it is expected to have important implications for researchers who want to study complex human behaviors exhibiting different response patterns to environmental conditions.

This article begins with an overview of MRMs that includes their primary limitation and an alternative approach to tackle this problem. Next, to alleviate the negative consequences caused by nonnormality in MRMs, skew-normal mixture regression models (SNMRMs) are introduced and investigated through a designed simulation study. Finally, the potential utility of SNMRMs is illustrated with a real data set from psychology.

Overview of Mixture Regression Models

Regression analysis may be one of the most commonly used statistical techniques in social science research. Typically, this method tries to model the relation between a set of predictors (either categorical or continuous) and a dependent variable. Based on a sample of interest, regression coefficients are estimated and used to make inferences about a particular population. However, as numerous studies have shown, when the regression coefficients differ across various unobserved subgroups in the population, MRMs are more appropriate because they can simultaneously estimate separate regression equations for those subgroups and classify subjects into their various latent classes (DeSarbo & Corn, 1988; Wedel & Desarbo, 1994). Scenarios in which this model is an ideal approach include the absence of an observed grouping variable to explicate heterogeneity in the data, or when the group variable is available but many interaction/moderation effects are needed to capture this heterogeneity. In the latter scenario, model estimation and interpretation may be especially difficult (McClelland & Judd, 1993).

Mixture regression analysis was first introduced by Quandt (1972) and Quandt and Ramsey (1978) under the title switching regression, where the model was estimated with a somewhat inefficient moment-generating-function based method. Similarly, in what was referred to as clusterwise linear regression, Späth (1979) offered a more efficient ordinal least squares estimation method to minimize the overall sum of the sums of squared errors within each group. Finally, inspired by Dempster, Laird, and Rubin’s (1977) seminal work on the expectation-maximization (EM) algorithm, Aitkin and Wilson (1980) carried out the maximum likelihood estimation of MRMs treating group memberships as missing data within the EM framework. This more fine-tuned estimation method still serves as the basis for the estimation of mixture models today, and has been extended to a variety of more generalized scenarios with different parametric forms for a response variable: binary/dichotomous (Kamakura & Russell, 1989; Wong & Maffini, 2011; Zhu & Zhang, 2004), counts (Wedel, DeSarbo, Built, & Ramaswamy, 1993), or any other type from the exponential family (Wedel & DeSarbo, 1995).

Normality-based MRMs specifically have been widely applied in many fields. Applications include marketing (DeSarbo & Corn, 1988; DeSarbo, Wedel, Vriens, & Ramaswamy, 1992; Goldfeld & Quandt, 1973; Naik, Shi, & Tsai, 2007; Quandt & Ramsey, 1978), economics (Cosslett & Lee, 1985; Hamilton, 1989), finance (Engel & Hamilton, 1990), agriculture (Turner, 2000), psychogeriatrics (Kliegel, Zimprich, & Eschen, 2005), nutrition (Arellano-Valle, Castro, Genton, & Gómez, 2008), public health (Schmeige, Levin, & Bryan, 2009), and psychometrics (Liu, Hancock, & Harring, 2011). Despite its widespread applications in a variety of different fields, researchers in psychology, education, and other social and behavioral sciences have not commonly adopted MRMs, likely in part because of their computational difficulty. Tracing the history of MRM, we know the model is not new to education. Examples do exist, however. Aitkin, Anderson, and Hinde (1981), for instance, found that students in England achieve different proficiency levels when exposed to different teaching styles, where teaching style was modeled as a latent variable with three classes (formal, informal, and mixed style) inferred from 38 binary questionnaire items. More recently, Van Horn, Bellis, and Snyder (2001) assessed the differential effects of various family resources on students’ academic achievements, whereas Ding (2006) made a clear call for MRMs in education and illustrated their utility with a study on children’s math achievement. And this call has become more reasonable given the advances of convenient computer programs that can estimate MRMs, including Mplus (Muthén & Muthén, 2012), Latent Gold (Vermunt & Magidson, 2008), and the free R package Flexmix (Grün & Leisch, 2007).

Like other conventional normality-based mixture models, MRMs rely heavily on the assumption of the underlying distribution, taking the form of

f (y_{j} | x_{i}) = \sum_{i = 1}^{g} ϖ_{i} ψ (y_{j} | x_{j}^{'} β_{i}, σ_{i}^{2}),

where $y_{j}$ is the value of the dependent variable for the jth observation, and $ϖ_{i}$ denotes the mixing proportions constrained to be $ϖ_{i}$ $\in (0, 1)$ and $\sum_{i = 1}^{g} ϖ_{i} = 1$ . Moreover, $ψ (y_{j} | x_{j}^{'} β_{i}, σ_{i}^{2})$ indicates that $y_{j}$ has the normal density function with mean of $x_{j}^{'} β_{i}$ and residual variance $σ_{i}^{2}$ in which $x_{j}^{'}$ (i = 1, . . ., n) represents the transpose of the (p+ 1) dimensional vector of independent variables for the jth observation, $β_{i}$ and $σ_{i}^{2}$ are the class-specific regression coefficients vector and residuals variances, respectively. Clearly, MRMs require the assumption of normality.

Even a mild violation of within-class normality in these MRMs may result in the overextraction of unobserved groups in an attempt to fit the data (Bauer & Curran, 2003, 2004; Maclean, Morton, Elston, & Yee, 1976; McLachlan & Peel, 2000). Recent work by Van Horn et al. (2012) demonstrated that even mild skewness of residuals in MRMs can lead to spurious classes being identified and biased parameter estimates. To cope with this problem, Van Horn et al. (2012) and George et al. (2012) applied an ordered polytomous approach (i.e., a proportional odds model) to handle mild and severe nonnormality in errors separately. More specifically, they simulated nonnormal response variables in R with varying degrees of skew in the residuals and recoded the nonnormal data into an ordinally scaled variable with six categories. Next, the polytomous MRM was applied to estimate the transformed data in Mplus. Because no specific distribution is assumed for this polytomous model, this method was expected to perform well in recovering the true number of latent classes and population parameters. Under very mild skew conditions, Van Horn et al. found that the true two-class model was correctly identified nearly 100% of the time. Comparing the thresholds, regression coefficients, and associated standard errors between the transformed polytomous model (representing true population setting) and estimated polytomous model results, these authors found that some parameter estimates were biased and some had fairly large standard errors. In a parallel study for severe skewed conditions, George et al. found that when intercept differences existed between two classes under conditions of mild skewness, among all model selection criteria they examined (Akaike’s information criterion [AIC], Bayesian information criterion [BIC], sample-adjusted BIC [SABIC], bootstrapping likelihood ratio test, entropy) the BIC performed best but still mistakenly selected a three-class model 11% of the time; meanwhile, under high skew conditions, the SABIC mistakenly supported the three-class model in 80% of total simulations. Furthermore, the parameter estimates for correctly selected two-class models were biased in most cases, with small thresholds being underestimated, large thresholds being overestimated, and the regression coefficients being overestimated in both classes in the presence of even mild to moderate skewness. Thus, although Van Horn et al. and George et al. took the first steps to address class identification issues in MRMs, their method is only applicable to very mildly skewed data with regard to class enumeration and, as a result of the dependent variable’s transformation into an ordinal scale, interpretation of results becomes more difficult.

Skew-Normal Mixture Regression Models

We propose a SNMRM in which skewness is captured by a separate parameter and normal error is encompassed as a special case. Skew-normal distributions were first proposed by Azzalini (1985) as a density function to “allow a ‘continuous’ variation from normality to non-normality” (p. 171). As defined, a skew-normal random variable Z has density function of $2 ϕ (z) Φ (λ z)$ , in which $ϕ$ and $Φ$ are the standard normal density function and cumulative normal distribution function, respectively. This new class of distribution density function adds an additional shape parameter $λ$ , and includes the normal density as a special case when $λ$ equals 0. In short, Z ~ SN( $λ$ ).

This idea has been used in conventional statistical techniques to accommodate slight deviation from normality, such as linear regression model (Arellano-Valle et al., 2008; Chen & Chen, 2003; Sahu, Dey, & Branco, 2003), nonlinear regression models (Cancho, Dey, Lachos, & Andrade, 2010; Cancho, Lachos, & Ortega, 2010; Xie, Weia, & Lina, 2009), and linear mixed models (Ghosh, Branco, & Chakraborty, 2007; Ma, Genton, & Davidian, 2004). The user-friendly R package “SN” (Azzalini, 2013) can estimate univariate or multivariate skew-normal distributions and skew-normal linear regression models. Very recently, skew-normal distributions have also been incorporated into mixture modelling (e.g., Lin, Lee, & Yen, 2007). However, no study has yet examined whether skew-normal distributions can effectively mitigate the overfit and biased parameter estimates in MRMs because of the spurious latent classes.

The SNMRM we proposed takes the form of

f (y_{j} | x_{i}) = \sum_{i = 1}^{g} ϖ_{i} ψ (y_{j} | x_{j}^{'} β_{i}, σ_{i}^{2}, λ_{i}) .

The key difference between Equations (1) and (2) lies in the one additional shape parameter λ_i. Here $ψ (y_{j} | x_{j}^{'} β_{i}, σ_{i}^{2}, λ_{i})$ represents the skew-normal density function with location $x' β_{i}$ , scale variance $σ_{i}^{2}$ , and shape parameter $λ_{i}$ . When $λ_{i}$ equals to 0, the skew density functions reduce to the normal ones and thereby Model (2) has an identical form with Model (1). As such, MRMs are considered a special case of SNMRMs. Because the shape parameter $λ_{i}$ can capture nonnormality in data, SNMRMs are expected to have the advantage of overcoming the negative consequences of nonnormality.

The current study first aims to investigate how sensitive the commonly used model selection indices are in class determination, with regard to various degrees of violation of the assumption of normality in MRMs. Van Horn et al. (2012) had examined this problem and we expected our results to be similar to theirs. Second and more important, the SNMRM as a more robust extension of MRMs will be investigated in terms of its performance in accommodating both normal and nonnormal errors through a simulation study. Finally, the potential advantages of this flexible approach will be illustrated with a real data analysis.

Simulation Study

Data Generation Model

Consider that subjects may respond to the same intervention in two ways. Accordingly, the two simple linear regression functions $Y_{i} = 0.2 X_{i} + e_{i}$ and $Y_{i} = 0.5 + 0.7 X_{i} + e_{i}$ used in Van Horn et al. (2012) and George et al. (2012) were adopted to represent the differential effect of the covariate X in comparing our new model, the SNMRM, with the MRM. Further imagine that the intervention has a larger effect on one group than the other. A significant benefit of mixture modeling is that group membership, unknown to the researchers, can be inferred from the data indirectly. Given that our focal interest was to know the robustness of SNMRMs to nonnormal errors, skewness was the primary manipulated factor examined as it relates to the recovery of the true population parameters. Following Van Horn et al. and George et al., a power transformation was used to create four levels of skewness: Power 1, 1.5, 2, and 2.76 were incorporated to obtain normal error, error with skewness of .2, .5, and 1, respectively. Skewed error was adopted in one or both groups, resulting in six simulation conditions: (a) no skewness for both groups, (b) skewness of .2 for one group, (c) skewness of .2 for both groups, (d) skewness of .5 for one group, (e) skewness of .5 for both groups, and (f) skewness of 1 for both groups. Scatterplots of each condition are displayed in Figure 1. Comparing all the six scatterplots, as planned the nonnormality is not obvious through visual inspection of the data. Moreover, as per our design, differential regression patterns do exist in these conditions but are not immediately obvious given the cloud of data.

Figure 1.

Scatterplot of each simulated condition.

It should be noted that when skewed error was placed into both groups, positive error was set in the first group and the negative error in the second to vary nonnormal conditions. This is different from the approach by Van Horn et al. (2012) and George et al. (2012), in which only positive skewness was manipulated. A total of 1000 subjects for each group were used to achieve reliable results for both models, and the R package (R Development Core Team, 2008) was used for generating the data.

Model Estimation and Model Selection

No existing program is available to estimate SNMRMs directly. Therefore, maximum likelihood estimation via the EM algorithm was implemented using R to estimate one-, two-, and three-class MRMs and SNMRMs for each simulated scenario (see the appendix for technical details¹). Furthermore, because there is no universal criterion for class enumeration in mixture modeling, we relied on two types of model fit indices for selecting the optimal model among one-, two-, and three-class mixture models: six information criteria include AIC (Akaike, 1987), consistent AIC (CAIC; Bozdogan, 1987), sample-adjusted CAIC (SACAIC; Tofighi & Enders, 2008), Bayesian information criterion (BIC; Schwarz, 1978), sample-adjusted BIC (SABIC; Sclove, 1987), and difference in BIC (DBIC; Draper, 1995); and two classification-based criteria are entropy (Celeux & Soromenho, 1996) and integrated completed likelihood (ICL; Biernacki, Celeux, & Govaert, 2000). Among them, Entropy, AIC, BIC, and SABIC were examined by Van Horn et al. (2012). The entropy criterion in the current study was developed by Celeux and Soromenho (1996) and is not the one defined by Ramaswamy, DeSarbo, Reibstein, and Robinson (1993) and used in Mplus.

To assess the quality of the classification function of each mixture model, two indices were used in the current study: correct classification rates (CCRs) based on the posterior probability assigned to each subject and the adjusted Rand index (ARI; Hubert & Arabie, 1985). ARI corrects for chance by accounting for the fact that classification performed randomly would be expected to correctly classify some cases. This index has expected values of 0 under random classification and 1 for perfect classification. For both CCR and ARI, larger values indicate better classification results.

In the mixture context, the likelihood is invariant under a permutation of the class labels in parameter vectors. Therefore, a label switching problem can occur when some labels of the mixture classes permute (McLachlan & Peel, 2000). Although the switching of class labels is not a concern in the general course of the maximum likelihood estimation via the EM algorithm for studies with only one replication, it was a serious problem in our simulation study because the same model was estimated iteratively for 500 replications per cell. To solve this problem, we used a simple strategy of considering all permutations of the class labels and the one with the lowest misclassification error was treated as the final class membership assignment (Lo, Brinkman, & Gottardo, 2008).

Results

For the purpose of comparing the performance of the MRM and SNMRM, the results of class enumeration, parameter estimation, and classification accuracy were summarized in pairs. Table 1 presents the percentages of one-, two-, and three-class models selected according to the eight aforementioned indices for the six simulated conditions examined. Tables 2 and 3 present the parameter estimates and classification accuracies averaged over 500 replications for each condition separately. All the one-, two-, and three-class MRM and SNMRM solutions converged properly. No offending parameter estimates (e.g., negative variances) were found in the estimated results.

Table 1.

Class Identification Results in Percentages for Six Simulation Conditions With Varying Degrees of Skewness.

	Condition 1: Normal error
	Log-likelihood		AIC		CAIC		SACAIC		BIC		SABIC		DBIC		Entropy		ICL
	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN
1 class	−2686.2	−2683.6	0	0	0	2	0	0	0	0	0	0	0	0	/	/	/	/
2 class	−2659.7	−2659.9	80	94	100	98	99	99	99	98	98	99	99	100	99	98	99	98
3 class	−2658.5	−2658.6	20	6	0	0	1	1	1	2	2	1	1	0	1	2	1	2
	Condition 2: Skew of .2 in Class 2
	Log-likelihood		AIC		CAIC		SACAIC		BIC		SABIC		DBIC		Entropy		ICL
	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN
1 class	−2686.1	−2678.3	0	0	0	0	0	0	0	0	0	0	0	0	/	/	/	/
2 class	−2651.2	−2651.5	62	80	98	99	94	99	98	99	87	98	94	99	99	99	99	99
3 class	−2648.9	−2649.4	38	20	2	1	6	1	2	1	13	2	6	1	1	1	1	1
	Condition 3: Skew of .2 in both classes
	Log-Likelihood		AIC		CAIC		SACAIC		BIC		SABIC		DBIC		Entropy		ICL
	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN
1 class	−2687.3	−2687.2	0	0	0	0	0	0	0	0	0	0	0	0	/	/	/	/
2 class	−2660.8	−2659.9	19	54	84	100	68	94	81	100	56	88	68	94	93	93	93	94
3 class	−2655.9	−2656.5	81	46	16	0	32	6	19	0	44	12	32	6	7	7	7	6
	Condition 4: Skew of .5 in Class 2
	Log-Likelihood		AIC		CAIC		SACAIC		BIC		SABIC		DBIC		Entropy		ICL
	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN
1 class	−2684.8	−2669.8	0	0	0	1	0	0	0	0	0	0	0	0	/	/	/	/
2 class	−2640	−2640.3	30	50	94	99	77	88	91	98	72	87	80	91	100	100	100	100
3 class	−2636.3	−2636.6	70	50	6	0	23	12	9	2	28	17	20	9	0	0	0	0
	Condition 5: Skew of .5 in both classes
	Log-likelihood		AIC		CAIC		SACAIC		BIC		SABIC		DBIC		Entropy		ICL
	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN
1 class	−2682.1	−2682	0	0	0	0	0	0	0	0	0	0	0	0	/	/	/	/
2 class	−2652.2	−2645.1	0	3	13	89	1	46	10	78	1	32	3	50	81	76	80	77
3 class	−2635.9	−2636.3	100	97	87	11	99	54	90	22	99	68	97	50	19	24	20	23
	Condition 6: Skew of 1 in both classes
	Log-likelihood		AIC		CAIC		SACAIC		BIC		SABIC		DBIC		Entropy		ICL
	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN
1 class	−2686.3	−2684.2	0	0	0	0	0	0	0	0	0	0	0	0	/	/	/	/
2 class	−2647.2	−2621.5	0	0	0	8	0	1	0	3	0	1	0	1	66	58	61	54
3 class	−2597.5	−2597.8	100	100	100	92	100	99	100	97	100	99	100	99	34	42	39	46

Note. MRM = mixture regression model; SN = skew-normal; AIC = Akaike’s information criterion; BIC = Bayesian information criterion; CAIC = consistent AIC; SACIC= sample-adjusted CAIC; SABIC = sample-adjusted BIC; DBIC = difference in BIC; ICL = integrated completed likelihood.

Table 2.

Parameter Estimates and Associated Standard Errors for Six Simulation Conditions With Varying Degrees of Skewness.

	Condition 1: Normal error		Condition 2: Skew of .2 in Class 2		Condition 3: Skew of .2 in both classes
True parameters	MRM	SN	MRM	SN	MRM	SN
Class 1
Intercept = 0	.01 (.025)	0 (.037)	.01 (.031)	0 (.057)	−.01 (.072)	−.11 (.081)
Slope = .2	.20 (.026)	.20 (.026)	.21 (.027)	.20 (.027)	.21 (.045)	.21 (.041)
Var(Residual) = .25	.27 (.013)	.27 (.014)	.27 (.019)	.30 (.022)	.25 (.077)	.27 (.116)
Shape parameter	—	.02 (.057)	—	0 (.13)	—	.24 (.417)
Class 2
Intercept = .5	.49 (.026)	.50 (.037)	.51 (.03)	.60 (.069)	.52 (.071)	.61 (.084)
Slope = .7	.70 (.026)	.70 (.026)	.71 (.027)	.70 (.027)	.70 (.044)	.70 (.04)
Var(Residual) = .25	.27 (.014)	.27 (.015)	.24 (.018)	.30 (.023)	.25 (.078)	.26 (.115)
Shape parameter	—	0 (.068)	—	−0.20 (.177)	—	−.23 (.42)
	Condition 4: Skew of .5 in Class 2		Condition 5: Skew of .5 in both classes		Condition 6: Skew of 1 in both classes
True parameters	MRM	SN	MRM	SN	MRM	SN
Class 1
Intercept = 0	0 (.028)	0 (.242)	−.10 (.063)	−.20 (.036)	−.1 (.089)	−.29 (.019)
Slope = .2	.24 (.042)	.23 (.03)	.20 (.044)	.20 (.027)	.21 (.05)	.21 (.017)
Var(Residual) = .25	.29 (.014)	.29 (.046)	.23 (.045)	.26 (.039)	.20 (.082)	.28 (.02)
Shape parameter	—	−.08 (.686)	—	2.49 (.504)	—	5.05 (.903)
Class 2
Intercept = .5	.57 (.061)	.66 (.12)	.54 (.065)	.70 (.132)	.58 (.093)	.77 (.018)
Slope = .7	.71 (.029)	.71 (.025)	.69 (.046)	.70 (.024)	.68 (.05)	.69 (.017)
Var(Residual) = .25	.20 (.039)	.23 (.093)	.24 (.047)	.28 (.095)	.21 (.086)	.26 (.025)
Shape parameter	—	−2.52 (.834)	—	−2.31 (.808)	—	−5.38 (.885)

Note. Standard deviations of parameter estimates across 500 replications are included in parentheses. MRM = mixture regression model; SN = skew-normal.

Table 3.

Classification Results for Six Simulation Conditions With Varying Degrees of Skewness.

Classification	No skew		Skew of .2 in Class 2		Skew of .2 in both classes
	MRM	SN	MRM	SN	MRM	SN
CCR	0.608	0.661	0.70	0.70	0.702	0.703
ARI	0.051	0.106	0.16	0.16	0.164	0.165
Classification	Skew of .5 in Class 2		Skew of .5 in both classes		Skew of 1 in both classes
	MRM	SN	MRM	SN	MRM	SN
CCR	0.701	0.701	0.709	0.71	0.716	0.72
ARI	0.162	0.162	0.175	0.177	0.187	0.193

Note. MRM = Mixture regression model; SN = skew-normal; CCR = correct classification rate; ARI = adjusted Rand index.

Our first consideration was the accuracy of class identification. Inspecting the results of the six conditions with various degrees of skewed errors in either one or two groups, we found that all six information criteria tended to select three-class models to overfit the data as the degree of nonnormality increases. This result was consistent with previous findings (George et al., 2012; Van Horn et al., 2012). As shown in Table 1, when the normality assumption was met in Condition 1, by all criteria except AIC two-class models were correctly selected in more than 95% of the replications in both MRM and SNMRM. As a consequence, the SNMRM can be used as an alternative to the MRM in class enumeration under normal condition.

When mildly skewed errors of .2 were embedded in one or both groups, as demonstrated in Conditions 2 and 3 separately, the SNMRM was better than the MRM in terms of the performance of all model selection criteria. Moreover, BIC, DBIC, CAIC, SACAIC, Entropy, and ICL were prone to select the correct two-class model in the SNMRM with accuracy rates exceeding 95%. In contrast to the SNMRM results, model selection indices in the MRM were more likely to select the three-class model, especially when both classes had skewed errors, resulting in unacceptable rates of accuracy. A similar pattern was revealed in Conditions 4 and 5, in which skewness increases to .5 for one or both groups, revealing accuracy rates by most model fit indices in the SNMRM to decrease but still remain much higher than those in the MRM. Comparing Conditions 4 and 5, again, the selection rates of two-class models dropped down when both classes had skewed error. This considerable decrease indicated that all the model fit indices in the MRM were particularly sensitive to the change in skewed errors from one group to both groups. The results of Condition 2 through Condition 5 indicate that the SNMRM is a more robust model to mild skewness as compared with the MRM.

When the errors with skewness of .5 were in only one group in Condition 4, CAIC, BIC, entropy, and ICL selected the two-class SNMRM more than 95% of the time. However, with the same degree of skewness operating in both groups, none of the model fit criteria provided a sufficient rate of accuracy in determining the number of groups. When nonnormality reaches skewness of 1 in Condition 6, no model fit index performs well in class enumeration, although all the information criteria have a slightly better chance of selecting the two-class model in the SNMRM than in the MRM. Clearly, neither the SNMRM nor the MRM can effectively salvage nonnormal data when the skewness of error becomes large, although model fit indices in SNMRM generally outperform their counterparts in MRM.

For replications where the two-class model was correctly selected, our next inquiry concerned the model parameter estimates for the two-class MRM and SNMRM. When errors were normally distributed as shown in Condition 1 of Table 2, both types of two-class mixture models on average provided unbiased estimates for the intercept and slopes whereas both slightly overestimated the residual variance. The shape parameter estimates in the SNMRM are close to 0, which is reasonable given that the residuals meet the assumption of normality. However, as errors become more skewed, most estimates are more distorted as evidenced by the increasing magnitude of bias reflected by Table 2. In Condition 2 with very mild skewness of .2 in one group, the MRM and the SNMRM provide similar estimates for the group with normal error whereas SNMRM slightly overestimates the intercept for the other group with skewed error. The shape parameter estimates accurately capture that skewed residuals occurred in Class 2 with nonzero values. When both classes of residuals are skewed in Condition 3, the SNMRM overestimates both the intercept and residual variance noticeably.

As for the two conditions with skewness of .5, a similar pattern is observed: The two models tend to provide comparable and good estimates of the slopes, slightly biased but comparable estimates of the residuals variance, unbiased estimates of the intercepts for the group with normal error, and somewhat biased estimates of the intercepts for the group with skewed error. In the last condition with the severely skewed errors of 1 in both classes, slope estimates are similar in the MRM and the SNMRM and both models underestimate the intercept in Class 1 and overestimate the intercept in Class 2. The magnitude of bias is relatively larger in the SNMRM. Moreover, the MRM is prone to underestimate the residual variance whereas the SNMRM produces an inflated estimate of residual variance. The sign and magnitude of the estimated shape parameters are consistent with the specified model settings. Overall, once the two-class model was correctly identified, both the MRM and the SNMRM can provide comparable and reasonable estimates of the slopes but somewhat biased estimates for intercepts and residual variance across all six conditions. As the degree of skewness increases, the magnitude of bias of estimating intercepts in the SNMRM becomes larger than its counterparts in the MRM.

The empirical standard errors of these estimates, that is, the standard deviations of parameter estimates over the 500 replications, were also included in the parentheses of Table 2. Comparing these two mixture models, we found that as long as the estimates of intercepts and residual variance are not too biased (like those in Condition 6), the SNMRM generally provides estimates with less certainty across samples since their empirical standard errors are larger than those in the MRM. However, the SNMRM can provide the same or more accurate estimates for slope parameters with less uncertainty, as evidenced by the smaller standard errors across all the six conditions.

Finally, because of unavoidable prediction errors in regression, membership classification is not expected to be perfect. Comparing the MRM and the SNMRM in the CCR (see Table 3), the average posterior probability in SNMRM is greater than or equal to its counterpart in the MRM across all the conditions, implying that the SNMRM assigns membership to the subjects more accurately. The same results are seen with the ARI. Overall, the SNMRM can achieve better classification results than the MRM does once the number of latent classes is correctly identified.

Applied Example

Data for the applied study came from the National Survey of Child and Adolescent Well-Being (U.S. Department of Health and Human Services, Administration for Children, Youth and Families, 2003). This survey included children (newborn to 12 years) who were subject to child abuse or neglect, including a wide variety of physiological and psychological variables. Among them, one key variable was the head circumference for children from newborn up to 4 years of age. An inspection of the histogram of the variable as shown in Figure 2 indicated that some researchers might have measured the head circumference in inches even though they were instructed to use centimeters as the measurement units. Although the principal researchers were unable to contact individual field researchers to confirm this suspicion, a series of general linear model analyses using this variable (Liu et al., 2011) also suggest that the systematic measurement errors did occur and could be accommodated by mixture models. Taking into considerations of the possible consequence due to the systematic measurement error, we screened out observations with missing data or unrealistic values probably because of coding errors, and kept 2,013 children in the final sample. The sample has a mean of 39.10 cm and a standard deviation of 12.25 cm, which was inconsistent with prior finding regarding the head circumference in this age range. Thus, the apparently compromised first and second moment information could not be used directly for our focal regression analysis in which age is considered a useful predictor. Fortunately, mixture modeling is a promising technique to deal with such systematic measurement error. The two distinct linear patterns in the scatterplot of head circumference and age in month again implied the existence of the two latent classes. Therefore, class determination was not a substantial issue in this example but could serve as a means to compare the MRM and the SNMRM in effectiveness.

Figure 2.

Histogram of head circumference.

In practice, the number of latent classes needs to be determined first for the MRM. Without a sound theoretical reason, a number of statistical indices are used together for this purpose. Despite the fact that the two-class model is believed to be the true condition in this case (cases measured in inches and cases measured in centimeters), all the model information criteria support the four-class models over the two-class model as Table 4 shows. In addition, both the bootstrap likelihood ratio test and the Lo-Mendell–Rubin likelihood ratio test supported the four-class model as well. (Because of computational intensity, the two tests were not included in our simulation study.) Consistent with previous findings, ICL and Entropy are more robust to nonnormal data in general. They serve better in the SNMRM than in the MRM with regard to class determination.

Table 4.

Class Identification Results for the Real Data Analysis.

	Log-likelihood		AIC		CAIC		SACAIC		BIC
	MRM	SN	MRM	SN	MRM	SN	MRM	SN	MRM	SN
1 class	−7,879	−7,261	15,764	14,530	15,784	14,556	15,774	14,543	15,781	14,552
2 class	−5,660	−5,611	11,330	11,236	11,363	11,282	11,347	11,260	11,358	11,275
3 class	−5,511	−5,511	11,036	11,042	11,082	11,108	11,060	11,076	11,075	11,098
4 class	−5,489	−5,488	10,995	11,003	11,055	11,089	11,026	11,047	11,046	11,076
	SABIC		DBIC		Entropy		ICL
	MRM	SN	MRM	SN	MRM	SN	MRM	SN
1 class	15,771	14,539	15,775	14,545	—	—	15,781	14,552
2 class	11,342	11,253	11,348	11,262	0	0	11,358	11,275
3 class	11,053	11,066	11,062	11,079	95	95	11,264	11,287
4 class	11,017	11,034	11,029	11,052	349	497	11,743	12,069

Note. The numbers in boldface within each column indicates the best fitting model among one- to four-class models. MRM = mixture regression model; SN = skew-normal; AIC = Akaike’s information criterion; BIC = Bayesian information criterion; CAIC = consistent AIC; SACIC= sample-adjusted CAIC; SABIC = sample-adjusted BIC; DBIC = difference in BIC; ICL = integrated completed likelihood.

A post hoc analysis showed that the two groups identified according to the two-class MRM estimated results exhibit a slight deviation from normality: The “inch” group had a skewness of −.5 and kurtosis of 2.2 whereas the “centimeter” group had a skewness of −.3 and kurtosis of −.1. This finding demonstrated that additional classes in the three- or four-class MRM might reflect a violation of the normality assumption even though the nonnormality may seem trivial. It is this negative finding that furthered our curiosity about the more robust SNMRM.

As seen in Table 5, the MRM and the SNMRM provided almost identical slope estimates and similar intercept estimates but considerably different estimates of the error variance. The estimated nonzero shape parameters for both classes indicate that the residuals were not normally distributed. In sum, the proposed SNMRM performed reasonably well for this data set.

Table 5.

Parameter Estimates for the Real Data Analysis.

Parameter Estimates	MRM	SN
Intercept, Class 1	16.63	17.79
Slope, Class 1	0.09	0.09
Residual variance, Class 1	2.56	3.91
Shape parameter	—	−1.10
Intercept, Class 2	42.34	44.77
Slope, Class 2	0.22	0.21
Residual variance, Class 2	6.86	12.67
Shape parameter	—	−1.67

Note. MRM = mixture regression model; SN = skew-normal

Discussion

Human behaviors are complex. Human actions and reactions in environmental contexts might be heterogeneous. For example, students might differ in their ways responding to the same educational treatment/intervention that they receive. Considering differential effectiveness may help researchers to better understand the complex interactions between individuals and treatment and thereby better promote positive changes in different individuals. In light of more recognition of heterogeneous regression patterns underlying data and the capability of MRM in revealing such heterogeneity, the MRM is expected to have an increasingly widespread application in studying educational or other social and behavioral phenomena.

However, as exemplified in Figure 1, visual inspection of scatterplot may not be an effective way to detect heterogeneous regression patterns, which makes the communication and collaboration between quantitative researchers and content experts extremely necessary and important to apply this model validly in practice. Unfortunately, when it comes to class determination, unless researchers have clear theoretical expectation, very often they rely on model fit indices to find the optimal number of groups for MRM. As the current study and some other researchers (George et al., 2012; Van Horn et al., 2012) revealed, the MRM requires almost a perfect assumption of normality. Otherwise, it tends to overextract the number of groups to compensate the nonnormal errors. Because of this oversensitivity, the MRM is rendered useless for practitioners who need to explore the heterogeneous regression patterns underlying data. Therefore, as a robust version of the MRM, the SNMRM appears to be a more theoretically compelling modeling tool for practitioners because it can investigate differential effects of covariates and accommodate moderately nonnormal errors as well.

This study used both simulated and real data to investigate the performance of the SNMRM in comparison of MRM in the presence of skewed errors. Our first finding is that commonly used model selection information criteria in the MRM are all prone to enumerate spurious latent classes to fit the data even when the skewness is mild. As the degree of nonnormality increases, the chance of overextraction is inflated. In practice, even large samples may not guarantee that the residuals perfectly follow the hypothetical normal distribution. Moreover, in scenarios where one or more key covariates are missing from the linear composite of predictors, this assumption is even more likely to be violated. Entropy and ICL-BIC are relatively robust to mildly skewed errors but do not work well in more severe skewness conditions. As such, we suggest that practitioners be cautious when using information criteria–based indices for class enumeration in the MRM although they might be useful in this regard in other types of mixture modeling.

Both simulation results and real data analysis demonstrated that the SNMRM is more accurate in selecting the true number of latent classes under conditions with normal or mildly skewed errors. Unfortunately, the SNMRM fails to achieve satisfactory results in the presence of severely skewed errors, although it still outperforms the MRM in class determination. Once the number of latent classes is correctly identified, the SNMRM can estimate the slopes of the identified classes with more accuracy and certainty than the MRM whereas the MRM provided less biased estimates of the intercepts and residual variance than those in the SNMRM across the simulated skewed conditions. As such, given the correct number of latent classes, if researchers’ focal interest is the relation between covariate and dependent variables, that is, the regression coefficient, the SNMRM is a better option than the MRM since this skewed model can provide unbiased estimate of slope with better certainty across samples. But if the researchers are also interested to know the intercept or residual variance, the MRM is preferred because of the less biased estimates it can provide.

Although the SNMRM appears to be a flexible tool in regulating departures from normality, this conclusion has been drawn from the simulated scenarios included in this work. Generalizability to other conditions may be problematic. Additional research is necessary to understand the SNMRM further. To this end, more factors’ effects may be considered, such as mixture proportions, magnitude of the differential effects of covariate and residual variance, and so forth.

Another potential future work is that the robustness of the SNMRM against outliers might still be insufficient when data involve strongly heavy-tailed observations. Such weakness might be tackled by adopting a broader mixture family of component densities such as mixtures of skew t (Lin, 2010; Lin, Lee, & Hsieh, 2007) and mixtures of skew Student’s t-normal distributions (Lin, Ho, & Lee, 2013). Undoubtedly, the emergence of this more robust SNMRM to nonnormal errors is good news for practitioners who aim to investigate differential effects of intervention program on participants.

Footnotes

Appendix

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) declared receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the National Science Council of Taiwan (Grant no. NSC101-2118-M-005-006-MY2).

Notes

References

Aitkin

Anderson

Hinde

(1981). Statistical modelling of data on teaching styles. Journal of the Royal Statistical Society, 144, 419-461.

Aitkin

Wilson

G. T.

(1980). Mixture models, outliers, and the EM algorithm. Technometrics, 22, 325-331. doi:10.2307/1268316

Akaike

(1987). Factor analysis and AIC. Psychometrika, 52, 317-332.

Arellano-Valle

R. B.

Castro

L. M.

Genton

M. G.

Gómez

H. M.

(2008). Bayesian inference for shape mixtures of skewed distributions with application to regression analysis. Bayesian Analysis, 3, 513-540.

Azzalini

(1985). A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, 12, 171-178.

Azzalini

(2013). R package “sn”: The skew-normal and skew-t distributions (Version 0.4-18). Retrieved from http://azzalini.stat.unipd.it/SN

Bauer

D. J.

Curran

P. J.

(2003). Distributional assumptions of growth mixture models: Implications for the overextraction of latent trajectory classes. Psychological Methods, 8, 338-363.

Bauer

D. J.

Curran

P. J.

(2004). The integration of continuous and discrete latent variable models: Potential problems and promising opportunities. Psychological Methods, 9, 3-29. doi:10.1037/1082-989X.9.1.3

Biernacki

Celeux

Govaert

(2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 719-725.

10.

Bozdogan

(1987). Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52, 345-370.

11.

Cancho

V. G.

Dey

K. D.

Lachos

V. H.

Andrade

(2010). Bayesian nonlinear regression models with scale mixtures of skew normal distributions: Estimation and case influence diagnostics. Computational Statistics & Data Analysis, 55, 588-602.

12.

Cancho

V. G.

Lachos

V. H.

Ortega

E. M. M.

(2010). A nonlinear regression model with skew-normal errors. Statistical Papers, 51, 547-558.

13.

Celeux

Soromenho

(1996). An entropy criterion for assessing the number of clusters in a mixture model. Journal of Classification, 13, 195-212.

14.

Chen

(2003). A skew regression model for inference of stock volatility. In Zhang

H. P.

Zhang

Y. T.

Huang

J. C.

(Eds.), Development of modern statistics and related topics (pp. 129-139). Singapore: World Scientific. doi:10.1142/9789812796707_0011

15.

Cosslett

S. R.

Lee

(1985). Serial correlation in discrete variable models. Journal of Econometrics, 27, 79-97.

16.

Dempster

A. P.

Laird

N. M.

Rubin

D. B.

(1977). Maximum likelihood with incomplete data via the E-M algorithm. Journal of the Royal Statistical Society, 39, 1-38.

17.

DeSarbo

W. S.

Corn

W. L.

(1988). A maximum likelihood methodology for clusterwise linear regression. Journal of Classification, 5, 249-282.

18.

DeSarbo

W. S.

Wedel

Vriens

Ramaswamy

(1992). Latent class metric conjoint analysis. Marketing Letters, 3, 273-288.

19.

Ding

C. S.

(2006, December). Using regression mixture analysis in educational research. Practical Assessment Research & Evaluation, 11(11). Retrieved from http://pareonline.net/getvn.asp?v=11&n=11

20.

Draper

(1995). Assessment and propagation of model uncertainty. Journal of the Royal Statistical Society, 57, 45-97.

21.

Engel

C. M.

Hamilton

J. D.

(1990). Long swings in the dollar: Are they in the data and do markets know it? American Economic Review, 80, 689-713.

22.

George

Yang

Horn

M. L.

Smith

Jaki

Feaster

. . . Howe

(2012). Using regression mixture models with non-normal data: Examining an ordered polytomous approach. Journal of Statistical Computation and Simulation, 1, 1-14. doi:10.1080/00949655.2011.636363

23.

Ghosh

Branco

M. D.

Chakraborty

(2007). Bivariate random effect model using skew-normal distribution with application to HIV-RNA. Statistics in Medicine, 26, 1255-1267.

24.

Goldfeld

S. M.

Quandt

R. E.

(1973). The estimation of structural shifts by switching regressions. Annals of Economic and Social Measurement, 2, 475-485.

25.

Grün

Leisch

(2007). Fitting finite mixtures of generalized linear regressions in R. Computational Statistics & Data Analysis, 51, 5247-5252.

26.

Hamilton

J. D.

(1989). A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica, 57, 357-384.

27.

Hubert

Arabie

(1985). Comparing partitions. Journal of Classification, 2, 193-218.

28.

Kamakura

W. A.

Russell

G. J. A.

(1989). Choice model for market segmentation and elasticity structuring. Journal of Marketing Research, 26, 379-390.

29.

Kliegel

Zimprich

Eschen

(2005). What do subjective cognitive complaints in persons with aging-associated cognitive decline reflect? International Psychogeriatrics, 17, 499-512.

30.

Lin

T. I.

(2010). Robust mixture modeling using multivariate skewt distributions. Statistics and Computing, 20, 343-356.

31.

Lin

T. I.

H. J.

Lee

C. R.

(2013). Flexible mixture modelling using the multivariate skew-t-normal distribution. Statistics and Computing, 2, 1-16. doi:10.1007/s11222-013-9386-4.

32.

Lin

T. I.

Lee

J. C.

Hsieh

W. J.

(2007). Robust mixture modeling using the skewt distribution. Statistics and Computing, 17, 81-92.

33.

Lin

T. I.

Lee

J. C.

Yen

S. Y.

(2007). Finite mixture modelling using the skew normal distribution. Statistica Sinica, 17, 909-927.

34.

Liu

Hancock

G. R.

Harring

J. R.

(2011). Using finite mixture modeling to deal with systematic measurement error: A case study. Journal of Modern Applied Statistical Methods, 10, 249-261.

35.

Brinkman

R. R.

Gottardo

(2008). Automated gating of flow cytometry data via robust model-based clustering. Cytometry, 73, 321-332. doi:10.1002/cyto.a.20531

36.

Y. Y.

Genton

M. G.

Davidian

(2004). Linear mixed effects models with flexible generalized skew-elliptical random effects. In Genton

M. G.

(Ed.), Skew-elliptical distributions and their applications: A journey beyond normality (pp. 339-358). Boca Raton, FL: Chapman & Hall/CRC Press.

37.

Maclean

C. J.

Morton

N. E.

Elston

R. C.

Yee

(1976). Skewness in commingled distributions. Biometrics, 32, 695-699.

38.

McClelland

G. H.

Judd

C. M.

(1993). Statistical difficulties of detecting interactions and moderator effects. Psychological Bulletin, 114, 376-390.

39.

McLachlan

G. J.

Peel

(2000). Finite mixture models. New York, NY: Wiley.

40.

Muthén

L. K.

Muthén

B. O.

(2012). Mplus user’s guide (6th ed.). Los Angeles, CA: Muthén & Muthén.

41.

Naik

Shi

Tsai.

C. L.

(2007). Extending the Akaike information criterion to mixture regression models. Journal of the American Statistical Association, 102, 244-254.

42.

Quandt

R. E.

(1972). A new approach to estimating switching regressions. Journal of the American Statistical Association, 67, 306-310.

43.

Quandt

R. E.

Ramsey

J. B.

(1978). Estimating mixtures of normal distributions and switching regressions. Journal of the American Statistical Association, 73, 730-738.

44.

R Development Core Team. (2008). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.

45.

Ramaswamy

DeSarbo

W. S.

Reibstein

D. J.

Robinson

W. T.

(1993). An empirical pooling approach for estimating marketing mix elasticities with PIMS data. Marketing Science, 12, 103-124.

46.

Sahu

Dey

D. K.

Branco

M. D.

(2003). A new class of multivariate skew distributions with applications to Bayesian regression models. Canadian Journal of Statistics, 31, 129-150.

47.

Schmeige

S. J.

Levin

M. E.

Bryan

A. D.

(2009). Regression mixture models of alcohol use and risky sexual behavior among criminally-involved adolescents. Prevention Science, 10, 335-344.

48.

Schwarz

(1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-464.

49.

Sclove

L. S.

(1987). Application of model-selection criteria to some problems in multivariate analysis. Psychometrics, 52, 333-343.

50.

Späth

(1979). Algorithm 39 clusterwise linear regression. Computing, 22, 367-373.

51.

Tofighi

Enders

C. K.

(2008). Identifying the correct number of classes in a growth mixture model. In Hancock

G. R.

Samuelsen

K. M.

(Ed.), Advances in latent variable mixture models (pp. 317-341). Charlotte, NC: Information Age.

52.

Turner

T. R.

(2000). Estimating the propagation rate of a viral infection of potato plants via mixtures of regressions. Applied Statistics, 49, 371-384.

53.

U.S. Department of Health and Human Services, Administration for Children, Youth and Families. (2003). National Survey of Child and Adolescent Well-Being (NSCAW), Restricted Release, Waves 1-4. Washington, DC: Author.

54.

Van Horn

M. L.

Bellis

J. M.

Snyder

S. W.

(2001). Family Resource Scale-Revised: Psychometrics and validation of a measure of family resources in a sample of low-income families. Journal of Psychoeducational Assessment, 19, 54-68.

55.

Van Horn

M. L.

Smith

Fagan

A. A.

Jaki

Feaster

D. J.

Masyn

Hawkins

J. D.

Howe

(2012). Not quite normal: Consequences of violating the assumption of normality with regression mixture models. Structural Equation Modeling, 19, 227-249.

56.

Vermunt

J. K.

Magidson

(2008). LG-syntax user’s guide: Manual for Latent GOLD 4.5 syntax module. Belmont, MA: Statistical Innovations.

57.

Wedel

DeSarbo

W. S.

(1994). A review of recent developments in latent class regression models. In Bagozzi

R. P.

(Ed.), Advanced methods of marketing research (pp. 352-388). Cambridge, England: Blackwell.

58.

Wedel

DeSarbo

W. S.

(1995). A mixture likelihood approach for generalized linear models. Journal of Classification, 12, 21-55.

59.

Wedel

DeSarbo

W. S.

Built

J. R.

Ramaswamy

(1993). A latent class Poisson regression model for heterogeneous count data with an application to direct mail. Journal of Applied Econometrics, 8, 397-411.

60.

Wong

Y. J.

Maffini

C. S.

(2011). Predictors of Asian American adolescents’ suicide attempts: A latent class regression analysis. Journal of Youth and Adolescence, 40, 1453-1464. doi:10.1007/s10964-011-9701-3

61.

Xie

F. C.

Weia

B. C.

Lina

J. G.

(2009). Homogeneity diagnostics for skew-normal nonlinear regression models. Statistics Probability Letters, 79, 821-827.

62.

Zhu

H.-T.

Zhang

(2004). Hypothesis testing in mixture regression models. Journal of the Royal Statistical Society, 66, 3-16.