Joint Sequence Analysis

Abstract

In its standard formulation, sequence analysis aims at finding typical patterns in a set of life courses represented as sequences. Recently, some proposals have been introduced to jointly analyze sequences defined on different domains (e.g., work career, partnership, and parental histories). We introduce measures to evaluate whether a set of domains are interrelated and their joint analysis justified. Also, we discuss about the quality of the results obtained using joint sequence analysis. In particular, we focus on cluster analysis and propose criteria to assess whether clusters obtained using a joint approach satisfactorily describe each domain.

Keywords

sequence analysis multiple sequence analysis dissimilarity optimal matching cluster analysis Cronbach’s α principal components analysis

Introduction and Motivation

Sequence analysis (SA) is now an established technique to describe life course trajectories (see Aisenbrey and Fasang 2010; Elzinga 2003, 2007, for a discussion). The aim of SA is to describe life courses represented as sequences, that is, the ordered collections of the states (activities) experienced by individuals during a period of time. A major issue in SA concerns the identification of the most typical patterns in data, and the identification of individuals who experienced similar trajectories. Therefore, the proper measurement of the dissimilarity between two trajectories plays a crucial role in SA. Actually, optimal matching analysis (OMA), introduced in the social sciences by Abbott and Forrest (1986) to calculate and analyze pair-wise dissimilarities between sequences “has become the standard technique so much that sequence analysis and OMA-like techniques are commonly regarded as being almost synonymous” (Elzinga 2003).

In its “standard” formulation, SA focuses on life courses defined on a single domain. Examples concern the analysis of early careers, work careers, and retirement patterns (Abbott and Hrychak 1990; Blair-Loy 1999; Chan 1994, 1995; Han and Moen 1999a; Malo and Munoz-Bùllon 2003; McVicar and Anyadike-Danes 2001; Scherer 2001; Schoon et al. 2001; Stovel, Savage, and Bearman 1996), of social mobility (Halpin and Chan 1998), of housing and residential mobility (Clark, Deurloo, and Dieleman 2003; Stovel and Bolan 2004), and of family formation (Aassve, Billari, and Piccarreta 2007; Piccarreta and Billari 2007).

Nonetheless, often social scientists are interested to study several domains, which are supposedly related one to another. Relevant examples are the joint analysis of the work, family, and housing trajectories, or of the work or family formation histories of parents and of their children.

Some proposals along this direction are described and discussed in the second section. Han and Moen (1999b) and Widmer and Ritschard (2009) propose to preliminarily simplify the domains using separate cluster analyses and to study the relation between the obtained results. Other approaches are instead focused on the definition of joint dissimilarities combining information on the domains. Aassve et al. (2007) and Piccarreta and Billari (2007) suggest to build a trajectory describing the combination of states experienced in each period. Pollock (2007) and Gauthier et al. (2010) introduce multiple sequence analysis (MSA), a technique to compute optimal matching (OM) distances based on sequences defined on the different domains.

A joint analysis can serve both to explore the relations among domains and to exploit such relations to obtain joint rather than marginal results. The above mentioned proposals are mostly focused on the latter aspect. However, surely a joint analysis of domains is worth, reasonable, and effective when the domains are associated.

For the case of two domains, Piccarreta and Lior (2010) propose a graphical approach to visualize their relation, and Piccarreta and Elzinga (2013) introduce criteria to quantify the strength of the possible association.

In this article, we focus on the relations among several domains and extend the work of Piccarreta and Elzinga (2013) following two main directions.

First, in the third section we consider the association between domains. Concepts very popular in multivariate data analysis, namely, Cronbach’s α and principal components analysis (PCA), are extended to our context to explore the relations among the domains prior to the possible application of a joint analysis. We then consider the relation between joint and domain-specific dissimilarities, to evaluate if and to what extent the former can be used without losing relevant information on one or more domains.

Second, we consider the problem of assessing the performance of joint dissimilarities when employed in the dissimilarities-based procedures solely applied in SA, such as cluster analysis (CA), multidimensional scaling, analysis of variance (ANOVA), or regression trees. While comparing the results obtained using joint and domain-specific dissimilarities is surely a relevant issue, suitable criteria at this aim necessarily differ from one technique to another.

For the sake of synthesis, in the fourth section, we limit attention to CA, one of the most popular techniques in SA (actually, all the abovementioned contributions on the joint analysis of domains refer to CA). Also, CA unveils the most relevant patterns in data, and it can be used in our context to visualize the most relevant combinations of the domains’ trajectories. Finally, as it will be discussed later, some conclusions on the domains’ interrelations can also be drawn based on the results of CA.

Our proposals are analyzed referring first to synthetic data, and then, in the fifth section, to data arising from the British Household Panel Survey, already used in Pollock (2007).

“Simple” and Joint Sequence Analysis (JSA): A Brief Overview

SA usually focuses on the description of life courses defined on a single life domain (e.g., education, family formation, work career). The observed individuals’ trajectories are represented by sequences, that is, ordered collection of the states experienced during a given period. For a given domain D, the set of the possible states that can be observed in each period, Ω _D , is called alphabet. Thus, sequences are categorical time series whose elements are included in Ω _D.

The application of SA to the analysis of life courses traces back to the work of Andrew Abbott and his coauthors (Abbott 1990a, 1990b, 1995; Abbott and Forrest 1986; Abbott and Hrychak 1990). In their seminal paper, Abbott and Forrest (1986) extended to sociology the work of Sankoff and Kruskal (1983), who used OM (originated in the field of information theory and computer science; Hamming 1950; Levenshtein 1966) to study DNA sequences.

OM¹ is an aligning technique used to measure the dissimilarity between two sequences taking into account the duration and possible simultaneity of events. Such dissimilarity is based on the operations needed to transform one sequence into another. Three operations are considered: insertion (of a state into the sequence), deletion (of a state from the sequence), and substitution (of one state with another). A cost is assigned to each operation, ideally reflecting the difficulty of modifying a sequence according to the operation itself. If K operations, o ₁, …, o_K, are needed to transform the ith sequence into the hth one (or conversely), the total transformation cost is determined as the sum of the single operation costs, $c_{i, h} = \sum_{k} c (o_{k})$ . To univocally define the dissimilarity independently of the number of possible transformations, and to guarantee it is symmetric, the OM dissimilarity is defined as the minimum transformation cost.

A heated debate in the literature concerns the choice of costs (see, e.g., Elzinga 2003; Halpin 2003), and a number of proposals have been introduced partially modifying the original algorithm (see among others, Gauthier et al. 2009; Halpin 2010; Hollister 2009; Lesnard 2010; MacIndoe and Abbott 2004). Also, measures of dissimilarity alternative to OM have been introduced (Elzinga 2003, 2007). Each proposal has its pros and cons, but standard OM surely remains the approach most used in social sciences (Aisenbrey and Fasang 2010).

In its “standard” formulation, SA focuses on life courses defined on a single domain. In many applications, limiting attention to a single domain might be reductive, and it can be of interest to study the interplay between several domains (i.e., trajectories describing different careers—e.g., work, family, housing), possibly related one to another.

A first way to do so is to evaluate the association between the results obtained for each domain. This approach has been usually adopted when only two domains are taken into account. In Han and Moen (1999b), CA is applied to the two domains separately, and the association between the two obtained partitions is analyzed. Of course, this is reasonable only when the domain-specific clusters are highly homogeneous. Also, problems can arise when more than two domains are considered² (multiple contingency tables should be analyzed, possibly with multiple correspondence analysis). More importantly, no attempt is done to combine domains’ characteristics.

In general terms, with JSA of two or more domains, we refer to any approach leading to the definition of joint dissimilarities based on the information arising from all the domains taken into account.

Along this line, some authors (see among the others, Aassve et al. 2007; Lesnard 2008; Piccarreta and Billari 2007) suggest building a combined domain, describing the combination of states experienced in each period. Even if in the context of life course analysis the number of domains analyzed jointly will generally be limited, the combined sequences can easily turn out to be noisy and unstable. Therefore, this approach is reasonable and convenient only when few, connected domains are considered, so that the number of states of the combined alphabet is not too high.

Another possibility is to combine (by averaging or summation) the domain-specific dissimilarities. This clearly resembles what is done for numerical vectors: For example, the squared Euclidean distance for a vector of measurements is the sum of the squared Euclidean distances component by component. This permits to easily combine domains without losing information on their complexity, as measured by the domain-specific dissimilarities. Nonetheless, this approach disregards the cost needed to simultaneously align the set of sequences (one for each domain) characterizing one individual with the set of sequences characterizing another.

This is what is done combining the costs defined for the specific domains. This idea (Blair-Loy 1999; Stovel et al. 1996), formalized and systematized by Pollock (2007) and Gauthier et al. (2010), consists in calculating the dissimilarity between two cases by averaging the substitutions (and insertion and deletion) costs needed to align the sequences in each domain. This approach, named MSA (Pollock 2007), nicely extends the rationale underlying OM to the case of multiple domains. Also, it preserves the information on each domain, as measured by the specific transformations costs. Nonetheless, it is important to observe that MSA, differently from the approaches described earlier, can be only applied when the dissimilarity measure is based on substitution costs (e.g., when OM distance is considered).

While a JSA can be reasonable and well motivated from an interpretative point of view, its implementation is justified and effective only when the considered domains are associated. Also, even in the case of association, joint dissimilarities do not necessarily describe all the domains adequately.

Furthermore, when joint dissimilarities are used in dissimilarities-based procedures (such as, e.g., CA or multidimensional scaling), it is important to assess whether or not the obtained results are satisfactory for all the domains. Clearly, for a given domain, a JSA procedure will possibly perform worse than a domain-specific procedure. Nonetheless, it is important to evaluate whether the loss is homogeneous across domains or if some domains strongly influence the results at the expenses of the others. It is therefore important to assess the extent to which the joint results are actually “joint.”

These aspects are only marginally accounted for in the cited contributions on multiple domains. Actually, most of these works (e.g., Aassve et al. 2007; Han and Moen 1999b; Piccarreta and Billari 2007; Pollock 2007) focus on joint CA and on the substantive interpretation of the obtained clusters. For the case when two domains are considered, Gauthier et al. (2010) discuss the problem and claim that only associated domains should be analyzed jointly, but do not illustrate how to diagnose association in the case of several domains.

In this work, we deal with the mentioned issues. Our aim is first to introduce criteria to evaluate association among domains and second to evaluate if and to what extent a chosen JSA approach is satisfactory with respect to all the domains taken into account.

It is important to underline that our procedures are defined conditionally both to the method chosen to measure domain-specific dissimilarities (e.g., OM) and to the chosen JSA approach. For the sake of illustration, in the following, we will focus, respectively, on OM and MSA (Pollock 2007). The same techniques can be applied also to evaluate other JSA methods. In this sense, our proposals might also be used to compare results obtained using different JSA approaches, but here we consider this as a secondary issue.

Evaluating the Association Among Domains

To evaluate the extent of association between several domains, we extend to our context concepts that are very well known in multivariate data analysis: Cronbach’s α (Cronbach 1951) and PCA (see, e.g., Jolliffe 2002).

To do so, it is necessary to introduce a measure of association or, better, correlation, between two domains. Piccarreta and Elzinga (2013) discuss different indices. Here, we refer to the Mantel coefficient (Mantel 1967), often used in ecology (see, e.g., Legendre and Legendre 1998; Manly 1997). Let D₁ and D₂ be the dissimilarity matrices built for the two domains (obtained using, e.g., OMA), and let d ₁ and d ₂ denote the vectors of their [n(n − 1)/2] extra-diagonal elements. The Mantel coefficient is defined as the correlation between d ₁ and d ₂, that is, the correlation between all the possible pair-wise dissimilarities. Alternatively to the linear relation (see later for a discussion on this point), the monotone relation between dissimilarities can be measured using rank-based correlation coefficients, such as, for example, the Spearman coefficient.

When P domains are considered, P dissimilarity matrices can be built, D₁ , D₂ , …, D _P based on the n sequences observed in each domain. To measure the coherency among the P domains, we refer to the Cronbach’s α calculated on the vectors d ₁, d ₂, …, d _P (or on their corresponding ranked values if the Spearman coefficient is considered). The vectors d ₁, d ₂, …, d _P can be regarded as “measurements” on P “variables.” The Cronbach’s α defined on these variables is based on σ_T = var(d₁ + d₂ + … + d_P, the variance of the sum of the considered variables, and σ_Σ = σ₁ + σ₂ + … + σ_P, the sum of the variances of the involved variables:

α = \frac{P}{P - 1} (1 - \frac{σ_{Σ}}{σ_{T}}) .

To avoid the units of measurement or the ranges of the variables to influence the value of α, standardized variables are usually taken into account, so that σ_Σ = P (the sum of the P unit variances). In this case, σ _T coincides with σ_Σ plus all the pair-wise correlations. If the ds are strongly (linearly) related, all the pair-wise correlations will take high values, and σ_Σ will result much lower than σ _T . In this case, α will take high values (close to 1).³ It is also possible to evaluate the strength of the connection of one specific d _p with the others by calculating α_−p, that is, the Cronbach’s α obtained without considering d _p . If one domain is weakly related to the others, then α_−p will be higher than α, indicating that the level of interconnection between the involved domains increases when the pth is not taken into account.

To analyze more in detail the relations among the ds, a PCA can be applied to the matrix of the Mantel’s correlations. If only the first principal component (PC) is relevant (i.e., characterized by a very large eigenvalue, explaining a huge proportion of the total variance), then the ds are jointly (linearly) related. Instead, if some higher-order PCs are relevant too (having relatively high eigenvalues⁴), then PCA can be used to describe the “structure” of the relations among the ds, individuating possible splits of the domains into blocks, or the presence of isolated domains. This can be done by referring to the loadings (correlations) of the ds with the PCs.

An alternative measure of association (named JSA association) can be defined by considering the relations between joint and domain-specific dissimilarities. Let D_JSA indicate the joint dissimilarity matrix (obtained combining information on the domains using one JSA approach). The corresponding vector of joint dissimilarities, d_JSA , can be regarded as a summary of the domain-specific dissimilarities, d ₁, …, d _P . If the domains are associated, d_JSA should satisfactorily summarize all the ds. More precisely, the relations will all be strong if the domains are associated and if this association can be efficiently exploited to summarize domain-specific information using JSA.

To evaluate JSA association, that is, the adequacy of the (chosen) JSA with respect to the domains, we refer to the Mantel’s coefficients between d_JSA and d ₁,…,d _P . The sum of these squared correlations, ${S S C}_{J S A} = \sum_{p} {C o r r}^{2} (d_{p}, d_{J S A})$ , is the sum of the proportions of the domain-based dissimilarities explained by d_JSA and can be used to evaluate the goodness of fit of the latter. In particular, the mean squared correlation, ${\overline{S S C}}_{J S A} = {S S C}_{J S A} / P$ , ranges between 0 and 1 (each squared correlation being 1 in the case of perfect linear association), and measures the total proportion of explained domain-specific dissimilarities.⁵

Insights on the relations between one domain, say the pth domain, and the others also arise from the analysis of the correlation between d _p and d_JSA _−
p, the vector of the joint dissimilarities obtained without considering the pth domain. A high correlation indicates that d_JSA _−
p explains d _p , that the pair-wise dissimilarities in the pth domain can be recovered based on information on the other domains, and, consequently, that the pth domain is related to the others.

Before proceeding, it is worth to discuss more about the type of association between domains studied using measures of linear (or monotone) association between their dissimilarities. A positive (Mantel’s) coefficient between two sets of dissimilarities is observed when (a) cases similar in one domain are also similar in the other one, (b) cases dissimilar on one domain are also dissimilar on the other one, and (c) the dissimilarities in the two domains increase together.

A negative correlation indicates instead that cases similar on one domain tend to be dissimilar on the other one, and vice versa. Thus, as the dissimilarity between two cases in one domain increases that in the other domain decreases accordingly (in the most extreme situation, two cases will be close in one domain if and only if they are far in the other). This is rarely observed in applications where domains are considered being at least theoretically related.

Focusing on positive correlations, condition (a) must clearly hold to make a joint analysis meaningful. Instead, in some situations, a one-to-many association can exist between domains, so that condition (b) does not hold. For example, people have children irrespective of their marital or occupational status. Also, condition (c) does not refer to the association between domains but, rather, to the relation between the specific domains’ states and dissimilarities. Imagine, for example, an oversimplified society where employed people live in rented houses, self-employed own their houses with a mortgage, and retired own their houses. Employed and self-employed will be the “closest” states in the employment domain, but they will be dissimilar in the housing domain, where owners will be considered as more similar instead. Even if there is clearly an association between these two simplified domains, the corresponding sets of dissimilarities will not be (strongly) linearly related due to the effect of the (chosen) substitution costs (we will refer to the situation when the chosen substitution costs partially mask the association between domains as measured by the correlation between dissimilarities as costs misalignment).

To analyze the behavior of the proposed measures, we simulated six sets of data on five domains characterized by different types and levels of association (Figure 1). In each simulation, five groups of sequences are built for each domain. Individual sequences in each group are characterized by the same combination of experienced states but by slightly different durations (realizations of multinomial random variables).

Figure 1.

Randomly generated domains with different types and levels of association. Simulation 1: perfect one-to-one matches (PM); Simulations 2 to 4: PM with perturbed fifth domain (30, 50, and 80 percent of sequences randomly permuted). Simulation 5: PM with one-to-many matches for the fifth domain; Simulation 6: Independent blocks (domains 1, 2 and domains 3, 4, and 5) with internally matching domains. Each row reports the five domains for the corresponding simulation. Cases are placed on the horizontal axis, and to each case a vertical bar is associated describing the states experienced during the considered period of time (on the vertical axis), distinguished using different colors. For each simulation, the ordering of cases along the horizontal axis is the same for all the domains.

In simulation 1 (Figure 1, first row), data were generated so that there is a perfect match (PM) between sequences in the different domains: Cases similar in one domain are also similar in the others (thus, condition (a) is satisfied). No adjustments were done to assure that condition (c) holds: Thus, there is not necessarily a perfect linear relation between the domains’ dissimilarities. Even so, since condition (a) holds, we expect our criteria to detect a relevant association (thus, condition (a), when met, should ideally more than compensate the lack of monotone relation between dissimilarities).

In the next simulations, the domains in simulation 1 are modified to build scenarios with different types of association. In simulations 2 to 4 (Figure 1, rows 2–4), a proportion π (30, 50, and 80 percent, respectively) of the sequences in the fifth domain was randomly permuted, thus disrupting the perfect matching of this domain with the others. In simulation 5 (Figure 1, row 5), the fifth domain was simplified, and only two types of sequences were generated, presenting therefore one-to-many matches with sequences in the other domains (thus, condition (b) is not satisfied). Finally, in simulation 6 (Figure 1, row 6), the domains built in simulation 1 were split into two blocks. Domains 1 and 2 (first block) were left unchanged, whereas cases in domains 3 to 5 (second block) were jointly randomly permuted. Hence, the domains in the two blocks remain internally related (one-to-one matches), but the two blocks are not related (sequences in the first block are randomly associated with sequences in the second block). Note that this is the case of lowest association considered here, since the maximal number of connected domains is three (second block).

For each simulation, the domain-specific OM dissimilarity matrices were built using insertion and deletion costs equal to 1 and substitution costs inversely related to the frequency of transition (see Rohwer and Pötter 2004). The global Cronbach’s α and the α_−p’s (obtained without considering the pth domain) are reported in Table 1. Table 2 displays the eigenvalues of the Mantel’s correlation matrix and the loadings of the first and the second PCs. Panel A in Table 3 contains the correlations between the domain-based dissimilarities and the joint dissimilarities calculated using MSA (Gauthier et al. 2010; Pollock 2007), d_JSA = d_MSA , together with the sum and the mean of the squared correlations (SSC_MSA and ${\overline{S S C}}_{M S A}$ ). The correlations between the d _p ’s and the joint dissimilarities calculated excluding one domain at a time, d_MSA _−
p, are shown in panel B in Table 3.

Table 1.

Simulated^a Data: Cronbach’s αs Based on Mantel’s Correlations Between Domain-based Dissimilarities.

	Simulation 1	Simulation 2	Simulation 3	Simulation 4	Simulation 5	Simulation 6
Without D₁ , α₋₁	.923	.839	.792	.711	.843	.762
Without D₂ , α₋₂	.928	.843	.796	.716	.827	.751
Without D₃ , α₋₃	.913	.828	.782	.707	.833	.632
Without D₄ , α₋₄	.924	.842	.799	.719	.826	.661
Without D₅ , α₋₅	.914	.914	.914	.914	.914	.625
All domains, α	.935	.881	.852	.805	.877	.737

^aSimulation 1 = perfect one-to-one matches (PM); Simulations 2 to 4 = PM with perturbed fifth domain (30, 50, and 80 percent of sequences randomly permuted). Simulation 5 = PM with one-to-many matches for the fifth domain; Simulation 6 = Independent blocks (domains 1, 2 and domains 3, 4, and 5) with internally matching domains.

Table 2.

Simulated^a Data: Loadings of the First Two PCs (PC₁ and PC₂) and Eigenvalues (λ₁ and λ₂) Extracted From the Mantel’s Correlation Matrix Between Domain-based Dissimilarities.

Domain	Simulation 1		Simulation 2		Simulation 3		Simulation 4		Simulation 5		Simulation 6
Domain	PC₁	PC₂	PC₁	PC₂	PC₁	PC₂	PC₁	PC₂	PC₁	PC₂	PC₁	PC₂
D₁	0.88	0.20	0.88	−0.17	0.89	−0.11	0.89	−0.01	0.86	−0.32	0.23	0.90
D₂	0.86	0.42	0.87	−0.21	0.88	−0.13	0.89	−0.03	0.89	−0.05	0.26	0.89
D₃	0.92	−0.21	0.90	−0.01	0.90	−0.02	0.90	0.00	0.88	−0.18	0.93	−0.15
D₄	0.88	−0.06	0.87	−0.14	0.87	−0.15	0.88	−0.03	0.89	0.03	0.88	−0.17
D₅	0.92	−0.32	0.58	0.81	0.39	0.92	0.07	1.00	0.56	0.82	0.94	−0.16
λ _i	3.97	0.37	3.44	0.75	3.29	0.90	3.18	1.00	3.42	0.81	2.63	1.69

Note: PC = principal component.

Table 3.

Simulated^a Data: Mantel’s Correlations Between Domain-based Dissimilarities (d _p ) and MSA Dissimilarities Obtained Based on All the Domains (d_MSA ) and Excluding One Domain at a Time (d_MSA−p ).

A. Correlations between the d _p ’s and d_MSA, sum and mean of squared correlations ( $S S C$ and $\overline{S S C}$ )
	Simulation 1	Simulation 2	Simulation 3	Simulation 4	Simulation 5	Simulation 6
Corr(d₁ , d_MSA)	0.840	0.835	0.833	0.837	0.844	0.518
Corr(d₂ , d_MSA)	0.844	0.844	0.851	0.856	0.863	0.544
Corr(d₃ , d_MSA)	0.928	0.915	0.913	0.905	0.890	0.792
Corr(d₄ , d_MSA)	0.862	0.844	0.833	0.836	0.878	0.749
Corr(d₅ , d_MSA)	0.904	0.587	0.437	0.205	0.562	0.790
SSC	3.838	3.302	3.136	2.992	3.336	2.377
$\overline{S S C}$ = SSC/5	0.768	0.660	0.627	0.598	0.667	0.475
B. Correlations between the d _p ’s and d_MSA−p
Corr(d₁, d_MSA−1)	0.772	0.756	0.754	0.755	0.761	0.268
Corr(d₂ , d_MSA−2)	0.825	0.808	0.810	0.808	0.814	0.311
Corr(d₃ , d_MSA−3)	0.898	0.871	0.866	0.846	0.832	0.655
Corr(d₄ , d_MSA−4)	0.815	0.778	0.758	0.756	0.833	0.587
Corr(d₅ , d_MSA−5)	0.888	0.468	0.287	0.045	0.467	0.682

Note: MSA = multiple sequence analysis.

Results in Tables 1 –3 reflect the association structure underlying the simulations. For domains in simulation 1, a relevant association is detected. The global α and the first eigenvalue (Tables 1 and 2) are very high, all the domains present high loadings with the first PC, and the α_−p’s are all lower than the global α. Also, the correlations between the d _p ’s and d_MSA are high and similar, and the same holds for the d_MSA−p (Table 3).

For simulations 2–4, a weaker and weaker association of the fifth domain with the others is correctly diagnosed. The global αs, all lower compared to simulation 1, decrease with the proportion (π) of perturbed sequences, and the deletion of the fifth domain leads to α₋₅ < α. The second eigenvalues of the Mantel’s correlations matrix and the loadings of the fifth domains on the second PC increase with π. Correspondingly, the correlations between d ₅ and d_MSA are much lower compared to the other domains (Table 3). Also, Corr(d ₅,d_MSA−5 ) is very low, indicating that information on the fifth domain cannot be recovered based on the others. For simulation 5 (one-to-many association), results are aligned with those observed in simulations 2–3: A weak relation between the fifth domain and the others is diagnosed.

Also for simulation 6 (independent blocks), results are coherent with expectations. The global α is the lowest, and the α_−p’s are similar for domains in the same block (even if this result alone would not permit to infer the division of domains into blocks). The deletion of one domain in the second block, (domains 3, 4, or 5) causes a bigger decrease of the αs, due to the decrease in the relative strength of the block (compared to the first block, including only two domains). The division into blocks is perfectly recovered by the PCs. The first two eigenvalues are both relevant: The first two domains are related to the second PC (which is the least relevant, due again to the higher size of the second block), and the last three domains to the first PC. Also, the first two domains show a lower correlation with d_MSA (mostly influenced by the domains in the second block; Table 3). Moreover if, say, domain 1 is removed, Corr(d ₁,d_MSA−1 ) is much lower than Corr(d ₁,d_MSA ), due to the decreased weight of the first block (now including only domain 2), on the joint dissimilarity.

Based on these results, we argue that the proposed measures provide insights about the structure of the association among domains and permit to preliminarily assess whether there are isolated domains or blocks of domains not related one to another. Also, JSA association enables to assess whether the relations among domains allow an effective summary of their dissimilarities through the joint dissimilarities obtained using a given JSA approach.

Nonetheless, our simulations show that the considered measures do not distinguish between lack of association (simulations 2–4) and non-monotone association (simulation 5, one-to-many). It is therefore necessary to introduce criteria to properly measure a possible nonlinear association. Actually, situations may arise when a JSA can be successfully applied even when a not strong linear association is diagnosed. Furthermore, it is important to evaluate whether a JSA gives satisfactory results for all the domains, or if there are domains which are not well described. This might help the researcher to carefully identify, possibly ex post, the domains which could be analyzed jointly.

Evaluating JSA

There are a number of dissimilarity-based methods used to analyze sequences. Examples include CA (and self-organizing maps), multidimensional scaling, ANOVA, or regression trees (suitably extended to SA). When these methods are based upon a joint dissimilarity matrix, it is crucial to evaluate if the obtained results are satisfactory for all the considered domains. Criteria to properly compare JSA and domain-specific SA obviously strongly depend on the particular technique applied to data.

For the sake of synthesis, here we limit attention to CA. Besides being one of the techniques most applied in SA, CA permits to simplify the inspection of the most typical patterns in data, by partitioning cases into groups having similar characteristics. Therefore, it can be particularly useful in JSA to have a clearer understanding of the relations among domains.

Let C _JSA be a partition of cases into G clusters obtained on the basis of a joint dissimilarity matrix, D_JSA . A standard criterion to evaluate the quality of C _JSA with respect to D_JSA is to compare the heterogeneity of sequences within the clusters and within the whole sample (see Piccarreta and Billari 2007). A measure of the heterogeneity within the whole sample is $T (D_{J S A}) = \sum_{i, j} d_{J S A}^{2} (i, j) / 2 n$ , where n is the sample size and $d_{J S A}^{2} (i, j)$ is the squared joint dissimilarity between the ith and the jth sequence. The dispersion within the gth cluster of C _JSA can be measured as $w_{g} (D_{J S A}) = \sum_{(i, j \in g)} d_{J S A}^{2} (i, j) / 2 n_{g}$ , n_g being the cluster’s size. A global measure of the within-cluster heterogeneity is $W (D_{J S A}) = \sum_{g} w_{g} (D_{J S A})$ , and a measure of the quality of C _JSA with respect to D_JSA is $R^{2} (D_{J S A} | C_{J S A}) = [T (D_{J S A}) - W (D_{J S A})] / . T (D_{J S A})$ , the proportion of the total heterogeneity accounted for by the G clusters in C _JSA.

Clusters based upon D_JSA , obtained by combining information on the P domains, should ideally group cases that are similar across all the domains. Thus, in the case of association between the P domains, C _JSA should explain satisfactorily both D_JSA and all the domain-specific dissimilarity matrices, D₁ , …, D _P . The measures $R^{2} (D_{1} | C_{J S A}), \dots, R^{2} (D_{p} | C_{J S A})$ , which can be calculated by substituting the D _p ’s to D_JSA in the aforementioned expressions, measure the ability of C _JSA to explain each domain singularly. If one out of these quantities is relatively low, this indicates that C _JSA does not represent/describe properly the corresponding domain.

For the six sets of simulated data described in the previous section, CA was applied to the joint dissimilarity matrix obtained using MSA, D_MSA . Different algorithms were tried, leading to very similar results due to the neat structure of the data. For the first five simulations, standard criteria (e.g., the average silhouette width or the Calinski–Harabasz index; see Studer 2013, for a review) selected five clusters substantially grouping the matching sequences in the first four domains (characterized by one-to-one matches). For the sixth simulation (independent blocks), standard criteria suggested a relatively high number of clusters, due to the independence of blocks. Among the partitions with a relatively small number of clusters, the five-cluster partition was again supported by standard criteria.

Panel A in Table 4 displays the performances of the five-cluster partitions, C _MSA , obtained using Ward’s (1963) algorithm, both at a global level—R ²(D_MSA | C _MSA )—and at a domain-specific level—R ²(D _p | C _MSA ).

Table 4.

Simulated^a Data: Performance (R ²) of the Partitions Obtained Using Ward’s Algorithm, Based on D_MSA ( C _MSA ), on the Domain-based Dissimilarities D _p ( C_p ), and on D_MSA ₋ _p (C _MSA ₋ _p ).

Performance (R ²)	Simulation 1	Simulation 2	Simulation 3	Simulation 4	Simulation 5	Simulation 6
A. R ² of the partitions C _MSA , based on D_MSA
R ² (D₁\|C _MSA)	.957	.957	.957	.957	.957	.101
R ² (D₂\|C _MSA)	.963	.963	.963	.963	.963	.152
R ² (D₃\|C _MSA)	.972	.972	.972	.972	.972	.972
R ² (D₄\|C _MSA)	.964	.964	.964	.964	.964	.964
R ² (D₅\|C _MSA)	.935	.495	.293	.056	.897	.935
R ² (D_MSA\|C _MSA)	.971	.924	.905	.885	.967	.748
B. R ² of the partitions C_p , based on the D _p ’s
R ² (D₁\|C_p )	.957	.957	.957	.957	.957	.957
R ² (D₂\|C_p )	.963	.963	.963	.963	.963	.963
R ² (D₃\|C_p )	.972	.972	.972	.972	.972	.972
R ² (D₄\|C_p )	.964	.964	.964	.964	.964	.964
R ² (D₅\|C_p )	.935	.935	.935	.935	.954	.935
C. R ² of the partitions C _MSA−p , based on the D_MSA−p ’s
R ² (D_MSA−1\|C _MSA−1)	.969	.906	.878	.851	.965	.852
R ² (D_MSA−2\|C _MSA−2)	.969	.903	.876	.846	.965	.862
R ² (D_MSA−3\|C _MSA−3)	.969	.903	.876	.847	.964	.596
R ² (D_MSA−4\|C _MSA−4)	.970	.903	.875	.844	.965	.583
R ² (D_MSA−5\|C _MSA−5)	.972	.972	.972	.972	.972	.580

To distinguish between domains that are possibly “difficult” to cluster and domains that are not well explained by C _MSA , it is also convenient to consider for each domain the five-cluster partition, C_p , obtained using the domain-specific dissimilarity matrix, D _p , and the corresponding R ²(D _p | C_p ). These R ²s (Table 4, panel B) are all higher than 0.9, indicating that every domain can be satisfactorily partitioned into five clusters. Focusing now on the quality of C _MSA , note that when the domains are associated (simulation 1), C _MSA is adequate also with respect to the domain-specific dissimilarities. Instead in simulations 2 to 4 (30, 50, and 80 percent of cases in the fifth domain randomly permuted), the low association of the fifth domain with the others is reflected by lower values of R ²(D₅ | C _MSA ), decreasing with the percentage of permuted cases. Interestingly, in simulation 5 (one-to-many matches), the fifth domain is instead well explained by C _MSA . Actually, the R ²s describe also non-monotone association between dissimilarities, differently from the measures based upon Mantel’s correlations. Finally, in simulation 6 (two independent blocks of internally connected domains), the first and the second domain are not explained by C _MSA , whereas the last three domains are well captured. This is clearly due to the different sizes of the two blocks (2 vs. 3 domains): The last three domains dominate the joint dissimilarities, the clustering process, and also the obtained partition.

Note that the R ²(D_MSA | C _MSA )s are aligned with the level of association underlying the simulations (with higher values corresponding to stronger association; last row of Table 4, panel A).

We now analyze the quality of the JSA partitions obtained by removing one domain at a time. For each p, the dissimilarity matrix D_MSA−p was obtained excluding the pth domain, and the five-cluster partition C _MSA−p was extracted. The measures R ²(D_MSA−p | C _MSA−p ) are reported in Table 4, panel C. The most interesting result concerns simulation 6. When one domain in the first block is removed, R ²(D_MSA−p | C _MSA−p ) is much higher than R ²(D_MSA | C _MSA ). Actually when one domain in the first block is not taken into account, the second block of domains is even more able to drive the clustering process. Therefore, the last three domains will be better explained by C _MSA−1 , with a consequent increase in the global R ². The reverse holds when one domain in the second block is removed. In this case, the two blocks have the same size (two domains each) and therefore the same importance in the determination of the cluster solution which, being a compromise between two equally weighted blocks, has a worse performance.

Observe that in simulations 3–4 (50 and 80 percent of permuted sequences in the fifth domain), the performance notably increases when the fifth domain (not strongly related to the others) is disregarded. In this case, MSA does not provide satisfactory results for all the involved domains. Thus, the procedure is “joint,” but the quality of the results is not the same for all the domains.

These results are coherent with those obtained in the previous section. Nonetheless, this is not always the case. For simulation 5 (one-to-many matches), a low association between the fifth domain and the others was diagnosed. The analysis based on the R ²s correctly emphasizes instead that the low correlation between the fifth domain and the others does not prevent the quality of JSA (specifically MSA).

Beyond CA: Some Considerations on Other Dissimilarity-based Techniques

As already discussed, CA is only one of the techniques that can be applied using joint rather than domain-specific dissimilarities. Therefore, measures similar to those introduced earlier should also be defined for techniques which are more and more popular in SA, such as, for example, ANOVA and regression trees (Piccarreta and Billari 2007; Studer et al. 2011).

Due to space limitations, we cannot consider into detail how to extend our proposals to these methods. Nonetheless, ANOVA and regression trees are substantially based upon the evaluation of partitions defined on the basis of one or more covariates. More specifically, in ANOVA, groups are formed based on the levels of one categorical factor. The factor is significant with respect to a given domain if the factor-based groups contain similar sequences. Regression or classification trees aim instead at subsequently partitioning the sample according to the levels of one or more covariates so as to define groups of cases more and more homogeneous with respect to a given domain. Thus, in both cases, the dispersion within the (supervised) final partition’s groups plays a crucial role in the evaluation of the procedure’s results.

Evidently, for ANOVA, the dispersion within the factor-based groups can be evaluated referring both to the joint and to the domain-specific dissimilarities. Attention will be focused on the comparison of the significance of the selected factor across domains.

More interestingly, for regression trees, the final partition will be obtained by referring to the joint dissimilarities (exactly as it was done in CA; see Piccarreta and Billari [2007] for a discussion on the relations between the two techniques) and attention will be focused on the evaluation of the JSA regression tree at the domain-specific level. The procedure proposed to evaluate CA can therefore be easily extended to regression trees.

Application to the British Household Panel Survey Data

We now refer to a data set analyzed in Pollock (2007), who focuses on the first 10 years of the British Household Panel Survey. In particular, for each of the 5,124 individuals in the sample, trajectories on four domains are considered. The first domain describes the Employment (E) trajectories followed in each of the 10 considered years: self-employed (S.Emp), employed (Emp), unemployed (Un), retired (Ret), maternity leave (Mat), family care (Fam), full-time student or at school (Sch), long-term sick or disabled (Dis), and on a Government training scheme (GTr). The Housing sequences (H) describe the histories of housing tenure: owned outright (Own), owned with a mortgage (Own.M), local authority rented (LaR), housing association rented (HaR), rented from employer (ER), rented private, unfurnished (UpR), and furnished (FpR). The Marital (M) domain is focused on the marital status of individuals: married (Mar), separated (Sep), divorced (Div), widowed (Wid), and never married (Nev). Finally, the Children (C) trajectories describe parenthood and cohabitation with children, depending on whether the individual is responsible for children under 16 years old (Yes) or not (No).

Pollock (2007) applied MSA to the four domains, with insertion and deletion costs set to 1 and with the substitution costs matrices reported in Table 5. Based on the MSA dissimilarities, he extracted 15 clusters using Ward’s (1963) algorithm (refer to Pollock [2007] for an interesting and in-depth analysis and interpretation of the obtained clusters).

Table 5.

Substitution Costs Matrices.

		Status identification number
		1	2	3	4	5	6	7	8	9
Employment status
Self-employed	1	0
Employed	2	0.6	0
Unemployed	3	1.4	0.8	0
Retired	4	1.2	1.2	1.2	0
Maternity leave	5	1.4	0.6	1.4	1.4	0
Family care	6	1.2	0.8	1.2	0.8	1	0
Full-time student	7	1.4	0.6	1	1.4	2	1.4	0
Long-term sick	8	1.4	1.4	1.2	0.8	2	1.2	1.4	0
Government training	9	1.4	0.8	1	1.4	2	1.4	1.4	1.4	0
Housing status
Own outright	1	0
Own with mortgage	2	0.6	0
Local authority rent	3	1.4	0.8	0
Housing association rent	4	1.4	1	0.8	0
Rent from employer	5	1.4	1	1.4	1.4	0
Private rent (unfurnished)	6	1.4	1	1.4	1	0.8	0
Private rent (furnished)	7	1.4	0.8	1.4	1.4	1.2	1	0
Marital status
Married	1	0
Separated	2	1	0
Divorced	3	0.5	0.6	0	2
Widowed	4	0.5	1.4	2	0	2
Never married	5	0.5	2	2	2	0
Children responsibility
Yes	1	0	0.5
No	2	0.5	0

Source: Pollock (2007).

We now describe the steps of our integrated approach to evaluate the interrelations among the considered domains and the performance of a JSA both at a joint and at a domain-specific level. As already said, our evaluation is conditioned to the (chosen) domain-based and JSA dissimilarities. OM and MSA were applied using the same specifications in Pollock (2007), obtaining the dissimilarity matrices D_E , D_H , D_M , D_C , D_MSA , and the corresponding vectors d_E , d_H , d_M , d_C , and d_MSA .

Step 1: Preliminary Evaluation of the (Linear) Association Among the Involved Domains

Table 6 displays the Cronbach’s αs and PCs extracted from the Mantel’s correlation matrix of the domain-based ds. The global α is low (0.24), and it consistently decreases if Employment or Housing is deleted, increasing instead if Children is not taken into account. The first PC is strongly related to Employment and Housing and to a lesser extent to Marital, whereas the second and the third PCs relate to Children and Marital, respectively. The domains do not show a strong linear association. Children is weakly related to the other domains, and Marital shows a moderate relation with Employment and Housing, which turn out to be the most connected domains.

Table 6.

Cronbach’s αs and Principal Components Analysis (PCA) Based on Mantel’s Correlations Between Domain-based Dissimilarities.

Cronbach’s αs			Loadings and Eigenvalues
Cronbach’s αs			PC1	PC2	PC3	PC4
Without employment, α_−E	0.09	Employment	0.70	0.16	−0.42	0.55
Without housing, α_−H	0.03	Housing	0.74	0.17	−0.15	−0.63
Without marital, α_−M	0.18	Marital	0.55	−0.12	0.81	0.16
Without children, α_−C	0.40	Children	−0.18	0.96	0.20	0.04
All_domains, α	0.24	Eigenvalues	1.38	1.00	0.90	0.73

Step 2: Preliminary Evaluation of the (Linear) Association Between Joint and Domain-based Dissimilarities

Table 7 displays the Mantel’s correlations between the d _p ’s (d_E , d_H , d_M , d_C ), the joint dissimilarities, d_MSA , and the d_MSA−p ’s (the joint dissimilarities obtained by disregarding one domain at a time). Also, the summaries of the squared correlations are reported (their sum, SSC, and their average, $\overline{S S C}$ , the ratio between SSC and the number of domains, i.e., 4 when all the domains are considered and 3 when one domain is excluded), providing information about the “proportion” of the variance of the d _p ’s explained by d_MSA . Results are coherent with those in Table 6. The Children-specific dissimilarities are weakly related to d_MSA , whereas for the first three domains, medium-high correlations are observed. Also, the exclusion of Children leads to the highest increase in the proportion of explained variance. Even so, all the d _p ’s show relatively low correlations with the d_MSA−p ’s indicating that domain-specific dissimilarities cannot be recovered by (combining) information on the other domains. Interestingly, it is Corr(d_M , d_MSA−M ) = 0.15; this confirms the relatively weaker linear relation between Marital and Employment and Housing, which again appear as the two most connected domains.

Table 7.

Mantel’s Correlations Between Domain-based and MSA Dissimilarities, Sum and Mean of Squared Correlations (SSC and $\overline{S S C}$ ).

Domain-based Dissimilarities, d	Corr(d,d_MSA) ^a	Corr(d,d_MSA−p), deleting domain^a:
Domain-based Dissimilarities, d	Corr(d,d_MSA) ^a	Employment	House	Marital	Children
Employment	0.66	0.23	0.70	0.77	0.68
House	0.68	0.71	0.27	0.76	0.69
Marital	0.63	0.74	0.72	0.15	0.65
Children	0.18	0.25	0.25	0.26	−0.05
SSC (explained variance)	1.32	1.12	1.07	1.22	1.36
$\overline{S S C}$ (percentage of explained variance)	0.33	0.37	0.36	0.41	0.45

^a d_MSA is the vector of multiple sequence analysis (MSA) dissimilarities; d_MSA−p is the vector of the MSA dissimilarities obtained by excluding one domain at a time.

The weak level of detected association could make a joint analysis unable to satisfactorily “represent” each specific domain. Nonetheless, as discussed in the third section, Mantel’s correlations could fail to detect nonlinear relations. It is therefore important to directly evaluate the results obtained applying a dissimilarity-based technique on combined dissimilarities. We here focus on CA which could also conveniently describe the data, unveiling the most typical combinations of patterns across domains and shedding some light on possible nonlinear relations.

Step 3: CA Based on JSA Dissimilarities. Evaluate Partitions both at a Joint and at a Domain-specific Level

Following Pollock (2007), we applied Ward’s hierarchical algorithm to D_MSA, the MSA dissimilarity matrix.

Figure 2 reports the R ²s and the average silhouette widths (Kauffman and Rousseeuw 1990) for a number of clusters ranging between 2 and 20. The silhouette coefficients suggest a rather low number of clusters (3). Moving to more detailed higher-order partitions, the six-cluster solution appears as a possible choice. Also, a steep decrease of the R ²s is observed when reducing the number of clusters from G = 6 to G = 5 (Figure 2).

Figure 2.

Performance (R ² and average silhouette width coefficients) of clusters solutions obtained using Ward’s algorithm, evaluated for MSA-dissimilarities.

Since more domains are taken into account, it is important to monitor the performance of the partitions also with respect to the domain-based dissimilarity matrices (D_E , D_H , D_M , D_C ).

Figure 3 displays the R ² and the silhouette coefficients characterizing the MSA partitions evaluated with respect both to the MSA and to the domain-based dissimilarities. The number of clusters suggested by the silhouette coefficients varies with domain: Two clusters are suggested for Employment and Children, three for Housing, and four for Marital. Also, for Children, the coefficients are negative (indicating overlapping), and they become negative for all the domains when more than six clusters are taken. The plot of the R ²s shows that when G < 5 Children is left unexplained, and that when G is increased from 6 to 7, a decrease in R ² is observed for Marital, which is again satisfactorily explained when at least nine clusters are considered.

Figure 3.

Performance (R ² and average silhouette width coefficients) of clusters solutions obtained using Ward’s algorithm, evaluated with respect to MSA and domain-based dissimilarities.

Also, it can be noted that when G is higher than 10, the domain-specific R ²s stabilize, and a rather constant gap is observed between the R ²s for Children and Marital (aligned with the global R ²s) and those for Employment and Housing, indicating a possible division of the domains into blocks. Children and Marital turned out to be the domains least connected to the others (see steps 1 and 2). Their good explanation might therefore be due to one-to-many relations. As a further consideration, note that these domains are those with the smallest alphabets. Therefore, these domains will be easier to cluster compared to the others. A cluster procedure attempting at determining homogeneous groups of cases along all the domains might favor the simplest domains in the case of weak association (simply because the reduction of heterogeneity in the simplest domains is easier to achieve). This intuition is also supported by the analysis of the silhouette coefficients (which decrease with the number of clusters for Children and Marital, indicating overlapping).

To test these considerations in a more detailed manner and to better analyze the association among domains, we focus on a specific partition, extracting G = 6 clusters. This solution (supported by the global silhouette coefficient) is far from being optimal, and actually Pollock (2007) uses a much higher number of clusters (15). Nonetheless, we prefer a low-degree partition since its clusters can be graphically represented (see step 4 below). Also, results in Table 8 are aligned with those observed when a higher number of clusters are extracted.

Table 8.

Performance (R ²) of the Six-cluster Solutions (Obtained Applying Ward’s to the MSA Dissimilarities) Evaluated Both at the Joint Level and at the Domains Level.

Dissimilarity matrix, D	R ²(D\|C _MSA )^a	R ²(D\|C _MSA−p ),^a deleting domain/s
Dissimilarity matrix, D	R ²(D\|C _MSA )^a	E	H	M	C	(E,H)	(C,M)
Employment, D_E	.504	.241	.610	.630	.578	.177	.650
House, D_H	.541	.499	.057	.601	.556	.045	.606
Marital, D_M	.501	.822	.818	.125	.548	.920	.107
Children, D_C	.661	.563	.712	.651	.122	.799	.138
MSA, D_MSA	.605	.673	.727	.664	.630	.900	.679

^a C_MSA is the six-cluster partition obtained applying Ward’s algorithm to the multiple sequence analysis (MSA) dissimilarities; C_MSA−p is the six-cluster partition obtained applying Ward’s algorithm to the MSA dissimilarities obtained without considering one or more domains. The diagonal elements of the sub-matrix (in gray) are the R ²(D_p |C_MSA−p )s, that is, the R ²s of C_MSA-p evaluated with respect to the dissimilarities of the excluded domain.

The first column in Table 8 shows the R ²s evaluated by referring to the MSA and to the domain-based dissimilarities. Columns 2 to 5 report the R ²s characterizing the six-cluster partitions, C _MSA−p , extracted from the dissimilarity matrices, D_MSA−p ’s, obtained excluding one domain at a time. As expected, the R ²(D_MSA−p | C _MSA−p )s (last row) are all higher than R ²(D_MSA | C _MSA ), due to the “simplification” of the MSA dissimilarities consequent to the deletion of one domain. Similar results are observed for the specific domains with one exception. When Employment is disregarded, the R ²s for Children and Housing decrease (respectively, from .661 to .563 and from .541 to .499), and Marital prevails (R ² increased from .501 to .822).

The diagonal elements of the sub-matrix (cells in gray) are the R ²(D _p | C _MSA−p )s, that is, the R ²s of C _MSA−p evaluated with respect to the dissimilarities of the excluded domain, providing information about the “joint ability” of the other domains to explain the excluded one. These values are small for all the domains, but Employment constitutes again an exception. Actually, the MSA partition C _MSA−E , obtained without taking the domain into account, captures a relatively high proportion of the Employment-specific dissimilarities (24 percent). These findings confirm the relatively higher association between Employment and the other domains and the relatively weak level of association among the others.

To explore also the possible division of the domains into blocks, we built the MSA dissimilarity matrices D_MSA−(MC) and D_MSA−(HE) , obtained by excluding first Children and Marital and then Employment and Housing, and again applied CA. The R ² of the obtained six-cluster partitions are reported in the last two columns of Table 8. The block structure appears now quite evident: The excluded domains (cells in gray) are very poorly explained.

We therefore hypothesize that (1) the four domains are split into two blocks; (2) Marital and Children, if taken into account, will always prevail over Employment and Housing, which turned out to be the most connected domains at steps 1 and 2; (3) The weak association between Children and Marital and the other domains might be due to one-to-many association. For this reason, we decided to explore the C _MSA−(MC) partition, obtained based only on Employment and Housing and to evaluate the obtained (six) clusters also with respect to the excluded domains.

Step 4.1: Analyzing Clusters: An In-depth Analysis Based on Graphical Tools

Our integrated approach, combining ex-ante and ex-post evaluations of the relations among domains, permits to acquire information on the opportunity and on the quality of a JSA. However, our measures are summaries: They only indicate the possible presence (or absence) of problems. When one or more domains appear to be “critical” and/or not well captured by the joint dissimilarities-based clusters, a detailed inspection of the clusters domain by domain can provide insights about the quality of the obtained clusters from a substantive point of view as well as information about the nature of the detected problems. To do so, we suggest considering index plots of the sequences in each cluster for each domain.

The plots for our data are reported in Figure 4: To each cluster, a column corresponds, with one row for each domain. In each plot, cases are placed on the horizontal axis, and to each case a vertical bar is associated describing the states experienced during the considered period of time (on the vertical axis), distinguished using different colors. The ordering of cases (in the same cluster) along the horizontal axis is the same for all the domains.

Figure 4.

Plots of sequences grouped into the six clusters obtained by applying Ward’s algorithm to the multiple sequence analysis dissimilarities based on the Employment and the Housing domains.

These plots provide very interesting insights about the combinations of patterns in different domains as well as about the relations among the considered domains. Of course, cases in the same clusters are mostly characterized by their Employment and Housing trajectories. These trajectories are not perfectly homogeneous due to the low number of chosen clusters, but even so their (homogeneous) subclusters are clearly visible. A one-to-many association is clearly detected: Individuals with similar E–H combination are characterized by different family trajectories, as it was reasonable to expect. It is important to underline that the cluster solution based upon all the domains (not reported here) led to clusters which, being more homogeneous with respect to Children, Marital, and Housing failed to describe and capture the Employment trajectories, by far the most difficult to describe. Also, for the sake of completeness, we considered the partition obtained by referring to Employment only, and we observed a deterred cluster structure. This confirms the observed relation (steps 1 and 2) between Employment and Housing.

Interestingly, the results obtained focusing only on two domains allow us to draw conclusions even clearer than those obtained by Pollock (2007). The typologies found by Pollock using 15 clusters are clearly visible in clusters’ plots. Furthermore, the study of association permits to draw clearer conclusions on the interrelations among domains. Actually, if more clusters are considered (results not shown), groups of cases which are also homogeneous according to the excluded domains are found. Finally, it has to be pointed out that Pollock (2007) described clusters using modal combinations of states and modal switches, which did not necessarily characterize a high proportion of cases in some clusters. The description of clusters using graphical tools makes their interpretation much easier and it is surely recommended in the case when more domains are taken into account.

Step 4.2: Analyzing Clusters: Summarizing Within Clusters Dispersion

As a further piece of information, we introduce measures to identify clusters that possibly group together sequences highly heterogeneous in one or more domains. This procedure is of interest per se, but it can also be useful to understand if and to what extent a low association can be explained by costs misalignment (i.e., condition (c), the dissimilarities in the two domains increase together, is not met due to misalignments between substitution costs and closeness).

Given a partition C _JSA , remember that in the fourth section the heterogeneity of a cluster, say the gth, with respect to a given dissimilarity matrix D (of joint or domain-specific dissimilarities) can be measured as $w_{g} (D) = \sum_{(i, j \in g)} d^{2} (i, j) / . 2 n_{g}$ , (n_g being the cluster’s size). To summarize the level of heterogeneity within the cluster, we propose to refer to the average amount of dispersion within it, ${\overset{ˉ}{w}}_{g} (D) = (w_{g} (D) / . n_{g})$ . Note that ${\overset{ˉ}{w}}_{g} (D)$ can (also) depend on the total dispersion in D. To fairly compare the different domains (possibly having different levels of original dispersion, or complexity), it is convenient to refer to the ratio $r_{g} (D) = {\overset{ˉ}{w}}_{g} (D) / . \overset{ˉ}{w} (D)$ , where $\overset{ˉ}{w} (D)$ is the average dispersion in the whole sample, $\overset{ˉ}{w} (D) = T (D) / . n$ . Results for our clusters ( C _MSA−(MC) ), reported in Table 9, indicate that the proposed criterion, r_g, correctly identifies clusters that are more dispersed with respect to a given domain. For example, the most heterogeneous clusters are the sixth and the fifth, the former particularly for Marital and Housing and the latter for Children. In both cases, the dispersion with respect to Employment is similar and almost aligned with that observed for the first cluster. The second cluster turns out to be the most homogeneous across all the domains.

Table 9.

Evaluation of the Dispersion Within the Six-cluster Solution (Obtained Applying Ward’s to the MSA Dissimilarities Based Only on Employment and Housing) Evaluated Both at the Joint Level and at the Domains Level.

	Cluster 1		Cluster 2		Cluster 3		Cluster 4		Cluster 5		Cluster 6
Dissimilarity matrix, D	${\overset{ˉ}{w}}_{1} (D)$	r₁(D)	${\overset{ˉ}{w}}_{2} (D)$	r₂(D)	${\overset{ˉ}{w}}_{3} (D)$	r₃(D)	${\overset{ˉ}{w}}_{4} (D)$	r₄(D)	${\overset{ˉ}{w}}_{5} (D)$	r₅(D)	${\overset{ˉ}{w}}_{6} (D)$	r₆(D)
Employment, D_E	26.6	0.88	11.2	0.37	10.5	0.35	6.2	0.21	21.6	0.72	20.0	0.66
House, D_H	5.5	0.23	4.9	0.21	26.5	1.12	8.4	0.35	12.7	0.53	22.4	0.94
Marital, D_M	8.7	0.54	3.4	0.21	0.4	0.02	6.1	0.38	5.0	0.31	38.4	2.37
Children, D_C	0.4	0.13	1.7	0.53	2.0	0.64	0.3	0.10	5.4	1.70	0.4	0.12
MSA, D_MSA	78.6	0.43	49.3	0.27	81.2	0.44	43.8	0.24	117.9	0.64	166.1	0.90

Note: ${\overset{ˉ}{w}}_{g} (D)$ is the average dispersion within the gth cluster with respect to the dissimilarity matrix D. $r_{g} (D) = {\overset{ˉ}{w}}_{g} (D) / . \overset{ˉ}{w} (D)$ , is the ratio between ${\overset{ˉ}{w}}_{g} (D)$ and $\overset{ˉ}{w} (D),$ which is the average dispersion in the whole sample.

Observe in addition that the fourth cluster turns out to be the most homogeneous with respect to Employment, even if it actually includes sequences dominated by different states, namely, “Family care” and “Employed” (and to a lesser extent “Unemployed”). A low heterogeneity is detected because “Family care” and “Employed” are considered as close states based on the substitution cost matrix defined for this domain (see Table 5).

In this sense, a careful inspection and “comparison” of the plots and of the r_g ’s can shed some light on the possible effects of costs definition on the domains dissimilarities and, consequently, on the joint dissimilarities and on the measures of association and dispersion.

Since the impact of the combination of the substitution costs defined on the different domains on the considered association measures is unpredictable, it could be worth to preliminarily analyze the association between domains using data-driven substitution costs (following, e.g., Rohwer and Pötter 2004, and specifying substitution costs inversely related to the frequency of transition), which do not depend on the user’s opinion (even if absolutely suitable) about the closeness or distance from one state to another.

Conclusions

The problem of jointly analyzing several domains in SA is not particularly recent. Nonetheless, many contributions propose ad hoc solutions, which are sensible and reasonable with reference to the specific considered data (Aassve et al. 2007; Blair-Loy 1999; Han and Moen 1999b; Piccarreta and Billari 2007; Stovel et al. 1996). A systematized and well-organized treatment of the problem recently appeared in Pollock (2007) and in Gauthier et al. (2010). The relevance of the problem is evident if one considers that the package TraMineR (developed for the R software; see Gabadinho et al. 2011), which is surely the most popular, now also permits to apply MSA.

Nonetheless, there are no contributions in the literature proposing methods to preliminarily evaluate the degree of interconnection between a set of domains or to evaluate the quality of the results obtained by jointly analyzing them both at a “global” and at a “specific-domain” level. With respect to the former point, some contributions exist concerning the case when two domains are taken into account. Piccarreta and Lior (2010) propose a graphical approach to visualize the relations among two domains, and Piccarreta and Elzinga (2013) introduce some criteria to quantify the strength of the association between two domains.

In this article, we innovate over the existing literature introducing an integrated approach for the ex-ante and ex-post evaluation of JSA. We start focusing on the measurement of the association between several domains and on the association between joint and domain-specific dissimilarities. Subsequently, we introduce criteria to evaluate dissimilarity-based methods applied to joint dissimilarities. There are many procedures that surely deserve attention: CA, multidimensional scaling, ANOVA, and regression trees are only some of the most used in SA.

For the sake of synthesis, we limit attention to CA. CA is surely one of the most popular techniques in SA, and it is often applied also to gain insights about the most relevant patterns in data before more structured analyses. Also, we show that the analysis of clusters can provide some further insights into the association among the domains taken into account.

We illustrate the reliability of the proposed criteria based on synthetic data, and then move to the analysis of data arising from the British Household Panel Survey, already analyzed in Pollock (2007).

Our results prove the substantive importance of our proposals. We showed how important is an exploratory analysis aiming at acquiring as much information as possible on the (possible) association among domains and its structure. The suitable identification and selection of the domains which can be efficiently studied together may prove particularly useful in SA: Looking at a too large number of domains could make the analysis unnecessarily and hopelessly overcomplex. Also, we illustrated how crucial is to assess the “joint” reliability of the results obtained using JSA and to carefully analyze the performances of the algorithms with respect to all the domains instead of focusing only on the global performance.

The assessment of association is a very useful preliminary step when more domains are studied jointly. The same of course applies when attention is focused on two domains only. As already mentioned, this topic has already been discussed in Piccarreta and Elzinga (2013). Even so, here we introduce an integrated approach which allows both testing the association and investigating the exact nature of this association. Also, our approach could be applied to compare the extent of association within subgroups (based, e.g., on gender, race, or education). Actually, finding association only in some of these subgroups, for instance, could possibly support some theoretical hypotheses or expectations about the interplay among domains (in the life course and work–family literature, for instance).

Surely, the proposed techniques can be improved along a number of directions. The definition of clear benchmarks for our measures or of permutation tests aiming at evaluating the departure from independence is clearly a relevant issue. Also, it would be surely important to define suitable criteria to evaluate JSA when applied to techniques other than CA, in particular discrepancy analysis or regression trees.

Of course, a quantitative approach as that described here is mostly useful as a screening procedure, to have some insights about solutions which merit to be inspected, and to emphasize which are the possible problems connected with the joint analysis of several domains. As for CA, for instance, the substantive interpretation of clusters plays a crucial and irreplaceable role in the choice of the partition or in the final decision about the domains to be jointly analyzed. Actually, one might be interested to extract clusters explaining all the domains in a way that is judged satisfactory, or might be prepared to obtain clusters explaining some domains worse than others. These considerations are strongly related to the goal of analysis and to the prior knowledge of the researcher on the data, which are strongly needed to support and to fully understand and interpret results.

Footnotes

Acknowledgments

All the analyses in this article were conducted using the statistical software R (R Core Team 2014). User-defined routines were used (based on a number of packages), but the dissimilarity matrices analyzed throughout this article were all obtained using the TraMineR package ().

The author is grateful to Gary Pollock for having shared his data and to three anonymous referees for their highly appreciable comments on a previous version of this article.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes

References

Aassve

Billari

F. C.

Piccarreta

. 2007. “Strings of Adulthood: A Sequence Analysis of Young British Women’s Work–Family Trajectories.” European Journal of Population 23:369–88.

Abbott

1990a. “Conceptions of Time and Events in Social Science Methods.” Historical Methods and Research 23:140–50.

Abbott

1990b. “A Primer on Sequence Methods.” Organization Science 1:373–92.

Abbott

1995. “Sequence Analysis: New Methods for Old Ideas.” Annual Review of Sociology 21:93–113.

Abbott

Forrest

. 1986. “Optimal Matching Methods for Historical Sequences.” Journal of Interdisciplinary History 16:471–94.

Abbott

Hrychak

. 1990. “Measuring Resemblance in Sequence Data: An Optimal Matching Analysis of Musicians’ Careers.” American Journal of Sociology 96:144–85.

Aisenbrey

Fasang

A. E.

. 2010. “New Life for Old Ideas: The ‘Second Wave’ of Sequence Analysis. Bringing the ‘Course’ Back Into the Life Course.” Sociological Methods and Research 38:420–62.

Blair-Loy

1999. “Career Patterns of Executive Women in Finance: An Optimal Matching Analysis.” American Journal of Sociology 104:1346–97.

Chan

T. W.

1994. Tracing Typical Mobility Paths. Oxford, UK: Ms Nuffield Coll.

10.

Chan

T. W.

1995. “Optimal Matching Analysis: A Methodological Note on Studying Career Mobility.” Work and Occupations 22:467–90.

11.

Clark

W. A. V.

Deurloo

M. C.

Dieleman

. 2003. “Housing Careers in the United States, 1968-93: Modelling the Sequencing of Housing States.” Urban Studies 40:143–60.

12.

Cronbach

L. J.

1951. “Coefficient Alpha and the Internal Structure of Tests.” Psychometrika 16:297–334.

13.

Elzinga

C. H.

2003. “Sequence Similarity: A Nonaligning Technique.” Sociological Methods and Research 32:3–29.

14.

Elzinga

C. H.

2007. “Sequence Analysis: Metric Representation of Categorical Time Series.” Mimeo. Retrieved June 14, 2015 (http://www.researchgate.net/publication/228982046_Sequence_analysis_Metric_representations_of_categorical_time_series).

15.

Gabadinho

Ritschard

Müller

N. S.

Studer

. 2011. “Analyzing and Visualizing State Sequences in R with TraMineR.” Journal of Statistical Software 40:1–37.

16.

Gauthier

J.-A.

Widmer

E. D.

Bucher

Notredame

. 2009. “How Much Does It Cost? Optimization of Costs in Sequence Analysis of Social Science Data.” Sociological Methods and Research 38:197–231.

17.

Gauthier

Widmer

E. D.

Bucher

Notredame

. 2010. “Multichannel Sequence Analysis Applied to Social Science Data.” Sociological Methodology 40:1–38.

18.

Halpin

2003. “Tracks through Time and Continuous Processes: Transitions, Sequences and Social Structure.” Mimeo. Conference paper for Frontiers in Social and Economic Mobility, Cornell University, Ithaca, NY, March 2003.

19.

Halpin

2010. “Optimal Matching Analysis and Life-course Data: The Importance of Duration.” Sociological Methods and Research 38:365–88.

20.

Halpin

Chan

T. W.

. 1998. “Class Careers as Sequences: An Optimal Matching Analysis of Work-life Histories.” European Sociological Review 14:111–30.

21.

Hamming

1950. “Error Detecting and Error Correcting Codes.” Bell System Technical Journal 26:147–60.

22.

Han

S.-K.

Moen

. 1999a. “Clocking Out: Temporal Patterning of Retirement.” American Journal of Sociology 105:191–236.

23.

Han

S.-K.

Moen

. 1999b. “Work and Family over Time: A Life Course Approach.” Annals of the American Academy of Political and Social Science 562:98–110.

24.

Hollister

2009. “Is Optimal Matching Suboptimal?” Sociological Methods and Research 38:235–64.

25.

Jolliffe

I. T.

2002. Principal Component Analysis. 2nd ed. New York: Springer.

26.

Kaiser

H. F.

1960. “The Application of Electronic Computers to Factor Analysis.” Educational and Psychological Measurement 20:141–51.

27.

Kauffman

Rousseeuw

P. J.

. 1990. Finding Groups in Data. New York: Wiley.

28.

Legendre

. 1998. Numerical Ecology. 2nd ed. Amsterdam, the Netherlands: Elsevier Science.

29.

Lesnard

2008. “Off-scheduling within Dual-earner Couples: An Unequal and Negative Externality for Family Time.” American Journal of Sociology 114:447–90.

30.

Lesnard

2010. “Setting Cost in Optimal Matching to Uncover Contemporaneous Socio-temporal Patterns.” Sociological Methods and Research 38:389–419.

31.

Levenshtein

1966. “Binary Codes Capable of Correcting Deletions, Insertions and Reversals.” Cybernetic Control Theory 10:707–10.

32.

MacIndoe

Abbott

. 2004. “Sequence Analysis and Optimal Matching Techniques for Social Science Data.” Pp. 387–406 in Handbook of Data Analysis, edited by Hardy

M. A.

Bryman

. Thousand Oaks, CA: Sage.

33.

Malo

M. A.

Munoz-Bùllon

. 2003. “Employment Status Mobility from a Life--cycle Perspective: A Sequence Analysis of Work-histories in the BHPS.” Demographic Research 9:119–61.

34.

Manly

B. F. J.

1997. Randomization, Bootstrap and Monte Carlo Methods in Biology. 2nd ed. London, UK: Chapman and Hall.

35.

Mantel

1967. “The Detection of Disease Clustering and a Generalized Regression Approach.” Cancer Research 27:209–20.

36.

McVicar

Anyadike-Danes

. 2001. “Predicting Successful and Unsuccessful Transitions from School to Work by Using Sequence Methods.” Journal of the Royal Statistical Society, Series A 165:317–34.

37.

Piccarreta

Billari

F. C.

. 2007. “Clustering Work and Family Trajectories Using a Divisive Algorithm.” Journal of the Royal Statistical Society, Series A 170:1061–78.

38.

Piccarreta

Elzinga

C. H.

. 2013. “Mining for Association between life course domains.” In Contemporary Issues in Exploratory Data Mining in the Behavioral Sciences (Quantitative Methodology Series), edited by McArdle

Ritschard

, 190–220. New York: Routledge.

39.

Piccarreta

Lior

. 2010. “Exploring Sequences: A Graphical Tool Based on Multi-dimensional Scaling.” Journal of the Royal Statistical Society. Series A 173:165–84.

40.

Pollock

2007. “Holistic Trajectories: A Study of Combined Employment, Housing and Family Careers by Using Multiple-sequence Analysis.” Journal of the Royal Statistical Society, Series A 170:167–83.

41.

R Core Team. 2014. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved June 14, 2015 (http://www.R-project.org).

42.

Rohwer

Pötter

. 2004. TDA User’s Manual. Bochum, Germany: Ruhr-Universitat Bochum.

43.

Sankoff

Kruskal

J. B.

, eds. 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Reading, MA: Addison-Wesley.

44.

Scherer

2001. “Early Career Patterns: A Comparison of Great Britain and Germany.” European Sociological Review 17:119–44.

45.

Schoon

McCullough

Joshi

Wiggins

Bynner

. 2001. “Transitions from School to Work in a Changing Social Context.” Young 9:4–22.

46.

Stovel

Bolan

. 2004. “Residential Trajectories: Using Optimal Alignment to Reveal the Structure of Residential Mobility.” Sociological Methods and Research 32:559–98.

47.

Stovel

Savage

Bearman

1996. “Ascription into Achievement.” American Journal of Sociology 102:358–99.

48.

Studer

2013. “Weighted Cluster Library Manual: A Practical Guide to Creating Typologies of Trajectories in the Social Sciences with R.” LIVES Working Papers, 24. doi: 10.12682/lives.2296-1658.2013.24. Retrieved June 14, 2015 (http://cran.r-project.org/web/packages/WeightedCluster/vignettes/WeightedCluster.pdf).

49.

Studer

Ritschard

Gabadinho

Müller

N. S.

. 2011. “Discrepancy Analysis of State Sequences.” Sociological Methods and Research 40:471–510.

50.

Ward

J. H.

1963. “Hierarchical Grouping to Optimize and Objective Function.” Journal of the American Statistical Association 58:236–44.

51.

Widmer

Ritschard

. 2009. “The De-standardization of the Life Course: Are Men and Women Equal?” Advances in Life Course Research 14:29–39.