Abstract
The use of sequence analysis in the social sciences has significantly increased during the last decade or two. Sequence analysis explores and describes trajectories and “fishes for patterns” (Abbott, 2000). Many dissimilarity metrics exist in various domains (bioinformatics, data mining, etc.); therefore a crucial and pervasive issue in papers using sequence analysis is robustness. To what extent do the various techniques lead to consistent and converging results? What kinds of patterns are more easily fished out by each of the metrics? Here we propose a systematic comparison of about ten metrics that have been used in the social science literature, based on the examination of dissimilarity matrices computed from a simulated sequence data set including various patterns that sociologists can try to identify. This should help scholars in picking the method best suited to their data design and inquiry objectives.
Introduction 1
Life Course Analysis - From Causal to Descriptive Approaches
Since the mid-1970s, life course analysis has become a major field of interest in the social sciences. It implies a shift “from structure to process, from macro to micro, from analysis to synthesis, from certainty to uncertainty” (Willekens, 1999: 26, quoted in Ritschard and Oris, 2005). The development of life course analysis is simultaneously linked to theoretical issues as well as to advances in longitudinal micro-individual data collection and statistical analysis. The use of longitudinal data, such as panels or retrospective event-history surveys, has developed considerably (GRAB, 1999). On the methodological side, new analysis techniques have spread slowly but cumulatively, and social scientists now have a full toolbox. Since the early 1980s, the central approach has been event history analysis (Kalbfleisch and Prentice, 1980 ; Allison, 1984 ; Courgeau and Lelièvre, 1992 ; Mayer and Tuma, 1990). This set of techniques, such as the famous Cox model (1972), is a generalization of life table methods. Focusing on the occurrence of specific events, they model transition likelihoods or durations, under the assumption that life courses result from a complex stochastic process (Courgeau and Lelièvre, 1986). As a consequence, more often than not, event history analysis is a parametric approach, with a causal view.
Although most empirical life course studies rely on an event-based approach, theory has underlined the importance of the concept of trajectory (Sackmann and Wingens, 2003): events should not be studied independently from each other, but rather as a sequence. This implies a “‘holistic’ perspective that sees life courses as one meaningful conceptual unit” (Billari, 2001: 440). A holistic approach allows to summarize the timing and sequence of events, durations in the various states and durations between events (Settersten and Mayer, 1997). Contrary to event history analysis, trajectory-based methods are mostly non-parametric: they make no assumption about the process underlying life courses and they belong to the “algorithmic model culture” (Breiman, 2001). They chiefly aim at exploring and describing life course as a whole and at “fishing for patterns” (Abbott, 2000). Event-based and trajectory-based approaches can thus be considered as two complementary cultures in life course analysis (Billari, 2005; Bry and Antoine, 2004): the latter intends to identify what differentiates life courses as a whole, while the former focuses on the risk of experiencing events (and its determinants).
Which Metric Should One Choose? 2
More often than not, the first step of holistic approaches consists in measuring the dissimilarity between life courses (regarded as sequences). Pairwise distances between sequences can further be used in various ways, often with data reduction techniques such as multidimensional scaling or clustering. Many dissimilarity metrics exist in various domains (bioinformatics, data mining…) and their use in social sciences has been developing rapidly for a decade or two. The most widely known is certainly Optimal Matching Analysis (Abbott and Forrest, 1986; Lesnard and de Saint Pol, 2006), but other metrics for sequence analysis have been proposed and similar techniques using correspondence analysis also exist. Therefore, a crucial and pervasive issue in papers using holistic approaches is robustness: to what extent do the various techniques lead to consistent and converging results? What kinds of patterns does each of the metrics identify most effectively?
Numerous articles have been devoted to comparing metrics. However, most of them have limitations: they deal with a narrow range of methods at a time; they apply to specific sets of empirical data; other choices implied in the holistic approach (clustering techniques, etc.) may blur the results. So generalization is often problematic. We propose a systematic comparison of a collection of metrics that have been used in the social science literature, based on the examination of dissimilarity matrices computed from two data sets: a simulated one comprising various sequence patterns that sociologists may aim at identifying, and an empirical one (about occupational careers) as a “control sample”. Thus, what we are trying to do here is not to point out a hypothetical “best metric”, but rather to unravel the specific patterns to which each alternative is actually more sensitive.
We will successively present a short review of existing methods for sequence analysis, a summary of the comparisons conducted in the literature, our own protocol for comparison, and finally, our results and discussion.
Various Dissimilarity Metrics
The numerous dissimilarity metrics can be broadly grouped into two families: one linked to sequence analysis algorithms and the other to the tradition of geometric data analysis.
“Algorithmic” Sequence Analysis Metrics
Sequence analysis is a set of techniques for handling longitudinal data as ordered strings of elements. Among these techniques, Optimal Matching has been used, discussed and criticized far more often than the others. Optimal Matching was first developed in signal treatment and bioinformatics (Sankoff and Kruskal, 1983), and was introduced into the social sciences by Andrew Abbott during the 1980s (Abbott and Forrest, 1986). Its principle consists in measuring the dissimilarity between two sequences by transforming one sequence into the other. This can be done through three kinds of elementary operations: insertion (one element is inserted into a sequence), deletion (one element is deleted from a sequence) and substitution (one element is replaced by another in a sequence). Each operation can be assigned a cost. Thus, the distance between two sequences is equal to the lowest cost needed to transform one sequence into the other.
To illustrate optimal matching, we will present two imaginary examples of sequences (Table 1). These sequences characterize school-to-work transitions, observed from ages 18 to 29 (made of 12 one-year elements) and with 3 possible states: student (S); unemployed or inactive (U); employed (E). Calvin is at school up to the age of 19, then remains unemployed for three years and finally gets a job at age 23. Hobbes is a student until age 20, experiences two one-year spells of unemployment at ages 21 and 23, and works during the rest of the trajectory. These two sequences can be matched in different ways by means of insertions, deletions and substitutions. For instance, Hobbes’ sequence can be aligned with Calvin’s by deleting an S at the beginning of the sequence, inserting an E at the end and replacing E by U between the U spells: three operations are necessary. Another possibility consists in replacing S by U at 20, E by U at 22 and U by E at 23: this also requires three operations.
Two examples of school-to-work transitions
Indicator matrix of Calvin’s life course
At age 18, Calvin is a student, and not unemployed or in employment, and likewise at age 19 etc…
Cost setting of the elementary operations is a crucial step in optimal matching techniques: “The assignment of transformation costs haunts all optimal matching analyses” (Stovel et al., 1996). It underpins the method’s flexibility and its ability to fit the object of research (Lesnard, 2010). Practically speaking, insertion and deletion are considered as a single operation, called indel, as deleting an element in one sequence is equivalent to inserting it in the other.
Indel operations pay more attention to the order of events in the sequences, by matching identical subsequences located at distinct positions within the sequences. The associated drawback is that they distort the timing of events. Substitutions, on the other hand, preserve the time structure by comparing simultaneous situations, but they alter the sequence of events. The balance between these costs will determine which kind of regularities are best captured in the dissimilarities computed.
Substitution costs are often chosen first. They may be constant whatever the states substituted, or distinct for each pair of possible states. In the latter case, costs can be driven by theoretical issues such as social stratification (Halpin and Chan, 1998; Blair-Loy, 1999; Scherer, 2001), or by the data themselves. Data-driven substitution costs are usually based on transition likelihoods between the states: the more frequent a transition, the more similar the states, the lower the costs. This cost scheme has become widely used in recent years (Han and Moen, 1999; Rohwer and Pötter, 2005; Aassve et al., 2007; Robette and Thibault, 2008). Other strategies can also be imagined; for instance, combining a theoretical hierarchy of states and transition rates (Abbott and Hrycak, 1990; Stovel and Bolan, 2004).
Regarding indel costs, it is their relationship with substitution costs which is central. Some scholars choose equal substitution and indel costs, arguing that there would be no theoretical grounds for acting differently (Dijkstra and Taris, 1995). Thus, the distance between two sequences is equal to the smallest number of operations needed to match them. This is known as the Levenshtein I distance (Levenshtein, 1966). Moreover, optimal matching users often used to set high indel costs. However, with sequences of equal length, if indel costs are higher than the maximum substitution cost multiplied by half the sequence length, insertions and deletions will never be used. This is equivalent to the Hamming distance (Hamming, 1950), based on the simultaneousness of identical elements: the dissimilarity between two sequences is equal to the number of substitutions needed to match them; that is, the number of time units where the situation is different. With sequences of different length, such a high indel cost leads to the use of insertions and deletions only to compensate for the length difference. Conversely, when indel costs are lower than half the minimum substitution cost, only indel operations will be used. The dissimilarity between two sequences thus corresponds to the length of their longest common subsequence (LLCS), which is also called Levenshtein II distance. In the end, cost setting comes down to positioning the cursor between Hamming and Levenshtein II distances, depending on a preference for contemporaneousness of events or the existence of common subsequences (Lesnard and de Saint Pol, 2009).
While the use of optimal matching has increased significantly over the last decade, these techniques have been criticized (Levine, 2000; Wu, 2000; Elzinga, 2003) 3 , giving birth to a “second wave” of sequence analysis (Aisenbrey and Fasang, 2010).
For instance, the choice of costs is often considered as arbitrary and weakly related to theoretical grounds, as elementary operations have no straightforward sociological interpretation (Levine, 2000; Wu, 2000; Elzinga, 2003). As a consequence, distances have no intrinsic meaning. Another criticism focuses on the order of events. Substitution costs are symmetrical, as replacing A by B in a sequence is equivalent to replacing B by A, and so the order of events is not properly taken into account (Wu, 2000). Moreover, optimal matching does not handle the direction of time (Wu, 2000). Indeed if transformation costs are identical at any point in time, then non-linear time dependency of sequences is neglected. With respect to the latter limit, Lesnard proposed a modified version of optimal matching, called Dynamic Hamming Matching, which uses time-varying costs (Lesnard, 2010). Practically, a substitution costs matrix is computed at each point in time, from the transition likelihoods between the various states at this particular moment, and insertions and deletions are prohibited. This variant has been applied in a few studies (Glorieux et al., 2008; Fasang, 2010; Lesnard and Kan, 2011).
Another set of sequence analysis methods, called non alignment techniques, has been developed. These metrics have the distinctive characteristic of not using the elementary operations needed in optimal matching. Their principle consists in calculating the similarity between sequences by comparing pairs of ordered elements of the sequences. DT coefficients (Dijkstra and Taris, 1995) compute the number of common pairs of ordered elements between two sequences. Elzinga’s metrics can be considered as extensions of DT coefficients (Elzinga, 2003; 2006). They are also based on order relationships between pairs of elements. For instance, Elzinga suggested evaluating the longest common prefix between sequences (LCP), the length of the longest common subsequence (LLCS), the number of common subsequences (NCS) or the number of matching subsequences (NMS). The latter has been used successfully in some life course studies (Elzinga and Liefbroer, 2007; Bras et al., 2010; Liefbroer and Elzinga, 2012).
A few recent options for cost setting in optimal matching can also be reported (Gauthier et al., 2009; Hollister, 2009; Halpin, 2010). Lastly, another alternative has been proposed in a recent issue of this journal (Rousset et al., 2012): we refer the reader to the original article for a detailed presentation. It should be noted that the foundations of the Rousset et al. method – taking into account the fact that transition likelihoods may be unequally spread over the life course – are rather similar to those of Lesnard's Dynamic Hamming Matching.
“Geometric Data Analysis Tradition” Metrics
A second family of dissimilarity measures between sequences makes use of the broad range of geometric data analysis techniques. These metrics have seldom been used in the Anglo-Saxon social science literature (Van der Heijden, 1987; Martens, 1994; Van der Heijden et al., 1997). Yet numerous examples exist in French studies, particularly in the field of school-to work transitions and careers 4 . This might be related to the long-standing tradition of correspondence analysis (or “geometrical analysis”) in French statistics (Bry, 1995, 1996; Lebart et al., 2000; Le Roux and Rouanet, 2004).
Several variants exist to apply geometric data analysis to life courses. The main difference between them lies in the way trajectories are coded and on the balance between chi-squared and Euclidean distance.
A first way of coding life courses consists in transforming them into an indicator matrix. In our example, 3 dummy variables are created for each year (one for each possible state): they are equal to 1 if the individual is in the given state during the given year, 0 otherwise. Thus, there are 12*3=36 dummy variables. For instance, Calvin’s trajectory would be coded as follows:
After the coding step, the indicator matrix may used as input data for Correspondence Analysis (CA), which applies the chi-squared distance. However, unscaled Principal Component Analysis (PCA) may be used instead of CA, which implies the Euclidean distance instead of chi-squared (Espinasse, 1993; Béduwé et al., 1995). In the former case, the distance between two elements of the sequence is weighted by the inverse of the variable frequency: states which are infrequent in a given year contribute more to the measure of the dissimilarity between two life courses than the most frequent states. In other words, rare situations are given more importance. On the contrary, Euclidean distance makes the states' contribution equal and corresponds to the number of discordances between sequences. The final dissimilarity matrix is computed from the coordinates on the dimensions of CA or PCA.
Dissimilarity metrics using an indicator matrix and CA or PCA focus on the contemporaneousness of identical situations, whether these identical situations follow each other or are located at distant moments of the life course. The resemblance between trajectories is based upon the duration of simultaneousness in common states. The simultaneousness implies that the timing of situations or events is taken into account. On the other hand, the nature of transitions and their unfolding – in other words, the sequence – are not part of the dissimilarity measure.
Another way of coding life courses can be viewed as a summary of indicator matrices. It is sometimes called Qualitative Harmonic Analysis (QHA) (Deville, 1974; Deville and Saporta, 1980; Dureau et al., 1994; Robette and Thibault, 2008). More precisely, the period under study is divided into sub-periods, then the time spent in each of the states for each sub-period is computed: this creates a number of variables equal to the number of sub-periods multiplied by the number of states (Table 3). This set of variables is used as input for a Correspondence Analysis.
Example of QHA coding of Calvin’s life course
Calvin spent half of the first sub-period (from 18 to 21) as a student, the other half in unemployment, etc.
A number of sub-periods equal to the period’s length is equivalent to PCA with disjunctive coding. On the other hand, a unique sub-period would focus solely on durations in states. Therefore, the choice of the number of sub-periods is a trade-off between these two borderline cases. Moreover, the sub-periods do not have to be of equal duration. On the contrary, short sub-periods can be chosen for eventful years of the life course and longer sub-periods for calmer years. The ability to highlight “dense” moments of the life course (Rindfuss et al., 1987) is one of the major advantages of this metric.
Compared to the codings based on indicator matrices, the division into sub-periods makes the metric less sensitive to exact simultaneousness in common states. In other words, two sequences of states that are identical but slightly shifted will be considered as more similar by QHA metrics than by the two previous ones.
A Few Existing Comparisons
Although comparisons between methods are not a central issue for most authors 5 , a (certain) number of papers test the influence of different alternatives while detailing their methodological protocol.
Some of them choose Optimal Matching and test various cost settings. For instance, drawing data from a study of the diffusion of Morris dancing in England, Forrest and Abbott (1990) introduce variation in substitution costs and conclude that the method seems to behave robustly. Reviewing this work in a further article, Abbott and Hrycak (1990) conclude that “the method thus seems to behave robustly with respect to variation […] in substitution costs […]. As is often the case, while care is needed, differences in minor analytic decisions are unlikely to drastically change results”. Chan (1995) tests three substitution cost matrices on career data and notes “an impressive common core” across clusters and adds that the variation “follows interpretable patterns”. In the same way, Anyadikes-Danes and McVicar perform several sensitivity analyses by introducing changes in substitution and indel costs in a study of school-to-work transitions. These highlight the fact that “relatively well-defined careers show up whether the cost matrix is designed to pick them out or not” (Anyadike-Danes and McVicar, 2010) and that clusters are similar in nature, although there may be differences in cluster size and membership (McVicar and Anyadike-Danes, 2002).
Other studies compare OMA with alternative metrics. Lesnard (2010) applies his own technique – dynamic Hamming matching – to time-use data and it turns out to be quite similar to Hamming distance, while the Levenshtein II distance is a little less sensitive to contemporaneousness. Robette and Thibault (2008) study occupational careers with both Optimal Matching and Qualitative Harmonic Analysis. They note that the main career clusters are very similar whatever the technique, and observe that OMA distinguishes slightly better between stable and mobile careers, while QHA seems slightly more sensitive to the presence of rare states. Aisenbrey and Fasang (2010) complete their review of sequence analysis methods by a comparative overview of dynamic Hamming matching, OMA with transition-based substitution costs and Elzinga’s Number of Matching Subsequences (ignoring durations) applied to school-to-work transitions. The first two metrics lead to the same substantive patterns, despite mild differences in size and sensitivity to temporal variation. Still, NMS “departs more radically”: it finds several “internally homogeneous cluster” and “one extremely heterogeneous clusters […] that comprises more than one half of all cases”.
Finally, Grelet (2002) focuses on Geometric Data Analysis techniques and applies Principal Component Analysis, Correspondence Analysis and Qualitative Harmonic Analysis to school-to-work transitions. The results obtained largely converge, although CA and QHA seem more sensitive to rare situations than PCA.
On the whole, most of these comparisons broadly agree that pattern fishing remains robust, whatever the metric: “As is often the case, while care is needed, differences in minor analytic decisions are unlikely to drastically change results” (Abbott and Hrycak, 1990) 6 .
Another interesting result is that the variations observed in these empirical studies are consistent with the statistical groundings of the metrics. For example, CA and QHA use chi-squared distance, which weights dissimilarities by the inverse of the states' frequency on the sequences: thus it is not surprising that they attach more importance to rare states. In the same way, the use of indel operations in Optimal Matching shifts subsequences backward or forward, which logically makes the Hamming distance (which utilizes no indel operations) less sensitive to sequence and transitions than the Levenshtein II distance (which utilizes only indel operations).
However, these comparisons have several limitations. First, they focus on only a few metrics at a time: none of them tries to encompass a wide range of methods. Second, they all use one given set of empirical data. As a consequence, it is difficult to assess whether the conclusions remain valid beyond the specific case under study. And last but not least, comparisons are always performed by examining clusters of sequences. Yet building a typology of sequences implies a long string of methodological choices: choice of the sequence analysis metric, but also choice of coding, clustering technique, parameterization and number of clusters. If the comparison is led from the end of the string, it is not easy to disentangle the mutual influence of the various steps of the analysis.
The strategy we adopt in the remainder of this paper attempts to overcome – to some extent – these limitations. To do so, we build an artificial set of sequences and then compare the dissimilarity matrices themselves to explore similarities and differences between a large range of sequence analysis metrics.
Comparing Metrics Using a Selected Set of Artificial Sequences
A “Reasoned” Set of Sequences
Most of the papers comparing some of the numerous sequence analysis methods use the empirical data they aim to study. The drawback of this approach is that while the results may be interesting, the extent to which they can be generalized is questionable: it only solves the robustness issue in a very specific context. On the other hand, analyzing randomly simulated data would be pointless, as most of the time actual sequences don’t look like random data at all, particularly for life course analysis. Even if they include a certain amount of diversity, life courses usually have strong regularities and similarities. For these reasons, we chose to build a reasoned set of artificial sequences. This set is designed to contain the various kinds of regularities or differences that life course analysts usually have to address: shifts, swaps, insertions, deletions, replacements, repetitions of spells (Barban and Billari, 2012), etc.
The sequences in our artificial set are of equal length (l = 20), as some metrics do not handle length differences in a straightforward and unambiguous way. Possible states are the following 8 letters: A, B, C, D, E, F, G, H.
Practically, the set of sequences is divided into subsets. Each subset corresponds to a specific sequence of spells, with variable durations in these spells.
#1. Time warping: A subset of sequences A-B-C with varying durations in A, B and C (n = 171)
#2. Shifts: Sequences A-B-C with B spell of fixed length equal to 6 and varying durations in A and C (n = 13)
#3. Reversal: Initial sequences (subset #1) in reversed order – C-B-A (n = 171)
#4. Swaps: Initial sequences (subset #1) with B and C swapped (A-C-B) or A and B swapped (B-A-C) (n = 342)
#5. Total permutation: Initial sequences (subset #1) with all spells swapped – C-A-B and B-C-A (n = 342)
#6. Short insertion: Sequence A-B-C with one short insertion (length = 1) of spell D – D-A-B-C, A-D-B-C, A-B-D-C and A-B-C-D – with varying durations in A, B and C (n = 612)
#7. Long insertion: Sequence A-B-C with one long insertion (length = 10) of spell D – D-A-B-C, A-D-B-C, A-B-D-C and A-B-C-D – with varying durations in A, B and C (n = 144)
#8. Two shorts identical insertions: Sequence A-B-C with two short insertions (length = 1) of spell D – D-A-D-B-C, D-A-B-D-C, D-A-B-C-D, A-D-B-D-C, A-D-B-C-D, and A-B-D-C-D – with varying durations in A, B and C (n = 821)
#9. Two long identical insertions: Sequence A-B-C with two long insertions (length = 7) of spell D – D-A-D-B-C, D-A-B-D-C, D-A-B-C-D, A-D-B-D-C, A-D-B-C-D, and A-B-D-C-D – with varying durations in A, B and C (n = 60)
#10. Two shorts distinct insertions: Sequence A-B-C with two short insertions (length = 1) of spell D and E – D-A-E-B-C, D-A-B-E-C, D-A-B-C-E, A-D-B-E-C, A-D-B-C-E, and A-B-D-C-E – with varying durations in A, B and C (n = 821)
#11. Two long distinct insertions: Sequence A-B-C with two long insertions (length = 7) of spell D and E – D-A-E-B-C, D-A-B-E-C, D-A-B-C-E, A-D-B-E-C, A-D-B-C-E, and A-B-D-C-E – with varying durations in A, B and C (n = 60)
#12. One deletion: Sequence A-B-C with A, B or C spell deleted – B-C, A-C (subset #12a) and A-B (subset #12b), with varying durations in A, B and C (n = 57)
#13. Two deletions: Sequence A-B-C with A and B, B and C or A and C spells deleted – sequences A, B and C (n = 3)
#14. One replacement: Initial sequences (subset #1) with A, B or C spell replaced by F – F-B-C, A-F-C and A-B-F (n = 523)
#15. Two replacements: Initial sequences (subset #1) with A and B, B and C or A and C spells replaced by F and G spells – F-G-C, A-F-G and F-B-G (n = 523)
#16. Three replacements: Initial sequences (subset #1) with A spell replaced by F, B by G and C by H – F-G-H (n = 171)
#17. Repetition of one spell: Sequences A-B-A with varying durations in A and B spells (n = 171)
#18. Repetition of two spells: Sequences A-B-A-B with varying durations in A and B spells (n = 969)
#19. Repetitions of a subsequence: Sequences A-B-A-B-A-B-A-B-A-B-A-B-A-B-A-B-A-B-A-B (n = 1)
#20. Shifted repetitions of a subsequence: Sequences B-A-B-A-B-A-B-A-B-A-B-A-B-A-B-A-B-A-B-A (n = 1)
#21. Repetition of the whole A-B-C sequence: Sequences A-B-C-A-B-C with varying durations in A, B and C spells (n = 11 628)
These subsets represent a total of 17,604 sequences. To reduce computation costs and to balance the weights of the various subsets, in subsets comprising more than 50 sequences, 50 sequences were selected at random. We end up with an artificial set of 854 sequences.
Moreover, to check some of our results against real life course data, we use a sample of 1,341 French male occupational careers drawn from the Biographies et entourage survey of INED in 2000. These data provide a record of occupations held from ages 14 to 50: the length of sequences is constant and equal to 37. There are 9 distinct states: farmer, self-employed, intermediate occupation, higher-level occupation, clerical and sales worker, manual worker, economically inactive, military conscript 7 .
The Set of Metrics
Our aim is to provide comparisons of a comprehensive set of dissimilarity metrics 8 . Regarding sequence-based metrics, we first use optimal matching with various cost schemes. The trade-off between substitution and indel costs determines which kinds of operations will be favoured. On the one hand, the Hamming distance only uses substitutions, while on the other, Levenshtein II distance uses exclusively insertions and deletions. With sequences of equal length, both indel and substitution operations may be used, if indel costs are somewhere between half the minimum substitution cost at one extreme and the maximum substitution cost multiplied by half the sequence length at the other. To uncover the optimal matching process between these two extremes, dissimilarity matrices are computed for the artificial set of sequences, with a constant substitution cost (s = 2) and various indel costs. Cost schemes are then compared by computing correlation – by means of Mantel tests – between these matrices and dissimilarity matrices using the Hamming distance on one side and Levenshtein II distance on the other (Figure 1).

Mantel correlation between dissimilarity matrices with varying optimal matching cost schemes
First, our approach shows a Mantel correlation of 87.7 percent between Hamming and Levenshtein II metrics. As stated before, with an indel cost lower than 1 – half the substitution cost – optimal matching is equivalent to the Levenshtein II metric, which means that only indel operations are used. On the other side, equal indel and substitution costs – Levenshtein I distance – lead to a correlation of 99.4 percent with the Hamming distance: they are almost equivalent. The drop in correlation occurs with an indel cost between 0.5 and 1: an indel equal to 0.75 times the substitution cost can be seen as a median setting.
The same analysis applied to our occupational career data leads to the same results, with even higher correlations: the Mantel correlation between Hamming and Levenshtein II reaches 97.3 percent (Appendix I - document available from the authors and distributed by the BMS-RC33 list.).
On the basis of these first results, the following “algorithmic sequence analysis” metrics are chosen for further analyses: Hamming distance (HAM); Levenshtein II distance (LEVII); optimal matching with substitution costs based on transition likelihoods and a high indel cost (indel = 2) to test the consequences of this common mode of substitution cost setting (OMAtr); Dynamic Hamming Distance (DHD); the variant proposed by Rousset et al. (2012) (ROUS); Elzinga’s Number of Matching Subsequences with durations handled through vector product (NMS). For metrics relating to geometric data analysis techniques, indicator matrix with Principal Component Analysis (PCA), indicator matrix with Correspondence Analysis (CA) and Qualitative Harmonic Analysis (QHA) are chosen 9 .
Life course analysis can be defined as the statistical analysis of the timing of events or states, their sequencing, their quantum – the number of events or episodes – and the durations in the states (Billari, 2005). The metrics we have presented often take several of these dimensions theoretically into account. But in order to compare the extent to which they do so, it can be interesting to add metrics to our analysis that focus exclusively on one of them. The Hamming distance focuses on the timing, through the simultaneousness in the states. For duration, we define a dissimilarity metric equivalent to the sum of squared differences in durations in the various states (DUR). For example, Calvin spends 2 years in S, 3 years in U and 7 years in E, while Hobbes spends 3 years in S, 2 years in U and 7 years in E: the dissimilarity between Calvin's and Hobbes’ sequences is equal to (2-3)2 + (3-2)2 + (7-7)2 = 2. Moreover, sequences can be represented in terms of episodes. Thus, Calvin’s sequence would take the shape S/2; U/3; E/7 and Hobbes’ S/3; U/1; E/1; U/1; E/6, the figures indicating the number of consecutive years spent in the states. We can now define a dissimilarity metric focusing on quantum of episodes (QUA), equivalent to the sum of squared differences in the number of episodes in the states. For instance, Calvin spends 1 episode in each of the 3 states, while Hobbes spends 1 episode in S, 2 in U and 2 in E: the dissimilarity between their sequences is equal to (1-1)2 + (1-2)2 + (1-2)2 = 2. Last, concerning sequence of episodes, we use the Length of the Longest Common Subsequence metric ignoring duration (SEQ) – the Levenshtein II distance on sequences represented as S-U-E and S-U-E-U-E (respectively for Calvin and Hobbes), for example.
We finally have a set of 12 metrics: the 9 examined here and 3 “control” metrics (DUR, QUA and SEQ).
RESULTS
Three Sets of Similar Metrics
Each of these metrics is then used to compute a dissimilarity matrix between the sequences of the “reasoned” set: we thus have 12 dissimilarity matrices. Next the matrices are compared by measuring the Mantel correlation between each pair of matrices (Table 4).
Mantel correlations between dissimilarity matrices, using different metrics
First of all, we notice that the metrics are closer to the duration measure (DUR) – correlations range from 61 percent to 89 percent – than to the sequence measure (SEQ) – from 45 percent to 53 percent – and above all the quantum measure (QUA) – from 20 percent to 38 percent – except for Elzinga's metric (NMS). Indeed the latter has a null correlation with the duration measure but a relatively strong correlation with measures based on quantum (67%) and sequence (52%). Among the others, LEVII seems to better combine these dimensions of temporality: it has the highest correlations with DUR, QUA and SEQ.
More generally, looking at the whole set of correlations, three groups of metrics may be distinguished: a first group that we will call “OM-like” metrics, which comprises LEVII, HAM, OMAtr, DHD, ROUS and PCA; a small group of “CA-like” metrics (CA and QHA); and NMS.
In the first group, correlations range from 83 percent to almost 100 percent. Among others, we see that OMAtr, DHD and ROUS's correlations with Hamming distance (HAM) are above 99 percent: they are almost equivalent. PCA is very close as well, as its correlation with these other four metrics ranges from 95 percent to 98 percent: the emphasis on contemporaneousness brings them together. LEVII, through the use indel operations, gives less priority to contemporaneousness, its correlation with the other “OM-like” metrics nonetheless remains high (from 83% to 91%).
In the second group, CA and QHA are almost equivalent (the correlation is 99.6%). Their correlation with “OM-like” metrics is high: between 62 percent and 76 percent.
NMS, conversely, is totally orthogonal to the other metrics: its correlation is almost null with any of them.
Most of these results still hold when the same approach is applied to actual sequence data, such as male occupational career data from the Biographies et entourage survey (appendix II - document available from the authors and distributed by the BMS-RC33 list.) 10 . The major differences are an even higher homogeneity of “OM-like” metrics (for instance, LEVII and HAM have a correlation of 97.3%) and a lower correlation between “CA-like” and “OM-like” metrics (from 45% to 57%).
Different Ways of Fishing for Patterns
Now that we have a global picture of the relative resemblance between metrics, we intend to uncover what drives the differences between them, in other words to determine the kind of sequence patterns to which each of these metrics is more sensitive.
For each metric, the distances between sequences of the “reasoned” set are computed, the distances are ranked 11 and then the ranks are scaled to make them comparable (the scaled rank is equal to 0 if the sequences are considered identical, and the most distant sequences have a scaled rank of 100).
We then compare the results of the different metrics for the specific sequence patterns one may want to fish (Table 5 - document available from the authors and distributed by the BMS-RC33 list.). For example, if the focus is on time-warping, scaled ranked distances between sequences from the first subset will be studied; that is, ABC sequences with varying durations in A, B and C. If the focus is then on reversals, we will be interested in scaled ranked distances between sequences from the first subset and sequences from the third subset – between ABC and CBA sequences.
This approach gives a similar picture as in the previous section: there are still three distinct groups of metrics, “OM-like” (LEVII, HAM, OMAtr, DHD, ROUS and PCA), “CA-like” (CA and QHA) and NMS.
Compared to “OM-like” metrics, “CA-like” metrics seem more sensitive to insertions of one long spell or two long different spells, and to one or two replacements. Conversely, they are less sensitive to short insertions. This means that “CA-like” metrics more easily capture differences in the universe of states composing sequences, insofar as the states appearing in one sequence and not in the other correspond to long spells. Moreover, these metrics are a little less sensitive to time warping and shifts, reversals, swaps, total permutations and repetitions: they attach less importance to the way and the order in which spells unfold. Lastly, they consider two highly unstable and slightly shifted sequences (subset #19 vs subset #20) as totally similar.
As one might expect from the previous stages of our inquiry, NMS behaves noticeably differently from “OM-like” metrics. It is highly sensitive to repetitions of spells. It is also significantly more sensitive to two insertions, especially when short spells are inserted; that is, when the sequence of spells composing the whole sequence differs, even if the differing spells have a short duration. Furthermore, NMS is less sensitive to differences in duration, i.e. to time warping and shifts – but above all to reversals, swaps, total permutations, deletions and replacements. This appears harder to interpret: it seems that NMS's focus on sequence of spells operates only in specific cases, in particular when “alien” spells are short. Even more strangely, NMS considers ABC and FGH sequences (subset #1 vs. subset #16) as rather similar, although they do not share any common state. On the other hand, two highly unstable and only slightly shifted sequences (subset #19 vs. subset #20) are viewed as totally distinct.
Let us go a little further by comparing metrics from the “OM-like” group. Compared to the Hamming distance (HAM), PCA is somewhat more sensitive to time warping and shifts, reversals, swaps and total permutations, deletions and long insertions. ROUS, OMAtr and DHD are almost equivalent to HAM, except that the two latter and, to a lesser extent, the former, capture replacements a bit more easily. Unsurprisingly, because of the use of indel operations which gives less importance to contemporaneousness, the Levenshtein II distance (LEVII) is less sensitive then HAM to shifts and more notably to total permutations. Moreover, it captures deletions and replacements better. But the main difference among this group regards a special case – that of the comparison between two unstable and slightly shifted sequences (subset #19 vs subset #20): while LEVII, OMAtr and PCA ignore the shift, HAM, DHD and ROUS consider the two sequences as highly distinct.
As far as “CA-like” metrics are concerned, CA and QHA do not differ in a significant way for any kind of pattern.
Conclusion
Since the 1980s, sequence analysis approaches have become widespread in the social sciences, and Optimal Matching has been the leading method. But OMA has been discussed and amended, and other metrics have been proposed. So there is a crucial need for comparisons between existing metrics. They are based on different statistical traditions (algorithmic culture, geometric data analysis, etc.) and all have specificities, particularly in the way they handle the various dimensions of temporality – contemporaneousness, durations or order. We have proposed an approach, based on a “reasoned” set of sequences, to uncover what kind of patterns each sequence analysis method is more able to fish for.
The results confirm what was already suspected: social science sequence data are strongly structured in such a way that the main patterns they conceal will be uncovered by most of the metrics. But as marginal differences may be of importance, it is useful to understand precisely the kinds of sequences to which these differences are tied. We have revealed three groups of heavily converging metrics and the small distinctions between them: 1) Levenshtein II, Hamming and Dynamic Hamming distances, OMA with data-driven substitution costs, the Rousset et al. metric and Principal Component Analysis; 2) Correspondence Analysis and Qualitative Harmonic Analysis; 3) Elzinga's Number of Matching Subsequences. This constitutes a further step towards a better knowledge of the wide range of available sequence analysis methods so that scholars can pick the one that best suits their data design and inquiry objectives.
