A new approach to the analysis of heterogeneous categorical sequences is proposed. The first-order Markov model is employed in a finite mixture setting with initial state and transition probabilities being expressed as functions of time. The expectation–maximization algorithm approach to parameter estimation is implemented in the presence of positive equivalence constraints that determine which observations must be placed in the same class in the solution. The proposed model is applied to a dataset from the British Household Panel Survey to evaluate the association between the education background and life outcomes of study participants. The analysis of the survey data reveals many interesting relationships between the level of education and major life events.
The aim of cluster analysis is to separate observations into distinct groups where members of the same group have similar characteristics. Traditional clustering methodologies include hierarchical procedures (Sneath (1957); Ward (1963); Szekely and Rizzo (2005)) and partition optimization approaches such as the -means algorithm (Hartigan and Wong (1979); MacQueen (1967)) and distribution-based methods making use of finite mixture models (McLachlan and Peel (2004)). Finite mixture models have a wide variety of applications thanks to their interpretability and flexibility that makes them one of the most popular tools in the area of cluster analysis.
Although, the majority of clustering algorithms have been developed for the analysis of quantitative data (Banerjee and Ghosh (2001)), applications to partitioning heterogeneous categorical observations are found in many areas such as animal species classification (Nimmo et al.(2018)Nimmo, Herrmann, Sublette, Melnykov, Helland, Romine, Carsella, Herrmann-Hoesing, Turner, and Vanden Heuvel), customer behaviour prediction (Chang et al.(2007)Chang, Hung, and Ho) and many others. The methodology proposed in this article is largely motivated by our interest in the analysis of data obtained by the British Household Panel Survey (BHPS) (University of Essex and Research (2018); Taylor et al.(2018)Taylor, Brice, Buck, and Prentice-Lane). The survey represents a massive project considering a variety of topics such as education, employment, financial management and health. The aforementioned aspects of the lives of survey participants have been observed over several years, with the occurrence times of major life events being collected. For the purposes of our project, we are mainly interested in exploring the association between the education outcome and life patterns as reflected by the observed sequences of life events. Other common examples of similarly structured categorical sequences include Internet navigation clickstream data and patient medical history logs. A number of methods have been developed for the cluster analysis of categorical sequences from a variety of perspectives. An R package ClickClust (Melnykov(2016a); Melnykov(2016b)) employs an expectation–maximization (EM) algorithm (Dempster et al.(1977)Dempster, Laird, and Rubin) based on a mixture of first-order Markov models. Another R package, clickstream (Scholz (2016)), carries out the classification based on the -means algorithm applied to transition frequency matrices. An R package TraMineR (Gabadinho et al.(2011)Gabadinho, Ritschard, Mueller, and Studer) relies on the traditional hierarchical clustering based on the dissimilarity calculation. Although the above-listed packages are useful in a variety of settings, they do not take into account the potentially temporal character of categorical sequences. The necessity for such an accommodation for the BHPS dataset is immediate to notice as the probability of making certain transitions can be sensitive to the time of event occurrence. As an example, health-related transitions associated with the diseases that are more likely to occur at an older age would make the time variable a crucial element of the model. Hence, in this article, the information about time is taken into account to ensure model adequacy.
Another important aspect of the analysis of the BHPS dataset in this manuscript is the availability of subject background information that should be accounted for. A setting in which the data clustering is carried out in the presence of some additional knowledge about the membership of certain observations falls under the umbrella of the semi-supervised clustering framework. For example, the classification may be known for a portion of the dataset or some observations could be restricted from being assigned to a particular class (Bair (2013); Hennig et al.(2015) Hennig, Meila, Murtagh, and Rocci). A number of semi-supervised clustering procedures have been proposed in literature, some of them based on the -means algorithm (Dinler and Tural (2016); Melnykov and Melnykov (2020)) and some using model-based approaches (Shental et al.(2003)Shental, Bar-Hillel, Hertz, and Weinshall; Melnykov et al.(2016) Melnykov, Melnykov, and Michael). We focus on a specific variation of restrictions, called positive equivalence constraints, that are imposed on the observations that must be assigned to the same class in the solution (Basu et al.(2008) Basu, Davidson, and Wagstaff). In the framework of our article, we are interested in the clustering solution driven by the education level. Thus, people who possess the same degree will be united within the same block of observations and treated jointly in the model as well as clustering outcome. We will further examine this idea in Section 4 where the BHPS dataset is considered and thoroughly analysed. The proposed mathematical model for semi-supervised clustering of categorical sequences is discussed in Section 2. Section 3 describes the performance of the proposed methodology on synthetic data in several experimental settings. The article is concluded with a brief discussion about the overall performance of the model.
Methodolody
Finite mixture models and model-based clustering
Let be a random sample with each following the probability distribution
where . The distributions of this form are known as finite mixture models of order . Each represents the distribution of the mixture component with its vector of parameters , while denotes the mixing proportion defining the weight of this component in the mixture. Mixing proportions follow intuitive constraints: and .
The estimates of parameters in a finite mixture model are usually obtained with the use of maximum likelihood estimation. At the same time, the direct optimization of the likelihood function may not be practical due to its complicated functional form, in which case an iterative approach known as the EM algorithm (Dempster et al.(1977) Dempster, Laird, and Rubin) provides a useful alternative for finding the maximum likelihood estimate (MLE). There are two parts in the EM algorithm that are repeated iteratively until convergence: E-step and M-step. The E-step amounts to finding the conditional expectation of the complete data log-likelihood, often referred to as the Q-function. For the M-step, the Q-function is maximized to obtain the updated estimate of the parameter vector.
From the data clustering perspective, it is important that the posterior probabilities that observation originated from a specific mixture component are readily available from the last E-step. According to the Bayes’ decision rule, observation would be classified to the component with the highest posterior probability. This classification rule can be expressed as , where is the estimated group membership label of observation . In practice, the number of components is usually unknown and needs to be estimated as a part of the main data clustering task. In this article, we utilize the Bayesian information criterion (BIC; Schwarz (1978)), a standard tool for finding the optimal mixture order. The estimated number of components is associated with the minimum BIC value.
Model for time-dependent categorical sequences
Let observation represent an ordered sequence, where each of its elements is a categorical value coded with a natural integer and assume that the number of unique categories equals , that is, for . As a joint probability expression, one can write . In such a setup, the first-order Markov model offers an appealing approach to describing the transitions between different categories, or states. The model assumes that probability of transition to the next state depends only on the current state and has no relationship to any other past states. It should be noted that the first-order Markov model presents a reasonable compromise in many modelling situations. Moreover, if more complex modelling assumptions are needed, it is straightforward to show that higher order Markov models can be reduced to a first-order model by introducing additional categories as available states (Durrett (1999)). On the other hand, the zero-order model would assume that the current state is independent of all past states which would be an unrealistic assumption in many cases.
By employing the first-order Markov model, the joint probability can be expressed as . To simplify the notation, we will use to denote the initial probability and for the transition probability. Then, would be the probability that the initial state is and would stand for the transition probability from state to state . Utilizing this notation, . As there are states in this Markov model, the vector of initial probabilities is given by and the transition matrix is .
Although, in the described model, all initial state and transition probabilities are constants, time can be an essential factor in real-life data. For example, in the context of one's life events, the transition probability from leaving school to entering an institution of higher education would be time-dependent, with this transition being more likely at a younger age. This consideration motivates the treatment of probabilities as functions of time , that is, and , and assumes that each observation consists of a categorical sequence as well as the vector of times of each event occurrence (Zhang et al., 2020). Let be the time vector indexing , then . With the introduction of the finite mixture model framework, and are replaced by and for specific mixture components and we obtain the model
The log-likelihood function corresponding to this model can be expressed as
where is the indicator function and represents the length of the categorical sequence.
By employing the EM algorithm, the Q-function is given by
where and represents the parameter estimates obtained at the previous iteration of the EM algorithm. We will follow the same convention throughout the article, where two dots denote a value from the current iteration, while one dot refers to a value going one iteration back. In the E-step, is updated with
In the M-step, the estimates of mixing proportions are obtained from , while the expressions for and depend on the specific parameterization used. Accounting for the constraints on initial probabilities, and for any , these probabilities can be parameterized in the manner similar to that described by Krzanowski and Marriott (1994), resulting in
and
Similarly, the functions for transition probabilities after taking into account the constraints and can be specified as follows:
and
With the selected functions for initial state and transition probabilities, the Q-function becomes
The maximization of this function needs to be carried out numerically, since closed form expressions are not available for the parameter updates. The authors employed the optimization routine proposed by Nelder and Mead (1965) and available through the R function .
Semi-supervised clustering with positive constraints
Suppose now that a number of positive equivalence constraints have been defined on the dataset. We define the notion of a block as a group of observations that have to be together in the clustering solution. It is easy to see that positive constraints are transitive and we will view a block as the largest group derived from the constraints that bind multiple points. In the framework of this article focusing on the analysis of BHPS data, we are interested in detecting homogeneous groups driven by the similarity in educational background rather than other socio-economic characteristics. As a result, we impose positive constraints on individuals with a similar degree received. The implementation of blocks adds the necessary context to our clustering application. While an unrestricted solution is likely to provide a nominally closer fit to the BHPS dataset, our focus on the educational background requires treating all individuals that share the same background as a block and finding the integral characteristics of such a block as a whole. Moreover, similar blocks can be modelled together and thus combined into clusters. Mathematically, the incorporation of the block-related information into the EM algorithm can be seen as the task of optimization over the constrained parameter space. The log-likelihood function provided in equation (2.3) cannot immediately accommodate these types of constraints, but all necessary adjustments can be made in the framework of the EM algorithm as discussed in Melnykov et al.(2016) Melnykov, Melnykov, and Michael.
Let us suppose that there are blocks in total and is the set of indices of the observations that belong to the block and therefore should be treated jointly. The block that observation belongs to can be identified as , and it follows that , where is the membership label for observation . It is readily seen that the collection of blocks defines a partition of the full set of indices , that is, and for all .
Denoting all positive constraints by and incorporating them into the E-step of the EM algorithm, the posterior probabilities can be re-written as
Since for all , the posterior probability may now be associated with the whole block and denoted by . Then, applying this methodology to equation (2.5) in the case of time-dependent categorical sequences, we obtain
where is the cardinality of . The corresponding Q-function can now be adjusted as follows:
Additional variables and variable selection
The traditional mixture modelling framework assumes that the probability of an observation to belong to the cluster is constant and given by mixing proportion . However, various aspects can affect the chances of a data point to enter a specific data group. In our setting, variables reflecting the level of parents’ education may have an influence on the magnitude of probablity . To include the information on these variables into the model, the mixing proportions are considered as their functions instead of constants. Suppose that in the finite mixture model (2.2) each mixing proportion is now treated as a function of some vector of additional variables and denotes the vector of coefficients specific to a particular mixing proportion. We define , where . The mixing proportions must still adhere to the constraints and , for all . The most popular parameterization approach within this framework is the multinomial logit model that leads to the following functions for the mixing proportions:
With additional variables introduced to the model, the posterior probabilities are updated as
The corresponding Q-function is given by
With the time-dependent first-order Markov model, the corresponding Q-function accordingly changes to
Once again, R function optim() was utilized to maximize this modified -function.
Computational issues
In the blocks of substantial size, both the numerator and denominator of (2.15) can be very small since they arise from the products of pdfs and probabilities. In order to avoid potential computational problems with the handling of small numbers, the following expression can be used for the computer implementation of the posterior probability calculations:
where . In equation (2.18), the ratios of components as well as proportions are calculated first instead of the products of individual probabilities.
Another computational issue is related to the calculation of transition and initial state probabilities as they occasionally can be very small. In the course of computations, this can lead to an overflow (or underflow). To avoid this consequence of having overly low probabilities, we used as the lower bound for all transition probabilities. Higher transition probabilities were adjusted proportionately to keep all constraints within the transition probability matrices satisfied.
Finally, the initial parameter values can have a significant influence on the convergence of the EM algorithm. In the experiments, we used a modified emEM method described by Biernacki et al.(2003) Biernacki, Celeux, and Govaert. After the random initialization, this method uses the EM algorithm in a short (em) stage to determine the promising sets of parameter values for running the longer (EM) stage until convergence. In our experiments, 20 random seeds were used during the short stage that was three-iteration-long. During the long stage, only the solution that produced the highest log-likelihood value was pursued further.
Experimental validation
This section describes a series of simulation studies to analyse the performance of the proposed methodology. For simplicity, we assume that the sequence length of all observations remains the same, that is, for all . Table 1 lists the values of parameters used in the models described by equations (2.6)–(2.9) and (2.17). In particular, it is assumed that two binary variables, and , are going to play a key role in the process of classification leading to . Consequently, is the corresponding vector of coefficients in the logit model, with representing the intercept in the expression of equation (2.14). are the parameters of the initial state probabilities in the expressions (2.6) and (2.7) and are the parameters for the transition probabilities in (2.8) and (2.9). In this experiment, all observations are generated from a mixture with components. Variables Bernoulli(), are generated during the first step in order to determine the mixing proportions associated with each component early in the process. During the next step, time was generated so that the initial state probabilities and transition probabilities for each component could be determined. In the spirit of recording the life events of an individual, we consider which would account for a 100-year lifespan and generate the initial between 0 and 20 from a uniform distribution. All subsequent values of were generated from uniform(20, 100) and ordered to imitate a sequence of life event occurrences. At this point, all initial probabilities and transition probabilities can be determined. At the final step that determines the positive constraints, we defined five blocks in such a manner that all observations from the first component formed one block, while observations from the two other components formed two blocks per component in equal proportions.
Parameters of the model used in the simulation study of Section 3
(αk0, αk1, αk2)
(ak1,bk1)
(ak2,bk2)
(ak3,bk3)
(c1j1,d1j1)
(c1j2,d1j2)
(c1j3,d1j3)
k = 1 (−2.02, 0.95, −0.90)
(−2.1, −0.9)
(1.7, −0.7)
(−2.2, 0.4) j = 1
(−9, 0.1)
(−3.1, −0.5)
(3.7, −0.1)
k = 2 (0.95, −0.51, −1.41)
(−0.2, 0.6)
(1.9, −1)
(0.9, 0.5) j = 2
(9.8, −0.2)
(3.5, −0.1)
(8.8, −0.5)
k = 3 (−0.90, −1.51, 1.31)
(1.6, −0.3)
(3.3, −0.9)
(−3.7, −0.5) j = 3
(4, −0.6)
(8.5, −0.3)
(7, −0.5)
j = 4
(−9.9, 0.2)
(−9.6, 0.2)
(−9, 0.1)
(c2j1,d2j1)
(c2j2,d2j2)
(c2j3,d2j3)
(c3j1,d3j1)
(c3j2,d3j2)
(c3j3,d3j3)
j = 1
(−7.5, 0.2)
(−9.5, 0.3)
(−5.9, 0.2) j = 1
(4, −0.1)
(−4.9, 0.2)
(7.8, −0.2)
j = 2
(−6.9, 0.1)
(10, −0.2)
(−8.2, 0.1) j = 2
(3.4, 0.8)
(−1.3, 0.9)
(8.4, 0.7)
j = 3
(−7.3, −0.6)
(7.1, −0.1)
(0.5, 0) j = 3
(−7.5, 0.1)
(6.8, −0.4)
(0.6, −0.2)
j = 4
(9.9, −0.5)
(−9.3, 0.1)
(3.8, −0.3) j = 4
(−8.7, 0.3)
(−0.4, 0.2)
(7, 0.1)
The simulations were performed for the sequences of and events in length and with and observations, thus leading to nine simulation scenarios. Each of these scenarios was replicated 100 times. Taking into account the fact that five blocks were defined, the proposed solution can have no more than five components, leading us to evaluate and . We used BIC as the criterion for identifying the best model and also evaluated the classification obtained in the solution comparing it with the true classification by means of the adjusted Rand index (ARI Hubert and Arabie (1985)). The upper bound of ARI is 1, which would only occur in the case of a complete agreement between the two partitions. Lower values of ARI indicate some disagreement between the two classifications, while 0 represents the expected level of agreement under a completely random classification.
The simulation results are presented in Table 2. We observe that in the case of and the correct number of components was identified only in of cases. There is also an apparent tendency to underestimate the number of components as was chosen as the best model of the time and even captured . This simulation scenario also produced the lowest average ARI value of 0.747, while the standard deviation was notably high at 0.278. However, the increase in the length of the sequence leads to substantial improvements even for the same sample size. Thus, with 50 elements in the sequence, was estimated correctly in of cases and both average ARI and the standard deviation improved to 0.866 and 0.235, respectively. This tendency remains very pronounced with all sample sizes in our simulation. As expected, the results improve substantially for larger sample sizes. In particular, in the best case with and the mean ARI is 0.926 and the correct mixture order is detected in of cases. Overall, we can conclude that the model as well as underlying classification can be estimated rather accurately, especially in those cases when and are sufficiently high.
Simulation results. ARI and sARI are the average adjusted Rand index and corresponding standard error, respectively
m = 10
m = 25
m = 50
Kˆ = 1
2
1
0
Kˆ = 2
42
29
23
Kˆ = 3
52
67
75
n = 100
Kˆ = 4
4
3
2
Kˆ = 5
0
0
0
ARI
0.747
0.828
0.866
sARI
0.278
0.259
0.235
Kˆ = 1
2
1
0
Kˆ = 2
38
26
15
Kˆ = 3
59
73
84
n = 250
Kˆ = 4
1
0
1
Kˆ = 5
0
0
0
ARI
0.790
0.851
0.918
sARI
0.268
0.258
0.195
Kˆ = 1
1
0
0
Kˆ = 2
22
17
14
Kˆ = 3
77
83
86
n = 500
Kˆ = 4
0
0
0
Kˆ = 5
0
0
0
ARI
0.882
0.910
0.926
sARI
0.226
0.206
0.190
Application
In this section, we apply the proposed methodology to the data obtained from the BHPS. BHPS was a longitudinal study that began in 1991 and followed the same individuals for 17 years. The households in the survey were selected based on the postcode address files in 250 areas of Great Britain to form a nationally representative sample. In our project, we focus on the fraction of the data that corresponds to the most senior female in each household to ensure that the independence of observations assumption is not violated by the presence of within-household correlations. As described in the Introduction, BHPS followed major developments in the lives of survey participants in the areas of health, education, family, etc. The age of each subject at the time of occurrence for each of numerous characteristics observed was recorded. In our study, we focus on the seven major life events listed in Table 3. Thus, each observation is represented by a sequence of events falling into one of the seven categories listed along with the age at which a particular event occurred. After the data cleaning, 1 934 observations were obtained.
Seven major life events considered in the study of Section 4. Abbr represents the abbreviation of the description for each state
State
Description
Abbr
State
Description
Abbr
1
School left
SL
5
Job left
JL
2
Marriage
MR
6
Health state worsened
HW
3
Child born
CH
7
Financial state worsened
FW
4
Job accepted
JA
With respect to their education, the individuals in the study were placed in 12 categories based on the highest education level achieved as shown in Table 4. In the semi-supervised framework described in Section 2.3, people who have the same education level are grouped with the aid of positive constraints to form 12 education-driven blocks. This approach ensures that the focus is kept on the life outcomes of the individuals that share the same educational background.
Twelve education qualification categories
Level
Description
1
University or CNAA Higher Degree
2
University or CNAA First Degree
3
Teaching Qualifications
City & Guilds Certificate (Rull Technological/Part III), HNC, HND,
4
BEC/TEC/BTEC Higher Certificate/Diploma, University Diploma. Any other technical, professional or higher qualifications
5
Nursing Qualifications
A Levels, Scottish Higher Grades, Scottish School Leaving Certificate
6
Higher Grade, Scottish Certificate of Sixth Year Studies, Higher School Certificate, Ordinary National Certificate/Diploma, BEC/TEC/BTEC National/General
Certificate or Diploma or City & Guilds Certificate (Advanced/Final/Part II) O Levels (pre-1975), O Level grades A–C (1975 or later), GCSE grades A–C, CSE grade 1, Scottish O Grades (pass or bands A–C or 1–3), Scottish School
7
Leaving Certificate Lower Grade, School Certificate or Matric, Scottish Standard
Grade Level 1–3 or City & Guilds Certificate (Craft/Intermediate/Ordinary/Part I)
8
Clerical or Commercial Qualifications
9
CSE Grades 2–5, O Level grades D–E, GCSE grades D–G, Scottish SCE Ordinary Grade bands D–E or 4–5 or Scottish Standard Grade levels 4–7
10
Recognized trade apprenticeship
11
Youth Training Certificate, any other qualifications
12
No qualifications
In the process of seeking the optimal model, we started with a finite mixture assuming constant mixing proportions. The models with varying order were evaluated and the minimum BIC value of 54031.56 was obtained for . In this solution, one cluster consisted of blocks 1 and 2, and the other contained the remaining 10 blocks. The posterior probabilities are shown in Table 5. The posterior probability for each block is either 1 or 0 making the clustering results for each block immediately clear. With respect to the education levels listed in Table 4, this solution has a clear interpretation stating that the life patterns of women with university or CNAA degrees are different from those of other women. Then, we utilize the approach of Section 2.4 and forward model selection technique to identify the set of variables that cause a decrease in the BIC value and thus lead to the model improvement by providing a better explanation of the effect of variables on the assignment to one of the two clusters. Being interested in the education level of study subjects, we introduce the variables that describe the educational background of their parents as shown in Table 6. There are two additional variables, each with four education levels as shown in the table. We denote their corresponding ‘truth’ values by 1. For example, means that the mother left school with no qualifications and says that the father received a university degree. The implemented forward selection procedure included as the first variable with following it. Both variables differentiate among the survey participants whose parents had or did not have university degrees. The selection process stopped after this step as the next candidate variable, , did not lead to a lower BIC. The corresponding BIC and log-likelihood values of the models encountered during the forward selection process are listed in Table 7.
Posterior probabilities for the 12 blocks per Table 4
Block
1
2
3
4
5
6
7
8
9
10
11
12
First component
1
1
0
0
0
0
0
0
0
0
0
0
Second component
0
0
1
1
1
1
1
1
1
1
1
1
Additional variables selected for the mixing proportion function
Var.
Mother’s education
Var.
Father’s education
1
Mother left school with no qualifications
5
Father left school with no qualifications
2
Mother left school with some qualifications
6
Father left school with some qualifications
3
Mother got further education qualification
7
Father got further education qualifications
4
Mother got university/higher degree
8
Father got university/higher degree
BIC values in the forward selection procedure
Variables
none
x4
(x4, x8)
Log-likelihood
−26 285.53
−26 261.04
−26 054.82
BIC value
54 031.56
53 997.71
53 600.41
Table 8 shows the mixing proportions that correspond to the four possible combinations of variables and included into the model. Recall that the first cluster represents blocks 1 and 2 as listed in Table 4, while the second cluster combines the rest of the blocks. It can be clearly seen that when neither mother nor father have a higher education background, their daughters are much less likely to obtain a university or CNAA degree themselves. The chances to enter the first cluster are just 0.048 versus 0.952 corresponding to the second group. However, if at least one of the parents has higher education training, the probability of their child's being in the first cluster increases substantially. In particular, when both mother and father received a high-level degree, this probability reaches its maximum value at 0.594, a more than tenfold difference compared to the similar probability for children without university educated parents. It is also noteworthy that mother's education appears to play a bigger role in the future education of a child. Thus, in families where the mother was the only parent with a university degree, the proportion of children with a CNAA or university degree stands at 0.243, while a similar proportion for families where the father was the only university educated parent reaches only 0.188. In conclusion, it is clear that the education level of parents plays a vital role in the education attained by their daughters.
Mixing proportions corresponding to the combinations of x4 and x8
(x4,x8)
(0,0)
(0,1)
(1,0)
(1,1)
First cluster
0.048
0.188
0.243
0.594
Second cluster
0.952
0.812
0.757
0.406
It should be noted that the mixing proportions displayed in Table 8 are relatively close to the relative frequency values obtained as empirical fractions shown in Table 9. As an example, the fraction of observations with and that belong to education categories 1 or 2 equals 0.048 and the same value is reported for the corresponding mixing proportion. This fact suggests that nearly crisp assignments to clusters occur when both parents do not have a university degree. The other -combinations produce more distinct values, with the largest discrepancies observed for , that is, in cases when mother received a university or higher degree.
Relative frequency table corresponding to the combinations of x4 and x8
(x4,x8)
(0,0)
(0,1)
(1,0)
(1,1)
First cluster
0.048
0.207
0.313
0.529
Second cluster
0.952
0.793
0.687
0.471
We proceed by exploring the initial state and transition probabilities for each of the two detected mixture components. The examination of the plots of initial probabilities shown in Figure 1 reveals important differences between the two clusters. Both plots show the probability of a specific state on the vertical axis and the age of an individual on the horizontal axis. It can be noticed that probabilities of states behave as functions of the age. For simplicity, the life span from 0 to 100 years was considered in these and all subsequent graphs. In addition, the solid vertical lines in these plots mark the interval with the more reliable model fit as they correspond to the data domain, while reaching outside of this region should be done with caution due to extrapolation. In the cluster of well-educated women, finishing school was the first significant event in the lives of of survey participants as indicated by the solid yellow block representing state 1 from Table 3. At the same time, in the second cluster, marriage (or state 2) turns out to be a common first major life event after age 20 and, in fact, completely replaces finishing school by age 25.
Initial state probability plots associated with the first (left) and second (right) mixture components discussed in Section 4. The seven states represent major life events listed in Table 3
While the initial state probabilities show a clear-cut picture, the transition probabilities describe more complicated life event dynamics presented in Figure 2. Each of the presented plots reflects age-dependent transition probabilities from a particular state.
Transition probability plots associated with the first (top) and second (bottom) mixture components discussed in Section 4. The seven states represent major life events listed in Table 3
State 1 (school left). In both clusters, the majority of women get married after completing their education. Still, there are certain differencs between the two clusters. Thus, in the first component, a considerable number of women will also make a transition to state 3 (child born) and state 4 (job accepted) after age 25. However, in the second component both of these transitions are much less common. Instead, at this age, there is a substantial proportion of women moving to state 5 (job left).
State 2 (marriage). For the high-educated women in the first cluster, the two prevailng transitions are into state 3 (child born) and state 4 (job accepted) with the former being more common in younger age and the latter prevailing in the older age. Leaving one's job is also not an uncommon transition in this state. On the other hand, the second cluster shows a larger variety of transitions. In addition to the outcomes described for women in cluster 1, there is a large proportion of women whose financial situation worsens, especially, after age 50. There is also a notable proportion of those getting married repeatedly.
State 3 (child born). In both clusters, when a woman is younger than 30, it is very common for her to have another baby, while after age 30, getting a job is the more likely transition in both groups. Also, by age 50, about of women in both groups encounter financial issues. The main difference between the two clusters is in the balance between the transitions into state 4 (job accepted) and state 5 (job left). While the latter is almost non-existent in the first cluster, it is prominent in the second one, especially, after 50. Seemingly, the women in cluster 1 show more career dynamics after having a child by taking on a new job or switching jobs even at age 50, while the women in cluster 2 are more likely to continue with the job that they currently have and leave it eventually.
State 4 (job accepted). In state 4, both clusters show a number of similarities. It is common to switch jobs starting at 25 and up until about 75 years of age. Another similarity is the worsening financial situation during the later years, with over chance of this transition around age 70. After 50, roughly similar proportions of women leave their job or experience worsening health condition. On the other hand, some differences between the two clusters occur during the younger age. In particular, for well-educated women, it is very common to get married about age 25, while in the second cluster only a minority of women experience this particular transition. Instead, a large proportion of women in the second cluster will have a child in their late teens or early 20s after having accepted a job. State 5 (job left). The patterns associated with this state are once again similar for the two clusters. Understandably, large proportions are associated with a worsening financial situation (state 7), especially, after 50. It appears that university educated women are more successful in finding a new job and do not experience health problems (state 6) to the same extent as the women with education below the university level. State 6 (health state worsened). The observation range in the first cluster is fairly short in this state, limiting the possibilities for comparisons to the second cluster. Still, it is notable that the proportion of women leaving their job after experiencing health problems grows steadily with age in the first cluster, while in the second one it remains roughly the same. After age 50, a growing proportion of women in the second cluster experience even worse health conditions, while this is not the case in cluster 1. State 7 (financial state worsened). Facing financial difficulties, younger university educated women tend to take a new job, while after age 50, it becomes more common for them to leave their current job. In the second cluster, finding a new job is also a common transition between ages 25 and 50, but leaving the current job is much less typical after 50 than in the first cluster. At the same time, in the second cluster, it is common to experience further financial problems, with about facing more difficulties by age 25 and about by age 85. This dynamic is completely absent from cluster 1, where the participants did not report deepening financial problems. Overall, the transition probabilities reveal considerable differences in life patterns for women with different education levels. Women who received higher education tend to have lives that are more conventional and career-oriented. They are also better equipped to handle such challenges as the loss of a job or worsening financial situation. The women who did not receive higher education are more likely to fall into a vicious cycle of accumulating financial and health problems, especially, when they are older.
Discussion
The article introduced a method for semi-supervised clustering of time-dependent categorical sequences. The model includes a mechanism for the selection of additional external variables that can affect the mixing proportions. Both the synthetic simulation study and analysis of real-life data showed interpretable and intuitive results. The application of this methodology to the BHPS data revealed the effect that educational background has on the life pattern of an individual. The subjects in the BHPS study can essentially be separated into two distinct groups: those who obtained a university degree and those who did not. The well-educated individuals of the first group exhibit conventional life patterns and are better equipped to cope with potential life challenges. At the same time, the subjects in the second cluster are more likely to experience adversity and fall into a cycle of compounding financial and health problems.
The strong connection that exists between the education of parents and their children was also confirmed. Those from the families of university educated parents are substantially more likely to receive a university degree themselves, especially, if both of their parents followed down this path. However, if only one of the parents completed a university degree, mother's education was a stronger predictor of the child's success than father's education. The proposed model can be applied to time-dependent data not only in sociology, but also in other areas dealing with categorical sequences such as medicine or financial industry.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The authors received no financial support for the research, authorship and/or publication of this article.
BanerjeeA and GhoshJ (2001) Clickstream clustering using weighted longest common subsequences. In Proceedings of the Web Mining Workshop at the 1st SIAM Conference on Data Mining, Vol. 143, page 144, Chicago, Illinois, USA.
3.
BasuS, DavidsonI and WagstaffK (2008) Constrained Clustering: Advances in Algorithms, Theory, and Applications. Boca Raton, FL: CRC Press.
4.
BiernackiC, CeleuxG and GovaertG (2003) Choosing starting values for the em algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics & Data Analysis, 41, 561–75.
5.
ChangH-J, HungL-P and HoC-L (2007) An anticipation model of potential customers purchasing behavior based on clustering analysis and association rules analysis. Expert Systems with Applications, 32, 753–64.
6.
DempsterAP, LairdNM and RubinDB (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39, 1–22.
7.
DinlerD and TuralMK (2016) A survey of constrained clustering. In Unsupervised Learning Algorithms, pages 207–235. Cham: Springer.
8.
DurrettR (1999) Essentials of Stochastic Processes. Vol. 1. Berlin: Springer.
9.
GabadinhoA, RitschardG, MuellerNS and StuderM (2011) Analyzing and visualizing state sequences in R with TraMineR. Journal of Statistical Software, 40, 1–37.
10.
HartiganJA and WongMA (1979) Algorithm as 136: A K-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28, 100–108.
11.
HennigC, MeilaM, MurtaghF and RocciR (2015) Handbook of Cluster Analysis. CRC Press.
12.
HubertL and ArabieP (1985) Comparing partitions. Journal of Classification, 2, 193–218.
13.
Institute for Social Economic Research (2018) British Household Panel Survey: Waves 1–18, 1991–2009. URL http://doi.org/10.5255/UKDA-SN-5151-2(last accessed 18 January 2021).
14.
KrzanowskiW and MarriottF (1994) Multivariate Analysis: Part I, Distributions, Ordination and Inference. London: Arnold Publication.
15.
MacQueenJ (1967) Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium, 1, 281–97.
16.
McLachlanG and PeelD (2004) Finite Mixture Models. Hoboken, NJ: John Wiley & Sons.
17.
MelnykovI and MelnykovV (2020) A note on the formal implementation of the K-means algorithm with hard positive and negative constraints. Journal of Classification. doi: 10.1007/s00357-019-09349-x.
18.
MelnykovV (2016a) Model-based biclustering of clickstream data. Computational Statistics & Data Analysis, 93, 31–45.
19.
MelnykovV (2016b). Clickclust: An R package for model-based clustering of categorical sequences. Journal of Statistical Software, 74, 1–34.
20.
MelnykovV, MelnykovI and MichaelS (2016) Semi-supervised model-based clustering with positive and begative constraints. Advances in Data analysis and Classification, 10, 327–49.
21.
NelderJA and MeadR (1965) A simplex method for function minimization. The Computer Journal, 7, 308–13.
22.
NimmoDWR, HerrmannSJ, SubletteJE, MelnykovIV, HellandLK, RomineJA, CarsellaJS, Herrmann-HoesingLM, TurnerJA and Vanden HeuvelBD (2018) Occurrence of Chironomid species (Diptera: Chironomidae) in the high Se-78 concentrations and high pH of Fountain Creek Watershed, Colorado, USA. Western North American Naturalist, 78, 39–64. doi: 10.3398/064.078.0106.
23.
ScholzM (2016) R package clickstream: Analyzing clickstream data with Markov chains. Journal of Statistical Software, 74, 1–17.
24.
SchwarzG. (1978) Estimating the dimension of a model. The Annals of Statistics, 6, 461–64.
25.
ShentalN, Bar-HillelA, HertzT and WeinshallD (2003) Computing Gaussian mixture models with EM using equivalence constraints. In Advances in NIPS, edited by ThrunS., SaulL. K. and Sch–lkopfB.. Vol. 15, pages 465–472. Cambridge, MA: MIT Press.
26.
SneathP (1957) The application of computers to taxonomy. Journal of General Microbiology, 17, 201–26.
27.
SzekelyGJ and RizzoML (2005) Hierarchical clustering via joint between-within distances: Extending ward’s minimum variance method. Journal of Classification, 22, 151–83.
28.
TaylorMF, BriceJ, BuckN and Prentice-LaneE (2018) British Household Panel Survey User Manual Volume A: Introduction, Technical Report and Appendices. Colchester: University of Essex.
29.
University of Essex I. f. S. Research E. (2018). British household panel survey: Waves1-18, 1991-2009. http://doi.org/10.5255/UKDA-SN-5151-2.
30.
WardJH (1963) Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58, 236–44.
31.
ZhangY, MelnykovV and ZhuX (2020) Model-based clustering of time-dependent categorical sequences with application to the analysis of major life event patterns. Statistical Analysis and Data Mining. doi: 10.1002/sam.11502.