Abstract
Characterizing longevity profiles from longitudinal studies is a task with many challenges. Firstly, the longitudinal databases usually have high dimensionality, and the similarities between long-lived and non-long-lived records are a highly burdening task for profile characterization. Addressing these issues, in this work, we use data from the English Longitudinal Study of Ageing (ELSA-UK) to characterize longevity profiles through data mining. We propose a method for feature engineering for reducing data dimensionality through merging techniques, factor analysis and biclustering. We apply biclustering to select relevant features discriminating both profiles. Two classification models, one based on a decision tree and the other on a random forest, are built from the preprocessed dataset. Experiments show that our methodology can successfully discriminate longevity profiles. We identify insights into features contributing to individuals being long-lived or non-long-lived. According to the results presented by both models, the main factor that impacts longevity is related to the correlations between the economic situation and the mobility of the elderly. We suggest that this methodology can be applied to identify longevity profiles from other longitudinal studies since that factor is deemed relevant for profile classification.
Introduction
Several studies about the human aging process are ongoing to understand aspects influencing human longevity. These studies are essential due to the increase of the world life expectancy, a consequence of living conditions improvement derived from, e.g., greater access to sanitation, food, and vaccines [1], which may have impacts on socioeconomic structures if not supported by social policies. In 2022, 10% of the world population was over 65 years old, and forecasts indicate that this rate will reach 16% in 2050 [2].
The efforts to understand human aging have occurred through interdisciplinary studies that interconnect some areas such as medicine, psychology, social sciences, and economics. Thus, this knowledge integration allowed us to obtain valuable information and identify aspects that most affect human longevity.
Usually, studies on human aging are longitudinal. It means that a group of individuals is observed repeatedly in fixed periods (called waves), and each observation monitors roughly the same aspects related to aging. Studies have shown some conditions that can influence aging, such as socioeconomic conditions, housing, retirement, and physical and psychological health [3, 4]. Therefore, this variety of information makes longitudinal aging studies a good source for comprehensive studies about aging in a society [3].
Studies on human aging are being conducted in various countries like China,1 the United States,2 Ireland,3 Korea,4 Brazil,5 and England [5]. The latter seems comprehensive as it covers socioeconomic conditions and health aspect, and the data are repeatedly collected every two years through questionnaire surveys since 2002 with participants aged 50 and over [3, 6]. According to data from The World Bank,6 since 2011, people in the UK have been long-lived when they reached 81.
Data from ELSA (The English Longitudinal Study of Ageing in UK) have been used to investigate several health issues. For example, correlating depression and comorbidities [7, 8, 9], comorbidities in obese individuals [10, 11], mortality over a given life period [12, 13], correlations between social frailty (such as financial difficulty, household status, limited social activities, among others) and all-cause mortality [14], and comparisons of indexes of disability and nutritional status [15].
In the data mining context, in [6], the authors describe a data preparation process for longitudinal data. They propose a new semi-supervised approach to characterize long-lived and non-long-lived profiles from the ELSA. The authors introduced a new concept of grouping feature sets by merging variables within a common aspect and classifying records as long-lived and non-long-lived. Furthermore, characterizing longevity profiles in longitudinal datasets is challenging, mainly due to the high dimensionality of the databases and high similarity among records. For example, for algorithms based on dissimilarity measures, the more we increase the dimensionality of a dataset, the less the difference measured between records [16, 6].
In [17], biclustering methods and feature selection methods applied to data originating from genomic research are discussed. Records (experimental conditions) from this data contain thousands of features (genes), challenging data clustering using conventional clustering algorithms. For this type of dataset, which is affected by high dimensionality, biclustering and feature selection have favored the discovery of significant subsets, which provide subjacent and actionable knowledge for the domain. A classifier based on discriminative biclusters with non-constant patterns was proposed in [18] to assist in the classification of medical and biological data, where the authors suggest that this is a new field to be considered in research for classification algorithms.
Aiming to characterize the longevity profiles, in this work, we propose a Feature Engineering procedure based on factor analysis and biclustering. Biclustering, applied to the ELSA dataset, is used to discover similar records on a subset of features to characterize the longevity profiles. However, searching for all subgroups on a dataset requires the execution of algorithms for an exhaustive search. These algorithms present an exponential cost for searching biclusters, which demands prohibitive processing times. For this reason, we reduce the dimensionality of dichotomous variables using the statistical method of Factor Analysis (FA) [19]. This method identifies correlated variables to define new unobserved variables called factors. Factors that best characterize the profiles are pre-selected through biclustering and added to the classification dataset that consists of ordinal categorical variables enriching the dataset, improving the predictive accuracy of the resulting decision tree classifier. Based on the result and the extracted rules, we interpreted the results to understand the factors/aspects that can influence human longevity.
This article presents the following sections: Section 2 presents the theoretical frameworks that are the basis for this study. Section 3 shows the methodology for carrying out this work. We present the classification results and discussion in Section 4. We suggest in Section 5 how to apply the knowledge acquired through our results to improve the development of public policies related to human aging. Finally, Section 6 shows our conclusions and future works.
Background
The ELSA-UK database
The ELSA is one of the broadest population studies in the world, containing features describing physical and mental health, quality of life, genetic inheritance, and social and economic conditions of a population [6]. When a participant from the ELSA dies, a relative or close friend responds to an end-of-life questionnaire informing the deceased health before his death. For this study, participants who died of non-accidental causes before reaching longevity are classified as non-long-lived, while participants who have reached longevity receive the long-lived label.
The ELSA is currently in its ninth wave, but we only processed the first six waves of ELSA. We kept only the last available record of respondents classified as non-long-lived, and the first record of those classified as long-lived. Thus, we minimize the average age gap between respondents in both classes. Intuitively, the more distant the ages of respondents from different profiles, the greater the chance of obtaining a wrong conclusion about the problem investigated.
In this work, we utilize variables from the preprocessed dataset proposed by [6]. We reduce the dimensionality of dichotomous variables through factor analysis, selecting the variables (called factors) that better characterize the longevity profiles through biclustering. We add the selected ones to the dataset with merged features to build a classification model.
Factor analysis (FA)
Factor Analysis is a statistical method that describes the correlation between the dataset variables through a set of linear combinations of the variables, known as latent factors. The variables most correlated with each other belong to the same group, with each group (set of variables) represented by a factor. Therefore, dimensionality reduction from using FA generates a number of factors lesser than the number of original variables [20].
Factor analysis (FA), applied on a dataset represented by Eq. (1), is described as follows:
Consider a vector of the average values of the
and the covariance and correlation matrices, respectively
Where
The factors model in matrix notation is given by Eq. (3).
For the matrix
We utilized the following methods to estimate the ideal number of factors
Scree Analysis: method proposed by [21], which explores the graphic representation of eigenvalues to indicate, through the inflection points of the curve (elbow), where scree appears. The inflection point represents a number of ideal factors to be used; and, Parallel Analysis: a statistical method based on the Monte-Carlo simulation, proposed by [22], which consists in the random construction of correlation matrices of variables, using the same dimensionality as the original database, and calculation, after several factoring processes of the hypothetical matrix, of the average of the eigenvalues compared to the original eigenvalues [23].
The Explained Variance can be used for both methods. This can be done by comparing the numerical value of the eigenvalues (
The matrix
with
The factors model given by Eq. (3) can be solved using weighted least squares, which can be used when the errors and
where
Finally,
To summarise, given a dataset expressed by Eq. (1):
Our first goal is to select factors representing highly correlated variables that contribute to characterizing the longevity profiles. For this task, we choose the BiMax algorithm to discover all maximal biclusters in a dataset, that is, all biclusters not entirely contained within other biclusters [24].
Although the BiMax algorithm is one of the biclustering algorithm pioneers, it can find significant biclusters. In [25], biclusters from BiMax show interactions between HIV and human proteins, acquiring the best solutions compared with other algorithms. [26] compares the performance of biclustering algorithms to discover biomarkers from Esophageal Squamous Cell Carcinoma (ESCC), been the BiMax able to recover biclusters from synthetic data, locate significantly enriched biclusters from real datasets, and also identifying potential biomarkers.
In this work, we choose the
Consider a bicluster given by
The
Decision tree (J48) and Random Forest are algorithms chosen to classify the records as long-lived and non-long-lived. J48 due to its higher interpretability, and Random Forest due to its good performance and some level of interpretability by showing the importance degree of each variable on the classification result analysis. These algorithms are available on the Weka platform [31].
Methodology
The proposed methodology for characterizing longevity profiles (long-lived and non-long-lived) involves two main phases:
The first is to select factors representing the relationships among the dichotomous variables and better discriminate both longevity profiles. So, we apply biclustering on the dataset transformed by factor analysis, analyzing each discovered bicluster and assigning longevity labels to them. This phase allows knowing which factors (variables) are more related to one profile than another. We select these factors, adding them to the dataset containing other categorical preprocessed variables from [6]. Section 3.2.7 describes the adopted procedure for assembling this dataset. The second phase is to build classification models using this dataset, comparing classification results against results from different preprocessed datasets from ELSA. We aim to apply a Knowledge Discovery in Databases (KDD) to select the best model and to give some insights through the obtained knowledge by following this methodology.
Materials
ELSA raw data contains 12,099 records and 4,484 variables only in the first of its actual nine waves. [6] proposes a methodology for preprocessing and reducing data from longitudinal studies applied in this work. The preprocessing steps are:
(Step 1) Selection of records labeled as long-lived and non-long-lived, considering the answers provided from the first record of individuals who became long-lived and the last record of responses from non-long-lived individuals; (Step 2) Filtering process to remove inconsistencies, such as a) individuals under the age of 50 or born outside the United Kingdom; b) responses considered inconsistent by the interviewer; c) individuals with deaths caused by accidents; d) records where the missing data rate is greater than 20%; e) information related to metadata (i.e., information that is not related to the respondent, only to the questionnaire itself). (Step 3) Considering that a person continues minimally in the state presented in a later wave, and to minimize distortions on imputation, we used missing data estimation through the Last-Observation-Carried-Forward method, replacing a missing value with a previous wave of the same feature for the same record [32]. (Step 4) Conceptual feature selection of variables closely related to human aging, based on expert knowledge from a previous literature review [3]. (Step 5) Merging of related features to reduce dimensionality, e.g., we merge questions about the participant engaging in an activity and the frequency of that activity into a single feature that adequately represents the information of both questions. (Step 6) Coding of the feature values to keep the ordered nominal variable values in a [0..1] interval, where ‘0’ always represents the least favorable outcome for reaching longevity, and ‘1’ represents the most favorable. (Step 7) Fusion of feature sets of questions associated with the same aspect. Each feature set was merged into a single variable, having its value calculated by weighting the involved features.
Considering the fusion of feature, the information loss is not significant when related or directly dependent information is transformed into a single attribute [33]. Thus, questions like “In the past 12 months, have you had an alcoholic drink?” and “How often has the respondent had an alcoholic drink during the last 12 months?” are merged into only one variable.
Equation (9) shows the process for calculating values of merged features [6].
Where
It is worth highlighting that the more variables the dataset contains, the lower the similarity between records. Therefore, it becomes more challenging to deal with machine learning tasks. Reducing variables through merging tends to increase this similarity. However, it is also worth noting that the more variables merged, the smaller the number of variables. Furthermore, it is possible to reduce the number of values resulting from the merge process by cartesian product if adjacent values of the scale of values resulting from this product conceptually represent the same condition, favorable or unfavorable, for the class of an object. For example, one can consider that daily smoking of a box or two boxes of cigarettes tends to represent similar conditions favorable to non-longevity. Then, these two adjacent values can be transformed into a single value, reducing the scale generated by the cartesian product and benefiting the discovery of knowledge through machine learning.
[6] reduced the raw data from the ELSA dataset through preprocessing to 128 variables and 1,333 records (1,091 long-lived and 242 non-long-lived). They also merge these 128 variables, forming 28 blocks of variables where each block represents the information of its components.
In our preliminary experiments, the dichotomous variables from the merged feature set lost their predictive power, which prompted a different process to treat those variables. Therefore, this work keeps only 13 of these 28 variables to compose the dataset (See Table 1). We select the remaining variables from the preprocessed ELSA dataset, adding the categorical ones to the dataset (See Table 2) and reducing the dimensionality of the dichotomous ones through the FA. Thus, we selected the most relevant factors through biclustering to address our longevity classification problem. In Table 3, we show the various covered aspects of dichotomous features.
Merged feature sets from categorical variables that were added to the dataset
(unmerged) Categorical variables added to the dataset
Aspects addressed by the dichotomous variables
Dimensionality reduction through factor analysis
The dataset used in this phase contains only the 48 dichotomous variables from the preprocessed ELSA (See Table 4). Bartlett test and the Kayser-Meyer-Olkin (KMO) test [34] were used to assess the applicability of FA in this dataset. The first analyzes the dataset matrix correlation, and the second measures the dataset suitability for the factors model. In this dataset, the
We analyze the three transformed datasets suggested by FA due to the difference in these estimations. Table 4 shows loading values from each feature after transformation by FA. Comparing the results from FA, one can observe that some variables remained highly correlated with others, regardless of their representation by factors. As an example, Table 4 shows that questionnaire variables related to depression, addressing negative feelings (set D2), and the variable related to difficulties reading the question cards (‘fqhelp’) maintained the correlation between themselves. Therefore, we define they belong to the same set because they are correlated in the data representation by 5, 9, and 11 factors.
Variable loadings for the three possible factor analysis datasets
Variable loadings for the three possible factor analysis datasets
Following this analysis, we set eight merged feature sets, and we present in Table 4 more descriptions of these factors, their compounding variables, and the formed sets (D1 to D8) for better comprehension and to explain these high correlations. This Table also presents a 9th block labeled ‘D9 – Undefined’. This block contains variables not simultaneously correlated in the three data representations by FA.
To confirm the relevance and veracity of the correlations, we reviewed the literature to understand the effects observed in a population (Section 3.2.2), and with that, the results of reducing data by FA were consistent and satisfactory. Short variable descriptions, shown in Table 4, were adapted from UK Data Archive Data Dictionaries, available here. Section 3.2.2 shows these data descriptions.
It can be seen in Table 4 that some factors considered less relevant (with
To analyze and substantiate the influence of correlated variables in sets D1 to D8 of Table 4, we researched for scientific contributions justifying these observed correlations.
Regarding the variables in set D1 (musculoskeletal problems), four of them are strongly correlated ( The variables in set D2, related to negative feelings as they may be associated with psychological issues, suggest that depression would be a possible cause for the reading difficulties of the question card observed by the interviewer. Negative feelings described by these variables are about their occurrence during the previous week: felt depressed (psceda), felt that everything required effort (pscedb), felt that sleep was restless (pscedc), felt lonely (pscede), felt sad (pscedg), felt they could not go on (pscedh) and problems reading the question cards (fqhelp). In [38], a common characteristic seen in people having depression symptoms is difficulty reading, which is associated with low self-esteem, feelings of incompetence that lead to inability to concentrate, and self-isolation. Albeit the variables in set D5, related to physical limitations derived from the occurrence of stroke (hepbs and hespk), have a significant correlation on three analyses (5, 9, and 11 factors), we also note a high correlation on the variable hewks (set D9), presence of weakness in the limbs. hewks has high correlation to variables in the set D5 on the analyses of 5 and 11 factors. We support the high correlation between these variables by studies in [39] and [40]. In the first one, 80% of people affected by stroke report weakness in the limbs or inability to perform movements on one side of the body, while in [40] relates that 65% of people with a stroke are affected by dysphagia, characterized by swallowing difficulties. Set D6 is formed by the variables related to being retired (wplljy_rage), currently working (wpdes) and medical diagnosis for serious health problems (heaga_h). In a study conducted in 2014, considering 40,000 families in the UK, 70% of people over 60 years old, 18% over 70 years old, and 7% over 80 years old were still active in the workforce [41]. None variable in set D7 has a significant value in the analyses of datasets reduced to 5 and 9 factors. The variables related to the use of diabetes medication (hemdb) and having a private health plan (wpphi) had a relatively moderate correlation in the representation by 11 factors. We found in [42] that the UK provides universal access to care via the National Health Service (NHS), but still, more than 10% has Private Health Insurance to complement NHS. This study compares diabetes management in the USA and the UK, and it does not differentiate people who have access to PHI from those who do not because the NHS is accessible to everyone. Although this is the only study investigating relations between these variables, we do not consider enough to assess the variables from D7 in our work. Finally, variable set D8 (Mobility and economic situation) has a diversity of information covering dimensions of physical health, education, and wealth status. Variables related to hip fracture (hefrac), vision problems caused by stroke (hevsi), and the current existence of financial debts (iadebt_owe) are the variables with the highest loading weight (above 0.6) considering the representation by 11 factors, so these factors partially describe the respondent’s physical health and economic situation. We could not find studies that directly relate to the variables of this set. Other variables are related to reduced mobility caused by health problems (mmhss), insulin self-injection for diabetes (heins), whether the respondent had formal education in the previous year (wpedc) and whether the respondent has their own business (wpbus). Regarding the sets D3 (Social relationship with family or friends), D4 (Positive feelings in the previous week, such as happy feelings (pscedd) and enjoying life (pscedf)), D7 (Health plan and comorbidities), and D8 (Mobility and wealth status), we were unable to find an article to substantiate the correlations found. Therefore, we suggest carrying out studies to understand the causes and impacts of these variables on elderly lives.
We do not investigate correlations from variables in block D9 once they are not correlated simultaneously in the three datasets. They are described as biological (dhsex); whether the respondent had angina or chest pain in the last two years (heyra); fall (hefla) or weakness in the limbs (hewks); cataract surgery (hecat); cancer (hecanb); medication for asthma (heama), hypertension (hemda) or lung problems (helng); hearing difficulties (hehra); walking aid (mmaid); children living outside the household (chouthh) and whether they have a partner (scptr); vehicle access (hoveh_spcar); home ownership (hoevm_hopay) and whether they have savings (iacisa_npb_sava).
Finding biclusters through factors
Reducing the dichotomous dataset dimensionality through factor analysis in 5, 9, and 11 factors computationally enabled the discovery of all biclusters through the BiMax algorithm. It is necessary to highlight that in experiments on these reduced datasets from FA, the results are not satisfactory when using algorithms that search for biclusters based on variance analysis [43] or searching by linear relationships [44]. It is important to note that variance-based algorithms search for groups having variance limited by a threshold considering its mean value. However, despite the numerical variable values from FA, the groups formed did not discriminate the profiles adequately. Using BiMax, which defines the standard threshold as the absolute value of an element
BiMax is commonly applied to discover subgroups on dichotomous datasets. However, on continuous data, a user-defined threshold can be used to transform the data into binary data. In this study, this threshold is the absolute value of the feature scores, then the records with have their
Bicluster analysis for datasets represented by 5, 9, and 11 factors
Bicluster analysis for datasets represented by 5, 9, and 11 factors
(*) The Cumulative Percentage of Variance (CPV) measures the total factors variance. 5F, 9F, and 11F are, respectively, the datasets reduced to 5, 9, and 11 factors.
Biclusters are labeled as long-lived or non-long-lived if at least 70% of their records are from the same class, while biclusters with less than 70% are unlabeled. Biclusters with
The choice for cluster label based on 70% of records from the majority class in each bicluster was an empirical choice to search for purer groups with higher representation. Note that the higher the purity percentage, the fewer biclusters are identified, making the discovery of the best factors unfeasible. For the problem addressed in the ELSA study, where both profiles of long-lived and non-long-lived are very similar, high percentages of records from the same class may not return biclusters representative of both profiles.
To select the more representative factors, in Sections 3.2.4 to 3.2.6, we consider if bicluster contains 70% of records from the same class and if its records number is representative when observing the total of that class in the dataset. We show in Tables 6, 8 and 10 only the profile characterization for trivial biclusters. Trivial biclusters have only one feature. We show only these because, considering labeled biclusters (L and NL), only those labeled non-long-lived have the factor F5 from the analysis of 5 factors and the factor F4 from the 9. In the result analysis from 11 factors, 120 out of 128 non-long-lived biclusters also have the factor F5. The other factors are in both long-lived and non-long-lived biclusters.
Table 6 shows only biclusters from the dataset of 5 factors whose records are clustered by one factor. The F1, F2, and F4 biclusters represent the long-lived (L) because at least 70% of records are from the same class in each bicluster. Bicluster F5 represents non-long-lived (NL), and bicluster F3 is not labeled (UD) because it does not contain at least 70% records from the same class.
Records into biclusters discovered by Bimax for 5 factors
Records into biclusters discovered by Bimax for 5 factors
Labels on the left stand for real classes of records in biclusters. The label in parentheses represents the label received by the bicluster. Label L stands for Long-lived, NL for non-long-lived, and UD for Undefined. Tot is the total of records in each bicluster.
All 14 non-long-lived biclusters have the factor F5 (See column 5F in Table 5). This factor is observed only in non-long-lived biclusters, being a good predictor of this class. Records from these biclusters totalize 163 out of 242 non-long-lived (67.36%) in the dataset, while 58 out of 1091 long-lived are observed in these biclusters (See Section 3.1). So, we added this factor as a predictor variable in our proposed dataset. Table 7 shows descriptions of variables compounding the factor F5.
Dichotomous features represented by factor F5
(*) Feature considered no significant for having loading lower than 0.3 [46, 47]. All possible answers for these questions are Yes or No.
Table 8 shows only biclusters from the dataset represented by 9 factors whose records are grouped by one of these factors.
Records into biclusters discovered by Bimax for 9 factors dataset
Records into biclusters discovered by Bimax for 9 factors dataset
Labels on the left stand for real classes of records in biclusters. The label in parentheses represents the label received by the bicluster. Label L stands for Long-lived, NL for non-long-lived, and UD for Undefined. Tot is the total of records in each bicluster.
Dichotomous features represented by factor F4
(*) Feature considered no significant for having loading lower than 0.3 [46, 47]. All possible answers for these questions are Yes or No. Acronym d.l.w. stands for “during last week”.
F1, F2, F5, F6, and F7 biclusters are labeled as long-lived, although most records that do not belong to these biclusters are also long-lived, which may indicate that it is not a factor with high predictive capacity. For example, bicluster F1 has 250 records, with 190 long-lived (76%). Considering 1,083 records (out of 1,333 available in the dataset) not assigned in this bicluster, 83.2% also are long-lived. Therefore, due to these high proportions from the same class, these factors are not considered suitable for characterizing profiles.
In contrast, the bicluster F4 has 226 records, with 162 non-long-lived (71.68%) and 64 long-lived (28.32%). Most of the unassigned records in this bicluster belong to the long-lived class (92.77% out of 1,107 unassigned records). Therefore, we have a higher proportion of non-long-lived in this bicluster and a higher proportion of long-lived not assigned in this bicluster, so we conclude that factor F4 contributes to profile characterization, adding this factor to our proposed dataset. This factor is also in all the 49 non-long-lived biclusters, in 31 out of 89 unlabeled biclusters, and is not in any long-lived bicluster (See column 9F in Table 5).
Biclusters F3, F8, and F9 are unlabeled because they do not meet the criteria of having at least 70% of their records from the same class. Table 9 shows variable descriptions from factor F4.
Table 10 shows only biclusters from the dataset represented by 11 factors with records grouped by one of these factors.
Records into biclusters discovered by Bimax for 11 factors dataset
Records into biclusters discovered by Bimax for 11 factors dataset
Labels on the left stand for real classes of records in biclusters. The label in parentheses represents the label received by the bicluster. Label L stands for Long-lived, NL for non-long-lived, and UD for Undefined. Tot is the total of records in each bicluster. * This trivial bicluster has 69.3% of non-long-lived records.
In this analysis, BiMax discovered 616 significant biclusters (see Table 5). Among these, 120 out of 128 non-long-lived have the factor F5. No significant long-lived biclusters have this factor. Furthermore, variables from F5 (set D2 from Table 4) also are observed on factor F4, from 9 factors data representation (sets D2 and D4 from Table 4). So, this factor is relevant for characterizing longevity profiles.
The dataset built for the classification model is formed by: a) merged feature sets (sets A1, A7, A8, A9, B5, B6, B7, B8, B9, C3, C4, C5, C6 – see Table 1); b) Factors F5 from the five factors dataset, and F4 from the nine factors dataset, from the FA analysis (see Table 4); and, c) Ordinal categorical original features from pre-processed ELSA dataset (hepaa, hefunc, hhtot, disib, dimar, dignmy, wpvw, hotenu and exrslf, described in Table 2). The resulting dataset has 1,091 long-lived, 242 non-long-lived records, and 24 features. We denominated this dataset as d5 in this work.
Statistical validation of training sets
We select four datasets to compare the classification results with our proposed dataset. The number of features in these datasets are dataset d1: 128 preprocessed ELSA features (categorical and dichotomous preprocessed following steps 1 to 4 briefly described in Section 3.1); dataset d2: 28 merged feature sets as proposed in [6], briefly described in steps 1 to 7 in Section 3.1; dataset d3: 5 factors from FA; and dataset d4: 9 factors from FA, and dataset d5, proposed in this work. Datasets d3 and d4, represent the 48 dichotomous variables from ELSA.
We performed statistical tests for multivariate data to validate the training sets, verifying if samples from different classes represent different populations through analysis of their distributions. Categorical dataset d1 is fitter to a Dirichlet multinomial distribution model, while the others have a normal distribution with all
Classifiers
The Decision Tree (DT) J48 and Random Forest (RF) algorithms, considered for this work, have a set of parameters adjusted through experiments to obtain the results from classification through 10-fold cross-validation (see Tables 11 and 12).
Best parameters for Random Forest algorithm by dataset
Best parameters for Random Forest algorithm by dataset
Parameter descriptions:7
-P: Size of each bag, as a percentage of the training set size (default 100); -print: Print the individual classifiers in the output; -attribute-importance: Compute and output feature (attribute) importance (mean impurity decrease method); -I: Number of iterations (i.e., the number of trees in the random forest) (default 100); -num-slots: Number of execution slots (default 1 - i.e. no parallelism); -K: Number of features to randomly investigate (default 0); -M: Set the minimum number of instances (records) per leaf (default 1); -V: Set the minimum numeric class variance proportion of train variance for split (defaultBest parameters for J48 algorithm by dataset
-R: Use reduced error pruning; -C: Set confidence threshold for pruning (default 0.25); -M: Set the minimum number of instances (records) per leaf (default 2); -N: Set the number of folds for reduced error pruning. One fold is used as a pruning set (default 3); -Q: Seed for random data shuffling (default 1).
Addressing class imbalance (1091L, 242NL), we performed random undersampling on the training dataset. Through systematic classification experiments on the dataset d5, we conclude that the best proportion for training records from ELSA is selecting 160L and 80NL, resulting in a training set with 240 records in total.
The test sets contain records not seen by the classifier, and each dataset has 1,093 records (931L, 162NL). Results are assessed through precision, recall, and F-Score measures.
The training set has 240 records (160L, 80NL), while the test set has 1,093 unseen records (931L, 162NL). Results from the test sets are shown in Tables 13 and 14 and discussed in this section.
Random forest classification results by test set
Random forest classification results by test set
L: Long-lived. NL: Non-long-lived. WA: Weighted Average. We highlight in bold the best results.
Table 13 shows that RF performs better for each measure on dataset d5 than the others. Comparing dataset d5 against datasets d1 and d2, one can see an increase of the F-Measure at minimum 5.5% for non-long-lived and a slight improvement for long-lived class. Increasing the classification results, mainly for the non-long-lived class, is necessary to diminish the impact of incorrect predictions that affect longevity and sociopolitical domains once individuals cannot feel the necessity to adopt good practices to improve their health conditions in case of erroneous predictions. Therefore, the model from RF for dataset d5 is deemed suitable for this classification task.
Decision tree (J48) classification results by test set
L: Long-lived. NL: Non-long-lived. WA: Weighted Average. We highlight in bold the best results. ()*: number of leaves of the tree.
F-Measure results on models built by DT on the datasets d1, d2, d3, and d5 are approximately equal considering each performance measure. Model on dataset d4 has a lower F-Measure rate on non-long-lived class. The model built on dataset d1 has a better recall rate, which is of interest to classify the non-long-lived, and a weighted average F-Measure similar to the model of dataset d5. But it is worth noting that dataset d1 has 128 features while d5 has only 24 features representing all features from d1. Even though these differences in the measures cannot be deemed significant by the lack of use of an adequate method to compare all these classification results, we can perceive that our results on dataset d5 are suitable, indicating that the model built on our proposed dataset can be used to classify longevity profiles. Table 16 shows the rules extracted from the DT classification model.
One can note that predictions on long-lived classes do not vary significantly between the models (less than 0.5% on each model for both classification algorithms). This class has a substantially higher amount of records than non-long-lived ones, making it possible to increase classification results on long-lived ones. Therefore, the insufficient records representing the non-long-lived worsens the class learning and decreases their classification results.
There is no significant difference between RF (Table 13) and DT (Table 14) for long-lived. Dataset d5 has its best F-Score classification result for non-long-lived on the RF model. Therefore, the choice of the records number based on results from systematic tests, the selection of best factors through biclustering plus enrichment of the data adding the blocks of variables, improved the non-long-lived classification for d5 dataset, raising the F-Score of minority class to 74.1%.
We observe that factors F5 and F4 from d5 have the first and third importance degree on the RF model (See Table 15). One can also observe that the first and fourth variables (F5 and A8) are also in the model from the DT (See Fig. 1). The variable wpvw from the DT, referring to the frequency of volunteer work, has the 19th place in the order of importance in the RF.
Variable importance based on average impurity decrease
(*) The number of nodes using that variable in the 280 forest trees compounding the forest. Variables selected by the DT model are highlighted in bold.
Classification rules from the DT tree considering the test set d5
[1] iadebt_owe: Reported having debt; [2] hevsi: Reported eyesight difficulties due to stroke; [3] hefrac: Reported hip fracture; [4] hoevm_hopay: House owner; [5] wpbus: Business owner; [6] heins: Injects insulin (diabetes); [7] mmhss: Walking test not taken due to health condition; [8] hecanb: Cancer treatment in the last 2 years; [9] wpedc: Formal education in the last 12 months; [10] dhsex: Biological gender (Male or Female); [11] hecat: Had cataracts surgery; [12] iacisa_npb_sava: Reported having savings; [13] heska_smk: Tobacco consumption; [14] scako: alcohol consumption; and, [15] wpvw: volunteer work frequency.
DT classification results from d5 test set.
We interpret the factors F4 and F5, analyzing the correlation of adjacent variables. The correlation between features and factors is called factorial loading. This correlation can vary between
Table 9 shows the values of correlation and meaning of the variables from the F4 factor. Psychological aspects and difficulty in reading question cards have correlations above 50%. Studies show that depressed people often report reading issues. Thus, this can explain the correlation between the difficulty in reading question cards and the variables related to negative feelings (See Table 4 and Section 3.2.2).
The higher correlations from factor F5, selected on the DT model, are related mainly to wealth status (iadebt_owe: Reported having debt, hoevm
The score of each record from F5 is in the interval
74.79% of the non-long-lived and 6.6% of the long-lived have scores lower than 16.25%, indicating a worse condition of wealth status and mobility; 93.4% of the long-lived and 25.21% of the non-long-lived have scores greater than 16.25%, indicating better wealth status and mobility conditions.
Variable A8, related to alcohol and tobacco consumption, is codified to favor people with the lowest consumption of these substances, setting them with higher scores, so these people should be long-lived. Otherwise, when they get a lower score on this variable, they are deemed as non-long-lived (See Tables 17 and 18). But one can note in the tree model (See Fig. 1) that people with lower consumption of these substances are non-long-lived (score
The weighting strategy calculates the value representing the information for each ELSA questionnaire response in the set. In these merged feature sets, the higher the value answered by a respondent, the greater the chance that the respondent will achieve longevity. For instance, Table 18 shows the possible values assumed by the merged feature set A8, following Eq. (9). Regarding Tobacco Consumption (0: Smokers) and Alcohol Consumption (0: Almost every day), the lowest value (0.0) contributes to a non-long-lived classification. On the other hand, high values for Tobacco Consumption (1: Never smoked) and Alcohol Consumption (1: Nothing in the last 12 months) result in a feature set value of 7.0, contributing to long-lived classification.
Example calculation of merged feature set value – Tobacco and Alcohol consumption (set of variables A8)
Possible values for the feature A8 (Tobacco and Alcohol consumption)
Finally, if someone is a smoker independently of the amount of consumed alcohol or they are a former smoker and consumes alcohol almost every day or up to twice a week but performs volunteer work more frequently (wpvw
Volunteer work frequency variable
Economic security, such as good wealth, and the absence of health problems that affect the mobility of the elderly are the most important aspects that lead to a long life. Therefore, people should be economically secure and debt-free when they reach retirement age. They must also have guaranteed access to housing and healthcare. The government must create policies to cover these aspects.
Policies to ensure greater self-esteem and satisfaction for the elderly must provide for the inclusion or continuity of the individual’s participation in society, observing their physical and psychological health status. In this way, the individual can feel that their participation in society is valuable, reducing the chances of isolation, as this can lead to depression and other typical disorders that can affect people with low self-esteem. One of these measures may be related to guaranteeing access to paid or voluntary work, as it provides a feeling of well-being and participation in society.
Conclusion
This study presented a methodology to characterize profiles from longitudinal datasets through a case study on data from the English Longitudinal Study of Ageing of the United Kingdom (ELSA-UK). Discovering patterns in longevity profiles using longitudinal studies is always challenging. The ELSA database is even more challenging due to the fact that the instances are very similar to build classification models. So, we present a feature engineering process by showing variable merging to reduce the categorical data dimensionality and binary data reduction through factor analysis with a posterior selection of relevant factors through biclustering.
In this work, we choose the ELSA dataset to demonstrate a feature engineering procedure. In this process, we decided to consider 5 databases: a) The first database considers all features previously pre-processed according to a study used as the main baseline. This procedure is briefly presented in Section 3.1. The resulting dataset is obtained by applying steps 1 to 4; b) The second database considers merged features, according to the baseline, as presented in Section 3.1. The resulting dataset is obtained by applying steps 1 to 7; c) The third and fourth databases are composed of factors coming from the binary features of the first baseline used in this work, these sets being differentiated by the number of factors; and, d) The fifth base proposed as a set of features to represent longevity is composed of merged variables and relevant factors selected by biclustering.
The composition of each of these baselines aimed to indirectly compare feature engineering procedures. The first database only had preprocessing of the dataset without applying feature engineering. The second base was composed of a merging feature engineering process. The third and fourth obtained solely from the factor analysis process by feature engineering. The fifth base explores the benefit of two feature engineering processes, bringing the best results from our research.
Experiment results based on the classification models generated from our proposed dataset highlight the need to implement measures that contribute to a higher quality of life for the elderly population regarding their health, economic situation, mobility, and voluntary work, which can raise their well-being by not limiting their access to social environments as well as favor their access to householding and health facilities.
We highlight the following contributions:
Preliminary results assessing the correlations between variables from factor analysis are validated through published research; The binary dataset reduction allows the selection of the most relevant features to distinguish longevity profiles through factor analysis and biclustering; Characteristic correlations obtained through factor analysis could lead to future in-depth studies to investigate how the features indicated by these variables are related; Identification of aspects influencing human longevity can aid in the socioeconomic measures development that can mitigate the future economic, social, and health resource shortage, directly favoring the active aging process.
We also highlight the following drawbacks found in this work:
The absence of more records representing the minority class can reduce the accuracy of our results. The lack of comparison of classification models, made according to a standard methodology, can impair the result analysis in terms of how much a model is statistically different from others, being this task done through observation, domain knowledge, and visual result analysis. It is important to note that the proposed method for feature selection via AF and biclustering is sensitive to the method utilized to estimate the ideal number of factors and the threshold for dichotomizing variables when applying the biclustering process via the BiMax algorithm, respectively. Ideally, the specialist must support the decision of the number of factors to represent the dataset. Through experimental analysis, one can choose the ideal number of factors, a procedure adopted in this work. In the dichotomization process, through preliminary experiments on the ELSA dataset, we analyzed the influence of 3 thresholds on the dichotomization of input data for the BiMax algorithm: the mean, the median, and the feature score. In this sense, values below or equal to the threshold received a value of 0. Otherwise, they received a value of 1. We acquired the best result by setting the feature score threshold equal to 1.
Therefore, the feature engineering procedure adopted for the ELSA dataset, although not robust or sufficient for any other dataset, can be considered a valid procedure where expert intervention and auxiliary experiments are necessary to ensure representativeness of the results. We also propose to evaluate, through statistical methods, the robustness of the proposed method, the sensitivity in choosing the number of factors when using other datasets, and compare conceptual modeling with feature selection methods.
From our experience, the participation of the domain expert is highly relevant in the feature selection to compose the dataset in the construction of learning models. The literature provides us with several algorithms for feature selection in a dataset. In this work, we consider relevant that the preliminary definition of this dataset receives assistance from the domain specialist through the conceptual feature selection process. It can bring benefits in obtaining knowledge discovery that is more in line with the needs of the knowledge application area. However, in future work, we suggest a more comprehensive comparison of the relevance of conceptual selection versus a simply algorithmic selection of attributes.
We also recognize the limitation related to the absence of previous studies describing relationships of some of the features observed in the correlation analysis of the factor analysis. Therefore, we drew on published social science research to validate these correlations. We believe that validation through analysis of external sources can help in building more accurate conceptual models.
For this work, we transform the longitudinal database into a two-dimensional dataset, obtaining subgroups of longevity profiles. For future work, as the longitudinal databases can be interpreted as a triadic context (with three dimensions: records, features, and waves), we suggest applying tri-clustering to cluster tridimensional data. By doing so, we can observe the influence of time on the discovered subgroups. We also suggest validating the statistical relevance of the results, including parameterization, through nested cross-validation.
Footnotes
Chinese Longitudinal Healthy Longevity Study (CLHLS).
Health and Retirement Study (HRS).
The Irish Longitudinal Study of Ageing (TILDA).
Korean Longitudinal Study of Ageing (KLoSA).
Logitudinal Study of Adult Health (ELSA-Brasil).
Acknowledgments
The data were made available through the UK Data Archive. ELSA was developed by researchers based at the NatCen Social Research, University College London, and the Institute for Fiscal Studies. NatCen Social Research collected the data. The funding is provided by the National Institute of Aging in the United States, and a consortium of UK government departments coordinated by the Office for National Statistics. The developers and funders of ELSA and the Archive do not bear any responsibility for the analyses or interpretations presented here.
This work was conducted during a scholarship supported by the National Council for Scientific and Technological Development of Brazil (CNPq – Conselho Nacional de Desenvolvimento Científico e Tecnológico) and by the Graduate Support Program for Community Higher Education Institutions (PROSUC – Programa de Suporte à Pós-Graduação de Instituições Comunitárias de Ensino Superior)/Coordination for the Improvement of Higher Education Personnel – Brazil (CAPES – Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil), Finance Code 001, Brazilian Federal Agency for Support and Evaluation of Graduate Education within the Ministry of Education of Brazil. This work was carried out at the Pontifical Catholic University of Minas Gerais, PUC-Minas.
