Comparative analysis of life expectancy prediction using regression algorithms

Abstract

This study performed a comparative analysis of various imputations for NULL values in the dataset, namely, mean, median, and mode. We implemented eleven regression models, including Linear and Support Vector Regression and tree-based regression models, such as decision tree, Surrogate tree, and random forest, with five different pre-processing techniques, providing different types of results. The core objective of this study is to compare these results and reach an interpretation as to why certain imputation technique produces a certain output. The interpretation of this result is helpful in the selection of the regression model. The experimental results of the proposed technique were evaluated and validated for the performance and quality analysis of life expectancy prediction using various quality parameters. Among the results, the highest accuracy was produced by random forest regression with an accuracy of 96.8%, which proves the significance of random forest in comparison to other state-of-the-art regression methods for life expectancy prediction.

Keywords

Life expectancy random forest decision tree surrogate tree support vector regression regression methods

1. Introduction

Life Expectancy is a measure of how long a person can live under certain conditions [1,2]. These conditions include several socio-economic factors such as BMI, Income composition, schooling, and status of development [3,4,5]. Usually, developed countries tend to have higher life expectancies than developing or underdeveloped countries. The global average life expectancy of human beings has been rising, indicating that standards of living are improving and the mortality rate of diseases has been decreasing day by day [6,7,8,9]. Life expectancy also acts as an important indicator of how well the region’s population is doing in terms of social and economic development [10,11].

Regression models are used in various fields of research such as medicine, engineering, and economics to predict continuous values by establishing relationships between several variables. The accuracy of a model also depends on the data used for its training, and the quality of pre-processing techniques also plays a major role in the outcome of the model [12,13,14]. These data can be improved by replacing the NULL values with the mean, median, mode, and interpolation, or by completely dropping them. This study aimed to determine how algorithms respond to various pre-processing techniques by analyzing their impact on target life expectancy predictions [15,16]. This will help in determining the appropriate pre-processing techniques for specific algorithms [17].

The evaluation criteria to understand each result is to record their error metrics to determine the extent of error and problems arising while running statistical models, such as underfitting, overfitting, and other problems that might generate wrong results. Since different pre-processing techniques can result in different results in each model, understanding what techniques work best on an algorithm becomes essential. Some hyper-parameter changes were also made via trial and error for further improvement in results.

The outcome of this study can enhance our cognition about what factors severely impact life expectancy, and this information could be used to develop the standard of living further. We will also understand what regression methods give the best result and why. We have also tracked various performance metrics to understand how the model behaves with different inputations. We conclude that the Random Forest model has the best outcome compared to Linear Regression and Support Vector regression.

The paper briefs, as illustrated, are the literature review discussed in Section 2, materials and method in Section 3, results and discussion in Section 4, and finally, Section 5, which gives the conclusion and future scope of the proposed system.

2. Litearture review

In a study by Bali et al. [18] to find out suitable model for predicting Life Expectancy they used techniques such as ridge regression, linear regression, random forest, and decision tree regression. The NaN values were replaced with 0 and the fields ‘country’ and ‘year’ were dropped, as they did not analyze a specific country’s data. In their findings, Random Forest Regression performed the best, with an accuracy of 96%.

Another study by Lipesa et al. [10] implemented XGBoost, Random Forest, and Artificial Neural Networks. Their findings were analyzed based on the MAE and RMSE values. The best-performing model was XGBoost with an MAE of 1.554 and an RMSE of 2.402. The data used for this study were obtained from the World Health Organization (WHO). However, some of the data were found to be incorrect and were then replaced by new values taken from the development indicators’ dataset from the World Bank. Kavitha et al. [19] presented a comparison between ‘linear regression’ and ‘support vector regression’. They used different kernels for SVM, namely, linear and non-linear support vector regression. In this study, linear regression with the least squares method performed better than SVM.

Aydin and Bulut [20] used even more kernels in the support vector regression (SVR) namely, radial basis, linear, polynomial, and sigmoid functions. They use data from 32 countries and compared the results using graphical methods. Ali et al. [21] utilized additional algorithms and models, such as the classification model, recursive feature elimination, logistic regression model, and cirrhosis mortality model, to predict the life expectancy of people suffering from Hepatitis B. The Area Under-curve method was used to assess the model’s performance.

In a recent study by Tuj et al. [22] to find out the best accurate regression model they used 8 different types of techniques which include models like “K-neighbors”, “Stacking Regressor”, “Random Forest”, “Decision Tree”. The most accurate result was obtained by an extreme gradient boosting regressor with an accuracy of 99%, followed by gradient boosting, which produced an accuracy of 96%. For change, they separately collected data from rural and urban sources [23].

Several techniques were applied on a WHO dataset which includes “Linear Regression”, “Decision Tree” and “K-neighbour Regression” along with Correlation Features Selection and Mutual Information Features Selection methods to get the desired result i.e., of R2 and RMSE. The Decision Tree produced the best results with mutual information features producing the best result and correlation feature selection had a k values of 15 and 10 respectively [24].

A study by Lakshmanrao et al. [25] examined the feasibility of life expectancy on a WHO dataset of 15 years by applying machine-learning models. In this study, they applied logistic regression, SVM, random forest, and decision tree to achieve a good r-squared value of 0.81 by Multiple Linear Regression, 0.56 by Support Vector Regression, 0.91 by Decision Tree, and 0.96 by Random Forest.

In a study conducted by Faisal et al. [26], the WHO dataset was used to identify statistically significant factors for life expectancy using machine-learning models. Random Forest regression and Linear regression models were used. Furthermore, they used an ensemble voting regressor and a decision tree. In the results, it was found that schooling has a positive impact of 0.71, and the best model was the Random Forest regressor, which achieved the lowest MSE of 1.93, and the lowest MAE of 1.24.

Fransiska et al. [27] used geographically weighted regression (GWR) and Random Forest Regression (RFR) to compare life expectancy prediction with RMSE, the RMSE value of GWR was found to be 64.99 with significant influence variables X3 (percentage of proper sanitation households), X5 (number of doctors), and X7 (average years of schooling) denoting several socio-economic statistical values. The RMSE of the RFR was found to be 84.04 with significant influence variables X3, X5, and X7.

In a recent study by Yifan Wang [28], to find the strongest factors that affect life expectancy across several countries and continents, 24 different variables were considered from 2000 to 2015 in nearly 200 countries and six continents. Various factors include the Adult Mortality Rate, Income, and HIV/AIDS. North America was found to be the least affected by these factors, while Asia was found to be the most affected. On Applying the Random Forest Model, it was found that every continent has different affecting factors: the XGboost model had the lowest SHAP value in the regions of Haiti in North America and Sierra Leone in Africa, with Life Expectancy below 42 years, and the highest SHAP values were found for Canada, Germany, and Ireland, with an average life of 80 years and above.

In conclusion, HIV/AIDS was found to be negatively correlated with income and adult mortality, more schooling, higher SHAP values, higher income, and lower adult mortality. Xinyang He et al. [29] performed a recent study to find out the factors Affecting Life Expectancy, they considered the “Multiple Linear Regression” model to predict Life Expectancy. 18 Different factors are considered in the regression model, such as GDP, Schooling, Disease, etc.

When applying the MLR Model it was found that Disease, Income, and Schooling were the most influential factors, with an RMSE of 3.818. Furthermore, they bifurcated their study into two parts, developing nations, and developed nations, and found that for developing nations the most important factors were medical and schooling, and the RMSE was 3.99, for developed nations, the RMSE was 2.57.

The gap identified was that pre-processing techniques were not mentioned in some studies. In addition, one of the studies [18] replaced NULL values with 0. Considering real-life parameters, such as Adult Mortality and BMI to be zero, is unrealistic.

Therefore, the need for a study to perform an extensive comparative analysis across various models and pre-processing techniques was identified to understand how different models behave based on different imputations.

This study has been done with eleven different algorithms namely ‘Linear Regression (LR)’, ‘Support Vector Regression (SVR)’, ‘Random Forest (RF)’, ‘Decision Tree (DT)’, ‘Polynomial Regression (POLY)’, ‘Logistic Regression (LOGI)’, ‘Principal Regression (PRIN)’, ‘Gradient Boosting (GB)’, ‘Surrogate Tree (ST)’, ‘Ridge Regression (RIDG)’, and Multi-Layer Perceptron (MLP) Regressor’, and five different pre-processing techniques which lead to 40 different combinations of results.

Each result was evaluated based on several error metrics to analyze the extent of error, under-fitting, over-fitting, and other problems that can lead to inaccurate predictions by the model. As each error metric has a different behavior from pre-processing changes, it is important to analyze how it behaves to fine-tune the model properly to find the best pre-processing technique for each algorithm. Specific hyper-parameters were changed using the trial-and-error method to further improve the results.

3. Materials and methods

Table 1
Dataset’s composition used in the study.

Column name Null count

Year 0

Life expectancy 10

Adult mortality 10

Infant deaths 0

Alcohol 194

Percentage expenditure 0

Hepatitis B 553

Measles 0

BMI 34

Under-five deaths 0

Polio 19

Total expenditure 226

Diphtheria 19

HIV/AIDS 0

GDP 448

Population 652

Thinness 1–19 years 34

Thinness 5–9 years 34

Income composition of resources 167

Schooling 163

Status_Developed 0

Status_Developing 0

Column name	Null count
Year	0
Life expectancy	10
Adult mortality	10
Infant deaths	0
Alcohol	194
Percentage expenditure	0
Hepatitis B	553
Measles	0
BMI	34
Under-five deaths	0
Polio	19
Total expenditure	226
Diphtheria	19
HIV/AIDS	0
GDP	448
Population	652
Thinness 1–19 years	34
Thinness 5–9 years	34
Income composition of resources	167
Schooling	163
Status_Developed	0
Status_Developing	0

The dataset utilized in the study consisted of 22 columns, including life expectancy. Most variables (20 out of 22) were continuous data, while the categorical ‘Status’ column indicated whether a country was ‘Developed’ or ‘Developing.’ This categorical column was divided into two separate binary classification columns. As a result, the dataset contained 22 columns, with 20 having continuous values and 2 representing binary classification. The data was sourced from Kaggle [30], and Table 1 provides a detailed breakdown of the dataset’s composition utilized in this study.

The techniques used in this study were ‘Linear Regression (LR)’, ‘Support Vector Regression (SVR)’, ‘Random Forest (RF)’, ‘Decision Tree (DT)’, ‘Polynomial Regression (POLY)’, ‘Logistic Regression (LOGI)’, ‘Principal Component Regression (PRIN)’, ‘Gradient Boosting (GB)’, ‘Surrogate Tree (ST)’, ‘Ridge Regression (RIDG)’, and Multi-Layer Perceptron (MLP) Regressor’. The following is a brief description of the aforementioned techniques.

3.1. Linear regression

Linear regression is an approach for establishing a relationship between dependent and independent variables. There are two types of linear regression: simple linear regression, with only one independent variable to establish the relationship, and multiple linear regression, with two or more independent variables that contribute to establishing the relationship. The model tries to fit a straight line across data points that best represents the relationship possible.

3.2. Support vector regression

Support Vector Machine is a supervised ML algorithm that uses both regression and classification models. It aims to create the best line or decision boundary called a hyperplane, which uses extreme points or vectors to create the hyperplane. The upper boundary line is called a positive hyperplane and the lower one is called a negative hyperplane. There are two types of SVM: linear and non-linear. Linear SVM is used on datasets that can be classified into two parts by using a single straight line, whereas non-linear SVM is used whenever the data are non-separable by a single straight line.

3.3. Decision tree regression

A Decision Tree is a flowchart-like structure comprising nodes and branches. The internal node represents the attribute, the leaf node represents the label of the class, and the branch represents the outcome. It Works by splitting the data recursively into subsets based on the most significant features at each node.

3.4. Random forest regression

Random Forest is a machine-learning algorithm that works on training, testing, and prediction principles. It starts by creating multiple subsets of the dataset using bootstrapping then creates separate decision trees for each subset of the original data, and then combines all these decision trees to provide a more accurate prediction in the training part, creating subsets of the original dataset, and creating separate decision trees for each subset in testing by averaging the decision trees and then providing the most votes prediction. It is an ensemble model that uses two different methods: bagging, in which it creates different training subsets of original data for training the model, and boosting, in which it combines the weak learner with strong learners by creating a sequential model such as ADAboost, and XGboost. ADAboost, XGboost, etc.

3.5. Polynomial regression

Polynomial regression establishes a non-linear relationship between variables that are modelled in the form of an n-degree polynomial, where, n is a positive integer. The non-linear nature of polynomial regression helps provide flexibility to provide a better interpretation between the dependent values and their explanatory variables.

3.6. Principal component regression

This technique combines multiple linear regression and principal component analysis (PCA). This technique transforms the actual explanatory variables into linearly non-correlated values called principal components. These values are then ordered based on their ability to explain the outcome, which in this study was Life Expectancy. These values are arranged in descending order of their ability to explain the outcomes. Multiple linear regression analysis was performed by selecting the preferred number of components for use.

3.7. Logistic regression

Logistic Regression is designed to predict binary outcomes or outcomes divided into categories. It is not useful to predict continuous values using this algorithm; however, the task can be performed by transforming the problem into a classification task. After producing the outcome for some data, the outcome can be used to classify the results and calculate the weighted average of life expectancy for that particular task. Thus we could predict continuous values using logistic regression.

3.8. Gradient boosting regression

The gradient boosting model creates a strong model by using several weak models. In general, these weak models are usually decision trees; however, any algorithm can be used. A residual, that is, the difference between the actual and predicted values was used to train the weak models, and gradient descent was used to improve the model in each iteration.

Figure 1.

Null instances in the dataset.

Figure 2.

Schooling vs life expectancy.

Figure 3.

HIV/AIDS vs life expectancy.

Figure 4.

Correlation matrix.

3.9. Surrogate tree

Surrogate trees is a tree-based imputation technique that supports tree-based algorithms. It focuses on the model instance’s struggling nodes for improving overall predictions. It is used when the standard model struggles to produce accurate results, so combining the surrogate and main trees may result in an even better result. In this study, we have used the IterativeImputer class from sklearn, which acts as an imputer for multivariate datasets and considers each feature while performing the imputation.

3.10. Ridge regression

It is a statistical technique to estimate unknown parameters in a linear regression model. It is better than linear regression as it penalizes more significant coefficients that encourage the model to use hold into account for weaker features. Its optimization algorithm tries to minimize the combined loss function, resulting in a more stable and improved solution in noisy conditions in a dataset.

3.11. Multilayer perceptron regressor

An MLP regressor is an artificial neural network used to perform regression. It consists of multiple layers of neurons, which are used to predict continuous numerical values. The MLP regressor has an input layer, an output layer, and one or more intermediate hidden layers.

The data used in the analysis were obtained from the World Health Organization and sourced from Kaggle [30]. The data consisted of 193 countries with various parameters that could be used to describe life expectancy. The dataset consists of 2937 records across 22 parameters. The dataset was divided into 80% training and 20% testing based on randomization techniques. This was performed to prevent bias while selecting the data for the training and testing sets. The study was performed using Python version 3.11.2 to perform all analyses, calculations, and predictions.

Before training the regression models on the dataset, the NULL values were replaced with four different values to determine which value is best suited for each regression model and how they behave while replacing the NULL value with various values. The values used to replace the NULL values were as follows:

1.
Mean
2.
Median
3.
Mode
4.
Interpolation Value

The fifth pre-processing technique that we used was to drop the records containing NULL values and train the regression model without them.

For data-handling operations we used the Pandas library. It provides data structure called dataframe, which holds the data in an array that is essentially a NumPy array. NumPy is another important library used to perform mathematical operations on complex structures such as arrays and matrices. The combination of NumPy and Pandas helped us clean the data and perform several operations for proper structuring before providing it to the regression model.

For visualization purposes, Matplotlib and Seaborn were used. Matplotlib is a visualization library that is typically used to plot two-dimensional graphs and visualize data. Seaborn, which is based on Matplotlib and integrates well with the dataframe, was used to create multidimensional visualizations.

In the initial phase, we applied the pre-processing. To do so, we determined the instances of NULL data in the dataset, as suggested by the diagram shown in Fig. 1.

An experimental exploratory analysis was performed to ensure that the data had parameters that were correlated with life expectancy to some extent. To do so, we plotted scatterplots of several parameters with respect to the life expectancy.

A linear pattern was observed in some scatterplots, suggesting some correlation between the parameters. The dependent variable was placed on the y-axis and the independent variables were placed on the x-axis while plotting the scatterplot. Figure 2 suggests a strong positive correlation between schooling and life expectancy.

It is clear that, as the schooling of children increases, their life expectancy also increases. Another scatter plot was plotted to assess whether HIV/AIDS affected life expectancy, or not and is shown in Fig. 3. It is clear that countries with high HIV/AIDS cases tend to have lower life expectancies than countries with a relatively lower count of HIV/AIDS. With the help of these scatterplots, it is clear that both socioeconomic and health-related factors can affect life expectancy. This analysis also confirms the presence of correlation in our dataset. Thus, to assess the correlation for all parameters available in the dataset, we visualized a correlation matrix, which is shown in Fig. 4.

The dataset consisted of a categorical variable, that is, the development status of a country. It states whether the country is already developed represented by the value ‘Developed’ or is in a state of developing, represented by the value ‘Developing’. This parameter also has to be fitted into the regression models, but to make the models consider them, we have to make these string values machine-understandable.

To solve this problem, we used one hot encoding. This technique uses numerical values to represent the categorical data [31,32,33]. This proves to be of great help because most regression models require numeric input. To apply one hot encoding to our dataset, we used the ‘get_dummies’ method from the Pandas library.

To begin with our comparative analysis, we must first split the data into X and Y parts in order to establish a relationship between them. The X-axis contains all parameters except life expectancy. In contrast, the Y-axis contains only the life expectancy. The dataset was then divided into two parts for training and testing. To achieve this, we used the ‘train_test_split’ method from the Pandas library, passing the values of the X and Y data and setting the ‘test_size’ parameter to 0.2, meaning that 20% of the dataset will be allocated for testing and the remaining 80% for training the model.

The models were trained and their results were noted in a tabular format. The following performance matrices were used are as follows: 1.
R2 Score
2.
Adjusted R2 Score
3.
Mean Absolute Error
4.
Mean Squared Error
5.
Root Mean Squared Error

All of the above-mentioned matrices were noted for comparison. No single metric perfectly depicts the performance of the model.

The Mean Absolute Error shown in Eq. (1) is calculated by adding all the absolute differences between the actual and predicted outputs and dividing this value by the total number of observations.
$\begin{aligned} \sum_{i = 1}^{n} | y - y^{'} | \end{aligned}$
(1)

Table 2
Quality metrics using linear regression.

Drop NaN Mean Median Mode lerp

RMSE 3.426 4.162 4.16 4.146 3.428

MSE 11.736 17.322 17.304 17.189 11.748

MAE 2.569 3.098 3.090 3.111 2.637

R2 0.849 0.808 0.808 0.809 0.840

Adj R2 0.839 0.801 0.801 0.802 0.829

Table 3
Quality metrics using support vector regression.

Drop NaN Mean Median Mode lerp

RMSE 2.400 2.499 2.621 2.635 2.414

MSE 5.761 6.249 6.874 6.944 5.832

MAE 1.518 1.532 1.590 1.625 1.513

R2 0.933 0.916 0.909 0.925 0.911

Adj R2 0.929 0.913 0.906 0.922 0.907

Table 4
Quality metrics using random forest.

Drop NaN Mean Median Mode lerp

RMSE 1.455 1.65 1.662 1.639 1.764

MSE 2.118 2.721 2.760 2.687 3.111

MAE 0.914 0.989 0.972 0.976 1.146

R2 0.970 0.968 0.967 0.968 0.960

Adj R2 0.968 0.966 0.965 0.966 0.957

Table 5
Quality metrics using decision tree.

Drop NaN Mean Median Mode lerp

RMSE 2.287 2.3 2.455 2.34 2.354

MSE 5.230 5.288 6.026 5.475 5.542

MAE 1.413 1.352 1.470 1.397 1.466

R2 0.936 0.938 0.930 0.936 0.925

Adj R2 0.932 0.934 0.925 0.932 0.920

Table 6
Quality metrics using polynomial regression.

Drop NaN Mean Median Mode lerp

RMSE 3.7237 3.9089 4.0338 4.1883 4.2906

MSE 13.866 15.28 16.271 17.541 18.409

MAE 2.841 2.883 3.0669 3.1193 3.2436

R2 0.8327 0.8366 0.8135 0.8004 0.8074

ADJ_R2 0.8213 0.8305 0.8065 0.793 0.8002

Table 7
Quality metrics using logistic regression.

Drop NaN Mean Median Mode lerp

RMSE 3.2069 4.1064 4.098 3.8927 3.8651

MSE 10.284 16.863 16.794 15.153 14.939

MAE 2.4964 3.0778 3.0646 2.9331 2.9481

R2 0.8492 0.8146 0.8219 0.831 0.8321

ADJ_R2 0.8389 0.8077 0.8153 0.8248 0.8259

Table 8
Quality metrics using principal regression.

Drop NaN Mean Median Mode lerp

RMSE 3.4029 3.9979 3.9411 3.9886 4.1755

MSE 11.58 15.984 15.533 15.909 17.435

MAE 2.6456 2.9508 2.9775 3.0575 3.0361

R2 0.8485 0.8226 0.8202 0.8377 0.8069

ADJ_R2 0.8382 0.816 0.8136 0.8317 0.7997

Table 9
Quality metrics using gradient boosting regression.

Drop NaN Mean Median Mode lerp

RMSE 2.0912 2.1019 2.3304 2.2908 2.3785

MSE 4.373 4.4178 5.4308 5.2479 5.6572

MAE 1.4731 1.5316 1.6512 1.6031 1.6479

R2 0.9454 0.9497 0.9437 0.939 0.9378

ADJ_R2 0.9417 0.9478 0.9417 0.9368 0.9355

Table 10
Quality metrics using MLP regressor.

Drop NaN Mean Median Mode lerp

RMSE 2.0816 2.3791 2.2435 2.5704 2.5236

MSE 4.3331 5.6603 5.0334 6.6068 6.3684

MAE 1.4375 1.6218 1.6046 1.6876 1.6725

R2 0.9508 0.94 0.9464 0.9221 0.9256

ADJ_R2 0.9475 0.9378 0.9444 0.9192 0.9228

Table 11
Quality metrics using surrogate tree regression.

Drop NaN Mean Median Mode lerp

RMSE 1.7683 3.2256 1.1758 0.9604 0.9578

MSE 2.0146 4.1907 1.3054 0.9540 0.9519

MAE 1.9528 3.8967 1.2883 0.9567 0.9547

R2 0.9634 0.9410 0.9468 0.9451 0.9426

ADJ_R2 0.9495 0.9778 0.9523 0.9428 0.9399

Table 12
Quality metrics using ridge regression.

Drop NaN Mean Median Mode lerp

RMSE 1.9250 3.7793 1.3066 0.9556 0.9526

MSE 2.1968 4.9255 1.4636 0.9470 0.9449

MAE 2.0981 4.4651 1.4465 0.9516 0.9496

R2 0.9376 0.9269 0.9097 0.9336 0.9309

ADJ_R2 0.9337 0.9541 0.9409 0.9342 0.9314

The Mean Squared Error shown in Eq. (2) is calculated in a similar fashion as MAE, but before adding each absolute difference, it is squared and then divided by total observations.
$\begin{aligned} \sum_{i = 1}^{D} (y - y^{'})^{2} \end{aligned}$
(2)

The Root Mean Squared Error given in Eq. (3) can be derived simply by taking the square root of the MSE.
$\begin{aligned} \sqrt{\sum_{i = 1}^{D} (y - y^{'})^{2}} \end{aligned}$
(3)

Figure 5.
Root mean squared error.

Figure 6.
Mean squared error.

Figure 7.
Mean absolute error.

Figure 8.
R2 score.

Figure 9.
Adjusted R2 score.

The R2 score given in Eq. (4) is a performance metric calculated using the square sum error of the regression line and the square sum error of the mean line.
$\begin{aligned} 1 - \frac{S S R}{S S M} \end{aligned}$
(4)

where, –
SSR = Square sum error of regression line
–
SSM = Square sum error of mean line

The Adjusted R2 score as given in Eq. (5) was calculated using the R2 score, total sample size (T), and number of independent variables (N).
$\begin{aligned} Adjusted R 2 = 1 - \frac{(1 - R 2) (T - 1)}{T - N - 1} \end{aligned}$
(5)
4. Results and discussion

	Drop NaN	Mean	Median	Mode	lerp
RMSE	3.426	4.162	4.16	4.146	3.428
MSE	11.736	17.322	17.304	17.189	11.748
MAE	2.569	3.098	3.090	3.111	2.637
R2	0.849	0.808	0.808	0.809	0.840
Adj R2	0.839	0.801	0.801	0.802	0.829

	Drop NaN	Mean	Median	Mode	lerp
RMSE	2.400	2.499	2.621	2.635	2.414
MSE	5.761	6.249	6.874	6.944	5.832
MAE	1.518	1.532	1.590	1.625	1.513
R2	0.933	0.916	0.909	0.925	0.911
Adj R2	0.929	0.913	0.906	0.922	0.907

	Drop NaN	Mean	Median	Mode	lerp
RMSE	1.455	1.65	1.662	1.639	1.764
MSE	2.118	2.721	2.760	2.687	3.111
MAE	0.914	0.989	0.972	0.976	1.146
R2	0.970	0.968	0.967	0.968	0.960
Adj R2	0.968	0.966	0.965	0.966	0.957

	Drop NaN	Mean	Median	Mode	lerp
RMSE	2.287	2.3	2.455	2.34	2.354
MSE	5.230	5.288	6.026	5.475	5.542
MAE	1.413	1.352	1.470	1.397	1.466
R2	0.936	0.938	0.930	0.936	0.925
Adj R2	0.932	0.934	0.925	0.932	0.920

	Drop NaN	Mean	Median	Mode	lerp
RMSE	3.7237	3.9089	4.0338	4.1883	4.2906
MSE	13.866	15.28	16.271	17.541	18.409
MAE	2.841	2.883	3.0669	3.1193	3.2436
R2	0.8327	0.8366	0.8135	0.8004	0.8074
ADJ_R2	0.8213	0.8305	0.8065	0.793	0.8002

	Drop NaN	Mean	Median	Mode	lerp
RMSE	3.2069	4.1064	4.098	3.8927	3.8651
MSE	10.284	16.863	16.794	15.153	14.939
MAE	2.4964	3.0778	3.0646	2.9331	2.9481
R2	0.8492	0.8146	0.8219	0.831	0.8321
ADJ_R2	0.8389	0.8077	0.8153	0.8248	0.8259

	Drop NaN	Mean	Median	Mode	lerp
RMSE	3.4029	3.9979	3.9411	3.9886	4.1755
MSE	11.58	15.984	15.533	15.909	17.435
MAE	2.6456	2.9508	2.9775	3.0575	3.0361
R2	0.8485	0.8226	0.8202	0.8377	0.8069
ADJ_R2	0.8382	0.816	0.8136	0.8317	0.7997

	Drop NaN	Mean	Median	Mode	lerp
RMSE	2.0912	2.1019	2.3304	2.2908	2.3785
MSE	4.373	4.4178	5.4308	5.2479	5.6572
MAE	1.4731	1.5316	1.6512	1.6031	1.6479
R2	0.9454	0.9497	0.9437	0.939	0.9378
ADJ_R2	0.9417	0.9478	0.9417	0.9368	0.9355

	Drop NaN	Mean	Median	Mode	lerp
RMSE	2.0816	2.3791	2.2435	2.5704	2.5236
MSE	4.3331	5.6603	5.0334	6.6068	6.3684
MAE	1.4375	1.6218	1.6046	1.6876	1.6725
R2	0.9508	0.94	0.9464	0.9221	0.9256
ADJ_R2	0.9475	0.9378	0.9444	0.9192	0.9228

	Drop NaN	Mean	Median	Mode	lerp
RMSE	1.7683	3.2256	1.1758	0.9604	0.9578
MSE	2.0146	4.1907	1.3054	0.9540	0.9519
MAE	1.9528	3.8967	1.2883	0.9567	0.9547
R2	0.9634	0.9410	0.9468	0.9451	0.9426
ADJ_R2	0.9495	0.9778	0.9523	0.9428	0.9399

	Drop NaN	Mean	Median	Mode	lerp
RMSE	1.9250	3.7793	1.3066	0.9556	0.9526
MSE	2.1968	4.9255	1.4636	0.9470	0.9449
MAE	2.0981	4.4651	1.4465	0.9516	0.9496
R2	0.9376	0.9269	0.9097	0.9336	0.9309
ADJ_R2	0.9337	0.9541	0.9409	0.9342	0.9314

The proposed algorithm performs a comparative approach of nine regression methods: Linear Regression, Support Vector Regression, Random Forest, Decision Tree, Polynomial, Logistic, Principal, Gradient Boosting, and MLP regression, which performs simulations on five different pre-processing techniques that lead to 40 different combinations of results.

The results obtained using the eleven regression methods on five different quality parameters are shown in Tables 2–12.

The DropNAN pre-processing, that is, dropping the NULL values produced the best result as it preserves the behavior of the dataset the best. As the linear regression establishes a relation between the data points by forming a line, while replacing the NULL values with interpolation values, the missing value is filled with the midpoint of the slope, the interpolation values help boost the values for linear regression, as they help to continue the linear trend of the linear regression model.

Tree-based models, such as random forest and decision tree regressions, use mean values to make predictions, replacing the NULL values with mean values helps in continuing that prediction and prevents the formation of bias that could have been done by performing some other imputation.

Furthermore, to classify the quality parameters using the nine different regression methods, all the quality parameters were plotted as shown in Figs 5 to 9.

5. Conclusion and future scope

We successfully implemented all decided regression models and applied pre-processing techniques. We also tracked various performance metrics to understand the behavior of the model with different imputations.

Other imputations, such as the end-of-tail imputation, can be used with the models to obtain better results. Moreover, the method of splitting data between training and testing can be further optimized, as every time the method is called, the splitting is randomized resulting in a variance of 1–2% in every iteration of calculating the results. If this situation can be improved to obtain the same randomized training and testing values, the problem can be eliminated. In the future, more parameters related to life expectancy will be considered, along with more regression and classification techniques, to analyze life expectancy better. The current study is limited to 22 factors and 11 imputation techniques; in the future, more than 22 factors affecting life expectancy will be considered for a more concrete analysis.

References

Agarwal

Shetty

Jhajharia

Aggarwal

Sharma

. Machine learning for prognosis of life expectancy and diseases. International Journal of Innovative Technology and Exploring Engineering. 2019; 8(10): 1765–71.

Angelantonio

. Life expectancy associated with different ages at diagnosis of type 2 diabetes in high-income countries: 23 million person-years of observation. Lancet Diabetes Endocrinol. 2023; 11: 731–42.

Song

Tay

PKC

Gwee

Wee

. Happy people live longer because they are healthy people. BMC Geriatrics. 2023; 23(440): 1–10.

Hadiabad

SFN

Abdollahi

Sadrzadeh

Karimi

. The relationship between sleep quality and quality of life among postmenopausal women. Journal of Client-Centered Nursing Care. 2023; 9(1): 47–56.

Mouteyica

AEN

Ngepah

. Health outcome convergence in Africa: the roles of immunization and public health spending. Health Economics Review. 2023; 13(30): 1–17.

Fahlevi

Ahmad

Balbaa

Aljuaid

. The efficiency of petroleum and government health services to benefit life expectancy during the inefficiencies of hydroelectricity consumption. Environmental and Sustainability Indicators. 2023; 19: 100289.

Kampf

Meister

. Testing for linearity in boundary regression models with application to maximal life expectancies. Bernoulli. 2023; 29(3): 1764–1791.

Owen

Lyons

Akbari

Guthrie

Agrawal

Alexander

, et al. Effect on life expectancy of temporal sequence in a multimorbidity cluster of psychosis, diabetes, and congestive heart failure among 1.7 million individuals in Wales with 20-year follow-up: a retrospective cohort study using linked data. The Lancet Public Health. 2023; 8(7): e535–45.

Dixon

Derrett

Samaranayaka

Harcombe

Wyeth

Beaver

, et al. Life satisfaction 18 months and 10 years following spinal cord injury: results from a New Zealand prospective cohort study. Quality of Life Research. 2023; 32: 1015–30.

10.

Lipesa

Okango

Omolo

Omondi

. An application of a supervised machine learning model for predicting life expectancy. SN Applied Sciences. 2023; 5(7): 1–15.

11.

Liu

Yang

Peng

Huang

. A geographically weighted regression model for health improvement: Insights from the extension of life expectancy in China. Applied Sciences. 2021; 11(5).

12.

Mazur

. Using regression models to estimate the expectation of life for the u.s.s.r.. Journal of the American Statistical Association. 1972; 67(337): 31–6.

13.

Pandey

Chhikara

. Analysis of life expectancy using various regression techniques. In: 2020 2nd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN). 2020; 209–13.

14.

Azies

Dewi

. Factors affecting life expectancy in east java: Predictions with a bayesian model averaging approach. The Indonesian Journal of Development Planning. 2021; 5(2): 283–95.

15.

Lyell

Khan

Limmer

O’Flaherty

Head

. Association between gender social norms and cardiovascular disease mortality and life expectancy: an ecological study. BMJ Open. 2023; 13(4): 1–9.

16.

Baena

JMG

Mora

JRM

Cardeosa

Vall

Zielonka

Godoy

. Impact of severe aortic stenosis on quality of life. PLoS ONE. 2023; 18(6): e0287508.

17.

Sato

Nakamura

. Exploration of the relationships between men’s healthy life expectancy in Japan and regional variables by integrating statistical learning methods. International Journal of Environmental Research and Public Health. 2023; 20(6782): 1–10.

18.

Bali

Aggarwal

Singh

Shukla

. Life Expectancy: Prediction Analysis using ML. In: 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO). 2021; 1–8.

19.

Kavitha

Varuna

Ramya

. A comparative analysis on linear regression and support vector regression. In: 2016 Online International Conference on Green Engineering and Technologies (IC-GET). 2016; 1–5.

20.

Aydin

Bulut

. Lifespan prediction using socio-economic data using machine learning. In: Pendyala

, editor. Machine Learning for Societal Improvement, Modernization, and Progress. IGI Global. 2022; 27–49.

21.

Ali

Srivastava

Tiwari

Pandey

Sahu

. Predicting life expectancy of hepatitis B patients using machine learning. In: 2022 IEEE International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE). 2022; 1–4.

22.

Tuj Jannat

Biplob

KBMB

Bitto

. Predicting bangladesh life expectancy using multiple depend features and regression models. Lecture Notes in Electrical Engineering. 2023; 998: 47–58.

23.

Roffia

Bucciol

Hashlamoun

. Determinants of life expectancy at birth: a longitudinal study on OECD countries. International Journal of Health Economics and Management. 2023; 23: 189–212.

24.

Biltawi

Qaddoura

. The Impact of Feature Selection on the Regression Task for Life Expectancy Prediction. In: 2022 International Conference on Emerging Trends in Computing and Engineering Applications (ETCEA). 2022; 1–5.

25.

Lakshmanarao

Srisaila

Kiran

TSR

Lalitha

Kumar

. Life expectancy prediction through analysis of immunization and HDI factors using machine learning regression algorithms. International Journal of Online and Biomedical Engineering (iJOE). 2022; 18(13): 73–83.

26.

Faisal

Alomari

Alasmari

Alghamdi

Saeedi

. Life Expectancy Estimation based on Machine Learning and Structured Predictors. In: AISS ’21: Proceedings of the 3rd International Conference on Advanced Information Science and System. 2021; 70: 1–8.

27.

Fransiska

Rini

Anwar

. Application of random forest and geographically weighted regression in Sumatra life expectancy. AIP Conference Proceedings. 2022; 2662(1): 020033.

28.

Wang

. The greatest factors affecting life expectancy: A research based on different continents and countries. In: 2021 3rd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI). 2021; 531–41.

29.

Liu

Zhang

. Analysis on relevant factors affecting life expectancy. In: 2022 IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC). 2022; 569–72.

30.

Life Expectancy WHO. https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who. 2023. Accessed: 2023-09-30.

31.

Deshpande

Uttarkar

. Life expectancy using data analytics. International Journal for Research in Applied Science and Engineering Technology. 2023; 11(4): 972–8.

32.

Van Buuren

Groothuis-Oudshoorn

. mice: Multivariate imputation by chained equations in R. Journal of Statistical Software. 2011; 45(3): 167.

33.

Jager

Allhorn

Biebmann

. A benchmark for data imputation methods. Frontiers in Big Data. 2021; 4(693674): 1–16.