Abstract
In the wake of the contemporary competitive business landscape, the retention of employees has become one of the most important yet difficult tasks for any corporate. Retaining top-performing employees not only improves organizational performance but also reduces recruitment costs. In this study, the authors investigate the major drivers leading to employee attrition and using machine learning algorithms implemented on a well proven and validated IBM HR data set. Although the data set tags the samples for a target variable (attrited and non-attrited), the work presented in this paper comes up with another labelling (1. likely to leave, 2. On the verge of leaving, 3. will stay). The data set is evaluated over top 10 Machine learning algorithms and a competitive analysis is made between them based on various factors. The best model has shown a prediction accuracy of over 85% +. Managers are provided with insights and recommendations at the end that will help companies to proactively identify at-risk employees and implement effective retention strategies.
Introduction
With the great resignation becoming a buzzword in recent days it is no wonder how important understanding attrition has evolved from a managerial viewpoint. Though there have been studies on the topics in the past, this becomes all the more relevant in the current scenario taking into consideration the recent pandemic and upsurge of resignations throughout various industries and domains [7]. The perspective of the older generation regarding job gratification has undergone a drastic transformation. A noticeable change can be observed in their outlook towards job satisfaction. No longer now they are seeking to stay in an organization for a longer duration. The current generation is always seeking the next best opportunity and will not hesitate to leave their current job for a better offer. The word better may have various understandings like - close to home, better work-life balance, role satisfaction, salary hike, and the list goes on endlessly. In what ways does the administration guarantee an upsurge in employee retention by ascertaining which employees are on the verge of leaving and incorporating policies and strategies that discourage them from doing so? In light of technological advancements and notable progress in the realm of Artificial Intelligence, there exists an abundance of algorithms that can be utilized to forecast attrition predicated on obtainable data. Although there are diversified industrial sectors, this papers focuses on jobs in IT and health case sector as mentioned in the IBM data set [21, 22].
In this instance, disparate sectors are surveyed (such as IT and healthcare). A diverse range of computational techniques including artificial neural networks and random forest shall then undergo a comparative assessment to determine the most fitting algorithm for each specific industry. The suitability of the algorithm for the industry will depend on its substantial improvement in accuracy for predicting attrition. From this information, the enterprise is empowered to make precise projections regarding employee turnover and can implement appropriate measures that would encourage workers with outstanding potential to stay. Moreover, such strategies enable more funds to be saved through avoiding unnecessary expenditures on finding another group of competent candidates for replacement purposes.
The paper is summarised as follows. Section 2 provides a short review of works done is past, Section 3 describes the the methodology used in general for any Machine learning algorithm that will be processed on the IBM data set. It also describes all the features of the data set. Results are discussed in Section 4, where relative inferences are drawn. Best features are obtained based on the feature score. The paper concludes in Section 5.
Review of associated works
Attrition is not a recent issue but attrition has been an issue that is plaguing industries for a long time. It has been more evident than ever that companies need to retain talent to gain a competitive edge. Employees nowadays are more aware of what they want and will often leave if they are not satisfied with their current job. Thus, targeting and retaining those employees is getting more vital by the day for an organization. The authors surveyed various factors that affects attrition and how machine learning can be used to accurately predict employees who are at the risk of attrition and what measures can be taken to prevent that. Works associated in this regard are mentioned in brief.
Sunil Ramlall [1, 2] in his study tried to understand what are the factors that affect attrition in an organization. The Author found out that factors like “Salary, Lack of challenge and opportunity, Lack of career advancement opportunities, Lack of recognition” are the most important factors leading to attrition. In the study conducted later, the same author finds out how motivating employees through various motivation theories that can help an organization to retain its employees. Some examples of such theories are “Need theories of Motivation, Maslow’s Need Hierarchy Theory, Equity Theory,” etc.
Further Das and Baruah [3] also conducted studies of various literature on the various strategies that an organization can adopt to retain employees. They also proposed a model that can increase employee satisfaction and lead to employee retention. The model is depicted in Fig. 1.

The employee retention and job satisfaction model [3].
Some researchers also used statistics and other statistical tools to figure out various factors of attrition. In this context, Inayat et al. [8] used descriptive statistics to find out the relation of Job satisfaction and its impact on performance of employees, where his studies showed that lower job satisfaction leads to a dip in performance. Similar studies also supported this relationship between job satisfaction and performance like the studies conducted by Zhang et al. [6] and Dorta-Afonso et al. [7]. It is vital to manage work satisfaction because the association between job satisfaction and employee attrition might serve as a substitute for the relationship between job satisfaction and performance.
Researchers also started implementing Machine learning algorithms to accurately predict the attrition provided a data set. Adhikari et al. [9] used a multiple regression approach to figure out attrition in the IT industry. He also made use of Principal Component Analysis, Cluster analysis and Factor analysis to find out the important factors that affect attrition. Khare et al. [10] used logistic regression to predict attrition. They also used statistical analytical methods like ANOVA to determine the factors of attrition. Zhuang and Pan [13] in their quest to figure out what are the factors that influences employee satisfaction used linear regression and quantile regression techniques and descriptive statistics to unearth those factors. Here one can take employee satisfaction as a surrogate for attrition. Alao and Adeyemo [11] used decision tree classifiers to predict employee attrition. Eduvie et al. [12] also used decision tree algorithms and they did a comparative study between three notable algorithms namely ID3, C4.5 and C5.0. Alsheref et al. [14] in their research build an ensemble model to predict attrition. They used machine learning techniques like Artificial Neural Networks to build their ensemble model.
Many researchers then started a comparative study of different machine learning algorithms to find out which algorithm works the best. Nagadevara et al. [15] did a comparative study between Artificial Neural Networks (ANN), Logistic Regression and Classification and Regression Trees (CART) version C5.0 and found out that worked best for their use case and achieving an accuracy of almost 90%. Ajit [16] then carried out a comparative study of seven methods namely XGBoost, Logistic Regression, Naïve Bayesian, Random Forest, Support Vector Machines (SVM), Linear Discriminant Analysis (LDA) and K-Nearest Neighbours (KNN) and according to his findings XGBoost was the best. Jessica Frierson and Frierson et al. [17] also carried out similar experiments and they compared Decision Tree, Logistic Regression, SVM, Gaussian Naïve Bayes, KNN and Neural Networks (NN) and for their dataset they found that Logistic Regression worked the best. Francesca et al. [18] similarly studied Gaussian NB, Bernoulli NB, Logistic Regression, KNN, Decision Tree, Random Forest, SVC and Linear SVC and identified that Gaussian NB is the best for their needs. Instead of focussing on accuracy they focused on recall as they needed to minimise the number of false positive i.e., misclassifying employees who will probably leave. Jain et al. [19] also did a comparative study of not only the machine learning techniques but how they perform across various departments. They used techniques like Decision Tree, Random Forest and SVM and they determined that Random Forest worked the best for them. Emmanuel-Okereke et al. [20] compared SVM and Gaussian NB and find out that Gaussian NB was superior to SVM. As one can see from all these experiments that different algorithms prove superior in different situations.
The work presented in this paper that makes it distinct from the earlier works is that the earlier works intend to predict a cohort of employee who are making up their mind to leave. Thus, the work presented in the this paper is not simply a traditional bi class classification problem; rather it is tr-class classification problem, wherein the authors make a narrow distinction between cohort of employees who are leaving (will attrite). Hence the paper proposes to predict a employee that will belong to one of the three classes of prediction as against traditional bi-class classification. These three classes are - Class 0: Employees most likely to leave, Class 1: Employees on the verge of leaving. Class 2: Employees who will stay. Although it will be difficult to hold persons in class 1, efforts and policies can be framed to retain Employees belonging to class 0.
The work presented in this paper implements the above concepts with various machine learning algorithms. A comparative study of ten algorithms has been done to find out the specific case for which algorithm works the best. Authors are also interested in finding out what are the factors that affect attrition and provide various suggestions based on that to retain employees. The authors are also want to identify a cohort of employees who haven’t yet made up their mind to leave the company or stay in the company and determine how one can retain them. It may often happen that those who have already made up their mind to leave will not stay no matter what thus identifying and targeting this cohort will be more beneficial to the organization. Our ultimate goal is to identify a strategy that will help organizations identify and retain talents within their organization and thus reducing attrition.
For this study, a well proven and validated dataset (provided by IBM) is used and top ten Machine learning algorithms are tested on it to determine which is the best for the use case. A flow chart for the process can be depicted in Fig. 2.

Flow chart depicting the steps in processing the data set.
The data set used is the IBM HR Analytics Employee Attrition & Performance [21, 22] - This IBM Analytics simulation database is made public in the community forum. The collected data set has total of 1470 records. Table 2 lists all of its 35 features, which is a good combination of numerical and categorical datatype. After initial pro-processing (data cleaning), it is observed that not all the features are required for effective prediction. Thus, few parameters (features) were dropped.
Data cleaning
Few features were deleted. Like: ’EmployeeCount’, ’EmployeeNumber’, ’Over18’ and ’Standard Hours’ columns from the Dataset. There is a reason for each of this: ’EmployeeCount’ - All the rows contained same values i.e., 1 ’EmployeeNumber’ - It a unique identifier and does not have any predictive value ’Over18’ - All the rows contained same values i.e., Y ’StandardHours’ - All the rows contained same values i.e., 80
To keep all the numerical data on the same scale, standardization is applied t most of the numeric data type columns such as Age, DailyRate etc...
Data splitting
For the purpose of training and testing algorithms over the data set, Four strategies of data splitting were followed. 80-20%, 75-25%, 70-30% and 60-40%. This ensures that the prominent features come out distinctly in any combination. Interesting inferences are obtained and the same cab be seen in the results section (refer section 4.2 and section 4.3)
Various machine learning models employed for evaluation
As mentioned earlier, top 10 prediction algorithms are evaluated based on various splits. The same is listed in Table 1. Performance of these algorithms is measured on Accuracy, Precision and Recall. Apart of prediction, one of the motives of this work is to identify people who would attrite. Result section (Section 4) elaborately discusses it.
List of algorithms for the study
List of algorithms for the study
Out of 35 features, 4 are dropped and 31 features contributes in the prediction process. Not all the features are statistically significant and contribute to the outcome variable. To this extend, six prominent features were identified.
The IBM HR Simulation database
Competative Analysis of various predictive techniques
After identifying the six most important features, the next step is to make a subset of the entire data with only the six most important variables and run logistic regression on it to change it from bi-class dependent variable into a tri-class variable. This is achieved by assigning cut-offs based on probability values. Assigning three classes based on the cut-off is shown in the Table 4. To remind again, the various class and their meanings are as: Class 0: Employees most likely to leave Class 1: Employees on the verge of leaving Class 2: Employees who will stay
Bi-class to tri-class probability cut-offs
Best algorithm overall
As performance measures: Accuracy, Recall, Precision and F1 Score are calculated for top 10 predictive algorithms for various splits (80-20, 75-25, 70-30, 65-25 and 60-40.) Neural network gives best of the accuracy (85%) amongst all the other algorithms, while Naive Bayes out performs all others in terms of Recall(60%) for all the splits. Best of the results are obtained for 80-20% split and the same can be seen in Table 3.
Feature ranking (Best 6 features using weighted sum)
To identify the top features that have an major effect on the outcome variable (employee attrition), feature importance analysis was performed for each algorithm. The analysis was done using data from various test-train splits. The outcomes are listed in Figs. 3 to 6. Of the total 35 features, top 6 features were calculated for all the splits. A score of 1 to 6 for all the features. 1 means the feature with least importance and 6 for the feature with highest importance. Finally, the weighted sum of these features was calculated (for all algorithms combined). Table 5 lists it.

Feature ranking for respective algorithm (for 60:40 Split).

Feature ranking for respective algorithm (for 70:30 Split).

Feature ranking for respective algorithm (for 75:25 Split).

Feature ranking for respective algorithm (for 80:20 Split).
Six best features according to average weighted sum
Based on the feature ranking and the logistic regression coefficients, one can analyse which factors contribute the most to employee attrition. As discussed before the employees are divided into three buckets - most likely to leave, on the verge of leaving, and most likely to stay. This is based on the predicted probabilities of attrition (as shown in Table 4). Based on the features, lets discuss about the ’Employees who are most likely to leave’ and ’Employee on the verge of leaving’.
In conclusion, the logistic regression analysis highlights key factors that contribute to employee attrition and provides insights into the actions that the company can take to improve employee retention. By addressing these issues, the company can create a more engaged and motivated workforce, which will lead to increased productivity and profitability.
The authors also ran an experiment where it was predicted if an employee will fall in any of the below buckets: Employees most likely to leave Employees on the verge of leaving Employees who will stay
Originally, the dependent variable in the provided data set had only two outcomes ’Yes’ or ’No’. This was changed into the above-mentioned outcomes (3 classes). It was inferred that by taking a subset of the original data, focusing six most important dependent variables, best of the prediction can be obtained by keeping cutoff on the probability scores. Now logistic regression was performed on this data set and tried to determine the probability of each employee. The logic that was used is stated in Table 4 and is mentioned below: Employees who will stay: Probability score between 0 and 0.4 Employees on the verge of leaving: Probability score between 0.4 and 0.6 Employees most likely to leave: Probability score above 0.6
This helps one to determine which categories of employee to focus on. Managers can focus on the employees who are on the verge of leaving and provide them incentives targeted specifically towards them to make them stay. Managers may or may not provide the same incentives to the employees most likely to leave as they may leave no matter what the incentives are provided.
According to the study in this paper, it is found that Naïve Bayes is the best for our use case as it has the best recall, while neural network is best in terms of accuracy. As managers want to identify how many employees possible at the risk of leaving thus recall becomes more important than accuracy. Separating the whole cohort of employees into sub cohorts also becomes advantageous for managers as they can then employ separate strategies for the three different cohorts that were identified.
Based on the feature ranking data, here are several steps the company can take to prevent employee attrition or stop employees who are on the verge of leaving. These are enlisted in above section.
By taking steps to address these issues, the company can reduce attrition rates and retain valuable employees. The employees that are satisfied and are not at the risk of attrition give us the opportunity to identify the strengths of an organization and further strengthen it. It can be very difficult to retain the employees who have already made their mind about leaving, this may be because they are already holding an offer from somewhere else or they are in the process of interviewing. But many are somewhere in the middle of these two extremes. These are people who have neither made up their mind about leaving nor they decided to stay/work. Here once can apply various strategies to convert them into the ones who will stay. Thus with the aid of machine learning algorithms managers can focus on various factors to improve employee satisfaction which will in turn improve retention.
Recommendations for future research
Retaining employees in today’s competitive environment is becoming very important for organizations to remain competitive. From the study outlined in this paper, it became very apparent for the company owners that it is not only important to predict the employee who is going to leave but also understanding the underlying reason which made them to do so. This involves a two way strategy solution: one is to address the issue and second is to provide them some incentives which will compel them to stay.
In this study, the authors experimented with changing the dependent variable which had two classes ’Yes’ and ’No’ into three classes and used Logistic Regression to achieve this. Future studies can be conducted using many supervised and or unsupervised methods (like Clustering) to group employees based on various parameters like mindset, nature, attitude and working procedures.
