Abstract
The life insurance industry is inherently a data-driven industry with various applications for analytical decision-making. Data science has influenced all business functions in an insurance organization to provide a distinct competitive advantage and push the industry towards the vision of ‘Insure Tech’.
This case revolves around one such application of analytics in HDFC analytics dealing with championing the initiative for forecasting and analysis of Business Persistency. The persistency ratio is actually a fairly simple, yet very important metric that provides a snapshot of the health of the insurance industry. Considering the importance of this parameter, it became extremely important for HDFC Life to understand the factors behind persistency numbers and what lies ahead for the organization.
The existing forecasting techniques were biased by the nature of work and did not give a significantly accurate and realistic number. The top management found this to be challenging for decision-making and decided that this required the intervention of the Business Insights department. Mr. Francis Rodrigues, SVP—Data Labs, Business Insights and Innovation was given the task to take over the pilot project and increase usage of analytical tools for Persistency Analysis.
While Quarter 1 results have been significant, Mr. Francis Rodrigues still wonders whether he captured all the internal and external measures to obtain effective results. Has he done enough and how many more areas can analytics be applied for in the insurance domain?
Keywords
Introduction
Data science and its myriad applications are enabling organizations to harness data resources so as to carry out analytical decision-making. The insurance industry is inherently data-driven industry and analytics can influence all business functions as a distinct competitive advantage. Data analytics can bring a paradigm shift in the outlook and provide well-organized insights to the insurance agency, exactly how it has transformed the functioning of HDFC Life.
HDFC Life has actively picked up the emerging technologies and focused on core areas to improve their dimensions for the insurance business. They have identified the importance of client satisfaction and consumer experience to significantly improve the persistency in insurance premium payments. With a motive to identify the factors behind sales persistency numbers and balance the expenses incurred to maintain customer relationships, data analytics has been a goldmine for HDFC Life.
HDFC Life
HDFC Life Insurance Company Limited (Formerly HDFC Standard Life Insurance Company Limited) is a joint venture between HDFC Ltd., one of India’s leading housing finance institutions and Standard Life Aberdeen, a global investment company.
Established in 2000, HDFC Life is a leading long-term life insurance solutions provider in India, offering a range of individual and group insurance solutions that meet various customer needs such as protection, pension, savings, investment and health.
Customer Insights
Every person reaches a stage in his life when they settle into a stable life and start investing in insurance products. Even though some people do become policyholders, lapsing policies have significant negative impacts on the revenue and growth of life insurance companies (Maurer, 2016). Most policyholders believe, when purchasing life insurance, they can just buy it and forget it. Life insurance companies—who do little more than send premium-payment reminders and administrative updates—do nothing to change that erroneous perception. Therefore, it is not surprising that customers let their policies lapse. Most insurance companies have done little to remind policyholders about the service they provide, let alone to ensure their satisfaction, engagement and loyalty (Keller, 2018).
Persistency
Insurance is a long-term business. After the policy is bought, the customer continues to pay premiums for several years. It is from this future income that the insurers make profits. Policy lapsation due to non-payments, insurance claims or withdrawals is all a phase of loss for the company and directly impacts their strike rates. The primary objective of insurance companies in order to achieve profitability is to maintain the strike rate such that maximum renewal payments are received and persistency is as high as possible.
Persistency is the percentage of an insurance company’s written policies remaining in force, without lapsing or terminated. The persistency ratio is actually a fairly simple, yet very important metric that provides a snapshot of the health of the insurance industry (Vyas, 2018). The persistency ratio helps to evaluate the consistency and stability of organization growth by providing a purview of how long customers will stay with their policies. Persistency looks at the number of customers that choose to renew their policies at the end of the year to calculate how long customers are choosing to stay with their policies thereby showing the loyalty of customers and how much confidence they have in the products being offered. Since persistency is a critical factor in the viability and success of insurance companies, it is closely monitored by the management, analysts and shareholder investors.
Importance of Persistency
Persistency rates are important both for the insurer and for customers (Vyas, 2018). For customers, most policies double up as investments and it is important for them to be satisfied with a product to enable a high persistency rate, which is the only way they will maximise the benefits of their investment. If customers have to surrender or let their policy lapse, not only do they give up on security, but also lose out a sum of their investment as their policy has been withdrawn before reaching maturity. Additionally, customers that let their policy lapse are harder to sell another policy as they garner negative perception about the market.
Similarly, for insurers, high rates of persistency translate into increased profitability, reduced costs, optimal long-term income and overall growth and development (Vyas, 2018). The key priority for insurers is to find customers that will stay loyal to their policy and their insurer, as this is a major determinant for a build of policy reserves and reduction in policy surrenders. Analysis of persistency rates also helps in understanding the policies that are working and those that need to be abandoned or changed in order to sell more and continually provide income to the business. Hence, there is a dire need to come up with persistency numbers and work proactively to achieve them in order to manage business profitability and growth (Shah, 2018).
The Challenge: Current Persistency Evaluation Methods
Considering the fact that persistency is a critical factor when it comes to making important business decisions in the insurance segment, the top management of every insurance company demands a projection of monthly persistency numbers in order to gain important insights. Usually, there are two teams in the organization that come up with the forecasts on a monthly basis, the persistency team and the Actuarial team.
The persistency team is the one that collects renewal premiums and works centrally and at a ground level through branch operations alongside the customers and insurance agents on a day to day basis. The projections that they come up with are usually conservative estimates as the numbers they submit are eventually the targets that they need to meet over the year. The numbers hence reported are pessimistic in nature and the final collection is usually overachieved in this case.
However, the actuarial team deals with statistic and mathematical computations related to the top line of the organization dealing with valuations and cost of acquisition of every product. Their projections always reflect numbers that can meet the top line for the organization and help the organization get maximum profitability. However, such projections are usually optimistic in nature and are never achieved based on actual collections. The problem, solution and expected payoff for the project are tabulated in Table 1.
Project Overview
Drawbacks
The current methods of persistency calculation are biased by the nature of work and deflect by a huge margin from the numbers actually achieved. There are no realistic forecasts for the coming year making the management uncertain with the decisions they take for strategic purposes.
The analysis is incomplete and is solely based on experience. It does not provide any insights about the factors that are actually affecting the persistency numbers, the direct correlation of customer experience with premium payments and the reasons behind low conversion rates. There is no statistical backing in the forecasts made and entire business decisions have an extreme dependency on the actuarial.
Having realized the importance of customer experience, organizations are heavily investing in 24×7 customer support; however, the lack of direction is only increasing the associated costs (Kulkarni, 2018). Customers who pay regularly are being contacted multiple times unnecessarily and those that actually need follow-up are shadowed due to the customer support workload. These overhead costs are becoming additional expenses with no returns leading to reduced revenues of the insurance agencies.
The Solution: Use of Data Analytics
Based on the experience of the persistency department and previously existing persistency collections, it was known that customer data is one of the major factors contributing to sales and persistency numbers for the organization. Customer choices, their behaviour towards frequency—payments and how important an insurance scheme was for them changed on the basis of locality, age, gender, income and many such demographic parameters. What gathered attention was the fact that not just the customer choices and experience affect the persistency numbers, but the policy parameters such as the channel or the billing frequency can also directly impact the persistency of premium payments.
A recurring incident brought up by the persistency team’s head gave an insight that customers with a particular combination of the aforementioned parameters behaved in a similar way. Considering an example, customers with auto-debit billing channel with a yearly billing frequency almost always paid their premiums on time, however, those with cash billing channel and an annual billing frequency were highly likely to default. This could be attributed logically by knowing that auto-debit channel directly deducted premium from the policyholder’s bank account every month, keeping the payments consistent (unless the bank account was empty, which was a rare case) as opposed to customers who opted for the cash channel and paid on their own will. Previous premium collections verified the fact that more than 95 per cent of the customers who opted for auto-debit paid every month; however, only 53 per cent of the cash-opted customers paid.
The follow-up logs were rigorously maintained on a day to day basis in the organization. It was evident from these logs, that some set of customers required more follow-ups from the persistency team in comparison to others in order to maintain the retention ratio of the organization. This was also considered to be an effective cost saving strategy when it came to customer centric classification as those who pay regularly were only sent a text message saving call and email costs.
Understanding the above scenario, the objective of the project was set. The objective of the undertaken project was to come up with monthly persistency numbers for the coming year and to classify customers into various categories with regards to the frequency of follow-up required for premium collection.
Data Exploration
Once the objective was clearly defined, the team at HDFC Life began collecting and extracting the relevant data from their repository. Being a team from the analytical background, they realized the importance of data sourcing—not just in terms of quantity (historical data over the years), but also qualitatively considering the business importance of parameters and potential usefulness to generate a competitive advantage. The key objective of data collection was to incorporate all the factors that were even mildly related to persistency such that observed results are accurate and the data used for decision-making has consistently high quality.
The final data extracted for the purpose had roughly around 11.2 million records and was more than 6 GB in size. All available policy and customer demographic parameters were considered making it around 22 direct and 6 derived variables for analysis. The policy and demographic parameters are described in Tables 2a and 2b, respectively. All existing data that the organization had been recording from the last few years was taken into consideration. Derived variables included industry persistency ratios, agent vintage and revival flagging to get an overall picture for every policy. Policy level time-indexed data was extracted to give deeper insights into the business and support the forecasting techniques that needed to be implemented. The idea was to identify the most important variables and reduce the parameter-list to significant ones only.
Policy Parameters
Demographic Parameters
Having extracted the data, the next step taken by HDFC Life was to explore the data thoroughly in order to understand the correct approaches to solve the business problem. The data cleaning process took a while because the team was dealing with data in enormous amounts. The range of data was in the span of 13 years (FY 2007–2019) and in this duration, HDFC Life had changed vendors for data storage multiple times. This led to discontinuity in data aggregation and disambiguity in terms of column names, archived data mismatch, data types and other pre-processing issues.
In order to make data ready for analysis, it was important to bring consistency into the data such that all columns follow a similar pattern, category and align to the correct customer ID. The business insights team collaborated with the underwriting, persistency and information technology departments to get a deeper understanding of the data and bring it back to statistically identifiable data points. The persistency and underwriting team helped with various industry metrics that can be tracked for better analysis. These ratios were further included in the data as derived variables. They also provided insight on how customers usually react to certain policies and what could be certain exceptions in the trend for a particular geography or age. The IT department helped in mapping the missing variables, the column names and completing the data as a whole to aid pre-processing.
The data exploration stage was followed by a meeting amongst the business insight executives along with the VP of the department. The aim of the meeting was to discuss the most appropriate approaches towards the business problem and accordingly decide on the tools, skills and effort required to be put in for the project.
Tools and Techniques
The last step before actually implementing the approaches was to decide the tools and platforms to implement them on. HDFC Life has highly invested in some of the most trending and useful software for business analytics. They have subscribed and licensed various useful data modelling and visualization software, some of them being SAS, QlikView and AWS (Amazon Web Services)—Sage maker.
To solve this business problem, the team decided to use QlikView for data exploration and visualization, whereas AWS sage maker, particularly Python Jupyter notebooks for data modelling. The team could have used tableau or Power BI instead of QlikView considering they provide slightly better visualization formats, however, the organization already had a cloud server setup for QlikView and the senior management was well-versed with its dashboard, making it a more convenient choice. With in-memory data model, dynamic dashboards and automated data integration, QlikView provides extensive advantage to the organization (Anjani, 2018). Python was a more obvious choice for the team because QlikView provides a direct API plugin to import and visualise python outputs on the software. R could have been a better choice statistically, but considering the better memory overheads and performance, Python was the way forward (Pandey, 2019). It also enabled easier communication and documentation considering most of the data scientists were well-versed with Jupyter notebooks and actively using it for everyday modelling.
The choices were made in accordance by keeping in mind that the dataset they will be dealing with is humongous and local machine RAMs will not be able to handle the processing requirements. Both the selected software function over cloud and provided the best capabilities required for the success of the project.
The Approaches
Once the data was cleaned and the tools were finalized, it was important to understand which models to apply and how to reach the desired outcomes of the set objectives. Considering data science models and available machine learning packages, the team knew that time series was the best approach to forecast values into the future considering they had time-indexed data available (Ivanović, 2016). However, that would only solve one-half of the problem. In order to understand, which parameters were actually affecting those persistency numbers, it was important to identify the important variables behind persistency collections and classify the customers based on such categorizations. For the same purpose, classification models were taken into consideration.
Approach I: Time Series Forecasting
The idea was to utilize the time indexes and analyse the policy-level transactions for every policyholder. Over the span of 11 years, enough data had been accumulated and on visualizing the monthly collections, it showed a clear trend and seasonality in pattern which can be observed in Figure 1.
This pattern made it obvious to the team that premium collections have a direct correlation with time. The premium collections had increased overtime every year, as can be seen by the trendline in the decomposition graph in Figure 1. Similarly, every month had a specific peak and the yearly pattern was the exact same for every year giving strong evidence for yearly seasonality as can be observed in Figure 2. As a constant growing trend and seasonality could be observed in the data, it made it clear that Time Series Forecasting is the best way to deal with such a forecasting problem.


Since the data was policy level, it had transactional level details on premium. Every collection and rollback was presented date-wise for every policy number. Considering the huge amount of data present and the available processing resources, it was decided that monthly premiums will be aggregated. All the premiums collected in a month were added together (subtracted in case of rollbacks) and reduced to one field for every month. Hence, it was decided that a set of 132 data points (11 financial years, 132 months) from April 2008 to March 2018 will be modelled using time series methods such as ARIMA, Holt’s winters or Neural Nets and the next 12 data points, that is, the coming financial year from April 2018 to March 2019 will be forecasted based on analysed trend and seasonality.
Approach II: Customer Clustering
Though time-series worked best for presenting the accurate persistency numbers for the coming year, it was not the appropriate approach for analysing the factors that actually affected those numbers.
The team wanted to give greater insights to the top management such that they could get details on what exactly affects the customers paying ability. They wanted to show to the management that certain type of people always pay their premiums on time whereas some certain groups always tend to default. These groups differ both in their personal or demographic details, as well as the type of policies they have purchased from the organization.
Such insights would greatly affect the decision-making ability of the top management in terms of sales strategies as well as policy/product designing. The team decided to approach this as a classification problem. Based on various parameters collected, it was decided that customers will be classified in groups of similar actions.
Three actions were pre-decided
Unpaid: The customer does not pay the premium at all. Paid_InGrace: The customer pays on or within 30 days of the due date. Paid_OutGrace: The customer pays after the grace period of 30 days is over.
Various classification algorithms were decided to be tried on a comparative basis in order to come up with the best accuracy metrics considering that every algorithm has a different approach along with its pros and cons. A comparative classification study by Box Jenkins in his book Time Series Analysis: Forecasting and Control (Box, 1970) can be seen in Figure 3.

The classification was decided to be approached in two levels, the first level having a binary classification dividing the customers into two groups—Paid or Unpaid. And the second level, on the customers that were identified as ‘Paid’ to be further classified into two categories—Within Grace Date or Beyond Grace Date. Both these models were decided to undergo the test of feature importance in order to identify the most contributing parameters when it came to deciding persistency numbers.
A two-level approach, according to the stakeholder would be the most appropriate way to go avoiding multiclass clustering and complexity in modelling at later stages. Having decided on the approaches and going over them, both based on data and implement ability, HDFC Life decided to take this project on-board immediately. With everything decided and planned for the project, data visualizations and modelling began in a quick pace. Multiple iterations of persistency numbers and customer clusters were obtained to finally reach the best solutions in a period of 2 months.
Results: How Well Did Data Analytics Help?
After rigorous efforts and multiple iterative attempts by the HDFC Life team, The Time Series Forecast was extremely accurate with an RMSE (Root mean squared error) of 44.25 crores. The monthly forecast for April–July FY19–20 was compared with the actual collections (Table 3) and the predictions hardly differed by a few thousand rupees. The R-square value of the training set was 93.6 per cent, whereas the test set R square was 92.1 per cent, giving a very accurate prediction of the desired persistency numbers. These forecasted values were a more accurate representation of the original scenario compared to the forecasts previously provided by actuarial or persistency.
Actual v/s Predicted Forecast
The python package pmdarima was used to obtain the best ARIMA/SARIMAX model using grid search. The best model features and output test results are shown in Figure 4.

Every ARIMA model has the assumption of a stationary time series. A stationary time series is one whose statistical properties such as mean, variance, autocorrelation, etc. are all constant over time (Box, 1970). To fit an ARIMA model to the existing dataset, first-order differencing was done in order to transform the series into a stationary one.
The data and the model were applied through the following tests in order to evaluate whether the time series still had non-stationarity and if the model required further improvements by refining the data:
Post the tests satisfying the assumption of stationarity, HDFC Life team performed a grid search to fit the best ARIMA model. Since the data followed a distinct seasonality, a SARIMA (Seasonal ARIMA) model was the most appropriate as can be seen in Figure 5. SARIMAX(2,0,4)x(1,1,1,12) gave the least AIC, BIC values along with the lowest log-likelihood amongst other model iterations, hence coming out as the best model from the study. All parameters of the model, along with the coefficients were significant at a 95 per cent level, which can be observed in Figure 4 with their p-values, all being less than 0.05.

A SARIMA model is usually in the form of SARIMA(p,d,q)x(P,D,Q,m). In the model identified as the most appropriate for premium forecasting:
p: The order of autoregressive (AR) model is 2 d: The degree of differencing is 0 in this case, since a first-order differenced series was given as an input q: The order of moving average (MA) is 4 P: The order of the seasonal component for the auto-regressive (AR) model is 1 D: The integration order of the seasonal process is 1 Q: The order of the seasonal component of the moving average (MA) model is 1 m: Number of observations per seasonal cycle is 12 since data is aggregated monthly
On the other hand, the customer classification model also fared well. The Level 1 Classification achieved an accuracy of 91 per cent and Level 2, an accuracy of 82 per cent despite the cascading effect.
Various classification models were trained on the dataset to come up with satisfactory results. Logistic Regression, Decision Tree, Random Forest, Support Vector Classifier, Naïve Bayes and Gradient boosting were attempted. Xtreme Gradient Boosting performed the best amongst all the classification models that were trained on the dataset. This can be attributed to the fact that boosting uses ensemble model to turn weak learners into strong ones. Xtreme gradient boosting allows the user to specify various hyperparameters such as error and log loss which helps in classification problems such as the customer classification in this case. Xtreme gradient boosting also allows the user with customization and regularization parameters preventing overfitting in the data.
The model details are elaborated below:
Level 1 classification into paid and unpaid categories was successfully performed with the following metrics: Model: Xtreme Gradient Boosting Accuracy: 91.43 per cent Precision: 0.4082 Recall: 0.9079 Confusion Matrix: [210118, 302405], [24684, 3229158]
Level 2 classification into within grace date and beyond grace date categories was performed with the following metrics:
Model: Xtreme Gradient Boosting Accuracy: 82.22 per cent Precision: 0.5115 Recall: 0.86 Confusion Matrix: [2647881, 1477], [574108, 13343]
Based on the correlation matrix in Figure 6 and feature importance coefficients in Figure 7, the classification approach statistically proved that the following variables, in order, affected the persistency numbers the most. Age and occupation did not show a significant trend towards premium collected. However, region of the policyholder did have a significant impact. The trends identified are listed below:


Conclusion
The use of InsureTech has led to long-term benefits for HDFC Life and made this project a huge success for the organization.
The ARIMA model gives accurate forecasts to understand the expected collections, thus enabling the organization to assess the performance and accordingly plan future steps for persistency renewal. With an accuracy of 92.1 per cent, the model is able to access accurate numbers and with 91.33 per cent accuracy, relevant factors affecting organization persistency that can enable the top management to make decisions at the beginning of the year itself. Having classified the customers into various categories will make the follow-up requirements for the persistency team very clear. Those with unpaid tags will require rigorous follow-ups, and a gradually decreasing amount for Beyond Grace Date Payers. For the ones who pay regularly, a reminder SMS or email would be enough, and this clarity would end up reducing customer support associated costs significantly. With the help of insights gathered from the important features affecting persistency, persistency team can infer the kind of people based on channel, policy type, location and directly associate them to a cluster; hence, identifying the most appropriate policy to sell. The top management can also understand the most important policy parameters for retention and accordingly design the new products or tweak the existing ones, thereby increasing the probability of its sales. Considering all the applications, HDFC Life is surely going to benefit from this project to a great extent. They will not only increase their persistency rates but also increase new business sales with the correct selling strategies and appropriately designed products. All of this is bound to increase the Revenue as well as the market share of the organization.
With the remarkable results obtained, HDFC Life has made significant progress in forecasting persistency using InsureTech. These tools are expected to help HDFC Life maintain their leadership position in the industry and grow exponentially in the coming years. While HDFC Life has just started using the power of data analytics; there are many more opportunities to apply analytics particularly in areas such as sales, underwriting and claims.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The authors received no financial support for the research, authorship and/or publication of this article.
