Abstract
There has been an increasing interest in considering models that try to predict which patients will be readmitted to hospitals, or, if not yet having been a hospital patient, which patients will need hospitalization. One major reason for this is to optimize the intervention process, thereby saving billions of health-care dollars. In this article, we use patient-data from a large healthcare organization and attempt different segmentation models to identify patients who are most at risk for hospitalization. One of these models uses multiple linear regression to provide a prediction of a patient’s subsequent year’s hospitalization, and examines the variables that are significant in the prediction process. Two subsequent segmentations use logistic-regression to explore (1) how well we can discriminate between those who will be hospitalized and those who will not be hospitalized, and (2) those who will be hospitalized for a “lengthy stay” (≥15 days) vs. those who will not be hospitalized. We adopt an approach from the database-marketing literature, and consider “lift” and Pareto curves to evaluate the success of the segmentations. Our findings suggest that there is great potential for this segmentation approach. This approach provides not only profit motivation for the focal (healthcare) organization, an HMO, pharmaceutical company, or similar organization, but also the societal benefit of more effective and cheaper healthcare.
Keywords
Introduction
Insurance companies and provider networks have struggled to rein in the escalating costs of healthcare in the United States. The sickest patients contribute disproportionately to healthcare spending; indeed, in 2008, the top 1% of the patients spent 20.2%, and the top 5% of the patients about 50%, of the total healthcare dollars spent. 1 If provider networks can better identify costly patients, they can intervene prior to expensive hospitalization.
Some health networks have started to use case workers for intervention efforts; for example, see Fisher et al. 2 For example, providing a case worker that oversees patient adherence to a prescribed medication regimen can make a significant impact on overall patient health – and keep patients out of the hospital. Several companies use the Internet to remind patients/‘clients’ to take their medications. 3 Our main objective is to effectively segment and identify the members of the healthcare organization that are most at risk for hospitalization. We view our analyses as not only useful to the specific healthcare organization whose data we have analyzed, but also as a template for other physician provider networks in general. The primary reason for identifying these members is, ultimately, to be able to formulate potential optimal intervention strategies.
Literature review
Despite the importance of controlling healthcare costs, relatively little public research has studied member claims-data to identify factors that contribute to rising costs. The Health Insurance Portability and Accountability Act (HIPAA) of 1996 has prevented making public a lot of healthcare data. 4 Indeed, some of the data released by the healthcare organization used in this article is truncated, or top-coded, for privacy protection. Studies, like the Dartmouth Atlas of Health Care, typically focus more on macro issues of reimbursement for physicians and procedures rather than on individual patient behaviors. 5 For example, some health networks spend a lot of resources trying to identify whether an MRI improves treatment, as opposed to identifying patients who need more frequent MRIs.
Studies that have attempted to specifically model readmission to hospitals within a short time period are noted in a literature review of said articles by Kansagara et al. 6 The authors reviewed 30 studies; the most common dependent variable was readmission within 30 days. Most of these studies had relatively poor discrimination ability, although a few had potential to be used at hospital discharge. Most models incorporated variables for medical comorbidity and prior medical services, but few examined variables associated with overall health, illness severity, or social determinants of health. Most considered hospital readmissions, and several dealt with consequences relating to one specific disease. Our study considers patient data from a complete 1-year period (year ‘1’), and has a dependent variable pertaining to hospital stays during the subsequent full-year (year ‘2’) – regardless of whether a hospital stay had occurred during the first year of data. And, as we will note later, severity of illness and other additional variables are utilized as independent variables.
Data and methodology
A well-established physician provider network has provided a random selection of its member data (with appropriate anonymity) in an effort to develop models to better predict members who are likely to spend time in the hospital in subsequent years. As of October 2011, the network had over 400,000 members; this article uses a random sample of about 76,000 members. Given the real-world of missing data, different analyses used different amounts of data; indeed, the maximum used in any one analysis is 47,344.
Our dependent variable is the number of days spent in the hospital in year 2 (the immediate, subsequent year following the claims data – i.e. data from year 1). Figure 1 below displays a histogram of days spent in the hospital in year 2. The vast majority of members in our dataset (about 85%) did not spend any time in the hospital. The mean number of days spent in the hospital is less than one (0.47). Our modeling approaches will rank and segment patients who are more likely – or more at risk – to spend some amount of time in the hospital. Figure 1(a) looks like an exponential function, although there is a slight increase in the tail at 15 days because the data is ‘top-coded’ (i.e. members with more than 15 days in the hospital receive a coded value of 15). Figure 1(b) presents the data in tabular form.
(a) Histogram and (b) table of days in hospital, Year 2.
There are a number of independent variables from the member claims-data in year 1 that will be included in the modeling and segmentation analyses. These variables are represented as dummy variables so that the model does not mandate linearity. As Magliozzi and Berger 7 explain, this ‘…permits each category to take on its own unique value….’ In other words, although a variable is available in interval-scale or ratio-scale form, no particular functional form needs to be assumed in advance – whatever pattern exists, that is what, indeed, results. The use of dummy variables also helps to compensate for the issue of top-coded values. Prescription (Rx) and lab counts, for example, are top-coded at the 95th percentile - members with Rx and lab counts in the 96th through 100th percentiles receive the same value as members in the 95th percentile; thus, ‘95th and above percentile’ becomes one category in the set of dummy variables.
Independent variables
We have six core independent variables in our study.
Relative frequency of age categories
Relative frequency of Rx categories
Relative frequency of lab cohort categories
Relative frequency of claims cohort categories
Relative frequency of Charlson cohort categories
Relative frequency of diagnosis grouping categories
So, overall, we have 5 sets of dummy variables as represented in Tables 1 to 5. We also have 3 additional dummy variables, the diagnosis groupings represented in Table 6. The variable, days-spent-in-the-hospital during year 1, was not among the year 1 data available.
Dependent variables and models
The following analyses include three different models that segment members based on their risk for hospitalization. All dependent variable measures pertain to hospital stays during year 2. Other models were also analyzed, but space limitations do not allow their presentation in this article.
The first model attempts to predict the actual length of time spent in a hospital using multiple linear regression analysis. The dependent variable is the natural log of the number of days in the hospital; the number of days ranges from 1 to 14 days. We excluded 0 in this analysis, because the vast majority of members spent no time in the hospital, and we excluded 15 because it is a top-coded variable. Both the ‘0’s’ and ‘15’s’ are extensively utilized in subsequent models. The second model attempts to predict – using a logistic regression – whether members spend any time at all in the hospital. The dependent variable is either 0 (no days spent in the hospital) or 1 (at least one day spent in the hospital). The third model also uses a logistic regression to predict whether members spend any time in the hospital, but it focuses solely on the extremes. The dependent variable is either 0 (no days spent in the hospital) or 1 (15 or more days spent in the hospital).
Goodness-of-fit criterion
We take a database-marketing segmentation approach and apply it to identify the patients most likely to spend time in the hospital for various time periods. Magliozzi and Berger 7 define and explain various segmentation strategies and methodologies. That article, written with a focus on database consumer marketing and mailing lists, offers techniques that can rank healthcare members most likely to contribute to healthcare costs. We follow their formulation of evaluating the goodness of performance of a segmentation analysis, by determining rank-ordered lists (ranked by decreasing predicted probability of spending time in the hospital) whose performance is assessed by using Pareto curves that capture the lift, or improvement in targeting, of a given segmentation.
In essence, a Pareto curve is constructed with a horizontal axis of how deeply one dips into the rank-ordered list, and a vertical axis of what percent of all occurrences (here, patients who are hospitalized during the subsequent year). A 45 degree line would indicate that the model has no segmentation ability (i.e. the top 10% predicted to be most likely to be hospitalized contains [only] about 10% of all those hospitalized, the top 20% contains about 20% of those hospitalized, etc.) In general, in database marketing, obtaining, for example, a ‘10/25’ (the 10% of the entire list of people predicted to be the most likely to make a purchase contains not 10%, but, indeed, 25% of the buyers on the entire list) would be considered an excellent segmentation performance. 8 The optimal decision ‘how deep[ly] to dip’ in the rank-ordered list would be dictated by relative values of the ‘in-the-mail-cost’ (essentially, printing and postage costs) and the average profit received per purchase/response. There is a clear analogy to the member-intervention decision, albeit, with more difficulty in determining the exact values of the costs and benefits.
Results and discussion
Model 1
As mentioned earlier, the first model applied a natural log transformation as the dependent variable. The natural log transformation greatly reduced the heteroscedasticity and improved the model. The coefficient of determination, R2, is 0.10, indicating that the model (i.e. the X’s collectively) explains 10% of the variability in the dependent variable. Figure 2 presents the ANOVA table for the regression analysis:
Model 1 multiple regression ANOVA table.
The results confirmed several anticipated results. The model predicts that members in the age cohorts of 70–79 and 80–89 (both dummy-variables significant at p < 0.01) are likely to spend about a half a day more in the hospital, on average than the base-age cohort of members 0–9 years old (all of the results are on average; however, for brevity purposes, we will discontinue including that expression for each and every stated result). In fact, every age group starting with 20–29 spent significantly more time in the hospital than those in the 0–9 age group (p < 0.01 for each age group); and, we cannot reject at 5% significance that the time spent in the hospital is monotonic non-decreasing as age category increases. Other findings included:
Members with between 1 and 36 Rx’s spend less time in the hospital than the members in highest, ≥37 Rx’s, category (p <0.01 for each other category). Members with the fewest (i.e. 1–10) claims and those with the next fewest amount (i.e. 11–20) of claims spend significantly fewer days in the hospital (each with p < 0.01) than those with the highest number of claims (i.e. ≥31). The results were not significant for the 21–30 claim category, but were directionally correct (p ≈ 0.30). The members with a Charlson index of 1–2 (p < 0.05) and 3–4 (p < 0.01) spend more time in the hospital than those members with a Charlson index of 0. Surprisingly, while those (relatively few) with a Charlson index ≥ 5 did spend more time in the hospital than those with 0 index, the result was not significant at α = 0.05 (p ≈ 0.15). Pregnant women and members with blood, pancreas, liver or renal diseases all spent more days in the hospital than their non-diagnosed counterparts (p < 0.01 for pregnancy, p < 0.05 for the blood diagnosis and for pancreas/liver/renal diagnosis). The results for Labs Cohort were directionally correct, with all cohorts spending fewer days in the hospital than the highest, ≥19, cohort. However, of the other four cohorts, the next to highest cohort (13–18) was only marginally below that of the highest cohort, while the 7–12 cohort had p < 0.05, but the 1–6 and 0 cohorts had p-values ≈ 0.17 and 0.07, respectively.
Model 2
The remaining models have categorical dependent variables, and thus, the focus is on a segmentation approach to identify members who are most at risk for requiring costly hospitalization (i.e. spending more time in the hospital). As Magliozzi and Berger 7 explain, ‘Virtually all direct marketers evaluate a list segmentation outcome by examining the Pareto Curve, or some other equivalent measure of the ‘lift’ provided by the segmentation; none measure the outcome by examining the R2 value.’ Even though the R2 of the regressions are relatively weak, there is a strong directional relationship between the claims data and members at risk for hospitalization, which creates a valuable basis for segmentation. Additionally, all of the models had a very high overall significance level (p < 0.001). Berger and Magliozzi 9 provide real-world list-segmentation multiple regression results in which the multiple R2 is only about 0.01, (highly significant due, in part, to the large sample size), but the lift/segmentation ability of the ‘scoring equation’ (i.e. regression equation) leads to a more than 100% increase in profit (by mailing to only the ‘profitable’ portion of the list) compared to mailing to the entire mailing list.
Lift for members with ≥ 1 day (vs. 0 days) in hospital

Pareto curve for members with ≥1 day (vs. 0 days) in hospital.
In other words, those 4734 (10% of 47,344) members predicted to be the 10% most likely members to spend at least one day in the hospital actually had a 32.9% chance of spending at least one day in the hospital, as opposed to a ‘random’ 10% selection of the membership, which would have about a 14.8% (percent for the entire list) chance of spending time in the hospital. This represents a lift of (32.9 − 14.8)/14.8 = 122%. If we consider the predicted top 20%, we have an 85% lift, and the top 30% provides a 64% lift.
In other words, Figure 3 indicates that the predicted top 10% of those most likely to spend at least one day in the hospital included 22.3% of the entire list of members (as noted earlier, about 7000) who, indeed, did spend time in the hospital. The top 20% contained 37.1% of the entire list, etc. So, if the top 10% were allocated risk-prevention resources, 22.3% of those needing the resources (not just 10%) would be among the selected 10%.
Model 3
Lift for members with ≥ 15 days (vs. 0 days) in hospital

Pareto curve for members with 15 + days (vs. 0 days) in hospital.
The Pareto curve results are excellent, if not amazing. The top 10% of the rank-ordered list contains a remarkable 53.2% of the members who spent more than 15 days in the hospital, and the top half of the list contains 93.6% of these members. Still, the true value of this segmentation can be determined only after a resolution of the tradeoff between the relatively small number of members in this high-hospitalization group and the relatively high cost that can potentially be saved per member with an intervention.
Summary and implications
The various segmentation analyses above significantly improve the ability to identify and target patients who are most at risk for hospitalization. While there are not any major surprises in the predictions of the impact of the independent variables (results of model 1), these impacts confirm the viability of the further analyses, which provide strong lifts in identifying members who are at risk for spending time in the hospital. Depending on the available resources, it is strongly implied that a healthcare network can potentially benefit greatly from investigating and evaluating intervention programs for their members. Suppose that it could assign a case worker to intervene with the top 10% of patients most likely to spend selected times in the hospital and achieved results that correspond to those found in the various models analyzed above. That would contribute significantly to, presumably, increasing patient welfare and cutting healthcare-provider costs – i.e. profitable for the healthcare provider, good for society.
If the intervention targeted, for example, the segment of the 10% of members most likely to be hospitalized ≥ 15 days, then a lift of 475% would be achieved and the intervened members (10% of the membership) would, notably, comprise over half (53.2%) of the targeted segment. If the intervention targeted the segment of members most likely to be hospitalized even one day (i.e. ≥1 day), then a lift of 122% would be achieved and the intervened members (10% of the membership) would comprise 22.3% of the targeted segment. These results certainly appear promising, and at minimum, suggest the worthiness of a cost/benefit analysis.
Healthcare providers, at least at this time, likely need to outsource the analyses to a company which specializes in this type of modeling. Eventually, it may pay for the provider to form its own ‘in-house analytics program.’ This process would mirror what has happened in the marketing arena during the last 20 years when the key marketing metric of CLV (customer lifetime value) was introduced. At first, specialized companies were hired to aid with determining CLV and associated optimal marketing strategies. Today, a large number of companies have formed their own in-house analytics capability. Studies have pointed to issues such as poor medication adherence for contributing to hospitalization healthcare costs. In September 2010, the New England Health Institute published a study estimating that ‘poor medication adherence in all its manifestations costs the United States upwards of $290 billion per year in unnecessary health care spending, not to mention illnesses and deaths that could be otherwise prevented.’ 10 Identifying, segmenting and intervening in the healthcare of such patients could offer significant returns.
Some physicians and health networks are advocating for an even more targeted approach to healthcare. As noted in The New Yorker, 11 Dr Jeffrey Brenner believes that targeting the very sickest patients can drastically reduce healthcare costs, and that this requires a coordinated effort between both healthcare professionals and social workers. It seems obvious, but, in healthcare costs, as with crime and taxpayers, the top 10% contributes disproportionately to the total sum. Indeed, as we noted earlier, the top 1% of the patients utilizes over 20% of the healthcare dollars spent. 1 Insurers may simply need (and from a profit and societal perspective, want) to dedicate more resources – even more disproportionately – to the top ‘X%’ of healthcare users to keep them out of the hospital, ‘X’ to be determined by cost/benefit considerations. Predicting these patients ahead of time remains a significant opportunity and major challenge. We believe that the segmentation approach put forth in this article is a solid beginning to this opportunity and challenge.
Limitations and future research
The underlying data set presented a number of analysis issues, many due to privacy concerns that prevented a complete view of patient data. The lab and prescription values posted were especially weak variables. Lab results are perhaps the best leading indicator of hospitalization and healthcare costs. For example, BMI and hemoglobin A1C levels are leading indicators of diabetes. Blood pressure is a leading indicator of hypertension. HDL and LDL levels are leading indicators of coronary disease. All of these diseases are generally chronic and costly. Similarly, prescription data offers a critical insight into patient care. If a patient adheres to a particular medication regimen, he/she is far more likely to control his/her disease and prevent more costly hospitalization down the road.
Yet, the data used in this study provided only counts of lab and prescription values and not the detailed data mentioned above. This is likely an issue of privacy concerns; nevertheless, this may limit the ability to create a stronger predictive algorithm. Perhaps some accommodation of privacy concerns and data availability can be reconciled to aid the analysis process for these worthy causes – reduction of healthcare costs and saving lives. Additionally, the top-coding of certain variables may be making it more difficult to identify the most costly healthcare utilizers. More granular information (e.g. without the top-coding) might help to create a more complete picture of costs.
The possible strategies may be less straightforward; for example, the optimal decision may not be as simple as ‘how deep(ly) to dip’ into the rank-ordered list for intervention strategies – not that that is such a simple determination! One might consider a strategy, for example, ‘Do not intervene at all with the top 5%, but implement a certain intervention protocol with the 5%–15% segment.’ This might be optimal if it is determined that the top 5% segment cannot be profitably helped even by the most intrusive/expensive intervention.
A final observation: many healthcare network members have 100+ claims (physician visits, outpatient surgeries, labs, Rx’s, etc.), but have never spent a single day in the hospital. In this regard, days-in-the-hospital may be a limited outcome variable that fails to represent the true impact of healthcare costs; perhaps, other dependent variables should be considered.
Footnotes
Funding
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
Conflict of interest
The authors declare that they do not have any conflicts of interest.
