Abstract
Access to copious amounts of information has reached unprecedented levels, and can generate very large data sources. These big data sources often contain a plethora of useful information but, in some cases, finding what is actually useful can be quite problematic. For binary classification problems, such as fraud detection, a major concern therein is one of class imbalance. This is when a dataset has more of one label versus another, such as a large number of non-fraud observations with comparatively few observations of fraud (which we consider the class of interest). Class rarity further delineates class imbalance with significantly smaller numbers in the class of interest. In this study, we assess the impacts of class rarity in big data, and apply data sampling to mitigate some of the performance degradation caused by rarity. Real-world Medicare claims datasets with known excluded providers are used as fraud labels for a fraud detection scenario, incorporating three machine learning models. We discuss the necessary data processing and engineering steps in order to understand, integrate, and use the Medicare data. From these already imbalanced datasets, we generate three additional datasets representing varying levels of class rarity. We show that, as expected, rarity significantly decreases model performance, but data sampling, specifically random undersampling, can help significantly with rare class detection in identifying Medicare claims fraud cases.
Introduction
Today’s society has access to a multitude of differing information sources providing access to unique observations or cases in order to make informed assessments or decisions. In many real-world cases, these unique and interesting observations do not constitute the majority in the provided information. An example that most people are familiar with is using a search engine to cull through billions of internet sources and return the few observations of interest. Most of this information could be considered irrelevant (i.e. noise) and, as such, can be excluded. Some less mundane examples where the interesting observations are significantly less frequent than the normative observations include the detection of fraud within credit card transactions [35] and credit scoring [25], healthcare billing fraud [76], or detecting faults in software development [50]. Another factor contributing to the relative infrequency of these observations of interest, making finding them potentially more difficult to find, is the explosion of data in recent years [70]. This so-called big data dramatically increases the number of normative cases relative to the cases of interest [49]. This discrepancy between interesting and normative cases is known as class imbalance, where each (interesting or normative) case is considered a class. The concern with class imbalance is akin to looking for the proverbial ‘needle in a haystack’, where what is of interest is difficult to find because of the large amounts of irrelevant or unimportant information.
In binary classification problems, e.g. interesting versus uninteresting, observations (also known as cases or instances) of the positive class are usually considered the class of interest with the negative class making up the remaining instances. The problem of class imbalance impacts many real-world problems and any possible solutions. Many of these problems have manual solutions, but the application of machine learning methods can help derive meaningful information, as well as reduce manual efforts. However, machine learning methods are susceptible to issues related to class imbalance [78]. This is further exacerbated when the positive class instances become increasingly rare. In our case, rarity, in relation to machine learning methods, includes any extremely small number of positive class instances, regardless of the proportion to the negative class. For instance, take a dataset where the positive class makes up 1% of that dataset (with 99% being the negative class), 1% of 1,000 instances is only 10 positive class instances, whereas 1% out of 10 million instances is 100,000 positive class instances. A machine learning model built with 100,000 positive class instances provides adequate information for a model to discern class patterns, whereas only 10 positive class instances may not provide enough discriminatory information to effectively predict the correct class. From this example, we note that using percentages to express the severity of class imbalance, while useful, can be misleading, especially with regards to rare observations.
There are a number of studies on class imbalance, primarily using smaller datasets [43], with few studies addressing class imbalance and big data [44, 40]. This lack of research could, in part, be due to the difficulties in applying traditional machine learning techniques to big data to assess the impact of class imbalance. These difficulties tend to stem from typical attributes associated with big data [36, 49], making it difficult to model and analyze using traditional methods. As there is no exact measure for or meaning of small versus big data, for simplicity, we consider datasets with under 100,000 instances as small. An example of smaller (non-big) data could be a dataset of 5,000 instances collected over a period of one month for a small company, where each instance represents the front door access record of an employee of the company. In contrast, an example of big data could be a dataset of several million or more weather forecast reference points for collecting real-time weather data, or insurance claims records collected over several years. With the advent of big data frameworks, such as Apache Spark [82] and H2O [56], machine learning models can be efficiently built using large datasets with millions of instances to assess the effects of severe class imbalance and rarity [48]. Class imbalance with rare cases, however, is a far less studied area only employing smaller datasets [37, 72]. As of this writing, there are no known studies with regards to big data and class rarity.
As mentioned, fraud detection is one area with extreme sparsity in known fraud cases. Given this disparity in fraud and normal cases, fraud detection is a good real-world scenario with which to evaluate the impact of having rarity in the positive class. Therefore, in our study, we assess the impact of rarity in fraud detection using Medicare provider claims data, focusing on the effects of class rarity in Medicare fraud detection. Medicare is one of the largest government healthcare programs in the United States (U.S.), with over 54.3 million beneficiaries. This program reimburses hospitals and physicians for medical care provided to people over the age of 65 and to younger individuals with specific medical conditions and disabilities [5]. The use of Medicare claims data allows us to gauge the impact of rarity on big data with real world implications. The healthcare field produces a large amount of information, from patient records to medical claims, and continues to utilize big data sources in order to become more efficient and productive [61, 68]. With this increased use of information, medical fraud continues to be attractive to would-be perpetrators, adversely affecting healthcare costs and quality of service. Understanding the impacts of having such low numbers of known positive instances is critical for the successful detection of actual fraudulent cases. With successful fraud detection methods, the FBI estimates that up to 10% of healthcare costs could be recovered [64]. In particular, the increase in elderly population [2, 6] poses unique challenges, such as increased medical care due to chronic illness, necessitating appropriate healthcare coverage for various medical drugs and services. This increased longevity in the elderly population in the U.S. can be attributed to overall improvements in healthcare, particularly with treatments of acute illnesses and diseases [9]. Medicare alone accounts for around 15% ($495 billion) of total U.S. federal spending [3, 34]. Thus, fraud detection could garner substantial savings that could assist patients and beneficiaries by decreasing healthcare costs.
The detection of Medicare-related fraud and abuse1
In this study, we focus on positive class rarity in big data, for fraud detection, by leveraging three publicly available Medicare datasets using LEIE excluded providers as fraud labels (i.e. positive class), as well as creating a combined dataset. Furthermore, we present our unique methods for data processing and engineering, to fully leverage the Medicare datasets for machine learning applications. The processing of each dataset is critical, not only for understanding the data, but for the successful application of machine learning. With these datasets, three models (learners) are used to assess fraud detection performance. In order to illustrate the effects of rarity, we map known excluded providers to the Medicare data and then artificially generate additional datasets (from the original Medicare data) by decreasing the number of positive class instances while maintaining the same number of negative class instances. We discuss the results of our experiment and use statistical significance tests to help assess meaningful differences in model performance due to the varying levels of positive class instances, and clearly demonstrate degraded performance as positive classes become rarer. Since this pattern of decreased performance is expected, the question is what can be done to lessen this loss in performance? We perform data sampling in order to increase model performance for rare class classification and show that significant improvement can be achieved. To the best of our knowledge, this is the only study to investigate the effects of rarity in big data within a current, real-world healthcare application domain, and employing data sampling techniques as a means to mitigate the detrimental effects on model performance due to rarity.
The rest of the paper is organized as follows. Section 2 discusses works related to the current research, focusing on class rarity. We discuss Medicare fraud and abuse in Section 3 and provide an overview of the datasets used (processing and data engineering) and the mapping of fraud labels in Section 4. In Section 5, we discuss class imbalance and rarity. In Section 6, we present our experimental design. The results of our research are discussed in Section 7. Finally, Section 8 summarizes our research limitations, conclusions, and future work.
The problems associated with class imbalance (for which class rarity is a subset) have been researched in a number of studies [8, 37, 43, 80]. Even so, the focus on truly rare class data is very limited. There have been several studies on overcoming the issues stemming from the detection of imbalanced or severely imbalanced classes, focusing on specific techniques for increasing machine learning performance, but not on how rarity impacts a current, real-world application domain (e.g. Medicare fraud detection). Additionally, these related studies primarily use smaller datasets from public repositories.2
UCI Machine Learning Repository –
There are several papers employing different methods to detect positive classes using relatively small datasets. In a study by Lin et al. [59], the authors use the Multivariate Imputation by Chained Equations (MICE) algorithm with optimal threshold based on posterior probabilities. Several datasets are used, with the largest having 20,000 instances and the lowest percentage of positive class instances being 9.3%. An approach using emerging patterns and decision trees to classify rare cases is presented in [7]. The authors improve on the decision tree model by biasing the decisions toward the rare class. They validate their results using five different datasets which are not specifically described in the paper, thus no information is available on what constituted rarity. Zhang et al. [84] propose a k Rare-class Nearest Neighbor (KRNN) method, which directly adjusts the induction bias in traditional KNN. They leverage 14 small datasets, with the largest having 13,302 instances and 3.51% being the smallest percentage of positive class instances. An artificial dataset is also used with varying positive class percentages, relative to the full dataset, from 50% to 5.88%, maintaining 50 positive class instances. Li et al. [58] use adaptive swarm-balancing algorithms to predict events using clinical healthcare data and claim their study incorporates big data. However, the largest dataset only has 47,781 instances with 50 (0.1%) positive class instances. The data used in these studies exhibit severe class imbalance but with datasets that we consider relatively small and not big data. Furthermore, as well as with smaller datasets, only [84] generates artificial cases to demonstrate the impacts of severe class imbalance on classification performance. Another study by Seiffert et al. [72] specifically assesses the effect of data size and class distribution on classification. The data samples go down to 0.1% positive class percentage, relative to the full data size. Seiffert et al. and the research in [74] are the only two studies that generate artificial cases to demonstrate the impacts of severe class imbalance, albeit with smaller datasets.
Big data and class imbalance is the subject of a paper by Fernández et al. [40]. The authors present an overview and discussion of related works on class imbalance and big data, providing a preliminary experiment comparing three traditional approaches to class imbalance applied to big data using the MapReduce framework with two datasets. Two imbalanced sources, derived from the ECBDL14 dataset, are used in their experiments. These two datasets consist of 12 million instances and 600,000 instances, respectively, both with 98:2 (majority:minority) class ratios. In a study by Rastogi et al. [69], the authors also address the issue of class imbalance using the ECBDL14 dataset. They use Locality Sensitive Hashing to identify nearest neighbors of the samples and then use SMOTE to generate random samples for the minority class employing the Apache Spark framework. The authors use an ECBDL14 data subset which has 2.89 million instances and a 98.3:1.7 class ratio. Zhai et al. [83] use an ensemble of extreme learning machines, compared with three SMOTE variant, to address class imbalance. They also employ a MapReduce framework with several standard datasets (up to 335,910 instances) and an artificially generated dataset with 150 (0.05%) of 312,191 instances. The studies in [40, 69] incorporate big data but do not investigate or specifically address rarity, nor do they provide varying class ratios to assess the impacts of a reduced positive class.
Three studies that focus on infrequent events using big data sources are Tayal et al. [74], Maalouf et al. [60], and Chai et al. [23]. In one study, Tayal et al. employ a RankRC algorithm which is a nonlinear kernel-based classification method. They compare their method to several SVM-based approaches using both artificially generated datasets, with 12,000 data points from 10% to 40% class ratios, and the real-world datasets. The largest real-world dataset used is from KDD Cup 1999, from which the authors used a subset of 812,808 instances with 0.098% positive class instances. The RankRC method performed better than the SVM methods and was more efficient in time and space requirements. Another study, by Maalouf et al., leverages the KDD Cup 1999 dataset, along with six others, to study the issue of severe class imbalance. Two of the datasets have a 0.34% positive class percentage (which is the lowest in the study), with one of the datasets having 304,814 instances. The authors present the truncated Newton method in prior correction logistic regression with a regularization term added to improve performance. Chai et al. examine the feasibility of using statistical text classification, with Logistic Regression, to automatically identify health information technology incidents using a manufacturer and user facility device experience database. They use a subset of the incident data with 570,272 instances and 1,534 positive class instances. Both a balanced (50% positive class) dataset and a so-called ‘stratified’ dataset (0.297% positive class) are generated from the subset of incident reports. Each of these studies incorporates some degree of, what could be considered, big data, but do not really focus on or assess the impacts of rarity using variations of the positive class, to include the original datasets, showing trends associated with increasing levels of rarity. These studies do illustrate some of the impacts of severe class imbalance with larger datasets, but do not focus on rarity or use particularly relevant big datasets. In contrast to our study, there are no papers that consider big Medicare fraud data and class rarity.
Medicare fraud, waste, and abuse3
One source of real-world healthcare provider fraud is the LEIE. This dataset was established and is maintained monthly by the Office of Inspector General (OIG) [66] in accordance with Sections 1128 and 1156 of the Social Security Act [65]. The OIG has authority to exclude individuals and entities from federally funded healthcare programs, such as Medicare, for a given period of time. The LEIE contains the reason for exclusion, date of exclusion, and reinstate/waiver date for all current physicians found unsuited to practice medicine. It is important to point out that the LEIE is aggregated at the provider-level (i.e. a single recorded exclusion per provider by NPI) and does not have specific information regarding procedures, drugs, or equipment related to fraudulent activities. As seen in Table 1, for our study, we selected the providers excluded for mandatory, not permissive, exclusions.5
LEIE mandatory exclusion rules
High-level view of fraud, waste, and abuse.
In this section, we summarize the Medicare datasets and data processing and engineering steps. The Medicare datasets are made publicly available from the Centers for Medicare and Medicaid Services (CMS) [4, 26]. Note that CMS is the Federal agency within the U.S. Department of Health and Human Services that administers Medicare, Medicaid, and several other health-related programs. Each dataset is derived from administrative claims data for Medicare beneficiaries enrolled in the Fee-For-Service program, with all claims records being recorded after payments are made [30, 31, 32]. Because these are final payments, we consider the Part B, Part D, and DMEPOS datasets to be appropriately cleansed and correct. The Part B dataset includes claims information for each procedure a physician/provider performs within a given year. The Part D dataset provides claims information pertaining to the prescription drugs administered under the Medicare Part D Prescription Drug Program within a given year. Finally, the DMEPOS dataset includes claims for medical equipment, prosthetics, orthotics, and supplies that physicians/providers referred patients to for purchase or rent from a supplier within a given year. In addition to the three aforementioned Medicare datasets, we create a combined dataset incorporating information from all three Medicare datasets. Our assumption is that there is no reliable way to know within which part of Medicare a physician/provider has or will commit fraud. Therefore, joining the Part B, Part D, and DMEPOS datasets can potentially better represent a provider’s claims, from procedures and drugs to equipment. This is because the combined dataset has a larger number of features from which machine learning algorithms can detect fraud.
Selected Medicare dataset features
Selected Medicare dataset features
We combined each of the individual years (2012 to 2015) of Medicare data, for each dataset, by appending each annual dataset to each other, across matching features. Some features, such as the standardized payments and standard deviation values, are not included since they do not appear in all of the Medicare years. We select features, listed in Table 2, that are readily usable by most machine learning models. However, we exclude features used for identification purposes (such as NPI [27]) or for filtering the data, e.g. for Medicare participation or equipment rentals. Redundant features, like HCPCS [29] descriptions (not the actual procedure codes) and others indicating demographic information, such provider names, credentials, and residence, are also not used. Additionally, we exclude features that contained mostly missing or constant values, as these can reduce classification performance. Details on all of the available Medicare features can be found in the “Public Use File: A Methodological Overview” documents, for each respective dataset, available at CMS.6
Because the LEIE dataset only provides information on excluded providers and not on particular procedures performed or medical specialties, we aggregate the Medicare data at the provider- or NPI-level to account for this discrepancy and then map LEIE exclusion labels to each Medicare dataset. Note that currently there is no known publicly available data source with fraud labels by provider for each procedure performed. We merge the Medicare datasets and the LEIE providers by NPI and year, generating fraud or non-fraud labels. Our labeling process includes both the exclusion period and the period prior to the start of the recorded exclusion. The rationale for keeping the former is that claims made during the exclusion period are improper payments and could be considered fraudulent per the federal False Claims Act (FCA) [33]. The latter could indicate fraudulent activities leading up to that provider being put on the LEIE, for which most stem from criminal convictions, patient abuse or neglect, or revoked licenses. Additional information on our Medicare data processing can be found in [15, 46].
Our study uses Medicare data and fraud detection to demonstrate the impact of class imbalance, specifically rare classes, on machine learning classification performance. As rarity is a severe form of class imbalance, we discuss both in this section. Class imbalanced data refers to the condition where the classes are not represented equally [77, 78]. The concerns associated with class imbalance are caused by these differences in the majority (negative) and minority (positive) classes. This imbalance creates a bias towards the majority class during the model training, which seeks both to minimize error (maximize accuracy) and provide predictive generalization, i.e. provide good predictions on new, unknown data. In more severe cases of class imbalance (e.g.
The original Medicare datasets are already severely imbalanced with the majority of provider claims being non-fraudulent. In order to assess the impact of increased rarity, we artificially generate three subsets, from the original Medicare data, with decreasing numbers of positive class instances (200, 100, and 50). These were selected based on preliminary results which indicated that this was a good representation of rare class subsets. For instances, using a subset of 1,000 positive class instances showed very similar results to the original dataset, thus does not meaningfully contribute to the discussion on the impact of rarity. To get a good representation of samples from the positive class, we repeat the sampling process ten times per subset. For example using the Part B dataset, given the subset with 100 positive class instances, we randomly select 100 instances out of the original 1,409, repeating this process 10 times, and average across the scores of those 10 to produce the final result. Table 3 lists each of the original datasets and subsets used in our experiment with counts of positive (fraud) and negative (non-fraud) classes.
Description of experiment datasets
Description of experiment datasets
In addition to using the original datasets and the rarity subsets, we employ RUS and ROS with the following class distributions (majority:minority): 99:1, 90:10, 75:25, 65:35, and 50:50. These class distributions are applied to the original dataset, as well as the rare class subsets. Note that the original, non-sampled, datasets are denoted by ‘None’ or ‘Full’. The selected ratios were chosen because they provide a good range of class distributions from balanced (50:50) to highly imbalanced (99:1) compared to the original (non-sampled) datasets. We repeat the RUS and ROS processes 10 times per dataset, for each class distribution, to reduce bias due to poor random draws.
In this section, we present the supervised machine learning models (also known as learners) and evaluation methods used to assess the impacts of rare class classification. Previous research indicates that supervised learning significantly outperforms unsupervised methods [10, 12]. Because our study includes big data, we use Apache Spark [62, 81], with the Machine Learning Library (MLlib) [63], to effectively handle these large dataset sizes to detect Medicare fraud. Apache Spark is a unified analytics engine for big data and machine learning that provides dramatically increased data processing speed compared to traditional methods or implementations using MapReduce approaches. MLlib is a scalable machine learning library built on top of Apache Spark.
Models
We use Logistic Regression (LR) [55], Random Forest (RF) [22], and Gradient Boosted Trees (GBT) [63]. Currently, Spark is limited to eight learners that could be used to classify fraud and non-fraud cases.7
LR uses a sigmoidal, or logistic, function to generate values from [0, 1] that can be interpreted as class probabilities. LR is similar to linear regression but uses a different hypothesis class to predict class membership. The bound matrix parameter was set to match the shape of the data so the algorithm knows the number of classes and features the dataset contains. The bound vector size is equal to 1 for binomial regression, with no thresholds set for binary classification.
RF is an ensemble approach building multiple decision trees. The classification results are calculated by combining the results of the individual trees, typically using majority voting. RF generates random datasets via sampling with replacement to build each tree and selects features at each node automatically based on entropy and information gain. In this study, we set the number of trees to 100 and the max depth to 16. Additionally, the parameter that caches node IDs for each instance, was set to true and the maximum memory parameter was set to 1024 MB in order to minimize training time. The setting that manipulates the number of features to consider for splits at each tree node was set to one-third, since this setting provided better results upon initial investigation. The maximum bins parameter, which is for discretizing continuous features, is set to 2 since we use one-hot encoding on categorical variables.
GBT is an ensemble approach that trains each decision tree iteratively in order to minimize loss determined by the algorithm’s loss function. During each iteration, the ensemble is used to predict the class for each training instance. The predicted values are evaluated with the actual values allowing for the identification and correction of previously mislabeled instances. The parameter that caches node IDs for each instance was set to TRUE, and the maximum memory parameter was set to 1024 MB to minimize training time.
Machine learning model performance is scored using the Area Under the Receiver Operating Characteristics (ROC) Curve (AUC) [20]. A receiver operating characteristic curve (ROC curve), is commonly used to visualize the performance of binary classification. The ROC curve is generated by plotting the true positive rate (TPR), also called sensitivity (
Model evaluation
Model evaluation is done using
Significance testing
In order to provide additional rigor around our AUC performance results, we use hypothesis testing to show the statistical significance of the model performance results. Both ANalysis Of VAriance (ANOVA) [42] and post hoc analysis via Tukey’s Honestly Significant Different (HSD) [75] tests are used in our study. ANOVA is a statistical test determining whether the means of several groups (or factors) are equal. Tukey’s HSD test determines factor means that are significantly different from each other. This test compares all possible pairs of means using a method similar to a
Results and discussion
To reiterate, our goal is to assess the impact of rarity in big data, leveraging a real-world use case in Medicare provider claims fraud detection. This Medicare fraud scenario provides a realistic picture as to the effects of rare classes (e.g. fraud cases) on the detection performance of machine learning models. In Fig. 2, we depict the average AUC results for each dataset, model, positive class count, and sampling method. The table associated with this figure, providing the average AUC scores, can be found on Table 5 in the Appendix. In this figure, we employ notched box plots to visually compare groups and assess any significant differences therein. For example, if the notches of two boxes do not overlap, this suggests that there are significant differences between these results.
Summary of Medicare fraud detection model performance.
Overall, notice that the trend shows decreasing model performance, across all models and datasets, as the positive class instances decrease, thus increasing AUC score variability. These results are intuitive as we would expect a machine learning model to struggle as the number of positive class instances decreased. In other words, the difficulty in discriminating fraud cases leads to lower, wide-ranging AUC scores. For example, compare the results for Part B at 1,409 versus 50 positive class instances. This is due to the strong influence of the non-fraud instances that make up the preponderance of each dataset. Moreover, model performance indicates a failure to correctly discern small disjuncts, compounded by the complication of noise inherent in big data. In examining the results further, because there are so few positive cases to begin with, the models tend to have a very low FPR (indicating a non-fraud observation being classified as fraud) with a very high false negative rate (FNR) implying most of the fraud observations are classified as non-fraud. A good machine learning model clearly has low FPR and FNR. Some of the variations in AUC scores, especially for the smaller positive class counts, stem from the TPR with the number of true positives (actual fraud case) and false negatives (fraud classified as non-fraud) being highly variable since the total number of positive classes is so small. In general, the poor model performance due to rarity indicates the difficulty, and inconsistency, in classifying rare positive class cases. There are simply not enough data points to discriminate between classes as the positive class instances decrease. Yet, the results using all known fraud cases shows promise, indicating good overall fraud detection performance.
In an attempt to improve fraud detection performance, we performed both RUS and ROS data sampling. Figure 2 also shows these results and upon inspection data sampling appears to provide some possible benefit. While patterns and trends can be derived from this figure, we employ hypothesis testing to determine the significance of these results. We performed a one-factor ANOVA test on Sampling Method (across all models, datasets, and positive class counts), with this factor being significant at a 95% confidence interval. From this, a Tukey’s HSD test was performed to obtain the significance between no sampling, RUS, and ROS performance results. Table 4 shows that RUS, which is in group ‘a’, significantly increases overall fraud detection performance versus no sampling, whereas ROS decreases performance. It is important to note that these changes in performance can vary based on the model used. In our case, RUS performs better using the tree-based learners (RF and GBT), which could be due to RUS being better able to represent misclassification costs and class distribution [38, 52]. However, LR shows more favorable results with ROS. This may be due to the squared-error loss function with the application of L2 regularization, also known as Ridge Regression, penalizing large coefficients and improving the generalization performance, making LR fairly robust to noise and overfitting (both possibilities due to the duplicated observations in ROS).
Tukey’s HSD test results for sampling methods
Tukey’s HSD test results for group ‘a’.
Given that models built with RUS-generated datasets outperform both the ROS and non-sampled datasets, we perform Tukey’s HSD tests for each positive class count value to assess any significant differences across class ratios and datasets. We focus on examining the group ‘a’ results over all models. Investigating the results for only this group provides some indication as to the best performing sampling methods for the majority of rare positive class cases. Note that there can be multiple combinations within a group designation. For example, for the combined dataset with 100 and 200 positive class counts, RUS and ROS with a class ratio of 90:10 are each in group ‘a’, respectively. Figure 3 depicts the average AUC scores, per dataset, for sampling methods and associated class ratios. This shows that RUS with a 90:10 class ratio consistently has the majority of group ‘a’ results with the original and rare subsets across Medicare datasets, irrespective of the model used. ROS exhibits better performance with more balanced datasets, particularly the 50:50 class ratio. From these results, RUS is effective in increasing model performance, and we show that retaining a reasonable percentage of the majority class and all of the minority class provides good results without losing too much information from the original dataset, even though the undersampled dataset remains imbalanced. Additionally, due to the high level of imbalance present in these datasets, there is a dramatic decrease in the number of non-fraudulent cases with the 90:10 class ratio, allowing for significantly less computing resources and time in building these models. The complete Tukey’s HSD results over the positive classes are listed in Tables 6 and 7 in the Appendix.
The fraud detection results using real-world Medicare big datasets conveys the difficulties in implementing machine learning solutions. The issues around rarity are clearly demonstrated in relation to a set of known fraud labels. We demonstrated that data sampling, in particular RUS, can be used to increase rare class classification performance. Moreover, RUS has been shown to perform well even with class noise [77]. Yet, there is a point at which there are simply not enough positive class instances to detect any discernible patterns to partition positive and negative class instances. At this point, the only reasonable course of action would be to obtain more quality data. Even though we were able to show the effects of rarity in big data on machine learning and the advantageous usage of data sampling, as with any experimental study, we would be remiss to not mention possible research limitations. The main limitation revolves around the data employed for the use case in our study. More specifically, the fraud label information in the LEIE is not all-inclusive where 38% of providers with fraud convictions continue to practice medicine and 21% were not suspended from medical practice despite their convictions [67]. This implies that there are potential unlabeled or mislabeled providers, which could affect the ground truth and model performance evaluations with rare positive class labels.
The problems with severe class imbalance can be detrimental to predictions from machine learning models. In particular, the effects of rare classes exacerbate this issue. With the ubiquitous nature of big data in today’s society, confronting the issues from rare classes becomes increasingly important in areas such as fraud detection for which there are limited known cases. Yet, there are few studies that investigate or attempt to mitigate these concerns regarding big data and class rarity. In our study, we leverage publicly available Medicare claims data (Part B, Part D, DMEPOS, and a combined dataset) with labels from the LEIE, to study the effects of rarity in big data focusing on Medicare fraud detection. We describe our unique methods for data processing and engineering, which are critical in creating usable Medicare datasets for the creation of successful fraud detection models. We use three different machine learning models to detect fraud in each Medicare dataset and assess performance using AUC scores. Three artificially generated rare data subsets are created from the original Medicare datasets. From our results, we show that performance using the original datasets, with all known fraud labels, is reasonable across learners and datasets. The application of rarity, however, significantly decreases model performance in nearly every scenario. This is an expected result, so in order to lessen some of this performance degradation, we apply two data sampling approaches. RUS is demonstrated to be significantly better than the original dataset and rarity subset results, with ROS exhibiting the worst overall fraud detection performance. Therefore, given this real-world scenario, rarity can be alleviated by employing RUS, in particular with a 90:10 class ratio. Lastly, retaining more of the negative class (non-fraud) instances, thus reducing information loss, appears to produce the best overall results when combating the effects of rarity, over most of the Medicare datasets. Future work will involve other viable sources of big data from healthcare insurance fraud. We also intend to assess the generalization of our results to other domains, with the use additional performance metrics.
Footnotes
Acknowledgments
The authors would like to thank the anonymous reviewers for the constructive evaluation of this paper and also the various members of the Data Mining and Machine Learning Laboratory at Florida Atlantic University. We also acknowledge partial support by the NSF (CNS-1427536). Opinions, findings, conclusions, or recommendations in this paper are the authors’ and do not reflect the views of the NSF.
Conflict of interest
The authors declare that they have no conflict of interest.
Appendix
Average AUC scores for all experiment combinations
Learner
Sampling
Class
PartB
Combined
PartD
DMEPOS
method
ratio
50
100
200
1409
50
100
200
473
50
100
200
1018
50
100
200
635
LR
None
[Full]
0.7368
0.7477
0.7737
0.8052
0.6682
0.7550
0.7911
0.8155
0.7000
0.7292
0.7439
0.7816
0.6404
0.6890
0.7068
0.7406
ROS
[50:50]
0.7137
0.7241
0.7653
0.8191
0.6676
0.7171
0.7824
0.8155
0.6756
0.7062
0.7330
0.7874
0.6198
0.6779
0.7060
0.7329
[65:35]
0.7305
0.7286
0.7698
0.8187
0.6745
0.7202
0.7882
0.7904
0.6731
0.7132
0.7320
0.7882
0.6102
0.6816
0.7098
0.7169
[75:25]
0.7273
0.7339
0.7664
0.8189
0.6834
0.7261
0.7889
0.7530
0.6752
0.7046
0.7319
0.7880
0.6212
0.6783
0.7060
0.7035
[90:10]
0.7168
0.7408
0.7660
0.8143
0.6969
0.7248
0.7872
0.6831
0.6873
0.7062
0.7403
0.7871
0.6274
0.6754
0.7119
0.6509
[99:1]
0.7165
0.7324
0.7591
0.8094
0.6779
0.7302
0.7872
0.6621
0.6874
0.7122
0.7364
0.7824
0.6290
0.6790
0.7063
0.5941
RUS
[50:50]
0.6642
0.6859
0.7469
0.8141
0.6766
0.6811
0.7331
0.7941
0.6005
0.6615
0.7190
0.7756
0.5890
0.6099
0.6836
0.7222
[65:35]
0.6723
0.7340
0.7659
0.8181
0.7044
0.6945
0.7552
0.8100
0.6509
0.6936
0.7470
0.7822
0.6169
0.6336
0.7011
0.7349
[75:25]
0.7032
0.7316
0.7731
0.8169
0.6969
0.7072
0.7673
0.8155
0.6614
0.7056
0.7517
0.7854
0.6396
0.6347
0.7025
0.7372
[90:10]
0.6840
0.7542
0.7813
0.8188
0.7041
0.7239
0.7784
0.8187
0.6852
0.7206
0.7685
0.7866
0.6306
0.6468
0.7129
0.7442
[99:1]
0.7068
0.7587
0.7799
0.8124
0.7242
0.7247
0.7844
0.8201
0.7016
0.7195
0.7770
0.7849
0.6381
0.6519
0.7181
0.7409
GBT
None
[Full]
0.6906
0.7344
0.7649
0.7957
0.6833
0.7321
0.7674
0.7905
0.6398
0.6948
0.7178
0.7485
0.6413
0.6711
0.6954
0.7313
ROS
[50:50]
0.6507
0.6757
0.7342
0.8162
0.6530
0.7079
0.7620
0.8070
0.6433
0.6446
0.6639
0.7669
0.6040
0.6285
0.6834
0.7163
[65:35]
0.6344
0.6832
0.7343
0.8170
0.6213
0.7117
0.7551
0.7642
0.6120
0.6408
0.6608
0.7651
0.5904
0.6198
0.6900
0.6834
[75:25]
0.6403
0.6642
0.7309
0.8204
0.5952
0.6886
0.7665
0.7256
0.6055
0.6268
0.6684
0.7662
0.6110
0.6357
0.6885
0.6639
[90:10]
0.6728
0.7017
0.7561
0.8204
0.6362
0.7223
0.7746
0.6280
0.6139
0.6437
0.6853
0.7704
0.6150
0.6585
0.6947
0.6138
[99:1]
0.6941
0.7316
0.7777
0.8033
0.6811
0.7313
0.7739
0.7433
0.6379
0.6731
0.7014
0.7557
0.6102
0.6703
0.6990
0.6369
RUS
[50:50]
0.6284
0.6823
0.7350
0.8050
0.6119
0.6603
0.7150
0.7759
0.6252
0.6340
0.6785
0.7451
0.5751
0.5895
0.6482
0.7060
[65:35]
0.6900
0.7141
0.7440
0.8143
0.6320
0.6851
0.7388
0.7913
0.6139
0.6477
0.6869
0.7600
0.5988
0.6059
0.6684
0.7209
[75:25]
0.6450
0.7080
0.7544
0.8195
0.6310
0.6847
0.7474
0.8040
0.6073
0.6635
0.7037
0.7654
0.5933
0.6148
0.6778
0.7334
[90:10]
0.6639
0.7341
0.7669
0.8206
0.6817
0.7185
0.7640
0.8167
0.6515
0.6794
0.7184
0.7676
0.6051
0.6417
0.6998
0.7378
[99:1]
0.7129
0.7537
0.7840
0.8037
0.6843
0.7261
0.7668
0.8037
0.6422
0.6910
0.7376
0.7573
0.6395
0.6599
0.7087
0.7359
RF
None
[Full]
0.6059
0.6375
0.7214
0.7960
0.5446
0.6239
0.7177
0.7938
0.5687
0.6182
0.6387
0.7089
0.5642
0.6253
0.6555
0.7076
ROS
[50:50]
0.6698
0.6916
0.7230
0.7824
0.6494
0.6854
0.7489
0.7931
0.6065
0.6101
0.6333
0.6605
0.5852
0.6296
0.6679
0.6817
[65:35]
0.6447
0.6809
0.6981
0.7697
0.6055
0.6670
0.7378
0.7699
0.5975
0.5894
0.6151
0.6598
0.5619
0.6158
0.6425
0.6649
[75:25]
0.6319
0.6598
0.6882
0.7691
0.5864
0.6538
0.7292
0.7573
0.5969
0.5901
0.6079
0.6582
0.5313
0.6046
0.6416
0.6593
[90:10]
0.6203
0.6515
0.6835
0.7712
0.5751
0.6294
0.7066
0.7564
0.5984
0.5793
0.6043
0.6789
0.5650
0.6119
0.6437
0.6828
[99:1]
0.6272
0.6692
0.6917
0.7917
0.5686
0.6517
0.7345
0.7754
0.5826
0.5894
0.6272
0.7059
0.5652
0.6206
0.6508
0.7022
RUS
[50:50]
0.6883
0.7225
0.7519
0.8150
0.6746
0.7079
0.7519
0.7955
0.6228
0.6614
0.6886
0.7409
0.6033
0.6418
0.6732
0.7238
[65:35]
0.6827
0.7276
0.7536
0.8216
0.6890
0.7222
0.7594
0.8062
0.6282
0.6806
0.7045
0.7490
0.6179
0.6498
0.6905
0.7239
[75:25]
0.7101
0.7385
0.7633
0.8270
0.6935
0.7332
0.7653
0.8150
0.6539
0.6641
0.7156
0.7584
0.6212
0.6590
0.6951
0.7289
[90:10]
0.6827
0.7677
0.7682
0.8301
0.7071
0.7416
0.7792
0.8279
0.6385
0.6862
0.7152
0.7586
0.6399
0.6542
0.6959
0.7377
[99:1]
0.6513
0.7220
0.7605
0.8159
0.6620
0.7010
0.7529
0.8152
0.6396
0.6666
0.6965
0.7371
0.5969
0.6488
0.6812
0.7224
Tukey’s HSD results for Part B and Combined datasets
Dataset
Pos class
Sampling
Ratio
AUC
Group
Dataset
Pos class
Sampling
Ratio
AUC
Group
Combined
473
RUS
[90:10]
0.8211
a
Part B
1409
RUS
[90:10]
0.8232
a
Combined
473
ROS
[50:50]
0.8052
a
Part B
1409
ROS
[50:50]
0.8059
a
Combined
473
None
[Full]
0.7999
a
Part B
200
RUS
[99:1]
0.7748
a
Combined
200
RUS
[90:10]
0.7739
a
Part B
200
RUS
[90:10]
0.7721
a
Combined
200
ROS
[99:1]
0.7652
a
Part B
200
None
[Full]
0.7533
a
Combined
200
ROS
[50:50]
0.7644
a
Part B
100
RUS
[90:10]
0.7520
a
Combined
200
ROS
[75:25]
0.7615
a
Part B
100
ROS
[99:1]
0.7111
a
Combined
200
ROS
[65:35]
0.7604
a
Part B
50
RUS
[99:1]
0.6903
a
Combined
200
None
[Full]
0.7587
a
Part B
50
ROS
[99:1]
0.6792
a
Combined
200
ROS
[90:10]
0.7561
a
Part B
50
ROS
[50:50]
0.6781
a
Combined
100
RUS
[90:10]
0.7280
a
Part B
50
None
[Full]
0.6778
a
Combined
100
ROS
[99:1]
0.7044
a
Part B
50
ROS
[90:10]
0.6700
a
Combined
100
None
[Full]
0.7037
a
Part B
50
ROS
[65:35]
0.6699
a
Combined
100
ROS
[50:50]
0.7035
a
Part B
50
ROS
[75:25]
0.6665
a
Combined
100
ROS
[65:35]
0.6996
a
Part B
1409
RUS
[75:25]
0.8211
ab
Combined
50
RUS
[90:10]
0.6977
a
Part B
1409
ROS
[75:25]
0.8028
ab
Combined
100
ROS
[90:10]
0.6922
a
Part B
200
RUS
[75:25]
0.7636
ab
Combined
50
RUS
[99:1]
0.6902
a
Part B
100
RUS
[99:1]
0.7448
ab
Combined
100
ROS
[75:25]
0.6895
a
Part B
200
ROS
[99:1]
0.7428
ab
Combined
50
ROS
[50:50]
0.6567
a
Part B
100
None
[Full]
0.7065
ab
Combined
200
RUS
[99:1]
0.7680
ab
Part B
100
ROS
[90:10]
0.6980
ab
Combined
100
RUS
[99:1]
0.7173
ab
Part B
100
ROS
[65:35]
0.6976
ab
Combined
100
RUS
[75:25]
0.7084
ab
Part B
100
ROS
[50:50]
0.6971
ab
Combined
50
RUS
[65:35]
0.6751
ab
Part B
50
RUS
[75:25]
0.6861
ab
Combined
50
RUS
[75:25]
0.6738
ab
Part B
50
RUS
[65:35]
0.6817
ab
Combined
50
ROS
[99:1]
0.6425
ab
Part B
50
None
[Full]
0.6778
ab
Combined
50
ROS
[90:10]
0.6361
ab
Part B
50
RUS
[90:10]
0.6769
ab
Combined
50
ROS
[65:35]
0.6338
ab
Part B
200
ROS
[50:50]
0.7408
abc
Combined
50
None
[Full]
0.6321
ab
Part B
1409
RUS
[65:35]
0.8180
b
Combined
473
RUS
[99:1]
0.8130
b
Part B
1409
ROS
[90:10]
0.8020
b
Combined
473
RUS
[75:25]
0.8115
b
Part B
1409
ROS
[65:35]
0.8018
b
Combined
473
ROS
[65:35]
0.7748
b
Part B
1409
ROS
[99:1]
0.8015
b
Combined
100
None
[Full]
0.7037
b
Part B
1409
None
[Full]
0.7990
b
Combined
50
ROS
[75:25]
0.6217
b
Part B
100
ROS
[75:25]
0.6860
b
Combined
200
RUS
[75:25]
0.7600
bc
Part B
50
RUS
[50:50]
0.6603
b
Combined
200
None
[Full]
0.7587
bc
Part B
200
RUS
[65:35]
0.7545
bc
Combined
100
RUS
[65:35]
0.7006
bc
Part B
200
None
[Full]
0.7533
bc
Combined
50
RUS
[50:50]
0.6544
bc
Part B
200
ROS
[90:10]
0.7352
bc
Combined
473
RUS
[65:35]
0.8025
c
Part B
200
ROS
[65:35]
0.7341
bc
Combined
473
None
[Full]
0.7999
c
Part B
100
RUS
[75:25]
0.7260
bc
Combined
200
RUS
[65:35]
0.7511
c
Part B
100
RUS
[65:35]
0.7252
bc
Combined
473
ROS
[75:25]
0.7453
c
Part B
1409
RUS
[50:50]
0.8114
c
Combined
100
RUS
[50:50]
0.6831
c
Part B
1409
RUS
[99:1]
0.8107
c
Combined
50
None
[Full]
0.6321
c
Part B
200
RUS
[50:50]
0.7446
c
Combined
473
RUS
[50:50]
0.7885
d
Part B
200
ROS
[75:25]
0.7285
c
Combined
200
RUS
[50:50]
0.7333
d
Part B
100
None
[Full]
0.7065
cd
Combined
473
ROS
[99:1]
0.7269
d
Part B
1409
None
[Full]
0.7990
d
Combined
473
ROS
[90:10]
0.6892
e
Part B
100
RUS
[50:50]
0.6969
d
Tukey’s HSD results for Part D and DMEPOS datasets
Dataset
Pos class
Sampling
Ratio
AUC
Group
Dataset
Pos class
Sampling
Ratio
AUC
Group
Part D
1018
RUS
[90:10]
0.7709
a
DMEPOS
200
RUS
[99:1]
0.7027
a
Part D
1018
RUS
[75:25]
0.7697
a
DMEPOS
200
ROS
[99:1]
0.6854
a
Part D
1018
ROS
[99:1]
0.7480
a
DMEPOS
200
RUS
[90:10]
0.7029
a
Part D
1018
None
[Full]
0.7463
a
DMEPOS
635
RUS
[90:10]
0.7399
a
Part D
1018
ROS
[90:10]
0.7455
a
DMEPOS
200
ROS
[90:10]
0.6834
a
Part D
200
RUS
[99:1]
0.7370
a
DMEPOS
100
None
[Full]
0.6618
a
Part D
200
None
[Full]
0.7002
a
DMEPOS
50
None
[Full]
0.6153
a
Part D
100
RUS
[90:10]
0.6954
a
DMEPOS
100
None
[Full]
0.6618
a
Part D
100
None
[Full]
0.6807
a
DMEPOS
200
None
[Full]
0.6859
a
Part D
50
RUS
[99:1]
0.6611
a
DMEPOS
635
None
[Full]
0.7265
a
Part D
50
RUS
[90:10]
0.6584
a
DMEPOS
200
ROS
[75:25]
0.6787
a
Part D
50
ROS
[50:50]
0.6418
a
DMEPOS
50
RUS
[99:1]
0.6248
a
Part D
50
None
[Full]
0.6362
a
DMEPOS
50
RUS
[90:10]
0.6252
a
Part D
50
ROS
[99:1]
0.6360
a
DMEPOS
200
ROS
[65:35]
0.6808
a
Part D
50
ROS
[90:10]
0.6332
a
DMEPOS
200
ROS
[50:50]
0.6858
a
Part D
50
ROS
[65:35]
0.6275
a
DMEPOS
50
RUS
[75:25]
0.6181
ab
Part D
50
ROS
[75:25]
0.6259
a
DMEPOS
50
None
[Full]
0.6153
ab
Part D
200
RUS
[90:10]
0.7340
ab
DMEPOS
50
RUS
[65:35]
0.6112
ab
Part D
100
RUS
[99:1]
0.6923
ab
DMEPOS
100
RUS
[99:1]
0.6535
ab
Part D
200
ROS
[99:1]
0.6883
ab
DMEPOS
200
RUS
[75:25]
0.6918
ab
Part D
100
None
[Full]
0.6807
ab
DMEPOS
50
ROS
[50:50]
0.6030
ab
Part D
100
RUS
[75:25]
0.6777
ab
DMEPOS
50
ROS
[90:10]
0.6025
ab
Part D
50
RUS
[75:25]
0.6409
ab
DMEPOS
50
ROS
[99:1]
0.6015
ab
Part D
50
None
[Full]
0.6362
ab
DMEPOS
100
ROS
[99:1]
0.6567
ab
Part D
50
RUS
[65:35]
0.6310
ab
DMEPOS
100
ROS
[90:10]
0.6486
ab
Part D
1018
RUS
[65:35]
0.7637
b
DMEPOS
100
ROS
[50:50]
0.6453
ab
Part D
1018
RUS
[99:1]
0.7597
b
DMEPOS
100
RUS
[90:10]
0.6476
abc
Part D
1018
ROS
[50:50]
0.7383
b
DMEPOS
50
RUS
[50:50]
0.5891
b
Part D
1018
ROS
[65:35]
0.7377
b
DMEPOS
200
RUS
[65:35]
0.6867
b
Part D
1018
ROS
[75:25]
0.7375
b
DMEPOS
200
None
[Full]
0.6859
b
Part D
100
RUS
[65:35]
0.6740
b
DMEPOS
635
RUS
[75:25]
0.7332
b
Part D
100
ROS
[99:1]
0.6582
b
DMEPOS
635
RUS
[99:1]
0.7331
b
Part D
100
ROS
[50:50]
0.6536
b
DMEPOS
50
ROS
[75:25]
0.5878
b
Part D
100
ROS
[65:35]
0.6478
b
DMEPOS
50
ROS
[65:35]
0.5875
b
Part D
100
ROS
[90:10]
0.6431
b
DMEPOS
100
ROS
[75:25]
0.6395
b
Part D
100
ROS
[75:25]
0.6405
b
DMEPOS
100
ROS
[65:35]
0.6391
b
Part D
50
RUS
[50:50]
0.6162
b
DMEPOS
635
ROS
[50:50]
0.7103
b
Part D
200
RUS
[75:25]
0.7237
bc
DMEPOS
100
RUS
[75:25]
0.6362
bc
Part D
200
ROS
[50:50]
0.6767
bc
DMEPOS
200
RUS
[50:50]
0.6683
c
Part D
200
ROS
[90:10]
0.6766
bc
DMEPOS
635
RUS
[65:35]
0.7266
c
Part D
1018
RUS
[50:50]
0.7538
c
DMEPOS
635
None
[Full]
0.7265
c
Part D
200
ROS
[75:25]
0.6694
c
DMEPOS
635
ROS
[65:35]
0.6884
c
Part D
200
ROS
[65:35]
0.6693
c
DMEPOS
100
RUS
[65:35]
0.6298
cd
Part D
100
RUS
[50:50]
0.6523
c
DMEPOS
100
RUS
[50:50]
0.6137
d
Part D
200
RUS
[65:35]
0.7128
cd
DMEPOS
635
RUS
[50:50]
0.7173
d
Part D
1018
None
[Full]
0.7463
d
DMEPOS
635
ROS
[75:25]
0.6756
d
Part D
200
None
[Full]
0.7002
de
DMEPOS
635
ROS
[90:10]
0.6492
e
Part D
200
RUS
[50:50]
0.6954
e
DMEPOS
635
ROS
[99:1]
0.6444
e
