A study on rare fraud predictions with big Medicare claims fraud data

Abstract

Access to copious amounts of information has reached unprecedented levels, and can generate very large data sources. These big data sources often contain a plethora of useful information but, in some cases, finding what is actually useful can be quite problematic. For binary classification problems, such as fraud detection, a major concern therein is one of class imbalance. This is when a dataset has more of one label versus another, such as a large number of non-fraud observations with comparatively few observations of fraud (which we consider the class of interest). Class rarity further delineates class imbalance with significantly smaller numbers in the class of interest. In this study, we assess the impacts of class rarity in big data, and apply data sampling to mitigate some of the performance degradation caused by rarity. Real-world Medicare claims datasets with known excluded providers are used as fraud labels for a fraud detection scenario, incorporating three machine learning models. We discuss the necessary data processing and engineering steps in order to understand, integrate, and use the Medicare data. From these already imbalanced datasets, we generate three additional datasets representing varying levels of class rarity. We show that, as expected, rarity significantly decreases model performance, but data sampling, specifically random undersampling, can help significantly with rare class detection in identifying Medicare claims fraud cases.

Keywords

Big data Medicare fraud detection class imbalance data sampling rare classes

1. Introduction

Today’s society has access to a multitude of differing information sources providing access to unique observations or cases in order to make informed assessments or decisions. In many real-world cases, these unique and interesting observations do not constitute the majority in the provided information. An example that most people are familiar with is using a search engine to cull through billions of internet sources and return the few observations of interest. Most of this information could be considered irrelevant (i.e. noise) and, as such, can be excluded. Some less mundane examples where the interesting observations are significantly less frequent than the normative observations include the detection of fraud within credit card transactions [35] and credit scoring [25], healthcare billing fraud [76], or detecting faults in software development [50]. Another factor contributing to the relative infrequency of these observations of interest, making finding them potentially more difficult to find, is the explosion of data in recent years [70]. This so-called big data dramatically increases the number of normative cases relative to the cases of interest [49]. This discrepancy between interesting and normative cases is known as class imbalance, where each (interesting or normative) case is considered a class. The concern with class imbalance is akin to looking for the proverbial ‘needle in a haystack’, where what is of interest is difficult to find because of the large amounts of irrelevant or unimportant information.

In binary classification problems, e.g. interesting versus uninteresting, observations (also known as cases or instances) of the positive class are usually considered the class of interest with the negative class making up the remaining instances. The problem of class imbalance impacts many real-world problems and any possible solutions. Many of these problems have manual solutions, but the application of machine learning methods can help derive meaningful information, as well as reduce manual efforts. However, machine learning methods are susceptible to issues related to class imbalance [78]. This is further exacerbated when the positive class instances become increasingly rare. In our case, rarity, in relation to machine learning methods, includes any extremely small number of positive class instances, regardless of the proportion to the negative class. For instance, take a dataset where the positive class makes up 1% of that dataset (with 99% being the negative class), 1% of 1,000 instances is only 10 positive class instances, whereas 1% out of 10 million instances is 100,000 positive class instances. A machine learning model built with 100,000 positive class instances provides adequate information for a model to discern class patterns, whereas only 10 positive class instances may not provide enough discriminatory information to effectively predict the correct class. From this example, we note that using percentages to express the severity of class imbalance, while useful, can be misleading, especially with regards to rare observations.

There are a number of studies on class imbalance, primarily using smaller datasets [43], with few studies addressing class imbalance and big data [44, 40]. This lack of research could, in part, be due to the difficulties in applying traditional machine learning techniques to big data to assess the impact of class imbalance. These difficulties tend to stem from typical attributes associated with big data [36, 49], making it difficult to model and analyze using traditional methods. As there is no exact measure for or meaning of small versus big data, for simplicity, we consider datasets with under 100,000 instances as small. An example of smaller (non-big) data could be a dataset of 5,000 instances collected over a period of one month for a small company, where each instance represents the front door access record of an employee of the company. In contrast, an example of big data could be a dataset of several million or more weather forecast reference points for collecting real-time weather data, or insurance claims records collected over several years. With the advent of big data frameworks, such as Apache Spark [82] and H2O [56], machine learning models can be efficiently built using large datasets with millions of instances to assess the effects of severe class imbalance and rarity [48]. Class imbalance with rare cases, however, is a far less studied area only employing smaller datasets [37, 72]. As of this writing, there are no known studies with regards to big data and class rarity.

As mentioned, fraud detection is one area with extreme sparsity in known fraud cases. Given this disparity in fraud and normal cases, fraud detection is a good real-world scenario with which to evaluate the impact of having rarity in the positive class. Therefore, in our study, we assess the impact of rarity in fraud detection using Medicare provider claims data, focusing on the effects of class rarity in Medicare fraud detection. Medicare is one of the largest government healthcare programs in the United States (U.S.), with over 54.3 million beneficiaries. This program reimburses hospitals and physicians for medical care provided to people over the age of 65 and to younger individuals with specific medical conditions and disabilities [5]. The use of Medicare claims data allows us to gauge the impact of rarity on big data with real world implications. The healthcare field produces a large amount of information, from patient records to medical claims, and continues to utilize big data sources in order to become more efficient and productive [61, 68]. With this increased use of information, medical fraud continues to be attractive to would-be perpetrators, adversely affecting healthcare costs and quality of service. Understanding the impacts of having such low numbers of known positive instances is critical for the successful detection of actual fraudulent cases. With successful fraud detection methods, the FBI estimates that up to 10% of healthcare costs could be recovered [64]. In particular, the increase in elderly population [2, 6] poses unique challenges, such as increased medical care due to chronic illness, necessitating appropriate healthcare coverage for various medical drugs and services. This increased longevity in the elderly population in the U.S. can be attributed to overall improvements in healthcare, particularly with treatments of acute illnesses and diseases [9]. Medicare alone accounts for around 15% ($495 billion) of total U.S. federal spending [3, 34]. Thus, fraud detection could garner substantial savings that could assist patients and beneficiaries by decreasing healthcare costs.

The detection of Medicare-related fraud and abuse1

¹
https://www.cms.gov/Outreach-and-Education/Medicare-Learning-Network-MLN/MLNProducts/downloads/fraud_and_ abuse.pdf.

is an area of study made possible by the public nature of the provider payment and utilization information. In addition, the List of Excluded Individuals/Entities (LEIE) [57] is a source of information on real-world fraud and abuse listing providers who have been excluded from healthcare practices in government-funded programs, such as Medicare. There are several studies on fraud in healthcare [73, 79] using Medicare-related data and other healthcare sources. The vast majority of studies using Medicare data focus on Medicare Part B [11, 13, 16, 18, 39, 53, 54]. Other studies [71] include multiple Medicare datasets, or leverage fraud labels from the LEIE [21, 24, 67]. These studies, however, do not account for the issue of class imbalance, even with the use of the LEIE fraud labels. Even so, the methods used to address class imbalance are very limited and preliminary. No related studies on Medicare fraud detection focus on addressing or investigating class rarity.

In this study, we focus on positive class rarity in big data, for fraud detection, by leveraging three publicly available Medicare datasets using LEIE excluded providers as fraud labels (i.e. positive class), as well as creating a combined dataset. Furthermore, we present our unique methods for data processing and engineering, to fully leverage the Medicare datasets for machine learning applications. The processing of each dataset is critical, not only for understanding the data, but for the successful application of machine learning. With these datasets, three models (learners) are used to assess fraud detection performance. In order to illustrate the effects of rarity, we map known excluded providers to the Medicare data and then artificially generate additional datasets (from the original Medicare data) by decreasing the number of positive class instances while maintaining the same number of negative class instances. We discuss the results of our experiment and use statistical significance tests to help assess meaningful differences in model performance due to the varying levels of positive class instances, and clearly demonstrate degraded performance as positive classes become rarer. Since this pattern of decreased performance is expected, the question is what can be done to lessen this loss in performance? We perform data sampling in order to increase model performance for rare class classification and show that significant improvement can be achieved. To the best of our knowledge, this is the only study to investigate the effects of rarity in big data within a current, real-world healthcare application domain, and employing data sampling techniques as a means to mitigate the detrimental effects on model performance due to rarity.

The rest of the paper is organized as follows. Section 2 discusses works related to the current research, focusing on class rarity. We discuss Medicare fraud and abuse in Section 3 and provide an overview of the datasets used (processing and data engineering) and the mapping of fraud labels in Section 4. In Section 5, we discuss class imbalance and rarity. In Section 6, we present our experimental design. The results of our research are discussed in Section 7. Finally, Section 8 summarizes our research limitations, conclusions, and future work.

2. Related works

The problems associated with class imbalance (for which class rarity is a subset) have been researched in a number of studies [8, 37, 43, 80]. Even so, the focus on truly rare class data is very limited. There have been several studies on overcoming the issues stemming from the detection of imbalanced or severely imbalanced classes, focusing on specific techniques for increasing machine learning performance, but not on how rarity impacts a current, real-world application domain (e.g. Medicare fraud detection). Additionally, these related studies primarily use smaller datasets from public repositories.2

²
UCI Machine Learning Repository – https://archive.ics.uci.edu/ml/index.php.

In this section, we focus on studies investigating class imbalance, discussing methods and, more specifically, what these related works consider imbalanced data.

There are several papers employing different methods to detect positive classes using relatively small datasets. In a study by Lin et al. [59], the authors use the Multivariate Imputation by Chained Equations (MICE) algorithm with optimal threshold based on posterior probabilities. Several datasets are used, with the largest having 20,000 instances and the lowest percentage of positive class instances being 9.3%. An approach using emerging patterns and decision trees to classify rare cases is presented in [7]. The authors improve on the decision tree model by biasing the decisions toward the rare class. They validate their results using five different datasets which are not specifically described in the paper, thus no information is available on what constituted rarity. Zhang et al. [84] propose a k Rare-class Nearest Neighbor (KRNN) method, which directly adjusts the induction bias in traditional KNN. They leverage 14 small datasets, with the largest having 13,302 instances and 3.51% being the smallest percentage of positive class instances. An artificial dataset is also used with varying positive class percentages, relative to the full dataset, from 50% to 5.88%, maintaining 50 positive class instances. Li et al. [58] use adaptive swarm-balancing algorithms to predict events using clinical healthcare data and claim their study incorporates big data. However, the largest dataset only has 47,781 instances with 50 (0.1%) positive class instances. The data used in these studies exhibit severe class imbalance but with datasets that we consider relatively small and not big data. Furthermore, as well as with smaller datasets, only [84] generates artificial cases to demonstrate the impacts of severe class imbalance on classification performance. Another study by Seiffert et al. [72] specifically assesses the effect of data size and class distribution on classification. The data samples go down to 0.1% positive class percentage, relative to the full data size. Seiffert et al. and the research in [74] are the only two studies that generate artificial cases to demonstrate the impacts of severe class imbalance, albeit with smaller datasets.

Big data and class imbalance is the subject of a paper by Fernández et al. [40]. The authors present an overview and discussion of related works on class imbalance and big data, providing a preliminary experiment comparing three traditional approaches to class imbalance applied to big data using the MapReduce framework with two datasets. Two imbalanced sources, derived from the ECBDL14 dataset, are used in their experiments. These two datasets consist of 12 million instances and 600,000 instances, respectively, both with 98:2 (majority:minority) class ratios. In a study by Rastogi et al. [69], the authors also address the issue of class imbalance using the ECBDL14 dataset. They use Locality Sensitive Hashing to identify nearest neighbors of the samples and then use SMOTE to generate random samples for the minority class employing the Apache Spark framework. The authors use an ECBDL14 data subset which has 2.89 million instances and a 98.3:1.7 class ratio. Zhai et al. [83] use an ensemble of extreme learning machines, compared with three SMOTE variant, to address class imbalance. They also employ a MapReduce framework with several standard datasets (up to 335,910 instances) and an artificially generated dataset with 150 (0.05%) of 312,191 instances. The studies in [40, 69] incorporate big data but do not investigate or specifically address rarity, nor do they provide varying class ratios to assess the impacts of a reduced positive class.

Three studies that focus on infrequent events using big data sources are Tayal et al. [74], Maalouf et al. [60], and Chai et al. [23]. In one study, Tayal et al. employ a RankRC algorithm which is a nonlinear kernel-based classification method. They compare their method to several SVM-based approaches using both artificially generated datasets, with 12,000 data points from 10% to 40% class ratios, and the real-world datasets. The largest real-world dataset used is from KDD Cup 1999, from which the authors used a subset of 812,808 instances with 0.098% positive class instances. The RankRC method performed better than the SVM methods and was more efficient in time and space requirements. Another study, by Maalouf et al., leverages the KDD Cup 1999 dataset, along with six others, to study the issue of severe class imbalance. Two of the datasets have a 0.34% positive class percentage (which is the lowest in the study), with one of the datasets having 304,814 instances. The authors present the truncated Newton method in prior correction logistic regression with a regularization term added to improve performance. Chai et al. examine the feasibility of using statistical text classification, with Logistic Regression, to automatically identify health information technology incidents using a manufacturer and user facility device experience database. They use a subset of the incident data with 570,272 instances and 1,534 positive class instances. Both a balanced (50% positive class) dataset and a so-called ‘stratified’ dataset (0.297% positive class) are generated from the subset of incident reports. Each of these studies incorporates some degree of, what could be considered, big data, but do not really focus on or assess the impacts of rarity using variations of the positive class, to include the original datasets, showing trends associated with increasing levels of rarity. These studies do illustrate some of the impacts of severe class imbalance with larger datasets, but do not focus on rarity or use particularly relevant big datasets. In contrast to our study, there are no papers that consider big Medicare fraud data and class rarity.

3. Medicare fraud and abuse

Medicare fraud, waste, and abuse3

³
https://www.cms.gov/Outreach-and-Education/Medicare-Learning-Network-MLN/MLNProducts/Downloads/Fraud- Abuse-Products.pdf.

can take many forms but in nearly all cases, this results in some form of monetary loss. Some examples of fraud and abuse involve billing Medicare for appointments the patient failed to keep, services rendered that were more complex than those actually performed, unnecessary medical services, submitting excessive charges for services, drugs, or supplies, and misusing claims codes (e.g. upcoding or unbundling). Aside from these typical fraud and abuse descriptions in Medicare, improper payments can also indicate possible fraud or abuse. The term improper payments refers to payments made by the government to the wrong person, in the wrong amount, or for the wrong reason [1]. Thus, finding improper payments could be a way to detect possible fraud and abuse activities. However, it is important to note that not all improper payments are considered fraud and abuse, but rather related to clerical or bookkeeping errors. Figure 1 depicts the scope of Medicare losses from simple errors to fraud.4

⁴

https://www.girardgibbs.com/whistleblower/healthcare-fraud/medicare/.

Additional information on Medicare-related fraud, waste, and abuse can be found in [19, 28, 41].

One source of real-world healthcare provider fraud is the LEIE. This dataset was established and is maintained monthly by the Office of Inspector General (OIG) [66] in accordance with Sections 1128 and 1156 of the Social Security Act [65]. The OIG has authority to exclude individuals and entities from federally funded healthcare programs, such as Medicare, for a given period of time. The LEIE contains the reason for exclusion, date of exclusion, and reinstate/waiver date for all current physicians found unsuited to practice medicine. It is important to point out that the LEIE is aggregated at the provider-level (i.e. a single recorded exclusion per provider by NPI) and does not have specific information regarding procedures, drugs, or equipment related to fraudulent activities. As seen in Table 1, for our study, we selected the providers excluded for mandatory, not permissive, exclusions.5

⁵

https://oig.hhs.gov/exclusions/authorities.asp.

We use these excluded providers as fraud labels in our final datasets [15].

Table 1

LEIE mandatory exclusion rules

Rule number	Description	Exclusion period
1128(a)(1)	Conviction of program-related crimes.	5 years
1128(a)(2)	Conviction for patient abuse or neglect.	5 years
1128(a)(3)	Felony conviction due to healthcare fraud.	5 years
1128(b)(4)	License revocation or suspension.	5 years
1128(c)(3)(g)(i)	Conviction of 2 mandatory offenses.	10 years
1128(c)(3)(g)(ii)	Conviction on 3 $+$ mandatory offenses.	Permanent

Figure 1.

High-level view of fraud, waste, and abuse.

4. Medicare data

In this section, we summarize the Medicare datasets and data processing and engineering steps. The Medicare datasets are made publicly available from the Centers for Medicare and Medicaid Services (CMS) [4, 26]. Note that CMS is the Federal agency within the U.S. Department of Health and Human Services that administers Medicare, Medicaid, and several other health-related programs. Each dataset is derived from administrative claims data for Medicare beneficiaries enrolled in the Fee-For-Service program, with all claims records being recorded after payments are made [30, 31, 32]. Because these are final payments, we consider the Part B, Part D, and DMEPOS datasets to be appropriately cleansed and correct. The Part B dataset includes claims information for each procedure a physician/provider performs within a given year. The Part D dataset provides claims information pertaining to the prescription drugs administered under the Medicare Part D Prescription Drug Program within a given year. Finally, the DMEPOS dataset includes claims for medical equipment, prosthetics, orthotics, and supplies that physicians/providers referred patients to for purchase or rent from a supplier within a given year. In addition to the three aforementioned Medicare datasets, we create a combined dataset incorporating information from all three Medicare datasets. Our assumption is that there is no reliable way to know within which part of Medicare a physician/provider has or will commit fraud. Therefore, joining the Part B, Part D, and DMEPOS datasets can potentially better represent a provider’s claims, from procedures and drugs to equipment. This is because the combined dataset has a larger number of features from which machine learning algorithms can detect fraud.

Table 2
Selected Medicare dataset features

Dataset	Feature	Description	Type
Part B	npi	Unique provider identification number	Categorical
	provider_type	Medical provider’s specialty (or practice)	Categorical
	nppes_provider_gender	Provider’s gender	Categorical
	line_srvc_cnt	Number of procedures/services the provider performed	Numerical
	bene_unique_cnt	Number of distinct Medicare beneficiaries receiving a service	Numerical
	bene_day_srvc_cnt	Number of distinct Medicare beneficiary/per day services	Numerical
	average_submitted_chrg_amt	Average of the charges that the provider submitted for a service	Numerical
	average_medicare_payment_amt	Average payment made to a provider per claim for a service	Numerical
Part D	npi	Unique provider identification number	Categorical
	specialty_description	Medical provider’s specialty (or practice)	Categorical
	bene_count	Number of distinct Medicare beneficiaries receiving the drug	Numerical
	total_claim_count	Number of drug the provider administered	Numerical
	total_30_day_fill_count	Number of standardized 30-day fills	Numerical
	total_day_supply	Number of day’s supply	Numerical
	total_drug_cost	Cost paid for all associated claims	Numerical
DMEPOS	referring_npi	Unique provider identification number	Categorical
	referring_provider_type	Medical provider’s specialty (or practice)	Categorical
	referring_provider_gender	Provider’s gender	Categorical
	number_of_suppliers	Number of suppliers used by provider	Numerical
	number_of_supplier_beneficiaries	Number of beneficiaries associated by the supplier	Numerical
	number_of_supplier_claims	Number of claims submitted by a supplier from a referring order	Numerical
	number_of_supplier_services	Number of services/products rendered by a supplier	Numerical
	avg_supplier_submitted_charge	Average payment submitted by a supplier	Numerical
	avg_supplier_medicare_pmt_amt	Average payment awarded to suppliers	Numerical
All	exclusion	Fraud labels mapped from the LEIE dataset	Categorical

We combined each of the individual years (2012 to 2015) of Medicare data, for each dataset, by appending each annual dataset to each other, across matching features. Some features, such as the standardized payments and standard deviation values, are not included since they do not appear in all of the Medicare years. We select features, listed in Table 2, that are readily usable by most machine learning models. However, we exclude features used for identification purposes (such as NPI [27]) or for filtering the data, e.g. for Medicare participation or equipment rentals. Redundant features, like HCPCS [29] descriptions (not the actual procedure codes) and others indicating demographic information, such provider names, credentials, and residence, are also not used. Additionally, we exclude features that contained mostly missing or constant values, as these can reduce classification performance. Details on all of the available Medicare features can be found in the “Public Use File: A Methodological Overview” documents, for each respective dataset, available at CMS.6

⁶

https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/index.html.

Because the LEIE dataset only provides information on excluded providers and not on particular procedures performed or medical specialties, we aggregate the Medicare data at the provider- or NPI-level to account for this discrepancy and then map LEIE exclusion labels to each Medicare dataset. Note that currently there is no known publicly available data source with fraud labels by provider for each procedure performed. We merge the Medicare datasets and the LEIE providers by NPI and year, generating fraud or non-fraud labels. Our labeling process includes both the exclusion period and the period prior to the start of the recorded exclusion. The rationale for keeping the former is that claims made during the exclusion period are improper payments and could be considered fraudulent per the federal False Claims Act (FCA) [33]. The latter could indicate fraudulent activities leading up to that provider being put on the LEIE, for which most stem from criminal convictions, patient abuse or neglect, or revoked licenses. Additional information on our Medicare data processing can be found in [15, 46].

5. Class imbalance and rarity

Our study uses Medicare data and fraud detection to demonstrate the impact of class imbalance, specifically rare classes, on machine learning classification performance. As rarity is a severe form of class imbalance, we discuss both in this section. Class imbalanced data refers to the condition where the classes are not represented equally [77, 78]. The concerns associated with class imbalance are caused by these differences in the majority (negative) and minority (positive) classes. This imbalance creates a bias towards the majority class during the model training, which seeks both to minimize error (maximize accuracy) and provide predictive generalization, i.e. provide good predictions on new, unknown data. In more severe cases of class imbalance (e.g. $<$ 1% positive class), the vast majority of instances belong to one class and a very small minority belong to the other class, which is typically the class of interest [51, 72]. Machine learning performance is degraded by increased class imbalance and rarity, particularly with big data. In this case, there are a disproportionately large number of majority class instances, increased data variability, and disjuncts. The concern with small disjuncts is related to the problem of within-class imbalances [45]. A learner frequently creates large disjuncts, which consider a significant portion of instances related to the target class. However, there are also underrepresented subconcepts with small portions of instances that are associated with small disjuncts. These small disjuncts cover few instances in a trained model and generally have much higher error rates in contrast to large disjuncts [80]. These errors can be compounded in big data sources with greater disparity between majority and minority classes. In order to assess methods to mitigate issues related to class rarity, we employ two basic sampling methods: random oversampling (ROS) and random undersampling (RUS). Oversampling is a method for balancing classes by adding instances to the minority class, whereas undersampling removes samples from the majority class. The main disadvantage of RUS is discarding potentially useful information, whereas ROS duplicates existing minority class instances and can decrease model generalization [17, 52, 78].

The original Medicare datasets are already severely imbalanced with the majority of provider claims being non-fraudulent. In order to assess the impact of increased rarity, we artificially generate three subsets, from the original Medicare data, with decreasing numbers of positive class instances (200, 100, and 50). These were selected based on preliminary results which indicated that this was a good representation of rare class subsets. For instances, using a subset of 1,000 positive class instances showed very similar results to the original dataset, thus does not meaningfully contribute to the discussion on the impact of rarity. To get a good representation of samples from the positive class, we repeat the sampling process ten times per subset. For example using the Part B dataset, given the subset with 100 positive class instances, we randomly select 100 instances out of the original 1,409, repeating this process 10 times, and average across the scores of those 10 to produce the final result. Table 3 lists each of the original datasets and subsets used in our experiment with counts of positive (fraud) and negative (non-fraud) classes.

Table 3
Description of experiment datasets

Dataset	# Positive	# Negative	Total	% Positive	% Negative
Part B	1,409	3,691,146	3,692,555	0.038	99.962
	200	3,691,146	3,691,346	0.005	99.995
	100	3,691,146	3,691,246	0.003	99.997
	50	3,691,146	3,691,196	0.001	99.999
Part D	1,018	2,098,715	2,099,733	0.048	99.952
	200	2,098,715	2,098,915	0.010	99.990
	100	2,098,715	2,098,815	0.005	99.995
	50	2,098,715	2,098,765	0.002	99.998
DMEPOS	635	862,792	863,427	0.074	99.926
	200	862,792	862,992	0.023	99.977
	100	862,792	862,892	0.012	99.988
	50	862,792	862,842	0.006	99.994
Combined	473	759,267	759,740	0.062	99.938
	200	759,267	759,467	0.026	99.974
	100	759,267	759,367	0.013	99.987
	50	759,267	759,317	0.007	99.993

In addition to using the original datasets and the rarity subsets, we employ RUS and ROS with the following class distributions (majority:minority): 99:1, 90:10, 75:25, 65:35, and 50:50. These class distributions are applied to the original dataset, as well as the rare class subsets. Note that the original, non-sampled, datasets are denoted by ‘None’ or ‘Full’. The selected ratios were chosen because they provide a good range of class distributions from balanced (50:50) to highly imbalanced (99:1) compared to the original (non-sampled) datasets. We repeat the RUS and ROS processes 10 times per dataset, for each class distribution, to reduce bias due to poor random draws.

6. Experimental design

In this section, we present the supervised machine learning models (also known as learners) and evaluation methods used to assess the impacts of rare class classification. Previous research indicates that supervised learning significantly outperforms unsupervised methods [10, 12]. Because our study includes big data, we use Apache Spark [62, 81], with the Machine Learning Library (MLlib) [63], to effectively handle these large dataset sizes to detect Medicare fraud. Apache Spark is a unified analytics engine for big data and machine learning that provides dramatically increased data processing speed compared to traditional methods or implementations using MapReduce approaches. MLlib is a scalable machine learning library built on top of Apache Spark.

6.1 Models

We use Logistic Regression (LR) [55], Random Forest (RF) [22], and Gradient Boosted Trees (GBT) [63]. Currently, Spark is limited to eight learners that could be used to classify fraud and non-fraud cases.7

⁷
https://spark.apache.org/docs/2.3.0/ml-classification-regression.html.

Out of these eight, we exclude the simple decision tree model and use RF and GBT instead and do not use the multilayer perceptron or Naive Bayes models which, based on preliminary experiments, do not exhibit comparable detection performance. Note that the default configuration for each learner is assumed, unless otherwise stated.

LR uses a sigmoidal, or logistic, function to generate values from [0, 1] that can be interpreted as class probabilities. LR is similar to linear regression but uses a different hypothesis class to predict class membership. The bound matrix parameter was set to match the shape of the data so the algorithm knows the number of classes and features the dataset contains. The bound vector size is equal to 1 for binomial regression, with no thresholds set for binary classification.

RF is an ensemble approach building multiple decision trees. The classification results are calculated by combining the results of the individual trees, typically using majority voting. RF generates random datasets via sampling with replacement to build each tree and selects features at each node automatically based on entropy and information gain. In this study, we set the number of trees to 100 and the max depth to 16. Additionally, the parameter that caches node IDs for each instance, was set to true and the maximum memory parameter was set to 1024 MB in order to minimize training time. The setting that manipulates the number of features to consider for splits at each tree node was set to one-third, since this setting provided better results upon initial investigation. The maximum bins parameter, which is for discretizing continuous features, is set to 2 since we use one-hot encoding on categorical variables.

GBT is an ensemble approach that trains each decision tree iteratively in order to minimize loss determined by the algorithm’s loss function. During each iteration, the ensemble is used to predict the class for each training instance. The predicted values are evaluated with the actual values allowing for the identification and correction of previously mislabeled instances. The parameter that caches node IDs for each instance was set to TRUE, and the maximum memory parameter was set to 1024 MB to minimize training time.

6.2 Performance metric

Machine learning model performance is scored using the Area Under the Receiver Operating Characteristics (ROC) Curve (AUC) [20]. A receiver operating characteristic curve (ROC curve), is commonly used to visualize the performance of binary classification. The ROC curve is generated by plotting the true positive rate (TPR), also called sensitivity ( $\frac{TP}{TP+FN}$ ), against the false positive rate (FPR), represented as ( $\frac{FP}{FP+TN}$ ), across all decision thresholds. The true positive rate is the proportion of actual positives (fraud) that are correctly identified as the positive class, and true negative rate measures the proportion of actual negatives (non-fraud) that are correctly classified. The TPR and FPR relationship indicates a machine learning model’s ability to distinguish between positive and negative classes. The AUC is a concise measure of ROC curve performance with a single value that ranges from 0 to 1, where a perfect model has a score of 1 and 0.5 is the baseline score indicative of a random guess. Furthermore, AUC has been found to be effective for class imbalance [47].

6.3 Model evaluation

Model evaluation is done using $k$ -fold cross-validation. The model is trained and tested $k$ times, where each time it is trained on $k-1$ folds and tested on the remaining fold. This is to ensure that all data are used in the classification. More specifically, we use stratified cross-validation which tries to ensure that each class (i.e. fraud or non-fraud) is approximately equally represented across each fold. For our experiments, we set $k=5$ . We repeat this 5-fold cross-validation process 10 times for each learner and dataset pairing. The use of repeats helps to reduce bias due to bad random draws when creating the folds. Given that our study uses big data, using 5 folds, which is equivalent to using 80% to train and 20% to validate the model, provides a low likelihood of repeating results across the folds, and has been used effectively in prior research [14]. The final performance result is the average AUC over all 10 repeats.

6.4 Significance testing

In order to provide additional rigor around our AUC performance results, we use hypothesis testing to show the statistical significance of the model performance results. Both ANalysis Of VAriance (ANOVA) [42] and post hoc analysis via Tukey’s Honestly Significant Different (HSD) [75] tests are used in our study. ANOVA is a statistical test determining whether the means of several groups (or factors) are equal. Tukey’s HSD test determines factor means that are significantly different from each other. This test compares all possible pairs of means using a method similar to a $t$ -test, where statistically significant differences are assigned different letter combinations (e.g. group ‘a’ is significantly different than group ‘b’).

7. Results and discussion

To reiterate, our goal is to assess the impact of rarity in big data, leveraging a real-world use case in Medicare provider claims fraud detection. This Medicare fraud scenario provides a realistic picture as to the effects of rare classes (e.g. fraud cases) on the detection performance of machine learning models. In Fig. 2, we depict the average AUC results for each dataset, model, positive class count, and sampling method. The table associated with this figure, providing the average AUC scores, can be found on Table 5 in the Appendix. In this figure, we employ notched box plots to visually compare groups and assess any significant differences therein. For example, if the notches of two boxes do not overlap, this suggests that there are significant differences between these results.

Figure 2.

Summary of Medicare fraud detection model performance.

Overall, notice that the trend shows decreasing model performance, across all models and datasets, as the positive class instances decrease, thus increasing AUC score variability. These results are intuitive as we would expect a machine learning model to struggle as the number of positive class instances decreased. In other words, the difficulty in discriminating fraud cases leads to lower, wide-ranging AUC scores. For example, compare the results for Part B at 1,409 versus 50 positive class instances. This is due to the strong influence of the non-fraud instances that make up the preponderance of each dataset. Moreover, model performance indicates a failure to correctly discern small disjuncts, compounded by the complication of noise inherent in big data. In examining the results further, because there are so few positive cases to begin with, the models tend to have a very low FPR (indicating a non-fraud observation being classified as fraud) with a very high false negative rate (FNR) implying most of the fraud observations are classified as non-fraud. A good machine learning model clearly has low FPR and FNR. Some of the variations in AUC scores, especially for the smaller positive class counts, stem from the TPR with the number of true positives (actual fraud case) and false negatives (fraud classified as non-fraud) being highly variable since the total number of positive classes is so small. In general, the poor model performance due to rarity indicates the difficulty, and inconsistency, in classifying rare positive class cases. There are simply not enough data points to discriminate between classes as the positive class instances decrease. Yet, the results using all known fraud cases shows promise, indicating good overall fraud detection performance.

In an attempt to improve fraud detection performance, we performed both RUS and ROS data sampling. Figure 2 also shows these results and upon inspection data sampling appears to provide some possible benefit. While patterns and trends can be derived from this figure, we employ hypothesis testing to determine the significance of these results. We performed a one-factor ANOVA test on Sampling Method (across all models, datasets, and positive class counts), with this factor being significant at a 95% confidence interval. From this, a Tukey’s HSD test was performed to obtain the significance between no sampling, RUS, and ROS performance results. Table 4 shows that RUS, which is in group ‘a’, significantly increases overall fraud detection performance versus no sampling, whereas ROS decreases performance. It is important to note that these changes in performance can vary based on the model used. In our case, RUS performs better using the tree-based learners (RF and GBT), which could be due to RUS being better able to represent misclassification costs and class distribution [38, 52]. However, LR shows more favorable results with ROS. This may be due to the squared-error loss function with the application of L2 regularization, also known as Ridge Regression, penalizing large coefficients and improving the generalization performance, making LR fairly robust to noise and overfitting (both possibilities due to the duplicated observations in ROS).

Table 4

Tukey’s HSD test results for sampling methods

Factor	Level	AUC	std	r	Min	Max	Q25	Q50	Q75	Group
Sampling method	RUS	0.7132	0.0843	12000	0.3221	0.9333	0.6631	0.7266	0.7751	a
Sampling method	None	0.7052	0.0872	2400	0.3229	0.9095	0.6517	0.7172	0.7731	b
Sampling method	ROS	0.6908	0.0880	12000	0.2928	0.9264	0.6352	0.6978	0.7607	c

Figure 3.

Tukey’s HSD test results for group ‘a’.

Given that models built with RUS-generated datasets outperform both the ROS and non-sampled datasets, we perform Tukey’s HSD tests for each positive class count value to assess any significant differences across class ratios and datasets. We focus on examining the group ‘a’ results over all models. Investigating the results for only this group provides some indication as to the best performing sampling methods for the majority of rare positive class cases. Note that there can be multiple combinations within a group designation. For example, for the combined dataset with 100 and 200 positive class counts, RUS and ROS with a class ratio of 90:10 are each in group ‘a’, respectively. Figure 3 depicts the average AUC scores, per dataset, for sampling methods and associated class ratios. This shows that RUS with a 90:10 class ratio consistently has the majority of group ‘a’ results with the original and rare subsets across Medicare datasets, irrespective of the model used. ROS exhibits better performance with more balanced datasets, particularly the 50:50 class ratio. From these results, RUS is effective in increasing model performance, and we show that retaining a reasonable percentage of the majority class and all of the minority class provides good results without losing too much information from the original dataset, even though the undersampled dataset remains imbalanced. Additionally, due to the high level of imbalance present in these datasets, there is a dramatic decrease in the number of non-fraudulent cases with the 90:10 class ratio, allowing for significantly less computing resources and time in building these models. The complete Tukey’s HSD results over the positive classes are listed in Tables 6 and 7 in the Appendix.

The fraud detection results using real-world Medicare big datasets conveys the difficulties in implementing machine learning solutions. The issues around rarity are clearly demonstrated in relation to a set of known fraud labels. We demonstrated that data sampling, in particular RUS, can be used to increase rare class classification performance. Moreover, RUS has been shown to perform well even with class noise [77]. Yet, there is a point at which there are simply not enough positive class instances to detect any discernible patterns to partition positive and negative class instances. At this point, the only reasonable course of action would be to obtain more quality data. Even though we were able to show the effects of rarity in big data on machine learning and the advantageous usage of data sampling, as with any experimental study, we would be remiss to not mention possible research limitations. The main limitation revolves around the data employed for the use case in our study. More specifically, the fraud label information in the LEIE is not all-inclusive where 38% of providers with fraud convictions continue to practice medicine and 21% were not suspended from medical practice despite their convictions [67]. This implies that there are potential unlabeled or mislabeled providers, which could affect the ground truth and model performance evaluations with rare positive class labels.

8. Conclusion and future work

The problems with severe class imbalance can be detrimental to predictions from machine learning models. In particular, the effects of rare classes exacerbate this issue. With the ubiquitous nature of big data in today’s society, confronting the issues from rare classes becomes increasingly important in areas such as fraud detection for which there are limited known cases. Yet, there are few studies that investigate or attempt to mitigate these concerns regarding big data and class rarity. In our study, we leverage publicly available Medicare claims data (Part B, Part D, DMEPOS, and a combined dataset) with labels from the LEIE, to study the effects of rarity in big data focusing on Medicare fraud detection. We describe our unique methods for data processing and engineering, which are critical in creating usable Medicare datasets for the creation of successful fraud detection models. We use three different machine learning models to detect fraud in each Medicare dataset and assess performance using AUC scores. Three artificially generated rare data subsets are created from the original Medicare datasets. From our results, we show that performance using the original datasets, with all known fraud labels, is reasonable across learners and datasets. The application of rarity, however, significantly decreases model performance in nearly every scenario. This is an expected result, so in order to lessen some of this performance degradation, we apply two data sampling approaches. RUS is demonstrated to be significantly better than the original dataset and rarity subset results, with ROS exhibiting the worst overall fraud detection performance. Therefore, given this real-world scenario, rarity can be alleviated by employing RUS, in particular with a 90:10 class ratio. Lastly, retaining more of the negative class (non-fraud) instances, thus reducing information loss, appears to produce the best overall results when combating the effects of rarity, over most of the Medicare datasets. Future work will involve other viable sources of big data from healthcare insurance fraud. We also intend to assess the generalization of our results to other domains, with the use additional performance metrics.

Footnotes

Acknowledgments

The authors would like to thank the anonymous reviewers for the constructive evaluation of this paper and also the various members of the Data Mining and Machine Learning Laboratory at Florida Atlantic University. We also acknowledge partial support by the NSF (CNS-1427536). Opinions, findings, conclusions, or recommendations in this paper are the authors’ and do not reflect the views of the NSF.

Conflict of interest

The authors declare that they have no conflict of interest.

Appendix

Table 5

Average AUC scores for all experiment combinations

Learner	Sampling	Class	PartB				Combined				PartD				DMEPOS
	method	ratio	50	100	200	1409	50	100	200	473	50	100	200	1018	50	100	200	635
LR	None	[Full]	0.7368	0.7477	0.7737	0.8052	0.6682	0.7550	0.7911	0.8155	0.7000	0.7292	0.7439	0.7816	0.6404	0.6890	0.7068	0.7406
	ROS	[50:50]	0.7137	0.7241	0.7653	0.8191	0.6676	0.7171	0.7824	0.8155	0.6756	0.7062	0.7330	0.7874	0.6198	0.6779	0.7060	0.7329
		[65:35]	0.7305	0.7286	0.7698	0.8187	0.6745	0.7202	0.7882	0.7904	0.6731	0.7132	0.7320	0.7882	0.6102	0.6816	0.7098	0.7169
		[75:25]	0.7273	0.7339	0.7664	0.8189	0.6834	0.7261	0.7889	0.7530	0.6752	0.7046	0.7319	0.7880	0.6212	0.6783	0.7060	0.7035
		[90:10]	0.7168	0.7408	0.7660	0.8143	0.6969	0.7248	0.7872	0.6831	0.6873	0.7062	0.7403	0.7871	0.6274	0.6754	0.7119	0.6509
		[99:1]	0.7165	0.7324	0.7591	0.8094	0.6779	0.7302	0.7872	0.6621	0.6874	0.7122	0.7364	0.7824	0.6290	0.6790	0.7063	0.5941
	RUS	[50:50]	0.6642	0.6859	0.7469	0.8141	0.6766	0.6811	0.7331	0.7941	0.6005	0.6615	0.7190	0.7756	0.5890	0.6099	0.6836	0.7222
		[65:35]	0.6723	0.7340	0.7659	0.8181	0.7044	0.6945	0.7552	0.8100	0.6509	0.6936	0.7470	0.7822	0.6169	0.6336	0.7011	0.7349
		[75:25]	0.7032	0.7316	0.7731	0.8169	0.6969	0.7072	0.7673	0.8155	0.6614	0.7056	0.7517	0.7854	0.6396	0.6347	0.7025	0.7372
		[90:10]	0.6840	0.7542	0.7813	0.8188	0.7041	0.7239	0.7784	0.8187	0.6852	0.7206	0.7685	0.7866	0.6306	0.6468	0.7129	0.7442
		[99:1]	0.7068	0.7587	0.7799	0.8124	0.7242	0.7247	0.7844	0.8201	0.7016	0.7195	0.7770	0.7849	0.6381	0.6519	0.7181	0.7409
GBT	None	[Full]	0.6906	0.7344	0.7649	0.7957	0.6833	0.7321	0.7674	0.7905	0.6398	0.6948	0.7178	0.7485	0.6413	0.6711	0.6954	0.7313
	ROS	[50:50]	0.6507	0.6757	0.7342	0.8162	0.6530	0.7079	0.7620	0.8070	0.6433	0.6446	0.6639	0.7669	0.6040	0.6285	0.6834	0.7163
		[65:35]	0.6344	0.6832	0.7343	0.8170	0.6213	0.7117	0.7551	0.7642	0.6120	0.6408	0.6608	0.7651	0.5904	0.6198	0.6900	0.6834
		[75:25]	0.6403	0.6642	0.7309	0.8204	0.5952	0.6886	0.7665	0.7256	0.6055	0.6268	0.6684	0.7662	0.6110	0.6357	0.6885	0.6639
		[90:10]	0.6728	0.7017	0.7561	0.8204	0.6362	0.7223	0.7746	0.6280	0.6139	0.6437	0.6853	0.7704	0.6150	0.6585	0.6947	0.6138
		[99:1]	0.6941	0.7316	0.7777	0.8033	0.6811	0.7313	0.7739	0.7433	0.6379	0.6731	0.7014	0.7557	0.6102	0.6703	0.6990	0.6369
	RUS	[50:50]	0.6284	0.6823	0.7350	0.8050	0.6119	0.6603	0.7150	0.7759	0.6252	0.6340	0.6785	0.7451	0.5751	0.5895	0.6482	0.7060
		[65:35]	0.6900	0.7141	0.7440	0.8143	0.6320	0.6851	0.7388	0.7913	0.6139	0.6477	0.6869	0.7600	0.5988	0.6059	0.6684	0.7209
		[75:25]	0.6450	0.7080	0.7544	0.8195	0.6310	0.6847	0.7474	0.8040	0.6073	0.6635	0.7037	0.7654	0.5933	0.6148	0.6778	0.7334
		[90:10]	0.6639	0.7341	0.7669	0.8206	0.6817	0.7185	0.7640	0.8167	0.6515	0.6794	0.7184	0.7676	0.6051	0.6417	0.6998	0.7378
		[99:1]	0.7129	0.7537	0.7840	0.8037	0.6843	0.7261	0.7668	0.8037	0.6422	0.6910	0.7376	0.7573	0.6395	0.6599	0.7087	0.7359
RF	None	[Full]	0.6059	0.6375	0.7214	0.7960	0.5446	0.6239	0.7177	0.7938	0.5687	0.6182	0.6387	0.7089	0.5642	0.6253	0.6555	0.7076
	ROS	[50:50]	0.6698	0.6916	0.7230	0.7824	0.6494	0.6854	0.7489	0.7931	0.6065	0.6101	0.6333	0.6605	0.5852	0.6296	0.6679	0.6817
		[65:35]	0.6447	0.6809	0.6981	0.7697	0.6055	0.6670	0.7378	0.7699	0.5975	0.5894	0.6151	0.6598	0.5619	0.6158	0.6425	0.6649
		[75:25]	0.6319	0.6598	0.6882	0.7691	0.5864	0.6538	0.7292	0.7573	0.5969	0.5901	0.6079	0.6582	0.5313	0.6046	0.6416	0.6593
		[90:10]	0.6203	0.6515	0.6835	0.7712	0.5751	0.6294	0.7066	0.7564	0.5984	0.5793	0.6043	0.6789	0.5650	0.6119	0.6437	0.6828
		[99:1]	0.6272	0.6692	0.6917	0.7917	0.5686	0.6517	0.7345	0.7754	0.5826	0.5894	0.6272	0.7059	0.5652	0.6206	0.6508	0.7022
	RUS	[50:50]	0.6883	0.7225	0.7519	0.8150	0.6746	0.7079	0.7519	0.7955	0.6228	0.6614	0.6886	0.7409	0.6033	0.6418	0.6732	0.7238
		[65:35]	0.6827	0.7276	0.7536	0.8216	0.6890	0.7222	0.7594	0.8062	0.6282	0.6806	0.7045	0.7490	0.6179	0.6498	0.6905	0.7239
		[75:25]	0.7101	0.7385	0.7633	0.8270	0.6935	0.7332	0.7653	0.8150	0.6539	0.6641	0.7156	0.7584	0.6212	0.6590	0.6951	0.7289
		[90:10]	0.6827	0.7677	0.7682	0.8301	0.7071	0.7416	0.7792	0.8279	0.6385	0.6862	0.7152	0.7586	0.6399	0.6542	0.6959	0.7377
		[99:1]	0.6513	0.7220	0.7605	0.8159	0.6620	0.7010	0.7529	0.8152	0.6396	0.6666	0.6965	0.7371	0.5969	0.6488	0.6812	0.7224

Table 6

Tukey’s HSD results for Part B and Combined datasets

Dataset	Pos class	Sampling	Ratio	AUC	Group	Dataset	Pos class	Sampling	Ratio	AUC	Group
Combined	473	RUS	[90:10]	0.8211	a	Part B	1409	RUS	[90:10]	0.8232	a
Combined	473	ROS	[50:50]	0.8052	a	Part B	1409	ROS	[50:50]	0.8059	a
Combined	473	None	[Full]	0.7999	a	Part B	200	RUS	[99:1]	0.7748	a
Combined	200	RUS	[90:10]	0.7739	a	Part B	200	RUS	[90:10]	0.7721	a
Combined	200	ROS	[99:1]	0.7652	a	Part B	200	None	[Full]	0.7533	a
Combined	200	ROS	[50:50]	0.7644	a	Part B	100	RUS	[90:10]	0.7520	a
Combined	200	ROS	[75:25]	0.7615	a	Part B	100	ROS	[99:1]	0.7111	a
Combined	200	ROS	[65:35]	0.7604	a	Part B	50	RUS	[99:1]	0.6903	a
Combined	200	None	[Full]	0.7587	a	Part B	50	ROS	[99:1]	0.6792	a
Combined	200	ROS	[90:10]	0.7561	a	Part B	50	ROS	[50:50]	0.6781	a
Combined	100	RUS	[90:10]	0.7280	a	Part B	50	None	[Full]	0.6778	a
Combined	100	ROS	[99:1]	0.7044	a	Part B	50	ROS	[90:10]	0.6700	a
Combined	100	None	[Full]	0.7037	a	Part B	50	ROS	[65:35]	0.6699	a
Combined	100	ROS	[50:50]	0.7035	a	Part B	50	ROS	[75:25]	0.6665	a
Combined	100	ROS	[65:35]	0.6996	a	Part B	1409	RUS	[75:25]	0.8211	ab
Combined	50	RUS	[90:10]	0.6977	a	Part B	1409	ROS	[75:25]	0.8028	ab
Combined	100	ROS	[90:10]	0.6922	a	Part B	200	RUS	[75:25]	0.7636	ab
Combined	50	RUS	[99:1]	0.6902	a	Part B	100	RUS	[99:1]	0.7448	ab
Combined	100	ROS	[75:25]	0.6895	a	Part B	200	ROS	[99:1]	0.7428	ab
Combined	50	ROS	[50:50]	0.6567	a	Part B	100	None	[Full]	0.7065	ab
Combined	200	RUS	[99:1]	0.7680	ab	Part B	100	ROS	[90:10]	0.6980	ab
Combined	100	RUS	[99:1]	0.7173	ab	Part B	100	ROS	[65:35]	0.6976	ab
Combined	100	RUS	[75:25]	0.7084	ab	Part B	100	ROS	[50:50]	0.6971	ab
Combined	50	RUS	[65:35]	0.6751	ab	Part B	50	RUS	[75:25]	0.6861	ab
Combined	50	RUS	[75:25]	0.6738	ab	Part B	50	RUS	[65:35]	0.6817	ab
Combined	50	ROS	[99:1]	0.6425	ab	Part B	50	None	[Full]	0.6778	ab
Combined	50	ROS	[90:10]	0.6361	ab	Part B	50	RUS	[90:10]	0.6769	ab
Combined	50	ROS	[65:35]	0.6338	ab	Part B	200	ROS	[50:50]	0.7408	abc
Combined	50	None	[Full]	0.6321	ab	Part B	1409	RUS	[65:35]	0.8180	b
Combined	473	RUS	[99:1]	0.8130	b	Part B	1409	ROS	[90:10]	0.8020	b
Combined	473	RUS	[75:25]	0.8115	b	Part B	1409	ROS	[65:35]	0.8018	b
Combined	473	ROS	[65:35]	0.7748	b	Part B	1409	ROS	[99:1]	0.8015	b
Combined	100	None	[Full]	0.7037	b	Part B	1409	None	[Full]	0.7990	b
Combined	50	ROS	[75:25]	0.6217	b	Part B	100	ROS	[75:25]	0.6860	b
Combined	200	RUS	[75:25]	0.7600	bc	Part B	50	RUS	[50:50]	0.6603	b
Combined	200	None	[Full]	0.7587	bc	Part B	200	RUS	[65:35]	0.7545	bc
Combined	100	RUS	[65:35]	0.7006	bc	Part B	200	None	[Full]	0.7533	bc
Combined	50	RUS	[50:50]	0.6544	bc	Part B	200	ROS	[90:10]	0.7352	bc
Combined	473	RUS	[65:35]	0.8025	c	Part B	200	ROS	[65:35]	0.7341	bc
Combined	473	None	[Full]	0.7999	c	Part B	100	RUS	[75:25]	0.7260	bc
Combined	200	RUS	[65:35]	0.7511	c	Part B	100	RUS	[65:35]	0.7252	bc
Combined	473	ROS	[75:25]	0.7453	c	Part B	1409	RUS	[50:50]	0.8114	c
Combined	100	RUS	[50:50]	0.6831	c	Part B	1409	RUS	[99:1]	0.8107	c
Combined	50	None	[Full]	0.6321	c	Part B	200	RUS	[50:50]	0.7446	c
Combined	473	RUS	[50:50]	0.7885	d	Part B	200	ROS	[75:25]	0.7285	c
Combined	200	RUS	[50:50]	0.7333	d	Part B	100	None	[Full]	0.7065	cd
Combined	473	ROS	[99:1]	0.7269	d	Part B	1409	None	[Full]	0.7990	d
Combined	473	ROS	[90:10]	0.6892	e	Part B	100	RUS	[50:50]	0.6969	d

Table 7

Tukey’s HSD results for Part D and DMEPOS datasets

Dataset	Pos class	Sampling	Ratio	AUC	Group	Dataset	Pos class	Sampling	Ratio	AUC	Group
Part D	1018	RUS	[90:10]	0.7709	a	DMEPOS	200	RUS	[99:1]	0.7027	a
Part D	1018	RUS	[75:25]	0.7697	a	DMEPOS	200	ROS	[99:1]	0.6854	a
Part D	1018	ROS	[99:1]	0.7480	a	DMEPOS	200	RUS	[90:10]	0.7029	a
Part D	1018	None	[Full]	0.7463	a	DMEPOS	635	RUS	[90:10]	0.7399	a
Part D	1018	ROS	[90:10]	0.7455	a	DMEPOS	200	ROS	[90:10]	0.6834	a
Part D	200	RUS	[99:1]	0.7370	a	DMEPOS	100	None	[Full]	0.6618	a
Part D	200	None	[Full]	0.7002	a	DMEPOS	50	None	[Full]	0.6153	a
Part D	100	RUS	[90:10]	0.6954	a	DMEPOS	100	None	[Full]	0.6618	a
Part D	100	None	[Full]	0.6807	a	DMEPOS	200	None	[Full]	0.6859	a
Part D	50	RUS	[99:1]	0.6611	a	DMEPOS	635	None	[Full]	0.7265	a
Part D	50	RUS	[90:10]	0.6584	a	DMEPOS	200	ROS	[75:25]	0.6787	a
Part D	50	ROS	[50:50]	0.6418	a	DMEPOS	50	RUS	[99:1]	0.6248	a
Part D	50	None	[Full]	0.6362	a	DMEPOS	50	RUS	[90:10]	0.6252	a
Part D	50	ROS	[99:1]	0.6360	a	DMEPOS	200	ROS	[65:35]	0.6808	a
Part D	50	ROS	[90:10]	0.6332	a	DMEPOS	200	ROS	[50:50]	0.6858	a
Part D	50	ROS	[65:35]	0.6275	a	DMEPOS	50	RUS	[75:25]	0.6181	ab
Part D	50	ROS	[75:25]	0.6259	a	DMEPOS	50	None	[Full]	0.6153	ab
Part D	200	RUS	[90:10]	0.7340	ab	DMEPOS	50	RUS	[65:35]	0.6112	ab
Part D	100	RUS	[99:1]	0.6923	ab	DMEPOS	100	RUS	[99:1]	0.6535	ab
Part D	200	ROS	[99:1]	0.6883	ab	DMEPOS	200	RUS	[75:25]	0.6918	ab
Part D	100	None	[Full]	0.6807	ab	DMEPOS	50	ROS	[50:50]	0.6030	ab
Part D	100	RUS	[75:25]	0.6777	ab	DMEPOS	50	ROS	[90:10]	0.6025	ab
Part D	50	RUS	[75:25]	0.6409	ab	DMEPOS	50	ROS	[99:1]	0.6015	ab
Part D	50	None	[Full]	0.6362	ab	DMEPOS	100	ROS	[99:1]	0.6567	ab
Part D	50	RUS	[65:35]	0.6310	ab	DMEPOS	100	ROS	[90:10]	0.6486	ab
Part D	1018	RUS	[65:35]	0.7637	b	DMEPOS	100	ROS	[50:50]	0.6453	ab
Part D	1018	RUS	[99:1]	0.7597	b	DMEPOS	100	RUS	[90:10]	0.6476	abc
Part D	1018	ROS	[50:50]	0.7383	b	DMEPOS	50	RUS	[50:50]	0.5891	b
Part D	1018	ROS	[65:35]	0.7377	b	DMEPOS	200	RUS	[65:35]	0.6867	b
Part D	1018	ROS	[75:25]	0.7375	b	DMEPOS	200	None	[Full]	0.6859	b
Part D	100	RUS	[65:35]	0.6740	b	DMEPOS	635	RUS	[75:25]	0.7332	b
Part D	100	ROS	[99:1]	0.6582	b	DMEPOS	635	RUS	[99:1]	0.7331	b
Part D	100	ROS	[50:50]	0.6536	b	DMEPOS	50	ROS	[75:25]	0.5878	b
Part D	100	ROS	[65:35]	0.6478	b	DMEPOS	50	ROS	[65:35]	0.5875	b
Part D	100	ROS	[90:10]	0.6431	b	DMEPOS	100	ROS	[75:25]	0.6395	b
Part D	100	ROS	[75:25]	0.6405	b	DMEPOS	100	ROS	[65:35]	0.6391	b
Part D	50	RUS	[50:50]	0.6162	b	DMEPOS	635	ROS	[50:50]	0.7103	b
Part D	200	RUS	[75:25]	0.7237	bc	DMEPOS	100	RUS	[75:25]	0.6362	bc
Part D	200	ROS	[50:50]	0.6767	bc	DMEPOS	200	RUS	[50:50]	0.6683	c
Part D	200	ROS	[90:10]	0.6766	bc	DMEPOS	635	RUS	[65:35]	0.7266	c
Part D	1018	RUS	[50:50]	0.7538	c	DMEPOS	635	None	[Full]	0.7265	c
Part D	200	ROS	[75:25]	0.6694	c	DMEPOS	635	ROS	[65:35]	0.6884	c
Part D	200	ROS	[65:35]	0.6693	c	DMEPOS	100	RUS	[65:35]	0.6298	cd
Part D	100	RUS	[50:50]	0.6523	c	DMEPOS	100	RUS	[50:50]	0.6137	d
Part D	200	RUS	[65:35]	0.7128	cd	DMEPOS	635	RUS	[50:50]	0.7173	d
Part D	1018	None	[Full]	0.7463	d	DMEPOS	635	ROS	[75:25]	0.6756	d
Part D	200	None	[Full]	0.7002	de	DMEPOS	635	ROS	[90:10]	0.6492	e
Part D	200	RUS	[50:50]	0.6954	e	DMEPOS	635	ROS	[99:1]	0.6444	e

References

Improper payments elimination and recovery act of 2010.

How Growth of Elderly Population in US Compares With Other Countries, 2013.

National Health Expenditures 2016 Highlights, 2016.

Medicare Provider Utilization and Payment Data, 2018.

US Medicare Program, 2018.

Administration for Community Living. 2017 profile of older americans.

Alhammady

and Ramamohanarao

, Using emerging patterns and decision trees in rare-class classification, in: Data Mining, 2004. ICDM’04. Fourth IEEE International Conference on, IEEE, 2004, pp. 315–318.

Ali

Shamsuddin

S.M.

and Ralescu

A.L.

, Classification with class imbalance problem: a review, Int J Adv Soft Comput Appl 7(3) (2015), 176–204.

Association of American Retired Persons (AARP). Chronic conditions among older americans.

10.

Bauder

R.A.

Rosa

and Khoshgoftaar

T.M.

, Identifying medicare provider fraud with unsupervised machine learning, in: 2018 IEEE International Conference on Information Reuse and Integration (IRI), IEEE, 2018, pp. 285–292.

11.

Bauder

R.A.

and Khoshgoftaar

T.M.

, A novel method for fraudulent medicare claims detection from expected payment deviations (application paper), in: Information Reuse and Integration (IRI), 2016 IEEE 17th International Conference on, IEEE, 2016, pp. 11–19.

12.

Bauder

R.A.

and Khoshgoftaar

T.M.

, Medicare fraud detection using machine learning methods, in: Machine Learning and Applications (ICMLA), 2017 16th IEEE International Conference on, IEEE, 2017, pp. 858–865.

13.

Bauder

R.A.

and Khoshgoftaar

T.M.

, Multivariate outlier detection in medicare claims payments applying probabilistic programming methods, Health Services and Outcomes Research Methodology 17(3-4) (2017), 256–289.

14.

Bauder

R.A.

and Khoshgoftaar

T.M.

, Medicare fraud detection using random forest with class imbalanced big data, in: Information Reuse and Integration (IRI), 2018 IEEE 19th International Conference on, IEEE, 2018, pp. 80–87.

15.

Bauder

R.A.

and Khoshgoftaar

T.M.

, A survey of medicare data processing and integration for fraud detection, in: Information Reuse and Integration (IRI), 2018 IEEE 19th International Conference on, IEEE, 2018, pp. 9–14.

16.

Bauder

R.A.

and Khoshgoftaar

T.M.

, The effects of varying class distribution on learner behavior for medicare fraud detection with imbalanced big data, Health Information Science and Systems 6 (September 2018), 9.

17.

Bauder

R.A.

Khoshgoftaar

T.M.

and Hasanin

, Data sampling approaches with severely imbalanced big data for medicare fraud detection, in: 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI), IEEE, 2018, pp. 137–142.

18.

Bauder

R.A.

Khoshgoftaar

T.M.

Richter

and Herland

, Predicting medical provider specialties to detect anomalous insurance claims, in: Tools with Artificial Intelligence (ICTAI), 2016 IEEE 28th International Conference on, IEEE, 2016, pp. 784–790.

19.

Bauder

R.A.

Khoshgoftaar

T.M.

and Seliya

, A survey on the state of healthcare upcoding fraud analysis and detection, Health Services and Outcomes Research Methodology 17(1) (2017), 31–55.

20.

Bekkar

Djemaa

H.K.

and Alitouche

T.A.

, Evaluation measures for models assessment over imbalanced data sets, Iournal of Information Engineering and Applications 3(10) (2013).

21.

Branting

L.K.

Reeder

Gold

and Champney

, Graph analytics for healthcare fraud risk estimation, in: Advances in Social Networks Analysis and Mining (ASONAM), 2016 IEEE/ACM International Conference on, IEEE, 2016, pp. 845–851.

22.

Breiman

, Random forests, Machine Learning 45(1) (2001), 5–32.

23.

Chai

K.E.K.

Anthony

Coiera

and Magrabi

, Using statistical text classification to identify health information technology incidents, Journal of the American Medical Informatics Association 20(5) (2013), 980–985.

24.

Chandola

Sukumar

S.R.

and Schryver

J.C.

, Knowledge discovery from massive healthcare claims data, in: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2013, pp. 1312–1320.

25.

Chong

Bravo

and Davison

, How much effort should be spent to detect fraudulent applications when engaged in classifier-based lending? Intelligent Data Analysis 19(s1) (2015), S87–S101.

26.

CMS. Medicare provider utilization and payment data.

27.

CMS. National provider identifier standard (npi), 2016.

28.

CMS. Medicare fraud & abuse: Prevention, detection, and reporting booklet, 2017.

29.

CMS. Hcpcs – general information, 2018.

30.

CMS Office of Enterprise Data and Analytics. Medicare fee-for service provider utilization & payment data part d prescriber public use file: A methodological overview.

31.

CMS Office of Enterprise Data and Analytics. Medicare fee-for-service provider utilization & payment data physician and other supplier.

32.

CMS Office of Enterprise Data and Analytics. Medicare fee-for-service provider utilization & payment data referring durable medical equipment, prosthetics, orthotics and supplies public use file: A methodological overview.

33.

CMS Outreach and Education. Medicare Fraud & Abuse: Prevention, Detection, and Reporting, Sep 2017.

34.

Cubanski

and Neuman

, The facts on medicare spending and financing, 2018.

35.

Pozzolo

A.D.

Caelen

Borgne

Y.-A.L.

Waterschoot

and Bontempi

, Learned lessons in credit card fraud detection from a practitioner perspective, Expert Systems with Applications 41(10) (2014), 4915–4928.

36.

Demchenko

Zhao

Grosso

Wibisono

and Laat

C.D.

, Addressing big data challenges for scientific data infrastructure, in: Cloud Computing Technology and Science (CloudCom), 2012 IEEE 4th International Conference on, IEEE, 2012, pp. 614–617.

37.

Dongre

S.S.

and Malik

L.G.

, Rare class problem in data mining: review, International Journal of Advanced Research in Computer Science 8(7) (2017), 1102–1105.

38.

Drummond

and Holte

R.C.

, C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling, in: Workshop on Learning from Imbalanced Datasets II, Citeseer, Vol. 11, 2003, pp. 1–8.

39.

Feldman

and Chawla

N.V.

, Does medical school training relate to practice? Evidence from big data, Big Data 3(2) (2015), 103–113.

40.

Fernández

Río

Chawla

N.V.

and Herrera

, An insight into imbalanced big data classification: outcomes and challenges, Complex & Intelligent Systems 3(2) (2017), 105–120.

41.

Centers for Medicare & Medicaid Services. What’s medicare, 2016.

42.

Gelman

, Analysis of variance – why it is more important than ever, The Annals of Statistics 33(1) (2005), 1–53.

43.

Guo

Jennifer

Huang

and Gong

, learning from class-imbalanced data: Review of methods and applications, Expert Systems with Applications 73 (2017), 220–239.

44.

Hasanin

and Khoshgoftaar

T.M.

, The effects of random undersampling with simulated class imbalance for big data, in: 2018 IEEE International Conference on Information Reuse and Integration (IRI), IEEE, 2018, pp. 70–79.

45.

and Garcia

E.A.

, Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering 21(9) (2009), 1263–1284.

46.

Herland

Khoshgoftaar

T.M.

and Bauder

R.A.

, Big data fraud detection using multiple medicare data sources, Journal of Big Data 5 (September 2018), 29.

47.

Jeni

L.A.

Cohn

J.F.

and Torre

F.D.L.

, Facing imbalanced data-recommendations for the use of performance metrics, in: Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on, IEEE, 2013, pp. 245–251.

48.

Walker

S.J.

, Big data: A revolution that will transform how we live, work, and think, 2014.

49.

Katal

Wazid

and Goudar

R.H.

, Big data: issues, challenges, tools and good practices, in: Contemporary Computing (IC3), 2013 Sixth International Conference on, IEEE, 2013, pp. 404–409.

50.

Khoshgoftaar

T.M.

Allen

E.B.

Hudepohl

J.P.

and Aud

S.J.

, Application of neural networks to software quality modeling of a very large telecommunications system, IEEE Transactions on Neural Networks 8(4) (1997), 902–909.

51.

Khoshgoftaar

T.M.

Golawala

and Van Hulse

, An empirical study of learning from imbalanced data using random forest, in: Tools with Artificial Intelligence, 2007. ICTAI 2007. 19th IEEE International Conference on, IEEE, Vol. 2, 2007, pp. 310–317.

52.

Khoshgoftaar

T.M.

Seiffert

Van Hulse

Napolitano

and Folleco

, Learning with limited minority class data, in: Machine Learning and Applications, 2007. ICMLA 2007. Sixth International Conference on, IEEE, 2007, pp. 348–353.

53.

Khurjekar

Chou

C.-A.

and Khasawneh

M.T.

, Detection of fraudulent claims using hierarchical cluster analysis, in: IIE Annual Conference. Proceedings, Institute of Industrial and Systems Engineers (IISE), 2015, p. 2388.

54.

J.S.

Chalfin

Trock

B.J.

Feng

Humphreys

Park

S.-W.

Carter

H.B.

Frick

K.D.

and Han

, Variability in medicare utilization and payment among urologists, Urology 85(5) (2015), 1045–1051.

55.

Cessie

S.L.

and Van Houwelingen

J.C.

, Ridge estimators in logistic regression, Applied Statistics (1992), 191–201.

56.

LeDell

Gill

Aiello

Candel

Click

Kraljevic

Nykodym

Aboyoun

Kurka

and Malohlava

h2o:RInterfacefor`H2O^{\prime}

, 2018. R package version 3.20.0.2.

57.

LEIE. Office of inspector general leie downloadable databases, 2018.

58.

Liu

Fong

Wong

R.K.

Mohammed

Fiaidhi

Sung

and Wong

K.K.L.

, Adaptive swarm balancing algorithms for rare-event prediction in imbalanced healthcare data, PloS One 12(7) (2017), e0180830.

59.

Lin

S.-C.

Wang

Z.-Y.

and Chung

Y.-F.

, Detect rare events via mice algorithm with optimal threshold, in: Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2013 Seventh International Conference on, IEEE, 2013, pp. 70–75.

60.

Maalouf

Homouz

and Trafalis

T.B.

, Logistic regression in large rare events and imbalanced data: a performance comparison of prior correction and weighting methods, Computational Intelligence 34(1) (2018), 161–174.

61.

Marr, Bernard. How Big Data Is Changing Healthcare, Apr 2015.

62.

Meng

Bradley

Yavuz

Sparks

Venkataraman

Liu

Freeman

Tsai

D.B.

Amde

Owen

et al., Mllib: machine learning in apache spark, The Journal of Machine Learning Research 17(1) (2016), 1235–1241.

63.

Meng

Bradley

J.K.

Yavuz

Sparks

E.R.

Venkataraman

Liu

Freeman

Tsai

D.B.

Amde

Owen

Xin

Franklin

M.J.

Zadeh

Zaharia

and Talwalkar

, Mllib: Machine learning in apache spark,

C o R R

, abs/1505.06807, 2015.

64.

Morris

, Combating Fraud in Health Care: An Essential Component of Any Cost Containment Strategy, 2009.

65.

OIG. Office of inspector general exclusion authorities.

66.

OIG. Office of inspector general exclusion authorities us department of health and human services.

67.

Pande

and Maas

, Physician medicare fraud: characteristics and consequences, International Journal of Pharmaceutical and Healthcare Marketing 7(1) (2013), 8–33.

68.

Raghupathi

and Raghupathi

, Big data analytics in healthcare: promise and potential, Health Information Science and Systems 2(1) (Feb 2014), 3.

69.

Rastogi

A.K.

Narang

and Siddiqui

Z.A.

, Imbalanced big data classification: a distributed implementation of smote, in: Proceedings of the Workshop Program of the 19th International Conference on Distributed Computing and Networking, ACM, 2018, p. 14.

70.

Reinsel

Gantz

and Rydning

, Data age 2025: The evolution of data to life-critical,

Don^{\prime}tFocusonBigData

, 2017.

71.

Sadiq

Tao

Yan

and Shyu

M.-L.

, Mining anomalies in medicare big data using patient rule induction method, in: Multimedia Big Data (BigMM), 2017 IEEE Third International Conference on, IEEE, 2017, pp. 185–192.

72.

Seiffert

Khoshgoftaar

T.M.

Van Hulse

and Napolitano

, Mining data with rare events: a case study, in: Tools with Artificial Intelligence, 2007. ICTAI 2007. 19th IEEE International Conference on, IEEE, Vol. 2, 2007, pp. 132–139.

73.

Sheshasaayee

and Thomas

S.S.

, A purview of the impact of supervised learning methodologies on health insurance fraud detection, 2018, 978–984.

74.

Tayal

Coleman

T.F.

and Li

, Rankrc: large-scale nonlinear rare class ranking, IEEE Transactions on Knowledge and Data Engineering 27(12) (2015), 3347–3359.

75.

Tukey

J.W.

, Comparing individual means in the analysis of variance, Biometrics (1949), 99–114.

76.

van Capelleveen

Poel

Mueller

R.M.

Thornton

and van Hillegersberg

, Outlier detection in healthcare fraud: a case study in the medicaid dental domain, International Journal of Accounting Information Systems 21 (2016), 18–31.

77.

Van Hulse

and Khoshgoftaar

T.M.

, Knowledge discovery from imbalanced and noisy data, Data & Knowledge Engineering 68(12) (2009), 1513–1542.

78.

Van Hulse

Khoshgoftaar

T.M.

and Napolitano

, Experimental perspectives on learning from imbalanced data, in: Proceedings of the 24th International Conference on Machine Learning, ACM, 2007, pp. 935–942.

79.

Waghade

S.S.

and Karandikar

A.M.

, A comprehensive study of healthcare fraud detection based on machine learning, International Journal of Applied Engineering Research 13(6) (2018), 4175–4178.

80.

Weiss

G.M.

, Mining with rarity: a unifying framework, ACM Sigkdd Explorations Newsletter 6(1) (2004), 7–19.

81.

Zaharia

Chowdhury

Das

Dave

McCauley

Franklin

M.J.

Shenker

and Stoica

, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, 2012, pp. 2–2.

82.

Zaharia

Xin

R.S.

Wendell

Das

Armbrust

Dave

Meng

Rosen

Venkataraman

and Franklin

M.J.

, Apache spark: a unified engine for big data processing, Communications of the ACM 59(11) (2016), 56–65.

83.

Zhai

Zhang

and Wang

, The classification of imbalanced large data sets based on mapreduce and ensemble of elm classifiers, International Journal of Machine Learning and Cybernetics 8(3) (2017), 1009–1017.

84.

Zhang

Kotagiri

Tari

and Cheriet

, Krnn: k rare-class nearest neighbour classification, Pattern Recognition 62 (2017), 33–44.

A study on rare fraud predictions with big Medicare claims fraud data

Abstract

Keywords

1. Introduction

1 https://www.cms.gov/Outreach-and-Education/Medicare-Learning-Network-MLN/MLNProducts/downloads/fraud_and_ abuse.pdf.

2 UCI Machine Learning Repository – https://archive.ics.uci.edu/ml/index.php.

3 https://www.cms.gov/Outreach-and-Education/Medicare-Learning-Network-MLN/MLNProducts/Downloads/Fraud- Abuse-Products.pdf.

Table 2 Selected Medicare dataset features

Table 3 Description of experiment datasets

6.1 Models

7 https://spark.apache.org/docs/2.3.0/ml-classification-regression.html.

6.3 Model evaluation

6.4 Significance testing

7. Results and discussion

Footnotes

Acknowledgments

Conflict of interest

Appendix

References

¹
https://www.cms.gov/Outreach-and-Education/Medicare-Learning-Network-MLN/MLNProducts/downloads/fraud_and_ abuse.pdf.

²
UCI Machine Learning Repository – https://archive.ics.uci.edu/ml/index.php.

³
https://www.cms.gov/Outreach-and-Education/Medicare-Learning-Network-MLN/MLNProducts/Downloads/Fraud- Abuse-Products.pdf.

Table 2
Selected Medicare dataset features

Table 3
Description of experiment datasets

⁷
https://spark.apache.org/docs/2.3.0/ml-classification-regression.html.