Abstract
Objective:
An accurate estimate of the prevalence, demographic characteristics, and geographic distribution of thalassemia in the United States is needed to plan for the health care needs of people with this disease. We developed and evaluated the predictive value of administrative case definitions for correctly identifying people living with thalassemia.
Methods:
We conducted a retrospective study of the diagnostic accuracy of 3 claims-based case definitions to identify people with thalassemia in Medicaid administrative data from 2012 through 2019. Case definition 1 was ≥5 encounters with a code for thalassemia; case definition 2 was ≥1 encounter with a code for thalassemia and ≥6 encounters with a transfusion code; and case definition 3 was ≥2 encounters with a code for thalassemia and a transfusion code occurring on the same encounters. We validated our findings by using confirmatory laboratory assessment and expert review by clinicians at thalassemia treatment centers in Georgia (Children’s Healthcare of Atlanta) and California (University of California San Francisco) as the gold standard.
Results:
Of the 327 people identified, thalassemia was confirmed in 173 (52.9%), excluded in 68 (20.7%), and found indeterminate in 86 (26.2%) people. Case definition 1 had the lowest positive predictive value (PPV) (range, 55%-77%). For case definition 2, the PPV range was 80% to 86%. For case definition 3, the PPV range was the highest (82%-96%) but also captured more indeterminate cases.
Conclusions:
Accurately identifying patients with thalassemia using a case definition based on administrative claims data is feasible. Extending our method to other health care databases beyond Medicaid may allow for an estimate of the national prevalence of transfusion-dependent thalassemia. However, cases of nontransfusion-dependent thalassemia were difficult to define with sufficient precision.
Keywords
Thalassemia is an inherited blood disorder that results from mutations in the α- or β-globin genes essential for the synthesis of hemoglobin. The mutations create an imbalance in the production of α- and β-globin chains leading to ineffective erythropoiesis, hemolysis, and chronic lifelong anemia. The clinical manifestations in α- and β-thalassemia are closely linked to genotype.1,2 The degree of anemia can range from mild to severe to life-threatening, and those affected are categorized as transfusion dependent (TD) and nontransfusion dependent (NTD) depending on the need for regular red blood cell transfusions. The most severe forms of thalassemia become TD during infancy (β-thalassemia major) 1 or even during the prenatal period (α-thalassemia major). 2 People with less severe forms of β-thalassemia (β-thalassemia intermedia, hemoglobin E β-thalassemia) and α-thalassemia (hemoglobin H disease) may not need regular transfusions in early childhood (NTD), but some people go on to develop severe symptoms from anemia and become TD at a later age. The designation of a person as NTD is, therefore, not static or final, although the probability of needing regular transfusions differs among various genotypes.1,2
Thalassemia is considered a rare disease, especially in the United States. Most cases in the United States are the result of migration from areas with a high prevalence of thalassemia, including China, Southeast Asia, South Asia, the Middle East, and the Mediterranean. 3 Globally, birth prevalence is estimated at 56 000 each year, of which 30 000 people need regular transfusions for survival.4-6 A recent systematic review of thalassemia prevalence found limited population-based estimates for many countries and heterogeneity in case definitions, diagnostic methodology, type of thalassemia reported, and details on transfusion requirements. 7 Similarly, the prevalence and distribution of thalassemia in the United States are underestimated because of a lack of population-based registries, variable newborn screening practices,8-10 and insufficient access to diagnostic genetic testing, such as follow-up failures after abnormal screenings, limited health care provider awareness, and cost-prohibitive molecular testing. 11 Birth prevalence estimates are available from California’s newborn screening program, which reports estimated rates of 1 in 9000 births for α thalassemia and 1 in 55 000 births for β thalassemia genotypes. 12 These estimates are now almost 20 years old, and even then, a survey suggested that thalassemia among births in the United States was rising, reflecting a heterogeneous group of conditions with new ethnicities, genotypes, and phenotypes. 13 Furthermore, estimates based on newborn screening data do not account for the effect of immigration and state-to-state migration. 14 Additional attempts to estimate population-based prevalence or identify a group of people with thalassemia to study their need for or use of health care services in the United States have included surveys of thalassemia treatment centers, 13 assessment of large-scale claims datasets,15,16 or a combination of these methods. 17
An accurate estimate of the prevalence, demographic characteristics, and geographic distribution of thalassemia in the United States is necessary to effectively plan for the health care needs of the growing and diverse population living with this disease. This study tests the discriminatory ability of 3 claims-based definitions to identify people living with TD and NTD thalassemia who are Medicaid beneficiaries against confirmatory laboratory diagnosis and expert review by clinicians at thalassemia treatment centers. We hypothesized that claims-based case definitions would have higher predictive value in identifying people who are TD (vs NTD) because of their more frequent use of health care, so we report results by patients’ transfusion status.
Methods
We conducted a retrospective study of the diagnostic accuracy of claims-based definitions to identify patients with thalassemia in Medicaid administrative data by using confirmatory laboratory assessments and expert medical record review from thalassemia treatment centers as the gold standard.
Inclusion Criteria
This study included people covered by Medicaid in California or Georgia who had ≥1 encounter with a recorded thalassemia diagnosis from January 1, 2012, through December 31, 2019, identified by the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes 18 and International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) codes 19 for thalassemia (Table 1). Patients also needed to have ≥1 encounter (regardless of diagnosis) at a partner clinical institution. Specifically, for California, eligibility was confirmed if any encounter included a University of California, San Francisco (UCSF), institutional National Provider Identifier. For those identified in Georgia, eligibility was confirmed if any encounter included an institutional Medicaid provider identification for Children’s Healthcare of Atlanta (CHOA).
ICD-9-CM and ICD-10-CM codes for thalassemia and sickle cell disease and procedure codes for transfusion
Abbreviations: —, not applicable; CPT, Current Procedural Terminology; HCPCS, Healthcare Common Procedure Coding System; ICD-9-CM, International Classification of Diseases, Ninth Revision, Clinical Modification; ICD-10-CM, International Classification of Diseases, Tenth Revision, Clinical Modification; ICD-10-PCS, International Classification of Diseases, Tenth Revision, Procedure Code System.
Data source: Centers for Disease Control and Prevention. 18
Data source: Centers for Medicare & Medicaid Services. 19
Data source: Centers for Medicare & Medicaid Services. 21
Data source: American Hospital Association. 22
We excluded patients with Medicaid encounters containing more diagnosis codes for sickle cell disease (a related hemoglobin disorder) than for thalassemia. We deduplicated all encounters, inclusive of inpatient and outpatient encounters, by patient identifiers and service date.
Ethics Approval
The UCSF Institutional Review Board (IRB) (IRB no. 22-37734) received expedited approval as research involving materials (data, documents, records, or specimens) that were collected or will be collected solely for nonresearch purposes (eg, medical treatment, diagnosis). The Georgia State University IRB considered this study exempt because it was conducted as a secondary analysis of data collected for the public health surveillance of hemoglobinopathies (IRB no. H11142) under an active agreement between Georgia State University and CHOA.
Validation Data
Thalassemia centers at UCSF and CHOA contributed their patient-level data including transfusion dependency status for patients with laboratory-confirmed thalassemia. A patient was considered to have confirmatory evidence of thalassemia if their medical records included a genetic test showing globin gene mutations compatible with a diagnosis of thalassemia. We confirmed pathogenicity of rare mutations through the online globin gene mutation database HbVar. 20
For patients who met a case definition based on Medicaid claims data but were not reported by a clinical center, hematologists at each center reviewed medical records to determine their thalassemia and transfusion dependency status. Patients were confirmed to have thalassemia if pertinent genetic tests were available in the medical record. In the absence of genetic tests, we used a clinical review of complete blood count, red blood cell indices (mean cell volume and mean cell hemoglobin), body iron status, and hemoglobin electrophoresis to assign the diagnosis of thalassemia. We classified people who were found either to carry the thalassemia trait or to not have thalassemia after this review as false positives. We classified patients lacking confirmatory laboratory evidence of thalassemia because of insufficient medical record evidence as indeterminate. Transfusion dependency status for true-positive cases could be validated only if patients received ≥6 recurring transfusions from participating clinical centers; we considered those patients with incomplete documentation to be NTD (Figure).

Overview of methodology in a study of the diagnostic accuracy of claims-based definitions to identify people with thalassemia in Medicaid administrative data, California and Georgia, 2012-2019.
Administrative Case Definitions
We tested 3 case definitions by using various combinations of thalassemia diagnosis (ICD-9-CM/ICD-10-CM codes) and transfusion procedure (ICD-9–Procedural Coding System 18 /ICD-10–Procedural Coding System, 19 Current Procedural Terminology/Healthcare Common Procedure Coding System [CPT/HCPCS]), 21 and revenue codes 22 (Table 1): case definition 1: ≥5 encounters with an ICD-9-CM/ICD-10-CM code for thalassemia occurring within a 12-month period; case definition 2: ≥1 encounter with an ICD-9-CM/ICD-10-CM code for thalassemia and ≥6 encounters with a transfusion code within a 12-month period; and case definition 3: ≥2 encounters with an ICD-9-CM/ICD-10-CM code for thalassemia and a transfusion code occurring on the same encounters within a 12-month period.
To identify transfusion dependency, we opted for 6 transfusions rather than the conventional 8, because the latter number was primarily derived from classification criteria used in clinical trials, 23 and we believe that a 6-transfusion threshold is reasonable and aligns more closely with actual disease management needs.
Data Analysis
The demographic characteristics of patients are included in Medicaid data. We calculated age on the first date of this study, January 1, 2012. We obtained information on thalassemia genotype and transfusion status through clinical case review. We determined the positive predictive value (PPV) of each case definition by dividing the number of true positives by the total number of people who met the case definition. The minimum PPV includes patients with a true-positive, false-positive, and an indeterminate thalassemia status in the denominator for each definition. The maximum PPV excludes patients with an indeterminate status.
Results
In California and Georgia, 327 people met at least 1 of the administrative case definitions (Table 2). Among these, 54.4% (n = 178) were female and 45.6% (n = 149) were male. More than half (50.8%; n = 166) of the cases were aged <18 years, while 49.2% (n = 161) were aged ≥18 years. About half (50.5%; n = 165) were Asian/Pacific Islander, 12.2% (n = 40) were Black, 9.5% (n = 31) were White, and 27.8% (n = 91) were classified as other/unknown race.
Demographic characteristics of all patients (N = 327) and true-positive thalassemia patients (n = 173) identified in a study of the diagnostic accuracy of administrative case definitions to identify thalassemia patients in Medicaid data, California and Georgia, 2009-2019
Data source: California and Georgia Medicaid, 2012-2019, with thalassemia status validated with clinical data from the University of California, San Francisco, and Children’s Healthcare of Atlanta.
Includes only unknown race in Georgia. In California, ethnicity is included in the race designation, so this category includes patients with Hispanic ethnicity, patients with unknown race, and patients who indicated “other” race.
Of the 327 people initially identified, we confirmed thalassemia in 173 (52.9%), excluded thalassemia in 68 (20.8%), and classified thalassemia as indeterminate in 86 (26.3%) people (Table 3). Most patients classified as indeterminate had few claims, no mention of thalassemia in their medical record, and insufficient laboratory results.
Positive predictive value (PPV) of administrative case definitions for thalassemia tested in Medicaid administrative data, California and Georgia, 2009-2019 a
Data source: California and Georgia Medicaid, 2012-2019, with thalassemia status validated with clinical data from the University of California, San Francisco, and Children’s Healthcare of Atlanta.
Case definition 1 had a PPV range of 55% to 77% (Table 3). The exclusion of transfusion codes doubled the number of true-positive patients, compared with case definition 2 and case definition 3, but failed to exclude any indeterminate patients and most false-positive patients. For case definition 2, the PPV range was 80% to 86%. For case definition 3, the PPV range was the highest at 82% to 96%; case definition 3 identified the most patients with TD thalassemia (79 of 84) and performed better than case definition 2 (72 of 84). The addition of transfusion codes on the same visit improved the number of true-positive patients identified but also included more indeterminate patients.
By case definition, among true-positive patients, people with TD thalassemia represented 84 of 172 (48.8%) for case definition 1, 72 of 84 (86.7%) for case definition 2, and 79 of 91 (86.8%) for case definition 3; patients with NTD thalassemia were underidentified (12 of 89) by case definition 2 and case definition 3 (13.5%) (Table 4).
Case definition 1: ≥5 thalassemia encounters; case definition 2: ≥1 thalassemia encounter AND 6 transfusion encounters; case definition 3: ≥2 thalassemia encounters, with a transfusion code on the same encounters.
Data source: California and Georgia Medicaid, 2012-2019, with thalassemia status validated with clinical data from the University of California, San Francisco, and Children’s Healthcare of Atlanta.
Cell sizes are suppressed for counts <11.
Discussion
Claims-based definitions have many advantages for estimating the prevalence of rare conditions and their associated morbidities and health care use, specifically ease, speed, and cost. 24 They also pose challenges because their accuracy relies on the coding taxonomy, coding practices, and the necessity to validate administrative case definitions against clinical gold-standard data. 25 The lack of universal newborn screening and population migration are recognized barriers to estimating the prevalence of thalassemia in the United States, which has prompted research into the use of administrative data. Since the original Registry and Surveillance for Hemoglobinopathies (RuSH) pilot, 17 several studies have updated and validated the surveillance-based case definitions for sickle cell disease (SCD),26,27 a related hemoglobin disorder, but challenges in data validation for a claims-based definition to identify thalassemia were formidable.
Until 2012, ICD-9-CM 18 coding for thalassemia was limited to a single code (282.4) for all forms of thalassemia disease and included those who were carriers of the thalassemia trait. In the Public Health Research, Epidemiology and Surveillance for Hemoglobinopathies (PHRESH) project, attempts to evaluate and validate data collected during RuSH found substantial problems in accurately identifying and describing the population with thalassemia in California and Georgia, likely due to the inclusion of thalassemia trait in the same ICD-9-CM codes as true disease. The report concluded that future efforts to conduct surveillance for thalassemia should take advantage of changes introduced in 2012 to ICD-9-CM coding and further revised in late 2015 with ICD-10-CM 19 coding and combine codes for red blood cell transfusion and iron overload with thalassemia diagnosis codes (Paulukonis S, Tracking California, written communication: unpublished California RuSH Data Validation Report, December 16, 2014).
Learning from prior experience, we tested the PPV of 3 case definitions to identify thalassemia using administrative data since the coding changes in 2012. Our results showed that case definition 3 had the highest PPV, but the reliance of this definition on a thalassemia diagnosis and transfusion code on the same visit excluded most people with NTD thalassemia. Case definition 3 more accurately identified TD thalassemia, compared with case definition 2, which includes multiple transfusion codes without a simultaneous diagnosis of thalassemia. While people who are TD receive ≥6 transfusions per year, it was not necessary to capture >2 visits with transfusion codes if a diagnosis code for thalassemia was present on the same visit. The reason that case definition 3 captures a much higher proportion of TD thalassemia than NTD thalassemia is that a small proportion of people with thalassemia receive intermittent (or on-demand) transfusions. 28 Among 717 people with thalassemia followed by the Thalassemia Western Consortium, only 9% had received intermittent transfusions, compared with 35% who received regular transfusions and 56% who had never received a transfusion. 29 Case definition 3 would also have the advantage of identifying patients with TD thalassemia who receive care at multiple health care systems or experience changes in health insurance coverage. However, because of the variable timing at which people may become TD, case definitions that rely on transfusion codes can miss patients who transition to transfusion dependence outside the study period. Additionally, claims data may not capture every administered transfusion if patients are under care management arrangements with capitated payments to health care providers, which could further limit the accuracy of the case definition.
The use of multiple thalassemia claims without requiring transfusion codes (case definition 1) identified the most NTD cases but had the lowest PPV. While consistent with findings from the PHRESH report, this high rate of false-positive and indeterminate cases was not explained by our data but could point toward continuing diagnostic inaccuracies or a lack of detailed information available to those entering the diagnosis of thalassemia in the patient’s medical record. 24 Despite the wider availability of genetic testing, many patients lack definitive testing that can improve the reliability of diagnosis and the coding of claims data.
Our evaluation of case definition 1 and case definition 3 demonstrates that approaches using administrative data to simultaneously identify TD and NTD thalassemia may be mutually conflicting and should be developed independently. The use of separate methods to identify TD and NTD thalassemia aligns with the distinct public health goals for these disorders. People who are TD require complex multidisciplinary care at health care facilities with expertise in transfusion medicine and iron overload; management according to standard guidelines reduces morbidity and improves long-term outcomes. 30 The extension of case definition 3 to all-payor claims databases (public and private) could allow for the emergence of state and possibly national prevalence estimates and a better understanding of the distribution of TD thalassemia in the United States. This knowledge would provide a rational basis for the development of thalassemia treatment centers through the training and education of hematologists and other health care providers. As with other rare diseases, targeted outreach by patient advocacy organizations can assist with access to comprehensive care and curative therapies for those with thalassemia.
The public health concern with NTD thalassemia stems from underdiagnosis and undermanagement of anemia, which predisposes affected people to complications and poor quality of life.31,32 Therefore, independent methods that focus on people who are NTD should be explored in future studies that address limitations imposed by the inconsistent use of genetic testing and overlap of diagnosis codes with thalassemia trait. While separate ICD-10-CM codes exist for thalassemia trait (D56.3) and other thalassemias (D56.9), the real-world use of these codes rather than codes for thalassemia disease (D56.0: α-thalassemia, D56.1: β-thalassemia) is unknown. Incorrect coding of thalassemia trait contributes to a high false-positive rate for identifying NTD thalassemia from claims data alone, suggesting the need to test definitions using electronic medical record data, which also include laboratory values. Conversely, some people who are NTD may also have undiagnosed or misdiagnosed thalassemia as a trait because of an incomplete laboratory investigation, and no existing databases would be useful for identifying these cases. A strong case can be made for performing genetic testing on all people suspected of having thalassemia (trait or disease), which should be universally covered through health insurance.
Because claims data more reliably identify TD (vs NTD) thalassemia, we considered whether these data could help estimate the size of the population with NTD thalassemia. Data from the 2018 Thalassemia Western Consortium showed that 35% of patients were TD, 29 suggesting that the population with NTD thalassemia may be twice as large as the population with TD thalassemia. However, the Thalassemia Western Consortium’s higher proportion of α-thalassemia (55%) and HbE β-thalassemia (15%)—likely because of regional ethnic differences—limits generalizability. Most people with α-thalassemia are NTD, constituting <10% of those with TD thalassemia at UCSF and CHOA. 33 Therefore, regional clinical data are essential if TD thalassemia prevalence is used to estimate NTD prevalence nationally.
In the absence of broader data, and given the genetic nature of thalassemia, TD and NTD forms co-occur in the same populations. Thus, methods for identifying TD thalassemia can guide targeted outreach and education—especially for the population with NTD thalassemia—by improving awareness, diagnosis, and adherence to clinical guidelines among affected communities and their health care providers.
Limitations
This study had several limitations. First, our analysis was limited by a small sample size, which restricted genotype-specific analysis and resulted in many indeterminate diagnoses. These indeterminate cases yielded a range of PPV estimates and prevented calculation of sensitivity and specificity. Second, data from 2 specialty centers may not be representative of other health care settings or private health insurance claims. Third, some people with NTD thalassemia may lack a formal diagnosis, making them undetectable via claims-based methods.
Conclusion
Our study’s findings contribute to the growing body of research that uses claims-based methods to identify rare diseases. 34 However, accurately capturing the full spectrum of thalassemia is challenging, especially for NTD cases. We recommend refining the computable phenotype for NTD thalassemia and supporting public health policies that expand access to genetic testing through newborn screening and health insurance coverage. In contrast, TD thalassemia can be efficiently identified by using our method, which may support studies on the natural history of the disease and evaluation of new therapies. Future research applying this definition to all-payor claims data could help estimate TD thalassemia prevalence at state and national levels, informing public health planning and outreach.
Footnotes
Acknowledgements
The authors thank Mary Hulihan, DrPH, Centers for Disease Control and Prevention, for her advice, review, and guidance on this article; and Sujit Sheth, MD, Cornell Medical Center, for his advice, guidance, and data contributions.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by funding from CDC-RFA-DD19-1903: Characterizing the Complications Associated With Therapeutic Blood Transfusions for Hemoglobinopathies.
ORCID iDs
Disclaimer
The findings and conclusions in this article are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
