Abstract
This article, prepared as part of a special issue on multiarmed experiments, describes the design of the RAND Health Insurance Experiment, paying particular attention to the choice of arms. It also describes how the results of the Experiment were used in a simulation model and, looking back, how the design might have differed, and how the results apply today, 4 decades after the Experiment was conducted.
The RAND Health Insurance Experiment (hereafter the Experiment) is a well-known example of a multiarm experiment (Aron-Dine et al., 2013; Gruber, 2006; Newhouse and the Insurance Experiment Group, 1993). Its primary aim was to estimate effects on medical spending and health outcomes from varying the amount of cost sharing in health insurance plans, in particular, varying coinsurance rates or the proportion of medical spending paid by the patient when using care. I was the principal investigator for the Experiment, the fieldwork for which was carried out between 1974 and 1982.
Despite the time that has passed since the Experiment was completed and the natural issue of the degree to which its findings remain applicable, its results on demand elasticities are still widely used. In 2006, the Congressional Budget Office said, “Even though the RAND data were gathered several decades ago, the study’s findings remain relevant and are widely relied on by analysts…” (Congressional Budget Office, 2006). Two years later, it called the findings “the best available evidence about the effects of cost sharing…” (Congressional Budget Office, 2008).
In what follows, I describe the rationale for several of the Experiment’s design decisions, including the choice of arms and sample size; additional material is in Newhouse (1974), which describes the initial design. Newhouse and the Insurance Experiment Group (1993) describes the final design, which was close to the initial design but incorporated some midcourse corrections described below. I also describe below one use of the Experiment’s results, a simulation model to extrapolate results from the Experiment’s insurance plans or arms to other insurance plans. Extrapolation to treatments not locally close to treatments included in an experiment, of course, is a common problem for both simple treatment-and-control and multiarmed experiments. I conclude with some thoughts on what in retrospect I would have done differently and how one might think about the applicability of the Experiment’s results from 4 decades ago to today’s health care financing and delivery. Before turning to issues around the design, however, I briefly describe how the Experiment came about.
The Genesis of the RAND Health Insurance Experiment
In August 1969, the Nixon Administration proposed the Family Assistance Program, a form of guaranteed income for poor families, with the amount of government assistance falling with family earnings. This form of income support, or a negative income tax, had been proposed by prominent economists across the political spectrum as an antipoverty policy (Friedman, 1962; Tobin et al., 1967). In Congressional hearings on the Administration’s proposal, Senator Russell Long (D-LA), the chairman of the Senate Finance Committee, asked how the proposal might affect the Medicaid notch, whereby individuals just below an income or asset threshold were eligible for Medicaid but those just above it were not. As a result of that question, the Nixon Administration formed a task force to address the issue of how Medicaid should integrate with its Family Assistance Program proposal and more generally its policy toward the then relatively new Medicaid program. 1
Larry Orr, an economist from the Office of Economic Opportunity, the agency created by the Johnson Administration to implement its War on Poverty, was assigned as a staff member to the Deputy Under Secretary for Policy, who was a member of the task force. Although the Medicaid statute, Title XIX of the Social Security Act, did not allow for any cost sharing, Orr raised the issue of whether some cost sharing should be considered. Based on the initial discussions of the task force, Orr concluded that little was known about the effect of cost sharing on the utilization of medical care.
At this time, the Office of Economic Opportunity was sponsoring a series of randomized experiments in income support or negative income taxes (Cain & Watts, 1973; Rivlin & Wiener, 1988). Although these experiments had not concluded, enough experience had been gained that the Office of Economic Opportunity viewed them as a promising method to develop evidence-based social policy.
In 1969, I had submitted a grant to the agency that was a forerunner of today’s Agency for Healthcare Research and Quality to study the effect of cost sharing on the demand for medical care using observational data. My grant application found its way to Orr, and he came to visit me to explore the possibility of an experiment in health insurance and the effects of cost sharing for low-income persons. I had not contemplated an experiment—if nothing else the necessary budget would have greatly exceeded monies available through the usual grant mechanisms—but Orr and I agreed that an experiment could be useful and so the Office of Economic Opportunity gave me a small grant to design such an experiment. With the help of colleagues at RAND and economists at the Office of Economic Opportunity, I did so. Based on that design, the Office of Economic Opportunity subsequently decided to hold a competition to implement the experiment, which our group won.
Although the original design was intended to ascertain effects of cost sharing only in a poor and near-poor population, consistent with the Medicaid focus and the mission of the Office of Economic Opportunity, the Nixon Administration in 1971 proposed a national health insurance plan that contemplated near universal coverage—and in fact bore many similarities to the Affordable Care Act (Obamacare) that was enacted 4 decades later (Altman & Shactman, 2011; Blumenthal & Morone, 2009). The Nixon proposal for national health insurance included cost sharing for medical services. There were two principal Democratic alternatives to the Nixon proposal. One had been introduced by Senator Edward Kennedy (D-MA) and Representative James Corman (D-CA) and called for universal coverage with no cost sharing, that is, free medical care. The second had been introduced by Senators Russell Long (D-LA) and Abraham Ribicoff (D-CT) and called for an insurance plan with considerably more cost sharing than the Administration’s proposal. Their proposal included a large deductible, the so-called catastrophic plan. 2 These various proposals for universal coverage with their different stances on cost sharing naturally led to interest in the effects of cost sharing in a general population, and the target population for the Experiment was duly modified to include a general population as I describe below.
The reader should consult (Newhouse and the Insurance Experiment Group, 1993) for a summary of the Experiment’s results, but it may help in reading what follows to know that the two extremes of cost sharing included in the Experiment, no cost sharing and a large income-related deductible, showed around a 25%–30% difference in use and that for most persons there were no detectable effects on health outcomes from this difference in the use of medical care. Confidence intervals were generally small, so large effects on health outcomes could be ruled out. The sick poor, approximately 6% of the Experimental population, was an exception; specifically, the blood pressure of those with hypertension (high blood pressure) was less well controlled if they were assigned to a cost sharing plan rather than a plan with no cost sharing. As a result of the poorer blood pressure control, they were predicted to have a 16% higher likelihood of dying at any given time.
The Design of the Experiment
Of course, the design needed to specify the variation in cost sharing among the experimental insurance plans, but there were also many other design decisions. What medical services should the plans cover and at what rates should hospitals and physicians be reimbursed? Beyond those fundamental decisions about the details of the insurance plans, how could refusal and attrition be minimized? What was the population to be sampled, and should the sampling be proportional or should certain subpopulations be oversampled? How many persons should participate, and how should they be allocated among plans? How many sites should be included, and how should specific sites be chosen? How might methods effects be measured?
As in any experiment, in answering these design questions, it was crucial to keep two related questions firmly in mind: What policy questions were we trying to answer? How would we analyze the data the Experiment would produce?
Although an initial aim of the Experiment was to estimate demand responses or price elasticities, it was also a goal to say something about the effects on health outcomes of any variation in the use of services. There had been a decades-long, heavily ideological debate about whether cost sharing deterred necessary or unnecessary services, or, for those who preferred gray to black and white, the mix of effects on both. In standard welfare economics, simply estimating demand and cost functions suffice to estimate the degree of inefficiency (deadweight loss), but it is highly problematic to attach any normative significance to the demand curve for medical care because of potential rents in provider reimbursement, asymmetric information, agency issues, and the many behavioral biases persons bring to dealing with both uncertainty and health care (Bernheim & Rangel, 2004; Beshears et al., 2008; Ericson & Sydnor, 2017; Handel & Kolstad, 2015; Handel & Schwartzstein, 2018; Kahneman, 2011; Loewenstein et al., 2013). Therefore, in addition to estimating how use responded to cost sharing, a key aim of the Experiment was to assess the effects of cost sharing on health outcomes, including both self-reported outcomes and physiologic measures such as blood pressure, as well as effects on quality of care and patient satisfaction. This, however, implied a major investment in measure development since existing measures were, to put it mildly, rudimentary.
The Range of Cost Sharing or Choice of Arms
What cost sharing should be in the Experimental plans depended on the policy questions for which answers were sought. From the three legislative proposals described above, it was clear that the Experiment’s cost sharing should span the range from no cost sharing to catastrophic insurance. But where should plans be located within this range? An economist’s first instincts might be to estimate a continuous function, which might, for example, assign persons coinsurance rates that varied in increments of 1 percentage point from 0% (no cost sharing) to 100% (no insurance benefits). The design group rejected this notion because of practicality (see below), and because interest was greater in certain ranges of the coinsurance space than others, namely, coinsurance rates that would roughly correspond to the three legislative proposals just described.
The Nixon Administration’s proposal mandated that all employers provide subsidized plans to their workers, and it would have replaced Medicaid with a plan that covered all poor families with children with graduated cost sharing tied to income. 3 Those without employer-provided plans who were ineligible for Medicaid would be eligible for state-based insurance pools. At that time, employment-based insurance plans typically had a coinsurance rate of 20% or 25%. As a result, the Experiment’s design called for a plan with 25% coinsurance, as well as one with 0 coinsurance (free care or no cost sharing), like the Kennedy–Corman proposal, and a plan with an initial 100% coinsurance rate, like the Long–Ribicoff proposal. The desire for some information on responses at coinsurance rates between 25% and 100% also led us to include a 50% coinsurance rate plan. These four coinsurance rates formed four principal arms of the Experiment.
All the plans with cost sharing had an annual cap or stop-loss feature on out-of-pocket spending. This cap was scaled down for low-income individuals and families. The 100% coinsurance plan with a cap was thus a large income-related deductible. After the first year in the first site, the 100% rate was changed to 95% to give families on that plan who did not expect to exceed the deductible a financial incentive to file a claim. 4
The cap on out-of-pocket spending was included both because of its policy relevance and because Kenneth Arrow’s seminal work on the economics of medical care had shown that, provided there was no behavioral response to insurance, that is, no moral hazard, an optimal insurance plan was full insurance above a deductible, meaning a cap on out-of-pocket expenses (Arrow, 1963).5,6 There was also a critical practical reason for including such a limit, namely, limiting adverse selection in enrollment. I describe this further below.
Having experimental plans that roughly mimicked the cost sharing in the three most prominent legislative proposals and policies already in the market was not only policy relevant but also may have made it easier to obtain consent to enroll. Scam artists sometimes posed as insurance salespersons, especially in low-income neighborhoods, so potential enrollees might well have harbored doubts about the bona fides of a group that asked them to enroll in a plan with 18% or 81% coinsurance or some other rate that was not found in any actual policy.
Although common in actual plans of the day, we did not include an initial relatively small deductible in the plans with 25% and 50% coinsurance. At the time we began, we did not have methods from economic theory to extrapolate results from plans in which unit price varied with total expenditure. Although the stop-loss provision in all the cost sharing plans meant this feature was present in all the Experimental plans except the free care (zero coinsurance) plan, adding an initial modest deductible would only have compounded the problem of extrapolation. After the Experiment began, however, we developed theory and methods for analyzing the behavior of persons with such plans, as we describe below when discussing the simulation model.
Although a principal Experimental objective was to estimate how varying cost sharing affected the use of medical services and total medical spending, or in technical terms an own-price elasticity, there was an additional debate at the time about differential coverage of specific medical services, especially the then prevailing better coverage of inpatient services relative to physician office visits. Many policies in fact excluded physician office visits from coverage altogether. A common argument of those advocating for more generous coverage was that the lack of coverage for outpatient services led persons to defer seeking care until they became sufficiently sick to require more expensive hospital treatment. That argument in turn led some to conclude that extending coverage to outpatient services would both save money and improve health outcomes, the so-called offset effect (Roemer et al., 1975). In economic terms, this debate was about whether inpatient and outpatient services were substitutes or complements, or about the sign of cross-price elasticities.
There was a similar argument around coverage of mental health services, which were generally excluded from coverage. If they were covered at all, the cost sharing was generally higher than for other services on the assumption that demand for them was more responsive to insurance coverage, that is, more elastic. For example, the coinsurance rate for mental health services might be 50%, whereas for other covered services it might be 20% or 25%.
The Experiment’s design addressed these differential coverage issues in two ways. One Experimental plan had 100% coinsurance (later changed to 95%) for outpatient services only and inpatient services were free, thus emulating those policies of the time that had better coverage of inpatient services. Results on this plan could be compared both with the plan in which all services were free and also with plans in which all services were equally costly. In that plan, it seemed preferable to make the annual out-of-pocket spending limit a flat US$150 per person (to a maximum of $450 for a family) for reasons of administrative simplicity, whereas in the other cost sharing plans the out-of-pocket limit was scaled to household income. 7 In the analysis, this plan was treated as a fifth arm, in addition to the four arms with 0%, 25%, 50%, and 95% coinsurance.
With respect to the then higher coinsurance rates for mental health services, the design included a plan that covered mental health services at 50% coinsurance and all other services at 25% coinsurance. 8 This created a contrast with plans that covered all services either at 25% or at 50%.
The results showed that better coverage of outpatient services raised total spending; there was no offset effect from the greater coverage of office visits. In the case of equal coverage (“parity”) of mental health services, we could not reject the null hypothesis of no offset, but confidence intervals were such that any true offset effect was almost certainly small in absolute terms.
An important decision was where to set the annual out-of-pocket limit. Because there was no strong consensus around where a national plan would set such a limit, the limit was randomly varied between 5%, 10%, and 15% of household income, subject to a $1,000 maximum. The $1,000 would be somewhat over $4,000 in 2020 dollars if using the all-items Consumer Price Index to adjust and over $17,000 using the increase in medical spending per capita to adjust. Although these varying limits could be regarded as additional arms, we did not have sufficient statistical power to distinguish among them and so when estimating results for the major arms, we simply pooled them together.
Another question was whether there should be a control group. There were two arguments for a control group. If any national health insurance legislation were to be enacted during the period of the Experiment, the insurance of a control group would be directly affected by that plan, whereas the Experimental participants could be held out and used as a classic control group. A second argument was simply optics; including a control group would forestall criticism from those who mistakenly thought a valid experiment had to have a control group as opposed to randomly assigned comparison groups. Although potentially attractive, a pilot study in the first site showed a control group was not feasible; many hospitals and physicians were simply unwilling to fill out two claim forms, one for the control group member’s actual insurer and a second for the Experiment. As a result, it proved impossible to obtain comparable data on use among a control group, and a control group was simply dropped from the design.
We enrolled entire family units rather than randomly selected family members. Although this decision decreased the statistical efficiency of the estimated results because of intrafamily correlation in health care utilization, we deemed it impractical to enroll only selected family members. Moreover, any national plan with an annual out-of-pocket limit would likely relate that limit to family income, as indeed the Nixon Administration’s plan did.
In 1970, the Nixon Administration proposed including Health Maintenance Organizations (HMOs) in Medicare and Medicaid, which led to the HMO Act of 1973 (Blumenthal & Morone, 2009). Although HMOs, in particular the Kaiser Permanente health plan, were positively viewed by many reformers, their views were not informed by a randomized experiment. For that reason, there were also two HMO arms of the Experiment, a treatment group who were not in an HMO when enrolled and who were randomized to an HMO and a control group of existing HMO members. These were sixth and seventh arms.
The Experiment did not include an uninsured arm. Although it was of great policy and intellectual interest to contrast the effects of being uninsured with having insurance of varying comprehensiveness, it was both ethically and practically impossible to randomize families who already had health insurance to a no insurance arm. 9
What Services to Cover
The Experimental plans covered almost all personal health care services because failure to cover a service would mean no claims data for that service would be available to analyze. 10 Not only medical but also dental services were covered. Dental insurance, however, was then and even today remains less generous than medical insurance. 11 For that reason, the 50% coinsurance rate that applied to mental health services in the split 25%/50% coinsurance plan also applied to dental services. In the other plans, dental care was covered at the same coinsurance rate as medical care.
Although the Experiment’s coverage of medical services was broad, there were certain limits and exclusions, especially for services that a national plan was unlikely to cover and where one was unlikely to observe steady-state demand because satisfying an accumulated stock of demand could last several years or even the entire duration of the Experiment. One example was multiple sessions per week of psychoanalysis, so psychotherapy was limited to 52 visits per year, despite a plea from the psychoanalysts’ professional organization to drop the limit. Cosmetic surgery other than for trauma-related incidents that occurred during the Experiment was another example, as was orthodontia for preexisting conditions. Vision benefits were limited to new lenses and examinations not more frequently than every year and new frames not more frequently than every 2 years. Although the primary reasons for these exclusions were the lack of policy interest and the likely inability to observe steady-state demand in a finite period, the exclusions also marginally reduced cost.
Reimbursement Rates
To carry out the Experiment required setting up and operating a small insurance company, and therefore, RAND contracted with an existing third-party administrator to process claims. With respect to reimbursing providers, the administrator followed its usual practice of paying billed charges and negotiating a lower charge if it deemed the fee excessive. Although no record of such negotiations was kept, this happened rarely, in well under 1% of the cases. Prior authorization was also much less common than today, and the Experiment only used it for relatively expensive dental services.
At the time of the Experiment, health insurers were typically passive, usually reimbursing medical bills subject to limits on unit prices. Today’s networks and tiered formularies had not been developed. Apart from a few HMOs, an insured patient could seek care from any physician and pay approximately the same amount out-of-pocket. Today, of course, commercial insurance is dominated by plans that use networks and employ prior authorization and utilization management. As a result, the passive reimbursement policy of the Experiment that largely mimicked the health insurance plans of the time would not be relevant in today’s world. If one were conducting the Experiment today, one would surely not pay billed charges and would employ prior authorization for some medical services. I return to this difference from today’s insurance in the concluding section.
Minimizing Refusal and Attrition: The Participation Incentive and Completion Bonus
The cap on annual out-of-pocket spending in all policies with cost sharing made it feasible to calculate a side payment such that no family could be financially worse off from participating. The side payment, called the Participation Incentive, thereby minimized adverse selection against the cost sharing plans among the great majority of families who had health insurance at the time of enrollment. 12 The amount of the side payment equaled the worst-case outcome for the family under its Experimental plan relative to its existing insurance plan. For example, if the family was enrolled in a cost sharing plan with a cap of $1,000, the Participation Incentive equaled $1,000 less the amount the family would have paid out-of-pocket under its existing insurance plan in the worst-case scenario.
This amount, divided by 12, was paid monthly, which was another design choice. 13 The Incentive could have been paid at the end of a year of participation but that might have weakened the credibility of the enrollment offer and led to a higher refusal rate. It could also have been paid at the beginning of each year but that risked greater attrition from a family pocketing the money and returning to its prior plan. (For ethical reasons and potentially to increase enrollment, families had the right to return to their prior plan at any time, even though payments to the families were structured such that it was never in their financial interest to do so.)
Setting the annual out-of-pocket limit at a maximum of $1,000 attempted to balance two conflicting arguments. A higher figure would have increased the range of variation in cost sharing, although potentially above the range of policy relevance, but it would also have increased the budget by increasing the worst case for most families and hence the amount of the Participation Incentive payments.
At the time, Participation Incentive payments were criticized on the grounds that they would not be part of a national health insurance program and thus the Experiment would not replicate what such a program would look like. Our responses were that these payments would have a negligible effect on utilization because the payments represented an income effect on demand, which, given the income elasticity of medical care and the mostly small percentage increases in household income that the payments represented, would cause negligible distortion in the results. As a result of the criticism, however, we not only tested that prediction directly, since we wanted to adjust for whatever modest effect there might be, but we also built experimental variation into the design to estimate the size of any effect; we describe that feature below.
Long after the Experiment concluded, however, high deductible plans in conjunction with Health Savings Accounts and Health Reimbursement Accounts were introduced and became widespread; 28% of those enrolled in employment-based insurance were in such plans in 2019 (Claxton et al., 2019). 14 These Accounts, when funded by the employer, are similar to the Experiment’s Participation Incentive in also creating an income effect on use (Chernew & Newhouse, 2008). Thus, in a way that was not anticipated at the time of the design, the combination of the Experiment’s cost sharing plans with side payments appears more relevant to today’s world than Experimental cost sharing plans with no side payments would have been.
The design also addressed a potential terminal condition problem. Because payments were made monthly, a family could be worse off from continuing in the Experiment during its final year of participation. For example, it might have no medical expenses for the first half of that year and then a large medical bill that its insurance at its place of employment would pay in full. If the family continued in the Experiment, its remaining Participation Incentive could be less than its out-of-pocket expense for the medical bill, so it would be to the family’s financial advantage to withdraw from the Experiment and return to its employment-based insurance plan. To address this issue, the Experiment also paid a Completion Bonus, an additional amount equal to the family’s worst case that was paid at the end of the Experiment if the family completed its period of participation.
After the Experiment began, the annual out-of-pocket limit was reduced from a maximum of $1,000 to $750 in the 25% coinsurance plans. This change, which reduced Participation Incentive payments by reducing the worst case, was made because of budgetary pressure, but the results of the Experiment were minimally affected. 15
What Population Should Be Sampled?
As mentioned above, the initial population of interest was poor and near poor, but after the 1971 Nixon Administration’s proposal for national health insurance, the interest became the general under 65 population. We excluded those eligible for Medicare because the Nixon Administration had no intention of changing Medicare and did not want to risk signaling that it did. We also excluded households with income above $25,000 (1974 $) because the Administration did not want to make transfer payments to high-income households. The $25,000 figure excluded about 3% of the population in the sites where the Experiment operated, so it was only a minor limitation on generalizability.
There remained the question of whether sampling should be proportional or whether the low-income group should be oversampled. The policy interest in the low-income group was greater, both because of the question of whether there should be cost sharing in Medicaid or whatever program would replace it and also because lower income groups were less likely to be insured so that effects of a new national health insurance program would be larger among lower income groups. For these reasons in the first site, Dayton, OH, low-income groups were oversampled.
Before the second site, Seattle, WA, became operational, we made an argument that income was a noisy variable and that sufficiently high oversampling rate on such a variable could lose statistical efficiency even for the favored group (Morris et al., 1979). The intuition for this result is that someone with a low-income today, such as a graduate student or someone temporarily unemployed, might well not have a low-income next year and conversely. Although in the case of the Experiment, we estimated that there was some gain in precision for the low-income group by modestly oversampling it, such oversampling would complicate all analyses because of the need to include sampling weights for what the RAND group thought was a modest gain in precision for the low-income group (and a modest loss in precision for the entire population). This argument for proportional sampling carried the day in the second site, but in the subsequent sites, which came on stream later, the government asked that we revert to oversampling low-income groups.
How Long Should the Experiment Run?
Several considerations went into a decision to randomly split the sample within plan and site into two groups whose enrollment periods varied. One group, comprising 70% of the participants, participated for 3 years, and the remaining 30% participated for 5 years. Why 3 and 5 years and why 70% and 30%? The design generally sought to optimize the amount of information the Experiment would generate subject to an overall budget constraint.16,17 For purposes of estimating effects of cost sharing on demand, two independent individuals participating for 1 year were more informative than one individual participating for 2 years because of the positive correlation of utilization across years for the same individual. Indeed, for the purpose of estimating demand, the gain in statistical efficiency from individuals participating for more than 2 years and especially for more than 3 years was small. That argued for more persons participating for a shorter period of time, which had the additional benefit of making the results of the Experiment available sooner. On the other hand, the period of participation had to be long enough to detect a beneficial health effect of lower cost sharing if such an effect existed. Not surprisingly, there were no data to estimate how long a period might suffice. We arbitrarily decided that 3 years were likely to be a sufficiently long period of time for health effects to manifest themselves, and we allocated a greater proportion of the sample to the 3-year group both because of the more precise estimates of standard errors that such an allocation permitted and because the 3-year group was less costly. Having some families participate for 5 years, however, gave some protection against a 3-year period being insufficient for health status effects to appear.
The main argument for an enrollment period longer than 3 years, however, was the high likelihood of transitory effects on utilization at the beginning and end of the Experiment, which would interfere with the goal of measuring steady-state demand. Using two different lengths of participation permitted direct estimation of any transitory effects. In some sites, we started both the 3- and 5-year groups at the same time, so comparing behavior of the 3-year group in their final year with that of the 5-year group in their third year provided an estimate of effects from anticipating the end of the Experiment. In other sites, we started the 5-year group 2 years ahead of the 3-year group, so any initial transitory effects could be estimated by comparing the 5-year group in their third year with the 3-year group in their first year. Fortunately, such transitory effects turned out to be quantitatively modest except for dental services.
How Many Persons on Each Plan or in Each Arm?
The initial analysis plan for utilization and spending effects envisioned estimating plan means (analysis of variance) or, to gain statistical precision, means adjusted by some standard covariates such as age (analysis of covariance). The optimal allocation of a sample when one is estimating means of discrete arms is in proportion to sqrt(w( i )/c( i )), where w( i ) is the weight assigned to the mean of the ith arm, and c i is the marginal cost of enrolling another household in the ith arm. The overall sample size of 2,000 families was set so that the standard errors of each of the five fee-for-service arms would be under 6% of the mean given our estimates of the means and variances—and in fact they all were under 6%.
This formula was used to allocate the sample to plans, with weights for the ith arm corresponding to the policy interest in that arm. In the first site the 0%, 25%, 50%, and 95% coinsurance plans were given equal weight and the Individual Deductible plan, the plan that covered outpatient services less well, a lower weight. After the Experiment had begun in the first site but before it began in the second site, however, we developed a more sophisticated theory of demand based on episodes of illness, which led to a model for analyzing data from plans in which unit price changed with total spending and extrapolating to plans that differed from the Experiment’s plans. This led us to somewhat increase the weight on the 95% coinsurance and individual deductible plans, the plans where the caps on annual spending were most likely to be binding.
Also after the first site had begun, the group analyzing health outcomes had made substantial progress in developing measures and, in contrast to the group analyzing utilization and spending, wished to compare health outcomes between families on the free plan with families on cost sharing plans taken as a group. Thus, from the health status group’s point of view, the free plan would have ideally comprised approximately half the sample. The final allocation of the sample to plans was a compromise that gave the optimal allocation for the utilization analysis half the weight and the optimal allocation for the health status group the other half. Further details of the rationale for the allocation of sample by plan are given in Appendix B of Newhouse and the Insurance Experiment Group (1993).
How Many Sites?
Adding a site entailed fixed costs. Field offices had to be opened and maintained for the duration of the fieldwork. Local personnel had to be hired. Local medical and political leaders had to be informed. Additional sites reduced between-site variance, but the greater fixed costs meant fewer total participants for a given budget and so greater within-site variance.
An optimal method would minimize the sum of the estimated between- and within-site variances subject to a budget constraint using the formula:
subject to
Which Sites?
There remained the question of which six actual sites should be chosen. We chose the sites purposefully rather than at random to assure obtaining variation in a number of characteristics that would give some face validity to a claim of generalizability. Because of geographic variation in the use of medical care, we wanted at least one site in each of the four Census regions. Because the sophistication of the medical care delivery system varied with city size, we wanted varying city sizes, including both metropolitan and nonmetropolitan areas. To accommodate the HMO arm at least one site had to have a well-established HMO amenable to participating in the Experiment.
Finally, one aim of the Experiment was to understand the nature of any nonprice rationing mechanisms that a national health insurance plan with markedly less cost sharing might activate. Although there was a wide range of existing estimates of the price elasticities of demand from observational data, most estimates implied that the Kennedy–Corman proposal with no cost sharing and universal coverage would likely increase demand beyond the short-run capacity of the delivery system. It was clear, however, that the Experiment would not stress the delivery system in any site and so would not trigger any rationing mechanisms. Not only would the Experiment’s participants represent a small share of any site’s population, some families would have more cost sharing than their prior insurance, while other families would have less. Thus, the net change in demand at any site would be negligible.
To shed some light on the nature of nonprice rationing mechanisms, we chose sites with varying degrees of excess demand, in effect a small observational study within the Experiment. The measure of excess demand that we used was the wait time in a site for an appointment with a primary care physician for a nonurgent problem. Before choosing the actual sites, we carried out a survey to measure wait times for an appointment in many locations; across five of the six sites that we ultimately chose, wait times varied from 4 to 25 days; the sixth site was a rural site where physicians did not make appointments and operated on a first-come, first-served basis. 19 Despite the wide range of demographic, economic, and social differences among the six sites we chose, the utilization response to the Experimental plans was remarkably uniform across them. The only relationship with wait times to an appointment was greater use of the emergency department in sites with longer waits.
Measurement of Methods Effects
Virtually, all policy experiments run the risk that some features of the design will not be replicated in an actual program and so could cause the results to differ from those of an actual program. In the case of the Experiment, the likelihood of transitory demand and the Participation Incentives have already been mentioned as examples of such features. In order to measure any effects from those two features, we built variation into the Experiment’s design to allow estimation of the relevant effects. In the case of transitory demand, as already described, the approach was to split the sample and to stagger the start of enrollment across sites. In the case of the Participation Incentive, the approach was to randomly give some families additional amounts above their worst case, allowing a comparison of their use with that of families who received only their worst case.
A similar approach was taken with respect to obtaining baseline measures of physiologic biomarkers such as blood pressure and cholesterol levels. It was an ethical imperative to notify the participants of any abnormal results—and if sufficiently abnormal to try to facilitate their getting immediate treatment. Notifying participants, however, could induce demand for medical care. To measure any such induced demand, we decided to split the sample such that only a random 60% of the participants received a baseline screening exam.
This decision was not without cost because baseline measures of health status greatly improve statistical power. Absent medical intervention most biomarkers such as blood pressure and cholesterol levels do not much change over a 3- to 5-year period, and the availability of a baseline measure absorbs much of the substantial between-person variation in these measures. Despite the loss of power from not having baseline values for the entire sample, the measurement of the actual outcome measures turned out to be estimated with sufficient precision for policy purposes. 20
Yet another measurement issue was the frequency of collecting survey data from the participants. Such data were necessary because not all the information we sought was available on claims forms. Moreover, recall error could be substantial if the participant was asked well after the fact about data such as days lost to activity impairment from illness and time spent seeking medical care. To obtain such information, we initially sent a mail questionnaire weekly to some randomly selected families and to others biweekly. There was only a small difference in the quality of the data between the two groups, so after the first year in the first site, this questionnaire was sent biweekly for the duration of the Experiment. To determine whether this questionnaire itself stimulated use, we did not send it to a random 25% of the sample in four of the sites in their first year of participation. 21
The allocation of the sample to the various subexperiments or arms to measure methods effects was done within plan and site (i.e., stratifying within the major arms of interest). The allocations for the various subexperiments accounted for the incremental cost of the treatment, for example, 44% and not 50% of the participants received the additional Participation Incentive payment because of its additional cost. The allocations also accounted for our interest in the treatment and control group method effects. In most cases, our interest was equal (so with equal cost we would have allocated half the sample to treatment and half to control), but in the case of the baseline physical exam, we increased the allocation to the baseline exam and decreased the allocation to no exam because of the value of a baseline measure in estimating insurance plan effects on exit outcomes. Because sample sizes for the entire Experiment were set based on highly skewed medical spending distributions, they were more than ample for estimating methods effects, which had less skewed distributions. Method effects turned out to be generally negligible.
Although different in spirit from the type of methods effects just described, another possible distortion was potential underfiling of claims in the 95% coinsurance plan. Whether it was an actual distortion related to the question of whether the Experiment was simply trying to measure insurer payout or whether it was trying to measure all use of medical care. If the goal was simply measuring insurer payout, underfiling was of no concern, but if the goal was measuring the use of the medical care system, it was. The question we deemed more important was the use of the medical care system, which implied a need to estimate underfiling.
A preliminary step described above was simply to reduce the 100% coinsurance rate plan to a 95% rate after the first year in the first site, which gave families with that coinsurance rate a financial incentive to file claims. To directly measure underfiling, however, we conducted an audit study of physician use in two sites. 22 The audit showed that claims for 7%–9% of physician visits had not been filed in the 95% coinsurance plan; this amount of underfiling, however, did not change any qualitative conclusions.
A Simulation Model for Spending and Use of Medical Services
Although analysis of mean plan spending was straightforward, extrapolation to other price schedules, including being uninsured, was not. When the Experiment was designed, the assumption was that relatively few families would spend amounts that approached or exceeded the upper limit or stop loss (this expectation turned out to be incorrect), so that ignoring the upper limit and treating spending differences among plans as simply a response to coinsurance would approximate an estimate of a pure response to price.
That assumption would not hold for small deductibles, however, so, as mentioned above, the Experimental plans did not replicate a common type of insurance policy at the time, one with a relatively small deductible, followed by coinsurance, followed by an upper maximum beyond which the policy would not pay. 23 What medical care spending might be under such a plan, or for that matter in a plan with an out-of-pocket limit that was substantially smaller than that in the Experiment, was not clear. Furthermore, if a plan covered a different mix of services, accrual of spending toward any deductible or out-of-pocket limit would differ and so price effects would likely differ.
During the design phase of the Experiment, we assumed that users of the Experiment’s results could extrapolate from estimates of mean spending in the different plans or arms as they best saw fit. Shortly after the Experiment began, however, we developed an economic theory of behavior when a consumer faced nonconstant unit price schedules (Keeler et al., 1977). That theory implied that the unit of observation to analyze utilization behavior given a nonconstant unit price schedule was not annual per person spending but rather episodes of treated illness because expected unit price would change within the year as the family used medical services. For example, if the consumer knew with certainty that they would subsequently exceed the stop-loss amount, any additional medical care would not increase annual out-of-pocket spending even if the stop-loss amount had not been exceeded at a particular point in time.
Using this theory, the RAND group built a simulation model from the Experimental data based on episodes of treated illness. For this purpose, the many arms of the Experiment were invaluable because they allowed estimation of effects in the relevant regions of the supported response surface reasonably well. Furthermore, by imposing structure, one could extrapolate to regions outside the supported surface. In short, estimates from this model could be and were used to estimate effects on spending of plans with varying initial deductibles, coinsurance rates, out-of-pocket limits, covered services, and whether the deductible was per person or per family. The model was even used to estimate the average spending of an uninsured consumer.
The simulation model, which is summarized in Keeler and Rolph (1988), began by grouping an individual’s claims into one of the five different types of episodes, hospitalization, well care, routine chronic outpatient care, outpatient care to treat acute episodes or flare-ups of chronic disease, and dental. It allowed the response to plan (price elasticity) to differ for each type of episode and for the propensity to initiate episodes to differ across persons. The model assumed that the total cost of each type of episode was known at the time the episode began, for example, when a woman made her first visit for a pregnancy, she would know the ultimate cost. Episodes of routine care for chronic diseases such as hypertension or diabetes were assumed to begin on the first day of the annual accounting period and last the entire year. Flare-ups of chronic disease, however, were dated to when they occurred during the year, as were acute episodes, hospitalization, well care, and dental episodes.
The model required estimating how use and spending responded as a person’s total spending approached a change in unit price, for example, as spending neared the upper limit on out-of-pocket spending or moved past it. A rational maximizer would use at higher rates as the probability of exceeding an upper limit increased because of a higher probability that the marginal unit of care would be free. Once over the limit the rational maximizer would treat medical care as being on sale for the remainder of the accounting period and use at an even higher rate than those on the free care plan because at the beginning of the next accounting period medical care would again be expensive. Empirically, however, the average Experimental participant turned out not to be the rational maximizer of economic theory but instead did not much anticipate the change in unit price, that is, was myopic before the annual upper limit was exceeded. Once over the limit, the average participant spent at roughly similar rates to those on the free plan, that is, while participants recognized that care was now free and increased their use of care, they did not increase it above the rate of those on the free plan, that is, they did not treat care as temporarily on sale.
Empirically, almost all of the actual response to the different insurance plans was driven by variation in the frequency of treated episodes, meaning the less costly was medical care, the greater was the number of episodes of each type. Cost per episode had only a very small response to price. In other words, the main effect of cost sharing was on the consumer’s decision to initiate care; once under treatment, cost per episode (for a given type of episode) was nearly constant across plans.
Although the theory underlying this model had not been conceived at the time of the initial design, the ability to analyze the data by episode fortuitously turned out not to require any information beyond the information that was already being collected. In particular, date of service, which was critical for the episode analysis in order to date episodes within the year, was routinely collected as part of the claim form.
The Hindsight of 40 Years
Looking back, what would I have done differently in designing the Experiment? And what do I now make of its results? The answer to the first question is that I would have not materially changed the design, although if I had known at the outset that health outcome effects would be concentrated among the sick poor, I would have oversampled that group. At the outset, however, measures of health outcomes for studies such as the Experiment were in a primitive state. As a result, when the Experiment was being designed it wasn’t clear what measures of health status would even be included, let alone what effects they might reveal. Indeed, one of the lasting contributions of the Experiment has been the development of measures of health outcomes and quality of care. 24
Of course, I would have made the midcourse changes in the design of the Experiment at the outset to obviate the need for such corrections. For example, I would have started with the 95% coinsurance plan rather than changing the 100% coinsurance plans to 95% coinsurance after the first year in the first site. Like that change, however, the changes in the design, which were made after the Experiment started, were small in magnitude and in my view have not affected any of the major inferences that have been drawn from the results of the Experiment.
From today’s vantage point, however, there are three potential changes in the design would have warranted greater consideration. First, it would be desirable to know more about what today are called value-based insurance designs (Chernew et al., 2008; Lee et al., 2013). These designs are more specific versions of the argument that lowering or exempting some services from cost sharing would lower total spending and/or improve outcomes. While the Experiment’s design did test the effect of varying cost sharing for only outpatient services and only mental health services, today’s value-based proposals are at a finer level of detail, for example, exempting from cost sharing specific drugs known to be efficacious for those with a chronic disease such as diabetes in order to encourage compliance. Even if this idea had been proposed at the time, however, it may have been left on the cutting room floor because incorporating it into the design would have required additional treatment arm(s) and thus lowered precision for the arms that were included.
Second, one might have added a second or even a third HMO. The results from the HMO arm of the Experiment have been widely cited, even though they came from just one staff model HMO. Some indication of their generalizability would certainly have been useful. Unlike value-based insurance, this idea was considered at the time. The Experiment began with only one HMO because there was much greater doubt about the feasibility of the HMO arm than the fee-for-service arms. That was because the HMO arm required participants to change providers whereas the fee-for-service arms did not. Beginning with only one HMO allowed us to determine whether refusal rates would be so high as to make the results not useful for policy purposes. It turned out that refusal rates at the first HMO site were only slightly higher than in the fee-for-service arms with cost sharing. 25 We therefore looked into adding a second HMO and found that it would have been feasible to do so without adding another field office. 26 At that point, however, the federal government did not wish to increase the budget to accommodate an additional HMO arm.
A third potential design change would have been to include Medicare beneficiaries. Unlike the option of additional HMOs, that option was off the table from the outset. As mentioned above, the Nixon Administration did not wish to be perceived as potentially considering changes in Medicare. Over many of the intervening years, however, cost sharing in traditional Medicare remained above that in large employer plans (McArdle et al., 2012). 27 Perhaps as a result, well after the RAND Experiment was completed, I was occasionally asked by federal policy makers about carrying out an analogous Experiment among the elderly, but both cost and the political sensitivity of Medicare always precluded it.
How applicable are the Experiment’s results to contemporary policy? Two important caveats stem from changes in the larger environment. The first is the ongoing technological change in medicine and the resulting enhancement of medical capabilities. A dramatic illustration is the fall in the age-adjusted cardiovascular mortality rate, then and now the leading cause of death. Between 1970 and 2017, that rate fell by a factor of three (National Center for Health Statistics, 2019). Although this fall is not all attributable to medical care, advances such as medications that better control blood pressure and cholesterol have been estimated to account for 44% of the decrease between 1980 and 2000 (Ford et al., 2007). The ability to save low-birth-weight babies has also steadily improved. And better medical care has not only reduced mortality; disability-free life years among the elderly rose between 1982 and 2011 (Freedman & Spillman, 2016), one cause of which has undoubtedly been developments such as artificial hips and knees and improved treatments for cataracts.
With more on offer from the medical establishment, it is certainly conceivable that the response to variation in what consumers pay for their care today would differ from 4 decades ago. Nonetheless, subsequent observational studies of this issue have found approximately the same utilization response to price that the Experiment did (Brot-Goldberg et al., 2017). That the response is still similar 4 decades later seems plausible given the Experiment’s finding that almost all of the effect of cost sharing is on the consumer’s decision to seek care. Because the consumer may well not know the diagnosis before seeking care and therefore be quite uncertain about the benefit from seeking care, it seems plausible that the initial decision to seek care at the margin could be less affected by technological change. And although newer technology would likely affect how a physician would treat a given clinical problem after establishing a diagnosis, the physician’s decisions may continue to be relatively unaffected by the amount of cost sharing in the patient’s insurance plan.
Technological change does mean that outcome effects from seeking more care might differ. The Experiment found little or no effect on outcomes in the general population from the additional use in the free care plan, although it did find better blood pressure control among low-income hypertensives. For that group, the better blood pressure control mattered for predicted mortality, as mentioned above. 28 Moreover, we not only found no effect of cost sharing on outcomes in the general population but confidence intervals were sufficiently small that more than modest effects could be ruled out. At the time, our after-the-fact explanation for the lack of a sizable effect in the overall population was that poor quality care in a generally healthy population had negative effects that offset the positive effects of better access among the subset of the population that was not in good health (Newhouse and the Insurance Experiment Group, 1993). Although there was evidence of poor quality care among the Experimental population that supported this view, in the subsequent years, a great deal more evidence of poor quality health care has accumulated, making this explanation of no outcome effects for the average nonelderly person more plausible (Institute of Medicine, 1999, 2001; Kilo & Larson, 2009; McGlynn et al., 2003). An optimist, however, might believe that quality of care has now improved sufficiently such that outcomes today would be more positive from inducing a general population to seek more care by reducing their out-of-pocket cost.
A second change in the environment has been the role of insurers. As already noted, the Experiment was conceived in an era when health insurers, both public and private, were mostly passive, simply reimbursing patients for a percentage of their medical spending. In contrast to today’s managed care, insurance arrangement of the 1970s could be termed unmanaged care. Beyond that well-known change, there has also been a subtle change in the nature of cost sharing. At the time of the Experiment, the dominant mode of cost sharing was coinsurance, which meant the patient paid something additional at the margin from using a higher priced provider or drug. The dominant mode of cost sharing today is copayment, meaning a fixed dollar price for a physician visit or a month’s supply of a drug.
The shift from coinsurance to copayment was made possible by insurers intervening directly on the supply side to create networks and formularies, whereby patients paid less to use providers in the network or drugs on preferred tiers of the formulary. This enabled insurers to bargain directly with providers on unit price rather than rely on the indirect mechanism of patients seeking out lower cost providers with the incentive of a somewhat lower out-of-pocket payment because of coinsurance. Both the Experiment and subsequent work have shown that consumers do not actively shop for lower provider prices (Marquis, 1985; Sinaiko & Rosenthal, 2016). Since unit prices paid providers are set today in negotiations between commercial insurers and providers, insurers could change coinsurance to copayment without materially changing the consumer’s incentives. 29
From an insurer’s point of view, a copayment can be set to achieve the same effect on the incentive to seek care as a coinsurance rate. From a policy analyst’s point of view, however, using the simulation model described above to estimate, the effect of different levels of copayment on demand requires translating copayment rates into coinsurance rates. That translation necessarily introduces uncertainty because the denominator to estimate the coinsurance rate from the copayment is not known.
In sum, the design of the multiarmed RAND Health Insurance Experiment has stood the test of time well. The strongest evidence of that is the continued use and citation of its results. Nonetheless, changes in the financing and delivery of medical care over the subsequent 4 decades have naturally introduced greater uncertainty about the contemporary applicability of its results.
Footnotes
Acknowledgment
I would like to thank Larry Orr for his comments on a draft of this article, as well as his assistance and counsel in designing and implementing the Experiment. Many other persons deserve credit as well, as evidenced not only by the many coauthors of Newhouse and the Insurance Experiment Group (1993) but also by the lengthy list of acknowledgments in that book.
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
