Abstract
Pay-for-performance (P4P) programs, based on productivity, patient satisfaction, quality of care, efficiency profiling, or unspecified criteria have become popular in American medicine. Theoretically, such programs hold the potential to narrow the gender pay gap among physicians by employing what are arguably neutral, meritocratic criteria. Such criteria are often unspecified in prior analyses but in reality may include a host of indicators, including objective features of performance, dimensions that entail a high degree of discretion, and gendered aspects, such as masculine competence (i.e., intelligence, confidence, efficiency, and decisiveness) or feminine warmth (i.e., kindness, trustworthiness, sympathy, and selflessness). Using data from four waves of the Community Tracking Study (CTS) Physician Surveys, I analyze the effects of such unique P4P criteria on the gender pay gaps among physicians. Most notable among findings is more pronounced gender inequality when criteria are unspecified as opposed to being based on productivity. No effect is found when P4P centers on warmer patient satisfaction criteria. I conclude by discussing how and why P4P schemes may reduce but also exacerbate gender inequalities in pay.
Organizations in the United States have increasingly shifted to “market-driven” human resource (HR) practices over the last two decades, including the use of pay-for-performance (P4P) to link rewards directly to employee performance (Campbell, Campbell, and Chia 1998; Cappelli and Neumark 2001; Ichniowski, Shaw, and Prennushi 1997; Lockwood 2002). Theoretically, P4P has the potential to narrow the gender pay gap by focusing attention on workers’ performance instead of their ascribed characteristics (Castagnetti and Rosti 2013; Kunda and Spencer 2003). However, scholars of gender and organizations have raised concerns about the potential for systematic inequality and bias in P4P systems, finding that men and women with identical performance ratings continue to experience unequal rewards (Bartol 1999; Castilla 2008, 2012; Castilla and Benard 2010; Dobbin, Schrage, and Kalev 2015; Roth, Huffcutt, and Bobko 2003). But why?
Prior research on P4P often assumes a singular definition of performance when, in fact, multiple evaluative criteria are possible, including those that entail culturally masculine skills or feminine qualities, such as warmth, communalism, and relational effectiveness. The same holds true in contemporary American medicine—my specific case in point. Indeed, in medicine, multiple informal and formal criteria can form the basis for bonuses. In this article, and drawing on four waves of the Community Tracking Study (CTS) Physician Surveys, I analyze the specific effects of P4P based on productivity, patient satisfaction, quality of care, efficiency profiling, or unspecified criteria on the gender pay gap among physicians. My core questions center on variations in gender pay gaps depending on the criteria used, and whether and how benefits may accrue differentially to men and women.
Gender Inequality and Meritocracy in Reward Systems
Research on organizational inequality has found that bureaucratic control, accountability structures, and formal personnel practices reduce arbitrariness in employment decisions (Bielby 2000, 2013; Dobbin 2009; Dobbin et al. 1993). For example, formal written guidelines for personnel decisions, and effective oversight of personnel practices, encourage decision-makers to use clear and consistent criteria instead of ascribed characteristics to evaluate individuals (Bielby 2013; Kalev 2014). Evaluators are also more likely to scrutinize their own behavior for bias when they expect their organizations or legal authorities to monitor their decisions (Kalev, Dobbin, and Kelly 2006; Lerner and Tetlock 1999; Tetlock 1992).
Personnel procedures that lack accountability, however, tend to reinforce informal network influence and increase ascriptive inequality in organizations because managers routinely evaluate employees who share their race and gender more highly than other employees, advantaging white men when the evaluators are white men (Bartol 1999; Bielby 2000; Roth et al. 2003). 1 As a result, experts on equal employment opportunity have widely endorsed discretion-control practices, including the use of performance evaluations to make personnel decisions (Dobbin et al. 2015). At the same time, managerial discretion over how to evaluate performance can introduce bias within P4P systems, especially if there are no explicit criteria (Bielby 2000; Maas and Torres-Gonzalez 2011).
How does the use of P4P influence ascriptive inequality within jobs, and how does this vary by performance criteria? P4P appeals to a belief in meritocracy, defined as a system where everyone has equal opportunities for rewards based on their individual merits and efforts, regardless of their gender, race, and other nonmerit characteristics (Scully 1997). Most Americans view meritocracy as a fair and legitimate distributive principle, and believe that meritocracy exists in American society (Kluegel and Smith 1986; McNamee and Miller 2014). Organizations mobilize this belief through P4P systems that aim to increase worker productivity and motivation. If men and women are equally talented and the system is a fair tournament with equal rules of competition for all players, the argument goes, then P4P should eliminate gender differences in pay by using concrete performance data as the basis for promotion and pay decisions (Castagnetti and Rosti 2013; Miller 2011). Following this point, research using tournament theory has found a smaller gender pay gap in intellectual professions and highly specialized occupations where detailed information about employees’ qualifications, skills, and abilities is used (Castagnetti and Rosti 2013).
Other research on P4P systems, in contrast to the arguments above, has found that they often increase ascriptive inequality within specialized jobs (Campbell et al. 1998; Castilla 2008, 2012; Chauvin and Ash 1994; Dobbin et al. 2015; Jonnergaard, Stafsudd, and Elg 2010; Madden 2012; Roth et al. 2003). In fact, inequities can arise in the allocation of work, evaluations of performance, or the translation of performance ratings into rewards. As a result, studies of jobs with intensive performance evaluations have found a larger-than-average gender pay gap among workers with similar productivity characteristics (Castilla 2008, 2012; Chauvin and Ash 1994; Dobbin et al. 2015; Jonnergaard et al. 2010; Madden 2012; Roth 2006).
When organizations use P4P with the goal of promoting a culture of meritocracy and rewarding employees equitably, it can have the very opposite effect (Castilla and Benard 2010; Dobbin et al. 2015). Castilla and Benard (2010) found more bias against women when the organizational culture accentuated meritocracy than when it did not. Other recent organizational research shows that the use of performance evaluations without monitoring helps white men and hurts white women (Dobbin et al. 2015). Corresponding experimental research has shown that people act in more biased ways when they believe that they are unbiased, especially when they have no accountability for their decisions (Monin and Miller 2001; Uhlmann and Cohen 2005, 2007). Studies examining the translation of performance evaluations into pay has also found “performance-reward bias” (Castilla 2008), whereby decision-makers offer larger rewards to men than to women with equal performance scores (Castilla 2008, 2012; Castilla and Benard 2010; Jasso and Webster 1997). Taken together, this body of research implies a larger gender pay gap under P4P systems, even when men and women perform equally and receive equal performance ratings.
Implicit Versus Explicit Performance Criteria
What mechanisms can produce inequality in P4P systems, and under what conditions? Organizational analysts have incorporated understandings of experimental social psychology in what Bielby (2013) calls “the cognitive turn” in research on workplace inequality. A large body of research in social psychology finds that gender is a primary characteristic that people automatically observe and treat as relevant, especially when they evaluate mixed-sex groups or tasks that are culturally gendered (Perdue et al. 1990; Ridgeway 2011).
Categorization encourages people to use culturally shared beliefs (stereotypes) that value men over women when they evaluate others (Banaji and Greenwald 2013; Biernat 2012; Dovidio et al. 1997; Kunda and Spencer 2003; Ridgeway 2011). Consequently, people hold men to a lower standard for performance than women who perform equally and evaluate women more harshly than men in most settings, especially when they perform a traditionally male task (Berger, Ridgeway, and Zelditch 2002; Biernat 2012; Foschi 1996, 2000; Ridgeway 2011). Unless there are opposing forces, people typically give a man more opportunities to perform, higher performance evaluations, and larger rewards than an otherwise similar woman (Foschi 1996, 2000; Jasso and Webster 1997; Ridgeway 2011).
The above pattern tends to matter most when there is little other information or the other evidence is ambiguous, while detailed individuating information about relevant skills or attributes can decrease bias in evaluations (Banaji and Greenwald 2013; Bauer and Baltes 2002; Biernat 2012; Kunda and Spencer 2003; Ridgeway 2011). However, individuating information has limited ability to override stereotypes in complex environments with many distractions or disruptions—a situation that characterizes most organizations (Martell 1996; Maynard and Brooks 2008; Pratto and Bargh 1991).
Scholars also note that some criteria for performance depend on managers’ subjective judgments of their workers, while others require them to consider objective criteria for performance. Rewards that decision-makers allocate at their discretion are especially subject to biased performance assessments, increasing the likelihood that decision-makers will overreward men and underreward women (Bielby 2000; Reskin 2000). 2 However, the possession of task-specific competence, or skills that are directly related to the task in question, can override the effects of ascribed characteristics on evaluations, especially if decision-makers are required to use other criteria (Berger et al. 2002; Ridgeway 2011). Thus, objective performance measures can reduce bias by encouraging consistency across cases and emphasizing explicit benchmarks (Bauer and Baltes 2002; Maas and Torres-Gonzalez 2011).
Accountability for following consistent procedures (procedural accountability) can also reduce stereotype bias if the evaluation process explicitly requires the use of information about task-specific skills, abilities, and performance, especially if there are costs to inaccuracy (Dobbs and Crano 2001; Lerner and Tetlock 1999). This has led many to argue that organizations can suppress biases by increasing accountability for antidiscrimination outcomes and emphasizing objective performance criteria (Bielby 2000; Kreiger 1995; Reskin 2000). As a whole, extant research suggests that accountability structures, oversight, and formal rules can help to keep cultural biases out of P4P reward allocations, especially in comparison with unspecified criteria. This leads to the following hypothesis:
Competence and Warmth
While explicit performance criteria are likely to reduce gender inequality compared with unspecified criteria, all definitions of performance are socially constructed and context-dependent. As such, measures of performance can focus attention on culturally masculine, gender-neutral, or culturally feminine skills and attributes. Research using Status Characteristics Theory (SCT) has found that evaluators favor men over equal-performing women when tasks are culturally masculine or gender-neutral, but they somewhat favor women when tasks are culturally feminine. Experimental research in this tradition as well as the Stereotype Content Model (SCM) has found that group differences tend to be organized along two fundamental dimensions, competence and warmth, and these dimensions correspond to gender stereotypes (Berger et al. 2002; Cuddy, Fiske, and Glick 2008; Fiske, Cuddy, and Glick 2007; Fiske et al. 2002; Ridgeway 2011).
Gender stereotypes define women as warm (kind, helpful, trustworthy, sympathetic, and selfless) but not competent, and men as competent (skillful, intelligent, confident, efficient, and decisive) but not warm (Burgess and Borgida 1999; Heilman and Eagly 2008). These stereotypes both define expectations about how men and women are, and prescribe how men and women should be. Although there is a hierarchy that values skills associated with competence more highly than those associated with warmth under most conditions, P4P based on definitions of performance that highlight warmth may provide women with some evaluative advantage and reduce the gender gap. Thus, one expects that P4P based on culturally masculine criteria for performance will widen the gender gap while P4P based on culturally feminine criteria for performance will narrow it. Contemporary American medicine is an opportune setting for testing these propositions because it currently uses multiple criteria for P4P.
The Case of American Medicine
In the United States, medicine is a prestigious, highly paid profession in which P4P is a growing trend. The gender pay gap within medicine is also larger than in other professions or the labor force as a whole (Ash et al. 2004; Weinberg 2004). Some of this is due to internal segregation of the profession, whereby men and women work in different practice types and specializations (Boulis and Jacobs 2008; Ku 2011). Gender-unequal family responsibilities and work hours also explain some of the difference, although women in medicine earn substantially less than men after controlling for work effort and experience, specialty, practice type, family status, and other relevant characteristics (Boulis and Jacobs 2008).
A question that previous research has not addressed is how P4P might contribute to this gap. Most research on P4P assumes a single definition of performance that corresponds to stereotypically masculine traits. Yet, contemporary American medicine has adopted multiple P4P strategies, including four formal criteria (Armour et al. 2001; Gilmore et al. 2007; Grumbach et al. 2006; Jha et al. 2012; Rosenthal et al. 2005; Satin 2009). Some bonuses in medicine are also based on unspecified criteria. Thus, medicine represents a unique setting for understanding the effects of explicit and discretionary criteria for performance on the gender gap in pay.
The most common explicit basis for allocating P4P is productivity, typically defined as a measure of labor output. National estimates from the CTS Physician Surveys indicate that 70.5 percent of physicians received P4P based on productivity during the period 1996 to 2005 (Center for Studying Health System Change [CSHSC] 2001, 2004, 2006, 2008). Practices and health plans often measure productivity in Relative Value Units (RVUs), which assign a standard metric to medical services that allows them to be compared with other services (American Academy of Family Physicians [AAFP] 2002). Examples of productivity include the amount of revenue that a physician generates for the practice, the number of patient visits that he or she provides, or the size of his or her enrollee panel.
Measures of productivity are independent of health outcomes and may encourage poor outcomes within the fee-for-service (FFS) American medical system because healthcare providers earn more when they provide more care but not necessarily when they provide better care (De Jaegher 2010). 3 Research suggests that financial incentives based on productivity have a strong impact on physician behavior, encouraging them to perform more tests and procedures (Armour et al. 2001; Tufano et al. 2001). This criterion also aligns with masculine stereotypes of aggressiveness and ambition in the FFS healthcare model and thus corresponds to a culturally masculine definition of competence that could increase male advantage (Biernat 2012; Burgess and Borgida 1999; Cuddy, Fiske, and Glick 2008). Also, while evaluative bias alone can widen the gender gap under productivity criteria, it is possible that there will be actual gender differences in productivity because of prescriptive stereotypes that encourage men to focus more on profits than women, and penalize women who pursue profits and efficiency rather than “warmer” patient service goals (Burgess and Borgida 1999; Heilman 2001). Thus, I hypothesize the following:
However, productivity measures are also relatively objective, and objective measures of performance can reduce the effects of stereotyping, increase procedural fairness, and decrease gender bias. As a result, this criterion could be associated with a smaller unexplained gender gap if men and women perform similarly under similar conditions, especially compared with unspecified criteria. This leads to the following hypothesis:
A second, less common, criterion for performance is patient satisfaction. National estimates from the CTS indicate that 23.1 percent of physicians received bonuses based on patient satisfaction between 1996 and 2005 (CSHSC 2001, 2004, 2006, 2008). Patients fill out satisfaction surveys, and practices, hospitals, and health plans increasingly take their results seriously, use them as a marketing tool, and reward physicians accordingly (Jackson 2001). Research has found that patient satisfaction is a good measure of how well physicians communicate and of the aesthetics and cleanliness of facilities, but a poor measure of surgical quality, compliance with quality processes, or overall safety (Lyu et al. 2013).
Patient satisfaction also depends on patients’ impressions of the quality of doctor-patient interactions, and patients are likely to have gendered expectations that influence their satisfaction ratings. On one hand, patients may expect more participatory care and better quality doctor-patient communication from female physicians because stereotypes of women include traits of deference and niceness (Bertakis 2009; Schmittdiel et al. 2000; West 1984). Prescriptive gender stereotypes also encourage women to be helpful and sympathetic, and there is some evidence that women physicians are more collegial and participatory in their approach (Bertakis 2009; Hall et al. 1994a, 1994b; Schmittdiel et al. 2000). Thus, P4P based on patient satisfaction may reduce the gender gap because it rewards physicians for skills that are associated with warmth and femininity (Biernat 2012; Cuddy et al. 2008; Fiske, Cuddy, and Glick 2007; Fiske et al. 2002; Ridgeway 2011). This leads to the following hypothesis:
A third basis for physician performance is quality of care, which primarily measures adherence to clinical protocols and subsequent health outcomes (Boyd et al. 2005; Gilmore et al. 2007; Rosenthal et al. 2005; Young et al. 2007). Some health plans, including Medicare and Medicaid, have begun to emphasize quality measures, largely in response to problems of overtreatment. From 1996 to 2005, 18.0 percent of physicians received P4P on the basis of quality (CSHSC 2001, 2004, 2006, 2008).
Quality-based payment schemes reward physicians for meeting a payer’s predefined clinical benchmarks, and sometimes offer incentives for physicians to encourage positive health behaviors such as losing weight or quitting smoking (Rosenthal and Frank 2006; Rosenthal et al. 2005; Rosenthal et al. 2004; Satin 2009; Young et al. 2007). Most quality-based bonus schemes standardize disease management protocols, with the goal of increasing physician adherence to clinical guidelines (Satin 2009). Some have criticized such benchmarks, arguing that they are unreliable as measures of performance and replace clinical judgment with an inappropriate one-size-fits-all strategy (Boyd et al. 2005; Hofer et al. 1999; Larriviere and Bernat 2008; Satin 2009). However, the quality benchmarks set by the Center for Medicare and Medicaid Services represent explicit formal criteria for performance that are largely gender-neutral. Thus, I hypothesize the following:
Finally, a growing trend in P4P in medicine involves economic profiling, whereby health plans assess physicians’ cost efficiency, and use financial incentives to encourage physicians to limit referrals and hospitalizations, and to reduce the number of tests and procedures that they prescribe (Armour et al. 2001; Grumbach et al. 2006; Hofer et al. 1999; Thomas and Ward 2006). This type of bonus influenced the incomes of 12.7 percent of physicians between 1996 and 2005 (CSHSC 2001, 2004, 2006, 2008). Cost-efficiency data come from claims databases, and health plans use “episode grouper” software to aggregate their members’ claim records into “episodes of care” (periods where one disease process is present, and healthcare providers are managing it) (Thomas, Grazier, and Ward 2004).
Profiling software packages apply approximately six standard methodologies to calculate the expected costs for an episode based on the average actual cost for all episodes of the same type and compare them with the actual cost of cases managed by individual physicians (Thomas et al. 2004). The different methodologies are moderately consistent with one another, and their biggest difference involves determining how to handle outliers (individual cases that are more complex and therefore require more resources) and how to attribute episodes to physicians when multiple physicians are involved in treatment (Thomas et al. 2004). Health plans define efficiency in terms of the average cost of treating a particular condition or providing a specific set of services, without reference to outcomes: Physicians with actual costs that are at or below their expected costs are “efficient,” and those with higher than expected costs are “inefficient” (Thomas et al. 2004). There is some evidence that the incentives created by profiling succeed in reducing physician resource use (Armour et al. 2001). However, because profiling focuses on reducing billing in an FFS-based healthcare system, contradicting other incentives, one expects bonuses for cost efficiency to be associated with lower physician incomes compared with other criteria. At the same time, profiling criteria are relatively objective and gender-neutral. Thus, I hypothesize the following:
While practices, hospitals, and health plans may reward physicians based on one or more of these four explicit criteria, physicians can also receive bonuses that are not based on any formal criteria. Practices distribute these bonuses without tying them explicitly to productivity, patient satisfaction, quality, or cost efficiency. I define the criteria for these bonuses as unspecified because medical practices do not need to use any consistent criteria or follow any formal procedures to allocate these bonuses. Among bonus-eligible physicians in the 1996 to 2005 CTS, 9.5 percent received bonuses based only on unspecified criteria (CSHSC 2001, 2004, 2006, 2008). (Some physicians may receive P4P based on explicit criteria in addition to a discretionary bonus, but I am unable to assess this.) While it is impossible to know the basis of these bonuses, it is likely that managerial discretion has a larger influence when criteria are unspecified, permitting stereotypes to influence evaluations and allocations of rewards (Maas and Torres-Gonzalez 2011). As a result, it is reasonable to expect a larger unexplained gender gap when physicians receive bonuses based on unspecified criteria alone (see H1). Table 1 summarizes the hypotheses.
Hypothesized Effects on the Gender Gap in Physician Income.
Note. P4P = pay-for-performance.
Data and Method
Data for this study come from the restricted-use versions of the CTS Physician Surveys for 1996–1997, 1998–1999, 2000–2001, and 2004–2005. The CTS is a longitudinal national study sponsored by the Robert Wood Johnson Foundation to track changes in the healthcare system and the effects of such changes on healthcare delivery.
Each round of the survey sampled physicians from 60 sites (51 metropolitan and 9 nonmetropolitan areas), and the first three surveys also included an independent supplemental national sample of physicians. The sample design involved random selection of physicians from the American Medical Association (AMA) Masterfile (which includes non-AMA members) and the American Osteopathic Association membership file. The survey drew separate samples of primary care physicians (PCPs) and non-PCPs within each site, defining primary care as family practice, general practice, internal medicine, and pediatrics (see Online Appendix A for subspecialties in each category).
All respondents provided direct patient care for at least 20 hours per week. The survey excluded federal employees, specialists in fields with a focus that was not direct patient care, graduates of foreign medical schools who were only temporarily licensed to practice in the United States, physicians who had not completed their medical training (residents, interns, and fellows), and physicians who requested that the AMA not release their names to outsiders. Each survey after 1996–1997 consisted of a combination of physicians who were part of the previous survey and new respondents.
The mode of data collection was a computer-assisted telephone interview (CATI). Interviews with PCPs averaged 21 to 22 minutes, and interviews with non-PCPs averaged 17 to 21 minutes, depending on the survey wave. The Robert Wood Johnson Foundation sent advance letters to physicians and offered a $25 honorarium for participating in the survey. The response rate among eligible physicians was 65.4 percent in the 1996–1997 survey, 60.9 percent in the 1998–1999 survey, 58.6 percent in 2000–2001, and 52.4 percent in 2004–2005. 4 I exclude the 2008 survey because it measures income in wide categories rather than in thousands of dollars, losing a significant amount of information. 5 The CTS Physician Surveys contain measures for whether physicians received bonus pay, and on what performance-basis, so they are suitable for addressing questions about the effects of P4P on gender inequality in income.
The CTS sample design uses stratification, clustering, and oversampling, so that standard error estimations must account for sample design. The user guide recommends the use of SUDAAN software but also offers a report on how to use Stata to provide reasonable national estimates with the full population (Schaefer et al. 2003). I used Stata commands that declared the survey design’s strata, primary sampling units (clusters), and sampling weight. 6 However, Stata produces larger variance estimates than SUDAAN and is likely to overstate the standard errors, leading to a lower likelihood of finding statistical significance. 7 Because this analysis does not concern the effects of the 60 sites, I used the sampling weight for the national sample (WTPHY4) to approximate nationally representative estimates (Schaefer et al. 2003). I then estimated ordinary least squares (OLS) regression models with linearized standard errors to approximate a nationally representative sample.
Individual physicians could respond between 1 and 4 times in the four survey waves, with an average of 1.5 responses per person. I attempted using multi-level models (MLM) for change to take advantage of the longitudinal design and to examine within-individual change over time (see Online Appendix D). Where relevant, I discuss the results of these models, although Stata does not allow adjustments for the complex sampling design with MLM, and the average number of responses was too low to meet some assumptions of MLM. As a result, OLS models with linearized standard errors were the best available technique for analyzing these data. To control for differences between those who responded in more than one wave and one-time responders, I created an indicator variable for whether a physician responded to multiple waves of the survey.
Respondents could receive bonuses based on multiple criteria or no specific criteria. I limit the analyses to cases where some kind of “merit” practices, specified or not, are built into the reward structure (84.6 percent of the combined sample). This produced an analytic sample with 22,645 cases, including 16,994 men (75.05 percent) and 5,651 women (24.95 percent). Among these physicians, 44.5 percent were eligible for bonuses based on multiple, explicit criteria (n = 10,082). The models treat the effects of each type of bonus on income as independent and control for other types, although I also tested interaction effects, and they did not significantly improve the model (see Online Appendix D).
Variables in the CTS measure specialty, practice type, ownership, sources of practice revenue, physician compensation, effects of care management strategies, and physicians’ allocation of time, career satisfaction, and perceptions of the ability to deliver care. The surveys also collected demographic information about physician age, gender, years in practice, and hours of work. Surveys after 1996–1997 collected data on race and Hispanic origin. 8 The analyses exclude full owners of solo practices because the survey did not ask these physicians about their bonus-eligibility. (The survey codebook states, “Full owners of solo practices are assumed not eligible for bonuses.”) Health plans sometimes offer bonuses to full owners of solo practices, so this omission represents a limitation of the data.
Additional limitations of the data include the lack of information on the specific bonus amount, the amount received from different stakeholders (practices, hospitals, and health plans), performance ratings, or actual performance. As a result, the CTS data cannot discern whether there are gender differences in ratings, inequitable translation of performance ratings into rewards, or both. The data also contain no information on health outcomes, the quality of care, or characteristics of physician–patient interactions. The CTS also provides no information on patient mix or administrative resources, so that the data cannot speak to gender differences in support staffing or in the relative difficulty of cases. Women may receive less organizational or administrative support, or work with patients who suffer from more serious health problems, leading to “performance support bias” (Madden 2012). Another important limitation of the data for analyses of gender inequality is the lack of data on marital or parental status, even though marriage and parenthood have known effects on income by gender. At the same time, the CTS is a nationally representative sample of physicians that contains data on whether physicians received P4P based on several distinct criteria, making it the best available data for understanding how performance criteria influence gender inequality within a single profession. The CTS also offers a large sample, and provides measures for most relevant human capital, specialization, salary status, and practice type.
Dependent Variables
The dependent variable is the natural logarithm of total income, adjusted for inflation in 2003 dollars. The question on the survey asks physicians, “What was your own net income from the practice of medicine to the nearest $1,000, after expenses but before taxes?” for the year prior to the start of the survey (i.e., 1995, 1997, 1999, or 2003). This question followed questions about P4P, which should have primed respondents to include bonuses in their estimated net income. Income is a continuous variable, with a top category of $400,000 or more (4.65 percent of cases in the analysis). I transformed this measure into the natural logarithm of income because income tends to be right-skewed, and normal probability plots revealed that models using the log transformation better fit OLS assumptions. I performed the log transformation on the previously censored income variable because the data were already top-coded. 9 The models test the effects of different criteria for bonuses on total income by gender.
Independent and Control Variables
The primary independent variables of interest are gender, performance criteria, and their interactions. The surveys contained indicators for bonuses based on productivity, patient satisfaction, quality of care, and profiling. (See Online Appendix B for the survey questions about P4P.) I also created a variable for bonuses based on unspecified criteria using the following questions: “Is your base salary a fixed amount that will not change until your salary is renegotiated or is it adjusted up or down during the present contract period depending on your performance or that of the practice?” “Are you currently eligible to earn income through any type of bonus or incentive plan?” and “Are you eligible to receive end-of-year adjustments, returns on withholds, or any type of supplemental payments, either from this practice or from health plans?” If a physician indicated eligibility for additional pay in one or more of these questions, but did not indicate that he or she received a bonus based on productivity, patient satisfaction, quality, or profiling, then he or she had a bonus based on “unspecified” criteria (the reference category in the analysis).
Control variables include salary status (salaried = 1), as both salaried and unsalaried physicians can receive P4P based on any performance criteria. Models control for survey wave, Census region (Northeast, Midwest, South, and West, with Pacific as reference), 10 private practice (reference includes health maintenance organizations [HMO], medical schools, hospitals, and other practice types), and percentage of revenue from Medicare and Medicaid. The CTS Physician Surveys grouped subspecialties into larger specialty groups: medical specializations, surgical specializations, psychiatry, and obstetrics/gynecology (see Online Appendix A). Primary care is the reference category.
Relevant productivity-related characteristics include hours per week and years in practice. I control for the natural logarithm of hours worked per week (ln[hours]) because the use of an untransformed variable for hours per week assumes that income has a linear relationship with the number of hours that an individual works, when actual returns to hours typically diminish as hours increase (Morgan and Arthur 2005). Although income increases with longer work weeks, it does so at a decreasing rate, so that the appropriate specification of the pay equation for professional workers with a dependent variable of ln(income) uses ln(hours) as an independent variable (Morgan and Arthur 2005). Models using the untransformed variable for weekly hours produced similar results (available from the author).
Analysis of outliers suggested that influential cases were not a problem for the analysis, and residual plots demonstrated that the models conformed to OLS assumptions of uncorrelated and normally distributed error terms. However, collinearity diagnostics suggested substantial, but expected, collinearity between gender and the interaction terms that are likely to inflate the standard errors of these coefficients. 11 Also, the number of design degrees of freedom for female physicians after adjusting for the complex sample design was 642, which is relatively small given the large number of variables included in the models. This may also negatively influence the likelihood of finding statistically significant interactions between gender × P4P criteria.
Results
Table 2 presents descriptive statistics for the primary variables of interest, for all respondents in the models and separately for men and women. (See Online Appendix C for descriptive statistics for control variables.) 12 According to Table 2, productivity was the most common performance basis among physicians who were eligible for some kind of merit-based pay (83 percent of bonus-eligible physicians). In comparison, 27 percent of bonus-eligible physicians received compensation based on patient satisfaction, 21 percent received bonuses for quality of care, 15 percent received bonuses for cost efficiency, and 11 percent based on unspecified criteria. Men were significantly more likely to receive a bonus based on productivity, while women were significantly more likely to receive compensation for patient satisfaction, quality of care, or efficiency.
Descriptive Statistics and Metrics for Gender, P4P, and Income in the CTS Physician Surveys, Using Survey Estimation.
Source. Community Tracking Study Physician Surveys, combined data 1996 to 2005.
Note. P4P = pay-for-performance; CTS = Community Tracking Study. PSU = Primary Sampling Unit
Denotes missing standard error because of stratum with a single sampling unit.
Two-tailed t test for difference of means: *p < .05 **p < .001.
Average adjusted income was more than $203,000 in 2003 dollars, and Table 2 reveals a gender gap in income that favored men under all reward schemes. According to Table 2, physicians earned above-average incomes when they received bonuses based on unspecified criteria, while physicians who received bonuses for patient satisfaction, quality of care, or cost efficiency earned below-average incomes for the sample. The raw gender gap in compensation among physicians who were eligible for merit pay was $70,811.60 in 2003 dollars. This gap also varied depending on criteria for allocating bonus pay: from a low of $54,460.20 among physicians who received bonuses only for quality to a high of $88,190.70 among those who received bonuses without explicit criteria. These raw averages offer some initial support for H1, which hypothesized that the gap would be wider when P4P is based on unspecified criteria.
Table 3 presents OLS models of the natural logarithm of total adjusted income for physicians. Model 1 is a control model that includes human capital, work effort, specialization, and practice characteristics and explains about 34 percent of the variance in compensation. Using the formula
OLS Regression Models of the Natural Log of Totala Compensation in 2003 Dollars.
Source. Community Tracking Study Physician Surveys, combined data 1996 to 2005.
Note. Numbers in parentheses are linearized standard errors, adjusted for the complex sample design and sampling weights. OLS = ordinary least squares; P4P = pay-for-performance; OBGyn = obstetrics/gynecology; PSU = Primary Sampling Unit.
In two-tailed tests: †p < .10. *p < .05. **p < .01. ***p < .001.
According to Model 1, white physicians earned 4.3 percent more than nonwhite physicians, and salaried physicians earned 7.8 percent less than similar, unsalaried physicians. In this model, a 10 percent increase in hours is associated with a 3.8 percent increase in physician income (Gordon 2010). 13 All specializations earned significantly more than primary care doctors. Surgical specialties were the most highly paid, bringing in 52.3 percent more than primary care, followed by obstetrics/gynecology (46.6 percent more) and medical specializations (33.4 percent more). Physicians who worked in private practice earned an average of 6.5 percent more than physicians in other types of practice.
Model 2 adds the effects of specific bonus criteria, compared with a reference group of physicians who received bonuses based on unspecified criteria, and explains 34.2 percent of the variance in income. Although this is a tiny increment over Model 1, the addition of the bonus criteria variables is statistically significant in a Wald test evaluating the difference between the nested models, F(4, 2136) = 3.74, p = .005. 14 In this model, the overall gender gap and the effects of controls variables do not change from Model 1. However, compared with unspecified criteria, bonuses based on productivity or profiling were associated with 4.2 percent and 1.1 percent lower incomes, respectively, while bonuses based on patient satisfaction were associated with 4.9 percent higher incomes. Quality of care had no significant effects. Linear combinations of coefficients revealed that bonuses based on patient satisfaction were associated with significantly higher incomes than bonuses based on productivity, quality, or profiling by 9.46 percent (p < .01), 6.02 percent (p < .05), and 8.74 percent (p < .01), respectively—a surprising effect given the association between patient satisfaction and culturally feminine warmth. Productivity-, quality-, and profiling-based bonuses were not significantly different from each other.
Model 3 includes interaction terms for Female × Bonus criteria, and explains 34.5 percent of the variance in physician income. Again, the change in R2 is very small, and the effects of control variables remain stable, but the inclusion of gender interactions offers a statistically significant improvement in the model, F(4, 2136) = 3.05, p = .016. The main effect of female reveals that the gender gap among physicians who receive bonuses based on unspecified criteria is 27.4 percent, which is considerably larger than the gap in Models 1 and 2 (supporting H1). The significant main effect for productivity in Model 3 suggests that these bonuses are associated with 6.5 percent lower incomes than unspecified bonuses for men, while the significant interaction of Female × Productivity suggests that bonuses based on productivity are associated with a 2.8 percent higher income for women than bonuses based on unspecified criteria.
These results support H2B, which predicted that productivity criteria would reduce the unexplained gender gap compared with unspecified criteria because productivity is an explicit and objective criterion for performance. Possibly because it is the most common explicit criterion, productivity is also the only P4P basis that has any discernible impact on the gender gap in physician pay compared with unspecified criteria.
Linear combinations of coefficients reveal that productivity-based P4P is associated with significantly lower income than P4P based on patient satisfaction and with significantly higher income than P4P based on economic profiling, but that there is no significant gender difference in these effects. Thus, the results fail to support H2A, which suggested that the gender gap would be wider under productivity than other explicit criteria because productivity is associated with masculine competence.
In the case of pay for patient satisfaction, the main effect of these bonuses is not statistically significant (p = .10) in Model 3, in contrast to the significant main effect of this criterion in Model 2 (p = .008). This is possibly due to the inflation of the linearized standard errors in Stata and the small number of design degrees of freedom relative to the number of variables in the model. Linear combinations of coefficients in Model 3 reveal that patient satisfaction bonuses are associated with significantly higher incomes than bonuses based on productivity and profiling, by 10.95 percent (p < .01) and 9.42 percent (p < .01), respectively. However, the lack of a significant Female × Satisfaction interaction in Model 3 fails to support H3, revealing no gender difference in the effect of satisfaction criteria income. Patient satisfaction may seem like the most culturally feminine criterion for P4P, but it does not reduce the gender gap in pay.
Bonuses based on quality have no significant main effect and no significant interaction with gender, thus failing to support H4. However, an identically specified MLM analysis suggested that the Female × Quality interaction was positive and significant in a one-tailed test, so the results are not conclusive to fully reject H4. Conversely, the negative main effect of profiling-based bonuses in Model 3 suggests that these bonuses are associated with 5.2 percent lower physician incomes than unspecified criteria. The interaction with gender is not statistically significant in this model, thus failing to support H5, although the interaction coefficient is equal in magnitude and opposite in direction of the main effect of profiling, and the same opposing effects were evident and statistically significant in MLM analyses (see Online Appendix D). Fully interacted, separate models by gender also revealed that profiling had a significant negative effect on men’s incomes, but no significant effect on women’s (not shown).
While one must be cautious about interpreting nonsignificant coefficients, the lack of significance in this model could be due to Stata’s estimation of the linearized standard errors, multicollinearity between the gender main effect and the interaction terms, and the small design degrees of freedom for the subsample of women after adjusting for the complex sample design. It appears that P4P based on profiling has a negative effect on men’s incomes in medicine but no effect on women’s, although future research will need to further test this hypothesis. As robustness checks, I also analyzed a model that included two-way interactions between criteria, but this did not significantly improve upon models that treat criteria as independent (see Online Appendix D).
Discussion and Conclusion
This study builds on existing research on gender and performance-based pay by focusing on how different performance criteria affect the pay gap when they form the basis for P4P. On one hand, P4P schemes can highlight skills and attributes that are unrelated to gender as the basis for pay decisions, and have the potential to increase accountability for performance assessments and reward allocations. On the other hand, definitions and measures of performance vary in their degree of accountability and objectivity.
One of the research questions guiding this study is whether implicit versus explicit criteria for performance influence the size of the gender gap in pay. My analysis illustrates that the gender gap is larger in medicine when criteria are unspecified compared with the most common explicit criterion, productivity, possibly because unspecified criteria can increase opportunities for managerial discretion to influence both evaluations and reward allocations. Returning to the original hypotheses (Table 1), these results support H1, which predicted that bonuses without formal criteria would be associated with a wider gender pay gap.
Conversely, productivity criteria are based on relatively objective measures, and this appears to reduce the unexplained gender gap. The difference between the effects of productivity-based versus unspecified bonuses, thus, also supports H2B. In fact, the availability of objective productivity measures appears to be more important than any association between productivity and culturally masculine definitions of competence: productivity bonuses do not produce a wider gender gap in pay than other explicit P4P criteria, failing to support H2A. The results also fail to support H3 to H5 regarding P4P based on patient satisfaction, quality, or profiling, although there are some limits to the data that may have influenced the ability to discern gender differences based on these P4P criteria. First, these criteria are the basis for P4P less often than productivity, and often in combination with other criteria, potentially diluting any effects. Less than 2 percent of respondents received P4P based on each of these criteria alone. Second, after accounting for the complex sample design, there may have been too little statistical power to discern significant interactions. Finally, some physicians who received P4P based on one or more explicit criteria could also have received a bonus based on unspecified criteria, thus diluting the effects of the explicit criteria on pay.
Another important research question involved criteria associated with competence and warmth as gendered dimensions of social evaluation. Results in this regard are somewhat surprising, especially with respect to the criterion highlighting culturally feminine skills, namely patient satisfaction. The insignificant Female × Satisfaction interaction suggests that bonuses based on satisfaction have no significant effect on the gender gap in physician income, failing to support H3. Other models testing the robustness of this effect, including separate models by gender, also suggest that patient satisfaction bonuses are associated with similar effects for both men and women. These results create an interesting puzzle for future research because patient satisfaction criteria are less common than productivity criteria, more subjective than the other explicit criteria, and more loosely coupled than productivity with remuneration in the FFS American healthcare system.
It is possible, of course, that satisfaction effects are an artifact of the data: Only 1.7 percent of respondents received P4P based on patient satisfaction alone. However, analyzing the effects of culturally feminine definitions of performance, such as patient satisfaction, on the gender gap is theoretically important because it is relevant to debates about gender and the social construction of skill. Previous research has suggested that employers take for granted many abilities that most women have by virtue of their gender socialization and do not reward those qualities as skills, thus devaluing care-related skills in paid work (England and Folbre 1999; England et al. 1994; Steinberg 1990). Defining and rewarding patient satisfaction is a way that some medical practices, hospitals, and health plans can elevate stereotypically feminine qualities (warmth) to the level of valued skills, thus reducing the extent to which care work is devalued.
At the same time, as the models revealed, there may be no gender difference in the impact of highlighting “feminine” skills. Some research cautions that optimistic visions of feminine patient-centered care does not fit with recent changes to American medicine that make it difficult for all physicians to provide humane care, such as increasing pressures to maximize efficiency and productivity, declining reimbursements, pressures to limit the time spent per patient, heightened use of clinical protocols, and the demands of insurance companies and quality control agencies (Boulis and Jacobs 2008). In the end, organizational and institutional constraints on physician behavior and limitations to the reach of patient satisfaction-based bonuses may restrict their ability to ameliorate the unexplained gender gap in pay.
The lack of significant effects of quality measures may similarly be related to the structure of American healthcare. Measurement of quality is not currently consistent with FFS-based American medicine, which rewards physicians and practices for performing more procedures. Patients also tend to demand more tests and treatments, whether or not they are evidence-based, so that quality criteria can clash with consumer satisfaction. In other words, bonuses based on quality may not effectively counteract other forces. Moreover, quality measures suffer larger problems of underreach than satisfaction-based bonuses. Only 20.6 percent of physicians received bonuses for quality at all, and even fewer received bonuses for quality alone (only 208 cases, or 0.9 percent).
Finally, economic profiling had a consistent, negative main effect on income. In Model 3, it has no statistically significant interaction with gender, although other models suggest that the negative effect holds only for men and suggest that it could narrow the gender gap in income. What is clear is that profiling criteria contradict other incentives in the FFS system, leading to lower overall compensation.
What are the implications of the findings for understanding evaluation and compensation processes in organizational settings? The results underscore the importance of interrogating claims about meritocracy by considering definitions of performance. Different explicit definitions of performance have the potential to reward particular kinds of skills, including the traditionally feminine skills of caring for others and expressing empathy.
While many occupations may define performance based on culturally masculine criteria, actual performance may depend as much or more on relational skills associated with warmth. For example, effective managers often depend more on relational than on technical skills. Elevating these types of skills into explicit performance criteria might have the potential to invoke culturally feminine meanings and improve opportunities for women. Future research might assess the use of performance criteria based on qualities associated with warmth, and their effects on gender inequality, in other occupational settings. Other studies might also examine racial inequality in the effects of P4P based on multiple criteria using data with more nonwhite respondents. The competence/warmth dichotomy does not map onto racial stereotypes in the same way as gender stereotypes, but objective evaluation criteria might similarly help racial minorities. Exploring the effects of P4P criteria on racial-ethnic disparities might offer stronger conclusions about how definitions of performance and processes of evaluation influence inequality.
Footnotes
Acknowledgements
I would like to thank Rochelle Cote, Martha Foschi, Megan Henley, Reeve Vanneman, Blair Wheaton, Jane Zavisca, the Inequality Workshop at the University of Arizona, and three anonymous reviewers for comments on earlier drafts.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
