Abstract
Patients increasingly rely on online physician ratings to select their physicians and make health care decisions. However, it is unclear whether online physician ratings signal physician quality information and affect patients’ physician choices. By combining physician rating data from Yelp with data from Medicare, which covers a large elderly patient group, the authors find that ratings are positively associated with important measures of physician quality, including physicians’ credentials, adherence to clinical guidelines, and patients’ health outcomes. They introduce novel instrumental variables, where reviewers’ leniency in rating other businesses is employed as an instrument for physicians’ ratings. They find that an increase in physicians’ average rating increases physicians’ patient flow. To understand the quality signals that patients respond to, the authors also use the latent Dirichlet allocation model and extract topics from review texts. Patients respond differentially to different information and respond most to information about physicians’ interpersonal and clinical skills. In addition, rating credibility, accessibility, and strength of other existent signals moderate the positive effects of online ratings on patient flow. Overall, online physician rating platforms can promote efficiency by disseminating important quality information to patients and directing patients to higher-quality physicians.
With the rise of technologies and online health information, health care has become increasingly consumer-oriented, and consumers no longer passively receive health care services. In particular, many consumers (patients) rely on user-generated online physician ratings to select their physicians and make informed decisions for their health care. A recent survey finds that almost three-quarters of patients rely on online reviews as the very first step to finding a new doctor (Hedges and Couey 2020). A defining characteristic of the health care market is the presence of imperfect information and information asymmetry between health care providers and patients, where consumers typically lack the specialized knowledge required to evaluate the quality of the service (Arrow 1963). In fact, health care services are generally regarded as a credence good, which is characterized by the persistent presence of information asymmetry even after consumption (Darby and Karni 1973). Although several public and private organizations have initiated programs to publicly report the quality of medical care provided by physicians or hospitals to help reduce information asymmetry in health care markets and help facilitate informed choices, traditional outcome-based health care report cards have not elicited significant consumer response (Dranove et al. 2003; Kolstad 2013; Werner et al. 2012). We examine whether user-generated online physician reviews and ratings, which have become increasingly popular, can help reduce information asymmetry, provide user-friendly information to patients, and improve efficiency in health care management.
Despite the popularity of consumer-generated online physician ratings, the effectiveness and reliability of online physician ratings have remained controversial and unclear. The theory of credence goods predicts that patient-generated online physician reviews would not be effective in providing information on physicians’ quality, as patients lack medical knowledge. In line with this argument, the American Medical Association, a professional association of physicians, has raised concerns that user-generated physician ratings may lack useful information and that the ratings may not reflect actual patient treatment outcomes (Lieber 2012). Yet online ratings may still contain informative and valid signals regarding physician qualities. Patients may be able to infer physicians’ clinical quality by observing their own health conditions or directly assess physicians’ empathy, attentiveness, and communication skills. Because the effectiveness of online physician ratings is unclear given the credence aspect of health care services, our first research question empirically examines whether online ratings are correlated with physician quality. It also remains an important empirical question as to whether and how user-generated online ratings impact patients’ choice of physicians. So far, scarce research has studied the effects of user-generated ratings on patient choice. To understand the impact of user-generated online ratings on health care management, our second research question examines the effect of online ratings on patient flow and the underlying mechanisms through which ratings affect patients’ choices. Our conceptual framework is summarized in Figure 1, and the two research questions are formulated as follows:
Are online ratings correlated with physician quality? Do online ratings affect patients’ physician choices? What are the underlying mechanisms through which ratings affect patients’ physician choices?

Conceptual Framework.
To address our research questions, we collect all available Yelp physician rating data up to June 2017 and match them with Medicare claims data and other external data sources. Yelp has been one of the major players in online physician ratings, and the number of reviews has grown substantially in recent years. We collect all Yelp physician rating data at the individual review level over time, which allows us to create longitudinal physician review data and examine how ratings change for a physician. Then, we link ratings with Medicare claim data at the individual physician level. Medicare covers more than 61 million U.S. residents, and Medicare spending grew to $830 billion in 2020 and accounted for 20% of national health expenditure (Centers for Medicare and Medicaid Services [CMS] 2020). As the largest insurer in the United States, Medicare is also expected to experience the fastest spending growth among major payers because it has the highest projected enrollment growth (CMS 2020). Medicare data allow us to examine the effects of online ratings on an important and large patient group of elderly people. The combination of Yelp and Medicare data enables us to analyze the effects of ratings at a granular level for more than 36,000 physicians who are rated on Yelp.
We find that physicians with higher ratings have better educational and professional credentials, measured by board certification status, ranks of schools, and accreditations. Furthermore, physicians with higher ratings show higher adherence to clinical guidelines, and patients of physicians with higher ratings display better clinical outcomes, which are captured by lower preventable inpatient admission rates and better ex post risk scores. The findings indicate that online reviews are highly correlated with important measurements of clinical quality and provide important quality signals to patients.
We then explore the second major research question of whether Yelp ratings affect patients’ physician choices. If consumers base their physician choice decisions on online ratings, our findings that physicians with higher ratings have higher clinical quality would indicate that patients will be matched with higher-quality physicians. We study the effects of online ratings on Medicare patients’ physician choices by examining the effects of ratings on patient flow, measured by physician's revenue and patient volume. To examine the treatment effect of being rated on Yelp and the treatment effect of receiving higher ratings, we employ a difference-in-differences (DiD) approach along with the instrumental variable (IV) and propensity score weighting method. We exploit variations across and within physicians in the timing of their first Yelp review and exploit the panel nature of the data to compare rated physicians and unrated physicians before and after being rated on Yelp. To examine the effects of higher online ratings on patient flows, we exploit variations in ratings across and within physicians. Furthermore, we introduce and employ a novel IV approach to address potential endogeneity concerns.
Methodologically, the estimation of the causal effects of online ratings on consumer demand is often empirically challenging due to endogeneity concerns. For example, if physicians receive different average ratings for reasons that also directly affect their patient flow, an ordinary least squares (OLS) estimation of the DiD model would be biased and confound the causal interpretation of the effects of ratings. To address the potential endogeneity concern, we instrument for Yelp ratings using the leniency of reviewers, where reviewers’ leniency is defined as their average ratings when rating other businesses on Yelp. On the Yelp platform, users can review businesses, including restaurants, physicians, and beauty salons. The main intuition behind our IV is that some reviewers may be more lenient (or less harsh) in rating any business while other reviewers may be harsher in rating any business. To construct our IV, we collect data on all other review ratings that reviewers generated on the platform. Using such rich detailed data, we calculate each physician's annual cumulative reviewer leniency as the average leniency for all reviewers who have rated that physician by each year. A reviewer's leniency measurement captures the reviewer's baseline leniency or harshness in reviewing any businesses, which would influence a physician's rating, as a physician rated by more lenient reviewers would have higher ratings. At the same time, reviewers’ leniency would be orthogonal to potential endogenous factors, such as physicians’ time-varying characteristics, that may directly affect patient flow. When we instrument for a physician's yearly cumulative average ratings with the leniency of reviewers, we find that a one-star increase in a physician's average rating has statistically significant positive effects on patient flow and increases the physician's annual patient revenue and volume by 1.9% and 1.2%, respectively. The estimated effect of online ratings may underestimate the phenomenon as the estimate is based on the elderly patient population, and younger generations rely more heavily on electronic word of mouth (WOM) for information acquisition. We conduct various robustness checks to ensure that our results are robust, and we provide additional analysis that validates our empirical model and IVs.
To further investigate the underlying mechanisms of the effect of online ratings on patient flow, we examine what information is included in online physician reviews. We use a machine-learning latent Dirichlet allocation (LDA) algorithm on all collected Yelp text reviews to detect the common topics of the written reviews. We find that reviews contain signals about physicians’ service-related quality (e.g., a physician's bedside manner, waiting time, office amenities) and clinical and treatment-related quality (e.g., treatment, diagnosis, prescription, outcomes). When choosing a physician, patients may differentially respond to service-related and treatment-related information in text reviews. To understand quality signals that patients respond to, we use topic weights from the LDA model to construct rating variables that capture physicians’ ratings on bedside manner, other service-related quality, and clinical-related quality. We find that patients respond most to information on physicians’ interpersonal and clinical skills.
Furthermore, we extend signaling theory (Nelson 1970; Spence 1973) to examine how online rating credibility, accessibility, and the strength of other existent signals moderate the positive effects of online ratings on patient flow. We find that patients’ responses to online ratings are greater for physicians with more reviews. This finding is consistent with signaling theory, which would indicate that ratings would signal more information about a physician's quality when there are a greater number of reviews. We also find that the effects of ratings on patient flow are larger for physicians with a greater share of younger patients, who tend to have greater access to online rating information. Furthermore, we find that the positive effect of positive online ratings on patient flow is greater for sole practitioners, who may lack hospital or organization brand names that can provide credible signals. For hospital-based physicians, hospital or organization brand names can provide credible signals and reduce uncertainties that patients face. For self-employed physicians without brand names, good online ratings can provide extra information, help signal their good quality, and reduce patients’ uncertainties.
We contribute to the literature by providing empirical evidence that physician ratings provide quality signal information and have positive effects on physicians’ patient flow. Given the credence aspect of health care services, the reliability and efficacy of user-generated physician ratings have been a controversial topic. Furthermore, despite academic, policy, and health care leaders’ interests in examining user-generated physician ratings and their effects on patients’ physician choices, empirical evidence on the effects of these ratings has been scarce. In addition to providing this empirical evidence, we examine the potential mechanisms through which online ratings affect patient flow by analyzing reviewers’ textual comments and extending the prior literature on signaling theory. Finally, we make methodological contributions by introducing a novel IV approach, which allows researchers to understand how ratings affect consumer demand. Taken together, our findings have important managerial and policy implications for major constituents in the health care sector, including health care managers, policy makers, patients, physicians, and online physician rating platforms.
Theoretical Background, Related Literature, and Hypothesis Development
Information Asymmetry in Health Care
The literature on the economics of information categorizes products as search, experience, and credence goods, depending on the level of information asymmetry present between buyers and sellers (Darby and Karni 1973; Klein 1998; Nelson 1970). Search goods are products or services that can be easily evaluated prior to purchase or consumption. Experience goods, such as restaurants and hair salons, cannot be easily evaluated before purchase but can be accurately evaluated after consumption (Nelson 1970). Credence goods, in contrast, are characterized by the persistent presence of information asymmetry even after consumption (Darby and Karni 1973). For credence goods, consumers lack expertise and knowledge to accurately evaluate the quality even after consumption. In general, health care services have been regarded as credence goods, as patients may lack medical training to assess the accuracy and appropriateness of physicians’ diagnosis and treatment (Arrow 1963; Dulleck, Kerschbamer, and Sutter 2011). Asymmetric information in credence goods can lead to inefficiencies and market failure, and partial solutions to the asymmetric information problem would include government regulations, warranties, and reputation (Dulleck, Kerschbamer, and Sutter 2011).
To address information asymmetry present in health care markets, several public and private organizations have initiated programs to publicly report the quality of medical care provided by physicians or hospitals. For instance, New York's publication of physician and hospital coronary artery bypass graft surgery mortality rates is an example of a health care report card that was introduced to provide information about the performance of hospitals and physicians (Dranove et al. 2003). However, the prior literature on the traditional outcome-based health care report cards provides mixed evidence on the effects of report cards and finds that they do not elicit significant consumer response and do not always have positive effects on physician behavior (Dranove et al. 2003; Kolstad 2013; Werner et al. 2012). For example, Dranove et al. (2003) find that cardiac surgery report cards in New York and Pennsylvania decrease patient and social welfare due to provider selection. Werner et al. (2012) also find that publicly reported quality information has minimal consumer response effects in the post–acute care market.
The traditional health care report cards have not been effective, as consumers have difficulty understanding the information in report cards because of its technical nature and the way it is presented (Brook et al. 2002). Consumers also report that they have little time to review the information included in traditional health care report cards and that they find the information provided by the health plans and providers not trustworthy, which leads them to prefer relying instead on WOM information from family and friends when making health care decisions (Brook et al. 2002). In response to such concerns, researchers have proposed next-generation report cards, which should be easy to use and understand for consumers (Fung et al. 2008; Kolstad and Chernew 2009; Schneider and Epstein 1998). In fact, consumer-generated online physician reviews and ratings—a new type of report card that can provide user-friendly information—have become more popular. As an increasing number of patients rely on internet review sites to find doctors, it is important to examine whether user-generated ratings convey physician quality information and whether ratings affect patients’ physician choices.
Online Physician Reviews and Physician Quality
With a recent survey indicating that 90% of patients use online reviews to select and evaluate physicians, the number of patients who use online reviews to select a new physician has dramatically increased (Hedges and Couey 2020). The most-visited online platforms for physician ratings include Healthgrades, Yelp, and RateMDs (Furnas et al. 2020; Kadry et al. 2011). Despite the popularity of consumer-generated online physician reviews and ratings, the effectiveness and reliability of online physician ratings have been controversial and unclear. On the one hand, patient-generated online physician reviews may not be effective or useful in evaluating physicians’ quality, due to the credence good nature of health care services. Accordingly, the American Medical Association has voiced concerns that “online opinions of physicians … should not be a patient's sole source of information when looking for a new physician” (Lieber 2012). Many physicians have also been worried about being rated online and expressed concerns that online ratings may be inaccurate and biased (Deardorff 2010), and some physicians have even pursued defamation lawsuits against people who have reviewed them negatively online (Scotti 2018).
On the other hand, online ratings can yield insights on physician quality in multiple ways. First, although many patients may not have sufficient medical training or knowledge to fully assess the clinical quality of physicians, patients have a reasonable understanding of their health conditions and can infer quality signals from improvements in their conditions before and after physicians’ treatments (Lu and Rui 2018). Second, several aspects of health care services embody attributes of search or experience goods rather than those of credence goods. In the literature on the economics of information, goods are classified as search, experience, or credence goods based on their most important attributes (Darby and Karni 1973); nevertheless, goods can consist of one or more combinations of search, experience, and credence attributes (Ekelund, Mixon, and Ressler 1995; Sheffet 1983). For example, patients can reliably assess a physician's empathy, attentiveness, and communication skills after their direct interactions with the physician, and such aspects have experience good attributes. Prior medical research shows that physicians’ bedside manner, which refers to physicians’ attitudes toward patients, is an important characteristic that improves patients’ health outcomes and health care satisfaction (Beusterien et al. 2013; Hausman 2004). For instance, when doctors and patients engage in effective and meaningful communication, patients form a personal bond with physicians and become empowered to participate more in their own care, which leads to better outcomes and faster recoveries (Levinson and Pizzo 2011). Lastly, the online rating platform, which helps collect information from diverse crowds, can allow expert patients with sufficient medical knowledge to enhance and contribute to the platform. Therefore, our first research question examines whether online ratings are correlated with observable physician qualities, as the relationship remains unclear due to the credence aspect of health care services. We predict the following:
Figure 1 illustrates and summarizes our conceptual framework. H1 is illustrated by the first arrow from physicians’ quality to online ratings, where physicians’ quality is measured by physicians’ credentials and patients’ clinical outcomes.
The Effects of Physician Reviews on Patient Flow
To fully understand the effects of online physician reviews, it is important not only to address the first primary research question but also to examine whether online ratings affect patients’ physician choices. If consumers choose their physicians based on online ratings, and online ratings are positively correlated with physicians’ quality, online ratings help channel patients toward higher-quality physicians. However, if online ratings affect patients’ physician choices but are not correlated or are negatively correlated with physicians’ quality, online ratings can steer patients toward lower-quality physicians. Furthermore, the finding that online ratings do not affect patients’ physician choices would have different managerial and policy implications. Therefore, our second research question examines the extent to which online ratings affect patients’ physician choices.
The prior literature on the online retail industry finds that online consumer reviews can potentially affect sales (Chevalier and Mayzlin 2006; Chintagunta, Gopinath, and Venkataraman 2010; Duan, Gu, and Whinston 2008; Forman, Ghose, and Wiesenfeld 2008; Liu 2006). For example, Chevalier and Mayzlin (2006) find that improvement in a book's reviews leads to an increase in sales, whereas Duan, Gu, and Whinston (2008) find that online ratings do not have significant effects on a movie's box office sales. Although the effect of online reviews on consumer demand has been studied in the context of search and experience products, including books (Chevalier and Mayzlin 2006; Forman, Ghose, and Wiesenfeld 2008), movies (Chintagunta, Gopinath, and Venkataraman 2010; Duan, Gu, and Whinston 2008), cameras (Chen and Xie 2008), and craft beer (Clemons, Gao, and Hitt 2006), there has been scarce empirical research on the effects of physician reviews on patient flow, as it is challenging to observe the actual patient flow.
WOM refers to informal communication among consumers about products, and prior literature on this topic finds that it can be perceived as a credible information source and, as a result, can have positive effects on consumers’ demand (e.g., Banerjee 1992; Brown and Reingen 1987). Furthermore, it is theorized that the impact of WOM is higher for credence goods than for search goods (Lim and Chung 2011). Consumers of search goods can easily and confidently acquire and evaluate information on the goods directly from marketers and retailers prior to purchase (Wright and Lynch 1995). In contrast, because consumers of credence goods cannot directly observe and evaluate a good prior to purchase, they may have to rely on signals from other consumers who have already experienced the product (Burnkrant and Cousineau 1975). In addition, WOM is generally perceived as a credible information source for credence goods because it is based on consumers’ actual experiences; consumers of credence goods find messages from marketers and retailers less credible because marketers have strong incentives to take advantage of information asymmetry and deceive consumers (Bansal and Voyer 2000).
1
In the case of online physician reviews, online ratings and signals about physicians’ qualities contained in reviews can provide credible information to patients. Therefore, we build on the WOM literature to hypothesize:
Moderating Effects of Rating Credibility, Accessibility, and Existent Signals
To understand the potential mechanisms behind the effect of online ratings on patient flow, we further extend the signaling theory (Nelson 1970; Spence 1973) and examine what information is included in online physician reviews and how online rating credibility, accessibility, and the strength of other signals moderate the positive effects of online ratings on patient flow.
First, extending the argument that online ratings increase patient flow by providing credible signals, we further posit that online physician ratings would have stronger effects when online rating signals are more credible and accessible to patients. Specifically, when the number of ratings received by physicians increases, the credibility of information contained in reviews would also increase; overall ratings are likely to converge toward the true value when the number of ratings increases (Zhu and Zhang 2010). Although patients may attribute a single review to idiosyncratic experiences or motivations, multiple consistent reviews will be perceived as a more reliable signal (Ho-Dac, Carson, and Moore 2013). Therefore,
In addition, the effect of online ratings would be greater for patients with higher internet accessibility and usage. Although the advances in information systems have made it easy to access the internet, the digital divide nonetheless remains, and patient groups with limited internet access are largely identical to patient groups who face health care disparities (Philbin et al. 2019; Viswanath and Kreuter 2007). Patient groups with greater internet experience and online access are more likely to use online physician rating platforms to find a physician and thus are more likely to be affected by online physician ratings. Therefore,
In addition, we build on the signaling theory to predict that the positive effect of positive online ratings on patient flow would be greater for sole practitioners who may lack hospital or organization brand names that can provide a credible signal. It remains an important question for physicians, policy makers, patients, and rating platforms to understand whether the effect of online ratings is greater for independent sole practitioners when compared with employed organization-based physicians. From patients’ perspectives, hospital or organization brand names can provide credible signals and reduce uncertainties. For self-employed physicians, positive online ratings can help signal their good quality, provide extra information, and reduce patients’ uncertainties. In addition, as a sole physician is the only physician a patient would interact with, whereas an organization-based physician may be exchangeable with other physicians in the organization, sole physicians’ ratings may have stronger effects on patients’ choices. Relatedly, the prior literature on online reviews also finds that online ratings have stronger effects on the growth of independent restaurants (Luca 2016) and independent hotels (Ding, Gao, and Liu 2022) when compared with chain restaurants and branded hotels, respectively. Therefore,
Setting and Data
Empirical Setting
Yelp
We use Yelp as our empirical setting for studying online physician ratings. Yelp is a leading online platform on which consumers can voluntarily leave feedback on their experiences with the visited businesses. In the platform, a business profile can be created by a consumer, an owner, or directly by Yelp after scraping business directories. Once a profile is created, users can leave an impression on the business by leaving a text review and posting a star rating (of between one and five stars). When a consumer searches for a business on Google or directly within Yelp, a snapshot of the business profile is displayed. The snapshot shows the business contact information as well as key indicators, including the average ratings of the business and the total number of reviews. After the user clicks on the snapshot, the detailed profile page of the doctor opens and displays each individual review, the individual star rating associated with each review, and the review’s posting date.
Yelp has a long history of physician ratings and has been one of the major players in online physician ratings. In fact, Yelp was originally founded by in 2004 to rate physicians, as the founder had a difficult time finding a good physician (Loten 2012). During our observation period, a survey found that around 50% of patients consult online physician ratings when looking for physicians, and among patients who use online physician ratings, 27% use Yelp, followed by Healthgrades and RateMDs, each at 26% (Leslie 2014). The prior medical literature also recognizes Yelp as an important and widely used source of physician rating information (e.g., Detz, Lopez, and Sarkar 2013; Furnas et al. 2020; Ranard et al. 2016). Yelp provides detailed rating information. Not only does Yelp provide the content and the timestamp of each review, but it also provides detailed information on a reviewer, including how many businesses the reviewer has reviewed and the ratings of these reviews. Such information can allow researchers to understand the time trends, rating content, and reviewers’ general rating patterns.
Potential data limitations
There are potential limitations to employing Yelp data. First, the data are collected from one rating platform. To address this concern, we show that similar major findings hold in other major physician rating platforms (i.e., Healthgrades and RateMDs; Web Appendix A3). Furthermore, although the number of rated physicians is on the rise, not all physicians are rated, and rated physicians may not be representative of all physicians. Specifically, there were approximately 759,000 physicians in direct patient care and 209,000 physicians in primary care during our observation period (AAMC 2016), and our Yelp data include 36,787 rated physicians. We employ inverse probability of treatment weighting (IPTW) to account for the fact that not all physicians receive ratings. Finally, patients who write reviews may not be representative of all patients. A primary care physician has 1,200 to 1,900 patients under their care (Raffoul et al. 2016), and a physician received an average of 5.3 reviews during our observation period. 2 Because we examine the effects of existing online ratings on patient choice, the data reflect the extant state of online ratings and would not bias our results. Nevertheless, as more patients rate their physicians, different effects may be observed in the future. In Web Appendix A, we provide additional details on the Yelp rating platforms, and we further discuss and address potential data limitations.
Data
Yelp rating
We collect Yelp rating data by using an algorithm outlined in Web Appendix B1. For each physician rated on Yelp, we collect the date of each individual review, the text of each review, the star rating of each review, and the historical distribution of each reviewer's ratings of all businesses that the reviewer rated. A physician's average Yelp rating by each year is reconstructed as the physician's cumulative average rating up to the end of that year, which mimics what Yelp displays as the average rating of the business. All this information was collected up to June 2017.
Patient flow and patients’ clinical outcomes
We capture a physician's patient flow and patients’ clinical outcomes from the Medicare fee-for-service claims data. Medicare, which is the largest insurer in the United States, constitutes a substantive portion of national health spending. The federal government spent $830 billion on Medicare in 2020, and Medicare is projected to experience a fast-spending growth of 7.6% per year until 2028 (CMS 2020). In 2019, the program covered 61.5 million U.S. residents, 85% of whom are elderly individuals aged 65 or older (CMS 2020). The program is widely accepted among physicians. Among primary care physicians alone, surveys found that 93% of them accept Medicare insurance (Boccuti et al. 2015). The wide coverage of Medicare allows us to analyze the impact of Yelp on the elderly, a large and important sector of patients.
We obtain Medicare claims data from two different sources. The first source is the 100% Medicare payment data from 2012 to 2015, which contain all physicians’ annual revenue and the numbers of unique patients served in a year from the Medicare Part B fee-for-service program. Medicare Part B covers doctor visits, lab tests, diagnostic screenings, ambulance transportation, and other outpatient services. These data are obtained from the Centers for Medicare and Medicaid Services (CMS). Second, we obtain the research identifiable files of Medicare fee-for-service inpatient and outpatient claims and Part D drug event files for a random 20% sample of Medicare enrollees between 2008 and 2015. The data contain granular claim-level information, including procedure codes, prescriptions filled, the amounts of bills, and dates of services.
We use Medicare payment data to obtain two variables on a physician's patient flow: physician's revenue and patient volume. We measure a physician's revenue as the annual revenue a physician receives and their patient volume as the number of distinct patients a physician sees from Medicare Part B.
We use Medicare claims data to construct patients’ clinical outcome variables. First, we collect data on whether a patient receives an applicable clinical procedure recommended by the Healthcare Effectiveness Data and Information Set (HEDIS) during a year. For example, HEDIS recommends eye exams for diabetic patients and mammograms for breast cancer screening for eligible women. In addition, we construct a preventable inpatient admission rate that captures whether patients receive inpatient admissions that could have been avoided with good primary care services. Preventable inpatient admissions refer to admissions for conditions that evidence suggests could have been avoided, in part through better outpatient care. For example, diabetic patients may be hospitalized due to complications of their disease if their conditions are not adequately monitored or if they do not receive education that would enable them to self-manage the disease (Hughes 2008).
Using patient demographics and diagnosis, we also compute a patient's Charlson Comorbidity Index (CCI) and hierarchical condition category (HCC) risk scores. The CCI is an index that predicts the risk of death within one year of hospitalization for patients with specific comorbid conditions. The CMS uses the HCC risk adjustment model to calculate risk scores and estimate future health care costs for the patient. Higher CCI and HCC risk scores indicate higher predicted spending and mortality, respectively. Higher values of the two health measures, in general, indicate patients who are more sick.
Physicians’ credentials
We also collect detailed information on physicians’ credentials, including their educational and professional accreditations, from external websites. From Healthgrades, we collect physicians’ board certification status, which is a voluntary test to demonstrate a physician's mastery of the minimum knowledge of and skills for their subject. Physician Compare is the official physician information website of physicians who bill Medicare. From the website, we obtain the medical school information of all physicians and merge it with the rankings from U.S. News & World Report and StartClass. From Physician Compare, we also obtain all physicians’ self-reported accreditations, such as accreditations in “Preventive Care and Screening: Influenza Immunization” (see Web Appendix Table W1 for the list of accreditations).
Data construction
To combine the Yelp and Medicare data, we construct the main data sample by matching a physician's rating profile on Yelp with the National Provider Identifier (NPI) directory using the physician's last name, first name, and practice health service area (HSA). As an NPI identifier is required for all physicians who bill Medicare, the directory is a super-set of physicians who bill Medicare. After applying the matching algorithm, 36,787 physicians from Yelp profiles are uniquely matched with the NPI directory, and 36,787 physicians constitute the main sample for our analysis on rated physicians. 3 From the NPI, we also identify sole practitioners, as they use social security numbers, rather than employer identification numbers, to bill Medicare.
Are Ratings Associated with Physicians’ Quality Information?
In this section, we test H1 and examine whether physician ratings are positively associated with physician clinical quality measurements. Despite the controversies surrounding the effectiveness and reliability of online ratings, online physician ratings can be a valuable information source for patients if ratings are positively associated with other conventional measurements of physicians’ clinical quality, such as the physicians’ credentials and patients’ clinical outcomes. Therefore, we introduce and construct several measurements that reflect (1) physicians’ credentials and (2) patients’ clinical outcomes, and we estimate the following physician j–level regression using all physicians who are rated on the platform and are included in the main sample:
Ratings and Physicians’ Credentials
Credentialsj includes the following set of measurements for physicians’ educational and professional credentials: board certification status, ranks of medical schools, and the number of self-reported accreditations. Although such credentials may be readily available outside of Yelp, if these quality measures are positively correlated with Yelp ratings, we can have more confidence that Yelp ratings convey some unobserved information on physician quality.
Table 1 presents the estimation results, and Panel A presents the relationship between online rating and physicians’ credentials. Columns 1 and 2 present the estimation results in which the ranking of a physician's medical school is measured by U.S. News & World Report and StartClass, respectively. We find that a physician's board certification status is positively associated with physician rating, and physicians who are board-certified have a .33-star higher rating when compared with physicians who are not board certified (Column 1, significant at 1%). The ranking of a physician's medical school is also associated with an increase in physician rating; physicians who graduated from higher-ranked medical schools where the rank is measured by U.S. News & World Report (Column 1, significant at 1%) and StartClass (Column 2, significant at 1%) have a higher online rating (the ranking order is reversed so that higher rankings indicate better schools). In addition, we find that an increase in the number of accreditations that a physician receives is associated with a higher physician rating (significant at 1%).
Correlations Between Yelp Ratings and Physicians’ Quality Information.
*p < .10. **p < .05. ***p < .01.
Notes: FEs = fixed effects. The ranking order of medical school is reversed so that higher rankings indicate better schools.
Ratings and Patients’ Clinical Outcomes
We introduce and construct several measurements that reflect the clinical outcomes of physicians’ patients and examine their associations with online ratings. ClinicalOutcomesj includes both procedural-based outcome variables (i.e., a physician's adherence to clinical guidelines itself is considered good clinical quality) and patients’ health-based outcome variables that measure patients’ preventable inpatient admission rate and health risk scores. 4 To construct clinical outcome variables, we use Medicare Part B claims data to link each patient to the patient's most frequently visited primary care physician j to identify the physician in charge of the patient's health. Then, we construct five different clinical outcomes: (1) eye exam, which is the probability that an eligible diabetic patient of a physician j receives a recommended eye exam, (2) mammogram, which is the probability that an eligible female patient of a physician j receives a recommended mammograms for breast cancer, (3) preventable inpatient admissions, which is the probability that a patient of a physician j receives a preventable inpatient admission, (4) the average CCI of physician j's patients, and (5) the average CMS-HCC health risk scores of physician j's patients.
Panel B in Table 1 presents the association between online rating and patients’ clinical outcomes. We find that a physician who better adheres to clinical guidelines and performs recommended eye exams on eligible patients has a higher online rating (significant at 10%). Relatedly, a physician who has a higher likelihood of performing recommended mammograms on eligible patients has a higher online rating (significant at 1%). The results show that a higher physician rating is associated with a higher likelihood that a physician provides a patient with a recommended clinical procedure.
We also find that an increase in the probability that a patient undergoes a preventable inpatient admission is associated with a decline in online rating (significant at 1%). Furthermore, a one-unit increase in patients’ average CMS-HCC health risk score and CCI is associated with an approximately .13-star decline (significant at 1%) and .19-star decline (significant at 1%) in online rating, respectively. The results indicate that higher ratings are associated with a lower likelihood that a patient experiences a preventable inpatient admission, lower health risks, and slower developments in comorbidities.
Overall, consistent with H1, we find that physicians’ Yelp ratings are positively and strongly associated with various conventional measures of clinical quality, including physicians’ educational and professional backgrounds, their adherence to clinical guidelines, and their patients’ risk-adjusted health outcomes. All of these factors are indicators of high clinical quality, and high Yelp ratings signal higher quality in multiple dimensions.
The Impact of Yelp Ratings on Physicians’ Patient Flow
In this section, we test H2 and investigate the effects of Yelp ratings on physicians’ annual patient flow. Physicians’ patient flows, measured by using physicians’ revenue and patient volume, can capture patients’ demand for physicians’ services. Our two main objectives in this section are to (1) obtain the treatment effect of being rated on Yelp and (2) obtain the treatment effect of receiving higher ratings. To examine the effect of Yelp ratings on physicians’ annual patient flow, we employ a DiD approach along with the IV and propensity score weighting method.
Main Model
To examine the effect of Yelp ratings on physicians’ annual patient flow, we employ the DiD approach to compare differences in the changes in physicians’ patient flow before and after receiving Yelp ratings between physicians who are rated on Yelp and physicians who are not rated on Yelp. The DiD approach is widely used to examine the effect of differential treatments on outcome variables of interest (Heckman, Ichimura, and Todd 1997). Using data on all physicians included in Medicare across all specialties, including both rated and unrated physicians, we exploit variations across and within physicians in the timing of their first Yelp review and exploit the panel nature of the data to compare rated physicians and unrated physicians before and after the treatment of being rated on Yelp. We define treatment as receiving a rating, and a physician is treated if they received at least one rating.
To examine the extent to which physicians who receive higher ratings grow faster in annual patient flow than those who receive lower ratings, we extend the DiD framework to examine whether receiving higher ratings increases patient flow. We exploit variations in ratings across and within physicians to examine the effects of higher online ratings on patient flow, and we estimate the following:
The coefficient α, which measures the extent to which being rated on Yelp affects patient flow, is identified from variations across and within physicians in the timing of their first Yelp review. The panel data and the flexible fixed effects model allow us to control for unobservable physician characteristics that remain constant over time and examine the same physicians’ patient flow before and after receiving ratings. In addition, we employ a propensity score weighting method to correct for observable characteristics that may have affected the treatment assignment status, and we provide more details regarding this method in another subsection. The coefficient β, which measures the extent to which receiving higher ratings affects patient flow, is identified from two sources. First, suppose that when ratings of rated physicians are held constant (i.e., Ratingjt = Ratingj), β captures whether a physician j who is rated higher (whose patient flow after rating is χj + α + βRatinghigh) than a physician j′ who is rated lower (whose patient flow after rating is χj′ + α + βRatinglow) grows faster in patient flow, compared with before they receive a rating (whose patient flow before rating is χj and χj′, respectively). Second, as a physician's rating evolves over time, β is additionally identified from changes in online ratings within physicians.
The coefficient β would capture the average effect of receiving higher ratings if the ratings are uncorrelated with physicians’ other time-varying quality that may affect patient flow and if cumulative end-of-year average ratings accurately measure the ratings a patient observes prior to visiting a physician. However, potential concerns in our research setting are that rating levels are not randomly assigned and that the end-of-year cumulative average ratings may not perfectly measure what patients observe, which creates two potential endogeneity problems. First, and most importantly, physicians’ time-varying characteristics, such as changes in physicians’ abilities, qualities, or budget, may codetermine their likelihood of receiving high or low ratings and their patient flow. For example, physicians may have improved their office amenities and thus have a higher likelihood of receiving higher ratings. However, the improvement in office amenities would also directly improve patient flow itself. In a different hypothetical scenario, physicians may have decided to spend their budget on marketing efforts to increase the patient flow but do not have enough staffing capacity to accompany the increase in patient flow. This can improve patient flow but, at the same time, increase the likelihood of receiving lower ratings. The direction of the bias due to this potential endogeneity problem is unclear a priori. Second, the yearly cumulative end-of-year average ratings may contain measurement errors if new ratings arrive near the end of the year and patients visit their physicians before the arrival of new ratings. In the presence of classical measurement errors, β may be biased toward zero. To account for the potential endogeneity due to unobservable time-varying characteristics and measurement errors, we introduce a novel IV approach and instrument for Yelp ratings using the leniency of reviewers. We provide more details regarding our IV in the following subsection.
IV Strategy
We instrument for Yelp ratings using the leniency of reviewers, where a reviewer's leniency is defined as the their average ratings when rating other businesses on Yelp. On the platform, reviewers can post reviews and ratings for many types of businesses. The intuition behind our IV is that some reviewers may be more lenient or less harsh in reviewing any businesses, whereas other reviewers may be harsher overall.
To determine each reviewer's leniency, we collect all review ratings that reviewers generated. Then, we construct our IV—a physician's cumulative reviewer leniency—similar to how we construct our cumulative ratings variable, Ratingjt. We let njt denote the number of reviewers for physician j by year t. We let
A valid IV should satisfy the relevance condition (i.e., should be correlated with the endogenous variable) and the exclusion restriction (i.e., should be uncorrelated with the error term) (Wooldridge 2010). Therefore, an IV should affect the dependent variable (i.e., patient flow) only through the endogenous variable (i.e., Yelp rating). Because physicians have different luck or chances of being reviewed by more lenient or harsher reviewers, physician reviewers’ leniency creates variations in physicians’ Yelp ratings. Reviewers’ leniency would affect physicians’ Yelp ratings because physicians who are reviewed by more lenient reviewers would have higher ratings, whereas physicians who are reviewed by harsher reviewers would have lower ratings. Therefore, physician reviewers’ leniency affects a physician's Yelp rating and, thus, satisfies the relevance condition. At the same time, reviewers’ leniency when rating other businesses is unlikely to have direct effects on physicians’ patient flow, thereby satisfying the exclusion restriction condition. Reviewers’ leniency is orthogonal to potential endogenous factors, such as physicians’ time-varying quality, which may codetermine patient flow and online ratings. Furthermore, the IV is orthogonal to potential classical measurement errors.
We perform several tests to support the validity of our IVs. For each specification, we perform tests for underidentification and weak instruments. First, the Kleibergen–Paap Lagrange multiplier statistic for the underidentification test is statistically significant, and we can reject the null hypothesis that the instruments are uncorrelated with the endogenous variables. The first-stage regression also indicates a strong relationship between Yelp ratings and our IVs, the leniency of reviewers (Web Appendix Table W2). Furthermore, F-statistics from the Angrist–Pischke multivariate F-test of excluded instruments are much larger than both the critical value of 10 (Angrist and Pischke 2008; Staiger and Stock 1994) and Stock–Yogo test statistics (Stock and Yogo 2005).
Propensity Score Weighting Method
In examining the treatment effect of being rated on Yelp, the estimate of α may be subject to a selection problem if physicians who are rated are systematically different from those who are not rated, and we take several approaches to address the concern. The DiD model and the flexible fixed effects model allow us to control for time-invariant differences in physicians’ characteristics and quality, and we validate the parallel pretrend assumption of the DiD model by showing that rated and unrated physicians exhibit common parallel trends prior to receiving their first rating (see the “Robustness Checks” section). To further address the potential selection problem, we employ an IPTW estimation, which uses propensity score weights to construct a comparable treatment and control group (Rosenbaum and Rubin 1983). We predict the treatment status using a logistic model that incorporates rich data on observable physician characteristics, including physicians’ specialty, practice locations, and patients’ risk scores. Propensity score estimates,
Main Results
Table 2 presents the main results of Yelp ratings on physicians’ annual patient flow, and all models are weighted by IPTW weights. In Columns 1 and 3, we include physician fixed effects and flexible year fixed effects and estimate the model using OLS, but we do not employ IVs. The main results of the IV estimation from Equation 2 appear in Columns 2 and 4. We focus on IV estimation results, as the OLS estimations may be biased. In the first-stage regression of the IV model, a 1-star increase in average reviewer leniency increases a physician's average rating by .57 stars (Web Appendix Table W2). As expected, physicians who are reviewed by less harsh reviewers have higher ratings.
The Effects of Yelp Ratings on Physicians’ Patient Flow.
*p < .10. **p < .05. ***p < .01.
Notes: FEs = fixed effects. Standard errors are two-way clustered at the physicians’ HSA and specialty levels. R2 has no statistical meaning in IV regressions (Wooldridge 2015, p. 471), and R2 is not reported in IV regressions.
Columns 2 and 4 present the second-stage results where the outcome variable is regressed on the predicted values for endogenous variables based on the first stage regression and all control variables (Angrist and Pischke 2008; Wooldridge 2010). In Column 2, the dependent variable measures the log of a physician's annual revenue. We find that once a physician is rated with an average rating (3.65 stars), their revenue would decrease by 1.3% (
The IV estimates of β, which capture the differential effects of ratings, are generally positive and slightly larger than the OLS estimates. Overall, the estimates show that a one-star increase in Yelp ratings is associated with a 1%–2% increase in a physician's revenue and patient volume, which confirms H2. As the estimate is based on the elderly patient population, the effect of online ratings may be greater for younger generations who rely more heavily on electronic WOM.
Potential Mechanisms and Heterogeneity in Effects of Ratings on Patient Flow
What Information Do Yelp Ratings Convey, and How Do Patients Respond?
There are possible mechanisms for positive estimates of
We use a machine-learning LDA model (Blei, Ng, and Jordan 2003) to understand the common themes of Yelp reviews and reduce review information dimensionality. The LDA model is a widely used topic modeling algorithm that seeks to discover latent topics in a collection of documents. The algorithm categorizes the texts of reviews into a small set of topics, which can be interpreted as a cluster of keywords that tend to show up together in a review. Technically, in the algorithm, a review is regarded as an unordered word collection generated by a small number of topics. Each topic is a distinct probability distribution that generates keywords in such a way that a small set of keywords will be drawn frequently. After prespecifying the total number of possible topics, the algorithm reads in the keyword distributions from all reviews and uses a Bayesian algorithm to infer the distribution of topics among reviews and keyword distributions among topics. We use the Cv coherence score to identify the optimal number of topics of the LDA algorithm (Röder, Both, and Hinneburg 2015), and we find 13 different topics in Yelp reviews. Additional details of the algorithm specification and the Cv coherence score are in Web Appendix D1.
Table 3 shows the label of 13 topics with top relevant keywords. We label the topic based on their top relevant keywords and categorize the topic as service quality-related (Panel A) or clinical and treatment quality-related (Panel B). For example, Topic 1, which includes keywords such as “question,” “care,” “answer,” “question,” “explain,” and “listen,” describes a physician with a good bedside manner with strong empathy, care, and interpersonal skills (Table 3). Other service-related topics include patients’ waiting time, office amenities, appointment scheduling, and billing process (Topics 2–5). We also identify topics that capture clinical and treatment-related information (Topics 6–9). In total, service quality information and clinical quality information account for 36% and 19% of review content, respectively. 5
Topics of Yelp Reviews.
Notes: Top relevant words are derived using the formula and modules provided by Sievert and Shirley (2014), setting λ = 2/3, which is the weight balancing a keyword's probability in a topic, and its probability in a topic divided by the overall probability of the keyword in all usage (see Web Appendix D for details). Subjective interpretation consists of our personal interpretation of the keywords of each topic. The probability column shows the probability that a Yelp review is classified according to each topic.
To understand the extent to which patients respond to service-related and treatment-related information in text reviews, we use topic weights from the LDA model to construct rating variables, BedsideMannerRating, OtherServiceRating, and ClinicalRating, which capture ratings on bedside manner, other service-related features, and clinical-related features, respectively. More specifically, BedsideMannerRating is constructed as the weighted average of ratings weighted by the LDA topic weight on physician's bedside manner (Topic 1). When we let
To understand which quality signals patients respond to, we estimate Equation 2 by replacing Rating with BedsideMannerRating, OtherServiceRating, and ClinicalRating. We also employ instrument variables—BedsideMannerLeniency, OtherServiceLeniency, and ClinicalLeniency—by combining the leniency measurements with the text analysis of online reviews. Specifically, BedsideMannerLeniency is constructed as the weighted average of leniency weighted by the LDA topic weight on a physician's bedside manner:
Table 4 shows that BedsideMannerRating has the highest positive effect on patient flow. The finding is consistent with the medical literature, which emphasizes the importance of bedside manner and its association with patients’ satisfaction and health outcomes (Simpson et al. 1991; Bylander 2015). For example, physicians with good interpersonal skills elicit greater relevant information from patients, which leads to greater patient satisfaction, recall, and adherence (O’Keefe 2001; Beusterien et al. 2013). Furthermore, we find that ClinicalRating has positive effects on patient flow, which indicates that patients respond to treatment-related information included in reviews. Finally, positive but statistically insignificant coefficients on OtherServiceRating indicate that ratings derived from service-related information on waiting time, office amenities, appointment scheduling, and billing do not have significant effects on patients’ physician choice decisions. Overall, the machine learning–based analysis reveals that text reviews contain important signals about physicians’ service-related and treatment-related information, and review information on physicians’ interpersonal and clinical skills has significant effects on patients’ physician choices.
Differential Effects of Service and Clinical Information on Patient Flow.
*p < .10. **p < .05. ***p < .01.
Notes: FEs = fixed effects. Standard errors are two-way clustered at the physicians’ HSA and specialty levels.
Moderating Effects of Rating Credibility, Accessibility, and Existent Signals
To further examine the potential mechanisms behind the effect of online ratings on patient flows, we test H4, H5, and H6. We estimate the moderating effects of online rating credibility, accessibility, and the strength of existent signals on the effects of ratings on patient flow.
Rating credibility
The signaling theory suggests that, with more reviews, an average Yelp rating signals more information about a physician's quality. Therefore, patients’ responses to online ratings received by physicians would be greater for physicians with more reviews. To test such predictions, we construct the variable
Moderating Effects of Rating Credibility, Accessibility, and Existent Signals.
*p < .10. **p < .05. ***p < .01.
Notes: FEs = fixed effects. Standard errors are two-way clustered at the physicians’ HSA and specialty levels. Columns 2 and 4 show IV estimations where Leniencyjt × Ratedj × Postt and Leniencyjt × Ratedj × Postt interacted with additional variables
Rating accessibility
Internet usage among older adults has steadily increased over time, and 82% of those aged between 65 and 70 in the United States are internet users (Anderson and Perrin 2017). Several surveys find that the elderly are among the high-utilization age group of online physician ratings (Cheney 2020; Software Advice 2013). In fact, patients over age 65 are more likely to utilize online physician ratings than the average population (Software Advice 2013), which supports the idea that the elderly group constitutes an important consumer sector of online physician ratings. At the same time, internet usage increases for younger elderly (i.e., those under 85 years; Gell et al. 2015; Hargittai and Dobransky 2017), and online physician ratings may be used more frequently by younger elderly individuals. 7 Therefore, we construct an age index YoungElderlySharej of a physician j's patient pool by computing the average share of patients aged less than 85 years among physician j's total number of Medicare patients from 2012 to 2015. We include the interaction term between [Young Elderly Share − Average(Young Elderly Share)] and Ratedj × Postt, where Average(Young Elderly Share) indicates the mean of Young Elderly Sharej in the sample. Consistent with H5, the estimated coefficients on the interaction term are positive, and younger patients, who have greater accessibility to and utilization of online ratings, respond more strongly to online ratings.
Sole practitioners
To examine whether the positive effect of positive online ratings on patient flow is greater for sole practitioners who may lack hospital or organization names that can provide credible signals, we identify sole practitioners from the NPI directory. 8 We construct the variable Sole, which is an indicator variable that equals 1 for sole practitioners and 0 for employed physicians. The baseline Equation 2 is extended to include the interaction term between Sole and Ratingjt × Ratedj × Postt. The positive and statistically significant interaction term estimates in Table 5 indicate that ratings have stronger effects on patient flow for sole practitioners, which is consistent with H6.
Robustness Checks
In this section, we conduct several robustness checks to ensure that our results are robust. We first provide robustness checks for the model specification. Then, we provide various robustness checks for the IV to support the exclusion restriction assumption and to show that our IVs are robust to alternative constructions and measurements.
Model Validations
The validity of the DiD model relies on the common parallel pretrend assumption (Athey and Imbens 2006), and it is important that there is no difference in pretrends in the patient flow between rated and unrated physicians. To assess the common path assumption, we follow Angrist and Pischke (2008) and the prior literature to estimate the DiD coefficient α separately for each year leading up to when physicians receive their first Yelp rating. The annual DiD coefficients prior to receiving the first rating can be used to test whether the estimated treatment effect of being rated on the platform began prior to receiving the first rating. The following relative time model is estimated as follows:
In Figure 2, Panels A and B, we plot the coefficients of αk, where the dependent variables are patient flow measured by physicians’ revenue and patient volume, respectively. For both dependent variables, the coefficients on the pretreatment indicator variables are all statistically insignificant. The result indicates that there are no preexisting trends in patient flow between rated and nonrated physicians. Therefore, the estimated impact of Yelp ratings on patient flow cannot be explained by differential pretrends or a false impact that began prior to the treatment.

Model Validation: Annual DiD Coefficients.
IV Validations
We perform several additional robustness checks that support the exclusion restriction assumption of our instruments and present empirical evidence that reviewers’ leniency (or harshness) affects patient flow only through its effects on ratings and does not have direct effects on patient flow. First, to validate the exclusion restriction assumption, we follow the approach taken in Duflo (2001) and show that the outcome variables exhibit parallel pretrends independent from actual instrument assignments (Web Appendix C2). Second, in Web Appendix C3, we construct an alternative version of instruments to alleviate potential concerns regarding endogeneity. The main intuition behind the alternative instruments is that we use only nonmedical businesses to construct a reviewer's leniency and residualize the impact of baseline business ratings. Specifically, if a reviewer gives a five-star rating to an average four-star restaurant, conceptually we only use the 5 − 4 = 1 star “residual leniency” measurement when constructing the alternative instruments. For example, if one worries that a high-quality physician attracts reviewers who like to visit and rate high-quality businesses such as four-star restaurants, the residualization removes such baseline attraction and only uses reviewers’ idiosyncratic “residual leniency,” which is not dependent on the average quality of the rated businesses. Similarly, if one worries that a physician may selectively encourage more lenient patients to leave ratings on the platform, the residualization removes such baseline leniency and only uses reviewers’ idiosyncratic “residual leniency,” which is not dependent on the average rating of the rated businesses. We find results to be robust. Furthermore, to assess the robustness of the measurement of ratings, in Web Appendix C4, we perform a similar estimation measuring ratings and instruments by the end of the previous year, as they may contain fewer measurement errors because Yelp readers would have read the previous year's ratings. We find qualitatively similar results with a stronger effect size and higher statistical significance.
Discussion and Conclusion
Managerial and Practical Implications
Our findings have several important policy and managerial implications for policy makers, health care managers, physicians, patients, and online physician rating platforms. Despite the heated debate in the health care sector on the reliability and efficacy of user-generated physician ratings, the empirical evidence on the extent to which online ratings are associated with physicians’ quality and affect patients’ physician choices has been scarce and mixed. We find that online ratings are robustly and positively associated with conventional measurements of physicians’ credentials, physicians’ adherence to clinical guidelines, and patients’ clinical outcomes. As all of these measurements are important indicators of high physician quality, Yelp ratings signal better quality in multiple dimensions, and online ratings can help patients choose higher-quality physicians. The finding that user-generated physician ratings are positively associated with important measures of physician quality highlights to patients, policy makers, and health care managers that online physician reviews can be a reliable and user-friendly source of information. Therefore, patients would want to avoid physicians with low online ratings. It is also important for health care policy makers, physicians, and online physician rating platforms to seek potential policies and strategies to encourage the accumulation of physician ratings and to improve the information quality of rating platforms. Policy makers can also consider online ratings and reviews as good complements to traditional public health care report cards. Traditional public health care report cards often include clinical and administrative information, such as mortality rates for different procedures, claims turnaround time, and patient satisfaction survey results. As patients tend to find information in traditional report cards difficult to understand or poorly presented (Brook et al. 2002), online rating platforms and policy makers can consider presenting user-friendly ratings along with traditional report cards to enable patients to make more informed choices.
The finding that online ratings have significant and positive effects on patients’ choices also has important practical and managerial implications. Physicians should be mindful of online reputation management, as low online ratings negatively affect patient demand. Furthermore, because information contained in online ratings affects patients’ physician choices, it is important to maintain the credibility and reliability of online physician ratings. For example, fraudulent reviews are an important problem that should be continuously monitored and addressed by policy makers and online review platforms. Online physician rating platforms should assess their reviewer verification and anonymity policies to identify suspicious review activities and remove fraudulent reviews.
It is empirically challenging to effectively extract value from massive unstructured text data (Wedel and Kannan 2016). We employ the LDA machine learning model to identify different common topics contained in text reviews. We find that information on physicians’ interpersonal and clinical skills has significant effects on patients’ physician choices, whereas information on other service-related information, such as waiting time and office amenities, does not have significant effects. Therefore, physicians who wish to improve their patient flow can prioritize enhancing their interpersonal skills and clinical quality, given that patients value such qualities. Because we find that the textual information contains signals on physicians’ service and treatment-related information, rating platforms and policy makers can encourage patients to write informative text reviews.
Consistent with signaling theory, online rating credibility, accessibility, and the strength of existent signals moderate the positive effects of online ratings on patient flow. The positive moderating effect of the number of online reviews emphasizes to health care leaders that it is important to accumulate a greater number of reviews to enhance the credibility of reviews. Furthermore, the heterogeneity in effects of online ratings across patients’ accessibility to the Internet suggests that policy makers and online physician rating platforms need to carefully consider how to promote the usage of online ratings among different members of the population. Despite the advances in information systems, the digital divide remains, in which the patient groups most affected by the digital divide are likely to be those who face health care disparities (Viswanath and Kreuter 2007). In devising policies that aim to bridge such disparities, health care policy makers should be mindful that certain patient groups may be less familiar with user-generated online reviews. Furthermore, physician rating platforms can consider making their platforms more accessible to all patient groups. Finally, given that online ratings have greater effects on sole practitioners than on hospital-based physicians, sole practitioners should be more proactive in maintaining good online ratings.
Theoretical Contributions
This work contributes to the literature in several ways. First, to the best of our knowledge, this is the first study to examine the association between online reviews and the actual clinical quality information constructed by using rich information from Medicare individual claims data. The Medicare claims data enable us to observe physicians’ adherence to clinical guidelines and patients’ actual health outcomes. Despite the popularity of online physician review platforms, the evidence on the correlation between physicians’ online ratings and their quality has been mixed and limited to a specific region or physician specialty (e.g., Gao et al. 2015; Lu and Rui 2018; Saifee et al. 2020).
Second, another important dimension through which this study differs from the prior literature is that it examines not only whether user-generated ratings convey physician quality information but also whether ratings affect patients’ physician choices. There has been limited empirical research on the effects of physician reviews on patient flow, as it is challenging to observe the actual patient flow. Xu, Armony, and Ghose (2021) use physician appointment data from an online appointment booking platform, which covers 872 physicians in the United States, and estimate a structural model to show that integrating quality information from reviews can improve the predictive power of patients’ appointment booking choices. We reliably measure the patient flow of physicians by observing their actual physician revenue and patient volume. Third, we analyze textual comments, in addition to rating data, to examine the extent to which service quality-related and clinical quality-related textual information affect patient flow. Fourth, we contribute to the literature on WOM (e.g., Chevalier and Mayzlin 2006; Liu 2006) and signaling theory (e.g., Nelson 1970; Spence 1973) by examining the potential mechanisms through which online ratings affect patient flow.
Finally, this research also makes important methodological contributions to the literature on how rating mechanisms affect consumer demand by proposing a new causal research design. The literature on ratings has covered a wide array of industries including health plans, restaurants, education, and consumer goods (Dafny and Dranove 2008; Jin and Leslie 2003; Kolstad and Chernew 2009; Scanlon et al. 2002). Methodologically, the prior studies in the literature often relied on cross-sectional or panel variation of the demand of the business unit in response to ratings. The estimation of the causal effects of online ratings has been especially challenging due to the potential endogeneity problems arising from the fact that ratings may be correlated with other characteristics that may affect consumer demand and ratings. Our article introduces a different causal approach by using an instrument variable design that exploits Yelp reviewers’ leniency in other reviews. The main intuition behind our IV is in line with Huang and Sudhir (2021), in which the skill level of the assigned representative is employed as an instrument for service satisfaction to examine the effect of service satisfaction on customer loyalty. Both IVs are similar in that both exploit variations in characteristics (i.e., skill level or leniency) of an agent (i.e., representative or reviewer), yet differ in actual construct and context, and our empirical approach can be applied in other rating demand estimations. For example, the subsequent literature cites and follows our new IV approach to examine consumer choices in an online platform for residential home services (Farronato et al. 2020).
Limitations and Future Research
This article has limitations, which can help identify promising areas for future research. First, an important challenge in studying patient demand is the difficulty of observing the actual patient flow. The access to Medicare data, fortunately, allows us to examine the effect of online ratings on Medicare patients’ physician choices. Medicare patients are a large and important patient group to examine. Furthermore, based on the positive moderator effect of younger patient age, we posit that the effect of online ratings would be even greater for younger patients who are not enrolled in Medicare, as younger generations rely more heavily on the internet for information acquisition. Nevertheless, we acknowledge the limitation of employing Medicare data, and we hope that future research will examine other patient groups and extend our findings. Second, fraudulent reviews are a serious problem that can undermine the credibility of online ratings, and physician rating data may contain fraudulent reviews. Although Yelp claims to have an internal system that removes fake reviews, their detection is a complex process (Malbon 2013), and researchers cannot fully observe Yelp's fake review–filtering algorithms, as they are proprietary. 9 Finally, future research can examine how different characteristics of reviewers and patients interplay and affect patients’ decisions. For example, as homophily could play a significant role in credibility formation (McPherson, Smith-Lovin, and Cook 2001), future research, with the availability of adequate data, could examine whether the similarity between patients and reviewers affects patient decisions.
Supplemental Material
sj-pdf-1-jmx-10.1177_00222429221146511 - Supplemental material for User-Generated Physician Ratings and Their Effects on Patients’ Physician Choices: Evidence from Yelp
Supplemental material, sj-pdf-1-jmx-10.1177_00222429221146511 for User-Generated Physician Ratings and Their Effects on Patients’ Physician Choices: Evidence from Yelp by Yiwei Chen and Stephanie Lee in Journal of Marketing
Footnotes
Acknowledgments
The authors are listed in alphabetical order. The authors thank the JM review team for their helpful comments throughout the review process. The authors are also grateful to Liran Einav, Jonathan Levin, Mark Duggan, Kate Bundorf, Grant Miller, Matthew Gentzkow, and Jay Bhattacharya. Yiwei Chen gratefully acknowledges financial support from the Stanford Institute of Economic Policy Research's Leonard W. Ely and Shirley R. Ely Graduate Student Fellowship.
Special Issue Editor
Harald van Heerde
Associate Editor
Michael Trusov
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Stanford Institute of Economic Policy Research, (grant number Leonard W. Ely and Shirley R. Ely Fellowship).
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
