Abstract
The issue of how to measure the impact of situational-, suspect-, and officer-level factors on police actions has long been debated in the policing literature. One promising method is to use interval-level metrics developed via a combined method of concept mapping and Thurstone scaling. Our objective here was to use these metrics to score 667 incident reports from a large (n ∼ 1,500) urban police department. From this process, we explored significant trends in how police officers perform during encounters with the public. We found that officers performed better in “higher stakes” encounters and excelled in vigilance situational assessment as well as use of tactics and adapting tactics. Officers tended to receive the worst scores in routine police–citizen interactions and the highest in crisis encounters. Interpretation and implications of these findings for American policing are discussed.
Introduction
A large body of research has examined situational-, suspect-, and officer-level factors that may influence how police officers behave during encounters with the public (Engel, Sobol, & Worden, 2000; Fagan, Braga, Brunson, & Pattavina, 2016; Terrill & Reisig, 2003). The majority of this research has focused on the outcomes of these encounters as the dependent variable of interest. For example, the questions of whether suspect race influences officer-involved shootings (James, James, & Vila, 2016; Nix, Campbell, Byers, & Alpert, 2017), whether officers use greater force against suspects based on their demeanor (Crawford & Burns, 1998; James, James, & Vila, 2018), and whether neighborhood predicts the outcomes of police–citizen encounters (Lee, 2016; Sun, Payne, & Wu, 2008) have dominated the policing literature. Although notable studies have analyzed process—for example, examining predictors of procedural justice (Holtfreter, Mastrofski, Jonathan-Zamir, Moyal, & Willis, 2016), measuring police use of force relative to suspect resistance (Alpert & Dunham, 1997; Hine, Porter, Westera, Alpert, & Allen, 2018), and de-escalation tactics (Todak & James, 2018)—most have used the outcome of encounters to judge appropriateness of police behavior.
Focusing on the outcome of police–citizen encounters (e.g., use of force, arrest rates, or citizen complaints) rather than police performance during the encounters (e.g., fairness in decision-making, use of de-escalation tactics, or procedural justice techniques) assigns much of the variation in these encounters caused by chance, or the actions of others, to the individual officer in question. Police–citizen encounters are probabilistic in that an officer will never have full control over the outcome. An officer could behave impeccably and still generate a citizen complaint, or could treat a citizen fairly yet still have to arrest him or her. Conversely, an officer could do everything wrong, behave appallingly, and “get away with it” if the citizen does not feel like making a complaint or believes making a complaint will have no impact. Thus, measurement techniques that focus on officer performance have the potential to provide more nuanced information about what influences officers’ treatment of the citizens they serve and protect.
To examine the situational-, suspect-, and officer-level predictors of how officers perform during encounters with the public, we used interval-level police performance metrics developed by Vila, James, and James (2018) to score 667 incident reports from a large urban department. By focusing on performance over outcome, the resulting data and analysis add nuance to the body of literature on what key variables influence officers during police–citizen encounters.
Literature Review
The question of what influences police action has been examined from multiple perspectives with differing theoretical underpinnings. These perspectives can be broadly categorized into situational-, suspect-, and officer-level predictors. 1 Given the extent of the literature on this topic, our summary is by no means comprehensive. As previously mentioned, most of the research has dealt specifically with outcomes such as use of force to judge appropriateness of police action, instead of measuring officer performance or actual behaviors during police–citizen encounters. Notable exceptions from procedural justice, appropriateness of force, and de-escalation studies exist and are described later.
Situational-Level Factors Influencing Police Actions
How situational factors predict police action has been well examined in the empirical literature. For example, many studies have analyzed how neighborhood context predicts arrest and use of force. In a seminal study, Smith (1986) reported that the most prominent factor influencing the probability of arrest across neighborhoods is socioeconomic status, which he interpreted as either potential preferential treatment of individuals living in high community status areas (evidence of discretion), or officers encountering more offenses necessitating arrest in low community status areas. Smith also found that officers used more coercive force in primarily Black or racially mixed neighborhoods, and in areas with higher numbers of transient community members, even when accounting for violent crime rates. These factors accounted for over a third of the between-neighborhood variation in police use of coercive force. Socioeconomic status or race/ethnicity as predictors of police action sit under the theoretical umbrella of the “bias” hypothesis, which states that officers are motivated by factors other than just the actual threat that they face.
More recently, studies continue to explore neighborhood and other situational-level predictors of police use of force. Sun et al. (2008) found that officers were more likely to use coercive force in neighborhoods with concentrated disadvantage and in neighborhoods with a lower percentage of senior citizens. Using the New York Police Department’s Stop, Question, and Frisk data, Lee (2016) measured the impact of location (inside or outside), time of incident, rate of violent crime in the neighborhood, and racial composition of the neighborhood on use of nonlethal force. Lee found that nonlethal force (compared with no force) was more likely to be employed in neighborhoods with a higher proportion of non-White citizens and with higher rates of violent crime. This latter finding is supportive of the theory that often counters the “bias hypothesis”—the “threat hypothesis.” This hypothesis states that officers act in accordance with the level of threat they face. Supportive of this theory, Garner, Maxwell, and Heraux (2002) found that prevalence of force increased in high-crime neighborhoods, although severity of force did not. They did, however, find that antagonizing bystanders increased the severity of police use of force.
In an Australian study of factors influencing police recruits’ decisions about force, Hine et al. (2019) found that recruits were more likely to use force in domestic violence role-play scenarios than other types of role-play scenarios. They also found that number of citizens present, whether or not the incident occurred in a confined space, perception of how the force would be perceived by the public, and concerns about the use of force being recorded and taken out of context were also influencing factors in level of force the recruits employed. Other factors that have shown to influence police use of force are whether a location is known to be dangerous, the incident occurring at night, and the incident involving the use of lights and sirens (Crawford & Burns, 2008; Phillips & Smith, 2000).
Suspect-Level Factors Influencing Police Actions
The most widely studied predictors of police actions are suspect-level factors, none more so than suspect race and ethnicity. In recent years, this literature has been dominated by investigations of whether police are more likely to shoot unarmed Black men than unarmed White men. Findings range alarmingly. Some studies provide clear support for the bias hypothesis by reporting that officers are significantly more likely to shoot unarmed Black men (Nix et al., 2017). Others provide support for the threat hypothesis by reporting that officers are no more likely to shoot unarmed Black suspects than suspects of other races (Miller et al., 2016). Many studies fall somewhere in the middle, for example, Fryer (2016) found evidence of racial discrimination in use of less lethal force, but not lethal force. Goff, Lloyd, Geller, Raphael, and Glaser (2016) at the Center for Policing Equity found that officers appeared to racially discriminate when data were examined using “all arrests” as a benchmark, but this finding disappeared when “violent crime” was included as the benchmark. Laboratory studies on the impact of suspect race on officer decisions to shoot have also produced mixed results. These range from officers being more likely to shoot unarmed Black suspects than unarmed White suspects (Correll et al., 2007), to no evidence of bias in shooting decisions (Correll & Keesee, 2009), to officers being more hesitant to shoot Black suspects despite implicit associations between Black suspects and weapons (James et al., 2016).
The impact of suspect race and ethnicity on police action other than use of deadly force is more cohesive. The majority of the research indicates that Black citizens are more likely to be stopped on the road (Lange, Johnson, & Voas, 2005), searched (Braga et al., 2016; Higgins, Jennings, Jordan, & Gabbidon, 2011), and arrested (Crawford, 2000; Kochel, Wilson, & Mastrofski, 2011) than non-Black citizens. Braga et al. (2016) found that, even when controlling for suspect resistance, Black suspects were more likely to be frisked than Hispanic, White, or Asian suspects. Conversely, Alpert and Dunham (1997) found that officers used the highest level of force against Hispanic citizens. Terrill and Mastrofski (2002) found that officers were more likely to use a higher level of force (e.g., moving from verbal commands to weaponless techniques such as handcuffing) against non-White suspects, even when controlling for suspect resistance. Similarly, both Gau et al. (2010) and Fridell and Lim (2016) found that officers were more likely to use Tasers on Hispanic and Black suspects, respectively. These findings are all supportive of the bias hypothesis and suggest that implicit racial bias plays a role in officers’ decision-making (Fridell & Pate, 1997).
Although much of the research has focused on suspect race and ethnicity, the biggest suspect-level predictors of police action tend to be behavioral. As laid out by the threat hypothesis, the level of threat posed by the suspect is a strong indicator of the level of force an officer is likely to use (Fyfe, 1980; Garner, Buchanan, Schade, & Hepburn, 1996; Klinger, 2004). Crawford (2000) found that during traffic stops, noncompliance was the strongest predictor of arrest over ticketing. A distinction is often drawn between demeanor that is threatening and aggressive (potentially indicating intent to assault or be non-compliant) and that which is simply disrespectful, rude, or combative (“contempt of cop”). Van Maanen’s (1978) seminal study on this topic asserted that “assholes get street justice” regardless of criminal or threatening behavior. Bayley and Garofalo (1989) found that use of obscene or insulting remarks or gestures by citizens strongly predicted police use of force. This assertion was contested by Klinger (1994) who found that suspect demeanor did not influence officers when separated from illegal activity. Since then, many studies have focused specifically on this topic and tend to find that officers are in fact more punitive when faced with hostile or disrespectful citizens, even when controlling for illegal activity (Coates, Kautt, & Mueller-Johnson, 2009; Engel & Worden, 2000; Lundman, 1996; Worden, Shepard, & Mastrofski, 1996). James et al. (2018) tested whether officers were influenced by suspect demeanor during use-of-force simulation and found that it strongly predicted a scenario devolving into a deadly encounter.
Other suspect-level predictors of police action that have been explored include gang affiliation (Fagan et al., 2016; Garner et al., 1995), apparent wealth (Mastrofski et al., 2006), attire (James et al., 2018), and prior arrest history (Fagan et al., 2016). These all hark back to Skolnick’s (1966) idea of the “Symbolic Assailant” that officers used to make rapid determinations about whether an individual was potentially dangerous. Another factor that has been found to predict increased use of force by police is having comorbid behavioral health disorders (Morabito, Socia, Wik, & Fisher, 2017).
Officer-Level Factors Influencing Police Actions
The body of research on officer-level predictors of police actions is less extensive than that on situational- and suspect-level predictors, in part because variation between officers—approximately 10%—tends to be much less than variation across situations and suspects—approximately 90% (Sun et al., 2008). Similar to other work in this field, it is plagued with contradiction. For example, despite the assumption that female officers are less likely to use force than male officers, results are mixed. Brandl, Stroshine, and Frank (2001) found that female officers were less likely to receive citizen complaints for excessive use of force than male officers. Lersch and colleagues (2008) explored gender differences in citizen complaints of misconduct and found some overrepresentation of male officers. Sun et al. (2008), using the Project on Policing Neighborhoods data set, found that male officers had higher odds of using coercive force than female officers. Paoline and Terrill (2007), on the other hand, employed a systematic social observation study and found that female officers were no less likely to use coercive force than their male colleagues.
Several studies have examined whether officer race influences use of force. Smith (2003) found that, contrary to assumption, departments with a higher proportion of minority officers do not have lower levels of officer-involved shootings. Sun et al. (2008) found that officer race did not predict coercive force. Digging deeper into the connection between officer race and complaints of excessive force, Brandl et al. (2001) found that minority officers were more likely to be assigned to higher crime areas, yet were no more likely to receive complaints of excessive force than White officers assigned to lower crime areas. Given that one would anticipate more complaints would be generated in high-crime areas, it is possible that minority officers do in fact use less excessive force. On the other hand, Crawford (2000) found that Black and Hispanic officers were more likely to arrest than issue a ticket during vehicle stops—although this may be more reflective of the small number of non-White officers in the department in question than evidence of proclivity to arrest across race.
More consensus can be drawn regarding the impact of officer experience and education on police action. Crawford and Burns (1998) analyzed arrests in Phoenix and found that more years of experience predicted less likelihood of using force during an arrest. Terrill and Mastrofski (2006) found that less experienced and less educated officers were more likely to use higher levels of force than those with more experience and education. Paoline and Terrill (2007) found that officers with a 4-year degree used significantly less physical force than those with a high school education and that officers with any college education used significantly less verbal force than their high school-educated colleagues. This study also found that officers with greater experience were less likely to use both physical and verbal force. McElvain and Kposowa (2004) analyzed data from a Sherriff’s department and found that less experienced officers were 8 times more likely to be investigated for use of force complaints than more experienced officers.
Other officer-level characteristics that may influence police action include having an “authoritarian” personality (Worden et al., 2015), being a night-shift officer (James, James, & Vila, 2017; Sun et al., 2008), being sleep deprived or fatigued (James, 2018), and “peer group aggression” (Mclusky et al., 2005). Although officer-level predictors of police–citizen encounter outcomes might not be as salient as situational- or suspect-level predictors, they clearly have their place in the policing literature.
Measuring Police Performance
Despite a focus on the outcomes of police–citizen encounters, and an emphasis on use of force and citizen complaints, some studies have shifted attention from the outcome to the process. Leading the way are the studies on procedural justice—the idea that citizens care as much or more about how they are treated during a police–citizen encounter than they do about the specific outcome of the encounter (Sunshine & Tyler, 2003). For example, Wheller, Quinton, Fildes, and Mills (2013) found that officers who show empathy and build rapport with victims are more likely to be perceived positively and cooperated with. Giacomantonio et al. (2016) found that officers who applied procedural justice techniques such as active listening and validating the individual were met with a positive change in suspect attitudes during stop and search encounters. When considering whether officers are more likely to use procedural justice techniques in some cases versus others, Holtfreter et al. (2016) found that officers were more likely to employ them with victims than with suspects and with respectful versus disrespectful citizens. Officer race or sex did not predict use of procedural justice tactics, nor did suspect characteristics such as apparent socioeconomic status.
Other research that focuses on specific police decisions during an encounter includes Todak and James’s (2018) examination of de-escalation tactics such as humanizing the individual, treating them with dignity and respect, and keeping control of their (the officers’) own emotions. Using systematic social observation, they found that officers were more likely to use de-escalation tactics with citizens who were visibly upset and less likely to use them with citizens who showed “contempt of cop.” Citizen characteristics such as race, age, gender, and socioeconomic status did not predict officers’ behavior. Several officer-level variables did however—both officer age and years of experience—increase the likelihood of officers using de-escalation tactics. The coding protocol Todak and James used to score officer behavior (as well as situational- and suspect-level variables) was based on that used in the Project on Policing Neighborhoods study.
Another notable series of studies examining police behavior versus the outcomes of police–citizen encounters is the work of Alpert and colleagues on the “Force Factor” (Alpert & Dunham, 1997; Hine et al., 2019). This work measures the appropriateness of police use of force by categorizing it based on suspect resistance. Using this method, the researchers have found that police officers use a higher level of force relative to suspect resistance against Black suspects compared with White suspects, against male suspects compared with female suspects, and against suspects who were under the influence of alcohol or drugs compared with those who were not (Alpert & Dunham, 1997). Furthermore, Hine et al.’s (2019) application of the force factor to police in Australia found that they were more likely to use a higher level of force relative to suspect resistance against suspects who were physically aggressive, who were armed with weapons, and who were male.
The question of how to measure police performance during encounters with the public is complex. One approach, proposed by Vila, James, and James (2016) and Vila et al. (2018), is to develop interval-level metrics via a combined process of concept mapping and Thurstone scaling. This involves bringing a group of top-level experts together for an intensive series of focus groups around a specific topic such as use of force, or crisis intervention. During these sessions, the experts propose specific items that they believe relate to officer performance during these encounters. For example, an item related to performance in crisis encounters might be “the officer repeated statements back to the person in crisis to ensure understanding.” The full list of items from all experts is then compiled, categorized, and entered into online surveys that get distributed nationwide for “rating” using a Thurstone scaling technique. This process involves officers scoring each item for importance on a Likert-type scale. After several hundred officers rate the list of items, the median score for each performance item is selected, and the result is a set of objective- and interval-level metrics for scoring officer performance. A higher score is indicative of a better performance. This is based both on guidelines of the law and appropriateness of actions based on expert opinion. To date, Vila, James, James, and Waggoner (2012) have developed metrics for use of force, tactical social interaction (Vila & James, 2014), and crisis intervention (James & James, 2017).
The Current Study
We scored 667 incident reports from a large urban police department using Vila and colleagues metrics to examine the impact of situational-, suspect-, and officer-level predictors of police performance. Our study is the first to apply this combined method. All three sets of metrics—use of force, tactical social interaction, and crisis intervention—were used to score incident reports, resulting in a comprehensive look at officer performance across a wide range of police–citizen encounters. By focusing on police performance during the encounter instead of on the outcome of the encounter, we can better assess the impact of situational-, suspect-, and officer-level factors on police actions. Given the importance of holding officers accountable for their behaviors and decisions, which they control, instead of for outcomes, which are probabilistic and might be outside of their control, this study represents an important step forward in understanding factors that influence police.
Methodology
Research Design
We worked with a single large police department during this study. They provided us with identifying numbers of all of the incident reports generated between 2015 and 2017 (n = 183,331). From these, we randomly selected 1,000 incident reports to analyze. 2 The department redacted officer and citizen names from these reports, then uploaded them as PDF files to a protected site that we could access. We then reviewed each incident report to determine whether the metrics could be used to score officer performance from the available data within the report. Of the 1,000 reports, 667 had sufficient information to score. The 333 reports that were not included either were not completed by the on-scene officer (e.g., theft report) or did not feature any interaction between the officer and a citizen (e.g., responding to an alarm). The dropped reports appeared to be random. For example, there was no clustering of these calls around a particular location or jurisdiction. The remaining 667 reports were scored using the interval-level performance metrics. From these, we examined situational-, suspect-, and officer-level predictors of police performance.
Sample
A large urban police department with approximately 1,500 sworn officers participated in the study. In total, 667 incident reports were scored using the interval-level performance metrics. To protect individual officers, names were redacted, so it is not known how many officers were represented in those 667 incident reports. The reports covered a wide range of police–citizen encounters. Table 1 shows that reports included assaults (aggravated and nonaggravated), domestic violence, traffic collisions (including hit and runs), crisis encounters, thefts, investigation of suspicious circumstances, warrant arrests (felony and misdemeanor), and other threats.
Summary of Incident Reports Scored.
Materials
The interval-level metrics used to score officer performance were developed by Vila et al. to assess performance across a range of police–citizen encounters. Three sets of metrics were used in the current study: use of force (designed to measure officer performance in situations where force is required), tactical social interaction (designed to measure officer performance in routine police–citizen encounters), and crisis intervention (designed to measure officer performance in crisis encounters or encounters with persons with mental illness). Please see Vila et al. (2018) for a detailed description of how the metrics were developed.
The metric items used in the current study are described in Tables 2 to 4 (including a footnote for each table on the possible range of scores and what they mean for performance). They include items from all three metric types (use of force, tactical social interaction, crisis intervention) and are further grouped into categories of related items (e.g., preplan, observe/assess, tactics, adapt). Although all three sets of metrics were used, many items were not applicable to particular situations (e.g., if use of force was not used, most of the force items were not applicable, or if the encounter did not involve a person in crisis, most of the crisis intervention items were not applicable). Some incident reports, however, were scored using items from all three sets of metrics.
Use of Force Metrics Used in the Current Study.
aScore range = 0 (no impact on performance) to 6 (extremely good impact on performance).
bThese items would only be scored positively if appropriate (by the law and by the department in question’s training and policies). For example, if the officer drew a handgun unnecessarily, that would be scored negatively (points deducted).
Tactical Social Interaction Metrics Used in the Current Study.
aScore range = 1 (no impact on performance) to 7 (extremely good impact on performance).
Crisis Intervention Metrics Used in the Current Study.
aScore range = 1 (no impact on performance) to 6 (extremely good impact on performance).
Procedures
Research assistants were trained on use of the metrics. Following an approximate 4-hr training session, they were scoring data with greater than 90% interrater reliability. The performance metrics were used by:
Reading through the incident report to get an indication which performance items were applicable; Identifying which items were possible for an officer to achieve to ensure an officer was not penalized for something he or she could not have done or where it was not possible to assess whether he or she could have done it from the report details; Coding all items that an officer did with a “1,” coding all items they did not do (but could have done) with a “0,” and coding all items that were not achievable (or where it was not clear if they were achievable or not) with a “–1”; Assigning performance scores to each item (see the footnotes in Tables 2 to 4 for a description of these scores); and Calculating an overall performance score by summing performance scores of each item and converting into a percentage
3
based on how many of the achievable items the officer performed.
Thus, performance scores were expressed as a proportion of all behaviors that were possible in the encounter which are measured by the metrics. It is important to note that every detail of officer performance is unlikely to be captured by the metrics. We argue that they do, however, allow for a more comprehensive assessment of officer performance than has been feasible to date. In addition to scoring incidents for officer performance, we also coded situational-, suspect-, and officer-level variables (see Table 5).
Situational-, Suspect-, and Officer-Level Variables Analyzed in the Current Study.
aIf (0 = no, 1 = yes), then only “yes” displayed. In all other instances (age, sex, race/ethnicity), frequencies displayed for each in order of coding.
Analytical Strategy
First, performance scores were calculated, and differences between performances across different types of incident were described. Second, the impact of the situational-, suspect-, and officer-level variables listed in Table 5 were analyzed for their effect on officer performance.
Means, standard deviations, and frequencies were used to describe data. Linear regression was used to analyze the impact of situational-, suspect-, and officer-level variables on performance. All data analyses were conducted in SPSS v 25.
Results
Across the 667 incidents scored, the mean performance score was 80.5% (SD = 8.8%). When exploring performance across different encounter types, we found that, on average, officers received the highest performance scores in crisis encounters (83.6%, SD = 0.6%), aggravated assaults (83.4%, SD = 1.2%), and domestic violence incidents (82.4%, SD = 1.3%). As Figure 1 indicates, officers tended to score lower on traffic collisions (74.8%, SD = 2.0%), harassment calls (76.9%, SD = 2.8%), and investigation of suspicious circumstances (76.7%, SD = 1.4%). It is worth noting that deadly and nondeadly threats, and hit and run incidents have very large confidence intervals, indicating instability in these measures likely due to their low numbers.

Simple bar chart depicting mean overall performance score (as a percentage) by encounter type. Scores are expressed as percentages. Error bars depict 95% confidence intervals.
To investigate where this approximate 20% “deficit” in officer performance was coming from, we explored subcategories of officer performance to see where strengths and weaknesses lay. From this, we found that officers excelled within the observe/assess subcategories (96.0%, SD = 7.9%) but were less proficient at preplanning (80.5%, SD = 22.5%), with use of tactics (84.4%, SD = 21.6%), adapting tactics (83.8, SD = 28.7%), with closing encounters (93.6%, SD = 23.2%) falling somewhere in between. Interestingly, officers were far better at interacting with citizens during crisis encounters (94.5%, SD = 17.0%) than during routine (noncrisis) police–citizen interactions (76.9%, SD = 9.4%).
Impact of Situational-Level Factors on Performance Scores
Of the 667 incidents scored, 40% occurred at night. More than one civilian was on scene in 64% of incidents, children were present at 22% of incidents, cultural differences were noted in 10% of reports, and language barriers were present in 4% of reports. When examining differences in officer performance based on situational-level factors, we found that officers seemed to perform similarly in nighttime incidents (81.5%, SD = 4.6%) compared with daytime incidents (82.5%, SD = 3.7%), in incidents with children present (81.7%, SD = 3.9%) versus in incidents without children (81.9%, SD = 4.3%), and in incidents with cultural differences present (83.7%, SD = 3.8%) versus without cultural differences (81.8%, SD = 4.3%). They performed slightly better in incidents with language barriers present (84.2%, SD = 4.0%) compared with in incidents with no language barriers (81.8%, SD = 4.3%) and in incidents with more than one civilian present (81.5%, SD = 7.7%) than in incidents with just one civilian present (78.6%, SD = 10.1%). The impact of number of civilians present on officer performance was significant, with officers performing significantly better when more than one civilian was on the scene (B = 1.43, SE = 0.66, p = .03).
Impact of Suspect-Level Factors on Performance Scores
The majority of the people described in incident reports were either young adults (48%) or older adults (43%). Sixty-nine percent were men, and 68% were White (30% were Black and 2% were Hispanic). Across the 667 incident reports, suspects were armed in 3%. Suspects were described as emotionally disturbed in 24%, noncompliant in 26%, hostile in 13%, and self-harming in 7% of reports. Furthermore, suspects were homeless in 14% and impaired by substances in 18% of incidents.
When exploring differences in performance scores based on suspect-level factors, we found that age did not influence performance score—when a teen was involved, officers received a mean performance score of 83.6% (SD = 3.0%) compared with when a young adult (84.7%, SD = 3.7%) or older adult (83.9%, SD = 3.0%) was involved. Officers received slightly higher performance scores across incidents with men (84.7%, SD = 3.7%) than women (82.1%, SD = 4.0%). Officers also received slightly higher performance scores with Black citizens (85.8%, SD = 3.4%) compared with White (83.2%, SD = 4.1%) or Hispanic (83.8%, SD = 3.8%) citizens. Officers received slightly higher scores when interacting with emotionally disturbed citizens (84.8%, SD = 3.9%) compared with nonemotionally disturbed citizens (83.0%, SD = 3.5%), citizens who were impaired by substances (86.0%, SD = 2.8%) versus nonimpaired citizens (82.9%, SD = 3.8%), homeless citizens (86.6%, SD = 1.9%) versus nonhomeless citizens (83.5%, SD = 3.8%), self-harming individuals (86.8%, SD = 4.8%) compared with non-self-harming individuals (83.6%, SD = 3.7%), and hostile citizens (83.6%, SD = 3.9%) compared with nonhostile citizens (84.9%, SD = 3.2%). Finally, officers received higher performance scores in incidents where citizens were noncompliant (86.3%, SD = 3.0%) versus compliant (82.8%, SD = 3.6%) and in incidents with armed suspects (88.0%, SD = 4.6%) than with unarmed suspects (83.6%, SD = 3.7%).
Of these suspect-level influences on officer performance, several were significant. First, officers performed significantly better in incidents with emotionally disturbed individuals compared with in incidents without emotionally disturbed individuals (B = 2.45, SE = 0.75, p = .001). Second, officers performed significantly better in incidents with Black individuals versus non-Black individuals (B = 2.06, SE = 0.70, p = .004). Third, officers performed significantly better in incidents with noncompliant versus compliant suspects (B = 2.76, SE = 0.88, p = .002). An interpretation of these slightly nonintuitive findings is provided in the Discussion section (Table 6).
The Results of a Linear Regression Model Testing the Impact of Suspect-Level Variables on Officer Performance Scores.
Impact of Officer-Level Factors on Performance Scores
The majority (84.10%) of officers involved in the incidents scored were men. There was no difference in performance scores between men (80.47%, SD = 8.4%) and women (80.7%, SD = 10.1%) officers, indicating that officer gender did not influence how officers interacted with citizens in the current sample. This variable was unfortunately the sole officer-level factor that we were able to assess from the incident reports provided.
Discussion
Interpretation of Results
Several interesting findings emerged from this study that require exploration. First, officers’ average performance score was 80.5%, indicating that approximately 20% of metric items were not being achieved. Recall that the metrics were developed based on a rigorous process starting with diverse experts brainstorming what constitutes “good” officer performance across a range of police–citizen encounters and ending with hundreds of law enforcement professionals rating those performance indicators for importance. To explore where this performance deficit was coming from, we looked at variation in performance scores across types of call and subcategories of performance. We found that officers tended to score the highest in crisis encounters, aggravated assaults, and domestic violence incidents but received lower scores during traffic stops, harassment calls, and investigation of suspicious circumstances. It is possible that the “high-stakes” nature of crisis encounters, aggravated assaults, and domestic violence incidents lend themselves to the types of tasks officers excel at (e.g., vigilant situational assessment, use of tactics, and adapting tactics). This seems to be supportive of the threat hypothesis and in line with prior research such as Hine et al. (2019). When looking specifically at subcategories, we found that officers received the highest performance scores within the “observe/assess” categories and the lowest during routine police–citizen interaction. Perhaps lending weight to our speculation that officers perform better in “high-stakes” encounters, a large difference was observed between how officers interacted with people in crisis encounters versus in routine police–citizen encounters. This suggests that officers are in fact very good at interacting with people (measured with performance items such as clearly explaining actions, showing empathy, and demonstrating concern for the citizen) but perhaps feel more need to employ these tactics during crisis intervention than during routine encounters. The implications of this possibility and potential for improving police performance based on these findings are described later.
Second, several situational- and suspect-level factors significantly predicted variation in officer performance scores. Officers performed better in situations involving more than one citizen—again potentially speaking to the idea that officers perform better in higher stakes encounters. Officers tended to receive higher performance scores in incidents with Black citizens, which might speak to officers entering situations with Black citizens with heightened awareness than situations with non-Black citizens—supportive of the bias hypothesis. Digging deeper into this possibility, we found that the discrepancy in officer performance scores between incidents with Black versus non-Black citizens was in the “observe/assess” categories (officers received on average 99% in this category across incidents with Black citizens, compared with 95% across incidents with non-Black citizens). Whether this increased vigilance speaks to implicit bias associating Black citizens with threat is difficult to discern from the current data. Prior research has certainly indicated that officers tend to associate (subconsciously or semiconsciously) Black citizens with weapons (James, 2018) and that this bias can influence how officers interact in the field (Fachner & Carter, 2015). However, it is also possible that officers paying greater attention in situations with Black citizens reflect a desire to perform well in those situations due to a desire to avoid the label of bias. On this possibility—the department in question had all received implicit bias training within the 12 months prior to the data collection period. As such, we caution reading any implications into our race finding here.
Officers also performed better in incidents with emotionally disturbed individuals and with noncompliant suspects. Prior research (Todak & James, 2018) has similarly found that officers employ de-escalation tactics such as humanizing citizens and treating them with dignity and respect more so in volatile situations with citizens who are escalated, in crisis, or emotionally disturbed. It makes sense that officers would employ tactics specific to de-escalation more so in situations that actually need to be escalated. Our metrics here, however, do not just measure de-escalation, or even just ability to interact with citizens across a range of police–citizen encounters. They measure a broad spectrum of police performance—from preplanning prior to encounters, observing and assessing situations both on approach and throughout the encounter, use of tactics (including force), adapting tactics (including de-escalation), interacting with citizens (in crisis and routine encounters), to closing encounters. Our findings could be indicative of officers “trying harder” during situations they perceive to be more challenging.
Implications of Results
Our results have implications for measurement of police performance, and future research investigating the impact of situational-, suspect-, and officer-level determinants of police action and police training and policy. Our use of Vila et al.’s (2018) metrics to measure police performance across a broad range of police activities and situations contributes nuance to the policing literature on what influences police action in the field. We coded 667 incidents and from these were able to explore patterns in how police behave across different types of encounter (e.g., domestic violence vs. investigation of suspicious circumstances), how they varied in different types of job task (e.g., tactics vs. interaction), and which incidents they excelled in more than others (crisis intervention vs. routine police–citizen encounters). By looking specifically at officer performance, rather than the outcomes of police–citizen encounters, we measured what officers were directly accountable and responsible for—their actions. The Vila and colleague metrics were developed with a great deal of empirical rigor, and as such, they go far beyond subjective judgments of how well an officer performed in any given encounter. We hope that moving forward other policing scholars interested in measuring police performance will consider using these metrics.
The implications of our study for future research are several. First, we urge researchers to measure police behavior instead of relying solely on outcomes such as use of force (which is typically considered failure on the part of the officer even when appropriate) or citizen complaints (which can either overestimate or underestimate a problem). Even if researchers are focused on a specific area of police performance, such as de-escalation or crisis intervention, metrics such as the ones used in the current study could be employed selectively—using only those directly relevant to the topic in question. Second, we urge researchers to capitalize on body-worn camera (BWC) footage for analyzing officer performance. Although it was feasible and appropriate to code officer performance from incident reports in the current study, we speculate that more information could be gathered from BWC footage, as well as some limitations (such as exactly what is reported) overcome. We address this, and other, study limitations later. Third, despite our emphasis on officer performance, we acknowledge the importance of the outcomes of police–citizen encounters. Outcomes are important for the fair application of law enforcement, as well as building public trust in police legitimacy. Researchers in this field could consider scoring both officer performance and encounter outcome (be it arrest, use of force, or citizen complaint), to determine how probabilistic these outcomes are, and how much they are driven by good or bad performance on the part of the officer. Fourth, the metrics provide a useful way of measuring change in police performance based on training or policy change. For example, the crisis intervention metrics could be used to evaluate the impact of Crisis Intervention Team (CIT) training on police behavior. This would be a more appropriate design than evaluating the effectiveness of CIT training based on reductions in use of force or arrest (which, again, might be outside of the officers’ control).
Finally, our findings have implications for police training and policy. We speculated earlier that officers may perform better in crisis encounters than in routine police–citizen encounters because they are more challenging, causing them to “try harder.” It is also possible, however, that officers perform better in these encounters because they are more likely to receive training in this area. CIT training is a staple in many police departments (Augustin & Fagan, 2011; Teller, Munetz, Gil, & Ritter, 2006). Training officers to treat each and every person they meet with dignity and respect, such as Procedural Justice Training, although often taught at the academy level, is less common during in-service police training (Todak & James, 2018). Given that the majority of interactions officers have with citizens will not require crisis intervention, it might be prudent to stress the importance of using these tactics (e.g., showing empathy, building rapport, and demonstrating concern for the citizen) in everyday routine police–citizen encounters to enhance public trust in police. In this regard, our training implications are very similar to Todak and James (2018) who stressed that officers should employ “de-escalation” tactics, such as being respectful and reducing the cop/citizen power differential, not just in situations that require de-escalation, but in all situations to reduce the likelihood of citizen escalation. It is possible that training such as this, which emphasizes the importance of treating each and every citizen in ways that prevent escalation, could reduce the 20% officer performance deficit we observed in the current study.
Study Limitations
Several limitations of the current design need to be acknowledged. First, relying on incident report data allows for the possibility of report bias, inaccuracies, or omissions. Within our study, 333 incident reports had to be dropped due to insufficient data (e.g., alarm calls with no police–citizen interaction). It did not appear that these cases were dropped in a nonrandom manner (e.g., all connected to a particular geographical location or patrol division). However, this speaks to the potential limitation of insufficient detail within incident reports. Although 667 of the 1,000 incident reports randomly selected between 2015 and 2017 had sufficient data within them to score using the metrics developed by Vila et al. (2018), it is possible that another method—such as BWC footage—would have revealed greater variation in officer performance. On the other hand, incident reports recorded details that might not be apparent in BWC footage, such as preplanning on the part of the officer on the way to the call or observation and assessment on approaching the citizen. Similarly, BWC footage would have provided a more objective recording of the incident but likely not have captured details specific to the officer’s mindset throughout the encounter. Second, we were severally limited in our ability to assess the impact of officer-level variables on performance. Given the redaction necessary to obtain incident reports, this was unavoidable. Other methods, such as systematic social observation (Paoline & Terrill, 2008), can provide a much richer source of information on officer-level influences of police action. Relatedly, it is possible that important citizen- and situational-level variables that may have influenced officer behavior were not captured in the incident reports we reviewed. Third, given our lack of information on the officers represented in this study, the generalizability of our findings is unclear. The incidents were randomly selected from a large urban department with nationally comparable diversity (35% non-White, 15% women), but without specific information on the incidents used in our analyses, we caution broad generalizability of the results. Fourth, the data analyzed in this study are from the perspective of the officer, and thus, any implicit or explicit biases the officer held were likely translated into their report of the incident.
Conclusion
We used a novel technique for scoring officer performance using objective, interval-level metrics developed by Vila et al. (2018). Using these metrics, we found that across 667 incidents, officers received an average performance score of 80.5%. Variation in performance was observed across encounter types, with officers tending to perform better in the “higher stakes” encounters, such as crisis intervention, than in more typical and routine police–citizen interactions. Several situational- and suspect-level factors predicted officer performance, including having more than one citizen on scene, dealing with citizens who were in emotional distress or who were not compliant, and interacting with Black citizens. We speculate that these findings might reflect officers being more “on their game” during situations they perceive as challenging, or that these are situations officers receive more training on how to respond to (e.g., CIT training). Consequent implications are for officers to employ the tactics they might reserve for crisis or escalated situations (e.g., demonstrating empathy) during routine police–citizen encounters to promote public trust in police. We also urge researchers to employ these metrics for evaluating training effectiveness (e.g., CIT or de-escalation), to investigate how well officer performance predicts the outcomes of police–citizen encounters, and to expand the use of the metrics to scoring BWC footage.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
