Abstract
The collection of large-scale administrative records in electronic form by many cities provides a new opportunity for the measurement and longitudinal tracking of neighborhood characteristics, but one that will require novel methodologies that convert such data into research-relevant measures. The authors illustrate these challenges by developing measures of “broken windows” from Boston’s constituent relationship management (CRM) system (aka 311 hotline). A 16-month archive of the CRM database contains more than 300,000 address-based requests for city services, many of which reference physical incivilities (e.g., graffiti removal). The authors carry out three ecometric analyses, each building on the previous one. Analysis 1 examines the content of the measure, identifying 28 items that constitute two independent constructs, private neglect and public denigration. Analysis 2 assesses the validity of the measure by using investigator-initiated neighborhood audits to examine the “civic response rate” across neighborhoods. Indicators of civic response were then extracted from the CRM database so that measurement adjustments could be automated. These adjustments were calibrated against measures of litter from the objective audits. Analysis 3 examines the reliability of the composite measure of physical disorder at different spatiotemporal windows, finding that census tracts can be measured at two-month intervals and census block groups at six-month intervals. The final measures are highly detailed, can be tracked longitudinally, and are virtually costless. This framework thus provides an example of how new forms of large-scale administrative data can yield ecometric measurement for urban science while illustrating the methodological challenges that must be addressed.
Keywords
The global move toward digital technology has instigated a marked shift in the practice of science over the past two decades. Surveys and experiments are now often conducted through Internet platforms, Global Positioning System devices and other sensors allow us to track patterns of movement and behavior, and computer processing technology has supported the development of new forms of statistical analysis. A recent consequence of this “digital revolution” is the availability of large-scale administrative data that might prove useful in research. Many public agencies and private companies systematically collect information on services and clients and compile it in digital databases. Some of these are more detailed versions of familiar data (e.g., crime reports) while others (e.g., cell phone records or citizen requests for governmental services) are novel. These “big” or next-generation data offer the opportunity to paint a comprehensive picture of cities, which has the potential to transform theoretical models of urban governance and social behavior (Lazer et al. 2009).
Despite considerable excitement at this prospect, big data have not yet become commonplace in contemporary social science research, in part, it seems, because researchers do not entirely know what to make of them. Without a clear understanding of how these new data sources contribute to our ongoing debates and the questions facing our fields, some analysts might reasonably consider their promise as being overblown. There is thus a need for methodologies that can connect big data with the current practice of social science.
We offer one such “proof of concept” in the present paper, using a database of more than 300,000 citizen-generated requests for public services in Boston, Massachusetts, to measure the conditions of urban neighborhoods across space and time. Building on the methodology of ecometrics (Raudenbush and Sampson 1999), we construct and assess a measure of physical disorder, one of the most widely used and popular concepts in urban sociology, criminology, and public policy. Although the idea of disorder has a long history in sociology, it has received increased attention in recent decades because of the influential “broken windows” theory of crime and urban decline (Raudenbush and Sampson 1999; Ross, Mirowsky, and Pribesh 2001; Skogan 1992; Wilson and Kelling 1982), making it an ideal test case for assessing the potential for ecometrics based on large-scale administrative data.
1. An Ecometric Approach to Disorder
Fifteen years ago, Raudenbush and Sampson (1999) proposed a systematic approach to the measurement of neighborhood social ecology, what they termed “ecometrics.” They encouraged researchers to borrow three tools developed by psychometricians for the measurement of behavior: (1) item-response models, which call for the use of scales whose multiple items vary in their difficulty, allowing greater precision in measurement across neighborhoods; (2) factor analysis, which can be used to address the interrelation between items and to identify one or a few latent constructs that the items reflect; and (3) generalizability theory, which requires criteria for ensuring that a given measurement of a neighborhood is reflective of the “true” score on the characteristic of interest, and not overly influenced by either stochastic or confounding processes. These guidelines, along with the illustrative examples that accompanied them, provided researchers with a step-by-step methodology for developing survey and observational protocols that could measure ecometrics, an approach that has been implemented by hundreds of researchers in dozens of cities.
The advent of large administrative data represents a new opportunity for ecometric study. The gigabytes and terabytes of data being collected by both public and private sector entities are a rich, low-cost resource for measuring the characteristics of neighborhoods, but using them in this manner poses clear methodological and substantive challenges. Administrative data are not collected according to any research question or plan, and thus, in their raw state, lack some of the characteristics expected of researcher-collected data. These challenges are well suited to the techniques common to ecometric study, which can act as a guide to both what is missing or occluded in such data sets, as well as how a researcher might address such issues.
We focus here on one of the most influential concepts in the urban sciences: that of physical disorder, including the iconic “broken window,” the accumulation of litter, the presence of graffiti, or other indications that a neighborhood is poorly maintained and monitored. Such incivilities are often associated with elevated crime rates (Raudenbush and Sampson 1999; Wilson and Kelling 1982) and lower mental, physical, and behavioral health among residents (Burdette and Hill 2008; Caughy, Nettles, and O’Campo 2008; Furr-Holden et al. 2012; Mujahid et al. 2008; O’Brien and Kauffman 2013; Wen, Hawkley, and Cacioppo 2006), attracting attention from a variety of disciplines (Caughy, O’Campo, Patterson 2001; Cohen et al. 2000; O’Brien and Wilson 2011; Taylor 2001). The importance of physical disorder as a neighborhood characteristic is such that it was also one of the two test cases that Raudenbush and Sampson (1999) used to illustrate their methodological approach to ecometrics.
Physical disorder is traditionally measured either through surveys or detailed neighborhood audits (e.g., Raudenbush and Sampson 1999; Taylor 2001), but the effort and cost associated with such protocols have made whole-city assessments challenging and precise longitudinal tracking nearly impossible. Modern technology used by city agencies, however, is now recording similar information in real time. These databases have the potential to supplement traditional ecometric protocols. One such database is a result of a recent policy innovation called the constituent relationship management (CRM) system. Colloquially known as a mayor’s hotline or a 311 hotline, these systems provide constituents with a variety of channels for directly requesting services from the city government, using phone, Internet, or smartphone applications that communicate requests to the appropriate department. The resultant database is a detailed documentation of constituent needs, leading some of its initial implementers to refer to it as “the eyes and ears of the city”—Jane Jacobs (1961) meets “big data,” as it were. Many of the requests refer to individual instances of physical disorder, such as graffiti or abandoned housing, giving the database the ability to reflect their prevalence across neighborhoods.
Although the potential impacts of big data on science have been overhyped (Pigliucci 2009) and there have been highly visible failures of prediction based on large-scale data (Lazer et al. 2014), CRM databases offer a number of possibilities as an alternative or supplement to expensive new data collection, especially in a time of declining research support. For one, the systems receive hundreds of cases every day, each attributed to a particular address or intersection, giving researchers considerable flexibility in how they might geographically divide cities. CRM also lends itself to the longitudinal tracking of physical disorder, a major advance considering that no whole-city protocol to date has been conducted more than once in a five-year period. Furthermore, the databases differentiate among dozens of case types, allowing greater precision in defining the events that constitute disorder than has previously been possible.
CRM databases were not created for the purposes of disorder research, however, and they have three weaknesses that any methodology must address. First, the substantive content of the databases is noisy, and it is not immediately apparent what they can measure or how they can do so. Some cases, such as requests for graffiti removal, are clear examples of physical disorder, but others, such as scheduling a bulk item pickup, are not. Second, there may be some aspect of data collection that creates systematic biases in measurement. For instance and quite importantly, CRM systems may suffer from skewed reporting in the incidence of disorder across neighborhoods. Last, there is no information about what scale of geographical analysis the databases can support (e.g., census block groups [CBGs] or tracts) or what time spans.
Whereas Raudenbush and Sampson (1999) forwarded criteria for survey- and observation-based measurement across geographical units, CRM databases make clear the need for a new set of guidelines for the use of administrative data in the creation of ecometrics. In the present study, we used the CRM database from Boston to illustrate the multiple analytic steps in the formulation of original ecometrics. This process is reported in three parts, each analysis requiring its own distinct logic, data sources, and analytical approach. Analysis 1 examines the content within a CRM database that reflects physical disorder, and it uses correlational analyses to identify an underlying factor structure. Analysis 2 then addresses the validity of any measure extracted from the CRM database by assessing biases in reporting through original data collection involving neighborhood audits. A method is then developed for using auxiliary measures from within the CRM database to estimate these biases and to help account for over- or underreporting. Analysis 3 then examines the reliability of these composite measures by identifying the spatial and temporal ranges at which their measurement is consistent. In each case, we spell out the assumptions in our analysis that arise from the use of administrative data.
2. Analysis 1: Operationalizing Physical Disorder
When Raudenbush and Sampson (1999) developed their methodology for ecometrics, they emphasized the development of item-response models and their examination through factor analysis, an approach that had been in common use in the field of psychometrics for decades. When conducting a neighborhood audit, for example, a protocol might measure a variety of items that collectively capture an overall pattern. A factor analysis based on their intercorrelations would then help determine which of these items in fact measured the desired construct, while also testing whether they reflected one or multiple constructs regarding the neighborhood’s ecology. The challenge with next-generation data, however, is that it is not immediately apparent what they can measure. Traditionally, research measures are derived from protocols written by the researchers themselves, and their items are based on an underlying theoretical construct. Administrative data are not endowed with an a priori theoretical organization of this sort. CRM databases, for example, are by-products of systems intended to transmit the needs of constituents to the appropriate government agencies, and their organization reflects this function, rather than a deliberate intent to measure neighborhood characteristics. Nonetheless, with thousands of requests spanning more than 150 case types, CRM databases offer a rich store of information for measuring neighborhood characteristics. But before factor analysis can be considered, it falls to researchers to use existing theory to identify those specific items that are likely to be relevant.
Physical disorder is typically defined as any aspect of a neighborhood’s visual cues that reflect a “breakdown of the local social order” (Skogan 1992:2), though this has come to mean two different things in practice. Raudenbush and Sampson’s (1999) measure focused specifically on the publicly visible artifacts of physical incivilities that denigrated the public space according to the broken windows theory, such as graffiti and various forms of litter indicating illegal or typically problematic behavior (e.g., used condoms, empty beer bottles, hypodermic needles). A variety of other researchers have expanded this definition to include any item that might be evidence that “spaces are not being kept or used properly” (Taylor 2001:5). This had led to a variety of protocols that also include items that, although not the result of flagrant incivilities, reflect an overall pattern of neglect, including deteriorating or abandoned housing, unkempt lawns or vegetation, and litter of all kinds (Caughy et al. 2001; Cohen et al. 2000; Furr-Holden et al. 2008; O’Brien and Wilson 2011; Ross and Mirowsky 1999; Rundle et al. 2011; Skogan 1992; Taylor 2001). One important consequence of this approach is that it extends measurement to elements of the neighborhood that are technically private but whose appearance and use are a visible part of the local scenery, like front porches, lawns, and the facades of houses. Despite this distinction, factor analyses on such protocols often identify a single factor, though Ross and Mirowsky (1999) found evidence for two latent constructs they referred to as disorder and decay, approximating the dichotomy described here.
To make the greatest use of the CRM database, we identify case types that reflect either private neglect or public denigration. Some will correspond directly to items in previous methodologies, such as a report of an abandoned house or a request for graffiti removal. But others will be novel, either because they are too uncommon to be measured through one-time neighborhood audits (e.g., cars illegally parked on a lawn) or because they are more likely to be experienced in private spaces (e.g., rodent infestation). This latter opportunity to “look” at the conditions inside houses could potentially add a new dimension to the measurement of disorder, one that has been hinted at in previous protocols that examine visible deterioration but has not been completely accessible. Altogether, it is possible to construct a battery of “items” that offers a greater breadth and depth than any previous measure of physical disorder. The second stage of the analysis will then use factor analysis to explore the dimensionality of these items. Given their large number, it seems feasible that they will describe not a unitary construct but one with multiple aspects that are related but distinct.
2.1. The CRM Database
Boston’s CRM system received 365,729 requests for service via its three channels (hotline calls, Internet self-service portals, and smartphone applications) between March 1, 2010, and June 29, 2012; of those, 334,874 had geographic references. March 1 was chosen as the start date because that is when a standardized data entry form was implemented.
The requests for service included 178 different case types. A subset of types reflected examples of physical disorder arising from either human negligence or denigration of the neighborhood (e.g., illegal dumping, abandoned bicycles). Other case types either did not indicate physical disorder (e.g., general request, bulk item pickup) or indicated deterioration that was not the fault of local residents (e.g., streetlight outage).
Each case record included the date of the request, the address or intersection where services were to be rendered, as well as the case type. These locations came from a master geographical database of the addresses and intersections of Boston that was based on the city’s tax assessment and roads data, with each address keyed to the appropriate census geographies (from the 2005–2009 American Community Survey [ACS], the most recent census with socioeconomic data when the database was built). The main measures for this analysis were counts of events that occurred in a neighborhood, which we operationalize as the CBG. CBGs are smaller than the more typically used census tract (average population ≈ 1,000 vs. 4,000), but the volume of CRM calls enables measurement and analysis at this finer scale. Boston contains 543 CBGs with a substantial population.
2.2. Defining Physical Disorder from Case Types
An initial examination of the 178 case types produced a list of 33 that might be evidence of human neglect or denigration in public spaces (see Table 1). Counts were tabulated for each of these 33 case types for each CBG over the period covered by the database. As a first step to identifying an underlying factor structure, an exploratory factor analysis was run on the 33 count variables (Tabachnick and Fidell 2006). The final solution produced five factors with eigenvalues > 1. These factors, whose constituent types and loadings are listed in Table 1, might be described as follows:
Counts of Case Types That Reflect Human Neglect or Denigration of the Neighborhood, Including the Factors and Loadings from an Exploratory Factor Analysis
Note: For factor analysis, n = 544 census block groups. An iterated principal-factors estimation was used with a promax rotation. PWD = Public Works Department.
Items did not load on initial factor analysis but were added on the basis of content similar to factor or one or more of its constituent items.
Item loaded at >.3 on both the trash and graffiti factors. It was maintained on the trash factor for reasons of content.
Housing issues, including 11 items referring to poor maintenance by landlords (e.g., poor heating, chronic dampness) and the presence of pests (e.g., bedbugs).
Uncivil use of space, including seven items that reflect how private actions can negatively affect the public sphere (e.g., illegal rooming house, poor condition of property, abandoned building).
Big-building complaints, including three different case types regarding problems with the upkeep of big buildings such as condominiums.
Graffiti, including two different case types regarding graffiti, one generated by constituents, the other by the Public Works Department.
Trash, including five items related to incivilities regarding trash disposal: illegal dumping, improper storage of trash barrels, empty litter baskets, abandoned bicycles, and rodent activity. (The last item, not itself an incivility, is a consequence of poor trash storage.)
Five items did not load on any factor and were discarded before the foregoing analyses. Four other items that loaded at <.4, though, were maintained on the basis of conceptual similarity: abandoned buildings loaded at .36 on the factor of uncivil use, requests to empty a litter basket loaded at >.3 on both trash and graffiti and was maintained on the former factor on the basis of its substantive content, and two items were added to the housing factor because they were conceptually identical to the definition of the factor and likely did not load in the factor analysis because of their low frequency.
2.3. Exploring the Dimensions of Physical Disorder
New measures were created from these five factors to evaluate their higher order factor structure. We accomplished this by summing counts for each of the constituent case types for each CBG over the period covered by the database. These measures had substantial outliers and were all log-transformed before analysis. Correlations between them were all significant (except for uncivil use and graffiti), although they were modest if they are considered to be manifestations of a superordinate construct (see Table 2); only two bivariate correlations were above r = .4 (housing issues and uncivil use of space, graffiti and trash), and two others were above r = .3 (housing issues and big-building complaints). Given both content and the pattern of correlations, the five factors appear to suggest two main groupings: denigration of the public space, composed of trash and graffiti, and poor care or negligence for private space, composed of big-building complaints, housing, and uncivil use of space.
Descriptive Statistics for and Correlations between Five Submeasures of Physical Disorder
Note: N = 544 census block groups. All variables were log-transformed before correlations.
p < .05. **p < .01. ***p < .001.
Confirmatory factor analysis, via structural equation modeling, was used to compare this two-factor structure with a one-factor structure in which all five measures were loaded together on an overarching measure of physical disorder. The two-factor model was superior by all measures. It had better fit (comparative fit index [CFI] = .82 vs. .61, standardized root mean square residual [SRMR] = .07 vs. .10, Δχ2df=1 = 89.27, p < .001) and accounted for 42 percent, as opposed to 26 percent, of the variation across factors. The model estimated the correlation between the two factors at r = .38 (p < .001). Although the two-factor model was stronger, note that it still had a poor fit. Because the hypothesis in question was the efficacy of a one- or two-factor model, there were no assumptions that the components of each were completely independent. We thus took the exploratory step of examining modification indices, leading to the addition of a covariance between uncivil use of space and trash to the model, greatly improving fit (CFI = .95, SRMR = .05, χ2df=5 = 24.26, p < .001). The final parameter estimates for this model are presented in Figure 1.

Estimated relationships between categories of physical disorder with standardized parameters from best-fitting confirmatory factor analysis.
Analysis 1 thus suggests that the CRM database is at least in principle capable of measuring two distinct but related aspects of physical disorder: private neglect and public denigration. This result provides a more nuanced measurement than existing scales of physical disorder, particularly with the ability to go beyond elements visible from the street and to access conditions within buildings. Many previous protocols for measuring disorder have combined items from each of these categories (e.g., abandoned or deteriorating housing with graffiti), and thus it is not surprising that the two constructs are correlated. It may also explain why previous longitudinal work has found that such items become uncoupled across time (Taylor 2001). It is important to note that correlational constructs of this sort reflect a shared process, but it is not clear what this process actually is. It is possible, for example, that housing issues and uncivil use of private space are generated by the same behavioral tendencies, but it is equally feasible that one of these causes the other, or even that they are mutually reinforcing. These are questions that go beyond the scope of our analysis and thus are ripe for future study. For present purposes, the reliable co-occurrence of these elements across neighborhoods provides two different sets of measures we subject to an ecometric analysis: two of a generalized sort, private neglect and public denigration, and five lower level categories that are more specific, housing, uncivil use of space, big-building complaints, graffiti, and trash.
3. Analysis 2: Validity and Bias in Administrative Data
Although it is tempting to treat the CRM database as the “eyes and ears of the city,” and thereby a direct reflection of neighborhood conditions across space and time, its accuracy in this regard cannot be assumed, because each case in the database is in fact the coincidence of two events: the issue itself and the decision of a resident or passer-by to report it. This fact suggests that assumptions must be imposed to analyze the data. In guiding this process, we invoke a simple behavioral model for the distribution of calls defined not only by the probability of an issue in a given space (P1), but also the probability that it will be reported (P2). 1 If P2 varies across neighborhoods, it could in turn create systematic biases in any measure based on the CRM system. To illustrate, in regions where residents are not inclined to make such calls, an issue might sit unnoted for a lengthy period, or even indefinitely, creating a gap or false negative in the database. Conversely, the residents of some neighborhoods might be highly vigilant, generating multiple reports for a single issue, leading to false positives that exaggerate the actual prevalence of disorder. This variation in P2 might be referred to as the civic response rate, which we thus account for to establish validity for the measures identified in analysis 1.
In pursuing this goal, we develop a methodology that accounts for the local civic response rate, producing final measures that more accurately reflect neighborhood conditions. We focus particularly on issues in the public domain, such as streetlight outages, as these are likely to be the most vulnerable to such biases, because the responsibility for reporting them belongs to no specific individual but to the neighborhood as a whole. Developing this methodology entails three steps that use data from the CRM system and a series of neighborhood audits.
First, there must be an independent or “objective” measure of response rate that captures the propensity of a neighborhood’s residents or visitors to report a given issue. We use two such measures, one identifying streetlight outages and the other evaluating sidewalk quality.
Second, it is necessary to create a measure of civic response rate that is based on measures from within the CRM system. This is critical because such a measure would allow the continual estimation of response rate, and in turn the production of valid measures of disorder, in lieu of regular neighborhood audits. In the next subsection, we develop the theoretical basis for how particular patterns in the CRM database might be reflective of the civic response rate. By examining the multivariate relationships between these internal measures and the objective measures of response rate, it is possible to construct a new measure from within the CRM system that can be used as an adjustment factor.
Third, we develop an equation that combines counts of cases with the adjustment factor to calculate final measures of physical disorder. This requires a measure of objective physical disorder, against which it is possible to calibrate the adjustment factor, determining how heavy its influence should be. This is done through an additional neighborhood audit that assessed loose litter on streets and sidewalks, an item that has been central to measures of physical disorder. In sum, this process produces a complete methodology for translating a raw database of CRM calls into a measure of physical disorder across a city. To conclude, we examine the construct validity of the measures produced by this methodology, comparing it with a series of other demographic, economic, and social indicators traditionally associated with disorder.
3.1. Sources of the Civic Response Rate
Reporting rates in the CRM database for public issues can be seen as having two distinct elements. The first entails knowledge of the CRM system and a willingness to use it. The second is a decision to take action or responsibility for the public space. To the former, a large part of the battle for any public service agency is informing residents of available services and making them comfortable with using them. The CRM system also requires direct interaction between constituents and the government, something that those from disadvantaged or minority groups are sometimes less inclined toward, either because they distrust the government in general or because they do not expect the requested services to actually be delivered (Putnam 1993; Verba, Schlozman, and Brady 1995). The sum of these effects might be described as engagement, or the likelihood that a person would use the CRM system in any case. Given the evidence that such patterns cluster demographically, it is likely to vary across neighborhoods, potentially contributing to measurement bias.
Knowing of and being willing to use the CRM system is not sufficient for using it to report a public issue, however. When calling in a report about something like graffiti or illegal dumping, an individual is taking responsibility for the public space, something that might have a different set of motivations than a call addressing personal needs (e.g., a request for a bulk item pickup). There are a number of mechanisms that may cause this concern for public space to vary systematically across neighborhoods. First, such variation may be the result of differences in the cognitive perception of disorder. One striking finding of citywide neighborhood surveys is that residents’ ratings of local disorder vary within the same neighborhood and only moderately correlate with observational measures such as video or research ratings (Franzini et al. 2008; Sampson and Raudenbush 2004; Taylor 2001). This would indicate that individuals and communities vary in their definition of “disorder,” something that might play an important role in how likely they are to feel compelled to report such issues. At the same time, it reveals that survey reports by definition are not “objective” measures of disorder either.
A second mechanism may be the variation across individuals in the level of personal responsibility they feel for the public space. For example, homeowners tend to be more engaged with public maintenance (O’Brien 2012), likely because of the long-term investment they have made by purchasing a house (Fischel 2005). Consistent with this, our preliminary analysis of the CRM data indicated that homeowners are four times more likely to report public issues than renters (O’Brien 2015). A third mechanism may be that the accumulation of physical disorder inclines residents to see the act of reporting new issues as useless, as such action will be unable to overcome the consistent generation of such problems (Ross et al. 2001). The truth may involve any one of these mechanisms, or some combination thereof, but the point stands that concern for public space could contribute to cross-neighborhood variations in the rate of reporting actual instances of physical disorder.
There are two features of the CRM database that will prove useful in the development of measures that reflect engagement and concern for public space. First, as noted, CRM case records indicate the type of services requested. From these, there is a subset that identifies issues in the public space. This subset overlaps with, but is not equivalent to, the subset of case types regarding physical disorder. Second, users of the CRM system are able to register, creating an account for tracking their reports. 2 Reports made by a registered user are then attributed to the individual’s account using an anonymous code, making it possible to determine how often an individual uses the system and to approximate the individual’s home location. Although this ignores those individuals who have used the system but not established accounts, the information still provides insights into an individual’s calling patterns that we would not otherwise have.
The most direct way to measure engagement would be to tabulate the number of individuals who do and do not know about the CRM system. This can be approximated as the proportion of neighborhood residents who have accounts with the CRM system. A less direct approach would be to identify case types whose need might be even across the city, that is, for which P1 would be constant across neighborhoods. In these cases, measuring their geographic distribution would then provide access to P2, the likelihood of using the system. For example, we might expect the need for general requests, which entail questions about city services and other government-related items, to be driven solely by interest and engagement with government. Another example would be requests for sanitation services to pick up bulk items. It is reasonable to assume that residents of all neighborhoods have a similar need for this service, as it is not determined by external, neighborhood processes. A third example of an evenly distributed issue is the need for snowplows during a snowstorm, when all neighborhoods should have a roughly equal need for snowplows, controlling for certain infrastructural characteristics (e.g., the total road length, dead ends). We then have four candidate measures of a neighborhood’s engagement: total registered users, general requests, bulk item pickups, and snowplow requests.
Measuring concern for public space requires a focus on reports that document a case of public deterioration and, in turn, a constituent’s decision to take action regarding it. This requires a list of case types that indicate a public issue. It is not possible to use any one of these types as a benchmark, as done with general requests, bulk item pickups, and snow plow requests, because the very issue at hand is whether public issues are uniformly distributed across the city. Instead, we focus on the other two techniques described for engagement. First, it is possible to identify a subset of users who have made one or more reports of a public issue. This could be used to tabulate the number of individuals in each neighborhood who have used the CRM system for such a purpose. Additionally, some of these “public reporters” make a disproportionate number of reports. Given their zeal for neighborhood maintenance, these individuals might be referred to as “exemplars.” Public issues in a neighborhood with either a greater number of average or exemplar public reporters would be expected to instigate reports to the CRM system more often and more quickly. Second, it is possible to measure the proportion of reports of public issues that were made by registered users. This would indicate how consistently such calls are part of a sustained relationship between a resident and government services. This amounts to three measures of concern for public space: (1) public reporters, (2) exemplars, and (3) proportion of calls made by registered users. Importantly, none of these measures is fully independent of engagement itself. For example, regardless of one’s inclination to report a streetlight outage, he or she must first know that the CRM system exists. Consequently, the analysis that follows allows these measures to load on one or both of these constructs.
3.2. Estimating the Civic Response Rate from the CRM Database
To be concurrent with the neighborhood audits (described below), the current analysis uses only CRM reports from 2011, amounting to 161,703 cases with geographic reference across 154 case types. This analysis incorporates two new ways of using the CRM database. First, similar to the identification of case types reflecting physical disorder in analysis 1, 59 case types were identified as reflecting issues in the public space (e.g., streetlight outage, pothole repair, graffiti removal; a complete list appears in Appendix A). Such a report indicates a concern for the maintenance of the public space on the part of the reporter. Other case types reflected personal needs rather than public concerns (e.g., general request, bulk item pickup). Second, all individuals who have registered with the CRM system have anonymous ID codes that are appended to each of their reports. In 2011, there were 29,439 constituent users, accounting for 38 percent of all requests for service. 3 The ID codes make it possible to construct a database of users with variables describing each individual’s pattern of reporting across time and space. This two-part database of calls and users was used to calculate the measures hypothesized to reflect a CBG’s civic response rate. 4
The call database was used to measure four of the seven proposed measures. Bulk item pickups (bulk items) and general requests were measured as the number of such requests occurring within a CBG. Proportion of public issues reported by registered users was measured as the number of public issues reported in a CBG attributed to a registered user divided by the total number of public issues reported in the CBG. Snowplow requests were first tabulated as a count for each CBG but were then adjusted for the total population, road length, and the length of dead-end roads. 5
The other three measures were calculated from the database of registered users, which included three main pieces of information: (1) the total number of calls a user had made, (2) the total number of calls a user had made regarding a public issue, and (3) an estimate of the user’s home location, based on the locations at which he or she requested services. 6 In 2011, 46 percent of registered users were public reporters. Of these, 87 percent reported two or fewer public issues, though there were those who were considerably more active (18 made more than 100 reports). Given this distribution, total users were measured as the number of registered users whose estimated locations fell within the CBG, average public reporters were measured as the number of a CBG’s total users who had reported one or two public issues during 2011, and exemplars were measured as those who had reported three or more public issues.
3.3. Objective Measures from Neighborhood Audits
Objective neighborhood conditions were assessed through two separate audits. One identified streetlight outages and the level of street garbage in 72 of Boston’s 156 census tracts (46 percent) between June 1 and August 31, 2011. In total, 4,239 street segments were assessed, and 244 streetlight outages were identified, each attributed to the nearest address. Garbage was rated for each street block on a five-point scale, with higher scores indicating more and larger piles of garbage. More detail on this protocol is provided in Appendix B in the online journal.
In the second audit, a consulting group hired by the City of Boston’s Public Works Department assessed the quality of all of the city’s sidewalks between November 2009 and April 2012. The unit of analysis was each continuous stretch of sidewalk that ran from intersection to intersection (n = 27,388). For each sidewalk, the assessors noted the proportion of panels that required replacement because they were cracked or broken, and they subtracted this from the total. This generated a 0-to-100 measure of sidewalk quality (with 100 indicating a sidewalk with no panels requiring replacement).
Streetlight outages and sidewalks were each cross-referenced with the CRM database to identify reports regarding them. For streetlight outages, we sought to identify the date on which each was reported. This was defined as the earliest case of an outage reported on the street segment in question that was fixed by the city after the date an auditor noted the outage. 7 This was then used to create a series of dichotomous measures indicating whether the outage had been reported by a constituent within a certain time window (e.g., one month). 8 For sidewalks, all requests for sidewalk repair were joined to the nearest sidewalk polygon from the same road. We were able to exclude those created by city employees because an additional code was included with such cases. The count of constituent reports for every sidewalk was then tabulated. Of the 27,388 sidewalk polygons, 1,168 generated requests for repair (4 percent, range = 1–19 requests).
Because the three audits described events or conditions on a single street segment within a neighborhood, multilevel models were run to create CBG-level measures (Raudenbush et al. 2004). These models controlled for microspatial characteristics of the street (e.g., zoning), and the second-level residuals were then used as CBG-level measures. Three outcome measures for these models were determined: the likelihood of a sidewalk generating one or more requests; the likelihood of a streetlight outage being reported within one month; and the continuous five-point measure of garbage. See Appendix C in the online journal for more detail on these models and the specification of outcome measures.
Two deviations from this approach are important to note. First, the number of outages per CBG was small for a multilevel model (244 outages in 127 CBGs), so the models were run instead with tracts as the second level (n = 56 tracts with outages). Each CBG then took the measure for its containing tract. Second, because sampling for the garbage audit occurred at the tract level, CBGs varied in the number of street segments that were rated. In order to be certain that neighborhood-level measures were reliable, the ensuing analysis was limited to the 196 CBGs with 10 or more street segment measures (see also Raudenbush and Sampson 1999).
3.4. Evaluating the Proposed Model of Civic Response Rate
Descriptive statistics for both objective measures of response rate and CRM-based measures proposed to estimate response rate, as well as the correlations among them, are reported in Table 3. All tabular variables had skewed distributions, with a long tail of CBGs that used the system extensively, leading us to log-transform them before correlational and regression analyses. As hypothesized, all variables indicating use of the system (general requests, bulk item pickups, all users, users reporting public issues, exemplary reporters) were strongly correlated (r = .36–.93, all p values < .001). Because of the very high correlation between all users and average users reporting public issues (r = .93 in the full sample and r = .95 in the subsample with values for all measures), the two were deemed to be the same measure. The “all users” measure was thus dropped from all proceeding analyses to avoid issues of multicollinearity.
Descriptive Statistics for and Correlations between Proposed Indicators of Response Rate
Note: N = 541 for all measures except propensity to report streetlight outages within one month (n = 195). Descriptive statistics reported for all census block groups (CBGs); correlations including all CBGs with measures on both variables reported above the diagonal, correlations for those CBGs with values for all measures (n = 195) reported below the diagonal. See text for more details on the derivation of each measure.
Log-transformed before correlations to account for skewed distribution.
Deviation from regression equation controlling for key variables; see text for more details.
p < .10. *p < .05. **p < .01. ***p < .001.
Requests for sidewalk repairs and propensity to report streetlights were modestly correlated (r = .18, p < .05). Each also shared stronger correlations with those measures from the CRM database intended to measure concern for the public space (public reporters, exemplars, percentage of public issues reported by registered users) than those intended to measure engagement (general requests, bulk item pickups, all users). The reverse was true for requests for snowplows, as predicted. They were significantly positively correlated with the sidewalk measure (r = .14, p < .05) but not the streetlight outage measure (r = −.12, ns) and they had a stronger correlation with measures of engagement than with concern for public space.
Structural equation modeling was used to determine how well the proposed constructs fit the data. The model analyzed those 195 CBGs with a measure for propensity to report streetlights. The best-fitting model, depicted in Figure 2, had good fit (CFI = .95, SRMR = .06, χ2df=9 = 25.44, p < .01), and it was quite similar to the model proposed in the introduction to this section. The measures derived from the CRM system did indeed separate into the two proposed latent constructs, engagement and concern for public disorder. It is notable, however, that the two objective measures of civic response rate loaded on the latent construct of concern for public space (sidewalks: β = .34, p < .001; streetlight outages: β = .18, p < .05) but not on engagement.

Relationships between objective and CRM-derived measures of response rate with standardized parameters from the best-fitting structural equation model.
As with the models in analysis 1, the novelty of the various measures required that we take a partially exploratory approach, tweaking the theoretically based model to specify the best fit. Consequently, there were four alterations that bear mentioning:
With the removal of the measure of all users, it was necessary to have average reporters of public issues load on both latent constructs (engagement: β = .50, p < .001; concern for the public space: β = .52, p < .001).
The measure of general requests was also removed because its strong correlation with other variables made the factor structure unstable.
The percentage of public calls from registered users was discarded because doing so strengthened the model’s fit.
Total population was used as a control variable predicting average reporters of public issues (β = .13, p < .05) and exemplars (β = .16, p < .01). Modification indices for the final model suggested that no significant bivariate relationships had been omitted.
3.5. Evaluating the Adjustment Factor
The results of the previous model suggest that the estimate of the civic response rate, and therefore the desired adjustment factor, is based on measures of concern for the public space. A composite measure for each CBG was created using the parameter estimates from Figure 2. We then established the efficacy of this measure as an adjustment factor by examining how well it improved the relationship between the raw measures from analysis 1 and objective measures of physical disorder, as indicated by street garbage. The analysis was performed in two parts. First, the raw counts of case types in each category (log-transformed to better approximate normality) were entered into five separate regressions predicting the level of street garbage in a CBG. Second, an adjustment factor was created for each count as an interaction with the civic response rate, which was then added to the corresponding regression. 9 This analysis was limited to residential neighborhoods (excluding regions dominated by institutions, parks, or downtown areas), as the predictive relationship between local behavior and loose litter would be most clear in these areas; in other areas, litter would be subject to dynamics that would not necessarily influence other components of physical disorder such as graffiti in the same way (n = 135 residential CBGs).
The first set of regressions found that all but one of the raw measures (graffiti) significantly predicted levels of street garbage (complete details are shown in Table 4). The strongest relationships were with housing (B = .63, p < .001) and uncivil use of space (B = .38, p < .001). Big buildings (B = .21, p < .05) and trash (B = .18, p < .05) had more moderate relationships. The fit of all five regressions increased significantly with the introduction of the adjustment factor (again see Table 4), with the strongest improvement occurring for trash (ΔR2 = .06, p < .01) and graffiti (ΔR2 = .05, p < .05). Notably, the variance explained more than doubled for both trash and graffiti, which had the weakest initial relationships with street garbage.
Comparison of Results from Regressions Using the Five Categories of Physical Disorder Derived from the CRM Database to Predict Objective Measures of Garbage, with and without the CRM-based Adjustment Factor
Note: N = 135 census block groups classified as residential and with measures of garbage for 10 or more street segments. All CRM-based variables were log-transformed before regressions. CRM = constituent relationship management.
p < .10. *p < .05. **p < .01. ***p < .001.
3.6. Construct Validity for the Composite Measures
As a last step, we evaluated the construct validity of these final measures by examining their relationship with other popular indicators of neighborhood conditions, drawn from three different data sources: median income, homeownership, and measures of ethnic composition from the U.S. Census Bureau’s ACS (2005–2009 estimates); survey measures of perceived physical disorder and collective efficacy (i.e., social cohesion and social control between neighbors) from the Boston Neighborhood Survey (BNS; 2008–2010 estimates, n = 3,428) 10 ; and reports of gun-related incidents from Boston’s 911 call record in 2011. Because the time points of these data sources vary, we analyze their relationship to the CRM-based measure for the most concurrent year: 2010 for the ACS and BNS and 2011 for 911. As before, we focus the analysis on residential neighborhoods, but in this case we analyze at the broader spatial scale of census tracts rather than block groups (n = 121 residential census tracts). We do so because the interpretation of the analysis depends in important ways on comparison with findings from previous studies, particularly Raudenbush and Sampson (1999), which were conducted on census tracts and clusters of tracts. In addition, because of the smaller sample size of the BNS compared with the Chicago study, the BNS has greater between-neighborhood reliability for tracts than for block groups. For the sake of brevity, we conducted this analysis on the higher order measures of private neglect and public denigration. Results for the five lower order measures as well as block groups are available upon request from the authors.
The measure of private neglect was lower where there was higher median income (r = −.59, p < .001), homeownership (r = −.36, p < .001), and collective efficacy (r = −.38, p < .001) and higher where there were greater black (r = .61, p < .001) and Hispanic (r = .27, p < .001) populations. It also co-occurred with gun-related incidents (r = .68, p < .001). Furthermore, it was higher where residents perceived more disorder (r = .44, p < .001). The measure of public denigration had largely similar relationships with these measures: it was lower in areas with more homeowners (r = −.49, p < .001), collective efficacy (r = −.48, p < .001), and a higher median income (r = −.39, p = .001). Public denigration was higher where there was a greater Hispanic population (r = .41, p < .001) and more gun-related incidents (r = .27, p < .01). It was also higher where residents perceived more disorder (r = .48, p < .001). The one unexpected finding was that it held no correlation with the proportion of black residents (r = −.05, ns).
These validation correlations are lower than those reported by Raudenbush and Sampson (1999:31) for survey-reported disorder. For example, public denigration correlates with perceived disorder at .48 in Boston but .71 in Chicago. However, at least four factors differ between studies beside the method (observation vs. CRM for non-survey-based indicators of disorder)—the items in the measure, city, reliability of the surveys, and time period—making direct comparability difficult. It should be noted, though, that the correlations for structural characteristics are similar; for example, the correlation between physical neglect and income in Boston is –.59, and in Chicago the correlation of observed disorder with poverty is .64. And the correlations for residential stability are –.36 in Boston and –.25 in Chicago. Moreover, the CRM correlations are on par with previous comparisons between perceived and objective disorder in other studies (Brown, Perkins, and Brown 2004; Franzini et al. 2008; Sampson and Raudenbush 2004; Taylor 2001).
Overall, the results suggest that it is possible to construct a measure from within the CRM database that adjusts counts of case types to better reflect neighborhood conditions, though there are differences between the two classes of physical disorder that should be noted. In particular, private neglect had a stronger relationship with street garbage, with two of its constituent metrics (housing and uncivil use) surpassing the threshold of about 15 percent shared variance typically seen between domains of physical disorder (Taylor 2001). The relationships between the indicators of public denigration and street garbage were a bit weaker, but the correlations with other indicators of disorder were of similar magnitude, even stronger in cases. This could be owed to one of two possible explanations. The first is that issues of trash storage and graffiti are in fact less linked to patterns of litter than expected. The second is that these issues are more susceptible to reporter bias and potentially in ways that audits of natural patterns in deterioration, such as streetlight outages and sidewalk cracks, might not fully capture. The same norms that lead to garbage-laden streets might also be responsible for diminished motivation to report graffiti or other issues in the public space. If this is so, then the assumption that a neighborhood’s civic response rate, P2, is consistent for a given neighborhood across all case types is called into question. Future validation efforts should carefully evaluate the most effective measures both for objective comparison and internal adjustment, as these might differ depending on the particular set of conditions that are intended to be the focus, a theme we return to below.
4. Analysis 3: Assessing Reliability across Space and Time
Analyses 1 and 2 have provided a methodology for measuring physical disorder using the CRM system, but without a guideline for how such measures should be bounded in space and time. Thus far, measures have been developed for CBGs over the entire available time course (two years and four months). It is desirable, however, to assess measures for smaller time windows, allowing researchers to examine local conditions at more precise intervals, and facilitating longitudinal analysis. In addition to CBGs, it would be appropriate to determine the optimal time window for census tracts, the unit at which most urban research is conducted.
Determining an “optimal” time window for measurement requires a balance of two contrasting dimensions: smaller time windows are more precise but are more sensitive to random events. To do this, we must examine how consistent the multiple measures of a single neighborhood are for different time intervals (using the intraclass correlation coefficient [ICC]) and the ability to statistically distinguish between neighborhoods (using the reliability coefficient, λ); these characteristics can be assessed using multilevel models. The goal is to identify the smallest time interval for which measures within a neighborhood are sufficiently consistent and not overly sensitive to error or stochastic processes. Because the measures of interest are in fact composites that combine counts of cases with the measures of concern for the public space, the establishment of reliability requires two steps. First, we must identify a time interval for which all of the constituent measures (e.g., instances of housing issues) attain a desired threshold for reliability and ICC. Once the appropriate time interval for the constituent measures is determined, it must be confirmed that the same time interval is appropriate for the composite measure. Note, however, that step 1 is not possible for the measure of exemplar reporters, as they are defined by their behavior over the course of a complete year. For this reason, exemplars will always be calculated as the number of public reporters in a region obtaining exemplar status over the previous 365 days.
The last question we seek to answer is that of longitudinal tracking. If the final time intervals are small enough, it would be possible to examine patterns of change across time. The multilevel models can assess the slope for a measure at both the global and neighborhood levels. If the reliability for the slope is high enough, the model is capable of discerning varying trajectories across neighborhoods, which could then be used in subsequent analyses.
4.1. Creating Measures for Spatiotemporal Windows
The temporal analysis utilizes the complete CRM database, including all requests for service received between March 1, 2010, and June 29, 2012. All requests are categorized by case type and include the date of the request and the address or intersection where services were to be rendered, allowing all requests to be geocoded to the appropriate census geographies.
The focal variables are those that constitute the composite measures of physical disorder, including both the raw counts of cases that reflect the five categories of physical disorder, and the measures of response rate. Drawing from analysis 1, the five categories of physical disorder were housing, uncivil use of space, big buildings, graffiti, and trash. On the basis of analysis 2, the response rate was calculated as the number of individuals reporting public issues, divided into two counts: those who made two or fewer calls in a year’s time and those who made three or more calls in a year’s time (i.e., exemplars).
Measures for each variable, excepting exemplars, were created for all CBGs and tracts for eight temporal windows—one, two, and three weeks, and one, two, three, four, and six months. For each, the original database was split into intervals of the given size, starting with March 1, 2010, and ending with the last complete interval. A count was then produced for each interval for each element in the given level of analysis (i.e., block group or tract). 11
4.2. Multilevel Models
Hierarchical linear modeling (Raudenbush et al. 2004) was used to compare the consistency of counts within a CBG over time. A natural-log link was used to account for the Poisson distributions of all outcome variables. The first-level equation predicted the outcome for a given time point relative to other measures for that region, and it included the number of time intervals elapsed since the start of the database, to estimate the rate and direction of change over time, and dummy variables controlling for seasonal effects, based on the month of the midpoint of the given time interval. The equation takes the following form:
The second-level equation was an intercepts-only model, estimating the average level of a measure for a neighborhood across time. In addition, the parameter relating time to changes in a measure, β1, was allowed to vary across CBGs, permitting the model to estimate different trajectories of change for different CBGs:
where τ0 is the measure of variation in the outcome measure between CBGs and τ1 is a measure of variation between CBGs in the linear relationship between time and the outcome variable. Furthermore, σ2 is a measure of the variation in the outcome measure within CBGs (i.e., differences within a CBG across time).
The ICC is then calculated as the proportion of variation that lies between groups:
Reliability is calculated as
where n is the number of observations per CBG. As we can see, this measure grows both with a greater ICC and also with more observations.
Variation across CBGs in the linear relationship between time and the outcome measure is assessed in two ways. First, the significance of the magnitude of τ1 is assessed using a χ2 test. Second, its reliability is measured as
where SSTime is the sums of squares for the measure of time.
4.3. Comparing Spatiotemporal Windows
The reliabilities and ICCs from the multilevel models described above are reported in Table 5 (CBGs) and Table 6 (tracts). As expected, the proportion of variation attributable to differences between both CBGs and tracts (measured by the ICC) increased monotonically as time windows became larger, because of both greater consistency and fewer measures per neighborhood. As would also be expected, ICCs were higher when comparing tracts than CBGs.
ICCs and Reliabilities (λ) for Level (Intercept) and Cross-time Change (Slope) in Measures of Public Denigration and Private Neglect across Census Block Groups for Various Time Windows
Note: N’s vary on the basis of the number of time intervals possible for the 28-month period in the database, nested in 541 census block groups. All ICCs are significant at p < .001. ICC = intraclass correlation coefficient.
ICCs and Reliabilities (λ) for Level (Intercept) and Cross-time Change (Slope) in Measures of Public Denigration and Private Neglect across Census Tracts for Various Time Windows
Note: N’s vary on the basis of the number of time intervals possible for the 28-month period in the database, nested in 156 census tracts. All ICCs are significant at p < .001. ICC = intraclass correlation coefficient.
A combination of the raw count and the measures of concern for the public space, calculated for 6-month windows only. See text for more details on construction.
The six measures varied in their consistency. Of the measures of physical disorder, housing, graffiti, and trash had the strongest reliabilities and highest ICCs. These differences seem largely attributable to the frequency of these categories. For example, there were three times as many events reflecting housing issues than uncivil use of space. With a lower frequency, counts of the latter would be more stochastic and therefore less consistent at smaller time intervals. Interestingly, counts of public reporters, though far fewer in number than actual calls, featured greater consistency within a region than any of the measures of physical disorder.
All ICCs in Tables 5 and 6 were significant at p < .001. The intent here, however, is not to find significant between-region variation but to identify spatiotemporal windows at which a single measure is indicative of a region’s “actual” value on that measure. The ICC, in that case, is used as an evaluation of how strongly a single measure of a neighborhood correlates with all other measures of that neighborhood. If we elect .7 as a threshold for a reliable neighborhood-level measure, then there are acceptable spatiotemporal windows available for all of the measures apart from big buildings. For those measures with greater consistency, the options are many: housing, for example, could be measured at two-month intervals for tracts or four-month intervals for CBGs. For others, like uncivil use, there is a need for six-month intervals at the tract level, and no time interval satisfies this criterion for CBGs.
The slope reliabilities in Tables 5 and 6 indicate the ability of the model to distinguish between the trajectories of different regions over time. Variation in slopes across CBGs and tracts were significant at p < .05 (or some lower threshold) in nearly all models, with the exception of those for big buildings and those for public reporters of intervals longer than four months. This variation was somewhat more discernible in tract-level models.
We then examined whether these cross-time consistencies hold for the composite measures. For the sake of simplicity, this was done for all variables using six-month windows for tracts. Note that the generation of the composite measures requires the incorporation of the number of exemplars, measured for the full year preceding the last day of the given time window. Consequently, the first time window analyzed must be that which ends at or after the end of the 12th month of the available database, diminishing the number of measurements per tract. For this reason, this analysis does not examine change over time.
Similar to the above examples, multilevel models were run to examine the consistency of the composite measures across space and time. The reliabilities and ICCs from these are reported in the bottom row of Table 6. Across the board, reliabilities and ICCs were lower for the composite measures, but not alarmingly so. All measures (other than big buildings) maintained ICCs of about .6 or higher, and housing had an ICC greater than .7. Reliabilities were typically about .8.
Last, we replicated the analysis for the two higher order constructs, private neglect and public denigration. The statistical advantage is that the combination of multiple measures amplifies the number of cases in the average time interval, thereby enabling higher reliabilities and ICCs at smaller time windows. This is particularly important when considering a measure such as big buildings, which has a low reliability when measured on its own but might be fruitfully incorporated into a more comprehensive description of the neighborhood.
Reliabilities and ICCs for these higher order counts were higher than their constituent categories. For each, the criterion of ICC = .7 was attained at six-month intervals for CBGs and two-month intervals for tracts. (Complete results are available on request from the authors.) This remained largely consistent when they were combined with measures of concern for the public space to create composite measures, though the consistency in public denigration for small time windows was somewhat diminished. For tracts with two-month intervals, public denigration had an ICC of .44 and a reliability coefficient of .88. Private neglect had an ICC of .65 and a reliability coefficient of .94. For CBGs with six-month intervals, public denigration had an ICC of .51 and a reliability coefficient of .76. Private neglect had an ICC of .68 and a reliability coefficient of .87.
5. Summary and Implications
In the present study, we sought to demonstrate how a citizen-initiated administrative database might act as “the eyes and ears of the city” in the spirit of Jane Jacobs (1961) while providing a low-cost, real-time measure of physical disorder. To accomplish this goal, we needed to address three major issues: (1) the lack of interpretable constructs, (2) the potential that the raw database might not objectively or accurately reflect real-world conditions, and (3) the need for criteria for reliability when bounding measures in space and time. Creating a set of theoretically guided factors first required an item-response model, in this case 28 case types that reflected deterioration or incivilities within a neighborhood. The subsequent factor analysis revealed five separate categories of physical disorder. It is worth noting that these constructs were extant in the data but that it was necessary to distinguish them from the noise surrounding them. Skipping forward to analysis 3, once these measures were fully developed, criteria for reliability were established both for one-time measures and cross-time trajectories using multilevel modeling.
In between these two steps, analysis 2 addressed the question of validity, which is a perhaps underappreciated concern for ecometric study. Neighborhood audit protocols are developed and administered to measure specific things as accurately as possible, meaning that they have an inherent validity for those items that they assess. In contrast, administrative data are the by-product of processes whose idiosyncrasies might bias their reflection of ground truth. The CRM database is the product of constituent reports and is therefore vulnerable to inconsistencies in reporting across neighborhoods. Because the nature of the bias was known, however, it was possible to account for it. The final methodology used indicators of civic response rate, derived from the CRM database itself, to systematically adjust raw measures to better reflect objective conditions. Reaching this point entailed considerable work, including two independent data collections and a lengthy set of analyses. Nonetheless, that investment of cost and effort would be necessary for any traditional protocol for measuring disorder, and in our case laid the groundwork for a methodology that can be reproduced at little cost both within Boston across time and in other cities with their own CRM systems. It is also worth noting that cities frequently conduct audit studies, so it is reasonable to assume that there will be an ongoing stream of potential sources of data from which to derive validation measures.
The final product was a multidimensional measure of physical disorder that is not only nearly costless to the researcher but also more comprehensive and precise than other measures currently available. Furthermore, the programming code published along with this paper (at the Boston Area Research Initiative’s Web site, http://www.bostonarearesearchinitiative.net/data-library.php) facilitates reproduction of the measure wherever similar databases exist. Given these apparent upsides to the use of administrative data, it seems appropriate to forward the following new, three-step process for carrying ecometrics into the age of big data:
Extract constructs by identifying item-specific models that are reflective of the theoretical concept of interest and then examining their underlying factor structure
Validate the measure by identifying and adjusting for any bias the information source might impart to the data and examining in conjunction with external data.
Establish reliability in the measure’s ability to track information across space and time.
With this methodology in hand, the opportunity before ecometric urban science is considerable, as there is a veritable trove of information on cities that sits largely untapped, of which CRM databases are but one example. Cities collect and now make available many other data points—such as tax assessments, building permits, zoning decisions, restaurant inspections, environmental assessments, housing code violations, pedestrian flows, and bicycle collisions, to name a few—each providing its own insights on the social and physical ecology of neighborhoods. Going further, there are private databases, such as Twitter, cell phone records, and Flickr photo collections, that are also geocoded and might be equally informative in building innovative measures of urban social processes. These various resources could be used to develop new versions of traditionally popular measures, as we have done here, or to explore new ones that have not been previously accessible. An illustration of the latter comes from our own analysis, in which a by-product of validation has provided two unanticipated behavioral measures—one related to civic engagement and the other capturing attitudes toward disorder in the public space. The potential of new forms of large-scale data underscores the central inspiration of this paper: as the volume of data on urban areas continues to grow and diversify, such data provide new and distinctive ways to measure neighborhood characteristics, often in ways previously unforeseen. These advances can be appropriated to shed light on some of the most salient themes in urban science, from the structure and function of the social organization, to the role of cognition and culture in generating local patterns, to the nascent examination of relationships between neighborhoods and the higher order social structure of the city.
Apart from its implications for ecometric science more broadly, the current methodology represents an advance for the direct measurement of physical disorder in urban neighborhoods. It incorporates a broad range of phenomena and is the first physical disorder measure to divide these items into independent subcategories, suggesting new avenues for research. For example, do the five subcategories relate differently to a neighborhood’s other social and demographic characteristics? If so, do they each reflect a different set of processes occurring within the neighborhood? Furthermore, what is the source of the higher order constructs suggested by analysis 1, private neglect and public denigration? Is it that their constituent types are all manifestations of the same social and behavioral patterns, or do they share other causal relationships that reinforce their correlation? It is crucial that we not overinterpret this single case and inappropriately reify these particular constructs. It will be necessary to confirm their consistency with data from other time points and cities, something that is likely to be possible with the continued proliferation of CRM systems throughout North America and Western Europe.
In addition, the measures enable a variety of analytical approaches that could prove useful in the extension of research surrounding “broken windows” and other theories of neighborhood well-being. All of the measures describe neighborhood conditions at the level of census tracts, and some can be used for CBGs. Future work could likely find ways to measure and interpret patterns of disorder for streets or even individual buildings. The measures can also be tracked across time, allowing analyses that evaluate not only what a neighborhood’s current level of physical disorder is, but whether it is on an upward or downward trajectory.
Finally, the CRM data are continuously generated as part of administrative operations. A new study with up-to-date data requires only a download and some data manipulation. In an effort to assist others in initiating such work, we will be publishing the computer code for constructing the measures developed in the current paper (along with the data, at the Boston Area Research Initiative’s Web site, http://www.bostonarearesearchinitiative.net/data-library.php). As CRM systems become more numerous around the world, typically in the form of 311 hotlines, this sort of measurement is becoming possible in a variety of cities. Some of these cities have established common standards for publishing CRM data, meaning that the data are not only being made readily available but are compatible in ways that would support cross-city comparisons.
6. Balancing Limitations and Opportunities
We have thus far focused predominantly on the opportunity presented by “big data,” but we must also take stock of the limitations that they carry and the challenges to be addressed. Indeed, the methodology presented here is only a first step—an illustration of what is possible—and future work will need to refine it further, particularly in terms of the validation process. We would likewise stress that traditional, well-established methods of urban data collection, such as community surveys and social observation (Sampson 2012), will continue to play an essential role in any future analysis. Claims to the contrary are merely “big data hubris,” as aptly put by Lazer et al. (2014). Each approach has its pros and cons, the balancing of which will depend on the research question. Surveys and observation are expensive and cannot realistically be carried out in real time, for example, but they can be calibrated to be representative of the population. In contrast, the CRM data analyzed here are cheap and in principle can be measured at very fine grained geographic scales and almost in real time, but issues of reliability and what the data are really measuring remain.
For example, complaints about big buildings were not particularly common in our database, making it difficult to measure that construct reliably. Techniques that aggregate cases at higher levels, by increasing the geographical range or the temporal window or, as we did here, combining multiple related constructs, will be critical. Future research is needed to examine these issues, especially in a context that can directly compare and contrast different methodologies of data collection, such as systematic social observation. Perhaps another factor is more important: although our validation process is promising, we still cannot be entirely certain that we have directly accessed the intended information, especially for those things that occur within private spaces or out of public observation. In particular, our measures of public denigration were not as closely correlated with other indicators of urban social structure as might be expected from past research. It may be, as we noted earlier, that the techniques for measuring and accounting for bias in this case were not sufficient to fully calibrate public denigration.
Another potential weakness is our working assumption that reporting bias is consistent across case types. Our data here seem to suggest that the situation is more nuanced, as reporting rates for streetlight outages and broken sidewalks were only moderately correlated. This finding points to important improvements for future versions of the measure, while also highlighting the need to tailor the validation process to the specific measure of interest. In some cases, like the ones presented here, there is a need to adjust for biases inherent in the data, and the objective measures necessary for doing so will need to be carefully constructed and measured.
In other cases, however, such a process of construct validation or bias adjustment may be less necessary or not applicable, even though reliability assessment by temporal and geographic scale remains at issue. For example, building permits and zoning approvals or variances are legal requirements for major building renovations and additions, meaning they should be largely objective in the information they provide. Allocation of city resources (e.g., beautification efforts or economic development) or distribution of the city budgets by amount and location are also largely “bias free” in their measurement and now widely available electronically. The availability of such measures could provide new insight into processes such as gentrification and inequality in the delivery of city services (e.g., Hwang and Sampson 2014).
Furthermore, in certain cases, the raw contents of the data are exactly what a researcher wants, and their face validity is sufficient to offer concrete interpretation. In such situations, researchers do not actually want to adjust for any biases. For example, a recent paper used noise complaints from New York City’s 311 hotline as a direct reflection of social conflict between neighbors (Legewie and Schaeffer 2015). Regardless of actual noise levels or norms of reactivity, each call in this analysis reflects an objective case of one neighbor asking the government to regulate the behavior of another. More generally, the electronic availability in many cities of citizen reporting systems offers a wide variety of domains (in our Boston data, 178 unique types of service calls) to test Black’s (1976) theory of the behavior of law and citizen initiation of government control.
Of course, the fundamental issues of measurement error and validity bear down at some level on all methodologies: survey reports can be skewed by other perceptual factors, including implicit judgments of race and class (Sampson and Raudenbush 2004), and observational work is dependent on interrater reliability that is always less than unity. The dominant investigator-driven research method is to conduct surveys or interviews, but even here, there is continuing controversy over the idea that researcher control leads to validity. For example, a recent critique argues that interviews are a weak basis for studying culture or inferring the motives for an individual’s behaviors (Jerolmack and Khan 2014, and responses). Although we would not go that far, our point is that assumptions must always be invoked in the analysis of social science data.
It remains significant, however, that administrative data are outside of a researcher’s direct control and that assumptions may sometimes be required that are uncheckable because of unobserved processes related to reporting or administrative filtering. It follows that validation is imperfect and should be viewed as a continuing process, one that will need to be undertaken for new administrative data sets that become available. Some might argue that these caveats obviate the usefulness of administrative data and other forms of “naturally occurring” digital information. Our position is to recognize both strengths and weaknesses of the new data being made available, using the most rigorous methods possible to address limitations. In defense of our approach, we would also note the significant advantages of the CRM data we analyzed relative to big data more generally. The CRM data are characterized by their richness and geographic precision, and their longitudinal nature permits long-term tracking. In addition, some of the content that is most difficult to validate stems from the fact that it cannot be measured by direct means and was thus previously unavailable, making it novel. If used properly, such data can broaden the range of questions that we can examine and the manner in which we do so. Our hope is that future efforts will capitalize on the advantages of large-scale administrate records and to combine them in meaningful ways with survey and observational protocols.
In sum, instead of an either/or approach, the debate between those who believe that only data generated directly as part of the research process are valid and those who believe that administrative and other types of naturally occurring data can be of use pushes both sides to improve the quality of their research, which certainly can only lead to better science. Meanwhile, because of the increasing availability of big data at little or no cost and at unprecedented temporal and geographic scales, such data remain a resource to be tapped, and it is incumbent upon researchers to develop methodologies that do so in ways that fulfill the expectations of rigorous science.
7. Methodology, Theory, and the Future of Big Data
Although this paper was tailored to the specifics of ecometrics, the rise of big data illustrates the challenges facing computational social science writ large. There is a clear need to demonstrate what these novel data sources can measure and how constructed metrics are theoretically relevant. Furthermore, they must be demonstrated to be both reliable and valid in their measurement before modeling can begin, which unfortunately seems to be the default in many current approaches that emphasize “econometrics” over “ecometrics” or simply the power to predict. However powerful predictive analytics may be, it does not answer the substantive questions about social processes and mechanisms that motivate most social scientists.
In this paper, therefore, we set out to accomplish a linked set of measurement goals that was rooted in substantive concerns. We grounded our study in a measure that is influential in urban research and theory, and we closely examined validity in a manner that goes beyond previous work. Though others have used supplementary data to give context to the patterns in Facebook or cell phone calls (Eagle, Pentland, and Lazer 2009; Kosinski, Stillwell, and Graepel 2013), ours is the only study that we know of that has gone a step further, using multiple internal measures to reduce measurement error and then validating this technique with external sources and substantive theory—an approach that was not simply data driven. Given the size and novelty of aptly termed “big” data, there is the temptation to allow such data to guide analysis and, in turn, dictate theory. Indeed, some have claimed that the era of big data will eliminate the need for theory, as it will be derived from the massive size of the data available. The former editor of Wired magazine was perhaps the most bold, claiming “the end of theory” and that “the data deluge makes the scientific method obsolete” (quoted in Pigliucci 2009). We strongly disagree. Purely data-driven approaches run the risk for producing models and algorithms that are overfit to the idiosyncrasies of a particular data set, leading to new theoretical models that are artifactual or just plain wrong—something that has been partially blamed for the failure to predict the crash of the housing bubble in 2008. Big data hubris is indeed a problem (Lazer et al. 2014). Accordingly, we have had theory take the lead throughout, determining the case types reflecting physical disorder and the measures of civic response rate.
A balance must nonetheless be maintained. As Lazer et al. (2009) rightfully pointed out, our current theories are not well suited to the complexity of information contained in these sorts of data and consequently are often unequipped to offer conjectures about them. In the current case, there was no model of disorder that was sufficiently articulated to predict a priori categories for the 28 indicators that we identified. Thus, there is something to be learned from these data about the causal dynamics that underpin disorder and its various manifestations, though such insights will of course be subject to the same rigorous evaluation required of any new theory. This “checks-and-balances” relationship between theory and empirics is instructive, and it will probably characterize the continued efforts of scientists to incorporate big data into their work, as well as to facilitate the emergence of a fully mature field of computational social science.
Footnotes
Appendix A
Case Types That Reflect an Issue in the Public Space and Counts in 2011
| Case Type | Count |
|---|---|
| Abandoned bicycle | 71 |
| Abandoned building | 103 |
| Abandoned vehicles | 2,233 |
| Bridge maintenance | 29 |
| Building inspection request | 822 |
| Catch basin | 13 |
| Construction debris | 101 |
| Empty litter basket | 292 |
| Exceeding terms of permit | 68 |
| Fire hydrant | 8 |
| General lighting request | 460 |
| Graffiti removal | 3,893 |
| Highway maintenance | 3,297 |
| Illegal auto body shop | 46 |
| Illegal dumping | 831 |
| Illegal occupancy | 263 |
| Illegal posting of signs | 116 |
| Illegal rooming house | 177 |
| Illegal use | 62 |
| Illegal vending | 32 |
| Improper storage of trash (barrels) | 1,745 |
| Install new lighting | 25 |
| Miscellaneous snow complaint | 1,407 |
| Missed trash/recycling/yard waste/bulk item | 6,211 |
| Missing sign | 671 |
| New sign, crosswalk or pavement marking | 976 |
| New tree requests | 831 |
| Overflowing or unkempt Dumpster | 149 |
| Park improvement requests | 3 |
| Park maintenance requests | 87 |
| Park safety notifications | 2 |
| Parking enforcement | 685 |
| Parking meter repairs | 139 |
| Parking on front/back yards (illegal parking) | 132 |
| Parks general request | 106 |
| Parks lighting issues | 5 |
| Pavement marking maintenance | 272 |
| Pick up dead animal | 1,374 |
| Pigeon infestation | 29 |
| PWD graffiti | 160 |
| Request for litter basket installation | 80 |
| Request for pothole repair | 4,603 |
| Request for snow plowing | 7,270 |
| Requests for street cleaning | 953 |
| Requests for traffic signal studies or reviews | 96 |
| Roadway repair | 306 |
| Rodent activity | 1,241 |
| Sidewalk cover/manhole | 3 |
| Sidewalk repair | 1,294 |
| Sidewalk repair (make safe) | 2,119 |
| Sign repair | 1,172 |
| Snow removal | 2,103 |
| Streetlight knock-downs | 476 |
| Streetlight outages | 8,127 |
| Traffic signal repair | 2,585 |
| Trash on vacant lot | 121 |
| Tree emergencies | 3,446 |
| Tree maintenance requests | 3,336 |
| Upgrade existing lighting | 15 |
Note: PWD = Public Works Department.
Acknowledgements
We thank the City of Boston’s Office of New Urban Mechanics and Department of Innovation and Technology for supporting our examination of government data; Jeremy Levine for assistance with the data; and the editor and reviewers of Sociological Methodology for helpful comments on earlier drafts. A version of this paper was presented at the annual meeting of the American Association for the Advancement of Science in Chicago, February 15, 2014.
Funding
Funding assistance was provided by the National Science Foundation (grant SMA 1338446), the John D. and Catherine T. MacArthur Foundation (grant 13-105766-000-USP), and the Radcliffe Institute for Advanced Study.
Notes
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
