Social and Technical Trade-Offs in Data Science

Abstract

Data science isn't easy, especially when it involves working with social data. And although data science researchers and practitioners draw on methods such as machine learning that seem to automate much of the process, they are often confronted by difficult choices all along the way. Considered judgment is therefore crucial to data science. It guides the process of formalizing problems in such a way that data can be meaningfully used; it informs the selection, preparation, and cleaning of data, especially when data are biased or are missing; and it underlies critical decisions regarding the computational methods brought to bear on data and the metrics used to measure performance.

Such choices are unavoidable when trying to put data to meaningful use. Moreover, in most cases, there are no “right” choices or even a “right” way to make these choices. Instead, researchers and practitioners almost always find themselves working through a number of difficult trade-offs between competing goals and values, trying to strike an appropriate balance for the task at hand. For instance, even when data scientists prioritize the accuracy of their models' predictions, they might also care how easily they can interpret their models. Sometimes the ability to explain what a model has learned will justify a seeming loss in performance, if human understanding is necessary for the perceived legitimacy of the model's conclusions.

This special issue illuminates the many considered decisions that researchers and practitioners make when working with data—and the ethical implications of these decisions. Difficult technical trade-offs often involve difficult social trade-offs. Unlike much of the critical commentary on data science to date, which has tended to treat data scientists as either ignorant of or insensitive to the ethical implications of their choices, this special issue highlights how researchers and practitioners make these consequential decisions, as well as the values that guide their decision making.

In its own way, each paper reveals that many of the normative concerns that scholars, policy makers, regulators, and advocates have about assumptions, bias, and errors in data science already register as pressing practical concerns for those who work in the field.

Neff et al., for example, provide a rich empirical account of the considerable amount of deliberation, interpretation, and conflict that attends data scientists' work in both academia and industry. Their ethnographic insights identify key parts of the data science process where researchers and practitioners struggle through confusion, ambiguity, and conflict. Shmueli complements Neff et al.'s contribution with an extended discussion of the many methodological challenges that researchers must address to leverage data effectively for causal behavioral research—and why these challenges introduce novel ethical concerns. Both articles pinpoint important opportunities to channel critical perspectives on “big data” into the everyday work of data scientists, showing that many social critiques map onto existing technical considerations.

Likewise, d'Alessandro et al. and Berendt et al. point to the many choices that confront practitioners as they proceed with the data science process—and how each of these choices implicates whether the resulting model will exhibit bias along the lines of race, gender, and other protected characteristics. d'Alessandro et al. walk through a well-known “process model” to help isolate specific decisions where practitioners can attempt to mitigate bias. Berendt et al., in turn, demonstrate why discrimination stems from the way humans interact with data and why even technical solutions to discrimination will always require human intervention and careful judgment.

Another pair of papers explores how challenging it can be to translate normative intuitions about the value of non-discrimination and diversity into formal technical definitions—and how these difficulties reveal important insights into tensions in common thinking about fairness. In evaluating recidivism-prediction instruments—tools designed to predict the likelihood that defendants will engage in criminal activity if released from prison—Chouldechova shows that disparate impact might be unavoidable, even when a model seems to perform equally well for members of different racial groups, if the groups do not recidivate at the same underlying rate. Drosou et al. delve into the meaning and value of diversity, which they distinguish from and relate to non-discrimination. In formalizing these notions and trade-offs technically, Chouldechova and Drosou et al. make the underlying social norms clearer and the difficult normative tensions far less easy to avoid.

Finally, Zeide offers a case study of the way that data can unsettle crucial social and political dynamics, even when the goal is simply to streamline decision making. She looks at the role that data have begun to play in pedagogy, assessment, and educational decision making, demonstrating how technology designed to facilitate or improve education has begun to change its meaning, its democratic governance, and the parties deemed appropriate for its provision.

Understanding the ethical implications of data science in terms of trade-offs draws attention to the fact that researchers and practitioners already routinely encounter and work through these dilemmas, though the normative values at stake might not always be obvious to data scientists. Taken together, the articles in this special issue push normative and policy debates about data onto firmer technical footing, with benefits for both the quality of critiques of data science and for data science itself.

Footnotes

Cite this article as: Barocas S, boyd d, Friedler S, Wallach H (2017) Social and technical trade-offs in data science. Big Data 5:2, 71-72, DOI: 10.1089/big.2017.29020.stt.