Social Science Data as a Challenge for Contemporary History

Abstract

Christina von Hodenberg

For historians of the twentieth and twenty-first centuries, a relatively new type of source – social science data – opens not only new possibilities but also presents specific methodological challenges. Social science data are the raw materials collected since the end of the nineteenth century by empirical researchers and bureaucrats who sought to describe and understand their societies with scientific methods, and to advise policy makers and practitioners about ongoing changes. We define the social sciences here in the widest sense and include data collection by experts in the fields of, for example, psychology, ethnography, demography, oral history and social geography. These data are both quantitative and qualitative, from sources such as databases, questionnaires, censuses, photos, geodata and of course interviews, recorded in the form of transcripts, audio tapes or videos.¹

It is often only these sources which enable us to test the validity of historical master narratives on a macro level and get beyond the micro- and meso-level studies so predominant in much of historical research. Science-generated, mass-produced data can also present rare opportunities to investigate the history of under-researched and marginalized social groups who have left few traces – beyond basic population data – in state archives, such as the poor, women or the elderly. We have gathered an international group of historians and social scientists to discuss the advantages and problems in this emerging field. How widespread is the use of social science data for historical purposes in different settings worldwide, and how are access and ethical issues currently regulated? To what extent do historians link qualitative and quantitative data, and if they do not, what is holding them back? Are there any widely recognized standards for the re-analysis of social science data, and to what extent do historians rely on pragmatic choices in the absence of such standards? What is expected of historians by social scientists, and vice versa? Finally, how do historians assess the opportunities and risks posed by distant reading, machine learning and artificial intelligence methods which now enable us to process very large amounts of social science data at once?

The contributors to this roundtable discussion represent different scholarly communities, in both a methodological and geographical sense, to capture the wide spectrum of practices and the current status of the field within and beyond Europe. For the field of British social and cultural history stands Jon Lawrence from Exeter University, United Kingdom. His work centres on the re-analysis of qualitative data – mainly personal testimonies collected through social science interviews and questionnaires – to reconstruct popular attitudes to individualism, community and politics in the past. From Japan joins us the Director-General of the National Institute of Population and Social Security Research (IPSS), Reiko Hayashi. She is an expert on population statistics, in particular the historical development of population policy and demographic data. Historical statistics is also the discipline to which María Francisca Rengifo Streeter (Universidad Adolfo Ibanez, Chile) belongs. She has applied machine learning tools to census data to understand the impact of educational policy on society. From Germany, we have Kerstin Brückweh (Europa-Universität Viadrina, Frankfurt an der Oder and Leibniz Institute for Research on Society and Space, Erkner). Brückweh uses social science data, including geodata, for research into contemporary history. Last but not least are the team of Eva Maria Gajek and Daria Tisch (Max Planck Institute for the Study of Societies, Cologne, Germany) whose fields are the cultural history of the economy and the historical sociology of income inequality.

In what follows, these six scholars debate the advantages and drawbacks of different approaches to social science data for historical research purposes. While a consensus emerges on some aspects – such as the need for improved data literacy skills among historians, and the imperative to reconstruct the context in which the data were originally produced – the responses also illustrate key differences in regard to the types of social science data available and accessed in the different countries and disciplines. The answers vary in relation to the research questions asked, the methods of analysis, the views on mixing quantitative and qualitative data and the promise of artificial intelligence tools to meaningfully dissect and digest ‘big data’ in the service of the historian.

The questions were posed by Christina von Hodenberg, a scholar of European History based at the German Historical Institute London.

1. What can we learn about history, using social science data as sources, which cannot be learned in any other way?

Kerstin Brückweh

Numbers have been used for describing and understanding societies, groups and other parts of the environment for a long time, but it was only in the nineteenth century that they started to become an ever-growing, important base for decision-making and societal self-descriptions. This was pushed even further in the twentieth and twenty-first centuries with the new possibilities of computational data processing and the internet. Despite the huge impact of statistics on planning of all kinds, historians today only rarely use statistical data in an elaborated way. They may refer to statistics and opinion polls, but hardly any history curricula – at least in Germany – offer classes on statistics for historians or teach statistical skills.² Students of history only seldom request such lessons. And, of course, social science data are not to be reduced to numbers. In our research group on the long history of 1989, we have re-used qualitative studies and especially interviews from urban studies, sociology and ethnography.³ These turned out to be very helpful in introducing different chronological layers into the historical analysis, especially the experiences of historical actors who are not to be found elsewhere in the archives, or who can only partly be included via retrospective methods such as oral history. I like to call these quiet voices, in contrast to those who speak out loud and have the social power to be heard. Examples of quiet voices may come from specific age groups (e.g. school pupils, the elderly), marginalized social groups (e.g. migrants, women) or particular spaces (e.g. council houses or large housing projects).⁴ Sometimes it is possible to re-interview the interviewees or to use parts of the original interviews as discussion openers for scholarly communication or for citizen science initiatives.⁵

I think that historians currently omit a large part of the description and analysis of past societies when statistics are neglected. This becomes even more problematic in an age of big data and associated tools such as text mining. We run the danger of funding tools that are not used by historians or used only with a limited understanding of their underlying mechanisms. I am not an expert in statistics, but I have studied the history of the social sciences and undergone a transformation myself: from writing a history of knowledge in censuses and surveys during the long twentieth century in Britain without using the survey data itself, to working in interdisciplinary teams and beginning to use such data. I am not advocating the use of quantitative over qualitative sources; instead, I propose more sophisticated ways of including quantitative sources in the qualitative writing of history. I do so because numbers are an important part of societies in the twentieth and twenty-first centuries.

Reiko Hayashi

I agree. Historical demography, developed in Japan around the middle of the twentieth century, opened a new way of looking at society. It uses historical data on birth, death, marriage and so on to describe societal change. This approach sheds light on reality, which is not affected by the conceptual bias held by later historians. I want to mention in particular the surveys of the poor which started to flourish in Japan during the 1930s, along with the communist ‘international’ movement.

I concur with Kerstin's view that historians who study population data have often been naïve in regards to the systems which created the data.

Jon Lawrence

As a historian of modern Britain, my focus has been on the analysis of qualitative data – i.e. personal testimony – collected and archived by historic British social science projects since the late 1930s. British social science was relatively slow to adopt ethnographic methods, or take an interest in how people understood their own lives, with the result that such material is limited prior to the late 1940s (the Mass Observation Archive is the principal exception, with rich, if often idiosyncratic, material from 1937 including extensive field notes from the project's intensive ‘Worktown’ study of Bolton).⁶ The special attraction of such social science testimony to the social or cultural historian is that it captures the contemporary voices of subjects who may otherwise have left few traces in the historical record, and therefore no account of how they saw the world and their place within it at the time. This is its great strength over retrospective methods such as oral history, which, despite its many virtues, cannot offer a similarly direct window on the contemporary meaning-making practices of otherwise marginalized groups. Of course, not all under-represented groups were of equal interest to social scientists, with the result that it is harder to find archived testimony from some marginalized groups who would now be of great interest to social historians, such as ethnic minorities or the disabled. Sometimes testimony from individuals in these groups is buried in the archive, but without identifying metadata because these identities were not seen as relevant to the original enquiry.

In any case, historians need to proceed with caution when re-analysing social science testimony. Much depends on their ability to contextualize what survives in the archive. This is one reason why I have always chosen to re-analyse testimony from place-based (or ‘community’) studies, where it is possible to situate respondents in time and place with some precision. It is also essential to know a good deal about the aims and methods of the original social science study. Historians cannot hope to read archived testimony ‘against the grain’ without a clear understanding of what that grain was – i.e. what the original researchers were looking for, how they did their research and how both these things may have shaped the testimony they collected. This is especially important for testimony collected before the 1960s, when we rarely have access to full transcriptions, let alone audio recordings, because researchers rarely made audio recordings in the field, relying instead on scribbled field notes written up more formally after the fact. The Philips compact cassette transformed the social science interview from the mid-1960s, making it easy to record interviews in full so long as the interview could be conducted somewhere relatively quiet (not always possible for subdisciplines such industrial sociology or criminology). But archived testimony rarely includes these original recordings, making it harder to assess the status of transcriptions (e.g. whether they are selective or verbatim), or reconstruct the emotional context of the testimony. Testimony from social science interviews needs to be read critically, with close attention paid to the performative aspects of the interview itself, and how this, alongside the broader agenda of the original researchers, may have influenced respondents’ self-presentation.⁷ But read with due care, such testimony can nonetheless shed light on facets of subaltern subjectivities that have left few other traces in the archive.

María Francisca Rengifo Streeter

Social science data are key sources for producing historical evidence. They are indispensable data for measuring past phenomena and addressing a research problem based on data sources that sought, intentionally or unintentionally, to record those phenomena. It allows us to identify variables and to extract knowledge from multi-dimensional data, inquiring into their possible interrelationships.

For example, I worked with my colleague Gonzalo Ruz, a machine learning specialist, to apply machine learning techniques (clustering and decision trees) to Chilean census data for the years 1920 and 1930 and to educational statistics (reports of the Chilean Ministry of Justice, the Ministry of Education and the Statistical Yearbooks from 1895 to 1930) in order to examine the key factors related to the educational crisis of the 1920s. This multi-dimensional analysis of demographic and schooling data broken down by administrative division allowed us to identify relevant changes in the behaviour of a group of variables and their interrelationships throughout the country. We thus demonstrated that latent spatial and demographic factors explain the different rates of school expansion across Chile. The results show that school provision was different in rural and urban areas and that adult literacy was a predominant factor.⁸ This study considered a multi-dimensional dataset, allowing us to discover the temporal and spatial dynamics of a historical process, and ultimately to interrelate the global, regional and local scales of the process.

The potential of cross-sectional data is enormous. Working with social science data enables multivariate analysis, which has great explanatory potential, but involves significant challenges. For example, it requires us to elaborate long-term chronological series of data, to make decisions about the representativeness of the data, and to make assumptions that may increase the estimation error of a variable backwards through time. In turn, fascinatingly, a longitudinal dataset is a key analytical tool for establishing the temporality of a historical problem.

Daria Tisch and Eva Maria Gajek

Social science data can serve a dual purpose in historical analysis: both as a source for scientific analysis and as the object of study themselves. Rudolf Martin's Yearbooks are an interesting example. The publisher of the Jahrbücher des Vermögens und Einkommens der Millionäre (Yearbooks of the Wealth and Income of Millionaires) was a former Prussian civil servant at the Home Office (Reichsamt des Innern), from 1912 to 1914. Martin's books provide detailed insights into the distribution of wealth among the elite in the German Reich of the early twentieth century. They consist of two parts: a list of the names, addresses, income and assets of the millionaires, followed by the subjects’ biographies in narrative form. Prior studies have already spot-checked Martin's data against tax records and have verified their accuracy.⁹

Martin's Yearbooks are a popular source not only in economic and social history but also in economics and the social sciences, for investigating the distribution of wealth and the social structure of millionaires as a group. The Jahrbücher have been used, for example, to study the recruitment and reproduction of economic elites.¹⁰ In recent years, teams of researchers have begun digitizing and transcribing the data to pursue different research questions. It is precisely the combination of the different types of social data published by Martin that opens up a plethora of questions relating to the regional and urban distribution of wealth, the relationship between income and wealth and gender relations, as well as social stratification and occupational affiliations. Furthermore, comparing these historical data with contemporary rich lists enables us to track changes in the composition of the wealth elite and the perpetuation of fortunes within family lines.¹¹

Martin's Yearbooks can also serve as an example for focusing on the generation of social data itself. Such an analysis can reveal a great deal about the techniques used to measure and explain social structures. This applies both to the organizing instruments of the list – its categories, arrangements and listing methods – and to the narrative description, in which aspects are emphasized, omitted, distorted and thus brought into a meaningful form. We also gain valuable insights into the socio-cultural and economic dynamics of the time.¹² When Martin published his data, he met with not only encouragement but also opposition from different actors. His data served as an argument in the contemporary tax debate and informed a political discussion about the protection and privacy of social data and the public's right to know about wealth distribution. Above all, it served to demonstrate to a wider public, through facts and figures, the emergence of great wealth since the era of industrialization. In this sense, the generation of social data also says something about how society uses such data to communicate about itself.¹³ Social data and their generation therefore offer worthwhile approaches to different disciplines of historical science, from the history of knowledge, cultural history and socio-economic history to political history.

2. In your field and country, how widespread is the use of social science data for historical purposes? To what extent are contemporary historians currently using (and possibly linking) qualitative and quantitative data produced by the social sciences? Are there obstacles to this practice, and what are they?

Reiko Hayashi

In Japan, historical demography expanded in the 1990s at the initiative of the late Akira Hayami. Scholars dealt with data from the Edo period (1600–1867) or before, using population registration data conserved at the municipal level or in temples.¹⁴ As historical studies moved on to more recent times, research now focuses on the Meiji era (1868–1911), the Taisho era (1912–1925) and the Showa era (1926–1988). These were followed by the Heisei era (1989–2018) and the present Reiwa era, which began in 2019. For these ‘modern’ periods, there are more data, as a modern statistics system was established during the Meiji period. There has already been a good amount of research to compile statistics published by the authorities and digitize them in Excel or other database formats. However, these compilations do not cover all the statistics published, so there is a need to go back to the printed version to get additional data. More original documents are publicly available in PDF format from the National Diet Digital Library or other data archives, but there are some copyright issues with reprinted material, and some historical materials have not yet been digitized.¹⁵

In order to sort and interpret data, scholars need knowledge accumulated by researchers. This gives rise to a new set of disciplines that are not necessarily nurtured by conventional humanities faculties, which are now experiencing budget cuts. I have noticed that new faculties, such as interdisciplinary sciences or information sciences, are more active in exploring historical data than the humanities, which tend to stick to the conventional qualitative approach.

To take population registration in the Edo period as an example, handwritten information needs to be deciphered into texts and then entered in a formatted data sheet. The writing used in the original documents is difficult to read, so some researchers are now using AI to recognize characters automatically, but the technology is still in its infancy.¹⁶ The Edo period population registration record also includes ‘failed to register’ data, indicating for example that people migrated out of a village. So, for any historical analysis, it is necessary to know specific information regarding registration practice.

In modern population data, there is a problem with under-registration, especially in the nineteenth century. Many historians have tried to estimate the extent of this issue. Nonetheless, demographers of later eras tend to have a prejudice that older periods had many statistical defects that could distort interpretation and the results of estimation and hence do not hold historical re-analysis in high esteem.

As for the cause of death data on which I work, the definition of ‘cause of death’ is in line with the international classification from 1899, but the interpretation of certifying doctors differs from international practice at the time and from what modern-day doctors diagnose. Thus, the format of the data makes it difficult to discern coherent, structural long-term trends in causes of death.¹⁷

Daria Tisch and Eva Maria Gajek

In historical research in Germany, quantitative analysis of social data was for a long time primarily an instrument of economic history, and occasionally also of social history. Most historians have tended to concentrate on qualitative research methods not only because the cultural turn has ignored the economic and material foundations of societies but also because there is a lack of training in quantitative skills at many universities. The use of quantitative data requires not only the strengthening of such skills but also the ability to place the data in long-term historical contexts and narratives.

In recent years, historians in Germany have increasingly begun to reflect on the use of social science data in a broader context.¹⁸ Many studies no longer use numerical data uncritically, as was the case in the 1990s, but reflect on and historicize them. For example, they ask about the knowledge regimes behind statistical data on wealth distribution¹⁹ or show the effects of tax data on political action.²⁰ The combination of quantitative and qualitative approaches is thus already being taken up, albeit sporadically, and not only the informative value of the data themselves is being used for the historical narrative but also the conditions of origin, the organizational structures and the historical contexts they emerged from.

A mixed methods approach is still extremely rare in the field of history.²¹ However, there are a growing number of collaborative projects involving both historians and social scientists. This exchange helps researchers in both disciplines to reflect on their approaches, pose new research questions and conduct innovative research. The Martin Yearbooks are a good example as they have brought historians, economists and social scientists into dialogue. The same applies to the authors of these contributions, who started talking about the data and their creation and have since been collaborating to harness their full potential. Beyond the information on income and assets, the address details and Martin's biographical notes also open up new research opportunities. The latter in particular have so far only been used as a kind of lexical source, without scrutinizing the narrative. We therefore believe that it is crucial to bring together competences and skills in interdisciplinary projects. Such collaborations offer great opportunities for a comprehensive historical understanding of social phenomena based on social science data.

María Francisca Rengifo Streeter

In the Chilean context, the use of social science data is a fairly recent development and more common in the field of economic history; it is regrettably absent from several fields of study, including public policy.

With respect to statistical data – censuses and statistical yearbooks – the biggest obstacle lies in the difficulty of accessing these sources and understanding the data. First, in Chile, the census records for individuals have not been preserved, so we only have the information aggregated at the district level, the smallest territorial unit, for the nineteenth century and most of the twentieth century. Socio-economic surveys at household level began to be conducted in the mid-twentieth century, but only the results of the most recent ones are accessible to researchers. Moreover, although historical censuses have been digitized, there is no central repository for their data. In order to use them, each researcher has to build his or her own database, investing a great deal of work and time.

Second, though not in importance, it is essential to understand what was measured and how: in what specific contexts such data were collected and for what purpose.

To really understand the data, methodologically speaking, one needs to describe their emergence in order to identify their nature. This requires interdisciplinary sensitivity because it is necessary to provide the metadata and also be familiar with the methodological tools for the analysis.

A very important question is: Which particular techniques are appropriate for a given research question, considering the specificities of the data at hand?

Jon Lawrence

Historians have long made use of quantitative social science data in Britain, most obviously in historical demography, where historians such as E.A. Wrigley and Roger Schofield championed new methods, first developed in France, for producing robust accounts of population change from partial, pre-modern data.²² More recently, their successors have structured digitized household-level census returns to facilitate the historical analysis of social and economic change in Victorian and Edwardian Britain (household-level census data is closed for 100 years; we currently await the public release of the 1921 returns as a structured dataset).²³ In Britain, the reuse of qualitative social science data was pioneered by social scientists themselves, most notably by Mike Savage in his study Identities and Social Change.²⁴ Perhaps understandably, social scientists’ reuse of historical qualitative data has been principally directed towards two goals: first, the disciplinary history of the social sciences, and especially of social science methods; and second, to facilitate restudies of particular places or problems (where re-analysis of historic testimony allows researchers to establish a base line against which to measure the findings of their present-day restudy).²⁵ Equally understandably, historians such as Selina Todd, David Cowan and I have been principally interested in understanding the past for its own sake and have focused primarily on re-analyzing qualitative social science testimony as a uniquely rich source of contemporary personal testimony that can shed light on how people understood their own lives.²⁶ Historians usually seek to contextualize surviving testimony by relating it to socio-economic data about demography, living standards, housing etc., alongside any quantitative data created at case level within the original project. This tends to be richest where testimony was collected as part of a structured, questionnaire-based survey.

In recent years, sociologists have explored the potential of collecting qualitative data from members of longitudinal studies such as Britain's 1946, 1958 and 1970 birth cohort studies, where personal testimony can be related to rich quantitative data collected intermittently across an individual's life course.²⁷ This is something that historians are also now championing as a potential way to develop new, richer social histories of everyday life.²⁸ Nor is this the end of the story. Ambitious initiatives such as the American Voices Project, which has collected qualitative data on a mass scale from across the United States since the late 2010s, and developments in the analytical power of machine learning and Natural Language Processing, have encouraged social scientists to argue for the systematic collection of representative qualitative data.²⁹ If these ambitions come to fruition, future historians can hope to have access to vast collections of personal testimony collected from the general population rather than as a by-product of specific, often tightly focused, social science projects. Given how much has already been achieved by historians working with existing archived qualitative data, this is an exciting prospect.

Kerstin Brückweh

When I started my book on censuses, surveys and opinion polls in the United Kingdom, I had thought that I would write one chapter on source criticism and then move to analyzing the data in the following chapters of the book. It turned out differently because I became fascinated by the images and interpretations of the world inherent in the knowledge produced by social scientists, including the tools and methods they used. The source criticism turned into a history of knowledge and a book of its own.³⁰ Since then, whenever I hear someone using words like ‘raw data’, ‘objective’ or ‘representative’ data, I feel compelled to ask what they mean. I have learned that social scientists are cautious with these terms – at least among themselves.³¹ My relationship with the world of social scientists became even more complex when I entered a new field of research: the history of the transformation of (post-)socialist European states before, during and after 1989. During the 1990s, many social scientists from the so-called Western countries with only limited experience of research on socialist states began studying post-socialist countries. Their enthusiasm might have been driven by genuine curiosity for the changing world of the 1990s, or by research funding opportunities (both of which subsequently declined from the 2000s onwards).³² Although the history of social scientists and the social sciences in the 1990s needs further research, it can already be seen that these scientists followed their own agendas as they took decisions about methods, theories, staff and cooperation partners. All this influenced the knowledge they produced and thus may impact historians’ interpretations and narratives when they access this material for secondary analysis.

The history of knowledge is now an established approach,³³ and it shows that scientists themselves are powerful historical actors. In the case of Germany, it is important to remember the dual role of West German social scientists. They came to study the revolution of 1989/90 and its consequences, and at the same time, they were involved in transforming their academic disciplines in the former German Democratic Republic (GDR). Take the example of the Commission for Research on Social and Political Change in the New Federal States (KSPW). It was constituted to study social and political change in the former GDR, to provide an empirical and theoretical foundation for policy recommendations, to cooperate with social scientists in the so-called new federal states and to support young scholars there.³⁴ In 1996, when the KPSW's government funding ended, its members and other scientists had amassed a huge corpus of material, produced a wide-ranging body of knowledge, and contributed to the transformation of the social sciences in the former GDR. This raises questions of unequal power relations.³⁵

For me, the history of knowledge, as a form of extended source criticism, is a necessary step and precondition to the use of social science data. Of course, this takes up valuable research time and thus poses one of the challenges that we are discussing here. In the research group on the long history of 1989 that I led at the Leibniz Centre for Contemporary History in Potsdam, we decided to only select data for our analysis if they were properly historicized and contextualized. The process of secondary analysis included locating sources (often in the attics or basements of the primary researchers), convincing the primary researchers that their material was of historical interest, and sorting the material and making it usable for historians. A question that has still not been satisfactorily solved, in my opinion, is where such materials from the German social sciences are to be archived for future use.

3. Are there any widely recognized standards for the historical re-analysis of social science data at this point? To what extent do historians rely on pragmatic choices in the absence of such standards? Social scientists and historians follow different methodological protocols – does this result in specific challenges or chances?

María Francisca Rengifo Streeter

Standards should emerge from a necessary critique of the data source. How reliable is it? How representative are the data? How fragmentary are they, and what is their scope? In other words, we have to carry out a critical assessment of the information source. Data collected by a particular instrument for a specific study may subsequently be analyzed in an entirely different way and for a different purpose. It is therefore important not only to preserve the data, but also to document the manner in which they have been labelled and archived. In other words, a database always requires specialized knowledge of the specific social science field it emerged from, and its curation practices.

Daria Tisch and Eva Maria Gajek

Currently, there are no universally accepted standards specifically for the historical re-analysis of social science data. Historians often rely on pragmatic choices and interdisciplinary approaches due to the absence of strict guidelines.

The lack of such standards presents both challenges and opportunities. One challenge is data interpretation. Social science data are often collected with contemporary goals in mind, which might not align with those of modern-day historical inquiries. This discrepancy requires historians to critically assess the purpose and context of data collection in order to avoid anachronistic conclusions.

However, the lack of standardized protocols offers historians the flexibility to make pioneering methodological choices, encouraging innovation and interdisciplinary research and thus enhancing the depth and breadth of historical inquiry. Open science practices, such as publishing social science data, further enrich this process by promoting transparency and collaboration. To return to Martin's Jahrbücher, researchers from different disciplines (economists, historians and sociologists) were brought together by their interest in these sources, and collaborated in digitizing the books and refining their research on the data. Their different perspectives on the same source paved the way for novel, transformative research questions.

Another challenge lies in the practices of data anonymization. While historians are heavily dependent on identifying individuals and naming names for their narratives, this is less relevant for the questions posed by social scientists, especially in quantitative research. Data on wealth and income require a high degree of sensitivity from the scholar. Even though our research focuses on a group that has by no means been ignored by historiography and has left behind numerous sources and testimonies, working with data on income and wealth and on the topic of wealth in general poses specific challenges. The saying ‘We don't talk about money’ is widespread in Germany and has long impacted on research practices. To what extent does the public and academic interest in the very rich justify the publication of private wealth data, as well as the names and addresses of well-off individuals? This is an ethical question that not only influences societal discourse throughout history (i.e. the question of how much a society must, may and should know about the distribution of wealth) but also significantly impacts the research process itself (i.e. the question of whether researchers should publish the personal data of the super-rich).

Jon Lawrence

These disciplinary differences over the ethics of archiving testimony, and indeed secondary analysis as a whole, are important. Some social scientists would argue that personal testimony should be destroyed or embargoed after it has been used by its original collectors, since ‘informed consent’ cannot be extended to subsequent, ‘secondary’ analyses.³⁶ Clearly, if this interpretation of social science ethics becomes general, whole swathes of secondary analysis will become unavailable to future historians.

In my experience, historians bring the same professional standards to the analysis of qualitative social science data as to any other historical source. That is, they ask the key questions about provenance that they would ask of any source: how was it created, by whom, under what circumstances and for what purpose? Why has it survived in this form, what may have been lost and how does the archived material relate to: (a) everything once collected by the original project and (b) the wider social world within which this material was created? In this sense, historians share the social scientist's concern with representativeness, but they are trained to work with a much looser, non-statistical standard more suitable for assessing the typicality or otherwise of archival traces. Unlike the social scientist, they cannot create their own samples (except through oral history, where early practitioners did advocate a statistical model of representativeness).³⁷ When it comes to methods of analysis, historians are trained in the close reading of text. It is their disciplinary bread and butter, but they rarely adopt any one critical method. The hermeneutic tradition runs deep in modern social history, but historians tend to borrow pragmatically from different analytical methods in response to the challenges posed by different source types. There is no dominant standard for qualitative analysis among historians, and I am not convinced that it would be desirable to seek one.

Reiko Hayashi

For the population sciences, both historical and contemporary demographers use the same analytical framework to analyze population, birth, death and migration. Indicators, such as crude birth/death rate, life expectancy or total fertility rate, are defined in more or less similar ways. Just as migration indicators vary even in contemporary analysis, they also vary in historical studies.

Kerstin Brückweh

For the mixed methods secondary analysis of social science data, it seems that for a very long time, each scholar has had to start from scratch. However, right now, I see changes and new opportunities. Smaller bottom-up initiatives have begun to shape the field, pose questions and demand standards for their research. Our German working group on ‘Social Data and Contemporary History’ is one such initiative. It aims to bring together historians who face similar challenges, as well as colleagues from archival infrastructures and the social sciences. We debate specific topics, such as social inequality in history or longitudinal dataset analysis, and aim to coordinate our activities to influence ongoing debates on ethical and legal questions of access, archiving and anonymization. I also see recent changes from the top. Large funding programmes like the German National Research Data Infrastructures (NFDI) have been set up, and while the social sciences were quicker to organize (with RatSWD and KonsortSWD), historians have now followed suit. We have established the NFDI4Memory infrastructure to tackle questions of data management and serve the specific demands of researchers who ‘use historical methods or […] rely on data requiring historical contextualization.’³⁸ I see this as a chance to establish long-term structures for the field of history in Germany. We can finally embrace the digital age – in terms of both methods and digitized or digital-born sources.

4. What is your view on the skills and training that historians should acquire to put social science data to good use? How do you assess the chances and risks posed by historians’ use of statistical analysis, distant reading, machine learning and artificial intelligence methods in this respect?

Reiko Hayashi

Data literacy skills are needed. For example, researchers have to know how data were collected and what is covered. Then they have to calculate widely used indicators correctly. As I mentioned earlier, there are great opportunities for deciphering handwritten records into digitized texts or carrying out OCR on printed material (in PDF format). Visualization, including GIS (geographical information systems), is useful for better understanding the data. The risks of artificial intelligence are the same as in any other field. AI-based interpretations are produced in a black box. However, AI tools can offer useful or interesting additional information on how to look at data.

Daria Tisch and Eva Maria Gajek

To effectively utilize social science data, historians should develop a range of skills that blend traditional historical methods with statistical data analysis techniques. Training in statistical analysis, ‘big data’, and digital humanities tools would enable historians to navigate and analyze large datasets. In particular, this training would allow historians to handle vast quantities of text and data that were previously unmanageable, providing new insights into historical processes and social phenomena. For example, Martin's Jahrbücher could be linked to other well-established sources, such as the Deutsche Biographie, to create extensive datasets that can then be used to answer a plethora of research questions. Through this dual approach, social science data, such as Martin's Yearbooks, provide a robust framework for historical analysis by offering both a direct window into the past and a meta-level for examining the role and reception of historical data in society.

María Francisca Rengifo Streeter

Historians will need interdisciplinary sensitivity and knowledge of methodological tools for data analysis. To avoid repeating myself, I would like to add a key point: for history, the main tension is temporality. By this, I mean that data are collected at one point in time, processed at another point in time and perhaps used at another point in time and for specific goals.

Kerstin Brückweh

I agree about temporality. However, I also think that spatiality is a key point, especially when it comes to longitudinal sources. While comparable quantitative data may be available for longitudinal studies on an international, national or regional level, this may not be the case for qualitative data. If we want to combine both in order to get to a nuanced version of the past, we cannot favour one kind of data over the other. As an example: We have just started a small project, funded by NFDI4Memory, called ‘Geodata as Social Data for Historical Longitudinal Analyses?’ In it, we plan to explore the use of drones and GNSS (global navigation satellite system) rovers for deep mapping in contemporary historical research. Historical maps and plans from the scientific collections of the Leibniz Institute for Research on Society and Space, where I am based, and from other regional archives provide points of reference by showing the state of the landscape in the past.³⁹ The project intends to use drone overflights to record the current state of these spaces, focusing on issues of land use, property rights and access restrictions. In turn, the project investigates social relations and their changes as they manifest themselves spatially. Its aim is to explore to what extent historical social changes can be understood via spatial data. This approach is experimental and carries the risk of failure. Put differently: in our research group on ‘The long history of 1989’, we began with lots of ideas of how to combine qualitative and quantitative data, and we also tried some out. But frankly, many were impossible to realize within the output-oriented context of modern-day academia. It takes time to rethink approaches and to learn the skills for understanding statistics and methods in the digital age, and even more time to combine them for longitudinal studies. We need more time for creativity, and more room for failure, in today's scientific world.

Jon Lawrence

I am most intrigued by the potential of machine learning for transforming how historians work with qualitative social science data at scale. The key word here is scale. When we are working with collections of a few hundred interviews, it remains possible to use standard historical methods of exhaustive close reading. But when there are thousands, or tens of thousands of cases, this becomes practically impossible and also unnecessary. Machine learning tools can be trained (through annotation) to enhance the existing metadata of large qualitative data collections, so that individuals with characteristics or experiences of interest to the researcher can be quickly and accurately identified from the mass. This would be a classic example of distant and close reading working in tandem, but there may also be considerable scope to use distant reading techniques to better understand the distinctive patterns of different large-scale qualitative collections.⁴⁰ Certainly, historians will be ill-advised to ignore the powerful potential of these tools.

Footnotes

Funding

The authors received no financial support for the research, authorship and/or publication of this article.

Social Science Data as a Challenge for Contemporary History

Abstract

Christina von Hodenberg

Kerstin Brückweh

Reiko Hayashi

Jon Lawrence

María Francisca Rengifo Streeter

Daria Tisch and Eva Maria Gajek

Reiko Hayashi

Daria Tisch and Eva Maria Gajek

María Francisca Rengifo Streeter

Jon Lawrence

Kerstin Brückweh

María Francisca Rengifo Streeter

Daria Tisch and Eva Maria Gajek

Jon Lawrence

Reiko Hayashi

Kerstin Brückweh

Reiko Hayashi

Daria Tisch and Eva Maria Gajek

María Francisca Rengifo Streeter

Kerstin Brückweh

Jon Lawrence

Footnotes

Funding

Notes