Abstract
In this series of articles for Business Information Review, I have discussed the dramatic impact of technology and digitization on the world of information. At times, these articles have addressed the need for new skills and competencies from the Information Scientist; they have also introduced the impact of big data in science and society.
In this article, I use my recent experience in academic research to explain how the essential disciplines of good records management (RM) are a necessary foundation for big data research. Like the quality movement of the 1990s, embedding good RM in big data research will avoid costly rework, illuminating erroneous conclusions and advancing research. There is more that the Information Scientist can bring to big data – managing the unpublishing process and guiding the integration of documents via structured data. Areas in which Information Scientists have the necessary background, discipline and experience.
Keywords
Old dog, new tricks?
Early in 2014, I started a new job moving back to a primarily research environment and much closer to the coalface – more doing, less managing. As might be imagined the change was profound.
Before starting work, I whimsically compared my experience to a 1960s film noir, The Swimmer. In the film, the star at a transition point in his life decides to swim across all the swimming pools in a valley in California; the plot is the backdrop to a series of vignettes of his life so far and the people he meets; the noir elements are that as the story unfolds so does the depth and depravity of the stories. My whimsy uses the framework of the plot, namely a journey of my scientific experiences but without any of the darker elements. As we are all formed by and captured by our experience, it is an experiment around if, or how, the experience, methods and ideas accumulated over 30 years are still applicable to cutting-edge contemporary research particularly in the field of big data.
My own journey
My own entry into research, 30+ years ago, ‘started’ with very little formal training in Research Methods or Research Practice. As an experimentalist, I was drilled in the importance of my lab notebook in which the librarians and archivists showed a keen interest and laid down very clear rules in the accurate recording, and deletion, of my reported results. Both are good disciplines that have lived with me to today and have proved of great value in commercial situations as well as being the bedrock of patent challenge.
Later on, I worked on high reliability systems and on clinical submissions in the pharmaceutical sector. I explored the nascent technology to deliver solutions for automated systems building, document version management and automated document compilation and submission. Like the research and development activities in the pharmaceutical industry itself this was an exposure to two worlds – research and development. The two worlds are quite different: research is hunting for novelty which can later be confirmed whilst development is disciplined to produce consistency. Research has the second chance; data, often considered in development, has no room for error.
The missing world in research
Some 30 years ago, the dependencies of data, and its interpretation, were much less and independent verification ‘of the facts’ very much easier. Then, if I were to share data, it would be the raw data from experimentation; the analysis codes, and their assumptions, were simple. The ‘distance’ in data and code from measurement to conclusion was much shorter. Today our scientific research is built on a much more complicated and interdependent superstructure of different data sources and chains of analysis software each containing a number of hidden assumptions.
There are many concerns in delivering substantive findings on these massive disparate data sets summarized by Gina Kolata writing in the New York Times
1
: the fear is that this avalanche of genetic and clinical data about people and how they respond to treatments will be hopelessly fragmented and impede the advance of medical science.
Whilst this comment is directed specifically at Translational Biomedical Research, it has relevance to all areas employing big data. To quote Trevor Garrett, a lead researcher on a Dutch national project
2
: Unfortunately, a lot of data is filed away without much consideration given to format or future use. This data could be reused. One of the greatest weaknesses to any translational research study is lack of enough data to make definitive conclusions. Old data that cannot be compared to new data is opportunity lost. Scientists now have too much choice when it comes to data formats. In fact, it’s quite common for researchers to invent formats for each new technique and sometimes each experiment. This makes the work of integrating large data sets significantly more difficult. Lots of bioinformaticians toiling away in dark cubicles. This is at the very least inefficient, if not an outright insurmountable barrier. It could be better.
He then goes on to describe approaches to the standardization and mapping of medical concepts across different classifications integral to the Scientific Data initiative from the publishers of Nature. In their recent appeal contained in Our roadmap to engagement; your call, 3 they introduced the software tools and data conventions which improve definition and metadata for each data set.
However, here I discuss a much more prosaic concern: how we can ensure the core processes have the quality and reliability on which we can have confidence in our results? In the experimental world, this requires the specification and control of the environment and its active elements. In the world of data, the corresponding controls come from the quality and reliability of the record handling. It is bringing the disciplines learned in pharmaceutical development into the conduct of academic research. It is a problem that we can address with existing practices.
Verifying data
Data are often considered as absolute, factual quantity, evidence that is considered unambiguous and unarguable. The scientific method permits new theories but every theory is or should be always underpinned by good data. One of Einstein’s dictums was I am utterly convinced that God does not play dice with the universe which is commonly interpreted as meaning that the natural world is a reliable and consistent foil for experimentation. Despite the difficulties of experimentation, the same experiment can, and should, produce the same result. Equally a traditional experiment provides simple and uncontroversial evidence confirming the steps which describes its process.
Data, and data experiments, are more fickle. Hidden within each data set are a multiplicity of conditions, assumptions, transformation and edits leading up to the final data edition, and in big data the complexity of this working process could undermine the validity of any conclusion. It is a journey that requires the good disciplines of records management.
I think we are all familiar with Newton’s famous quotation (If I have seen further it is by standing on the shoulders of Giants) which recognizes the social structure in science and the contribution of the researchers whose work, and publications, have built the base for our current research.
The intensity and velocity of the pyramid of shoulders [the publications of giants] is not without its difficulties. Already this year there have been a number of significant retractions and challenges to recent seminal papers. Some have dark explanations of fraud but most represent the honest mistake or error inevitable within any large assembly of data.
As we enter the era of ‘big data’, validation or confirmation is not as simple as rerunning an experiment. Whilst data analyses are themselves experiments, they are often built upon a battery of assumptions embedded in code, settings and approximations. A data experiment is liable both to genuine errors or slips but also systemic errors caused by the embedded errors and approximations inherited in the pyramid of the work of others.
Backward to the excel generation
The latest spat in popular literature comes from the recent 696-page bestseller by Thomas Piketty Capital in the Twenty-First Century. 4 In trying to understand its success, commentators explain that there is increasing interest in the subject of economic inequality in the United States. The book delivers a clear critique of recent governments, worldwide, in allowing the income gap to widen. Piketty’s assertions and his advocacy of the book are based on the data used in its development and the ‘collection of big data’. In the new paradigm of ‘big data’, he provided all his source data on his website.
Of course the thesis behind his work is controversial and crosses into politics. A trenchant and detailed critique has been published in the Financial Times. Here the critic, Chris Giles,
5
has interrogated the data used by Piketty and uncovered a number of data errors which he published in an academic attempt to undermine the (political) conclusions of the work. Giles’ work was itself then criticized by supporters of Piketty’s theories including errors in his data. Unlike a true political debate, the criticisms were embedded in the data sets used and the choice, change and omission of pertinent data elements. This represents an argument on data which masks an argument on ideas.
What do we need to do?
In earlier articles for Business Information Review (on ‘Information Morality’), I have referenced the difficulty of rerunning the experiments published by academic authors (James, 2013). This represented a critique of the scientific method behind ‘big data’ research. Here my thinking, and observations, in this paper are not aimed at the methods of big data but at the processes surrounding big data. They are concerns over the provenance and reproducibility of experiments with data, concerns over the data itself.
Of course my career in the pharmaceutical industry has developed within me a healthy respect for data quality. Managing the data lifecycle paying attention to governance, husbandry and provenance is for the industry the bedrock of useful information. The debate over Piketty’s ‘spreadsheet’ slips and errors is also reminiscent of the concerns over spreadsheets in every pharmaceutical development laboratory in the 1990s, a problem successfully met and solved routinely.
Training for big data
As with many areas of science we have seen, in the past, approaches which address some of these issues but in the face of the rapid advance of digitization, the development has been unbalanced. We have been working hard at addressing the issues under conditions of scarcity before finding ourselves in an era of overabundance. Digital progress is rapid – both as the problem and the solution – it is simply our attitudes and experience that need to be accelerated.
Earlier ‘out of the box’ articles have discussed the unbalancing of our approach caused by the rapid development of digital capabilities; my summary presented in Table 1 suggests the differences in approach from an era of scarcity to an era of abundance.
From scarcity to abundance.
Three of these aspects of big data are the topic of this article: big data, authoring and understanding. Each of these has traditionally been the domain of the information scientist and there are existing techniques, familiar to the information scientist, which should be applied to the emergent discipline of big data.
Provenance and records management
There are very many digital tools that address the challenges of Records Management so, properly instructed in the principles of the discipline, it is relatively easy to introduce them into everyday practice in a way reminiscent of the quality movement. Here the slogan Quality is Free emphasized the difference in attitude, and outcome, as quality was assured when the improvement was built into the (manufacturing) processes rather than added as an afterthought through external inspection.
Today systems for version management are available as ubiquitous pervasive services in code development, Git, for example, delivers distributed version control and edition management; it is widely used in software development but has not yet become the environment of choice for data, our big data.
Unpublishing and literature reviews
The ‘reversal’ of the support for authoring – unpublishing – is an interesting new development. For the typical academic, publication, particularly in high-impact publications, is the springboard to their career. It is a benign conspiracy with the publishers of those same journals. For investigative science, particularly in the data intensive disciplines, it is the first to publish that counts. It generates an arms race to be first, but with this speed comes danger.
Information scientists have been at the heart of literature review and the practices of the Cochrane review, the social process where a comparative and integrated review of the literature surround a (medical) intervention is assessed. The Cochrane review promotes or relates scientific finding in the light of wider evidence and reproducibility. It is a collective framework to handle the pace and fragility of the findings from unverifiable big data. Indeed to add to the pantheon of odd journal titles, there is a growing market for the journal of unpublishing to join the Ig Nobel Prizes.
From the essay to the database
As the volume of our literature expands, there is a corresponding decline in the use of the material. In my new area of research, there is a continual flood of policy documents, learned papers, strategies and white papers – all very interesting but collectively overwhelming. That they are delivered in the form of our Victorian forebears – as an essay – makes any abbreviated sense making or targeted interpretation difficult. The pdf format, format oriented and content naive, inhibits progress.
From joint developments in Information and Computing Science, we have the ability for structured documents using any of a number of specialized markup languages and yet we make no progress. Our researchers remain condemned to the electronic equivalent of a table of open journals with only a hypertext link to switch between each journal but nothing such as pdfx 6 which assists in the structuring of our electronic forest. Twenty years ago, I worked with our Information Scientists on standard generalized markup language and converting database output into documents, there has been little progress in the reverse direction.
The wheel of progress?
One of the disappointing aspects of working in information technology is the continual reinvention of the wheel, where every new facet of technology evolves from the ‘wild west’ pioneering phases and only after the same journey of experience settles into a maturity of operation. The same patterns underpin this journey with the slow adoption of the prosaic and pragmatic disciplines of standards, process and record/version management.
Summary
Today’s research community have access to an unimaginable wealth of information and data, the facilities to do things faster have improved but our ability to do things better is much slower to appear in the mainstream.
Today’s big data science is expanding rapidly, where the big data lead to evident and testable conclusions. This is not a problem, but where the findings are difficult to verify we are in a dangerous era: any real findings are swamped by false findings. Whilst technology has advanced at a pace, our practices have been tardy especially in recalling well-established good practices to this new epoch of innovation. Good practice from Information Science relevant to big data includes record management, cataloguing and data (literature) review. Tools exist but the gap lies in experience and education.
Perhaps it is time for the information scientist to intervene in the dangers of poorly managed data, time to rework Newton’s famous quotation: If I have seen further it is by building on the records management of giants.
