Abstract
Data governance and data literacy are two important building blocks in the knowledge base of information professionals involved in supporting data-intensive research, and both address data quality and research data management. Applying data governance to research data management processes and data literacy education helps in delineating decision domains and defining accountability for decision making. Adopting data governance is advantageous, because it is a service based on standardised, repeatable processes and is designed to enable the transparency of data-related processes and cost reduction. It is also useful, because it refers to rules, policies, standards; decision rights; accountabilities and methods of enforcement. Therefore, although it received more attention in corporate settings and some of the skills related to it are already possessed by librarians, knowledge on data governance is foundational for research data services, especially as it appears on all levels of research data services, and is applicable to big data.
Introduction
Data intensive science, coupled with mandates for data management plans and open data from research funders, has led to a growing emphasis on research data management both in academia and in academic libraries. The role of the latter is changing, so academic librarians are often integrated in the research process, first of all in the framework of research data services (RDSs) (Tenopir et al., 2015). Therefore, it comes as no surprise that supporting data-intensive research is a top trend in academic library work (ACRL, 2014; NMC, 2014). It is in focus especially because it gives the chance to change the present situation, where faculty and researchers regard the library not as a place of real-time research support, but only as a dispensary of books and articles (Jahnke et al., 2012).
Against this background, a review of the literature was done in order to identify and examine significant constituents of the knowledge base that is crucial for information professionals involved in supporting data-intensive research. The first constituent is data governance (DG), which is extensively dealt with mainly in the corporate (business) sector, and is explored in this paper with the belief that bringing it into the picture will enable better RDSs. The second one is data literacy, about which there is a massive body of literature, among others in the form of review articles (Koltay, 2015a, 2015b; MacMillan, 2014). Data literacy is closely related to research data services that include research data management (RDM). As the concept of RDSs itself and data literacy education are still evolving, their relationship to data governance requires examination that may lead to some kind of synthesis. The management of data quality is also inspected in order to determine to what extent it plays the role of an interface between these two constituents.
Accordingly, this writing is built on three core terms. Data governance can be defined as the exercise of decision making and authority that comprises a system of decision rights and accountabilities that is based on agreed-upon models, which describe who can take what actions, when and under what circumstances, using what methods (DGI, 2015a). While the various definitions of data literacy will be discussed below, we define it here as the ability to process, sort and filter vast quantities of information, which requires knowing how to search, how to filter and process, to produce and synthesize it (Johnson, 2012). This definition is in accordance with the idea, expressed by Schneider (2013), that the boundaries between information in information literacy and data in data literacy are blurring, because these boundaries never have been rigid.
Research data services consist of a wide spectre of informational and technical services that a library offers to researchers in managing the full data life cycle (Tenopir et al., 2012).
Research data services and the paradigms of academic library management
A better understanding of the academic libraries’ role in the data-intensive environment can be obtained if we place them into the context of academic librarianship’s past and present development paradigms, outlined by Martell (2009). The first paradigm, called the ‘Ownership’ or ‘Collections’ paradigm evolved after World War II and reached its zenith in the 1960s. It was built on the assumption that campus library systems would be able to collect all documents that could adequately satisfy the institutions’ scholarly and teaching needs. Such support allowed for a broad range of interpretations, but it proved to be unsustainable and was supplanted by the ‘Access’ paradigm that directed more attention to and made use of resource sharing from the late 1970s until the end of the 20th century. Widespread access to digital material, in particular the availability of electronic full text of serials made ownership in its traditional sense not practical, so the ‘iAccess’ paradigm came into being. More recently, the emergence and growing prevalence of social media creates an opportunity to add social dimension to iAccess, forging in this way the ‘sAccess’ paradigm.
While social media undoubtedly plays a role in Research 2.0, it is often difficult to disentangle the relationship between features that are induced by its presence from the influence of the growing importance of data. Social media influences academic libraries in many ways. It produces enormous quantities of (big) data that can be analysed, published and reused mainly by researchers in the social sciences (Boyd and Crawford, 2012). It also changes the ways in which research is done, even though the lack of trust in social media channels for scholarly communication lessens its impact (Nicholas et al, 2014). Therefore, it is a demanding task to define to what extent data-intensive research pertains to iAccess and to sAccess. In any case, both paradigms have influence on it to some degree.
Data governance in detail
As stated above, data governance is a subject of interest for the business sector. Therefore, it is rarely addressed by the LIS literature. A notable exception is the work of Krier and Strasser (2014) that focuses on data management in libraries.
A review of definitions of data government by Smith (2007) clearly shows the close ties of DG to the business sector. Besides providing a set of definitions that relate it to companies, enterprises and business, Smith underlines that ‘the process of data governance is to exercise control over the data within a corporate alignment’.
It seems clear that the academic sector, librarianship, as well as library and information science also should pay attention to DG, albeit it attracted attention mainly in the business sector. Even though rather implicitly, this need is asserted by DosSantos (2015), who points out that the role of the data governor must shift to be something more akin to a data librarian in order to make data governance the driving force behind business innovation, instead of being an impediment to data. This goal can be attained by delivering information technology as a service and by enabling the processes of locating and organizing the best available data.
The expression data governance could refer to organizational bodies; rules, policies, standards; decision rights; accountabilities and methods of enforcement. DG enables better decision making and protects the needs of stakeholders. It reduces operational friction and encourages the adoption of common approaches to data issues. Data governance also helps build standard, repeatable processes, reduce costs and increase effectiveness through coordination of efforts and by enabling transparency of processes. It is governed by the principles of integrity, transparency and auditability (DGI, 2015a).
DG also delineates decision domains, i.e. what decisions must be made to ensure effective management and use of the organization’s assets. It also defines the locus of accountability for decision making by defining who is entitled to make decisions in a given organization, and who is held accountable for the decision making related to data assets (Khatri and Brown, 2010; Weill and Ross, 2004). Seiner (2014: 2) adds to this that valid data governance may require identifying ‘people who informally already have a level of accountability for the data they define, produce and use to complete their jobs or functions’. One of the reasons for this is that correct and efficient governance depends as much on technology as on organizational culture, despite the fact that good governance technology makes data transparent, gives it accountability and helps identify areas where performance can be improved (ORACLE, 2015).
Accountabilities, the main components of which are stewardship and standardization, are defined in a manner that introduces checks and balances between different teams, between those who create and collect information, others who manage it, those who use it, and those who introduce standards and compliance requirements (DGI, 2015b).
As stewardship appears in this list and is also present in several resources related to research data management (Bailey, 2015), and because it is sometimes used interchangeably with DG, some clarification is needed. Data stewardship is concerned with taking care of data assets that do not belong to the stewards themselves, thus data stewards represent the concerns of others, and ensure that data-related work is performed according to policies and practices as determined through governance. In contrast, data governance is an overall process that brings together cross-functional teams (including data stewards and/or data governors) to make interdependent rules or to resolve issues and to provide services to data stakeholders (Rosenbaum, 2010).
To be successful, data governance needs to have clear definitions of its objectives, processes and metrics. It has to create its own processes and standards. Besides roles and responsibilities for all data governance roles, communities of practice for governance, stewardship and information management have to be established. Change management processes also have to be instituted, and – last but not least – there have to be rewards for good data governance behaviour.
Data governance should not be optional, because it contributes to organizational success through repeatable and compliant practices. In the sense of managing, monitoring and measuring different aspects of an organization, governance can be related to managing information technology, people and other tangible resources. Data is everywhere, thus DG runs horizontally. Definitions of the data and how to use it are part of the data management process, while integrating data into the organization and establishing individuals to oversee the administration of data processes pertain to data governance. DG also must include metadata, unstructured data, registries, taxonomies and ontologies (Smith, 2007).
The traditional principles of DG also apply to big data. From among big data types, data from the Web and from social media, as well as machine-to-machine data deserve special attention. Big data governance is especially important in regard to the acceptable use of data (Soares, 2012). In environments where big data plays a substantial role, one of the most common data integration mistakes is underestimating data governance (ORACLE, 2015). Although big data integration differs from traditional data integration by many factors (Dong and Srivastava, 2013), it demonstrates the complexity and importance of data governance. Data integration itself can be defined as the combination of technical and business processes used to combine data from disparate sources into meaningful and valuable information. It helps to understand, cleanse, monitor, transform and deliver data, thus it supplies trusted data from a variety of sources (IBM, 2016). Data integration solves the problems related to combining data of varied provenance by presenting a unified view of these data (Lenzerini, 2002).
As Sarsfield (2009) put it, DG is like an elephant in a dark room. It can be perceived depending on where you touch it. If you touch its tail, it feels like a snake. If you touch one of its legs, it feels like a tree. Therefore, cross-functional perspectives on data governance vary, and we will take this variability into consideration to couple it with data quality and data literacy.
In research settings, the stakeholders of DG are researchers, research institutions, funders, publishers and the public at large. A good understanding of data governance also addresses researchers’ fear of lost rights and benefits. Governance structures are needed for managing human subjects-related data as well, because taking care of sensitive information requires not only establishing standards and norms of practice, but fostering culture change towards better data stewardship (Hartter et al., 2013). In addition to these functions, data governance in this environment enables proper access and sharing (Riley, 2015), even if data ownership is often ambiguous, because if someone has a stake in research data, it does not mean that they are owners of that data (Briney, 2015). Many DG skills, such as dealing with licensing terms and agreements, as well as knowledge about copyright are already possessed by librarians (Krier and Strasser, 2014).
Altogether, data governance is the starting point for managing data. A formal data governance program has to provide answers to questions, such as the availability and access possibilities, provenance, meaning and trustworthiness. As a shared responsibility among all constituents of an institution, it is also required to provide coordinated, cross-functional approaches and to facilitate best practices. It both prevents the misuse of institutional data assets and encourages more effective use of these same data assets by the institution itself (ECAR, 2015). Being knowledgeable about data governance’s nature is foundational for RDSs and well-developed data governance is one of the necessary conditions for open data (Weber et al., 2012), even though it is also one of the most challenging issues of data sharing (Krier and Strasser, 2014).
Data governance and managing data quality
Data governance also ‘guarantees that data can be trusted and that people can be made accountable for any adverse event that happens because of poor quality’ (Sarsfield, 2009: 38). In a similar vein, Khatri and Brown (2010) underline that governance includes establishing who in the organization holds decision rights for determining standards for data quality. Data management involves determining the actual criteria employed for data quality, while DG is about designating who should make these decisions. According to Seiner (2014), DG formalizes not only behaviour related to the definition, production and usage of data, but its quality. Similarly, a White Paper by Information Builders (2014) emphasizes that data governance is a critical component of any data quality management strategy. Another White Paper, titled Successful Information Governance through High-Quality Data, underlines that the success of an information governance program depends on robust data quality that can be achieved if we reduce the proliferation of incorrect or inconsistent data by continuous analysis and monitoring (IBM, 2012).
Data quality is one of the cornerstones of the data-intensive paradigm of scientific research. This is true, even if it is difficult to appraise data, because appraisal requires deep disciplinary knowledge and manually appraising data sets is very time consuming and expensive, while automated approaches are in their infancy (Ramírez, 2011). In the academic sphere, the problem of data quality has been relatively well elaborated, thus an exhaustive further treatment of it is not needed. Nonetheless, let us repeat its most notable factors, which are availability and discoverability, trust and authenticity, acceptability, accuracy (comprising correctness and consistency), applicability, integrity, completeness, understandability and usability (IBM, 2012). It is also clear that research data services offered by academic libraries could play a critical role as data quality hubs on campus, by providing data quality auditing and verification services for the research communities (Giarlo, 2013). While caring for the availability of data would be a self-explanatory requirement, directed towards data librarians, being knowledgeable about the ways to assess the digital objects’ authenticity, integrity and accuracy over time would also be useful (Madrid, 2013). More recently, Zilinski and Nelson (2014) identified some other factors of data quality as coverage and relevance to the given research question and format, comprising fields and units used, naming conventions, dates of creation and update. They also direct our attention to a set of quality control attributes that are akin to data governance that answer the question if quality control is explicitly outlined by examining who is in charge of checking for quality and what processes do they use.
Successful data governance depends not only on provisions related to roles in general, but responsibilities connected with appropriate data standards and managed metadata environments (Smith, 2007). Therefore, managing metadata is one of the key quality-related processes of data governance because it enables – among other things – documenting the provenance of data that ensures its quality is secured (ORACLE, 2015).
Data governance, data quality and data literacy
To illustrate the importance of appropriate DG, we can take the case study presented by Soares (2012) about the unfortunate events surrounding the Mars Climate Orbiter. In 1999, a navigation error directed the Orbiter into a trajectory 170 kilometres lower than the intended altitude above Mars, because NASA’s engineers used English units (pounds) instead of NASA specified metric units (newtons). This relatively minor mistake resulted in a huge miscalculation in orbital altitude and in the loss of $328m. With appropriate attention to data governance principles and to the actual details, and if data literacy skills had been mobilized, this accident could have been avoided.
Even though data literacy is going through a gestation period (Carlson and Johnston, 2015), being data literate begins to be widely accepted as a crucial ability for information professionals involved in supporting data-intensive research (Koltay, 2015b; Qin and D’Ignazio, 2010; Schneider, 2013). On the other hand, the terminology in the field of data literacy is still not standardized. There is science data literacy (Qin and D’Ignazio, 2010) and research data literacy (Schneider, 2013). Carlson et al. (2011) argue for data information literacy because – according to their approach – it differs from a more restricted meaning of data literacy, i.e. the ability to read graphs and charts appropriately, drawing correct conclusions from data, and recognizing when data is being used in misleading or inappropriate ways. In the following, naming differences will be disregarded, and we will vote for the term data literacy first of all because this term is simple and straightforward (Koltay, 2015a), while it does not seem to have the limitation mentioned by Carlson et al. (2011). Besides of this, while the terms differ, definitions and competence lists show convergence. If we look to the development of data literacy’s definitions, we can see that Fosmire and Miller (2008) spoke simply about information literacy in the data world. Two years later, data literacy was defined plainly as the ability to understand, use and manage data (Qin and D’Ignazio (2010). According to Calzada Prado and Marzal’s (2013) definition, data literacy enables individuals to access, interpret, critically assess, manage, handle and ethically use data.
As mentioned above, Johnson (2012) described data literacy in much more detail, defining it as the ability to process, sort and filter vast quantities of information, which requires knowing how to search, how to filter and process, to produce and synthesize. It is clear that these attributes are basically identical to the characteristics of information literacy as they appear in the well-known and widely accepted definition of information literacy, which comprises the abilities to recognize information need, identify, locate, evaluate, and use information to solve a particular problem (ALA, 1989). Nonetheless, it has to be added that – while information literacy seems essentially to enable us to efficiently process all types of information content (Badke, 2010) – the community of practice for data librarians differs from that of information literacy (Carlson and Johnston, 2015).
As to the similarities to information literacy, it has to be added that several authors emphasize it. The Australian and New Zealand Information Literacy Framework, edited by Alan Bundy (2004) states that information literate persons obtain, store and disseminate not only text, but data as well. Andretta et al. (2008) identified presenting, evaluating and interpreting qualitative and quantitative data as a learning outcome of information literacy. According to Hunt (2004), data literacy education should borrow heavily from information literacy education, even if the domain of data literacy is more fragmented than the field of information literacy. Schneider (2013) also defined data literacy as a component of information literacy.
Both the SCONUL (2011) Seven Pillars of Information Literacy model and the information literacy lens on the Vitae Researcher Development Framework (Vitae, 2011) stress that to identify which information could provide the best material to answer an information need, finding, producing and dealing with research data is important, as information literacy today not only encompasses published information and underlying data. This is in accordance with a broader interpretation of information literacy, which recognizes that the concept of information includes research data (RIN, 2011). Carlson et al. (2011) underline that expanding the scope of information literacy to include data management and curation is a logical development. Si et al. (2013) state that data-related services should be supported by professionals with excellent information literacy skills.
Even though without referring to data literacy, Wang (2013) mentions that reference librarians frequently conduct information literacy sessions that educate the users about the existing data resources for their specific study areas.
Calzada Prado and Marzal (2013) state that information literacy and data literacy form part of a scientific-investigative educational continuum, a gradual process of education that begins in school, is perfected and becomes specialized in higher education, and becomes part of lifelong learning. When suggesting a new framework for data literacy education, Maybee and Zilinski (2015) also point towards the close relationship between information literacy and data literacy.
Beyond definitions, applying and analysis of several information literacy standards, Calzada Prado and Marzal (2013: 126) identified a number of abilities, some of which clearly show their origin in the best-known definition of information literacy (ALA, 1989) and the Information Literacy Competency Standards for Higher Education (ACRL, 2000). determining when data is needed; accessing data sources appropriate to the information needed; recognizing source data value, types and formats; critically assessing data and its sources; knowing how to select and synthesize data and combine it with other information sources and prior knowledge; using data ethically; applying results to learning, decision making or problem solving.
They also emphasize the ability to identify the context in which data is produced and reused. By mentioning these two main components of the data lifecycle they are in line with contemporary views of information literacy that incorporate the understanding of how information is produced (ACRL, 2013).
Mandinach and Gummer (2013) identify data literacy as the ability to understand and use data effectively to inform decisions. With this, they give weight to data literacy’s role in supporting decision making. Therefore, they bring data literacy up to data governance, recognizing that it may be tied to the world of business.
Data literacy, as it is understood by the Association of College and Research Libraries, focuses on understanding how to find and evaluate data, giving emphasis to the version of the given dataset and the person responsible for it, and does not neglect the questions of citing and ethical use of data (ACRL, 2013).
Taking all these definitions together, data literacy can be defined as a specific skill set and knowledge base, which empowers individuals to transform data into information and into actionable knowledge by enabling them to access, interpret, critically assess, manage and ethically use data (Koltay, 2015a).
Searle et al. (2015) identify data literacy as one of RDSs activities that support researchers in building the skills and knowledge required to manage data well. Therefore, we can say that data literacy is related to practically all processes that are covered by RDSs, and build the main framework of libraries’ involvement in supporting the data-intensive paradigm of research (Tenopir et al., 2014). RDSs are undoubtedly comprehensive, thus covering their aspects makes data literacy overarching and comprehensive.
When taking the closeness of data literacy to information literacy into consideration, it is intriguing to contemplate if there is such a thing as generic information literacy.
According to Carlson et al. (2011), data information literacy programs have to be aligned with current disciplinary practices and cultures. A bibliometric study by Pinto et al. (2014) shows that information literacy both in the health sciences and the social sciences have their own specific ‘personality’. In general, newer approaches to information literacy underline that information is used in different disciplinary contexts (Maybee and Zilinski, 2015). In this context, the case of chemical information literacy is especially interesting. Bawden and Robinson (2015) examined its history and found that – while chemical information literacy contains some generic elements – it is more strongly domain specific than any other subject. As Farrell and Badke (2015) underline, in order to meet the demand of the information age for skilled handlers of information, information literacy education must become situated within the socio-cultural practices of disciplines by an expanded focus on epistemology and metanarrative. Truly situated information literacy will therefore require that librarians or disciplinary faculty invite students into disciplines. Therefore, information literacy has to be understood as information practices belonging to a discipline.
Data literacy skills are also regarded to be discipline specific (Carlson and Johnston, 2015). As to the required skills and abilities, data literate persons have to know how to select and synthesize data and combine it with other information sources and prior knowledge. They also have to recognize source data value and be familiar with data types and formats (Calzada Prado and Marzal, 2013). Other skills include knowing how to identify, collect, organize, analyse, summarize and prioritize data. Developing hypotheses, identifying problems, interpreting the data, and determining, planning, implementing, as well as monitoring courses of action also pertain to required skills and add the need for tailoring data literacy to specific uses (Mandinach and Gummer, 2013).
Ridsdale et al. (2015) set up a matrix of data literacy competencies with the intention to foster an ongoing conversation about standards of data literacy and learning outcomes in data literacy education. The perhaps most important activity in this matrix is quality evaluation that includes assessing sources of data for trustworthiness and for errors or problems. Evaluation appears already when we collect data and data interpretation clearly shows the mechanisms that also characterize information literacy. Even data visualization comprises evaluating and critically assessing graphical representations of data.
A pilot data literacy program on data literacy offered at Purdue University was built around the following skills: planning; lifecycle models; discovery and acquisition; description and metadata; security and storage; copyright and licensing; sharing; management and documentation; visualizations; repositories; preservation; publication and curation. (Carlson and Stowell Bracke, 2015)
The fact that data quality plays a distinguished role in data literacy is also demonstrated by Carlson et al. (2011), who compiled the perspectives of both faculty and students. Generally, faculty in this study expected their graduate students to be able to carry out data management and handling activities. Both major responsibilities and deficiencies in data management of graduate students included quality assurance. Quality assurance is seen as a blend of technical skills that materializes in familiarity with equipment, disciplinary knowledge and a metacognitive process that requires synthesis. Even though partly superseded by the Framework for Information Literacy for Higher Education (ACRL, 2015), data literacy can be seen through the prism of the Information Literacy Competency Standards for Higher Education (ACRL, 2000). Standard 3 of these Standards (Evaluate information critically) contains the requirement of understanding and critically assessing sources by determining if the given data is reputable and/or if the data repository or its members provide a level of quality control for its content.
As mentioned above, managing metadata is one of the key quality-related processes of data governance. At the same time, the appraisal of metadata is part of quality assurance that should be included in data literacy programs. Quality assurance in this context comprises utilising metadata to facilitate understanding of potential problems with data (Ridsdale et al., 2015).
Data literacy education has a dual purpose. The first one is rather self-explanatory, i.e. to ensure that students, faculty and researchers become data literate science workers. As Carlson and Johnston (2015) underline, we must raise awareness of data literacy among faculty, students and administrators by sending clear messages to our stakeholders’ needs. Some of these messages could have their roots in business environments. Conveying corporate messages may even strengthen the credibility of such messages. The second goal is to educate information professionals (Qin and D’Ignazio, 2010; Schneider, 2013).
Imparting data literacy to faculty is hampered by the circumstance that educating them is a delicate issue. As Duncan et al. (2013) pointed out, faculty members rarely like to hear that they are doing something in the wrong way. Exner (2014) also confirms that it is not easy to reach faculty, especially if we do not understand their lives properly. Faculty members are busy, and being experts in their fields, they usually require different approaches to instruction than students (Carlson and Johnston, 2015).
Conclusion
Although being familiar with data governance did not receive a lot of attention in academia, it brings substantial knowledge to the work of the data librarian. Despite differences between them, both data governance and data literacy are indispensable for managing data quality, thus – by their overarching nature – making use of them is a prerequisite of effective and efficient data management that substantiates research data services.
Making use of the lessons learnt from data governance could substantially enhance the effectiveness of research data management processes in academic libraries. The reasons for this are manifold. First, in delineating decision domains and defining accountability for decision making, applying practices adopted form data governance can improve data management in the library. Second, data governance is a service that is based on standardized, repeatable processes and is designed to enable the transparency of data-related processes and cost reduction, thus it can be used also in the academic library. Third, it refers to rules, policies, standards; decision rights; accountabilities and methods of enforcement. Therefore, it would serve as a pragmatic addition to already existing data quality principles, practices and tools of the library. Fourth, the practice of data governance can also be helpful in managing change and negotiating big data issues.
These lessons can speak for themselves and may be built into data literacy programs. It is important for the library profession to take this challenge seriously and acquire the skills needed to provide effective data literacy education, irrespective of the fact that its competencies extend beyond the knowledge and skills of a typical librarian, or a faculty member. Paying attention to the management of data quality (also taking data governance into consideration) is an important step towards making all our target audiences accept the library’s mission to provide research data services and to offer these services to their full satisfaction.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
