Abstract
The paper reports on an exploratory study of researchers’ needs for effective research data management at two Swedish universities, conducted in order to inform the ongoing development of research data services. Twelve researchers from diverse fields have been interviewed, including biology, cultural studies, economics, environmental studies, geography, history, linguistics, media and psychology. The interviews were structured, guided by the Data Curation Profiles Toolkit developed at Purdue University, with added questions regarding subject metadata. The preliminary analysis indicates that the research data management practices vary greatly among the respondents, and therefore so do the implications for research data services. The added questions on subject metadata indicate needs of services guiding researchers in describing their datasets with adequate metadata.
Keywords
Introduction
Over the recent years a plethora of discussions have been taking place in relation to scholarly research data and the role of academic libraries and research institutions in providing various services to support research data management (see, for example, Borgman, 2015). Libraries that are not yet offering support for research data may be at the stage of developing such services, as is the case of two Swedish universities in focus of this paper. The libraries of Linnaeus University and Lund University have chosen to first identify characteristics, requirements, needs and related issues of managing different types of research data produced by researchers employed by the respective universities, which would then serve as a foundation to create most appropriate research data management services. The study is based on interviews with researchers, following the Data Curation Profiles (DCP) Toolkit (Carlson and Brandt, 2014). Based on the toolkit, a data curation profile is created for each of the interviewed researchers. The profile may be used by the researcher for research data management (RDM), and by libraries in developing support for RDM.
The outline of the paper is as follows. In the next section (Background), the context of research data services provided by libraries as seen in professional discussions, project reports and research literature is provided. This is followed by a section (Methodology) on the methods of the study. The results are presented and analysed in the next section (Results). The last section (Discussion) concludes the paper with a discussion on the implications of the results for developing research data services.
Background
Research data management and services in Sweden
During the past years several RDM projects and events took place at Swedish universities; many of which had been initiated by academic libraries and archives. There has been a lot of focus on data sharing and open data, and lately also on preservation and archiving, in particular with projects for e-archiving.
Several projects have investigated researchers’ experiences and attitudes regarding research data management. In 2007 libraries and archives at University of Gothenburg, Lund University and the Swedish University of Agricultural Sciences undertook a joint pre-study with the aim of exploring research data in open archives and university archives. The study comprised a survey on researchers’ attitudes towards publishing research data and looked specifically at the future roles of open archives and university archives (Björklund and Eriksson, 2007). The Swedish National Data Service (SND) performed a major study in 2008–2009 on practices of open access to and reuse of research data. The study comprised two surveys targeting professors and PhD students working at Swedish universities. The survey results pointed to the following: a number of barriers to sharing research data; unresolved issues regarding legal and ethical aspects; lack of resources to make research data available.
It concluded that researchers need to be trained in relation to research methods, digital research databases and accessible e-tools, as well as that funds should be made available for preparing research data to be shared and archived (Carlhed and Alfredsson, 2009).
SND participated in a similar project also involving university libraries at the University of Gothenburg, Lund University, Linköping University and Malmö University, focusing on researchers within Arts and Humanities. The participating libraries interviewed researchers about their attitudes towards publishing their research data. The project revealed a positive attitude in this respect, and identified the need for quality RDM systems and related professional support (Andersson et al., 2011).
In 2014 the Lund University Library undertook an investigation of the conditions for RDM within Humanities and Social Sciences at the university. It both looked at the organizational aspects for RDM, and brought in researchers’ opinions through a small survey (Åhlfeldt and Johnsson, 2014).
These above mentioned projects were all early, exploring projects on RDM, with a focus on data sharing. The conditions for this project with interviews with Data Curation Profiles have been a bit different, as the debate on open data and RDM have been more animated the last year. With the Data Curation Profiles we have had the possibility to study researchers’ experiences on a more individual basis and close to their everyday work.
At the national level it is the Swedish Research Council that is the major stakeholder in RDM, active particularly in funding related research infrastructures in Sweden, such as SND, Environment Climate Data Sweden (ECDS), Bioinformatics Infrastructure for Life Sciences (BILS) and the Max IV Laboratory which provides access to synchrotron X-rays. At the request of the Swedish government, in 2015 the Swedish Research Council proposed a national policy for open access to scientific information, including publications and research data (Swedish Research Council, 2015). The Swedish government’s stance on the proposal is expected to be made clear in the fall of 2016 when the research bill is anticipated. The policy aims to ensure open access to all Swedish scientific information that has been fully or partially funded by public funds, by the year 2025. In Sweden, research data are owned by the academic or research institution at which the researcher is employed (see Bohlin, 1997), and not by the researchers themselves as is the case in many countries, such as Finland. However, this is not very well known by Swedish researchers; they regard the data they have collected as their own (Swedish Research Council, 2015). The proposal suggests a model in which the responsibility for archiving and providing access would lie with different organizations. Academic and research institutions would have responsibility for archiving research data, and a national facility would be responsible for making the research data available by linking to the research data archived at the academic and research institutions.
In spite of the absence of an established national policy, there is a relatively strong spirit of professional development in RDM at Swedish universities and research infrastructures, as is particularly evident from the considerable number of events aimed at RDM research and training. The primary player is the Swedish National Data Service (SND). Apart from arranging training events in research data management, SND is also coordinating pilot projects on RDM at several Swedish universities during 2016.
Development of research data services – Linnaeus University
Research data services at Linnaeus University (LNU) have been in the planning stage since 2015, when a librarian working as a research strategist at the university library was chosen to serve as the representative for SND. Apart from taking part in SND’s training events, the LNU library liaised with the Digital Humanities Initiative at LNU (Linnaeus University, 2016) beginning in February 2016. As one of the original eight pilot projects planned as part of this initiative, the Humanities Data Curation pilot project was envisioned. It set out to be conducted by the Digital Humanities researchers who would collaborate with LNU Library SND representative as well as the Lund University Library, with a particular focus on the survey reported here. The plan for further steps is under development.
Development of research data services – Lund University Library
At the Lund University Library, the Department of Scholarly Communication addresses RDM and development of related services. Lund University Library is one of 26 libraries in the network of Lund University Libraries. Whereas most libraries provide a service to a particular faculty, department or centre, the University Library provides a service to the entire University (Lund University Libraries, 2016). There are a number of RDM-related projects coming up at Lund University, and in several of these library staff is involved. The university hosts several research infrastructures which are working in different ways with questions of data sharing, data preservation, etc. During the fall 2015 Lund University Library conducted a survey of the university faculties regarding their experiences of and attitudes to RDM (Johnsson and Lassi, 2016). The survey, in which faculty managers and library staff were interviewed, showed that the respondents expect RDM to become more important in the future, and that it will become a part of researchers’ work tasks in a more structured way than at present. The results of the survey also showed the diversity of research data generated within a full university, and verified the need to investigate the characteristics of different types of research data within different fields. The Lund University Library is also involved in projects of research infrastructures at the university, such as the ICOS Carbon Portal (ICOS Carbon Portal, 2016).
Subject metadata in research data
Naming and organizing data and relationships among them can have ‘profound effects on the ability to discover, exchange, and curate data’ (Borgman 2015: 65). Standardized metadata schemes increase these abilities. While in many scientific areas metadata schemes are well used (although there are still gaps in a number of them), the question is raised here about the degree to which subject metadata are standardized. Subject searching (searching by topic or theme) is one of the most common and at the same time the most challenging type of searching in information systems. Subject index terms taken from standardized knowledge organization systems (KOS), like classification systems and subject headings systems, provide numerous benefits compared to free-text indexing: consistency through uniformity in term format and the assignment of terms, provision of semantic relationships among terms, support of browsing by provision of consistent and clear hierarchies (for a detailed overview see, for example, Lancaster 2003). In terms of cross-searching based on different metadata schemas using subject terms from different KOS, challenges like mapping across the KOS need to be addressed in order to meet the established objectives of quality controlled information retrieval systems like those provided by libraries.
Libraries are providing quality subject access of resources described in library catalogues through, for example, classification schemes and subject headings; however, when it comes to research data, there seems to be a lack of controlled subject terms in metadata schemes. An exploratory analysis of 36 disciplinary metadata standards from the list provided by the Digital Curation Centre (2016) in the UK shows that 18 (50%) of them to do not provide any subject metadata field. Of those that do, only a few offer clear guidelines on the controlled vocabulary use, while others leave it to be a freely added keyword.
Several examples of the former include the following: SDAC (Standard for Documentation of Astronomical Catalogues) which provides three categories of keywords: (1) the ones as in the printed publication, (2) controlled ADC keywords (take from controlled sets), and (3) mission name, a header which precedes the satellite name for data originating from the satellite mission. FGDC/CSDGM (Federal Geographic Data Committee Content Standard for Digital Geospatial Metadata) supports two categories of subject metadata: (1) theme keyword thesaurus, from a formally registered thesaurus or a similar authoritative source of theme keywords, (2) theme keyword, a common-use word or phrase used to describe the subject of the data set. ClinicalTrials.gov Protocol Data Element Definitions for its keyword element advises use of words or phrases that best describe the protocol with a note to use Medical Subject Headings (MeSH) where appropriate.
Several of the others use Dublin Core’s dct: subject element, which as in the previous examples allows free keywords, while recommending the use of controlled vocabulary.
Methodology
The research questions guiding the study are: What are the respondents’ current practices of research data management? What are the respondents’ current practices of using subject metadata to describe their data? What are the implications for developing research data services?
The research questions have been investigated through structured interviews with researchers, during which they were asked to fill out the interview worksheet of the Data Curation Profiles Toolkit, while the session was audio recorded. The researcher was asked to respond to the questions in the interview worksheet with a specific project in mind on themes such as sharing, ingestion to a repository and organization and description of data. The study design is further elaborated below.
Data Curation Profiles Toolkit
In this exploratory study the Data Curation Profiles (DCP) Toolkit was used, developed through a research project conducted by the Purdue University Libraries and the Graduate School of Library and Information Science at the University of Illinois Urbana-Champaign (Witt et al., 2009). The DCP Toolkit is envisioned as an aid to starting discussions between librarians and faculty and in the planning of data services that directly address the needs of researchers. The profiles are supposed to give information about a particular data set and the researcher’s doings in terms of curating that data set.
The method consists of an interview template with questions concerning RDM of a specific data set. The interviewee is the researcher who has created the data set and the interviewer is a librarian or from a related profession. Also included in the DCP Toolkit are guidelines and instructions to the interviewer as well as instructions to the interviewee. The interviews are to result in data curation profiles for specific research areas or projects, and could be useful both to research support staff as well as to the researchers themselves.
The DCP template comprises 13 modules on RDM of a specific data set. The questions cover the nature of the specific data set, such as form, format, size, and bring up aspects like sharing, archiving, discovery, organization and description, linking, interoperability, and measuring impact. They are formulated in a generic way in order to allow their use in all kinds of scientific areas. In order to address metadata and subject access in particular, four questions were added to the interview worksheet in Module 6 – Organization and Description of Data, each followed by a 5-point Likert scale ranging from ‘not a priority’ to ‘high priority’, followed by the ‘I do not know’ option (please see Appendix for the modified version of the list of questions in Module 6): The ability to apply standardized subject classification to the data set. (Please list relevant controlled vocabularies next to this table); The ability for automatic suggestions of keywords for subject classification; The ability to add your own tags to the data set; and The ability to connect the data set to the keywords of the publications that are based on it.
After the development and launch of the DCP Toolkit, in 2011 about a dozen workshops on how to use the toolkit were held across the USA. As a result, several libraries have constructed data curation profiles which have been published in a public directory of the DCP (Carlson and Brandt, 2015).
In the years since 2011 several projects have used the DCP in the development of RDM services (Bracke, 2011; Brandt and Kim, 2014; Carlson and Bracke, 2013; McLure et al., 2014; Wright et al., 2013). The Cornell University Library used DCP when developing a research data registry at the university (Wright et al., 2013). The library performed eight interviews with researchers in a wide range of subject areas. In spite of considerable variety in researchers’ priorities, the project team was able to discern similarities among their needs. In another project at Purdue University, DCP was used to investigate the needs for data curation for researchers within agricultural science (Bracke, 2011). The project was focused on establishing the role which the subject librarian in Agricultural Science could take in data curation. The project performed interviews based on the DCP, and the identified data sets were put in a prototype data repository.
One example where they used selected parts of the DCP was a project at the Library of Colorado State University (McLure et al., 2014). They conducted a number of focus group interviews with researchers in order to explore what kind of data sets researchers had, how they managed their data, and what kind of support they would need in terms of RDM. Methodologically, they also investigated the feasibility of adapting DCP to focus groups. The findings showed that focus groups may be very useful when investigating general conditions for RDM and to spot trends and behaviours among researchers. On the other hand, conducting individual interviews using the DCP gives more specific, granular detail on researchers’ data management (McLure et al., 2014).
Data collection
Data was collected through structured interviews which were guided by the DCP Toolkit’s interview worksheet, described above. The DCP interview template was translated into Swedish to fit the Swedish-speaking researchers at Lund University. At Linnaeus University all interviews were held in English, but allowing the respondents to reply in Swedish if that was their preference – this was because one of the interviewers there was not a native Swedish speaker. The respondents were recruited through email advertisements, personal contacts with researchers and a faculty survey (the latter only at Lund University).
In total 12 interviews were conducted – five at Linnaeus University and seven at Lund University. The interviews were conducted between 28 January and 28 April 2016 and lasted between 46 and 119 minutes. Respondents were at different stages of their research career, ranging from PhD candidate to professor. Further, the respondents were active in a wide range of research areas, such as archaeology, biology, business administration, film studies, library and information science, and media and journalism. In this exploratory study we could not cover all disciplines, e.g. informatics and chemistry. A follow-up study with a larger sample size would be able to cover more disciplines.
At the start of each interview, the respondents were presented with information about the study and the interview session, and were asked to read through and sign an informed consent form. Following the research ethics guidelines of the Swedish Research Council (2011) the informed consent form stated that all answers would be anonymized and that no risks of participating in the study could be predicted. Other information in the consent form included the aims of the study and the explanation that the data collected might be used for scientific publishing and in other studies, in which case all data would be anonymized.
The respondents were asked to fill out the DCP interview worksheet during the interview session, which was also audio recorded. The audio recordings served as an aid for the interviewers to capture any explanations or discussions that arose from the interview worksheet questions that might not have been written down in the worksheet. The interview worksheets were scanned and stored along with consent forms and audio recordings in an online collaborative environment, with access allowed only to the three authors of the paper.
Note that the DCP Toolkit recommends that two sessions are conducted with each researcher in order to allow for time to learn from the discussions that arise from the interviews. In the reported study only one session per respondent was scheduled in order to avoid the risk of getting fewer respondents because it would take too much of their time for anyone to accept the invitation to take part in the study.
Data processing
The responses from the interview worksheet were transferred into MS Excel. The respondents were anonymized in the MS Excel spreadsheet in order to facilitate the reuse of the data set. The audio recordings were used to take notes of statements that could be of interest in relation to the research questions, the statements that further explain something from the interview worksheet or those that were interesting examples of situations concerning RDM.
For each of the respondents at Lund University, a data curation profile was created based on the instructions provided by the DCP Toolkit. The audio recordings contributed to creating a set of recommendations tailored to each respondent concerning their current and future RDM practices. The recommendations section is a modification of the original data curation profile. The data curation profile was delivered to the researcher with an invitation to a follow-up meeting on the DCP. We followed the detailed guidelines and instructions for constructing the data curation profiles from the collected data. Together with the audio recordings, the paper-based interview worksheets provided rich material for creating the data curation profiles.
Data analysis
Guided by the research questions, the data analysis was started by a simple numeric analysis of the number of responses to each question. These results were complemented by analysing the relevant modules of the audio recordings, aiming to find any explanatory statements or comments to the questions and their responses. Whereas the numerical data analysis was conducted for all the data, in this paper we focused on Modules 3 (Sharing), 5 (Ingestion), 6 (Metadata) and 7 (Discovery) for the qualitative analysis, as these were deemed to in conjunction cover many, but not all, of the aspects of RDM.
Results
As stated, the respondents conduct research in a wide range of scientific research areas, using a wide range of research methods to collect, process and analyse data. Also, the study is based on a small number of respondents, 12 persons in total. Therefore, the results are presented to show this broad array of experiences and descriptions.
Research data sets
The tools used by the respondents to generate data varied greatly and included audio recorder (3), camera (3), questionnaires (2), pen and paper (2), sensors, field notes and a plethora of software tools. Most of the tools required to utilize the respondents’ data are proprietary software that requires a purchase to use, such as Excel (4/12), SPSS (2), NVivo (1), Word (1) and MATLAB (1). The R software environment and programming language and the Genome browser are two examples of tools used, available under licences allowing free use for academic purposes.
The DCP interview guide asked the respondents to identify the stages that their data had passed through during their research project. All respondents identified at least two stages, commonly three. The first stage typically concerns raw data collected or obtained by the respondents. The second stage typically concerns some type of processing, e.g. quality checking or cleaning the data, or preliminary analysis, e.g. first coding. In total 11 out of 12 persons declared they had a third data stage. In the third data stage the data are often in a second form of analysis format, and in this phase the researchers start posing their research questions to the data, or start with the deeper form of analysis. Of the 12 data sets four were bound by privacy or confidentiality agreements, whereas five were not, and one respondent was unsure.
Attitudes towards, and incentives for, sharing data
Out of the 12 respondents two have deposited the data in focus in the interview in a repository. Among the other 10 respondents, eight stated that they were willing to deposit their data, while two were hesitant to share their data. The hesitancy was motivated by the fact that the data was sensitive, comprising interviews, and that the researcher had promised their respondents confidentiality. Further, 11 of the 12 respondents stated that their data would probably be of interest to others, suggesting, for example, public libraries, media companies, teachers, study participants, activists and non-government organizations. When speculating about the use that others could have of their data, respondents suggested addressing new research questions, developing educational programmes, conducting statistical analysis, and serving for informed policy making. As to citation, nine of the 12 respondents would require a citation or an acknowledgement when others used their data.
The first data stage of the data management cycle was commonly seen as the best stage to share, by 10 out of 12; of these, five would share the data with anyone, and the other five would prefer to share with immediate collaborators or others in their field. Whether the respondents would share the data at the second data stage differed: three respondents stated that they would not share at that stage, while nine stated that they would share, but predominantly with immediate collaborators or researchers in the same field (5). The hesitancy to share data at this stage could perhaps be explained by the processing or initial analysis done to the data, having moved from raw data to another state which could possibly be more difficult for others to understand or use. Of the 12 respondents eight indicated that they would share their data after they had published results.
As stated above, in Sweden research data are owned by the academic or research institution at which the researcher was employed at the time of the data collection. Of the 12 respondents five reported knowing that their employers owned the data, whereas the answers to this question among the other seven respondents varied to include the researcher (4), the research community (1) and the general public (1). The research funders of the data collected in the respondents’ studies generally do not require data to be shared or deposited in a repository (11/12), while one person stated that there was such a requirement but that it would be illegal to share the data due to privacy protection.
The ability to see usage statistics of their data sets had a high priority (5/12) or medium priority (5) for the majority of the respondents. The majority of the respondents indicated that the ability to gather information about the people who have used the data set would be of high (6/12) or medium priority (3), and a few that it would be of low priority (3). Suggestions of other metrics or analytics that could be of interest to the respondents include citations (4), seeing publications based on the data set, and the country, academic or research institution, and time at which a data set has been used.
Metadata
The respondents organized and described their data sets in a wide variety of ways, including metadata schemes, codebooks, a paper filing system using plastic folders ordered geographically and by topic, and tables in a MS Excel file. Two respondents used standardized metadata schemes: one scheme is provided by a Swedish national infrastructure, and one is a domain-specific taxonomy. One respondent reported using a particularly wide range of different descriptions to cover the needs of others who may be interested in the data, including: a standardized metadata scheme coupled with their own classifications to describe instruments and parameters measured, technical specifications of, for example, how often measurements were conducted, and a verbal description of how the data had been collected and processed.
The majority (7/12) considered that the existing organization and description are sufficient for others to understand and use the data. One respondent noted that there is a risk that codebooks are too detailed and complex for someone else to read, which could impede the understanding and correct use of the data.
The ability to apply standardized metadata from the respondents’ own fields was considered important by most (high priority by 8/11, medium priority by 2, low priority by 1). The ability to apply standardized subject classification to the data set was also considered of medium to high importance (high priority 5/11, medium priority 5, low priority 1).
Most (10) respondents requested more information about what was meant by the question ‘[t]he ability to apply standardized subject classification to the data set’. This could indicate that the question is designed with library and information science jargon; while researchers may be required to add keywords and other subject classification when submitting publications, they might not know the name for this activity. Respondents who elaborated on their responses in the worksheet indicated that subject classification is important for others to find their data sets and that standardization of the classification is needed to ensure findability. Also, a respondent described the importance of standardized KOS in archaeology, as places have changed names throughout history.
The ability for automatic suggestions of keywords for subject classification was also deemed as high to medium priority (high priority 6/11, medium priority 3, low priority 2). Out of the six respondents who rated this as having a high priority, four stressed the importance of them being able to make the final decision as to which suggestions to add to the data set. One respondent noted that automatic suggestions of keywords would be a good idea to start to think about classification; they generally do not think about keywords before it is required to add them to a publication.
Ability to add one’s own tags was considered the top priority (high priority 9/11, medium priority 2) regarding subject metadata features. One respondent justified the high priority by explaining that the development of the research area moves quicker than the classification systems, so tagging as a complement to standardized subject classification would increase findability and show the data set’s uniqueness. Another respondent suggested that tags could be useful to inform others about anomalies that might be interpreted as errors but are actually just outliers.
Ability to connect the dataset to the keywords of the publications based on it was also considered rather important (high priority 5/11, medium priority 3, low priority 1, N/A or I do not know 1). Among the more hesitant respondents, one responded that it would just be confusing, and another one stated that everything that can be automated does not have to be automated in its own right; choosing keywords oneself could add a level of reflection to the process. Having a more positive attitude towards this, one respondent stated that the importance of this ability is more or less self-evident.
Data ingestion into a repository
All (12/12) respondents reported on the need to prepare the data for their ingestion into a data repository to some extent, ranging from what the respondents viewed as hardly any work at all (2) for raw data, to a lot of work (1). Activities needed were reported to be related to making the data set understandable to others by preparing codebooks and metadata; filling out gaps or fixing errors in the data; digitizing materials that are on paper; and anonymizing identifying information from interviews. One respondent expanded on the process of making the data set understandable through adding sufficient metadata to them, stating that it takes a lot of time and effort, and needs to be balanced with how much work is needed for the data to be understandable and usable.
Most respondents (7/12) expressed high priority for the ability to submit the data by themselves, while a few (4) thought it was low priority. Two respondents stated that preparing and submitting the data should be done by experts, one of them suggesting the library as a potential service provider. A respondent who expressed high priority to submitting the data themselves explained that they wanted control of the ingestion process.
As to the automatized submission process, the opinions ranged from ‘it is not possible’ (1) and ‘not a priority’ (5), to ‘medium priority’ (2) and ‘high priority’ (4). Two respondents stated that they would appreciate it as a time-saving factor and the fact that it would be possible only for some types of data such as those that do not need to be anonymized, or particularly secured for privacy reasons. One respondent stated that they were generally in favour of automating processes, but for these purposes it seems a bit risky in case there are errors in the data set that they have not yet had time to fix before the automated submission process starts.
The possibility of batch upload to repository was considered high priority by six of 12 respondents, medium priority by a few (2), and low and no priority by one person each; in addition, two people chose the N/A/I do not know option. As above, the time factor was reported as an important factor for batch upload, as well as creating a collection of all the data (thousands of documents, photos and notes) in one set.
Discovery and use of the data
The ability for researchers within the same scientific area to easily find the data was ranked as high priority by all respondents, although one referred to metadata about the data, rather than data itself due to high confidentiality. Possibility for researchers outside the scientific area at hand to easily find the data was also considered a high priority (10 high priority, 2 medium; of the high 1 answered for metadata in the same context as for the previous question). The possibility for the general public to easily find the data and the possibility to discover them via a search service was also considered important; on average it was less important than the first two possibilities to make the data available to researches. For the general public the distribution was three high priority, three medium, four low, two not a priority; for discovery via a search service like Google the distribution of answers was five high priority, two medium, three low, two not a priority.
The replies varied greatly as to how the researchers envision the users would use the data, and included various types of discovery services, ranging from Google, databases like repositories or library catalogues, integrated cross-database search, to defining interfaces such as access through metadata, datasets divided into subject areas, search box with keywords, classification-based browsing interface.
The ability to connect data sets to visualization or analytical tools was highly desired by most respondents (7/12), but had a low (2) or no priority (1) for a few others. Allowing others to comment on or annotate their data sets had varying importance from high (4/12) and medium priority (3), to low (2) or no priority (3).
Regarding the ability to connect data with publications and other research output, most respondents reported a high (5/12) or medium priority (4), with a few giving this a low priority (2), or did not know (1). Support for web service APIs to give access to their data was seen as being either of high priority (4/12) or being of low priority (3). Three respondents responded that they did not know or that this was not applicable for their data. The ability to connect or merge their data with other data sets was again reported as either a high (4/12) or medium priority (5), or not a priority at all (3).
Data preservation
Which parts of the data set that would be most important to preserve over time varied greatly. Some respondents (4 out of 12) responded that everything is equally important, including documents from funders, project descriptions, data at different stages of the research process and publications. Other respondents (4) indicated that it would be sufficient to preserve the raw data. Lastly, some respondents (4) responded that one or a few stages of the processed data would be most important to preserve. The time period for which the respondents estimated that their data would be useful or valuable to others ranged from indefinitely (5) to 10–20 years (3), 5–10 years (1), 3–5 years (2). One person indicated that they did not know.
The ability to audit datasets to ensure structural integrity long-term was indicated to be of either high (6/12) or medium priority (3). Two respondents responded that this had a low priority, and one that they did not know.
Migrating datasets into new formats over time had a high (8/12) or medium priority (3), whereas only one respondent indicated that it had a low priority. Similarly, the need for a secondary storage site for datasets was indicated to be of high (8/12) or medium (2) priority, one respondent responding that it was not a priority, and one respondent indicating that they did not know. Having the secondary storage site at a different geographic location was not as prioritized, responses varying from high (5) to medium priority (4), one respondent giving it a low priority, and one responding that they did not know.
Concluding discussion
During the interviews, the respondents shared their experiences with collecting, processing, managing and storing data in their everyday work. They could easily give examples of situations that arise in their RDM practices and procedures that they take for their data. However, they had not always reflected upon why they worked in a certain manner and if there were other ways of managing their data, for example concerning going about long-term preservation and sharing of data. As expected, some respondents had more elaborate and thorough protocols for RDM, predominately in areas that require meticulous protocols such as environmental research based on observational data.
Generally the respondents are positive towards the idea of sharing research data, and they show a clear awareness about it. For most of the researchers, sharing and exchanging data with research colleagues is very common and natural, in particular when collaborating with others. This could possibly include sharing data for educational purposes, as open educational resources. If there are sensitive aspects to the research data, the researchers are also aware of this, not least if they have had to get permissions from an ethics board to collect the data. Concerning which types of research data to share, it seems least problematic for the respondents to share data from the first data stage, e.g. raw data. It seems the respondents are more hesitant about sharing data which has undergone some initial processing or analysis, corresponding to the second or third data stage. Those who are willing to share data from the second or third data stage would prefer to share with immediate collaborators or researchers in the same field only. There is a concern that the ‘early processed’ data may not be properly analysed or interpreted by an external researcher. In these context libraries and other research support services may play an active role in providing guidance on how to share data in trustful and secure ways, including setting embargos on data sets when metadata could be available, though the data themselves are not.
As to data ingestion, the fact that most respondents need to prepare the data for their ingestion into a data repository means that in order to save the time of the researcher, a data management plan would need to be in place, which could help the researcher plan for the ingestion already in the planning phase of a project. Similarly, services from the library could be provided to support both the planning and the ingestion, which was suggested by a few respondents. Combining the possibility for researchers to submit data to a repository themselves with a service in which librarians deposit data on the researcher’s behalf would be useful, to facilitate sharing for researchers who want control over the ingestion phase, as well as those who do not have the time to do this themselves. By providing training in data ingestion, researchers could become more and more self-sufficient in sharing their data themselves.
As to data organization and description, services to support organization and description of data in order for others to use them, need to be provided. Availability in multiple formats should also be supported. Standardized metadata, controlled vocabularies, automated suggestions of keywords in general and automated assignment of keywords from the related publication(s) in a repository, ability to add own keywords should all be supported. Providing training to researchers in how to use metadata to correctly describe data also seems to be a valuable service. Such training could include the different characteristics and uses of subject metadata derived from controlled vocabularies compared to add user-generated tags, and how to use the correct terms for the intended audience of the data set.
As to funders, there still does not seem to be a policy requiring them to draft a data management plan, to share publish or deposit data in a repository, and in a majority of cases to preserve the data beyond the life of the funding. However, this may change soon depending on the Swedish government’s decisions regarding the policy suggested by the Swedish National Research Council, so it is a good time to start planning for these services. Provision of authorized access only would also need to be ensured, as a considerable portion of datasets may be bound by privacy or confidentiality concerns.
A long-term preservation policy, whether local or national, needs to take into consideration the resources required to archive vast amounts of data created at universities in Sweden. Although archives typically require some selection of resources to be archived, the respondents saw great value in all their materials being archived. It may also be relevant to preserve auxiliary resources related to a research data set, including data collection instruments, software and databases used and created during different stages of the research process. This could be addressed in future research.
Using the DCP Toolkit to gather data about researchers’ RDM and needs for service development was a valuable experience. Based on the interviews we created data curation profiles that can provide valuable guidance for the researchers and the results of the interviews will be considered in our further research data service developments. In some of the interviews, the work sheet responses were quite brief, and thus hard to translate into something meaningful in the data curation profile. The audio recordings of the interviews provided extra information what helped to make a richer profile. Also, we added an extra section to the data curation profile with personalized recommendations focusing on the research data management of future research projects. This was a particularly appreciated part of the data curation profile, according to communication and follow-up meetings with the respondents. As the construction of data curation profiles was a rather cumbersome process, we would want to simplify the process if we were to introduce data curation profiles as a regular service. This would require further investigation into the benefits and challenges of the DCP Toolkit from the researchers’ perspective as well as the library’s perspective. According to the website, the DCP Toolkit is in the planning stages of redesign, so the profile construction process may be made simpler in the future.
The next step of this project is to continue to analyse the qualitative data, i.e. the recordings of the interview sessions, which will provide complementary information to the DCP Toolkit questions, including explanations to the responses to the worksheet questions. Lund University Library is in the planning stages of a study evaluating a tool for data management plans, which will complement the results from this study concerning, for example, file types, volumes of data and needs for storage solutions.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Appendix
Questions added to the original interview worksheet are indicated with an asterisk (*)
Module 6 – Organization and Description of Data
3. Please prioritize your need for the following types of services for your data.
