Abstract
Introduction:
Current guidelines for clinical biobanking have a strong focus on obtaining, handling, and storage of biospecimens. However, to allow for research tying biomarker analysis to clinical decision making, there should be more focus on collection of data on donor characteristics. Therefore, our aim was to develop a stepwise procedure to define a framework as a tool to help start the data collection process in clinical biobanking.
Materials and Methods:
The Radboud Biobank (RB) is a central clinical biobanking facility designed in accordance with the standards set by the Parelsnoer Institute, a Dutch national biobank originally initiated with eight different disease cohorts. To organize the information of these cohorts, we used our experience and knowledge in the field of biobanking and translational research to identify research domains and information categories to classify data. We extended this classification system to a stepwise procedure for defining a data collection framework and examined its utility for existing RB biobanks.
Results:
Our approach resulted in the definition of a three-step procedure: (1) Identification of research domains and relevant questions within the field that may benefit from biobank samples. (2) Identification of information categories and accompanying subcategories that are relevant for answering questions in identified research domains. (3) Reduction to an efficient framework based on essentiality and quality criteria. We showed the utility of the procedure for three existing RB biobanks.
Discussion:
We developed guidelines for the definition of a framework that supports the standardization of the biobank data collection process. Connecting the biobank database to pertinent information collected from the electronic health record will improve data quality and efficiency for both care and research. This is crucial when using the corresponding biospecimens for scientific research. Further, it also facilitates the combination of different clinical biobanks for a specific disease.
Introduction
B
General guidelines for the establishment of a clinical biobank (e.g., Organisation for Economic Co-operation and Development and International Society for Biological and Environmental Repositories guidelines3,4) place strong emphasis on obtaining, handling, and storage of biospecimens. 5 The National Cancer Research Institute Confederation of Cancer Biobanks established a data standard to enable biobanks to communicate about the samples they hold and so facilitate the formation of an integrated national network of biobanks. 6 Still, there are currently no recommendations available on the preparation of a data collection framework for clinical biobanking. The absence of such recommendations complicates data pooling and/or enrichment of different clinical biobanks.
Focus in biobanking needs to shift from biospecimens to comprehensive data collection.7–9 Recently, the focus on data has already increased, with special attention to metadata, the question whether and how data can be exchanged between biobanks, and whether data are FAIR (Findable, Accessible, Interoperable, and Reusable). 10 In 2012, Norlin and colleagues introduced the concept of Minimum Information About Biobank data Sharing (MIABIS) to facilitate and initiate collaborations between biobanks,11,12 in compliance with the aim of Biobanking and Biomolecular Resources Research Infrastructure to harmonize biobanking across Europe. Standardized metadata are proposed to describe the content of a biobank on an aggregate level, but MIABIS does not provide guidelines for the design of a framework for clinical data at the level of a disease-specific biobank. All in all, the fact that the original data have to be of high quality before data exchange even becomes relevant (“garbage in is garbage out”) is still undervalued.
Tools to improve data quality have previously been developed. A number of international groups are working on the harmonization of information on donor characteristics. The Public Population Project in Genomics (P 3 G) has put in effort to optimize the data collection process. 13 However, the P 3 G consortium focuses especially on the harmonization of data across biobanks worldwide. The approach of P 3 G to retrospectively harmonize established studies using the DataSHaPER approach already provides scientific utility.14,15 Further, the Observational Medical Outcomes Partnership Common Data Model found that disparate coding systems can be harmonized to a standardized vocabulary, allowing users to generate evidence from a wide variety of sources. It would also support collaborative research across data sources. 16 However, in many cases, retrospective harmonization leads to loss of detail, which might hamper the selection of samples and data of the research population of interest. To reduce the need for retrospective harmonization, a higher level of standardization of the prospective data collection process is necessary.
This calls for a concept of a data collection framework at the level of a single clinical biobank. Even though such a framework will be mostly disease specific, some general guiding principles are required for an optimal design (or architecture). In this work, we developed a stepwise procedure to obtain a data collection framework for a clinical biobank, aimed at acquiring sufficient data to answer relevant research questions in the field while maintaining an efficient and feasible data collection process. With this, we aim at providing a tool for all involved in the field of biobanking to help start the process of data collection. We assessed the procedure for its use in practice by applying it to three existing clinical biobanks of the Radboudumc. Many clinical biobanks already consider similar steps when organizing their data collection at the outset. Still, we believe we are the first to publish such steps in a scientific article, which will prevent researchers from reinventing the wheel.
Materials and Methods
The idea for our stepwise procedure arose from the initial need to structure information from the Parelsnoer Institute (PSI). PSI is a Dutch national biobank that was initiated in 2007 and is facilitated by all eight Dutch University Medical Centers. 17 It started out with eight Parels (Pearls), which are cohorts focused on different diseases, such as diabetes mellitus type I or inflammatory bowel disease. With regard to data, the Parels initially operated largely independently, leading to eight differently organized information models (IM). To structure the data, PSI commissioned the harmonization of these individual IMs to an independent IT company. This company classified the data elements by repurposing a pre-existing standard, HL7 v3, commonly used for clinical messaging between IT components in hospitals. This pre-existing standard was adapted as a binding guideline for the overall structure of the IM and classification of underlying data elements. During the project, it turned out that an IT-based IM did not meet the requirements for scientific analysis, mainly because any relevant clinical context, necessary for the interpretation of observations, was lost. For example, the question “What variants are present in the patient's DNA?” would be asked before the question whether the patient has actually been genetically tested. Over the course of 2015, PSI progressed toward a Detailed Clinical Model-based architecture, which was developed with (integration into) the clinical process and scientific analysis in mind.
Based on the activities of PSI, the Radboudumc in Nijmegen started its own clinical biobanking facility in 2012: the Radboud Biobank (RB). 18 Similar to PSI, the RB is made up of multiple disease-specific biobanks, but from the start its aim has been to coordinate the biobanks in a uniform manner. As management of the RB, we, thus, wished to be able to support new biobanks in the design of a data collection framework for their clinical biobank. To this end, we composed a multidisciplinary team including biomedical researchers, a clinician, laboratory staff, and data management professionals. Considering both our own discipline(s) and those of others, we worked to reach consensus in which the demands of all stakeholders are represented. We took the designed PSI IM as a starting point and extended it to a stepwise procedure for the design of a data collection framework for a clinical biobank. First of all, we listed general medical scientific questions that can be answered by using biospecimens and clinical data from a clinical biobank. To compose this list, we used our experience and knowledge gained in the design and implementation of biobanks that are part of PSI and/or the RB as well as our experience in conducting translational research. After presentation and discussion with end-users of clinical biobanks, the questions were categorized according to the classical frame of types of clinical questions used in evidence-based medicine. 19 This resulted in five research domains: Treatment (therapy), Prevention, Diagnosis, Prognosis (natural history), and Etiology or harm (causation).
Further, we identified five “umbrella” categories into which the information categories of existing data elements from PSI could be divided. To streamline the process of choosing information categories for a clinical biobank, each main information category was subsequently linked to relevant research domains. The subcategories (consisting of multiple items) within these information categories were used as units of data that together form the framework of the data collection. Finally, using the categories, we developed a practical checklist composed of critical questions to be posed for each item that is considered for the framework.
To test the applicability of the stepwise procedure in practice, we took three use cases for which a framework has been prepared: breast cancer, iron disorders, and the congenital malformation hypospadias. 20 These represent three relevant yet diverse diseases, for which we have both experience in building biobanks and the information necessary for the application of our proposed stepwise procedure.
Results
Our approach led to the identification of the three-step procedure to define a data collection framework (Fig. 1). These steps are elaborated later, followed by the assessment of the utility of the stepwise procedure.

Stepwise procedure to define a data collection framework for a clinical biobank.
Step 1: identification of research domains and relevant questions
We identified five research domains that can benefit from the use of biospecimens and data from a clinical biobank (Table 1). Each research domain has its own requirements regarding the nature of the biospecimens to be collected and the accompanying clinical data. Some studies require an accurate standard classification of the disease, whereas for others the time of diagnosis is important. Moreover, there are studies that demand a standardized description of the signs and symptoms during the course of the disease, presenting either before diagnosis or thereafter. Although the actual research question is often not known in advance in the case of biobanking, identification of currently relevant questions within the research domains can help in defining a data collection framework. The relevance of a question is dependent on the present research gaps in the field and the focus of the clinical biobank. Note that for some questions, data of both patients and healthy controls are necessary.
Derived from the common clinical questions as generally agreed on in evidence-based medicine. 19
Step 2: identification of information categories and accompanying subcategories and items
We defined five information categories that meet the needs of most hypotheses in clinical research (Table 2). Each information category has its own subcategories (consisting of multiple items).
Text between brackets represents examples of possible items belonging to the subcategory.
See Table 1 for the different research domains.
Depending on the objective within the research domain.
Donor identification
Donor identification includes basic personal characteristics, such as year of birth, gender, and ethnicity, that are more or less mandatory for selection in most instances (as inclusion/exclusion criteria). Further, a reliable identification number should be obtained at inclusion and, subsequently, pseudonymized with a unique code linked to the data of the donor. This will ensure both the privacy of the donor and efficient association between the clinical data and the biospecimens.
Disease characteristics
All information pertinent to the classification of the disease for which donors were included in the biobank should be incorporated in the framework. These will include: (1) The clinical diagnosis, preferably using a standard vocabulary (e.g., International Statistical Classification of Diseases and Related Health Problems, 10th revision or Systematized Nomenclature of Medicine—Clinical Terms21,22) and a standardized set of specific disease characteristics, such as subclasses/-phenotypes and phase; (2) Essential items from the medical history of a patient; (3) Diagnostic measurements, such as imaging, physical examinations, and laboratory tests; and (4) Co-morbid conditions at time of diagnosis, comprising a limited, standardized co-morbidity list, supplemented with an open “other co-morbidities” category.
For selection purposes, underlying data that lead to the diagnosis and variables that enable specification of disease are necessary. This will allow for differentiation between different patient types within a group with the same diagnosis. Further, for diagnostic studies, a reference may be needed for every donor to the biobank. This may be a reference test, golden standard, or an expert clinical diagnosis obtained during the course of the disease.
Risk and prognostic factors
Factors known from literature (evidence based) to have a substantial effect on the occurrence or course of the disease should be included in the framework, because of their potential confounding role in future studies. These factors may relate to genetic predisposition, lifestyle, or environment. To make research possible, factors are preferably measured according to international standards (e.g., Logical Observation Identifiers Names and Codes for lab tests 23 ), enabling comparisons between different studies and pooling of data for more extensive studies. The time of measuring data is crucial. For etiologic studies, the period before disease onset is relevant, whereas the time of diagnosis is more relevant for prognostic research. When a valid and reliable description of the risk/prognostic factor is not possible, it is better not to include these in the framework, or to use a proxy instead. Details about family history may be important when associated with the onset or course of the disease, as proxy for either heredity or shared environment. In some cases, standardized family trees are needed; in others, a standardized question about disease occurrence among first (and second) degree family members is sufficient.
Treatment
Details regarding treatment should be well documented, including dose and duration when medication is involved. These can be useful for identifying unexpected side effects or may be important as confounding factors. Of note, the type of treatment or dosage may change over time, so the treatment data should be updated regularly. Preferably, data on treatment would be captured according to international standards. However, there is currently no agreed general standard for the notion of treatment. This is mainly due to the fact that, although perfectly demarcated in a clinical sense, treatment potentially spans multiple medical disciplines and sources of information (e.g., medication, physiotherapy, dietary requirements). The only feasible approach as of now is to document certain subsets in the context of the patient's primary disease. For example, the Diabetes Parel requested their participants to bring either all their medication or a list from the pharmacist during their visit, so that medication use could be reported accurately despite the absence of a standard. 24
Course and outcome of disease
Disease course includes changes in the severity of the disease (from complete recovery to death), changes in functional measures, and changes in quality of life experienced by the donor. Typically, disease course varies over time and this variability is of specific interest in many studies. Details on the course and outcome of disease can be obtained by following donors, by follow-up visits, and/or by links to existing national registries. In most civilized countries, several high-quality medical and socioeconomic registries are available that are of interest for linkage with biobanks, for example, cancer registries, registries of death certificates, and pathology archives. Availability of these registries does not mean that biobanks are already permanently linked. There is ample room for improvement of the efficiency and quality of these linkages, in conformity with statutory and consent obligations to participants. 25
Step 3: reduction to an efficient framework
After having drafted the first framework in step 2, step 3 consists of thorough reviewing before implementation in the biobank procedures. This review process comprises the critical assessment of all items and the subsequent reduction of these items to obtain an efficient framework. To this end, we developed a practical checklist. This checklist is composed of critical questions to be posed for each item that is considered for the framework (Table 3). The first part of these questions will help to discern whether a certain item X is essential for at least one of the information categories defined in step 2. The second part consists of two questions with regard to quality. To include item X, at least one of the essentiality criteria and both quality criteria should be met. It will be challenging to follow this procedure strictly, as it may be tempting to include variables for which a definite yes for the quality criteria cannot be given. In the long run, however, data collection for these factors will be untenable, whereas the probability that these appear to be useful is usually very low. It is much more beneficial to put all efforts into obtaining high-quality data on a limited set of factors that are pertinent for the specific disease. Very likely, these factors are also included in the standard clinical protocol for that disease (a protocol that describes which data should be collected and how this should be done); if not, adding these factors to the protocol should be seriously considered.
Questions are based on the information categories of Table 2. Note that the essentiality criteria only apply when the corresponding research domain is considered relevant in step 1.
The stepwise procedure has been summarized in a checklist, which can be used freely by researchers who need to define a data collection framework for their clinical biobank (Supplementary Data; Supplementary Data are available online at www.liebertpub.com/bio).
Assessment of the utility of the procedure
We assessed the utility of the procedure for three different RB biobanks. Tables 4–6 show examples of data collection frameworks for clinical biobanks on breast cancer, iron disorders, and hypospadias. For three different types of research questions, items from different information categories were obtained. Combination of these items resulted in an efficient framework. However, note that more than three research questions may have to be identified to ensure that the framework is comprehensive as well.
All items of the defined framework (except for biobank identification number and vital status) are also useful and relevant for clinical care. Therefore, they are obviously part of the EHR. With respect to the vital status, active follow-up and/or linking to the municipal administration and the cause of death registry are necessary.
Note that more than three research questions may have to be identified to obtain a comprehensive framework.
Applicable to healthy controls.
EHR, electronic health record; I, donor identification; II, disease characteristics; III, risk and prognostic factors; IV, treatment; V, course and outcome of disease; SNP, single nucleotide polymorphism.
All items of the defined framework (except for biobank identification number and ethnicity) are also useful and relevant for clinical care. Therefore, they are obviously part of the EHR.
Note that more than three research questions may have to be identified to obtain a comprehensive framework.
Applicable to healthy controls.
BMI, body mass index; IRIDA, iron-refractory iron deficiency anemia; I, donor identification; II, disease characteristics; III, risk and prognostic factors; IV, treatment; V, course and outcome of disease.
All items of the defined framework (except for biobank identification number and ethnicity) are also useful and relevant for clinical care. Therefore, they are obviously part of the EHR.
Note that more than three research questions may have to be identified to obtain a comprehensive framework.
Applicable to healthy controls.
I, donor identification; II, disease characteristics; III, risk and prognostic factors; IV, treatment; V, course and outcome of disease.
Discussion
We developed a stepwise procedure to guide researchers in their definition of a data collection framework when establishing a new clinical biobank, aimed at acquiring sufficient data to answer relevant research questions in the field while maintaining an efficient and feasible data collection process. First, relevant research domains, such as diagnosis or prognosis, should be identified. Subsequently, the information categories with accompanying subcategories and items to answer research questions in the selected domains need to be identified. As a final step, it is important to reduce the framework based on essentiality and quality. To test their practical use, the steps have been successfully applied to three existing clinical biobanks. These examples showed that because of the large overlap between items for questions in different research domains, the actual frameworks remain limited.
Input from various disciplines is required. Clinicians, research nurses, and researchers are involved in the process of defining the clinical data. On the other hand, laboratory technicians and researchers supply input regarding lab values and biospecimens to be collected. Other biobank professionals are responsible for the definition and composition of a unique ID for each participant (i.e., Biobank ID). Donor representatives may also contribute because they can indicate donor relevant outcomes and feasibility of additional measurements. An active role of donors in data collection (self-administered questionnaires, diaries, smart devices) by uploading data to the database via a secure website or mobile application may increase the potential for efficient enrichment of a database.
A practical guideline when designing a data collection framework is to stay close to the clinical recording standards. What is obligatory to record in the electronic health record (EHR) is probably important for clinical research as well, and vice versa. 26 By linking the EHR to the research database, the data collection for a clinical biobank can be fully integrated into routine patient care, saving a considerable amount of time as information only needs to be recorded once.27,28 Once the biobank database is connected to the EHR, a periodic synchronization is sufficient to complete and save the (coded) data in the database. If needed, research-specific information may be added. However, a prerequisite is that the data recorded in the EHR meet the quality standards for both healthcare and research. 29 With limited data quality, EHR linkage may result in retrieving incomplete or non-standardized data, which will lead to extra costs for cleaning and validating this data. These costs should be taken into account when deciding whether a certain item should be collected and if so, whether it should be collected by using the EHR. Further, linkage of the EHR to the research database can also be a technologically complex endeavor, requiring extensive IT infrastructure (e.g., a data warehouse, querying tools, appropriate Application Programming Interface). Still, a database linked to the EHR will generally lead to more patient inclusions, contain fewer measurement errors, and is better to manage financially. Investing in a good organization of the EHR as well as of the biobank database may, therefore, benefit both patient care and research.
We are confident that our approach to determine a data collection framework for a clinical biobank is of high interest for all involved in biobanking. After all, careful design and proper management of data is crucial when using the corresponding biospecimens for scientific research. Moreover, the definition of a data collection framework is timely and will motivate all different stakeholders to join forces to initiate standardization of the donor characteristics to facilitate the combination of different clinical biobanks in the future.
Footnotes
Acknowledgments
The authors thank the collaborators within the Radboud biobanks “Aetiological research into Genetic and Occupational/Environmental Risk Factors for Anomalies in Children (AGORA),” “Breast Cancer” and “Iron” for their help in the compilation of the practical examples that have been used for the framework propositions in this article.
Author Disclosure Statement
No conflicting financial interests exist.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
