A Stepwise Procedure to Define a Data Collection Framework for a Clinical Biobank

Abstract

Introduction:

Current guidelines for clinical biobanking have a strong focus on obtaining, handling, and storage of biospecimens. However, to allow for research tying biomarker analysis to clinical decision making, there should be more focus on collection of data on donor characteristics. Therefore, our aim was to develop a stepwise procedure to define a framework as a tool to help start the data collection process in clinical biobanking.

Materials and Methods:

The Radboud Biobank (RB) is a central clinical biobanking facility designed in accordance with the standards set by the Parelsnoer Institute, a Dutch national biobank originally initiated with eight different disease cohorts. To organize the information of these cohorts, we used our experience and knowledge in the field of biobanking and translational research to identify research domains and information categories to classify data. We extended this classification system to a stepwise procedure for defining a data collection framework and examined its utility for existing RB biobanks.

Results:

Our approach resulted in the definition of a three-step procedure: (1) Identification of research domains and relevant questions within the field that may benefit from biobank samples. (2) Identification of information categories and accompanying subcategories that are relevant for answering questions in identified research domains. (3) Reduction to an efficient framework based on essentiality and quality criteria. We showed the utility of the procedure for three existing RB biobanks.

Discussion:

We developed guidelines for the definition of a framework that supports the standardization of the biobank data collection process. Connecting the biobank database to pertinent information collected from the electronic health record will improve data quality and efficiency for both care and research. This is crucial when using the corresponding biospecimens for scientific research. Further, it also facilitates the combination of different clinical biobanks for a specific disease.

Introduction

Biobanking has been identified as a key component of translational medicine. Clinical biobanks are focused on gaining study material for specific, disease-oriented research.¹ Standardized clinical biobanks comprising high-quality biospecimens and donor characteristics will greatly contribute to our understanding of complex, multifactorial diseases and are a prerequisite for the translation of biomarkers from bench to bedside.²

General guidelines for the establishment of a clinical biobank (e.g., Organisation for Economic Co-operation and Development and International Society for Biological and Environmental Repositories guidelines^3,4) place strong emphasis on obtaining, handling, and storage of biospecimens.⁵ The National Cancer Research Institute Confederation of Cancer Biobanks established a data standard to enable biobanks to communicate about the samples they hold and so facilitate the formation of an integrated national network of biobanks.⁶ Still, there are currently no recommendations available on the preparation of a data collection framework for clinical biobanking. The absence of such recommendations complicates data pooling and/or enrichment of different clinical biobanks.

Focus in biobanking needs to shift from biospecimens to comprehensive data collection.^7–9 Recently, the focus on data has already increased, with special attention to metadata, the question whether and how data can be exchanged between biobanks, and whether data are FAIR (Findable, Accessible, Interoperable, and Reusable).¹⁰ In 2012, Norlin and colleagues introduced the concept of Minimum Information About Biobank data Sharing (MIABIS) to facilitate and initiate collaborations between biobanks,^11,12 in compliance with the aim of Biobanking and Biomolecular Resources Research Infrastructure to harmonize biobanking across Europe. Standardized metadata are proposed to describe the content of a biobank on an aggregate level, but MIABIS does not provide guidelines for the design of a framework for clinical data at the level of a disease-specific biobank. All in all, the fact that the original data have to be of high quality before data exchange even becomes relevant (“garbage in is garbage out”) is still undervalued.

Tools to improve data quality have previously been developed. A number of international groups are working on the harmonization of information on donor characteristics. The Public Population Project in Genomics (P³G) has put in effort to optimize the data collection process.¹³ However, the P³G consortium focuses especially on the harmonization of data across biobanks worldwide. The approach of P³G to retrospectively harmonize established studies using the DataSHaPER approach already provides scientific utility.^14,15 Further, the Observational Medical Outcomes Partnership Common Data Model found that disparate coding systems can be harmonized to a standardized vocabulary, allowing users to generate evidence from a wide variety of sources. It would also support collaborative research across data sources.¹⁶ However, in many cases, retrospective harmonization leads to loss of detail, which might hamper the selection of samples and data of the research population of interest. To reduce the need for retrospective harmonization, a higher level of standardization of the prospective data collection process is necessary.

This calls for a concept of a data collection framework at the level of a single clinical biobank. Even though such a framework will be mostly disease specific, some general guiding principles are required for an optimal design (or architecture). In this work, we developed a stepwise procedure to obtain a data collection framework for a clinical biobank, aimed at acquiring sufficient data to answer relevant research questions in the field while maintaining an efficient and feasible data collection process. With this, we aim at providing a tool for all involved in the field of biobanking to help start the process of data collection. We assessed the procedure for its use in practice by applying it to three existing clinical biobanks of the Radboudumc. Many clinical biobanks already consider similar steps when organizing their data collection at the outset. Still, we believe we are the first to publish such steps in a scientific article, which will prevent researchers from reinventing the wheel.

Materials and Methods

The idea for our stepwise procedure arose from the initial need to structure information from the Parelsnoer Institute (PSI). PSI is a Dutch national biobank that was initiated in 2007 and is facilitated by all eight Dutch University Medical Centers.¹⁷ It started out with eight Parels (Pearls), which are cohorts focused on different diseases, such as diabetes mellitus type I or inflammatory bowel disease. With regard to data, the Parels initially operated largely independently, leading to eight differently organized information models (IM). To structure the data, PSI commissioned the harmonization of these individual IMs to an independent IT company. This company classified the data elements by repurposing a pre-existing standard, HL7 v3, commonly used for clinical messaging between IT components in hospitals. This pre-existing standard was adapted as a binding guideline for the overall structure of the IM and classification of underlying data elements. During the project, it turned out that an IT-based IM did not meet the requirements for scientific analysis, mainly because any relevant clinical context, necessary for the interpretation of observations, was lost. For example, the question “What variants are present in the patient's DNA?” would be asked before the question whether the patient has actually been genetically tested. Over the course of 2015, PSI progressed toward a Detailed Clinical Model-based architecture, which was developed with (integration into) the clinical process and scientific analysis in mind.

Based on the activities of PSI, the Radboudumc in Nijmegen started its own clinical biobanking facility in 2012: the Radboud Biobank (RB).¹⁸ Similar to PSI, the RB is made up of multiple disease-specific biobanks, but from the start its aim has been to coordinate the biobanks in a uniform manner. As management of the RB, we, thus, wished to be able to support new biobanks in the design of a data collection framework for their clinical biobank. To this end, we composed a multidisciplinary team including biomedical researchers, a clinician, laboratory staff, and data management professionals. Considering both our own discipline(s) and those of others, we worked to reach consensus in which the demands of all stakeholders are represented. We took the designed PSI IM as a starting point and extended it to a stepwise procedure for the design of a data collection framework for a clinical biobank. First of all, we listed general medical scientific questions that can be answered by using biospecimens and clinical data from a clinical biobank. To compose this list, we used our experience and knowledge gained in the design and implementation of biobanks that are part of PSI and/or the RB as well as our experience in conducting translational research. After presentation and discussion with end-users of clinical biobanks, the questions were categorized according to the classical frame of types of clinical questions used in evidence-based medicine.¹⁹ This resulted in five research domains: Treatment (therapy), Prevention, Diagnosis, Prognosis (natural history), and Etiology or harm (causation).

Further, we identified five “umbrella” categories into which the information categories of existing data elements from PSI could be divided. To streamline the process of choosing information categories for a clinical biobank, each main information category was subsequently linked to relevant research domains. The subcategories (consisting of multiple items) within these information categories were used as units of data that together form the framework of the data collection. Finally, using the categories, we developed a practical checklist composed of critical questions to be posed for each item that is considered for the framework.

To test the applicability of the stepwise procedure in practice, we took three use cases for which a framework has been prepared: breast cancer, iron disorders, and the congenital malformation hypospadias.²⁰ These represent three relevant yet diverse diseases, for which we have both experience in building biobanks and the information necessary for the application of our proposed stepwise procedure.

Results

Our approach led to the identification of the three-step procedure to define a data collection framework (Fig. 1). These steps are elaborated later, followed by the assessment of the utility of the stepwise procedure.

FIG. 1.

Stepwise procedure to define a data collection framework for a clinical biobank.

Step 1: identification of research domains and relevant questions

We identified five research domains that can benefit from the use of biospecimens and data from a clinical biobank (Table 1). Each research domain has its own requirements regarding the nature of the biospecimens to be collected and the accompanying clinical data. Some studies require an accurate standard classification of the disease, whereas for others the time of diagnosis is important. Moreover, there are studies that demand a standardized description of the signs and symptoms during the course of the disease, presenting either before diagnosis or thereafter. Although the actual research question is often not known in advance in the case of biobanking, identification of currently relevant questions within the research domains can help in defining a data collection framework. The relevance of a question is dependent on the present research gaps in the field and the focus of the clinical biobank. Note that for some questions, data of both patients and healthy controls are necessary.

Table 1.

Research Domains That Could Benefit from the Availability of Large Collections of Biospecimens from Well-Defined Donors

Research domain ^a	Objective and description
Treatment (therapy)	Contribution to precision (stratified) medicine by studying differential effectiveness of treatment in association with (genetic) biomarkers
Prevention	Assessment of the effectiveness of prophylactic treatment by monitoring biomarkers for the development of a specific disease
Prevention	Assessment of the utility of population surveillance by monitoring biomarkers for the development of a specific disease
Diagnosis	Assessment of the diagnostic value of biomarker X in a patient with suspected disease Y
	Improvement of clinical diagnostics by determining which phenotypical characteristics are associated with certain genotypes/biomarker values
	Improvement of laboratory diagnostics by determining which (genetic) biomarkers are associated with specific clinical characteristics
	Improvement in the classification of disease categories by searching for clusters of homogenous (genetic) biomarkers and phenotypical characteristics
Prognosis (natural history)	Assessment of the prognostic value of specific (genetic) biomarkers that are strongly associated with the course of a certain disease or a distinguishing health characteristic
Prognosis (natural history)	Improvement of prognosis by targeting (genetic) biomarkers for which a change has a favorable effect on the course of a certain disease or a distinguishing health characteristic
Etiology or harm (causation)	Elucidation of mechanisms of the development or the course of disease by studying the association of (genetic) biomarkers with other disease characteristics
	Searching for prevention targets by studying which (genetic) biomarkers are associated with the occurrence of a certain disease
	Searching for prevention targets by studying which combination of (genetic) biomarkers and personal and/or environmental characteristics are associated with the occurrence of a disease

Derived from the common clinical questions as generally agreed on in evidence-based medicine.¹⁹

Step 2: identification of information categories and accompanying subcategories and items

We defined five information categories that meet the needs of most hypotheses in clinical research (Table 2). Each information category has its own subcategories (consisting of multiple items).

Table 2.

Information Categories That Meet the Needs of Most Hypotheses in Clinical Research

Information category	Subcategories ^a	Type of research domain ^b
I. Donor identification	• Identification number	Obligatory for all research domains
I. Donor identification	• Demographics (year of birth, gender, ethnicity)	Obligatory for all research domains
II. Disease characteristics	• Diagnosis plus details of the disease (subclass, subphenotype, phase)	Obligatory for all research domains
	• Patient medical history
	• Diagnostic measurements (imaging, physical examination, laboratory)
	• Co-morbidity
III. Risk and prognostic factors	• Lifestyle and environmental exposures that are known to affect the onset or course of the disease	• Treatment (therapy)
	• Details on family history (standardized family tree investigations or questions about disease occurrence in family members)	• Prognosis (natural history)
		• Etiology or harm (causation)
IV. Treatment	• Details regarding treatment (type, dose, duration)	• Treatment (therapy)
IV. Treatment	• Co-medication	• Prognosis (natural history)
V. Course and outcome of disease	• Course of the disease (severity, complications)	• Prevention
	• Health outcomes (functionality, quality of life)	• Prognosis (natural history)
		• Etiology or harm (causation)^c

Text between brackets represents examples of possible items belonging to the subcategory.

See Table 1 for the different research domains.

Depending on the objective within the research domain.

Donor identification

Donor identification includes basic personal characteristics, such as year of birth, gender, and ethnicity, that are more or less mandatory for selection in most instances (as inclusion/exclusion criteria). Further, a reliable identification number should be obtained at inclusion and, subsequently, pseudonymized with a unique code linked to the data of the donor. This will ensure both the privacy of the donor and efficient association between the clinical data and the biospecimens.

Disease characteristics

All information pertinent to the classification of the disease for which donors were included in the biobank should be incorporated in the framework. These will include: (1) The clinical diagnosis, preferably using a standard vocabulary (e.g., International Statistical Classification of Diseases and Related Health Problems, 10th revision or Systematized Nomenclature of Medicine—Clinical Terms^21,22) and a standardized set of specific disease characteristics, such as subclasses/-phenotypes and phase; (2) Essential items from the medical history of a patient; (3) Diagnostic measurements, such as imaging, physical examinations, and laboratory tests; and (4) Co-morbid conditions at time of diagnosis, comprising a limited, standardized co-morbidity list, supplemented with an open “other co-morbidities” category.

For selection purposes, underlying data that lead to the diagnosis and variables that enable specification of disease are necessary. This will allow for differentiation between different patient types within a group with the same diagnosis. Further, for diagnostic studies, a reference may be needed for every donor to the biobank. This may be a reference test, golden standard, or an expert clinical diagnosis obtained during the course of the disease.

Risk and prognostic factors

Factors known from literature (evidence based) to have a substantial effect on the occurrence or course of the disease should be included in the framework, because of their potential confounding role in future studies. These factors may relate to genetic predisposition, lifestyle, or environment. To make research possible, factors are preferably measured according to international standards (e.g., Logical Observation Identifiers Names and Codes for lab tests²³), enabling comparisons between different studies and pooling of data for more extensive studies. The time of measuring data is crucial. For etiologic studies, the period before disease onset is relevant, whereas the time of diagnosis is more relevant for prognostic research. When a valid and reliable description of the risk/prognostic factor is not possible, it is better not to include these in the framework, or to use a proxy instead. Details about family history may be important when associated with the onset or course of the disease, as proxy for either heredity or shared environment. In some cases, standardized family trees are needed; in others, a standardized question about disease occurrence among first (and second) degree family members is sufficient.

Treatment

Details regarding treatment should be well documented, including dose and duration when medication is involved. These can be useful for identifying unexpected side effects or may be important as confounding factors. Of note, the type of treatment or dosage may change over time, so the treatment data should be updated regularly. Preferably, data on treatment would be captured according to international standards. However, there is currently no agreed general standard for the notion of treatment. This is mainly due to the fact that, although perfectly demarcated in a clinical sense, treatment potentially spans multiple medical disciplines and sources of information (e.g., medication, physiotherapy, dietary requirements). The only feasible approach as of now is to document certain subsets in the context of the patient's primary disease. For example, the Diabetes Parel requested their participants to bring either all their medication or a list from the pharmacist during their visit, so that medication use could be reported accurately despite the absence of a standard.²⁴

Course and outcome of disease

Disease course includes changes in the severity of the disease (from complete recovery to death), changes in functional measures, and changes in quality of life experienced by the donor. Typically, disease course varies over time and this variability is of specific interest in many studies. Details on the course and outcome of disease can be obtained by following donors, by follow-up visits, and/or by links to existing national registries. In most civilized countries, several high-quality medical and socioeconomic registries are available that are of interest for linkage with biobanks, for example, cancer registries, registries of death certificates, and pathology archives. Availability of these registries does not mean that biobanks are already permanently linked. There is ample room for improvement of the efficiency and quality of these linkages, in conformity with statutory and consent obligations to participants.²⁵

Step 3: reduction to an efficient framework

After having drafted the first framework in step 2, step 3 consists of thorough reviewing before implementation in the biobank procedures. This review process comprises the critical assessment of all items and the subsequent reduction of these items to obtain an efficient framework. To this end, we developed a practical checklist. This checklist is composed of critical questions to be posed for each item that is considered for the framework (Table 3). The first part of these questions will help to discern whether a certain item X is essential for at least one of the information categories defined in step 2. The second part consists of two questions with regard to quality. To include item X, at least one of the essentiality criteria and both quality criteria should be met. It will be challenging to follow this procedure strictly, as it may be tempting to include variables for which a definite yes for the quality criteria cannot be given. In the long run, however, data collection for these factors will be untenable, whereas the probability that these appear to be useful is usually very low. It is much more beneficial to put all efforts into obtaining high-quality data on a limited set of factors that are pertinent for the specific disease. Very likely, these factors are also included in the standard clinical protocol for that disease (a protocol that describes which data should be collected and how this should be done); if not, adding these factors to the protocol should be seriously considered.

Table 3.

Practical Checklist for Defining a Data Collection Framework

What are prerequisites to include an item X in the data collection framework of a clinical biobank set up for disease Y?
I. Is X essential^a:
• for the identification of donors with Y?
• for the inclusion of donors with Y in scientific research?
• for the classification of donors with Y?
• for the family history of Y necessary for this research study?
• as a factor with sufficient evidence that it contributes to the onset or the prognosis of Y?
• for the description of the treatment of Y?
• for the description of the course of disease Y?
• for the registration of health outcomes by donors with Y?
II. Does X meet the following requirements regarding quality:
• is there an internationally agreed definition and standard operationalization?
• is there a parameter with which the value of X can be measured that is valid, reproducible, and can be carried out with reasonable effort?
Variable X will be included in the framework if, with good argumentation, at least one of the items from question I and both items from question II can be answered with “yes.”

Questions are based on the information categories of Table 2. Note that the essentiality criteria only apply when the corresponding research domain is considered relevant in step 1.

The stepwise procedure has been summarized in a checklist, which can be used freely by researchers who need to define a data collection framework for their clinical biobank (Supplementary Data; Supplementary Data are available online at www.liebertpub.com/bio).

Assessment of the utility of the procedure

We assessed the utility of the procedure for three different RB biobanks. Tables 4 –6 show examples of data collection frameworks for clinical biobanks on breast cancer, iron disorders, and hypospadias. For three different types of research questions, items from different information categories were obtained. Combination of these items resulted in an efficient framework. However, note that more than three research questions may have to be identified to ensure that the framework is comprehensive as well.

Table 4.

Example of Defining a Data Collection Framework: Breast Cancer Biobank

Step 1—Identifying research domains and relevant questions^a	Question A: What is the prognostic value of SNP X for the 5-year risk of recurrence in premenopausal women with primary breast cancer? (Prognosis)	Question B: Is the efficacy (survival, metastasis, recurrence) of chemotherapy A dependent on the status of biomarker Y at initial diagnosis among premenopausal women with primary breast cancer? (Treatment)	Question C: What is the prevalence of SNP Z in women with primary breast cancer? (Diagnosis)
Step 2—Identifying information categories with subcategories and items	I. Identification number—biobank identification number	I. Identification number—biobank identification number	I. Identification number—biobank identification number
	I. Demographics—gender (female), year of birth	I. Demographics—gender (female), year of birth	I. Demographics—gender (female), year of birth
	II. Diagnosis—diagnosis, date of diagnosis (≥5 years ago), uni/bilateral, histology, tumor size, lymph node and hormone receptor status, histological grade	II. Diagnosis—diagnosis, date of diagnosis, uni/bilateral, histology, tumor size, lymph node and hormone receptor status, histological grade	II. Diagnosis—diagnosis, date of diagnosis, uni/bilateral, histology, tumor size, lymph node and hormone receptor status, histological grade
	II. Patient medical history—history of cancer	II. Patient medical history—history of cancer	II. Patient medical history—history of cancer
	III. Lifestyle and environmental exposures—menopausal status (premenopausal)	III. Lifestyle and environmental exposures—menopausal status (premenopausal)
	III. Details on family history—first-degree family history	III. Details on family history—first-degree family history
	IV. Details regarding treatment—details regarding treatment (surgery, radio/chemo/endocrine therapy)	IV. Details regarding treatment—details regarding treatment (surgery, radio/chemo/endocrine therapy)
	V. Course of the disease—relapse-free and overall survival after 5 years	V. Course of the disease—relapse-free and overall survival
Step 3—Reduction to an efficient framework	Combining the earlier lists leads to the following data collection framework for breast cancer, given the ambition to study the research questions as described in A, B, C:
	I. Donor identification: biobank identification number, gender,^b year of birth^b
	II. Disease characteristics: diagnosis, date of diagnosis, uni/bilateral, histology, tumor size, lymph node and hormone receptor status, histological grade
	III. Risk and prognostic factors: family history (first degree),^b menopausal status,^b history of cancer,^b
	IV. Treatment: surgery (type of surgery, date of surgery), radiotherapy (dose, duration, date of start), chemo- and endocrine therapy (type, dose, date of start)
	V. Course and outcome of disease: relapse-free survival (recurrence yes/no, uni/bilateral, date of diagnosis; metastases yes/no, location, date of diagnosis), overall survival (date of death,^b cause of death^b)

All items of the defined framework (except for biobank identification number and vital status) are also useful and relevant for clinical care. Therefore, they are obviously part of the EHR. With respect to the vital status, active follow-up and/or linking to the municipal administration and the cause of death registry are necessary.

Note that more than three research questions may have to be identified to obtain a comprehensive framework.

Applicable to healthy controls.

EHR, electronic health record; I, donor identification; II, disease characteristics; III, risk and prognostic factors; IV, treatment; V, course and outcome of disease; SNP, single nucleotide polymorphism.

Table 5.

Example of Defining a Data Collection Framework: Iron Disorder Biobank

Step 1—Identifying research domains and related questions^a	Question A: What is the prevalence of inherited ALAS2 mutations among patients diagnosed with multidysplastic syndrome with ring sideroblasts (MDS-RS)? (Diagnosis)	Question B: What factors (genetic or environmental) can discriminate between TMPRSS6-heterozygotes with IRIDA and those without? (Etiology or harm)	Question C: Does the addition of vitamin C improve the response to oral iron treatment in IRIDA patients? (Treatment)
Step 2—Identifying information categories with subcategories and items	I. Identification number—biobank identification number	I. Identification number—biobank identification number	I. Identification number—biobank identification number
	I. Demographics—gender	I. Demographics—gender, year of birth, ethnicity	I. Demographics—gender, year of birth, ethnicity
	II. Diagnosis—diagnosis, subtype	II. Diagnosis—diagnosis, age at diagnosis, diagnostic tools used	II. Diagnosis—diagnosis, severity of disease
	II. Diagnostic measurements—ALAS2 genotype (if already available)	II. Diagnostic measurements—TMPRSS6 genotype (heterozygous), laboratory values (iron, hematological)	II. Patient medical history—symptoms
		II. Co-morbidity—hematological disease history	II. Diagnostic measurements—laboratory values (iron, hematological)
		III. Lifestyle and environmental exposures—BMI, nutrition, alcohol	III. Lifestyle and environmental exposures—nutrition, alcohol, smoking
		III. Details on family history—first-degree family history	IV. Details regarding treatment—type of treatment, dose, frequency, duration, previous treatments (oral/iv iron, transfusion)
		IV. Details regarding treatment—type of treatment, dose, frequency, duration
Step 3–Reduction to an efficient framework	Combining the earlier lists leads to the following data collection frameworkfor iron disorders, given the ambition to study the research questions as described in A, B, C:
	I. Donor identification: biobank identification number, gender,^b year of birth,^b ethnicity^b
	II. Disease characteristics: diagnosis, subtype, diagnostic tools used, laboratory values (iron, hematological), age at diagnosis, symptoms, ALAS2 genotype, TMPRSS6 genotype, hematological disease history
	III. Risk and prognostic factors: BMI, nutrition, alcohol, smoking, first-degree family history^a
	IV. (Co-)Treatment: type of treatment, dose, frequency, duration, previous treatments

All items of the defined framework (except for biobank identification number and ethnicity) are also useful and relevant for clinical care. Therefore, they are obviously part of the EHR.

Note that more than three research questions may have to be identified to obtain a comprehensive framework.

Applicable to healthy controls.

BMI, body mass index; IRIDA, iron-refractory iron deficiency anemia; I, donor identification; II, disease characteristics; III, risk and prognostic factors; IV, treatment; V, course and outcome of disease.

Table 6.

Example of Defining a Data Collection Framework: Hypospadias Biobank

Step 1—Identifying research domains and related questions^a	Question A: Is placental insufficiency (using proxies, such as pre-eclampsia, multiple pregnancies, and birth weight) a risk factor for all subphenotypes of hypospadias? (Etiology or harm)	Question B: Which genes play a role in the development of common isolated hypospadias? (Etiology or harm)	Question C: Are the subphenotype of hypospadias, the related characteristics of the penis, and the presence of other birth defects predictive for the quality of life in adulthood? (Prognosis)
Step 2—Identifying information categories with subcategories and items	I. Identification number—biobank identification number	I. Identification number—biobank identification number	I. Identification number—biobank identification number
	I. Demographics—gender (male)	I. Demographics—gender, ethnicity	I. Demographics—gender
	II. Diagnosis—diagnosis, subphenotype	II. Diagnosis—diagnosis, subphenotype	II. Diagnosis—diagnosis, subphenotype
	II. Co-morbidity—diagnosis known syndromes and other birth defects	II. Co-morbidity—diagnosis known syndromes and other birth defects	II. Co-morbidity—diagnosis known syndromes and other birth defects
	III. Lifestyle and environmental exposures—pre-eclampsia during pregnancy, multiple pregnancies, birth weight, pregnancy duration, BMI during pregnancy, parity (from medical file, questionnaire, clinical registers)	III. Lifestyle and environmental exposures—pre-eclampsia during pregnancy, multiple pregnancies, birth weight	II. Diagnostic measurements—penile characteristics (e.g., chordee)
	III. Details on family history—family history of hypospadias (first/second degree)		III. Lifestyle and environmental exposures—social support (family situation, hospital visits)
			IV. Details regarding treatment—surgery, type of surgery, outcome of surgery
			V. Health outcomes—quality of life
Step 3—Reduction to an efficient framework	Combining the earlier lists leads to the following data collection framework for hypospadias, given the ambition to study the research questions as described in A, B, C:
	I. Donor identification: biobank identification number, gender,^b ethnicity^b
	II. Disease characteristics: diagnosis, subphenotype, diagnosis known syndromes and other birth defects, penile characteristics (e.g., chordee)
	III. Risk and prognostic factors: pre-eclampsia during pregnancy,^b multiple pregnancies,^b birth weight,^b social support^b (family situation, hospital visits), age at time of determination of quality of life,^b pregnancy duration,^b family history (first/second degree),^b BMI during pregnancy, parity^b
	IV. Treatment: treatment characteristics (surgery, type of surgery, outcome of surgery)
	V. Course and outcome of disease: quality of life^b

All items of the defined framework (except for biobank identification number and ethnicity) are also useful and relevant for clinical care. Therefore, they are obviously part of the EHR.

Note that more than three research questions may have to be identified to obtain a comprehensive framework.

Applicable to healthy controls.

I, donor identification; II, disease characteristics; III, risk and prognostic factors; IV, treatment; V, course and outcome of disease.

Discussion

We developed a stepwise procedure to guide researchers in their definition of a data collection framework when establishing a new clinical biobank, aimed at acquiring sufficient data to answer relevant research questions in the field while maintaining an efficient and feasible data collection process. First, relevant research domains, such as diagnosis or prognosis, should be identified. Subsequently, the information categories with accompanying subcategories and items to answer research questions in the selected domains need to be identified. As a final step, it is important to reduce the framework based on essentiality and quality. To test their practical use, the steps have been successfully applied to three existing clinical biobanks. These examples showed that because of the large overlap between items for questions in different research domains, the actual frameworks remain limited.

Input from various disciplines is required. Clinicians, research nurses, and researchers are involved in the process of defining the clinical data. On the other hand, laboratory technicians and researchers supply input regarding lab values and biospecimens to be collected. Other biobank professionals are responsible for the definition and composition of a unique ID for each participant (i.e., Biobank ID). Donor representatives may also contribute because they can indicate donor relevant outcomes and feasibility of additional measurements. An active role of donors in data collection (self-administered questionnaires, diaries, smart devices) by uploading data to the database via a secure website or mobile application may increase the potential for efficient enrichment of a database.

A practical guideline when designing a data collection framework is to stay close to the clinical recording standards. What is obligatory to record in the electronic health record (EHR) is probably important for clinical research as well, and vice versa.²⁶ By linking the EHR to the research database, the data collection for a clinical biobank can be fully integrated into routine patient care, saving a considerable amount of time as information only needs to be recorded once.^27,28 Once the biobank database is connected to the EHR, a periodic synchronization is sufficient to complete and save the (coded) data in the database. If needed, research-specific information may be added. However, a prerequisite is that the data recorded in the EHR meet the quality standards for both healthcare and research.²⁹ With limited data quality, EHR linkage may result in retrieving incomplete or non-standardized data, which will lead to extra costs for cleaning and validating this data. These costs should be taken into account when deciding whether a certain item should be collected and if so, whether it should be collected by using the EHR. Further, linkage of the EHR to the research database can also be a technologically complex endeavor, requiring extensive IT infrastructure (e.g., a data warehouse, querying tools, appropriate Application Programming Interface). Still, a database linked to the EHR will generally lead to more patient inclusions, contain fewer measurement errors, and is better to manage financially. Investing in a good organization of the EHR as well as of the biobank database may, therefore, benefit both patient care and research.

We are confident that our approach to determine a data collection framework for a clinical biobank is of high interest for all involved in biobanking. After all, careful design and proper management of data is crucial when using the corresponding biospecimens for scientific research. Moreover, the definition of a data collection framework is timely and will motivate all different stakeholders to join forces to initiate standardization of the donor characteristics to facilitate the combination of different clinical biobanks in the future.

Footnotes

Acknowledgments

The authors thank the collaborators within the Radboud biobanks “Aetiological research into Genetic and Occupational/Environmental Risk Factors for Anomalies in Children (AGORA),” “Breast Cancer” and “Iron” for their help in the compilation of the practical examples that have been used for the framework propositions in this article.

Author Disclosure Statement

No conflicting financial interests exist.

References

Parodi

. Biobanks: A definition. In: Mascalzoni

(ed). Ethics, Law and Governance of Biobanking: National, European and International Approaches. Dordrecht, Netherlands: Springer Netherlands; 2015: 15–19.

Kinkorová

. Biobanks in the era of personalized medicine: Objectives, challenges, and innovation. EPMA J, 2016; 7:4.

The Organisation for Economic Co-operation and Development. OECD Guidelines on Human Biobanks and Genetic Research Databases. Paris; 2009.

2012 Best practices for repositories collection, storage, retrieval, and distribution of biological materials for research international society for biological and environmental repositories. Biopreserv Biobank, 2012; 10:79–161.

Davey Smith

, Ebrahim

, Lewis

, et al. Genetic epidemiology and public health: Hope, hype, and future prospects. Lancet, 2005; 366:1484–1498.

Quinlan

, Mistry

, Bullbeck

, et al. A data standard for sourcing fit-for-purpose biological samples in an integrated virtual network of biobanks. Biopreserv Biobank, 2014; 12:184–191.

Simeon-Dubach

, Watson

. Biobanking 3.0: Evidence based and customer focused biobanking. Clin Biochem, 2014; 47:300–308.

Quinlan

, Groves

, Jordan

, et al. The informatics challenges facing biobanks: A perspective from a United Kingdom biobanking network. Biopreserv Biobank, 2015; 13:363–370.

Quinlan

, Gardner

, Groves

, et al. A data-centric strategy for modern biobanking. Adv Exp Med Biol, 2015; 864:165–169.

10.

Wilkinson

, Dumontier

, Aalbersberg

, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data, 2016; 3:160018.

11.

Norlin

, Fransson

, Eriksson

, et al. A minimum data set for sharing biobank samples, information, and data: MIABIS. Biopreserv Biobank, 2012; 10:343–348.

12.

Merino-Martinez

, Norlin

, van Enckevort

, et al. Toward global biobank integration by implementation of the minimum information about biobank data sharing (MIABIS 2.0 Core). Biopreserv Biobank, 2016; 14:298–306.

13.

Public Population Project in Genomics and Society. P³G. 2017. Available at: http://p3g.org (last accessed July 14, 2017).

14.

Fortier

, Burton

, Robson

, et al. Quality, quantity and harmony: The DataSHaPER approach to integrating data across bioclinical studies. Int J Epidemiol, 2010; 39:1383–1393.

15.

Fortier

, Doiron

, Little

, et al. Is rigorous retrospective harmonization possible? Application of the DataSHaPER approach across 53 large studies. Int J Epidemiol, 2011; 40:1314–1328.

16.

Observational Health Data Sciences and Informatics. 2017. OMOP common data model. Available at: https://ohdsi.org/data-standardization/the-common-data-model (last accessed at July 14, 2017).

17.

Manniën

, Ledderhof

, Verspaget

, et al. The Parelsnoer institute: A national network of standardized clinical biobanks in the Netherlands. Open J Bioresour, 2017; 4:3.

18.

Manders

, Siezen

, Gazzoli

, et al. Radboud biobank: A central facility for prospective clinical biobanking in the Radboud university medical center, Nijmegen. OA Epidemiol, 2014; 2:4.

19.

Straus SE

, Glasziou

, Haynes

. Evidence-Based Medicine, How to Practice and Teach EBM. 3rd ed. Edinburgh, United Kingdom: Churchill Livingstone; 2005.

20.

Van Rooij

, van der Zanden

, Bongers

, et al. AGORA, a data- and biobank for birth defects and childhood cancer. Birth Defects Res A Clin Mol Teratol, 2016; 106:675–684.

21.

World Health Organization. 2016. International classification of diseases. Available at: http://who.int/classifications/icd/en (last accessed July 14, 2017).

22.

SNOMED International. 2017. SNOMED CT: The global language of healthcare. Available at: http://snomed.org/snomed-ct (last accessed July 14, 2017).

23.

Forrey

, McDonald

, DeMoor

, et al. Logical observation identifier names and codes (LOINC) database: A public use set of codes and names for electronic reporting of clinical laboratory test results. Clin Chem, 1996; 42:81–90.

24.

Van't Riet

, Schram

, Abbink

, et al. The diabetes pearl: Diabetes biobanking in the Netherlands. BMC Public Health, 2012; 12:949.

25.

Biolink

. BBMRI rainbow project Biolink NL. Available at: http://biolink-nl.eu/Biolink_NL/home.html (last accessed July 14, 2017).

26.

Douglas

, Scheltens

. Rethinking biobanking and translational medicine in the Netherlands: How the research process stands to matter for patient care. Eur J Hum Genet, 2015; 23:736–738.

27.

Bowton

, Field

, Wang

, et al. Biobanks and electronic medical records: Enabling cost-effective research. Sci Transl Med, 2014; 6:234cm3.

28.

Boeckhout

, Scheltens

, Manders

, et al. Patients to learn from. J Clin Transl Res, 2017; 3:1.

29.

Richesson

, Krischer

. Data standards in clinical research: Gaps, overlaps, challenges and future directions. J Am Med Inform Assoc, 2007; 14:687–696.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB