Informatics Management of Tumor Specimens in the Era of Big Data: Challenges and Solutions

Abstract

Biomedical data bear the potential to facilitate personalized diagnosis and precision treatment. In the era of Big Data, high-quality annotation of human specimens has become the primary mission of biobankers, especially for tumor biobanks with large amounts of “omics” and clinical data. However, the lack of agreed-upon standardization and the gap among heterogeneous databases make information application and communication a major challenge. International efforts are underway to develop national projects on informatics management. The aim of this review is to provide references in specimen annotation to regulate and take full advantage of biological and biomedical information. First, critical data categories that are vital for specimen applications, including sample attributes, clinical data, preanalytical variations, and analytical records, are systematically listed for subsequent data mining. Second, current standards and guidelines related to biospecimen information are reviewed, and proper standards for tumor biobanks are recommended. In particular, commonly-used approaches and functionalities of data management are summarized and discussed. This review highlights the importance of informatics management of tumor specimens, defines critical data types, recommends data standards, and presents the methodologies of data harmonization for biobankers to reach high quality annotation of biospecimens.

Introduction

Biobanks are repositories of biological samples and relevant data engaged in the advancement of precision medicine and biomedical research.¹ With the emergence of Big Data, massive phenotypic and “omics” data derived from biological samples have led to more in-depth biomedical research, leading to a better understanding of human disease and transformation of scientific discoveries into individually tailored bedside applications.^2–4 Biospecimen information combined with correlated clinical and analytical data are vital determinants for sample value in scientific research.⁵ A consensus has been reached that the more accurate and extensive the annotation of human specimens, the more valuable and productive the scientific studies will become.⁶

However, the current situation of bioinformatics management of biobanks demands improvements. Data from multiple databases, including pathological systems, electronic medical records, radio-and chemotherapy systems, laboratory results, image reports, molecular sequencing, and research-generated data, are stored separately and are difficult to be integrated and well-communicated.^7,8 Moreover, these databases are often in different formats, from unstructured free-text, semistructured forms to structured text fields, and some health care institutes still rely on paper records and manual recording of data, which is time consuming and may possibly generate inaccurate data recording.^9,10 Preanalytical variations, which occur during the whole life span of samples and are critical in creating high-quality samples, are difficult to track by current informatics infrastructures.¹¹

Guidelines such as the Best Practices from the International Society for Biological and Environmental Repositories (ISBER) dictated: common objectives of bioinformatics management are to be compatible of both clinical and biospecimen data, and capable of establishing biospecimen networks that can exchange information.¹² However, it didn't provide detailed procedures such as what information to collect, extract, transform and integrate, and how this can be supported by specific IT systems.¹³ Consequently, individual biobank collects inconsistent data, resulting in communication barriers, which severely hinders the application of clinical samples. The information inconsistence has become one of the most challenging issue biobankers face to solve.

International efforts have been made to achieve consistent data and provide standardized programmatic access to multiple databases.^14,15 Based on this approach, this review aimed at highlighting bioinformatics management with an emphasis on specimen annotation. We summarized key data categories of specimen annotation for subsequent data mining, discussed current data standards and guidelines, and recommended methodology of bioinformative management built on reported experiences. We believe we are the first to publish such solutions for biospecimen annotation and data harmonization in a review article.

Critical Information Categories

To define information categories (types of information) is the first step for specimen annotation. The published guidelines Biospecimen Reporting for Improved Study Quality (BRISQ) mainly focus on preanalytical factors influencing research results, while Minimum Information About BIobank data Sharing (MIABIS) does not contain data elements relevant to medical areas or data on individual subjects and samples.^16,17 Individual biobanks include information categories and their subcategories that meet the needs of most hypotheses in clinical research, but the content is far from enough to fully annotate the specimen.^18,19 According to previous reports and biobank practices,^19–22 a detailed list of data categories highly related to sample application is summarized in Table 1 and the explanations are as below.

Table 1.

Main Data Categories for Biospecimen Annotation

Data category	Data element	Description
Specimen attribute information	Barcode	One-of-a-kind code composed of numbers, along with a specific pattern of stripes that represents a particular product
	Storage center	The preservation center to which the sample belongs
	Sample ID	Unique sample identification number, it can be random, prefabricated, or regular coding
	Sample type	Values defining type of samples, such as plasma, serum, urine, frozen tissue, paraffin tissue, and so on.
	Study design	The type of study design, such as case–control, cohort, and so on.
	Collection mechanism	How the sample was taken, such as fine needle aspiration, surgical, biopsy, and so on.
	Stabilization method	The initial process by which biospecimens were stabilized during collection, such as snap freezing, RNAlater pretreated, fixation, and so on.
	Number of aliquots	Number of the samples to be packed
	Volume	The amount of each liquid sample
	Mass	The size or weight of each solid sample
	Concentration	The relative amount of a given substance contained within a solution, such as DNA concentration
	Array and locations of tissue chips	The distribution of tissue chips and their corresponding specimen information
	Anatomic tumor site	The part of the body from which the sample was taken
	Location (site-freezer-layer-rack-slot)	Specific location of sample storage
	Date of collection	The time when the sample was taken
	Date of storage	The time when the sample was stored
	Method of enrichment of relevant components	Method for separating the major components of a sample, such as centrifugation, microdissection, and so on.
Clinical data	Treatment history
	Past and current medication	Medication administration record, such as neo-adjuvant therapy
	Response to therapy	Therapeutic response, such as well, tolerated
	Secondary tumors	The secondary or metastasis of the primary tumor and multiple tumors
	Concomitant disease	A second illness occurring at the same time as a primary illness
	Diagnosis
	Gross diagnosis	Diagnosis obtained by gross examination
	Pathological diagnosis	Macroscopic and microscopic pathological evaluation
	Clinical diagnosis	Clinical evaluation based on anamnesis and physical examination
	Diagnosis code (ICD-10)	A code used to specify a medical diagnosis
	Date of diagnosis	The time when the patient was diagnosed
	Pathological examination
	Tumor morphology	Microscopic observation of the tumor and its matrix composition
	TNM staging	Cancer staging system describing the cancer extent
	Tumor grade	The description of a tumor based on how abnormal the tumor cell and the tumor tissue look under a microscope
	Inspection records
	Image reports	Image reports of the patients' imaging examination
	Hematologic examination results	The result of the patients' hematologic examination
	Tumor biomarkers	The result of the patients' tumor biomarker examination
	Epidemiologic data
	Age at time of specimen collection	The patients' age at the time of diagnosis of cancer
	History of cancer disease	The history of the patients' other cancer disease
	Evidence for familial history of cancer	The evidence for familial history of cancer of the patients
	Exposure and risk factors	Any attribute, characteristic, or exposure of an individual that increases the likelihood of developing cancer
	Demographic data
	Gender	The patients' gender
	Place of residence	The place of residence of the patients
	Ethnicity	The ethnicity of the patients
	Place of origin	The birthplace of the patients
	Age	The patients' age
	Occupation	The job or profession of the patients
	Follow-up data
	Relapse date	The date of the recurrence of the cancer
	Relapse type	The type of the relapse, such as localized, distant
	Metastatic sites	The part of the body or the organs metastasis occurs
	Date of death	The date of the patients' death
	Duration of global survival	The survival time of the patients after treatment
	Duration of survival without relapse	The survival time of the patients after treatment without relapse
	Last investigation date	The date on which the patients were last examined
Preanalytical variations	General
	Sample collector	How the specimens were collected
	Number of freeze–thaws	The number, estimate, or range thereof of thaw–refreeze events to which biospecimens were subjected before analysis
	Duration of thaw events	The amount of time or range thereof the biospecimens spent thawed before the final thaw before processing
	Time of last thaw	The time of last dissolving of the samples
	Temperature of thawing	The temperature at which biospecimens were kept between unfreezing and analysis
	Processing time	The time of the samples being processed
	Prealiquoting temperature	The temperature at which biospecimens were kept before aliquoting
	Cryopreservation method	The medium and temperature of the samples being cryopreserved
	Noncryopreservation method	The medium and temperature of the samples under noncryopreserved condition
	Storage duration (years)	The duration of storage
	Shipping temperature	Temperature maintained during shipping
	Solid
	Warm ischemia time	The length of time the specimen is only partially perfused due to vessel ligation during surgery, before complete removal
	Cold ischemia time	The time the biospecimen spends after complete removal from the patient but before being placed into fixative
	Liquid
	Centrifugation time	The time of centrifugation
	Precentrifugation temperature	The temperature at which the samples were kept before centrifugation
	Centrifugation temperature	The temperature at which the samples were centrifugated
	Centrifugation speed	The speed the sample was centrifugated
	Centrifugal times	The number of times the samples were centrifugated
	Second centrifugation time	The time of second centrifugation
Quality assessment results	Digitally scanned documents	A digitally scanned image associated with a sample, such as HE slides of tissues
	Composition assessment	Microscopic analysis of the tumor and its matrix composition, such as percentage of tumor/necrosis
	Purity, concentration, and integrity of derivatives	Results of the purity, concentration, and integrity of derivatives, such as RNA integrity number, DNA purity, hemolysis assessment, and so on.
	Percent viability	Percentage of living cells
	Biomarker expression levels	The expression of the biomarkers tested
Analytical records	Genomic level	Sequencing results of the samples, results of DNA methylation, Copy number alterations
	Transcriptomic level	Results of alternative messenger RNA splicing, the expression levels of mRNA
	Proteomic level	Results of protein microarray, mass, spectrometry

HE, hematoxylin-eosin; ICD-10, The International Statistical Classification of Diseases and Related Health Problems 10th Revision.

Attribute information

The function of attribute information is for the identification and tracking of biospecimens.²⁰ It demonstrates sample types, disease profiles, sample ID, date of collection, number of aliquots, sample volume, tumor sites, storage sites, location (spot-freezer-layer-rack-slot), and more.¹² In addition, the specimens derived from the refining of original specimens also generate a series of information, such as concentration and purity of nucleic acids, and array and locations of tissue chips, which must be recorded and saved under the original catalogs for further management and application.^21,23

Clinical data

As stated in the National Cancer Institute (NCI) Best Practices, “information linked to biospecimens may include demographic data, lifestyle factors, environmental and occupational exposures, cancer history, structured pathology data, additional diagnostic studies, information on initial staging procedure, treatment data, and any other data relevant to tracking a research participant”.²⁰

Based on published guidelines and individual experiences,^{8,19,22,24–30} seven main clinical datasets were included, namely, the treatment history demonstrating past and current medications, radiotherapy, and surgery; diagnostic descriptions revealing gross, pathological, and clinical diagnoses during treatment; pathological investigations stating cancer types, grades, and stages; inspection records indicating image and hematologic examination results; epidemiologic data illustrating familial history of cancer and exposure and risk factors; demographical data elucidating basic population information; and follow-up data showing relapse and survival details (Table 1).

Preanalytical variations

Preanalytical variations are defined as any variation taking place between the time of specimen collection or ischemia (for organs) to that of sample analysis. They are essential for evidence-based practices and are necessary to provide effective and efficient interconnectivity and interoperability between biobanks for credentialing research.³¹ As ISBER's Best Practices pointed out, any information about the specimen being compromised in any way should be recorded and available to the user.¹² The standard preanalytical options have been published by the ISBER Biospecimen Science Working Group for application to biospecimens quality control. They included seven Standard PRE-analytical Code (SPREC) variables for liquid and solid tissue samples, respectively.^32,33 In addition, the BRISQ also mentioned critical preanalytical elements involving human specimens.¹⁶

According to these standards or guidelines, important variations in the life cycle of tumor samples are summarized in Table 1.

Quality assurance results

In-house tests carried out by biorepositories to assess and control biospecimen quality are essential for qualified research.³⁴ Any results produced by evaluating the biomolecular analytes of the samples are quality assurance results.^32,35 These data reveal the actual quality status of the samples, thus allowing researchers to differentiate between the pros and cons when retrieving samples and further improve the quality of biomedical studies.³⁶

Analytical records

With the development of high-throughput methods for “omics” studies, a large number of physical samples have been analyzed and transformed into digital information. These abundant experimental records, combined with a clinicopathological database, can reveal potential biomarkers and clinical phenotypes directly, which significantly expand our understanding of carcinogenesis and guide personalized medicine.^37–39 In addition, by taking advantage of a stored analytical database, unnecessary tests and ineffective treatments can be avoided, thereby becoming cost effective to both the patients and health care practitioners.⁴⁰

Recommendations for Data Standardization

To minimize data heterogeneity, standards and guidelines have been established with instructions on how to create more consistent and standardized information to optimally annotate biospecimens. The SPREC is a seven-element code corresponding to the most critical preanalytical variables of fluid and solid biospecimens.³³ It lists the data fields and their abbreviations or codes. SPREC is a good precedent for data standards, but it only covers preanalytical variations and is not sufficient for entire data standardization. The BRISQ guidelines include general information for consistent documentation of classes of biospecimens and factors that could influence research results. The list was prioritized into three titers according to their relative importance. The third tier contains “additional items to report,” which includes information unlikely to influence research results.¹⁶ However, its data elements, such as ischemic time, disease status, and clinical characteristics of patients, are crucial in answering scientific questions. MIABIS is a dataset consisting of 52 attributes of minimum information facilitating the sharing of samples and data among biobanks on a global scale, with limitations that it does not include data on medical, quality, and ethical levels.¹⁷ In addition, several data elements of MIABIS that describe biobanks and biobankers like contact phone and email are not essential for specimen usage and can be omitted or marked as “additional items.”

A minimum data set and associated standard available in BRISQ, SPREC, and MIABIS have been included and published by World Health Organization/The International Agency for Research on Cancer.⁴¹ However, the data set only integrated and listed the data elements presented by the above standard and guidelines. A data standard for sourcing fit-for-purpose biological samples in an integrated virtual network of biobanks has been published based on the above guidelines.⁴² The data categories covering comprehensive information are appropriate for annotation of sample networks and large scale biobanks.

According to the data categories summarized in Table 1, combining the standardized formats of data items presented in the above standards and guidelines, we recommend standardized formats of data elements suitable and critical for tumor specimens and cancer research in this section. As shown in Table 2, existing guidelines or standards mainly focus on sample attributes and their associated preanalytical variations, with little reference to data related to clinical, prognostic, and analytical outcomes, which are vital resources for scientific research. The standardization of these data still needs further exploration and efforts. While BRISQ provides a referential standardized form for most of the data elements, SPREC provides only a limited number of preanalytical variables, and MIABIS focuses more on sample attribute information. It is suggested that BRISQ is more comprehensive in terms of data standardization.

Table 2.

Recommended Standardized Format of Critical Data Elements According to Standard PREAnalytical Code, Biospecimen Reporting for Improved Study Quality, and Minimum Information About BIobank Data Sharing

Data category	Data element	Standardized format	Source
Specimen attribute information	Barcode	Barcode	MIABIS
	Storage center	NA	NA
	Sample ID	Text identifier or barcode	MIABIS
	Sample type	Solid tissue, whole blood, serum, cells, etc.	BRISQ, MIABIS
	Collection type	Case–control, cohort, longitudinal, etc.	MIABIS
	Collection mechanism	Fine needle aspiration, preoperative blood draw, etc.	SPREC, BRISQ
	Number of aliquots	NA	NA
	Volume	The amount in each liquid biospecimen sample	BRISQ
	Mass	The approximate size or weight of solid biospecimen samples processed (e.g., cubes ∼0.5 cm on a side, 0.5 g)	BRISQ
	Concentration	NA	NA
	Anatomic tumor site	Organ(s) of origin or site of blood draw	BRISQ
	Location (site-freezer-layer-rack-slot)	NA	NA
	Date of collection	The time when the sample was taken	MIABIS
	Date of storage	Time between acquisition and use	BRISQ
	Method of enrichment of relevant components	Laser-capture microdissection of tissue, block selection for region of lesion, centrifugation of blood	BRISQ
Clinical data	Treatment history
	Past and current medication	Neoadjuvant therapy, other current or past medical treatments	BRISQ
	Response to therapy	NA	NA
	Secondary tumors	NA	NA
	Concomitant disease	NA	NA
	Diagnosis
	Gross diagnosis	NA	NA
	Pathological diagnosis	Macroscopic and microscopic pathological evaluation	BRISQ
	Clinical diagnosis	Clinical evaluation based on anamnesis and physical examination	BRISQ
	Diagnosis code (ICD-10)	NA	NA
	Date of diagnosis	Time between diagnosis and sampling	BRISQ
	Pathological examination
	Tumor morphology	NA	NA
	TNM staging	NA	NA
	Tumor grade	NA	NA
	Inspection records
	Image reports	NA	NA
	Hematologic examination results	NA	NA
	Tumor biomarkers	NA	NA
	Epidemiologic data
	Age at time of specimen collection	Time between diagnosis and sampling	BRISQ
	History of cancer disease	NA	NA
	Evidence for familial history of cancer	NA	NA
	Exposure and risk factors	Environmental factors (e.g., smoking status)	BRISQ
	Demographic data
	Gender	Value list	BRISQ
	Place of residence	NA	NA
	Ethnicity	NA	NA
	Place of origin	NA	NA
	Age	Number	BRISQ
	Occupation	NA	NA
	Follow-up data
	Relapse date	NA	NA
	Relapse type (localized, distant…)
	Metastatic sites
	Date of death
	Duration of global survival
	Duration of survival without relapse
	Last investigation date
Preanalytical variations	General
	Sample collector	Value list, e.g., sodium EDTA and sodium heparin	SPREC, BRISQ
	Number of freeze–thaws	The number, estimate, or range thereof of thaw–refreeze events to which biospecimens were subjected before analysis	BRISQ
	Duration of thaw events	The amount of time or range thereof the biospecimens spent thawed before the final thaw before processing	BRISQ
	Time of last thaw	The time or range of times between unfreezing and analysis	BRISQ
	Temperature of thawing	The temperature at which biospecimens were kept between unfreezing and analysis	BRISQ
	Processing time	NA	NA
	Prealiquoting temperature	NA	NA
	Cryopreservation method	Value list, including fixation medium and storage temperature	SPREC, BRISQ
	Noncryopreservation method	Value list, including fixation medium and storage temperature	SPREC, BRISQ
	Storage duration (years)	Value list, presented of container and temperature	SPREC, BRISQ
	Shipping temperature	Temperature maintained during shipping	BRISQ
	Solid
	Warm ischemia time	Value list, presented as ranges in minutes	SPREC
	Cold ischemia time	Value list, presented as ranges in minutes	SPREC
	Liquid
	Centrifugation time	Value list, presented as time between collection and processing in hours	SPREC
	Precentrifugation temperature	NA	NA
	Centrifugation temperature	Value list	SPREC
	Centrifugation speed	Value list	SPREC
	Centrifugal times	NA	NA
	Second centrifugation time	Value list, presented as time between centrifugation and storage in hours	SPREC
Quality assessment results	Digitally scanned documents (e.g., HE slides of tissues)	Any methods used to assess the quality of the biospecimens and the results	BRISQ
	Composition assessment (e.g., Percentage of tumor/necrosis)
	Purity, concentration, and integrity of derivatives (e.g., RNA integrity number, DNA purity, hemolysis assessment),
	Percent viability
	Biomarker expression levels
Analytical records	DNA or RNA sequence	NA	NA
	expression levels of mRNA or protein
	Gene and protein microarray
	DNA methylation
	Copy number alterations
	Alternative messenger RNA splicing

BRISQ, Biospecimen Reporting for Improved Study Quality; EDTA, ethylene diamine tetraacetic acid; MIABIS, Minimum Information About BIobank data Sharing; SPREC, Standard PRE-analytical Code.

Principal Technological Approaches and Functionalities

For methods of bioinformatics management, considerable efforts have been made by international organizations and individual biobanks. In 2003, the National Cancer Institute Center for Bioinformatics (NCICB) established the NCICB core infrastructure for biomedical informatics (caCORE), a robust infrastructure for data management and integration that supports advanced biomedical applications.⁴³ Successively in 2004, the NCI launched cancer the Biomedical Informatics Grid (caBIG) to develop a federation of interoperable research information systems.^44–46 The caBIG platform was later changed to the NCI Genomic Data Commons, which aims to harmonize both the genomic and clinical data across programs and projects.⁴⁷ Evolved from caBIG, the caTissue was launched to capture and represent highly granular, hierarchically structured data for biospecimen processing, quality assurance, tracking, and annotation.⁴⁸ Individual biobanks have also attempted to develop biospecimen data management systems (BDMSs) to merge clinical data with biospecimen information for research purposes.^19,26 These above projects and practices have exemplified the implementation of data harmonization, programmatic access, and data integration.

In this section, essential approaches and functionalities of information management on attribute- and external data are discussed and summarized. The architecture of relative informatics management module is shown in Figure 1.

FIG. 1.

Technological architecture of data management. Clinical data were converted from free text to structured data by NLP. The structured data were then standardized by tool CDE to XML data. An ETL system was then applied to ETL the uniform data to BDMSs. At the same time, external analytical data were imported to BDMSs and integrated with existing biospecimen related data. Finally, all the messages were deidentified to nonhuman subject data according to data protection rule. BDMS, biospecimen data management system; CDE, common date element; ETL, extract, transform, and load; NLP, natural language processing; XML, extensible markup language.

Structuring and standardization of data

To deal with mass data stored in nonstandard forms, a growing number of published works have applied Natural Language Processing (NLP) to convert the free text into standard structured text fields.^19,48 NLP is any computer-based algorithm that handles, augments, and transforms natural language to extract target information so that it can be represented for computation.^14,29,49,50

Creating an NLP system involves exploiting three common related NLP resources: extraction tools, ontologies, and corpora.⁴⁹ Typical clinical NLP pipelines include the following tasks: (1) section identification, (2) medical named entity recognition, and (3) negation and assertion classification. Ontologies provide a knowledge base to reference how various concepts are related to one another, for example, a medical dictionary. Corpora are collections of clinical text that can be used to test an NLP system. As exemplified in the NCI programs, the construction of the cancer text information extraction system and the clinical annotation engine was to meet NLP and clinical data annotation needs in caTissue Core.⁴⁸ Although the sensitivity of personal health information limits the availability of this text, the development of deidentified data sets has solved the challenges. Detailed methods of NLP in oncology have been introduced in previous reviews.^14,46,51

To promote data communication and sharing across different databases, NCI and many other institutes have supported a broad initiative to standardize vocabularies and ontologies to create the common date elements (CDEs) applied to cancer research data capture and reporting.^43,52–54 CDEs are combinations of precisely defined questions (variables) paired with a specified set of responses to the question that is common to multiple datasets or used across different studies.⁵⁵ Although there are no formal international specifications governing the construction or use of CDEs, the NCI's attempt on data vocabularies and ontologies can serve as a reference for data standardization and harmonization.

Definition and integration of data sources

The definition of data sources was accomplished by converting the hierarchy to valid extensible markup language (XML) as defined by the XML schema definition. XML makes the connection using registered data types in an implementation and platform agnostic manner possible.⁴⁴ An XML element generally includes a connection string, the type of Structured Query Language (SQL) engine, and a transact SQL statement, as well as a structural metadata scheme of a data table.^19,46 The biospecimen annotation is exported into an XML document, including the patient identification number. Afterward the biospecimen annotation is linked to related clinical information using a tool. The tool combines the clinical data with the corresponding biospecimen annotation by matching the patient identification number, that is, the patient's medical record number (MRN).^56,57

As multiple databases are connected to the BDMSs, to put these databases on a grid, an extract, transform, and load (ETL) system was exerted for creating topic-specific subsets of source databases.^56,58–60 The system uses a series of toolkits to extract data from multiple heterogeneous databases (that are not optimized for analytics), transform the data into designed formats, and flexibly load the derived information into a whole data warehouse.⁵⁶ The exact steps in that process might differ from one ETL tool to the next, but the end result is the same, namely, integrated data for analysis.

Biospecimen and analytical data entry

Once the recruited patients are determined, their basic clinical information (such as patient demographics and diagnoses) stored in the Hospital Information System or the electronic health record is linked and saved within the BDMSs by the patient's MRN.^19,61 Empty fields are preset for subsequent entering of sample attribute data. At the time of specimen collection, attributive messages of the samples are entered and saved with the previous patient information.⁶² In most circumstances, the software also needs to be capable of tracking processing details and preanalytical events, for example, centrifugation conditions, freeze–thaw cycles, storage temperature, especially time recording for time-sensitive properties for evidence-based practices.⁵¹ In terms of analytical data, because it is mostly derived from a variety of data formats, one of the good practices is to accept their format and bulk import the data using a set of mapping tables that map columns in their data (patient ID, time point, analytical result) to BDMS data elements.^60,62,63

Specimen identification and positioning

Unique ID and relevant barcode linking to the sample are generated by corresponding toolsets according to specified templates.¹⁹ The barcode is set as an electronic tag that identifies the sample electronically. IDs are usually presented as numbers, while barcodes are preferentially stored as strings and numbers for pattern matching and searched by an electronic scan. A storage management module which is devoted to assign storage location (site-freezer-layer-rack-slot) and to keep track of a given specimen should also be maintained to meet the inventory management functionality.^1,12,62,64

Data security

To make data communication safe, there has been growing recognition for the need of data security and access control.⁶⁵

The European Union (EU) published the GDPR (general data protection regulation), which is a law about data privacy and security for the protection of personal data.⁶⁶ Data protection can be assured by additional tools to encrypt key data, deidentify sensitive information, and support role authorization.^44,67 To protect the private messages from the donors, it is mandatory to develop a deidentification process by removing the 18 identifiers required by health insurance portability and accountability act (HIPAA), to limit exposure of patient identifiers to research staff.^25,46,68,69 After deidentification, randomly generated Universally Unique Identifiers linked to the original identifiers were created and stored, to support the reidentification that is permissible under HIPAA.⁴⁶ In addition, security logging with defined privilege levels, such as administrators, visitors, and common users, must be set to insure access control. Data backups should also be routinely developed in case of database corruptions.^12,63

Register and retrieval service

To support the function of registration and application, a web-based end-user friendly interface composed of role-based perspectives is needed so that users (such as researcher, preliminary user, administrator, and honest broker) can register and apply to relevant user needs.^46,70

In particular, indexing services used to retrieve the integrated information must be established to choose appropriate samples between different end points.⁵⁷ A pick list preset for critical factors, such as sample type, collection date, pathological type, and medication condition, are created and designed flexibly with the Standard Boolean constructs, including AND, OR, and NOT, used to combine all of the above constraints in the user interface.^46,56 The query results are returned as a single row preferably, presenting the necessary nonhuman structured data, including clinical key elements, biosample information, and biological data.⁶² A function to export these data according to their content type into comma-separated value files supported by popular statistical software such as SPSS, EXCEL, STATA, or R must be included.¹⁹

Inventory tracking and temperature monitoring

The inventory module should have the capability of repository management. This includes assigning new virtual locations to the sample after collection, automated location tracking, and location editing when the sample location is changed or deleted.⁶² In the event that a container fails, the inventory management system must be able to meet mass movements and changes of specimen location.¹² Ideally, the system should also have the ability to display the spatial utilization rate of each storage container so that biobankers can see the space used intuitively.

Since the change of storage temperature is an important factor in determining specimen quality, it is preferable to keep records of the temperature data of both the carriers and rooms for subsequent sample quality assessment.⁷¹ This can be realized by binding to an interface of an external temperature monitoring system or by building a temperature monitoring function independently within the BDMSs, according to previous experience.

Operation log and documentation

An operation log describing any changes made to the system in any way must be recorded and saved as an important part of the BDMSs. This includes, but not be limited to, the details of the changes made, by whom these were made, and recordings of the date and time of the change.¹² Ideally, any documents of the repository, including operating instructions, standard operating procedures, (digitally scanned) informed consent, and financial revenue, may also be archived electronically and stored within the BDMSs for administrative convenience.

To provide a better reference for specimen annotation, the recommended priority of data management approaches is summarized in Table 3 depending on the size, the type, and the objective of the biobanks. Biobanks can refer to the corresponding approaches in data management according to their own situations.

Table 3.

The Priority of Technical Methods Recommended in Data Management

Methods	Importance^a	Maturity	Recommendation priority^a
Methods	Importance^a	Maturity	Individual biobank	National biobank	Regional biospecimen network
NLP	Very important	Medium	Best	Ideal	Mandatory
CDE	Very important	Low	Best	Ideal	Mandatory
XML	Very Important	High	Ideal	Ideal	Mandatory
ETL	Important	Medium	Best	Ideal	Mandatory
Specimen information entry	Emergency	High	Mandatory	Mandatory	Mandatory
Analytical information import	Important	High	Mandatory	Mandatory	Mandatory
Specimen labeling	Emergency	High	Mandatory	Mandatory	Mandatory
Specimen positioning	Emergency	High	Mandatory	Mandatory	Mandatory
Deidentification	Very important	Medium	Mandatory	Mandatory	Mandatory
Role based privilege	Important	High	Ideal	Mandatory	Mandatory
Data backup	Very important	High	Mandatory	Mandatory	Mandatory
Retrieval service	Emergency	High	Mandatory	Mandatory	Mandatory
Registry service	Important	High	Ideal	Mandatory	Mandatory
Inventory tracking	Emergency	High	Mandatory	Mandatory	Mandatory
Temperature monitoring	Important	High	Ideal	Mandatory	Mandatory
Operation log	Important	High	Ideal	Mandatory	Mandatory
Documentation	Important	High	Best	Ideal	Mandatory

Importance degree: emergence > very important > important; priority degree: mandatory > ideal > best.

ETL, extract, transform, and load; NLP, natural language processing; XML, extensible markup language.

Conclusions and Future Perspectives

With the development of digital technology and the advent of Big Data, physical biobanks that store sample entities may gradually give way to electronic biorepositories that store electronic data in the future, as the U.K. biobank has done.⁷² In view of this, the challenge of data management of biological samples has been increasingly emerging. This review highlights the importance of specimen annotation and provides recommendations for daily specimen informatics management for biobanks. We summarized the important information categories for the annotation of tumor specimens and recommended a standardized format of data to maintain data consistency, including main technical approaches and functionalities of biospecimen data management for biobanks. Based on these data management principles, each BDMS can be designed to best support the processes of a particular biorepository with its own unique workflows and dataset types.

We should acknowledge a few potential drawbacks of this review. First, due to the limitation of current standards of specimen information, the recommended standardized format of data presentation was incomplete and needs further efforts in data standardization. Moreover, in consideration of the professional level of biobankers, the approaches of data management were not described in detail with specific tools and procedures. A wealth of professional literature and guideline is available for reference. And professional system design engineers can direct to relevant reference for further details.

In the future, biobanks are expected to achieve significant breakthroughs in precision medicine if they continue focusing on the strengthening of the depth and breadth of data mining for facilitating open sharing. To achieve this goal, one of the first steps would be to construct a dynamic boutique clinical database with longitudinal clinical data based on prospective design, that is, to collect matched tumor tissues, paracancer tissues, metastatic foci, lymph nodes, peripheral blood, and relative messages from patients at different stages before, during, and after treatment. Researchers can utilize the data produced in different stages for the evaluation and analysis of drug efficacy and treatment effects longitudinally. The second step would be to prompt global data standardization and open sharing, thus making horizontal data analysis based on a large sample size from different regions and ethnic groups become possible. Based on large sample databases, many complex medical problems and global health issues may be solved comprehensively.

Taken together, a globally open shared bio-database built on the depth and breadth would usher a promising future for biobanking, defined by comprehensive metadata and extensive collaboration in medical research.

Footnotes

Authors' Contributions

P.-F.Z. drafted the article. W.-H.J. reviewed the article before submission. All authors read and approved the final article.

Acknowledgments

The authors thank Ye-zhu Hu, Shao-dan Zhang, and Ting Zhou for their unique experience and insights in information management. The authors are also grateful to the colleagues in the biospecimen and biobank industry for their efforts in the standardization of sample annotation.

Author Disclosure Statement

No conflicting financial interests exist.

Funding Information

This work was supported by the National Key Research and Development Program (grant no. 2016YFC1302704) and the Science and Technology Planning Project of Guangdong Province, China (grant no. 2019B030316031).

References

Vaught

. Biobanking comes of age: The transition to biospecimen science. Annu Rev Pharmacol Toxicol, 2016; 56:211–228.

Scott

, Caulfield

, Borgelt

, et al. Personal medicine—The new banking crisis. Nat Biotechnol, 2012; 30:141–147.

Rossi

, Ceballos

, Lu

. Immune precision medicine for cancer: A novel insight based on the efficiency of immune effector cells. Cancer Commun (Lond), 2019; 39:34.

Hamburg

, Collins

. The path to personalized medicine. N Engl J Med, 2010; 363:301–304.

Ambrose

, Freedman

, Buetow

, et al. Using patient-initiated study participation in the development of evidence for personalized cancer therapy. Clin Cancer Res, 2011; 17:6651–6657.

Litton

. Biobank informatics: Connecting genotypes and phenotypes. Methods Mol Biol, 2011; 675:343–361.

Papatheodorou

, Crichton

, Morris

, et al. A metadata approach for clinical data management in translational genomics studies in breast cancer. BMC Med Genomics, 2009; 2:66.

Spjuth

, Krestyaninova

, Hastings

, et al. Harmonising and linking biomedical and clinical data across disparate data archives to enable integrative cross-biobank research. Eur J Hum Genet, 2016; 24:521–528.

Noor

, Holmberg

, Gillett

, et al. Big Data: The challenge for small research groups in the era of cancer genomics. Br J Cancer, 2015; 113:1405–1412.

10.

MacRury

, Finlayson

, Hussey-Wilson

, et al. Development of a pseudo/anonymised primary care research database: Proof-of-concept study. Health Inform J, 2016; 22:113–119.

11.

Vaught

, Caboux

, Hainaut

. International efforts to develop biospecimen best practices. Cancer Epidemiol Biomarkers Prev, 2010; 19:912–915.

12.

2012 best practices for repositories collection, storage, retrieval, and distribution of biological materials for research international society for biological and environmental repositories. Biopreserv Biobank 2012;10:79–161.

13.

Herrington

, Mao

, Parker

, et al. Proteomic architecture of human coronary and aortic atherosclerosis. Circulation, 2018; 137:2741–2756.

14.

Amin

, Tsui

, Borromeo

, et al. PaTH: Towards a learning health system in the Mid-Atlantic region. J Am Med Inform Assoc, 2014; 21:633–636.

15.

Jacobson

, Becich

, Bollag

, et al. A federated network for translational cancer research using clinical data and biospecimens. Cancer Res, 2015; 75:5194–5201.

16.

Moore

, Kelly

, Jewell

, et al. Biospecimen reporting for improved study quality (BRISQ). Cancer Cytopathol, 2011; 119:92–101.

17.

Norlin

, Fransson

, Eriksson

, et al. A minimum data set for sharing biobank samples, information, and data: MIABIS. Biopreserv Biobank, 2012; 10:343–348.

18.

Manders

, Peters

TMA

, Siezen

, et al. A stepwise procedure to define a data collection framework for a clinical biobank. Biopreserv Biobank, 2018; 16:138–147.

19.

Eminaga

, Ozgur

, Semjonow

, et al. Linkage of data from diverse data sources (LDS): A data combination model provides clinical data of corresponding specimens in biobanking information system. J Med Syst, 2013; 37:9975.

20.

National Cancer Institute: Best practices for biospecimen resources; 2018. Available from: http://biospecimens.cancer.gov/bestpractices/ (accessed December 17, 2019).

21.

Dowst

, Pew

, Watkins

, et al. Acquire: An open-source comprehensive cancer biobanking system. Bioinformatics, 2015; 31:1655–1662.

22.

Wade

. Traits and types of health data repositories. Health Inf Sci Syst, 2014; 2:4.

23.

Vaught

, Henderson

, Compton

. Biospecimens and biorepositories: From afterthought to science. Cancer Epidemiol Biomarkers Prev, 2012; 21:253–255.

24.

Ludvigsson

, Andersson

, Ekbom

, et al. External review and validation of the Swedish national inpatient register. BMC Public Health, 2011; 11:450.

25.

Freedman

, Cantor

, Merriman

, et al. 2013 HIPAA changes provide opportunities and challenges for researchers: Perspectives from a Cancer Center. Clin Cancer Res, 2016; 22:533–539.

26.

Langseth

, Luostarinen

, Bray

, et al. Ensuring quality in studies linking cancer registries and biobanks. Acta Oncol, 2010; 49:368–377.

27.

Leitsalu

, Alavere

, Tammesoo

, et al. Linking a population biobank with national health registries-the estonian experience. J Pers Med, 2015; 5:96–106.

28.

Rossille

, Burgun

, Pangault-Lorho

, et al. Integrating clinical, gene expression, protein expression and preanalytical data for in silico cancer research. Stud Health Technol Inform, 2008; 136:455–460.

29.

Segagni

, Tibollo

, Dagliati

, et al. The ONCO-I2b2 project: Integrating biobank information and clinical data to support translational research in oncology. Stud Health Technol Inform, 2011; 169:887–891.

30.

Foran

, Chen

, Chu

, et al. Roadmap to a comprehensive clinical data warehouse for precision medicine applications in oncology. Cancer Inform, 2017; 16:1176935117694349.

31.

Ellervik

, Vaught

. Preanalytical variables affecting the integrity of human biospecimens in biobanking. Clin Chem, 2015; 61:914–934.

32.

Betsou

, Barnes

, Burke

, et al. Human biospecimen research: Experimental protocol and quality control tools. Cancer Epidemiol Biomarkers Prev, 2009; 18:1017–1025.

33.

Fay Betsou

, Jamie

Case

, Rodrigo

Chuaqui

, et al. Standard PREanalytical code version 3.0. Biopreserv Biobank, 2018; 16:9–12.

34.

Caixeiro

, Lai

, Lee

. Quality assessment and preservation of RNA from biobank tissue specimens: A systematic review. J Clin Pathol, 2016; 69:260–265.

35.

Wei

, Simpson

. Digital pathology and image analysis augment biospecimen annotation and biobank quality assurance harmonization. Clin Biochem, 2014; 47:274–279.

36.

Vaught

. Developments in biospecimen research. Br Med Bull, 2015; 114:29–38.

37.

Auffray

, Balling

, Barroso

, et al. Making sense of big data in health research: Towards an EU action plan. Genome Med, 2016; 8:71.

38.

Marko-Varga

. BioBanking—The Holy Grail of novel drug and diagnostic developments?. J Clin Bioinform, 2011; 1:14.

39.

Kato

, Nishimura

, Ikeda

, et al. Developments for a growing Japanese patient population: Facilitating new technologies for future health care. J Proteomics, 2011; 74:759–764.

40.

Suh

, Sarojini

, Youssif

, et al. Tissue banking, bioinformatics, and electronic medical records: The front-end requirements for personalized medicine. J Oncol, 2013; 2013:368751.

41.

Mendy

, Caboux

, Lawlor

, et al. Comon minimum technical standards and protocols for biobanks dedicated to cancer research. IARC Technical Publications, No. 44, 2017.

42.

Quinlan

, Mistry

, Bullbeck

, et al. A data standard for sourcing fit-for-purpose biological samples in an integrated virtual network of biobanks. Biopreserv Biobank, 2014; 12:184–191.

43.

Covitz

, Hartel

, Schaefer

, et al. caCORE: A common infrastructure for cancer informatics. Bioinformatics, 2003; 19:2404–2412.

44.

Saltz

, Oster

, Hastings

, et al. caGrid: Design and implementation of the core architecture of the cancer biomedical informatics grid. Bioinformatics, 2006; 22:1910–1916.

45.

BIGSPW

. The Cancer Biomedical Informatics Grid (caBIG): Infrastructure and applications for a worldwide research community. Stud Health Technol Inform, 2007; 129(Pt 1):330–334.

46.

Crowley

, Castine

, Mitchell

, et al. caTIES: A grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research. J Am Med Inform Assoc, 2010; 17:253–264.

47.

Jensen

, Ferretti

, Grossman

, et al. The NCI Genomic Data Commons as an engine for precision medicine. Blood, 2017; 130:453–459.

48.

McIntosh

, Sharma

, Mulvihill

, et al. caTissue Suite to OpenSpecimen: Developing an extensible, open source, web-based biobanking management system. J Biomed Inform, 2015; 57:456–464.

49.

Yim

, Yetisgen

, Harris

, et al. Natural language processing in oncology: A review. JAMA Oncol, 2016; 2:797–804.

50.

Soysal

, Warner

, Denny

, et al. Identifying metastases-related information from pathology reports of lung cancer patients. AMIA Jt Summits Transl Sci Proc, 2017; 2017:268–277.

51.

Segagni

, Tibollo

, Dagliati

, et al. An ICT infrastructure to integrate clinical and molecular data in oncology research. BMC Bioinformatics, 2012; 13(Suppl. 4):S5.

52.

Silva

, Ball

, Douglas

. The Cancer Informatics Infrastructure (CII): An architecture for translating clinical research into patient care. Stud Health Technol Inform, 2001; 84(Pt 1):114–117.

53.

Warzel

, Andonaydis

, McCurry

, et al. Common data element (CDE) management and deployment in clinical trials. AMIA Annu Symp Proc, 2003:1048.

54.

Saver

, Warach

, Janis

, et al. Standardizing the structure of stroke clinical and epidemiologic research data: The National Institute of Neurological Disorders and Stroke (NINDS) Stroke Common Data Element (CDE) project. Stroke, 2012; 43:967–973.

55.

Sheehan

, Hirschfeld

, Foster

, et al. Improving the value of clinical research through the use of Common Data Elements. Clin Trials, 2016; 13:671–676.

56.

Post

, Kurc

, Overcash

, et al. A temporal abstraction-based extract, transform and load process for creating registry databases for research. AMIA Jt Summits Transl Sci Proc, 2011; 2011:46–50.

57.

Olund

, Lindqvist

, Litton

. BIMS: An information management system for biobanking in the 21st century. IBM Syst J, 2007; 46:171–182.

58.

Pecoraro

, Luzi

, Ricci

. Designing ETL tools to feed a data warehouse based on electronic healthcare record infrastructure. Stud Health Technol Inform, 2015; 210:929–933.

59.

Denney

, Long

, Armistead

, et al. Validating the extract, transform, load process used to populate a large clinical research database. Int J Med Inform, 2016; 94:271–274.

60.

Bouzille

, Jouhet

, Turlin

, et al. Integrating biobank data into a clinical data research network: The IBCB project. Stud Health Technol Inform, 2018; 247:16–20.

61.

Chen

, Wulff

, Sholle

, et al. Evaluating generalizability of a biospecimen informatics approach: Support for local requirements and best practices. AMIA Jt Summits Transl Sci Proc, 2018; 2017:55–62.

62.

Nadkarni

, Kemp

, Parikh

. Leveraging a clinical research information system to assist biospecimen data and workflow management: A hybrid approach. J Clin Bioinform, 2011; 1:22.

63.

Bickerstaffe

, Ranaweera

, Endersby

, et al. The Ark: A customizable web-based data management tool for health and medical research. Bioinformatics, 2017; 33:624–626.

64.

Riondino

, Ferroni

, Spila

, et al. Ensuring sample quality for biomarker discovery studies—Use of ICT tools to trace biosample life-cycle. Cancer Genomics Proteomics, 2015; 12:291–299.

65.

Jayabalan

, O'Daniel

. Access control and privilege management in electronic health record: A systematic literature review. J Med Syst, 2016; 40:261.

66.

EU General Data Protection Regulation. Available from: https://gdpr.eu/tag/gdpr/ (accessed October 9, 2020).

67.

Kho

, Cashy

, Jackson

, et al. Design and implementation of a privacy preserving electronic health record linkage tool in Chicago. J Am Med Inform Assoc, 2015; 22:1072–1080.

68.

De Meyer

, De Moor

, Reed-Fourquet

. Privacy protection through pseudonymisation in eHealth. Stud Health Technol Inform, 2008; 141:111.

69.

Felmeister

, Masino

, Rivera

, et al. The biorepository portal toolkit: An honest brokered, modular service oriented software tool set for biospecimen-driven translational research. BMC Genomics, 2016; 17(Suppl. 4):434.

70.

, Gui

, Yong

. An introduction to hardware, software, and other information technology needs of biomedical biobanks. Methods Mol Biol, 2019; 1897:17–29.

71.

Robb

, Gulley

, Fitzgibbons

, et al. A call to standardize preanalytic data elements for biospecimens. Arch Pathol Lab Med, 2014; 138:526–537.

72.

Bycroft

, Freeman

, Petkova

, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature, 2018; 562:203–209.