Abstract
Anonymization is a recognized process by which identifiers can be removed from identifiable data to protect an individual's confidentiality and is used as a standard practice when sharing data in biomedical research. However, a plethora of terms, such as coding, pseudonymization, unlinked, and deidentified, have been and continue to be used, leading to confusion and uncertainty. This article shows that this is a historic problem and argues that such continuing uncertainty regarding the levels of protection given to data risks damaging initiatives designed to assist researchers conducting cross-national studies and sharing data internationally. DataSHIELD and the creation of a legal template are used as examples of initiatives that rely on anonymization, but where the inconsistency in terminology could hinder progress. More broadly, this article argues that there is a real possibility that there could be possible damage to the public's trust in research and the institutions that carry it out by relying on vague notions of the anonymization process. Research participants whose lack of clear understanding of the research process is compensated for by trusting those carrying out the research may have that trust damaged if the level of protection given to their data does not match their expectations. One step toward ensuring understanding between parties would be consistent use of clearly defined terminology used internationally, so that all those involved are clear on the level of identifiability of any particular set of data and, therefore, how that data can be accessed and shared.
Introduction
B
DataSHIELD (Data Aggregation Through Anonymous Summary-statistics from Harmonized Individual-levEL Databases, www.datashield.ac.uk) is a project that has been developed to enable greater data sharing in ways that protect the confidentiality of participant data. It has been used to conduct secure data analyses in core BioSHaRE projects: the Healthy Obese Project (HOP) and Environmental Determinants of Health.
One agreed approach to protection used by the majority of biobanks globally is to remove identifiers in their entirety or replace them with codes, if a link to the individual providing the data is still desired. Governments champion analyzing large datasets, “…to open up new seams of inquiry, improving the evidence base, and leading to better policies, better outcomes, and potentially economic growth…” 3 Institutions and funders encourage data sharing and may require plans for that dissemination, including details of processes used.4,5 However, there is an ongoing debate on the ability of such processes to provide adequate protection.6–8
It can be argued that removing identifiers is not stable enough to provide adequate protection, even when conducted to the highest standard. This is because there is no clarity or consensus on what anonymization, or anonymized, means in the international research context. Instead, the numerous and contradictory international, national, jurisdictional, and institutional codes, legislation, regulations, and guidance provide differing definitions and requirements for what data can be shared without a substantial risk of reidentifying the individual who provided that data. The processes by which identifiable data are protected may work at the local, level but may be inconsistent when seeking to work across jurisdictions or borders, possibly hindering international research. A lack of understanding could also cause researchers or others to make assumptions about levels of protection, leading to data exchanges that may be outside the expectation of participants.
This article argues that, regarding protection of identifiable data, two things are needed to better enable international data sharing. First, there need to be internationally agreed-on definitions of the processes by which data are protected for research purposes (e.g., anonymized, anonymization) Second, agreed-on definitions need to be used, as far as is possible, uniformly. This may be an impossibility, as terminology may be enshrined in national legislation, but the attempt needs to be made so that there can be a clear understanding of terms around the world. DataSHIELD will be used as an example of an international research initiative that relies on anonymization, but one that could be seriously jeopardized without clear agreements on terminology.
Inconsistent definitions and use of terminology within and across countries
As early as February 1998, the Human Genome Organization cited categories of data that, when used in consent forms for biomedical research, could inform potential participants whether that data would, “…identify the person, code the identity, or anonymize the identity so that a person could not be traced….” 9 Anonymization was also described as the full stripping of identifiers. These terms were also taken up later by others, such as the International Council for Harmonization of Technical Requirements for Pharmaceuticals for Human Use (ICH). In its 2007 Harmonized Tripartite Guideline (ICH E-15 EMEA), it more formally categorized data and samples in these four categories. 10 Identified data and samples contain personal identifiers, while coded data samples have those identifiers removed with one or more codes. Coding (single or double) allowed for linking the coded data with the personal, useful for clinical follow-up or adding new information to the participant's or patient's record. Anonymized data and samples, “…are initially single or double coded, but where the link between the subjects' identifiers and the unique code(s) is subsequently deleted.” This deletion makes it, “…no longer possible to trace the data and samples back to individual subjects through the coding key(s).” “Anonymous” data and samples, “…are never labeled with personal identifiers when originally collected, neither is a coding key generated.” 10
These four categories and their definitions take us neatly and clearly from the openly identifiable to the nonidentifiable and set out activities in which these categories of data and samples can be used. Unfortunately, even though this guideline was recommended for adoption by the three ICH regulatory bodies (the European Union, Japan and the United States), they are not universally used across these jurisdictions. The US Health Information Portability and Accountability Act (HIPAA) Privacy Rule, section 45 CFR 164.514(a), uses “deidentified” for data where 18 specific identifiers have been removed and cites the resulting data as, “Health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual….”
11
In Japan's recent amendment of The Act on the Protection of Personal Information the following is stated:
“Anonymized Information refers to any personal data, which is anonymized according to the standards prescribed by the new data protection authority…, should not contain particular descriptions or items that could be used to identify a person, and should be subject to an anonymization process that would render the data subsequently incapable of being restored to its original form.”
12
The current European data protection legislation (Directive 95/46/EU) does not provide a definition, but states in Recital that, “…principles of protection shall not apply to data rendered anonymous in such a way that the data subject is no longer identifiable.” 13 Without a clear definition, the different member states have made their interpretations locally. A study of the national legislation and guidance across five participating BioSHaRE countries, found “…considerable differences…” 14 A look at the six countries participating in the HOP show these in detail (Table 1).
In 2005, Knoppers and Saginur presented a table of numerous terms used across the world for “coded,” which ranged from “identifiable,” “reversibly deidentified,” and “unidentified.” 15 Similarly, terms used for “anonymized” included “irreversibly unlinked,” which conveys the need for permanently removing identifiers, but also “de-identified” and “nonidentifiable,” which can be interpreted to mean that they could be reidentified if desired. The authors noted that, “…the proliferation of terminology to describe the identifiability of genetic data renders it difficult to share and use samples between jurisdictions as ethics review committees and researchers have no means of ensuring equivalency between the “labels” of identifiability.” 15
Unlike the rapid progress in biomedical sciences, any agreement on definition has not been reached in the 10 years since Knoppers and Saginur's work. In the United Kingdom, the Information Commissioners Office follows the national practice to use “pseudoanonymized” instead of “coded.” 16 French guidance (translated into English) states that, “‘[t]rue’ anonymization necessarily involves an (irreversible) loss of information…,” yet goes on to state that, “[i]t is still possible to correlate anonymized personal data, so an individual may be reidentified based on partial information when personal data are anonymized, but not deleted.” 17 This latter could be interpreted as discussing coded data, rather than anonymized, and appears to confirm the belief that even anonymization does not mean “forever.” A recent study, advocating a common language for biobanking, notes that, “[a] Swedish respondent commented that in some Swedish basic legal documents, the term [Anonymized] is used for coded information.4,18
DataSHIELD
The difficulties of dealing with such inconsistencies come into play when data are actually being shared. Details on DataSHIELD are available elsewhere, 19 but in summary, it is a project that uses processes by which individual level (or personally identifying) data can be interrogated at the source, without that data leaving the original site. Only nonidentifying data is shared with outside researchers (Fig. 1). These data can take the format of, depending on the analysis conducted, summary statistics, regression coefficients and p-values, all of which are nondisclosive. 19

In DataSHIELD, nonidentifying summary statistics and computer instructions are allowed to pass between computers, while individual-level data are retained on the study's local data computer behind firewalls. DataSHIELD, Data Aggregation Through Anonymous Summary-statistics from Harmonized Individual-levEL Databases.
There are currently varying ways to combine and analyze data from different databases. Individual-level meta-analysis requires each study to contribute data to a central resource, where it can be combined and analyzed. This can raise ethicolegal issues around the transfer of personal data, if the central resource is outside the security protections of the original institution where the data were held. A study-level meta-analysis asks each contributing study to first do the analysis and then contribute the anonymized results to a central resource. This avoids ethicolegal concerns around individual-level data, but, depending on the aim of the study, can severely restrict the scope of the analyses. It may also be more costly as, with every change in research question, the analyses need to be redone.
DataSHIELD takes a different approach. In one (simplified) example of how DataSHIELD can be used, a set of variables to be analyzed is agreed by the participating studies. A separate database for these datasets is created at each study site; the data may include identifying and nonidentifying variables. A query is sent out from an analysis center (AC) to each of these databases. This “research question” is analyzed at each study site on the agreed-to variables. Only “answers” to the question, in the form of nonidentifying data, are returned to the AC. The process is repeated until a result is reached. In this way, separate and parallelized analyses can be undertaken simultaneously, using different studies and in different countries. In this way, “…although a full individual-level analysis is enabled, no identifying or sensitive information is physically moved, or even rendered temporarily visible, outside the original study in which the data were collected.” 20
One example where DataSHIELD is being piloted is HOPa—core project of the BioSHaRE FP7 project (www.bioshare.eu). HOP study members are evaluating the prevalence of Healthy Obesity, assessing lifestyle “risk factors for Healthy Obesity and the clinical consequences of (Healthy) Obesity.” 21 According to the latest BioSHaRE-EU statistics, 11 studies from 8 countries contributed agreed-to variables where DataSHIELD was used as an analysis pilot. 2 As noted, DataSHIELD allows all personal data to be retained at the original institution, behind existing firewalls and other security systems, and only used within the consent given by research participants and according to any ethics or scientific approvals that were required.
As a proof of concept, an ethicolegal analysis was carried out on DataSHIELD. 20 It reviewed case law, data protection laws, and other applicable administrative guidance in the United Kingdom and concluded that DataSHIELD, “…provide[s] a flexible means of interrogating data, while protecting participants' confidentiality, in accordance with applicable legislation.” 20 However, this analysis brought into focus the difficulties raised earlier. DataSHIELD itself does not define anonymization; it follows best practices in epidemiology 22 and relies on the participating studies to follow local rules for data protection.
In the single-country analysis, DataSHIELD met the applicable rules and regulations in the United Kingdom for the use of “anonymized” and “anonymous” data, according to UK Data Protection Act 1998, and interpreted by administrative and governmental bodies, such as the UK Information Commissioners Office. 16 DataSHIELD, however, is being used in multinational studies.
As shown in Table 1, the legal and other definitions for how data should be prepared for sharing (e.g., coded or anonymized) may not match across projects. It is recognized that these excerpts are for the most part translations into English and therefore could be interpreted differently in the original language. Currently, DataSHIELD researchers expect each project to conform to local institutional and national requirements. Assuming an adequate translation, Table 1 shows that “an understanding of what one means” by these terms has enabled the research to be done. Allowing countries to make individual interpretations and using them in research practice, such as in the consenting process, may be seen as preferable to trying to force nations to agree on specific terms.
Research has shown that many times participants do not completely understand or remember the consent process and instead put their trust in the institution and people carrying out the work. 23 Therefore a reliance on the consent discussion or in forms may not be sufficient to inform potential participants of the measures that will be used to protect their data in their own country. It can be argued that it may be completely inadequate to explain how data will be shared internationally. In BioSHaRE, this work was done in European countries that are under a common data protection regime (notwithstanding individual interpretations of EU Directives).The transfer of data into countries without similar robust data protection legislation, such as the United States, may be more troubling for participants. By relying on the anonymization process, without clarity about what that actually means, initiatives such as DataSHIELD may be jeopardized because the foundation on which they are built is not stable. Relying on a common understanding may be more akin to building on shifting sands rather than on the solid and level surface that clear, agreed definitions would provide.
Why are we concerned with clarifying terminology?
Identifiability and willingness of people to share information about themselves, and how they see that data, are complex issues. Generally, a researcher may perceive an anonymized dataset as just data, while a participant may still regard it (if they think of it at all) as part of themselves. They may have differing expectations and understandings of what form the data takes and on how it will be used and shared. If a laptop is left on an airplane with an anonymized dataset from a biobank, the possible harms that might result from such a mistake will vary depending on the extent and nature of the data that has been disclosed. If everyone agreed on a definition for anonymized and recognized that it provided adequate protection, and that fact had been explained to, and was understood by, participants, then the potential fallout from such an event and how it might impact those biobank participants would be more understandable to them. However, the reporting of such mistakes may not be so nuanced and people may react badly. Should such events become more common, there may eventually be a backlash against research.
At this time, little research has been done on participants' understanding of what linkage means, in terms of linking biobank data with medical records, hospital admissions, prescribing records, etc. Researchers will know that linking together successive instances of care will help them better analyze, for example, the effectiveness of a particular intervention. Yet as more data are linked together, identifying any individual in a dataset becomes easier. As linkage starts to include data on potentially sensitive issues such as sexual preferences, psychological history, criminal convictions, and personal genetics, as well as the more commonly collected data such as on employment and education, it becomes more critical that individuals understand what is being done to their data when they are told that it has been “anonymized” and how it is being linked and used. This has the added benefit of allowing the public to add meaningfully to the debates around data use.
Without better public understanding of data use, the dangers posed by continuing to use anonymization processes without clarification may not be through willful reidentification, but through the danger to public trust. There are examples of deidentification studies where experts have shown that they can reidentify individuals from large datasets7,24; the databases that have been used in these studies range from hospital records, census records, large scale genomic research studies, blogs, and online user ratings. Yet in these kinds of instances, specialist expertise and the will to reidentify were needed; this is not the case for most people accessing data. Data transfer agreements require researchers accessing potentially identifying data to agree to not reidentify and this is accepted practice.
The number of breaches in data security will vary from country to country and reporting will most likely not be complete, but a recent review by Laurie and colleagues of harms in relation to health and biomedical data showed that evidence of direct harm was scant. 25 For the instances that were identified, “…careless or negligence conduct—through maladministration or human error—rather than intentional and willful abuse of data, [was] overwhelmingly the cause of harm/impact.” 25 This will be reassuring from the point of view of deliberate reidentification, but it does show that with the ever-growing reliance on the transfer of data, the linking of datasets together and the growing size of datasets, mistakes and misuse will happen. If the harm does not manifest itself in the revealing of a person's identify, the harm may be in the loss of trust individuals have in the institutions holding and sharing their data due to a lack of understanding of the processes involved.
In addition, different projects will need different approaches to protecting or sharing data. DataSHIELD, for example, recognizes that reporting small numbers of individual cases that meet certain criteria (i.e., relating to fewer than five instances of a particular condition in a specific population) could be potentially identifying. 22 Instead, such data are either not reported or are combined with other data to protect against possible reidentification. However, in the rare disease community, there may only be three or four individuals who meet certain criteria and the point of the investigation may be to identify variables that contribute to their condition. As science changes, we need to be flexible in our research strategies. Constants, such as agreed data categories, will allow us to better react to the changing nature of research, while still being able to give (understandable) assurances that necessary protections are in place.
As noted, an ethicolegal analysis on DataSHIELD has only been conducted in the United Kingdom; so it is not possible to report definitively what difficulties arise when used in multinational studies. While the lack of agreed-on definitions may not halt international collaborative research, it could be argued that having them could help in the creation of international agreements for cross-border transfer of data. They could also potentially ease ethics approval processes, and perhaps help pave the way for international ethics equivalency, 26 as committee members would be reassured that they understand the level of protection being given to data.
Arriving at consistent terminology would help participants, patients, researchers, funders, and others understand how data are being used at any time in the research or dissemination process, in discussions, in consent materials, in lay summaries, or in journal articles. It would also assist those responsible for data protection to clearly see when lines have been crossed and sanctions could be imposed. The Nuffield Council on Bioethics and its Working Party on Health and Biological Data advocate that, in the United Kingdom, the government, “…introduce criminal penalties for deliberate misuse of data whether or not it results in demonstrable harm to individuals.” 27 The only way in which such sanctions will be possible will be if there are clear definitions of the categories of data used, or misused, in any particular situation.
In Europe, adding definitions to the draft General Regulation on Data Protection may be one way to put formal rules in place; these could require the adoption of the ICH E-15 EMEA definitions, as these have already been agreed by the EU, Japan, and the United States. 10 However, HIPAA has been cited as an example of the difficulties that may be encountered when relying on legislation. The Privacy Rule “…allows a covered entity to deidentify data by removing …18 elements that could be used to identify the individual or the individual's relatives, employers, or household members…. The covered entity also must have no actual knowledge that the remaining information could be used alone or in combination with other information to identify the individual who is the subject of the information.” 28 With this specific list of 18 categories of identifiers, including names, addresses, and telephone numbers, it attempts to ringfence personal data to protect it. Yet once any list is in place, there will inevitably be items that fall outside it.
With the greater availability of datasets and the continuing desire to link them, anonymity can continue to be compromised. 6 Instead, discussions might better begin with international groups that can investigate current practices, propose common terminology, and solicit public comment through consultation. Such discussions should include a wide range of stakeholders so that different understandings can be taken into account. It would also be useful to involve those in other disciplines to reflect the increasing interdisciplinarity of the biomedical sciences.
The future for DataSHIELD
DataSHIELD continues to be developed and trialed in a variety of settings. The ethicolegal analysis conducted on DataSHIELD clarified questions related to its conformity, as well as the need for conformity by similar types of tools, with data protection rules in the United Kingdom and highlighted the need to conduct further such analyses around the cross-border transfer of data. Current plans consist of creating a “template” or checklist for projects to use when biobanks are approached to recruit them to join in a DataSHIELD analysis. The first step is to compare the criteria used in the UK analysis to the actual processes undertaken by the eight HOP biobanks to participate in DataSHIELD. This will include questions such as what institutional approvals were needed, what security provisions were in place (e.g., firewalls, data transfer agreements requiring researchers to not attempt to reidentify participants), and whether participants have agreed to the use of their anonymized data in research outside their own institution. This will ensure that the UK analysis did not miss any important issues that should have been examined. Once a list of criteria is finalized, a template will be created, which will be piloted to ascertain its effectiveness. The definition of what “anonymized” data means for each study will be a key part of the development of the template. Such data could also assist in mapping the four key categories of data—identifiable, coded, anonymized, and anonymous—onto the terminologies being used around the world.
Conclusion
Some may argue that it does not matter whether data are termed coded, anonymized, or pseudonymized, as long as the parties involved understand what is meant by any particular use of any of these terms. However, it has been shown that participants often do not understand clearly to what they have consented 29 and instead they trust the researchers and institution conducting the work. 23 If researchers share categories of data, even inadvertently, which have a label that promises a certain level of protection and this is not the case, there is a potential that individuals' identifying data could be at risk. This could cause a break in the trust relationship that could cause harm to individuals or seriously damage the research process.
Groups are working on ways to establish good practice; the Global Alliance for Genomics and Health (GA4GH, www.genomicsandhealth.org), which has a mandate to encourage international data sharing, is currently working on a lexicon of definitions used in its policies to aid the public, participants, patients, researchers, funders, and others. However, because so many countries have definitions enshrined in primary and secondary legislation, it is difficult to see a future where there is complete agreement on terminologies. While groups such as the GA4GH and projects such as BioSHaRE cannot mandate change, they can set a scene where change might come gradually, through showing that consistency benefits everyone involved. The first step toward that goal needs to be a deeper examination of what anonymization practices mean in the context of international research and how potential participants understand these practices. With such information in hand, the biobanking community can begin to create a harmonized approach to international data sharing.
Footnotes
Acknowledgments
The research leading to these results was supported by the Biobank Standardisation and Harmonisation for Research Excellence in the European Union (BioSHaRE-EU) program, which received funding from the European Union Seventh Framework Programme (FP7/2007–2013) under grant agreement No. 261433.
Author Disclosure Statement
No conflicting financial interests exist.
