Abstract

G
Inappropriate use of genomic data poses very specific risks since it can be used to identify an individual. Because each person can be identified by the variations in their genome, even databases of deidentified data can be used, in combination with other databases, to reidentify individuals (Erlich and Narayanan, 2014; Bustamante and Shringarpure, 2015). Especially as the technology progresses, correlations between phenotype and genotype will be more easily ascertained and associated with an individual. In fact, in a “vulnerability research” experiment, a team at MIT's Whitehead Institute was able to identify 50 individuals who had participated in genomic studies by their full name using only a computer, Internet access, and public resources. Other studies have shown that an individual can be identified even if a distant relative submits DNA (Fearer, 2013). In another example, Harvard Medical School's Personal Genome Project (PGP) is based on the premise that guaranteeing privacy is impossible. When recruiting volunteers, PGP informs them of the benefits and risks of participating, which can be potentially dramatic. PGP publishes participants' demographic and clinical information, their genomic sequences, and their names and headshots if they wish. George Church, the geneticist in charge, calls this open consent (Kupersmith, 2013).
There is a strong argument for collecting and/or archiving genomic data in large databases. It is widely believed that “big” data will propel research forward. No single organization or laboratory will collect sets of data that will be large enough to truly accelerate the science that is critical to understand the genome. This is largely due to practical and financial barriers. Thus, it is important that these scarce resources are used responsibly (National Institutes of Health, 2015).
There are challenges to these large databases. It is not possible, and it may not be best, to aggregate these data in a single large centralized repository. Therefore, federated models are used and/or proposed by such entities as the Global Alliance for Genomes and Health. There is a significant challenge in the federated model: it is difficult to ascertain whether data are duplicative. There are two obvious ways to mitigate this problem. One would be to give everyone a unique identifier from a central generator of some sort, for example, the National Institutes of Health's Global Unique Identifier (National Institutes of Health, 2015). Another would be to put the participant at the center, much like monetary banking, and allow the participant to control access to their health information in the context of their needs (Terry, 2013). Combinations of these models also offer interesting solutions.
Access to large sets of data will likely enable researchers to predict who might develop a condition and then appropriately personalize treatment (Collins and Varmus, 2015). Data from large populations increase the probability of finding the genetic correlation between genotype and phenotype. If reasonable privacy and security protections are in place and databases maintain transparency, the risks are manageable in the face of the benefits gained through research.
Today, there is a plethora of privacy and security policies for various databases. For example, in Estonia, a government project is creating a database that includes genetic information aiming to involve three quarters of the country's population. It will be used in large-scale association studies (Frank, 2015). In another example, Kaiser Permanente institutional review board oversees their collection (Kaiser Permanente, 2015). Mayo Clinic's Biobank privacy policy says that samples will not be stored with a name, address, birth data, social security number, or Mayo Clinic number and also comments that in the case of reidentification, the Genetic Information Nondiscrimination Act (GINA) of Library of Congress (2008) offers protection (Mayo Clinic, 2015).
The diversity of protections and rules is complex. Ultimately, privacy and security cannot be guaranteed. Despite these risks, large-scale sharing of genomic information and associated clinical information is essential to accelerate biomedical research. President Barack Obama announced the Precision Medicine Initiative (PMI) at the State of the Union address in 2015. Throughout 2015, the NIH, White House, and FDA have worked to flesh out what this effort will entail. A report was issued in September 2015 (National Institutes of Health, 2015). The PMI calls for a cohort of more than 1 million people, many of whom will be sequenced and all of whom will contribute health information, data from wearables, and environmental data. This cohort will form the foundation for a longitudinal study with the hopes of answering many questions. Critical to all of this is the principle that people should be treated as partners. There must be a high degree of authentic engagement and transparency. The stakes are high for those who suffer. PMI and other large cohorts that collect genomic data are critical to alleviating this suffering.
