Abstract

Deposit Photos/© vitstudio
“Billions and billions”—the phrase has come to be associated with astronomer and science popularizer Carl Sagan. It brings to mind the vast number of stars in the observable universe. But there is another universe, one that is even vaster, in its own way, than interstellar space. This universe is not “astronomical.” It is, rather, “genomical.” And as its immensity becomes better appreciated, it may become the adjective of choice in descriptions of the extraordinarily numerous.
The genomical universe is expanding in data space, say researchers at Cold Spring Harbor Laboratory and the University of Illinois at Champaign-Urbana. These researchers, projecting to the year 2025, compared genomics with three other major generators of Big Data: astronomy, YouTube, and Twitter. Their estimates show that genomics is a “four-headed beast”—it is either on par with or the most demanding of the domains in terms of data acquisition, storage, distribution, and analysis.
The researchers presented their findings July 7 in PLOS Biology, in a perspective article entitled, “Big Data: Astronomical or Genomical?” Besides quantifying the Big Data immensity of genomics, the article discussed the new technologies that will need to be developed to meet the computational challenges that genomics poses for the near future. Essentially, the article issued a call for a concerted, community-wide approach to the “genomical” challenges of the next decade.
All of the fields the team compared, from social media on the Internet to astronomy, are generating huge quantities of electronic data—on the order of tens to hundreds of petabytes per year. A petabyte is one quadrillion bytes—10 followed by 15 zeros; it's 1,000 times more bytes than a terabyte, the amount of storage you might have on your home computer. And, says the team, all of the fields are on rapidly upward-sloping growth curves.
YouTube actually generates the most data right now, about 100 petabytes a year. But genomics is not far behind and growing much more rapidly. At the current rate, the quantity of genomics data produced daily is doubling every 7 months. By 2025, that figure will range between 2 and 40 exabytes per year, depending on the rate of doubling. One exabyte is the equivalent of 1,000 petabytes, about a million times more data than you can store on your home computer.
“Genomics clearly poses some of the most severe computational challenges facing us in the next decade,” wrote the authors of the PLOS Biology article. “Genomics is a ‘four-headed beast,' considering the computational demands across the lifecycle of a dataset—acquisition, storage, distribution, and analysis.”
Like data that flows over the Internet, biological data that is the raw material of genomics is highly distributed. That means it's generated and consumed in many locations. Unlike Internet data, however, which is formatted according to a few standard protocols, genomic data is compiled in many different formats, a fact that threatens its broad intelligibility and utility.
“In human health, the major needs are driven by the realization that for precision medicine and similar efforts to be most effective, genomes and related ‘omics data need to be shared and compared in huge numbers,” the authors continued. “If we do not commit as a scientific community to sharing now, we run the risk of establishing thousands of isolated, private data collections, each too underpowered to allow subtle signals to be extracted. More than anything else, connecting these resources requires trust among institutions, scientists, and the public to ensure the collections will be used for medical purposes and not to discriminate or penalize individuals because of their genetic makeup.”
One of the study's authors, Michael Schatz, an associate professor at the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory, is especially interested in the problem posed by obtaining hundreds of millions, even billions, of human full-length genome sequences. The problem, he said, is not really speed, which will grow rapidly and predictably. Rather, it is in figuring out how to align and represent different genomes so that they might be compared.
“The point of sequencing a billion genomes is not really to make a billion separate lists saying, ‘If you have these variants, you have the following risks.' Of course, individuals will want to look at the list of DNA variants they possess. But the real power of having 1 billion human genomes comes from ways of comparing them and combining layers of analysis,” explained Dr. Schatz. “Our belief is, by combining all this information, patterns will emerge—in the same way that when Mendel grew tens of thousands of pea plants, at the dawn of genetics 150 years ago, he was able to formulate laws of inheritance by looking a patterns for how specific traits were inherited.”
“Genomics is a game-changing science in so many ways,” Dr. Schatz concluded. “My colleagues and I are saying that it's important to think about the future so that we are ready for it.”
