Special Issue on Data-Intensive Science

Abstract

What do you get when you gather approximately 100 people from different disciplines, different backgrounds, and different institutions into one workshop and ask them to identify issues, problems, and challenges? No matter what the topic, plenty of discussion will follow. In this case, the workshop was the NSF-supported Data-Intensive Science (DIS) Workshop held at Seattle Children's Research Institute on September 19–20, 2010. The imperative nature of this topic was further reinforced by a recent issue of Science (Feb. 11, 2011). The workshop participants included researchers, funding representatives, computer scientists, analysis and computational experts, statisticians, educators, business leaders, and policy makers. The end result? A thorough examination of the state of data-intensive biological discovery and the computing environment in which it exists, and the challenges it faces.

Challenges/Opportunities Identified During the Workshop

Data Management Challenges

Dataset Standardization and Access Challenges

Community Science Opportunities

Collaborative Funding Strategies

Public Participation Opportunities

Special thanks go to the National Science Foundation (Grant DBI-0969929), Seattle Children's Research Institute (hosting and complimentary support), Mary Ann Liebert Publishers (open access publication of this special issue), and Courtney MacNealy and Andrew Lowe (organizational support).

The analysis started before the participants even arrived. Invitees were encouraged to submit their thoughts on the DIS challenges before the workshop commenced and a number of them did.

J. Qui focused on scalable programming and algorithms as a means to discuss cloud computing use. D. Hudak et al. contributed an assessment of data-intensive biomedical science from the Ohio Supercomputer Center. P. Bernstein submitted a concise assessment of the issues and potential solutions for data integration in the DIS. W. Dubitzky discussed a path toward a knowledge management system to discover scientific knowledge from large-scale data. Y-K. Yu focused on the DIS using proteomics as an example field. Finally, M. McLennan described how HUBzero can help with not just the data, but the interpretation of the data.

Summary of Policy Report

It is an exciting time for the field of DIS. It has incredible potential to transform hypothesis-driven science, and create data commons that can be of a benefit to all. The benefits will be long lasting, and as such, DIS requires a long-term science policy vision. The policy must be cohesive yet dynamic, with a firm grasp of the global impact it will have.

Summary of Technology Report

Scientists spend too much time manipulating data sets in order to meet the esoteric requirements of nonstandardized computing tools. The fragmented state of analysis leads to wasted resources (time and money). Standardization in both data gathering and data analysis is needed. Commonly used, flexible, and easily accessible analysis processes will support increased collaboration and better use of funding. Each of us can play an important role in these challenges. The funding agencies can provide an increased support for the development of sustainable, scalable software, and cyberinfrastructure. The research community at large can encourage increased support by recognizing and rewarding such efforts in a manner on par with its recognition of publications.

In addition, the cyberinfrastructure currently available is inadequate to the challenges of DIS. Although the lack of technologies capable of handling the data is a problem, removing policy barriers such as piece-meal funding will make it possible to fund the much-needed global system. Clouds may be a partial solution to these challenges; however, even the best cloud system requires surrounding infrastructure to transfer data, and education to inform users about the particular nuances of cloud computing.

Summary of Education Report

Education within and between disciplines such as the life sciences and computer programming or data analysis is needed in order to meet the challenges of DIS. No one person can be fully trained in the remarkable breadth of knowledge needed to navigate today's analysis landscape. Yet the potential of DIS can be met when many different groups work in tandem to offer each expertise and build on the strengths available. A standard language to support communication is a necessity. Scientists should have a basic understanding of data analysis and data management, and data analysis/computer experts should have a primer on scientific goals and models. This would ensure development efforts are transparent to all members of the team. Funding agencies can enable these developments by recognizing the need for crossdisciplinary education and provide support and leadership for it.

Summary of Communication Report

The reality now is that with DIS, communication between biologists, computer experts, statisticians, and many other scientists is necessary and crucial to success, and it cannot stop there, as scientists and researchers are only one of the five constituencies that need to communicate effectively in order to help DIS reach its potential. The other four constituencies (funders and policymakers, students, general public, and industry) all play vital roles as a shift in science paradigm takes place. Currently, communication is hampered by a lack of terminology standardization and crossknowledge of disciplines. Additionally, more funding and rewards for collaborative, multidisciplinary data-centric, and open projects is needed. The scientists on their own are not able to surmount these barriers; however, with effective communication a change will take place.

Summary of Biology Report

There are a number of challenges to fully realizing DIS potential within the life sciences. First, there are many and varied data sources, which lead to fragmented, redundant, and incomplete data sets that are difficult to coalesce. Second, there are funding policies that reward the one PI, and their publication list's length rather than results from collaborative science groups with technological or analytical development. Finally, a severe lack of standardization, both in terminology and in analysis methods, stymies the communication needed to build multidisciplinary groups. Steps must be taken to build the community mind-set that will allow crossdisciplinary groups to be possible, both through education and standardization to allow better communication, and through funding policy changes to recognize the importance of data analysis and storage processes in today's sciences.

Summary of Bioinformatics Report

Bioinformatics is increasingly becoming DIS. Given the size of the data sets it is becoming more and more difficult to manage, transfer, and analyze these data. Groups are looking toward cloud computing as a possible solution, but there are barriers to its adoption. The life sciences in particular have very heterogeneous data types and a lack of data standards. The latter makes it difficult for experts outside the life sciences, such as data analysis experts, to understand the issues, and work with the data. Hence, analysis methods are often piecemeal pipelines and adapting them to cloud computing is not easy. In addition, the storage and transfer of data to and from the cloud is cost and time prohibitive. To utilize cloud computing in research, bioinformaticians require a suite of tools that make the cloud simple and accessible. Additionally, bioinformaticians need to provide libraries of tools that run on the cloud but are easy to use and invoke locally by biologists.

The first workshop raised many important points about the state of the computational resources available to the scientific community for data-intensive biological discovery. It is with keen anticipation that we look forward to the second workshop: May 16–17, 2011 in Bethesda, Maryland. The specific aims of the second workshop are to identify solutions and build a roadmap for the future. We will address the following general questions:

How can we further conceptualize common needs and solutions across the bioinformatics and other research communities that integrate data, software, and high-performance computing to better enable large-scale analysis and data management and address biological grand challenges?

What are emerging needs unique to these communities and how do we communicate requirements in compelling ways to those who fund and develop enabling capabilities?

How can the research communities capitalize on existing and planned high-performance computing resources to further data-intensive biological research and best leverage cyberinfrastructure, capabilities, opportunities, and cost-savings these resources may present?

The research community needs solutions to its data dilemmas, and with the right minds working together, solutions will be found.

Footnotes

Author Disclosure Statement

The author declares that no conflicting financial interests exist.