Abstract

Challenges/Opportunities Identified During the Workshop
Data Management Challenges
Dataset Standardization and Access Challenges
Community Science Opportunities
Collaborative Funding Strategies
Public Participation Opportunities
Special thanks go to the National Science Foundation (Grant DBI-0969929), Seattle Children's Research Institute (hosting and complimentary support), Mary Ann Liebert Publishers (open access publication of this special issue), and Courtney MacNealy and Andrew Lowe (organizational support).
The analysis started before the participants even arrived. Invitees were encouraged to submit their thoughts on the DIS challenges before the workshop commenced and a number of them did.
J. Qui focused on scalable programming and algorithms as a means to discuss cloud computing use. D. Hudak et al. contributed an assessment of data-intensive biomedical science from the Ohio Supercomputer Center. P. Bernstein submitted a concise assessment of the issues and potential solutions for data integration in the DIS. W. Dubitzky discussed a path toward a knowledge management system to discover scientific knowledge from large-scale data. Y-K. Yu focused on the DIS using proteomics as an example field. Finally, M. McLennan described how HUBzero can help with not just the data, but the interpretation of the data.
Summary of Policy Report
It is an exciting time for the field of DIS. It has incredible potential to transform hypothesis-driven science, and create data commons that can be of a benefit to all. The benefits will be long lasting, and as such, DIS requires a long-term science policy vision. The policy must be cohesive yet dynamic, with a firm grasp of the global impact it will have.
Summary of Technology Report
Scientists spend too much time manipulating data sets in order to meet the esoteric requirements of nonstandardized computing tools. The fragmented state of analysis leads to wasted resources (time and money). Standardization in both data gathering and data analysis is needed. Commonly used, flexible, and easily accessible analysis processes will support increased collaboration and better use of funding. Each of us can play an important role in these challenges. The funding agencies can provide an increased support for the development of sustainable, scalable software, and cyberinfrastructure. The research community at large can encourage increased support by recognizing and rewarding such efforts in a manner on par with its recognition of publications.
In addition, the cyberinfrastructure currently available is inadequate to the challenges of DIS. Although the lack of technologies capable of handling the data is a problem, removing policy barriers such as piece-meal funding will make it possible to fund the much-needed global system. Clouds may be a partial solution to these challenges; however, even the best cloud system requires surrounding infrastructure to transfer data, and education to inform users about the particular nuances of cloud computing.
Summary of Education Report
Education within and between disciplines such as the life sciences and computer programming or data analysis is needed in order to meet the challenges of DIS. No one person can be fully trained in the remarkable breadth of knowledge needed to navigate today's analysis landscape. Yet the potential of DIS can be met when many different groups work in tandem to offer each expertise and build on the strengths available. A standard language to support communication is a necessity. Scientists should have a basic understanding of data analysis and data management, and data analysis/computer experts should have a primer on scientific goals and models. This would ensure development efforts are transparent to all members of the team. Funding agencies can enable these developments by recognizing the need for crossdisciplinary education and provide support and leadership for it.
Summary of Communication Report
The reality now is that with DIS, communication between biologists, computer experts, statisticians, and many other scientists is necessary and crucial to success, and it cannot stop there, as scientists and researchers are only one of the five constituencies that need to communicate effectively in order to help DIS reach its potential. The other four constituencies (funders and policymakers, students, general public, and industry) all play vital roles as a shift in science paradigm takes place. Currently, communication is hampered by a lack of terminology standardization and crossknowledge of disciplines. Additionally, more funding and rewards for collaborative, multidisciplinary data-centric, and open projects is needed. The scientists on their own are not able to surmount these barriers; however, with effective communication a change will take place.
Summary of Biology Report
There are a number of challenges to fully realizing DIS potential within the life sciences. First, there are many and varied data sources, which lead to fragmented, redundant, and incomplete data sets that are difficult to coalesce. Second, there are funding policies that reward the one PI, and their publication list's length rather than results from collaborative science groups with technological or analytical development. Finally, a severe lack of standardization, both in terminology and in analysis methods, stymies the communication needed to build multidisciplinary groups. Steps must be taken to build the community mind-set that will allow crossdisciplinary groups to be possible, both through education and standardization to allow better communication, and through funding policy changes to recognize the importance of data analysis and storage processes in today's sciences.
Summary of Bioinformatics Report
Bioinformatics is increasingly becoming DIS. Given the size of the data sets it is becoming more and more difficult to manage, transfer, and analyze these data. Groups are looking toward cloud computing as a possible solution, but there are barriers to its adoption. The life sciences in particular have very heterogeneous data types and a lack of data standards. The latter makes it difficult for experts outside the life sciences, such as data analysis experts, to understand the issues, and work with the data. Hence, analysis methods are often piecemeal pipelines and adapting them to cloud computing is not easy. In addition, the storage and transfer of data to and from the cloud is cost and time prohibitive. To utilize cloud computing in research, bioinformaticians require a suite of tools that make the cloud simple and accessible. Additionally, bioinformaticians need to provide libraries of tools that run on the cloud but are easy to use and invoke locally by biologists.
The first workshop raised many important points about the state of the computational resources available to the scientific community for data-intensive biological discovery. It is with keen anticipation that we look forward to the second workshop: May 16–17, 2011 in Bethesda, Maryland. The specific aims of the second workshop are to identify solutions and build a roadmap for the future. We will address the following general questions:
How can we further conceptualize common needs and solutions across the bioinformatics and other research communities that integrate data, software, and high-performance computing to better enable large-scale analysis and data management and address biological grand challenges? What are emerging needs unique to these communities and how do we communicate requirements in compelling ways to those who fund and develop enabling capabilities? How can the research communities capitalize on existing and planned high-performance computing resources to further data-intensive biological research and best leverage cyberinfrastructure, capabilities, opportunities, and cost-savings these resources may present?
The research community needs solutions to its data dilemmas, and with the right minds working together, solutions will be found.
Footnotes
Author Disclosure Statement
The author declares that no conflicting financial interests exist.
