Abstract

Dear Editor:
The above steps are harder when data originate from different sources, when it is poorly documented, and when it is so large that moving the data is impractical. A lot of technology is available to perform these integration steps. A short survey appears in Bernstein and Haas (2008). However, this technology is mostly targeted for professional information technology staff at large enterprises, not for the science community. That is, much of it is expensive and not open source, is primarily intended for use with relational databases and OLAP data warehouses, and does not support important scientific data types such as sequences and multidimensional arrays.
The problems associated with data integration in the sciences are nicely summarized in a report of a National Academies workshop in 2009 (Weidman and Arrison, 2010). The report includes experiences from major science projects in astronomy, biomedical computing, climate research, and high-energy physics. It also includes experiences of data management researchers who work on generic data-integration technology. The report is a good place to start for an understanding of the problems and promising solutions to data integration in the sciences. Some of the improvements that were suggested are the following:
Provide incentives to researchers to make their data reusable by documenting, cleaning, and packaging it in a form that others can access. Incentives might include extra funding for the work and providing recognition to those who do it.
To simplify data discovery, develop data registries and better Web search to find structured data.
Develop data transformation libraries for data conversion and semantic integration.
Develop domain-specific repositories where data is archived, with search tools to browse them.
Footnotes
Author Disclosure Statement
The author declares that no conflicting financial interests exist.
