Data Integration for Data-Intensive Science

Abstract

Dear Editor:

To perform analytic computation on data (scientific or otherwise), there are many preliminary, time-consuming steps to identify and integrate the relevant inputs. A typical sequence of steps is the following.

Data discovery—Find the data sets that are relevant to the question at hand. This may involve Web search and networking with other researchers, followed by deeper study to determine whether the data is really appropriate for the intended use.

Obtain access to data—Determine how to access the data in a desired format. This may involve retrieving the data or enabling the invocation of operations on the data in its home location. It may also involve legal issues, such as data ownership and privacy.

Convert data—Transform the data into a format that analytic tools can cope with. This may be as simple as loading a single densely populated numerical table from a spread sheet into a database. But often the necessary transformations are more complex (e.g., a sparsely populated spread sheet with many tabs, irregular tables, and a mixture of text and numbers). The transformations may require extracting structured information from unstructured (e.g., text) or semistructured (e.g., XML or HTML) sources.

Semantic integration—Determine the meaning of data fields and transform them so they can be compared or combined. This may involve cleaning the data to eliminate inconsistencies, deal with missing values, and identify erroneous values. This is often the most time-consuming step of data integration.

Labeling data products—The result of each computation should be labeled with metadata, so the provenance of each data product can be traced and the result can be integrated in future scientific analyses. Examples of useful metadata are the version of the data sources and where to find them, the software that performed the computation, and the person responsible for the computation.

The above steps are harder when data originate from different sources, when it is poorly documented, and when it is so large that moving the data is impractical. A lot of technology is available to perform these integration steps. A short survey appears in Bernstein and Haas (2008). However, this technology is mostly targeted for professional information technology staff at large enterprises, not for the science community. That is, much of it is expensive and not open source, is primarily intended for use with relational databases and OLAP data warehouses, and does not support important scientific data types such as sequences and multidimensional arrays.

The problems associated with data integration in the sciences are nicely summarized in a report of a National Academies workshop in 2009 (Weidman and Arrison, 2010). The report includes experiences from major science projects in astronomy, biomedical computing, climate research, and high-energy physics. It also includes experiences of data management researchers who work on generic data-integration technology. The report is a good place to start for an understanding of the problems and promising solutions to data integration in the sciences. Some of the improvements that were suggested are the following:

Provide incentives to researchers to make their data reusable by documenting, cleaning, and packaging it in a form that others can access. Incentives might include extra funding for the work and providing recognition to those who do it.

To simplify data discovery, develop data registries and better Web search to find structured data.

Develop data transformation libraries for data conversion and semantic integration.

Develop domain-specific repositories where data is archived, with search tools to browse them.

Footnotes

Author Disclosure Statement

The author declares that no conflicting financial interests exist.

References

Bernstein

P.A.

, Haas

2008. Information integration in the enterprise. Commun ACM, 51:72–79.

Weidman

, Arrison

[rapporteurs].2010. Steps Toward Large-Scale Data Integration in the Sciences. The National Academies Press: Washington, DC.