Toward a System-Centric Global Knowledge Management Approach to Discovering (Organizing and Sharing) Scientific Knowledge from Large-Scale Data

Abstract

Dear Editor:

Question: Given a complex biological system X, what conditions must be met so that we can claim that X is sufficiently understood. What do the structures look like that would reflect our (sufficiently complete) knowledge about the biological system X? What exactly we mean by “sufficiently understood” is of course a challenging question in its own right. But let it suffice to say that this refers to some form of mechanistic understanding that facilitates robust explanations and predictions. Perhaps a Turing-like test in the connection with computerized models is a good starting point to reflect on this challenge (Harel, 2005).

Take, for example, typical model organisms such as the tobacco mosaic virus (virus), Escherichia coli (prokaryote), Chalmydomonas reinhardtii, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and Mus musculus (eukaryotes). A model organism is a nonhuman species that is extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the organism model will provide insight into the workings of other organisms. Arguably, at some point in the future we will be in the position to claim that we have understood these model organisms. If this is the case, what will this mean in terms of the knowledge structures that embody this knowledge? Where will the knowledge be stored? How will it be accessed and shared among scientists (and nonscientists)? How will it be used to answer concrete questions?

Thesis: Knowledge reflecting our understanding about a biological system X will reside in globally accessible scientific databases, information bases, and knowledge bases. We distinguish these systems as follows: Databases refer to data sets created either by biological experiments or by computational simulations of systems biology models. Clearly, as simulations are becoming more commonly used to generate and test hypotheses, the data volumes generated are likely to exceed those of conventional experiments by several orders of magnitude (Brito et al., 2004). Information bases refer to biological data that has been consolidated, integrated, aggregated, formatted, structured, etc., to support specific/automated queries or requests. Typical representatives of this kind of systems are repositories storing biological sequences, structures, pathways, etc. Data warehouses, document (scientific papers) repositories, ontologies, and the results of pattern discoveries also fall in this category. Knowledge bases (also known as “executable models”) refer to systems that are capable of numeric or logical inference (“knowledge is action”). Biological systems dynamics models and expert systems are representatives of this type of system (Dubitzky and Azuaje, 2004). Incidentally, the result of data mining (also known as knowledge discovery in databases) activities is often referred to as “knowledge.” Knowledge, however, should be clearly distinguished from patterns (i.e., information), as knowledge is the full utilization of information and data, coupled with the potential of people's skills, competencies, ideas, intuitions, commitments, and motivations. Knowledge is something that can be executed to derive “new” outputs to some input. Data mining results in the form of executable rules, decision trees, artificial neural networks, etc., that are executable, can be considered as knowledge.

Antithesis: The problem with this view is that it implies a serious fragmentation of the knowledge, as the knowledge will be distributed over dozens, hundreds, thousands, or more individual elements with no or little integration. It is doubtful that we could view such a structure as a coherent body of knowledge on a particular system.

Synthesis: Ultimately, the knowledge we have about a particular complex biological system will be stored in some form of computerized model (or multiple models that reflect multiple “versions” of this knowledge), which will be globally available through the Internet. This model will integrate data, information, and knowledge about the system in a coherent way so that it can be queried, shared, updated, and used to provide explanations and predictions. Such a model will be a kind of “one-stop-shop” or Google Earth-like system for a specific biological system.

The following outlines the vision on a system-centric global knowledge management approach to discovering (organizing and sharing) scientific knowledge from large-scale data.

Background

Complex systems (Science, 1999) are defined as systems with many interdependent parts that give rise to nonlinear and emergent properties determining their high-level functioning and behavior. Examples of complex systems in the life sciences include bee hives, bees themselves, the nervous and the immune system, biochemical interaction networks, degenerative diseases, ecosystems, and species populations. Complex systems are also subject of a wide range of scientific and engineering disciplines other than the life sciences. Conventional experimental, statistical, and engineering techniques are limited in their ability to fully capture and predict the behavior of such complex systems. As a result, physics and other scientific and engineering areas have been constructing and deploying computer-based models and simulations to develop and study complex systems. Modeling and simulation (M&S) approaches rely heavily on concepts, methodologies and tools from mathematics and computer science designed to integrate and process (often large volumes of ) quantitative data obtained from the system under study. Recently, the life sciences have been adopting the M&S framework under a new discipline called systems biology (Science, 2002). Around the turn of the century there was no clear consensus as to what systems biology actually means. Today, the following definition is likely to meet acceptance by a considerable number of scientists: Systems biology refers to the quantitative analysis of the dynamic interactions among several components of a biological system and aims to understand the behavior of the system as a whole. R&D in systems biology involves the development and application of systems theory concepts for the study of complex biological systems through iteration over mathematical modeling and computational simulation and biological experimentation. Systems biology could be viewed as a tool to increase understanding of biological systems and to develop more directed experiments and finally allow predictions.

For example, a systems biology investigation may integrate proteomic, transcriptomic, and metabolomic data to model and simulate the effects of a calorie-restriction intervention in mice or the response of humans to a specific drug. From a data integration perspective, systems biology could be viewed as a data integration “pipeline” flowing across four levels of integration (Klipp et al., 2005):

Data models . The first level of data integration requires the definition of data models and schemas for data representation, data storage and data exchange.

Data query, information retrieval, and presentation . The second level of data integration deals with data queries and information retrieval, the connection of different data types (coming from different databases or data sources) and the visualization and presentation of data.

Data combination, correlation, and data analysis . The third level of data integration is about combining, correlating, and analyzing data from different sources possibly relating to different levels of biological organization.

Data networks and models . The highest level of complexity in data integration occurs when data from different sources is combined and integrated into biological interaction networks and dynamic models representing biological components, systems, and processes.

The effort and activities involved in this data integration workflow for a particular systems biology problem may vary largely depending of the complexity of the problem. Here we focus specifically on those problems that share the following characteristics:

System-centric problem. The problem to be understood or studied is concerned with a clearly defined concrete biological system. For example, a particular ecosystem, population or species, organism, organ, tissue, or things like the cell cycle, the bile acid xenobiotic system, etc.

Effort over many years. Understanding the system under study typically involves comprehensive activities over a large period of time, typically years or decades.

Effort involving large communities. Typically, large communities of researchers and developers from various disciplines participate in activities concerned with the investigation of the phenomenon or system in question. Often these communities are geographically widely dispersed. It is also common that large numbers of end-users or lay people become part of such a community.

The Problem

“Dislocation,” decontextualization and fragmentation of system-relevant data, information and knowledge

Knowledge management is the collection of processes that govern the creation, dissemination, and utilization of knowledge. In this context, the creation step refers to the data integration “pipeline,” which proceeds from experimental data (either from biological experiments or simulations), to the sharing the utilization of the underlying inputs/outputs of each of the data integration steps (from “raw” data to fully fledged systems dynamics models).

Currently, solutions of large-scale system-centered problems suffer from a serious lack of integration of the underlying human, data, information, and knowledge resources. Researchers, who discover new insights disperse their results over a variety of journal and conference publications, biomedical databases, information bases, and knowledge bases. These are typically devoted to a general subject area (e.g., gene sequences, protein structures, etc.) as opposed to being exclusively dedicated to the system under study. Researchers wanting to get hold of relevant data, information, and knowledge invest a considerable effort to locate and reintegrate the information in the context of the system under investigation. In other words, instead of publishing, sharing, and using the data, information and knowledge in the context of the relevant structures of the system in question, the data, information and knowledge is being heavily fragmented, decontextualized, and physically distributed, only to be relocated, recontextualized, and reintegrated by those who need the information. From a knowledge management perspective, this is an extremely poor solution.

Heterogeneity of system-relevant data, information, and knowledge

For the complex biological systems in contention here there is often a large wealth of data, information, and knowledge. Typically, conceptual and logical as well as the physical models and structures used to describe and format these vary widely. A system-centered approach to integrating all relevant data, information, and knowledge structures into a coherent scheme poses a number of challenges.

The Vision

A system-centric global knowledge management approach to discovering (organizing and sharing) scientific knowledge from large-scale data (SKM)

Key elements of an SKM approach include (see Fig. 1):

FIG. 1.

Outline of components of SKM.

A semantically integrated heterogeneous knowledge space that enables users to deposit new knowledge about the underlying complex biological system. Importantly, such a knowledge space needs to be organized around a systems model of the underlying biological system, as this will enable an optimal computer-based approach directly supporting the scientific process and knowledge management tasks (e.g., sharing, pattern/knowledge discovery from data, inference, scientific discovery). Simply put, the scientific process consists of the linking of scientific information (data) and theory (knowledge). Another way to characterize the scientific “method” is the body of techniques for investigating phenomena, acquiring new knowledge, or correcting and integrating previous knowledge. Furthermore, this knowledge space needs to be able to integrate and handle structured (e.g., databases), semistructured (e.g., natural language texts), and unstructured (e.g., signals) data and information and knowledge.

Mechanisms facilitating the update or evolution of such a knowledge space (i.e., learning). Typically, such updates would be made in a manual or semiautomated fashion by users in response to newly discovered knowledge. Therefore, it is critical to provide mechanisms that facilitate interactive search and exploration of such a heterogeneous, complex-structured knowledge space.

Mechanisms facilitating automated knowledge discovery from the underlying knowledge space. This poses a novel challenge as such mechanisms would “see” a heterogeneously structured continuum (data, information, knowledge structures), which is organized around a system model of the underlying complex biological system.

Mechanisms that exploit the semantically integrated information of SKM knowledge spaces to facilitate functions and tasks like decision support, what-if queries, simulations, and problem solving. The current lack of a system-centered knowledge management approach to complex biological systems applications does not allow the efficient support and automation of such functions. Instead, users must assemble all relevant information and create ad hoc models of the underlying system.

A model and supporting computing technology for sharing the knowledge of the biological system among a globally dispersed community of researchers and users. This aspect is intrinsically covered by the SKM concept, which does away with the inflexible, inefficient, and fragmented knowledge handling currently used in the life sciences by organizing knowledge in a system-centered way around a globally accessible resource. Its technical realisation requires an open, scalable computing platform that supports large, geographically distributed communities of scientists and other users.

Mechanisms facilitating the maintenance, versioning, and information provenance of a SKM knowledge space.

To further illustrate the SKM idea and concept and the typical state of the art of the scientific process in the live sciences, we briefly outline a concrete scenario.

The cell cycle—a fundamental biological process or complex system—is central to numerous biological processes such as embryonic development, wound healing, tissue self-organization, and tumor growth. There is a huge body of knowledge about the cell cycle, but many aspects are still subject to intense research. At present, this knowledge is distributed across the globe and is held in the brains of humans, in articles of scientific journals, in a large number of biological databases (containing data from relevant as well as irrelevant biological experiments), information bases (containing consolidated information on genes, proteins, cellular compartments, biological pathways, etc.) and model and knowledge bases (containing mathematical and computational models about different aspects of the cell cycle). While cell cycle scientists often possess a holistic and sometimes rather comprehensive mental model about the cell cycle system, no concerted computer-supported management of data, information, and knowledge about the cell cycle exits. Hence, the researchers are forced to add new knowledge about the cell cycle by entering pieces of information into a variety of computer-based resources and repositories, most of which contain vast amounts of information that is not related to the cell cycle at all. Likewise, cell cycle scientists are forced to repeatedly search, filter, and extract new information about the cell cycle via a laborious and time-consuming process. In this process of updating and using knowledge about the cell cycle, the researchers constantly map the information to the cell cycle model they hold in their mind. No sophisticated computerized solution exists that would directly support this process by facilitating a globally accessible, system-centered, and semantically integrated management of the data, information, and knowledge about the cell cycle.

Footnotes

Author Disclosure Statement

The author declares that no conflicting financial interests exist.

References

Brito

R.M.M.

, Dubitzky

, Rodrigues

J.R.

2004. Protein folding and unfolding simulations: a new challenge for data mining. OMICS, 8:153–166.

Dubitzky

, Azuaje

2004. Artificial Intelligence Methods and Tools for Systems Biology. Springer: New York.

Harel

2005. A Turing-like test for biological modeling. Nat Biotechnol, 23:495–496.

Klipp

, Herwig

, Kowald

, Wierling

, Lehrach

2005. Systems Biology in Practice: Concepts, Implementation and Application. Wiley: Malden, MA.

Science, 1999. Special edition: complex systems. 284:1–212.

Science, 2002. Special edition: systems biology. 295:1589–1780.