Intelligent data integration from heterogeneous relational databases containing incomplete and uncertain information

Abstract

The integration of incomplete and uncertain information has emerged as a crucial issue in many application domains, including data warehousing, data mining, data analysis, and artificial intelligence. This paper proposes a novel approach of mediation-based integration for integrating these types of information from heterogeneous relational databases. We present in detail the different processes in the layered architecture of the proposed flexible mediator system. The integration process of our mediator is based on the use of fuzzy logic and semantic similarity measures for more effective integration of incomplete and uncertain information. We also define fuzzy views over the mediator’s global fuzzy schema to express incomplete and uncertain databases and specify the mappings between this global schema and these sources. Moreover, our approach provides intelligent data integration, enabling efficient generation of cooperative answers from similar ones, retrieved by queried flexible wrappers. These answers contain information that is more detailed and complete than the information contained in the initial answers. A thorough experiment verifies our approach improves the performance of data integration under various configurations.

Keywords

Cooperative answers intelligent data integration mediator system incomplete and uncertain information fuzzy logic

1. Introduction

The integration of incomplete and uncertain information from heterogeneous relational databases is an important research topic. The intent of data integration is to enable the user to extract information from heterogeneous data sources using a unified model [1]. The data integration system (DIS) is considered as an interoperable information system, which copes with heterogeneity and changeability in data sources to be integrated.

The DIS can be classified into five categories [2]: (1) the multi-base system allows users to access directly to the different data sources [3], (2) the federation system provides a federated schema for integrating heterogeneous databases called federated databases [4], (3) the mediator system provides a virtual integration of data at query time [5], (4) the data warehouse system is a materialized integration which feeds integrated data into data warehouses [6], and (5) the hybrid system that supports the data integration using a hybrid of the virtual and materialized approaches [7].

Data integration systems will continue to exploit data from heterogeneous data sources. There are different types of heterogeneity between data sources, such as structural heterogeneity, semantic heterogeneity, syntactic heterogeneity, etc. [8]. Furthermore, the data source often contains precise, incomplete, and uncertain information. Such heterogeneity of the information’s nature makes data integration even very challenging within the database system community [9]. This paper focuses more particularly on the challenges posed by this heterogeneity to data integration.

There are many sources of incomplete and uncertain information, including random data sets, inherently imprecise data sets, data poorly documented, data entry errors, statistical measures, measurement errors, inaccurate human judgments, etc.

There are different ways to define incomplete and uncertain information [10]. In the incompleteness problem of relational database, the information can be of two different types [11, 12]: the missing information and disjunctive information. The missing or inapplicable or even better unknown information of an attribute of a relational table is represented by a special value “NULL”, which means that the attribute has a value, but it is unknown. For example, the accepted paper of Maria is not included in the conference proceeding since Maria does not validate the registration. The disjunctive information can be represented by a finite set of possible worlds separated by the “OR” disjunction operator. This information states that one of the possible worlds being the true value, but it is unknown which one. For example, Mr. Christian is a professor of computer science at the University of Aalborg or Aarhus, but we do not know precisely to which University he belongs.

On the other hand, the uncertain information is related to the degree of the information’s accuracy. Uncertainty concept includes inconsistency, ambiguity, probability, possibility, maybe, fuzzy, vague, and imprecise terms that refer to the handling of data subject to doubt on their validity or being linked with forecasting or estimation [13]. In general, these uncertainty aspects represent uncertain information as fuzzy information or probabilistic information [13, 14]. The fuzzy information of an attribute value can be defined by a set of possible states associated with a degree of truth taking its values in the interval [0, 1] or a linguistic value, such as high, medium, or low. For example, we talk about the high possibility of extending the paper submission deadline.

The second type of uncertain information concerns the probabilistic information, which associates with each attribute value a probability between 0 and 1 according to a known distribution for that attribute domain. For example, the likelihood that a manuscript of Maria will be accepted is 0.9.

In general, the uncertain information can be formalized using mathematical logic, such as probabilistic logic, possibilistic logic, and fuzzy logic whose values belong to 0 and 1 [15, 16]. The probabilistic logic is frequently used for managing statistical measures and its predictions to obtain approximate answers, while the fuzzy logic is more generic for representing different forms of uncertain information than other formalisms [15, 16]. In our article, we use fuzzy logic as an effective and sound formalism to integrate heterogeneous data under incompleteness and uncertainty.

Dealing with incompleteness and uncertainty in relational databases has attracted significant attention from researchers for many years. Some of the research works have focused on data representation and modeling using different mathematical theories and leveraging techniques, such as machine learning methods, neural networks, Petri Nets, ontology, etc. [17, 18, 19, 20]. Other studies have focused on query processing challenges. Fuzzy logic-based flexible queries and preferences-based skyline queries are the most used solutions to provide accurate results that best meet the user requirements [21, 22, 23].

While the querying of incomplete and uncertain databases has already been addressed, there is still a need for further research toward more advanced studies, in particular research on the integration of incomplete and uncertain information from heterogeneous databases. The early version of this article is extended, and some major enhancements have been involved in data integration processes. The detailed differences are summarized as follows. First, the previous work [24] does not consider the integration of disjunctive and probabilistic information. Second, Resnik’s similarity measure does not give better results when computing the semantic similarity between compounded concepts. Thus, the similarity value between two similar concepts is greater than 1, which gives thorny issues in the analysis of experimental results. Third, a new method is included to provide intelligent data integration via the generation of cooperative answers more detailed and complete than the initial ones. Fourth, important algorithms are proposed, and a flexible mediator system has been developed to validate the proposed approach for the integration of incomplete and uncertain information from heterogeneous relational databases (HRDB). Finally, to further evaluate the performance of our approach against the related works, thorough experiments have been conducted on three categories of databases.

The main contributions in this paper are listed as follows:

•
We improve the flexible mediation approach to integrate disjunctive and probabilistic information.
•
We introduce a new method for producing cooperative answers from similar ones.
•
We propose some algorithms required in the data integration process.
•
We propose a semantic similarity-based method for dealing with semantic conflicts during data integration processes and therefore guarantee meaningful data integration.
•
We develop a flexible mediator system and provide extensive experiments.

The remainder of this paper is outlined as follows. We review grounding research in the integration of incomplete and uncertain information in Section 2. Section 3 provides our intelligent mediation approach, while in Section 4 we present in detail the flexible mediator system. The experiments and discussions are reported in Section 5. The final section draws a conclusion and suggests further research.
2. Grounding research

The most critical data integration problems are closely related to data sources heterogeneity, interoperability issues, and query processing complexity [25]. The heterogeneity of information’s nature is a thorny problem in the area of data analysis. The efficient integration of multiple sources containing incomplete and uncertain information may help resolve the incompleteness and uncertainty, yielding accurate results that better meet user requirements. Most of the data integration (DI) literature assumes that the integrated data are either incomplete or uncertain without identifying the different types of each nature of the information.

Some related works have only studied the incompleteness in the DI field. Nikolaou et al. [26] proposed an ontology-based approach to define integrity constraints over data source schemas and specify the functional dependencies used to extend the database with missing information. Exploiting integrity constraints may be useful for overcoming portions of incompleteness in databases. Such a method cannot ensure the integration of disjunctive information presenting possible values, but we do not know which one is true.

Another work proposed a bag semantics-based approach to investigate approximation schemas for computing precise answers to queries that have been proposed for bag semantics [27]. During the data integration process, one of the local schemata requires computing an intractable query, while the other remains tractable. The approach focused on the semantic-based query representation without considering the different issues within incomplete databases themselves.

Hannou et al. [28] proposed a pattern-based approach for extracting complete query answers from incomplete databases. The method involves the evaluation of a query over the data and the extraction from the corresponding pattern dataset, the completeness of the query answer. This method costs a lot of computing and effort to define and incorporate completeness patterns for integrating incomplete databases.

The second group of related works focused on the problem of uncertain data integration. Jaradat et al. [29] propose a new best-effort data integration framework that copes with the challenges of uncertain data integration. The proposed framework relaxes the traditional data integration by explicitly incorporating probability-based mappings into the data integration process. This solution is likely less precise and therefore becomes less efficient when there are different information’s natures.

A fuzzy RDF data model proposed in [30] attempts to combine fuzzy logic with the semantic web RDF (Resource Description Framework) model in order to deal with uncertain and semantic conflicts. This extended RDF model provides a new method for ensuring the mapping between RDF data and relational databases containing uncertain information. Despite the effectiveness of the uncertain data integration, such a method cannot ensure high efficiency in diverse scenarios that can occur in practice.

Gal et al. [31] proposed a learn-to-rerank algorithm, which aims to rerank a list of schema matches to put the best at the top of the data integration process. This method must use matching predictors as learning features to integrate data, and therefore it is less effective when the data sources contain more uncertain information.

In real-world applications, the information is subject to both incompleteness and uncertainty. While many efforts have been proposed to address these problems, to the best of our knowledge, there are few efforts regarding the issue of integration from heterogeneous relational databases (HRDB) containing both incomplete and uncertain information.

In 2005, Leone et al. [32] have proposed INFOMIX, a novel system that supports incomplete and inconsistent information integration. The key idea of INFOMIX is that the integrated data must satisfy the integrity constraints defined on the global schema and the mapping between this schema and data sources. Besides, the system used logic-based methods for answering user queries that are sound and complete. We point out that this system is employed only in materialized integration that aims to extract and transform data from sources and load the results into a data warehouse.

In our previous work [24], we have proposed a fuzzy logic-based flexible mediator architecture for integrating incomplete and uncertain information from HRDB. This approach is briefly described below, which helps set the stage for the description of the proposed approach in this article.

The flexible mediator architecture is split into three layers: flexible mediation layer, flexible wrappers layer, and data sources layer. The flexible mediation layer receives the initial query written in a common vocabulary language provided by the global fuzzy schema, representing unified querying support. The submitted query has been enhanced to integrate incomplete information, especially the missing information. For ensuring the mapping between the global fuzzy schema and the local schemata of different data sources, we have defined fuzzy views over the global schema, which are based on measuring Resnik’s similarity between elements of the global schema and those of the local schemas.

The flexible wrappers layer is a middle layer between the flexible mediation layer and the data sources layer. The query can be rewritten and decomposed to a set of sub-queries over the participating local sources through the fuzzy mappings of each flexible wrapper. Each queried wrapper sends its answers back to the flexible mediation layer, which provides a preprocessing phase for eliminating redundant answers. The retained answers will be collected and sorted in descending order according to their membership degrees, yielding approximate answers.

3. Overview of our intelligent mediation approach

In this article, we adopt our flexible mediation architecture to improve the quality, interoperability, and efficiency of the data integration process. Our approach is able to efficiently integrate different types of incompleteness and uncertainty, such as missing, disjunctive, fuzzy, and probabilistic information. The handling of these kinds of information’s nature is a challenging problem in different data integration processes, such as global schema representation, data mapping, rewriting queries, etc.

Our intelligent mediation approach aims at dealing with these issues by using fuzzy logic that is a more pragmatic approach to imprecise data than recent trends towards probabilistic models. To be clear, Fig. 1 shows a general overview of the proposed intelligent mediation approach.

Figure 1.

Overview of the intelligent mediation approach.

In the uncertain database modeling by fuzzy logic, each uncertain attribute value, carrying uncertain information is related to a set of fuzzy predicates P. Each $P_{i}\in P$ is attached to a fuzzy set F, which has a membership degree $\mu p_{i}$ in the range [0, 1] [33]. In the context of our work, to define metadata of different fuzzy predicates related to uncertain information, we first provide a global relational schema of the flexible mediator system. In the case of incomplete information, we retain its original values. Therefore, using LAV (Local-As-View) mapping approach will be very helpful, in which the local schemas are defined as views over the preexisting global schema [34]. The fuzzy logic-based schema matching involves fuzzy mappings, which are defined by views so-called fuzzy views. Next, we give the following formal definitions.

Definition 1. [Flexible mediator system]. A flexible mediator system is a triple of the form $(FS,\linebreak LS,FM)$ , where:

•

$F S$ is the global fuzzy schema, which provides a virtual and mediated schema between the user and different data sources. $F S$ is expressed as a relational schema associated with metadata of fuzzy predicates.

•

$L S$ is a set of local schemas of data sources to be integrated. We assume that $L S$ is specified by the relational model.

•

$F M$ is a set of fuzzy mappings that establish the connection between the elements of $F S$ and those of $L S$ .

Definition 2. [Fuzzy view]. A fuzzy view $FV=FQ(FS)$ , which can be created by the following SQL (structured query language) statements:

CREATE VIEW name-of-FV ( $A(R)$ ) AS SELECT $A(T)$ FROM $T\in FS$ $[\textit{WHERE}\dots]$ $[\textit{GROUP BY}\dots|\textit{HAVING}\dots]$

Such that the fuzzy view $F V$ of a relation $R\in LS$ is a virtual table, which is defined by a fuzzy query $F Q$ over the global fuzzy schema $F S$ . $F V$ is indeed described over a set of attributes $A(R)$ , which should correspond to the set of attributes $A(T)$ , with $T\in FS$ is a subset of relations in $F S$ .

The fuzzy mapping between elements of schemata (i.e., name of tables, name of attributes, etc.) is based on a well-known semantic similarity measure, which aims to determine the semantic likeness between terms or text [35]. In our work, we rely on Wu and Palmer (Wup) [36] and cosine [37], two core knowledge-based semantic similarity measures.

Using fuzzy logic with semantic similarity measure in the data integration process alleviates the designer’s task compared to other semantic-based data integration approaches such as fuzzy RDF and fuzzy ontology.

At the flexible wrappers layer, the query processing over the global fuzzy schema involves rewriting queries using fuzzy views. Moreover, each flexible wrapper provides query cooperation processing, which relies on exploiting relationships between rewritten queries and others that extract incomplete information (more details see Section 4.2.2).

The cooperative query answers have been rewritten in terms of the global fuzzy schema and submitted to the flexible mediation layer. The latter provides now answer management process to produce cooperative answers from similar ones.

The purpose of focusing on the cooperative feature between answers was twofold. First, we decrease the number of returned answers that are still more contained incomplete and uncertain information. Second, we reduce the degree of uncertainty and incompleteness in the answers.

4. Flexible mediator system

We present Flexible Mediator, an end-to-end data integration system, which allows the user to ask queries over integrated data from HRDB containing incomplete and uncertain information. The layered architecture of flexible mediator system is described as follows:

•
Flexible mediation layer: it takes the user query $Q$ and outputs cooperative answers, which improve the performance of data integration. On this basis, we propose a new module called “answers management module” for generating this type of answers.
•
Flexible wrappers layer: it consists of three main modules: The query-rewriting module for rewriting the submitted query $Q$ in terms of the data source schema, the query cooperation module to get more information about the query, and the rewriting answers module that allows the rewriting of query answers in terms of the global fuzzy schema. The output of this module is used as the input of the answers management module.
•
Data sources layer: it includes heterogeneous relational databases (HRDB), which contain different types of incomplete and uncertain information. We rely on the LAV mapping approach [34] for ensuring the extensibility and interoperability of the flexible mediator system.

Figure 2.
General architecture of the flexible mediator system.

The integration of incomplete and uncertain databases has become an intrinsic property in many application domains, such as transportation planning, weather information, e-business, bioinformatics field, etc. While the flexible mediator system is generic for many application domains, we focus on the food safety domain. The datasets in this field represent a great example of a scientific database that includes several uncertain and incomplete pieces of information [38, 39]. For example, some microbes do not explicitly specify the danger degree. Other food calories are represented as ranges instead of exact values. The use of the flexible mediator system in microbiology laboratories and industries provides reliable and efficient management of such datasets, taking into account the conditions of growth of microorganisms in food.
4.1 Flexible mediation layer

The flexible mediator system provides a form-based query interface that helps the user to pose queries and view the results in a single tabular structure. Using the query formulation form rather than writing SQL statements directly or in a natural language makes the system flexible enough and easy to use and can minimize or avoid errors such as typing mistakes.

Note that the main elements of the global fuzzy schema, such as tables, attributes, fuzzy predicates, etc., have been automatically extracted by SQL queries and loaded in the corresponding graphical components using Java programming language.

Figure 3.

The flexible mediator system.

In Fig. 3, the graphical user interface (GUI) of the flexible mediator guides the user to formulate query according to the SQL statement syntax. The first part, “Researched objects” allows the user to select the relational tables to be used in the well-known “FROM” clause of the SQL query. The second part, “Suggested properties” represents the “SELECT” clause, where the user can be select all or some attributes to be displayed. The third part, named Conditions, represents the “WHERE” clause, which allows the user to define the search conditions (restrictions). In this part, the condition consists of three components:

(1)

A left operand defines an attribute or a column of a relational table to be compared.

(2)

A comparison or research operator such as $>$ , $<$ , $=$ , IN, Not IN, NOT NULL, etc.

(3)

A right operand is a value that can directly be entered in the text field or selected from the Textual-value list, including fuzzy predicates.

Before representing the main module of the flexible mediation layer, we first define the global fuzzy schema.

Definition 3. [Global fuzzy schema] The global fuzzy schema $F S$ is a set of relations $R$ associated with metadata of fuzzy predicates $P$ . $F S$ can be represented in the following form:

$\displaystyle FS=\{R_{1},R_{2},\dots,R_{n}\}\cup\{P(A_{1}),P^{\prime}(A_{2}),% \dots,P^{m}(A_{m})\}$

Where each relation $R_{i}\in FS(i=1..n)$ is defined over a set of attributes $A(R_{i})$ . The attribute value of $A_{i}\in A(R_{i})$ can be of different natures: precise information that is typically defined based on the attribute domain denoted by $D(A_{i})$ , missing information represented by the NULL value, disjunctive information defined by a finite set of possible worlds, or uncertain information. The different uncertain attributes within FS are associated with metadata of fuzzy predicates $P(A_{1}),P^{\prime}(A_{2}),\dots,P^{m}(A_{m})$ , which is modeled by using Zadeh’s fuzzy logic [40].

As for our approach, contrary to related works for fuzzy databases modeling, which define an additional column for indicating the membership value of a tuple [41], it aims at using metadata that contains a set of fuzzy predicates with their membership functions.

Our global fuzzy schema $F S$ in the food safety domain consists of 09 tables related between them by relationships. For instance, Food, Microbe, and Symptom tables are interrelated by a many-to-many relationship called “Infected”, while Factor and Microbe tables are linked by the “to have” relation. Besides, each table has columns with an average number equal to 10.

Example 1. Consider a relation “Microbe” of our global fuzzy schema.

Microbe (Title_Microbe, Family_microbe, Date_of_Discovery, Discovery, Danger, Length, Diameter, Propagation).

In this relation, we can distinguish that the domains: D (Title_Microbe), D (Family_microbe), and D (Discovery) are ordinary sets, while D (Date_of_Discovery), D (Danger), D (Length), D (Diameter), and D (Propagation) can be presented by fuzzy predicates. Note that some attributes of this relation can contain incomplete information.

The fuzzy predicate $P$ is expressed by linguistic values, which are easy to understand by human beings, such as cheap, expensive, and associated with a membership function $F$ whose membership degree belongs to 0 and 1. Furthermore, the fuzzy predicate $P$ can be modeled by a discrete or continuous membership function [40]. In the case of discrete membership function, the fuzzy set is described as follows:

$\displaystyle F=\left\{\mu_{F}(u_{1})/u_{1},\mu_{F}(u_{2})/u_{2},\ldots,\mu_{F% }(u_{n})/u_{n}\right\}$

Where $u_{i}\in U$ a subset of the universe of discourse $U$ , for instance, Escherichia coli bacteria can spread in intestinal (IF) and abdominal floras (AF). The truth degrees of these possible propagation areas of the cited bacteria are respectively 0.95 and 0.85.

From example 1, the fuzzy set of the attribute, “Propagation” of Escherichia coli bacteria is then represented as follows: $F=\left\{0.95/IF,0.85/AF\right\}$ .

Example 2. Consider the relation Microbe from Example 1. The attribute “Danger” has three fuzzy predicates called Low, Medium, and Hard. Figure 4 depicts the membership functions associated with these fuzzy predicates.

The definition of these membership functions is based on an interview with domain experts in food safety and online documentation, such as the world health organization website (https://www.who.int/). The continuous membership function is defined as a quadruplet $(a,A,B,b)$ with its membership degree $\mu_{P}$ of the value $x$ is given by:

•

$\mu_{p}(x)=1$ for $A\leqslant x\leqslant B$

•

$\mu_{p}(x)=0$ for $x\leqslant a$ or $x\geqslant b$

•

$\mu_{p}(x)=\frac{A-x}{a}$ increases linearly for $a<x<A$

•

$\mu_{p}(x)=\frac{x-B}{b}$ decreases linearly for $B<x<b$

For example, as pictured in Fig. 4, the membership function of the fuzzy predicate “Medium” is defined by (11, 14, 24, 27). Thus, our metadata includes 321 fuzzy predicates. Table 1 gives a part of the metadata related to the Microbe table of the global fuzzy schema.

Table 1

Part of metadata related to the microbe table of the global fuzzy schema

Table	Attributes	Fuzzy predicates	Membership function
Microbe	Danger	Low	(0, 0, 8, 11)
		Medium	(11, 14, 24, 27)
		Hard	(27, 30, 0, 0)
	Length	Short	(0, 0, 3, 5)
		Long	(5, 10, 0, 0)
	Diameter	Small	(0, 0, 5, 8)
		Big	(8, 10, 0, 0)

Figure 4.

Membership functions of fuzzy predicates of the attribute “Danger”.

For the sake of clarity, we shall give an example of user query $Q$ , which is used to describe thoroughly the different processes of data integration.

Example 3. Display name, date of discovery, length, and danger of long microbes with medium danger. This request is defined by the following fuzzy query over the global fuzzy schema $F S$ .

SELECT FM.Title_Microbe, FM.Date_of_Discovery, FM.Length, FM.Danger FROM FS.Microbe FM WHERE FM.length $=$ long and FM.danger $=$ Medium

This formal representation of the fuzzy query expresses the uncertain information through the fuzzy predicates ‘Long’ and ‘Medium’, which are not treated as a string (see Section 4.2).

4.1.1 Answers management module

After retrieving and rewriting the query answers at the flexible wrappers layer (see Section 4.2), the flexible mediator system awaits the answers from all queried wrappers. The answers management module is proposed to combine similar answers and generate cooperative ones that are more detailed when accurate data are not available. On this basis, we establish a semantic similarity-based method to determine the redundant answers from the heterogeneous ones. The method is based on two semantic similarity measures: Wup and cosine [36, 37], which tend to give more accurate results and are very much in coincidence with human similarity [25].

The Wup similarity is one of the most commonly used knowledge-based similarity measures, which takes advantage of overcoming structural and semantic conflicts [36]. This measurement is widely used for computing the word-to-word semantic similarity, which can be extended between fuzzy predicates defined by linguistic terms (atomic terms). Wup similarity is based on measuring the depth of two concepts $C_{1}$ and $C_{2}$ , along with the minimum number of IS-A links in the path of the common subsumer C in the WordNet taxonomies [42, 36]. It is computed as follows [36]:

$\displaystyle\textit{Sim}_{\textit{wup}}(C_{1},C_{2})=\frac{2\cdot\textit{% depth}(C)}{\textit{depth}(C_{1})+\textit{depth}(C_{2})}$ (1)

On the other hand, we used the Cosine similarity measure to compute the similarity between vectors of words or numbers. Cosine similarity has proven to be a robust metric in the information retrieval domain for determining how similar the documents are [43]. The cosine similarity of two vectors X, Y is given by the following formula [37]:

$\displaystyle\textit{Sim}_{\textit{cosine}}(X,Y)=\frac{X\cdot Y}{\|X\|^{2}% \cdot\|Y\|^{2}}$ (2)

Where $\|X\|$ , $\|Y\|$ represent the magnitude of X and Y, respectively. The cosine measure not only concerns the compounded concepts that are composed of words but also determines the similarity between vectors of membership degrees or even numeric values of a set of attributes.

The similarity-based method aims at combining these two similarity measures. Such measurement can take values between 0 and 1, making its results easy to interpret and analyze.

Let $R=V_{1},V_{2},\dots,V_{n}$ and $R^{\prime}=V^{\prime}_{1},V^{\prime}_{2},\dots,V^{\prime}_{n}$ , be two received answers, where $V_{i}$ , $V^{\prime}_{i}$ (for i=1..n) are values of attributes $A_{i}$ and $A^{\prime}_{i}$ , respectively. To compute the similarity between $R$ and $R^{\prime}$ , we propose the Sim-Responses algorithm applied for three parts of the answer. The first one concerns the values of attributes that are atomic concepts. We compute the Wup similarity between atomic concepts of $R$ and $R^{\prime}$ (see lines 4–5). The second part of an answer represents the values of attributes, which are compounded concepts, and in this case, we use the cosine measure as a similarity metric (see lines 6–7). The third part specifies the values of attributes that are defined by membership degrees or even numeric values. This part of the answer can be seen as a vector of values. Hence, we use $V$ , $T$ two vectors for representing the third part of answer $R$ and $R^{\prime}$ , respectively (see lines 8–10). After this, we compute the cosine similarity between vectors $V$ and $T$ (see line 11). All the similarities between the values of $R$ and $R^{\prime}$ were stored in the table called $S$ . Finally, the Sim-responses algorithm computes the average similarity that represents the similarity value between $R$ and $R^{\prime}$ (see lines 12–16).

Table 2

Example of some heterogeneous answers

Answer	Title_microbe	Date_of_discovery	Length	Degree1	Danger	Degree2
$R_{1}$	Escherichia coli	0.5	7	0.6	15	1
$R_{2}$	Salmonella	0.5	11	1	13	0.09
$R_{3}$	Jejuni	0.5	NULL	0.5	25	0.03
$R_{4}$	Proteus	1885	[20, 70]	1	[24.5, 25]	0.32
$R_{5}$	E.coli	1885	6.5	0.7	14	1
$R_{6}$	Lari	1930 or 1936	NULL	0.5	13	0.09

$R$ , $R^{\prime}$ : Answers Sim: similarity between $R$ and $R^{\prime}$ $\textit{Sim}=0$ ; $S$ : Table of similarity $=$ 0; $V$ , $T$ : vectors of $\textit{numbers}=\emptyset$ ; $j=1$ ; $V{i}\in R$ , $V^{\prime}{i}\in R^{\prime}$ $V_{i}$ and $V^{\prime}_{i}$ are two atomic concepts $S[j]=\textit{Sim}_{\textit{wup}}(V_{i},V^{\prime}_{i})$ ; $j++$ ; $V_{i}$ and $V^{\prime}_{i}$ are two compounded concepts $S[j]=\textit{Sim}_{\textit{cosine}}(V_{i},V^{\prime}_{i})$ ; $j++$ ; $V_{i}$ or $V^{\prime}_{i}$ are numeric values or they are associated with membership function $V=V\cup V_{i}$ ; $T=T\cup V^{\prime}_{i}$ ; $S[j]=\textit{Sim}_{\textit{cosine}}(V,T)$ ;Sum, $k=0$ ; *Compute the sum of all similarities $j=1$ to S.size $\textit{Sum}=\textit{Sum}+S[j]$ ; $k=k+1$ ; $\textit{Sim}=\textit{Sum}/k$ ; *Compute the average of similarities returnSim; end Sim-Responses Algorithm

The answers $R$ and $R^{\prime}$ are similar if two conditions are both satisfied:

(1)

The similarity degree between $R$ and $R^{\prime}$ given by the Sim-Responses algorithm exceeds 0.7. During the experimental phase (see Section 5.2), we deduce that the most suitable threshold is set to 0.7.

(2)

The natural idea is to ensure that the similarity degrees between $R$ and other answers are similar to those with $R^{\prime}$ . This condition is easy to achieve through the comparison of Euclidean distance between similarity values.

Example 4. Consider the user query Q from Example 3. After processing the query $Q$ at the flexible wrappers layer, its answers were retrieved and rewritten in terms of the vocabulary of the global fuzzy schema (see Section 4.2.3). Each rewritten answer can be broken down into attributes previously specified in the user query (there are four attributes: Title_Microbe, Date_of_Discovery, Length, and Danger). Furthermore, it consists of additional attributes, whose values present the membership degrees of incomplete and uncertain query conditions (for more details, see Section 4.2.2). In this example, we have two additional attributes, Degree1, Degree2 related to the query conditions length $=$ long, and danger $=$ Medium, respectively. Table 2 shows some heterogonous answers from different sources.

The similarities between the answers acquired through the use of the Sim-Responses algorithm are given in Table 3.

Table 3

Detailed similarities between some answers

Answers	Sim1	Sim2	Sim	Decision
$(R_{1},R_{2})$	0.333	0.729	0.531	No
$(R_{1},R_{3})$	0.181	0.180	0.180	No
$(R_{1},R_{4})$	0.190	0.420	0.305	No
$(R_{1},R_{5})$	0.960	0.480	0.720	Yes
$(R_{1},R_{6})$	0.191	0.795	0.493	No
$(R_{2},R_{3})$	0.182	0.170	0.176	No
$(R_{2},R_{4})$	0.050	0.583	0.317	No
$(R_{2},R_{5})$	0.301	0.720	0.511	No
$(R_{2},R_{6})$	0.191	0.135	0.163	No
$(R_{3},R_{4})$	0.316	0.231	0.273	No
$(R_{3},R_{5})$	0.180	0.119	0.239	No
$(R_{3},R_{6})$	0.315	0.244	0.279	No
$(R_{4},R_{5})$	0.190	0.417	0.304	No
$(R_{4},R_{6})$	0.666	0.320	0.493	No
$(R_{5},R_{6})$	0.192	0.790	0.491	No

As shown in Table 3, the ‘Sim1’ column reflects the similarity results between the values of attributes that may be atomic or compounded concepts. For example, from the attribute ‘Title_Microbe’ and using algorithm 1, we apply the Wup similarity between atomic concepts as Salmonella and the Cosine measure between compounded concepts, such as Escherichia coli.

The second column, called ‘Sim2’, indicates the cosine similarity results between vectors of numeric values. Each vector consists of three membership degrees: the membership degree of the attribute Date_of_discovery that contains incomplete information and the membership degrees of attributes Length and Danger, which correspond to the values of attributes Degree1 and Degree2, respectively. Note that if one of the values that will be compared represents precise information, its membership degree is still equal to 1, such as the date of discovery of the Proteus bacteria (see answer $R_{4}$ , Table 2). In the case of disjunctive information, such as the date of discovery of E.coli bacteria, its membership degree is given according to Lemma 1 (see Section 4.2.2).

The content of the column ‘Sim’ is the average values of the similarity results defined in columns Sim1 and Sim2. Sim column indicates the similarity degrees between answers. The rightmost column shows the decision to fulfill the first condition (‘Yes’ to the answers that meet the first condition or ‘No’ otherwise). We can see that the first condition of similarity between answers $R_{1}$ and $R_{5}$ is satisfied (i.e., the similarity degree is 0.720 $\geqslant$ 0.7).

In the second condition, we evaluate if similarities between the answer $R_{1}$ and other answers are similar to those with $R_{5}$ . We use the Euclidean distance between these similarities since it is a widely sufficed distance metric to efficiently compare different numeric vectors [44].

For example, $\textit{Sim}(R_{1},R_{2})=0.531$ and $\textit{Sim}(R_{2},R_{5})=0.511$ , with the difference between these similarities is closer to 0. Therefore, $R_{1}$ and $R_{5}$ are two similar answers, which represent the same bacterium called Escherichia coli, the well-known by E.Coli.

In the second phase of the answers management module, the cooperation process between similar answers is applied to each attribute, skipping the missing information:

(1)

Aggregate the precise and disjunctive values into one value that is disjunctive information.

(2)

Maintain the uncertain information that has the highest membership degree. If two values have the same membership degree, we take both as two possible values.

(3)

Delete the additional attributes that reflect the degrees of query conditions, such as Degree1 and Degree2 (see Table 2).

The cooperative answer is a more detailed response that aggregates the possible values from similar answers, taking into account the uncertain values having the highest membership degree. In this way, we can have greater integration of sources and better meet user needs.

Example 5. From Example 4, the results of the flexible mediator are shown in Table 4.

Table 4

Example of some final answers

Title_microbe	Date_of_discovery	Length	Danger
Escherichia coli or E.coli	1885	6.5	14 or 15
Salmonella	0.5	11	13
Jejuni	0.5	Null	25
Proteus	1885	[20, 70]	[24.5, 25]
Lari	1930 or 1936	Null	13

Form Table 4, the first row represents the cooperative answer, which has been generated from similar answers $R_{1}$ and $R_{5}$ . According to the cooperation process presented above, we can see that the value of attribute ‘Title_Microbe’ has been generated by aggregating two values, Escherichia coli and E.coli, related to $R_{1}$ and $R_{5}$ , respectively. Furthermore, we give a precise value to the attribute Date_of_discovery, which represents the attribute value of the $R_{5}$ , skipping the null value presented in $R_{1}$ . The values of attributes Length and Danger have been generated through step 2 of the cooperation process. After this, the final step aims to delete the additional attributes that are Degree1 and Degree2. The output of the answer management module is then a set of answers, including cooperative ones. These results will be sent directly by the user, concluding the query flow.

4.2 Flexible wrappers layer

This second layer of the flexible mediator provides support for query processing and dealing with heterogeneity problems. The flexible wrapper is the basic building block of the flexible mediator. Figure 5 shows a zoom-in the structure of the flexible wrapper.

Figure 5.

Overview of a flexible wrapper.

4.2.1 Query rewriting module

It aims to reformulate the user query to refer to the data sources by using fuzzy views, which are based on the LAV mapping approach.

Several query-rewriting algorithms are proposed in the literature, the three known are the Bucket algorithm, Inverse-rules algorithm, and Minicon algorithm [45, 46, 47]. The first two algorithms do not fit well when handling queries involving cooperation with other queries [47]. The Minicon algorithm scales up to a large number of views and outperforms the previous algorithms. It aims at finding the maximally-contained rewriting of a conjunctive query using a set of conjunctive views [47]. Hence, we use the Minicon algorithm to define the fuzzy view-based query rewriting method. For this purpose, the creation of fuzzy views over global fuzzy schema $F S$ for each local schema $L S$ is based on a 1-1 mapping between $F S$ and $L S$ . This schema mapping provides a set of 1-1 mappings between the tables of $F S$ and $L S$ , which in turn provide a set of 1-1 mappings between columns of tables. The creation of fuzzy views and the selection of the fuzzy view closest to the relational table of a user query are based on measuring semantic similarities. The most similar elements take the highest value of semantic similarity. Algorithm 2 computes and returns the name of the closest element.

$A$ : concept, $B$ : a set of concepts Closer: concept $k=|B|$ is a number of concepts; $M$ : Table of $\textit{similarities}=\emptyset$ ; $j=1$ to $k$ $A$ and $B_{j}\in B$ are two atomic concepts $M[j]=\textit{Sim}_{\textit{wup}}(A,B_{j})$ ; $A$ and $B_{j}\in B$ are two compounded concepts $M[j]=\textit{Sim}_{\textit{cosine}}(A,B_{j})$ ; $\textit{Closer}=B^{\prime}\in B$ where its similarity $M^{\prime}=\textit{Max}(M)$ ; returnCloser; end Given-similar algorithm

The fuzzy view-based query-rewriting algorithm (see Algorithm 3) takes the user query $Q$ and a set of fuzzy views $F V$ as input and then outputs the rewritten query $Q^{\prime}$ . We first calculate to each table $T$ of the user query, closest table $S T$ in the fuzzy views (see lines 4–5). From $S T$ , the algorithm computes the similarity between its attributes $A(ST)$ and those of table $T$ (see lines 6–8). Finally, from the list of similar tables $T L$ and the list of similar attributes $A L$ , the algorithm used the Minicon mechanism to produce the rewritten query $Q^{\prime}$ . Furthermore, the algorithm stores the different matchings between the user query $Q$ and rewritten query $Q^{\prime}$ (such as $T L$ and $A L$ ), which will be used in the rewriting answers module to rewrite the retrieved answers to the vocabulary of the global fuzzy schema (more detail in Section 4.2.3).

$Q$ : user query, $F V$ : set of fuzzy views $Q^{\prime}$ : rewritten query $T L$ : List of tables names $=\emptyset$ ; $A L$ : List of attributes names $=\emptyset$ ; table $T$ in $Q$ $ST=$ Given-similar ( $T, F V$ ); $TL=TL\cup ST$ ; $j=1$ to $|A(T)|$ a number of attributes $A$ of the table $T$ $SA=$ Given-similar ( $A_{j}(T),A(ST)$ ); $AL=AL\cup SA$ ; $Q^{\prime}=\textit{MiniCon}(TL,AL)$ ; return $Q^{\prime}$ ; end Fuzzy view-based Query-Rewriting algorithm

Using fuzzy views, semantic similarity measures, and the MiniCon algorithm can better characterize our approach for integrating incomplete and uncertain databases modeled by fuzzy logic.

Example 6: Continuing with user query Q presented in example 3, we consider two fuzzy views over the global fuzzy schema: Microbio_Food_view and Microorganism_view.

CREATE VIEW Microbio_Food_view (TitleFood, Calorie, TypeFood, GroupFood, NameMicrobe, Lengthiness, Invention, Risk, Family_microbe) AS

SELECT F.Name, F.Calorie, F.Category, F.Family, M.Title_Microbe, M.Length, M.Discovery, M.Danger, M.family_microbe

FROM FS.Food F, FS.Microbe M

CREATE VIEW Microorganism_view (Name, Family, Discovery_date, Discovered, Length, Danger, Spread) AS

SELECT M.Title_Microbe, M.family_microbe, FM.Date_of_Discovery, M.Discovery, M.Length, M.Danger, M.Propagation

FROM FS.Microbe M

From these views, we produce two queries $Q^{\prime}_{1}$ and $Q^{\prime}_{2}$ by the query-rewriting algorithm.

SELECT V1. NameMicrobe, V1.Lengthiness, V1.Risk FROM Microbio_Food_view V1 WHERE V1.Lengthiness $=$ long and V1.Risk $=$ Medium

The $Q^{\prime}_{2}$ is presented as follows:

SELECT V2. Name, V2.Discovery_date, V2.Length, V2.Danger FROM Microorganism_view V2 WHERE V2.Length $=$ long and V2.Danger $=$ Medium

We can see that the rewritten query $Q^{\prime}_{1}$ does not cover all the attributes satisfied in the user query $Q$ , such as the date of discovery of a microbe. To address this challenge, the following query cooperation module provides good query performance.

4.2.2 Query cooperation module

The query cooperation module serves two purposes. First, it ensures the integration of incomplete and uncertain information in which the rewritten query has cooperated with others that extract these types of information. Second, it executes the cooperative query, taking into account the membership degrees assigned to each incomplete and uncertain piece of information.

The rewritten query $Q^{\prime}$ may not cover all the attributes specified in Q. In addition, despite that the returned answers satisfy the conditions of query $Q^{\prime}$ , some attributes that will be displayed can contain incomplete and uncertain information. To tackle these challenges, we introduce the query cooperation-driven method, which consists of five main steps, as shown in Fig. 6.

Figure 6.

Query cooperation process.

We first detect missing attributes in the query $Q^{\prime}$ through fuzzy mapping. If $Q^{\prime}$ does not cover all the attributes specified in $Q$ , the second step attempts to add missing information to missing attributes. This solution aims at retaining in $Q^{\prime}$ the same representation as that giving in the user query $Q$ . So, the null value can be used on the missing attributes.

In the case where $Q^{\prime}$ covers all the attributes specified in $Q$ , we apply the third step of our method. We extract the conditions of $Q^{\prime}$ to capture its different parameters. We focused on the operands used in the definition of the query condition.

From the information extracted in the previous step, $Q^{\prime}$ has cooperated with disjunctive queries whose conditions are related between them by the ’OR’ logical operator. Hence, we use the missing information presented by a Null value as a standard value for each additional query condition. Using disjunction queries involves the selection of all tuples satisfying each condition.

Example 7: Consider the rewritten queries $Q^{\prime}_{1}$ , $Q^{\prime}_{2}$ from Example 6. After the $Q^{\prime}_{1}$ has been processed by applying the first four steps of the query cooperation-driven method, its additional query is represented as follows:

SELECT V1. NameMicrobe, NULL AS Date_of_Discovery, V1.Lengthiness, V1.Risk FROM Microbio_Food_view V1 WHERE V1.Lengthiness IS NULL or V1.Risk IS NULL

Note that, in the ‘SELECT’ bloc of the above query, our method generates the missing attribute ‘Date_of_Discovery’ having a null value (using step 2). The second query related to $Q^{\prime}_{2}$ given after applying steps 1, 3, and 4:

SELECT V2. Name, V2.Discovery_date, V2.Length, V2.Danger FROM Microorganism_view V2 WHERE V2.Length IS NULL or V2.Danger IS NULL

Our key idea of using queries with missing information in their conditions is that they may get more data, retaining their representation, thus improving the quality of rewritten query answers.

The last step of the query cooperation-driven method aims to combine via the union of the rewritten query with the additional one to generate a cooperative query ready to be executed. Therefore, for executing such a cooperative query, the flexible wrapper assigns each incomplete and uncertain piece of information defined in the query conditions, a membership degree according to the metadata of fuzzy predicates. Due to the limitation of space, we skip the description of the algorithm Get-Degree (T, a, A, B, b), which just implemented the different membership degrees defined by the quadruplet $(a,A,B,b)$ of the membership function related to the attribute T (see Section 4.1).

In the preliminary version of this article, we have assigned to missing information defined by the null value, the average value of membership function, which equals 0.5 because it may be that this incomplete value corresponds to the user’s need or not [24]. We extended this solution for considering the disjunctive and probabilistic information.

In the case of disjunctive information, we attempt to give importance to each possible value since one value may be true in the current situation and false in another one. Therefore, we distribute the membership degree over different possible values of the disjunctive information.

Lemma 1. If an attribute A represented by n possible values: $A_{1}$ or $A_{2}\dots$ or $A_{n}$ , then the membership degree of this disjunctive information is $1/n$ for each value.

On the other hand, the probabilistic information can be attached by membership degrees, which are corresponded to the probability values between 0 and 1.

Example 8: From Examples 6 and 7, respectively, we take both the rewritten query $Q^{\prime}_{1}$ and its additional query to give the cooperative query $Q^{\prime\prime}_{1}$ as follows:

SELECT V1. NameMicrobe, 0.5 AS Date_of_Discovery, V1.Lengthiness, Get-Degree (Length, 5, 10, 0, 0) AS Degree1, V1.Risk, Get-Degree (Danger, 11, 14, 24, 27) AS Degree2 FROM Microbio_Food_view V1 UNION SELECT V1. NameMicrobe, 0.5 AS Date_of_Discovery, V1.Lengthiness, 0.5 AS Degree1, V1.Risk, $0.5$ AS Degree2 FROM Microbio_Food_view V1 WHERE V1.Lengthiness IS NULL or V1.Risk IS NULL

This cooperative query, $Q^{\prime\prime}_{1}$ , aims to display the name, date of discovery, lengthiness, and risk of long microbes having medium danger (via the first sub-query) and satisfies the missing information (via the second sub-query). Using the Get-Degree algorithm, we give the membership degree to fuzzy predicates ‘Long’ and ‘Medium’ for attributes Lengthiness and Risk, respectively. These attributes correspond to those in the global fuzzy schema, which are Length and Danger, respectively.

We can see that the missing attribute ‘Date_of_Discovery’ is associated with the value 0.5 as a completeness degree. Moreover, each query answer is represented by the following six attributes rather than four: NameMicrobe, Date_of_Discovery, Lengthiness, Degree1, Risk, and Degree2.

4.2.3 Rewriting answers module

The queried flexible wrapper receives its answers written in terms of the data source vocabulary, which may be different of the global fuzzy schema. We recall that the query rewriting algorithm (see Algorithm 3) allows storing the different matchings between the user query Q and rewritten query $Q^{\prime}$ . On this basis, we use these matchings to transform the table names and attribute names of answers to the vocabulary of the global fuzzy schema. The rewriting answers process consists of four steps, as follows:

(1)
From the query-rewriting module, extract the matchings between Q and $Q^{\prime}$ .
(2)
Translate the table name of answers to the corresponding name in Q.
(3)
Omit attributes that represent the membership degree, such as Degree1, Degree2 from example 8.
(4)
Translate each attribute name of answers to the corresponding name in Q.

The result of the rewriting answers module is a set of answers whose representation (table name and attribute name) corresponds to that indicated in the user query, providing transparent access to multiple data sources.

Example 9: We present in Table 5 some results of the cooperative query $Q^{\prime\prime}_{1}$ from Example 8.

Table 5
Examples of answering the cooperative query $Q^{\prime\prime}{1}$

NameMicrobe Date_of_discovery Lengthiness Degree1 Risk Degree2

Escherichia coli 0.5 7 0.6 15 1

Salmonella 0.5 11 1 13 0.09

Jejuni 0.5 NULL 0.5 25 0.03

From Table 5, we apply our rewriting answers process, and the result is shown in Table 6.

Table 6
Result of rewriting answers presented in Table 5

Title_microbe Date-of-discovery Length Degree1 Danger Degree2

Escherichia coli 0.5 7 0.6 15 1

Salmonella 0.5 11 1 13 0.09

Jejuni 0.5 NULL 0.5 25 0.03

Table 7
Summary of the different databases

Incompleteness rate Uncertainty rate

DB Total number of tables Total number of records Average number of attributes Missing info % Disjunctive info % Fuzzy info % Probabilistic info %

DB1 4 2307 9 24 20 – –

DB2 7 3420 7 13 27 – –

DB3 6 2353 9 – – 27 21

DB4 5 3280 6 – – 28 25

DB5 8 2475 8 23 10 21 7

DB6 7 3983 9 20 11 19 9

Except for additional attributes, the rewritten answers’ structure (e.g., the names of attributes) is suitable to the user query, as expected. Each flexible wrapper sends now its rewritten answers to the flexible mediation layer.
4.3 Data sources layer

NameMicrobe	Date_of_discovery	Lengthiness	Degree1	Risk	Degree2
Escherichia coli	0.5	7	0.6	15	1
Salmonella	0.5	11	1	13	0.09
Jejuni	0.5	NULL	0.5	25	0.03

Title_microbe	Date-of-discovery	Length	Degree1	Danger	Degree2
Escherichia coli	0.5	7	0.6	15	1
Salmonella	0.5	11	1	13	0.09
Jejuni	0.5	NULL	0.5	25	0.03

The data sources layer represents multiple heterogeneous relational databases that we want to integrate. The uncertain and incomplete information are two existing information’s natures together in the data source: Dealing with data uncertainty by removing records with uncertain information leads to incomplete query results. In contrast, the incomplete information in the data source allows us to insert inaccurate or uncertain information for completing the data lacking.

In our flexible mediator, we use relational databases (RDBs) as an example of the data source since they are commonly used in information systems and enterprises [48]. Thus, RDBs were easier to administrate and manipulate by using a database management system (DBMS). In our previous work [24], we have arbitrarily created five heterogeneous RDBs in the domain of food safety. In this article, we improve this layer by creating six databases classified into three categories. The first one introduces two databases with only precise and incomplete information. The second presents two other databases that only cover precise and uncertain information. And the third category identifies two other databases containing precise, incomplete, and uncertain information. This data categorization is because few works have been devoted to study the integration of both incomplete and uncertain databases. Therefore, to give a sound and complete experimental study, we must compare our work with three related works classes depending on this categorization.

On the other hand, when facing the heterogeneity of information’s nature, several semantic conflicts can be detected. Using semantic similarity measures in our intelligent mediation approach provides a promising method to cope with these heterogeneity challenges. Table 7 summaries the different databases.

Table 7 shows the principal features of databases used in our experiments, such as the total number of tables and records, the average number of attributes. We also present the percentages of occurrences of missing, disjunctive, fuzzy, and probabilistic information.

Recall that our flexible mediator system used LAV-based fuzzy mappings between schemas, making the system interoperable enough. Any source can then freely be added to or left without harming other sources from the data source layer. In fact, adding a new source involves the auto-creation of fuzzy views. In contrast, deleting the data source leads to remove its flexible wrapper includes its fuzzy mapping.

5. Experiments and performance evaluation

To evaluate the performance of the flexible mediator system, we have performed extensive experiments and the results shall be discussed thoroughly. Regarding implementation details, our system was developed using Java programming language with WS4J (Wordnet Similarity for Java) Java library that includes a set of semantic similarity measures, such as Wup similarity [49]. All the experiments are performed on an Intel Core i7 CPU 4.0 GHz PC with 8 GB RAM.

In the aforementioned section, we have introduced the relational databases used in our experiments. These databases were built by using Oracle DBMS, release 11.2.0.2.0, which provides the ability to view tables back in time, superior compression of all types of data, and offered the grid computing functions [50].

5.1 Experimental settings

To report the evaluation of the proposed approach, we used three categories of relational databases in our experiments (see Section 4.3). Generally, noise in datasets hinders most types of data analysis. For a good performance of data integration methods, it is necessary to remove identified noise, giving high-quality datasets. On the other hand, we apply the three popular performance measures: precision $(P)$ , recall $(R)$ , and F-measure $(F)$ . We first adjust these metrics in the context of data integration:

$\displaystyle P=\frac{\textit{Number of correctly integrated information}}{% \textit{Number of integrated information}}$ (3) $\displaystyle R=\frac{\textit{Number of correctly integrated information}}{% \textit{Number of correct information}}$ (4) $\displaystyle F=\frac{2\cdot P\cdot R}{R+P}$ (5)

Typically, these metrics of performance indicate the effectiveness and accuracy of the data integration system. Precision $P$ means the ratio of the integrated information that is relevant to the user query. Recall $R$ refers to the ratio of the relevant information that is successfully integrated. The F-measure is defined as a weighted combination of $P$ and $R$ .

Furthermore, these metrics require the use of ground truth, which has been defined as training data. In our work, producing ground truth for each category of databases is a tedious task. For a sound experimental study, we have taken a lot of effort and time to generate these training data. We conducted a total of 150 queries and 3837 results for three ground truths ( $GT1$ , $GT2$ , and $GT3$ ), described in Table 8.

Table 8

Description of the ground truths

			Incompleteness rate		Uncertainty rate
Ground truth	Number of queries	Number of answers	Missing info %	Disjunctive info %	Fuzzy info %	Probabilistic info %
GT1	25	1281	19	48	–	–
GT2	25	989	–	15	23	20
GT3	100	1567	10	20	17	5

From Table 8, $GT1$ contains integrated information from only incomplete databases. It has been generated by executing 25 queries with a total of 1281 answers. $GT2$ comprises the result of the execution of 25 queries against databases that only contain uncertain information. According to the proposed approach, the content of $GT2$ can include incomplete information, especially the disjunctive information, which describes a part of cooperative answers. In the third ground truth $GT3$ , we have applied our approach by running 100 queries with 1567 answers in all. After that, we suggest three evaluation studies in order to evaluate the performance of our proposed approach.

5.2 Correctness and effectiveness

Our intelligent mediation approach aims to produce more detailed cooperative answers from similar ones. Determining similar answers plays a vital role in generating cooperative answers that are expected to achieve better performance. The closeness of answers is related to the similarity degree measured between them by applying the Sim-Responses algorithm. So, we need to find a suitable value of threshold denoted $\alpha$ , of the similarity degree, which indicates at the initial stage that two answers are similar or no.

We start for the initial value $\alpha=0.5$ , and we increase the value until we obtain the optimal threshold that achieves acceptable results in terms of precision $(P)$ , recall $(R)$ , and F-measure $(F)$ . For this analysis, several experiments have been conducted in such a way that each experiment uses different threshold values according to three categories of data integration (DI). We have performed 14 experiments on 25 queries for each DI1 and DI2, and 100 queries for DI3. We mainly focused on the last category, which was indeed related to our proposal. Due to the limitation of space, we present in each category of experiments derived from the three categories of data integration, the averages of $P$ , $R$ , and $F$ , namely, $\textit{Avg}(P)$ , $\textit{Avg}(R)$ , and $\textit{Avg}(F)$ , respectively. Table 9 shows the result of this study.

Table 9
Result of the correctness and effectiveness study

$\alpha$	Experiments of DI1			Experiments of DI2			Experiments of DI3
	$\textit{Avg}(P)$	$\textit{Avg}(R)$	$\textit{Avg}(F)$	$\textit{Avg}(P)$	$\textit{Avg}(R)$	$\textit{Avg}(F)$	$\textit{Avg}(P)$	$\textit{Avg}(R)$	$\textit{Avg}(F)$
0.50	0.52	0.50	0.51	0.52	0.50	0.51	0.61	0.53	0.57
0.52	0.52	0.50	0.51	0.52	0.50	0.51	0.61	0.53	0.57
0.54	0.54	0.50	0.52	0.52	0.50	0.51	0.65	0.55	0.60
0.56	0.57	0.54	0.55	0.53	0.63	0.58	0.69	0.61	0.65
0.58	0.71	0.81	0.76	0.64	0.70	0.67	0.72	0.70	0.71
0.60	0.71	0.90	0.80	0.64	0.70	0.67	0.78	0.85	0.81
0.62	0.89	0.97	0.93	0.75	0.85	0.80	0.79	0.85	0.82
0.64	0.89	0.97	0.93	0.82	0.85	0.84	0.80	0.87	0.83
0.66	0.89	0.97	0.93	0.90	0.97	0.93	0.83	0.87	0.85
0.68	0.89	0.97	0.93	0.90	0.97	0.93	0.85	0.90	0.88
0.70	0.90	0.97	0.94	0.91	0.97	0.94	0.92	0.98	0.95
0.72	0.90	0.97	0.94	0.91	0.97	0.94	0.92	0.98	0.95
0.74	0.90	0.97	0.94	0.91	0.97	0.94	0.92	0.98	0.95
0.76	0.90	0.97	0.94	0.91	0.97	0.94	0.92	0.98	0.95

We compare the results in Table 9. From the first category of data integration, DI1, for the value of the threshold $\alpha$ between 0.50 and 0.60, the integration reports a poor performance. $\textit{Avg}(P)$ , $\textit{Avg}(R)$ , and $\textit{Avg}(F)$ increased to 0.89, 0.97, and 0.93, respectively when $\alpha$ belongs to 0.62 and 0.68. When, $\alpha\geqslant$ 0.70, our approach achieves stable and high data integration performance. From the second data integration, DI2, Avg (P), Avg (R), and Avg (F) increased to 0.90, 0.97, and 0.93, respectively, when $\alpha$ belongs to 0.66 and 0.68. Comparing with the result in DI1, the integration of uncertain information requires more precision to determine the similarity between answers. On the other hand, the integration of both incomplete and uncertain information from DI3 gives a better result when $\alpha=0.7$ , with Avg (P), Avg (R), and Avg (F) achieved 0.92, 0.98, and 0.95, respectively. When $\alpha$ exceeds 0.7, the performance is always stable. Hence, we stopped the experiments for $\alpha=0.76$ and deduced that $\alpha=0.7$ gives better results in terms of correctness, effectiveness, and consistency of our approach. In subsequent evaluation studies, we, therefore, set the threshold $\alpha$ to 0.7.

5.3 Efficiency

We study the efficiency of the proposed approach in terms of data integration time. Experiments were set up to consider two distinct scenarios:

(1)
Varying percentages of each piece of incomplete and uncertain information in databases.
(2)
Query complexity: The robustness of the data integration system is related to the query processing complexity.

In the first scenario, we vary the percentages of incomplete and uncertain information and we compute the running time of three data integration categories (DI1, DI2, and DI3). Figure 7. shows the experimental results.

Figure 7.
Efficiency with varying percentages of incomplete and uncertain information.

Overall, from the first and second categories of data integration (DI1, DI2), it appears that the execution time stands stable when the ratio of incomplete or uncertain information increases. Moreover, the time cost gap between DI1 and DI2 is quite small when the percentages of information increase, as expected. It reveals that our approach performs better when the incompleteness and uncertainty are studied separately. On the other hand, the time cost for DI3 is also stable with a small-gap between DI3 and other data integrations. This result means that the increase in the amounts of incomplete and uncertain information does not influence in the data integration effectiveness.

The second scenario of the approach’s efficiency study is related to the complexity of queries, which has a great impact on the performance of data integration. Hence, we have used 150 queries with different complexities, for example, conjunctive and disjunctive queries, flexible queries (queries with the condition over uncertain or incomplete information), query with an aggregate function, query with join operators, etc. We give an example of a complex query, which returns information about high-calorie foods that can quickly become contaminated and produce dangerous microbes. This request can be interpreted by the flexible query as follows.

SELECT F.* FROM FS.Contamine C JOIN (FS.Food F, FS.Microbe M) ON (C.idF $=$ F.idF and C.idM $=$ M.idM) WHERE F.Calorie $=$ High AND C.Spread $=$ Quick AND M.Danger $=$ Hard

Figure 8 shows the experimental results.

Figure 8.
Efficiency with various query complexities.

Overall, from Fig. 8, it appears that all three curves show a slight upward trend when the complexity of queries increases. The efficiency of the approach determines the robustness of flexible wrappers against the complexity of queries on the one hand and the accuracy of generation cooperative answers on the other hand. In ID1 and ID2, the execution time of different queries is quite stable, while ID3 shows a slight rise. The reason is that the integration of incomplete and uncertain information leads to more processing than the separate integration of these types of information.

In summary, the intelligent data integration approach can scale up to the number of incomplete and uncertain information and query processing complexity with faster computation.
5.4 Comparative performance

The third evaluation study aims at comparing the performance of our proposed approach to that of the relevant work taking into account precision, recall, and F-measure. We have stated earlier that very few works in data integration from incomplete and uncertain databases. Hence, we must compare the performance of our approach with three categories of most important works: Semantic-based approach proposed in [27] for incomplete data integration, fuzzy RDF data model [30] for uncertain data integration, and fuzzy mediation in our previous work [24] for integrating incomplete and uncertain information. We have also conducted 50 queries for comparing with the first two works and 100 for our previous work. Figure 9 shows the average of both precision $\textit{Avg}(P)$ , recall $\textit{Avg}(R)$ , and F-measure $\textit{Avg}(F)$ .

Figure 9.

Result of comparative performance.

Based on the results presented in Fig. 9, our approach outperforms other approaches and gives the highest precision, recall, and F-measure, which are very close to one. In the integration of incomplete information, our work achieves high accuracy compared to the semantic-based integration approach [27], with $\textit{Avg}(P)=0.90$ , $\textit{Avg}(R)=0.97$ , and $\textit{Avg}(F)=0.94$ . The semantic-based integration is focused on the semantic query representation without taking into account the data modeling. Furthermore, our approach achieves high uncertain data integration performance compared to the fuzzy RDF model [30]. The latter achieved low accuracy in $\textit{Avg}(P)=0.80$ , $\textit{Avg}(R)=0.94$ and $\textit{Avg}(F)=0.87$ .

Finally, the proposed approach performs quite better than our previous work [24]. Our intelligent data integration approach achieves a significant improvement in efficiency due to the cooperation between queries for integrating more information on the one hand and between similar answers to give cooperative ones on the other hand. These answers contain information that is more detailed and completed than the information contained in approximate ones from our previous work [24]. This indicates that the use of similar answers is successful in enhancing results rather than removing them.

In summary, the experimental results show the effectiveness and efficiency of the proposed intelligent mediation approach, based on fuzzy logic and semantic similarity measures for integrating incomplete and uncertain information from HRDB.

6. Conclusion

Several of the existing methods have independently investigated the problems of incompleteness and uncertainty in data integration. Our contribution presented in this paper aims at correctly integrating both incomplete and uncertain information from heterogeneous relational databases. We have presented an intelligent data integration approach, proving cooperative answers that best meet the user requirements.

To examine the efficiency and effectiveness of our proposed approach, we have developed the flexible mediator system and performed extensive experiments. The results prove that the approach improves data integration performance when the incompleteness and uncertainty problems are available independently or simultaneously. Though our solution gives dramatic performance improvements in data integration, we still have some room for optimization. For instance, enrich the metadata of fuzzy predicates with linguistic modifiers, usually called hedges (e.g., more or less, very, etc.), to improve the description power of incomplete and uncertain information.

Integrating information from large amounts of data is another thorny issue of our industry now, and it will continue to be a problem in the future. Our ongoing work also aims to improve the proposed approach to maintain high efficiency for intelligent integrating large-scale databases when the incompleteness and uncertainty are available in their data.

References

Nicklas

Schwarz

and Mitschang

, A schema-based approach to enable data integration on the fly, International Journal of Cooperative Information Systems 26(01) (2017), 1650010.

Doan

Halevy

and Ives

, Principles of data integration, Elsevier, 2012.

Rosenberg

and Landers

, An overview of MULTIBASE, North-Holland Publishers, New York, NY, 1982.

Sheth

A.P.

and Larson

J.A.

, Federated database systems for managing distributed, heterogeneous, and autonomous databases, ACM Computing Surveys (CSUR) 22(3) (1990), 183–236.

Arens

C.N.H.Y.

Chee

C.Y.

and Knoblock

C.A.

, Retrieving and integrating data from multiple information sources, International Journal of Cooperative Information Systems 2(2) (1993), 127–158.

Inmon

W.H.

and Kelley

, Rdb – VMS: Developing a Data Warehouse, John Wiley and Sons Inc., 1993.

Hull

and Zhou

, A framework for supporting data integration using the materialized and virtual approaches, in: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, 1996, pp. 481–492.

Majeed

R.W.

Stöhr

M.R.

Ruppert

and Günther

, Data Discovery for Integration of Heterogeneous Medical Datasets in the German Center for Lung Research (DZL)., in: GMDS, 2018, pp. 65–69.

Feng

Huber

Glavic

and Kennedy

, Uncertainty annotated databases-a lightweight approach for approximating certain answers, in: Proceedings of the 2019 International Conference on Management of Data, 2019, pp. 1313–1330.

10.

Soliman

M.A.

Ilyas

I.F.

and Ben-David

, Supporting ranking queries on uncertain and incomplete data, The VLDB Journal 19(4) (2010), 477–501.

11.

Chaparro

Zampetti

Moreno

and Di Penta

e.a.

, Detecting missing information in bug descriptions, in: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, 2017, pp. 396–407.

12.

Greco

Molinaro

and Trubitsyna

, Algorithms for computing approximate certain answers over incomplete databases, in: Proceedings of the 22nd International Database Engineering & Applications Symposium, 2018, pp. 1–4.

13.

Link

and Prade

, Relational database schema design for uncertain data, Information Systems 84 (2019), 88–110.

14.

Chen

Griffin

W.M.

and Matthews

H.S.

, Representing and visualizing data uncertainty in input-output life cycle assessment models, Resources, Conservation and Recycling 137 (2018), 316–325.

15.

Gozhyj

Kalinina

Vysotska

and Gozhyj

, The method of web-resources management under conditions of uncertainty based on fuzzy logic, in: 2018 IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT), IEEE, Vol. 1, 2018, pp. 343–346.

16.

Halpern

J.Y.

, Reasoning about uncertainty, MIT press, 2017.

17.

Gao

and Ji

, Schema induction from incomplete semantic data, Intelligent Data Analysis 22(6) (2018), 1337–1353.

18.

Kuchárik

and Balogh

, Modeling of Uncertainty with Petri Nets, in: Asian Conference on Intelligent Information and Database Systems, Springer, 2019, pp. 499–509.

19.

Miao

S.G.X.

Gao

and Liu

, Incomplete data management: A survey, Frontiers of Computer Science 12(1) (2018), 4–25.

20.

Zhou

and Tang

, A Note on Incomplete Information Modeling in the Evidence Theory, IEEE Access 7 (2019), 166410–166414.

21.

Aggoune

, A Fuzzy Querying Using Cooperative Answers and Proximity Measure, in: International conference on the Sciences of Electronics, Technologies of Information and Telecommunications, SETIT 2018. Smart Innovation, Systems and Technologies, B. M. and R. S, eds, Springer International Publishing, Cham, 2020, pp. 39–49.

22.

Cuzzocrea

Greco

Larsen

H.L.

Saccà

Andreasen

and Christiansen

, Flexible Query Answering Systems: 13th International Conference, FQAS 2019, Amantea, Italy, July 2–5, 2019, Proceedings, Vol. 11529, Springer Nature, 2019.

23.

Gulzar

R.M.A.Q.X.Y.

Alwan

A.A.

and Swidan

M.B.

, SCSA: Evaluating skyline queries in incomplete data, Applied Intelligence 49(5) (2019), 1636–1657.

24.

Aggoune

, Towards a Flexible Mediator Architecture Using Fuzzy Logic for Integration of Incomplete and Uncertain Information, in: Proceedings of the 2nd Mediterranean Conference on Pattern Recognition and Artificial Intelligence, ACM, 2018, pp. 7–13.

25.

Aggoune

Bouramoul

and Kholladi

M.K.

, Mediation system for dealing with semantic problems in databases, International Journal of Data Mining, Modelling and Management 9(2) (2017), 99–121.

26.

Nikolaou

Grau

B.C.

and Kostylev

e.a.

, Satisfaction and Implication of Integrity Constraints in Ontology-based Data Access, in: International Joint Conferences on Artificial Intelligence, 2019, pp. 1829–1835.

27.

Console

Guagliardo

and Libkin

, On Querying Incomplete Information in Databases under Bag Semantics, in: International Joint Conferences on Artificial Intelligence, Vol. 17, 2017, pp. 993–999.

28.

Hannou

F.-Z.

Amann

and Baazizi

M.-A.

, Explaining Query Answer Completeness and Correctness with Partition Patterns, in: International Conference on Database and Expert Systems Applications, Springer, 2019, pp. 47–62.

29.

Jaradat

Halimeh

A.A.

Deraman

and Safieddine

, A best-effort integration framework for imperfect information spaces, International Journal of Intelligent Information and Database Systems 11(4) (2018), 296–314.

30.

and Yan

, Modeling fuzzy data with RDF and fuzzy relational database models, International Journal of Intelligent Systems 33(7) (2018), 1534–1554.

31.

Gal

Roitman

and Shraga

, Heterogeneous data integration by learning to rerank schema matches, in: 2018 IEEE International Conference on Data Mining (ICDM), IEEE, 2018, pp. 959–964.

32.

Leone

Greco

Ianni

Lio

and Terracina

e.a.

, The INFOMIX system for advanced integration of incomplete and inconsistent data, in: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, 2005, pp. 915–917.

33.

Moura

Soares

Sampaio

and Reiser

e.a.

, fGrid: Uncertainty variables modeling for computational grids using fuzzy logic, in: 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), IEEE, 2016, pp. 2249–2256.

34.

Halevy

A.Y.

, Answering queries using views: A survey, The VLDB Journal 10(4) (2001), 270–294.

35.

Mittal

and Jain

, Word sense disambiguation method using semantic similarity measures and owa operator, ICTACT Journal on Soft Computing 5(2) (2015).

36.

and Palmer

, Verb semantics and lexical selection, arXiv preprint cmp-lg/9406033 1 (1994).

37.

Nguyen

H.V.

and Bai

, Cosine Similarity Metric Learning for Face Verification, in: Computer Vision – ACCV 2010 Kimmel

Klette

and Sugimoto

, eds, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 709–720.

38.

Andrew

Laura

Michael

V.G.

Natalie

, Anthony etal, Guidance on communication of uncertainty in scientific assessments, European food safety authority and hart Journal 17(1) (2019), e05520.

39.

Nicolau

A.I.

Barker

G.C.

Aprodu

and Wagner

, Relating the biotracing concept to practices in food safety, Food Control 29(1) (2013), 221–225.

40.

Zadeh

L.A.

, Fuzzy sets, in: Fuzzy sets, fuzzy logic, and fuzzy systems, World Scientific, 1996, pp. 394–432.

41.

Hoa

, A type-2 fuzzy relational database model, Journal of Research and Development on Information and Communication Technology, 2017.

42.

Miller

G.A.

, WordNet: An electronic lexical database, MIT press, 1998.

43.

Orkphol

and Yang

, Word sense disambiguation using cosine similarity collaborates with Word2vec and WordNet, Future Internet 11(5) (2019), 114.

44.

Omran

M.G.

Engelbrecht

A.P.

and Salman

, An overview of clustering methods, Intelligent Data Analysis 11(6) (2007), 583–605.

45.

Levy

Rajaraman

and Ordille

, Query answering algorithms for information agents, 1996.

46.

Duschka

O.M.

Genesereth

M.R.

and Levy

A.Y.

, Recursive query plans for data integration, The Journal of Logic Programming 43(1) (2000), 49–73.

47.

Pottinger

and Halevy

, MiniCon: A scalable algorithm for answering queries using views, The VLDB Journal 10(2–3) (2001), 182–198.

48.

Romero

and Vernadat

, Enterprise information systems state of the art: Past, present and future trends, Computers in Industry 79 (2016), 3–13.

49.

Finlayson

, Java libraries for accessing the princeton wordnet: Comparison and evaluation, in: Proceedings of the Seventh Global Wordnet Conference, 2014, pp. 78–85.

50.

Freeman

R.G.

and Nanda

, Oracle Database 11g: New Features, McGraw-Hill/Oracle Press, 2008.

				Incompleteness rate		Uncertainty rate
DB	Total number of tables	Total number of records	Average number of attributes	Missing info %	Disjunctive info %	Fuzzy info %	Probabilistic info %
DB1	4	2307	9	24	20	–	–
DB2	7	3420	7	13	27	–	–
DB3	6	2353	9	–	–	27	21
DB4	5	3280	6	–	–	28	25
DB5	8	2475	8	23	10	21	7
DB6	7	3983	9	20	11	19	9

Intelligent data integration from heterogeneous relational databases containing incomplete and uncertain information

Abstract

Keywords

1. Introduction

3. Overview of our intelligent mediation approach

4.2.2 Query cooperation module

5. Experiments and performance evaluation

5.1 Experimental settings

Table 9 Result of the correctness and effectiveness study

References

Table 9
Result of the correctness and effectiveness study