Abstract
Schema matching is used for data integration, mediation, and conversion between heterogeneous sources. Nevertheless, mappings identified with an automatic or semi-automatic process can never be completely certain. In a process of concept alignment, it is necessary to manage uncertainty. In this paper, we present a fuzzy-based process of concept alignment for uncertainty management in schema matching problem. The ultimate goal is to enable interoperability between different electronic health records. Data integration of health information is done through the mediation of our ubiquitous user model framework. Results look promising and fuzzy theory proved to be a good fit for modeling uncertain schema matching. Fuzzy combined similarities can handle uncertainty in the schema matching process to enable interoperability between electronic health records improving the quality of mappings and diminishing the human error to verify the mappings.
Keywords
Introduction
Health information of a patient is traditionally scattered with different stakeholders. Health systems provide and consume medical observations, clinic history, medical reports, and laboratory results of different types, radiology images, and etcetera. Hence, in the digital age, handwritten documents remain prevalent throughout most of the health sector, despite of all efforts to automate and share electronic health information among general practitioners, patients and other stakeholders. Healthcare is moving towards “solutions which support a continuous medical process and (i) include multiple healthcare professionals and institutions, (ii) utilize ubiquitous computing healthcare environments and (iii) embrace technological advances, typical of the domain of today’s pervasive software applications.” [1].
Many efforts have been made to automate medical information developing Electronic Healthcare Records (EHRs) and standards have been proposed to enable sharing and reusing information amongst healthcare software systems. Nevertheless, health care information is very complex and therefore interoperability between health systems is hard. Different formats and forms are used to capture patient’s personal, social, family and medical history. Patient’s care involves documents of different nature: treatments, tests, progress notes, referrals, imaging, medical charts, and nursing notes just to mention a few. Integration of patient information in EHRs entails interoperability problems. The health ecosystem is highly dynamic and medical stakeholders are autonomous. Each of these medical stakeholders is free to begin or leave the interaction with other health systems. Medical information is stored in structured formats including databases, and unstructured documents. This differences result in severe interoperability problems [2].
In order to achieve syntactic and semantic interoperability, two commonly used approaches are used: adoption of standards and conversion/mediation between heterogeneous sources. The standard based approach is more used in health domain. Major international organizations have made huge efforts towards standardization, but there is still no universal international consensus on standard adoptions. Conversion and semantic mediation approaches between eHealth standards are always partial and difficult to maintain in an open and highly dynamic environment like health ecosystem. A combined solution is a reasonable option: global standard adoption and conversion between major standards. A possible solution to overcome limitations of standardization and mediation approaches and leverage their advantages is to integrate elements of both approaches.
Schema matching is used for data integration, mediation, and conversion between heterogeneous sources. Nevertheless, mappings identified with an automatic or semi-automatic process can never be completely certain [3]. A human expert must determine the right matches between concepts or correct semi-automatic or automatic concept alignments. Schema mappings are inherently uncertain [4]. In a process of concept alignment, it is necessary to manage uncertainty. There are two basic approaches to manage uncertainty in schema matching, those based on probability theory or based on fuzzy set theory.
In this paper, we present a fuzzy-based process of concept alignment for uncertainty management in schema matching problem. The ultimate goal is to enable interoperability between different EHRs. Data integration of health information is done through the mediation of our ubiquitous user model framework [5, 6]. We provide a combined solution using global standard adoption and conversion between standards. Previous findings of this work were presented in [7, 8]. The hypothesis of this work is that fuzzy combined similarities can handle uncertainty in the schema matching process to enable interoperability between EHRs. It is expected that a fuzzy based process of concept alignment improves the quality of mappings and diminishes the human error to verify the mappings.
The rest of the paper is organized as follows. Efforts towards the integration of health information and EHRs interoperability are presented in Section 2. Related work in uncertainty schema matching is presented in Section 3. Our proposal to handle uncertainty for EHR Interoperability is presented in Section 4. Experiments and results in the application scenario are shown in Section 5. Conclusions are given in Section 6.
Electronic health records interoperability
EHRs are patients’ health information provided by clinicians, laboratories, hospitals, and other health care professionals and organizations. Patients’ information is stored in the best scenario according to a national or international standard. Formats can include relational databases, structured and unstructured storage. This results in a big interoperability problem [2]. The enhancement of interoperability among health systems is critical to enable the transformation of health care [9]. There are two main approaches to achieve syntactic and semantic interoperability: application of standards and mediation-based approaches. In healthcare domain, interoperability is predominantly based on the application of standards [9]. Many national and international standards and guideline have been proposed to support Health Information Systems (HIS). Nevertheless, there is no unanimous consensus regarding the adoption of standards. This means that translation between standards in frequently needed.
Semantic-based technologies and mediation approaches between representations are needed to cope with multiple representations and lack of consensus in the adoption of standards problems.
The major International Organizations for Standardization providing standard solutions for EHR interoperability are: International Organization for Standardization (ISO) [10], European Committee for Standardization [11], Health Level Seven [12] accredited by American National Standards Institute (AN-SI) [13] and Digital Imaging and Communications Medicine [14].
Major technology players, Google and Microsoft have also tried to contribute with personal health record (PHR) management systems [15]. Google Health offered the users a Web-based system to manage health information in 2008 but they retired in 2012 for lack of adoption. Microsoft HealthVault [16] is still active as a personal health record management system, but it didn’t achieve the expected user adoption [15].
Although many efforts have been made, there is no international consensus on standard adoptions regarding document formats, terminology, and communication protocols in the eHealth community. On the other hand, conversion and semantic mediation approaches between eHealth standards are always partial and difficult to maintain in an open and highly dynamic environment like health ecosystem. Nowadays efforts point into global standard adoption and conversion between major standards.
Uncertainty in schema matching for health information
The schema matching process is inherently uncertain; an automatic matching tool can never be certain to identify the correct correspondence between two elements [3]. Uncertainty management in schema matching is commonly done with two approaches: probability theory [19] and fuzzy set theory [20].
To our knowledge, just a few papers in health domain discuss uncertainty in relation with EHR.
Han et al. [17] discussed uncertainties faced by clinicians and patients in a process of diagnosis and treatment, and they proposed a conceptual taxonomy of uncertainty in health care domain. Although the authors do not refer to the uncertainty specifically in the application of technology, they do refer to the data obtained from different health sources. Hence, we used Han’s descriptions and taxonomy to try to understand uncertainty in health sources. The authors identify different types of uncertainty in health care. One of these types is uncertainty by source described as “incomplete information, inadequate understanding, or undifferentiated alternatives of equal attractiveness” [17]. These types of uncertainty can be related with conflicts that occur when trying to align models of different sources in any domain as described in [18] namely: a) Naming conflicts, b) Different graph structure, c) Different scope d) Different granularity e) Different focus.
Health information frequently has naming uncertainty: the same concept is used in two models with different names or the same label defines different concepts. Some labels or concepts are incomplete in the sense of abbreviations. The names can also be ambiguous when using labels that are not meaningful in relation of the represented concept. Uncertainty in health information can also occur when two models describe relevant concepts with different granularity, scope or structure. Models can also adhere to different vocabulary or using different conventions.
Uncertain schema matching for electronic health records interoperability
Ubiquitous user model integration
As we reviewed in Section 2, health information of a patient is scattered and different stakeholders provide and consume medical observations, clinic history, medical reports, and laboratory results of different types, radiology images, and etcetera. The goal of data integration of patient information in EHRs entails interoperability problems. In order to cope with the aforementioned heterogeneity and dynamicity, mechanisms of interoperability must be provided that require the least intervention and effort of the medical stakeholders in order to enable interoperability respecting the providers’ and consumers’ autonomy.
In previous works [5, 6] we presented a framework for ubiquitous user model interoperability. This framework can provide semantic mediation between eHealth stakeholders enabling interoperability for sharing and reusing. The proposed framework enables the interoperability between profile suppliers and consumers with a mixed approach that consists in central ubiquitous user model ontology to provide formal representation of the user profile and a process of concept alignment to automatically discover the semantic mappings between the user models.
The central ubiquitous user model interoperability ontology (U2MIO) is a flexible representation of a ubiquitous user model to cope with the dynamicity of a distributed multi-application environment that provides mediation between profile suppliers and consumers. U2MIO can evolve over time to adapt the representation to the changing multi-application environment. The dynamic user profile structure ontology is based in Simple Knowledge Organization System (SKOS) for the Web [21]. SKOS is a W3C recommendation for expressing structure and content of concept schemas. The exchange of information of concept schemas described with a machine-readable standard like SKOS is easier and facilitates sharing and reusing.
The process of concept alignment automatically discovers the semantic mapping between the concepts of profile suppliers and consumers and the U2MIO ontology in order to interpret the information from heterogeneous sources and integrate them into a ubiquitous user model.
Ubiquitous user model interoperability ontology
The U2MIO represents a flexible user model profile that evolves during time according to the recommendations of the concept alignment. The ontology reuses SKOS ontology [21] designing a central concept scheme for the ubiquitous user model and one concept scheme for each profile supplier or consumer. Semantic mappings are determined between each stakeholder’s concept scheme, and the central user model concept scheme by the process of concept alignment. Semantic relations are set with SKOS properties. This representation supports interoperability overcoming semantic differences and enables the participation of new stakeholders in the interoperability process without effort of the profile information provider or consumer. Figure 1 shows the interrelations between health stakeholders and the ubiquitous user model concept.

Interrelations between health stakeholders and ubiquitous user model.
In this section, we briefly describe the process of concept alignment as presented in previous works [22]. We present the modifications made to this process in order to handle uncertainty with a fuzzy-based concept alignment in Section 4.4.
The alignment process enables the interoperability between a document written or translated to XML (named source) with the ubiquitous user model (named target). The ubiquitous user model (u2m), represented in U2MIO ontology, provides articulation between heterogeneous sources, given the mappings from all individual concepts of the sources with the corresponding ubiquitous user model concepts. These mappings enable the interoperability between user models. The ubiquitous user model also provides articulation with possible EHR consumers: new or current stakeholders.
The process of concept alignment input is usually an instance in XML or JSON.
The process of concept alignment contemplates approximate matching because it not only considers exact match or disjoint (black or white); the matching can consider semantic relaxation suggesting possible answers to requests also including neighbors (hypernyms, hyponyms, meronyms) when no exact match is available. It can also be partial because finding all concepts of the source schema is not necessary to consider an alignment useful for this work.
Some source documents have concepts that are only useful for the provider, or have concepts with high matching difficulty. Typically, schema matching is performed on “design-time” to address semantic heterogeneity, but ubiquitous web applications may require “run-time” matching operations. New stakeholders appear and changes to current stakeholders participating in the interoperability process can be discovered at run-time. In order to cope with this dynamicity, a continuous “design-time” and “run-time” matching strategy can be acceptable [23]. Commonly the goal of the alignment process is to find the best match determining the cardinality of the output alignment is 1 : 1. Since the goal in this work is to verify how uncertainty affects the outcome, we allow 1 : N alignment. This means that the overall match result may relate one concept of the source schema to one or more elements of the target schema. Further explanation is done below. The outcome of each matching technique is determining if the relation between two concepts is Equivalent (=), Related (⊆, ⊇ , ∩) or Independent: (⊥). The ultimate action of the concept alignment is the integration of a new schema to the U2MIO ontology in order to enable sharing and reusing.
A mapping element is defined as a triple: <c
s
, c
t
, R> where: c
s
is a source concept expressed as c
t
is a target concept expressed as R is a semantic relation (e.g., Equivalent (=); Related (⊆, ⊇ , ∩); Independent: (⊥) holding between the entities c
s
and c
t
.
Two basic elements are available to find mappings between concepts: concept labels and hierarchical structures. Concept values can be used in conflict resolution when a concept is consumed and to determine if further transformation is necessary for interchangeability.
A concept scheme is considered as (C, H
C
) where C is a set of concepts arranged in a subsumption hierarchy H
C
. Concept values are not considered in this work. Each concept c
s
in a set of concept source C
S
is defined by: a label
A concept on the target side C
T
is described by a set of labels included in the target
Similarity aggregation
It is commonly known, in the schema matching research community, that a combination of matching techniques improves the precision of the mappings found. Three matching techniques are used to determine the similarity between concept c s in C S and concept c t in C T . Equation (1) is used to aggregate the outcome similarities of these three techniques.
Similarity (sim Dice (c s , c t ) ∈ [0, 1]) based in Dice coefficient [23] has the purpose of finding the lexical similarity between concept c s in C S and a concept c t in U. Dice coefficient is a commonly used similarity measure between words. The main idea is to describe the compared words with two vectors (a1, a2, …, a n ) and (b1, b2, …, b n ) that represent in this case bigrams of word A and B respectively. The general equation (2) incorporates the inner product that represents the number of matches between the two vectors.
Equation (3) calculates the longest common substring distance similarity (d lcs (c s , c t ) ∈ [0, 1]) which tries to find if one label is subsumed in the other. The semantic similarity (sim wordnet (c s , c t ) ∈ [0, 1]) is based in Wu and Palmer path lengths method [24] using WordNet [25] as an external resource and tries to find the semantic similarity of the labels. Similarity in concepts other than English can be also included if the proper lexical resource is available. Table 1 shows a summary of the similarity measures and their purpose.
Used matching techniques
In (3), LCS (c s , c t ) is a function that returns the longest subsequence common to the two concept labels c s and c t and Length (string) returns the number of characters in string.
Max function in (1) returns the highest of these three similarity measures is considered to determine relation R in the triple <c
s
, c
t
, R> of a mapping element according of the following criteria: Equivalent (=): two concept elements c
s
and c
t
are equivalent iff sim0 (c
s
, c
t
) ≥ t
e
. Related (⊆, ⊇ , ∩): two concept elements c
s
and c
t
are related iff 0.9 > sim0 (c
s
, c
t
) ≥ t
r
. Independent (⊥): two concept elements c
s
and c
t
are independent iff t
r
> sim0 (c
s
, c
t
).
where t e = 0.9 and t r = 0.6.
Approximate threshold, α, is associated so that α = t
e
= 0.9.This means that in order to consider a concept to be equivalent (
The previously described similarity measures help to compare individual concepts and verify string similarity (Dice), label inclusion (longest substring) and label semantic equivalence (Wu and Palmer). Despite the fact that the combined similarity measure (1) covers three aspects that help the process to identify the best suited concepts for alignment, many possible conflicts as the ones described by [18] can occur when dealing with highly autonomous applications. Naming conflicts are frequent when the same label has different meaning in two models or different labels are used for the same concept. Applications also use complex types and different granularity to express the same concept as others, grouping or decomposing data. Labels in these cases are not easily detected as meaningful. For example, a date decomposed in single integers with confusing descriptions like (y, m, d, m, s, f) for year, month, day, and so on. Lexical, structural and semantic similarity measures that do not consider internal structure constraints are not sufficient to deal with these conflicts.
Two tier matching strategy The process of concept alignment is based on a two-tier matching strategy that consists in two phases: element level matching and structure level matching.
Element level matching phase In the element level matching phase, the concepts are directly compared to each other without considering the hierarchy structure and values. The goal of element level matching is given concept c s of the source concept scheme C S , finding in the best concept label c t from a set of concept candidates for alignment in the target concept scheme U.
The mapping relations cannot be determined only with the element level matching process, because from this stage the following outcomes are possible: : Exactly one concept label of the target has sim = 1 or equivalent relation, the best-suited concept is clearly defined, but it can be the case that the labels are homonymous. : More than one concept label of the target has sim = 1 or equivalent relation, so in order to disambiguate which concept(s) are best suited in the target for alignment, context in the hierarchy must be considered. : No concept label of the target is classified as equivalent and the best score is related. : No concept label of the target is classified as equivalent and the best score is independent.
Structure level matching phase The goal in the structure level matching step is to disambiguate the meaning of the word analyzing its context, this means analyzing the structure and meaning of the neighbor concepts in the same source document.
The ultimate goal of this process is to determine the best mappings between the concept c s of the source concept scheme C S and the best concept c t b from the set of labels C T of the target concept scheme U. From this phase, decision recommendations can be obtained for the inclusion of new concepts, sub-collections and collections in the ubiquitous user model concept scheme allowing it to evolve over time. The structure level matching provides reasoning on structure in order to verify or decline the results obtained in the element level matching phase. In this phase, the context of each cs of the source is analyzed. The sets of neighbors in the source include ancestors and siblings, and the set of neighbors in the target include the concepts of the most related collection in the ontology.
In this step, the similarity between the neighbor concepts in the source and the neighbors of the best-suited concept(s) in the target are calculated. After this step, a set of

Two-tier matching strategy of the process of concept alignment.
Uncertainty is always present in schema matching process, as we stated in Section 3.2. A process of concept alignment must handle or at least deal with uncertainty of schema matching. Uncertainty management in schema matching is commonly done with probability theory or fuzzy set theory. In this work we propose a fuzzy-based process of concept alignment.
As we described in Section 4.3.1 the element level matching process has four cases of possible outcomes. We have great interest in Case 2 in this work. The case in which several candidates have the same highest similarity as outcome of the element level matching process is very frequent. From (1), we can deduce that if several concept labels share the same commonly used substring (e.g. “value”, “text”, “name”, “type”), the outcome will be Case 2. It can also be the case that one or more concepts labels are identical but used with different meaning. This ambiguity will cause uncertainty that the structure level matching does not always resolve correctly.
The main idea is to aggregate the outcome similarities of these three techniques presented in Section 4.3.1 in a different manner. Instead of using (1), which has max function to return the highest similarity, we use fuzzy combined similarity measures. To this end, we model the uncertainty based on fuzzy set theory proposed in [4, 20]. In [4], Gal et al. demonstrate how fuzzy theory can be used to model the uncertainty of attribute correspondences and schema matching.
In the process of concept alignment, as we explained above, the similarity between each concept in the source and each concept in the target is calculated. A similarity matrix results of the calculation of similarities with each used matching technique. Latter these similarity matrixes must be combined to determine a final similarity matrix. Gal et al. [4] state that the similarity matrix is sufficient to represent the uncertainty involved in the matching process.
Handling uncertainty with fuzzy combined similarity measures
The calculation of the similarities between two concepts is the primary operation in the process of concept alignment. It supports schema matching for the purpose of data integration, sharing and reusing. Recently, several matchers using string, structural and/or semantic similarity techniques are combined to increase the quality of mappings. As we described in previous section, we can model de uncertainty involved in the process of concept alignment with the similarity matrix. Our proposal to model uncertainty is presented below.
Concept correspondences and similarity matrices
Let A be a mapping element as a triple <c
s
, c
t
, R> defined as: c
s
is a source concept expressed as c
t
is a target concept expressed as R is a semantic relation (e.g., Equivalent (=); Related (⊆, ⊇ , ∩); Independent: (⊥) holding between the entities c
s
and c
t
.
The matching operation determines the alignment A′ for the pair of schemas X and U where X can be any concept scheme constructed from a profile supplier/consumer document and U will always be the ubiquitous user model concept scheme.
The model of uncertainty in schema matching described below is based on the work of [4].
Let X and U be schemas with n and n′ concepts. Let S = X × U be the set of all possible concept correspondences between X and U. S is a set if concept pairs (c s , c t ). Let M (X, U) be an n × n′ similarity matrix over S, where Mi,j represents a degree of similarity between the i-th concept of X and the j-th concept of U. Mi,j is a real number in [0, 1].
As described in Section 4.3.1 our processes of concept alignment use a combination of matchers build on matching techniques. Given schemas X and U, M (X, U) is the set of three similarity matrices M (X, U). S M is a mapping, transforming in this case three similarity matrices into another similarity matrix.
The inputs of S M are three similarity matrices obtained by the used matching techniques summarized in Table 1.
In the next section, we present our proposed fuzzy based aggregation approach to determine the combined similarity matrix as described above.
Fuzzy-Based similarity aggregation
Equation (1) is used to aggregate the outcome similarities of these three techniques in the process of concept alignment, for now on called for short, MAX aggregation. In this section we present the fuzzy aggregation used to map the outcomes of string and semantic matchers handling uncertainty that considers three main steps: (i) fuzzification, (ii) fuzzy inference and (iii) defuzzification. These steps are defined as follows.
Fuzzification step receives as inputs the three similarity measures described above: Dice similarity (S D ), longest common substring (S C ), and WordNet similarity (S W ). Each input is mapped to a set of fuzzy sets A i using the membership function representation μ A i (S) ∈ [0, 1] that expresses the degree of belonging of the input S into the fuzzy set A i , as showin in (4).
Particularly, we define each of these similarity measures to be partitioned in three fuzzy sets: low, medium and high, based on the conditions of aggregation explained in Section 4.3.1. Figure 3 shows the input membership functions of Dice similarity (S D ), longest common substring (S C ) and WordNet similarity (S W ). We implemented these linguistic terms with trapezoidal membership functions with standard four parameters as summarized in Table 2.

Input trapezoidal membership functions definitions: (S D ) Dice similarity, (S C ) Longest common substring, and (S W ) WordNet similarity.
Parameters used in the standardized membership functions
Then, the fuzzy inference step receives the input membership values and performs an inference operation in the fuzzy space in order to calculate the fuzzy value μ (C S ) of the output combined similarity (C S ). A set of fuzzy rules are defined to support the inference operation. Considering the k fuzzy rule R k denoted as (5), where X i ∈ {low, medium, high} and Y j ∈ {Independent, Related, Equivalent} represent the input and output Gaussian membership functions respectively, and ∧-symbol represents the T-norm.
For this work, the min function is used as the T-norm. In addition, the Mamdani method [26] is used in this approach given that the rule base is easier to design, interpret and debug using linguistic variables.
Lastly, the defuzzification step calculates the crisp output value C S . It uses the max function as T-conorm to accumulate the activated terms, and the centroid method denoted as in (6) where CS,i represents the ith C S value and μ (CS,i) is the membership function of that value.
For our approach, we determine the linguistic term sets as: Independent, Related and Equivalent. Figure 4 shows the output membership functions defined for the combined similarity (C S ). Table 2 summarizes the parameters used in the standardized Gaussian membership functions (mean and standard deviation) employed in the fuzzy system.
Table 3 shows the fuzzy rules implemented that focus on giving less relevance to the longest common substring matching technique in order to disambiguate and correctly select exact matches when several candidates for the match share the same substring.

Output Gaussian membership functions definition: (C S ) combined similarity.
Fuzzy rules base
In order to measure the efficiency and effectiveness of the matching/mapping systems, different metrics have been proposed in the literature [27].
In this work, the evaluation of the process of concept alignment is focused in: The human effort required by the mapping designer to verify the correctness of the mappings, which is quantified with the metric overall. The quality of the generated mappings quantifying the proximity of the results generated by the process of concept alignment to those expected with three known metrics: precision, recall, and f-measure [27]. With these metrics a partial measure of the effectiveness of the process is performed.
These metrics are based on the notions of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN).
A human expert provided a list of expected matches (Table 4) and evaluated the outcomes, deciding if the semantic mapping relations found were correct and recommendations make sense. Exact match relations correctly found by the process, and good recommendations for concept or collection addition were considered as TP. Wrong exact matches were listed as FP. When the process did not find a relevant exact match, a concept was improperly discarded or a wrong recommendation was made, it was registered as FN. Properly discarded concepts were recorded as TN.
Expected outcomes of the matching process
Expected outcomes of the matching process
As we pointed out in Section 2, there are many standards of EHR content and there is no international consensus on the adoption. Efforts are needed to encourage global standard adoption and conversion between major standards. In Section 3, we also recognized how uncertainty is present in health information, and uncertainty is inherent and unavoidable in the schema matching process needed for data integration. The process of data integration of electronic health records must handle or at least deal uncertainty.
Experiments
The goal of the experiments is to assess the performance of our fuzzy-based process of concept alignment evaluating the quality of the generated mappings. It is also important to determine the contribution of the fuzzy combined similarity measures to reduce human effort to rectify the mappings found by the automatic process. To this end, we tested the interoperability between EHRs through the mediation of ubiquitous user model.
We enhanced the ubiquitous user model interoperability ontology (UMIO) with Fast Healthcare Interoperability Resources (FHIR) Specification, Patient-example.xml [28]. Subsequently, we try to integrate data from Microsoft HealthVault personal health record management system. Basic Demographic Information [29] concepts were aligned with FHIR patient concepts to enable the interoperability between these two major standards. The Basic Demographic schema has eight concepts to be aligned with FHIR patient schema, which has 62 concepts. The matching process is a good example to handle uncertainty: semantic meaning is difficult to interpret given that has naming uncertainty and has concepts with different granularity.
Two experiments were executed to compare the performance of the process of concept alignment with the fuzzy-based process. The first experiment was performed with MAX similarity aggregation, see (1). The fuzzy combined similarity measure was considered in the second experiment. It is important to emphasize that in this work, we consider two cases of matching cardinality: 1 : 1 and 1 : N. This means that the overall match result may relate one concept of the source schema to one or more elements of the target schema. Several matches were allowed to identify how ambiguity affects the outcomes.
In order to evaluate the outcomes of the experiments, human experts provide the following acceptable expected outcomes of the matching process shown in Table 4. As we described in Section 4.2, each concept c
s
in the set of concept source C
S
is defined by a label
The possible outcomes for each concept will be compared to the resulting outcomes of the processes of concept alignment to determine its correctness. When a concept has two outcomes, both of them are considered correct but it is preferable to gain a mapping that enables reuse and sharing: add recommendation with
Results
Case 1: Interoperability between EHRs through the mediation of ubiquitous user model using MAX aggregation
The results of the process of concept alignment between HL7 schema and Basic HealthVault PHR are presented in the confusion matrix of Table 6. For this experiment, MAX function was used for similarities aggregation. It is important to notice that Table 5 show more than eight outcomes because some concepts in the source were mapped with more than one concept in the target. The eight concepts in the source were matched to 20 possible concepts. More details were provided in Table 5 and discussed below. A human expert must rectify each of these relations, and this effort must therefore be considered.
Concept alignment outcomes of MAX and Fuzzy based processes. Bold values represent wrong mappings
Concept alignment outcomes of MAX and Fuzzy based processes. Bold values represent wrong mappings
Resulting confusion matrix of the matching process between HL7 and Basic MHVault using MAX similarity aggregations
Comparing the expected outcomes from Table 4 with the outcomes with MAX similarity aggregation presented in Table 5, we can notice that only two concepts in the source have really a correspondent exact match in the target scheme: gender and postcode. Only gender exact match was found.
The concepts birthyear and country_text were related with incorrect concepts. In particular, the MAX aggregation is causing that all concepts containing the substring
Similar problem occurred with country_value, country_family, and country_type concepts since they also were related in these cases with wrong exact matches to concepts in the target, which contain the substring
Also, country_version concept was correctly discarded.
Case 2: Interoperability between EHRs through the mediation of ubiquitous user model using fuzzy combined similarity measures
The fuzzy-based process of concept alignment was used to find the mappings between HL7 schema and Basic HealthVault PHR. The results of the process of concept alignment between HL7 schema and Basic HealthVault PHR are presented in the confusion matrix of Table 7. We can also compare the expected outcomes in Table 4 with the outcomes of the process with fuzzy similarity aggregation shown in Table 5. As we can see, fuzzy-based process improved the outcomes significantly given that only ten mappings were found (wrong mappings are presented in bold in Table 5). This implies match less effort of an expert rectifying these relations.
Resulting confusion matrix of the matching process between HL7 and Basic MHVault using Fuzzy similarity aggregation
Only one
Correct addition recommendations to the most related collection were made by the process establishing exact matches for the concepts birthyear, country_text, and country_value. In particular the fuzzy-based process was able to find the correct recommendation in spite of the ambiguity of all concepts that include the substring
Sadly, the fuzzy aggregation was not able to find the correct recommendations for concepts country_family, and country_type. However the latter concept was only related with two out of four concepts in the target with subring
Also, country_version concept was also correctly discarded. A wrong exact match was found for country_postcode (FP).
The efficiency and effectiveness of the processes of concept alignment for MAX aggregation and fuzzy aggregation are shown in Table 8.
Efficiency and effectiveness measuring results
The schema matching application example of FHIR vs Basic Demographic is particularly difficult because many labels have little meaning or contain commonly used substring (
On the other hand the fuzzy process of concept alignment found the right recommendations for the concepts: birthyear, country_text and country_value. These process didn’t make the mistakes done by the MAX aggregation. The fuzzy combined similarity enables a smoother aggregation disambiguating several homonyms.
Results in Table 8 show poor performance of the processes of concept alignment. Nevertheless, it is important to notice that for these experiments a complex example was chosen. Out of the element level phase of the process of concept alignment with MAX aggregation very wrong candidates for alignment were selected. These wrong choices propagate and these wrong outcomes cause the precision and recall to fall.
Although precision of the fuzzy-based process is not high, it shows a big improvement from the first experiment. Recall on the other hand, presents a huge improvement. Overall metric returns values between 0 and 1. “The greater the overall value is, the less effort the designer has to provide” [27]. When precision is less than 50% overall is negative. Result reported in Table 8 show negative outcomes for both processes. Hence the fuzzy combine metric show a significant improvement in comparison with MAX aggregation.
We presented a fuzzy-based concept alignment to find semantic mappings between heterogeneous health stakeholders through the mediation of the ubiquitous user model framework. The uncertainty management in schema matching is commonly done with probability theory or fuzzy set theory. In this work, we presented a fuzzy-based process of concept alignment. Two experiments were designed to compare the performance of the processes of concept alignment: a) using MAX function for the aggregation of three similarity techniques, and b) although the efficiency and effectiveness results were low, the fuzzy combined similarities significantly improve the quality of generated mappings and diminished the human effort to rectify the automatic mappings. The focus is to give less relevance to the longest common substring matching technique in order to disambiguate and correctly select correct exact matches when several candidates for the match share the same substring. From this point of view, results look promising and fuzzy theory proved to be a good fit for modeling uncertain schema matching. Fuzzy combined similarities could handle uncertainty in the schema matching process to enable interoperability between EHRs.
For future work, we are refining the fuzzy-based concept alignment. We are trying to implement multidimensional fuzzy sets to model uncertainty in schema matching.
Footnotes
Acknowledgment
The authors declare that there is no conflict of interest regarding the publication of this paper.
