Modeling fuzzy relational database in HBase

Abstract

With the increase of massive data, a large number of business applications began to seek effective and scalable frameworks for data storages and processing. Under this background, emerging technologies for big data, such as Hadoop-based systems that use scalable distributed storage system HBase, become available. Since most of business data nowadays are stored in relational databases, and information imprecision and uncertainty widely exist in real-world applications, there is an increasing willingness to manage large-scale fuzzy relational data in the Hadoop-based platform. This paper concentrates on fuzzy information modeling in HBase. In particular, we investigate the formal transformation from the fuzzy relational data model to the HBase model and develop a set of mapping rules to assist in the transformation process. In addition, we present a generic approach to transform the fuzzy relational algebra into the fuzzy HBase manipulation language.

Keywords

HBase fuzzy relational data modeling mapping

1 Introduction

The relational database model has been proven to be a very useful model in database management systems and has been widely applied in business applications [1, 2]. The inherence of the relational database model is effective for previse and unambiguous data, but in the real world, an application often involves imperfect information, not just because of the unreliability of its source, but also because of its nature [3, 4]. For example, in many domains, such as environmental surveillance, market analysis and quantitative economics research [5], it is difficult to state all information with one hundred percent certainty. For this reason, there is a need for relational data models which can cope with imperfect information. Modeling fuzziness in data makes sense in the cases where the nature of information is uncertain and the complete information is often not available [3 , 6–9]. In order to manage fuzzy data in a relational database, possibility theories has been extensively applied to extend the database model and resulted in numerous outcomes [10 –12].

In previous business applications, average size of a corporate database tends to be in the range of Gigabytes (GB). With the advent of the era for big data [13, 14], a database with a multi-Terabyte (TB) or even Petabyte (PB) data size becomes normal. For example, Facebook receives about 12 TBs of compressed digital data per day [15]. As databases are becoming increasingly large, how to effectively store and process large-scale data become a topic which is getting growing attention [16 –18]. Many researches have focused on the big data management in recent years, and various column-based storage systems such as Bigtable [19], Dynamo [20], etc., have been developed to tackle the problems posed by storage and access of huge amounts of data. Currently, an open-source, reliable and scalable database system HBase, which is designed for large-scale distributed data storage and high-performance computation, has attracted much attention both in academia and industry. Built with a distributed architecture on top of the Hadoop distributed file system, HBase can achieve high data availability, scale to billions of rows and millions of columns, and support fast real-time access of structured data [21].

In order to effectively processing large-scale data, database administrators have to face the challenge of ensuring that their databases can easily communicate with column-based distributed databases (e.g., HBase). As such, demands of various applications equipped with scalable data storage systems become increasingly significant and indispensable. Considering most of business data nowadays are stored in relational databases, there is an increasing willingness to manage large-scale uncertain relational data in the column-based distributed database platform. In [26], Vilaca et al. introduced a mapping approach from a relational table to an HBase table, and developed a distributed query engine for running SQL queries on top a of NoSQL datastore. In [23], mapping mechanisms from existing relational databases to column-based databases are proposed. Based on the proposed mechanisms, search engines to verify the translations are also provided [13 , 22]. Unfortunately, although fuzzy values have been employed to model and handle imperfect information in relational databases, and a large number of the existing fuzzy relational databases (FRDB) may be classified as legacy, relatively little work has been carried out in extending column-based databases towards the representation of fuzzy concepts. In particular, the study on reengineering fuzzy relational databases in column-based databases, especially in HBase, is still a blank field. Actually, there exist many legacy fuzzy relational databases that are in need of modernization in order to be compatible and competitive in this new era of big data.

To bridge the gap between fuzzy relational and column-based databases, in this paper, we take a significant step in a fundamental consolidation of fuzzy data managements in HBase, and study how to reengineer existing fuzzy relational databases in HBase. In particular, we are interested in finding the HBase schema that effectively describes the existing fuzzy relational schema. To accomplish this, we introduce the formal transformation from the fuzzy relational model to the HBase model, and develop a set of rules to assist in the transformation process. To allow for better and platform independent sharing of data stored in fuzzy relational formats, we also present a generic approach to translate the fuzzy relational algebra into the fuzzy HBase manipulation language. To the best of our knowledge, this is the first effort on reengineering fuzzy relational databases in column-based databases.

The rest of the paper is organized as follows. Section 2 gives the preliminaries of relational and HBase models. The formal definitions of fuzzy relational and HBase models are given in Section 3. The transformation from the fuzzy relational model to the HBase model is presented in Section 4. Section 5 introduces the principles, which transformed the fuzzy relational algebra into the fuzzy HBase manipulation language, and Section 6 concludes the paper.

2 Basic knowledge

In this section, we will introduce the basic concepts of the relational database model and the Hadoop-based database model.

2.1 The relational database model

In the relational model of a database, all data is represented in terms of tuples, grouped into relations [6]. A relation can be viewed as a table with rows and columns, where each column corresponds to an attribute (the features that are usually extracted from real-world things are called attributes) and each row corresponds to a tuple which represents a data object. In relational tables, there is a unique key for each row. A domain describes the finite set of possible values for a given attribute and every value is an atomic data-the minimum data unit with meanings. If an attribute value or the values of an attribute group in a relation can solely identify a tuple from other tuples, the attribute or attribute group is called a super key of the relation. A primary key uniquely specifies a tuple with a table, and a foreign key is a field in a relational table that matches the primary key column of another table, which can be used to cross-reference tables. There are constraints in the relational databases, and constraints restrict the data that can be stored in relations. Constraints can apply to single attributes, to a tuple or to an entire relation. As introduced in [20], there are domain integrity constraints, entity integrity constraints, referential integrity constraints, etc.

Relational database model provides algebraic operations [20] as a basis for database manipulation languages. Primitive algebraic operators are the set union (∪), the set intersection (∩), the set difference (–), the Cartesian product (×), the selection (σ), the projection (π) and the Join (∞). For the set union and the set difference, the two relations involved must be union-compatible, i.e., the two relations must have the same set of attributes. Because set intersection can be defined in terms of set difference, the two relations involved in set intersection must also be union-compatible. In particular, let r and s be two union-compatible relations on the scheme R (A₁, A₂, …, A_n),

for the set union, we have r ∪ s = {t|t ∈ r ∨ t ∈ s}

for the set intersection, we have r ∩ s = {t|t ∈ r ∧ t ∈ s}

for the set difference, we have r - s = {t|t ∈ r ∧ t ∉ s}

Cartesian product is a unary operation on relations. Let r and s be two relations on the schema R and S respectively, we have r × s = {t (R ∪ S) |t [R] ∈ r ∧ t [S] ∈ s}, i.e., the result of r×s is a relation on the schema R∪ S, where a tuple is a combination of a tuple from r and a tuple from s. Selection operation extracts from a table the tuples whose specified attributes values satisfy a given condition, and returns them as a new table. In particular, the selection of r based on a selection condition P specified by a Boolean expression can be defined as follows: σ_P (r) = {t|t ∈ r ∧ P (t)}. A projection is a unary operation written as π_s (r) where s is a set of attribute names. The result of the projection operation π_s (r) is a relation on the schema S that only includes the columns of relational table r which are given in S. In particular, let r be relation on the scheme R (A₁, A₂, …, A_n), the projection of r over attribute subset S (S⊂R) is defined as follows, π_s (r) = {t (S) | (∀ x) (x ∈ r (S) ∧ t = x [S])}. Join operation is a binary operation on two relations, let r(R) and s(S) be any two relations, let p be a conditional predicate in the form of AθB, where θ ∈ {< , = , > , ≤ , ≥ , ≠}, A ∈ R and B ∈ S, then r $\underset{p}{\infty}$ s = {t (R ∪ S) |t [R] ∈ r ∧ t [S] ∈ s ∧ P (t [R] , t (S)} or r $\underset{p}{\infty}$ s = σ_P (r × s). The result of r $\underset{p}{\infty}$ s is a relation on the schema R∪S, where a tuple is a combination of a related tuple from r and a related tuple from s (the two combined tuples from r and s must satisfy the given condition p).

2.2 The Hadoop-based database model

Hadoop is an open-source, reliable and scalable architecture for large-scale distributed data storage and high-performance computation. Hadoop-based database model contains three main parts: Hadoop core, HBase and Hive [15], where Hadoop core consists of Hadoop distributed file system for data storage and the map-reduce framework for data processing, HBase is column-oriented database storing massive data sets, and Hive provides a set HQL interface used to store and query data (by calling map-reduce framework) in HBase.

From a logical point of view, data in HBase are organized in labeled tables [27]. HBase tables are made up of several HDFS files and blocks, each of which is replicated by Hadoop. HBase tables are automatically partitioned horizontally by HBase into regions. Each HBase table is sorted as a multidimensional sparse map, with rows and columns, each row having a sorting key and an arbitrary number of columns. Table cells are versioned, by default, the version is a timestamp auto-assigned by HBase at the time of cell insertion. Timestamps are stored in descending order. Each particular column can have several versions for the same row key. Each cell is tagged by column family and column name, hence programs can identify what type of data item a given cell contains. A cell’s content is an uninterrupted array of bytes which is uniquely identified by “Table + Row-Key + Column-Family:Column + Timestamp”. Table rows are sorted by row key which is also a byte array and serves as table’s primary key. All table accesses are via the table primary key and any scan of HBase table results into a map-reduce job. Table 1 is an example of an HBase table. A request for the values of “cf1:c1” in the row “r1”, if no timestamp is specified would be the value from time stamp t3, that is r1cf1c1v2.

Map-reduce is a manipulation language model for processing large-scale data sets [28]. It is a parallel and distributed manipulation language on a cluster. In short, a map-reduce manipulation language has two key functions: the map function and the reduce function, each function done in parallel. In the map phase, the framework distributes map tasks across nodes in the cluster. Each map task (managed by the mapper) processes key/value pairs of its data fragment assigned by the framework and produces a set of intermediate key/value pairs. In the reduce phase, reduce function (managed by the reducer) merges all intermediate values with the same intermediate key.

3 Fuzzy relational and HBase data models

In this section, we will briefly introduce the basic notions of fuzzy relational and HBase data models.

3.1 Fuzziness in the relational data model

Notions of fuzzy relational models have been introduced in previous works such as [8 , 24], which differ in minor aspects in expressiveness and notation. The formal definition of fuzzy relational models in this work abstracts with respect to the most important and common features in the literature.

A relational data model is fuzzy because of a lack of information, in general, there are two levels of fuzziness in fuzzy relational databases:

At the first level, relations (or tuples) may be fuzzy, i.e., they have some possibility to the model (or table).

The second level concerns the fuzzy values of attributes of special relations.

In order to model the first level of fuzziness in fuzzy relational database, a possibility ρ (0 ≤ ρ ≤ 1) with a tuple is used to indicate the possibility that the tuple belongs to a table. To model the second level of fuzziness in fuzzy relational database, a set of possible values of the attribute specified with a possibility distribution is used to indicate the possibilities of all the possible values. A tuple (or attribute value) will not be declared when its possibility is 0, and ρ can be omitted when the possibility of a tuple (or attribute value) is 1.0. To make the presentation concrete, an example in handling fuzzy information is provided. Consider the fuzzy relation in Table 2 that refers to the student table [10], where ρ denotes the possibilities of tuples. In the first row of Table 2, for the student vincent whose department and age are computer sciences and 30 respectively, the possibility of this tuple being a member of the student table is 0.8. The first row of Table 2 is an instance of the first kind of fuzziness in fuzzy relational database. For the student lyot whose department is biology, if his age is unknown so far, i.e., he has a fuzzy value in the age attribute, which could be represented by using a possibility distribution, for example, {28/0.7, 23/0.1, 32/0.2}. The attribute age is an instance of the second kind of fuzziness in fuzzy relational database. In the following, we will give the formal definitions of the fuzzy relational data model.

Definition 1. A fuzzy relational schema is a 4-tuple RS = (σ, υ, τ, ρ), where

σ is a finite set of distinct attributes;

υ is a finite set of mutually exclusive and exhaustive values called domains;

τ: σ ⟶ υ is a function that associates a domain with an attribute;

ρ is a function of type T⟶ [0,1] that associates a possibility (or possibility distributions) with a tuple (or attribute values) in RS.

Definition 2. A fuzzy relation r based on RS = (σ, υ, τ, ρ) is a fuzzy subset of the Cartesian product of Dom(υ₁)× Dom(υ₂) × ... × Dom(υ_n)× Dom(u_r), where the domain of attribute υ_i, i.e., Dom(υ_i), could be a fuzzy subset and Dom (u_r) ∈ (0, 1].

Definition 3. Let Ω be a finite set of symbols denoting real-world relations, and η_Ω be the set of values over Ω. A database instance I of a fuzzy relational schema RS = (σ, υ, τ, ρ) is constituted by:

a finite set Ω_I of relation identifiers;

a mapping φ_I assigning to each relation in a subset σ of Ω_I;

a mapping χ_I assigning a value in η_ΩI to each relation in Ω_I;

a mapping δ_I assigning a possibility to each relation in Ω_I.

A fuzzy relational database model is based on the notions of fuzzy relational schema, fuzzy relation (table), fuzzy relational instance, integrity constraint and fuzzy relational algebra. An integrity constraint in a schema is a predicate over relation expressing a constraint. By far the most used integrity constraint in relational databases is the referential integrity constraint. The formal definition of the referential integrity constraint in fuzzy relational model is as follows.

Definition 4. Let r and s be fuzzy relations on the scheme R and S with primary keys K1 and K2 respectively, the subset α of attributes of S is a foreign key referencing K1 in r, if for every tuple t in s there must be a tuple t’ in r such that t’[K1] = t[α], that is, the referential satisfies: π_α(S) ⊆π_K1 (R).

In the following, the fuzzy algebraic operations in fuzzy relational databases are provided. Let r and s be two union-compatible fuzzy relations on the scheme RS = (σ, υ, τ, ρ), the fuzzy set union is defined as follows: r ∪ s = {t| (t ∈ r ∨ t ∈ s) ∧ max(u_r, u_s)}, where u_r and u_s are related possibilities, u_r, u_s ∈ (0, 1]. The fuzzy set intersection is defined as follows: r ∩ s = {t|t ∈ r ∧ t ∈ s ∧ min(u_r, u_s)}. The fuzzy set difference is defined as follows: r - s = {t|t ∈ r ∧ t ∉ s ∧ min(u_r, (1 - u_s))}. The Cartesian product and projection of fuzzy relations are the same as the ones under classical relational databases. Let r and s be two fuzzy relations on the fuzzy schema RS and SS respectively, we have r × s = {t (RS ∪ SS) |t [RS] ∈ r ∧ t [SS] ∈ s}. The fuzzy projection of r over attribute subset S (S⊂R) is defined as follows, π_s (r) = {t (S) | (∀ x) (x ∈ r (S) ∧ t = x [S])}.

In fuzzy relational model, fuzzy selection operation extracts from a table the tuples whose specified attributes values satisfy a given fuzzy selection condition P_f, and returns them as a new table. Let r(R) be a fuzzy relation based on a fuzzy selection condition P_f, specified by a fuzzy expression combining the basic clause AθB. Since the predicate P_f may be fuzzy, the evaluation of the fuzzy expression can be conducted by using Zadeh’s extension principle [20], where θ ∈ {< _ω, = _ω, > _ω, ≤ _ω, ≥ _ω, ≠ _ω}, where ω is a threshold. Therefore, the fuzzy selection can be defined as follows: σ_Pf (r) = {t|t ∈ r ∧ P_f (t)}. Similarly, let r(R) and s(S) be two fuzzy relations, P_f is a fuzzy selection condition in the form of AθB, where A ∈ R and B ∈ S, then the fuzzy join can be defined as follows, $r \underset{pf}{\infty} s = {t (R \cup S) | t [R] \in r \land t [S] \in s \land P_{f} (t [R], t (S)}$ or $r \underset{pf}{\infty} s = σ_{pf} (r \times s)$ .

3.2 Fuzziness in the HBase data model

Being similar to the classic HBase data model, the main part of a fuzzy HBase databases (FHDB) consists of tables, row keys, column families, columns, values and timestamps, where row keys are unique and timestamps in HBase are automatically assigned by database systems for version managements. In FHDB, three levels of fuzziness occur in a fuzzy HBase table:

At the first level, the occurrences of column families or columns may be fuzzy, i.e., they have some possibility to the model.

At the second level, sets of rows associated with columns in a column family may be fuzzy, i.e., rows have some possibility to the column family.

The third level concerns the fuzzy values of special columns.

In order to model the first level of fuzziness in HBase, a column family cf or column c with a possibility ρ (0 ≤ ρ ≤ 1), the column family’s or column’s name depicted by a pair of words with ρ possibility is used to indicate the possibility that the column family or column belongs to an HBase table. To model the second level of fuzziness in HBase, a possibility column ρ is used to indicate the possibility that a row belongs to a column family. To model the third level of fuzziness in HBase, a set of possible values of the column specified with a possibility distribution is used to indicate the possibilities of all the possible values. A row (or column value) will not be declared when its possibility is 0, and ρ can be omitted when the possibility of a row (or column value) is 1.0. Consider the fuzzy HBase table in Table 3, where ρ denotes the possibilities of related rows belonging to the corresponding column family. “Employee with 0.8 possibility” is a column family with the first level of fuzziness. In the first row of Table 3, for the employee vinc whose salary is 10k, the possibility of this row being a member of the column family “employee” is 0.5. The first row of Table 3 is an instance of the second kind of fuzziness in the fuzzy HBase table. For the employee vincent, if his salary is unknown so far, i.e., he has a fuzzy value in the salary column, which could be represented by using a possibility distribution, for example, {10k/0.5, 20k/0.3, 50k/0.2}. This salary column is an instance of the third kind of fuzziness in the fuzzy HBase table. The formal definitions of the fuzzy HBase data model are as follows.

Definition 5. A fuzzy HBase schema is a 7-tuple HS = (r, t, cf, c, υ, τ, ρ), where

r is a finite set of distinct row keys;

cf is a finite set of distinct column families;

c is a finite set of columns;

υ is a finite set of mutually exclusive and exhaustive values called domains;

τ: c ⟶υ is a function that associates a domain with an attribute;

ρ is a function of type T ⟶ [0,1] that associates a possibility (or possibility distributions) with a column family, column or values in HS.

Definition 6. Let Ω be a finite set of symbols denoting real-world objects, and η_Ω be the set of values over Ω. A database instance I of a fuzzy HBase schema HS = (r, t, cf, c, υ, τ, ρ) is constituted by:

a finite set Ω_I of row identifiers;

a mapping ς_I assigning to each row in a subset cf of Ω_I;

a mapping φ_I assigning to each row in a subset c of Ω_I;

a mapping χ_I assigning a value in η_ΩI to each cell in Ω_I;

a mapping δ_I assigning a possibility to each column family, column or row in Ω_I.

The main constraints in fuzzy HBase data models are domain integrity constraints and cell integrity constraints. The contents of domain integrity constraints in fuzzy HBase data models are that column values should be the values in the domain. The contents of cell integrity constraints in fuzzy HBase data models are that each nonempty cell should have an identified key and the value of the identified key should be sole and cannot be null.

In the following, we will introduce the operators of fuzzy manipulation languages in fuzzy HBase databases. Manipulation language in fuzzy HBase has two key phases (i.e., the map phase and the reduce phase). The main operators in fuzzy HBase include union, intersection, difference, table scan operator, file output operator, reduce sink operator, predicate filter operator, select operator, join operator, relative threshold filter operator, absolute threshold filter operator.

Let r and s be two union-compatible fuzzy HBase tables on the scheme RS = (σ, υ, τ, ρ), cell_c(r) and cell_c(s) are union-compatible cells of r and s over column subset C(C⊂r, C⊂s) respectively, the union is defined as follows: r ∪ s = {t| ∪ _c (t ∈ cell_c (r) ∨ t ∈ cell_c (s)) ∧ max(u_r, u_s)}, where u_r and u_s are related possibilities, u_r, u_s ∈ (0, 1]. The intersection is defined as follows: r ∩ s = {t| ∩ _c (t ∈ cell_c (r) ∧ t ∈ cell_c (s)) ∧ min(u_r, u_s)}. Difference is defi-ned as follows: r - s = {t|t ∈ cell_c (r) ∧ t ∉ cell_c (s) ∧ min(u_r, (1 - u_s))}. Let r and s be two fuzzy tables on the fuzzy HBase schema RS and SS respectively, The Cartesian product in fuzzy HBase models is defined as follows: r × s = {t (RS ∪ SS) | ∪ (t [RS] ∈ cell_c (r) ∧ t [SS] ∈ cell_c (s))}. Table scan operator retrieves all rows (or cells) from the fuzzy HBase table specified in the argument column at the map phase, and file output operator exports all results generated at the reduce phase. Reduce sink operator sends intermediate outputs (keys-values) to the reduce stage. A predicate filter operator written as χ_c(hs), where c is a column’s name, is an operator on the fuzzy HBase schema HS that includes the column of fuzzy HBase table hs which is given in HS. In particular, let hs be table on the fuzzy HBase scheme HS, the predicate filter of h over sing attribute C(C⊂ H) is defined as follows, χ_c (h) = {t (C) | (∀ x) (x ∈ h (C) ∧ t = x [C])}.

Similar to fuzzy relational models, select operator in fuzzy HBase models extracts from a table the rows whose specified column values satisfy a given fuzzy selection condition P_f, and returns them as a new table. Let hs be a fuzzy HBase table, a fuzzy selection condition P_f, specified by a fuzzy expression combining the basic clause AθB, where θ ∈ {< _ω, = _ω, > _ω, ≤ _ω, ≥ _ω, ≠ _ω}, and ω is a threshold, the select operator in fuzzy HBase models can be defined as follows: σ_Pf (hs) = {t|t ∈ r ∧ P_f (t)}. Let hr(HR) and hs(HS) be two fuzzy HBase tables, P_f is a fuzzy selection condition in the form of AθB, where A ∈ HR and B ∈ HS, then the join operator written as $\underset{pf}{\infty}$ in fuzzy HBase models can be defined as follows, $hr \underset{pf}{\infty} hs = {t (HR \cup HS) | t [HR] \in hr \land [HS] \in hs \land P_{f} (t [HR], t (HS)}$ .

A relative threshold filter operator written as ϑ_ω (hs_c), where c is a set of columns’ names and ω is the given threshold, is an operator on the fuzzy HBase schema HS that includes the columns of fuzzy HBase table hs which are given and satisfy the relative threshold constraints in HS. In particular, let hs be table on the fuzzy HBase scheme HS, the relative threshold filter over column subset C(C⊂HS) is defined as follows, ϑ_ω (hs_c) = {t (C) | (∀ x) (x ∈ hs (C) ∧ t = x [C]) ∧ ρ_c ≥ ω}. An absolute threshold filter operator written as Δ_ω (hs_s), where s is a set of all columns’ names in the answers and ω is the given threshold, is an operator on the fuzzy HBase schema HS that includes the columns of answers in fuzzy HBase table hs which are given and satisfy the whole threshold constraints in HS. The absolute threshold filter over column subset S(S⊂HS) is defined as follows, Δ_ω (hs_s) = {t (s) | (∀ x) (x ∈ hs (S) ∧ t = x [S] ∧ Πρ_c ≥ ω)}.

4 Reengineering FRDB using FHDB

Relational database lacks sufficient power in handling large-scale data, while HBase has advantages in the massive data storage and query. Hence it is significant to reengineer traditional (fuzzy) relational database in HBase for large-scale data processing.

In this section, we concentrate on the formal mapping approach for reengineering the fuzzy relational database in the fuzzy HBase database. In particular, the reengineering could be established by a series of mapping rules as follows.

Rule 1. For a fuzzy relational database model S, a column family named S is created in HBase.

Rule 2. For a fuzzy relational database model S, a column named Timestamp is created in HBase.

Rule 3. For each relation R in S, a common column prefix with the same name as R. is created in the column family S.

Rule 4. For primary key attributes KA_i in relation R of fuzzy relational database model S, a common row key column concatenating all (distinct) key attributes KA_i, and a column with the same name as R . KA_i (where R. is the corresponding column prefix) are created in S respectively.

Rule 5. For each deterministic attribute A in relation R of fuzzy relational database model S, a column with the same name as R . A is created in the column family S.

Rule 6. For each foreign key attributes PA_i in relation R of fuzzy relational database model S, a column with the same name as R . PA_i is created in the column family S.

Rule 7. For each fuzzy relation FR in a fuzzy relational database model S, if the corresponding relation has some possibility to the model S, then a column with the same name as ρ . FR is created in the column family S.

Rule 8. For each fuzzy attribute FA with fuzzy occurrences in fuzzy relation FR of S, a column with the same name as FR . FA with ρ possibility, is created in the column family S.

Based on the mapping roles above, the fuzzy relational schema can be easily mapped into the HBase schema following the ensuring processing:

create the column family and the timestamp column for the fuzzy relational model S by applying Rule 1 and 2, respectively.

for each relation R that exists no fuzziness in R, reengineering the relation, primary key, attributes, foreign key in R by applying Rules 3, 4, 5 and 6, respectively.

for each fuzzy relation FR, reengineering the fuzzy relation and fuzzy attribute in FR by applying Rules 7 and 8 respectively, and reengineering the deterministic primary key, deterministic attributes and foreign key by applying Rules 4, 5 and 6, respectively.

reengineering the key attributes in S by applying Rules 4 as the primary key of the HBase schema.

In the following, we will use the following examples to illustrate the reengineering processing above.

Example 1. Let us consider the deterministic relational model shown in Fig. 1(a). Figure 1(b) shows the mapped HBase schema after the reengineering. From Fig. 1(a), we know that relation (or table) school and employee are deterministic relations, therefore, we firstly use Rule 3 to map the school, employee and school-employee -employ tables. Then, Rule 2 is used to create the timestamp column in HBase. For their key attributes sid and eid, we use Rule 4 to reengineer them and use “sid+eid” as the row key in HBase. For their attributes sname, salary and period, we know that they are deterministic attributes, and therefore we use Rule 5 to reengineer them in HBase.

Example 2. Let us consider the fuzzy relational model shown in Fig. 2(a). Figure 2(b) shows the mapped HBase schema after the reengineering. Assume that relational table donor is a fuzzy relation having some possibility to the model and having fuzzy attribute only (i.e., dname), therefore, we firstly use Rule 7 and 8 to reengineer the donor table in HBase. Since relational table school is a deterministic entity, we use Rule 3 to map the school table. For the fuzzy attribute dname (assume that dname is a fuzzy attribute with fuzzy occurrences in S), hence we use Rule 8 to reengineer it in HBase. The relational table donate a fuzzy table having some possibility to the model and having deterministic attribute only, then Rule 7 and Rule 5 are used respectively to map the donate table in HBase.

In the following, we propose an algorithm (Algorithm FRI2FHBI) which can transform fuzzy

Algorithm 1. FRI2FHBI(t)

01 while not end(t)

02 t_act= getTable(t)

03 S_act= getHBaseSchema(t_act, pk_act, fk_act, da_act, fa_act, ρ_act)

//translating the schema of t_act into the fuzzy HBase schema based on Rule 1–8

04 for each tuple tuple_i in t_act

05 pk_i(tuple_i) = getValues(pk_act(tuple_i))

06 cellOfHBase(primary key column) =pk_i(tuple_i)

07 fk_i(tuple_i) = getValues(fk_act(tuple_i))

08 cellOfHBase(foreign key column) =fk_i(tuple_i)

09 da_i(tuple_i) = getValues(da_act(tuple_i))

10 cellOfHBase(determinstic attribute column) =da_i(tuple_i)

11 fa_i(tuple_i) = getValues(fa_act(tuple_i))

12 cellOfHBase(fuzzy attribute column) =fa_i(tuple_i)

13 ρ_i(tuple_i) = getValues(ρ_act(tuple_i))

14 cellOfHBase(Possibility Column) =ρ_i(tuple_i)

15 end for

16 end while

relational instances into fuzzy HBase instances. In FRI2FHBI, function getHBaseSchema(t_act, pk_act, fk_act, da_act, fa_act, ρ_act) gets all primary keys pk_act, foreign keys fk_act, deterministic attributes da_act, fuzzy attributes fa_act, and possibilities ρ_act of t_act. getTable (t) and getValues(k) get next table to be processed and all values of k respectively, and cellOfHbase(k) writes the values to the corresponding cell of k.

Algorithm FRI2FHBI operates in two phases. In the first phase (lines 2-3), the fuzzy HBase schema is generated. In the second phase (lines 4–15), the contents in fuzzy relational tables are migrated to the fuzzy HBase cells based on the generated fuzzy HBase schema. At line 3 of Algorithm FRI2FHBI, it constructs the fuzzy HBase table schema from fuzzy relational tables according to the mapping Rule 1–8. At line 4, it repeatedly gets the tuple to process. The values will be migrated from the fuzzy relational tables to the corresponding cells of the constructed fuzzy HBase tables at lines 6, 8, 10, 12, 14.

Let us take the following example to illustrate the transformation from fuzzy relational database instances to fuzzy HBase database instances.

Example 3. Let us consider the relational tables shown in (Fig. 3a–e), where instances of relations school, employee, donor, employ and donate, are depicted by using the relational tuples stored in tables school, employee, donor, employ and donate, respectively. (Figure 3f–g) shows the mapped HBase instances after the reengineering. For the transformation of deterministic relational tables, we could firstly generate the corresponding HBase schema (recall Example 1). Then we migrate the related data from tables school, employee and employ, to the generated HBase table shown in Fig. 3(f). Similarly, for the transformation of fuzzy relational tables, we could firstly generate the corresponding HBase schema (recall Example 2), and then migrate the related data from tables school, donor and donate, to the generated HBase table shown in Fig. 3(g).

5 Manipulation language transformation

In this section, we will give a detailed discussion on how fuzzy relational manipulation language (algebra) is transformed into the corresponding fuzzy HBase manipulation language. As introduced in Section 3, the algebraic operations in fuzzy relational algebra are fuzzy set operations, fuzzy selections, fuzzy Cartesian product and fuzzy projections, and manipulation language in fuzzy HBase has two key phases (i.e., the map phase and the reduce phase). The reengineering from fuzzy relational algebra to fuzzy HBase manipulation language could be established by a series of mapping rules as follows.

Rule 9. For the fuzzy set operation, i.e., fuzzy set union r∪s, fuzzy set intersection r∩s and fuzzy set difference r−s over relations r and s in fuzzy relational algebra, let u_r and u_s be the related possibilities of r and s, assume that the corresponding HBase cells of r and s are r (r_c1, r_c2, …, r_cn) and s (s_c1, s_c2, …, s_cn) respectively, then r∪s can be mapped into the union of the corresponding HBase cells and the related possibility is equal to max(u_r, u_s). r∩s can be mapped into the intersection of the corresponding HBase cells and the related possibility is equal to min(u_r, u_s). r−s can be mapped into the intersection of the corresponding HBase cells in r (r_c1, r_c2, …, r_cn) but not in s (s_c1, s_c2, …, s_cn), and the related possibility is equal to min(u_r, (1 - u_s)).

Rule 10. For the fuzzy Cartesian product r×s or fuzzy join $r \underset{pf}{\infty} s$ (P_f is a given fuzzy selection condition) over relations r and s in fuzzy relational algebra, assume that the corresponding HBase cells of r and s are r (r_c1, r_c2, …, r_cn) and s (s_c1, s_c2, …, s_cn) respectively, then r×s can be mapped into the set of the corresponding HBase cells, depicted as rs (r_c1, r_c2, … , r_cn, s_c1, s_c2, … , s_cn). $r \underset{pf}{\infty} s$ can be mapped into the set of the corresponding HBase cells in r (r_c1, r_c2, …, r_cn) and s (s_c1, s_c2, …, s_cn) that satisfy the given fuzzy selection condition P_f.

Rule 11. For the fuzzy selection σ_Pf (r) over relation r, where P_f is a given fuzzy selection condition. Assume that the corresponding HBase cells of r is r (r_c1, r_c2, …, r_cn), if the fuzzy selection condition occurs at the possibility attribute (i.e., the first level of fuzziness in relational databases), then σ_Pf (r) can be mapped into the absolute threshold filter operation Δ_pf (r (r_c1, r_c2, …, r_cn)). If the fuzzy selection condition occurs at the possibility attribute values (i.e., the second level of fuzziness in relational databases), then σ_Pf (r) can be mapped into the relative threshold filter operation ϑ_pf (r (r_c1, r_c2, …, r_cn)). If the fuzzy selection condition occurs at single attribute, then σ_Pf (r) can be mapped into the predicate filter operation π_ci (h) on the corresponding column r_ci that satisfy the given fuzzy selection condition P_f. If the fuzzy selection condition occurs at multiple attributes, then σ_Pf (r) can be mapped into the set of the corresponding HBase cells in r (r_c1, r_c2, …, r_cn) that satisfy the given fuzzy selection condition P_f.

Rule 12. For the fuzzy projection π_s (r) over attribute subset s of the relation r, assume that the corresponding HBase cells of r is r (r_c1, r_c2, …, r_cn), then π_s (r) can be mapped into the select operator on the corresponding HBase column r_ci.

In the following, we will give the detailed transformed process from fuzzy relational algebraic operations to the operators of manipulation language in fuzzy HBase. In particular, the transformation can be accomplished by following the ensuing mapping operations:

According to the attribute names in fuzzy relational algebraic operations, a table scan operation that scans related columns in fuzzy HBase tables is generated in the map phase. If there exits fuzzy selection conditions occurring at single attributes, then a select operator and a following reduce sink operation in fuzzy HBase, which reduces the outputs of mappers as the inputs of reducers, are generated in the map phase. Otherwise, a reduce sink operation after the table scan operation in fuzzy HBase is generated in the map reduce.

Based on the fuzzy set union, intersection or difference in relational algebra, a fuzzy HBase union, intersection or difference operation that operates on related HBase cells isgenerated.

Based on the fuzzy Cartesian product or fuzzy join in relational algebra, a fuzzy HBase Cartesian product or join operation that joins related HBase cells is generated in the reduce phase.

Based on the fuzzy selection condition occurring at single attribute in relational algebra, a predicate filter operation that filters related cells in fuzzy HBase tables with the assistant of given predicates, and a reduce sink operation after the predicate filter operation are generated in the map phase. Based on the fuzzy selection condition occurs at the multiple attributes in relational algebra, a select operator based on the given fuzzy selection condition in fuzzy HBase is generated in the reduce phase. Based on the fuzzy selection condition occurs at the possibility attribute in relational algebra, an absolute threshold filter operation in fuzzy HBase is generated. Based on the fuzzy selection condition occurs at the possibility attribute values in relational algebra, a relative threshold filter operation in fuzzy HBase is generated.

In the following, we will illustrate the mapping plans of fuzzy relational algebra into the fuzzy HBase manipulation language.

Example 4. Consider the fuzzy relational tables donor and donate in Fig. 2. In a fuzzy relational database, if a user would like to know the donated date (dsd.date) of a donor whose name (d.dname) is Peter, under the restriction of the given threshold being 0.1, he or she may use the following relational algebraic statements to obtain what he or shewants:

Q: π _dsd.date ( σ _{ρ

.d≥0.1} (donate ∞ _{d.did=dsd.did} ( σ _{d.dname=Peter} (donor)))

According to the transformation introduced above, the map phase executes table scan (associated with columns dsd.date, ρ.d, d.did, dsd.did, d.dname), predicate filter (associated with the selection “d.dname = ’Peter”’), relative threshold filter (associated with the selection of threshold for filtering the rows according to the corresponding relative possibilities) and reduce sink operations.

In the reduce phase, it executes equijoin (associated with the fuzzy join operation “d.did = dsd.did”), absolute threshold filter (associated with the threshold selection for filter the rows according to the absolute possibilities), projection (associated with the projection of dsd.date), file output operations. The detailed fuzzy HBase manipulation language of Q is shown in Fig. 4.

6 Conclusion

In this paper, we investigated how to reengineer fuzzy relational databases by using HBase. We first introduce formalisms to capture the semantics of fuzzy relational and HBase models. In order to achieve the reengineering fuzzy relational databases in HBase, we then developed a set of mapping rules to handle the transformation from fuzzy relational databases to fuzzy HBase databases. On this basis, we present a generic approach to genetate fuzzy HBase manipulation language for mapping fuzzy relational algebra, and illustrate the approach by introducing representative examples.

Future research is both geared towards applicability and enhancement. We are currently working on a prototype showing the feasibility of the approach for large practical applications. In addition, we plan to accomplish the transformation of the fuzzy entity-relationship model to the NoSQL schema (e.g., the transformation of semantic concepts like inheritance, dependency, aggregation, etc.), and work on introducing query optimizations as an enhancement of our methodology.

Footnotes

Acknowledgments

The authors would also like to express their gratitude to the anonymous reviewers for providing very helpful suggestions. The work was partially supported by the China Postdoctoral Science Foundation funded project (2015M581449 and 2016T90294), Heilongjiang Postdoctoral Fund (LBH-Z14089), Natural Science Foundation of Heilongjiang Province of China (QC2015067), and Fundamental Research Funds for the Central Universities (HIT.NSRIF.2017036).

References

Barbara

, Garcia-Molina

and Porter

, The management of probabilistic data, IEEE Transactions on Knowledge and Data Engineering4(5) (1992), 487–502.

Cartell

, Scalable SQL and NoSQL data stores, SIGMOD Record39(4) (2010), 12–27.

Cavallo

and Pittarelli

, The Theory of Probabilistic Databases, In Proceedings of VLDB, 1987, pp. 71–81.

Chang

, Dean

, Ghemawat

, et al., Bigtable: A distributed storage system for structured data, Seventh Symposium on Operating System Design and Implementation (2006), 205–218.

Chen

, The entity-relationship model - toward a unified view of data, ACM Transactions on Database Systems1(1) (1976), 9–36.

Codd

E.F.

, A relational model of data for large shared data banks, Communications of the ACM13(6) (1970), 377–387.

Decandia

, Hastorun

, Jampani

, et al., Dynamo: Amazon’s Highly Available Key-value Store, In Proceedings of ACM SIGOPS symposium on operating systems principles, 2007, pp. 205–220.

Galindo

, Medina

, Pons

, Cubero

, A Server for Fuzzy SQL Queries, In Proceedings of the International Conference on Flexible Query Answering Systems, 1998, pp. 164–175.

Gao

, Nachankar

and Qiu

, Experimenting Lucene index on HBase in an HPC environment, In Proceedings of the 2011 Workshop on High-Performance Computing Meets Databases, 2011, pp. 25–28.

10.

Hsieh

, Answers to queries concerning uncertain and imprecise information in fuzzy relational databases, International Journal of Intelligent Systems20(6) (2005), 647–668.

11.

Huang

and Liou

, A study on the translation mechanism from relational-based database to column-based database, In Proceedings of the International Conference on Informatics and Applications, pp. 2012, 480–486.

12.

Kallman

, Kimura

, Natkins

, et al., H-store: A high-performance, distributed main memory transaction processing system, In Proceedings of VLDB, 2008, pp. 1496–1499.

13.

Konstantinou

, Angelou

, Boumpouka

, et al., On the elasticity of NoSQL databases over cloud management platforms, In Proceedings of CIKM, 2011, pp. 2385–2388.

14.

Lee

, An Extended Relational Database Model for Uncertain and Imprecise Information, In Proceedings of VLDB, 1992, pp. 211–220.

15.

and He

, Research of distributed database system based on Hadoop, In Proceedings of the International Conference on Information Science and Engineering, 2010, pp. 1417–1420.

16.

Liu

, Ma

Z.M.

and Feng

, Formal Approach for reengineering fuzzy XML in fuzzy object-oriented databases, Applied Intelligence (2013), 541–552.

17.

Liu

, Ma

Z.M.

and Feng

, Storing and querying fuzzy XML data in relational databases, Applied Intelligence39(2) (2013), 386–396.

18.

Lin

, Lychagina

and Wong

, Tenzing: A SQL implementation on the MapReduce framework, In Proceedings of VLDB, 2011, pp. 1318–1327.

19.

, Ozyer

, Kianmehr

and Alhajj

, VIREX and VRXQuery: Interactive approach for visual querying of relational databases to produce XML, Journal of Intelligent Information Systems35(1) (2010), 21–49.

20.

, Ma

Z.M.

and Zhang

, Fuzzy XML data management, studies in fuzziness and soft computing, Springer311 (2014), 1–218.

21.

Pei

, Jiang

, Lin

, et al., Probabilistic Skylines on Uncertain Data, In Proceedings of VLDB, 2007, pp. 15–26.

22.

Prade

and Testemale

, Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries, Information Sciences34 (1984), 115–143.

23.

Rabl

, Gomez-Villamor

, Sadoghi

, et al., Solving big data challenges for enterprise application performance management, In Proceedings of VLDB, 2012, pp. 1724–1735.

24.

Raju

and Majumdar

, Fuzzy functional dependencies and lossless join decomposition of fuzzy relational database systems, ACM Transactions on Database Systems13(2) (1988), 129–166.

25.

Smets

, Imperfect information: Imprecision-uncertainty, Uncertainty Management in Information Systems: From Needs to Solutions, 1997, pp. 225–254Kluwer Academic Publishers.

26.

Vilaça

, Cruz

, Pereira

and Oliveira

, An Effective Scalable SQL Engine for NoSQL Databases, In Proceedings of International Conference on Distributed Applications and Interoperable Systems, 2013, pp. 155–168.

27.

Vora

, Hadoop-HBase for large-scale data, In Proceedings of Computer Science and Network Technology, 2011, pp. 601–605.

28.

Wang

, Chen

and Liu

, Distributed storage and index of vector spatial data based on HBase, In Proceedings of the International Conference on Geoinformatics, 2013, pp. 1–5.

29.

Yazici

, Buckles

B.P.

and Petry

F.E.

, Handling complex and uncertain information in the ExIFO and NF2 data models, IEEE Transactions on Fuzzy Systems7(6) (1999), 659–676.

30.

Zadeh

L.A.

, Fuzzy sets, Information and Control8(3) (1965), 338–353.