Abstract
With the increase of massive data, a large number of business applications began to seek effective and scalable frameworks for data storages and processing. Under this background, emerging technologies for big data, such as Hadoop-based systems that use scalable distributed storage system HBase, become available. Since most of business data nowadays are stored in relational databases, and information imprecision and uncertainty widely exist in real-world applications, there is an increasing willingness to manage large-scale fuzzy relational data in the Hadoop-based platform. This paper concentrates on fuzzy information modeling in HBase. In particular, we investigate the formal transformation from the fuzzy relational data model to the HBase model and develop a set of mapping rules to assist in the transformation process. In addition, we present a generic approach to transform the fuzzy relational algebra into the fuzzy HBase manipulation language.
Introduction
The relational database model has been proven to be a very useful model in database management systems and has been widely applied in business applications [1, 2]. The inherence of the relational database model is effective for previse and unambiguous data, but in the real world, an application often involves imperfect information, not just because of the unreliability of its source, but also because of its nature [3, 4]. For example, in many domains, such as environmental surveillance, market analysis and quantitative economics research [5], it is difficult to state all information with one hundred percent certainty. For this reason, there is a need for relational data models which can cope with imperfect information. Modeling fuzziness in data makes sense in the cases where the nature of information is uncertain and the complete information is often not available [3, 6–9]. In order to manage fuzzy data in a relational database, possibility theories has been extensively applied to extend the database model and resulted in numerous outcomes [10–12].
In previous business applications, average size of a corporate database tends to be in the range of Gigabytes (GB). With the advent of the era for big data [13, 14], a database with a multi-Terabyte (TB) or even Petabyte (PB) data size becomes normal. For example, Facebook receives about 12 TBs of compressed digital data per day [15]. As databases are becoming increasingly large, how to effectively store and process large-scale data become a topic which is getting growing attention [16–18]. Many researches have focused on the big data management in recent years, and various column-based storage systems such as Bigtable [19], Dynamo [20], etc., have been developed to tackle the problems posed by storage and access of huge amounts of data. Currently, an open-source, reliable and scalable database system HBase, which is designed for large-scale distributed data storage and high-performance computation, has attracted much attention both in academia and industry. Built with a distributed architecture on top of the Hadoop distributed file system, HBase can achieve high data availability, scale to billions of rows and millions of columns, and support fast real-time access of structured data [21].
In order to effectively processing large-scale data, database administrators have to face the challenge of ensuring that their databases can easily communicate with column-based distributed databases (e.g., HBase). As such, demands of various applications equipped with scalable data storage systems become increasingly significant and indispensable. Considering most of business data nowadays are stored in relational databases, there is an increasing willingness to manage large-scale uncertain relational data in the column-based distributed database platform. In [26], Vilaca et al. introduced a mapping approach from a relational table to an HBase table, and developed a distributed query engine for running SQL queries on top a of NoSQL datastore. In [23], mapping mechanisms from existing relational databases to column-based databases are proposed. Based on the proposed mechanisms, search engines to verify the translations are also provided [13, 22]. Unfortunately, although fuzzy values have been employed to model and handle imperfect information in relational databases, and a large number of the existing fuzzy relational databases (FRDB) may be classified as legacy, relatively little work has been carried out in extending column-based databases towards the representation of fuzzy concepts. In particular, the study on reengineering fuzzy relational databases in column-based databases, especially in HBase, is still a blank field. Actually, there exist many legacy fuzzy relational databases that are in need of modernization in order to be compatible and competitive in this new era of big data.
To bridge the gap between fuzzy relational and column-based databases, in this paper, we take a significant step in a fundamental consolidation of fuzzy data managements in HBase, and study how to reengineer existing fuzzy relational databases in HBase. In particular, we are interested in finding the HBase schema that effectively describes the existing fuzzy relational schema. To accomplish this, we introduce the formal transformation from the fuzzy relational model to the HBase model, and develop a set of rules to assist in the transformation process. To allow for better and platform independent sharing of data stored in fuzzy relational formats, we also present a generic approach to translate the fuzzy relational algebra into the fuzzy HBase manipulation language. To the best of our knowledge, this is the first effort on reengineering fuzzy relational databases in column-based databases.
The rest of the paper is organized as follows. Section 2 gives the preliminaries of relational and HBase models. The formal definitions of fuzzy relational and HBase models are given in Section 3. The transformation from the fuzzy relational model to the HBase model is presented in Section 4. Section 5 introduces the principles, which transformed the fuzzy relational algebra into the fuzzy HBase manipulation language, and Section 6 concludes the paper.
Basic knowledge
In this section, we will introduce the basic concepts of the relational database model and the Hadoop-based database model.
The relational database model
In the relational model of a database, all data is represented in terms of tuples, grouped into relations [6]. A relation can be viewed as a table with rows and columns, where each column corresponds to an attribute (the features that are usually extracted from real-world things are called attributes) and each row corresponds to a tuple which represents a data object. In relational tables, there is a unique key for each row. A domain describes the finite set of possible values for a given attribute and every value is an atomic data-the minimum data unit with meanings. If an attribute value or the values of an attribute group in a relation can solely identify a tuple from other tuples, the attribute or attribute group is called a super key of the relation. A primary key uniquely specifies a tuple with a table, and a foreign key is a field in a relational table that matches the primary key column of another table, which can be used to cross-reference tables. There are constraints in the relational databases, and constraints restrict the data that can be stored in relations. Constraints can apply to single attributes, to a tuple or to an entire relation. As introduced in [20], there are domain integrity constraints, entity integrity constraints, referential integrity constraints, etc.
Relational database model provides algebraic operations [20] as a basis for database manipulation languages. Primitive algebraic operators are the set union (∪), the set intersection (∩), the set difference (–), the Cartesian product (×), the selection (σ), the projection (π) and the Join (∞). For the set union and the set difference, the two relations involved must be union-compatible, i.e., the two relations must have the same set of attributes. Because set intersection can be defined in terms of set difference, the two relations involved in set intersection must also be union-compatible. In particular, let r and s be two union-compatible relations on the scheme R (A1, A2, …, An), for the set union, we have r ∪ s = {t|t ∈ r ∨ t ∈ s} for the set intersection, we have r ∩ s = {t|t ∈ r ∧ t ∈ s} for the set difference, we have r - s = {t|t ∈ r ∧ t ∉ s}
Cartesian product is a unary operation on relations. Let r and s be two relations on the schema R and S respectively, we have r × s = {t (R ∪ S) |t [R] ∈ r ∧ t [S] ∈ s}, i.e., the result of r×s is a relation on the schema R∪ S, where a tuple is a combination of a tuple from r and a tuple from s. Selection operation extracts from a table the tuples whose specified attributes values satisfy a given condition, and returns them as a new table. In particular, the selection of r based on a selection condition P specified by a Boolean expression can be defined as follows: σP (r) = {t|t ∈ r ∧ P (t)}. A projection is a unary operation written as πs (r) where s is a set of attribute names. The result of the projection operation πs (r) is a relation on the schema S that only includes the columns of relational table r which are given in S. In particular, let r be relation on the scheme R (A1, A2, …, An), the projection of r over attribute subset S (S⊂R) is defined as follows, πs (r) = {t (S) | (∀ x) (x ∈ r (S) ∧ t = x [S])}. Join operation is a binary operation on two relations, let r(R) and s(S) be any two relations, let p be a conditional predicate in the form of AθB, where θ ∈ {< , = , > , ≤ , ≥ , ≠}, A ∈ R and B ∈ S, then r s = {t (R ∪ S) |t [R] ∈ r ∧ t [S] ∈ s ∧ P (t [R] , t (S)} or r s = σP (r × s). The result of r s is a relation on the schema R∪S, where a tuple is a combination of a related tuple from r and a related tuple from s (the two combined tuples from r and s must satisfy the given condition p).
The Hadoop-based database model
Hadoop is an open-source, reliable and scalable architecture for large-scale distributed data storage and high-performance computation. Hadoop-based database model contains three main parts: Hadoop core, HBase and Hive [15], where Hadoop core consists of Hadoop distributed file system for data storage and the map-reduce framework for data processing, HBase is column-oriented database storing massive data sets, and Hive provides a set HQL interface used to store and query data (by calling map-reduce framework) in HBase.
From a logical point of view, data in HBase are organized in labeled tables [27]. HBase tables are made up of several HDFS files and blocks, each of which is replicated by Hadoop. HBase tables are automatically partitioned horizontally by HBase into regions. Each HBase table is sorted as a multidimensional sparse map, with rows and columns, each row having a sorting key and an arbitrary number of columns. Table cells are versioned, by default, the version is a timestamp auto-assigned by HBase at the time of cell insertion. Timestamps are stored in descending order. Each particular column can have several versions for the same row key. Each cell is tagged by column family and column name, hence programs can identify what type of data item a given cell contains. A cell’s content is an uninterrupted array of bytes which is uniquely identified by “Table + Row-Key + Column-Family:Column + Timestamp”. Table rows are sorted by row key which is also a byte array and serves as table’s primary key. All table accesses are via the table primary key and any scan of HBase table results into a map-reduce job. Table 1 is an example of an HBase table. A request for the values of “cf1:c1” in the row “r1”, if no timestamp is specified would be the value from time stamp t3, that is r1cf1c1v2.
Map-reduce is a manipulation language model for processing large-scale data sets [28]. It is a parallel and distributed manipulation language on a cluster. In short, a map-reduce manipulation language has two key functions: the map function and the reduce function, each function done in parallel. In the map phase, the framework distributes map tasks across nodes in the cluster. Each map task (managed by the mapper) processes key/value pairs of its data fragment assigned by the framework and produces a set of intermediate key/value pairs. In the reduce phase, reduce function (managed by the reducer) merges all intermediate values with the same intermediate key.
Fuzzy relational and HBase data models
In this section, we will briefly introduce the basic notions of fuzzy relational and HBase data models.
Fuzziness in the relational data model
Notions of fuzzy relational models have been introduced in previous works such as [8, 24], which differ in minor aspects in expressiveness and notation. The formal definition of fuzzy relational models in this work abstracts with respect to the most important and common features in the literature.
A relational data model is fuzzy because of a lack of information, in general, there are two levels of fuzziness in fuzzy relational databases: At the first level, relations (or tuples) may be fuzzy, i.e., they have some possibility to the model (or table). The second level concerns the fuzzy values of attributes of special relations.
In order to model the first level of fuzziness in fuzzy relational database, a possibility ρ (0 ≤ ρ ≤ 1) with a tuple is used to indicate the possibility that the tuple belongs to a table. To model the second level of fuzziness in fuzzy relational database, a set of possible values of the attribute specified with a possibility distribution is used to indicate the possibilities of all the possible values. A tuple (or attribute value) will not be declared when its possibility is 0, and ρ can be omitted when the possibility of a tuple (or attribute value) is 1.0. To make the presentation concrete, an example in handling fuzzy information is provided. Consider the fuzzy relation in Table 2 that refers to the student table [10], where ρ denotes the possibilities of tuples. In the first row of Table 2, for the student vincent whose department and age are computer sciences and 30 respectively, the possibility of this tuple being a member of the student table is 0.8. The first row of Table 2 is an instance of the first kind of fuzziness in fuzzy relational database. For the student lyot whose department is biology, if his age is unknown so far, i.e., he has a fuzzy value in the age attribute, which could be represented by using a possibility distribution, for example, {28/0.7, 23/0.1, 32/0.2}. The attribute age is an instance of the second kind of fuzziness in fuzzy relational database. In the following, we will give the formal definitions of the fuzzy relational data model.
σ is a finite set of distinct attributes; υ is a finite set of mutually exclusive and exhaustive values called domains; τ: σ ⟶ υ is a function that associates a domain with an attribute; ρ is a function of type T⟶ [0,1] that associates a possibility (or possibility distributions) with a tuple (or attribute values) in RS.
a finite set ΩI of relation identifiers; a mapping φI assigning to each relation in a subset σ of ΩI; a mapping χI assigning a value in η
ΩI
to each relation in ΩI; a mapping δI assigning a possibility to each relation in ΩI.
A fuzzy relational database model is based on the notions of fuzzy relational schema, fuzzy relation (table), fuzzy relational instance, integrity constraint and fuzzy relational algebra. An integrity constraint in a schema is a predicate over relation expressing a constraint. By far the most used integrity constraint in relational databases is the referential integrity constraint. The formal definition of the referential integrity constraint in fuzzy relational model is as follows.
In the following, the fuzzy algebraic operations in fuzzy relational databases are provided. Let r and s be two union-compatible fuzzy relations on the scheme RS = (σ, υ, τ, ρ), the fuzzy set union is defined as follows: r ∪ s = {t| (t ∈ r ∨ t ∈ s) ∧ max(ur, us)}, where ur and us are related possibilities, ur, us ∈ (0, 1]. The fuzzy set intersection is defined as follows: r ∩ s = {t|t ∈ r ∧ t ∈ s ∧ min(ur, us)}. The fuzzy set difference is defined as follows: r - s = {t|t ∈ r ∧ t ∉ s ∧ min(ur, (1 - us))}. The Cartesian product and projection of fuzzy relations are the same as the ones under classical relational databases. Let r and s be two fuzzy relations on the fuzzy schema RS and SS respectively, we have r × s = {t (RS ∪ SS) |t [RS] ∈ r ∧ t [SS] ∈ s}. The fuzzy projection of r over attribute subset S (S⊂R) is defined as follows, πs (r) = {t (S) | (∀ x) (x ∈ r (S) ∧ t = x [S])}.
In fuzzy relational model, fuzzy selection operation extracts from a table the tuples whose specified attributes values satisfy a given fuzzy selection condition Pf, and returns them as a new table. Let r(R) be a fuzzy relation based on a fuzzy selection condition Pf, specified by a fuzzy expression combining the basic clause AθB. Since the predicate Pf may be fuzzy, the evaluation of the fuzzy expression can be conducted by using Zadeh’s extension principle [20], where θ ∈ {< ω , = ω , > ω , ≤ ω , ≥ ω , ≠ ω }, where ω is a threshold. Therefore, the fuzzy selection can be defined as follows: σPf (r) = {t|t ∈ r ∧ Pf (t)}. Similarly, let r(R) and s(S) be two fuzzy relations, Pf is a fuzzy selection condition in the form of AθB, where A ∈ R and B ∈ S, then the fuzzy join can be defined as follows, or .
Fuzziness in the HBase data model
Being similar to the classic HBase data model, the main part of a fuzzy HBase databases (FHDB) consists of tables, row keys, column families, columns, values and timestamps, where row keys are unique and timestamps in HBase are automatically assigned by database systems for version managements. In FHDB, three levels of fuzziness occur in a fuzzy HBase table: At the first level, the occurrences of column families or columns may be fuzzy, i.e., they have some possibility to the model. At the second level, sets of rows associated with columns in a column family may be fuzzy, i.e., rows have some possibility to the column family. The third level concerns the fuzzy values of special columns.
In order to model the first level of fuzziness in HBase, a column family cf or column c with a possibility ρ (0 ≤ ρ ≤ 1), the column family’s or column’s name depicted by a pair of words with ρ possibility is used to indicate the possibility that the column family or column belongs to an HBase table. To model the second level of fuzziness in HBase, a possibility column ρ is used to indicate the possibility that a row belongs to a column family. To model the third level of fuzziness in HBase, a set of possible values of the column specified with a possibility distribution is used to indicate the possibilities of all the possible values. A row (or column value) will not be declared when its possibility is 0, and ρ can be omitted when the possibility of a row (or column value) is 1.0. Consider the fuzzy HBase table in Table 3, where ρ denotes the possibilities of related rows belonging to the corresponding column family. “Employee with 0.8 possibility” is a column family with the first level of fuzziness. In the first row of Table 3, for the employee vinc whose salary is 10k, the possibility of this row being a member of the column family “employee” is 0.5. The first row of Table 3 is an instance of the second kind of fuzziness in the fuzzy HBase table. For the employee vincent, if his salary is unknown so far, i.e., he has a fuzzy value in the salary column, which could be represented by using a possibility distribution, for example, {10k/0.5, 20k/0.3, 50k/0.2}. This salary column is an instance of the third kind of fuzziness in the fuzzy HBase table. The formal definitions of the fuzzy HBase data model are as follows.
r is a finite set of distinct row keys; cf is a finite set of distinct column families; c is a finite set of columns; υ is a finite set of mutually exclusive and exhaustive values called domains; τ: c ⟶υ is a function that associates a domain with an attribute; ρ is a function of type T ⟶ [0,1] that associates a possibility (or possibility distributions) with a column family, column or values in HS.
a finite set ΩI of row identifiers; a mapping ς
I
assigning to each row in a subset cf of ΩI; a mapping φI assigning to each row in a subset c of ΩI; a mapping χI assigning a value in ηΩI to each cell in ΩI; a mapping δI assigning a possibility to each column family, column or row in ΩI.
The main constraints in fuzzy HBase data models are domain integrity constraints and cell integrity constraints. The contents of domain integrity constraints in fuzzy HBase data models are that column values should be the values in the domain. The contents of cell integrity constraints in fuzzy HBase data models are that each nonempty cell should have an identified key and the value of the identified key should be sole and cannot be null.
In the following, we will introduce the operators of fuzzy manipulation languages in fuzzy HBase databases. Manipulation language in fuzzy HBase has two key phases (i.e., the map phase and the reduce phase). The main operators in fuzzy HBase include union, intersection, difference, table scan operator, file output operator, reduce sink operator, predicate filter operator, select operator, join operator, relative threshold filter operator, absolute threshold filter operator.
Let r and s be two union-compatible fuzzy HBase tables on the scheme RS = (σ, υ, τ, ρ), cellc(r) and cellc(s) are union-compatible cells of r and s over column subset C(C⊂r, C⊂s) respectively, the union is defined as follows: r ∪ s = {t| ∪ c (t ∈ cellc (r) ∨ t ∈ cellc (s)) ∧ max(ur, us)}, where ur and us are related possibilities, ur, us ∈ (0, 1]. The intersection is defined as follows: r ∩ s = {t| ∩ c (t ∈ cellc (r) ∧ t ∈ cellc (s)) ∧ min(ur, us)}. Difference is defi-ned as follows: r - s = {t|t ∈ cellc (r) ∧ t ∉ cellc (s) ∧ min(ur, (1 - us))}. Let r and s be two fuzzy tables on the fuzzy HBase schema RS and SS respectively, The Cartesian product in fuzzy HBase models is defined as follows: r × s = {t (RS ∪ SS) | ∪ (t [RS] ∈ cellc (r) ∧ t [SS] ∈ cellc (s))}. Table scan operator retrieves all rows (or cells) from the fuzzy HBase table specified in the argument column at the map phase, and file output operator exports all results generated at the reduce phase. Reduce sink operator sends intermediate outputs (keys-values) to the reduce stage. A predicate filter operator written as χc(hs), where c is a column’s name, is an operator on the fuzzy HBase schema HS that includes the column of fuzzy HBase table hs which is given in HS. In particular, let hs be table on the fuzzy HBase scheme HS, the predicate filter of h over sing attribute C(C⊂ H) is defined as follows, χc (h) = {t (C) | (∀ x) (x ∈ h (C) ∧ t = x [C])}.
Similar to fuzzy relational models, select operator in fuzzy HBase models extracts from a table the rows whose specified column values satisfy a given fuzzy selection condition Pf, and returns them as a new table. Let hs be a fuzzy HBase table, a fuzzy selection condition Pf, specified by a fuzzy expression combining the basic clause AθB, where θ ∈ {< ω , = ω , > ω , ≤ ω , ≥ ω , ≠ ω }, and ω is a threshold, the select operator in fuzzy HBase models can be defined as follows: σPf (hs) = {t|t ∈ r ∧ Pf (t)}. Let hr(HR) and hs(HS) be two fuzzy HBase tables, Pf is a fuzzy selection condition in the form of AθB, where A ∈ HR and B ∈ HS, then the join operator written as in fuzzy HBase models can be defined as follows, .
A relative threshold filter operator written as ϑ ω (hsc), where c is a set of columns’ names and ω is the given threshold, is an operator on the fuzzy HBase schema HS that includes the columns of fuzzy HBase table hs which are given and satisfy the relative threshold constraints in HS. In particular, let hs be table on the fuzzy HBase scheme HS, the relative threshold filter over column subset C(C⊂HS) is defined as follows, ϑ ω (hsc) = {t (C) | (∀ x) (x ∈ hs (C) ∧ t = x [C]) ∧ ρc ≥ ω}. An absolute threshold filter operator written as Δ ω (hss), where s is a set of all columns’ names in the answers and ω is the given threshold, is an operator on the fuzzy HBase schema HS that includes the columns of answers in fuzzy HBase table hs which are given and satisfy the whole threshold constraints in HS. The absolute threshold filter over column subset S(S⊂HS) is defined as follows, Δ ω (hss) = {t (s) | (∀ x) (x ∈ hs (S) ∧ t = x [S] ∧ Πρc ≥ ω)}.
Reengineering FRDB using FHDB
Relational database lacks sufficient power in handling large-scale data, while HBase has advantages in the massive data storage and query. Hence it is significant to reengineer traditional (fuzzy) relational database in HBase for large-scale data processing.
In this section, we concentrate on the formal mapping approach for reengineering the fuzzy relational database in the fuzzy HBase database. In particular, the reengineering could be established by a series of mapping rules as follows.
Based on the mapping roles above, the fuzzy relational schema can be easily mapped into the HBase schema following the ensuring processing: create the column family and the timestamp column for the fuzzy relational model S by applying Rule 1 and 2, respectively. for each relation R that exists no fuzziness in R, reengineering the relation, primary key, attributes, foreign key in R by applying Rules 3, 4, 5 and 6, respectively. for each fuzzy relation FR, reengineering the fuzzy relation and fuzzy attribute in FR by applying Rules 7 and 8 respectively, and reengineering the deterministic primary key, deterministic attributes and foreign key by applying Rules 4, 5 and 6, respectively. reengineering the key attributes in S by applying Rules 4 as the primary key of the HBase schema.
In the following, we will use the following examples to illustrate the reengineering processing above.
In the following, we propose an algorithm (Algorithm FRI2FHBI) which can transform fuzzy
01 while not end(t)
02 tact= getTable(t)
03 Sact= getHBaseSchema(tact, pkact, fkact, daact, faact, ρact)
//translating the schema of tact into the fuzzy HBase schema based on Rule 1–8
04 for each tuple tuple i in tact
05 pki(tuple i ) = getValues(pkact(tuple i ))
06 cellOfHBase(primary key column) =pki(tuple i )
07 fki(tuple i ) = getValues(fkact(tuple i ))
08 cellOfHBase(foreign key column) =fki(tuple i )
09 dai(tuple i ) = getValues(daact(tuple i ))
10 cellOfHBase(determinstic attribute column) =dai(tuple i )
11 fai(tuple i ) = getValues(faact(tuple i ))
12 cellOfHBase(fuzzy attribute column) =fai(tuple i )
13 ρi(tuple i ) = getValues(ρact(tuple i ))
14 cellOfHBase(Possibility Column) =ρi(tuple i )
15 end for
16 end while
relational instances into fuzzy HBase instances. In FRI2FHBI, function getHBaseSchema(tact, pkact, fkact, daact, faact, ρact) gets all primary keys pkact, foreign keys fkact, deterministic attributes daact, fuzzy attributes faact, and possibilities ρact of tact. getTable (t) and getValues(k) get next table to be processed and all values of k respectively, and cellOfHbase(k) writes the values to the corresponding cell of k.
Algorithm FRI2FHBI operates in two phases. In the first phase (lines 2-3), the fuzzy HBase schema is generated. In the second phase (lines 4–15), the contents in fuzzy relational tables are migrated to the fuzzy HBase cells based on the generated fuzzy HBase schema. At line 3 of Algorithm FRI2FHBI, it constructs the fuzzy HBase table schema from fuzzy relational tables according to the mapping Rule 1–8. At line 4, it repeatedly gets the tuple to process. The values will be migrated from the fuzzy relational tables to the corresponding cells of the constructed fuzzy HBase tables at lines 6, 8, 10, 12, 14.
Let us take the following example to illustrate the transformation from fuzzy relational database instances to fuzzy HBase database instances.
Manipulation language transformation
In this section, we will give a detailed discussion on how fuzzy relational manipulation language (algebra) is transformed into the corresponding fuzzy HBase manipulation language. As introduced in Section 3, the algebraic operations in fuzzy relational algebra are fuzzy set operations, fuzzy selections, fuzzy Cartesian product and fuzzy projections, and manipulation language in fuzzy HBase has two key phases (i.e., the map phase and the reduce phase). The reengineering from fuzzy relational algebra to fuzzy HBase manipulation language could be established by a series of mapping rules as follows.
In the following, we will give the detailed transformed process from fuzzy relational algebraic operations to the operators of manipulation language in fuzzy HBase. In particular, the transformation can be accomplished by following the ensuing mapping operations: According to the attribute names in fuzzy relational algebraic operations, a table scan operation that scans related columns in fuzzy HBase tables is generated in the map phase. If there exits fuzzy selection conditions occurring at single attributes, then a select operator and a following reduce sink operation in fuzzy HBase, which reduces the outputs of mappers as the inputs of reducers, are generated in the map phase. Otherwise, a reduce sink operation after the table scan operation in fuzzy HBase is generated in the map reduce. Based on the fuzzy set union, intersection or difference in relational algebra, a fuzzy HBase union, intersection or difference operation that operates on related HBase cells isgenerated. Based on the fuzzy Cartesian product or fuzzy join in relational algebra, a fuzzy HBase Cartesian product or join operation that joins related HBase cells is generated in the reduce phase. Based on the fuzzy selection condition occurring at single attribute in relational algebra, a predicate filter operation that filters related cells in fuzzy HBase tables with the assistant of given predicates, and a reduce sink operation after the predicate filter operation are generated in the map phase. Based on the fuzzy selection condition occurs at the multiple attributes in relational algebra, a select operator based on the given fuzzy selection condition in fuzzy HBase is generated in the reduce phase. Based on the fuzzy selection condition occurs at the possibility attribute in relational algebra, an absolute threshold filter operation in fuzzy HBase is generated. Based on the fuzzy selection condition occurs at the possibility attribute values in relational algebra, a relative threshold filter operation in fuzzy HBase is generated.
In the following, we will illustrate the mapping plans of fuzzy relational algebra into the fuzzy HBase manipulation language.
According to the transformation introduced above, the map phase executes table scan (associated with columns dsd.date, ρ.d, d.did, dsd.did, d.dname), predicate filter (associated with the selection “d.dname = ’Peter”’), relative threshold filter (associated with the selection of threshold for filtering the rows according to the corresponding relative possibilities) and reduce sink operations.
In the reduce phase, it executes equijoin (associated with the fuzzy join operation “d.did = dsd.did”), absolute threshold filter (associated with the threshold selection for filter the rows according to the absolute possibilities), projection (associated with the projection of dsd.date), file output operations. The detailed fuzzy HBase manipulation language of Q is shown in Fig. 4.
Conclusion
In this paper, we investigated how to reengineer fuzzy relational databases by using HBase. We first introduce formalisms to capture the semantics of fuzzy relational and HBase models. In order to achieve the reengineering fuzzy relational databases in HBase, we then developed a set of mapping rules to handle the transformation from fuzzy relational databases to fuzzy HBase databases. On this basis, we present a generic approach to genetate fuzzy HBase manipulation language for mapping fuzzy relational algebra, and illustrate the approach by introducing representative examples.
Future research is both geared towards applicability and enhancement. We are currently working on a prototype showing the feasibility of the approach for large practical applications. In addition, we plan to accomplish the transformation of the fuzzy entity-relationship model to the NoSQL schema (e.g., the transformation of semantic concepts like inheritance, dependency, aggregation, etc.), and work on introducing query optimizations as an enhancement of our methodology.
Footnotes
Acknowledgments
The authors would also like to express their gratitude to the anonymous reviewers for providing very helpful suggestions. The work was partially supported by the China Postdoctoral Science Foundation funded project (2015M581449 and 2016T90294), Heilongjiang Postdoctoral Fund (LBH-Z14089), Natural Science Foundation of Heilongjiang Province of China (QC2015067), and Fundamental Research Funds for the Central Universities (HIT.NSRIF.2017036).
