Processing of association rules with ontology in distributed NoSQL systems

Abstract

Nowadays, many NoSQL systems are developed to deal with data elasticity in distributed environments. This is very useful for Data mining such as association rules technique which generates a huge number of rules. To avoid any manual post-processing for selecting the interesting rules, many researchers suggest integrating expert users’ knowledge by using ontology and rule patterns. Nevertheless, with NoSQL Big Data that contain very large data, the number of generated rules is so huge that any post-processing becomes complicated especially in industrial areas. Also, any solution and results have to be tested and checked with a real Big Data context. In order to deal with this issue, we use an adjusted approach with ontology and rule patterns to reduce database NoSQL context before generating any rule. After that, we conduct a real experiment on distributed industrial MongoDB database to calculate execution time and generated rules. This work proves the gain in performance for using association rules with ontology in the NoSQL systems.

Keywords

Ontology NoSQL association rules big data MongoDB

1. Introduction

Nowadays, many industrial companies introduce different distributed NoSQL systems to manage their great precious information systems. These companies are also interested in Data Mining [28]. Their goal is to take advantage from their information systems and extract new knowledge, and thus provide early feedback means. These knowledge are generally hidden and can be extracted from historical data to guess new data, and predict anticipate actions in facts [9]. For instance, providers can use this knowledge to have ideas about general tendencies of their customers, and then improve their future clients’ satisfaction. A lot of methods and techniques (predictive and descriptive) have continually been developed in Data Mining. Among these methods, the technique of association rules (AR) is very useful and practical because it is simple to understand and to implement [3]. Many algorithms are developed to implement AR (see Section 2.1). However, their main common problem is the huge number of generated rules. To choose the interesting rules, a manual post-processing is required, and must be done by expert users in business domain. The selected rules are considered as effective knowledge. To help users and avoid them any manual hard investigation, many proposals are developed to find pertinent rules after generating all rules (see Section 2.2). Most of these proposals meet requirements in several cases.

Nevertheless, the volume of business data is now increasing very quickly, especially in industrial domains. Lots of manufactories and industrial companies are equipped with digital acquisition systems that generate very large amounts of data. So, new Big Data systems become currently very popular like NoSQL [27] and newSQL [11]. Also, many distributed environments have been developed to support storage and processing of Big Data as Cloud Computing, Hadoop and Spark [25]. In NoSQL systems, looking for the frequent itemsets and generation of all association rules become exceedingly difficult for both the storage and the processing sides.

In this article, we use our approach to deal with this problem in NoSQL systems. Firstly, we propose to integrate users’ knowledge as ontology in early step of the process. Then, we give concrete experiment applied to a Big Data NoSQL context in order to evaluate and prove the gain in performance.

The paper is organized in five sections. The first section gives an introduction. In the second section, we go over the constraints of AR and related work. Then, we focus on the impact of these constrains and solutions for NoSQL systems. In the third section, we present our approach based on the preselected rules patters. An experiment is given at the fourth section; historical NoSQL data of an industrial company [41] are used in this process, so a study will be conducted to get automatically the interesting target rules. In the fifth section, test cases are carried out to validate performance gain. A conclusion and perspective close this paper in the last section.

2. Problem with association rules and related work

Data mining is a set of theories and techniques that can extract, represent and integrate knowledge. Its role is to look for hidden correlations in databases, and transform them to knowledge. Data Mining is used and applied in many domains: client’s relation management (CRM), Business Intelligence (BI), statistics, scientific research, biology, etc. [17].

Many methods and tools are developed to serve data mining needs. These methods can be classified by principle (supervised, unsupervised) [17], by objective (descriptive, predictive) [42], etc.

There are many Data mining techniques [16], like association rules, clustering, decision trees, neurons network, genetic algorithm, nearest neighbor, etc. Choosing an appropriate technique depends closely on requirements and data nature. In this paper, we are interested in association rules technique.

2.1. Technique of association rules and its problem

The technique of association rules is an interesting topic in Data mining. It can detect association or links between data (itemsets) in form of rules ( $X \to Y$ ) that can give new results to users [16,47].

An association rule $X \to Y$ means that the most transactions (records) which satisfy the premise X in a context (database), also satisfy the conclusion Y.

Each rule is evaluated by two measures: support and confidence.

a rule $X \to Y$ satisfy a support S if at less S% of transactions satisfy X and Y.

a rule $X \to Y$ satisfy a confidence C if at less C% of transactions that satisfy X satisfy also Y.

For example, for the rule: “90% of customers who buy the product A, buy also the product B, and 40% of all customers have bought these two products”, we can say that this rule is verified with certitude more than 90% (confidence), and it is supported by at least 40% of customers (support).

Note that the support is a statistic measure, and the confidence is a measure for strength of the rule. A rule to be selected, must has its support and confidence greater or equal than user-defined thresholds called ${Sup}_{\min}$ and ${Conf}_{\min}$ respectively. The goal is to discover all rules that satisfy these two conditions.

After selecting and preparing data, the process of rule extraction can be done in two steps [2]:

Looking for the frequent itemsets (patterns). This first step dominates the processing time.

Generating all association rules. This step is rather straightforward.

Many efficient algorithms are proposed to implement the association rules. Among these, the well-known is Apriori algorithm [10] which has been very influential. Later, many other scholars have improved and optimized Apriori algorithm, many new variants algorithms have been presented like FP-Growth [15], Close [33], Closet [34], etc. Nevertheless, their entire major inconvenient resides in the important huge number of generated rules.

In Aprioi algorithm, all potential itemsets (set of items or attributs of a transaction) in a context are checked. After building the trellis (graph of all possible combinations of itemset subsets) in a context, if we have m itemsets in this trellis, then we will have $2^{m - 1}$ scan iterations to do. These scans are necessary to calculate the support of each itemset and to mark it as frequent or not. So, the number of scans and generated rules are exponentially depending on itemsets [47].

Consequently, a post-processing is necessary, and it must be done by expert users in order to target manually the interesting rules which can be considered as effective knowledge. The post-processing must to be efficient and adapted to both the user preferences and the data structure. However, in Big data context, this post-processing becomes very hard and it would be interesting to automatize it.

2.2. Background and related work

In order to automatize and optimize the post-processing, many approaches were developed to find pertinent rules. Some approaches, like that of Silberschatz and Tuzhilin [38] proposed to decrease the number of generated rules by using interest measures which can be objective or subjective [38]:

Objective measures are relied just to data structure (algorithm). Many works, like guided by Piatetsky-Shapiro et al. [35], Bayardo and Agrawal [6], Hilderman and Hamilton [18], Tan et al. [43], Guillet and Hamilton [14], etc., have summarized these measures and compared their definitions and properties. However, these objective measures offered just a partial response to post-processing, because they are limited to just data evaluation.

Subjective measures integrate explicitly expert’s or manager’s knowledge [36]. Approaches which integrate these measures are mainly distinctive with representation models of knowledge.

Some authors proposed using templates to describe interested and uninterested rules [20]. Others used two representation models for user’s conviction [23]: General Impressions (GI) and Reasonably Precise Knowledge (RPK). A version of RPK in fuzzy logic has been developed by Liu and Hsu [22] to select the classification rules based on syntactic comparison. Other more exact representation of user’s knowledge by using rules has been developed by Padmanabhan and Tuzhuilin [32], and the rule interest was defined by logical contradiction.

In 1995, Srikant and Agrawal [10] proposed to represent user’s knowledge by General Association Rules (GAR), and integrate knowledge by hierarchal taxonomy of attributes. The introduction of knowledge in attributes structure allowed decreasing number of rules.

Later in 1999, Liu et al. [24] developed this taxonomy to become rule patterns which can represent vast user’s knowledge in a particular domain on database. These rule patterns allowed defining a characteristic form of interested rules which will be selected among all calculated rules.

In their French article named “Vers la fouille de règles d’association guidée par des ontologies et des schémas de règles” [26], LINA–COD team (Claudia M. et al.) proposed as a new approach to introduce user’s knowledge in the extraction of association rules by using ontology associated with rule patterns. This approach is detailed in the next section. So since that, ontology has been introduced in association rules process, and then let ontology’s benefits to be enjoyed. Ontology is a conceptual database which allows users to modelize knowledge and provide a common shared vocabulary. It allows users to understand, define, structure and standardize the semantic of the terms and concepts in a domain [12]. For a company, ontology represents its memory and a reference vocabulary for its interesting domains [31]. Many classifications are defined for ontology [4,13,19].

As result, most of these proposals listed above allow integrating gradually and efficiently the users’ knowledge and they contribute to automate the post-processing. However, they continue steadily to deal with the huge number of rules, because all possible association rules are generated. So, as soon as the number of rules increases, the post-processing will become hard and consumer in time and space.

In other side, data are continually increasing nowadays, and many Big Data concepts are soon emerged. For industrial companies, all their historical data must be accumulated and safeguarded. A lot of these companies have recently opted for Big Data systems as NoSQL [27] and NewSQL [11] which are eventually deployed in distributed environments.

At present time, many scholars are interested in association rules with NoSQL systems. Most of them propose many variant parallel algorithms with Map/Reduce technique [5,8,21,37,45,46]. However, the Data mining process turn out to be more and more outsized, and the process of generating association rules becomes very hard and complicated in Big Data context such as NoSQL. The huge number of generated rules cannot be supported and covered by the above proposals. To cope with this problem, we give our approach and we perform a real experiment with NoSQL to validate their performance in the next section.

3. Our proposal

We propose to use our approach to solve the issue of association rules in NoSQL systems.

Firstly, we are interested in using the LINA–COD’s approach [26]. This approach is very efficient since it integrates users’ knowledge by using ontology coupled with rule patterns in its process. Later, we will adapt this approach to the Big Data context.

As shown Fig. 1, LINA–COD team’s approach is based on three main elements:

a database from which all association rules are extracted;

an ontology which represent knowledge related to the database;

a set of rules schemes that can link ontology concepts to interesting rules.

Fig. 1.

Representation of LINA–COD team’s approach [26].

Formally, ontology is a set of concepts linked by relations of subsumption or conceptualization [26]. Each parent concept represents a generalization of its enclosed child concepts, and each child concept is a specialization of its parent, and a hierarchy of conceptual knowledge is obtained by these relations. So, ontology is an efficient tool to check or validate new knowledge in a related domain. This is very important, because ontology can take part in the data mining process by selecting, filtering and collecting knowledge (association rules, patterns, etc.).

Also, note that a rule pattern allows carrying out a supervised selection on association rules. It offers a method to express knowledge by using a model of investigated rules: $X_{1}, X_{2}, X_{3} \dots \to Y_{1}, Y_{2}, Y_{3} \dots$ Where $X_{i}$ and $Y_{j}$ are constraints on concepts (ontology) or on attributes (database).

As a result, the rule patterns combined with ontology allow increasing capability to target just interesting rules in post-processing.

However, in this approach, we note that the whole context (database) is taken into account during exploration and searching association rules, and afterward the rule patterns are used to filter just the interesting rules. Also, we note that all possible itemsets in the context are considered during searching stage, regardless of what these itemsets would be considered or not by rule patterns later.

Using Apriori algorithm, many explorations must be done on the complete context to calculate support and confidence for each itemset. Even with other variants of Apriori algorithm, like FP-Growth [15], Close [33], etc., exploration is optimized by reducing the number of context access, but this number remains still very big, and the process of exploration continues to be high consumer in time and space.

In NoSQL systems, it is harder to take into account the entire context each time and consider all possible itemsets. So, to avoid these constraints, we propose to filter the context at the beginning by using only itemsets which are included and respect initially the rule patterns chosen by experts. These itemsets have to contain concepts (ontology) or attributs (context) that respect the NoSQL patterns (keys/values combination).

This can limit the field of investigation, so only a part of context, which contains itemsets appeared in the rule patterns, can be considered at exploration step. Later, in the generating rules step, only the rules that respected the chosen rule patterns can be generated.

As shown in Fig. 2, a proposed schema is given for our approach with NoSQL systems.

Fig. 2.

Representation of our proposed approach.

In the following, we give description of our proposal phases.

Phase 1 (Filtering the Big Data NoSQL Context).

We use rule patterns to filer the NoSQL context in order to have just instances that contain the items included in these rule patterns. For example, let consider $C_{1}, C_{2} \to C_{3}$ as an interesting rule pattern for users, where $C_{1}, C_{2}$ and $C_{3}$ are concepts in the ontology. This rule pattern means that we have to consider all association rules which their promises satisfy $C_{1}$ and not $C_{2}$ , and their conclusions satisfy $C_{3}$ . Using the relation between concepts and attributes, a concept is joined with one or many attributes. So, in a simple case (mono relation between concepts and attributes), our rule pattern can be converted to a rule pattern of attributes like $A_{1}, A_{2} \to A_{3}$ . To be accepted by NoSQL patterns, these attributes have to be keys or values exclusively. Consequently, we can filter the big data context to consider just instances that contains the attributes $A_{1}$ , $A_{3}$ and not $A_{2}$ . We do the same action for all other rule patterns in order to keep just their included itemsets present in the context, and exclude any other itemset not included in any rule pattern. Finally, we get a new restricted sub-context as a database context which will be considered in the next step.

Phase 2 (Using the filtered database context).

At this phase, we use the restricted context as the same method described in LINA–COD’s approach.

The database consists of a set of N transactions described through P attributes. Let $I = {I_{1}, I_{2}, \dots, I_{p}}$ the set of attributes called features (items) and $T = {t_{1}, t_{2}, \dots, t_{n}}$ the set of N transactions.

Each transaction $t i = {I_{1}, I_{2}, \dots, I_{m}}$ is a subset of the attributes set I. Apriori algorithm [3] (or any other variant algorithm) allows extraction of rules in form $X \to Y$ , where X and Y are two disjoint sets of attributes.

Ontology is defined by a set of concepts $C = {C_{1}, C_{2}, \dots, C_{o}}$ and a set relationships or properties $R = {R_{1}, R_{2}, \dots, R_{r}}$ . The concepts are hierarchically linked by a relationship of subsumption.

In this scenario, it is fundamental to be able to connect the database to ontology. Each concept of ontology is instantiated in database by a subset of records. A simple way to make this connection is to associate a concept directly to an attribute of the database. Other possibilities are also envisaged, like connecting a subset of attributes to a concept. Finally, a rule pattern can express knowledge about the form of the rules sought. The semantic extension of “general impression” allows joining in rule patterns not only constraints on attributes, but also constraints on concepts [26].

To recapitulate our proposal steps, an explaining diagram is given in Fig. 3.

Fig. 3.

Representation of our proposed approach diagram.

To use our approach with NoSQL systems, we need to modulate

In conclusion, our proposition is useful, because it significantly reduces the cost related to execution time and space at the early steps. It provides an important all-purpose optimization of time processing and occupying space especially for Big Data.

In order to verify the effectiveness of this approach in NoSQL context, we conduct real experiment in the next section.

4. Experiment

In this section, we perform a real experiment to bear out the efficacy of our approach on a distributed NoSQL system. Firstly, we present data used in experiments and their Big Data nature, and we give an extract of their related ontology. Next, we introduce our distributed platform. Later, we present an application that simplifies process to users, so they can build new interesting rule patterns or choose others from previous rule patterns recorded by experts.

Finally, we conduct our experiment by taking many interesting rule patterns chosen by users. These rule patterns can use concepts of ontology or attributes of database which are used as a guide for selecting association rules. We apply our approach to restrict the NoSQL context and select just the interesting itemsets, and then we can carry out the Apriori algorithm on this restricted context. The frequent itemsets are discovered with a minimum support and a minimum confidence previously chosen by users as parameters. Note that since the same data and ontology are used in experiment, the same number of pertinent association rules would be found at the end regardless of using or not our approach.

The results of these experiments are analyzed and interpreted in the next section.

4.1. Presentation of data and ontology

We use downstream activity data of an oil and gas company [41] which is composed of a head office and other productive plants. Its information system encompasses all business activities (production, maintenance, human resources, finances and support). Each plant has a large database for each domain. The head office has meanwhile large consolidated databases (Data Marts) which have been recently migrated from Oracle relational system to NoSQL MongoDB version 2.6 [30]. Note that data migration from Oracle to MongoDB has been done by using approach provided by D. Dahmani et al. [7].

Also, note that MogoDB has been chosen because it is a document store suitable for data nature. MongoDB is efficient, popular, and its current ranking [40] strengthen greatly this choice.

The industrial production is a strategic business domain regulated by international certifications and standards. This domain covers the following axes: production, stocks, prediction and programming of shipping, technique and laboratories management, security and installations, etc.

For our case study, we opt for industrial production data at the headquarters as shown in Fig. 4.

Fig. 4.

MongoDB databases (head office and units).

We have chosen these data because of two reasons:

They are suitable to many Data mining queries that users are interested in.

Their characteristics are marked as NoSQL nature: volume (Tera bytes), velocity (data received every time by electronic capture systems), and variability (many data types) [27].

As experimentation, we use the shipping collection in MongoDB database. This collection contains all data about loading products into boats from different units and shipped to foreign clients. Users frequently focus their analysis on these data. An extract from this collection is shown in Table 1.

Table 1

An extract from collection MongoDB

Date	Code_ Unit	Code_ boat	Code_ Product	Quantity	Code_ Client
01/01/01	U_L1	BCH	GNL	145632.59	AGC
01/01/01	U_L2	ABR	GNL	124235.40	TOT
01/01/01	U_P2	CQT	C3	58314.23	REB
01/01/01	U_P2	CQT	C4	47865.44	REB
02/01/01	U_L1	MCT	GNL	154314.66	AGC
02/01/01	U_P1	NRT	C3	94564.72	TPG
02/01/01	U_P1	NRT	C4	81966.26	REP
03/01/01	U_L2	SFG	GNL	162388.95	REP
03/01/01	U_L1	HPL	C5	3563.46	AGC
03/01/01	U_L2	DRV	C2	5213.62	IC
03/01/01	U_K1	GXF	GNL	25532.87	TPG
…

As shown in Fig. 5, ontology was created to represent knowledge and concepts related to this collection (units, products, clients, etc.). All relations between these concepts are defined. This ontology was implemented by using Protégé tool version 4.0.2 [39], and OWL language [44].

Fig. 5.

Ontology related to the shipping collection in production database.

4.2. Configuring the distributed platform

Distributed environments are frequently used with NoSQL systems because their elasticity that improves greatly the performance of Data mining [7]. We use a distributed platform managed by MongoDB and composed of nodes (called Shards). This platform does not require many resources as other distributed platforms, i.e., Cloud IaaS, Hadoop, etc.

MongoDB has three following processes [1]: a Configuration server that stores metadata of each shard, a Mongos process that redirects client requests to the appropriate shards and groups the results before sending them back to the client, and a mongod process that hosts and manages data in the Shard. As shown in Fig. 6, we use a platform containing 7 nodes:

Mongo server that manages and distributes data within 5 Shards (mongod).

Configuration server used by Mongos to manage the configuration of shards and replications.

5 Shards servers, each one runs mongod service and hold a data partition distributed by Mongos.

All nodes are identical (HS21, Xeon quad core E5420, 3.5 Ghz/1333 Mhz, 16 MB RAM, 2×10 GB Disk). This configuration was chosen according to the good practices document of MongoDB [1].

Fig. 6.

Distributed architecture MongoDB used in experiments.

Once the platform is prepared, we install a MongoDB distributed database according to the steps listed in the MongDB Sharding procedure [29]. Figure 7 displays information about our shards; we can see data distribution between different shards.

Fig. 7.

Distribution of data between shards.

4.3. Performing experiments

In the following, we use our approach and distributed platform to carry out a real interesting experiment for the company’s users.

In order to make work easy, we use our java application called Prontodam (see Fig. 8) that implements Apriori algorithm. This application allows users to connect to MongoDB database, choose minimum thresholds for support and confidence, select ontology, and build pattern rules from concepts or their related attributes with NoSQL key/value model. With this application, a user can easily build manually new interesting rule patterns or choose them from a previous list of rule patterns which can be already recorded by other experts (see Fig. 9). This let users not only build gradually an interesting catalog of rule patterns, but also check and validate the process of the association rules.

Additionally, this application can be executed with two alternatives: (1) with our approach (2) or without it. By the way, these two execution alternatives allow us to compare the results and then have an idea on how gain could be obtained with implementing our approach.

Fig. 8.

Some print screens of ProntoDam application.

4.3.1. Example of a rule pattern used in experiment

Firstly, we give an example of a rule pattern chosen by users in the shipping collection. The users’ goal is to discover the client’s interest in each product delivered by different units. Note that the same product is different in units because its quality depends on some specific parameters delivered by this unit. For instance, the specific calorific power (SCP) is an important parameter for Liquefied Natural Gas (LNG) product. After each boatload, a quality certificate of product is delivered by the company and checked by clients. Users like to have an idea about product quality sought by each client. So, the following rule pattern is given: “The client is interested in such product produced by such unit”.

The rule pattern above means that users are interested in association rule with the following form:

“Client → Product, Unit”. This pattern rule is very useful for managers because it allows them to qualify the clients’ interests in order to satisfy their upcoming needs, and recommend product quality to units according to the clients’ needs. Note that Client, Product, Unit are concepts in ontology, and they can be translated into their equivalent attributes Code_Client, Code_Product, Code_Unit respectively.

Experts can just use Prontodam to choose or compose this rule pattern if it does not exist, and all remaining things can be done automatically. Figure 9 shows a Prontodam print screen of the composition of our rule pattern. After that, users can carry out the operation to generate association rules.

Fig. 9.

Compositing a rule pattern from concepts and attributes in Prontodam application.

In Fig. 10, we can see the results of our rule pattern execution related to minima values 10.25% and 75% for support and confidence respectively.

Reminding that since the same data and ontology are used, the same number of rules would be found at the end regardless of using or not our approach.

In addition, we repeat many times the trial execution on the rule pattern “Client → Product, Unit” by:

Varying minimum values for support and confidence. (Note that these values are fixed by our experts based on previous statistical studies.)

Using alternatively our approach or not (to compare and calculate gain).

The results and the gain in performance are discussed in the next section.

Fig. 10.

Extract from the results of Prontodam for the rule pattern “Client → Product, Unit”.

Table 2

Results of the rule pattern “Client → Product, Unit”

Number of handled instances (data context size)		Min. Support	Min. confidence	Number of frequent itemsets		Number of generated rules (pertinent rules)			Execution time (hour, minute, second)

Without our approach (No filter)	With our approach (use filter)			Without our approach	With our approach	Without our approach		With our approach	Without our approach	With our approach

						Before filter	After Filter
3371905	529824	10.25%	75%	525	109	1516	19	19	59 m 33 s	5 m 22 s
		12.50%	70%	695	98	1569	15	15	1 h 1 m 13 s	4 m 57 s
		7.75%	80%	1071	159	3122	41	41	1 h 23 m 52 s	10 m 45 s
		4.50%	85%	1663	194	5045	68	68	1 h 48 m 19 s	21 m 49 s
		2.25%	65%	2156	321	9331	101	101	2 h 3 m 7 s	39 m 27 s

Table 3

Results of the average of 15 different rule patterns “Xi → Yi”

Number of handled instances (data context size)		Min. Support	Min. confidence	Number of frequent itemsets		Number of generated rules (pertinent rules)			Execution time (hour, minute, second)

Without our approach (No filter)	With our approach (use filter)			Without our approach	With our approach	Without our approach		With our approach	Without our approach	With our approach

						Before filter	After Filter
3371905	759736	10.25%	75%	613	115	1802	29	29	58 m 51 s	5 m 30 s
		12.50%	70%	725	133	1933	32	32	1 h 3 m 23 s	5 m 26 s
		7.75%	80%	1056	173	4652	56	56	1 h 34 m 8 s	10 m 42 s
		4.50%	85%	1790	204	8449	109	109	1 h 52 m 34 s	19 m 55 s
		2.25%	65%	2058	355	14701	374	374	2 h 5 m 28 s	37 m 33 s

4.3.2. Using lot of rule patterns in experiment

As the previous example, users can use Prontodam to build many rule patterns “Xi → Yi” ( $X i$ and $Y i$ are concepts from ontology that respect NoSQL patterns), and do the same operation. We have 15 interesting rule patterns chosen by users.

Then, many executions can be performed on these 15 rule pattern by (1) varying minimum values for support and confidence, and (2) using or not alternatively our approach.

We give in Table 2 the average results of this experiment by using the average of the 15 rule patterns. To reduce display, we do not give the number of frequent item sets founded because the final goal is the number of generated rules and execution time.

5. Results and discussion

In this section, we discuss the results of experiments. For instance, in Table 2 we have recapitulated the results of the rule pattern given as example (see Section 4.3.1), the data context size, and the numbers of both frequent itemsets and generated rules. Note that the context is filtered with our approach, so we give a good reduction. Note also that the final goal is the number of generated rules obtained.

However, by using our approach we notice that only 109 frequent itemsets are founded, and 19 pertinent association rules are directly generated, so the execution time is 5 minutes and 22 seconds. On the other hand, without our approach, there are 525 frequent itemsets founded, 1516 pertinent rules generated which are filtered to 19 rules later than. So the execution time is 59 minutes and 33 seconds.

In Table 3, we give the results for general experiment that represent the average of the 15 rule patterns. Notice also that the same numbers of pertinent association rules are always found using ontology. However with our approach, the context is initially filtered, and so the frequent patterns and the target rules are obtained with a very high-speed manner and we give a high well reduction in all results.

Finally, different graphs are presented in the following for the average of the 15 rule patterns.

The graph at Fig. 11 compares and shows the great difference for the number of frequent itemsets with or without our proposal corresponding to different support and confidence values.

Fig. 11.

Comparing the number of frequent itemsets.

In the same way, the graphs Fig. 12 shows the great difference between the numbers of pertinent rules generated directly with and without our approach for the three experiments.

Fig. 12.

Comparing the number of generated rules.

The graph at Fig. 13 point up the profit given in execution time related to this experiment. We note the huge difference in execution time, and as the context increases gradually, the run time becomes more favorable for the suggested proposal.

Fig. 13.

Comparing execution time for the experiment.

6. Conclusion

In this article, we have discussed the problem of association rules technique in Big data and especially with the NoSQL systems. We have gone over some proposals to solve the huge number of generated rules in general. We have chosen to integrate explicitly users’ knowledge by using ontology and rule patterns. The most interesting rules are automatically selected after generating all association rules. However, with the NoSQL systems this is not efficient and needs some adjustments.

To deal with this issue, we have presented an adjustment to improve the processing of generating pertinent association rules in such NoSQL systems. We use ontology with rule patterns to preselect data before any processing, and then filter at early step only the rule concerned by users’ choice. These data have to respect keys/values NoSQL model. Then, only the interesting rules are generated.

In order to prove the efficiency of our proposal, we have carried out a real experiment on industrial NoSQL data. After carefully scrutinizing the results of experiment, the adjustment given by our proposal allows to reduce significantly the execution time, the number of frequent itemsets, and only the interesting rules targeted by users are gererated. As result, this proposal proves its benefits increasingly and becomes very interesting in NoSQL systems.

As perspective, we plan to extend this work to a large-scale cloud environment, and use it for the NewSQL systems.

References

A MongoDB White Paper, MongoDB Operations Best Practices, MongoDB 2.6. pp. 5–20. 2018, https://www.mongodb.com/collateral/mongodb-operations-best-practices.

Agrawal

et al., Fast algorithms for mining association rules in large databases, in: Proceedings of the 20th International Conference on Very Large Databases (VLDB 1994), Santiago de Chile, Chile.

Agrawal,

Imielinski and

Swami, Mining association rules between sets of items in large databases, in: Proceedings of the 12th ACM SIGMOD International Conference on Management of Data, 1993, pp. 207–216.

Ashburner

et al., Gene ontology: Tool for the unification of biology, The Gene Ontology Consortium, Nature Genetics 25(1) (2000), 25–34. doi:10.1038/75556.

Barkhordari and

Mahdi, Kavosh: An effective map-reduce-based association rule mining method, Journal of Big Data (2018).

R.J.

BayardoJr. and

Agrawal, Mining the most interesting rules, in: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, 1999, pp. 145–154. doi:10.1145/312129.312219.

Dahmani,

Rahal and

Belalem, Improving the performance of data mining by using big data in cloud environment, in: Journal of Information & Knowledge Management, Vol. 15, World Scientific Publishing Co., 2016.

Daniele

et al., SeaRum: A Cloud-Based Service for Association Rule Mining, 12th IEEE International Conference on Trust, 2013.

A.R.

Ganguly

et al., Knowledge discovery from sensor data for scientific applications, in: Learning from Data Streams, Book Springer, 2007, pp. 205–229. doi:10.1007/3-540-73679-4_13.

10.

Grawal and

Srikant, Mining generalized association rules, in: The 21st International Conference on Very Large Data Bases (VLDB’95), San Francisco, CA, pp. 407–419.

11.

Grolinger

et al., Data management in cloud environments: NoSQL and NewSQL data stores, Journal of Cloud Computing: Advances, Systems and Applications, a Springer Open Journal (2013), 1–22.

12.

T.R.

Gruber, Toward principles for the design of ontologies used for knowledge sharing, International Journal Human-Computer Studies. 43 (1993), 907–928. doi:10.1006/ijhc.1995.1081.

13.

Guarino, Formal ontology and information systems, in: Proc. of FOIS’1998, Ternto, Italy, 1998, pp. 3–15.

14.

Guillet and

Hamilton, Quality Measures in Data Mining. Studies in Computational Intelligence, Springer, 2007.

15.

Han

et al., Mining frequent patterns without candidate generation: A frequent-pattern tree approach, Data Mining and Knowledge Discovery 8 (2000), 53–87. doi:10.1023/B:DAMI.0000005258.31418.83.

16.

Han,

Kamber and

Pei, Data Mining Concepts and Techniques, 3rd edn, Elsevier Inc., 2012.

17.

Hastie,

Tibshirani and

Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer-Verlag, New York, 2008.

18.

R.J.

Hilderman and

H.J.

Hamilton, Evaluation of interestingness measures for ranking discovered knowledge, in: Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’01), Springer-Verlag, 2001, pp. 247–259.

19.

Jurisica

et al., Ontologies for Knowledge Management: An Information Systems Perspective, Knowledge and Information Systems, Vol. 6, Springer-Verlag, 2004, pp. 380–401.

20.

Klemettinen,

Mannila,

Ronkainen,

Toivonen and

A.I.

Verkamo, Finding interesting rules from large sets of discovered association rules, in: International Conference on Information and Knowledge Management (CIKM), 1994, pp. 401–407.

21.

Kumar

et al., Mining Association Rules from NoSQL Data Bases Using MapReduce Fuzzy Association Rule Mining Algorithm, 2017.

22.

Liu and

Hsu, Post-analysis of learned rules, in: Proceedings of the Thirteenth National Conference on Artificial Intelligence, Lecture Notes in Artificial Intelligence, AAAI Press/MIT Press, 1996, pp. 828–834.

23.

Liu,

Hsu and

Chen, Using general impressions to analyze discovered classification rules, in: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-1997),

Heckerman,

Mannila,

Pregibon and

Uthurusamy, eds, AAAI Press, pp. 31–36.

24.

Liu,

Hsu,

Wang and

Chen, Visually aided exploration of interesting association rules, in: Proceedings of the Third Pacific-Asia Conference on Methodologies for Knowledge Discovery and Data Mining, Lecture Notes in Computer Science, Vol. 1574, Springer-Verlag, 1999, pp. 26–28.

25.

Liu,

Pacitti and

Valduriez, A survey of scheduling frameworks in big data systems, International Journal of Cloud Computing, Inderscience Publishers 7(2) (2018), 103–128. doi:10.1504/IJCC.2018.093765.

26.

Marinica,

Guillet and

Briand, Vers la Fouille de Règles D’association Guidée Par des Ontologies et des Schémas de Règles, LINA–COD Team, Ecole Polytechnique de l’Université de Nantes, France, QDC, 2008.

27.

Mc Creary and

Kelly, Making Sense of NoSQL, Edition: Manning Publications Co., 2014.

28.

Moffitt Vasarhelyi, AIS in an age of big data, Journal of Information Systems. American Accounting Association Fall 27(2) (2013), 1–19.

29.

MongoDB Documentation Project Release 2.6.4, pp. 607–695, 2018. http://docs.mongodb.org/manual/tutorial/.

30.

MongoDB Documentation Project Release 2.6.4, pp. 5–49, 2018. http://docs.mongodb.org/manual/tutorial/.

31.

Monticolo

et al., “Ontodesign: A Domain Ontology for Building and Exploiting Project Memories in Product Design Projects”, SeT Laboratory, University of Technology UTBM, France, 2007, http://www.knowllence.com.

32.

Padmanabhan and

Tuzhuilin, Unexpectedness as a Measure of Interestingness in Knowledge Discovery. Decision Support Systems, Vol. 27, Elsevier, 1999, pp. 303–318.

33.

Pasquier

et al., Discovering frequent closed itemsets for association rules, in: 7th Intl. Conf. on Database Theory, 1999.

34.

Pei,

Han,

Mao,

Nishio,

Tang and

Yang, CLOSET: An efficient algorithm for mining frequent closed itemsets, in: Proceeding of the ACM SIGMOD DMKD’00, Dallas, TX, 2002, pp. 21–30.

35.

Piatetsky-Shapiro and

W.J.

Frawley (eds), Knowledge Discovery in Databases, AAAI Press Co. Publications, 1991, 539 pages.

36.

Piatetsky-Shapiro and

C.J.

Matheus, The interestingness of deviations, in: Knowledge Discovery in Databases, Papers from AAAI Workshop (KDD’,

U.M.

Fayyad and

Uthurusamy, eds, 1994, pp. 25–36.

37.

Rathee and

Kashyap, Adaptive-Miner: An efficient distributed association rule mining algorithm on Spark, Journal of Big Data (2018).

38.

Silberschatz and

Tuzhilin, What makes patterns interesting in knowledge discovery systems, IEEE Transactions on Knowledge and Data Engineering (1996), 970–974. doi:10.1109/69.553165.

39.

Site Protégé, Tool for modeling Ontology, Stanford University, http://protege.stanford.edu/.

40.

I.T.

Solid, Ranking Database Management Systems, 2018, http://db-engines.com/en/ranking.

41.

Sonatrach Company, http://www.sonatrach.com/en/.

42.

Stephens and

Pablo, Supervised and unsupervised data mining techniques for the life sciences, Technical Report, Oracle and, White head Institute, MIT, USA, 2003.

43.

P.-N.

Tan,

Kumar and

Srivastava, Selecting the right objective measure for association analysis, in: Information Systems, Vol. 29, Elsevier Science Ltd., 2004, pp. 293–313.

44.

The W3C Web Ontology Language, www.w3.org, 2018.

45.

Woo, Apriori-map/reduce algorithm, in: The International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2012), Las Vegas, 2012.

46.

Woo and

Lee, MapReduce example with HBase for association rule, in: Future Information Technology. Lecture Notes in Electrical Engineering,

Park,

Stojmenovic,

Choi and

Xhafa, eds, Vol. 276, Springer, Berlin, Heidelberg, 2014.

47.

M.J.

Zaki and

J.R.

Wagner Meira, Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, 2014.