Mining association rules on Big Data through MapReduce genetic programming

Abstract

Association rule mining is one of the most important tasks to describe raw data. Although many efficient algorithms have been developed to this aim, existing algorithms do not work well on huge volumes of data. The aim of this paper is to propose a new genetic programming algorithm for mining association rules in Big Data. The genetic operators of our proposal have been specifically designed to avoid a growing in the complexity of the solutions without an improvement in their fitness function values. Furthermore, it introduces a repairing operator to improve the convergence. Additionally, to facilitate its application on real world problems a grammar has been included, allowing it to introduce subjective knowledge into the mining process and to reduce the search space. Due to the growing interest in data gathering, a unique implementation of the proposed algorithm is not useful so different implementations (considering different architectures such as RMI, Hadoop and Spark) are required depending on the data size. All these adaptations obtain exactly the same solutions as those of the original algorithm since they only differ on the software architectures. The experimental study considers more than 75 datasets and 14 algorithms and the results reveal that the proposed algorithm obtains excellent results for more than 12 quality measures. The scalability of the proposal is also analyzed by considering the three parallel implementations on high dimensional datasets (3,000 millions of instances) and file sizes up to 800 GB.

Keywords

Association rules Big Data MapReduce Hadoop Spark

1. Introduction

As technology advances, high volumes of valuable data are generated in modern organizations. Nowadays, the extraction of knowledge from such raw massive data is a priority to support decision making. It has driven and motivated the research in improving techniques for data analysis in such massive datasets, giving rise to the new buzzword Big Data [30]. This term encompasses a set of techniques to face up the problems derived from the management and analysis of these huge quantities of data [10].

The extraction of patterns of interest that represent intrinsic and important properties of data plays an important role in data analysis. Nevertheless, the knowledge extracted by a single pattern might be meaningless, and a more descriptive analysis can be required. In this sense, the concept of association rules was proposed by Agrawal et al. [2] as a way of describing correlations within patterns of potential interest. Association rule mining was firstly described in the context of market basket analysis [40], where it was used to determine which products were bought together. Nowadays, though, the applications are not limited to the market basket analysis and more and more domains [16, 18, 26] are interested in using this kind of relationships.

First approaches in the association rule mining [1] field were based on the Apriori algorithm. It is an exhaustive search algorithm that extracts associations of interest by dividing the problem into two sub-tasks: (1) finding patterns whose frequency of occurrence is greater than a minimum threshold; and (2) extracting association rules from the previously obtained patterns. Despite the fact that this first algorithm [2] worked well in many different fields, the number of applications has grown exponentially and this rapid increment in size and number of datasets has given rise to some limitations. For instance, obtaining all the rules could be unfeasible if the dataset has a high number of $k$ different single items, producing 2 ${}^{k}$ $-$ 1 patterns and 3 ${}^{k}-$ 2 ${}^{k+1}+$ 1 rules to be analyzed and saved in main memory. In addition, real-world datasets include continuous features so the high number of distinct values produces extremely large search spaces to be considered by traditional approaches [40].

In order to overcome existing drawbacks in the mining of association rules, many research studies [36] have been focused on extracting these relationships by means of Evolutionary Algorithms (EAs). The use of EAs enables to extract association rules in a single step, not requiring a previous subtask for mining frequent patterns. Multi-objective optimization [31, 32] has been also considered by different researchers to extract association rules by means of EAs [12, 20, 21]. Additionally, some researchers [22] have applied EAs for optimizing patterns in continuous domains, not requiring any previous discretization step. But even more important than all of this is the reduction in both the computational time and the memory requirements by considering the pattern mining problem as a combinatorial optimization issue. Even when really efficient algorithms have been proposed for mining association rules, truly Big Datasets hamper the process of mining association rules. In the 1990s, 20 attributes were called a large-scale problem [27]. Nowadays, the number of attributes in many areas, e.g. gene analysis, can easily reach thousands or even millions [33]. Under these circumstances, new forms of processing data are needed to enhance the process of decision making and knowledge discovery when massive data are considered [14, 17]. Additionally, parallel computing is being applied in this field, by considering both multi-core processors and multiple computers through a Remoted Method Invocation (RMI). In this sense, Cano et al. [5] proposed the use of Graphics Processor Units (GPUs) to speed up the process of mining association rules. GPU computing allows thousands of cores to be used at the same time, however, it could not be enough when truly Big Data are considered. In this regard, MapReduce [6] has emerged as a paradigm to tackle Big Data. It uses multiple machines in a distributed way, enabling a higher level of parallelism. Nevertheless, not only are new huge quantities of data available but also a massive number of decision variables, different mathematical properties of the data or even various type of constraints. Thus, this problem cannot be solved by only increasing the computational power or by using paradigms such as MapReduce. Hence, novel methods and algorithms based new advances in distributed computing [27] have become a necessity [25].

The goal of this paper is therefore to propose a new efficient EA to extract association rules in Big Data. The baseline of this work is a new Grammar-Guided Genetic Programming algorithm to optimize Leverage, Support and Confidence, known as G3P-LSC. The proposed model makes use of a context-free grammar to encode the solutions and it allows to restrict the search space by adding some syntax constraints, i.e. it enables expert’s knowledge to be introduced into the mining process. Furthermore, our proposal is eminently designed to be as parallel as possible so Big Data can be tackled, and its operators have been specifically designed to avoid the loss in large search spaces as well as to maintain diversity in the solutions. In this regard, its genetic operators provide a reduced set of rules with high values for many different quality measures and few attributes, making it easier to understand from a user’s perspective. Taking the proposed sequential algorithm G3P-LSC as a starting point, different approaches have been finally implemented considering both RMI and MapReduce (Hadoop and Spark). In order to analyze the scalability of the proposal and its parallel versions, the experimental study includes different data sizes, considering datasets with more than 3,000 millions of instances. G3P-LSC has been compared with other 14 algorithms and using more than 75 datasets. Results state that the proposed G3P-LSC algorithm mines rules by optimizing the desired qualities, providing the user with rules of high interest. Finally, the proposed approach presents a good computational cost and a promising scalability when the size of the problem increases.

The rest of the paper is organized as follows. Section 2 presents the most relevant definitions and related work; Section 3 describes the proposed algorithm; Section 4 presents the datasets used in the experiments and the results; finally, some concluding remarks are outlined in Section 5.

2. Preliminaries

In this section, the association rule mining task is formally defined, and the MapReduce paradigm is analyzed.

2.1 Association rule mining

Association rule mining (ARM) [40] is considered as one of the most relevant tasks in unsupervised learning. It aims to discover accurate associations between item-sets of interest for the application domain. These associations have a descriptive nature, describing useful behaviors for the end user.

In a formal way, it is possible to define an association rule as follows [36]. Let $I=\{i_{1},i_{2},i_{3},\ldots,i_{n}\}$ be the set of items or features, and let define a set of all transactions $T=\{t_{1},t_{2},t_{3},\ldots,t_{m}\}$ in a dataset, where each transaction $t_{j}$ comprises a subset of items $\{i_{k},\ldots,i_{l}\},1\leqslant k,l\leqslant n$ . An association rule is formally defined [1] as an implication of the form $X\to Y$ where $X\subset I$ , $Y\subset I$ , and $X\cap Y$ $=\emptyset$ . The meaning of an association rule [13] is that if the antecedent $X$ is satisfied for a specific transaction $t_{j}$ , i.e. $X\subset t_{j}$ , then it is highly probable that the consequent $Y$ is also satisfied for that transaction, i.e. $Y\subset t_{j}$ . Nevertheless, in some scenarios the extraction of rules of the form $X\to Y$ could not be enough, and it may be interesting to extract rules such as $X\to$ ${}^{\neg}$ Y. This kind of rules relates the presence of $X$ to the absence of $Y$ [13].

ARM obtains relationships from data were no information is known, so the extracted knowledge might be hardly quantifiable sometimes. In general, association rules represent data behavior and their interest is quantified by means of metrics that determine how representative a specific rule is within a dataset. Tons of quality measures have been defined for this aim [35], being support and confidence the two most widespread metrics in literature [13]. The support of the item-set $X$ (see Eq. (1)) is defined [1, 40] as the number of transactions that satisfies $X\subset t_{z}\in T$ .

$\displaystyle\text{Support($X$)}=|\{\forall t_{j}\in T,X\subseteq t_{j}:t_{j}% \subseteq I\}|$ (1)

In the same way, the support of an association rule $X\to Y$ (see Eq. (2.1)) is defined as the number of transactions from $T$ that satisfies both $X$ and $Y$ [1, 13, 40].

$\displaystyle\text{Support($X\to Y$)}=|\{\forall t_{j}\in T,X\subset t_{j}\wedge$ $\displaystyle\quad∼{}Y\subset t_{j}:t_{j}\subseteq I\}|$ (2)

As for the confidence quality measure [1, 13, 40], it determines the strength of implication of the rule, so the higher its value, the more accurate the rule is. In a formal way, the confidence measure (see Eq. (2.1)) is defined as the proportion of transactions that satisfies both the antecedent $X$ and the consequent $Y$ among those transactions that contain only the antecedent $X$ [1].

$\displaystyle\text{Confidence($X\to Y$)}=\text{Support($X\to Y$)}/$ $\displaystyle\quad∼{}\text{Support(X)}$ (3)

Even though these quality measures are extensively used in the field, they have some downsides [13, 36]. First, the confidence measure does not detect statistical independence or negative dependence between items. Second, item-sets with very high support are a source of misleading rules. To overcome these drawbacks, many researchers have proposed several measures for the selection of interesting rules, and leverage is one of the most alluring since it satisfies the three properties proposed by Piatetsky-Shapiro [28]. Leverage (see Eq. (2.1)) calculates how different is the co-occurrence of the antecedent $X$ and consequent $Y$ from expected [1], i.e. from independence. This quality measure [36] takes values in the range [ $-$ 0.25, 0.25], and a zero value states for statistical independence between $X$ and $Y$ .

$\displaystyle\text{Leverage ($X\to Y$)}=\text{Support($X\to Y$)}-$ $\displaystyle\quad\text{(Support($X$) $\times$ Support($Y$))}$ (4)

2.2 MapReduce

MapReduce [6] is a recent paradigm of distributed computing in which programs are composed of two main phases defined by the programmer: map and reduce. MapReduce considers that the input and output are based on (key, value) pairs, which are also denoted as tuples $<$ k, v $>$ . In the map phase, each mapper processes a sub-set of input data and produces $<$ k, v $>$ pairs. Then, an intermediate step is carried out, known as shuffle phase, which merges all the values associated with the same key. For example, given three different pairs with the same key, i.e. $<$ k, v ${}_{1}$ $>$ $<$ k, v ${}_{2}$ $>$ $<$ k, v ${}_{3}$ $>$ , the merging process will return $<$ k, $<$ v ${}_{1}$ , v ${}_{2}$ , v ${}_{3}$ $>>$ . Finally, the reducer takes this new list as input to produce the final values. It should be noted that all the map and reduce operations are run on a distributed way. The flowchart of a generic MapReduce framework is depicted in Fig. 1.

Figure 1.

Diagram of a generic MapReduce framework.

Hadoop [15] is the de facto standard for MapReduce applications. Hadoop implements the MapReduce paradigm and provides a distributed filesystem known as Hadoop Distributed File System (HDFS), which replicates file data in multiple storage nodes that can concurrently access to the data. The main drawback of Hadoop is that it imposes an acyclic data flow graph, and there are applications that cannot be modeled efficiently using this kind of graph such as iterative or interactive analysis [38]. Moreover, the communication among mappers and reducers are performed using disk. This operation could cause problems of I/O, when the number of (key, value) pairs are extremely large. The disk being the main bottleneck due to the slow speed of read/write. All of this has hampered the modeling of efficient iterative algorithms in this platform. To solve these downsides, a novel solution has been proposed known as Spark [38]. This new proposal is eminently designed to be used in iterative and interactive algorithms. To speed up the process of tackling huge amounts of data, it introduces an abstraction called Resilient Distributed Datasets (RDDs). RDD represents a read-only collection of objects partitioned across a set of machines stored in main memory. It allows us to load a dataset in memory one time, and read multiple times without having to load it from disk in each iteration as Hadoop does. Furthermore, the communication among mappers and reducers are performed in memory, being much faster than the approach followed by Hadoop. One of the main strengths of Spark is its rich application program interface, which provides a set of in-memory primitives facilitating the modeling of algorithms.

3. Evolutionary algorithm based on grammars for mining association rules in Big Data

The main motivation of this work is to propose an EA based on grammars for mining association rules. This work has been eminently designed to tackle massive amounts of data, where its genetic operators enable to scale on huge search spaces without losing accuracy and maintaining diversity among solutions. Its repairing operator allows to improve the convergence in complex spaces. Additionally, the fitness function has been explicitly designed to find frequent and reliable rules whose antecedent and consequent are not independent. To accomplish this, two different populations and three genetic operators as well as a grammar have been used. Also, due to the growing interest in data gathering, a unique and universal implementation of the proposed algorithm is not useful, so different adaptations are carried out depending on the data size. Hence, different adaptations have been performed in function of the used architecture, and all of them are fully described in further sections. Finally, it is worth noting that all of these adaptations obtain exactly the same solutions, the unique difference among them is the used software architecture.

3.1 Baseline

The baseline of this work is a new EA, known as G3P-LSC, that makes use of a context-free grammar to constrain the search space. Many authors have explored the use of grammars in pattern mining, achieving excellent results in both introducing subjective external knowledge into the mining process and restricting search space by introducing some syntax constraints [19]. A major strength of using grammars is the adaptability to represent solutions with different forms and features in such a way that a simple change in the grammar is able to produce completely different solutions. However, the user should be cautious when using grammars in pattern mining since the fact of reducing the search space may produce the loss of high interesting solutions, e.g. those that do not satisfy the constraints provided by the grammar. Besides, another major limitation of using grammars is the possibility of bloating by which the trees expand without control and the complexity increases. All these downsides are overcome in the proposed G3P-LSC as it is described below.

Figure 2.

Context-free grammar defined by G3P-LSC.

3.1.1 Encoding

G3P-LSC represents each solution as a derivation syntax tree encoded by means of a set of production rules from the context-free grammar shown in Fig. 2. It is defined as a four-tuple ( $\Sigma N,\Sigma T,P,S$ ) where $\Sigma N$ and $\Sigma T$ represent the alphabet of non-terminal and terminal symbols, respectively; and they have no common elements, i.e. $\Sigma N\cap\Sigma T=\emptyset$ .

Terminal symbols are literals of the grammar and cannot be changed using the rules of the grammar. For example, the value of an attribute does not change even when the rules of the grammar are modified. Additionally, terminal symbols do not appear in the left-hand side of any production rule. On the contrary, non-terminal symbols are lexical elements used to form a grammar, and they can be replaced to produce different solutions Non-terminal symbols may appear in both left and right-hand side of the production rules. In order to encode a solution, a number of production rules from the set $P$ are applied beginning from the start symbol denoted by $S$ . A production rule is defined as $\alpha\to\beta$ where $\alpha\in\Sigma N$ , and $\beta\in\{\Sigma T\cup\Sigma N\}^{*}$ . After applying the production rules, a derivation syntax tree is obtained for each solution, where internal nodes contain only non-terminal symbols, and leaves contain only terminal symbols.

To generate each solution, the derivation syntax tree is obtained by applying a series of derivation steps from the start symbol of the grammar. From this symbol, the algorithm searches solutions belonging to the set P, until a valid derivation chain is reached. Additionally, in order to avoid bloating that is one of the main problems of using grammars, a maximum number of derivations is previously determined as an input parameter. In this regard, there is a maximum length that no rule can excess. Genetic operators are aware of this maximum length so they are specifically designed to avoid an uncontrolled growth of the solutions.

3.1.2 Evaluation procedure

In any evolutionary approach, the evaluation process is cornerstone since it is responsible for assigning a fitness value to determine how promising each rule is for a specific aim. The proposed fitness function $F$ of a solution (rule) $R\equiv X\to Y$ is the product of support, confidence and leverage, i.e. $F$ ( $R$ ) $=$ support( $R$ ) $\times$ confidence( $R$ ) $\times$ leverage( $R$ ). This fitness function takes values in the range [ $-$ 0.25, 0.25]. Support and confidence are the most widespread measures in association rule mining and they are related in such a way that the confidence value of a rule cannot be lower than its support [36]. It means that high support values imply high confidence values but in those datasets where the maximum feasible support value is not too high, then confidence is required to be maximized and, therefore both metrics should be considered at time. Even though these two quality measures are extensively used in the field they should be considered together with additional quality measures [35]. In this regard, Leverage appears as a good metric to determine co-occurrence of the antecedent and consequent of a rule from independence. A major feature of this quality measure with regard to similar metrics [36] is that its values have predefined lower and upper bounds, i.e. [ $-$ 0.25, 0.25], zero value denoting a statistical independence between antecedent and consequent.

The fitness function has been precisely designed to avoid frequent (support) and reliable (confidence) rules where the antecedent and consequent are not independent. Using the product of these metrics it is obtained that when leverage is zero, then the overall fitness value is also zero.

Finally, it is important to highlight that the evaluation procedure is carried out in a sequential way, where the whole dataset has to be read from one processor, evaluating each rule of the main population in each instance.

3.1.3 Algorithm

The G3P-LSC algorithm proposed as baseline for mining association rules in Big Data environments is depicted in Listing 1. It starts by encoding rules (line 1, Listing 1) by using the context-free grammar defined in Fig. 2 and the extracted metadata. Additionally, the maximum feasible length of this grammar is set to the number of attributes in data, so it is possible to obtain rules comprising the whole set of features.

After selecting a set of individuals to work as parents (line 7, Listing 1), the next step is to apply the crossover operator with a certain probability. If the crossover operator is applied, two offspring will be generated, which could be independently mutated with a certain probability. On the contrary, if the crossover operator is not applied, then the two parents could be separately mutated with a certain probability (see lines 7 to 16, Listing 1). A major feature of G3P-LSC is the elitism, in which best solutions are guaranteed a place

Listing 1: G3P-LSC sequential algorithm
1:	Initialize a random population of N rules as $P_{0}$
2:	auxiliary_population $\leftarrow\emptyset$
3:	for i $=$ 0 to NumberOfGenerations do
4:	offspring $\leftarrow\emptyset$
5:	evaluate_rules( $P_{i}$ )
6:	Maintain elitism using auxiliary_population
7:	Apply BetterSelector to $P_{i}$
8:	for each pair in $P_{i}$ do
9:	if Rand_number(0, 1) $<P_{\textit{cro}}$ then
10:	pair $\leftarrow$ Apply crossover and repairing operator (pair)
11:	end if
12:	for each individual of the pair do
13:	if Rand_number(0, 1) $<P_{\textit{mut}}$ then
14:	individual $\leftarrow$ Apply mutation and repairing operator (in dividual)
15:	end if

16:	end for
17:	offspring $\leftarrow$ offspring $+$ pair
18:	end for
19:	$P_{i+1}$ $\leftarrow$ offspring $+$ auxiliary_population
20:	end for

in the next generation. It is especially important in mining association rules since both crossover and mutation are too disruptive operators and may cause the loss of really promising solutions. This evolutionary process is repeated a number of generations specified by the user.

3.1.4 Genetic operators

The crossover genetic operator works by interchanging a random sub-tree between two parents, whereas the mutation operator applies changes to attributes from the antecedent and the consequent. These are high disruptive operators, giving rise to solutions whose fitness values highly vary from the original solution (parents). This issue is caused by the ARM problem itself since the simple fact of changing a single attribute in a rule may produce wrong solutions or even completely different leverage, support and confidence values. In this regard, G3P-LSC also proposes a repairing operator used to improve the algorithm’s performance by modifying invalid rules. The main idea is really simple since this operator checks whether the antecedent and consequent of the rule include similar items, and it is checked on each of the resulting solutions. It is important to highlight that, according to the formal definition of association rules provided in Section 2.1, both the antecedent and consequent cannot include the same items, i.e. $X\cap Y$ $\neq\emptyset$ . Thus, a rule is considered as invalid if both the antecedent and the consequent comprise common items and the repairing operator therefore works by removing those repeated items and providing rules satisfying $X\cap Y$ $=\emptyset$ . In general, invalid elements are randomly removed either from antecedent or consequent with the only constraint that the resulting solution satisfies that $X=\emptyset$ and $Y=\emptyset$ .

3.2 Scaling G3P-LSC using parallel and distributed computing

Focusing on the same idea of G3P-LSC, different parallel and distributed computing architectures have been used to speed up the mining of association rules in Big Data environments. In these implementations, the evaluation procedure has been the unique parallel phase since it has been proved to be the most time consuming [5]. Each version is developed following a different kind of implementation, although all the versions share the same idea and the same results are therefore obtained.

3.2.1 RMI version

The first parallel version of the G3P-LSC algorithm is based on a master/slave architecture that uses RMI to communicate the master process with each slave. This version, known as G3P-LSC RMI, uses multiple threads and different processes that are distributed among a cluster of machines (See Listing 2).

Listing 2: G3P-LSC RMI-rules are distributed among
slaves
function evaluate_rules( $P_{i}$ )
end function
1:	Split $P_{i}$ in subPopulations // As much as slaves
2:	for each subPopulation in subPopulations do
3:	$\textit{Slave}_{j}$ .evaluate(subPopulation)
4:	end for

Focusing on the master process, only a single master procedure is used since it is the coordination point. The master process is almost the same as the baseline approach shown in Listing 1, being the difference the function evaluate_rules. In this case, this function splits the main population in as many subpopulations as number of slaves exist (see Listing 2). The aim of each slave is to evaluate the subpopulation of rules in the whole dataset. Furthermore, each slave is located in a different computer node, enabling parallel and distributed computing, and achieving a better performance.

Although RMI provides a high-level programming interface, the load balancing, fault tolerance and the coordination among slaves are very troublesome to manage. In this approach, the dataset has to be replicated in each slave, provoking both high network and I/O activity. Even when this implementation works well for large datasets, the use of truly Big Data hampers the process of replicating data. Some researchers have proposed distributed file systems to solve this issue. However, it is not enough when truly Big Data is considered since each slave process has to read the whole dataset and this operation becomes impossible if the file size is big enough.

3.2.2 MapReduce versions

These versions have been implemented in two different MapReduce architectures (Apache Hadoop and Apache Spark). Both implementations share the same approaches, even though slight adaptations have been required to adapt to the used architecture. They require three different processes: (1) the driver performs the main code of G3P-LSC baseline, however, the evaluate_rules function has been implemented to use MapReduce. Thus, in each generation of the evolutionary process a MapReduce phase is required to evaluate the main population; (2) Mappers in which the main population of rules is evaluated; (3) Reducers which collect the data produced by mappers, and return the evaluated main population for the whole dataset.

Listing 3: G3P-LSC MapReduce
function evaluate_rules(population)
end procedure
1:	MapReduce to evaluate rules in population
end function
procedure evaluatorMapper(instance)
1:	extbfdo
2:	measures $\leftarrow$ rule.evaluate(instance)
3:	emit(rule, measures)
4:	end for
end procedure
procedure reducer(rule, measures)
1:	finalMeasure 0
2:	for each measure in measures do
3:	finalMeasure finalMeasure $+$ measure
4:	end for
5:	emit(rule, finalMeasure)

In the mapper phase (see evaluatorMapper procedure, Listing 3) each mapper receives as input a sub-set of the dataset and the main population. A group of pairs (key, value) are generated by each mapper, where the key is the individual (rule or solution within the population set), and the value is a tuple consisting of values for the support of the antecedent, consequent and rule. Hence, each evaluatorMapper procedure produces the same number of (key, value) pairs as number of individuals exist in the main population. Hence, the number of (key, value) pairs in each generation is calculated as the number of mappers multiplied by the number of individuals in the main population. The reducer phase, on the contrary, receives these (key, value) pairs, calculating the total sums of support antecedent, consequent and rule for each individual. Hence, the reducer produces the same quantity of (key, value) pairs as individuals exist in the main population (see reducer procedure, Listing 3). At the end of this phase, the output is the evaluated main population. Finally, the driver continues the evolutionary procedure similar to the baseline of G3P-LSC. The procedure followed by MapReduce versions can be summarized as illustrated in Fig. 3, where the input of the MapReduce phase is the dataset and the main population (shaded rectangle). The output is the evaluated main population which is returned to the driver.

Figure 3.

Flowchart for each generation in G3P-LSC versions based on MapReduce. It receives the main population and the dataset as inputs, and it produces as output the evaluated main population. The main population is returned to the driver.

Although the two implementations (Hadoop and Spark) share the same approaches some slight differences exist. These differences are produced by the structure of each platform. Hadoop version uses disk as the way of communicating. Thus, communication among driver-mappers, mappers-reducers and reducers-driver are performed using disk. Furthermore, Apache Hadoop does not provide any way of saving the dataset in main memory so each generation has to read the dataset from disk. It should be noted that the process of this algorithm in each generation is exactly the same, so the dataset is loaded in each generation and the outputs are written on disk. In cases where the dataset is big enough, the reading process could be very time consuming so reading once, saving in main memory and using multiple times could be more efficient. On the other hand, Apache Spark allows to communicate among processes using main memory and enabling a faster communication. The main difference between Hadoop and Spark versions is that in the first generation of Spark, the whole dataset will be loaded in main memory using a RDD. It is split into the cluster, and each mapper could access to one different partition (sub-dataset). Unlike the previous Hadoop version, the dataset is loaded in memory only once, so it does not require a loading for each generation and the global performance can be substantially improved. Moreover, the communication among mappers and reducers are not performed using disk but memory, being this much faster. Both versions return exactly the same set of rules as the previous versions. In order to clarify these points, all the source code is available at http://www.uco.es/grupos/kdis/wiki/G3PLSC.

4. Experiments

The aim of this section is to study the performance of the G3P-LSC algorithm and its versions on different parallel architectures when different data dimensionalities are considered. The goal of this study is three-fold:

1.
To prove the quality of the obtained solutions. In this sense, a comparative study has been carried out by using exhaustive search algorithms to prove that our proposal is able to discover global optimum solutions in a reduced quantum of time. Additionally, a set of EAs for mining association rules has also been considered in this experimental study, and a comparative analysis of different quality measures has been carried out. A set of statistical tests [8, 9, 10, 11] has been applied to compare the differences.
2.
To analyze the scalability of the proposed algorithm when different parallel implementations are considered truly Big Datasets are used in this section to prove scalability considering both synthetic and real-world datasets.
3.
To show the interesting of using grammars is to restrict the search space and to include external subjective knowledge that helps in the process of mining association rules.

All the experiments have been run on an HPC cluster comprising 16 computing nodes, with two Intel E5-2620 microprocessors at 2 GHz and 64 GB DDR memory. Cluster operating system was Linux CentOS 6.3. As for the specific details of the used software, the experiments have been run on Hadoop 2.6.0 and Spark 1.6.0.
4.1 Study of the grammar

As previously described, the grammar considered in this approach is defined in Fig. 2. Considering this grammar $G$ , the following language is obtainedL⁢(G)={condition ⁢(𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛)*⁢ condition ⁢(𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛)*} where the first part of the language, i.e. $\{\textit{condition}$ $(\textit{condition})^{*}\}$ , corresponds to the antecedent of the rule, whereas the second part represents the consequent of the rule.

According to the grammar $G$ (see Fig. 2), the minimum length of a rule includes a single item in the antecedent, and a single item in the consequent. Hence, the minimum tree will have a depth of 4, a total of 7 internal nodes (one of them is the starting symbol Rule), and requiring 7 derivations to create the final tree. Additionally, the maximum derivation size is fixed to the number of attributes in data, so the maximum value depends on the dataset to be used. Taking a dataset comprising 3 features, the number of internal nodes in the tree is 9, requiring 10 derivations; 11 internal nodes if the dataset comprises 4 features, requiring 13 derivations; and so on. As a result, the maximum number of internal nodes is equal to 3 $+$ 2 $n$ , $n$ being the number of features in data satisfying that $n>$ 1; whereas the number of derivations required to create the final tree is 3 $n∼{}+$ 1. As for the depth of the tree, it should be highlight that it remains the same for whatever number of features. It implies that the size of the resulting trees increases in width, not in depth.

4.2 Computational complexity

An analysis of the computational complexity is essential to determine the efficiency of the proposed approach. In this sense, we analyze each of its main procedures: encoding criterion, evaluator procedure and genetic operators. According to all these procedures, it is possible to determine the computational complexity of the whole algorithm.

As for the computational complexity of the encoding criterion, it depends on the number of derivations required to form the tree, which depends on the number of attributes in data as described in Section 4.1, i.e. $3n+1$ , $n$ being the number of features in data. Hence, the final complexity of the encoding criterion will be determined by the number of times (individuals) the derivation process is carried out, resulting as $O(3n\times m+m)$ , $n$ denoting the number of features in data, and $m$ stating for the number of individuals to be created. Concerning the evaluator procedure, its complexity depends on the number $m$ of individuals, $t$ instances in data and $n$ attributes. Mathematically, the final complexity is defined as $O(m\times t\times n)$ . Finally, the computational complexity of the three genetic operators depends on both the derivation tree size (number of derivations, i.e. $3n+1$ , for a dataset with $n$ attributes or features) and the number of individuals $m$ . In consequence, the final complexity order is defined as $O(3n\times m+m)$ .

Figure 4.

Analysis of the convergence of the three proposed genetic operators.

Analyzing the computing requirements for each procedure, it is stated that the number m of individuals is previously fixed, so it is considered as a constant with a complexity $O(1)$ . Additionally, all the procedures are repeated as many times as the predefined number of generations, which is also a constant value predefined. Therefore, bearing in mind all these issues, the resultant computational complexity of the complete algorithm is defined as $O(n\times t)$ . Thus, the complexity of the proposed approach is linear with regard to the number of instances and the number of attributes. It is important to highlight that any of the evolutionary approaches used in the experimental stage presents the same computational complexity, being linear with regard to the number of instances and attributes in data. On the contrary, exhaustive search approaches require the whole search space to be generated, existing $2^{n}-1$ different patterns in data when $n$ attributes are considered. Thus, the computational complexity is exponential with regard to the number of attributes in data, and it becomes prohibitively expensive for really large datasets.

4.3 Analysis of the genetic operators

In this section, an interesting analysis of the usefulness of the proposed genetic operators is carried out. The aim of this section is to demonstrate that the fitness function improves when the three proposed genetic operators are considered at time. To this aim, it is analyzed how the average fitness function improves along the generations when considering the same algorithm with, and without, the proposed genetic operators.

Table 1
Datasets considered for the experimental study

Dataset	Attributes (R/I/N)		File size (MB)	Instances
Bolts	8	(2/6/0)	0.20	40
Breast Cancer Wisc ${}^{1}$	10	(0/10/0)	0.10	699
Stock price	10	(10/0/0)	0.06	950
Tic-Tac-Toe ${}^{1}$	9	(0/0/9)	0.02	958
Statlog ${}^{1}$	20	(0/7/13)	1.50	1,000
Flare ${}^{2}$	11	(0/0/11)	0.02	1,066
Car ${}^{2}$	6	(0/0/6)	0.04	1,728
Chess ${}^{1}$	36	(0/0/36)	0.40	3,196
Texture ${}^{2}$	40	(40/0/0)	1.50	5,500
Optdigits ${}^{2}$	64	(0/64/0)	0.80	5,620
Satimage ${}^{2}$	36	(0/36/0)	0.70	6,435
Marketing ${}^{2}$	13	(0/13/0)	0.10	6,876
Thyroid ${}^{2}$	21	(6/15/0)	0.40	7,200
Ring ${}^{2}$	20	(20/0/0)	0.70	7,400
Twonorm ${}^{2}$	20	(20/0/0)	1.20	7,400
Mushroom ${}^{1}$	22	(0/0/22)	0.20	8,124
Coil2000 ${}^{1}$	85	(0/85/0)	1.80	9,822
PenBased ${}^{2}$	16	(0/16/0)	0.60	10,992
Nursery ${}^{2}$	8	(0/0/8)	1.20	12,690
Magic ${}^{1}$	10	(10/0/0)	1.50	19,020
Letter ${}^{2}$	16	(0/16/0)	0.70	20,000
UJIIndoorLoc ${}^{1}$	529	(2/527/0)	45.00	21,048
House16H ${}^{2}$	17	(10/7/0)	3.80	22,784
Grammatical ${}^{1}$	100	(100/0/0)	56.00	27,965
ChessKrKp ${}^{1}$	6	(0/0/6)	0.70	28,056
Adult ${}^{1}$	14	(6/0/8)	3.00	48,842
Statlog (shuttle) ${}^{1}$	10	(0/10/0)	1.60	58,000
Connect4 ${}^{1}$	42	(0/0/42)	11.00	67,557
ColorTexture ${}^{2}$	17	(16/1/0)	11.00	68,040
ColorHistogram ${}^{2}$	33	(32/1/0)	20.00	68,040
Fars ${}^{2}$	29	(5/0/24)	11.00	100,968
Census ${}^{2}$	41	(1/12/28)	60.00	299,284
Epsilon	2000	(2000/0/0)	11,000.00	500,000
Covtype ${}^{1}$	54	(0/10/44)	72.00	581,012
Transactions90 k ${}^{2}$	3	(0/3/0)	11.00	855,367
Poker ${}^{2}$	10	(0/10/0)	25.00	1,025,010
US Census Data 1990 ${}^{1}$	68	(0/0/68)	328.00	2,458,285
SUSY ${}^{1}$	18	(18/0/0)	2,000.00	5,000,000
HEPMASS ${}^{1}$	28	(28/0/0)	7,000.00	10,500,000
HIGGS ${}^{1}$	28	(28/0/0)	7,300.00	11,000,000
Protein structure ${}^{3}$	631	(77/462/92)	62,000.00	34,890,838

${}^{1}$ UCI repository: https://archive.ics.uci.edu/ml/datasets.html. ${}^{2}$ KEEL repository: http://keel.es. ${}^{3}$ LIBSVM repository: http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/.

According to the results illustrated in Fig. 4, the worst results are obtained when the algorithm includes only a single genetic operator (either crossover or mutation). In fact, the behavior of the algorithm is almost the same for these two genetic operators, and only a slight difference is obtained close to the generation number 1,000 (the crossover operator converges faster than the mutation operator). However, when the repairing operator comes into play the results obtained by each of the previously analyzed genetic operators (mutation and crossover) are improved. Now, both genetic operators achieve a higher convergence. Finally, both mutation and crossover are considered at time including (or not) the repairing operator. As a result, it is obtained that the fact of considering the two main genetic operators (mutation and crossover) produces a huge improvement in the convergence of the algorithm. In fact, the resulting average fitness values are one order of magnitude better than those obtained when the genetic operators were considered in isolation. These results are even better when the three genetic operators (crossover, mutation and repairing) are considered at time.

To sum up, both crossover and mutation genetic operators play an important role in the convergence of the algorithm and they should be considered at time. The fact of including the repairing operator in the proposed algorithm slightly improves the obtained results, and this improvement is higher when the number of generations increases.

4.4 Datasets included in the experimental study

This experimental section considers a large number of datasets (see Table 1) comprising both synthetic and real-world datasets. The goal of these studies is to analyze the performance of different algorithms for mining association rules on a high number of datasets and to prove the scalability of the proposed approach. Thus, the used datasets depend on the type of experiment as described below.

In the first analysis only real-world datasets have been considered since the quality of the solutions have been studied. In this sense, a total of 41 real-world datasets have been analyzed, their main characteristics are shown in Table 1. Here, the label Attributes (R/I/N) states for the number of Real, Integer, and Nominal features included in each dataset; File size denotes the size in terms of memory for each dataset; whereas the Instances variable indicates the number of transactions within each dataset. It should be pointed out that all these datasets are freely available and the source to be downloaded is specified for each one. Finally, the second study considers 35 synthetic datasets so the scalability of the proposal can be analyzed. Synthetic datasets have been required since the number of both instances and attributes can be changed to illustrate the performance of different implementations. In these datasets, the number of instances ranged from 1 $\times$ 10 ${}^{5}$ to 3 $\times$ 10 ${}^{9}$ , the number of continuous and discrete attributes ranged from 8 to 48 distributed following a Gaussian distribution throughout the whole set of instances, the file size varies from 46 MB to 804 GB.

4.5 Sequential algorithms and set up

In these experimental studies, 14 different non-parallel algorithms have been considered to be analyzed, comprising both exhaustive search and evolutionary approaches. All these algorithms have been selected according to their efficiency and significance within the association rule mining field. The aim of this first study is therefore to analyze the resulting set of solutions in terms of quality measures. Each of these non-parallel algorithms has been briefly described as follows:

14) 1)
Apriori [2]: first algorithm for mining association rules, which is based on an exhaustive search methodology. It exploits the search space by means of the downward closure property.
2)
Eclat [39]: it employs a depth-first strategy by extending prefixes of candidate itemsets.
3)
GENAR [23]: genetic algorithm for mining qua- ntitative association rules without a previous discretization step. Each solution is encoded by using the minimum and maximum intervals of each numerical attribute.
4)
GAR [24]: extension of the GENAR [23] algorithm. Each solution is encoded by using all the attributes within data.
5)
EARMGA [37]: genetic algorithm for mining association rules in continuous domains. It does not require a minimum predefined support threshold.
6)
Alatasetal [3]: genetic algorithm for mining quantitative association rules. This algorithm is able to extract both positive and negative relationships between item-sets.
7)
G3PARM [19]: grammar-guided genetic programming algorithm for mining different types of association rules by means of a predefined grammar.
8)
MOEA_Ghosh [12]: multi-objective genetic algorithm that extracts useful and interesting association rules. It is based on three measures: comprehensibility, interestingness and accuracy.
9)
MOPNAR [21]: multi-objective evolutionary algorithm that mines positive and negative qua- ntitative association rules. It looks for a good trade-off between comprehensibility, lift and performance (product of support and certainty factor).
10)
MODENAR [4]: multi-objective differential ev- olutionary algorithm based on the proposed algorithm in [3]. It weights four quality measures: support, confidence, comprehensibility and amplitude of the attributes.
11)
ARMMGA [29]: multi-objective evolutionary algorithm based on EARMGA [37]. It looks for a good trade-off between support and confidence.
12)
NSGA-G3P [20]: multi-objective version of the G3PARM [19] algorithm. It is based on the well-known NSGA-II multi-objective algorithm [7].
13)
SPEA-G3P [20]: multi-objective version of the G3PARM [19] algorithm. It is based on the well- known SPEA multi-objective algorithm [36].
14)
QAR-CIP-NSGA-II [22]: multi-objective evolutionary algorithm that extends the well-known NSGA-II algorithm [7]. It performs an evolutionary learning of the intervals of continuous attributes.

All the configurations are those provided by the original authors. A summarizing table could be found in the supplementary material.

Table 2
Comparative of efficiency and effectiveness among Apriori, Eclat and G3P-LSC

Dataset Algorithm Fitness function Ratio time

Average Maximum

Tic-Tac-Toe G3P-LSC 0.006 0.021 –

Apriori 0.006 0.021 28.390

Eclat 32.615

Flare G3P-LSC 0.066 0.066 –

Apriori 0.066 0.066 19.910

Eclat 22.270

Car G3P-LSC 0.008 0.033 –

Apriori 0.008 0.033 1.4

Eclat 1.29

Chess G3P-LSC 0.095 0.146 –

Apriori 0.141 0.146 158.22

Eclat 142.45

Mushroom G3P-LSC 0.117 0.121 –

Apriori 0.118 0.121 75.95

Eclat 61.47

Nursery G3P-LSC 0.074 0.017 –

Apriori 0.074 0.017 18.46

Eclat 22.41

ChessKrKp G3P-LSC 0.001 0.004 –

Apriori 0.001 0.004 19.708

Eclat 31.12

Connect4 G3P-LSC 0.067 0.074 –

Apriori 0.074 0.074 398.84

Eclat 230.27

USCensus1990 G3P-LSC 0.123 0.137 –

Apriori 0.136 0.137 1,230,045

Eclat 1,680,098

4.6 Quality evaluation on sequential algorithms

Dataset	Algorithm	Fitness function	Ratio time
		Average	Maximum
Tic-Tac-Toe	G3P-LSC	0.006	0.021	–
	Apriori	0.006	0.021	28.390
	Eclat			32.615
Flare	G3P-LSC	0.066	0.066	–
	Apriori	0.066	0.066	19.910
	Eclat			22.270
Car	G3P-LSC	0.008	0.033	–
	Apriori	0.008	0.033	1.4
	Eclat			1.29
Chess	G3P-LSC	0.095	0.146	–
	Apriori	0.141	0.146	158.22
	Eclat			142.45
Mushroom	G3P-LSC	0.117	0.121	–
	Apriori	0.118	0.121	75.95
	Eclat			61.47
Nursery	G3P-LSC	0.074	0.017	–
	Apriori	0.074	0.017	18.46
	Eclat			22.41
ChessKrKp	G3P-LSC	0.001	0.004	–
	Apriori	0.001	0.004	19.708
	Eclat			31.12
Connect4	G3P-LSC	0.067	0.074	–
	Apriori	0.074	0.074	398.84
	Eclat			230.27
USCensus1990	G3P-LSC	0.123	0.137	–
	Apriori	0.136	0.137	1,230,045
	Eclat			1,680,098

The aim of this study is to analyze the quality of the solutions obtained by G3P-LSC. First, our proposal is compared to exhaustive search algorithms so only datasets comprising nominal attributes are considered. No minimum quality threshold has been considered so any solution present in the dataset is obtained. In this regard, the best solution found by Apriori or Eclat represents the best possible solution within the dataset since all the existing rules are discovered by these algorithms. Any solution is ranked by the aforementioned fitness function ( $F$ ( $R$ ) $=$ support( $R$ ) $\times$ confidence( $R$ ) $\times$ leverage( $R$ )) and the top twenty rules are analyzed. Similarly, the G3P-LSC algorithm returns the best twenty discovered rules along its evolutionary process, so the aim is to check whether these solutions are good enough.

Table 2 shows the results, where the average fitness function represents the obtained average by the 20 top rules, whereas the maximum fitness function states for the best value $F$ ( $R$ ) of any association rule $R$ within the set of discovered rules. Finally, Ratio time means how many times G3P-LSC is faster than Apriori and Eclat, and it is calculated as the runtime of Apriori (and Eclat) divided by the runtime required by G3P-LSC. Analyzing the results shown in Table 2, it is obtained that G3P-LSC discovers the best rule (maximum fitness function value) in all the datasets, so the proposed algorithm is able to converge to the global optimum. Additionally, if the set of top rules is analyzed, it is discovered that G3P-LSC obtains really promising results since the average value of its resulting set of rules is close to the global optimum (remember that the set of top rules for Apriori and Eclat includes the best rules within each dataset).

Table 3
Comparative of the single values obtained within the fitness function Apriori, Eclat and G3P-LSC

Dataset	Algorithm	Fitness function
		Support	Confidence	Leverage
Tic-Tac-Toe	G3P-LSC	0.235	0.600	0.047
	Apriori	0.235	0.600	0.047
	Eclat
Flare	G3P-LSC	0.310	1.000	0.214
	Apriori	0.309	1.000	0.213
	Eclat
Car	G3P-LSC	0.173	0.884	0.048
	Apriori	0.173	0.885	0.048
	Eclat
Chess	G3P-LSC	0.589	0.978	0.165
	Apriori	0.611	1.000	0.107
	Eclat
Mushroom	G3P-LSC	0.559	0.979	0.213
	Apriori	0.556	1.000	0.213
	Eclat
Nursery	G3P-LSC	0.167	0.849	0.108
	Apriori	0.018	0.900	0.107
	Eclat
ChessKrKp	G3P-LSC	0.087	0.512	0.032
	Apriori	0.082	0.546	0.032
	Eclat
Connect4	G3P-LSC	0.646	0.983	0.108
	Apriori	0.641	1.000	0.116
	Eclat
USCensus1990	G3P-LSC	0.654	0.982	0.195
	Apriori	0.734	0.997	0.187
	Eclat

Table 4

Obtained ranking after executing 10 times each algorithm in the whole set of real-world datasets. Different well-known quality measures have been calculated. Values in bold typeface represent the best ranking for each quality measure

Measure	(1)	(2)	(3)	(4)	(5)	(6)	(7)	(8)	(9)	(10)	(11)	(12)	(13)
CF	4.167	9.083	5.283	11.267	9.750	5.750	9.117	3.400	6.633	8.500	7.817	7.400	2.833
Confidence	8.100	8.750	5.950	1.883	9.933	2.900	10.967	9.167	10.367	6.883	2.883	4.850	8.366
Cosine	4.467	5.733	9.733	8.900	11.383	3.600	7.867	12.167	9.233	4.750	1.833	4.300	7.033
Gain	2.933	6.750	6.717	9.850	8.783	9.400	5.383	8.267	3.367	8.417	10.517	9.117	1.500
Laplace	1.267	6.663	9.183	7.433	11.300	3.833	8.300	12.300	9.033	5.483	2.550	4.200	9.433
LeastContradiction	4.933	5.367	8.717	7.967	9.867	3.350	11.167	10.800	11.133	5.150	1.717	3.900	6.933
Leverage	1.067	5.383	8.100	9.067	10.117	8.617	6.800	8.467	5.417	6.800	9.783	8.483	2.900
Lift	3.950	7.133	6.983	8.300	9.867	8.650	3.767	12.167	2.833	7.633	9.600	8.383	1.733
NetConf	2.033	6.200	7.950	10.617	9.617	9.800	7.617	2.633	5.900	7.533	10.150	8.817	2.133
Pearson	1.700	6.333	8.233	11.200	10.500	7.017	6.233	10.017	5.700	6.900	7.000	8.200	1.966
Support	6.250	5.617	9.983	8.000	11.367	3.117	7.933	9.067	9.600	4.817	1.617	3.267	10.366
YulesQ	2.567	6.617	7.533	10.683	9.983	8.233	8.483	2.217	6.167	7.117	9.600	8.433	3.366
Zhang	4.283	8.833	5.250	10.683	9.500	6.050	8.067	10.667	5.633	8.483	7.133	4.283	2.133

(1) G3P-LSC	(4) EARMGA	(7) MOEA_Ghosh	(10) ARMMGA	(12) SPEA-G3P
(2) GAR	(5) Alatasetal	(8) MOPNAR	(11) NSGA-G3P	(13) QAR-CIP-NSGA-II
(3) GENAR	(6) G3PARM	(9) MODENAR

Finally, the average values for each of the quality measures (support, confidence and leverage) that form the fitness function are also illustrated in Table 3. As it is shown, the results obtained by G3P-LSC are quite similar to those obtained by exhaustive search approaches.

To prove that there is no statistically significant difference in the average fitness function values, a Wilcoxon signed rank test has been used. A $p$ -value of 0.1814 has been obtained, so it possible to assert that, at a significance level of $\alpha=$ 0.01, there is no significant difference between exhaustive search algorithms and G3P-LSC on the average set of solutions. It is interesting to note that the global optimum was attained in all datasets (see Table 2) so the convergence to the global optimum is guaranteed. Finally, according to the runtime, our proposal performs better that Apriori and Eclat, and this performance is even better when large datasets are considered (see USCensus1990).

Once it is demonstrated that G3P-LSC converges well to the global optimum, a comparative study among different EAs have been performed by considering continuous and discrete attributes. In this experimental study, all the described datasets in Table 1 have been used, and each EA has been run 10 times for each dataset. Note that EAs are non-deterministic algorithms so the results shown are the obtained average results for these executions. Table 4 shows the obtained average ranking for different quality measures after running all the algorithms on all the datasets. It should be noted that these algorithms were run on their original versions, so each one optimizes its own fitness function. Due to space limitation only the ranking table has been illustrated in this paper, although the whole set of results can be checked in the supplementary material. Different quality measures have been selected to quantify the quality of the rules. In Table 4, each column represents a different EA, whereas each row is used to represent a different quality measure.

Table 5

$F_{F}$ Friedman’s values to test whether the null hypothesis $H_{0}$ that all the algorithms equally behave for each quality measure can be rejected or not, considering the critical interval $F_{0.01,29,348}=$ 2.23

Measure	$F_{F}$	$H_{0}$
CF	23.08	Rejected
Confidence	47.06	Rejected
Cosine	58.03	Rejected
Gain	35.57	Rejected
Laplace	85.59	Rejected
LeastContradiction	59.26	Rejected
Leverage	26.62	Rejected
Lift	49.87	Rejected
NetConf	46.21	Rejected
Pearson	33.91	Rejected
Support	53.80	Rejected
YulesQ	31.81	Rejected
Zhang	27.18	Rejected

As it is illustrated in Table 4, G3P-LSC obtains the best results in Leverage, NetConf, Pearson and Laplace quality measures. If we focus on the Confidence quality measure, the ranking values determine that G3P-LSC does not behave well for this specific quality measure. However, if we analyze the confidence values for each run it is obtained that all the values are above 0.9 and, in some specific datasets, the confidence values obtained are greater than 0.95. Thus, it is possible to assert that the results of G3P-LSC for this quality measure are alluring enough. In the same way, the ranking values for support seem to be worse than for other algorithms. It is worth noting that extremely high support values, as it is obtained by some algorithms, imply misleading rules according to some of the quality measures. In fact, maximum support values, i.e., 1.00, imply misleading rules since they do not provide unknown information about the dataset. As it is demonstrated by analyzing the rankings (see Table 4), the rules obtained by G3P-LSC present values for the interestingness measures that are better than or similar to the obtained by the analyzed algorithms.

Table 6

Statistical differences among distinct measures and algorithms according to the Bonferroni-Dunn test. G3P-LSC achieves the highest number of significant differences and it does not almost loss. QAR-CIP-NSGA-II is the algorithm more similar to our proposal

G3P-LSC vs	#Wins	#Losses	#Draws
GAR	8	0	5
GENAR	9	0	4
EARMGA	10	1	2
Alatasetal	12	0	1
G3PARM	7	1	5
MOEA_Ghosh	9	0	4
MOPNAR	8	0	5
MODENAR	7	0	6
ARMMGA	9	0	4
NSGA-G3P	7	2	4
SPEA-G3P	3	1	9
QAR-CIP-NSGA-II	2	0	11

In order to prove if the differences obtained by the ranking analysis are statically significant, a Friedman test [10, 11] for each quality measure is carried out. In this regard, we consider the null hypothesis $H_{0}$ that all algorithms equally perform. Table 5 shows whether the null hypothesis is rejected, considering a critical interval $F_{0.01,29,348}=2.23$ . According to the results of the Friedman statistical test, it is possible to assert that the null hypothesis is rejected for each of the analyzed quality measures since $F_{F}$ obtains the following values: 23.08 for CF; 47.06 for Confidence; 58.03 for Cosine; 35.57 for Gain; 85.59 for Laplace; 26.62 for Leverage; 49.87 for Lift; 46.21 for NetConf; 33.91 for Pearson; 53.80 for Support; 31.81 for YulesQ; and 27.18 for Zhang. Thus, none of the $F_{F}$ values belongs to the critical interval $F_{0.01,29,348}=2.23$ , so it is not possible to statistically assert that all algorithms equally behave. In this regard, a Bonferroni-Dunn test [8] has been considered to determine the statistical differences among the algorithms under study. According to the Bonferroni-Dunn test, the obtained critical difference is 3.36 for $\alpha=$ 0.01. In this regard, Table 6 shows the number of quality measures in which G3P-LSC is better than other algorithms (#Wins), worse (#Losses) or there is no significant difference between them (#Draws). Alatasetal is the algorithm most different with respect to G3P-LSC, since our proposal has obtained 12 wins of a total of 13. The highest number of losses is obtained with NSGA-G3P. Although, G3P-LSC loses 2 times with NSGA-G3P, our proposal obtains 7 wins, thus G3P-LSC behaves better than the rest. The most similar algorithm is QAR-CIP-NSGA-II, obtaining 11 draws. However, G3P-LSC obtains 2 wins and no losses.

As a result, G3P-LSC behaves statistically better for more quality measures and in very few cases the quality measures obtained by other algorithms behave statistically better than G3P-LSC.

Figure 5.

Context-free grammar modified to obtain negative items.

Figure 6.

Context-free grammar modified to obtain rules having a single item in both the antecedent and consequent.

4.7 Different grammars for G3P-LSC

The aim of this section is to demonstrate how the grammar can be modified to achieve different results (see Table 7). All these alluring quantitative association rules have been obtained on the Stock dataset and the proposed G3P-LSC algorithm with different grammars each time. Taking the original grammar (see Fig. 2), the following rule is obtained: IF Company1 IN [30.46, 58.83] THEN Company10 IN [42.67, 61.01]. As shown, this rule describes continuous patterns by means of enclosed values (lower and upper bounds). One of the main advantages of using grammars is the ability to introduce syntax constraints and to apply external knowledge to the mining process. In this regard, for some specific domains, it is possible to require specific rules, determining the position of an attribute or even the range in which is defined.

Table 7
Rules obtained by G3P-LSC when different grammars are considered on the stock dataset

Type of grammar	Rule	Support	Confidence	Leverage
Original	IF Company1 IN [17.38, 40.73]	0.51	0.98	0.23
	THEN Company5 IN [59.41, 93.54]
Positive and	IF Company1 IN [42.023, 60.05] AND Company3 IN [34.83, 56.31]	0.54	0.99	0.23
negative rules	THEN Company4 NOT IN [28.29, 57.96]
Only one item into the	IF Company4 IN [58.67, 93.73]	0.54	0.98	0.22
antecedent and consequent	THEN Company1 IN [42.51, 60.16]

As a matter of example, Table 7 shows some obtained rules for the stocks dataset when different restrictions are defined. For instance, it is possible to include not only positive but also negative associations, so a simple change in the grammar (see Fig. 5) allows to obtain rules as the one depicted in the second row. The consequent of this rule denotes that Company4 cannot include a value in the range [28.29, 57.96].

Finally, the third example is obtained from a grammar (see Fig. 6) that allows to obtain only rules having a single item in both the antecedent and consequent. Considering this grammar, an example of rule obtained from the stocks dataset is IF Company4 IN [58.67, 93.73] THEN Company1 IN [42.51, 60.16].

4.8 Scalability on parallel implementations

The aim of this study is to analyze the performance of the different proposed implementations when the number of both attributes and instances increases. In the first part of this study, a set of synthetic datasets has been used that have been properly created to analyze how the number of both instances and attributes affect to the algorithms’ performance. Finally, note that all these algorithms obtain the same results being the followed approach to parallelize the unique difference among them. Analyzing Fig. 7, the results illustrate that the baseline implementation of G3P-LSC is the most efficient when a small number of instances is bore in mind. However, when the number of instances continues growing (up to 6 $\times$ 10 ${}^{5}$ ), its performance begins to decrease starting to be needed some kind of parallelization to improve the runtime. Although Big Data architectures provide a way to improve runtime using distributed computing, they are totally unsuitable for small datasets as it is shown. It is due to the expensive cost of communication and scheduling of the platform, in cases where the data are smaller this cost is not justified. Thus, other kind of parallelization is needed. RMI has been considered as a simple way of parallelization achieving excellent results when the number of instances ranges from 6 $\times$ 10 ${}^{5}$ to 9 $\times$ 10 ${}^{5}$ , obtaining better results than sequential approach in this range of examples. Up to 6 $\times$ 10 ${}^{5}$ examples, Sparks’ performance is almost the same as RMI, while the performance of Hadoop is far from being the best. It demonstrates that the sequential baseline approach and RMI version are appropriate to be used on small datasets unlike Big Data implementations, which are useful for truly large datasets. Hadoop is the worst since it has to read the whole dataset from disk in each generation and write on disk each communication among mappers and reducers, hampering the overall performance. Continuing the experimental study, the number of instances is increased from 9 $\times$ 10 ${}^{5}$ to 1 $\times$ 10 ${}^{9}$ . Figure 8 illustrates that a sequential implementation is meaningless when datasets with millions of instances are considered ant its runtime exponentially increases. This baseline implementation is unfeasible when file sizes of GB are considered since it takes more than 300 days on mining a dataset with 1 $\times$ 10 ${}^{9}$ instances.

Figure 7.

Runtime of different implementations when they are run on datasets comprising 48 attributes and a number of instances that varies from 1 $\times$ 10 ${}^{5}$ to 9 $\times$ 10 ${}^{5}$ . The file size varies from 28 MB to 248 MB.

Although RMI obtains the same results using only a 15% of sequential runtime, it is not enough when huge datasets are considered. Spark’s implementation obtains the best results since the datasets could be stored in main memory. On the other hand, the implementation based on Hadoop obtains worse results than Spark. Again, the number of instances is increased from 1 $\times$ 10 ${}^{9}$ to 3 $\times$ 10 ${}^{9}$ , with a range of file size from 275 GB to 804 GB. Figure 9 only shows Hadoop and Spark since both have proved to obtain the best performance in Big Data. It illustrates that Spark achieves a better performance until the file size of the dataset could not be stored in main memory ( $\approx$ 790 GB).

Figure 8.

Runtime of different implementations when they are run on datasets comprising 48 attributes and a number of instances that varies from 1 $\times$ 10 ${}^{6}$ to 1 $\times$ 10 ${}^{9}$ . The file size varies from 275 MB to 275 GB.

Figure 9.

Runtime of different implementations when they are run on datasets comprising 1 $\times$ 10 ${}^{9}$ to 3 $\times$ 10 ${}^{9}$ instances and a number of attributes of 48. The file size varies from 275 GB to 804 GB.

At this point, Spark begins to use both disk and memory as cache system (hard ware limitations), thus its behavior is almost the same as Hadoop having to read almost the whole dataset from disk in each generation. Thus, Spark is appropriate to handle an extremely large number of examples when data can be stored in main memory. On the other hand, Hadoop is quite appropriate to handle huge datasets that cannot be stored in main memory.

Figure 10.

Runtime of different implementations when they are run on datasets comprising 1 $\times$ 10 ${}^{9}$ instances and a number of attributes that varies from 8 to 48. The file size varies from 45 GB to 275 GB.

Figure 11.

Runtime of parallel approaches when the number of nodes decreases. A dataset comprising 1 $\times$ 10 ${}^{9}$ instances with 48 attributes and a file size of 275 GB is used.

Additionally, an analysis has been carried out to study how the number of attributes affects. In this regard, Fig. 10 illustrates how the number of attributes linearly affects to the global performance of the algorithms. Note that the higher the number of attributes, the higher the search space, so the convergence time increases linearly with the number of attributes.

In a third study, it has been considered interesting to note how relevant is the cluster capacity in the decision on which implementation should be used. Figure 11 illustrates how the performance is affected by the cluster capabilities. RMI is the implementation less improved when the number of nodes increases since each node has to evaluate the whole dataset. However, Apache Hadoop and Apache Spark benefit because a greater number of nodes means a greater level of parallelism, since each node evaluates a subset of the dataset.

Once it has been demonstrated that implementations based on Spark and Hadoop obtain alluring runtime on synthetic datasets, they have been run on real-world datasets. No comparison with sequential approaches have been considered since it was previously demonstrated that these algorithms cannot be run efficiently. Table 8 shows the obtained results for a set of Big Data real-world datasets. As it could be appreciated, the behavior is the same as was shown in previous studies. Spark achieves a better performance since the whole dataset could be stored in main memory. It achieves the same results as Hadoop but using only a half amount of the time, and in some cases Spark only needs a 13% of the time required by Hadoop. The issue of Hadoop is also shown, needing almost the same time to process several GB (Epsilon datasets) than several MB (Poker dataset). It is due to that much of the time is used to orchestrate the platform and no to our main computations, thus, Hadoop always needs a quantity of time to coordinate the platform independently of the size of the data. However, Spark does not require this large time in orchestration.

Table 8

Runtime in hours required when large real-world datasets are used by running Spark & Hadoop implementation. As all these datasets could be stored in main memory

Dataset	Hadoop	Spark
Epsilon	8.37	1.42
Poker	7.65	0.01
US Census 1990	7.50	0.83
Susy	14.31	8.20
Heap mass	33.33	14.25
Higgs	24.46	12.33
Protein structure prediction	37.76	18.44

5. Concluding remarks

In this work, a genetic programming algorithm based on grammars for mining association rules in Big Data has been proposed, known as G3P-LSC. The novelty of this work is that it proposed a new evolutionary approach which is able to be run in a distributed way enabling the use of parallel paradigms such as MapReduce. Our proposal provides a reduced set of rules, easy to understand, and with a good level in many quality measures. The main aim of G3P-LSC is to optimize a set of well-known quality measures in the association rule mining field.

Currently, due to the increasing interest in data storage, a unique and universal implementation of a single algorithm is unfeasible and different adaptations should be done depending on the data size. In this sense, the proposed algorithm has been implemented on different architectures including a sequential approach, RMI and MapReduce. It should be noted that all the implementations return exactly the same results and the unique change is the architecture.

When comparing the obtained results with other 14 algorithms and using more than 75 datasets, it is obtained that the proposed G3P-LSC algorithm mines rules with better values for interesting measures and few attributes, providing the user with high quality rules. As a grammar is used, the user could even specify which set of attributes must appear in the proposed solutions, allowing to constrain the search space only for rules of his interest. Finally, the proposed approach presents a good computational cost in all datasets and good scalability when the size of the problem increases.

5.1 Future work

Finally, as a future work some parallel searches could be considered to be adapted to the extraction of association rules in Big Data. Its application is not trivial and these paradigms have not been studied yet in the association rule mining field and further research is encouraged. Next, each of the most well-known parallel models is analyzed [34]. Island models or even cellular EAs could be studied to mine association rules in a parallel way. Indeed, hybrid models could improve the convergence of the previous models, where some relaxations of the original methods could further improve the diversity.

Footnotes

Acknowledgments

This work was Supported by the Spanish Ministry of Economy and Competitiveness under the project TIN2014-55252-P, and FEDER funds. This work is also supported by the Juan de la Cierva Formacion post-doctoral grant, reference FJCI-2015-23560.

References

Aggarwal

Han

. Frequent pattern mining. Springer International Publishing 2014.

Agrawal

Imielinski

Swami

. Mining association rules between sets of items in large databases. SIGMOD Rec 1993; 22(2): 207-216.

Alatas

Akin

. An efficient genetic algorithm for automated mining of both positive and negative quantitative association rules. Soft Computing 2006; 10(3): 230-237.

Alatas

Akin

Karci

. MODENAR: Multi-objective differencial evolution algorithm for mining numeric association rules. Applied Soft Computing 2008; 8: 646-656.

Cano

Luna

Ventura

. High performance evaluation of evolutionary-mined association rules on GPUs. The Journal of Supercomputing 2013; 66(3): 1438-1461.

Dean

Ghemawat

. MapReduce: Simplified data processing on large clusters. Communications of the ACM – 50th Anniversary Issue 1958–2008, 2008; 51(1): 107-113.

Deb

Pratap

Agarwal

Meyarivan

. A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Transactions on Evolutionary Computation 2002; 6(2): 182-197.

Demsar

. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 2005; 7: 1-30.

García

Fernández

Luengo

Herrera

. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences, Special Issue on Intelligent Distributed Information Systems 2010; 180(10): 2044-2064.

10.

García

Herrera

. An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. Journal of Machine Learning Research 2008; 9: 2677-2694.

11.

García

Molina

Lozano

Herrera

. A study on the use of non-parametric tests for analyzing the evolutionary algorithms’ behaviour: A case study. Journal of Heuristics 2009; 15(6): 617-644.

12.

Ghosh

Nath

. Multi-objective rule mining using genetic algorithms. Information Science 2004; 163(1–3): 123-133.

13.

Han

Kamber

. Data mining: Concepts and techniques. Morgan Kaufmann 2011.

14.

Kyriklidis

Dounias

. Evolutionary computation for resource leveling optimization in project management. Integrated Computer-Aided Engineering 2016; 23(2): 173-184.

15.

Lam

. Hadoop in action. Manning Publications Co, Greenwich, CT, USA 1st edition, 2010.

16.

. Novel alarm correlation analysis system based on association rules mining in telecommunication networks. Information Sciences 2010; 180(16): 2960-2978.

17.

Luna

Cano

Pechenizkiy

Ventura

. Speeding-up association rule mining with inverted index compression. IEEE Transactions on Cybernetics 2016; 46(12): 3059-3072.

18.

Luna

Romero

Ventura

. An evolutionary algorithm for the discovery of rare class association rules in learning management systems. Applied Intelligence 2015; 42(3): 501-513.

19.

Luna

Romero

Ventura

. Design and behavior study of a grammar-guided genetic programming algorithm for mining association rules. Knowledge and Information Systems 2012; 32(1): 53-76.

20.

Luna

Romero

Ventura

. Grammar-based multi-objective algorithms for mining association rules. Data & Knowledge Engineering 2013; 86: 19-37.

21.

Martin

Rosete

Alcala-Fdez

Herrera

. A new multiobjective evolutionary algorithm for mining a reduced set of interesting positive and negative quantitative association rules. IEEE Transactions on Evolutionary Computation 2014; 18(1): 54-69.

22.

Martín

Rosete

Alcalá-Fdez

Herrera

. QAR-CIP-NSGA-II: A new multi-objective evolutionary algorithm to mine quantitative association rules. Information Sciences 2014; 258: 1-28.

23.

Mata

Alvarez

Riquelme

. Mining numeric association rules with genetic algorithms. In Proceedings of the 5th International Conference on Artificial Neural Networks and Genetic Algorithms, ICANNGA 2001, Taipei, Taiwan 2001; 264-267.

24.

Mata

Alvarez

Riquelme

. Discovering numeric association rules via evolutionary algorithm. In Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2002, Taipei, Taiwan 2002; 40-51.

25.

Mencá

Sierra

Mencá

Varela

. Genetic algorithms for the scheduling problem with arbitrary precedence relations and skilled operators. Integrated Computer-Aided Engineering 2016; 23(3): 269-285.

26.

Ordoñez

Ezquerra

Santana

. Constraining and summarizing association rules in medical data. Knowledge and Information Systems 2006; 9(3): 259-283.

27.

Pan

Tian

Zhang

. A region division based diversity maintaining approach for many-objective optimization. Integrated Computer-Aided Engineering 2017; 24(3): 1-18.

28.

Piatetsky-Shapiro

. Discovery, analysis and presentation of strong rules. In Knowledge Discovery in Databases. Piatetsky-Shapiro

Frawley

, (eds). AAAI Press 1991; 229-248.

29.

Qodmanan

Nasiri

Minaei-Bidgoli

. Multi objective association rule mining with genetic algorithm without specifying minimum support and minimum confidence. Expert Systems with Applications 2011; 38: 288-298.

30.

Ramírez-Gallego

Fernández

García

Chen

Herrera

. Big Data: Tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce. Information Fusion 2018; 42: 51-61.

31.

Rostami

Neri

. Covariance matrix adaptation pareto archived evolution strategy with hypervolume-sorted adaptive grid algorithm. Integrated Computer-Aided Engineering 2016; 23(4): 313-329.

32.

Rostami

Neri

Epitropakis

. Progressive preference articulation for decision making in multi-objective optimisation problems. Integrated Computer-Aided Engineering 2017; 24(4): 315-335.

33.

Sabar

Abawajy

Yearwood

. Heterogeneous cooperative co-evolution memetic differential evolution algorithm for big data optimization problems. IEEE Transactions on Evolutionary Computation 2017; 21(2): 315-327.

34.

Sudholt

. Parallel evolutionary algorithms. In Springer Handbook of Computational Intelligence 2015; 929-959.

35.

Tan

Kumar

. Interestingness measures for association patterns: A perspective. In Proceedings of the Workshop on Postprocessing in Machine Learning and Data Mining, KDD’00, New York, USA 2000.

36.

Ventura

Luna

. Pattern mining with evolutionary algorithms. Springer International Publishing 2016.

37.

Yan

Zhang

. Genetic algorithm-based strategy for identifying association rules without specifying actual minimum support. Expert Systems with Appications 2009; 36: 3066-3076.

38.

Zaharia

Chowdhury

Franklin

Shenker

Stoica

. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, Berkeley, CA, USA 2010.

39.

Zaki

. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering 2000; 12(3): 372-390.

40.

Zhang

. Association rule mining: Models and algorithms. Springer Berlin/Heidelberg 2002.

Mining association rules on Big Data through MapReduce genetic programming

Abstract

Keywords

1. Introduction

2. Preliminaries

2.1 Association rule mining

3.1 Baseline

3.1.2 Evaluation procedure

3.1.3 Algorithm

3.1.4 Genetic operators

3.2 Scaling G3P-LSC using parallel and distributed computing

3.2.1 RMI version

3.2.2 MapReduce versions

4.2 Computational complexity

Table 1 Datasets considered for the experimental study

4.5 Sequential algorithms and set up

Table 3 Comparative of the single values obtained within the fitness function Apriori, Eclat and G3P-LSC

Table 7 Rules obtained by G3P-LSC when different grammars are considered on the stock dataset

5.1 Future work

Footnotes

Acknowledgments

References

Table 1
Datasets considered for the experimental study

Table 3
Comparative of the single values obtained within the fitness function Apriori, Eclat and G3P-LSC

Table 7
Rules obtained by G3P-LSC when different grammars are considered on the stock dataset