GPU-based swarm intelligence for Association Rule Mining in big databases

Abstract

Association Rule Mining (ARM) is a fundamental data mining task that is time-consuming on big datasets. Thus, developing new scalable algorithms for this problem is desirable. Recently, Bee Swarm Optimization (BSO)-based meta-heuristics were shown effective to reduce the time required for ARM. But these approaches were applied only on small or medium scale databases. To perform ARM on big databases, a promising approach is to design parallel algorithms using the massively parallel threads of a GPU processor. While some GPU-based ARM algorithms have been developed, they only benefit from GPU parallelism during the evaluation step of solutions obtained by the BSO-metaheuristics. This paper improves this approach by parallelizing the other steps of the BSO process (diversification and intensification). Based on these novel ideas, three novel algorithms are presented, i) DRGPU (Determination of Regions on GPU), ii) SAGPU (Search Area on GPU, and, iii) ALLGPU (All steps on GPU). These solutions are analyzed and empirically compared on benchmark datasets. Experimental results show that ALLGPU outperforms the three other approaches in terms of speed up. Moreover, results confirm that ALLGPU outperforms the state-of-the-art GPU-based ARM approaches on big ARM databases such as the Webdocs dataset. Furthermore, ALLGPU is extended to mine big frequent graphs and results demonstrate its superiority over the state-of-the-art D-Mine algorithm for frequent graph mining on the large Pokec social network dataset.

Keywords

ARM big databases BSO

1. Introduction and background

1.1 General concepts

Association Rule Mining (ARM) is a fundamental data mining task, which consists of discovering hidden patterns in transaction databases. It is applied in many real world problems such as: Constraint Programming [23, 24], Information Retrieval [25] and Business Intelligence [26, 27]. The ARM problem is defined as follows [10]. Let $I=\{i_{1},i_{2},\ldots,i_{n}\}$ , be a set of $n$ different items (symbols) or attribute values. Moreover, let $T=\{t_{1},t_{2},\ldots,t_{m}\}$ be a set of transactions representing a transactional database, where $t_{i}\subseteq I$ for $i\in\{1,2,\dots m\}$ . An association rule is an implication of the form $X\rightarrow Y$ , where $X\subset I$ , $Y\subset I$ , and $X\cap Y=\emptyset$ . The itemset $X$ of a rule $X\rightarrow Y$ , is called the antecedent, while the itemset $Y$ is called the consequent. Two fundamental measures are commonly used to assess how interesting association rules are, which are the support and confidence [10]. The support of an itemset $I^{\prime}\subseteq I$ is the number of transactions containing $I^{\prime}$ . The support of a rule $X\rightarrow Y$ is the support of the set $X\cup Y$ , and its confidence is defined as: $\frac{\textit{support}(X\cup Y)}{\textit{support}(X)}$ .

In real-life, transactional databases can be very large. Database size has a major impact on the runtime of traditional ARM algorithms such as Apriori [10] and FP-Growth [5], which have been originally evaluated on small or medium size datasets. These algorithms may run extremely slow on very large data sets. To address this issue, metaheuristics approaches have been designed, which can find approximate solutions to difficult ARM problems in less time. Swarm Intelligence is a category of such promising metaheuristics, which includes many approaches such as Ant Colony Optimization (ACO), Particle Swarm Optimization (PSO), Bee Swarm Optimization (BSO). These approaches have been adapted for solving the ARM problem in the PSOARM [11], $\textit{ACO}_{R}$ [12], Penguins Search Optimization [36] and BSO-ARM [28] algorithms, to name a few. Swarm Intelligence algorithms, similarly to any population-based metaheuristics, are composed of three steps: diversification, intensification and evaluation. The complexity of each step depends on both the problem to be addressed and the metaheuristic behavior. In previous studies, it was shown that BSO-ARM outperforms several state-of-the-art metaheuristic-based ARM algorithms both in terms of runtime and rule quality [28]. BSO-ARM simulates the behavior of bees, where the diversification step is performed by using the determination of the search area strategy. The aim of the determination of the search area is to generate regions that bees have to explore. In the intensification step, each bee explores intensively its region by using a neighborhood search. Finally, the evaluation step selects the best solution to be considered for the next iteration. This process is repeated iteratively until a predefined maximum number of iterations has been performed. Besides ARM, some metaheuristics were applied for other applications such as high-utility itemset mining [6] and biological space-seeds [37].

Although metaheuristics considerably improve performance compared to exact algorithms, they still show long execution times when dealing with big databases. Parallel programming is the most effective available option nowadays for boosting the speed of such algorithms. In this context, many HPC-based approaches have been recently developed for ARM, which can be categorized into two classes. The first class is algorithms using cluster-based architectures. Different version of parallel Apriori have been proposed in [21, 17, 4, 19], Fp-Growth in [3] and algorithms exploiting metaheuristics in [33, 34]. The second class are those using new emerging cost-effective parallel architectures such as GPUs (Graphical Processing Units) and FPGAs (Field Programming Gate Array). This paper considers the GPU architecture, which is presented next.

1.2 The GPU architecture

GPUs are graphical cards, which have historically been designed for video games and entertainment. From 2007, GPUs have started to be used as an efficient computing tool for various applications [1, 9]. The GPU framework is mainly composed of two hosts as presented in Fig. 1.

Figure 1.

GPU architecture.

The CPU host contains a processor and a main memory. The GPU-based thread host is logically composed of several blocks of threads. Each block may be seen as a collection of threads that can access a shared memory. All the blocks can also access some constant and global memories. Physically, the threads of a block are organized into warps of 32, 64 or 128 threads. To design any GPU-based parallel approach, three constraints must be taken into account [15]: i) CPU/GPU communication and data transfer should be minimized, ii) parallelism control and efficient mapping of threads to data should be done to deal with memory constraints and the thread divergence problem should be considered, iii) memory overhead should be managed.

Figure 2.

The BSO-ARM approach.

1.3 Bees Swarm Optimization for Association Rules Mining (BSO-ARM)

The BSO-ARM process is performed as follows. It starts by generating a random initial solution named Sref, from which the search area of each bee is determined by using one of three possible diversification strategies, i) Modulo, ii) Next, and iii) Syntactic. Each bee explores its region during what is called the intensification step, using a neighborhood search method. All the bees communicate via the queen bee that selects the best solution as the reference solution for the next iteration. This process is repeated until a predefined maximum number of iterations has been completed. Figure 2 illustrates the BSO-ARM approach. In BSO-ARM [28], a solution $S$ to a problem is encoded as a vector of $n$ elements, where $n$ is the number of all items whose positions are defined as follows:

(1)
$S[i]=0$ if the item $i$ does not belong to the solution $S$ .
(2)
$S[i]=1$ if the item $i$ belongs to the antecedent in $S$ .
(3)
$S[i]=2$ if the item $i$ belongs to the consequent in $S$ .

In the following, the different steps of BSO-ARM are presented.
1.3.1 Intensification step

Many intensification-based strategies have been proposed in the literature. Without loss of generality, this work focuses on two of the best strategies, namely SLS (Simple Local Search) and TS (Tabu Search). The former strategy consists of randomly and iteratively changing a bit from a given solution $S$ . This simple operation can create $n$ neighborhoods where $n$ is the solution size. This operation generates only admissible solutions. At each step, only the best solution is kept, which then serves as input data for the following iterations. In TS, the same SLS process is applied. Except that a tabu list is used to avoid already explored solutions. At each iteration, the best found solution is saved in the tabu list. When it is found that a new solution is already in the tabu list, the algorithm moves to another solution.

1.3.2 Diversification step

This step consists of applying a strategy to determine the search area. Given a reference solution, Sref, and a colony of $K$ bees, the search area selection strategy defines $K$ search spaces, that is one for each bee. The three following strategies have been considered in BSO-ARM [28].

•
Modulo strategy: The $k^{th}$ bee builds its search area by successively changing bits at position $k+i\times\textit{Flip}$ in the reference solution Sref, where $i$ is varied from 0 to $n-1$ , and Flip is a predefined parameter. This strategy can be used iff the number of bees is more than $\frac{n}{\textit{Flip}}$ . If the distance between solutions equals the number of bits, then the distance between the bees and the reference solution equals $\frac{n}{\textit{Flip}}$ .
•
Next strategy: A predefined number Flip is used where the $k^{th}$ bee changes the first Flip bits in Sref, starting from the $(k-1)^{th}$ bit. For example, if $\textit{Flip}=3$ , the 1st bee changes the bit 0, 1 and 2 and the second bee changes the bits 2, 3, 4, and so on. This strategy is called Next because at each step, the $k^{th}$ bee changes the $(k-1)^{th}$ bit and the next $\textit{Flip}-1$ bits. The distance between these solutions and Sref equals to Flip.
•
Syntactic strategy: This strategy defines a weight for each solution. The weight of a solution, $S=a_{0}a_{1}a_{2}\ldots a_{n-1}$ , denoted as $W(S)$ , is defined as: $W(S)=\sum_{i=0}^{n-1}a_{i}$ , where $n$ is the size of the solution. The algorithm of this strategy can be summarized as follows.

(1)
If an item $t$ belongs to the antecedent of $S1$ and either appears in the consequent of $S2$ or does not belong to $S2$ , there is a gap of one between $S1$ and $S2$ .
(2)
If the item $t$ belongs to the consequent of $S1$ and does not belong to $S2$ , then there is a gap of two between $S1$ and $S2$ .

First, $W(\textit{Sref})$ is calculated. Then, the $k^{th}$ bee changes consecutive bits of Sref, starting from $k^{th}$ the bit of Sref. The $k^{th}$ bee stops this process when it obtains a solution $S$ that satisfies the constraints given by Eqs (1) or (2)

$\displaystyle W(S)=W(\textit{Sref})-\textit{Distance}$ (1) $\displaystyle W(S)=W(\textit{Sref})+\textit{Distance}$ (2)

1.4 Contribution

The goal of this paper is to address scalability issues of ARM for big datasets using swarm intelligence metaheuristics. An approach is developed that exploits the GPU architecture for mining big transactional databases using swarm intelligence based metaheuristics. The main contribution is to exploit the power of GPU parallelism for all steps of the metaheuristic process, contrarily to existing approaches such as CudaApriori [15], SEGPU [30], MEGPU [31] and TDGPU [32], that only use the GPU for the evaluation step, and perform the remaining steps sequentially on the CPU.

It is is to be noted that a previous work [16] has proposed to use swarm intelligence based metaheuristics, running on a GPU. However, that study solves a different problem. It proposes a generic framework to solve optimization problems by using a GPU-based local search strategy.

The contribution of the paper are listed as follows:

(1)
This paper investigates various GPU-parallelism strategies to improve the performance of Bee Swarm Optimization (BSO) for the ARM problem in the context of big databases.
(2)
Three approaches are proposed, namely DRGPU (Determination of Regions on GPU), SAGPU (Search Area on GPU) and ALLGPU (All steps on GPU).
(3)
To validate the proposed approaches, an experimental evaluation has been carried out using medium, large, very large and big datasets. Results show a considerable speed up on the challenging big Webdocs dataset. Results also reveal the superiority of the fourth approach (ALLGPU) over existing GPU-based ARM approaches.
(4)
A case study in social network analysis shows that ALLGPU outperforms DMine [38] to discover interesting rules from the big Pokec social graph that contains 1.63 million nodes.

The remainder of the paper is organized as follows. Section 2 discusses related work about recent parallel ARM algorithms. Section 3 describes the proposed solutions, and a theoretical study is presented in Section 4. A performance evaluation is presented in Section 5. Finally, Section 6 draws a conclusion.
2. Related work

Many solutions have been proposed in the ARM literature. In this related work section, only GPU-based ARM approaches are discussed. The first GPU-based ARM algorithms were introduced by Fang et al. in [18]. Two parallel versions of the Apriori algorithm [10] have been proposed, called PBI (Pure Bitmap Implementation) and TBI (Trie Bitmap Implementation). PBI represents transactions and itemsets using a bitmap data structure. The bitmap for itemsets is a $(n*m)$ bit matrix where $n$ is the number of itemsets and $m$ is the number of items. In this representation, $\textit{bit}(i,j)=1$ if the itemset $i$ contains the item $j$ . Otherwise, $\textit{bit}(i,j)=0$ . Similarly, the structure to store transactions is a $(n*m)$ bit matrix, where $n$ is the number of itemsets and $m$ is the number of transactions. In this structure, $\textit{bit}(i,j)=1$ if the transaction $j$ contains the itemset $i$ . Otherwise, $\textit{bit}(i,j)=0$ . In another study [7], the GPU-FPM algorithm was proposed. It is an Apriori-like algorithm that uses a vertical database representation to overcome the memory limitation of a GPU. It uses a Mempack structure to store different types of data.

Adil and Sadaf proposed a new Apriori algorithm for the GPU architecture [14]. It is applied in two steps. First, the generation of itemsets is performed on the GPU host. Each thread block computes the support of a set of itemsets. Then, the generated itemsets are sent back to the CPU to generate the rules corresponding to each itemset and to determine the confidence of each rule. The main drawback of this algorithm is the high cost of CPU/GPU communications.

In [13], the authors proposed a parallel version of the DCI algorithm where the intersection and computation operators (which represent the most frequent operations of DCI) are parallelized. In the latter, two strategies have been proposed, i) transaction wise approach $t w$ , vs. ii) candidate wise approach $c w$ . In $t w$ , all GPU cores work on the same candidate and each thread is in charge of a portion of the data. Alternatively, in $c w$ , many candidates are handled by the GPU simultaneously. A contiguous subsequence of candidates is assigned to each thread block. In [2], the GPApriori algorithm has been proposed using two data structures in order to improve itemset counting. In [8], a new GPU-based APRIORI algorithm has been developed on Micron’s Automata Processor (AP). At each pass of the algorithm, some candidate itemsets are sent to AP for matching and counting frequent itemsets. Moreover, The multiple-entry NFA structure was proposed to handle variable-size itemsets.

In [20], the Bit_Q_Apriori algorithm that simplifies the process of candidate generation and support counting is proposed. Unlike the Apriori algorithm, which generates k-candidates by combining two (k-1)-frequent itemsets, the Bit_Q_Apriori algorithm generates k-candidates by joining 1-frequent itemsets and (k-1)-frequent itemsets. The bitset structure is used to store transaction identifiers that correspond to each candidate. Therefore, support counting can be implemented by using boolean operators, instead of scanning the database. Some researchers have recently introduce bio-inspired approaches for ARM in [35, 30, 31]. In [35], an evolutionary algorithm is proposed to solve the ARM problem on GPUs. First, the rule is saved on constant memory. Then, the consequent and the antecedent of the rule are evaluated concurrently. Each thread $j$ of the $i^{th}$ block matches the expression $j$ (antecedent or consequent) by the $i^{th}$ transaction. In the SEGPU algorithm presented in [30], the evaluation of a single itemset is done on GPU where each block checks if the current itemset includes one part of the transactional database. Finally, in MEGPU presented in [31], the evaluation of multiple itemsets is done on GPU where each block processes a single itemset. Every thread then checks the current itemset against the set of transactions assigned to it.

From this brief review of related work, it is found that all the proposed approaches have attempted to improve ARM algorithms by only parallelizing the evaluation step on GPU. The intuition behind this is that this operation is the most time consuming for the CPU. However, for big transactional databases, diversification and intensification may also be time consuming as well. The main contribution of this work is to investigate several GPU parallelism strategies that take into account all the constraints of GPU architecture and CPU/GPU communications, which is relevant when dealing with big transactional databases.

3. Proposed solution

This section presents the proposed method, which is inspired by the BSO-ARM algorithm. The section is divided into three sections corresponding to the three steps performed by the proposed method: diversification (also know as determination of regions on GPU), intensification, and evaluation. The main difference between the proposed approach and BSO-ARM is that the three steps are designed to be run on a GPU. For the diversification step, it will be shown that regions explored in parallel and sequentially are the same. Based on these GPU implementations of the three steps, four different versions of BSO for ARM will be presented.

3.1 Diversification

In the GPU-based BSO-ARM algorithm, each bee independently explores its region. In this paper, to take advantage of all blocks of the GPU, each block is associated to a bee for exploring its own region. To do this, the task of determining the search area will be divided into several fragments (subtasks), each consisting of creating a region for a bee. That is, the threads of the $i^{th}$ block contribute to the determination of the region for the $i^{th}$ bee. Every block modify the reference solution according to a considered strategy. Consequently, the number of threads will be proportional to $n$ , the number of bits in the reference solution. The following paragraphs describe how the three strategies for the determination of the search area that were discussed in Section 1.3 are parallelized for a GPU.

3.1.1 GPU implementation of the Modulo Strategy

Correctness of the Definition

(Correctness of the Definition).

The modulo strategy on GPU is correct in terms of producing the same result as a single-processor architecture. Formally speaking, let $P_{i}$ denotes the following proposition: the $i^{th}$ bee modifies the $((i-1)\times\textit{Flip})+1)^{th}$ bit of the Reference Solution.

Proof The proposition can be proved by recurrence for $i$ , as follows:

(1)
Initialization: The proposition is trivially true for $i=1$ .
(2)
Implication Assume that $P_{i}$ is true for $i=1,2,\ldots n$ and prove that $P_{n+1}$ is true.Since $P_{n}$ is true then the $n^{th}$ bee modifies the bit $((n-1)\times\textit{Flip})+1$ . At each step of the Modulo strategy, a jump of Flip value is performed to obtain the next bee. Consequently, the $n+1^{th}$ bee modifies $(((n-1)\times\textit{Flip})+1)+\textit{Flip}$ . So it modifies the bit $((n\times\textit{Flip})+1)$ .

.

Let’s consider $\textit{Sref}=1121101$ , $\textit{Flip}=2$ and $K=4$ . The regions are determined as follows:

(1)
The first bee modifies bit 1, which corresponds to $((1-1)\times\textit{Flip})+1$ .
(2)
The second bee modifies bit 3, which corresponds to $((2-1)\times\textit{Flip})+1$ .
(3)
The third bee modifies bit 5, which corresponds to $((3-1)\times\textit{Flip})+1$ .
(4)
The fourth bee modifies bit 7, which corresponds to $((4-1)\times\textit{Flip})+1$ .

Implementation.

In this strategy, each thread in a block modifies a bit of Sref, while the remaining threads of the same block copy the remaining bits of Sref. That is, the thread $ID=(i-1)\times\textit{Flip}+1$ of the $i^{th}$ block modifies the bit $ID=(i-1)\times\textit{Flip}+1$ of Sref in order to generate the $i^{th}$ bee. The pseudocode of this approach for GPU is given in Algorithm 3.1. Line 1 of this algorithm returns the identity of the current thread. In Line 2, the procedure Copy copies the bit given in the second argument into the first argument. In Line 3, procedure Change allows a random change of its input bit. The structure, Bees, is a matrix of size $(K\times n)$ , where $K$ is the number of bees and $n$ is the number of items. Consequently, for this algorithm, the number of blocks is set to $K$ and the number of threads per block is set to $n$ .

GPU Kernel of the Modulo Strategy[1] $\textit{idx}\leftarrow\textit{blockIdx.x}\times\textit{blockDim.x}+\textit{% threadIdx.x.}$ $\textit{idx}=((\textit{blockIdx.x}-1)\times\textit{Flip})+1$ $\textit{Copy}(\textit{Bees}[\textit{blockIdx.x}][\textit{idx}],\textit{Change}% (\textit{Sref}[\textit{idx}]))$ $\textit{Copy}(\textit{Bees}[\textit{blockIdx.x}][\textit{idx}],\textit{Sref}[% \textit{idx}])$
3.1.2 GPU implementation of the Next Strategy

(Correctness of the Definition).

The Next strategy on GPU is correct in terms of producing the same result as the sequential version executed on a single-processor architecture. That is, in the two cases, the $i^{th}$ bee modifies the bits $[i,((\textit{Flip}+i)-1)]$ of the Reference Solution.

Proof Let $P_{i}$ be the proposition: “the $i^{th}$ bee modifies the bits $[i,((\textit{Flip}+i)-1)]$ ” of the $i^{th}$ bee. This proposition may be proved by recurrence on $i$ as follows:

(1)
Initialization: The proposition is trivially true for $i=1$ .
(2)
Implication Assume that $p_{i}$ is true for $i=1,2,\ldots n$ and prove that $p_{n+1}$ is true.Given that $(P_{n})$ is true, the $n^{th}$ bee modifies the $[n,((\textit{Flip}+n)-1)]$ bits of the reference solution. In this strategy, at each step, successive modifications of Flip bits are necessary to obtain the next bee. So the $n+1^{th}$ bee modifies $[n+1,((\textit{Flip}+n)-1)+1]$ and then it modifies $[n+1,((\textit{Flip}+n+1)-1)]$ that corresponds to $P_{n+1}$ .

.

Consider the previous example with the same parameters. The regions are determined as follows:

(1)
The first bee modifies bits 1 and 2, which correspond to $[1,\textit{Flip}+1-1]$ .
(2)
The second bee modifies bits 2 and 3, which correspond to $[2,\textit{Flip}+1-1]$ .
(3)
The third bee modifies bits 3 and 4, which correspond to $[3,\textit{Flip}+3-1]$ .
(4)
The fourth bee modifies bits 4 and 5, which correspond to $[4,\textit{Flip}+4-1]$ .

Implementation. In this strategy, the Flip threads of each block modify one bit of Sref and the remaining threads of the same block copy the remaining bits of Sref. Indeed, the threads $ID\in[i,((\textit{Flip}+i)-1)]$ , where $i$ is the identity of the $i^{th}$ block, modify the bits assigned to them. The pseudocode of this approach on a GPU is given in Algorithm 2.

GPU Kernel of Next Strategy[1] $\textit{idx}\leftarrow\textit{blockIdx.x}\times\textit{blockDim.x}+\textit{% threadIdx.x}$ $\textit{idx}\geqslant\textit{blockIdx}$ and $\textit{idx}\leqslant\textit{Flip}+\textit{blockidx}-1$ $\textit{Copy}(\textit{Bees}]\textit{blockIdx.x}\brack[\textit{idx}],\textit{% Change}(\textit{Sref }[\textit{idx}]))$ $\textit{Copy}(\textit{Bees block}[\textit{Idx.x}][\textit{idx}],\textit{Sref}[% \textit{idx}])$
3.1.3 GPU implementation of the syntactic strategy

Correctness of syntactic definition on GPU. This definition is trivially correct on GPU. Indeed, each block of thread generates successively potential solutions until one solution satisfies the distance criterion. Three types of threads are considered in every block.

(1)
A thread performs a modification of a bit in the reference solution.
(2)
$(n-1)$ threads copy the remaining bits of the reference solution.
(3)
A thread calculates the weight of the generated solution.

Implementation. In this strategy, a thread of each block modifies one bit of the Sref reference solution. The remaining $n-1$ threads of the same block copy the remaining bits of Sref, and the last threads calculate in parallel the weight of the generated solution. This process must be repeated until the distance between the current solution and the reference solution satisfy the distance parameter. The pseudocode of this approach for GPU is given in Algorithm 3.1.3.

GPU Kernel of Syntactic Strategy[1] $\textit{idx}\leftarrow\textit{blockIdx.x}\times\textit{blockDim.x}+\textit{% threadIdx.x.}$ ( $d\neq\textit{Distance}$ ) r $\leftarrow$ rand(n) ( $\textit{idx}=r$ ) $\textit{Copy}(\textit{Bees}[\textit{blocIdx.x}][\textit{idx}],\textit{Change}(% \textit{Sol}[\textit{idx}]))$ ( $\textit{idx}=n+1$ ) $d\leftarrow W(\textit{Bees}[\textit{blockidx.x}],\textit{Sref})$ $\textit{Copy}(\textit{Bees}[\textit{blockidx.x}][\textit{idx}],\textit{Sol}[% \textit{idx}])$

In the pseucode of Algorithm 3.1.3, $W$ is the weight that is calculated using Eq. (1), and Sol is the current solution in the search tree (initialized to Sref).
3.2 The intensification step

The intensification step (SLS or TS) ensures that each region is explored by one bee. To parallelize this operation on a GPU, each block of threads is associated to a region, and each thread of a block modifies one bit of the reference solution. Indeed, for $n$ items, $(n\times 2)$ neighbors can be generated so $(n\times 2)$ threads should be launched. Each thread $j$ copies the current solution Sol in the new array called Neighbor, and it modifies the $i^{th}$ bit to 1 or 0, where $i=2\times j$ , or $i=(2\times j)+1$ . The pseudocode of this strategy on GPU is given in Algorithm 3.2.

GPU Kernel of Local Search Strategy[1] $\textit{idx}\leftarrow\textit{blockIdx.x}\times\textit{blockDim.x}+\textit{% threadIdx.x.}$ ( $i\leftarrow 0$ ; $i\leqslant n$ ; $i++$ ) ( $\textit{idx}=i/2$ or $\textit{idx}=(i/2)+1$ ) $\textit{Copy}(\textit{Neighbor}[i],\textit{Bees}[\textit{blockIdx.x}][i])$ $\textit{Copy}(\textit{Neighbor}[i],\textit{Copy}(\textit{Bees}[\textit{% blockIdx.x}][i]))$

3.3 The evaluation step

According to prior studies [30, 31], the most CPU intensive operation for the ARM problem using swarm intelligence techniques is fitness computation, which requires to scan the database multiple times. Thus, parallelizing this step on a GPU could intuitively reduce the overall runtime. To parallelize this step, multiple rules are simultaneously evaluated on the GPU. Every block of threads evaluates a rule. Threads of the same block are launched to collaboratively calculate the fitness of a single rule. Therefore, there are as many rules as blocks. Transactions are subdivided into subsets and each subset is associated to exactly one thread. Hence, each thread calculates only its corresponding subset of rules. After that, a sum reduction is applied to aggregate the fitness value. Such a strategy attempts to benefit from the massively parallel power of GPU by launching a large number of threads per rule and to reduce CPU/GPU communications. The general procedure of slave kernel is given in Algorithm 3.3.

GPU Kernel for evaluation of multiple association rules[1] $\textit{idx}\leftarrow\textit{blockIdx.x}\times\textit{blockDim.x}+threadIdx.x$ Compare the solution Buff[blockIdx.x] $i=0$ to $l$ transactions $\textit{Buff}[\textit{blockIdx.x}[\in t_{((i\times\textit{block Dim.x})+% \textit{idx})}$ $\textit{count}[\textit{blockIdx.x}[[i[\leftarrow 1$ $\textit{count}[\textit{blockIdx.x}]\leftarrow 0$ $\textit{fitness}(\textit{Buff}[\textit{blockIdx.x}])\leftarrow$ $\textit{Sum}\_\textit{Reduction}(\textit{count}[\textit{blockIdx.x}])$ $\textit{cudaMemcpy}(\textit{fitness}(\textit{Buff}[\textit{blockIdx.x}])$ $\textit{cudaMemcpyDeviceToHost})$

3.4 Four novel GPU-based approaches for ARM using BSO

The previous subsection has presented GPU implementations of the three main steps of the BSO process. This section explains how these steps are combined to create four versions of the proposed GPU-based BSO method for ARM in big databases.

3.4.1 The DRGPU algorithm

The first algorithm is called DRGPU (Determination of Regions). In this algorithm the reference solution is first created by the CPU, which then sends it to the GPU. Afterwards, region determination is performed on the GPU. Each generated bee is transmitted to the CPU, which explores its own region thanks to the neighborhood computation strategy. Each solution is then evaluated and the best one is considered as the reference solution for the next iteration. This process is repeated until the maximum number of iterations is reached. Figure 3 shows the framework of the proposed approach.

Figure 3.

DRGPU framework.

3.4.2 The SAGPU algorithm

The second algorithm, called SAGPU (Search Area on GPU), creates the reference solution and perform region determination on CPU. After that, each generated bee is sent to the GPU, where the local search of each region is done in parallel. The generated solutions are then sent to the CPU for the evaluation and the selected best solution is then considered as the reference solution for the next iteration. This process is repeated until the maximum number of iterations is reached. Figure 4 shows the framework of the proposed approach.

Figure 4.

SAGPU framework.

3.4.3 The MEGPU algorithm

The third algorithm, called MEGPU (Multiple Evaluation on GPU), performs reference solution creation, region determination and local search on the CPU. Then, each generated bee is sent to the GPU for the evaluation where multiple solutions are evaluated in parallel. The fitness of each solution is then sent to the CPU for the dancing step. The best solution becomes the reference solution for the next iteration. This process is repeated until the maximum number of iterations is reached. Figure 5 shows the framework of the proposed approach.

Figure 5.

MEGPU framework.

3.4.4 The ALLGPU algorithm

The fourth algorithm, named ALLGPU (All steps on GPU), performs solution creation on CPU and then send it to the GPU. The determination of regions and neighborhood computation are performed on GPU. After that, each block of threads sends the generated solutions to the global memory of the GPU. Afterwards, each block process one solution from the global memory and evaluates it in parallel. Each specified thread of each block obtains the best solution from the shared memory. The best solution of each block is sent to the global memory, where the specified thread obtains the best solution to be considered as the reference solution for the next iteration. This process is repeated on GPU until the maximum number of iterations is reached. Figure 6 shows the framework of the proposed approach.

Figure 6.

ALLGPU framework.

4. Analysis of the proposed approaches

Having presented the four versions of the proposed approach, the three next subsections asymptotically analyzes each algorithm in terms of (1) CPU/GPU communications, (2) threads synchronization costs and (3) thread divergence. Finally, the time complexity of each proposed algorithm is given and compared to the sequential version.

DRGPU. At each pass, the reference solution must be transmitted from CPU to GPU. Thus, if the number of iterations is set to IMAX, and the solution size is $N$ Bytes (where $N$ is the number of items), then the CPU/GPU communication cost is $\textit{IMAX}\times N$ Bytes.

SAGPU. At each pass, $K$ solutions must be sent to the GPU. So the CPU/GPU communication cost is $\textit{IMAX}\times K\times N$ Bytes.

MEGPU. First, $M$ transactions are transmitted to the GPU. The size of the data transmitted is $(M\times N)$ bytes. Then the set of generated neighbors are sent simultaneously to the GPU. If the number of neighbors is $\Delta$ then the total CPU/GPU communication costs is $[(M\times\Delta)+(\Delta^{2}\times K)]\times\textit{IMAX}$ Bytes.

ALLGPU. For this algorithm, CPU/GPU communication only consists of transferring the Reference Solution to the GPU. Therefore, the CPU/GPU communication cost is $N$ Bytes.

4.1 GPU-block synchronization

DRGPU. At each iteration, every GPU block generates one region. Region determination is completed when all blocks have determined their regions. The GPU-block synchronization of this approach is thus 1 at each pass. The total GPU-block synchronization of this approach is IMAX.

SAGPU. The local search of each region is established on GPU, where each block explores one region. The GPU-block synchronization of this approach is 1 at each iteration, which corresponds to the time where all blocks have explored their own regions. So the total GPU-block synchronization of this approach is IMAX points.

MEGPU. This strategy attempts to benefit from the massively parallel power of GPU by launching a large number of threads per rule to reduce CPU/GPU communication. Consequently, the threads of the same block must be synchronized after each iteration (to deal with the sum reduction operation). Thus, $N\times K\times\textit{IMAX}$ points of synchronization are required during all the lifetime of the approach.

ALLGPU. All tasks are implemented on GPU. Therefore, the total GPU-block synchronization is computed by summing the synchronization caused by the determination of regions (IMAX points), the local search (IMAX points), the evaluation process ( $N\times K\times\textit{IMAX}$ points), and the dancing step (IMAX). The total GPU-block synchronization of this approach is ( $\textit{IMAX}\times[3+K\times N]$ ) points.

4.2 Thread divergence

Threads of the same block should execute the same instruction at the same time. Thread divergence occurs when distinct threads from the same warp execute different instructions at the same time.

DRGPU. Region determination is performed on the GPU, where each thread has to generate one region for a bee. All threads of a given block have to execute the same instruction. Consequently, there is no thread divergence between threads of the same block.

SAGPU. The local search is performed on GPU, where each thread of the given block generates one solution. All threads of such block have to execute the same instruction at the same time. Hence, there is no thread divergence between threads of the same block in this approach.

MEGPU. The transactions are usually different terms of number of items. To evaluate a single rule on GPU, the different threads have to scan all its items and compares them to the transaction it is mapped with. The comparison process of a thread is stopped when it does not find a given item of the considered rule in the transaction that it checks. Thread divergence ( $T D$ ) can be computed according to the number of comparisons done by the different threads (see Eq. (3)).

$\displaystyle TD=\max\{\max\{t_{(r*w)+i}-t_{(r*w)+j}\}\}|(i,j,r)\in 1\ldots w^% {3}.$ (3)

where $t_{(r*w)+i}$ is the size of the $(r*w)+i^{th}$ transaction assigned to the $i^{th}$ thread, which is allocated to the $r^{th}$ warp, and $w$ is the number of warps.

In the worst case, transactions are highly different in size. Thus, it is possible to find on the same warp one transaction containing all items and another one containing only a single item. In this case, thread divergence can be approximated to the maximal number of items minus one for very large datasets (Eq. (4)).

$\displaystyle\lim_{M\to+\infty}TD(M)=N-1.$ (4)

The thread divergence count of this approach is thus $N-1$ .

ALLGPU. All tasks are performed on a GPU, so the total GPU-block synchronization is computed by summing the thread divergence count caused by region determination (0), local search (0), the evaluation process ( $N-1$ ), and the dancing step (0). The total thread divergence count of the proposed approach is thus ( $N-1$ ).

4.3 Time complexity

The time complexity of any bioinspired-based approach depends on the empirical parameters used in the seaching process. As the proposed algorithms are based on BSO, their computation complexity depends on the following parameters:

•
IMAX is the maximum number of iterations to be performed by the algorithm.
•
$K$ the number of regions or bees in the colony.
•
$N$ is the number of items (i.e. solution size).
•
$M$ is the number of transactions.

The complexity cost of the sequential version is in $O(\textit{IMAX}\times K\times N\times M)$ [28].

.

The complexity of DRGPU is $O(\textit{IMAX}\times N\times M)$

Proof In this approach, the region determination process is distributed among the blocks, and each one is used to determine one region. The determination of regions is then performed in $O(1)$ . The complexity of the approach compared to the sequential approach is divided by the number of regions, $K$ , which is equal to $O(\textit{IMAX}\times N\times M)$ .

.

The complexity of SAGPU is $O(\textit{IMAX}\times K\times M)$

Proof The exploration of different regions in SAGPU is distributed among the blocks, where each one is used explore a region. The local search is then performed in $O(1)$ , and the complexity of the approach compared to the sequential approach is divided by the number of neighbors, $N$ , which is equal to $O(\textit{IMAX}\times K\times M)$ .

.

The complexity of MEGPU is:

$\displaystyle O(\textit{IMAX}\times K\times N)$

Proof The evaluation in MEGPU is distributed among the blocks, such that each block evaluates one solution. Thus, evaluation is performed in $O(1)$ time. The complexity of the approach is then divided by the number of transactions, $M$ , which is equal to $O(\textit{IMAX}\times K\times N)$ .

.

The complexity of ALLGPU is $O(\textit{IMAX})$

Proof With ALLGPU, all BSO operations are performed on GPU in IMAX iterations. The complexity is hence $O(\textit{IMAX})$ .
4.4 Discussion

The comparison of the four approaches, presented in this section, is summarized in Table 1.

Table 1
Summary of the analysis

Issue	DRGPU	SAGPU	MEGPU	ALLGPU
Com	$\textit{IMAX}\times N$	$\textit{IMAX}\times K\times N$	$[(M\times N)+(N^{2}\times K)]\times\textit{IMAX}$	$N$
Sync	IMAX	IMAX	$\textit{IMAX}\times K\times N$	$(\textit{IMAX}\times[3+K\times N])$
TD	$0$	$0$	$N-1$	$N-1$
Compl	$O(\textit{IMAX}\times N\times M)$	$O(\textit{IMAX}\times K\times M)$	$O(\textit{IMAX}\times K\times N)$	$O(\textit{IMAX})$

The table shows that every approach has advantages and disadvantages, when compared from a theoretical perspective. For instance, the two first approaches benefit from cluster computing challenges but have a high time complexity, while ALLGPU reduces the time complexity but has a high time divergence and synchronization cost.

5. Performance evaluation

Several experiments have been carried to evaluate the proposed solution. We first compare the proposed approaches with each other to select the best approach. Then, the best approach (ALLGPU) is compared with state-of-the-art GPU-based approaches. Finally, ALLGPU is extended for frequent graph mining to process big database instances. The proposed approaches have been implemented using the C-CUDA language. Experiments have been carried out on a CPU host coupled with a GPU device. The CPU host is a 64-bit quad-core Intel Xeon E5520 with a clock speed of 2.27 GHz. The GPU device is an Nvidia Tesla $C2075$ with 448 CUDA cores (14 multiprocessors with 32 cores each) and a clock speed of 1.15 GHz. It has 2.8 GB of global memory, 49.15 KB of shared memory, and a warp size of 32. Both the CPU and GPU are used in single precision. The measure used in this study is the speed-up, given by Eq. (5)

$\displaystyle SP_{k}=\frac{T_{1}}{T_{k}},$ (5)

where $T_{i}$ is the CPU-time on $i$ processors.

The experiments have been performed on a set of benchmark datasets, described in Table 2). This includes large instances (Connect containing 100000 transactions and 999 items, BMP-POS containing 515597 transactions and 1657 items), the big WebDocs instance containing 1.69 million transactions and 526765 items, and a big graph instance Pokec, which represents a social network with 1.63 million nodes and 269 items.

Table 2

Description of the datasets

Inst.	Trans. size	Items size
Connect	100.000	999
BMP-POS	515.597	1.657
WebDocs	1.69 million	526.765
Pokec	1.63 million	269

5.1 Comparison of the proposed approaches

In the first experiment, performance of the four proposed algorithms is compared on the large instances (Connect and BMP-POS) described above. An initial parameter setting phase was done to find the best parameter settings for each algorithm. These settings have then been used for the rest of the experiment. The performance of DRGPU depends on the strategy used for region determination (Modulo, Next, and Syntactic). SAGPU depends on the strategy used for the neighborhood computation process (SLS: “Stochastic Local Search” [28] and TS: “Tabu Search” [29]). MEGPU depends on the strategy used for thread divergence ( $B R$ (Blocs-based Reordering), $T R$ (Transactions-based Reordering), and TRMV (Transactions-based Reordering with Median Value) [32]). These strategies are adapted for our use in order to select the best parameters of MEGPU. Table 3 presents the results of the proposed approaches (DRGPU, SAGPU, and MEGPU) using different strategies for each algorithm. By varying the number of iterations from 100 to 1000, it is found that the best strategy for each algorithm is:

•
the Next strategy for DRGPU,
•
SLS method for SAGPU, and,
•
TRMV for MEGPU.

Table 3
Speed up of the proposed approaches on connect and BMP-POS using different number of iterations

Datasets Iteration DRGPU SAGPU MEGPU

Modulo Next Syntactic SLS TS BR TR TRMV

Connect 100 82 85 81 84 82 70 71 82

200 91 93 89 90 89 75 77 89

500 99 100 99 108 99 100 100 104

800 110 118 109 115 110 108 108 123

1000 123 150 122 140 129 112 119 142

BMP-POS 100 118 119 117 119 118 120 121 123

200 127 128 126 131 128 131 132 138

500 138 150 140 149 140 142 145 151

800 145 160 147 155 143 145 149 155

1000 166 179 167 175 170 162 165 178

The next experiment is to combine the previous best strategies in ALLGPU. Figures 7 and 8 show the speed up of the proposed approaches on the Connect and BMP-POS datasets, respectively. The results show that ALLGPU outperforms the three other approaches for all instances used. ALLGPU is thus only considered for the remaining experiments.

Figure 7.
Comparison of the proposed apporaches in terms of speed up on connect.

Figure 8.
Comparison of the proposed apporaches in terms of speed up on BMP-POS.

Figure 9.
ALLGPU algorithm vs GPU-based ARM algorithms on the WebDocs instance (speed up).

5.2 ALLGPU vs. GPU-based ARM approaches

Datasets	Iteration	DRGPU	SAGPU	MEGPU
Connect	100	82	85	81	84	82	70	71	82
	200	91	93	89	90	89	75	77	89
	500	99	100	99	108	99	100	100	104
	800	110	118	109	115	110	108	108	123
	1000	123	150	122	140	129	112	119	142
BMP-POS	100	118	119	117	119	118	120	121	123
	200	127	128	126	131	128	131	132	138
	500	138	150	140	149	140	142	145	151
	800	145	160	147	155	143	145	149	155
	1000	166	179	167	175	170	162	165	178

In the following, ALLGPU is compared to some state-of-the-art GPU-based ARM algorithms. Figure 9 presents the speed-up of the ALLGPU, SEGPU [30], GAGPU [35] and PEMS [31] approaches on the big Webdocs instance to generate 500000 association rules.

It is clear from that figure that our approach outperforms all GPU-based ARM approaches in terms of speed up. This is due to the higher GPU-parallelism compared to the existing GPU-based approaches, which evaluate only the generation of solutions on GPU.

5.3 Comparing ALLGPU and DMine

The last experiment aims to extend ALLGPU for dealing with big graph instances. For this, ALLGPU is adapted to mine frequent graphs. We compare ALLGPU and DMine [38], the most recent algorithm for mining frequent graphs. Figure 10 presents the runtime of both approaches (ALLGPU and DMine) for extracting association rules from the big graph instance Pokec that includes 1.63 million of nodes, and 269 different items.

Figure 10.

ALLGPU algorithm vs DMine on pockec for different minimum support values (seconds).

By varying the minimum support from 100% to 20%, it is observed that ALLGPU is more stable than DMine. The figure shows that the latter outperforms the former for high minimum support values, while ALLGPU becomes a bit faster for low values. These results are also due to the fact that ALLGPU exploits GPU-parallelism to explore the large rule space for low minimum support values.

6. Conclusion

This paper has proposed novel approaches for association rule mining that rely on GPU parallelism. Four versions of the proposed approach have been designed by considering, on the one hand, GPU computing challenges such as threads divergence, CPU/GPU communication and GPU-block synchronization, and on the other hand, the time complexity of the approach compared to the sequential version.

To numerically assess the contribution, several tests have been carried out on large and big datasets and a big graph dataset. Results show that the designed AllGPU approach outperforms the three other approaches in terms of speed up for large instances. For the big WebDocs instance, results confirm the usefulness of ALLGPU compared to state-of-the-art GPU-based ARM approaches. Moreover, when dealing with the big Pokec social graph dataset, results shows that using ALLGPU to mine frequent graphs outperforms the recent DMine algorithm. In future work, we plan to extend ALLGPU to be used with other swarm intelligence based approaches such as Particle Swarm Optimization, Bat Swarm Optimization, and Ant Colony Optimization.

References

Kaur

and Jindal

, Content based Image Retrieval with Graphical Processing Unit, in: Int. Conf. on Recent Trends in Information, Telecommunication and Computing, ITC, 2014.

Zhang

and Jason

, Gpapriori: Gpu-accelerated frequent itemset mining, in: Cluster Computing, IEEE International Conference on, 2011.

Wang

Zhang

and Chang

E.Y.

, Pfp: parallel fp-growth for query recommendation, in: Proceedings of the ACM Conference on Recommender Systems, 2008, pp. 107–114.

Cryans

J.D.

Rattich

and Champagne

, Adaptation of APriori to MapReduce to build a warehouse of relations between named entities across the web, In Advances in Databases Knowledge and Data Applications, 2010, 185–189.

Han

Pei

and Yin

, Mining frequent patterns without candidate generation, In ACM SIGMOD Record 29(2) (2000), 1–12.

J.M.T.

Zhan

and Lin

J.C.W.

, An ACO-based approach to mine high-utility itemsets, Knowledge-Based Systems 116(15) (2015), 102–113.

Zhou

Kun-Ming

and Bin-Chang

, Parallel frequent patterns mining algorithm on GPU, in: Systems Man and Cybernetics (SMC), IEEE International Conference on, 2010.

Wang

Fox

J.J.

Stan

M.R.

and Skadron

, Association rule mining with the Micron Automata Processor, in: Parallel and Distributed Processing Symposium, 2015, pp. 689–699.

Nobile

M.S.

Cazzaniga

Besozzi

and Mauri

, GPU-accelerated simulations of mass-action kinetics models with cupSODA, The Journal of Supercomputing, 2014, 1–8.

10.

Agrawal

Imielinski

and Swami

A.N.

, Mining association rules between sets of items in large databases, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, 1993, pp. 207–216.

11.

Kuo

R.J.

Chie

M.C.

and Chiu

Y.T.

, Application of particle swarm optimization to association rule mining, Applied Soft Computing 11(1) (2011), 326–336.

12.

Kuo

R.J.

and Shih

C.W.

, Association rule mining through the ant colony system for national health insurance research database in Taiwan, Journal of Computers and Mathematics with Applications, 2007, 1303–1318.

13.

Claudio

and Orlando

, gpudci: Exploiting gpus in frequent itemset mining, in: Parallel, Distributed and Network-Based Processing, 20th Euromicro International Conference on, 2012.

14.

Adil

S.H.

and Sadaf

, Implementation of association rule mining using CUDA, in: Emerging Technologies, ICET International Conference on, 2009.

15.

Ryoo

Rodrigues

C.I.

Stone

S.S.

Stratton

J.A.

Ueng

S.Z.

Baghsorkhi

S.S.

and Wen-mei

W.H.

, Program optimization carving for GPU computing, Journal of Parallel and Distributed Computing 68(10) (2008), 1389–1401.

16.

Van Luond

Melab

and El-Ghazali

, GPU Computing for Parallel Loal Search Metaheuristics, IEEE Transactions on Computers, Institute of Electrical and Electronics Engineers 62(1) (2013), 173–185.

17.

Ravi

V.T.

and Agrawal

, Performance issues in parallelizing data-intensive applications on a multi-core cluster, in: Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009, pp. 308–315.

18.

Fang

et al., Frequent itemset mining on graphics processors, in: Proceedings of the Fifth International Workshop on Data Management on New Hardware, 2009.

19.

Jiang

Ravi

V.T.

and Agrawal

, A Map-Reduce system with an alternate API for multi-core environments, in: Proceedings of the 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010, pp. 84–93.

20.

Zhu

Zhou

Wang

Qiu

and Wang

, A real-time FPGA-Based accelerator for ECG analysis and diagnosis using association-rule mining, ACM Transactions on Embedded Computing Systems 15(2) (2016), 25.

21.

Zhou

and Huang

, An improved parallel association rules algorithm based on MapReduce framework for big data, in: Fuzzy Systems and Knowledge Discovery, 11th International Conference on, 2014, pp. 284–288.

22.

Djenouri

and Comuzzi

, Combining Apriori heuristic and bio-inspired algorithms for solving the frequent itemsets mining problem, Information Sciences 420 (2017), 1–15.

23.

Djenouri

Habbas

and Djenouri

, Data mining-based decomposition for solving the MAXSAT problem: Toward a new approach, IEEE Intelligent Systems 32(4) (2017), 48–58.

24.

Djenouri

Habbas

Djenouri

and Fournier-Viger

, Bee swarm optimization for solving the MAXSAT problem using prior knowledge, Soft Computing, 2017, 1–18.

25.

Djenouri

Belhadi

and Belkebir

, Bees swarm optimization guided by data mining techniques for document information retrieval, Expert Systems with Applications 94 (2018), 126–136.

26.

Djenouri

Belhadi

and Fournier-Viger

, Extracting useful knowledge from event logs: A frequent itemset mining approach, Knowledge-Based Systems 139 (2018), 132–148.

27.

Djenouri

Drias

and Bendjoudi

, Pruning irrelevant association rules using knowledge mining, International Journal of Business Intelligence and Data Mining 9(2) (2014), 112–144.

28.

Djenouri

Drias

and Habbas

, Bees swarm optimisation using multiple strategies for association rule mining, International Journal of Bio-Inspired Computation 6(4) (2014), 239–249.

29.

Djenouri

Drias

and Chemchem

, A hybrid bees swarm optimization and tabu search algorithm for association rule mining, In Nature and Biologically Inspired Computing World Congress, 2013, 120–125.

30.

Djenouri

Bendjoudi

Mehdi

Nouali-Taboudjemat

and Habbas

, Parallel association rules mining using GPUS and bees behaviors, in: Soft Computing and Pattern Recognition, 6th International Conference of, 2014, pp. 401–405.

31.

Djenouri

Bendjoudi

Mehdi

Nouali-Taboudjemat

and Habbas

, GPU-based bees swarm optimization for association rules mining, The Journal of Supercomputing 71(4) (2015), 1318–1344.

32.

Djenouri

Bendjoudi

Mehdi

and Habbas

, Reducing thread divergence in GPU-based bees swarm optimization applied to association rule mining, Concurrency and Computation: Practice and Experience 29(9) (2017), e3836.

33.

Djenouri

Habbas

and Belhadi

, How to exploit high performance computing in population-based metaheuristics for solving association rule mining problem, Distributed and Parallel Databases 36(2) (2018), 369–397.

34.

Djenouri

Bendjoudi

Djenouri

and Habbas

, Parallel BSO algorithm for association rules mining using master/worker paradigm, in: International Conference on Parallel Processing and Applied Mathematics, 2015, pp. 258–268.

35.

Djenouri

and Drias

, Parallel Bees Swarm Optimization for Association Rules Mining Using GPU Architecture, in: International Conference in Swarm Intelligence, 2014, pp. 50–57.

36.

Gheraibia

Moussaoui

Djenouri

Kabir

and Yin

P.Y.

, Penguins search optimisation algorithm for association rules mining, Journal of Computing and Information Technology 24(2) (2016), 165–179.

37.

Gheraibia

Moussaoui

Djenouri

Kabir

Yin

P.Y.

and Mazouzi

, Penguin search optimisation algorithm for finding optimal spaced seeds, International Journal of Software Science and Computational Intelligence 7(2) (2015), 85–99.

38.

Yuan

Wang

J.Y.

and Chen

, Efficient distributed subgraph similarity matching, The VLDB Journal 24(3) (2015), 369–394.

Datasets	Iteration	DRGPU			SAGPU		MEGPU
		Modulo	Next	Syntactic	SLS	TS	BR	TR	TRMV
Connect	100	82	85	81	84	82	70	71	82
	200	91	93	89	90	89	75	77	89
	500	99	100	99	108	99	100	100	104
	800	110	118	109	115	110	108	108	123
	1000	123	150	122	140	129	112	119	142
BMP-POS	100	118	119	117	119	118	120	121	123
	200	127	128	126	131	128	131	132	138
	500	138	150	140	149	140	142	145	151
	800	145	160	147	155	143	145	149	155
	1000	166	179	167	175	170	162	165	178

GPU-based swarm intelligence for Association Rule Mining in big databases

Abstract

Keywords

1. Introduction and background

1.1 General concepts

1.2 The GPU architecture

(1) S ⁢ [ i ] = 0 if the item i does not belong to the solution S . (2) S ⁢ [ i ] = 1 if the item i belongs to the antecedent in S . (3) S ⁢ [ i ] = 2 if the item i belongs to the consequent in S . In the following, the different steps of BSO-ARM are presented. 1.3.1 Intensification step

1.3.2 Diversification step

3. Proposed solution

3.1 Diversification

3.1.1 GPU implementation of the Modulo Strategy

Correctness of the Definition

(Correctness of the Definition).

.

Implementation.

(Correctness of the Definition).

.

3.3 The evaluation step

3.4 Four novel GPU-based approaches for ARM using BSO

3.4.1 The DRGPU algorithm

4.1 GPU-block synchronization

4.2 Thread divergence

.

.

.

.

Table 1 Summary of the analysis

5.3 Comparing ALLGPU and DMine

References

(1)
$S[i]=0$ if the item $i$ does not belong to the solution $S$ .
(2)
$S[i]=1$ if the item $i$ belongs to the antecedent in $S$ .
(3)
$S[i]=2$ if the item $i$ belongs to the consequent in $S$ .

In the following, the different steps of BSO-ARM are presented.
1.3.1 Intensification step

Table 1
Summary of the analysis