A novel algorithm for searching frequent gradual patterns from an ordered data set

Abstract

Mining frequent simultaneous attribute co-variations in numerical databases is also called frequent gradual pattern problem. Few efficient algorithms for automatically extracting such patterns have been reported in the literature. Their main difference resides in the variation semantics used. However in applications with temporal order relations, those algorithms fail to generate correct frequent gradual patterns as they do not take this temporal constraint into account in the mining process. In this paper, we propose an approach for extracting frequent gradual patterns for which the ordering of supporting objects matches the temporal order. This approach considerably reduces the number of gradual patterns within an ordered data set. The experimental results show the benefits of our approach.

Keywords

Data mining pattern mining gradual pattern temporal data itemset closed itemsets

1. Introduction

Gradual patterns that capture the order correlations of the form “The more/less X, the more/less Y” play an important role in many real-world applications where numerical data must be handled. Data mining algorithms are more commonly used to automatically extract such patterns [5, 8, 10, 17].

Condensed representations are sometimes used to reduce the number of extracted pattern, as in [2] which proposes an approach for frequent closed gradual patterns as a post-processing step of the approach in [8]. This approach does not procure the benefits of reduced runtime and memory and thus does not provide any added value for running the algorithms. The authors of [10] thus tackle this by proposing an algorithm known as GLCM that reduces the number of patterns during the mining process. GLCM is based on an extension of the idea developed in the LCM [25] algorithm and allows the efficient computation of gradual itemsets over large real-world databases with a linear time complexity in the number of closed frequent gradual patterns and a constant memory complexity w.r.t. the number of closed frequent gradual patterns.

In [20], ParaMiner, a generic and parallel algorithm for closed pattern mining is proposed. This algorithm outperforms the state of the art gradual pattern mining algorithm on the problem of mining gradual itemsets. It is built on the principle of pattern enumeration in strongly accessible set systems and its efficiency is the result of a dataset reduction technique called EL-reduction, combined with a technique for performing data set reduction in a parallel execution on a multi-core architecture. ParaMiner is currently the most efficient algorithm for mining frequent closed gradual patterns from large numerical databases.

Although the gradual patterns mining algorithms reported in the literature allows to efficiently extract gradual patterns, they do not assume that there is temporal constraint on the data. However, there are applications with a temporal order among objects (or rows) i.e. objects have a temporal meaning. An example is the paleoecological data described in [17]. In that context, one can only take into account the patterns whose order of concordant objects (rows) respects the temporal order, whereby other patterns are irrelevant. Thus, the reported algorithms are not suitable for extracting gradual patterns under temporal constraints on rows (or transactions).

Most recently, [17] exploits the principle given in [5] and uses an Apriori algorithm to extract frequent gradual patterns by taking into account the temporal constraint during the mining process. The authors apply their approach on the paleoecological data and show through the extracted gradual patterns, the interest to search gradual patterns from an ordered data set. However, this approach is limited to finding the gradualities only between consecutive objects and does not benefit from the efficiency of the techniques proposed in ParaMiner[20].

We therefore propose herein an approach to extract frequent closed gradual patterns with temporal constraint on objects. Our approach extends the previous work proposed in [17] by overcoming the two limitations mentioned above, extract the gradualities between distant objects and exploit the efficiency of the techniques proposed in ParaMiner[20]. This approach exploits the encoding proposed in the ParaMiner algorithm and uses the generic algorithm that computes the closure of a pattern by augmenting it with elements from the intersection of the transactions in its support set. In fact, in the ParaMiner algorithm, the closure of a pattern $s$ is the intersection of the set of transactions including $s$ , from which we remove the descending variations of the attributes before the first attribute with an ascending variation. This is possible because of the symmetry of the gradual pattern mining problem; however, this symmetry is no longer valid with the temporal constraint, which is why we have adopted the generic algorithm for computing the closure of a given gradual pattern. When integrated into ParaMiner, our approach reduces the number of extracted patterns by pruning irrelevant patterns during the mining process.

The paper is organized as follows: after introducing the notion of gradual patterns in Section 2, Section 3 describes our approach to mining frequent closed gradual patterns under temporal constraint. Before concluding, experimental results on two data sets are presented and compared to ParaMiner in Section 4.

2. Gradual itemsets and related works

In this section, we define the notion of gradual itemsets (patterns), and illustrate it on a sample data set.

2.1 Preliminary definitions

We assume that we are given a data set $\Delta$ containing a set of objects $\mathcal{T}=\{t_{1},\dots,t_{n}\}$ that defined a relation on an attribute set $\mathcal{I}=\{i_{1},\ldots,i_{m}\}$ with numerical values. $\forall t\in\mathcal{T}$ , $t[i]$ denotes the value of $t$ over attribute $i$ .

Table 1
$\Delta$ : Ordered numerical data set

tid	$i_{1}$	$i_{2}$	$i_{3}$	$i_{4}$	$i_{5}$	$i_{6}$	$i_{7}$
$t_{1}$	84	61	7	0	1	8	0
$t_{2}$	116	36	4	1	11	2	31
$t_{3}$	90	52	2	3	5	1	13
$t_{4}$	124	34	1	5	12	7	36
$t_{5}$	102	49	0	6	7	10	17
$t_{6}$	135	17	0	1	18	3	62
$t_{7}$	106	40	3	1	9	0	18

The gradual itemsets extracted from $\Delta$ are in the form “more/less $i_{1}$ , $\ldots$ , more/less $i_{k}$ ” $(k\leqslant m)$ . These gradual itemsets are defined on a subset of $\mathcal{T}$ , whose the elements are associated in an increasing or decreasing order. Each attribute will hereafter be considered twice: once to indicate its increase, and once to indicate its decrease, using the $\leqslant$ and $\geqslant$ operators. This leads us to consider kinds of items, called gradual items.

Let us illustrate the notion of gradual itemsets using the numerical data set given in Table 1. This data set contains seven objects on which is defined a relation of temporal order and seven attributes.

Definition 1 (Gradual Item) Let $\Delta$ be a data set defined on a numerical attribute set $\mathcal{I}$ . A gradual item is defined in the form $i^{*}$ , where $i$ is an attribute of $\mathcal{I}$ and $*\in\{\geqslant,\leqslant\}$ is a comparison operator.

For instance, considering the data set of Table 1, $i_{1}^{\geqslant}$ (respectively $i_{1}^{\leqslant}$ ) is a gradual item meaning that the values of attribute $i_{1}$ are increasing (respectively decreasing).

A gradual itemset is thus defined as follows:

Definition 2 (Gradual Itemset) A gradual itemset $s=(i_{1}^{*_{1}},\ldots,i_{k}^{*_{k}})$ is a non-empty set of gradual items.

For example, $\{i_{1}^{\geqslant},i_{3}^{\leqslant}\}$ is a gradual itemset meaning that “the more the values of attribute $i_{1}$ increase, the more the values of attribute $i_{3}$ decrease”.

A gradual itemset imposes a variation constraint on several attributes simultaneously.

2.2 Discovering frequent gradual itemsets

The support (frequency) of a gradual itemset in a database amounts to the extent to which a gradual pattern appears in a given database. Several support definitions have been proposed in the literature [12, 5, 14, 8], showing that gradual itemsets can follow different semantics. The choice thereof generally depends on the application.

In this paper, we consider the variation semantic proposed in [8] which is implemented in Paraminer and we have adapted it to temporal constraint. This has enabled to discover new kinds of gradual itemsets reported here as temporal gradual itemsets, which are gradual itemsets whose longest list of transactions respects the temporal order.

Definitions of important notions about the gradual itemset proposed in [8] are recalled hereafter.

Definition 3 (List of Objects) Let $s=(i_{1}^{*_{1}},\ldots,i_{k}^{*_{k}})$ be a gradual itemset. A list of objects $L=\langle t_{1},\ldots,t_{n}\rangle$ respects $s$ if $\forall j\in[1,n-1],\forall p\in[1,k]$ , $t_{j}[i_{p}]*_{p}t_{j+1}[i_{p}]$ is satisfied.

Note that there may be several lists of objects respecting $s$ [8].

By considering the data set of Table 1 and the pattern $s_{1}=$ $(i_{1}^{\geqslant},i_{2}^{\leqslant},i_{4}^{\leqslant},i_{5}^{\geqslant},i_{7}% ^{\geqslant})$ , the set of all the lists of objects respecting $s_{1}$ is $G_{s_{1}}$ $=\{\langle t_{3},t_{7},t_{2},t_{6}\rangle$ , $\langle t_{5},t_{4},t_{6}\rangle\}$ .

Definition 4 (Support) Let $G_{s}=\langle L_{1},\ldots,L_{n}\rangle$ be the set of all the lists of objects respecting a gradual itemset $s$ . Thus $\textit{Supp(s)}=\frac{\textit{max}\{|L_{i}|,L_{i}\in G_{s}\}}{|\Delta|}$ .

The longest list from $G_{s_{1}}$ is $\langle t_{3},t_{7},t_{2},t_{6}\rangle$ of size $4$ . Thus $\textit{Supp}(s_{1})=\frac{4}{7}=$ 0.57, meaning that 57% of all objects can be ordered consecutively according to $s_{1}$ .

Referring to the previous example from Table 1, $t_{1}$ and $t_{2}$ can be ordered with respect to $s_{2}=(i_{1}^{\geqslant},i_{3}^{\leqslant})$ as $t_{1}[i_{1}]\leqslant t_{2}[i_{1}]$ and $t_{1}[i_{3}]\geqslant t_{2}[i_{3}]$ .

This order is only a partial order. For example consider $t_{2}$ and $t_{3}$ of Table 1: they can’t be ordered according to $s_{2}$ . In fact, the pattern $s_{2}$ is not relevant when explaining the variations between $t_{2}$ and $t_{3}$ , and more generally all transaction pairs that it can’t order. Conversely, a gradual pattern is relevant when explaining the variations occurring in the transactions that it can order. The support definition that we consider in this paper goes further and focuses on the size of the longest lists of objects that can be ordered according to a gradual itemset: the intuition is that such patterns will be supported by long continuous variations in the data (long periods of co-evolution between paleoecological indicators in the case of paleoecological data as mentioned by [17]), the extraction of such continuous variations being particularly desirable in order to better understand the data.

Definition 5 (Complementary Gradual Itemset) Let $s=(i_{1}^{*_{1}},\ldots,i_{k}^{*_{k}})$ be a gradual itemset, and $c$ be a function such that “ $c(\geqslant)=\leqslant$ and $c(\leqslant)=\geqslant$ ”. Then $c(s)=(i_{1}^{*^{c}_{1}},\ldots,i_{k}^{*^{c}_{k}})$ is the complementary (symmetric) gradual itemset of $s$ and is defined as $\forall j\in[1..k],*^{c}_{j}=c(*_{j})$ .

The complementary gradual itemset of $(i_{1}^{\geqslant},i_{3}^{\leqslant})$ is $(i_{1}^{\leqslant},i_{3}^{\geqslant})$ .

Propositon 6 $\textit{Supp(s)}=\textit{Supp(c(s))}$ .

Proposition 6 avoids unnecessary computations, as generating only half of the gradual itemsets is sufficient to automatically deduce the other ones. This means that, for each gradual itemset, there is a symmetric gradual itemset having the same support.

A gradual itemset is said to be frequent if its support is greater than or equal to a user-defined threshold. The problem of mining frequent gradual itemsets is to find the complete set of frequent gradual itemsets in a given database $\Delta$ containing numerical items, with respect to a minimum threshold known as minSupp.

The approach described above allows closed frequent gradual patterns in the numerical database to be automatically extracted. However, this approach doesn’t suppose any temporal constraint between objects in the database, which is unsuitable for the case of a database whose objects follow a temporal order relation.

With temporal constraint, for a given numerical database $\Delta$ containing a set of objects, finding all of the frequent gradual itemsets whose lists of corresponding objects respect the temporal order is a new problem reported in [17]. For example, in Table 1, the gradual itemset $s_{1}$ is not advantageous in the temporal context because the lists of objects respecting $s_{1}$ that are $L_{1}=\langle t_{3},t_{7},t_{2},t_{6}\rangle$ and $L_{2}=\langle t_{5},t_{4},t_{6}\rangle$ do not respect the temporal order ( $t_{7}$ precedes $t_{2}$ in $L_{1}$ and $t_{5}$ precedes $t_{4}$ in $L_{2}$ . The gradual itemset $s_{2}=(i_{1}^{\geqslant},i_{3}^{\leqslant})$ is an interesting candidate as one of its lists of objects $\langle t_{1},t_{2},t_{4},t_{6}\rangle$ respects the temporal order. Its relevance depends, in this case, only on the user-defined minimum support threshold.

In this paper, we propose to integrate the temporal constraint into the Paraminer algorithm in order to automatically extract the frequent gradual itemsets within the mining process. This approach exploits the encoding of the gradual itemset mining problem proposed in [20] and uses the principle of the conventional algorithms to solve pattern mining problems in order to overcome the problem posed by the non-symmetrical patterns respecting the temporal order. Our approach is a constraint-based pattern mining approach that is different to the one proposed in [20], which prevents use of the algorithm to solve conventional pattern mining problems.

2.3 Closed gradual itemsets

Closed itemsets are the key to obtaining concise representation of patterns without loss of information [19, 23, 4, 3]. This notion of closure was introduced for the first time when extracting gradual patterns in [2], where the authors propose a pair of functions $(f,g)$ defining a closure operator [11, 18] for the gradual patterns. Given a set of lists of transactions $\mathcal{L}$ from a database, $f$ returns the gradual pattern $s$ respecting all lists of transactions in $\mathcal{L}$ , while the function $g$ returns the set of maximal lists of transactions $\mathcal{L}$ which respects the variations of all gradual items in $s$ . A gradual pattern $s$ is said to be closed if we have $f(g(s))=s$ .

In our approach, the closure is seen as another constraint and is efficiently combined with temporal constraint to mine closed gradual patterns whose lists of objects respect the temporal order.

2.4 Related works

As mentioned above, several works have been interested in the gradual pattern mining problem. Certain works propose algorithms to efficiently extract gradual patterns [8, 7, 14, 20, 10] from numerical data set and other works are instead focused on extracting other variants of gradual patterns from different types of data [21, 15, 22].

In [8], the authors propose an Apriori[1] principles based method in order to extract gradual itemsets in an efficient manner from large databases. This method takes advantage of a binary representation of lattice structure and deals with binary matrices to represent how tuples are ordered (adjacency matrix) with regard to a gradual itemset. The data are represented through a graph whose nodes are defined as the objects in the data, and the vertices express the precedence relationships derived from the considered itemset. This graph is called precedence graph. [7] propose a conflict sets based approach for extracting gradual itemsets. This approach computes the support for gradual itemsets, in a level-wise process that consists in discarding, at each level, the rows whose so-called conflict set is maximal, i.e. the rows that prevent the maximal number of rows to be sorted. In [14], the authors propose to compute the support of a gradual itemset by using the Kendall tau ranking correlation coefficient. In fact, instead of evaluating the support as the length of the longest path in precedence graph as in [8], the authors consider the number of pairs of objects that are correctly ordered (concordant and discordant pairs) with respect to gradual itemset. [9] propose a LCM[25] algorithm based approach to efficiently compute frequent closed gradual itemsets with a time complexity linear in the number of extracted patterns and a memory complexity constant w.r.t. the number of extracted patterns.

In order to tackle the time-consuming problem, the authors of [16] exploit the multiple processors and cores that are now available on computers to enhance the performances of the GRITE algorithm proposed by [8]. Following the same idea, [20] proposes a generic and parallel algorithm for closed gradual patterns mining based on a data set reduction technique called EL-reduction, combined with a technique for performing data set reduction in a parallel execution on a multi-core architecture.

In [15], the authors introduce emerging gradual patterns which are defined as gradual patterns that describe a data set by contrast to a reference data set, i.e. occur in a data set but not in another. [24] extends the notion of gradual pattern to the case in which the co-variations are expressed between attributes of different database relations. The authors hence propose the relational gradual pattern concept, which enables to examine the correlations between attributes from a graduality point of view in multi-relational data. [22] are interested in mining gradual rules from Stream Data. Due to the complexity of data streams that is increased as the data must be handled on the fly and can be seen only once, the authors propose an approach based on OWA (Ordered Weighted Aggregation) operators [26, 27] and B-Trees in order to speed up the mining process.

Most recently, [21] addresses the spatial gradual pattern mining problem with an application to the measurement of potentially avoidable hospitalizations. In [13], the authors propose a sequential pattern mining based approach for efficient extraction of frequent gradual patterns with their corresponding sequence of tuples.

3. Mining closed gradual itemsets under temporal constraint

This section presents our extraction process for discovering, from a numerical database, the frequent closed gradual itemsets whose lists of objects follow a temporal order.

3.1 Graduality under the temporal constraint

The semantic to be considered within the context of our study differs from that reported in [20] in the following points:

•
The support of a gradual item is not always 100%. In fact, in a conventional context [8], it is always possible to order all of the objects by one column. This is not always possible in the temporal context. For example, none of the gradual items extracted from Table 1 has a support equal to 100% under temporal constraint.
•
The gradual itemsets sought after here are not symmetrical unlike conventional gradual itemsets. Thus, the Proposition 6 is no longer valid. All gradual itemsets must be sought with their corresponding complementary itemset.
•
The computation of the closure of a gradual itemset is defined in a much simpler way than that proposed in [20]. This is due to the fact that the gradual itemsets are not symmetrical in the temporal context.
•
The extraction process is faster and the number of extracted patterns is smaller because taking into account the temporal constraint reduces the number of object paths to be explored in the data set.

The paragraphs below give a formal definition of the problem of mining frequent gradual patterns under temporal constraint. This is then illustrated by an example using the database of Table 1, and shows that the semantics of graduality of the state of the art are not suitable.

Definition 7 (Sequence of objects) Let $L=\langle t_{a_{1}},\ldots,t_{a_{n}}\rangle$ be a list of objects. $L$ is a sequence of objects if ${a_{1}}\leqslant{a_{2}}\leqslant\ldots\leqslant{a_{n}}$ holds.

Definition 8 (Temporal constraint) Let $s$ be a gradual itemset. $s$ respects the temporal constraint if the longest list of objects respecting $s$ is a sequence of objects.

Definition 9 (Sibling gradual itemset) Let $s=$ $\{i_{1}^{_{1}},\ldots,i_{k}^{_{k}}\}$ be a gradual itemset. $s^{\prime}=$ $\{i_{1}^{{}^{\prime}_{1}},\ldots,i_{k}^{{}^{\prime}_{k}}\}$ is a sibling gradual itemset of $s$ and noted Sibling(s) if and only if $s^{\prime}\neq s$ and $i^{\prime}_{1}=i_{1},\ldots,i^{\prime}_{k}=i_{k}$ .

Proposition 10 Let $s=\{i_{1}^{_{1}},\ldots,i_{k}^{_{k}}\}$ be a gradual itemset. $s$ admits $2^{k}-1$ sibling gradual itemsets.

The determination that the complementary gradual itemset of a gradual itemset is one of its siblings is of little importance.

Let $\mathcal{I}=\{i_{1},\ldots,i_{m}\}$ be a set of attributes and $\mathcal{T}=\{t_{1},\ldots,t_{n}\}$ be a set of objects, where each object $t_{j}$ ( $j\in[1,n]$ ) stores a numerical value for every attribute in $\mathcal{I}$ . For mining closed gradual itemsets under temporal constraint, a temporal constraint on the variations of attributes is integrated into the encoding proposed in [20]. This encoding is modified as follows:

$\mathcal{A}=\{i_{1}^{\geqslant},i_{1}^{\leqslant},\ldots,i_{m}^{\geqslant},i_{% m}^{\leqslant}\}$ is the set of attribute variations. In the new data set $\Delta^{\prime}$ , there are as many transactions as there are pairs of objects $(t_{j},t_{j^{\prime}})$ , $t_{j},t_{j^{\prime}}\in\mathcal{T}$ , with $j,j^{\prime}\in[1,n]$ and $j<j^{\prime}$ . $T_{(t_{j},t_{j^{\prime}})}$ denotes the transaction that contains the variation for every attribute in $\mathcal{A}$ between the records $t_{j}$ and $t_{j^{\prime}}$ : for every attribute $i\in\mathcal{I}$ , we have:

•
$i^{\geqslant}\in T_{(t_{j},t_{j^{\prime}})}\Longleftrightarrow t_{j}[i]% \leqslant t_{j^{\prime}}[i]$
•
$i^{\leqslant}\in T_{(t_{j},t_{j^{\prime}})}\Longleftrightarrow t_{j}[i]% \geqslant t_{j^{\prime}}[i]$

$j<j^{\prime}$ imposes the temporal constraint on the attribute variations in $\mathcal{A}$ and is an optimization compared to the encoding in [20].

The encoding of Table 1 is given in Table 2.

Table 2
$\Delta^{\prime}$ : Encoding of the database in $\Delta$

$T_{(t_{1},t_{2})}$ $\{∼{}i_{1}^{\geqslant},∼{}i_{2}^{\leqslant},∼{}i_{3}^{\leqslant},∼{}i_{4}^{% \geqslant},∼{}i_{5}^{\geqslant},∼{}i_{6}^{\leqslant},∼{}i_{7}^{\geqslant}∼{}\}$

$\ldots$ $\ldots$

$T_{(t_{1},t_{7})}$ $\{∼{}i_{1}^{\geqslant},∼{}i_{2}^{\leqslant},∼{}i_{3}^{\leqslant},∼{}i_{4}^{% \geqslant},∼{}i_{5}^{\geqslant},∼{}i_{6}^{\leqslant},∼{}i_{7}^{\geqslant}∼{}\}$

$T_{(t_{2},t_{3})}$ $\{∼{}i_{1}^{\leqslant},∼{}i_{2}^{\geqslant},∼{}i_{3}^{\leqslant},∼{}i_{4}^{% \geqslant},∼{}i_{5}^{\leqslant},∼{}i_{6}^{\leqslant},∼{}i_{7}^{\leqslant}∼{}\}$

$\ldots$ $\ldots$

$T_{(t_{2},t_{7})}$ $\{∼{}i_{1}^{\leqslant},∼{}i_{2}^{\geqslant},∼{}i_{3}^{\leqslant},∼{}i_{4}^{% \leqslant},∼{}i_{4}^{\geqslant},∼{}i_{5}^{\leqslant},∼{}i_{6}^{\leqslant},∼{}i% _{7}^{\leqslant}\}$

$\ldots$ $\ldots$

$T_{(t_{4},t_{5})}$ $\{∼{}i_{1}^{\leqslant},∼{}i_{2}^{\geqslant},∼{}i_{3}^{\leqslant},∼{}i_{4}^{% \geqslant},∼{}i_{5}^{\leqslant},∼{}i_{6}^{\geqslant},∼{}i_{7}^{\leqslant}\}$

$\ldots$ $\ldots$

$T_{(t_{4},t_{7})}$ $\{∼{}i_{1}^{\leqslant},∼{}i_{2}^{\geqslant},∼{}i_{3}^{\geqslant},∼{}i_{4}^{% \leqslant},∼{}i_{5}^{\leqslant},∼{}i_{6}^{\leqslant},∼{}i_{7}^{\leqslant}\}$

$T_{(t_{5},t_{6})}$ $\{∼{}i_{1}^{\geqslant},∼{}i_{2}^{\leqslant},∼{}i_{3}^{\leqslant},∼{}i_{3}^{% \geqslant},∼{}i_{4}^{\leqslant},∼{}i_{5}^{\geqslant},∼{}i_{6}^{\leqslant},∼{}i% _{7}^{\geqslant}\}$

$T_{(t_{5},t_{7})}$ $\{∼{}i_{1}^{\leqslant},∼{}i_{2}^{\leqslant},∼{}i_{3}^{\geqslant},∼{}i_{4}^{% \leqslant},∼{}i_{5}^{\geqslant},∼{}i_{6}^{\leqslant},∼{}i_{7}^{\geqslant}\}$

$T_{(t_{6},t_{7})}$ $\{∼{}i_{1}^{\leqslant},∼{}i_{2}^{\geqslant},∼{}i_{3}^{\geqslant},∼{}i_{4}^{% \leqslant},∼{}i_{4}^{\geqslant},∼{}i_{5}^{\leqslant},∼{}i_{6}^{\leqslant},∼{}i% _{7}^{\leqslant}\}$

With this encoding, the support of a given gradual pattern $s=(i_{1}^{_{1}},\ldots,i_{k}^{_{k}})$ is the size of the longest tid path [20] in the tid support set of $s$ , for example, the longest tid path in $\{\langle t_{1},t_{2}\rangle,\langle t_{1},t_{3}\rangle,\langle t_{1},t_{4}% \rangle,\langle t_{2},t_{3}\rangle,\langle t_{2},t_{4}\rangle,\langle t_{3},t_% {4}\rangle\}$ is $\{\langle t_{1},t_{2}\rangle,\langle t_{2},t_{3}\rangle,\langle t_{3},t_{4}\rangle\}$ of size 3.

In the temporal context, the closure of a gradual pattern $s$ is just the intersection of the transactions in $\Delta^{\prime}$ that contain it. $\Delta^{\prime}[s]$ denotes the set of transactions of $\Delta^{\prime}$ containing $s$ .
3.2 Algorithm

Algorithm 1 computes the unique closure of every gradual itemset respecting the temporal order. It is defined in a much simpler way than that proposed in [20].

[h] Mining closed gradual itemset from an ordered data set0.3em0.7em An ordered data set $\Delta$ containing a set of objects $\{t_{1},\ldots,t_{n}\}$ defining a relation on an attribute set $\mathcal{I}=\{$ $i_{1},\ldots,i_{m}\}$ with numerical values All frequent closed gradual itemsets. BeginDebutFin $k\leftarrow 0$ $\Delta^{\prime}\leftarrow\emptyset$ *Encoding of data set $\Delta$ $\Gamma\leftarrow\emptyset$ *the set of frequent closed gradual itemsets $k<n$ $j\leftarrow k+1$ $j\leqslant n$ $T_{(t_{k},t_{j})}\leftarrow\emptyset$ $i\in\mathcal{I}$ $t_{k}[i]\leqslant t_{j}[i]$ $T_{(t_{k},t_{j})}\leftarrow T_{(t_{k},t_{j})}\cup\{i^{\geqslant}\}$ $t_{k}[i]\geqslant t_{j}[i]$ $T_{(t_{k},t_{j})}\leftarrow T_{(t_{k},t_{j})}\cup\{i^{\leqslant}\}$ $\Delta^{\prime}\leftarrow\Delta^{\prime}\cup T_{(t_{k},t_{j})}$ $j\leftarrow j+1$ $k\leftarrow k+1$ each frequent gradual itemset $s$ extracted from $\Delta^{\prime}$ ${q_{max}\leftarrow\cap\Delta^{\prime}[s]}$ $\Gamma\leftarrow\Gamma\cup q_{max}$ $\Gamma$

Corollary 11 Algorithm 3.2 is correct and complete for frequent closed gradual itemset mining under temporal constraint.

Proof. Proof of this is clear since Algorithm 1 is a simplification of the classical algorithm to compute closed patterns given in [6], which has been shown to compute the set of closed patterns. $\Box$

Lemma 12 Let $s$ be a frequent gradual itemset respecting the temporal order constraint. Let $s^{\prime}$ be a sibling gradual itemset of $s$ and ${L_{s}}$ (respectively $L_{s^{\prime}}$ ), the longest list of objects respecting $s$ (respectively $s^{\prime}$ ). This gives $|L_{s}\cap L_{s^{\prime}}|\leqslant 1$ .

Proof.

•
Let $s$ be a frequent gradual itemset respecting the temporal order constraint and $s^{\prime}$ be a sibling gradual itemset of $s$ .
•
Let ${L_{s}}$ (respectively $L_{s^{\prime}}$ ), the longest list of objects respecting $s$ (respectively $s^{\prime}$ )

Suppose that $|L_{s}\cap L_{s^{\prime}}|>1$ , this means that there are at least two objects $t_{j}$ and $t_{j^{\prime}}$ belonging to both ${L_{s}}$ and $L_{s^{\prime}}$ . Since $s^{\prime}$ is a sibling gradual itemset of $s$ , there exits at least two complementary gradual items $i^{_{k}}$ and $i^{c(_{k})}$ such that $s$ (respectively $s^{\prime}$ ) contains $i^{_{k}}$ (respectively $i^{c(_{k})}$ ). This means that if $t_{j}$ comes before $t_{j^{\prime}}$ in ${L_{s}}$ , then $t_{j^{\prime}}$ will came before $t_{j}$ in $L_{s^{\prime}}$ . This contradicts the fact that the objects are ordered. Therefore, $|L_{s}\cap L_{s^{\prime}}|\leqslant 1$ . $\Box$

Proposition 13 Let $s$ be a frequent gradual itemset respecting the temporal order constraint extracted from a numerical database $\Delta$ with respect to a minimum support threshold minSupp. If $\textit{minSupp}>\frac{|\Delta|}{2}$ , then all the sibling gradual itemsets are not frequent.

Proof. Let $s$ be a frequent gradual itemset respecting the temporal order constraint extracted from a database $\Delta$ with respect to a minimum support threshold minSupp. As $\textit{minSupp}>\frac{|\Delta|}{2}$ and $s$ is frequent, then $\textit{Supp(s)}>\frac{|\Delta|}{2}$ . Let $s^{\prime}$ be a sibling itemset of $s$ , by using Lemma 12, we obtain $\textit{Supp(s')}<\textit{minSupp}$ . Thus $s^{\prime}$ is not frequent. $\Box$

Proposition 13 prevents unnecessary computations within the mining process, as a gradual itemset and its sibling itemsets cannot both be frequent if the minimum support threshold is greater than half of the data set size.

Figure 1.
Comparative evaluation of our approach vs the original Paraminer and the approach proposed in [17] (on a real data set of 111 objects and 40 attributes).

As mentioned in introduction, a first approach for extracting frequent closed gradual itemset from the ordered data set have been proposed by [17]. However, this approach does not allow to extract in this context the gradual itemsets between distant objects. For instance, the gradual itemset $s_{1}=(i_{1}^{\geqslant},i_{2}^{\leqslant},i_{4}^{\leqslant},i_{5}^{\geqslant}% ,i_{7}^{\geqslant})$ extracted from the ordered numerical dataset $\Delta$ using our approach can not be extracted using the approach proposed by [17]. In fact by considering the graduality between the consecutive objects, the gradual item $i_{4}^{\leqslant}$ is not satisfied between the objects $t_{1}$ and $t_{2}$ ( $t_{1}[i_{4}]<t_{2}[i_{4}]$ ) and between the objects $t_{3}$ and $t_{4}$ ( $t_{3}[i_{4}]<t_{4}[i_{4}]$ ) although the gradual items $i_{1}^{\geqslant},i_{2}^{\leqslant},i_{5}^{\geqslant},i_{7}^{\geqslant}$ be satisfied between these objects. Our proposed approach extracts from an ordered data set other forms of gradual itemsets that can not be extracted by [17].
4. Experimental study

This section presents our experimental study on a real world data set of paleoecological indicators [17] and on a synthetic data set. The efficiency of the original ParaMiner algorithm is firstly compared with the new algorithm integrating the temporal constraint in terms of the number of extracted patterns. We also compare our proposed approach that we call LongOrdGradual with the one proposed by [17] called here ConsGradual in terms of the number of extracted patterns.

Figure 1 shows the number of extracted frequent closed gradual patterns with support variation for a paleoecological data set containing 111 objects and 40 attributes (items) [17]. Focus is placed on the variation of frequent closed gradual patterns respecting the temporal order with regard to conventional ones extracted using Paraminer according to the minimum support value minSupp. We can see that the number of patterns extracted with LongOrdGradual is generally smaller than the number of patterns extracted with Paraminer. This is because LongOrdGradual takes into consideration the temporal constraint during the mining process.

We observe from Fig. 1 that the proposed approach LongOrdGradual extracts more patterns than the one proposed in [17]. Thus, it allows to extract other gradual itemsets that can not be captured by [17]. The interest of such patterns whose sequence of objects respects the temporal order has been shown in [17]. Another important difference is that [17] extracts only the gradualities for which the attribute values do not stay constant between the objects. On Fig. 1, the curve called “Number of reduced patterns” shows the percentage of reduced patterns by LongOrdGradual compared to Paraminer.

Figure 2 shows the computational time taken by our algorithm and the other algorithms. We can see that the proposed approach in [17] have shorter execution times than the other approaches as it extracts fewer patterns than the others (see Fig. 1 on the number of extracted gradual patterns). Runtimes of LongOrdGradual are slightly shorter than those of the Paraminer algorithm.

Figure 2.

Comparative evolution of the computation time of our approach vs the original Paraminer and the approach proposed in [17] (on a real data set of 111 objects and 40 attributes).

Figure 3.

Comparative evaluation of our approach vs the original Paraminer and the approach proposed in [17] (on a synthetic data set of 108 objects and 30 attributes).

Figure 4.

Comparative evolution of the computation time of our approach vs the original Paraminer and the approach proposed in [17] (on a synthetic data set of 108 objects and 30 attributes).

Figure 5.

Comparative evaluation of our approach vs the original Paraminer and the approach proposed in [17] (on a synthetic data set of 1000 objects and 20 attributes).

On Fig. 3, we compare over a synthetic data set, LongOrdGradual with both Paraminer and ConsGradual. The comparison is made on the number of closed frequent gradual patterns returned by each approach with the variation of minimum support threshold (minSupp). The synthetic data set used which contains 108 objects and 30 attributes was produced with the same modified version of IBM Synthetic Data Generator for Association and Sequential Patterns as the one used in [8]. It comes from this experiment that LongOrdGradual extracts less patterns than ConsGradual (the curve called “ConsGradual” on Fig. 3). In fact although the two approaches integrate the temporal constraint during the process mining, they do not extract the same forms of patterns. On this synthetic data set, the number of patterns extracted with LongOrdGradual is lower than the number of patterns extracted using Paraminer.

Figure 6.

Comparative evaluation of our approach vs the original Paraminer and the approach proposed in [17] on the synthetic data sets with the variation of objects number.

Figure 7.

Comparative evaluation of our approach vs the original Paraminer and the approach proposed in [17] on the synthetic data sets with the variation of attributes number.

Figure 4 shows the runtime evolution of each approach for discovering gradual patterns on a synthetic data set according to the minimal support. Our approach exhibits a better speedup than the other ones as it extracts fewer patterns than the others (see Fig. 3).

Figure 5 shows the number of extracted patterns on a data set having a bigger set of objects (1000 objects). On this data set our proposed approach extracts fewer patterns than the other approaches. In most of the cases, techniques allowing to obtain gradual knowledge are generally driven on bases containing a weak number of objects and attributes. Also, the synthetic data sets generated by IBM Synthetic Data Generation Code for Associations and Sequential Patterns are very dense, a huge number of gradual patterns can be extracted, and the computational times can be longer. Experimental results obtained on these synthetic data sets show that the number of extracted patterns by our approach is generally smaller than the number of extracted patterns by the other algorithms, the difference is very large (see Figs 3 and 5).

Figure 6 shows, for 50 ttributes and a minimal support minSupp set to 0.1, the variation in computational time taken by each algorithm as the size of the data set is increased (variation of object number). On this figure, we can see that the computational times increase when the object number increases except for the approach proposed in [17] whose the runtimes slightly decrease when the object number increases. Nevertheless, our approach takes less time than Paraminer on these data sets.

Figure 7 shows, for a data set containing 1000 objects and with a minimal support set to 0.1, the variation in computational time taken by each approach to extract gradual patterns. From this figure, we observe an increase of computational time when the attribute number increases. Indeed, the number of extracted gradual patterns increases when the attribute number increases and thus increasing the computational time. Paraminer algorithm takes longer than the other algorithms.

5. Conclusion

In this paper, we propose an approach for the automatic extraction of frequent closed gradual patterns when the ordering of supporting objects follows a temporal order. We show that, in this context, taking into account the temporal constraint during the mining process allows us to significantly reduce the quantity of extracted patterns by eliminating patterns containing lists of inconsistent objects i.e where the lists of objects do not respect the temporal order. An algorithm dedicated to the extraction of frequent closed gradual patterns under temporal constraint has been proposed and implemented. The experiments carried out on the real world data sets made it possible to extract other forms of patterns in the ordered data sets by considering the gradualities between distant objects. Unlike current approaches, it would be interesting in this temporality context, to consider the cases for which an attribute has the equal values between two objects (i.e., neither decreasing nor decreasing order of values) in order to select the most suitable results. We also plan to take variation strength into account during the mining process.

Footnotes

Acknowledgments

The authors would like to thank Auvergne-Rhône-Alpes region and European Union for their financial support through the European Regional Development Fund (ERDF). The authors would also like to thank the authors of Paraminer for making available their algorithm.

References

Agrawal

Mannila

Srikant

Toivonen

and Verkamo

A.I.

, Fast discovery of association rules, In Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996, pp. 307–328.

Ayouni

Laurent

Ben-Yahia

and Poncelet

, Mining closed gradual patterns, In ICAISC, 2010, pp. 267–274.

Ben-Yahia

Gasmi

and Mephu-Nguifo

, A new generic basis of “factual” and “implicative” association rules, Intelligent Data Analysis 13(4) (2009), 633–656.

Ben-Yahia

Hamrouni

and Mephu-Nguifo

, Frequent closed itemset based algorithms: a thorough structural and analytical survey, SIGKDD Explorations 8(1) (2006), 93–104.

Berzal

Cubero

J.C.

Sánchez

Miranda

M.A.V.

and Serrano

, An alternative approach to discover gradual dependencies, IJUFKS 15(5) (2007), 559–570.

Boley

Horváth

Poigné

and Wrobel

, Listing closed sets of strongly accessible set systems with application to data mining, TCS 411(3) (2010), 691–700.

Di-Jorio

Laurent

and Teisseire

, Fast extraction of gradual association rules: A heuristic based method, In CSTST, 2008, pp. 205–210.

Di-Jorio

Laurent

and Teisseire

, Mining frequent gradual itemsets from large databases, In Advances in Intelligent Data Analysis VIII. IDA, 2009, pp. 297–308.

T.D.T.

Laurent

and Termier

, PGLCM: efficient parallel mining of closed frequent gradual itemsets, In ICDM, 2010, pp. 138–147.

10.

T.D.T.

Termier

Laurent

Négrevergne

Tehrani

B.O.

and Amer-Yahia

, PGLCM: efficient parallel mining of closed frequent gradual itemsets, KAIS 43(3) (2015), 497–527.

11.

Ganter

and Wille

, Formal Concept Analysis – Mathematical Foundations, Springer, 1999.

12.

Hüllermeier

, Association rules for expressing gradual dependencies, In PKDD, 2002, pp. 200–211.

13.

Jabbour

Lonlac

and Saïs

, Mining gradual itemsets using sequential pattern mining, In FUZZ-IEEE, 2019, pp. 138–143.

14.

Laurent

Lesot

and Rifqi

, GRAANK: exploiting rank correlations for extracting gradual itemsets, In FQAS, 2009, pp. 382–393.

15.

Laurent

Lesot

and Rifqi

, Mining emerging gradual patterns, In Conference of the International Fuzzy Systems Association and the European Society for Fuzzy Logic and Technolog (IFSA-EUSFLAT), 2015.

16.

Laurent

Négrevergne

Sicard

and Termier

, Pgp-mc: Towards a multicore parallel approach for mining gradual patterns, In DASFAA, Part I, 2010, pp. 78–84.

17.

Lonlac

Miras

Beauger

Mazenod

Peiry

J.-L.

and Mephu-Nguifo

, An approach for extracting frequent (closed) gradual patterns under temporal constraint, In FUZZ-IEEE, 2018, pp. 878–885.

18.

Mephu-Nguifo

, Galois lattice: A framework for concept learning-design, evaluation and refinement, In Sixth International Conference on Tools with Artificial Intelligence, ICTAI ’94, New Orleans, Louisiana, USA, 6–9 November 1994, pp. 461–467.

19.

Mephu-Nguifo

and Njiwoua

, Using lattice-based framework as a tool for feature extraction, In Machine Learning: ECML-98, 10th European Conference on Machine Learning, Chemnitz, Germany, 21–23 April 1998, Proceedings, 1998, pp. 304–309.

20.

Négrevergne

Termier

Rousset

and Méhaut

, Paraminer: a generic pattern mining algorithm for multi-core architectures, DMKD 28(3) (2014), 593–633.

21.

Ngo

Georgescu

Laurent

Libourel

and Mercier

, Mining spatial gradual patterns: Application to measurement of potentially avoidable hospitalizations, In Tjoa

A.M.

Bellatreche

Biffl

van Leeuwen

and Wiedermann

, editors, SOFSEM: Theory and Practice of Computer Science, 2018, pp. 596–608.

22.

Nin

Laurent

and Poncelet

, Speed up gradual rule mining from stream data! A b-tree and owa-based approach, J Intell Inf Syst 35(3) (2010), 447–463.

23.

Pasquier

Bastide

Taouil

and Lakhal

, Discovering frequent closed itemsets for association rules, In ICDT, 1999, pp. 398–416.

24.

Phan

Ienco

Malerba

Poncelet

and Teisseire

, Mining multi-relational gradual patterns, In Proceedings of the 2015 SIAM International Conference on Data Mining, Vancouver, BC, Canada, 30 April–2 May 2015, pp. 846–854.

25.

Uno

Kiyomi

and Arimura

, LCM ver. 2: Efficient mining algorithms for frequent/closed/maximal itemsets, In FIMI, ICDM Workshop, 2004.

26.

Yager

R.R.

, On ordered weighted averaging aggregation operators in multicriteria decisionmaking, IEEE Trans. Systems, Man, and Cybernetics 18(1) (1988), 183–190.

27.

Yager

R.R.

, Families of owa operators, Fuzzy Sets and Systems 59(1) (1993), 125–148.

$T_{(t_{1},t_{2})}$	$\{∼{}i_{1}^{\geqslant},∼{}i_{2}^{\leqslant},∼{}i_{3}^{\leqslant},∼{}i_{4}^{% \geqslant},∼{}i_{5}^{\geqslant},∼{}i_{6}^{\leqslant},∼{}i_{7}^{\geqslant}∼{}\}$
$\ldots$	$\ldots$
$T_{(t_{1},t_{7})}$	$\{∼{}i_{1}^{\geqslant},∼{}i_{2}^{\leqslant},∼{}i_{3}^{\leqslant},∼{}i_{4}^{% \geqslant},∼{}i_{5}^{\geqslant},∼{}i_{6}^{\leqslant},∼{}i_{7}^{\geqslant}∼{}\}$
$T_{(t_{2},t_{3})}$	$\{∼{}i_{1}^{\leqslant},∼{}i_{2}^{\geqslant},∼{}i_{3}^{\leqslant},∼{}i_{4}^{% \geqslant},∼{}i_{5}^{\leqslant},∼{}i_{6}^{\leqslant},∼{}i_{7}^{\leqslant}∼{}\}$
$\ldots$	$\ldots$
$T_{(t_{2},t_{7})}$	$\{∼{}i_{1}^{\leqslant},∼{}i_{2}^{\geqslant},∼{}i_{3}^{\leqslant},∼{}i_{4}^{% \leqslant},∼{}i_{4}^{\geqslant},∼{}i_{5}^{\leqslant},∼{}i_{6}^{\leqslant},∼{}i% _{7}^{\leqslant}\}$
$\ldots$	$\ldots$
$T_{(t_{4},t_{5})}$	$\{∼{}i_{1}^{\leqslant},∼{}i_{2}^{\geqslant},∼{}i_{3}^{\leqslant},∼{}i_{4}^{% \geqslant},∼{}i_{5}^{\leqslant},∼{}i_{6}^{\geqslant},∼{}i_{7}^{\leqslant}\}$
$\ldots$	$\ldots$
$T_{(t_{4},t_{7})}$	$\{∼{}i_{1}^{\leqslant},∼{}i_{2}^{\geqslant},∼{}i_{3}^{\geqslant},∼{}i_{4}^{% \leqslant},∼{}i_{5}^{\leqslant},∼{}i_{6}^{\leqslant},∼{}i_{7}^{\leqslant}\}$
$T_{(t_{5},t_{6})}$	$\{∼{}i_{1}^{\geqslant},∼{}i_{2}^{\leqslant},∼{}i_{3}^{\leqslant},∼{}i_{3}^{% \geqslant},∼{}i_{4}^{\leqslant},∼{}i_{5}^{\geqslant},∼{}i_{6}^{\leqslant},∼{}i% _{7}^{\geqslant}\}$
$T_{(t_{5},t_{7})}$	$\{∼{}i_{1}^{\leqslant},∼{}i_{2}^{\leqslant},∼{}i_{3}^{\geqslant},∼{}i_{4}^{% \leqslant},∼{}i_{5}^{\geqslant},∼{}i_{6}^{\leqslant},∼{}i_{7}^{\geqslant}\}$
$T_{(t_{6},t_{7})}$	$\{∼{}i_{1}^{\leqslant},∼{}i_{2}^{\geqslant},∼{}i_{3}^{\geqslant},∼{}i_{4}^{% \leqslant},∼{}i_{4}^{\geqslant},∼{}i_{5}^{\leqslant},∼{}i_{6}^{\leqslant},∼{}i% _{7}^{\leqslant}\}$

A novel algorithm for searching frequent gradual patterns from an ordered data set

Abstract

Keywords

1. Introduction

2. Gradual itemsets and related works

2.1 Preliminary definitions

Table 1 Δ : Ordered numerical data set

2.3 Closed gradual itemsets

2.4 Related works

3. Mining closed gradual itemsets under temporal constraint

3.1 Graduality under the temporal constraint

Footnotes

Acknowledgments

References

Table 1
$\Delta$ : Ordered numerical data set