Frequent similar pattern mining using non Boolean similarity functions

Abstract

There are many problems were the objects under study are described by mixed data (numerical and non numerical features) and similarity functions different from the exact matching are usually employed to compare them. Some algorithms for mining frequent patterns allow the use of Boolean similarity functions different from exact matching. However, they do not allow the use of non Boolean similarity functions. Transforming a non Boolean similarity function into a Boolean one, and then applying the previous algorithms for mining frequent patterns, could lead to loss some patterns, and even more to generate some other patterns which indeed should not be considered as frequent similar patterns. In this paper, we extend the similar frequent pattern mining by allowing the use of non Boolean similarity functions. Several properties for pruning the search space of frequent similar patterns and a data structure that allows computing the frequency of patterns candidates, are proposed. Also, three algorithms for mining frequent patterns using non Boolean similarity functions are proposed. Experimental results show the efficiency and efficacy of the algorithms. The proposed algorithms obtain better patterns for classification than those patterns obtained by traditional frequent pattern miners, and miners using Boolean similarity functions.

Keywords

Data mining frequent patterns similarity functions Mixed data

1 Introduction

Frequent pattern mining has become an important task in data mining research [1, 2]. Frequent patterns have been used in many applications to discover useful information, as user’s profiles [3], malicious software instructions [4], risk factors [5], human behaviours [6], etc. Also, frequent pattern mining is usually the first and more expensive step of association rule mining [7], another well known and widely studied data mining task. Moreover, frequent patterns have been useful for other tasks like classification [8] and clustering [9].

A frequent pattern is a combination of feature values that appears in a dataset with a frequency greater than or equal to a user-specified frequency threshold.

In the first researches on frequent pattern mining [10], the objects under study were described through Boolean features. Nonetheless to the simplicity of the domain of the features, frequent pattern mining is a difficult task, since the search space grows exponentially regarding to the number of features. There are other situations that make dificult the frequent pattern mining task: I) In real world problems datasets usually contain objects described by mixed data (numerical and non numerical features). II) In real life objects almost never exactly match. III) Not all similarity functions are Boolean. There are many problems in which the available similarity function is not Boolean. For example, they have been used to compare documents [11], for modeling semantic relations [12, 13], for differential diagnosis of glaucoma, for comparing somatotypes, and for prognosis of the rehabilitation of patients with cleft lip and palete [14].

All algorithms for mining frequent similar patterns reported in the literature (ObjectMiner [15], STreeDC-Miner [16], STreeNDC-Miner [16] and RP-Miner [17] and CFSP-Miner [18]) have focused on Boolean similarity functions. Therefore, we have two options for mining frequent similar patterns with non Boolean similarity functions: transforming the non Boolean similarity functions into Boolean similarity functions; or developing new algorithms for mining frequent similar patterns that use non Boolean similarity functions. However, transforming the non Boolean similarity function into a Boolean one could lead to loss some frequent similar patterns and/or to generate some frequent similar patterns, which should not be generated.

In the present work, we focused on frequent similar pattern mining using non Boolean similarity functions. Thus, STreeDC-Miner [16], STreeNDC-Miner [16] and RP-Miner [17] which use Boolean similarity functions, are extended for using non Boolean similarity functions. Preliminary results on this topic, where presented in the short conference paper [19]. The main differences of the current paper with the short conference paper are: I) we formalize and prove theoretical properties in which our frequent similar pattern mining algorithms are based, II) we provide more details about the algorithm STree^*DC-Miner [19], for using non Boolean similarity functions that holds the f_S-downward closure property, and III) based on the theoretical properties we introduce two new algorithms STree^*NDC-Miner and RP^*-Miner for using non Boolean similarity functions that do not hold the f_S-downward closure property.

2 Basic concepts and results

Let Ω = {O₁, O₂, …, O_n} be a dataset. Each object O_i is described by a set of features R = {r₁, r₂, …, r_m} and represented as a tuple (v₁, v₂, …, v_m) where v_i ∈ D_i (D_i is the domain of the feature r_i, 1 ≤ i ≤ m). A subdescription of an object O for a subset of features S ⊆ R denoted by I_S (O), is the description of O in terms of the features in S; O [r] denotes the value of the feature r ∈ R in O; and f_S : Ω × Ω → [0, 1] a similarity function between O and O′ using their subdescriptions I_S (O) and I_S (O′) respectively. For simplicity we will denote f_S (O, O) instead of f_S (I_S (O) , I_S (O′)). Two examples of similarity functions are:

$f_{S} (O, O^{'}) = \prod_{r \in S} C_{r} (O [r], O^{'} [r])$ (1)

$f_{S} (O, O^{'}) = \frac{\sum_{r \in S} C_{r} (O [r], O^{'} [r])}{| S |}$ (2)

where C_r : D_r × D_r → [0, 1] is a comparison function between the values of the feature r. Three examples of comparison functions are: $C_{r} (x, y) = {\begin{matrix} 1 if x = y \\ 0 otherwise \end{matrix}$ (3)

$C_{r} (x, y) = 1 - \frac{x - y}{{Max}_{r} - {Min}_{r}}$ (4)

$C_{r} (x, y) = {\begin{matrix} 1 if 1 - \frac{x - y}{{Max}_{r} - {Min}_{r}} \geq α \\ 0 otherwise \end{matrix}$ (5) where ${Max}_{r} = max_{O \in Ω} O [r]$ , ${Min}_{r} = min_{O \in Ω} O [r]$ and α ∈ [0, 1].

Notice that the similarity function (1) can be Boolean (if ∀r ∈ S, C_r is Boolean) or non Boolean (if ∃r ∈ S, C_r is non Boolean). The similarity function (2) is non Boolean.

Let I_S (O) be a subdescription, O ∈ Ω, ∅ ≠ S ⊆ R, and f_S be a Boolean similarity function; then the frequency of I_S (O) in Ω for f_S was defined in [16] as: $f_{S} freq (O) = \frac{| {O^{'} \in Ω | f_{S} (O, O^{'}) = 1} |}{| Ω |}$ (6) Thus, using a Boolean similarity function, each subdescription I_S (O′) contributes to the frequency of another subdescription I_S (O), if f_S (O, O′) =1, i.e. if O and O′ are similar acording f_s. Following this idea for non Boolean similarity functions we define the frequency of I_S (O) in Ω for f_S as: $f_{S} freq (O) = \frac{\sum_{O^{'} \in Ω} f_{S} (O, O^{'})}{| Ω |}$ (7)

It can be noticed that if f_S were a Boolean similarity function, (7) and (6) would be equals.

Also, we say that I_S (O) is a frequent similar pattern in Ω, also called f_S-frequent subdescription, if its frequency is greater than or equal to a frequency threshold minFreq. Given a dataset of objects Ω described by a set of features R, a non Boolean similarity function f_S, and a frequency threshold minFreq, the frequent similar pattern mining problem consists in finding all frequent similar patterns in Ω using the non Boolean similarity function to compute the frequency of the subdescriptions by using (7).

3 Pruning properties

The downward closure property (dcp) is used in frequent itemset mining for pruning the search space [20]. This property ensures that all supersets of a non frequent itemset are also non frequent itemsets. An analogous dcp, in the context of frequent similar pattern mining, expresses that: all superdescriptions of a non f_S-frequent subdescription are also non f_S-frequent subdescriptions. This property is called f_S-downward closure property (f_S-dcp). Given a non-Bolean similarity function, we can define the dcp for non Boolean similarity functions as follows:

Property 1. (f_S-downward closure) Given a dataset Ω, and a non Boolean similarity function f_S; f_S fulfills the f_S-downward closure iff ∀O, S₁, S₂; O ∈ Ω; ∅ ≠ S₁ ⊆ S₂ ⊆ R [f_{S
₁}freq (O) < minFreq] ⇒ [f_{S
₂}freq (O) < minFreq].

However, this property, unlike the dcp for frequent itemset mining, is not always true. Its fulfillment depends on whether the frequency and similarity function are monotonic.

Property 2. (Monotony of the frequency) Given a dataset Ω and a non Boolean similarity function f_S; f_S fulfills the monotony of the frequency iff ∀O, S₁, S₂; O ∈ Ω [∅ ≠ S₁ ⊆ S₂ ⊆ R] ⇒ [f_{S
₁}freq (O) ≥ f_{S
₂}freq (O)].

Definition 1. (Monotonic similarity function) Given a dataset Ω and a non Boolean similarity function f_S; f_S is non increasing monotonic iff ∀O, O′, S₁, S₂; O, O′ ∈ Ω, [∅ ≠ S₁ ⊆ S₂ ⊆ R] ⇒ [f_{S
₁}(O,O′) ≥ f_{S
₂}(O,O′)].

The similarity function (1) from section 2 is a non increasing monotonic similarity function, while the similarity function (2) is not.

The dependencies among the dcp, the monotony of the frequency and the monotony of the similarity function can be expressed as follows:

Proposition 1. Given a dataset Ω≠ ∅, and a non Boolean similarity function f_S; if f_S is a non increasing monotonic similarity function, then f_S fulfills the monotony of the frequency.

Proof 1. If f_S is a non increasing monotonic similarity function then, ∀O, O′, S₁, S₂; O, O′ ∈ Ω;

$[\emptyset \neq S_{1} \subseteq S_{2} \subseteq R] \Rightarrow [f_{S_{1}} (O, O^{'}) \geq f_{S_{2}} (O, O^{'})]$ (8) From (8) then ∀O, S₁, S₂; O ∈ Ω $\begin{matrix} [\emptyset \neq S_{1} \subset S_{2} \subseteq R] \Rightarrow \\ [\sum_{O^{'} \in Ω} f_{S_{1}} (O, O^{'}) \geq \sum_{O^{'} \in Ω} f_{S_{2}} (O, O^{'})] \end{matrix}$ (9) Consequently, from (9) then ∀O, S₁, S₂; O ∈ Ω $\begin{matrix} [\emptyset \neq S_{1} \subset S_{2} \subseteq R] \Rightarrow \\ [\frac{\sum_{O^{'} \in Ω} f_{S_{1}} (O, O^{'})}{| Ω |} \geq \frac{\sum_{O^{'} \in Ω} f_{S_{2}} (O, O^{'})}{| Ω |}] \end{matrix}$ (10) Therefore ∀O, S₁, S₂; O ∈ Ω [∅ ≠ S₁ ⊂ S₂ ⊆ R] ⇒ [f_{S
₁}freq (O) ≥ f_{S
₂}freq (O)]. ■

Proposition 2. Given a dataset Ω≠ ∅ and a non Boolean similarity function f_S; If f_S fulfills the monotony of the frequency, then f_S fulfills the f_S-downward closure.

Proof 2. If f_S fulfills the monotony of the frequency then ∀O, S₁, S₂; O ∈ Ω [∅ ≠ S₁ ⊂ S₂ ⊆ R] ⇒ [f_{S
₁}freq (O) ≥ f_{S
₂}freq (O)]. Therefore, ∀O, S₁, S₂; O ∈ Ω: $\begin{matrix} [\emptyset \neq S_{1} \subset S_{2} \subseteq R] \Rightarrow \\ [[minFreq > f_{S_{1}} freq (O)] \Rightarrow [minFreq > f_{S_{2}} freq (O)]] \end{matrix}$ (11) Thus, ∀O, S₁, S₂; O ∈ Ω; ∅ ≠ S₁ ⊂ S₂ ⊆ R; [f_{S
₁}freq (O) < minFreq] ⇒ [f_{S
₂}freq (O) < minFreq]. ■

Proposition 3. Given a dataset Ω≠ ∅ and a non Boolean similarity function f_S; if f_S is a non increasing monotonic similarity function, then f_S fulfills the f_S-downward closure.

Proof. 3 Based on Propositions 1 and 2 the proof is immediate. ■

For non Boolean similarity functions, we define the following concept which will be used for pruning.

Definition 2. (f_S- interesting pattern ) Given a dataset Ω and a non Boolean similarity function f_S; a subdescription I_S (O), O ∈ Ω is a f_S-interesting pattern iff I_S (O) is a f_S-frequent subdescription or it contributes to the frequency of another f_S-frequent subdescription I_S (O′), O′ ∈ Ω.

In contraposition, a subdescription I_S (O), O ∈ Ω, is considered a nonf_S-interesting pattern if f_Sfreq (O) < minFreq and ∀O′ ; O′ ∈ Ω; I_S (O′) ≠ I_S (O) [f_Sfreq(O^′)≥minFreq]⇒[f_S(O′, O) =0].

Proposition 4. Given a dataset Ω and a non increasing monotonic non Boolean similarity function f_S; if a subdescription I_S (O) is a non f_S-interesting pattern, then all superdescriptions of I_S (O) are non f_S-interesting patterns.

Proof 4. if I_S (O) is a non f_S-interesting pattern, then $f_{S} freq (O) < minFreq$ (12) and $\begin{matrix} \forall O^{'}; O^{'} \in Ω; I_{S} (O^{'}) \neq I_{S} (O) \\ [f_{S} freq (O^{'}) \geq minFreq] \Rightarrow [f_{S} (O^{'}, O) = 0] \end{matrix}$ (13) By Proposition 3, since f_S is a non increasing monotonic non Boolean similarity function, f_S fulfills the f_S-dcp. Therefore, based in (12) we have $\begin{matrix} \forall S^{'}; S \subseteq S^{'} \subseteq R \\ [f_{S} freq (O) < minFreq] \Rightarrow [f_{S^{'}} freq (O) < minFreq] \end{matrix}$ (14) From the non increasing monotony of the similarity function f_S and (13): $\begin{matrix} \forall O^{'}, S; O^{'} \in Ω; S \subseteq S' \subseteq R; I_{S} (O^{'}) \neq I_{S} (O) \\ [f_{S'} freq (O^{'}) \geq minFreq] \Rightarrow \\ [f_{S} freq (O^{'}) \geq minFreq] \Rightarrow \\ [f_{S'} (O^{'}, O) \leq f_{S} (O^{'}, O) = 0] \end{matrix}$ (15) In addition, it is obvious that: $\begin{matrix} \forall O^{'}, S'; O^{'} \in Ω; S \subseteq S' \subseteq R \\ [I_{S} (O^{'}) \neq I_{S} (O)] \Rightarrow [I_{S'} (O^{'}) \neq I_{S'} (O)] \end{matrix}$ (16) Therefore from (15) and (16): $\begin{matrix} \forall O^{'}, S; O^{'} \in Ω; S \subseteq S' \subseteq R; I_{S'} (O^{'}) \neq I_{S'} (O) \\ [f_{S'} freq (O^{'}) \geq minFreq] \Rightarrow \\ [f_{S} freq (O^{'}) \geq minFreq] \Rightarrow \\ [f_{S'} (O^{'}, O) = 0] \end{matrix}$ (17) Finally, from (14) and (17) we obtain that ∀S′ ; S ⊆ S′ ⊆ R: $f_{S^{'}} freq (O) < minFreq$ and $\begin{matrix} \forall O^{'}; O^{'} \in Ω; I_{S'} (O^{'}) \neq I_{S'} (O) \\ [f_{S'} freq (O^{'}) \geq minFreq] \Rightarrow [f_{S'} (O^{'}, O) = 0] \end{matrix}$ That is, all superdescriptions of a non f_S-interesting pattern I_S (O) are non f_S-interesting patterns. ■

Notice that, those non f_S-interesting patterns neither are frequent similar patterns nor contribute to the frequency of any frequent similar pattern. Additionally, from Proposition 4, all superdescriptions of a non f_S-interesting pattern are not frequent similar patterns. Therefore, all non f_S-interesting patterns can be pruned without losing frequent similar patterns.

4 Proposed data structure

In order to compute the frequency and to reduce the number of similarity function evaluations, for non Boolean similarity, functions we propose a data structure, that we name STree^*,1. Given a set of features S ⊆ R, a ${STree}_{S}^{*}$ is a tree where each branch from the root to a leaf represents a subdescription P of an object O respect to the set of features S (P = I_S (O)). Each leaf contains:

$P . O$ : Set of objects in Ω having a subdescription equal to P.

$P . S$ : Set of the pairs (P′, f_S (P′, P)), such that, P ≠ P′ and f_S (P′, P) ≠0.

$P . \tilde{c}$ : The similarity evaluation, as result of the partial occurrences of the subdescription P in the collection. If f_S (P′, P) ≠0 then P′ is a partial occurrence of P with a similarity value f_S (P′, P). $P . \tilde{c} = \sum_{P^{'} = I_{S} (O^{'}) | O^{'} \in Ω} f_{S} (P^{'}, P)$ the similarity evaluation, as result of the partial occurrences of the subdescription P, in the collection. However, since the similarity (sim) between repetitions of P′ and P is always the same. Then the sum of the similarities between the repetitions of P′ and P is equal to $sim * | P^{'} . O |$ , where $| P^{'} . O |$ is the number of repetitions of P′. Thus, the number of similarity function evaluations to compute $P . \tilde{c}$ can be reduced as follows: $P . \tilde{c} = \sum_{(P^{'}, sim) \in P . S} sim * | P^{'} . O |$

Given a ${STree}_{S}^{*}$ , the frequency of a subdescription I_S (O) can be computed as:

$f_{S} freq (O) = \frac{| I_{S} (O) . O | + I_{S} (O) . \tilde{c}}{| Ω |}$ (18)

In this expression $| I_{S} (O) . O |$ is the number of repetitions of the subdescription I_S (O) in Ω and due the similarity between these repetitions is 1, then it is unnecessary to compute them.

On the other hand, if the non Boolean similarity function is non increasing monotonic then the number of similarity function evaluations can be reduced even more.

Proposition 5. Given a dataset Ω and a non increasing monotonic non Boolean similarity function f_S, if $I_{S} (O) . S = {(I_{S} (O^{'}), f_{S} (O, O^{'})) ∣ O^{'} \in Ω \land I_{S} (O^{'}) \neq I_{S} (O) \land f_{S} (O, O^{'}) > 0}$ then $\forall \hat{S}$ , $\hat{S} \supset S$ : $f_{\hat{S}} (O, O^{'}) > 0$ if $(I_{S} (O^{'}, f_{S} (O, O^{'})) \in (I_{S} (O) . S \cup {(I_{S} (O), 1)})$

Proof. The proof is immediate by contradiction. Suppose that there is $\emptyset \neq S \subset \hat{S}$ , O, O′ such that $f_{\hat{S}} (O, O^{'}) > 0$ and $(I_{S} (O^{'}), f_{S} (O, O^{'})) \notin (I_{S} (O) . S \cup {(I_{S} (O), 1)})$ . Therefore $f_{S} (O^{'}, O) < f_{\hat{S}} (O, O^{'})$ . However this is a contradiction because f_S is non increasing monotonic, and then ∀S, $\hat{S} \subseteq R$ ; O, O′ ∈ Ω; $[\emptyset \neq S \subset \hat{S}] \Rightarrow [f_{S} (O^{'}, O) ⩾ f_{\hat{S}} (O^{'}, O)]$ . Consequently, $(I_{S} (O^{'}), f_{S} (O, O^{'})) \in (I_{S} (O) . S \cup {(I_{S} (O), 1)})$ . ■

Using the Proposition 5, let f_S be a non increasing monotonic non Boolean similarity function, Let $I_{\hat{S}} (O)$ be a superdescription of I_S (O), given the set $I_{S} (O^{'}) . S$ for all I_S (O′), O′ ∈ Ω, then since we do not need to compute the similarity between $I_{\hat{S}} (O)$ and all superdescriptions $I_{\hat{S}} (O^{'})$ , the number of similarity function evaluations of f_S can be reduced. Only the similarity between $I_{\hat{S}} (O)$ and the superdescriptions $I_{\hat{S}} (O^{'})$ of I_S (O′), such that, $(I_{S} (O), f_{S} (O, O^{'})) \in I_{S} (O^{'}) . S \cup {(I_{S} (O^{'}), 1)}$ should be computed.

Consequently, the frequency of a subdescription $I_{\hat{S}} (O)$ can be computed as follows:

$\begin{matrix} f_{\hat{S}} freq (O) = \frac{\begin{matrix} | I_{\hat{S}} (O) . O | \\ + & \sum_{\begin{matrix} I_{\hat{S}} (O^{'}) \in Ω, \hat{S} \supset S \\ (I_{S} (O), f_{S} (O, O^{'})) \in I_{S} (O^{'}) . S \end{matrix}} f_{\hat{S}} (O, O^{'}) * | I_{\hat{S}} (O^{'}) . O | \end{matrix}}{| Ω |} \end{matrix}$

In this expression the second term of the sum in the numerator is the sum of the similarities different from 0, between $I_{\hat{S}} (O)$ and the other subdescriptions in Ω that are not identical to $I_{\hat{S}} (O)$ . All this information is currently stored in the STree^* data structure.

In general a ${STree}_{S}^{*}$ is built as follows:

Building the empty ${STree}_{S}^{*}$ , which does not contain any branches.

Adding all object subdescriptions that the ${STree}_{S}^{*}$ will contain. If for a description of an object O, respect to the set of features S (I_S (O)), there is not already a branch from the root to a leaf in the ${STree}_{S}^{*}$ , representing this subdescription, then a new branch is added. The addition of a subdescription I_S (O) to ${STree}_{S}^{*}$ ends with the addition of the object O to the set $I_{S} (O) . O$ .

Computing the set $I_{S} (O) . S$ from the subdescriptions contained in the data structure for each subdescription I_S (O) contained in the ${STree}_{S}^{*}$ .

Computing $I_{S} (O) . \tilde{c}$ from the sets $S$ of the subdescriptions contained in the data structure for each subdescription I_S (O) contained in the ${STree}_{S}^{*}$ .

When the ${STree}_{S}^{*}$ has been built, the frequency of each subdescription I_S (O) contained in the data structure can be computed by using (18).

5 Frequent similar pattern mining algorithms

On the following subsections, the 3 frequent similar pattern mining algorithms proposed for using non Boolean similarity functions are described.

5.1 STree^*DC-Miner algorithm

The STree^*DC-Miner algorithm (Algorithm 1) allows mining frequent similar patterns using non Boolean similarity functions that hold the f_S-dcp.

The STree^*DC-Miner algorithm is based on the following points:

Assuming that f_S holds the f_S-dcp. As a consequence:

Non interesting patterns are pruned. For this, the search space is explored starting from the interesting patterns described by only one feature and then by means of successive expansions, in which a feature and a feature value are added to each interesting pattern. For each expansion of an interesting pattern, we verify if the expansion is a frequent similar pattern, larger patterns are explored.

The frequencies of expansions of the interesting but non frequent similar patterns, are not computed, since only superdescriptions of frequent similar patterns can be frequent similar patterns. Additionally, it is not needed to compute the similarity between expansions of patterns that are interesting but non frequent similar patterns.

If the similarity of a subdescription P to another subdescription P′ is equal to 0, it is not needed to compute the similarity of a superdescription $\hat{P}$ of P to a superdescription $\hat{P^{'}}$ of P′.

Considering identical subdescriptions as an unique subdescription (using ${STree}_{S}^{*}$ ) allows to STree^* DC-Miner reducing the number of similarity function evaluations.

Let ≺ be a linear order on R and f_S a non increasing monotonic non Boolean similarity function, we consider that:

A subset of features S is expandable, if S =∅ or there is at least one similar frequent pattern I_S (O).

A subset of features $\hat{S}$ is a direct expansion of S, if S is expandable, $\hat{S} = S \cup {r}$ , r ∈ R and ∀r′ ∈ S, r′ ≺ r.

A subset of features $\hat{\hat{S}}$ is an expansion of S, iff $\hat{\hat{S}}$ is a direct expansion of S, or there is an $\hat{S}$ such that $\hat{S}$ is a direct expansion of S and $\hat{\hat{S}}$ is an expansion of $\hat{S}$ .

The STree^*DC-Miner algorithm (Algorithm 1) starts with an empty set of features $\hat{S}$ and a null ${STree}_{S}^{*}$ .

5.2 STree^*NDC-Miner algorithm

The STree^*NDC-Miner algorithm (Algorithm 2) allows mining frequent similar patterns using non Boolean similarity functions that do not hold the f_S-dcp.

If a non Boolean similarity function does not hold the f_S-dcp then we cannot ensure that non interesting patterns can be pruned without losing frequent similar patterns. Also, the similarity from all subdescriptions P to all subdescriptions P′, such that P ≠ P′, should be computed; otherwise, some frequent similar patterns could be lost. Therefore, in order to avoid the lose of frequent similar patterns, it is needed to search the frequent similar patterns for all S ⊆ R, S≠ ∅. However, STree^*NDC-Miner prunes this search based on the following points:

Considering identical subdescriptions as a unique subdescription reduces the number of similarity function evaluations.

The search space is explored from subdescriptions with all features to subdescriptions with only one feature, by means of successive reductions removing one feature each time. Notice that, given a subdescription P, the number of repetitions of P is equal to the sum of the number of repetitions for each superdescription $\hat{P}$ of P, such that P is obtained by removing the same feature from $\hat{P}$ . Therefore, in the STree^* structure it is enough to store the number of objects associated with each subdescription into a new field $\bar{\bar{c}}$ instead of storing the list of objects $P . O$ . This also reduces the computational effort and memory requirements.

Let ≺ a linear order on R and f_S a non Boolean similarity function that does not hold the f_S-dcp, we consider that:

A subset of features S is reducible, if S≠ ∅

A subset of features $\overset{ˇ}{S}$ is a direct reduction of S, if S is reducible, $\overset{ˇ}{S} = S - {r}$ , $\overset{ˇ}{S} \neq \emptyset$ , r ∈ R and ∀r′ ∈ (R - S), r′ ≺ r.

A subset of features $\overset{ˇ}{\overset{ˇ}{S}}$ is a reduction of S, if $\overset{ˇ}{\overset{ˇ}{S}}$ is a direct reduction of S, or there is an $\overset{ˇ}{S}$ such that $\overset{ˇ}{S}$ is a direct reduction of S and $\overset{ˇ}{\overset{ˇ}{S}}$ is a reduction of $\overset{ˇ}{S}$ .

The STree^*NDC-Miner algorithm (Algorithm 2) starts from the whole set of features $\hat{S} = R$ and a null ${STree}_{S}^{*}$ .

5.3 RP^*-Miner algorithm

For a non Boolean similarity function which does not hold the f_S-dcp, some superdescriptions of the non frequent similar patterns could be frequent similar patterns. Therefore, if all non f_S-interesting patterns are removed (like STree^*DC-Miner does), then some frequent similar patterns could be lost. On the other hand, if the search space of frequent similar patterns is exhaustively explored (like STree^*NDC-Miner does), then all frequent similar patterns are found, but this process is very expensive. We propose the RP^*-Miner algorithm as an alternative.

The general idea of RP^*-Miner is as follows: First, all frequent similar patterns with only one feature are obtained. Then, each frequent similar pattern is successively expanded by adding a feature and a feature value. It is important to highlight that a pattern with k features could be obtained by expanding different patterns with k - 1 features. In our expansion process, if a pattern has been previously obtained then the repeated pattern is neither analyzed, nor expanded.

The RP^*-Miner algorithm (Algorithm 3) starts with an empty set of features $\hat{S}$ , an empty set of analyzed patterns W, and a null ${STree}_{S}^{*}$ .

6 Experiment results

In this section, we show the performance of the proposed algorithms. For comparing the performance of our algorithms we use the following measures:

Efficiency of an algorithm in terms of its runtime. Less runtime implies more efficiency.

Efficacy of an algorithm in terms of the number frequent similar patterns obtained. More patterns implies more efficacy

Quality of the patterns found by an algorithm in terms of the accuracy achieved by a simple supervised classifier based on frequent patterns. This classifier, in the training phase, obtains the set of frequent similar patterns from each class and removes all frequent similar patterns that appear in more than one class, keeping only patterns that represent objects from a single class. In the classification phase, each object of the test collection is classified into the class where there are more frequent similar patterns similar to its subdescriptions. The accuracy is measured as the percentage of objects correctly classified. For each collection and for each value for the minFreq parameter, we repeated the experiment 10 times, randomly selecting 50% of the collection for training and using the remaining objects for testing.

Table 1 gives a description of the collections 2 used in our experiments.

Table 1
Description of the datasets used in the experiments

Ω Objects Features Clases

Diabetes 178 8 2

Liver Disorders 345 6 2

Iris 150 4 3

Ω	Objects	Features	Clases
Diabetes	178	8	2
Liver Disorders	345	6	2
Iris	150	4	3

In section 6.1, the proposed algorithm STree^*DC-Miner is compared to STreeDC-Miner using a non Boolean similarity function that holds the f_S downward closure property. For applying STreeDC-Miner the similarity function was Booleanized. In section 6.2, STree^*DC-Miner, STree^*NDC-Miner and RP^*-Miner are compared using a non Boolean similarity function that does not hold the f_S-dcp.

6.1 Similarity function holding the f_S-dcp

For this experiment, as a non Boolean similarity function that holds the f_S-dcp we use the similarity function (1) from section 2 jointly with the comparison function (4) for numerical features and the equality (3) for non numerical features. We use the similarity function (1) using the comparison function (5) with α = 0.9 for numerical features and the equality (3) for non numerical features, as Boolean similarity function that holds the f_S downward closure property. Notice that the function (1) by using (5) for numerical features, and the equality (3) for non numerical features, is a Booleanization of the non Boolean similarity function obtained by using (1) and (4).

In this experiment, for mining frequent similar patterns with the non Boolean similarly function we use STree^*DC-Miner, and for mining frequent similar patterns with the Boolean similarly function we only use STreeDC-Miner although ObjectMiner, STreeDC-Miner, STreeNDC-Miner and RP-Miner obtain the same set of frequent similar patterns (they have the same efficacy) but STreeDC-Miner is the most efficient [16].

Figure 1 shows the runtime of STree^*DC-Miner and STreeDC-Miner for several values of the minFreq threshold for the collections Diabetes, Liver Disorders and Iris. From this figure, we can see that STreeDC-Miner is more efficient than STree^*DC-Miner. This result can be explained by the following reasons:

STreeDC-Miner uses arithmetic operations with integer numbers (similarity values are 0 or 1), which are faster than float point arithmetic operations, which are used by STree^*DC-Miner (similarity values are in [0, 1]).

The number of similarity function evaluations made by STreeDC-Miner is less than the number of similarity function evaluations made by STree^*DC-Miner (see Fig. 2).

The number of frequent similar patterns obtained by STreeDC-Miner is smaller than the number of frequent similar patterns obtained by STree^*DC-Miner (see Fig. 3).

Fig.1

Runtime using a non Boolean similarity function, which holds the f_S downward closure property and a Booleanization of this function, for (a) Diabetes, (b) Liver Disorders and (c) Iris.

Fig.2

Number of similarity function evaluations using a non Boolean similarity function, which holds the f_S-dcp and a Booleanization of this function, for (a) Diabetes, (b) Liver Disorders and (c) Iris.

Fig.3

Number of frequent similar patterns using a non Boolean similarity function, which holds the f_S-dcp and a Booleanization of this function, for (a) Diabetes, (b) Liver Disorders and (c) Iris.

The quality of the set of frequent similar patterns obtained by using the non Boolean similarity function and its Booleanization is shown in Table 2. As a reference, we included the quality of the set of frequent similar patterns obtained using the equality as similarity function, like in the traditional approach. For each collection and for all the algorithms used for mining frequent similar patterns, the classification was done by testing different values of the minFreq threshold from minFreq = 0.1 until minFreq = 0.9 with increments of 0.1.

From Table 2, we can observe that the classification accuracies achieved by using the frequent similar patterns obtained by our algorithm (STree^*DC-Miner) are better than the classification accuracies achieved using the frequent similar patterns obtained by STreeDC-Miner using a Booleanization of this function. These results confirm the negative effect of transforming a non Boolean similarity function into a Boolean one.

It can be also noticed from Table 2 that using the equality, as in the traditional approach, the classification accuracies reached by the set of frequent patterns are lower than the classification accuracies reached by using the set of frequent similar patterns obtained using either Boolean or non Boolean similarity functions (different from the equality).

Table 2

Quality of the sets of frequent similar patterns using a non Boolean similarity function, which holds the f_S-dcp, a Booleanization of this function, and the Traditional Approach

Ω	min Freq	STreeDC-Miner	Traditional Approach	STree^DC-Miner*
Diabetes	0.10	71.9	0.0	74.0
	0.20	73.4	0.0	74.1
	0.30	73.2	0.0	74.1
	0.40	72.2	0.0	74.0
	0.50	70.5	0.0	74.1
	0.60	69.7	0.0	73.8
	0.70	62.4	0.0	73.4
	0.80	25.1	0.0	71.3
	0.90	0.0	0.0	55, 5
Max accuracies		73.2	0.0	74.1
Liver Disorders	0.10	53.4	32.0	55.3
	0.20	53.2	20.8	55.2
	0.30	50.0	13.3	55.3
	0.40	49.5	2.3	54.9
	0.50	43.6	0.0	54.1
	0.60	33.2	0.0	53.5
	0.70	14.3	0.0	55.4
	0.80	0.0	0.0	57.1
	0.90	0.0	0.0	6.3
Max accuracies		53.4	32.0	57.1
Iris	0.10	88.3	64.9	92.3
	0.20	84.7	42.7	92.3
	0.30	74.0	23.9	92.0
	0.40	60.1	17.9	92.3
	0.50	30.1	15.1	92.3
	0.60	15.6	7.5	90.8
	0.70	3.1	1.3	85.1
	0.80	0.0	0.0	55.9
	0.90	0.0	0.0	0.0
Max accuracies		88.3	64.9	92.3

Additionally, in Table 2, it can be seen that when the minFreq threshold increases, the classification accuracy decreases because the number of frequent patterns also decreases. However, the use of non Boolean similarity functions allows our algorithm to find more similar frequent patterns, even for high values of the minFreq threshold, and therefore, the accuracy decreases slower than using the frequent patterns found by algorithms that use Boolean similarity functions. Moreover, when the equality was used as similarity function, the number of frequent patterns found for the Diabetes collection was too low even for small values of the minFreq threshold. Therefore, for this collection, the classifier was unable to correctly classify any object in the testing set.

6.2 Similarity functions not holding the f_S-dcp

We used the similarity function (2) from section 2 as a non Boolean similarity function that does not hold the f_S-dcp. For this similarity function we used the comparison function (4) for numerical features and the equality (3) for non numerical features.

Figure 4 shows the runtime of STree^*DC-Miner, RP^*-Miner and STree^*NDC-Miner for several values of the minFreq threshold over the Diabetes, Liver Disorders and Iris datasets.

Fig.4

Runtime using a non Boolean similarity function, which holds the f_S downward closure property and a Booleanization of this function, for (a) Diabetes, (b) Liver Disorders and (c) Iris.

For all collections, the faster algorithm was STree^* DC-Miner, followed by RP^*-Miner. The runtime of STree^*NDC-Miner was almost always greater than the runtimes of STree^*DC-Miner and RP^*-Miner. This is a consequence of the number of frequent similar patterns obtained (see Table 3) and the number of similarity function evaluations(see Figure 5). Nevertheless, the runtime of STree^*NDC-Miner was always less than 76 secs for the tested collections.

Table 3

Number of frequent similar patterns using a non Boolean similarity function, which does not hold the f_S-dcp

Ω	min Freq	STree^DC Miner*	RP^-Miner*	STree^NDC-Miner*
Diabetes	0.80	211592	211592	226225
	0.82	154652	154652	168696
	0.84	93999	93999	105711
	0.86	42754	42899	50994
	0.88	7998	8163	10355
	0.90	448	809	873
	0.92	0	0	0
	0.94	0	0	0
	0.96	0	0	0
	0.98	0	0	0
Liver Disorders	0.80	256	256	267
	0.82	232	232	239
	0.84	207	207	219
	0.86	161	161	185
	0.88	120	120	138
	0.90	61	61	67
	0.92	0	0	0
	0.94	0	0	0
	0.96	0	0	0
	0.98	0	0	0
Iris	0.80	29	45	65
	0.82	6	21	21
	0.84	4	4	4
	0.86	1	1	1
	0.88	0	0	0
	0.90	0	0	0
	0.92	0	0	0
	0.94	0	0	0
	0.96	0	0	0
	0.98	0	0	0

Fig.5

It is important to highlight that STree^*NDC-Miner always found all frequent similar patterns, since it follows an exhaustive strategy, while STree^*DC-Miner, which assumes that the similarity function holds the f_S-dcp, could not found all frequent similar patterns. On the other hand, RP^*-Miner due to its relaxed prune, generally found more frequent similar patterns than STree^*DC-Miner.

The quality of the set of frequent similar patterns obtained by RP^*-Miner, STree^*DC-Miner and STree^*NDC-Miner using the non Boolean similarity function that does not hold the f_S-dcp is shown in Table 4. Again, we included the quality of the set of frequent patterns obtained using the equality as similarity function, like in the traditional approach. For each collection and for all the algorithms used for mining frequent similar patterns, the classification was done testing different values of the minFreq threshold from minFreq = 0.80 to minFreq = 0.98 with increments of 0.02.

Table 4

Quality of the sets of frequent similar patterns using a non Boolean similarity function, which does not hold the f_S-dcp

Ω	min Freq	STree^DC-Miner*	RP^-Miner*	STree^NDC-Miner*
Diabetes	0.80	76.10	76.10	76.46
	0.82	76.26	76.24	76.58
	0.84	76.12	76.30	76.36
	0.86	73.20	73.20	76.30
	0.88	73.20	73.20	75.70
	0.90	73.20	72.76	73.20
	0.92	36.10	73.20	73.20
	0.94	0.00	73.20	73.20
	0.96	0.00	36.10	36.10
	0.98	0.00	0.00	0.00
Max accuracies		76, 26	76.30	76.58
Liver Disorders	0.80	45.52	40.00	45.42
	0.82	45.07	45.07	52.69
	0.84	42.49	42.49	52.99
	0.86	43.38	43.38	46.77
	0.88	40.40	40.40	40.85
	0.90	39.05	46.77	44.23
	0.92	35.42	58.01	39.05
	0.94	3.63	39.05	36.32
	0.96	0.00	35.42	35.42
	0.98	0.00	0.00	0.00
Max accuracies		45, 07	45.07	52.99
Iris	0.80	86.53	86.53	90.80
	0.82	85.33	85.33	90.93
	0.84	76.40	76.40	90.00
	0.86	69.07	79.33	89.60
	0.88	63.07	74.67	86.53
	0.90	54.93	71.60	84.27
	0.92	41.33	69.47	60.67
	0.94	36.80	42.40	34.13
	0.96	33.33	34.13	33.47
	0.98	13.33	33.33	33.33
Max accuracies		86, 53	86.53	90.93

In most of the cases, the quality of the set of frequent similar patterns found by STree^*NDC-Miner was better than the quality of set of frequent similar patterns found by STree^*DC-Miner and RP^*-Miner. In fact, the best classification accuracies (cells with bold text in Table 4) were obtained by STree^*NDC-Miner followed by RP^*-Miner.

7 Conclusions

In this paper, we focused on frequent similar pattern mining using non Boolean similarity functions. We proposed and proved several properties that allow pruning the search space of frequent similar patterns when a non Boolean similarity function is used. Also, we proposed three novel algorithms: STree^*DC-Miner, RP^*-Miner and STree^*NDC-Miner. The STree^*DC-Miner algorithm assumes that the similarity function holds the f_S downward closure property and uses it to prune the search space of frequent similar patterns. The STree^*NDC-Miner algorithm does not assume that the similarity function holds any property, therefore it does not prune the search space. Finally the RP^*-Miner algorithm performs a relaxed prune of the search space of frequent similar patterns, in this way RP^*-Miner finds more frequent similar patterns than STree^*DC-Miner with less computational effort than STree^*NDC-Miner.

From our experiments, we can conclude that, for problems where the similarity function holds the f_S downward closure property, our proposed STree^*DC-Miner algorithm obtains a set of frequent similar patterns with better quality than those obtained when the similarity function is Booleanized. This confirm the negative effect of transforming a non Boolean similarity function into a Boolean similarity function for computing frequent similar patterns with the miners reported in the literature. Moreover, the quality of the frequent similar patterns obtained by STree^*DC-Miner is also better than the quality of the frequent patterns obtained using the equality as similarity function (traditional approach).

For problems where the similarity function does not hold the f_S downward closure property, we can conclude that, RP^*-Miner finds more frequent similar patterns with better quality than STree^*DC-Miner. For this kind of functions STree^*NDC-Miner is the only algorithm that finds all frequent similar patterns, however, RP^*-Miner is faster.

As future work, we visualize extending closed frequent similar patterns mining and association rules mining for non Boolean similarity functions and non-monotonic similarity functions.

Footnotes

This is an extension of the STree structure proposed in [].

References

Han

, Cheng

, Xin

and Yan

, Frequent pattern mining: Current status and future directions, Data Mining and Knowledge Discovery15 (2007), 55–86.

Fernández

, Gómez

, Lecumberry

, Pardo

and Ramírez

, Pattern recognition in latin america in the big data era, Pattern Recognition48 (2015), 1185–1196.

Chiu

C.Y.

, Yeh

C.T.

and Lee

, Frequent Pattern Based User Behavior Anomaly Detection for Cloud System, In Proceedings of the Conference on Technologies and Applications of Artificial Intelligence (TAAI), Taiwan, 2013.

Fan

, Ye

and Chen

, Malicious sequential pattern mining for automatic malware detection, Expert Systems with Applications52 (2016), 16–25.

Nahar

, Imam

, Tickle

K.S.

and Chen

Y.P.

, Association rule mining to detect factors which contribute to heart disease in males and females, Expert Systems with Applications40 (2013), 1086–1093.

Wen

, Zhong

and Wang

, Activity recognition with weighted frequent patterns mining in smart environments, Expert Systems with Applications42 (2015), 6423–6432.

Kotsiantis

and Kanellopoulos

, Association rules mining: A recent overview, International Transactions on Computer Science and Engineering32 (2006), 71–82.

Hernández-León

, Carrasco-Ochoa

J.A.

, Martínez-Trinidad

J.F.

and Hernández-Palancar

, Classification based on specific rules and inexact coverage, Expert Systems with Applications39 (2012), 11203–11211.

Beil

, Ester

and Xu

, Frequent term-based text clustering, In Proceedings of the 2002 ACM SIGKDD International Conference on Knowledge Discovery in Databases (KDD02), Edmonton, Canada, 2002, pp. 436–442.

10.

Agrawal

, Imielinski

and Swami

, Mining associations between sets of items in massive databases, In Buneman

and Jajodia

(eds), Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington DC, 1993, pp. 207–216.

11.

Yates

R.B.

and Neto

B.R.

, Modern Information Retrieval, Addison-Wesley, New York, 1999.

12.

Zhang

, Wang

Y.J.

, Cui

and Cong

, Semantic similarity based on compact concept ontology, In Proceedings of the 17th International Conference on WorldWideWeb (WWW ’08), ACM, New York, NY, USA, 2008.

13.

Janowicz

and Wilkes

, SIM-DL_A: A Novel Semantic Similarity Measure for Description Logics Reducing Inter-Concept to Inter-Instance Similarity, In Proceedings of the 6th Annual European Semantic Web Conference (ESWC2009), LNCS 5554, Springer Verlag, Berlin, Germany, 2009, pp. 353–367.

14.

Ortiz-Posadas

M.R.

, The Logical Combinatorial Approach Applied to Pattern Recognition in Medicine, New Trends and Advanced Methods in Interdisciplinary Mathematical Sciences, Springer, Cham, 2017, pp. 169–188.

15.

Dánger

, Ruiz-Shulcloper

and Berlanga

, Objectminer: A New Approach for Mining Complex Objects, In Proceedings of the Sixth International Conference on Enterprise Information Systems, Oporto, Portugal, 2004, pp. 42–47.

16.

Rodríguez-González

A.Y.

, Martínez-Trinidad

J.F.

, Carrasco-Ochoa

J.A.

and Ruiz-Shulcloper

, Mining frequent patterns and association rules using similarities, Expert Systems with Applications40 (2013), 6823–6836.

17.

Rodríguez-González

A.Y.

, Martínez-Trinidad

J.F.

, Carrasco-Ochoa

J.A.

and Ruiz-Shulcloper

, RP-Miner: A relaxed prune algorithm for frequent similar pattern mining, Knowledge and Information System27 (2011), 451–471.

18.

Rodríguez-González

A.Y.

, Lezama

, Iglesias-Alvarez

C.A.

, Martínez-Trinidad

J.F.

, Carrasco-Ochoa

J.A.

and Muños de Cote

, Closed frequent similar pattern mining: Reducing the number of frequent similar patterns without information loss, Expert Systems with Applications96 (2018), 271–283.

19.

Rodríguez-González

A.Y.

, Martínez-Trinidad

J.F.

, Carrasco-Ochoa

J.A.

and Ruiz-Shulcloper

, Using Non Boolean similarity Functions for Frequent Similar Pattern Mining, In Proceedings of the 23th Canadian Conference on Artificial Intelligence 2010 (AI 2010), LNCS 6085, Springer Verlag, Berlin, Germany, 2010, pp. 374–378.

20.

Agrawal

and Srikant

, Fast Algorithms for Mining Association Rules in Large Databases, In Proceedings of 20th International Conference on Very Large Data Bases, Morgan Kaufmann, Santiago de Chile, Chile, 1994, pp. 487–499.

Frequent similar pattern mining using non Boolean similarity functions

Abstract

Keywords

1 Introduction

2 Basic concepts and results

5.1 STree*DC-Miner algorithm

5.2 STree*NDC-Miner algorithm

5.3 RP*-Miner algorithm

6 Experiment results

Table 1 Description of the datasets used in the experiments Ω Objects Features Clases Diabetes 178 8 2 Liver Disorders 345 6 2 Iris 150 4 3

Footnotes

References

5.1 STree^*DC-Miner algorithm

5.2 STree^*NDC-Miner algorithm

5.3 RP^*-Miner algorithm

Table 1
Description of the datasets used in the experiments

Ω Objects Features Clases

Diabetes 178 8 2

Liver Disorders 345 6 2

Iris 150 4 3