A novel approach for hiding sensitive utility and frequent itemsets

Abstract

Data is shared among different organizations for mutual benefit. Data mining techniques are utilized to discover valuable knowledge for decision-making. However, data mining poses a threat to disclose the sensitive information. Thus, the sensitive knowledge should be concealed before releasing data. The pervious works either address the association rule or utility itemsets hiding problem. This paper focuses on preserving the sensitive utility and frequent itemsets, and a sanitization approach named HUFI is presented. The sensitive itemsets are hidden by reducing their support or utility below the minimum thresholds. For a sensitive itemset, the concept of maximum boundary value is introduced to determine the hidden strategy. Then, a transaction supporting minimal number of non-sensitive itemsets is selected to be sanitized. In such a transaction, a weight is assigned to each item contained in the sensitive itemset, and an item with the highest weight is selected to be modified. We compared HUFI with the state of the art algorithms on various databases. The experiment results show that HUFI outperforms the other algorithms in minimizing the side effects on non-sensitive knowledge and maintaining the database quality after the sanitization process. In addition, the impact of database density on sanitization approaches is observed.

Keywords

Sensitive utility and frequent itemsets sanitization side effects maximum boundary value

1. Introduction

Data mining is the process of extracting the potentially useful information and knowledge from huge amount of data to aid the users for making effective decisions. Different kinds of techniques are used to find different kinds of knowledge. On the basis of the knowledge we are interested in, the tasks of data mining can be divided into summarization, classification, clustering, association and trend analysis [18]. Association rule mining [23] is the most commonly used technique, which is based on the support and confidence framework to detect the relationship among items in a transaction database. The support measure is used since it is assumed that highly frequent itemsets are likely to be of interest to users. However, the frequency of an itemset may not be a sufficient indicator of interestingness, because it only reflects the number of transactions that support the itemset. It does not reveal the utility of an itemset, which can be measured as profit. Thus, the utility mining is proposed [13]. Previous works focus on the association rule mining or utility mining. In order to find the itemsets with high utility and support, utility and frequent itemset mining is presented.

In the current competitive environment, data is shared among different organizations for mutual benefits during the business collaborations. However, the data sharing brings the risk of disclosing the sensitive knowledge. O’Leary [5] first presented that the knowledge discovery can be a threat to the database security. Clifton and Marks [4] then described a scenario to illustrate the treat of data mining, and the confidential data should be hidden before releasing data. To address this issue, sensitive knowledge can be hidden by transforming the original database to a sanitized one by some specific privacy strategies, and the hidden process is called data sanitization. In recent years, Privacy Preserving Data Mining (PPDM) has become an important research direction. In this paper, we focus on the Privacy Preserving Utility and Frequent itemset Mining (PPUFM). The contributions of our work are summarized as follows. Firstly, the concept of maximum boundary value is proposed to determine whether the support or utility hidden strategy is used to hide a sensitive itemset. Secondly, the transaction with the minimal number of non-sensitive itemsets is selected to be sanitized. Experimental results show that the proposed algorithm outperforms the other algorithms in terms of maintaining the database quality and minimizing the side effects on non-sensitive knowledge.

The remainder of this paper is organized as follows. Section 2 provides a review of related research. Some preliminary definitions are described and the sanitization problem is formulated in Section 3. Section 4 presents a novel hidden approach for concealing the sensitive utility and frequent itemsets. In Section 5, the proposed approach is compared with the state of the art algorithms. Finally, the conclusions are made in Section 6.

2. The related works

There has been a great number of works on sensitive knowledge hiding. Atallan et al. [16] firstly proved that the optimal sanitization is NP-hard and proposed a heuristic approach to solve the security problem. Following this work, a lot of approaches were presented. Dasseni and Verykios [6, 31]proposed three strategies and five sanitization approaches for hiding the sensitive association rules. Algorithm 1.a hides a sensitive rule by increasing the support of the rules’s antecedent. Algorithm 1.b conceals a sensitive rule by decreasing the frequency of the rule’s consequent. Algorithm 2.a reduces the support of a sensitive rule until its support or confidence is below the given thresholds. Algorithms 2.b and 2.c hide sensitive rules by reducing the support of their generating itemsets. All the five algorithms have the assumption that the generating itemsets of the sensitive rules are disjoint. Thus, they cannot deal with the overlapping rules. Oliveira and Zaïane [27] presented a one-scan sanitization approach named SWA for hiding the sensitive association rules. In addition, a disclosure threshold is assigned to each rule for improving the balance between knowledge discovery and privacy.

Amiri [1] presented three heuristic approaches for hiding the sensitive itemsets. The Aggregate approach directly removes the transaction from the database. The Disaggregate approach deletes some items from the transactions to hide the sensitive itemsets. The Hybrid approach is the combination of the previous two hidden approaches. Wu et al. [35] developed a template-based algorithm to protect sensitive rules. All the class of modifications are recorded in a template, and a hidden method that produces the lowest side effects is selected to hide the rules. Hong et al. [29] borrowed the concept of Term Frequency and Inverse Document Frequency (TF-IDF) in text mining for hiding sensitive itemsets. A greedy algorithm was presented. It calculates the SIF-IDF value of each transaction, and the transaction with the highest value is sanitized. However, the algorithm has poor performance in scalability.

Most of the existing methods address the sanitization problem based on the heuristic strategy. Sun and Yu [33, 34] proposed a border-based [10] approach for hiding sensitive itemsets. The algorithm focuses on preserving the border of non-sensitive itemsets instead of all the non-sensitive frequent itemsets. Moustakides and Verykios [8, 9] proposed two approaches, called MaxMin1 and MaxMin2, for concealing sensitive itemsets, which are based on the border theory and the maxmin criterion in decision theory. Both algorithms perform sanitization by minimizing the side effects on the positive border itemsets. Gkoulalas Divanis and Verykios [3] developed a novel border-based approach to preserve sensitive knowledge. The original database is minimally extended by inserting a set of synthetic transactions, which is based on the CSP. Le et al. [11, 12] proposed two approaches, named HCSRIL and AARHIL, on the basis of the lattice theory to conceal sensitive rules. AARHIL makes an improvement in the selection of victim transactions in comparison with HCSRIL algorithm. Thus, it outperforms HCSRIL in minimizing the side effects and execution time.

Shah and Asghar [24] adopted the GA algorithm for privacy preserving in association rule mining. The selection of the victim transactions depends on the fitness value. Cheng et al. [19, 20, 21, 22] proposed four algorithms for preserving sensitive knowledge. BRDA approach hides the sensitive rules by removing some items on the basis of the positive and negative border rules. EMO-based, EMO-AddItem and EMO algorithm are based on the evolutionary multi-objective optimization (EMO). EMO-based and EMO-AddItem focus on protecting sensitive association rules. EMO-based algorithm utilizes the NSGA-II to find appropriate transactions for sanitization. Some items are removed from the identified transactions to reduce the support and confidence of sensitive rules. EMO-AddItem adopts the HypE algorithm to insert some items into the database. The NSGA-II and HypE algorithms are used to drive the evolution forward in EMO algorithm, and the suitable items are found to remove from the database for concealing the sensitive itemsets. However, the hidden approaches based on the evolutionary algorithm usually generate the spurious rules.

All above algorithms focus on protecting association rules or frequent itemsets. Yeh and Hsu [15] proposed two approaches, named HHUIF and MCISF, for privacy preserving utility mining. HHUIF algorithm removes the items with the maximal utility. MCISF algorithm considers the conflict count during the sanitization process. Yun and Kim [30] presented an algorithm named FPUTT to improve the efficiency of the algorithm HHUIF by adopting a tree structure. However, the side effects of FPUTT are the same as those of HHUIF. Lin et al. [14] designed two algorithms called MSU-MAU and MSU-MIU to protect the high utility itemsets. For a sensitive itemset, both algorithms identify the victim transaction in which the utility of the itemset is maximal.

Rajalaxmi and Natarajan [25] was the first to present two approaches for preserving the sensitive utility and frequent itemsets. Both algorithms hide a sensitive itemset by decreasing its support below the minimum support threshold. Then, the utility of the itemset is checked. If it is still higher than the given utility threshold, the utility of the itemset is reduced until it gets below the threshold. Since both the support and the utility of an itemset should be decreased below the given minimum thresholds, the damage to the non-sensitive knowledge and database quality is large. To tackle this problem, a novel algorithm is proposed in this paper. The concept of maximum boundary value is presented, which is used to determine whether the support or the utility reduction strategy is adopted to hide a sensitive itemset. Therefore, the sanitization is performed with flexibility, and the side effects are minimized.

3. Preliminary

In this section, we firstly introduce some basic notions about frequent itemsets [23] and utility itemsets mining [17, 26]. Then, the sanitization problem is formulated.

3.1 Basic definitions

Let $I=\{i_{1},i_{2},\ldots,i_{m}\}$ be a set of distinct items and $D=\{T_{1},T_{2},\ldots,T_{n}\}$ be a transaction database, where $T_{i}$ represents a transaction containing a set of items in $I$ . Each transaction is assigned a unique identifier TID. A collection of one or more items is called an itemset. A transaction $T$ is said to support an itemset $X$ if $X\subseteq T$ . An itemset containing $k$ items is called a $k$ -itemset. Support count is an important property of an itemset, which refers to the number of transactions that contain a specified itemset. For an itemset $X$ , the support count of $X$ is defined as $\textit{supc}=|T_{i}|X\subseteq T_{i},T_{i}\in D|$ . The support of $X$ is a ratio of the transactions supporting X to all the transactions in database $D$ . It is computed as follows:

$\displaystyle\textit{sup}(X)=\textit{supc}(X)/|D|$ (1)

where $|D|$ is the cardinality of database $D$ . An itemset is frequent if its support is greater than or equal to a predetermined minimum support threshold, which is denoted by minsup.

Table 1

A sample transaction database

(a) Database table
TID	Transaction (item,iu)	TID	Transaction (item,iu)
$T_{1}$	(3,18) (5,1)	$T_{6}$	(1,1) (2,1)
$T_{2}$	(2,6) (4,1) (5,1)	$T_{7}$	(2,10) (4,1) (5,1)
$T_{3}$	(1,2) (3,1) (5,1)	$T_{8}$	(1,3) (3,25) (4,3) (5,1)
$T_{4}$	(1,1) (4,1) (5,1)	$T_{9}$	(1,1) (2,1)
$T_{5}$	(3,4) (5,2)	$T_{10}$	(2,6) (3,2) (5,2)

(b) External utility table
Item	1	2	3	4	5
eu	3	10	1	6	5

Definition 1. $iu(i_{p},T_{q})$ is the internal utility of item $i_{p}$ in transaction $T_{q}$ . It reflects the quantity value of an item in a transaction. For example, in Table 1a, $iu(5,T_{1})=1$ and $iu(3,T_{8})=25$ .

Definition 2. $eu(i_{p})$ is the external utility of item $i_{p}$ . The value reflects the importance of item $i_{p}$ . For example, in Table 1b, $eu(4)=6$ .

Definition 3. $u(i_{p},T_{q})$ denotes the utility of item $i_{p}$ in transaction $T_{q}$ . It is computed as $iu(i_{p},T_{q})*eu(i_{p})$ . For example, $u(3,T_{5})=1*4=4$ , in Table 1.

Definition 4. $u(X,T_{q})$ is the utility of itemset $X$ in transaction $T_{q}$ . It is calculated as follows:

$\displaystyle u(X,T_{q})=\sum_{i_{p}\in X}u(i_{p},T_{q})$ (2)

For example, $u(\{1,3\},T_{8})=u(1,T_{8})+u(3,T_{8})=9+25=34$ , in Table 1.

Definition 5. $u(X)$ is the utility of itemset $X$ . It is defined as:

$\displaystyle u\left(X\right)=\sum\limits_{X\subseteq T_{q}\wedge T_{q}\in D}{% u\left({X,T_{q}}\right)}$ (3)

For example, $u(\{1,3\})=u(\{1,3\},T_{3})+u(\{1,3\},T_{8})=7+34=41$ .

Definition 6. $tu(T_{q})$ is the utility of transaction $T_{q}$ . It is defined by:

$\displaystyle tu(T_{q})=\sum\limits_{i_{p}\in T_{q}}u(i_{p},T_{q})$ (4)

For example, $tu(T_{8})=u(1,T_{8})+u(3,T_{8})+u(4,T_{8})+u(5,T_{8})=9+25+18+5=57$ .

An itemset is high utility if its utility is greater than or equal to the user-specified minimum utility threshold, which is denoted by minutil. Otherwise, the itemset is low utility. Utility and frequent itemset mining is to find all the itemsets whose utility and support are beyond the given minimum utility and support thresholds, respectively.

3.2 The formal description of the sanitization problem

The sanitization problem can be formulated as follows:

Let $D$ be a transaction database and $P$ be the set of utility and frequent itemsets that can be discovered from $D$ with the given minsup and minutil thresholds. Let $S P$ be a set of sensitive itemsets that should be concealed, and $SP\subset P$ . $\sim SP$ denotes the set of non-sensitive itemsets, and $\sim SP\cup SP=P$ . Let $S T$ be a set of sensitive transactions that contain at least one sensitive itemset. The sanitization problem is to transform the original database $D$ into sanitized database $D^{\prime}$ such that only the non-sensitive itemsets can be mined from $D^{\prime}$ . The sanitization process often causes the following three side effects [2, 7, 32].

1)
Hiding failure (HF): the portion of the sensitive itemsets that are not be hidden after the database sanitization. The hiding failure is computed as follows:

$\displaystyle HF=\frac{\left|{SP(D)}\right|}{\left|{SP(D^{\prime})}\right|}$ (5)

where $\left|{SP(D)}\right|$ and $\left|{SP(D^{\prime})}\right|$ are the number of sensitive itemsets mined from the original database $D$ and sanitized one $D^{\prime}$ respectively.
2)
Missing cost (MC): the portion of the non-sensitive itemsets that are falsely hidden after the hiding process. The missing cost is measured as follows:

$\displaystyle MC=\frac{\left|{\sim SP(D)-\sim SP(D^{\prime})}\right|}{\left|{% \sim SP(D)}\right|}$ (6)

where $\left|{\sim SP(D)}\right|$ and $\left|{\sim SP(D^{\prime})}\right|$ denote the number of non-sensitive itemsets discovered from the original database $D$ and result database $D^{\prime}$ respectively.
3)
Database utility difference (Diff): the dissimilarity between the utility of original database $D$ and the result database $D^{\prime}$ . The utility difference is calculated as follows:

$\displaystyle\textit{Diff}=\frac{\sum\nolimits_{T_{i}\in D}tu(T_{i})-\sum% \nolimits_{T_{i}\in D^{\prime}}tu(T_{i})}{\sum\nolimits_{T_{i}\in D}tu(T_{i})}$ (7)

Where $tu(T_{i})$ denotes the utility of transaction $T_{i}$ . $\sum\nolimits_{T_{i}\in D^{\prime}}tu(T_{i})$ and $\sum\nolimits_{T_{i}\in D}tu(T_{i})$ are the utility of database $D^{\prime}$ and $D$ respectively. This measure depicts the database quality.

The goal of privacy preserving utility and frequent itemset mining is not only to hide all the sensitive itemsets but also to minimize the side effects on non-sensitive knowledge and the integrity of the original database. A sensitive itemset can be hidden by modifying some items until its utility or the support is below the given thresholds. The modified item for hiding a sensitive itemset is called a victim item, which is denoted by $I_{\textit{vic}}$ . Correspondingly, the transaction containing a victim item is a victim transaction, which is denoted by $T_{\textit{vic}}$ .
4. A novel approach for hiding utility and frequent itemsets

In this section, we propose a sanitization approach named HUFI for hiding sensitive utility and frequent itemsets. To do that, we first present two hiding strategies based on reducing the support or utility of the sensitive itemsets. Then the concept of maximum boundary value is introduced to determine the hiding strategy. After that we describe the proposed algorithm in detail. Finally, a sample is given to illustrate the hiding process.

4.1 The hiding strategies

Definition 7. Given a transaction database $D$ and a sensitive itemset $X$ , in order to decrease the utility of $X$ below the utility threshold minutil, the minimal utility to be reduced is determined by:

$\displaystyle\textit{diffu}=u(X)-\textit{minutil}+1$ (8)

Definition 8. Given a transaction database $D$ and a sensitive itemset $X$ , in order to decrease the support of $X$ below the support threshold minsup, the minimal support count to be reduced is defined by:

$\displaystyle\textit{diffs}=\textit{supc}(X)-\left\lceil{\textit{minsup}\ast% \left|D\right|}\right\rceil+1$ (9)

where $\left|D\right|$ is the cardinality of the database, and $\textit{supc}(X)$ refers to the support count of $X$ .

Starting from the relationship between the support and the utility of an itemset, we develop two strategies to hide a sensitive itemset $X$ .

Support reduction strategy: remove a victim item $I_{\textit{vic}}$ from a victim transaction $T_{\textit{vic}}$ that supports $X$ , and $I_{\textit{vic}}\in X$ .

Utility reduction strategy: modify the internal utility of a victim item $I_{\textit{vic}}$ according to the following formula.

$\displaystyle iu(I_{\textit{vic}},T_{\textit{vic}})=\left\{{\begin{array}[]{ll% }iu(I_{\textit{vic}},T_{\textit{vic}})-\textit{diu}&\textit{diu}<iu(I_{\textit% {vic}},T_{\textit{vic}})\\ 0&\textit{diu}\geqslant iu(I_{\textit{vic}},T_{\textit{vic}})\\ \end{array}}\right.$ (10)

where diu refers to the internal utility that should be decreased. It is calculated as $\textit{diu}=\left\lceil{\textit{diffu}/eu(I_{\textit{vic}})}\right\rceil$ . $eu(I_{\textit{vic}})$ is the external utility of $I_{\textit{vic}}$ , and $iu(I_{\textit{vic}},T_{\textit{vic}})$ is the internal utility of item $I_{\textit{vic}}$ in transaction $T_{\textit{vic}}$ .

Property 1. Let $X$ be a sensitive utility and frequent itemset, the set $S T$ be the sensitive transactions of $X$ . The victim item used for hiding $X$ is $I_{\textit{vic}}$ , and the transaction containing $I_{\textit{vic}}$ is a victim transaction $T_{\textit{vic}}$ . If $I_{\textit{vic}}$ is removed from $T_{\textit{vic}}$ , the utility of $X$ is decreased by $u(X,T_{\textit{vic}})$ .

Proof. In order to hide the itemset $X$ , the victim item contained in $X$ is removed from $T_{\textit{vic}}$ . Since the victim transaction supporting $X$ is contained in set $S T$ and $I_{\textit{vic}}\in X$ , the itemset $X$ is not supported by the transaction $T_{\textit{vic}}$ after the deletion operation. Thus, the utility of $X$ is decreased to $u(X)-u(X,T_{\textit{vic}})$ , and the sensitive transactions of $X$ are updated to $ST-T_{\textit{vic}}$ .

Based on the two hiding strategies, the underlying problems should be addressed.

•

Which transactions are selected for sanitization?

•

Which item is selected to be modified in an identified transaction?

For the first question, we only consider the sensitive transactions for sanitization, because the non-sensitive transactions have no effect on hiding the sensitive itemsets. Moreover, the transaction that affects the minimum number of utility and frequent itemsets is selected to be a victim transaction $T_{\textit{vic}}$ , which effectively minimizes the impact on non-sensitive itemsets. For the second question, we assign a weight to each sensitive item $S I$ in a victim transaction $T_{\textit{vic}}$ according to Eq. (11).

$\displaystyle w(SI)=\textit{SPC}(SI,T_{\textit{vic}})+\frac{1}{\textit{NSPC}(% SI,T_{\textit{vic}})}$ (11)

where $\textit{SPC}(SI,T_{\textit{vic}})$ and $\textit{NSPC}(SI,T_{\textit{vic}})$ denote the number of sensitive itemsets and non-sensitive itemsets affected by modifying the item $S I$ in transaction $T_{\textit{vic}}$ respectively. The item with a higher weight is modified with higher possibility, because the modification of the item affects more sensitive itemsets and less non-sensitive itemsets. In order to determine which strategy is adopted to hide a sensitive itemset, we propose the concept of maximum boundary value and minimum boundary value.

Definition 9. Given a sensitive itemset $X$ , the sensitive transactions of $X$ are $ST=\{ST_{1},ST_{2},...,ST_{n}\}$ . The sensitive transactions are sorted in descending order of $u(X,ST_{i})$ . The maximum boundary value $Bd_{\max}$ and minimum boundary value $Bd_{\min}$ are defined by:

$\displaystyle\left\{{\begin{array}[]{l}Bd_{\max}=\sum\nolimits_{k=1}^{\textit{% diffs}}u(X,ST_{k})\\ Bd_{\min}=\sum\nolimits_{k=n-\textit{diffs}+1}^{n}u(X,ST_{k})\\ \end{array}}\right.$ (12)

where $u(X,ST_{k})$ refers to the utility of $X$ in the transaction $ST_{k}$ . diffs is the minimum support count of $X$ that needs to be reduced. From Eq. (12), we have that $Bd_{\max}$ is the utility of $X$ in the top diffs transactions, and $Bd_{\min}$ is the utility of $X$ in the bottom diffs transactions.

For a sensitive itemset $X$ , using the support reduction strategy to hide $X$ is denoted as A, and using the utility reduction strategy to hide $X$ is denoted as B.

Property 2. A is more efficient than B, and the side effects caused by A and B are same if $Bd_{\max}\leqslant\textit{diffu}$ .

Proof. In order to hide an itemset $X$ , the utility or support of $X$ should be decreased until it is bellow minutil or minsup. Since $Bd_{\max}\leqslant\textit{diffu}$ and $Bd_{\max}=\sum\nolimits_{k=1}^{\textit{diffs}}u(X,ST_{k})$ , diffu is larger than or equal to zero when diffs is down to zero. diffu and diffs denote the minimum utility and support count that should be reduced respectively. Therefore, hiding the itemset $X$ only needs to reduce its support below minsup. Besides, the victim item is directly removed from the victim transaction by utilizing the support reduction strategy, and the victim item should be checked whether the deletion or decreasing operation is performed by adopting the utility reduction strategy. Thus, the utility reduction strategy is more complex than the support reduction strategy. In addition, both hiding strategies select the victim item and transaction in the same way. Therefore, the support reduction strategy is more efficient than the utility reduction strategy, and the two strategies result in the same side effects.

Property 3. A causes more side effects than B if $Bd_{\min}>\textit{diffu}$ .

Proof. Since $Bd_{\min}>\textit{diffu}$ and $Bd_{\min}=\sum\nolimits_{k=n-\textit{diffs}+1}^{n}{u(X,ST_{k})}>\textit{diffu}$ , when $u(X)$ is decreased to minutil, $\textit{sup}(X)$ is still greater than minsup. Thus, hiding the itemset $X$ only needs to reduce its utility below minutil. Assuming $Y$ is the non-sensitive itemset that supports the victim item $I_{\textit{vic}}$ in an identified transaction $T_{\textit{vic}}$ . If we adopt the support reduction strategy to conceal the itemset $X$ , the support count of $Y$ is decreased by one, and the utility of $Y$ is updated to $u(Y)-u(Y,T_{\textit{vic}})$ . However, while using the utility reduction strategy to hide $X$ , the support of $Y$ is unchanged and $u(Y)=u(Y)-\left\lceil{\textit{diffu}/eu(I_{\textit{vic}})}\right\rceil\ast eu(% I_{\textit{vic}})$ when $\textit{diffu}\leqslant u(I_{\textit{vic}},T_{\textit{vic}})-eu(I_{\textit{vic% }})$ . Due to $u(Y,T_{\textit{vic}})>\left\lceil{\textit{diffu}/eu(I_{\textit{vic}})}\right% \rceil\ast eu(I_{\textit{vic}})$ , the support reduction strategy causes more side effects on the non-sensitive knowledge than the utility reduction strategy.

From the Propertys 2 and 3, we calculate the maximum boundary value and minimum boundary value in advance to select a better hiding strategy for concealing a given sensitive itemset $X$ , and the selection operations are as follows:

$\displaystyle\left\{{\begin{array}[]{ll}\textit{support reduction strategy}&Bd% _{\max}\leqslant\textit{diffu}\\ \textit{utility reduction strategy}&Bd_{\min}>\textit{diffu}\\ \textit{utility reduction strategy}&Bd_{\max}>\textit{diffu}\wedge Bd_{\min}% \leqslant\textit{diffu}\\ \end{array}}\right.$ (13)

If $Bd_{\max}>\textit{diffu}\wedge Bd_{\min}\leqslant\textit{diffu}$ , we adopt the utility reduction strategy to hide $X$ , because the side effects on non-sensitive knowledge could be minimized by decreasing the utility of $X$ . Furthermore, since $Bd_{\max}=\sum\nolimits_{k=1}^{\textit{diffs}}u(X,ST_{k})$ and $Bd_{\min}=\sum\nolimits_{k=n-\textit{diffs}+1}^{n}u(X,ST_{k})$ , $Bd_{\max}$ is greater than $Bd_{\min}$ . We get that $Bd_{\max}>\textit{diffu}$ when $Bd_{\min}>\textit{diffu}$ . Therefore, the selection of hiding strategy can be determined by:

$\displaystyle\left\{{{\begin{array}[]{ll}\textit{support reduction strategy}&% Bd_{\max}\leqslant\textit{diffu}\\ \textit{utility reduction strategy}&Bd_{\max}>\textit{diffu}\\ \end{array}}}\right.$ (14)

For each sensitive itemset, the hiding strategy is determined beforehand. Given the sensitive itemsets $SP=\{SP_{1},SP_{2},...,SP_{m}\}$ , the count of sensitive itemsets hidden by adopting the utility reduction strategy is $c$ , the ratio of the utility reduction strategy, denoted as $R_{u}$ , is determined by:

$\displaystyle R_{u}=\frac{c}{m}$ (15)

where $m$ is the number of sensitive itemsets. Correspondingly, the ratio of the support reduction strategy is denoted as $R_{s}$ , and $R_{u}+R_{s}=1$ .

When hiding a sensitive itemset by reducing its support below the threshold minsup, there is no need to future decrease its utility. Also after reducing the utility of an itemset, there is no need to decrease its support since it is no longer significant. Thus, we can hide a sensitive itemset by decreasing its support or utility. Note that when reducing the support of an itemset, its utility will also decrease. Also utilizing the utility reduction strategy to decrease the utility of an itemset, its support will also decrease. But this does not guarantee that both the utility and support will fall below the given thresholds while decreasing the support or utility.

4.2 The data structures

The previous approaches have to perform frequent database scans. Thus, we define the following three structures to speed up the sanitization process.

Definition 10. Given a transaction $t$ , Item list ( $I$ -list) stores information of items in $t$ . Each item $i$ in $I$ -list is a triple:

$\displaystyle i=\left\langle{\textit{Item},\textit{InUtility},\textit{Utility}% }\right\rangle$

where Item is $i$ , IuUtility is the internal utility of $i$ in $t$ , Utility is the utility of $i$ in $t$ . For example in Table 1, $I$ -list of $T_{1}$ is $\{<3,18,18>,<5,1,5>\}$ .

Definition 11. Given a database $D$ , a set of sensitive itemsets $SP=\{SP_{1},SP_{2},...,SP_{m}\}$ , Transaction table ( $T$ -table) has information of sensitive transactions in $D$ . Each transaction $t$ in $T$ -table has four elements:

$\displaystyle t=\left\langle{\textit{TID},\textit{SID},\textit{NSID},\textit{I% -list}}\right\rangle$

where TID is the unique identifier of $t$ , SID and NSID are the sensitive itemsets and non-sensitive itemsets supported by $t$ , respectively. $I$ -list is the item list of $t$ .

Definition 12. Given a database $D$ , a set of itemsets $P=\{P_{1},P_{2},...,P_{n}\}$ , High Utility and Frequent Itemset-table (HUFI-table) has the information of utility and frequent itemsets discovered from $D$ . Each itemset $P_{i}$ in HUFI-table has five elements:

$\displaystyle P_{i}=\left\langle{\textit{IID},\textit{Items},\textit{HUFI-% utility},\textit{Supc},\textit{TIDs}}\right\rangle$

where IID is the unique identifier of $P_{i}$ , Items is a list of items contained in $P_{i}$ , HUFI-utility is the utility of $P_{i}$ , Supc refers to the support count of $P_{i}$ , TIDs indicates the transactions supporting $P_{i}$ in $D$ .

The proposed algorithm uses these structures to conceal the sensitive itemsets. The construction of the T-table and HUFI-table need to scan database twice. The sanitization process is performed on the two table structures. Once a victim item is modified, T-table and HUFI-table are updated respectively. After all the sensitive itemsets are hidden, the result database is obtained from T-table.

4.3 Algorithm description

Algorithm: HUFI

Input: The transaction database

D

, the minimum utility threshold minutil, the minimum support threshold minsup, the sensitive utility and frequent itemsets

SP=\{SP_{1},SP_{2},...,SP_{n}\}

. Output: The sanitized database

D^{\prime}

1. for each

SP_{i}\in SP

\textit{diffs}=\textit{supc}(SP_{i})-\left\lceil{\textit{minsup}\ast\left|D% \right|}\right\rceil+1

\textit{diffu}=u(SP_{i})-\textit{minutil}+1

4. Find the sensitive transactions

S T

SP_{i}

5. Calculate the maximum boundary value

Bd_{\max}

6. if

Bd_{\max}\leqslant\textit{diffu}

7. Support reduction strategy 8. else 9. Utility reduction strategy 10. end if 11. while

\textit{diffs}>0

and

\textit{diffu}>0

12.

T_{\textit{vic}}=\textit{argmin}_{T\in ST}\textit{NSPC}(T)

13. for each

SI\in T_{\textit{vic}}\wedge SI\in SP_{i}

14.

w(SI)=\textit{SPC}(SI,T_{\textit{vic}})+1/\textit{NSPC}(SI,T_{\textit{vic}})

15. end for 16.

I_{\textit{vic}}=\textit{argmax}_{SI\in SP_{i}}w(SI)

17. if using the support reduction strategy 18.

\textit{supc}(SP_{i})=\textit{supc}(SP_{i})-1

19.

u(SP_{i})=u(SP_{i})-u(SP_{i},T_{\textit{vic}})

20. else 21.

\textit{diu}=\left\lceil{\textit{diffu}/eu(I_{\textit{vic}})}\right\rceil

22. if

\textit{diu}\geqslant iu(I_{\textit{vic}},T_{\textit{vic}})

23.

\textit{supc}(SP_{i})=\textit{supc}(SP_{i})-1

24.

u(SP_{i})=u(SP_{i})-u(SP_{i},T_{\textit{vic}})

25. else 26.

u(SP_{i})=u(SP_{i})-\textit{diu}\ast eu(I_{\textit{vic}})

27. end if 28. end if 29. Update the itemsets containing the victim item 30. end while 31. end for

In this subsection, the pseudo-code of the proposed approach is presented in Algorithm HUFI. The designed algorithm hides all the sensitive itemsets in a one by one fashion. Firstly, minimum support count and utility that should be reduced, diffs and diffu, are calculated separately. Then the sensitive transactions $S T$ that support the sensitive itemset $SP_{i}$ are generated. In addition, the sensitive transactions are sorted in descending order of $u(SP_{i},ST_{i})$ to compute the maximum boundary value $Bd_{\max}$ according to Eq. (12). Next, the hiding strategy is selected on the basis of $Bd_{\max}$ . If $Bd_{\max}$ is higher than diffu, the utility reduction strategy is adopted. Otherwise, the support reduction strategy is utilized to hide $SP_{i}$ . After the hiding strategy is determined, the transaction supporting the lowest number of non-sensitive itemsets is selected to be sanitized. Besides, the number of non-sensitive itemsets supported by a transaction $T$ is denoted as $\textit{NSPC}(T)$ . Then, each sensitive item in the identified transaction is assigned a weight according to Eq. (11). The item with the highest weight is chosen to be a victim item. If the support reduction strategy is adopted to hide $SP_{i}$ , the victim item is directly removed from the victim transaction. Then, the support and utility of $SP_{i}$ are updated respectively. If the utility reduction strategy is used, the victim item needs to be checked whether the deletion or the decreasing operation is performed. We calculate the internal value required to reduce, which is denoted as diu. If diu is greater than or equal to $iu(I_{\textit{vic}},T_{\textit{vic}})$ , the victim item is deleted. Otherwise, the decreasing operation is performed. Correspondingly, the itemsets containing the victim item are updated. The sanitization process goes on until either the utility or the support of the sensitive itemset $SP_{i}$ is below the given thresholds.

The complexity of the proposed algorithm is $O(\left|{SP}\right|(t\ast p+K\ast(t+m\ast q+q)))$ , where $\left|{SP}\right|$ is the number of sensitive itemsets $S P$ , $t$ is the number of sensitive transactions of the currently hidden itemset $SP_{i}$ , $p$ is the maximal number of items contained in a sensitive transaction, $K$ is the number of iterations required to hide $SP_{i}$ , $m$ is the number of items in $SP_{i}$ , $q$ is the maximal number of utility and frequent itemsets supported by a victim transaction. The running time to determine the hiding strategy of $SP_{i}$ takes $O(t\ast p)$ , and the running time for hiding $SP_{i}$ takes $O(K\ast(t+m\ast q+q))$ .

4.4 Example

An example is given to illustrate the proposed hiding algorithm. Consider a transaction database displayed in Table 1. The minimum support threshold is set at 0.2, and the minimum utility threshold is set at 50. The utility and frequent itemsets are mined, and the results are listed in Table 2. The sensitive itemsets, {2, 4} and {1, 3, 5}, are identified in boldface in Table 2. The proposed hiding algorithm operates as follows.

Table 2
Derived utility and support itemsets

HID	Itemset	Utility	Support count	HID	Itemset	Utility	Support count
1	2	240	5	6	3 5	85	5
2	3	50	5	7	4 5	56	4
3	5	50	8	8	1 3 5	51	2
4	2 4	172	2	9	2 4 5	182	2
5	2 5	240	3

For hiding itemset {2, 4}, we first calculate the minimum support and utility that should be reduced, respectively. Then $\textit{diffs}=\textit{supc}\{2,4\}-\left\lceil{0.2\ast 10}\right\rceil+1=1$ and $\textit{diffu}=u\{2,4\}-50+1=123$ . Next, the sensitive transactions $S T$ that support {2, 4} are generated, and $ST=\{T_{2},T_{7}\}$ . We then sort $S T$ in descending order of $u(\{2,4\},ST_{i})$ and the result is $ST=\{T_{7},T_{2}\}$ . Since $u(\{2,4\},T_{2})=66$ and $u(\{2,4\},T_{7})=106$ , we can compute the maximum boundary value based on the sorted transactions, and $Bd_{\max}=\sum\nolimits_{j=1}^{\textit{diffs}}u(\{2,4\},ST_{j})=106$ . Due to $Bd_{\max}<\textit{diffu}$ , the support reduction strategy is selected to hide {2, 4}. After the hiding strategy is determined, we select the victim transaction $T_{\textit{vic}}$ supporting the minimum number of non-sensitive itemsets to be modified. Since $\textit{NSPC}(T_{2})=5$ and $\textit{NSPC}(T_{7})=5$ , the victim transaction is selected randomly. Let us assume that the victim transaction is $T_{2}$ . Then the sensitive items in $T_{2}$ are assigned weights and the item with the maximal weight is a victim item $I_{\textit{vic}}$ . According to Eq. (11), we have $w(2)=1+1/3=4/3$ and $w(4)=1+1/2=3/2$ . In this case, we select the item 4 as a victim item and remove it from $T_{2}$ . Next, the update operation is performed. As a result, $\textit{supc}(\{2,4\})=1<\textit{minsup}$ , and the itemset {2, 4} is hidden successfully. However, non-sensitive itemsets {4, 5} and {2, 4, 5} are falsely concealed.

We continue to hide the next sensitive itemset that is {1, 3, 5}, the diffs and diffu are 1 and 2 respectively. The sensitive transactions $S T$ of {1, 3, 5} are identified, then $ST=\{T_{3},T_{8}\}$ . We then sort $S T$ to compute the maximal boundary value $Bd_{\max}$ . Since $Bd_{\max}=\sum\nolimits_{j=1}^{\textit{diffs}}u(\{1,3,5\},ST_{j})>\textit{diffu}$ , the utility reduction strategy is adopted to hide {1, 3, 5}. After the hiding strategy is determined, the victim transaction is identified as follows. The non-sensitive itemsets supported by transaction $T_{8}$ are {3}, {5} and {3, 5}. Thus, $\textit{NSPC}(T_{8})=3$ . Since {4, 5} has been falsely concealed, we do not take it into account. Due to $\textit{NSPC}(T_{3})=3$ , we randomly select $T_{3}$ as a victim item. Next, the weight of each sensitive item in $T_{3}$ is assigned. Since $w(1)=1+1/0=\infty$ , $w(3)=1+1/2=3/2$ and $w(5)=1+1/2=3/2$ , the item 1 is the victim item $I_{\textit{vic}}$ . Then, we check whether to delete item 1 or decrease its internal utility. Because $\textit{diu}=\left\lceil{2/3}\right\rceil=1<iu(1,T_{3})$ , the decreased operation is performed. $iu(1,T_{3})$ is decreased to 1 and $u(\{1,3,5\})$ is updated to 48, which is less than minutil. Therefore, {1, 3, 5} is concealed and the itemsets {4, 5} and {2, 4, 5} are hidden by mistake.

5. Experiments

In this section, we compared the proposed algorithm HUFI with the MCRSU and MSMU algorithms [25] in terms of execution time, hiding failure, missing cost and database utility difference. The experiments were divided into two phases. In the first phase, the EFIM algorithm [28] was used to mine the high utility itemsets. Then, the itemsets that appear frequently were generated from these high utility itemsets. In the second phase, sensitive itemsets were generated randomly, and the sanitization approaches were adopted to hide the sensitive itemsets. All algorithms were implemented in Java and ran on Intel Xeon E5-2360 2.8 GHz CPU and 8 GB RAM. We carried out the experiments on the four real datasets [36], which are listed in Table 3. Density is measured as the average transaction length divided by the number of items. For all datasets, the internal utility of items in transactions were generated using a uniform distribution in the [1, 10] interval, and the external utility of items were generated using a Gaussian normal distribution.

Table 3
The characteristics of the four datasets

Dataset	No. of transactions	No. of items	Avg. trans. length	Density (%)
Chess	3196	75	40.2	53.6
Bms_2	77512	3340	4.6	0.13
Retail	88162	16470	10.3	0.062
Chainstore	1112949	46086	7.2	0.016

Table 4

The parameter settings of the four datasets

Chess			Bms_2
Sensitive itemsets size	$\varepsilon$ (%)	minsup (%)	Sensitive itemsets size	$\varepsilon$ (%)	minsup (%)
varied	20.8	37.5	varied	0.138	0.129
10	varied	40.6	20	varied	0.129
10	20.8	varied	20	0.138	varied
Retail			Chainstore
Sensitive itemsets size	$\varepsilon$ (%)	minsup (%)	Sensitive itemsets size	$\varepsilon$ (%)	minsup (%)
varied	0.018	0.113	varied	0.005	0.089
20	varied	0.113	20	varied	0.089
20	0.018	varied	20	0.005	varied

In order to better demonstrate the performance of the proposed algorithm HUFI, we conducted a series of experiments using various parameter settings, which are displayed in Table 4. Sensitive itemsets size denotes the number of sensitive itemsets, $\varepsilon$ refers to the relative minimum utility threshold, and it is calculated as follows:

$\displaystyle\varepsilon=\frac{\textit{minutil}}{\sum\nolimits_{T_{q}\in D}tu(% T_{q})}$ (16)

where minutil is the minimum utility threshold, and $\sum\nolimits_{T_{q}\in D}tu(T_{q})$ refers to the utility of database $D$ . minsup listed in Table 4 represents the minimum support threshold. Besides, since all the sanitization algorithms totally hide the sensitive itemsets, the results of hiding failure are not shown in this section.

Figure 1.

Execution time at various sensitive itemsets sizes.

5.1 Execution time

Figure 1 shows the runtime of each algorithm in various datasets as the sensitive itemsets size is increased. It can be observed that the runtime of algorithms are increased with the sensitive itemsets size increases. The reason is that when the number of sensitive itemsets raises, the utilities of sensitive itemsets also increase. Thus, more computations are required to identify the victim items to be modified. Results of the Fig. 1 indicate that MSMU algorithm outperforms the HUFI and MCRSU algorithms. This is because the MSMU and MCRSU algorithms firstly sort the sensitive transactions in ascending order of the number of sensitive itemsets supported by each transaction. Then the victim transaction is selected on the basis of the sorted sensitive transactions. For the MSMU algorithm, when a victim transaction is determined, the item with the minimum support and maximum utility is selected as a victim item. The MCRSU algorithm uses conflict ratio to identify the victim item, and the HUFI algorithm selects the item that affects the minimal number of non-sensitive itemsets and the maximal number of sensitive itemsets to be sanitized. Thus, MSMU algorithm takes less time than other algorithms. In addition, for a sensitive itemset $X$ , both MSMU and MCRSU algorithms consider that the itemset $X$ is hidden if its utility and support are below the minutil and minsup respectively. Moreover, the strategy of hiding $X$ is first to reduce its support below the minsup. Then the utility of $X$ , denoted as $u(X)$ , is checked whether it is below the minutil. If $u(X)$ is higher than the minutil, the utility of $X$ is reduced until it falls below the minutil. Otherwise, the itemset $X$ is successfully hidden. However, the HUFI algorithm hides the itemset $X$ until its support or utility is below the given thresholds. Thus, the MCRSU algorithm takes more time than the HUFI algorithm. Note that in Fig. 1c the HUFI algorithm has the lowest efficiency since the Retail dataset is very sparse. Thus the HUFI algorithm spends a lot of time in selecting the victim transaction compared to the other sanitization algorithms.

Figure 2.

Execution time at various utility thresholds.

The execution time of the three algorithms under various utility thresholds $\varepsilon$ are plotted in Fig. 2. We find that the MSMU algorithm has the best performance due to its hidden method, and the MCRSU performs better than the HUFI algorithm in Retail and Chainstore datasets. The reason is that the two datasets are much sparser than the other datasets. In addition, we note that HUFI hides a sensitive itemset $X$ on the basis of the maximum boundary value $Bd_{\max}$ , namely $Bd_{\max}$ is used to determine whether the support reduction strategy or utility reduction strategy is adopted to hide $X$ . Thus, as the utility threshold $\varepsilon$ increases, the ratio of utility reduction strategy $R_{u}$ is also raised. For example, in Fig. 2b, when relative minimum utility threshold is 0.104%, the ratio of utility reduction strategy $R_{u}$ is 28%. While the utility threshold is 0.138%, the ratio $R_{u}$ is increased to 64%. The execution time of the three algorithms under varied minimum support thresholds are also been compared. As shown in Fig. 3, the MSMU algorithm has the best results in all datasets. HUFI algorithms outperforms the MCRSU for all the datasets except the Retail and Chainstore. The reason is that HUFI algorithms takes a lot of time in selecting the victim transactions. Besides, it also can be observed that the ratio of support reduction strategy $R_{s}$ for HUFI algorithm is increased with the minsup increases. It is caused by the hidden method based on the maximum boundary value. For example, in Fig. 3a, when minsup is 40.6%, there are no sensitive itemsets hidden by using the support reduction strategy. While minsup increases to 53.1%, the ratio of support reduction strategy $R_{u}$ is increased to 19%. In addition, it is very interesting to see that the runtime is suddenly increased when minsup is 0.147% in Fig. 3c, because the sensitive itemsets are selected randomly under various minsup. The utility and support of the sensitive itemsets which are selected with $\textit{minsup}=$ 0.147% are higher than those of the other sensitive itemsets.

Figure 3.

Execution time at various minimum support thresholds.

Figure 4.

MC at various sensitive itemsets sizes.

Figure 5.

MC at various utility thresholds.

5.2 Missing cost

The missing cost (MC) of the three algorithms under various sensitive itemsets sizes for the four datasets are displayed in Fig. 4. As shown in Fig. 4, the missing cost generally increases as the sensitive itemsets size increases. This is reasonable since when sensitive itemsets size is increased, the utilities that need to be reduced of the sensitive itemsets are also increased, and correspondingly the side effects on the non-sensitive itemsets are more serious. From Fig. 4, it also can be seen that the HUFI algorithm produces the lowest missing itemsets in each dataset. By applying the maximum boundary value to select the hidden strategy, HUFI algorithm effectively reduces the side effects on non-sensitive itemsets. However, the MCRSU and MSMU algorithms directly hide the sensitive itemsets by using the support reduction strategy without considering the impact on the non-sensitive knowledge. Besides, a sensitive itemset is hidden until the utility and support are below the thresholds respectively. Thus, the proposed HUFI algorithm has the best performance in minimizing the missing cost.

The missing cost of the three algorithms under various utility thresholds and support thresholds are respectively displayed in Figs 5 and 6. In Fig. 5, it can be seen that the HUFI algorithm achieves the best performance in minimizing the missing cost compared with the MCRSU and MSMU algorithms for all datasets. Especially in Fig. 5a, when utility threshold is 21.1%, the MC of HUFI is 20.7% while the MC of MCRSU and MSMU achieve 92.8% and 98.3% respectively. It is caused by the hidden method of HUFI algorithm. In addition, an interesting observation is that the MC of Chess dataset is greatly higher than that of Retail dataset. This is reasonable since Chess dataset is much denser than Retail and the average support of itemsets contained in Chess dataset is higher. As a result, the damage to the non-sensitive knowledge in Chess is larger. From the results displayed in Fig. 6, we find that the proposed HUFI algorithm achieves better performance compared to the other algorithms under various minsup. The reason is that the HUFI algorithm applies the concept of maximum boundary value to select a better hidden strategy for concealing sensitive itemsets. Thus, the HUFI algorithm is more flexible compared to the MCRSU and MSMU algorithms.

Figure 6.

MC at various minimum support thresholds.

Figure 7.

Diff at various sensitive itemsets sizes.

Figure 8.

Diff at various utility thresholds.

Figure 9.

Diff at various minimum support thresholds.

5.3 Database utility difference

Database utility difference (Diff) is used to measure the utility difference between the original and the sanitized database. The measure can reveal the amount of utility that is lost after the sanitization process. A lower Diff indicates that less information is lost. Figure 7 shows the results of the HUFI, MCRSU and MSMU algorithms at various sensitive itemsets sizes for the four datasets. It is obvious to see that Diff is increased as the number of sensitive itemsets increases. This is reasonable since more sensitive itemsets are hidden and more information is lost. In Fig. 7, we also find that the proposed HUFI algorithm achieves better performance compared to the state-of-the-art algorithms MCRSU and MSMU. This is because the maximum boundary value is used as a criterion to select an appropriate hidden strategy in the HUFI algorithm. MSMU algorithm has the worst performance in term of Diff, because only the minimum support value and maximum utility value are utilized to choose the victim items. Besides, the impact on the non-sensitive itemsets is not taken into consideration. In addition, we note that the Diff of Chess dataset is much higher than that of the other datasets. Since Chess is a very dense dataset, the average support of itemsets contained in a dense dataset is higher. Thus, the modification of transactions in Chess dataset would cause more information lost after the sanitization process.

The database utility difference of the three algorithms for various $\varepsilon$ and various minsup are respectively shown in Figs 8 and 9. From the results of Figs 8 and 9, it can be seen that the proposed HUFI algorithm outperforms the other algorithms. The reason is that the proposed HUFI with the maximum boundary value can greatly reduce the side effects on the result database. Besides, the HUFI algorithm selects a transaction that affects the minimum number of non-sensitive utility and support itemsets as a victim transaction. In such a transaction, an item that affects the maximum number of sensitive itemsets and minimum number of non-sensitive itemsets is selected as a victim item. Thus, the impact on the database is minimized. It is interesting to observe that when minsup is 0.147%, the Diff suddenly increases in Fig. 9c. The reason is that both the utility value and support count of the sensitive itemsets are higher. Thus, the damage to the database quality is larger.

6. Conclusions

In this paper, a hidden algorithm named HUFI is proposed to conceal the sensitive utility and frequent itemsets. The designed algorithm utilizes the concept of maximum boundary value to determine whether the support or utility reduction strategy is adopted to hide a sensitive itemset. HUFI algorithm identifies a transaction that supports the minimum number of non-sensitive itemsets to be sanitized. In such a transaction, an item with the highest weight is selected as a victim item, which indicates that the item affects the maximal number of sensitive itemsets and minimal number of non-sensitive itemsets. The performance of the proposed algorithm is compared with the state of the art algorithms MCRSU and MSMU on real databases. The experimental results demonstrate that the HUFI algorithm has the better performance compared to the other algorithms in maintaining database quality and minimizing the side effects on non-sensitive knowledge. Another advantage of HUFI is that it provides more flexibility to hide sensitive itemsets. In addition, it is observed that the density of a database has a great impact on the performance of the sensitive knowledge hiding.

In the future, we will focus on utility and frequent itemset mining. The existing algorithms spend a lot of time mining the utility and frequent itemsets. How to design the pruning strategies to effectively reduce the search space is the main challenge.

Footnotes

Acknowledgments

This work is partially supported by NSF-China and Guangdong Province Joint Project (Grant No. U1301252); National Natural Science Foundation of China (Grant No. 61272543).

References

Amiri

, Dare to share: Protecting sensitive knowledge with data sanitization, Decision Support Systems 43(1) (2007), 181–191.

Gkoulalas Divanis

Haritsa

and Kantarcioglu

, Privacy issues in association rule mining, Frequent Pattern Mining, Springer International Publishing, 2014, 369–401.

Gkoulalas Divanis

and Verykios

V.S.

, Exact knowledge hiding through database extension, IEEE Transactions on Knowledge and Data Engineering 21(5) (2009), 699–713.

Clifton

and Marks

, Security and privacy implications of data mining, in: Proceedings of the ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1996, pp. 15–19.

O’Leary

D.E.

, Knowledge discovery as a treat to database security, in: Proceedings of the 1st International Conference in Knowledge Discovery and Database, 1991, pp. 107–516.

Dasseni

Verykios

V.S.

Elmagarmid

A.K.

and Bertino

, Hiding association rules by using confidence and support, in: Proceedings of the 4th International Workshop on Information Hiding, 2001, pp. 369–383.

Lee

and Chen

Y.C.

, Protecting sensitive knowledge in association patterns mining, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2(1) (2012), 60–68.

Moustakides

G.V.

and Verykios

V.S.

, A Max-Min approach for hiding frequent itemsets, in: Proceedings of the 6th International Conference on Data Mining, 2006, pp. 502–506.

Moustakides

G.V.

and Verykios

V.S.

, A MaxMin approach for hiding frequent itemsets, Data and Knowledge Engineering 65(1) (2008), 75–89.

10.

Mannila

and Toivonen

, Levelwise search and borders of theories in knowledge discovery, Data Mining and Knowledge Discovery 1(3) (1997), 241–258.

11.

H.Q.

Arch Int

and Arch Int

, Association rule hiding based on intersection lattice, Mathematical Problems in Engineering 2013 (2013), 1–11.

12.

H.Q.

Arch Int

Nguyen

H.X.

and Arch Int

, Association rule hiding in risk management for retail supply chain collaboration, Computers in Industry 64(7) (2013), 776–784.

13.

Yao

Hamilton

H.J.

and Butz

C.J.

, A foundational approach to mining itemset utilities from databases, in: Proceedings of the 4th SIAM International Conference on Data Mining, 2004, pp. 482–486.

14.

Lin

J.C.W.

T.Y.

Fournier Viger

Lin

Zhan

and Voznak

, Fast algorithms for hiding sensitive high-utility itemsets in privacy-preserving utility mining, Engineering Applications of Artificial Intelligence 55(C) (2016), 269–284.

15.

Yeh

J.S.

and Hsu

P.C.

, HHUIF and MSICF: Novel algorithms for privacy preserving utility mining, Expert Systems with Applications 37(7) (2010), 4779–4786.

16.

Atallah

Bertino

Elmagarmid

Ibrahim

and Verykios

, Disclosure limitation of sensitive rules, in: Proceedings of the 1999 Workshop on Knowledge and Data Engineering Exchange, 1999, pp. 45–52.

17.

Liu

and Qu

, Mining high utility itemsets without candidate generation, in: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, 2012, pp. 55–64.

18.

Chen

M.S.

Han

and Yu

P.S.

, Data mining: an overview from a database perspective, IEEE Transactions on Knowledge and Data Engineering 8(6) (1996), 866–883.

19.

Cheng

Lin

C.W.

and Pan

J.S.

, Use HypE to hide association rules by adding items, PLOS One 10(6) (2015), e0127834.

20.

Cheng

Pan

J.S.

and Lin

C.W.

, Privacy preserving association rule mining using binary encoded NSGA-II, in: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2014, pp. 87–99.

21.

Cheng

Chun Wei

Jeng Shyang

and Ivan

, Manage the tradeoff in data sanitization, IEICE TRANSACTIONS on Information and Systems 98(10) (2015), 1856–1860.

22.

Cheng

Ivan

Jeng Shyang

Chun Wei

and Roddick

J.F.

, Hide association rules with fewer side effects, Ieice Transactions on Information and Systems 98(10) (2015), 1788–1798.

23.

Agrawal

and Srikant

, Fast algorithms for mining association rules, in: Proceedings of the 20th International Conference on Very Large Data Bases, 1994, pp. 487–499.

24.

Shah

R.A.

and Asghar

, Privacy preserving in association rules using a genetic algorithm, Turkish Journal of Electrical Engineering and Computer Sciences 22(2) (2014), 434–450.

25.

Rajalaxmi

R.R.

and Natarajan

A.M.

, Effective sanitization approaches to hide sensitive utility and frequent itemsets, Intelligent Data Analysis 16(6) (2012), 933–951.

26.

Krishnamoorthy

, Pruning strategies for mining high utility itemsets, Expert Systems with Applications 42(5) (2015), 2371–2381.

27.

Oliveira

S.R.M.

and Zaïane

O.R.

, Protecting sensitive knowledge by data sanitization, in: Proceedings of the 3th International Conference on Data Mining, 2003, pp. 613–616.

28.

Zida

Fournier Viger

Lin

C.W.

and Tseng

V.S.

, EFIM: A highly efficient algorithm for high-utility itemset mining, in: Mexican International Conference on Artifical Intelligence, 2015, pp. 530–546.

29.

Hong

T.P.

Lin

C.W.

Yang

K.T.

and Wang

S.L.

, Using TF-IDF to hide sensitive itemsets, Applied Intelligence 38(4) (2013), 502–510.

30.

Yun

and Kim

, A fast perturbation algorithm using tree structure for privacy preserving utility mining, Expert Systems with Applications 42(3) (2015), 1149–1165.

31.

Verykios

V.S.

Elmagarmid

A.K.

Bertino

and Saygin

, Association rule hiding, IEEE Transactions on Knowledge and Data Engineering 16(4) (2004), 434–447.

32.

Verykios

V.S.

, Association rule hiding methods, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 3(3) (2013), 28–36.

33.

Sun

and Yu

P.S.

, A border-based approach for hiding sensitive frequent itemsets, in: Proceedings of the 5th International Conference on Data Mining, 2005, pp. 426–433.

34.

Sun

and Yu

P.S.

, Hiding sensitive frequent itemsets by a border-based approach, Journal of Computing Science and Engineering 1(1) (2007), 74–94.

35.

Y.H.

Chiang

C.M.

and Chen

A.L.

, Hiding sensitive association rules with limited side effects, IEEE Transactions on Knowledge and Data Engineering 19(1) (2007), 29–42.

36.

SPMF: An open-source data mining library, http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php.

A novel approach for hiding sensitive utility and frequent itemsets

Abstract

Keywords

1. Introduction

2. The related works

3. Preliminary

3.1 Basic definitions

4.1 The hiding strategies

4.3 Algorithm description

4.4 Example

Table 2 Derived utility and support itemsets

Table 3 The characteristics of the four datasets

6. Conclusions

Footnotes

Acknowledgments

References

Table 2
Derived utility and support itemsets

Table 3
The characteristics of the four datasets