An efficient algorithm for hiding sensitive-high utility itemsets

Abstract

Privacy-preserving utility itemset mining is the process of hiding sensitive-high utility itemsets (SHUIs) appearing in original database such that they will not be discovered in the sanitized database. The purpose of SHUI hiding algorithm is to conceal the set of SHUIs while minimizing the side effects which caused by data distortion process. In this paper, a novel algorithm, named EHSHUI (An Efficient Algorithm for Hiding Sensitive-high utility Itemsets), is proposed to minimize the side effects of the sanitization process. The proposed algorithm includes three heuristic steps: (1) The transaction on which the SHUI achieves maximal utility among transactions containing it is specified as victim transaction; (2) The item that causes minimal impacts on non-SHUIs is selected as victim item; and (3) An exactly number of utility is calculated for reducing internal utility of victim item from victim transaction. This strategy exactly identifies item and transaction for data modification such that it minimizes the impacts on non-SHUIs, data distortions, and the time to access database. The experiment results illustrate that the proposed algorithm achieves higher performance and lower side effects than the state-of-the-art.

Keywords

High utility mining high utility itemset sensitive-high utility itemset hiding privacy-preserving utility mining

1. Introduction

With the development of the data storage technology and the use of internet applications, data mining becomes extremely important issue of computer science. Data mining techniques, especially pattern mining [4, 18, 25, 3, 5, 8, 12, 21] and association rule mining [1, 2, 16, 24], aim to discover useful knowledge implicit inside the database. Because of the popular application, the frequent itemset mining has been a classical problem of pattern mining and currently applied in many fields. Its purpose is to discover itemsets satisfied a given minimal support threshold from a given database. Although it has proven very useful in many applications, it is limited because the role of data items in database has not been considered [2]. The reality has proven that itemset mining becomes meaningful if the value of data items is concurrently discovered with their frequency. For example, in a sale database, in each transaction the quantity of each item sold is stored, and their profit is concerned.

In order to overcome the limitation of frequent itemset mining, a novel pattern mining method has been proposed to mine itemsets associated with their quantity and profits. It has been known as high utility itemset (HUI) mining. Yao et al. [26] proposed the novel approach to discover high utility itemsets. Many more efficient algorithms [25, 5, 8, 12, 21, 13, 6, 29, 20] based on the approach proposed in [26] have been published.

These days, sharing data provides mutual benefits for collaborating organizations but increases the risk of sensitive knowledge leakage. High utility itemset mining techniques have allowed regimented discovery of knowledge from huge databases. The knowledge for supporting the decision making is said to be sensitive knowledge, which can be referred from sensitive-high utility itemsets (SHUIs). However, sharing data for mining may cause the possibility of revealing sensitive knowledge [15, 9]. The need of privacy [7] prompted the growth of privacy-preserving high utility mining techniques. To deal with privacy concerns, SHUI hiding algorithm has been proposed to sanitize the original database in such a way that the sensitive knowledge are concealed. The aim of the algorithm is to modify data item in the specific transactions in order to reduce the utility of SHUIs to under minimal utility threshold. Therefore, SHUIs cannot be discovered from the sanitized database by unauthorized users at the same minimal utility threshold. This process usually cause the side effects, such as non-SHUIs lost by hiding process or similarity between original database and sanitized database, etc.

The target of SHUI hiding algorithm is to hide every SHUI with minimal side effects. The performance of SHUI hiding algorithm depends on the strategy for specifying victim transaction and victim item for data modification. Yeh et al. [27] were the first authors who proposed a methodology for hiding SHUIs including HHUIF (Hiding High Utility Item First) and MSICF (Maximum Sensitive Itemsets Conflict First) algorithms. The technique used in [27] is to reduce internal utility of a victim item from a victim transaction to decrease utility of SHUI under minimal utility threshold. Relying on this idea, more efficient algorithms [17, 10, 11, 19] were then proposed. In these algorithms, the authors proposed heuristic strategies to find victim item based on its utility. The item having higher utility amongst sensitive items was selected for modification. This strategy failed in evaluating exactly number of non-SHUIs and that would be lost by sanitization process. Thus, the specified victim item is not the best one for minimizing the side effects.

This paper proposed an heuristic algorithm for hiding SHUIs by exactly specifying victim transaction and victim item for data modification. The victim item was specified in such a way that it requires single modification at victim transaction and modifies that causes the least impacts to non-SHUIs. Two theorems were proposed to exactly compute minimal utility value needed to be reduced from victim item. The experiment results showed that the proposed algorithm is better than the previous algorithms proposed in [27, 11]. The remaining contents are organized as follows: Section 2 introduces the related works, Section 3 proposes hiding SHUIs algorithm named EHSHUI, Section 4 describes the experimental results, and the last section is conclusion.

2. Related works

2.1 High utility itemset mining

HUI mining algorithm aims at discovering itemsets having utility no less than a minimal utility threshold given by user with low CPU-time. The first HUI mining model was proposed by Yao et al. [26], the authors defined two units to measure utility level of itemset. They are internal utility (transaction utility) and external utility. Liu et al. proposed a HUI mining algrorithm including two phases [14]. In the first phase, the authors applied closure property named TWU (Transaction-Weighted-Utilization) to prune searching space when generating candidate itemsets. The second phase specified HUIs from candidate itemsets. Two-phase method was then improved in the better algorithms in [5, 22, 21]. However, the set of high TWU itemsets is greater than the set of high utility itemsets. Consequently, the large searching space caused high complexity of running time in the proposed algorithms.

To overcome the drawback, Liu et al. [12] proposed a new data structure named utility-list and HUI-Miner algorithm to mine HUIs without generating candidate itemsets. The experiment result showed that HUI-Miner algorithm was better than Two-phase algorithms. Using the same idea as HUI-Miner, Fournier-Viger et al. [6] proposed FHM algorithm. By analyzing the relationship of items which concurrently appeared in the same transactions, FHM restricted the join between utility-lists when calculating utility of itemsets. So, it required lower CPU-time than HUI-Miner algorithm.

Zida et al. [28] proposed EFIM (A Highly Efficient Algorithm for High-Utility Itemset Mining) algorithm. To prune searching space, the authors defined two upper bounds named sub-tree utility and local utility. In addition, a quick-caculation method, denoted by FUC (Fast Utility Counting), was applied to find upper bound of HUIs in linear time and space. Moreover, EFIM applies a projected database and transaction merging technique in order to minimize database scanning cost. The experiment indicated that EFIM was more efficient than previous algorithms.

In the remaining content of this section, we present the concerned concepts which were defined in [26, 13]:

.

(Transaction database): Let $I=\{{{i_{1}},{i_{2}},\ldots,{i_{m}}}\}$ be a finite set of items, where each item ${i_{l}}\in I$ has an external utility $p({i_{l}})$ . An itemset $X=\left\{{{x_{1}},{x_{2}},\ldots,{x_{k}}}\right\}$ is a set of $k$ distinct items, where ${x_{j}}\in I,1\leqslant j\leqslant k$ , and $k$ is the length of $X$ . A transaction database is a set of transactions $D=\left\{{{T_{1}},{T_{2}},\ldots,{T_{n}}}\right\}$ , where each transaction ${T_{c}}\subseteq I,1\leqslant c\leqslant n$ has a unique identifier $i d$ , called Tid. Each item ${i_{p}}$ in a transaction ${T_{c}}$ is associated with a weight indicator called quantity $q({i_{p}},{T_{c}})$ , which is the number of item ${i_{p}}$ appearing in the transaction ${T_{c}}$ .

Table 1
Transaction dataset D

Transaction	Tid	Transaction	Tid
T1	A (4), C (1), E (4), F (2), G (1)	T6	B (1), F (1), H (2)
T2	D (2), E (4), F (3)	T7	D (1), E (1), F (4), G (1), H (1)
T3	A (1), B (3), D (2), E (5), F (1)	T8	B (1), D (1), F (1)
T4	D (1), E (2), F (6)	T9	B (4), D (4), G (10)
T5	A (3), B (2), C (1), E (1)

Table 2

External utility of transaction dataset D

Item	A	B	C	D	E	F	G	H
Utility	3	5	1	1	2	1	2	1

For example, $q(A,T_{1})=4$ and $p(A)=3$ ; $q(C,T_{1})=1$ and $p(C)=1$ .

.

(Item utility in a transaction): The utility of an item $i$ in a transaction $T_{c}$ , denoted as $u(i,T_{c})$ , is defined as:

$\displaystyle u(i,T_{c})=q(i,T_{c})\times p(i)$

For example, $u(A,T_{1})=q(A,T_{1})\times p(A)=3\times 4=12$ ; $u(C,T_{1})=q(C,T_{1})\times p(C)=1\times 1=1$ .

.

(Itemset utility in a transaction): The utility of an itemset $X$ in a transaction $T_{c}$ , denoted as $u(X,T_{c})$ , is defined as:

$\displaystyle u(X,{T_{c}})=\sum\nolimits_{i\in X}{u(i,{T_{c}})}$

For example, $u(\{AC\},T_{1})=u(A,T_{1})+u(C,T_{1})=13$ ; $u(\{AC\},T_{5})=u(A,T_{5})+u(C,T_{5})=10$ .

.

(Itemset utility in database): The utility of an itemset $X$ in transaction database $D$ , denoted as $u(X)$ , is defined as:

$\displaystyle u(X)=\sum\nolimits_{X\subseteq{T_{c}}\wedge{T_{c}}\in D}{u(X,{T_% {c}})}$

For example, $u(\{AC\})=u(\{AC\},T_{1})+u(\{AC\},T_{5})=13+10=23$ .

.

(High utility itemset): Given a minimum utility threshold $\varepsilon$ . An itemset $X$ is said to be a high utility itemset if the utility of $X$ is not less than $\varepsilon$ . Let HUIs is a set of high utility itemsets then $\textit{HUIs}=\{X|X\in I,u(X)\geqslant\varepsilon\}$

For example, Table 3 is a set of high utility itemsets mined from data set given in Tables 1 and 2 when setting the minimum utility threshold $\varepsilon=30$ .

Table 3

The set HUIs mined from D with $\varepsilon=30$

Itemset	Utility	Itemset	Utility	Itemset	Utility	Itemset	Utility
B	55	BD	47	EF	48	BDG	44
E	34	BE	37	ABE	49	DEF	44
AB	37	BG	40	ACE	33	ABDE	30
AE	44	DE	30	AEF	36	ABDEF	31

.

(The utility of a transaction): The utility of a transaction $T_{c}$ is denoted as $tu(T_{c})$ and defined as:

$\displaystyle tu(T_{c})=\sum\nolimits_{i\in T_{c}}{u(i,T_{c})}$

2.2 Sensitive-high utility itemset hiding

High utility itemsets discovered from a database are useful to the data owner. A high utility itemset can be used to support the decision making process is said to be a SHUI. In order to protect this sensitive information from being disclosed, the data owner has to preserve SHUIs from unauthorized miner when sharing data outside his/her company.

Hiding SHUI is a process that intercedes the internal utility value of some items in order to transfer an original database into a sanitized database in such a way that SHUIs are concealed from the sanitized database. Modifying database for hiding SHUIs causes side effects to HUIs and database. The algorithm causes lower side effects is the better one. Yeh et al. [27] defined three units to measure the performance of algorithms, including: Hiding Failure (HF); Missing Cost (MC) and Difference between the original and sanitized database (DIF).

The target of proposed algorithms is to hide SHUIs while minimizing the side effects. Yeh et al. [27] were the first authors who proposed heuristic algorithms for hiding SHUIs named HHUIF and MSICF. The main idea of both algorithms is to minimize side effects basing on selecting an appropriate victim item for database modification. The victim item specified by HHUIF is an item which has maximal utility among sensitive items in a SHUI while the victim item selected by MSICF is an item which has maximal frequency among sensitive items of all SHUIs. Although the proposed algorithms [27] achieve a good result in hiding SHUIs with low HF, they cause high MC and DIF because they do not specify exactly minimal utility value which need to be reduced for hiding SHUI. This leads to the case that a SHUI has already been hidden but data modification has been still continuing. Moreover, if utility of a SHUI is equal to minimal utility threshold then it cannot be hidden by algorithms. Selvaraj et al. [17] proposed an improvement named MHIS (Modified HHUIF algorithm with Item Selector). In case of existing more than one maximal-utility item in SHUI, MHIS gives priority to modify the item having higher frequency.

A novel method which hides SHUIs by adding pseudo transactions into database was proposed by Lin et al. [10]. The authors applied GA methodology to compute exactly number of additional transactions and set of items in each transaction. The experiment shows that it is more efficient than previous methods. However, this method creates new HUIs (ghost HUIs, the itemsets are non-HUIs in original database but are HUIs in the distorted database).

In 2016, Lin et al. [11] proposed two heuristic algorithms, including: MSU-MAU (Maximum Sensitive Utility-Maximum item Utility) and MSU-MIU (Maximum Sensitive Utility-Minimum item Utility). MSU-MAU assigns victim transaction to the transaction in which the SHUI achieves maximal utility and victim item to the item having maximal utility amongst sensitive items. MSU-MIU selects victim transaction as the same as MSU-MAU, but it assigns victim item to the item having minimal utility among sensitive items. Experiment results indicate that these algorithms achieve better performance than algorithms proposed in [27]. However, the drawback of algorithm proposed in [27] has not been solved by MSU-MAU and MSU-MIU.

In order to exactly measure the difference of database after sanitizing data, Lin et al. [11] defined three measuring units, including: Database structure similarity (DSS), Database utility similarity (DUS) and Itemsets utility similarity (IUS). These units are very important for evaluating transparency of sanitized database in comparison to the original database.

Given a database and a minimal utility threshold, the definitions about side effects for hiding SHUI algorithms which were proposed in [27, 11] are re-presented as follows:

.

(HF – Hiding Failure): Hiding Failure of a hiding SHUI algorithm is the ratio between SHUIs’ in sanitized database and SHUIs in original database, namely:

$\displaystyle\text{HF}=\frac{|\text{SHUIs'}|}{|\text{SHUIs}|}$

Because of database sanitization, some of non-sensitive high utility itemsets in the original database cannot be discovered from sanitized database. This effect is said to be missed cost and defined as Definition 8.

.

(MC – Missing Cost): Missing cost of a hiding SHUI algorithm is the level of non-sensitive high utility itemsets in the original database is mishidden by hiding process, and is defined as follows:

$\displaystyle\text{MC}=\frac{|\text{non-SHUIs}-\text{non-SHUIs'}|}{|\text{non-% SHUIs}|}$

Let $tp^{D}$ is the set of transaction patterns in original database D and $tp^{D^{\prime}}$ is the set of transaction pattern in sanitized database. Let $\textit{freq}(tp_{k}^{D})$ is the frequency of k-th pattern in the original database D and $\textit{freq}(tp_{k}^{D^{\prime}})$ is the frequency of k-th pattern in the sanitized database D’.

.

(DSS – Database structure similarity): Database structure similarity is the difference about structure similarity of original database D and sanitized database D’, and is defined as follows:

$\displaystyle\text{DSS}=\sqrt{\sum\limits_{k=1}^{|tp_{k}^{D}\cup tp_{k}^{D^{% \prime}}|}{(\textit{freq}(tp_{k}^{D})-\textit{freq}(tp_{k}^{D^{\prime}}))^{2}}}$

.

(DUS – Database utility similarity): Database utility similarity is the utility similarity ratio between sanitized database and original database, and is defined as follows:

$\displaystyle\text{DUS}=\frac{\sum\nolimits_{T_{c}\in D^{\prime}}{tu({T_{c}})}% }{\sum\nolimits_{T_{c}\in D}{tu({T_{c}})}}$

Where, $tu(T_{c})$ is utility value of transaction $T_{c}$ .

.

(IUS – Itemsets utility similarity): Itemsets utility similarity is similarity ratio between HUIs in sanitized database and HUIs in original database and is defined as follows:

$\displaystyle\text{IUS}=\frac{\sum\nolimits_{X\in\textit{HUIs'}}{u(X)}}{\sum% \nolimits_{X\in\textit{HUIs}}{u(X)}}$

3. Algorithm proposal

3.1 Sensitive-high utility itemset hiding strategy

.

(Sensitive high utility itemset): Let $\text{S}\in\text{HUIs}$ is a sensitive high utility itemset in database D. The set of sensitive high utility itemsets is denoted by SHUIs, and is defined as:

$\displaystyle\text{SHUIs}=\{S|S\in\text{HUIs}\text{ and S is used for % supporting the making decision}\}$

We have $\text{SHUIs}\subseteq\text{HUIs}$ .

A sensitive high utility itemset $S_{i}\in\text{SHUIs}$ is hidden if its utility less than minimal utility threshold, namely:

$\displaystyle u(S_{i})<\varepsilon$

Let $d u$ is the minimal utility value of an itemset $S_{i}$ must be reduced in order to hide it, and is specified as $du=u(S_{i})-\varepsilon+1$ . $S_{i}$ is hidden if and only if $du=0$ . In order to hide an itemset $S_{i}$ , internal utility of an item $i_{k}\in S_{i}$ must be reduced in some transactions containing $S_{i}$ in such a way that modification causes $u(S_{i})<\varepsilon$ . The target of hiding process is to specify exactly a transaction and an item for modification such that the side effects are minimal.

.

(Victim item): The victim item, denoted by $i_{\textit{vic}}$ , is an item of a sensitive high utility itemset $S_{i}$ so that modifying $i_{\textit{vic}}$ from a transaction causes minimal side effects.

.

(Victim transaction): The victim transaction, denoted by $T_{\textit{vic}}$ , is a transaction containing sensitive high utility itemset $S_{i}$ so that modifying internal utility of $i_{\textit{vic}}\in S_{i}$ from $T_{\textit{vic}}$ causes minimal side effects.

The hiding process includes three steps: (1) Victim transaction specification: $T_{\textit{vic}}$ , (2) Victim item specification: $i_{\textit{vic}}$ , and (3) Victim item modification: reduce internal utility of $i_{\textit{vic}}$ until $du\leqslant 0$ .

There are two cases occurred at step (3): If $u(i_{\textit{vic}},T_{\textit{vic}})>du$ then an enough value of internal utility of item $i_{\textit{vic}}$ in $T_{\textit{vic}}$ will be reduced to make $du\leqslant 0$ . It needs single time to modify $i_{\textit{vic}}$ in order to hide $S_{i}$ . Otherwise, the item $i_{\textit{vic}}$ will be removed from $T_{\textit{vic}}$ . We calculate internal utility value of $i_{\textit{vic}}$ for each case as in Theorems 1 and 2.

.

If $u(i_{\textit{vic}},T_{\textit{vic}})>du$ , $i_{\textit{vic}}\in S_{i}$ , the SHUI $S_{i}$ is hidden iif $q(i_{\textit{vic}},T_{\textit{vic}})$ is reduced a value $\textit{dec}=\left\lceil{\frac{{du}}{{p({i_{\textit{vic}}})}}}\right\rceil$ , it means $q({i_{\textit{vic}}},{T_{\textit{vic}}})=q({i_{\textit{vic}}},{T_{\textit{vic}% }})-\left\lceil{\frac{{du}}{{p({i_{\textit{vic}}})}}}\right\rceil$ .

Proof. Let setT as a set of transactions containing SHUI $S_{i}$ before $i_{\textit{vic}}$ is modified in $T_{\textit{vic}}$ . It is conculated as follows:

$\displaystyle\textit{setT}=\{{{T_{c}}|{S_{i}}\subseteq{T_{c}}\wedge{T_{c}}\in D}\}$ (1) $\displaystyle\Rightarrow u({S_{i}})=\sum\nolimits_{{T_{c}}\in\textit{setT}}{u(% {S_{i}},{T_{c}})}$ (2) $\displaystyle=\sum\nolimits_{{T_{c}}\in\textit{setT}\backslash{T_{\textit{vic}% }}}{u({S_{i}},{T_{c}})}+u({S_{i}},{T_{\textit{vic}}})$ (3) $\displaystyle=\sum\nolimits_{{T_{c}}\in\textit{setT}\backslash{T_{\textit{vic}% }}}{u({S_{i}},{T_{c}})}+\sum\nolimits_{i\in{S_{i}}\backslash{i_{\textit{vic}}}% }{u(i,{T_{\textit{vic}}})}+u({i_{\textit{vic}}},{T_{\textit{vic}}})$ (4) $\displaystyle=\sum\nolimits_{{T_{c}}\in\textit{setT}\backslash{T_{\textit{vic}% }}}{u({S_{i}},{T_{c}})}+\sum\nolimits_{i\in{S_{i}}\backslash{i_{\textit{vic}}}% }{u(i,{T_{\textit{vic}}})}+q({i_{\textit{vic}}},{T_{\textit{vic}}})\times p({i% _{\textit{vic}}})$ (5) $\displaystyle\textit{where du}=u({S_{i}})-\varepsilon+1$ (6) $\displaystyle\Rightarrow du=\sum\nolimits_{{T_{c}}\in\textit{setT}\backslash{T% _{\textit{vic}}}}{u({S_{i}},{T_{c}})}+\sum\nolimits_{i\in{S_{i}}\backslash{i_{% \textit{vic}}}}{u(i,{T_{\textit{vic}}})}+q({i_{\textit{vic}}},{T_{\textit{vic}% }})\times p({i_{\textit{vic}}})-\varepsilon+1$ (7)

Let dec is a minimal internal utility of $i_{\textit{vic}}$ must be reduced to hide $S_{i}$ . To have $du\leqslant 0$ we reduce dec value from $q({i_{\textit{vic}}},{T_{\textit{vic}}})$ . After decreasing dec, the value of $d u$ becomes:

$\displaystyle\sum\nolimits_{{T_{c}}\in\textit{setT}\backslash{T_{\textit{vic}}% }}{u({S_{i}},{T_{c}})}+\sum\nolimits_{i\in{S_{i}}\backslash{i_{\textit{vic}}}}% {u(i,{T_{\textit{vic}}})}+(q({i_{\textit{vic}}},{T_{\textit{vic}}})-\textit{% dec})\times p({i_{\textit{vic}}})-\varepsilon+1\leqslant 0\Leftrightarrow\sum% \nolimits_{{T_{c}}\in\textit{setT}\backslash{T_{\textit{vic}}}}{u({S_{i}},{T_{% c}})}+\sum\nolimits_{i\in{S_{i}}\backslash{i_{\textit{vic}}}}{u(i,{T_{\textit{% vic}}})}+q(i_{\textit{vic}},T_{\textit{vic}})\times p(i_{vi})-\varepsilon+1-% \textit{dec}\times p(i_{vic})\leqslant 0\textit{Apply }(7)\textit{ into }(9),% \textit{ we have: }du-\textit{dec}\times p({i_{\textit{vic}}})\leqslant 0$ (10) $\displaystyle\Rightarrow\textit{dec}\geqslant\frac{{du}}{{p({i_{\textit{vic}}}% )}}\textit{Because }q(i_{\textit{vic}},T_{\textit{vic}})\in\mathbb{N}\wedge% \textit{dec}>0,\textit{ so dec}=\left\lceil{\frac{{du}}{p(i_{\textit{vic}})}}\right\rceil$ (12)

.

If $i_{\textit{vic}}\in S_{i}$ is deleted from $T_{\textit{vic}}$ then $u(S_{i})$ will be reduced a value $u({S_{i}},{T_{\textit{vic}}})$ , so that $d u$ will be reduced a value $u(S_{i},T_{\textit{vic}})$ , it means: $u(S_{i})=u(S_{i})-u(S_{i},T_{\textit{vic}})$ and $du=du-u(S_{i},T_{\textit{vic}})$ .

Proof. Let $\textit{setT}_{in}$ is the set of transactions containing $S_{i}$ before modifying database. It is computed as follows:

$\displaystyle\textit{setT}_{in}=\left\{{{T_{c}}|{S_{i}}\subseteq{T_{c}}\wedge{% T_{c}}\in D}\right\}$ (13) $\displaystyle\Rightarrow u(S_{i})_{in}=\sum\nolimits_{T_{c}\in\textit{setT}_{% in}}u(S_{i},T_{c})$ (14) $\displaystyle\Rightarrow du_{in}=\left({\sum\nolimits_{{T_{c}}\in\textit{set}{% T_{in}}}{u({S_{i}},{T_{c}})}}\right)-\varepsilon+1$ (15)

Let $\textit{setT}_{\textit{out}}$ is a set of transactions containing $S_{i}$ after removing $i_{\textit{vic}}\in S_{i}$ from $T_{\textit{vic}}$ . We have:

$\displaystyle\textit{setT}_{\textit{out}}=\textit{setT}_{in}\backslash{T_{% \textit{vic}}}$ (16) $\displaystyle\Rightarrow u(S_{i})_{\textit{out}}=\sum\nolimits_{T_{c}\in% \textit{setT}_{\textit{out}}}{u(S_{i},T_{c})}$ (17) $\displaystyle\textit{By }(16)\textit{ and }(17)\Rightarrow u(S_{i})_{\textit{% out}}=\sum\nolimits_{{T_{c}}\in\textit{setT}_{in}\backslash{T_{\textit{vic}}}}% {u({S_{i}},{T_{c}})}$ (18) $\displaystyle\Leftrightarrow u(S_{i})_{\textit{out}}={\sum\nolimits_{{T_{c}}% \in\textit{setT}_{in}}}{u({S_{i}},{T_{c}})}-u({S_{i}},{T_{\textit{vic}}})$ (19) $\displaystyle\Rightarrow du_{\textit{out}}=u{({S_{i}})_{\textit{out}}}-% \varepsilon+1$ (20) $\displaystyle\textit{By }(19)\Rightarrow du_{\textit{out}}={\sum\nolimits_{{T_% {c}}\in\textit{setT}_{in}}}{u({S_{i}},{T_{c}})}-u({S_{i}},{T_{\textit{vic}}})-% \varepsilon+1$ (21) $\displaystyle\Leftrightarrow du_{\textit{out}}=d{u_{in}}-u({S_{i}},{T_{\textit% {vic}}})$ (22)

Specifying exactly SHUI for the first hiding contributes to reduce the side effects of hiding SHUI algorithm. In this paper, we give a priority to a SHUI which has maximal frequency amongs itemsets in SHUIs.

The frequency of a sensitive HUI $S_{i}$ is the number of itemset in SHUIs containing $S_{i}$ , and is defined as in Definition 16

.

The frequency of $S_{i}$ in SHUIs, denoted as ${f_{\textit{SHUIs}}}({S_{i}})$ , is the cardinality of supper itemsets of $S_{i}$ in SHUIs, and is calculated as:

$\displaystyle{f_{\textit{SHUIs}}}({S_{i}})=\left|{{S_{i}}\subset{S_{k}},1% \leqslant k\leqslant\left|{\textit{SHUIs}}\right|\wedge{S_{k}}\in\textit{SHUIs% }\wedge i\neq k}\right|$

3.2 Algorithm proposal: EHSHUI

EHSHUI algorithm includes three main stages: Firstly, the algorithm specifies victim transaction containing SHUI Si where $S_{i}$ achieves maximal utility. Secondly, a victim item is selected among sensitive items such that modifying it causes the least lost HUIs. Finally, an enough value of internal utility of the victim item will be reduced to hide the SHUI. The details of specifying victim transaction and victim item are presented in Algorithms 1 and 2, respectively. The detail of hiding process is presented in EHSHUI Algorithm.

FindVictimTransactionInputinputOutputoutput DSi: project database; $S_{i}$ : the sensitive itemset should be hidden $T_{\textit{vic}}$ : victim transaction $\textit{maxUtility}=$ 0 $T_{i}\in\textit{DSi}$ ( $u(S_{i},T_{i})>\textit{maxUtility}$ ) $\textit{maxUtility}=u(S_{i},T_{i})$ $T_{\textit{vic}}=T_{i}$ ; $\textit{return}\ T_{\textit{vic}}$ ;

Algorithm 1: Scans for each transaction in DSi to find a transaction $T_{\textit{vic}}$ such that $T_{\textit{vic}}=\textit{argmax}_{T_{i}\in\textit{DSi}}\{u(S_{i},T_{i})\}$ . Modifying $i_{\textit{vic}}$ from $T_{\textit{vic}}$ contributes to reduce $S_{i}$ quickly. Therefore, it minimize database scanning. Consequently, it minimize the CPU-Time.

FindVictimItemInputinputOutputoutput $S_{i}$ : sensitive itemset; HUIs: high utility itemsets; $T_{\textit{vic}}$ : victim transaction; $\varepsilon$ : minimal utility threshold $i_{\textit{vic}}$ : victim item ( $|S_{i}|==1$ ) $i_{\textit{vic}}=S_{i}$ ; $\textit{CanItems}=\{x_{l}|u(x_{l},T_{\textit{vic}})>du,x_{l}\in S_{i}\}$ ( $\textit{CanItems}!=\emptyset$ ) ( $|\textit{CanItems}|==1$ ) $i_{\textit{vic}}=\textit{Canitems}$ Set $\textit{MinLoss}=\textit{Integer.MaxValue}$ ( $\textit{item}\ x_{l}\in\textit{CanItems}$ ) Set $\textit{CountLoss}=0$ ( $X\in\textit{HUIs}|x_{l}\subset X\wedge X\subseteq T_{\textit{vic}}$ ) Compute $u(X)$ in case of modifying $x_{l}$ at $T_{\textit{vic}}$ ( $u(X)<\varepsilon$ ) $++\textit{CountLoss}$ ; ( $\textit{CountLoss}<\textit{MinLoss}$ ) $i_{\textit{vic}}=x_{l}$ $\textit{MinLoss}=\textit{CountLoss}$ $i_{\textit{vic}}=\textit{argmin}_{x_{l}}\{u(x_{l},T_{\textit{vic}}),x_{l}\in S% _{i}\}$ $\textit{return}\ i_{\textit{vic}}$

Algorithm 2: If $|S_{i}|>1$ then $i_{\textit{vic}}$ must be one of item of $S_{i}$ . The $i_{\textit{vic}}$ is selected amongst sensitive items such that decreasing $q(i_{\textit{vic}},T_{\textit{vic}})$ leads to $u(S_{i})<\varepsilon$ . Firstly, Algorithm 2 selects candidate items (CanItems) of $S_{i}$ which having $u(i_{\textit{vic}},T_{\textit{vic}})>du$ . This aims to avoid deleting $i_{\textit{vic}}$ from $T_{\textit{vic}}$ because removing $i_{\textit{vic}}$ causes every non-HUIs containing $i_{\textit{vic}}$ be removed from $T_{\textit{vic}}$ . The modification, therefore, leads to high side effects. The process is:

1.
If $\textit{CanItems}!=\emptyset$ then the algorithm scans candidate items to count number of HUIs which can be miss hidden when $i_{\textit{vic}}$ is modified. The $i_{\textit{vic}}$ is the item that has minimum number of miss hidden HUIs (lines 5–18).
2.
Otherwise, when $\textit{CanItem}=\emptyset$ victim item is selected as the item having minimal utility in victim transaction (line 20).

EHSHUI (Hiding sensitive-high utility itemsets)InputinputOutputoutput D: original database; HUIs: high utility itemsets; SHUIs: sensitive high utility itemsets; $\varepsilon$ : minimal utility threshold D’: Sanitazed database Compute $f_{\textit{HSUIs}}(S_{i}),1\leqslant i\leqslant|\textit{SHUIs}|\wedge S_{i}\in% \textit{SHUIs}$ Sort SHUIs in decreasing order of $f_{\textit{HSUIs}}(S_{i})$ ( $S_{i}\in\textit{SHUIs}$ ) $\textit{DSi}=\textit{projectData}(D,S_{i})$ $du=u(S_{i})-\varepsilon+1$ ( $du>0$ ) $T_{\textit{vic}}=\textit{findVictimTransaction}(\textit{DSi},S_{i})$ $i_{\textit{vic}}=\textit{findVictimItem}(S_{i},T_{\textit{vic}})$ ( $u(i_{\textit{vic}},T_{\textit{vic}})>du$ ) $\textit{dec}=\left\lceil{\frac{{du}}{{p({i_{\textit{vic}}})}}}\right\rceil$ $q(i_{\textit{vic}},T_{\textit{vic}})=q(i_{\textit{vic}},T_{\textit{vic}})-% \textit{dec}$ $du=0$ $du=du-u(S_{i},T_{\textit{vic}})$ remove $i_{\textit{vic}}$ from $T_{\textit{vic}}$ Update (D)

Algorithm 3: First, the algorithm computes frequency of every SHUI: $f_{\textit{HSUIs}}(S_{i})$ in line 1. Then, it sorts SHUIs in descending of $f_{\textit{HSUIs}}(S_{i})$ in line 2. This aims at giving the first priority to the highest frequency itemset for hiding in order to reduce data modification.

At the next steps, the algorithm scans consequently $S_{i}\in\textit{HSUIs}$ and executes hiding process (line 3). For each $S_{i}$ , the algorithm executes following steps:

1.
To generate a projection of database (DSi) including transactions which contains $S_{i}$ (line 4) in order to reduce the time to access database for finding victim item.
2.
To specify difference utility ( $d u$ ) of $S_{i}$ in comparing to minimal utility threshold (line 5). This is a minimal utility value needed to be reduced in order to hide $S_{i}$ .
3.
While $du>0$ , it repeats the process to select victim transaction, victim item and data modification. $T_{\textit{vic}}$ and $i_{\textit{vic}}$ specification strategy decides the side effects. $T_{\textit{vic}}$ specification (line 7) has already presented in Algorithm 1. Detail of $i_{\textit{vic}}$ selection (line 8) has already presented in Algorithm 2. After specifying $T_{\textit{vic}}$ and $i_{\textit{vic}}$ , the algorithm computes internal utility of $i_{\textit{vic}}$ needed to be reduced to hide $S_{i}$ . There are two cases:

(a)
To reduce internal utility of $i_{\textit{vic}}$ in $T_{\textit{vic}}$ : If $u(i_{\textit{vic}},T_{\textit{vic}})>du$ then internal utility of $i_{\textit{vic}}$ in $T_{\textit{vic}}$ is reduced a value $\left\lceil{\frac{{du}}{{p({i_{\textit{vic}}})}}}\right\rceil$ . The itemset $S_{i}$ will be hidden (line 10, by Theorem 1) and $q(i_{\textit{vic}},T_{\textit{vic}})$ is updated a new value as $q(i_{\textit{vic}},T_{\textit{vic}})-\textit{dec}$ (line 11). If $q(i_{\textit{vic}},T_{\textit{vic}})$ is equal to dec then $i_{\textit{vic}}$ is removed from $T_{\textit{vic}}$ . Finally, when Si is hidden, $d u$ is assigned a flag value (0) to finish hiding process.
(b)
To remove $i_{\textit{vic}}$ from $T_{\textit{vic}}$ : If $u(i_{\textit{vic}},T_{\textit{vic}})\leqslant du$ , assigning a new value to $d u$ as $du=du-u(S_{i},T_{\textit{vic}})$ and remove $i_{\textit{vic}}$ from $T_{\textit{vic}}$ (lines 14–15, by Theorem 2).

The EHSHUI algorithm is finished when every SHUI is hidden.
4. Experiment results

4.1 Experiment data and system description

Table 4
The data sets description

Databases	$\left\|{D}\right\|$	$\left\|{I}\right\|$	Average length	Maximum length
T1000_200_40	1,000	200	20.3	40
Foodmart	4,141	1,559	4	11
Retail	88,162	16,470	10.3	76
Mushroom	8,124	120	23	23

The EHSSUI algorithm was executed with four databases, including:

One randomly generated databases: T1000_200_40 (The data is randomly generated by a Java program we build).

Three real databases: retail, foodmart, and mushroom. They were published and available in [23]. These databases have been popular used for experiment of pattern mining and privacy preserving in pattern mining. The detail of databases is presented in Table 4.

System: CPU Core I5 2.4 GHz, RAM 8 GB, Windows 10.

Sets of sensitive-high itemsets: For each database, SHUIs are randomly selected from the set HUIs mined by FHM algorithm [6].

Figure 1.

Miss cost with various sensitive itemsets.

Figure 2.

Runtime under various sensitive itemsets.

Figure 3.

DSS under various sensitive itemsets.

Figure 4.

DUS under various sensitive itemsets.

Figure 5.

IUS under various sensitive itemsets.

4.2 Result comparision

The EHSHUI algorithm is compared to four previous algorithms including HHUIF, MSICF, MSU-MAU, and MSU-MIU proposed in [27, 11]. The results are demonstrated as follows:

1.
Mising cost: Figure 1 depicts the miss cost was caused by hiding process. The experiment results indicates that EHSHUI achieves lower MC in comparison to HHUIF, MSICF, MSU-MAU, and MSU-MIU. This is because EHSHUI calculates exactly the minimal internal utility value for modifying victim item in victim transaction. The EHSHUI algorithm, therefore, overcomes the drawbacks of [27, 11] indicated in the related works.
2.
Run time: The results in Fig. 2 show that run time for experiment of EHSHUI, MSU-MAU, and MSU-MIU are much lower in comparision to HHUIF and MSCIF because EHSHUI, MSU-MAU, and MSU-MIU use project database to reduce database-scanning time when specifying victim transaction and victim item. Run time of EHSHUI is equal to or slightly higher than runtime of MSU-MAU and MSU-MIU because EHSHUI needs more time to evaluate minimal MC when comparing side effects amongs sensitive item modifications (this depends on the size of HUIs).
3.
Database structure similarity: In Fig. 3, the database structure similarity of EHSHUI in comparision to HHUIF, MSCIF, MSU-MAU, and MSU-MIU were plotted against the percent of SHUIs for the databases. From Fig. 3, it is obvious to see that EHSHUI causes less data distortion than the others. With increasing percent of SHUIs from 1% to 5% in Fig. 3, EHSHUI achieves DSS higher than the other algorithms.
4.
Database utility similarity: The results in Fig. 4 show that percent of utility similarity between sanitized database and original database caused by EHSHUI is higher than by HHUIF, MSCIF, MSU-MAU, and MSU-MIU. This indicates that EHSHUI is better than the others in minimizing DUS side effect for every experimental database
5.
Itemsets utility similarity: Itemset utility similarity between original database and sanitized database indicates the truthfulness of database. This means the algorithm creates higher IUS is the better one. The experiment result in Fig. 5 illustrates that IUS created by EHSHUI is more higher than other algorithms for every database and percent of SHUIs. The reason why EHSHUI achieved this result because it specifies exactly victim transaction and victim item.

5. Conclusion

Privacy-preserving data mining has raised its importance in today’s information analysis-based research and marketing. The aim of privacy-preserving HUI mining is to conceal some SHUI with an intention that it cannot be revealed with any HUI mining algorithm. The main idea of SHUI hiding algorithm is to hide every SHUI with lowest impacts on the number of HUI and to keep the quality of data intact. This paper proposes an heuristic algorithm for hiding SHUI named EHSHUI based on three heuristic steps. In the first step, the transaction on which SHUI $S_{i}$ gains maximal utility amongst transactions containing $S_{i}$ is specified as victim transaction. In the second step, the impact on non-SHUIs is measured for each sensitive item if its internal utility is decreased from victim transaction. The item which causes the least impacts to non-SHUIs is selected as victim item. The final step reduces internal utility of the victim item from the victim transaction. In case of SHUI is not hidden by reducing its utility, the algorithm deletes the victim item from victim transaction. The experiments were executed on four popular datasets with five levels of SHUIs. Four significant algorithms HHUIF, MSICF, MSU-MAU and MSU-MIU were selected to compare to EHSHUI. The heuristic strategy results in better performance of EHSHUI algorithm in comparison to the previous works in minimizing the side effects. The results indicate that EHSHUI achieved better results than the previous works in comparison with four side effects, including MC, DSS, DUS, and IUS.

The finding from this study rise in the form of improving quality of collaboration between companies. First, hiding all of SHUIs guarantees the safe for sharing data without disclosing sensitive knowledge. Moreover, the higher quality of sanitized database motivate companies to share data and discover worthy knowledge from collaborated database.

The limitation should be pointed out which could also be addressed in future research. The running time of the proposed algorithm increases with the growth of HUIs. The reason is the time consuming by computation to specify number of impacted non-SHUIs for every sensitive items in the second step. Therefore, an improvement of the proposed algorithm should be recommended in the next work.

References

Agrawal

Imieliński

and Swami

, Mining association rules between sets of items in large databases, in: Acm Sigmod Record, ACM, Vol. 22, 1993, pp. 207–216.

Agrawal

Srikant

et al., Fast algorithms for mining association rules, in: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, Vol. 1215, 1994, pp. 487–499.

Borzemski

, Internet path behavior prediction via data mining: conceptual framework and case study, J. UCS 13(2) (2007), 287–316.

Cristofor

and Simovici

D.A.

, Galois connections and data mining, Journal of Universal Computer Science 6(1) (2000), 60–73.

Erwin

Gopalan

R.P.

and Achuthan

, Efficient mining of high utility itemsets from large datasets, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2008, pp. 554–561.

Fournier-Viger

C.-W.

Zida

and Tseng

V.S.

, Fhm: Faster high-utility itemset mining using estimated utility co-occurrence pruning, in: International Symposium on Methodologies for Intelligent Systems, Springer, 2014, pp. 83–92.

Georgiadis

Polatidis

Mouratidis

and Pimenidis

, A method for privacy-preserving collaborative filtering recommendations, Journal of Universal Computer Science 23(2) (2017), 146–166.

Hong

T.-P.

Lee

C.-H.

and Wang

S.-L.

, Effective utility mining with the measure of average utility, Expert Systems with Applications 38(7) (2011), 8259–8265.

H.Q.

Arch-Int

Nguyen

H.X.

and Arch-Int

, Association rule hiding in risk management for retail supply chain collaboration, Computers in Industry 64(7) (2013), 776–784.

10.

Lin

C.-W.

Hong

T.-P.

Wong

J.-W.

Lan

G.-C.

and Lin

W.-Y.

, A ga-based approach to hide sensitive high utility itemsets, The Scientific World Journal, 2014.

11.

Lin

J.C.-W.

T.-Y.

Fournier-Viger

Lin

Zhan

and Voznak

, Fast algorithms for hiding sensitive high-utility itemsets in privacy-preserving utility mining, Engineering Applications of Artificial Intelligence 55 (2016), 269–284.

12.

Liu

and Qu

, Mining high utility itemsets without candidate generation, in: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, ACM, 2012, pp. 55–64.

13.

Liu

Liao

W.-k.

and Choudhary

, A fast high utility itemsets mining algorithm, in: Proceedings of the 1st International Workshop on Utility-based Data Mining, ACM, 2005, pp. 90–99.

14.

Liu

Liao

W.-k.

and Choudhary

, A two-phase algorithm for fast discovery of high utility itemsets, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2005, pp. 689–695.

15.

Quoc Le

Arch-Int

and Arch-Int

, Association rule hiding based on intersection lattice, Mathematical Problems in Engineering, 2013.

16.

Sahoo

Das

A.K.

and Goswami

, An efficient approach for mining association rules from high utility itemsets, Expert Systems with Applications 42(13) (2015), 5754–5778.

17.

Selvaraj

and Kuthadi

V.M.

, A modified hiding high utility item first algorithm (hhuif) with item selector (mhis) for hiding sensitive itemsets, 2013.

18.

Suzuki

, Data mining methods for discovering interesting exceptions from an unsupervised table, J. UCS 12(6) (2006), 627–653.

19.

Trieu

V.H.

Ngoc

C.T.

Le Quoc

and Si

N.N.

, Algorithm for hiding high utility sensitive association rule based on intersection lattice, in: 2018 1st International Conference on Multimedia Analysis and Pattern Recognition (MAPR), IEEE, 2018, pp. 1–6.

20.

Trieu

V.H.

Ngoc

C.T.

Le Quoc

and Thanh

L.N.

, Hhusi: An efficient algorithm for hiding sensitive high utility itemsets, in: International Conference on Industrial Networks and Intelligent Systems, Springer, 2018, pp. 145–154.

21.

Tseng

V.S.

Shie

B.-E.

C.-W.

and Philip

S.Y.

, Efficient algorithms for mining high utility itemsets from transactional databases, IEEE Transactions on Knowledge and Data Engineering 25(8) (2013), 1772–1786.

22.

Tseng

V.S.

C.-W.

Shie

B.-E.

and Yu

P.S.

, Up-growth: an efficient algorithm for high utility itemset mining, in: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2010, pp. 253–262.

23.

Viger

P.F.

, An open-source data mining library, 2008 (accessed 4/2019).

24.

Tran

Hong

T.-P.

and Le Minh

, Using soft set theory for mining maximal association rules in text data, J. UCS 22(6) (2016), 802–821.

25.

Yao

and Hamilton

H.J.

, Mining itemset utilities from transaction databases, Data & Knowledge Engineering 59(3) (2006), 603–626.

26.

Yao

Hamilton

H.J.

and Butz

C.J.

, A foundational approach to mining itemset utilities from databases, in: Proceedings of the 2004 SIAM International Conference on Data Mining, SIAM, 2004, pp. 482–486.

27.

Yeh

J.-S.

and Hsu

P.-C.

, Hhuif and msicf: novel algorithms for privacy preserving utility mining, Expert Systems with Applications 37(7) (2010), 4779–4786.

28.

Zida

Fournier-Viger

Lin

J.C.-W.

C.-W.

and Tseng

V.S.

, Efim: a highly efficient algorithm for high-utility itemset mining, in: Mexican International Conference on Artificial Intelligence, Springer, 2015, pp. 530–546.

29.

Zida

Fournier-Viger

Lin

J.C.-W.

C.-W.

and Tseng

V.S.

, Efim: a fast and memory efficient algorithm for high-utility itemset mining, Knowledge and Information Systems 51(2) (2017), 595–625.

An efficient algorithm for hiding sensitive-high utility itemsets

Abstract

Keywords

1. Introduction

2. Related works

2.1 High utility itemset mining

.

Table 1 Transaction dataset D

.

.

.

.

.

.

.

.

.

.

3. Algorithm proposal

3.1 Sensitive-high utility itemset hiding strategy

.

.

.

.

.

.

4.1 Experiment data and system description

Table 4 The data sets description

References

Table 1
Transaction dataset D

Table 4
The data sets description