Mining multi-relational high utility itemsets from star schemas

Abstract

Mining high utility itemsets is an interesting research problem in data mining and knowledge discovery. Most high utility itemset discovery algorithms seek patterns in a single table, but few are dedicated to processing data stored using a multi-dimensional model. In this paper, the problem of mining high utility itemsets in multi-relational databases is investigated, and two algorithms, RHUI-Mine and RHUI-Growth, are proposed for star schema-based data warehouses. In the RHUI-Mine algorithm, the search space is traversed in a level-wise manner, and an item index and transaction index are proposed to represent item and transaction information, respectively. The RHUI-Growth algorithm traverses the search space recursively using a pattern growth approach, and a dimensional tree and relational tree are used to compress the original data. Neither algorithm materializes the join operation between tables, thus making use of the star schema properties. Experiments show that both RHUI-Mine and RHUI-Growth are effective approaches for mining high utility itemsets in multi-relational data.

Keywords

Multi-relational high utility itemsets star schema item index transaction index dimensional tree relational tree

1. Introduction

High utility itemset (HUI) mining [15, 18] is an important data mining problem that addresses the limitations of frequent itemset mining [7, 20] by introducing interestingness measures that reflect both statistical significance and user expectations. Various algorithms for efficiently mining HUIs in large databases have been presented [22, 28]. To the best of our knowledge, all existing HUI mining algorithms are restricted to cases where information is compiled in a single relation. However, information may be dispersed among different tables and, in some cases, the tables may reside in different physical locations. This imposes the need to explore more complex patterns and corresponding data mining techniques.

Multi-relational data mining (MRDM) [9] is concerned with the discovery of models and patterns from databases with multiple relational tables. MRDM is a popular data mining approach for multi-relational itemsets [4], multi-relational classification [3], and multi-relational clustering [6]. However, to the best of our knowledge, no algorithm for HUI discovery from relational data has yet been developed.

The straightforward approach to mining multiple tables is to join them all together before mining and apply an existing and efficient single-table algorithm. However, when multiple tables are joined, the resulting table has many more columns and rows, and this adversely affects the cost of mining algorithms.

The join-then-mine approach leads to high computational complexity. The number of columns in the joined table will be close to the sum of the number of columns in the individual tables. As the performance of HUI mining is very sensitive to the number of columns (items), mining the resulting table can take much longer than mining the individual tables. Moreover, in large applications, the join of all related tables cannot always be realistically computed because of the explosion in many-to-many relationships, large-dimension tables, and the distributed nature of the data.

In this paper, we propose two algorithms, RHUI-Mine and RHUI-Growth, for mining multi-relational HUIs in data stored following a star schema. First, we define the problem of multi-relational HUI mining. To improve the performance of utility mining, the proposed methods represent information using indices and trees. These representations can be used to generate multi-relational HUIs without joining the tables before mining. Both algorithms are composed of two stages: the first stage involves processing each dimension table, and the second stage combines all local information into global results across multiple tables. Experimental results show that our algorithms exhibit very interesting performance regarding efficiency and memory consumption.

The remainder of this paper is organized as follows: In Section 2, we define the problem of mining HUIs in multi-relational data. In Section 3, we analyze related work. In Sections 4 and 5, respectively, we describe the proposed RHUI-Mine and RHUI-Growth algorithms in detail. In Section 6, we present and analyze the experimental results. Finally, we present our conclusions in Section 7.

2. Problem definition

2.1 Star schema

A data warehouse (DW) is a database that is maintained separately from an organization’s operational database for the purpose of decision support. DWs provide integrated, enterprise-wide, historical data and focus on providing support for decision makers with respect to data modeling and analysis [23]. For data modeling, a DW is built using a multi-dimensional data model that consists of fact and dimension tables. Dimensions are various perspectives that are used to analyze data. Dimension tables are used to describe dimensions; they contain dimension transaction identifiers, values, and attributes. Fact tables contain identifiers for dimension tables in addition to measurable facts that data analysts may wish to examine. The simplest and most-used multi-dimensional model is the star schema, which consists of a fact table at the center of multiple dimension tables. As there have been several studies on mining frequent itemsets in this model [17, 25], we concentrate on the algorithms for discovering HUIs in a star schema.

Let ${\bm{I}}=\{i_{1},i_{2},\ldots,i_{m}\}$ be a finite set of distinct items. In the case of a relational database, an item corresponds to a proposition of the form (attribute, value). A set of items is denoted as an itemset, which corresponds to a transaction in the database. An itemset with $k$ items is called a $k$ -itemset.

Let ${\bm{S}}=(D_{1},D_{2},\ldots,D_{n},\textit{FT}$ ) be a DW modeled as a star schema, with $D_{i}$ corresponding to each dimension table and FT to the fact table. Each dimension table $D_{i}$ (1 $\leqslant i\leqslant n$ ) is a set of transactions. For a transaction $T=$ (tid, $X$ ), tid is a transaction identifier (ID) and $X$ is an itemset. FT has the attributes (tid ${}_{1}$ , tid ${}_{2}$ , …, tid ${}_{n}$ ), where tid ${}_{i}$ (1 $\leqslant i\leqslant n$ ) is the tid of dimension table $D_{i}$ . In the more general case, FT also contains some other attributes called facts or measures.

Figure 1.

Example of a star schema data warehouse.

Figure 1 illustrates a star schema with three dimensions (Customer, Product, and Store) and a fact table linking all the dimensions. The three dimension tables, Customer, Product, and Store, record information on customers, products, and stores, respectively. To ensure uniqueness, we refer to an attribute by concatenating it with the table name. For example, we denote attribute City of table Customer by Customer.City, which is different from attribute City in table Store (denoted by Store.City). The fact table stores information on all purchase activities. In addition to the three identifiers referencing the three dimension tables, the purchase number is recorded by the “count” measure in the fact table, and the time of purchase is also stored.

Figure 2.

Conceptual representation of the example star schema.

The conceptual representation of Fig. 1 is shown in Fig. 2, where C, P, and S denote the dimension tables Customer, Product, and Store, respectively, and $x_{i}$ , $y_{i,}$ and $z_{i}$ denote each possible item of C, P, and S. Each item has the form attribute $=$ value. For example, $x_{1}$ corresponds to Customer.City $=$ Beijing.

If we organize the rows with the same customer and the same time into one transaction, the fact table can be conceptually represented by Table 1.

Table 1

Conceptual representation of the fact table in the example data warehouse

tid ${}_{F}$	Transactions	TU
$F_{1}$	( $c_{1}$ ,3) ( $p_{1}$ ,1) ( $p_{2}$ ,1) ( $p_{3}$ ,1) ( $s_{1}$ ,1) ( $s_{3}$ ,1) ( $s_{4}$ ,1)	149
$F_{2}$	( $c_{2}$ ,3) ( $p_{1}$ ,2) ( $p_{2}$ ,1) ( $s_{2}$ ,2) ( $s_{3}$ ,1)	181
$F_{3}$	( $c_{3}$ ,4) ( $p_{2}$ ,3) ( $p_{4}$ ,1) ( $s_{2}$ ,1) ( $s_{4}$ ,3)	182
$F_{4}$	( $c_{2}$ ,3) ( $p_{1}$ ,2) ( $p_{3}$ ,1) ( $s_{2}$ ,2) ( $s_{4}$ ,1)	179
$F_{5}$	( $c_{4}$ ,1) ( $p_{1}$ ,1) ( $s_{1}$ ,1)	65

2.2 Multi-relational high utility itemsets

We are interested in mining HUIs from the star schema structure. In particular, we examine the problem of determining all HUIs in table $R$ resulting from a natural join of all the given tables (FT $\triangleright\triangleleft D_{1}\triangleright\triangleleft D_{2}% \triangleright\triangleleft\ {\ldots}\triangleright\triangleleft D_{n}$ ). The join conditions are given by FT.tid ${}_{1}$ $=$ tid ${}_{1}$ , FT.tid ${}_{2}$ $=$ tid ${}_{2}$ , $\ldots$ , FT.tid ${}_{n}$ $=$ tid ${}_{n}$ , where tid ${}_{i}$ (1 $\leqslant i\leqslant n$ ) is the tid of dimension table $D_{i}$ . In the following discussions, when we mention a multi-relational HUI (MRHUI), this always refers to the utility of the itemset in table $R$ . Because the fact table records the relationship between the dimension tables and itself, and we do not perform the actual join operation before mining, FT replaces $R$ in the proposed algorithm.

Definition 1 (item utility in a transaction). Let $D_{i}$ be a dimension table and DT ${}_{i}$ a tid of $D_{i}$ . The internal utility $q(i_{p}$ , DT ${}_{i})$ represents the quantity of item $i_{p}$ in transaction DT ${}_{i}$ , the external utility $p(i_{p})$ is the unit profit value of item $i_{p}$ , and the utility of item $i_{p}$ in transaction DT ${}_{i}$ is defined as $u(i_{p}$ , DT ${}_{i})=p(i_{p})\times q(i_{p}$ , DT ${}_{i})$ .

Definition 2 (itemset utility of a single relation). Let $D_{i}$ be a dimension table, DT ${}_{i}$ be a tid of $D_{i}$ , and $F_{d}$ be a tid of FT. The utility of itemset $X$ in transaction DT ${}_{i}$ is defined as $u(X,DT_{i})=\sum_{i_{p}\in\,X\wedge\,X\subseteq\,DT_{i}\,}{u(i_{p},DT_{i})}$ . Given DT ${}_{i}\in F_{d}$ , $u(X$ , $F_{d})=u(X$ , DT ${}_{i})$ , otherwise $u(X,F_{d})=$ 0. The utility of itemset $X$ for a single relation in FT is defined as $u(X)=\sum_{\,F_{d}\in\,FT}{u(X,F_{d})}$ .

Definition 3 (itemset utility of a multi-relation). Let $X_{1}$ , $X_{2}$ , $\ldots$ , $X_{m}$ be itemsets in dimension $D_{1}$ , $D_{2}$ , $\ldots$ , $D_{m}$ , respectively, and DT ${}_{i}$ be a tid of $D_{i}$ (1 $\leqslant i\leqslant m)$ . The utility of itemset $X_{1}\cup X_{2}\cup$ … $\cup X_{m}$ in FT is defined as $u(X_{1}\cup X_{2}\cup\ldots\cup X_{m})=\sum_{F_{d}\in FT}{\sum_{i=1}^{m}{u(X_{% i},F_{d})}}$ , where $F_{d}$ is a tid of FT.

The transaction utility (TU) of transaction $T_{d}$ is defined as TU( $T_{d})=u(T_{d}$ , $T_{d}$ ), where $T_{d}$ is a tid in a dimension table or fact table.

For the MRHUI mining problem, we need to determine those itemsets that contribute significantly to the total profit. We quantify significance using a metric called the minimum utility threshold $\delta$ . By specifying this measure, users can define what constitutes a significant contribution to the total profit according to their requirements. The minimum utility threshold $\delta$ is given as a percentage of the total transaction utility values in FT, whereas the minimum utility value is defined as

$\displaystyle\textit{min{\_}util}=\delta\times\sum\limits_{F_{d}\in FT}{TU(F_{% d})}$

An itemset $X$ is an MRHUI if $u(X)\geqslant$ min_util; otherwise, it is called a multi-relational low utility itemset (MRLUI). Given a DW ${\bm{S}}$ modeled as a star schema, the task of MRHUI mining is to determine all itemsets that have utilities that are greater than or equal to min_util in fact table FT of ${\bm{S}}$ .

Similar to HUI mining in a single table, the main challenge is that the itemset utility does not have the downward closure property. In this paper, we also use the property of transaction-weighted downward closure proposed by Liu et al. [12] and define the transaction-weighted utilization.

Definition 4. The transaction-weighted utilization (TWU) of itemset $X$ is defined as $\textit{TWU(X)}=\sum{\textit{TU}(F_{d})}$ , where $F_{d}$ is a tid in the fact table that contains $X$ .

$X$ is a high transaction-weighted utilization itemset (HTWUI) if TWU( $X$ ) $\geqslant$ min_util; otherwise, it is called a low transaction-weighted utilization itemset (LTWUI). An HTWUI (or LTWUI) with $k$ items is called a $k$ -HTWUI (or $k$ -LTWUI).

It has been proved [12] that any subset of an HTWUI is also an HTWUI. This property, also referred to as transaction-weighted downward closure, can be used to prune the supersets of LTWUIs and thus reduce the search space.

To explain the aforementioned concepts, we join all tables of the example DW shown in Fig. 2. The results are presented in Table 2 and the profit of the different items in the three dimension tables is given in Table 3.

For convenience, we write an itemset { $y_{1}$ , $z_{2}$ } as $y_{1}z_{2}$ . In the example DW, the utility of item $y_{1}$ in transaction $F_{2}$ is $u(y_{1}$ , $F_{2})$ $=$ 12 $\times$ 2 $=$ 24, that of itemset $y_{1}z_{2}$ in transaction $F_{2}$ is $u(y_{1}z_{2}$ , $F_{2})$ $=$ $u(y_{1}$ , $F_{2}$ ) $+$ $u(z_{2}$ , $F_{2})$ $=$ 24 $+$ 30 $=$ 54, and that of $y_{1}z_{2}$ in the natural join result table is $u(y_{1}z_{2})$ $=$ $u(y_{1}z_{2}$ , $F_{2})+u(y_{1}z_{2}$ , $F_{4})$ $=$ 54 $+$ 54 $=$ 108. Given min_util $=$ 350, as $u(y_{1}z_{2})<$ min_util, $y_{1}z_{2}$ is not an MRHUI. The transaction utility of $F_{5}$ is TU( $F_{5})$ $=$ $u(x_{3}$ , $x_{4}$ , $x_{7}$ , $y_{1}$ , $y_{4}$ , $y_{7}$ , $z_{1}$ , $z_{5}$ , $z_{7}$ , $F_{5})=$ 65 and the utility of the other transactions is listed in the third column of Tables 1 and 2. As the TWU of $y_{1}z_{2}$ is TWU( $y_{1}z_{2})=$ TU( $F_{2})$ $+$ TU( $F_{4})$ $=$ 360, $y_{1}z_{2}$ is an HTWUI.

Table 2
Results of joining all tables of the example DW

tid ${}_{F}$	Transactions	TU
$F_{1}$	( $x_{1}$ ,3) ( $x_{4}$ ,3) ( $x_{6}$ ,3) ( $y_{1}$ ,1) ( $y_{4}$ ,1) ( $y_{7}$ ,1) ( $y_{2}$ ,1) ( $y_{5}$ ,1) ( $y_{8}$ ,1) ( $y_{3}$ ,1) ( $y_{6}$ ,1) ( $y_{9}$ ,1) ( $z_{1}$ ,1) ( $z_{5}$ ,1) ( $z_{7}$ ,1) ( $z_{3}$ ,1) ( $z_{6}$ ,1) ( $z_{9}$ ,1) ( $z_{4}$ ,1) ( $z_{5}$ ,1) ( $z_{10}$ ,1)	149
$F_{2}$	( $x_{1}$ ,3) ( $x_{5}$ ,3) ( $x_{7}$ ,3) ( $y_{1}$ ,2) ( $y_{4}$ ,2) ( $y_{7}$ ,2) ( $y_{2}$ ,1) ( $y_{5}$ ,1) ( $y_{8}$ ,1) ( $z_{2}$ ,2) ( $z_{6}$ ,2) ( $z_{8}$ ,2) ( $z_{3}$ ,1) ( $z_{6}$ ,1) ( $z_{9}$ ,1)	181
$F_{3}$	( $x_{2}$ ,4) ( $x_{5}$ ,4) ( $x_{8}$ ,4) ( $y_{2}$ ,3) ( $y_{5}$ ,3) ( $y_{8}$ ,3) ( $y_{3}$ ,1) ( $y_{6}$ ,1) ( $y_{10}$ ,1) ( $z_{2}$ ,1) ( $z_{6}$ ,1) ( $z_{8}$ ,1) ( $z_{4}$ ,3) ( $z_{5}$ ,3) ( $z_{10}$ ,3)	182
$F_{4}$	( $x_{1}$ ,3) ( $x_{5}$ ,3) ( $x_{7}$ ,3) ( $y_{1}$ ,2) ( $y_{4}$ ,2) ( $y_{7}$ ,2) ( $y_{3}$ ,1) ( $y_{6}$ ,1) ( $y_{9}$ ,1) ( $z_{2}$ ,2) ( $z_{6}$ ,2) ( $z_{8}$ ,2) ( $z_{4}$ ,1) ( $z_{5}$ ,1) ( $z_{10}$ ,1)	179
$F_{5}$	( $x_{3}$ ,1) ( $x_{4}$ ,1) ( $x_{7}$ ,1) ( $y_{1}$ ,1) ( $y_{4}$ ,1) ( $y_{7}$ ,1) ( $z_{1}$ ,1) ( $z_{5}$ ,1) ( $z_{7}$ ,1)	65

Table 3

Profit table

(a) Profit of the dimension customer
Item	$x_{1}$	$x_{2}$	$x_{3}$	$x_{4}$	$x_{5}$	$x_{6}$	$x_{7}$	$x_{8}$
Profit	5	3	2	3	4	1	8	2

(b) Profit of the dimension product
Item	$y_{1}$	$y_{2}$	$y_{3}$	$y_{4}$	$y_{5}$	$y_{6}$	$y_{7}$	$y_{8}$	$y_{9}$	$y_{10}$
Profit	12	9	7	4	2	1	6	3	5	8

(c) Profit of the dimension store
Item	$z_{1}$	$z_{2}$	$z_{3}$	$z_{4}$	$z_{5}$	$z_{6}$	$z_{7}$	$z_{8}$	$z_{9}$	$z_{10}$
Profit	9	15	8	2	16	4	5	6	10	3

3. Related work

In the following subsections, we discuss research on multi-relational itemset mining and HUI mining.

3.1 Multi-relational itemset mining

MRDM approaches look for patterns that involve multiple tables (relations) from a relational database [9]. In recent years, research has been conducted on multi-relational itemset mining from a star schema.

In [4], a level-wise candidate generation-and-test algorithm for mining a multi-relational itemset from a star schema was proposed. Similar to the Apriori algorithm [1], Crestana-Jensen and Soparkar’s algorithm generates frequent itemsets in separate tables and then merges the results. Although the joined table is computed instead of being materialized, this algorithm still suffers from the problem of candidate explosion as the number of dimensions, attributes, and values increases.

Ng et al. presented an approach for mining multi-relational itemsets by binding multiple dimension tables [14]. Specifically, this algorithm performs the operations of binding two dimensions and then mining frequent itemsets iteratively until all dimensions are processed without joining.

Unlike the aforementioned works, StarFP-Growth [16] is a multi-relational itemset mining algorithm over a star schema based on the pattern growth methodology [8]. For StarFP-Growth, a local dimension tree is first constructed for each dimension, and then a global tree is built by combining the dimension trees. Similar to FP-Growth [8], frequent itemsets across different dimensions are discovered from the global super tree. Silva and Antunes extended StarFP-Growth to the data stream environment [17].

Nagao and Seki proposed an algorithm for mining closed itemsets from multi-relational data [13]. Their algorithm performs parallel mining on multi-core processors. To realize load-balancing, node-wise parallelization and parallel-for based parallelization are used to optimize the task-parallelism in the problem of closed itemset discovery from multi-relational data.

3.2 High utility itemset mining

HUI mining, an extension of frequent itemset mining, is an area of active research in data mining. Many algorithms have been proposed for mining HUIs.

The basic concepts of HUI mining were outlined by Yao et al. [27]. Because their approach, called mining with expected utility (MEU), cannot use downward closure to reduce the number of candidate itemsets, they proposed a heuristic approach to predict whether an itemset should be added to the candidate set. However, the prediction typically gives an overestimate, especially in the initial stages. Moreover, the approach to examining candidates is impractical in terms of both processing cost and memory requirements if the number of items is large or the utility threshold is low.

The Two-Phase algorithm [12] was developed to determine HUIs using a downward closure property. This is employed in phase I to reduce the number of candidates. In phase II, only one extra database scan is required to identify the HUIs. Although this algorithm effectively reduces the search space and captures a complete set of HUIs, it still generates too many candidates and requires multiple database scans, especially when mining dense datasets and long patterns, much like the Apriori algorithm for mining frequent itemsets.

Several methods of generating candidates efficiently in phase I and avoiding multiple database scans have been proposed, including the projection-based approach [10], an approach based on maximal itemsets [11], and an approach based on a binary partition for itemset expansion [19]. Of these new approaches, tree-based algorithms, such as CTU-Tree [5], IHUP-Tree [2], and UP-Tree [22], have been shown to be very efficient for mining HUIs. These algorithms comprise three steps: (1) tree construction, (2) generation of candidate HUIs from the trees using the pattern-growth approach, and (3) identification of HUIs from the set of candidates.

Recently, we proposed a new HUI mining algorithm, named IHUI-Mine [21], based on the subsume index structure [20]. By recording information about the co-occurrence of itemsets, the subsume index is an efficient data structure for frequent itemset mining [24]. In IHUI-Mine, this structure was also found to be efficient for HUI mining.

4. Mining algorithm based on the item and transaction index

In this section, we propose the MRHUI mining (RHUI-Mine) algorithm. In this algorithm, we exploit the characteristics of tables in a star schema and do not perform any join operations. Furthermore, we exploit the following pruning strategy: if $X\cup Y$ is an HTWUI, where $X$ is from table $D_{i}$ and $Y$ is from table $D_{j}$ , $X$ must be an HTWUI and $Y$ must be an HTWUI. Thus, the RHUI-Mine algorithm is composed of two stages. In the first stage, we mine the HTWUIs locally for each dimension table. Only relevant information is scanned. In the second stage, global HUIs across multiple tables are discovered based on local HTWUIs.

4.1 Data structure

To represent the transaction data required for mining MRHUIs in a compact form, we design two data structures called the itemset index (SI) and transaction index (TI) for itemsets and transactions, respectively.

Definition 5. Let $X$ be an itemset in dimension table $D_{i}$ ( $1\leqslant i\leqslant n$ ) and SI( $X$ ) be the itemset index of $X$ . SI( $X$ ) is defined as a tuple $<$ st, TWU $>$ , where st is the set of transaction IDs in $D_{i}$ containing $X$ and TWU denotes TWU( $X$ ).

Consider itemset $y_{3}y_{6}$ in dimension table $P$ , for example, SI( $y_{3}y_{6}$ ).st, where st of SI( $y_{3}y_{6}$ ) is { $p_{3}$ , $p_{4}$ }. To calculate the TWU of the itemset in the fact table, TI is defined as follows:

Definition 6. Let $t$ be a tid in dimension table $D_{i}$ (1 $\leqslant i\leqslant n$ ) and TI( $t$ ) $=$ $<$ sf, std $>$ , where sf is the set of transaction IDs containing $t$ in the fact table and std is the set of transaction IDs in $D_{j}(j\neq i)$ co-occurring with $t$ in the fact table.

We still consider $p_{3}$ , for example, because there are two transactions ( $F_{1}$ and $F_{4})$ in FT that contain $p_{3}$ ; TI( $p_{3}$ ).sf is { $F_{1}$ , $F_{4}$ }. In these two transactions, $c_{1}$ and $c_{2}$ in C and $s_{1}$ , $s_{2}$ , $s_{3,}$ and $s_{4}$ in S co-occur with $p_{3}$ ; thus, TI( $p_{3})$ .std is { $c_{1}$ , $c_{2}$ , $s_{1}$ , $s_{2}$ , $s_{3}$ , $s_{4}$ }. Therefore, TI( $p_{3}$ ) $=$ $<$ { $F_{1}$ , $F_{4}$ }, { $c_{1}$ , $c_{2}$ , $s_{1}$ , $s_{2}$ , $s_{3}$ , $s_{4}$ } $>$ . Additionally, we have TI( $p_{4}$ ) $=$ $<$ { $F_{3}$ }, { $c_{3}$ , $s_{2}$ , $s_{4}$ } $>$ , and so SI( $y_{3}y_{6})$ . TWU $=$ TU( $F_{1}$ ) $+$ TU( $F_{3}$ ) $+$ TU( $F_{4}$ ) $=$ 149 $+$ 182 $+$ 179 $=$ 510. Therefore, SI( $y_{3}y_{6})=$ $<$ { $p_{3}$ , $p_{4}$ }, 510 $>$ .

4.2 HTWUI generation in dimension tables

In our RHUI-Mine algorithm, HTWUIs and their SIs in each dimension table are discovered first; this method is described in Algorithm 1.

In Algorithm 1, high TWU single items are determined in Step 1. CHUI ${}_{k}$ is used to denote the set of HTWUIs with length $k$ . Step 2 calculates the SI of each itemset in CHUI ${}_{1}$ . The initial value of $k$ is set to one in Step 3. The main loop discovers all HTWUIs using a candidate generation-and-test methodology (Steps 4–15). Step 5 calls the function candidate_gen to generate candidates with length ( $k+1$ ). For each new candidate $C$ , the loop (Steps 6–12) scans the dimension table to update SI( $C$ ). In Step 13, only candidates whose TWU is greater than or equal to min_util are maintained as HTWUIs in the dimension table for further use. Finally, Step 16 returns all of the discovered local HTWUIs.

Algorithm 1	Function DHUI_gen ( $D$ , min_util)
Input	$D$ : a dimension table; min_util: minimum utility value
Output	DHUI: All HTWUIs in $D$
1	Store all 1-HTWUIs in CHUI ${}_{1}$ ;
2	Calculate SI of each 1-HTWUI;
3	$k$ = 1;
4	while CHUI ${}_{k}\neq\emptyset$ do
5	CS ${}_{k+1}$ $=$ candidate_gen(CHUI ${}_{k}$ , min_util);
6	for each transaction $t\in D$ do
7	for each itemset $C\in$ CS ${}_{k+1}$ do
8	if $C\subseteq t$ then
9	Update SI( $C$ );
10	end if
11	end for
12	end for
13	CHUI ${}_{k+1}$ $=$ { $C\in$ CS ${}_{k+1}\|$ TWU( $C$ ) $\geqslant$ min_util};
14	$k$ ++;
15	end while
16	return DHUI $=$ $\cup_{k}$ CHUI ${}_{k}$ ;
Function	candidate_gen(CHUI ${}_{k}$ , min_util)
1	for each itemset $h_{m}\in\textit{CHUI}_{k}$ do
2	for each itemset $h_{n}\in\textit{CHUI}_{k}$ do
3	if ( $h_{m}[1]==h_{n}[1]\wedge h_{m}[2]==h_{n}[2]\wedge\ldots\wedge h_{m}[k-1]==h_{n% }[k-1]$
	$\wedge h_{m}[k]\prec h_{n}[k])$ then
4	$C=h_{m}[1]h_{m}[2]\ldots h_{m}[k-1]h_{m}[k]h_{n}[k]$ ;
5	if all subsets of $C$ with length $k$ are in CHUI ${}_{k}$ then
6	Put $C$ into CS ${}_{k+1}$ ;
7	end if
8	end if
9	end for
10	end for
11	return CS ${}_{k+1}$ ;

With the support of transaction-weighted downward closure [12], the function candidate_gen generates the ( $k+1$ )-HTWUI candidates by joining two $k$ -HTWUIs that share the common ( $k-1$ ) prefix. Note that items within any itemset are in lexicographical order, denoted by “ $\prec$ ”. For example, for a $k$ -itemset $h_{i}$ , this means that the items are sorted such that $h_{i}[1]\prec h_{i}[2]\prec\ldots\prec h_{i}[k]$ . This lexicographical order simply ensures that no duplicates are generated.

4.3 MRHUI discovery

In Algorithm 1, the first stage provides all the local HTWUIs required for the second stage. In the second stage, we use the proposed SI and TI structure to mine the data without joining the tables. The RHUI-Mine pseudo-code is described in Algorithm 2.

Algorithm 2	RHUI-Mine
Input	S: a DW with a star schema; min_util: minimum utility value
Output	MRHUIs
1	Scan the fact table FT once, and calculate TU of each transaction in FT;
2	for each dimension table $D_{i}$ do
3	Calculate TI of each transaction in $D_{i}$ ;
4	DHTWUI ${}_{i}=$ DHUI_gen ( $D_{i}$ , min_util);
5	end for
6	Put all elements of DHTWUI ${}_{1}$ into CHS;
7	for ( $i=$ 2; $i\leqslant n$ ; $i++$ ) do
8	RHTWUI_gen(CHS, DHTWUI ${}_{i}$ , min_util);
9	end for
10	for each itemset is $\in$ CHS do
11	Scan the transactions according to SI and TI to calculate $u$ (is);
12	if $u$ (is) $\geqslant$ min_util then
13	Put is into MRHUIS;
14	end if
15	end for
16	Output MRHUIS;
Procedure	RHTWUI_gen(CHS, DHTWUI, min_util)
1	$k=$ 1;
2	while CHS ${}_{k}\neq\emptyset$ do
3	for each itemset $h_{m}\in$ CHS ${}_{k}$ do
4	for each itemset $h_{n}\in$ DHTWUI ${}_{k}$ do
5	if $\exists s\in$ SI( $h_{m})$ .st and $t\in$ SI( $h_{n})$ .st, such that
	TI( $s)$ .sf $\cap$ TI( $t)$ .sf $\neq\emptyset$ then
6	if ( $h_{m}$ [1]== $h_{n}$ [1] $\wedge h_{m}$ [2]== $h_{n}$ [2] $\wedge\ldots$
	$\wedge h_{m}$ [ $k-$ 1]== $h_{n}$ [ $k-$ 1] $\wedge h_{m}$ [ $k$ ] $\prec h_{n}$ [ $k$ ]) then
7	$C=h_{m}$ [1] $h_{m}$ [2]… $h_{m}$ [ $k-$ 1] $h_{m}$ [ $k$ ] $h_{n}$ [ $k$ ];
8	if all subsets of $C$ with length $k$ are in
	CHS ${}_{k}$ or DHTWUI ${}_{k}$ then
9	$\textit{SI(C).TWU}=\displaystyle\sum_{ft\in(TI\left(s\right).sf\cap TI\left(t% \right).sf)}{\textit{TU}(ft\mbox{)}}$ ;
10	if SI( $C)$ .TWU $\geqslant$ min_util then
11	SI( $C)$ .st= SI( $h_{m})$ .st $\cup$ SI( $h_{n})$ .st;
12	Put $C$ into CHS;
13	end if
14	end if
15	end if
16	end if
17	end for
18	end for
19	$k$ ++;
20	end while

In Algorithm 2, the fact table FT is first scanned once to determine the TU of each transaction (Step 1). The loop of Steps 2–5 enumerates the dimension tables individually. For each dimension table $D_{i}$ , the TI of each transaction is calculated (Step 3), and Algorithm 1 is called to discover all HTWUIs (denoted by DHTWUI ${}_{i})$ in $D_{i}$ (Step 4). We use CHS to denote all HTWUIs across multiple dimension tables. CHS is initialized as the HTWUIs of the first dimension table (Step 6). The loop generates CHS by calling the procedure RHTWUI_gen until all dimensions have been processed without joining (Steps 7–9). Using the SI and TI structure, we can calculate the actual utility of each HTWUI by scanning the associated transactions rather than all of the dimension tables (Steps 10–15). Finally, Step 16 outputs all of the discovered MRHUIs.

The aim of procedure RHTWUI_gen is to discover HTWUIs across all dimension tables by considering CHS and DHTWUI of one dimension table, and the results are added to CHS. This procedure follows a candidate generation-and-test approach that is similar to function candidate_gen in Algorithm 1, except for two aspects. First, before combining two itemsets from CHS and DHTWUI, Step 5 checks whether they co-occur in the fact table according to SI and TI. Second, after a new candidate $C$ has been generated, Step 9 updates SI( $C)$ .TWU and, if this value is greater than or equal to min_util, SI( $C)$ .st is calculated for further use (Step 11). In this algorithm, CHS ${}_{k}$ and DHTWUI ${}_{k}$ are used to denote the set of HTWUIs of length $k$ in CHS and DHTWUI.

4.4 Complexity analysis

Let ${\bm{S}}$ $=$ ( $D_{1}$ , $D_{2}$ , …, $D_{n}$ , FT) be a DW modeled as a star schema, with $D_{i}$ corresponding to each dimension table and FT to the fact table. If there are $m$ facts in FT, the size of the fact table is $m\times n$ . A dimension table $D_{i}$ with $r_{i}$ rows and $c_{i}$ columns has size $r_{i}\times c_{i}$ . Thus, the size of ${\bm{S}}$ is $m\times n+\sum_{i=1}^{n}{(r_{i}\times c_{i})}$ . Once the size of ${\bm{S}}$ has been determined, the TI of each transaction in the dimension tables can also be determined.

4.4.1 Calculation of SI for each 1-HTWUI

This operation consists of two steps. First, for a transaction $T$ in dimension table $D_{i}$ , it should be determined which facts $T$ belongs to. For this step, the cost is $O(m\times n\times r_{i})$ . Second, for an item it in dimension table $D_{i}$ , it should be determined which transactions it belongs to. For this step, the cost is $O(r_{i}\times c_{i})$ . Thus, calculating the TWU of each item requires $O(\sum_{i=1}^{n}{(m\times n\times r_{i}+r_{i}\times c_{i})})$ time.

4.4.2 HTWUI generation in dimension tables

We consider dimension table $D_{i}$ as an example; the results for other dimension table are the same. This operation is composed of two steps: HTWUI concatenation and SI computation. For the first step, two $k$ -HTWUIs with the same first ( $k-1$ ) items are merged to generate candidate HTWUIs of length ( $k+1$ ). Each merge operation requires ( $k-1$ ) equality comparisons. In the best-case scenario, every merge step produces a valid ( $k+1$ )-HTWUI. The cost of this procedure is $O(\sum_{k=2}^{c_{i}{-}1}{((k-1){|}CS_{k{+}1}{|)}})$ . In the worst-case scenario, every pair of $k$ -HTWUIs must be merged. The cost in this scenario is $O(\sum_{k=2}^{c_{i}{-}1}{((k-1){|}CHUI_{k}{|}^{2}{)}})$ . For the second step, we must check whether an itemset is contained in a transaction, which requires $O\left(r_{i}\times\sum_{k=1}^{c_{i}}{\left({{\begin{array}[]{*{20}c}{c_{i}}% \hfill\\ k\hfill\\ \end{array}}}\right)}\right)$ time.

4.4.3 MRHUI discovery

This operation can also be divided into two steps: HTWUI concatenation across multiple dimension tables and HUI identification. For the first step, the cost is $O\left(m\times\sum_{k=1}^{n}{\left({{\begin{array}[]{*{20}c}n\hfill\\ k\hfill\\ \end{array}}}\right)}\right)$ . For the second step, given an HTWUI $X$ , $u(X)$ is calculated by scanning the associated transactions. In the worst-case scenario, all facts in FT contain $X$ . Thus, the cost of this HUI filtering is $O(m\times|\textit{SH}|)$ , where SH is the set of all HTWUIs.

4.5 Comparison with related work

We compare the MRHUI algorithm with a typical multi-relational frequent itemset mining algorithm proposed in [4], which also traverses the search space level by level in DWs using the star schema.

Although the joined table is not materialized, it is computed by the algorithm in [4]. In contrast, our MRHUI algorithm avoids the computation of the joined table.

Furthermore, as pointed out by the authors of [4], their algorithm does not perform any pruning from one pass to the next. Thus, there could be many candidate itemsets, leading to high computational cost. On the contrary, by using an itemset index and transaction index, the MRHUI algorithm speeds up the HTWUI calculation and real utility verification from the index information, rather than by examining the entire database.

5. Mining algorithm based on a tree structure

In this section, we propose another MRHUI mining algorithm, RHUI-Growth, using the recursive pattern growth approach. Similar to the RHUI-Mine algorithm presented in Section 4, the RHUI-Growth algorithm consists of a dimension stage and a global stage; that is, after constructing a dimensional tree for each dimension, the corresponding tables are discarded. The trees already take into account the TWU and incorporate the transaction IDs. The relational tree that represents the entire star schema is then constructed by aggregating the dimension trees based on the fact table. Finally, this tree is mined using the known pattern growth method.

5.1 Dimensional tree structure

We first construct a dimensional HUI (DHUI)-tree for each dimension. Specifically, DHUI-tree $T$ is the following tree structure:

(1)
It consists of one root labeled “null”, denoted by T.root; a set of item-prefix subtrees as the children of the root, denoted by T.tree; and a transaction table, denoted by T.tr. The transaction table keeps track of the path corresponding to each tid and stores the last node of that path. This structure assists the global mining. If we wish to know which 1-HTWUI belongs to a tid, we follow the link in that table to determine the last node, and then climb through its parents until we reach the root node. The items of the nodes in the path we consider are the 1-HTWUIs of that transaction.
(2)
Each node $N$ in the item-prefix subtree consists of four fields, N.item, N.util, N.children, and N.parent, where N.item is the item $N$ represents; N.util is TWU(N.item) in the fact table; N.children links to the children nodes of $N$ (or to “null” if there are none); and N.parent links to the parent node of $N$ .

The DHUI-tree is constructed using the pattern growth approach and can be completed with two scans of the database. During the first pass over the database, the algorithm calculates the TWU value of each item.

During the second scan of the database, transactions are inserted into the DHUI-tree. Initially, the tree is created with a root. When a transaction is retrieved, low TWU items are removed from the transaction, because only supersets of HTWUIs can potentially become HUIs. An update node operation is then performed if the current root node contains a child with the item to be inserted; otherwise, an insert node operation is performed until every HTWUI in the current transaction has been processed. Note that items within one dimension are ordered in TWU descending order.

For example, given min_util $=$ 350, the DHUI-tree of dimension table C is shown in Fig. 3.

Figure 3.
DHUI-tree for dimension table C.

Note that items $x_{2}$ , $x_{3}$ , $x_{4}$ , $x_{6}$ , and $x_{8}$ are not shown in Fig. 3 because they were deleted as 1-LTWUIs.
5.2 Relational tree structure

To discover HUIs across multiple dimension tables, a relational HUI (RHUI)-tree $T$ is defined as follows:

(1)
It consists of one root labeled “null”, denoted by T.root; a set of item-prefix subtrees as the children of the root, denoted by T.tree; and an HTWUI-header table, denoted by T.header.
(2)
Each node $N$ in the item-prefix subtree consists of five fields, N.item, N.util, N.nodelink, N.children, and N.parent, where N.item is the item $N$ represents; N.util is TWU(N.item); N.nodelink links to the next node in $T$ that carries the same N.item (or to “null” if there are none); N.children denotes the children nodes of $N$ (or “null” if there are none); and N.parent is the parent node of $N$ .

Let $T$ be an RHUI-tree; a conditional pattern of itemset $X$ is defined as CP( $X$ ) $=$ ( $Y$ : util), where YX is the set of items on the path from T.root to $X$ ; and util is TWU( $X$ ) on this path. The set of all conditional patterns of $X$ is called the conditional pattern base of $X$ , denoted by CPB( $X$ ). The RHUI-tree constructed from CPB( $X$ ) is called the conditional RHUI-tree of $X$ , denoted by CT( $X$ ). Similar to the FP-Growth algorithm [8], both the conditional pattern base and conditional RHUI-tree are used for the recursive enumeration of HTWUIs.

The construction of an RHUI-tree is very similar to the construction of a DHUI-tree. Despite this, there are three differences:

•
First, it is not necessary for the first scan of any table to calculate the TWUs. They have already been calculated and stored in each dimension tree.
•
Second, a fact is a set of tids; therefore, each fact must be denormalized before ordering the items or inserting them into the tree. A denormalized fact is an itemset with the items corresponding to its tids. Through the branch table of each DHUI-tree, we can obtain the path corresponding to the itemset of each tid. Furthermore, according to the anti-monotone property, if an itemset is not HTWUI, no other itemset containing it will be HTWUI. Thus, we only check the 1-HTWUIs in the transactions of each tid, which ensures that the final tree has only the 1-HTWUIs of each dimension. Table 4 presents the result of denormalizing each fact given min_util $=$ 350. In this denormalized table, items in one dimension are sorted in TWU descending order.
•
Third, we order items in the RHUI-tree by descending order of the lowest TWU of all dimensions. Dimensions with the same lowest TWUs are ordered alphabetically and items of a dimension are ordered in TWU descending order. Note that with this ordering, items from multiple dimensions are not intermixed.

Table 4
Denormalized facts

Itemsets ${}_{P}$ Itemsets ${}_{S}$ Itemsets ${}_{C}$ TU

( $y_{1}$ ,1) ( $y_{4}$ ,1) ( $y_{7}$ ,1) ( $y_{2}$ ,1) ( $y_{5}$ ,1) ( $y_{8}$ ,1) ( $y_{3}$ ,1) ( $y_{6}$ ,1) ( $z_{6}$ ,1) ( $z_{5}$ ,2) ( $z_{4}$ ,1) ( $z_{10}$ ,1) ( $x_{1}$ ,3) 100

( $y_{1}$ ,2) ( $y_{4}$ ,2) ( $y_{7}$ ,2) ( $y_{2}$ ,1) ( $y_{5}$ ,1) ( $y_{8}$ ,1) ( $z_{6}$ ,2) ( $z_{2}$ ,2) ( $z_{8}$ ,2) ( $x_{5}$ ,3) ( $x_{1}$ ,3) ( $x_{7}$ ,3) 159

( $y_{2}$ ,3) ( $y_{5}$ ,3) ( $y_{8}$ ,3) (y ${}_{3}$ ,1) (y ${}_{6}$ ,1) ( $z_{6}$ ,1) ( $z_{5}$ ,3) ( $z_{2}$ ,1) ( $z_{8}$ ,1) ( $z_{4}$ ,3) ( $z_{10}$ ,3) ( $x_{5}$ ,4) 154

( $y_{1}$ ,2) ( $y_{4}$ ,2) ( $y_{7}$ ,2) (y ${}_{3}$ ,1) (y ${}_{6}$ ,1) ( $z_{6}$ ,2) ( $z_{5}$ ,1) ( $z_{2}$ ,2) ( $z_{8}$ ,2) ( $z_{4}$ ,1) ( $z_{10}$ ,1) ( $x_{5}$ ,3) ( $x_{1}$ ,3) ( $x_{7}$ ,3) 174

( $y_{1}$ ,1) ( $y_{4}$ ,1) ( $y_{7}$ ,1) ( $z_{5}$ ,1) ( $x_{7}$ ,1) 46

Table 5
1-HTWUIs and their TWUs

Item $y_{1}$ $y_{4}$ $y_{7}$ $y_{2}$ $y_{5}$ $y_{8}$ $y_{3}$ $y_{6}$ $z_{6}$ $z_{5}$ $z_{2}$ $z_{8}$ $z_{4}$ $z_{10}$ $x_{5}$ $x_{1}$ $x_{7}$

TWU 574 574 574 512 512 512 510 510 691 575 542 542 510 510 542 509 425

Figure 4.
Example RHUI-tree.

Given min_util $=$ 350, the 1-HTWUIs and their TWUs are listed in Table 5, where items within the same dimension are sorted in TWU descending order. The items with the lowest TWUs in dimensions C, P, and S are $x_{7}$ : 425, $y_{6}$ : 510, and $z_{10}$ : 510, respectively. Thus, in the RHUI-tree, items in dimension P are always listed before those in dimension S, and items in dimension S are always listed before those in dimension C. In the denormalized facts given in Table 4, the three dimensions are also listed in this order, and 1-LTWUIs are not considered. Thus, the TU of each transaction is also updated. The example RHUI-tree is shown in Fig. 4.
5.3 The RHUI-Growth algorithm

Itemsets ${}_{P}$	Itemsets ${}_{S}$	Itemsets ${}_{C}$	TU
( $y_{1}$ ,1) ( $y_{4}$ ,1) ( $y_{7}$ ,1) ( $y_{2}$ ,1) ( $y_{5}$ ,1) ( $y_{8}$ ,1) ( $y_{3}$ ,1) ( $y_{6}$ ,1)	( $z_{6}$ ,1) ( $z_{5}$ ,2) ( $z_{4}$ ,1) ( $z_{10}$ ,1)	( $x_{1}$ ,3)	100
( $y_{1}$ ,2) ( $y_{4}$ ,2) ( $y_{7}$ ,2) ( $y_{2}$ ,1) ( $y_{5}$ ,1) ( $y_{8}$ ,1)	( $z_{6}$ ,2) ( $z_{2}$ ,2) ( $z_{8}$ ,2)	( $x_{5}$ ,3) ( $x_{1}$ ,3) ( $x_{7}$ ,3)	159
( $y_{2}$ ,3) ( $y_{5}$ ,3) ( $y_{8}$ ,3) (y ${}_{3}$ ,1) (y ${}_{6}$ ,1)	( $z_{6}$ ,1) ( $z_{5}$ ,3) ( $z_{2}$ ,1) ( $z_{8}$ ,1) ( $z_{4}$ ,3) ( $z_{10}$ ,3)	( $x_{5}$ ,4)	154
( $y_{1}$ ,2) ( $y_{4}$ ,2) ( $y_{7}$ ,2) (y ${}_{3}$ ,1) (y ${}_{6}$ ,1)	( $z_{6}$ ,2) ( $z_{5}$ ,1) ( $z_{2}$ ,2) ( $z_{8}$ ,2) ( $z_{4}$ ,1) ( $z_{10}$ ,1)	( $x_{5}$ ,3) ( $x_{1}$ ,3) ( $x_{7}$ ,3)	174
( $y_{1}$ ,1) ( $y_{4}$ ,1) ( $y_{7}$ ,1)	( $z_{5}$ ,1)	( $x_{7}$ ,1)	46

Item	$y_{1}$	$y_{4}$	$y_{7}$	$y_{2}$	$y_{5}$	$y_{8}$	$y_{3}$	$y_{6}$	$z_{6}$	$z_{5}$	$z_{2}$	$z_{8}$	$z_{4}$	$z_{10}$	$x_{5}$	$x_{1}$	$x_{7}$
TWU	574	574	574	512	512	512	510	510	691	575	542	542	510	510	542	509	425

In this subsection, we describe the RHUI-Growth pseudo-code presented in Algorithm 3.

Initially, the RHUI-Growth algorithm constructs dimension trees for each dimension (Step 1). Step 2 builds the RHUI-tree based on all the dimension trees. Procedure RHTWUI_growth is then called to discover HTWUIs in Step 3. Finally, Step 4 calculates the actual utility of each HTWUI and outputs all the discovered HUIs. Considering the itemset and its conditional tree as parameters, procedure RHTWUI_growth is similar to the methods used in other tree-based algorithms [2, 18].

5.4 Complexity analysis

Similar to Subsection 4.4, let ${\bm{S}}=$ ( $D_{1}$ , $D_{2}$ , …, $D_{n}$ , FT) be a DW modeled as a star schema. In ${\bm{S}}$ , $D_{i}$ is a dimension table with $r_{i}$ rows and $c_{i}$ columns, and FT is the fact table with $m$ facts.

5.4.1 TWU calculation for each item

This operation is similar to the operation of calculating SI for each 1-HTWUI in the RHUI-Mine algorithm (see Subsection 4.4). The cost of this operation is also $O(\sum_{i=1}^{n}{(m\times n\times r_{i}+r_{i}\times c_{i})})$ .

5.4.2 Construction of dimensional trees

We again consider the operation in one dimension table $D_{i}$ as an example; the results for other dimension tables are similar. We can also analyze the complexity of this model in two steps. First, each transaction in $D_{i}$ is sorted in TWU descending order. For one transaction in $D_{i}$ , there are at most $c_{i}$ 1-HTWUIs. Using quick sort to arrange the items in this transaction, the cost is $O(c_{i}$ $\times$ log $c_{i})$ . As there are $r_{i}$ transactions in $D_{i}$ , the cost of processing every transaction is $O(r_{i}$ $\times$ $c_{i}$ $\times$ log $c_{i})$ . Second, the transactions of $D_{i}$ are inserted into a dimensional tree $T_{i}$ one by one. In this step, the length of one branch is bounded by $c_{i}$ . As there are $r_{i}$ transactions, the cost of this step is $O(c_{i}$ $\times$ $r_{i})$ . Based on the above discussion, the construction of each dimensional tree requires $O(r_{i}$ $\times$ $c_{i}$ $\times$ log $c_{i}+c_{i}$ $\times$ $r_{i})$ time.

Algorithm 3	RHUI-Growth
Input	S: a DW with a star schema; min_util: minimum utility value
Output	MRHUIs
1	Construct dimensional tree for each dimension table.
2	Construct RHUI-tree $T$ based on all dimensional trees.
3	RHTWUI_ growth( $\emptyset$ , $T)$ ;
4	Output all MRHUIs by calculating utilities of each HTWUI.
Procedure	RHTWUI_growth( $X$ , CT( $X))$
1	if CT( $X)$ only contains a single branch $B$ then
2	for each subset $Y$ of the set of items in $B$ do
3	Output $Y\cup X$ , TWU( $Y\cup X$ ) equal to the smallest TWU of nodes in $Y$ ;
4	end for
5	else for each $i\in$ CT( $X)$ .header do
6	RHTWUI_growth( $X\cup$ { $i$ }, CT( $X\cup$ { $i$ }));
7	end for
8	end if

5.4.3 Construction of relational trees

When constructing a relational tree, the dimension tables should be processed one by one. We take dimension table $D_{i}$ as an example, and consider $T_{i}$ to be the dimensional tree constructed by $D_{i}$ and $T_{i}$ .tr to be the transaction table of $T_{i}$ . For this specific dimension table $D_{i}$ , the fact table should first be scanned. When a transaction $T$ in $D_{i}$ is retrieved by a fact in FT, it should be located in $T_{i}$ .tr. As the transaction identifier in $T_{i}$ .tr is sorted in TWU descending order, this operation requires $O$ (log $r_{i})$ time in the worst case. The items of $T$ are then inserted into the relational tree one by one. In the worst-case scenario, all items are 1-HTWUIs contained in $T$ , requiring $O$ (log( $r_{i})+c_{i})$ time. Because facts in FT are scanned one by one, $O(m$ $\times$ $n$ $\times$ (log( $r_{i})+c_{i}))$ time is required for one dimension table $D_{i}$ . As there are $n$ dimension tables, the total time for constructing the relational tree is $O(m$ $\times$ $n$ $\times$ $\sum_{i=1}^{n}{(\textit{log}(r_{i})+c_{i})})$ .

5.4.4 Generation of HTWUIs

Let $T$ be the relational tree. For an item $h$ in HTWUI-header table T.header, the cost of generating all HTWUIs containing $h$ is $O(\sum_{X\text{is an HTWUI in}\ \textit{CT(h))}}{(h\cup X)})$ . Considering all elements in the header table, the operation of generating HTWUIs requires $O(\sum_{h\in\ \textit{T.header}}{\sum_{X\ \text{is an HTWUI in}\ CT(h))}{(h% \cup X)}})$ time.

5.4.5 Discovering all HUIs

This operation is similar to MRHUI discovery in the RHUI-Mine algorithm (see Subsection 4.4). The cost of this HUI filtering is $O(m\times|\textit{SH}|)$ , where SH is the set of all HTWUIs.

Figure 5.

Foodmart star schema.

5.5 Comparison with related work

We compare the RHUI-Growth algorithm with the FP-tree-based algorithm MultiClose [26] for mining closed itemsets from relational data in a star schema.

First, MultiClose converts dimension tables to a vertical data format and uses a heuristic technique to compute the supports of itemsets. RHUI-Growth does not use a vertical data format, as we are concerned with utility values rather than itemset supports.

Second, FP-tree is used together with a two-level hash table in MultiClose. RHUI-Growth exploits two kinds of tree structures, a dimensional tree and a relational tree. For each dimension table, a transaction table is used rather than an item table, which differentiates the dimensional tree from FP-tree. For global relational tree building according to the dimensional trees of RHUI-Growth, the TWU is recorded in each node instead of the support.

Finally, MultiClose discovers the resulting itemsets using local mining and global mining. That is, closed itemsets of the dimension tables are first discovered, and then closed itemsets of different dimension tables are combined to closed itemsets across multiple dimension tables. In RHUI-Growth, no local mining operation is needed. Information from the dimension tables is first compressed into dimensional trees, and then a global relational tree is constructed according to the dimensional trees. MRHUIs are only mined in the final relational tree.

6. Performance evaluation

In this section, we evaluate the performance of our RHUI-Mine and RHUI-Growth algorithms. To the best of our knowledge, there are no other algorithms for mining HUIs from multi-relational data. Thus, we compare our algorithms with two state-of-the-art HUI mining algorithms, IHUP [2] and UP-Growth [22], for a single relation.

6.1 Experimental environment and datasets

The experiments were performed on a 2.40 GHz CPU with 2 GB memory running Windows 7. Our programs were written in C++.

Figure 6.

Execution time on 1000 customers from the Foodmart database.

The Foodmart database (Microsoft Corporation) was used to evaluate the performance of the algorithms. The database is from Microsoft SQL Server 2005. Foodmart contains the utility of each item (i.e., unit profit times quantity) in the original data. The star schema DW shown in Fig. 5, which consists of three dimension tables and one fact table, was used for our experiments. In this star schema, there are 10281, 1560, and 24 records in the Customer, Product, and Store dimension tables, respectively, and there are 164558 records in the fact table.

As this was the only database for evaluation, we divided it into three parts containing the first 1000 customers, the first 6000 customers, and all 10281 customers. For example, to form the database containing the first 1000 customers, we retained all records relating to these customers and deleted all other information from the original Foodmart database.

Because IHUP and UP-Growth cannot run directly on Foodmart, we performed a natural join of the experimental data before running these two single-table algorithms. Thus, the runtime and memory usage of IHUP and UP-Growth discussed in Subsections 6.2 and 6.3 were recorded on the table that resulted from a natural join of all the relational tables. As the time and memory required to prepare the joined table were negligible, the total runtime and memory usage of IHUP and UP-Growth shown in Figs 6, 7, 9, and 10 are the results for the joined single table only.

6.2 Evaluation of the execution time

We first evaluate the execution time over the Foodmart database. Over the three partitioned databases, we compare the execution time of the four algorithms under various minimum utility thresholds. The lower the minimum utility threshold, the larger the number of MRHUIs, and thus the longer the runtime. In this set of experiments, we terminated the mining task once its runtime exceeded 10000 s. The comparison results are shown in Figs 6–8.

As shown in Fig. 6, for 1000 customers from the Foodmart database, RHUI-Growth is obviously superior to the other three algorithms for minimum utility thresholds between 0.05% and 0.30%. On average, RHUI-Growth is two orders of magnitude faster than IHUP, and an order of magnitude faster than UP-Growth and RHUI-Mine. We can also observe that both IHUP and UP-Growth outperform the RHUI-Mine algorithm at higher minimum utility thresholds; however, their execution times drastically exceed that of RHUI-Mine when the minimum utility threshold is below 0.25% and 0.15%, respectively. This is because the joined table of the database of 1000 customers is small, meaning that the processing cost of the single-table algorithms is acceptable when the minimum utility threshold is high.

Figure 7.

Execution time on 6000 customers from the Foodmart database.

As shown in Fig. 7, on the medium-sized database of 6000 customers, both RHUI-Mine and RHUI-Growth outperform the two single-table HUI mining algorithms. On average, RHUI-Growth is 6.37 times faster than RHUI-Mine. The runtime of IHUP and UP-Growth exceeded 10000 s for minimum utility thresholds of 0.1% and 0.05%, respectively.

Figure 8.

Execution time on all 10281 customers in the Foodmart database.

Figure 8 shows the execution time for the database of all 10281 customers. Neither IHUP nor UP-Growth could generate a result on the single joined table within 10000 s under any minimum utility threshold. Therefore, the performance of these algorithms is not plotted in Fig. 8. These results demonstrate that existing HUI mining for a single relation is very sensitive to the number of columns (items) and the join-then-mine approach leads to high computational complexity. For the two proposed algorithms, RHUI-Growth is always faster than RHUI-Mine, especially when the minimum utility threshold is small. Compared with the execution time of RHUI-Mine, which significantly increases with decreasing minimum utility threshold, RHUI-Growth demonstrates a relatively steady execution time. RHUI-Growth is an order of magnitude faster than RHUI-Mine when the minimum threshold is lower than 0.05%. Similar to the HUI algorithms for a single relation [2, 18], it is true that the tree-based algorithm (RHUI-Growth) is always more efficient than the level-wise algorithm (RHUI-Mine) for multi-relational data.

6.3 Evaluation of memory usage

We also compare the memory usage of the four algorithms for the three databases described in Subsection 6.1. The results are shown in Figs 9–11.

Figure 9.

Memory usage on 1000 customers from the Foodmart database.

Figure 9 shows the memory comparison results on the database of 1000 customers. RHUI-Mine consumes more memory than the two single-table algorithms. The large amount of memory of RHUI-Mine is caused by multiple database scans inherited from the level-wise candidate generation-and-test methodology. Another interesting result is that, although RHUI-Growth is superior to UP-Growth in terms of memory consumption when the minimum utility threshold is lower than 0.1%, UP-Growth still outperforms RHUI-Growth under certain circumstances. This is because the pruning strategies [22] used by UP-Growth work well on this small database.

Figure 10.

Memory usage on 6000 customers from the Foodmart database.

As shown in Fig. 10, on the database of 6000 customers, RHUI-Growth consumes less memory than the other three algorithms under all minimum utility thresholds. Although IHUP and UP-Growth outperform RHUI-Mine in terms of memory usage in most cases, not all of their memory usage results are plotted because their execution times exceeded 10000 s, as described in Subsection 6.2.

Figure 11.

Memory usage on all 10281 customers in the Foodmart database.

Figure 11 shows the memory usage for the database of all 10281 Foodmart customers. For the same reason discussed in Subsection 6.2, the memory usage of the two single-table algorithms is not plotted. On average, RHUI-Growth uses around 8.57 times less memory than RHUI-Mine. This indicates that the tree structure can represent useful information in a very compressed form, because transactions have many items in common. By using path overlapping (prefix sharing), tree structures can save a great deal of space.

It is also interesting to observe that the memory consumption of RHUI-Growth does not increase with the decrease of the minimum utility threshold. For example, on the database of all 10128 customers, the memory consumption of RHUI-Growth decreases and then increases when the minimum utility threshold is between 0.2% and 0.1%. This is because the TWU of 1-HTWUIs changes under different minimum utility thresholds, which leads to a different sorting order of 1-HTWUIs. Thus, the compression effects of the tree structures on the original databases are different.

7. Conclusions

In this paper, we examined the problem of mining data stored in decentralized tables and defined the problem of MRHUI mining over a star schema-based DW. We proposed two algorithms, RHUI-Mine and RHUI-Growth, for discovering HUIs from multi-relational data. The RHUI-Mine algorithm uses a candidate generation-and-test scheme, whereas the RHUI-Growth algorithm follows the pattern growth methodology. Both proposed algorithms mine the star schema directly instead of performing a join of the tables, and therefore require much less time to prepare the data in the pre-processing step. Experimental results show that the proposed algorithms are effective.

A star schema can be considered as the building block for a snowflake schema. Hence, our proposed technique can also be extended to the snowflake structure.

Footnotes

Acknowledgments

We thank the anonymous reviewers for their very useful comments and suggestions. This study was partly supported by the Beijing Natural Science Foundation (4162022), High Innovation Program of Beijing (2015000026833ZK04), Beijing Municipal Science and Technology Commission (D161100005216002), and Beijing Talents Project.

References

Agrawal

and Srikant

, Fast algorithms for mining association rules, in: The 20th International Conference on Very Large Data Bases, 1994, pp. 487–499.

Ahmed

C.F.

Tanbeer

S.K.

Jeong

B.-S.

and Lee

Y.-K.

, Efficient tree structures for high utility pattern mining in incremental databases, IEEE Transactions on Knowledge and Data Engineering 21 (2009), 1708–1721.

Bina

Schulte

Crawford

Qian

and Xiong

, Simple decision forests for multi-relational classification, Decision Support Systems 54 (2013), 1269–1279.

Crestana-Jensen

and Soparkar

, Frequent itemset counting across multiple tables, in: The 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2000, pp. 49–61.

Erwin

Gopalan

R.P.

and Achuthan

N.R.

, CTU-Mine: An efficient high utility itemset mining algorithm using the pattern growth approach, in: The 7th IEEE International Conference on Computer and Information Technology, 2007, pp. 71–76.

Fonseca

N.A.

Costa

V.S.

and Camacho

, Conceptual clustering of multi-relational data, in: The 21st International Conference on Inductive Logic Programming, 2011, pp. 145–159.

Hamrouni

, Key roles of closed sets and minimal generators in concise representations of frequent patterns, Intelligent Data Analysis 16 (2012), 581–631.

Han

Pei

Yin

and Mao

, Mining frequent patterns without candidate generation: a frequent-pattern tree approach, Data Mining and Knowledge Discovery 8 (2004), 53–87.

Knobbe

A.J.

, Multi-Relational Data Mining, IOS Press, 2006.

10.

Lan

G.-C.

Hong

T.-P.

and Tseng

V.S.

, An efficient projection-based indexing approach for mining high utility itemsets, Knowledge and Information Systems 38 (2014), 85–107.

11.

Lin

M.-Y.

T.-F.

and Hsueh

S.-C.

, High utility pattern mining using the maximal itemset property and lexicographic tree structures, Information Sciences 215 (2012), 1–14.

12.

Liu

Liao

W.-K.

and Choudhary

A.N.

, A two-phase algorithm for fast discovery of high utility itemsets, in: The 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, 2005, pp. 689–695.

13.

Nagao

and Seki

, Towards parallel mining of closed patterns from multi-relational data, in: IEEE 8th International Workshop on Computational Intelligence and Applications, 2015, pp. 103–108.

14.

E.K.K.

A.W.-C.

and Wang

, Mining association rules from stars, in: The 2002 IEEE International Conference on Data Mining, 2002, pp. 322–329.

15.

Shao

Yin

Liu

and Cao

, Actionable combined high utility itemset mining, in: The 29th AAAI Conference on Artificial Intelligence, 2015, pp. 4206–4207.

16.

Silva

and Antunes

, Pattern mining on stars with FP-growth, in: The 7th International Conference on Modeling Decisions for Artificial Intelligence, 2010, pp. 1783–1814.

17.

Silva

and Antunes

, Multi-relational pattern mining over data streams, Data Mining and Knowledge Discovery 29 (2015), 1783–1814.

18.

Song

Liu

and Li

, Mining high utility itemsets by dynamically pruning the tree structure, Applied Intelligence 40 (2014), 29–43.

19.

Song

Wang

and Li

, Binary partition for itemsets expansion in mining high utility itemsets, Intelligent Data Analysis 20 (2016), 915–931.

20.

Song

Yang

and Xu

, Index-BitTableFI: An improved algorithm for mining frequent itemsets, Knowledge-Based Systems 21 (2008), 507–513.

21.

Song

Zhang

and Li

, A high utility itemset mining algorithm based on subsume index, Knowledge and Information Systems 49 (2016), 315–340.

22.

Tseng

V.S.

Shie

B.-E.

C.-W.

and Yu

P.S.

, Efficient algorithms for mining high utility itemsets from transactional databases, IEEE Transactions on Knowledge and Data Engineering 25 (2013), 1772–1786.

23.

Vaisman

and Zimányi

, Data Warehouse Systems, Springer, 2014.

24.

Coenen

and Hong

T.-P.

, Mining frequent itemsets using the N-list and subsume concepts, International Journal of Machine Learning and Cybernetics 7 (2016), 253–265.

25.

Wang

S.-L.

Hong

T.-P.

Tsai

Y.-C.

and Kao

H.-Y.

, Hiding sensitive association rules on stars, in: The 2010 IEEE International Conference on Granular Computing, 2010, pp. 505–508.

26.

L.-J.

and Xie

K.-L.

, A novel algorithm for frequent itemset mining in data warehouses, Journal of Zhejiang University Science A 7 (2006), 216–224.

27.

Yao

Hamilton

H.J.

and Butz

C.J.

, A foundational approach to mining itemset utilities from databases, in: The 4th SIAM International Conference on Data Mining, 2004, pp. 482–486.

28.

Zhang

and Deng

Z.-H.

, Mining summarization of high utility itemsets, Knowledge-Based Systems 84 (2015), 67–77.

Mining multi-relational high utility itemsets from star schemas

Abstract

Keywords

1. Introduction

2. Problem definition

2.1 Star schema

Table 2 Results of joining all tables of the example DW

3.1 Multi-relational itemset mining

3.2 High utility itemset mining

4. Mining algorithm based on the item and transaction index

4.1 Data structure

4.2 HTWUI generation in dimension tables

4.3 MRHUI discovery

4.4 Complexity analysis

4.4.1 Calculation of SI for each 1-HTWUI

4.4.2 HTWUI generation in dimension tables

4.4.3 MRHUI discovery

4.5 Comparison with related work

5. Mining algorithm based on a tree structure

5.1 Dimensional tree structure

5.4 Complexity analysis

5.4.1 TWU calculation for each item

5.4.2 Construction of dimensional trees

5.4.3 Construction of relational trees

5.4.4 Generation of HTWUIs

5.4.5 Discovering all HUIs

6. Performance evaluation

6.1 Experimental environment and datasets

Footnotes

Acknowledgments

References

Table 2
Results of joining all tables of the example DW