Abstract
Mining high utility itemsets is an interesting research problem in data mining and knowledge discovery. Most high utility itemset discovery algorithms seek patterns in a single table, but few are dedicated to processing data stored using a multi-dimensional model. In this paper, the problem of mining high utility itemsets in multi-relational databases is investigated, and two algorithms, RHUI-Mine and RHUI-Growth, are proposed for star schema-based data warehouses. In the RHUI-Mine algorithm, the search space is traversed in a level-wise manner, and an item index and transaction index are proposed to represent item and transaction information, respectively. The RHUI-Growth algorithm traverses the search space recursively using a pattern growth approach, and a dimensional tree and relational tree are used to compress the original data. Neither algorithm materializes the join operation between tables, thus making use of the star schema properties. Experiments show that both RHUI-Mine and RHUI-Growth are effective approaches for mining high utility itemsets in multi-relational data.
Keywords
Introduction
High utility itemset (HUI) mining [15, 18] is an important data mining problem that addresses the limitations of frequent itemset mining [7, 20] by introducing interestingness measures that reflect both statistical significance and user expectations. Various algorithms for efficiently mining HUIs in large databases have been presented [22, 28]. To the best of our knowledge, all existing HUI mining algorithms are restricted to cases where information is compiled in a single relation. However, information may be dispersed among different tables and, in some cases, the tables may reside in different physical locations. This imposes the need to explore more complex patterns and corresponding data mining techniques.
Multi-relational data mining (MRDM) [9] is concerned with the discovery of models and patterns from databases with multiple relational tables. MRDM is a popular data mining approach for multi-relational itemsets [4], multi-relational classification [3], and multi-relational clustering [6]. However, to the best of our knowledge, no algorithm for HUI discovery from relational data has yet been developed.
The straightforward approach to mining multiple tables is to join them all together before mining and apply an existing and efficient single-table algorithm. However, when multiple tables are joined, the resulting table has many more columns and rows, and this adversely affects the cost of mining algorithms.
The join-then-mine approach leads to high computational complexity. The number of columns in the joined table will be close to the sum of the number of columns in the individual tables. As the performance of HUI mining is very sensitive to the number of columns (items), mining the resulting table can take much longer than mining the individual tables. Moreover, in large applications, the join of all related tables cannot always be realistically computed because of the explosion in many-to-many relationships, large-dimension tables, and the distributed nature of the data.
In this paper, we propose two algorithms, RHUI-Mine and RHUI-Growth, for mining multi-relational HUIs in data stored following a star schema. First, we define the problem of multi-relational HUI mining. To improve the performance of utility mining, the proposed methods represent information using indices and trees. These representations can be used to generate multi-relational HUIs without joining the tables before mining. Both algorithms are composed of two stages: the first stage involves processing each dimension table, and the second stage combines all local information into global results across multiple tables. Experimental results show that our algorithms exhibit very interesting performance regarding efficiency and memory consumption.
The remainder of this paper is organized as follows: In Section 2, we define the problem of mining HUIs in multi-relational data. In Section 3, we analyze related work. In Sections 4 and 5, respectively, we describe the proposed RHUI-Mine and RHUI-Growth algorithms in detail. In Section 6, we present and analyze the experimental results. Finally, we present our conclusions in Section 7.
Problem definition
Star schema
A data warehouse (DW) is a database that is maintained separately from an organization’s operational database for the purpose of decision support. DWs provide integrated, enterprise-wide, historical data and focus on providing support for decision makers with respect to data modeling and analysis [23]. For data modeling, a DW is built using a multi-dimensional data model that consists of fact and dimension tables. Dimensions are various perspectives that are used to analyze data. Dimension tables are used to describe dimensions; they contain dimension transaction identifiers, values, and attributes. Fact tables contain identifiers for dimension tables in addition to measurable facts that data analysts may wish to examine. The simplest and most-used multi-dimensional model is the star schema, which consists of a fact table at the center of multiple dimension tables. As there have been several studies on mining frequent itemsets in this model [17, 25], we concentrate on the algorithms for discovering HUIs in a star schema.
Let
Let
Example of a star schema data warehouse.
Figure 1 illustrates a star schema with three dimensions (Customer, Product, and Store) and a fact table linking all the dimensions. The three dimension tables, Customer, Product, and Store, record information on customers, products, and stores, respectively. To ensure uniqueness, we refer to an attribute by concatenating it with the table name. For example, we denote attribute City of table Customer by Customer.City, which is different from attribute City in table Store (denoted by Store.City). The fact table stores information on all purchase activities. In addition to the three identifiers referencing the three dimension tables, the purchase number is recorded by the “count” measure in the fact table, and the time of purchase is also stored.
Conceptual representation of the example star schema.
The conceptual representation of Fig. 1 is shown in Fig. 2, where C, P, and S denote the dimension tables Customer, Product, and Store, respectively, and
If we organize the rows with the same customer and the same time into one transaction, the fact table can be conceptually represented by Table 1.
Conceptual representation of the fact table in the example data warehouse
We are interested in mining HUIs from the star schema structure. In particular, we examine the problem of determining all HUIs in table
The transaction utility (TU) of transaction
For the MRHUI mining problem, we need to determine those itemsets that contribute significantly to the total profit. We quantify significance using a metric called the minimum utility threshold
An itemset
Similar to HUI mining in a single table, the main challenge is that the itemset utility does not have the downward closure property. In this paper, we also use the property of transaction-weighted downward closure proposed by Liu et al. [12] and define the transaction-weighted utilization.
It has been proved [12] that any subset of an HTWUI is also an HTWUI. This property, also referred to as transaction-weighted downward closure, can be used to prune the supersets of LTWUIs and thus reduce the search space.
To explain the aforementioned concepts, we join all tables of the example DW shown in Fig. 2. The results are presented in Table 2 and the profit of the different items in the three dimension tables is given in Table 3.
For convenience, we write an itemset {
Results of joining all tables of the example DW
Results of joining all tables of the example DW
Profit table
In the following subsections, we discuss research on multi-relational itemset mining and HUI mining.
Multi-relational itemset mining
MRDM approaches look for patterns that involve multiple tables (relations) from a relational database [9]. In recent years, research has been conducted on multi-relational itemset mining from a star schema.
In [4], a level-wise candidate generation-and-test algorithm for mining a multi-relational itemset from a star schema was proposed. Similar to the Apriori algorithm [1], Crestana-Jensen and Soparkar’s algorithm generates frequent itemsets in separate tables and then merges the results. Although the joined table is computed instead of being materialized, this algorithm still suffers from the problem of candidate explosion as the number of dimensions, attributes, and values increases.
Ng et al. presented an approach for mining multi-relational itemsets by binding multiple dimension tables [14]. Specifically, this algorithm performs the operations of binding two dimensions and then mining frequent itemsets iteratively until all dimensions are processed without joining.
Unlike the aforementioned works, StarFP-Growth [16] is a multi-relational itemset mining algorithm over a star schema based on the pattern growth methodology [8]. For StarFP-Growth, a local dimension tree is first constructed for each dimension, and then a global tree is built by combining the dimension trees. Similar to FP-Growth [8], frequent itemsets across different dimensions are discovered from the global super tree. Silva and Antunes extended StarFP-Growth to the data stream environment [17].
Nagao and Seki proposed an algorithm for mining closed itemsets from multi-relational data [13]. Their algorithm performs parallel mining on multi-core processors. To realize load-balancing, node-wise parallelization and parallel-for based parallelization are used to optimize the task-parallelism in the problem of closed itemset discovery from multi-relational data.
High utility itemset mining
HUI mining, an extension of frequent itemset mining, is an area of active research in data mining. Many algorithms have been proposed for mining HUIs.
The basic concepts of HUI mining were outlined by Yao et al. [27]. Because their approach, called mining with expected utility (MEU), cannot use downward closure to reduce the number of candidate itemsets, they proposed a heuristic approach to predict whether an itemset should be added to the candidate set. However, the prediction typically gives an overestimate, especially in the initial stages. Moreover, the approach to examining candidates is impractical in terms of both processing cost and memory requirements if the number of items is large or the utility threshold is low.
The Two-Phase algorithm [12] was developed to determine HUIs using a downward closure property. This is employed in phase I to reduce the number of candidates. In phase II, only one extra database scan is required to identify the HUIs. Although this algorithm effectively reduces the search space and captures a complete set of HUIs, it still generates too many candidates and requires multiple database scans, especially when mining dense datasets and long patterns, much like the Apriori algorithm for mining frequent itemsets.
Several methods of generating candidates efficiently in phase I and avoiding multiple database scans have been proposed, including the projection-based approach [10], an approach based on maximal itemsets [11], and an approach based on a binary partition for itemset expansion [19]. Of these new approaches, tree-based algorithms, such as CTU-Tree [5], IHUP-Tree [2], and UP-Tree [22], have been shown to be very efficient for mining HUIs. These algorithms comprise three steps: (1) tree construction, (2) generation of candidate HUIs from the trees using the pattern-growth approach, and (3) identification of HUIs from the set of candidates.
Recently, we proposed a new HUI mining algorithm, named IHUI-Mine [21], based on the subsume index structure [20]. By recording information about the co-occurrence of itemsets, the subsume index is an efficient data structure for frequent itemset mining [24]. In IHUI-Mine, this structure was also found to be efficient for HUI mining.
Mining algorithm based on the item and transaction index
In this section, we propose the MRHUI mining (RHUI-Mine) algorithm. In this algorithm, we exploit the characteristics of tables in a star schema and do not perform any join operations. Furthermore, we exploit the following pruning strategy: if
Data structure
To represent the transaction data required for mining MRHUIs in a compact form, we design two data structures called the itemset index (SI) and transaction index (TI) for itemsets and transactions, respectively.
Consider itemset
We still consider
HTWUI generation in dimension tables
In our RHUI-Mine algorithm, HTWUIs and their SIs in each dimension table are discovered first; this method is described in Algorithm 1.
In Algorithm 1, high TWU single items are determined in Step 1. CHUI
With the support of transaction-weighted downward closure [12], the function candidate_gen generates the (
MRHUI discovery
In Algorithm 1, the first stage provides all the local HTWUIs required for the second stage. In the second stage, we use the proposed SI and TI structure to mine the data without joining the tables. The RHUI-Mine pseudo-code is described in Algorithm 2.
In Algorithm 2, the fact table FT is first scanned once to determine the TU of each transaction (Step 1). The loop of Steps 2–5 enumerates the dimension tables individually. For each dimension table
The aim of procedure RHTWUI_gen is to discover HTWUIs across all dimension tables by considering CHS and DHTWUI of one dimension table, and the results are added to CHS. This procedure follows a candidate generation-and-test approach that is similar to function candidate_gen in Algorithm 1, except for two aspects. First, before combining two itemsets from CHS and DHTWUI, Step 5 checks whether they co-occur in the fact table according to SI and TI. Second, after a new candidate
Complexity analysis
Let
Calculation of SI for each 1-HTWUI
This operation consists of two steps. First, for a transaction
HTWUI generation in dimension tables
We consider dimension table
MRHUI discovery
This operation can also be divided into two steps: HTWUI concatenation across multiple dimension tables and HUI identification. For the first step, the cost is
Comparison with related work
We compare the MRHUI algorithm with a typical multi-relational frequent itemset mining algorithm proposed in [4], which also traverses the search space level by level in DWs using the star schema.
Although the joined table is not materialized, it is computed by the algorithm in [4]. In contrast, our MRHUI algorithm avoids the computation of the joined table.
Furthermore, as pointed out by the authors of [4], their algorithm does not perform any pruning from one pass to the next. Thus, there could be many candidate itemsets, leading to high computational cost. On the contrary, by using an itemset index and transaction index, the MRHUI algorithm speeds up the HTWUI calculation and real utility verification from the index information, rather than by examining the entire database.
Mining algorithm based on a tree structure
In this section, we propose another MRHUI mining algorithm, RHUI-Growth, using the recursive pattern growth approach. Similar to the RHUI-Mine algorithm presented in Section 4, the RHUI-Growth algorithm consists of a dimension stage and a global stage; that is, after constructing a dimensional tree for each dimension, the corresponding tables are discarded. The trees already take into account the TWU and incorporate the transaction IDs. The relational tree that represents the entire star schema is then constructed by aggregating the dimension trees based on the fact table. Finally, this tree is mined using the known pattern growth method.
Dimensional tree structure
We first construct a dimensional HUI (DHUI)-tree for each dimension. Specifically, DHUI-tree
It consists of one root labeled “null”, denoted by T.root; a set of item-prefix subtrees as the children of the root, denoted by T.tree; and a transaction table, denoted by T.tr. The transaction table keeps track of the path corresponding to each tid and stores the last node of that path. This structure assists the global mining. If we wish to know which 1-HTWUI belongs to a tid, we follow the link in that table to determine the last node, and then climb through its parents until we reach the root node. The items of the nodes in the path we consider are the 1-HTWUIs of that transaction. Each node
The DHUI-tree is constructed using the pattern growth approach and can be completed with two scans of the database. During the first pass over the database, the algorithm calculates the TWU value of each item.
During the second scan of the database, transactions are inserted into the DHUI-tree. Initially, the tree is created with a root. When a transaction is retrieved, low TWU items are removed from the transaction, because only supersets of HTWUIs can potentially become HUIs. An update node operation is then performed if the current root node contains a child with the item to be inserted; otherwise, an insert node operation is performed until every HTWUI in the current transaction has been processed. Note that items within one dimension are ordered in TWU descending order.
For example, given min_util
DHUI-tree for dimension table C.
Note that items
To discover HUIs across multiple dimension tables, a relational HUI (RHUI)-tree
It consists of one root labeled “null”, denoted by T.root; a set of item-prefix subtrees as the children of the root, denoted by T.tree; and an HTWUI-header table, denoted by T.header. Each node
Let
The construction of an RHUI-tree is very similar to the construction of a DHUI-tree. Despite this, there are three differences:
First, it is not necessary for the first scan of any table to calculate the TWUs. They have already been calculated and stored in each dimension tree. Second, a fact is a set of tids; therefore, each fact must be denormalized before ordering the items or inserting them into the tree. A denormalized fact is an itemset with the items corresponding to its tids. Through the branch table of each DHUI-tree, we can obtain the path corresponding to the itemset of each tid. Furthermore, according to the anti-monotone property, if an itemset is not HTWUI, no other itemset containing it will be HTWUI. Thus, we only check the 1-HTWUIs in the transactions of each tid, which ensures that the final tree has only the 1-HTWUIs of each dimension. Table 4 presents the result of denormalizing each fact given min_util
Third, we order items in the RHUI-tree by descending order of the lowest TWU of all dimensions. Dimensions with the same lowest TWUs are ordered alphabetically and items of a dimension are ordered in TWU descending order. Note that with this ordering, items from multiple dimensions are not intermixed.
Denormalized facts
1-HTWUIs and their TWUs
Example RHUI-tree.
Given min_util
In this subsection, we describe the RHUI-Growth pseudo-code presented in Algorithm 3.
Initially, the RHUI-Growth algorithm constructs dimension trees for each dimension (Step 1). Step 2 builds the RHUI-tree based on all the dimension trees. Procedure RHTWUI_growth is then called to discover HTWUIs in Step 3. Finally, Step 4 calculates the actual utility of each HTWUI and outputs all the discovered HUIs. Considering the itemset and its conditional tree as parameters, procedure RHTWUI_growth is similar to the methods used in other tree-based algorithms [2, 18].
Complexity analysis
Similar to Subsection 4.4, let
TWU calculation for each item
This operation is similar to the operation of calculating SI for each 1-HTWUI in the RHUI-Mine algorithm (see Subsection 4.4). The cost of this operation is also
Construction of dimensional trees
We again consider the operation in one dimension table
Construction of relational trees
When constructing a relational tree, the dimension tables should be processed one by one. We take dimension table
Generation of HTWUIs
Let
Discovering all HUIs
This operation is similar to MRHUI discovery in the RHUI-Mine algorithm (see Subsection 4.4). The cost of this HUI filtering is
Foodmart star schema.
We compare the RHUI-Growth algorithm with the FP-tree-based algorithm MultiClose [26] for mining closed itemsets from relational data in a star schema.
First, MultiClose converts dimension tables to a vertical data format and uses a heuristic technique to compute the supports of itemsets. RHUI-Growth does not use a vertical data format, as we are concerned with utility values rather than itemset supports.
Second, FP-tree is used together with a two-level hash table in MultiClose. RHUI-Growth exploits two kinds of tree structures, a dimensional tree and a relational tree. For each dimension table, a transaction table is used rather than an item table, which differentiates the dimensional tree from FP-tree. For global relational tree building according to the dimensional trees of RHUI-Growth, the TWU is recorded in each node instead of the support.
Finally, MultiClose discovers the resulting itemsets using local mining and global mining. That is, closed itemsets of the dimension tables are first discovered, and then closed itemsets of different dimension tables are combined to closed itemsets across multiple dimension tables. In RHUI-Growth, no local mining operation is needed. Information from the dimension tables is first compressed into dimensional trees, and then a global relational tree is constructed according to the dimensional trees. MRHUIs are only mined in the final relational tree.
Performance evaluation
In this section, we evaluate the performance of our RHUI-Mine and RHUI-Growth algorithms. To the best of our knowledge, there are no other algorithms for mining HUIs from multi-relational data. Thus, we compare our algorithms with two state-of-the-art HUI mining algorithms, IHUP [2] and UP-Growth [22], for a single relation.
Experimental environment and datasets
The experiments were performed on a 2.40 GHz CPU with 2 GB memory running Windows 7. Our programs were written in C++.
Execution time on 1000 customers from the Foodmart database.
The Foodmart database (Microsoft Corporation) was used to evaluate the performance of the algorithms. The database is from Microsoft SQL Server 2005. Foodmart contains the utility of each item (i.e., unit profit times quantity) in the original data. The star schema DW shown in Fig. 5, which consists of three dimension tables and one fact table, was used for our experiments. In this star schema, there are 10281, 1560, and 24 records in the Customer, Product, and Store dimension tables, respectively, and there are 164558 records in the fact table.
As this was the only database for evaluation, we divided it into three parts containing the first 1000 customers, the first 6000 customers, and all 10281 customers. For example, to form the database containing the first 1000 customers, we retained all records relating to these customers and deleted all other information from the original Foodmart database.
Because IHUP and UP-Growth cannot run directly on Foodmart, we performed a natural join of the experimental data before running these two single-table algorithms. Thus, the runtime and memory usage of IHUP and UP-Growth discussed in Subsections 6.2 and 6.3 were recorded on the table that resulted from a natural join of all the relational tables. As the time and memory required to prepare the joined table were negligible, the total runtime and memory usage of IHUP and UP-Growth shown in Figs 6, 7, 9, and 10 are the results for the joined single table only.
We first evaluate the execution time over the Foodmart database. Over the three partitioned databases, we compare the execution time of the four algorithms under various minimum utility thresholds. The lower the minimum utility threshold, the larger the number of MRHUIs, and thus the longer the runtime. In this set of experiments, we terminated the mining task once its runtime exceeded 10000 s. The comparison results are shown in Figs 6–8.
As shown in Fig. 6, for 1000 customers from the Foodmart database, RHUI-Growth is obviously superior to the other three algorithms for minimum utility thresholds between 0.05% and 0.30%. On average, RHUI-Growth is two orders of magnitude faster than IHUP, and an order of magnitude faster than UP-Growth and RHUI-Mine. We can also observe that both IHUP and UP-Growth outperform the RHUI-Mine algorithm at higher minimum utility thresholds; however, their execution times drastically exceed that of RHUI-Mine when the minimum utility threshold is below 0.25% and 0.15%, respectively. This is because the joined table of the database of 1000 customers is small, meaning that the processing cost of the single-table algorithms is acceptable when the minimum utility threshold is high.
Execution time on 6000 customers from the Foodmart database.
As shown in Fig. 7, on the medium-sized database of 6000 customers, both RHUI-Mine and RHUI-Growth outperform the two single-table HUI mining algorithms. On average, RHUI-Growth is 6.37 times faster than RHUI-Mine. The runtime of IHUP and UP-Growth exceeded 10000 s for minimum utility thresholds of 0.1% and 0.05%, respectively.
Execution time on all 10281 customers in the Foodmart database.
Figure 8 shows the execution time for the database of all 10281 customers. Neither IHUP nor UP-Growth could generate a result on the single joined table within 10000 s under any minimum utility threshold. Therefore, the performance of these algorithms is not plotted in Fig. 8. These results demonstrate that existing HUI mining for a single relation is very sensitive to the number of columns (items) and the join-then-mine approach leads to high computational complexity. For the two proposed algorithms, RHUI-Growth is always faster than RHUI-Mine, especially when the minimum utility threshold is small. Compared with the execution time of RHUI-Mine, which significantly increases with decreasing minimum utility threshold, RHUI-Growth demonstrates a relatively steady execution time. RHUI-Growth is an order of magnitude faster than RHUI-Mine when the minimum threshold is lower than 0.05%. Similar to the HUI algorithms for a single relation [2, 18], it is true that the tree-based algorithm (RHUI-Growth) is always more efficient than the level-wise algorithm (RHUI-Mine) for multi-relational data.
We also compare the memory usage of the four algorithms for the three databases described in Subsection 6.1. The results are shown in Figs 9–11.
Memory usage on 1000 customers from the Foodmart database.
Figure 9 shows the memory comparison results on the database of 1000 customers. RHUI-Mine consumes more memory than the two single-table algorithms. The large amount of memory of RHUI-Mine is caused by multiple database scans inherited from the level-wise candidate generation-and-test methodology. Another interesting result is that, although RHUI-Growth is superior to UP-Growth in terms of memory consumption when the minimum utility threshold is lower than 0.1%, UP-Growth still outperforms RHUI-Growth under certain circumstances. This is because the pruning strategies [22] used by UP-Growth work well on this small database.
Memory usage on 6000 customers from the Foodmart database.
As shown in Fig. 10, on the database of 6000 customers, RHUI-Growth consumes less memory than the other three algorithms under all minimum utility thresholds. Although IHUP and UP-Growth outperform RHUI-Mine in terms of memory usage in most cases, not all of their memory usage results are plotted because their execution times exceeded 10000 s, as described in Subsection 6.2.
Memory usage on all 10281 customers in the Foodmart database.
Figure 11 shows the memory usage for the database of all 10281 Foodmart customers. For the same reason discussed in Subsection 6.2, the memory usage of the two single-table algorithms is not plotted. On average, RHUI-Growth uses around 8.57 times less memory than RHUI-Mine. This indicates that the tree structure can represent useful information in a very compressed form, because transactions have many items in common. By using path overlapping (prefix sharing), tree structures can save a great deal of space.
It is also interesting to observe that the memory consumption of RHUI-Growth does not increase with the decrease of the minimum utility threshold. For example, on the database of all 10128 customers, the memory consumption of RHUI-Growth decreases and then increases when the minimum utility threshold is between 0.2% and 0.1%. This is because the TWU of 1-HTWUIs changes under different minimum utility thresholds, which leads to a different sorting order of 1-HTWUIs. Thus, the compression effects of the tree structures on the original databases are different.
In this paper, we examined the problem of mining data stored in decentralized tables and defined the problem of MRHUI mining over a star schema-based DW. We proposed two algorithms, RHUI-Mine and RHUI-Growth, for discovering HUIs from multi-relational data. The RHUI-Mine algorithm uses a candidate generation-and-test scheme, whereas the RHUI-Growth algorithm follows the pattern growth methodology. Both proposed algorithms mine the star schema directly instead of performing a join of the tables, and therefore require much less time to prepare the data in the pre-processing step. Experimental results show that the proposed algorithms are effective.
A star schema can be considered as the building block for a snowflake schema. Hence, our proposed technique can also be extended to the snowflake structure.
Footnotes
Acknowledgments
We thank the anonymous reviewers for their very useful comments and suggestions. This study was partly supported by the Beijing Natural Science Foundation (4162022), High Innovation Program of Beijing (2015000026833ZK04), Beijing Municipal Science and Technology Commission (D161100005216002), and Beijing Talents Project.
