Mining spatial high-average utility co-location patterns from spatial data sets

Abstract

The spatial co-location pattern refers to a subset of non-empty spatial features whose instances are frequently located together in a spatial neighborhood. Traditional spatial co-location pattern mining is mainly based on the frequency of the pattern, and there is no difference in the importance or value of each spatial feature within the pattern. Although the spatial high utility co-location pattern mining solves this problem, it does not consider the effect of pattern length on the utility. Generally, the utility of the pattern also increases as the length of the pattern increases. Therefore, the evaluation criterion of the high utility co-location mining is unfair to the short patterns. In order to solve this problem, this paper first considers the utility and length of the co-location pattern comprehensively, and proposes a more reasonable High-Average Utility Co-location Pattern (HAUCP). Then, we propose a basic algorithm based on the extended average utility ratio of co-location patterns to mining all HAUCPs, which solves the problem that the average utility ratio of patterns does not satisfy the downward closure property. Next, an improved algorithm based on the local extended average utility ratio is developed which effectively reduces the search space of the basic algorithm and improves the mining efficiency. Finally, the practicability and robustness of the proposed method are verified based on real and synthetic data sets. Experimental results show that the proposed algorithm can effectively and efficiently find the HAUCPs from spatial data sets.

Keywords

Spatial data mining spatial co-location pattern HAUCP (High-Average Utility Co-location Pattern)average utility ratio

1. Introduction

With the continuous popularization and development of various new and high technologies such as Internet and wireless communication technologies, a large amount of data containing spatial location information is constantly collected, and people’s requirements for spatial data mining technology is gradually increasing. As a hot research direction of spatial data mining, spatial co-location pattern mining has received more attention and expectation. A spatial co-location pattern is a collection of spatial features whose instances are frequently co-located in a spatial neighborhood. For example, West Nile virus often occurs in mosquito-infested areas where poultry is raised; hospitals, drug stores, and flower shops often appear side-by-side. In recent years, Huang [1, 2], Yoo [3, 4], Wang [5, 6], Tran [7, 8], and Yang [9, 10, 11] have proposed multiple algorithms for mining the co-location patterns, in these algorithms, the participation index is used to measure the prevalence of a co-location pattern, and the importance or value of each feature is not distinguished and studied. These traditional co-location pattern mining methods are easy to ignore those patterns that occur infrequently, but they are very useful. The spatial high utility co-location pattern mining method largely compensates for the missing information caused by the prevalent spatial co-location pattern mining. However, the spatial high utility co-location pattern mining only considers the utility of co-located features, and ignores the impact of the length of patterns, normally, the more the number of features in a pattern, the greater the utility value of the pattern, so the long pattern is more likely to become high utility patterns. It is unfair to use the same minimum utility ratio threshold to measure patterns of different lengths, which may easily lead to mining ineffective long patterns. For example, for the cultivation of commercial crops, although the profit of each crop is very low, the combination of multiple crops leads to high utility results. The combination of crops has higher profits, which is a meaningless pattern for economic crop growers. Therefore, the High-Average Utility Co-location Pattern (HAUCP) criterion is a more reasonable judgment criterion.

The main contributions of this paper are listed as follows: (1) A concept of HAUCPs is proposed, and the average utility ratio is used as the metric of interest. (2) The downward closed property of the extended average utility ratio is analyzed, and a basic algorithm for mining the HAUCPs based on the extended average utility ratio is designed. (3) We analyze the performance of the basic algorithm, and propose an improved algorithm based on local extended average utility ratio to further improve the efficiency of the basic algorithm. (4) We conduct extensive experiments on real data sets and synthetic data sets to prove the practicability, efficiency and robustness of the proposed algorithms.

The rest of the paper is structured as follows: Section 2 introduces the related work. Section 3 gives the related definition of HAUCPs. We present the HAUCP mining algorithms in Section 4. In Section 5 we perform the relevant experimental evaluation. Section 6 summarizes the paper.

2. Related work

Spatial co-location pattern mining is an important research direction of spatial data mining. It was first proposed in [1], which used the minimum participation ratio (called Participation Index – PI) as the interesting measure of a spatial co-location pattern. Since the PI satisfies the downward closure property, using this measure can effectively prune the candidate pattern. In recent years, many researches on the spatial co-location pattern mining have focused on two aspects of pruning candidate patterns and computational optimization of table instances.

After Yao et al. [12] introducing the high utility to frequent itemset mining in transaction databases, the high utility pattern mining has received extensive attention. At present, there are many high utility pattern mining methods for transaction databases. Liu et al. [13] introduced a weighted transaction concept with downward closure property, and proposed an effective pruning algorithm. Ahmed at el. [14] proposed the IHUP algorithm, which effectively avoids multiple scans of the database. The UP-Growth algorithm was proposed in [15], it uses a tree structure, and the effect is better than IHUP. Yin at el. [16] combined the utility with sequential patterns and proposed the US-pan algorithm to mine high utility sequential patterns. Yang at el. [17] introduced the concept of utility to the spatial co-location pattern mining, defined the concepts of pattern utility and pattern utility ratio (PUR), and an extended pruning algorithm EPA was proposed, but it did not fully consider the differences between features and between different instances of the same feature. For this problem Wang et al. [18] further considered the utility of different instances of the same feature, a new utility participation index (UPI) was proposed as an interesting measure of the high utility co-location pattern. However, none of the literatures [17, 18] takes into account the effect of pattern length on pattern utility.

High-average utility pattern mining is also called high-average utility itemset (HAUI) mining. It was first proposed in [19]. As long as an itemset meets that its utility value divided by the number of items is not less than the user defined minimum average utility threshold, the itemset can be called a high-average utility itemset. To improve the performance of mining HAUIs, at present, there are many HAUI mining methods for transaction databases [20, 21, 22, 23, 24, 25, 26, 27, 28]. Lin et al. [20] proposed a HAUP tree structure and a HAUP growth algorithm for mining high-average utility itemsets. The projection-based PAI algorithm [21] was proposed which leverages an innovative pruning technique. The HAUI-tree approach was proposed [22] to mine HAUIs using an indexing table structure, the total number of candidates involved in mining HAUIs can be significantly reduced. An innovative, more efficient HAUI-Miner [23] was proposed with two pruning strategies to mine HAUIs based on a compressed average-utility (AU) list structure. Lin et al. [24] then developed three pruning strategies to speed up the mining process of the HAUIs. Then they presented an updating algorithm called FUP-HAUIMD [25] to maintain the discovered HAUIs with transaction deletion. The FUP-HAUIMD algorithm can easily update the discovered HAUIs without scanning the database all the time. Lin et al. [26] addressed the limitation of high utility sequential pattern mining from uncertain databases and presented a probabilistic high-average utility sequential pattern mining framework for discovering the set of probabilistic high-average utility sequential patterns from uncertain databases. In the meantime, they also proposed an efficient framework called PRE-HAUIMI for transaction insertion in dynamic databases, which relies on the average-utility-list (AUL) structures [27] and addressed the limitation of the previous potential high-utility sequential pattern mining and presented a potentially high average-utility sequential pattern mining framework for discovering the set of potentially high average-utility sequential patterns (PHAUSPs) from the uncertain dataset by considering the size of a sequence, which can provide a fair measure of the patterns than the previous works. Lin et al. [28] introduced a level-wise algorithm named High Average-Utility Itemset Mining with Multiple Minimum Average-Utility thresholds, which relies on a novel transaction-maximum utility downward closure (TMUDC) property and a concept of least minimum average-utility (LMAU) to mine high average-utility itemsets (HAUIs). Due to the autocorrelation of the spatial data in the spatial data sets, and the complexity of the spatial data types and spatial relationships, extracting high-average utility co-location patterns (HAUCPs) from a spatial data set appears more difficult than extracting the corresponding patterns from a transaction database. In this paper, we first introduce the concept of high-average utility to the spatial co-location pattern mining and study related issues.

3. Related concepts and definitions

This section first introduces the basic concepts of traditional co-location pattern mining and a classic star neighbor materialization model for spatial data. Then, the concepts related to HAUCP mining are defined.

3.1 Traditional co-location pattern mining

In a spatial data set, different spatial feature represents different kind of things in space. The spatial feature set represents a collection of different kind of things in space, denoted as $F={\{}f_{1},f_{2},\ldots,f_{n}{\}}$ , for example $F=$ {restaurant, flower shop, pharmacy, hospital}, or $F=$ {school, cafeteria, supermarket}, etc. An object at a specific position is called a spatial instance, and a set of instances is called an instance set, which is recorded as $S=S_{1}\cup S_{2}\cup\ldots\cup S_{n}$ , where $S_{i}(1\leqslant i\leqslant n)$ is a set of instances corresponding to the spatial feature $f_{i}$ . In order to distinguish different instance of different feature, each instance is given a unique number. A spatial instance information usually includes (the type of feature of instance, instance ID, spatial location). As shown in Fig. 1, there are five spatial features A, B, C, D, and E. The spatial feature A has five instances A1, A2, A3, A4, and A5, B has four instances B1, B2, B3, and B4, C has five instances C1, C2, C3, C4, and C5, D has four instances D1, D2, D3, D4, and E has three instances E1, E2, and E3. If the Euclidean distance between two spatial instances is less than or equal to a distance threshold $d$ given by the user, then the two spatial instances are said to be neighbor in space. For ease of description, the spatial instances that satisfy the neighborhood relationship $R$ are connected as a solid line in Fig. 1, such as A5 and D3.

Figure 1.

An example of spatial features and instances.

A spatial co-location pattern $c$ , recorded as $c={\{}f_{1},f_{2},\ldots,f_{k}{\}}$ , is a subset of the spatial feature set $F$ , where the number of features in $c$ is called the size of the pattern, As shown in Fig. 1, the size of the pattern {A, B, E} is 3. For any spatial instance set $I={\{}i_{1},i_{2},\ldots,i_{k}{\}}$ , if any two instances in $I$ satisfy the spatial neighborhood relationship, it is said to be a clique instance. If a clique instance contains all the features in $c$ and its subset cannot contain all features of $c$ , then $I$ is called a row instance of co-location pattern $c$ , and it is marked as row_instance ( $c$ ). A set of all row instances of a co-location pattern $c$ is called table instance of $c$ , and is marked as table_instance ( $c$ ). As shown in Fig. 1, the table instance corresponding to the pattern {A, C, D, E} is {{A3, C3, D1, E2}}.

3.2 Star neighbor materialization model

The star neighbor materialization model for spatial data is proposed by Yoo et al. [4]. The star neighbor materialization model is the basis of the classic join-less mining algorithm. The following will introduce the definition of star neighborhoods and star instances.

Definition 1. Star Neighborhood: Given a spatial instance $i_{i}\in S$ , the feature type of $i_{i}$ is $f_{i}\in F$ , the star neighborhood of $i_{i}$ is a set of instances $SN={\{}i_{j}\in S|i_{j}=i_{i}\vee(f_{j}>f_{i}\wedge R(i_{j},i_{i})){\}}$ , where “ $>$ ” indicates the lexicographic order of the feature name and $R$ is a neighbor relationship.

Definition 2. Star Instance: Let $I={\{}i_{1},i_{2},\ldots,i_{k}{\}}\subseteq S$ be a set of spatial instances whose feature types are { $f_{1},f_{2},\ldots,f_{k}$ }. If all instances in $I$ are neighbors to the first instance $i_{1}$ , $I$ is called a star instance of co-location pattern $c={\{}f_{1},f_{2},\ldots,f_{k}{\}}$ .

The star neighbors of all instances in a spatial data set form the star neighborhood materialization of the spatial data set. According to Definition 2, a star instance { $i_{1},i_{2},\ldots,i_{k}$ } of co-location pattern { $f_{1},f_{2},\ldots,f_{k}$ } can be collected from star neighborhoods with a central feature type $f_{1}$ . All star neighborhoods in the spatial data set of Fig. 1 are listed in Table 1. One star neighborhood is given a serial number $T_{q}$ in Table 1.

Table 1
Star neighborhoods of the spatial data set in Fig. 1

Serial number	Instances	Neighbor instances
T ${}_{1}$	A1	B2, E3
T ${}_{2}$	A2	B2, C2, E3
T ${}_{3}$	A3	B3, C3, C4, D1, E2
T ${}_{4}$	A4	B4, E1
T ${}_{5}$	A5	C5, D3
T ${}_{6}$	B1	D2
T ${}_{7}$	B2	C2, D2, E3
T ${}_{8}$	B3	C3, C4
T ${}_{9}$	B4	E1
T ${}_{10}$	C3	D1, E2
T ${}_{11}$	D1	E2
T ${}_{12}$	D2	E3

Figure 2.

Table instances of some spatial co-location patterns of the spatial data set in Fig. 1.

3.3 High-Average Utility Co-location Pattern (HAUCP)

Definition 3. Internal Utility: The number of instances of a spatial feature $f_{i}$ in the table instance of pattern $c$ is defined as the internal utility of $f_{i}$ in $c$ , denoted as $q(f_{i},c)=|\pi_{fi}(\textit{table{\_}instance}(c))|$ , where $\pi$ is the projection operation. Particularly, If $c={\{}f_{i}{\}}$ , then $q(f_{i},c)$ means the number of all instances of feature $f_{i}$ in the data set.

Taking Fig. 1 as an example, Fig. 2 shows the table instance of co-location $c=$ {A, B, C}, the internal utility of A in $c$ is $q(A,c)=|\pi_{fi}(\textit{table{\_}instance}(c))|=2$ .

Definition 4. External Utility: Each feature in a data set (such as the restaurant) has a different degree of importance (such as the profit of the restaurant). We call the degree of importance of a feature $f_{i}$ as external utility, denoted as $v(f_{i})$ .

The external utilities corresponding to the features of the spatial data set in Fig. 1 are shown in Table 2.

Table 2
External utilities of five features in Fig. 1

Feature	A	B	C	D	E
External Utility $v(f_{i})$	5	4	3	1	2

Definition 5. Utility of Feature in a Co-location Pattern: For a size $k$ co-location pattern $c$ , we define the product of the external utility of feature $f_{i}$ and its internal utility in $c$ as the utility of feature $f_{i}$ in $c$ , and recorded it as:

$\displaystyle u(f_{i},c)=v({f_{i}})\times q({f_{i},c})$ (1)

For the data set in Fig. 1, the utility of feature A in co-location pattern {A, B} is: $u(\text{A},{\{}\text{A},\text{B}{\}})=v(\text{A})\times q(\text{A},{\{}\text{A% },\text{B}{\}})=5\times 4=10$ .

Definition 6. Average Utility of a Co-location Pattern: For a size $k$ co-location pattern $c={\{}f_{1},\linebreak f_{2},\ldots,f_{k}{\}}$ , We define the ratio of the sum of the utilities of all features in $c$ to the length of $c$ as the average utility of $c$ , and recorded it as:

$\displaystyle au(c)=\frac{\sum\limits_{f_{i}\in c}{u({f_{i},c})}}{|c|}=\frac{% \sum\limits_{f_{i}\in c}{u({f_{i},c})}}{k}$ (2)

For the data set in Fig. 1, the average utility of co-location pattern $c={\{}\text{A},\text{B}{\}}$ is $au(c)=(u(\text{A},c)+u(\text{B},c))/2=(5\times 4+4\times 3)/2=16$ .

Definition 7. Total Utility of a Spatial Data Set: In a spatial data set $D$ , suppose the feature set is $F={\{}f_{1},f_{2},\ldots,f_{n}{\}}$ , then the total utility of $D$ is the sum of the utilities of all features in $D$ , which is recorded as:

$\displaystyle U(D)=\sum\limits_{f_{i}\in F}{v(f_{i})\times q(f_{i},\{f_{i}\})}$ (3)

For the data set $D$ in Fig. 1, the total utility of $D$ is $U(D)=v(\text{A})\times q(\text{A},{\{}\text{A}{\}})+v(\text{B})\times q(\text{% B},{\{}\text{B}{\}})+v(\text{C})\times q(\text{C},{\{}\text{C}{\}})+v(\text{D}% )\times q(\text{D},{\{}\text{D}{\}})+v(\text{E})\times q(\text{E},{\{}\text{E}% {\}})=5\times 5+4\times 4+3\times 5+1\times 4+2\times 3=66$ .

Definition 8. Average Utility Ratio of a Co-location Pattern: In a spatial data set $D$ , if the feature set is $F={\{}f_{1},f_{2},\ldots,f_{n}{\}}$ , the total utility of $D$ is $U(D)$ , and the average utility of a co-location pattern $c={\{}f_{1},f_{2},\ldots,f_{k}{\}}$ is $au(c)$ , then we define the average utility ratio of $c$ as the ratio of the average utility of $c$ to the total utility of the data set, i.e.,:

$\displaystyle\lambda(c)=\frac{au(c)}{U(D)}$ (4)

Lemma 1. The average utility ratios of co-location patterns do not meet the downward closure property.

Proof. We prove it by giving a counter-example. For the spatial data set in Fig. 1, there are $au({\{}\text{A},\text{B}{\}})=$ 16, $au({\{}\text{B},\text{C}{\}})=$ 8.5, $au({\{}\text{A},\text{B},\text{C}{\}})=$ 9, $\lambda({\{}\text{A},\text{B}{\}})=$ 16/66 $=$ 8/33, $\lambda({\{}\text{B},\text{C}{\}})=$ 8.5/66, and $\lambda\linebreak({\{}\text{A},\text{B},\text{C}{\}})=$ 9/66, where $\lambda({\{}\text{A},\text{B}{\}})>\lambda({\{}\text{A},\text{B},\text{C}{\}})$ , but $\lambda({\{}\text{B},\text{C}{\}})<\lambda({\{}\text{A},\text{B},\text{C}{\}})$ . So, the average utility ratios of co-location patterns do not meet the downward closure property.

Lemma 2. For a co-location $c$ , $0\leqslant\lambda(c)\leqslant 0.5$ .

Proof. When any two instances in a co-location $c$ do not satisfy the spatial neighbor relationship, that is no instance of the spatial feature in $c$ participates in the table instance of $c$ , resulting in the feature utility in $c$ is 0, so that the $\lambda(c)$ is 0, this is the minimum of $\lambda(c)$ . On the contrary, when all instances of any feature in $c$ participate in the table instance of $c$ , the $\lambda(c)$ takes the maximum that is 1/ $k$ ( $k$ is the size of $c$ ). And when $k=$ 2, it is the maximum of $\lambda$ ( $c$ ). In summary, $0\leqslant\lambda(c)\leqslant 0.5$ . (For the spatial co-location pattern mining, the minimum size of mining results is size 2 due to the size 1 pattern is meaningless).

Definition 9. High-Average Utility Co-location Pattern (HAUCP): In a spatial data set, if the average utility ratio $\lambda(c)$ of a co-location pattern $c$ is not less than a given average utility ratio threshold $\zeta$ , then the pattern $c$ is called a HAUCP in the spatial data set.

Based on the above definitions, HAUCP mining can be defined as: given a spatial data set $D$ with a total utility value of $U(D)$ and an average utility ratio threshold $\zeta$ , the HAUCP mining is to discover all co-location patterns with their average utility ratio are not less than $\zeta$ .

4. Mining algorithm

In the traditional prevalent co-location pattern mining, the prevalence measure $P I$ meets downward closure property, which can reduce the number of candidates and improve the algorithm efficiency. However, from the Lemma 1, we know that the average utility ratio of the co-location pattern does not meet the downward closure property, so it cannot be directly used as the standard for the lower average utility pattern of pruning. In this section, we will propose the concept of the extended average utility, and propose a basic mining algorithm for mining all HAUCPs based on the downward closure property of extended average utility ratio. The performance of the basic mining algorithm is analyzed, and the extended average utility is further optimized, and based on this, an improved algorithm for mining HAUCPs is proposed.

4.1 Basic algorithm

Table 3
Total utilities of star neighborhoods of the spatial data set in Fig. 1

Serial number	Central instance	Neighbor instances	Twu
T ${}_{1}$	A1	B2, E3	11
T ${}_{2}$	A2	B2, C2, E3	14
T ${}_{3}$	A3	B3, C3, C4, D1, E2	18
T ${}_{4}$	A4	B4, E1	11
T ${}_{5}$	A5	C5, D3	9
T ${}_{6}$	B1	D2	7
T ${}_{7}$	B2	C2, D2, E3	10
T ${}_{8}$	B3	C3, C4	10
T ${}_{9}$	B4	E1	6
T ${}_{10}$	C3	D1, E2	6
T ${}_{11}$	D1	E2	3
T ${}_{12}$	D2	E3	3

Definition 10. Total Utility of Star Neighborhood: The total utility of a star neighborhood $T_{q}$ is defined as the sum of the external utility of the corresponding feature of each instance in the star neighborhood, recorded as:

$\displaystyle\textit{twu}({T}_{q})=\sum\limits_{f_{i}\in T_{q}}{v(f_{i})}$ (5)

For example, in Table 3, the total utility of star neighborhood $T_{1}$ is $\textit{twu}(T_{1})=v(A)+v(B)+v(E)=5+4+2=11$ .

Definition 11. Extended Average Utility of Pattern: For any pattern $c={\{}f_{1},f_{2},\ldots,f_{k}{\}}$ , its extended average utility is expressed as $\textit{utwu}(c)$ , which is defined as the sum of the total utility of all star neighborhood containing $c$ , which is recorded as:

$\displaystyle\textit{utwu}(c)=\sum\limits_{c\subseteq T_{q}}{\textit{twu}(T_{q% })}$ (6)

For example in Table 3: $\textit{utwu}({\{}\text{A}{\}})=\textit{twu}(T_{1})+\textit{twu}(T_{2})+% \textit{twu}(T_{3})+\textit{twu}(T_{4})+\textit{twu}(T_{5})=63$ , $\textit{utwu}({\{}\text{A},\text{B}{\}})=\textit{twu}(T_{1})+\textit{twu}(T_{2% })+\textit{twu}(T_{3})+\textit{twu}(T_{4})=54$ .

Definition 12. Extended Average Utility Ratio of Pattern: For any pattern $c={\{}f_{1},f_{2},\ldots,f_{k}{\}}$ , the extended average utility ratio of $c$ is defined as the ratio of extended average utility of $c$ to the total utility $U(D)$ of the spatial data set $D$ , which is recorded as:

$\displaystyle\mu(c)=\frac{\textit{utwu}(c)}{U(D)}$ (7)

If the $\mu(c)$ is greater than or equal to a given threshold $\zeta$ , the co-location pattern $c$ is called an over-estimated HAUCP.

Lemma 3. If $c$ is a subset of $c^{\prime}$ , then the extended average utility ratio of $c^{\prime}$ is no more than the extended average utility ratio of $c$ .

Proof. Because $c$ is a subset of $c^{\prime}$ , all star neighborhood containing $c^{\prime}$ must contain $c$ , according to Definition 11, $\textit{utwu}(c^{\prime})=\Sigma_{c^{\prime}\subseteq Tq}\textit{twu}(T_{q})% \leqslant\Sigma_{c\subseteq Tq}\textit{twu}(T_{q})=\textit{utwu}(c)$ .

For example, in Table 3 $\textit{utwu}({\{}\text{A}{\}})=$ 63, $\textit{utwu}({\{}\text{A},\text{B}{\}})=$ 54, $\textit{utwu}({\{}\text{A},\text{B}{\}})<\textit{utwu}({\{}\text{A}{\}})$ .

Lemma 4. If a pattern is not an over-estimated HAUCP, then this pattern must not be a HAUCP.

Proof. Let $c$ be any co-location pattern of the spatial data set $D$ , according to Definition 9, if $\lambda(c)\geqslant\zeta$ , we consider $c$ as a HAUCP. Hence, $\zeta\times U(D)\leqslant au(c)=\Sigma_{fi\in c}u(f_{i},c)/|c|=\Sigma_{fi\in c% }v(f_{i})\times q(f_{i},c)/|c|\leqslant\Sigma_{c\subseteq Tq}\textit{twu}(T_{q% })=\textit{utwu}(c)$ .

For example, in Table 3, $\textit{utwu}({\{}\text{A},\text{B}{\}})=$ 54, $au({\{}\text{A},\text{B}{\}})=(5\times 4+4\times 3)/2=$ 16, $\textit{utwu}({\{}\text{A},\text{B}{\}})>au({\{}\text{A},\text{B}{\}})$ . If {A, B} is not an over-estimated HAUCP, then {A, B} must not be a HAUCP.

According to Lemmas 3 and 4, based on the extended average utility of co-location patterns, some patterns that are not likely to become a HAUCP are removed in advance. Algorithm 1 is our basic algorithm for mining all HAUCPs

Algorithm 1: Basic algorithm
Input:
$F={\{}f_{1},\ldots,f_{n}{\}}$ : a set of spatial feature types
$S$ : a set of spatial instances
$V$ : a two-dimensional table for external utility of each $f_{i}\in F$
$R$ : a spatial neighbor relationship
$\zeta$ : a average utility ratio threshold
Output:
A set of HAUCPs.
Variables:
$SN={\{}SN(f_{1}),\ldots,{SN}(f_{n}){\}}$ : a set of star neighbor instances of features $f_{i}$
$U(D)$ : the total utility of the spatial data set
Twu: the total utility of the star neighbors
$k$ : co-location size
$C_{k}$ : a set of size $k$ candidate
$HC_{k}$ : a set of size $k$ over-estimated HAUCPs
$SI_{k}$ : a set of star instances of size $k$ candidates
$CI_{k}$ : a set of clique instances of size $k$ candidates
$UA_{k}$ : a set of size $k$ HAUCPs
Method:
1) $SN=$ gen_star_neighborhoods ( $F$ , $S$ , $R$ )
2) $U(D)=$ get_utility ( $F$ , $S$ , $V$ )
3) $\textit{Twu}=\textit{gen{\_}Twu}({SN},V)$
4) ${UA}_{1}=F$ ; ${HC}_{1}=F$ ; $k=$ 2
5) while (not empty ${HC}_{k-1}$ ) do
6) $C_{k}=$ gen_candidate ( ${HC}_{k-1},k$ )
7) for each $c\in C_{k}$ do
8) $\textit{utwu}(c)=\textit{getTwu}(c)$ ; $\mu(c)=\textit{utwu}(c)/U(D)$
9) if $\mu(c)\geqslant\zeta$ then put $c$ into $HC_{k}$
10) for $i$ in 1 to $n$ do
11) for $t\in{SN}_{fi}$ where $f_{i}={cf}_{1}$ , ${cf}_{1}$ is the first feature of ${HC}_{k}({cf}_{1},\ldots,{cf}_{k})$
12) ${SI}_{k}=$ filter_star_instances ( ${HC}_{k},t$ )
13) end do
14) if $k=$ 2 then ${CI}_{k}={SI}_{k}$
15) else ${CI}_{k}=$ filter_clique_instances ( ${HC}_{k},{SI}_{k}$ )
16) ${UA}_{k}=$ select_Hau_co-location ( ${HC}_{k},{CI}_{k},{\zeta}$ )
17) $k++$
18) end do
19) return $\cup({UA}_{2},\ldots,{UA}_{k})$

The basic algorithm is explained as follows:

Steps 1–4 (initialization): Given an input spatial data set $D$ and a neighboring relationship threshold, use the geometric method to find all neighboring instance pairs, and group the neighboring instance pairs to generate star neighborhood sets. Calculating the total utility $U(D)$ and star neighbor utility of the spatial data set according to Eqs (3) and (5), and consider the size 1 patterns as over-estimated HAUCPs. Steps 5–9 generates size $k(k>1)$ over-estimated HAUCPs from the size $k-1$ candidate patterns, and then generates size $k$ over-estimation HAUCPs from the size $k$ candidate patterns. Here, we perform attribute-level filtering on the co-location patterns, that is, if any subset of an over-estimated HAUCPs is not over-estimated, the over-estimated pattern will be cut off. Steps 10–13: the star instance of the over-estimated HAUCPs is collected from the star neighborhood set. It is required that the feature type of the center instance of the star neighborhood set is the same as the first feature of the over-estimated HAUCP. For example, instances of the over-estimated pattern {B, C} are collected from the star neighbor set of feature B. Because the spatial proximity relationship is symmetric, the size 2 star instance is the clique instance. For the size 2 star instance, Step 14 is performed directly. For the size 3 and above, it needs to perform step 15 to check whether the star instance is a clique instance. For example, to test whether the star instance {A2, B2, C2} of the overestimated pattern {A, B, C} is a clique, we need to check whether the subinstance {B2, C2} other than A2 is in the table instance of pattern {B, C} Step 16 extracts size $k$ HAUCPs that meet the threshold condition $\zeta$ .

The time complexity of Algorithm 1 is related to the number of candidate HAUCPs, and the number of candidate HAUCPs is related to the distance threshold, the number of features, the number of instances, and the average utility threshold. For the candidate HAUCPs that has not been pruned, it is necessary to scan the neighbor relationship $S N$ to determine the table instance of the pattern, and calculate the average utility value of the pattern through the table instance to determine the HAUCPs. Let $T_{BA}$ be the time cost of the algorithm, $T_{BA}(2)$ represents the time cost of finding the two size HAUCPs, and $T_{BA}(k)$ represents the time cost of finding $k(k>2)$ size HAUCPs, $T_{BA}(k)=T_{\textit{gen\_candidate}}({HC}_{k-1})+|C_{k}|\times T_{\textit{% getTwu}}(c)+T_{\textit{filter\_clique\_instances}}({HC}_{k})$ , where $T_{\textit{gen\_candidate}}({HC}_{k-1})$ is the cost of generating the $k$ size candidate HAUCPs based on the $k-1$ size overestimated pattern ${HC}_{k-1}$ , $T_{\textit{getTwu}}(c)$ is the cost of calculating the $k$ size overestimated patterns, and $T_{\textit{filter\_clique\_inst}}({HC}_{k})$ is the cost of of filtering the $k$ size HAUCPs from the $k$ size overestimated patterns ${HC}_{k}$ which is also the most timeconsuming part of the algorithm. Assuming that the average number of instances of each feature is $s$ , then the $T_{\textit{filter\_clique\_inst}}$ is about $O(|HC_{k}|\times k\times s)$ . Apparently the number of candidate patterns is the key to restricting the efficiency of the algorithm

4.2 Improved algorithm

Because the extended average utility in the basic algorithm is too loose, the pruning effect is not obvious. In order to more effectively prune candidate patterns, this section we will reduce the extended average utility of the pattern to make it closer to the actual average utility value of the pattern.

Table 4
Maximum utilities of star neighborhoods of the spatial data set in Fig. 1

Serial number	Central instance	Neighbor instances	Autwu
T ${}_{1}$	A1	B2, E3	5
T ${}_{2}$	A2	B2, C2, E3	5
T ${}_{3}$	A3	B3, C3, C4, D1, E2	6
T ${}_{4}$	A4	B4, E1	5
T ${}_{5}$	A5	C5, D3	5
T ${}_{6}$	B1	D2	4
T ${}_{7}$	B2	C2, D2, E3	4
T ${}_{8}$	B3	C3, C4	6
T ${}_{9}$	B4	E1	4
T ${}_{10}$	C3	D1, E2	3
T ${}_{11}$	D1	E2	2
T ${}_{12}$	D2	E3	2

Definition 13. Utility of Feature in Star Neighborhood: We define the product of the external utility and the quantity of different instances of feature $f_{i}$ appearing in a star neighborhood $T_{q}$ as the utility of $f_{i}$ in $T_{q}$ , denoted as:

$\displaystyle u(f_{i},T_{q})=v(f_{i})\times q(f_{i},T_{q})$ (8)

Definition 14. Maximum Utility of Star Neighborhood: For a star neighborhood $T_{q}$ , $\textit{umtwu}(T_{q})$ represents the maximum utility in $T_{q}$ , which is recorded as:

$\displaystyle\textit{umtwu}(T_{q})=\max\{u(f_{i},T_{q}),f_{i}\in T_{q}\}$ (9)

For example in the star neighborhood $T_{1}$ of Table 4, $\textit{umtwu}(T_{1})=\text{max}{\{}u(\text{A},T_{1})$ , $u(\text{B},T_{1})$ , $u(\text{E},\linebreak T_{1}){\}}={\{}5,4,2{\}}=5$ .

Definition 15. Local Extended Average Utility of Pattern: local extended average utility for a pattern $c={\{}f_{1},f_{2},\ldots,f_{k}{\}}$ is expressed as $\textit{autwu}(c)$ , which is defined as the sum of the maximum utility of all star neighborhoods containing $c$ , and it recorded as:

$\displaystyle\textit{autwu}(c)=\sum\limits_{c\subseteq{T}_{q}}{\textit{umtwu}(% T_{q})}$ (10)

For example in Table 4: $\textit{autwu}({\{}\text{A}{\}})=\textit{umtwu}(T_{1})+\textit{umtwu}(T_{2})+% \textit{umtwu}(T_{3})+\textit{umtwu}(T_{4})+\textit{umtwu}(T_{5})=5+5+6+5+5=26$ , $\textit{autwu}({\{}\text{A},\text{B},\text{C}{\}})=\textit{umtwu}(T_{2})+% \textit{umtwu}(T_{3})=5+6=11$ .

Definition 16. Local Extended Average Utility Ratio of Pattern: For a pattern $c={\{}f_{1},f_{2},\ldots,f_{k}{\}}$ , the ratio of its local extended average utility to the $U(D)$ is called as local extended average utility ratio $\textit{autwu}(c)$ of $c$ , that is:

$\displaystyle w(c)=\frac{\textit{autwu}(c)}{U(D)}$ (11)

If $w(c)$ is greater than or equal to a given threshold $\zeta$ , the pattern $c$ is called an approximate HAUCP.

Lemma 5. If a pattern $c$ is a subset of pattern $c^{\prime}$ , then the local extended average utility ratio of $c^{\prime}$ is no more than the local extended average utility ratio of $c$ .

Proof. Because $c$ is a subset of $c^{\prime}$ , all star neighborhoods containing $c^{\prime}$ must contain $c$ , according to Definition 15, we can get $\textit{autwu}(c^{\prime})=\Sigma_{c^{\prime}\subseteq Tq}\textit{umtwu}(T_{q}% )\leqslant\Sigma_{c\subseteq Tq}\textit{umtwu}(T_{q})=\textit{autwu}(c)$ .

For example, in Table 4, there are $\textit{autwu}({\{}\text{A}{\}})=$ 26, $\textit{autwu}({\{}\text{A},\text{B}{\}})=$ 21, $\textit{autwu}({\{}\text{A},\text{B},\text{C}{\}})=$ 11, $\textit{autwu}({\{}\text{A},\text{B},\text{C}{\}})\leqslant\textit{autwu}({\{}% \text{A},\text{B}{\}})\leqslant\textit{autwu}({\{}\text{A}{\}})$ .

Lemma 6. If a pattern is not an approximate HAUCP, then this pattern must not be a HAUCP.

Proof. Let suppose $c$ is any pattern of the spatial data set $D$ , according to Definition 9, if ${au}(c)/U(D)\geqslant\zeta$ , we consider $c$ as a HAUCP. Hence, $\zeta\times U(D)\leqslant{au}(c)=\Sigma_{fi\in c}{au}(f_{i},c)/|c|=\Sigma_{fi% \in c}v(f_{i})\times q(f_{i},c)/|c|\leqslant\Sigma_{fi\in Tq}\max{\{}u(f_{i},T% _{q}){\}}=\Sigma_{c\subseteq Tq}\textit{umtwu}(T_{q})=\textit{autwu}(c)$ .

For example, in Table 4, there are $\textit{autwu}({\{}\text{A},\text{B}{\}})=$ 21, ${au}({\{}\text{A},\text{B}{\}})=(5\times 4+4\times 3)/2=$ 16, $\textit{autwu}({\{}\text{A},\text{B}{\}})>{au}({\{}\text{A},\text{B}{\}})$ . If {A, B} is not an approximate HAUCP, then {A, B} must be not a HAUCP.

Based on the Lemma 5 and Lemma 6, an improved algorithm is designed as follows.

Algorithm 2: Improved Algorithm
Input:
$F={\{}f_{1},\ldots,f_{n}{\}}$ : a set of spatial feature types
$S$ : a set of spatial instances
$V$ : a two-dimensional table for external utility of each $f_{i}\in F$
$R$ : a spatial neighbor relationship
$\zeta$ : a average utility ratio threshold
Output:
A set of HAUCPs.
Variables:
${SN}={\{}{SN}(f_{1}),\ldots,{SN}(f_{n}){\}}$ : a set of star neighbor instances of features $f_{i}$
$U(D)$ : the total utility of the spatial data set
Twu: the total utility of star neighbors
$k$ : co-location size
$C_{k}$ : a set of size $k$ candidate
${AC}_{k}$ : a set of size $k$ approximate HAUCPs
${SI}_{k}$ : a set of star instances of size $k$ candidates
${CI}_{k}$ : a set of clique instances of size $k$ candidates
${UA}_{k}$ : a set of size $k$ HAUCPs
Method:
1) $SN=$ gen_star_neighborhoods ( $F$ , $S$ , $R$ )
2) $U(D)=$ get_utility ( $F$ , $S$ , $V$ )
3) $\textit{Twu}=\textit{gen{\_}Twu}({SN},V)$
4) ${UA}_{1}=F$ ; ${AC}_{1}=F$ ; $k=2$
5) while (not empty ${AC}_{k-1}$ ) do
6) $C_{k}=$ gen_cancidate ( ${AC}_{k-1}$ , $k$ )
7) for each $c\in C_{k}$ do
8) $\textit{autwu}(c)=\textit{getAuTwu}(c)$ ; $\zeta(c)=\textit{autwu}(c)/U(D)$
9) if $w(c)\geqslant\zeta$ then put $c$ into $AC_{k}$
10) for $i$ in 1 to $n$ do
11) for $t\in{SN}_{fi}$ where $f_{i}={cf}_{1}$ , ${cf}_{1}$ is the first feature of ${AC}_{k}({cf}_{1},\ldots,{cf}_{k})$
12) ${SI}_{k}=$ filter_star_instances ( ${AC}_{k},t$ )
13) end do
14) if $k=$ 2 then ${CI}_{k}={SI}_{k}$
15) else ${CI}_{k}=$ filter_clique_instances ( ${AC}_{k},{SI}_{k}$ )
16) ${UA}_{k}=$ select_Hau_co-location ( ${AC}_{k},{CI}_{k},{\zeta}$ )
17) $k++$
18) end if
19) return $\cup({UA}_{2},\ldots,{UA}_{k})$

Algorithm 2 is explained as follows:

Steps 1–4 (initialization): Same as Algorithm 1 Steps 5–9 generates size $k(k>1)$ candidate patterns from the size $k-1$ approximate HAUCPs, and then generates size $k$ approximate HAUCPs from the size $k$ candidate patterns. Steps 10–13 the star instance of the approximate HAUCP is collected from the star neighborhood set. It is required that the feature type of the center instance of the star neighborhood set is the same as the first feature of the approximate HAUCP. Because the spatial neighboring relationship is symmetric, the size 2 star instance is the clique instance. For the size 2 star instance, Step 14 is performed directly. For the size 3 and above, it needs to perform Step 15 to check whether the star instance is a clique instance. Step 16 extracts all HAUCPs that meet the threshold condition $\zeta$ .

Lemma 7. For a pattern $c$ , there is an inequality ${au}(c)\leqslant\textit{autwu}(c)\leqslant\textit{utwu}(c)$ .

Proof. $\textit{utwu}(c)$ represents the sum of the utility values of features all star neighborhood containing $c$ , and $\textit{autwu}(c)$ represents the sum of the maximum feature utility values of all star neighborhood containing $c$ . Obviously $\textit{utwu}(c)$ is much larger than $\textit{autwu}(c)$ , and ${au}(c)\leqslant\textit{autwu}(c)$ has proved in Lemma 6. So ${au}(c)\leqslant\textit{autwu}(c)\leqslant\textit{utwu}(c)$ holds.

Since the two algorithms proposed in this paper differ only in the number of candidate patterns, according to Lemma 7, for any co-location pattern $c$ , Algorithm 2 can cut more candidate patterns. In the HAUCP mining, looking up table instances is the most time consuming part. In Algorithm 2, the process of looking up table instances is further reduced by reducing the number of candidate patterns. Therefore, the efficiency of Algorithm 2 will be higher than that of Algorithm 1.

5. Experimental results

5.1 Experimental environment

The HAUCP mining algorithm and the spatial high utility co-location pattern mining algorithm EPA [13] are both written in Java (https://github.com/llixiaoshou/HAUCP). The experimental environment is: (1) System: Windows 10, (2) CPU: Intel(R) Core(TM) i7-8700K, @3.70GHz, (3) Memory: 16GB.

5.2 Experimental data

In the experiment, we mainly used two types of spatial data sets, the synthetic and the real data sets. The real data sets include “Gong Shan vegetation” data set 1, “Shanghai POI” data set 2, “Lasvegas” data set 3, “Beijing POI” data set 4, and “Toronto” data set 5. The data distributions of the 5 real data sets are shown in Fig. 3a–e, and their property information are listed in Table 5. Synthetic data sets are generated using a spatial data generator similar to [14]. In the following experiments, $F$ is the number of features, $d$ is the neighbor distance threshold, $n$ is the total number of instances, and $\zeta$ is the pattern utility ratio threshold.

Table 5
A summary of the 5 real data sets

Name of data sets	Number of instances	Number of features	Distribution
Data set 1 (Gong Shan vegetation)	13349	25	Dense, cluster
Data set 2 (Shanghai POI)	14446	23	Dense, zonal $+$ concentrated
Data set 3 (Lasvegas)	9713	26	Sparse, uniform
Data set 4 (Beijing POI)	25276	12	Concentrated dense
Data set 5 (Toronto)	10057	52	Sparse, uniform

Figure 3.

The distribution of the 4 5 real data sets.

5.3 The quality of mining results

We aim at finding the high utility co-locations whose instances are frequently located together in geographic space and have high utilities. To measure the rationality of a mined pattern $c$ , a quality measurement $Q(c)$ of $c$ is proposed in literature [14]. The formula is as follows:

$\displaystyle Q(c)=\frac{\sum\limits_{f_{i}\in c}{u(f_{i},c)}}{\sum\limits_{f_% {i}\in c}{U(f_{i},D)}}$ (12)

Where $u(f_{i},c)$ represents the total utility of feature $f_{i}$ in pattern $c$ , and $U(f_{i},D)$ represents the total utility of feature $f_{i}$ in data set $D$ . If the value of $Q(c)$ of $c$ is higher, it means the average participation utility of features in this pattern is relatively larger, and the average influence of features in this pattern is higher. On the contrary, when the value $Q(c)$ of $c$ is lower, the average participation utility of features in $c$ is relatively smaller, which can be believed that the interest to $c$ is relatively lower. In addition, the co-location instances of a pattern reflect the prevalence of the pattern in the spatial data set, and it is of great significance to the availability of mining results.

The algorithm EPA proposed in the literature [13] is the only mining spatial high utility co-location pattern algorithm from data sets of spatial feature with utility, and the HAUCP mining algorithm proposed in this paper is also mining HAUCPs on data sets of spatial feature with utility. Therefore, the experiments in this part mainly compare the effectiveness and rationality of the mined patterns based on different metrics by $Q(c)$ and prevalence of the mined patterns. Among them, the HAUCP mining method we use is the basic algorithm in Section 4.1 because the two algorithms discover the same results. The experimental data set we use is a synthetic data set where $F=$ 25, $n=$ 12000.

Figure 4.

Effect of $k$ in mined top- $k$ patterns under different mining methods.

(A) Effect of $k$ in top- $k$ mined patterns

In Fig. 4, we set $d=$ 20 m, $\zeta=$ 0.05 (In real life, the values of $d$ and $\zeta$ are selected according to the needs of users. The selection of $d$ and $\zeta$ in this paper is random, and the experimental conclusions obtained are the same. In Fig. 4a, the horizontal axis represents the $k$ in the mined top- $k$ patterns, and the vertical axis represents the average $Q(c)$ value of the mined top- $k$ patterns. In Fig. 4b, the vertical axis represents the average prevalence of mined top- $k$ patterns. We can see that the average $Q(c)$ values and average prevalence values of the mined top- $k$ patterns of the basic algorithm are no less than the results based on EPA mining and the greater the $k$ value, the greater the advantage of our algorithm. This is because the effect of the pattern length on the utility is not considered in the EPA Generally, the larger the length of patterns, the greater the utility value and the less prevalence, which results in a large number of long patterns being judged as high utility patterns. But the basic algorithm comprehensively considers the effect of pattern length on pattern utility, therefore, the results of HAUCP are not only of higher quality but also of better prevalence.

(B) Effect of the pattern utility ratio threshold

In Fig. 5, we set $d=$ 20 m, and the utility ratio threshold increased from 0.05 to 0.09. In Fig. 5a, the horizontal axis represents the utility ratio threshold, and the vertical axis represents the average $Q(c)$ of the top-10 patterns mined under different methods. In Fig. 5b, the horizontal axis is also the utility ratio threshold, and the vertical axis is the average prevalence of the top-10 patterns mined under different methods. Since the basic algorithm takes into account the effect of the pattern length on the pattern utility, the average $Q(c)$ and average prevalence of the top-10 mined patterns are larger than the results of EPA mining. As the utility ratio threshold increases, the number of patterns judged to be valuable is decreasing, so the average $Q(c)$ values and the average prevalence of the two algorithms are decreasing. When the utility threshold is 0.07 to 0.09, the overall average $Q(c)$ and average prevalence value remain unchanged, this is because the top-10 patterns in the mining results of the two algorithms remains unchanged.

Figure 5.

Effect of the pattern utility ratio threshold.

Figure 6.

Effect of the distance threshold.

Figure 7.

Effect of the pattern utility ratio threshold.

Figure 7.

Continued.

(C) Effect of distance threshold

In Fig. 6, we set $\zeta=$ 0.05, and the distance threshold increased from 15 m to 25 m, In Fig. 6a, the horizontal axis represents the distance threshold, and the vertical axis represents the average $Q(c)$ of the top-10 patterns mined under different methods. In Fig. 6b, the horizontal axis represents the distance threshold, and the vertical axis represents the average prevalence of the top-10 patterns mined under different methods. From the Fig. 4, we can see that the average $Q(c)$ values and average prevalence values for both algorithms increase as the distance threshold increases, this is because as the distance threshold increases, the number of table instances increases. At the same time, we also see that the result of the basic algorithm is always better than that of the EPA.

5.4 The efficiency

In this section, we compare the efficiency of the basic method and the improved algorithm (denoted AUP) on the 5 real data sets and synthetic data sets.

(A) Effect of the pattern utility ratio

In the real data set 1, we set $d=$ 2000 m and the utility ratio threshold increased from 0.1 to 0.3. In the real data set 2, we set $d=$ 300 m and the utility ratio threshold increased from 0.05 to 0.1. In the real data set 3, we set $d=$ 150 m and the utility ratio threshold increased from 0.05 to 0.1. In the real data set 4, we set $d=$ 350 m and the utility ratio threshold increased from 0.05 to 0.1. In the real data set 5, we set $d=$ 100 m and the utility ratio threshold increased from 0.15 to 0.3. In Fig. 7a, c, e, g, and i we can see that the number of candidate patterns of the AUP is less than that of the basic algorithm. In Fig. 7b, d, f, h, and j Since the AUP can reduce more low utility patterns in advance, its running time is lower than that of the basic algorithm. At the same time, with the increase of the utility ratio threshold, two algorithms can filter out more lower-utility patterns in advance, so the number of candidate patterns and the overall running time of the two algorithms are reduced accordingly. Combined with the distribution of the five data sets, we can see that the AUP has better pruning effects on sparse and uniform data sets.

(B) Effect of distance threshold $d$

For this experiment we set $\zeta=$ 0.1 and the distance threshold increased from 1800 m to 2300 m for the data set 1, set $\zeta=$ 0.05 and the distance threshold increased from 250 m to 300 m for the data set 2, set $\zeta=$ 0.1 and the distance threshold increased from 100 m to 175 m for the data set 3, $\zeta=$ 0.1 and the distance threshold increased from 300 m to 400 m for the data set 4, and $\zeta=$ 0.1 and the distance threshold increased from 50 m to 90 m for the data set 5. In Fig. 8a, c, e, g, i, AUP can dropped more patterns that are not co-located with high-average utility makes the number of candidate patterns smaller than that of the basic algorithm. In Fig. 8c, e, g, i, the comparison of the number of candidate patterns in the two algorithms is more obvious, In Fig. 8b, d, f, h, j, since the AUP can reduce more low utility patterns in advance, its running time is lower than that of the basic algorithm. In Fig. 8d, f, h, j, the running rate of AUP is much faster than the basic algorithm. Compared with the distribution of 5 data sets, the AUP is more effective on sparse and uniform data sets. A larger value of $d$ means more spatial neighbor relationships and more spatial instances could form cliques. So as $d$ increases, the candidate patterns and runtimes of both algorithms increase accordingly.

Figure 8.

Effect of distance threshold $d$ .

Figure 8.

Continued.

(C) Effect of the number of features

We do this experiment using synthetic datasets We set $d=$ 20, $n=$ 100000, $\zeta=$ 0.05 in the synthetic datasets. As shown in Fig. 9a, with the increase of the number of features, the number of candidates of two algorithms first increases then decreases, and in Fig. 9b, with the increases of the number of features, the running time of two algorithms first increases then decreases. This is because we fixed the number of total instances, so with the number of features increases continually, the number of instances each feature decreases relatively, then the average size of candidates and the average number of neighbors of each instance decrease. Since the AUP can reduce more low utility patterns in advance, its number of candidate patterns and its running time are lower than that of the basic algorithm. When $F=$ 30, the comparison of the pruning effects of the two algorithms is the most obvious, and the AUP filtering patterns is about 16.8 times that of the basic algorithm.

Figure 9.

Effect of the number of features.

(D) Effect of the number of total instances

In the synthetic data set, we set $\zeta=$ 0.1, $F=$ 50, and $d=$ 20 m. From Fig. 10, we can see that due to the increase of the total number of instances $n$ , in a fixed space, the number of neighbor instances increases, and the number of clique instances increases. As a result, the distance calculation and join operations between instances increase, making the two algorithms have more candidates and longer runtimes. However, because AUP can prune more candidate patterns, the number of candidate patterns is less than that of the basic algorithm and the algorithm efficiency is higher than that of the basic algorithm. When the number of instances $n\geqslant$ 50000, the pruning effect of AUP is significantly better than that of the basic algorithm. When the number of instances reaches 70000, the efficiency of the improved algorithm is 6.2 times that of the basic algorithm.

Figure 10.

Effect of the number of total instances.

6. Conclusions

This paper redefines the metrics of patterns of interest and first introduces the average utility in the spatial co-location pattern mining framework. For the property that the average utility ratio does not satisfy the downward closure, we propose to over-estimate the average utility ratio of a pattern, and a basic algorithm for HAUCP mining is proposed. In order to improve the efficiency of the basic algorithm, this paper further improves the extended average utility in the basic algorithm and proposes an improved algorithm based on local extended average utility ratio. A large number of experimental results show that the improved algorithm can efficiently mine HAUCPs. However, the utility ratio threshold of this paper is given by the user, and it is highly dependent on the user. Setting an appropriate minimum utility ratio threshold is not easy. The most common method is to continuously try different thresholds and obtain the most suitable threshold from the results of multiple attempts. It can be seen that this method of setting the utility ratio threshold is not only tedious but uncontrollable. Therefore, in the future work, based on the research in this paper, we will consider the problem of mining the top- $k$ HAUCPs.

Footnotes

Acknowledgments

This work is supported by the Yunnan Provincial Major Science and Technology Special Plan Projects (202202AD080003), National Natural Science Foundation of China (61966036, 41906148, 61662086), and the Project of Innovative Research Team of Yunnan Province (2018HC019).

References

Huang

Shekhar

and Xiong

, Discovering colocation patterns from spatial data sets: a general approach, IEEE Transactions on Knowledge and Data Engineering 16 (2004), 1472–1485.

Shekhar

and Huang

, Discovering spatial co-location patterns: a summary of results, in: Proceedings of the 7th International Symposium on Advances in Spatial and Temporal Databases, 2001, pp. 236–256.

Yoo

J.S.

and Shekhar

, A Partial Join Approach for Mining Co-location Patterns, in: Proc. of ACM International Symposium on Advances in Geographic Information Systems (ACM-GIS), 2004, pp. 241–249.

Yoo

J.S.

and Shekhar

, A Join-less Approach for Co-location Pattern Mining: A Summary of Results, in: Proc. of 5th IEEE International Conference on Data Mining (ICDM), 2005, pp. 813–816.

Bao

and Wang

, A clique-based approach for co-location pattern mining, In-formation Sciences 490 (2019), 244–264.

Wang

Bao

Zhou

et al., Mining maximal sub-prevalent co-location pat-terns, World Wide Web 22 (2019), 1971–1997.

Tran

and Wang

, A spatial co-location pattern mining framework insensitive to prevalence thresholds based on overlapping cliques, Distributed and Parallel Databases, doi: 10.1007/s10619-021-07333-2.

Tran

Wang

Chen

et al., MCHT: A maximal clique and hash table-based maximal prevalent co-location pattern mining algorithm, Expert Systems with Applications, doi: 10.1016/j.eswa.2021.114830.

Yang

Wang

and Wang

, A Parallel Spatial Co-location Pattern Mining Approach Based on Ordered Clique Growth, in: Proc. of the 23nd International Conference on Database Systems for Advanced Applications (DASFAA), 2018, pp. 734–742.

10.

Yang

Wang

et al., Efficient discovery of co-location patterns from massive spatial datasets with or without rare features, Knowledge and Information Systems, 2021, 1365–1395.

11.

Yang

Wang

and Zhou

, SCPM-CR: A Novel Method for Spatial Co-location Pattern Mining with Coupling Relation Consideration, in: IEEE Transactions on Knowledge and Data Engineering. doi: 10.1109/TKDE.2021.3060119.

12.

Yao

Hamilton

H.J.

and Butz

C.J.

, A Foundational Approach to Mining Itemset Utilities from Databases, in: Proc. of the SIAM International Conference on Data Mining, 2004, pp. 211–225.

13.

Liu

Liao

W.K.

and Choudhary

, A Two Phase Algorithm for Fast Discovery of High Utility of Itemsets, in: Proc. of the 9th Pacific Asia Conf. Knowledge Discovery and Data Mining (PAKDD), 2005, pp. 689–695.

14.

Ahmed

C.F.

Tanbeer

S.K.

Jeong

and Lee

, Efficient tree structures for high utility pattern mining in incremental databases, IEEE Transactions on Knowledge and Data Engineering 21 (2009), 1708–1721.

15.

Tseng

V.S.

C.W.

Shie

B.E.

and Yu

P.S.

, UP-Growth: An Efficient Algorithm for High Utility Itemsets Mining, in: Proc. of the 16th ACM SIGKDD Conf. Knowledge Discovery and Data Mining (KDD), 2010, pp. 253–262.

16.

Yin

Zheng

and Cao

, US-pan: an efficient algorithm for mining high utility sequential patterns, in: Proc. of the 18th ACM SIGKDD Conf. Knowledge Discovery and Data Mining (KDD), 2012, pp. 660–668.

17.

Yang

Wang

Bao

and Lu

, A framework for mining spatial high utility co-location patterns, in: Proc. of the 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2015, pp. 595–601.

18.

Wang

Jiang

Chen

and Fang

, Efficiently mining high utility co-location patterns from spatial data sets with instance-specific utilities, in: Proc. of the 22nd International Conference on Database Systems for Advanced Applications (DASFAA), 2017, pp. 458–474.

19.

Hong

Lee

and Wang

, Mining high average-utility itemsets, in: Proc. of the IEEE International Conference on Systems, Man and Cybernetics (SMC), 2009, pp. 2526–2530.

20.

Lin

C.W.

Hong

T.P.

and Lu

W.H.

, Efficiently mining high average utility itemsets with a tree structure, Lect. Notes Comput. Sci, 2010, 131–139.

21.

Lan

G.C.

Hong

T.P.

and Tseng

V.S.

, Efficiently mining high average-utility itemsets with an improved upper-bound strategy, Int. J. Inform. Technol. Decis 11 (2012), 1009–1030.

22.

Nguyen

H.T.

and Hong

T.P.

, A new method for mining high average utility itemsets, Lect. Notes Comput. Sci 8838 (2014), 33–42.

23.

Lin

J.C.

Fournier-Viger

Hong

T.P.

Zhan

and Voznak

, An efficient algorithm to mine high average-utility itemsets, Advanced Engineering Informatics 30 (2016), 233–243.

24.

Lin

J.C

Fournier-Viger

Hong

T.P.

J.H.

and Vo

, A fast algorithm for mining high average-utility itemsets, Applied Intelligence the International Journal of Artificial Intelligence Neural Networks & Complex Problem Solving Technologies, 2017, 331–346.

25.

Lin

J.C.

, S

Fournier-Viger

Youcef

and G

, Maintenance algorithm for high average-utility itemsets with transaction deletion. Applied Intelligence: The international journal of artificial intelligence, Neural Networks, and Complex Problem-Solving Technologies 48 (2018), 3691–3706.

26.

Lin

J.C.

J.M.

Fournier-Viger

Hong

T.P.

and Li

, Efficient Mining of High Average-Utility Sequential Patterns from Uncertain Databases, in: Proc. of the IEEE International Conference on Systems, Man and Cybernetics (SMC), 2019, pp. 1989–1994.

27.

Lin

J.C.

Pirouz

Djenouri

Cheng

C.F.

and Ahmed

, Incrementally updating the high average-utility patterns with pre-large concept, Applied Intelligence, 2020, 3788–3807.

28.

Lin

J.C.

Pirouz

Zhang

and Fournier-Viger

, High average-utility sequential pattern mining based on uncertain databases, Knowledge and Information Systems 62 (2020), 1199–1228.

Mining spatial high-average utility co-location patterns from spatial data sets

Abstract

Keywords

1. Introduction

2. Related work

3. Related concepts and definitions

3.1 Traditional co-location pattern mining

Table 1 Star neighborhoods of the spatial data set in Fig. 1

Table 2 External utilities of five features in Fig. 1

4.1 Basic algorithm

Table 3 Total utilities of star neighborhoods of the spatial data set in Fig. 1

Table 4 Maximum utilities of star neighborhoods of the spatial data set in Fig. 1

5.1 Experimental environment

5.2 Experimental data

Table 5 A summary of the 5 real data sets

(A) Effect of k in top- k mined patterns

(B) Effect of the pattern utility ratio threshold

(C) Effect of distance threshold

(A) Effect of the pattern utility ratio

(B) Effect of distance threshold d

(C) Effect of the number of features

(D) Effect of the number of total instances

Footnotes

Acknowledgments

References

Table 1
Star neighborhoods of the spatial data set in Fig. 1

Table 2
External utilities of five features in Fig. 1

Table 3
Total utilities of star neighborhoods of the spatial data set in Fig. 1

Table 4
Maximum utilities of star neighborhoods of the spatial data set in Fig. 1

Table 5
A summary of the 5 real data sets

(A) Effect of $k$ in top- $k$ mined patterns

(B) Effect of distance threshold $d$