Abstract
Association rule algorithm has always been a research hotspot in the field of data mining, in the context of today’s big data era, in order to efficiently obtain association rules and effectively update them, based on the original fast update pruning (FUP) algorithm, an association rule incremental update algorithm (FBSCM) based on sorting compression matrix is proposed to solve the shortcomings of frequent scanning of transaction datasets. Firstly, The algorithm maps the transaction dataset as a Boolean matrix, and changes the storage mode of the matrix(that is, adding two columns and a row vector); Secondly, the matrix is compressed many times during the generation of frequent k-itemset; After that, the items in the matrix are sorted incrementally according to the support degree of the itemset; Finally, the original string comparison operation is replaced by the vector product of each column of the matrix. Experimental results and analysis show that the FBSCM algorithm has higher temporal performance than the traditional FUP algorithm in different incremental dataset sizes, different minimum support thresholds and different feature datasets, especially when the incremental transaction volume is large or the minimum support degree is small.
Introduction
Data mining has significant applications in many fields [1–5]. Among them, association rule mining, as one of widely used technologies of data mining [6], can find the relationship of interest from massive data [7] and mine meaningful association relationship [8]. However, with the passage of time, the existing association rule information will be invalid due to the change of influencing factors, so in the new state, it is necessary to use the existing mining results to perform corresponding addition and deletion operations, so as to quickly update the association rule base, and the association rule incremental update algorithm came into being. At present, the research methods of this algorithm are mainly divided into two categories. One is mining based on tree structure [9, 10], and the other is iterative-based mining algorithms [11, 12]. Among them, the FUP [Fast Update Pruning] algorithm proposed by Cheung et al. is one of the most classic and practical algorithms, but there are still shortcomings in multiple scans of transactional data sets and inability to process massive data.
To solve the shortcomings of multiple scan transactional datasets in the FUP algorithm, the researchers proposed corresponding improved algorithms. The simple and efficient incremental update algorithm proposed by Chen et al. SFUA can quickly update the association rules when the newly added database is relatively small [13]. Based on the distributed environment, the CMFUP algorithm proposed by Geng et al. [14] and the MRFUP algorithm proposed by Zhu et al. [15] have higher correlation rule mining efficiency in processing large data sets. In addition, some researchers have proposed improved methods based on Boolean matrices for mining frequent itemset. The key idea of such methods is to convert the transaction dataset into a 0-1 matrix, and replace the inefficient string comparison operation with the “AND operation”, which greatly shortens the calculation time of the support of frequent itemset, thereby improving the efficiency of algorithm operations. Liu et al. [16] proposed a new association rule algorithm, ABBM, which uses the Boolean vector “relational calculus” method to discover frequent itemset because all transaction data is stored in bits, so less memory space is required. Yu et al. [17] proposed an improved Apriori algorithm, which replaces the original transaction database with a Boolean matrix. This approach makes it easier to delete infrequent itemset by using the “AND operation” and random access characteristics of arrays. The algorithm can also be mined on the Hadoop platform, which exponentially improves the efficiency of the algorithm. From the aspects of matrix storage mode, matrix compression, and efficient calculation of support, a variety of improvement schemes [18, 19] are proposed, which effectively improves the efficiency of the algorithm. However, these algorithms are all proposed in the context of static databases, and there are not many ways to start from the matrix improvement for dynamic databases.
Zhou et al. [20] proposed a new association rule incremental algorithm (FBCM), based on the improvement idea of Boolean matrix. The algorithm operates on two compressed Boolean matrices to achieve efficient update of association rules. Related experiments show that the algorithm can greatly reduce the computational overhead and computation amount, effectively improve the update efficiency of association rules. However, the approach still requires multiple scans of the matrix and does not reduce the number of candidate sets.
Due to the classic incremental association rule mining algorithm in data mining technology, the FUP algorithm has the following two disadvantages, for example, although the use of the original association rule mining information can reduce the calculation of the transaction itemset part, it still needs to scan the transaction database frequently, which increases the amount of calculation; It takes a long time to use the string comparison function to complete the statistical itemset support. This paper aims to solve the above shortcomings by proposing an incremental association rule mining algorithm (FBSCM) based on sorted compression matrix. FBSCM further improves the mining efficiency of the algorithm by converting the Boolean matrix of the transactional data set, effectively compressing the rows and columns of the matrix, bitwise and arithmetic, and matrix sorting. Finally, the design experiment is used for comparative analysis, and the performance of the algorithm before and after the improvement in processing different incremental dataset sizes, different support thresholds and different data sets is compared.
Improved algorithm
Definitions and related properties
Definitions
Transaction matrix transformation mode
Transaction matrix transformation mode
where the row represents a transaction T i = (di1di2di3 …… d in ), the column represents the item I j = (d1jd2jd3j …… d mj ), and we have
In the association rule mining algorithm based on matrix improvements, the compression strategy of the matrix depends on the following properties:
FBSCM algorithm
As brought up earlier, FUP as an iterative mining algorithm has a couple of disadvantages, such as multiple scanning of database and generation of a large number of invalid candidate sets. Our proposed FBSCM addresses these issues from three aspects:
(1) Storage of the matrix
FBSCM adds three vectors at the end of the generated Boolean matrix to reduce multiple scans of the matrix. Among them, the vector ω is used as the weight, representing the frequency of each transaction in the transaction dataset, and effectively compressing the non-unique transactions in the transaction dataset to ensure that each transaction information in the matrix is not repeated. The vector n is considered to count the number of candidates contained in each transaction. The vector m is used to count the size of the number of supports corresponding to each term.
(2) Matrix compression
Column compression is performed on the matrix. The sum of the rows in the matrix is recalculated, and the rows of the matrix are compressed. According to the properties 1 and 2, the frequency of infrequent itemset is reduced by continuous compression, so that the matrix can achieve maximum simplification, which can maximize the utilization of storage space and improve the operation efficiency of the algorithm.
(3) The orderliness of the itemset
In the Apriori algorithm, the more supportive the frequent 1-itemset and the self-connection to generate the more frequent 2-itemset has the higher probability of success. Drawing on the idea of itemset ordering in the Apriori algorithm, in the mining of incremental association rules, the columns of the merged matrix are arranged in increments according to the frequent 1-itemset support count m. And the subsequent connection operation will produce fewer candidate combinations (the number of candidate 2-itemset generated
FBSCM consists of the following key steps. We assume that the original dataset is DB, and the new dataset is db. DB+db is the updated transaction dataset.
Scans dataset db for supported updates to all 1-itemset X belonging to L1 (DB) in db support (XDB∪db). X is checked once the scan is complete. If support (XDB∪db) ⩾ minsup × (|DB + db|), it is added to the L1 (DB + db). Count the number of supports support (X db ) for 1-itemset X in db that is not part of the L1 (DB + db) and if support (X db ) ⩾ minsup × (|DB + db|) which is added to the L1 (DB + db). At this point, L1 (DB + db) is generated.
Where,
Column compression is performed on matrix matrix DB and matrix db according to L1 (DB + db) and the property 1, respectively. The Boolean matrix of DB and db is then combined to generate a new matrix matrixDB+db. The matrix is compressed accordingly according to the property 2. Each column m is updated, after which the columns of the matrix are sorted in ascending order supporting the number m.

FBSCM algorithm flowchart.
Candidate 2-itemset is generated by frequently connecting 1-itemset to itself, where efficient calculation of the support of each candidate combination is performed by “AND operation” of each vector of the matrix. Items that do not meet the minimum support threshold set are removed from the candidate 2-itemset, resulting in a frequent 2-items set. Repeat this process until the frequent (k + 1)-itemset can no longer be obtained, at which point the algorithm finally wants to get the frequent k-itemset.
This section explains the flow of FBSCM by analyzing an example. Set the minimum support threshold s = 0.33. The number of transactions in the original transaction dataset DB is |DB|=6, as shown in Table 2. The number of entries is 5, and the minimum number of db supports is 2. The increased number of transactions in the transaction dataset db is |db|=3, as shown in Table 3, the number of entries is 4, and the minimum number of supported db is 1. The minimum number of supports for DB+db is 3. Set sup as the count for the supported numbers.
The original transaction dataset DB
The original transaction dataset DB
Add transaction dataset
Frequent itemsets on DB are:
The first step of the algorithm is to map DB and db as corresponding Boolean matrices, as shown in Table 4 and Table 5.
DB’s Boolean matrix
DB’s Boolean matrix
db ‘s Boolean matrix
First, scan db to update the number of supports for all 1-itemset belonging to L1 (DB), A . sup (DB + db) = 6, B . sup (DB + db) = 7, C . sup (DB + db) = 6, E . sup (DB + db) = 3, D . sup (DB + db) = 2 <3, so D is an infrequent 1-itemset, so L1 (DB+ db) = { A, B, C, E }.
According to L1 (DB + db) and the property 1, the 1-itemset {D} in the DB Boolean matrix does not need to participate in subsequent iterations. It can be deleted by column compression of db’s Boolean matrix, recalculated the number of occurrences of “1” in each row of the matrix, and updated the values of the array n, w, m, while the Boolean matrix of db does not need to change. The compressed DB matrix is shown in Table 6.
Boolean matrix after the first compression of DB
Boolean matrix after the first compression of DB
The Boolean matrix of DB and db is then combined to generate a new matrix. The merged matrix is row compressed according to the property 3, the item support number m of each column of the matrix is recalculated, and the columns of the matrix are sorted in ascending order according to the number of supports. The combined and sorted matrix is shown in Table 7.
DB + db combined and sorted Boolean matrix
Before mining the frequent 2-itemset, it is necessary to compress the corresponding rows and columns of the matrix according to the properties 1 and 2. Similarly, in the iterative generation process of the frequent k-itemset afterwards, the matrix also needs to be compressed according to the compression strategy. Then use the sorted compression matrix to get the frequent 2-itemset, the frequent 1-itemset mined earlier is L1 (DB+ db) = { E, A, B, C }, the candidate 2-itemset of DB+db is generated by the L1 (DB + db) self-join, so the candidates are EA, EB, EC, AB, AC, BC.
Take E ∧ A = 100001,
Compressed DB+db Boolean matrix
Compressed DB+db Boolean matrix
The Candidate 3-itemset for DB+db is generated by L2 (DB + db) self-join, with only ABC candidates. Take A ∧ B ∧ C = 110,
Eventually, the frequent itemset is obtained as L (DB+ db) = {{ E } , { A } , { B } , { C } , { E } , { EB } , { AB } , { AC } , { BC } , { ABC }}.
Performance analysis
The parameters used for the time complexity analysis of the FBSCM algorithm and the FUP algorithm are shown in the following Table 9:
Parameter list
Parameter list
The complexity of the time required to scan the original transactional database DB and the incremental transactional database db into a Boolean matrix is O (T × R); By adding two columns and a row array to the Boolean matrix to change the storage mode of the matrix, the space complexity consumed is O ((R + 1) × (I + 2)): For the compressed and merged matrix column vectors to support ascending order by itemset, a fast sort algorithm is adopted, and the time complexity consumed is O (I k log I k ); In the subsequent mining of frequent 1-itemset, only affected by the compressed matrix, a Boolean matrix D1 will be generated, and the time complexity consumed is O (R1 × I1); When k ⩾ 2, the time complexity required to generate candidate k-itemset C k is O (|Lk-1|2), the final acquired frequent k-itemset depends on the Boolean matrix size and C k , so the required complexity is O (|C k | × I k ) + O (|Lk-1| × Ik-1). Therefore, the total time complexity of the FBSCM algorithm is:
The time complexity analysis of the traditional FUP algorithm is as follows: firstly, the time complexity consumed to obtain frequent 1-itemset is O (T × R); when 2 ⩽ k ⩽ K, the time complexity required to generate candidate k-itemset C k is O (|Lk-1|2); The process of generating frequent k-itemset has operations such as calculating the number of itemset support and pruning, and the time complexity needs O (|C k | × R) and O (|C k | × k × (k - 2)), respectively. Therefore, the total time complexity required by the traditional FUP algorithm is:
Poor results for these two algorithms:

Partial Groceries dataset presentation.
From the above analysis, it can be seen that the values of both I k and R k are less than R, therefore,
|C k | is the number of candidate k-itemset, generated by frequent (k-1)-itemset Lk-1, therefore, |Lk-1| is less than |C k |, available:
In summary, T FUP - T FBSCM > 0. The time complexity of the FBSCM algorithm is theoretically superior to that of the FUP algorithm.
The experimental environment of this paper is carried out on a PC with Win10 64-bit system, 8 G memory, 4 G graphics card, and i5-8250U CPU @ 1.60 GHz.
In this experiment, four datasets are selected according to the density of data samples and the size of transaction volume, and the experimental data sets are shown in Table 10:
Experimental dataset table
Experimental dataset table
Among them, the Groceries dataset (shown in Fig. 2) is composed of the transaction list of department stores, the Mushroom dataset (shown in Fig. 3) comes from the UCI machine learning library, and the T40I10D100K dataset (shown in Fig. 4) and the T10I4D100K dataset (shown in Fig. 5) are the datasets used by Cheung et al. when proposing the FUP algorithm.

Partial Mushroom dataset presentation.

Partial T40I10D100K dataset presentation.

Partial T10I4D100K dataset presentation.
For the association rule mining algorithm, experimental comparison is normally carried out from the three aspects of setting transaction datasets of different sizes, setting different minimum support thresholds, and setting transaction datasets with different characteristics. Therefore, in this paper, the single-variable method will be selected for experimental comparison: First, take the incremental data amount as a single variable, and compare the changes of each algorithm under the same dataset. Second, the support threshold is used as a single variable to compare the changes of each algorithm under the same dataset. Finally, the effect of the support threshold on the efficiency of the algorithm is obtained. Third, the data characteristics of the dataset are used as a single variable, and the changes of each algorithm under the same transaction data volume size and minimum support threshold are compared.
In this experiment, we can finally obtain the influence of incremental transaction data size, minimum support threshold, and data characteristics of the dataset on the efficiency of the algorithm.
Experimental results and analysis
Algorithm comparison under different incremental datasets
The original transaction dataset DB will randomly select 4500 records from the data-intensive much-related Mushroom dataset, while the incremental dataset db will randomly select 500–3500 records (step size is 500) from the remaining data. Given the minimum support min_sup = 0.4, the running time of FUP, FBCM and FBSCM under different incremental dataset sizes is compared, as shown in Fig. 6:

Comparison under different volumes of dataset.
The charts in Fig. 6 are a simulated implementation of the matrix-based incremental algorithm proposed in Refs. 20, under different incremental dataset data volumes. The abscissa represents the incremental transaction data volume, taking 500 as the step size, changing to 3500. As can be seen from the Fig. 6, while keeping the min_sup = 0.4 unchanged, as the amount of data in the new transaction dataset continues to increase, the workload of scanning the transaction dataset increases greatly, and the running time also increases. In terms of the speed of mining frequent itemset, FUP is the slowest, and FBSCM has the highest execution efficiency and the slowest time increase. Obviously, FBSCM has the highest execution efficiency due to the use of sorted compression matrix to mine frequent patterns. Moreover, FBSCM effectively narrows the search scope of understanding and avoids the generation of invalid candidate sets on the basis of making frequent itemset complete, which not only reduces the computational overhead, but also accelerates the mining speed of frequent itemset. After the frequent 1-itemset is obtained, the compressed matrix is sorted in ascending order according to the order of the itemset, which avoids the generation of invalid candidate sets. And in the process of mining k-itemset, the support of the itemset is efficiently counted through “AND operation”, which further increases the efficiency of the algorithm.
We use the data-intensive Mushroom dataset for this experiment. The original transaction dataset DB will randomly extract 6000 records from the Mushroom, the incremental dataset db will randomly extract 2000 data from the remaining data. The experiment takes different minimum support thresholds as the abscissa, and compares the running time of the algorithm in different minimum support min_sup cases. The results are shown in Fig. 7.

Comparison with different minimum support.
The abscissa in Fig. 7 represents the minimum degree of support, taking 0.02 as the step size, increasing from 0.25 to 0.41, and the ordinate represents the change in the running time of each algorithm, which is in milliseconds. As can be seen from the Fig. 7, when the support threshold changes to 0.41, the running time of FBSCM is 32.4 ms, which is 1.6 times faster than the run time of FUP 51.9 ms. The running time of FBCM based on matrix-based improvements is 43.3 ms. When the support threshold is 0.41, because the transaction dataset is updated after the frequent itemset will not be much, FUP can reduce the number of scans of the original transaction dataset, and the running time is not be very high. For FBCM based on matrix improvement and FBSCM, at this point, even if there is not a lot of frequent itemset generation, it is still necessary to map the transaction dataset to a Boolean matrix, the time consumed by this step is not reduced. As a result, FBSCM improvement does not open a significant gap with FUP. When the support threshold is at 0.25, the running time of FBSCM is 77 ms, which is nearly 10 times faster than the running time of FUP (765.5 ms). Given the same amount of data in the dataset, the time taken by FBSCM is always at the lowest level of the three algorithms, and the processing time of the three algorithms is gradually decreasing with the increase of support. This experiment shows that FBSCM has a significant advantage in the case of frequent itemset mining with small threshold (that is, large computational volume).
Given the incremental transaction dataset size |db|=1000, and the minimum support threshold min_sup, the operating efficiency of FUP, FBCM and FBSCM under the dataset Mushroom and Groceries is compared, and the experimental results are shown in Fig. 8.

Comparison between algorithm under different feature dataset.
Figure 8 shows the operation efficiency comparison chart of FUP, FBCM and FBSCM algorithms under the small and dense Mushroom dataset, the small and sparse Groceries dataset, the large and dense T40I10D100K dataset and the large and sparse T10I4D100K dataset, respectively, and the incremental data sets were obtained by sampling 1000 pieces in the original dataset. The detailed experimental data corresponding to the experimental results are shown in Table 11.
Detailed experimental data
From the efficiency ratio of Table 11 (FUP/FBSCM) and (FBCM/FBSCM), it can be seen that the data relationship, whether dense or sparse or transaction volume, regardless of size, does not have much impact on the operation efficiency of FBSCM, because the mining rate of FBSCM under the above four datasets is about 3 to 5 times higher than that of FUP. Compared with FBCM with the same matrix improvement idea, the improvement by FBSCM is about 1-2 times, and the optimization situation is more ideal, thanks to the use of sorted compression matrix. The large amount of overhead generated by FUP scanning the entire transaction dataset multiple times during the mining process has become the performance bottleneck of the algorithm, which makes the efficiency of the algorithm not well improved.
In this paper, an improved algorithm based on the sorted compression matrix, FBSCM is proposed. Compared with FUP that needs multiple scans for transaction dataset and generates invalid candidate sets, FBSCM scans the transaction data only once by mapping the transaction dataset to a Boolean matrix and effectively avoids the generation of invalid candidate sets by sorting the Boolean matrix. Through performance analysis and experimental results, the results show that FBSCM gives better performance in terms of runtime performance, especially if the minimum support threshold is low or the incremental transaction volume is large. In the future we will investigate new approaches to improve the memory consumption performance.
