Abstract
With the development of automobile insurance industry, how to identify automobile insurance fraud from massive data becomes particularly important. The purpose of this paper is to improve automobile insurance fraud management and explore the application of data mining technology in automobile insurance fraud identification. To this aim, an Apriori algorithm based on simulated annealing genetic fuzzy C-means (SAGFCM-Apriori) have been proposed. The SAGFCM-Apriori algorithm combines fuzzy theory with association rule mining, expanding the application scope of the Apriori algorithm. Considering that the clustering center of the traditional fuzzy C-means (FCM) algorithm is easy to fall into local optimal, the simulated annealing genetic (SAG) algorithm is used to optimize it. The SAG algorithm optimized FCM (SAGFCM) is used to generate fuzzy membership degrees and introduces fuzzy data into the Apriori algorithm. The Apriori algorithm is improved by reducing the rule mining time when acquiring rules. The results of empirical studies on several data sets demonstrate that the optimization of FCM by SAG can effectively avoid the local optimal problem, improve the accuracy of clustering, and enable SAGFCM-Apriori to obtain better fuzzy data during data preprocessing. Moreover, the proposed algorithm can reduce the mining time of association rules and improve mining efficiency. Finally, the SAGFCM-Apriori algorithm is applied to the scene of automobile insurance fraud identification, and the automobile insurance fraud data is mined to obtain fuzzy association rules that can identify fraud claims.
Introduction
With the popularization of private vehicles, many sideline industries related to automobiles have been vigorously developed. The automobile service industry, such as automobile sales, automobile maintenance, automobile finance, and automobile insurance, has been significantly developed. Particularly, the automobile insurance business has made great progress. Since the insurance business was launched in China in 1980, automobile insurance has accounted for more than 50% of the domestic insurance business. It is an imperative category in the insurance industry, even known as the “food insurance” of the insurance industry [1]. However, automobile insurance fraud is common, not only harming the interests of insurance companies but also unfair to honest policyholders. There is an urgent need for intelligent, fast, and automated methods to obtain valuable information about automobile insurance fraud from automobile insurance fraud databases.
Association rule mining algorithm is a commonly used data mining algorithm [2]. Youcef Djenouri et al. [3] proposed an efficient parallel algorithm named CGPUGA. It is a genetic algorithm that runs on clusters of GPUs to efficiently discover diversified association rules. It benefits from cluster computing to generate rules. José María Luna et al. [4] proposed a series of algorithms based on the MapReduce framework and the Hadoop open-source implementation in order to establish new efficient pattern mining algorithms to work in big data. Results revealed the interest of applying MapReduce versions when complex problems were considered, and also the unsuitability of this paradigm when dealing with small data. Feng Feng et al. [5] pointed out and rectified some problems of soft set based association rule mining, refined and proposed relevant concepts. Further detailed insights were provided for soft set based association rule mining. Feng Feng et al. [6] developed a new approach to maximal association rule mining using logical formulas over soft sets in order to overcome the problem of meaningless rules in the rules obtained by traditional rule mining techniques. Fuzzy association rule mining is a main research direction of data mining. Lv [7] proposed a list based fuzzy frequent item mining algorithm and a list based fuzzy consistent association rule mining algorithm. Both algorithms use two pruning strategies to reduce spatial search. Sushil et al. [8] used fuzzy association rule mining to evaluate and predict students’ performance at the end of the semester. In this way, students who need personal attention to reduce failure rates and take appropriate action for next semester’s exams are identified. Mohamed et al. [9] proposed a new neutrosophic association rule algorithm. This algorithm uses a new way to generate association rules by dealing with membership, uncertainty and non-membership functions of items. Effective decisions are made by considering all fuzzy association rules. Wu et al. [10] optimized the classical Apriori algorithm. Firstly, Principal Component Analysis is used to optimize multi-source parameters and then extended to transactions with fuzzy attributes. Finally, the traditional IEC ratio is combined as the feature quantity to extract the rule together. Xu et al. [11] proposed a spatial association rule mining method that takes fuzzy attributes into account. The method introduces the theory of fuzzy sets and transforms fuzzy spatial attributes into fuzzy values represented by membership functions. In addition, the improved FP-Growth fuzzy association rule mining algorithm is used to extract association rules. Régis Pierrard et al. [12] proposed a fuzzy adaptive closure algorithm that relies on closure of item sets. This algorithm has a small number of traversals for the data sets and is applicable to the case where variables are related. To improve the accuracy of transformer diagnosis, Guo et al. [13] established a diagnosis model based on fuzzy association rules combined with case-based reasoning (CBR) to evaluate the failure types, fault locations, and cause of breakdown in power transformers. Zhang et al. [14] proposed a mining method based on fuzzy genetic algorithm for association rules. This method fuses association rules with fuzzy genetic algorithm by constructing a mining model. Hou et al. [15] proposed a fuzzy association rule mining algorithm based on master-slave architecture combined with genetic algorithm. This algorithm can reduce the complexity of fitness evaluation. Moreover, the acceleration degree of this algorithm can be close to linear increase when the modern number is large. Zhang et al. [16] proposed a quantitative data mining algorithm based on improved multi-level fuzzy association rules. Compared with other algorithms, this algorithm has prominent advantages in mining accuracy and operation time. Kang et al. [17] used competitive aggregation algorithm to cluster the processed data and FP-Tree algorithm to mine association rules during data mining.
One of the most commonly used algorithms for mining association rules is the Apriori algorithm first proposed by Rakesh Agrawal and Ramakrishnan Srikant [18]. But this algorithm cannot process quantitative data. To solve this problem, some researchers propose to combine fuzzy C-means clustering (FCM) algorithm with Apriori algorithm. Dang Qinhua [19] proposed an improved clustering algorithm DFCM. Based on this method, the Apriori algorithm is improved for the temperature control link of the decomposition furnace in the predecomposition system. Sowan et al. [20] studied two kinds of fuzzy association rule mining models for improving prediction performance. One is the FCM-Apriori model, the other is the FCM-MSapriori model. The two models are used to predict the relatively similar performance of the road. Dhiraj et al. [21] discovered and briefly discussed several different techniques and algorithms for mining fuzzy association rules. In addition, the current trend and future research scope were introduced from the field of fuzzy association rule mining. Zhang [22] proposed a fuzzy association rule mining algorithm. This algorithm combines the adaptive kernel-based FCM (KFCM) clustering algorithm based on fruit fly algorithm with the Apriori algorithm based on location storage. The effectiveness and feasibility of the algorithm are also proved. Yan et al. [23] introduced the fuzzy C-means algorithm into the association rule algorithm and proposed a FARMA algorithm to mine property insurance data. And the cross-selling strategies of property insurance customers were analyzed. Akbar et al. [24] used the fuzzy C-means clustering algorithm and Apriori algorithm to assess forest fire risk in western Iran. Chen et al. [25] proposed an intuitionistic fuzzy vector association rule mining method based on dual fuzzy simulation. The dual fuzzy idea is introduced to improve the traditional intuitive fuzzy set, and the framework of the two-layer association rule mining method is established.
Some researchers also apply association rule mining algorithms to insurance fraud detection. Shen [26] designed an anti-fraud system scheme of automobile insurance and an anti-fraud system model of automobile insurance. The Apriori algorithm of association rules in data mining technology is used to mine the fraudulent behaviors of automobile insurance regularly. Zhou [27] proposed the architecture of intelligent medical insurance audit system based on BP neural network and association rules, and then designed the support base of intelligent medical insurance audit system. Aayushi et al. [28] analyzed the results of rule-based mining and applied organizational decision rules and k-means clustering to periodic claims outlier anomaly detection. They also applied association rule mining based on Gaussian distribution to disease-based outlier anomaly detection. Du [1] divided the sample set and found out the frequent item sets of black samples based on Apriori algorithm and FP-growth algorithm. The frequent item sets were verified in the white sample. Association rules between frequent item sets were mined to identify fraud claims, and finally, the fraud rate of gray sample set was estimated.
Fraud will lead to a decrease in the reputation of the industry. It also severely affects the healthy and sustainable development of the insurance industry. The research on automobile insurance fraud has economic value and practical significance for both the policyholders and the insurance companies. However, the current domestic insurance fraud identification technology remains immature, and the management practices related to fraud identification and fraud rate estimation are underdeveloped. The research on insurance fraud identification faces numerous difficulties. The motivation of this paper is to propose a method that can mine automobile insurance fraud information and identify automobile insurance fraud claims in automobile insurance big data. The process of identifying automobile insurance fraud is a process of data mining, in which association rule mining can dig the hidden information in automobile insurance data and more effectively identify automobile insurance fraud claims. The Apriori algorithm is one of the most commonly used algorithms for mining association rules. However, it can only mine Boolean data, with a low efficiency when facing a large database. There are some limitations in the Apriori algorithm. Thus, a fuzzy attribute is introduced into association rule mining to overcome these problems. An Apriori algorithm based on simulated annealing genetic fuzzy C-means (SAGFCM-Apriori) is proposed in this study. The traditional Apriori algorithm needs to partition numeric data and convert them into Boolean data because only Boolean data can be used for the Apriori algorithm. This way will make the partition stiff, leading to sample attribution problems at the partition boundaries. On this basis, the FCM algorithm optimized by simulated annealing genetic algorithm is firstly used to divide the numerical data into fuzzy partitions. The fuzzy version of the original data is obtained, and the partition stiff problem is solved. SAG algorithm also can improve the situation of FCM clustering center falling into local optimal. Secondly, the improved Apriori algorithm is employed to mine the association rules in the fuzzy data set. The process of connection-pruning is improved to reduce the time of mining association rules for the Apriori algorithm. Thus, the I/O load on the algorithm is reduced. Finally, the examples are provided to verify the effectiveness of the proposed algorithm. The association rules of fraudulent claims are extracted to identify hidden automobile insurance fraud claims and provide a reference for fraud identification. The research results have significant practical value for insurance companies and are mainly beneficial to save the cost of fraud claims and improve the accuracy of the review.
FCM algorithm and optimization
The FCM algorithm
Fuzzy C-means clustering algorithm is a popular method at present. This algorithm is developed and improved from the hard C-means clustering (HCM) algorithm. FCM algorithm is a data clustering method based on fuzzy set theory and optimization of objective function. FCM takes the minimization of the sum of squares of distances within the partition interval as the judgment basis. Then membership degree is used to determine the degree to which each data belongs to each cluster partition. Divide n data instances X = {X
i
|X
i
∈ R (i = 1, 2, ·· · , n)} into c (2 ≤ c ≤ n) categories. The clustering center of each cluster is obtained and the value function of the non-similarity index is minimized. {A1, A2, ·· · , A
c
} represents the corresponding categories; {v1, v2, ·· · , v
c
} is the clustering center set of all categories. U = (μ
ik
|i = 1, 2, ·· · , m ; k = 1, 2, ·· · , c) is the membership matrix, where μ
ik
represents the membership of sample X
i
for category A
k
. The objective function is shown in Equation (1) [29]
I
k
= {i|2 ≤ c ≤ n ; d
ik
= 0} is assumed, and i ∈ I
k
, μ
ik
= 1 is true for all i category.
Clustering centers and data membership degrees are repeatedly modified by Equations (3) and (4), and then classified. When the algorithm converges, all kinds of clustering centers and the membership degrees of each sample to each pattern category are obtained theoretically. Thus, the fuzzy clustering division is completed. Although FCM has a high search speed, FCM is a local search algorithm and is very sensitive to the initial value of the clustering center. If the initial value is not chosen properly, it will converge to the local minimum. Improvements will be made to address this issue below.
FCM is very sensitive to the selection of clustering center initial value. If the initial value is not chosen properly, it will fall into the local optimal solution. Thus, the clustering results are affected. Therefore, simulated annealing algorithm is combined with genetic algorithm to generate the simulated annealing genetic (SAG) algorithm. SAG algorithm is applied to FCM.
Simulated annealing algorithm has been successfully applied to combinatorial optimization for a long time. Its basic idea is to find the minimum solution by simulating the annealing process of high temperature objects in nature. First an initial solution is generated as the current solution. The new solution is accepted as the current solution when the evaluation function value of the new solution is less than that of the initial solution; otherwise, this solution is accepted with a certain probability as the current solution. The current solution is iterated to prevent the algorithm from falling into the local optimal solution.
Genetic algorithm is a computational model based on natural selection and genetic mechanism in the biological world. It simulates the process of biological evolution in nature to search for the optimal solution. Firstly, genetic algorithm encodes the problem parameters into chromosomes. Secondly, chromosome selection, crossover, mutation and other similar biological chromosome exchange operations are carried out continuously. Finally, chromosomes that meet the target conditions are produced. But genetic algorithm is prone to precocity and falls into the local optimal solution.
The SAG algorithm first generates the initial population. Then the crossover operator and mutation operator are used on the population. Finally, the simulated annealing operation is then used on the population, and the replication strategy of the Metropolis acceptance criterion is used to generate the next generation population. The acceptance criterion makes the best individual in the population enter the next generation and accept the less good solution under certain probability. The method ensures the diversity of the population, and is beneficial to the population reproduction to find the global optimal solution. The combination of simulated annealing algorithm and genetic algorithm is applied to FCM cluster analysis. Simulated annealing algorithm and genetic algorithm can complement each other and effectively overcome the premature convergence of genetic algorithms [30]. In addition, SAGFCM can design genetic encoding and fitness function according to clustering problem so as to effectively and quickly converge to the global optimal solution. Thus, the problem that FCM converges easily to local optimum is overcome.
The steps of SAGFCM algorithm is as follows: Initialization control parameters: population individual size sizepop, maximum number of evolution MAXGEN, crossover probability P
c
, mutation probability P
m
, mutation probability T0, temperature cooling coefficient k, termination temperature T
end
; Randomly initialize c clustering centers and generate the initial population Chrom. For each clustering center, formula (3) is used to calculate the membership of each sample and the fitness value f1 of each individual, where i = 1, 2 . ·· · , sizepop; Set the loop count variable gen = 0; Selection, crossover, mutation and other genetic operations are carried out on the population Chorm. For the newly generated individuals, formula (3) and (4) are used to calculate the membership of each sample, c clustering centers, and the fitness value If T
i
< T
end
, the algorithm ends and the global optimal solution is returned. Otherwise, perform the cooling operation Ti+1 = kT
i
and proceed to step 3.
Fuzzy association rule algorithm
Apriori algorithm and improvement
At present, there are many algorithms for mining association rules. Among them Apriori algorithm is the most influential. Most other algorithms are generally extensions or variants of Apriori algorithm. Table 1 shows the comparison of the advantages and disadvantages of the currently commonly used algorithms for mining association rules. In contrast, the Apriori algorithm is more suitable for mining association rules of automobile insurance fraud. Apriori algorithm is an algorithm for mining frequent item sets of Boolean association rules. In Apriori algorithm, all item sets with support degree greater than minimum support degree are called frequent item sets. This algorithm uses an iterative method of layer-by-layer search: the support degree for each data item in item sets I is calculated at the first scan of database D; the frequent 1-item sets L1 are composed of the items that satisfy the support greater than the minimum support degree; then the frequent 2-item sets L2 are generated from the frequent 1-item sets L1; in the subsequent kth scan, the frequent (k-1)-item sets Lk-1 generated by the (k - 1)th scan are first taken as the seed sets and connected to generate the potential candidate k-item sets C k ; then the data sets are scanned to calculate the support degrees of all items in the candidate k-item sets C k ; the frequent k-item sets L k are formed by finding the items satisfying the support degree greater than the minimum support, and it is taken as the subset of the next scan. The process is repeated until no new frequent item sets are created. Strong association rules are eventually generated from frequent item sets. These strong association rules must satisfy minimum support and minimum confidence. In Apriori algorithm, the core idea of finding frequent k-item sets is to make use of the following two basic properties: 1) the subsets of the frequent item sets must be the frequent item sets; 2) the supersets of the non-frequent item sets must be infrequent item sets.
Comparison of advantages and disadvantages of commonly used association rule mining algorithms
Comparison of advantages and disadvantages of commonly used association rule mining algorithms
In the kth cycle, the Apriori algorithm will first generate the sets C k of the candidate k-item sets. Each item set in C k is generated by making a (k-2)-connection between two frequent item sets that belong to Lk-1 and have only one different item. Item sets in C k are candidates for generating frequent item sets. The final frequent item sets L k must be a subset of C k . Each element in C k needs to be validated in the transaction database to decide whether to add L k . The validation process here is a bottleneck in the performance of the algorithm. The process requires multiple scans of potentially large databases. This means that if the frequent item sets contains at most 10 items, the database needs to be scanned 10 times. A large I/O load is required.
The Apriori algorithm needs to scan the database multiple times and generate a large set of candidate items during the iteration. These problems form the performance bottleneck of the algorithm. The algorithm is improved to improve the operation efficiency of Apriori algorithm.
A unique serial number TID is defined for each data transaction t in database D. K-item sets R k are defined as R k = 〈X k , TID (X k ) 〉, where X k = (i j 1 , i j 2 , ·· · , i j k ), i j 1 , i j 2 , ·· · , i j k ∈ I, j1 < j2 < ·· · < j k , TIDS (X k ) is a set of the serial number of all transaction t that contain X k in the database. According to the support degrees of k-item sets support (R k ) = (|TIDS (X k ) |/|D|)×100 % defined above, the support number of k-item sets R k is supnum = |D| × support (R k ) = |TIDS (X k ) |.
The improved Apriori algorithm still adopts the iterative method of layer-by-layer search. The connection and pruning operations in the iterative process are defined as follows: Connection: assume two (k-1)-item sets: Lk-1 (i) = 〈Xk-1, TIDS (Xk-1) 〉 ⊆ Lk-1, Lk-1 (j) = 〈Yk-1, TIDS (Yk-1) 〉 ⊆ Lk-1, i < j. If the first k-2 terms of the (k-1)-item sets Xk-1 and Yk-1 are equal, that means Xk-1 [k - 2] ≡ Yk-1 [k - 2], and Xk-1 [k - 1] < Yk-1 [k - 1] is guaranteed not to repeat itself, then the (k-1)-item sets are connected: Lk-1 (i) ∞ Lk-1 (j) = 〈Xk-1 ∪ Yk-1, TIDS (Xk-1) ∩ TIDS (Yk-1) 〉 = 〈X
k
, TIDS (X
k
) 〉 = R
k
⊆ C
k
, C
k
is a superset of frequent k-term sets; otherwise, the connection operation is not performed because the resulting item sets are either a duplicate item sets or a non-frequent item sets. The amount of computation is reduced. Pruning: calculate the support number of k-item sets. According to the above, sumpum = |TIDS (X
k
) |. The calculation does not require another scan of database D. This avoids I/O operations and improves the efficiency of the algorithm. If supnum (R
k
) ≥ minsupnum, then 〈X
k
, TIDS (X
k
) 〉 ⊆ L
k
, that means K-item sets R
k
are frequent k-item sets; Otherwise, remove R
k
from sets C
k
.
The Apriori algorithm is improved during connection and pruning. The improved algorithm only performs one database scan when generating frequent 1-item sets, and no database scan is required in subsequent iterations. The improved algorithm reduces the I/O load and greatly increases the discovery speed of frequent item sets. In addition, the iterative process of the algorithm does not require complicated calculation. That means the connection of item sets can be completed using only the union and intersection operations of the sets. This makes the algorithm easy to implement.
The traditional Apriori algorithm has two disadvantages that directly affect the efficiency of the algorithm. The first is to do a lot of scanning of the transaction data sets when looking for frequent item sets. A heavy I/O load and a lot of computation time are required. The second is that repeated scanning of the database can produce a large number of useless candidate sets. In addition, the traditional fuzzy association rule mining algorithm requires domain experts to give the fuzzy membership function and divide the fuzzy partition in advance [31]. In the case of mass database and no prior knowledge, it is very difficult and heavy work to give fuzzy sets by experts in advance. In this case, an improved Apriori algorithm based on SAGFCM algorithm (SAGFCM-Apriori) is proposed. SAGFCM-Apriori is an association rule mining algorithm with fuzzy concepts. The boundary division of Apriori algorithm can be solved by using SAGFCM-Apriori algorithm. Some definitions of fuzzy association rules are as follows [21]: Definition 1: assume that I = {i1, i2, ·· · , i
n
} is a set of items made up of n different items. Each record t in database D = {t1, t2, ·· · , t
m
} is a set of items in I (a subset of data items in I), and t has a unique identifier TID. If sets X ⊆ I and X ⊆ t, then record t is said to contain sets X. Definition 2: assume that R = {r1, r2, ·· · , r
h
} is a set of h distinct fuzzy intervals. If A = {a1, a2, ·· · , a
q
} ⊆ R, then A is a set of fuzzy intervals in R. Definition 3: μ
A
(X) = ∧ μ
a
i
(x
i
) represents the membership degrees of sets X to fuzzy set A. Every x
i
has a unique a
j
in that corresponds to it. Definition 4: a fuzzy association rule is defined as the implication of X _ A ⇒ Y _ B, where X ⊆ I, Y ⊆ I, X∩ Y ≠ ∅, A ⊆ R, B ⊆ R, a
i
and b
j
are fuzzy intervals corresponding to x
i
and y
j
, respectively. Definition 5: the support degrees and confidence degrees of the fuzzy association rule X _ A ⇒ Y _ B are calculated as Equations (5) and (6):
The Cartesian product A × B of two fuzzy sets A and B is usually defined by the membership function (X, Y) ↦ μ
A
(X) ⊗ μ
B
(Y) [32]. Where, μ
A
(X) denotes the membership degrees of sets X to fuzzy set A, μ
B
(Y) denotes the membership degrees of sets Y to fuzzy set B, ⊗ is a t-norm, |D| is the total number of records in the database.
The first step in mining fuzzy association rules is to blur the data in the database. The fuzzified version of the original data to be processed is then obtained [33]. These two steps can be solved by the SAGFCM pretreatment technology mentioned above. Then, the fuzzy database is mined using the improved Apriori algorithm. The process of mining fuzzy association rules is shown in Fig. 1.

Fuzzy association rule mining flow chart.
The calculation steps of SAGFCM-Apriori algorithm are as follows: The original database D is scan by SAGFCM algorithm and the data attribute values are classified. Boolean data is stored in database BD, and numeric data is stored in database SD. Then the clustering calculation is carried out. For Boolean data, the fuzzy membership μ = 1 or μ = 0. For numeric data, each numeric attribute in database SD is converted to a fuzzy record according to the fuzzy partition. Each fuzzy record contains the fuzzy attributes of the data and the corresponding fuzzy membership function μ (μ ∈ [0, 1]). Then intermediate data sets D1 are generated. D1 are updated and iterated until all attribute data in attribute sets A is fuzzy, and the fuzzy version FD of the data sets are obtained. The fuzzy support degrees of all fuzzy attributes 1-item sets are calculated in FD. And all fuzzy frequent 1-item sets are obtained. All fuzzy frequent 1-item sets are combined to obtain fuzzy candidate 2-item sets. The fuzzy support degree of all fuzzy candidate 2-item sets are calculate. Then the fuzzy candidate 2-item sets whose support degrees are less than the minimum support are removed, and the fuzzy frequent 2-item sets are obtained. Perform similar steps until all fuzzy frequent k-item sets are found. The strong fuzzy association rules not less than the minimum confidence given by the user are generated from all fuzzy frequent item sets.
Effectiveness analysis of SAGFCM algorithm
In order to test the optimization effect of SAG algorithm on FCM algorithm, the following data sets are used in this paper. IRIS data set, Wine data set, partial data set of automobile insurance claims and a two-dimensional random data set containing 400 data sets. These data are put into FCM algorithm [24] and SAGFCM algorithm respectively for calculation. The results are shown in Table 2 and Figs. 2–5. The IRIS data set is derived from the characteristics of the iris flower. It contains 150 rows, each containing four attributes, divided into 3 categories. The Wine data set is based on the characteristics of Wine. It contains 178 rows, each containing 13 attributes, divided into 3 categories. Its first two dimensions are selected as the test data. Partial data set of automobile insurance claims contains 400 rows, each of which contains 3 attributes (numerical) and is divided into 4 categories. The randomly generated data is divided into 4 categories. In Figs. 2–5, triangle points in each figure represent the clustering centers. The IRIS data set and partial data set of automobile insurance claims in Figs. 2 and 4 select the first two dimensions as the horizontal and vertical coordinates.
Comparison of objective function optimal values of FCM algorithm and SAGFCM algorithm
Comparison of objective function optimal values of FCM algorithm and SAGFCM algorithm

SAGFCM clustering result of IRIS data set.

SAGFCM clustering result of Wine data set.

SAGFCM clustering result for Partial data set of automobile insurance claims.

SAGFCM clustering result for Random data set.
Table 2 shows that the optimal value of the objective function calculated by the optimized SAGFCM algorithm is less than the optimal value calculated by the FCM algorithm. In other words, the optimal value of the objective function calculated by the optimized SAGFCM algorithm is better. In addition, with the increase of data volume, the optimization effect of SAG algorithm is better and its advantages are more obvious. The main reason is that simple FCM algorithm is easy to converge to the local optimal solution when processing large data. The hybrid algorithm formed by the combination of simulated annealing algorithm and genetic algorithm can effectively overcome the problem of convergence to the local optimal solution. Figures 2–5 show the clustering results of each data set using SAGFCM algorithm. Different points represent different categories, and triangular points represent cluster centers. The data sets in Figs. 2 and 3 are divided into 3 categories, and the data sets in Figs. 4 and 5 are divided into 4 categories. The clustering boundaries of the data sets in Figs. 2 and 5 are obvious, and there are individual abnormal points in Figs. 3 and 4, but the overall clustering effect is good.
The clustering accuracy of SAGFCM is compared with other common clustering algorithms to further verify the optimization effect of SAG algorithm on FCM algorithm [34]. IRIS and Wine, two data sets that contain classification labels, are used as test data sets. These two data sets are introduced into the clustering algorithms of K-means, K-medoids, FCM [24] and SAGFCM, and the clustering accuracy is calculated. This indicates that the clustering effect of fuzzy clustering algorithm is better than that of hard clustering algorithm, and SAG algorithm has optimization effect on FCM algorithm. The results are shown in Table 3. The results show that fuzzy clustering algorithm (FCM, SAGFCM) has a higher clustering accuracy, while the clustering accuracy of SAGFCM is higher than other algorithms.
Comparison table of clustering accuracy of each algorithm under different data sets
Data of different sizes are used for testing before fraud identification to verify the improvement effect of the improved Apriori method proposed in this paper. The data sets are put into Apriori algorithm and improved Apriori algorithm respectively for calculation and timing. The result is shown in Fig. 6. The abscisic coordinate in Fig. 6 is the scale of test data, with 20 rows, 40 rows, 60 rows, 80 rows and 100 rows respectively. The test found that the improved Apriori algorithm not only does not affect the generation of the final result, but also reduces the operation time. In addition, with the increase of data size, the improvement effect is more obvious.

Mining time comparison between Apriori algorithm and improved Apriori algorithm.
MatlabR2017a software is used to analyze the association rules of fraud identification for the collected automobile insurance claim settlement data of an insurance company. The data includes data on 700 claims. The missing data are supplemented by linear interpolation. All claim settlement samples have been preliminarily screened by the management system of the insurance company. It contains 342 samples of fraudulent claims and 358 samples of non-fraudulent claims.
The automobile insurance claim settlement data of the insurance company includes 11 characteristic indicators, which are used in this paper to identify fraudulent claims, as shown in Table 4.
Specific contents of the characteristic indicators
Specific contents of the characteristic indicators
Firstly, the data are scanned and divided into numerical data sets and Boolean data sets [35]. Then the classification calculation is carried out respectively. When a data is a Boolean property, its fuzzy membership function μ = 1 or μ = 0. Boolean data can be easily divided into fuzzy partitions to obtain fuzzy versions of Boolean attribute values. For numerical data, the SAGFCM algorithm is used to generate membership matrix. The data is divided into fuzzy record sets by fuzzy partition. Each fuzzy record contains the fuzzy attributes of the data and the corresponding fuzzy membership function μ (μ ∈ [0, 1]). In some necessary cases, it is necessary to set a threshold value for μ. The purpose is to get only numerical attribute data that is greater than the threshold. According to SAGFCM algorithm, any original data set with Boolean attributes or numerical attributes can be transformed into a fuzzy set with fuzzy attributes and fuzzy records. Table 5 shows the attributes and attribute values of the original data sets.
Some original data attributes and their values
In the original data, x1, x2 and x3 are numerical data. SAGFCM was used to blur them, and the results are shown in Table 6. In the process of fuzzy data, the SAGFCM algorithm clusters x1 into three types, namely short-term, medium-term and long-term. x2 is clustered into three types, small quantity, normal quantity and large quantity. x3 is clustered into three types, low frequency, moderate frequency and high frequency. Each indicator of each sample corresponds to a membership function value vector. The dimension with the largest value of the vector dimension is found. This dimension is the fuzzy partition where the sample indicator is.
Partial results of fuzzification of numerical data
Other data is fuzzy processed according to the fuzzy method of Boolean data. All data is fuzzified and converted into binary data, as shown in Table 7. For example, attribute x10 is divided into three attributes: whether it is the first site (A21), whether it is the supplementary survey site (A22), and whether it is the no survey site (A23).
Partial binary data sets
The obtained binary data sets are divided into the fraud claim data sets and the non-fraud claim data sets according to the criterion of whether they are fraud claim samples. On the basis of multiple tests, the minimum support degree is set to 0.7, and the minimum confidence degree is set to 0.7. The fraud data sets are put into the improved Apriori algorithm to obtain the frequent item sets and the support degrees of the fraud samples. The non-fraud claim data sets are also put into the algorithm to obtain the frequent item sets and the support degrees of the non-fraud claim samples. Non-fraud sample frequent item sets are used to verify fraud samples frequent item sets. For the convenience of expression, the frequent item sets of the fraud samples are sorted and numbered according to its support degrees in the fraud samples, and the number is F i . When Support (F i |fraud) > Support (F i |non - fraud) is satisfied, the frequent item sets of the fraud samples are considered to be valid. According to Table 8, all frequent item sets of fraud samples are valid except {A19}.
Support of frequent item sets of fraud samples in fraud samples and non-fraud samples
Association rules can be mined from frequent item sets. However, in this empirical study, it is not significant to find the association rules among the indicators. So the relationship between the probability of fraud under a frequent item sets is defined as an association rule. According to the Bayes’ theorem, the probability P (fraud|F
i
) that this sample is a fraudulent claim sample when frequent item sets F
i
appear can be obtained as Equation (7):
Frequent item sets of fraud samples and their fraud rates
Some fuzzy association rules can be obtained from Table 9. When the index A12 is satisfied, the fraud rate is only 0.4976. When the indexes A12 and A13 is satisfied, the fraud rate increases to 0.5172. When the indexes A7, A12 and A13 are all satisfied at the same time, the fraud rate is 0.6154. That is to say, when the three indexes of A7,A12 and A13 are satisfied at the same time, the probability of the sample being fraudulent claim sample is 61.54% >50%. We have reason to believe that the sample is a sample of fraudulent claims. In other words, {A7,A12,A13} is a significant fuzzy association rule. Similarly, {A12,A13,A15} and {A12,A13,A19} are also significant fuzzy association rules.
According to the association rules identified, it can be found out that the automatic underwriting private automobile insurance claims are highly likely to be of fraudulent nature. {A7,A12,A13} can be interpreted as the circumstance where the private automobile insurance claims linked to a high number of historical accidents and automatic underwriting are highly likely to be fraudulent. From the perspective of insurers, it is advisable to carry out manual underwriting for preventing the clients with greater historical risks from making fraudulent claims, with private automobile insurance in particular. As indicated by {A12,A13,A15}, a claim linked with auto inspection exemption and the automatic underwriting of private automobile insurance is more likely to be fraudulent. That is to say, it is essential that insurance company pays particular attention to the automobile insurance claims made by the owners of the private cars exempt from inspection when dealing with claims. As suggested by {A12,A13,A19}, it is more likely for the private automobile insurance linked to automatic underwriting and male driver at the time of the accident to be fraudulent. Insurance companies should be more careful to avoid claims fraud when they settle claims that are automatically underwritten by private cars with male accident drivers.
In this study, a fuzzy association rule algorithm, SAGFCM-Apriori, is proposed to mine the potential association rules in automobile insurance fraud events, so as to improve the management mechanism of automobile insurance fraud and promote the research of data mining technology in insurance fraud identification. SAGFCM-Apriori algorithm applies fuzzy theory to mining association rules. In the data fuzzing step, the simulated annealing genetic algorithm is used to optimize FCM. Consequently, the complication that the initial clustering center of FCM easily converges to the local optimum is overcome, and the influence of classification error on the final rule mining is reduced. In the process of mining fuzzy association rules, an improved method of the Apriori algorithm is proposed. The I/O load of the algorithm can be reduced, and the discovery speed of frequent item sets can be improved. Besides, different data sets are employed to verify the effects of SAGFCM-Apriori on data fuzzification and rule mining efficiency improvement in the empirical process. The results demonstrate that the optimization of simulated annealing genetic algorithm improves the clustering accuracy of FCM, and SAGFCM improves the result of the objective function. Moreover, the operation time of the improved Apriori algorithm is reduced. With the increase in data size, the efficiency of the Apriori algorithm becomes more significant. The efficiency of SAGFCM-Apriori mining association rules is effectively improved. Then, SAGFCM-Apriori is adopted to mine the association rules for automobile insurance fraud claims. Through correlation analysis of fraud claim data set, 15 valid frequent item sets of fraud claim are mined, and fuzzy association rules identifying fraud claim samples are obtained according to frequent item sets. The results of association rule mining reveal that male, more historical occurrences, auto inspection exemption, private car, and auto underwriting are all the crucial characteristics with some association rules.
Considering that data mining involves a huge database, the mining efficiency of association rules can be further improved in future research. In the process of mining association rules for automobile insurance fraud, insurance claim cases with fraud account for a relatively low proportion among all claim cases. Besides, it is also worth exploring how to mine higher-quality association rules from sparse fraud claim data and further verify the validity and correctness of association rules.
Footnotes
Acknowledgment
This paper is financially supported by the following projects. This work was supported in part by the Natural Science Foundation of Shandong province under Grant ZR2020MF033. Supported by the National Bureau of Statistics of China under Grant 2019LZ10, the National Natural Science Foundation of China under Grant 61502280, the General project of science and technology plan of Beijing Municipal Commission of Education under Grant KM202010017001 and Shandong University of Science and Technology Postgraduate Research and Innovation Project under Grant YC20210230. Thanks to the African Institute for Mathematical Science (AIMS) for their support and help.
