Research on automobile insurance fraud identification based on fuzzy association rules

Abstract

With the development of automobile insurance industry, how to identify automobile insurance fraud from massive data becomes particularly important. The purpose of this paper is to improve automobile insurance fraud management and explore the application of data mining technology in automobile insurance fraud identification. To this aim, an Apriori algorithm based on simulated annealing genetic fuzzy C-means (SAGFCM-Apriori) have been proposed. The SAGFCM-Apriori algorithm combines fuzzy theory with association rule mining, expanding the application scope of the Apriori algorithm. Considering that the clustering center of the traditional fuzzy C-means (FCM) algorithm is easy to fall into local optimal, the simulated annealing genetic (SAG) algorithm is used to optimize it. The SAG algorithm optimized FCM (SAGFCM) is used to generate fuzzy membership degrees and introduces fuzzy data into the Apriori algorithm. The Apriori algorithm is improved by reducing the rule mining time when acquiring rules. The results of empirical studies on several data sets demonstrate that the optimization of FCM by SAG can effectively avoid the local optimal problem, improve the accuracy of clustering, and enable SAGFCM-Apriori to obtain better fuzzy data during data preprocessing. Moreover, the proposed algorithm can reduce the mining time of association rules and improve mining efficiency. Finally, the SAGFCM-Apriori algorithm is applied to the scene of automobile insurance fraud identification, and the automobile insurance fraud data is mined to obtain fuzzy association rules that can identify fraud claims.

Keywords

Fuzzy association rules fuzzy clustering data mining automobile insurance fraud

1 Introduction

With the popularization of private vehicles, many sideline industries related to automobiles have been vigorously developed. The automobile service industry, such as automobile sales, automobile maintenance, automobile finance, and automobile insurance, has been significantly developed. Particularly, the automobile insurance business has made great progress. Since the insurance business was launched in China in 1980, automobile insurance has accounted for more than 50% of the domestic insurance business. It is an imperative category in the insurance industry, even known as the “food insurance” of the insurance industry [1]. However, automobile insurance fraud is common, not only harming the interests of insurance companies but also unfair to honest policyholders. There is an urgent need for intelligent, fast, and automated methods to obtain valuable information about automobile insurance fraud from automobile insurance fraud databases.

Association rule mining algorithm is a commonly used data mining algorithm [2]. Youcef Djenouri et al. [3] proposed an efficient parallel algorithm named CGPUGA. It is a genetic algorithm that runs on clusters of GPUs to efficiently discover diversified association rules. It benefits from cluster computing to generate rules. José María Luna et al. [4] proposed a series of algorithms based on the MapReduce framework and the Hadoop open-source implementation in order to establish new efficient pattern mining algorithms to work in big data. Results revealed the interest of applying MapReduce versions when complex problems were considered, and also the unsuitability of this paradigm when dealing with small data. Feng Feng et al. [5] pointed out and rectified some problems of soft set based association rule mining, refined and proposed relevant concepts. Further detailed insights were provided for soft set based association rule mining. Feng Feng et al. [6] developed a new approach to maximal association rule mining using logical formulas over soft sets in order to overcome the problem of meaningless rules in the rules obtained by traditional rule mining techniques. Fuzzy association rule mining is a main research direction of data mining. Lv [7] proposed a list based fuzzy frequent item mining algorithm and a list based fuzzy consistent association rule mining algorithm. Both algorithms use two pruning strategies to reduce spatial search. Sushil et al. [8] used fuzzy association rule mining to evaluate and predict students’ performance at the end of the semester. In this way, students who need personal attention to reduce failure rates and take appropriate action for next semester’s exams are identified. Mohamed et al. [9] proposed a new neutrosophic association rule algorithm. This algorithm uses a new way to generate association rules by dealing with membership, uncertainty and non-membership functions of items. Effective decisions are made by considering all fuzzy association rules. Wu et al. [10] optimized the classical Apriori algorithm. Firstly, Principal Component Analysis is used to optimize multi-source parameters and then extended to transactions with fuzzy attributes. Finally, the traditional IEC ratio is combined as the feature quantity to extract the rule together. Xu et al. [11] proposed a spatial association rule mining method that takes fuzzy attributes into account. The method introduces the theory of fuzzy sets and transforms fuzzy spatial attributes into fuzzy values represented by membership functions. In addition, the improved FP-Growth fuzzy association rule mining algorithm is used to extract association rules. Régis Pierrard et al. [12] proposed a fuzzy adaptive closure algorithm that relies on closure of item sets. This algorithm has a small number of traversals for the data sets and is applicable to the case where variables are related. To improve the accuracy of transformer diagnosis, Guo et al. [13] established a diagnosis model based on fuzzy association rules combined with case-based reasoning (CBR) to evaluate the failure types, fault locations, and cause of breakdown in power transformers. Zhang et al. [14] proposed a mining method based on fuzzy genetic algorithm for association rules. This method fuses association rules with fuzzy genetic algorithm by constructing a mining model. Hou et al. [15] proposed a fuzzy association rule mining algorithm based on master-slave architecture combined with genetic algorithm. This algorithm can reduce the complexity of fitness evaluation. Moreover, the acceleration degree of this algorithm can be close to linear increase when the modern number is large. Zhang et al. [16] proposed a quantitative data mining algorithm based on improved multi-level fuzzy association rules. Compared with other algorithms, this algorithm has prominent advantages in mining accuracy and operation time. Kang et al. [17] used competitive aggregation algorithm to cluster the processed data and FP-Tree algorithm to mine association rules during data mining.

One of the most commonly used algorithms for mining association rules is the Apriori algorithm first proposed by Rakesh Agrawal and Ramakrishnan Srikant [18]. But this algorithm cannot process quantitative data. To solve this problem, some researchers propose to combine fuzzy C-means clustering (FCM) algorithm with Apriori algorithm. Dang Qinhua [19] proposed an improved clustering algorithm DFCM. Based on this method, the Apriori algorithm is improved for the temperature control link of the decomposition furnace in the predecomposition system. Sowan et al. [20] studied two kinds of fuzzy association rule mining models for improving prediction performance. One is the FCM-Apriori model, the other is the FCM-MSapriori model. The two models are used to predict the relatively similar performance of the road. Dhiraj et al. [21] discovered and briefly discussed several different techniques and algorithms for mining fuzzy association rules. In addition, the current trend and future research scope were introduced from the field of fuzzy association rule mining. Zhang [22] proposed a fuzzy association rule mining algorithm. This algorithm combines the adaptive kernel-based FCM (KFCM) clustering algorithm based on fruit fly algorithm with the Apriori algorithm based on location storage. The effectiveness and feasibility of the algorithm are also proved. Yan et al. [23] introduced the fuzzy C-means algorithm into the association rule algorithm and proposed a FARMA algorithm to mine property insurance data. And the cross-selling strategies of property insurance customers were analyzed. Akbar et al. [24] used the fuzzy C-means clustering algorithm and Apriori algorithm to assess forest fire risk in western Iran. Chen et al. [25] proposed an intuitionistic fuzzy vector association rule mining method based on dual fuzzy simulation. The dual fuzzy idea is introduced to improve the traditional intuitive fuzzy set, and the framework of the two-layer association rule mining method is established.

Some researchers also apply association rule mining algorithms to insurance fraud detection. Shen [26] designed an anti-fraud system scheme of automobile insurance and an anti-fraud system model of automobile insurance. The Apriori algorithm of association rules in data mining technology is used to mine the fraudulent behaviors of automobile insurance regularly. Zhou [27] proposed the architecture of intelligent medical insurance audit system based on BP neural network and association rules, and then designed the support base of intelligent medical insurance audit system. Aayushi et al. [28] analyzed the results of rule-based mining and applied organizational decision rules and k-means clustering to periodic claims outlier anomaly detection. They also applied association rule mining based on Gaussian distribution to disease-based outlier anomaly detection. Du [1] divided the sample set and found out the frequent item sets of black samples based on Apriori algorithm and FP-growth algorithm. The frequent item sets were verified in the white sample. Association rules between frequent item sets were mined to identify fraud claims, and finally, the fraud rate of gray sample set was estimated.

Fraud will lead to a decrease in the reputation of the industry. It also severely affects the healthy and sustainable development of the insurance industry. The research on automobile insurance fraud has economic value and practical significance for both the policyholders and the insurance companies. However, the current domestic insurance fraud identification technology remains immature, and the management practices related to fraud identification and fraud rate estimation are underdeveloped. The research on insurance fraud identification faces numerous difficulties. The motivation of this paper is to propose a method that can mine automobile insurance fraud information and identify automobile insurance fraud claims in automobile insurance big data. The process of identifying automobile insurance fraud is a process of data mining, in which association rule mining can dig the hidden information in automobile insurance data and more effectively identify automobile insurance fraud claims. The Apriori algorithm is one of the most commonly used algorithms for mining association rules. However, it can only mine Boolean data, with a low efficiency when facing a large database. There are some limitations in the Apriori algorithm. Thus, a fuzzy attribute is introduced into association rule mining to overcome these problems. An Apriori algorithm based on simulated annealing genetic fuzzy C-means (SAGFCM-Apriori) is proposed in this study. The traditional Apriori algorithm needs to partition numeric data and convert them into Boolean data because only Boolean data can be used for the Apriori algorithm. This way will make the partition stiff, leading to sample attribution problems at the partition boundaries. On this basis, the FCM algorithm optimized by simulated annealing genetic algorithm is firstly used to divide the numerical data into fuzzy partitions. The fuzzy version of the original data is obtained, and the partition stiff problem is solved. SAG algorithm also can improve the situation of FCM clustering center falling into local optimal. Secondly, the improved Apriori algorithm is employed to mine the association rules in the fuzzy data set. The process of connection-pruning is improved to reduce the time of mining association rules for the Apriori algorithm. Thus, the I/O load on the algorithm is reduced. Finally, the examples are provided to verify the effectiveness of the proposed algorithm. The association rules of fraudulent claims are extracted to identify hidden automobile insurance fraud claims and provide a reference for fraud identification. The research results have significant practical value for insurance companies and are mainly beneficial to save the cost of fraud claims and improve the accuracy of the review.

2 FCM algorithm and optimization

2.1 The FCM algorithm

Fuzzy C-means clustering algorithm is a popular method at present. This algorithm is developed and improved from the hard C-means clustering (HCM) algorithm. FCM algorithm is a data clustering method based on fuzzy set theory and optimization of objective function. FCM takes the minimization of the sum of squares of distances within the partition interval as the judgment basis. Then membership degree is used to determine the degree to which each data belongs to each cluster partition. Divide n data instances X = {X_i|X_i ∈ R (i = 1, 2, ·· · , n)} into c (2 ≤ c ≤ n) categories. The clustering center of each cluster is obtained and the value function of the non-similarity index is minimized. {A₁, A₂, ·· · , A_c} represents the corresponding categories; {v₁, v₂, ·· · , v_c} is the clustering center set of all categories. U = (μ_ik|i = 1, 2, ·· · , m ; k = 1, 2, ·· · , c) is the membership matrix, where μ_ik represents the membership of sample X_i for category A_k. The objective function is shown in Equation (1) [29] $J_{b} (U, v) = \sum_{i = 1}^{n} \sum_{k = 1}^{c} μ_{ik}^{b} d_{ik}^{2},$ (1) where $d_{ik} = d (x_{i} - v_{k}) = \sqrt{\sum_{j = 1}^{m} (x_{ij} - v_{kj})^{2}}$ . d_ik is the Euclidean distance. It can measure the distance between the ith sample X and the kth clustering center v_k. m is the number of sample characteristics; b ∈ (1, + ∞) is the fuzzy weighting coefficient, a parameter value of the degree of fuzziness in the fuzzy partition. The higher the value of b, the more ambiguous the partition. The purpose of FCM is to find an optimal classification to minimize the objective function value J_b generated by the classification. It requires that the sum of membership degrees of one sample for all types is 1, that is, Equation (2) is satisfied. $\sum_{j = 1}^{c} μ_{ij} = 1, i = 1, 2, \cdot \cdot \cdot, n .$ (2) Equations (3) and (4) [29] are respectively used to calculate c clustering centers {v_i} and the membership of sample X_i for category A_k. $μ_{ik} = \frac{1}{\sum_{j = 1}^{c} (\frac{d_{ik}}{d_{jk}})^{\frac{2}{b - 1}}} .$ (3)

I_k = {i|2 ≤ c ≤ n ; d_ik = 0} is assumed, and i ∈ I_k, μ_ik = 1 is true for all i category. $v_{i} = \frac{\sum_{k = 1}^{n} (μ_{ik})^{b} x_{k}}{\sum_{k = 1}^{n} (μ_{ik})^{b}}$ (4)

Clustering centers and data membership degrees are repeatedly modified by Equations (3) and (4), and then classified. When the algorithm converges, all kinds of clustering centers and the membership degrees of each sample to each pattern category are obtained theoretically. Thus, the fuzzy clustering division is completed. Although FCM has a high search speed, FCM is a local search algorithm and is very sensitive to the initial value of the clustering center. If the initial value is not chosen properly, it will converge to the local minimum. Improvements will be made to address this issue below.

2.2 Simulated annealing genetic algorithm to optimize FCM algorithm

FCM is very sensitive to the selection of clustering center initial value. If the initial value is not chosen properly, it will fall into the local optimal solution. Thus, the clustering results are affected. Therefore, simulated annealing algorithm is combined with genetic algorithm to generate the simulated annealing genetic (SAG) algorithm. SAG algorithm is applied to FCM.

Simulated annealing algorithm has been successfully applied to combinatorial optimization for a long time. Its basic idea is to find the minimum solution by simulating the annealing process of high temperature objects in nature. First an initial solution is generated as the current solution. The new solution is accepted as the current solution when the evaluation function value of the new solution is less than that of the initial solution; otherwise, this solution is accepted with a certain probability as the current solution. The current solution is iterated to prevent the algorithm from falling into the local optimal solution.

Genetic algorithm is a computational model based on natural selection and genetic mechanism in the biological world. It simulates the process of biological evolution in nature to search for the optimal solution. Firstly, genetic algorithm encodes the problem parameters into chromosomes. Secondly, chromosome selection, crossover, mutation and other similar biological chromosome exchange operations are carried out continuously. Finally, chromosomes that meet the target conditions are produced. But genetic algorithm is prone to precocity and falls into the local optimal solution.

The SAG algorithm first generates the initial population. Then the crossover operator and mutation operator are used on the population. Finally, the simulated annealing operation is then used on the population, and the replication strategy of the Metropolis acceptance criterion is used to generate the next generation population. The acceptance criterion makes the best individual in the population enter the next generation and accept the less good solution under certain probability. The method ensures the diversity of the population, and is beneficial to the population reproduction to find the global optimal solution. The combination of simulated annealing algorithm and genetic algorithm is applied to FCM cluster analysis. Simulated annealing algorithm and genetic algorithm can complement each other and effectively overcome the premature convergence of genetic algorithms [30]. In addition, SAGFCM can design genetic encoding and fitness function according to clustering problem so as to effectively and quickly converge to the global optimal solution. Thus, the problem that FCM converges easily to local optimum is overcome.

The steps of SAGFCM algorithm is as follows:

Initialization control parameters: population individual size sizepop, maximum number of evolution MAXGEN, crossover probability P_c, mutation probability P_m, mutation probability T₀, temperature cooling coefficient k, termination temperature T_end;

Randomly initialize c clustering centers and generate the initial population Chrom. For each clustering center, formula (3) is used to calculate the membership of each sample and the fitness value f₁ of each individual, where i = 1, 2 . ·· · , sizepop;

Set the loop count variable gen = 0;

Selection, crossover, mutation and other genetic operations are carried out on the population Chorm. For the newly generated individuals, formula (3) and (4) are used to calculate the membership of each sample, c clustering centers, and the fitness value $f_{i}^{'}$ of each individual. If $f_{i} > f_{i}^{'}$ , the old individual is replaced by the new one; Otherwise, accept the new individual with probability $P = \exp [(f_{i} - f_{i}^{'}) T]$ and discard the old one;

If T_i < T_end, the algorithm ends and the global optimal solution is returned. Otherwise, perform the cooling operation T_i+1 = kT_i and proceed to step 3.

3 Fuzzy association rule algorithm

3.1 Apriori algorithm and improvement

At present, there are many algorithms for mining association rules. Among them Apriori algorithm is the most influential. Most other algorithms are generally extensions or variants of Apriori algorithm. Table 1 shows the comparison of the advantages and disadvantages of the currently commonly used algorithms for mining association rules. In contrast, the Apriori algorithm is more suitable for mining association rules of automobile insurance fraud. Apriori algorithm is an algorithm for mining frequent item sets of Boolean association rules. In Apriori algorithm, all item sets with support degree greater than minimum support degree are called frequent item sets. This algorithm uses an iterative method of layer-by-layer search: the support degree for each data item in item sets I is calculated at the first scan of database D; the frequent 1-item sets L₁ are composed of the items that satisfy the support greater than the minimum support degree; then the frequent 2-item sets L₂ are generated from the frequent 1-item sets L₁; in the subsequent kth scan, the frequent (k-1)-item sets L_k-1 generated by the (k - 1)th scan are first taken as the seed sets and connected to generate the potential candidate k-item sets C_k; then the data sets are scanned to calculate the support degrees of all items in the candidate k-item sets C_k; the frequent k-item sets L_k are formed by finding the items satisfying the support degree greater than the minimum support, and it is taken as the subset of the next scan. The process is repeated until no new frequent item sets are created. Strong association rules are eventually generated from frequent item sets. These strong association rules must satisfy minimum support and minimum confidence. In Apriori algorithm, the core idea of finding frequent k-item sets is to make use of the following two basic properties: 1) the subsets of the frequent item sets must be the frequent item sets; 2) the supersets of the non-frequent item sets must be infrequent item sets.

Table 1
Comparison of advantages and disadvantages of commonly used association rule mining algorithms

Name Apriori FP-growth

Advantages (1) The iterative method of layer-by-layer search is adopted, and the algorithm is simple and clear; The database scan time is less, and the mining efficiency is higher.

(2) It is suitable for sparse data sets and can produce relatively small candidate sets.

Disadvantages Too many scans of the database, so it may be slower on large data sets. Can only find frequent itemsets, unable to mine the association rules.

Name	Apriori	FP-growth
Advantages	(1) The iterative method of layer-by-layer search is adopted, and the algorithm is simple and clear;	The database scan time is less, and the mining efficiency is higher.
	(2) It is suitable for sparse data sets and can produce relatively small candidate sets.
Disadvantages	Too many scans of the database, so it may be slower on large data sets.	Can only find frequent itemsets, unable to mine the association rules.

In the kth cycle, the Apriori algorithm will first generate the sets C_k of the candidate k-item sets. Each item set in C_k is generated by making a (k-2)-connection between two frequent item sets that belong to L_k-1 and have only one different item. Item sets in C_k are candidates for generating frequent item sets. The final frequent item sets L_k must be a subset of C_k. Each element in C_k needs to be validated in the transaction database to decide whether to add L_k. The validation process here is a bottleneck in the performance of the algorithm. The process requires multiple scans of potentially large databases. This means that if the frequent item sets contains at most 10 items, the database needs to be scanned 10 times. A large I/O load is required.

The Apriori algorithm needs to scan the database multiple times and generate a large set of candidate items during the iteration. These problems form the performance bottleneck of the algorithm. The algorithm is improved to improve the operation efficiency of Apriori algorithm.

A unique serial number TID is defined for each data transaction t in database D. K-item sets R_k are defined as R_k = 〈X_k, TID (X_k) 〉, where X_k = (i_{j
₁}, i_{j
₂}, ·· · , i_{j
_k}), i_{j
₁}, i_{j
₂}, ·· · , i_{j
_k} ∈ I, j₁ < j₂ < ·· · < j_k, TIDS (X_k) is a set of the serial number of all transaction t that contain X_k in the database. According to the support degrees of k-item sets support (R_k) = (|TIDS (X_k) |/|D|)×100 % defined above, the support number of k-item sets R_k is supnum = |D| × support (R_k) = |TIDS (X_k) |.

The improved Apriori algorithm still adopts the iterative method of layer-by-layer search. The connection and pruning operations in the iterative process are defined as follows:

Connection: assume two (k-1)-item sets: L_k-1 (i) = 〈X_k-1, TIDS (X_k-1) 〉 ⊆ L_k-1, L_k-1 (j) = 〈Y_k-1, TIDS (Y_k-1) 〉 ⊆ L_k-1, i < j. If the first k-2 terms of the (k-1)-item sets X_k-1 and Y_k-1 are equal, that means X_k-1 [k - 2] ≡ Y_k-1 [k - 2], and X_k-1 [k - 1] < Y_k-1 [k - 1] is guaranteed not to repeat itself, then the (k-1)-item sets are connected: L_k-1 (i) ∞ L_k-1 (j) = 〈X_k-1 ∪ Y_k-1, TIDS (X_k-1) ∩ TIDS (Y_k-1) 〉 = 〈X_k, TIDS (X_k) 〉 = R_k ⊆ C_k, C_k is a superset of frequent k-term sets; otherwise, the connection operation is not performed because the resulting item sets are either a duplicate item sets or a non-frequent item sets. The amount of computation is reduced.

Pruning: calculate the support number of k-item sets. According to the above, sumpum = |TIDS (X_k) |. The calculation does not require another scan of database D. This avoids I/O operations and improves the efficiency of the algorithm. If supnum (R_k) ≥ minsupnum, then 〈X_k, TIDS (X_k) 〉 ⊆ L_k, that means K-item sets R_k are frequent k-item sets; Otherwise, remove R_k from sets C_k.

The Apriori algorithm is improved during connection and pruning. The improved algorithm only performs one database scan when generating frequent 1-item sets, and no database scan is required in subsequent iterations. The improved algorithm reduces the I/O load and greatly increases the discovery speed of frequent item sets. In addition, the iterative process of the algorithm does not require complicated calculation. That means the connection of item sets can be completed using only the union and intersection operations of the sets. This makes the algorithm easy to implement.

3.2 Apriori algorithm based on SAGFCM algorithm

The traditional Apriori algorithm has two disadvantages that directly affect the efficiency of the algorithm. The first is to do a lot of scanning of the transaction data sets when looking for frequent item sets. A heavy I/O load and a lot of computation time are required. The second is that repeated scanning of the database can produce a large number of useless candidate sets. In addition, the traditional fuzzy association rule mining algorithm requires domain experts to give the fuzzy membership function and divide the fuzzy partition in advance [31]. In the case of mass database and no prior knowledge, it is very difficult and heavy work to give fuzzy sets by experts in advance. In this case, an improved Apriori algorithm based on SAGFCM algorithm (SAGFCM-Apriori) is proposed. SAGFCM-Apriori is an association rule mining algorithm with fuzzy concepts. The boundary division of Apriori algorithm can be solved by using SAGFCM-Apriori algorithm. Some definitions of fuzzy association rules are as follows [21]:

Definition 1: assume that I = {i₁, i₂, ·· · , i_n} is a set of items made up of n different items. Each record t in database D = {t₁, t₂, ·· · , t_m} is a set of items in I (a subset of data items in I), and t has a unique identifier TID. If sets X ⊆ I and X ⊆ t, then record t is said to contain sets X.

Definition 2: assume that R = {r₁, r₂, ·· · , r_h} is a set of h distinct fuzzy intervals. If A = {a₁, a₂, ·· · , a_q} ⊆ R, then A is a set of fuzzy intervals in R.

Definition 3: μ_A (X) = ∧ μ_{a
_i} (x_i) represents the membership degrees of sets X to fuzzy set A. Every x_i has a unique a_j in that corresponds to it.

Definition 4: a fuzzy association rule is defined as the implication of X _ A ⇒ Y _ B, where X ⊆ I, Y ⊆ I, X∩ Y ≠ ∅, A ⊆ R, B ⊆ R, a_i and b_j are fuzzy intervals corresponding to x_i and y_j, respectively.

Definition 5: the support degrees and confidence degrees of the fuzzy association rule X _ A ⇒ Y _ B are calculated as Equations (5) and (6): $support (X \Rightarrow Y) = \frac{\sum {μ_{A} (X) \otimes μ_{B} (Y)}}{| D |},$ (5) $confidence (X \Rightarrow Y) = \frac{\sum {μ_{A} (X) \otimes μ_{B} (Y)}}{\sum μ_{A} (X)} .$ (6)

The Cartesian product A × B of two fuzzy sets A and B is usually defined by the membership function (X, Y) ↦ μ_A (X) ⊗ μ_B (Y) [32]. Where, μ_A (X) denotes the membership degrees of sets X to fuzzy set A, μ_B (Y) denotes the membership degrees of sets Y to fuzzy set B, ⊗ is a t-norm, |D| is the total number of records in the database.

The first step in mining fuzzy association rules is to blur the data in the database. The fuzzified version of the original data to be processed is then obtained [33]. These two steps can be solved by the SAGFCM pretreatment technology mentioned above. Then, the fuzzy database is mined using the improved Apriori algorithm. The process of mining fuzzy association rules is shown in Fig. 1.

Fig. 1

Fuzzy association rule mining flow chart.

The calculation steps of SAGFCM-Apriori algorithm are as follows:

The original database D is scan by SAGFCM algorithm and the data attribute values are classified. Boolean data is stored in database BD, and numeric data is stored in database SD. Then the clustering calculation is carried out. For Boolean data, the fuzzy membership μ = 1 or μ = 0. For numeric data, each numeric attribute in database SD is converted to a fuzzy record according to the fuzzy partition. Each fuzzy record contains the fuzzy attributes of the data and the corresponding fuzzy membership function μ (μ ∈ [0, 1]). Then intermediate data sets D₁ are generated. D₁ are updated and iterated until all attribute data in attribute sets A is fuzzy, and the fuzzy version FD of the data sets are obtained.

The fuzzy support degrees of all fuzzy attributes 1-item sets are calculated in FD. And all fuzzy frequent 1-item sets are obtained.

All fuzzy frequent 1-item sets are combined to obtain fuzzy candidate 2-item sets.

The fuzzy support degree of all fuzzy candidate 2-item sets are calculate. Then the fuzzy candidate 2-item sets whose support degrees are less than the minimum support are removed, and the fuzzy frequent 2-item sets are obtained.

Perform similar steps until all fuzzy frequent k-item sets are found.

The strong fuzzy association rules not less than the minimum confidence given by the user are generated from all fuzzy frequent item sets.

4 Empirical analysis

4.1 Effectiveness analysis of SAGFCM algorithm

In order to test the optimization effect of SAG algorithm on FCM algorithm, the following data sets are used in this paper. IRIS data set, Wine data set, partial data set of automobile insurance claims and a two-dimensional random data set containing 400 data sets. These data are put into FCM algorithm [24] and SAGFCM algorithm respectively for calculation. The results are shown in Table 2 and Figs. 2–5. The IRIS data set is derived from the characteristics of the iris flower. It contains 150 rows, each containing four attributes, divided into 3 categories. The Wine data set is based on the characteristics of Wine. It contains 178 rows, each containing 13 attributes, divided into 3 categories. Its first two dimensions are selected as the test data. Partial data set of automobile insurance claims contains 400 rows, each of which contains 3 attributes (numerical) and is divided into 4 categories. The randomly generated data is divided into 4 categories. In Figs. 2–5, triangle points in each figure represent the clustering centers. The IRIS data set and partial data set of automobile insurance claims in Figs. 2 and 4 select the first two dimensions as the horizontal and vertical coordinates.

Table 2
Comparison of objective function optimal values of FCM algorithm and SAGFCM algorithm

Algorithm IRIS Wine Partial data set of automobile insurance claims Random data

FCM 29.0739 27.6713 2.6564 3.5244

SAGFCM 29.0736 27.6701 2.6371 3.4566

Algorithm	IRIS	Wine	Partial data set of automobile insurance claims	Random data
FCM	29.0739	27.6713	2.6564	3.5244
SAGFCM	29.0736	27.6701	2.6371	3.4566

Fig. 2

SAGFCM clustering result of IRIS data set.

Fig. 3

SAGFCM clustering result of Wine data set.

Fig. 4

SAGFCM clustering result for Partial data set of automobile insurance claims.

Fig. 5

SAGFCM clustering result for Random data set.

Table 2 shows that the optimal value of the objective function calculated by the optimized SAGFCM algorithm is less than the optimal value calculated by the FCM algorithm. In other words, the optimal value of the objective function calculated by the optimized SAGFCM algorithm is better. In addition, with the increase of data volume, the optimization effect of SAG algorithm is better and its advantages are more obvious. The main reason is that simple FCM algorithm is easy to converge to the local optimal solution when processing large data. The hybrid algorithm formed by the combination of simulated annealing algorithm and genetic algorithm can effectively overcome the problem of convergence to the local optimal solution. Figures 2–5 show the clustering results of each data set using SAGFCM algorithm. Different points represent different categories, and triangular points represent cluster centers. The data sets in Figs. 2 and 3 are divided into 3 categories, and the data sets in Figs. 4 and 5 are divided into 4 categories. The clustering boundaries of the data sets in Figs. 2 and 5 are obvious, and there are individual abnormal points in Figs. 3 and 4, but the overall clustering effect is good.

The clustering accuracy of SAGFCM is compared with other common clustering algorithms to further verify the optimization effect of SAG algorithm on FCM algorithm [34]. IRIS and Wine, two data sets that contain classification labels, are used as test data sets. These two data sets are introduced into the clustering algorithms of K-means, K-medoids, FCM [24] and SAGFCM, and the clustering accuracy is calculated. This indicates that the clustering effect of fuzzy clustering algorithm is better than that of hard clustering algorithm, and SAG algorithm has optimization effect on FCM algorithm. The results are shown in Table 3. The results show that fuzzy clustering algorithm (FCM, SAGFCM) has a higher clustering accuracy, while the clustering accuracy of SAGFCM is higher than other algorithms.

Table 3

Comparison table of clustering accuracy of each algorithm under different data sets

Data Sets	K-means	K-medoids	FCM	SAGFCM
IRIS	0.8933	0.8867	0.9000	0.9067
Wine	0.5730	0.5963	0.6798	0.6910

4.2 Effectiveness analysis of the improved Apriori algorithm

Data of different sizes are used for testing before fraud identification to verify the improvement effect of the improved Apriori method proposed in this paper. The data sets are put into Apriori algorithm and improved Apriori algorithm respectively for calculation and timing. The result is shown in Fig. 6. The abscisic coordinate in Fig. 6 is the scale of test data, with 20 rows, 40 rows, 60 rows, 80 rows and 100 rows respectively. The test found that the improved Apriori algorithm not only does not affect the generation of the final result, but also reduces the operation time. In addition, with the increase of data size, the improvement effect is more obvious.

Fig. 6

Mining time comparison between Apriori algorithm and improved Apriori algorithm.

4.3 The processing of the automobile insurance fraud data

MatlabR2017a software is used to analyze the association rules of fraud identification for the collected automobile insurance claim settlement data of an insurance company. The data includes data on 700 claims. The missing data are supplemented by linear interpolation. All claim settlement samples have been preliminarily screened by the management system of the insurance company. It contains 342 samples of fraudulent claims and 358 samples of non-fraudulent claims.

The automobile insurance claim settlement data of the insurance company includes 11 characteristic indicators, which are used in this paper to identify fraudulent claims, as shown in Table 4.

Table 4
Specific contents of the characteristic indicators

Types of indicators Characteristic indicators Content of indicators

Numerical data x ₁ The time interval between the date of accident and the date of termination (month)

x ₂ The number of photos from the scene of the accident used to determine the loss

x ₃ Historical number of accidents (excluding this one)

Boolean data x ₄ Nature (office, enterprise or private)

x ₅ Automatic underwriting or not

x ₆ Be approved for transfer or not

x ₇ Vehicle inspection (exempted, inspected or not inspected)

x ₈ Report the crime at the scene or not

x ₉ The gender of the driver at the time of the accidents

x ₁₀ Type of survey (first site, supplementary survey site or no survey site)

x ₁₁ Type of repair shop for the subject matter insured (first class plant, second class plant, third class plant or special service station)

x ₁₂ Fraudulent claim or not

Types of indicators	Characteristic indicators	Content of indicators
Numerical data	x ₁	The time interval between the date of accident and the date of termination (month)
	x ₂	The number of photos from the scene of the accident used to determine the loss
	x ₃	Historical number of accidents (excluding this one)
Boolean data	x ₄	Nature (office, enterprise or private)
	x ₅	Automatic underwriting or not
	x ₆	Be approved for transfer or not
	x ₇	Vehicle inspection (exempted, inspected or not inspected)
	x ₈	Report the crime at the scene or not
	x ₉	The gender of the driver at the time of the accidents
	x ₁₀	Type of survey (first site, supplementary survey site or no survey site)
	x ₁₁	Type of repair shop for the subject matter insured (first class plant, second class plant, third class plant or special service station)
	x ₁₂	Fraudulent claim or not

Firstly, the data are scanned and divided into numerical data sets and Boolean data sets [35]. Then the classification calculation is carried out respectively. When a data is a Boolean property, its fuzzy membership function μ = 1 or μ = 0. Boolean data can be easily divided into fuzzy partitions to obtain fuzzy versions of Boolean attribute values. For numerical data, the SAGFCM algorithm is used to generate membership matrix. The data is divided into fuzzy record sets by fuzzy partition. Each fuzzy record contains the fuzzy attributes of the data and the corresponding fuzzy membership function μ (μ ∈ [0, 1]). In some necessary cases, it is necessary to set a threshold value for μ. The purpose is to get only numerical attribute data that is greater than the threshold. According to SAGFCM algorithm, any original data set with Boolean attributes or numerical attributes can be transformed into a fuzzy set with fuzzy attributes and fuzzy records. Table 5 shows the attributes and attribute values of the original data sets.

Table 5

Some original data attributes and their values

ID	x ₁	x ₂	x ₃	x ₄	x ₅	x ₆	x ₇	x ₈	x ₉	x ₁₀	x ₁₁	x ₁₂
1	0	16	2	enterprise	no	no	inspected	yes	0	first site	second class plant	no
2	1	15	7	private	yes	no	exempted	yes	1	first site	first class plant	no
3	3	6	1	private	yes	no	exempted	no	1	no survey site	first class plant	no
4	0	8	1	private	yes	no	exempted	no	1	no survey site	second class plant	yes
5	7	0	2	private	yes	no	exempted	no	1	supplementary survey site	first class plant	no

In the original data, x₁, x₂ and x₃ are numerical data. SAGFCM was used to blur them, and the results are shown in Table 6. In the process of fuzzy data, the SAGFCM algorithm clusters x₁ into three types, namely short-term, medium-term and long-term. x₂ is clustered into three types, small quantity, normal quantity and large quantity. x₃ is clustered into three types, low frequency, moderate frequency and high frequency. Each indicator of each sample corresponds to a membership function value vector. The dimension with the largest value of the vector dimension is found. This dimension is the fuzzy partition where the sample indicator is.

Table 6

Partial results of fuzzification of numerical data

ID	x ₁	x ₂	x ₃
1	(0.77,0.14,0.09)	(0.72,0.22,0.06)	(0.39,0.41,0.20)
2	(0.95,0.03,0.02)	(0.65,0.28,0.07)	(0.45,0.30,0.26)
3	(0.54,0.32,0.14)	(0.22,0.97,0.01)	(0.01,0.96,0.03)
4	(0.77,0.14,0.09)	(0.16,0.80,0.04)	(0.01,0.96,0.03)
5	(0.10,0.70,0.20)	(0.21,0.71,0.08)	(0.39,0.41,0.20)

Other data is fuzzy processed according to the fuzzy method of Boolean data. All data is fuzzified and converted into binary data, as shown in Table 7. For example, attribute x₁₀ is divided into three attributes: whether it is the first site (A21), whether it is the supplementary survey site (A22), and whether it is the no survey site (A23).

Table 7

Partial binary data sets

ID	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	A14
1	1	0	0	1	0	0	0	1	0	0	1	0	0	0
2	1	0	0	1	0	0	1	0	0	0	0	1	1	0
3	1	0	0	0	1	0	0	1	0	0	0	1	1	0
4	1	0	0	0	1	0	0	1	0	0	0	1	1	0
5	0	1	0	0	1	0	0	1	0	0	0	1	1	0
ID	A15	A16	A17	A18	A19	A20	A21	A22	A23	A24	A25	A26	A27	A28
1	0	1	0	1	0	1	1	0	0	0	1	0	0	0
2	1	0	0	1	1	0	1	0	0	1	0	0	0	0
3	1	0	0	0	1	0	0	0	1	1	0	0	0	0
4	1	0	0	0	1	0	0	0	1	0	1	0	0	1
5	1	0	0	0	1	0	0	1	0	1	0	0	0	0

The obtained binary data sets are divided into the fraud claim data sets and the non-fraud claim data sets according to the criterion of whether they are fraud claim samples. On the basis of multiple tests, the minimum support degree is set to 0.7, and the minimum confidence degree is set to 0.7. The fraud data sets are put into the improved Apriori algorithm to obtain the frequent item sets and the support degrees of the fraud samples. The non-fraud claim data sets are also put into the algorithm to obtain the frequent item sets and the support degrees of the non-fraud claim samples. Non-fraud sample frequent item sets are used to verify fraud samples frequent item sets. For the convenience of expression, the frequent item sets of the fraud samples are sorted and numbered according to its support degrees in the fraud samples, and the number is F_i. When Support (F_i|fraud) > Support (F_i|non - fraud) is satisfied, the frequent item sets of the fraud samples are considered to be valid. According to Table 8, all frequent item sets of fraud samples are valid except {A19}.

Table 8

Support of frequent item sets of fraud samples in fraud samples and non-fraud samples

Frequent item sets	Number of frequent item sets	Support (F_i\|fraud)	Support (F_i\|non - fraud)
{A13}	F ₁	0.9591	0.8687
{A12}	F ₂	0.8947	0.8631
{A12,A13}	F ₃	0.8772	0.7821
{A19}	F ₄	0.8216	0.8464
{A15}	F ₅	0.8129	0.8045
{A7}	F ₆	0.8070	0.5391
{A13,A15}	F ₇	0.7895	0.7039
{A13,A19}	F ₈	0.7870	0.7346
{A7,A13}	F ₉	0.7807	0.4581
{A12,A19}	F ₁₀	0.7310	0.7263
{A12,A15}	F ₁₁	0.7251	0.6872
{A12,A13,A15}	F ₁₂	0.7164	0.6285
{A12,A13,A19}	F ₁₃	0.7164	0.6592
{A7,A12}	F ₁₄	0.7130	0.4777
{A7,A12,A13}	F ₁₅	0.7018	0.4190

Association rules can be mined from frequent item sets. However, in this empirical study, it is not significant to find the association rules among the indicators. So the relationship between the probability of fraud under a frequent item sets is defined as an association rule. According to the Bayes’ theorem, the probability P (fraud|F_i) that this sample is a fraudulent claim sample when frequent item sets F_i appear can be obtained as Equation (7): $\begin{matrix} P (fraud | F_{i}) & = \frac{P (F_{i} | fraud) P (fraud)}{P (F_{i})} \\ = \frac{N_{F_{i}} (fraude)}{N_{F_{i}} (fraude) + N_{F_{i}} (non - fraude)}, \end{matrix}$ (7) where, N_{F
_i} (fraude) is the number of fraud samples that satisfy F_i, N_{F
_i} (non - fraude) is the number of non-fraud samples that satisfy F_i. P (F_i|fraud) is the probability of frequent item set F_i appearing in the sample when the sample is a fraud sample. P (fraud) is the probability that the sample is a fraud sample. P (F_i) is the probability that there is the frequent item set F_i in the sample. According to Eq. (7), the probability that the sample is a fraud claim sample when each frequent item set appears, the fraud rate, can be obtained in Table 9.

Table 9

Frequent item sets of fraud samples and their fraud rates

Frequent item sets	Number of frequent item sets	Fraud rates
{A7}	F ₆	0.5885
{A12}	F ₂	0.4976
{A13}	F ₁	0.5133
{A15}	F ₅	0.4912
{A7,A12}	F ₁₄	0.5880
{A7,A13}	F ₉	0.6195
{A12,A13}	F ₃	0.5172
{A12,A15}	F ₁₁	0.5020
{A12,A19}	F ₁₀	0.5056
{A13,A15}	F ₇	0.5172
{A13,A19}	F ₈	0.5056
{A7,A12,A13}	F ₁₅	0.6154
{A12,A13,A15}	F ₁₂	0.5213
{A12,A13,A19}	F ₁₃	0.5094

Some fuzzy association rules can be obtained from Table 9. When the index A12 is satisfied, the fraud rate is only 0.4976. When the indexes A12 and A13 is satisfied, the fraud rate increases to 0.5172. When the indexes A7, A12 and A13 are all satisfied at the same time, the fraud rate is 0.6154. That is to say, when the three indexes of A7,A12 and A13 are satisfied at the same time, the probability of the sample being fraudulent claim sample is 61.54% >50%. We have reason to believe that the sample is a sample of fraudulent claims. In other words, {A7,A12,A13} is a significant fuzzy association rule. Similarly, {A12,A13,A15} and {A12,A13,A19} are also significant fuzzy association rules.

According to the association rules identified, it can be found out that the automatic underwriting private automobile insurance claims are highly likely to be of fraudulent nature. {A7,A12,A13} can be interpreted as the circumstance where the private automobile insurance claims linked to a high number of historical accidents and automatic underwriting are highly likely to be fraudulent. From the perspective of insurers, it is advisable to carry out manual underwriting for preventing the clients with greater historical risks from making fraudulent claims, with private automobile insurance in particular. As indicated by {A12,A13,A15}, a claim linked with auto inspection exemption and the automatic underwriting of private automobile insurance is more likely to be fraudulent. That is to say, it is essential that insurance company pays particular attention to the automobile insurance claims made by the owners of the private cars exempt from inspection when dealing with claims. As suggested by {A12,A13,A19}, it is more likely for the private automobile insurance linked to automatic underwriting and male driver at the time of the accident to be fraudulent. Insurance companies should be more careful to avoid claims fraud when they settle claims that are automatically underwritten by private cars with male accident drivers.

5 Conclusion

In this study, a fuzzy association rule algorithm, SAGFCM-Apriori, is proposed to mine the potential association rules in automobile insurance fraud events, so as to improve the management mechanism of automobile insurance fraud and promote the research of data mining technology in insurance fraud identification. SAGFCM-Apriori algorithm applies fuzzy theory to mining association rules. In the data fuzzing step, the simulated annealing genetic algorithm is used to optimize FCM. Consequently, the complication that the initial clustering center of FCM easily converges to the local optimum is overcome, and the influence of classification error on the final rule mining is reduced. In the process of mining fuzzy association rules, an improved method of the Apriori algorithm is proposed. The I/O load of the algorithm can be reduced, and the discovery speed of frequent item sets can be improved. Besides, different data sets are employed to verify the effects of SAGFCM-Apriori on data fuzzification and rule mining efficiency improvement in the empirical process. The results demonstrate that the optimization of simulated annealing genetic algorithm improves the clustering accuracy of FCM, and SAGFCM improves the result of the objective function. Moreover, the operation time of the improved Apriori algorithm is reduced. With the increase in data size, the efficiency of the Apriori algorithm becomes more significant. The efficiency of SAGFCM-Apriori mining association rules is effectively improved. Then, SAGFCM-Apriori is adopted to mine the association rules for automobile insurance fraud claims. Through correlation analysis of fraud claim data set, 15 valid frequent item sets of fraud claim are mined, and fuzzy association rules identifying fraud claim samples are obtained according to frequent item sets. The results of association rule mining reveal that male, more historical occurrences, auto inspection exemption, private car, and auto underwriting are all the crucial characteristics with some association rules.

Considering that data mining involves a huge database, the mining efficiency of association rules can be further improved in future research. In the process of mining association rules for automobile insurance fraud, insurance claim cases with fraud account for a relatively low proportion among all claim cases. Besides, it is also worth exploring how to mine higher-quality association rules from sparse fraud claim data and further verify the validity and correctness of association rules.

Footnotes

Acknowledgment

This paper is financially supported by the following projects. This work was supported in part by the Natural Science Foundation of Shandong province under Grant ZR2020MF033. Supported by the National Bureau of Statistics of China under Grant 2019LZ10, the National Natural Science Foundation of China under Grant 61502280, the General project of science and technology plan of Beijing Municipal Commission of Education under Grant KM202010017001 and Shandong University of Science and Technology Postgraduate Research and Innovation Project under Grant YC20210230. Thanks to the African Institute for Mathematical Science (AIMS) for their support and help.

References

X.Y.

, Association analysis for fraudulent claims in auto insurance market, Mater Dissertation, Lanzhou University, (2019).

, Zhang

J.L.

and Liu

X.L.

, An association rule mining algorithm based on mutation mechanism and QPSO, Journal of Shandong University of Science and Technology (Natural Science) 39(02) (2020), 95–104.

Djenouri

, Belhadi

, Fournier-Viger

and Fujita

, Mining diversified association rules in big datasets: A cluster/GPU/genetic approach, Information Sciences 459 (2018), 117–134.

Luna

J.M.

, Padillo

, Pechenizkiy

and Ventura

, Apriori Versions Based on MapReduce for Mining Frequent Patterns on Big Data, IEEE Transactions on Cybernetics 48 (2018), 2851–2865.

Feng

, Cho

, Pedrycz

, Fujita

and Herawan

, Soft set based association rule mining, Knowledge-Based Systems 111 (2016), 268–282.

Feng

, Wang

, Yager

Y.Y.

, Alcantud

J.C.R.

and Zhang

, Maximal association analysis using logical formulas over soft sets, Expert Systems with Applications 159 (2020), 113557.

X.B.

, Research on Fuzzy Association Rule Mining, Mater Dissertation, Harbin Institute of Technology (2016).

Sushil

K.V.

, Thakur

R.S.

and Shailesh

, Fuzzy Association Rule Mining based Model to Predict Studentsąŕ Performance, International Journal of Electrical and Computer Engineering (IJECE) 4(7) (2017), 2223–2231.

Mohamed

A.B.

, Mai

, Florentin

and Victor

, Neutrosophic Association Rule Mining Algorithm for Big Data Analysis, Symmetry 10(4) (2018), 106–124.

10.

Z.Y.

, Dong

, Wang

J.Y.

, Wang

and Li

J.Z.

, Fault Diagnosis of Power Transformer Based on Association Rules Mining, High Voltage Apparatus 55(08) (2019), 157–163.

11.

D.H.

, Li

H.W.

, Zhang

T.Y.

, Fan

and Zhu

, A Method of Spatial Association Rule Mining Considering Fuzzy Attributes, Journal of Geomatics Science and Technology 33(03) (2016), 313–318.

12.

Pierrard

, Poli

J.P.

and Hudelot

, A Fuzzy Close Algorithm for Mining Fuzzy Association Rules, International Conference on Information Processing & Management of Uncertainty in Knowledge-based Systems (2018), 88–99.

13.

Guo

C.X.

, Wang

, Wu

Z.Y.

, et al., Transformer failure diagnosis using fuzzy association rule mining combined with case-based reasoning, Transformer Failure Diagnosis Using Fuzzy Association Rule Mining Combined with Case-based Reasoning 14(11) (2020), 2202–2208.

14.

Zhang

and Liu

W.J.

, Research and improvement of association rule mining technology based on fuzzy genetic algorithm, Modern Electronics Technique 14 (2017), 31–33+37.

15.

Hou

and Liu

, A Fuzzy Association Rule Mining Algorithm Based on Master-slave Architecture and GA, Control Engineering of China 24(2) (2017), 226–282.

16.

Zhang

D.X.

and Zhang

Y.J.

, Quantitative data mining algorithm based on improved multi-level fuzzy association rules, Application Research of Computers 36(12) (2017), 1–6.

17.

Kang

Y.W.

, Zou

, Guo

W.M.

and Duan

S.T.

, Operation Optimization of Desulfurization System based on Fuzzy Association Rules, Journal of Engineering for Thermal Energy and Power 35(07) (2020), 145–151.

18.

Agrawal

and Srikant

, Fast Algorithms for Mining Association Rules, Proceedings of 20th International Conference on Very Large Data Bases, VLDB (1994), 487–499.

19.

Dang

Q.H.

, The Research and Application on Mining Model of Fuzzy Association Rules, Mater Dissertation, Zhengzhou University, (2011).

20.

Sowan

, Dahal

, Hossain

M.A.

, et al., Fuzzy association rule mining approaches for enhancing prediction performance, Expert Systems with Applications 40(17) (2013), 6928–6937.

21.

Dhiraj

and C.

Dr.

, Vinay, A Survey on Different Fuzzy Association Rule Mining Techniques, International Journal For Technological Research In Engineering 9(2) (2015), 2001–2007.

22.

Zhang

Y.R.

, Kernel-based Adaptive Fuzzy C-means Clustering Algorithm Based on Frult Fly Algorithm and Association Rule Mining, Xi’an University of Technology (2017).

23.

Yan

, Sun

H.T.

and Li

Y.Q.

, Association Rules based on Fuzzy Theory and Its Application in Cross-selling of Property Insurance, Financial Theory & Practice 05 (2017), 100–105.

24.

Ali

A.J.

, Ali

and Heydar

, Evaluation of forest fire risk using the Apriori algorithm and fuzzy c-means clustering, Journal of Forest Science 63(8) (2017), 370–380.

25.

Chen

, Fan

P.P.

and Chen

D.P.

, Intuitionistic fuzzy vector association rules mining basedon dual fuzzy simulation, Computer Integrated Manufacturing Systems 26(07) (2020), 1875–1886.

26.

Shen

, The Application of the anti fraud system for Automobile Insurance, Mater Dissertation, Hunan University, (2014).

27.

Zhou

Y.R.

, Research on Intelligent Medical Insurance Audit System Based on BP Neural Network and Association Rules, Mater Dissertation, Zhejiang Sci-Tech University, (2017).

28.

Aayushi

, Anu

and Anuja

, Fraud detection and frequent pattern matching in insurance claims using data mining techniques, Tenth International Conference on Contemporary Computing, IEEE Computer Society (2017), 1–7.

29.

, Shi

, Wang

and Hu

, Research on Intelligent Medical Insurance Audit System Based on BP Neural Network and Association Rules, Beihang University Press, Beijing, (2015), 178–196.

30.

Yan

, Ou

Z.C.

and Liu

, Research on UBI Auto Insurance Pricing Model Based on Adaptive SAPSO to Optimize the Fuzzy Controller, International Journal of Fuzzy Systems 22(2) (2020), 491–503.

31.

Qin

Z.C.

, Chen

G.B.

, Li

, Sun

and Fu

, Evaluation model and application of rock burst coupling based on set pair analysis and interval triangular fuzzy number, Journal of Shandong University of Science and Technology (Natural Science) 38(01) (2019), 16–24.

32.

Hĺźllermeier

and Beringer

, Mining implication-based fuzzy association rules in databases, Intelligent Systems for Information Processing (2003), 327–337.

33.

Yan

, Liu

J.H.

, Liu

, Li

M.X.

and Qi

, Payments Per Claim Model of Outstanding Claims Reserve Based on Fuzzy Linear Regression, International Journal of Fuzzy Systems 21(06) (2019), 1950–1960.

34.

Liu

W.Y.

, Lin

Y.L.

, Li

K.W.

and Lei

Y.X.

, A Novel Unbalanced Data Classification Method for Software Defects, Journal of Shandong University of Science and Technology (Natural Science) 40(2) (2021), 84–94.

35.

Yan

, Sun

H.T.

and Liu

, Study of Fuzzy Association Rules and Cross-selling toward Property Insurance Customers Based on FARMA, Journal of Intelligent & Fuzzy Systems 31(06) (2016), 2789–2794.

ID	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	A14
1	1	0	0	1	0	0	0	1	0	0	1	0	0	0
2	1	0	0	1	0	0	1	0	0	0	0	1	1	0
3	1	0	0	0	1	0	0	1	0	0	0	1	1	0
4	1	0	0	0	1	0	0	1	0	0	0	1	1	0
5	0	1	0	0	1	0	0	1	0	0	0	1	1	0
ID	A15	A16	A17	A18	A19	A20	A21	A22	A23	A24	A25	A26	A27	A28
1	0	1	0	1	0	1	1	0	0	0	1	0	0	0
2	1	0	0	1	1	0	1	0	0	1	0	0	0	0
3	1	0	0	0	1	0	0	0	1	1	0	0	0	0
4	1	0	0	0	1	0	0	0	1	0	1	0	0	1
5	1	0	0	0	1	0	0	1	0	1	0	0	0	0

ID	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	A14
1	1	0	0	1	0	0	0	1	0	0	1	0	0	0
2	1	0	0	1	0	0	1	0	0	0	0	1	1	0
3	1	0	0	0	1	0	0	1	0	0	0	1	1	0
4	1	0	0	0	1	0	0	1	0	0	0	1	1	0
5	0	1	0	0	1	0	0	1	0	0	0	1	1	0
ID	A15	A16	A17	A18	A19	A20	A21	A22	A23	A24	A25	A26	A27	A28
1	0	1	0	1	0	1	1	0	0	0	1	0	0	0
2	1	0	0	1	1	0	1	0	0	1	0	0	0	0
3	1	0	0	0	1	0	0	0	1	1	0	0	0	0
4	1	0	0	0	1	0	0	0	1	0	1	0	0	1
5	1	0	0	0	1	0	0	1	0	1	0	0	0	0

Research on automobile insurance fraud identification based on fuzzy association rules

Abstract

Keywords

1 Introduction

2 FCM algorithm and optimization

2.1 The FCM algorithm

3 Fuzzy association rule algorithm

3.1 Apriori algorithm and improvement

4.1 Effectiveness analysis of SAGFCM algorithm

Table 2 Comparison of objective function optimal values of FCM algorithm and SAGFCM algorithm Algorithm IRIS Wine Partial data set of automobile insurance claims Random data FCM 29.0739 27.6713 2.6564 3.5244 SAGFCM 29.0736 27.6701 2.6371 3.4566

Footnotes

Acknowledgment

References

Table 2
Comparison of objective function optimal values of FCM algorithm and SAGFCM algorithm

Algorithm IRIS Wine Partial data set of automobile insurance claims Random data

FCM 29.0739 27.6713 2.6564 3.5244

SAGFCM 29.0736 27.6701 2.6371 3.4566

ID	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	A14
1	1	0	0	1	0	0	0	1	0	0	1	0	0	0
2	1	0	0	1	0	0	1	0	0	0	0	1	1	0
3	1	0	0	0	1	0	0	1	0	0	0	1	1	0
4	1	0	0	0	1	0	0	1	0	0	0	1	1	0
5	0	1	0	0	1	0	0	1	0	0	0	1	1	0
ID	A15	A16	A17	A18	A19	A20	A21	A22	A23	A24	A25	A26	A27	A28
1	0	1	0	1	0	1	1	0	0	0	1	0	0	0
2	1	0	0	1	1	0	1	0	0	1	0	0	0	0
3	1	0	0	0	1	0	0	0	1	1	0	0	0	0
4	1	0	0	0	1	0	0	0	1	0	1	0	0	1
5	1	0	0	0	1	0	0	1	0	1	0	0	0	0