Improved ELM optimization model for automobile insurance fraud identification based on AFSA

Abstract

With the rapid development of China’s insurance industry, insurance fraud incidents are also increasing, especially in the field of auto insurance. Therefore, the vehicle insurance fraud identification model based on extreme learning machine is studied. Because the initial connection weight and hidden layer neuron threshold of the ELM are generated randomly, the recognition results are unstable and the accuracy is affected. Therefore, artificial fish swarm algorithm is used to optimize the model parameters. This paper adaptively improves the step size, visual field and crowding degree of artificial fish swarm. First of all, the principal component analysis method is used to generate the input vector of the ELM model for vehicle insurance fraud. Then the weights and thresholds of the ELM model are optimized by improved artificial fish swarm algorithm. Finally, the model is applied to vehicle insurance fraud identification. The empirical analysis shows that the optimized model has less recognition error and higher recognition stability compared with the traditional ELM classification model.

Keywords

Vehicle insurance fraud artificial fish swarm algorithm extreme learning machine

1. Introduction

In recent years, with the continuous improvement of the quality and efficiency of China’s economic development, the insurance industry of China has also shown a rapid growth trend [1]. The insurance premium of automobile insurance, as the largest type of property insurance in China, is also increasing year by year. According to the report on insurance statistics issued by the China Insurance Regulatory Commission for 2018, between January and April, the original insurance premium income of property insurance companies increased by 16.13% over the same period last year, an increase of 4.26 percentage points from the same period last year. Of this total, auto insurance business amounted to 255.702 billion RMB, an increase of 6.59% over the same period last year, accounting for 63.16% of the total business of property insurance companies. With the increasing of the amount of auto insurance, the amount of auto insurance claims is also increasing, and the number of auto insurance fraud cases is also increasing. Insurance fraud not only disturbs the normal order of the insurance industry, but also infringes on the interests of insurance institutions, thus affecting the pricing strategy of insurance companies and destroying the order of the market [2]. Therefore, it is very important to establish a scientific anti-insurance fraud identification system.

The research of anti-insurance fraud identification is of great significance to the development of society. Traditionally, insurance fraud detection depends on audit and expert identification to a great extent. However, manual detection of fraud cases is of high cost and low efficiency. In order to better identify fraud information, many insurance companies use data mining technology to identify fraud information, predict fraud information and mine fraud rules. Data Mining Technology is increasingly regarded as the key means of Fraud Detection [3, 4, 5]. In recent years, numerous scholars have devoted themselves to research of data mining in insurance fraud identification [6]. Verma et al. proposed a method to detect abnormal outliers in insurance claims by using data mining technology [7]. Rawte and Anuradha proposed a new hybrid method to detect fraudulent claims for medical insurance through two learning techniques of supervised and unsupervised in hybrid data mining [8]. Yaram et al. proposed a machine learning algorithm for document clustering and fraud detection [9]. Bhowmik et al. adopts the model of Nave Bayesian classification and decision tree algorithm to solve the fraud problem of automobile insurance [10]. Ye taking China’s motor vehicle insurance as an example, proposes using a BP neural network to detect insurance fraud [11]. Li et al. proposed a potential nearest neighbor stochastic forest method based on principal component analysis for the identification of automobile insurance fraud [12]. Tang and Mo using data mining techniques such as support vector machine and Apriori algorithm, a vehicle insurance anti-fraud detection system model is proposed [13]. Yan et al. put forward a vehicle insurance fraud identification model based on ant colony algorithm to optimize random forest. This model can classify and predict the claim data of automobile insurance more effectively and excavate fraud rules. It has better accuracy and robustness [14].

At present, factor analysis, regression analysis logit model and Probit model are used to analyze and identify insurance fraud in the world. The purpose of these analytical methods is to try to identify quantifiable indicators that affect insurance fraud and to test the existing insurance fraud in the insurance market through the established model [15]. The identification of insurance fraud in the initial stage is based on mathematical statistics, the main idea of the algorithm is “model and regression analysis”. Artist Montsrrat Guillen uses the logit method to identify and analyze the data in the case of motor vehicle claim in Spain. Although the method is effective, it can not identify the missing data. In order to solve the problem of missing data better, Caudill used multivariate logit method to study fraud identification. Although this model can be greatly improved compared with the original logit method, it has a high demand for data. This becomes the main shortcoming of the recognition model [16]. In this paper, a fraud model is established by using the neural network-based Extreme Learning Machine (ELM) proposed by Huang. The connection weights of the input layer and the hidden layer and neuron thresholds of the hidden layer are generated randomly by the ELM algorithm, and the unique optimal solution can be obtained by setting the number of neurons in the hidden layer without adjustment during the training process. The ELM is a simple algorithm for the training of single layer feedforward newral networks, whose learning speed can be faster then tradition gradient descent methods such as back-propagation (BP) and better generalization capacity may also be obtained. However, random parameters generated by ELM algorithm will result in poor generalization performance of the network. In order to improve the prediction accuracy, it is often necessary to increase the number of hidden layer nodes, but too many hidden layer nodes will increase the complexity of the network and prone to network overfitting problem [17].

Artificial fish swarm algorithm (AFSA) was first proposed by Dr. Li Xiaolei in 2002. AFSA is kind of swarm intelligence optimization algorithm based on fish swarm behavior. The simulation of fish preying, swarming, Chasing, random behavior in the whole domain to search for optimization. In recent years, many scholars have applied artificial fish to the optimization of ELM. Ge and others used fish swarm algorithm to optimize ELM to establish a prediction model, and then use this model to study the capacitor prediction of power capacitor banks. Finally, the experimental results show that the prediction accuracy of ELM model optimized by fish swarm algorithm is obviously higher than that of PSO-ELM model, GA-ELM model and DE-ELM model [18]. Zhou and others used the improved artificial fish swarm optimization ELM to establish a prediction model, and then applied the model to breast tumor assistant diagnosis. Finally, the experimental results show that the prediction results of the improved artificial fish swarm algorithm to optimize the ELM prediction model are better than the original artificial fish swarm optimization ELM model and ELM model [19]. Lin and others constructed a better classifier by combining ELM classifier with artificial fish swarm optimization algorithm. Finally, the experimental results show that the accuracy and efficiency of this new classifier is much higher than that of the traditional ELM classifier [20]. Aiming at the problem of ELM, this paper combines with the improved artificial fish swarm algorithm to optimize the weight and threshold of ELM, and establishes the IAFSA-ELM model of vehicle insurance fraud identification. Through adaptive improvement of the step size, visual field and crowding degree of artificial fish, artificial fish swarm can find the global optimal solution more quickly and get rid of local extremum.

2. Artificial fish swarm algorithm

Artificial fish swarm algorithm is mainly used to simulate fish preying, swarming, Chasing, random and other behaviors in the whole domain for optimization. Each behavior is designed to make the fish faster to find the highest concentration of food.

Preying behavior: This is the most basic behavior of fish survival. It is generally believed that fish choose the trend through visual or taste perception of water food concentration. Swarming behavior: Swarming behavior is a kind of survival mode formed during the evolution of fish herd, which can conduct collective preying and avoiding enemies. Chasing behavior: When a fish or several fish finds food, the nearby fish will follow and swim, causing the farther fish to follow. Random behavior: Fish swim randomly in the water, also in search of a wider range of food or companions. These behaviors of artificial fish can be transmuted by their perception of the environment.

2.1 Traditional artificial fish swarm algorithm

Fish range of vision: visual. Fish moving step: step. Congestion factor: $\lambda$ . The largest number of preying attempts: try_number. $r$ is a random number from $-$ 1 to 1. $d_{i,j}=\left\|{X_{i}-X_{j}}\right\|$ is the distance between individuals of artificial fish.

Make the current position is $X=(x_{1},x_{2},\ldots,x_{n})$ , The food concentration at which $X_{i}$ is located is $Y_{i}$ , that is, the value of the objective function. Fish usually have the following behaviors:

Preying behavior: Let $X_{i}$ be the artificial fish current state, If the fish observed a position within its field of view $X_{j}$ and if the concentration of food at $X_{j}$ is greater than $X_{i}$ , then the food concentration is one step forward in the direction of $X_{j}$ ; On the contrary, choose the position $X_{j}$ again to determine whether or not to meet the advance condition. After trying this way try_number, if you still don’t meet the advance condition, move one step at random.

$\displaystyle X_{\textit{next}}=\frac{X_{j}-X_{i}}{\left\|{X_{j}-X_{i}}\right% \|}\times\textit{Step}\times r$ (1)

Where $X_{\textit{next}}$ is the next position where the fish are moving. Swarming behavior: Let $X_{c}$ be the center position, $n_{i}$ is total artificial fish number in the current view and $Y_{c}$ be central food concentration, If $\frac{Y_{c}}{n_{i}}>\lambda Y_{i}$ , which means that the companion center has more food (higher fitness function value) and is not very crowded, it goes forward a step to the companion center. Otherwise, executes the preying behavior. Chasing behavior: Let $X_{i}$ be artificial fish current state, $n_{i}$ is total artificial fish number in the current view. $Y_{j}$ is the maximum food concentration of the artificial fish in the current view. If $\frac{Y_{j}}{n_{i}}>\lambda Y_{i}$ , which means that the companion center has more food (higher fitness function value) and is not very crowded, it goes forward a step to the companion $Y_{i}$ . Otherwise, executes the preying behavior. Random behavior: artificial fish swim randomly in water. In fact, they are seeking food or companions in larger ranges.

2.1.1 Traditional artificial fish swarm algorithm

Figure 1.

Flow chart of traditional artificial fish swarm algorithm.

2.1.2 Parameter analysis of traditional artificial fish swarm algorithm

•
visual and step: The field of vision determines the search range of artificial fish, and the step size determines the convergence speed and accuracy. If the field of vision is large, the global search ability of artificial fish is strong, the artificial fish can jump out of the local extremum, the Chasing behavior and swarming behavior of artificial fish are also prominent. If the field of vision is small, the local search ability of artificial fish is stronger, preying behavior and random behavior are more prominent. In the early stage of iteration, the convergence rate increases gradually with the increase of step size, but in the late stage of convergence, if the step size is large, artificial fish oscillate back and forth near the optimal value, but it is difficult to approach the optimal value accurately [21].
•
try_number item: The more the number of artificial fish is, the more intelligent the fish is, the faster the convergence rate is, the higher the precision is, and the stronger the ability of jumping out of local extremum is. At the same time, because of the increase of fish population, the computation of algorithm will increase greatly. The more the artificial fish try, the stronger the preying ability of artificial fish is, and the faster the convergence rate is. But when the local extremum is more prominent, it is easy to miss the global extremum and thus miss the global optimization.
•
$\lambda$ : The crowding factor is the degree of allowable crowding within the visual field of artificial fish [22]. Crowding degree is introduced to avoid overcrowding and fall into local extremum.

2.2 Improved artificial fish swarm algorithm

2.2.1 Improvement of step size

In the process of predation, traditional artificial fish move at fixed step lengths. If the step size is large, the accuracy of solution is not high. At the late stage of evolution, the fish flock fluctuates around the optimal solution. If the step is small, the solution is accurate but the convergence rate is too slow, and it is easy to fall into the local maximum value. In this paper, the step parameters are adjusted adaptively according to the number of iterations. Given a large step at the beginning of the iteration, the convergence rate can be accelerated, with the increase of the number of iterations, the step parameter decreases gradually [23]. According to the various behaviors of fish, two different step were determined, one step was used for preying behavior and random behavior, one step for swarming behavior and Chasing behavior.

The step of preying behavior and and random behavior are updated as follows:

$\displaystyle\textit{step}_{i}^{t+1}=w_{i}\textit{step}_{i}^{t}+f_{i}(x_{i}-x_% {i}^{t})+(1-f_{i})(x_{*}-x_{i}^{t})$ (2)

The location of preying and random behavior is updated as follows:

$\displaystyle x_{i}^{t}=x_{i}^{t-1}+\delta\times\textit{step}_{i}^{t}$ (3)

Adjustment parameters of $f_{i}\in[0,1]$ , let $\textit{step}^{0}$ be the initial step size, $x_{*}$ is the global optimal solution, and $x_{i}$ is the individual optimal solution for the ith individual. $\delta$ is a random number in 0 through 1.

The selection of weights $w$ directly affects the convergence of the algorithm. The larger inertia weight factor is helpful to jump out of the local minimum and facilitate the global search. Small weight factors are easy to enhance local search, thus speeding up the convergence of the algorithm. For this reason, this paper presents an adaptive reduction of weights $w$ according to the number of iterations.

$\displaystyle w_{i}=w_{\max}-(w_{\max}-w_{\min})\times\frac{t}{T\_\max}$ (4)

$t$ be iterations, and $T\_\max$ be maximum number of iterations.

The step of swarming behavior and Chasing behavior are updated as follows:

$\displaystyle\textit{step}^{k}=\left\{{{\begin{array}[]{ll}\eta*\textit{step}^% {k-1}&\textit{step}^{k-1}>1/2*\textit{step}^{0}\\ 1/2*\textit{step}^{0}&\textit{step}^{k-1}\leqslant 1/2*\textit{step}^{0}\\ \end{array}}}\right.$ (5)

Adjustment parameters of $\eta\in[0,1]$ , let $\textit{step}^{0}$ be the initial step size, $k$ be iterations, $k=1,2,\linebreak 3,\ldots,N$ , and $N$ be maximum number of iterations.

2.2.2 Improvement of visual field and overcrowding

According to the analysis of parameters, the larger the field of view is, the stronger the global search is. When the minimum value is obtained, the smaller the crowding degree is, the stronger the global convergence ability is. Therefore, in the early stage of algorithm operation, selecting a larger field of view and a smaller crowding factor can effectively improve the algorithm’s global search ability and convergence speed.

visual can be adjusted according to Eq. (6)

$\displaystyle\left\{{{\begin{array}[]{ll}\textit{visual}=\textit{visual}*a+0.5% &\textit{visual}>3\\ \textit{visual}=3&\textit{visual}\leqslant 3\\ a=e^{-30*\left(\frac{t}{T\_\max}\right)^{s}}\\ \end{array}}}\right.$ (6)

Figure 2.

Variation curve of adjusting parameter a.

Where $s$ is an integer greater than 1 in a range of 2 to 10, Fig. 2 shows the variation curve of the adjusting parameter $a$ when $s$ is 3, 5 and 10. From Fig. 1, we can see that $a$ is 1 at the beginning of iteration, which ensures that the initial field of view of swarming behavior and following behavior is a larger value. And with the increase of iteration times, the field of vision increases first and then decreases. Until a threshold is reached, the field of vision remains unchanged. In this way, global search ability of artificial fish is enhanced in the early stage of iteration, and the local search ability of artificial fish is improved in the later stage.

The crowding factor indicates the degree of allowable crowding. When the crowding parameter is between 0 and 1, the greater the crowding degree is, the smaller the degree of allowable crowding is, which is conducive to convergence to the global, but the accuracy is slightly poor; When the parameter of congestion is larger than 1, the smaller the parameter is, the smaller the allowable degree of crowding is, which is favorable to global convergence, but the degree of precision solution also decreases [22].

$\displaystyle\left\{{{\begin{array}[]{ll}\lambda=\lambda*a+0.5&\lambda<1\\ \lambda=1&\lambda\geqslant 1\\ \end{array}}}\right.$ (7)

2.2.3 Pseudocode of improved algorithm

IAFSA algorithm main program pseudo code is follows.

Algorithm 1: Pseudocode of the main program of the IAFSA

Input: The population size fishnum, maximum number of iterations

T{\_}\max

and maximum number of heuristics try_number.

Output: The BestX and BestY.

1. Initialize AF. BestX

\leftarrow\emptyset

; BestY

\leftarrow\emptyset

2. While

t<T{\_}\max

3. for i

=

1: fishnum

4. [Xi1, Yi1]

=

AF_swarm();

5. [Xi2, Yi2]

=

AF_follow();

6. end

7. [Ymax, index]

=

max(Y);

8. if Ymax

>

bestY

9. BestY

=

Ymax;

10. else

11. BestY(t)

=

Best(t

-

1);

12. end

13.

t=t+1

;

14. end

Pseudocode for preying behavior and random behavior of the IAFSA are as follows.

Algorithm 2: Pseudo code for preying behavior and random behavior of the IAFSA

Input: Current position

X_{i}^{t}

of AF, the current number

i i

of AF, the current historical optimum location

X i

for AF, the current optimal position

X_{*}

of all AFs, the current iteration number

t

Output: The Xnest and Ynest.

1. Xnest

\leftarrow\emptyset

;

2. for

i=

1: try_number

w_{i}=w_{\max}-(w_{\max}-w_{\min})*t/T{\_}\max

;

\text{step}_{i}^{t+1}=w_{i}*\text{step}_{i}^{t}+f_{i}*(X_{i}-X_{i}^{t})+(1-f_{% i})*(X_{*}-X_{i}^{t})

;

5. Xnest

=

+

\beta*\text{step}_{i}^{t+1}

;

6. Ynest

=

AF_foodconsistence (Xnest);

7. if Yi

<

Ynest;

8. Xnest

=

+

rand*step*(Xnest-Xi)/norm (Xnest-Xi);

9. Xi

=

Xnext;

10. break;

11. end

12. end

13. % Random behavior

14. if isempty (Xnext)

15.

w_{i}=w_{\max}-(w_{\max}-w_{\min})*t/T{\_}\max

;

16. step

{}_{i}^{t+1}=w_{i}*\text{step}_{i}^{t}+f_{i}*(X_{i}-X_{i}^{t})+(1-f_{i})*(X_{*% }-X_{i}^{t})

;

17. Xnest

=

+

\beta

*step

{}_{i}^{t+1}

;

18. end

The pseudocode for swarming behavior of the IAFSA is as follows.

Algorithm 3: Pseudocode for swarming behavior of the IAFSA

Input: Current position

X_{i}^{t}

of AF, the current number

i i

of AF, the current historical optimum location

X i

for AF, the current iteration number

t

Output: The Xnest and Ynest.

1. Xnest

\leftarrow\emptyset

;

2. If step

>

1/2*step

{}^{0}

3. step

=

\eta

*step;

4. else

5. step

=

1/2*step

{}^{0}

;

6. end

7. if visual

>

8. Visual

=

Visual*a

+

0.5

9. else

10. Visual

=

11. end

12. if

n_{i}>0

Y_{c}/n_{i}>\lambda Y_{i}

13. X

{}_{\text{next}}=

{}_{i}

+

rand*step*(Xc-Xi)/norm (Xc-Xi);

14. else

15. Preying behavior

16. end

The pseudocode for chasing behavior of the IAFSA is as follows.

Algorithm 4: Pseudocode for chasing behavior of the IAFSA

Input: Current position

X_{i}^{t}

of AF, the current number

i i

of AF, the current historical optimum location

X i

for AF, the current iteration number

t

Output: The Xnest and Ynest.

3. Xnest

\leftarrow\emptyset

;

4. If step

>

1/2*step

{}^{0}

3. step

=\eta

*step;

4. else

5. step

=

1/2*step

{}^{0}

;

6. end

7. if visual

>

8. Visual

=

Visual*a

+

0.5

9. else

10. Visual

=

11. end

12. if

n_{i}>0

Y_{c}/n_{i}>\lambda Y_{i}

13. X

{}_{\text{next}}=

{}_{i}

+

rand*step*(X

{}_{j}

-X

{}_{i})

/norm (X

{}_{j}

-X

{}_{i}

)

14. else

15. Preying behavior

16. end

3. Extreme learning machine algorithm

ELM is a feed forward neural network with a single hidden layer, which consists of input layer, hidden layer and output layer, and the neurons between the layers are fully connected. There are $n$ neurons in the input layer corresponding to $n$ input variables, $m$ neurons in the output layer corresponding to $m$ output vectors in the output layer and $l$ neurons in the hidden layer. Let the connection weight between the input layer and the hidden layer be $w_{j\times i}$ , and the connection weight between the hidden layer and the output layer be $\beta_{l\times m}$ . The input matrix of the training set with $Q$ samples is $X_{n\times Q}$ and the output matrix is $Y_{m\times Q}$ . The threshold of the hidden layer neuron is $b_{l\times 1}$ , and the excitation function of the hidden layer neuron is $g(x)$ , then the output ${T}$ of the network is:

$\displaystyle T=\left[{{\begin{array}[]{cccc}{t_{1}},&{t_{2}},&\ldots,&{t_{Q}}% \\ \end{array}}}\right]$ (8) $\displaystyle t_{j}=\left[{{\begin{array}[]{*{20}c}{t_{1j}}\\ {t_{2j}}\\ {\ldots}\\ {t_{mj}}\\ \end{array}}}\right]_{m\times 1}=\left[{{\begin{array}[]{*{20}c}{\sum\limits_{% i=1}^{l}{\beta_{i1}g(w_{i}x_{j}+b_{i})}}\\ {\sum\limits_{i=1}^{l}{\beta_{i2}g(w_{i}x_{j}+b_{i})}}\\ {\ldots}\\ {\sum\limits_{i=1}^{l}{\beta_{im}g(w_{i}x_{j}+b_{i})}}\\ \end{array}}}\right]_{m\times 1}(i=1,2,\ldots,Q)$ (9)

$w_{i}=\left[{{\begin{array}[]{cccc}{w_{i1}},&{w_{i2}},&\ldots,&w_{1n}\\ \end{array}}}\right];x_{j}=\left[{{\begin{array}[]{cccc}{x_{1j}},&{x_{2j}},&% \ldots,&x_{nj}\\ \end{array}}}\right]^{\text{T}}$ . Then Eq. (9) can be expressed as $H\beta=T^{\text{T}}$ . Where $T^{\text{T}}$ is the transpose of the network output matrix and $H$ is the hidden layer output matrix of ELM.

4. Improving the principle of artificial fish swarm optimization ELM

In this paper, an improved artificial fish swarm algorithm (Improvement Artificial Fish Swarm algorithm) is proposed to optimize the fraud recognition model of ELM (IAFSA-ELM). The model overcomes the shortcomings of the ELM model, such as poor network generalization performance and low precision caused by the initial random weights of the ELM model. Considering the strong prediction ability of ELM and the excellent optimization ability of artificial fish swarm algorithm, the improved artificial fish swarm algorithm is combined with ELM algorithm.

IAFSA-ELM algorithm flow:

1.
Data preprocessing of ELM network topology;
2.
Setting artificial fish population parameters;
3.
Artificial fish begin to optimize ELM;
4.
The artificial fish chooses the execution behavior to judge whether the convergence condition is reached or not;
5.
The artificial fish herd outputs the optimization result, which is the initial weight of the network, and executes the ELM network loop;
6.
Output the final result.

Figure 3.
Flow chart of IAFSA-ELM algorithm flow.

5. Empirical analysis

In order to verify the validity of the proposed algorithm in vehicle insurance fraud identification, this paper selects the historical claims of a vehicle insurance company as an example for fraudulent identification analysis [24]. Because the result of a claim is divided into honest claims and fraud claims, it can be seen as two categories of problems. In this paper, 500 auto insurance historical data of a certain insurance company are selected, of which 350 are honest claims and 150 are fraudulent claims.

5.1 Selection of vehicle claim index

Before mining the vehicle insurance fraud features, we need to select the appropriate evaluation indicators. According to the information of the insured, we select 15 independent variables as the index system of insurance fraud identification.

In this paper, 15 fraud indicators are divided into three factors: 1) Driver factor: Driving age (years)/Risk driver gender/Historical danger does not include this time. 2) Vehicle factor: Nature of vehicle/vehicle channel source/whether the guarantee is automatic or not/whether to correct the transfer of ownership. 3) Other factors: The interval between the insurance period and the insurance period (months)/car checking/whether to report the case on the spot/Survey type/number of hours reported/target reported man-hour fee/target repair plant type/fixed number of photos. Each variable type is described in Table 1.

Table 1
Data set index description

Variable	Type	Variable	Type
Channel source X1	Classified variable	Use nature X2	Classified variable
Nature of vehicle X3	Classified variable	Automatic security X4	Boolean variable
Whether to correct the transfer of	Boolean variable	Vehicle inspection X6	Classified variable
ownership X5
Is there a report on the scene X7	Boolean variable	Risk driver gender X8	Boolean variable
Survey type X9	Classified variable	Target repair plant type X10	Classified variable
Number of reported replacement parts	Discrete variable	Maximum standard spare parts survey	Discrete variable
X11		reported amount X12
Target reported man-hour fee X13	discrete variable	Number of photographs with constant	Discrete variable
		loss X4
Historical risk number X15	Discrete variable

Then we analyze the reasons for fraud identification by selecting 15 indexes from different means of vehicle insurance fraud.

•

Change the driver. The drivers in such cases are mainly drunk drivers, unlicensed drivers and other illegal drivers. After the accident, they find someone to replace them and then claim compensation to the insurance company. Most of the cases occur at night and are located in remote areas. There are very few witnesses or even no witnesses. The amount of loss is higher. Therefore, it is necessary to select the relevant information of the driver and the target claim amount as the fraud identification index.

•

The risk is first and the insurance is later. The risk date of such fraud is very close to the expiration date of the insurance. Its implementation means mainly forges the insurance date and forges the insurance date. Therefore, it is necessary to determine the expiration period of insurance for the vehicle when it is under risk.

•

Provision of false claim materials. For example, forgery, alteration of invoices for vehicle repairs, forgery of public security traffic police departments of traffic accident responsibility identification and so on. Therefore, it is necessary to verify the actual situation of the vehicle (the type of vehicle), the risk record and the certificate of determining whether there is an accident or not.

•

Forgery accident scene. The policyholder usually forges the severity of the accident. For example, undamaged parts of vehicles in danger are replaced with old parts to expand losses for fraud claims. So we need to check whether to report the accident on the spot.

•

Cast above quota protect. The insured amount is higher than the actual value of the vehicle, and when the insurance accident occurs, the insured can obtain the compensation that is higher than the actual value of the vehicle. Therefore, it is necessary to determine the actual amount of insurance coverage for vehicles under risk.

•

One accident, multiple false claims. The policyholder claims multiple claims against one or more insurers for an accident. The policyholder enters into an insurance contract with several insurers and intentionally conceals duplicate insurance coverage. Therefore, it is necessary to approve the number of historical claims of the insured [25].

Table 2

Stratification of classified variables

Variable	Lamination
Channel source	The car company is set to 0, the tradition is 1, the agency channel is 2, the new channel is
	3, and the comprehensive development is 4
Use nature	1 for business and 0 for non-business
Nature of vehicle	Enterprise vehicle is set to 1, private vehicle is set to 2, agency vehicle is set to 0
Guarantee	Is set to 1, no set to 0
Correct transfer of ownership	Is set to 1, no set to 0
Vehicle inspection situation	Untested set to 0, tested to 1, exemption set to 2
Scene report	Is set to 1, no set to 0
Survey type	The first site is set to 1, the unsurveyed site is set to 0, and the replenishment site is set to 2
Target repair plant type	1 for first class plant, 2 for second class plant, 3 for third type plant, 0 for special service
	station
Whether or not to cheat	Is set to 2, no set to 1

From the description of the data variables of auto insurance claims in Table 1, we can see that there are non-numerical classification variables and Boolean variables. Thus, we need to stratify and quantify these data according to the number of each index variable. The hierarchical results are shown in Table 2.

5.2 Principal component analysis based on SPSS

In recent years, the global insurance industry has developed rapidly. Many insurance claims occur every year, so the number of insurance claims data is particularly large. Generally, in different claims, the information of each applicant is not the same. For automobile insurance, automobile type, insurance amount, insurance expiration time, insured information, historical claims and other factors will affect insurance fraud. This is why vehicle insurance claims data have high dimensionality. In addition, there is a certain correlation between the selected indicators (variables). For example, the type of the automobile and the amount of insurance, the insured amount of high-end automobiles is generally higher. In addition, the car repair shop type is different, maintenance costs are also very different. All these factors increase the difficulty of insurance fraud research. Although there are correlations between some indicators, they are extremely important for identifying insurance fraud, and therefore, all indicators should be retained.Principal component analysis (PCA) is a commonly used dimensionality reduction method. Principal Component Analysis (PCA) is a multivariate statistical method, which can reduce the dimensionality of the multi-dimensional feature matrix, thereby reducing the complexity of the data, and the reduced dimension data can retain the main information of the original data [24]. Therefore, this paper uses principal component analysis to reduce the dimension of data, while retaining the important information of the original data, reducing the difficulty of operation.

SPSS software has powerful ability of data processing and statistics mining, and can standardize data, calculate eigenvalues and extract principal components. Therefore, 500 groups of data collected in this paper are normalized and then input into SPSS 22.0 for principal component analysis (PAC). The results of the analysis are shown in Table 3.

Table 3
Stratification of classified variables

Component	Contribution	Cumulative	Component	Contribution	Cumulative
	rate%	contribution%		rate%	contribution%
1	13.069	13.069	9	5.647	75.516
2	10.539	23.608	10	5.238	80.754
3	9.322	32.929	11	5.022	85.776
4	8.587	41.516	12	4.543	90.320
5	8.064	49.580	13	3.964	94.284
6	7.484	57.064	14	3.307	97.590
7	6.647	63.711	15	2.410	100.00
8	6.158	69.869

When the principal components are extracted to the tenth, their cumulative contribution to the information reaches 80 percent, covering more than 80 percent of the original data. So, select the new sample of the first 10 principal components to analyze the influencing factors of automobile insurance fraud identification. The purpose of reducing the dimension of raw data is achieved.

5.3 Fraud identification results

K-fold cross validation is usually used to evaluate the performance of different models. In K-fold cross-validation, the training data are randomly divided into K, where k-1 is used for model training, and the remaining one is used for testing. After repeating the process K times, the K models and their performance evaluation are obtained. This method is less sensitive to data.

In K-fold cross validation, the value of K is generally 10. If the training set is relatively small, the value of K can be increased, so there will be more data for training, and the result of performance evaluation will get smaller deviation. But the increase of K value will result in the extension of the cross-validation algorithm and the similarity of the training block height, which can not play the effect of cross-validation. If the data set is large, you can choose a smaller K value to reduce the cost of repeated calculations in different data blocks, but there is still a lot of training data. In this paper, the training data set is divided into 10 blocks. In 10 iterations, 9 blocks are used for training in each iteration, and the remaining one is used for model evaluation.

In this paper, 500 claim data are used to carry out 10-fold cross experiments. The recognition results were compared with ELM recognition model (ELM), traditional artificial fish swarm optimization ELM recognition model (AFSA-ELM), improved fish swarm algorithm optimization ELM identification model (Ifish-ELM) and genetic algorithm optimized ELM recognition model (GA-ELM).

Artificial fish population parameters: visual field is 0.3, step size is 0.25, crowding degree is 0.618, maximum iteration number is 50 times, the number of fish is 50, the maximum number of attempts is 3; The improved artificial fish population parameter: the initial step is 0, is 0.85 and the genetic algorithm parameters are as follows: maximum iteration number is 100, population number is 40, crossover probability is 0.7, mutation probability is 0.1.

Table 4
Ten fold crossover experiment of four data models

Ten-fold	Test set accuracy%				Time(s)
	ELM	GA-ELM	AFSA-ELM	IAFSA-ELM	ELM	GA-ELM	AFSA-ELM	IAFSA-ELM
1	82	92	92	94	1.68s	119.10s	270.72s	118.20s
2	80	86	84	88	1.67s	98.79s	277.56s	156.72s
3	90	92	92	96	1.74s	105.72s	240.75s	133.41s
4	90	92	92	92	1.69s	94.82s	281.01s	125.67s
5	84	86	82	90	1.62s	85.65s	291.20s	127.25s
6	86	94	92	94	1.65s	75.62s	266.58s	132.68s
7	80	84	80	86	2.13s	113.12s	294.92s	125.63s
8	82	76	86	90	1.67s	106.15s	255.53s	118.02s
9	74	88	92	88	1.77s	93.36s	203.42s	124.24s
10	78	96	88	96	2.01s	100.92s	219.39s	111.21s
Max	90	96	92	96	2.13s	119.10s	294.92s	156.72s
Min	74	76	82	86	1.62	75.62s	203.42s	111.21s
Mean	82.6	88.6	88	91.4	1.763s	100.325s	260.108s	127.30s

The experimental results of four models with 500 data sets are given in Table 3. Comparing the training time of the four models we can see that the ELM model takes the least time. Because the initial weight of the ELM model is random, so the ELM model does not need to calculate the initial weight so the ELM takes the shortest time. Because the initial weight of the ELM model is random, the training time is shortened, but the recognition accuracy is not good. Compared with the traditional artificial fish swarm, the improved artificial fish swarm optimization ELM model has little difference in training speed. However, the improved artificial fish swarm algorithm adaptively adjusts the parameters such as visual field, step size, crowding degree and so on so it has higher accuracy in ELM model optimization classification.

In order to further illustrate the recognition performance of the four models, 250 cases were randomly selected as predictive data and the remaining 250 cases as training data. Of the 250 data tested, 91 were fraud claims and 159 were honest claims.

Figure 4.

Results of recognition and classification of four models.

The result of recognition and classification of prediction samples form four recognition models is shown in Fig. 4a–d. Number 2 in Fig. 4 represents fraud claims, and number 1 represents honest claims. The symbol * represents the actual fraud case, the symbol O represents the predicted value given by various models, and when the symbol O equals 1, the representative model predicts that the case is an honest claim. When the symbol O equals 2, the representative model predicts that the case is a fraudulent claim. From the prediction and recognition results, we can see that the improved artificial fish swarm algorithm can be used to optimize the ELM model and the prediction results are closer to the real value. So we can see that the improved artificial fish swarm algorithm has more advantages in fraud identification.

Table 5

Fraud identification forecast table

	Scoring value
	Projected to be 2	Predicted to be 1	Total
Fraudulent claims (2)	A (Correct prediction)	B (Prediction error)	A $+$ B
Honest claim (1)	C (Prediction error)	D (Correct prediction)	C $+$ D
Total	A $+$ C	B $+$ D	A $+$ B $+$ C $+$ D

Table 6

Prediction results of four models

	ELM	AFSA-ELM	GA-ELM	IAFSA-ELM
A	70	62	65	76
B	21	29	26	15
C	52	2	2	13
D	107	157	157	146
Accuracy (A $+$ D)/(A $+$ B $+$ C $+$ D)	70.8%	87.6%	88.8%	88.8%

According to Table 5, the sensitivity and specificity of the model are defined. The sensitivity of the model is as follows: A/(A $+$ B). Specificity is D/(C $+$ D). Table 6 shows the results of each of the four models. From the results of Table 6, we can see that only 62 fraud claims can be identified accurately in the AFSA-ELM model, and 29 fraud claims are predicted as honest claims, which make the identification of fraud in the insurance industry poor. As a result, fraud cannot be identified, thus affecting the development of the insurance industry. The AFSA-ELM model has 52 honest claims are predicted as fraud claims, and it is not helpful to the insurance industry. However, IAFSA-ELM model can effectively identify the majority of fraud claims. Therefore, the IAFSA-ELM model can help the insurance industry to identify fraud claims to some extent. Accuracy of IAFSA-ELM model and GA-ELM model is as high as 88.8%, while that of ELM model and AFSA-ELM model is only 70.8% and 87.6%. So it can show that the IAFSA-ELM model has some advantages. To further judge the advantages and disadvantages of GA-ELM model and IAFSA-ELM model, we use Spss22 to make ROC graph according to Table 6 data (Fig. 4).

Figure 5.

ROC curves of each model.

Figure 6.

Results of recognition and classification of two models.

In order to further illustrate the recognition performance of each model, the sensitivity and specificity of each model and the ROC curve of each model recognition result are further considered in this paper. Generally speaking, in fraud detection forecasts, honest transactions account for the majority of customers, while fraudulent transactions account for only a small proportion. This is called data class imbalance. If class imbalance is serious, Then the classifier can not meet the classification requirements because of the imbalance of the data, which leads to the over-fitting of multi-class samples and the under-fitting of a small number of samples. In this paper, the sensitivity and specificity are combined with the graphical method by ROC curve, and then the area (AUC) under the ROC curve of each model is compared. If the area of that model is the largest, the recognition of the model is the best.

The ROC curve depicts the relationship between sensitivity and specificity in diagnosis, and because it can analyze the evaluation system more comprehensively and objectively, it is used in experimental medicine, clinical epidemiology, biostatistics, radiology. Data mining and pattern recognition have been widely studied and applied [26].

ROC curve judgment standard: AUC $<$ 0.5 indicates that the model has no recognition value. AUC $>$ 0.5–0.7 is lower. AUC $>$ 0.7–0.9 is better than model recognition accuracy. AUC $>$ 0.9 indicate the best performance of model recognition [27]. The more convex the ROC curve is, the closer it is to the upper left corner, which indicates that the value of model recognition is greater, which is beneficial to the comparison between different models. As shown in Fig. 5, the AUC of IAFSA-ELM model is 0.877 greater than that of GA-ELM model (0.851), AFSA-ELM model (0.834) and ELM model (0.721). Therefore, the recognition performance of IAFSA-ELM model is relatively better, so the IAFSA-ELM model is better than GA-ELM model on the basis of the same overall recognition rate. Based on the results from Table 6, AFSA-ELM model with low specific (67.72%), sensitivity degrees (76.92%) is higher, and IAFSA-ELM model sensitivity (83.52%) and specific (92.41%) is higher, so it can be seen that the improved adaptive artificial fish algorithm optimized ELM model in recognition accuracy, sensitivity, specificity and other aspects are better than the traditional artificial fish swarm optimization ELM model performance. It further explained that the adaptive regulation of the artificial fish swarm’s step length, crowding degree and visual field can improve the optimization ability of the artificial fish swarm, so that the artificial fish swarm can find the global optimal solution more quickly.

In order to further illustrate the effectiveness of the improved model, the prediction results of the improved model are compared with those of the binary logit regression model. In order to better judge the effectiveness of the two models, we examined the latter 50 cases as test data. The results are shown in Fig. 6.

Figure 7.

ROC curves of each model.

We can calculate that the data accuracy rate of binary logit model is 86% and the improved recognition model is 92% in the Fig. 6. Therefore, we can see that the improved recognition model is better than the binary logit regression model in the overall recognition accuracy. In order to verify the specificity of the two models, the predicted results of the two models were analyzed by ROC curve (Fig. 7). The AUC of the regression model was 0.856 and the AUC of the improved extreme learning machine was 0.871. So the improved algorithm is better than the binary logit regression model in recognition specificity and accuracy.

6. Conclusion

In China’s insurance market, automobile insurance is the first major insurance in property insurance [28]. With the increasing ownership of cars in the society, the amount of insurance claims for all kinds of vehicle accidents is also increasing, and the cases of insurance fraud claims in the course of vehicle maintenance and insurance settlement are also on the rise And become one of the biggest threats to the current development of the insurance industry. Therefore, we urgently need to propose an effective method to identify vehicle insurance fraud to excavate potential fraudulent customers and judge whether they are fraudulent or not according to the customer’s claim data, so as to take appropriate measures to prevent fraud in advance.

In this paper, the principal component analysis of fraud claim data is carried out, and the extracted indexes are used as input variables of model fraud prediction. Aiming at the deficiency of ELM model in solving the problem of recognition and classification, a recognition model based on improved artificial fish swarm optimization (ELM) is proposed in this paper. Considering the recognition and classification ability of the ELM and the search and optimization characteristics of artificial fish swarm algorithm, the artificial fish swarm algorithm is combined with ELM to optimize the initial weight of ELM with the improved artificial fish swarm algorithm, so as to overcome the disadvantages of slow convergence speed of neural network and easy to fall into local minimum. The improved artificial fish swarm algorithm takes into account that the traditional artificial fish swarm uses a fixed step size. With the increase of iteration time, the fish herd is closer to the global optimal value. If the step size is invariant, the fish herd will be unstable around the optimal value. So this paper proposes a step size that varies with the number of iterations. Due to the different behavior of artificial fish swarm requires different visual field, this paper proposes two parallel visual fields, which overcomes the shortcomings of traditional artificial fish swarm algorithm that the convergence speed is slow and it is not easy to converge to the global optimal solution. In the final empirical analysis, the improved artificial fish swarm algorithm is compared with the traditional artificial fish swarm algorithm and the traditional genetic algorithm in a given data set to verify the superiority and effectiveness of the improved artificial fish swarm optimization ELM model.

Footnotes

Acknowledgments

This work was financially supported by the Project of National Natural Science Foundation of China (No. 61502280, 61472228), the Project of Qingdao Applied Basic Research of Qingdao (special youth project, No. 14-2-4-55-jch).

References

Yan

Wang

Liu

et al., Financial early warning of non-life insurance company based on RBF neural network optimized by genetic algorithm, Concurrency & Computation Practice & Experience6 (2017), e4343.

You

B.N.

, Risk prevention and handling of Insurance fraud-taking vehicle insurance fraud as the path, Shanghai Insurance3 (2018), 13–18.

Hassan

A.K.I.

and Abraham

, Modeling Insurance Fraud Detection Using Imbalanced Data Classification. Advances in Nature and Biologically Inspired Computing, Springer International Publishing, 2016.

Yan

Sun

H.T.

Liu

and Chen

, An integrated method based on hesitant fuzzy theory and RFM model to insurance customers’segmentation and lifetime value determination, Journal of Intelligent & Fuzzy Systems35 (2018), 159–169.

Sithic

H.L.

and Balasubramanian

, Survey of insurance fraud detection using data mining techniques, International Journal of Innovative Technology & Exploring Engineering3 (2013), 62–65.

Yan

Sun

H.T.

and Liu

, Study of fuzzy association rules and cross-selling toward property insurance customers based on FARMA, Journal of Intelligent & Fuzzy Systems31 (2016), 2789–2794.

Verma

Taneja

and Arora

, Fraud detection and frequent pattern matching in insurance claims using data mining techniques, Tenth International Conference on Contemporary Computing, IEEE Computer Society, 2017, pp. 1–7.

Rawte

and Anuradha

, Fraud detection in health insurance using data mining techniques International Conference on Communication, Information & Computing Technology, IEEE, 2015, pp. 1–5.

Yaram

, Machine learning algorithms for document clustering and fraud detection, in: International Conference on Data Science & Engineering, IEEE, 2017.

10.

Bhowmik

, Detecting auto insurance fraud by data mining techniques, Journal of Emerging Trends in Computing and Information Sciences4 (2011), 371–377.

11.

M.H.

, Research on insurance fraud identification based on BP neural network a case study of China motor vehicle insurance claim, Insurance Research3 (2011), 79–86.

12.

Y.Q.

Yan

Liu

and Li

M.Z.

, A principle component analysis-based random forest with the potential nearest neighbor method for automobile insurance fraud identification, Applied Soft Computing70 (2018), 1000–1009.

13.

Tang

and Mo

Y.W.

, Construction of vehicle insurance anti-fraud system based on data mining technology, Shanghai Insurance11 (2013), 39–42+63.

14.

Yan

Y.Q.

and Sun

H.T.

, Research on vehicle insurance fraud identification based on ant colony optimization stochastic forest model, Insurance Research6 (2017), 114–127.

15.

Cui

and Luan

P.P.

, A review of research on insurance fraud, Director of Economic Research6 (2013), 112–113.

16.

Caudill

S.B.

Ayuso

and Guillén

, Fraud detection using a multinomial logit model with missing information, Journal of Risk & Insurance72 (2005), 539–550.

17.

Zhang

and Li

, A new model of water quality evaluation based on particle swarm optimization extreme learning machine, Environmental Science and Technology39 (2016), 135–139.

18.

W.H.

Wang

Y.S.

Wang

M.G.

Yan

P.L.

and Wu

, Research on capacity prediction of power capacitor bank based on fish swarm optimization (ELM), Power Capacitor and Reactive Power Compensation39 (2018), 11–15+32.

19.

Zhou

H.P.

and Yuan

, Application of improved fish swarm optimization (ELM) in breast tumor auxiliary diagnosis, Computer Engineering and Science39 (2017), 2145–2152.

20.

Lin

H.W.

Sneeuw

N.I.C.O.

, Optimization of image classification method based on fish swarm algorithm for ultimate learning machine, Journal of Agricultural Machinery10 (2017), 156–164.

21.

Tian

H.L.

H.G.

and Xu

B.H.

, Prediction of support vector machine based on improved artificial fish swarm algorithm, Computer Engineering39 (2013), 222–225.

22.

Zhang

Y.J.

Z.W.

and Feng

Z.H.

, An improved artificial fish swarm algorithm based on dynamic parameter adjustment, Journal of Hunan University (Natural Science Edition)39 (2012), 77–82.

23.

Liang

Y.M.

and Pei

X.H.

, Particle swarm optimization artificial fish swarm optimization, Computer Simulation33 (2016), 213–217+281.

24.

Y.Q.

Yan

and Sun

H.T.

, Construction and research of vehicle insurance fraud identification model based on ultimate learning machine, Science and Technology and Economy30 (2017), 96–100.

25.

Shao

Z.G.

, Analysis on early warning index of automobile insurance anti-fraud in insurance company, Modern Economic Information5 (2017), 356.

26.

Wei

X.X.

and Zhou

Y.Q.

, Performance evaluation method for two classes of classification problems based on ROC curves, Computer Technology and Development20 (2010), 47–50.

27.

Luo

S.W.

C.P.

Zhang

L.P.

and Chen

, ROC curve was used to evaluate the diagnostic value of CEA, CYFRA21-1, SCC in non-small cell lung cancer, Chongqing Medicine40 (2011), 250–252+255.

28.

Zhang

Tayal

et al., Auto insurance fraud detection using unsupervised spectral ranking for anomaly, Journal of Finance & Data Science2 (2016), 58–75.

Improved ELM optimization model for automobile insurance fraud identification based on AFSA

Abstract

Keywords

1. Introduction

2. Artificial fish swarm algorithm

2.1 Traditional artificial fish swarm algorithm

2.2.1 Improvement of step size

3. Extreme learning machine algorithm

5.1 Selection of vehicle claim index

Table 1 Data set index description

Table 3 Stratification of classified variables

Table 4 Ten fold crossover experiment of four data models

Footnotes

Acknowledgments

References

Table 1
Data set index description

Table 3
Stratification of classified variables

Table 4
Ten fold crossover experiment of four data models