Causation analysis model: Based on AHP and hybrid Apriori-Genetic algorithm

Abstract

This paper presents a causation analysis model for traffic accident. Traffic accident is a result influenced by the interaction of various factors. Considering the characteristic of multi-dimensional and multi-layer in traffic accident data, a model which based on traffic accident historical data on the city of Guiyang in 2015 was built to find the main reasons and potential rules of traffic accidents. The model starts from the four main dimensions such as the drivers, the vehicles, the time-address and the environment, and uses a way which based on AHP and hybrid Apriori-Gentic algorithm to mine causes of accident. First of all, the analytic hierarchy process (AHP) is used to sort the importance of the influencing factors about accident. On the basis of objective analysis, the influencing factors are quantified and the main influencing factors are selected. Then the genetic algorithm combined with Apriori is used to analyze the main influencing factors and find the expected association rules out. The experimental result shows that the model can improve the accuracy of mining and find more expected association rules. Finally the hybrid algorithm is parallelized to reduce time complexity, which makes the model has a good application potential.

Keywords

Traffic accident causational analysis AHP Apriori genetic algorithm

1 Introduction

In recent years, with the rapid growth on the number of automobile and driver in China, pressure on road traffic growths greatly and the trend of traffic accidents becomes more and more intensified [1]. At the same time, China is one of the largest countries in the world on the number of death caused by traffic accident. The latest official figures published by China’s ministry of public security shows that there were 187781 traffic accidents happened in 2015 on China [2, 3]. With the occurrence of traffic accident, the history data of accident is gradually accumulated. These data can be used for targeted statistical analysis and further research of mining. In order to explore the causes of traffic accidents, Data Mining is used to excavate the historical accident data hoping to find the potential and deep rules and data patterns in the traffic accident data so as to provide decision support for the prevention of traffic accidents. Due to many factors have influence on the occurrence of traffic accidents and there are many complicated fields and redundant information in the accident historical data, it is difficult to carry out the causation analysis. For this reason, the analytic hierarchy process (AHP) is introduced into the data preprocessing.

The main influencing factors about accident can be selected by the AHP. AHP [8, 9, 23, 24] is proposed by Saty, which is applicated to network system theory and multi-objective comprehensive evaluation. It is a systematic analysis method combining qualitative analysis and quantitative analysis. The method divides a complex problem into several layers. Each layer contains several factors. After through a depth analysis of the complex nature of things and the related influencing factors, a clear hierarchical structure chart can be drawn. And then the judgment matrix is established one by one between different factors. The weight of different factors is obtained by calculating the eigenvalues and eigenvectors of the judgment matrix. And the optimal scheme is selected according to the value of weight.

To improve the accuracy of mining, the Apriori algorithm and genetic algorithm are used together in association analysis. The Apriori algorithm is an original algorithm in mining frequent sets for the boolean association rules [10, 26]. Apriori uses an iterative method called layer-by-layer search, where the k itemsets are used to explore k+1 itemsets until it is impossible to find larger frequent sets. Through the frequent sets, the association rules in the form of A = >B can be got. And each rule is measured by two parameters which are called support and confidence.

The genetic algorithm searches the global optimal solution by simulating the idea of Darwin’s natural evolution. It’s initial population are composed of rules generated randomly [11, 12, 27, 28]. After each chromosome (rule) was encoded and the user gave the fitness function in the evolutionary process, rules are optimized by the generations according to the principle of survival of the fittest. Final group are composed of the most suitable rules in the current group and the descendants of these rules. Descendant is created by using genetic operations such as crossover and mutation.

This paper focus on the causes of urban traffic accidents. The historical data of traffic accident on city of Guiyang in 2015 is selected as the basis of analysis. The model combines AHP and hybrid Apriori-Genetic algorithm to analyze the data. Firstly, the weight of influencing factors are determined by AHP algorithm which can select the main influencing factor and remove the secondary influencing factors and simplify the operation. Then, the Apriori are used to associate the main influencing factor fields. Finally, the search results are optimized by genetic algorithm. The model can explore the comprehensive action rules and the details of traffic safety behind the main influencing factors. Such as the influences about the road conditions and other traffic environment on the accident. It also can improve the use value of data.

2 Related work

With the accumulation of traffic accident data, how to model the data and extract the useful information quickly from it has become the focus of the research. The Apriori algorithm can produce a large number of association rules, and has a good adaptability to explore the causes of traffic accidents [5 –7]. The Apriori algorithm uses support to find frequent sets and discovers association rules based on confidence. However, if the Apriori algorithm is used directly, a lot of useless and repeated frequent items and association rules will be generated.

In order to find out the cause of the traffic accident accurately, it is necessary to improve and optimize the Apriori algorithm so that the algorithm can be applied to the research of traffic accident better. Many scholars have done research work in this respect. Jianfeng Xi, Zhonghao Zhao et al. Proposed a method of research on accident causes [13], which based on ahp-apriori algorithm. For the numerous and complex fields in the data of traffic accident, it is determined by the method of analytic hierarchy process which factors or attributes in accidents has a greater weight. Through the association analysis about the main influencing attributes in the data, the association rules are obtained. Then the driver’s psychological and behavior factors are introduced to analyze the association rules and verify the credibility of the rules. Although the use of analytic hierarchy process can screen out the main influencing factor fields of traffic accident, but the Apriori algorithm used directly still produce a lot of frequent sets and useless association rules. Moreover, only the support and the confidence as the measure will be biased in the algorithm.

S. Ghosh, S. Biswas et al. Proposed frequent pattern mining based on the genetic algorithm [14]. They introduced the genetic algorithm into association analysis to improve the mining process and reduce the time complexity in the global search. The method is simple and efficient, and has better performance in larger data set. Surendra Kumar Chadokar proposed a hybrid association rule and genetic algorithm for network communication [15]. He used the Apriori algorithm to process the network communication data and obtained the frequent sets. And then the frequent sets were through the genetic algorithm to get the lesser and better rules. By comparing the time complexity and the number of the frequent items and rules between the apriori algorithm and the hybrid algorithm, the result indicated that the hybrid algorithm can reduce the execution time in the calculation and is smaller in the number of frequent items and useful rules relative to the apriori algorithm.

With a simple genetic algorithm is difficult to meet the needs of data mining and it’s feature of randomness and error-prone, Haiyan Ren and Ke Luo [17] combined the genetic algorithm with theApriori algorithm and proposed a hybrid association rule mining algorithm which improves the accuracy of genetic algorithm in classification mining.

Sanat Jain and Swati Kabra [16] proposed an optimized association rule mining algorithm and introduced the mining of positive association rules [19, 20] and negative association rules [21, 22] in their paper. The Apriori algorithm which based on support and confidence is used to mine the valid positive association rules and the negative association rules, and the positive and negative rules are optimized by genetic algorithm. This method can reduce the search space and judge whether the association rule is appropriated for further mining by the correlation coefficient of each association rule.

Pankaj sharma and Sandeep Tiwari proposed an algorithm which used the mutated artificial bee populations algorithm to optimize association analysis [18]. This method is based on the foraging behavior of bees, which can produce high quality frequent sets in a large number of data set; At the same time, with the traditional association rule algorithm doesn’t consider the negative correlation attributes in a rule, it can be learned from the artificial bee colony algorithm which rules contain negative correlation attributes. The proposed algorithm is compared with the KNN algorithm and the standard artificial bee colony algorithm. It is concluded that the algorithm proposed by the authors has improved the accuracy of the classification, but the lifting effect is not ideal. There are also other research methods in the traffic accident analysis [29, 30] provide different solution mechanism.

3 Proposed methodology

3.1 Data presentation

There are accidenttime, accidentaddr, driver1fault, driver2fault, sex1, sex2, carcolor1, carcolor2, brith1, brith2, Driving experience and other a total of 21 fields and 11 fields about weather associated fields and traffic deduction for illegal activities in the original historical data (a total of 56651 accidents) of traffic accident on city of Guiyang in 2015 [4]. The data recorded the information about two sides of the driver. The field driver1fault and driver2fault recorded which driver as the main person in charge of the accident. The data of “driver 1 full responsible” accounted for 94.94% and the data of “driver 1 and driver 2 are equally responsible” accounted for 5.06%. The statistical analysis is shown in Fig. 1.

Therefore, the focus of the research is on the case of “driver 1 full responsibility” and the related data fields. The data of the latter case about “driver 1 and driver 2 are equally responsible” is less and involved more fields. What’s more, the relationship is more complex. So the latter case is not in the scope of the study currently, we will continue to research the latter case. In the accident data, not all attribute fields may be associated with the cause of the accident, such as license plate number, driver’s license number, accident number and so on. There is no doubt that the remaining attribute fields may be related to the accident influencing factor if the unrelated number fields are removed.

Fig. 1

The proportion of accident type.

3.2 Analytic hierarchy process

Because there are too many factors in the traffic accident data, if the association rules algorithm is used directly, a lot of useless and repeated frequent items and association rules will be generated. Therefore, the analytic hierarchy process is applied to select attributes.

AHP is clear. Not only can it make the mind become logic, mathematical and model, combine the qualitative analysis and the quantitative analysis while quickly and accurately find the key to the problem, but also it can sort the weight of the factors at all levels. AHP is suitable for decision analysis for complex problems with multi-criteria and multi-objective, especially for scheme evaluation and decision. In the traffic accident data, all the fields can be divided into four different dimensions: the driver class, the vehicle class, the time-place class and the environment class. The system hierarchy is shown in Fig. 2.

Fig. 2

The system hierarchy.

With the analytic hierarchy process (AHP) combines with the expert knowledge, the judgment matrix of the upper and lower levels is constructed by referring to the nine factor ratio of AHP. Table 6 shows the nine factor ratio of AHP [23]. Then the matrix consistency test is carried out.

Table 1

The judgment matrix of target layer and middle layer G-C

G	C1	C2	C3	C4
C1	1	5	3	2
C2	1/5	1	1/2	1/4
C3	1/3	2	1	1/2
C4	1/2	4	2	1

Table 2

The judgment matrix of middle layer and scheme layer C1-S

C1	S1	S2	S3	S4
S1	1	5	3	2
S2	1/5	1	1/3	1/2
S3	1/3	3	1	2
S4	1/2	2	1/2	1

Table 3

The judgment matrix of middle layer and scheme layer C2-S

C2	S5	S6	S7	S8
S5	1	1	1/2	1/3
S6	1	1	1/3	1/3
S7	2	3	1	1/2
S8	3	3	2	1

Table 4

The judgment matrix of middle layer and scheme layer C3-S

C3	S9	S10	S11	S12
S9	1	3	3	4
S10	1/3	1	1/2	2
S11	1/3	2	1	3
S12	1/4	1/2	1/3	1

Table 5

The judgment matrix of middle layer and scheme layer C4-S

C4	S13	S14	S15
S13	1	5	3
S14	1/5	1	1/3
S15	1/3	3	1

Table 6

The nine factor ratio of AHP

Relative weight of evaluation index A / B	Definition	Explain
1	Equally important	A, B has the same contribution to the goal
3	A little more important	A is a little more important than B
5	Important	A is more important than B
7	Obviously important	A is significantly more important than B
9	very important	A is more important than B
2,4,6,8	Intermediate importance	The scale value corresponding to the intermediate state

Tables 1–5 show the judgment matrix of the upper level and lower level.

Then whether the above matrix can passed the consistency test is judged. The maximum eigenvalue of the matrix and its corresponding eigenvector(λ), and the corresponding CI and CR values are also calculated.

λ_max1 = 4. 0211 λ_max2 = 4. 1074

λ_max3 = 4. 0458 λ_max4 = 4. 0310

λ_max5 = 3. 0385

The CI and CR are calculated as follows [8]: $CI = \frac{λ_{\max} - n}{n - 1}$ (1) $CR = \frac{CI}{RI}$ (2)

The value of RI can be obtained by searching the mean randomness consistency index reference table. When the order of the matrix n is 3, RI takes 0.58. When the order of the matrix n is 4, RI takes 0.90 [23]. For each matrix, if the final calculated CR value is much less than 0.1, then the matrix passed the consistency test. The next step can be carried out.

For G-C matrix, CI = (4.0211-4) / (4-1) = 0.007 and CR = CI/0.9 = 0.0078< <0.1, so the consistency check for G-C matrix is passed. Table 7 lists the results and intermediate values which are calculated in our model. ω_k represents the eigenvector corresponding to the maximum eigenvalue of the matrix. The weights of (C1, C2, C3, C4) in the G-C judgement matrix are (0.4773, 0.0809, 0.1539, 0.2880).

Table 7

The result and intermediate values in calculation

	G-C	C1-S	C2-S	C3-S
ω _k1	0.4773	0.4909	0.1377	0.4673
ω _k2	0.0809	0.0863	0.1258	0.1601
ω _k3	0.1539	0.2483	0.2879	0.2772
ω _k4	0.2880	0.1745	0.4486	0.0954
λ _max	4.0211	4.1074	4.0458	4.0310
CI	0.007	0.0358	0.0153	0.0103
CR	0.0078	0.039	0.017	0.0011

Finally, the weight value of each attribute in the scheme layer relative to the goal layer will be got. By selecting the field which weight value is greater than a certain threshold as the main factor that affects the traffic accident (From Table 8, we can see that when the threshold is between 0.0376(0.1539*0.2772) and 0.04(0.4773*0.2483), we can make an appropriate partition. Some main factor fields can be selected easily in the condition. If the threshold is too large, the selected fields will be less, resulting in less association rules. On the contrary, if the threshold is little, the selected fields are too much. It is not good for the filtration of main factor fields).

Table 8

The arrangement of the weight value about each field attribute related to the goal layer

Criterion layer	Weight	Scheme layer	Weight
		Driving years	0.4909
Driver	0.4773	Driving gender	0.0863
		Driver age	0.2483
		Way of training	0.1745
		Car brand 1	0.1377
Vehicle	0.0809	Car brand 2	0.1258
		Car color 1	0.2879
		Car color 2	0.4486
		The day	0.2772
Time-Address	0.1539	The month	0.1601
		The time	0.4673
		The address	0.0954
		Weather condition	0.6369
Environment	0.2880	Temperature	0.1047
		Wind condition	0.2583

3.3 Hybrid Apriori-Genetic algorithm

Combining the advantages of genetic algorithm, this paper designed a hybrid genetic association rule mining algorithm. Apriori which is the classic algorithm in association rule mining was used to find the frequent itemsets in the traffic accident data. The frequent itemsets are translated into chromosomes in some form as the initial population of the genetic algorithm and then the fitness value for each chromosome was calculated according to the predefined fitness function (or evaluation function). A number of chromosomes with high fitness values are chose to replicate and a new generation of group is generated by genetic manipulations (selection, crossover, mutation). Through generations of continuous breeding evolution, finally the population converged to a group of individuals with the highest fitness or the number of iteration reached a preset threshold. The result which is the optimal classification rule set could be output.

The flow chart of the hybrid algorithm is shown in Fig. 3:

Fig. 3

The flow chart of the hybrid algorithm.

3.3.1 Design of code

For the traffic accident data set, the factors that affect the traffic accident are taken as the rule antecedent and the type of cause about traffic accident as consequent. The rules of form about “driving age, age, training school, time and other fields => driver1fault” are expected to be found. Each attribute (such as driving age) in the rule has n categories, then the corresponding×bit binary is used to represent the attribute. The×and n satisfy the relationship: ${Min {x | 2}^{x} > n}$ (3)

The consequent of rule (driver1fault) is taken as a classification attribute. It represents the type of cause about the accident. They are 9 kinds of the cause. The method to represent driver1fault in binary is same as the attribute in the rule antecedent. In order to facilitate the subsequent calculation, we preprocess fields in the raw data. The following table describes the classification and the label of the type of cause about the accident (see Table 10) and driving age (seeTable 9).

Table 9

The classification and comparison of accident type

Label	Mean
Driving experience 1	0–4 years
Driving experience 2	5–11 years
Driving experience 3	12–19 years
Driving experience 4	20 years and more

Table 10

The classification and comparison of driving age

Label	Mean
1	Rear-end
2	Retrograde
3	Astern
4	Not link to low gear or not pull parking brake when parking, causing the vehicle to slide
5	Switch the door
6	Violate traffic signals
7	Not give way according to the rule
8	Other circumstances that are full responsible in the law
9	Does not match the provisions of the previous paragraph 8 or both sides have the above situation

The classification and comparison of other fields are seen in Table 11. For the frequent itemsets obtained by the Apriori algorithm, item which contains both the feature attribute and the classification attribute is selected. Then the chosen itemsets are encoded as the initial group. For example, if a frequent item - [Driver age = ‘Driving experience 1’, Driver training = ‘school training’, driver1fault = ‘1’] is got, the classification of Driver training has two categories which are ‘school training’ and ‘self training’. So in formula (3),×should be 2. ‘School training’ corresponds to ‘01’, ‘self training’ corresponds to ‘10’. ‘Driving experience 1’ is encoded as ‘001’, ‘Driving experience 2’ is encoded as ‘010’, ‘Driving experience 3’ is encoded as ‘011’, ‘Driving experience 4’ is encoded as ‘100’; driver1fault = ‘1’ is encoded as ‘0001’. Other attributes which not appear in this frequent item correspond to a binary string of 0. The corresponding number of bits is x and the value of×depends on the number of classification of the characteristic attribute. In the programming, a list is constructed for storing the binary chromosome which corresponding to frequent item. The length of the list is 23, corresponding to seven feature attributes and one classification attribute. The code corresponding to each attribute field is stored in fixed order in the list.

Table 11

The classification and comparison of part field

Feature attribute	Data value	Corresponding label
S1	Age	0–24, 25–35, 36–47, 48–53, 54-above	age1, age2, age3, age4, age5
S3	Sex	male, female	male, female
S4	way of training	self training, school training	self training, school training
S11	Time	0–7am, 8–12am, 13–18pm, 19–23pm	time1, time2, time3, time4
S13	Weather	rain, rain turned cloudy or overcast, cloudy turned overcast, cloudy or cloudy turned sunny, overcast turned rain or cloudy turned rain, overcast turned sunny or cloudy	Weather1, Weather2, Weather3, Weather4, Weather5, Weather6
S15	Wind direction	The east wind turns south wind, The northeast wind turns south winds, The northeast wind turns southeast wind, The northeast wind turns east wind, The southeast wind turns south wind, The south wind turns southeast wind	Wind1, Wind2, Wind3, Wind4, Wind5, Wind6

3.3.2 Define of the fitness function

The fitness function is used to evaluate the ability of individual to adapt the environment and is the basis for natural selection. Because each chromosome can be seen as the rule with the form of “driving age, age, the way of training, time, weather and other fields => driver1fault", so the chromosome can be evaluated by support, confidence, coverage and other metrics. The support and confidence of the rules reflect the usefulness and certainty of the found rules, and the coverage expresses the coverage of the rules. In the design of fitness function, based on the comprehensive consideration, the fitness function F(r) is calculated by Equation (4). $F (r) = a * S (r) + b * C (r) + c * R (r)$ (4)

The variable r represents the rule, a, b, c are constant coefficients and the range of a, b, c all is [0, 1]. S(r) is the support of the rule, C(r) is the confidence of the rule and R(r) is the coverage of the rule.

Let N be the number of record for the entire data set. C is the rest attributes of rule, after the driver1fault attribute is removed. The number of occurrences of C in the data set is represented as R_C; the attribute of driver1fault denoted as D in the rule, the number of occurrences of D in the dataset is represented by R_D. The number of occurrences of C and D together in the data set is represented by R_C ∪ R_D. So S(r) is defined as [10]: $S (r) = \frac{R_{C} \cup R_{D}}{N}$ (5)

C(r) is defined as [10]: $C (r) = \frac{R_{C} \cup R_{D}}{R_{C}}$ (6)

R(r) is defined as [25]: $R (r) = \frac{R_{C} \cup R_{D}}{R_{D}}$ (7)

The values of the constant coefficients a, b, and c are adjusted by the user as required, which can make the emphasis on the rule evaluation changed and make evolution takes place in the desired direction.

3.3.3 Define of the genetic operator

The operator of Select

The operation of select uses the roulette algorithm. The specific process is described below:

For each chromosome in the population, after its corresponding fitness value is calculated, all of the fitness values are plotted on a disc. The magnitude of the fitness value represents the area on the disc. The larger the area of a single module (the fitness of individual), the greater the probability of being selected in the process that the wheel rotated. Assuming that the number of initial populations is p, a total of p random numbers between 0 and 1 are continuously generated. The corresponding chromosome is selected according to the module where the random number located.

The operator of Crossover:

The probability of the crossover is set to 0.6. In order to speed up the evolution of the population without destroying the genetic diversity of the population, after the male and female parent are selected by using the select operator, k times crossed by single point crossover will be done and the cross-bit is randomly generated. String which before and after the cross-bit about two parents are exchanged to form two new individuals. A total of 2k individuals will be produced. Taking into account to find a better rule set in the mining of traffic accident data, the newly generated individuals are sorted by fitness and then the individual which fitness is more than the fitness threshold is selected from the 2K individuals to be added to the result. At the same time, these selected individuals are also added to the original population to form a new population. In this way, the genes of male parent and female parent are preserved and the performance of individuals in the population is also greatly improved in the process of evolution.

For each generated rule, if the sum of support, confidence and coverage is greater than a certain threshold, it is suitable for the next genetic evolution. For the rules of the final result, the validity of each rule can be judged by it’s incidental support, confidence and coverage.

The operator of Mutation

The method of using the variability probability is described as follows (P_m means the probability of mutation):

If (the fitness value of individual > the average fitness of the population)

Then {P_m is small or close to zero;}

Else {P_m is relatively large;}

3.4 Parallelization of hybrid algorithm

In order to improve the running speed of the hybrid algorithm, the hybrid algorithm is parallelized. The idea of Apriori algorithm is that the frequent k-itemsets L_k with length of k is generated by multiple scanning the database, and the candidate set C_(k+1) is generated by serial self connection. The number of occurrences about all the candidate set C_(k+1) of length k+1 in the transaction set DB is counted, until no frequent itemsets can be found. It is clear that to scan the database once is needed when a L_k is found and the candidate sets are huge. Similarly, each new rule produced in iteration needs to scan the database once to calculate its fitness in the genetic algorithm. Therefore, when the candidate itemsets are large, the hybrid algorithm has a large time consumption on the system I/O. In the implementation of the program, by opening the multi process (multi-threaded mechanism in Python is not perfect, so Python provides multi process to parallel processing, which is called multiprocessing module) to read the transaction set DB. The implementation of the algorithms is described below:

Input: The preprocessed traffic accident data set DB, a frequent item or rule r

Output: Frequency of r in DB

Begin

Initialize parameters. BlockSize = fixed value, num = fixed value, shared list-array = [0, ·,0], the length of array is num;

Initialize synchronization lock-lock;

For i in num:

Open the process i, and then the process i executes the function found(i, array, lock, DB, r);

Wait for each process to be completed and returns to the main process, merging the return value.

End

Table 12
The results of association rules found by the hybrid algorithm

Rule Fitness value Support Confidence Coverage

‘lmale’,‘time3’=> ‘1’ 1.72 0.28 0.71 0.73

‘Driving experience 1’, ‘male’=> ‘1’ 1.71 0.32 0.88 0.51

‘school training’=> ‘1’ 1.66 0.37 0.79 0.54

‘male’, ‘rain’=> ‘1’ 1.73 0.44 0.67 0.62

‘self training’=> ‘7’ 1.91 0.41 0.83 0.67

‘school training’, ‘time 3’=> ‘7’ 1.55 0.33 0.75 0.47

‘school training’, ‘male’, ‘age 2’=> ‘7’ 1.32 0.25 0.71 0.36

Rule	Fitness value	Support	Confidence	Coverage
‘lmale’,‘time3’=> ‘1’	1.72	0.28	0.71	0.73
‘Driving experience 1’, ‘male’=> ‘1’	1.71	0.32	0.88	0.51
‘school training’=> ‘1’	1.66	0.37	0.79	0.54
‘male’, ‘rain’=> ‘1’	1.73	0.44	0.67	0.62
‘self training’=> ‘7’	1.91	0.41	0.83	0.67
‘school training’, ‘time 3’=> ‘7’	1.55	0.33	0.75	0.47
‘school training’, ‘male’, ‘age 2’=> ‘7’	1.32	0.25	0.71	0.36

The function found(i, array, lock, DB, r) is described below:

The process i adds the lock to array, so other processes cannot read the value of array due to congestion;

Take the maximum value-max in array, update the array[i] = max + BlockSize, and then release the lock;

If (max+BlockSize) <size of DB:

Read the data from max to max+BlockSize in DB, and count the frequency of r-count in this region;

Else:

Read the data from max to end of DB in DB, and count the frequency of r-count in this region;

Return count;

Similarly, the fitness value of the rule can be derived from the Equations (6–9) after the corresponding frequency are calculated.

4 Experimental result analysis

4.1 Experimental design

Comparing with the data in Table 8, it can be found that the weight of Driver in the criterion layer relative to Accident in the goal layer is 0.4773. While the weight of the Driving years in the scheme layer relative to Driver in the criterion layer is 0.4909, So the weight of Driving years relative to Traffic Accident is 0.4773*0.4909 = 0.2343. After the weight of each field in the scheme layer relative to Traffic Accident is calculated, the field which weight is greater than the threshold is selected as the main factor of the accident.

Finally, the experiment selected driving age, sex, age, way of training, time, weather condition and wind direction as the main factors about the cause of traffic accident. The data of these fields and the driver1fault as input for the hybrid Apriori and genetic algorithm. Then the data is processed by data denoising, discretization, concept stratification and other data preprocessing work. A part of the processing results are shown in Table 11. The traditional Apriori algorithm, simple genetic algorithm and hybrid Apriori-Genetic algorithm are used to deal with the historical data of traffic accidents on Guiyang in 2015 respectively, and the data mining results are compared to see which has a better performance.

After data preprocessing, the range of the ‘Age’ field is {0–24, 25–35, 36–47, 48–53, 54-the above} in the data set, and there are marked with {age1, age2, age3, age4, age5} respectively; The value of the ‘way of training’ is {self training, school training} The value of the ‘sex’ is {male, female} The field of ‘time’ is marked with {time1, time2, time3, time4} after discretization and hierarchical process; The range of the ‘weather’ field after hierarchical process is {Weather1, Weather2, Weather3, Weather4, Weather5, Weather6} The range of the wind direction is {Wind1, Wind2, Wind3, Wind4, Wind5, Wind6} after hierarchical process. The handling of the driving age is seen in Table 9 and the handling of the type of the accident is seen in Table 10.

4.2 Mining results about classification rule

The constant coefficient a, b, c in the fitness function F(r) (see equation 4) all is 1. After the vacancies in original data is deleted, data preprocessing and other operations have done, there is remaining 42389 data in the set. The apriori-Genetic algorithm is used to deal with the pre-processed data and the support threshold is set to 0.1. The association rules which has a higher value of fitness are find by the hybrid algorithm are list in the Table 12. The corresponding fitness, support, confidence, and coverage are followed by the rule.

The fitness value of first rule-‘male, time3 => 1’ is 1.72, and its support is 0.28, confidence is 0.71, coverage is 0.73. It can be knew from the rule that if the driver is male and the time is the time3 (13–18 pm), the accident with the type ‘1’ (rear-end) is often happened. And the rule has a high value of coverage. The fitness value of the rule-‘male’, ‘rain’ => ‘1’ is 1.73. The rule shows that if the driver is male and the weather is rainy day, the accident with the type ‘1’ (rear-end) is also often happened. It has higher support, but the confidence and coverage is smaller.

The rule of ‘self training =>7’ shows that the accident with the type ‘7’ (Not give way according to the rule) is often happened when the driver’s way of training is self training. And when the driver’s way of training is school training & the time is the time3, the probability of the accident with the type ‘7’ (Not give way according to the rule) is also higher.

Through detailed analysis of the results of the traffic accident rules set, certain and meaning rules can be found and it has important significance on the targeted prevention and scientific management about traffic accident.

4.3 Compared with Apriori algorithm and genetic algorithm

The Apriori algorithm is used to deal with the data which’s fields are selected by the analytic hierarchy process. Then the association rules are obtained directly. The simple genetic algorithm is also used to deal with the data which’s fields are selected by the AHP. The initial population was randomly generated, and the size of initial population was equal to the size of initial population about the hybrid Apriori-Genetic algorithm. The design of the function fitness is the same as the equation 4, and a = b = c = 1. The design of the genetic operators in the genetic algorithm is the same with the hybrid Apriori-Genetic algorithm.

These algorithms are implemented in the condition that the hardware environment is intel Core i5-4200H @ 2.80GHZ 2.79GHZ processor, 8GB memory and the operating system is windows 10, software is python2.7. The experimental results of different algorithms are compared by setting different support degree and genetic algebra.

Figure 4a shows the number of expected accident rules(the definition of expected rule is that the fitness of the rule is more than 1.0, which guarantees the reliability of rule.) obtained by the hybrid algorithm and apriori algorithm in different support degree. The genetic algebra of the hybrid algorithm is 100. Figure 4b also shows the number of expected accident rules generated by the hybrid algorithm and the simple genetic algorithm in different genetic algebra. The support degrees are both 0.1. It can be seen that because the hybrid algorithm uses support, confidence and coverage as the evaluation index of a rule and Apriori algorithm searched aimless, the hybrid algorithm can find more expected rules which are much meet user expectations in the same support and reduces the generation of useless rules. When the genetic algebra is less, the number of expected rules found by simple genetic algorithm is less relative to the hybrid algorithm due to the random search of simple genetic algorithm. However, with the increase of genetic algebra, the search space is increased. The number of expected rules found by simple genetic algorithm and hybrid algorithm tends to approach. The effect between two algorithms is not much different.

Fig. 4

(a) The number of association rule obtained by Apriori algorithm and hybrid GA algorithm. (b) The number of association rule obtained by genetic algorithm and hybrid GA algorithm.

Figure 5a shows the comparison on execute time about the hybrid algorithm and the apriori algorithm which running at different degrees of support. Figure 5b also shows the comparison on execute time between the hybrid algorithm and the simple genetic algorithm under the same support (0.1) and different genetic algebra.

Fig. 5

(a) The runtime of Apriori algorithm and hybrid GA algorithm. (b) The runtime of genetic algorithm and hybrid GA algorithm.

It can be seen from the figure that the hybrid algorithm performs better than the simple genetic algorithm when the genetic algebra is not large, but the spent time of the hybrid algorithm is worse than the Apriori.

Figure 6a and 6b show the comparison of the execute time about the hybrid algorithm after parallelized and other algorithms. It can be known that the parallel hybrid algorithm has a better performance than the simple genetic algorithm on execute time. At the same time, the time complexity of the parallel hybrid algorithm is obviously reduced compared with the algorithm without parallelization.

Fig. 6

(a) Comparison of running time between Apriori algorithm and parallel hybrid GA algorithm. (b) Comparison of running time between genetic algorithm and parallel hybrid GA algorithm.

5 Conclusion

This paper mainly focuses on the data mining of traffic accident data. According to the multi-dimensional and multi-layer characteristics of traffic accident data, an accident causation analysis model which based on AHP and the hybrid Apriori-Genetic algorithm is proposed to mine the cause of the accident. By comparing the existed traffic accident processing algorithms and making the hybrid algorithm parallel, the experiment results show that the model can improve the accuracy of mining and find more expected association rules, and has a good application potential. Through the model, we find out some meaningful traffic accident laws, but the construction of matrix in AHP is always subjective. So how to eliminate this error will be the next research work.

Footnotes

Acknowledgments

This work was supported by National Natural Science Foundation of China, under grant No. 61772553 and No. 61379058, Technology Plan Project of Hunan Province, under grant No.2015TP2017, the Fundamental Research Funds for the Central Universities of Central South University under grant No.2016zzts359.

References

Ministry of Transport of the People’s Republic of China, China Statistical Yearbook of Transportation, Ministry of Transport, ed., People’s Communications Press, Beijing, 2015, pp. 11–15.

Ministry of public security of China, The total number of traffic accidents occurred. The State Council of China, http://data.stats.gov.cn/search.htmls=traffic_accident. 2015.

Ministry of public security of China, The Statistical Annual Report on Road Traffic Accidents of the People ‘s Republic of China Year). The State Council of China, (2015, year). http://www.mps.gov.cn/n2256342/index.html, 2016.

Guiyang Public Security Bureau, The big data contest of Guiyang traffic. Guiyang municipal government, http://jjzd.gygov.gov.cn/art//1/15/art_9_878350.html, 2015.

Beshah

and Hill

, Mining road traffic accident data to improve safety: Role of road-related factors on accident severity in ethiopia, AAAI Spring Symposium: Artificial Intelligence for Development 24 (2010), 1173–1181.

Marukatat

, Structure-based rule selection framework for association rule mining of traffic accident data, International Conference on Computational and Information Science 5 (2006), 231–239.

Murat

Y.S.

, Modelling traffic accident data by cluster analysis approach, Teknik Dergi 20(3) (2009), 4759–4777.

Roshamida

A.J.

, Amir

M.A.U.

and Zulkifly

M.R.

, Risk assessment of dry bulk cargo operations using Analytic Hierarchy Process (AHP) method, IEEE Trans on Information and Communication Technology (2016), 146–159.

Deng

, Pan

, Shen

and Gui

, Credit distribution for influence maximization in online social networks with node features[J], Journal of Intelligent and Fuzzy Systems 31(2) (2016), 979–990.

10.

Agrawal

, Imielinski

and Swami

, Mining association rules between sets of items in large databases, Acm Sigmod Record on ACM 22 (2) (1993), 207–216.

11.

Whitley

L.D.

, The GENITOR algorithm and selection pressure: Why rank-based allocation of reproductive trials is best, International Computer Games Association 89 (1989), 116–123.

12.

Deng

, Li

, Dong

and Ota

, Finding overlapping communities based on Markov chain and link clustering[J], Peer-to-Peer Networking and Applications 10(2) (2017), 411–420.

13.

and Zhao

, A traffic accident causation analysis method based on AHP-apriori, Procedia Engineering 137 (2016), 103–110.

14.

Ghosh

, Biswas

and Sarkar

, Mining frequentitemsets using genetic algorithm, International Journal of Artificial Intelligence & Applications 1 (4) (2010), 133–143.

15.

Chadokar

S.K.

, Singh

and Singh

, Optimizing network traffic by generating association rules using Hybrid Apriori-Genetic algorithm, IEEE Trans on Wireless and Optical Communications Networks (2013), 1–5.

16.

Jain

and Kabra

, Mining & optimization of association rules using effective algorithm, International Journal of Emerging Technology and Advanced Engineering 2 (4) (2012), 281–285.

17.

Ren

and Luo

, Research on GA and association rules applying in mining of classification, Computer Engineering and Applications 47 (17) (2011), 131–133.

18.

Zhou

J.L.

and Shia

Y.-B.

, A hybrid fuzzy FTA-AHP method for risk decision-making in accident emergency response of work system, Journal of Intelligent & Fuzzy Systems 29(4) (2015), 1381–1391.

19.

Kishor

and Porika

, An efficient approach for mining positive and negative association rules from large transactional databases, IEEE Trans on Inventive Computation Technologies 1 (2016), 1–5.

20.

Naredi

and Deshmukh

R.A.

, Improved extraction of quantitative rules using best M positive negative association rules algorithm, IEEE Trans on Electronics, Computing and Communication Technologies (2015), 17–25.

21.

Ravi

and Khare

, EO-ARM: An efficient and optimized k-map based positive-negative association rule mining technique, IEEE Trans on Circuit, Power and Computing Technologies (2014), 1723–1727.

22.

Doshi

and Roy

, Enhanced data processing using positive negative association mining on AJAX data, IEEE Trans on Systems, Communication and Information Technology Applications (2014), 386–390.

23.

Mahmoudzadeh

and Bafandeh

, A new method for consistency test in fuzzy AHP, Journal of Intelligent & Fuzzy Systems 25 (2) (2013), 457–461.

24.

Xie

, Ning

, Wang

, Xie

, Cao

, Xie

and Wen

, Recover corrupted data in sensor networks: A matrix completion solution, IEEE Transactions on Mobile Computing 16(5) (2017), 1434–1448.

25.

Nair

J.J.

and Thomas

, Improvised apriori with frequent subgraph tree for extracting frequent subgraphs, Journal of Intelligent & Fuzzy Systems 32 (4) (2017), 3209–3219.

26.

Lin

J.C.-W.

et al., A fast Algorithm for mining fuzzy frequent itemsets, Journal of Intelligent & Fuzzy Systems 29 (6) (2015), 2373–2379.

27.

Lacerda

, Estefane

G.M.

, de Carvalho

A.C.P.L.F.

and Teresa

B.L.

, Model selection via genetic algorithms for RBF networks, Journal of Intelligent & Fuzzy Systems 13(4) (2002), 111–122.

28.

McClintock

, Lunney

and Hashim

, A genetic algorithm environment for star pattern recognition, Journal of Intelligent & Fuzzy Systems 6 (1) (1998), 3–16.

29.

Huang

, Yin

, Schwebel

D.C.