Abstract
Accurate and rapid prediction of the coal and gas outburst is very significant for preventing accident and protecting environment, the paper presents a novel feature selection and outburst classifier framework which can identify effective candidate features and improve the classification accuracy. First, Apriori is applied for preliminarily extracting the association rules from sample data and attribute features in coal and outburst, and it can present the effective sample data and features for outburst prediction. Second, in order to reduce the redundancy of the strong association rules obtained from Apriori, Boruta is applied for selecting all highly relevant optimal features based on the obtained strong association rules. Third, Random Forest(RF) is used to assign different weights to different features in optimal candidate features considering the importance of different features to outburst, based on the above obtained high-quality sample data and optimal features, the parameters of KNN model optimized by Bayesian Optimization(BO) is used to predict the coal and gas outburst. The experimental results show that the proposed feature selection model Apriori-Boruta can obtain significant sample data, and the proposed RF- KNN optimized classifier model can achieve higher performance in terms of the number of optimal features and prediction accuracy compared with traditional prediction models.
Introduction
Coal and gas outburst not only endangers the safety production and personal safety of the coal mine, but also restricts the production capacity of the coal mine, and affects the safety of the coal mine [1]. In addition, gas exposure to fire can cause an explosion, which extremely affects the underground environment and causes air pollution [2]. So it is very important to study the accurate and quickly prediction of coal and gas outburst. Coal and gas outburst is affected by many influencing factors, the relationships among the influencing factors are very complex. It is uncertain that which factors can take important part in the process of coal and gas outburst, and the outburst mechanism is difficult to explain. Therefore, before the classifier model is performed, the correlation analysis is necessary on the influencing factors of the coal and gas outburst, because it can eliminate the unimportant influencing factors and noise sample data, then present the key factors and obtain the corresponding association rules for the following classifier, improving the accuracy and reliability of the coal and gas outburst prediction. Considering related characters of coal and gas outburst prediction, the research of paper focuses on the correlation analysis of influencing factors and the optimized design of classifier in the coal and gas outburst.
In the outburst prediction model, the analysis of influencing factors and the design of classification model are two basic steps. The quality of the analysis results of influencing factors is very important to the performance of classification model. Extraction of the effective features is very beneficial to the classification model, and it is the premise of accurate and efficient outburst prediction. Due to the complexity of coal and gas outburst process and the diversity and uncertainty of influencing factors, the influencing factors of outburst are interrelated and coupled, and there exit nonlinear interactions. The focus of outburst prediction model mainly obtain the main controlling factors affecting outburst and analysis of the mutual mechanism relationships among the main controlling factors.
Fully understanding of the main controlling factors contributing to outburst is significant in coal and gas outburst prediction. However, these main influencing factors affect each other, which significantly increase the prediction difficulty. The correlation analysis methods include association rules analysis and feature selection methods. Association rules analysis can find the frequent interdependence and association between features of sample data. There are association characteristics between two or more variables in influencing factors of coal and gas outburst, and the association rule mining can find association rule relationships in sample data and attribute features, so as to determine the association rule relationships between the main factors affecting coal and gas outburst. The association rule relationships between influencing factors and outburst, and between influencing factors of coal and gas outburst are regarded as the knowledge rules of outburst prediction and used to judge the state of outburst. However, the current algorithms of association rules such as Apriori [3–6], FP-Growth [7], Eclat [8] and grey relation analysis(GRA) [9] have universal redundancy, many association rules provide the same information, resulting in a significant reduction in the efficiency and accuracy of classifiers. Deletion of some association rules does not affect the integrity of the overall information of association rules, so it is necessary to explore the new method to meet the needs of mining association rules of coal and gas outburst influencing factors. FP-Growth has some disadvantages. For example, there are too many child nodes in the tree, if a tree containing only prefix is generated, the efficiency of the algorithm will also be greatly reduced. FP growth algorithm need recursively generate conditional database and conditional FP Tree, so the memory overhead is large. It can only be used to mine single dimensional Boolean association rules. Eclat algorithm uses the idea of inverted index, but it does not use inverted index for fast search, but for data statistics, the inverted index is used to quickly build frequent index item sets, the complexity is very high. Apriori algorithm adopts the iterative method of layer by layer search. The algorithm is simple and clear, it has no complex theoretical derivation, and it is very easy to implement. According to the characteristics of coal and gas outburst attribute features and samples data, Apriori is used to extract association rules of coal and gas outburst influencing factors, then it presents a feature selection method to eliminate the redundancy of coal and gas outburst influencing factors that Apriori cannot solve.
The feature selection can find out the most significant features and remove the remaining features. Exiting feature selection methods include filter, wrapper and hybrid methods. Among them the wrapper method depends on the classification algorithm, it can take the classification algorithm as a part of the feature subset evaluation in the feature selection method, and rely on the classifier to evaluate the feature subset, the classification effect is good, but the computational complexity is very high, it can effectively measure the influence of the interaction of multiple features on the target. For the wrapper method, the classifier is used as a black box to return the features ranking, so any classifier that provides the features ranking can be used. For practical reasons, the classifier used in this problem should be both computationally efficient and simple, and it may not have user-defined parameters, now the wrapper methods such as Sequential Forward Search(SFS) [10], Sequential Backward Search(SBS) [10], Genetic Algorithm(GA) [11], Whale Optimization(WO) [12], SVM-RFE[13], RF [14] have been applied in feature selection of all kinds of fields. Boruta [15] is a wrapper feature selection model. However, the classification performance of the wrapper method affects the feature selection and prediction accuracy, so it is necessary to find out effective classifier. The RF classifier is suitable for accomplishing reliable and robust estimations of different features importance, so Boruta based on RF is introduced to evaluate the influencing factors of coal and gas outburst and it is able to identify optimal feature subset which is highly related to the class labels. Against this problem, the choice of the sensitive statistical characteristics as basis of subsequent outburst classifier is further studied. Therefore, in this paper, a novel method Apriori-Boruta that combines Apriori association rules analysis and Boruta feature selection is proposed, which can select the sensitive and effective statistical characteristics and sample data for outburst classifier.
For the coal and gas outburst classification, various intelligent techniques such as BP Neural Networks [16, 17], SVM [18], ELM [19], PNN [20, 21] and RF [22] have been applied successfully in this fields, these methods have advantages and disadvantages. However, there exit some problems. For classifier, KNN [23] can give the same weight to the adjacent samples and does not consider the distance between the sample data and the training sample data, when searching for the nearest neighbor, measuring the similarity between sample data and calculating the distance between sample data, all features are given the same weight without considering the importance difference of the feature, which lead to the instability and low accuracy of the final classification. The effects of each factors on the outburst are different, and the feature weights can contribute to make the importance for classification, the features which are important to classification are assigned the higher weight, and the features which are not important to classification are assigned the lower weight. The importance of feature weight to dataset makes the KNN classifier easier to divide the sample space. Therefore, according to the characteristics of coal and gas outburst, the optimization of KNN classifier is another problem that we need to study. In order to address above problems, a novel KNN classifier method called RF-KNN based on RF feature weighting is proposed in this paper, the classifier performance of KNN is improved.
The contributions is the proposal of a Apriori-Boruta feature selection combined with RF-KNN classifier model. First, a novel Apriori-Boruta method is proposed to select the sensitive features and sample data as the basis of subsequent analysis. Secondly, a novel RF-KNN classifier model is presented, and the RF feature weighting is introduced to optimize the KNN classifier, then the prediction accuracy of KNN classifier is improved greatly. Thirdly, Bayesian optimization is introduced to optimize the parameters of the RF-KNN, which furtherly presents higher prediction accuracy and efficiency of coal and gas outburst.
The content of this paper is as follows: Section2 describes the theory and algorithm used in this paper, then proposes a new feature selection model framework and optimal classifier model based on relevant theoretical methods, Section 3 provides experimental design and parameter setting, Section 4 conducts multi-level verification and related work contrast experiments on the actual data set, Section5 summarizes the full text and discusses the future work.
Theory and method
This paper focuses on the effective correlation analysis method of the influencing factors and design of classification model with strong generalization ability in coal and gas outburst, then proposes a novel coupled model that consists of data preprocessing, the combination of Apriori association rule extraction and feature selection, the combination of RF feature weighting and BO optimized RF-KNN classifier. The analysis results for the feature selection of influencing factors and recognition of optimization classifiers show the effectiveness, adaptability, and the superiority of the proposed approach, which can obtain the effective association rules that affect coal and gas outburst, improving the classification accuracy and efficiency of coal and gas outburst. The flowchart is shown in Fig. 1.

Flowchart of Apriori-Boruta and RF-KNN model.
Apriori method
Association rule mining [4] can find out the relevance between different items in the same events, namely, obtain the subset of all items or attributes that often appear in the events and their relevance, its main purpose is to discover the relevance between the internal structural features of sample data. Mathematically, the rules are denoted as follows: Two disjoint non-empty sets x and y are presented, if X and Y have a tangible logic implication relationship such as X⟶Y, then the X⟶Y is an association rule, where X is the left term of the rule, Y is the right term of the rule, and x and y are both item sets, which are called item sets. The purpose of association rule mining is to export the association between items in a project set by analyzing the record set of the relationships hidden in the data, that is the specific item set and the record sets. The effectiveness of rules is often measured by the support and confidence degree. The calculation formula of the support degree is shown in (1). And the confidence degree is shown in (2).
Rule X→Y has support degree s, which means that S is the percentage of X∪Y contained in transaction D in, that is, which is denoted as:
The core idea of Apriori algorithm [22] is to find frequent item sets through candidate set generation and downward closure detection, that is, using the iterative method of layer by layer search. The algorithm uses horizontal search, uses “k-1 itemset” to search “k itemset”, and finally calculates the confidence of frequent item sets to obtain association rules. The evaluation index screening process based on Apriori algorithm is given below (as shown in Fig. 2). The specific steps are as follows:

Flowchart of Apriori model.
C is used as the candidate itemset and l is used as the frequent itemset. Through the iteration, all frequent item sets in the transaction database are retrieved, namely, the item sets with support degree are not lower than the threshold set by the user, and all items in the transaction set are taken as candidate 1- item sets. Then we count the support of candidate item sets, and delete all items that whose support are lower than the threshold, and generate frequent 1- item sets L1
Use the join step to generate the candidate set to be pruned, and use the pruning step to prune to obtain the candidate 2-item set C2.
Traverse the transaction set again to get the support of each candidate set, delete all items whose support are lower than the threshold, and finally get the frequent 2-item set L2.
The above process is iterated until the candidate set Ck is empty until the frequent k-itemset Lk cannot be found
The association rules are generated from the generated frequent k-item sets, and the confidence of each association rule is calculated. If the confidence is greater than the minimum confidence, the association rule can be selected into the last association rule set.
Boruta [24] is a wrapper method constructed by RF classifier, the RF is an integrated method which can give the estimation of features importance, perform the classification by the voting ratio of multiple unbiased weak classifiers such as decision trees, it is independently constructed on different bagged samples of the training sets. The classification accuracy caused by the random arrangement of attribute values between objects is taken as the important measure of attributes. Boruta can use the RF to guide the feature selection process, the recognition performance of features is determined by measuring the loss of classification accuracy. Each feature constructs a decision tree in the forest, and then uses each tree to identify the target category independently, and calculates the average and standard deviation of the loss of classification accuracy. Z-score is used instead of average precision loss and standard deviation to measure the recognition performance of a single feature, because Z-score can also consider the volatility of precision loss among different decision trees in the RF.
Z-score cannot be directly used to measure the importance of features, because the features do not necessarily follow the standard normal distribution. Boruta can solve this problem by constructing corresponding shadow features for each single feature, then uses Z-score to measure the importance of these shadow features, it mainly evaluates the importance of each feature through cyclic iteration. If the importance of the original features is significantly higher than that of the shadow features, the original features are important; if the importance of the original features is significantly lower than that of the shadow feature, the original features are not important. Among them, the original features are the features that need to be selected, while the shadow features are generated according to the original features. The generation rules are as follows: First, we add random interference items to the original features to generate extended features. Second, the shadow features are generated by sampling from the extended features, which will be generated in each iteration, the Algorithm 1 is as follows:
Apriori not only can excavate the characteristic association relationships between coal and gas outburst and influencing factors, and but also can obtain effective sample data, because the coal and gas outburst influencing factors and sample data have complex and various correlations, the mining of Apriori cannot remove the redundancy, and it cannot be used as the knowledge and rules of recognition. Feature selection can select some of the most effective features from the original features to reduce the dimension of the dataset, then preserve the physical interpretation meaning of the original attributes, and give it an advantage in functional readability and interpretability. Multiple features can be identified from a small subset of the original attributes by removing redundant attributes, we choose the feature subset in the feature set to get the optimal feature subset. Boruta is a fully correlated wrapper feature selection method that attempts to find all features that can carry information that can be used for prediction, instead of finding only a subset of features that produce a minimum error on the classifier as most traditional wrapper algorithms do. Boruta finds all the relevant features, regardless of whether the features are related to the decision variable or not, this makes it ideal for determining the best subset of features. The Boruta directly determines the number of features by finding all the features associated with the category in the candidate feature set, finds the correlation between coal and gas outburst and influencing factors, and determines the main factors affecting coal and gas outburst. Boruta can find all the relevant features, this advantage is to solve the Apriori cannot automatically give the optimal feature subset. Therefore, we consider to combine Apriori and Boruta. The relevant flow chart is shown in Fig. 3

Flowchart of Apriori -Boruta model.
RF feature weighting model
RF [25–30] can select the sample data and the attribute features. According to the important attributes of the features in the RF, the features related to the classification are selected. Due to the inherent randomness of RF, the model may give different important weights to the features each time. After training the model many times, namely, at each time a certain number of features are selected and the intersection of the previous features are preserved, after reaching a certain number of times, we can get some characteristics which have important contribution to the classifier. RF is used to assess the importance of features to see how much each feature contributed to each tree, then averages and compares the contributions of different features. Contribution measures include: Keeny index (Gini), out-of-pocket data (OOB) error rate are as evaluation indicators to measure. OOB is the mean reduction of classification accuracy after slight disturbance and before disturbance for independent variables of out-of-pocket data are made, and the Keeny index measures impurity of data partition or training set. Keeny impurity represents the possibility of randomly selected sample being misallocated in subset [39–41].
We use vim to express the importance of variables, Gini value is expressed in GI, assuming there are m features x1, x2,... xC, now we need calculate the Gini index score vimj for each feature xj, which is the average change in the node splitting impure of the j feature in all decision trees, the Gini index formula is as follows:
Among them, k denotes kth categories, and Pmk denotes the proportion of category k in node m. The significance of feature xj in node m, namely, the change of Gini index before and after node m branching is as follows:
GIl and GIr represent the Gini index of the two new nodes respectively after branching is made. If the node whose feature xj appears in decision tree i is in collection m, then xj is important in a tree i:
For every decision tree, the features are sampled, then the current Gini index is calculated, and the whole splitting process is carried out, after a tree is established, the importance of each node of the tree is obtained. By ranking the trees according to the Gini Index as the feature relevance, the decision trees can be established at one time, and the feature relevance ranking is generated, finally, the average value of these features is selected to get the rank of the features importance, this paper regards the importance score of features produced by the process of RF construction as the index of feature importance t, then takes an average value, and compares the contribution between the one-bit features.. For decision tree t in a RF, the number of correct classification of out-of-pocket data is counted as countt,ini. A new set of testing data of t is obtained after random permutation of the arbitrary character v of the data out of the bag is performed, and t is used to test the reorganized data, in which the number of correct classification is recorded as countt,v, then the importance of the feature can be derived from a formula.
N is the number of decision trees in RF, it has unique advantage in dealing with the interaction between features. and different influencing factors of coal and gas outburst have different importance for outburst, some features are important for one type of classification, but are not so important or even completely irrelevant for another type of classification, so different features should be weighted according to importance that they contribute to the target classification. Usually, the weight coefficient is used to reflect the difference of different weights, the importance of feature weighting for data set can change the distribution characteristics of sample data, which makes the distribution between the same class more compact and the distribution between different classes more looser, making it easier for the classifier to divide the sample space. KNN does not need to build a prediction model, and the prediction result of the sample data under test is directly determined by its k nearest training sample data, but KNN is very sensitive to input data selection, all features are given the same importance degree, and the unimportant features are treated equally without considering the difference of the importance degree between the features, which will have effect on the selection of nearest neighbor.
KNN can deal with most of the non-linear and multi-modal problems in industrial processes, the distance or similarity between the testing sample data, and the training sample data is calculated, the data of the same class have much higher similarity, but the data of different class have a lower similarity. For example, r1, r2,..., rn are input training sample dataset with n points, where the data dimension of each point is d, namely rj1, rj2,..., rjd, and s1, s2,..., sm are the input testing dataset on the same dimensional space as r, the goal of KNN can search the nearest kr data points of si for every si∈s, and then judge the class of Si according to the class of k data points. The KNN generally can use majority voting method as classification decision rule, namely, the majority class of k adjacent to input instance determines the class of input instance, which is also the result of empirical risk minimization. The model can be used to get the weight of the feature value, and the RF is used to give the weight of different features according to their importance to outburst classification, a weight coefficient is assigned to the feature, and the weight coefficient contains the information of the target correlation, the characteristics of weak correlation with the target, so this paper uses the RF to assign weight to different features to make up the defect of KNN, in order to improve the accuracy, the correlation between labels is incorporated into the weight coefficient of features to find the nearest neighbor samples which are closer to the label information of the sample to be tested. The idea of feature weighting of RF is as follows:
1. The RF is used to weight features to get a new feature set.
2. For test sample x
i
, the distance d(x
i
, x
t
) between the Elucidence distance formula x
i
and each training sample data x
t
is calculated.
According to the distance, K + 1 sample data of x t is found from the training set xt,1, xt,2, xt,3, xt,k+1.
3. Select the sample with the largest distance from x
t
from k + 1 adjacent sample data and set it as xt,k+1, and the corresponding distance as d(xi, xt,k+1). Use (xi, xt,k+1) to normalize the distance between other k neighbor sample data and xt
4. For the normalized distance D(xt,xi), xi, xt is transformed by Gauss kernel function to the same probability p(xt, xi), namely
5. According to the similar probability p(xt, xi), xt and KNN sample data, then we obtain posterior probability p(li,|xi), that xt belongs to li, (i=1, 2, 3, 4, r).
Where, KNN(xt,) is that the weighted KNN method corresponds to the classification results of the testing sample data. The weighted KNN classifier can give different weights to the adjacent sample data according to the similarity between the adjacent sample data and the testing sample data, it makes the testing sample data more similar to the training sample data, which weakens the sensitivity of k value selection and strengthens the robustness of the classification results.
Dataset description and preprocessing
The experimental data come from mine in Henan Province, with reference to previous study [31], this paper establishes a sample database which has 48 samples. Of these, 28 were coal and gas outburst accidents, and 20 were not coal and gas outburst accidents. The coal and gas outburst prewarning index includes gas pressure (1), initial velocity of gas output (2), initial velocity of gas emission (3), coal seam firmness coefficient (4), structural coal thickness (5), fault structure complexity (6), outburst classification results (7). Coal and gas outbursts are bipartition problem.
In order to make the experimental results objective, the method of 10-fold cross-validation [32, 33] are used to verify the classification effect, we normalize each attribute value to [0,1].
Experimental results and analysis
Association rule analysis of coal and gas outburst influencing factors
The six attribute indexes are discretized detailed rules formulated by the state firstly according to the experience of the expert and related reference [34, 35]. The continuous attribute values in the database are discretized and coded, the results of preprocessing and partial coding are shown in Table 1, then a correlation analysis experiment is conducted to identify which input influencing factors most affect the model results quantitatively and qualitatively, and to identify the extent that these parameters can also affect each other, fully understanding the relative importance of the influencing factors to the predicted outcomes and dependence degree between influencing factors, the results of Apriori association rules of coal and gas outburst are shown in Table 2.
Raw data for coal and gas outburst
Raw data for coal and gas outburst
The results of preprocessing and partial coding
In this paper we use Apriori method to extract strong association rules, between influencing factors. In the experiment, we set the minimum support of 20% and the minimum confidence of 80% according to the literatures and expert experience. From Table 3, we can see that the rule4 in the two high-frequency combinations, if the initial velocity of gas output is within the interval(5.1-10) and the fault structure complexity is within the interval (1.1-2), the support degree is the highest, which is 0.37, namely, coal and gas outburst is more likely to occur; According to rules 2, 5 and 6, when the initial velocity of gas emission is within the interval (10.1-15) and the fault structure complexity is within the interval (1.1-2), when the fault structure complexity is within the interval(1.1-2) and the coal firmness coefficient is within the interval (0.61-0.85), when the initial velocity of gas release is within the interval (0.5-1) and the fault structure complexity is within the interval (1.1-2). The probability of outburst is the same among the three combinations, the support degree are all 0.34. We can see that the probability of outburst caused by the combination of initial velocity of gas output and fault structure complexity is the highest, which is higher than the combination of gas pressure, initial velocity of gas emission and coal firmness coefficient.
Apriori association rules of coal and gas outburst
From the four high-frequency combinations in Table 2, we can get from the association rule8, 9, 10 and 11 that if the initial velocity of gas emission is within the interval (0.5-1), the initial velocity of gas output is within the interval (5.1-10), the fault structure complexity is within the interval (1.1-2), the coal firmness coefficient is within the interval (0.61-0.85), and the probability of outburst is 0.24. If the initial velocity of gas emission is within the interval (0.5-1), the initial velocity of gas emission is within the interval (0.5-1), and the fault structure complexity is within the interval (1.1-2), the coal firmness coefficient is within the interval (0.61-0.85), and the probability of outburst is 0.20. We can see that the combination of initial velocity of gas output is greater than the combination of initial velocity of gas emission and other factors.
From the above mining association rules, it can be seen that the gas pressure, initial velocity of gas output, initial velocity of gas emission, coal firmness coefficient and the fault structure complexity are closely related to the outburst, to what extent the major parameters affect the outburst of coal and gas in consideration of interactions among factors can also be obtained. The initial velocity of gas output is within the interval (5.1-10), the fault structure complexity is within the interval (1.1-2); the coal firmness coefficient is within the interval (0.61-0.85), the initial velocity of gas emission is within the interval(10.1-15); and the initial speed of gas emission is within the interval (0.5-1), the risk of gas outburst is greater, so it is very necessary to take some outburst prevention measures under such coal mining conditions.
We can see that the rule4 that the outburst occurs, if the gas pressure is within the interval (5.1,10), the fault structure complexity is within the interval (1.1,2); if the initial velocity of gas emission is within the interval (10.1,15), the coal firmness coefficient is within the interval (0.61,0.85); if the initial velocity of gas emission is within the interval (0.5,1), the coal firmness coefficient is within the interval (0.61,0.85); if the initial velocity of gas emission is within the interval (0.5-1), the fault structure complexity is within the interval (1.1,2).According to association rule10, when the outburst occurs, the initial velocity of gas emission is within the interval (10.1,15)and the coal firmness coefficient is within the interval (5.1,10), the fault structure complexity can be predicted to be within the interval (1.1-2), and the gas pressure is within the interval (0.5-1), so we should focus on preventing the initial velocity of gas emission and coal firmness coefficient. According to association rule11, when outburst occurs, the initial velocity of gas emission is within the interval (10.1-15), and the gas pressure is within the interval (0.5-1), we can predict that when the fault structure complexity is within the interval (1.1-2), the initial velocity of gas output is within the interval (5.1-10), so we should focus on preventing the initial velocity of gas emission and gas pressure. It can be seen from rule 8 that the initial velocity of gas emission is within the interval (10.1-15) and the coal firmness coefficient is within the interval(0.61,0.85), and the fault structure complexity within the interval (1.1-2) and the gas pressure is within the interval (0.5-1); According to rule9, when the gas pressure is within the interval (0.5,1) and initial velocity of gas output is within the interval (5.1,10), the coal firmness coefficient is within the interval (0.61,0.85) and the fault structure complexity is within the interval (1.1,2).
In order to improve the prediction accuracy of coal and gas outburst and prevent outburst, it is necessary to focus on the weak links and combine with the actual knowledge background of coal and gas outburst. The above mining results can be used to analyze the outburst situation, find the weak links in the outburst, and put forward improvement measures and plans for these weak links. In particular, the fault structure complexity is a very important indicator in outburst, and it is strongly independent of other influencing factors, coal firmness coefficient follows by it. Generally, combined with other indicators outburst is easy to occur, and the combined indicators are different, the probability of outburst is also different.
Tables 4–6 shows the performance comparison of the classification process with the proposed methods using different classifiers. Our experimental procedure is constructed as follows. First, we creates single models (Apriori and Boruta) based on different classifiers to compare the prediction performance of single models. Second, we generates some hybrid models which are Apriori-Boruta, Apriori-RF, Boruta-RF and Apriori- Boruta+RF respectively, and the classification performance of Apriori- Boruta+RF model using KNN classifiers over the different process model is very good.
Comparison with different feature selection using SVM
Comparison with different feature selection using SVM
Comparison with different feature selection using NB
Comparison with different feature selection using KNN
For SVM classifier, according to the accuracy evaluation results, the overall accuracy is 0.88 and the feature dimension is 3 by using the feature selection alone; The classification accuracy is 0.88 and the feature dimension is 6 by using the association rules mining alone; The results based on the combination of the association rules and feature selection methods are the best, the highest accuracy of the combined method is 0.93 and the dimension is 3, which is 4% higher than that of the feature selection and association rules mining, respectively. This method can make full use of the Boruta feature selection to further reduce the redundancy and improve the accuracy based on the knowledge of association rules extraction. However, the accuracy of the combination of Apriori and Boruta based on RF feature weighting is 0.90, which cannot improve the classification accuracy significantly. For NB classifier, from the accuracy evaluation results, we can see that the highest classification accuracy is 0.82 by using the feature selection alone. The accuracy is 0.75 and the precision is 88% by using the association rules alone. The accuracy of Apriori-Boruta and Apriori-Boruta+RF are all 0.82, which are the same as Boruta and Boruta-RF, so the performance of NB is the lowest and it is not suitable for predicting coal can gas outburst. For KNN classifier, we can see that the accuracy of the association rule mining algorithm is 0.88, the dimension is 6; The best effect of the Boruta feature selection is 0.89, but the dimension is 3, which is smaller than the dimension of Apriori. The accuracy based on the combination of association rule and feature selection is 0.93, which is higher than Apriori and Boruta. The reason is that the signal rules extracted by the association rule algorithm and signal features obtained by Boruta generate redundancy, so the classification effect is also not much higher, only the combination of them can produce good results. However, after RF feature weighting is performed, the combination of association rule mining and feature selection can improve the classification accuracy significantly, and the overall classification accuracy of the combination method can reach 0.95, which is higher than that the accuracy of Apriori-Boruta alone and other methods on different classifiers such as SVM and NB. The dimension is reduced, and the classification accuracy is improved.
In summary, it can be seen that Apriori-Boruta can render the performance of coal and gas outburst prediction with an appropriate number of features and sample data. At the same time, we also know that different classifiers have different performance, the combination of RF feature weighting and the KNN classifier can produce excellence result, however, RF feature weighting can not contribute to the effect of other classifiers such as SVM and NB. The reason is that SVM and NB are not sensitive to weighted features, therefore we use the KNN as a classifier. Throuth the experiments, the universality of Apriori-Boruta-RF-KNN is further verified, which also proves the effectiveness of the Apriori-Boruta and RF-KNN in the coal and gas outburst prediction.
In order to verify the performance of different classifiers combined with different feature dimensions, Fig. 4 shows the classification accuracy with KNN, RF, NB and SVM individually, SVM and KNN give a higher accuracy than BN or RF. The results show that too many redundant features will reduce the accuracy. The experiment in this section mainly gives the performance comparison of four classifier algorithms under features dimensions. For different classifiers, when the features dimension is less than 3, the accuracy of the classifier increases significantly. When the features dimension is more than 3, the accuracy of the classifier does not change significantly. When the features dimension is 3, it can not only contain most of the classification information, but also has higher efficiency, and achieves better classification effect.

Comparison of different feature dimensions using different classifiers.
In the experiment, we compare the performance of different features, and find that the highest classification accuracy of different features with SVM and KNN are all 0.88, followed by RF and NB. The performance of SVM and KNN are better than that of NB and RF. With the increasing of the number of features, the classification accuracy firstly increases gradually. When the peak value reaches, the classification accuracy will basically remain the same or decline. For KNN, SVM, NB and RF, when the feature dimension changes from low to high, the recognition accuracy increases gradually, but the increasing range is not large. When the feature dimension is 3, the accuracy is the highest, and the recognition accuracy is 0.88, which shows that the algorithm proposed in this paper can achieve better recognition performance in a lower dimension.
In addition, it should be noted that the combination of feature selection and different classifiers have great differences in the classification performance, which is related to the performance of the classifier itself and the preference of the feature subset. In practical application, we should select the appropriate classifier and the feature selection according to the distribution characteristics of data samples.
The results of Accuracy, Precision and F1-Measure, the running time of the method are compared with other prediction models mentioned in the literatures in Table 7. From Table 7, we can see that proposed algorithm model boasts the highest degree of fitting with the expected output, and the accuracy of the model and other indicators are better than the other feature selection methods, and feature extraction methods in the current literatures. The indicators of the model proposed in this paper are the highest, the accuracy is 0.95. This shows that different feature selection methods or feature extraction methods combining with different classifiers can produce different effects, so we need to choose appropriate methods according to the characteristics of data distribution. It can be seen from the Table 7 that the average accuracy of different feature extraction methods combining with different classifiers is between 0.81 and 0.89. Using ELM and PNN based on KPCA feature preprocessing ignore the extraction of linear feature information in features, the accuracy and precision are lower than our method, the reference extracts linear feature components, ignoring the extraction of non-linear feature components. The reference extracts common components in features by FA, but the FA can only evaluate the features comprehensively, and can’t distinguish the relationship between the various information in the features effectively. At the same time, it also requires the amount and composition of data. In order to evaluate the effectiveness of the proposed model, the prediction results of the model are compared with the results of other coal and gas outburst prediction models mentioned in the literatures. From Table 7, we also can see that the accuracy, precision, specificity and sensitivity of the models are better than other models in the literatures, and the best prediction results are obtained. For the extraction of LDA information in coal and gas outburst features are ignored, the prediction indexes are also lower than our proposed model. The effect of PCA is the worst, and the effect of linear discriminant analysis is general. The FA is used to extract the common components of features, and then combines with RF classifier to predict outburst. Because of the defects of FA feature extraction and RF classifier, the classification accuracy is only 0.84. Through the comparative experiments, it is furtherly proved that the proposed correlation analysis methods can extract the key feature information of coal and gas outburst, and then improve the accuracy of classification algorithms.
Comparison with prediction algorithms in literatures
Comparison with prediction algorithms in literatures
Compared with the exiting feature extraction or feature selection methods combining different classifiers of coal and gas outburst in the current literatures, the model proposed in this paper has a better recognition effect. The main reasons are as follows: For correlation analysis and feature selection, the features and sample data we extract fully consider the relationships. In the design and optimization of classification model, we use RF to assign weights to different features, and make use of the performance efficiency of Bayesian optimization to optimize related parameters of KNN classifier.
The coal and gas outburst classification model proposed in this paper is only verified on the sample data of the limited monitoring working face of the coal mine. Because different coal mines have different actual conditions, the outburst process and influencing factors are also different. We need to improve the comprehensive performance of most coal mines in the future.
This paper presents a novel feature extraction method that combines Apriori and Boruta to select the most sensitive characters and sample data, the correlation analysis based on feature extraction methods are used: identifying candidate features and sample data using Apriori association rules, and selecting a subset by applying the Boruta feature selection. Furthermore, a modified KNN is proposed to realize the high accurate prediction, the RF feature weighting is used to enhance the classification effect of KNN for coal and gas outburst, then improve the prediction accuracy and significantly reduce the runtime. The selected features are evaluated in conjunction with state-of-the-art optimized classifier algorithms such as SVM, KNN and NB using real coal and gas outburst sample data. According to the results, the prediction accuracy of combination model of Apriori-Boruta and RF-KNN is significantly better than that of other models.
The proposed model using Apriori-Boruta and RF-KNN can give higher prediction accuracy of coal and gas outburst, in the future we will optimize the confidence and support threshold of association rule analysis, study how to choose the best combination of support and confidence degree, modify the exiting classifiers to contribute to performance of coal and gas outburst.
Data availability statement
Some or all data, models, or code generated or used during the study are available from the corresponding author by request.
Declaration of competing interest
The authors declare no competing financial interest.
Footnotes
Acknowledgments
This research was supported by the National Natural Science Foundation of China (U1704242).
Our deepest gratitude goes to the anonymous reviewers for their careful work and thoughtful suggestions that have helped improve this paper substantially.
