Abstract
In order to minimize the over-fitting and related factors that are caused by the high dimensionality of the input data in software defect prediction, the attributes are often optimized using various feature selection techniques. However, the comparative performance of these selection techniques in combination with machine learning algorithms remains largely unexplored using web applications. In this work, we investigate the best possible combination of feature selection technique with machine learning algorithms, with the sample space chosen from open source Apache Click and Rave data sets. Our results are based on 945 defect prediction models derived from parametric, non-parametric and ensemble-based machine learning algorithms, for which the metrics are derived from the various filter and threshold-based ranking techniques. Friedman and Nemenyi post-hoc statistical tests are adopted to identify the performance difference of these models. We find that filter-based feature selection in combination with ensemble-based machine learning algorithms not only poise as the best strategy but also yields a maximum feature set redundancy by 94%, with little or no comprise on the performance index.
Introduction
Often what hinders the progress and development in software applications are the defects [1, 2]. Sources of defects can be many, such as that due to the human factors, poor design, unrealistic development time frame, communication failure, poor coding practices, use of incompatible third-party tools, and others [3]. Determining defects out of these factors are important and hence the emphasis on defect prediction models [4]. The advantage is that they assist in strategic planning, allocation and effective utilization of the resources. The most accepted and popularly used methodology is to identify an optimized set of metrics and use them in machine learning (ML) algorithms [5, 6]. The accuracy and efficiency of such models depend on the nature of the data set, metrics selection technique and the ML algorithm itself, which are in fact intimately related to each other [7]. We note that while ML algorithms are based on certain variational and/or on conditional propagation methods, for feature selection there exists three major techniques which are the wrapper, ranker, and filter-based methods. The filter methods optimize the features independently of the classifier. These are based on distance approaches, minimal correlation as like in feature subset selection methods (FM) [8], or on maximal relevance as in threshold-based feature ranking (FR) methods [9]. On the other hand, the wrapper technique utilizes the ML models to select the feature subset and hence the results obtained are classifier specific. Also, it has been discussed previously that since the correlation-based feature selection makes use of all the training data at once, it can give better results than the wrapper on small dataset [8]. Hence, in this study, we have excluded wrapper techniques.
Given this complexity, we attempt to analyze the inter-dependency of these factors on defect prediction in Apache web applications. In this study we attempt to identify an optimal input metric set which would maximize the goodness of the data fit, yielding good performance class index [6, 12]. This is because a high dimensional input not only increase the computation time but also leads to over-fitting. Besides, in common situations where the non-fault prone modules are much larger than the number of fault-prone modules, the problem of class imbalance arises. Thus, it may so happen that the chosen input set may facilitate the machine to learn the given data too specifically, but may incorrectly classify the data not used in the training process [13]. Furthermore, the skewed distribution of the fault and non-fault prone modules could also lead to mis-classifications, if appropriate statistical measures are ignored.
A range of ML models has been used to predict fault-prone modules by utilizing object-oriented (OO) metrics [14–18]. These models have certainly provided insights in improving the quality of the software with the effective utilization of resources. However, the accuracy and efficiency of the defect prediction depend upon the selection of metrics and the data quality. The FM and FR techniques are most commonly used in input refinements [10] and have become an indispensable tool in software development [2]. The correlation between metrics and fault proneness has been addressed in many models [19–26]. For instance, Wang et al. [9] used ensemble feature selection techniques and found that ensembles of few rankers were quite effective in defect prediction. Koshgaftaar et al., [13] empirically investigated a number of feature selection techniques including filter, wrapper and ranking on various software data sets and found that 85% of the features were redundant. Novakovic, [11] compared several feature ranking methods on two real datasets taken from the UCI repository of ML databases. The author find that ranking methods with different supervised learning algorithms yield different results for balanced accuracy and that the classifier accuracy is sensitive to the choice of ranking indices. On the other hand, Song et al., [27] applied two wrapper approaches in their proposed model and concluded inter-dependency of feature selection techniques with ML algorithms, as well as on the nature of the data sets. However, little was discussed about the performance of ML algorithms with reduced features set in comparison with the whole data set. Several reasons may restrict such an attempt. For instance, in certain bio-informatics case studies the specific characteristics of the data set often render difficulty in building classification models using all the attributes. There also exist cases where a pattern of attributes may not be domain specific, such as binary attributes which appear in text classification. Differently, in this work we compare the models by evaluating it numerically with feature set selection with that of the whole feature set. To the best of our knowledge, little emphasis has been given to extensively compare the performance of various ML techniques on the widely used web applications like Apache.
We analyze the relationship between OO metrics and ML techniques using web applications. The performance of nine ML techniques is compared for defect prediction in web application classes. Beyond, using statistical methods we also test the differences among the ML techniques on various releases of Apache Click and Apache Rave datasets. From a feature space of eighteen OO metrics [28, 29], the features are extracted using eight FM and seven FR techniques. These are then fed as input to nine ML algorithms, the performance of which are determined by means of the “Area Under the Curve” (AUC). Thus, the results are based on 945 defect prediction models. Further, we include the uniqueness measure of each classifier using Friedman and Nemenyi post-hoc analysis. Primarily, the following research questions (RQs) are addressed in this study:
RQ-1. What are the features selected by FM and FR techniques for the Apache Click and Rave and, what is the redundancy?
RQ-2. What is the performance of the defect prediction models with reduced feature sets and to what extent are these at variance in comparison to that obtained with whole features, as input?
RQ-3. Which among the various parametric, non-parametric and ensemble-based ML algorithms, emerge as the best predictive technique in identifying the defect prone classes using FM and FR techniques and, what is the comparative performance?
RQ-4. Which pair of ML techniques are significantly different from each other for defect prediction using FM and FR techniques?
Systematically, to address the above RQs, we first provide a brief description of our research methodology in Section 2. In Section 3, we discuss our results, compare them with the earlier works and briefly state the limitation of the study. We summarize and conclude our major results in Section 4.
Research methodology
The metric suite serves as input to eight FM and seven FR techniques. Both, the reduced subset and also the whole feature set is served as input to defect prediction models by applying nine ML techniques. Thereafter, the models are subjected to ten-fold cross-validation and performance is evaluated usingAUC.
Variables, data collection and application selection
The independent variables to our models are given in Table 1 and, the dependent variable is the presence of the defect in a class. A metric suite of the OO metrics and defect data is extracted from the change-log reports using an in-house developed Defect Collection and Reporting System (DCRS) module [31]. Our choice of Apache data set was based on the application size and programming language used. Also, we find these open source Java-based applications fit for the study as these had more than 300 classes and multiple release version. While the Apache Click provides a web framework which is relatively easy to learn and helps understand the client style programming model, the Rave project provides a social mash-up engine to support web widgets for internet, as well as intranet applications.
List of independent variables used in the study
List of independent variables used in the study
In general, the structure of the feature selection algorithms comprise of (i) generation of a subset and its evaluation, (ii) definition of control criteria and, (iii) validation of the results. On a successful execution, the hypothesized subset is validated by a classifier of choice on to the real data. Given that the original data set has N number of features, then the subset search procedure determines the 2N number of candidate subsets. Here, we derive the feature subset selection using eight search space algorithms: (i) Best First (BF), (ii) Exhaustive Search (ES), (iii) Genetic Search (GS), (iv) Greedy Stepwise Search (GSS), (v) Linear Forward (LF), (vi) Random Search (RS), (vii) Scatter Search (SS) and (viii) Subset Size Forward Selection (SSFS). On the other hand, the FR techniques evaluate each attribute according to specified criteria. The ordered list reduces the FS search space for a given threshold value so as to discard the redundant and irrelevant features from a given feature space. Besides, it also helps in limiting the storage requirements and increases the algorithm speed. In this work, we adapt to the widely used ranking methods which are either based on statistics (Chi-square, OneR and ReleifF) or entropy (Information Gain, Symmetrical Uncertainty and Gain Ratio) paradigms. The seven feature ranking techniques evaluated in this study are: (i) Chi-Squared (CS) [19], (ii) Gain Ratio (GR) [6], (iii) Information Gain (IG), (iv) Filtered Attribute Evaluation (FAE), (v) One-R (vi) Relief-F [32] and (vii) Symmetrical Uncertainty (SU). We have also considered a case pertaining to no subset selection (referred to as NONE) so as to compare the performance of FS and FR techniques on the classification accuracy.
Machine learning techniques
The nine ML algorithms used in this study are (i) Naive Bayes (NB), (ii) Logistic Regression (LR), (iii) Multi-Level Perceptron (MLP), (iv) Decision Tree (DT), (v) Random Forest (RF), (vi) Bagging (BAG), (vii) J48, (viii) Ada Boost (AB) and (ix) Logit Boost (LB). The LR and NB techniques are statistical in nature and MLP represents a neural network solution. The DT and J48 are parametrized tree methods while RF, BAG, LB and AB are ensemble-based methods. The Weka 3.7 suite was used to build the predictive models [33].
Performance indicator and validation method
For class imbalanced datasets the Receiver Operating Characteristic (ROC) is the most widely used performance indicator [34]. It is the Area Under Curve (AUC) which determines the predictive capability of the model. We note that the AUC measure is insensitive to the noise and imbalance in the dataset. Furthermore, we apply the ten-fold cross-validation of the models [35] to yield the best possible result.
Statistical testing
The statistical difference between various ML techniques is computed using a non parametric Friedman test [37], to rank a set of techniques over multiple data sets. It is defined as follows:
where, n is the number of data sets, k the number of compared techniques and, R the individual average rank (1, 2...., k). When the Friedman measure falls in the critical region, the latter which is obtained from χ2 - table with a specific level of significance and (k - 1) degrees of freedom, the null hypothesis is rejected. Such case indicate differences among the compared techniques. Following null hypothesis rejection, the Friedman test is followed by the non-parametric Nemenyi test [38], which performs pairwise performance comparisons of the ML techniques using Critical Difference (CD). The CD is calculated as,
The methodology is well justified to the present study, as (i) a particular data set from a version is filtered using several filter/ ranking techniques, (ii) that the dependent parameter represents an ordinal data, and (iii) that all data distribution are not normally distributed. Following Demsar [39], the Friedman test is taken at a significance level of 0.05 and have been computed using SPSS software [40].
RQ-1. What are the features selected by FM and FR techniques for the Apache Click and Rave and, what is the redundancy?
Feature selection using FM
The results of FM techniques for each release of Apache Click and Rave data set is shown in Table 2. For brevity, the FM techniques which yield the similar set of features are grouped for each version. From the seven versions, we find 23 distinct feature sets. With FM techniques yielding one feature selected in Rave 0.16-0.17 and a maximum of ten in Rave 0.20.1- 0.21.1, it then implies a redundancy of 94% and 44%, respectively. This, not only reduces computational efforts but also the dimensionality thereby decreasing the vulnerabilities due to over-fitting. The BF, ES and GSS techniques find a similar set of features for all versions, except for Click 2.1-2.2. In most cases, atleast NPM or WMC are consistently selected by these three techniques. Similarly, LF and SSFS find a similar set of features for five data sets out of the seven versions. On the other hand, RS finds a totally different feature set with respect to all others, except for two versions. For instance, while all FM techniques for Rave 0.16-0.17 selects NPM, RS finds a different metric which is CBM. On the other hand, for Rave 0.20.1 - 0.21.1, RS selects eight features.
Metrics selected using FM and FR
Metrics selected using FM and FR
For a given version, although the FM technique yield different features set, it is found that ≃ 90% of the features are the same. For instance, six features are selected for Click 2.2 - 2.3, implying a redundancy of 66%. Of six features selected by all FM techniques, five of them (CBO, Ce, NPM, DAM and CAM) are common. In general, the features selected mostly by across the Apache data set are WMC, NPM, LCOM3, DAM and CAM. Beyond, none of the FM technique finds Ca as a relevant feature and that CBM and MOA are selected only once, both by RS in the Rave data set.
In Table 2, we show the features selected by the FR techniques by applying the threshold formula of log2(n), where n (= 18) is the total number of features in the whole data set. The first four features are selected from the ranked list, thereby fixing the FR redundancy to 78%. Our results show that at least two of the FR techniques find the similar set of ranked features in all versions. The CS, SU, IG and FAE select similar set of metrics (≃ 90%) across all seven data sets. The GR also have a similar selection of features, except for Rave 0.12-0.13. Further, we find that OneR and ReliefF, always select unique set of features. Similarly, for most of the versions, the CS and FAE determine a similar set of features with RFC, LCOM (and/or LCOM3) in common. On the other hand, ReliefF finds a unique ranked feature set in all versions with CAM being common, irrespective of the versions. The features that are selected most by the FR techniques are WMC, RFC, LCOM3, AMC and CAM. However, unlike the FM technique, FR provides a moderate weight to all features including Ca, CBM and MOA. Significantly, among all features, CAM, LCOM3 and RFC were found to be more relevant in FR.
Summary
We find that there exists a cluster of FS and FR techniques which yield similar set of features. The redundancy in feature selection vary between 44% - 94% for FM, while for FR the redundancy is 78%. Besides, the FS techniques find WMC, NPM, LCOM3, DAM and CAM as the most commonly selected features and, FR finds WMC, RFC, LCOM3, AMC and CAM to be most relevant. This implies that WMC, LCOM3 and CAM have good discriminant power and hence constitute a solid basis in defect prediction for web applications. However, our finding does not rule out the significance of other metrics, since there exist cases where metrics such as NOC, RFC, NPM and others that were frequently selected [41].
RQ-2. What is the performance of the defect prediction models with reduced feature sets and to what extent are these at variance in comparison to that obtained with whole features as input?
Defect prediction with features selected using FM
In Table 3 we list the ten-fold cross validated results of the nine ML techniques, with the features selected by the FM techniques. The ML technique resulting with the best AUC for a given dataset is highlighted in bold. The AUC values fall in the range [0.54(DT); 0.75(RF)] with all features included for Click 2.0-2.1. The BF, ES, GS, GSS, LF, SS and SSFS selected three features, while RS selects five features. The AUC values for both sets were found in the range [0.52 (DT), 0.69 (RF)]. Hence, with the use of FM techniques, only 16% and 28% of the total features were selected, resulting in a significant decrease in the complexity of the models. However, the resulting variance in the AUC values of the defect prediction models is within ±3% in comparison with the results obtained with all features. The RF yields the maximum AUC values and LR, NB, MLP and J48 algorithms show an increase in the AUC values with features selected using various FM techniques. On an average, the variance in the AUC is determined to be only ±3% for the Click 2.1-2.2, with only 28% to 39% features selected using various FM techniques. The maximum AUC value was obtained for RF and Bagging techniques (AUC = 0.78), irrespective of the search algorithms. For Click 2.2-2.3, the range of AUC values fall in the range [0.71, 0.82], except DT [0.58]. With only 33% of features selected using various FM techniques, the AUC variance is ±2%, except for RF (5% decrease). The maximum AUC of 0.81 is determined by AB while DT yields the lowest predictive performance with 0.60. The marginal variation in the AUC values with different FM techniques is because of the greater similarity in the feature set, where five features are in common out of six selected.
The AUC values with all and the FM selected features for Click and Rave.
The AUC values with all and the FM selected features for Click and Rave.
The range of AUC values with all features included for Rave 0.12-0.13 version is [0.56, 0.67]. With 50% to 56% features selected using various FM techniques; the variance in the AUC values is not more than ±3%. However, unlike the Click versions, the maximum AUC is determined for LR and MLP classifiers, which remain consistent for calculations with and without the FM techniques. For Rave 0.16 - 0.17, with all features the AUC range is [0.50, 0.60] with the maximum being determined by the RF and LR, and the minimum by tree-based methods. Significantly, only one feature is selected by all search FM techniques, bringing a redundancy of 95%. The performance values of all other classifiers remain more or less consistent with a variance of not more than ±2%. For RF, the change in the AUC value is -0.11 with respect to the calculations performed with all features as input. That, only one feature is selected in this version and that the AUC values are relatively poor, infers that the metrics are strongly correlated. The range of AUC for Rave 0.20.1 - 0.21.1 with all features is found to be [0.45, 0.55], which with FM determined features become [0.39, 0.60]. These values, more or less, represent a random selection. With only WMC included as the input feature set, LR shows an enhancement in the AUC value by 0.07 in comparison to the classification performance with all feature set included, while with eight features as determined by RS search algorithm the performance is reduced to 0.39, i.e., ≃ 14% decrease. More importantly, for this version of Rave, the predictions imply no relevance as the values represent no better than a random selection. The AUC values determined for Rave 0.22.1- 0.23, with all features included are found in the range [0.57, 0.70]. With features selection, the range remains more or less similar, with ≃2% decrease in the maximum value. It may be noted that irrespective of the search algorithms, the maximum AUC value is determined by the Bagging classifier with a variance of ±2%.
The AUC performance of various defect prediction models built by using features selected with ranking techniques is highlighted in Table 4. With features selected using various ranking techniques the AUC values of the defect prediction models, lies in the range [0.47- 0.71] against the range [0.54(DT) - 0.75(RF)] obtained with all features included. The best performance emerges for AB and LB (0.71), with features ranked by the ReliefF algorithm. The AUC value is at ±7% variance in comparison with the models built using all the features for the Click 2.0-2.1 version. For Click 2.1-2.2, the predictive performance of various models built using all the features resulted in AUC ranging from 0.72 (J48) to 0.78(AB, LB). However, with ranked features the AUC values are in the range [0.58 - 0.77], with Bagging being the best performer with features selected by the CS, IG and FAE algorithms. The tree-based algorithms perform less effectively with the ranked features. For Click 2.2 - 2.3, the range of AUC with all features is [0.58 (DT) - 0.82(RF)]. With ranked features both LB and AB, determined AUC value of 0.80 with ReliefF. For Rave 0.12 - 0.13, the AUC ranged over the interval [0.56 (DT) - 0.67 (MLP, LR)]. The features selected using CS, IG, SU and FAE FR techniques resulted in providing the best AUC value of 0.67 for MLP and LB. The AUC values for Rave 0.16-0.17 and Rave 0.20.1-0.21.1 were determined in the range [0.50 -0.6] and [0.45 - 0.55], respectively. For both data sets, the tree-based algorithms provided the lowest values, and the ensemble methods resulted in best values. It is also found that for these two Rave data sets, the ranked features could not significantly enhance the AUC values with respect to the values determined with the whole set of features. For Rave 0.16-0.17 the maximum enhancement in the AUC values is found to be by 4% which is obtained by MLP and LB with input features determined by SU. On the other hand, ReliefF in combination with AB enhances the classifier performance by a similar percentage for Rave 0.20.1-0.21 dataset. For Rave 0.22.1-0.23, the range of the AUC values was found to lie in the range [0.57 (DT) - 0.70 (BAG.)], without all features. With ranked features from different algorithms, we find that there was either a decrease or no improvement in the AUC values, irrespective of the prediction models. The J48 method, showed the poorest performance in the AUC with ranked features with values decreased by 16% -17%. On the other hand, the ensemble methods,showed a relatively low variance in their AUC value with respect to those obtained when all features were considered. The Bagging method with GR and ReliefF ranking techniques yielded the best AUC values for this dataset. In short, we find that a well optimized AUC value for defect prediction appears to result from the ensemble models with features ranked from the ReliefF algorithms.
The AUC values with all and FR selected features for Click and Rave.
The AUC values with all and FR selected features for Click and Rave.
The AUC values across the seven versions of Apache Click and Rave were found to lie in the range 0.45 - 0.82. The best AUC values were determined by the Bagging and the least by J48. Using various search techniques in combination with correlation-based FM techniques, redundancy of 44.4% to 94.4% were found across the Apache versions. Correspondingly, the AUC values with reduced features were found to vary only within ±0.08 in comparison with all features. The AB technique yielded the best AUC value (0.81) with 66.7% feature redundancy. On the other hand, the classifier performance of tree-based algorithms (DT and J48) was relatively poor. On the other hand, with FR the feature redundancy is 66.7%. The LB classifier in combination with ReliefF yielded the best AUC value (0.80). It may be noted that the classifiers performances with ReliefF yielded the minimum variance in AUC. The variance as high as 0.13 was determined for Gain Ratio and OneR. In general, we find positive increase in AUC values with FM techniques, but not with FR. Also, we find that the classifier which deduces the best and least AUC value, using various FM techniques remains more or less invariant with all feature set included and that with reduced features. However, for ranking techniques we find that the AUC values are sensitive to both classifier and FR techniques.
RQ-3. Which among the various parametric, non-parametric and ensemble-based ML algorithms, emerge as the best predictive technique in identifying defect prone classes using FM and FR techniques and what is the comparative performance?
The Friedman test is conducted to confirm that the observed performance differences among the predictive models are not random. The null hypothesis is that all ML classifiers are equivalent and so their ranks should be equal. The inputs to the Friedman test are the data from Table 3 and 4. While, the total of 23 data sets is considered to evaluate the performance of the ML algorithms using the FM technique, 31 data set forms the input data for the FR techniques. The Friedman test resulted in a significant value of 10.04 for nine ML algorithms for the FM techniques and 9.38 for the classifiers based on FR techniques. These values were compared with the F-distribution table for F(8,176) and F(8,240) degrees of freedom of FM and FR techniques, respectively, which was found to be 1.99 [36]. Thus, the null hypothesis is rejected in both the cases. The average rank for each classifier, subjected to FM and FR techniques, are accumulated in Table 5. As evident from Table 5, the superiority of Bagging technique appears noteworthy for both FM and FR. The analysis of the first five averaged ranks of the ML algorithms using FM are as follows: BAG >MLP >LR >RF >LB. For, those models built using FR features set, the average rank of the models were in the following order: BAG >LB >RF >MLP >LR. Clearly, the first five methods appear to be independent of the FM or FR feature selection. Interestingly the last four ML algorithms were also found to be similar for both techniques. The average rank of DT and J48 were the highest, implying the poorest performance. Finding that the Friedman tests detected a difference in classifier performances, investigations were extended to post-hoc Nemenyi analysis so as to identify which two classifiers differ significantly.
Friedman analysis
Friedman analysis
Table 6 shows the outcome of the Nemenyi post-hoc tests for the FM and FR techniques based on classifier comparisons, respectively. Note that when the difference between the averaged rank of two classifiers is smaller than CD, the difference in their performance is marked as insignificant. Using equation (2) with q = 0.05 (for nine classifiers as 3.10) we find the CD value as 2.51 for FM and 2.15 for FR techniques. For those two classifiers whose difference in the ranks are greater than the CD values are, by definition very different and thus, on relative grounds help one to discard the classification method with respect to the best performing ones. Such differences are indicated with bold numerics in Table 6. It is evident that the FM and FR input classifiers form two performance clusters, namely (i) BAG, MLP, LB, RF and LR and, (ii) DT and J48. However, it may be noted that NB in combination with FR techniques perform better than their FM counterparts. Moreover, the performance analysis of the fault prediction model on the Apache dataset using nine classifiers employing FM and FR techniques, show that the Bagging technique has better accuracy and precision. To sum up the whole of our analysis, we find that the ensemble-based methods yield better results than their non-parametric counterparts. The performance of LR being at par with the ensemble methods is surprising, and needs further detailed investigation.
Nemenyi pairwise comparison of the classifiers using features selected by FM (CD = 2.15) and FR (CD = 2.5)
Nemenyi pairwise comparison of the classifiers using features selected by FM (CD = 2.15) and FR (CD = 2.5)
There exist numerous studies which investigate the effectiveness of feature selection on the performance of defect prediction models. Shivaji et al. [21] explored the impact of FR and wrapper methods. Not only the authors found a redundancy of 90%, but also that the feature selection improved the prediction performance. This is consistent with our findings that redundancy as high as 94% was determined using FM techniques. We also find that the nature of the dataset plays an important role which is consistent with the findings of Song et al.[27]. Afzal et al., [22] also investigated various feature selection techniques including FM, FR and concluded that IG and ReliefF select fewer attributes without degrading the performance of the software prediction model. Our findings are similar for the FM methods and ReliefF with Bagging outperformed in FR techniques. Yu et al. [24] also have concluded that feature selection performs better than other methods of their choices in improving performance in defect prediction. He et al., [25] conducted an empirical study on software defect prediction with a simplified metric set using feature subset evaluation with GS. Their study concluded that NB performed better among the ML techniques and, that the metrics such as CBO, LOC, LCOM, RFC and Ce were found to be the top five OO metrics that could be used in the defect prediction models. However, our results show the WMC, NPM, LCOM3, DAM and CAM as the most relevant features using FM, while FR finds WMC, RFC, LCOM3, AMC and CAM to be most relevant. Our study validates the findings of Wang et. al and Gao et al. [43], that using FR methods had little significance on final performance outcome. However, our findings suggest that in comparison to FR, the FM tends to show better performance in combination with ensemble-based ML techniques. Different from the previous studies, this work empirically validates the effectiveness of various search techniques in combination with FM, FR and MLtechniques.
Threats to validity
We emphasize that our results may not be generalized, as they depend on a large number of project and environment specific variables. This is quite evident from the variance in the results we obtain despite both Click and Rave being Apache projects. This affirms major differences between the software projects and indicates that defect indicators that work well in one project may be less useful [41, 42]. Another possible source of counterfeiting factor is that of bias. The set of OO metrics may vary from the very objective of the study. However, the ten-fold cross-validation and AUC methodology minimize bias and are therefore effective in strengthening our conclusions. Although, the feature selection, as well as the ML techniques have a concrete mathematical foundation, the quality of the data set is less known. Hence, rigorous statistical analyses have to be performed which we propose as our future study. Also, of significance is the selection of the nine classifiers. Our choices are guided by the objective that we establish a meaningful balance between the well established, novel and popular techniques, which include the statistical, non-parametric and ensemble-based methods. The validity of the conclusion is restricted to the outcomes given by the Friedman and Nemenyi post-hoc analysis.
Conclusion
An important step for software practitioners is to identify the optimal metrics set required for data collection, modeling and analysis. For the same, the most accepted methodology is that based on feature selection so as to eliminate redundancy. Of existing many, we study the defect prediction capability of the features selected using eight FM and seven FR techniques, which are fed as input to nine ML techniques. Our study which comprises of 945 defect prediction models show that redundancy in feature selection as high as 94% can be achieved in the Apache data sets, with little change in the performance index. In comparison to FR, the FM tends to show better performance in combination with ensemble-based ML techniques. We also find that WMC, LCOM3 and CAM forms a good basis in defect prediction using web applications, although we do not rule out the significance of other metrics. Interestingly, in accordance with the results of Lessmann et al. [4], we find that the AUC performance remains less sensitive to the feature selection. In FM, we find RS yields distinct feature set, while in FR ReliefF selects a unique feature set. The latter seems to be a better choice of a ranking algorithm when used with Bagging technique. Overall we conclude that ML models based on ensemble methods in combination with features deduced using FM techniques yield a better framework for identifying faults in the Apache dataset. We also propose that a similar strategy may be adopted for projects similar in nature.
