Test effort estimation and prediction of traditional and rapid release models using machine learning algorithms

Abstract

Recently, many software companies have shifted to shorter release cycles from the traditional multi-month release cycle. Evolution and transition of release cycles may affect the test effort in the system. This paper analyses 25 traditional releases containing 1210 classes and 69 rapid releases containing 2616 classes of four Open Source Java systems. Correlations between 48 Object Oriented metrics and 2 test metrics were evaluated to identify the best indicators of test effort. The results show that (i) correlation between OO and test metrics remain irrespective of release models, (ii) test effort required in Rapid Release (RR) models (shorter release cycles) is slightly more as compared to Traditional Release (TR) models, (iii) Out of 18 machine learning algorithms instance based machine learning algorithms IBK and K star followed by Multi-Layer Perceptron (MLP) and additive regression are able to predict the test effort accurately in classes.

Keywords

Release cycles machine learning prediction software metrics test effort

1 Introduction

Agile methodologies like XP instituted the notion of shorter or faster or rapid release cycles and advocate the benefits of using them for both companies and customers [13 , 32]. Soaring market competition have forced many software companies to exercise shorter release cycles and release their products within a span of weeks or days [35]. Shorter release cycles cater many advantages to organizations as well as end users [32]. These shorter cycles enable faster customer feedback thereby allowing companies to schedule their succeeding releases more easily and doesn’t force the developer to complete the entire feature at one go. The components can be published in incremental releases allowing the developer to pay more attention to quality assurance [26], resulting in faster bug detection and correction [6]. Moreover, developers are not bustled to complete features because of an upcoming release date and can concentrate on quality assurance every 6 weeks instead of every couple of months. Customers are also benefited, as they can obtain new features, bug fixes and security updates faster. Consequently, shorter release cycles have been adapted in many software and embedded domains [14]. Mozilla Firefox migrated to Rapid Release (RR) concept after facing huge competition from Google Chrome. It shifted from its Traditional Release (TR) model of one year for a major release to 6 weeks from version 5.0 [34]. A recent study [32] analyzed release patterns in mobile domain and found that frequently updated mobile applications (i.e. shorter release cycles) on Google play store, were highly favored by the users, irrespective of their high update frequency [32]. In fact, updated versions were accepted more quickly by the users worldwide [26]. On the other hand, organizations lack time to stabilize their platforms resulting in increased customer support costs due to frequent upgrades [5].

However, transitions to shorter release cycles may affect the test effort of an application. Therefore, an investigation in this domain would yield new facets in research.

2 Related work

With the rise of agile methodologies, more and more projects are shifting to faster and shorter release cycles [4 , 34]. Otte et al. [35] analyzed Open Source projects and found that more than 50 percent of the projects published at least one release per month. Kuppuswami et al. [30] and Marschall [25] both evaluated the development effort required in shorter release cycles and found contradicting results. A similar trend was observed for bug fix time. While some studies advocated faster bug fix time in shorter release cycles [5, 6], Baysal et al. [27] found median time to fix bugs around 2 weeks faster in TR systems. The current work analyzes and compares test effort required in TR and RR models using OO and test metrics. To the best of our knowledge, not study till date has analyzed release models this way.

Bruntink and Van Deursen [21, 22] and Leon et al. [24] analyzed testability using OO metrics and found positive relationship between OO and test metrics but they didn’t use the results to predict test effort in classes. Aggarwal et al. [15] and Singh and Saha [37] implemented OO metrics to predict the test effort in classes using neural networks. Badri et al. [16, 17] investigated the relationship between single OO metric, Lack of Cohesion method (LCOM) with JUnit test cases and found positive results. Badri et al. [19] used OO metrics and predicted test effort using linear regression. Later, Toure et al. [7] analyzed unit test effort using 5 OO metrics and further implemented Principal Component Analysis (PCA) to learn the orthogonal dimensions observed by their selected suite of unit test case metrics. In a recent study, Badri et al. [20] investigated the test code size using use case metrics and 6 machine learning algorithms. Toure et al. [8] analyzed test metrics derived from JUnit framework to identify test effort using 3 machine learning technologies (univariate logistic regression, the univariate linear regression, and the multinomial logistic regression).

The current study analyzes testability using 48 OO metrics, analyzes their correlation and validates results using non parametric tests. The results are further used to predict test effort using 18 machine algorithms. Use of wide range of OO metrics and machine learning algorithms gives the current study an edge over other published work in this field.

3 Experimental study design

This section presents the experimental study design followed in the paper.

3.1 Data collection

Initially, Apache Software Foundation [2] was explored and projects written in Java language were kept in the pool for consideration. Projects meeting the selection criteria (Section 3.2) were then shortlisted for the study. In order to eliminate bias and increase generalizability, Simple Random Sampling was applied and four datasets were finally selected for the analysis. The artifacts of the selected datasets were obtained from GitHub [9] and Jira [12] (which are source and bug repositories respectively). The characteristics of the studied projects are presented in Table 1.

Table 1
Characteristics of projects studied

Projects Category Date of first Versions Test Classes

release TR RR TR RR

Avro¹ Big-data 14/07/2009 3 14 31 93

Hive² Database 30/04/2009 5 16 137 263

Jclouds³ Cloud 4/5/2011 9 21 758 1527

Zookeeper⁴ Database 13/11/2007 8 18 284 733

Total 25 69 1201 2616

Projects	Category	Date of first	Versions	Test Classes
Avro¹	Big-data	14/07/2009	3	14	31	93
Hive²	Database	30/04/2009	5	16	137	263
Jclouds³	Cloud	4/5/2011	9	21	758	1527
Zookeeper⁴	Database	13/11/2007	8	18	284	733
Total	25	69	1201	2616

1. www.avro.apache.org, 2. www.hive.apache.org, 3. www.jclouds.apache.org, 4. www.zookeeper.apache.org. Note: Words ‘Release’ and ‘Version’ are used synonymously in the study.

3.2 Selection criteria

While selecting the projects following criteria and assumptions were made:

Projects must be in active state and their last activity date must not be later than February 2017

Release cycles of the projects must have evolved from longer release cycles (less than six releases per year) to shorter cycles (more than six releases per year) somewhere in their lifecycle in order to study change

Projects must have a development lifecycle of more than five years

Projects with less than six releases* in a year were accounted under TR model phase and projects with more than six releases a year were considered to follow RR model phase.

*The time estimation of six releases/year in the paper is an attempt made to set a standard for analysis and has been used in studies in this field [5, 6]. This can be manipulated if required for other analysis.

The study also calculated the mean time of release for each dataset. The median in the box chart that divides the box into two is the mean time of the release. It can be observed from Fig. 1 that the mean time of release for TRs is way higher than RRs. The study also presents the evolution of the releases of the Zookeeper dataset over the period of time in Fig. 2. Graphs of other datasets can be obtained likewise.

Fig.1

Distribution of release cycle lengths (in days) for TRs and RRs with X asis as TR and RR release and Y axis as days since previous release.

Fig.2

Evolution of releases in dataset ‘Zookeeper’.

3.3 Object oriented metrics

The study gathered 48 OO metrics for each version of the selected dataset. These metrics were extracted using an Open Source tool – IntellijIdea [10]. This tool takes Java classes as input and produces OO metrics. Details about the tool and the metrics can be obtained from www.jetbrains.com. Table 2 presents the OO metrics used in the study.

Table 2
Object Oriented metrics used in the study

Type Metrics

Size LOC, STAT, SLOC, CLOC, Command, Jloc, Jf, Jm, Cons, Com_rat, Tcom_rat, TODO

Complexity Osmax, Osavg, WMC, NOAC, Ocmax, Ocavg, Opavg, CSO, CSOA, CSA, Query, NAAC, NOIC, NOOC

Coupling RFC, MPC, CBO

Inheritance DIT, Dcy, Dcy^, Dpt, Dpt^, Cyclic, Level, Level^*, Pdcy, PDpt, Sub, Inner

Halstead N, E, V, D, B, n

Cohesion LCOM

Type	Metrics
Size	LOC, STAT, SLOC, CLOC, Command, Jloc, Jf, Jm, Cons, Com_rat, Tcom_rat, TODO
Complexity	Osmax, Osavg, WMC, NOAC, Ocmax, Ocavg, Opavg, CSO, CSOA, CSA, Query, NAAC, NOIC, NOOC
Coupling	RFC, MPC, CBO
Inheritance	DIT, Dcy, Dcy^, Dpt, Dpt^, Cyclic, Level, Level^*, Pdcy, PDpt, Sub, Inner
Halstead	N, E, V, D, B, n
Cohesion	LCOM

3.4 Test metrics

The test metric studied in this paper includes Number of Test Cases (NOTC). This metric is computed by counting the number of times method JUnit ‘assert’ is invoked in the code of a test class. The framework JUnit enables the tester with multiple ‘assert’ methods like ‘assertEquals’, ‘assertArrayEquals’, ‘assertTrue’, ‘assertFalse’ ‘assertSame’, or ‘assertnotSame’. The functioning of this method goes as follows: the parameters which are passed to this method are checked for compliance with certain conditions depending on the particular variant. For instance, ‘assertEquals’ tests whether its parameter is equal or not. If it’s found that the parameter doesn’t meet the condition, the framework produces an exception indicating that the test has failed. Hence, counting the invocations to this ‘assert’ method helps in identifying the test cases written in the code. Therefore, the testers implement the set of JUnit ‘assert’ methods to identify the change in the expected behavior of the class. In Open Source, the naming convention followed by most of the projects for writing test classes include java class name followed or preceded by the word ‘test’. For instance, AppendFileTest, QuorumTest, TestDriver. Consequently, out of all the Java classes, only these test classes were mined for ‘assert’ methods and kept for further analysis.

3.5 Statistical analysis techniques

The current study implements non-parametric tests mentioned in Table 3 to validate the results.

Table 3
Brief description of statistical tests used in the study

Spearman’s Test [3] It is a non-parametric that measures the statistical dependency between two ranking variables

Friedman’s Test [23] It is a non-parametric test to estimate the difference in distributions of more than two quantitative variables

Nemenyi Test [11, 31] It is a post hoc test (advocated after Friedman’s Test) that makes pair-wise comparison of performance of various techniques

4 Framework for test effort estimation and prediction for TR and RR models

Figure 3 presents the proposed framework for test effort estimation and prediction. This framework has majorly two sections. One is for test effort estimation for TR and RR models and the other is for test effort prediction. These are discussed in Sections 4.1 and 4.2.

Fig.3

Framework for test effort estimation and prediction for TR and RR models.

4.1 Test effort estimation

Test effort estimation for the TR and RR models will be obtained in the following manner

Correlation: Identify the correlation between OO metrics and test metrics (LOC, NOTC) to gaze test effort.

Differentiation: Check for any difference in correlation of these metrics in TR and RR models.

Filtration: Filter out metrics (OO) best correlated with test metrics for both TR and RR models.

Categorization: Categorize these metrics to identify unit test effort. Repeat this step for TR and RR models individually. Compare the test effort of TR and RR models.

4.1.1 Correlation

The study starts the data analysis by calculating the association between the metrics using correlations. Correlations are helpful since they can reflect the predictive relationship of the variables which can be further exploited in practice. The study evaluated Spearman’s correlation since the data available is not normal (using Shapiro Wilk test [33]). The coefficient r_s is computed for the dependent NOTC and LOC with 47 other independent variables gathered through IntellijIdea [10]. The coefficient r_s can range from values – 1 to +1 wherein values closer to +1 indicates positive correlation while values near to – 1 indicate negative correlation. Values near to zero indicate poor correlation indicating lack of association between the variables. The study also checked the p values of the correlational results in order to confirm the significance of results. For this confidence level, 0.05 was assumed and correlations coefficient with p values>0.05 were ignored by the study. The null hypothesis thus created for Spearman correlation was

There exists no association (monotonic) among the two variables (Independent and Dependent)

In order to validate the above-mentioned hypothesis, Spearman’s coefficient of correlation was the calculated between each object oriented metrics (independent variables) with test metrics NOTC and size metric LOC (dependent variables). This step was repeated for each TR and RR version of the selected datasets. The results depicted a strong positive relationship between our test metrics and size metrics. Metrics such as STAT, SLOC, N, V, E, D, n had the highest correlation with test metric NOTC in both TR and RR models. In terms of unit testing, a larger class will bear more parameters, attributes and methods and a stronger correlational value between NOTC and size metrics points towards a greater effort during testing. A higher coupling in the class would require more testing than compared to a class with lower coupling since more classes are interdependent on each other and such classes require larger test suites. Same is the case with higher complexity and unit testing. Coupling metrics MPC and complexity metrics WMC, OSmax, OSavg, NOAC and RFC were strongly correlated to NOTC indicating that test effort would increase if values of these metrics increase. Cohesion metric LCOM was also seen moderately correlated with NOTC while Inheritance metrics were least correlated (with NOTC) category of metrics out of all the six categories (Section 3.1.3) of metrics referred in this paper. Consequently, it may be assumed that inheritance in the class though related but is not a strong predictor of test effort in the class when compared to other metrics.

Similarly, size metrics and Halstead metrics had a strong positive correlation with LOC. Complexity metrics CSOA, CSO, OCmax, OCavg, OSmax, OSavg, NOAC and coupling metrics CBO and MPC too had a significantly positive correlation with LOC. Inheritance metrics Dcy, DIT, level and Dpt were moderately correlated with average mean correlational value as 3.57 and p value < 0.05. The results are obvious since all these metrics are derived out of lines of code of the class only. A greater LOC would mean a greater coupling, complexity, inheritance and size of the application. The results indicate that there does exist strong correlation among few variables which drive the study to reject the null hypothesis and accept the alternative hypothesis indicating that there does exist an association between independent and dependent variables.

4.1.2 Differentiation

The variability in correlation among metrics is assessed through Friedman Test. Freidman test is a non-parametric test implemented to test for differences between groups and ranks them with best performing variable as rank 1, the next best rank as 2 and so on [11]. For the study, the Friedman Test will allocate a mean rank to all the metrics based on their correlational values which are ascertained through Spearman Correlation. The test then compares the average ranks thus provided and calculates the statistics using the Equation 1 [29]: $x^{2} = \frac{12}{nk (k + 1)} \sum_{i = 1}^{k} R_{i}^{2} - 3 n (k + 1)$ (1)

Where, K = number of datasets R = rank allocated to i_th variable

The results will be checked at the 0.5 significance level and the null hypothesis for the test can be drawn as follows:

There exists no significant difference between correlations of the metrics

The test was run on the correlational results obtained between OO metrics and dependent test metric NOTC in SPSS. The results of TR phase and RR phase are presented separately in respectively. It may be observed that the significance values of both TR (Table 4) and RR models (Table 5) for Nemenyi test was found to be less than the threshold value, 0.05. Hence, the study rejects the null hypothesis and accept the alternative hypothesis stating that there exists a significant difference between the correlational values of the shortlisted metrics. Post hoc analysis after Friedman Test [16] is advisable if the results obtained are significant at p value. Wilcoxon signed ranks test for pairwise comparison is a commonly applied test post-Friedman test but the test doesn’t account for family wise error if the Bonferroni correction is not performed. Demsar [11] and Lessmann et al. [31] on the other hand propose Nemenyi Test for comparing pair-wise difference post-Friedman Test since it takes care of the family wise error as well. Nemenyi test is used to compare all the classifiers with each other. The performance is assumed to be significantly different of the comparing classifiers if the corresponding mean rank differs by at least the criticaldifference.

Table 4

Result of Friedman’s test on traditional releases

Variable	Mean Rank	Variable	Mean Rank	Variable	Mean Rank
LOC	4.69	CLOC	20.09	CSA	30.88
N	4.94	CSO	24.72	NOIC	30.88
E	5.66	RFC	25.63	Jm	31.22
STAT	6.19	Query	27.28	NOOC	31.69
V	7.03	Inner	27.69	Cons	31.75
SLOC	7.63	Dpt	28	OPavg	32
MPC	10.22	CBO	28.31	Level^*	32.19
D	10.44	Dpt^*	28.41	Level	32.31
WMC	13.56	Cyclic	28.63	SUB	32.38
OSmax	13.88	CSOA	28.88	Dcy	32.53
n	14.31	PDpt	29.31	Dcy^*	32.59
NOAC	14.88	OCmax	29.34	DIT	33.03
OSavg	16.03	JLOC	29.69	Jf	33.34
LCOM	17.63	Pdcy	29.81	COMRAT	33.72
Command	18.16	OCavg	30.44	Jf	30.76
B	18.56	NAAC	30.5	Tcom-rat	31.79
Test Statistics
N = 25, Chi square = 463, df = 47, Asymp. Sig. = 0

Table 5

Result of Friedman’s test on rapid releases

Variable	Mean Rank	Variable	Mean Rank	Variable	Mean Rank
LOC	4.94	Command	20.66	PDcy	31.8
SLOC	5.91	CSO	21.69	NOOC	32.3
STAT	6.78	CLOC	22.94	NOIC	32.3
N	7.81	CSOA	23.5	PDpt	32.5
D	8.5	CBO	23.81	NAIC	32.6
E	8.5	OCmax	25.41	OPavg	32.7
V	9.47	Dcy	27.28	Cons	33.1
MPC	9.78	Dcy^*	27.78	Dpt^*	33.5
WMC	12.72	Query	28.41	DIT	33.8
n	13.91	OCavg	28.41	Inner	34.1
OSmax	14.59	JLOC	29.09	Dpt	34.1
NOAC	15	Level	29.5	Cyclic	34.1
B	15.41	CSA	29.69	Jf	34.1
OSavg	16.56	Jm	29.88	COMRAT	35.9
RFC	17.72	Level^*	29.94	Sub	35.8
LCOM	18.06	NAAC	30.94	Tcom-rat	39.2
Test Statistics
N = 69, Chi square = 594.2, df = 47, Asymp. Sig. = 0

This Critical Difference (CD) is computed by Equation 2: $CD = q_{a} \sqrt{\frac{n (n + 1)}{6 K}}$ (2)

The CD of both the TR and RR models was computed and Nemenyi test was applied with confidence level 95%. If the variability between two classifiers is observed less than the value of CD, then it is assumed that there exists no significant difference in 95% confidence interval, else, the difference exists [11]. The CD for TR and RR was observed as 3.967 and 2.388 respectively. It was found in TR models, the metric LOC outperformed all the metrics in terms of correlation (with NOTC). N, E, STAT, SLOC, V stood next but all of them were statistically indistinguishable from LOC, as observed in Fig. 5. Coupling metric MPC and complexity metrics WMC and NOAC not only ranked in the top but also demonstrated statistically significant difference too. Metrics LOC, WMC, NOAC and MPC were finally observed as the statistically most distinguished and correlated set of metrics with test metric NOTC in both TR and RR models. Since the metrics following top 15 were less powerful in terms of correlation, graph displaying results of Nemenyi test in TR and RR models are presented in Figs. 4 and 5 respectively.

Fig.4

Graph displaying results of Nemenyi Test for RR.

Fig.5

Graph displaying results of Nemenyi Test for TR.

Table 6

Mean correlational values of highly correlated OO metrics (only significant value (p-value < 0.05) considered)

		Avro				Hive				Jclouds				Zookeeper
		LOC		NOTC		LOC		NOTC		LOC		NOTC		LOC		NOTC
Metrics∖Release Models		TR	RR	TR	RR	TR	RR	TR	RR	TR	RR	TR	RR	TR	RR	TR	RR
Size	STAT	0.99	1.00	0.78	0.74	0.90	0.99	0.59	0.63	0.92	0.92	0.54	0.61	0.97	0.97	0.52	0.57
	SLOC	0.99	0.98	0.76	0.73	0.91	0.98	0.57	0.61	0.98	0.99	0.54	0.60	0.99	0.99	0.51	0.58
Coupling	MPC	0.81	0.72	0.71	0.56	0.69	0.76	0.71	0.63	0.65	0.64	0.49	0.54	0.55	0.68	0.49	0.57
	RFC	0.41	0.66	0.36	0.38	0.18	0.51	0.13	0.36	0.61	0.60	0.33	0.38	–	0.42	–	0.13
Complexity	OSmax	0.75	0.69	0.50	0.36	0.44	0.71	0.55	0.54	0.48	0.55	0.35	0.41	0.71	0.66	0.39	0.35
	OSavg	–	–	–	–	0.17	0.40	0.58	0.63	0.36	0.45	0.28	0.16	0.51	0.51	0.48	0.35
	WMC	0.94	0.96	0.76	0.59	0.81	0.81	0.53	0.28	0.86	0.85	0.50	0.58	0.89	0.88	0.49	0.45
	NOAC	0.80	0.92	0.72	0.00	0.81	0.82	0.36	0.24	0.82	0.81	0.52	0.58	0.72	0.68	0.48	0.44
Halstead	N	0.94	0.94	0.77	0.78	0.89	0.97	0.58	0.59	0.95	0.95	0.52	0.58	0.85	0.90	0.69	0.51
	E	0.90	0.95	0.78	0.77	0.88	0.95	0.54	0.57	0.94	0.94	0.52	0.60	0.85	0.89	0.66	0.49
	V	0.87	0.97	0.76	0.73	0.90	0.97	0.56	0.59	0.64	0.95	0.50	0.56	0.85	0.90	0.63	0.49
	D	0.76	0.90	0.80	0.76	0.81	0.90	0.32	0.54	0.89	0.88	0.51	0.62	0.82	0.86	0.65	0.50
	B	0.73	0.64	0.72	0.71	0.65	0.75	0.24	0.47	0.51	0.45	0.35	0.39	0.50	0.67	0.52	0.36
	n	0.75	0.91	0.87	0.57	0.83	0.90	0.35	0.43	0.78	0.79	0.29	0.26	0.80	0.83	0.37	0.42
Cohesion	LCOM	–	0.52	0.27	0.20	–	0.40	0.24	0.26	–	0.37	0.29	0.32	–	0.33	0.20	0.22

– Result was not statistically significant.

The results obtained after statistical validation using non-parametric tests indicate that the top correlated metrics (OO metrics with test metrics) were same in both TR and RR models indicating the insensitivity of the frequency of release cycles on metrics.

4.1.3 Filtration

OO metrics most strongly correlated with test metrics were filtered out (obtained as result of Friedman’s test) and will be referred as ‘reduced set’ in the study. Mean correlational values of the reduced set was calculated for further analyze their correlation with test metrics.

Since rise in size (in terms of methods, parameters or attributes or one may simply call in terms of code) will affect the test effort of an application, correlation between test metrics and Halstead metrics were the highest. In fact, a larger application would require more test effort irrespective of the release cycles. Complexity metrics WMC, NOAC, OSmax and OSavg reflected a slightly stronger positive correlation in TR models as compared to RR models. This means that the complexity of an application may rise if the release cycles are lesser or in other words, the complexity of an application may reduce if the release cycles are frequent. Consequently, an increased complexity will directly affect the test effort required. Coupling metric RFC on the other hand, show stronger results in RR models indicating that frequent releases may increase coupling but the other coupling metric MPC doesn’t follow the trend. Therefore, coupling and release lengths cannot be linked in the analysis.

This study of correlation between test and OO metrics direct the study towards the best predictors of test effort in a class. The results thus obtained will be used to estimate the unit test effort TR and RRs in the next section.

4.1.4 Categorization

The results of Nemenyi test highlight that LOC, MPC, WMC and NOAC are statistically most significant set of OO metrics among the reduced set of 15 metrics. These four metrics and test metric NOAC, in this section of the study is used to create five categories that will help in accessing the test effort in TR and RR models. Using mean value based approach, five conditions and five categories are created. Ranging from the level very low, to very high these categories will label the classes of both TR and RR models. This approach though has been implemented earlier in the work of [19], but their work doesn’t analyze the test effort for different release models. Moreover, the set of metrics used were different.

Based on five conditions mentioned in Table 7, five categories were summarized as follows:

Table 7
Shortlisted OO metrics

Metrics Mean Values Condition

LOC mLOC i. LOC>mLOC

NOTC mNOTC ii. NOTC>mNOTC

MPC mMPC iii. MPC>mMPC

NOAC mNOAC iv. NOAC>mNOAC

WMC mWMC v. WMC>mWMC

Metrics	Mean Values	Condition
LOC	mLOC	i. LOC>mLOC
NOTC	mNOTC	ii. NOTC>mNOTC
MPC	mMPC	iii. MPC>mMPC
NOAC	mNOAC	iv. NOAC>mNOAC
WMC	mWMC	v. WMC>mWMC

Category 1: Comprises of classes that satisfy all the conditions i.e. LOC greater than mean LOC (large size), NOTC greater than mean NOTC (large no. of test cases), MPC greater than mean MPC (Largely coupled) and NOAC greater than mean NOAC (highly complex) and WMC greater than mean WMC. Such a class requires a great amount of test effort and therefore, are allocated to ‘Very High’ test effort level and ranked under category 1.

Category 2: Comprises of classes satisfying any four of the five mentioned conditions. Such classes would be allocated category 4 and would be kept under test effort level ‘High’.

Category 3: Comprises of classes satisfying any three of the five mentioned conditions. Such classes would be allocated category 3 and would be kept under test effort level ‘Medium’.

Category 4: Comprises of classes satisfying any two of the five mentioned conditions. Such classes would be allocated category 2 and would be kept under test effort level ‘Low’.

Category 5: Comprises of the classes satisfying any one of the conditions indicating lesser complexity, coupling or code, therefore, requiring lesser effort to test and are kept under test effort ‘Very Low’ with category 5.

This categorization summarized in Table 8 would help in identifying the effort required in writing and testing unit test cases. Each class of the TR and RR were checked and categorized based on the conditions met. The Fig. 6, represents the percentage of classes of both TR and RR, falling under each category. It can be observed that though the trend of distribution is different in different datasets, on an average, test effort levels seem to go slightly up in RRs compared to TRs. Jclouds was the only dataset where classes falling in category 1 and 2 were considerably higher in TRs than RRs. This pattern was investigated and it was found that Jclouds had the highest LOC and a maximum number of TR and RRs resulting in rise of classes in category 1 and 2. Considering the results, it can be assumed that the test effort of releases depends a lot on the size of the application and a bigger application will always employ greater effort irrespective of the release cycles of the application. Further, it was also observed that classes falling into category 1 and 5 were relatively smaller compared to other categories indicating that there exists a lesser number of classes that employ too much or too less effort in both the release models.

Fig.6

Categorization of classes into levels of test.

Table 8

Test effort categorization into five classes

Cat.	Classes	Unit test effort
1	All 5 conditions satisfied	Very High
2	Any 4 conditions satisfied	High
3	Any 3 conditions satisfied	Medium
4	Any 2 conditions satisfied	Low
5	Any 1 condition satisfied	Very low

However, the analysis does draw the results in favor of TRs and it can be stated that TR models require relatively lesser test effort as compared to RR models.

4.2 Test effort prediction

Prediction is a computationally hard problem. Machine learning techniques can quickly andefficiently produce solutions to such problems and therefore have been used extensively by the studies in this domain [1 , 37–38]. This work uses 18 machine learning algorithms to predict test effort in classes. These include seven function classifiers, two tree based classifiers, two inductive rule based classifiers, four meta classifiers and three lazy classifiers. A brief description of these algorithms is presented in Table 9. The test effort prediction was made in the following three ways:

Table 9
Brief description of machine learning techniques used in the study

Machine learning classifier Brief description

Gaussian Processes GP It is a probabilistic model that has distribution of random variables over functions with continuous domain

Isotonic Reg. IR It is a monotonic function that searches for a function that never decreases and improves the mean squared error of the data

Multiple Linear Reg. MLR It describes the relationship between independent and dependent variables by analyzing the fit of the data with the equation of form Y = a+bX

MLP MLP It belongs to the family of artificial neural networks (feedforward) containing more than one layers between its input and output layer

Pace Reg. PR It analyzes the effect of each variable and provides better results than LR

Simple linear Reg. SLR If there exists just one predictor variable in MLR, it can be referred as SLR

SMO SMO It resolves the optimization problems that encountered while training support vector machine

IBK IBK It is an instance based learner that utilizes nearest k training instances for prediction

K Star K^* It is an instance based classifier that utilizes entropy as a measure

LWL LWL Often known as ‘just in time’ learning, it is an instance based classifier that predicts by approximating local model around current point of interest

Additive Reg AR It makes prediction by accumulating predictions of each estimator and improvises base regression technique

Bagging GP It trains every classifier by randomly redistributing the training set and preparing individuals for its ensembles

RandomSubSpace RS It trains the estimators on random samples to reduce correlation rather than the entire feature list

Reg by discretization RD It trains classifier on discretized class attributes over a copy of data

Decision Table DT It is multi flow supervised classifier

M5Rules M5R It uses separate and conquer technique to produce decision lists for regression models

Decision Stump DS It is 1-level decision tree and predict using one mere 1 input feature

M5P M5P It is restructured form of Quinlan’s M5 algorithm that induce trees for regression models

Machine learning classifier		Brief description
Gaussian Processes	GP	It is a probabilistic model that has distribution of random variables over functions with continuous domain
Isotonic Reg.	IR	It is a monotonic function that searches for a function that never decreases and improves the mean squared error of the data
Multiple Linear Reg.	MLR	It describes the relationship between independent and dependent variables by analyzing the fit of the data with the equation of form Y = a+bX
MLP	MLP	It belongs to the family of artificial neural networks (feedforward) containing more than one layers between its input and output layer
Pace Reg.	PR	It analyzes the effect of each variable and provides better results than LR
Simple linear Reg.	SLR	If there exists just one predictor variable in MLR, it can be referred as SLR
SMO	SMO	It resolves the optimization problems that encountered while training support vector machine
IBK	IBK	It is an instance based learner that utilizes nearest k training instances for prediction
K Star	K^*	It is an instance based classifier that utilizes entropy as a measure
LWL	LWL	Often known as ‘just in time’ learning, it is an instance based classifier that predicts by approximating local model around current point of interest
Additive Reg	AR	It makes prediction by accumulating predictions of each estimator and improvises base regression technique
Bagging	GP	It trains every classifier by randomly redistributing the training set and preparing individuals for its ensembles
RandomSubSpace	RS	It trains the estimators on random samples to reduce correlation rather than the entire feature list
Reg by discretization	RD	It trains classifier on discretized class attributes over a copy of data
Decision Table	DT	It is multi flow supervised classifier
M5Rules	M5R	It uses separate and conquer technique to produce decision lists for regression models
Decision Stump	DS	It is 1-level decision tree and predict using one mere 1 input feature
M5P	M5P	It is restructured form of Quinlan’s M5 algorithm that induce trees for regression models

Using all OO metrics (All)

Using the highest correlated OO metrics (Reduced set)

Using all except reduced set (Remaining set)

This three-way process will not only help in prediction but will also help to access the predictive capability of these metrics. The Coefficient of Correlation (CCOFF) and the residual errors in terms Root Mean Squared Error (RMSE) was calculated (using WEKA [36]) for the selected datasets and reported in Table 10. Residual errors generate from the difference between actually known responses and predicted responses. Models generating high CCOFF and low residual errors are desired for efficient prediction. It was observed that:

Instance based lazy learners IBK and K Star were most accurate in predicting test effort with CCOFF approximately near 1 and RMSE less than 5(using reduced set). Since the target function In lazy learners are calculated individually for every query therefore, such instance based learners can generate quality results for datasets containing few features.

In MLP, existence of more than one layer between the input and the output enables complex calculations and enhances the adaptive capability of networks to train functions. The cross validation results for MLP were therefore best (CCOFF more than 0.9 and RMSE less than 10) in all the datasets and for all the metricssets.

AR accumulates prediction results of estimator to improvise the base regression techniques, resulting in efficient predictions. However, the mean squared error of this algorithm increased for the Remaining set of Hive and Jclouds dataset thereby making it less accurate than MLP. Other techniques like GP, LWL, PR, M5P gave good predictions but their mean squared error were way more higher that IBK, K*, MPL and AR. RD gave the worst results among all the selected machine learning techniques.

Table 10

Test effort prediction using machine learning algorithms for the selected datasets in the study

	AVRO						HIVE
	ALL		REDUCED		REMAINING		ALL		REDUCED		REMAINING
			SET		SET				SET		SET
		CCOFF	RMSE	CCOFF	RMSE	CCOFF	RMSE	CCOFF	RMSE	CCOFF	RMSE	CCOFF	RMSE
FUNCTIONS	GP	0.98	5.93	0.83	7.45	0.98	5.94	0.98	24.36	0.94	24.15	0.97	25.22
	IR	0.87	5.23	0.86	6.52	0.82	7.07	0.92	18.56	0.92	18.56	0.73	32.66
	MLR	0.92	4.82	0.79	7.21	0.82	4.45	0.97	10.23	0.95	15.23	0.85	24.94
	MLP	0.99	1.37	0.99	2.36	0.99	1.81	0.99	7.04	0.97	11.02	0.97	16.33
	PR	0.88	10.17	0.84	12.58	0.67	15.06	0.98	8.59	0.94	15.48	0.78	29.66
	SLR	0.79	7.58	0.79	7.56	0.68	9.56	0.81	27.56	0.81	27.49	0.73	32.19
	SMO	0.96	3.44	0.95	7.56	0.93	2.34	0.95	14.09	0.92	19.00	0.78	30.66
LAZY	IBK	1.00	3.01	1.00	2.36	1.00	2.21	1.00	7.66	1.00	8.99	1.00	9.21
	K^*	1.00	3.25	1.00	2.34	1.00	2.23	1.00	6.96	1.00	9.63	1.00	9.87
	LWL	0.96	3.32	0.95	3.96	0.96	5.25	0.92	19.56	0.89	21.58	0.90	22.69
META	AR	0.99	1.15	0.99	2.36	0.99	1.15	0.98	9.34	0.98	8.69	0.90	21.56
	GP	0.76	9.82	0.81	8.31	0.73	9.33	0.82	31.78	0.80	32.66	0.77	34.59
	RS	0.87	9.24	0.90	7.14	0.82	10.32	0.98	32.87	0.93	29.66	0.76	39.44
	RD	0.94	3.36	0.95	3.84	0.88	29.66	0.72	32.25	0.48	41.28	0.46	39.86
RULES	DT	0.84	6.55	0.90	5.33	0.85	6.55	0.93	17.36	0.86	21.99	0.99	6.23
	M5R	0.84	6.68	0.94	5.25	0.86	6.23	0.86	23.11	0.82	26.33	0.80	28.79
TREES	DS	0.82	7.12	0.75	8.69	0.81	7.19	0.86	24.16	0.86	24.07	0.73	32.26
	M5P	0.84	6.25	0.89	7.23	0.82	6.23	0.86	23.03	0.82	26.44	0.80	27.48
JCLOUDS						ZOOKEEPER
	ALL		REDUCED SET		REMAINING SET		ALL ALL		REDUCED SET		REMAINING SET
		CCOFF	RMSE	CCOFF	RMSE	CCOFF	RMSE	CCOFF	RMSE	CCOFF	RMSE	CCOFF	RMSE
FUNCTIONS	GP	0.96	12.02	0.91	13.10	0.58	20.60	0.96	7.60	0.95	13.96	0.89	11.88
	IR	0.90	10.71	0.90	10.70	0.32	22.68	0.68	13.82	0.67	14.52	0.68	13.82
	MLR	0.93	9.13	0.91	10.38	0.49	21.19	0.80	11.35	0.74	14.52	0.65	14.38
	MLP	0.98	5.06	0.97	8.52	0.57	20.14	0.96	8.15	0.93	15.26	0.93	7.87
	PR	0.93	9.07	0.91	10.29	0.46	21.63	0.85	10.17	0.74	12.58	0.62	15.06
	SLR	0.79	14.79	0.79	14.79	0.31	23.09	0.51	16.28	0.51	16.28	0.40	17.38
	SMO	0.81	15.90	0.83	14.96	0.46	21.80	0.76	12.91	0.57	16.64	0.49	17.67
LAZY	IBK	1.00	6.98	1.00	5.69	0.92	7.43	1.00	7.58	1.00	6.58	0.99	0.98
	K^*	1.00	5.69	1.00	4.56	0.90	6.39	1.00	6.35	1.00	6.33	0.99	0.80
	LWL	0.86	12.31	0.87	12.03	0.57	20.28	0.91	7.68	0.75	12.55	0.68	13.99
	META	AR	0.96	6.42	0.96	6.46	0.66	18.51	0.88	8.95	0.87	11.90	0.83	9.76
	GP	0.69	20.36	0.72	20.38	0.59	20.36	0.79	14.05	0.72	15.71	0.69	15.69
	RS	0.80	19.03	0.75	19.31	0.51	23.16	0.77	16.58	0.67	17.63	0.51	18.08
	RD	0.83	13.51	0.82	13.89	0.37	22.57	0.63	14.71	0.63	14.71	0.44	17.00
RULES	DT	0.72	16.99	0.60	19.48	0.47	21.45	0.40	17.47	0.79	11.53	0.40	17.41
	M5R	0.91	9.95	0.92	9.63	0.80	14.71	0.82	10.96	0.66	14.29	0.49	16.52
TREES	DS	0.81	14.24	0.81	14.24	0.33	22.97	0.65	14.04	0.64	15.36	0.65	14.04
	M5P	0.91	9.59	0.92	9.57	0.61	19.47	0.80	12.20	0.75	14.29	0.73	13.21

It can also be observed from Table 10 that prediction using the reduced set of 15 metrics was good as the prediction made using the all the 48 OO metrics together. Furthermore, prediction made using the remaining set of 33 metrics was found to be significantly poor. This signifies the importance and predictive capability of the reduced set ofmetrics.

5 Threats to validity

Conclusion validity: This threat deals with statistical validity of results. Use of non-parametric tests such as Friedman’s test, Nemenyi test, Spearman’s test omits bias, since they are free from preconditions. Threshold value α= 0.05, which is a universally acceptable cut, father ensures confidence in the results obtained in the study.

Construct validity: OO metrics studied in the paper are extensively researched and established metrics. Moreover, they have been computed using tool IntellijIdea [10]. This reduces the threat for independent variables. The dependent variable has been calculated using JUnit framework which assures the accuracy of measurement of this variable too.

Internal validity: Since the ‘casual effect’ each OO metric on testability will be difficult to obtain and beyond the study. The threat to internal validity exists in the study.

External validity: The datasets used in the study belong to Open Source repositories which allow free access and ease of replication. The results obtained in the study are furthermore, consistent with the findings of earlier studies on release models and test effort prediction. However, to increase the generalizability, more studies on varied platforms are advisable.

6 Conclusions

Evolution and transition of release cycles may affect the test effort required in the system and in order to analyze this test effort required in Traditional Release (TR) models were compared to test effort required in Rapid Release (RR) models. For this, 25 traditional releases containing 1210 classes and 69 rapid releases containing 2616 classes were analyzed. On investigating the correlation between 2 test metrics and 48 Object Oriented (OO) metrics of both TR and RR models individually, it was observed that the OO metrics highly correlated with test metrics remain same for both the models and do not vary with the changing release cycles. However, the test effort was found to be slightly more in RR models indicating that shorter release cycles might require more test effort in the system. In addition to this, we empirically identified the top 15 OO metrics that were highly correlated to test metrics and used them to predict test effort.

The results of the shortlisted set of 15 OO metrics were as good as the result of all the 48 OO metrics taken together, signify the predictive capability of the shortlisted metrics. It was also observed that instance based machine learning algorithms IBK and K star gave the best results followed by Multi-Layer Perceptron (MLP) and Additive Regression for test effort prediction in classes. However, study suggests a deeper analysis on larger datasets before generalizing results.

References

Kaur

and Kaur

, Statistical comparison of modelling methods for software maintainability prediction, International Journal of Software Engineering and Knowledge Engineering23(6) (2013), 743–774.

Apache Software Foundation (ASF), http://www.apache.org/ .

Croux

and Dehon

, Influence functions of the Spearman and Kendall correlation measures, Statistical Methods and Applications19(4) (2010), 497–515.

Gamma

, Agile, open source, distributed, and on-time inside the eclipse development process, International Conference on Software, (Keynote)2005.

Khomh

, Adams

, Dhaliwal

and Zou

, Understanding the impact of rapid releases on software quality, Empirical Software Engineering20(2) (2015), 336–373.

Khomh

, Dhaliwal

, Zou

and Adams

, Do faster releases improve software quality? An empirical case study of Mozilla Firefox, Proceedings of the 9th Working Conference on Mining Software Repositories2012, pp. 179–188.

Toure

, Badri

and Lamontagne

, A metrics suite for JUnit test code: A multiple case study on open source software, Journal of Software Engineering Research and Development2(1) (2014), 14–26.

Toure

, Badri

and Lamontagne

, Predicting different levels of the unit testing effort of classes using source code metrics: A multiple case study on open-source software, Innovations in Systems and Software Engineering14(1) (2018), 15–46.

Github , http://ww.github.com .

10.

IntellijIdea, https://www.jetbrains.com/idea .

11.

Demsar

, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research7 (2006), 1–30.

12.

Jira, http://www.jira.atlassian.org .

13.

Beck

, Extreme programming explained: Embrace change. 2000Professional Addison-Wesley.

14.

Stewart

K.J.

, Darcy

D.P.

and Daniel

S.L.

, Observations on patterns of development in open source software projects, ACM SIGSOFT Software Engineering Notes30(4) (2005), 1–5.

15.

Aggarwal

K.K.

, Singh

, Kaur

and Malhotra

, Empirical study of object-oriented metrics, Journal of Object Technology5(8) (2006), 149–173.

16.

Badri

, Badri

and Toure

, An empirical analysis of lack of cohesion metrics for predicting testability of classes, International Journal of Software Engineering and Its Applications5(2) (2011), 69–86.

17.

Badri

, Badri

and Toure

, Exploring empirically the relationship between lack of cohesion and testability in object-oriented systems, International Conference on Advanced Software Engineering and Its Applications2010, pp. 78–92.

18.

Badri

and Toure

, Empirical analysis of object-oriented design metrics for predicting unit testing effort of classes, Journal of Software Engineering and Applications5(7) (2012), 513–524.

19.

Badri

Toure

and Lamontagne

, Predicting unit test effort levels of classes: An exploratory study based on multinomial logistic regression modeling, Procedia Computer Science62 (2015), 529–538.

20.

Badri

, Badri

, Flageol

and Toure

, Investigating the Accuracy of Test Code Size Prediction using Use Case Metrics and Machine Learning Algorithms: An Empirical Study, International Conference on Machine Learning and Soft Computing2017, pp. 25–33.

21.

Bruntink

and Van

, Deursen, An empirical study into class testability, Journal of Systems and Software79(9) (2006), 1219–1232.

22.

Bruntink

and Van

, Deursen, Predicting class testability using object-oriented metrics, International Workshop on Source Code Analysis and Manipulation (2004), pp. 136–145.

23.

Friedman

, A comparison of alternative tests of significance for the problem of m rankings, The Annals of Mathematical Statistics11(1) (1940), 86–92.

24.

Leon

, Van

Deursen

Zaidman

and Bruntink

, On the Interplay Between Software Testing and Evolution and its Effect on Program Comprehension, (2008), 173–202.

25.

Marschall

, Transforming a six month release cycle to continuous ow, in, pp, Proceedings of the Conference of AGILE2007, 395–400.

26.

Mäntylä

M.V.

, Adams

, Khomh

, Engström

and Petersen

, On rapid releases and software testing: A case study and a semi-systematic literature review, Empirical Software Engineering20(5) (2015), 1384–1425.

27.

Baysal

, Davis

and Godfrey

M.W.

, A tale of two browsers, in Proceedings of the 8th Working Conference on Mining Software Repositories2011, pp. 238–241.

28.

Petersen and Wohlin

, A comparison of issues and advantages in agile and incremental development between state of the art and an industrial case, Journal of Systems and Software82(9) (2009), 1479–1490.

29.

Malhotra

, An empirical framework for defect prediction using machine learning techniques with Android software, Applied Soft Computing49 (2016), 1034–1050.

30.

Kuppuswami

, Vivekanandan

, Ramaswamy

and Rodrigues

, The effects of individual XP practices on software development effort, ACM SIGSOFT Software Engineering Notes28(6) (2003), 6–6.

31.

Lessmann

, Baesens

, Mues

and Pietsch

, Benchmarking classification models for software defect prediction: A proposed framework and novel findings, IEEE Transactions on Software Engineering34(4) (2008), 485–496.

32.

McIlroy

, Ali

and Hassan

A.E.

, Fresh apps: An empirical study of frequently-updated mobile apps in the Google play store, Empirical Software Engineering21(3) (2016), 1346–1370.

33.

Shapiro

S.S.

and Wilk

M.B.

, An analysis of variance test for normality, Biometrika52(3) (1965), 591–611.

34.

Shankland

Google ethos speeds up Chrome release cycle, 2010 http://cnet.co/wlS24U .

35.

Otte

, Moreton

and Knoell

H.D.

, Applied quality assurance methods under the open source development model, in Proceedings of the 32nd Annual IEEE International Computer Software and Applications Conference (COMPSAC)2008, pp. 1247–1252.

36.

Waikato Environment for Knowledge Analysis, www.cs.waikato.ac.nz/ml/weka/ .

37.

Singh

and Saha

, Predicting testability of eclipse: A case study, Journal of Software Engineering4(2) (2010), 122–136.

38.

Zhou

, Leung

, Song

, Zhao

, Lu

, Chen

and Xu

, An in-depth investigation into the relationships between structural metrics and unit testability in object-oriented systems, Science China Information Sciences55(12) (2012), 2800–2815.

Projects	Category	Date of first	Versions		Test Classes
		release	TR	RR	TR	RR
Avro¹	Big-data	14/07/2009	3	14	31	93
Hive²	Database	30/04/2009	5	16	137	263
Jclouds³	Cloud	4/5/2011	9	21	758	1527
Zookeeper⁴	Database	13/11/2007	8	18	284	733
Total			25	69	1201	2616

Spearman’s Test [3]	It is a non-parametric that measures the statistical dependency between two ranking variables
Friedman’s Test [23]	It is a non-parametric test to estimate the difference in distributions of more than two quantitative variables
Nemenyi Test [11, 31]	It is a post hoc test (advocated after Friedman’s Test) that makes pair-wise comparison of performance of various techniques

Test effort estimation and prediction of traditional and rapid release models using machine learning algorithms

Abstract

Keywords

1 Introduction

2 Related work

3 Experimental study design

3.1 Data collection

Table 1 Characteristics of projects studied Projects Category Date of first Versions Test Classes release TR RR TR RR Avro1 Big-data 14/07/2009 3 14 31 93 Hive2 Database 30/04/2009 5 16 137 263 Jclouds3 Cloud 4/5/2011 9 21 758 1527 Zookeeper4 Database 13/11/2007 8 18 284 733 Total 25 69 1201 2616

3.5 Statistical analysis techniques

4.1.1 Correlation

4.1.2 Differentiation

4.1.4 Categorization

Table 7 Shortlisted OO metrics Metrics Mean Values Condition LOC mLOC i. LOC>mLOC NOTC mNOTC ii. NOTC>mNOTC MPC mMPC iii. MPC>mMPC NOAC mNOAC iv. NOAC>mNOAC WMC mWMC v. WMC>mWMC

6 Conclusions

References

Table 7
Shortlisted OO metrics

Metrics Mean Values Condition

LOC mLOC i. LOC>mLOC

NOTC mNOTC ii. NOTC>mNOTC

MPC mMPC iii. MPC>mMPC

NOAC mNOAC iv. NOAC>mNOAC

WMC mWMC v. WMC>mWMC