Abstract
Traditional mutation testing is a powerful technique to evaluate the quality of test suites. Unfortunately, it is not yet widely used due to the problems of a large number of generated mutants, limited realism (mutants not necessarily reflect real software defects), and equivalent mutants problem. Higher order mutation (HOM) testing has been proposed to overcome these limitations of first order mutation testing. We present an empirical evaluation of our approach to higher order mutation testing. We apply different multi-objective optimization algorithms (including one modified by us), as well as our classification of HOMs, proposed objectives and fitness functions. We search for “High Quality and Reasonable HOMs” able to replace all of its constituent FOMs without scarifying test effectiveness and to reflect complex defects requiring more than one change to correct them. Our approach leads to: 1) reduced cost of mutation testing due to reduced number of HOMs, 2) harder to kill mutants (which mimic harder to find defects), 3) reduced cost of mutation testing as it does not waste resources for creating easy-to-kill mutants. Furthermore, we establish a relevant upper bound on mutation order in higher order mutation testing and thus reduce the cost of mutation even further.
Keywords
Introduction
Mutation testing (called also First Order Mutation Testing (FOMT)) is used to evaluate the fault detection capability of the test suits by inserting changes into the original program to generate mutants, and then checking whether the given set of test cases is good enough to detect the difference between original program and its mutants or not.
In mutation testing, we can use Mutation Score (MS) or Mutation Score Indicator (MSI) to evaluate quality of a set of test cases. MS is the ratio of killed mutants to the difference between all generated mutants and equivalent mutants [1, 4], while MSI is the ratio of killed mutants to all generated mutants [12–14, 16]. Value of above mentioned scores lies between 0 and 1. A low score means that the majority of faults cannot be detected accurately by the test set. A higher score indicates that most of the injected faults have been identified with this particular test set. When score is close to 1, we can say that the given set of test cases is a good one. If score is zero, there is no test case that can kill the mutants.
Although Mutation Testing (MT) has been introduced as an automated technique to assess the quality of the test cases, the problem is a large number of generated first order mutants (FOMs). Furthermore, most of mutants are simple, often easily to detect and do not denote realistic faults [8, 21]. Besides, one of mutation testing problems is the generation of too many equivalent mutants [5, 21], which have the same semantic meaning as the original program under test.
There are many different approaches which have been proposed for overcoming the problems of MT [21] including second order mutation testing [16, 30] and higher order mutation testing [5–7, 11] in general. The main goal of higher order mutation testing is to generate higher order mutants (HOMs) that can be used to improve the effectiveness of mutation testing. Instead of using only one simple change as traditional mutation testing, higher order mutation testing uses more complex changes to generate mutants by applying two or more mutation operators. Higher Order Mutation Testing (HOMT), an idea of Jia and Harman [5, 7], not only is the unique approach able to address all of the three mutation testing problems simultaneously, but also is the only one to deal with the problem of realism of injected defects as there is an empirical evidence that real defects require more than one change to fix them [29].
Still there is little research in the field of applying multi-objective optimization algorithms to search for valuable HOMs, e.g., Strongly Subsuming and Coupled HOMs [5, 7], although this is a promising approach. It allows to find higher order mutants that represent more realistic complex faults, which are harder to kill than first order mutants. There are three papers, by Harman et al. [5] and Langdon et al. [10, 11], which focused on multi-objective higher order mutation testing with genetic programming. Haider et al. [3] proposed fuzzy based optimization approach in combination with all path coverage criterion to safely reduce a test suite to a single solution and they found that the approach significantly reduces the test suite to a precise test suite.
In this paper, we investigate applying multi-objective optimization algorithms for finding “High Quality and Reasonable HOMs” (one of 11 HOM categories in our HOMs classification described in Section 2). Our main goal is to carry out an empirical evaluation of the proposed approach, which can be used to construct difficult to kill and more realistic HOMs, especially “High Quality and Reasonable HOMs”, as well as to reduce the number of generated higher order mutants. We want to bring out some useful findings for applying multi-objective optimization algorithms in the area of higher order mutation testing for overcoming the limitations of traditional mutation testing. As a result, it could be used to improve the mutation testing effectiveness in general.
The rest of the paper is organized as follows. Section 2 presents the landscape of the reported research on higher order mutation testing applying multi-objective optimization algorithms. The kinds of higher order mutants based on our HOMs classification also are included in this section. Section 3 describes the posed research questions that the study will answer. The next section is used to shortly present the selected algorithms, programs under test and supporting tool. Section 5 shows the experimental results needed to answer the posed research question as well as to bring out some useful findings. The last section presents conclusions, discussions of threats to validity and proposition of future works.
Background
As we mentioned in Section 1, still there is little research on applying multi-objective optimization algorithms in higher order mutation testing.
In fact, there are many optimization problems, which have more than one objective function and the objective functions are conflicting, to some extent, preventing simultaneously the simple optimization of each objective. Hence, multi-objective optimization algorithms have been devised for solving optimization problems and making the decisions that satisfy multiple objectives.
Harman et al. [5] and Langdon et al. [10, 11] are the first authors who focused on applying multi-objective optimization algorithm with genetic programming in HOMT. In order to produce better mutants, they suggested inserting “semantically close” faults instead of inserting “syntactically close” faults to the original program under test. They then applied a multi-objective optimization algorithm (NSGAII) with genetic programming in the area of higher order mutation testing and the obtained results demonstrated that this approach is able to find harder to kill higher order mutants that represent more realistic complex faults.
In our previous works [22–25], we proposed a new classification of HOMs to cover all of the available cases of generated HOMs, as well as we described our objectives and fitness functions. Our HOMs classification consists of eleven kinds of HOMs which are illustrated in Table 1. The notations are explained below: H1: Live (potentially equivalent) Mutant) H2: Non-Quality, Un-Reasonable and With New TCs H3: Non-Quality, Un-Reasonable and With Mixed TCs H4: Non-Quality, Reasonable and With New TCs H5: Non-Quality, Reasonable and With Mixed TCs H6: Non-Quality, Reasonable and With Old TCs H7: H8: Quality, Reasonable and With Mixed TCs H9: Quality, Reasonable and With Old TCs H10: Quality, Un-Reasonable and With Old TCs H11: Quality, Un-Reasonable and With Mixed TCs
We performed the empirical evaluation of multi-objective optimization algorithms in HOMT. Different from Langdon et al. [5, 11], who have suggested directly inserting more than one fault into a program under test to generate higher order mutants, we introduced another method as follows. First, we generate the set of available first order mutants and then generate valuable higher order mutants by combining two or more first order mutants.
Our experimental results are as follows: the total number of generated HOMs is small (about 30% in comparison to number of FOMs), the mean ratio of reasonable HOMs (subsuming HOMs) to all found HOMs is about 54%, while the mean ratio of “High Quality and Reasonable HOMs” (strongly subsuming and coupled HOMs) to all found reasonable HOMs (subsuming HOMs) is fairly high (around 8%) [23].
It is worth emphasizing that higher order mutation testing in general and the presented approach in particular is able to address all of the three main problems of mutation testing (a large number of generated mutants, limited realism of artificial defects and equivalent mutant problem) simulataneously [20]. In this paper, we build upon our previous research [22–25] and collect empirical evidence to answer new research questions (e.g., RQ1, RQ2, RQ3.2, RQ3.5) posed in Section 3 on a basis of extended number of programs under test (e.g., in [23] we used three programs under test, while in this paper we use five, see Table 2). These answers are essential to extend our understanding of higher-order mutation testing as a viable alternative to traditional first order mutation testing.
Research questions
We pose the following Research Questions (RQs) that the study will answer.
We split RQ3 into 5 sub-questions to answer: RQ3.1. What are the ratios of the number of HOMs in the identified mutant categories to all generated HOMs? The goal of the RQ3.1 is to investigate the number (distribution) of HOMs in the identified mutant categories. RQ3.2. What are the ratios of “not good” HOMs that are generated? “Not good” HOM mean “Non-quality and Un-reasonable HOM” (H2 and H3) which are easier to kill than their constituent FOMs and there is lack of test cases that can kill simultaneously HOM and its constituent FOMs. RQ3.3. What are the ratios of the number of HOMs in the mutant order to all generated HOMs? By means of this question, we want to obtain the percentage of generated HOMs in the mutant order. RQ3.4. What are the ratios of “High Quality and Reasonable HOMs” (H7) to all generated HOMs in the mutant order? This question focuses on a kind of mutants being of special interest. The aim is to obtain the frequency of generating “High Quality and Reasonable HOMs” (H7). Such mutants not only reflect harder to kill mutants, which reflect realistic, complex faults but also could be used to replace all of its constituent FOMs. RQ3.5. What are the ratios of “live (potentially equivalent) mutants” (H1) to all generated HOMs in the mutant order? Live mutants can be “really-equivalent mutants” or “difficult-to-kill mutant”. They cannot be killed by the given set of test cases which are included in the selected project under test (but they perhaps could be killed by some new test cases). Reduction of Live (potentially equivalent) mutants leads to reduced mutation testing execution cost of the given set of test cases on those live mutants. Answering this question (together with the answering the question RQ3.4) may help to find the relevant order of higher order mutation testing.
By the means of RQ5, we want to evaluate which of the proposed approaches using the set of first order mutants to generate higher order mutants is better suited to drive development of new test cases by using MSI and, as a result, to improve the software quality.
Algorithms, projects under test and supporting tool
Five selected multi-objective optimization algorithms [23] are briefly described below.
There are five selected Projects Under Test (PUTs): BeanBin, Barbecue, JWBF, CommonsChain 1.2 and CommonsValidiator 1.4.1, which are five real-world, open source projects downloaded from SourceForge (https://sourceforge.net/). Table 2 shows the projects selected for the experiment along with their number of classes (NOC), lines of code (LOC) and number of test cases (NOT). For each project under test, we run the HOMT process 5 times for each selected algorithm.
Our supporting tool is Judy [16, 17], a mutation testing tool, which has been written in Java and for Java. Judy is easy to configure and use. It allows to configure and launch mutation testing via command line in Windows, Linux and Mac OS X. Judy supports a large set of mutation operators for first order mutation testing, as well as higher order mutation testing such as: mutants generation, mutants execution and mutation analysis. Judy also supports build-in cluster for distributed computations. The list of some mutation operators available in Judy is presented in [16, 17] as well as in [23].
Results and analysis
Answer to RQ1
Table 3 shows the mean number of FOMs which were generated by applying all Judy operators to produce mutants, and the mean number of total generated HOMs (from the set of generated FOMs) for different PUTs by applying the selected multi-objective optimization algorithms.
The first step of our experimental procedure is to generate all possible FOMs, and our purpose is to compare the number of generated FOMs with the number of generated HOMs. From the set of FOMs, valuable HOMs were generated and evaluated by applying multi-objective optimization algorithms, guided by our objectives and fitness functions. According to the results which are presented in Table 3, The total number of generated HOMs of order 2 to 15 was reduced to about 22% of the number of FOMs. While, according to results of Langdon et al. [10, 11], the number of generated HOMs grows exponentially with order. This is because we use the proposed objectives and fitness functions to focus on constructing the valuable (high quality and reasonable) HOMs instead of generating all possible HOMs.
Our approach leads to reduced cost of mutation testing by reducing the number of HOMs.
The mean values of mutation score indicator (MSI) in our experimentation and in Langdon et al.’s [10, 11] experimentation after higher order mutation testing execution are presented in Table 4.
The experimental results indicate that our approach seems to be better than the approach of Langdon et al. [10, 11] in terms of generating difficult-to-kill higher order mutants by applying multi-objective optimization algorithm. This is because the number of live (potentially equivalent) mutants, which cannot be killed by the given test suite (albeit could be killed by new and high quality test cases), in our approach (about 32%) is much larger than in Langdon et al.’s (about 1%).
Our approach leads to hard to kill mutants, which mimic harder to find defects and drive development of higher quality test cases.
The results presented in Fig. 1 indicate that the majority of generated HOMs are
The mean proportion of
According to the experimental results (see Fig. 1), the mean percentage of Non-quality and Un-reasonable HOMs (H2 and H3, which are easier to kill than their constituent FOMs and there are no any test cases which can kill them and simultaneously all their constituent FOMs) is not large, around 5.5% in total.
Our approach do not waste computational resources for creating mutants, which are easy to kill by most test cases, and addresses the problem of cost of mutation testing.
Table 5 shows the mean ratios of number of HOMs (H1-H11) of a particular order (2-15) to all generated HOMs (Column 2) as well as the mean ratios of High quality - Reasonable HOMs (H7) and live HOMs (H1) to all produced HOMs in each individual order (Column 3 and 4). The results indicate that generally for lower orders, the number of generated HOMs is larger than for higher orders.
Obtained empirical results also show that we can find many “High Quality and Reasonable HOMs” (H7) from the 2nd - order to the 5th - order. For the 6th - order, as well as for higher orders, generated H7 are rare. There is lack of H7 in many cases (see Table 5 and Fig. 2). As a result, we may conclude that higher order mutation up to the 5th - order can be rewarding wrt. searching for H7 by applying multi-objective optimization algorithms. While the ratio of Live (potentially equivalent) mutants (H1) to total number of HOMs is large, 22% to 55%, for every order (see Table 5 and Fig. 2).
Five (5) is a relevant upper bound on mutation order in higher order mutation testing as the ratio of “High Quality and Reasonable HOMs” (H7) to total number of HOMs is relatively high for orders between 2 and 5. This ratio is low, close to zero, for orders higher than 5, while the ratio of live (and potentially equivalent) mutants to the total number of HOMs is large for every order. Our finding suggests that a possible way to reduce the cost of mutation is by a limited order of HOMs.
Figure 3 shows the percentage of “High Quality and Reasonable HOMs” (H7) compared with total generated HOMs and the number of Reasonable HOMs (H4-H9). According to the results of five applied algorithms, the mean ratio of H7 to all found Reasonable HOMs (H4-H9) is about 8%. This number is relatively high because the proportion of all Reasonable HOMs to all generated HOMs is quite a large (see Table 6), while the ratios of the number of H1 to the total number of HOMs are more or less the same for five algorithms.
This indicates that we can find the mutants that are harder to kill and more realistic (reflecting real, complex faults) than FOMs by applying multi-objectives optimization algorithm. Especially “High Quality and Reasonable HOMs” can be used to replace the set of their constituent FOMs without lost of testing effectiveness. In addition, our experimental results indicate that the mean ratio of the number of TCs which kill generated HOMs to the number of TCs which kill their constituent FOMs is about
Our approach leads to reduced cost of mutation testing, increases realism of mutants, generates harder-to-kill mutants than FOMs and drives development of higher quality test cases at the same time.
According to our experimental results, the eMOEA algorithm is the best in terms of constructing Reasonable HOMs (H4-H9) (see Table 6), while the eNSGAII algorithm is the best for searching for “High Quality and Reasonable HOMs”. Approximately 12% of Reasonable HOMs, which were found by eNSGAII algorithm, are classified as “High Quality and Reasonable HOMs” (see Fig. 3).
In 2004, McConnell, who has been two-time winner of the Software Development Magazine Jolt Award, came to the following conclusions based on the results of his study [18]: “Industry average experience is about 1–25 errors per 1000 lines of code for delivered software” and “The Applications Division at Microsoft experiences about 10–20 defects per 1000 lines of code during in-house testing and 0.5 defects per 1000 lines of code in released product”. Hence, we may speculate that in the complete versions of the software projects, a single line of code rarely has more than one error. From that, we propose an approach to modify the multi-objective optimization algorithm applied to construct higher order mutants. Instead of creating a random initial list of HOMs (parent population) from list of FOMs, we create an initial list of HOMs by combining n FOMs guided by the rule “apply no more than one mutation operator to each line of code”. We choose the eNSGAII algorithm to modify, because it is the best algorithm in terms of constructing the “High Quality and Reasonable HOMs”. The modified algorithm is named
Results of the empirical comparison of two algorithms (eNSGAII and eNSGAII-DiffLOC) are shown in Table 7. The notations are explained below: NoM is the ratio of the number of generated HOMs to the number of FOMs. NoT is the ratio of the number of TCs which kill generated HOMs to the number of TCs which kill their constituent FOMs. NoR is the ratio of generated “reasonable HOMs” to all generated HOMs. NoH7 is the ratio of H7 (“High Quality and Reasonable HOMs”) to all generated HOMs. NoH1 is the ratio of H1 (live (potentially equivalent) HOMs) to all generated HOMs.
The obtained results indicate that eNSGAII-DiffLOC is slightly better than the original eNSGAII algorithm in terms of mutant reduction (slightly smaller NoM of eNSGAII-DiffLOC), generates harder-to-kill mutant (slightly smaller NoT and bigger NoR of eNSGAII-DiffLOC) and is able to construct “High Quality and Reasonable HOMs” (slightly higher NoH7 of eNSGAII-DiffLOC).
On the other hand, slightly larger number of Live (potentially equivalent) mutants (NoH1) can be seen as a disadvantage, due to the fact that H1 mutants include, to some extent, equivalent mutants. However, H1 mutants include also “difficult-to-kill mutants”, which are valuable.
The proposed eNSGAII-DiffLOC seems to be slightly better than original eNSGAII algorithm in terms of mutant reduction, generating harder-to-kill mutant and constructing “High Quality and Reasonable HOMs”.
To answer the RQ5, we have performed an empirical study, in which HOMs are generated in three ways. In the first way (HOMT1), HOMs are generated by combining FOMs from the set of all generated FOMs. In the second way (HOMT2), we delete first live FOMs from set of generated FOMs, and then create HOMs by combining FOMs from the set of killed FOMs. And the last (HOMT3), first delete all of easy-to-kill FOMs, which were killed by all of given TCs, from set of generated FOMs, then create HOMs by combining FOMs from the remaining set of FOMs (named set of not-easy-to-kill FOMs). Then we use mutation score indicator (MSI) as the indicator of usefulness of higher order mutation in driving development of TCs. The live mutants, which cannot be killed by the given test suite, make up a significant part of generated mutants and may drive the development of new test cases. The experimental results are shown in Table 8, in which FOMT is implementation of first order mutation testing.
The results presented in Table 8 indicate that the given sets of TCs of PUTs have lower MSI in first order mutation testing. It means that there are many live FOMs and the given sets of TCs are not good enough to detect the difference between original program and their mutants and, therefore, need to be improved following the results of mutation analysis based on the FOMT strategy. The numbers of live FOMs makes up from 53% to 87% of generated mutants. Only a small number of FOMs were killed by the given sets of TCs. In the case of live FOMs, we have to check whether the live FOMs are equivalent mutants or not, but it often involves additional human effort [16]. If mutants are not equivalent, developers or testers need to create new TCs, as in Test-Driven Development (TDD) or Continuous Test-Driven Development (CTDD) [15], and check whether they are able to kill live FOMs or not. If live FOMs are equivalent mutants, TCs, which can kill them, do not exist. The most striking result is that the HOMT2 strategy appeared to be useless as it gives a false impression that TCs are of high quality (MSI is equal or close to 100%) and the usefulness of HOMT2 is strongly limited, i.e., opportunities of test case improvement guided by results of HOMT2 mutation analysis are rare if any. Almost all of higher degree mutants, which were constructed by combining the killed FOMs, are also killed. This indicates that, combining first order killed mutants to create higher degree mutants is not a good way to evaluate and improve the quality of given set of test cases because the generated HOMs are easy to kill. The HOMT1 and HOMT3 strategies seem to be better and offer more opportunities to improve the quality of given set of test cases, as MSI (and the number of killed mutants) decreased in comparison to HOMT2.
We should not use first order killed mutants to create difficult (but possible) to kill higher order mutants. Furthermore, using not-easy-to-kill mutants to generate higher order mutants seems to be a promising method to improve the quality of TCs.
In this paper, we have performed the experimental evaluation of the effect of applying multi-objective optimization algorithms in the area of higher order mutation testing based on our proposed HOMs classification, objectives and fitness functions using Judy mutation testing tool for Java and open source projects. The results indicate that our approach can be used to construct the difficult to kill and more realistic HOMs, especially “High Quality and Reasonable HOMs”, as well as to reduce the number of generated HOMs. As a result, the approach can be used to improve the mutation testing effectiveness in general.
We have empirically evaluated our approach to HOMT presented some findings, which can be useful from the point of view of the effectiveness of mutation testing. Our approach can lead to reduced cost of mutation testing, can produce hard to kill mutants (which mimic hard to find defects and drive development of high quality test cases), suggests that there is an upper bound on the order of mutants, which allows to reduce the cost of mutation even further and do not waste computational resources for creating mutants, which are easy to kill by most test cases. The proposed approach, being a HOMT approach, also addresses the problem of realism of mutants. eNSGAII-DiffLOC, being a modified version of eSNGAII, seems to be a good starting point for further research and development of algorithms, which are better in terms of mutant reduction, harder-to-kill mutant generation and “High Quality and Reasonable HOMs” construction.
Further research is recommended because using only 5 selected projects under test (PUTs) may not be a representative sample of all Java programs. In addition, the obtained results may also depend on programming language, applied mutation tool, mutation operators and algorithms. Beside to this, the number of analyzed projects under test is too small to derive more firm conclusions using, even using robust statistical methods [9]. Hence, further research should focus on collecting more data. Furthermore, the obtained differences in the analyzed projects under test are small so new modifications and algorithms should be searched for and empirically evaluated. It would be also interesting to combine the proposed approach with a technique that utilizes a data-flow analysis to decrease the number of mutation points and consequently to reduce the number of higher order mutants [2], as well as to compare the outcomes of software reliability models applied to tests [27] with mutation testing.
