Abstract
BACKGROUND:
Software quality prediction models play a crucial role in identifying vulnerable software components during early stages of development, and thereby optimizing the resource allocation and enhancing the overall software quality. While various classification algorithms have been employed for developing these prediction models, most studies have relied on default hyperparameter settings, leading to significant variability in model performance. Tuning the hyperparameters of classification algorithms can enhance the predictive capability of quality models by identifying optimal settings for improved accuracy and effectiveness.
METHOD:
This systematic review examines studies that have utilized hyperparameter tuning techniques to develop prediction models in software quality domain. The review focused on diverse areas such as defect prediction, maintenance estimation, change impact prediction, reliability prediction, and effort estimation, as these domains demonstrate the wide applicability of common learning algorithms.
RESULTS:
This review identified 31 primary studies on hyperparameter tuning for software quality prediction models. The results demonstrate that tuning the parameters of classification algorithms enhances the performance of prediction models. Additionally, the study found that certain classification algorithms exhibit high sensitivity to their parameter settings, achieving optimal performance when tuned appropriately. Conversely, certain classification algorithms exhibit low sensitivity to their parameter settings, making tuning unnecessary in such instances.
CONCLUSION:
Based on the findings of this review, the study conclude that the predictive capability of software quality prediction models can be significantly improved by tuning their hyperparameters. To facilitate effective hyperparameter tuning, we provide practical guidelines derived from the insights obtained through this study.
Keywords
Introduction
Software systems have experienced unprecedented growth in size and complexity, presenting significant challenges in terms of building high quality software while minimizing costs [34]. To address these challenges, the early prediction of defect-prone components, maintenance requirements, effort estimation, and reliability has become crucial. Predictive modeling, a technique that generates models to forecast future outcomes, has gained popularity in software engineering as a means to address these needs [43]. By analyzing historical and current data, predictive modeling extracts predictive rules that can be applied to future data, enabling effective planning and resource allocation. An illustration of predictive modeling in software engineering is given in Fig. 1. Various classification/learning algorithms, including statistical methods, machine learning, and evolutionary algorithms, are utilized for developing these models [36, 14].
Software quality prediction modeling.
The performance of predictive models depends on several factors, and researchers have been actively working to improve their effectiveness. One crucial factor influencing model performance is the choice of hyperparameter values in the learning algorithms used to build the prediction models [7]. Hyperparameters influence the behavior and performance of the learning algorithm. For example, the hyperparameters of a neural network include the learning rate (
It has been observed that different parameter values used during the construction of classifiers can result in significant variance in performance [7]. To address this issue, researchers have explored parameter tuning techniques to identify the optimal settings within the parameter space of classification algorithms. By fine-tuning the hyperparameters, researchers aim to minimize variance and improve the performance of classifiers. Parameter tuning involves systematically exploring different combinations of hyperparameter values to find the most suitable configuration. Several studies have demonstrated that models developed using optimal parameter settings exhibit improved performance compared to those built with default parameter values.
The problem of hyperparameter tuning has received significant attention from researchers, leading to the proposal of various approaches aimed at addressing this challenge. One of the widely adopted approaches is Grid Search, which systematically explores a predefined set of hyperparameter values and evaluates the performance of the model for each combination [21]. It provides a brute-force method to exhaustively search the hyperparameter space and identify the best configuration based on a chosen evaluation metric. Another popular technique is Random Search, which randomly samples hyperparameter values from predefined ranges [12]. Unlike Grid Search, which systematically explores all possible combinations of hyperparameters within specified ranges, Random Search takes a more probabilistic approach. It strategically selects hyperparameter values at random from different regions within the predefined ranges. In a Random Search, each hyperparameter configuration is chosen independently, and the search space is randomly sampled over multiple iterations. This targeted randomness enables a more efficient exploration of the hyperparameter space by prioritizing promising areas without the need to evaluate every potential combination. The advantage lies in the ability to discover optimal or near-optimal hyperparameter configurations more quickly, making Random Search a favorable choice in scenarios where computational resources are limited. Bayesian Optimization is another noteworthy approach that combines prior knowledge with observed performance to guide the search for optimal hyperparameters [15]. It utilizes a probabilistic model to model the unknown performance function and suggests promising regions for further exploration. Evolutionary Algorithms, such as Genetic Algorithms, mimic the process of natural selection to iteratively search for optimal hyperparameter configurations [10, 21]. These algorithms maintain a population of candidate solutions and use genetic operators, such as crossover and mutation, to generate new configurations with potentially improved performance. More recently, machine learning-based approaches have gained attention. These techniques leverage the power of algorithms, such as Artificial Neural Networks or Gaussian Processes [30], to model the relationship between hyperparameters and performance. By learning from previous evaluations, these approaches can predict promising configurations and guide the search towards better solutions. In addition to these general approaches, researchers have proposed hybrid methods that combine multiple techniques [21, 44]. For example, the combination of Grid Search with local search methods, such as Gradient Descent or Simulated Annealing, can provide a balance between exhaustive exploration and fine-grained optimization.
Overall, the field of hyperparameter tuning has witnessed significant advancements, leading to the development of various techniques and approaches. Each approach offers distinct advantages and trade-offs in terms of search efficiency, adaptability to different problem domains, and robustness to noise or limited computational resources. Selecting an appropriate hyperparameter tuning approach depends on the specific requirements and constraints of the given problem.
The primary goal of the research is to conduct a comprehensive analysis of hyperparameter tuning techniques employed for software quality prediction models. Specifically, we aim to:
Identify and review studies that have applied parameter tuning techniques in the development of prediction models in software quality-related domains. Evaluate the effectiveness of parameter tuning in improving the performance of prediction models. Analyze the impact of parameter tuning on the performance of various learning algorithms used in software quality prediction. Provide guidelines and recommendations for practitioners and researchers regarding the application of parameter tuning techniques to enhance the predictive capability of software quality prediction models.
The subsequent sections of this research paper are structured as follows: Section 2 presents the research questions that guide our systematic review and outlines the criteria used for selecting primary studies. Section 3 presents the outcomes of the selected studies and answers the research questions of the study. Section 4 provides practical guidelines and recommendations for practitioners and researchers. Section 5 discusses the limitations of the research, while Section 6 conclude the paper by summarizing the key findings and implications of our research.
Software quality prediction modeling.
This systematic review adheres to the established guidelines outlined by reputable sources [4, 32] and depicted in Fig. 2. The review process commenced with the identification of the necessity for a systematic review, which was subsequently followed by the formulation of research questions based on the underlying motivations. A comprehensive search strategy was then devised to locate relevant primary studies. The data extraction phase involved extracting pertinent information from the primary studies to effectively address the research questions. Finally, data synthesis was performed to consolidate the findings and derive conclusive results for this review.
Identify the need for systematic review
The majority of classification algorithms utilized in software quality prediction models employ hyperparameters that significantly impact the performance of the predictors. A literature analysis conducted by Tantithamthavorn et al. [7] revealed that 87% of the commonly used classification algorithms for software defect prediction necessitate the configuration of at least one hyperparameter setting. Given this context, it becomes imperative to conduct a comprehensive review of studies that have focused on tuning hyperparameters of classification algorithms when constructing software quality prediction models. This review aims to investigate the following aspects:
The performance improvement observed in tuned models compared to untuned models. The classification techniques that display high sensitivity to their hyperparameters. The commonly employed parameter optimization methods within the context of software quality prediction models. The associated overheads and complexities involved in the parameter tuning process. The prevalence and popularity of parameter tuning in software quality prediction models. The guidance and recommendations provided by researchers regarding parameter tuning approaches.
Research questions
The primary objective of this review is to investigate the utilization of parameter tuning techniques in the construction of software prediction models and examine their influence on model performance. The motivations outlined in section 2.1 serve as the foundation for formulating the research questions that guide this study. Table 1 presents a comprehensive overview of the research questions addressed in this review.
To address RQ1, the study meticulously analyzed the primary studies to identify the diverse software quality attributes for which parameters were tuned. Additionally, RQ2 delves into the parameter tuning techniques employed in these studies, while RQ3.1 focuses specifically on the hyperparameters of classification algorithms that underwent tuning. Furthermore, RQ3.2 explores the impact of parameter tuning on model performance, and RQ3.3 seeks to identify the most effective parameter tuning techniques employed in the reviewed studies. In pursuit of a thorough analysis, RQ4 assesses the strengths and weaknesses associated with parameter tuning in software prediction models. This examination enables us to provide valuable guidelines to researchers regarding the optimal tuning of classification algorithm parameters.
By addressing these research questions, this review aims to shed light on the current practices, challenges, and best practices surrounding parameter tuning in the development of software prediction models. The insights obtained from this analysis will serve as a valuable resource for researchers and practitioners alike, fostering improvements in the field of software engineering.
Research questions
Research questions
The objective of this study is to conduct a comprehensive review of parameter tuning techniques applied in software quality prediction models. Specifically, the focus of this review is on the following software attributes:
Defect proneness Maintenance proneness Effort estimation Reliability
To ensure a comprehensive selection of primary studies, a meticulous search strategy was devised. Synonyms and alternate terms associated with the aforementioned software attributes, prediction, and tuning were identified from the existing literature. These terms were combined using Boolean expressions, utilizing “OR” to merge similar terms and “AND” to combine the main search terms. The resulting search string used for the selection of primary studies is as follows:
Software AND (fault OR defect OR bug OR error OR vulnerability OR change OR maintenance OR effort OR quality OR reliability) AND (proneness OR prone OR prediction OR probability) AND ((Parameter OR Hyperparameter) AND (tuning OR optimization OR selection OR determination))
The search was conducted across several reputable digital libraries, including:
IEEE Xplore Science Direct ACM Digital Library Springer Link Wiley Online Library Google Scholar Web of Science
Executing the search string on these electronic databases enabled the identification of relevant studies. In addition, a thorough examination of the references cited within these studies was conducted to identify any further pertinent research. Furthermore, select studies analyzing parameter tuning in the domain of software engineering were also considered to provide additional insights. Following a meticulous evaluation of the identified studies, a comprehensive set of 31 studies was selected as primary studies. These studies were deemed to meet the rigorous criteria outlined in the research, ensuring their relevance, credibility, and suitability for inclusion in this review.
Data extraction and data synthesis
To ensure systematic data collection and effectively address the formulated research questions, a comprehensive data extraction form was meticulously designed. The data extraction form encompasses the following fields:
Title of the study Names of the author(s) Publication year Publication details Datasets Quality attribute Learning/Classification algorithm(s) Supplementary algorithm(s) employed Parameters tuned/optimized Parameter tuning technique(s) Observed performance improvement upon parameter tuning Strengths and weaknesses of the tuning techniques Overall results of the study
Each primary study underwent a thorough review process to extract relevant data as per the structured form, ensuring the accurate representation of key information. The extracted data was then meticulously recorded and organized within a spreadsheet for further analysis and synthesis.
The collected data derived from the primary studies serves as the foundation for formulating comprehensive responses to the research questions. Through a meticulous process of data synthesis, involving the summarization of pertinent facts and figures, the collected information is distilled and analyzed to generate meaningful insights and findings. This synthesis enables a comprehensive understanding of the parameter tuning techniques employed in software quality prediction models and their impact on model performance.
By following this rigorous data extraction and synthesis process, the study ensures the reliability and integrity of the findings, providing valuable insights into the effectiveness of parameter tuning in the context of software quality prediction models.
Results and discussion
This section encompasses the presentation of the results derived from the selected studies included in this systematic review. Firstly, we provide a concise overview of the chosen studies, outlining their key characteristics and contributions. Subsequently, we meticulously address each research question, drawing upon the findings extracted from the selected studies to provide comprehensive answers. Furthermore, we engage in a detailed discussion and interpretation of the results, aiming to derive meaningful conclusions and insights from the collected data.
Description of primary studies
A total of 31 studies were identified wherein parameter tuning techniques were applied in the construction of software prediction models. The details of these selected studies are presented in Table 2, offering comprehensive insights into the methodologies and outcomes of each study. To provide a temporal perspective, Fig. 3 illustrates the distribution of these studies from the year 2010 to mid-2023. The graphical representation highlights that only a limited number of studies have employed parameter tuning techniques within this timeframe. Notably, the majority of these studies predominantly utilized machine learning algorithms as the foundation for their prediction models. The analysis of this distribution showcases the relatively recent emergence of parameter tuning techniques in the realm of software prediction models.
RQ1: What are the categories of quality attributes where parameter tuning is being done?
While the primary focus of this systematic review is software quality, we have also incorporated studies on effort prediction due to the shared factors that influence their respective prediction models. Figure 4 shows the percentage of quality attributes where tuning is employed in the primary studies. Table 3 provides a comprehensive overview of the attributes herein the parameters of learning algorithms have been optimized in the selected studies. Notably, a closer examination of the table reveals a limited number of studies in each category where parameter tuning techniques have been employed. Consequently, it becomes imperative to evaluate the impact of parameter tuning on the performance of prediction models within these specific categories.
Selected primary studies
Selected primary studies
Quality attributes where parameter tuning is being done
Year-wise distribution of studies.
Parameter tuning techniques applied in prediction models
Percentage of quality attributes considered in studies.
Studies using different parameter tuning techniques.
Within the selected studies, a diverse range of parameter tuning techniques have been employed by researchers, as highlighted in Table 4. Figure 5 provides a graphical representation showcasing the distribution of studies utilizing different parameter tuning techniques. Among the various techniques employed, it is noteworthy to mention that Multisearch and Caret, which are default parameter tuning options provided by Weka [8] and R Caret [26] respectively, employ the Grid search technique. Considering Multisearch and Caret as variants of Grid search, the number of studies utilizing the Grid search technique amounts to 13, making it the most widely utilized parameter tuning technique within the selected studies. Following Grid search, the differential evolution technique is another prominent parameter tuning approach employed by researchers. Additionally, genetic algorithms and its variants have been utilized as tuning techniques in several studies, demonstrating their efficacy in optimizing the performance of prediction models.
RQ3: What is the effect of parameter tuning on the performance of software quality prediction models?
This section presents the effect of parameter tuning on the performance of the predictors. This research question has sub questions and are addressed in different sub sections as follows.
RQ3.1: What are the learning techniques whose parameters have been tuned?
Table 5 presents the parameters of which learning techniques have been optimized. Support vector machines, k-Nearest neighbours, Random forest, Neural Networks, Classification and Regression trees, and Decision trees are widely used in the studies.
Learning algorithms where parameters are optimized
Learning algorithms where parameters are optimized
Improvement of performance of prediction models – study wise
The primary objective of parameter tuning is to enhance the performance of prediction models. In this regard, the impact of tuning the parameters of learning algorithms on the performance of prediction models, as observed in the selected studies, has been meticulously analyzed and consolidated in Table 6. The results derived from most of the studies consistently demonstrate a noticeable improvement in the performance of software quality prediction models following parameter tuning. In an effort to provide comprehensive insights, we have also sought to ascertain the statistical characteristics of the results obtained from the primary studies. While a variety of performance measures were employed across the selected studies, certain commonly used measures such as accuracy, area under the receiver operating characteristics curve (AUC), precision, sensitivity, and F-measure were identified. To facilitate comparison, Table 7 presents the minimum (Min.), maximum (Max.), mean, median, and standard deviation (std.) values of these performance measures for models constructed with both default parameters and tuned parameters.
To further illustrate the impact of parameter tuning on performance, box plots for accuracy and precision are depicted in Figs 6 and 7 respectively. These visual representations clearly demonstrate the reasonable improvements achieved through parameter tuning, underscoring its efficacy in enhancing the overall performance of software quality prediction models.
RQ3.3: What are the most effective parameter tuning techniques?
The majority of the selected studies focused on the application of a single parameter tuning technique and did not conduct comparative analyses of different tuning techniques. However, a few studies stand out for their efforts in directly comparing the performance of various tuning techniques. For instance, PS6 conducted a comparative study between differential evolution and grid search for parameter tuning in defect predictors. The findings of this study demonstrated that differential evolution outperformed grid search in terms of optimizing the performance of defect predictors. Similarly, PS11 investigated the impact of grid search and particle swarm optimization on parameter tuning. The results of this study revealed that both techniques exhibited generally positive effects on parameter optimization, without significant differences in performance. Furthermore, the results presented in PS20 indicated that differential evolution outperformed simulated annealing in terms of performance improvement. This finding strengthens the evidence supporting the efficacy of differential evolution as a superior parameter tuning technique. The results of PS30 indicate that Harmony Search shows better performance compared with the traditional optimization methods (Grid Search, Random Search, Tabu Search and Genetic Algorithm).
Boxplot of accuracy values.
Box plot of precision values.
Values of performance measures of models with default and tuned parameters settings
Among the various parameter tuning techniques, grid search, random search, genetic algorithm and differential evolution emerged as particularly popular choices among researchers, with multiple studies incorporating these techniques into their experimentation.
The findings across the majority of studies consistently demonstrate a substantial minimum 30 percent improvement in the performance of the predictors following parameter tuning. Notably, even the learning algorithms that initially performed poorly exhibited significant performance enhancements when their parameters were properly tuned, often surpassing the performance of the top-performing algorithms. This highlights the potential for parameter tuning to effectively address the limitations of underperforming models and unlock their full predictive capabilities.
However, it is important to acknowledge the potential weaknesses associated with parameter tuning techniques. Studies have identified two key concerns:
Additional Computational Cost: The pursuit of optimal parameter settings requires additional computational resources and time. Tuning parameters often involves exhaustive search or optimization algorithms, which can significantly increase the computational bur-den of model training and evaluation. Researchers must carefully consider the trade-off between the potential performance improvement and the computational cost incurred. Risk of Overfitting: The tuning process introduces the risk of overfitting, wherein the model becomes excessively tailored to the training data and loses its ability to generalize well to new, unseen data. Fine-tuning parameters can result in models that exhibit high accuracy on the training set but perform poorly on unseen data. This highlights the need for cautious parameter tuning to strike a balance between model complexity and generalizability.
RQ5: What are the guidelines given in studies that a researcher should keep in mind while tuning the parameters?
The research findings from various studies have shed light on several important aspects related to parameter tuning in software prediction models:
Different parameter settings exhibit significant variance in performance, indicating that default parameter settings, while relatively satisfactory, are far from optimal for individual problem instances. Tuning parameters can lead to improvements on average, but still fall short of achieving optimality for specific instances. A substantial majority (87%) of the most commonly used classification techniques, as indicated by 26 out of 30 techniques, require at least one parameter setting. This underscores the criticality of selecting optimal parameter settings as an important experimental design choice for defect prediction models. Parameter tuning has the potential to alter the comparative rankings of data mining algorithms, emphasizing its impact on model performance evaluation. Certain algorithms, such as decision tree, support vector machine and random forest, display high sensitivity to parameter optimization, suggesting the need for careful tuning to achieve optimal results. Neural network based algorithms requires a high-performance computing system to effectively handle the tuning process, indicative of its computational demands.
Despite the significant findings highlighting the importance of parameter tuning, a significant number of researchers and practitioners still overlook the tuning of classification algorithm parameters in software prediction models. This reluctance can be attributed to the following factors:
Constraint of time Computational overhead Unware of its significance
To address this gap, we propose guidelines for software researchers and practitioners when applying parameter tuning techniques:
Assess the sensitivity of the classification algorithm to its parameters: Different algorithms exhibit varying levels of sensitivity to their hyperparameters. Understanding this sensitivity is crucial for determining the impact of parameter settings on model performance. High sensitivity: If the classification technique is highly sensitive to its parameters, parameter tuning becomes imperative to achieve optimal or near-optimal settings. In such cases, researchers should employ the most suitable parameter tuning technique to fine-tune the hyperparameters. Medium sensitivity: If the classification technique demonstrates moderate sensitivity to its parameters, the impact of parameter settings on model performance is moderate. If practitioners face constraints in tuning the parameters, default parameter tuning methods provided by data mining tools or packages can be considered. For example, the Weka tool offers Multisearch as its default parameter tuning method, while the Caret package in R pro-vides automatic parameter tuning. Low sensitivity: In instances where the classification technique exhibits low sensitivity to its parameters, the impact of parameter settings on model performance is minimal. In such cases, practitioners may choose to disregard parameter tuning if they face constraints or limitations.
To facilitate decision making regarding parameter tuning, Fig. 8 presents a flowchart depicting the process based on the sensitivity of the classification technique.
These guidelines aim to support software researchers and practitioners in making informed decisions when it comes to parameter tuning, considering the specific characteristics and sensitivity of the classification algorithms employed. By following these guidelines, researchers and practitioners can enhance the performance of software prediction models and optimize their resource allocation effectively.
Limitations
In this comprehensive systematic review, we diligently examined all relevant studies sourced from the specified digital libraries, focusing on those that employed parameter tuning techniques for constructing software quality predictors. While our search was conducted exhaustively, it is important to acknowledge the possibility that a relevant study may have been inadvertently omitted, despite our rigorous efforts. Moreover, it is noteworthy that the number of studies directly comparing different parameter tuning techniques remains limited. Consequently, the determination of an unequivocal “best” technique is currently inconclusive, warranting further investigation and comparative analyses in future research.
To ensure the integrity and reliability of our findings, we have made the assumption that the results obtained from all primary studies included in our review are impartial. However, it is crucial to recognize that any bias or potential limitations within the primary studies could pose a threat to the validity of our findings and conclusions. We have undertaken extensive measures to mitigate this risk, including rigorous selection criteria and thorough analysis of the selected studies. Nonetheless, the potential for inherent biases or limitations in the primary studies remains a factor that should be considered in the interpretation and generalization of our findings.
Flow chart to determine tuning method.
In this research paper, we conducted a meticulous systematic literature review to examine the utilization of parameter tuning techniques in software quality prediction models, specifically focusing on defect, maintenance, reliability, and effort attributes of the software. The key objectives of this study were as follows:
First, we performed an extensive search within digital libraries and identified 31 primary studies that implemented parameter tuning techniques within the context of software quality prediction models. Second, we carefully extracted and synthesized data from these primary studies. We summarized the characteristics of the primary studies based on various quality attributes, parameter tuning techniques, classification algorithms, and their respective hyperparameters. Third, we conducted a comprehensive analysis of the primary study results to evaluate the impact of parameter tuning on the performance of software quality prediction models. Additionally, we analyzed studies that employed multiple parameter tuning techniques to determine the most effective approach. Fourth, we provided a thorough examination of the strengths and weaknesses associated with tuning the hyperparameters of classification algorithms. Finally, we presented guidelines and recommendations that software practitioners should consider when tuning parameters for their prediction models.
The main findings derived from the selected primary studies are as follows:
A limited number of studies have specifically addressed parameter tuning settings in the realm of software quality prediction, resulting in untuned models that are far from optimal in terms of performance. Tuned models consistently exhibited improved prediction capabilities and demonstrated stability comparable to untuned models. Grid search, Differential evolution, Genetic algorithm-based and hybrid parameter tuning techniques emerged as the most commonly employed and effective methods. The parameters of classification algorithms such as Support Vector Machine, k-nearest neighbor, Random Forest, Neural Networks, Classification and Regression Trees (CART), and Random Forest Classification were frequently subjected to tuning. Notably, Neural Networks, Instance-based Learning with parameter k (IBk), Support Vector Machine, Decision Tree, Random Forest and Logistic Regression exhibited high sensitivity to parameter tuning. Linear Regression, Regression Tress (RTs) and Bagging Parameter tuning was observed to significantly enhance the performance of underperforming classification techniques, leading to notable changes in their ranking.
Based on the results obtained, we strongly recommend software practitioners to adopt parameter tuning techniques when constructing their prediction models. For practitioners facing time constraints, utilizing default parameter tuning methods provided by tools like Weka-Multisearch and R-Caret can serve as a valuable starting point. Furthermore, we urge researchers to conduct comparative studies to further evaluate the effectiveness of different parameter tuning techniques. Additionally, the exploration of parameter tuning techniques on a broader range of classification algorithms is encouraged to expand the understanding of their impact.
By implementing these recommendations and incorporating parameter tuning techniques, software practitioners and researchers can enhance the performance and reliability of their software quality prediction models, thereby advancing the field of software engineering.
