Abstract
The software engineering community produces data that can be analyzed to enhance the quality of future software products, and data regarding software defects can be used by data scientists to create defect predictors. However, sharing such data raises privacy concerns, since sensitive software features are usually considered as business assets that should be protected in accordance with the law. Early research efforts on protecting the privacy of software data found that applying conventional data anonymization to mask sensitive attributes of software features degrades the quality of the shared data. In addition, data produced by such approaches is not immune to attacks such as inference and background knowledge attacks. This research proposes a new approach to share protected release of software defects data that can still be used in data science algorithms. We created a generalization (clustering)-based approach to anonymize sensitive software attributes. Tomek link and AllNN data reduction approaches were used to discard noisy records that may affect the usefulness of the shared data. The proposed approach considers diversity of sensitive attributes as an important factor to avoid inference and background knowledge attacks on the anonymized data, therefore data discarded is removed from both defective and non-defective records. We conducted experiments conducted on several benchmark software defect datasets, using both data quality and privacy measures to evaluate the proposed approach. Our findings showed that the proposed approach outperforms existing well-known techniques using accuracy and privacy measures.
Introduction
General overview
In the real world, it is quite important to identify software defects, or potential causes of such defects using data from previous software projects. However, information about such projects is usually “classified information” that is not shared with researchers or the public, especially for mission critical applications where defected files, modules, or even the number of lines of codes in such applications cannot be shared with the public. In this case, if shared it could potentially provide enough information for adversaries to attack those systems. In addition, there are business reasons for not sharing such data with the public.
Researchers usually create defect prediction models based on many public datasets. However, such datasets contain flaws and are not domain specific, meaning that they do not usually work well when applied to predict or detect defects in real-world systems. For instance, it is not useful to predict errors in manned spacecraft software using a data mining model that was prepared for a completely different type of system.
Another solution is to test defect prediction models on synthetic datasets, which is insufficient compared to testing on real systems. There is a new trend to disclose software datasets after scrambling specific sensitive attributes from those datasets. Yet again, masking only direct identifiers is not enough to protect data, as indirect identifiers are as significant as direct identifiers for adversaries wishing to initiate attacks. As an example, such datasets usually contain sensitive attributes such as the line of codes (LOC), which is usually associated with the effort needed to discover bugs in specific modules. When released, it could affect the business of an organization, as many third parties could discover various important metrics such as its coding schemes, potential vulnerabilities in the code, etc.
Motivation example
An example is shown in Tables 1 and 2. Data in Table 1 is considered private and sharing it would disclose sensitive attributes such as the LOC. Table 2 provides an anonymized version of the data in Table 1. When creating an anonymized release of the data, we remove the direct identifiers; however, doing so does not always guarantee that the release of this data will not be a target for specific types of attacks by adversaries or third party, including background knowledge attacks. For instance, even if we hide attributes such as file ID or name when releasing Table 2, some adversaries may still be able to infer the values of sensitive attributes, if they have background information about specific files. For example, if an attacker knows the number of children or weighted methods per class metric for file C_4, then he can be certain that the number of LOCs in that file is between 260 and 280 by having access to the data in the encrypted Table 2. One reason for this is that the attribute LOC is not diverse enough to prevent such types of attack.
Private dataset
Private dataset
This is just one example of how an attacker can identify the values of sensitive attributes. Tables 1 and 2 contain several types of attribute. For example, file ID is considered a direct identifier that needs to be removed when scrambling this dataset. Indirect identifiers, also called quasi identifiers, are another example of auxiliary information used by adversaries to reveal the values of sensitive attributes. Such identifiers are usually used in combination with external information to infer the values of sensitive attributes such as LOC.
While data scrambling or anonymization enhances the privacy of datasets when shared, it also leads to information loss. This trade-off between privacy and utility is not usually considered when creating anonymization algorithms. The existing works are mostly randomizing data; as such, the resulting anonymized data loses its statistical distribution.
Released (anonymized data)
While sharing data is also a major problem due to privacy regulations, including the General Data Protection Regulation (GDPR), regulations such as GDPR have recognized privacy-enhancing technologies. Data that is properly anonymized, thereby anonymous, does not fall under many GDPR restrictions [52].
This study introduces a novel privacy-preserving defect prediction approach which anonymizes data using a generalization technique that also maintains the diversity of the anonymized defect datasets. Furthermore, the approach reduces the size of the original dataset by employing an entropy-preserving measure to remove noisy records after anonymization. We found no evidence of an existing method that uses generalization techniques with data reduction to increase data usefulness after anonymization, where the latter is a major limitation of many of the existing techniques.
The aim of this research is to create a new generalization approach to share defect related metrics for software projects, while at the same time preserving sensitive attributes present in datasets containing those metrics. The approach has several practical implications in real life. Each company has specific privacy requirements, so our approach calibrates this by allowing users to identify the required level of privacy. Companies that have high restriction on their data can anonymize at a more generalized level, allowing for better privacy assurance. Furthermore, the quality of the shared data can be enhanced through the proposed record reduction mechanism which aims to discard less important records that mislead classification algorithms.
This research examines the use of generalization techniques with a novel data reduction approach that relies on information theory, specifically entropy, to anonymize defect-related data and use the resulting data in tasks such as classification. The approach tests both the privacy and the utility of the proposed anonymization. Additionally, since the data reduced is diverse, the resulting data needs to be immune to specific types of privacy attacks such as background knowledge attacks. Our approach will answer the following research question. Can we share anonymized release of software datasets using our diversity preserving technique while at the same time preserving both privacy and utility of the released data?
This research is significant because it will enable companies to share their defect metrics data without privacy concerns; therefore they will be able to share their data with other companies even to create cross-company defect prediction (CCDP) models. It has been shown that defect prediction models are as effective as within-company defect prediction [29, 30, 51].
The proposed solution is innovative in the following aspects: 1) a generalization approach that provides both strong privacy guarantee and better preservation of data properties useful for subsequent analysis; 2) an intelligent tuning mechanism that makes this approach adaptive to different contexts including the privacy risks of a and its importance in analysis tasks; 3) no extra computation overhead for the data analyzer, i.e., the data analyzer can conduct data analysis directly on the anonymized data without the need of decryption. The second innovation of our approach is that it uses an intelligent tuning mechanism to make our solution adaptive to different contexts and domains including the privacy risk of a field its importance in analysis tasks. This can be done by tuning the privacy and utility budget based on the domain
Research background
Generally, the data owner stores data in table form, which mainly consists of different types of feature that can be classified into: explicit features, which explicitly identify a record in the table such as social security number (SSN), and sensitive attributes which indicate private specific information about a record such as salary; and quasi identifier attributes (QID), which is group of attributes that might identify a specific record. This classification of features is also applied to software data where some features are considered very sensitive such as the LOC and the effort related metrics.
Anonymization techniques fall into various categories, including generalization, suppression, anatomization, permutation, and perturbation. Generalization is the opposite of suppression and replaces QID or sensitive attributes with values that are more general; for example, it may replace engineer with job. On the other hand, rather than replacing the values of anatomization and permutation attributes, they concentrate on de-associating the relationship between QID attributes and sensitive attributes. Perturbation may reform the data by swapping values, adding noise, or generating synthetic data based on some statistical information of the raw data.
The anonymization process means hiding sensitive attributes that could be used to identify objects. Many techniques and tools can be used to anonymize data, but one of the main issues that needs to be considered when anonymizing data is the preservation of the record values and the contribution of features, in order to carry out a productive analysis on the data (e.g. creating defect prediction models). Therefore, the analytical results gained from the anonymized data will be similar to the result gained from un-anonymized data. It is essential, but not straightforward, to achieve such a tradeoff between privacy and utility so that the anonymized data can still be analyzed.
Anonymization algorithms differ with the noise level added and the task to which the anonymized data can be applied. While some algorithms such as Black Marker deteriorate the data and make it useless to researchers, others can partially preserve the structure and utility of the data. Most anonymization algorithms depend mainly on anonymizing identification attributes, which is not always enough to avoid all types of privacy attacks. The attacker may infer confidential information from external or auxiliary information, and therefore absolute privacy cannot be achieved in the presence of auxiliary information.
As mentioned above, it is vital to preserve data utility after anonymization, so it is quite important to ask researchers about their intentions, needs and tasks when studying software defect related matters. This is needed to release data that meets their needs and the tasks to which the data can be applied and create new anonymization techniques that preserve the statistical characteristics of the original data. For instance, when anonymizing data it can be applied for specific mining algorithms (e.g. classification), but not for others (e.g. clustering).
Research on data anonymization has been conducted in many domains including medical, social and financial. There is also a difference between anonymization and encryption. In encryption the main objective is to hide information, whereas in anonymization the objective is to create a data release that can be used by researchers, while at the same time maintaining its privacy. Applying privacy preserving techniques to software engineering has been used not only in defect prediction but also in many other objectives such preserving the privacy of personal data when collecting requirements and developing systems [8].
Anonymization, in the context of defect prediction, means hiding attributes that can be used to identify objects in defect datasets or organizations that generate the data. Beckers and Heisel [8] propose an approach to preserve the privacy of individual data collected during the system development process. Their approach creates privacy patterns to express and analyze different privacy goals such as anonymity, pseudonymity, unlinkability and unobservability by applying specific privacy mechanisms to the software engineering process to identify incomplete privacy requirements.
In many countries, privacy laws do not allow the use of actual data to test software products; therefore many projects use fake data to test their code. However, using fake data does not generally cover the operational conditions that arise when using the developed software in real-life environments. Specifically, good test case coverage cannot be achieved using fake data. Grechanik et al. [24] propose an approach to share realistic data that can be used by software testers to test their code. They applied a
In a related work by Li et al. [31] an approach called TestIng taSks (PISTIS)/protecting and minimizing databases for software was introduced. This approach relies on a weight-based data clustering algorithm that clusters data point to achieve the objective of reducing the test space while covering as many test paths as possible. Each cluster is represented by a cluster centroid, which characterizes the general population. In addition, it becomes difficult for an adversary to infer useful information when only a cluster centroid is represented instead of the entire cluster. Experimental results have shown that it is still possible to achieve the required level of test coverage on the anonymized data.
Privacy preserving software testing has been utilized in other testing approaches, such as regression testing, which uses a test selection (RTS) technique with control-flow graphs (CFGs) to model web service interactions. To test these techniques, service providers need to share their CFGs with participants. The latter includes sensitive implementation details that cannot be always shared with participants. Ruth’s [44] study proposes a technique to sanitize individual nodes and alter the shape of the CFG to protect its sensitive details while at the same time keeping it effective for conducting testing.
Senarath and Arachchilage [46] address the issue of applying privacy requirements by developers and why they cannot embed privacy in software systems, and conclude that most developers are not even aware of personal privacy issues when implementing software systems. They provide several guidelines that should be applied by organisations to enforce privacy requirements and recommend that privacy guidelines should be made as simple and clear as possible. They also suggest that formal training procedures should be applied and used as guidelines in creating software systems. In the software engineering community, developers’ lack of knowledge regarding privacy affects their opinion which interferes in the way they embed privacy into designs. They also state that privacy guidelines should provide formal mechanisms for evaluation. Finally, they advise that the way in which privacy in software systems should be implemented at the technical level must be explicitly specified.
Peters and Menzies [40] proposed a privacy preserving defect prediction approach, MORPH, an anonymization-based technique which anonymizes data through a mutation function that moves data points based on random distance considering class boundaries. It was tested on several classification algorithms such as Random Forests, Bayes Net, and Logistic Regression and it was found that its main limitation is the difficulty in achieving a balance between preserving class boundary when noise is added and the required level of privacy; this solely depends on the distribution of the defect prediction dataset.
Limitations of the existing techniques and our contributions
The existing techniques, such as MORPH, have a number of limitations. In general, they do not guarantee that an instance does not move across the boundary between the original instance and instances of another class, which means that it may be difficult to keep the MORPHed value close to the original to maintain the utility of the dataset. Our approach, on the other hand, has the following contributions:
This research examines using generalization techniques with a novel data reduction diversity preserving approach that relies on information theory to anonymize defects related data. The proposed approach tests both the privacy and the utility of the proposed anonymization. Our results show that we can share anonymized release of the datasets using our diversity preserving data reduction technique while at the same time preserving both the privacy and the accuracy of the released data.
The rest of this paper is organized as follows: Section 2 discusses the related work; Section 3 presents the methodology of our approach; the experiment and evaluation is presented in Section 4; the discussion of the results are shown in Section 5; threats to the validity are presented in Section 6; and Section 7 concludes the work and discusses the potential of future work.
Privacy preserving analytics in software engineering
Data analytics in software engineering has been applied over the years for many objectives such as defect prediction, but one of the main challenges has been data privacy. Moparthi and Geethanjali [37] proposed a multi-party ensemble-based defect prediction approach on multiple associated products which privately discovers and uses defect patterns for prediction. They encoded discovered patterns using string encoding functions by dividing data analysis sites into masters and slaves, then a public key function was used for transmitting data between sites. Once the sensitive patterns were discovered in the master site, they were stored in a matrix and then sent randomly to slave sites. They repeated the process until all instances were processed and tested using classification accuracy measure; however, there was no privacy-based evaluation of the approach. In addition, the authors did not describe the encoding functions and public key infrastructure used to implement the system.
Peters and Menzies [40] propose a privacy preserving defect prediction approach-MORPH. MORPH is a Anonymization-based technique which anonymizes data through a mutation function that moves data points based on a random distance considering class boundaries. MORPH was tested on several classification algorithms such as Random Forests, Bayes Net, and Logistic Regression. The main limitation for MORPH is the difficulty of achieving a balance between preserving class boundary when the noise is added and the required level of privacy; this solely depends on the distribution of the defect prediction dataset.
Peters et al. [41] continued this line of research by proposing a new algorithm, CLIFF, which is an instance pruner that finds attribute subranges in the data that are more present in one class than in other classes. They propose that after anonymization; a dataset can be pruned using CLIFF in order to mitigate inference attacks that benefit from large data size. CLIFF utilizes a power function proposed by Jalali et al. [28] to prune less useful attributes. As such, the power of each attribute’s value needs to be calculated for each class. The power values depend mainly on the frequency of attribute values with a specific class. This can be done at the row level, then rows with the highest power are kept in the dataset, while rows with less power are pruned from the dataset. Experimental results have shown that combining MORPH with CLIFF leads to significantly better privacy compared with only using MORPH. While MORPH can work well on large datasets, it cannot be applied effectively to small datasets. LACE2 was proposed by Peters et al. [42] to reduce the amount of data shared when multi-party data sharing is needed. Data can be incrementally added by parties to the repository, when it is not similar to the current data in the repository, thereby avoiding data redundancy. The obfuscation approach used was quite similar to MORPH; however the authors mention that their new approach yielded better privacy than their previous approaches without affecting data utility.
Li et al. [33] proposed a multiple sources and privacy preservation approach for heterogeneous defect prediction. Their approach selects the most similar projects to the one under testing by joining the target project and a well-selected set of other sources in a discriminative common subspace, where the target and sources have similar data distributions and favorable classification ability. A privacy preserving algorithm was designed based on double obfuscation (SRDO). The authors found that SRDO achieved better privacy and utility than CLIFF and MORPH.
Fu et al. [21] proposed a new approach called pattern preserving tree that perturbed the original software engineering data considering the patterns existing in the original tree. Unlike the existing approaches, the new approach perturbs data while paying attention to the patterns that exist in the original tree. It generates a new regression tree from the perturbated data, which can then be shared with researchers.
A privacy model for behavior preserving testing and debugging of software systems was introduced by Budi et al. [13]. Their approach handled one of the main limitations of
A number of similar studies have been carried out regarding privacy preserving testing of database applications by generating a mock dataset that retains the statistical character of the original dataset [53]. The objective of creating a mock dataset is to have an almost identical structure as the original data in which table schema description and view indices are retained and can be used to estimate the performance of the original dataset. The authors showed that a mock database generated from this can simulate the live environment for testing purpose, while maintaining the privacy of data in the original database.
Wang et al. [53] went on to propose a model for quantifying the privacy leakage and application performance metrics differences when generating synthetic datasets. The analysis conducted in their paper shows that designing an efficient privacy preserving approach to measure performance issues is a challenging problem.
Recently a number of works have been conducted utilizing differential privacy to conduct privacy preserving analytics of software systems including Chen et al. [15] who propose an approach called DP-Share, which applies the traditional data mining process to software engineering datasets, such as sampling and classification. Once the model is created, Laplacian noise is added to satisfy the requirements of differential privacy. Nine software projects were used during the experiments and several performance measures were used to evaluate the privacy and utility of the proposed technique. The results showed that after privacy and utility analysis DP-Share can achieve better performance than a baseline method.
Other related works focus on using privacy preserving techniques in effort estimation. For example, Qi et al. [43] show the importance of preserving the privacy features of effort estimation. They highlight the many indicators that reflect sensitive features in the dataset, such as the properties of the product or the image of a team, e.g. function points count (FP), line of code (LOC) and number of function point (size). The authors conclude that privacy threats can target their effort data; therefore they propose an interval covering based subclass division (ICSD) strategy to obfuscate data. ICSD can divide the target data into several subclasses by extracting a new attribute (i.e. class label) from the effort data. The obtained class label is beneficial for maintaining the distribution of the target data after obfuscation. The authors proposed a manifold learning based bi-directional data obfuscation (MLBDO) algorithm, which uses two nearest neighbors, selected respectively from the previous and next subclasses by utilizing the manifold learning based nearest neighbor selector, as the disturbances to obfuscate the target sample. The approach was called ICSD&MLBDO. Experimental results on seven public effort datasets showed that: 1) ICSD&MLBDO enhanced the privacy and maintained the utility of obfuscated data, and 2) ICSD&MLBDO can achieve better privacy and utility than the compared privacy-preserving methods.
Privacy preserving data analytics
Privacy preserving methods and techniques focus on modifying data to hide direct identifiers and make it difficult to identify individual entities by inferring relationships between indirect identifiers and external information [1]. It has been shown that removing direct identifiers such as name and email only is not effective in avoiding all privacy related attacks [7]. There is another category of attributes called indirect identifiers which are not explicit identifiers but can still be combined with other external attributes to identify individuals. This type of attack is known as a record linkage attack (Benjamin et al., 2010). An example of such attributes is the QID set [5-digit ZIP, gender, date of birth]. It has been shown that about 87% of the residents of the United States can be identified when this set of attributes is publicly available [23].
Privacy preserving data mining (PPDM) focuses on transforming data to create a private version of it, but preserving its statistical attributes. Mendes and Vilela [36] surveyed existing privacy preserving data mining techniques and classified them into data collection, data publishing, data distribution and output-based approaches. Data collection techniques include additive [39] and multiplicative noise methods [34].
Data anonymization techniques are divided into many categories including generalization, perturbation, etc. In generalization the values of quasi identifiers are replaced with more general values to avoid inference and record linkage attacks. Introducing similarity between records through k-anonymity is a very common approach to releasing generalized anonymized datasets [50]. In
Multiple data sharing techniques target distributed data storage systems. For instance, in secure sum, secure product, and secure set union the objective is to calculate some mathematical functions over two or more sites, without violating the privacy of individual sites [16]. There are several metrics for measuring privacy and utility after applying privacy preserving algorithms. Privacy measures include confidence level [5], average conditional entropy [4], variance [39], and hidden failure [38]. Utility measures include minimal distortion [45], loss metrics [27], information loss [54], misses cost [39], and misclassification error [39].
Several other approaches included in the literature inherit the characteristics of the previous techniques to create a privatized version of the datasets. Randomization approaches have been classified into numerical and itemset randomization [11] and do not need prior knowledge about all records and correlations patterns inherent in the data. Randomization approaches treat all records irrespective of their local density; therefore outlier records are more susceptible to adversarial attacks than records in more dense regions [1]. Evfimievski et al. [20] utilize a randomization-based privacy preserving approach for identifying support and variance in association rules. Approaches such as noise addition, data swapping, and synthetic data generation can be classified under randomization-based anonymization. Soria-Comas et al. [49] introduce a micro-aggregation privacy model. Unlike the traditional
Differential privacy is a privacy model that provides worst-case privacy guarantee regardless of what attackers know [19]. It protects statistical aggregate queries to prevent attackers from inferring sensitive information about individual records based on subsequent queries. The assumption is that if an adversary has full or partial background knowledge about a group of individuals in a dataset, with the exception of a single record, he cannot re-identify that record. Privacy is achieved by adding enough noise that follows a specific distribution called Laplace or double exponential. In differential privacy, the level of privacy can be controlled by a data publisher who decides the amount of noise (i.e., privacy budget) to be introduced to the dataset using a
Research methodology
Overall research design
Figure 1 shows the architecture of our proposed approach. In a real environment our method works by obtaining an
The architecture of the proposed approach.
Importance of features of the anonymized data was used to reduce the size of data while preserving data privacy and diversity. This approach generates anonymized datasets with defects and no-defects records which can then be shared with data recipients, who may use it to perform defect prediction. Defect prediction was applied to the original data and a comparison was conducted before and after anonymization. The methodology is evaluated based on several utility and privacy measures as shown in our experiments.
Software defect data
Software data with defective software modules, components, or files is utilized by companies to manage the software testing process using valuable historical data. However, sometimes such data does not exist or is not sufficient, in which case an organization can use data from other organizations. Table 3 shows some of the characteristics or quasi identifiers that may be found in software data sets.
Sample quasi identifiers in software defect data
Sample quasi identifiers in software defect data
In our experiment we used a data repository which contains different datasets generated by NASA [12] and is renowned in test research. All the datasets include attributes related to code.
To explain our approach, we will use the following motivation example. Table 4 shows a sample of an original dataset that contains defects. The example will be used to drive our description of the methodology steps and our contributions.
Private software defect dataset
Clustering based anonymization (min 2 instances per cluster)
The algorithm then runs an insensitive micro aggregation method to divide data in each
Our example in Table 5 shows the resulting dataset when anonymizing using a clustering-based anonymization and the taxonomy shown in Fig. 3. Here LOC is considered a sensitive feature to be protected. The last column is the class attribute, and all other features are considered quasi identifiers therefore anonymized.
Generalization example in which the original LOC counts are clustered (left), then generalized based on cluster mean (right).
In reality, clustering-based anonymization is carried out using clustering algorithms such as
Using a domain taxonomy to anonymize defect data.
We relied on information theory, in which entropy is a measure of the uncertainty associated with a random variable, to accomplish the data reduction step. This term usually refers to Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits [47]. Entropy has been utilized in areas such as natural language processing. It is a measure of the average information content which is missing when the value of a random variable is not known. Overall, entropy, when a defected code is conditioned on one or (more) features, reveals the degree of uncertainty about the occurrence of a defect when such feature(s) are observed. This form of entropy is called conditional entropy. The conditional entropy
It has been reported that conditional entropy is one of the best preforming entropy measures in signal compression algorithms [48]. In general, the more non-uniform occurrence of a feature,
Points in the space representing defective and non-defective records.
For example, Fig. 4 shows many defective (green circles) and benign code records represented as points in the space. Records belonging to different classes, but close to each other, yield the highest entropy, since the values of features of those records have almost uniform distribution. Data scientists call these instances Tomek links, which are pairs of very close instances, but of opposite classes. Removing such instances increases the space between the two classes, facilitating the classification process. Additionally, it preserves the diversity of the original dataset by removing records from both classes. Formally, a Tomek link is defined between records
Thus,
Using Tomek links for data reduction.
Removing Tomek links is not enough since there will be low ranked records within the boundaries of the class itself, therefore we utilized another reduction technique, which was proposed by Alejo et al. [6]. Each instance
In a related work, Peters et al. [42], utilized the near-miss approach for data obfuscation which is shown in Fig. 6. Specifically, this approach selects samples from the majority class (non-defective components) for which the average distance of the
Entropy-based ranking for the private dataset
This approach utilizes the following obfuscation function
nearest neighbor of
If the clustering step is conducted, one potential clustering result is shown in Table 7. This table contains three clusters, with
Discarding records with the lowest entropy rank (highest entropy)
Near-miss data reduction approach.
Once the data is anonymized it will be shared with data recipients. Our approach will calibrate this by allowing users to identify the required level of generalization, where higher
Resulting data maintaining both diversity and privacy (2 clusters
2) each)
Resulting data maintaining both diversity and privacy (2 clusters
Suppose a point z is added, and its distance to P is in between xi and
Pseudo data per cluster is generated using singular value decomposition (SVD). SVD is applied to the anonymized clustered data set. It decomposes a matrix
Whereas
The number of first
Data analysis and interpretation
In theory, the ideal situation for any privacy algorithm is to achieve the highest level of privacy and utility. However, the objective of such algorithms has been always to reach an acceptable tradeoff between privacy and utility as shown in Fig. 7.
Privacy
To measure the privacy level of anonymized datasets, we use conditional privacy [3], This measure depends on the notion of mutual information between the raw and anonymized records at
Where
If
Relationship between privacy and accuracy.
We determine whether the anonymization techniques produce satisfactory accuracy values in terms of differentiating between files that contains defects and those that are defect free. Thus, if we apply a classification model on the original dataset, the results should not be very different if the same model is applied to the anonymized dataset. Accuracy is measured based on the following formula
In addition to accuracy, we used other measures to evaluate the effectiveness of our approach. Both recall (
To run our experiments, we implemented several components of our approach in MATLAB and Python, with the anonymization approach implemented in MATLAB and the data reduction approach implemented in Python. We also used the implementation of MORPH, CLIFF and LACE2 in Python to conduct a comparison with other approaches. The experiments were conducted on 11 datasets using the PROMISE dataset repository which contains different datasets generated by NASA (Boetticher, 2007). As mentioned previously, this is a well-known repository that has been used to test research on software engineering. Table 9 shows the number of records, features, and labels and the percentage of bugs in each dataset. These datasets have been used recently by Peters et al. [4] to test their approach on privacy preserving defect prediction.
Datasets used in the experiments
Datasets used in the experiments
Several types of experiment were conducted on each dataset as follows:
The first experiment aimed to measure the contribution of each feature to the overall privacy level. The objective of this experiment was to examine if some features contribute greater than others to overall privacy. This experiment was conducted by varying the number of instances per cluster and then measuring the overall privacy level. In the second experiment we measured the effect of data reduction on the privacy values. This experiment was conducted by varying the number of instances per cluster. The objective of this experiment is to examine if removing some instances from the dataset makes the data more or less vulnerable to privacy threats. The third experiment measured the utility of the resulting data in terms of different metrics. This experiment had two objectives: first to examine how perturbing data affects the values of accuracy when the resulting data is tested and second, to examine if data reduction can enhance the classification accuracy when the resulting data is used for classification. The final experiment compared our approach with the existing techniques. We conducted this experiment by measuring both privacy and utility of the resulting data and comparing it with the original data.
The first experiment measured privacy per feature for each dataset, by varying the number of instances per cluster, then measuring the privacy values per feature, and finally calculating the overall privacy values. The results of this experiment are shown in Fig. 8.
Privacy values for all the datasets used.
From Fig. 8 we can summarize the following:
The
The second dataset (
The third experiment was conducted on the
The
The
The
The
This experiment was conducted on the
The
The
The characteristics of the
Number of instances before and after reduction for all datasets used.
Privacy values before and after reduction for all datasets used.
The second experiment measured the effects of reducing the size of the training dataset and measuring how such a reduction affects the total privacy. For the reduction step, and given the small number of defective instances, we used Tomek link to remove the instances of the majority class only. AllNN was also used to remove the instances of the minority class, and to reduce the number of removed instances we used the
Privacy was measured before and after data reduction as shown in Fig. 10. This experiment also measured privacy per number of instances.
It was noted that the
For the
Since the
We noticed that the
Since the number of defects in the
For
For
Since the
For the
We reduced the size of the
Since the
Utility results
The final experiment on this dataset measured utility under different scenarios. The original dataset was divided into training data and testing data, then three training datasets were used to create three classification models, using the original dataset, the dataset before reduction and the dataset after reduction. The same testing data was used to test each of those three models. The subsections below describe the results for each dataset.
4.2.3.1. CM1 dataset
Figure 11 measures classification accuracy of the original dataset, the anonymized dataset before reduction, and the anonymized dataset after reduction for the CM1 dataset.
Utility metrics (CM1 dataset).
We can make several observations from the above results. First, our approach preserved the utility of the original dataset, whose overall recall was 89%. When the same testing dataset was utilized on the model created on the anonymized data, without any reduction, we attained about 81% rcall. However, when the reduction step was implemented and the model retrained using the reduced data, we observed an increase in the accuracy value as well as other classification metrics. This included higher TP rate, higher precision, recall, and
4.2.3.2. KC1 dataset
Figure 12 shows the classification accuracy of the original dataset, the anonymized dataset before reduction, and the anonymized dataset after reduction for the KC1 dataset. Although there is some similarity with the results of the previous dataset, it is observed that “after data reduction” had more advantage regarding the accuracy values than before reduction. The reduction step also reduced the FP rate for the positive class (the defective class).
Utility metrics (KC1 dataset).
Utility metrics (KC2 dataset).
4.2.3.3. KC2 dataset
The following confusion matrices show the results of classification metrics when the same testing data was used as input to classification models created based on the original data, data before reduction and data after reduction. The results are reported in Fig. 13.
We can summarize the results shown in Fig. 13 as follows: First, there was a significant difference in the values of classification metrics between the data before and after anonymization. That is, the classification results were affected when the data were anonymized. However, this happened only before data reduction. After data reduction the values of classification metrics increased relatively. This experiment proved the advantage of data reduction on the classification results. Second, the false positive values were smaller after data reduction, which also shows that reduction also improved the classification results for the minority class.
4.2.3.4. KC3 dataset
Figure 14 measures the classification accuracy of the original dataset, the anonymized dataset before reduction, and the anonymized dataset after reduction for the KC3 dataset. Reduction of the dataset led to better results in terms of many classification metrics. Nevertheless, we noticed a minor increase in misclassifying non-defective records as defective as shown when comparing 17. One potential reason for this is that after data reduction, fewer similar instances are produced which happens with other experiments, however, since the features were diverse in this dataset, such an effect became clearer and eventually led to an increasing of the miscalculation rate of negative examples.
Utility metrics (KC3 dataset).
Utility metrics (MC1 dataset).
4.2.3.5. MC1 dataset
Figure 15 highlights the classification accuracy of the original dataset, the anonymized dataset before reduction, and the anonymized dataset after reduction for the MC1 dataset.
Figure 15 demonstrates several interesting patterns. First, in terms of accuracy utility for the after-data reduction case achieved results close to the original dataset. Although the results for the after-data reduction were better than the before data reduction, the false positive rate for the minority class as slightly higher after data reduction. However, this came at the cost of a lower true positive rate for the majority class.
4.2.3.6. MC2
Figure 16 indicates the classification accuracy of the original dataset, the anonymized dataset before reduction, and the anonymized dataset after reduction for the MC2 dataset. The results for this dataset clearly indicate the benefits of our data reduction approach, particularly when the dataset is of a small size but with a diverse number of sensitive features. The recall was around 67% compared to 60% when no reduction was applied. This dataset yielded lower accuracy even compared to the original dataset. Although the dataset was more balanced than the previous datasets, it gained less classification accuracy. However, the main objective of our approach was not to gain higher accuracy, but to decrease the differences in accuracy between the original and anonymized data.
Utility metrics (MC2 dataset).
Utility metrics (MW1 dataset).
4.2.3.7. MW1
Figure 17 shows the classification accuracy of the original dataset, the anonymized dataset before reduction, and the anonymized dataset after reduction for the MW1 dataset. The overall results show the benefits of the data reduction step in increasing the values of several classification metrics. Compared to before data reduction this yielded 78% recall, this value increased after data reduction, reaching around 81%.
4.2.3.8. PC1
Figure 18 demonstrates the classification accuracy of the original dataset, the anonymized dataset before reduction, and the anonymized dataset after reduction for the PC1 dataset. There were no significant differences between classification metrics for before and after anonymization. In fact, it is observed that there were very minor differences between the values of accuracy before and after anonymization. However, there is an increase in the values of false positives after anonymization than before anonymization, which is explained by the very few instances of the minority class.
Utility metrics (PC1 dataset).
Utility metrics (PC2 dataset).
4.2.3.9. PC2
The final experiment measured the utility of the resulting data in terms of several classification metrics as shown in Fig. 19. There were no significant differences between the accuracy values of the original dataset and the anonymized datasets. In addition, data reduction yielded better results than before reduction. It is also noted that the after-data reduction yielded a lower FP rate than the original datasets.
4.2.3.10. PC3
Figure 20 displays the classification accuracy of the original dataset, the anonymized dataset before reduction, and the anonymized dataset after reduction for the PC3 dataset. Two main observations can be drawn from the results of this aspect of the experiments, as shown in Fig. 20. First, the result clearly indicates the benefits of reducing the size of the dataset, although the difference is not as significant as in the previous experiments.
Utility metrics (PC3 dataset).
4.2.3.11. PC4
Figure 21 measures the classification accuracy of the original dataset, the anonymized dataset before reduction, and the anonymized dataset after reduction for the CM1 dataset. The recall of the original dataset was around 90% compared to 88% when the reduced dataset was used and 87% when the non-reduced dataset was used. This shows that there are no significant differences between before and after data anonymization. However, this is associated with higher values of accuracy for the after-data reduction case.
We compared our approach with three existing techniques. All such techniques obfuscate the data by morphing its values via a randomization function between-class instance. Although the MORPHed value can ensure the dataset is sufficiently privatized, it does not always guarantee that an instance does not move across the boundary between the original instance and instances of another class, which means that it may be difficult to keep the MORPHed value close enough to the original to maintain the utility of the dataset.
Comparison with existing approaches – accuracy-based comparison
Comparison with existing approaches – accuracy-based comparison
Utility metrics (PC4 dataset).
We compared the results achieved by the MORPH technique and its successors with our approach. To conduct a fair comparison, the training part of each dataset was anonymized using those techniques and our technique, and then the same testing dataset was applied to each approach. The results are reported in Table 10. The accuracy values clearly show that our approach achieves significantly higher accuracy values. Of the existing approaches LACE2 achieves better results. LACE 2 also applies a data reduction technique that removes the noisy instances.
The final experiment also compared our approach with other existing techniques such as CLIFF, LACE 2, and MORPH, in terms of privacy. The experiment was conducted by varying the number of instances per cluster. It should be noted that this value varies from one dataset to another as shown in Table 11. The results shown in Table 11 clearly indicate that our approach yields better privacy for most of the datasets, even on small number of instances per cluster. This result is noticed for our approaches before and after data reduction as demonstrated in the table. It is important to mention that while our approach inherits some of the characteristics of the K-anonymity approach, applying SVD per cluster enhanced the level of privacy for each of the resulting clusters and eventually the privacy values of each dataset.
Privacy-based comparison with the existing techniques
In this experiment, we tested the execution time of our anonymization approach and compare it to other approaches. On the first three datasets, the results can be generalized to the rest of datasets. The results clearly indicate that our approach does not add an extra overhead to the anonymization process due to the clustering step. It can be noticed that our approach is competitive to other approaches. It achieves comparable or close results to other approaches as demonstrated in Fig. 22.
Efficiency of the anonymization process.
When measuring the time in second, our approach anonymizes all three datasets in less than 0.35 seconds. The results of other approaches are close to ours, with minor advantage to CLIFF approach. However this comes at the privacy and accuracy budget.
Results discussion
In this research we proposed, implemented, and tested an approach for preserving the privacy of software data when sharing it publicly. The work contributes to the field from two perspectives. First, we show that shared data can still be used to identify defects with an accuracy that is quite close to that achieved from the original data. Second, we introduced an approach for reducing noisy records that may affect the utility of the shared data. We also showed that removing such records significantly improves accuracy values after anonymization. In addition, we proved that our reduction approach has no effect on the achieved privacy. We compared our approach with several well-known approaches such as LACE2, CLIFF and MORPH and showed that our approach outperforms those techniques in terms of both privacy and utility.
Unlike the other techniques, our approach does not add random noise using the features of nearest neighbor instances of the opposite class. In fact, this significantly reduced the accuracy when the anonymized data was tested using approaches such as LACE and CLIFF. We believe that this increases the similarity between instances of different classes after anonymization. Accordingly, approaches such as CLIFF, LACE and MORPH result in more noise being introduced to the data.
In addition, approaches such as LACE remove instances that are similar to the existing ones. This negatively affects the utility of the resulting data, since it introduces similarity of instances from different classes, which misleads the classifier used to test the anonymized data. Privacy-wise, this also increases the number of unique instances that are not similar to any of the existing instances, which may boost the number and types of background knowledge attacks. Instead, our approach removes similar instances per class that mislead the classifiers and most of the time misclassified by many classifiers. Then to avoid background knowledge and mapping attacks, we increased the similarity after anonymization using a generalization-based anonymization approach and perturbed data using a well-known SVD to add noise per cluster and per class.
We used NASA public datasets to test our approach. We conducted several types of experiment, with the first set of experiments testing our approach from the privacy perspective. The following are the major findings related to privacy:
The data reduction approach does not affect the values of privacy. Whereas one may argue that discarding some instances may increase the probability of data breach, we showed that increasing the number of instances per cluster can control this. In fact, when increasing the number of instances per cluster we were able to reach analogous privacy values compared to data before reduction. Data reduction has no effects on diversity since we removed instances from both classes. In addition, we tried to remove instances that negatively affect utility not privacy. While our results prove that privacy values are affected by size of dataset, we showed that it is still possible to control this on small datasets by increasing the The results also indicate that there were no significant differences between attributes contribution to the overall privacy values.
The following are the major findings related to accuracy of the resulting data after anonymization
We showed that accuracy values after anonymization are comparable to those values before anonymization. We showed that data reduction significantly increases the TP rate, reduced the FP rate and eventually increases the accuracy values. We showed that compared to other approaches, our approach leads to higher accuracy values. We demonstrated this by feeding the same testing data to our approach and other approaches. This result is explained by randomization technique used by those approaches which adds noise by taking the most similar instances of the opposite class and uses the features of that instance in adding a random noise. Researchers have criticized this and showed that it is not reasonable to control the amount of noise without reducing the quality of the resulting dataset.
Algorithms implemented in this research can be mainly applied to data regarding software products’; therefore applying our approach to other domains is out of the scope of this work and could be considered for future work. We will focus on using our approach for defect prediction, which is considered a classification problem. Evaluating the proposed methods using other data mining algorithms such as clustering is beyond the scope of our work, which only focuses on privacy related issues.
Conclusions and future work
In this paper, we introduced an approach to perturb sensitive attributes of software defect data as it has been shown that such data is vulnerable to different types of attack when shared without anonymization. We showed that our anonymization approach produces data that inherits the statistical characteristics of the original data. Instead of applying simple randomization approaches, we applied a hybrid approach to reduce the size of the original dataset and then anonymize it. The main objective of this approach was to remove records that mostly affect the utility of the produced data. First, sensitive attributes were identified; we then identified then discarded records using two approaches, Tomek links and AllNN. Unlike the existing techniques, which only consider removing low ranked records, we focused on removing similar records that were close to the decision boundary and had different classes. Those records are most likely misclassified after anonymization. We showed that privacy values are not affected after anonymization. In addition, we showed a significant improvement in the accuracy value after data reduction. Finally, we compared the results of our experiments with several existing techniques and showed that our approach outperformed many of those approaches.
The proposed approach could be extended to address its limitations and enhance its effectiveness. First, it could cover not only numerical but also textual attributes. It has been shown that it is not only the code that is vulnerable to attacks but also other important documents such as requirement documents, thus it is important to extend our approach to consider textual features, for instance information about platforms, operating environments, and server types are all sensitive attributes that needs to be anonymized. However, as opposed to numerical features, textual features require approaches that preserve the semantic-based characteristics of the proposed approach.
Moreover, we plan to test our approach on newer datasets, other than the NASA dataset. We also plan to apply other techniques such as differential privacy for anonymization of software engineering data.
