Integration of cluster analysis and granular computing for imbalanced data classification: A case study on prostate cancer prognosis in Taiwan

Abstract

This paper proposes a particle swarm K-means optimization (PSKO)-based granular computing (GrC) model to preprocess skewed class distribution in order to enhance the classification accuracy for the class imbalance problem. The GrC model obtains knowledge from information granules rather than from numerical data. It also processes multi-dimensional and sparse data by using singular value decomposition and latent semantic indexing (LSI). The data possessing features of multiple dimensions and scarcity can be preprocessed using LSI in order to reduce the number of data dimensions as well as records. Ten benchmark data sets are employed to demonstrate the effectiveness of the proposed model. Experiment results indicate that the proposed model has better classification performance with both imbalanced and balanced data. In addition, the computational result for prostate cancer prognosis reveals that the proposed model really can support physicians in judging the condition of prostate cancer patients with a more accurate survival rate estimation.

Keywords

Prostate cancer granular computing particle swarm K-means optimization class imbalance classification

1 Introduction

When learning imbalanced or skewed data, in which almost all instances are labeled as one class, while a few instances are labeled as other classes, traditional data mining approaches such as neural networks (NN), decision trees (DT), and support vector machines (SVM) tend to produce high accuracy over the majority class but poor predictive accuracy over the minority class. However, this minority class is usually the important one, like medical diagnoses examples or abnormal products of finished-goods inspection data. A variety of methods have been proposed to cope with imbalanced data problems, such as methods of sampling [1, 2], adjusting the cost-matrices, and moving the decision thresholds [2]. However, these techniques have some disadvantages. For example, the computational load is increased and overtraining may occur owing to the replicated samples in the case of over-sampling. Under-sampling does not take into account all available training data, which results in a loss of available information.

Another approach to dealing with imbalanced data is the use of granular computing. Granular computing represents the data pattern in some subsets called information granules (IG). An IG is a group of objects which have similar functions and are indistinguishable. In imbalanced data, normal data or the majority data have more similar functions, while the minority class has a more unique condition. By constructing the IGs based on data similarity, the number of IGs in the majority class is reduced to less than those in the minority class. By considering the IGs, the proportion of the minority class can be increased. This addresses the imbalance condition and helps classifiers.

The main purpose of this paper is to develop a novel granular computing methodology for tackling imbalanced data. This proposed method extracts information from IGs for discrete or continuous data by employing a metaheuristic-based method and sub-attributes. Then, a singular value decomposition and Latent Semantic Indexing (LSI) are applied to process multi-dimensional and sparse data. These procedures help the classifier to classify the data.

Furthermore, the proposed algorithm is applied to prostate cancer data. Prostate cancer is one of the most common causes of death in men in most industrialized countries [3]. Many studies have considered the introduction of diagnosis for prostate cancer, but only few studies have examined prognosis methods for prostate cancer. Some studies have evaluated the use of artificial NN to increase prostate cancer detection rate and reduce unnecessary biopsies by using a neuro-fuzzy system based on both serum data (total prostate-specific antigen; tPSA, percent free PSA), and clinical data (age) to enhance the performance of tPSA to discriminate prostate cancer. However, none of these studies explore prognosis for prostate cancer. In addition, these methods cannot handle data with imbalanced characteristics. Therefore, this paper proposes an algorithm that can perform better with imbalanced datasets like prostate cancer data.

In this paper, the proposed algorithms are compared with the cluster-based GrC model and the improved cluster-based GrC model of the IG-based method. Finally, real medical inspection data of prostate cancer in Taiwan is used to evaluate the effectiveness of the proposed method.

The remainder of this paper is arranged as follows. Section two presents a survey of literature related to this paper. Section three proposes the developed model, while the experimental results are provided in Section four. Section five shows the case study for the prostate cancer prognosis system. Finally, concluding remarks are given in Section six.

2 Literature survey

This section briefly discusses the related background of this paper, including prostate cancer, classification, granular computing, class imbalance problems and the particle swarm optimization algorithm.

2.1 Prostate cancer

Prostate cancer is a significant cause of morbidity and mortality in western counties. According to autopsy data, approximately 42% of men with prostate cancer over the age of 50 die of other causes [4]. In the United States, prostate cancer is the most common cancer and the second most common cause of cancer related death [5]. In 2011, an estimated 240,890 new cases and 33,720 deaths caused by prostate cancer were recorded. Similar data have been reported in Europe and Canada. In Taiwan, the morbidity of prostate cancer is one fifth of male cancer patients, and the mortality is one seventh. According to the data published by the Department of Health, Executive Yuan, there were 3,603 cases and 1,052 deaths caused by prostate cancer in 2008, and the mean age of prostate cancer patients was 75. This indicates that older men have a higher risk of developing prostate cancer. Miller et al. [6] identified 24,405 men with lower-risk prostate cancer using data from 13 Surveillance, and discovered that 55% of these men were potentially over treated in an appropriate initial expectant management. The survival rate of patients after surgical treatment has been previously evaluated using competing-risk analysis. However, the diagnosis of prostate cancer (PCa) does not assess the combined effect of age and comorbidities in patients with high-risk PCa.

In the initial screening process, doctors usually use digital rectal examination (DRE) and test for prostate specific antigen (PSA). If the DRE or PSA result is abnormal, the doctor will advise the patient to undergo biopsy testing. If adenocarcinoma of the prostate is confirmed by microscopic examination of the biopsy specimen, the next treatment will be decided. Treatment options include definitive, curative, systemic, palliative, or salvage therapy.

Due to the high number of prostate cancer patients, many studies have tried to develop an early diagnosis system without biopsy testing. Proposed expert systems have been developed based on patient data including weight, height, body mass index (BMI), prostate-specific antigen (PSA), Free PSA, age, prostate volume, density, smoking, systolic, diastolic, pulse, and Gleason score [7]. Classification or forecasting algorithms were then proposed to analyze these data. Keles et al. [3] proposed a neuron-fuzzy classification algorithm (NEFCLASS). This algorithm aims to determine whether a patient has prostate cancer or benign prostatic hyperplasia (BPH). Chen and Lin [8] focused on identifying principal genes and then used these genes to classify cancers either by classifiers such as support vector machines (SVM) or back-propagation neural networking (BPNN) to extract significant samples. Furthermore, an artificial neural network algorithm was also proposed [9]. Although these algorithms do not give an absolutely correct diagnosis, their results can provide more information for doctors before they undertake further medical treatment.

2.2 Classification

Classification aims to build a model of a class label by training a data set such that the model can be used to classify new data whose class labels are unknown [10]. In recent years, many approaches such as artificial intelligence, neural networks, rough set, fuzzy set, and many others have been applied by classification algorithm [11]. Classification problems have numerous practical applications. For instance, patients can be classified into disease groups on the basis of their symptoms for medical diagnosis [12 –15]. Through examination of the physical characteristics of objects or individuals, they can be classified into appropriate classes [16, 17]. Letter recognition is also one of the best examples in this field. Moreover, in the field of human resources management, enterprises assign personnel to appropriate occupation groups according to their qualifications by classification tools [18]; Production systems management and technical diagnosis monitor the operation of complex production systems for fault diagnosis purposes [19]; marketing research is aimed at customer satisfaction measurement, analysis of the characteristics of different groups of customers, development of market penetration strategies, etc. [20]; the field of financial management and economics is mainly focused on business failure prediction, credit risk assessment for firms and consumers, stock evaluation and classification, country risk assessment, bond rating, etc. [11]. Other research areas, like environmental, energy management and ecology research analyze and measure the environmental impact of different energy policies and investigate the efficiency of energy policies at a national level [21].

2.3 Class imbalance problems

Imbalanced data can be found in many applications, and a number of approaches have been proposed to cope with imbalanced data sets. A basic approach is a sampling method. Sampling methods reduce the imbalance in the data set by removing (down-sampling) instances from the majority class or duplicating (up-sampling) instances from the minority class until the ratio between major and minor classes balance [1, 2]. The second approach is to adjust the cost (weight) of each class [1]. The third approach adapts the decision threshold to impose bias on the minority class. These approaches have been applied in many studies, however, they also have some drawbacks. For instance, the high computation load caused by replicating instances. On the other hand, removing some instances might also remove important information inside the data.

Further studies have improved these basic approaches. Sáez et al. [22] improved the sampling method by combining an iterative ensemble-based noise filter with Synthetic Minority Over-sampling Techniques (SMOTE-IPF). Krawczyk et al. [23] employed evolutionary under-sampling with a boosting method to improve the sampling method. The proposed algorithm was also applied to solving the classification of cancer data. Further extensions of the sampling method were studied by Charte et al. [24]. Ramentol et al. [25] applied a fuzzy rough set theory to develop a fuzzy-rough ordered weighted average approach for imbalanced classification. Embedding fuzzy theory in classification algorithms has also been conducted in some previous researches [26]. In addition, some improvements were made based on Granular Computing (GrC) [27].

2.4 Granular computing

Granular computing (GrC) is an innovative information processing computing model. It is a collective term referring to theories, methodologies, techniques, and tools for the analysis of information granules (IGs), e.g., groups, classes, intervals or clusters, encountered in problem solving [28 –31]. Generally speaking, GrC is a process of complex information entities called information granules which arise in the process of data abstraction and derivation of knowledge from information. The idea of information granularity has been explored in a number of fields such as rough sets, fuzzy sets, cluster analysis, databases, machine learning and data mining [32].

The main issues in granular computing are how to construct the IGs and to describe IGs [28]. The process of constructing IGs was first proposed by Zadeh [30]. In order to construct IGs more efficiently and more feasibly, some approaches such as the Self Organizing Map (SOM) network, Fuzzy C-means (FCM), rough sets, and Fuzzy Adaptive Resonance Theory (ART) have been proposed [28, 32]. They aim to divide IGs into different levels of granularity [28, 32]. Level of granularity is correlated to the data variants. More detailed information requires smaller IGs. In the process of representing IGs and determining the level of granularity, Bargiela and Pedrycz [32] proposed a “hyperbox” and “inclusion and compatibility” to measure IGs. The Granular computing model copies the human instinct in information processing. Thus, it can improve classification performance in imbalanced data [33]. In classifying a dataset, IGs represent a collection of objects arranged based on their similarity, functional adjacency and indistinguishability [34].

2.5 Particle swarm optimization

Particle swarm optimization (PSO) is a modern evolutionary algorithms based on swarm behavior [35]. The algorithm forages for the optimal solution by moving its particles in the solution space [36]. Compared with other artificial intelligence techniques such as genetic algorithm (GA), tabu search (TS), or simulated annealing (SA), PSO is much faster [36]. Kennedy [35] investigated the performance of particle swarm optimization incorporating various neighborhood topologies to different types of problems. The result showed that PSO performance is significantly influenced by the parameter settings. In some cases it is also trapped in local optima [37]. However, it can still obtain high-quality solutions within a shorter calculation time, and more stable convergence characteristics than other stochastic methods [37, 38].

2.6 Latent semantic indexing

High dimensional data consists of a large number of features which usually contain redundant and irrelevant information. Therefore, some dimensional reduction techniques have been proposed to reduce the dimensions without changing the data pattern. Latent Semantic Indexing (LSI) is one information retrieval method that can automatically model term-term inter-relationships to improve the retrieval outcome. LSI examines the similarity of the “contexts” in which words appear, and creates a reduced-dimension feature-space where words that occur in similar contexts are near each other. LSI applies a method, singular value decomposition (SVD), from linear algebra, to discover important associative relationships.

3 Methodology

This section presents the proposed algorithm in four parts: (1) data preprocessing, (2) granular construction, (3) feature extraction and knowledge discovery, and (4) classification for imbalanced data. The first part, data preprocessing, includes the missing value processing and data normalization. The majority class data is then reduced using granular construction in the second part. The output of this part is a balanced dataset. Dimension reduction and feature extraction are then implemented. In this paper, the dimension reduction is performed by LSI algorithm. The last part is the data classification. Figure 1 illustrates the proposed algorithm.

A concise procedure is explained as follows:

Data preprocessing

Data collection

Select benchmark data sets with unbalanced characteristics and collect cogent prostate cancer data.

Data preprocessing

Delete missing values and implement normalization for raw data.

Granular construction

Granularity selection criteria

Determine the thresholds of H-index and U-ratio.

Determine the level of granularity for IGs.

The number of IGs is determined by H-index as well as U-ratio.

Execute granular construction.

The construction of IGs is achieved by clustering techniques.

Compute H-index and U-ratio of IGs.

H-index is used to measure the consistency of the class in one IG, while the H-index is defined as follows: $H - index = \frac{\sum_{i} n / m}{i}$ (1)

where m represents the number of all objects in one granule, i is the number of all IGs and n is the number of objects possessing the majority class.

U-ratio means the proportion of undistinguishable granules to all IGs. In this paper, U-ratio is defined as follows: $U - index = u / i$ (2)

where u represents the number of undistinguishable granules and i represents the quantity of all IGs.

Check whether the criteria are satisfied or not

If the H-index is larger than or equal to the threshold of H-index and U-ratio is smaller than or equal to the threshold of U-ratio, the answer is “Yes.” Go to Step 6. Otherwise the answer is “No.” Repeat Steps 5 – 7 until criteria are satisfied.

Rewrite attributes

Divide value interval of attributes into overlapping and non-overlapping areas and sub-attributes.

Feature extraction and knowledge acquisition

Analysis of sub-attributes

Implement singular value decomposition (SVD) in transforming the original feature space to a smaller feature space in order to reduce the dimensionality.

Feature extraction

Determine the optimal number of features by evaluating efficiency and accuracy.

Sub-attribute reduction

Reduce the number of dimensions of the sub-attributes to the optimal number.

Classification

Implement classifiers and calculate the classification accuracy.

Validation

Validate the classification performance. If the performance is acceptable, terminate the procedure. Otherwise, repeat Steps 9 – 13.

Classification

Apply classification method, such as neural network or decision tree, to classify the data after granularity.

3.1 Construction of information granules

In this paper, the IG construction process is conducted using a data mining-based approach proposed by Chen [39]. In order to obtain a better result, in this paper, Chen’s original algorithm is combined with a particle swarm K-means optimization (PSKO) algorithm to build IGs. The PSKO algorithm is an improvement of the K-means algorithm [40].

3.2 Selection of granularity

The level of granularity in the proposed algorithm is determined based on the H-index and U-ratio [39]. During IG construction using the PSKO algorithm, similar IGs are grouped in a single layer. The similarity is calculated using Euclidean distance. Herein, the pattern with same degree of similarity will be assigned in the same cluster. After the clusters are built, the H-index and U-ratio are calculated based on the clustering result. The levels of granularity will then be adjusted by PSKO algorithm until H-index and U-ratio are satisfied.

3.3 Representation of information granules

This paper employs the concept of sub-attributes, called hyperboxes, to represent IGs. Let a hyperbox [b] defined in Rⁿ be fully described by lower bound (b–) and upper bound (b+). The set of all points in the n-dimensional space is an important and frequently used universal set. This set is represented as Rⁿ. Through b– and b+, the hyperbox can be expressed as [b] = [b–, b+]. Part 1 in Fig. 2 gives an illustrative example to express the implementation procedure of sub-attributes.

In Fig. 2, there are two IGs, A and B, which have only one attribute, X_i. In sub-attributes, IGs are represented by the lower and upper limit of the objects. The IGs A and B can be described as [a–, a+] and [b–, b+], respectively. Part 2 in Fig. 2 shows that there are overlaps between granules A and B. This makes it difficult to handle with knowledge acquisition tools. However, data mining cannot discover knowledge from these constructed IGs because most knowledge acquisition algorithms are designed to deal with numeric attributes. In this paper, this problem is tackled by “sub-attributes” which divide the value interval of attributes into overlapping and non-overlapping areas. Next, a Boolean variable, 0 or 1, is used to represent whether the IG contains these intervals or not.

4 Experimental results and analysis

The proposed PSKO-based GrC model is verified using both balanced and imbalanced datasets. Table 1 lists all the tested datasets. By applying the balanced datasets, the proposed algorithm can be verified for its capability in extracting data from a variety of datasets, such as a large amount of data, data with multiple categories or high-dimensional data. On the other hand, testing using imbalanced dataset aims to verify the performance of the proposed algorithm in extracting data from a variety of skewed datasets.

The validation process is conducted using 10-fold cross validation. Before implementing the algorithm, all benchmark data sets are preprocessed, and data sets are divided into 90% training and 10% testing sets. For each 10-fold experiments, 30 replications are executed.

Furthermore, the forth part, which is the classification part, is conducted by three different classifiers, namely a feed-forward neural network with back-propagation learning algorithm (BPN), a decision tree (C4.5), and support vector machine (SVM), since they are the most applied classification methods. Finally, computational results obtained by the proposed PSKO-based GrC with classifiers are compared with PSKO-based GrC without classifiers, K-means-based GrC, and numerical computing model [36]. All of these algorithms are built in C language, while the classifiers are programmed via WEKA in Windows 7 using a Core 2 Quad 2.50 GHz CPU with 4 G RAM.

4.1 Parameter setting

The proposed algorithm involves some parameters. Different parameter settings may give different results. In this paper, the parameter setting is determined using the Taguchi method. The parameters evaluated using the Taguchi method are the number of particles, learning factors c₁ and c₂, and inertia weight. Let each parameter be treated as a factor in the Taguchi method, while each factor has three levels. Table 2 shows the tested level for each factor.

In order to reduce computation time, each parameter combination is run using 150 iterations and 10 repetitions. The Taguchi analysis is conducted using MINITAB. The best parameter setting according to the Taguchi results is: 40 particles, 1.47 c₁, 0.5 1.47 c₂, and 0.5 inertia weight.

In this paper, the proposed algorithm is also compared with a basic classification algorithm called back propagation neural network (BPN) without GrC. This comparison is made in order to evaluate the efficacy of using GrC in classification. The parameter setting for the BPN algorithm is 0.2 learning rate, 0.8 momentum and 5000 iterations. The network structure for each dataset is listed in Table 3. This parameter setting follows the parameter setting in [39]. For some datasets which are not evaluated in [39], the network structure is subjectively determined based on the dataset.

4.2 Computational results

The computational results are summarized in Tables 4–7. The results obtained by BPN are used to analyze the effectiveness of using granular computing in the classification problem. The BPN results for balanced datasets listed in Table 4 are no better than the results of other algorithms using granular computing. The same result is also shown for the imbalanced datasets. The results of the GrC-based algorithm for imbalanced datasets are also better than the BPN result. This proves that IGs help classifiers to obtain more information to distinguish different classes.

Further comparison is made within the GrC-based algorithms. Table 5 summarizes the computational results for balanced datasets. According to this result, the GrC-based algorithm using BPN as the classifier is relatively better than those with C4.5 and SVM as the classifier. The results also show that between PSKO-based GrC with classified, PSKO-based GrC without classified, K-means-based GrC and the numerical method, the results obtained by PSKO-based GrC with classified model are better than those of the other algorithms. However, if the algorithms use C4.5 or SVM as the classifier, K-means-based GrC and PSKO-based GrC without classified have relatively better performance.

For the imbalanced datasets, using BPN as the classifier is also better than using C4.5 and SVM as the classifier. This is shown by the results summarized in Table 7. Compared with C4.5 and SVM, the proposed algorithm using BPN as the classifier can obtain higher accuracy for both training and testing.

The comparison between algorithms using BPN as the classifier for imbalanced data indicates that for PSKO-based GrC with and without classified data is better than K-means-based GrC and the numerical method. However, by using C4.5 and SVM as the classifiers, K-means-based GrC can obtain relatively better results.

4.3 Statistical hypothesis

The computational result for balanced datasets indicates that the PSKO-based GrC with classified data has a relatively better result. Therefore, a further evaluation using non-parametric statistical tests is conducted to find the significance between the PSKO-based GrC with classified data with other algorithms. Herein, the hypothesis is as follows: $\begin{matrix} H_{0} : μ_{PSKO GrC *} = μ_{i} \\ H_{1} : μ_{PSKO GrC *} \neq μ_{i} \end{matrix}$

where μ_PSKO GrC* is the mean of PSKO-based GrC with classified data using BPN classifier, and μ_i is the mean of other algorithms. The statistical test is conducted using SPSS. Table 8 summaries the p-value with a 95% confidence interval. This shows that the results obtained by PSKO-based GrC with classified data do not significantly differ from those obtained by PSKO-based GrC and K-means-based GrC for some datasets. However, compared with other numerical algorithms using the BPN classifier and algorithms using C4.5 and SVM classifiers, the results are significantly different.

Furthermore, for the imbalanced dataset, the algorithms using BPN as the classifier have better results than those using C4.5 and SVM. However, the performances of algorithms using BPN are relatively similar. Therefore, in order to analyze the differences between algorithms using BPN classification, a non-parametric statistical test is conducted with the following hypotheses:

Test 1 $\begin{matrix} H_{0} : μ_{PSKO GrC *} = μ_{PSKO GrC} \\ H_{1} : μ_{PSKO GrC *} \neq μ_{PSKO GrC} \end{matrix}$

Test 2 $\begin{matrix} H_{0} : μ_{PSKO GrC *} = μ_{Kmeans GrC} \\ H_{1} : μ_{PSKO GrC *} \neq μ_{means GrC} \end{matrix}$

Test 3 $\begin{matrix} H_{0} : μ_{PSKO GrC} = μ_{Kmeans GrC} \\ H_{1} : μ_{PSKO GrC} \neq μ_{means GrC} \end{matrix}$ where μ_PSKO GrC and μ_Kmeans GrC are the mean of PSKO-based GrC without classified data and K-means-based GrC. All of these use BPN as the classifier. Table 9 summarizes the p-value of the statistic test. It proves that PSKO-based GrC with classified data differs significantly from K-means-based GrC for most of the datasets. However, PSKO- based GrC with and without classified data are not significantly different, except for BSWD and PIMA datasets. The results obtained by PSKO-based GrC without classified data and PSKO-based GrC are also significantly different for MSWD, car evaluation and Glass datasets.

5 Model evaluation results and discussion

The computational results presented in Section 4 prove that the proposed classification using granular computation with PSKO algorithm in constructing IGs has a better classification result. Therefore, in this section, the proposed algorithm is applied to real cancer prognosis data.

5.1 Prognosis data overview

The prostate cancer related data were collected from a well-known teaching-oriented hospital located in Taipei. It consists of 176 data. The data can be divided into two classes, 1 or 0. Class 1 indicates that the patients have died due to prostate cancer. Moreover, these patients can be separated into four categories according to patients’ survival lengths. There are 61 patients belonging to this class, and each record is determined as category 1, 2, 3 or 4 by patients’ survival lengths. Class 0 indicates that the patients are still alive after more than 5 years. There are 115 patients in this category. These patients are noted as category 5. Figure 3 shows the class distribution.

The original data collected from the hospital contains many features since it includes patient biographies and medical records. This paper does not include all of these features in the calculation. Therefore, two feature selection steps are applied. In the first step, the important features are chosen by an expert (doctor) in prostate cancer. From this step, there are six chosen features as listed in Table 10. In the second step, the feature selection is conducted using stepwise regression. As a result, there are three models suggested by the stepwise regression. Table 11 shows all the suggested models. Herein, the last model with three features is chosen for further evaluation using the proposed algorithm. These three features are biopsy Gleason score, initial prostate specific antigen (iPSA), and digital rectal examination (DRE).

5.2 Computational results

The processed prostate cancer data is now evaluated using the proposed algorithm. In this experiment, the parameter setting for the proposed algorithms is also determined using the Taguchi method. Herein, the tested parameter settings are the same as those in Section 4. The best parameter setting for the real dataset based on the Taguchi method. There are 40 particles. The c₁, c₂, and learning rate are 1.47, 0.5, and 0.5, respectively. For the BPN algorithm, the parameter setting for learning rate is 0.2, momentum is 0.8, and 5000 iterations are run. The network structure is 3-7-5.

Table 12 shows the BPN result. It shows that classification using the BPN algorithm without GrC yields a 67.11% accuracy for training and 59.33% accuracy for testing data. The proposed algorithms using three different classifiers, BPN, C4.5 and SVM are then applied to evaluate the prostate cancer data. There are 30 replications for each 10-fold validation of each tested algorithm. The results are summarized in Table 13. It shows that in general, PSKO-based GrC algorithms have better performance than K-means-based GrC and numerical methods. This result also shows that although the PSKO-based GrC algorithm without classified dataset can obtain better accuracy for the training set, the accuracy for the testing set is significantly reduced. On the other hand, the ratio between training and testing accuracy obtained by PSKO-based GrC with classified data do not as big as the ratio in PSKO-based GrC without classified data. This indicates that the PSKO-based GrC without classified data might have an overfitting problem. For instance, the result obtained by PSKO-based GrC with C4.5. Therefore, this paper prefers the result obtained by the PSKO-based GrC with BPN classifier as the best result. Although it does not yield the best testing accuracy, it obtains the best testing accuracy. In addition, the ratio between testing and training accuracy is quite small.

5.3 Statistical analysis

In order to analyze the differences between each algorithm, a statistical test is conducted. Herein, a non-parametric Mann-Whitney statistic test is applied for each pair of algorithms. The hypothesis is as follows, $\begin{matrix} H_{0} : μ_{i} = μ_{j} \\ H_{1} : μ_{i} \neq μ_{j} \end{matrix}$ where μ_i and μ_j are the mean of the accuracy obtained by algorithms i and j, and i ≠ j. Table 14 summarizes the test results. It reveals that the algorithms using BPN classifier differ significantly from other classifiers. Among algorithms using BPN classifier, the K-means-based GrC algorithm has significantly different results than PSKO-based GrC with or without classified data.

6 Conclusions

This paper proposes novel classification algorithms for imbalanced datasets. The proposed algorithms improve the sampling method with granular computation (GrC). The GrC groups similar objects in clusters called information granules (IGs). Since “normal” instances have greater similarity than minority classes, the number of IGs for the majority class will be smaller than the minority class. Thus, by considering the IGs, classifier algorithms can more easily see data patterns. In constructing the IGs, this paper employed the PSKO algorithm. After the IGs were constructed, a dimensional reduction using LSI was applied to eliminate unnecessary information in the data and reduce the computation load. Then, three different classifiers were employed to classify the processed data. This paper applies BPN, C4.5 and SVM as the classifiers.

The proposed algorithms were verified using balanced and imbalanced datasets. The results show that in general, the proposed PSKO-based GrC with and without classified data using BPN classifier has better results than the other algorithms. It also reveals that the proposed algorithms with BPN as the classifiers can perform better than those using C4.5 and SVM as the classifiers. The comparison between algorithms with GrC and without GrC also proves that GrC with IGs successfully improves the classification algorithm since it can provide more information about the data patterns. The IGs constructed by PSKO algorithm help to reduce the imbalance in the dataset.

Furthermore, the proposed algorithm is applied to prostate cancer data. This data consists of different patient conditions. The aim of this classification is to provide early detection of prostate cancer based on patients’ medical records. Therefore, finding potential prostate cancer, which makes up the smaller portion of the patient datasets, is very important. This paper applies the proposed imbalanced data classification to analyze this dataset. The experiment results show that the proposed algorithm has high accuracy, greater than 70%. The experiment results also show that the best algorithm is PSKO-based GrC without classified data using BPN and C4.5 as the classifiers.

In the future, other soft computing techniques should be integrated into granular computing in order to extract more representative granular information. In order to construct a better IG set, other granular selection techniques should be evaluated. Different criteria for granularity selection should also be tested. Future prospects of the proposed algorithm can also be applied to different prognostic problems, like liver cancer, which is very common in Taiwan. In addition, it is necessary to find criteria which are more correlated to prognosis problems.

Footnotes

Acknowledgments

This paper is partially supported by the Ministry of Science and Technology of Taiwan under contract number NSC102-2410-H-011-017-MY3. This support is much appreciated.

References

Batista

G.E.

, Prati

R.C.

and Monard

M.C.

, A study of the behavior of several methods for balancing machine learning training data, ACM Sigkdd Explorations Newsletter 6 (2004), 20–29.

Chawla

N.V.

, Bowyer

K.W.

, Hall

L.O.

and Kegelmeyer

W.P.

, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research (2002), 321–357.

Keles

, Hasiloglu

A.S.

, Keles

and Aksoy

, Neuro-fuzzy classification of prostate cancer using NEFCLASS-J, Computers in Biology and Medicine 37 (2007), 1617–1628.

Sakr

, Grignon

D.J.

, Crissman

, Heilbrun

, Cassin

, Pontes

and Haas

, High grade prostatic intraepithelial neoplasia (HGPIN) and prostatic adenocarcinoma between the ages of 20-69: An autopsy study of 249 cases, In Vivo 8 (1993), 439–443.

, Madan

, Yee

and Zhang

, Progress of molecular targeted therapies for prostate cancers, Biochimica et Biophysica Acta (BBA) - Reviews on Cancer 1825 (2012), 140–152.

Miller

D.C.

, Gruber

S.B.

, Hollenbeck

B.K.

, Montie

J.E.

and Wei

J.T.

, Incidence of initial local therapy among men with lower-risk prostate cancer in the united states, Journal of the National Cancer Institute 98 (2006), 1134–1141.

Çinar

, Engin

E.Z.

and Ateşçi

Y.Z.

, Early prostate cancer diagnosis by using artificial neural networks and support vector machines, Expert Systems with Applications 36 (2009), 6357–6361.

Chen

A.H.

and Lin

C.-H.

, A novel support vector sampling technique to improve classification accuracy and to identify key genes of leukaemia and prostate cancers, Expert Systems with Applications 38 (2011), 3209–3219.

Saritas

, Ozkan

I.A.

and Sert

I.U.

, Prognosis of prostate cancer by artificial neural networks, Expert Systems with Applications 37 (2010), 6646–6650.

10.

Tan

M.S.P.N.

and Kumar

, Introduction to data mining, Boston, Pearson Education, Inc, 2006.

11.

Zopounidis

and Doumpos

, Multicriteria classification and sorting methods: A literature review, European Journal of Operational Research 138 (2002), 229–246.

12.

Stefanowski

, On rough set based approaches to induction of decision rules, Rough Sets in Knowledge Discovery 1 (1998), 500–529.

13.

Tsumoto

, Automated extraction of medical expert system rules from clinical databases based on rough set theory, Information Sciences 112 (1998), 67–84.

14.

Belacel

, Multicriteria assignment method PROAFTN: Methodology and medical application, European Journal of Operational Research 125 (2000), 175–183.

15.

Michalowski

, Rubin

, Slowinski

and Wilk

, Triage of the child with abdominal pain: A clinical algorithm for emergencyatient management, Paediatrics & Child Health 6 (2001), 23.

16.

Ripley

B.D.

, Pattern recognition and neural networks, Cambridge University Press, 2007.

17.

Nieddu

and Patrizi

, Formal methods in pattern recognition: A review, European Journal of Operational Research 120 (2000), 459–495.

18.

Rulon

P.J.

, Tiedeman

D.V.

, Tatsuoka

M.M.

, Langmuir

C.R.

, Multivariate statistics for personnel classification, 1967.

19.

Shen

, Tay

F.E.

, Qu

and Shen

, Fault diagnosis using rough sets theory, Computers in Industry 43 (2000), 61–72.

20.

Siskos

, Grigoroudis

, Zopounidis

and Saurais

, Measuring customer satisfaction using a collective preference disaggregation model, Journal of Global Optimization 12 (1998), 175–195.

21.

Flinkman

, Michalowski

, Nilsson

, Slowinski

, Susmaga

and Wilk

, Use Of rough sets analysis to classify siberian forest ecosystems according to net primary production of phytomass, INFOR, Information Systems and Operational Research 38 (2000), 145–160.

22.

Sáez

J.A.

, Luengo

, Stefanowski

and Herrera

, SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences 291 (2015), 184–203.

23.

Krawczyk

, Galar

, Jeleń

Ł.

and Herrera

, Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy, Applied Soft Computing 38 (2016), 714–726.

24.

Charte

, Rivera

A.J.

, del Jesus

M.J.

and Herrera

, Addressing imbalance in multilabel classification: Measures and random resampling algorithms, Neurocomputing 163 (2015), 3–16.

25.

Ramentol

, Vluymans

, Verbiest

, Caballero

, Bello

, Cornelis

and Herrera

, IFROWANN: Imbalanced fuzzy-rough ordered weighted average nearest neighbor classification, IEEE Transactions on Fuzzy Systems 23 (2015), 1622–1637.

26.

Sanz

J.A.

, Bernardo

, Herrera

, Bustince

and Hagras

, A compact evolutionary interval-valued fuzzy rule-based classification system for the modeling and prediction of real-world financial applications with imbalanced data, IEEE Transactions on Fuzzy Systems 23 (2015), 973–990.

27.

Zhao

, Xu

, Jia

and Shang

, A Classification Method for Imbalanced Data Based on SMOTE and Fuzzy Rough Nearest Neighbor Algorithm, in: Yao

, Hu

, Yu

and Grzymala-Busse

W.J.

(Eds.) Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing: 15th International Conference, RSFDGrC 2015, Tianjin, China, 2015, pp. 340–351.

28.

Castellano

and Fanelli

A.M.

, Information granulation via neural network-based learning, IFSA World Congress and 20th NAFIPS International Conference, vol. 3055, 2001, pp. 3059–3064.

29.

Yao

, Information granulation and rough set approximation, International Journal of Intelligent Systems 16 (2001), 87–104.

30.

Zadeh

L.A.

, Fuzzy sets and information granularity, Advances in Fuzzy Set Theory and Applications 11 (1979), 3–18.

31.

Zadeh

L.A.

, Fuzzy Sets: Where Do We Stand? Where Do We Go? Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic, Fuzzy Sets and Systems 90 (1997), 111–127.

32.

Bargiela

and Pedrycz

, Recursive information granulation: Aggregation and interpretation issues, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 33 (2003), 96–112.

33.

Zadrozny

and Elkan

, Learning and making decisions when costs and probabilities are both unknown, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (ACM2001), pp. 204–213.

34.

C.-T.

, Chen

L.-S.

and Yih

, Knowledge acquisition through information granulation for imbalanced data, Expert Systems with Applications 31 (2006), 531–541.

35.

Eberhart

and Kennedy

, A new optimizer using particle swarm theory, Proceedings of the Sixth International Symposium on Micro Machine and Human Science, Nagoya, 1995, pp. 39–43.

36.

Yusup

, Zain

A.M.

and Hashim

S.Z.M.

, Overview of PSO for optimizing process parameters of machining, Procedia Engineering 29 (2012), 914–923.

37.

Niknam

and Amiri

, An efficient hybrid approach based on PSO, ACO and k-means for cluster analysis, Applied Soft Computing 10 (2010), 183–197.

38.

Niknam

, Amiri

, Olamaei

and Arefi

, An efficient hybrid evolutionary optimization algorithm based on PSO and SA for clustering, Journal of Zhejiang University SCIENCE A 10 (2009), 512–519.

39.

Chen

M.-C.

, Chen

L.-S.

, Hsu

C.-C.

and Zeng

W.-R.

, An information granulation based data mining approach for classifying imbalanced data, Information Sciences 178 (2008), 3214–3227.

40.

Kuo

R.J.

, Wang

M.J.

and Huang

T.W.

, An application of particle swarm optimization algorithm to clustering analysis, Soft Computing 15 (2009), 533–542.