Abstract
Cancer has been one of the most serious health challenges to the human kind for a long period of time. Lung cancer is the most prevalent type of cancer which shows higher death rates. However, lung cancer mortality rates can be tracked by periodic screening. With the advanced medical science, the society has reaped numerous benefits with respect to screening equipments. Computed Tomography (CT) is one of the popular imaging techniques and this work utilizes the CT images for lung cancer detection. An early detection of lung cancer could prolong the lifetime of the patient and is made effortless by the latest screening technology. Additionally, the accuracy of disease detection can be enhanced with the help of the automated systems, which could support the healthcare experts in effective diagnosis. This article presents an automated lung cancer detection system equipped with machine learning algorithm, which can differentiate between the benign, malignant and normal classes of lung cancer. The accuracy of the proposed lung cancer detection method is around 98.7%, which is superior to the compared approaches.
Introduction
Lung cancer is the fourth most widespread cancer next to breast, cervical and oral cancer in the cavity. Lung cancer is observed more in men rather than women, as per the medical reports. This dreadful disease is observed as the second and sixth mostly occurring cancer in men and women respectively [1]. The mortality rates with respect to lung cancer are improving every year. The important reason for the increased death rates is the ignorance and negligence of periodical health checkups. Early detection of any cancerous growth supports in increasing the lifespan of the patient.
Computer-based diagnostic systems came into being, due to the progression of the medical science and computer technology. As all the diagnostic procedures are in digital format, it is easy for the computer based systems to process the data. The computer based diagnostic systems process the digital images and locate the abnormal areas being present in the image. Though it may seem easier, it is quite difficult to achieve a better system owing to the presence of noise and other unwanted details. Thus, a better computer based diagnostic system must deal with all these issues to attain the goal.
As the diagnostic systems assist the physician to make a decision, the reliability of the system is very important. The system is said to be reliable when the accuracy rates are reasonable. However, achieving better accuracy rates is a challenging task, as the images contain several unnecessary details. In spite of the presence of numerous lung cancer detection systems, there is a constant demand for a reliable system. Taking this challenge into consideration, this work attempts to propose a novel computer based diagnostic system for lung cancer detection by incorporating advanced image processing techniques over Computed Tomography (CT) images.
The complete work is decomposed and organised into four key phases, which are CT image pre-processing, segmentation, feature extraction and classification. The image pre-processing phase makes the CT images fitter for the future processes. Usually, the pre-processing phase attempts to enhance the quality of image by incorporating the noise removal or the contrast enhancement procedure and so on. The segmentation phase focuses on extracting the regions of interest from the whole image, such that the segmented regions alone are processed. The efficiency of the classification system improves, when specific regions of CT images are focussed. The features of the segmented regions are extracted, which are rich enough for the classifiers to detect the abnormalities being present in the CT images. The key points of this work are as follows.. The CT image’s contrast is improved by using adaptive histogram equalisation method, which is effective for both grey and colour images. Kernelized Fuzzy C Means (KFCM) is used in the segmentation of CT images, which proves to be stronger than the regular FCM. The features of the segmented image are extorted by means of curvelet, which is known for its discriminative power. Ensemble classifier is used to detect the defect in the CT image.
The rest of this paper is organized in the following way. A short review of the state-of-the-art literature with respect to lung cancer detection is presented in section 2. Section 3 presents the proposed lung cancer detection approach in a detailed manner. Section 4 analyses the efficiency of the suggested solution by conducting many comparative studies. The concluding points of this article are summed up in section 5.
Review of literature
The purpose of this section is to review the recent literature with respect to lung cancer detection systems.
Several strategies for lung cancer diagnosis using CT images are investigated on the basis of image processing techniques in [1]. Lung cancer diagnosis is conducted by separating the test into various areas such as pre-processing, segmentation of the nodule. Recent patterns in the identification of lung nodules are discussed in [2]. In addition, the efficiency of the new identification strategies for lung nodules is compared and presented. In [3], a method of classification for lung cancer is proposed based on a wavelet recurrent neural network. The study uses wavelet to eliminate the noise from the input image, which is used for classifying the recurrent neural network. This work, however, cannot attain better rates of accuracy and this means that the false positive rates are higher.
In [4], an algorithm is proposed to detect the pulmonary nodules based on cascade classifier. This work detects the pulmonary nodules and classifies them into normal and benign. A learning method is proposed on the basis of cascade classifier and applied over the detected pulmonary nodules. This work focuses more on accuracy rates, rather than on sensitivity and specificity rates. A technique to detect lung nodules from a series of CT slices is presented in [5]. This work segments the lung nodules by applying Otsu’s threshold along with some morphological operations. The geometric, histogram and texture features are extracted from the segmented nodules to carry out the process of classification. The Multilayer Perceptron (MLP) is employed as a classifier and this work involves computational overhead.
In [6], a technique based on Ek-means algorithm and Support Vector Machine (SVM) is presented to recognize and classify the lung tumour. This work pre-processes the CT images for removing the unwanted information by means of thresholding approach. The regions of interest alone are extracted and the Gray Level Co-occurrence Matrix (GLCM) features are extracted. Finally, SVM is utilized to distinguish between the cancerous and non-cancerous areas. An early lung cancer detection mechanism is proposed in [7], which exploits Hopfield Neural Network classifier for extracting the lung areas from the CT images. The edges of the lung region lobes are detected by bit planes and the diagnostic rules are framed to detect the abnormality.
The lung cancer detection algorithm based on FCM and Bayesian classification is presented in [8]. In this work, FCM is employed for segmentation and the GLCM features are extracted. Based on the feature set, the Bayesian classifier is utilized to distinguish between the normal and the cancer affected CT images. Yet, the results of this work are not convincing in terms of sensitivity and specificity rates. In [9], a technique for identifying lung cancer based on a genetic method is proposed. This work, however, suffers from time complexity.
In [10], a gabor filter and watershed segmentation based lung cancer detection technique is proposed. The process of segmentation is carried out by watershed segmentation approach and the gabor features are extracted from the CT images. This technique does not include the process of classification and it stops itself with segmentation. A lung cancer detection technique, which is based on Local Energy based Shape Histogram (LESH) and machine learning techniques is introduced in [11]. Initially, this work pre-processes the CT images by Contrast Limited Adaptive Histogram Equalization (CLAHE) and the LESH features are extracted. Machine learning algorithms such as Extreme Learning Machine (ELM) and SVM are applied. This work is efficient but the computational overhead can still be decreased by altering the feature extracting technique.
A new lug cancer detection technique based on Mumford-Shah algorithm is proposed in [12]. This work removes the Gaussian noise by applying sigma filter and the regions of interest are segmented by otsu’s thresholding and mumford-shah model is applied. The texture features are extracted from the extracted regions by spectral texture extraction technique and the classification is done by multi-level slice classifier. However, the classification accuracy of this work can be improved further.
In [13], a lung nodule detection and segmentation technique is proposed based on patch based multi-atlas method. This work chooses a small group of atlases by matching the target image with a large group of atlases in terms of size and shape based feature vector. The lung nodules are then detected by means of a patch based approach and the laplacian of the Gaussian blob detection technique is utilized to detect the segmented area of the lung nodule. However, the images utilized for testing is very minimal and hence, the efficiency of this work cannot be determined. A work to enhance the lung nodules is presented in [14]. This work exploits a three dimensional multi-scale block Local Binary Pattern (LBP). This filter can distinguish between the line based regions and the edges effectively. This work focuses only on enhancement, which is just a part of this proposed approach.
An auto lung nodule segmentation and classification technique is proposed in [15]. Initially, the images are pre-processed by different thresholding techniques and morphological operations. The areas of interest alone are extracted by means of apriori information and Hounsfield Units. SVM is utilized for achieving the task of classification. The results of this work can still be improved in terms of sensitivity and specificity.
Motivated by these current research ventures, this paper aims to implement a consistent algorithm for lung cancer diagnosis that can show improved sensitivity and specificities at low time complexity. The following section discusses the method suggested and the description of the work.
Proposed lung cancer detection approach
The proposed lung cancer detection approach is based on several phases and all the involved phases are explained one after the other. Initially, the overall flow of the work is presented.
Overview of the proposed approach
Understanding the power of modularity, this work decomposes the complete functionality of the work into four phases. They are CT image pre-processing, segmentation, feature extraction and classification. The normal and the abnormal regions of the image cannot be classified on the go and it involves several important steps to achieve better classification results. Initially, the CT images are to be pre-processed to enable them for further image processing activities. The pre-processing activity of this work is concerned with the enhancement of the image contrast, such that the regions can easily be differentiated. Adaptive Histogram Equalization (AHE) technique is employed to improve the contrast of the CT images. The overall flow of the proposed approach is depicted in Fig. 1.

Overall flow of the proposed classification approach.
The aim of image segmentation is to focus on all parts of the image, so as to achieve better classification results. This work segments the CT images by means of Kernelized FCM (KFCM), which proves better performance than standard FCM. The segmented regions are then passed to the next phase called feature extraction, which aims to extract the curvelet plus GLCM features. These features are enough for the classifiers to achieve better results.
Finally, an ensemble classifier which conceives k-NN, SVM and ELM is utilized for differentiating between the normal and the abnormal areas being present in the CT images. The reason for the utilization of ensemble classifier is that the classification results are not the decision of a single classifier, which may turn false positive or false negative. Instead, the classification decision is the result of three different classifiers, which promises the enhanced classification results. The following subsections present all the sub-phases involved in the proposed approach.
Many diagnostic images have low contrast problems and therefore the image information cannot be reliably identified. The function of the contrast improvement algorithm is regarded at this point. The CT images are pre-processed using the AHE technique and the key reason for the use of AHE is that this technique measures multiple histograms in the different areas of the CT image. This increases the contrast in each unique region of the image. The measured histogram is adaptive, meaning that the values are distributed and balanced. With AHE the edges and the local contrast of the CT image are improved and the pre-processed images are shown in Fig. 2.

(a) to (d) input images, (a1) to (d1) contrast enhanced images.
Let a CT image with n × n. pixels undergo the process of contrast enhancement. The AHE operates all the pixels p i . and modifies p i . with respect to the intensity of the immediately present neighbourhood pixels. The AHE processes the image regions separately, which is contrast to the standard HE technique. Thus, the contrast of all regions is improved and this shoots up the contrast of the entire image. Hence, the image contrast is improved and the resultant images are segmented as described in the following section.
This phase aims to segment or divide the CT image into several segments, such that each segment can be focussed with utmost care. It is a better choice to process a specific region rather than to process a whole image. This idea conserves time and computational power of the system. Hence, the process of segmentation renders several benefits to the classification system. FCM is one of the famous segmentation algorithms, which groups the pixels that sw similar characteristic features. The term characteristic feature indicates the measure with which the clusters are formed and it can be intensity or the distance between the pixels. The input of the standard FCM algorithm is the image along with the total count of clusters. The segmentation outcome of the proposed approach is presented in Fig. 3.

(a) to (d) are the input images, (a1-d1) are the segmented images.

Comparative analysis with the existing techniques.
In order to improve the capability of clustering, this work launches a kernel function that can deal with noise as well. This work employs the Gaussian Radial Basis Function (GRBF) kernel and the kernel function k is represented in the feature space as follows.
In the above equation, 〈β (p) , β (q)〉 is the inner product. The GRBF kernel is denoted as
The objective function of the KFCM is presented as follows.
In the above equation, r. and c are the total count of clusters and data points respectively. μ ij is the membership of the pixel x j in the i th cluster and f is the degree of fuzziness. β indicates the non-linear mapping implicitly.
On expanding Equation 4, Equation 5 and 6 are obtained as follows.
Based on the equations from 4 to 6, the objective function is modified as follows.
Where
The overall segmentation algorithm and sample segmented images for segmentation are presented below.
Hence, the CT images are segmented by the KFCM, which deals with noise effectively and generates better clusters. This process is followed by the extraction of features from the segmented images as presented below.
Curvelet is a multidimensional and multiresolutional alytic tool, which is proposed by Candes and Donoho in the year 2000 [16]. The curvelets are simple and are easy to implement. Additionally, curvelets are faster with minimal redundancy. Initially,ast Fourier Transform (FFT) is applied over the input image. The obtained fourier frequency plane is sliced into several wedges, which results in circular and angular slices. The circular slices decompose an image into several scales and the angular slices divide the images in different angles. Hence, all the wedges conform to a specific angle and scale. When the Inverse FFT (IFFT) is applied over a wedge, the curvelet coefficients can be obtained for a particular scale and angle. Additionally, the curvelets can handle the edge discontinuities in a better way.
The CT images of lungs are considered in three different scales and the GLCM coefficients are extracted. GLCM features are effective in determining the spatial relationship between the pixels. The GLCM can be represented as a square matrix M × M, where M is the total number of gray levels being available in an image. All the entities being present in the GLCM is the approximation of probability of intensity between two pixels. The (x, y) th entity of the GLCM matrix is formed when a pixel with coordinate (a, b) has a gray level gl i and then the pixel (a + da, b + db) is found with the intensity gl j . The da and db are computed by considering different scales and angles. The features are computed by performing weight operation upon every entity of the matrix and the weighted values are added. The features being utilized by this work are energy, contrast, correlation, homogeneity, autocorrelation, dissimilarity and inertia. The co-occurrence is computed in three different angles such as 0, 45 and 90 degrees. When the features are extracted, the classifier is trained with the obtained feature set as follows.
Ensemble classification
When it comes to supervised learning, the classifiers are given knowledge with the features with which the classification is done. Normally, the process of classification requires prior knowledge about the dataset being utilised. The entire process of classification ihieved by two major phases namely training and testing. In the training phase, the classifier iparted knowledge by means of the extracted features. By this way, the classifier equips itself in differentiating the objects. The classification problem can be a binary or multiclass problem. The binary classification system involves only two classes, whereas the multiclass problem has several classes involved in it. This work employs ensemble classifier, as the result of a single classifier may introduce false positives or false negatives. The overall algorithm of the proposed approach is presented as follows.
In order to ensure reliability, this work employs k-NN, SVM and ELM and the final decision is taken by performing majority vote computation. The following subsections present the summary of the working principle of the employed classifiers.
k-NN classifier
k-NN is the basic classifier that deals with the nature of the training data, so as to differentiate between the normal and abnormal areas of the image. This classifier calculates the Euclidean distance between the image pixels and is denoted by
The effectiveness of the k-NN classifier is decided by the value of k and hence, k must be chosen carefully. However, manual choice of k is difficult and inefficient, as it requires prior knowledge about the dataset. Additionally, arriving at an optimal value of k by manual approach consumes more time and energy. In order to avoid this issue, an automatic k fold cross validation scheme is employed, which can choose the value of k. The k-fold cross validation method works by decomposing the training images into k parts and a single part is considered as the testing and the rest of the images are treated as training objects. This process is repeated for all the test samples. At last, the average is computed for all the k results attained sfar and this value is fixed as k. Hence, the value of k is assigned automatically, which is effortless and there is no need to have prior knowledge about the dataset.
Vis a popular and promising classifier, whh is f with knowledge about the malignant and benign CT images by means of the calculated threshold [18]. Consider 1,2,3, ... N CT images, which are to be classified as malignant and benig Both these classes are divided by a hyperplane, which derives a criterion to separate the classesnce, the choice of hyperplane must be optimal, as the efficiency of the classifier depends on this hyperplane. The hyperplane is segregated by the following equation.
In Equation (11), L i is the lagrange multiplier that partitions the hyperplane of the classifying area ψ i (B, M). The threshold to differentiate between the benign and malignant classes is denoted by th. By this principle, SVM distinguishes between the malignant and benign classes of the CT lung images.
ELM is one of the fast learning classifiers and has gained considerable popularity [19]. Let there be U training entities and are represented as (m
j
, n
j
), where m
j
= [mj1, mj2, …, m
jk
]
T
∈ Dim
k
and m
j
indicates the training entity with dimension k. n
j
= [nj1, nj2, …, n
jo
]
T
∈ Dim
o
denotes the j
th
trainingabel with dimension o, which is the entire class count. A Single hidden Layer Feed-Forward NeuNetwork (SLFN) is determined with a single activation function act (x) and G
n
neurons, is indicated by the following.
wt
i
is the weight denoted by vectors wt
i
= [wti1, wti2, …, wt
in
]
T
which links the i
th
hidden neuron with the input neurons, where i = [i1, i2, …, ik]
T
. The vector with weights links the i
th
hidden neurons and the output neurons, the bias of the i
th
hidden neuron is denoted as bs
i
. ELM do not demand any knowledge of the input data, so that the wt
i
and bs
i
are assigned randomly. The SLFN is represented by
Consider the ELM’s hidden layer output matrix as HLM and the i th column of HLM represents the i th hidden neurons output vector by taking into account of the inputs mj1, mj2, …, m jn .
The matrix format is represented by the following equation.
The output weights are calculated by the norm least-square solution and the equation is given by
Where HL† is the HL ’s Moore-Penrose generalized inverse. The ELM training phase functions by feeding the number of classes o, the activation function act (x), hidden neurons count G n and ELM count in ensemble E. During the process of training, the ELM is fed with the training samples TS = {(m j , n j ) |m j ∈ D k , n j ∈ D0 ; j = 1, 2, …, N} .. The ELM iequipped by computing γ for all TS using Equation 18. Based on the knowledge gained in the training phase, the ELM can differentiate between the normal and the malignant cases.
The final classification results of the k-NN, SVM anELM are obtained and the dominating result is found out. As three different classifiers are utilized, eight different scenarios may occur. For instance, when two or three classifiers produce the same result, then that result is considered as dominating and the same result is declared as final. The finesult of this work is not based on the decision of a single classifier and this feature enhances the ability of the system. Additionally, in order to reduce time and computational complexity, this work exploits efficient classifiers. The false positive and false negative rates are considerably reduced, wch iturn improves the sensitivity and specificity rates.
The functionality of the proposed approach is evaluated by implementing this work in Matlab version 8.1, on a stand alone computer with 8 GB RAM. The CT images are collected from different laboratories to carry out this research. The tailored dataset contains an image set of 100, which includes 37 and 63 benign and malignant images respectively. The effectiveness of the proposed classification system is justified by executing the comparative analysis in two different aspects. The initial round of comparative analysis justifies the choice of the incorporated techniques such as AHE, KFCM, curvelet features and ensemble classification. The second round of comparative analysis literally compares effectiveness of the proposed classification approach with the existing literature proposed in LESH + CC [11], Mumford-Shah algorithm [12], apriori + SVM [15], in terms of classification accuracy, sensitivity, specificity and time consumption.
Accuracy is the most critical performance metric that proves the reliability of the proposed classification algorithm. The classification results are accurate, when the accuracy rate is the maximum and is computed by
Where the true positive, true negative, false positive and false negative rates are represented by T
p
, T
n
, F
p
and F
n
respectively. Sensitivity and specificity rates of the classification approach must be as greater as possible, such that the F
p
and F
n
rates are minimal. The formulae for sensitivity and specificity rate computation are as follows.
The experimental results of the first round of comparative results are presented in Table 1.
Performance analysis in terms of accuracy, sensitivity, specificity and time consumption
The experimental results of the proposed approach justify the choice of the techniques that are incorporated in it. The second to fourth columns of the table signifies the comparative technique for pre-processing, segmentation and feature extraction respectively. The columns five to seven indicates the performance of the classifier being employed alone. Finally, the results of the proposed approach are presented. The proposed approach shows maximum accuracy, sensitivity and specificity, however the time consumption is a bit greater than ELM. Yet the time consumption is reasonable, as the results of the proposed approach performs ensemble classification and does not based on a single classifier. The attained time consumption is tolerable, as the degree of sensitivity and specificity are greater. The second round of comparative analysis is performed by comparing the proposed approach with the existing techniques and the results are presented as follows.
The performance of the proposed classification approach is compared with the existing techniques and the proposed approach outperforms the others. The performance of LESH+CC is quite comparable with the proposed approach. Though the accuracy rates of LESH+CC are greater, the sensitivity and specificity rates are not up to the mark. Besides this, the time consumption of this work is maximal, as it has some difficulty in feature extraction.
The next better performing classification approach is apriori+SVM, which shows better accuracy and sensitivity rates but fail to prove greater specificity rates. Greater specificity rates imply maximum false positives, in order to increase the accuracy rates. Mamford-shah based classification approach shows poor results, when compared to all the other techniques with the least sensitivity, specificity and accuracy rates. The reason for the better performance of the proposed approach is that all the techniques involved in this work operate by themselves without any manual setting. Besides this, the better pre-processed and segmented results affect the quality of the classification process. When the initial image processing activities are optimal, obviously the outcome of the system is efficient. Hence, a better classification approach to distinguish between malignant and benign tumour in the lung CT scans is presented.
This article presents a reliable computer aided lung cancer classification system based on curvelet features and ensemble classifier. The proposed classification approach is applied over the pre-processed and segmented images, which is accomplished by AHE and KFCM respectively. The features of the segmented regions of the image are extracted by means of curvelet, which is faster with high discrimination ability. The ensemble classifier which is formed by k-NN, SVM and ELM is proposed and the decisions of all the classifiers are obtained. The dominating result is chosen as the final decision and the performance of this work is tested against three different existing techniques. The proposed classification approach serves better in terms of accuracy, sensitivity and specificity rates with minimal time consumption. In future, this work is planned to be extended by including feature normalization and selection techniques.
