Abstract
Keywords
Introduction
Mammography is the most popular imaging modality for population-based breast cancer screening. However, due to the large heterogeneity of the suspicious breast lesions (e.g., soft tissue masses) and random overlapping of the dense fibro-glandular tissues, accurate classification of suspicious breast masses detected on the mammograms is a quite difficult task for radiologists in reading and interpreting screening mammograms. As a result, although the cancer detection yield of screening mammography is typically less than 0.5 to 1%, the recall rates often reach 10% or higher in current clinical practice [1]. Thus, the majority of recalls and/or biopsies are proved to be benign. The higher false-positive recall rates add anxiety with potentially long-term psychosocial consequences and physical harms to many cancer-free women who participate in screening mammography [2]. Hence, improving specificity of breast cancer screening has high clinical impact. For example, since studies have shown that digital breast tomosynthesis (DBT) might help reduce recall rates by 16 to 30% as comparing to digital mammography [3], developing, optimizing and testing new DBT technology has attracted extensive research interest recently [4, 5]. However, adding a new imaging test, such as DBT, may increase radiation dose, screening costs and imaging reading time.
In order to help radiologists improve diagnostic specificity by reducing false-positive recall rates, developing and using computer-aided diagnosis (CAD) schemes is also an approach that has been attracting extensive research interest in the last two decades [6]. The goal of developing CAD based classification schemes is to analyze each suspicious lesion detected by radiologists on the mammograms and compute a likelihood score of the lesion being malignant, which aims to assist radiologists in making more accurate recall decisions. Unlike the commercially available CAD based detection schemes of mammograms, which detect and cue locations of the suspicious lesions on mammograms [7], developing CAD based classification schemes to classify between malignant and benign lesions still faces multiple challenges in both technical development and clinical application. In this article, we only discuss CAD based classification schemes.
Table 1 summarizes several most representative CAD schemes reported in the literature [8–15], which include size of training/testing datasets, lesion segmentation methods, type of the selected image features, machine learning classifiers, and classification performance levels namely, the areas under the receiver operating characteristic (ROC) curves (AUC). From these previously reported results, CAD schemes could yield promising performance of classifying between mammographic masses compared to the high false-positive recall rates in current clinical practice. However, previous studies have shown and discussed that without providing an intelligible reasoning process, radiologists are typically reluctantly to accept any recommendations generated by CAD schemes using a “black-box” type approach in the scheme’s decision-making process [16, 17]. In most of the previous CAD schemes, (1) the meaning of the majority of the computed image texture features are unexplainable to the human vision system or the variations of those features are also insensitive to the human eyes, and (2) the working mechanisms of the machine learning classifiers (e.g. artificial neural networks) are very complex and thus the relationship or correlation between the image features and the phenotypes of cancer development is also difficult to be understandable. As a result, no CAD based classification scheme is currently used in clinical practice.
To potentially increase radiologists’ confidence in accepting mass classification results made by CAD schemes, we in this study, developed and tested a new CAD concept and scheme that only uses 5 simple but visually sensitive image features, which are routinely evaluated by the radiologists in reading and interpreting mammograms, as well as a simple and explainable linear classifier (i.e. logistic regression). Using this new CAD scheme and the accompanying image display (“visual aid”) tool, the classification logic and decision reasoning can then be better understandable by the human observers. The detailed experimental procedures and study results, along with the discussion, to test our hypothesis are presented in the following sections of this article.
Materials and method
A testing image dataset
From the retrospectively collected de-identified full-field digital mammography (FFDM) image data in our laboratory, we assembled a reference dataset for developing and testing a new proposed CAD scheme in this study. The dataset includes FFDM images acquired from 301 women who were recalled by the radiologists during the original image reading and interpretation due to the detection of suspicious soft tissue masses on the mammograms. The biopsy was performed on each suspicious mass and the pathological reports indicated that 152 masses were malignant and the remaining 149 masses were benign. In addition, each lesion involved in this dataset was detectable in both craniocaudal (CC) and mediolateral oblique (MLO) view images. The center locations of all masses were previously marked by the radiologists on the corresponding digital mammograms.
Overview of CAD scheme
Figure 1 shows the flowchart of developing the proposed CAD scheme. In reading a mammogram, if a suspicious mass region is detected, an observer (a radiologist) clicks the seed point around the center of the lesion in the image display workstation. CAD scheme first applies a modified region-growing based algorithm to segment the mass region from the background. Subsequently, 14 features describing five visually sensitive characteristics of segmented lesions (i.e. size, shape, contrast, homogeneity and spiculation) are calculated and extracted. In CAD development stage, in order to identify an optimal approach to represent each of 5 visually sensitive features, student t-tests are then performed to select one optimal feature computation method in each of the five features groups. Lastly, a logistic regression classifier is trained to fuse the information from the five optimal features and generate a likelihood score indicating the probability of malignancy.
A lesion segmentation algorithm
The first step of our CAD scheme is to segment the suspicious mass region depicting on the mammogram. By comparing a large number of previously developed algorithms for segmenting breast masses depicting on mammograms [18], we modified a unique region growing technique that was reported in a previous study [19] and develop a new mass segmentation algorithm implemented in our new CAD scheme. Specifically, an initial growth seed was placed at the mass center location (μ
x
, μ
y
), which had been marked by the radiologists. Based on the a priori knowledge that breast masses tend to be compact and convex, a two-dimensional Gaussian function centered at the seed point (μ
x
, μ
y
) with variance σ2 was used to limit the pixels far away from the seed point. A Gaussian-constraint image was generated by the multiplication of the original image and the Gaussian function. Subsequently, a conventional region growing algorithm was applied on the Gaussian-constraint image to define the mass region. The region growing algorithm starts from the initial seed point and grows iteratively according to some similarity criterions. Here we included a parameter t to define the criterion: the candidate pixel is included in the mass region if and only if the pixel value is greater than the following threshold:
Where is the average pixel gray value inside the current segmented mass region and t is the predefined parameter.
The next step is to determine the two parameters (σ and t) involved in the algorithm. σ
i
and t
j
are denoted as candidate values for σ and t, which are sampled from a predefined range. A two-dimensional grid search method was used to apply every possible pair (σ
i
, t
j
) in the region growing algorithm to define a segmented mass region, and then compute the cost function of the specific pair. The cost function is defined as:
where circ is the standard deviation of the distances from the boundary pixels to the center pixel, which is used to measure the circularity of the segmented region; grad is the average gradient vector magnitude of the boundary pixels, which is used to measure the sharpness of the segmented boundary; and w is a predefined weighting factor which was determined by subjective evaluation based on our experimental results. The cost function is constructed based on the a priori information that the mass region tends to have a circular shape and a sharp boundary. Finally, the (σ i , t j ) pair generating the smallest cost function was considered as the parameter to segment the mass region. The segmentation results are not quite sensitive to the selection of weight factor w; simple averaging (i.e. w = 1) can get satisfactory results.
This algorithm was first applied to segment each suspicious mass region. Since a previous comprehensive review study [18] has shown that no breast mass segmentation algorithm is perfect or always superior to others, the automated segmentation result of each lesion was visually inspected. Manual correction was applied to adjust parameters used in the region growing algorithm or directly modify part of the mass boundary contour, if significant error was observed, which typically happens in approximately less than 15% of mass regions. Hence, this is an interactive CAD scheme to maintain high accuracy and reliability in computing image features. The similar mass segmentation scheme has been used and successfully tested in an interactive CAD workstation to process and analyze the suspicious mammographic masses queried by radiologists [20].
After mass region segmentation, we investigated and computed a set of image features that are easily caught by (or sensitive to) human eyes and routinely evaluated (or rated) by the radiologists in reading and interpreting mammograms. These features are computed to characterize mass size, shape, contrast, homogeneity and spiculation. Since different image features can be defined to represent each of these mass characteristics, we first computed a number of image features using different definitions or methods in each category. We then analyzed the correlation and compared their discriminatory power to select one optimal feature to represent each of these five mass characteristics. The following are brief descriptions of the computed features.
Mass size
We computed 3 image features in this category. The first one is the mass area, which is computed by automatically counting the total number of pixels inside the segmented mass region, and then multiplying the pixel size (or spatial resolution of mammogram). In addition, since in clinical practice, radiologists use the radial length of the lesion to measure mass size (based on RECIST guidelines [21]), we computed the second feature that is the normalized mean radiant length computed by the mean radial length divide by the total number of pixels inside the mass region, and the third feature that is the maximum radial length.
Mass shape
Radiologists typically rate mass shape into round, oval, or irregular. To quantify these ratings, we computed two most commonly used shape-related features [22]. The first one is a shape factor ratio defined as P2 /A, where P and A are the perimeter and area of a lesion region, respectively. The second one is a radiant length coefficient of variation. Using the radial length (r i ) that is the distance between the mass center and pixel (i) located at the mass boundary, this feature is defined as the coefficient of variation of r i , which can be computed by standard deviation of r i divided by mean value of r i .
Mass contrast
In order to compute the contrast related features, different types of mass outside surrounding area can be selected, which will have different impact on the computational results [23]. In this study, we used the method reported by te Brake et al. [24] to define the mass surrounding area. First, a morphological dilation operation with a spherical kernel of size 0.6R was performed on the segmented mass region, where R is the mean radial length () of the mass region. Then the pixels inside the dilated region but outside the mass region were labelled as “outside surrounding area.” Figure 2 shows an example of the outside surrounding area that was used to compute the contrast features. Next, three contrast related features were defined and computed. The first one is computed by the difference between the average pixel values inside the segmented mass region and its surrounding (outside) area. The second one is computed based on a distance measure between the two pixel intensity histograms, which can be computed as:
Where I is denoted as the set of pixels in the mass region, O is denoted as the set of pixels in the outside surrounding area. The last contrast related feature is computed based on the average gradient vector magnitude of the boundary pixels, which is related to the sharpness of the boundary.
The degree of mass density heterogeneity phenotypes contains biologically important tumor development patterns including the degree of tumor stiffness variation and necrosis. To quantify mass density homogeneity, we computed four features. These are 1) standard deviation of pixel intensities inside the mass region. 2) Kurtosis of pixel intensities inside the mass region. 3) Average local pixel intensity fluctuation in the mass region as defined in our previous study [22], where the local pixel intensity fluctuation of a pixel is defined as the maximum absolute difference between the pixel intensity and the intensity of pixels inside a 5×5 square kernel centered at that pixel. 4) Standard deviation of the local pixel intensity fluctuation inside the segmented mass region.
Mass spiculation
The degree of mass boundary spiculation is another primary characteristic indicating mass malignancy. In this study, we used two radial edge-gradient analysis based features to measure the spiculation of a segmented mass. First, a 3×3 mean filtering was performed on the mammogram as a preprocessing step. A morphological dilation and erosion operation was then applied to the segmented mass region, respectively. The difference between the dilated and eroded image was extracted as the “lesion boundary area.” For each pixel inside the lesion boundary area, the maximum gradient at that pixel and the radial direction from the mass center to the boundary pixel were computed, respectively. Next, the “radial angle” was obtained by the angle between the two vectors (i.e. maximum gradient and radial direction). The radial angles of all pixels in the lesion boundary were collected to form a radial angle histogram. Based on the a priori knowledge that if a mass boundary is not spiculated, the radial angle histogram will tend to be compact and accumulate near 0°, we extracted the kurtosis of the distribution as the first spiculation-related feature. Then the number of pixels whose radial angles were between 60° and 120° or –60° and –120° were counted as “spiculated” pixel number. The spiculated pixel number divided by total pixel number inside the mass boundary area was calculated as the second feature. Figure 3 shows two examples of radial angle histograms, where the first one is from a less-spiculated mass and the second one is from a spiculated mass.
In summary, a total of 14 image features were computed to quantify and represent 5 types of mass characteristics. All values in each of 14 image features were normalized with mean equal to zero and standard deviation equal to one. Figure 4 shows an example of the Spearman’s correlation matrix of 14 features, which was generated using all segmented mass regions depicting on the CC view images. The figure shows that the features belonging to the same feature group typically have higher correlation coefficients. Selecting an optimal feature in each mass characteristic group can reduce the redundancy of corresponding features and increase the robustness of the classifier developed in the following step. Hence, our goal in the next step is to identify 5 optimal features (one for each mass characteristic).
Classification and evaluation
The last step of our CAD scheme applies a machine learning model to classify the lesion by generating a likelihood score of the segmented mass region associating with a malignant mass based on 5 optimally selected image features (one in each category as discussed in section 2.3). In order to make reasoning and internal relationship (or linkage) of the classifier easy explainable or understandable, we trained and implemented a logistic regression classifier into the CAD scheme. Logistic regression is a relatively simple “linear” classifier because its output is only determined by the linear combination of the selected features. The working mechanism of the logistic regression classifier is transparent to the users as the optimized weights of the linear combination can reflect how the features contribute to the classification process.
Next, we embedded the feature selection process into a leave-one-case-out (LOCO) training and testing method to evaluate classifier performance, which enables to minimize both feature selection and classifier training/testing bias due to the limited image dataset [25]. The logistic regression based classifier was built using publically available R project for statistical computing (https://www.r-project.org/). Specifically, in each training and testing cycle of the LOCO, 300 cases were arranged as the training set. Using these 300 cases, the scheme identified one optimal feature with highest discriminatory power among all the features in the same mass characteristic group. For this purpose, a Student’s t-test was performed with the null hypothesis that the mean of feature values in the benign case group is identical to that in the malignant case group. The p-values of the t-test for all features in the same feature group were computed and ranked. The feature with smallest p-value was selected as an optimal (or “best representative”) feature of this lesion characteristic group. The classifier was then trained using the five selected features (one in each feature group) over the training dataset of 300 cases. The trained classifier was applied to the remaining case (testing case) to compute a classification score. The classification scores range from 0 to 1, indicating the probability of the mass regions associated with malignant masses. This procedure was iteratively performed 301 times using different training and testing datasets. As a result, each of 301 cases in our dataset was used as a testing case once and obtained a classification score. Subsequently, the classification scores of all cases were processed by a publically available receiver operating characteristic (ROC) curve fitting program based on the use of a maximum likelihood data analysis method (ROCKIT, http://www.xray.bsd.uchicago.edu/krl/roc_soft.htm, University of Chicago). AUC was used as the performance assessment index to evaluate the performance of our CAD scheme.
Since FFDM images are two-dimensional projection images and the image features computed from the same mass depicting on the CC and MLO view images may not be the same due to difference of overlapping breast tissue, we trained and tested two sets of logistic regression based machine learning classifiers using the image features computed from the segmented mass regions depicted on either CC or MLO view mammograms, respectively, in this study. The AUC values of two classifiers were also separately computed and evaluated. Finally, we applied and tested a simple fusion method to combine two classification scores of the same lesion depicted on two CC and MLO view mammograms [26]. Thus, we generated a mass-based classification score for all 301 lesions in our dataset. The mass-based classification performance level (AUC) was also computed and compared with the AUC values generated using a single mass region depicting on either CC or MLO view mammograms.
Finally, in order to make the decision-making process more transparent to the users, we also designed a graphic user interface (GUI). In this way, GUI will display the seed point inside the mass region by marked or clicked by the user, automatically segmented mass boundary contour, the computed feature values, and the classification score. The examples of this interface can be found in the following Results section of this article.
Result
Based on the CAD-generated mass segmentation results, the average size of benign masses was 10.37±16.39 mm2 for MLO views and 9.13±18.12 mm2 for CC views, while the average lesion size of benign masses was 13.69±11.15mm2for MLO views and 11.94±12.20 mm2 for CC views. The results indicated that (1) the differences between the malignant mass sizes and benign mass sizes are not statistically significant in CC view images of our testing dataset (p = 0.115); while the malignant masses are significantly larger than benign masses in MLO view images (p = 0.041), and (2) due to the projection difference, the size of a mass may be different in CC and MLO view images. Figure 5 shows an interface of our CAD scheme displayed on the computer workstation with two examples. One includes a benign mass region and one depicts a malignant mass region. For each detected (or queried) suspicious mass region, our CAD scheme provides the observers (e.g., radiologists) the segmented mass boundary contour, 5 normalized values in 5 image feature categories to show the quantitative levels of computed mass size, shape, contrast, homogeneity and spiculation, as well as the classification score (the probability of being malignant).
Table 2 summarizes the frequency of the image features selected in the LOCO training and testing process using segmented regions from either CC view or MLO view. It indicates that except one feature in Spiculation category of CC view images, the exactly same (or consistent) combination of 5 optimal features were selected (i.e. feature 1, 4, 6, 9 and 14) in all 301 LOCO training and testing cycles. Table 3 summarizes and compares the discriminatory power (or AUC value) of using each of the 5 most frequently selected mass characteristic features. The results show that all of these 5 visually sensitive features can make contribution to classify between the malignant and benign mass regions with the AUC values that are significantly greater than a random guess (AUC = 0.5). Figure 6 displays and compares two ROC curves generated using the logistic regression classifiers that integrate the 5 selected image features and LOCO training/testing using the mass regions segmented from the MLO and CC view mammograms respectively. The AUC value of 0.758 with 95% confidence interval of (0.702, 0.808) was obtained from the classifier using MLO view mammograms and it is significantly greater than the AUC values obtained from each of the five single optimal features using MLO view mammograms shown in Table 3. For the classifier using CC view mammograms, AUC = 0.786 with 95% confidence interval of (0.732 0.833); it is significantly greater than the AUC values obtained from four of the five single optimal features (i.e. feature 1, 4, 9, 14) using CC view images and the difference is not significant for the remaining one (i.e. feature 6). Two AUC values (i.e. obtained from the classifier using MLO and CC view mammograms respectively) were not statistically significantly different (with a 2-tailed p-value of 0.370 computed using ROCKIT program). The results show that using a simple machine learning classifier enables to further increase the classification performance than using a single image feature (Table 3).
Table 4 includes two confusion matrices of the classification results of two classifiers that were trained/tested using MLO and CC view images when applying a threshold of 0.5 (at the middle point in the range of the classification scores) to divide between malignant and benign case groups. The results show that applying the new CAD scheme to our testing dataset, 70.1% (211/301) of the mass regions depicted on either MLO view image or CC view images were correctly classified.
Table 5 summarizes and compares the classification performance levels (AUC values) of applying 3 simple and direct fusion methods to combine the classification scores of two mass regions depicted on CC and MLO view mammograms. It shows that by using the average classification score between two mass regions, the fused mass-based classification performance level yielded a “best” AUC = 0.806±0.025. It is significantly higher than the AUC value yielded using a logistic regression based classifier optimized using the mass regions depicted on MLO view images (with the computed 2-tailed p = 0.006); while the difference between AUCs obtained by average fusing and by using logistic regression optimized by the CC view mammograms are not statistically significant (p = 0.344).
Discussion
This study is part of our continuing effort to develop and evaluate computer-aided quantitative image feature analysis schemes to assist predicting cancer risk [27], improving tumor detection [28] and diagnosis [29, 30], and assessing patient prognosis or treatment efficacy [31–33]. Among them, developing a more effective CAD tool to assist classifying between malignant and benign soft breast tissue masses is also important to help increase efficacy of screening mammography. Although great research effort has been made in developing CAD schemes aiming to better classify suspicious mammographic masses, due to the difference between human vision and computer vision, the confidence level of radiologists to accept or consider the classification results generated by the “black-box” type CAD schemes is quite low [16, 17] because classification accuracy is not the only objective in many machine learning applications whereas interpretability is also an important one [34]. In order to help solve this application problem, we in this study explored a new approach to develop and test CAD scheme. Although a large number of texture features can be computed from mammograms and applied to develop CAD schemes, these features are typically not visually sensitive or easily explainable to the observers. The new CAD scheme presented in this study uses 5 simple and visually sensitive image features. Using these image features, we aim to mimic and quantify 5 important characteristics of mammographic masses, which are routinely used by radiologists to interpret or assess the likelihood of a detected suspicious breast mass being malignant. The potential advantage of this new approach is not limited to only increasing the confidence of radiologists on the CAD results as these features are visually sensitive and understandable, but also eliminates the inter-observer variability in rating these features.
Although we cannot directly compare the classification performance of our CAD schemes with the performance levels of previously reported CAD schemes (e.g., Table 1) due to the use of different testing image datasets, we demonstrated the feasibility of applying a new CAD scheme optimized using a set of simple and visually sensitive image features to yield a comparable performance level of the CAD schemes that are optimized using much more complicated texture features and machine learning classifiers (i.e., artificial neural networks). In this way, under a comparable CAD performance level, using our CAD scheme can provide radiologists not only a classification score, but also the segmented mass boundary contour along with 5 quantitative image features that are routinely visually assessed or rated in the current clinical practice (as shown in Fig. 5). This is a new “visual aided” CAD based classification approach, which enables us to provide radiologists with a new computer-assisted mammogram reading and interpretation environment in future observer performance studies, which is always an important task in medical imaging reading and diagnosis field [35].
In our study, we also compared the classification performance levels using different image feature sets and classifiers. First, we tested another feature selection method using a non-parametric (or Wilcoxon) rank sum test and re-evaluated the logistic regression classifier using the same LOCO training and testing method. The results showed that AUC values of 0.758±0.027 and 0.736±0.028 were obtained from the MLO and CC view image based classifier, respectively. Comparing to the AUC values yielded using the Student’s t-test based feature selection (as shown in Fig. 6), we observed when using Wilcoxon rank sum test based feature selection, the classifier that was trained and tested using MLO view images yielded the same classification performance (p = 1.0) as the same features were selected. In contrast, different features were selected using the Wilcoxon rank sum test for the classifier trained and tested using CC view images, which yielded a significantly lower performance (p = 0.02). This indicates that overall, using Student’s t-tests outperformed Wilcoxon rank sum tests for the feature selection tasks in this study.
Second, we also evaluated two logistic regression classifiers that were trained and tested using a total of 14 features (as shown in Table 2) from the mass regions depicting on either MLO or CC view images, respectively. The results show that AUC values of 0.730±0.029 and 0.776±0.026 were obtained using the two MLO and CC view image based classifiers that were trained and tested using the same LOCO cross-validation method embedded with feature selection. Statistically significant differences were detected between the logistic regression classifier using all 14 features and the 5 optimal features computed from the mass regions from MLO view images (p < 0.05); while for CC view images, the AUC value obtained using the 14 features based classifier was lower than using the optimal features based classifier, but the difference is not significant (p = 0.48). This indicates that removing redundant image features may potentially improve classifier performance and robustness of the classifier in future tests using new independent image datasets.
In addition, we have made a number of new observations from our study results. First, in medical image processing, many different methods or formulas can be used to compute and quantify the same category of features (i.e., contrast and spiculation). In this study, we tested several different methods to compute or quantify the 5 categories of mass features. The comparison results show that using one optimal feature in each of the 5 categories can yield better classification performance level compared to using all 14 extracted features, which indicates that the non-optimal or redundant features can be filtered out and removed by the Student’s t-test based feature selection method. Therefore, we demonstrated that an optimal feature with highest discriminatory power in one category might be easily selected by applying a simple Student’s t-test based feature selection method and can potentially improve the performance level of the classifier. Second, since mammograms are two-dimensional projection images, the image features computed from two mass regions depicting on CC and MLO view images may have substantial variation, which also affects the two classification scores. Hence, using a fusion method to combine the two classification scores yielded from two mass regions to generate a final mass-based classification score also has potential to significantly increase the classification performance level compared to using a single mass region based scheme.
In summary, we developed a new CAD scheme for classification between malignant and benign mammographic mass regions using 5 visually sensitive image features and a straightforward logistic regression model based classifier. We demonstrated that by more closely mimicking how the radiologists classify mammographic lesions, a new and simple scheme can yield a comparable classification performance as compared to other previously developed and reported schemes (e.g., Table 1).Meanwhile, due to its simplicity and more transparent reasoning of the feature selection and case classification process, using this new CAD scheme has potential to help increase radiologists’ confidence to accept or consider the CAD-generated classification results. However, despite the encouraging study results, this is a preliminary study with a number of limitations. First, this study only used a relatively small image dataset, which may not represent general breast masses in a diverse screening mammography practice. Thus, performance and robustness of the CAD scheme needs to be further tested using new larger and more diverse image databases. Second, this is only a technology development study with a proposed new CAD based classification method. How to use this new CAD scheme and GUI method to help increase radiologists’ confidence in CAD generated classification results and improve their decision-making in classifying between malignant and benign masses (i.e., reducing false-positive recalls) needs to be tested in future observer performance studies. Therefore, more research work is still needed to optimize and test the performance and robustness of this new type of CAD scheme.
Footnotes
Acknowledgments
This work is supported in part by Grants of R01CA160205 and R01CA197150 from the National Cancer Institute, National Institutes of Health. The authors would also like to acknowledge the support from the Peggy and Charles Stephenson Cancer Center, University of Oklahoma as well.
