Abstract
Introduction
Accurate, resource-conserving methods to classify skin lesions are in great demand given the rapidly increasing incidence of melanoma globally 1 and the challenges of accessing timely, high-quality care due to the global dermatology workforce shortage. 2 Dermoscopy, or epiluminescent microscopy, has improved the accuracy of melanoma classification compared with unaided visual diagnosis by clinicians, but dermoscopes are relatively expensive, limiting their usefulness in low-resource countries, and the accuracy rates vary markedly depending on the experience level of the user. 3 –7 Several groups have developed computer algorithms to improve classification of skin lesions from digital images, but many require dermoscope images, and most have only been scored on small datasets. 8 –10
We have developed a unique computerized classification system that combines aspects of computer vision technology with big data methodology. The system uses a patented algorithm combined with a large proprietary database of diagnosed lesion images to match new images with database images. In the current study, we evaluated the accuracy of the algorithm to classify melanoma lesions 10 mm or larger in a test database of histopathologically diagnosed skin lesions.
Materials and Methods
Image Database Development
Study personnel obtained written informed consent from each study participant according to the protocol approved by Fox Commercial Institutional Review Board, Ltd. (Springfield, IL).
To create a database of lesion images from a diverse population, study personnel recruited English-speaking volunteers, 18 years of age and older, from a community location in an ethnically diverse suburban neighborhood in Los Angeles County, California from April 3, 2011 through July 29, 2011. Participants self-reported basic demographic information, and they presented to study personnel one or more lesions on the visible skin that they were willing to have photographed. Study personnel then photographed the lesions using Celestron® (Torrance, CA) hand-held digital microscopes. These “microscopes” were 2-megapixel cameras with a macro lens surrounded by a ring of white light-emitting diode lights. To ensure consistent lighting and imaging distance, we attached an opaque 10-cm tube to the front of each camera. With the open end of the tube in contact with the subject's skin, all ambient light, but not the light-emitting diode light, was blocked, and the imaging distance, and therefore magnification, was fixed.
Each participant's lesion images were uploaded into the image database, and at least one of three board-certified dermatologists later reviewed and diagnosed the lesions using standard clinical criteria. For lesion images diagnosed by more than one dermatologist, we examined agreement between dermatologists, who in some cases provided more than one possible diagnosis, ranked by their degree of confidence in the diagnosis. In a subsequent qualitative review of agreement among dermatologists, we elected to eliminate diagnoses provided by one dermatologist due to substantial inconsistencies in the data and poor agreement with the other two dermatologists. We then developed a decision-tree algorithm to assign diagnosis. If only one dermatologist reviewed the lesion image, the algorithm assigned the diagnosis listed if the dermatologist provided only a single diagnosis, or the diagnosis with highest confidence if the dermatologist provided more than one diagnosis. If the lesion image was reviewed by both dermatologists, the algorithm assigned a diagnosis based on the degree of confidence in each specific diagnosis; if the dermatologists' highest-confidence diagnosis did not match, the image was eliminated from the database. However, consistent with dermatology standard of care, if any of the diagnoses were for malignant conditions, we assigned the malignant diagnosis.
The participants self-reported birth year, gender, and race or ethnicity (American Indian/Alaska Native, Asian/Pacific Islander, black/African American, white/Caucasian, or other). Participants also self-reported Hispanic/Latino ethnicity (yes/no). Study personnel de-identified the data by creating a unique alphanumeric identifier for each participant that linked his or her demographic data to his or her individual lesion images.
Due to the low population prevalence of skin malignancies, we enriched the database with images of melanoma that had been previously confirmed by histopathology. We acquired the images from DermNet NZ, 11 a well-known and reliable source of skin-lesion images.
Image-Search Algorithm and Query Images
We created a proprietary, patent-protected, image-search algorithm that builds on proven computer vision methods, in particular from the field of content-based image retrieval (CBIR). Our algorithm compares new images of skin lesions (“query images”) with the database of diagnosed skin-lesion images (“database images”). It uses orientation- and artifact-independent image information on lesion size, color, shape, and texture to create a single high-dimensional signature for each image. The algorithm then computes the distance between the query image's signature and those of the database images to determine which database images are closest to the query image. In CBIR terms, the best matching database images are the query results. Query results are then converted into an estimate of the query diagnosis through majority voting. This is equivalent to constructing a k-nearest-neighbor classifier, 12 where a diagnosis is assigned to the query based on the frequency of diagnostic labels attached to the images in the CBIR result set. To evaluate the classifier accuracy, scoring was based on the diagnosis with the most votes.
To assess the accuracy of the image-search algorithm to classify melanomas, we randomly selected 129 images of nonmelanoma lesions and 208 images of melanoma lesions, all with the largest diameter of at least 10 mm. All melanoma query images were selected from the set of images acquired from DermNet NZ to ensure confirmation of the malignancy by histopathology. The nonmelanoma query images were randomly selected from the study database of images collected at the community location; all lesions imaged at the community center were diagnosed clinically. We based the sample sizes on results from prior sample-size calculations to detect a sensitivity of 85% and a specificity of 90% at a 95% confidence level.
Data Analysis
We examined the demographic characteristics of participants who contributed lesion images to the image database by determining counts and percentages across categories within age, gender, and race/ethnicity. We also examined counts and percentages of queried and database images by melanoma and nonmelanoma diagnosis and by category of diagnosis among nonmelanoma lesions.
To evaluate the ability of the image-match algorithm to accurately discriminate between melanoma and nonmelanoma lesions, we calculated several classification accuracy measures. We calculated sensitivity (the ratio of the number of true melanomas that the algorithm correctly classified as melanoma to the number of all true melanomas), specificity (the ratio of the number of true nonmelanomas that the algorithm correctly classified as nonmelanoma to the number of true nonmelanomas), positive predictive value (PPV) of a test (the ratio of the number of true melanomas that the algorithm correctly classified as melanoma to the number of all lesions the algorithm classified as melanoma), negative predictive value (NPV) of a test (the ratio of the number of true nonmelanomas that the algorithm correctly classified as nonmelanoma to the number of lesions the algorithm classified as nonmelanoma), overall accuracy (the ratio of the true melanomas and true nonmelanomas correctly classified by the algorithm to the total number of lesions evaluated by the algorithm), positive likelihood ratio test (the ratio of the odds that the algorithm correctly classified a true melanoma as melanoma to the odds that it incorrectly classified a true nonmelanoma as melanoma, which is also given as sensitivity/[1 – specificity]), and negative likelihood ratio (the ratio of the odds that the algorithm correctly classified a true nonmelanoma as nonmelanoma to the odds that it incorrectly classified a true melanoma as nonmelanoma, which is also given by [1 – sensitivity]/specificity).We calculated these estimates and their corresponding 95% confidence intervals using standard methods. 13,14
Results
In total, 1,900 participants were recruited by study personnel and agreed to allow study team members to capture digital images of their skin lesions for inclusion in the skin-lesion image database. Study personnel asked participants to report information about specific demographic variables, and for each variable over 90% of all recruited participants provided responses (Table 1). In addition to the recruited participants, we enriched the study database with images of histopathologically diagnosed melanoma lesions, for a total of 2,202 individual image donors. Demographic data were unavailable for the individuals who provided the additional melanoma images that we acquired to enrich the study database.
Demographics of the Individuals Who Contributed One or More Images to the Lesion Image Database
Category includes 302 individuals whose melanoma images were acquired from the DermNet NZ database. 11
Overall, the recruited participants were fairly young, with slightly more than half under the age of 35 years, but older participants, who are likely to present with different conditions and skin types and tones, were well represented (Table 1). Participants were predominantly female and ethnically diverse, with just under one-third identifying as white/Caucasian and nearly 40% identifying as Hispanic or Latino.
The study database of 11,780 images (Table 2) included all 11,478 images from the 1,900 participants and an additional 302 images from the DermNet NZ database. Melanoma diagnoses accounted for about 1 in 10 database images and nearly two-thirds of query images. The distribution of images by diagnosis was similar for nonmelanoma query images and the database images, as expected given that the nonmelanoma query images were randomly selected from the database images. The combined nevus diagnoses accounted for more than four-fifths of all nonmelanoma database images and just over three-quarters of all query diagnoses.
Distribution of Queried and Database Images by Lesion Diagnosis
Diagnoses of all queried melanoma image lesions were confirmed by histopathology.
Nearly every measure of accuracy of the algorithm to correctly identify melanoma and nonmelanoma lesions 10 mm or larger exceeded 90% (Table 3). The algorithm accurately identified more than 90% of true melanoma and true nonmelanoma lesions. Of those the algorithm identified as having melanoma, more than 94% in fact had melanoma; of those the algorithm identified as not containing melanoma, more than 85% did not have melanoma. Overall, the algorithm accurately identified nearly 91% of true melanomas and true nonmelanomas, with a lower confidence limit greater than 87%.
Algorithm-Matching Results and Clinical Accuracy Measures for Melanoma Versus Nonmelanoma Lesions with Maximum Diameter of 10 mm or Larger
CI, confidence interval; FN, false negative; FP, false positive; SN, sensitivity; SP, specificity; TN, true negative; TP, true positive.
In addition, the likelihood ratio tests were highly discriminatory. The odds that a true melanoma would be accurately identified were more than 10 times greater than the likelihood that a true nonmelanoma would be incorrectly identified as melanoma. The odds that a true melanoma would be incorrectly identified as nonmelanoma were only 1/10th the odds that a true nonmelanoma would be accurately identified.
Discussion
The image-matching algorithm performed with high accuracy for the classification of larger melanoma lesions, exceeding reported accuracy rates that typically range from 70% to 86% among experienced board-certified dermatologists, the current gold standard for clinical diagnosis of skin lesions. 15 –17 It is of importance that, compared with a study by Carli et al. 18 that specifically examined accuracy rates for classification of large lesions (10 mm or larger) as in the present study, the reported accuracy rates of the image-matching algorithm outperformed those of practicing dermatologists visually diagnosing lesions (algorithm versus naked eye examination: sensitivity, 90.4% versus 82.9%; specificity, 91.5% versus 75.8%; and overall accuracy, 94.5% versus 79.5%). The same study 18 also reported accuracy rates for visual examination combined with dermoscopy, and the reported sensitivity of the algorithm was slightly lower (algorithm versus visual examination plus dermoscopy, 90.4% versus 93.3%), but the specificity and overall accuracy were substantially higher for the algorithm (algorithm versus dermoscopy: specificity, 91.5% versus 77.3%; overall accuracy, 90.8% versus 85.6%). Several other studies reported dermatologists' diagnostic accuracy rates using dermoscopy, and the image-matching algorithm generally achieved comparable or higher sensitivity and specificity. 3 –7,15 –17
The algorithm also demonstrated high PPV and NPV in this test database enriched for melanoma images. These findings were not surprising given that PPV and NPV are affected by the prevalence of the condition, and the prevalence in this test dataset was high. We would therefore expect the algorithm to achieve lower PPV in a population with a lower prevalence of melanoma, as would be encountered in a typical dermatology or primary care setting. However, the lower PPV and NPV expected in a general or screened population do not negate the value of this or any other screening test because if the sensitivity of the test is very high, the potential benefits of the test due to increased survival and reduced healthcare costs through earlier detection may be greater than the cost of performing the test.
The likelihood ratio test results, mathematically related to sensitivity and specificity, are not affected by the population prevalence of the disease. The algorithm's positive likelihood ratio was above the threshold of 10, which is considered strong conclusive evidence that the disease is likely to be present, and the algorithm's small negative likelihood ratio was at the threshold of 0.1, which is considered conclusive evidence that the disease is not likely to be present. 13
We designed the algorithm to model dermatologists' approach to skin-lesion analysis: mental matching of a query image (a patient's lesion) with a personal “database” of images learned during medical school, residency, and routine clinical practice. Accurate classification is limited only by the size of the database and the user's recall ability. Our system replicated this model by creating a proprietary skin-lesion image database that included diagnosed lesions from participants diverse in age, race/ethnicity, skin type, and specific diagnosis. The system will increase in robustness over time through the addition of new participant records to the system database, similar to the continually increasing expertise of dermatologists through the daily examination of skin lesions. However, unlike dermatologists, who will ultimately retire and take their expertise with them, our image database will continually increase in the number of records and the diversity of participants, lesion types, and presentations, including skin conditions in children.
During algorithm development, we discovered that lesion size and pathology were complex drivers of image-signature design. We therefore elected first to maximize algorithm accuracy for the identification of larger lesions that are more likely to be clinically important (presented here). However, we believe that optimization for larger, riskier lesions does not diminish the algorithm's usefulness given the critical global need for simple, inexpensive, and accurate tools to classify skin lesions. Other highly effective medical technologies have been developed that have a lower limit of resolution, such as positron emission tomography, which has been established as an important clinical tool in the evaluation and management of cancer despite the limitation of its use to larger tumors. 19 However, in a second phase of algorithm development, we will conduct additional assessments of smaller lesions and apply that information to improve algorithm performance for those lesions as well as for nonmelanoma malignancies, including basal and squamous cell carcinomas, and pediatric lesions. Subsequent phases could include development of lesion evolution tracking to compare multiple images of single lesions over time because detection of changes in the growth or visual presentation of lesions is another important dimension to the classification of suspicious lesions.
A potential limitation of the system is the lack of demographic information associated with images acquired from DermNet NZ that were used to enrich the database for melanoma and that were used as query images. We elected to enrich the database with 302 histopathologically confirmed melanoma lesion images due to the low prevalence of melanoma in the population of individuals we recruited to donate skin-lesion images for the study database. The additional images provided the algorithm with a greater representation of melanoma characteristics upon which to optimize the image-matching query. If the query images and the enriched melanoma images were obtained predominantly from white individuals, then the present study results may not directly reflect the robustness of the matching algorithm within diverse populations. However, all nonmelanoma images in the study database were derived from a highly diverse population of skin-lesion image donors, and the algorithm demonstrated very high specificity, NPV, negative likelihood ratio, and overall accuracy, suggesting that the algorithm classification works well even in diverse populations. It is of importance, however, that the algorithm uses only the information in the image, without supplementary demographic information, to classify the image, and the accuracy estimates—the correct classification of melanoma and nonmelanoma images—are internally valid and not affected by the lack of demographic data for this subset of images.
Another potential limitation of our image-matching algorithm is that it currently relies on a database of clinically diagnosed, rather than biopsy-proven, skin lesions. However, the CBIR technology that drives our algorithm exploits visual characteristics of the skin-lesion images that dermatologists have identified as common to a given lesion type. Therefore, even though diagnostic error is likely more frequent in our database of clinically diagnosed lesions than it would be in a database with biopsy-proven diagnoses, the visual characteristics consistent with the lesion types assigned by the dermatologists are sufficient to inform the algorithm and result in exceptionally high sensitivity and specificity of the matches. It is possible that a comparably sized database of biopsied images would give higher classification accuracy results, but the standard of care for lesion types that are clearly benign is clinical diagnosis, and therefore development of this algorithm would not be feasible if the database were restricted to biopsy-proven lesions.
In conclusion, this newly developed algorithm has the potential to improve classification of larger melanoma skin lesions, and ultimately all skin cancers, using digital images captured with low-cost cameras.
Footnotes
Acknowledgments
The authors wish to acknowledge the contributions of Yvonne Chen, Vice President of Operations, Lūbax, Inc., to the administration of the study. This study was funded by Lūbax, Inc.
Disclosure Statement
R.H.C. and M.S. are employees of Lūbax, Inc. R.H.C., M.S., and S.M.E. hold stock in Lūbax, Inc. S.M.E. and E.M. are paid consultants for Lūbax, Inc. J.M.K., V.A., and J.B. declare no competing financial interests exist. J.M.K., V.A., and J.B. served as unpaid consultants on the study design and reviewed the database and algorithm output, but they have no financial or other interests in this technology or Lūbax, Inc. The terms of this arrangement have been reviewed and approved by Fox Commercial Institutional Review Board, Ltd. in accordance with its policy on objectivity in research.
