Abstract
Introduction and Objective:
Natural language processing (NLP) software programs have been widely developed to transform complex free text into simplified organized data. Potential applications in the field of medicine include automated report summaries, physician alerts, patient repositories, electronic medical record (EMR) billing, and quality metric reports. Despite these prospects and the recent widespread adoption of EMR, NLP has been relatively underutilized. The objective of this study was to evaluate the performance of an internally developed NLP program in extracting select pathologic findings from radical prostatectomy specimen reports in the EMR.
Methods:
An NLP program was generated by a software engineer to extract key variables from prostatectomy reports in the EMR within our healthcare system, which included the TNM stage, Gleason grade, presence of a tertiary Gleason pattern, histologic subtype, size of dominant tumor nodule, seminal vesicle invasion (SVI), perineural invasion (PNI), angiolymphatic invasion (ALI), extracapsular extension (ECE), and surgical margin status (SMS). The program was validated by comparing NLP results to a gold standard compiled by two blinded manual reviewers for 100 random pathology reports.
Results:
NLP demonstrated 100% accuracy for identifying the Gleason grade, presence of a tertiary Gleason pattern, SVI, ALI, and ECE. It also demonstrated near-perfect accuracy for extracting histologic subtype (99.0%), PNI (98.9%), TNM stage (98.0%), SMS (97.0%), and dominant tumor size (95.7%). The overall accuracy of NLP was 98.7%. NLP generated a result in <1 second, whereas the manual reviewers averaged 3.2 minutes per report.
Conclusions:
This novel program demonstrated high accuracy and efficiency identifying key pathologic details from the prostatectomy report within an EMR system. NLP has the potential to assist urologists by summarizing and highlighting relevant information from verbose pathology reports. It may also facilitate future urologic research through the rapid and automated creation of large databases.
Introduction
N
As the electronic medical record (EMR) has now gained widespread use in the United States and beyond, NLP programs are slowly being adopted in medicine to generate automated patient safety alerts, such as detecting adverse drug events or various disease conditions. 1 The technology has also been used to effectively create large clinical and research databases. 2 We previously examined the ability of an NLP program to extract significant details from prostate biopsy reports to build a prostate cancer repository and assist clinicians managing newly diagnosed patients at our institution. 2
For men who undergo surgery for prostate cancer, several histopathological findings of the prostatectomy specimen (i.e., stage, Gleason score, presence of extracapsular extension [ECE], and surgical margin status [SMS]) are important predictors of biochemical relapse and help dictate the postoperative surveillance schedule and need for additional treatment. 3 Despite efforts to standardize radical prostatectomy specimen reports, there is still variability in reporting methods by pathologists. 4 Urologists are often required to read through lengthy documents, such as pathology reports, diagnostic tests, and progress notes, to identify key parameters that guide decision-making. A lack of standardization may lead to clinical inefficiency and, potentially, an adverse event.
The objective of this study was to evaluate whether an internally developed NLP program could accurately extract key pathological parameters from radical prostatectomy specimen reports.
Materials and Methods
This study was a cross-sectional analysis of a prospectively generated institutional review board-approved database, including prostate cancer patients, within the Kaiser Permanente Southern California (KPSC) region. The database was generated to evaluate quality of life outcomes in men following the various prostate cancer treatments at our 14 medical centers.
NLP program development
From March 2011 to October 2013, a software engineer designed the NLP program (KPSC Clinical Information Extraction System) to extract details from radical prostatectomy specimen reports within our EMR. The software programming methodology is described in our previous study by Thomas et al. 2
Pathologic reporting of prostatectomy specimens within the KPSC region, which includes over 50 pathologists, follows current standards of practice. 4 There is no universal template shared among pathologists, however. Reports generally consist of a detailed gross and microscopic description of the surgical specimen(s) and a summary of findings. Key variables programmed for NLP to extract from the report included pathologic stage (TNM), Gleason sum, tertiary Gleason pattern, histologic type, size of the dominant tumor nodule, surgical margin status (SMS), and presence of seminal vesicle invasion (SVI), perineural invasion (PNI), angiolymphatic invasion (ALI), extracapsular extension (ECE), and lymph node involvement (LNI).
NLP program validation
The NLP program was validated by comparing NLP results to gold standard results created by two independent, blinded manual reviewers who are urologists (B.J.K., M.M.). One hundred randomly generated radical prostatectomy pathology reports were selected for the validation process. Each reviewer manually extracted the key variables of interest, as listed above. If there was discordance of results between the two reviewers, then a consensus was determined after discussion among the reviewers. The accuracy of the NLP program to the gold standard was quantified. Sensitivity, specificity, positive predictive value, and negative predictive value of NLP results were also determined for binary variables. Finally, we compared the time to data acquisition of the manual review versus NLP automation.
Results
A total of 100 randomly generated radical prostatectomy specimen reports were used to validate the NLP program. Table 1 displays a summary of the pathological characteristics of the surgical specimens. All cases were performed through robot-assisted radical prostatectomy. One case had no evidence of cancer in the final specimen, but only had high-grade prostatic intraepithelial neoplasia (HGPIN). All other cases contained prostate adenocarcinoma. The majority of the cancers were staged as pT2c (65.7%). Pelvic lymphadenectomy was performed in 34% of cases, with three cases demonstrating metastases to regional lymph nodes. Surgical margins were positive in 21 of 97 cases (21.6%), in which SMS was reported.
ALI=angiolymphatic invasion; ECE=extracapsular extension; LNI=lymph node involvement; SMS=surgical margin status; SVI=seminal vesicle invasion.
Table 2 lists the 13 pathological variables extracted. The program achieved 100% accuracy in identifying the Gleason sum, presence of a tertiary Gleason pattern, SVI, ALI, ECE, and the M stage. It also achieved near-perfect concordance in reporting the remaining variables evaluated. The least accurate pathological parameter was the N stage (95.0%). The overall accuracy of the program was 98.7%.
NLP=natural language processing; PNI=perineural invasion.
For T staging, NLP incorrectly reported pT2 in one case with only HGPIN present. The pathologist's report stated that “staging appears to be a pT2,” since there was adenocarcinoma present in the initial prostate biopsy. Additionally in this case, the program incorrectly selected adenocarcinoma as the histologic type.
For N staging, NLP failed to report a value in three cases, in which a lymphadenectomy was performed. The pathologist's report, however, did not explicitly state the N stage in these cases, but rather indicated that the lymph nodes were negative for metastatic carcinoma. In two cases, NLP indicated that the N stage was N0 when a lymphadenectomy was not performed.
The program made four errors in extracting the size of the dominant tumor. In three of these cases, NLP did not identify the tumor size within the gross description of the specimen. In the fourth case, the size was listed as 2 cm, however, the pathologist was actually referring to a benign cystic nodule.
The program incorrectly called SMS as negative in three cases, in which there was no margin status reported by the pathologist. It also reported positive PNI in a case that was indeterminate.
Table 3 lists the sensitivities and specificities of NLP in extracting the correct pathological parameters for SVI, PNI, ALI, ECE, LNI, and SMS. In general, the software demonstrated very high specificity for all variables studied (94.4%–100%). NLP also had 100% sensitivity for extracting SVI, PNI, ALI, and ECE. LNI and SMS were the least sensitive at 60.0% and 87.5%, respectively. In addition, NLP demonstrated very high positive and negative predictive values (93.3%–100%).
Finally, we compared the time to data acquisition between the program and our manual review. NLP could generate a complete result for a given case in <1 second. The average time for the manual reviewers to extract information from all 100 cases was 5.3 hours or 3.2 minutes per report.
Discussion
NLP programs have revolutionized information technology in the workplace and at home and have led to more efficient communication systems and interfaces. The technology has been slower to evolve in medicine, since the advent of the EMR. Its prospects for clinical and research work, however, are vast and potentially necessary in the future with the rapid growth of medical data.
Our study demonstrated the overall high performance (98.7% accuracy) of an NLP program in extracting key pathological details from a radical prostatectomy specimen report. For all 13 variables studied, NLP was correct >95% of the time. We previously demonstrated the ability of another similar NLP program to extract key data from prostate biopsy reports in an EMR, with a high degree of accuracy (97.6%). 2
NLP encountered difficulty deciphering one case, in which there was only HGPIN identified in the prostate specimen. The challenge was likely a result of a reference made to the previous prostate biopsy report that mentioned adenocarcinoma. NLP was also more likely to err for terms that were commonly ambiguous or verbose, such as lymph node status, PNI, and SMS. For example, some pathologists did not clearly report a positive surgical margin, but only indicated that tumor was present at the inked surface of the specimen.
Determining the size of the dominant tumor was also challenging due to variability in language in describing a tumor, such as “mass,” “lesion,” and “nodule.” Furthermore, many structures other than the tumor itself are commonly measured and described in a given report, leading to potential confusion for the software. NLP was likely to succeed when extracting more clear-cut terms, such as the Gleason sum, tertiary pattern, and T stage.
Our program demonstrated very high sensitivity, specificity, and predictive values for most binary parameters. Sensitivity for correctly identifying LNI was low (60%), likely due to the low incidence of lymph node metastases after radical prostatectomy and the increased complexity of programming this variable, as discussed above.
One of the most notable advantages of NLP was that it could generate results in a matter of seconds compared to the minutes or hours it would take for a manual reviewer to obtain the same information. This time differential is more apparent when extrapolated over many cases. For instance, the software could significantly reduce the rigorous time and effort required to populate a clinical or research database and achieve this with a high degree of accuracy. In contrast, a manual reviewer may be prone to fatigue and, therefore, have reduced efficiency over time.
NLP has the additional advantage of identifying data that are not formally coded in the EMR. For instance, Gundlapalli et al. 5 used NLP to screen for homelessness among the U.S. veterans, which is a nonmedical diagnosis. In our study, we were able to extract the most pertinent and detailed information from the radical prostatectomy report. These pathological parameters are not traditionally flagged or highlighted and, thus, require increased attention by the clinician.
Even coded information, such as International classification of diseases (ICD)-9 codes, may lack specificity. McPeek Hinz et al. 6 utilized NLP in identifying all historic cases of venous thromboembolic disease at their institution, which would not have been captured by ICD-9 codes alone. A similar study identified more postoperative complications using NLP versus discharge coding data. 7
National registries, such as the National Inpatient Sample and the SEER-Medicare database, contain a wealth of information on millions of patients. However, these data include hundreds of variables with different coding schemes, if coded at all, which may create challenges in performing statistical analyses. 8 Furthermore, issues arise when attempting to merge data from multiple databases that use different definitions and labels for variables, such as Medicaid and Medicare data. 9 Thus, without proper search and extraction tools available, much of this rich information remains untouched. NLP programs have the inherent versatility of extracting detailed data, presented in multiple formats, which may or may not be coded.
In addition to rapid and complete data acquisition, NLP has broad utility in clinical practice, such as detecting incidental or unusual findings. For example, one study used NLP to detect incidental findings from radiology studies. 10 Dutta et al. 10 reported 89% sensitivity and 98% specificity for identifying the need for additional imaging and improved the inclusion of this recommendation in discharge instructions for emergency department patients. Another study used NLP to detect breast cancer recurrence, with a reported accuracy of 92% and 96% specificity. 11
NLP software systems may be used to improve patient monitoring and screening. Denny et al. 12 showed improved identification of patients requiring colorectal screening using NLP in comparison to reviews of manual charts and billing records. A parallel study illustrated a fully automated system that used NLP to facilitate colonoscopy surveillance intervals. 13
The technology can even be used to translate important information in other languages, as demonstrated by Wang et al., 14 in which they extracted tumor-related data from operation notes of hepatic carcinomas written in Chinese. These are only a few examples of the expansive, but underutilized, potential of NLP in medicine.
There were limitations with our study. Despite showing near-perfect accuracy, the program was still susceptible to error. It incorrectly reported adenocarcinoma as the histologic type in a case, in which there was only HGPIN because the term adenocarcinoma was present in the report. This error represented a rare event and one that would likely have been identified by the clinician as there were no other pertinent cancer-related data listed (i.e., Gleason grade, SVI, ECE, and SMS). In the future, it would be useful to implement an alert system to flag potential discordances between the NLP program and the pathology report for the clinician or researcher to examine further.
The validation process included only 100 cases, and there may have been other complex scenarios not yet identified. Most cases of NLP error are fortunately due to underreporting, as the software does not produce a result unless it is present and detected in the report. With continued fine-tuning and validation of the program, we expect to improve upon the accuracy of NLP and reduce or eliminate such errors.
Finally, there were other pathologic variables of interest not included in the program, such as the locations of positive surgical margins and the extent of margin involvement with tumor. Hence, the clinician may still be required to study the pathology report for additional details. We are in the process of expanding upon our current program to include all relevant pathologic details.
Information technology has and will likely continue to influence healthcare in the United States and abroad, with the widespread adoption of EMRs and improved methods of data acquisition and processing, such as NLP systems. Further studies are needed to determine whether the NLP technology translates to an actual reduction in medical error or improved patient safety. 1 However, it is foreseeable that with a growing patient population and the expansion of medical data, clinicians and researchers will rely upon programs utilizing NLP to facilitate their work.
Conclusions
We demonstrated in this study that NLP is an effective tool to identify key pathologic details from the prostatectomy report within an EMR system. It may be valuable to clinicians by highlighting relevant details from the pathology report that may dictate postprostatectomy monitoring and treatment. This program may also facilitate urologic research by automatically generating large databases.
Footnotes
Acknowledgments
The authors would like to acknowledge the staff at the Kaiser Permanente, Department of Research and Evaluation. Source of funding: Intuitive Surgical, Inc.
Disclosure Statement
No competing financial interests exist.
Abbreviations Used
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
