Application of Natural Language Processing in Electronic Health Record Data Extraction for Navigating Prostate Cancer Care: A Narrative Review

Abstract

Introduction:

Natural language processing (NLP)-based data extraction from electronic health records (EHRs) holds significant potential to simplify clinical management and aid research. This review aims to evaluate the current landscape of NLP-based data extraction in prostate cancer (PCa) management.

Materials and Methods:

We conducted a literature search of PubMed and Google Scholar databases using the keywords: “Natural Language Processing,” “Prostate Cancer,” “data extraction,” and “EHR” with variations of each. No language or time limits were imposed. All results were collected in a standardized manner, including country of origin, sample size, algorithm, objective of outcome, and model performance. The precision, recall, and the F1 score of studies were collected as a metric of model performance.

Results:

Of the 14 studies included in the review, 2 articles focused on documenting digital rectal examinations, 1 on identifying and quantifying pain secondary to PCa, 8 on extracting staging/grading information from clinical reports, with an emphasis on TNM-classification, risk stratification, and identifying metastasis, 2 articles focused on patient-centered post-treatment outcomes such as incontinence, erectile, and bowel dysfunction, and 1 on loneliness/social isolation following PCa diagnosis. All models showed moderate to high data annotation/extraction accuracy compared with the gold standard method of manual data extraction by chart review. Despite their potential, NLPs face challenges in handling ambiguous, institution-specific language and context nuances, leading to occasional inaccuracies in clinical data interpretation.

Conclusion:

NLP-based data extraction has effectively extracted various outcomes from PCa patients' EHRs. It holds the potential for automating outcome monitoring and data collection, resulting in time and labor savings.

Introduction

Prostate cancer (PCa) is the second-most common malignancy in males,¹ with a high mortality rate burden for advanced disease. Both the disease and its treatment come with significant morbidity and quality-of-life impact.² This necessitates a nuanced approach to diagnosis, treatment, and ongoing patient care. Given the myriad of recent advances in the treatment of advanced PCa and the consequent increase in survival rates, a new set of challenges has emerged. These include optimizing patient treatment based on evidence-based medicine, ongoing symptom monitoring, ensuring patients' adherence to interventions, and identifying suitable candidates for clinical trials.³

Electronic health records (EHRs) contain patient information that is both structured and unstructured. The unstructured data contain heterogeneous free-text content, representing a massive trove of granular, unmined data. This can be used for various purposes, such as optimizing clinical care, research, and health care quality metrics. Using EHRs for these purposes, however, can be labor-intensive and time-consuming. Conventional data analysis techniques have not been readily applied to these data sources, as they are unstructured and require novel techniques for interpretation.⁴

Integrating artificial intelligence (AI) into medicine marks a transformative era in health care, revolutionizing diagnostics, treatment, and patient care. AI amalgamates computational algorithms with vast data sets, offering novel capabilities to extract meaningful insights from complex medical information.⁵ AI-based data extraction holds the potential to extract the required data in the requisite details while preserving the meaning of the surrounding context. One desirable application of AI is automating labeled data extraction from EHRs, resulting in significant time and labor savings.⁶ Natural language processing (NLP) models represent a frontier of machine learning that utilizes extensive data sets to train neural networks for various language-related tasks. One common example is to extract information from unstructured text data.

Within this context, NLP models have emerged as powerful tools for extracting valuable insights from unstructured clinical text within EHRs. While there has been significant research on using NLP-based EHR data extraction for other cancers,⁷ there is a lack of qualitative syntheses for PCa. Therefore, the present study aimed to explore the applications of NLP-based models in extracting EHR data related to PCa. By synthesizing existing studies, we seek to provide a comprehensive overview of the current state of NLP applications in PCa, emphasizing successes, challenges, and potential avenues for future research.

Materials and Methods

To conduct this narrative review, we performed literature searches in PubMed and Google Scholar databases, focusing on articles published relating to NLP-based EHR data extraction related to PCa with keywords, including “Natural Language Processing,” “Data extraction,” “Electronic Health Records,” and their variations. Two authors (A.B. and R.T.) screened articles based on titles and abstracts. We then reviewed the full text of the relevant articles. We also used reference searching of selected articles to identify other potential published works. The final articles included original research on NLP-based data extraction concerning PCa management. Irrelevant articles, reviews, and articles on differing target diseases were excluded. A.B. and R.T. independently collected data from these articles.

All results were then collected in a standardized manner, including country of origin, sample size, algorithm, objective of outcome, and model performance. Most teams have used the precision, recall, and F1 score as a metric of model performance. The F1 score is a metric used to measure the performance of a model. It considers precision and recall, providing a balanced assessment of the model's accuracy. Precision indicated positive predictive value in NLP-based data extraction, and recall indicated the sensitivity or actual positive rate in the context of NLP-based data extraction. The F1 score combines precision and recall into a single metric. It is the harmonic mean of precision and recall, calculated as F1 = 2(precision × recall)/ (precision + recall).

The F1 score ranges between 0 and 1, where a higher score indicates better model performance in precision and recall. It is beneficial when there is an uneven class distribution (class imbalance) in the data set, as it considers both false positives and false negatives in its calculation, providing a balanced assessment of the model's performance. Each model's precision, recall, and/or F1 score are recorded in Table 1.

Table 1.

Overview of Included Studies

Authors' (country) study objective	Sample analyzed	Demographics ^a	NLP model	Comparator	Result of best-performing models ^b,d,e
Hernandez-Boussard et al. (USA)³ To assess documentation of UI and ED	Training set: 100 randomly selected notes. 165,367 notes from 7109 patients 200 records were processed by gold standard for comparison	White: 69.68% Hispanic: 2.15% Black: 4.03% Asian: 9.95% Other: 14.18%	GATE software comprising: ANNIE module (to detect), ConText (to determine semantic context) and JAPE (to annotate)	Manual annotation by a board-certified urology nurse	UI: F1-score affirmed: 0.8667; negated: 0.9577; discussed risk: 0.9091 ED: F1-score affirmed: 0.8489; negated: 0.9159; discussed risk: 0.9029 Accuracy of positivity and negativity was assessed against manual chart review
Bozkurt et al. (USA)⁸ To assess documentation of pretreatment DRE using NLP	Training set: 426,227 notes from 7215 patients Comparison set: patient reports that were manually annotated by subject matter experts	White: 69.68% Hispanic: 2.15% Black: 4.03% Asian: 9.95% Other: 14.18%	In-house model	Manual chart review by domain experts	5958 (82.6%) patients had at least one DRE-documented EHR. 737 (10.2%) patients' reports state directly that the patient deferred or refused DRE before treatment. Among delayed/refused cases, 357 (48.4%) had a DRE documented after treatment. Precision and recall of the NLP were 95% and 90% based on evaluation against annotation by domain experts
Bozkurt et al. (USA)⁹ To assess provider documentation of DRE	Training set: 301 manually reviewed notes.185,356 notes from 3766 patients Comparison was done against a random sample of manual chart reviews	White: 69.68% Hispanic: 2.15% Black: 4.03% Asian: 9.95% Other: 14.18%	In-house model	Manual chart review	F1 = 0.93 for current visits F1 = 0.75 for historical visits Accuracy of positivity and negativity was assessed against manual chart review
Banerjee et al. (USA)²⁰ To assess the occurrence of PCa outcomes and the risk discussion of it	Total 528,362 notes from 6595 patients Training set: 110 randomly selected tests that were manually annotated to create a dictionary following training by use of 528,162 notes	White: 69.68% Hispanic: 2.15% Black: 4.03% Asian: 9.95% Other: 14.18%	In-house model using Stanford PCa database, domain-independent Python parser, pointwise mutual information, word2vec model	117 expert annotated notes	F1 = 0.86 for both UI and BD Definition of true positive and true negative not provided
Bozkurt et al. (USA)¹¹ To extract clinical and pathologic staging from clinical notes	Total 5461 patients: 4261 patient notes for NLP evaluation 1200 patient notes for comparison Training set: 80% notes Testing set: 20% notes	White: 69.68% Hispanic: 2.15% Black: 4.03% Asian: 9.95% Other: 14.18%	Python (v3.6) using the NLTK for preprocessing, and Python (v3.6) using the NLTK for preprocessing, and scikit-learn for feature extraction and classification	Manual chart review	Clinical and pathologic T and N stages; The rule-based NLP approach with an F1 score of 0.71 For clinical Mstage the ML approach with an F1 score of 0.88 Accuracy of positivity and negativity was defined compared with manual chart review
Lenain et al. (USA)¹³ To extract staging information from pathology reports	13,595 reports from 4470 patients in an 80%/10%/10% division in training, validation, and test sets	White: 69.68% Hispanic: 2.15% Black: 4.03% Asian: 9.95% Other: 14.18%	System made on Python (version 3.6) using the NLTK for preprocessing, and Python (version 3.6) using the NLTK for preprocessing, and scikit-learn for feature extraction and classification	Ground-truth-stage labels manually abstracted from the clinical notes.	T stage: F1-score: 0.80 N-stage: F1 score: 0.77 M-stage: F1 score: 0.99 Definition of true positive and true negative not provided
Alba et al. (Danish)¹⁵ To identifying metastatic PCa cases	Testing: 400 instances out of notes of 1,144,610 patients	White: 60.6% Hispanic: 2.6% Black: 14% Hawaiian native: 0.6% American Indian: 0.4% Asian: 0.4% Other: 21.5%	Levenshtein edit distance algorithm; Word2vec word embeddings	Manual chart review	Cancer notes Precision: 0.975. Radiology notes precision: 0.936. Specificity: 0.979. Sensitivity: 0.919 Accuracy of positivity and negativity was assessed by manual chart review
Heintzelman et al. (Finland)¹⁰ To assess pain recognition and characterization in metastatic PCa extracted from clinical notes by NLP	Training set 4409 Test set 889	Caucasian: 81.8% Black: 15.16% Hispanic: 0.03%	ClinREAD, a proprietary health care domain-oriented, rule-based NLP system (Lockheed Martin, Bethesda) built on AeroText (Rocket Software, Newton, Minnesota, built on AeroText (Rocket Software, MN)	Subject matter expert evaluative^c	Interannotator agreement: Pain mention: 0.90; explicitly no pain: 0.88; start date of pain: 0.64; some pain: 0.79; end date of pain: 0.63; controlled pain: 0.81; body location of pain: 0.75; severe pain: 0.90; overall severity of pain: 0.81 System performance of accurately assessing positivity and negativity was assessed against standard comparator
Zhu et al. (USA)²¹ To assess the identification of social isolation in PCa patients by NLP	3138 patient and 150,990 notes for training; 1057 patients with 55,516 notes for testing	NA	Linguamatics I2E version 5.3, Cambridge, United Kingdom	Manual chart review	Training set: precision: 0.92, recall 0.95, F-measure: 0.93 Test set: precision: 0.9, recall: 0.97, F-measure: 0.93 Manual review was used to accurately assess positivity and negativity
Cassim et al. (South Africa)¹⁴ To extract PCa predictive information	Training set: 1000 reports. Test set: 1000 reports	NA	In-house model based on python IDE	Manual chart review	For mining algorithm: precision: 0.98/recall: 1; F-measure: 0.99/F-measure for GS analysis: 0.98 TP: correctly extracting the GS TN: correctly extracting a biopsy without a GS FP: falsely extracting a GS/FN: falsely extracting the manually coded GS
Odisho et al. (USA)¹⁸ To extract pathology data and structured data entry by clinicians of PCa patients	Training set: NA 523 cases extracted by NLP 555 cases extracted manually 319 cases extracted using structured data elements	NA	In-house Java-based NLP system	Manual review and structured data elements	GS ĸ: 0.93 Margin status ĸ: 0.90 Extracapsular extension ĸ: 0.91 Seminal vesicle invasion ĸ: 0.84 Lymph node dissected ĸ: 0.95 Pathologic T stage ĸ: 0.97 Pathologic N stage: 1.00 Accuracy of positivity and negativity was assessed by comparison against manual review
Gregg et al. (USA)¹⁷ To collect the components of PCa risk stratification and verify it against manual extraction	2351 patient records were reviewed to extract “risk group elements” Training by an iterative process till 90% accuracy was achieved	NA	In-house NLP system	Manual review by subject matter expert	Weighted κ statistics: PSA: 0.86 (95% CI: 0.82–0.90), GS: 0.91 (95% CI: 0.90–0.93) Clinical T-stage categories: 0.89 (95% CI: 0.85–0.94) clinical T-stage categories, respectively Classification of accuracy of the NLP was compared with manual data collection
Kim et al. (USA)¹⁶ To extract key pathologic parameters from pathology reports	100 pathology reports were used to validate the NLP algorithm Training set: NA	NA	KPSC Clinical Information Extraction System	Manual review by subject matter expert	Overall accuracy: 98.7% Specificity for all variables included: 94.4%–100% Sensitivity: 100% except for lymph node involvement (60%) and surgical margin status (87.5%) PPV: 98.7%–100%/NPV: 93.3%–100% NLP accuracy was assessed against manual review
Huang et al. (Singapore)¹² To extract clinically relevant information from histopathology reports	Training set: 4306 reports Validation set: 1098 reports Test set: 268 reports	NA	In-house model based on rule-based, machine learning and deep learning methods	Variables as annotated by professional UroCaRe cancer registrars	Overall accuracy: 93.3% Accuracy was >95% for all included variables except pathologic tumor (T) staging, nodal (N) staging, and tumor size (74.3%, 61.2%, and 89.7%, respectively) NLP accuracy was assessed against variables as annotated by UroCaRe cancer registrars

F-measure = 2 × precision × recall/(precision + recall).

Demographics table: First six studies from the same data set (from Stanford) have the same demographic breakdown.

F1-score 0.90—Highly accurate/F1-score 0.75 to 0.90—Moderately/F1-score 0.5 to 0.75—Inaccurate.

Used for evaluation of NLP performance and improvement instead of comparison.

0.01 to 0.20: none to slight agreement; 0.21 to 0.40: fair agreement; 0.41 to 0.60: moderate agreement; 0.61 to 0.80: substantial agreement; 0.81 to 1.00: almost perfect agreement.

Precision and recall are calculated as follows: (i) TP/(TP+FP) and (ii) TP/(TP+FN).

BD = bowel dysfunction; CI = confidence interval; DRE = digital rectal examination; ED = erectile dysfunction; EHR = electronic health record; FN = false negative; FP = false positive; GS = Gleason score; ML = machine learning; NLP = natural language processing; NLTK = natural language toolkit; NPV = negative predictive value; PCa = prostate cancer; PPV = positive predictive value; PSA = prostate-specific antigen; TN = true negative; TP = true positive; UI = urinary incontinence.

Results and Discussion

Studies

We identified and included 14 studies that focused on the various NLP models in EHR data extraction for PCa. Two articles focused on documenting digital rectal examinations (DREs).^8,9 One article focused on identifying and quantifying pain secondary to PCa,¹⁰ eight on extracting staging/grading information from pathologic/clinical reports, with an emphasis on TNM classification and identifying metastasis,^{11

–19} two articles focused on patient-centered outcomes (PCOs) such as incontinence, erectile dysfunction (ED), and bowel dysfunction (BD)^3,20; and one article focused on loneliness/social isolation following a PCa diagnosis.²¹ Interestingly, most articles were from a single group from Stanford University. Given that almost all these articles were published in computer science/health data analytics journals, providing a comprehensive overview of these projects in the urology literature is imperative.

Most published works in this arena focused on developing in-house training models, except two^10,15 that used commercially available software to evaluate their data sets. All studies divided their retrospective patient cohorts into two sets: one more extensive set for training the model and one smaller or equal set for testing it. This ensured that the populations and EHR records were cross-compatible and baseline characteristics were similar.

Overview of NLP models

While NLPs have the capability to extract complex information from EHRs, almost all approaches prepare the data first for this task via preprocessing. The common steps in text preparation are tokenization, stemming, and stopword removal.

Tokenization

A sentence is broken down into smaller pieces, such as separating a sentence into individual words. This helps make sense of the text.

Stemming

Simplification of words to their basic form. For example, turning words such as “running” into “run” or “jumps” into “jump.”

Stopword removal

Some words (such as “and,” “the,” or “in”) do not carry much meaning by themselves. Removing them helps the model focus on the important words.

Flowchart 1 represents the overall scheme for NLP-based data extraction. However, variations were encountered in each study depending on the aim of the data extraction. Major variations from Flowchart 1 were documented below.

The articles assessing DREs aimed to develop and test methods for automatically assessing a quality metric, provider-documented pretreatment DRE, using NLP frameworks. One study aimed to develop an NLP pipeline for automatic documentation of DRE in clinical notes using domain-specific dictionaries created by clinical experts.⁸ The second approach used software to learn from clinical notes using distributional semantics algorithms and create a list of terms for the dictionary to which they added terms by clinical experts.⁹ The relative performances of both can be seen in Table 1.

Eight studies focused on extracting the stage/grade of PCa. One study used frequency counts of structured data elements as predictors derived from the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) to identify more sophisticated PCa phenotypes than the International Classification of Diseases (ICD) code-based queries are capable of. This may be particularly useful in high-precision phenotyping scenarios, such as identifying participants for clinical trials or observational research.¹⁸ The other studies followed a scheme similar to Figure 1.

FIG. 1.

This flowchart represents a simplified, combined schema for most NLP-based data extraction studies. Each study has its variations and nuances that are not represented here. Explanations: (1) Feature Extraction: Feature extraction involves selecting and transforming relevant information from raw data to create a simplified representation. In simpler terms, it is similar to picking out the essential details that best capture what you are interested in. (2) N-gram Extraction: N-grams are sequences of N words or characters that appear together in a given context. For example, in the phrase “natural language processing,” the bigram (2-g) would be “natural language,” and the trigram (3-g) would be “natural language processing.” N-gram extraction is about identifying and using these sequences to understand patterns and relationships in data. (3) Word embeddings (optional): Word embeddings are a way to represent words as vectors (mathematical entities) in a multidimensional space. Each word gets a unique position based on its meaning and context. This representation allows computers to understand the relationships between words. While optional, it can enhance the understanding of language nuances in certain applications. (4) Rules-based approach: A rules-based approach involves defining specific instructions or conditions to guide the processing of data. It is similar to creating a set of rules for a computer to follow. For example, in text analysis, you might have rules such as “if a word is negated, change its meaning” or “if a sentence contains certain keywords, pay special attention to it.” These rules help extract meaningful information. Putting it together: In practical terms, when extracting features from, say, a body of text. N-gram extraction: Identify meaningful word sequences (bigrams, trigrams). Word embeddings: If used, represent words in a way that captures their meanings and relationships. Rules-based approach: Apply specific rules for interpreting the data, refining the extraction process. In short: Feature extraction involves simplifying data. N-gram extraction focuses on word sequences. Word embeddings, if used, enhance language understanding. A rules-based approach sets specific guidelines for data interpretation. Together, they make data more manageable and meaningful. NLP = natural language processing.

One study focused on the categorization of pain levels secondary to PCa using NLPs.¹⁰ This study categorized pain into four levels of increasing severity through consensus among NLP developers, subject matter experts, and statisticians. It used Linguamatics I2E for indexing, parsing, and querying clinical notes based on pain-related criteria. This was followed by the development of NLP algorithms for pain level identification and the association of mentions with severity levels and relevant parameters.

The NLP pipeline for data extraction about PCOs of urinary incontinence (UI), BD, and ED identified patients within a large academic EHR system using ICD-9/10 and CPT codes. The main difference in the NLP approach for this application lies in the methodology used for training the classifier.

The first approach²⁰ uses a weakly supervised NLP pipeline. In this context, weakly supervised means the model is trained with less precise or detailed annotation. The training data for this model include automated sentence annotations based on domain-specific dictionaries. The second approach,³ on the contrary, utilizes a neural embedding model that incorporates a Tf-idf weighted sentence vector generation method. Tf-idf stands for term frequency-inverse document frequency, a numerical statistic used to assess the importance of a word in a document relative to a collection of documents. This method assigns weights to words based on their frequency and rarity in the document. Results for each approach can be seen in Table 1.

Another study used NLP-based models to identify social isolation after PCa diagnosis.²¹ They generated a lexicon for social isolation using domain expert knowledge and the Loneliness Scale. This was followed by developing NLP algorithms using I2E queries to identify social isolation mentions based on defined criteria.

Clinical applications

We found that there has been significant success with high precision and recall for NLP-based data extraction and monitoring from EHRs in PCa patients. There are multiple proofs of concept for this approach to data extraction in a clinical setting. NLPs can extract data from the entire patient care process.

Digital rectal examination

Recent AUA/SUO guidelines on early detection of PCa state that clinicians may use DRE alongside prostate-specific antigen to establish the risk of clinically significant PCa.²² Despite its importance in the clinical setting, DRE rates are variable due to interprovider preferences, clinical experience, and patient refusal. However, monitoring DRE rates and results is difficult because they are recorded as inconsistent, free text in EHRs. This has hampered efforts to enforce DRE inclusion as a quality improvement metric.

The use of NLPs in DRE recognition and evaluation can improve clinical outcomes and simplify research/quality improvement metrics due to their ability to rapidly parse through multiple EHRs, decreasing the need for manual data extraction and classification of DRE evaluation.⁹ NLP-based recognition of pretreatment DREs can be integrated into clinical support systems for advising treatment regimens.⁸ This can directly improve and standardize patient care regimens and improve/simplify information availability for tumor boards.

The importance of including pretreatment DREs as a quality improvement metric has been well-established, not just in PCa care but also in colorectal cancer and other abdominal pathologies.²³ Since most DREs are recorded in free texts or the reason for refusal is documented in such free text, it can be a labor-intensive process to go through multiple EHRs when DRE rates for quality improvement are being calculated. NLP-based systems can also recognize the presence or absence of DREs from the text of a physician's note.⁹ Boussard and colleagues also showed that NLP pipelines can significantly simplify the recognition of DREs and also note if there is a reason documented for refusing DREs.⁸ NLPs can free up significant human resources and speed up data collection and categorical classification of the obtained data (classifying DREs as normal, abnormal, refused, etc.) by optimizing this data collection.

This data extraction and classification function can be extended to clinical research as well.⁸ Such models can also be extended to other physical/radiologic examinations such as cervical imaging and breast mammograms.²⁴

Symptoms: pain

NLP-based approaches can be applied for real-time pain monitoring and pain-related predictor/associated variable identifications, with implications for outpatient, inpatient, postoperative, and hospice care for PCa.

While attending providers can monitor pain in real-time, there can be limitations to this manual approach due to staffing ratios, inefficiencies, physician burnout, overlooked patient complaints, and inadequate patient hand-off. This can be exacerbated in a hospice or outpatient setting. These limitations can be overcome with the ability of NLPs to process a large volume of patient data rapidly, using pain predictor variables to create an alert system to identify patients who need immediate pain attention.¹⁰

For instance, NLP-based models can highlight instances where the model identifies subtle patterns, trends, or early indicators of pain that human providers might easily overlook. The approach used by Heintzelman et al.¹⁰ focuses on identifying variables that are common predictors/associations of PCa-related pain. In a clinical setting, this can be used to stratify patients at high risk of inadequate/delayed pain treatment.¹⁰

The NLP model can go beyond keyword matching to provide a semantic understanding of patient communication. For example, it can discern the difference between acute and chronic pain descriptors,¹⁰ helping health care providers differentiate between immediate concerns and ongoing pain management strategies.

When patients receive multimodal pain management involving various medications, therapies, and interventions, the NLP model can assist in synthesizing information from diverse sources. It can help health care providers assess the overall effectiveness of the treatment plan and identify potential synergies or conflicts among different modalities.^25,26 Multidisciplinary palliative care teams can use NLP-based models identify chronic/undertreated pain patients who are under the care of primary care physicians.²⁶ The insights gained from the NLP model¹⁰ could contribute to further research in pain management, potentially leading to new interventions, treatment strategies, or predictive models. Delivering readily interpretable, real-time presentations illustrating the historical pain status of individual patients or patient groups holds the promise of aiding clinicians in promptly identifying those in need of heightened pain management.¹⁰

PCa staging

Even though PCa staging is one of the most essential factors for guiding treatment, surprisingly it is often not readily accessible in the EHRs as a discrete field. NLP-based approaches have achieved 94% accuracy in extracting M-stage from lung cancer pathology reports.²⁷ Similar NLP-based cancer staging has been effectively achieved in prostate, bladder, breast, and liver cancers.^7,17

Missing staging for PCa in EHRs can cause problems, especially in establishing continuity of care, as patients may visit multiple treatment sites for radiation, chemotherapy, and surgery. One group noted that up to 36% of EHRs have missing PCa stages, and NLP-based models can correctly impute 21% to 31% of the missing stages in EHRs.¹¹ Another interesting finding is that rule-based NLP approaches outperform their N-gram-based counterparts for staging detection.¹¹ The same team also investigated OMOP CDM, a distinct but complementary approach to NLP-based models in health care data analysis. They found that OMOP CDM is better for identifying metastatic PCa than ICD-searched and NLP-based models.¹⁹

Moreover, each NLP-based model might have to be retrained for each cancer type, while the OMOP CDM can likely be used for different cancers and report structures. This is why they recommended combining NLP-based models and OMOP CDMs for optimal performance for PCa.¹⁹ On an institutional level, it is likely more straightforward to implement a uniform OMOP CDM for stage extraction. NLPs have also effectively extracted high-accuracy Gleason scores from pathology reports.¹⁴ This can significantly reduce efforts for continuity of care, data collection, and finding patients for research enrollment.

Another group used NLP-based pipelines to stratify patients into the D'Amico risk classification with more than 90% accuracy.¹⁷ Although the diagnostic and therapeutic plans, such as imaging and biopsy, are often based on cancer risk stratification, there is significant overordering of such studies, resulting in resource wastage. Significant success in reducing unnecessary medical interventions among low-risk patients has been achieved. However, this accomplishment has necessitated coordinated efforts, including extensive data collection across various practice sites, subsequent data analysis, and the implementation of comparative performance feedback and decision support interventions.¹⁷ NLP-based pipelines can be extremely efficient in automating and improving such data collection. Integrating NLP-based clinical PCa risk stratification into clinical support systems can help decrease overuse and wastage of resources.¹⁷

Patient-centered outcomes

Hernandez-Boussard et al.'s work has also shown that NLP-based data extraction can identify and monitor PCOs such as UI/BD/ED and have efficacy close to the gold standard of manual data extraction.^3,20 The recent greater emphasis placed on quality metrics other than mortality rate creates incentives for monitoring and recording post-treatment PCOs. Real-time monitoring of EHRs via automated systems can detect complaints documented by primary care providers in resource-limited settings, which tertiary care systems can monitor. New complaints can be followed by specialists. Other cases include time and labor savings on data collection for retrospective studies and building databases.

Post-treatment support systems/social isolation

Another interesting application is NLP-based data extraction to identify social isolation from EHRs.¹² They reported a high incidence of social isolation among patients living with PCa (up to 2%), which is in line with the overall population. NLP-based models can detect such patients, and such applications can be useful to elder or social care services. This use of NLP-based data extraction can be extended to other social phenomena, such as detecting neglect, abuse, and financial difficulties, which are all more common in patients living with cancer.²⁸ The prospect of using primary care provider notes to identify patients with social isolation and other similar risk factors for mental illness can provide a database for targeted interventions. However, significant ethical and privacy issues can arise with such real-time monitoring systems.

Using NLPs to prepopulate EHR notes represents a promising approach to streamline clinical documentation workflows and enhance the efficiency of health care delivery.²⁹ As discussed above, NLP-based systems can automate extracting relevant patient information for PCa, such as Gleason grade, staging, metastatic status, radiologic findings, and previous treatment. This prepopulation of EHR notes with pertinent clinical data not only accelerates the documentation process but can also ensure the accuracy and completeness of patient records.²⁹ Moreover, NLP-enabled prepopulation reduces the burden on clinicians by minimizing manual data entry tasks,²⁹ allowing them to focus more on direct patient care.

Although one might argue that all the above tasks can be achieved more efficiently and accurately by utilizing structured clinical notes instead of NLP models, the retrospective standardization of provider notes is a formidable undertaking.³⁰ In light of the nearly 15 years since the widespread adoption of EHRs, the absence of consensus on standardization methods, coupled with distinct preferences among medical specialties and variations in provider practices,³⁰ complicates the realization of a unified standard of structured provider notes. Applying NLP models presents a pragmatic alternative for extracting valuable insights from free-text clinical notes.²⁰ NLPs accommodate the inherent variability in provider documentation, offering adaptability to prevailing practices. Moreover, the inertia associated with administrative and personal habits is mitigated by the seamless integration of NLP models with existing free-text entries.³¹

Thus, in the absence of consensus, economic, and administrative will on standardization, using NLP on free-text clinical notes emerges as an effective means of harnessing valuable information for enhanced clinical decision support.

Limitations of NLPs

Despite their vast potential, NLP-based models encounter several limitations in clinical research. One significant challenge is the variability and complexity of language used in health care documents. Clinical narratives often contain colloquialisms, abbreviations, misspellings, and domain-specific jargon, posing difficulty for NLPs to extract and interpret relevant information⁸ accurately. This variability demands robust algorithms capable of handling diverse linguistic patterns and context-specific nuances to overcome. Standardizing medical terminologies used within the institution at the provider level can also improve the model's accuracy and decrease the complexity of the lexicons needed for data extraction.⁸

Our review found some heterogeneity in the success rates of NLP-based data extraction and monitoring from EHRs. The most effective models target narrower subsets of text, such as identifying UI/ED/BD, vs larger and more complex targets, such as identifying staging. While this does not diminish the potential role of NLP-based models, it may be prudent for institutions to start with narrower targets and improve models before moving to more complex targets.

Fundamentally, the NLP-based model and its utility will only be as good as the underlying data. For example, if the clinician is yet to document the finding of interest, it is impossible for any software to extract data from it. Similarly, preprocessing can often result in important textual details being discarded, resulting in missed data. The potential for biases to infiltrate EHRs during notetaking is a critical concern, as these biases can subsequently impact the performance of models built on such data. If biases are present in the clinical notes documented in the EHR, machine learning models trained on these data may inadvertently perpetuate and amplify them.³² Therefore, addressing biases in EHR data is ethically imperative and crucial for ensuring equitable and unbiased health care outcomes when implementing machine learning models in a clinical setting.

Efforts to mitigate biases involve thoroughly examining data collection processes, monitoring model outputs for disparities, and implementing corrective measures to promote fairness and inclusivity in health care AI applications.

Another limitation is the need for extensive, high-quality annotated data sets for training and validation. Building annotated training data sets that have been annotated to train the model demands substantial human effort and domain expertise to ensure accuracy and relevance. Such concerns can be somewhat mitigated by using software to create lexicons thoroughly parsing through the training set/data. The lack of evidence from prospective and multicentric studies may be another limitation of NLP models for PCa management. This review carries the limitations inherent with a narrative review, notably less rigorous inclusion and exclusion criteria, lack of quantification of publication bias, and more subjectivity than a systematic review.

Future applications

Given the rapid development of NLP-based data extraction pipelines, there is significant potential for using such models in health care throughout the patient care process. NLP-based monitoring models have the potential to provide real-time EHR monitoring of abnormal DRE findings, thereby generating red flag alerts in EHRs, prompting providers to take appropriate measures as per major urologic society guidelines to provide optimal care to patients. Adherence to quality metrics such as DREs and PCOs may become intrinsically linked to reimbursement outcomes. In such a scenario, a growing impetus exists to construct organized fields documenting quality metrics.^8,30 Another possible use can be to document and compare PCO rates between providers,³³ leading to the detection of heterogeneous practice styles and personalized provider feedback. Insurance and billing practices may also benefit from such a system, as EHRs can assess complications, their treatments, and associated billing codes.

Another crucial area is adverse event detection, where NLP can systematically scan EHRs and clinical notes to identify potential side effects or complications associated with specific treatments, providing a proactive approach to patient safety monitoring.^34,35 For example, NLP-based systems have already been used to identify the occurrence of surgical-site infections.^36,37 NLP-based models can also flag and document differential rates of incorrect discharge practices between providers. One study showed that up to 20% of patients are discharged on opioid-only analgesia, putting them at high risk for overdose or dependence. NLP-based systems can provide real-time monitoring and alerts to such oversights while providing an easy avenue for detecting opioid overprescription.³⁶

Moreover, in patient-reported outcomes, NLP-based models can assist in extracting and analyzing subjective information from clinical narratives, offering a deeper understanding of patients' experiences and perspectives, which is valuable for outcomes research and treatment optimization. Similarly, clinical decision support systems with integrated NLPs can help suggest new and evolving treatments that might apply to a particular patient, contributing to a key facet of individualized medicine.

In addition, by enabling structured data extraction from unstructured clinical notes, NLP-based models build comprehensive and standardized data sets, facilitating large-scale epidemiologic studies and multicenter collaborations. Data collection can also benefit from a uniform collection style, as inter-researcher variability in data collection remains a persistent problem.³⁷ These models can also assist research by identifying patients suitable for the various available treatments.^32,38

Integrating deep learning algorithms such as ChatGPT into NLP presents a promising avenue for extracting PCa-related information from clinical notes. ChatGPT's proficiency in understanding and generating human-like text enables it to decipher the nuanced language used in medical documentation.³⁹ Moreover, ChatGPT's potential role in NLP for PCa-related data extraction can extend to its ability to adapt to the evolving landscape of medical language, which incorporates new terminologies, treatment modalities, and research findings.⁴⁰ Unfortunately integrating large language models (LLMs) into NLP-based systems is not without downsides. LLMs have been known to frequently provide false or fake information (a phenomenon known as hallucination).⁴¹ It is possible that LLM-powered NLP-based systems may input fake information, and without adequate safeguards in place, this can have significant clinical and academic implications. Therefore, a cautiously optimistic approach to incorporating LLMs into NLP-based systems may be ideal.

Addressing the unmet need for Health Insurance Portability and Accountability Act (HIPAA)-compliant NLP models is a critical endeavor in the evolving health care landscape. We believe that NLP models fully compliant with HIPAA standards pose challenges and opportunities. Challenges involved in NLP integration into health systems include handling protected health information (PHI), ensuring secure data storage and transmission, and implementing robust access controls for only authorized personnel.⁴² Solutions could involve innovations in encryption techniques⁴³ and secure data-sharing protocols⁴³ that ensure the safety of PHI. Perhaps the most important ethical aspect of integrating NLPs into health systems is open discussion and transparent communication with patients, who should be the ultimate decision-makers of their PHI.⁴⁴

Summary and Conclusion

NLP-based models hold the potential to simplify and cut costs in health care, with current models having good performances in extracting data about physical examination findings, quality metrics, cancer staging from pathology reports, and miscellaneous PCa-related phenomena such as social isolation and pain. The most effective models target narrower subsets of text, such as identifying UI/ED/BD, vs larger and more complex targets, such as identifying staging. Overall, integrating NLP-based models in clinical research shows promise to enhance efficiency, scalability, and the depth of insights derived from diverse health care data sources.

As we embrace NLP-based technologies, the overarching aim is to foster a health care ecosystem that learns, adapts, and evolves, ensuring better patient outcomes. In conclusion, NLP-based data collection approaches can significantly improve certain aspects of research and clinical practice. However, important limitations such as inconsistent performance, lack of an ethical framework for integration into health systems, and biases from original provider notes carried over to NLP-based data extraction remain. A reliable, ready-to-use commercial NLP-based automated data extraction remains unknown.

Footnotes

Authors' Contributions

A.B.: Conceptualization, data curation, formal analysis, investigation, methodology, validation, visualization, writing—original draft preparation, and writing—review and editing.

R.T.: Conceptualization, data curation, formal analysis, investigation, methodology, validation, visualization, and writing—review and editing.

J.G.P.: Formal analysis, investigation, writing—original draft preparation, and writing—review and editing.

J.K.: Writing—review and editing.

D.M.L.: Conceptualization, methodology, validation, visualization, and writing—review and editing.

R.M.: Conceptualization, methodology, validation, visualization, and writing—review and editing.

D.J.P.: Conceptualization, formal analysis, investigation, methodology, validation, and writing—review and editing.

H.N.S.: Conceptualization, data curation, formal analysis, investigation, methodology, validation, visualization, writing—original draft preparation, and writing—review and editing.

Author Disclosure StatementConflict of Interest

No competing financial interests exist.

Funding Information

No funding was received for this article.

Abbreviations Used

References

Bray

, Ferlay

, Soerjomataram

, et al. Global Cancer Statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin, 2018; 68(6):394–424.

Herr

. Quality of life in prostate cancer patients. CA Cancer J Clin, 1997; 47(4):207–217.

Hernandez-Boussard

, Kourdis

, Seto

, et al. mining electronic health records to extract patient-centered outcomes following prostate cancer treatment. AMIA Annu Symp Proc, 2018; 2017:876–882.

Sarwar

, Seifollahi

, Chan

, et al. The secondary use of electronic health records for data mining: Data characteristics and challenges. ACM Comput Surv, 2022; 55(2):40.

Paraskevopoulos

, Smeets

, Tian

, et al. Using artificial intelligence to extract information on pathogen characteristics from scientific publications. Int J Hyg Environ Health, 2022; 245:114018.

Bohr

, Memarzadeh

. The rise of artificial intelligence in healthcare applications. Artif Intell Healthc, 2020; 2020:25–60.

Datta

, Bernstam

, Roberts

. A frame semantic overview of NLP-based information extraction for cancer-related EHR notes. J Biomed Inform, 2019; 100:103301.

Bozkurt

, Kan

, Ferrari

, et al. Is it possible to automatically assess pretreatment digital rectal examination documentation using natural language processing? A single-centre retrospective study. BMJ Open, 2019; 9(7):e027182.

Bozkurt

, Park

, Kan

, et al. An automated feature engineering for digital rectal examination documentation using natural language processing. AMIA Annu Symp Proc AMIA Symp, 2018; 2018:288–294.

10.

Heintzelman

, Taylor

, Simonsen

, et al. Longitudinal analysis of pain in patients with metastatic prostate cancer using natural language processing of medical record text. J Am Med Inform Assoc, 2013; 20(5):898–905.

11.

Bozkurt

, Magnani

, Seneviratne

, et al. Expanding the secondary use of prostate cancer real world data: Automated classifiers for clinical and pathological stage. Front Digit Health, 2022; 4:793316.

12.

Huang

, Lim

FXY

, Gu

, et al. Natural language processing in urology: Automated extraction of clinical information from histopathology reports of uro-oncology procedures. Heliyon, 2023; 9(4):e14793.

13.

Lenain

, Seneviratne

, Bozkurt

, et al. Machine learning approaches for extracting stage from pathology reports in prostate cancer. Stud Health Technol Inform, 2019; 264:1522–1523.

14.

Cassim

, Mapundu

, Olago

, et al. Using text mining techniques to extract prostate cancer predictive information (Gleason score) from semi-structured narrative laboratory reports in the Gauteng province, South Africa. BMC Med Inform Decis Mak, 2021; 21(1):330.

15.

Alba

, Gao

, Lee

, et al. Ascertainment of veterans with metastatic prostate cancer in electronic health records: Demonstrating the case for natural language processing. JCO Clin Cancer Inform, 2021; 5:1005–1014.

16.

Kim

, Merchant

, Zheng

, et al. A natural language processing program effectively extracts key pathologic findings from radical prostatectomy reports. J Endourol, 2014; 28(12):1474–1478.

17.

Gregg

, Lang

, Wang

, et al. Automating the determination of prostate cancer risk Strata from electronic medical records. JCO Clin Cancer Inform, 2017; 1:CCI.16.00045.

18.

Odisho

, Bridge

, Webb

, et al. Automating the capture of structured pathology data for prostate cancer clinical care and research. JCO Clin Cancer Inform, 2019; 3:1–8.

19.

Seneviratne

, Banda

, Brooks

, et al. Identifying cases of metastatic prostate cancer using machine learning on electronic health records. AMIA Annu Symp Proc, 2018; 2018:1498–1504.

20.

Banerjee

, Li

, Seneviratne

, et al. Weakly supervised natural language processing for assessing patient-centered outcome following prostate cancer treatment. JAMIA Open, 2019; 2(1):150–159.

21.

Zhu

, Lenert

, Bunnell

, et al. Automatically identifying social isolation from clinical narratives for patients with prostate Cancer. BMC Med Inform Decis Mak, 2019; 19(1):43.

22.

Wei

, Barocas

, Carlsson

, et al. Early detection of prostate cancer: AUA/SUO Guideline Part I: Prostate cancer screening. J Urol, 2023; 210(1):46–53.

23.

Gori

, Dulal

, Blayney

, et al. Utilization of prostate cancer quality metrics for research and quality improvement: A structured review. Jt Comm J Qual Patient Saf, 2019; 45(3):217–226.

24.

Diamond

, Laurentiev

, Yang

, et al. Natural language processing to identify abnormal breast, lung, and cervical cancer screening test results from unstructured reports to support timely follow-up. Stud Health Technol Inform, 2022; 290:433–437.

25.

Carlson

, Hooten

. Pain—Linguistics and natural language processing. Mayo Clin Proc Innov Qual Outcomes, 2020; 4(3):346–347.

26.

Bacco

, Russo

, Ambrosio

, et al. Natural language processing in low back pain and spine diseases: A systematic review. Front Surg, 2022; 9:957085.

27.

Wang

, Chai

, Huang

, et al. Prostate artery embolization on lower urinary tract symptoms related to benign prostatic hyperplasia: A systematic review and meta-analysis. World J Clin Cases, 2022; 10(32):11812–11826.

28.

Liang

, Hao

, Wu

, et al. Social isolation in adults with cancer: An evolutionary concept analysis. Front Psychol, 2022; 13:973640.

29.

Kaufman

, Sheehan

, Stetson

, et al. Natural language processing-enabled and conventional data capture methods for input to electronic health records: A comparative usability study. JMIR Med Inform, 2016; 4(4):e35.

30.

Vos

JFJ

, Boonstra

, Kooistra

, et al. The influence of electronic health record use on collaboration among medical specialties. BMC Health Serv Res, 2020; 20:676.

31.

Magoc

, Allen

, McDonnell

, et al. Generalizability and portability of natural language processing system to extract individual social risk factors. Int J Med Inf, 2023; 177:105115.

32.

Hovy

, Prabhumoye

. Five sources of bias in natural language processing. Lang Linguist Compass, 2021; 15(8):e12432.

33.

Wang

, Luo

, Wang

, et al. Natural language processing for populating lung cancer clinical research data. BMC Med Inform Decis Mak, 2019; 19(5):239.

34.

Murff

, FitzHenry

, Matheny

, et al. Automated identification of postoperative complications within an electronic medical record using natural language processing. JAMA, 2011; 306(8):848–855.

35.

Barber

, Garg

, Persenaire

, et al. Natural language processing with machine learning to predict outcomes after ovarian cancer surgery. Gynecol Oncol, 2021; 160(1):182–186.

36.

Liberman

, Samuels

, Goggins

, et al. Opioid prescriptions at hospital discharge are associated with more postdischarge healthcare utilization. J Am Heart Assoc Cardiovasc Cerebrovasc Dis, 2019; 8(3):e010664.

37.

Kyte

, Ives

, Draper

, et al. Inconsistencies in quality of life data collection in clinical trials: A potential source of bias? Interviews with research nurses and trialists. PLoS One, 2013; 8(10):e76625.

38.

Britton

. Healthcare reimbursement and quality improvement: Integration using the electronic medical record; Comment on “Fee-for-service Payment—An Evil Practice That Must Be Stamped Out?”. Int J Health Policy Manag, 2015; 4(8):549–551.

39.

Dave

, Athaluri

, Singh

. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell, 2023; 6:1169595.

40.

, Gao

, Luan

, et al. The impact of Chat Generative Pre-trained Transformer (ChatGPT) on oncology: Application, expectations, and future prospects. Cureus, 2023; 15(11):e48670.

41.

Ling Ong

, Jie Seng

, Feng Law

, et al. Artificial intelligence, ChatGPT, and other large language models for social determinants of health: Current state and future directions. Cell Rep Med, 2024; 5(1):101356.

42.

Bradford

, Hurdle

, LaSalle

, et al. Development of a HIPAA-compliant environment for translational research data and analytics. J Am Med Inform Assoc, 2014; 21(1):185–189.

43.

HIPAA Encryption Requirements—2024 Update. HIPAA Journal. [Cited February 4, 2024]. Available from: https://www.hipaajournal.com/hipaa-encryption-requirements/

44.

Khattak

, Rabbi

. Ethical considerations and challenges in the deployment of natural language processing systems in healthcare. Int J Appl Health Care Anal, 2023; 8(5):17–36.