Abstract
Introduction:
Natural language processing (NLP)-based data extraction from electronic health records (EHRs) holds significant potential to simplify clinical management and aid research. This review aims to evaluate the current landscape of NLP-based data extraction in prostate cancer (PCa) management.
Materials and Methods:
We conducted a literature search of PubMed and Google Scholar databases using the keywords: “Natural Language Processing,” “Prostate Cancer,” “data extraction,” and “EHR” with variations of each. No language or time limits were imposed. All results were collected in a standardized manner, including country of origin, sample size, algorithm, objective of outcome, and model performance. The precision, recall, and the F1 score of studies were collected as a metric of model performance.
Results:
Of the 14 studies included in the review, 2 articles focused on documenting digital rectal examinations, 1 on identifying and quantifying pain secondary to PCa, 8 on extracting staging/grading information from clinical reports, with an emphasis on TNM-classification, risk stratification, and identifying metastasis, 2 articles focused on patient-centered post-treatment outcomes such as incontinence, erectile, and bowel dysfunction, and 1 on loneliness/social isolation following PCa diagnosis. All models showed moderate to high data annotation/extraction accuracy compared with the gold standard method of manual data extraction by chart review. Despite their potential, NLPs face challenges in handling ambiguous, institution-specific language and context nuances, leading to occasional inaccuracies in clinical data interpretation.
Conclusion:
NLP-based data extraction has effectively extracted various outcomes from PCa patients' EHRs. It holds the potential for automating outcome monitoring and data collection, resulting in time and labor savings.
Introduction
Prostate cancer (PCa) is the second-most common malignancy in males, 1 with a high mortality rate burden for advanced disease. Both the disease and its treatment come with significant morbidity and quality-of-life impact. 2 This necessitates a nuanced approach to diagnosis, treatment, and ongoing patient care. Given the myriad of recent advances in the treatment of advanced PCa and the consequent increase in survival rates, a new set of challenges has emerged. These include optimizing patient treatment based on evidence-based medicine, ongoing symptom monitoring, ensuring patients' adherence to interventions, and identifying suitable candidates for clinical trials. 3
Electronic health records (EHRs) contain patient information that is both structured and unstructured. The unstructured data contain heterogeneous free-text content, representing a massive trove of granular, unmined data. This can be used for various purposes, such as optimizing clinical care, research, and health care quality metrics. Using EHRs for these purposes, however, can be labor-intensive and time-consuming. Conventional data analysis techniques have not been readily applied to these data sources, as they are unstructured and require novel techniques for interpretation. 4
Integrating artificial intelligence (AI) into medicine marks a transformative era in health care, revolutionizing diagnostics, treatment, and patient care. AI amalgamates computational algorithms with vast data sets, offering novel capabilities to extract meaningful insights from complex medical information. 5 AI-based data extraction holds the potential to extract the required data in the requisite details while preserving the meaning of the surrounding context. One desirable application of AI is automating labeled data extraction from EHRs, resulting in significant time and labor savings. 6 Natural language processing (NLP) models represent a frontier of machine learning that utilizes extensive data sets to train neural networks for various language-related tasks. One common example is to extract information from unstructured text data.
Within this context, NLP models have emerged as powerful tools for extracting valuable insights from unstructured clinical text within EHRs. While there has been significant research on using NLP-based EHR data extraction for other cancers, 7 there is a lack of qualitative syntheses for PCa. Therefore, the present study aimed to explore the applications of NLP-based models in extracting EHR data related to PCa. By synthesizing existing studies, we seek to provide a comprehensive overview of the current state of NLP applications in PCa, emphasizing successes, challenges, and potential avenues for future research.
Materials and Methods
To conduct this narrative review, we performed literature searches in PubMed and Google Scholar databases, focusing on articles published relating to NLP-based EHR data extraction related to PCa with keywords, including “Natural Language Processing,” “Data extraction,” “Electronic Health Records,” and their variations. Two authors (A.B. and R.T.) screened articles based on titles and abstracts. We then reviewed the full text of the relevant articles. We also used reference searching of selected articles to identify other potential published works. The final articles included original research on NLP-based data extraction concerning PCa management. Irrelevant articles, reviews, and articles on differing target diseases were excluded. A.B. and R.T. independently collected data from these articles.
All results were then collected in a standardized manner, including country of origin, sample size, algorithm, objective of outcome, and model performance. Most teams have used the precision, recall, and F1 score as a metric of model performance. The F1 score is a metric used to measure the performance of a model. It considers precision and recall, providing a balanced assessment of the model's accuracy. Precision indicated positive predictive value in NLP-based data extraction, and recall indicated the sensitivity or actual positive rate in the context of NLP-based data extraction. The F1 score combines precision and recall into a single metric. It is the harmonic mean of precision and recall, calculated as F1 = 2(precision × recall)/ (precision + recall).
The F1 score ranges between 0 and 1, where a higher score indicates better model performance in precision and recall. It is beneficial when there is an uneven class distribution (class imbalance) in the data set, as it considers both false positives and false negatives in its calculation, providing a balanced assessment of the model's performance. Each model's precision, recall, and/or F1 score are recorded in Table 1.
Overview of Included Studies
F-measure = 2 × precision × recall/(precision + recall).
Demographics table: First six studies from the same data set (from Stanford) have the same demographic breakdown.
F1-score 0.90—Highly accurate/F1-score 0.75 to 0.90—Moderately/F1-score 0.5 to 0.75—Inaccurate.
Used for evaluation of NLP performance and improvement instead of comparison.
0.01 to 0.20: none to slight agreement; 0.21 to 0.40: fair agreement; 0.41 to 0.60: moderate agreement; 0.61 to 0.80: substantial agreement; 0.81 to 1.00: almost perfect agreement.
Precision and recall are calculated as follows: (i) TP/(TP+FP) and (ii) TP/(TP+FN).
BD = bowel dysfunction; CI = confidence interval; DRE = digital rectal examination; ED = erectile dysfunction; EHR = electronic health record; FN = false negative; FP = false positive; GS = Gleason score; ML = machine learning; NLP = natural language processing; NLTK = natural language toolkit; NPV = negative predictive value; PCa = prostate cancer; PPV = positive predictive value; PSA = prostate-specific antigen; TN = true negative; TP = true positive; UI = urinary incontinence.
Results and Discussion
Studies
We identified and included 14 studies that focused on the various NLP models in EHR data extraction for PCa. Two articles focused on documenting digital rectal examinations (DREs). 8,9 One article focused on identifying and quantifying pain secondary to PCa, 10 eight on extracting staging/grading information from pathologic/clinical reports, with an emphasis on TNM classification and identifying metastasis, 11 –19 two articles focused on patient-centered outcomes (PCOs) such as incontinence, erectile dysfunction (ED), and bowel dysfunction (BD) 3,20 ; and one article focused on loneliness/social isolation following a PCa diagnosis. 21 Interestingly, most articles were from a single group from Stanford University. Given that almost all these articles were published in computer science/health data analytics journals, providing a comprehensive overview of these projects in the urology literature is imperative.
Most published works in this arena focused on developing in-house training models, except two 10,15 that used commercially available software to evaluate their data sets. All studies divided their retrospective patient cohorts into two sets: one more extensive set for training the model and one smaller or equal set for testing it. This ensured that the populations and EHR records were cross-compatible and baseline characteristics were similar.
Overview of NLP models
While NLPs have the capability to extract complex information from EHRs, almost all approaches prepare the data first for this task via preprocessing. The common steps in text preparation are tokenization, stemming, and stopword removal.
Tokenization
A sentence is broken down into smaller pieces, such as separating a sentence into individual words. This helps make sense of the text.
Stemming
Simplification of words to their basic form. For example, turning words such as “running” into “run” or “jumps” into “jump.”
Stopword removal
Some words (such as “and,” “the,” or “in”) do not carry much meaning by themselves. Removing them helps the model focus on the important words.
Flowchart 1 represents the overall scheme for NLP-based data extraction. However, variations were encountered in each study depending on the aim of the data extraction. Major variations from Flowchart 1 were documented below.
The articles assessing DREs aimed to develop and test methods for automatically assessing a quality metric, provider-documented pretreatment DRE, using NLP frameworks. One study aimed to develop an NLP pipeline for automatic documentation of DRE in clinical notes using domain-specific dictionaries created by clinical experts. 8 The second approach used software to learn from clinical notes using distributional semantics algorithms and create a list of terms for the dictionary to which they added terms by clinical experts. 9 The relative performances of both can be seen in Table 1.
Eight studies focused on extracting the stage/grade of PCa. One study used frequency counts of structured data elements as predictors derived from the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) to identify more sophisticated PCa phenotypes than the International Classification of Diseases (ICD) code-based queries are capable of. This may be particularly useful in high-precision phenotyping scenarios, such as identifying participants for clinical trials or observational research. 18 The other studies followed a scheme similar to Figure 1.

This flowchart represents a simplified, combined schema for most NLP-based data extraction studies. Each study has its variations and nuances that are not represented here. Explanations: (1) Feature Extraction: Feature extraction involves selecting and transforming relevant information from raw data to create a simplified representation. In simpler terms, it is similar to picking out the essential details that best capture what you are interested in. (2) N-gram Extraction: N-grams are sequences of N words or characters that appear together in a given context. For example, in the phrase “natural language processing,” the bigram (2-g) would be “natural language,” and the trigram (3-g) would be “natural language processing.” N-gram extraction is about identifying and using these sequences to understand patterns and relationships in data. (3) Word embeddings (optional): Word embeddings are a way to represent words as vectors (mathematical entities) in a multidimensional space. Each word gets a unique position based on its meaning and context. This representation allows computers to understand the relationships between words. While optional, it can enhance the understanding of language nuances in certain applications. (4) Rules-based approach: A rules-based approach involves defining specific instructions or conditions to guide the processing of data. It is similar to creating a set of rules for a computer to follow. For example, in text analysis, you might have rules such as “if a word is negated, change its meaning” or “if a sentence contains certain keywords, pay special attention to it.” These rules help extract meaningful information. Putting it together: In practical terms, when extracting features from, say, a body of text. N-gram extraction: Identify meaningful word sequences (bigrams, trigrams). Word embeddings: If used, represent words in a way that captures their meanings and relationships. Rules-based approach: Apply specific rules for interpreting the data, refining the extraction process. In short: Feature extraction involves simplifying data. N-gram extraction focuses on word sequences. Word embeddings, if used, enhance language understanding. A rules-based approach sets specific guidelines for data interpretation. Together, they make data more manageable and meaningful. NLP = natural language processing.
One study focused on the categorization of pain levels secondary to PCa using NLPs. 10 This study categorized pain into four levels of increasing severity through consensus among NLP developers, subject matter experts, and statisticians. It used Linguamatics I2E for indexing, parsing, and querying clinical notes based on pain-related criteria. This was followed by the development of NLP algorithms for pain level identification and the association of mentions with severity levels and relevant parameters.
The NLP pipeline for data extraction about PCOs of urinary incontinence (UI), BD, and ED identified patients within a large academic EHR system using ICD-9/10 and CPT codes. The main difference in the NLP approach for this application lies in the methodology used for training the classifier.
The first approach 20 uses a weakly supervised NLP pipeline. In this context, weakly supervised means the model is trained with less precise or detailed annotation. The training data for this model include automated sentence annotations based on domain-specific dictionaries. The second approach, 3 on the contrary, utilizes a neural embedding model that incorporates a Tf-idf weighted sentence vector generation method. Tf-idf stands for term frequency-inverse document frequency, a numerical statistic used to assess the importance of a word in a document relative to a collection of documents. This method assigns weights to words based on their frequency and rarity in the document. Results for each approach can be seen in Table 1.
Another study used NLP-based models to identify social isolation after PCa diagnosis. 21 They generated a lexicon for social isolation using domain expert knowledge and the Loneliness Scale. This was followed by developing NLP algorithms using I2E queries to identify social isolation mentions based on defined criteria.
Clinical applications
We found that there has been significant success with high precision and recall for NLP-based data extraction and monitoring from EHRs in PCa patients. There are multiple proofs of concept for this approach to data extraction in a clinical setting. NLPs can extract data from the entire patient care process.
Digital rectal examination
Recent AUA/SUO guidelines on early detection of PCa state that clinicians may use DRE alongside prostate-specific antigen to establish the risk of clinically significant PCa. 22 Despite its importance in the clinical setting, DRE rates are variable due to interprovider preferences, clinical experience, and patient refusal. However, monitoring DRE rates and results is difficult because they are recorded as inconsistent, free text in EHRs. This has hampered efforts to enforce DRE inclusion as a quality improvement metric.
The use of NLPs in DRE recognition and evaluation can improve clinical outcomes and simplify research/quality improvement metrics due to their ability to rapidly parse through multiple EHRs, decreasing the need for manual data extraction and classification of DRE evaluation. 9 NLP-based recognition of pretreatment DREs can be integrated into clinical support systems for advising treatment regimens. 8 This can directly improve and standardize patient care regimens and improve/simplify information availability for tumor boards.
The importance of including pretreatment DREs as a quality improvement metric has been well-established, not just in PCa care but also in colorectal cancer and other abdominal pathologies. 23 Since most DREs are recorded in free texts or the reason for refusal is documented in such free text, it can be a labor-intensive process to go through multiple EHRs when DRE rates for quality improvement are being calculated. NLP-based systems can also recognize the presence or absence of DREs from the text of a physician's note. 9 Boussard and colleagues also showed that NLP pipelines can significantly simplify the recognition of DREs and also note if there is a reason documented for refusing DREs. 8 NLPs can free up significant human resources and speed up data collection and categorical classification of the obtained data (classifying DREs as normal, abnormal, refused, etc.) by optimizing this data collection.
This data extraction and classification function can be extended to clinical research as well. 8 Such models can also be extended to other physical/radiologic examinations such as cervical imaging and breast mammograms. 24
Symptoms: pain
NLP-based approaches can be applied for real-time pain monitoring and pain-related predictor/associated variable identifications, with implications for outpatient, inpatient, postoperative, and hospice care for PCa.
While attending providers can monitor pain in real-time, there can be limitations to this manual approach due to staffing ratios, inefficiencies, physician burnout, overlooked patient complaints, and inadequate patient hand-off. This can be exacerbated in a hospice or outpatient setting. These limitations can be overcome with the ability of NLPs to process a large volume of patient data rapidly, using pain predictor variables to create an alert system to identify patients who need immediate pain attention. 10
For instance, NLP-based models can highlight instances where the model identifies subtle patterns, trends, or early indicators of pain that human providers might easily overlook. The approach used by Heintzelman et al. 10 focuses on identifying variables that are common predictors/associations of PCa-related pain. In a clinical setting, this can be used to stratify patients at high risk of inadequate/delayed pain treatment. 10
The NLP model can go beyond keyword matching to provide a semantic understanding of patient communication. For example, it can discern the difference between acute and chronic pain descriptors, 10 helping health care providers differentiate between immediate concerns and ongoing pain management strategies.
When patients receive multimodal pain management involving various medications, therapies, and interventions, the NLP model can assist in synthesizing information from diverse sources. It can help health care providers assess the overall effectiveness of the treatment plan and identify potential synergies or conflicts among different modalities. 25,26 Multidisciplinary palliative care teams can use NLP-based models identify chronic/undertreated pain patients who are under the care of primary care physicians. 26 The insights gained from the NLP model 10 could contribute to further research in pain management, potentially leading to new interventions, treatment strategies, or predictive models. Delivering readily interpretable, real-time presentations illustrating the historical pain status of individual patients or patient groups holds the promise of aiding clinicians in promptly identifying those in need of heightened pain management. 10
PCa staging
Even though PCa staging is one of the most essential factors for guiding treatment, surprisingly it is often not readily accessible in the EHRs as a discrete field. NLP-based approaches have achieved 94% accuracy in extracting M-stage from lung cancer pathology reports. 27 Similar NLP-based cancer staging has been effectively achieved in prostate, bladder, breast, and liver cancers. 7,17
Missing staging for PCa in EHRs can cause problems, especially in establishing continuity of care, as patients may visit multiple treatment sites for radiation, chemotherapy, and surgery. One group noted that up to 36% of EHRs have missing PCa stages, and NLP-based models can correctly impute 21% to 31% of the missing stages in EHRs. 11 Another interesting finding is that rule-based NLP approaches outperform their N-gram-based counterparts for staging detection. 11 The same team also investigated OMOP CDM, a distinct but complementary approach to NLP-based models in health care data analysis. They found that OMOP CDM is better for identifying metastatic PCa than ICD-searched and NLP-based models. 19
Moreover, each NLP-based model might have to be retrained for each cancer type, while the OMOP CDM can likely be used for different cancers and report structures. This is why they recommended combining NLP-based models and OMOP CDMs for optimal performance for PCa. 19 On an institutional level, it is likely more straightforward to implement a uniform OMOP CDM for stage extraction. NLPs have also effectively extracted high-accuracy Gleason scores from pathology reports. 14 This can significantly reduce efforts for continuity of care, data collection, and finding patients for research enrollment.
Another group used NLP-based pipelines to stratify patients into the D'Amico risk classification with more than 90% accuracy. 17 Although the diagnostic and therapeutic plans, such as imaging and biopsy, are often based on cancer risk stratification, there is significant overordering of such studies, resulting in resource wastage. Significant success in reducing unnecessary medical interventions among low-risk patients has been achieved. However, this accomplishment has necessitated coordinated efforts, including extensive data collection across various practice sites, subsequent data analysis, and the implementation of comparative performance feedback and decision support interventions. 17 NLP-based pipelines can be extremely efficient in automating and improving such data collection. Integrating NLP-based clinical PCa risk stratification into clinical support systems can help decrease overuse and wastage of resources. 17
Patient-centered outcomes
Hernandez-Boussard et al.'s work has also shown that NLP-based data extraction can identify and monitor PCOs such as UI/BD/ED and have efficacy close to the gold standard of manual data extraction. 3,20 The recent greater emphasis placed on quality metrics other than mortality rate creates incentives for monitoring and recording post-treatment PCOs. Real-time monitoring of EHRs via automated systems can detect complaints documented by primary care providers in resource-limited settings, which tertiary care systems can monitor. New complaints can be followed by specialists. Other cases include time and labor savings on data collection for retrospective studies and building databases.
Post-treatment support systems/social isolation
Another interesting application is NLP-based data extraction to identify social isolation from EHRs. 12 They reported a high incidence of social isolation among patients living with PCa (up to 2%), which is in line with the overall population. NLP-based models can detect such patients, and such applications can be useful to elder or social care services. This use of NLP-based data extraction can be extended to other social phenomena, such as detecting neglect, abuse, and financial difficulties, which are all more common in patients living with cancer. 28 The prospect of using primary care provider notes to identify patients with social isolation and other similar risk factors for mental illness can provide a database for targeted interventions. However, significant ethical and privacy issues can arise with such real-time monitoring systems.
Using NLPs to prepopulate EHR notes represents a promising approach to streamline clinical documentation workflows and enhance the efficiency of health care delivery. 29 As discussed above, NLP-based systems can automate extracting relevant patient information for PCa, such as Gleason grade, staging, metastatic status, radiologic findings, and previous treatment. This prepopulation of EHR notes with pertinent clinical data not only accelerates the documentation process but can also ensure the accuracy and completeness of patient records. 29 Moreover, NLP-enabled prepopulation reduces the burden on clinicians by minimizing manual data entry tasks, 29 allowing them to focus more on direct patient care.
Although one might argue that all the above tasks can be achieved more efficiently and accurately by utilizing structured clinical notes instead of NLP models, the retrospective standardization of provider notes is a formidable undertaking. 30 In light of the nearly 15 years since the widespread adoption of EHRs, the absence of consensus on standardization methods, coupled with distinct preferences among medical specialties and variations in provider practices, 30 complicates the realization of a unified standard of structured provider notes. Applying NLP models presents a pragmatic alternative for extracting valuable insights from free-text clinical notes. 20 NLPs accommodate the inherent variability in provider documentation, offering adaptability to prevailing practices. Moreover, the inertia associated with administrative and personal habits is mitigated by the seamless integration of NLP models with existing free-text entries. 31
Thus, in the absence of consensus, economic, and administrative will on standardization, using NLP on free-text clinical notes emerges as an effective means of harnessing valuable information for enhanced clinical decision support.
Limitations of NLPs
Despite their vast potential, NLP-based models encounter several limitations in clinical research. One significant challenge is the variability and complexity of language used in health care documents. Clinical narratives often contain colloquialisms, abbreviations, misspellings, and domain-specific jargon, posing difficulty for NLPs to extract and interpret relevant information 8 accurately. This variability demands robust algorithms capable of handling diverse linguistic patterns and context-specific nuances to overcome. Standardizing medical terminologies used within the institution at the provider level can also improve the model's accuracy and decrease the complexity of the lexicons needed for data extraction. 8
Our review found some heterogeneity in the success rates of NLP-based data extraction and monitoring from EHRs. The most effective models target narrower subsets of text, such as identifying UI/ED/BD, vs larger and more complex targets, such as identifying staging. While this does not diminish the potential role of NLP-based models, it may be prudent for institutions to start with narrower targets and improve models before moving to more complex targets.
Fundamentally, the NLP-based model and its utility will only be as good as the underlying data. For example, if the clinician is yet to document the finding of interest, it is impossible for any software to extract data from it. Similarly, preprocessing can often result in important textual details being discarded, resulting in missed data. The potential for biases to infiltrate EHRs during notetaking is a critical concern, as these biases can subsequently impact the performance of models built on such data. If biases are present in the clinical notes documented in the EHR, machine learning models trained on these data may inadvertently perpetuate and amplify them. 32 Therefore, addressing biases in EHR data is ethically imperative and crucial for ensuring equitable and unbiased health care outcomes when implementing machine learning models in a clinical setting.
Efforts to mitigate biases involve thoroughly examining data collection processes, monitoring model outputs for disparities, and implementing corrective measures to promote fairness and inclusivity in health care AI applications.
Another limitation is the need for extensive, high-quality annotated data sets for training and validation. Building annotated training data sets that have been annotated to train the model demands substantial human effort and domain expertise to ensure accuracy and relevance. Such concerns can be somewhat mitigated by using software to create lexicons thoroughly parsing through the training set/data. The lack of evidence from prospective and multicentric studies may be another limitation of NLP models for PCa management. This review carries the limitations inherent with a narrative review, notably less rigorous inclusion and exclusion criteria, lack of quantification of publication bias, and more subjectivity than a systematic review.
Future applications
Given the rapid development of NLP-based data extraction pipelines, there is significant potential for using such models in health care throughout the patient care process. NLP-based monitoring models have the potential to provide real-time EHR monitoring of abnormal DRE findings, thereby generating red flag alerts in EHRs, prompting providers to take appropriate measures as per major urologic society guidelines to provide optimal care to patients. Adherence to quality metrics such as DREs and PCOs may become intrinsically linked to reimbursement outcomes. In such a scenario, a growing impetus exists to construct organized fields documenting quality metrics. 8,30 Another possible use can be to document and compare PCO rates between providers, 33 leading to the detection of heterogeneous practice styles and personalized provider feedback. Insurance and billing practices may also benefit from such a system, as EHRs can assess complications, their treatments, and associated billing codes.
Another crucial area is adverse event detection, where NLP can systematically scan EHRs and clinical notes to identify potential side effects or complications associated with specific treatments, providing a proactive approach to patient safety monitoring. 34,35 For example, NLP-based systems have already been used to identify the occurrence of surgical-site infections. 36,37 NLP-based models can also flag and document differential rates of incorrect discharge practices between providers. One study showed that up to 20% of patients are discharged on opioid-only analgesia, putting them at high risk for overdose or dependence. NLP-based systems can provide real-time monitoring and alerts to such oversights while providing an easy avenue for detecting opioid overprescription. 36
Moreover, in patient-reported outcomes, NLP-based models can assist in extracting and analyzing subjective information from clinical narratives, offering a deeper understanding of patients' experiences and perspectives, which is valuable for outcomes research and treatment optimization. Similarly, clinical decision support systems with integrated NLPs can help suggest new and evolving treatments that might apply to a particular patient, contributing to a key facet of individualized medicine.
In addition, by enabling structured data extraction from unstructured clinical notes, NLP-based models build comprehensive and standardized data sets, facilitating large-scale epidemiologic studies and multicenter collaborations. Data collection can also benefit from a uniform collection style, as inter-researcher variability in data collection remains a persistent problem. 37 These models can also assist research by identifying patients suitable for the various available treatments. 32,38
Integrating deep learning algorithms such as ChatGPT into NLP presents a promising avenue for extracting PCa-related information from clinical notes. ChatGPT's proficiency in understanding and generating human-like text enables it to decipher the nuanced language used in medical documentation. 39 Moreover, ChatGPT's potential role in NLP for PCa-related data extraction can extend to its ability to adapt to the evolving landscape of medical language, which incorporates new terminologies, treatment modalities, and research findings. 40 Unfortunately integrating large language models (LLMs) into NLP-based systems is not without downsides. LLMs have been known to frequently provide false or fake information (a phenomenon known as hallucination). 41 It is possible that LLM-powered NLP-based systems may input fake information, and without adequate safeguards in place, this can have significant clinical and academic implications. Therefore, a cautiously optimistic approach to incorporating LLMs into NLP-based systems may be ideal.
Addressing the unmet need for Health Insurance Portability and Accountability Act (HIPAA)-compliant NLP models is a critical endeavor in the evolving health care landscape. We believe that NLP models fully compliant with HIPAA standards pose challenges and opportunities. Challenges involved in NLP integration into health systems include handling protected health information (PHI), ensuring secure data storage and transmission, and implementing robust access controls for only authorized personnel. 42 Solutions could involve innovations in encryption techniques 43 and secure data-sharing protocols 43 that ensure the safety of PHI. Perhaps the most important ethical aspect of integrating NLPs into health systems is open discussion and transparent communication with patients, who should be the ultimate decision-makers of their PHI. 44
Summary and Conclusion
NLP-based models hold the potential to simplify and cut costs in health care, with current models having good performances in extracting data about physical examination findings, quality metrics, cancer staging from pathology reports, and miscellaneous PCa-related phenomena such as social isolation and pain. The most effective models target narrower subsets of text, such as identifying UI/ED/BD, vs larger and more complex targets, such as identifying staging. Overall, integrating NLP-based models in clinical research shows promise to enhance efficiency, scalability, and the depth of insights derived from diverse health care data sources.
As we embrace NLP-based technologies, the overarching aim is to foster a health care ecosystem that learns, adapts, and evolves, ensuring better patient outcomes. In conclusion, NLP-based data collection approaches can significantly improve certain aspects of research and clinical practice. However, important limitations such as inconsistent performance, lack of an ethical framework for integration into health systems, and biases from original provider notes carried over to NLP-based data extraction remain. A reliable, ready-to-use commercial NLP-based automated data extraction remains unknown.
Footnotes
Authors' Contributions
A.B.: Conceptualization, data curation, formal analysis, investigation, methodology, validation, visualization, writing—original draft preparation, and writing—review and editing.
R.T.: Conceptualization, data curation, formal analysis, investigation, methodology, validation, visualization, and writing—review and editing.
J.G.P.: Formal analysis, investigation, writing—original draft preparation, and writing—review and editing.
J.K.: Writing—review and editing.
D.M.L.: Conceptualization, methodology, validation, visualization, and writing—review and editing.
R.M.: Conceptualization, methodology, validation, visualization, and writing—review and editing.
D.J.P.: Conceptualization, formal analysis, investigation, methodology, validation, and writing—review and editing.
H.N.S.: Conceptualization, data curation, formal analysis, investigation, methodology, validation, visualization, writing—original draft preparation, and writing—review and editing.
Author Disclosure StatementConflict of Interest
No competing financial interests exist.
Funding Information
No funding was received for this article.
