Performance of Predefined Search Patterns for Identifying Documented Goals-of-Care Discussions in Inpatient Electronic Health Records

Abstract

Objective:

To characterize the sensitivity and predictive value of predefined search terms for identifying documented goals-of-care discussions in health records of hospitalized patients with serious illness.

Methods:

We evaluated the performance of 30 previously published and investigator-defined search terms codified into regular expressions (a type of pattern-based text search) in detecting goals-of-care documentation in a 2974-note corpus of electronic health record notes belonging to 159 inpatients enrolled in a U.S. clinical trial over 2020–2021.

Results:

Compared to conventional chart abstraction, search terms for “goals of care” and synonyms such as “GOC” had poor sensitivity (range: 29.5–38.3%) and modest positive predictive value (PPV; range: 48.3–61.7%) for identifying notes with goals-of-care documentation. Combinations of search terms demonstrated modest performance (sensitivity 62.0%, PPV 59.4%, F₁ 0.61) but fell short of more complex natural language processing models.

Conclusion:

In certain contexts, predefined regular-expression-based search terms may have suboptimal sensitivity and predictive value for identifying documented goals-of-care discussions.

Introduction

Goals-of-care discussions are an important process measure in serious illness communication.¹ However, measuring documented goals-of-care discussions in unstructured electronic health record (EHR) data can be costly and labor-intensive.² One approach to reducing the burden of abstraction is using predefined search terms and patterns to identify text requiring further human abstraction, an approach that has been used to measure many outcomes and metrics from EHR text including goals-of-care documentation and advance care planning.^3–11 Predefined search terms and patterns are often implemented using regular expressions, a type of search pattern that combines character patterns with wildcards, repetition rules, and logical operators to refine searches. Although regular expressions are easy to develop, highly transparent, and widely accessible, there are fewer data examining the sensitivity and predictive value of this approach against conventional whole-chart abstraction in measuring EHR-documented goals-of-care discussions for hospitalized patients.^10–12 In this study, we evaluate the performance of predefined search terms for identifying documented goals-of-care discussions compared to manual abstraction for hospitalized patients with serious illness, using an existing corpus of manually abstracted EHR data.^13,14

Methods

We conducted a secondary analysis of existing EHR data, comprising 2974 inpatient notes from 159 patients enrolled in a randomized trial of hospitalized patients aged ≥55 years with chronic life-limiting illness (defined by diagnosis codes¹⁵) or aged ≥80 years. The parent trial enrolled all eligible patients hospitalized at a UW Medicine hospital (a Seattle-area health system) from April 2020 to March 2021 under a waiver of informed consent. Patients were randomized to usual care or a clinician-facing communication-priming intervention (Jumpstart Guide).^13,14 The primary outcome was EHR documentation of a goals-of-care discussion, defined as a discussion of the “overarching aims of medical care for a patient”¹⁶ beyond routine code status discussions or citation of historical advance care planning documents. In the trial, this outcome was measured using deep-learning natural language processing (NLP)-screened human abstraction, in which a deep-learning NLP model identified likely passages for human adjudication.²

For the current study, we used a manually abstracted validation dataset from the parent trial to evaluate the performance of predefined search terms for measuring the same outcome. The validation sample comprised 159 trial participants whose records underwent conventional chart abstraction (without NLP) for the primary outcome. Due to challenges in reaching a prespecified subgroup enrollment target, this sample had been enriched for patients with Alzheimer’s disease and related dementias (ADRD; target proportion 50%) to support an interim subgroup analysis. To measure the reference outcome in the validation sample, four research coordinators manually reviewed and coded all sampled inpatient records using a qualitative data analysis platform and a predefined codebook.² Abstractors met weekly to review coding and ensure consistency. Records from the index hospitalization preceding randomization were included in the reference corpus but excluded from consideration for the trial outcome. The resulting manually abstracted 2974-note reference sample serves as the data source for this study, with manual abstraction used as the “gold standard” for the presence or absence of goals-of-care documentation.

To assess the performance of predefined search terms and patterns for identifying goals-of-care discussions, we assembled a broad-ranging list of 30 search terms, drawing from previously published search terms^11,17 and investigator consensus.^18–20 Search terms were codified into regular expressions and grouped into thematic categories. We then compared the binary outcome of regular-expression match against conventional chart abstraction results for goals-of-care documentation, evaluating both measures at the whole-note level. We characterized the performance of each search term using sensitivity (recall), positive predictive value (PPV; precision), and F₁ (the harmonic mean of sensitivity and PPV), examining PPV and F₁ in lieu of specificity because the latter provides limited insight into the prediction of rare outcomes compared to metrics that focus on true positives in sparse situations.^21,22

We then evaluated the ensemble performance of combinations of search terms toward maximizing F₁, employing a hierarchical approach to mitigate overfitting by reducing the size and dimensionality of the search space. We first identified the highest-performing combinations of search terms within each thematic category, and then evaluated ensembles of highest-performing search terms from all themes to identify search term sets with the best overall performance (Fig. 1). We also examined search term sets constrained to include terms from the specific themes of goals of care, advance care planning, and hospice or palliative care.

FIG. 1.

Hierarchical approach for evaluation of combinations of 30 search terms across 10 thematic categories. POLST, Physician Orders for Life-Sustaining Treatment.

Regular-expression matching and statistical analyses were performed using Stata/MP version 18.5 (StataCorp, stata.com). Manual chart abstraction was performed using Dedoose (SocioCultural Research Consultants, dedoose.com). Study procedures were approved by the University of Washington Institutional Review Board (STUDY00007031, STUDY00011002). The parent trial was registered with ClinicalTrials.gov (NCT04281784).

Results

Between April 23, 2020, and March 26, 2021, the parent trial¹³ enrolled 2512 patients, of whom 159 were sampled in the reference dataset (Table 1). The mean age in the reference sample was 72.6 (standard deviation, 10.6) years, 42% were female, and the majority (65%) had two or more chronic life-limiting illness diagnoses. The sample was purposively enriched for patients with ADRD (80/159, 50%). The EHR note corpus for this patient sample consisted of 2974 notes, of which 295 (9.9%) notes from 54/159 (34%) patients contained documented goals-of-care discussions by manual abstraction.

Table 1.

Characteristics of Patient Sample

Patient characteristics	n = 159
Age, years, mean (standard deviation)	72.6 (10.6)
Sex, n (%)
Female	66 (42)
Male	93 (58)
Race, n (%)
Asian	24 (15)
Black	36 (23)
Native American	3 (2)
Pacific Islander	1 (1)
White	94 (59)
Unknown	1 (1)
Ethnicity, n (%)
Hispanic	9 (6)
Non-Hispanic	149 (94)
Unknown	1 (1)
Trial eligibility criteria (not mutually exclusive), n (%)
Age ≥80 years	53 (33)
Chronic life-limiting illness diagnoses^a
Cancer with poor prognosis	30 (19)
Chronic lung disease	46 (29)
Coronary artery disease	71 (45)
Congestive heart failure	67 (42)
Peripheral vascular disease	42 (26)
Severe chronic liver disease	20 (13)
Diabetes with end-organ damage	32 (20)
Moderate-to-severe chronic kidney disease	42 (26)
Alzheimer’s disease and related dementias	80 (50)
No. chronic life-limiting illness diagnoses, n (%)
0	19 (12)
1	36 (23)
2	25 (16)
3	29 (18)
≥4	50 (31)

Chronic life-limiting illness diagnoses are not mutually exclusive.

We constructed an a priori list of 30 search terms and regular expressions that were grouped into 10 thematic categories and examined the prevalence and performance of each search term (Table 2). Notably, search terms for “goals of care” and similar constructs had poor sensitivity and PPV (search terms 1–3: sensitivity range: 29.5–38.3%, PPV range: 48.3–61.7%). Terms pertaining to code status were more highly prevalent in the dataset, with 37% of notes and 134/159 (84%) of patients’ charts containing the term “code status”; however, the PPV of these terms was low, as expected given that our operational definition of documented goals-of-care discussions excluded routine code status discussions or documentation. Search terms related to hospice care and comfort measures had excellent PPV but low sensitivity, likely as a consequence of their narrow scope. Within thematic categories, best-performing combinations of search terms demonstrated modest performance (Supplementary Table S1), with search terms in the goals-of-care category demonstrating 54.2% sensitivity, 51.0% PPV, and F₁ = 0.53. Combinations of search terms across themes (Table 3) showed improved performance over within-theme search terms, with the highest-performing search term set demonstrating 62.0% sensitivity, 59.4% PPV, and F₁ = 0.61. Constraining the search space to search term sets that included the goals-of-care category yielded a search term set demonstrating 76.6% sensitivity, 49.1% PPV, and F₁ = 0.60.

Table 2.

Performance of Individual Search Terms

Search terms, by thematic category^a	Note-level prevalence and performance for detecting GOC discussions^b
Search terms, by thematic category^a	N = 2974 n (%)	Sensitivity (%)	Positive predictive value (%)	F₁
A. Goals of care:
1. goals of care^c	183 (6)	38.3	61.7	0.47
2. GOC, GoC^d	180 (6)	29.5	48.3	0.37
3. goals of care, goals for care, goals of treatment, goals for treatment^c,e	183 (6)	38.3	61.7	0.47
4. patient goals	1 (<1)	0	0	—
5. treatment goals	1 (<1)	0.3	100	0.01
B. Quality of life:
6. quality of life^c	33 (1)	10.2	90.9	0.18
7. QOL, QoL^d	10 (<1)	2.4	70	0.05
C. Code status:
8. CPR^d	66 (2)	11.5	51.5	0.19
9. DNR^d	339 (11)	55.9	48.7	0.52
10. DNI^d	215 (7)	35.6	48.8	0.41
11. do not intub[ate], no intub[ation]^f,g	0 (0)	—	—	—
12. do not mechanica[lly ventilate], no mechanica[l ventilation]^f,g	0 (0)	—	—	—
13. do not resus[citate], no resus[citation]^g	6 (<1)	0	0	—
14. compressions	24 (1)	1.7	20.8	0.03
15. code status	1105 (37)	63.7	17	0.27
16. full code	530 (18)	19	10.6	0.14
17. shocks	3 (<1)	0	0	—
18. defibrillation	5 (<1)	1.7	100	0.03
D. Advance care planning:
19. advance[d] care planning^g	4 (<1)	0.7	50	0.01
20. ACP^d	14 (<1)	3.4	71.4	0.06
E. POLST forms:
21. POLST^d	142 (5)	24.1	50	0.33
F. Power of attorney:
22. power of attorney	56 (2)	3.7	19.6	0.06
23. POA, [D]POA^d,g	219 (7)	14.6	19.6	0.17
G. Palliative care:
24. palliat[ive]^g	165 (6)	26.1	46.7	0.33
H. Family meetings:
25. family meeting	91 (3)	17.6	57.1	0.27
26. family discussion	4 (<1)	0.3	25	0.01
27. family conference	13 (<1)	1.4	30.8	0.03
I. Hospice:
28. hospice	53 (2)	14.9	83	0.25
J. Comfort measures only care:
29. comfort measures, comfort care	69 (2)	20.3	87	0.33
30. CMO^d	1 (<1)	0.3	100	0.01

All phrases were evaluated as complete phrases, and no whole word matches were required. Comma-separated terms match any of the listed search terms.

Each search term was evaluated as a binary variable against manual abstraction, which found 295 (10%) notes with goals-of-care discussions.

These terms also matched hyphenated versions (e.g., “goals-of-care”).

These terms were treated as case-sensitive.

This term yielded identical results to search term no. 1 in the data and was omitted from further analysis.

These terms were not encountered in the sample and were omitted from further analysis.

These terms matched words sharing the same stem; characters within brackets were omitted from the search phrase and are only shown for readability.

ACP, advance care planning; CMO, comfort measures only; CPR, cardiopulmonary resuscitation; DNI, do not intubate; DNR, do not resuscitate; [D]POA, durable power of attorney; GOC, goals of care; POLST, Physician Orders for Life-Sustaining Treatment; QOL, quality of life.

Table 3.

Performance of Combinations of Search Terms Across Themes

High-performing search term sets by F₁ score^a	Search terms by number	N = 2974n (%)	Performance in detecting GOC discussions
High-performing search term sets by F₁ score^a	Search terms by number	N = 2974n (%)	Sensitivity (%)	Positive predictive value (%)	F₁
Search term set with highest F₁ score: quality of life, QOL, QoL, POLST, family meeting, family conference, hospice, comfort measures, comfort care	6, 7, 21, 25, 27, 28, 29	308 (10)	62.0	59.4	0.61
Search term set with highest F₁ score that includes goals-of-care theme: goals of care, GOC, GoC, quality of life, QOL, QoL, POLST, family meeting, family conference, hospice, comfort measures, comfort care	1, 2, 6, 7, 21, 25, 27, 28, 29	460 (15)	76.6	49.1	0.60
Search term set with highest F₁ score that includes advance care planning, POLST, or POA themes: goals of care, GOC, GoC, quality of life, QOL, QoL, advance[d] care planning, ACP, POLST, power of attorney, [D]POA, family meeting, family conference, hospice, comfort measures, comfort care	1, 2, 6, 7, 19, 20, 21, 22, 23, 25, 27, 28, 29	656 (22)	80.7	36.3	0.50
Search term set with highest F₁ score that includes hospice or palliative themes: quality of life, QOL, QoL, POLST, palliat[ive], family meeting, family conference, hospice, comfort measures, comfort care	6, 7, 21, 24, 25, 27, 28, 29	404 (14)	67.8	49.5	0.57

All phrases were evaluated as complete phrases, and no whole word matches were required. Comma-separated terms match any of the listed search terms. See Table 2 for further details about individual search terms. For combinations with identical performance, the combination with the fewest terms is listed. Terms containing brackets searched for word stems; characters within brackets were omitted from the search phrase and are only shown for readability.

Discussion

In previous reports, our research group has described the use of various NLP models for identifying documented goals-of-care discussions in various note corpora.^2,18,23 When comparing the performance of predefined search terms against a previously published 110-million-parameter deep-learning model based on Bio+Clinical BERT (Bidirectional Encoder Representations from Transformers) that was tested on the same corpus,^2,24,25 we observed that the deep-learning model substantially outperformed all identified best-performing ensembles of predefined search terms (Fig. 2). Although the superior performance of deep-learning models over regular expressions is somewhat expected,^26,27 more sophisticated NLP models also come with steep development costs as well as a lack of transparency that can raise concerns for the introduction of hidden biases.^28–31

FIG. 2.

Comparison of search term performance with BERT-based deep-learning model.^a ^aThe deep learning model is an instance of Bio+ClinicalBERT that was fine-tuned on a manually labeled external training set and is fully described and evaluated in a previous report.² Herein, the model is evaluated on this study’s 2974-note 159-patient corpus, which is a superset of the corpus tested in the previous report² that includes records from the index hospitalization preceding randomization for the parent trial. Sensitivity (i.e., recall) and positive predictive value (PPV; i.e., precision) are presented at observed discrimination thresholds with sensitivities closest to prespecified values of 70%, 80%, and 90%. BERT stands for Bidirectional Encoder Representations from Transformers (Google Research, 2018).²⁵

Given the plummeting explainability of more modern NLP models—which are often described as a “black box,”³¹ and have even been known to confabulate responses³²—it is tempting for clinical researchers to turn toward simpler, more explainable models such as those based on predefined search terms and regular expressions. Although such approaches may offer excellent performance for certain datasets and constructs, in other contexts such an approach may risk compromising reliability or even introducing bias in counterintuitive ways. We are frequently asked, “Can’t you just search for ‘goals of care’ in the medical record?” Within the confines of our study, which examined records of hospitalized patients in a single health system, our results suggest that “searching for ‘goals of care’” can compromise sensitivity to a surprising extent.

The low performance of intuitive search terms such as “goals of care” was somewhat surprising. However, in our abstraction work, we have frequently noted that clinicians often use the words “goals of care” in an aspirational or hypothetical sense—for example, “We should have a goals-of-care discussion,” or “Will consider this approach if consistent with goals of care,” with or without subsequent documentation of an actual a goals-of-care discussion with patients or family members. Conversely, inpatient clinicians often document in-depth goals-of-care discussions without actually using the term “goals of care” or its synonyms, as evidenced by the remarkably low sensitivity of goals-of-care-themed search terms for notes containing documented goals-of-care discussions.

Our study has several important limitations. First, our study sought to measure a linguistically complex construct, and for the purposes of its parent clinical trial, adopted a definition of goals-of-care documentation that is more stringent than some other definitions. More permissive definitions that include routine code status documentation may exhibit different performance than that observed in our study, and constructs of lesser complexity are likely to be more amenable to regular-expression-based strategies. Second, our study only measured a limited set of search terms. Other terms that may be used in the documentation of goals-of-care discussions may exhibit differing performance. Notably, our search terms do not capture many domains of goals-of-care documentation such as values, trade-offs, prognosis, and worries. However, these concepts are also linguistically even more complex, and likely to require more sophisticated means of measurement. Third, our data were collected from patients with serious illness hospitalized in a single multihospital health system in the Pacific Northwest. Documentation practices may differ in other health systems, regions, care settings (e.g., inpatient vs. outpatient), or with different EHR systems. Fourth, although our test corpus contained 2974 notes, they were collected from a relatively small sample of 159 patients that was enriched for patients with ADRD.

Conclusion

In a 159-patient, 2974-note sample of clinical notes collected from hospitalized patients with serious illness, predefined search terms implemented using regular expressions demonstrated poor sensitivity and PPV for identifying notes with documented goals-of-care discussions extending beyond routine code status discussions. In the inpatient setting, this construct may require more sophisticated NLP approaches to measure reliably, at the cost of explainability.

Footnotes

Acknowledgments

The authors are grateful for the contributions of the late J. Randall Curtis, MD, MPH, who was a founding principal investigator of this research program.

Authors’ Contributions

R.Y.L. had full access to all the data in the study and took responsibility for the integrity of the data and the accuracy of the data analysis.

Author Disclosure Statement

The authors have no conflicts of interest to disclose.

Funding Information

This work was supported by the National Institute on Aging (R01AG062441), the National Heart, Lung, and Blood Institute (K23HL161503, K12HL137940, T32HL125195), and the Cambia Health Foundation. Infrastructure support was provided by the Institute of Translational Health Science (National Center for Advancing Translational Sciences, UL1TR002319) and by the Pulmonary and Critical Care Medicine Fellowship Research Training Program of the University of Washington (T32HL007287). R.Y.L., A.M.U., K.S.L., J.S., T.C., W.B.L., D.G.D., R.A.E., and E.K.K. receive research funding from the National Institutes of Health. T.C. also reports receiving royalties from Springer Nature. The funding sources had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the article; and decision to submit the article for publication.

Supplementary Material

References

Bernacki

, Block

, American College of Physicians High Value Care Task Force. Communication about serious illness care goals: A review and synthesis of best practices. JAMA Intern Med, 2014; 174(12):1994–2003; doi: 10.1001/jamainternmed.2014.5271

Lee

, Kross

, Torrence

, et al. Assessment of natural language processing of electronic health records to measure goals-of-care discussions as a Clinical Trial Outcome. JAMA Netw Open, 2023; 6(3):e231204; doi: 10.1001/jamanetworkopen.2023.1204

Turchin

, Kolatkar

, Grant

, et al. Using regular expressions to abstract blood pressure and treatment intensification information from the text of physician notes. J Am Med Inform Assoc, 2006; 13(6):691–695; doi: 10.1197/jamia.M2078

Meystre

, Savova

, Kipper-Schuler

, et al. Extracting information from textual documents in the electronic health record: A review of recent research. Yearb Med Inform, 2008; 17(01):128–144; doi: 10.1055/s-0038-1638592

Chatham

, Bradley

, Schirle

, et al. Detecting problematic opioid use in the electronic health record: Automation of the addiction behaviors checklist in a chronic pain population. medRxiv, 2023; 12; doi: 10.1101/2023.06.08.23290894

Zupanc

, Lakin

, Volandes

, et al. Forms or free-text? Measuring advance care planning activity using electronic health records. J Pain Symptom Manage, 2023; 66(5):e615–e624; doi: 10.1016/j.jpainsymman.2023.07.016

Albashayreh

, Bandyopadhyay

, Zeinali

, et al. Natural language processing accurately differentiates cancer symptom information in electronic health record narratives. JCO Clin Cancer Inform, 2024(;8)::e2300235; doi: 10.1200/CCI.23.00235

Limsomwong

, Ingviya

, Fumaneeshoat

. Identifying cancer patients who received palliative care using the SPICT-LIS in medical records: A rule-based algorithm and text-mining technique. BMC Palliat Care, 2024; 23(1):83; doi: 10.1186/s12904-024-01419-1

, Gu

, Lotter

, et al. Extraction and imputation of eastern cooperative oncology group performance status from unstructured oncology notes using language models. JCO Clin Cancer Inform, 2024(;8)::e2300269; doi: 10.1200/CCI.23.00269

10.

Lindvall

, Lilley

, Zupanc

, et al. Natural language processing to assess end-of-life quality indicators in cancer patients receiving palliative surgery. J Palliat Med, 2019; 22(2):183–187; doi: 10.1089/jpm.2018.0326

11.

Lindvall

, Deng

, Moseley

, et al. Natural language processing to identify advance care planning documentation in a Multisite Pragmatic Clinical Trial. J Pain Symptom Manage, 2022; 63(1):e29–e36; doi: 10.1016/j.jpainsymman.2021.06.025

12.

Lilley

, Lindvall

, Lillemoe

, et al. Measuring processes of care in palliative surgery: A novel approach using natural language processing. Ann Surg, 2018; 267(5):823–825; doi: 10.1097/SLA.0000000000002579

13.

Curtis

, Lee

, Brumback

, et al. Intervention to promote communication about goals of care for hospitalized patients with serious illness: A Randomized Clinical Trial. JAMA, 2023; 329(23):2028–2037; doi: 10.1001/jama.2023.8812

14.

Curtis

, Lee

, Brumback

, et al. Improving communication about goals of care for hospitalized patients with serious illness: Study protocol for two complementary randomized trials. Contemp Clin Trials, 2022; 120:106879; doi: 10.1016/j.cct.2022.106879

15.

Wennberg

, Fisher

, Goodman

, Skinner

. Tracking the Care of Patients with Severe Chronic Illness. The Dartmouth Institute for Health Policy and Clinical Practice; 2008. Available from: https://data.dartmouthatlas.org/downloads/atlases/2008_Chronic_Care_Atlas.pdf [Last accessed: January 18, 2025].

16.

Secunda

, Wirpsa

, Neely

, et al. Use and meaning of “Goals of Care” in the healthcare literature: A systematic review and qualitative discourse analysis. J Gen Intern Med, 2020; 35(5):1559–1566; doi: 10.1007/s11606-019-05446-0

17.

Chien

, Shi

, Chan

, et al. Identification of Serious Illness Conversations in Unstructured Clinical Notes Using Deep Neural Networks. In: Artificial Intelligence in Health: First International Workshop, AIH 2018, Stockholm, Sweden, July 13-14, 2018, Revised Selected Papers. Lecture Notes in Artificial Intelligence. ( Koch

Fernando

, Koster

Andrew

, Bichindaritz

Isabelle

, Herrero

Pau

, et al. eds.) Springer Nature: Switzerland; 2019, pp. 199–212.

18.

Lee

, Brumback

, Lober

, et al. Identifying goals of care conversations in the electronic health record using natural language processing and machine learning. J Pain Symptom Manage, 2021; 61(1):136–142 e2; doi: 10.1016/j.jpainsymman.2020.08.024

19.

Lee

, Kross

, Downey

, et al. Efficacy of a communication-priming intervention on documented goals-of-care discussions in hospitalized patients with serious illness: A Randomized Clinical Trial. JAMA Netw Open, 2022; 5(4):e225088; doi: 10.1001/jamanetworkopen.2022.5088

20.

Curtis

, Downey

, Back

, et al. Effect of a patient and clinician communication-priming intervention on patient-reported goals-of-care discussions between patients with serious illness and clinicians: A Randomized Clinical Trial. JAMA Intern Med, 2018; 178(7):930–940; doi: 10.1001/jamainternmed.2018.2317

21.

Davis

, Goadrich

. The relationship between Precision-Recall and ROC curves. In: Proceedings of. the 23rd International Conference on Machine Learning; Pittsburgh, PA; 2006.

22.

Cook

, Ramadas

. When to consult precision-recall curves. Stata J Promot Commun Stat Stata, 2020; 20(1):131–148; doi: 10.1177/1536867x20909693

23.

Uyeda

, Curtis

, Engelberg

, et al. Mixed-methods evaluation of three natural language processing modeling approaches for measuring documented goals-of-care discussions in the electronic health record. J Pain Symptom Manage, 2022; 63(6):e713–e723; doi: 10.1016/j.jpainsymman.2022.02.006

24.

Alsentzer

, Murphy

, Boag

, et al. Publicly Available Clinical BERT Embeddings. 2019. Available from: http://arxiv.org/abs/1904.03323 [Last accessed: August 29, 2024].

25.

Devlin

, Chang

, Lee

, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. 2019. Available from: http://arxiv.org/abs/1810.04805 [Last accessed: August 29, 2024].

26.

Sejnowski

. The unreasonable effectiveness of deep learning in artificial intelligence. Proc Natl Acad Sci U S A, 2020; 117(48):30033–30038; doi: 10.1073/pnas.1907373117

27.

Kaplan

, McCandlish

, Henighan

, et al. Scaling laws for neural language models. 2020. Available from: http://arxiv.org/abs/2001.08361 [Last accessed: August 29, 2024].

28.

Amann

, Blasimme

, Vayena

, et al.; Precise4Q consortium. Explainability for artificial intelligence in healthcare: A multidisciplinary perspective. BMC Med Inform Decis Mak, 2020; 20(1):310; doi: 10.1186/s12911-020-01332-6

29.

Rogers

, Kovaleva

, Rumshisky

. A primer in BERTology: What we know about how BERT works. Trans Assoc Comput Linguist, 2020; 8:842–866.

30.

Gianfrancesco

, Tamang

, Yazdany

, et al. Potential biases in machine learning algorithms using electronic health record data. JAMA Intern Med, 2018; 178(11):1544–1547; doi: 10.1001/jamainternmed.2018.3763

31.

Castelvecchi

. Can we open the black box of AI? Nature, 2016; 538(7623):20–23; doi: 10.1038/538020a

32.

Zhang

, Li

, Cui

, et al. Siren’s song in the AI Ocean: A survey on hallucination in large language models. 2023. Available from: http://arxiv.org/abs/2309.01219 [Last accessed: August 29, 2024].