Abstract
Background
Clinician notes are structured in a variety of ways. This research pilot tested an innovative study design and explored the impact of note formats on diagnostic accuracy and documentation review time.
Objective
To compare two formats for clinical documentation (narrative format vs. list of findings) on clinician diagnostic accuracy and documentation review time.
Method
Participants diagnosed written clinical cases, half in narrative format, and half in list format. Diagnostic accuracy (defined as including correct case diagnosis among top three diagnoses) and time spent processing the case scenario were measured for each format. Generalised linear mixed regression models and bias-corrected bootstrap percentile confidence intervals for mean paired differences were used to analyse the primary research questions.
Results
Odds of correctly diagnosing list format notes were 26% greater than with narrative notes. However, there is insufficient evidence that this difference is significant (75% CI 0.8–1.99). On average the list format notes required 85.6 more seconds to process and arrive at a diagnosis compared to narrative notes (95% CI -162.3, −2.77). Of cases where participants included the correct diagnosis, on average the list format notes required 94.17 more seconds compared to narrative notes (75% CI -195.9, −8.83).
Conclusion
This study offers note format considerations for those interested in improving clinical documentation and suggests directions for future research. Balancing the priority of clinician preference with value of structured data may be necessary.
Implications
This study provides a method and suggestive results for further investigation in usability of electronic documentation formats.
Keywords
Background
Research on the optimal format for clinical documentation in electronic health records (EHRs) is limited (Colicchio and Cimino, 2019; Colicchio et al., 2020). Some previous studies have indicated that clinicians feel constrained by the limitations of creating structured EHR notes (Koopman et al., 2015; Rosenbloom et al., 2011). Conversely, there are concerns that excessive narrative documentation is time-consuming to process, potentially distracting and not essential for clinical care (Rizvi et al., 2016; Clarke et al., 2014; Mazer et al., 2017; Hirschtick, 2006; Overhage and McCallie, 2020). Facilitating more accurate diagnoses or increasing efficiency could be an important justification for selection of note format, but if note format does not impact the accuracy of diagnoses or save time in the process of reviewing the notes, there may be advantages to having notes in a structured format for use in clinical decision support and research (Rosenbloom et al., 2011). There are a variety of ways that clinician notes can be structured. What is now a common way of structuring documentation evolved from Weed’s desire to bring additional structure to what was a time-based narrative (Weed, 1968a, 1968b). Sections of notes (e.g. History of Present Illness, Past History, and Physical Exam) add structure within a narrative framework. More recently, Skeff and colleagues suggested bringing even more structure to the History of Present Illness (HPI) by organising it chronologically, including dates and times, and shortening narrative to key data under each of the time points (Skeff, 2014; Mazer et al., 2017). To date, there are no data to indicate whether different note formats affect accuracy of diagnosis, and only limited data covering the effect of note format on efficiency (Hultman et al., 2019).
This exploratory research was designed to pilot test an innovative study design, and to explore the impact of different note formats on diagnostic accuracy and time spent reviewing clinical notes. The aim of the study was to examine the impact of two note formats (a list of findings (list); and narrative text) on clinician diagnoses and time spent in reviewing notes.
Method
Sample
Participants in this study were physicians. Recruitment strategy included posting information about the study on the listserv of the Society for Improvement of Diagnosis in Medicine (SIDM), notices on a University of Alabama at Birmingham (UAB) continuing education network website, and an email from the UAB Internal Medicine residency director to residents, fellows and Internal Medicine attending physicians. In line with sample size requirements for a pilot and feasibility study, the goal was to recruit at least 12 providers, the suggested sample size per arm for pilot and feasibility studies (Billingham et al., 2013).These types of studies are not intended to be definitive well-powered trials, but they can provide us the ability to determine effect sizes and feasibility measures for a subsequent well-powered study (Teare et al., 2014; Leon et al., 2011; Lancaster et al., 2004).
Procedure
Participants were expected to review a set of cases and arrive at a differential diagnosis, rather than a single correct diagnosis, because the information in the case did not include the definitive test that could confirm a single diagnosis. The task was for the participants to process and list up to three possible diagnoses for each case using the web-based Canvas Learning Management System (LMS). Eight written cases, used in previous research (Berner et al., 1994), were revised to include updated terminology. The study cases were derived from actual cases and were clinically challenging, with varied diagnoses and multiple organ systems. The patients had been seen by internal medicine physicians and included data available before a diagnosis was established. The correct diagnosis for each case was determined by the results of the definitive test that confirmed the diagnosis, and was confirmed by an expert panel, but, as stated above, this definitive test was not included in the information presented to the participants.
To reduce potential bias, each participant was randomly assigned to sets of eight cases, half in each format (narrative and list). Each set contained the same cases, but in different case orders, and with different cases in list and narrative format; thus, all participants received the same cases but not in the same order or format. The narrative format depicted case details in paragraphs, similar to narrative physician notes, including headings of history, physical examination and laboratory data. The list format depicted all case information in bullet points under similar headings, with separately listed negative information, such as pertinent negatives or normal laboratory test values. Thus, each format had some structure in that they both used subheadings. The main difference between the two formats was that in the narrative format, there were paragraphs of text, whereas in the list format there were lists of findings, similar to what Mazer et al.(2017) included in their chronological History of Present Illness. Figure 1 illustrates these formats. Case description in a narrative format versus a list of findings format.
In the study by Berner et al. (1994), from which the cases used in the present study were taken, a group of experts had arrived at consensus on whether diagnoses produced by four diagnostic decision support systems for these cases were correct or incorrect. These lists of correct and incorrect diagnoses (and their synonyms) for each case were used to evaluate the participants’ diagnoses. In addition, for diagnoses that participants listed that were not on the previously developed lists, two physicians (MLG and PID) independently reviewed the additional diagnoses and arrived at consensus on those diagnoses considered correct. For each case and subject, diagnoses were considered accurate if any of the three diagnoses listed by participants was correct. The LMS calculated the amount of time participants spent from opening the case to submitting their first diagnosis. This measure of time to process the note defined efficiency.
Data analysis
Diagnostic accuracy served as the dependent variable for a generalised linear mixed regression analysis. A binary diagnostic accuracy variable was derived for each case assessed by each participant. Each case for each participant was scored as to whether the diagnosis was correct (diagnostic accuracy). Note type (narrative and list) served as the independent variable. For the “time to process the note” variable, time (in seconds) measured how many seconds the participant took to process the case. Generalised linear mixed regression models were applied to analyse accuracy (binary outcome). The mixed model binary logistic regression was used to check for the association between the diagnostic accuracy variable and the type of note as the independent variable. The participant ID was used as a random effect variable for random intercept. Comparative analyses were conducted of narrative and list format cases by aggregating data at the participant level. All participants’ data were aggregated by using median and paired participants’ data. Due to the small sample size, bias-corrected bootstrap percentile confidence intervals were estimated for mean paired differences in time to diagnosis.
As the current study was a pilot study and followed the recommendations of previously published work, all tests reported findings at a significance level of 0.05, 0.15 and 0.25 (i.e. at 95%, 85% and 75% confidence intervals [CIs]) (Lee et al., 2014). In addition to the 95% CIs, it is recommended to evaluate the 75% and 85% CIs for pilot and feasibility studies to determine whether the findings merit a well-powered trial. Data analysis was done with R 4.0.4 (R Core Team, 2021) and R packages: tidyverse 1.3.0 (Wickham et al., 2019) and Ime4 1.1-26 (Bates et al., 2015). In addition, SAS software version9.4 (SAS Institute Inc, 2018), and NCSS-PASS (NCSS-PASS, 2020), were used.
The research was approved by the University of Alabama at Birmingham (UAB) Institutional Review Board.
Results
Descriptive statistics for participants and case characteristics.
aSince some participants completed only one type of note, the total number of participants does not equal the sum of participants who completed cases with list or narrative notes.
Descriptive statistics for participants’ diagnostic accuracy and process time to diagnose.
Effect of the note format on diagnostic accuracy.
Time for processing the narrative notes versus time for processing the list type of note, all diagnoses.
Time for processing the narrative notes versus time for processing the list type of note, correct diagnoses.
For the effect of note format on diagnostic accuracy presented in Table 3, the 75% CI is 0.80–1.99, and the odds ratio (OR) is 1.26. The point estimate suggests a 26% greater odds of diagnosing accurately with the list format compared to the narrative. However, all of the CIs include an OR of 1, which means that the evidence of an association between accuracy of diagnosis and note format is insufficient, even from a pilot study point of view (Billingham et al., 2013). Based on the preliminary effect size estimated, the sample size needed for a confirmatory well-powered trial was calculated. Given the current correct diagnosis rate for list format and narrative cases (25/55 and 22/55, respectively), eight cases (four narrative and four list) in total assigned to each physician, with an intracluster correlation of 0.07, a total of 357 physicians would be required to achieve 80% power to detect a difference between the list and narrative groups at a significance level of 0.05.
In Table 4 for all data, bias-corrected bootstrapped percentile confidence intervals for paired data show that the time for processing the narrative notes is less than the time for processing the list type of note (95% CI - 162.3, −2.77) with a mean difference decrease of 85.6 s and with a medium effect size (d = 0.52). At a significance level of 0.05, this suggests that there is strong evidence of a greater time requirement to process list notes. The mean (SD) time of the narrative notes was 265.57 (152.46) seconds, for list notes it was 351.17 (242.22) seconds and for all type of notes, 308.37 (203.57) seconds. A sample size estimation was also performed (using GPower v3.1.9.6) for the paired t-test to estimate the necessary sample size that would be needed for the significance level of 0.05 and power = 80%. The required sample size for the effect size d = 0.52 that was estimated from the data is n = 32.
In Table 5 using only the correct diagnoses, the bias-corrected bootstrapped percentile confidence intervals for paired data show that the time for processing the narrative notes was less than the time for processing the list type of note (75% CI -195.9–8.83), with a mean difference decrease of 94.17 s and with a small effect size (d = 0.37). At a significance level of 0.25, this suggests that there is weak evidence of a greater time requirement to process list notes. The mean (SD) for time of processing of the narrative notes was 330.83 (204.75) seconds, for list notes it was 425 (360.81) seconds and for all type of notes 377.92 (288.68) seconds. A sample size estimation was also performed for a paired t-test for the significance level of 0.05 and power = 80%. The required sample size for effect size d = 0.37, which was estimated from the data, is n = 60.
Discussion
Data from this pilot study suggest that, for primary care physicians, preliminary estimates found very weak evidence that note format (list or narrative) affected diagnostic accuracy. Specifically, the current implementation of structured note format in the form of a list of findings may not yield a beneficial impact in terms of improving diagnostic accuracy. Further, per the estimates, it would take a much larger trial with over 350 participants to detect a statistically significant difference in diagnostic accuracy between note formats. Such a trial would require a multi-site recruitment approach. For this pilot sample, three different sources of participants (with very few interested) were used. Given this difficulty in recruiting the pilot sample, it may be very difficult to recruit a sufficient sample for a larger trial.
There was more clear evidence of a greater time required to process the list format notes versus the narrative notes. Although it is possible that participants became distracted while working on the cases, thus possibly lengthening the time per case, it is unlikely that this occurred frequently or that it would have affected one format over the other. However, the list format was likely to have been unfamiliar to many of the participants and this lack of familiarity could have affected the time taken to process the notes. Although data may be recorded in an EHR template in a structured form, the note may be displayed in the more familiar narrative text. If it were the lack of familiarity leading to the longer time taken for list notes, a study that did the assessment after participants had become familiar with the format might find less difference between the formats.
Participants may have taken longer to process the list format because narrative notes can imply a diagnosis, whereas the list format required more effort to process and synthesise the list of findings. The study by Mazer et al. (2017) asked residents to compare the preparation of the usual history of present illness to one that was chronologically organised. The chronological history also requires more effortful interpretation and synthesis and residents said it took longer to prepare than the usual history. However, they also said it saved them time when presenting the case. Since residents’ usual presentation to their supervisors is in a synthesised form (verbally or in writing) like the narrative format, it is possible that faculty also may find it easier to focus on the synthesised diagnosis and miss the opportunity to educate students about the individual findings. Further research on the causes for the structured form taking more time and the implications for medical education may be warranted.
It is also possible that the minimal differences between formats may be due to both formats having some structure, in that they included standard subheadings. In addition, the HPI did not have the additional chronological structure in the HPI advocated by Skeff and colleagues (Skeff, 2014; Mazer et al., 2017), which might have lessened the cognitive load to a greater degree than the list format. Additional studies with more extreme differences in structure may yield different results and comparing variations along the continuum of structure may be worth doing.
Other aspects of note style are also key considerations in recommending one style of note over the other. One benefit of a narrative note is that it provides the opportunity to capture patients’ complaints in their own words, with as much detail as possible. In some circumstances this could include contextual clues that could be important to determine the correct diagnosis, even though in the present study such a benefit was not found. Neither of those elements would be captured in a list version of the same encounter. Schwartz et al. (2012) found that even narrative notes often missed important contextual information, for example, financial reasons why the patient was not taking the prescribed medication, and Gantzer et al. (2020) emphasised the importance of capturing the patient’s story, which they felt is difficult to do with structured data in the EHR. In contrast, structured notes would greatly facilitate evaluation of note quality and research studies of quality and would facilitate the use of the data in clinical decision support systems. Clinical decision support systems, in particular, have usually required that input data be in a structured format, although the exact nature of the type of structure may vary. However, it is possible that advances in natural language processing might provide ways to quickly analyse narrative notes for these same purposes (Koleck et al., 2019).
Time demands for physicians to document their notes in the different formats were not measured in this study. It is possible that one type of note may be preferred when one is reviewing the note, while another type may be preferred when creating the notes. It is possible that the results would be different with non-primary care physicians, more routine cases, or a larger sample size, or in more time-constrained real-world settings. Given the estimates of sample size needed to confirm the difference in diagnostic accuracy, it is unlikely that a study with such a large sample would be feasible.
Conclusion
This pilot study offers note format considerations for those interested in improving clinical documentation. Future studies should consider exploring the impact on efficiency of structured notes once users are accustomed to using them. If sufficient participants could be recruited, it might also be valuable to further explore diagnostic accuracy and the relationship between accuracy and efficiency. Balancing the priority of clinician preference with the value of structured data also may be necessary. In any case, the usability of electronic documentation formats is a much needed area for research and this study provides a method and suggestive results for further investigation.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
