Abstract
Objectives:
Pediatric mental health prevalence rates have increased in recent years, while gaps remain in the number of available providers. Ongoing evaluation and understanding of treatment progress and engagement are critical to psychiatric care, and these details are often documented in the electronic health record (EHR). Given the utility of retrospective chart review (RCR) as a tool for psychiatrists, we developed a coding system examining common comorbid conditions (anxiety and attention-deficit/hyperactivity disorder [ADHD]) and adherence and evaluated interrater reliability.
Methods:
We created a coding system with a comprehensive manual and coding instructions that explore both symptom severity domains (anxiety, ADHD, and global) and adherence to medication. Codes were rated using Likert scales, and two independent raters coded all data.
Results:
RCR was completed for 142 patients with a total of 1139 visits over 2 years. Weighted linear kappa statistics ranged between 0.77 and 0.95, and weighted quadratic kappa statistics ranged between 0.74 and 0.96, suggesting substantial to almost perfect agreement. Interrater agreement was highest for anxiety severity.
Conclusions:
We created a novel coding system for RCR and found substantial to almost perfect interrater reliability for assessing ADHD severity, anxiety severity, global severity, and medication adherence using psychiatry encounter notes documented in an EHR. Our coding system explores conditions that are often heterogeneous and have waxing and waning presentations, using a continuum that captures the complexity of symptoms. Future directions include utilization of coding systems to explore emotion and behavior change over time to optimize treatment.
Keywords
Introduction
Mental health challenges among children and adolescents have increased in recent years, with a notable rise during and post COVID-19 (Racine et al., 2021). Specifically, the pooled prevalence estimates of clinically elevated child and adolescent anxiety doubled during the pandemic, with 20% of youth reporting significant symptoms (Fortuna et al., 2023). Attention-deficit/hyperactivity disorder (ADHD) is another common condition found in youth, with prevalence estimates between 6% and 10% prepandemic and significant symptom increase postpandemic (Rogers and MacLean, 2023; Xu et al., 2018). ADHD and anxiety often co-occur with estimates ranging from 25% to 40% (D’Agati et al., 2019). Given the prevalence of these conditions, the increased need for mental health services, and a shortage of available providers, it is critical to optimize efficient, targeted treatments (Yellowlees, 2022).
To understand treatment progress, ongoing symptom monitoring is needed. For example, routine outcome monitoring (ROM) offers a valuable framework for measuring treatment outcomes in everyday practice (Noorden et al., 2013). Though research protocols consistently track outcomes using validated measures, the application of these tools in real world settings may be less consistent (Krishna et al., 2019). Even with systems like ROM, providers often do not utilize reliable measures to monitor treatment outcome through clinical care (Jonášová et al., 2025). However, the prevalent use of electronic health records (EHRs) presents an opportunity to integrate ROM into everyday practice by systematically extracting information from EHR data to support standardized outcome tracking. EHRs aid in collecting and sharing health care information which supports care planning and provision such as medical appointments across various subspecialties, including psychiatry and behavioral health (Kariotis et al., 2022). Documentation includes details about symptom change between visits and medication adherence. Medication adherence is particularly important in ADHD management, where exploring medication use, including nuances like breaks during school holidays, aids in better understanding variability in outcomes (Charach et al., 2008). Poor treatment adherence for conditions such as ADHD and anxiety undermines long-term outcomes for patients and contributes to relapses or chronic impairment, making it paramount that symptom severity and adherence are measured to examine intervention effectiveness (Ferrin et al., 2025). This need is even more critical now due to pandemic disruptions in symptom monitoring and adherence (Plevinsky et al., 2020).
Retrospective chart review (RCR) is a useful research methodology utilized in health care disciplines that allows for exploration of prerecorded data and has potential to provide psychiatry with valuable research opportunities (Gearing et al., 2006; Vassar and Matthew, 2013). For example, RCR has been used to study comorbidity rates and clinical subtypes of ADHD among children (Goldstein and Schwebach, 2004; Weiss et al., 2003). Given the rich amount of data that exists in EHRs, research has explored the utilization of RCR combined with coding of clinical notes to better understand pediatric mental health presentations (Kuhns et al., 2020; Piper et al., 2022). In the last decade, there has been increased focus on translational machine learning for psychiatry to support diagnosis, prognosis, and treatment selection, especially given the shortage of providers, but more research is needed (Dwyer and Koutsouleris, 2022; Hoodbhoy et al., 2021). Coding systems and content analysis are established methods to analyze observed behavior and interview data (Chorney et al., 2015; Weston et al., 2001). Coding typically involves developing a detailed codebook with ongoing training to establish intercoder reliability (Allison et al., 2000; McHugh, 2012; Weston et al., 2001). Content analysis allows for flexible exploration of text data which can be done conventionally without preconceived categories or via a more directed approach, which is appropriate when existing theory or research exists (Hsieh and Shannon, 2005).
Of note, the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5) introduced severity modifiers (mild, moderate, severe) for various conditions, including ADHD, and maintained these classifications in the most recent text revision (American Psychiatric Association, 2022). These modifiers attempt to address the growing acceptance that ADHD, like many mental health conditions, is dimensional with complexity that occurs on a continuum (Lubke et al., 2009). However, there have been some questions of how reliably these severity classifications work and whether a broader assessment of global functioning similar to the clinical global impression scale may offer greater utility (Epstein and Loren, 2013). Consequently, this supported our rationale for incorporating a global severity measure into the coding system.
The current study focuses on the creation of a coding system to conduct RCR. This coding system was created as part of a larger longitudinal study examining medication use, physical growth (height/weight), and changes in behavioral symptoms in children with ADHD and those with comorbid ADHD and anxiety. Variability in documentation across providers required the research team to develop a tool to extract information about anxiety and ADHD medication adherence and global symptom severity across thousands of psychiatry encounters with different providers. Therefore, in the present study, we aim to explore the reliability of this coding system in measuring these important clinical domains by utilizing multiple coders to establish interrater reliability.
Method
Study design and sample selection
We completed an RCR and screened 1200 randomly selected EHRs of patients seen by psychiatry at a pediatric hospital in the mid-Atlantic region between January 2018 and December 2019. Following the initial random selection of 1200 EHRs, a random selection of 198 patients diagnosed with ADHD (both with and without comorbid anxiety) was included in the full sample (1598 encounters), and we explored a subset of 142 patients whose encounters were coded by the same raters. Notably, we found very limited standardized assessments associated with patient encounters for symptoms of interest, which led us to code every patient encounter to explore symptom changes over time. The clinical notes for the encounters were completed by providers across various training levels including residents, fellows, and attending psychiatrists, and note templates varied across providers. Given the lack of routine outcome measures utilized during visits, we created and piloted our coding system to explore key outcome variables across encounters through the 2-year study period. We considered key variables for RCR studies in our study design including data abstraction instruments and processes, coding manual development, ethical considerations, and interrater reliability (Gearing et al., 2006; Vassar and Matthew, 2013). Children’s National Hospital Institutional Review Board approval was obtained for this retrospective study, including waiver of consent/assent. Data were collected and de-identified to minimize risk.
Coding manual and operationalization of variables
Two PhD-level psychologists (K.R. and L.M.A.) created a coding system with a comprehensive manual and coding instructions (see full coding system in Supplemental Data). A detailed codebook and data abstraction survey were designed, and a team of psychologists, psychiatrists, and one research assistant (RA) were trained in data entry. We utilized Research Electronic Data Capture (REDCap), a secure, web-based software platform designed to support data capture with an intuitive interface and automated export procedures (Harris et al., 2009). A subset of the team was trained in coding the encounters using the coding system, rating the following variables: ADHD severity, anxiety severity, global severity, and adherence to medication. Time was spent discussing how to code for missing data and how to code details in notes that contained historical information from prior encounters (i.e., details for history and physical or new patient visit or previous follow-up).
For ADHD and anxiety, we modeled our 3-point Likert scale rating system on mild, moderate, and severe classifications introduced for ADHD in the DSM-5 (American Psychiatric Association, 2022). Detailed instructions in the coding system were created by KR and outlined how to review an encounter for symptoms consistent with these diagnoses. We drafted examples to demonstrate increasing impairment across home, school, or peer interactions. If the note lacked detail to provide a rating or a patient did not have a confirmed diagnosis during the encounter, coders could enter a score of 0 for “unable to assess.” ADHD coding was completed for all patients in the sample, while anxiety ratings were completed for all those with a confirmed comorbid anxiety diagnosis (N = 67). After initial coding to clarify cases in which “not applicable” was provided due to a lack of diagnosis versus unable to assess, an updated value (888) was utilized to represent cases that were not applicable. We did not have sufficient standardized rating scales (e.g., Vanderbilt) in the sample; therefore, these reports were not incorporated into coding decisions.
Given that there was a range of comorbidities in the sample, we utilized a 7-point Likert numeric scale informed by the Clinician Global Impressions–Severity (CGI-S) scale, which is used to answer the question considering your total clinical experience with this population, how mentally ill is the patient at this time (Guy, 1976). The CGI-S has been used in research and clinic settings to rate observed and reported symptoms, behavior, and functioning in the past 7 days, on average (Busner and Targum, 2007) and has been used for RCR (Kelly, 2010). We created a global severity rating scale informed by CGI-S, and raters were given the following instructions: Based on the information available in the clinical note, how severe are the child’s current emotional and behavioral symptoms and their impact on functioning at this time? Since time between encounters varied in our sample, we created detailed descriptions for coders to rank overall severity accounting for number of diagnoses and degree of emotional or behavioral impairment across life domains. While a diagnostic frequency guide was provided for higher ratings (i.e., this child meets criteria for 2 or more mental health diagnoses), emphasis was placed on rating the degree of impairment and the impact across settings. Although the coding system included an “unable to assess code,” this was not utilized as all encounters were rated. Finally, as we were interested in examining medication utilization in the sample, we created a 4-point Likert scale to examine mediation adherence ranging from poor, low, fair, and good adherence. Coders looked for specific details related to frequency of missed doses across the time interval between encounters, gaps in prescription refills, or changes in dosing or timing. Additionally, the raters could also enter a specific score to indicate that they were unable to assess adherence to medication from the content of the clinical note. Like anxiety and ADHD codes, coders could enter a score of 0 for “unable to assess” or 888 if not applicable.
Coders
Two coders (L.M.A. and W.G.-T.) completed matched ratings. One coder was a PhD-level attending psychologist at the academic medical center, and the second was a bachelor’s-level RA trained in the coding system. Before coding started, the RA completed comprehensive training on the coding system with LMA, who codeveloped the coding system. This included a detailed discussion on ADHD and anxiety diagnoses and evidence of impairment. Coding was completed independently, and weekly meetings were held throughout the course of coding over several months. These meetings allowed for discussion of notes that were difficult to code and exploration of examples of different severity levels from notes reviewed the week before. As we were interested in determining interrater reliability, initial codes were not altered to allow for the calculation of kappa and percent agreement. Notably, even early in the coding process, ratings showed strong similarity between coders, and as such, the coding system was not altered from its initial state.
Results
Charts for 142 patients with a total of 1139 encounters were coded by both L.M.A. and W.G.-T. and utilized for analyses. There was an average of 5.71 encounters per chart (standard deviation = 4.19). Overall, 142 patients had ADHD, 67 had at least one anxiety disorder (range from 1 to 3), and 92 had another comorbid diagnosis in addition to ADHD and anxiety (i.e., autism, depression). Interrater reliability was assessed using weighted kappa statistics for 1139 patient encounters for ADHD severity, global severity, and medication adherence level. Anxiety severity was assessed for 543 encounters. Additionally, given that adherence was so frequently coded 0 (unable to assess from the chart), additional kappa analyses were completed only for instances where at least one coder rated adherence to medication (442 encounters). Given the ordinal nature of the rating scales, both linear and quadratic weighted kappa coefficients were calculated. All analyses were conducted using SPSS version 31.0.
The weighted linear kappa statistics ranged between 0.77 and 0.95, suggesting substantial to almost perfect agreement. Specifically, interrater agreement was highest for anxiety severity (κ = 0.93, p < 0.001, 95% confidence interval [CI]: [0.91, 0.96]) followed by adherence (κ = 0.90, p < 0.001, 95% CI: [0.83, 0.93]), ADHD severity (κ = 0.87, p < 0.001, 95% CI: [0.84, 0.90]), and CGI (κ = 0.86, p < 0.001, 95% CI: [0.84, 0.89]).
The weighted quadratic kappa statistics ranged between 0.74 and 0.96, suggesting substantial to almost perfect agreement across all subcategories. Specifically, interrater agreement was also highest for anxiety severity (κ = 0.95, p < .001, 95% CI: [0.93, 0.97]) followed by ADHD severity (κ = 0.89, p < .001, 95% CI: [0.86, 0.92]), CGI (κ = 0.90, p < .001, 95% CI: [0.88, 0.92]), and adherence (κ = 0.90, p < .001, 95% CI: [0.87, 0.93]), each demonstrating equally strong agreement.
We also examined medication adherence in the smaller subset of encounters (N = 442) where at least one rater endorsed an adherence rating and results indicate substantial agreement for weighted linear kappa statistics (κ = 0.77, p < 0.001, 95% CI: [0.71, 0.83]) and weighted quadratic kappa statistics (κ = 0.74, p < .001, 95% CI: [0.66, 0.81]). Overall, these findings suggest a high level of consistency in adherence ratings even when removing encounters without clear adherence indicated.
Discussion
Our findings demonstrate the capacity of using this novel coding system to reliably classify ADHD, anxiety, and global severity, as well as medication adherence using RCR of EHRs. Ongoing debates around assessing symptoms categorically or on a continuum invite an opportunity to explore other ways to conceptualize mental health diagnoses (Bell, 2011; Lebeau et al., 2012). We explored this in our study by creating a multicategory ordinal coding system to assess symptom severity, medication adherence, and global functioning across a 2-year treatment interval. Coders demonstrated high interrater reliability across all coded domains.
Our coding system was implemented with high interrater reliability by doctoral and nondoctoral team members, which underscores the trainability of this coding protocol and may even suggest future applications in novel technology-assisted approaches, including the training of machine learning models whose algorithms could benefit from the consistently accurate codes (Wu et al., 2024). It has been suggested that EHRs may benefit from focused design to support documentation and processes to track patient emotional and behavioral concerns over time (Cifuentes et al., 2015). Utilizing details from the EHR can ultimately enhance clinical care and guide clinical translation research, including quality improvement. Examples of standardization may include the use of smart text or point-and-click note sections built into the EHR that clearly outline details from interval history including severity of symptoms by diagnostic category since last visit, clinician-rated level of medication adherence, and examples of impairment across key domains (i.e., home, school, peers). Drop-down options aid providers in efficiently documenting a base response and additional free text space allows for more nuanced details.
Our coding system uniquely explores ADHD and anxiety symptoms with a dimensional examination of severity and impairment, which is important given the growing debate around understanding ADHD symptoms across a spectrum (Michelini et al., 2024). Outcome monitoring over time allows for patient-centered care and optimal treatment that targets symptom fluctuations. ADHD and anxiety show high comorbidity rates, and their symptom presentations influence pharmacological treatment from childhood through adulthood (D’Agati et al., 2019). Combined with adherence coding, providers can use coding information to inquire about new stressors, diagnoses, or barriers to treatment that exacerbate symptoms. Our coding system demonstrated good initial evidence of reliability as well as trainability for assessing these key domains, though more work is needed to explore its generalizability in different settings. For example, coding systems may provide a unique opportunity for retrospective review of care that can benefit multiple behavioral health providers, including in the emergency department, where standardized assessments are often unavailable (Holder et al., 2017). A coding system could examine chief complaints, symptom descriptors, and diagnoses to better understand patterns in access utilization and influence future care (i.e., questions asked, support needed in the emergency room).
In our study, we found that providers documented medication adherence less than symptom severity. For example, many notes described symptoms (i.e., patient struggled with inattention at school this month) but did not include any statement about medication use or frequency in the interval history, making it unclear whether providers assessed medication adherence during the visit. We know from research that medication adherence predicts larger improvements in symptoms and/or functioning, especially for adolescents with ADHD (Sibley et al., 2022). This underscores the importance of both evaluating and targeting medication adherence, and our coding process highlighted potential areas of EHR standardization that may be needed.
Limitations
While this study had many strengths, there were a few notable limitations. First, we faced some of the inherent challenges that arise in an RCR using an EHR. For example, psychiatry notes were not standardized across providers with variability in documentation detail across providers or encounters. While we accounted for this during training on the coding system, the clinical variation in documentation practices may impact results (Atsma et al., 2020) and further underscores the benefit of exploring standardization of templates or interview points, especially at a training institution. Additionally, behavioral health notes in an EHR may lack full details, especially those sensitive in nature, which may leave a gap in full details reported during an encounter (Kariotis et al., 2022). A third limitation is that, due to our priority to complete full coding to assess interrater reliability, we did not complete comparison meetings to finalize codes for ratings with discrepancies. Fourth, although the global severity rating demonstrated strong interrater reliability, incorporating diagnostic count in the scale may introduce confusion, as the number of diagnoses alone does not necessarily reflect clinical severity. While we aimed to mitigate this by training coders to consider emotional and/or behavioral impairment and functional impact across domains, future work should consider replacing diagnostic count anchors with more detailed examples of functional impairment to distinguish symptom severity from clinical complexity. Last, the development of this coding system was part of a larger project, which impacted the initial development phase of research questions. We utilized chart review to understand patient symptoms and medication adherence over time, but it is notable that these data are several steps removed from the patient (i.e., provider hears details, details are recorded in note, team member completes chart abstraction), which underscores the complexity of chart review (Allison et al., 2000).
Conclusions
We created a novel coding system for RCR and found substantial to almost perfect interrater reliability for assessing ADHD severity, anxiety severity, global severity, and medication adherence using psychiatry encounter notes documented in an EHR over 2 years. This coding system explores conditions that are often heterogeneous and have waxing and waning presentations using a continuum that captures the complexity of symptoms. Future directions include the utilization of coding systems to explore emotion and behavior change over time to optimize treatment, standardization of documentation, and exploration of implications for technological advancements such as machine learning.
Clinical Significance
This study demonstrates clinical significance by successfully leveraging the EHR to develop a reliable coding system. EHRs are regularly used during intake and follow-up behavioral health appointments, making them an important clinical tool to support both clinical care and inform translational research. The integration of a coding system in an EHR can bolster outcome monitoring procedures and standardize practices. Importantly, the present coding system was developed and used with comorbid diagnoses, a mental health presentation commonly treated by psychiatry and other behavioral health subspecialties.
Footnotes
Author Disclosure Statement
L.W. discloses stock holdings with Pfizer and Moderna.
Funding Information
This publication was supported by the Lambert Family Foundation. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the Lambert Family Foundation.
Supplemental Material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
