Abstract
Background:
The impact of sex and gender in the incidence and severity of coronavirus disease 2019 (COVID-19) remains controversial. Here, we aim to describe the characteristics of COVID-19 patients at disease onset, with special focus on the diagnosis and management of female patients with COVID-19.
Methods:
We explored the unstructured free text in the electronic health records (EHRs) within the SESCAM Healthcare Network (Castilla La-Mancha, Spain). The study sample comprised the entire population with available EHRs (1,446,452 patients) from January 1st to May 1st, 2020. We extracted patients' clinical information upon diagnosis, progression, and outcome for all COVID-19 cases.
Results:
A total of 4,780 patients with a confirmed diagnosis of COVID-19 were identified. Of these, 2,443 (51%) were female, who were on average 1.5 years younger than male patients (61.7 ± 19.4 vs. 63.3 ± 18.3, p = 0.0025). There were more female COVID-19 cases in the 15–59-year-old interval, with the greatest sex ratio (95% confidence interval) observed in the 30–39-year-old range (1.69; 1.35–2.11). Upon diagnosis, headache, anosmia, and ageusia were significantly more frequent in females than males. Imaging by chest X-ray or blood tests were performed less frequently in females (65.5% vs. 78.3% and 49.5% vs. 63.7%, respectively), all p < 0.001. Regarding hospital resource use, females showed less frequency of hospitalization (44.3% vs. 62.0%) and intensive care unit admission (2.8% vs. 6.3%) than males, all p < 0.001.
Conclusion:
Our results indicate important sex-dependent differences in the diagnosis, clinical manifestation, and treatment of patients with COVID-19. These results warrant further research to identify and close the gender gap in the ongoing pandemic.
Introduction
As of July 2020, the World Health Organization (WHO) has declared that the coronavirus disease 2019 (COVID-19) pandemic is far from controlled. The cumulative number of confirmed COVID-19 cases across 216 countries worldwide amounts to over 11,874,226; 545,481 confirmed deaths have been reported to date. 1 Daily numbers of both infections and casualties are reaching record highs in many countries, with many already experiencing ‘second waves’ after lockdowns lift. 2
Ever since COVID-19 was initially identified on December 31, 2019 in Wuhan (Hubei Province, China), 3 there remain many unknowns regarding the epidemiology, clinical characteristics, prognosis, and management of the disease. 4 Although substantial efforts have been aimed at improving our clinical understanding of the disease, less is known about the gendered impact of the current pandemic. Indeed, investigating sex- and gender-related issues in health care is an ongoing and unmet need, 5 and it is considered a research priority issue within the WHO's Sustainable Development Goals, a strategic opportunity to promote human rights, and achieve health for all. 6
Characterizing the extent to which COVID-19 impacts women and men differently is of vital importance to better understand the consequences of the pandemic and to design equitable health policies and effective therapeutic strategies. In this line, recent evidence suggests that there are indeed sex differences in the clinical outcomes of COVID-19. 7 –9 Some hypotheses underscore the influence of hormonal factors, 10 immune response, 11 differential distribution of the angiotensin-converting enzyme 2 (ACE-2) receptors, and smoking habits, 12 among others. 13
To further characterize the gendered impact of COVID-19, here, we aimed to address whether the frequency and severity of COVID-19 affect women differently than men. In addition, we sought to explore the factors underlying these differences. To achieve these goals, we used natural language processing (NLP) and artificial intelligence to explore the unstructured, free-text clinical information captured in the electronic health records (EHRs) of a large series of test-confirmed COVID-19 cases.
Methods
This study is part of the BigCOVIData initiative 14 and was conducted in compliance with legal and regulatory requirements. 15 This study was classified as a “non-post-authorization study” (EPA) by the Spanish Agency of Medicines and Health Products (AEMPS), and was approved by the Research Ethics Committee at the University Hospital of Guadalajara (Spain). We have followed the STrengthening the Reporting of OBservational studies in Epidemiology (STROBE) guidance for reporting observational research. 16
Study design, data source, and patient population
This was a retrospective, multicenter study using secondary free-text data from patients' EHRs within the SESCAM Healthcare Network in Castilla-La Mancha, Spain. Data were retrieved from all available departments, including inpatient hospital, outpatient hospital, and emergency room, for virtually all types of provided services in each participating hospital. The study period was January 1, 2020–May 1, 2020.
The study database was fully anonymized and aggregated, so it did not contain patients' personally identifiable information. Given that clinical information was handled in an aggregate, anonymized, and irreversibly dissociated manner, patient consent regulations do not apply to the present study.
The study sample included all patients in the source population with test-confirmed COVID-19 (mainly polymerase chain reaction [PCR] + but also IgG/IgM+).
Extracting free text from EHRs: EHRead®
To meet the study objectives, we used EHRead, a technology developed by SAVANA that applies NLP, machine learning, and deep learning to access and analyze the unstructured, free-text information jotted down by health professionals in EHRs. The process used for the extraction of clinical data by EHRead has been previously described. 17 In brief, all extracted clinical terms are standardized according to a unique terminology. This custom-made terminology is based on systematized nomenclature of medicine-clinical terms (SNOMED-CT) and includes more than 400,000 medical concepts, acronyms, and laboratory parameters aggregated over the course of 5 years of free-text mining. These clinical entities are detected in the unstructured free text are then classified based on EHRs' sections using a combination of regular expression rules and machine learning models. Deep learning classification methods, which rely on word embeddings and context information, are also used to determine whether the clinical information is expressed in terms of negative, speculative, or affirmative statements.
Internal validation
For particular cases where extra specifications are required (e.g., to differentiate COVID cases from other mentions of the term related to fear of the disease or potential contact), the detection output was manually reviewed in more than 5,000 reports to avoid any ambiguity associated with free-text reporting. All NLP deep learning models used here were validated using the standard training/validation/testing approach; we used a 75/12/13 split ratio in the available annotated data (between 2,000 and 3,000 records, depending on the model) to ensure efficient generalization on unseen cases. For the linguistic validation of analyzed variables regarding COVID-19 mentions, signs/symptoms (e.g., dyspnea, tachypnea, pneumonia), laboratory values (e.g., ferritin, lactate dehydrogenase [LDH]), and treatments (e.g., hydroxychloroquine, cyclosporine, Lopinavir/Ritonavir), we obtained F-scores (the harmonic mean between precision and recall) >0.80 in all cases. However, the validation of “PCR-confirmed COVID-19” returned a F-score of 0.64; although the precision in the identification of this concept was very high (0.90), the recall value was 0.5. This means that even though our model accurately identifies PCR+ cases (i.e., very low number of false positives), the prevalence data reported here may be underestimated. Importantly, out of a subsample of 964 manually reviewed clinical reports (532 from males and 432 from females), a total of 158 PCR+ cases (16.4%) were missed by the system. The proportion of female patients among the detected and missed cases was 46.2% and 38.0%, respectively; a chi-square test of independence revealed no significant differences between the two groups (p = 0.07). These data indicate that there was not a clear bias toward females in the proportion of undetected cases.
Data analyses
We generated frequency tables to display the information regarding comorbidities, symptoms, and other categorical variables. Continuous variables (e.g., age) were described using summary tables containing mean, standard deviation, median, minimum and maximum values, and quartiles for each variable. To test for possible statistically significant differences in the distribution of categorical variables between males and females, we used Yates-corrected chi-square tests for percentages or analysis of variance for normally distributed continuous variables. Sex ratios (SRs) and their 95% confidence intervals (CIs) of several epidemiological and clinical indicators are presented. To determine whether the SRs of confirmed COVID-19 cases significantly varied across time, we performed linear regression analyses to test the null hypothesis that the slope is equal to zero. Sex differences in COVID-19-related clinical outcomes (i.e., confirmed cases, hospitalization, and intensive care unit [ICU] admission) were further confirmed via multivariate analyses, adjusting for age. All statistical inferences were performed at the 5% significance level using two-sided tests or 95% CIs.
Results
From a source population of 2,045,385 individuals, we extracted and analyzed the clinical information of 1,446,452 patients with available EHRs from January 1st to May 1st, 2020. Among these, we then retrieved the clinical information upon diagnosis, progression, and outcome for 4,780 patients with a test-confirmed diagnosis of COVID-19, of whom 2,443 (51%) were women. The patient flowchart for female and male patients is depicted in Figure 1. The female/male SRs (95% CI) for hospitalization and ICU admission were 0.49 (0.43–0.55) and 0.60 (0.44–0.80), respectively. To further confirm the sex-dependent differences in the clinical outcomes related to COVID-19, we performed a multivariate analysis of the explored outcomes adjusted by age. These analyses revealed that higher risk for hospital admission and ICU use in men was sustained after controlling for patients' age, with female/male SRs (95% CI) of 0.73 (0.65–0.83) and 0.48 (0.31–0.76), respectively. Regarding confirmed diagnosis, sex-dependent differences remained nonsignificant in the multivariate analysis (female/male SRs of 0.88, 95% CI: 0.69–0.13).

Patient flowchart. Flowchart depicting the total number of inhabitants in the source population, the number (%) of patients with available electronic health records analyzed, the number of patients diagnosed with COVID-19, and of those, the number of hospitalizations and intensive care unit admissions. ♂ = male patients; ♀ = female patients. *Confirmed cases based on laboratory results (mainly PCR+ but also IgG/IgM+). COVID-19, coronavirus disease 2019; PCR, polymerase chain reaction; IgG, immunoglobulin G; IgM, immunoglobulin M.
Isolated COVID-19 cases were already identified in the SESCAM system early in January and February 2020, yet, they were scarce up to the first week of March 2020. Shortly after, confirmed cases raised exponentially and reached a daily maximum at the end of March/early April, 2020. This peak in newly reported cases was followed by a slow decrease; by early May 2020, confirmed cases went close to near-zero levels (Fig. 2A). As shown in Figure 2B, the proportion of COVID-19 cases in females remained stable throughout the beginning of the outbreak up to the plateau; by the end of the study period, the number of diagnosed female patients markedly increased. Linear regression analyses showed that the SR of confirmed cases (newly identified cases in females over new cases in males) significantly increased over time, p < 0.001.

Epidemiological curve and SRs showing COVID-19 cases within the study period.
Female COVID-19 patients were on average 1.5 years younger than males (61.7 ± 19.4 vs. 63.3 ± 18.3, p = 0.0025). In addition, there were more female patients in the 15–59-year-old interval (Fig. 3), with the greatest SR (95% CI) observed in the 30–39-year-old interval (1.686; 1.351–2.113) (Table 1).

Age and sex distribution of COVID-19 patients. Age distribution of incident cases of COVID-19 in females (left) and males (right) in the study population for the period comprised between January 1, 2020 and May 1, 2020.
Number of Coronavirus Disease 2019 Cases by Age Group and Sex
Total population of Castilla La-Mancha (Spain).
A SR of 1 indicates equal proportion of male and female patients; a SR >1 indicates higher proportion of female patients than male patients.
p-Values from Yates-corrected chi-square test on percentage difference of female versus male COVID-19 patients.
CI, confidence interval; SR, sex ratio; COVID-19, coronavirus disease 2019.
We did not observe any sex-dependent differences in the number of COVID-19 cases per 100,000 individuals; the prevalence rates for female and male patients was 239.7 and 227.6, respectively, with a corresponding SR (95% CI) of 1.054 (0.995–1.115), p = 0.0741 (Table 1). The data shown in Table 2 indicate an age-dependent increase in reported cases in both males and females, being patients aged >79 years the most affected with rates of 968.1 in men and 689.3 in women, and corresponding SR (95% CI) of 0.712 (0.632–0.803), p < 0.001.
Clinical Manifestations of Coronavirus Disease 2019 Upon Diagnosis
A SR of 1 indicates equal proportion of male and female patients; a SR >1 indicates higher proportion of female patients than male patients.
p-Values from Yates-corrected chi-square test of difference between percentage of patients (female vs. male) presenting with the sign/symptom. All tests were performed individually for each variable sign/symptom.
RR, respiratory rate; SD, standard deviation; bpm, beats per minute.
Regarding symptoms upon diagnosis, headache, anosmia, and ageusia were significantly more frequent in women than men, all p < 0.001 (Table 2). Interestingly, imaging by chest X-ray or blood tests were performed less frequently in females (65.5% vs. 78.3% and 49.5% vs. 63.7%, respectively), all p < 0.001. Regarding hospital resource use, female COVID-19 patients showed less frequency of hospitalization (44.3% vs. 62.0%) and ICU admission (2.8% vs. 6.3%) than males, all p < 0.001.
As expected, comorbidities upon COVID-19 diagnosis were more often reported in men, whereas 78.9% of female patients had at least one of the studied comorbidities at diagnosis, this percentage was 87.4% in males (p = 0.0183) (Table 3). However, depressive disorders and asthma were significantly more frequent in females, with associated ratios of 2.030 (1.616–2.565) and of 1.743 (1.363–2.241), respectively.
Comorbidities of Coronavirus Disease 2019 Patients Upon Diagnosis
A SR of 1 indicates equal proportion of male and female patients; a SR >1 indicates higher proportion of female patients than male patients.
p-Values from Yates-corrected chi square test of difference between percentage of patients (female vs. male) diagnosed with each condition or disease. All tests were performed individually for each comorbidity.
COPD, chronic obstructive pulmonary disease; HIV, human immunodeficiency virus.
According to the laboratory parameters upon COVID-19 diagnosis, men significantly suffered more from lymphopenia and worse renal function (as per creatinine and urea values but not glomerular filtration rate) than women (Table 4). On the contrary, all liver function parameters, as well as D-dimer and all acute phase reactants (except for higher C-reactive protein levels in men) were also evenly distributed by sex.
Laboratory Parameters of Coronavirus Disease 2019 Patients Upon Diagnosis
A SR of 1 indicates equal proportion of male and female patients; a SR >1 indicates higher proportion of female patients than male patients.
p-Values from Yates-corrected chi square test of difference between percentage of patients (female vs. male) in either outcome group (high levels). All tests were performed individually for each parameter.
ALT, alanine transaminase; AST, aspartate transaminase; GGT, gamma-glutamyl transpeptidase; LDH, lactate dehydrogenase.
Regarding treatments received by COVID-19 patients (Table 5), our results indicate that except chloroquine, the SR for all treatments analyzed was <1. Notably, most of these comparisons were statistically significant against female patients with COVID-19 (Table 5).
Treatments Used in Coronavirus Disease 2019 Patients
A SR of 1 indicates equal proportion of male and female patients; a SR >1 indicates higher proportion of female patients than male patients.
p-Values from Yates-corrected chi square test of difference between percentage of patients prescribed with the therapeutic agents (male vs. female). All tests were performed individually for each treatment.
Discussion
Using a big data approach and from a population perspective, we have identified important sex-dependent differences in the clinical manifestation, diagnosis, management, and hospital resource use associated with COVID-19. Specifically, female teenagers and young adult women were significantly more affected by COVID-19 than their male counterparts in the same age ranges; In addition, our results indicate that headache, as well as ear, nose, and throat (ENT) symptoms were significantly more frequent in female COVID-19 patients. Regarding medical outcomes, both hospitalization and ICU admission were less common in females than males. Unfortunately, basic diagnostic tests such as blood tests or imaging were less used in women.
Our results provide further evidence of the inherent gender bias in the Health System, which is thought to originate in medical school and impacts all aspects of health care. 18,19 Although this bias well established the context of cardiovascular, 20 respiratory, 21 –23 and infectious diseases (particularly, sexually transmitted diseases 24 ), the impact of sex and gender in the ongoing COVID-19 pandemic is just beginning to be unraveled. 8,25,26 Beyond mechanistic and molecular studies, 5 –9 more subtle and general events may already play a role in the sex-dependent management of COVID-19 patients. 27,28 One key question is whether COVID-19 affects women's reproductive health; in other coronavirus-related infectious diseases such as the severe acute respiratory syndrome and the Middle East respiratory syndrome, pregnancy has been identified as a risk factor for developing severe complications. 29,30 Finally, ovarian hormones influence inflammation, immunity, and many other aspects of women's health, 13,31 as well as the expression of ACE-2 receptors, which seem to play a role in the progression of COVID-19. 32 These effects are lost after menopause (due to ovarian insufficiency), which in most women occurs around 50 years of age. Interestingly, as shown in Table 3, mood disorders (e.g., depression) and asthma were more frequent in women than in men among COVID-19 patients. These results warrant further research on the effects of menopause in COVID-19-related health outcomes.
The increased vulnerability of women to COVID-19 is also associated with occupational risks. It is well established that most frontline health care professionals are women, which puts them at a higher risk for infection and negative clinical outcomes. 33 Further, women are more likely to serve as the primary caregivers within a household, thus becoming more exposed to the disease. This becomes worrying in disadvantaged populations and resource-poor communities, as well as countries without the benefits of a universal, free-for-all health care system.
Strengths and limitations
The main strengths of our research include immediacy, large sample size, and direct access to real-world evidence. Of note, our methodology ensures absence of any bias in patient selection as our hypothesis that gender impacts diagnosis and management of COVID-19 was assessed a posteriori. The observed change in the SR of confirmed cases at the tail of this first wave of the pandemic should be further confirmed in other cohorts and geographical locations. 34 Finally, it is unlikely that our conclusions are impacted by the limitations of pay- or copay-systems, as Spain enjoys a universal, free-for-all health care system.
Our results should be interpreted in light of the following limitations. First, this was an observational, retrospective study; therefore, any causal inferences based on the present results must be carefully interpreted. Second, given the variation in COVID-19 severity, it is possible that the free-text information available in EHRs is not homogeneous across patients seen in different points of care (i.e., primary-to-tertiary care). For instance, care providers could have been more likely to further explore (and report more often) milder symptoms in women, who in turn are more likely to be seen in primary care; on the contrary, the more severe symptoms reported in men may be related to the fact that they were more likely to be hospitalized or visit the ICU. Third, it is possible that women were more likely than men to report ENT symptoms. 35 Finally, as indicated in the methods section, our reported COVID-19 prevalence rates are probably lower than real, as some cases might be missed by the system due to heterogeneous reporting in EHRs. However, the observed low recall metrics in variables related to the identification of PCR-confirmed patients do not affect the quality of the descriptive results since our precision metrics for these concepts were optimal.
Implications for future research
The well-established gender bias in cardiovascular, 20 respiratory, 21 –23 and other diseases should be further investigated in COVID-19 patients. Despite recent regulations and partial improvements, the attention paid to sex and gender differences in biomedical and health research is far from optimal. 36 As pointed out in recent reviews, occupational gender segregation makes women particularly vulnerable to COVID-19 since two-thirds of the health and social care workforce worldwide are women. 37 Crucially, any gender bias in the use of diagnostic testing and imaging, as evidenced in our research from a country with universal, free-for-all health care, might be magnified in less privileged settings.
Conclusion
The biological, behavioral, social, and systemic factors underlying the differences in how women and men may experience COVID-19 and its consequences cannot be oversimplified. 38 Regrettably, most research studies are systematically failing to offer comparisons between women and men, girls and boys, and people with diverse gender identities. 39 Based on the results presented here, we conclude that women were more heavily impacted by COVID-19 than men (specifically teenagers and young adults). In addition, women presented different symptoms at disease onset, clinical outcomes, and treatment patterns. These results warrant further research to identify and close the gender gap in the diagnosis and treatment of COVID-19.
Footnotes
Acknowledgments
We thank all the Savaners for helping accelerate health science with their daily work. We also thank SESCAM (Healthcare Network in Castilla-La Mancha) for its participation in the study and for supporting the development of cutting-edge technology in real time.
Authors' Contributions
J.A., I.H.M., J.L.I., A.P., M.S., and J.B.S. had the original idea of the study and developed the concept protocol; A.P., S.L., S.M., I.S., and I.Z. developed the analytical plan and conducted the statistical analyses; A.P., J.A., J.L.I., J.B.S., Y.G., S.L., S.M., I.H.M., C.D.R.-B., and I.Z. interpreted the results; C.D.R.-B. and J.B.S. wrote and edited the article; C.D.R.-B. and I.Z. are responsible for figures and data visualization; all authors contributed to drafting and interpretation, and they approved the final version.
Author Disclosure Statement
Savana employees contributed to the design, data analysis, and writing of the present study. All authors declare that there are no other direct or indirect potential conflicts to disclose.
Funding Information
The BigCOVIData study was funded by Savana.
